0% found this document useful (0 votes)
113 views615 pages

Advanced Image Processing For Defense and Security Applications

Uploaded by

online services
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views615 pages

Advanced Image Processing For Defense and Security Applications

Uploaded by

online services
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 615

EURASIP Journal on Advances in Signal Processing

Advanced Image Processing for


Defense and Security Applications

Guest Editors: Eliza Yingzi Du, Robert Ives, Alan van Nevel,
and Jin-Hua She
Advanced Image Processing for Defense and
Security Applications
EURASIP Journal on
Advances in Signal Processing

Advanced Image Processing for Defense and


Security Applications
Guest Editors: Eliza Yingzi Du, Robert Ives, Alan van Nevel,
and Jin-Hua She
Copyright © 2010 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2010 of “EURASIP Journal on Advances in Signal Processing.” All articles are open access
articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Editor-in-Chief
Phillip Regalia, Institut National des Télécommunications, France

Associate Editors
Adel M. Alimi, Tunisia Sudharman K. Jayaweera, USA Douglas O’Shaughnessy, Canada
Kenneth Barner, USA Soren Holdt Jensen, Denmark Björn Ottersten, Sweden
Yasar Becerikli, Turkey Mark Kahrs, USA Jacques Palicot, France
Kostas Berberidis, Greece Moon Gi Kang, South Korea Ana Perez-Neira, Spain
Enrico Capobianco, Italy Walter Kellermann, Germany Wilfried R. Philips, Belgium
A. Enis Cetin, Turkey Lisimachos P. Kondi, Greece Aggelos Pikrakis, Greece
Jonathon Chambers, UK Alex Chichung Kot, Singapore Ioannis Psaromiligkos, Canada
Mei-Juan Chen, Taiwan Ercan E. Kuruoglu, Italy Athanasios Rontogiannis, Greece
Liang-Gee Chen, Taiwan Tan Lee, China Gregor Rozinaj, Slovakia
Satya Dharanipragada, USA Geert Leus, The Netherlands Markus Rupp, Austria
Kutluyil Dogancay, Australia T.-H. Li, USA William Sandham, UK
Florent Dupont, France Husheng Li, USA B. Sankur, Turkey
Frank Ehlers, Italy Mark Liao, Taiwan Erchin Serpedin, USA
Sharon Gannot, Israel Y.-P. Lin, Taiwan Ling Shao, UK
Samanwoy Ghosh-Dastidar, USA Shoji Makino, Japan Dirk Slock, France
Norbert Goertz, Austria Stephen Marshall, UK Yap-Peng Tan, Singapore
M. Greco, Italy C. Mecklenbräuker, Austria João Manuel R. S. Tavares, Portugal
Irene Y. H. Gu, Sweden Gloria Menegaz, Italy George S. Tombras, Greece
Fredrik Gustafsson, Sweden Ricardo Merched, Brazil Dimitrios Tzovaras, Greece
Ulrich Heute, Germany Marc Moonen, Belgium Bernhard Wess, Austria
Sangjin Hong, USA Christophoros Nikou, Greece Jar-Ferr Yang, Taiwan
Jiri Jan, Czech Republic Sven Nordholm, Australia Azzedine Zerguine, Saudi Arabia
Magnus Jansson, Sweden Patrick Oonincx, The Netherlands Abdelhak M. Zoubir, Germany
Contents
Advanced Image Processing for Defense and Security Applications, Eliza Yingzi Du, Robert Ives,
Alan van Nevel, and Jin-Hua She
Volume 2010, Article ID 432972, 1 page

Nonblind and Quasiblind Natural Preserve Transform Watermarking, G. Fahmy, M. F. Fahmy,


and U. S. Mohammed
Volume 2010, Article ID 452548, 13 pages

Reversible Watermarking Using Statistical Information, Ahmad Mahmoudi Aznaveh,


Farah Torkamani-Azar, Azadeh Mansouri, and Fatih Kurugollu
Volume 2010, Article ID 738972, 6 pages

On Converting Secret Sharing Scheme to Visual Secret Sharing Scheme, Daoshun Wang and Feng Yi
Volume 2010, Article ID 782438, 11 pages

Semi-Fragile Zernike Moment-Based Image Watermarking for Authentication, Hongmei Liu, Xinzhi Yao,
and Jiwu Huang
Volume 2010, Article ID 341856, 17 pages

Digital Watermarking Method Warranting the Lower Limit of Image Quality of Watermarked Images,
Motoi Iwata, Tomoo Kanaya, Akira Shiozaki, and Akio Ogihara
Volume 2010, Article ID 426085, 18 pages

A Contourlet-Based Image Watermarking Scheme with High Resistance to Removal and Geometrical
Attacks, Sirvan Khalighi, Parisa Tirdad, and Hamid R. Rabiee
Volume 2010, Article ID 540723, 13 pages

A New Robust Watermarking Scheme to Increase Image Security, Hossein Rahmani, Reza Mortezaei,
and Mohsen Ebrahimi Moghaddam
Volume 2010, Article ID 428183, 30 pages

An Efficient Prediction-and-Shifting Embedding Technique for High Quality Reversible Data Hiding,
Wien Hong
Volume 2010, Article ID 104835, 12 pages

Improved Adaptive LSB Steganography Based on Chaos and Genetic Algorithm, Lifang Yu, Yao Zhao,
Rongrong Ni, and Ting Li
Volume 2010, Article ID 876946, 6 pages

A Macro-Observation Scheme for Abnormal Event Detection in Daily-Life Video Sequences,


Wei-Yao Chiu and Du-Ming Tsai
Volume 2010, Article ID 525026, 19 pages

Pedestrian Validation in Infrared Images by Means of Active Contours and Neural Networks,
Massimo Bertozzi, Pietro Cerri, Mirko Felisa, Stefano Ghidoni, and Michael Del Rose
Volume 2010, Article ID 752567, 11 pages

Vehicle Trajectory Estimation Using Spatio-Temporal MCMC, Yann Goyat, Thierry Chateau,
and Francois Bardet
Volume 2010, Article ID 712854, 8 pages
Superresolution versus Motion Compensation-Based Techniques for Radar Imaging Defense
Applications, J. M. Muñoz-Ferreras and F. Pérez-Martı́nez
Volume 2010, Article ID 308379, 9 pages

A Locally Adaptable Iterative RX Detector, Yuri P. Taitano, Brian A. Geier, and Kenneth W. Bauer Jr.
Volume 2010, Article ID 341908, 10 pages

Background Subtraction for Automated Multisensor Surveillance: A Comprehensive Review,


Marco Cristani, Michela Farenzena, Domenico Bloisi, and Vittorio Murino
Volume 2010, Article ID 343057, 24 pages

High-Resolution Sonars: What Resolution Do We Need for Target Recognition?, Yan Pailhas,
Yvan Petillot, and Chris Capus
Volume 2010, Article ID 205095, 13 pages

An Efficient and Robust Moving Shadow Removal Algorithm and Its Applications in ITS, Chin-Teng Lin,
Chien-Ting Yang, Yu-Wen Shou, and Tzu-Kuei Shen
Volume 2010, Article ID 945130, 19 pages

Robust Tracking in Aerial Imagery Based on an Ego-Motion Bayesian Model, Carlos R. del Blanco,
Fernando Jaureguizar, and Narciso Garcı́a
Volume 2010, Article ID 837405, 18 pages

Covariance Tracking via Geometric Particle Filtering, Yunpeng Liu, Guangwei Li, and Zelin Shi
Volume 2010, Article ID 583918, 9 pages

Construction of Fisheye Lens Inverse Perspective Mapping Model and Its Applications of Obstacle
Detection, Chin-Teng Lin, Tzu-Kuei Shen, and Yu-Wen Shou
Volume 2010, Article ID 296598, 23 pages

Clusters versus GPUs for Parallel Target and Anomaly Detection in Hyperspectral Images, Abel Paz and
Antonio Plaza
Volume 2010, Article ID 915639, 18 pages

A Review of Unsupervised Spectral Target Analysis for Hyperspectral Imagery, Chein-I Chang,
Xiaoli Jiao, Chao-Cheng Wu, Yingzi Du, and Mann-Li Chang
Volume 2010, Article ID 503752, 26 pages

Subinteger Range-Bin Alignment Method for ISAR Imaging of Noncooperative Targets,


J. M. Muñoz-Ferreras and F. Pérez-Martı́nez
Volume 2010, Article ID 438615, 16 pages

Investigating the Bag-of-Words Method for 3D Shape Retrieval, Xiaolan Li and Afzal Godil
Volume 2010, Article ID 108130, 9 pages

Optical Flow and Principal Component Analysis-Based Motion Detection in Outdoor Videos, Kui Liu,
Qian Du, He Yang, and Ben Ma
Volume 2010, Article ID 680623, 6 pages

Shape Analysis of 3D Head Scan Data for U.S. Respirator Users, Ziqing Zhuang, Dennis E. Slice,
Stacey Benson, Stephanie Lynch, and Dennis J. Viscusi
Volume 2010, Article ID 248954, 10 pages
Contents
A Conditional Entropy-Based Independent Component Analysis for Applications in Human Detection
and Tracking, Chin-Teng Lin, Linda Siana, Yu-Wen Shou, and Tzu-Kuei Shen
Volume 2010, Article ID 468329, 14 pages

Objective Assessment of Sunburn and Minimal Erythema Doses: Comparison of Noninvasive In Vivo
Measuring Techniques after UVB Irradiation, Min-Wei Huang, Pei-Yu Lo, and Kuo-Sheng Cheng
Volume 2010, Article ID 483562, 7 pages

Robust Real-Time Background Subtraction Based on Local Neighborhood Patterns, Ariel Amato,
Mikhail G. Mozerov, F. Xavier Roca, and Jordi Gonzàlez
Volume 2010, Article ID 901205, 7 pages

Improving Density Estimation by Incorporating Spatial Information, Laura M. Smith,


Matthew S. Keegan, Todd Wittman, George O. Mohler, and Andrea L. Bertozzi
Volume 2010, Article ID 265631, 12 pages

Adaptive Inverse Hyperbolic Tangent Algorithm for Dynamic Contrast Adjustment in Displaying
Scenes, Cheng-Yi Yu, Yen-Chieh Ouyang, Chuin-Mu Wang, and Chein-I Chang
Volume 2010, Article ID 485151, 20 pages

Multi-Threshold Level Set Model for Image Segmentation, Chih-Yu Hsu, Chih-Hung Yang,
and Hui-Ching Wang
Volume 2010, Article ID 950438, 8 pages

An Interactive Procedure to Preserve the Desired Edges during the Image Processing of Noise Reduction,
Chih-Yu Hsu, Hsuan-Yu Huang, and Lin-Tsang Lee
Volume 2010, Article ID 923748, 13 pages

Full Waveform Analysis for Long-Range 3D Imaging Laser Radar, Andrew M. Wallace, Jing Ye,
Nils J. Krichel, Aongus McCarthy, Robert J. Collins, and Gerald S. Buller
Volume 2010, Article ID 896708, 12 pages

Facial Recognition in Uncontrolled Conditions for Information Security, Qinghan Xiao and
Xue-Dong Yang
Volume 2010, Article ID 345743, 9 pages

Iris Recognition: The Consequences of Image Compression, Robert W. Ives, Daniel A. Bishop, Yingzi Du,
and Craig Belcher
Volume 2010, Article ID 680845, 9 pages

Scale Invariant Gabor Descriptor-based Noncooperative Iris Recognition, Yingzi Du, Craig Belcher,
and Zhi Zhou
Volume 2010, Article ID 936512, 13 pages

A Multifactor Extension of Linear Discriminant Analysis for Face Recognition under Varying Pose and
Illumination, Sung Won Park and Marios Savvides
Volume 2010, Article ID 158395, 11 pages

Unconstrained Iris Acquisition and Recognition Using COTS PTZ Camera, Shreyas Venugopalan and
Marios Savvides
Volume 2010, Article ID 938737, 20 pages
Fusion of PCA-Based and LDA-Based Similarity Measures for Face Verification, Mohammad T. Sadeghi,
Masoumeh Samiei, and Josef Kittler
Volume 2010, Article ID 647597, 12 pages

A Robust Iris Identification System Based on Wavelet Packet Decomposition and Local Comparisons of
the Extracted Signatures, Florence Rossant, Beata Mikovicova, Mathieu Adam, and Maria Trocan
Volume 2010, Article ID 415307, 16 pages

The Complete Gabor-Fisher Classifier for Robust Face Recognition, Vitomir Štruc and Nikola Pavešić
Volume 2010, Article ID 847680, 26 pages

Multiclient Identification System Using Adaptive Probabilistic Model, Chin-Teng Lin, Linda Siana,
Yu-Wen Shou, and Chien-Ting Yang
Volume 2010, Article ID 983581, 15 pages

Comparing an FPGA to a Cell for an Image Processing Application, Ryan N. Rakvic, Hau Ngo,
Randy P. Broussard, and Robert W. Ives
Volume 2010, Article ID 764838, 7 pages
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 432972, 1 page
doi:10.1155/2010/432972

Editorial
Advanced Image Processing for Defense and Security Applications

Eliza Yingzi Du,1 Robert Ives,2 Alan van Nevel,3 and Jin-Hua She4
1 Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis,
723W. Michigan Street, SL 160, Indainapolis, IN 46259, USA
2 Department of Electrical Engineering, US Naval Academy, 105 Maryland Avenue, MS 14B, Annapolis, MD 21402, USA
3 Image and Signal Processing Branch, Research Department, Naval Air Warfare Center, 1900 N Knox Road, M/S 6302,

China Lake, CA 93555, USA


4 School of Computer Science, Tokyo University of Technology, 1404-1 Katakura, Hachioji, Tokyo 192-0982, Japan

Correspondence should be addressed to Eliza Yingzi Du, [email protected]

Received 31 December 2010; Accepted 31 December 2010

Copyright © 2010 Eliza Yingzi Du et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The history of digital image processing can be traced back defense and security applications. 105 works, from 12 coun-
to the 1920s when digital images were transferred between tries (Canada, Taiwan, Spain, China, Egypt, Iran, France,
London and New York. However, in the past, the cost of Italy, Slovenia, Japan, and USA) were submitted. Then, 44
processing was very high because the imaging sensors and works were accepted for publication after being thoroughly
computational equipments were very expensive and had only reviewed by international experts on subject matter areas.
limited functions. As a result, the development of digital In this special issue, we have discussed following topics:
image processing was limited. information assurance (including watermarking and visual
As optics, imaging sensors, and computational technol- secret sharing scheme), steganography, target detection and
ogy advanced, image processing has become more com- tracking (including abnormal event detection, human detec-
monly used in many different areas. Some areas of appli- tion and tracking, vehicle trajectory estimation, radar and 3D
cation of digital image processing include image enhance- Radar image processing, multisensory surveillance, hyper-
ment for better human perception, image compression and spectral image processing, obstacle detection, ISAR image
transmission, as well as image representation for automatic processing, sonar image processing, 3D shape retrieval and
machine perception. image analysis, and image segmentation), and biometrics
Most notably, digital image processing has been widely (including iris recognition, face recognition, hardware design
deployed for defense and security applications such as small for biometrics, and multicliet identification).
target detection and tracking, missile guidance, vehicle nav-
igation, wide area surveillance, and automatic/aided target Eliza Yingzi Du
recognition. One goal for an image processing approach in Robert Ives
defense and security applications is to reduce the workload Alan van Nevel
of human analysts in order to cope with the ever increasing Jin-Hua She
volume of image data that is being collected. A second,
more challenging goal for image processing researchers is
to develop algorithms and approaches that will significantly
aid the development of fully autonomous systems capable of
decisions and actions based on all sensor inputs.
For this special issue, our aim is to bring together
researchers designing or developing advanced image pro-
cessing techniques/systems, with a particular emphasis on
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 452548, 13 pages
doi:10.1155/2010/452548

Research Article
Nonblind and Quasiblind Natural Preserve
Transform Watermarking

G. Fahmy,1 M. F. Fahmy,2 and U. S. Mohammed2


1 German University in Cairo (GUC), New Cairo City 11835, Egypt
2 Department of Electrical Engineering, Assiut University, Assiut 71515, Egypt

Correspondence should be addressed to G. Fahmy, [email protected]

Received 10 September 2009; Revised 10 December 2009; Accepted 10 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 G. Fahmy et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper describes a new image watermarking technique based on the Natural Preserving Transform (NPT). The proposed
watermarking scheme uses NPT to encode a gray scale watermarking logo image or text, into a host image at any location.
NPT brings a unique feature which is uniformly distributing the logo across the host image in an imperceptible manner. The
contribution of this paper lies is presenting two efficient nonblind and quasiblind watermark extraction techniques. In the
quasiblind case, the extraction algorithm requires little information about the original image that is already conveyed by the
watermarked image. Moreover, the proposed scheme does not introduce visual quality degradation into the host image while still
being able to extract a logo with a relatively large amount of data. The performance and robustness of the proposed technique are
tested by applying common image-processing operations such as cropping, noise degradation, and compression. A quantitative
measure is proposed to objectify performance; under this measure, the proposed technique outperforms most of the recent
techniques in most cases. We also implemented the proposed technique on a hardware platform, digital signal processor (DSK
6713). Results are illustrated to show the effectiveness of the proposed technique, in different noisy environments.

1. Introduction applications, such as DVD copy protection where the original


information may not be available for watermark detection.
With the widespread use of the Internet and the rapid and On the other hand, semi-blind and blind watermarking
massive development of multimedia, there is an impending schemes are more feasible in that situation [12]. However,
need for efficient and powerfully effective copyright protec- they have lower robustness than the private watermarking
tion techniques. A variety of image watermarking methods schemes [13]. In general, the requirements of a watermarking
have been proposed [1–14], where most of them are based system fall into three categories: robustness, visibility, and
on the spatial domain [1, 2] or the transform domain capacity. Robustness refers to the fact that the watermark
[3, 4]. However, in recent years [14–16], several image must survive against attacks from potential pirates. Visibility
watermarking techniques based on the transform domain refers to the requirement that the watermark be impercepti-
have appeared. ble to the eye. Capacity refers to the amount of information
Digital watermarking schemes are typically classified into that the watermark must carry. Embedding a watermark logo
three categories. Private watermarking which requires the typically amounts to a tradeoff occurring between robustness
prior knowledge of the original information and secret keys visibility and capacity.
at the receiver. Semiprivate or semiblind watermarking where In [15], a composite approach for blind grayscale logo
the watermark information and secret keys must be available watermarking is presented. This approach is based on the
at the receiver. Public or blind watermarking where the multiresolution fusion principles to embed the grayscale
receiver must only know the secret keys [14]. The robustness logo in perceptually significant blocks in wavelet subband
of private watermarking schemes is high to endure signal decompositions of the host image. Moreover, a modulus
processing attacks. However, they are not feasible in real approach is used to embed a binary counterpart of the logo
2 EURASIP Journal on Advances in Signal Processing

in the approximation sub-band. However, in spite of its high encoding and reconstructing lost data from images. The NPT
complexity, the technique failed with the cropping attacks. transform of an image S of size N × N is given by
In [16], a curvelet-based watermarking technique has been
Str = ψ(α)Sψ(α), (1)
proposed for embedding gray-scale logos; however, the
normalized correlation (NCORR) between the original and where ψ(α) is the transformation kernel defined as [23, 24]
extracted logos with most of the watermarking attacks does
not exceed 0.91. Several wavele-based fragile watermarking ψ(α) = αIN + (1 − α)HN , (2)
techniques have been presented in [17–19]. Other similar where IN is the Nth order identity matrix, 0 ≤ α ≤ 1,
techniques are also presented but in the DCT domain and HN is any orthogonal transform, like Hadamard, DCT,
[20–22]. In spite of the successful performance of most Hartley, or any other orthogonal transform. Throughout this
watermarking techniques reported in the literature, they still paper, we use the 2D Hartley transform, defined by
suffer from being semifragile due to the energy concentration       
  1 2(k − 1)π 2 j −1 π
of their transform domains (DCT and Wavelets), which HN k, j = √ cos + sin .
makes them discard much of the mid- and high-frequency N N N
watermarked data in compression. (3)
In [5, 7–16, 23, 24], an alternate novel watermarking We note here that the Hartley transform was utilized due to
scheme has been proposed. It is based on making use of its circular symmetry performance, as it evenly distributes
the Natural Preserve Transform, NPT. The NPT is a special the energy of the original image in the 4 corners of the
orthogonal transform class that has been used to code and orthogonally projected transform image. Hence the Hartley
reconstruct missing signal portions [23]. Unlike previous transform achieves a tradeoff point between the energy
watermarking schemes that use binary logos, NPT amounts concentration feature (which is crucial for any transform
to evenly distributing the watermarking gray-scale logo or domain for compression purposes) and the even distribution
text, all over the host image. The method assumes the prior and spreading feature (which is crucial for watermarking
knowledge of the host image for watermark extraction, and and data hiding applications). Figure 1 illustrates this idea by
it also suffers from slow convergence in the logo extraction showing the energy concentration for different well-known
process. In [25, 26], an efficient fast least squares technique orthogonal transforms such as DCT, Wavelet, Hadamard and
is proposed for NPT watermark extraction, to remedy the Hartley.
iterative technique originally proposed in [23, 24]. The value of α in (2), gives a balance between the original
In this paper, a unified approach is proposed for domain and the transform domain sample basis. Clearly,
nonblind and quasiblind NPT-based watermarking. In the when α = 1, the transformed image is the original image
quasiblind case, the extraction technique requires little whereas when α = 0, it will be its orthogonal projection
information about the original host image that is needed (which is in the Hartley transform as in this paper). Hence
for the complete recovery of both the host image and the NPT transform is capable of concentrating energy of
watermarking logo. This needed information is conveyed the image while still preserving its original samples values
by the watermarked image itself with no or negligible on a tradeoff basis. This makes the NPT transform domain
degradation. Illustrative examples are given to show the image has both almost original pixel values (that cannot be
quality of the watermarked images, as well as the extracted visually distinguished from the original image) and mostly
watermarking logo and its performance in the presence capable of retrieving the original image from a small part
of attacks. Hardware implementations using DSP have of the transformed image (provided that this small part has
experimentally proved computer simulations. In fact, apart enough energy concentration in it). The transformed image
from its simplicity, the method is virtually insensitive to has PSNR of the order 20 log10 (α/(1 − α)).
cropping attacks and performs well in case of compression The original image can be retrieved from the transformed
and noise attacks. The proposed approach also delivers an image Str , using
extracted watermark that is not only perfect/semiperfect
but also can be visually seen by the user, which gives the S = ψ −1 (α)Str ψ −1 (α). (4)
application more user confidence and trust. If H is symmetric, as in Hartley matrices, one can show that
The paper is organized as follows. Section 2 covers
ψ(α)−1 = ψ(α/(2α − 1)). Otherwise, the matrix ψ −1 (α) can
all mathematical background needed for our proposed
be computed as follows: ψ −1 (α) ≡ φ = (1/α)[I − ((1 −
watermarking technique using NPT. Section 3 briefly reviews
the NPT nonblind and blind embedding and extraction tech- α)/α)H + (((1 − α)/α)H)2 − (((1 − α)/α)H)3 + · · · ]. This
niques and describes their implementation. Experimental means that it can be evaluated to any desired accuracy as
procedure and simulation results are presented in Section 4, ((1 − α)/α)H  < 1.
along with hardware implementation results. Discussion and Instead of spreading the orthogonal projection over the
conclusion are in Sections 5 and 6, respectively. complete image frame, one can only spread it over part of
the image (specific blocks or quarters). This leads to the Mth
partial NPT that is defined as follows:
2. Mathematical Background for NPT ⎡ ⎤
Iα IM +(1−α)HM 0
The NPT was first used as a new orthogonal transform ψM (α) = ⎣ ⎦. (5)
that holds some unusual properties that can be used for 0 IN −M
EURASIP Journal on Advances in Signal Processing 3

(a) (b) (c) (d)

Figure 1: Transform domain basis for Hartley, DCT, Hadamard, and wavelet, respectively.

Host image NPT transformed image PSNR = 44.17 dB

(a) (b)

Figure 2: Original image and its NPT image, computed with α = 0.994.

Figure 2 shows Lena image and its NPT transformed image. α the watermarking logo invisible, we replace the last r rows
is adjusted to yield a nominal PSNR of 45 dB. Its value is α = z of Aw with the last r of the original image S. Hence,
0.994. The PSNR of the transformed image is 44.17 dB. The ⎡ ⎤
high similarity between the original and transformed images A0w
suggests that NPT is very convenient for watermarking and Awm = ⎣ ⎦. (7)
S(N − r + 1 : N, :)
data hiding.
In [26], a more simple embedding technique is proposed.
3. The Proposed Image Watermarking It amounts to replacing part of the host image by the logo
image. For simplicity, the logo is embedded in the upper
left

Technique (IW-NPT) S11 S12
corner. So, if the host image is partitioned as S = ,
3.1. Watermark Embedding. Let the host image S (size N × N) S21 S22
w S12
be watermarked by a watermarking logo w of size (m × n). In then the embedded image is Swm = . Now, the NPT-
S21 S22
the bottom embedding technique [25], the logo is embedded based watermarking technique proceeds as follows.
to S as the last r bottom lines. Hence, the logo matrix is
reshaped to be a matrix w1 (of size r × N, r = (mn/N)). (1) Obtain the NPT of Swm as Aw = ψ(α)Swm ψ(α).
Then, the last r rows of S are replaced by the reshaped logo (2) Partition
w1 . Such
step yields a watermarked square image Swm , Swm = ⎡ ⎤
S1
, S1 = S(1 : N − r, :). A11 A12  m,
w1 ⎣ ⎦
Next, the NPT of Swm is obtained as follows: Aw = A21 A22  N − m. (8)
  ←n→ ←N −n →
A0w  (N − r),
Aw = ψ(α)Swm ψ(α) ≡ z  r. (6)
(3) In order to make the watermarking logo invisible, the
←N →
watermarked image Awm is constructed by replacing
the upper left cropped
section
by the true host image
This step in (6) would register the watermark (distribute S11 A12
its energy) all over the host image. In order to make S11 , that is, Awm = .
A21 A22
4 EURASIP Journal on Advances in Signal Processing

It is also worth mentioning that, the logo can be Assiut university logo
embedded any where in the image [26]. Here, we choose to
embed the logo in the image region Sk that is most similar to
the logo, that is, that has the least Euclidian distance Sk −
w .
As an illustrative example, we consider the embedding
of Assiut University logo, shown in Figure 3 onto Lena
image, using the three embedding techniques. The gray-scale
Assiut University logo has a size of 87 × 60. The complete Figure 3: The original logo image.
Harley orthogonal transformation is used with α = 0.99.
In the bottom embedding case, the logo is embedded as
21 bottom rows, as described earlier. Figure 4 shows the
(3) Partition
NPT watermarked image Aw and the masked watermarked
image Awm for the bottom, top, and optimum embed- ⎡ ⎤
ψ11 ψ12  m,
ding cases. The watermarked PSNRs are 39.85, 39.6, and ⎣ ⎦
39.71 dB., respectively. We note here that top embedding ψ = ψ21 ψ22  N − m. (10)
(or embedding in any of the 4 corners) would give the best
watermarking and extraction performance as will be shown n N −n
later. This is due to the more energy concentration in the
corner for the Hartley transform (Figure 1). Then, as long as N − m ≥ m, the watermark w is the least
squares solution of the system
3.2. Watermark Extraction. The watermarking extraction
Y21 − ψ22 S21 = ψ21 w. (11)
process is divided into a nonblind case, where the original
host image is known at the receiver side and we only try to
Even though Y22 is not corrupted, we do not need it to
extract the logo from the watermarked image, and a blind
calculate the logo w in this nonblind case.
case where the host image is not known at the receiver side,
The quality of extraction is judged by computing the
and we try to extract both the host and logo images from the
normalized correlation NCORR between the original and
watermark image, Awm .
extracted logo, that is,
m n
3.2.1. The Nonblind Case. We first try to extract the i=1 j =1 wi j wexi j
NCORR = , (12)
logo from the top embedding watermarked image, as in w  . wex 
Figure 4(b). Assuming that the original image S, the param-
eter α of (1), and the type of the orthogonal transformation where wex is the extracted watermark. The nonblind
HN , are known at the receiver, the extraction of the extracted logo, as in this image, in our experiments achieved
watermark from the received, Awm proceeds as follows. a NCORR = 1 performance factor. A similar approach
is applied in case of bottom embedding or optimum
embedding. The technique is straightforward for the 3 cases
(1) Determine the logo size m, n. This is easily done by and computationally efficient while being accurate and fast
correlating the watermarked image Awm to the host convergent. For the number of unknowns in (11) to be less or
image S to determine the region of exact matching, equal to the number of known equations, the watermarking
(S11 ). logo must be limited in size, meaning it must not have
a number of rows larger than N/2 for top, bottom, and
(2) Form optimum embedding cases.

⎡ ⎤ 3.2.2. The Quasiblind Case. When the prior knowledge of


Y11 Y12 the host image S is not available, the following quasiblind
Y = Awm φ ≡ ⎣ ⎦
Y21 Y22 technique is proposed for watermarking extraction of an
NPT-based watermarked image. For simplicity, we consider
= ψSwm (9) the blind extraction of bottom embedding logos. The
⎡ ⎤ proposed technique can be described as follows.
w S12
= ψ⎣ ⎦.
(1) Partition
S21 S22
⎡ ⎤
ψ11 ψ12  (N − r),
⎣ ⎦
Due to the insertion of S11 in place of A11, the sub- ψ = ψ21 ψ22  r. (13)
matrices Y11 and Y12 are in error (nonwatermarked)
while Y21 and Y22 still convey the watermark effects. ← N →
EURASIP Journal on Advances in Signal Processing 5

NPT watermarked image Aw Watermarked image Awm

(a) Bottom Embedding PSNR = 39.85 dB


Watermarked image Aw Watermarked image Awm

(b) Top Embedding PSNR = 39.6 dB


Embedded watermarked image Aw NPT watermarked image Awm

(c) Optimum Embedding PSNR = 39.71 dB

Figure 4: The watermarked images in the three embedding schemed.

As Aw φ = ψSwm and from (6), (7), and (9), we can (2) To cancel the effect of S1 in (14), construct an (N − r)
easily show that square matrix V such that V t ψ12 = 0. This matrix
can be easily constructed by expressing its kth vector
⎡ ⎤ ⎡ ⎤⎡ ⎤ Vk as follows:
A0w ψ11 ψ12 S1
⎣ ⎦φ = ⎣ ⎦⎣ ⎦, 
r
 
S(N − r + 1 : N, :) ψ21 ψ22 w1 Vk = IN −r,k − α jk ψ12 :, j
(14) j =1 (15)
that is, A0w φ = ψ11 S1 + ψ12 w1 . where IN −r,k ≡ IN −r (:, k), 1 ≤ k ≤ N − r.
6 EURASIP Journal on Advances in Signal Processing

Watermarked image Awm Reshuffled image Awm

Resized logo Watermarking logo

(a) (b)

Figure 5: Quasiblind watermark extraction.(a) The number of independent columns of A0 is 6, and they are clustered to the far right. (b)
The number of independent columns of A0 is 21, and they are randomly distributed.

The α jk are obtained by solving a set of r linear the number of rows in the logo image. This would mean
equations satisfying the following condition: that an area equal to the logo image (in rows and columns
  equal to host width) will have to be duplicated in the host
Vk t ψ12 :, j = 0, 1≤ j ≤r (16) image which makes the degradation more noticeable as in
Figure 5(b), where S(1 : m,1 : n) are used instead of S(m + 1 :
Since ψ12 is an (N − r) by r matrix, then its rank = r.
2m , n + 1 : 2n). This justifies why the bottom embedding
Consequently, the rank of the matrix V is (N −2r)
case is our favorable option for quasiblind extraction. We
[27].
used the terminology quasiblind as a minor amount of
(3) Premultiply (14) by V t to yield information that has to be known (r parameters of every
column) at the receiver side.
V t A 0w φ = V t ψ11 S1 . (17) At this point, it is worth mentioning that there is no
As the rank of ψ11 is (N − r), the rank of V t ψ11 is (N − guarantee that the r arbitrary variables needed to solve (17),
2r). So, to have a unique solution of (17), r arbitrary (i.e., r last rows of S1 ) are clustered to the last r right columns
parameters of every column of S1 have to be known o V t ψ12 . They may be randomly distributed all over the
at the receiver/extractor. This can be achieved if in the columns of V t ψ12 . By empirical observation, the simulation
watermarked image Aw , we choose the matrix z (6) results prove that this case happens only when r ≥ 6. In this
to be S(N −2r +1 : N −r,:) instead of S(N −r +1 : N,:) case, the dependent columns have to be identified. The QR
(it basically means replicating the r last rows of the matrix decomposition of V t ψ12 is used to achieve this goal
image as in Figure 5). [27].
To test the quasiblind scheme, two experiments have
Having obtained S1 as the unique solution of (17), been carried out. In the first experiment, we watermark Lena
w1 (the logo) is extracted as in the nonblind case, and image with a resized university logo. The size of the logo
subsequently reshaped to regain the original watermark w. was compressed to 44 × 30 pixels, which makes it possible
For top and optimum quasiblind embedding the original to reshape it and embed it to the last 6 rows of the host
image is first extracted in a similar manner, and then the image as explained. NPT is applied to the matrix Swm =

w1 (the logo) is extracted as in the nonblind case. S(1:250,:)
We note here that since the r parameters of every column w1
with α = 0.99 to get the NPT-transformed image
have to be known at the receiver/extractor side, as in the Aw . The watermarked image Awm is constructed by replacing
bottom embedding case, m parameters of every column have the last 6 rows of Aw by S (245 : 250,:). Figure 5(a) shows the
to be known at the receiver/extractor side for both the top watermarked Lena image as well as the extracted resized logo.
embedding and the optimum embedding cases, where m is The watermarked PSNR is 34.35 dB.
EURASIP Journal on Advances in Signal Processing 7

Watermarked PSNR
for different α values NCORR for different α values
40 1

35 0.95

0.9
30

0.85
25
0.8
20
0.75

15
0.7

10 0.65

5
0 0.5 1 0.4 0.6 0.8 1
(a) (b)

Figure 6: PSNR and NCORR values for differential α values for different.

In the second experiment, we use the complete logo. The 4. Testing the Robustness of the Proposed
logo is reshaped and embedded
as the last 21 rows of the Watermarking Technique
S(1:235,:)
host image. Thus, Sw = with α = 0.99. The
w1 The proposed NPT watermarking extraction algorithm has
watermarked image Awm is constructed by replacing the last been tested against cropping compression and noise attacks.
21 rows of Aw by S (1 : 21,:). As expected, the rank of V The following simulation results show its robustness to these
equals the rank of V t ψ12 and both equal 214. However, the attacks.
21 dependent columns are distributed over the column space
of V ψ12 . The QR decomposition shows that the columns 4.1. Robustness to Cropping. The main feature of the pro-
number Icm = [1 13 19 26 31 37 42 49 54 60 66 72 79 84 90 posed NPT watermarking scheme is the even distribution of
96 102 108 114 120 127] are the dependent columns. Hence, the watermark all over the host image. So, as long as the size
we embed the rows of the original image S corresponding of the cropped watermarked image is greater than the size of
to these columns, the PSNR of the reshuffled watermarked the embedded logo, cropping has no effects on the extracted
image Awm is lowered to 25.0 dB. Figure 5(b) shows the logo and one can extract the logo exactly, as the number of
watermarked image Awm together with the extracted logo. linear equations needed to determine the logo is greater than
This example clearly indicates that the quality of the or equal to the number of unknowns. To verify this feature,
watermarked image is high if no data reshuffling occurs. two examples have been considered. In the first, we consider
Simulations of several examples have indicated that this half cropping the watermarked optimum location embedded
would be the case as long as the number of embedded rows Lena image Awm . The cropped part is filled with white
does not exceed 6. pixels. Figure 7(a) shows the watermarked cropped image
Figure 6 shows the PSNR values of the watermarked together with the extracted logo. NCORR = 1, a property
image for different values of alpha, along with the corre- that is shared by the other two embedding techniques. The
sponding NCORR values of the extracted image, when the second example considers the top embedding of a text on the
watermarked image is compressed using SPIHT with bpp = Cameraman 256 × 256 image. Embedding is achieved using
2.5. It can be shown in the figure that the less the value of the Matlab string and character functions. The embedded
alpha, the less contribution of the original image in (2), and text size is 8 × 70. Figure 7(b) shows the watermarked
the lower the PSNR, but the more contribution of the Hartley and the received cropped watermarked images, as well as
basis in (2) which means more energy distribution, which the extracted text that has been exactly reconstructed. This
will yield better extraction, better NCORR. A value of alpha perfect reconstruction is valid as long as the size of the
in the range 0.985–0.99 is the optimal tradeoff point between cropped image is greater or at least equal to the logo size
the 2 curves, as in Figure 6. to ensure a solution of the linear system of equation that
8 EURASIP Journal on Advances in Signal Processing

Embedded watermarked image Aw NPT watermarked image Awm

Cropped watermarked image

Cropped image Reconstructed image

Extracted
watermarking logo

(a)
Watermarked image Aw Watermarked image Awm

Cropped image

Quasi blind data hiding and watermarking technique.


A natural preserving transformation-based technique.
This paper describes a novel data hiding and watermarking technique.
The proposed method is NPT-based one.
Authors: Fahmy, Fahmy, and Sayed
(b)

Figure 7: (a) Cropping performance of optimum location embedding, together with the extracted logo. α = 0.99, NCORR = 1 and SNR =
39 .23 dB. (b) Cropping of top embedded Cameraman image, together with the text. α =0.99.

determines the logo. This result is to be compared to about using SPIHT coder/decoder [28] algorithm implemented
NCORR = 0.99 for the composite technique of [15], at most with different number of bits per pixel (bpp). Figure 8 com-
NCORR = 0.9063 can be achieved using the curvelet method pares the nonblind performance of NCORR of the extracted
[16] and with almost NCORR = 0.749 in the VQ technique logos versus compression; (bpp) is used to represent the
of [14]. watermarked Lena image Awm , for the three embedding
techniques, evaluated for different values of α. These results
4.2. Robustness to Compression Attacks. To verify that the indicate that embedding the logos near the corners of
watermarking logo can be easily identified even in presence the host image improves its robustness to compression
of compression, the watermarked image Awm is compressed attacks, since the Hartley matrix concentrates the energy near
EURASIP Journal on Advances in Signal Processing 9

a = 0.95 a = 0.93
0.98 1

0.96 0.98

0.94 0.96

0.92 0.94
NCORR

NCORR
0.9 0.92

0.88 0.9

0.86 0.88

0.84 0.86
0.5 1 1.5 2 0.5 1 1.5 2
Bits per pixel Bits per pixel

Bottom Bottom
Top Top
Opt. loc. Opt. loc.

bpp = 0.8 bpp = 0.8


NCORR = 0.912 NCORR = 0.933

(a) (b)

Figure 8: Compression performance of the 3 embedding schemes, bpp = 0.95 and 0.93, together with the extracted logos for top embedding
using 0.8 bpp.

the 4 corners of the host image (as described before). The the curvelet technique in [16], which can achieve at most
results also indicate that the top embedding case competes NCORR = 0.52 for the AWGN attack.
well with other techniques, especially when decreasing α (a
in the figure). 4.4. Online Implementation. Due to the simplicity of the
proposed NPT technique, it has been implemented on a
4.3. Robustness to Noise Attacks. In this simulation, the Digital Signal Processor board (DSP), TMS320C6416T DSP
watermarked image Awm is contaminated with zero mean starter kit (DSK). This board has 512 KB flash memory,
AWGN as well as salt and pepper noise. The simulation 16 MB SDRAM, and C6000 Floating point digital signal
is performed for 10 independent noises, with different processor 225 MHZ. Figure 11 shows example of the water-
seeds, and the extracted logos are averaged over these 10 marked image (unmasked and masked with the logo), along
simulations. Figure 9 compares the normalized correlation with the extracted logo. We note here that because of
of both top and bottom embedding, when the watermarked memory restrictions on the DSK board, the size of the
image is mixed with AWGN with different powers. Figure 10 logo on the board was limited. We also note that due to
shows the watermarked images as well as the extracted logos the finite representation of floating point numbers on the
when corrupted for the cases of AWGN yielding SNR = DSK, our technique suffers from some truncation noise.
15 dB and salt and pepper noise with noise density D = Table 1 provides different sizes of the host and watermark
0.5; note that α is a in the figure. These results compete images along with their corresponding embedding time and
with composite approach of [15] and are far superior to extraction time.
10 EURASIP Journal on Advances in Signal Processing

Noise performance, a = 0.95 Noise performance, a = 0.93


0.96 0.96

0.94
0.94
0.92
0.92
0.9

0.88 0.9
NCORR

NCORR
0.86
0.88

0.84
0.86
0.82

0.84
0.8

0.78 0.82
5 10 15 5 10 15
SNR in dBs SNR in dBs

Bottom Bottom
Top Top
(a) (b)

Figure 9: Comparison of NCORR of both top and bottom embedding cases in noisy environments for different SNRs, for 2 different α
values, 0.95 and 0.93.

Salt and pepper image D = 0.05 Noisy image SNR = 15 dB

Extracted Extracted
noisy logo noisy logo

(a) (b)

Figure 10: Typical performance of top embedding case with α =0.95: (a) AWGN case, NCORR = 0.938, (b) Salt and pepper case with
D = 0.05, NCORR = 0.9.
EURASIP Journal on Advances in Signal Processing 11

Unmasked watermarked image Masked watermarked image

Extracted
watermarking logo

(a) (b)

Figure 11: The unmasked and masked watermarked image with the extracted logo on the DSK board.

Table 1: Simulation results for the DSK using the proposed nonblind and blind techniques.

Host image size Watermark size Embedding time Extraction time Watermarked PSNR
64 × 64 20 × 20 0.13 s 0.84 s 25.2
128 × 128 32 × 32 0.23 s 1.69 s 27.3
100 × 100 100 × 11 0.34 s 2.82 s 23.4
140 × 140 50 × 70 0.45 s 4.4 s 24.3

Watermarked image Awm PSNR = 34.35 dB Reshuffled image Awm , PSNR = 25 dB

Blind Blind
resized logo resized logo

Figure 12: Blind logo watermarked Lena image.The logo is Figure 13: Blind logo watermarked Lena image.The logo is
embedded as the last 6 bottom lines α = 0.99. PSNR = 34.35 dB. embedded as the last 21 bottom lines α = 0.99. PSNR = 25 dB.

4.5. Blind Noisy Experiment. Figures 12 and 13 show more of resistant results apply for both nonblind logo extraction
our proposed blind watermarking technique. The quality of and quasiblind logo extraction cases, as in the quasiblind
the extracted logo in noisy environment would be high if no case the only difference is just some host image columns
data shuffling happens as described in Section 3.2.2. we can duplication as in Section 3.2.2. We note here that the
easily see the tradeoff between making the watermark logo proposed approaches deliver an extracted watermark that is
more robust in noisy environment (less α value) and making not only perfect/semi perfect but also can be visually seen by
the watermarked image more identical to the original (high the user, which gives the application more user confidence
α value). and trustworth. This could be exploited in some applications
where the user needs to examine visually the watermark for
5. Discussion extra cushion and security, like in law enforcement security
applications. The proposed technique can also be used for
We tested our approaches against cropping, noise and different data hiding applications where the watermark logo
compression attacks as evident in Section 4. These attacks image can be any type of data or side information. It is
12 EURASIP Journal on Advances in Signal Processing

beyond the scope of this paper to examine effect of rotation, [11] S. Stanković, I. Djurović, and L. Pitas, “Watermarking in
translation, or scaling of the watermarked image on the the space/spatial-frequency domain using two-dimensional
proposed technique. Radon-Wigner distribution,” IEEE Transactions on Image
Processing, vol. 10, no. 4, pp. 650–658, 2001.
[12] Y. Wang and A. Pearmain, “Blind image data hiding based on
6. Concluding Remarks self reference,” Pattern Recognition Letters, vol. 25, no. 15, pp.
1681–1689, 2004.
The paper presents how logos and watermarks can be [13] P. H. W. Wong, O. C. Au, and Y. M. Yeung, “A novel
efficiently embedded using an NPT-based technique. The blind multiple watermarking technique for images,” IEEE
watermark is highly invisible and robust against cropping, Transactions on Circuits and Systems for Video Technology, vol.
compression, and noise attacks. An efficient fast least square 13, no. 8, pp. 813–830, 2003.
algorithm is also described for watermark extraction, for [14] Y. H.-C. Wu and C.-C. Chang, “A novel digital image water-
both the nonblind and quasiblind cases. In the nonblind markingscheme based on the vector quantization technique,”
case, the extraction algorithm assumes prior knowledge of Computers & Security, vol. 24, pp. 460–471, 2005.
the host image whereas in the quasiblind case only very few [15] E. First and X. Qi, “A composite approach for blind grayscale
information of the host image is needed. Simulation and logo watermarking,” in Proceedings of the IEEE International
practical implementation results have proven the robustness Conference on Image Processing (ICIP ’07), pp. 265–268, 2007.
of this technique to attacks, especially when the logo is [16] T. D. Hien, I. Kei, H. Harak, Y.-W. Chen, Y. Nagata, and
embedded near the corners of the host image. The authors Z. Nakao, “Curvelet-domain image watermarking based on
would like to acknowledge the helpful and constructive edge-embedding,” in Proceedings of the 11th International
comments from reviewers. This work is funded in part by the Conference on Knowledge-Based Intelligent Information and
ministry of Communication and Information Technology, Engineering Systems (KES ’07), pp. 311–317, 2007.
Egypt, ITIDA [17] A. Paquet and R. Ward, “Wavelet-based digital watermarking
for image authentication,” in Proceedings of the IEEE Canadian
Conference on Electrical and Computer Engineering, vol. 2, pp.
References 879–884, May 2002.
[18] J. Fridrich, “A hybrid watermark for tamper detection in
[1] N. Nikolaidis and I. Pitas, “Robust image watermarking in the digital images,” in Proceedings of the International Symposium
spatial domain,” Signal Processing, vol. 66, no. 3, pp. 385–403, on Signal Processing and Applications, pp. 301–304, August
1998. 1999.
[2] M.-S. Hwang, C.-C. Chang, and K.-F. Hwang, “A water- [19] C.-Y. Lin and S.-F. Chang, “Semi-fragile watermarking for
marking technique based on one-way hash functions,” IEEE authenticating JPEG visual content,” in Security and Water-
Transactions on Consumer Electronics, vol. 2, no. 45, pp. 286– marking of Multimedia Contents II, vol. 3971 of Proceedings of
294, 1999. SPIE, pp. 140–151, January 2000.
[3] J. R. Hernandez, M. Amado, and F. Perez-Gonzalez, “DCT- [20] C.-Y. Lin and S.-F. Chang, “A robust image authentication
domain watermarking techniques for still images: detector method distinguishing JPEG compression from malicious
performance analysis and a new structure,” IEEE Transactions manipulation,” IEEE Transactions on Circuits and Systems for
on Image Processing, vol. 9, no. 1, pp. 55–68, 2000. Video Technology, vol. 11, no. 2, pp. 153–168, 2001.
[4] M.-S. Hsieh, D.-C. Tseng, and Y.-H. Huang, “Hiding digital [21] E. Koch and J. Zhao, “Towards robust and hidden image
watermarks using multiresolution wavelet transform,” IEEE copyright labeling,” in Proceedings of the IEEE Workshop on
Transactions on Industrial Electronics, vol. 48, no. 5, pp. 875– Nonlinear Signal and Image Processing, pp. 452–455, June 1995.
882, 2001. [22] Q. B. Sun and S. F. Chang, “Semi-fragile image authentication
[5] A. M. Ahmed, Digital image Watermarking using fuzzy logic using generic wavelet domain features and ECC,” in Proceed-
and naturalness preserving transform, Ph.D. thesis, Kansas State ings of the IEEE International Conference on Image Processing,
University, Manhattan, Kan, USA, 2004. vol. 2, pp. 901–904, September 2002.
[6] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon, “Secure [23] R. Yarlagda and J. Hersshey, “Natural preserving transform
spread spectrum watermarking for multimedia,” IEEE Trans- for image coding and reconstruction,” IEEE Transactions on
actions on Image Processing, vol. 6, no. 12, pp. 1673–1687, 1997. Acoustics, Speech, & Signal Processing, vol. 4, no. 33, pp. 1005–
[7] S. Joo, Y. Suh, J. Shin, and H. Kikuchi, “A new robust 1012, 1985.
watermark embedding into wavelet DC components,” ETRI [24] D. D. Day and A. M. Ahmed, “A modified natural preserving
Journal, vol. 24, no. 5, pp. 401–404, 2002. transform for data hiding and image watermarking,” in Pro-
[8] C.-T. Hsu and J.-L. Wu, “Multiresolution watermarking for ceedings of the 4th Joint Conference on Information Sciences of
digital image,” IEEE Transactions on Circuits and Systems II, IASTED International Conference Signal and Image Processing,
vol. 45, no. 8, pp. 1097–1101, 1998. pp. 44–48, 2003.
[9] C.-S. Lu, S.-K. Huang, C.-J. Sze, and H.-Y. M. Liao, “Cocktail [25] M. Fahmy, G. Raheem, O. Mohammed, O. Fahmy, and G.
watermarking for digital image protection,” IEEE Transactions Fahmy, “Watermarking via bspline expansion and natural
on Multimedia, vol. 2, no. 4, pp. 209–224, 2000. preserving transform,” in Proceedings of the IEEE International
[10] W. Zeng and B. Liu, “A statistical watermark detection Symposium on Signal Processing and Information Technology,
technique without using original images for resolving rightful Sarajevo, Bosnia, December 2008.
ownerships of digital images,” IEEE Transactions on Image [26] M. F. Fahmy, O. M. Fahmy, and G. Fahmy, “A quasi
Processing, vol. 8, no. 11, pp. 1534–1548, 1999. blind watermark extraction of watermarked natural preserve
EURASIP Journal on Advances in Signal Processing 13

transform images,” in IEEE International Conference on Image


Processing (ICIP ’09), November 2009.
[27] J. W. Daniels, Applied Linear Algebra, Prentice-Hall, Engle
wood Cliff, NJ, USA, 1988.
[28] A. Said and W. A. Pearlman, “A new, fast, and efficient image
codec based on set partitioning in hierarchical trees,” IEEE
Transactions on Circuits and Systems for Video Technology, vol.
6, no. 3, pp. 243–250, 1996.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 738972, 6 pages
doi:10.1155/2010/738972

Research Article
Reversible Watermarking Using Statistical Information

Ahmad Mahmoudi Aznaveh,1 Farah Torkamani-Azar,1 Azadeh Mansouri,1


and Fatih Kurugollu (EURASIP Member)2
1 Electrical & Computer Engineering Faculty, Shahid Beheshti University, G.C., Tehran 1983963113, Iran
2 The Institute of Electronics, Communications and Information Technology, Queen’s University Belfast, Belfast BT3 9DT, UK

Correspondence should be addressed to Farah Torkamani-Azar, [email protected]

Received 1 December 2009; Accepted 9 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 Ahmad Mahmoudi Aznaveh et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

In most reversible watermarking methods, a compressed location map is exploited in order to ensure reversibility. Besides, in some
methods, a header containing critical information is appended to the payload for the extraction and recovery process. Such schemes
have a highly fragile nature; that is, changing a single bit in watermarked data may prohibit recovery of the original host as well
as the embedded watermark. In this paper, we propose a new scheme in which utilizing a compressed location map is completely
removed. In addition, the amount of auxiliary data is decreased by employing the adjacent pixels information. Therefore, in
addition to quality improvement, independent authentication of different regions of a watermarked image is possible.

1. Introduction bit in the watermarked media may prevent the hidden data to
be extracted. In addition, the restoration process would fail
Reversible watermarking, also called lossless data hiding, as well. This restricts the employing of reversible data hiding
embeds the watermark data into a digital image in a just to the cases in which there is a complete control over the
reversible manner, that is, one can restore the original watermarked data. Therefore, due to emerging applications
image without any degradation. Many techniques, focusing of reversible watermarking schemes [19, 20], it makes sense
on capacity-distortion improvement, have been proposed to extend its scope to lossy environment.
during the last decade. In most of the previous work, The rest of this paper organized as follows: Section 2
channel degradation is not allowed; as a result, such schemes reviews the concept of robust reversible watermarking.
are highly fragile. This limits the usability of reversible In the next section, our proposed method is introduced.
watermarking only in lossless environments. The experimental results are presented in Section 4, and
Lossless data embedding can be classified into the conclusion is drawn in Section 5.
following categories: the first one utilizes additive spread
spectrum [1, 2]; the second category compresses the selected
image features for creating vacancy [3, 4], and employs 2. Robustness Concept in
this spare space for embedding; the third group, namely Reversible Watermarking
expansion-based methods [5–16], embeds the watermark
data in some features by expanding them; these features are Most of reversible watermarking methods, presented so far,
created by some decorrelation operator, and finally, some have a highly fragile nature; nevertheless, there are some
methods are based on histogram modification [17, 18] in methods which can be deemed as semifragile techniques. In
which peak/zero points of the histogram either in spatial [21], Kalker and Willems provided a theoretical analysis of
domain or transform domain are utilized for embedding. reversible data hiding techniques in which robust reversible
Most of the existing reversible watermarking algorithms watermarking is interpreted in three ways: firstly, it can refer
have a highly fragile nature in the sense that changing a single to robustness of the embedded watermark. Next, it can refer
2 EURASIP Journal on Advances in Signal Processing

to reversibility of the host signal, and finally, both payload cropping can be achieved by distributing and storing the
and reversibility are considered. They focused on the third auxiliary data close to the corresponding pixel pairs. Still,
option and concluded that a robust reversible data hiding the size of auxiliary data especially in case of using distortion
scheme exploits the side information available from the control is a major problem. In addition, in order to prevent
received data and also error correcting codes. some decoding ambiguity, the decoder should be informed
De Vleecshouwer et al. proposed a modulo-256 addition about the error threshold in advance. They generalized their
based on the classical patchwork algorithm in order to method in [11] in which a high capacity in a single pass
achieve a reversible watermarking scheme [2]. Firstly, they of embedding can be achieved. However, the watermarked
map the histogram of each zone to a circle; then instead of image quality is not satisfactory: the degradation is more
the concept of average value, the position of the histogram annoying in edges. Moreover, in case of capacity control,
on the circle is used as the discriminating factor. As a the amount of auxiliary data which are used to specify
result, unreliable retrieval caused by wrapped around pixels the noncontainer elements increases intensively. In [13],
impacting the average value is avoided. Due to embedding an improved version of [11] is presented. However, due
in nonoverlapping blocks and the small size of the auxiliary to dependencies in decoding process, it cannot be deemed
data, the recovery of a fragment of the payload using a grid robust against cropping. In addition, these dependencies may
alignment is possible for the cropped images. In addition, prevent the extraction of the watermark and recovery of
this method along with the message repetition is robust the original image. In [14], this problem is solved through
against JPEG compression just for extracting the payload. considering an extra state in the embedding phase. Still, this
This scheme, however, is not free from salt and pepper noise. method cannot be robust against cropping.
In addition, due to block-based embedding, the capacity is In robustness point of view, introduced in [21], most of
very low. the presented semifragile reversible watermarking algorithms
Ni et al. [22] proposed another semifragile reversible can be categorized in the first group in which just the
data hiding scheme based on the patchwork algorithm. They extraction of the payload is possible in case of lossy
classified each block into four different categories and use operation. Although, the proposed methods in [11, 12] can
different embedding schemes in order to avoid overflow and recover the original image after cropping, the large number
underflow. Their method employs error correction codes of auxiliary data is the main disadvantage of these methods.
(BCH) in order to overcome some ambiguity raised during In [15], we proposed an alternative transform to increase
embedding and also to provide robustness against JPEG the quality of the marked image by employing the checker-
compression. board structure. In [16], we expand the algorithm by
In order to enlarge the scope of reversible watermarking, utilizing a new error control strategy to decrease the size
a joint marking procedure is proposed in which a robust lossy of auxiliary data. Since, this kind of reversible embedding
watermark is first embeded and a reversible watermarking algorithm does not employ a location map and the size of
scheme is employed in the next stage [23]. The information auxiliary data is negligible, it can be used to make the method
needed to invert back the marked media should be stored more robust against some geometrical attacks.
during the reversible embedding. In case of no attack, the
original image can be recovered. It should be noted that due
to high embedding capacity which is required in this method, 3. Proposed Method
the achieved quality is not acceptable.
Among different reversible watermarking schemes, the Toward overcoming the highly fragile nature of reversible
expansion based methods received more attention because algorithms, two possible solutions are considered: the first
they have the highest embedding capacity along with the one uses a block-based embedding; therefore, in case of
lowest quality degradations. However, a location map is some attacks such as cropping, it is possible to extract the
needed to determine the positions of the expanded values. embedded watermark using a grid alignment. In this case,
This location map should be compressed in order to decrease the proper grid alignment should be recovered through an
its influence on embedding capacity. As a result, lots of efforts exhaustive search [2]; in addition, a low embedding capacity
have been done to decrease the size of location map. Using is another disadvantage of this group of algorithms. The
location map, however, has some other shortcomings, for other way is to design a reversible method in which the
example, a single bit modification may break the entropy dependency to a location map or any auxiliary data become
decoder synchronization. Furthermore, it is not possible to restricted. For example, by using the statistical information
employ such embedding scheme in block-based manner. which can be retrieved from watermarked media; it is
As a result, it is necessary to remove the location possible to determine the location of container elements. It
map in order to prepare a less fragile method. There are should be noted that, this statistical information should be
some methods which do not use a location map. In [24], the same in both original and the watermarked image. This
the locations of expanded values are determined based on self-dependent reversible watermarking can be used as a way
some statistical analysis. However, it needs to transfer some to decrease the fragile nature of the reversible methods.
information to the recipient separately. As a result, it cannot To decrease the dependency to auxiliary data and toward
be consider as a blind scheme. designing self-dependent reversible algorithm, we proposed
Coltuc and Chassery proposed a method in which using to use a checker board structure [15]. This structure is
the location map is removed too [12]. Robustness against utilized in order to better employ the spatial correlation of
EURASIP Journal on Advances in Signal Processing 3

the original value, we reserve the symbol ‘0’ for noncontainer


Pn elements; hence, the watermark symbol can be chosen from
[1, M − 1]. On the other hand, the range of auxiliary data
Pw P Pe
may fall outside of this range since in case of overflow it is
necessary to use negative corrective data. Thus, a prefix is
utilized for encoding the range of [−M +1, M − 1] with M − 1
Ps symbols.
Since the transform in (1) employs the image structure
Figure 1: The structure of adjacent pixels in embedding process.
more fittingly, it can better preserve the image quality
especially in edges, in comparison to similar methods [11–
13]; this is because the human visual system is adapted to
extract the structural information of a viewing scene [25].
an image. Half of the pixels were used for embedding and As mentioned before, the pixels which are not capable
the remaining was kept intact which can be employed for of embedding the watermark should be changed to be
specifying the location of the embedded information. The recognized at the decoder side. Therefore, the increase of the
embedding is done through (1) auxiliary data has a doubled effect on capacity: on one part,
Pwkd = P +
An dn + As ds + Ae de + Aw dw + [w], (1) there are less embeddable pixels; besides, the auxiliary data
occupy a potion of the capacity:
where d∗ = P − P∗ , ∗ ∈ {n, s, e, w}.
In (1), Pwkd is the marked pixel and Pn , Ps , Pe , and Pw are E N −E
C= log2 (M) − log2 (M + 1), (4)
the northern, southern, eastern, and western adjacent pixels 2N 2N
respectively, as depicted in Figure 1.
Furthermore, An , As , Ae and Aw indicate the contri- where E is the number of embeddable pixels and N represents
bution factors of corresponding neighbors. By increasing the number of pixel pairs.
the contribution factors, it is possible to embed more at In this case, missing the auxiliary data due to a synchro-
the expense of higher degradation. Due to high spatial nization attack prevents the algorithm to extract the payload
correlation in natural images, it is expected that differences and reverts back the original media. Thus, increasing the
between P and its neighborhoods will be small (which is the amount of auxiliary data will increase the fragility of the
basic assumption in different expansion-based methods). reversible watermarking methods.
In the extraction phase, it is sufficient to compute the The amount of auxiliary data especially after applying
weighted sum of the pixel group which is illustrated in (2) distortion control is increased dramatically which deteriorate
the situation. As a result, to present a less fragile reversible
P = Pwkd +
An Pn + As Ps + Ae Pe + Aw Pw method, reducing the number of auxiliary data is unavoid-
(2) able.
= M × P + [w], Figure 2 illustrates that there is a high correlation
between the variance of adjacent pixels and the resulting
where M = (An + As + Ae + Aw + 1). error after applying the transform (1) due to spatial cor-
The sum of adjacent pixels should be remained intact so relation of natural images. As it is presented in [15], half
that the watermark extraction and recovery of the original of the watermarked pixels remain intact in the proposed
host become possible; consequently, for embedding just method, and the variance of adjacent pixels can estimate the
half of the pixels, constructing a checkerboard structure is introduced error; we decide to use it as distortion control
employed. parameter:
In (2), P and w are congruent modulo M. As a result, it
can be concluded from (2) that M symbols can be embedded 1 
in each pixel in a reversible manner. Therefore, the raw VARP = (Pavr − P∗ )2 , (5)
3 ∗∈{n,w,s,e}
capacity is (log2 M)/2 bpp. To achieve more capacity, it is
possible to repeat the embedding procedure by changing the
role of two pixel groups. where Pavr is (Pn + Pw + Pe + Ps )/4. In order to reduce the
As mentioned before, the sum of differences is often number of auxiliary data, the idea is to utilize just the pixels
extremely small; however, in some cases it is possible to cause the variance of which does not exceed a predefined threshold.
overflow or underflow. As a result, the transform can be The threshold can be easily found due to spatial correlation
applied only when the intensity interval does not change: of natural images and the application demands.
Although, utilizing this technique may exclude some
0 ≤ P + (An dn + As ds + Ae de + Aw dw ) + [w] ≤ 255. (3) transformable pixels, the decrease of the number of auxiliary
data is more significant than the decrease of the embed-
In this case, it is necessary to distinguish between trans- dable pixels. Therefore, the actual capacity is improved;
formable and nontransformable pixels. Therefore, similar furthermore, the dependency of embedded payload to the
to [12], one symbol is assigned to decide between the auxiliary data is highly decreased. The relation between
container elements and noncontainers. Since one can embed the embeddable pixels and the correction data for different
M symbols [0, M − 1] with keeping the ability of recovering test images is illustrated in Figure 3. All images are of size
4 EURASIP Journal on Advances in Signal Processing

100 Lena
60
90
Standard deviation of adjacent pixels

55
80
50
70
45
60

PSNR (dB)
50 40

40 35

30 30
20 25
10
20
0
−250 −200 −150 −100 −50 0 50 100 150 200 250 15
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Introduced error
Capacity (bpp)
Figure 2: The relation between introduced error and standard
(a)
deviation of adjacent pixels for Lena.
Baboon
70

The ratio of embeddable to auxialiary data 60


9

8 50
PSNR (dB)

7
40
6
30
5

4 20

3
10
2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Capacity (bpp)
1
30 25 20 (b)
Standard deviation of adjacent pixels Cameraman
60
Lena Cameraman
55
Peppers Boat

Figure 3: The relation of embeddable pixels to correction data. 50

45
PSNR (dB)

40
256 × 256 and with error threshold equal to 10, the standard
35
deviation varies from 30 to 5 as depicted in Figure 3. It
is clearly shown that by restricting the variance, the ratio 30
between the embeddable pixels and the auxiliary data is
increased; therefore, the major part of the payload is taken 25
by the watermark information rather than the auxiliary
data. 20
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Consequently, by using the information from the intact
Capacity (bpp)
pixels, the amount of auxiliary data is decreased significantly.
In other words, it is possible to embed the watermark in some Proposed [15]
elements independent from others. In this case, this scheme [14] [11]
has the potential to be robust against synchronization attacks (c)
because the extracting and recovery for a fragment of a Figure 4: Performance comparison on test images Lena, Baboon,
picture can be done independently. and Cameraman.
EURASIP Journal on Advances in Signal Processing 5

Table 1: The number of auxiliary data in the proposed method. We compare our method with similar schemes in term of
auxiliary data too. The required auxiliary data for different
Auxiliary Auxiliary Auxiliary
Capacity capacity Capacity location map-free reversible watermarking method for Lena
data data data
are depicted in Table 2. The proposed method produces less
0.90 1421 0.83 2348 0.74 4502 auxiliary data than the other location map-free schemes.
0.80 275 0.79 601 0.60 1659 Since the watermark extraction and recovery process for
0.70 78 0.70 292 0.50 886 each part can be done independently, the recovery of a
0.60 22 0.63 122 0.31 211 fragment of the watermarked image is possible. Therefore,
0.57 15 0.51 31 0.20 94 the proposed method allows robustness against cropping. It
0.50 9 0.36 18 0.10 23
is worth noting that it is possible to recover parts of an image
without considering grid alignment.
0.45 4 0.23 15 0.05 10
0.35 3 0.11 7 0.03 3
0.19 0 0.01 0 0.01 0
5. Conclusion
Lena Cameraman Baboon A location map-free reversible watermarking scheme is
proposed. Since the information of adjacent pixels is utilized,
the size of the auxiliary data decreases. Therefore, the quality
Table 2: Comparison of the number of auxiliary data between of the proposed method is improved significantly especially
similar methods (NE: Not embeddable). in low embedding rate. Furthermore, the negligible amount
of the auxiliary data provides robustness against some
Capacity Proposed [15] [11] [12]
geometric attacks such as cropping in which in addition to
0.9 1421 1635 1900 NE
extracting the embedded watermark, the original image can
0.8 275 3450 3745 NE be recovered.
0.7 78 5314 5583 NE
0.6 22 7245 7896 NE
Acknowledgment
0.5 9 9352 9426 190
0.4 4 11195 11089 2933 The authors would like to thanks ITRC (Iran Telecom-
0.3 3 12878 13975 6124 munication Research Center) for partially supporting this
0.2 0 15126 >17045 10520 research.
0.1 0 16850 17045 11590
References
[1] B. Macq, “Lossless multiresolution transform for image
4. Experimental Results authenticating watermarking,” in Proceedings of the 10th Euro-
pean Signal Processing Conference (EUSIPCO ’00), Tampere,
In this section, firstly, the performance of our method is Finland, September 2000.
evaluated in terms of capacity and distortion. Then, the [2] C. De Vleeschouwer, J. F. Delaigle, and B. Macq, “Circular
results in case of some lossy environment are explored. interpretation of bijective transformations in lossless water-
We evaluate the results of location map-free different marking for media asset management,” IEEE Transactions on
expansion (DE) based methods by comparing capacity Multimedia, vol. 5, no. 1, pp. 97–105, 2003.
[3] J. Fridrich, M. Goljan, and R. U. I. Du, “Lossless data embed-
versus distortion. The results are illustrated in Figure 4.
ding for all image formats,” in Security and Watermarking of
For a fair comparison, the same expansion amount is Multimedia Contents IV, Proceedings of SPIE, pp. 572–583,
utilized. For the proposed method and [15], 1.5 is considered San Jose, Calif, USA, February 2002.
as the contribution factors while in [11, 23] the simulation is [4] M. U. Celik, G. Sharma, A. M. Tekalp, and E. L. I. Saber,
performed for n = 3. In this case, the expansion amounts “Lossless generalized-LSB data embedding,” IEEE Transactions
are similar in all cases. The experiments are conveyed on on Image Processing, vol. 14, no. 2, pp. 253–266, 2005.
benchmarks of size 256 × 256. [5] J. U. N. Tian, “Reversible data embedding using a difference
As it is indicated, the proposed method outperforms the expansion,” IEEE Transactions on Circuits and Systems for
other methods especially in low embedding bit rate due to Video Technology, vol. 13, no. 8, pp. 890–896, 2003.
decreasing the required auxiliary data. In this case, just a less [6] L. Kamstra and H. J. A. M. Heijmans, “Reversible data
portion of the payload is occupied by the auxiliary data. embedding into images using wavelet techniques and sorting,”
As it is depicted in above figures, decreasing the size of IEEE Transactions on Image Processing, vol. 14, no. 12, pp.
2082–2090, 2005.
auxiliary data could restrict the introduced distortion. On
[7] D. M. Thodi and J. J. Rodrı́guez, “Expansion embedding
the other hand, it decreases the dependency of the embedded techniques for reversible watermarking,” IEEE Transactions on
watermark to the auxiliary data as well. Image Processing, vol. 16, no. 3, pp. 721–730, 2007.
The number of auxiliary data for different benchmarks [8] S. Lee, C. D. Yoo, and T. O. N. Kalker, “Reversible image
with different capacity is illustrated in Table 1. As it is clearly watermarking based on integer-to-integer wavelet transform,”
shown, the required auxiliary data is negligible especially in IEEE Transactions on Information Forensics and Security, vol. 2,
low embedding bit rate. no. 3, pp. 321–330, 2007.
6 EURASIP Journal on Advances in Signal Processing

[9] Y. Hu, H. K. Lee, and J. Li, “DE-based reversible data hiding [25] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,
with improved overflow location map,” IEEE Transactions on “Image quality assessment: from error visibility to structural
Circuits and Systems for Video Technology, vol. 19, no. 2, pp. similarity,” IEEE Transactions on Image Processing, vol. 13, no.
250–260, 2009. 4, pp. 600–612, 2004.
[10] V. Sachnev, H. J. Kim, J. Nam, S. Suresh, and Y. Q. Shi,
“Reversible watermarking algorithm using sorting and pre-
diction,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 19, no. 7, pp. 989–999, 2009.
[11] D. Coltuc and J.-M. Chassery, “High capacity reversible water-
marking,” in Proceedings of the IEEE International Conference
on Image Processing (ICIP ’06), pp. 2565–2568, Atlanta, Ga,
USA, October 2006.
[12] D. Coltuc and J. M. Chassery, “Very fast watermarking by
reversible contrast mapping,” IEEE Signal Processing Letters,
vol. 14, no. 4, pp. 255–258, 2007.
[13] D. Coltuc, “Improved capacity reversible watermarking,” in
Proceedings of the International Conference on Image Processing
(ICIP ’07), vol. 3, pp. 249–252, October 2007.
[14] M. Chaumont and W. Puech, “A high capacity reversible
watermarking scheme,” in Visual Communications and Image
Processing, Proceedings of SPIE, San Jose, Calif, USA, February
2009, 72571H-9.
[15] A. Mahmoudi Aznaveh, A. Mansouri, and F. Torkamani-Azar,
“A new approach in reversible watermarking,” in Proceedings
of the 8th International Workshop on Digital Watermarking
(IWDW ’09), vol. 5703 of Lecture Notes in Computer Science,
pp. 241–251, Guilford, UK, 2009.
[16] A. Mahmoudi Aznaveh, F. Torkamani-Azar, and A. Mansouri,
“Toward quality improvement in location map free reversible
watermarking,” in Proceedings of the 10th Pacific Rim Confer-
ence on Multimedia (PCM ’09), vol. 5879 of Lecture Notes in
Computer Science, pp. 867–876, Bangkok, Thailand, December
2009.
[17] Z. Ni, Y. U. N. Q. Shi, N. Ansari, and W. E. I. Su, “Reversible
data hiding,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 16, no. 3, pp. 354–362, 2006.
[18] W.-L. Tai, C.-M. Yeh, and C.-C. Chang, “Reversible data
hiding based on histogram modification of pixel differences,”
IEEE Transactions on Circuits and Systems for Video Technology,
vol. 19, no. 6, pp. 906–910, 2009.
[19] X. Zhang and S. Wang, “Fragile watermarking with error-free
restoration capability,” IEEE Transactions on Multimedia, vol.
10, no. 8, pp. 1490–1499, 2008.
[20] D. Coltuc, “On stereo embedding by reversible watermarking,”
in Proceedings of the International Symposium on Signals,
Circuits and Systems (ISSCS ’07), pp. 1–4, August 2007.
[21] T. O. N. Kalker and F. M. J. Willems, “Capacity bounds
and constructions for reversible data-hiding,” in Security and
Watermarking of Multimedia Contents V, Proceedings of SPIE,
pp. 604–611, Santa Clara, Calif, USA, February 2003.
[22] Z. Ni, Y. U. N. Q. Shi, N. Ansari, W. E. I. Su, Q. Sun, and
X. Lin, “Robust lossless image data hiding designed for semi-
fragile image authentication,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 18, no. 4, pp. 497–509,
2008.
[23] D. Coltuc and J. M. Chassery, “Distortion-free robust water-
marking: a case study,” in Proceedings of the Security, Steganog-
raphy, and Watermarking of Multimedia Contents IX, San Jose,
Calif, USA, March 2007, 65051N-8.
[24] H. L. Jin, M. Fujiyoshi, and H. Kiya, “Lossless data hiding
in the spatial domain for high quality images,” IEICE Trans-
actions on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E90-A, no. 4, pp. 771–777, 2007.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 782438, 11 pages
doi:10.1155/2010/782438

Research Article
On Converting Secret Sharing Scheme to
Visual Secret Sharing Scheme

Daoshun Wang and Feng Yi


Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Correspondence should be addressed to Daoshun Wang, [email protected]

Received 25 November 2009; Revised 28 April 2010; Accepted 4 July 2010

Academic Editor: Yingzi Du

Copyright © 2010 D. Wang and F. Yi. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Traditional Secret Sharing (SS) schemes reconstruct secret exactly the same as the original one but involve complex computation.
Visual Secret Sharing (VSS) schemes decode the secret without computation, but each share is m times as big as the original and
the quality of the reconstructed secret image is reduced. Probabilistic visual secret sharing (Prob.VSS) schemes for a binary image
use only one subpixel to share the secret image; however the probability of white pixels in a white area is higher than that in a black
area in the reconstructed secret image. SS schemes, VSS schemes, and Prob. VSS schemes have various construction methods and
advantages. This paper first presents an approach to convert (transform) a (k, k)-SS scheme to a (k, k)-VSS scheme for greyscale
images. The generation of the shadow images (shares) is based on Boolean XOR operation. The secret image can be reconstructed
directly by performing Boolean OR operation, as in most conventional VSS schemes. Its pixel expansion is significantly smaller
than that of VSS schemes. The quality of the reconstructed images, measured by average contrast, is the same as VSS schemes.
Then a novel matrix-concatenation approach is used to extend the greyscale (k, k)-SS scheme to a more general case of greyscale
(k, n)-VSS scheme.

1. Introduction available. No information will be revealed with any k − 1


or fewer shares. VSS schemes, originally based on binary
A secret kept in a single information-carrier could be easily images, have been expanded to work with greyscale and color
lost or damaged. Secret Sharing (SS) schemes, called (k, n) images. In a (k, n)-VSS scheme, the computation complexity
threshold schemes, have been proposed since the late 1970s of reconstructing a secret image using k shadows in visual
to encode a secret into n pieces (“shadows” or “shares”) so cryptography is proportional to O(k) and proportional to the
that the pieces can be distributed to n participants at different size of the shadow images. Several (k, k)-VSS schemes have
locations [1, 2]. The secret can only be reconstructed from been designed for special k values [6–8]. In a VSS scheme,
k or more pieces (k ≤ n). Since Shamir’s scheme is every pixel of the original image is expanded to m subpixels
a basic secret sharing scheme and is easy to implement, in a shadow image. These m subpixels are referred to as pixel
it is commonly used in many applications. However, the expansion. The quality of the reconstructed secret image
computation complexity of Shamir’s scheme is O(k log2 k) is evaluated by contrast (denoted by α) in VSS schemes.
for the polynomial evaluation and interpolation in [3]. Pixel expansion m and contrast α are two factors to evaluate
Wang et al. [4] proposed a deterministic (k, k)-secret sharing a VSS scheme. Therefore, it is desirable to minimize m
scheme for greyscale images. That scheme uses simple and maximize α as much as possible. Much work has been
Boolean XOR operations and has no pixel expansion. The directed toward reducing the pixel expansion [9, 10]. Many
computation complexity of the reconstructed secret image of the previous schemes were primarily proposed for binary
is O(k). Visual secret sharing (VSS) schemes [5] have been images. A number of VSS schemes have also been proposed
proposed to encode a secret image into n “shadow” (“share”) for greyscale images [11–13]. The minimum pixel expansion
images to be distributed to n participants. The secret can of the (k, k)-VSS scheme for greyscale image in [13] is equal
be visually reconstructed only when k or more shares are to those in [11, 12], namely, m ≥ (g − 1) · 2k−1 , where g is
2 EURASIP Journal on Advances in Signal Processing

the number of different grey levels in the secret image. The to a (k, k)-VSS scheme for greyscale images. In our (k, n)-
deterministic VSS schemes mentioned above have achieved scheme, the pixel expansion is smaller than that of previous
minimum pixel expansion m and optimal contrast α = 1/m, deterministic (k, n)-VSS schemes [10, 11], when k ≥ n/4,
but the value of m can be still quite large, partly because m is k ≥ 4. The average contrast of our (k, n)-VSS scheme is close
proportional to the exponential of k. to that of deterministic (k, n)-VSS schemes [10, 11] when
To further reduce pixel expansion, a number of proba- k ≥ n/2, k ≥ 2.
bilistic VSS schemes (Prob.VSS schemes) have been proposed The rest of the paper is organized as follows. In Section 2,
in [14–16]. These schemes were designed for the case of g = we briefly review binary Prob. VSS scheme. Section 3
2, that is, for black and white images. In the reconstructed presents an approach to convert a greyscale (k, k)-SS scheme
secret image, the probability of white pixels in a white area is to a (k, k)-VSS scheme. In Section 4, we present a novel
higher than that in a black area. Therefore small areas, rather approach to extend the above (k, k)-SS scheme into a more
than individual pixels, of the secret image can be recovered general greyscale (k, n)-VSS scheme. Section 5 concludes the
accurately. With the trade-off in resolution, probabilistic paper.
schemes can achieve no pixel expansion (m = 1), and the
contrast is the same as the ones in the deterministic schemes.
Because the SS scheme, VSS scheme, and Prob. VSS 2. A Review of Probabilistic VSS Scheme
scheme use these different construction methods, it is Here, we briefly review probabilistic visual secret sharing
important to research the link (or relationship) among these scheme [14–16]. The following Definition 2.1 is directly from
three methods. Some studies have focused on describing the Yang’s scheme [15].
relationship of SS schemes and VSS schemes with respect
to pixel expansion and contrast. Cimato et al. [16] first Definition 2.1 (see [15]). A (k, n)-Prob. VSS scheme can be
proved that there exists a one-to-one mapping between shown as tow sets, white set C0 and black set C1 , consisting
binary VSS schemes and probabilistic binary VSS schemes of nλ and nγ n × 1 matrices, respectively. When sharing a
with no pixel expansion, where contrast is traded for the white (resp., black) pixel, the dealer first randomly chooses
probability factor. Yang et al. [17, 18] introduced secret image one n × 1 column matrix in C0 (resp., C1 ), and then randomly
sharing deterministic and probabilistic visual cryptograph selects one row of this column matrix to a relative shadow.
scheme (DPVCS), which is a two-in-one combination of VSS The chosen matrix defines the color level of pixel in every one
and PVSS schemes. Bonis and Santis [19] first analyzed the of the n shadows. A Prob. VSS Scheme is considered valid if
relationship between SS schemes and VSS schemes, focusing the following conditions are met.
attention on the amount of randomness required to generate
the shares. They proved that SS schemes for a set of secrets of (1) For these nλ (resp., nγ ) matrices in the set C0 (resp.,
size two binary SS schemes and VSS schemes are “equivalent” C1 ) the “OR”-ed value of any k-tuple column vector
with respect to the randomness. Lin et al. [20] presented an V is L(V ). There values of all matrices form a set λ
innovative approach to combine two VSS and SS scheme, (reps. γ).
the n shares are created for a given grey-valued secret image.
Each share includes both SS and VSS scheme information, (2) The two sets λ and γ satisfy that p0 ≥ pTH and
providing two options for decoding. So far the study of p1 ≤ pTH − α, where p0 and p1 are the appearance
relationships among SS, Prob. VSS, and VSS scheme has probabilities of the “0” (white color) in the set λ and
been focused mainly on the relationship between VSS and γ, respectively.
Prob. VSS scheme, the randomness relationship between
SS and VSS scheme, and the methods combining VSS and (3) For any subset with {i1 , i2 , . . . , iq } of {1, 2, . . . , n} with
SS scheme. However, another interesting topic of study q < k, the p0 and p1 are the same.
would be the relationship between SS and VSS schemes,
especially with regard to the underlying pixel expansion and The first two conditions are called contrast, and the
contrast. third is condition called security. From the above definition,
In this paper, we give the relationship between the (k, n)- the matrices in C0 and C1 are n × 1 matrices, so the pixel
SS scheme and (k, n)-VSS scheme with respect to pixel expansion is one.
expansion and contrast. We first propose a construction For conventional VSS schemes, a pixel in the original
approach to transform a traditional (k, k)-SS scheme to image is expanded to m subpixels and the number of white
a (k, k)-VSS scheme for greyscale images. That is, the subpixels of a white and black pixel is h and l. When stacking
generation of the shadow images is based on Boolean OR k shadows, we will have “m − h” B “h” W subpixels for a
and XOR operations, and the reconstruction process uses white pixel and “m − l” B “l” W subpixels for a black pixel.
Boolean OR operation, as in most other VSS schemes. In Hence, from the observation, if we use all the columns of the
our (k, k)-VSS scheme, the pixel expansion m is g − 1, much basis matrices S0 and S1 of a conventional VSS scheme as the
smaller than the (g − 1) · 2k−1 of traditional VSS scheme and n × 1 column matrices in the sets C0 and C1 , we can let the
independent of k. The quality of the reconstructed image, pixel appear in white color different probability instead of
measured in “Average Contrast” between consecutive grey expanding the original pixel to m subpixel and the frequency
levels, is 1/(g − 1) · 2k−1 , which is equal to that in the VSS of white pixel in white and black areas in the recovered image
schemes. Then we extend the traditional (k, k)-SS scheme will be p0 = h/m and p1 = l/m.
EURASIP Journal on Advances in Signal Processing 3

3. The Proposed Converting Method for secret image is O(k). The reconstructed secret image
a (k, k) Scheme needs to perform Boolean XOR operation described in
[15] while conventional VSS scheme performs Boolean
The purpose of this section is to show how to convert a OR operation. If a and b are integers, a ⊕ b can be
(k, k)-SS scheme to a (k, k)-VSS scheme. First, we give quality expressed in terms of OR and XOR operations as: a ⊕ b =
measures of the recovered secret image. Then we introduce OR (NOT (OR (NOT a, b)), NOT (OR (a, NOT b)) ). The
a seemingly simple but very valid method that can be used XOR operation can be performed by four NOT operations
easily to transform a greyscale image to a binary image. and three OR operations. Thus, the scheme described above
Finally, we prove that the proposed method for converting is more complex than VSS schemes based on OR operations.
the (k, k)-SS scheme to a (k, k)-VSS scheme is valid. In this case, we cannot directly use SS scheme of [15] to con-
struct a VSS scheme. A new approach must be constructed.
3.1. Quality Measurement of Recovered Secret Image. Since To address this, we propose a method to convert a greyscale
the existing probabilistic schemes were only proposed for secret image to a binary image. Then, we construct a (k, k)-
binary images, the contrast between black and white pixels VSS scheme to transform XOR operation to OR operation
was naturally chosen as an important measurement of based on scheme of [15]. The following subsection will
quality. The scheme we proposed is for greyscale images. introduce this new method to encode greyscale images into
We use the expected contrast between two pixels with binary images.
consecutive grey levels in the original image to indicate the
quality of reconstruction. This is referred to as “Average 3.3. New Encoding Method of Greyscale Image. Each pixel of
Contrast”, defined as follows. original image S can take any one of g different grey levels.
Let S = [si j ] be the φ × ϕ original secret image, i = S = [si j ]φ×ϕ , where i = 1, 2, . . . , φ, j = 1, 2, . . . , ϕ and si j ∈
1, 2, . . . , φ, j = 1, 2, . . . , ϕ, and si j ∈ {1, . . . , g }. Suppose that {1, . . . , g }. We have g = 2 for a binary image and g = 256
U = [ui j ] is the (mg · φ) × (mg · ϕ) reconstructed image, where for a greyscale image with one byte per pixel. In a greyscale
mg is the pixel expansion factor. For si j = l, l ∈ {1, . . . , g }, image with one byte per pixel, the pixel value can be an index
the corresponding pixel in Ucan be denoted as Ul = {ui j | to a color table, thus g = 256. In a color image using an RGB
si j = l}, l ∈ {1, . . . , g }. model, each pixel has three integers: R (red), G (green) and
The appearance of Ul depends on the Hamming weight B (blue). If each R, G or B takes value between 0 and 255, we
of the m dimensional vector. Because of the randomness have g = 2563 .
of the shadow images, H(U) is a random variable. We are In the construction of the shadow images, each pixel of S
interested in the average Hamming weight for all pixels Ul . is coded as a binary string of g − 1 bits. For si j = l, its coded
Let a(h)i j be the (i, j)th Boolean value in the hth shadow form is ci j = bgl−−11 = 0 g −l 1l−1 , which is a string of g − l zeros
image. Then the reconstruction results is and l − 1 ones. The order of the bits does not matter.
ui j = a(1) (2) (k)
i j + ai j + · · · + ai j . (1) Example 3.1. For example, b64−−11 can be written as 00111, or
The symbol “+” represents Boolean OR operation in formula 01101, or equivalently 11010.
(1). In other words, matrix U is Boolean OR operation of the
shares U = A1 + · · · + Ak . Note that the range of grey level for the original image
Let Pt = P(H({ui j = t | si j = l })) be the probability of and the reconstructed image pixels is from 1 to g, but the
H(Ul ) taking value t with t ∈ {1, . . . , g }, the expected value range of coded form, ci j , is from 0 to g − 1. Notation gives
g −1 a list of variable names for easy lookup.
of H(Ul ) is E(H({ui j = t | si j = l })) = t=0 t · Pt . We
Each pixel of C is expanded into g − 1 subpixels with
now define Average Grey βl and Average Contrast αl for the
a function T which converts a binary string of g − 1 bits
reconstructed image as
 ⎞    into a row vector of g − 1 components. Therefore, the pixel

H ui j | l E H ui j = t | si j = l expansion factor of this scheme is m = g − 1. Notice that
βl = E ⎝ ⎠= , this encoding method turns out to be a crucial part of
mg mg (2) construction.
 
αl = βl − βl−1 , l ∈ 2, . . . , g .
3.4. Construction of the Shares. Each pixel of C is expanded
3.2. Brief Review the (k, k)-SS Scheme Based on Boolean XOR into g − 1 subpixels with a function T which converts a binary
Operation. The (k, k)-SS scheme in [4] is deterministic and string of g − 1 bits into a row vector of g − 1 components.
the reconstructed image is exactly the same as the original Therefore, the pixel expansion factor of this scheme is mg =
one. A secret image S can share k shadows A1 , . . . , Ak . After g − 1.
obtaining all k shadows, we can perform XOR operations to Now, the description of the proposed scheme is given
recover the secret image A. in Algorithm 2.
The (k, k)-SS scheme in [4] for greyscale images is given
in Algorithm 1. 3.5. Proof of the Construction. In this section we will show
From Algorithm 1, the symbol “⊕” represents XOR that the quality of the scheme depends on the quality of
operation, the computation complexity of reconstructed the reconstructed image U. We now look at a pixel of the
4 EURASIP Journal on Advances in Signal Processing

Input: an integer k with k ≥ 2, and the secret image S.


Output: k distinct matrices A1 , . . . , Ak , called shadow images.
Construction: generate k − 1 random matrices B1 , . . . , Bk−1 , compute the shadow images as below:
A1 = B1 , A2 = B1 ⊕ B2 , . . . , Ak−1 = Bk−2 ⊕ Bk−1 , Ak = Bk−1 ⊕ S.
Revealing: S = A1 ⊕ A2 ⊕ · · · ⊕ Ak .

Algorithm 1

Input: The secret image S, S = [si j ] in the coded form C = [ci j ]


Output: The shadow images D1 , . . . , Dk .
Share generation: Randomly generate k − 1 matrices R1 , . . . , Rk−1 of size (mg · φ) × (mg · ϕ),
where Rh = Xh , Xh ∈ {0, . . . , 2g −1 − 1}.
D1 = R1 ,
Dh = Rh−1 ⊕ Rh , h = 2, . . . , k − 1,
Dk = Rk−1 ⊕ C.
⎛ T(D1 ) ⎞
.
The basic construction matrix is U =⎝ .. ⎠, where the transform T converts a binary string of g − 1 bits into a row vector
T(Dk )
of g − 1 components. That is, T(Dh ) = V (h) = (v1(h) · · · vg(h)
−1 ), h = 1, . . . , k. The hth row of the basic matrix is used to
construct the share image Dh .
Revealing: U = D1 + · · · + Dk .

Algorithm 2

reconstructed image U = D1 + · · · + Dk . Theorem 3.2 states from the random matrices R1 , . . . , Rh−1 only and it must be
the average grey and average contrast of U. random.

Theorem 3.2. The proposed algorithm is a probabilistic (k, k)- Therefore, the proposed (k, k) scheme satisfies the secu-
VSS scheme with Pixel expansion mg = g − 1, Average Grey rity condition. That is, when fewer than k shadows are used,
   the original secret image C will not be revealed.
E H ui j = t | si j = l To show contrast, let mg be the pixel expansion, we have
βl = mg = g − 1 according to the construction of the shares above.
mg
   Since U = T(d1 ) + · · · + T(dk ) with “+” being Boolean

1 − 1/2k−1 g − l + (l − 1)   OR, we have
= , l ∈ 1, . . . , g ,
g −1 U = T(X1 ) + (T(X1 ) ⊕ T(X2 )) + · · · (T(Xk−2 ) ⊕ T(Xk−1 ))
(3)
+ (T(Xk−1 ) ⊕ T(s)).
and Average Contrast αl = βl − βl−1 = 1/(2k−1 · (g − 1)). (4)
Proof. To show security, since the random matrices Substituting T(Xi ) with Vi , i = 1, . . . , k − 1. We use variables
R1 , . . . , Rk−1 are all distinct, thus the matrices D1 , . . . , Dk V0 substitute T(s). We get
are also all distinct and all random, therefore each share
does not reveal any information of S and the security of the U = V1 + (V1 ⊕ V2 ) + · · · + (Vk−2 ⊕ Vk−1 ) + (Vk−1 ⊕ V0 ),
scheme is ensured. Then we will prove any k − 1 or fewer (5)
shares will not be obtained any information of C, that is:
Di 1 ⊕ Di 2 ⊕ · · · ⊕ D i h =
/ C for any set of integers {i1 , . . . , ih } Here, V0 is the coded from the original image S. That is, V0 =
when 1 ≤ h < k. We consider two cases. 0g −l 1l−1 for si j = l. Since V1 +(V1 ⊕V2 ) = V1 +V1 V2 = V1 +V2
and V1 + V2 + (V2 ⊕ V3 ) = V1 + V2 + V3 , we have
Case 1 (k ∈ {i1 , . . . , ih }). In this case, Dk ⊕ (⊕tj =s D j ) =  
C ⊕ Rk−1 ⊕ (⊕tj =s D j ) where ⊕tj =s D j means Ds ⊕ · · · ⊕ Dt Ul = ui j | si j = l = V1 + V2 + · · · + Vk−2
with s, . . . , t being the indices in i1 , . . . , ih besides n. Since   (6)
there are odd number of random matrices involved, at least + Vk−1 + (Vk−1 ⊕ V0 ), l ∈ 1, . . . , g .
one of them cannot be absorbed into zero matrix, thus This can be rewritten as
Di1 ⊕ Di2 ⊕ · · · ⊕ Dih must be random thus not equal to C.
Ul = U0 + Vk−1 + (Vk−1 ⊕ V0 ), (7)
Case 2 (k ∈/ {i1 , . . . , ih }). Since no matrix C involved in Di1 ⊕
Di2 ⊕· · ·⊕Dih to begin with, Di1 ⊕Di2 ⊕· · ·⊕Dih is constructed where U0 = V1 + V2 + · · · + Vk−2 .
EURASIP Journal on Advances in Signal Processing 5

We know that Vk−1 + (Vk−1 ⊕ V0 ) must have at least l − 1 By the definition of Average Grey and Average Contrast (2),
bits being 1. That is Vk−1 + (Vk−1 ⊕ V0 ) can be written as βl = E(H({ui j | si j = 1}))/g − 1, we have Average Grey
xg −l 1l−1 where each of the g − l bits, denoted by x, may take   
value 0 or 1. Therefore, Ul = {ui j | si j = l} = U0 + xg −l 1l−1 = E H ui j | si j = 1 1 1
y g −l 1l−1 also has at least l − 1 bits being 1. The probability for β1 = = = ,
g −1 3−1 2
each y bit to be 1 is p = 1 − 1/2k−1 since every of such bit
  
depends on k − 1 random matrices. The total number of 1’s E H ui j | si j = 2
among these g − l bits (the Hamming weight of the vector) 3/2 3 (13)
β2 = = = ,
is a random variable with a binomial distribution, and the g −1 3−1 4
expected value of the Hamming weight is   
E H ui j | si j = 3 2
  β3 = = = 1.
1     g −1 3−1
1− · g −l = p g −l . (8)
2k−1
Average Contrast
It follows that the expected Hamming weight of the entire
g − 1 vector is 1 1
α2 = β2 − β1 = , α3 = β3 − β2 = . (14)
  4 4
   1  
E H ui j | si j = l = 1− · g − l + (l − 1), We can reach the exactly same average contrast directly
2k−1
  from (11). The average contrast is the same as that of
l ∈ 1, . . . , g . Example 3.3.
(9)
The following Theorem 3.4 is directly from the result
Thus the Average Grey is of [15].
      
E H ui j | si j = l 1 − 1/2k−1 g − l +(l − 1) Theorem 3.4 (see [15]). In binary (k, k)-Prob.VSS scheme
βl = = . with m = 1 and the parameters threshold probability pTH =
m g −1 1/2k−1 and the contrast α = 1/2k−1 . Suppose that the secret
(10) image is black and white image, in our Theorem 3.2 above,
Pixel expansion mg = g − 1, Average Contrast αl = βl − βl−1 =
and the Average Contrast of the reconstructed image is
1/2k−1 · (g − 1). That is g = 2, we obtain m2 = 2 − 1 = 1, and
1 αl = βl − βl−1 = 1/2k−1 · (2 − 1) = 1/2k−1 . It is clear that values
αl = βl − βl−1 =  . (11) of pixel expansion and contrast of Theorem 3.2 above are same
2k−1 · g − 1
as those of Theorem 3.4.

3.6. The Minimum Size of Recognizable Regions. With a


Example 3.3 (continuation of Example 3.1). According to (9) probabilistic scheme, small regions (not individual pixels) of
of Theorem 3.2, we obtain the secret image are correctly reconstructed. The smaller such
   regions can be, the better this scheme is. We now discuss the
E H ui j | si j = 1 minimum size of the region that can be correctly recognized.
  Before examining a region of N pixels, we start with one
1   pixel taking grey level l, that is, si j = l. The reconstructed
= 1− · g − l + (l − 1)
2k−1 pixel is Ul = {ui j | si j = l} = xg −l 1l−1 , x ∈ {0, 1}. Let Yl
  be the Hamming weight of U, we have Yl = H(Ul ) ∈ {l −
1
= 1− · (3 − 1) + (1 − 1) = 1, 1, . . . , g − 1} and
22−1
    
E H ui j | si j = 2 g −l  g −l−t
P(Yl = l − 1 + t) = · pt · 1 − p , (15)
  t
1  
= 1− · g −l + (l − 1) (12)
2k−1 where p = 1 − 1/2k−1 . Clearly, Yl has a binomial distribution
 
1 3 with mean and variance being.
= 1− · (3 − 2) + (2 − 1) = ,
22−1 2 We have
        
E H ui j | si j = 3 μy = l − 1 + p g − l , δ 2y = g − l p 1 − p . (16)
 
1  
= 1− · g − l + (l − 1) Now we consider a group of N pixels with the same
2k−1 grey level l in the original image. Since all pixels are
  treated separately in the share generation, these N random
1
= 1− · (3 − 3) + (3 − 1) = 2. variables are independent and identically distributed (i.i.d.).
22−1
6 EURASIP Journal on Advances in Signal Processing

Therefore, the total visual effect of the region is closely related Table 1: Minimum region sizes of a binary image with the proposed
 greyscale (k, k)-VSS scheme or the scheme of [14].
to the Z = Ni=1 Yl(i) , and
⎛ ⎞
N N 
  D Black and white (2, 2) Black and white (3, 3)
E(Z) = E ⎝ Y (i) ⎠
= E Y (i) 0.00 9 27
l l
i=1 i=1 (17) 0.05 12 43
    0.10 15 75
= Nμ y = N p g − l + (l − 1) ,
0.15 19 169
where p = 1 − 1/2k−1 , 0.20 25 675
⎛ ⎞
N 
N   0.25 36
Var(Z) = Var ⎝ Y (i) ⎠
= Var y (i) l = Nσ 2 0.30 57
l y
i=1 i=1 (18) 0.35 100
    0.40 225
=N p 1− p g −l .
0.45 900
Based on Central Limit Theory, these binomial distribution
can be safely approximated by Gaussian distribution, and we
can obtain the lower bound for N. According to Empirical
Rule, about 99.73% of all values fall within three standard With p1 = 0 and p0 = 1/2k−1 , it becomes
deviations of the mean. Hence, to recognize a region of grey  
level l, the region size should satisfy 9 · p0 · 1 − p0
NYang >  2 . (24)
μl − 3σl > μl−1 + 3σl−1 + N · d, (19) p0 − d

where d determines the minimum separation between the


Table 1 gives some specific region sizes for various d values.
two distributions. That is
Comparing (22) and (24), it is immediate the following
      
N p g − l + (l − 1) − 3 N p 1 − p g − l two results.
   
> N p g − l + 1 + (l − 2) Result 1. The minimum size of a recognizable region
between grey level g and grey level g − 1 of the proposed
  
+ 3 N p 1 − p g − l + 1 + Nd, scheme is the same as that between black and white region
in the (k, k)-Prob.VSS scheme of the (k, n)-Prob.VSS scheme
     (20)
N −p + 1 − d > 3 N p 1 − p g − l in [16].
   Result 2. When our proposed scheme is applied to binary
+ 3 Np 1− p g −l +1 ,
images, that is, g = 2, its minimum region size is the same
√        as that in [15].
3 N · p 1− p · g −l + g −l+1
N> .
1− p−d
4. Converting a (k, k)-SS Scheme to
Therefore
⎛ ⎞2
a (k, n)-VSS Scheme
  g −l+ g −l+1
N > 9p 1 − p · ⎝ ⎠ . (21) We now extend the above (k, k)-VSS scheme for greyscale
1− p−d images into a (k, n)-VSS scheme.
Note that the range of original image pixel value is
slightly different from the range of its coded form, that is 4.1. Construction of the Shares. We give Example 4.1 to
si j ∈ {1, 2, . . . , g } and ci j ∈ {0, 1, . . . , g − 1}. When l = g, illustrate Algorithm 3.
the above inequality becomes
  Example 4.1 (continuation of Example 3.3). The greyscale
9p 1 − p (2, 3)-VSS scheme with g = 3. The three basic construction
N >  2 , (22)
1− p−d matrices for the three distinct (2, 2)-VSS schemes are
which indicates the minimum size of a recognizable region ⎛  ⎞
between grey level g and grey level g − 1. When g = 2, T d(1) |w  
⎜ ⎟ 3
the above is the minimum region size in a binary image. Bi(2,2) =⎝  ⎠, w = 1, . . . , . (25)
1 2
In the (k, n) probabilistic VSS scheme proposed in [15], the T d(2) |w
minimum region size is
⎛ ⎞2
p0 (1 − p0 ) + p1 (1 − p1 ) For example, ci j = 01, d(1) ∈ {10, 00, 01, 11}, we let d(1) |w =
NYang >9·⎝ ⎠ . (23) 00, or 10, or 11. The three basis matrices are listed in Table 2
p0 − p1 − d
as follows.
EURASIP Journal on Advances in Signal Processing 7

Input: The secret image S, S = [si j ] in the coded form C = [ci j ].


Output: The shadow images D1 , . . . , Dn .
Share construction procedure: For (k, n) scheme, we create a construction matrix with n rows from the k rows
of the construction matrix of the (k, k)-VSS scheme as described previously. We do it in four steps.
Step 1: Generate ( nk ) distinct construction matrices for ( nk ) different (k, k)-VSS schemes to the same secret image. Notice that the
n g −1 − 1)}. For the wth scheme, its construction matrix is
random matrices are Rh⎡= X(h), ⎤ X(h) ∈ {0, . . . , ( k ) · (2
⎛ ⎞ V1
(w)
T(D(1) |w )
⎜ .. ⎟ ⎢ .. ⎥
Bw(k,k) ⎠=⎢ ⎥ n
=⎝ . ⎣ . ⎦, where w = 1, . . . , ( k ), h = 1, 2, . . . , k and D |w is created directly
(h)

T(D(k) |w ) Vk
(w)

from D(1) |w , . . .,D(k) |w


needs w group distinct random matrices, each group matrix has k − 1 distinct random matrices.
The D(h) |w includes k − 1 distinct random matrices. (See Section 3.5 for details), and Vh(w) is a m-dimensional row vector.
Step 2: Consider a function f : Z + → Z + , q ∈ {1, . . . , k}, f (q) ∈ {1, . . . , n}, for example, when n = 3 and k = 2,
one possible such functions are f (1) = 1, f (2) = 2, or f (2) = 1, f (3) = 2, or f (1) = 1, f (3) = 2. There are ( nk ) different
ways to define such a function. Let w ∈ {1, . . . , ( nk )} and lw be one of such functions.Here, we denote ( nk )
by the number of k-combinations of an n-element set.
⎡ (w) ⎤
V1
(k,n) (k,n) ⎢ . ⎥
Step 3: Generate a random matrix B w of n rows, Bw = ⎣ .. ⎦.
(w)
Vn
(k,n)
For q ∈ {1, . . . , k}, set Vq(w)

(w)
= Vq and q = fw (q). In other words, substitute k rows of B w with the rows of Bw(k,k)
⎡ (1)
⎤ ⎡ r ⎤ ⎡ (3) ⎤
V1 (2)
V1
could be ⎣ V (1) ⎦, or ⎣ V1 ⎦, or ⎣ r ⎦, where r is
(k,n)
according to function fw . For example, with n = 3 and k = 2, Bw
2 (2) (3)
r V2 V2
randomly generated, w ∈ {1, 2, 3}
(k,n) (k,n) (k,n) (k,n)
Step 4: Concatenate all ( nk ) different matrices B w together and obtain B (k,n) = B 1 ◦ B 2 ◦ · · · ◦ B (n)
k
n
as the resulting n × (m · ( k )). Construction matrix for our (k, n) scheme. Finally, the hth row of B (k,n)
(k,n)
is used to create share image Ah . Notice that each Bw is different from Bw(k,k) .
Revealing: U = Dw1 + Dw2 + · · · + Dwk for w1 , . . . , wk ∈ {1, . . . , n}.

Algorithm 3

(2,2)
 rows 1, 2 of matrix Bw , here q1 , q2 ∈ {1, 2, 3}.
copiedfrom
Table 2: Share construction procedure of (2, 3)-VSS scheme with
g = 3. With 2 = 3 different combinations of two elements out
3
(k,n)
R(1) |w d (1) |w ci j d (2) |w = d (1) |w ⊕C of the three, there are three
 different matrices Bw . The
1 00 00 01 01
concatenation of these 32 matrices forms the basic matrix
as below
2 10 10 01 11
⎛{1,2}⎞ ⎛{1,3}⎞ ⎛ ⎞
3 11 10 01 11 {2,3}
⎜ %&'( ⎟ ⎜ %&'( ⎟ ⎜ %&'( ⎟
⎜0 0⎟ ⎜1 0⎟ ⎜ r r ⎟
0 0 1 0 1 1 B(k,n) =⎜ ⎟◦⎜ ⎟◦⎜ ⎟
⎝0 1⎠ ⎝ r r ⎠ ⎝1 1⎠
We have B1(2,2)
= 01 , = 11 , B2(2,2)
= 11 . B3(2,2)
3 (k,n) rr 11 11
Using the 2 possible functions f , we create 3 matrices Bw (27)
⎛{1,2} {1,3} ⎞
as follows: %&'( %&'( {2,3}
⎜ %&'( ⎟
⎛{1,2}⎞ ⎛{1,3}⎞ ⎛
⎞ ⎜0 0 1 0 r r ⎟
{2,3} =⎜ ⎟.
⎜ %&'( ⎟ ⎜ %&'( ⎟ ⎜ %&'( ⎟ ⎝0 1 r r 1 1⎠
(k,n) ⎜ 0 0 ⎟ (k,n) ⎜ 1 0 ⎟ (k,n) ⎜ r r ⎟
B1 = ⎜ ⎟, B 2 = ⎜ ⎟, B3 = ⎜ ⎟. rr 11 11
⎝0 1⎠ ⎝r r ⎠ ⎝1 1⎠
rr 11 11
We now give an application of the scheme above.
(26)
(k,n) Example 4.2. Application example of the greyscale (2, 3)-VSS
The first two rows of B 1 are from the first two B1(2,2) scheme with 3 grey levels.
(k,n)
matrices. The first row, and the third row of B2 are from
(2,2) The secret image is shown in Figure 1(a). The three
the first row and the second row of B2 . The second row and
(k,n) shadow images (shares) are in parts 1(b), 1(c), and 1(d). And
the third row of B3 are from the first row and the second
the reconstructed image is in Figures 1(e)–1(h).
row of B3(2,2) . Here, the symbol r represents a random bit,
taking value 0 or 1. The two random bits in a matrix may or Theorem 4.3. Algorithm 3 is a probabilistic (k, n)-VSS scheme
(k,n)
may not take the same value. In matrix Bw , rows q1 , q2 are with
8 EURASIP Journal on Advances in Signal Processing

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 1: (a) The secret image. (b) Share 1. (c) Share 2. (d) Share 3. (e) Share 1+Share 2. (f) Share 1 + Share 3. (g) Share 2 + Share 3. (h)
Share 1 + Share 2 + Share 3.

n  
Pixel expansion: mg = (g − 1) · k , Since U = T(Vh1 ) + · · · + T(Vhk ) and there is only one

set V corresponding to the (k, k)-VSS scheme. Based on
Average Grey: βl = E(H({ui j | si j = l}))/m = 1 + (g − Theorem 3.2 above, concatenation of random matrices does
   
1)(2k − nk ) + (l − 1)/(g − 1) · 2k−1 · nk , not affect the total Hamming weight. Thus
  
    n
Average Contrast: αl = βl − βl−1 = 1/(g − 1) · 2k−1 · nk . Ul = ui j | si j = l =
g −l
xU 1l−1 +
g −1
− 1 xU
k
(28)
Proof. To show security, the shares D1 |w , D2 |w , . . . , Dk |w n
[(g −1)( k )+1−l] l−1
are all random and all independent of each other. From = xU 1 .
the construction of the shares given  in the Section 4.1,
we can see that the (k − 1) · nk  random matrices From Theorem 3.2, the Average Grey of the (k, k)-VSS
 scheme is H(V  )=(1 − 1/2k−1 ) · (g − l) + (l − 1) for the
D(1) |w , D(2) |w , . . . , D(k−1) |w , w = 1, . . . , nk , are all distinct  pixels

with grey level l in the original image, the other nk − 1
and all independent of each other. Each Bw(k,k) forms a
sets of V  are random vectors. Among these V  vectors, the
(k, k)-VSS scheme. We know that the k rows of matrix
(k,n) number of 1’s is (1 − 1/2k−1 )(g − 1), that is
Bw are from the corresponding k rows of Bw(k,k) , and can   
(k,n)
be used to reconstruct the secret image. The matrix Bw E(H(Ul )) = E H ui j | Si j = l
is a special (k, n)-VSS scheme, which can construct the  
1  
secret image using special k rows of n rows. The matrix = 1− · g − l + (l − 1)
(k,n) (k,n) (k,n)   2k−1
B(k,n) (= B 1 ◦ B2 ◦ · · · ◦ B(n) ) includes nk distinct   
k 
(k,n) (k,n) (k,n) n 1  
submatrices, B1 , B2 , . . . , B(n) . In matrix B(k,n) , there + −1 1− g −l ,
k k 2k−1
exist some special rows, which come from B1(k,k) , B2(k,k) , . . .,     n
and B((k,k)
n . From the construction method above (see in
  n (l − 1) + 1 − g · k
k) E(H(Ul )) = g − 1 · + ,
Section 4.1), those rows are distinct random rows, we cannot
k 2k−1
get any information of the secret image from the special   
E(H(Ul )) (l − 1) + 1 − g nk
rows of the matrix B(k,n) . Each row of the matrix B(k,n) is a βl = =1+    ,
m 2k−1 nk g − 1
random matrix, namely, A1 |w , A2 |w , . . . , Ak |w are all random
(29)
and all independent of each other. With less than k shares,
n
no information about the secret image is revealed, thus the Therefore, αl = βl − βl−1 = 1/(g − 1) · 2k−1 · k .
security of the system is ensured. When n = k, Theorem 4.3 reduces to the case of the
To show the pixel expansion, similar to the proposed (k, k)-VSS scheme.
 scheme (see Section 3), the pixel expansion mg =
(k, k)-VSS When g = 2, it reduces to a black and white VSS
n
(g −1)· nk is obvious from the shadow construction process.  n  expansion m = k and Average Contrast
scheme with pixel
−1
We now look at its Average Grey and Average Contrast. αl = 1/2 · k .
k
EURASIP Journal on Advances in Signal Processing 9

4.2. Comparison with a Previous VSS Scheme with Respect to For the pixels with grey level l in the original image, the
Pixel Expansion. We will compare our scheme above with the   pixel Ul has Hamming weight H(Ul ) ∈ [l −
reconstructed
traditional schemes in terms of their pixel expansion. 1, (g − 1) nk ]. The probability of H(Ul ) = l − 1 + t is:
Blundo et al. [10] gave an estimate of the value of the
pixel expansion of (k, n)-VSS scheme for black white image, ⎛   ⎞
  n  t
the following Theorem 4.4 is from Lemma 3.3 of [10]. ⎜ g −1 + 1 − l⎟ 1
pl−1+t =⎝ k ⎠· 1−
2k−1
Theorem 4.4 (see [10]). For any n > k ≥ 2, the pixel t
expansion m of (k, n)-VSS scheme is  [(g −1)( n )+1−l]−t
     1 k (34)
n − 1 k−2 n − 1 k−1 · ,
m∈ 2 + 1, 2 +1 . (30) 2k−1
k−1 k−1  
  n
Muecke [11] and Blundo et al. [12] gave optimal pixel t = 0, . . . , g − 1 · − l + 1.
k
expansion m∗ for in g grey level (k, n)-VSS schemes.

Theorem 4.5 (see [11, 12]). In (k, n)-VSS scheme with g grey In our analysis of the region size, let random variable Xl
levels, the pixel expansion m∗ and contrast αg between grey   the Hamming weight above, thus Xl ∈ [l − 1, (g −
represent
levels are 1) · nk ] and Xl has a binomial distribution with mean vaue
  α and variance:
m∗ = g − 1 m, αg = , (31)
g −1
⎛ ⎛ ⎞ ⎞
 
where m and α are pixel expansion and contrast of binary VSS  n
 1
μx = ⎝ g − 1 ⎝ ⎠ + 1 − l ⎠ · 1 − + (l − 1),
schemes. k 2k−1
⎛ ⎛ ⎞ ⎞
Formulas (30) and (31) imply that  
  n 1 1
   δx2 = ⎝ g − 1 ⎝ ⎠ + 1 − l⎠ · 1 − · .
  n − 1 k−2   k 2k−1 2k−1
m∗ = g − 1 · m ∈ 2 + 1 g −1 ,
k−1 (35)
    
n − 1 k−1  
2 +1 g −1 Now we consider a group of N pixels with the same
k−1
(32) grey level l in the original image. Since all pixels are
treated separately in the share generation, these N random
The relative contrast is α∗i = 1/m∗ , i = 0, . . . , g − 2. variables are independent and identically distributed (i.i.d.).
From Theorem 4.3, the pixel expansion of a probabilistic Therefore, the total visual effect of the region is closely related
  
(k, n)-VSS scheme is mg = (g −1) · nk , The Average Contrast to the Z = Ni=1 Xl(i) , and
 
is αl = βl − βl−1 = 1/(g − 1) · 2k−1 · nk , l = 1, . . . , g.
It is clear that the pixel expansion in our (k, n)-VSS ⎛ ⎞
N N 
 
scheme (see the Theorem 3.4) is smaller than that of previous
E(Z) = E⎝ X ⎠ = E X (i) = Nμx
(i)
l l
deterministic (k, n)-VSS schemes [10, 11], when k ≥ n/4, i=1 i=1
k ≥ 4. Average contrast of our (k, n)-VSS scheme is close      
(36)
to that of deterministic (k, n)-VSS schemes [10, 11] when  n 
k ≥ n/2, k ≥ 2, and in other cases our contrast is lower than =N p· g −l · + 1 − l + (l − 1) ,
k
that of (k, n)-VSS schemes [10, 11].
In a deterministic SS scheme for greyscale image, we pay
a higher computation complexity that the reconstruction is where p = 1 − 1/2k−1 ,
guaranteed. In our proposed scheme we pay smaller pixel
expansion with a (small) probability of making mistake in ⎛ ⎞
N 
N  
reconstructing the secret image. In some applications we may Var(Z) = Var⎝ X ⎠ = Var X (i) = Nσ 2
(i)
l l x
wish a trade-off: we are willing to sacrifice some contrast in i=1 i=1
order to reduce the complexity of VSS scheme or vice versa.     
(37)
    n
=N p 1− p g −l +1−l .
4.3. The Minimum Size of Recognizable Region in (k, n)-VSS k
Scheme. In the proof of Theorem 4.3, we obtained:
  
  g −l n g −1
Using a Gaussian distribution to approximate the above
Ul = ui j | si j = l = xU 1l−1 + − 1 xU binomial distribution, we can obtain the lower bound for
k
(33) N. According to Empirical Rule, about 99.73% of all values
n fall within three standard deviations of the mean. Hence,
[(g −1)( k )+1−l] l−1
= xU 1 . to recognize a region of grey level l, the region size should
10 EURASIP Journal on Advances in Signal Processing

Table 3: Minimum region sizes of the proposed (2, 3)-VSS scheme with g = 3.

(2, 3)-VSS Between grey levels 1 and 2 Between grey levels 2 and 3
with g = 3 (l = 2, g − l = 1) (l = 3, g − l = 0)
d = 0.00 198 162
d = 0.05 244 200
d = 0.10 309 253
d = 0.15 404 300
d = 0.20 549 449
d = 0.25 791 646
d = 0.30 1235 1010
d = 0.35 2196 1795
d = 0.40 4940 4038
d = 0.45 19760 16150

satisfy μl − 3σl > μl−1 + 3σl−1 + N · d, where d determines the smaller than the previous result 2k−1 · (g − 1). The quality
minimum separation between the two distributions. That is of the reconstructed image, measured in Average Contrast
     between consecutive grey levels, is the same as the traditional
 n  greyscale VSS schemes. When our scheme is applied to
N g −1 · +1−l p+l−1
k binary images, it has the same minimum size for recognizable
)  regions as that of the Prob.VSS scheme of [15]. This (k, k)-
*    
*   n   SS scheme is extended to a more general greyscale (k, n)-
− 3+N g −1 +1−l p 1− p VSS scheme based on XOR operations. The pixel expansion
k
in our (k, n)-VSS scheme (see Theorem 3.4) is smaller than
     that of previous deterministic (k, n)-VSS schemes [10, 11],
 n 
>N g −1 · +2−l p+l−2 when k ≥ n/4, k ≥ 4. Average contrast of our (k, n)-VSS
k
scheme is close to that of deterministic (k, n)-VSS schemes
)      [10, 11] when k ≥ n/2, k ≥ 2, and in other cases our contrast
*
*   n  
+ 3+N g −1 · +2−l p 1− p +N ·d is lower than that of (k, n)-VSS schemes [10, 11]. However,
k there remains a problem of how to ensure the favorable pixel
,     expansion and contrast provided by (k, n)-SS scheme is also
N 1− p−d >3 p 1− p available in (k, n)- VSS scheme
⎛)
*   )
*   ⎞
*  n *  n
+
×⎝ g −1 · +
+1−l+ g −1 · + 2 − l⎠ Notation
k k
Original image: S = {si j }, i = 1, . . . , φ, j = 1, . . . , ϕ,
  si j ∈ {1, . . . , g }
p 1− p
N > 9 2 Coded image: C = {Ci j }, i = 1, 2, . . . , φ,
1− p−d j = 1, 2, . . . , ϕ, ci j ∈ {0, 1, . . . , g − 1}
⎛)
*   )
*   ⎞2 Reconstructed U = {ui j }, i = 1, . . . , m · φ,
* n * n image: j = 1, . . . , m · ϕ
+
· ⎝ (g − 1) · +
+ 1 − l + (g − 1) · + 2 − l⎠
k k Number of grey g
(38) levels:
Grey level values: l, t
where p = 1 − 1/2k−1 ,
(1 − p − d) > 0, d < 1 − p = 1/2k−1 . Average gray: βl
Average contrast: αl
When k = n, N > 9p(1 − p) · ( g − i+ g − l + 1/1 − p − Intermediate Rh = {Xh }, Xh ∈ {0, . . . , 2g −1 − 1},
d)2 is the minimum region size. For a (2, 3) scheme, n = 3, matrices: Dh = {di(h) j }, h = 1, 2, . . . , n
k = 2, g = 3, when d < 1/2k−1 = 0.5, Table 3 shows the Shadow images: Ah , Dh , h = 1, 2, . . . , n
region sizes for a few d values. Threshold value: k ∈ {2, . . . , n}
A set of share {q1 , . . . , qk }
5. Conclusions indices:
Pixel expansion: M
This paper proposes an approach to convert a deterministic Basic matrix: B
(k, k)-SS scheme to a (k, k)-VSS scheme for greyscale images Binary vectors: V
with maximum number of grey levels g. Its pixel expansion Probability: P
factor is g − 1 which is independent of k and it is significantly Region size (pixels): N
EURASIP Journal on Advances in Signal Processing 11

n  
The number of k k [13] M. Iwamoto and H. Yamamoto, “The optimal n-out-of-n
combinations of an visual secret sharing scheme for gray-scale images,” IEICE
n element set: Transactions on Fundamentals of Electronics, Communications
  and Computer Sciences, vol. E85-A, no. 10, pp. 2238–2247,
Index to (k, k) w = 1, . . . , nk
schemes in the 2002.
generation of a [14] R. Ito, H. Kuwakado, and H. Tanaka, “Image size invariant
(k, n) scheme: visual cryptography,” IEICE Transactions on Fundamentals of
Electronics, Communications and Computer Sciences, vol. E82-
Temporary x, y ∈ {0, 1}, q, z ∈ Z 1 .
A, no. 10, pp. 2172–2177, 1999.
variables:
[15] C.-N. Yang, “New visual secret sharing schemes using proba-
bilistic method,” Pattern Recognition Letters, vol. 25, no. 4, pp.
Acknowledgments 481–494, 2004.
[16] S. Cimato, R. De Prisco, and A. De Santis, “Probabilistic visual
The authors would like to thank the anonymous reviewers cryptography schemes,” Computer Journal, vol. 49, no. 1, pp.
for their valuable suggestions to improve this paper. The 97–107, 2006.
authors thank Professor Xiaobo Li of University of Alberta [17] C.-N. Yang and T.-S. Chen, “An image secret sharing scheme
for his suggestions and help in the early stage of the with the capability of previewing the secret image,” in
development of this paper. This research was supported Proceedings of the IEEE International Conference onMultimedia
in part by the National Natural Science Foundation of and Expo (ICME ’07), pp. 1535–1538, July 2007.
China under Grants nos. 60873249, 60902102, 60673065, [18] C.-N. Yang, A.-G. Peng, and T.-S. Chen, “Secret image sharing:
60971006, 863 Project of China under Grant 2008AA01Z419, DPVCS a two-in-one combination of (D)eterministic and
and the Postaldoctoral Foundation of China under Grant no. (P)robabilistic (V)isual (C)ryptography (S)chemes,” Journal
20090460316. of Imaging Science and Technology, vol. 52, no. 6, Article ID
060508, 12 pages, 2008.
[19] A. De Bonis and A. De Santis, “Randomness in secret sharing
References and visual cryptography schemes,” Theoretical Computer Sci-
[1] G. R. Blakley, “Safeguarding cryptographic keys,” in Proceed- ence, vol. 314, no. 3, pp. 351–374, 2004.
ings of the AFIPS National Computer Conference, vol. 48, pp. [20] S.-J. Lin and J.-C. Lin, “VCPSS: a two-in-one two-decoding-
313–317, 1979. options image sharing method combining visual cryptography
[2] A. Shamir, “How to share a secret,” Communications of the (VC) and polynomial-style sharing (PSS) approaches,” Pattern
ACM, vol. 22, no. 11, pp. 612–613, 1979. Recognition, vol. 40, no. 12, pp. 3652–3666, 2007.
[3] C.-C. Chang and R.-J. Hwang, “Sharing secret images using
shadow codebooks,” Information Sciences, vol. 111, no. 1–4,
pp. 335–345, 1998.
[4] D. S. Wang, L. Zhang, N. Ma, and X. Li, “Two secret sharing
schemes based on Boolean operations,” Pattern Recognition,
vol. 40, no. 10, pp. 2776–2785, 2007.
[5] M. Naor and A. Shamir, “Visual cryptography,” in Proceedings
of the Advances in Cryptology (EUROCRYPT ’94), vol. 950 of
Lecture Notes in Computer Science, pp. 1–12, 1994.
[6] V. Rijmen and B. Preneel, “Efficient colour visual encryp-
tion or ‘Shared Colors of Benetton’,” in Proceedings of the
EUROCRYPTO’96 Rump Session, 1996, http://www.iacr.org/
conferences/ec96/rump/preneel.ps.
[7] C.-N. Yang, “A note on efficient color visual encryption,”
Journal of Information Science and Engineering, vol. 18, no. 3,
pp. 367–372, 2002.
[8] Y.-C. Hou, “Visual cryptography for color images,” Pattern
Recognition, vol. 36, no. 7, pp. 1619–1629, 2003.
[9] E. R. Verheul and H. C. A. Van Tilborg, “Constructions and
properties of k out of n visual secret sharing schemes,” Designs,
Codes, and Cryptography, vol. 11, no. 2, pp. 179–196, 1997.
[10] C. Blundo, A. De Bonis, and A. De Santis, “Improved schemes
for visual cryptography,” Designs, Codes, and Cryptography,
vol. 24, no. 3, pp. 255–278, 2001.
[11] I. Muecke, Greyscale and colour visual cryptography, M.S.
thesis, Computer Science of Dalhousie University-Daltech,
Halifax, Canada, 1999.
[12] C. Blundo, A. De Santis, and M. Naor, “Visual cryptography
for grey level images,” Information Processing Letters, vol. 75,
no. 6, pp. 255–259, 2000.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 341856, 17 pages
doi:10.1155/2010/341856

Research Article
Semi-Fragile Zernike Moment-Based Image Watermarking for
Authentication

Hongmei Liu,1 Xinzhi Yao,2 and Jiwu Huang1


1 Department of Electronics and Communication, Sun Yat-sen University, Guangzhou 510006, China
2 Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong

Correspondence should be addressed to Hongmei Liu, [email protected]

Received 30 November 2009; Revised 17 May 2010; Accepted 6 July 2010

Academic Editor: Jin-Hua She

Copyright © 2010 Hongmei Liu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We propose a content-based semi-fragile watermarking algorithm for image authentication. In content-based watermarking
scheme for authentication, one of the most challenging issues is to define a computable feature vector that can capture the major
content characteristics. We identify Zernike moments of the image to generate feature vector and demonstrate its good robustness
and discriminative capability for authentication. The watermark is generated by quantizing Zernike moments magnitudes (ZMMs)
of the image and embedded into DWT (Discrete Wavelet Transform) subband. It is usually hard to locate the tampered area
by using global feature in the content-based watermarking scheme. We propose a structural embedding method to locate the
tampered areas by using the separability of Zernike moments-based feature vector. The authentication process does not need the
original feature vector. By using the semi-fragilities of the feature vector and the watermark, the proposed authentication scheme
is robust to content-preserved processing, while being fragile to malicious attacks. As an application of our algorithm, we apply it
on Chinese digital seals and the results show that it works well. Compared with some existing algorithms, the proposed scheme
achieves better performance in discriminating high-quality JPEG compression from malicious attacks.

1. Introduction than appended to it, eliminating the extra storage require-


ments of visual-hash-based system [2]. The watermark-
With the development of advanced image editing software, it based system may be further divided into two categories,
has become easier to modify or forge digital image [1]. When content-independent watermarking [6–11] and content-
the digital image contains important information, its cred- based watermarking [13–22]. The security of content-
ibility must be ensured. So a reliable image authentication independent watermarking scheme is not so good. Due
system is necessary. Because the image can allow for lossy to the fact that the watermark in this kind of method
representations with graceful degradation, the image authen- is content independent and the detection of tampering is
tication system should be able to tolerate some commonly mainly based on the fragility of the hidden watermark, a wise
used incidental modification, such as JPEG compression malicious manipulation that does not change the watermark
and noise corruption. Therefore, the traditional bit-by-bit will cheat the scheme. For example, the algorithms in [6]
verification based on cryptographic hash is no longer a and [7] cannot detect the modifications that are multiples
suitable way to authenticate the image. Image authentication of watermarking quantization steps, which may be exploited
that validates based on the content is desired [2]. to pass an image with large modification as authentic [12].
In content-dependent watermarking scheme, the general
In the literature, image authentication can be roughly framework for authentication includes the following parts.
classified into two categories, visual-hash-based [3–5]
and watermark-based [6–22]. In visual-hash-based system, (i) Generating feature vector from the host image.
authentication information needs extra channel to transmit (ii) Embedding quantized feature vector as watermark
or store. In watermarked-based system, the authentication into the host image and getting the watermarked
information is imperceptibly embedded in the image rather image.
2 EURASIP Journal on Advances in Signal Processing

(iii) Authenticating the test image by comparing the image degrades a lot. In [19], the entropy of the probability
watermark extracted from the test image and the distribution of gray level values in block is used to generate
feature vector generating from the test image. binary feature mask. Positions of malicious manipulations
can be localized. In [20], five features are generated and
One of the most challenging issues of this framework tested. Some are block-based local features, such as edge
is to define a feature vector. An ideal feature vector for shape, standard deviation and mean value, and some are
authentication should have the following properties. frame-based global features, such as edge shape and statis-
tical feature. With global features, the location of attacked
(i) It is computable and can capture the major content
areas cannot be recognized. With local features, there are
characteristics [12].
some problems in tolerance to the incidental operations,
(ii) It is semi-fragile. It is robust to different incidental especially with the block-based edge shape feature. In [21],
manipulations while fragile to malicious manipula- the image is partitioned into nonoverlapping 4 × 4 pixel
tions. blocks in the spatial domain. The mean values of these blocks
(iii) It has good discriminative capability. It is able to form n-dimensional vectors, which are quantized to the
distinguish malicious manipulations from incidental nearest lattice point neighbors. However, it is not robust
ones. to JPEG compression. In [22], the authors proposed to
extract content-based features from the DWT approximation
Without these properties, the feature-based watermark will subband to generate two complementary watermarks: edge-
degenerate as a content-independent watermark in authenti- based watermark to detect the manipulations and content-
cation. based watermark to localize tampered regions.
A number of features have been proposed in content- In content-based watermarking scheme for image
based watermarking schemes for image authentication. In authentication, in order to locate the tampered areas, local
[13], Lin and Chang found that the magnitude relation- feature is usually computed and embedded locally, just like
ship between two coefficients remains invariable through the algorithms in [13, 15, 16, 19–22]. However, restricted by
repetitive JPEG compression. The authentication could be the embedding capacity and invisibility of the watermarked
verified by a 1-bit signature which represents the magnitude image, the watermark generated by local feature should be
relationship between the two coefficients. It is an elegant low bitrate. Thus the feature will not have the first property
algorithm. However, the drawback of the method is that listed above and the algorithm is susceptible to attack, such as
once the DCT pairs are known, an attacker can easily the feature in [13, 20]. Global feature can generate relatively
modify DCT coefficients and keep the original relationship lower bitrate watermark, but it is usually hard to locate the
unchanged [14]. The algorithm in [15] extends and improves tampered areas, such as the global features in [20]. All the
the scheme in [13] by generating the signature bit from feature vectors in the existing schemes are assumed to have
the difference between two wavelet coefficients to which a the second and third characteristics. However, they are not
random bias is added. The signature is inserted into the addressed and analyzed explicitly.
wavelet coefficients using nonuniform quantization-based In this paper, we propose to use Zernike moments to
method. Though the method of feature extraction increases generate feature vector. By using this global feature, we can
the difficulty of the attacker to manipulate the feature, it decide whether the image is maliciously manipulated or not
cannot get the global information of the original image. and locate the tampered areas. At first, we identify Zernike
In [16], the robust signature is cryptographically gener- moments to generate feature vector and demonstrate its
ated on the basis of invariant features called significance- good semi-fragile and discriminative capability for authen-
linked connected component extracted from the image and tication. Moments have been utilized as pattern features
then signed and embedded into the wavelet domain as in many applications to achieve invariant recognition of
a watermark using the quantization-based method. The image pattern. Of various types of moments defined in
algorithm of feature extraction produces too many bits the literature, Zernike moments have been shown to be
of watermark information, which reduces the robustness. superior to the others in terms of their insensitivity to image
In [17], according to the approximation component and noise, information content, and ability to provide faithful
the energy relationship between the subbands of the detail image representation [23] and thus have been used in many
components in DWT domain, global feature and local applications [24–28], for example, invariant watermarking
feature are both generated. Then the global watermark and [26–28] to resist RST (rotation, scale, and translation)
local watermark are generated from global feature and local manipulations. But there is little research on the semi-
feature, respectively. This scheme has lower false positive fragility and discriminative capability of Zernike moments
probability than Lin and Chang’s scheme in [13] and the when different kinds of manipulations are applied to the
false positive probability is 0.07% when quality factor of image in authentication application. In this paper, we analyze
JPEG compression is 70. In [18], Tsai and Chien proposed and demonstrate these properties of Zernike moments.
an authentication scheme with recovery of tampered area. Then, we propose a Zernike moments-based semi-fragile
The features for watermark are generated from LL2 bands watermarking algorithm in DWT domain. It is usually hard
of DWT and embedded into the high-frequency bands. to locate the tampered areas using global feature. We propose
This method needs additional information to extract the a structural embedding method to solve this problem by
watermark, and when recovery is achieved, the quality of the using the separability of Zernike moments feature vector,
EURASIP Journal on Advances in Signal Processing 3

which can be separated into individual moments. The Note that Rn,−m (ρ) = Rn,m (ρ). These polynomials are
authentication process uses a two-stage decision method. orthogonal and satisfy
In the first stage, we decide if the test image is maliciously -
     π
manipulated by a metric measure. In the case of malicious ∗
Vnm x, y × V pq x, y dxd y = δnp δmq (3)
manipulation, we further locate the tampered areas in the x2 +y 2 ≤1 n+1
second stage.
with
Experimental results show that the proposed authentica-
tion scheme has better performance in discriminating high- ⎧
⎨1
a = b,
quality JPEG compression from malicious manipulations δab = ⎩ (4)
when compared with some existing methods. We also 0 otherwise.
test the performance of the proposed method under the
situation in which malicious manipulation is followed by Zernike moments are the projection of the image func-
other manipulations. Under this situation, the system can tion onto these orthogonal basis functions. The Zernike
work well too. Our scheme can be used on different kinds moment of order n with repetition m for a continuous
of images. The experiments on Chinese digital seals support image function f (x, y) that vanishes outside the unit circle
this conclusion. is
The paper is organized as follows. Section 2 describes the --
n+1   ∗
 
Zernike moments and their semi-fragile property. The out- Anm = f x, y Vnm ρ, θ dxd y. (5)
π x2 +y 2 ≤1
line of the proposed system, content-based watermark and
its structural embedding method, and how to authenticate For a digital image, we have
an image are described in Section 3. Section 4 demonstrates
the experimental results and the analysis. Conclusions and n + 1   ∗  
Anm = f x, y Vnm ρ, θ , x2 + y 2 ≤ 1. (6)
discussions of future works are shown in Section 5. π x y

2. Zernike Moments Magnitudes and To compute the Zernike moments of a given image, the
center of the image is taken as the origin and the pixel
Semi-Fragile Property coordinates are mapped to the range of the unit circle. Those
In content-based watermarking scheme for image authen- pixels falling outside the unit circle are not used in the
tication, extraction of feature vector is one of the most computation. Note that A∗nm = An,−m .
challenging issues. An ideal feature vector should have three Suppose that one knows all moments Anm up to
properties listed in Section 1. In this section, we propose order N max of f (x, y). Using orthogonality of the Zernike
to generate feature vector based on Zernike moments and basis, we can reconstruct the image f (x, y),
analyze the properties of this feature vector. The invariance N
max 
   
of Zernike moments, that is, the robustness to geometric f1 x, y = Anm Vnm ρ, θ (7)
distortions, has been investigated by the authors of [24, 26, n=0 m
28]. But the semi-fragile property of Zernike moment has
not been investigated in literature. In this section, we will Note that as Nmax approaches infinity, f1(x, y) will
demonstrate this property and explain how to discriminate approach f (x, y).
malicious manipulations from incidental manipulations by The reconstruction process is illustrated in Figure 1. For
using it. Some of the materials in the following are based on a 64 × 64 gray image of letter A, the reconstructed images are
[24, 28]. generated by using (7) followed by mapping the pixel value
to [0, 255]. It shows that the lower-order moments capture
2.1. Zernike Moment. In [29], Zernike introduced a set of gross shape information and the high-frequency details are
complex polynomials that form a complete orthogonal set filled in by higher-order moments.
over the interior of the unit circle, x2 + y 2 = 1. Let the According to the research in [24] and our experiments,
set of these polynomials be denoted by {Vnm (x, y)}. The Zernike moments with 12-order have a good trade-off
polynomials can be expressed as between performance (detecting accuracy) and computation
        complexity, which will be illustrated in Section 2.2.
Vnm x, y = Vnm ρ, θ = Rnm ρ exp jmθ , (1)

where n is a non-negative integer and m is an integer 2.2. Semi-Fragile Property of Zernike Moments-Based Feature
such that n − |m| is non-negative and even. ρ and θ Vector. In authentication, semi-fragile means that the feature
represent polar coordinates over the unit circle and Rnm are vector is robust to commonly used incidental modifications
polynomials of ρ (Zernike polynomials) given by that preserve the perceptual quality while fragile to malicious
manipulations. Although classification of incidental and
  n−|
 m|/2
(−1)s [(n − s)!]ρn−2s malicious manipulations depends on a specific application,
Rnm ρ = . in most cases, JPEG compression and slight noise corruption
s=0
s!((n + |m|/2) − s)!((n − |m|/2) − s)!
are generally regarded as incidental manipulation, while
(2) cut and replace as malicious manipulations. We adopt this
4 EURASIP Journal on Advances in Signal Processing

(a) (b) 4-order (c) 8-order (d) 12-order (e) 15-order

Figure 1: Reconstruction of a gray image. From left to right: the original image, the reconstructed image with order 4, 8, 12 and 15,
respectively.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 2: Some example images.

point of view and investigate the semi-fragile property of where Z1 and Z2 are the feature vectors of the images f1 (x, y)
the Zernike moments-based feature vector. We also verify and f2 (x, y). Zi = (ZMMi,1 , ZMMi,2 , . . . , ZMMi,N ) =
the robustness of Zernike moments to rotation through (|A00 |, |A11 |, |A20 |, . . . , |ANmax Nmax |), where ZMMi,k is the
experiments. The moments are computed by keeping the size kth Zernike moment magnitude of the feature vector Zi .
of manipulated image unchanged. Assume that f2 (x, y) is obtained by processing f 1 (x, y).
The semi-fragile property is described by the distance We measure the distance between the feature vectors of
between two images. Each image is represented by a N- f1 (x, y) and f 2 (x, y). Then we address the difference of the
dimensional feature vector and the distance is computed on distance when the following different kinds of manipulations
two feature vectors. Smaller distance means better match of are applied to f 1 (x, y) and get f 2 (x, y).
the images. The distance between two feature vectors may be The experiments are conducted on 300 256 × 256 images
measured using Euclidean distance [24]. In this paper, we use that come from [30]. Some of them are shown in Figure 2.
absolute difference to simplify the computation. The distance Each image is processed by
SE (Simplified Euclidean distance) is defined as
(i) JPEG with QF ∈ [90, 80, 70, 60, 50, 40, 30, 20],
     (ii) additive noise with varying strength Sn ∈ [1, 2,
SE f1 x, y , f2 x, y = SE(Z1 , Z2 ) 3, 4, 5, 6] and [−5 Sn , 5 Sn ] noises are added ran-
(8) domly,

N
2 2
= 2ZMM1,i − ZMM2,i 2, (iii) rotation with increasing angle ∈ [5◦ , 15◦ , 25◦ , 35◦ ,
i=1 45◦ ],
EURASIP Journal on Advances in Signal Processing 5

Table 1: Comparison of 8-order, 12-order, and 15-order Zernike moments.

8-Order 12-Order 15-Order


Distinguishing Incidental SEs identified as malicious 31 65 95
Ability Malicious SEs identified as incidental 366 327 316
Computation time (second) for a 256 × 256 image 1.6607 4.1001 6.9235

(iv) cutting out blocks at randomly chosen areas. The Thus, 12-order Zernike moments would gain an overall
block sizes are 16 by 16, 24 by 24, 32 by 32, 40 by better performance by considering the distinguishing ability
40, and 48 by 48, respectively, and computing complexity, compared with 8-order and 15-
(v) Replacing the cut block by other content. The block order Zernike moments. In the following sections, we will
sizes are 16 by 16, 24 by 24, 32 by 32, 40 by 40, and 48 adopt 12-order, 49 Zernike moments to generate the feature
by 48, respectively. vector. The detailed distributions of 12-order SEs used in our
experiments are illustrated in Figure 4.
The first three kinds of manipulations are regarded as Assume that f2 (x, y) is obtained by cutting a block from
incidental ones, while the last two kinds of manipulations f1 (x, y). We also conduct the experiments to address the
are regarded as malicious ones. Thus we get 29 processed relationship between SE(f1 (x, y), f2 (x, y)) and the size of cut
images for each original image. Totally we have 8700 block in the image. The results on the images in Figure 2 are
processed images. We measure the distance between Zernike shown in Figure 5, where x-axis is the size of the cut block
moments based feature vectors of the original image and and y-axis is SE(f1 (x, y), f2 (x, y)). We can observe that the
its manipulated image by (8). Zernike moments of 8-order distance between the original image and the processed image
(25 moments), 12-order (49 moments), and 15-order (72 becomes larger when the size of the cut block increases.
moments) are tested in experiments. The results are shown It means that the distance of feature vector can reflect the
in Figure 3. Figures 3(a), 3(c), and 3(e) demonstrate the dis- degree of the content change of the image.
tribution of the distances, where x-axis represents manipula-
tions and y-axis is log10 (SE( f1 (x, y), f2 (x, y))). From Figures 3. Proposed Authentication Algorithm
3(a), 3(c), and 3(e), we can see that distances between the
feature vectors of the original images and their incidentally In this section, the Zernike moments-based watermarking
manipulated images are usually much smaller than those algorithm for authentication is given. The framework, the
between the feature vectors of the original images and structural embedding method of the Zernike moments-
their maliciously manipulated images, and thus can be based watermark, the location of the tampered areas, and the
classified into two groups. One group includes most of the authentication process are described.
distances obtained from the incidental manipulations and
another includes most of those obtained from the malicious 3.1. The Framework of the Proposed Scheme. Figure 6 gives
manipulations. We also give the histograms of the distances, the block diagrams of the embedding and authentication
one for the incidental manipulations and the other for the processes.
malicious manipulations, which are shown in Figures 3(b), The embedding steps are as follows.
3(d), and 3(f), where x-axis represents the distance and y-
axis is the number of occurrences of the distance. From (i) Compute 49 ZMMs of the host image f1 (x, y) .
Figures 3(b), 3(d), and 3(f), we can see that two histograms Each ZMM is quantized to 12 bits and the 9
are separated clearly. Figure 3 tells that we can separate these most significant bits are selected to be part of the
two kinds of manipulations by using the following rule: watermark.

(ii) Apply 3-level DWT to f1 (x, y)and get 10 subbands,
    
⎨Malicious, SE f1 x, y , f2 x, y > T1 , LL3 , HL3 , LH3 , HH3 , HL2 , LH2 , HH2 , HL1 , LH1 ,
decision = ⎩ (9) HH1 , where the low frequency subband LL3 is a low
Incidental, otherwise , pass approximation of the original image.
where T1 is a predefined threshold, which will be given in (iii) The watermark generated from ZMMs is structurally
Section 4 through experiments. embedded in LL3 subband.
Obtained from Figure 3, we also list in Table 1 the (iv) IDWT is applied and the watermarked image is
performance of distinguishing incidental from malicious obtained.
attacks for 8-order, 12-order, and 15-order Zernike moments The authentication steps are as follows:
by using the SEs. The computing time of Zernike moments
for a 256×256 test image with individual order is also given. (i) Compute 49 ZMMs of the test image f2 (x, y).
As can be seen in Table 1, when the order grows from 8 to (ii) Apply 3-level DWT to f2 (x, y) and extract watermark
15, incidental SEs are more easily regarded as malicious ones from LL3 subband. The watermark is restored as
while malicious SEs are less easily regarded as incidental ones; 49 ZMMs, which is the estimation of 49 ZMMs of the
at the same time, the computing time increases gradually. original host image f1 (x, y).
6 EURASIP Journal on Advances in Signal Processing

SE-order 8 Histogram of SE-order 8


5 450
4.5 400

Number of occurrence
4 350
3.5 300
log10 (SE)

3 250
2.5
200
2
150
1.5
100
1
0.5 50
0 0
0 JPEG Noise Rotation Cut Replace 30 0.5 1 1.5 2 2.5 3 3.5 4 4.5
log10 (SE)

Non-malicious attack
Malicious attack
(a) (b)

SE-order 12 Histogram of SE-order 12


5 450
4.5 400

Number of occurrence
4 350
3.5 300
log10 (SE)

3 250
2.5
200
2
150
1.5
100
1
0.5 50
0 0
0 JPEG Noise Rotation Cut Replace 30 1 1.5 2 2.5 3 3.5 4 4.5
log10 (SE)

Non-malicious attack
Malicious attack
(c) (d)

SE-order 15 Histogram of SE-order 15


5 450
4.5 400
4
Number of occurrence

350
3.5
300
log10 (SE)

3
250
2.5
200
2
1.5 150
1 100
0.5 50
0 0
0 JPEG Noise Rotation Cut Replace 30 1 1.5 2 2.5 3 3.5 4 4.5
log10 (SE)

Non-malicious attack
Malicious attack
(e)
(f)

Figure 3: The distribution of the distances.


EURASIP Journal on Advances in Signal Processing 7

SE-JPEG SE-noise
5 5

4 4
log10 (SE)

log10 (SE)
3 3

2 2

1 1
90 80 70 60 50 40 30 20 1 2 3 4 5 6
Quality factor (%) Noise strength
(a) (b)
SE-rotation SE-cut SE-replace
5 5 5

4 4 4
log10 (SE)

log10 (SE)

log10 (SE)
3 3 3

2 2 2

1 1 1
16 × 16

24 × 24

32 × 32

40 × 40

48 × 48

16 × 16

24 × 24

32 × 32

40 × 40

48 × 48
5 15 25 35 45
Rotation angle (◦ )

Size of cut Size of replace


(c) (d) (e)

Figure 4: The distribution of SEs in order 12.

(iii) The first decision stage. Compute SE( f1 (x, y), it. If the blocks are more than ZMMs in number,
f2 (x, y))and compare it with a predefined threshold then some of ZMMs can be embedded repeatedly.
to decide whether the test image is authentic or not. The secret key can be used to improve the security
In the case of inauthentic, go to next step. of the scheme.
(iv) The second decision stage. Locate the attacked area (iii) ZMM1,i is embedded in the selected block with one
by using the structure of the embedded watermark. bit in one coefficient. The embedding method we
adopted can be found in [31],
3.2. Structural Embedding Method and Location of Attacked
3
Area. In content-based watermarking scheme, it is usually A (i) = A(i) − A(i) mod Sw + Sw if X = 1,
hard to locate the tampered areas by using global feature. 4
(10)
In our system, we locate the tampered regions using the 1
blockwise method by resorting to the separability of the A (i) = A(i) − A(i) mod Sw + Sw if X = 0,
4
Zernike moments-based feature vector and the change of
where A(i) and A (i) are the DWT coefficients before and
watermark.
after embedding, respectively. X is the watermark bit. Sw is
From the description in Section 2, we can know that
the watermark strength which is a positive natural number.
the Zernike moments-based feature vector is composed by
The watermark bit X  can be extracted by the following
individual ZMMs. Each ZMM can be embedded separately
method:
into a block. When some parts of the watermarked image
are changed, the ZMMs embedded in these areas will be 1
A (i) mod Sw ≥ Sw then X  = 1,
changed and thus can be used to locate the tampered areas. 2
(11)
The structural embedding method is as follows.  1 
A (i) mod Sw < Sw then X = 0,
(i) LL3 subband is segmented into nonoverlapped 3 × 3 2
blocks. Denote ZMM1,i and ZMM 1 ( j) ( j)
1,i are the ith ZMMs in
(ii) For each ZMM1,i in the feature vector Z1 of f1 (x, y), Z1 embedded in and extracted from the selected jth block,
we randomly select a block by a secret key to embed respectively. The authentication process is as follows.
8 EURASIP Journal on Advances in Signal Processing

(i) Compute 49 ZMMs, ZMM2,i (i = 1 − 49), of the ×103


feature vector Z2 of the test image f2 (x, y). 16
( j) 14
(ii) Extract the watermark and get ZMM 1 1,i from each
block of LL3 subband of f2 (x, y). 12

(iii) In the first stage, the authenticity of the image is 10


decided by the following rule

SE
8
6
decision 4
⎧      2

⎪Malicious SE f1 x, y , f2 x, y


⎨   (12) 0
16 × 16 24 × 24 32 × 32 40 × 40 48 × 48
=
⎪ = SE Z41 , Z2 > T1 ,

⎪ Size of cut

⎩Incidental otherwise, Figure 5: The relationship between distance and the size of cut
block.
where T1 is a predefined threshold. Z41 is the estimation of
( j)
Z1 and restored from the extracted watermark ZMM1 1,i by
averaging those with same i. ( j)
y-axis represents |ZMM1,i − ZMM 1 ( j)
1,i |. From Figure 7(a4 ),
(iv) In the second stage, the tampered areas are located by we can observe that malicious manipulation introduces
the following rule: much greater changes to the embedded watermarks in the
tampered blocks than to the individual components of the
( j)
feature vector. So using the estimated watermark ZMM 4 1,i in
decision (13) will not affect the locating of tampered areas too much.
⎧ 2 2 1 ( j)
⎪ 24 ( j)
1 ( j) 2 The error between the extracted watermark ZMM 1,i and the

⎪ jth block is attacked 2ZMM1,i − ZMM 1,i 2

⎨ 4 ( j)
estimated watermark ZMM 1,i is shown in Figure 7(a5 ). X-
= > T2 ,

⎪ axis represents the serial number of the block in LL3 subband


⎩ and y-axis represents |ZMM4 ( j)
1 ( j)
jth block is not attacked otherwise, 1,i − ZMM1,i |. We can observe
(13) that the bursts in the right image of Figure 7(a4 ) are still
kept in Figure 7(a5 ). Figure 7(a6 ) shows the location result by
4 ( j) comparing the errors in Figure 7(a5 ) with T2 . From Figure 7,
where T2 is a predefined threshold and ZMM 1,i are the
( j) we can see that the structural embedding method is effective
estimation of ZMM1,i . In our scheme, they are estimated in locating the tampered areas by resorting to the location of
from Z 2 . That is, we assume that each ZMM2,i in Z2 is the changed watermark.
embedded and get its corresponding block by the same
( j)
secret key used in embedding side and get ZMM 4 1,i . We
( j)
3.3. The Robustness of Watermark to Incidental Manipulations.
will demonstrate that it is reasonable to estimate ZMM1,i The robustness of watermark to incidental manipulations
from Z 2 by an example in the following part. is very important in authentication, because the extracted
There are three parameters in our schme. T1 in (12) can watermark is used to estimate original feature vector of the
be selected by the ROC (Receiver Operator Characteristic, image and decide if the test image is authentic. We measure
shown in Section 4) of the scheme and the requirements the robustness of the watermark by computing the distance
of the false positive probability and the false negative between the original feature vector of the image and the
probability. T2 in (13) is set as 512 by extensive experiments estimated feature vector from the extracted watermark by
and Sw is chosen to be 64. (8). The experiments are conducted on the 300 images used
Figure 7 demonstrates the method of locating the tam- in Section 2.2. Each watermarked image is processed by
pered area. Figures 7(a1 ), 7(a2 ), and 7(a3 ) are the original
image f1 (x, y), the watermarked image, and the maliciously (i) JPEG with QF ∈ [90, 80, 70, 60, 50],
manipulated image f 2 (x, y). The cars on the road of
(ii) additive noise with varying strength Sn ∈
Figure 7(a2 ) are copied and pasted to get Figure 7(a3 ). The
differences between ZMM1,i and ZMM2,i of Figure 7(a1 ) [1, 2 , 3 , 4 , 5].
and Figure 7(a3 ) are shown in the left image of Figure 7(a4 ).
The histogram of the distance is shown in Figure 8, where
X-axis represents serial number of ZMMs and y-axis
x-axis represents the distance and y-axis is the occurrence
represents |ZMM1,i − ZMM2,i |. The errors between the
( j) number of the distance. From Figure 8, we can see that most
extracted watermark ZMM 1 1,i from jth block of Figure 7(a) of the distance is zero. It means that the extracted watermark
( j)
and the original watermark ZMM1,i embedded in jth block is equal to the embedded watermark in most cases and thus
are shown in the right image of Figure 7(a3 ). X-axis repre- the watermark is robust to high-quality JPEG compression
sents the serial number of the block in LL3 subband and and noise.
EURASIP Journal on Advances in Signal Processing 9

Compute ZMMs
The host image Embed watermark The watermarked
DWT IDWT
by structure method image
(a)

Compute ZMMs

The test image Extract


DWT Authentic?
watermark

The tampered areas


Locate tampered areas No Yes

(b)

Figure 6: The framework of the proposed scheme: (a) embedding process (b) authentication process.

(a1 ) (a2 ) (a3 )

×104
80 3
Sum error of watermarks
Sum error of moments

2.5
60
2
40 1.5
1
20
0.5
0 0
0 10 20 30 40 0 20 40 60 80 100
Serial number of ZMMs Serial number of the block

(a4 )
×104
3
2.5
2
Error

1.5
1

0.5
0
0 10 20 30 40 50 60 70 80 90 100
Serial number of the block
(a5 ) (a6 )

Figure 7: Demonstration of the location method of the attacked area.


10 EURASIP Journal on Advances in Signal Processing

Table 2: Some P f p and P f n .

T1 Number of the false negative image Pfn Number of the false positive image Pf p
2680 7 0.0023 77 0.0257
2820 10 0.0033 69 0.0230
3000 14 0.0047 65 0.0217
3320 17 0.0057 63 0.0210
3940 20 0.0067 61 0.0203
4300 25 0.0083 60 0.0200
4900 30 0.0100 59 0.0197
6700 44 0.0147 49 0.0163
8000 56 0.0187 47 0.0157
9000 70 0.0233 44 0.0147

1500 −1.6

−1.8
Number of ocuurance

−2
1000
log10 (Pfn )
When T1 = 3320,
Pfp=0.021,(Pfn )=0.0057
−2.2

−2.4
500
−2.6

−1.8 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1


log10 (Pfp )
0
0 10 20 30 40 50 60 70
Sum error of watermark Figure 9: ROC curve.

JPEG attack
Noise attack (ii) Non-malicious manipulations. Compressing by
JPEG with QF ∈ [90, 80, 70, 60, 50] and adding
Figure 8: The robustness of watermark to incidental manipula-
tions. Gaussian noise with strength Sn ∈ [1, 2, 3, 4, 5].
We generate 6000 processed images. Among them 3000
images are produced by incidental manipulations and 3000
images are generated by malicious attacks. Pfp and Pfn are
4. Experimental Results used to represent the false positive probability and the false
To demonstrate the power of our authentication system, we negative probability, respectively. Some Pfp and Pfn under
study the ROC of the scheme and set the threshold T1 . Then different thresholds are shown in Table 2. Our observation
we present some results obtained by applying only malicious shows that the false positive image usually is the JPEG
or incidental manipulation on standard test images and compressed image with QF 50 and the false negative image
Chinese seal images. We also demonstrate the results of is usually the maliciously manipulated image with small
locating the tampered areas when the image is processed by size content change. The ROC of the scheme is shown in
combining malicious manipulation with JPEG compression, Figure 9, where x-axis is log10 (Pfp ) and y-axis is log10 (Pfn ).
sharpening, or blurring. Comparisons with some existing The thresholds are between 2680 and 9000. T1 is set as 3320
schemes will also be presented. in our experiments because we can get relatively low Pfp and
Pfn at the same time by using this threshold.

4.1. ROC and Threshold. Experiments are performed on 300 4.2. Authentication Results When Single Attack Is Applied.
images that come from [30], which do not include the images The experiments are firstly conducted on the standard test
used in Section 2. All of these images are watermarked and images in Figure 10. The PSNRs of their watermarked images
then processed by two kinds of manipulations as follows. are shown in Table 3. Table 4 lists the authentication results
when JPEG compressions are applied to their watermarked
(i) Malicious attacks. Adding, erasing, and replacing images. Figure 11 shows the tamper localization results when
something with different sizes. malicious attacks are applied to some of them. Then we
EURASIP Journal on Advances in Signal Processing 11

I01 I02 I03 I04 I05

I06 I07 I08 I09 I10

I11 I12 I13 I14 I15

I16 I17 I18 I19 I20

Figure 10: The test images.

Table 3: PSNRs obtained by watermarking the images in Figure 10.

Image in Figure 10 I01 I02 I03 I04 I05 I06 I07 I08 I09 I10
PSNR (dB) 42.6 42.5 42.3 42.2 42.4 42.6 42.7 42.0 42.3 42.8
Image in Figure 10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20
PSNR (dB) 42.1 42.6 42.4 42.3 42.8 42.7 42.3 42.9 42.5 42.9

conduct experiments on Chinese seal images in Figure 12 4.3. Authentication Results When Combined Attacks Are
and show the authentication results when malicious attacked Applied. The objective of this section is to check whether
are applied to the watermarked images. Table 5 lists the our scheme can successfully detect and locate a malicious
authentication results when JPEG compressions are applied manipulation when some other manipulations are applying
to their watermarked images. From Tables 4 and 5, we can to the image simultaneously. We apply two-stage decision
see that our system can successfully pass almost all the JPEG- method. The authenticity of the test image is firstly decided.
compressed images with QF as low as 40. As for the additive We observe that combined manipulations introduce more
Gaussian noise, our scheme can tolerate noisy images with changes to the watermark and the feature vector than single
PSNR as low as 33.6 dB. From the experiment results, manipulation. In first stage, SE( f1 (x, y), f2 (x, y)) > T1
we can see that the proposed scheme is robust to JPEG in (12) is true and the image is regarded as maliciously
compression while sensitive to malicious manipulations with manipulated. Figure 13 shows the tampering and location
good capability in locating the attacked areas. results in the second stage. The manipulations following
12 EURASIP Journal on Advances in Signal Processing

I02 I13 I15 I18

WI02 (PSNR = 42.5) WI13 (PSNR = 42.4) WI15 (PSNR = 42.8) WI18 (PSNR = 42.9)

TI02 TI13 TI15 TI18

LI02 LI13 LI15 LI18

Figure 11: Authentication results when some standard test images are maliciously manipulated where I: original standard image, WI:
watermarked image, TI: tampered watermarked image and the oval highlights the tampered part, LI: location of the attacked areas.

malicious tampering include JPEG compressions, blurring, 3 × 3 and becomes worse when window size increases. In
and sharpening. In order to compare with the algorithm in the case of a combined manipulation involving sharpening,
[8], we adopt the same symbols. We can see that our scheme the results are good when the sharpening factor is smaller
can work well in most cases. In the case of a combined than 50. When the sharpening factor exceeds 50, the result
manipulation involving JPEG, Figure 13 indicates that when becomes worse when the factor increases.
the quality factor is as low as 40, the detection result is still
good. In the case of a combined manipulation involving 4.4. Performance Comparison. In authentication, one of
blurring, the detection result is good when window size is the most important issues is discriminating the incidental
EURASIP Journal on Advances in Signal Processing 13

S01 S02 S03 S04

WS01 (PSNR = 42.3) WS02 (PSNR = 42.1) WS03 (PSNR = 42.1) WS04 (PSNR = 42)

TS01 TS02 TS03 TS04

LS01 LS02 LS03 LS04

Figure 12: Authentication results when some Chinese digital seal images are maliciously manipulated where S: original seal image, WS:
watermarked seal image, TS: tampered watermarked seal image and the oval highlights the tampered part, LS: location of the attacked areas.

and malicious attacks. Conventional content independent as low as 30 as authentic. But for image I20 in Figure 10,
watermarking approaches, such as the schemes in [7, 8, the JPEG compressed image with QF as high as 70 is still
11], do not provide a rational metric measure for the mistaken as maliciously attacked image. The scheme in this
discriminating. They use the detected attacked areas to paper gives a two-stage scheme and a metric measure for
decide whether the image is maliciously attacked. Because the discriminating. For 20 images in Figure 10, this measure
incidental manipulations can introduce error of watermark can pass most JPEG compressed images with QF as low
which may be mistaken as the result of maliciously attack, as 40. The comparison between our algorithm and that
sometime the scheme does not work well. For example, in in [8] can be found in Tables 6 and 7, where Table 6
[8], the scheme works very well on 11 of 12 test images demonstrates the performance of discriminating when only
in Figure 10 and passes JPEG compressed images with QF JPEG compression is applied to the images and Table 7
14 EURASIP Journal on Advances in Signal Processing

W image T image T T+B3×3 T+B5×5 T+B7×7

T + S10 T + S20 T + S30 T + S40 T + S50 T + S60

T + S70 T + S80 T + S90 T + J90 T + J80 T + J70

T + J60 T + J50 T + J40 T + J30 T + J20 T + J10

Figure 13: The detection results when combined attacks are applied to watermarked image. W image and T image denote the watermarked
image and the tampered watermarked image, respectively. The oval in T image highlights the tampered part. The symbols T, J, B and S
denote malicious tampering, JPEG-compression, blurring and sharpening, respectively. + means followed by. The number following each
symbol is the parameter adopted by the manipulation in Photoshop.


Table 4: Authentication results when JPEG compressions are applied to the corresponding watermarked images. means that our scheme
regards the manipulation is incidental and × means that our scheme regards the manipulation is malicious.

Image in Figure 10
Manipulation I01 I02 I03 I04 I05 I06 I07 I08 I09 I10
√ √ √ √ √ √ √ √ √ √
JPEG (QF > 40)
√ √ √ √ √ √ √ √ √
JPEG (QF = 40) ×
√ √ √
JPEG (QF = 30) × × × × × × ×
Image in Figure 10
Manipulation I11 I12 I13 I14 I15 I16 I17 I18 I19 I20
√ √ √ √ √ √ √ √ √ √
JPEG (QF > 40)
√ √ √ √ √ √ √ √ √ √
JPEG (QF = 40)

JPEG (QF = 30) × × × × × × × × ×

shows the detection results when combined manipulations attacked areas as the approach in [8], but the scheme
are applied to the images. From Table 6 and the experimental in [8] uses the original watermark in the authentication
results in Sections 4.2 and 4.3, we can see that our scheme is process.
more stable in discriminating high-quality JPEG compres- Comparisons with some other content-independent [7,
sion from malicious attacks than the approach in [8] and can 9–11] and-dependent [22] watermarking approaches for
be used on different kinds of images. Table 7 shows that our authentication are listed in Table 8. From Table 8, we can
scheme can give similar detection results for the maliciously see that the performance of discriminating JPEG from the
EURASIP Journal on Advances in Signal Processing 15


Table 5: Authentication results when JPEG compressions are applied to the watermarked seal images. and × have the same meanings as
those used in Table 4.
Images in Figure 12
Manipulation WS01 WS02 WS03 WS04
√ √ √ √
JPEG (QF > 50)
√ √ √ √
JPEG (QF = 50)
√ √ √
JPEG (QF = 40) ×
√ √ √ √
noise (Sn ≤ 5)
√ √
noise (Sn = 6) × ×

Table 6: Comparison with the scheme in [8] on images I18 and I20 in Figure 10.

The scheme in [8] The proposed scheme The scheme in [8] The proposed scheme
Images in Figure 10
Manipulation I18 I18 I20 I20
√ √ √ √
JPEG (QF >= 80)
√ √ √
JPEG (QF = 70) ×
√ √ √
JPEG (QF = 60) ×
√ √ √
JPEG ( QF = 50) ×
√ √ √
JPEG (QF = 40) ×

JPEG (QF = 30) × × ×
JPEG (QF = 20) × × × ×

Table 7: The comparison of the performance of detecting attacked discriminating the incidental manipulations from
areas when combined manipulations are applied to the image I18 in malicious attacks,
Figure 10. The symbols and the numbers have the same meanings
as those in Figure 12. (4) by using the separability of Zernike moments-based
feature vector, a structural embedding method for the
T +B T +J T +S
ZMMs-based watermark is given. Extensive experi-
Proposed algorithm 3×3 40 40 ments show that this method can locate the attacked
The scheme in [8] 3×3 40 40 area effectively. It can locate the altered blocks even if
the altered image has been lossy compressed, blurred,
or sharpened with medium strength,
malicious manipulation of our scheme is superior to those of
the algorithms in [7, 9–11, 22]. (5) the proposed authentication scheme has better per-
Comparisons with some content-based watermarking formance of discriminating high-quality JPEG com-
approaches in [13, 17] are shown in Table 9, where the data pression from malicious attacks than some existing
in the last two columns come from [17]. From Table 9, we schemes. The scheme does not need the original
can see that our scheme has better robustness to high-quality feature vector for authentication process,
JPEG compression.
(6) the proposed scheme can be used on different kinds
5. Conclusion and Future Works of images. The experiments on Chinese seal images
with a very homogeneous background support this
In this paper, we propose a content-based watermarking conclusion.
scheme for image authentication. The contributions of this
paper are as follows: The feature vector of Zernike moments can also work
(1) to have found the semi-fragile property of the Zernike well to authenticate binary images [32] like documents and
Moments-based feature vector. CAD images. It can also be used in video authentication.
Our extensive experiments show that this feature vector has
(2) to have proposed to use Zernike feature vector good semi-fragile characteristics for video processing. Some
as the feature in image authentication. Extensive preliminary results on video authentication by using Zernike
experiments show that Zernike moments have good moments-based feature vector has been published in [33].
robustness and discriminating capability for authen- Our future works include researching on the embedding
tication, algorithm robust to geometric distortions and improving the
(3) to have proposed a two-stage decision method in precision in locating the altered areas. Recovery [18] of the
authentication process and a metric measure for tampered area will be also studied in our future work.
16 EURASIP Journal on Advances in Signal Processing

Table 8: Comparisons with other methods in [7, 9–11, 22] on image “Lena.”

Proposed Kundur’s Lu’sscheme in Bao’s scheme in Yang’s scheme in Qi’s scheme in


algorithm scheme in [7] [9] [10] [11] [22]
PSNR(dB) 42.8 43.0 30.5 40.5 36.34 39.46
Robustness to JPEG (QF) 40 50 80 80 60 50

Table 9: Comparisons with some content-based watermarking methods (Pfp %).

Proposed algorithm Wang’s scheme in [17] Lin’s scheme in [13]


Robustness to JPEG with QF 70 0 0.07 3.1
No-attack 0 0 0.2

Acknowledgment [11] H. Yang and X. Sun, “Semi-fragile watermarking for image


authentication and tamper detection using HVS model,” in
This work is supported by 973 Program (2011CB302204), Proceedings of the International Conference on Multimedia and
GDIID Program (GDIID2008IS046), and Guangdong Sci- Ubiquitous Engineering (MUE ’07), pp. 1112–1117, April 2007.
ence and Technology program (2009B090300345). [12] B. Zhu, M. D. Swanson, and A. H. Tewfik, “When seeing isn’t
believing,” IEEE Signal Processing Magazine, vol. 21, no. 2, pp.
40–49, 2004.
References
[13] C.-Y. Lin and S.-F. Chang, “Semi-fragile watermarking for
[1] J. Fridrich, “Security of fragile authentication watermarks authenticating JPEG visual content,” in Security and Water-
with localization,” in Security and Watermarking of Multimedia marking of Multimedia Contents II, vol. 3971 of Proceedings of
Contents IV, vol. 4675 of Proceedings of SPIE, pp. 691–700, San SPIE, pp. 140–151, San Jose, Calif, USA, 2000.
Jose, Calif, USA, 2002. [14] R. Radhakrishnan and N. Memon, “On the security of the
[2] C. Fei, D. Kundur, and R. H. Kwong, “Analysis and design SARI image authentication system,” in Proceedings of the IEEE
of secure watermark-based authentication systems,” IEEE International Conference on Image Processing (ICIP ’01), pp.
Transactions on Information Forensics and Security, vol. 1, no. 971–974, October 2001.
1, pp. 43–55, 2006. [15] K. Maeno, Q. Sun, S.-F. Chang, and M. Suto, “New semi-
[3] A. Swaminathan, Y. Mao, and M. Wu, “Robust and secure fragile image authentication watermarking techniques using
image hashing,” IEEE Transactions on Information Forensics random bias and nonuniform quantization,” IEEE Transac-
and Security, vol. 1, no. 2, pp. 215–230, 2006. tions on Multimedia, vol. 8, no. 1, pp. 32–45, 2006.
[4] C. D. Roover, C. D. Vleeschouwer, F. Lefebvre, and B. Macq, [16] Q. Sun and S.-F. Chang, “Semi-fragile image authentication
“Robust image hashing based on radial variance of pixels,” using generic wavelet domain features and ECC,” in Proceed-
in Proceedings of the IEEE International Conference on Image ings of the International Conference on Image Processing (ICIP’
Processing (ICIP ’05), vol. 3, pp. 77–80, Genova, Italy, 2005. 02), pp. 901–904, September 2002.
[5] C.-H. Lin and W.-S. Hsieh, “Semi-fragile image authentication [17] J. Wang, S. Lian, Z. Liu, R. Zhen, and Y. Dai, “Multimedia data
method for robust to JPEG, JPEG2000 compressed and scaled authentication in wavelet domain,” in Independent Component
images,” in Information Hiding and Application, vol. 227 of Analyses, Wavelets, Unsupervised Smart Sensors, and Neural
Studies in Computational Intelligence, pp. 141–162, Springer, Networks IV, vol. 6247 of Proceedings of SPIE, pp. 3–12, 2006.
Berlin, Germany, 2009. [18] M. J. Tsai and C. C. Chien, “Authentication and recovery for
[6] B. Zhu, M. D. Swanson, and A. H. Tewfik, “Transparent wavelet-based semifragile watermarking,” Optical Engineering,
robust authentication and distortion measurement technique vol. 47, no. 6, p. 067005, 2008.
for images,” in Proceedings of the 1996 7th IEEE Digital Signal [19] S. Thiemert, H. Sahbi, and M. Steinebach, “Using entropy
Processing Workshop, pp. 45–48, September 1996. for image and video authentication watermarks,” in Security,
[7] D. Kundur and D. Hatzinakos, “Digital watermarking for Steganography and Watermarking of Multimedia Contents VIII,
telltale tamper proofing and authentication,” Proceedings of the vol. 6072 of Proceedings of SPIE, pp. 1–10, 2006.
IEEE, vol. 87, no. 7, pp. 1167–1180, 1999. [20] J. Dittmann, “Content-fragile watermarking for image
[8] G.-J. Yu, C.-S. Lu, and H.-Y. M. Liao, “Mean-quantization- authentication,” in Security and Watermarking of Multimedia
based fragile watermarking for image authentication,” Optical Contents III, vol. 4314 of Proceedings of SPIE, pp. 175–184,
Engineering, vol. 40, no. 7, pp. 1396–1408, 2001. 2001.
[9] Z.-M. Lu, D.-G. Xu, and S.-H. Sun, “Multipurpose image [21] M. Schlauweg, D. Profrock, T. Palfner, and E. Muller,
watermarking algorithm based on multistage vector quanti- “Quantization-based semi-fragile public-key watermarking
zation,” IEEE Transactions on Image Processing, vol. 14, no. 6, for secure image authentication,” in Mathematics of
pp. 822–831, 2005. Data/Image Coding, Compression, and Encryption VIII,
[10] P. Bao and X. Ma, “Image adaptive watermarking using with Application, vol. 5915 of Proceedings of SPIE, pp. 1–11,
wavelet domain singular value decompostion,” IEEE Transac- 2005.
tions on Circuits and Systems for Video Technology, vol. 15, no. [22] X. Qi, X. Xin, and R. Chang, “Image authentication and
1, pp. 96–102, 2005. tamper detection using two complementary watermarks,” in
EURASIP Journal on Advances in Signal Processing 17

Proceedings of the International Conference on Image Processing


(ICIP ’09), pp. 4257–4260, 2009.
[23] C. Teh and R. T. Chin, “On image analysis by the methods of
moments,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 10, no. 4, pp. 496–513, 1988.
[24] A. Khotanzad and Y. H. Hong, “Invariant image recognition
by Zernike moments,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 12, no. 5, pp. 489–497, 1990.
[25] C. Maaoui, H. Laurent, and C. Rosenberger, “2D color shape
recognition using Zernike moments,” in Proceedings of the
IEEE International Conference on Image Processing (ICIP ’05),
vol. 3, pp. 976–979, September 2005.
[26] Y. Xin, S. Liao, and M. Pawlak, “A multibit geometrically
robust image watermark based on Zernike moments,” in
Proceedings of the 17th International Conference on Pattern
Recognition (ICPR ’04), vol. 4, pp. 861–864, August 2004.
[27] P. Amin and K. P. Subbalakshmi, “Rotation and cropping
resilient data hiding with zernike moments,” in Proceedings
of the International Conference on Image Processing (ICIP ’04),
vol. 4, pp. 2175–2178, October 2004.
[28] H. S. Kim and H.-K. Lee, “Invariant image watermark using
Zernike moments,” IEEE Transactions on Circuits and Systems
for Video Technology, vol. 13, no. 8, pp. 766–775, 2003.
[29] V. F. Zernike, “Beugungstheorie des schneidenver-fahrens
und seiner verbesserten form, der phasenkontrastmethode,”
Physica, vol. 1, no. 7–12, pp. 689–704, 1934.
[30] http://www.cs.washington.edu..
[31] M.-J. Tsai, K.-Y. Yu, and Y.-Z. Chen, “Joint wavelet and spatial
transformation for digital watermarking,” IEEE Transactions
on Consumer Electronics, vol. 46, no. 1, pp. 241–245, 2000.
[32] X. Yao, H. Liu, W. Rui, and J. Huang, “Content-based
authentication algorithm for binary images,” in Proceedings of
the International Conference on Image Processing (ICIP ’09), pp.
2893–2896, 2009.
[33] H. Liu, L. Zhu, and J. Huang, “A hybrid watermarking scheme
for video authentication,” in Proceedings of the International
Conference on Image Processing (ICIP ’06), pp. 2569–2572,
2006.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 426085, 18 pages
doi:10.1155/2010/426085

Research Article
Digital Watermarking Method Warranting the Lower Limit of
Image Quality of Watermarked Images

Motoi Iwata, Tomoo Kanaya, Akira Shiozaki, and Akio Ogihara


Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Sakai-shi, Osaka 599-8531, Japan

Correspondence should be addressed to Motoi Iwata, [email protected]

Received 30 November 2009; Revised 16 February 2010; Accepted 2 June 2010

Academic Editor: Jin-Hua She

Copyright © 2010 Motoi Iwata et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We propose a digital watermarking method warranting the lower limit of the image quality of watermarked images. The proposed
method controls the degradation of a watermarked image by using a lower limit image. The lower limit image means the image of
the worst quality that users can permit. The proposed method accepts any lower limit image and does not require it at extraction.
Therefore lower limit images can be decided flexibly. In this paper, we introduce 2-dimensional human visual MTF model as an
example of obtaining lower limit images. Also we use JPEG-compressed images of quality 75% and 50% as lower limit images.
We investigate the performance of the proposed method by experiments. Moreover we compare the proposed method using three
types of lower limit images with the existing method in view of the tradeoff between PSNR and the robustness against JPEG
compression.

1. Introduction any lower limit image and does not require it at extraction.
Therefore lower limit images can be decided flexibly. In
Digital watermarking is a technique that embeds additional this paper, we introduce 2-dimensional human visual MTF
data into digital contents so that the distortion by embedding model as an example of obtaining lower limit images. Also
them is perceptually undetectable [1]. The distortion of we use JPEG-compressed images of quality 75% and 50% as
watermarked images by general digital watermarking meth- lower limit images, which are popular formats as degraded
ods is fixed only after embedding. Some digital watermarking images.
methods [2] prevent the degradation of the image quality of The rest of this paper consists of five sections. We
watermarked images by using human visual system. However describe our approach in Section 2 and introduce the existing
the lower limit of the image quality of the watermarked techniques in Section 3. Then we describe the detail of the
images was not clear. Such obscurity of the lower limit proposed method in Section 4 and show and discuss the
disturbs the practical use of digital watermarking. performance of the proposed method in Section 5. Finally we
The method proposed by Yoshiura and Echizen [2] conclude this paper in Section 6.
decided the embedding strength by introducing uniform
color space so that the degradation of all regions in a image 2. Our Approach
was equalized. However there is the fact that the degradation
by modification in uniform color space depends on the We assume that there is a range in which the changes for
direction of the modification described in Section 2. pixel values are imperceptible. We call the range “perceptual
In this paper, we propose a digital watermarking capacity.” Existing methods do not modify pixel values in
method warranting the lower limit of the image quality of the perceptual capacity strictly. Therefore we introduce a
watermarked images. The proposed method controls the lower limit image which means the image of the worst
degradation of a watermarked image by using a lower limit quality that users can permit, that is, which provides with
image. The lower limit image means the image of the worst perceptual capacity. The contribution of the introduction of
quality that users can permit. The proposed method accepts lower limit images is the separation of perceptual capacity
2 EURASIP Journal on Advances in Signal Processing


and watermarking procedures. The separation yields the ⎪ B

⎪ , B ≤ 0.04045,
independence of investigation. ⎨ 12.92
B = ⎪ 2.4
The proposed method warrants the lower limit of the ⎪


⎩ (B + 0.055) , otherwise,
image quality of a watermarked image by approximating an 1.055
original image to the corresponding lower limit image for
embedding. Moreover we introduce L∗ a∗ b∗ color space for Rs
R = ,
equalizing the degradation by embedding, where L∗ a∗ b∗ 255
color space is one of the popular uniform color spaces. Gs
Then the quality of a watermarked image is between that G = ,
255
of the original image and that of the lower limit image in
L∗ a∗ b∗ color space. The lower limit image can be decided Bs
B = ,
flexibly because the proposed method does not require it at 255
extraction. (1)
In general, the modification with the same quantity in
a uniform color space perceptually yields the same degrada- where R, G, and B are the values in gamma-transformed
tion. However the direction of the modification is important, sRGB color space, and Rs , Gs , and Bs are the values in sRGB
too. We found this fact by comparing the degradation of the color space.
modified images approaching human visual filtered images Then we obtain XYZ color space from gamma-
with that of the modified images leaving the filtered images, transformed sRGB color space by the following equations:
where the modification was done in L∗ a∗ b∗ color space. ⎛ ⎞ ⎛ ⎞⎛ ⎞
X 0.412453 0.35758 0.180423 R
The human visual filter cuts off redundant component for ⎜ ⎟ ⎜ ⎟⎜ ⎟
⎝Y ⎠ = ⎝0.212671 0.71516 0.072169⎠⎝G⎠. (2)
visual sensation. Figure 1 shows the difference in quality Z 0.019334 0.119193 0.950227 B
by the direction of modification, where the human visual
filter used here is mathematical 2-dimensional human
visual MTF model described in Section 3.2.2. As shown in 3.1.2. L∗ a∗ b∗ Color Space. L∗ a∗ b∗ color space is one of
Figure 1, the degradation of the modified image approaching uniform color spaces established by CIE in 1976 [4]. In
the filtered image is more imperceptible than that of the a uniform color space, the distances between colors are
modified image leaving the filtered image. We utilize this fixed based on the perceptual differences between the colors
feature by employing the images filtered by mathematical [3, 5, 6].
2-dimensional human visual MTF model as one of the L∗ a∗ b∗ color space is obtained from XYZ color space by
types of lower limit images. Also we use JPEG compressed the following equations:
images of quality 75% and 50% as lower limit images, which  
Y
are popular formats as degraded images. In other words, L∗ (Y ) = 116 f − 16,
Yn
employing the MTF model is a theoretical approach to 5    6
generate lower limit images, while using JPEG-compression X Y
a∗ (X, Y ) = 500 f −f ,
is a practical approach. Xn Yn
5    6
Y Z (3)
3. Existing Techniques b∗ (Y , Z) = 200 f −f ,
Yn Zn
3.1. Color Spaces. In this section, we describe XYZ color ⎧ 1/3

⎨t , if t > 0.008856,
space, L∗ a∗ b∗ color space, and opponent color space in
f (t) = ⎪
Sections 3.1.1, 3.1.2, and 3.1.3, respectively. ⎩7.787t + 16 , otherwise,
116
3.1.1. XYZ Color Space. XYZ color space is a color where Xn , Yn , and Zn are coefficients which depend upon the
space established by CIE (Commission Internationale de illuminant (for daylight illuminant D65, Xn = 95.045, Yn =
l’Eclairage) in 1931. The transformation of sRGB color space 100, and Zn = 108.892).
into XYZ color space is as follows [3].
First we obtain gamma-transformed sRGB color space by
3.1.3. Opponent Color Space. Opponent color space is based
the following equations:
on input signals from L cone, M cone, and S cone in retina.
⎧  Opponent color space is obtained from XYZ color space by
⎪ R

⎪ , R ≤ 0.04045,
⎨ 12.92 the following equation:
R = ⎪ 2.4 ⎛ ⎞ ⎛ ⎞⎛ ⎞



⎩ (R + 0.055) , otherwise, ⎜
Jw/k
⎟ ⎜
0.279 0.720 −0.107 X
⎟⎜ ⎟
1.055 ⎝ Jr/g ⎠ = ⎝−0.449 0.290 −0.077⎠⎝Y ⎠, (4)
⎧ J y/b 0.086 −0.590 0.501 Z
⎪ G

⎪ , G ≤ 0.04045,
⎨ 12.92 where Jw/k , Jr/g , and J y/b represent luminance channel
G = ⎪ 2.4


 and the opponent channels of red-green and yellow-blue,
⎩ (G + 0.055) , otherwise,
1.055 respectively.
EURASIP Journal on Advances in Signal Processing 3

luminance, contrast sensitivity is high for medium spatial


frequency and is suddenly low for high spatial frequency. It is
known that the shape of human visual MTF for other color
stimulus is similar to that for luminance.

3.2.2. Mathematical 2-Dimensional Human Visual MTF


Model. Ishihara et al. [7, 8] and Miyake [9] revealed that
human visual MTF depends on directivity in spatial
frequency and mean of stimulus. Moreover they proposed
mathematical 2-dimensional human visual MTF model
Original image about tristimulus on opponent color space.
(a) Let u, v be horizontal
√ and vertical spatial frequency,
respectively, let w = u2 + v2 be the spatial frequency on u-v
plane, and let φ = arctan(v/u) be the direction of w. Then
contrast sensitivity M(u, v) obtained by mathematical 2-
dimensional human visual MTF model is defined as follows:
  2 2
M(u, v) = M0 (w) 1 − 1 − γ(u) 2sin 2φ2 , (5)

where γ(u) represents the ratio of diagonal contrast sensi-


tivity to horizontal contrast sensitivity when the horizontal
spatial frequency is equal to u, and M0 (w) is defined as
follows:
Approaching filtered image  
(b) M0 (w) = β(o)MB w, m p , σ p
 
− β(o)MB mc , m p , σ p MB (w, mc , σc ), (6)
 
MB (w, m, σ) = exp −2π 2 σ 2 (w − m)2 ,

where β(o) represents the maximum value of horizontal


contrast sensitivity on human visual MTF when the mean
of stimulus is equal to o.
We define Mr/g (u, v) and M y/b (u, v) as contrast sensitivity
M(u, v) for red-green channel and yellow-blue channel,
respectively. We can obtain Mr/g (u, v) and M y/b (u, v) from
Leaving filtered image
(5) using the parameters shown in Table 1. The parameters
(c) βr/g (Jr/g ), β y/b (J y/b ), γrg (w), and γ y/b (w) in Table 1 are
calculated by the following equations:
Figure 1: Difference by the direction of modification.
  2
βr/g Jr/g = −0.07570Jr/g + 8.731Jr/g − 1.839,
  2
3.2. Two-Dimensional Human Visual MTF Model β y/b J y/b = −0.054J y/b + 4.851J y/b − 0.8439,
(7)
3.2.1. Modulation Transfer Function. Modulation Transfer γr/g (u) = 0.001531u2 − 0.06149u + 1.14,
Function (MTF) describes the relationship between spacial
frequency and contrast sensitivity. Spatial frequency is a γ y/b (u) = 0.001919u2 − 0.06427u + 1.09,
measure of how often a structure repeats per unit of distance.
As shown in Figure 2, any pattern corresponds to the spacial where Jr/g and J y/b represent the means of all Jr/g and J y/b in
frequency. On the other hand, contrast sensitivity is a an image, respectively. In the literature [7–9], M0 (w) is not
measure of the ability to discern luminances of different calculated when w < 1.5. For the correction of this incident,
levels in a static image. Contrast sensitivity depends on we regard M0 (w) as M0 (1.5) when w < 1.5 so as to obey the
spatial frequency. For example, it tends to be high for meaning of contrast sensitivity. Figure 4 shows the shapes of
medium spatial frequency, while it tends to be low for high Mr/g (u, v) and M y/b (u, v) with or without the correction.
spatial frequency.
Figure 3 shows the shape of human visual MTF for 3.3. Filtering Based on Two-dimensional Human Visual MTF
luminance. As shown in Figure 3, contrast sensitivity is Model. The filter of 2-dimensional human visual MTF
numerically expressed by MTF. In human visual MTF for model cuts off imperceptible components from images.
4 EURASIP Journal on Advances in Signal Processing

Low spacial frequency High spacial frequency


(a) (b)

Figure 2: Patterns with low spatial frequency and high spatial frequency.

80 Step 2. Jr/g (x, y) and J y/b (x, y) are transformed into Fr/g (u, v)
70 and F y/b (u, v) by discrete Fourier transform (DFT), respec-
tively.
60
Contrast sensitivity

50
Step 3. The filtered discrete Fourier transform coefficients
 
Fr/g (u, v) and F y/b (u, v) are, respectively, obtained by the
40 following equations:
30

Fr/g (u, v) = Fr/g (u, v) × Mr/g (u, v),
20 (8)

10 F y/b (u, v) = F y/b (u, v) × M y/b (u, v).

0 Step 4. The filtered pixel values in opponent color space are


0 5 10 15 20 25 30
 
Spatial frequency obtained from Fr/g (u, v) and F y/b (u, v) by inverse DFT. Then
the lower limit image is obtained by the transformation of
Figure 3: Shape of human visual MTF for luminance. opponent color space into sRGB color space.

4. Proposed Method
Table 1: Parameters for MTF.
Jr/g J y/b 4.1. Embedding Procedure. Firstly we divide an original
o Jr/g J y/b
image with Nx × N y pixels and the corresponding lower limit
image into blocks with L × L pixels. Moreover the blocks are
β(o βr/g (Jr/g ) β y/b (J y/b )
divided into subblocks with Ls × Ls pixels. Let B(i, j) and
γ(u) γr/g (u) γ y/b (u) 1 j) be the (i, j)-th block in the original image and the
B(i,
mp 1.5 1.5
lower limit image, respectively, where 0 ≤ i < Nx /L, 0 ≤
σp 1/70 1/45
j < N y /L. Let Bs (k, l) and B1s (k, l) be the (k, l)-th subblock
 0 1/4
in B(i, j) and B(i,1 j), respectively, where 0 ≤ k < L/Ls ,
mc — 7.5
0 ≤ l < L/Ls (“(i, j)” is omitted in the representation of
σc — 1/20
Bs (k, l) and B1s (k, l) for simplicity). The proposed method
embeds one watermark bit into one block. Let bi j ∈ {0, 1}
be the watermark bit embedded in B(i, j).
The embedding procedure of bi j is as follows.
In this paper, only red-green and yellow-blue channels
are filtered, which are based on the characteristic that Step 1. Let g(m, n) be the pixel value located at (m, n) in
modification in luminance is more perceptual than that in B(i, j), where 0 ≤ m < L, 0 ≤ n < L. Then the pixel
red-green or yellow-blue channel. value g(m, n) is regarded as the point Pmn (L∗mn , a∗mn , bmn

)
∗ ∗ ∗
in L a b color space. In the same manner, g1(m, n) and
Step 1. An original image with Nx × N y pixels is transformed P1mn (L
1∗ 1∗ 1∗ 1
mn , amn , bmn ) are defined from B(i, j).
into opponent color space. Let Jr/g (x, y) and J y/b (x, y) be the
values of red-green and yellow-blue channels located at the Step 2. Let D(m, n) be the distance between the origin O
coordinate (x, y), respectively. 1
and the point Pmn , and let D(m, n) be the distance between
EURASIP Journal on Advances in Signal Processing 5

Mr/g (u, v) without correction L∗

80 P , mn 1 − t 1
Pmn
Contrast sensitivity

60 Pmn t
40
20
0
30
20 D(m, n) 1
D(m, n)
−30 −20 0 10
−10 −10 b∗
0
10 −20 v
u 20 30 −30
(a) a∗
O
M y/b (u, v) without correction

Figure 5: Modification Pmn into Pmn .
40
Contrast sensitivity

30
20
10
0
30
20
−30
−20 0 10
−10 −10
0
10 −20 v
u 20 30 −30 Aerial Airplane

(b)
Mr/g (u, v) with correction
1
0.8
Contrast sensitivity

0.6
0.4
0.2 Balloon Couple
0

−30
−20
30 −10
20 10 0
0 10
−10 −20 20 v
u −30 30
(c) Earth Girl
M y/b (u, v) with correction
1
0.8
Contrast sensitivity

0.6
0.4
0.2
0
Lena Mandrill
30
20
−30 −20 10
−10 −10
0
0 −20
10 v
u 20 30 −30
(d)

Figure 4: Shapes of Mr/g (u, v) and M y/b (u, v).


Milkdrop Parrots

the origin O and the point P1mn . D(m, n) and D(m,


1 n) are
obtained by the following:equations:

D(m, n) = L∗mn 2 + a∗mn 2 + bmn


∗ 2,

(9) Pepper Sailboat


1
D(m, 1 2 + a∗
n) = L∗mn 1 2 1 2

mn + bmn . Figure 6: Original images.
6 EURASIP Journal on Advances in Signal Processing

Aerial Airplane Aerial Airplane

Balloon Couple Balloon Couple

Earth Girl Earth Girl

Lena Mandrill Lena Mandrill

Milkdrop Parrots Milkdrop Parrots

Pepper Sailboat Pepper Sailboat

Figure 7: Watermarked images (MTF). Figure 8: Watermarked images (JPEG75).

Step 3. The difference ΔD(m, n) between the norms of the Step 4. The sum W(k, l) of D(m, n) in Bs (k, l) is obtained by
pixels in the original image and the lower limit image is the following equation:
obtained by the following equation:
s −1 L
L s −1
1
ΔD(m, n) = D(m, n) − D(m, n). (10)  
W(k, l) = D Ls k + x, Ls l + y . (12)
y =0 x=0
Moreover the sum Σ+ of positive values and the sum Σ− of
negative values in B(i, j) are obtained as follows:
 Step 5. The mean W of the sums W(k, l) of all subblocks in
Σ+ = ΔD(m, n), B(i, j) is obtained by the following equation:
(∀m,n)ΔD(m,n)≥0
 (11)
L/Ls −1 L/Ls −1  
Σ− = ΔD(m, n). x=0 x=0 W x, y
(∀m,n)ΔD(m,n)<0 W= . (13)
L2 /L2s
EURASIP Journal on Advances in Signal Processing 7

(i) when WQ ≡ bi j mod 2,



W = WQ × Q; (15)

(ii) when WQ ≡
/ bi j mod 2,
⎧ 

⎨ W Q + 1 × Q, if |Σ+ | ≥ |Σ− |,

Aerial Airplane W = ⎪  (16)
⎩ WQ − 1 × Q, if |Σ+ | < |Σ− |.

Moreover we obtain the quantity K which is added to W for


embedding by the following equation:

K = W − W. (17)
Balloon Couple Step 8. We obtain the quantity K(m, n) which is added to
each pixel value D(m, n) in B(i, j) for embedding as follows:
(i) when K ≥ 0,

⎨ KL ΔD(m, n) ,
2

if ΔD(m, n) ≥ 0,
K(m, n) = ⎪ Σ+ L2s (18)
⎩0, if ΔD(m, n) < 0;
Earth Girl
(ii) when K < 0,


⎪ if ΔD(m, n) ≥ 0,
⎨0,
K(m, n) = ⎪ KL2 ΔD(m, n) (19)

⎩ , if ΔD(m, n) < 0.
Σ− L2s
  
Lena Mandrill Step 9. Let P  mn (L∗mn , a∗mn , bmn

) be the watermarked point of

Pmn . As shown in Figure 5, we change Pmn into Pmn so as

to satisfy D (m, n) − D(m, n) = K(m, n) by the following
equation:
⎛ 
⎞ 1 ⎛ ⎞ ⎛ ⎞
L∗mn L∗mn L∗mn
⎜ ∗  ⎟ ⎜ ⎟
⎜a ⎟ = (1 − t)⎜
⎝a∗
⎟ ⎜ ∗1 ⎟
mn ⎠ + t ⎝ amn ⎠, (20)
⎝ mn ⎠
Milkdrop Parrots ∗ ∗
bmn 1

bmn bmn

where t is the ratio for changing of Pmn into Pmn . The ratio t
satisfies 0 ≤ t ≤ 1 and the following equation:
 
ΔL∗mn 2 + Δa∗mn 2 + Δbmn
∗ 2 2
t
 
+ 2 L∗mn ΔL∗mn + a∗mn Δa∗mn + bmn
∗ ∗
Δbmn t (21)
Pepper Sailboat
2
Figure 9: Watermarked images (JPEG50). − 2D(m, n)K(m, n) − K(m, n) = 0,
where ΔL∗mn , Δa∗mn , and Δbmn

are obtained by the following
equation:
Step 6. The quantized mean W Q is obtained by the following 1
equation: ΔL∗mn = L∗mn − L∗
mn ,

7 8 1
W Δa∗mn = a∗mn − a∗
mn , (22)
WQ = + 0.5 , (14)
Q ∗ 1
∗ ∗
Δbmn = bmn − bmn .
where x means the maximum integer which is smaller than Step 10. The watermarked points Pmn
are transformed into
x. The quantizer Q acts as embedding strength. sRGB color space, where the transformation of real numbers
into integers (round-up or round-down) is decided so that
Step 7. The quantized mean W Q will be modified so as to be the influence on Pmn 
is minimized. Then we obtain the
even when bi j = 0 and be modified so as to be odd when watermarked block B(i, j).
bi j = 1 by the following steps (Step 7∼Step 9).

The watermarked value W of the quantized mean is We obtain the watermarked image after all watermark
obtained as follows: bits have been embedded.
8 EURASIP Journal on Advances in Signal Processing

4.2. Extracting Procedure. Firstly we obtain the blocks B(i, j)


and the subblocks Bs (k, l) from a watermarked image in the
same manner as embedding procedure.
The extracting procedure for a block B(i, j) is as follows.

Step 1. The pixel values g(m, n) in B(i, j) are transformed


into L∗ a∗ b∗ color space and are regarded as the points Aerial Airplane
Pmn (L∗mn , a∗mn , bmn

) in L∗ a∗ b∗ color space.

Step 2. The sum W(k, l) of D(m, n) in Bs (k, l) is obtained for


each sub-block in the same manner as (12).

Step 3. The mean W of the sums W(k, l) of all subblocks in


B(i, j) is obtained in the same manner as (13). Balloon Couple

Step 4. The quantized mean W Q is obtained in the same


manner as (14). Then we extract bi j as follows:


⎨0, if W Q ≡ 0 mod 2,
bi j = ⎩ (23)
1, if W Q ≡ 1 mod 2. Earth Girl

We obtain all the watermark bits after extracting for all


blocks.

5. Experiments Lena Mandrill


5.1. Environments. Firstly we investigated the image quality
of watermarked images and lower limit images. Then we
confirmed that embedded watermark bits were perfectly
extracted from watermarked images. Next we investigated
the available range of the embedding strength Q because
the embedding strength should be decided so that the
Milkdrop Parrots
ratio t can exist. Moreover we investigated the property
of the proposed method when the embedding strength
Q was variable for each block. The variable embedding
strength was the maximum value for each block. Finally we
investigated the robustness against JPEG compression and
the comparison with an existing method in view of image
quality and robustness.
Pepper Sailboat
As shown in Figure 6 we used twelve color images
“aerial,” “airplane,” “balloon,” “couple,” “earth,” “girl,” “lena,” Figure 10: Lower limit images (MTF).
“mandrill,” “milkdrop,” “parrots,” “pepper”, and “sailboat”
as original images. They were standard images widely used
for experiments. The size of all original images was 256 × We used PSNR for the evaluation of image quality. PSNR
256 pixels, that is, Nx = 256, and N y = 256. We used was calculated by the following equation:
L = 32 and Ls = 16 as the size of blocks and subblocks,
respectively. All the watermark bits bi j were decided so as to 2552
PSNR = 10 log10 ,
satisfy WQ ≡ / bi j mod 2. Then the watermarked images that MSE
used such watermark bits were worst degraded among those 3Nx N y  (24)
that used any watermark bit. We used Q = Mmin /6 as the 1  2
MSE = , imgi − oimgi ,
embedding strength so that the ratio t in Step 9 in Section 4.1 3Nx N y i=1
could exist, where Mmin represents the minimum value of
the larger ones of |Σ+ | and |Σ− | in each block. The lower where imgi and oimgi represent the pixels in one image
limit images consist of three types, that is, “MTF” which is and the other image, respectively. We also used mean
described in Section 3.2, and “JPEG75” and “JPEG50” which structural similarity (MSSIM) index [10] to evaluating the
are JPEG-compressed images of quality 75% and 50%. The similarity between watermarked images and lower limit
quality 75% of JPEG compression is the standard quality. images. MSSIM index is obtained by calculating the mean
EURASIP Journal on Advances in Signal Processing 9

Aerial Airplane Aerial Airplane

Balloon Couple Balloon Couple

Earth Girl Earth Girl

Lena Mandrill Lena Mandrill

Milkdrop Parrots Milkdrop Parrots

Pepper Sailboat Pepper Sailboat

Figure 11: Watermarked images (MTF, maxQ). Figure 12: Watermarked images (JPEG75, maxQ).

5.2. Results and Discussion


of SSIM indices of all windows on the images. SSIM index
between two window I0 and I1 of size 8 × 8 pixels was 5.2.1. Image Quality. Figures 7∼9 show the watermarked
calculated by the following equation: images using “MTF,” “JPEG75”, and “JPEG50” as the lower
limit images, respectively. As shown in Figure 7∼9, the degra-
  dation of all the watermarked images was imperceptible.
2μ0 μ1 + C1 (2σ01 + C2 )
SSIM(I0 , I1 ) =  2  , (25) Table 2 shows the PSNRs of the watermarked images
μ0 + μ21 + C1 σ02 + σ12 + C2
and the lower limit images against the original images. As
shown in Table 2, the PSNRs of the watermarked images
where μ0 and μ1 represent the means of I0 and I1 , respectively, except for “milkdrop” and “sailboat” are the lowest when the
and σ0 and σ1 represent the variances of I0 and I1 , respec- type of the lower limit images is “MTF.” The PSNRs of the
tively. The constant values C1 and C2 are defined as default watermarked images “milkdrop” and “sailboat” using “MTF”
values, that is, C1 = (0.01 × 255)2 and C2 = (0.03 × 255)2 , are higher than those using “JPEG50,” although the PSNRs
respectively. of the lower limit images of type “MTF” are less than those
10 EURASIP Journal on Advances in Signal Processing

Table 2: PSNRs of watermarked images and lower limit images


against original images.

Watermarked images Lower limit images


MTF JPEG75 JPEG50 MTF JPEG75 JPEG50
aerial 34.9 37.5 35.8 26.1 28.3 26.9
Aerial Airplane airplane 39.5 45.8 43.6 25.7 30.2 28.5
balloon 43.6 48.9 46.8 32.9 34.9 33.3
couple 42.0 47.5 44.0 29.6 34.1 32.6
earth 41.0 48.0 44.6 31.4 33.7 32.0
girl 42.0 46.8 45.0 28.3 32.7 31.5
lena 42.2 45.1 44.2 26.6 32.4 30.6
mandrill 32.4 39.0 37.8 21.9 27.2 25.4
Balloon Couple
milkdrop 43.6 44.1 42.4 30.7 32.3 30.8
parrots 40.0 48.3 46.5 26.0 34.3 31.6
pepper 37.7 41.1 39.7 25.16 28.8 27.4
sailboat 43.9 44.9 43.0 27.8 31.0 29.4

Table 3: The minimum and maximum of Q.


Earth Girl
MTF JPEG75 JPEG50
min max
aerial 8 52 146 181
airplane 8 11 18 35
balloon 7 7 19 26
couple 6 66 42 89
Lena Mandrill
earth 7 51 56 86
girl 6 37 44 56
lena 10 47 55 54
mandrill 6 101 136 154
milkdrop 10 44 78 89
parrots 7 7 38 43
Milkdrop Parrots pepper 8 81 123 162
sailboat 12 13 37 54

Such degradation tends to be imperceptible. Therefore the


images filtered by 2-dimensional human visual MTF model
were appropriate for lower limit images in view of the
Pepper Sailboat direction of modification by embedding. However the lower
Figure 13: Watermarked images (JPEG50, maxQ). limit images of type “MTF” were slightly inappropriate in
view of the strength of modification by embedding because
some degradation was perceptible as shown in Figure 10.
Therefore one of the future works is the improvement of the
of type “JPEG50.” This suggests that the arbitrariness of the decision of the embedding strength.
type of lower limit images is useful. Although the PSNRs
of the watermarked images “aerial” and “mandrill” using 5.2.3. Flexibility of Embedding Strength. Table 3 shows the
“MTF” were less than 37 [dB] and were relatively low, the minimum and maximum of the embedding strength Q. The
degradation of these images was imperceptible because these minimum values of Q of “JPEG75” and “JPEG50” are similar
images mainly consisted of texture-like or noisy regions as to those of “MTF.” The minimum of the embedding strength
shown in Figure 7. was fixed so that the embedded watermark could be perfectly
extracted from the watermarked image. The maximum of
5.2.2. Validity of Lower Limit Images. Figure 10 shows the the embedding strength was fixed so that the ratio t could
lower limit images of type “MTF.” As shown in Figure 10, exist (the maximum of Q is equal to Mmin /6). As shown in
the degradation of the lower limit images of type “MTF” Table 3, the range of available Q depended on images. In
appeared as emphasizing the difference of color, for example, “balloon” and “parrots,” the flexibility of Q was low because
the hair in “mandrill” or the profile of parrots in “parrots.” the maximum of Q is equal to the minimum of Q. It is the
EURASIP Journal on Advances in Signal Processing 11

Original Existing method (32.9 (dB))


(a) (b)

Proposed method (same Q-MTF, 32.4 (dB)) Proposed method (max Q-JPEG50, 31.5 (dB))
(c) (d)

Figure 14: Comparison in the image quality.

future work to investigate the relationship between the range Table 4: PSNRs of watermarked images using the maximum of Q
of available embedding strengths and the robustness against of each block.
attacks. MTF JPEG75 JPEG50
aerial 30.0 34.4 32.8
5.2.4. Performance Using the Maximum of Q of Each Block. airplane 28.6 35.1 34.9
We investigated the property of the proposed method when
balloon 37.6 40.1 38.7
the maximum of Q (= Mmin /6) of each block is used; that
is, embedding strength Q is variable by a block. The demerit couple 33.5 40.0 38.3
of using the maximum of Q of each block is the increase of earth 34.9 40.2 38.1
the quantity of data saved for extracting. In the following, girl 32.8 38.1 37.2
we call the methods using the same Q and the maximum lena 30.1 38.2 36.3
of Q “sameQ” and “maxQ”, respectively. Note that the high mandrill 25.0 33.4 31.5
maximum of Q in Table 3 does not always cause the low milkdrop 34.9 37.7 36.5
PSNR of the watermarked image of “maxQ” with such Q parrots 29.5 40.4 37.1
because the PSNR does not depend on the maximum of Q pepper 29.9 34.1 31.8
among all blocks but on the distribution of Q for each block
when the maximum of Q of each block is used. sailboat 32.6 35.9 35.2
12 EURASIP Journal on Advances in Signal Processing

Original Existing method (32.9 (dB))


(a) (b)

Proposed method (same Q-MTF, 32.4 (dB)) Proposed method (max Q-JPEG50, 31.5 (dB))
(c) (d)

Figure 15: Comparison in the image quality by enlarged partial regions.

Figures 11∼13 show the watermarked images using but also too large modification by embedding. However
“MTF,” “JPEG75”, and “JPEG50” as the lower limit image, we obtained practical results when we use “JPEG75” and
respectively. The embedding strength of all the watermarked “JPEG50” as the lower limit images.
images is ”maxQ”. Table 4 shows the PSNRs of watermarked
images using the maximum of Q of each block. The
5.2.5. Similarity between Watermarked Images and Lower
degradation of all the watermarked images using “JPEG75”
Limit Images. Table 5 shows the MSSIMs between water-
and “JPEG50” was imperceptible. The degradation of the
marked images and lower limit images. As shown in Table 5,
watermarked image “mandrill” using “MTF” was slightly
we confirmed that all the watermarked images were similar
perceptible as scattered green dots in the hair of mandrill.
to the lower limit images because all the MSSIMs were high.
Table 4 shows the PSNRs of the watermarked images using
It is natural that the MSSIMs of “maxQ” are larger than
“maxQ” as the embedding strength. Although PSNR of
those of “sameQ” because the use of larger Q yields the closer
“airplane” using “MTF” is under 30 [dB], the degradation
watermarked images to the lower limit images. It is the reason
of “airplane” was imperceptible because the degradation was
why the MSSIMs of “maxQ” are not 1.0 that there are some
chromatic. On the other hand, although the degradation
pixels of which K(m, n) are equal to 0 in (18) or (19).
of “mandrill” was mainly texture-like chromatic noise on
texture-like regions, the degradation of “mandrill” was
slightly perceptible because the modification by embedding 5.2.6. Robustness against JPEG Compression. We define the
was large. We confirmed that the use of “MTF” caused not number of correctly extracted bits divided by the number
only the right direction of the modification by embedding of all embedded bits as extraction rate. Tables 6 and 7 show
EURASIP Journal on Advances in Signal Processing 13

Aerial Airplane
100 90
95
80
90
85 70
Extraction rate (%)

Extraction rate (%)


80
75 60

70 50
65
60 40
55
30
50
45 20
28 30 32 34 36 38 40 42 28 30 32 34 36 38 40 42 44 46
PSNR (dB) PSNR (dB)
(a) (b)

Ballon Couple
100 100

90
90

80
Extraction rate (%)

Extraction rate (%)

80
70
70
60
60
50

40 50

30 40
30 32 34 36 38 40 42 44 46 48 50 30 32 34 36 38 40 42 44 46 48
PSNR (dB) PSNR (dB)
(c) (d)
Earth Girl
95 95
90 90
85 85
80 80
Extraction rate (%)

Extraction rate (%)

75 75
70 70
65 65
60 60
55 55
50 50
45 45
30 32 34 36 38 40 42 44 46 48 30 32 34 36 38 40 42 44 46 48
PSNR (dB) PSNR (dB)

Conv MaxQ-MTF Conv MaxQ-MTF


SameQ-MTF MaxQ-JPEG50 SameQ-MTF MaxQ-JPEG50
SameQ-JPEG50 MaxQ-JPEG75 SameQ-JPEG50 MaxQ-JPEG75
SameQ-JPEG75 SameQ-JPEG75
(e) (f)

Figure 16: Comparison in the robustness against JPEG compression of quality 75% (1).
14 EURASIP Journal on Advances in Signal Processing

Lena Mandrill
100 100
90
90
80
Extraction rate (%)

Extraction rate (%)


80 70
60
70
50
60 40
30
50
20

40 10
30 32 34 36 38 40 42 44 46 24 26 28 30 32 34 36 38 40 42
PSNR (dB) PSNR (dB)
(a) (b)
Milkdrop Parrots
85 95
90
80
85
75
80
Extraction rate (%)

Extraction rate (%)

70 75
65 70

60 65
60
55
55
50
50
45 45
30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 48 50
PSNR (dB) PSNR (dB)
(c) (d)
Pepper Sailboat
75 90

70
80
65
Extraction rate (%)

Extraction rate (%)

70
60
55 60

50
50
45
40
40

35 30
28 30 32 34 36 38 40 42 30 32 34 36 38 40 42 44 46
PSNR (dB) PSNR (dB)

Conv MaxQ-MTF Conv MaxQ-MTF


SameQ-MTF MaxQ-JPEG50 SameQ-MTF MaxQ-JPEG50
SameQ-JPEG50 MaxQ-JPEG75 SameQ-JPEG50 MaxQ-JPEG75
SameQ-JPEG75 SameQ-JPEG75
(e) (f)

Figure 17: Comparison in the robustness against JPEG compression of quality 75% (2).
EURASIP Journal on Advances in Signal Processing 15

Aerial Airplane
100 100
95
90 90
85
Extraction rate (%)

Extraction rate (%)


80 80
75
70 70
65
60 60
55
50 50
28 30 32 34 36 38 40 42 28 30 32 34 36 38 40 42 44 46
PSNR (dB) PSNR (dB)
(a) (b)

Ballon Couple
100 100
95
90
90
Extraction rate (%)

Extraction rate (%)

85 80
80
70
75
70 60
65
50
60
55 40
30 32 34 36 38 40 42 44 46 48 50 30 32 34 36 38 40 42 44 46 48
PSNR (dB) PSNR (dB)
(c) (d)
Earth Girl
100 100

90
95
80
Extraction rate (%)

Extraction rate (%)

90
70

85 60

50
80
40
75
30

70 20
30 32 34 36 38 40 42 44 46 48 30 32 34 36 38 40 42 44 46 48
PSNR (dB) PSNR (dB)

Conv MaxQ-MTF Conv MaxQ-MTF


SameQ-MTF MaxQ-JPEG50 SameQ-MTF MaxQ-JPEG50
SameQ-JPEG50 MaxQ-JPEG75 SameQ-JPEG50 MaxQ-JPEG75
SameQ-JPEG75 SameQ-JPEG75
(e) (f)

Figure 18: Comparison in the robustness against JPEG compression of quality 90% (1).
16 EURASIP Journal on Advances in Signal Processing

Lena Mandrill
100 100
95 95
90
90
85
Extraction rate (%)

Extraction rate (%)


85
80
75 80
70 75
65
70
60
55 65

50 60
30 32 34 36 38 40 42 44 46 24 26 28 30 32 34 36 38 40 42
PSNR (dB) PSNR (dB)
(a) (b)
Milkdrop Parrots
100 100
95
90 90
85
Extraction rate (%)

Extraction rate (%)

80
80

70 75
70
60 65
60
50
55
40 50
30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 48 50
PSNR (dB) PSNR (dB)
(c) (d)
Pepper Sailboat
100 100

95
90
90
Extraction rate (%)

Extraction rate (%)

80
85

70 80

75
60
70
50
65

40 60
28 30 32 34 36 38 40 42 30 32 34 36 38 40 42 44 46
PSNR (dB) PSNR (dB)

Conv MaxQ-MTF Conv MaxQ-MTF


SameQ-MTF MaxQ-JPEG50 SameQ-MTF MaxQ-JPEG50
SameQ-JPEG50 MaxQ-JPEG75 SameQ-JPEG50 MaxQ-JPEG75
SameQ-JPEG75 SameQ-JPEG75
(e) (f)

Figure 19: Comparison in the robustness against JPEG compression of quality 90% (2).
EURASIP Journal on Advances in Signal Processing 17

Table 5: MSSIM between watermarked images and lower limit the extraction rates in JPEG compression of quality 75% and
images. 90%, respectively.
As shown in Table 6, the proposed method using
MTF JPEG50 JPEG75
“sameQ” had no robustness against JPEG compression
sameQ maxQ sameQ maxQ sameQ maxQ of quality 75%. Using “maxQ,” some extraction rates of
aerial 0.9954 0.9972 0.9415 0.9514 0.9640 0.9702 “JPEG75”and “JPEG50” against JPEG compression of quality
airplane 0.9952 0.9980 0.9551 0.9710 0.9720 0.9841 75% were larger than 90%. It was noticeable that some
balloon 0.9973 0.9986 0.9614 0.9765 0.9757 0.9855 extraction rates of “JPEG75” were larger than those of
couple 0.9881 0.9940 0.9603 0.9698 0.9713 0.9806 “JPEG50” although the PSNRs of “JPEG75” were larger than
earth 0.9933 0.9965 0.9692 0.9776 0.9813 0.9868 those of “JPEG50.” The investigation of the relationship
girl 0.9739 0.9872 0.9550 0.9703 0.9684 0.9807 between lower limit images and robustness is one of our
lenna 0.9750 0.9912 0.9562 0.9703 0.9738 0.9814 future works.
mandrill 0.9811 0.9911 0.9211 0.9449 0.9588 0.9696
As shown in Table 7, the proposed method using
“sameQ” had partial robustness against JPEG compression
milkdrop 0.9700 0.9840 0.9551 0.9650 0.9704 0.9768
of quality 90%. On the other hand, almost all the extraction
parrots 0.9826 0.9933 0.9578 0.9733 0.9756 0.9831
rates using “maxQ” were equal to 100%. Therefore the
pepper 0.9603 0.9782 0.9629 0.9733 0.9765 0.9819 proposed method using “maxQ” had the robustness against
sailboat 0.9889 0.9953 0.9662 0.9788 0.9802 0.9882 JPEG compression of quality 90%.

5.2.7. Comparison with Existing Method. We use the existing


Table 6: Extraction rates in JPEG compression of quality 75%. method proposed by Yoshiura and Echizen in the literature
[2] for comparison. Yoshiura’s method used the correla-
sameQ maxQ tion of 2-dimensional random sequences which was one
MTF JPEG75 JPEG50 MTF JPEG75 JPEG50 of popular watermarking procedures. Moreover Yoshiura’s
aerial 50.00 48.44 56.25 56.25 96.88 95.31 method took into consideration human visual system by
airplane 37.50 50.00 48.44 23.44 82.81 62.50 using L∗ u∗ v∗ color space which was one of uniform color
balloon 53.13 50.00 65.63 34.36 87.50 90.63 spaces. Therefore Yoshiura’s method was appropriate to the
comparison.
couple 45.31 48.44 57.81 56.25 93.75 82.81
Figure 14 shows the original image “mandrill” and the
earth 57.81 46.88 53.13 45.31 93.75 87.50
watermarked images of the existing method and the pro-
girl 53.13 48.44 46.88 56.25 93.75 89.06 posed methods using “sameQ-MTF” and “maxQ-JPEG50.”
lenna 40.63 51.56 50.00 82.81 92.19 84.38 The PSNRs of the watermarked images were approximately
mandrill 43.75 70.31 50.00 15.63 93.75 96.88 equalized as described in Figure 14. As shown in Figure 14,
milkdrop 50.00 64.06 45.31 59.38 82.81 67.19 chromatic block noises were perceptible in the watermarked
parrots 54.69 59.38 45.31 48.44 81.25 93.75 image of the existing method, while the degradation was
pepper 39.06 65.63 67.19 43.75 75.00 71.88 imperceptible in the watermarked images of the pro-
sailboat 42.19 40.63 46.88 35.98 89.06 59.38 posed methods using “sameQ-MTF” and “maxQ-JPEG50”
although the PSNRs of of the proposed methods were lower
than the PSNR of the existing method. Figure 15 shows the
enlarged partial regions of the images in Figure 14. As shown
in Figure 15, the degradation of each watermarked image was
Table 7: Extraction rates in JPEG compression of quality 90%. able to be observed in detail. The degradation of the existing
sameQ maxQ method was chromatic block noise. The degradation of the
MTF JPEG75 JPEG50 MTF JPEG75 JPEG50
proposed method using “sameQ-MTF” was strong chro-
matic edge enhancement. The degradation of the proposed
aerial 89.06 100.00 100.00 100.00 100.00 100.00
method using “maxQ-JPEG50” was imperceptible even if
airplane 57.81 53.13 71.88 93.75 100.00 100.00 the partial region was enlarged. It was the reason why the
balloon 57.81 78.13 89.06 82.81 96.88 98.44 degradation of the proposed method using “maxQ-JPEG50”
couple 59.38 42.19 89.06 100.00 98.44 100.00 was not block noise that the location of the pixels modified
earth 89.06 92.19 96.88 100.00 100.00 100.00 by embedding was scattered by (18) and (19).
girl 23.44 28.13 54.69 79.69 100.00 100.00 Figures 16∼19 show the comparison of Yoshiura’s
lenna 98.44 98.44 95.31 100.00 100.00 100.00 method and the proposed method using “MTF,” “JPEG75” or
mandrill 70.31 100.00 100.00 100.00 100.00 100.00 “JPEG50” as the lower limit images and “sameQ”, or “maxQ”
milkdrop 87.50 96.88 98.44 100.00 100.00 100.00 as the embedding strength. The horizontal axis of the graphs
parrots 50.00 92.19 96.88 93.75 98.44 98.44
in Figures 16∼19 represents PSNR[dB] of watermarked
images, while the vertical axis represents extraction rate[%].
pepper 95.31 100.00 100.00 100.00 100.00 100.00
In Figures 16∼19, the performance of the proposed method
sailboat 68.75 81.25 98.44 100.00 100.00 100.00 is represented by the point for each condition, while that of
18 EURASIP Journal on Advances in Signal Processing

the existing method is represented by the curve. We evaluated human vision models,” IEICE Transactions on Information and
the superiority of the proposed method by checking whether Systems, vol. E89-D, no. 1, pp. 256–270, 2006.
the point of the proposed method was above the curve of the [3] “JIS handbook 61 Color 2007,” Japanese Standards Associa-
existing method or not. As shown in Figures 16 and 17, only tion, 2007.
the point corresponding to “maxQ-JPEG75” was above the [4] Wikipedia, “Lab color space,” November 2009, http://en
curve of the existing method for the results of all test images. wikipedia.org/wiki/Lab color space.
Therefore the proposed method using “maxQ-JPEG75” was [5] T. Oyama, Invitation to Visual Psycology, Saiensu-sha Co.,
superior to the existing method for all test images in view of 2000.
the robustness against JPEG compression of quality 75%. In [6] “Colors & Dyeing Club in Nagoya,” Osaka, November 2009,
http://www005.upp.so-net.ne.jp/fumoto/.
comparison with each parameter of the proposed method,
[7] T. Ishihara, K. Ohishi, N. Tsumura, and Y. Miyake, “Depen-
in the results of “balloon,” “mandrill,” and “parrots,” the dence of directivity in spatial frequency responseof the human
point corresponding to “maxQ-JPEG50” was located on the eye (1): measurement of modulation transfer function,”
upper-left of “maxQ-JPEG75.” The superiority of “maxQ- Journal of the Society ofPhotographic Science and Technology of
JPEG50” on the above cases would be decided depending on Japan, vol. 65, no. 2, pp. 121–127, 2002.
the importance of an extraction rate and a PSNR. As shown [8] T. Ishihara, K. Ohishi, N. Tsumura, and Y. Miyake, “Depen-
in Figures 18 and 19, the points corresponding to “maxQ- dence of directivity in spatial frequency responseof the human
JPEG75” and “maxQ-JPEG50” were above the curve of the eye (2): mathematical modeling of modulation transfer
existing method for the results of all test images. Therefore function,” Journal of the Societyof Photographic Science and
the proposed method using “maxQ-JPEG75” or “maxQ- Technology of Japan, vol. 65, no. 2, pp. 128–133, 2002.
JPEG50” was superior to the existing method for all test [9] Y. Miyake, T. Ishihara, K. Ohishi, and N. Tsumura, “Measure-
images in view of the robustness against JPEG compression ment and modeling of the two dimensionalMTF of human
of quality 90%. Moreover the extraction rates of “maxQ- eye and its application for digital color reproduction,” in
Proceedings of the 9th IS&T and SID Color Image Conference,
JPEG75” and “maxQ-JPEG50” for all test images were over
pp. 153–157, Scottsdale, Ariz, USA, 2001.
95%, where the errors could be recovered by using error
[10] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,
correcting codes. In comparison with each parameter of “Image quality assessment: from error visibility to structural
the proposed method, the PSNRs of “maxQ-JPEG75” were similarity,” IEEE Transactions on Image Processing, vol. 13, no.
higher than those of “maxQ-JPEG50” for all test images. 4, pp. 600–612, 2004.
From above discussion, the performance of “maxQ-JPEG75”
was totally the best because of the imperceptibility shown in
Figure 12 and the robustness against JPEG compression.

6. Conclusion
We have proposed a watermarking method warranting the
lower limit of the image quality of watermarked images.
The proposed method warrants the lower limit of the image
quality of watermarked images by introducing lower limit
images and equalizes the degradation by embedding on
watermarked images by using L∗ a∗ b∗ color space. We have
investigated the image quality of watermarked images, the
validity of the lower limit images filtered by mathematical 2-
dimensional human visual MTF model, the flexibility of the
embedding strength, the performance using the maximum
of Q of each block, the similarity between watermarked
images and lower limit images, the robustness against JPEG
compression, and the comparison with the existing method.
Our future works should be to investigate the relationship
between the robustness against general image processing
and lower limit images and to improve the decision of the
embedding strength for each block so as to improve the
tradeoff of PSNR and an extraction rate.

References
[1] K. Matsui, Fundamentals of Digital Watermarking, Morikita
Shuppan, 1998.
[2] H. Yoshiura and I. Echizen, “Maintaining picture quality
and improving robustness of color watermarking by using
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 540723, 13 pages
doi:10.1155/2010/540723

Research Article
A Contourlet-Based Image Watermarking Scheme with
High Resistance to Removal and Geometrical Attacks

Sirvan Khalighi,1, 2 Parisa Tirdad,1 and Hamid R. Rabiee2


1 Electicaland Computer Engineering Department, Islamic Azad University of Qazvin, Iran
2 AICTC Research Center, Department of Computer Engineering, Sharif University of Technology, Iran

Correspondence should be addressed to Sirvan Khalighi, [email protected]

Received 16 August 2009; Revised 8 January 2010; Accepted 1 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Sirvan Khalighi et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

We propose a new nonblind multiresolution watermarking method for still images based on the contourlet transform (CT). In
our approach, the watermark is a grayscale image which is embedded into the highest frequency subband of the host image in
its contourlet domain. We demonstrate that in comparison to other methods, this method enables us to embed more amounts of
data into the directional subbands of the host image without degrading its perceptibility. The experimental results show robustness
against several common watermarking attacks such as compression, adding noise, filtering, and geometrical transformations. Since
the proposed approach can embed considerable payload, while providing good perceptual transparency and resistance to many
attacks, it is a suitable algorithm for fingerprinting applications.

1. Introduction domain or in frequency domain. Recent works on digital


watermarking for still images are applied on frequency
Recent rapid growth of distributed networks such as Inter- domain.
net enables the users and content providers to access, Among the transform domain techniques, discrete
manipulate, and distribute digital contents in high volumes. wavelet transform-(DWT-) based techniques are more pop-
In this situation, there is a strong need for techniques to ular, since DWT has a number of advantages over other
protect the copyright of the original data to prevent its transforms including space-frequency localization, multires-
unauthorized duplication. One approach to address this olution representation, superior HVS modeling, linear com-
problem involves adding an invisible structure to a host plexity, and adaptivity [10]. In general, the DWT algorithms
media to prove its copyright ownership. These structures try to locate regions of high frequency or middle frequency to
are known as digital watermarks. Digital watermarking is embed information, imperceptibly [11]. Even though DWT
performed upon various types of digital contents such as is popular, powerful, and familiar among watermarking
images, audio, text, video, and 3D models. It is applied techniques, it has its own limitations in capturing the
to many applications, such as copyright protection, data directional information such as smooth contours and the
authentication, fingerprinting, and data hiding [1]. Current directional edges of the image. This problem is addressed by
methods of watermarking images, depending on whether contourlet transform (CT) [12]. The contourlet transform
the original image is used during watermark extraction was developed as an improvement over wavelet where
process or not, could be divided into two categories: the directional information is important. In addition to
blind and non-blind methods. Schemes reported in [2, 3] multiscale and time-frequency localization proprieties of
are nonblind methods, while the methods in [4–9] are wavelets, CT offers directionality and anisotropy.
categorized as blind methods. Most of the reported schemes Zaboli and Moin [2] used the human visual System
use an additive watermark to the image in the spatial characteristics and an entropy-based approach to create an
2 EURASIP Journal on Advances in Signal Processing

efficient watermarking scheme. It decomposes the origi-


···
nal image in CT domain in four hierarchical levels and LPD
watermarks it with a binary logo image which is scrambled (2,2) DFB
through a well-known PN sequence. They showed adding a LFD
Image Coarse scale
scrambled watermark to high-pass coefficients in an adaptive DFB
way based on entropy results in a high performance detection
capability for watermark extraction.
Jayalakshmi et al. [3] proposed a non-blind watermark- Fine scale Directional subbands
ing scheme using the pixels selected from high frequency
coefficients based on directional subband which doubles Figure 1: Contourlet filter bank [6].
at every level. They noted that contourlet-based methods
perform much better than wavelet-based methods in images a
like maps. The watermark was a 16×16 binary logo.
Duan et al. [4] proposed a watermarking algorithm x b
using nonredundant contourlet transform that exploits the H M M G −
+
energy relations between parent and children coefficients.
This special relationship provides energy invariance before
and after the JPEG compression. They embedded a pseudo- (a)
random binary watermark exploiting the modulation of the a
energy relations.
Xiao et al. [5] proposed an adaptive watermarking b x1
H M −+ M G +
scheme based on texture and luminance features in the
CT domain, which uses the texture and luminance features
of the host image to find the positions in which the
(b)
watermark is embedded. Salahi et al. [6] presented a new
blind spread spectrum method in contourlet domain, where Figure 2: Laplacian pyramid scheme (a) analysis and (b) recon-
the watermark is embedded through a PN sequence in the struction [12].
selected contourlet coefficients of the cover image, and the
data embedding is performed in selected subbands providing
higher resiliency through better spread of spectrum com- 2. Discrete Contourlet Transform
pared to the other subbands.
Shu et al. [7] proposed a blind HVS-based watermarking The contourlet transform (CT) is a geometrical image-
algorithm in the translation invariant circular symmetric based transform that was introduced in [12]. In contourlet
contourlet transform. This approach shows good resistance transform, the laplacian pyramid (LP) is first used to capture
against Gaussian white noise attack. Lian et al. [8] pre- point discontinuities. It is then followed by a directional
sented a method based on nonsampled contourlet transform filter bank (DFB) to link point discontinuities into linear
(NSCT). The algorithm provides an HVS model in the NSCT structures [14]. As shown in Figure 1, the first stage is LP
domain, exploiting the masking characteristics of the HVS decomposition and the second stage is DFB decomposition.
to embed the watermark adaptively. Wei et al. [9] presented The overall result is an image expansion using basic elements
an adaptive watermarking method in the CT domain based like contour segments, and thus called contourlet transform,
on clustering of the mean shift texture features. During which is implemented by a pyramidal directional filter bank
clustering, three texture features including energy, entropy, (PDFB) [15]. At each level, the LP decomposition generates
and contrast are selected for mean shift fast clustering. The a downsampled lowpass version of the original, and the
watermark is directly embedded in the strong texture region difference between the original and the prediction results in
of the host image. a bandpass image. Figure 2 illustrates this process, where H
In [13], we proposed a new contourlet-based image and G are called analysis and synthesis filters, respectively,
watermarking method which embeds a grayscale watermark and M is the sampling matrix.
with as much as 25% of the host image size in the The bandpass image obtained in the LP decomposition is
16th directional subband of the host image. Since the further processed by a DFB. A DFB is designed to capture the
original image is required for watermark extraction, our high-frequency content like smooth contours and directional
method is considered to be nonblind. In this paper, we edges. The DFB is efficiently implemented via a K-level
employ the method introduced in [13] with more details binary tree decomposition that leads to 2K subbands with
and some improvement in our algorithm and provide wedge-shaped frequency partitioning as shown in Figure 3.
comprehensive experiments with more host images. The The contourlet decomposition is illustrated by using the Lena
remainder of the paper is organized as follows. In Section 2, test image of size 512×512 and its decomposition into four
we present Contourlet Transform (CT). In Section 3, we levels, in Figure 4. At each successive level, the number of
introduce the proposed approach. Experimental results directional subbands is 2, 4, 8, and 16.
are discussed in Section 4. Final remarks are outlined in Embedding the watermark in high frequency compo-
Section 5. nents improves the perceptibility of the watermarked image.
EURASIP Journal on Advances in Signal Processing 3

×105
ω2 (π, π)
15
1 2 12.5
0 3
10
7 4

Energy
7.5
6 5
5
ω1 2.5
5 6
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4 7
Subband
3 0
2 1 Figure 5: Energy variation in the last level.

(−π, −π)

our scheme, although the watermark is embedded into the


Figure 3: Frequency partitioning (k = 3, 2k = 8 we dge-shaped highest frequency subbands, it is likely to be spread out
frequency subbands) [12].
into all subbands when we reconstruct the watermarked
image, due to the special transform structure of laplacian
pyramid (LP) [16]. Because the high-frequency subbands
of the watermarked image contain the watermarking com-
ponents, the proposed scheme is highly robust against
various low-frequency attacks, which will remove the low
frequency component of the image. On the other hand,
some watermarking components can be preserved at the
low-frequency subbands. Thus, the scheme is expected to be
also robust to the high-frequency attacks, which will destroy
the high-frequency components of the image. Consequently,
the proposed watermarking scheme is robust to the widely
Figure 4: Contourlet decomposition of Lena. spectral attacks resulting from both the low-and high-
frequency processing techniques. The proposed approach is
presented in Section 3.1.
Therefore, we have selected the highest frequency subband
which possesses the maximum energy for watermark embed- 3.1. Watermark Embedding Technique. In the proposed algo-
ding (Figure 5). The Energy E of a subband s (i, j), 0 ≤ i, j ≤ rithm, the watermark which is a grayscale image, with as
N is computed by much as 25% of the host image size, is embedded into
2 the gray level host image of size N × N. The host image
2
E= 2s(i, j)22 . (1) and the watermark are transformed into the contourlet
i j domain.Then, the CT coefficients of the last directional
subband of the host image are modified to embed the
The majority of coefficients in the highest frequency watermark. The steps involved in watermark embedding are
subband are significant values compared to the other sub- shown in Figure 6. We use f (i, j) to denote the host image,
bands of the same level, indicating the presence of directional f  (i, j) the watermarked image, and w(i, j) the watermark.
edges. The technique is comprised in three main steps as discussed
below.
3. The Proposed Approach
Step 1. The host image f (i, j) of size N × N and the
We select contourlet transform for watermark embedding watermark w(i, j) of size N/2 × N/2 are transformed into the
because it captures the directional edges and smooth con- CT domain. An “n” level pyramidal structure is selected for
tours better than other transforms. Since the human visual LP decomposition. At each level lk , there are 2lk directional
system is less sensitive to the edges, embedding the water- subbands, where k = 1, 2, 3, . . . , n. The highest frequency
mark in the directional subband improves the perceptibility subband of the host image is selected for watermark embed-
of the watermarked image, but it is hardly robust. To ding. Watermark decomposition results in two subbands w1 ,
achieve robustness, we can embed the watermark in the w2 and a lowpass image. Since w1 and w2 have the same
lowpass image of the contourlet decomposition. However, resolution, therefore we choose one of them, in addition to
the perceptibility of the watermarked image degrades. In the lowpass image for watermark embedding.
4 EURASIP Journal on Advances in Signal Processing

Host image Watermark Watermarked image Original image

Compute contourlet Compute contourlet


L-level contourlet L-level contourlet
transform transform coefficients coefficients

Compute the watermark coefficients


Modify directional subband coefficients
fl (i, j) − flk (i, j)
flk (i, j) = flk (i, j) + α·w(i, j) w (i, j) = k
α

Compute inverse contourlet transform Watermark

Figure 7: Extraction algorithm.


Watermarked image

Figure 6: Embedding algorithm. extracted watermark is improved. In order to achieve this


goal, after selecting a subband, we can use other directional
subbands which have the highest level of energy. The water-
marked image quality is measured by the PSNR between f
Step 2. The coefficients of the selected subband are modified
and f  , formulated by
as follows [17]:
 

      2552
flk i, j = flk i, j + α · W i, j , (2) PSNR = 10 log10 (dB),
MSE
Where flk (i, j) represents lth level, kth directional subband M 
N
(4)
1   2
coefficients, and α is a weighting factor which controls MSE = 
f (i, j) − f (i, j) .
robustness and perceptual quality. M × N i=1 j =1

Step 3. inverse contourlet transform (ICT) is applied by To evaluate the performance of watermark retrieval
considering the modified directional subbands to obtain the process, normalized correlation (NC) is used. Here, W1
watermarked image. and W2 are the original and recovered watermark signals,
respectively. The normalized correlation is calculated by
3.2. Watermark Extraction Process. For retrieving the water- ⎛ M N     ⎞
mark, we need a copy of the original image as a reference. By ⎜ i=0 j =0 W1 i, j · W2 i, j ⎟
using the inverse embedding formula (3), we can extract the NC = ⎝  M N  2   M N  2  ⎠.
i=0 j =0 W1 i, j · i=0 j =0 W 2 i, j
embedded watermark
    (5)

 fl i, j − flk i, j

w i, j = k . (3)
α 4. Experimental Results
The extraction process consists of the following steps. We have performed experiments with various watermarks
and popular host images such as Lena, Barbara, Baboon,
Step 1. Both watermarked and original images are trans- Cameraman, City, Couple, Man, Boat, Elaine, Peppers,
formed into CT domain. and Zelda of size 512×512. The watermark is a grayscale
fingerprint (.bmp) of size 128×128, which contains lots of
Step 2. The directional subband and the lowpass image of
curves and significant details. Therefore, it can be a perfect
the embedded watermark will be retrieved by subtracting
criterion for measuring the performance of the proposed
the highest frequency subbands of the original and the
method. In addition, it can be used in fingerprinting
watermarked image by using (3).
applications. In (2), α was set to 0.1 to obtain a tradeoff
Step 3. For reconstructing the watermark, Laplacian Pyra- between perceptibility and robustness. In both LP and DFB
mid requires both directional subbands (W1 ,W2 ) and the decomposition, “PKVA” filters [18] were used because of
lowpass image (L). Instead of inputting (L,W1 ,W2 ) we input their efficient implementation. We decomposed the host
(L,W1 ,W1 ) into the LP. image into four levels, and the watermark into one level.

The watermark extraction process is summarized in 4.1. Watermark Invisibility. Figures 8(a) and 8(b) provide
Figure 7. the comparison between the original Lena test image and its
By increasing the levels of decomposition, the water- corresponding watermarked image.The original watermark
marking capacity is also increased, and the quality of and the extracted watermark are also shown in Figures 8(c)
EURASIP Journal on Advances in Signal Processing 5

(a) (b)

(c) (d)

Figure 8: (a) Lena image. (b) Watermarked image. (c) Original watermark. (d) Extracted watermark.

(a) (b) (c) (d)

(e) (f) (g)

Figure 9: Recovered watermarks from Lena image after JPEG2000 compression. (a) Rate = 0.3 (b) Rate = 0.4 (c) Rate = 0.5 (d) Rate = 0.6
(e) Rate = 0.7 (f) Rate = 0.8 (g) Rate = 0.9.
6 EURASIP Journal on Advances in Signal Processing

Filtering Filtering
1 1

Normalized correlation
0.9 0.9
Normalized correlation

0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5 0.4
0.4

Baboon

Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat

City
Baboon

Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat

City

Test images Test images


Wiener Gaussian LPF Wiener Gaussian LPF
Soft thresholding FMLR Soft thresholding FMLR
Hard thresholding Hard thresholding
(a) (b)

Figure 10: Normalized correlation results of different test images under different filtering attacks (window size = 3×3) (a) embedding the
watermark in the highest frequency subband of the host image (b) embedding the watermark in the 16th directional subband.

Image enhancement Image enhancement


1 1
0.98 0.98
Normalized correlation
Normalized correlation

0.96 0.96
0.94 0.94
0.92 0.92
0.9 0.9
0.88 0.88
0.86 0.86
0.84 0.84
0.82 0.82
0.8 0.8
Baboon

Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat

City

Baboon

Camera
Barbara

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat

City

Test images Test images


Sharpening Sharpening
Reduce color Reduce color
Histogram equalization Histogram equalization
(a) (b)

Figure 11: Normalized correlation results of different test images under image enhancement attacks (a) embedding the watermark in the
highest frequency subband of the host image (b) embedding the watermark in the 16th directional subband.

and 8(d), respectively. The results of embedding data in the The results of hiding more amounts of data into the
highest frequency subband of the host image are shown in highest and other directional subbands of the Lena test image
Table 1. are shown in Table 2. The PSNR and NC values for other
Our experiments on the test images showed that the 16th subbands are also shown in columns 2 and 3 of the same
directional subbands have the highest priority for watermark table, respectively. We used the 1st and the 4th directional
embedding. The results of embedding the watermark in the subbands that have the highest level of energy after the 16th
16th directional subbands of the host images were as follows. subband. In addition to embedding the watermark into the
The watermark invisibility can be guaranteed at average 16th directional subband, we hide another version of the
PSNR value of 46.96 dB for all the test images due to their watermark into the 1st and the 4th subband, and thus we
similar characteristics and the NC value of 0.9862 for all the could embed 34 KB of data into the host image without
extracted watermarks except for the Man image, for which degrading its perceptual quality. Embedding the watermark
the PSNR and NC values were 47.09 and 0.9838, respectively. in other subbands with lower energy than a given threshold
EURASIP Journal on Advances in Signal Processing 7

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 12: Recovered watermarks from Lena image under various filtering and enhancement attacks (a) FMLR, (b) Gaussian LPF, (c) hard
thresholding (d) soft thresholding (e) reduce color (f) image sharpening (g) Wiener filtering (h) histogram equalization.

Noise addition Noise addition


1 1

0.95
Normalized correlation

Normalized correlation

0.95
0.9
0.9
0.85
0.85
0.8

0.75 0.8

0.7 0.75
Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat
Baboon

Baboon

Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat
City

City

Test images Test images

Gaussian Salt and pepper Gaussian Salt and pepper


Poisson Speckle Poisson Speckle
(a) (b)

Figure 13: Normalized correlation results of different test images under noise attacks (a) embedding the watermark in the highest frequency
subband of the host image (b) embedding the watermark in the 16th directional subband.

will result in perceptual distortion in the watermarked image. such as lowpass filtering, lossy compression, noise, and
Table 3 shows the results of embedding data in the Lena test geometrical distortion. On the other hand, the watermark
image with different sizes. The size of the watermark is 25% at low-frequency subbands of an image is sensitive to other
of the size of the host image. image processing algorithms such as histogram equalization
and cropping. As we mentioned in Section 3, although the
4.2. Resistance to Various Attacks. It is known that embed- watermark is embedded into the highest frequency subbands,
ding the watermark at the high-frequency subbands of an it is likely to be spread out into all subbands when we
image is sensitive to many image processing algorithms reconstruct the watermarked image, due to the special
8 EURASIP Journal on Advances in Signal Processing

(a) (b) (c) (d)

Figure 14: Recovered watermarks from Lena image after applying different noises (a) Salt and pepper (density = 0.0001) (b) Gaussian noise
(density = 0.0001) (c) Speckle noise (density = 0.0001) (d) Poisson noise.

Geometric transformations Geometric transformations


1 1
0.95 0.95

Normalized correlation
Normalized correlation

0.9 0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
0.65 0.7
0.6 0.65
0.55
Baboon

Barbara

Camera

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat

Baboon

Camera
Barbara

Couple

Elaine

Lena

Man

Pepper

Zelda
Boat
City

Test images City


Test images
Cropping Cropping
Scaling Scaling
Rotation Rotation
(a) (b)

Figure 15: Normalized correlation results of different test images under geometrical attacks: (a) embedding in the highest frequency subband
of the host image (b) embedding the watermark in the 16th directional subband.

Table 1: Results of embedding data in the highest frequency Table 2: Results of embedding more amounts of data into 16th and
subband of the host image. another directional subband.

Host image Highest frequency subband PSNR NC Subband PSNR NC


Baboon 13 37.0757 0.986389 NC16 = 0.9596
16&1 43.3194
Barbara 6 36.7178 0.983328 NC1 = 0.9852
Boat 4 36.7234 0.985742 NC16 = 0.8954
16&4 36.3084
Cameraman 1 45.7128 0.987057 NC4 = 0.9858
City 13 37.0794 0.98637
Couple 13 37.0754 0.986388 Table 3: Results of embedding data in Lena image with different
Elaine 1 45.7083 0.987072 size.
Lena 16 46.968 0.986253 Host image size Watermark size PSNR (dB) NC
Man 4 36.9369 0.98061 1024×1024 256×256 (65 KB) 47.1197 0.996708
Peppers 4 36.7585 0.985308 512×512 128×128 (17 KB) 46.968 0.986253
Zelda 4 36.7616 0.985401 256×256 64×64 (5.05 KB) 36.9065 0.977844

transform structure of the Laplacian Pyramid. In this section, scheme for both high-and low-frequency, signal processing
we attempt to show the robustness of our watermarking attacks. The MATLAB 7.0 and Checkmark 1.2 [19] were
EURASIP Journal on Advances in Signal Processing 9

Table 4: Normalized correlation coefficients after JPEG2000 compression on watermarked images in which the watermark is embedded in
the highest frequency subband.

Host 0.3 0.4 0.5 0.6 0.7 0.8 0.9


Baboon 0.907042 0.954985 0.973306 0.978915 0.983985 0.98639 0.98639
Barbara 0.971182 0.977175 0.980079 0.98293 0.983295 0.983295 0.983295
Boat 0.96877 0.976222 0.980652 0.985366 0.985643 0.985643 0.985643
Cameraman 0.983941 0.986697 0.987041 0.987041 0.987041 0.987041 0.987041
City 0.955788 0.973253 0.981524 0.984497 0.986288 0.986288 0.986288
Couple 0.971682 0.980032 0.983572 0.986312 0.986312 0.986312 0.986312
Elaine 0.964434 0.980293 0.984266 0.98591 0.986947 0.986947 0.986947
Lena 0.97721 0.982396 0.985192 0.986252 0.986252 0.986252 0.986252
Man 0.9623 0.971662 0.975141 0.980085 0.980611 0.980611 0.980611
Peppers 0.967675 0.977996 0.982511 0.985192 0.985192 0.985192 0.985192
Zelda 0.977523 0.980563 0.985064 0.985319 0.985319 0.985319 0.985319

Table 5: Normalized correlation coefficients after JPEG2000 compression on watermarked images in which the watermark is embedded in
the 16th directional subband.
Host 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Baboon 0.9407 0.9770 0.9836 0.9854 0.9865 0.9871 0.9871
Barbara 0.9824 0.9846 0.9860 0.9869 0.9871 0.9871 0.9871
Boat 0.9766 0.9841 0.9863 0.9870 0.9872 0.9872 0.9872
Cameraman 0.9850 0.9864 0.9872 0.9872 0.9872 0.9872 0.9872
City 0.9730 0.9837 0.9852 0.9865 0.9872 0.9872 0.9872
Couple 0.9835 0.9849 0.9864 0.9872 0.9872 0.9872 0.9872
Elaine 0.9792 0.9850 0.9861 0.9869 0.9872 0.9872 0.9872
Lena 0.9842 0.9862 0.9869 0.9872 0.9872 0.9872 0.9872
Man 0.9697 0.9798 0.9832 0.9845 0.9848 0.9848 0.9848
Peppers 0.9832 0.9851 0.9864 0.9872 0.9872 0.9872 0.9872
Zelda 0.9840 0.9863 0.9869 0.9871 0.9871 0.9871 0.9871

Table 6: Comparison of the proposed method with other domain cryptographic attacks, and protocol attacks [20]. We inves-
methods. tigate the robustness of our method against removal and
Elbasi & Wang & geometrical attacks.
Proposed
Characteristic Eskicioglu’s Pearmain
method
Method Method 4.2.1. Removal Attacks. Removal attacks aim at the complete
Transform removal of the watermark information from the watermark
Contourlet Wavelet DCT
domain data without cracking the security of the watermarking
Watermark type Gray scale PRN sequence Binary algorithm [20]. To test the robustness of our method against
No. watermark removal attacks JPEG2000 compression, image enhance-
17 KB(128×128) — 910
bits embedded ment techniques, various noise, and filtering attacks were
PSNR in dB 46.97 40.86 39.21 used.
No. reported The JPEG2000 attack was tested using Jasper 1.900.1 [21].
13 9 4
attacks Table 4 shows the results of applying JPEG2000 attack on the
Extraction type Nonblind Semiblind Blind watermarked images in which the watermark is embedded
in the highest frequency subband of the host image and
Table 5 shows the results of applying JPEG2000 attack on
watermarked images in which the watermark is embedded
used for testing the robustness of the proposed method. in the 16th directional subband of the host image. The
The wide class of existing attacks can be divided into results demonstrate an excellent robustness of our method
four main categories: removal attacks, geometrical attacks, against JPEG2000 compression. Figure 9 shows the extracted
10 EURASIP Journal on Advances in Signal Processing

(a) (b) (c)

(d) (e) (f)

(g)

Figure 16: Recovered watermarks from Lena image under geometric attacks (a) cropping half top (NC = 0.9759) (b) cropping 400×450
(NC = 0.9398) (c) cropping half right (NC = 0.6697) (d) cropping half left (NC = 0.7260) (e) rotation (angle = 20o ) (f) scaling (factor = 2)
(g) cropping half down (NC = 0.1277).

watermarks after compressing Lena image with different sharpening, reduced color, and histogram equalization were
compression rates. used.
To assess the robustness of the proposed method to Figures 10 and 11 show the normalized correlation
various types of filtering and enhancement techniques, fre- coefficient results of applying filtering attacks with a 3×3
quency mode Laplacian removal, Gaussian lowpass filtering, window size and image enhancement techniques on different
soft thresholding, hard thresholding, wiener filtering, image test images, respectively.
EURASIP Journal on Advances in Signal Processing 11

Gaussian noise Table 7: Comparison of the proposed method with similar domain
1
methods.
Proposed Method I &
Normalized correlation

0.95
Characteristic CEW
method Method II
0.9 Watermark type Gray scale Binary Binary
No. watermark
0.85 17 KB(128×128) Not mentioned 16×16
bits embedded
PSNR in dB 46.97 Not mentioned ≈47
0.8
No. reported
13 3 3
0.75
attacks
City Lena Peppers Barbara Extraction type Nonblind Nonblind Nonblind
Test images
Method I CEW
Method II Proposed Figure 12 shows the recovered watermarks of the Lena
test image under these attacks. The results show good
Figure 17: The robustness comparison result of the proposed
robustness properties of the proposed method against all
method with [2, 3] under Gaussian attack (mean, var) = (0,
0.0001). the tested attacks except for the thresholding and Wiener
filtering.
To test the robustness of our method under various
noise processes, Gaussian noise, salt & pepper noise, speckle
noise, and Poisson noise with a density of 0.0001 were
Cropping
1 used. Figure 13 shows the normalized correlation coefficient
results of applying various noise attacks on different test
0.9 images. Figure 14 shows the recovered watermarks of the
Normalized correlation

Lena test image under different noise processes. Results


0.8
demonstrate excellent resistance of our method against
0.7 common noises.
0.6
4.2.2. Geometrical Attacks. In contrast to removal attacks,
0.5 geometrical attacks do not actually remove the embedded
watermark itself but intend to distort the watermark detector
0.4
City Lena Peppers Barbara synchronization with the embedded information [20]. The
Test images most common geometrical attacks are rotation, scaling,
Method I CEW and cropping. The parameters used in these attacks are a
Method II Proposed rotation angle of 20◦ , a scaling factor of 2, and cropping
Figure 18: The robustness comparison result of the proposed size of 256 × 512 (the top half is removed). Figure 15
method with [2, 3] under cropping attack (400×450). illustrates the normalized correlation coefficient results of
these attacks on different test images. Figure 16 shows the
extracted watermarks after applying geometric attacks on the
Lena test image. Results of cropping other parts of the Lena
test image are also shown in Figure 16. Results demonstrate
Rotation
0.9 good resistance of our method against cropping and scaling
0.85 but poor resistance against rotation attack.
Normalized correlation

0.8
0.75
4.3. Comparison. The performance of the proposed method
0.7
was compared with 2 other methods with two different
0.65
decomposition types and the results are shown in Table 6.
0.6
0.55
Wang and Pearmain’s method [22] is a blind watermarking
0.5
technique based on the patch work estimation. A total
0.45 of 910 watermark bits were embedded in the Lena test
0.4 image by using DCT. The PSNR reported was 39.21 dB and
City Lena Peppers Barbara the numbers of attacks reported were only 4. Elbasi and
Test images Eskicioglu’s method [23] is a semiblind DWT watermarking
Method I CEW technique which embeds a pseudorandom number (PRN)
Method II Proposed
sequence as a watermark in three bands of the image,
Figure 19: The robustness comparison result of the proposed using coefficients that are higher than a given threshold.
method with [2, 3] under rotation attack (angle = 6◦ ). The reported PSNR was 40.86 dB and the numbers of
12 EURASIP Journal on Advances in Signal Processing

attacks reported were 9. In the proposed method, 17 KB [4] G. Duan, A.T. S. Ho, and X. Zhao, “A Novel non-redundant
are embedded and the obtained PSNR is 46.97 dB. The contourlet transform for robust image watermarking against
Watermarked image in our method can survive many attacks, non-geometrical and geometrical attacks,” in Proceedings of the
and it is superior in terms of PSNR compared to these 5th International Conference on Visual Information Engineering
methods. Furthermore, we compared our method with three (VIE ’08), pp. 124–129, August 2008.
related works, which also used contourlet decomposition. [5] S. Xiao, H. Ling, F. Zou, and Z. Lu, “Adaptive image water-
marking algorithm in contourlet domain,” in 2007 Japan-
Method I and Method II are reported in [3], and CEW is
China Joint Workshop on Frontier of Computer Science and
reported in [2]. Table 7 summarizes the comparison results Technology, FCST 2007, pp. 125–130, chn, November 2007.
of the proposed method with these methods. Figures 17, [6] E. Salahi, M. S. Moin, and A. Salahi, “A new visually impercep-
18 and 19 show the comparison results between Method I, tible and robust image watermarking scheme in Contourlet
Method II, CEW, and the proposed method on the popular domain,” in 2008 4th International Conference on Intelligent
test images under the Gaussian noise, cropping, and rotation Information Hiding and Multiedia Signal Processing, IIH-MSP
attack, respectively. In the Gaussian noise and cropping 2008, pp. 457–460, chn, August 2008.
attacks, our method outperforms other methods but in [7] Z. Shu, S. Wang, C. Deng, G. Liu, and L. Zhang, “Water-
rotation attack (angle = 6◦ ), the performance of CEW was marking algorithm based on contourlet transform and human
better. visual model,” in 2008 International Conference on Embedded
Software and Systems, ICESS-08, pp. 348–352, chn, July 2008.
[8] X. Lian, X. Ding, and D. Guo, “Digital watermarking based
5. Conclusion on non-sampled contourlet transform,” in 2007 IEEE Interna-
tional Workshop on Anti-counterfeiting, Security, Identification,
In this paper, we proposed a new multiresolution water- ASID, pp. 138–141, chn, April 2007.
marking method using the contourlet transform. In this [9] F. Wei, T. Ming, and J. Hong-Bing, “An adaptive watermark
method, a grayscale watermark was added to the highest scheme based on contourlet transform,” in International Sym-
frequency subband of the host image. The quality of the posium on Computer Science and Computational Technology,
watermarked image was good in terms of perceptibility and ISCSCT 2008, pp. 677–681, chn, December 2008.
PSNR (average of 39.4107 dB) measures. We showed that [10] P. Meerwald and A. Uhl, “A survey of wavelet domain
we can embed a remarkable amount of data (34 KB) using watermarking algorithms,” in Electronic Imaging, Security and
other high frequency subbands in addition to the highest Watermarking of Multimedia Contents, vol. 4314 of Proceedings
frequency subband. Moreover, we showed that this method of SPIE, January 2001.
was robust against various removal and geometrical attacks [11] D. Kundur and D. Hatzinakos, “Towards robust logo water-
such as JPEG2000 compression, salt and pepper noise, marking using multiresolution image fusion principles,” IEEE
Transactions on Image Processing, vol. 6, no. 1, pp. 185–198,
Gaussian noise, speckle noise, Poisson noise, frequency
2004.
mode Laplacian removal, Gaussian lowpass filtering, reduced
[12] M. N. Do and M. Vetterli, “The contourlet transform: An
color, image sharpening, cropping, scaling, and histogram efficient directional multiresolution image representation,”
equalization. We compared the robustness of the proposed IEEE Transactions on Image Processing, vol. 14, no. 12, pp.
method with 3 other contourlet methods under cropping, 2091–2106, 2005.
Gaussian noise, and rotation attacks. Compared to the [13] S. Khalighi, P. Tirdad, and H. R. Rabiee, “A new robust non-
DWT-based and DCT-based methods, the proposed method blind digital watermarking scheme in contourlet domain,” in
is superior in terms of embedding capacity, PSNR and Proceedings of the 9th IEEE International Symposium on Signal
survival to a number of image attacks. Considering the Processing and Information Technology (ISSPIT ’09), Ajman
good characteristics of our method such as imperceptibility, ,UAE, December 2009.
robustness and non-blind extraction, it would be a suitable [14] D. D.-Y. Po and M. N. Do, “Directional multiscale modeling
choice for fingerprinting applications. Our future focus will of images using the contourlet transform,” IEEE Transactions
be on enhancing the robustness properties of the proposed on Image Processing, vol. 15, no. 6, pp. 1610–1620, 2006.
algorithm against various attacks. [15] M. N. Do and M. Vetterli, “Pyramidal directional filter
banks and curvelets,” in Proceedings of IEEE International
Conference on Image Processing (ICIP ’01), vol. 3, pp. 158–161,
References Thessaloniki, Greece, October 2001.
[16] M. N. Do and M. Vetterli, “Framing pyramids,” IEEE Transac-
[1] J. Cox, M. L. Miller, and J. A. Bloom, “Watermarking applica- tions on Signal Processing, vol. 51, no. 9, pp. 2329–2342, 2003.
tions and their properties,” in Proceedings of the International [17] I. J. Cox, J. Kilian, T. Leighton, and T. G. Shamoon, “Secure
Conference on Information Technology: Coding and Computing spread spectrum watermarking for multimedia,” in Proceed-
(ITCC ’00), pp. 6–10, Las Vegas, Nev, USA, 2000. ings of IEEE International Conference on Image Processing
[2] S. Zaboli and M. S. Moin, “CEW: A non-blind adaptive (ICIP ’97), vol. 6, pp. 1673–1687, Santa Barbara, Calif, USA,
image watermarking approach based on entropy in contourlet October 1997.
domain,” in 2007 IEEE International Symposium on Industrial [18] S. Phoong, C. W. Kim, P. P. Vaidyanathan, and R. Ansari, “New
Electronics, ISIE 2007, pp. 1687–1692, esp, June 2007. class of two-channel biorthogonal filter banks and wavelet
[3] M. Jayalakshmi, S. N. Merchant, and U. B. Desai, “Digital bases,” IEEE Transactions on Signal Processing, vol. 43, no. 3,
watermarking in contourlet domain,” in 18th International pp. 649–665, 1995.
Conference on Pattern Recognition, ICPR 2006, pp. 861–864, [19] May 2010, http://watermarking.unige.ch/Checkmark/index
chn, August 2006. .html.
EURASIP Journal on Advances in Signal Processing 13

[20] S. Voloshynovskiy, S. Pereira, T. Pun, J. J. Eggers, and J. K.


Su, “Attacks on digital watermarks: Classification, estimation-
based attacks, and benchmarks,” IEEE Communications Mag-
azine, vol. 39, no. 8, pp. 118–125, 2001.
[21] May 2010, http://www.ece.uvic.ca/∼mdadams/jasper/.
[22] Y. Wang and A. Pearmain, “Blind image data hiding based on
self reference,” Pattern Recognition Letters, vol. 25, no. 15, pp.
1681–1689, 2004.
[23] E. Elbasi and A. M. Eskicioglu, “A DWT-based robust semi-
blind image watermarking algorithm using two bands,” in
Security, Steganography, and Watermarking of Multimedia
Contents VIII, vol. 6072 of Proceedings of SPIE, San Jose, Calif,
USA, January 2006.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 428183, 30 pages
doi:10.1155/2010/428183

Research Article
A New Robust Watermarking Scheme to Increase Image Security

Hossein Rahmani, Reza Mortezaei, and Mohsen Ebrahimi Moghaddam


Electrical and Computer Engineering Department, Shahid Beheshti University, G.C., Tehran 1983963113, Iran

Correspondence should be addressed to Mohsen Ebrahimi Moghaddam, m [email protected]

Received 12 December 2009; Revised 7 July 2010; Accepted 16 October 2010

Academic Editor: Yingzi Du

Copyright © 2010 Hossein Rahmani et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

In digital image watermarking, an image is embedded into a picture for a variety of purposes such as captioning and copyright
protection. In this paper, a robust private watermarking scheme for embedding a gray-scale watermark is proposed. In the
proposed method, the watermark and original image are processed by applying blockwise DCT. Also, a Dynamic Fuzzy Inference
System (DFIS) is used to identify the best place for watermark insertion by approximating the relationship established between the
properties of HVS model. In the insertion phase, the DC coefficients of the original image are modified according to DC value of
watermark and output of Fuzzy System. In the experiment phase, the CheckMark (StirMark MATLAB) software was used to verify
the method robustness by applying several conventional attacks on the watermarked image. The results showed that the proposed
scheme provided high image quality while it was robust against various attacks, such as Compression, Filtering, additive Noise,
Cropping, Scaling, Changing aspect ratio, Copy attack, and Composite attack in comparison with related methods.

1. Introduction (2) Security. Only authorized users gain access to the


watermark data.
Owing to the recent advances in network and multimedia
techniques, digital images may be transmitted over the (3) Imperceptibility. The embedding process should not
nonsecure channels such as the Internet. Therefore, the introduce any perceptible artifacts into original
enforcement of multimedia copyright protection has become image and not degrade the perceived quality of
an important issue in literature. image.
Watermarking and cryptography are two standard mul- (4) Robustness. The watermark should be able to with-
timedia security methods. However, cryptography is not an stand various attacks while can be detected in the
effective method because it does not provide permanent extraction process.
protection for the multimedia content after delivery to
consumers, because, after decryption there is no protection The most important watermarking schemes are invisible
for the documents. Digital watermarking technologies allow where are secure and robust. Moreover, in the invisible
users to hide appropriate information in the original image watermarking, the embedding locations are secret, and only
that is imperceptible during normal use but readable by the authorized persons who have the secret keys can extract
special application. Therefore, the major purpose of digital the watermark.
watermarks is to provide protection for intellectual property On the other hand, the watermarking algorithms are
that is in digital format. To evaluate a watermark system, the classified also as: the methods which require the original
following attributes are generally considered [1, 2]. information and secret keys for extracting watermark are
called private watermark algorithms. The methods which
(1) Readability. A watermark should convey as much require the watermark information and secret keys are called
information as possible, statistically detectable, semiprivate or semiblind algorithms, and ones which need
enough to identify ownership and copyright unam- secret keys rather than the original information are called
biguously. blind watermark algorithms.
2 EURASIP Journal on Advances in Signal Processing

In another classification, digital watermarking algo- original image. For creation of compound watermark, the
rithms can be divided into two groups: spatial domain synthetic image is created by Gaussian and Laplacian random
[5–7] and frequency domain [8–12] methods according number generator. The choice of these two distributions
to the processing domain of the host image. The spatial for modeling the DC and AC coefficients of image DCT is
domain algorithms are simple and the watermark can be motivated by empirical results presented in Reininger and
damaged easily, but the frequency domain algorithms can Gibson [25] and Mohanty et al. [26]. Next, the original
resist versus intensity attack and watermark information watermark is embedded in insensitive area of synthetic
cannot be damaged easily [13]. image using any DCT-based visible watermarking algorithm.
However, in all frequency domain watermarking Asatryan proposed another method that combines spatial
schemes, there is a conflict between robustness and and frequency domain to hide a grayscale watermark in
transparency. If the watermark is embedded in the lower- grayscale original image by mapping the values of DCT
frequency bands, the scheme would be robust to attacks but coefficients of compressed watermark image to the interval
the watermark may be difficult to hide. On the other hand, if [0,255] (max and min value of grayscale image) by a fixed
the watermark is embedded in the higher-frequency bands, linear transform and inserts these values in the original
it would be easier to hide the watermark but the scheme has image [27]. But, this method introduces perceptible artifacts
less resistant to attacks. Therefore, finding a proper place to into original image and degrades the perceived quality of
embed the watermark is very important. image.
In 1996, Cox et al. [14] advised that the watermark In this paper, we have proposed a new robust water-
should be embedded in the low-frequency coefficients of marking method in frequency domain to insert a gray level
DCT domain to ensure the robustness. To improve this watermark in an image. The proposed method is more robust
method, Lu et al. [15] used a cocktail watermark to and makes image with higher quality than related ones. The
increase robustness and HVS to maintain high fidelity of the basic idea of the proposed method is based on this fact that
watermarked image. Barni and Hsu [16, 17], respectively, most of the signal energy of the DCT block is compacted in
recommended that the watermark should be embedded in the DC component and the remaining energy is distributed
the middle frequency coefficients to reduce the distortion. reductively in the AC components in zigzag scan order [4].
But Huang et al. in [18] points out that the DC coefficient is Also, for most images, the main characteristics of the DCT
more proper to be used for embedding watermark, and this coefficients in one block have high correlation with the
conclusion is obtained based on his robustness test between adjacent blocks. Gonzales et al. [3] described a technique
the DC coefficient and two low-frequency coefficients. which estimates the first five AC coefficients precisely. In this
Also, DWT as another frequency transform technique method, DC values of a 3 × 3 neighborhood of blocks are
has been used by many researchers such as Xie and Arce for used to estimate the AC coefficients for the center block. They
digital image watermarking [19]. The proposed method by did not consider variations in the image in AC coefficients
Zhao et al. in [20] is a sample of DCT/DWT domain-based estimation, but Veeraswamy and Kumar in [4] proposed
method which uses a dual watermarking scheme exploiting a new method that considered the variation in the image
the orthogonality of image subspaces to provide robust and accordingly AC coefficients have been estimated with
authentication. As other examples, in [21, 22], the proposed different equations. This method is better than Gonzales
DCT/DWT methods embed a binary visual watermark by method in terms of reduced blocking artifacts and improved
modulating the middle-frequency components. These two PSNR value. Based on these ideas, here, at first, a grayscale
methods are robust to common image attacks; but geometric watermark image is created by applying DCT on each b × b
attacks are still challenges. In [23], another approach to nonoverlapping block of original grayscale watermark image
combine DWT and DCT has been proposed to improve the and setting all AC coefficients of each one to zero. Then, the
performance of the DWT-based watermarking algorithms. original image is divided into a × a nonoverlapping blocks
In this method, watermarking is done by embedding the and DCT is applied on each a × a block. Next, a Dynamic
watermark in the first and second level of DWT subbands Fuzzy Inference System (DFIS) is used to select the number
of the host image, followed by the application of DCT of original image blocks for embedding watermark. Finally,
on the selected DWT subbands. The combination of these DC value of each b × b DCT block of watermark image
two transforms improved the watermarking performance is embedded in DC value of a × a DCT block of original
considerably in comparison with DWT-only watermarking image by using the output of the DFIS. In the extraction
approach. process, DCT is applied on the test image to extract the
Most of the existing watermarking methods use a DC coefficients of each b × b DCT block of watermark and
pseudorandom sequence or binary image as a watermark. the AC coefficients of each b × b DCT block of extracted
However, using grayscale images as watermarks has drawn watermark are estimated based on proposed technique by
much attention for copyright protection since many logos Veeraswamy and Kumar [4] to construct the watermark
are grayscale in nature. One of the methods that hide a with higher quality. The proposed method was tested on
grayscale watermark image in original image was proposed several bench mark images using StirMarkMATLAB software
by Mahanty and Bhargava [24]. In this method, at first, based and its results were satisfactory. The results showed that
on Human Visual System (HVS), the most perceptually the proposed method created the high-quality watermarked
important region of original image is found. Then, a images while they were more robust against attacks such as
compound watermark is created to insert in this region of the JPEG compression, additive noise, filtering, cropping.
EURASIP Journal on Advances in Signal Processing 3

Texture sensitivity
Inputs Fuzzy inference Suitability (α) Outputs
Luminance sensitivity system
Weighting factor (β)
Location sensitivity

Figure 1: Inputs and outputs of Dynamic Fuzzy Inference System (DFIS).

The rest of paper has been organized as follows: In Not high and
Section 2, the proposed approach has been introduced and Low not low High
1

Membership degree
in Section 3, the proposed method has been motivated and
structurally compared with related ones. Section 4 describes
the experimental results and in Section 5, the paper has been
concluded.

2. Proposed Algorithm 0
A B C D E
In this section, the proposed algorithm is describedin detail. Texture sensitivity
The algorithm is divided into four parts: block selection, Figure 2: Dynamic membership function for texture sensitivity.
watermark creation,watermark embedding and watermark
extraction, which are described in Sections 2.1–2.4, respec-
tively.
(iii) Location Sensitivity (Ck ). The center quarter of an
image is perceptually more important than other
2.1. Block Selection. In this section, we try to find the best areas of the image. We estimate location of each block
blocks for embedding the watermark. For this purpose, the by computing the following ratio:
original image is divided to a × a nonoverlapping blocks and
subsequently DCT is applied on each block. In the following center(Bk )
Ck = , (3)
of this paper, the value of a is considered as 8 to increase the 64
method robustness versus compression because the standard
JPEG is based on 8 × 8 blocks. Then, the following properties where center(Bk ) is the number of pixels of the kth
of Human Visuals System (HVS) model that is suggested block lying in the center quarter (25%) of the image.
in [24, 28] is used for selecting blocks that are suitable for When these parameters are computed, they can be used
embedding watermark. to select blocks and determine weighting factor for embed-
(i) Luminance Sensitivity (Lk ). The brighter the back- ding. In the proposed method, a Fuzzy Inference System
ground, the lower the visibility of the embedded (FIS) for calculating the relationship established between all
watermark. It is estimated by following relation: properties of the HVS model is used because FIS provides
simple mapping from a given set of inputs to another set of
XDC,k outputs without the complexity of mathematical modeling
Lk = , (1) concepts.
X DC
Here, a DFIS has been used [28] and optimized in order
where XDC,k is the DC coefficient of kth block and to approximate the relationship established between three
X DC is the mean value of DC coefficients of an properties of the HVS model for both block selection and
original image. embedding process. We supposed the location sensitivity
parameter is independent to images, therefore, in this model
(ii) Texture Sensitivity (Tk ). The stronger the texture, the
a static membership function is used for location sensitivity
lower visibility of embedded watermark. It can be
and only texture sensitivity and luminance sensitivity have
estimated by quantizing the DCT coefficients of a
dynamic membership functions. In the proposed DFIS, as it
block (Xk ) using the JPEG quantization table (Q).
is shown in Figure 1, the inputs consist of texture sensitivity,
The latter results are then rounded to the nearest
luminance sensitivity, and location sensitivity parameters
integers. The number of nonzero coefficients is then
of each block and the outputs consist of corresponding
computed. This number presents the texture of that
suitability and weighting factors. The shape and support set
block:
values for inputs and outputs MFs (Membership functions)
  
Xk have been derived from experiments on various images.
Tk = Number of non Zero Round , (2) The suitability parameter (α) depends to all three inputs
Q
but the weighting factor (β) only depends on the texture
where Xk are coefficients and Number of non Zero sensitivity and luminance sensitivity. Let now explain for
counts nonzero coefficients in kth block. instance how the texture sensitivity membership function is
4 EURASIP Journal on Advances in Signal Processing

Not
Not dark and center Center
Dark not bright Bright 1
1

Membership degree
Membership degree

0 0
A B C D E 0 0.5 1
Luminance sensitivity Location sensitivity
(a) (b)

Figure 3: Membership functions for (a) luminance sensitivity and (b) location sensitivity.

Not Lightly
suitable suitable Suitable Very Lightly Lightly Medium Very
1 small small large large large
Membership degree

Membership degree
1

0 0
2 5 8 10 0.0035 0.0098 0.0145 0.0192 0.0239 0.0192 0.035
Suitability Weighting factor
(a) (b)

Figure 4: Membership Functions outputs of DFIS, (a) Suitability (α) and (b) Weighting Factor (β).

computed. The structure of Texture sensitivity membership When points A, B, C, D, and E are determined, the
function has been shown in Figure 2. To compute the slopes of all membership functions (MFs) are computed.
membership function parameters; first, we set A and E The membership function of the other dynamic parameter
to take the minimum and maximum values of texture (Luminance sensitivity) is created in the same way. Mem-
sensitivity Tk (4): bership functions for Luminance sensitivity and Location
sensitivity are shown in Figure 3. It is worth mentioning that
A = min(Tk ), the shape of location sensitivity membership functions is
(4) different from the others, because the experiments showed
E = max(Tk ),
that this kind of MFs better fits to the used data than the
where Tk is the Texture sensitivity of the kth block of the others. So, the location sensitivity membership function (μu )
image. Then, in order to find point C, the average of the is defined as the Z-function. The Z-function models this
texture sensitivity of all 8 × 8 blocks in the image is computed property using the following equation:
as shown in (5), where BN is the number of 8 × 8 blocks in ⎧

⎪1, u ≤ p,
the image: ⎪


⎪  2  


BN   ⎪
⎪ u− p p+q
 ⎪
⎪1−2×  , p<u≤ ,
TK ⎪
⎪ q−p
C= . (5) ⎨ 2
k=1
BN μu = ⎪ (8)
⎪  2  

⎪ u−q p+q

⎪2 ×  , < u ≤ q,
Finally, points B and D are determined, in such a manner ⎪


⎪ q−p 2
that these points never overlap or precede points A or E. ⎪



The point B is equal to the median of texture values that ⎩0, otherwise.
are between A and C values as shown in (6), where Tk is the
Texture of kth block and the point D is equal to the median Figure 3(b) shows a plot of this function. In (8), p and q
of textures values that are between C and E values as shown are two constant values that should be specified heuristically;
in (7): for example, the best values that we found for not center curve
were p = 0 and q = 1. The same curves for all images have
B = median{Tk | A ≤ T k ≤ C }, (6) been used.
The membership functions for outputs of DFIS (α and
D = median{Tk | C ≤ T k ≤ E}. (7) β) are shown in Figure 4. After defuzzification we have crisp
EURASIP Journal on Advances in Signal Processing 5

(a) (b) (c) (d)

Figure 5: (a) Original Lena, (b) Lena created using only DC coefficients of original image, (c) Lena created using estimate formulas (Gonzales
method [3]) on image (b), and (d) Lena created using estimate formulas (Veeraswamy method [4]) on image (b).

et al. in [3] described a technique which estimates a few


DC DC DC
low-frequency AC coefficients precisely. Moreover, only the
DC values of 3 × 3 neighborhoods of each central block are
Block 1 Block 2 Block 3 needed to estimate the AC coefficients of each central block
as shown in Figure 6. The estimation relations for the first
five AC coefficients of each 4 × 4 DCT block are shown in
Table 1 (first column) and Figure 5(c) shows the Lena image
DC DC AC AC DC that created using these relations.
Gonzales et al. [3] did not consider variations in the
AC AC
Block 4 Block 6
image in AC coefficients estimation, but in [4] a new
AC method was proposed that considered the variation in the
image and accordingly AC coefficients are estimated with
different equations. This method is better than Gonzales
method in terms of reduced blocking artifacts and improved
DC DC DC
PSNR value. In this method, at first, the entropy of each
block is calculated and then blocks with entropy values
Block 7 less than a threshold value are defined as smoother blocks
Block 8 Block 9
and blocks with entropy values equal or greater than
a threshold are considered as featured blocks. Based on
entropy values, three cases are considered in estimation
relations: (1) Smoother blocks, (2) Featured blocks and
Figure 6: The position of five AC components in the central block
(3) Featured blocks surrounded by featured blocks. The
that estimated using nine DC components of 3 × 3 neighborhoods.
estimation formulas based on Veeraswamy method to sup-
port 4 × 4 DCT block for these three cases are shown in
Table 1 (2th and 3th columns) and Table 2, correspond-
values for αk and βk which determine the suitability and ingly.
weighting factor of kth block of image. The blocks with Based on this idea, only DC coefficients are needed to
highest αk values are selected for embedding process. estimate the AC coefficients of each block [4]. Therefore, the
estimating formulas (as shown in Tables 1 and 2 for 4×4 DCT
2.2. Watermark Creation. As we know, most of the signal block) are employed to find these coefficients. Figure 5(d)
energy of the block DCT is compacted in the DC component shows a sample image that created by this method when the
and the remaining energy is distributed diminishingly in the size of block is 4 × 4.
AC components in zigzag scan order [4]. On the other hand, For watermark creation process, as shown in Figure 7, the
the DC component is more robust than AC components original grayscale watermark image is divided into b × b (e.g.,
versus different attacks. However, with having only DC 4 × 4) nonoverlapping blocks and subsequently performing
coefficient in each block of an image, the overall look of that the DCT on each block. Next, all AC coefficients are changed
image is presented. For example, Figure 5(b) shows the Lena to zero.
image that created using only DC component of each 4 × 4 In the proposed method, this created watermark image is
DCT block. inserted in the original image. In the extraction process, the
Since the DCT coefficients in one block for most images estimating formulas (as shown in Tables 1 and 2 for 4 × 4
have a high correlation with the adjacent blocks; Gonzales DCT block) are employed to reconstruct the watermark.
6 EURASIP Journal on Advances in Signal Processing

Table 1: The formulas to estimate the first five AC coefficients of each 4 × 4 DCT block (1).

Estimation formulas for smoother blocks Estimation formulas for featured blocks
Estimation formulas (Gonzales’s method)
(Veeraswamy’s method) (Veeraswamy’s method)
AC(0, 1) = 0.14125 × (DC4 − DC6 ) AC(0, 1) = 0.175 × (DC4 − DC6 ) AC(0, 1) = 0.231 × (DC4 − DC6 )
AC(1, 0) = 0.14125 × (DC2 − DC8 ) AC(1, 0) = 0.175 × (DC2 − DC8 ) AC(1, 0) = 0.231 × (DC2 − DC8 )
AC(0, 2) = 0.03485 × (DC4 +DC6 − 2 × DC5 ) AC(0, 2) = 0.083 × (DC4 + DC6 − 2 × DC5 ) AC(0, 2) = 0.118 × (DC4 + DC6 − 2 × DC5 )
AC(2, 0) = 0.03485 × (DC2 +DC8 − 2 × DC5 ) AC(2, 0) = 0.083 × (DC2 + DC8 − 2 × DC5 ) AC(2, 0) = 0.118 × (DC2 + DC8 − 2 × DC5 )
AC(1, 1) = AC(1, 1) =
AC(1, 1) = 0.15 × (DC1 + DC9 − DC3 − DC7 )
0.02026 × (DC1 + DC9 − DC3 − DC7 ) 0.029 × (DC1 + DC9 − DC3 − DC7 )

Table 2: The formulas to estimate the first five AC coefficients of each 4 × 4 DCT block (2).

Featured blocks surrounded by horizontal


Featured blocks surrounded by horizontal Featured blocks surrounded by vertical
and vertical featured blocks (Veeraswamy’s
featured blocks (Veeraswamy’s method) featured blocks (Veeraswamy’s method)
method)
AC(0, 1) = 0.231 × (DC1 − DC3 ) AC(0, 1) = 0.231 × (DC4 − DC6 ) AC(0, 1) = 0.231 × (DC1 − DC3 )
AC(1, 0) = 0.231 × (DC2 − DC8 ) AC(1, 0) = 0.231 × (DC1 − DC7 ) AC(1, 0) = 0.231 × (DC1 − DC7 )
AC(0, 2) = 0.15 × (DC1 − DC3 ) AC(0, 2) = 0.118 × (DC4 + DC6 − 2 × DC5 ) AC(0, 2) = 0.15 × (DC1 − DC3 )
AC(2, 0) = 0.118 × (DC2 + DC8 − 2 × DC5 ) AC(2, 0) = 0.15 × (DC1 − DC7 ) AC(2, 0) = 0.15 × (DC1 − DC7 )
AC(1, 1) = 0.15 × (DC1 + DC9 − DC3 − DC7 ) AC(1, 1) = 0.15 × (DC1 + DC9 − DC3 − DC7 ) AC(1, 1) = 0.15 × (DC1 + DC9 − DC3 − DC7 )

2.3. Watermark Insertion. To describe the proposed method, Algorithm 1 (The watermark embedding). We have the
we supposed that the original image (I) and created water- following.
mark image (W  ) are grayscale images with size M × N and
L × K, respectively.
Input: An original image I, watermark (W).
In the watermark embedding process, the original image
is transformed to frequency domain by DCT. Because the
JPEG standard is based on 8 × 8 block DCT, thus, block DCT Output: A watermarked image IW .
with size of 8 × 8 is commonly used in image watermarking
process to make it robust versus JPEG compression [29]. Step 1. Divide the original image I, into 8 × 8 nonoverlapping
Based on this idea, the original image is divided into 8 × 8 blocks and apply DCT on each block. Next, compute the
nonoverlapping blocks and DCT is applied on each block. HVS model properties as said in Section 2.1 and compute αk
Next, the Dynamic Fuzzy Inference System (DFIS) is used to and βk values of each block with Fuzzy Inference System as
calculate the αk and βk for each 8 × 8 DCT block of original described in Section 2.1. Finally, sort blocks in descending
image. Then, (L/b × K/b) number of blocks of original order of αk value of each block.
image with highest αk are selected for embedding watermark
image, where b × b is the size of DCT block of watermark. Step 2. Create used watermark (W  ) from original water-
In the other side, the image that is created by described mark (W) as described in Section 2.2.
approach in Section 2.2 (used watermark) is divided into
Step 3. Select first (L/b × K/b) blocks of sorted blocks which
b × b nonoverlapping blocks and then DCT is performed on
is computed in Step 1 for embedding process (L × K is size of
each block. If b is smaller than a (in the proposed method,
created watermark).
the value of a is 8), more robustness against attacks and
more visual enhanced extracted watermark can be achieved Step 4. Use (9) for invisible insertion of the created water-
but the quality of watermarked image is decreased. Thus, mark (used watermark) into the DC coefficients of selected
b provides a tradeoff between robustness after attacks and blocks of the original image:
quality of watermarked image. Finally, the DC value of each
b × b DCT block of the created watermark is embedded  
in DC value of each selected 8 × 8 DCT block of original XDC,k = XDC,k + ϕk × βk × WDC,k , (9)
image (based on (9)). Therefore, the watermarked image is

created by modifying DC value of each 8 × 8 DCT block where Xk,DC and Xk,DC are DC coefficients in kth block
of the original image. As shown in Figure 8, the following of watermarked image and original image, respectively

steps are used to insert the watermark in the original and WDC,k is DC coefficient of kth block in created
image. watermark. The βk parameter is a weighting factor that
EURASIP Journal on Advances in Signal Processing 7

Original watermark
(L × K) b b Created image for used
DC AC AC AC Set all AC DC 0 0 0 as watermark
AC AC AC AC coefficients to zero 0 0 0 0
b
AC AC AC AC b 0 0 0 0
AC AC AC AC 0 0 0 0

K/b

Divide the original watermark image


Performed IDCT on each
into b × b non-overlapping blocks and
L/b b × b non-overlapping blocks
perform the DCT on each block

Figure 7: The watermark creation procedure.

I: Original image (M × N) Original watermark (L × K)

Create image to use


Divide the original image as watermark
Seed key
into 8 × 8 non-overlapping
blocks and compute the
DCT on each block

Compute the αk and βk


Generate pseudo-random
values of each 8 × 8 DCT W  : Created watermark (L × K)
number (ϕk )
block with DFIS

Divide the used watermark


Select (L/b × K/b) blocks into b × b non-overlapping
Insert watermark
with highest αk blocks and compute the
DCT on each block

Iw : Watermarked image

Figure 8: The proposed watermark embedding process.


8 EURASIP Journal on Advances in Signal Processing

I: Original image (M × N) Iw : Watermarked image (M × N)

Seed key

Generate pseudo-random
number (ϕk )
Divide the original image Divide the watermarked image
into 8 × 8 non-overlapping into 8 × 8 non-overlapping
blocks and compute the blocks and compute the
DCT on each block DCT on each block

Extract watermark
Compute the αk and βk
b b
values of each 8 × 8 DCT
DC 0 0 0 DC AC AC 0
block with DFIS
0 0 0 0 Estimate AC AC 0 0
b
0 0 0 0 b AC 0 0 0
Select (L/b × K/b) blocks 0 0 0 0 0 0 0 0
with highest αk

Perform IDCT on each


b × b DCT block

W  : Extracted watermark

Figure 9: Proposed watermark extraction process.

(a) (b) (c) (d)

Figure 10: (a) Original Lena image, (b) Original Baboon image, (c) Original Peppers image and (d) Original Crowd image.

controls the tradeoff between invisibility, robustness, and can be any arbitrarily chosen pseudorandom sequence. This
detection fidelity of watermarked image which is computed parameter is just used for security purpose.
by DFIS as described in Section 2.1. The ϕk parameter is
a pseudorandom (1, −1) bit pattern that determines the Step 5. Use inverse DCT on each 8 × 8 block to obtain
addition or subtraction involved at each position which watermarked image IW .
EURASIP Journal on Advances in Signal Processing 9

(a) (b) (c) (d)

Figure 11: (a) and (c) Original watemarks, (b) and (d) Used watermarks.

(a) PSNR: 52.33 dB (b) PSNR: 52.44 dB (c) PSNR: 51.72 dB (d) PSNR: 51.38 dB

(e) γ = 0.9974 (f) γ = 0.9973 (g) γ = 0.9979 (h) γ = 0.9974

(i) PSNR: 52.92 dB (j) PSNR: 53.08 dB (k) PSNR: 52.27 dB (l) PSNR: 52.04 dB

(m) γ = 0.9975 (n) γ = 0.9970 (o) γ = 0.9982 (p) γ = 0.9971

Figure 12: (a)–(d) Watermarked images after embedding 128 × 128 watermark image as shown in Figure 11(b). (e)–(h) Extracted
watermarks from (a)–(d), respectively. (i)–(l) Watermarked images after embedding 128 × 128 watermark image as shown in Figure 11(d).
(m)–(p) Extracted watermarks from (i)–(l), respectively.

2.4. Watermark Extraction. The watermark extraction pro- with DFIS and then the (L/b × K/b) number of blocks with
cess is the reverse of embedding process and requires highest αk are selected (L/b × K/b is number of b × b blocks
the original image. As illustrated in Figure 9, at first, the in watermark image and L × K is size of it). Then, the DC
watermarked image (IW ) and the original one (I) are divided coefficients of extracted watermark are computed as follows:
into 8 × 8 nonoverlapping blocks and the DCT is performed 

XDC,K − XDC,k
on each block of images. Next, as described in Section 2.1, WDC,k = , (10)
αk and βk values of each block in original image is computed ϕk × βk
10 EURASIP Journal on Advances in Signal Processing

(a) PSNR: 48.27 dB (b) PSNR: 48.47 dB (c) PSNR: 47.94 dB (d) PSNR: 47.63 dB

(e) γ = 0.9975 (f) γ = 0.9966 (g) γ = 0.9971 (h) γ = 0.9965

(i) PSNR: 48.84 dB (j) PSNR: 49.11 dB (k) PSNR: 48.66 dB (l) PSNR: 48.26 dB

(m) γ = 0.9972 (n) γ = 0.9967 (o) γ = 0.9969 (p) γ = 0.9963

Figure 13: (a)–(d) Watermarked images after embedding 256 × 256 watermark image as shown in Figure 11(b). (e)–(h) Extracted
watermarks from (a)–(d), respectively. (i)–(l) Watermarked images after embedding 256 × 256 watermark image as shown in Figure 11(d).
(m)–(p) Extracted watermarks from (i)–(l), respectively.


where XDC,K and XDC,k are DC coefficients of kth block Algorithm 2 (The watermark extraction). We have the
in watermarked image and original image, respectively and following.

WDC,k is DC coefficient of kth block in extracted watermark.
The βk parameter is a weighting factor which is computed Input: An original image (I) and watermarked image (IW ).
in Step 1 and the ϕk parameter is a pseudorandom (1, −1)
bit pattern that generated with arbitrary seed and used in Output: An extracted watermark (W  ).
insertion process. Step 1. Divide the original image into 8 × 8 nonoverlapping

Finally, the WDC,k values and estimation formulas as blocks and compute the DCT on each block. Then compute
described in Section 2.2 are used to create the b × b DCT the HVS model properties as said in Section 2.1 and compute
blocks of watermark then by performing Block-wise inverse αk and βk values of each block with fuzzy approach (DFIS).
DCT, watermark in spatial domain is created. The following Finally, sort blocks in descending order of αk value of each
steps are used for watermark Extraction. block.
EURASIP Journal on Advances in Signal Processing 11

(a) PSNR = 34.89 dB (b) PSNR = 31.71 dB (c) PSNR = 32.32 dB (d) PSNR = 27.72 dB

(e) γ = 0.7689 (f) γ = 0.7120 (g) γ = 0.5794 (h) γ = 0.4493

(i) PSNR = 34.91 dB (j) PSNR = 31.73 dB (k) PSNR = 32.35 dB (l) PSNR = 27.72 dB

(m) γ = 0.8319 (n) γ = 0.7029 (o) γ = 0.5629 (p) γ = 0.4065

Figure 14: (a)–(d) The watermarked Lena, Baboon, Peppers and Crowd images with 256 × 256 watermark image as shown in Figure 11(b)
after JPEG compression with quality factor 40%, 30%, 20% and 10%, respectively. (e)–(h) The extracted watermarks (256 × 256) from
(a)–(d), respectively. (i)–(l) The watermarked Lena, Baboon, Peppers and Crowd images with 256 × 256 watermark image as shown in
Figure 11(d) after JPEG compression with quality factor 40%, 30%, 20% and 10%, respectively. (m)–(p) The extracted watermarks (256 ×
256) from (i)–(l), respectively.

Step 2. Select first (L/b × K/b) blocks of sorted blocks which 3. Structural Comparison of Proposed Method
is computed in Step 1 for extracting process. L/b × K/b is with Related Ones
number of b × b blocks in watermark and L × K is size of
watermark. The employed techniques in proposed method make it
more robust and its results with more quality. In this
Step 3. Divide the watermarked image IW into 8 × 8 section, the differences and excellences of proposed method
nonoverlapping blocks and compute the DCT on each block. with two related methods [24, 27] are introduced in four
conventional different steps of watermarking methods: (1)
Step 4. Extract the watermark from selected blocks use (10). selecting embedding area procedure, (2) watermark creation
procedure, (3) inserting procedure, (4) extracting procedure.
Step 5. Estimate AC coefficients of each block in extracted Also, the motivation of proposed method is implied in
watermark from Step 4, then use b × b Block-wise inverse subsections.
DCT to create extracted watermark in spatial domain (W  ).
If the input watermark image is present in the extracted 3.1. Selecting Embedding Area Procedure. Mohanty’s method
image, then the ownership is approved. [24], at first, finds the most perceptually important subimage
12 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 35.04 dB (b) PSNR = 31.78 dB (c) PSNR = 32.41 dB (d) PSNR = 27.74 dB

(e) γ = 0.8318 (f) γ = 0.8090 (g) γ = 0.7555 (h) γ = 0.6239

(i) PSNR = 35.05 dB (j) PSNR = 31.79 dB (k) PSNR = 32.42 dB (l) PSNR = 27.75 dB

(m) γ = 0.8673 (n) γ = 0.7774 (o) γ = 0.7275 (p) γ = 0.5601

Figure 15: (a)–(d) The watermarked Lena, Baboon, Peppers and Crowd images with 128 × 128 watermark image as shown in Figure 11(b)
after JPEG compression with quality factor 40%, 30%, 20% and 10%, respectively. (e)–(h) The extracted watermarks (128 × 128) from
(a)–(d), respectively. (i)–(l) The watermarked Lena, Baboon, Peppers and Crowd image with 128 × 128 watermark images as shown in
Figure 11(d) after JPEG compression with quality factor 40%, 30%, 20% and 10%, respectively. (m)–(p) The extracted watermarks (128 ×
128) from (i)–(l), respectively.

of original image, where the size of subimage is equal to size geometrical attacks such as Tampering, Data block removal,
of watermark (L × K) to embed the watermark in it. To find and Cropping (see Sections 4.2 and 4.3).
this subimage, the properties of Human Visual System (HVS) In Asatryan’s method [27] that inserts the watermark
such as Luminance, Edginess, Contrast, Location and Texture in spatial domain, all pixels of original image are used to
are calculated for each L × K subimage of original image and embed the watermark. Therefore, the quality of watermarked
the high score one is selected as most perceptually important image in this method is degraded and artifact is produced in
region of original image and watermark is embedded in it. watermarked image (see Section 4.3).
As result, this method is not robust to geometrical attacks
such as Tampering, Data block removal and Cropping; 3.2. Watermark Creation Procedure. Mohanty’s method cre-
because the watermark is embedded in consecutive blocks ate synthetic image by using 8 × 8 DCT coefficients of
(subimage) of original image. For example if this region selected subimage of original image and Gaussian, Laplacian
of watermarked image is cropped or tampered, the whole distributions for DC, AC coefficients, respectively. Then, the
watermark is removed and the extraction procedure cannot original watermark is embedded in the created synthetic
find any watermark in the test image (see Section 4.3). image using any DCT-based visible watermarking algorithm
But, in the proposed method, blocks of the watermark are to create used watermark.
not embedded in consecutive blocks of original image and In the Asatryan’s method, the used watermark is created
embedded in nonconsecutive blocks of original image. As by compressing the original watermark that the rate of
result, the proposed method is more robust versus many compression is defined by user.
EURASIP Journal on Advances in Signal Processing 13

(a) PSNR = 35.22 dB (b) PSNR = 33.93 dB (c) PSNR = 38.17 dB (d) PSNR = 47.02 dB

(e) γ = 0.5387 (f) γ = 0.7754 (g) γ = 0.9579 (h) γ = 0.9957

Figure 16: (a)–(d) The watermarked Lena, Baboon, Peppers and Crowd images with 256 × 256 watermark image after wavelet compression
with quality factor 0.4 bpp, 0.8 bpp, 1.5 bpp and 3.5 bpp, respectively. (e)–(h) The extracted watermarks (256 × 256) from (a)–(d),
respectively.

In the proposed method the used watermark is created and the mapped values of DCT coefficients are embedded in
by dividing the original watermark into b × b DCT blocks pixel values of each block of original image. As result, because
and changing the AC coefficients of each block to zero. the embedding is done in special domain, the robustness of
The b parameter provides a tradeoff between quality of this method is decreased and the quality of watermarked
watermarked image and extracted watermark. The proposed image is low (see Section 4.3). Also, mapping the DCT
creation watermark procedure is acceptable because the coefficients to the interval [0,255] may be caused distortion
watermark image that creates by only DC coefficients of each in the extracted watermark.
b × b (where b < 8; e.g., b = 4) DCT block of original The weighted factor (β) is used in all three methods.
watermark is perceptually similar as original one. Also, the The value of this parameter is 0.02 for DC and 0.1 for
AC coefficients estimating formulas that propose in [4] can AC coefficients in Mohanty’s method and 0.07 for all pixels
be used to increase the quality of created watermark. in Asatryan’s method. But, in the proposed method, the
value of this parameter for each DCT block is based on
Texture and Luminance of this block. It is based on idea that
3.3. Inserting Procedure. In the Mohanty’s method the used modification inside a highlytextured block is unnoticeable
watermark is embedded into the original image by fusing to the human eye and the brighter the background is the
the DCT coefficients of used watermark blocks with the lower the visibility of the embedded watermark. Therefore,
corresponding blocks of the selected subimage. In the other the proposed method produces a watermarked image with
hand, the DCT coefficients of each 8 × 8 DCT block of higher quality than two related methods.
used watermark is embedded in corresponding 8 × 8 DCT
block of selected subimage. As result, the robustness of
mohanty’s method decreases because the AC coefficients 3.4. Extracting Procedure. The Mohanty’s method use a
of DCT block is not robust to many attacks such as Low reverse embedding procedure to extract the DCT coefficients
Pass Filtering, Compression, Median Filtering. Therefore, the of each 8 × 8 DCT block of watermark and applied IDCT to
many of embedded AC coefficients of used watermark are create watermark in spatial domain. But in proposed method
degraded after such attacks. To solve this drawback, in the a reverse embedding procedure is performed to extract the
proposed method, the coefficients of b × b (where b < 8) only DC coefficients of each b × b DCT block of watermark.
DCT blocks of used watermark are embedded only in DC Then the estimation formulas are used to evaluate the AC
coefficients of each 8 × 8 DCT block of original image. As coefficients of each b × b DCT block (e.g., first five AC
result, the robustness of proposed method is higher than coefficients when b = 4) of watermark and applied IDCT
mohanty’s method,because the DC coefficients of DCT block to create watermark in spatial domain.
is robust than AC coefficients of one. The Asatryan’s method use a reverse embedding proce-
The Asatryan’s method works in spatial domain to embed dure (in spatial domain) to extract the mapped DCT coef-
the watermark in original image. In this method, the values ficients of watermark. Then the reverse of linear transform
of 32 × 32 block DCT coefficients of compressed watermark that used in embedding process is used to create the DCT
are mapped to the interval [0,255] by fixed linear transform coefficients of watermark. Finally, IDCT is applied to create
14 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 26.95 dB (b) PSNR = 27.08 dB (c) PSNR = 27.08 dB (d) PSNR = 26.96 dB

(e) γ = 0.6630 (f) γ = 0.6494 (g) γ = 0.7975 (h) γ = 0.7497

(i) PSNR = 26.98 dB (j) PSNR = 27.08 dB (k) PSNR = 27.07 dB (l) PSNR = 26.94 dB

(m) γ = 0.7032 (n) γ = 0.7643 (o) γ = 0.6143 (p) γ = 0.6070

Figure 17: (a) and (b) Lena and Peppers watermarked imageswith 256 × 256 watermark image as shown in Figure 11(b) after adding
Gaussian noise with mean = 0 and variance = 0.002, (e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively.
(c) and (d) Crowd and Baboon watermarked images with 128 × 128 watermark image as shown in Figure 11(b) after adding Gaussian noise.
(g) and (h) The extracted watermarks with size 128 × 128 from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images
with 128 × 128 watermark image as shown in Figure 11(d) after adding Gaussian noise, (m) and (n) the extracted watermarks with size
128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon watermarked images with 256 × 256 watermark image as shown in
Figure 11(d) after adding Gaussian noise. (o) and (p) The extracted watermarks with size 256 × 256 from (k) and (l), respectively.

the watermark in spatial domain. The steps of Mohanty’s and ShahidBeheshti University ones. We have chosen Lena,
method, Asatryan’ method and proposed watermarking Baboon, Peppers and Crowd grayscale images with size 512 ×
method are summarized in Table 3. 512 as shown in Figure 10 to embed watermarks in them
and the watermarks are grayscale watermark logos with size
4. Experimental Results 128 × 128 and 256 × 256 as shown in Figure 11. Also, based
on experiments on different watermark images (with size
The proposed algorithm has been tested on different images 128 × 128 and 256 × 256), the value of b was selected
and a big set of grayscale watermark images but only results equal to 4. The program development tool was MATLAB
for four popular images and two logos with different sizes and the computation platform was a personal computer with
are presented here. The selected logos are Texas University 1.66 GHZ of CPU and 2 GB of RAM.
EURASIP Journal on Advances in Signal Processing 15

Table 3: Comparing the structures of D. Asatryan and N. Asatryan [27], Mahanty and Bhargava [24], and proposed watermarking methods.

Watermarking methods
Block size of Mohanty’s method [24] Asatryan’s method [27] Proposed method
original image 8 × 8 M/K × N/L 8×8
(1) Calculate the Texture, Luminance,
(1) Find a most perceptually and Location of each 8 × 8 DCT block
significant set of blocks constituting a in original image.
Selection subimage (equal to size of watermark) (2) Use proposed Fuzzy Interface
(1) All blocks of original image are
embedding with respect to human perception such System to calculate the suitabilityfactor
used to embed a watermark
area procedure as Texture, Location, Contrast, of each block.
Luminance, and Edginess in original (3) Select (L/b × K/b) blocks with
image. higher suitability factor to embed a
watermark.
Block size of
8×8 32 × 32 b×b
watermark
(2) Create synthetic image by using (2) Compression was performed on
8 × 8 DCT coefficients of most watermark image until the number of
perceptually important subimage of chosen DCT coefficients of each
Watermark original image and Gaussian/Laplacian 32×32 DCT block was significantly (4) Change the AC coefficients of each
creation distribution for DC, AC coefficients, smaller than the number of pixels of b × b DCT block to zero and apply
procedure respectively. the original watermark. IDCT to create used watermark.
(3) Embed the watermark in the (3) The values of DCT coefficients are
synthetic image using any DCT-based mapped to the interval [0,255] by fixed
visible watermarking algorithm. linear transform.
(4) The used watermark is now
invisibly embedded into the original
(5) Embed DC coefficient of each
image by fusing the compound (4) Embed each mapped DCT
Inserting b × b DCT block of used watermark in
watermark blocks with the coefficient of watermark in each pixel
procedure DC coefficient of selected 8 × 8 DCT
corresponding blocks of the selected of block of original image.
block of original image.
perceptually important subimage of
the original.
⎧ βk is different for each 8 × 8 DCT
Value of β ⎪
⎨ 0.02 for DC coefficients block of original image and is
coefficient βi, j,k = ⎪ β = 0.07 for all pixels computed by proposed Fuzzy Interface
(weighing ⎩ 0.1 for AC coefficients System based on Texture and
factor)
Luminance of selected block.
(1) Select blocks of original image
where the watermark was embedded
(1) Use the reverse embedding
(1) Select subimage of original image in them.
procedure to extract the mapped DCT
where the watermark was embedded (2) Use the reverse embedding
coefficients of watermark.
in it. procedure to extract the DC
(2) The reverse of linear transform
(2) Use the reverse embedding coefficients of each b × b DCT block of
Extracting where used in embedding process is
procedure to extract the DCT used watermark.
procedure utilized to create the DCT coefficients
coefficients of watermark. (3) The extracted DC coefficients are
of watermark.
(3) Apply IDCT on each extracted used to estimate the AC coefficients of
(3) Apply IDCT on each extracted
8 × 8 DCT block to create watermark each b × b DCT block of watermark.
DCT block to create watermark in
in spatial domain. (4) Apply IDCT on each estimated
spatial domain.
b × b DCT block to create watermark
in spatial domain.

The experiments confirmed the effectiveness of the pro- of the watermarked image. The MSE and PSNR values in
posed algorithm in producing visually pleasing watermarked decibels (dB) are defined as follows:
images and in addition the extracted watermark was visually
 
recognizable and similar to both inserted watermark and 2552
original watermark. Our scheme requires one key as seed PSNR = 10 log10 ,
MSE
of random number generator to be stored for extraction
−1
(11)
phase, so this method has no storage overhead. After the 1 
M −1 N 2

watermark is embedded into the original image, the PSNR MSE = Xi j − Xi j ,
M×N i=0 j =0
(Peak Signal to Noise Ratio) is used to evaluate thequality
16 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 33.25 dB (b) PSNR = 32.28 dB (c) PSNR = 30.11 dB (d) PSNR = 28.67 dB

(e) γ = 0.8948 (f) γ = 0.8994 (g) γ = 0.7074 (h) γ = 0.7940

(i) PSNR = 33.33 dB (j) PSNR = 32.34 dB (k) PSNR = 30.07 dB (l) PSNR = 28.64 dB

(m) γ = 0.8646 (n) γ = 0.8598 (o) γ = 0.6502 (p) γ = 0.7751

Figure 18: (a) and (b) The Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after
performing Gaussian lowpass filter with window size 5 × 5, (e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b),
respectively. (c) and (d) Crowd and Baboon watermarked images with 128 × 128 watermark image as shown in Figure 11(b) after performing
Gaussian lowpass filter, (g) and (h) the extracted watermarks with size 128 × 128 from (c) and (d), respectively. (i) and (j) Lena and Peppers
watermarked images with 128 × 128 watermark image as shown in Figure 11(d) after performing Gaussian lowpass filter, (m) and (n) the
extracted watermarks with size 128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon watermarked images with 256 × 256
watermark image as shown in Figure 11(d) after performing Gaussian lowpass filter, (o) and (p) the extracted watermarks with size 256 × 256
from (k) and (l), respectively.

where Xi j represents the (i, j) pixel value of original image where w and w are extracted watermark and inserted
and Xij represents the (i, j) pixel value of watermarked watermark images, respectively, and w and w are their pixels
image. The other metric used to test the quality of the mean values, respectively. The subscripts i, j of w or w
retrieved watermark image is Normalized Cross Correlation denote the index of an individual pixel of the corresponding
(NCC). It is defined as follows: image. The summations are over all the image pixels.
The other part of experiments involved testing the
    algorithm against many common attacks on watermarked
i, j wi,j − w wi, j − w image and fortunately the extracted watermark in almost
γ= 9 2 , (12)
  
2   cases was detectable and acceptable due to the original
i, j wi, j − w i, j wi, j − w and inserted watermark. In these experiments, we used
EURASIP Journal on Advances in Signal Processing 17

(a) PSNR = 29.60 dB (b) PSNR = 29.08 dB (c) PSNR = 25.31 dB (d) PSNR = 24.34 dB

(e) γ = 0.8577 (f) γ = 0.8621 (g) γ = 0.6194 (h) γ = 0.7174

(i) PSNR = 29.63 dB (j) PSNR = 29.11 dB (k) PSNR = 25.30 dB (l) PSNR = 24.33 dB

(m) γ = 0.8172 (n) γ = 0.8074 (o) γ = 0.5654 (p) γ = 0.6998

Figure 19: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after performing
Average filter with window size 5 × 5, (e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d)
Crowd and Baboon watermarked images with 128 × 128 watermark image as shown in Figure 11(b) after performing Average filter, (g)
and (h) the extracted watermarks with size 128 × 128 from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images
with 128 × 128 watermark image as shown in Figure 11(d) after performing Average filter, (m) and (n) the extracted watermarks with size
128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon watermarked images with 256 × 256 watermark image as shown in
Figure 11(d) after performing Average filter, (o) and (p) the extracted watermarks with size 256 × 256 from (k) and (l), respectively.

StirMarkMATLAB software that contains approximately 90 processare shown in Figure 10. Also, we used two watermarks
different types of image manipulations. But, in the following in Figure 11 at two sizes (128 × 128 and 256 × 256) to be
subsections, we will present only the experimental results for embedded in these original images. The watermarked images
test images, and nongeometric and geometric attacks such as and the extracted watermarks with corresponding PSNR
Compression, Noise addition, Filtering, Cropping, Changing with different size of watermarks (128 × 128 and 256 × 256)
Aspect Ratio, Tampering and Scaling on the watermarked are shown in Figures 12 and 13, respectively.
images to evaluate the robustness of the proposed scheme. It is obvious that the PSNR value of the watermarked
image had a higher value in comparison with other existing
watermarking algorithms. The average PSNR value for the
4.1. Quality of Watermarked Image and Extracted Watermark watermarked images was approximately 52 dB, where the
before Attack. Four selected images that used in embedding size of watermark images is 128 × 128. Also, the average
18 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 29.12 dB (b) PSNR = 28.81 dB (c) PSNR = 24.66 dB (d) PSNR = 23.89 dB

(e) γ = 0.9407 (f) γ = 0.9404 (g) γ = 0.8287 (h) γ = 0.8830

(i) PSNR = 29.14 dB (j) PSNR = 28.83 dB (k) PSNR = 24.65 dB (l) PSNR = 23.88 dB

(m) γ = 0.9191 (n) γ = 0.9215 (o) γ = 0.7743 (p) γ = 0.8701

Figure 20: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after Bluring with
radius 3, (e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d) Crowd and Baboon watermarked
images with 128 × 128 watermark image as shown in Figure 11(b) after Bluring, (g) and (h) the extracted watermarks with size 128 × 128 from
(c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images with 128 × 128 watermark image as shown in Figure 11(d) after
Bluring, (m) and (n) the extracted watermarks with size 128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon watermarked
images with 256 × 256 watermark image as shown in Figure 11(d) after Bluring, (o) and (p) the extracted watermarks with size 256 × 256
from (k) and (l), respectively.

PSNR value for the watermarked images was approximately 4.2.1. Compression
49 dB, where the size of watermark images is 256 × 256.
So, the watermark embedding process produced high-quality JPEG Compression. Using image compression before storing
watermarked images. and transmitting images is very common. JPEG from Joint
Photographic Experts Group has been funded its way through
digital imaging and is very popular image compression tool
4.2. Quality of Watermarked Image and Extracted Watermark for still images. So we evaluated the robustness of the pro-
versus Various Attacks. In the following experiment, we posed scheme by compressing the watermarked images with
used several image manipulations, including Compression, different JPEG quality factors. Figures 14(a)–14(d), 14(i)–
Noise addition, Filtering, Cropping, Changing aspect ratio, 14(l) show the watermarked images with watermark size
Tampering, Copy attack, Scaling and Composite attacks on 256 × 256 after JPEG compression with quality factor 40%,
the watermarked images to evaluate the robustness of the 30%, 20% and 10% for Lena, Baboon, Peppers and Crowd
proposed scheme. images, respectively. Figures 14(e)–14(h), 14(m)–14(p) show
EURASIP Journal on Advances in Signal Processing 19

(a) PSNR = 25.34 dB (b) PSNR = 25.15 dB (c) PSNR = 23.28 dB (d) PSNR = 21.10 dB

(e) γ = 0.7308 (f) γ = 0.7144 (g) γ = 0.6174 (h) γ = 0.5034

(i) PSNR = 25.39 dB (j) PSNR = 25.20 dB (k) PSNR = 23.27 dB (l) PSNR = 21.08 dB

(m) γ = 0.6679 (n) γ = 0.6684 (o) γ = 0.3421 (p) γ = 0.4626

Figure 21: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after Sharpening,
(e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d) Crowd and Baboon watermarked images
with 128 × 128 watermark image as shown in Figure 11(b) after Sharpening, (g) and (h) the extracted watermarks with size 128 × 128
from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images with 128 × 128 watermark image as shown in Figure 11(d)
after Sharpening, (m) and (n) the extracted watermarks with size 128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon
watermarked images with 256 × 256 watermark image as shown in Figure 11(d) after Sharpening, (o) and (p) the extracted watermarks with
size 256 × 256 from (k) and (l), respectively.

the extracted watermark from Figures 14(a)–14(d), 14(i)– Wavelet Compression (JPEG2000). We evaluated the robust-
14(l), respectively. Also, Figures 15(a)–15(d), 15(i)–15(l) ness of proposed method against another version of com-
show the watermarked images with watermark size 128 × 128 pression that is wavelet compression. Figures 16(a)–16(d)
after JPEG compression with quality factor 40%, 30%, 20% show the results of applying wavelet compression on Lena,
and 10% for Lena, Baboon, Peppers and Crowd images, Baboon, Peppers and crowd images with compression ratio
respectively and Figures 15(e)–15(h), 15(m)–15(p) show the 0.4 bpp, 0.8 bpp, 1.5 bpp and 3.5 bpp, respectively. The
extracted watermark from Figures 15(a)–15(d), 15(i)–15(l), extracted watermarks that shown in Figures 16(e)–16(h) are
respectively. The results show that the proposed scheme is still visually detectable after this attack.
robust against JPEG image compression and the extracted
watermarks are visually similar to inserted watermark under 4.2.2. Noise Addition. The robustness of proposed method
different quality factors of JEPG compression. has been evaluated by adding Gaussian noise with mean =
20 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 31.17 dB (b) PSNR = 32.08 dB (c) PSNR = 26.46 dB (d) PSNR = 24.60 dB

(e) γ = 0.8768 (f) γ = 0.8818 (g) γ = 0.6537 (h) γ = 0.5993

(i) PSNR = 31.22 dB (j) PSNR = 32.14 dB (k) PSNR = 26.48 dB (l) PSNR = 24.59 dB

(m) γ = 0.8426 (n) γ = 0.8476 (o) γ = 0.6247 (p) γ = 0.6046

Figure 22: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after applying
Median filter with window size 5 × 5, (e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d)
Crowd and Baboon watermarked images with 128 × 128 watermark image as shown in Figure 11(b) after appling Median filter, (g) and (h)
the extracted watermarks with size 128 × 128 from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images with 128 × 128
watermark image as shown in Figure 11(d) after appling Median filter, (m) and (n) the extracted watermarks with size 128 × 128 from (i)
and (j), respectively. (k) and (l) Crowd and Baboon watermarked images with 256 × 256 watermark image as shown in Figure 11(d) after
appling Median filter, (o) and (p) the extracted watermarks with size 256 × 256 from (k) and (l), respectively.

0 and variance = 0.002 on the watermarked images. Figures 18(i)–18(l) show the resultant images after performing
17(a)–17(d), 17(i)–17(l) show the results of adding Gaussian Gaussian lowpass filter with window size 5 × 5. Figures 18(e)–
noise. The extracted watermarks are still visually detectable 18(h), 18(m)–18(p) show the extracted watermarks and
after this attack (as shown in Figures 17(e)–17(h), 17(m)– corresponding γ values. The extracted watermarks are still
17(p). It indicates that the proposed scheme is also robust to visually detectable after this attack. It indicates that the pro-
noise attack. posed scheme is also robust to Gaussian lowpass filter attack.
Figures 19(a)–19(d), 19(i)–19(l) show the resultant
4.2.3. Filtering. The robustness of watermarking scheme images by averaging filter with window size 5 × 5. Figures
has been also tested by performing various filters such as 19(e)–19(h), 19(m)–19(p) show the extracted watermarks
sharpening, Gaussian lowpass filter, averaging, median, and and their γ values. The extracted watermarksare still visually
blurring on the watermarked images. Figures 18(a)–18(d), detectable after averaging filter attack.
EURASIP Journal on Advances in Signal Processing 21

(a) PSNR = 33.94 dB (b) PSNR = 31.85 dB (c) PSNR = 31.85 dB (d) PSNR = 29.29 dB

(e) γ = 0.9528 (f) γ = 0.9402 (g) γ = 0.8835 (h) γ = 0.8860

(i) PSNR = 34.06 dB (j) PSNR = 31.93 dB (k) PSNR = 31.79 dB (l) PSNR = 29.26 dB

(m) γ = 0.9462 (n) γ = 0.9142 (o) γ = 0.8651 (p) γ = 0.8702

Figure 23: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after Scaling (1/2),
(e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d) Crowd and Baboon watermarked images
with 128 × 128 watermark image as shown in Figure 11(b) after Scaling (1/2), (g) and (h) the extracted watermarks with size 128 × 128
from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images with 128 × 128 watermark image as shown in Figure 11(d)
after Scaling (1/2), (m) and (n) the extracted watermarks with size 128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon
watermarked images with 256 × 256 watermark image as shown in Figure 11(d) after Scaling (1/2), (o) and (p) the extracted watermarks
with size 256 × 256 from (k) and (l), respectively.

Figures 20(a)–20(d), 20(i)–20(l) show the resultant show the extracted watermarks and their γ values. The test
images after blurring with radius 3. Figures 20(e)–20(h), results show that the watermark image can also detectable
20(m)–20(p) show the extracted watermarks and their γ after the filter attacks. (it is worth mentioning that because
values. The extracted watermarks are still visually detectable we zoomed out images in the paper the effects of some filters
after this attack. It indicates that the proposed scheme is also are not visible in these sizes).
robust to blurring attack.
Figures 21(a)–21(d), 21(i)–21(l) show the resultant
images by sharpening. Figures 21(e)–21(h), 21(m)–21(p) 4.2.4. Geometric Attacks. In the following experiments, dif-
show the extracted watermarks. Also, Figures 22(a)–22(d), ferent geometric attacks such as scaling, cropping, tampering
22(i)–22(l) show the resultant images after median filtering and changing aspect ratio are performed on the watermarked
with window size 5 × 5. Figures 22(e)–22(h), 22(m)–22(p) images to test the robustness of proposed method.
22 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 28.80 dB (b) PSNR = 28.00 dB (c) PSNR = 24.00 dB (d) PSNR = 23.58 dB

(e) γ = 0.7144 (f) γ = 0.6931 (g) γ = 0.6285 (h) γ = 0.4076

(i) PSNR = 28.83 dB (j) PSNR = 28.02 dB (k) PSNR = 23.99 dB (l) PSNR = 23.57 dB

(m) γ = 0.6494 (n) γ = 0.6022 (o) γ = 0.3308 (p) γ = 0.4237

Figure 24: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b) after Scaling (1/4),
(e) and (f) the extracted watermarks with size 256 × 256 from (a) and (b), respectively. (c) and (d) Crowd and Baboon watermarked images
with 128 × 128 watermark image as shown in Figure 11(b) after Scaling (1/4), (g) and (h) the extracted watermarks with size 128 × 128
from (c) and (d), respectively. (i) and (j) Lena and Peppers watermarked images with 128 × 128 watermark image as shown in Figure 11(d)
after Scaling (1/4), (m) and (n) the extracted watermarks with size 128 × 128 from (i) and (j), respectively. (k) and (l) Crowd and Baboon
watermarked images with 256 × 256 watermark image as shown in Figure 11(d) after Scaling (1/4), (o) and (p) the extracted watermarks
with size 256 × 256 from (k) and (l), respectively.

Scaling. In this experimental the watermarked images are extracted watermark from Figures 24(a)–24(d), 24(i)–24(l)
reduced to 1/2 and 1/4 of its original size. In order to and corresponding γ values, respectively. The test results
detect the watermark, the reduced images are recovered to its show that the watermark image can also detectable after the
original dimension, respectively. Figures 23(a)–23(d), 23(i)– scaling attacks.
23(l) show the watermarked image after reducing to 1/2
and recovering to original dimension. Figures 23(e)–23(h), Cropping. In this experimental the watermarked images are
23(m)–23(p) show the extracted watermark from Figures cropped. Figures 25(a)–25(d) show the cropped version
23(a)–23(d), 23(i)–23(l), respectively. of Lena, Baboon, Peppers and Crowd watermarked image
Figures 24(a)–24(d), 24(i)–24(l) show the watermarked respectively. Figures 25(e)–25(h) show the extracted water-
image after reducing to 1/4 and recovering to original mark from these figures. As shown from these figures, the
dimension. Figures 24(e)–24(h), 24(m)–24(p) show the extracted watermarks are visually detectable yet.
EURASIP Journal on Advances in Signal Processing 23

(a) PSNR = 18.01 dB (b) PSNR = 21.01 dB (c) PSNR = 16.56 dB (d) PSNR = 14.63 dB

(e) γ = 0.4506 (f) γ = 0.4206 (g) γ = 0.5325 (h) γ = 0.7223

Figure 25: (a)–(d) the watermarked Lena, Baboon, Peppers and Crowd images after cropping, respectively. (e)–(h) the extracted watermarks
from (a)–(d), respectively.

(a) PSNR = 27.50 dB (b) PSNR = 26.49 dB (c) PSNR = 24.45 dB (d) PSNR = 26.36 dB

(e) γ = 0.6733 (f) γ = 0.5809 (g) γ = 0.6440 (h) γ = 0.5666

Figure 26: (a) and (b) The watermarked Lena and Peppers images after changing aspect ratio attack (X = 1, Y = 1.2), respectively. (c) and
(d) the watermarked Baboon and Crowd images after changing aspect ratio attack (X = 0.8, Y = 1), respectively. (e)–(h) The extracted
watermarks from (a)–(d), respectively.

Changing Aspect Ratio. In this experiment, the robustness Tampering and Data Blocks Removal. We tested the robust-
of proposed method was tested by changing aspect ratio of ness of proposed method by tampering the watermarked
watermarked image. Figures 26(a) and 26(b) show the Lena images. Figures 27(a)–27(d) show the results of tampering
and Peppers images after changing aspect ratio (X = 1, Lena, Baboon, Peppers and Crowd images, respectively. As
Y = 1.2) of these images and Figures 26(c) and 26(d) show shown in Figures 27(e)–27(h), the extracted watermarks are
the Baboon and Crowd images after changing aspect ratio still visually detectable after this attack and it indicates that
(X = 0.8, Y = 1) of these images. To extract the watermark, the proposed scheme is also robust to such attacks. Also,
the images were rescaled to original size (512 × 512) and the Figures 27(i)–27(l) show the results of data blocks removing
extracted watermarks from these figures are shown in Figures of Lena, Baboon, Peppers and Crowd images, respectively
26(e)–26(h). and Figures 27(m)–27(p) show the extracted watermark
24 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 15.35 dB (b) PSNR = 19.59 dB (c) PSNR = 16.37 dB (d) PSNR = 18.46 dB

(e) γ = 0.7167 (f) γ = 0.9090 (g) γ = 0.7598 (h) γ = 0.8415

(i) PSNR = 15.33 dB (j) PSNR = 14.49 dB (k) PSNR = 15.87 dB (l) PSNR = 14.63 dB

(m) γ = 0.7698 (n) γ = 0.7702 (o) γ = 0.7816 (p) γ = 0.6724

Figure 27: (a)–(d) The watermarked Lena, Baboon, Peppers and Crowd images after tampering, respectively. (e)–(h) The extracted
watermarks from (a)–(d), respectively. (i)–(l) The watermarked Lena, Baboon, Peppers and Crowd images after data blocks removal,
respectively. (m)–(p) The extracted watermarks from (i)–(l), respectively.

from Figures 27(i)–27(l), respectively. As result, the extracted watermarked image. Figures 28(a) and 28(b) show the Lena
watermark after such attacks are still visually detectable and and Peppers watermarked imageswith 256 × 256 watermark
the proposed method is robust to tampering and data blocks image as shown in Figure 11(b) and Figures 28(c) and 28(d)
removal. show the Lena and Peppers watermarked imageswith 256 ×
256 watermark image as shown in Figure 11(d). Figures
28(e)–28(h) show the counterfeit watermarked image with
Copy Attack. The copy attack has been used to create
Figures 28(a)–28(d), respectively. The extracting watermarks
the false positive problem and operated as follow: (1)
from Figures 28(e)–28(h) are shown in Figures 28(i)–
a watermark is first predicted from watermarked image,
28(l), respectively. Therefore, the proposed method is robust
(2) the predicted watermark into a target image to create
against copy attack.
counterfeit watermarked image, (3) from the counterfeit
image, a watermark can be detected that wrongly claims
rightful ownership. 4.2.5. Composite Attacks. The purpose of this experiment
In this experiment, the robustness of proposed water- is to check whether this kind of combination attack is
marking method was tested by applying copy attack on able to remove the watermark of the proposed method.
EURASIP Journal on Advances in Signal Processing 25

(a) PSNR = 48.27 dB (b) PSNR = 48.47 dB (c) PSNR = 48.66 dB (d) PSNR = 48.26 dB

(e) (f) (g) (h)

(i) γ = 0.0145 (j) γ = −0.1253 (k) γ = −0.0246 (l) γ = −0.0057

Figure 28: (a) and (b) Lena and Peppers watermarked images with 256 × 256 watermark image as shown in Figure 11(b), (e) and (f)
the counterfeit watermarked image with (a) and (b), respectively. (c) and (d) The Crowd and Baboon watermarked images with 256 × 256
watermark image as shown in Figure 11(d), (g) and (h) the counterfeit watermarked image with (c) and (d), respectively. (i)–(l) the extracted
watermark from (e)–(h), respectively.

Figures 29(a)–29(h) show the watermarked images after × 512 pixels and watermark image with size 128 × 128 pixels.
different composite attacks and Figures 29(i)–29(p) show the The execution time for Mohanty method was 4 sec that is
extracted watermarks from Figures 29(a)–29(h), respectively. approximately 50% higher in time than proposed algorithm
Therefore, the experimental results presented on the and 1 sec for Asatryan method that is approximately 50%
quality and recognize ability demonstrates the performance lower than proposed algorithm.
of our method under various attacks. Based on experiments, in the proposed method, the
average minimum value of γ was 0.4 when the extracted
watermark was visually detectable. This value for Mohanty
4.3. Comparison with Other Related Methods. In this sub- method and Asatryan method were 0.65 and 0.3, respectively.
section, the results of proposed method are compared with To have a complete comparison between proposed
two related ones which have been presented by Mahanty method and related ones, we embedded the 50 different
and Bhargava [24] and D. Asatryan and N. Asatryan [27]. watermark images in three sizes (64 × 64, 128 × 128 and
The comparison is based on four metrics: (1) average 256 × 256) in 50 selected images in two sizes (256 × 256 and
execution time for watermark insertion (2) PSNR value 512 × 512) and obtained 50 × 50 = 2500 watermarked images.
of watermarked image, (3) PSNR or correlation value (γ) Then we used StirMark and did different attacks to the
value of extracted watermark and (4) error rate of detecting watermarked images including Blurring, Sharpening, Scal-
watermark. ing, adding Gaussian noise, Tampering, data block removal
These three methods were implemented on a personal and Cropping. In addition, JPEG compression with different
computer with 1.66 GHZ of CPU and 2 GB of RAM and the quality factors was applied to the watermarked images.
average execution time of proposed method for watermark Then, we conducted the watermark detection procedure
insertion was approximately 2 sec for an image with size 512 on every attacked watermarked image. Table 4 shows the
26 EURASIP Journal on Advances in Signal Processing

(a) PSNR = 31.17 dB Wiener filter (b) PSNR = 27.08 dB Softthreshold (c) PSNR = 24.46 dB Hardthresh- (d) PSNR = 24.60 dB Templatere-
(3 × 3) + Scaling (1/2) + JPEG(80) + Blurring (1) + JPEG(85) old + Averagefiltr (3×3) + JPEG(80) moval + Scalin (1/2) + JPEG(85)

(e) PSNR = 31.22 dB Blurring (2) + (f) PSNR = 32.14 dB Median filter (g) PSNR = 26.48 dB Sharpening (h) PSNR = 24.59 dB Blurring (2) +
JPEG(80) (3 × 3) + JPEG(85) (1) + JPEG(90) JPEG(80)

(i) γ = 0.8768 (j) γ = 0.7999 (k) γ = 0.7472 (l) γ = 0.6889

(m) γ = 0.7560 (n) γ = 0.7983 (o) γ = 0.4247 (p) γ = 0.8098

Figure 29: (a)–(h) The watermarked image after different composite attacks, (i)–(k) The extracted watermark from (a)–(h), respectively.

PSNR of watermarked images and extracted watermarks. in Tables 7, 8, 9 and 10 for Lena and Baboon with watermark
As it is shown in this table, the proposed method out- size 256 × 256, Peppers and Crowd with watermark size
performs than two related methods in term of PSNR of 128 × 128, respectively. The 2th column of each of these tables
watermarked images and extracted watermark after different represents the attack type and the symbols “AF”, “B”, “GN”,
attacks. “MF”, “S”, “GL”, “SH”, “C”, “CAR”, “WF” and “JP” denote
Finally, as Table 5 shows, the comparison results have average filter, blurring, Gaussian noise, median filter, scaling,
demonstrated that our method is capable of detecting Gaussian lowpass filter, sharpening, cropping, change aspect
watermarks at lower error rates than two related methods ratio, wiener filter and JPEG compression, respectively. The
and can more effectively stay robust under image processing number following each symbol is the parameter with a
attacks. Also, Table 6 shows PSNR value of watermarked specific operation. The 2th column of each of these tables
image by different methods. The best value in each row of represents PSNR of the watermarked image after different
these tables has been bolded. attacks, 3th, 4th and 5th columns of each table represent the
The quality of extracted watermark by proposed method γ value of extracted watermark. The best value in each row
and two related ones versus different attacks are summarized has been bolded.
EURASIP Journal on Advances in Signal Processing 27

Table 4: Comparison of proposed method and two related methods.

Average PSNR of watermarked Image Average PSNR of extracted watermark after


Host image size Watermark size different attacks
Proposed Mohanty Asatryan Proposed Mohanty Asatryan
method method [24] method [27] method method [24] method [27]
512 × 512 256 × 256 49.81 dB 39.44 dB 34.72 dB 19.69 dB 16.04 dB 18.10 dB
512 × 512 128 × 128 53.15 dB 42.31 dB 35.63 dB 18.68 dB 15.57 dB 17.96 dB
256 × 256 128 × 128 48.18 dB 37.98 dB 33.55 dB 20.70 dB 16.93 dB 18.79 dB
256 × 256 64 × 64 52.90 dB 41.09 dB 34.73 dB 18.74 dB 15.59 dB 18.52 dB

Table 5: Error rates of detecting watermark.

Attack type Proposed method Mohanty method [24] Asatryan method [27]
Attack-free 0/2500 0/2500 0/2500
Blurring (2,3) 2/2500 2/2500 2/2500
Sharpening(1) 3/2500 4/2500 1/2500
Median Filter (5 × 5, 7 × 7) 0/2500 3/2500 5/2500
Gaussian Noise (0.001, 0.002) 3/2500 3/2500 2/2500
Gaussian Low Pass Filter (3 × 3, 5 × 5) 0/2500 0/2500 1/2500
Cropping (40%, 50%, 60%) 1/2500 20/2500 1/2500
Scaling (1/2, 1/4) 2/2500 3/2500 7/2500
JEPG Compression (10,20,30,40) 3/2500 6/2500 1/2500
Tampering 0/2500 8/2500 2/2500
Data Block Removal 0/2500 5/2500 1/2500
Composite Attack 3/2500 13/2500 21/2500
Total 16 67 44

Table 6: PSNR value of watermarked image (512 × 512) by several methods.

PSNR of watermarked image


Image Watermark Watermark size Mohanty method [24] Asatryan method [27] Proposed method
Lena Figure 11(b) 128 × 128 43.35 dB 36.10 dB 52.33 dB
Lena Figure 11(b) 256 × 256 40.71 dB 36.06 dB 48.27 dB
Lena Figure 11(d) 128 × 128 44.46 dB 35.78 dB 52.92 dB
Lena Figure 11(d) 256 × 256 40.98 dB 35.82 dB 48.84 dB
Baboon Figure 11(b) 128 × 128 43.09 dB 36.42 dB 52.44 dB
Baboon Figure 11(b) 256 × 256 41.01 dB 36.46 dB 48.47 dB
Baboon Figure 11(d) 128 × 128 44.19 dB 35.91 dB 53.08 dB
Baboon Figure 11(d) 256 × 256 41.17 dB 35.92 dB 49.11 dB
Peppers Figure 11(b) 128 × 128 43.27 dB 35.18 dB 51.72 dB
Peppers Figure 11(b) 256 × 256 39.03 dB 35.20 dB 47.94 dB
Peppers Figure 11(d) 128 × 128 43.18 dB 34.90 dB 52.27 dB
Peppers Figure 11(d) 256 × 256 40.80 dB 34.86 dB 48.66 dB
Crowd Figure 11(b) 128 × 128 43.69 dB 35.33 dB 51.38 dB
Crowd Figure 11(b) 256 × 256 40.30 dB 35.37 dB 47.63 dB
Crowd Figure 11(d) 128 × 128 44.12 dB 35.10 dB 52.04 dB
Crowd Figure 11(d) 256 × 256 40.25 dB 35.04 dB 48.26 dB
28 EURASIP Journal on Advances in Signal Processing

Table 7: The quality of extracted watermark from Lena image (512 × 512) with watermark size 256 × 256 (Figure 11(b)) versus several
attacks in different methods.
γ of extracted watermark
Attack type Watermarked image PSNR
Mohanty method [24] Asatryan method [27] Proposed method
No Attack 48.27 dB 0.9975 0.9988 0.9975
AF(5 × 5) + JP(50) 29.49 dB 0.6533 0.6898 0.7008
S(1/2) + B(3) + JP(60) 28.91 dB 0.5611 0.5709 0.5770
SH(1) + MF(5 × 5) 29.25 dB 0.5753 0.7128 0.6389
MF(5 × 5) + S(1/2) + JP(60) 30.25 dB 0.5901 0.4866 0.7178
GN(0,0.002) + MF(5 × 5) + JP(50) 30.03 dB 0.4569 0.5684 0.5360
GL(5 × 5) + CAR(1,1.2) + JP(50) 26.38 dB 0.7072 0.6079 0.8984
WF(3 × 3) + B(2) + JP(40) 31.20 dB 0.5310 0.6404 0.6696
B(2) + GN(0,0.002) + JP(50) 29.38 dB 0.4085 0.4402 0.5535
C(40%) 18.01 dB 0.1708 0.3508 0.4506
S(1/4) + JP(40) 28.59 dB 0.4598 0.2222 0.6341

Table 8: The quality of extracted watermark from Baboon image (512 × 512) with watermark size 256 × 256 (Figure 11(b)) versus several
attacks in different methods.
γ of extracted watermark
Attack type
Watermarked image PSNR Mohanty method [24] Asatryan method [27] Proposed method
No Attack 48.47 dB 0.9961 0.9986 0.9966
AF(5 × 5) + JP(50) 24.32 dB 0.3857 0.5036 0.4550
S(1/2) + B(2) + JP(60) 25.59 dB 0.3973 0.4318 0.5109
SH(1) + MF(5 × 5) 23.86 dB 0.3128 0.6555 0.5246
MF(5 × 5) + S(1/2) + JP(60) 24.37 dB 0.3033 0.4202 0.4293
GN(0, 0.002) + MF(5 × 5) + JP(50) 24.34 dB 0.3048 0.5439 0.4032
GL(5 × 5) + CAR(1,1.2) + JP(50) 27.68 dB 0.4485 0.3643 0.6176
WF(3 × 3) + B(2) + JP(40) 26.48 dB 0.4079 0.4459 0.5079
B(2) + GN(0,0.002) + JP(50) 25.13 dB 0.4239 0.5226 0.5020
C(40%) 20.29 dB 0.1279 0.3880 0.4183
S(1/4) + JP(40) 23.45 dB 0.2017 0.1212 0.4027

Table 9: The quality of extracted watermarked from Peppers image (512 × 512) with watermark size 128 × 128 (Figure 11(b)) versus several
attacks in different methods.
γ of extracted watermark
Attack type Watermarked image PSNR Mohanty method [24] Asatryan method [27] Proposed method
No Attack 51.72 dB 0.9974 0.9990 0.9979
AF(5 × 5) + JP(50) 29.80 dB 0.4233 0.6118 0.5943
S(1/2)+B(3) + JP(60) 29.35 dB 0.3711 0.4999 0.5492
SH(1) + MF(5 × 5) 29.52 dB 0.4277 0.6256 0.5750
MF(5 × 5) + S(1/2) + JP(60) 30.66 dB 0.4405 0.3304 0.5490
GN(0, 0.002) + MF(5 × 5) + JP(50) 30.34 dB 0.2832 0.4481 0.4021
GL(5 × 5) + CAR(1,1.2) + JP(50) 31.79 dB 0.6008 0.4568 0.7637
WF(3 × 3) + B(2) + JP(40) 31.13 dB 0.5343 0.5928 0.6697
B(2) + GN(0,0.002) + JP(50) 27.60 dB 0.3900 0.5222 0.5152
C(40%) 17.06 dB 0.2080 0.3710 0.6544
S(1/4) + JP(40) 29.06 dB 0.3827 0.3535 0.6415
EURASIP Journal on Advances in Signal Processing 29

Table 10: The quality of extracted watermarked fromCrowd image (512 × 512) with watermark size 128 × 128 (Figure 11(b)) versus several
attacks in different methods.
γ of extracted watermark
Attack type Watermarked image PSNR Mohanty method [24] Asatryan method [27] Proposed method
No Attack 51.38 dB 0.9969 0.9985 0.9974
AF(5 × 5) + JP(50) 25.19 dB 0.3432 0.4803 0.4346
S(1/2) + B(3) + JP(60) 24.30 dB 0.3040 0.4584 0.4490
SH(1) + MF(5 × 5) 24.53 dB 0.2400 0.4056 0.3132
MF(5 × 5) + S(1/2) + JP(60) 25.79 dB 0.3029 0.3505 0.4132
GN(0, 0.002) + MF(5 × 5) + JP(50) 25.90 dB 0.2306 0.4034 0.3389
GL(5 × 5) + CAR(1,1.2) + JP(50) 29.24 dB 0.4064 0.4708 0.5851
WF(3 × 3) + B(2) + JP(40) 27.91 dB 0.3666 0.4660 0.5051
B(2) + GN(0,0.002) + JP(50) 26.06 dB 0.3099 0.5306 0.4455
C(40%) 14.70 dB 0.1818 0.3195 0.7337
S(1/4) + JP(40) 23.86 dB 0.2030 0.2344 0.4182

5. Conclusion [8] V. Saxena and J. P. Gupta, “A novel watermarking scheme for


JPEG images,” WSEAS Transactions on Signal Processing, vol. 5,
In this paper, a grayscale watermark insertion and extraction no. 2, pp. 74–84, 2009.
schemes were proposed. The proposed method works by [9] X. Kang, W. Zeng, and J. Huang, “A multi-band wavelet water-
modifying the DC value of the original image in frequency marking scheme,” International Journal of Network Security,
domain to create the watermarked image. The embedding vol. 6, no. 2, pp. 121–126, 2008.
procedure is based on fuzzy inference system to locate [10] X.-Y. Wang, L.-M. Hou, and J. Wu, “A feature-based robust
the best place of watermark insertion. The algorithm was digital image watermarking against geometric attacks,” Image
tested with several standard test images and the experimental and Vision Computing, vol. 26, no. 7, pp. 980–989, 2008.
results demonstrated that it created high-quality images and [11] Z.-J. Lee, S.-W. Lin, S.-F. Su, and C.-Y. Lin, “A hybrid
it was robust versus different attacks. In the future, we watermarking technique applied to digital images,” Applied
Soft Computing, vol. 8, no. 1, pp. 798–808, 2008.
are going to change the proposed method such that it can
[12] A. K. Parthasarathy and S. Kak, “An improved method of
support all attacks by developing a blind method that uses
content based image watermarking,” IEEE Transactions on
similar idea. Broadcasting, vol. 53, no. 2, pp. 468–479, 2007.
[13] G. C. Langelaar and R. L. Lagendijk, “Optimal differential
energy watermarking of DCT encoded images and video,”
References IEEE Transactions on Image Processing, vol. 10, no. 1, pp. 148–
158, 2001.
[1] M.-C. Hu, D.-C. Lou, and M.-C. Chang, “Dual-wrapped
digital watermarking scheme for image copyright protection,” [14] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon, “Secure
Computers and Security, vol. 26, no. 4, pp. 319–330, 2007. spread spectrum watermarking for multimedia,” IEEE Trans-
[2] J.-M. Shieh, D.-C. Lou, and M.-C. Chang, “A semi-blind actions on Image Processing, vol. 6, no. 12, pp. 1673–1687, 1997.
digital watermarking scheme based on singular value decom- [15] C. S. Lu, H. Y. M. Liao, and C. J. Sze, “Cocktail watermarking
position,” Computer Standards and Interfaces, vol. 28, no. 4, pp. on images,” in Proceedings of the 3rd International Workshop on
428–440, 2006. Information Hiding, pp. 333–347, 1999.
[3] C. A. Gonzales, L. Allman, T. McCarthy, P. Wendt, and A. N. [16] M. Barni, F. Bartolini, V. Cappellini, and A. Piva, “A DCT-
Akansu, “DCT coding for motion video storage using adaptive domain system for robust image watermarking,” Signal Pro-
arithmetic coding,” Signal Processing, vol. 2, no. 2, pp. 145– cessing, vol. 66, no. 3, pp. 357–372, 1999.
154, 1990. [17] C.-T. Hsu and J.-L. Wu, “Hidden digital watermarks in
[4] K. Veeraswamy and S. Srinivas Kumar, “Adaptive AC- images,” IEEE Transactions on Image Processing, vol. 8, no. 1,
coefficient prediction for image compression and blind water- pp. 58–68, 1999.
marking,,” Journal of Multimedia, vol. 3, no. 1, pp. 16–22, [18] J. Huang, Q. S. Yun, and W. Cheng, “Image watermarking in
2008. DCT: an embedding strategy and algorithm,” Acta Electronica
[5] V. Martin, M. Chabert, and B. Lacaze, “An interpolation-based Sinica, vol. 28, no. 4, pp. 57–60, 2000.
watermarking scheme,” Signal Processing, vol. 88, no. 3, pp. [19] L. Xie and G. R. Arce, “Joint wavelet compression and
539–557, 2008. authentication watermarking,” in Proceedings of the 1998
[6] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, “Techniques for International Conference on Image Processing, pp. 427–431,
data hiding,” IBM Systems Journal, vol. 35, no. 3-4, pp. 313– October 1998.
335, 1996. [20] Y. Zhao, P. Campisi, and D. Kundur, “Dual domain water-
[7] R. V. Schyndel, A. Z. Trikel, and C. F. Osborn, “A digital marking for authentication and compression of cultural
watermark,” in Proceedings of the 1st IEEE International heritage images,” IEEE Transactions on Image Processing, vol.
Conference on Image Processing, pp. 86–90, 1994. 13, no. 3, pp. 430–448, 2004.
30 EURASIP Journal on Advances in Signal Processing

[21] C.-T. Hsu and J.-L. Wu, “Hidden digital watermarks in


images,” IEEE Transactions on Image Processing, vol. 8, no. 1,
pp. 58–68, 1999.
[22] C.-T. Hsu and J.-L. Wu, “Multiresolution watermarking for
digital images,” IEEE Transactions on Circuits and Systems II,
vol. 45, no. 8, pp. 1097–1101, 1998.
[23] A. Al-Haj, “Combined DWT-DCT digital image watermark-
ing,” Journal of Computer Science, vol. 3, no. 9, pp. 740–746,
2007.
[24] S. P. Mohanty and B. K. Bhargava, “Invisible watermarking
based on creation and robust insertion-extraction of image
adaptive watermarks,” ACM Transactions on Multimedia Com-
puting, Communications and Applications, vol. 5, no. 2, article
12, 2008.
[25] R. C. Reininger and J. D. Gibson, “Distributions of the two-
dimensional DCT coefficients for images,” IEEE Transactions
on Communications, vol. 31, no. 6, pp. 835–839, 1983.
[26] S. P. Mohanty, K. R. Ramakrishnan, and M. S. Kankanhalli,
“A dual watermarking technique for images,” in Proceedings of
the 7th ACM International Multimedia Conference, pp. 49–51,
1999.
[27] D. Asatryan and N. Asatryan, “Combined spatial and fre-
quency domain watermarking,” in Proceedings of the 7th
International Conference on Computer Science and Information
Technologies, pp. 323–326, 2009.
[28] N. Sakr, J. Zhao, and V. Z. Groza, “Adaptive digital image
watermaking based on predictive embedding and a Dynamic
Fuzzy Inference System model,” International Journal of
Advanced Media and Communication, vol. 1, no. 3, pp. 237–
264, 2007.
[29] A. G. Borş and I. Pitas, “Image watermarking using block site
selection and DCT domain constraints,” Optics Express, vol. 3,
no. 12, pp. 512–523, 1998.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 104835, 12 pages
doi:10.1155/2010/104835

Research Article
An Efficient Prediction-and-Shifting Embedding Technique for
High Quality Reversible Data Hiding

Wien Hong
Department of Information Management, Yu Da University, Miaoli, 361, Taiwan

Correspondence should be addressed to Wien Hong, [email protected]

Received 27 November 2009; Revised 9 March 2010; Accepted 31 March 2010

Academic Editor: Jin-Hua She

Copyright © 2010 Wien Hong. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The embedding capacity of a histogram-based reversible data hiding technique is primarily determined by the peak height of the
histogram. Recently, some studies have tried to embed data in the histogram of prediction errors by modifying the error values and
have better embedding efficiency. However, these methods offer no selective embedment mechanism to exclude the positions where
the modification in the embedding operation contributes no capacity but merely degrade the image quality. In this paper, a novel
coding method for reversible data hiding is presented. A two-stage prediction algorithm that fully exploits the pixel correlations
is employed to create more embeddable spaces, and a selective embedment mechanism is used to enhance the image quality.
According to the experimental results, the proposed method achieved the highest payload while maintaining the lowest distortion
for most standard test images, comparing to other existing histogram-shifting-based reversible data hiding techniques.

1. Introduction little suspicious. An image that is used to embed data is


called a cover image, and an image with data embedded
Data hiding is a technique that embeds data into cover is called a stego image [12]. The earliest reversible data
media by slightly modifying their content [1] and has been hiding technique reported in the literature is Barton’s
used in many applications, such as tamper detection [2], work [13]. Afterwards, a number of reversible data hid-
copyright protection [3], and finger printing [4]. When data ing techniques have been proposed to fulfill the insatiate
are embedded into cover media, the content of the media will demands in this field. In 2003, Tian [14] proposed a
be inevitably modified and thus distortion introduced. The novel reversible data hiding method with high payload.
distortion caused by data embedding is termed embedding In his method, the difference value between paired pixels
distortion [5]. Although the embedding distortion in many is expanded, and a bit can be embedded into the LSB
applications is small, the distorted cover media cannot be of the expanded difference. In Tian’s method, n bits can
recovered to their original state [6, 7]. However, some be embedded into 2n pixels. Alattar [15] extended Tian’s
applications, such as in medical or military usages, allow work by increasing the payload without introducing a
no permanent embedding distortion in order to preserve noticeable distortion. In Alattar’s method, n bits can be
content fidelity. This demand has highlighted the needs of embedded into n + 1 pixels. Tian and Alattar et al.’s
reversible data hiding and has drawn much attention in the embedding techniques can be classified as the expansion-
recent years [8–10]. embedding technique. For an expansion-embedding tech-
The reversible data hiding is a technique that allows nique, difference values between pixels have to be expanded
extraction of embedded data from the stego media and to conceal data. Therefore, the embedding distortion is
exactly restores the marked media to their original states relatively large. Besides, the selection of embedding position
[11]. Many researchers use digital image as the cover to avoid the overflow or underflow problem has to pay
media because they are often transmitted throughout the the overhead cost, which may significantly reduce the
Internet, which is easy to be accessed and may arouse a payload.
2 EURASIP Journal on Advances in Signal Processing

In 2006, Ni et al. [16] proposed a reversible data hiding


method and achieved very high image quality. They selected
pairs of peak and zero of an image histogram and shift c min(a, b) if c ≥ max(a, b),
b
the histogram bins to leave embeddable spaces for data p= max(a, b) if c ≤ min(a, b),
a x
embedding. Ni et al.’s embedding method can be classified a + b − c otherwise.
as the shifting-embedding technique. In their work, the
maximum payload is limited by the peak height of the cover
image histogram; therefore, the payload is smaller compared
to the expansion-embedding-based reversible data hiding Figure 1: Context for predicting the pixel x.
techniques. In 2007, Thodi and Rodrı́guez combined the
expansion-embedding and shifting-embedding techniques
and proposed a reversible data hiding method with higher 2. Relative Works
payload and lower distortion [5]. In their works, the
prediction errors are expanded, and data are embedded In this section, we will briefly describe the central idea of
into the LSBs of the expanded prediction errors. A better PSE and will review three PSE-based reversible data hiding
performance was achieved in Thodi et al.’s method than that methods proposed in 2009.
of Tian’s and Ni et al.’s methods.
Recently, some researchers [17–20] adopted the concept
of shifting-embedding technique and embedded data into 2.1. The Basic Concept of PSE. The basic idea of the
the prediction error histogram. Since the peak height of the PSE method is to exploit the correlation of neighboring
prediction error histogram is usually higher than that of the pixels inherent in images and predict the pixel values to
image histogram for most natural images, a higher payload obtain the prediction errors. To embed data, the prediction
can be achieved. In these methods, the peak value of the errors are scanned sequentially. If the scanned prediction
prediction error histogram is calculated, and the histogram error e is equal to some predetermined values (i.e., peak
bin to the left or to the right of the peak value is shifted values), a bit s can be embedded by modifying e according
to vacate a histogram bin just next to the peak. Data is to the value of s. Otherwise, e is modified to a value
then embedded by modifying the prediction errors with that will not cause ambiguous conduction when extracting
the peak value. We classified these newly proposed methods data. The stego image can be constructed by transforming
as the prediction-and-shifting embedding (PSE) method the modified prediction errors back to their spatial val-
because the data embedding of these methods mainly ues.
relies on prediction and histogram-shifting techniques. For To extract the embedded data, the stego image is
a PSE method, the performance of the predictor plays sequentially scanned, and the modified prediction errors are
a very important role. The peak height of a prediction obtained. If the scanned modified prediction error e is equal
error histogram should be as high as possible since the to some pre-determined values (i.e., peak values), a bit 0
peak height represents the number of bits that can be or 1 is then extracted. To recover the original image, e is
embedded in a prediction error histogram. One goal of modified back to its original value, e, and then transforms e
the PSE method is to construct a higher prediction error back to its spatial value. Since the PSE method only modifies
histogram to increase the payload. A higher prediction the prediction error values slightly (usually plus or minus
error histogram often results from an accurate prediction, one unit), the stego image quality is often higher than
which can be done by employing a well designed predictor that of expansion-embedding-based techniques. Besides, the
that fully exploited the correlation among the neighboring payload of PSE method is larger than those histogram-
pixels. shifting-based techniques because the peak of the prediction
The PSE methods proposed so far have better embedding error histogram is often higher than that of the original
efficiency than the traditional histogram-shifting embedding image histogram.
technique; however, the existing methods offer no selective
embedment mechanism (SEM) to exclude those pixels that 2.2. Hong Et al.’s Method. Hong et al. [18] employed a
contribute no capacity but merely cause image distortion. median edge detection (MED) predictor used in JPEG-LS
In this paper, a novel reversible data hiding method based to sequentially predict pixel values and embed data. In their
on PSE is proposed. A sophisticated SEM is introduced work, a pixel x is predicted by previously visited pixels a, b,
to exclude pixels with larger prediction errors being mod- and c, as shown in Figure 1. The prediction error is calculated
ified to enhance the image quality. In comparison to by e = x − p, where p is the prediction value of x.
prior reversible data hiding methods, the proposed method Hong et al. recognized that the prediction error his-
achieves the best performance in terms of payload and togram is sharply distributed and centered at zero. Therefore,
PSNR. they choose e = 0 and e = −1 as the peak values and employ
The rest of this paper is organized as follows. In Section 2, the PSE technique to embed data. However, the current pixel
three PSE-based reversible data hiding methods will be x is predicted by previously modified pixels; the prediction
described. The proposed method is presented in Section 3, may become less accurate. A less accurate prediction will
followed by the experimental results and discussions in decrease the peak height of the prediction error histogram,
Section 4. Conclusions are addressed in Section 5. leading to a decrease in payload.
EURASIP Journal on Advances in Signal Processing 3

images, which is equivalent to adopt the nearest predictor


and use the value in the reference image to predict pixel
values in other subsampled images in the corresponding
position. The use of the nearest predictor may result in a
less accurate prediction [21] and subsequently reduce the
payload.

3. Proposed Method
In this section, we present a PSE-based reversible data hiding
method that achieves high payload with low distortion. In
a PSE method, the embedding capacity is determined by
the peak height, that is, the most frequently occurring value
Basic pixel in the prediction errors. We term those errors “embeddable
Non-basic pixels errors” because one bit can be embedded within one error.
Prediction errors other than embeddable errors are termed
Figure 2: The layout of basic pixel and nonbasic pixel for a 3 × 3 “nonembeddable errors.” During embedding, nonembed-
block. dable errors have to be shifted or remain unchanged
according to the design of the embedding algorithm.
Figure 4(a) shows the relationship between embeddable
2.3. Tsai Et al.’s Method. Tsai et al. [19] partitioned the errors and nonembeddable errors, and Figure 4(b) illustrates
cover image into blocks and the center pixel of each block a histogram of a one-side shifted embedding algorithm
is selected as the basic pixel of this block. The basic pixels in which different types of prediction errors are marked
served as reference pixels and will not be modified during the accordingly.
embedding process (i.e., the cover image and the stego image For all prediction errors, only embeddable errors con-
share the same set of basic pixels). The layout of basic pixel tribute to the payload. Those nonembeddable errors that
and nonbasic pixel for a 3 × 3 block is shown in Figure 2. have to be shifted during embedding contribute no payload
To embed data, the value of the basic pixel is subtracted but cause distortions. To enhance the embedding efficiency,
from other nonbasic pixels in the same block. The resulting it is desirable not only to increase the number of embeddable
difference values are the prediction errors of those nonbasic errors but also to decrease the number of those must-be-
pixels. All blocks are processed in the same manner, and shifted nonembeddable errors.
all the prediction errors can be obtained. After that, the It is known that the prediction errors of those pixels
PSE technique is employed to embed data. In Tsai et al.’s located in a complex region are often lager than those pixels
method, a larger payload might be achieved by decreasing the located in a smooth region, and large prediction errors are
number of basic pixels or equivalently, increasing the block likely to be nonembeddable and most of them have to be
size. However, the predictor they used is simply the nearest shifted. For example, Figure 5 shows an error image of Lena
predictor in that the nonbasic pixel value is predicted by the using an MED predictor. Note that different predictors may
value of the nearest basic pixels. A larger block may result in result in roughly the same error image since edges in the
a less accurate prediction, leading to a decrease in payload. In error image are mostly preserved. The vertical bar indicates
Tsai’s method, a 3 × 3 block is suggested to achieve the best the magnitude of the absolute prediction errors. Note that
results. error values in complex regions (e.g., the hairs and edges)
are larger than those in smooth regions (e.g., the shoulder).
If a large prediction error can be detected before embedding,
2.4. Kim Et al.’s Method. Kim et al. [20] exploited the spatial
this prediction error can be excluded to join the embedding
correlation between subsampled images and proposed a
process and thus, the distortion can be reduced. A predictor
reversible data hiding method with high payload and low
used in a PSE method should have the capability to increase
distortion. They subsampled the cover image into k subsam-
the height of the error histogram while reducing the number
pled images, and the subsampled image that maximizes the
of nonembeddable errors that have to be shifted during
spatial correlation among the subsampled images is selected
embedding.
as the reference subimage. Figure 3 shows an example of a
cover image with four subsampled images.
To obtain the prediction errors, the pixel values in the 3.1. The Selective Embedment Mechanism. In this subsection,
reference subimage are subtracted from other subsampled a selective embedment mechanism (SEM) is introduced.
images. The resulting difference values are the prediction SEM employs a local smoothness estimator to determine
errors. The PSE technique is then applied for data conceal- whether the scanned pixel will be selected to go through
ment. Kim et al. also provided an adjustable embedding level the embedding process or just skipped. For each pixel Ii, j ,
mechanism for large payload at the cost of image distortion. a local smoothness estimator fs (·) is employed to estimate
In Kim et al.’s method, they simply use the value in the the smoothness for which the pixel Ii, j is located. The
reference image as the prediction value of other subsampled smoothness estimator fs (Ii, j ) is defined as the standard
4 EURASIP Journal on Advances in Signal Processing

I0,0 I0,1 I0,2 I0,3 I0,0 I0,2 I0,1 I0,3

I1,0 I1,1 I1,2 I1,3 I2,0 I2,2 I2,1 I2,3

I2,0 I2,1 I2,2 I2,3

I3,0 I3,1 I3,2 I3,3

I1,0 I1,2 I1,1 I1,3

I3,0 I3,2 I3,1 I3,3

(a) (b)

Figure 3: An example of image subsampling: (a) Original image and (b) Four subsampled images.

×104
3
Embeddable
errors
Non-embeddable errors that 2.5
remain unchanged during Shifting
embedding Embeddable direction
errors 2
Frequency

1.5

0.5

0
−20 −15 −10 −5 0 5 10 15 20
Prediction error

Non-embeddable errors Non-embeddable errors


Non-embeddable errors that have that remain unchanged that has to be shifted
to be shifted during embedding during embedding during embedding
(a) (b)

Figure 4: An illustration of embeddable and nonembeddable errors: (a) an illustration; (b) locations of embeddable and nonembeddable
errors in a prediction error histogram.

deviation of the four neighboring pixels of the current pixel payload; therefore, we simply skip Ii, j to reduce unnecessary
Ii j : pixel value modification. The SEM based on the threshold T
  is shown in Figure 6.
fs Ii, j
9 2  2  2  2 3.2. The Prediction Algorithm. The prediction algorithm
= Ii, j −1 − μ + Ii−1, j − μ + Ii, j+1 − μ + Ii+1, j − μ , of the proposed method is inspired by the error coding
(1) architecture used in multilevel progressive compression
(MLP) method [22], where a linear weighted predictor P16 (·)
where μ is the mean of Ii, j −1 , Ii−1, j , Ii, j+1 , and Ii+1, j . is employed to obtain prediction errors with a small variance.
If fs (Ii, j ) < T, where T is a predefined threshold, then To begin with, pixels in the cover image are divided into
Ii, j is estimated in a smooth region. If fs (Ii, j ) ≥ T, Ii, j is two disjoint sets, namely, “Black” and “White” pixels in
estimated in a complex region. The prediction error of a the same way as the layout of a checkerboard. Pixel Ii, j
pixel located in a complex region is likely to be large, and a is “Black” if i + j is odd and is “White” otherwise. The
lager prediction error is likely to be shifted but contributes no embedding process involves two prediction passes. In the
EURASIP Journal on Advances in Signal Processing 5

20 j −4 j−3 j −2 j −1 j j +1 j +2 j +3 j +4
18 i−4
16
14 i−3
12
i−2
10
8 i−1
6
i
4
2 i+1
0
i+2
Figure 5: The image of prediction errors.
i+3

i+4
f (Ii, j )
Smooth Complex
region region Figure 7: An illustration of the prediction algorithm: the pixel
T
values at the position marked and are known, the pixels at
Selected for data embedding Skipped
the positions marked are used to predict the value of center pixel
Figure 6: The selective embedment mechanism.
(marked ), and all other unmarked pixels will be predicted in the
second prediction pass.
first pass, the values of all “Black” pixels Ii, j are predicted
using the 16 known “White” neighbors. During embedding,
the values of these “Black” pixels are modified. In the
second pass, the “White” pixels are predicted using the data embedding is done by modifying the prediction errors.
16 known, modified “Black” neighbors. Figure 7(a) shows The detailed embedding procedure is listed in the follow-
the context of prediction neighborhood of the pixel to be ing.
predicted.
The prediction of Ii, j by using the predictor P16 (·) is given
by Input. A cover image I of size M ×M and secret data S.
   
P16 Ii, j = 0.3164 Ii, j −1 + Ii−1, j + Ii, j+1 + Ii+1, j
Output. A stego image I  , a minimum threshold Tmin ,
 end of embedding position EP , two pairs of peaks
− 0.0351 Ii−1, j −2 + Ii−2, j −1 + Ii−2, j+1 + Ii−1, j+2
(p0+ , p0− ), (p1+ , p1− ), and a location map LM .

+Ii+1, j+2 + Ii+2, j+1 + Ii+2, j −1 + Ii+1, j −2 Step 1. Set k = 0, T = 0.
 
+ 0.0039 Ii, j −3 + Ii−3, j + Ii, j+3 + Ii+3, j . Step 2. Scan the pixels in I using the raster scan order. For
(2) pixels Ii, j s satisfying mod(i + j, 2) = k, where mod(x, 2) is a
modulus function that returns 1 if x is an odd number and
The weights in P16 (·) are calculated by bicubic polyno- 0 otherwise, the estimator fs (·) is employed to estimate the
mial interpolation and are normalized, so that their sum is local smoothness of those pixels. If fs (Ii, j ) ≤ T, pixel Ii, j is
one [22]. classified to be within a smooth region. Thus, the predictor
For pixels located near the border of the cover image P16 (·) is employed to calculate the prediction value I1i, j of Ii, j .
and have no sufficient neighbors to predict, we may simply The prediction error is then calculated by ei, j = Ii, j − I1i, j .
skip these pixels, or slightly modify the prediction rules, so If fs (Ii, j ) > T, the scanned pixel is classified to be within
that these pixels can be applied [22]. Because there is only a complex region. Therefore, the prediction error will not
a small portion of pixels that have no sufficient neighbors, be calculated, that is, the scanned pixel will not join the
these effects have little practical significance. embedding process.

3.3. The Embedding Algorithm. To embed data, the local Step 3. After all the prediction errors are obtained, the
smoothness estimator is employed to exclude pixels located histogram of the prediction errors is calculated, and a pair
in complex regions to join the embedding process. The of peaks (pk+ , pk− ), where pk+ > pk− , of the histogram is
pixels located in smooth regions are then predicted, and determined.
6 EURASIP Journal on Advances in Signal Processing

Step 4. Scan each prediction error ei, j obtained in Step 2. If Step 2. Scan the pixels Ii, j ’s with their positions satisfying
the scanned error ei, j is equal to pk+ or pk− , the scanned error mod(i + j, 2) = k using the same order as in the embedding
is embeddable, and a bit s can be embedded by using the rules phase. If the position of scanned pixel is recorded in LM , this
⎧ pixel is skipped and proceeds to next one. If fs (Ii, j ) ≤ Tmin ,

⎪ ei, j + 1 if ei, j = pk+ , s = 1, then the prediction value I1i, j of Ii, j is calculated by using the


⎨ predictor P16 (·), and the prediction error can be obtained by
ei, j = ei, j − 1 if ei, j = pk− , s = 1, (3)

⎪   ei, j = Ii, j − I1i, j .


⎩ ei, j if ei, j = pk+ or ei, j = pk− , s = 0.
Step 3. If ei, j = pk+ or ei, j = pk− , a bit s can be extracted by
Otherwise, the scanned error ei, j is nonembeddable and has using the following rule:
to be modified to ei, j by using the rules ⎧

⎨0 if ei, j = pk+ or ei, j = pk+ ,
⎪ pk+ , s=⎩ (5)

⎪ ei, j + 1 if ei, j >
1 if ei, j = pk+ + 1 or ei, j = pk− − 1,


ei, j = ei, j − 1

if ei, j < pk− , (4)

⎪ The original prediction error ei, j can be recovered by using

⎩ ei, j otherwise. the following rule:

The pixel value Ii, j in the cover image is then modified to Ii, j ⎪
⎪ ei, j − 1 if ei, j > pk+ ,


by setting Ii, j = I1i, j + ei, j . If Ii, j > 255 or Ii, j < 0, an overflow ⎨
or underflow problem occurs. If this happened, record the s = ⎪ei, j + 1 if ei, j < pk− , (6)


position information (i, j) in the location map LM , mark s ⎪
⎩e 
i, j otherwise.
unextracted if s had been extracted from S, and set ei, j = ei, j .
The original pixel value Ii, j can be obtained by calculating
Step 5. Set k = 1 and repeat Steps 2–4 to obtain another pair
of peaks (pk+ ,pk− ) and perform data embedding. Ii, j = I1i, j + ei, j .

Step 6. Repeat Steps 1–5 and perform a binary search to find Step 4. Repeat Steps 2-3 until the end of embedding position
a minimum threshold Tmin such that | p0+ | + | p0− | + | p1+ | + | p1− | EP is met.
is just larger than the length of S, where |x| represents the
Step 5. Set k = 0 and repeat Steps 2–4. Concatenating the
number of bits that is used to record x. The end of embedding
extracted bits, the embedded secret data can be obtained.
position EP is recorded for the purpose of data extraction.

Step 7. Output the stego image I  , and the parameters Tmin ,


p0+ , p0− , p1+ , p1− , EP , and LM are served as a key K for decoding. 4. Experimental Results and Discussions
The key K is transmitted over a secret channel. The receiver Several experiments, including tests on 8-bit and 16-bit
with the correct key K can then extract the embedded images as well as using steganalysis tools, were carried out to
message and restore the stego image to the original image. demonstrate the effectiveness of the embedding algorithms
We use eight bits to record each of Tmin , p0+ , p0− , p1+ , proposed in this paper.
and p1− , use log2 (M × M) bits to record EP , and use N ×
log2 (M × M) bits to record LM if there are N overflow and 4.1. 8-Bit Test Images. Six standard 512 × 512 images Air-
underflow pixels. The proposed embedding algorithm only plane, Lena, Sailboat, Peppers, Boat, and Baboon taken from
modified the pixel values plus or minus one grayscale unit USC-SIPI database [23] were converted to 8-bit grayscale
at most, and pixel values at 0 or 255 occur rarely for most image by using the following equation if they were originally
natural images. In this case, for a 512 × 512 cover image with in RGB color format:
no pixel value at 0 or 255, that is, no overflow and underflow
occurs, the key size |K | is 5 × 8 + log2 (512 × 512) = 58 bits. V = 0.2989R + 0.5870G + 0.1140B, (7)

3.4. The Extraction and Recovery Procedures. Once the where V is the converted grayscale value and R, G, and B
receiver receives the stego image I  and the key K, the are the red, green, and blue component of the cover image.
embedded secret data can be extracted, and the original The six grayscale test images are shown in Figure 8. The
image can be recovered by using the procedure listed below. secret data were generated by using a pseudorandom number
generator (PRNG). The peak signal-to noise ratio (PSNR) is
Input. The stego image I  and the key K. used to measure the stego image quality:
Output. Recovered cover image I and the secret ⎛ 2 ⎞
⎜ 2 −1
data S. b

PSNR = 10 log10 ⎝ ⎠, (8)
MSE
Step 1. Set k = 1.
EURASIP Journal on Advances in Signal Processing 7

(a) Airplane (b) Lena (c) Sailboat

(d) Peppers (e) Boat (f) Baboon

Figure 8: Six standard grayscale images.

where b is the bit depth for the cover image and MSE is the Table 1: Key size comparison.
mean square error between the cover image and the stego
Method Key size (bits)
image. The payload ρ is measured in bpp and is calculated
by Proposed method 58 + 18 N
Hong et al. 8 + 18 N
ρ = EC − O, (9) Tsai et al. 32 + 18 N
where EC denotes the capacity of the given cover image and O Kim et al. 10 + 18 N
denotes the side information that is required at the decoding
stage. In the proposed method, the size of O is equal to the
size of the key |K |. by using blocks of this size. For Kim et al.’s method, four
According to our experiments, no overflow or underflow subsampled images were used, and the histogram bins were
occurred in the test images Airplane, Lena, Sailboat, and shifted one unit at most to ensure that a high-quality stego
Baboon. The key size for these four images is 58 bits. On image can be achieved.
the other hand, there is one pixel and eight pixels overflow In Hong et al.’s method, the key requires eight bits to
or underflow for Peppers and Boat, respectively. The key record the end of embedding position. In Tsai et al.’s method,
size is 58 + 1 × log2 (512 × 512) = 76 bits for Peppers and the key is composed of two pairs of peak and zero points;
58 + 9 × log2 (512 × 512) = 202 bits for Boat. each pair requires 16 bits to record it. In Kim et al.’s method,
To compare the proposed method with Hong et al.’s, the key is composed of two sampling factors, three bits each,
Tsai et al.’s and Kim et al.’s methods proposed in 2009, the and four bits embedding level. To ensure that the pixels-
embedding algorithms of these methods were implemented, to-be-modified are changed one grayscale unit at most to
and the parameters for each method were chosen, so that achieve high stego image quality, these methods also require
the best performance can be achieved. For Hong et al.’s a location map to prevent overflow or underflow occurrence.
method, two peaks 0 and −1 were selected for embedding, Suppose that there are N overflow and underflow pixels for a
as suggested in their paper. For Tsai et al.’s method, a 3 × 3 512 × 512 cover image, the key size of each method is listed
block size was employed since the best result can be achieved in Table 1.
8 EURASIP Journal on Advances in Signal Processing

66 62
Tmin = 1 Tmin = 2
64
60
62
58
60 Tmin = 3
Tmin = 2
PSNR (dB)

PSNR (dB)
58 56
Tmin = 4
56 Tmin = 3 54
Tmin = 5
54 Tmin = 4
Tmin = 7
52
Tmin = 10 Tmin = 15
52
Tmin = 80 50
50
Tmin = 80
48 48
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25
Payload (bpp) Payload (bpp)

(a) Airplane (b) Lena

62 60
Tmin = 2
Tmin = 2
60 Tmin = 3 Tmin = 3
58

58
Tmin = 4 56
PSNR (dB)

PSNR (dB)

56 Tmin = 5 Tmin = 4
54
54 Tmin = 6
Tmin = 5
Tmin = 9 52
52 Tmin = 6
Tmin = 18
Tmin = 9
50 50 Tmin = 80
Tmin = 80

48 48
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Payload (bpp) Payload (bpp)

(c) Sailboat (d) Peppers

60 66
Tmin = 6
Tmin = 3
64
58 Tmin = 4 Tmin = 7
62

56 60 Tmin = 8
PSNR (dB)

PSNR (dB)

Tmin = 5 58 Tmin = 9
54 Tmin = 6
56 Tmin = 12
Tmin = 7
52 Tmin = 10
54 Tmin = 21

Tmin = 15 52 Tmin = 28
50
Tmin = 80 50 Tmin = 80

48 48
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Payload (bpp) Payload (bpp)

Proposed method with SEM Tsai et al.’s method Proposed method with SEM Tsai et al.’s method
Proposed method without SEM Kim et al.’s method Proposed method without SEM Kim et al.’s method
Hong et al.’s method Hong et al.’s method
(e) Boat (f) Baboon

Figure 9: A comparison of various test images.


EURASIP Journal on Advances in Signal Processing 9

65 in smooth regions for data embedding, and these pixels


often contribute more embeddable errors. On the other
60 hand, if no SEM is applied, pixels located in smooth or
complex regions will have equal probability to be selected for
55 data embedding, resulting in significant image degradation.
Figures 9(a)–9(f) also reveal that the proposed method
PSNR (dB)

50 with SEM performs better than those without using this


mechanism at all embedding rates, and the improvements are
45 significant from small to moderate payload.
Note that the proposed method outperforms Hong et
40 al.’s, Tsai et al.’s, and Kim et al.’s methods in terms of payload
and PSNR for all test images, at all embedding rates, even if
35 no SEM is used. For example, for the smooth image, such
as Lena, the proposed method achieved 54 dB at 0.1 bpp
30 whereas their methods only achieved around 51 dB at the
0 0.2 0.4 0.6 0.8 1 1.2 1.4
same bit rate. For the complex image, such as Baboon, the
Payload (bpp)
proposed method also performs much better. For example,
Airplane Peppers the PSNR of the proposed method is 54.5 dB at 0.04 bpp
Lena Boat whereas the PSNR of their method only achieved around
Sailboat Baboon 50 dB under the same bit rate.
Although the proposed method focused on high-quality
Figure 10: Comparison of payload versus PSNR for test images.
stego images, it can be implemented for large payload using
a multi-level embedding strategy, namely, the stego image is
the cover image for the next embedding level. In this case,
For the PSE-based method, the overflow and underflow the proposed method without SEM version is used to speed
problems will only occur at pixels of the cover image valued 0 up the embedding process because the contribution of SEM
or 255. This occurs rarely for most natural images. Therefore, becomes insignificant at high payload. The side information
the key size listed in Table 1 is comparable and is suitable to produced in each embedding level is embedded together
be transmitted over the Internet. with the data bits into the next embedding level; only the
The comparison results of bpp-PSNR relationship for side information produced in the last embedding level is
different methods are shown in Figures 9(a)–9(f). The served as the key K. In the decoding phase, the key for the
minimum thresholds Tmin are marked beside the dots on previous embedding level and data embedded in the last
the curve for the proposed method. Note that the dots that embedding level are extracted with the key K, and the stego
are not marked represent having the same previous Tmin . image is restored to its previous state. This process is repeated
The lower bond of PSNR of these 8-bit test images is 10 × until all the data bits are extracted and the original image is
log10 (2552 /1)  48.13 dB because pixels in the cover images recovered. Figure 10 shows the payload versus PSNR for each
are modified one grayscale unit at most, resulting in that the test images with 10 embedding levels. As shown in the figure,
MSE between the cover image and the stego image is slightly the quality of the stego image is high at low and moderate
smaller than 1. payloads and is still acceptable at high payloads. Since the
In the proposed method, the advantage of using SEM proposed method is based on PSE technique, the payload-
can be seen from Figures 9(a)–9(f). For example, Figure 9(a) distortion performance depends on the characteristics of the
reveals that, when SEM is applied, the gain in PSNR at cover image. A better performance, that is, a higher payload
0.1 bpp is around 2 dB for the smooth image Airplane. At this with a lower distortion, can be achieved when the cover
moment, setting Tmin = 1 is enough to embed all secret data. image contains large amount of smooth regions, for example,
The threshold Tmin is gradually increased as the embedding Airplane and Lena. On the contrary, cover images with large
rates increased. This is because in SEM, a larger threshold amount of complex regions, such as Sailboat and Baboon,
will be selected in order to embed more data; however, the often exhibit lower PSNR under the same payload.
advantage of using SEM to increase the PSNR will become We also tested the proposed method and Hong et al.’s,
less significant as the payload increases. It is clear that the Tsai et al.’s, and Kim et al.’s methods using 23 natural
proposed method with SEM performs the best for all the test photographic images of the Kodak images test set, each sized
images at all embedding rates than those without SEM. This 768 × 512. These images were also used in Hong et al.’s
is because SEM evaluates a minimum threshold to prevent experiments [18]. The results were shown in Table 2 and have
pixels located in complex regions from being selected for significant improvement in payload under roughly the same
data embedding, since these pixels often contribute fewer PSNR. Note that the averaged payload is one-third higher
payloads but cause almost equally distortion. than their methods under roughly the same PSNR.
For the complex image Baboon, the gain in PSNR by It is interesting to note that for Image no. 8, the proposed
using SEM becomes significantly larger when the embedding method provides fewer payloads than that of Hong et al.’s
rates are small, as shown in Figure 9(f). This is because a method. This is because in this particular image, rich vertical
smaller threshold Tmin is used to select those pixels located and horizontal edges provide the MED predictor a better
10 EURASIP Journal on Advances in Signal Processing

Table 2: Maximum payload for various test images (payload is measured in bits).

Hong et al. Tsai et al. Kim et al. Proposed


Image Payload PSNR Payload PSNR Payload PSNR Payload PSNR
1 52,743 48.45 37,951 48.96 41,098 48.92 58,692 49.35
2 90,417 48.68 80,074 49.26 83,002 49.21 123,689 48.95
3 114,193 48.83 111,829 49.49 115,578 49.45 162,458 49.27
4 84,229 48.64 69,653 49.18 72,595 49.14 118,052 48.95
5 59,630 48.49 41,324 48.99 44,615 48.95 78,698 49.75
6 73,067 48.57 66,355 49.16 69,220 49.12 93,673 49.23
7 107,538 48.78 94,483 49.36 97,741 49.32 155,529 49.44
8 53,113 48.45 34,399 48.94 37,507 48.9 43,523 50.29
9 82,651 48.63 77,472 49.24 80,802 49.2 106,925 49.05
10 83,997 48.64 74,045 49.21 77,538 49.17 109,604 49.02
11 79,063 48.61 67,484 49.17 70,677 49.13 100,481 49.16
12 96,509 48.71 86,945 49.3 89,766 49.26 127,128 49.09
13 32,631 48.33 24,386 48.87 27,222 48.83 38,553 49.56
14 59,988 48.49 42,461 48.99 46,184 48.96 78,118 49.02
15 98,337 48.72 98,950 49.39 101,497 49.34 133,588 49.28
16 88,007 48.66 76,246 49.23 79,415 49.19 118,570 49.07
17 82,961 48.63 71,970 49.2 75,200 49.16 113,159 48.81
18 49,885 48.43 40,651 48.98 40,698 48.92 67,991 48.99
19 70,976 48.56 59,135 49.11 60,320 49.05 86,501 49.36
20 127,419 48.91 155,474 49.83 156,857 49.77 143,470 50.56
21 69,064 48.54 63,928 49.14 64,509 49.08 87,292 49.22
22 69,066 48.54 52,931 49.06 53,853 49.01 95,492 48.9
23 107,315 48.78 102,726 49.42 104,379 49.37 151,973 49.25
Avg. 79,686 48.61 70,907 49.19 73,490 49.15 104,050 49.29

(a) im1 (b) im2

Figure 11: Two medical images.

opportunity to produce higher prediction error histogram obtained from [24, 25] as shown in Figure 11. The medical
than the proposed MLP predictors. Nevertheless, the PSNR images, im1 and im2 shown in Figure 11, are indeed two
of the proposed method is 1.35 dB higher than Hong et al.’s very different images in their contents. Most regions in im1
method. are informative parts containing body tissues whereas in im2,
the informative parts are surrounded by almost uniform dark
regions.
4.2. 16-Bit Test Images. Since the proposed method par- The proposed method was tested and compared with
ticularly focuses on applications where high-quality stego those of Tsai et al.’s, Kim et al.’s, Hong et al.’s, and a newly
image is demanded such as medical images, the proposed method proposed by Fallahpour et al. [26], which is primar-
method was tested to two 16-bit 512 × 512 medical images ily designed for medical images. Fallahpour et al.’s method
EURASIP Journal on Advances in Signal Processing 11

110 116
Tmin = 10 Tmin = 1
114
108 Tmin = 12
112
106 110
Tmin = 15
108
PSNR (dB)

PSNR (dB)
104
106
102 Tmin = 18
104 Tmin = 2
Tmin = 5
Tmin = 21 102
100
Tmin = 24
Tmin = 32 100
98 Tmin = 80
98 Tmin = 80

96 96
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Payload (bpp) Payload (bpp)

Proposed method with SEM Proposed method with SEM


Proposed method without SEM Proposed method without SEM
Hong et al.’s method Hong et al.’s method
Tsai et al.’s method Tsai et al.’s method
Kim et al.’s method Kim et al.’s method
Fallahpour et al.’s method Fallahpour et al.’s method
(a) im1 (b) im2

Figure 12: PSNR versus payload of two medical images.

50 As shown in Figure 12, the proposed method outper-


forms other methods under various embedding rates. For
45
example, the maximum payload of im1 of the proposed
40
method is around 0.27 bpp whereas others are less than
0.1 bpp. For image im2, the performance of the proposed
35 method is comparable to that of Fallahpour et al.’s method
Percent (%)

but outperforms other methods at all embedding rates.


30 It is interesting to note that Fallahpour et al.’s method
performs better in im2 than in im1. This is because the
25 large uniform background in im2 offers sharply distribution
image histograms and provides more embeddable spaces.
20 On the contrary, the histograms of image blocks in im1 are
relatively flat since the pixel intensities of image blocks are
15
spread out, causing the embeddable spaces to significantly
10 decreased.
0 20 40 60 80 100
Embedding ratio (%)
4.3. Security Verification. Although the proposed method
RM R−M produces imperceptible high-quality stego image, there exist
SM S−M steganalysis tools to detect whether an image is embedded
with messages. The RS-method proposed by Fridrich et al.
Figure 13: RS-diagram of Lena stego image.
[27] is one of the well-known steganalysis tools used to
examine the security of a data hiding technique. The RS-
method successfully detects the LSB embedding by using
is based on partitioning the image into nonoverlapping sensitive dual statistics derived from the information of the
blocks, and these blocks are prioritized based on objective or regular and singular (RS) grouping in images. To detect
subjective quality. Data are then embedded into each block an image, the image is partitioned into groups G of n
at the pixel level using histogram-shifting technique. The consecutive pixels. The discrimination function, the flipping
results obtained from each method are shown in Figure 12. function, and the mask M are used to classify the groups
In our experiments, 16 image blocks were used in Fallahpour G into three disjoint groups: regular, singular, and unusable
et al.’s method. Note that the lower bond of PSNR for 16-bit groups. The RS method analyzes the percentage of regular
images is 10 log10 ((216 − 1)2 /1)  96.33 dB. groups RM , R−M and the percentage of singular groups SM ,
12 EURASIP Journal on Advances in Signal Processing

S−M . For most natural images, the relationships RM  R−M [8] M. Kuribayashi, M. Morii, and H. Tanaka, “Reversible
and SM  S−M generally hold. If this relationship is violated, watermark with large capacity based on the prediction
the embedded message is suspicious to be detected; see [27] error expansion,” Transactions on Fundamentals of Electronics,
for more details. The RS-diagram of the proposed method Communications and Computer Sciences, vol. E91-A, no. 7, pp.
for the test image Lena is shown in Figure 13. 1780–1790, 2008.
[9] D. Coltuc and J.-M. Chassery, “Very fast watermarking by
As can be seen in Figure 13, the relationships RM 
reversible contrast mapping,” IEEE Signal Processing Letters,
R−M and SM  S−M hold for various embedding ratios.
vol. 14, no. 4, pp. 255–258, 2007.
According to our experiments, other test images have similar [10] S. Han, M. Fujiyoshi, and H. Kiya, “An efficient reversible
RS-diagrams, indicating that the proposed method is secure image authentication method,” Transactions on Fundamentals
from the RS-diagram steganalysis. of Electronics, Communications and Computer Sciences, vol.
E91-A, pp. 1907–1914, 2008.
5. Conclusions [11] M. U. Celik, G. Sharma, A. M. Tekalp, and E. Saber, “Lossless
generalized-LSB data embedding,” IEEE Transactions on Image
In this paper, we have presented a reversible data hiding Processing, vol. 14, no. 2, pp. 253–266, 2005.
scheme based on prediction-and-shifting embedding tech- [12] J. Fridrich and D. Soukal, “Matrix embedding for large
payloads,” IEEE Transactions on Information Forensics and
nique and achieved high payload and high image quality.
Security, vol. 1, no. 3, pp. 390–395, 2006.
The proposed method employs the SEM for determining
[13] J. M. Barton, “Method and apparatus for embedding authen-
the best threshold to exclude pixels located in complex tication information within digital data,” US patent 5 646 997,
regions to join the embedding process, so that the number of July, 1997.
modified pixels can be greatly reduced. When large payload [14] J. Tian, “Reversible data embedding using a difference expan-
is embedded, multi-level embedding technique is performed. sion,” IEEE Transactions on Circuits and Systems for Video
The proposed method has the following advantages: (1) Technology, vol. 13, no. 8, pp. 890–896, 2003.
simple and effective, (2) applicable to variety of images such [15] A. M. Alattar, “Reversible watermark using the difference
as photographical or medical images, and (3) adjustable expansion of a generalized integer transform,” IEEE Transac-
payload according to the requirement of applications. Test tions on Image Processing, vol. 13, no. 8, pp. 1147–1156, 2004.
results showed that, for a variety of test images, the proposed [16] Z. Ni, Y.-Q. Shi, N. Ansari, and W. Su, “Reversible data hiding,”
method outperforms prior works, such as Hong et al.’s, Tsai IEEE Transactions on Circuits and Systems for Video Technology,
vol. 16, no. 3, pp. 354–361, 2006.
et al.’s, Kim et al.’s, Thodi et al.’s, and Tian’s methods in terms
[17] M. Chen, Z. Chen, X. Zeng, and Z. Xiong, “Reversible
of payload and PSNR. data hiding using additive prediction-error expansion,” in
Proceedings of the 11th ACM Workshop on Multimedia and
Acknowledgment security, pp. 19–24, 2009.
[18] W. Hong, T. S. Chen, and C. W. Shiu, “Reversible data hiding
This research was supported by the National Science Council for high quality images using modification of prediction
of the Republic of China under Grant NSC98-2622-E-412- errors,” The Journal of Systems and Software, vol. 82, no. 11,
003-CC3. pp. 1833–1842, 2009.
[19] P. Tsai, Y.-C. Hu, and H.-L. Yeh, “Reversible image hiding
References scheme using predictive coding and histogram shifting,” Signal
Processing, vol. 89, no. 6, pp. 1129–1143, 2009.
[1] N. Provos and P. Honeyman, “Hide and seek: an introduction [20] K. Kim, M. Lee, H.-Y. Lee, and H.-K. Lee, “Reversible data
to steganography,” IEEE Security and Privacy, vol. 1, no. 3, pp. hiding exploiting spatial correlation between sub-sampled
32–44, 2003. images,” Pattern Recognition, vol. 42, no. 11, pp. 3083–3096,
[2] P. L. Lin, C.-K. Hsieh, and P.-W. Huang, “A hierarchical 2009.
digital watermarking method for image tamper detection and [21] R. C. Gonzalez and R. E. Woods, Digital Image Processing,
recovery,” Pattern Recognition, vol. 38, no. 12, pp. 2519–2529, Prentice-Hall, Upper Saddle River, NJ, USA, 2nd edition, 2002.
2005. [22] D. Salomon, Data Compression: The Complete Reference,
[3] S. Lee, C. D. Yoo, and T. Kalker, “Reversible image water- Springer, Berlin, Germany, 2nd edition, 2000.
marking based on integer-to-integer wavelet transform,” IEEE [23] USC-SIPI Image Database, http://sipi.usc.edu/database.
Transactions on Information Forensics and Security, vol. 2, no. [24] Medical Image Samples, http://www.barre.nom.fr/medical/
3, pp. 321–330, 2007. samples.
[4] A. K. Jain and U. Uludag, “Hiding biometric data,” IEEE [25] DICOM Sample Image Sets, http://pubimage.hcuge.ch:8080.
Transactions on Pattern Analysis and Machine Intelligence, vol. [26] M. Fallahpour, D. Megias, and M. Ghanbari, “High capacity,
25, no. 11, pp. 1494–1498, 2003. reversible data hiding in medical images,” in Proceedings of the
[5] D. M. Thodi and J. J. Rodrı́guez, “Expansion embedding IEEE International Conference on Image Processing, pp. 4241–
techniques for reversible watermarking,” IEEE Transactions on 4244, 2009.
Image Processing, vol. 16, no. 3, pp. 721–730, 2007. [27] J. Fridrich, M. Goljan, and R. Du, “Reliable Detection of LSB
[6] J. Mielikainen, “LSB matching revisited,” IEEE Signal Process- Steganography in Color and Grayscale Images,” in Proceedings
ing Letters, vol. 13, no. 5, pp. 285–287, 2006. of the International Workshop on Multimedia and Security, pp.
[7] Z. H. Wang, T. D. Kieu, C. C. Chang, and M. C. Li, “A 27–30, 2001.
novel information concealing method based on exploiting
modification direction,” Journal of Information Hiding and
Multimedia Signal Processing, vol. 1, no. 1, pp. 1–9, 2010.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 876946, 6 pages
doi:10.1155/2010/876946

Research Article
Improved Adaptive LSB Steganography Based on
Chaos and Genetic Algorithm

Lifang Yu, Yao Zhao, Rongrong Ni (EURASIP Member), and Ting Li


Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China

Correspondence should be addressed to Yao Zhao, [email protected]

Received 17 November 2009; Accepted 19 May 2010

Academic Editor: Yingzi Du

Copyright © 2010 Lifang Yu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We propose a novel steganographic method in JPEG images with high performance. Firstly, we propose improved adaptive LSB
steganography, which can achieve high capacity while preserving the first-order statistics. Secondly, in order to minimize visual
degradation of the stego image, we shuffle bits-order of the message based on chaos whose parameters are selected by the genetic
algorithm. Shuffling message’s bits-order provides us with a new way to improve the performance of steganography. Experimental
results show that our method outperforms classical steganographic methods in image quality, while preserving characteristics of
histogram and providing high capacity.

1. Introduction attack [6, 7] since it is based on simply flipping LSBs. F5


employs matrix encoding to decrease the change for one
Steganography is the science of hiding messages in a medium payload, but its shrinkage at 0s makes it detectable. OutGuess
called carrier or cover object in such a way that existence of embeds message bits into a part of coefficients and uses
the message is concealed. The cover object could be a digital the other part to compensate artifacts on the histogram, so
still image, an audio file, or a video file. The hidden message it preserves characteristics of histogram. But its embedding
called payload could be a plain text, an audio file, a video file, efficiency and capacity are low because of compensation.
or an image [1, 2]. Our contributions are in two folds. First, we present
Steganographic methods can be classified into spatial improved adaptive LSB steganography that can embed mes-
domain embedding and frequency domain embedding. Least sages adaptively and thus can satisfy various requirements
Significant Bit (LSB) replacing is the most widely used (high capacity, high security, high image quality, etc.).
steganographic method in spatial domain, which replaces Second, our method minimizes degradation of the stego
the cover image’s LSBs with message bits directly. Although image through finding the best mapping between the secret
it has several disadvantages such as vulnerable to attacks, message and the cover image based on chaos and the genetic
LSB steganography is a popular method because of its low algorithm (GA).
computational complexity and high embedding capacity. The rest of the paper is organized as follows. Section 2
In frequency domain, popular steganographic methods introduces general principles of chaos and GA. Section 3
mostly base on Discrete Cosine Transformation (DCT). After illustrates our proposed method in detail, which includes
performing DCT on each 8 × 8 block and quantizing the DCT the improved adaptive LSB steganography, a method to
coefficients, message bits are embedded into the quantized shuffle message bits based on the logistic map and GA,
DCT (qDCT) coefficients. Recently, many steganographic the embedding procedure and the extraction procedure.
schemes using LSB and its improved versions on qDCT Experimental results are shown in Section 4, where we
have been invented, which offer reasonably high embedding demonstrate that our method has good stego image qual-
capacity while attempting to preserve the marginal statistics ity, high security-preserving characteristics of histogram,
of the cover image, such as J-Steg [3], F5 [4], and OutGuess and high capacity. Finally, conclusions are addressed in
[5]. It is well known that J-Steg is detectable using the χ 2 Section 5.
2 EURASIP Journal on Advances in Signal Processing

2. Preliminary c0 c1 ··· cloc cloc+1 ··· c62 c63

2.1. Chaos and Its Application in Information Hiding. The


chaos phenomenon is a deterministic and analogously Embed l1 bits to Embed l2 bits to
stochastic process appearing in a nonlinear dynamical system each valid each valid
[8, 9]. Because of its extreme sensitivity to initial conditions coefficient coefficient
and the outspreading of orbits over the entire space, it has Figure 1: Division of 64 coefficients in a 8 × 8 block.
been used in information hiding to increase security [10, 11].
Logistic map is one of the simplest chaotic maps,
described by (7) Repeat (3) to (6) until termination condition is satis-
xn+1 = μxn (1 − xn ), (1) fied.

where 0 ≤ μ ≤ 4, xn ∈ (0, 1).


Researches on chaotic dynamical systems show that the 3. Our Proposed Method
logistic map stands in chaotic state when 3.5699456 < μ ≤ 4. 3.1. Improved Adaptive LSB (IA-LSB) Steganography. The
That is, the sequence {xn , n = 0, 1, 2, . . .} generated by classical LSB steganography replaces cover images’ LSBs with
the logistic map is nonperiodic and nonconvergent. All the messages’ bits directly. This embedding strategy leads to
sequences generated by the logistic map are very sensitive dissymmetry. When the LSB of a coefficient in the cover
to initial conditions, in the sense that two logistic sequences image equals to its corresponding message bit, no change is
generated from different initial conditions are uncorrelated made. Otherwise, this coefficient is changed from 2n to 2n+1
statistically. The logistic map was used to generate a sequence or from 2n + 1 to 2n—changes from 2n to 2n − 1 or from
as the watermark [11] or to encrypt the embedded position 2n + 1 to 2n + 2 never happen. This dissymmetry is utilized
[10, 11] in former works. In our algorithm to be described by steganalysis, known as χ 2 attack [6, 7].
below, we use the logistic map to shuffle bits-order of the In order to avoid dissymmetry, improved adaptive LSB
message. (IA-LSB) steganography is proposed. First, the number of
bits to be embedded in a certain coefficient is adaptive. With
2.2. Genetic Algorithm. The genetic algorithm (GA), intro- proper parameters, we can get high capacity while preserving
duced by Holland [12] in his seminal work, is commonly high security. Second, less modification rule (LMR) is used to
used as an adaptive approach that provides a randomized, minimize modification.
parallel, and global search. It bases on the mechanics of nat-
ural selection and genetics to find the exact or approximate
3.1.1. Adaptively Decide Bits to be Embedded in Each Coef-
solution for a given optimization problem.
ficient. Let C = c0 , c1 , . . . , c63
denote the sequence of
GA is implemented as a computer simulation in which a
quantized DCT coefficients in a certain 8 × 8 JPEG block
population of abstract representations of candidate solutions
of the cover image. loc divides 64 coefficients into two
to an optimization problem evolves toward better solutions.
parts. In the first part, l1 bits are embedded into each valid
The evolution usually starts with some randomly selected
coefficient, and in the second part, l2 bits are embedded
genes as the first generation. All genes in a generation
(shown in Figure 1). We can adjust l1 , l2 , and loc to get high
form a population. Each individual in the population is
performance according to the content of the cover image.
called chromosome, which corresponds to a solution in the
optimization problem domain. An objective, called fitness
function, is used to evaluate the quality of each chromosome. 3.1.2. Less Modification Rule (LMR). Suppose ci is assigned to
A new generation is recombined to find the best solution hold l (l ∈ {l1 , l2 }) bits. Denote ci ’s corresponding l message
by using three operators: selection, crossover, and mutation bits as mi (mi ∈ {0, 1, . . . , 2l − 1}) decimally, and denote its
[13]. The process is repeated until a predefined condition is corresponding coefficient in the stego image as si . Let LSBl (x)
satisfied. be the decimal expression of the least significant l bits of x.
Once we have the genetic representation and the fitness That is, LSBl (x) = x mod 2l .
function defined, pseudocode algorithm of GA is illustrated Let si = ci +mi −LSBl (ci ), s
i = ci − (2 − (mi − LSBl (ci ))) be
l

as follows. two candidates for si . Because LSBl (si ) = LSBl (s


 
i ), si and si


hold the same message bits. In classical LSB steganography,


(1) Generate initial population. si = si . In our method, si or s i is chosen according to less
(2) Evaluate the fitness of each individual in the popula- modification rule formulated as follows:
tion. ⎧ 2 2 2 2



⎪si if 2si − ci 2 < 2s 2
i − ci ,
(3) Select best-ranking individuals to reproduce. ⎪
⎨ 2 2 2 2
(4) Breed a new generation through crossover and muta- si = ⎪s
i if 2si − ci 2 > 2s 2
i − ci , (2)

⎪ 2 2 2 2
tion (genetic operations) and give birth to offspring. ⎪
⎩s or s , randomly,
i i if 2si − ci 2 = 2s 2
i − ci .
(5) Evaluate the individual fitness of the offspring.
(6) Replace the worst ranked part of population with In this rule, we always choose the change that introduces
offspring. less modification. For example, if l = 2, mi = 3, and ci = 8,
EURASIP Journal on Advances in Signal Processing 3

Table 1: PSNR of gray images embedded by IA-LSB with and without shuffling message bits, simply denoted as “with” and “without”.

Average embedding capacity (bpc)


PSNR (db) 0.46 0.624 0.731
with without with without with without
Lena 39.93 39.72 38.521 38.181 37.376 37.221
Baboon 33.59 33.41 33.058 32.821 32.381 32.23
Milkdrop 44.32 44.21 40.187 39.92 39.274 38.934
Plane 38.73 38.44 37.586 37.281 36.727 36.506

Start Table 2: PSNR of color images embedded by IA-LSB at 0.45 bpc.

PSNR (db) with without


Initialization Lena 35.134 35.011
Baboon 28.62 28.556
Milkdrop 39.384 39.252
(x0 , μ)1 (x0 , μ)2 ··· (x0 , μ)L p
Plane 34.518 34.391

3.2. Shuffle Message Bits Based on Chaos and Genetic Algo-


Logistic map rithm. Shuffling message bits changes the way of modifying
the cover image during embedding thus influences image
quality and security of the stego image. By finding a proper
Message Shuffle way to shuffle, we can improve the image quality or security
or both. In this paper, we use the logistic map for shuffling
Shuffled Shuffled Shuffled and use GA to find proper parameters for the logistic map.
message message ···
message
bits L p
Denote the message with length L as M = {m0 , m1 ,
bits 1 bits 2
. . . , mL−1 }. The process of using the logistic map to shuffle
is stated as follows.
Fitness 1 Fitness 2 ··· Fitness L p
(1) Given a pair of input (x0 , μ), the logistic map will
generate a sequence {xn , n = 0, 1, 2, . . .}. Wipe off the
first k (e.g. 1000) elements of the sequence, and use
the consecutive L different elements to form a vector
GA operators Y = { y0 , y1 , . . . , yL−1 } = {xk , xk+1 , . . . , xk+L−1 }.
(2) Sort the elements of Y in descending order. The
suffixes of the sorted elements form a sequence I =
Next generation of (x0 , μ)1 (x0 , μ)2 · · · (x0 , μ)L p
{i0 , i1 , . . . , iL−1 }.
(3) Shuffle message bits according to I. That is, the
N
message bit with suffix ir in M is put to position r.
No. of generations > maxGen?
Here comes an example of using the logistic map to
shuffle message bits. Let M = {0, 1, 1, 1, 0, 1}, Y = {0.1,
Y
0.6, 0.4, 0.2, 0.8, 0.7}, then I = {4, 5, 1, 2, 3, 0}, and shuffled
Best solution message sequence is {0, 1, 1, 1, 1, 0}.
(x0 , μ)
From the shuffling process mentioned above, we can
see that the pair of parameters (x0 , μ) decides the order of
End
shuffled message bits. In order to improve the performance
Figure 2: Process of using GA to find the best pair input for logistic of the shuffling method, GA is used to select a proper pair of
map. (x0 , μ). In our scheme, we choose to improve quality of the
stego image in the sense of PSNR and select PSNR as GA’s
fitness function:
⎧ ⎫
then LSBl (ci ) = 0, si = ci +3 = 11, s  ⎨  M N ⎬
i = ci − 1 = 7. LSBl (si ) = 1
 
LSBl (si ), but the absolute value of change from ci to si is 3 fitness = PSNR = −10 · log10 ⎩ 2 [d(m, n)]2 ⎭,
255 MN m=1 n=1
while to s 
i is 1, so choose si as si . Take another example, (3)
l = 2, mi = 3 and ci = 10, then LSBl (ci ) = 2, si = ci + 1 = 11,
s 
i = ci − 3 = 7. In this case, choose si , which is closer to ci , where M and N are number of rows and columns of the
as si . cover image, respectively; d(m, n) is the difference between
4 EURASIP Journal on Advances in Signal Processing

Best (x0 , μ)
selected by
GA

Message Shuffled
Logistic map
bits message bits

Cover Entropy Quantized DCT Entropy Stego


JPEG file IA-LSB
decoding coefficients encoding JPEG file

Figure 3: Embedding procedure of our proposed method.

Stego Entropy Stego quantized Extracting LSBm Shuffled


JPEG file decoding DCT coefficients of coefficient message bits

Message
(x0 , μ) Logistic map
bits

Figure 4: Extracting procedure of our proposed method.

41 (2, 1)
0.35
40
0.3
39
Frequency of currence

0.25
38
PSNR (db)

37 0.2

36 0.15

35 0.1
34 0.05
33
0.45 0.5 0.55 0.6 0.65 0.7 0
−40 −30 −20 −10 0 10 20 30 40
Embedding rate (bpc)
Values of coefficients
Our
Our method
F5
Original
MB1
Figure 6: Distribution of the (2,1)th AC components.
Figure 5: PSNR of our method, F5, and MB1.

coefficients in spatial domain at position (m, n) in the cover (4) Repeat (2) and (3) till the number of generations
image and in the stego image. The process of using GA to equals maximum generation maxGen (e.g., 100).
maximize PSNR is shown in Figure 2 and stated as follows.
(5) Put out the best pair of (x0 , μ) selected by GA.
(1) Initialize population. Randomly generate L p pairs of
(x0 , μ), x0 ∈ (0, 1), μ ∈ (3.5699456, 4]. L p is the size 3.3. Embedding Procedure. A coefficient ci is valid, if ci =/0
of population and each (x0 , μ) is an individual. and it is not a DC coefficient. The whole embedding
procedure is depicted in Figure 3. Firstly, the message
(2) For each (x0 , μ), shuffle message bits and embed the
bits are shuffled by the logistic map whose input pair
reordered message bits into the cover image using
(x0 , μ) is selected by GA. Secondly, the cover JPEG file is
IA-LSB steganography, then compute PSNR between
decoded, obtaining quantized DCT coefficients. Thirdly, the
the cover image and the stego image, which is the
shuffled message bits are embedded into the valid quantized
fitness function of GA. In the following operations,
DCT coefficients using IA-LSB steganography. Finally, stego
the individual with larger fitness function will be
quantized DCT coefficients are encoded to the stego JPEG
considered better.
file.
(3) GA operators—selection, crossover, and mutation— It needs to be taken into consideration that valid coef-
are operated to generate the next generation. ficients after embedding should still be valid, that is, valid
EURASIP Journal on Advances in Signal Processing 5

coefficients should not be changed to 0. On one hand, char- 5. Conclusion


acteristics of histogram can be preserved; on the other hand,
message bits can be extracted correctly and simply. If si = 0, A steganographic method uses IA-LSB based on chaos
si = si ± 2l . To add or subtract 2l is determined randomly. and genetic algorithm is proposed. After finding the best
parameters for the logistic map using GA, rearrange the
secret message and embed it into the cover image using IA-
3.4. Extracting Procedure. After receiving the stego JPEG file LSB. Experimental results demonstrate that our algorithm
and (x0 , μ), we can extract the message bits as showed achieves high embedding capacity while preserving good
in Figure 4. First, the stego JPEG file is entropy decoded image quality and high security.
to obtain stego quantized DCT coefficients. Second, the The important and distinctive features in the proposed
shuffled message bits are extracted from LSBs of valid method are to minimize the degradation of stego image
coefficients. Thirdly, the shuffled message bits are reordered by shuffling the secret message based on the logistic map
to there natural order using logistic map with (x0 , μ) as input. and GA. To find better mapping between the secret message
Message bits are obtained. and the cover image so as to improve the steganographic
performance is our future work.

4. Experiments Acknowledgments
In this section, we demonstrate the performance of our This work was supported in part by National Natural Science
proposed method and compare it with that of F5 [14], Foundation of China (no. 60776794, no. 90604032, and no.
MB1 [15], and Outguess [16]. The image quality of each 60702013), 973 program (no. 2006CB303104), 863 program
steganography method is expressed objectively in PSNR. (no. 2007AA01Z175), Beijing NSF (no. 4073038), and Spe-
Standard 256 gray-level and true color images with sizes cialized Research Foundation of BJTU (no. 2006XM008 and
of 256 × 256 are used as covers, such as Lena, Baboon, no. 2005SZ005).
and Couple. The JPEG quality factor is set to 80 during
compression in each method.
References
4.1. Image Quality. In order to demonstrate validity of shuf- [1] J. Silman, “Steganography and steganalysis: an overview,”
fling message bits, we compare the PSNR of images embed- Tech. Rep., SANS Institute, 2001.
ded by IA-LSB steganography with and without shuffling [2] T. Jamil, “Steganography: the art of hiding information in
plain sight,” IEEE Potentials, vol. 18, no. 1, pp. 10–12, 1999.
message bits. The results of gray images are shown in
[3] D. Upham, 1997, http://zooid.org/∼paul/crypto/jsteg/.
Table 1. Shuffling message bits does improve the PSNR of the
[4] A. Westfeld, “F5-a steganographic algorithm,” in Proceedings
stego image. It can also be applied to other steganographic
of the 4th International Workshop on Information Hiding, pp.
algorithms and provides us with a new way to improve 289–302, Pittsburgh, Pa, USA, 2001.
performance of steganography. Moreover, Table 2 shows that [5] N. Provos, “Defending against statistical steganalysis,” in
this scheme of shuffling is not only applicable to gray images Proceedings of the 10th USENIX Security Symposium, pp. 323–
but also color images. 335, Washington, DC, USA, 2001.
Figure 5 shows the PSNR of our method, F5, and MB1. [6] A. Westfeld and A. Pfitzmann, “Attacks on steganographic
The results are averaged on 50 gray-level images. We can systems,” in Proceedings of the 3rd International Workshop on
see that the PSNR of our proposed method is higher than Information Hiding, 2000.
that of F5 and MB1. For the capacity of Outguess is around [7] N. Provos and P. Honeyman, “Detecting steganographic
0.3 bpc, it is not shown in the figure. The PSNR of Outguess content on the internet,” Tech. Rep., Center for Information
is not higher than 32.86 db at 0.3 bpc (bit per nonzero AC Technology Integration, University of Michigan, 2001.
coefficient) but of our method is higher than 37 db even at [8] Z. Liu and L. Xi, “Image information hiding encryption using
0.72 bpc. We can conclude that our method outperforms F5, chaotic sequence,” in Proceedings of the 11th International
MB1, and Outguess in image quality. Conference on Knowledge-Based Intelligent Information and
Engineering Systems and the XVII Italian Workshop on Neural
Networks, pp. 202–208, 2007.
4.2. Preserving Characteristics of Histogram. As a represen- [9] Y. Zhang, F. Zuo, Z. Zhai, and C. Xiaobin, “A new image
encryption algorithm based on multiple chaos system,” in
tative example, Figure 6 plots distribution of the (2,1)th
Proceedings of the International Symposium on Electronic
quantized AC components for cover image “Lena” and Commerce and Security (ISECS ’08), pp. 347–350, August 2008.
its corresponding stego image with an embedding rate of [10] R. Munir, B. Riyanto, S. Sutikno, and W. P. Agung, “Secure
0.46 bpc. The red line illustrates the coefficients distribution spread spectrum watermarking algorithm based on chaotic
of a stego image with our proposed method, and green bars map for still images,” in Proceedings of the International
illustrate that of the cover image. Figure 6 shows that our Conference on Electrical Engineering and Informatics, 2007.
method preserves the characteristics of histogram. This is [11] Z. Dawei, C. Guanrong, and L. Wenbo, “A chaos-based robust
also true for the other components (e.g., (1,2)th, (2,2)th AC wavelet-domain watermarking algorithm,” Chaos, Solitons and
components) and the other testing images. Fractals, vol. 22, no. 1, pp. 47–54, 2004.
6 EURASIP Journal on Advances in Signal Processing

[12] J. H. Holland, Adaptation in Natural and Artificial Systems,


MIT Press, Cambridge, Mass, USA, 1992.
[13] Y.-T. Wu and F. Y. Shih, “Genetic algorithm based methodol-
ogy for breaking the steganalytic systems,” IEEE Transactions
on Systems, Man, and Cybernetics B, vol. 36, no. 1, pp. 24–31,
2006.
[14] http://os.inf.tu-dresden.de/∼westfeld/publikationen/f5r11.zip.
[15] http://www.philsallee.com/mbsteg/index.html.
[16] http://www.outguess.org/download.php.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 525026, 19 pages
doi:10.1155/2010/525026

Research Article
A Macro-Observation Scheme for Abnormal Event Detection in
Daily-Life Video Sequences

Wei-Yao Chiu and Du-Ming Tsai


Department of Industrial Engineering and Management, Yuan-Ze University, 135 Yuan-Tung Road, Nei-Li, Tao-Yuan 32026, Taiwan

Correspondence should be addressed to Du-Ming Tsai, [email protected]

Received 19 October 2009; Revised 4 March 2010; Accepted 8 April 2010

Academic Editor: Robert W. Ives

Copyright © 2010 W.-Y. Chiu and D.-M. Tsai. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

We propose a macro-observation scheme for abnormal event detection in daily life. The proposed macro-observation
representation records the time-space energy of motions of all moving objects in a scene without segmenting individual object
parts. The energy history of each pixel in the scene is instantly updated with exponential weights without explicitly specifying the
duration of each activity. Since possible activities in daily life are numerous and distinct from each other and not all abnormal
events can be foreseen, images from a video sequence that spans sufficient repetition of normal day-to-day activities are first
randomly sampled. A constrained clustering model is proposed to partition the sampled images into groups. The new observed
event that has distinct distance from any of the cluster centroids is then classified as an anomaly. The proposed method has been
evaluated in daily work of a laboratory and BEHAVE benchmark dataset. The experimental results reveal that it can well detect
abnormal events such as burglary and fighting as long as they last for a sufficient duration of time. The proposed method can be
used as a support system for the scene that requires full time monitoring personnel.

1. Introduction of foreground objects. The macro-observation approach


does not describe the motion of an object by the local
Activity recognition has played an important role in video details. Instead, it describes the motion from a global
surveillance for security, traffic-monitoring, homecare, and aspect using an abstract representation of time-space changes
healthcare applications. An activity recognition system gen- in a video sequence. Human beings have the remarkable
erally involves the following four steps: low-level detection of ability to recognize the behavior of a single isolated person,
moving objects from the background with a still camera, spa- or the interaction between multiple people from a far
tiotemporal representation of motions in an image sequence, distance without knowing the detailed motions of individual
extraction of motion features from the representation, and persons.
high-level classification. In this paper, we propose a fast macro-observation
There are two major approaches for activity recogni- surveillance scheme that can detect abnormality in our daily
tion in video sequences: micro-observation and macro- life that involves distinct activities of a single person or
observation. The micro-observation approach analyzes the a group of people. A video surveillance system that can
motions based on the local detailed parts of individual monitor abnormal events in daily life is very complicated to
moving objects. In human motion analysis, this means construct due to unanticipated or indefinable activities.
the body parts such as head, torso, and limbs must be
identified first, followed by poses assignment based on the 1.1. Micro-Observation Approach. The micro-observation
extracted body parts. The poses then construct a specific approach for activity recognition can well describe the
action, and finally a sequence of actions gives a meaningful details of a motion and provides a good discrimination
behavior. This approach requires a bottom-up process to between individual activities with subtle changes. However,
construct a representation from the low-level primitives it generally requires an accurate segmentation of foreground
2 EURASIP Journal on Advances in Signal Processing

objects from the background and precise identification of be recognized were organized as a matrix. The eigenvectors
the individual body parts. An inaccurate extraction and with dominant eigenvalues of the covariance matrix formed
description of details in a lower level causes the failure in a the eigenspace. A human posture in a frame was then
higher level process. represented by a point in the eigenspace, and a motion was
Appearance-based methods [1–6] used appearance mod- described by a set of successive points in the eigenspace.
els that combine shape, color, and texture to analyze the Distance measures were finally used to match the lines of the
moving objects. Model-based methods constructed a human observed motions and those of the reference motions. The
body as articulated/kinematic or skeleton models [7–13]. eigenspace approach is computationally expensive and can
The poses identified from the object models were considered only describe specific activities.
as individual states in space, and then hidden Markov models Motion Energy Image (MEI) and Motion History Image
(HMMs) [14–18] were generally used to describe the state (MHI), first proposed by Davis and Bobick [43], are a global
changes over time. Bayesian networks and neural networks spatiotemporal representation of motion. They are treated
[19–21] were also commonly used for high-level activity as temporal templates for the match of human movement
recognition. W 4 [5] is a well-known system using such an [44]. MEI is defined as the sum of object silhouettes in
approach to recognize events between people and objects. every image frame over a fixed duration. The result of MEI
This approach is also well applied to gesture recognition is a binary image of motion shape. While MEI is used to
[22, 23] and gait recognition [24, 25]. Shah et al. [26] record the “shape” of a motion, the intensity of MHI is a
presented a surveillance system, called KNIGHT, that used function of recency of motion. The effectiveness of the MEI
rule-based algorithms to detect single object activities and and MHI representations is critically determined by the fixed
multiobject interactions. Speed, direction, and orientation of duration value. Bradski and Davis [45] extended the MHI
object silhouettes and their interobject distances were used for motion segmentation and pose recognition by extracting
as features to detect activities such as falling, running, and additional pose and directional motion information in MHI.
meeting. The gradient orientation at each pixel is derived from the
spatial derivatives along the y- and x-axis of MHI. Wong
1.2. Macro-Observation Approach. Spatiotemporal represen- and Cipolla [46] also used the gradient directions in MHI
tation of an image sequence is critical for recognizing differ- for gesture recognition. Davis and Bobick [47] used MHI for
ent activities using the macro-observation approach. Optical recognizing aerobic movements. The temporal templates of
flow [27–30] that describes each pixel in two consecutive MEI and MHI were also used for hand gesture recognition
frames by a velocity vector has been popularly used as a [48]. The temporal template has shown to be a good global
motion representation. Efros et al. [31] recognized human representation of motions. However, it is currently only
actions of individuals in a low resolution video sequence. verified for simple activities such as hand gestures and
Their algorithm started by tracking individual human figures aerobic exercises that have a fairly steady motion duration
and forming a figure-centric sequence. Then the optical flow and is only tested for single isolated object in a simple
vector field was calculated from the figure-centric sequence, background.
and a set of motion descriptors were derived from 4 channels
of the optical flow. The K-nearest neighbor classifier was 1.3. Unusual Event Detection. There were a few methods
finally used to recognize various human actions in sport proposed to tackle abnormal/rare event detection in specific
videos. domains. Vaswani et al. [49] presented a system that learned
Trajectory [32–36] is a commonly-used representation the pattern of normal activities and detected abnormal events
for describing moving objects from a far distance. In order from a very low-resolution video where the moving objects
to construct the trajectory of moving objects in an image were small enough to be modeled as point objects. The
sequence, object tracking is generally applied first, and then activity of moving objects was modeled by a polygonal
the centroid of a tracked object is marked as a point on the “shape” of the configuration of the tracked points using
trajectory. The position, speed, direction, and curve/shape Kendall’s statistical shape theory. The expected log likelihood
of the motion trajectory are used to analyze the intended of the represented Kendall’s shape for an observed sequence
behaviors of moving objects in the scene. The trajectory of fixed length was then used as the change detection statistic.
representation has been mostly applied to traffic monitoring. The system was applied to monitor passengers getting out
Stauffer and Grimson [37] used vector quantization to a plane and moving towards the terminal from a very
cluster trajectories for parking lot monitoring. The clusters far observation distance. It is basically a trajectory-based
were identified by a hierarchical analysis of the vector method and is only applicable to the monitoring of a widely
cooccurrences in the trajectories. The trajectory is good for open scene. Piciarelli and Foresti [36] proposed an on-line
the representation of a widely open scene, but may fail to trajectory clustering for anomalous event detection, and
describe the people interaction in a room-sized scene. applied it to traffic behavior monitoring on a highway. The
Eigenspace [38–41] derived from principal component trajectory is represented by a series of position coordinates
analysis is also used for motion representation in video and is matched to the clusters of a training set by a
sequences. Rahman and Ishikawa [42] recognized human distance measure. Hu et al. [50] proposed a self-organizing
motion using an eigenspace. A 2D spatial image was first method to learn activity patterns for anomaly detection and
arranged as a column vector. Then, a series of a fixed activity prediction. The activity patterns were represented
number of consecutive images for every possible motion to by trajectories, where object position, velocity and size were
EURASIP Journal on Advances in Signal Processing 3

used as the features. A fuzzy self-organizing neural network part extraction, and state-space modeling for all possible
was then presented to classify the activity patterns. The events to detect. The abnormal events in a daily life are very
system was applied in traffic monitoring to detect abnormal difficult to define semantically, and the normal events are too
driving trajectories. numerous to model individual day-to-day activities.
Fleet et al. [51] and Andrade et al. [52] used optical
flow patterns to detect emergency events in crowded scenes. 1.4. Overview of the Proposed Method. With the macro-
They first computed the optical flow for the whole frame, observation approach, the proposed method first segments
and retained only the flow information in the foreground moving objects from the background for each input scene
region. Principal component analysis was then performed image. The foreground objects shifting in spatial images
on the optical flow fields for a series of image frames of over time are globally represented by an energy map, where
fixed duration. The dominant eigenvectors of the training the movement strength of each pixel in the current scene
data matrix were used to form bases for the projection. The image is exponentially increased/decreased based on the state
projected optical flow vectors were then used as features. A changes of the pixel over time. The length of image frames
mixture of Gaussian hidden Markov model was trained with for different activities does not have to be explicitly specified,
the feature vectors for each video segment in the training and the energy map can be promptly updated for each new
set, and the spectral clustering was used to determine the scene image. The shape of the energy map and movement
number of HMMs to represent various flow sequences. strength of every pixel in the map carry meaningful time-
For day-to-day behavior analysis, an extremely large set of space interaction of single person or multiple people with
training image sequences may be required. The covariance the environment. A set of discriminative features can then be
matrix of such a large training set could be prohibitively effectively extracted from the energy map of each new scene
large for PCA computation. Adam et al. [53] proposed an image.
optical flow-based method for unusual event detection in Abnormal event detection in daily life can be considered
cluttered and crowded environments. The abnormality was as a very special case of one-class classification problem. No
mainly detected by evaluating the probability distribution all possible abnormal event in daily life can be foreseen.
of flow magnitude and direction in the optical flow fields. It is also very difficult to collect all possible conditions of
An unusual event without radical motion changes cannot be a specific abnormal event. Conversely, normal behaviors in
detected with this method. daily life can be easily collected for learning. The behaviors
Zhong et al. [54] presented a technique for detecting repeated daily can be grouped into many clusters, and all
unusual activities in video sequences. Moving objects in each clusters belong to the same class, that is, the normality
image frame were detected first. The simple spatial histogram class. Because the types of normal behaviors in a daily life
of the detected objects was used as image features and, could be numerously large and quite different from each
therefore, the observed activities were location-dependent. other, the images are randomly sampled from a long image
They divided the video into equal-length segments and sequence that can sufficiently represent the cyclical day-
classified the extracted features into prototypes. A cooccur- to-day activities of the observed scene. An unsupervised
rence matrix between the video segments and prototype clustering subject to distance constraints is proposed to
features was constructed for similarity comparison. The group various normal activities into a manageable number
correspondence between prototypes and video segments was of clusters so that the computation in the recognition process
then solved as a graph editing problem. can be efficiently carried out and all normal events can
The abnormal event detection methods aforementioned have distances from their cluster centroids within very tight
generally consider the motions of objects with a fixed control limits (distance thresholds). The video images with
observation duration (i.e., a predetermined number of image distinct feature distances lasting for an extended period of
frames) in video sequences, and require well-controlled time are then declared as an abnormal event.
environments or well-defined patterns of activities. Most The proposed macro-observation method mimics the
of the currently available activity recognition methods only human observer who can easily recognize abnormal events
deal with very simple activities, and are domain specific from a far distance without knowing the detailed move-
such as aerobic exercises [44] and tennis strokes [55]. In ments of individual persons. The global representation of
this study, we propose a macro-observation approach to complicated motions in a scene can well detect abnormal
detect abnormal events observed, especially, in a room-sized events as long as they can last for tens of seconds. Since the
scene from a still camera. The scene of a room may involve detailed motions of individual body parts are not separately
complicated day-to-day behaviors such as an older person extracted, the proposed method cannot be responsive to the
staying alone at home (for homecare monitoring), multiple events with subtle motion changes and the events spanning
people with/without interaction in a nursing home (for only a few seconds. The proposed monitoring system can be
healthcare monitoring), and cashier-customer interaction in used as a supplement for the personnel that requires intensive
a shop (for security monitoring). The observed objects in and constant manual monitoring of scenes for unpredictable
such scenes have moderate sizes in the image. The trajectory events from multiple cameras.
representation of an object as a point may lose meaningful This paper is organized as follows. Section 2 first dis-
interaction between people. The proposed method does cusses the foreground segmentation method to extract
not take the micro-observation approach since it requires moving objects in video images. The energy map used
complicated object tracking, object segmentation and body to represent the spatiotemporal motion is then described,
4 EURASIP Journal on Advances in Signal Processing

followed by the extraction of discriminative features from where


the energy map. The proposed clustering mechanism is −1
N
then presented to group similar energy maps sampled from    
ST x, y = fT −i x, y
image sequences of normal daily life. Section 3 describes the i=0
experimental results of daily activities in a laboratory over      
a long period of observation and the BEHAVE benchmark = ST −1 x, y − fT −N x, y + fT x, y ,
dataset [56]. Section 4 concludes the paper and discusses (2)
−1
N
future work. 2
   
ST x, y = fT2−i x, y
i=0
2. Abnormal Event Detection      
= S2T −1 x, y − fT2−N x, y + fT2 x, y .
This section discusses the abnormal event detection scheme
that comprises the processes of moving object detection, Note that ST (x, y) and S2T (x, y) can be efficiently updated
exponential energy map for spatiotemporal representation of by dropping the last image frame fT −N (x, y) in the image
motions, extraction of motion features, and the constrained series and adding the current image frame fT (x, y) to the
clustering model for classification. image series. Therefore, the updating computation involves
only two simple arithmetic operations. A very high process
2.1. Moving Object Detection. The objective of the paper rate of image frames is achieved accordingly. Note also that
is to detect abnormal events in daily life in a scene the mean and variance updating processes in (1) are invariant
such as an office or a nursing home, where nonstationary to the number of image frames N in the series.
background changes such as movements of a chair, placing In motion detection, the multiple temporal images of
of cups and newspapers on tables, revolving of ceiling fans, the background will present approximately the same gray
opening/closing of doors or curtains, and switching on/off value with a small variance. The gray value of a foreground
room lights are not uncommon. Since the proposed method pixel will be distinctly different from that of the background.
does not rely on the accurate detail parts of moving objects The upper and lower control limits for foreground-pixel
for the detection, any background subtraction techniques detection in the current image frame fT (x, y) can be given
such as the ones in [3, 57–59] can be directly applied to by μT −1 (x, y) ± κ · σT −1 (x, y), where κ is a control constant.
foreground segmentation as long as it is computationally If the gray-level of fT (x, y) is out of the control limits, pixel
fast. at (x, y) is then considered as a foreground point. Otherwise,
In background updating models, each pixel of the it is classified as a steady background point. The detection
background image over time has been simply modeled with result is represented by a binary image BT (x, y), where
a single Gaussian model [3]. A more robust background
⎧   2    2
modeling technique is to represent each pixel by a mixture ⎪

⎪ 0 background , if 2 fT x, y − μT −1 x, y 2
of Gaussians [37, 57]. In order to promptly detect moving ⎪

   
objects for nonstop monitoring of day-to-day activities, we BT x, y = ⎪ ≤ κ · σT −1 x, y ,


adopt a single-Gaussian background updating approach, ⎩ 
⎪ 
1 foreground , otherwise.
instead of the more complicated mixture Gaussian model, to
extract foreground objects with a high processing rate. (3)
In the single Gaussian model for each individual pixel
Since the gray values between foreground and background
in the image, the parameters are represented by the gray-
points are generally distinctly different, the control constant
level mean μT (x, y) and standard deviation σT (x, y) of the
κ is set at 5 in this study.
pixel (x, y) over a limited time duration. Different from the
Gaussian background updating models [3, 57] that estimate 2.2. Spatiotemporal Representation. The goal of this sub-
the parameter values by a linear filtering technique, these two section is to construct a global representation of motion
statistical values of the single Gaussian model can be easily that can describe the changes in both temporal and spatial
and precisely calculated by deleting the last image in the dimensions. The existing spatiotemporal representations of
series of the historical image frames and adding the current motions aforementioned generally describe the temporal
image frame for nonstop monitoring. context with a fixed duration in a video sequence. The
Let { ft (x, y), t = T, T − 1, . . . , T − N + 1} be a series motion representation from a fixed number of image frames
of N consecutive image frames, where T denotes the current may not sufficiently capture the salient and discriminative
time frame. The gray-level mean and variance of the single properties for a large variety of activities encountered in
Gaussian background model for pixel (x, y) at time frame T daily life. Short observation duration cannot describe a full
is given by cycle of an activity. In contrast, excessively long observation
     1   duration may mix two or more different activities or reduce
μT x, y = E f x, y = ST x, y , the significance of a unique activity in the spatiotemporal
N
        2 representation.
σT2 x, y = E f 2 x, y − E f x, y (1) In order to construct a more responsive spatiotemporal
1     representation for the scene that may involve the motions of
= · S2T x, y − μ2T x, y , a single person or multiple people with varying time spans
N
EURASIP Journal on Advances in Signal Processing 5

of activities, we construct the motion energy map using an map. They do not have to be separated into different moving
exponential time update process, which is defined as parts.
    Figure 1 displays the motion energy maps of various
ET x, y = MT + ET −1 x, y · γ, (4) video sequences of one single person from daily activities
in a laboratory, in which the energy constant Tenergy is set
where γ is the energy update rate, 0 < γ < 1, and at 10 for visual display, and the update rate γ is given
⎧   by 0.999 for the normal walking speed of people in the
  ⎨Tenergy , if BT x, y ∈ foreground, room. The video images were taken at 10 frames per
MT x, y = ⎩ (5) second. Figure 1(a) shows the original video sequence at
0, otherwise. varying time frames. The scenario in the sequence is that
a single person walked towards the door from the lower-
The initial value of energy is set to zero at time frame left to the upper-right in the scene. The resulting energy
0, that is, E0 (x, y) = 0 for all pixels. The energy of a pixel map is shown in the bottom row of Figure 1, where the
ET (x, y) will be increased if it remains as a foreground point. brightness is proportional to the energy value. Figure 1(b)
It is only decayed when it becomes a background point. In presents another single person walked from the upper-right
(4) above, Tenergy is a predetermined energy constant, and is door to the lower-left corner in the opposite direction. By
assigned to each foreground pixel. Assume that the current closely observing the two corresponding energy maps in
energy value of a pixel (x, y) is E. If pixel (x, y) is a foreground Figures 1(a) and 1(b), both display similar representations
point and lasts for a period of N f frames, then the energy at in shape. The energy values in the upper-right are higher
(x, y) is increased up to than those in the lower-left in the energy map of Figure 1(a),
N f −1
whereas the energy values in Figure 1(b) show the reverse
 trend. Therefore, the representative shape of the energy
Tenergy · γi + E · γN f . (6)
i=0
map describes various spatiotemporal activities, and the
changes of energy values in the map implicitly indicate
Conversely, if pixel (x, y) is changed from a foreground point the moving direction. Figure 1(c) displays a single person
to a background point and lasts for Nb frames, the energy at working on a computer. The resulting energy map, as seen
(x, y) is then exponentially decreased to in the bottom row of Figure 1(c), shows that only the sitting
area of the person gives bright energy values. The historical
E · γNb . (7) data of the movement from the lower-left to the upper-
right corners were responsively decayed to very small energy
The choice of Tenergy value is not critical at all as long as values.
it is larger than zero for foreground points and equal to zero Figure 2(a) shows a group of people discussing in the
for background points. The value of Tenergy affects only the middle-right area for a prolonged period of time, and then
visual representation of the energy map in the image. It does walked back to their seats. The bottom row in Figure 2(a)
not change the detection results. gives the corresponding energy map, in which the middle-
The exponential energy updating of foreground pixels right area is brighter than the remaining regions in the image
assigns larger weights to the most recent image frames. The and the energy values for pixels in the walking paths are
energy update rate γ gives an exponential decrement of the larger than those of the background. Figure 2(b) shows two
energy. A large γ value gives a slow decrement of the energy, people chatting around the public desk in the laboratory, and
and the long-term history of the pixel is taken into account Figure 2(c) displays two people separately working on the
for spatiotemporal representation. In contrast, a small γ computers. The corresponding energy maps are presented
value results in an accelerated decrement of energy, and only in the bottom row of Figures 2(b) and 2(c), which show
the short-term history of the pixel is used to represent the that different moving frequencies of multiple people generate
motion. The exponential decrement of energy allows flexible different energy maps. Based on the representative samples in
adjustment of the observed period for the historical status of Figures 1 and 2, the exponential energy maps can represent
each pixel. The proposed exponential energy map of motions different day-to-day activities that involve single or multiple
prevents the explicit choice of a predetermined number people.
of image frames for the construction of spatiotemporal The proposed exponential energy map can well rep-
representation. It can be thus effectively used to represent resent spatiotemporal activities from a macro-observation
activities that last for various durations. By detecting each view. It requires no complicated segmentation and object
individual pixel as a foreground or a background point in recognition techniques to identify the detailed parts of
the video sequence, the energy of the pixel can be easily individuals in a group. It can well represent activities
updated according to (4) without knowing its associated that last a sufficient period of time. In order to prevent
moving part of an object. If the motion of the pixel continues false alarms, it is suggested in this paper that an activity
(i.e., foreground point), the energy of the pixel will be last for only a few seconds is interpreted as noise. The
exponentially accumulated. Otherwise, the energy of the restriction of the exponential energy map is that it cannot be
pixel (i.e., background point) will be decreased. In the effectively used to describe activities that involve only subtle
macro-observation approach, two (or multiple) movements movements of the body or last only a very short period of
within a scene is simply interpreted as an event in the energy time.
6 EURASIP Journal on Advances in Signal Processing

t=1 t=1 t=1

t = 10 t = 10 t = 10

t = 15 t = 15 t = 15

t = 20 t = 20 t = 20

(a) (b) (c)

Figure 1: Video sequences involving different activities of a single person and their corresponding energy maps: (a) single person walking
from lower-left to upper-right; (b) single person walking from upper-right to lower-left; (c) single person working on a computer. The
corresponding energy map of each column sequence is shown in the bottom row.

2.3. Discriminative Features. The proposed exponential features from the energy map. The shape and energy statistics
energy map gives spatiotemporal representation of an activ- of the energy map are used as descriptors. Currently, we use
ity. To construct a classification system for identifying abnor- up to 12 discriminative features and they are described in
mal events, we need to design and extract discriminative detail as follows.
EURASIP Journal on Advances in Signal Processing 7

t=1 t=1 t=1

t = 100 t = 100 t = 100

t = 200 t = 200 t = 200

t = 300 t = 300 t = 300

(a) (b) (c)

Figure 2: Video sequences involving different activities of multiple people and their corresponding energy maps: (a) a group of people
discussing in the middle-right area; (b) multiple people chatting around the public desk; (c) multiple people working on the computers. The
corresponding energy map of each column sequence is shown in the bottom row.

Invariant Moments f1 to f7 . an event in the energy map of an activity in the image. The first seven discriminative
forms a specific shape with the energy magnitude of each features are, therefore, based on Hu’s invariant moments
pixel as the weight. The extracted features from the energy [60]. Features f1 ∼ f7 are invariant to position, rotation
map should be independent of location, orientation and size and scale changes. The seven invariant moments used in this
8 EURASIP Journal on Advances in Signal Processing

study are not merely computed from the binary shape, but Mean Energy f12 . the mean energy is defined as
use the energy value ET (x, y) as the density for each pixel in    
the energy map. x yET x, y f
f12 =     = 10 . (14)
x y b x, y f11
Entropy f8 . let ET (x, y) be the normalized energy value into
integer in the range between 0 and 255 (for an 8-bit display). This feature gives the mean energy value in the region
Thus of nonzero energy. The total energy f10 for highly repetitive
    activities in a small limited area may be similar to that for
  ET x, y − Minu,v ET (u, v) nonrepetitive activities in a wide area. The mean energy can
ET x, y = × 255 .
Maxu,v ET (u, v) − Minu,v ET (u, v) be used to describe the relationship between the repetitive
(8) motions and the moving area.
As demonstration examples, Figures 3(a1)–3(a3) present
Denote by Pi the probability that ET (x, y) = i, i = the energy maps of people sitting in chairs for study, Figures
0, 1, 2, . . . , 255. The entropy of the energy map is therefore 3(b1)–3(b3) display the energy maps of a single person
defined as walking in different directions, and Figures 3(c1)–3(c3)
 are the energy maps of the interaction between multiple
f8 = − Pi · log Pi . (9) people. The corresponding features values of f1 ∼ f12 for
i
the individual energy maps are summarized in Table 1. It
The entropy feature describes the complexity of move- shows that similar activities yield similar feature values, and
ments in a scene. A still scene will have an entropy value different activities result in distinct feature values.
approximate to zero. A single person sitting in a chair for
study will yield a small entropy value, whereas a scene 2.4. Classification. The discriminative features extracted
involving interaction and movements of multiple people will from the motion energy maps can now be used to identify
result in a large entropy value. abnormal events from the normal activities in daily life.
Monitoring of abnormality in daily life is not possible to
Maximum Energy f9 . the maximum energy is defined as be restricted only to the recognition of prestudied and
  premodeled events. As aforementioned, there could have
f9 = Max ET x, y . (10) numerous distinct daily-life activities in an observed scene.
x,y
It is extremely difficult to apply a supervised classification
This feature gives the maximum energy value in the system, where each input sample must be manually assigned
energy map. A foreground object that keeps moving for a a class index. The selected classification system should be
prolonged period of time will have a larger feature value of computationally efficient in the detection stage so that it
f9 , compared to that for a short period of time. can be easily implemented for on-line, real-time monitoring.
The fuzzy C-means (FCM) clustering [61] has been a widely
Total Energy f10 . the total energy is defined as used technique for unsupervised classification. However, the
   conventional clustering technique only partitions samples
f10 = ET x, y . (11) into clusters such that the weighted mean distance of each
x y sample to its centroid is minimized. There is no control of
A scene with a group of people generally yields a larger the distance variance in each cluster. It cannot handle clusters
total energy value, compared to the scene with a single of different sizes and densities. It is extremely difficult to find
person. A person who keeps moving around in the scene will a fixed global distance threshold for each cluster to separate
generate a larger total energy value, whereas a person sitting normal and abnormal events in video images.
for study will result in a smaller total energy value. In this paper, a constrained clustering method is applied
for training with the objective that the distance of every
cluster member to its own cluster center meets adaptively a
Area of Nonzero Energy f11 . this feature is defined as distance constraint. In order to collect sufficient represen-
   tative samples of daily life under observation, the training
f11 = b x, y , (12)
x y
energy maps and, thus, their corresponding discriminative
features are randomly sampled from a video image sequence
where that spans the sufficient period for all possible day-to-
⎧   day activities. Note that each single input scene image
  ⎨1, if ET x, y > 0
b x, y = ⎩ (13) fT (x, y) has its own corresponding energy map ET (x, y).
0, otherwise. Each training sample in this study means the feature
vector ( f1 , f2 , . . . , f12 ) of an energy map. Based on our
A wide moving area will result in a larger feature value of experiments, a range between 15% and 20% of the total
f11 , even if the movement lasts for only a very short period image frames in the video image sequence is sufficient to
of time, whereas a limited moving area will have a smaller train the classifier. The classification system involves two
feature value even if the movement lasts for a prolonged processes, learning process and detection process, which are
period of time. individually described in the following two subsections.
EURASIP Journal on Advances in Signal Processing 9

Table 1: Feature values for the demonstrative energy maps in Figure 3.

Energy maps in Figure 3


Features
3(a1) 3(a2) 3(a3) 3(b1) 3(b2) 3(b3) 3(c1) 3(c2) 3(c3)
f1 1.381 1.636 1.466 2.111 2.108 2.136 2.354 2.496 2.476
f2 2.951 3.627 3.260 4.401 4.577 4.553 5.309 5.507 5.588
f3 4.678 4.630 4.798 7.227 8.090 8.711 7.555 8.126 8.534
f4 4.922 4.647 4.840 7.081 7.111 8.315 8.738 9.571 9.499
f5 9.752 9.286 9.700 14.238 14.871 16.981 16.993 20.433 20.102
f6 6.402 6.461 6.471 9.285 9.412 10.595 11.454 13.215 12.294
f7 10.175 10.829 10.041 15.189 14.854 16.978 17.089 18.421 18.517
f8 0.717 0.708 0.740 1.085 1.309 1.421 1.603 1.747 1.755
f9 186 256 340 253 249 313 361 325 640
f10 103727 106136 161232 381675 435401 513579 860751 1335763 1281152
f11 13084 13248 14248 13231 17102 20284 20063 20177 21571
f12 7 8 11 28 25 25 42 66 59

(a1) (a2) (a3)

(b1) (b2) (b3)

(c1) (c2) (c3)


(a)

Figure 3: Demonstration examples of energy maps: (a1)–(a3) people sitting in chairs; (b1)–(b3) single person walking in different directions;
(c1)–(c3) interaction between multiple people.

2.4.1. Learning Process. In this paper, we are only interested there is only one class to identify. However, different normal
in the classes of normal and abnormal events. Since the activities may have very distinct representations of energy
abnormal events are unpredictable beforehand, all training maps. We would like to group similar activities that have
samples are normal activities collected from a video sequence similar energy maps and, thus, similar feature vectors into
of daily life. They all belong to the same class, that is, the same cluster. The goal of clustering for this one-class
10 EURASIP Journal on Advances in Signal Processing

classification with distinct patterns problem is to assign Input. The number of clusters C in each level, maximum
similar training samples to the same cluster so that the number of clusters Cmax , and training data set X =
distance of every member in the cluster to the cluster center {x1 , x2 , . . . , xK }.
meets a minimum distance threshold.
Let X = {x1 , x2 , . . . , xK } be a set of K training samples, Step 1. Normalize the feature values.
and vi the centroid of cluster i. The distance between sample Let xk = ( fk,1 , fk,2 , . . . , fk,12 ) be the feature vector, and fk, j
xk and the centroid vi is denoted by d(xk , vi ) = xk − vi 2 . be the jth feature of sample k, for k = 1, 2, . . . , K
The objective of the proposed clustering is given by
fk, j − μ j
fk, j = , j = 1, 2, . . . , 12, (16)
σj
Min β
s.t. d(xk , vi ) ≤ μdi + β · σdi , (15) where μ j and σ j are the mean and standard deviation of
feature j for all training samples.
if xk ∈ vi , ∀k = 1, 2, . . . , K, Let xk = ( fk,1
 
, fk,2 
, . . . , fk,12 ).

Step 2. Perform the standard fuzzy C-means clustering.


where μdi and σdi are the mean and standard deviation of
Let vi be the centroid of cluster i, i = 1, 2, . . . , C, and
the distances d(xk , vi ) for all members in cluster i, and β is
a control constant. The upper control limit μdi + β · σdi is K p
k=1 wik · xk
used as an adaptive distance threshold Tdi for each individual vi = K p (17)
cluster i. Each member in its own cluster must meet the k=1 wik ,
distance constraint, and the control limit should be as tight where wik is the weight for training sample k in cluster i, p is
as possible. weighting exponent (p = 2 in this study).
In this aper, we use a hierarchical clustering technique to In each iteration, wik is updated by
group similar energy maps into clusters. In each hierarchical
level of the clustering, a small number of clusters C is given, 1
and then the standard fuzzy C-means clustering process is wik =      1/ p−1 , (18)
C  
carried out. In the resulting clusters, the Euclidean distance j =1 d xk , vi /d xk , v j

between each assigned member of the cluster and the cluster


centroid is calculated so that the mean μdi and standard where d(xk , vi ) = xk − vi 2 . Then the centroid vi is updated
deviation σdi for each cluster i can be determined. If the using the new assigned weight wik . The updating procedure
distance is less than the distance threshold Tdi , the sample is repeated until convergence.
member is retained in the cluster. Otherwise, it is removed Let Vr = {vir }Ci=1 be the resulting set of cluster centroids
from the cluster. This procedure is repeated for every cluster. at hierarchical level r (Initially, set r = 1).
At the end of the process, all removed samples are considered
as a new set of training data, and the fuzzy C-means Step 3. (a) Assign sample xk to cluster vir , for k =
clustering with C as the number of clusters is performed 1, 2, . . . , K, where i = arg minc d(xk , vcr ).
in the next hierarchical level. The clustering process is (b) Set the distance threshold of each cluster vir , i =
expanded to the lower hierarchical levels until the distance of 1, 2, . . . , C, to
every member in individual clusters is less than its distance
Tdi = μdi + β · σdi , (19)
threshold Tdi , or the maximum total number of clusters Cmax
is met. where μdi and σdi are the distance mean and standard
At the end of the hierarchical clustering process, the deviation of cluster i.
control constant β will be reduced to tighten the distance (c) Let Xir = φ and X 4 ir = φ, for i = 1, 2, . . . , C. Given that
thresholds if the distance constraints for all training samples xk ∈ vi , k = 1, 2, . . . , K, if d(xk , vir ) < Tdi , then assign
 r
under a given total number of clusters Cmax are satisfied.
= 
Otherwise, it is increased to loosen the distance thresholds. Xir ←− Xir xk , (20)
The hierarchical clustering process is then repeated with the
new control constant. The minimum value of the feasible otherwise,
control constant β can be efficiently obtained by a binary = 
search. The total number of clusters Cmax is predetermined, 4 ir ←− X
X 4 ir xk . (21)
which is related to the complexity of daily activities in
question. Experiments on various scene scenarios have At the end of the assignment, Xir contains all the members
shown that the total number of clusters around 50 and that meet the distance constraints, that is, d(xk , vir ) < Tdi , in
60 is sufficient to represent different patterns of normal cluster i. The centroid of cluster vir is updated by
activities in daily life. The number of clusters C in each
hierarchical clustering level is given by 10 in this study. 1  
vir = 2 2
2X r 2 xk , (22)
The detailed algorithm of the constrained clustering model  i
r
xk ∈Xi
with minimum adaptive distance thresholds is presented as
follows. where |Xir | is the cardinality of cluster Xir .
EURASIP Journal on Advances in Signal Processing 11

4 ir records the samples with d(xk , vir ) > Tdi . Let X


X 4r = abnormal event will generally last for an extended period
>C
4 r
i=1 Xi , which is the set that contains all the samples that
of time, a single alarm of xT is treated as noise. When the
violate the distance constraints at iteration r. It is passed motion energy maps have d(xT , vi∗ ) > Tdi∗ and prolong for a
along to the next hierarchical level r + 1 as a new training sufficient duration, an abnormal event is evidently detected.
set. Since the detection process involves only simple Euclidean
distance computation from a small set of cluster centroids, it
Step 4. Cluster in the lower hierarchical level. is computationally very fast.
Take X4 r as the set of new training data. Let r ← r + 1.
Repeat Steps 2 and 3 until r · C > Cmax (max. number of
4 r = φ (all samples meet the distance
clusters is violated), or X 3. Experimental Results
constraints).
This section evaluates the performance of the proposed
Step 5. Find the minimum control constant β. abnormality detection scheme from two image sequences,
4r =
If r · C > Cmax and X / φ, the current control constant β one involving the scene of a laboratory and the other
is too tight, and must be increased by setting obtained from the BEHAVE benchmark dataset. The pro-
posed algorithms were implemented using the C++ language
1 
on a Pentium 4, 3.0 GHz personal computer. The test images
β ←− β + βupper , (23)
2 in the experiments were 200 × 150 pixels wide with 8-bit
gray levels. The total computation time from foreground
else setting
segmentation to abnormality detection for an input image
1  is 0.132 seconds, of which the computation of the seven
β ←− β + βlower . (24) invariant-moments takes 0.121 seconds. It achieves a mean
2
of 7.6 fps for real-time detection of abnormal events.
Repeat Steps 2 to 5 unit Δβ < 0.1, where Δβ is the The first activity monitoring example is the daily work
difference between the old and the new β values. Currently, in a laboratory, which involves various activities of a single
the lower bound and upper bound of β are set at βlower = 0.0 person and multiple people. Some of the demonstration
and βupper = 2.0. activities in the laboratory are displayed in Figures 1 and 2.
The training image sequences were collected for two days,
The proposed clustering model can effectively assign and there are a total of 100,216 energy maps. 15% of the total
similar activities into the same cluster that all adaptively meet energy maps were randomly sampled, which corresponds to
a tight distance threshold. It is expected that normal activities 15,030 energy maps used in training. In the experiments, the
similar to the sampling ones in the training set will also have two parameters used to construct the motion energy maps
a corresponding cluster that meets the distance threshold, were set with Tenergy = 10 and γ = 0.999 for the relatively
whereas an abnormal event (the one not observed in the slow activities in the laboratory. The total number of clusters
training set) will not find any cluster that yields a distance Cmax is given by 50. The resulting minimum feasible value of
less than the threshold. the control constant β is 0.1.
> In the experiments, we simulated three abnormal events
2.4.2. Detection Process. Let Ω = r Vr be the set of the including burglary, fighting and moving furniture out of the
final cluster centroids obtained from the training process. room. All these three activities are very difficult to define
For a new scene image at current time frame T with the explicitly and model beforehand. Scenario 1 involves only
feature vector xT = ( fx,1 , fx,2 , . . . , fx,12 ), the feature value is the actions of a single person. Scenarios 2 and 3 involve
first normalized with respect to the mean μ j and standard interactions between two people. For the burglary scenario,
deviation σ j for each feature j of the training samples, that is, a person was asked to find a wallet hidden in the room as fast
as possible. No further instructions on how to find the wallet
fx, j − μ j were given to the pretended burglar. Figure 4(a) displays
fx, j = , j = 1, 2, . . . , 12, (25)
σj the original video sequence at varying time frames for the
burglary scenario, and Figure 4(b) shows the corresponding
and let xT = ( fx,1
 
, fx,2 
, . . . , fx,12 ). The minimum distance of energy maps. It can be seen that the energy in the map is weak
xT to the cluster centroids in Ω is given by in the early stage of the burglary activity. The energy is then
  ? 
?2  accumulated and the shape in the map becomes stable after a
d xT , vi∗ = min ?xT − vi ? , (26) sufficient period of time.
vi ∈Ω
For the fighting scenario, two people were fighting each
where vi∗ = arg minvi ∈Ω d(xT , vi ). other in the room. Figures 5(a) and 5(b) present, respectively,
In the training process, the distance threshold Tdi of the video sequence and the corresponding energy maps for
each cluster i is adaptively given by μdi + β · σdi . In the the fighting scenario. For the moving furniture scenario,
detection process, the same distance threshold of each cluster two people sequentially moved a chair, a computer monitor
is also applied to detect abnormal events. If d(xT , vi∗ ) > and other laboratory objects out of the room. Figures
Tdi∗ , a suspected abnormal event is declared. Otherwise, 6(a) and 6(b) show, respectively, the video sequence and
it is classified as a normal activity in daily life. Since an the corresponding energy maps for the moving furniture
12 EURASIP Journal on Advances in Signal Processing

t = 80 t = 60

t = 240 t = 160

t = 480 t = 320

t = 720 t = 500

t = 960 t = 780

t = 1250 t = 900
(a) (b) (a) (b)
Figure 4: Abnormal event of a burglary scenario: (a) discrete image Figure 5: Abnormal event of a fighting scenario: (a) discrete image
frames in the sequence; (b) corresponding energy maps. (symbol t frames in the sequence; (b) corresponding energy maps.
represents the frame number in the sequence with fps = 10).

scenario. The energy is accumulated and the shape becomes In the total of 85,186 frames, only 27 events that have
clear in the map as the activity proceeds. distances d(xT , vi∗ ) larger than the threshold Tdi∗ are falsely
When the training is done, 85% of the untrained image alarmed, and the detection results are displayed in Figure 7.
frames (a total of 85,186 frames) from the two-day video Since each individual input image xT has its own corre-
sequence are used to test for the similarity measurement. sponding cluster i∗ and, thus, different distance threshold
EURASIP Journal on Advances in Signal Processing 13

10
8
6

Δd
4
2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Event number
t = 70 Figure 7: Excessive distances Δd over the threshold TΔd for the
27 detected events with distances beyond the control limits in the
normal 2-day laboratory video sequence. (The length of each event
in the x-axis represents the event duration.)

10
8
6

Δd
4
t = 210 2
0 Burglary Fighting Moving
furniture

Figure 8: Excessive distances Δd of the three abnormal events in


the laboratory: burglary, fighting and moving furniture out of the
room. (Note that the duration scales in the x-axis of both Figures 7
and 8 are the same.)

t = 420
distance Δd. The results show that most of the 85,186 frames
have the distances within the control limits. All the falsely
detected events last only a very few frames (as seen in the
x-axis) and the excessive distance Δd is very small and less
than 2 (as seen in the y-axis). The duration for the 27 falsely-
detected events is from a minimum of 0.2 seconds (1 frame)
to a maximum of 4 seconds (20 frames) with a mean of
0.72 seconds (3.6 frames). Because an activity must last for
t = 620
some duration (i.e., a sufficient number of consecutive image
frames), the isolated image frames can be classified as noise.
Figures 8 illustrates the measured distances over time for
the three abnormal events of burglary, fighting and moving
furniture. The plot only displays the distance differences
Δd(xT , vi∗ ). The scale on the x-axis in Figure 8 is exactly
the same as that in Figure 7, that is, the length of an event
in the x-axis of the figure represents also the duration of
t = 830 the activity. The results show that the abnormal activity at
the beginning gives small distance values. As the abnormal
activity continues, the resulting distances become distinctly
large and prolong for a long duration, as seen in the x-axis
and the y-axis in Figure 8. Table 2 summarizes the resulting
statistics of duration and excessive distance Δd for the 2-day
normal image sequence and the three abnormal events. It
again reveals that the proposed detection scheme can well
t = 1030 identify the prolonged abnormal activities with distinctly
(a) (b) large distances Δd. The falsely-alarmed events give only
a very short duration with very small excessive distances
Figure 6: Abnormal event of a moving-furniture scenario: (a)
discrete image frames in the sequence; (b) corresponding energy
and therefore, can be effectively eliminated by introducing
maps. additional decision rules based on the event duration.
In order to further test the robustness of the proposed
method for abnormal event detection in daily life, the
Tdi∗ , the plot in Figure 7 displays only the difference between same laboratory scene was continuously monitored for
the distance d(xT , vi∗ ) and the threshold Tdi∗ , that is, 31 days. There are a total of 26,784,000 images frames
Δd(xT , vi∗ ) = max{d(xT , vi∗ ) − Tdi∗ , 0}. In the figure, the x- observed. The trained cluster centroids based on the two-
axis presents the event number and the y-axis is the excessive day sampled images, as described previously, are also used for
14 EURASIP Journal on Advances in Signal Processing

Table 2: Statistical analysis for the two-day normal video sequence of the laboratory scene and the three abnormal events.

Excessive distance Δd Alarm duration (sec.)


Image sequence Total events detected
Mean Std. Min. Max. Aveage
Normal image sequences for 2
27 0.33 0.39 0.2 4 0.72
days (85,186 frames)
Abnormal image sequence 3 3.75 2.18 71.4 182 96

16 Table 3: False positive measures under varying event durations for


14 ∗ the 31-day laboratory video sequence.
12 NFAPA MTBFA
10 ∗ Category of eventduration Mean time
(seconds) Number of false
8 between false
alarms (per day)
σd

alarms (hours)
6
Event duration > 1 s 3.6 6.6

4 Event duration > 5 s 2.3 10.4

2 Event duration > 30 s 1.4 17.1
0 Event duration > 60 s 1.0 24.0
Fuzzy C-means The proposed method Event duration > 90 s 0.7 34.2
Event duration > 120 s 0.6 40.0
Figure 9: Box-plots of σd for the 50 clusters from the constrained
clustering model and the conventional FCM.
Table 4: The definition of nine activities in the BEHAVE dataset.
Activity Definition
abnormal event detection in this long-observation sequence. The people are in a group and not moving very
In-Group
much
The performance of the proposed method on the 31-
day image sequence is measured by the false positive rate Two people or groups with one (or both)
Approach
approaching the other
(false alarms of normal events) given that all the three
abnormal events (burglary, fighting and moving furniture) Walk Together People walking together
are correctly identified. Because the distances are calculated Ignore Ignoring of one another
for individual image frames and an event lasts a number of Split Two or more people splitting from one another
consecutive image frames, the false positive rate is therefore Following Being followed
measured by the mean number of false alarms per day Meet Two or more people meeting one another
(NFAPD) and its corresponding “mean-time-between-false- Fight Two or more groups fighting
alarms (MTBFA)”. MTBFA is the average time between
Run together The group is running together
two consecutive events alarmed by the monitoring system.
The higher the MTBFA is, the higher the reliability of the
monitoring system. Table 3 summarizes NFAPD and MTBFA
measures for the 31-day image sequence. The detected are evaluated. The total number of clusters was 50 for both
events are grouped into 6 categories according to their time methods. Let σd be the standard deviation of d(xT , vi∗ ) for all
durations. For the detected events lasting longer than 5 members in a cluster. The value of σd should be as small as
seconds, the mean number of false alarms is only 2.3 events possible for a more reliable monitoring. Figure 9 presents the
per day. It indicates the mean time between false alarms is box-plot that shows the maximum, minimum, median, and
10 hours, and is quite tolerable for a monitoring support the lower and upper quartiles of σd values for the resulting 50
system. By analyzing the falsely detected events in detail clusters of individual methods. It indicates that the standard
according to their durations in seconds, we found that the FCM method generates a high variation of distances (with a
falsely detected events with prolonged durations are generally mean σd of 2.93) and the proposed clustering model results
traceable, that is, there are assignable causes to those events, in a smaller and more stable variation (with a mean σd of
such as installing a new air conditioner in the laboratory, 0.93).
assembling new computer equipment by a vendor, and tour Surprisingly, the laboratory work can be trained in two
visit to the laboratory. None of them were observed in the days with a very limited number of sampled images, and the
two-day video sequence used in training. trained cluster centroids can be used to describe most of the
In order to show the effectiveness of the constrained daily work in the laboratory for over a month. It is believed
clustering model with respect to the standard fuzzy C-means the false positive rate can be further improved by including
(FCM) method, the distances d(xT , vi∗ )’s of the two methods more sampling images from sufficient observation days in
for the 15,030 training image frames described previously the learning process.
EURASIP Journal on Advances in Signal Processing 15

(a1) In-Group (a2) Approach (a3) Walk Together

(a4) Ignore (a5) Split (a6) Following

(b1) Meet (b2) Fight (b3) Run together

Figure 10: Activity examples in the BEHAVE dataset: (a1)–(a6) scenarios in the Sequence 0; (b1)–(b3) three activities in Sequence 5, which
are abnormal with respect to Sequence.

Table 5: The scenarios of the learning and testing sequences for the BEHAVE dataset.

Video clips Scenarios Frame number Video length


Sequence 0 In-Group, Approach, Walk Together, Ignore, Split, Following 1–11200 7 min. 27 sec.
Sequence 5 In Group, Approach, Walk Together, Split, Meet, Run Together, Fight 47300–58400 7 min. 24 sec.

Meet Fighting 20
20 15
15 Fighting Running Noise
Δd

10
Together
Δd

10 5
5 0
0 473 493 513 533 553 573 584
473 493 513 533 553 573 584 Frame number ×102
Frame number ×102
Figure 12: Detection results with the 7 moment-based features f1
Figure 11: Excessive distances Δd of the four abnormal events in to f7 .
Sequence 5 of the BEHAVE dataset.

sequences, each containing a different combination of activ-


The second evaluation dataset is a street scene obtained ities. The training image sequence is Sequence 0 from the
from the BEHAVE Interactions Test Case Scenarios [56]. BEHAVE dataset, which contains six activities of In-Group,
BEHAVE is funded by the UK’s Engineering and Physical Approach, Walk Together, Ignore, Split, and Following. The
Science Research Council project. It involves nine different demonstration images for these six activities are shown in
activities such as Walk Together and Run Together in the Figure 10(a1)–10(a6). The testing video is Sequence 5 that
image sequences. The definitions of these nine activities contains seven events of Approach, Ignore, Walk Together,
are listed in Table 4. The BEHAVE dataset has eight video Split, Meet, Run together, and Fight. In Sequence 5, the three
16 EURASIP Journal on Advances in Signal Processing

20 Meet Fighting distances of the four abnormal events are distinctly large
15 Fighting Running and prolong for their corresponding durations, as seen in
Δd

10 Together the x-axis and the y-axis in Figure 11. The running together
5 event includes many discrete running activities where people
0 abruptly enter and exit from the street scene and, therefore,
473 493 513 533 553 573 584
×102
the resulting distances Δd are not continuous.
Frame number
We have also conducted additional experiments on the
(a) All features excluding f12 BEHAVE dataset with various combinations of features.
20 Figure 12 shows the detection results using only the seven
15 moment-based features. It fails to detect the subtle activity
Δd

10 of Running Together. Noise is also created. The long Fighting


5 activity is not alarmed at the beginning of the duration,
0 and the overall discrimination magnitudes for the abnormal
473 493 513 533 553 573 584
×102
activities are reduced.
Frame number
Since feature 12 is the ratio of f10 and f11 , we also
(b) All features excluding f10 and f11 evaluate the detection performance without including f12 in
Figure 13: Detection results based on: (a) all features excluding f12 ;
the feature set. As seen in Figure 13(a), the four abnormal
(b) all features excluding f10 and f11 . activities Meet, long Fighting, short Fighting and Running
Together are also well detected without the use of feature
f12 . Comparing the detection results between Figures 11 and
Incomplete 13(a), the use of 12 full features gives higher discrimination
duration of
Meet fighting
magnitudes, especially in the case of short Fighting. We
20 have also performed the detection task by excluding features
Fighting Mis-
15 detected f10 and f11 (and retaining all the remaining 10 features
Δd

10 event for classification). Figure 13(b) shows the detection results.


5 The Running Together event is misdetected, and the whole
0 duration of the long Fighting is not fully detected.
473 493 513 533 553 573 584
×102
We have also used principal component analysis (PCA)
Frame number
for feature selection. It finds the eigenvalues of the 12
Figure 14: Detection results by K-means for Sequence 5 in the features, and sorts the features in descending order of
BEHAVE dataset. their corresponding eigenvalues. Then 12 feature sets,
each containing the dominant features from 1 to 12, are
individually evaluated. The detection results consistently
activities of Meet, Fight, and Run Together are not included indicate that the use of 12 full features generates the highest
in the training sequence 0. Therefore, these three activities discrimination magnitudes. The discrimination power is
are treated as abnormal events. The demonstration images significantly reduced when less number of features is used
of these three abnormal events are shown in Figure 10(b1)– for classification.
10(b3). In order to evaluate the clustering performance between
The BEHAVE video images are captured at 25 frames Fuzzy C-means and K-means, we have also used K-means
per second. The activities and video lengths of the training for clustering and classification, and tested it on the BEHAVE
and testing sequences are listed in Table 5. There are a total dataset. We replicated 10 times with different random initial
of 11,200 frames (7 minutes and 27seconds) in Sequence solutions for both FCM and K-means. Figure 14 shows
0, of which 50% (i.e., 5,600 image frames) are randomly representative detection results of sequence 5 in the BEHAVE
sampled and used as the training samples. The update rate dataset. The K-means technique is less responsive to abnor-
γ is set at 0.999 to construct the motion energy maps. The mal activities. Compared to the detection results of FCM
total number of clusters Cmax is given by 40. The resulting in Figure 11, K-means clustering procedure misdetects the
minimum control constant β from the training process is 0.1. subtle activity of Running Together. The duration of the long
The test video of Sequence 5 has a total of 11,100 frames (7 Fighting activity is not fully detected, and the discrimination
minutes and 24 seconds). magnitude of the short Fighting is less significant. Under the
When the training is done, the whole video images of same termination criterion, K-means needs additional 40%
Sequence 5 are used to test for the detection performance. computation time to converge.
Figure 11 illustrates the testing result of Sequence 5. In the
figure, the x-axis presents the frame number and the y-axis 4. Conclusions
is the excessive distance Δd. The results show that all the
normal activities of In-Group, Approach, Walk Together, and Analysis of events has been conventionally based on the
Split in Sequence 5 are within the control limits. There are recognition of a set of predefined activities. In a scene of
four major abnormal events detected in Sequence 5, that daily life such as home and office, it is extremely difficult to
is, one long fighting event, one short fighting event, one define and model every possible activity in advance. In this
meeting event and one running together event. The resulting paper, we have proposed a macro-observation approach to
EURASIP Journal on Advances in Signal Processing 17

detect abnormal events such as burglary and fighting in daily [9] C.-W. Chu, O. C. Jenkins, and M. J. Matarić, “Markerless kine-
life. The proposed motion energy map can simultaneously matic model and motion capture from volume sequences,”
represent both spatial context and temporal context of an in Proceedings of the IEEE Computer Society Conference on
activity. All historical image frames are taken into account Computer Vision and Pattern Recognition (CVPR ’03), pp. 475–
with exponential weights to construct the energy map. It 482, June 2003.
alleviates the limitation on the use of a fixed duration [10] G. J. Brostow, I. Essa, D. Steedly, and V. Kwatra, “Novel skeletal
representation for articulated creatures,” in Proceedings of the
for various activities with different paces. The constrained
8th European Conference On Computer Vision (ECCV ’04), vol.
clustering model can effectively divide numerous activities 3023 of Lecture Notes in Computer Science, pp. 66–79, May
in daily life into groups based on their similarity in energy 2004.
maps. By training a sufficient number of randomly sampled [11] R. Ishiyama, H. Ikeda, and S. Sakamoto, “A compact model of
energy maps in a video sequence that spans sufficient human postures extracting common motion from individual
repetition of day-to-day activities, all normal events can be samples,” in Proceedings of the 18th International Conference
effectively represented by the cluster centroids. It allows fast on Pattern Recognition (ICPR ’06), vol. 1, pp. 187–190, August
computation of similarity measure for each new scene image. 2006.
The proposed method can therefore be applied for on-line, [12] A. Sundaresan and R. Chellappa, “Segmentation and proba-
real-time monitoring of unpredictable abnormal events in bilistic registration of articulated body models,” in Proceedings
daily life. of the 18th International Conference on Pattern Recognition
The merit of this paper is to show the feasibility of (ICPR ’06), vol. 2, pp. 92–95, August 2006.
the easily-implemented macro-observation approach for [13] C.-C. Chen, J.-W. Hsieh, Y.-T. Hsu, and C.-Y. Huang, “Seg-
abnormality detection in daily life. The proposed scheme mentation of human body parts using deformable triangu-
lation,” in Proceedings of the 18th International Conference on
in its present form can well detect abnormal events with
Pattern Recognition (ICPR ’06), vol. 1, pp. 355–358, August
prolonged durations, especially those lasting tens of seconds 2006.
or more. It is not highly responsive to the events that last [14] R. Navaratnam, A. Thayananthan, P. H. S. Torr, and R. Cipolla,
only a few seconds. It is worth further investigation on the “Hierarchical part-based human body post estimation,” in
spatiotemporal representation and similarity metric for the Proceedings of British Machine Vision Conference (BMVC ’05),
analysis of short-term activities in daily life. pp. 479–488, 2005.
[15] C. Bregler, “Learning and recognizing human dynamics in
References video sequences,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR
[1] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action ’97), pp. 568–574, June 1997.
in time sequential images using hidden Markov models,” [16] Y. Yacoob and M. J. Black, “Parameterized modeling and
in Proceedings of the International Conference on Pattern recognition of activities,” Computer Vision and Image Under-
Recognition (ICPR ’92), pp. 379–385, 1992. standing, vol. 73, no. 2, pp. 232–247, 1999.
[2] R. Polana and R. Nelson, “Low level recognition of human [17] N. M. Oliver, B. Rosario, and A. P. Pentland, “A Bayesian
motion (or how to get your man without finding his body computer vision system for modeling human interactions,”
parts),” in Proceedings of the IEEE Workshop on Motion of Non- IEEE Transactions on Pattern Analysis and Machine Intelligence,
Rigid and Articulated Objects, pp. 77–82, Austin, Tex, USA, vol. 22, no. 8, pp. 831–843, 2000.
November 1994. [18] M. Brand and V. Kettnaker, “Discovery and segmentation of
[3] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, activities in video,” IEEE Transactions on Pattern Analysis and
“Pfinder: real-time tracking of the human body,” IEEE Trans- Machine Intelligence, vol. 22, no. 8, pp. 844–851, 2000.
actions on Pattern Analysis and Machine Intelligence, vol. 19, [19] S. S. Intille and A. F. Bobick, “Recognizing planned, multiper-
no. 7, pp. 780–785, 1997. son action,” Computer Vision and Image Understanding, vol.
[4] H. Roh, S. Kang, and S.-W. Lee, “Multiple people tracking 81, no. 3, pp. 414–445, 2001.
using an appearance model based on temporal color,” in Pro- [20] H. Buxton, “Learning and understanding dynamic scene
ceedings of the International Conference on Pattern Recognition activity: a review,” Image and Vision Computing, vol. 21, no.
(ICPR ’00), pp. 643–646, 2000. 1, pp. 125–136, 2003.
[5] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4 : real-time [21] C. P. Town, “Ontology-driven Bayesian networks for dynamic
surveillance of people and their activities,” IEEE Transactions scene understanding,” in Proceedings of the IEEE Computer
on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. Society Conference on Computer Vision and Pattern Recognition
809–830, 2000. (CVPR ’04), vol. 7, p. 116, 2004.
[6] Y. Chen, Y. Rui, and T. S. Huang, “JPDAF based HMM for real- [22] A. F. Bobick and A. D. Wilson, “A state-based approach to the
time contour tracking,” in Proceedings of the IEEE Computer representation and recognition of gesture,” IEEE Transactions
Society Conference on Computer Vision and Pattern Recognition on Pattern Analysis and Machine Intelligence, vol. 19, no. 12,
(CVPR ’01), pp. 543–550, December 2001. pp. 1325–1337, 1997.
[7] I. Haritaoglu, D. Harwood, and L. S. Davis, “Ghost: a human [23] Q. Dong, Y. Wu, and Z. Hu, “Gesture segmentation from a
body part labeling system using silhouettes,” in Proceedings of video sequence using greedy similarity measure,” in Proceed-
the International Conference on Pattern Recognition (ICPR ’98), ings of the 18th International Conference on Pattern Recognition
pp. 77–82, 1998. (ICPR ’06), vol. 1, pp. 331–334, August 2006.
[8] T. B. Moeslund and E. Granum, “A survey of computer vision- [24] N. V. Boulgouris, K. N. Plataniotis, and D. Hatzinakos,
based human motion capture,” Computer Vision and Image “Gait recognition using linear time normalization,” Pattern
Understanding, vol. 81, no. 3, pp. 231–268, 2001. Recognition, vol. 39, no. 5, pp. 969–979, 2006.
18 EURASIP Journal on Advances in Signal Processing

[25] N. V. Boulgouris and Z. X. Chi, “Human gait recognition [43] J. W. Davis and A. F. Bobick, “Representation and recognition
based on matching of body components,” Pattern Recognition, of human movement using temporal templates,” in Proceed-
vol. 40, no. 6, pp. 1763–1770, 2007. ings of the IEEE Computer Society Conference on Computer
[26] M. Shah, O. Javed, and K. Shafique, “Automated visual Vision and Pattern Recognition (CVPR ’97), pp. 928–934, June
surveillance in realistic scenarios,” IEEE Multimedia, vol. 14, 1997.
no. 1, pp. 30–39, 2007. [44] A. F. Bobick and J. W. Davis, “The recognition of human
[27] R. Cutler and M. Turk, “View-based interpretation of real- movement using temporal templates,” IEEE Transactions on
time optical flow for gesture recognition,” in Proceedings Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp.
of the International Conference Automatic Face and Gesture 257–267, 2001.
Recognition, pp. 416–421, 1998. [45] G. R. Bradski and J. W. Davis, “Motion segmentation and pose
[28] M. J. Black, “Explaining optical flow events with parame- recognition with motion history gradients,” Machine Vision
terized spatio-temporal models,” in Proceedings of the IEEE and Applications, vol. 13, no. 3, pp. 174–1843, 2002.
Computer Society Conference on Computer Vision and Pattern [46] S.-F. Wong and R. Cipolla, “Continuous gesture recognition
Recognition (CVPR ’99), pp. 326–332, June 1999. using a sparse Bayesian classifier,” in Proceedings of the 18th
[29] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event International Conference on Pattern Recognition (ICPR ’06),
detection using volumetric features,” in Proceedings of the 10th vol. 1, pp. 1084–1087, China, August 2006.
IEEE International Conference on Computer Vision (ICCV ’05), [47] J. Davis and A. Bobick, “Virtual PAT: a virtual personal
pp. 166–173, October 2005. aerobics trainer,” in Proceedings of Perceptual User Interfaces,
[30] T. Ogata, W. Christmas, J. Kittler, and S. Ishikawa, “Improving pp. 13–18, 1998.
human activity detection by combining multi-dimensional [48] C. Shan, Y. Wei, X. Qiu, and T. Tan, “Gesture recognition
motion descriptors with boosting,” in Proceedings of the 18th using temporal template based trajectories,” in Proceedings
International Conference on Pattern Recognition (ICPR ’06), pp. of the 17th International Conference on Pattern Recognition
295–298, August 2006. (ICPR ’07), vol. 3, pp. 954–957, August 2004.
[31] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing [49] N. Vaswani, A. R. Chowdhury, and R. Chellappa, “Activity
action at a distance,” in Proceedings of the 9th IEEE Interna- recognition using the dynamics of the configuration of
tional Conference on Computer Vision, pp. 726–733, October interacting objects,” in Proceedings of the IEEE Computer
2003. Society Conference on Computer Vision and Pattern Recognition
[32] N. Johnson and D. Hogg, “Learning the distribution of (CVPR ’03), vol. 2, pp. 633–640, June 2003.
[50] W. Hu, D. Xie, T. Tan, and S. Maybank, “Learning activity
object trajectories for event recognition,” Image and Vision
patterns using fuzzy self-organizing neural network,” IEEE
Computing, vol. 14, no. 8, pp. 609–615, 1996.
Transactions on Systems, Man, and Cybernetics B, vol. 34, no.
[33] A. Madabhushi and J. K. Aggarwal, “A Bayesian approach to
3, pp. 1618–1626, 2004.
human activity recognition,” in Proceedings of IEEE Interna-
[51] D. J. Fleet, M. J. Black, Y. Yacoob, and A. D. Jepson,
tional Workshop on Visual Surveillance (VS ’99), pp. 25–32,
“Design and use of linear models for image motion analysis,”
1999.
International Journal of Computer Vision, vol. 36, no. 3, pp.
[34] J. Owens and A. Hunter, “Application of the self-organizing
171–193, 2000.
map to trajectory classification,” in Proceedings of IEEE [52] E. L. Andrade, R. B. Fisher, and S. Blunsden, “Detection of
International Workshop on Visual Surveillance (VS ’00), pp. 77– emergency events in crowded scenes,” in Proceedings of the
83, 2000. Institution of Engineering and Technology Conference on Crime
[35] T. W. Liao, “Clustering of time series data—a survey,” Pattern and Security, vol. 2, pp. 528–532, London, UK, 2006.
Recognition, vol. 38, no. 11, pp. 1857–1874, 2005. [53] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-
[36] C. Piciarelli and G. L. Foresti, “On-line trajectory clustering time unusual event detection using multiple fixed-location
for anomalous events detection,” Pattern Recognition Letters, monitors,” IEEE Transactions on Pattern Analysis and Machine
vol. 27, no. 15, pp. 1835–1842, 2006. Intelligence, vol. 30, no. 3, pp. 555–560, 2008.
[37] C. Stauffer and W. E. L. Grimson, “Learning patterns of [54] H. Zhong, J. Shi, and M. Visontai, “Detecting unusual activity
activity using real-time tracking,” IEEE Transactions on Pattern in video,” in Proceedings of the IEEE Computer Society Confer-
Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747–757, ence on Computer Vision and Pattern Recognition (CVPR ’04),
2000. vol. 2, pp. 819–826, July 2004.
[38] H. Murase and R. Sakai, “Moving object recognition in [55] N. Rea, R. Dahyot, and A. Kokaram, “Semantic event detection
eigenspace representation: gait analysis and lip reading,” in sports through motion understanding,” in Proceedings of
Pattern Recognition Letters, vol. 17, no. 2, pp. 155–162, 1996. the 3rd International Conference on Image and Video Retrieval
[39] M. J. Black and A. D. Jepson, “EigenTracking: robust matching (CIVR ’04), vol. 3115 of Lecture Notes in Computer Science, pp.
and tracking of articulated objects using a view-based repre- 88–97, July 2004.
sentation,” International Journal of Computer Vision, vol. 26, [56] “BEHAVE Interactions Test Case Scenarios,” University
no. 1, pp. 63–84, 1998. of Edinburgh, 2007, http://groups.inf.ed.ac.uk/vision/
[40] M. M. Rahman and S. Ishikawa, “Recognizing human BEHAVEDATA/INTERACTIONS/.
behaviors employing global eigenspace,” in Proceedings of the [57] C. Stauffer and W. E. L. Grimson, “Adaptive background
International Conference on Pattern Recognition (ICPR ’02), mixture models for real-time tracking,” in Proceedings of the
2002. IEEE Computer Society Conference on Computer Vision and
[41] J. Wei, “Video content classification based on 3-D eigen Pattern Recognition (CVPR ’99), vol. 2, pp. 246–252, June 1999.
analysis,” IEEE Transactions on Image Processing, vol. 14, no. [58] P. KaewTrakulPong and R. Bowden, “An improved adap-
5, pp. 662–673, 2005. tive background mixture model for real-time tracking with
[42] M. M. Rahman and S. Ishikawa, “Human motion recognition shadow detection,” in Proceedings of the 2nd European Work-
using an eigenspace,” Pattern Recognition Letters, vol. 26, no. 6, shop on Advanced Video Based Surveillance Systems, pp. 149–
pp. 687–697, 2005. 158, Kingston, UK, 2001.
EURASIP Journal on Advances in Signal Processing 19

[59] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis,


“Background and foreground modeling using nonparametric
kernel density estimation for visual surveillance,” Proceedings
of the IEEE, vol. 90, no. 7, pp. 1151–1162, 2002.
[60] M.-K. Hu, “Visual pattern recognition by moment invariants,”
IEEE Transactions on Information Theory, vol. 8, no. 2, pp.
179–187, 1962.
[61] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms, Kluwer Academic Publishers, Norwell, Mass, USA,
1981.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 752567, 11 pages
doi:10.1155/2010/752567

Research Article
Pedestrian Validation in Infrared Images by Means of
Active Contours and Neural Networks

Massimo Bertozzi,1 Pietro Cerri,1 Mirko Felisa,1 Stefano Ghidoni,2 and Michael Del Rose3
1 VisLab, Dipartimento di Ingegneria dell’Informazione, Università di Parma, 43124 Parma, Italy
2 IAS-Lab, Dipartimento di Ingegneria dell’Informazione, Università di Padova, 35131 Padova, Italy
3 Vetronics Research Center, U. S. Army TARDEC, MI 48397, USA

Correspondence should be addressed to Massimo Bertozzi, [email protected]

Received 30 November 2009; Accepted 31 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 Massimo Bertozzi et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

This paper presents two different modules for the validation of human shape presence in far-infrared images. These modules are
part of a more complex system aimed at the detection of pedestrians by means of the simultaneous use of two stereo vision systems
in both far-infrared and daylight domains. The first module detects the presence of a human shape in a list of areas of attention
using active contours to detect the object shape and evaluating the results by means of a neural network. The second validation
subsystem directly exploits a neural network for each area of attention in the far-infrared images and produces a list of votes.

1. Introduction vehicle. This system is able to detect all obstacles appearing


in the scene and is based on the simultaneous use of two
During the last years, pedestrian detection has been a key stereo camera systems: two far-infrared cameras and two
topic of the research on intelligent vehicles. This is due to the daylight cameras [3]. The first stages of this system provide
many applications of this functionality, like driver assistance, a reliable detection of image areas that potentially contain
surveillance, or automatic driving systems; moreover, the pedestrians; following stages are devoted to refine and filter
heavy investments made by almost all car manufacturers on these rough results to validate the pedestrians presence. The
this kind of research prove that particular attention is now validation is based on a multivote system; several approaches
focused on improving road safety, especially for reducing the are independently used to analyze areas of attention, and
high number of pedestrians being injured every year. Also each subsystem outputs a vote describing how much the
the U.S. Army is actively developing systems for obstacle obstacle is likely to be a pedestrian. Then, a final validation is
detection, path following, and anti-tamper surveillance, for done, based on all votes.
its robotic fleet [1, 2]. This paper describes two of the intermediate validation
Finding pedestrians from a moving vehicle is, however, modules. The first one has been developed and, in an initial
one of the most challenging tasks in the artificial vision field, stage, extracts objects shape by means of active contours
since a pedestrian is one of the most deformable object thats [4], then provides a vote using a neural network-based
can appear in a scene. Moreover, the automotive environ- approach. The second validation stage directly exploits a
ment is often barely unstructured, incredibly variable, and neural network for evaluating the presence of human shapes
apparently moving, due to the fact that the camera itself is in far-infrared images.
in motion; therefore, really few assumptions can be made on This paper is organized as follows. Section 2 describes
the scene. related work in pedestrian detection systems based on
This paper describes two modules for pedestrian valida- artificial vision. The pedestrian detection system is discussed
tion developed for integration into a vision-based obstacle in Section 3. The module for active contours-based shape
detection system to be installed on an autonomous military detection algorithm is detailed in Section 4 while Section 5
2 EURASIP Journal on Advances in Signal Processing

describes the neural network-based validation step. Finally, use of neural network on images. As an example, in [20],
Section 6 ends the paper presenting few results and remarks convolutional neural networks are used as feature extractor
on the system. and classifier.

2. Related Work 3. System Description


For the U.S. Army the use of vision as a primary sensor for the The algorithms described in this work have been developed
detection of human shapes is a natural choice since cameras as a part of a tetravision-based pedestrian system [3, 21].
are noninvasive sensors and therefore do not emit signals. The whole architecture is based on the simultaneous use
Vision-based systems for pedestrian detection have been of two far-infrared and two daylight cameras. Thanks
developed exploiting different approaches, like the use of to this approach, the system is able to detect obstacles
monocular [5, 6] or stereo [7, 8] vision. Many systems and pedestrians when the use of infrared devices is more
based on the use of a stationary camera employ simple appropriate (night, low-illumination conditions, etc.) or,
segmentation techniques to obtain foreground region; but conversely, in case visible cameras are more suitable for the
this approach fails when the pedestrians have to be detected detection (hot, sunny environments, etc.).
from moving platforms. Most of the current approaches
In fact, FIR images convey a type of information that
for pedestrian detection using moving cameras treat the
is very different from those in the visible spectrum. In the
problem as a recognition task: a foreground detection is
infrared domain the image of an object depends on the
followed by a recognition step to verify the presence of a
amount of heat it emits, namely, it is generally related to
pedestrian. Some systems use motion detection [7, 9] or
its temperature (see Figure 1). Conversely, in the visible
stereo analysis [10] as a means of segmentation.
domain, objects appearance depends on how the surface
Other systems substitute the segmentation step with
of the object reflects the incident light as well as on the
a focus-of-attention approach, where salient regions in
illumination conditions.
feature maps are considered as candidates for pedestrians.
In the GOLD system [11], vertical symmetries are associated Since humans usually emit more heat than other objects
with potential pedestrians. In [12] the local image entropy like trees, background, or road artifacts, the thermal shape
directs the focus-of-attention followed by a model-matching can be often successfully exploited for pedestrian detection.
module. In such cases, pedestrians are in fact brighter than the back-
For what concerns the recognition phase, recent ground. Unfortunately, other road participants or artifacts
researches are often motion based, shape based, or multicue emit heat as well (cars, heated buildings, etc.). Moreover,
based. Motion-based approaches use the periodicity of infrared images are blurred and have a poor resolution and
human gait or gait patterns for pedestrian detection [7, 12]. the contrast is low compared with rich and colorful visible
These approaches seem to be more reliable than shape- images.
based ones, but they require temporal information and are Consequently, both visible and far-infrared images are
unable to correctly classify pedestrians that are still or have used for reducing the search space.
an unusual gait pattern. Figure 2 depicts the overall algorithm flow for the
Shape-based approaches rely on pedestrians appearance; complete pedestrian system. Different approaches have
therefore both moving and stationary people can be detected been developed for the initial detection in the two image
[11, 13]. In these approaches, the challenge is to model domains: warm areas detection, vertical edges detection, and
the variations of the shape, pose, size and appearance of an approach based on the simultaneous computation of
humans, and their background. Basic shape analysis methods disparity space images in the two domains [3, 21].
consist in matching a template with candidate foreground These first stages of detection output a list of areas of
regions. In [14], a tree-based hierarchy of human silhouettes attention in which pedestrians can be potentially detected.
is constructed and the matching follows a coarse-to-fine Each area of attention is labelled using a bounding box.
approach. In [15, 16], probabilistic templates are used to A symmetry-based approach is further used to refine this
take into account the possible variations in human shape. rough result in order to resize bounding boxes or to separate
As a final step of the recognition task, some systems also bounding boxes that can contain more pedestrians.
exploit pattern-recognition techniques based on the use of These two steps in the processing, barely, take into
classifiers, or in combination with a shape analysis with gait account specific features of pedestrians; in fact, only sym-
detection [14, 17]. metrical and size considerations are used to compute the
For the task of human shape classification, the most com- list of bounding boxes. Therefore, independent validation
mon classifiers are support vector machine [18], adaboost modules are used to evaluate the presence of human shapes
[19], and neural networks. Concerning the systems adopting inside the bounding boxes. These stages exploit specific
the neural networks approach, most of them first extract pedestrian characteristics to discard false positives from the
features from images, and then use these features as the input list of bounding boxes. In the following paragraphs the
of the classifier. In [10], foreground objects are first detected two validators shown as bold in Figure 2 are described and
through foreground/background segmentation, and then detailed.
classified as pedestrian or nonpedestrian by a trained neural A final decision step is used to balance the votes of
network. Conversely, other systems are based on the direct validators for each bounding box.
EURASIP Journal on Advances in Signal Processing 3

(a) (b)

Figure 1: Examples of typical scenarios in FIR and visible images.

Warm area Active Neural


contour network
Infrared Stereo

Final validator
Neural
Edge network
Merge Symmetry
Head
FIR
Tetra
vision Probabilistic
VIS
model
Visible
Detection Symmetry Validator

Figure 2: Overall algorithm flow.

4. Active Contour-Based Validator efficiently guide the snake toward the desired image features,
and on the other hand, a correct decision on the snake
As previously discussed, the pedestrian validation step is internal parameters that should provide to the snake the
composed by several validators, each one supplying a vote desired “mechanical” properties.
that is then provided to the final evaluation step. The Regarding external forces, it should be noted that they
validator detailed in this section is based on the analysis of must generate something similar to an energy field: it is
a pedestrian shape, which can be extracted using the well- therefore not enough to choose the important features, but
known active contour models, also known as snakes. rather, a method must also be defined, in order to create the
field: the snake behavior should be affected by the features
4.1. Active Contour Models. Active contour models are widely also at a certain distance—this, after all, is the meaning of
used in pattern recognition for extracting an object shape. force field.
First introduced by [22], this topic has been extensively Every point composing the snake reaches a local energy
explored also in the last years. Basically, a snake is a curve minimum; this means that the active contour does not find
described by the parametric equation v(s) = (x(s), y(s)), a global optimum position; rather, since it is based on local
where s is the normalized length, assuming values in the minimization, the final position strongly depends on the
range [0, 1]. This continuous curve becomes, in a discrete initial condition, that is, the initial snake position.
domain, a set of points that are pushed by some energies that Because initial stages of the pedestrian detection system
depend on the specific problem being addressed. Indeed, on provide a bounding box for each detected object, the snake
the image domain, over which a snake moves, energy fields initial position can be chosen as the bounding box contour;
are defined, which affect the snake movements. Such energy then, a contracting behavior should be impressed, to force
fields depend on the original image, or on an image obtained the snake to move inside the bounding box. Other energies
by processing the original one, in order to highlight those must also be introduced to make the snake stop when the
features by which the snake should be attracted. object contour is reached.
The points of the contour then move according to both It was said that there are two kinds of forces, and
these external forces and other forces that are said to be associated energies that control snake movements and that
internal to the snake, that is, that control the way each snake can be divided into two different categories: internal and
point influences its neighbors. external. Because internal energy comes from interactions
The two challenges when dealing with snakes are, on between points, it depends only on the topology of the snake,
one hand, a good choice of the external forces, in order to and controls the continuity of the curve derivatives; it is
4 EURASIP Journal on Advances in Signal Processing

evaluated by the equation During the initial iterations, the snake tends to contract,
due to the elastic energy; this tendency stops when some
Eint = α(s)|vs (s)|2 + β(s)|vss (s)|2 , (1) other energy counterweights it, for instance, the presence
of edges or a light image region. While adapting to the
where vs (s) and vss (s) are, respectively, the first and second object shape, the snake length decreases, as well as the
derivatives of v(s) with respect to s. The first contribution mean distance between two adjacent snaxels. Since this mean
appearing in the sum represents the tension of the snake that distance is a value that affects the internal energy, in order to
is responsible for the elastic behavior; the second one gives keep almost constant the elastic property also during strong
the snake resistance to bending; α(s) and β(s) are weights. contraction, the snake is periodically resampled using a fixed
Therefore, internal energy controls the snake mechanical step; in this way some unwanted snaxels accumulation can be
properties, but is independent of the image; external energy, avoided.
on the contrary, causes the snake to be attracted to the Due to the iterative nature of the snake contraction,
desired features, and should therefore be a function of the computational times are not negligible. On a Core2 CPU
image. working at 2.13 GHz the algorithm needs a time that is below
Analytically, the snake will try to minimize the whole 20 ms for each snake, and sensibly lower for small targets.
energy balance, given by the equation This computational load makes the use of this technique
-1 feasible in a system that is asked to work at several frames
Esnake = (Eint (v(s)) + Eext (v(s)))ds. (2) per second, like the one being described.
0

Because energies are the only way to control a snake, a proper


choice of both internal and external energies should be made. 4.2. Double Snake. The active contour technique turned out
In particular, the external energy depending on the image to be effective, but it showed some weaknesses when adapting
must decrease in the regions where the snake should be to concave shapes, like those created by a pedestrian when his
attracted. In the following, the energies adopted to obtain an legs are open. In this case, the active contour needs to sensibly
object shape are described. extend his length while wrapping around the concave shape,
As previously said, the initial snake position is chosen but this process is usually not complete because of the
to be along the bounding box contour. In this system both elastic energy. Moreover, the initialization, that is, the initial
visible and far-infrared images are available, but the latter configuration of the snake, strongly influences the shape
seem much more convenient when dealing with pedestrians, extracted at the end of the process. To increase the capability
due to the thermal difference between a human being and the of adapting to concave shapes, and to partially solve the
background [3]. dependence on the initialization, the study in [24] proposed
To extract a pedestrian shape, the Sobel filter output is a a technique based on two snakes: a snake external to the
useful starting point; moreover, the edge image is needed also shape to recover, like the one previously discussed, and a
by previous steps of the recognition algorithm; therefore it is new one, placed inside the pedestrian shape, that tends to
already available. A Gaussian smoothing filter is then applied adapt from inside, driven by a force that makes the snake
to enlarge the edges, and consequently the area capable of expand, instead of contracting. Moreover, the two snakes do
influencing the snake behavior, that is, the area where the not evolve independently, but rather interact; how they do
field generated by external forces is sensible. The resulting that is a key point in the development of this technique.
image is then associated with an energy field that pushes the The simplest interaction is obtained by adding in (2) a
snake towards the edges: for this reason, the brighter a pixel contribution that depends on the position of the other snake,
in that image, the lowest the associated energy; in this way, so that each one tends to move towards the other.
snaxels (the points into which the snake is discretized) are Note, however, that there is no guarantee that the two
attracted by the strongest edges; see Figure 3. snakes will get very close, as there can be strong forces
Bright regions of the original FIR image are also consid- that make the two snakes remain far from each other; for
ered. In fact, smoothed edges do not accurately define the this reason, the tuning of the parameters in the energy
object contour (mainly because they are smoothed): snake calculation should be carefully performed, so that the force
contraction has to be arrested by bright regions in the FIR between the two contours can balance the other components.
image that can belong to a portion of a human body (see This task turns out to be particularly difficult when dealing
Figure 4). This method lets the snake correctly adapt to a with images taken in the automotive scenario, which usually
body shape in a lot of situations, and it should also be noticed present a huge amount of details and noise; it is in fact
that this mechanism works only if there are hot regions inside very difficult to find a set of parameters providing a good
the bounding box; a useful side effect, then, is an excessive attraction between the two snakes, and, at the same time,
snake contraction when there are not warm blobs inside a letting them free of moving towards the desired image
bounding box. features.
The minimum energy location is found by iteratively Alternatively, the snake evolution can be controlled by
moving each snaxel, following an energy minimization a new behavior that ensures that the two snakes will get
algorithm. Many of them were proposed in the literature. For very close to each other. Such behavior is based on the idea
this application, the greedy snake algorithm [23], applied on that, at each iteration, every snaxel should move towards
5 × 5 neighborhood, was adopted. the corresponding snaxel on the other snake. Snaxels are
EURASIP Journal on Advances in Signal Processing 5

(a) (b) (c)

Figure 3: Energy field due to edges: (a) original image, (b) edge image obtained using Sobel operator and gaussian smooth, and (c) edge
energy functional with inverted sign, to obtain a more effective graphical representation.

(a) (b)

Figure 4: Energy field due to the image: (a) original image and (b) intensity energy functional with inverted sign, to obtain a more effective
graphical representation.

therefore coupled, so that each snaxel in one snake has a obstacles are detected in the frame; all of the contours are
corresponding one in the other contour. Then, during the extracted for the classification. In this case it can be analyzed
iteration process, snaxels couples are considered: for each of the behavior of the shape extractor when dealing with
them, one of the points is moved towards the other one, obstacles other than pedestrians that are usually colder than
the latter remaining in the same position; the moving point a human being: as a result, in the FIR images they will appear
is chosen so that the energy of the couple is minimized. dark, and will therefore lack the features that attract the
In general, the number of points is different for the two snakes. In this situation, contours extracted using the double
snakes, this means that a snaxel of the shorter contour can snake algorithm (d) tend to become similar to a square,
be included in more than one couple: such points have a and are clearly different from the shape of a pedestrian; this
greater probability of being moved, but this effect does not difference is not so high using the single snake technique, as
jeopardize the shape extraction. can be seen in (b).
In this approach the energy balance is still considered,
but here it has a slightly different meaning, because it is
used to choose which snaxel in the couple should move. 4.3. Neural Network Classification. Once the shape of each
This gives a great power to the force that attracts the two obstacle is extracted, it has to be classified, in order to obtain
snakes, and the drawback is that they can therefore neglect a vote to provide to the final validator. Obstacles shapes
the other forces, namely, the features of the image that should extracted using the active contour technique are validated
attract them. To mitigate this power, every two iterations using a neural network.
with the new algorithm, an iteration with the classical greedy Prior to be validated, extracted shapes should be further
snake algorithm is performed, so that the snakes are better processed: the neural network needs a given number of input
influenced by the image and by the internal energy. This data, but each snake has a number of points that depend on
solution turned out to be the most effective one. its length. For this reason, each snake is resampled with a
Some examples and performance comparisons of con- fixed number of points, and the coordinates are normalized
tour extraction are presented in Figure 5; in the left column, in the range [0; 1]. The neural network has 60 input neurons,
a simple case is presented: the contour of the same pedestrian two for each of the 30 points of the resampled snake, and
is extracted using the single snake technique (a) and the only one output neuron that provides the probability that
double snake (c). Then, in (b) and (d) a more complex the contour represents a pedestrian; such probability will be,
scene is considered: together with a pedestrian, some other again, in the range [0; 1].
6 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 5: Examples of shape extraction. In (a), the contour of a pedestrian is extracted using the single snake algorithm, while (c) shows the
result when the double snake technique is used; it can be seen that the contour is smoother in the latter case. In (b) a more complex situation
is analyzed using the single snake technique, and (d) presents the same scene analyzed by the double snake algorithm (the red contour is the
inner one, while the green snake is the outer one).

For the training of the network, a dataset of 1200 pedes- The computational time of a neural network can be
trian contours and roughly the same number of contours neglected, since it is anyway below 1 ms.
of other objects has been used. They have been chosen in
a lot of short sequences of consecutive frames, so that each
pedestrian appeared in different positions, but avoiding to 5. Neural Network-Based Validator
use too many snakes of the same pedestrian. During the
training phase, the target output has been chosen as 0.95 This section describes the neural network-based validator,
and 0.05 for pedestrians and nonpedestrians, respectively; shown in Figure 2. A feed-forward multilayer neural network
extreme values, like 0 or 1, have been avoided, because they is exploited to evaluate the presence of pedestrians in the
could have produced some weighting parameters inside the bounding boxes detected by previous stages. Since neural
network to assume a too high value, with negative influence networks can express highly nonlinear decision surfaces, they
on the performance. are especially appropriate to classify objects that present a
This classificator was tested on several sequences. Recall high degree of shape variability, like a pedestrian. A trained
that the output of the neural network is the probability neural network can implicitly represent the appearance of
that an obstacle is a pedestrian; it is therefore interesting pedestrians in various poses, postures, sizes, clothing, and
to analyze which values are assigned to pedestrians and occlusion situation.
other objects on the test sequences. Output values of the In the system described here, the neural network
network are shown in Figure 6(a) which represents the is directly trained on infrared images. Generally, neural
output values distribution when pedestrians are classified, network-based systems, working on daylight images, do not
while (b) is the distribution when contours of objects that exploit directly the image; in fact, it is not appropriate
are not pedestrians are analyzed. for encoding the pedestrian features, since pedestrians
It can be seen that classification results are accurate, present a high degree of variability in color and texture
and this classificator was therefore included in the global and, moreover, intensity image is sensitive to illumination
system depicted in Figure 2. Moreover, the performance was changes. Conversely, in the infrared domain the image of
evaluated also considering this classificator by itself, and not an object depends on its thermal features and therefore it is
as a part of a greater system. A threshold was therefore nearly invariant to color, texture, and illumination changes.
calculated to obtain a hard decision; the best value turned The thermal footprint is a useful information for the neural
out to be 0.4, which provided a correct classification of 79% network to evaluate the pedestrian presence and, therefore,
of pedestrians and 85% of other objects. it is exploited as a direct input for the net (Figure 7). Since
EURASIP Journal on Advances in Signal Processing 7

0.35

0.3

0.25
Frequency

0.2

0.15

0.1

0.05

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability Input layer Hidden layer Output layer
(a)
Figure 7: A three-layer feed-forward neural network: each neuron
0.5 is connected to all neurons of the following layer. The infrared
0.45 bounding boxes are exploited as input of the network.
0.4
0.35
0.3
Frequency

and temperature conditions and to avoid the overfitting.


0.25 Moreover, an additional test set has been created in order to
0.2 evaluate the performance of the validator.
0.15 The network parameters are initialized by small random
0.1 numbers between 0.0 and 1.0, and are adapted during
the training process. Therefore, the pedestrian features are
0.05
learnt from the training examples instead of being statically
0 predetermined. The network is trained to produce an output
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
of 0.9 if a pedestrian is present, and 0.1 otherwise. Thus,
Probability
the detected object is classified thresholding the output value
(b) of the trained network: if the output is larger than a given
threshold, then the input object is classified as a pedestrian,
Figure 6: Distribution of the neural network output values. On
otherwise as a nonpedestrian.
the x-axis are plotted the probability values given by the neural
network, while on the y-axis is reported the occurrence of each A weakness of the neural network approach is that it
probability value when the shapes of pedestrians (a) and other can be easily overfitted, namely, the net steadily improves
objects (b) are analyzed. its fitting with the training patterns over the epochs, at the
cost of diminishing the ability to generalize to patterns never
seen during the training. The overfitting, therefore, causes an
error rate on validation data larger than the error rate on the
a neural network needs a fixed-sized input ranged from 0 to training data. To avoid the overfitting, a careful choice of the
1, the bounding boxes are resized and normalized. training set, the number of neurons in the hidden layer, and
The net has been designed as follows: the input layer is the number of training epochs must be performed.
composed by 1200 neurons, corresponding to the number of In order to compute the optimal number of training
pixels of resized bounding boxes (20 × 60). The output layer epochs, the error on validation dataset is computed while
contains a single neuron only and its output corresponds to the network is being trained. The validation error decreases
the probability that the bounding box contains a pedestrian in the early epochs of training but after a while it begins to
(in the interval [0,1]). The net features a single hidden increase. The training session is stopped if a given number
layer. The number of neurons in the hidden layer has been of epochs have passed without finding a better error on
computed trying different solutions; values in the interval validation set and if the ratio between error on validation
25–140 have been considered. set and error on training set is greater than a specific value.
The network has been trained using the back- This point represents a good indicator of the best number of
propagation algorithm. The training set is generated epochs for training and the weights at that stage are likely to
from the results of the previous detection module that were provide the best error rate in new data.
manually labelled. Initially, a training set, composed by 1973 The determination of number of neurons in the hidden
examples, has been created. It contains 902 pedestrians, layer is a critical step as it affects the training time and
and 1071 nonpedestrians examples ranging from traffic generalization property of neural networks. Using too few
sign poles, vehicles, to trees. Then, the training set has been neurons in the hidden layer, the net results inadequate to
expanded to 4456 examples (1897 of pedestrian and 2559 correctly detect the patterns. Too much neurons, conversely,
of nonpedestrian) in order to cover different situations decreases the generalization property of the net. Overfitting,
8 EURASIP Journal on Advances in Signal Processing

0.98 0.97
0.96
0.96
0.95
0.94

Accuracy
0.94
0.92 0.93
0.92
Accuracy

0.9
0.91
0.88
0.9
0.86 20 25 50 80 100 120 140
Number of neurons
0.84
Small training set
0.82 Big training set

0.8
Figure 9: The accuracy of the net on test set depending on the
20 25 50 80 100 120 140 number of neurons in hidden layer.
Number of neurons

Small training set


Big training set

Figure 8: The accuracy of the net on validation set depending


on the number of neurons in hidden layer. The optimal neurons
number is a tradeoff between underfitting and overfitting.

in fact, occurs when the neural network has so much


information processing capacity that the limited amount of
information contained in the training set is not enough to
train all of the neurons in the hidden layer. In Figure 8,
the accuracy of the net on validation set depending on the Figure 10: The tetravision far-infrared and daylight acquisition
number of neurons in hidden layer is shown. With a larger system installed on board of the test vehicle.
training set, a bigger number of neurons in the hidden layer
are required. This is caused by the bigger complexity of the
training set that contains pedestrians in different conditions.
Therefore, a net with more processing capacity is needed. Tests were performed on both validation techniques
The trained nets have been tested on the test set that separately, in order to understand the strong and weak points
is strictly independent to the training and validation set. It of each of them; such a knowledge is needed by the final
contains examples of pedestrians and nonpedestrians in var- validator in order to properly adjust the weights of the soft
ious poses, shapes, sizes, occlusion status, and temperature decisions. The discussion will therefore focus on results given
conditions. In Figure 9, the accuracy of the net on test set by both neural networks, one working on shapes extracted by
varying the number of neurons in hidden layer is shown. The the active contours technique and the other one directly on
performance of the nets, trained on the big training set, is the regions of interest found by the algorithm early stages.
greater than that trained on the small set. This is caused by a As previously described, the approach chosen for the
higher completeness of the training set. The performance of classification of pedestrians contours is based on a neural
nets is similar to that performed on validation set (Figure 8); network, an approach that gives good results when the
but the optimal number of neurons in the hidden layer is problem description turns out to be complex. A neural
lower. The net having 80 neurons in the hidden layer and network suitable for the classification of pedestrians contours
trained on big training set is the best one, achieving an was developed, which provided good results, as can be seen
accuracy of 96.5% on the test set. in Figure 6.
In Figure 11 some examples of the contraction mecha-
nism are reported: the white lines are the snakes in the initial
6. Discussion position, that is, on the bounding box contour, while the
snakes after energy minimization are drawn in yellow. Some
The developed system has been tested in different situations examples are presented for a close pedestrian, Figure 11(a),
using an experimental vehicle equipped with the tetra-vision and for a distant pedestrian and a motorbike, Figure 11(b).
system (see Figure 10). In Figure 11(c) the importance of the initial snake position is
EURASIP Journal on Advances in Signal Processing 9

(a) (b)

(c) (d)

Figure 11: Results: in (a) and (b), shape extraction of a close and distant pedestrian, respectively; the white snake represents the initial
position, while the yellow one is the final configuration. In (c), a typical issue connected with a wrong initial snake disposition is shown: the
head is outside the extracted shape because it was also outside the bounding box. In (d) some results in a difficult working condition are
presented, that is, during summer, when a lot of background objects appear bright, due to the high temperature.

(a) (b)

(c) (d)

Figure 12: Classification results of the neural network analyzing pedestrians shapes. Bounding boxes that are filled are classified as
pedestrians, while a red contour is put around obstacles that are classified as nonpedestrians. Output values are also printed on the image.
10 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

(e) (f)

Figure 13: Neural network results: validated pedestrians are shown using a superimposed red box; the white rectangles represent the
discarded bounding boxes.

highlighted: the head is not detected because it is outside of and all the others votes greater than 0.85. In Figure 12(c),
the initial snake position (in white). Some shape extraction a distant pedestrian is correctly classified with a vote of
results are presented when the FIR images are not optimal, 0.84; in Figure 12(d) two pedestrians are present, at different
like those acquired in summer, under heavy direct sunlight; distances, and are correctly classified, with votes of 0.87 and
in this condition, many objects in the background become 0.77.
warm, and the assumption that a pedestrian has a higher Concerning the neural network-based validator, a feed-
temperature than the background is not satisfied. This forward multilayer neural network is exploited to evaluate
causes some errors in the contraction process, so that the the presence of pedestrians in the bounding boxes detected
snake in the final position does not completely adhere to by previous stages of the tetra-vision system. The neural net-
the pedestrian contour, but also includes some background work is trained on infrared images in order to acknowledge
details (Figure 11(d)). the thermal footprint of pedestrians. The training set has
In Figure 12 some classification results of the neural been generated from the results of the previous detection
network that analyzes pedestrians shapes are shown. In modules that were manually labelled. Such set contains a
Figure 12(a), a lot of potential pedestrians are found by the large number of pedestrian and nonpedestrian examples,
obstacle detector of previous system stages, but only one like traffic sign poles, vehicles, and trees, in order to cover
is classified as a pedestrian, with a vote of 0.98, while all different situations and temperature conditions. Different
the other obstacles received a vote not greater than 0.17. neural nets have been trained to understand which is the
In Figure 12(b) a scene with a lot of pedestrians is shown optimal number of training epochs, neurons in the hidden
and two obstacles: the latter received votes not exceeding layer of the net, and training examples and, therefore, to
0.19, while one of the pedestrians received a vote of 0.44, avoid the overfitting. The test set containing also pedestrians
EURASIP Journal on Advances in Signal Processing 11

partially occluded or with missing parts of the body has [11] M. Bertozzi, A. Broggi, A. Fascioli, and M. Sechi, “Shape-
been generated in order to evaluate the performance of based pedestrian detection,” in Proceedings of the IEEE Intel-
net. Experimental results show that the system is promising, ligent Vehicles Symposium, pp. 215–220, Detroit, Mich, USA,
achieving an accuracy of 96.5% on the test set. October 2000.
Figure 13 shows some results of the neural network [12] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. von
validator. The validated pedestrians are shown using a super- Seelen, “Walking pedestrian recognition,” IEEE Transactions
imposed solid red box. Conversely, the empty rectangles on Intelligent Transportation Systems, vol. 1, no. 3, pp. 155–162,
2000.
represent the bounding boxes generated by previous steps
and classified as nonpedestrians. Figures 13(a) and 13(b) [13] D. Beymer and K. Konolige, “Real-time tracking of multiple
people using continuous detection,” in Proceedings of the IEEE
depict examples of pedestrians and nonpedestrians correctly
International Conference on Computer Vision, Kerkyra, Island,
classified. In Figure 13(c), an area of attention is not correctly 1999.
validated because it contains multiple pedestrians, and they
[14] D. M. Gavrila, “Pedestrian detection from a moving vehicle,”
are not in the typical pedestrian pose. Some false positives in Proceedings of the European Conference on Computer Vision,
are presented in Figures 13(d) and 13(e). vol. 2, pp. 37–49, July 2000.
[15] H. Nanda and L. Davis, “Probabilistic template based pedes-
Acknowledgment trian detection in infrared videos,” in Proceedings of the IEEE
Intelligent Vehicles Symposium, Paris, France, June 2002.
This work has been supported by the European Research [16] C. Stauffer and W. E. L. Grimson, “Similarity templates
Office of the U. S. Army under contract number N62558-07- for detection and recognition,” in Proceedings of the IEEE
P-0029. International Conference on Computer Vision and Pattern
Recognition, vol. 1, pp. 221–228, 2001.
[17] V. Philomin, R. Duraiswami, and L. Davis, “Pedestrian
References tracking from a moving vehicle,” in Proceedings of the IEEE
Intelligent Vehicles Symposium, pp. 350–355, Detroit, Mich,
[1] M. Del Rose and P. Frederick, “Pedestrian detection,” in USA, October 2000.
Proceedings of the Intelligent Vehicle Systems Symposium,
[18] A. Broggi, M. Bertozzi, M. Del Rose, M. Felisa, A. Rakotoma-
Traverse, Mich, USA, 2005.
monjy, and F. Suard, “A pedestrian detector using histograms
[2] R. Kania, M. Del Rose, and P. Frederick, “ Autonomous robotic of oriented gradients and a support vector machine classifi-
following using vision based techniques,” in Proceedings of cator,” in Proceedings of the IEEE International Conference on
the Ground Vehicle Survivability Symposium, Monterey, Calif, Intelligent Transportation Systems, pp. 144–148, Seattle, Wash,
USA, 2005. USA, September 2007.
[3] M. Bertozzi, A. Broggi, C. Caraffi, M. Del Rose, M. Felisa, [19] G. Overett and L. Petersson, “Boosting with multiple classifier
and G. Vezzoni, “Pedestrian detection by means of far-infrared families,” in Proceedings of the IEEE Intelligent Vehicles Sympo-
stereo vision,” Computer Vision and Image Understanding, vol. sium, pp. 1039–1044, Istanbul, Turkey, June 2007.
106, no. 2-3, pp. 194–204, 2007.
[20] M. Zarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata,
[4] M. Bertozzi, E. Binelli, A. Broggi, and M. Del Rose, “Stereo “Pedestrian detection with convolutional neural networks,”
vision-based approaches for pedestrian detection,” in Proceed- in Proceedings of the IEEE Intelligent Vehicles Symposium, pp.
ings of the IEEE International Workshop on Object Tracking and 224–229, Las Vegas, Nev, USA, June 2005.
Classification Beyond the Visible Spectrum, San Diego, Calif,
[21] M. Bertozzi, A. Broggi, M. Felisa, G. Vezzoni, and M. Del Rose,
USA, June 2005.
“Low-level pedestrian detection by means of visible and far
[5] A. Shashua, Y. Gdalyahu, and G. Hayun, “Pedestrian detection infra-red tetra-vision,” in Proceedings of the IEEE Intelligent
for driving assistance systems: single-frame classification Vehicles Symposium, pp. 231–236, Tokyo, Japan, June 2006.
and system level performance,” in Proceedings of the IEEE
[22] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active
Intelligent Vehicles Symposium, pp. 1–6, Parma, Italy, June
contour models,” International Journal of Computer Vision,
2004.
vol. 1, no. 4, pp. 321–331, 1988.
[6] L. Zhao, Dressed human modeling, detection, and parts
[23] D. J. Williams and M. Shah, “A fast algorithm for active
localization, Ph.D. dissertation, Carnegie Mellon University,
contours and curvature estimation,” CVGIP: Image Under-
2001.
standing, vol. 55, no. 1, pp. 14–26, 1992.
[7] R. Cutler and L. S. Davis, “Robust real-time periodic motion
[24] S. R. Qunn and M. S. Nixon, “A robust snake implementation;
detection, analysis, and applications,” IEEE Transactions on
a dual active contour,” IEEE Transactions on Pattern Analysis
Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp.
and Machine Intelligence, vol. 19, no. 1, pp. 63–68, 1997.
781–796, 2000.
[8] H. Shimizu and T. Poggie, “Direction estimation of pedestrian
from multiple still images,” in Proceedings of the IEEE Intelli-
gent Vehicles Symposium, Parma, Italy, June 2004.
[9] R. Polana and R. C. Nelson, “Detection and recognition of
periodic, nonrigid motion,” International Journal of Computer
Vision, vol. 23, no. 3, pp. 261–282, 1997.
[10] L. Zhao and C. E. Thorpe, “Stereo- and neural network-
based pedestrian detection,” IEEE Transactions on Intelligent
Transportation Systems, vol. 1, no. 3, pp. 148–154, 2000.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 712854, 8 pages
doi:10.1155/2010/712854

Research Article
Vehicle Trajectory Estimation Using Spatio-Temporal MCMC

Yann Goyat,1 Thierry Chateau,2 and Francois Bardet2


1 LCPC, Route de Bouaye, 44341 Bouguenais, France
2 LASMEA, Université Blaise Pascal, 24 Avenue des landais, 63177 Aubière, France

Correspondence should be addressed to Yann Goyat, [email protected]

Received 19 October 2009; Accepted 23 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 Yann Goyat et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper presents an algorithm for modeling and tracking vehicles in video sequences within one integrated framework. Most
of the solutions are based on sequential methods that make inference according to current information. In contrast, we propose a
deferred logical inference method that makes a decision according to a sequence of observations, thus processing a spatio-temporal
search on the whole trajectory. One of the drawbacks of deferred logical inference methods is that the solution space of hypotheses
grows exponentially related to the depth of observation. Our approach takes into account both the kinematic model of the vehicle
and a driver behavior model in order to reduce the space of the solutions. The resulting proposed state model explains the trajectory
with only 11 parameters. The solution space is then sampled with a Markov Chain Monte Carlo (MCMC) that uses a model-driven
proposal distribution in order to control random walk behavior. We demonstrate our method on real video sequences from which
we have ground truth provided by a RTK GPS (Real-Time Kinematic GPS). Experimental results show that the proposed algorithm
outperforms a sequential inference solution (particle filter).

1. Introduction the applications use sequential methods even though it is not


necessary.
Efficient target tracking is a critical component in many For other situations, deferred tracking is much more
computer vision applications such as visual surveillance appealing, as it is not causal. This allows the optimisa-
or robotics. The object-tracking procedure is intended to tion process to operate over a larger data set (the whole
estimate the state (position, velocity, . . .) of an object at each observation sequence), thus allowing to hope for better
time given an observation sequence. results. Deferred visual multiobject tracking have already
Tracking methods can be divided into two major cat- been successfully experienced on pedestrian tracking in [7]
egories: The first category relates to sequential inference and with a MCMC search in [8].
tracking (also called online or causal tracking), for which the The solution presented in this paper is a spatio-temporal
state of the object at a given time step has been estimated deferred logical inference approach. One of the main
as a function of the record of past and current observations challenges of such methods is that the solution space of
and the record of past states. The second concerns deferred hypotheses grows exponentially related to the (duration)
logical inference (also called offline or noncausal tracking), depth of observation. In the specific case of vehicle tracking,
for which the state estimation at a given point in time uses priors on both driver behavior and the road geometry can be
the entire observation sequence. used. Moreover, the trajectory of the object to be tracked is
Sequential tracking is needed when the tracker’s output driven by a kinematic model. Therefore, we propose an 11-
controls real-time processes, which cannot be delayed (such dimensional reduced state vector of the vehicle trajectory.
as robotic applications). Sequential tracking is also needed Since we use a probabilistic framework, the tracking
when it is not possible to record the observation data, due problem can be seen as the estimation of the distribu-
to its size, or due to regulations. Much work has been done tion of the state vector posterior distribution, given a
on sequential visual tracking (model-based approaches [1– video sequence. We propose a Markov Chain Monte Carlo
3] or learning-based approaches [4–6]). Therefore, most of (MCMC) method to sample the posterior distribution.
2 EURASIP Journal on Advances in Signal Processing

MCMC have been already used in visual tracking. In [9, 10], Priors
a MCMC based particle filter is presented for multiobject
tracking and an extension is proposed to handle a varying X(0) X∗
Proposal
number of objects (Reversible Jump Markov Chain Monte
Video
Carlo, RJMCMC). In [8], the RJMCMC algorithm is used Acceptance sequence
in a deferred logical inference framework to track several rule
vehicles offline from a video sequence.
In MCMC methods, the random walk behavior is driven I0
X(1)
by proposal distributions. We use priors on driver behav- X∗

i=1
p(X | Z) ≈ {X(i) }N
Proposal
ior and road geometry to define efficiently the proposal.

1
k} K
k=
Exploration is achieved with the Metropolis-Hasting rule Acceptance

{I
according to a global likelihood function. rule
We use a likelihood function based on a background
subtraction algorithm. A discrete set of positions of the X(2)
vehicle into the video sequence is generated from the
trajectory state. A generic 3D model of a vehicle is then
projected into each image and then compared to a back-
ground/foreground map of the video sequence. We propose
an efficient implementation of the likelihood function using X(N)
a line integral image to decrease computation time.
Experiments have been done to compare, on real video
sequences, the deferred logical inference approach with a Figure 1: Given a video sequence and an initial state, the method
classic sequential particle filter. samples the posterior distribution of the trajectory using a random
The remaining of the paper is organized as follows. step method (MCMC).
Section 2 presents the probabilistic framework proposed to
solve the tracking problem. Section 3 provides a detailed
description of the vision likelihood function. A set of exper-
a 11-dimensional state space by driver temporal command
imental results along with both qualitative and quantitative
parameters. This method drastically reduces the dimension
analysis is presented in Section 4, before we conclude in
of the state space, thus improving computational efficiency.
Section 5.

2.2. Driver Command and Vehicle Priors. The driver com-


2. Proposed Method mands are the steering wheel angle, and the vehicle longitu-
dinal acceleration, from which we deduce the vehicle speed
This section describes the core of the method, based on a
through integration. The experiments presented below have
probabilistic framework. Figure 1 represents an illustration
been conducted on a mid-velocity curve. While traveling
of the algorithm. Given a video sequence and an initial
such a curve, a light vehicle driver’s command law is
state, the method samples the posterior distribution of the
commonly modelled by a trapezoid, with steering wheel
trajectory using a random walk method (MCMC). In the
angle velocities lying between 1.5 and 4 degrees per second,
following, we begin by presenting the state vector associated
and with absolute longitudinal accelerations lying between
to the trajectory model. Then, we give an overview of the
1 m·s−2 and 3 m·s−2 . In order to take into account a
Monte Carlo Markov Chain algorithm used to sample the
wider range of driver commands, and the steering system
posterior distribution. Finally, we show how to generate
nonlinearities due to frictions and mechanical compliances,
new proposals by sampling from an object-specific proposal
we use a more compliant model: we model the steering
distribution.
command with a double sigmoid (one for entering the curve,
and one for releasing from it).
2.1. State Vector Reduction. In a spatio-temporal deferred As the experiments presented below have been conducted
logical inference approach, the solution space of hypotheses on the second half of the curve, a single sigmoid is used to
grows exponentially related to the number of observation define the steering angle generator, from parameters defined
frames. Estimating a single vehicle planar trajectory along in Section 2.2 (cf. Figure 2)
a 100 frame sequence, involves a state space of dimension
300 (planar position and orientation are estimated for each . θ
fδ (θ δ , k) =   δ,2 2 2 + θδ,1 ,
frame). Conducting a Monte-Carlo search in such a space 1 + exp θδ,3 θδ,4 − k / 2θδ,2 2 (1)
is computationally intractable! To avoid this problem, we
do not consider the vehicle position sequence as the state k = time.
vector, but we implement a trajectory generator, lying on
driver behavior priors, road geometry priors, and vehicle The same reasoning applies to the vehicle speed generator
kinematic priors. The vehicle trajectory generator detailed fv (θ v , k), calculated as fδ (θ δ , k), swapping index δ into index
in the following, generates trajectory samples, defined in v.
EURASIP Journal on Advances in Signal Processing 3

Input: The first element of the chain X0 and its weight


proportional to its likelihood: π(X0 ) ∝ P(Z | X0 )
for n = 1 to N + NB do
θδ,1 + θδ,2

θδ,3 - Choose a move m ∈ {1, . . . , M } among all the parameter


of the state X according to prior q (m).
- Draw a proposal X∗ from the distribution q(X∗ | X)
with X = Xn−1
- Evaluate its joint likelihood: p(Z | X∗ )
θδ,4 k - Compute the acceptance ratio using Metropolis-Hasting
θδ,1
rule:  
Figure 2: Graphic representation of the sigmoid parameters. p(Z | X∗ ) q(X | X∗ )
α = min 1, ×
p(Z | X) q(X∗ | X)
- Add a nth element to the chain Xn = X∗ with
probability α, (otherwise Xn = Xn−1 ).
To model the vehicle, we use a plain kinematic model, as
end for
described in Section 3.1. This model allows us to iteratively Burn-in: delete the NB first elements of the chain.
generate xk , yk , and αk , for every time step k. Output: N-element Markov Chain of state hypothesis:
The vehicle trajectory generator is represented by a {Xn }n=NB +1,...,NB +N
.
random state vector X = (l, θ δ , θ v )T with
.
(i) l = (x0 , y0 , α0 ) represents the initial position and Algorithm 1: MCMC algorithm.
orientation of the vehicle (into a world reference
frame),
.
(ii) θ δ = (θδ,1 , . . . , θδ,4 ) are the parameters of a sigmoid
function δk = fδ (θ δ , k) representing the discrete 2.4. Proposals. At iteration n, the MCMC generates a new
temporal evolution of the steering angle, proposal by sampling from a proposal distribution q(X∗ |
.
(iii) θ v = (θv,1 , . . . , θv,4 ) are the parameters of a sigmoid X(n−1) ) defined by
function vk = fv (θ v , k) representing the discrete
    
temporal evolution of the vehicle velocity. q X∗ | X(n−1) = q (m)q X∗ | X(n−1) , m , (5)
m∈{1;...;M }
2.3. MCMC. We want to estimate p(X | Z), the posterior
probability density for a model’s parameters X, given some
observed data Z. Monte-Carlo methods assume that the where q (m) is a prior distribution used to select the
posterior distribution can be approximated by a set of N parameter index of X to be modified (M denotes the size
samples: of X). A parameter-specific proposal distribution is then
 N
defined by
p(X | Z) ≈ X(n) . (2)
.  @  
n=1
−1)
Sampling from p(X | Z) is a hard problem and many q(X∗ | X, m) = p Xm∗ | Xm(n−1) δ X ∗j − X (n
j . (6)
methods have been proposed. Metropolis-Hasting is a ran- j=
/m
dom walk algorithm designed to approximate a stationary
distribution. At each step, a state X∗ is proposed according Here, only the mth component (m is selecting with the prior
to a proposal density q(X∗ | X). The proposal state is then distribution q (m)) of the state vector is moved at iteration
accepted or rejected according to an acceptance ratio defined n; the other parameters remain unchanged. The MCMC is
by the Metropolis-Hasting rule summarized in Algorithm 1.
 
p(Z | X∗ ) q(X | X∗ )
α = min 1, × . (3)
p(Z | X) q(X∗ | X) 3. Observation
Metropolis-Hasting rule can be used to build a Markov
This section presents the observation function defined to
Chain which approximates the posterior distribution p(X |
compute the likelihood p(Z | X = X(n) ) probability to
Z). The resulting method is called Markov Chain Monte
observe the video sequence, given a sample X(n) . Figure 3
Carlo. Moreover, the Nb first elements of the chain are
illustrates the observation process. A discrete set of positions
removed (burn-in) in the final sampling set. An estimate of
of the vehicle into the video sequence is generated from
the state is given by a maximum likelihood rule applied to
the trajectory sample X. A generic 3D model of a vehicle
the particle set
is then projected into each image and compared to a
.  
N  background/foreground map of the video sequence.
1 =
X arg max δ X − X(n) , (4) We propose an efficient implementation of the likelihood
X
n=1
function using a line integral image to decrease computation
where δ is the Dirac function. time.
4 EURASIP Journal on Advances in Signal Processing

Kinematic L L
modelδ State Vedio
vector
Z
X
Car
Bicycle δ
kinematic model
Camera
parameters
Figure 4: The bicycle model synthesizes the displacement of a
four-wheel vehicle, through the displacement of two wheels whose
centers are connected by a rigid axis of length L. Ackerman’s theory
serves to estimate the steering angle of the front axis of a vehicle
traveling at low speed.

k
k z0
{xk }K
k=1 {Ik }K
k=1

(Rw ) T n θnt
(R0 ) (xt ) R0 y0

 AK  zw ynt x0
{ p(zk | xk )}K
k=1 p(Zk | Xk ) = k=1 p(zk | xk )
yw
Rw
xnt
xw
Figure 3: Illustration of the likelihood function. A discrete set of
positions of the vehicle into the video sequence is generated from Figure 5: Example of a simple three-dimensional geometric model
the trajectory sample. A generic 3D model of a vehicle is then pro- used for a vehicle. It is composed of two cubes. The coordinate
jected into each image and compared to a background/foreground system associated with the cube and the other system associated
map of the video sequence. with the scene are related according to pure translation. The plane
(Oxy) of the world coordinate system and component axes are
merged with the GPS coordinate system.
3.1. Building a Discrete Set of Vehicle Positions. Let X define
a discrete set of temporal positions and orientations of the
vehicle, associated to a sample X of the posterior distribution 3.2. Computing p(zk | xk ). Since the video sequence comes
from a static camera, vehicle extraction is achieved using
.
X = {xk }Kk=1 , (7) a background/foreground extraction approach. We use a
nonparametric method [11], based on discrete modelization
. of the background probability density of the pixel color
with xk = (xk , yk , αk )T is a vector which gives the position .
and orientation of the vehicle at time k into a world reference (RGB). The algorithm provides a set of binary images I =
. T
frame Rw associated to a planar ground. xk can be computed {Ik }K
k=1 , where Ik (u) = 1 if the pixel u = (ux , u y ) is
in a recursive way using a simple kinematic model of the associated to foreground and Ik (u) = −1 if the pixel is
vehicle. Here, we used a bicycle model (cf. Figure 4) associated to background.
A simplified three-dimensional geometric model of
xk = xk−1 + T · vk−1 · cos(αk−1 ), the vehicle is used, as depicted in Figure 5. This model
is composed of two nested parallelepipeds. In a general
yk = yk−1 + T · vk−1 · sin(αk−1 ), (8) case, the model may be more complex and contain PM
v
αk = αk−1 + T · k−1 · tan(δk−1 ), parallelepipeds. Let M(R0 ) = {Mi(R0 ) }i=1,...,NM represent the
L model’s set of cube vertices (NM = 8 × PM ), expressed
where T is the sample time used for the video acquisition within a coordinate system associated with model R0 . This
and L denotes the wheelbase (distance between front and rear coordinate system is selected such that the 3 axes all lie in the
wheels). δk and vk are given by the steering angle and velocity same direction as that of the world coordinate system Rw .
parametric functions presented into Section 2.2. Each point of the vehicle model is projected onto the
The likelihood function p(Z | X) can be written by image via the following equation

p(Z | X) = p(z1 ; z2 ; . . . ; zK | x1 ; x2 ; . . . ; xK ) m B(R


4 i ∝ Cc ·(Rw ) T(R0 ) (xk ) · M 0)
, (11)
(9) i

and assuming independence of random variables with MB homogeneous coordinates associated with point
M; Cc is the camera projection matrix, and (Rw ) T(R0 ) (xk )
@
K the homogeneous transformation matrix between the world
p(Z | X) = p(zk | xk ). (10) coordinate system and the system associated with the 3D
k=1 model (cf. Figure 5).
EURASIP Journal on Advances in Signal Processing 5

The set M(Ri ) = {mi }i=1,...,NM is thus built based on the x



projection of 3D model points within the image. Ik
Ik Convex hull
For a given position xk , the likelihood is linked to
the difference between the number of foreground and y
background pixels inside the vehicle model projection in the
image. This computation performed for each particle spends
SIGNIFICANT processing time, and we are proposing
 
herein a fast likelihood calculation method based on an
Ik ((x1 , y1 )T ) Ik ((x2 , y1 )T )
approximation of the 3D model projection in the image
through its convex hull. Foreground binary map Line-integral image
.
Let E (M(R0 ) ; xk ) = {ei }i=1,...,Ne (ei = (xie , yie ) as Figure 6: Illustration of the vision likelihood computation. The
coordinates of ei in the image plane) be the list of convex 3D model of the vehicle (shown in green/clear) is reprojected onto
hull points. (Calculation of the convex hull is not developed the image generated from the background-shape extraction. This
in this article; the calculation procedure is conducted using projection is approximated by its convex hull (shown in red/dark
a classical algorithm with a complexity expressed in O(N · on the right image). The likelihood calculation proceeds in a line-
.
log N).) We will now define Ek = E (M(R0 ) ; xk ) in order by-line integral image of the log-likelihood ratio.
to streamline notations. The likelihood calculation may be
performed efficiently by use of a line-by-line integral image
defined by 4.1. Experimental Details
 T  
x  T  4.1.1. Initialisation. The first sample of the MCMC must
IΣk x, y = Ik i, y . (12)
i=1
be initialized using priors. We use a data driven method
to compute the initial position of the vehicle on the road
Points ei are categorized by pairs featuring the same y- (x0 , y0 ). A nonparametric blob detector [11] is applied to
coordinate values, such that the background/foreground image I0 . The initial velocity is
5 provided by a specific sensor. Other parameters are initialized
       
Ek = x1e , y e , x2e , y e , x3e , y e + 1 , x4e , y e + 1 , . . . using priors given by vehicle or driver behavior. The dimen-
(13) sions of the geometric model are defined for each vehicle
   6
N N with a stochastic process on width and length parameters
xNe −1 , y + e , xNe , y e + e
e
. and using the likelihood computation (cf. Section 3.2). If
2 2
dimensions seem to be incoherent, a standard vehicle is
Convex hull coding within the set Ek necessitates a chosen.
shape discretization along the image lines. Moreover, special
attention needs to be paid to coding the upper and lower
4.1.2. Proposals. A key point of the method concerns the
extremities. On the other hand, it is not at all necessary
control of the random walk behavior using proposal distri-
to sort points positioned on the same line. A compliance
butions. Parameter-specific proposals are defined. Since both
measurement relative to a convex hull is computing from the
lower and upper bounds can be defined for all parameters, we
integral image by application of the following relation:
choose proposals according to a Beta distribution
e /2
N           ξ1 −1  
∗ ξ2 −1
a(Ek ) = 2 · IΣk e2 j − IΣk e2 j −1 − x2e j − x2e j −1 . P Xm∗ | Xm(n) ∝ Xm∗ · 1 − Xm (16)
j =1
(14) parameters ξ1 and ξ2 are computed such as the maximum of
the distribution is obtained for Xm∗ = Xm(n) .
Figure 6 describes the principle behind the likelihood
calculation method using the integral image. A line-by-line
scanning is performed as part of this method. 4.1.3. Details about the Sequential Method. Behavior of the
Finally, the likelihood expression is written by proposed method is compared with a sequential particle
.
filter. The state vector is defined as Xk = (x y , yx , αx , vx , δx )T .
p(zk | xk ) ∝ CE−k1 max(0, a(Ek )) (15) Dynamics are controlled by the kinematic bicycle model with
a zero centered normal law applied to both the steering angle
. 
with the normalization constant CEk = Nj =e /2 e e
1 |x2 j − x2 j −1 | and velocity variation. Moreover, the likelihood function
defining the surface of the convex hull. is slightly modified by removing the normalizing constant.
The particle set is resampled at each iteration using a SIR
algorithm.
4. Experimental Validation
In this section, experimental results are presented to high- 4.2. Results. In order to compare the two methods, a vehicle,
light the relevance of our tracker. We compare the Offline equipped with a RTK GPS accurate to within one centimeter,
proposed approach to a sequential stochastic filter (particle has been used. (A calibration between the GPS reference
filter). frame and the camera has been achieved but details are
6 EURASIP Journal on Advances in Signal Processing

Table 1: Position error (the true position is given by a RTK GPS) for the proposed deferred logical inference method and a sequential particle
filter.
Method Position error (m) Position std. (m) Orientation error (degrees) Orientation std. (degrees)
Sequential filter 0.27 0.26 3.67 3.36
Deferred logical inference 0.20 0.22 1.12 0.97

100 100

80 80
Accuracy (%)

Accuracy (%)
60 60

40 40

20 20

0 0
0 10 20 30 40 50 0 10 20 30 40 50
Position tolerance (cm) Orientation (yaw) tolerance (degrees/10)

Deferred Deferred
Sequential Sequential
(a) (b)

Figure 7: Percentage of correct position/orientation related to the tolerance. (a) position absolute tolerance. (b) orientation (yaw angle)
absolute tolerance.

not presented in this paper.) This vehicle traveled through trajectory, thus bringing more time consistency than the
the test section 20 times along various trajectories at speeds sequential method.
ranging from 40 to 80 km/hr. The error was quantified as the Figure 8 illustrates the two methods on a real sequence.
average distance between each estimated vehicle position and Curves on the right column show zooms on local trajectories.
the straight line passing through the two closest GPS points. The middle column illustrates the image projection of the
For each test, at least five vehicle runs were carried out, vehicle position for the sequential method. The right column
which enabled deriving a very rough statistic on the recorded illustrates the image projection of the vehicle position for
measurements. For the tests actually conducted, the vehicle the deferred method. It is of high interest to notice the
has been tracked in a curve over a distance of approximately noisy estimation provided by the sequential method, where
100 m (minimum radius = 130 m). the estimated trajectory does not seem to match the vehicle
All the experiments presented here have been done using kinematic model. The reason for this weak consistency is
200 particles for the two methods. that the maximum a posteriori estimate may be found
Table 1 presents the average error and related standard on different particles at every time step. In contrast, the
deviations for the two tested methods. The proposed spatio-temporal deferred approach ensures faithfulness to
deferred logical inference provides a lower global error than the model, thus explaining the observed improvement.
the sequential particle filter.
Figure 7 plots the estimation accuracy as a percentage 5. Conclusion
of correct positions (vertical axis) versus an error tolerance
(horizontal axis) for both methods. On the left graph the We have presented a solution for estimating vehicle tra-
error tolerance is the position absolute error, ranging from jectories using a single static color camera. A spatio-
0 to 50 cm, while on the right graph the error tolerance is temporal deferred logical inference solution which takes
the vehicle orientation (yaw angle) absolute error, ranging into account both vehicle kinematics and driver behavior
from 0 to 5 degrees. The curve associated to the proposed has been proposed, using a stochastic approach to estimate
method outperforms the sequential particle filter both for the the posterior distribution of the trajectory. By choosing a
position and the orientation estimation. Moreover, the right MCMC, the random walk evolution is controlled by injecting
graph emphasizes the benefit of the deferred method, which priors on both driver and vehicle behavior and on geometric
integrates the vehicle and driver priors in every generated knowledge about the road. Moreover, a global likelihood
EURASIP Journal on Advances in Signal Processing 7

177

176

175

174

115.5 116 116.5 117

198
197
196
195
194
193
192
191
190
189
103 104 105 106 107 108 109 110 111

206.8
206.6
206.4
206.2
206
205.8
205.6
205.4
205.2
205
204.8
99.2 99.6 100 100.4 100.8

Real trajectory
Sequential method
Deferred method

Figure 8: Snapshots illustrating the two methods. Left column: zoom on local trajectories. Middle column: the bounding box illustrates
the position of the vehicle estimated with the sequential method. Right column: the bounding box illustrates the position of the vehicle
estimated with the sequential method.

function using background/foreground binary extraction the entire curve, the system is composed of three color
has been proposed, with an efficient implementation. cameras with very little overlap. The system has successfully
Experiments have been achieved to demonstrate that the analyzed observations recorded under actual traffic condi-
proposed method outperforms a classic sequential particle tions over several-day periods.
filter solution using statistics performed on real video
sequences. Two points explain this improvement. First, the
spatio-temporal deferred approach processes over the whole
data set, thus ensuring time consistency. Second, the spatio- References
temporal deferred approach, unlike the sequential approach, [1] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking
ensures total faithfulness to the model at any time step, of non-rigid objects using mean shift,” in Proceedings of IEEE
because the maximum a posteriori estimate may be found Computer Society Conference on Computer Vision and Pattern
on different particles at every time step. Recognition, vol. 2, pp. 142–149, 2000.
The method discussed in this paper is currently operating [2] M. Isard and A. Blake, “Condensation—conditional density
24 hours a day with various weather conditions to provide propagation for visual tracking,” International Journal of
statistics on curve trajectories. For the purpose of covering Computer Vision, vol. 29, no. 1, pp. 5–28, 1998.
8 EURASIP Journal on Advances in Signal Processing

[3] M. Isard and J. MacCormick, “BraMBLe: a Bayesian multiple-


blob tracker,” in Proceedings of the IEEE International Confer-
ence on Computer Vision, vol. 2, pp. 34–41, 2001.
[4] O. Williams, A. Blake, and R. Cipolla, “A sparse probabilistic
learning algorithm for real-time tracking,” in Proceedings of the
IEEE International Conference on Computer Vision, vol. 1, pp.
353–360, Nice, France, 2003.
[5] S. Avidan, “Support vector tracking,” in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR ’01), vol. 1, pp. 184–191, Kauai, Hawaii,
USA, 2001.
[6] S. Avidan, “Ensemble tracking,” in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR ’05), vol. 2, pp. 494–501, IEEE Computer
Society, Washington, DC, USA, 2005.
[7] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera
people tracking with a probabilistic occupancy map,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
30, no. 2, pp. 267–282, 2008.
[8] Q. Yu, G. Medioni, and I. Cohen, “Multiple target tracking
using spatio-temporal Markov chain Monte Carlo data associ-
ation,” in Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR ’07), 2007.
[9] Z. Khan, T. Balch, and F. Dellaert, “MCMC-based particle
filtering for tracking a variable number of interacting targets,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 27, no. 11, pp. 1805–1819, 2005.
[10] K. Smith, D. Gatica-Perez, and J.-M. Odobez, “Using particles
to track varying numbers of interacting people,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR ’05), vol. 1, pp. 962–969, 2005.
[11] Y. Goyat, T. Chateau, L. Malaterre, and L. Trassoudaine,
“Vehicle trajectories evaluation by static video sensors,” in Pro-
ceedings of the 9th IEEE International Conference on Intelligent
Transportation Systems (ITSC ’06), pp. 864–869, 2006.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 308379, 9 pages
doi:10.1155/2010/308379

Research Article
Superresolution versus Motion Compensation-Based Techniques
for Radar Imaging Defense Applications

J. M. Muñoz-Ferreras1 and F. Pérez-Martı́nez2


1 Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, Campus Universitario,
Ctra. Madrid-Barcelona, Km. 33600, Alcalá de Henares, 28805 Madrid, Spain
2 Department of Signals, Systems and Radiocommunications, Technical University of Madrid, E.T.S.I. Telecomunicación,

Avenida Complutense, s/n, 28040 Madrid, Spain

Correspondence should be addressed to J. M. Muñoz-Ferreras, [email protected]

Received 17 November 2009; Accepted 8 April 2010

Academic Editor: Robert W. Ives

Copyright © 2010 J. M. Muñoz-Ferreras and F. Pérez-Martı́nez. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

Radar imaging of noncooperative targets is an interesting application of all-weather high-resolution coherent radars. However,
these images are usually blurred when using the standard range-Doppler algorithm, if a long coherent processing interval (CPI)
is used, and motion compensation techniques are hence necessary to improve imaging quality. If the CPI is reduced enough,
target scatterers do not migrate of resolution cells and their corresponding Doppler frequencies are constant. Hence, for a short
CPI, motion compensation is not longer necessary, but Doppler resolution gets degraded. In that case, superresolution algorithms
may be applied. Here, we compare the superresolution-based focusing techniques with motion compensation-based methods.
Our conclusion is that imaging quality after employing the superresolution approaches is not improved and, consequently, the
use of motion compensation-based approaches to focus the radar images cannot be circumvented. Simulated and real data from
high-resolution radars have been used to make the comparisons.

1. Introduction In adverse meteorological conditions (such as fog or


haze) and in defense and security applications, imaging sen-
Radar imaging based on a static high-resolution coherent sors based on electro-optical wavelengths may have a reduced
radar is usually referred as Inverse Synthetic Aperture Radar performance [4–6]. However, the ISAR technique, because of
(ISAR) imaging. ISAR may obtain range-Doppler images of its all-weather feature, may still provide useful target images
noncooperative targets [1, 2], that is, targets whose motion is in those conditions. These images may subsequently be
unknown. A large transmitted bandwidth guarantees a high exploited by Automatic Target Recognition (ATR) algorithms
slant-range resolution, whereas a large variation of the target [7–10].
aspect angle during the coherent processing interval (CPI) In ISAR imaging, if the processing interval CPI is not too
allows obtaining a fine cross-range resolution [3]. The slant- large, target scatterers do not migrate of resolution cells and
range and cross-range resolutions are, respectively, given by their corresponding Doppler frequencies remain constant
c during the CPI. Hence, for this case, the standard range-
ρr = , (1) Doppler algorithm (RDA) obtains focused ISAR images.
2Δ f
However, these images are usually not adequate for subse-
λ quent ATR algorithms, because they have a degraded cross-
ρa = , (2) range resolution, according to (2). Note that it is likely that
2Δθ
the variation of the target aspect angle Δθ is little for this
where c is the light speed, Δ f is the transmitted bandwidth, λ short CPI.
is the transmitted wavelength, and Δθ is the variation of the On the contrary, if the CPI is large, the target scatterers
target aspect angle during the CPI. migrate of resolution cells and the Doppler histories are
2 EURASIP Journal on Advances in Signal Processing

complex functions. In this situation, RDA generates blurred By trying to move away from motion compensation
ISAR images of decreased quality and motion compensation techniques, several authors have proposed to make use of
techniques are usually necessary to improve these ISAR superresolution techniques [12–17] to focus ISAR images.
products. Because the blurring origin comes from a large CPI, the
Moreover, the previous problem is exacerbated when the subjacent idea under the superresolution approach is based
target is involved in complex motions, which is true for many on reducing the observation interval CPI. As previously
practical cases. For example, maritime targets are usually commented, for a reduced CPI, the target scatterers do
involved in complex dynamics characterized by complicated not have enough time to experiment large variations of
yaw, pitch, and roll attitude motions [11]. their Doppler frequencies or to migrate of resolution
Hence, in ISAR imaging of real noncooperative maneu- bins.
vering targets, an important trade-off emerges; it is interest- However, this CPI reduction certainly implies a loss of
ing to process a long CPI for achieving a fine cross-range Doppler (cross-range) resolution. It is here where superres-
resolution, but blurring effects arise for this long CPI because olution algorithms may theoretically improve the standard
of the complex motion. Fourier resolution. Hence, according to these approaches
For a long CPI, the reason for scatterer migrations is quite [12–17], focused ISAR images could be obtained without the
obvious; the target scatterers have enough time to migrate necessity of processing long coherent intervals or of applying
of resolution cells. On the other hand, as far as the complex motion compensation algorithms.
functions for the Doppler history are concerned, we can write In this paper, we compare the superresolution appro-
the phase history of a target scatterer as [11] aches with the results obtained after compensating the
motion, by applying the methods to simulated and real
4π data from complex targets. As far as the superresolution
ϕs (τ) = − Rs (τ), (3)
λ algorithms are concerned, we concentrate on the spectral
estimation based on autoregressive (AR) coefficients [18],
where λ is the central transmitted wavelength and Rs (τ) is the multiple signal classification (MUSIC) estimator [19],
the range from the radar to the scatterer as a function of the and the Capon estimator [20].
slow-time τ. Superresolution algorithms are based on parametric
If the target has a smooth constant rotational motion and models of the signals and, consequently, they assume that the
the CPI is short, it can be shown [11] that a very accurate data satisfy some concrete hypotheses. In the ISAR scenario,
approximation of (3) is given by we do not know to what extent the data match the models
and, hence, the results are not as promising as expected. We
4π  
ϕs (τ) = − R0 + ys + xs ωe τ , (4) have obtained images with many peaks whose positions do
λ not necessarily correspond with the true locations of the
where R0 is the range from the radar to the target rotation scatterers. On the other hand, focusing indicators (such as
center, xs is the cross-range position of the scatterer, ys entropy or contrast) may provide optimized values for the
is the slant-range position of the scatterer, and ωe is the superresolution-based images because of their peaky nature.
effective rotation rate. If we suppose that R0 does not change However, this is not indicative of an enhancement in the
its position during the CPI, that is, translational motion quality of the ISAR images, as discussed.
compensation has previously been applied; the Doppler Our conclusion is that, when dealing with complex high-
frequency associated to this scatterer is not a function of τ: resolution radar data, the performance of the superreso-
lution approach is not as good as expected and motion
1 dϕs (τ) 2xs ωe compensation methods should be applied if focused ISAR
fds = − = . (5) images are desired to be obtained.
2π dτ λ
Section 2 presents a brief introduction to RDA and
Hence, according to (5), if the target motion is smooth, motion compensation. In Section 3, the ISAR focusing
which is true for a reduced CPI, the Doppler frequency for technique based on superresolution algorithms is addressed.
each target scatterer is a constant and the standard RDA A brief description of the superresolution algorithms (AR,
will generate a focused ISAR image. Take into account that MUSIC, and Capon) is also given. Comparisons between
the Doppler frequency is proportional to the cross-range superresolution and motion compensation-based techniques
position xs of the scatterer. when using simulated data are presented in Section 4. Deep
On the other hand, according to (3), if the target is analyses of the obtained results in Section 4 let us derive
involved in complex motions and the CPI is large, the range important conclusions. After detailing the results achieved
from the radar to the scatterer Rs (τ) is a complex function with live radar data in Section 5, some final conclusions
and, consequently, the phase of the scatterer ϕs (τ) is also a conclude the paper in Section 6.
complex function of the slow-time τ. This eventually implies
that the scatterer Doppler frequency is not constant during 2. Range-Doppler Algorithm and
the illumination interval CPI and, hence, if the standard RDA Motion Compensation
is applied, a severely blurred ISAR image is to be obtained.
The problem rests in the fact that the processed CPI is too The ISAR technique allows us to generate range-Doppler
large and complex phase variations arise. images of noncooperative targets. The standard image
EURASIP Journal on Advances in Signal Processing 3

formation algorithm for ISAR is the range-Doppler algo- approaches, by reducing the CPI, as explained in the next
rithm (RDA) [11, 21], which may easily be described as section.
follows.

(i) Acquire a set of range profiles by using a coherent 3. ISAR Focusing Technique Based on
high-resolution radar and stack them to form the Superresolution Algorithms
matrix Mrτ [n, m], where n = 0, 1, . . . , N − 1, m =
0, 1, . . . , M − 1, N is the number of range bins, and M The idea under applying superresolution algorithms for
is the total number of acquired range profiles. Hence, focusing ISAR images consists of processing the radar echoes
the columns of Mrτ [n, m] are the range profiles. for a reduced CPI in order to guarantee that scatterers
do not have enough time to migrate of resolution cells
(ii) Apply a Fast Fourier Transform (FFT) to each range or to experiment complex phase variations. More formally,
bin; that is, apply an FFT to each row of Mrτ [n, m]. by referring to Section 2, if we admit that Mrτ [n, m] is
The resulting matrix Mrd [n, k] is the ISAR image a matrix whose columns are the range profiles, the ISAR
generated by using RDA. focusing algorithm based on superresolution algorithms may
be expressed as next indicated.
Figure 1 schematically shows the simple processing made
by RDA. (i) Consider a reduced number M1 of range profiles of
Target motion may be divided into a translational com- Mrτ [n, m]. This is equivalent to reducing the CPI.
ponent and a rotational component [21, 22]. With respect to This simplified set may mathematically be expressed
the line-of-sight (LOS), the translational motion may further as Mrτ [n, m1 ], where m1 = 0, 1, . . . , M1 − 1, with
be decomposed into a radial (along-LOS) component and a M1  M. The selection of M1 depends on the target
tangential (across-LOS) component. The rotational motion dynamics.
is formed by the yaw, pitch, and roll attitude components.
In this context, the obtained ISAR image is a projection (ii) For the nth range bin, estimate its high-resolution
depending on target dynamics and orientation. Concretely, frequency content by applying a superresolution
this ISAR projection plane is a plane formed by the LOS algorithm. That is, apply a superresolution technique
vector and a vector normal to the effective rotation vector ωef to each row of Mrτ [n, m1 ].
and contained in the plane perpendicular to LOS [22]. The (iii) Repeat the previous step for all the range bins.
effective rotation vector ωef is the projection of the rotation Subsequently, construct the superresolution ISAR
vector ω over the plane perpendicular to LOS. image Mrd,SR [n, k1 ], where k1 indicates the number
As an example, let us consider the scenario shown in of Doppler bin.
Figure 2. A coherent high-resolution radar illuminates a
pitching ship. The ship deck is aligned with LOS. In this The algorithm is schematically shown in the flowchart
case, the effective rotation vector ωef is just the pitch rotation depicted in Figure 3, where the acronym SRA refers to
vector ω p , as shown in Figure 2. The obtained ISAR image is superresolution algorithm.
a side view of the target. Take into account that SRA may apply to the AR,
The rotational motion and the tangential translational MUSIC, or Capon spectral estimators, on which the paper
motion may generate the desired Doppler gradient among concentrates. For completeness, in the next subsections, a
scatterers situated in the same range bin. However, motion brief description of these spectral estimators is provided.
is also responsible for the possible appearance of blurring
effects. Concretely, when the CPI is large and RDA is applied, 3.1. Spectral Estimation Based on AR Coefficients. In this
the radial (along-LOS) component of the translational superresolution technique, it is assumed that the data are the
motion causes a large blurring in the ISAR images and the output of an Infinite Impulse Response (IIR) filter, whose
rest of motion may produce the so-called Migration Through input is excited with a white noise of variance ρw [18]. If the
Resolution Cells (MTRCs) [23]. signal is represented as s[m1 ], m1 = 0, 1, . . . , M1 − 1, then it
Generally, before applying RDA, motion compensation can be expressed as the filter output as
techniques are necessary to improve the quality of the ISAR
images. Thus, for translational motion compensation, two p

stages are often considered; range-bin alignment [1, 24– s[m1 ] = − a[k]s[m1 − k] + u[m1 ], (6)
27] and phase adjustment [28–31]. On the other hand, for k=1
compensating the rotational motion, several methods may
also be found in the literature [32–36]. where a[k] are the filter coefficients, p is the filter order, and
In this paper, when dealing with motion compensation u[m1 ] is the white noise at the input.
issues, we employ the extended envelope correlation method The spectral estimator based on AR coefficients as a
[26] for range-bin alignment, the entropy minimization function of the frequency f for the signal s[m1 ] may be
approach [28] for phase adjustment, and the uniform-rate written as [18]
technique [36] for rotational motion compensation.
  ρw / fs
The focusing technique based on superresolution algo- PAR f =    , (7)
rithms circumvents the use of motion compensation-based eHp f aaH e p f
4 EURASIP Journal on Advances in Signal Processing

FFT

Range profile M
Range profile 1
Range profile 2
FFT

Range
Range profile ISAR
Radar ··· .. image
generation
.

FFT Doppler

Figure 1: Range-Doppler algorithm flowchart.

white noise [19]. The MUSIC spectral estimator as a function


lane of the frequency f for the signal s[m1 ] is [19]
np
e ctio
p roj   1
IS AR PMUSIC f =  Nc H
  , (9)
ISAR image eH f k=Ns +1 vk vk e f
Doppler

where vk is the kth eigenvector of the correlation matrix


S ωef = ω p RNc (of dimensions Nc × Nc ) of the input signal s[m1 ]. The
LO
Radar eigenvectors vk are ordered according to their corresponding
Slant-range eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λNc , in such a way that the
first Ns eigenvectors generate the signal subspace and the rest
Figure 2: A scenario example to show the ISAR projection plane. generate the noise subspace. Moreover, the vector e( f ) in (9)
can be written as
   T
e f = 1 e j2π( f / fs ) · · · e j2π( f / fs )(Nc −1) . (10)
Mrτ Mrd,SR
SRA For the determination of Ns , the extended criterion of Akaike
SRA may be employed [18]. If λ1 ≥ λ2 ≥ · · · ≥ λNc , the function
in (11) is calculated for each q = 1, 2, . . . , Nc :
Slant-range
Slant-range

.. .. ..
. . . ⎛   Nc ⎞
n n    1/ Nc − q
 =q+1 λi
AIC q = Nc − q ln⎝ ⎠
i
AN −(Nc −q)
m1 SRA k1 c
i=q+1 λi (11)
Slow-time Doppler   
+ q − 1 2Nc − q − 1 .
Figure 3: Algorithm flowchart for the ISAR focusing technique
based on superresolution algorithms. The estimated number of sinusoids Ns is the value of q which
minimizes expression (11).

3.3. Capon Spectral Estimation. Finally, the Capon spectral


where H indicates conjugate transpose, fs is the sampling estimator for the signal s[m1 ], m1 = 0, 1, . . . , M1 − 1, can be
frequency, and written in a similar way to MUSIC as [20]

 T   1/ fs
  PCapon f =    , (12)
ep f = 1 e j2π( f / fs ) ··· e j2π( f / fs )p , eH f RN−c1 e f
 (8)
  T where e( f ) is the vector provided by (10) and RNc is the
a = 1 a[1] · · · a p .
correlation matrix (of dimensions Nc ×Nc ) of the input signal
s[m1 ].
Some methods to calculate the filter coefficients a[k] and
the variance of the white noise ρw have been proposed [18]. 4. Comparison Results for Simulated Data
Note that these values are necessary to evaluate (7). In this
paper, we have used the modified variance method, which In this section, simulated data have been used to make
minimizes the forward and backward prediction errors the pertinent comparisons between the technique based on
[18]. superresolution methods and the motion compensation-
based approaches. These data have extensively been used in
3.2. Spectral Estimation Based on MUSIC. The MUSIC the literature to compare different ISAR focusing methods
estimator is also a parametric approach which supposes that [37]. The data belong to a simulated MIG-25 aircraft,
the signal is a combination of sinusoids contaminated with which is composed of 120 scatterers. The target is uniformly
EURASIP Journal on Advances in Signal Processing 5

Table 1: Radar parameters for the simulated target.


−2
Radar type Stepped frequency 10
−4

Number of range bin


Central Frequency 9 GHz 20 −6
Stepped frequencies in a burst 64 −8
Number of bursts 512 30

(dB)
−10
Pulse repetition frequency 15000 Hz −12
40
Bandwidth 512 MHz −14
Coherent processing interval 2.18 s 50 −16
−18
60
−2 −100 −50 0 50 100
10
−4 Doppler (Hz)
Number of range bin

20 −6
Figure 5: ISAR image after applying RDA to the simulated data for
−8 a reduced CPI (32 bursts).
30

(dB)
−10

40 −12
−14
50 −16
−18
60 10
−100 −50 0 50 100
Number of range bin

Doppler (Hz) 20

Figure 4: ISAR image after applying RDA to the simulated data for
30
the entire CPI.

40

rotating, whereas a high-resolution radar illuminates it. The


50
radar parameters are detailed in Table 1.
Figure 4 shows the ISAR image obtained after applying
the standard RDA for the entire CPI. Clearly, this ISAR 60
image is blurred. Certainly, because of the long processed −100 −50 0 50 100
illumination interval, the target scatterers have migrated of Doppler (Hz)
resolution cells.
If the processed illumination interval CPI is reduced, Figure 6: ISAR image obtained by applying AR coefficients to the
the obtained ISAR image (by using RDA) does not suffer simulated data.
from cell migrations, as shown in Figure 5 for a reduced CPI
of 0.137 s (i.e., by considering only 32 bursts). However, as
expected, the Doppler resolution has been decreased and the
image quality is poor.
Figures 6–8 show the ISAR images obtained with the
10
method based on superresolution algorithms, when the
reduced CPI of 0.137 s is considered. Figure 6 presents the
Number of range bin

20
result when the spectral estimator based on AR coefficients
is used, whereas Figures 7 and 8 refer to the result when
using the MUSIC and Capon estimators, respectively. For the 30
AR coefficients, a filter order of p = 5 has been considered.
For the results provided by the MUSIC and Capon spectral 40
estimators, a matrix dimension of Nc = 10 has been
considered for the correlation matrix. 50
Finally, Figure 9 presents the ISAR image obtained after
compensating the motion for the entire CPI. For this 60
purpose, the techniques in [26, 28, 36] have been applied.
Note that the target scatterers are clearly distinguishable in −100 −50 0 50 100
this result. The ISAR image in Figure 9 is highly focused. Doppler (Hz)

As commented in the introduction, the entropy and Figure 7: ISAR image obtained by applying MUSIC to the
the contrast are focusing indicators extensively used in the simulated data.
6 EURASIP Journal on Advances in Signal Processing

next, the entropy and the contrast are not proper focusing
indicators to measure the image quality of the ISAR images
10 obtained by using a superresolution-based technique.
The contrast and the entropy for the ISAR images
Number of range bin

20 depicted in Figures 4–9 are detailed in Table 2.


By carefully analyzing the results provided in this section,
30 we may draw the next conclusions.

40 (i) The ISAR images obtained by applying the technique


based on superresolution algorithms usually present
50
spurious scatterers; that is, they have peaks whose
positions do not correspond with locations of real
scatterers. We attribute this behavior to the fact
60
that the inherent parametric model assumed by
−100 −50 0 50 100 the superresolution techniques may not adequately
Doppler (Hz) adjust to the ISAR data. Note that the ISAR data
are complex; as an example, take into account that
Figure 8: ISAR image obtained by applying the Capon spectral interference among scatterers is always present in
estimator to the simulated data.
complex targets.
(ii) Hence, the qualitative appearance of the ISAR images
−2 obtained with the superresolution-based approach is
10 not satisfactory. Their quality may be greater than the
−4
Number of range bin

−6
RDA-based images (Figures 4 and 5), but it is clear
20
that the superresolution-based approach does not
−8
30 outperform the motion compensation-based results
(dB)

−10 (Figure 9), where the scatterers are clearly visible and
40 −12 localizable. Possible subsequent ATR algorithms may
−14 have problems with the spurious peaks appearing in
50 −16 the superresolution-based ISAR images.
−18
60 (iii) By comparing the results provided by the AR,
−100 −50 0 50 100 MUSIC, and Capon spectral estimators, the most
Doppler (Hz)
promising output is the one given by the Capon
estimator, since the target contour is more detailed.
Figure 9: ISAR image after compensating the motion (simulated On the other hand, it is clear that, for complex radar
data). data, the Akaike criterion misestimates the number of
sinusoids existent in each range bin.
(iv) From a direct reading of Table 2, one may conclude
literature [28, 30]. Their expressions may, respectively, be
that the images obtained with the superresolution-
written as
based technique are highly focused, because they

E= Ink ln Ink , have high contrast and low entropy values. However,
(13) according to the previous conclusions, we know that
n k
CD the superresolution approaches do not outperform
 E F2 G the motion compensation-based techniques. The
|Ink |2 − |Ink |2
(14) explanation for the high contrast and low entropy
C= E F , values must be found in the very abrupt peaks
|Ink |2
generated by the parametric approaches [18]. We
admit that these focusing indicators are really useful
where Ink is the ISAR image, ·
calculates the sample mean,
for other ISAR contexts [28, 30], but we also conclude
n is the number of range bin, k is the number of Doppler bin,
that they are useless for assessing the performance of
and Ink is given by
ISAR focusing superresolution-based approaches.
|I |2
Ink =  nk 2. (15)
n k |Ink | 5. Comparison Results for Real Data
In the literature, it is assumed that the greater the contrast In this paper, we also present the results of applying the
and the lower the entropy are, the more focused the ISAR commented algorithms to real data in order to verify that the
images are [28, 30]. This is usually valid to make comparisons previously drawn conclusions are still valid for live scenarios.
among different autofocusing methods. However, as shown The data belong to a sailboat, which was illuminated by
EURASIP Journal on Advances in Signal Processing 7

Table 2: Focusing indicators for the ISAR images corresponding to −500


the simulated data.
−450 −5
Entropy Contrast
−400 −10
Figure 4 7.82 4.68

Doppler (Hz)
Figure 5 8.38 4.03 −350 −15

(dB)
Figure 6 1.33·10−4 127.99
−300 −20
Figure 7 2.67·10−4 127.99
Figure 8 5.58 11.78 −250 −25
Figure 9 6.46 10.52
−200 −30

Table 3: Radar parameters for the live acquisition. −150


1478 1480 1482 1484 1486 1488 1490 1492 1494
Radar type LFMCW Slant range (m)
Central frequency 28.5 GHz Figure 10: ISAR image after applying RDA to the real data for the
Ramp repetition frequency 1000 Hz entire CPI.
Bandwidth 1 GHz
Coherent processing interval 0.6 s

−500
Table 4: Focusing indicators for the ISAR images corresponding to
the real data. −5
−450
Entropy Contrast
−400 −10
Figure 10 9.22 8.43
Figure 11 5.83 17.87
Doppler (Hz)

−350 −15

(dB)
Figure 12 2.79 113.23
Figure 13 0.64 270.13 −300 −20
Figure 14 4.77 41.70
−250 −25
Figure 15 8.53 16.74
−200 −30

a millimeter-wave high-resolution radar [38]. The radar −150


1478 1480 1482 1484 1486 1488 1490 1492 1494
parameters are detailed in Table 3.
Slant range (m)
Figure 10 shows the ISAR image obtained after using
RDA for the entire CPI. Because of the large CPI, the ISAR Figure 11: ISAR image after applying RDA to the real data for a
image is blurred. Figure 11 presents the ISAR image (by using reduced CPI (64 ramps).
RDA) for a reduced CPI of 0.064 s. This image has a poor
Doppler resolution, as expected.
Figures 12–14 show the ISAR images obtained after
applying the superresolution technique based on the AR, −500
MUSIC, and Capon estimators, respectively. For the AR
coefficients, a filter order of p = 21 has been considered. −450
On the other hand, for the MUSIC and Capon spectral
estimators, a matrix dimension of Nc = 15 has been −400
considered for the correlation matrix.
Doppler (Hz)

Finally, Figure 15 shows the ISAR image obtained after −350


compensating the motion for the entire CPI. This image is
highly detailed and may be useful for subsequent recogni- −300
tion/identification algorithms. A photo of the sailboat is also
included in Figure 15 for reference. −250
The contrast and the entropy for the real ISAR images
in Figures 10–15 are detailed in Table 4. Again, high contrast −200
and low entropy values are obtained for the superresolution-
based ISAR images. −150
1478 1480 1482 1484 1486 1488 1490 1492 1494
The results obtained with real data are analogous to
Slant range (m)
the ones achieved with simulated data. Consequently, the
conclusions drawn at the end of Section 4 are also applicable Figure 12: ISAR image obtained by applying AR coefficients to the
to the real data detailed in this section. real data.
8 EURASIP Journal on Advances in Signal Processing

−500 6. Conclusions
−450 The ISAR technique is a radar imaging method which may
be very interesting in defense and security applications. In
−400 fact, ISAR can provide images of noncooperative targets
in adverse meteorological conditions and in degraded
Doppler (Hz)

−350 scenarios.
Generally, it is interesting to process long illumination
−300 intervals to guarantee a high Doppler resolution. In this
case, it is almost mandatory to apply motion compensation
−250 techniques, if focused ISAR images are desired. Otherwise,
the radar images are highly blurred and are useless for
−200 recognition/identification purposes.
On the other hand, if the processed CPI is reduced,
−150 the target scatterers do not migrate of resolution cells and
1478 1480 1482 1484 1486 1488 1490 1492 1494
Slant range (m) their associated Doppler frequencies may be considered to be
constant. In this case, the ISAR images have a poor Doppler
Figure 13: ISAR image obtained by applying MUSIC to the real resolution, which may theoretically be improved by using
data. superresolution algorithms.
In this paper, we have concentrated on the compari-
son between the superresolution-based approaches and the
motion compensation-based methods with respect to their
−500 capabilities of focusing ISAR images. Both simulated and real
data from complex targets have been used.
−450
Our main conclusion is that motion compensation
cannot be circumvented, that is, it is always necessary to
−400
compensate the motion, if focused high-resolution ISAR
Doppler (Hz)

images are desired. The ISAR images obtained after applying


−350
superresolution approaches usually present spurious peaks,
whose positions do not correspond to locations of real
−300
scatterers. These images could not be properly exploited by
subsequent ATR algorithms.
−250
The paper also provides the values of the entropy
−200
and the contrast for all the presented ISAR images. The
superresolution-based images have high contrast and low
−150
entropy values, but this is not indicative of an increase in
1478 1480 1482 1484 1486 1488 1490 1492 1494 image quality.
Slant range (m)

Figure 14: ISAR image obtained by applying the Capon spectral Acknowledgments
estimator to the real data.
This work was financially supported by the Spanish National
Board of Scientific and Technology Research under Project
TEC2008-02148/TEC. The authors thank Dr. A. Blanco-del-
−250 Campo, Dr. A. Asensio-López, and Dr. B. P. Dorta-Naranjo
for providing the live data of the sailboat.
−200 −5

−150 −10 References


Doppler (Hz)

−100 −15 [1] C.-C. Chen and H. C. Andrews, “Target motion induced
(dB)

radar imaging,” IEEE Transactions on Aerospace and Electronic


−50 −20
Systems, vol. 16, no. 1, pp. 2–14, 1980.
0 −25 [2] D. A. Ausherman, A. Kozma, J. L. Walker, H. M. Jones,
and E. C. Poggio, “Developments in radar imaging,” IEEE
50 −30 Transactions on Aerospace and Electronic Systems, vol. 20, no.
4, pp. 363–400, 1984.
100 [3] W. G. Carrara, R. S. Goodman, and R. M. Majewski, Spotlight
1478 1480 14821484 14861488 14901492 1494
Synthetic Aperture Radar. Signal Processing Algorithms, Artech
Slant range (m)
House, Boston, Mass, USA, 1995.
Figure 15: ISAR image after compensating the motion (real data). [4] S. A. Hovanessian, Introduction to Sensor Systems, Artech
A photo of the sailboat is given for reference. House, Boston, Mass, USA, 1988.
EURASIP Journal on Advances in Signal Processing 9

[5] A. V. Jelalian, Laser Radar Systems, Artech House, Boston, [26] J. M. Muñoz-Ferreras and F. Pérez-Martı́nez, “Extended
Mass, USA, 1992. envelope correlation for range bin alignment in ISAR,” in Pro-
[6] G. R. Osche and D. S. Young, “Imaging laser radar in the near ceedings of the IET International Conference on Radar Systems
and far infrared,” Proceedings of the IEEE, vol. 84, no. 2, pp. (RADAR ’07), pp. 65–68, Edinburg, UK, October 2007.
103–125, 1996. [27] D. Zhu, L. Wang, Y. Yu, Q. Tao, and Z. Zhu, “Robust ISAR
[7] K.-T. Kim, D.-K. Seo, and H.-T. Kim, “Efficient classification range alignment via minimizing the entropy of the average
of ISAR images,” IEEE Transactions on Antennas and Propaga- range profile,” IEEE Geoscience and Remote Sensing Letters,
tion, vol. 53, no. 5, pp. 1611–1621, 2005. vol. 6, no. 2, pp. 204–208, 2009.
[8] B. K. S. Kumar, B. Prabhakar, K. Suryanarayana, V. Thilaga- [28] L. I. Xi, G. Liu, and J. Ni, “Autofocusing of ISAR images based
vathi, and R. Rajagopal, “Target identification using harmonic on entropy minimization,” IEEE Transactions on Aerospace
wavelet based ISAR imaging,” EURASIP Journal on Applied and Electronic Systems, vol. 35, no. 4, pp. 1240–1252, 1999.
Signal Processing, vol. 2006, Article ID 86053, 13 pages, 2006. [29] B. D. Steinberg, “Microwave imaging of aircraft,” Proceedings
[9] E. Radoi, A. Quinquis, and F. Totir, “Supervised self- of the IEEE, vol. 76, no. 12, pp. 1578–1592, 1988.
organizing classification of superresolution ISAR images: an [30] M. Martorella, F. Berizzi, and B. Haywood, “Contrast
anechoic chamber experiment,” EURASIP Journal on Applied maximisation based technique for 2-D ISAR autofocusing,”
Signal Processing, vol. 2006, Article ID 35043, 14 pages, 2006. IEE Proceedings on Radar, Sonar and Navigation, vol. 152, no.
[10] S. Musman, D. Kerr, and C. Bachmann, “Automatic recogni- 4, pp. 253–262, 2005.
tion of ISAR ship images,” IEEE Transactions on Aerospace and [31] D. E. Wahl, P. H. Eichel, D. C. Ghiglia, and C. V. Jakowatz Jr.,
Electronic Systems, vol. 32, no. 4, pp. 1392–1404, 1996. “Phase gradient autofocus—a robust tool for high resolution
[11] D. R. Wehner, High Resolution Radar, Artech House, Boston, SAR phase correction,” IEEE Transactions on Aerospace and
Mass, USA, 2nd edition, 1995. Electronic Systems, vol. 30, no. 3, pp. 827–835, 1994.
[12] R. M. Nuthalapati, “High resolution reconstruction of ISAR [32] R. Lipps and D. Kerr, “Polar reformatting for ISAR imaging,”
images,” IEEE Transactions on Aerospace and Electronic Sys- in Proceedings of the IEEE National Radar Conference, pp.
tems, vol. 28, no. 2, pp. 462–472, 1992. 275–280, Dallas, Tex, USA, May 1998.
[13] J. W. Odendaal, E. Barnard, and C. W. I. Pistorius, “Two- [33] S. A. S. Werness, W. G. Carrara, L. S. Joyce, and D. B.
dimensional superresolution radar imaging using the MUSIC Franczak, “Moving target imaging algorithm for SAR data,”
algorithm,” IEEE Transactions on Antennas and Propagation, IEEE Transactions on Aerospace and Electronic Systems, vol. 26,
vol. 42, no. 10, pp. 1386–1391, 1994. no. 1, pp. 57–67, 1990.
[14] R. Wu, Z.-S. Liu, and J. Li, “Time-varying complex spectral [34] A. Aprile, A. Mauri, and D. Pastina, “Real time rotational
analysis via recursive APES,” IEE Proceedings on Radar, Sonar motion compensation algorithm for focusing spot-SAR/ISAR
and Navigation, vol. 145, no. 6, pp. 354–360, 1998. images in case of variable rotation-rate,” in Proceedings of the
[15] K.-T. Kim, S.-W. Kim, and H.-T. Kim, “Two-dimensional 1st European Radar Conference (EURAD ’04), pp. 141–144,
ISAR imaging using full polarisation and superresolution Amsterdam, The Netherlands, October 2004.
processing techniques,” IEE Proceedings on Radar, Sonar and [35] M. Xing, R. Wu, and Z. Bao, “High resolution ISAR imaging
Navigation, vol. 145, no. 4, pp. 240–246, 1998. of high speed moving targets,” IEE Proceedings on Radar,
[16] Z.-S. Liu, R. Wu, and J. Li, “Complex ISAR imaging of maneu- Sonar and Navigation, vol. 152, no. 2, pp. 58–67, 2005.
vering targets via the Capon estimator,” IEEE Transactions on [36] J. M. Muñoz-Ferreras and F. Pérez-Martı́nez, “Uniform
Signal Processing, vol. 47, no. 5, pp. 1262–1271, 1999. rotational motion compensation for inverse synthetic
[17] A. Quinquis, E. Radoi, and F.-C. Totir, “Some radar imagery aperture radar with non-cooperative targets,” IET Radar,
results using superresolution techniques,” IEEE Transactions Sonar and Navigation, vol. 2, no. 1, pp. 25–34, 2008.
on Antennas and Propagation, vol. 52, no. 5, pp. 1230–1244, [37] V. C. Chen, 1999, http://airborne.nrl.navy.mil/∼vchen/tftsa
2004. .html.
[18] S. L. Marple Jr., Digital Spectral Analysis with Applications, [38] A. Blanco-del-Campo, A. Asensio-López, B. P. Dorta-
Prentice Hall, Englewood Cliffs, NJ, USA, 1987. Naranjo, et al., “Millimeter-wave radar demonstrator for
[19] R. O. Schmidt, “Multiple emitter location and signal param- high resolution imaging,” in Proceedings of the 1st European
eter estimation,” IEEE Transactions on Antennas and Propaga- Radar Conference (EURAD ’04), pp. 65–68, Amsterdam, The
tion, vol. 34, no. 3, pp. 276–280, 1986. Netherlands, October 2004.
[20] J. Capon, “High-resolution frequency-wavenumber spectrum
analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418,
1969.
[21] V. C. Chen, Time-Frequency Transforms for Radar Imaging and
Signal Analysis, Artech House, Boston, Mass, USA, 2002.
[22] V. C. Chen and W. J. Miceli, “Simulation of ISAR imaging
of moving targets,” IEE Proceedings on Radar, Sonar and
Navigation, vol. 148, no. 3, pp. 160–166, 2001.
[23] J. L. Walker, “Range-Doppler imaging of rotating objects,”
IEEE Transactions on Aerospace and Electronic Systems, vol. 16,
no. 1, pp. 23–52, 1980.
[24] G. Y. Delisle and H. Wu, “Moving target imaging and trajec-
tory computation using ISAR,” IEEE Transactions on Aerospace
and Electronic Systems, vol. 30, no. 3, pp. 887–899, 1994.
[25] J. Wang and D. Kasilingam, “Global range alignment for
ISAR,” IEEE Transactions on Aerospace and Electronic Systems,
vol. 39, no. 1, pp. 351–357, 2003.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 341908, 10 pages
doi:10.1155/2010/341908

Research Article
A Locally Adaptable Iterative RX Detector

Yuri P. Taitano, Brian A. Geier, and Kenneth W. Bauer Jr.


Air Force Institute of Technology, 2950 Hobson Way, Wright Patterson AFB, OH 45433-7765, USA

Correspondence should be addressed to Kenneth W. Bauer Jr., kenneth.bauer@afit.edu

Received 27 November 2009; Revised 26 February 2010; Accepted 1 April 2010

Academic Editor: Yingzi Du

Copyright © 2010 Yuri P. Taitano et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present an unsupervised anomaly detection method for hyperspectral imagery (HSI) based on data characteristics inherit
in HSI. A locally adaptive technique of iteratively refining the well-known RX detector (LAIRX) is developed. The technique
is motivated by the need for better first- and second-order statistic estimation via avoidance of anomaly presence. Overall,
experiments show favorable Receiver Operating Characteristic (ROC) curves when compared to a global anomaly detector based
upon the Support Vector Data Description (SVDD) algorithm, the conventional RX detector, and decomposed versions of the
LAIRX detector. Furthermore, the utilization of parallel and distributed processing allows fast processing time making LAIRX
applicable in an operational setting.

1. Introduction error when estimating the mean and covariance, respectively.


The subsequences are refined by removing anomalies, in
In many experiments, the RX detector is modified in a an iterative fashion, from consideration in local statistics
preprocessing fashion [1–10] in order to minimize the false estimation. Even so, the refined subsequence that is used
alarm rate while attaining a reasonable true positive rate. to estimate a mean vector, μ, and the covariance matrix,
In most cases the modifications that are proposed can 
, is likely to still be nonGaussian; but, as is demonstrated
be generally described as dimensionality reduction variants subsequently, it often provides a better false alarm rate
coupled with RX [5, 10], window adjustments for covariance than the conventional RX because its estimates are not as
estimates [2–4], and the RX detector coupled with an entropy contaminated by anomalies.
and a nonparametric approach [1, 6]. In cases where a To illustrate the potential of this idea, consider the
new anomaly detection methodology is proposed, the RX following “abbreviated” image created from a desert image
detector is often used as a performance benchmark [9, 11– (see Figure 1). The image is “abbreviated” in that the targets
14]. have been moved closer together than in the original image
The literature on anomaly detection in HSI is quite exten- by simply eliminating columns of image pixels. This creates
sive [1–5, 7–12, 14–18] with major contributions appearing a situation with a very nonhomogeneous background. This
rapidly after Reed and Yu [19, 20]. In anomaly detection, small image is 63 × 49 pixels. A 25 × 25 window will be
the goal has always been to distinguish background from used in all subsequent processing. In Figure 1(a) the truth
potential targets in an automatic fashion while jointly mask shows the known objects of interest. Figure 1(b) shows
minimizing false alarms and maximizing true positives. the RX scores for the HSI image, Figure 1(c) shows the RX
The RX detector is prone to high false alarms because scores for the 1st 10 principal components (instead of using
the local Gaussian assumption is largely inaccurate [11]. the entire HSI data-cube), Figure 1(d) shows the output of
The purpose of this paper is to propose a refinement of the LAIRX for 2 iterations (called LAIRX (2) in subsequent
RX detector by taking into account the anomaly dominance discussion), using PCA for the input. In Figures 1(b), 1(c),
upon first and second order statistic estimation. That is, we and 1(d) RX scores are displayed such that anomalies are
wish to force stability upon the subsequences, locally defined expected to “fire” as bright and the background should
with respect to a window size, in order to reduce the bias and “fire” as dark. Figure 1(b) shows that many background
2 EURASIP Journal on Advances in Signal Processing

pixels may be declared as anomalies (depending on the continue; otherwise, the algorithm terminates. Hence, in
threshold) for RX. Figure 1(c) shows the benefit of using LAIRX we allow the RX detector to 
be iteratively refined with
principal components as input to RX. Figure 1(d) shows a respect to the estimation of μ and while keeping track of
large reduction in lighter pixels in the background using one detected anomalies. LAIRX has the following steps:
iterative refinement of the covariance calculation used in RX.
Step 1. Reduce the dimensionality to a set of p principal
components via a global estimate of .
2. Methods
The data used in our experiment are from the ARES desert Step 2. Apply the RX detector to the data matrix using a
and forest radiance collections. In our analysis we only pixel process window. If this is not the 1st iteration, withhold
consider two classes, background and certain man-made anomalous pixels identified  in the previous iteration from
targets. The goal of our analysis is to distinguish the the local estimation of μ and .
latter from many sources of background variation, such as 2 . These are
Step 3. Identify those RX scores that exceed χα,p
brush, roads, forest, large rock formations and other natural
referred to as anomalies. This step ends an iteration.
anomalies.
Step 4. If the set of pixels identified as anomalies in Step 3
2.1. Locally Adaptive Iterative RX Detector (LAIRX). Reed are identical to the set of pixels identified as anomalies in
and Yu derived an anomaly detector using a Generalized the previous iteration then go to Step 5, otherwise; return to
Likelihood Ratio Test (GLRT), which was later dubbed the Step 2.
RX detector [19, 20]. The detector GLRT is simplified by
assuming that background pixel vectors are iid normal with Step 5. Map detected anomalies to the image space.
1 that is,
estimated mean, μ1, and covariance matrix, Σ,
Once LAIRX has terminated we are assume that the
−1
 T n 1 1   T respective window sequences’ anomaly indicator has con-
RX(x) = x − μ1 Σ+ x − μ1 x − μ1 verged almost surely to some target given the χα,p 2 cut-off.
n+1 n+1
   T −1  
(1) The subsequences associated with the  target may or may
× x − μ1 −→ x − μ1 Σ1 x − μ1 . not be Gaussian but the refined μ and estimates should
n→∞
result in a higher true positive rate coupled with a relatively
low false alarm rate when compared to the conventional
In short, an incoming pixel vector, x, is the center of a
RX detector because the sequence is forced to be more iid
neighborhood of size n which is checked for irregularity
and, hence, estimation bias is reduced. The global SVDD
via the distance formulated above. That is, the pixel vector
algorithm was chosen as a competing algorithm because of its
is checked to see if it lies outside the hyper ellipse whose
recent promise as an efficient and powerful anomaly detector.
location and shape are determined via μ1 and Σ,1 respectively.
2 ,
In what follows, the LAIRX detector is compared to
An anomalous vector is declared given that RX(x) > χα,p the SVDD algorithm, the conventional RX detector and
2 2
where χα,p is the αth quantile of a χ distribution with p a decomposed version of the LAIRX detector. This later
degrees of freedom. For more information see [18]. comparison was made in an effort to more fully understand
The idea, following the philosophy of the RX method, the performance of the LAIRX detector. Each anomaly
is to place a window about each pixel in an image and detector was applied to four images; see Figure 2, containing
use local image statistics to determine whether or not the multiple vehicles, varying land formations, sage brush and a
point is anomalous. It is evident that such a method can road. These images vary in the difficulty they present to the
suffer from at least 3 potential complications. First, the algorithms as reflected in the SNR plots of Figure 3.
window pixel vectors are almost never statistically inde-
pendent. Second, such vectors are not typically identically
distributed. Further, outliers (the things we are looking for) 2.2. Parallel Implementation of LAIRX. The RX detector is
can seriously compromise the integrity of the local statistics, naturally a computationally inefficient task since M matrix
particularly the estimated covariance matrix. The first and inversions are required, M ≈ 52000 in our analysis. This
second complications are the subject of current research. computational burden was an inspiration for the application
In this paper, we examine the third complication. A look of SVDD in HSI anomaly detection [22]. However, most
at outlier effects and some remedies is given in [18]. The implementations of the RX detector are not optimized.
basic approach given in this paper is laid out in [21]. Here, We decided to optimize the RX algorithm via parallel and
we propose to deal with the outlier effects in an iterative distributed processing on a dual quad-core machine. The
fashion. As we process the basic RX algorithm across the implementation is very simple given the Matlab Parallel
image, we maintain a catalog of anomalous pixels, this is the Computing Toolbox but is described nonetheless.
1st iteration. Indeed, if we simply quit after processing the
image once, we would have simply run the RX method. A 2.2.1. Basic Setup. The hyperspectral image is formulated
second iteration is applied, only this time we withhold the as a data matrix where columns are wavelengths and rows
anomalous pixels from consideration in calculating the local are pixels, respectively. A window is moved row-wise across
statistics. So long as we find new anomalies the iterations the data matrix at a single row increment where each center
EURASIP Journal on Advances in Signal Processing 3

(a) (b)

(c) (d)

Figure 1: Comparison of RX and LAIRX. (a) Original test image with truth mask, (b) RX scores, (c) RX using 1st 10 Principal Components,
(d) LAIRX scores.

(a) (b) (c) (d)

Figure 2: RGB of image scenes. (a) ARES1D, (b) ARES1F, (c) ARES2F, (d) ARES3F.

pixel serves as input to the RX detector. The data  held The row midpoint of these data matrices are the center pixels
locally in each windowis used to estimate μ and . The for that particular window. This step makes RX and LAIRX
components of μ and are based on a set of p principal an “embarrassingly parallel” problem.
components from the wavelength bands, where p = 10 (2) The data partitions are batched in batches of size G
in our analysis. The number of principal components is where G is the number of available processor cores, G = 8 in
dependent on the data collection environment and should be our setup.
determined via exploratory data analysis. From our analysis, (3) Each batched data partitions are processed using the
we found that the desert radiance images required only RX detector simultaneously.
two principal components while the forest radiance images (4) Results are pooled and used for subsequent analysis.
required about ten.
2.3. Global SVDD Anomaly Detector. The Support Vector
2.2.2. Parallelization Scheme. (1) Generate all possible win- Data Description (SVDD) algorithm was originally proposed
dow indices where each column indexes a window of data. by Tax and Duin as a means for exploring one class
4 EURASIP Journal on Advances in Signal Processing

35 140

30 120

25 100

20 80

15 60

10 40

5 20

0 0
0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6
(a) (b)

70 14

60 12

50 10

40 8

30 6

20 4

10 2

0 0
0 1 2 3 4 5 0.8 1 1.2 1.4

Background Background
Target Target
(c) (d)

Figure 3: Signal to noise ratio (SNR) pixel map, that is, μ/σ, displayed with background-target SNR histogram. (a) ARES1D, (b) ARES1F,
(c) ARES2F, (d) ARES3F.

distributions [23, 24]. The SVDD anomaly detector applied that contains T. This is a constrained optimization problem
to HSI was proposed by Banerjee et al. [22]. In general, the stated as.
goal of the SVDD algorithm is to find the hypersphere with min(R) subject to xi ∈ S, i = 1, . . . , M. (2)
minimum volume about a set of random vectors. In our case
these random vectors are the background pixels with each The center a and radius R are found by optimizing the fol-
dimension corresponding to a different spectral band, that lowing Lagrangian:
  
is, for a set of pixels, T = {xi , i = 1, . . . , M }, we seek the L(R, a, αi ) = R2 − αi R2 ( xi , xi
− 2 a, xi
+ a, a
) . (3)
minimum enclosing hypersphere, S = {x : x − a 2 < R2 }, i
EURASIP Journal on Advances in Signal Processing 5

1 For HSI anomaly detection, Banerjee et al. propose a fast


global SVDD detector [11]. The algorithm that they propose
0.9 is fast because it uses a small subset of the data to build the
SVDD model. The algorithm is as follows:
0.8 (1) Randomly select a set of N background pixels from
the training set.
0.7
(2) Estimate an optimal value for σ 2 using a cross-
validation or minimax method given the set of
0.6 background spectra.
(3) Estimate the SVDD parameters (a, αi , R) to model the
True positive rate

0.5 region of support for the background given a random


subset of the background spectra and σ 2 .
0.4 (4) For each pixel in the data matrix perform the decision
test:
0.3
(i) if SVDD(y) is less than the detection threshold
0.2
t, the pixel is part of the background.
(ii) else, declare the pixel as an anomaly.
0.1 As you can see in Figure 4, there is a performance effect as
σ 2 is varied. The only free parameter in the RBF is σ 2 . We
0 wish to find the optimal σ 2 that is able to fully describe the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
nontarget/background spectra.
False positive rate
A common method is to estimate σ 2 while minimizing
50 4000 the false positive rate (Pfa) via an argument based on
100 5000 leave-one-out cross validation [11, 22–25], which gives the
500 6000 following upper bound on Pfa:
905 7000

1000 8000 #SV
2000 9000 Pfa≤E , (6)
N
3000 10000

Figure 4: σ selection for ARES1D image. Effect of bandwidth on


N is the number of samples selected to train the SVDD model
SVDD performance. The highest performance is achieved when σ ≈ and #SV are the number of support vectors required to
905; σ was incrementally varied from 50 → 10, 000. describe the data. Based upon the above inequality, Banerjee
et al. [22] propose an approximate minimax estimate for σ 2
as
The next step is to apply the kernel trick with a kernel
1
M
function Φ(x), which is usually taken to be Gaussian σ1 = min PFai
σ M i=1
[11, 22–24]. Once L is optimized with respect to αi and
after incorporating the kernel function, K(y, x), we get the ⎧ ⎫
M
⎨1
following SVDD statistic [11] #SV i ⎬
≈ min E (7)
σ ⎩M N ⎭
     
 i=1
SVDD y = R2 − K y, y + ∝i K y, xi ⎧ ⎫
M
⎨1
i #SV i ⎬
   (4) ≈ min ,
σ ⎩M N ⎭
= C + 2 ∝i K y, xi , i=1
i
where M is the number of replicates. Based upon the above
where result, the algorithm to obtain the global estimate for σ 2 is
 ? ?2  the following.
  −?x − y ?
K x, y = exp . (5)
σ2 (1) Generate M equal sets of training data by randomly
selecting pixels from the background.
The radial basis function (RBF) parameter σ 2 controls how
(2) For each set of training data, the SVDD decision
well the SVDD generalizes to unseen data. The choice of
boundary is determined using different values for σ 2 .
this parameter is driven by the data and must be chosen
empirically. In our analysis, y is a pixel vector and x is (3) For each value of σ 2 , the average fraction of support

a support vector obtained via SVDD given background vectors, (1/M) M i=1 [#SV i /N], is computed over all of
spectra. the training sets.
6 EURASIP Journal on Advances in Signal Processing

×103 ×10−4
16 0

14
−0.5
12

10 −1

8
−1.5
6
−2
4

2 −2.5

0
−3
−2

−4 −3.5
−8 −6 −4 −2 0 0 1 2 3 4 5
×104 ×104
(a) (b)
×104 ×104
2.5 2

2 1.5

1
1.5

0.5
1
0
0.5
−0.5
0
−1

−0.5
−1.5

−1 −2

−1.5 −2.5
−6 −4 −2 0 2 −6 −4 −2 0
×104 ×104

Background Background
Target Target
(c) (d)

Figure 5: 2-D Principal Component Analysis view and SVDD pixel map. (a) ARES1D, (b) ARES1F, (c) ARES2F, (d) ARES3F. Note: Blue is
target and red is background.

(4) The σ 2 that produces the smallest average fraction of approach, 60% of the training data is 300 pixel vectors while
support vectors is the minimax estimate. there are 149 features. The balance between sample size and
We discovered that the average fraction of support vectors number of variables is causing the minimax estimation to
is at a minimum when σ 2 ≥ 905. Therefore, our minimax converge at a minimum that allows a fairly high Pfa while
estimate is 905 because this is the smallest σ 2 that allows maintaining an effective characterization of the background
us to effectively describe our data. If σ 2 > 905 then our spectra. This demonstrates that the bandwidth parameter
detection results may be poor because the resulting SVDD selection is robust to small sample sizes relative to the
model is overly general. Note that when using the minimax number of dimensions or spectral bands.
EURASIP Journal on Advances in Signal Processing 7

1 The goal of our analysis was to compare a promising


anomaly detector (global SVDD), a benchmark detector
0.9 (RX) and our proposal (LAIRX). The RX algorithm is run
in two different modes. RX-FULL is run on the full 149
bands of the image cube; whereas, RX-PCA is run on the
0.8
10 retained principal components. Finally, in an effort to
examine the iteration sensitivity of LAIRX we created a
0.7 version denoted LAIRX(2) that only performs one iterative
refinement (in other words, 2 RX-type iterations). In order
0.6
to make inferences about these competing algorithms we
True positive rate

investigate the runtime, percentage of target pixels, area


under the ROC curve restrained to false positive cases falling
0.5 below 0.20, and the true positive rate at a fixed false positive
rate of 0.05. In what follows (Table 1), AUC denotes the “area
0.4 under the (ROC) curve (false positive rates from 0–.2)” and
TPR is the “true positive rate” at a fixed a false positive rate
of 0.05.
0.3 At the onset, we believed that performance gains would
be substantial given that we force our multidimensional data
0.2 sequences within each window to be more iid even though
these data sequences are likely to not be locally Gaussian.
Even if the multivariate populations are not Gaussian, by
0.1
iteratively refining the anomaly detection we are introducing
robustness to our final RX(x) score map, upon which a
0 threshold is applied to determine the final classification.
0 0.05 0.1 0.15 0.2
A drawback to LAIRX is that you need to know a good
False positive rate
rejection rate for the iterative refinement which is largely data
0.01 dependent as you can see in Figure 6. This problem is a topic
0.15 of future research.
0.65

Figure 6: Rejection Criteria Effect: ROC curves for LAIRX statistic


map thresholded at 0.01, 0.15, and 0.65, respectively. 3. Results
In what follows, ROC curves are presented for each image in
Figure 7 and summary statistics are offered in Table 1.
Figure 5 is SVDD pixel map with an accompanying 2D For ARES1D, the background and target pixels are
principal component scatter plots. These will contrasted in linearly separable given the first and second principal com-
subsequent analysis. ponents (see Figure 5(a)). Additionally, the SNR is showing a
clear separation with lower SNR values present for the target
pixels, which is encouraging (see Figure 5(b)). We would
2.4. Experiment Description. In our analysis we included expect with these observations that the RX-type classifiers
all wavelength bands that were not contaminated by atmo- should do very well and that the nonlinear classifier, SVDD,
spheric absorption, which resulted in 149 bands. The global should also perform reasonably well. When viewing the ROC
SVDD approach is supervised in the sense that it requires as curve for ARES1D, you can see that PCA is beneficial for the
input a known set of background spectra while LAIRX and RX derivative methods. The SVDD algorithm is performing
RX are unsupervised. For SVDD, the sample size used when well but as you can see in the SVDD statistic map there are
sampling from the background spectra was N = 500. This many locally clustered areas which display the same in value
size was chosen based upon computing limitations while as the target pixels.
accommodating the high dimensional feature space. For The image ARES1F poses difficulties for all the algo-
LAIRX, the maximum number of iterations and principal rithms tested. ARES1F which has a very noisy SNR statistic
components was 50 and 10, respectively. Based on our map, as you can see in Figure 3(b). This image in particular,
exploratory data analysis 10 principal components was highlights the benefits of LAIRX’s iterative approach.
sufficient. A window size of 25 × 25 was employed. The image scene is similar for both ARES2F and ARES3F
Each anomaly detector was applied to four images; see (Figure 2). As you can see in Figure 3, the SNR map is
Figure 2, containing multiple vehicles, varying land forma- showing a reasonable segmentation for both of these images
tions, sage brush and a road. ARES1D is desert radiance while and the distributions are somewhat separable. Table 1 and
the other three are forest radiance. Three of the images area Figure 7 show that performance is good across the board
contains less than 1% target pixels while one image has about with LAIRX being superior. It is interesting to point out that
3.4% target pixels. the nonlinear technique is performing poorly in contrast to
8 EURASIP Journal on Advances in Signal Processing

1 1

0.9 0.9

0.8 0.8

0.7 0.7

True positive rate

True positive rate


0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
False positive rate False positive rate
(a) (b)
1 1

0.9 0.9

0.8 0.8

0.7 0.7
True positive rate

True positive rate

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
False positive rate False positive rate

SVDD SVDD
RX-full RX-full
RX-PCA RX-PCA
LAIRX LAIRX
LAIRX(2) LAIRX(2)
(c) (d)

Figure 7: Receiver Operating Characteristic (ROC) curves for each image. (a) ARES1D, (b) ARES1F, (c) ARES2F, (d) ARES3F.

the other algorithms. Additionally, the data associated with 4. Discussion


these images is separable in a lower dimensional subspace,
as you can see in Figure 3, which indicates that the RX type We have presented an unsupervised automatic target detec-
approach should perform well. Also, note that for ARES2F tion algorithm which builds upon the conventional RX
and ARES3F the SVDD statistic maps depicted in Figure 5 is detector by direct manipulation of the RX algorithm. As a
showing that large natural anomalies have a similar value as practical matter, the LAIRX detector must have data pre-
the targets, which leads to a higher false positive rate. processed as principal components before detection which
EURASIP Journal on Advances in Signal Processing 9

Table 1: Tabular results by image.

Image Algorithm Runtime (min) Image Prop AUC TPR


SVDD 1.46 291 × 199 pixels 0.18 0.90
RX-Full 3.93 0.11 0.44
ARES1D RX-PCA 0.1992 0.41 % Targets 0.17 0.80
LAIRX 48.54 0.37 1.00
LAIRX(2) 4.12 6 Targets 0.45 1.00
SVDD 1.05 191 × 160 pixels 0.16 0.83
RX-Full 2.11 0.05 0.18
ARES1F RX-PCA 0.10 3.40% Targets 0.06 0.25
LAIRX 16.11 0.30 0.80
LAIRX(2) 1.99 10 Targets 0.14 0.26
SVDD 1.61 312 × 152 pixels 0.15 0.70
RX-Full 3.14 0.18 0.91
ARES2F RX-PCA 0.16 0.66% Targets 0.18 0.85
LAIRX 19.46 0.47 1.00
LAIRX(2) 3.24 30 Targets 0.48 1.00
SVDD 1.10 226 × 136 pixels 0.17 0.83
RX-Full 1.93 0.17 0.82
ARES3F RX-PCA 0.09 0.48% Targets 0.18 0.90
LAIRX 11.99 0.44 0.97
LAIRX(2) 1.99 20 Targets 0.47 0.96

hinders real-time viability while the global SVDD builds a References


model based upon background spectra and then classifies
raw pixel vectors as anomalous as they are received. However, [1] W. Di, Q. Pan, Y.-Q. Zhao, and L. He, “Multiple-detector
fusion for anomaly detection in multispectral imagery based
the global SVDD is a supervised algorithm given that a
on maximum entropy and nonparametric estimation,” in
set of background spectra and RBF spread parameter are Proceedings of the 8th International Conference on Signal
specified which may limit its real-time viability in a dynamic Processing (ICSP ’06), vol. 3, January 2006.
operational setting. [2] M. Hsueh and C.-I. Chang, “Adaptive causal anomaly detec-
For the types of images analyzed here, our results tion for hyperspectral imagery,” in Proceedings of the IEEE
have shown that LAIRX is a reasonable competitor to the International Geoscience and Remote Sensing Symposium
SVDD algorithm and that a linear technique can perform (IGARSS ’ 04), vol. 5, pp. 3222–3224, January 2004.
well in a nonlinear environment after statistic estimation [3] W. Liu and C.-I. Chang, “A nested spatial window-based
modification. We have also demonstrated, see Table 1 and approach to target detection for hyperspectral imagery,” in
Figure 7, that the algorithmic steps taken to create LAIRX Proceedings of the IEEE International Geoscience and Remote
interact in a way that lead to higher true positives coupled Sensing Symposium (IGARSS ’04), vol. 5, pp. 266–268, January
with low false positives. By introducing iterative refinement, 2004.
we are getting better performance because the first and [4] W. Liu and C.-I. Chang, “Multiple-window anomaly detection
second order statistic estimation have less bias and error. for hyperspectral imagery,” in Proceedings of the IEEE Inter-
Our method has demonstrated potential in an image scene national Geoscience and Remote Sensing Symposium (IGARSS
with sparse vehicle activity. Whether or not similar results ’08), vol. 2, pp. 41–44, January 2008.
follow in a densely populated target environment remains to [5] F. Mei, C. Zhao, L. Wang, and H. Huo, “Anomaly detection
be seen. in hyperspectral imagery based on kernel ICA feature extrac-
Both LAIRX and LAIRX(2) are implemented in a par- tion,” in Proceedings of the 2nd International Symposium on
Intelligent Information Technology Application (IITA ’08), vol.
allel and distributed fashion which makes these algorithms
1, pp. 869–873, January 2008.
computationally efficient. The processing time for LAIRX
[6] F. Mei, C. Zhao, H. Huo, and Y. Sun, “An adaptive kernel
and LAIRX(2) are constrained by the number of available method for anomaly detection in hyperspectral imagery,” in
processor cores. In general, LAIRX is somewhat slow as Proceedings of the 2nd International Symposium on Intelligent
you can see in Table 1 but the ready availability of cluster Information Technology Application (IITA ’08), vol. 1, pp. 874–
machines and even affordable 8 processor core machines 878, January 2008.
make LAIRX a viable algorithm in an operational setting. In [7] N. M. Nasrabadi, “A nonliner kernel-based joint fusion/
contrast, the runtime of the global SVDD algorithm is fixed detection of anomalies using hyperspectral and SAR imagery,”
from an algorithmic perspective of parallel and distributed in Proceedings of the IEEE International Conference on Image
computing. Processing (ICIP ’08), pp. 1864–1867, 2008.
10 EURASIP Journal on Advances in Signal Processing

[8] N. Renard and S. Bourennane, “Dimensionality reduction Transactions on Geoscience and Remote Sensing, vol. 44, no. 8,
based on tensor modeling for classification methods,” IEEE pp. 2282–2291, 2006.
Transactions on Geoscience and Remote Sensing, vol. 47, no. 4, [23] D. M. J. Tax and R. P. W. Duin, “Support vector domain
pp. 1123–1131, 2009. description,” Pattern Recognition Letters, vol. 20, no. 11–13, pp.
[9] G. Yanfeng, Z. Ye, and L. Ying, “Unmixing component 1191–1199, 1999.
analysis for anomaly detection in hyperspectral imagery,” in [24] D. M. J. Tax and R. P. W. Duin, “Support vector data
Proceedings of the IEEE International Conference on Image description,” Machine Learning, vol. 54, no. 1, pp. 45–66, 2004.
Processing (ICIP ’06), pp. 965–968, 2006. [25] V. N. Vapnik, Statistical Learning Theory, Wiley Series on
[10] Y. Gu, Y. Liu, and Y. Zhang, “A selective kernel PCA Adaptive and Learning Systems for Signal Processing, Com-
algorithm for anomaly detection in hyperspectral imagery,” in munications, and Control, John Wiley & Sons, New York, NY,
Proceedings of the IEEE International Conference on Acoustics, USA, 1998.
Speech and Signal Processing (ICASSP ’06), vol. 2, pp. 725–728,
January 2006.
[11] A. Banerjee, P. Burlina, and R. Meth, “Fast hyperspectral
anomaly detection via SVDD,” in Proceedings of the 14th IEEE
International Conference on Image Processing (ICIP ’07), vol. 4,
pp. 101–104, January 2007.
[12] J.-M. Gaucel, M. Guillaume, and S. Bourennane, “Whitening
spacial correlation filtering for hyperspectral anomaly detec-
tion,” in Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP ’05), vol. 5,
pp. 333–336, January 2005.
[13] S. Tiwari, S. Agarwal, and A. Trang, “Texture feature selection
for buried mine detection in airborne multispectral imagery,”
in Proceedings of the International Geoscience and Remote
Sensing Symposium (IGARSS ’08), vol. 1, pp. 145–148, 2008.
[14] S. M. Thornton and J. M. F. Moura, “The fully adaptive GMRF
anomally detector for hyperspectral imagery,” in Proceedings
of the IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP ’00), vol. 1, pp. 37–40, 2000.
[15] C.-I. Chang and S.-S. Chiang, “Anomaly detection and
classification for hyperspectral imagery,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 40, no. 6, pp. 1314–1325,
2002.
[16] C.-I. Chang and H. Ren, “An experiment-based quantitative
and comparative analysis of target detection and image
classification algorithms for hyperspectral imagery,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 38, no. 2,
pp. 1044–1063, 2000.
[17] M. Shi and G. Healey, “Using multiband correlation models
for the invariant recognition of 3-D hyperspectral textures,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 43,
no. 5, pp. 1201–1209, 2005.
[18] T. E. Smetek and K. W. Bauer Jr., “A comparison of mul-
tivariate outlier detection methods for finding hyperspectral
anomalies,” Military Operations Research, vol. 13, no. 4, pp.
19–44, 2008.
[19] I. S. Reed and X. Yu, “Adaptive multiple-band CFAR detection
of an optical pattern with unknown spectral distribution,”
IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. 38, no. 10, pp. 1760–1770, 1990.
[20] X. Yu, L. E. Hoff, I. S. Reed, A. M. Chen, and L. B. Stotts,
“Automatic target detection and recognition in multiband
imagery: a unified ML detection and estimation approach,”
IEEE Transactions on Image Processing, vol. 6, no. 1, pp. 143–
156, 1997.
[21] Y. P. Taitano, Hyperspectral imagery target detection using the
iterative RX detector, M.S. thesis, School of Operation Science,
Air Force Institute of Technology, Wright-Patterson AFB,
Ohio, USA, March 2007, AFIT/GOR/ENS/07-25.
[22] A. Banerjee, P. Burlina, and C. Diehl, “A support vector
method for anomaly detection in hyperspectral imagery,” IEEE
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 343057, 24 pages
doi:10.1155/2010/343057

Review Article
Background Subtraction for Automated Multisensor Surveillance:
A Comprehensive Review

Marco Cristani,1, 2 Michela Farenzena,1 Domenico Bloisi,1 and Vittorio Murino1, 2


1 Dipartimento di Informatica, University of Verona, Strada le Grazie 15, 37134 Verona, Italy
2 IIT Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy

Correspondence should be addressed to Marco Cristani, [email protected]

Received 10 December 2009; Accepted 6 July 2010

Academic Editor: Yingzi Du

Copyright © 2010 Marco Cristani et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background subtraction is a widely used operation in the video surveillance, aimed at separating the expected scene (the
background) from the unexpected entities (the foreground). There are several problems related to this task, mainly due to the
blurred boundaries between background and foreground definitions. Therefore, background subtraction is an open issue worth
to be addressed under different points of view. In this paper, we propose a comprehensive review of the background subtraction
methods, that considers also channels other than the sole visible optical one (such as the audio and the infrared channels). In
addition to the definition of novel kinds of background, the perspectives that these approaches open up are very appealing: in
particular, the multisensor direction seems to be well-suited to solve or simplify several hoary background subtraction problems.
All the reviewed methods are organized in a novel taxonomy that encapsulates all the brand-new approaches in a seamless way.

1. Introduction background initialization, where the model of the back-


ground is bootstrapped, and background maintenance (or
Video background subtraction represents one of the basic, updating), where the parameters regulating the background
low-level operations in the video surveillance typical work- have to be updated by online strategies.
flow (see Figure 1). Its aim is to operate on the raw video The biggest, general problem afflicting the video BG
sequences, separating the expected part of the scene (the subtraction is that the distinction between the background
background, BG), frequently corresponding to the static (the expected part of the scene) and the foreground (the
bit, from the unexpected part (the foreground, FG), often unexpected part) is blurred and cannot fit into the definition
coinciding with the moving objects. Several techniques may given above. For example, one of the problems in video
subsequently be carried out after the video BG subtraction background subtraction methods is the oscillating back-
stage. For instance, tracking may focus only on the FG ground: it occurs when elements forming in principle the
areas of the scene [1–3]; analogously, target detection and background, like tree branches in Figure 2, are oscillating.
classification may be fastened by constraining the search This contravenes the most typical characteristic of the
window only over the FG locations [4]. Further, recognition background, that is, that of being static, and bring such items
methods working on shapes (FG silhouettes) are also present to being labelled as FG instances.
in the literature [5, 6]. Finally, the recent coined term of video The BG subtraction literature is nowadays huge and
analytics addresses those techniques performing high-level multifaceted, with some valid reviews [9–11], and several
reasoning, such as the detection of abnormal behaviors in a taxonomies that could be employed, depending on the
scenery, or the persistent presence of foreground, exploiting nature of the experimental settings. More specifically, a
low-level operations like the BG subtraction [7, 8]. first distinction separates the situation in which the sensors
Video background subtraction is typically an online (and sensor parameters) are fixed, so that the image view
operation generally composed by two stages, that is, the is fixed, and the case where the sensors can move or
2 EURASIP Journal on Advances in Signal Processing

Raw input sequence

BG
subtraction

Video
Tracking Detection Recognition
analytics

High-level analysis modules

Figure 1: A typical video surveillance workflow: after background subtraction, several, higher-order, analysis procedures may be applied.

(a) (b)

(c) (d)

Figure 2: A typical example of ill-posed BG subtraction issue: the oscillating background. (a) A frame representing the background scene,
where a tree is oscillating, as highlighted by the arrows. (b) A moving object passes in front of the scene. (c) The ground truth, highlighting
only the real foreground object. (d) The result of the background subtraction employing a standard method: the moving branches are
detected as foreground.

parameters can change, like cameras mounted on vehicles techniques [13, 14]. In any case, this kind of partitions could
or PTZ (pan-tilt-zoom) cameras, respectively. In the former not apply to all the techniques present in the literature.
case, the scene may be nonperfectly static, especially in the In this paper, we will contribute by proposing a novel,
case of an outdoor setting, in which moving foliage or comprehensive, classification of background subtraction
oscillating/repetitively moving entities are present (like flags, techniques, considering not only the mere visual sensor
water or sea surface): methods in this class try to recover channel, which has been considered by the BG subtraction
from these noisy sources. In the case of moving sensors, the methods until six years ago. Instead, we will analyze back-
background is no static any more, and typical strategies aim ground subtraction in the large, focusing on different sensor
to individuate the global motion of the scene, separating it channels, such as audio and infrared data sources, as well as a
from all the other different, local motions that witness the combination of multiple sensor channels, like audio + video
presence of foreground items. and infrared + video.
Other taxonomies are more technical, focusing on the These techniques are very recent and represent the last
algorithmic nature of the approaches, like those separat- frontier of the automated surveillance. The adoption of
ing predictive/nonpredictive [12] or recursive/nonrecursive different sensor channels other than video and their careful
EURASIP Journal on Advances in Signal Processing 3

association helps in tackling classical unsolved problems for the reviewed approaches that cope with some of them. Then,
background subtraction. for each problem, we will give a sort of recipe, distilled
Considering our multisensor scenario, we thus rewrite from all of the approaches analyzed, that indicates how that
the definition of background as whatever in the scene that specific problem can be solved. These considerations are
is, persistent, under one or more sensor channels. From this summed up in Table 1.
follows the definition of foreground—something that is, Finally, a conclusive part, (Section 9), closes the survey,
not persistent under one ore more sensor channels—and envisaging which are the unsolved problems, and discussing
of (multisensor) background subtraction, from here on just what are the potentialities that could be exploited in the
background subtraction, unless otherwise specified. future research.
The remainder of the paper is organized as follows. First, As a conclusive consideration, it is worth noting that
we present what are the typical problems that affect the BG our paper will not consider solely papers that focus in
subtraction (Section 2) and, afterwards, our taxonomy is their entirety on a BG subtraction technique. Instead, we
described (see Figure 3), using the following structure. decide to include those works where the BG subtraction
In Section 3, we analyze the BG methods that operate represents a module of a structured architecture and that
on the sole visible optical (standard video) sensor channel, bring advancements in the BG subtraction literature.
individuating groups of methods that employ a single
monocular camera, and approaches where multiple cameras
are utilized. 2. Background Subtraction’s Key Issues
Regarding a single video stream, per-pixel and per-region
Background subtraction is a hard task as it has to deal
approaches can further be singled out. The rationale under
with different and variable issues, depending on the kind of
this organization lies in the basic logic entity analyzed by
environment considered. In this section, we will analyze such
the different methods: in the per-pixel techniques, temporal
issues following the idea adopted for the development of
pixels’ profiles are modeled as independent entities. Per-
the “Wallflower” dataset (http://research.microsoft.com/en-
region strategies exploit local analysis on pixel patches, in
us/um/people/jckrumm/WallFlower/TestImages.htm) pre-
order to take into account higher-order local information,
sented in [15]. The dataset consists of different video
like edges for instance, also to strengthen the per-pixel
sequences that is, olate and portray single issues that make
analysis. Per-frame approaches are based on a reasoning
the BG/FG discrimination difficult. Each sequence contains
procedure over the entire frame, and are mostly used as
a frame which serves as test, and that is, given together with
support of the other two policies. These classes of approaches
the associated ground truth. The ground truth is represented
can come as integrated multilayer solutions where the FG/BG
by a binary FG mask, where 1 (white) stands for FG. It is
estimation, made at lower per-pixel level, is refined by the
worth noting that the presence of a test frame indicates
per-region/frame level.
that in that frame a BG subtraction issue occurs; therefore,
When considering multiple, still video, sensors (Section 4),
the rest of the sequence cannot be strictly considered as an
we can distinguish between the approaches using sensors in
instance of a BG subtraction problem.
the form of a combined device (such as a stereo camera,
Here, we reconsider these same sequences together
where the displacement of the sensors is fixed, and typically
with new ones showing problems that are not taken into
embedded in a single hardware platform), and those in which
account in the Wallflower work. Some sequences portray also
a network of separate cameras, characterized in general by
problems which rarely have been faced in the BG subtraction
overlapping view fields, is considered.
literature. In this way, a very comprehensive list of BG
In Section 5, the approaches devoted to model audio
subtraction issues is given, associated with representative
background are investigated. Employing audio signals opens
sequences (developed by us or already publicly available)
up innovative scenarios, where cheap sensors are able to
that can be exploited for testing the effectiveness of novel
categorize different kind of background situations, high-
approaches.
lighting unexpected audio events. Furthermore, in Section 6
For the sake of clarity, from now on we assume as false
techniques exploiting infrared signals are considered. They
positive a FG entity which is identified as BG, and viceversa.
are particularly suited when the illumination of the scene is
Here is the list of problems and their relative rep-
very scarce. This concludes the approaches relying on a single
resentative sequences (http://profs.sci.univr.it/∼cristanm/
sensor channel.
BGsubtraction/videos) (see Figure 4):
The subsequent part analyzes how the single sensor
channels, possibly modeled with more than one sensor,
could be jointly employed through fusion policies in order Moved Object [15]. A background object can be moved.
to estimate multisensor background models. They inherit the Such object should not be considered part of the foreground
strengths of the different sensor channels, and minimize forever after, so the background model has to adapt and
the drawbacks typical of the single separate channels. In understand that the scene layout may be physically updated.
particular, we will investigate in Section 7 the approaches that This problem is tightly connected with that of the sleeping
fuse infrared + video and audio + video signals (see Figure 3). person (see below), where a FG object stand still in the
This part concludes the proposed taxonomy and is scene and, erroneously, becomes part of the scene. The
followed by the summarizing Section 8, where the typical sequence portrays a chair that is, moved in a indoor
problems of the BG subtraction are discussed, individuating scenario.
4 EURASIP Journal on Advances in Signal Processing

Single channel Multiple channels

Visual Infrared Audio


Visual + Visual +
infrared audio
Multiple
sensor
Single
sensors

Per-pixel
Single Multiple
device devices
Per-region

Per-frame

Figure 3: Taxonomy of the proposed background subtraction methods.

Time of Day [15]. Gradual illumination changes alter the Camouflage [15]. A pixel characteristic of a foreground
appearance of the background. In the sequence the evolution object may be subsumed by the modeled background,
of the illumination provokes a global appearance change of producing a false negative. The sequence shows a flickering
the BG. monitor that alternates shades of blue and some white
regions. At some point, a person wearing a blue shirt moves
Light Switch [15]. Sudden changes in illumination alter in front of the monitor, hiding it. The shirt and the monitor
the appearance of the background. This problem is more have similar color information, so the FG silhouette tends do
difficult than the previous one, because the background be erroneously considered as a BG entity.
does evolve with a characteristic that is, typical of a
foreground entity, that is, being unexpected. In their paper Bootstrapping [15]. A training period without foreground
[15], the authors present a sequence where a global change objects is not always available in some environments, and
in the illumination of a room occurs. Here, we articulate this makes bootstrapping the background model hard. The
this situation adding the condition where the illumination sequence shows a coffee room where people walk and stay
change may be local. This situation may happen when standing for a coffee. The scene is never empty of people.
street lamps are turned on in an outdoor scenario; another Foreground Aperture [15]. When a homogeneously colored
situation may be that of an indoor scenario, where the object moves, changes in the interior pixels cannot be
illumination locally changes, due to different light sources. detected. Thus, the entire object may not appear as fore-
We name such problem, and the associated sequence, Local ground, causing false negatives. In the Wallflower sequence,
light switch. The sequence shows an indoor scenario, where this situation is made even extreme. A person is asleep at his
a dark corridor is portrayed. A person moves between two desk, viewed from the back. He wakes up and slowly begins
rooms, opening and closing the related doors. The light to move. His shirt is uniformly colored.
in the rooms is on, so the illumination spreads out over
the corridor, locally changing the visual layout. A back- Sleeping Foreground. A foreground object that becomes
ground subtraction algorithm has to focus on the moving motionless has to be distinguished from the background. In
entity. [15], this problem has not been considered because it implies
the knowledge of the foreground. Anyway, this problem is
Waving Trees [15]. Background can vacillate, globally and similar to that of the “moved object”. Here, the difference is
locally, so the background is not perfectly static. This implies that the object that becomes still does not belong to the scene.
that the movement of the background may generate false Therefore, the reasoning for dealing with this problem may
positives (movement is a property associated to the FG). be similar to that of the “moved object”. Moreover, this prob-
The sequence, depicted also in Figure 2, shows a tree that is, lem occurs very often in the surveillance situations, as wit-
moved continuously, simulating an oscillation in an outdoor nessed by our test sequence. This sequence portrays a cross-
situation. At some point, a person comes. The algorithm has ing road with traffic lights, where the cars move and stop. In
to highlight only the person, not the tree. such a case, the cars have not to be marked as background.
EURASIP Journal on Advances in Signal Processing 5

Moved object

Time of day

Light switch

Light switch
(local)

Waving tree

Camouflage

Bootstrapping

Foreground
aperture

Sleeping
foreground

Shadows

Reflections

Figure 4: Key problems for the BG subtraction algorithms. Each situation corresponds to a row in the figure, the images in the first two
column (starting from left) represent two frames of the sequence, the images in the third column represent the test image, and the images in
the fourth column represent the ground truth.

Shadows. Foreground objects often cast shadows that appear Reflections. the scene may reflects foreground instances, due
different from the modeled background. Shadows are simply to wet or reflecting surfaces, such as the floor, the road,
erratic and local changes in the illumination of the scene, windows, glasses, and so for, and such entities have not to
so they have not to be considered FG entities. Here be classified as foreground. In the literature, this problem
we consider a sequence coming from the ATON project has been never explicitly studied, and it has been usually
(http://cvrr.ucsd.edu/aton/testbed/), depicting an indoor aggregated with that of the shadows. Anyway, reflections
scenario, where a person moves, casting shadows on the floor are different from shadows, because they retain edge infor-
and on the walls. The ground truth presents two labels: one mation that is, absent in the shadows. We present here
for the foreground and one for the shadows. a sequence where a traffic road intersection is monitored.
6 EURASIP Journal on Advances in Signal Processing

Table 1: A summary of the methods discussed in this paper, associated with the problems they solve. The meaning of the abbreviations is
reported in the text.

MO TD LS LLS WT C B FGA SFG SH R


√ √ √ √
Per-pixel
√ √ √ √ √ √
Per-region
√ √ √
Per-frame
√ √ √ √ √ √
Multistage
√ √ √ √ √ √
Multicamera

Infrared-sensor

Infrared + video

Infrared + video

The floor is wet and the shining sun provokes reflections of that contain predictive and nonpredictive parts), and does
the passing cars. not give hints on the capabilities of each approach.
In the following section, we will consider these situations The Wallflower paper [19] inspired us a different tax-
with respect to how the different techniques present in the onomy, similar to the one proposed in [20], that fills this
literature solve them (we explicitly refer to those approaches gap. Such work actually proposes a method that works
that consider the presented test sequences) or may help on different spatial levels: per-pixel, per-region, and per-
in principle to reach a good solution (in this case, we frame. Each level taken alone has its own advantages and
infer that a good solution is given for a problem when the is prone to well defined key problems; moreover, each level
sequence considered are similar to those of the presented individuates several approaches in the literature. Therefore,
dataset). individuating an approach as working solely in a particular
Please note that the Wallflower sequences contain only level makes us aware of what problems that approach
video data, and so all the other new sequences. Therefore, can solve. For example, considering every temporal pixel
for the approaches that work on other sensor channels, the evolution as an independent process (so addressing the
capability to solve one of these problems will be based on per-pixel level), and ignoring information observed at the
results applied on data sequences that present analogies with other pixels (so without performing any per-region/frame
the situations portrayed above. reasoning) cannot be adequate for managing the light switch
problem. This partition of the approaches into spatial logic
3. Single Monocular Video Sensor levels of processing (pixel, region, and frame) is consistent
with the nowadays BG subtraction state of the art, permitting
In a single camera setting, background subtraction focuses to classify all the existent approaches.
on a pixel matrix that contains the data acquired by Following these considerations, our taxonomy organizes
a black/white or color camera. The output is a binary the BG subtraction methods into three classes.
mask which highlights foreground pixels. In practice, the
process consists in comparing the current frame with the (i) Per-Pixel Processing. The class of per-pixel approaches
background model, individuating as foreground pixels those is formed by methods that perform BG/FG discrim-
not belonging to it. ination by considering each pixel signal as an inde-
Different classifications of BG subtraction methods for pendent process. This class of approaches is the most
monocular sensor settings have been proposed in literature. adopted nowadays, due to the low computational
In [13, 14], the techniques are divided into recursive and effort required.
nonrecursive ones, where recursive methods maintain a
single background model that is, updated using each new (ii) Per-Region/Frame Processing. Region-based algo-
coming video frame. Nonrecursive approaches maintain a rithms relax the per-pixel independency assumption,
buffer with a certain quantity of previous video frames and thus permitting local spatial reasoning in order
estimate a background model based solely on the statistical to minimize false positive alarms. The underlying
properties of these frames. motivations are mainly twofold. First, pixels may
A second classification [12] divides existing meth- model parts of the background scene which are
ods in predictive and nonpredictive. Predictive algo- locally oscillating or moving slightly, like leafs or
rithms model a scene as a time series and develop a flags. Therefore, the information needed to capture
dynamic model to evaluate the current input based on these BG phenomena has not to be collected and
the past observations. Nonpredictive techniques neglect evaluated over a single pixel location, but on a larger
the order of the input observations and build a proba- support. Second, considering the neighborhood of a
bilistic representation of the observations at a particular pixel permits to assess useful analysis, such as edge
pixel. extraction or histogram computation. This provides
However, the above classifications do not cover the entire a more robust description of the visual appearance of
range of existent approaches (actually, there are techniques the observed scene.
EURASIP Journal on Advances in Signal Processing 7

(iii) Per-Frame Processing. Per-frame approaches extend These parameters are initially estimated from the first few
the local support of the per-region methods to the seconds of a video and are periodically updated for those
entire frame, thus facing global problems like the parts of the scene not containing foreground objects.
light switch. The drawback of these models are that only monomodal
background are taken into account, thus ignoring all the
3.1. Per-Pixel Processes. In order to ease the reading, we situations where multimodality in the BG is present. For
group together similar approaches, considering the most example, considering a water surface, each pixel has at least
important characteristics that define them. This permits also a bimodal distribution of colors, highlighting the sea and the
to highlight in general pros and cons of multiple approaches. sun reflections.

3.1.1. Early Attempts of BG Subtraction. To the best of our 3.1.3. Multimodal Approaches. One of the first approaches
knowledge, the first attempt to implement a background dealing with multimodality is proposed in [28], where a
subtraction model for surveillance purposes is the one in mixture of Gaussians is incrementally learned for each pixel.
[21], where the differencing of adjacent frames in a video The application scenario is the monitoring of an highway,
sequence are used for object detection in stationary cameras. and a set of heuristics for labeling the pixels representing the
This simple procedure is clearly not adapt for long-term road, the shadows and the cars are proposed.
analysis, and suffers from many practical problems (one for An important approach that introduces a parametric
all, it does not highlight the entire FG appearance, due to the modeling for multimodal background is the Mixture of
overlapping between moving objects across frames). Gaussians (MoG) model [29]. In this approach, the pixel
evolution is statistically modeled as a multimodal signal,
3.1.2. Monomodal Approaches. Monomodal approaches described using a time-adaptive mixture of Gaussian com-
assumes that the features that characterize the BG values of a ponents, widely employed in the surveillance community.
pixel location can be segregated in a single compact support. Each Gaussian component of a mixture describes a gray
One of the first and widely adopted strategy was proposed in level interval observed at a given pixel location. A weight is
the surveillance system Pfinder [22], where each pixel signal associated to each component, mirroring the confidence of
z(t) is modeled in the YUV space by a simple mean value, portraying a BG entity. In practice, the higher the weight, the
updated on-line. At each time step, the likelihood of the stronger the confidence, and the longer the time such gray
observed pixel signal, given an estimated mean, is computed level has been recently observed at that pixel location. Due
and a FG/BG labeling is performed. to the relevance assumed in the literature and the numerous
A similar approach has been proposed in [23], exploiting proposed improvements, we perform here a detailed analysis
a running Gaussian average. The background model is of this approach.
updated if a pixel is marked as foreground for more than More formally, the probability of observing the pixel
m of the last M frames, in order to compensate for sudden value z(t) at time t is
illumination changes and the appearance of static new
  
R  
objects. If a pixel changes state from FG to BG frequently,
P z(t) = wr(t) N z(t) | μ(t) (t)
r , σr , (2)
it is labeled as a high-frequencies background element and it r =1
is masked out from inclusion in the foreground.
Median filtering sets each color channel of a pixel in where wr(t) , μ(t) (t)
r and σr are the mixing coefficients, the mean,
the background as modeled by the median value, obtained and the standard deviation, respectively, of the rth Gaussian
from a buffer of previous frames. In [24], a recursive filter is N (·) of the mixture associated with the signal at time t. The
used to estimate the median, achieving a high computational Gaussian components are ranked in descending order using
efficiency and robustness to noise. However, a notable limit the w/σ value: the most ranked components represent the
is that it does not model the variance associated to a BG “expected” signal, or the background.
value. At each time instant, the Gaussian components are
Instead of independently estimating the median of each evaluated in descending order to find the first matching with
channel, the medoid of a pixel can be estimated from the observation acquired (a match occurs if the value falls
the buffer of video frames as proposed in [25]. The idea within 2.5σ of the mean of the component). If no match
is to consider color channels together, instead of treating occurs, the least ranked component is discarded and replaced
each color channel independently. This has the advantage with a new Gaussian with the mean equal to the current
of capturing the statistical dependencies between color value, a high variance σinit , and a low mixing coefficient winit .
channels. If rhit is the matched Gaussian component, the value z(t) is
In W 4 [26, 27], a pixel is marked as foreground if its value labeled FG if
satisfies a set of inequalities, that is r
hit−1

2 2 2 2 wr(t) > T, (3)


2 2 2 2
2M − z(t) 2 > D ∨ 2N − z(t) 2 > D, (1) r =1

where T is a standard threshold. The equation that drives the


where the (per-pixel) parameters M, N, and D represent evolution of the mixture’s weight parameters is the following:
the minimum, maximum, and largest interframe absolute
difference observable in the background scene, respectively. wr(t) = (1 − α)wr(t−1) + αM (t) , 1 ≤ r ≤ R, (4)
8 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 5: A near infrared image (a) from CBSR dataset [16, 17] and a thermal image (b) from Terravic Research Infrared Database [17, 18].

where M (t) is 1 for the matched Gaussian (indexed by rhit ) usually causing false positives. Examples of such situations
and 0 for the others, and α is the learning rate. The other are sudden changes in the chromatic aspect of the scene, due
parameters are updated as follows: to the weather evolution or local light switching.
 
(t −1)
μ(t) (t)
rhit = 1 − ρ μrhit + ρz ,
    (5) 3.1.4. Nonparametric Approaches. In [32], a nonparamet-
  ric technique estimating the per-pixel probability density
σr2hit(t) = 1 − ρ σr2hit(t−1) + ρ z(t) − μ(t)
rhit
T
z(t) − μ(t)
rhit ,
function using the kernel density estimation (KDE) [33]
where ρ = αN (z(t) | μ(t) (t)
rhit , σrhit ). It is worth noting that the technique is developed (KDE method is an example of Parzen
higher the adaptive rate α, the faster the model is “adapted” window estimate, [34]). This faces the situation where the
to signal changes. In other words, for a low learning rate, pixel values” density function is complex and cannot be
MoG produces a wide model that has difficulty in detecting a modeled parametrically, so a non-parametric approach able
sudden change to the background (so, it is prone to the light to handle arbitrary densities is more suitable. The main
switch problem, global and local). If the model adapts too idea is that an approximation of the background density
quickly, slowly moving foreground pixels will be absorbed can be given by the histogram of the most recent values
into the background model, resulting in a high false negative classified as background values. However, as the number of
rate (the problem of the foreground aperture). samples is necessarily limited, such an approximation suffers
MoG has been further improved by several authors, see from significant drawbacks: the histogram might provide
[30, 31]. In [30], the authors specify (i) how to cope with poor modeling of the true pdf, especially for rough bin
color signals (the original version was proposed for gray quantizations, with the tails of the true pdf often missing.
values), proposing a normalization of the RGB space taken Actually, KDE guarantees a smoothed and continuous
from [12], (ii) how to avoid overfitting and underfitting version of the histogram. In practice, the background pdf
(values of the variances too low or too high), proposing a is given as a sum of Gaussian kernels centered in the most
thresholding operation, and (iii) how to deal with sudden recent n background values, bi
and global changes of the illumination, by changing the
  1  (t)
n 
learning rate parameter. For the latter, the idea is that if P z(t) = z − bi , Σt . (6)
the foreground changes from one frame to another more n i=1
than the 70%, the learning rate value grows up, in order to
permit a faster evolution of the BG model. Note that this In this case, each Gaussian describes one sample data, and
improvement adds global (per-frame) reasoning to MoG, not a whole mode as in [29], with n in the order of 100,
so it does not belong properly to the class of per-pixel and covariance fixed for all the samples and all the kernels.
approaches. The classification of z(t) as foreground is assumed when
In [31], the number of Gaussian components is automat- P(z(t) ) < T. The parameters of the mixtures are updated
ically chosen, using a Maximum A-Posteriori (MAP) test and by changing the buffer of the background values in FIFO
employing a negative Dirichlet prior. order by selective update, and the covariance (in this case,
Even if per-pixel algorithms are widely used for their a diagonal matrix) is estimated in the time domain by
excellent compromise between accuracy and speed (in com- analyzing the set of differences between two consecutive
putational terms), these techniques present some drawbacks, values. In [32], such model is duplicated: one model is
mainly due to the interpixel independency assumption. employed for a long-term background evolution modeling
Therefore, any situation that needs a global view of the (for example dealing with the illumination evolution in a
scene in order to perform a correct BG labeling is lost, outdoor scenario) and the other for the short-term modeling
EURASIP Journal on Advances in Signal Processing 9

(for flickering surfaces of the background). Intersecting the 3.2.2. Texture- and Edge-Based Approaches. These
estimations of the two models gives the first stage results of approaches exploit the spatial local information for
detection. The second stage of detection aims at suppressing extracting structural information such as edges or textures.
the false detections due to small and unmodelled movements In [36], video sequences are analyzed by dividing the scene
of the scene background that cannot be observed employing in overlapped squared patches. Then, intensity and gradient
a per-pixel modeling procedure alone. If some parts of the kernel histograms are built for each patch. Roughly speaking,
background (a tree branch, for example) moves to occupy intensity (gradient) kernel histograms count pixel (edge)
a new pixel, but it is not part of the model for that values as weighted entities, where the weight is given by a
pixel, it will be detected as a foreground object. However, Gaussian kernel response. The Gaussian kernel, applied on
this object will have a high probability to be a part of each patch, gives more importance to the pixel located in
the background distribution at its original pixel location. the center. This formulation gives invariance to illumination
Assuming that only a small displacement can occur between changes and shadows because the edge information helps
consecutive frames, a detected FG pixel is evaluated as caused in discriminating a FG occluding object, that introduces
by a background object that has moved by considering the different edge information in the scene, and a (light) shadow,
background distributions in a small neighborhood of the that only weakens the BG edge information.
detection area. Considering this step, this approach could In [37], a region model describing local texture char-
also be intended as per-region. acteristics is presented through a modification of the Local
In their approach, the authors also propose a method for Binary Patterns [38]. This method considers for each pixel
dealing with the shadows problem. The idea is to separate a fixed circular region and calculates a binary pattern of
the color information from the lightness information. Chro- length N where each ordered value of the pattern is 1 if the
maticity coordinates [35] help in suppressing shadows, but difference between the center and a particular pixel lying on
loses lightness information, where the lightness is related to the circle is larger than a threshold. This pattern is calculated
the difference in whiteness, blackness and grayness between for each neighboring pixel that lies in the circular region.
different objects. Therefore, the adopted solution considers Therefore, a histogram of binary patterns is calculated.
S = R + G + B as a measure of lightness, where R, G This is done for each frame and, subsequently, a similarity
and B are the intensity values for each color channel of a function among histograms is evaluated for each pixel, where
given pixel. Imposing a range on the ratio between a BG the current observed histogram is compared with a set of
pixel value and its version affected by a shadow permits to K weighted existing models. Low-weighted models stand for
perform a good shadow discrimination. Please note that, in FG, and vice versa. The model most similar to the histogram
this case, the shadow detection relies on a pure per-pixel observed is the one that models the current observation, so
reasoning. increasing its weight. If no model explains the observation,
Concerning the computational efforts of the per-pixel the pixel is labeled as FG, and a novel model is substituted
processes, in [9] a good analysis is given: speed and memory with the least supported one. The mechanism is similar to
usage of some widely used algorithms are taken into account. the one used for per-pixels BG modeling proposed in [29].
Essentially, monomodal approaches are generally the fastest, The texture analysis for BG subtraction is considered also
while multimodal and non-parametric techniques exhibit in [39], where it is proposed a combined pixel-region model
higher complexity. Regarding the memory usage, non- where the color information associated to a pixel is defined
parametric approaches are the most demanding, because in a photometric invariant space, and the structural region
they need to collect for each pixel a statistics on the past information derives from a local binary pattern descriptor,
values. defined in the pixel’s neighborhood area. The two aspects
are linearly combined in a whole signature that lives in a
3.2. Per-Region Processes. Region-level analysis considers a multimodal space, which is modeled and evaluated similarly
higher level representation, modeling also interpixel rela- to MoG. This model results particularly robust to shadows.
tionships, allowing a possible refinement of the modeling Another very similar approach is presented in [40], where
obtained at the pixel level. Region-based algorithms usually color and gradient information are explicitly modeled as
consider a local patch around each pixel, where local time adaptive Gaussian mixtures.
operations may be carried out.
3.2.3. Sampling Approaches. The sampling approaches eval-
3.2.1. Nonparametric Approaches. This class could include uate a wide local area around each pixel to perform complex
also the approach of [32], above classified as per-pixel, since analysis. Therefore, the information regarding the spatial
it incorporats a part of the technique (the false suppression support is collected through sampling, which in some cases
step) that is, inherently per-region. permits to fasten the analysis.
A more advanced approach using adaptive kernel density In [41], the pixel-region mixing is carried out with a
estimation is proposed in [12]. Here, the model is genuinely spatial sampling mechanism, that aims at producing a finer
region-based: the set of pixels values needed to compute BG model by propagating BG pixels values in a local area.
the histogram (i.e., the nonparametric density estimate for a This principle resembles a region growing segmentation
pixel location) is collected over a local spatial region around algorithm, where the statistics of an image region is built
that location, and not exclusively on the past values of that by considering all the belonging pixels. In this way, regions
pixel. affected by a local, small chromatic variation (due to a cloudy
10 EURASIP Journal on Advances in Signal Processing

weather or shadows, for example), become less sensitive to then coupled together through the use of a Markov Random
the false positives. The propagation of BG samples is done Field (MRF) spatial prior. Limitations of the model concern
with a particle filter policy, and a pixel values with higher the considered approximation of the camera model, affine
likelihood of being BG is propagated farer in the space. As instead of fully perspective, but, experimentally, it has been
per-pixel model, a MoG model is chosen. The drawback of shown not to be very limiting.
the method is that it is computational expensive, due to the
particle filtering sampling process. 3.2.5. Hybrid Foreground/Background Models for BG Subtrac-
In [42] a similar idea of sampling the spatial neigh- tion. These models includes in the BG modeling a sort of
borhood for refining the per-pixel estimate is adopted. knowledge of the FG, so they may not be classified as pure
The difference here lies in the per-pixel model, that is, BG subtraction methods. In [20], a BG model competes
non-parametric, and it is based on a Parzen windows-like with an explicit FG model in providing the best description
process. The model updating relies on a random process that of the visual appearance of a scene. The method is based
substitutes old pixel values with new ones. The model has on a maximum a posteriori framework, which exhibits the
been compared favorably with the MoG model of [31] with product of a likelihood term and a prior term, in order
a small experimental dataset. to classify a pixel as FG or BG. The likelihood term is
obtained exploiting a ratio between nonparametric density
3.2.4. BG Subtraction Using a Moving Camera. The estimations describing the FG and the BG, respectively,
approaches dealing with moving cameras focus mainly and the prior is given by employing an MRF that models
on compensating the camera ego-motion, checking if the spatial similarity and smoothness among pixels. Note that,
statistics of a pixel can be matched with the one present in other than the MRF prior, also the non-parametric density
a reasonable neighborhood. This occurs through the use estimation (obtained using the Parzen Windows method)
of homographies or 2D affine transformations of layered works on a region level, looking for a particular signal
representations of the scene. intensity of the pixel in an isotropic region defined on a joint
Several methods [43–46] well apply to scenes where the spatial and color domain.
camera center does not translate, that is, when using of PTZ The idea of considering a FG model together with a
cameras (pan, tilt, or zoom motions). Another favorable BG model for the BG subtraction has been also taken into
scenario is when the background can be modeled by a plane. account in [56], where a pool of local BG features is selected
When the camera may translate and rotate, other strategies at each time step in order to maximize the discrimination
have been adopted. from the FG objects. A similar approach has been taken
In the plane + parallax framework [47–49], a homogra- into account in [57], where the authors propose a boosting
phy is first estimated between successive image frames. The approach which selects the best features for separating BG
registration process removes the effects of camera rotation, and FG.
zoom, and calibration. The residual pixels correspond either Concerning the computational efforts, per-region
to moving objects or to static 3D structures with large depth approaches exhibit higher complexity, both in space and in
variance (parallax pixels). To estimate the homographies, time, than the per-pixel ones. Anyway, the most papers claim
these approaches assume the presence of a dominant plane real-time performances.
in the scene, and have been successfully used for object
detection in aerial imagery where this assumption is usually 3.3. Per-Frame Approaches. These approaches extend the
valid. local area of refinement of the per-pixel analysis to being
Layer-based methods [50, 51] model the scene as piece- the entire frame. In [58], a graphical model is used to
wise planar scenes, and cluster segments based on some adequately model illumination changes of a scene. Even if
measure of motion coherency. results are promising, it is worth noting that the method has
In [52], a layer-based approach is explicitly suited for not be evaluated in its on-line version, nor it works in real-
background subtraction from moving cameras but report time; further, illumination changes should be global and pre-
low performance for scenes containing significant parallax classified in a training section.
(3D scenes). In [59], a per-pixel BG model was chosen from a set of
Motion segmentation approaches like [53, 54] sparsely pre-computed ones in order to minimize massive false alarm.
segment point trajectories based on the geometric coherency The method proposed in [60] captures spatial correla-
of the motion. tions by applying principal component analysis [34] to a
In [55], a technique based on sparse reasoning is set of NL video frames that do not contain any foreground
presented, which also deals with rigid and nonrigid FG objects. This results in a set of basis functions, whose the
objects of various size, merged in a full 3D BG. The first d are required to capture the primary appearance
underlying assumptions regard the use of an orthographic characteristics of the observed scene. A new frame can
camera model and that the background is the spatially then be projected into the eigenspace defined by these d
dominant rigid entity in the image. Hence, the idea is that the basis functions and then back projected into the original
trajectories followed by sparse points of the BG scene lie in a image space. Since the basis functions only model the static
three-dimensional subspace, estimated through RANSAC, so part of the scene when no foreground objects are present,
allowing to highlight outlier trajectories as FG entities, and the back projected image will not contain any foreground
to produce a sparse pixel FG/BG labeling. Per-pixel labels are objects. As such, it can be used as a background model.
EURASIP Journal on Advances in Signal Processing 11

The major limitation of this approach lies just on the original The method is slow and no real-time implementation is
hypothesis of absence of foreground objects to compute the presented by the authors, due to the computation of the
basis functions which is not always possible. Moreover, it is filters’ coefficients.
also unclear how the basis functions can be updated over This computational issue has been subsequently solved in
time if foreground objects are going to be present in the [64]. Given the same quadtree structure, instead of entirely
scene. analyzing each zone covered by a filter, only one pixel is
Concerning the computational efforts, per-frames ap- randomly sampled and analyzed for each region (filter) at the
proaches usually are based on a training step and classifica- highest level of the hierarchy. If no FG is detected, the analysis
tion step. The training part is carried out in a offline fashion, stops; otherwise, the analysis is further propagated on the
while the classification part is well suited for a real-time 4 children belonging to the lower level, down to reach the
usage. lowest one. Here, in order to get the fine boundaries of the
BG silhouette, a 4-connected neighborhood region growing
3.4. Multistage Approaches. The multistage approaches con- algorithm is performed on each of the FG children. The
sist in those techniques that are formed by several serial exploded quadtree is used as default structure for the next
heterogeneous steps, that thus cannot be included properly frame in order to cope efficiently with the overlap among FG
in any of the classes seen before. regions between consecutive frames.
In Wallflower [15], a 3-stage algorithm that operates In [65], a nonparametric, per pixel FG estimation is
respectively at pixel, region and frame level is presented. followed by a set of morphological operations in order
At the pixel level, a couple of BG models is maintained for to solve a set of BG subtraction common issues. These
each pixel independently: both the models are based on a 40- operations evaluate the joint behavior of similar and prox-
coefficients, one-step Wiener filter, where the (past) values imal pixel values by connected-component analysis that
taken into account are the predicted values by the filter in one exploits the chromatic information. In this way, if several
case, and the observed values in the other. A double check pixels are marked as FG, forming a connected area with
against these two models is performed at each time step: the possible holes inside, the holes can be filled in. If this area
current pixel value is considered as BG if it differs less than 4 is very large, the change is considered as caused by a fast
times the expected squared prediction error calculated using and global BG evolution, and the entire area is marked as
the two models. BG.
At the region level, a region growing algorithm is applied. All the multistage approaches require high compu-
It essentially closes the possible holes (false negative) in the tational efforts, if compared with the previous analysis
FG if the signal values in the false negative locations are paradigms. Anyway, in all the aforementioned papers the
similar to the values of the surrounding FG pixels. At the multistage approaches are claimed to be functioning in a
frame level, a set of global BG models is finally generated. real-time setting.
When a big portion of the scene is suddenly detected as FG,
the best model is selected, that is, the one that minimizes the 3.5. Approaches for the Background Initialization. In the
amount of FG pixels. realm of the BG subtraction approach in a monocular
A similar, multilevel approach has been presented in video scenario, a quite relevant aspect is the one of the
[61], where the problem of the local/global light switch is background initialization, that is, how a background model
taken into account. The approach lies on a segmentation of has to be bootstrapped. In general, all of the presented
the background [62] which segregates portions of the scene methods discard the solution of computing a simple mean
where the chromatic aspect is homogeneous and evolves over all the frames, because it produces an image that exhibits
uniformly. When a background region suddenly changes its blending pixel values in areas of foreground presence. A
appearance, it is considered as a BG evolution instead of a general analysis regarding the blending rate and how it may
FG appearance. The approach works well when the regions be computed is present in [66].
in the scene are few and wide. Conversely, the performances In [67], the background initial values are estimated by
are poor when the scene is oversegmented, that in general calculating the median value of all the pixels in the training
occurs for outdoor scenes. sequence, assuming that the background value in every pixel
In [63], the scene is partitioned using a quadtree location is visible more than 50% of the time during the
structure, formed by minimal average correlation energy training sequence. Even if this method avoids the blending
(MACE) filters. Starting with large-sized filters (32 × 32 effects of the mean, the output of the median will contains
pixels), 3 levels of smaller filters are employed, until the large error when this assumption is false.
lower level formed by 4 × 4 filters. The proposed technique Another proposed work [68], called adaptive smoothness
aims at avoiding false positives: when a filter detects the method, avoids the problem of finding intervals of stable
FG presence on more than 50% of its area, the analysis intensity in the sequence. Then, using some heuristics, the
is propagated to the 4 children belonging to the lower longest stable value for each pixel is selected and used as the
level, and in turn to the 4-connected neighborhood of each value that most likely represents the background.
one of the children. When the analysis reaches the lowest This method is similar to the recent Local Image
(4 × 4) level and FG is still discovered, the related set Flow algorithm [69], which generates background values’
of pixels are marked as FG. Each filter modeling a BG hypotheses by locating intervals of relatively constant inten-
zone is updated, in order to deal with slowly changing BG. sity, and weighting these hypotheses by using local motion
12 EURASIP Journal on Advances in Signal Processing

information. Unlike most of the proposed approaches, this evolution of the BG model, so that the amount of false
method does not treat each pixel value sequence as an positive alarms re likely minimized. The techniques which
i.i.d. (independent identically distributed) process, but it explicitly deal with this problem are [15, 58, 59, 61, 65]. In
considers also information generated by the neighboring all the other adaptive approaches, this problem generates a
locations. massive amount of false positives until when the learning
In [62], a hidden Markov model clustering approach rate “absorb” the novel aspect of the scene. Another solution
was proposed in order to consider homogeneous compact consists in considering texture or edge information [36].
regions of the scene whose chromatic aspect does uniformly
evolve. The approach fits a HMM for each pixel location,
Local Light Switch. This problem is solved by those
and the clustering operates using a similarity distance which
approaches which learn in advance how the illumination can
weights more heavily the pixel values portraying BG values.
locally change the aspect of the scene. Nowadays, the only
In [70], an inpainting-based approach for BG initial-
approach which deals with this problem is [61].
ization is proposed: the idea is to apply a region-growing
spatiotemporal segmentation approach, which is able expand
a safe, local, BG region by exploiting perceptual similarity Waving Trees. This problem is successfully faced by two
principles. The idea has been further improved in [71], where classes of approaches. One is the per-pixel methods that
the region growing algorithm has been further developed, admit a multimodal BG model (the movement of the tree
adopting graph-based reasoning. is usually repetitive and holds for a long time, causing a
multimodal BG). The other class is composed by the per-
3.6. Capabilities of the Approaches Based on a Single Video region techniques which inspect the neighborhood of a
Sensor. In this section, we summarize the capabilities of the “source” pixel, looking whether the object portrayed in the
BG subtraction approaches based on a monocular video source has locally moved or not.
camera, by considering their abilities in solving the key
problems expressed in Section Problems. Camouflage. Solving the camouflage issue is possible when
In general, whatever approach which permits an adap- other information other than the sole chromatic aspect
tation of the BG model can deal with whatever situation in is taken into account. For example, texture information
which the BG globally and slowly changes in appearance. greatly improves the BG subtraction [36, 37, 39]. The other
Therefore, the problem of time of day can generally be source of information comes from the knowledge of the
solved by these kind of methods. Algorithms assuming foreground; for example, employing contour information
multimodal background models face the situation where the or connected-component analysis on the foreground, it is
background appearance oscillates between two or more color possible to recover the camouflage problem by performing
ranges. This is particularly useful in dealing with outdoor morphological operations [15, 65].
situations where there are several moving parts in the scene
or flickering areas, such as the tree leafs, flags, fountains, and
sea surface. This situation is wellportrayed by the waving tree Foreground Aperture. Even in this case, texture information
key problem. The other problems represent situations which improves the expressivity in the BG model, helping where the
imply in principle strong spatial reasoning, thus requiring mere chromatic information leads to ambiguity between BG
per-region approaches. Let us discuss each of the problems and FG appearances [36, 37, 39].
separately: for each problem, we specify those approaches
that explicitly focus on that issue. Sleeping Foreground. This problem is the most related with
the FG modeling: actually, using only visual information and
Moved Objects. All the approaches examined fails in dealing without having an exact knowledge of the FG appearance
with this problem, in the sense that an object moved in (which may help in detecting a still FG object which must
the scene, belonging to the scene, is detected as foreground remain separated from the scene), this problem cannot be
for a certain amount of time. This amount depends on the solved. This is implied by the basic definition of the BG, that
adaptivity rate of the background model, that is, the faster is, whatever visual static element and whose appearance does
the rate, the smaller the time interval. not change over time is, background.

Time of Day. BG model adaptivity ensures success in dealing Shadows. This problem can be faced employing two strate-
with this problem, and almost each approach considered is gies: the first implies a per-pixel color analysis, which aims
able to solve it. at modeling the range of variations assumed by the BG
pixel values when affected by shadows, thus avoiding false
Global Light Switch. This problem is solved by those positives. The most known approach in this class is [25],
approaches which consider the global aspect of the scene. where the shadow analysis holds in the HSV color space.
The main idea is that when a global change does occur in Other approaches try to define shadow-invariant color spaces
the scene, that is, when a consistent portion of the frame [30, 32, 65]. The other class of strategies considers edge
labeled as BG suddenly changes, a recovery mechanism information, that is, more robust against shadows [36, 39,
is instantiated which evaluates the change as a sudden 40].
EURASIP Journal on Advances in Signal Processing 13

Reflections. This problem has been never considered in model. A similar approach is exploited in [79], where
scenarios employing a single monocular video camera. a histogram of disparity values across a range of time
In general, the approaches that face simultaneously and and gain conditions is computed. Gathering background
successfully with several of the above problems (i.e., that observations over long-term sequences has the advantage
present results on several Wallflower sequences) are [15, 36, that lighting variation can be included in the background
65]. training set. If background subtraction methods are based
on depth alone [78, 80], errors due to foreground objects
4. Multiple Video Sensors in close proximity to the background or foreground objects
having homogeneous texture arise. The integration of color
The majority of background subtraction techniques are and depth information reduces the effect of the following
designed for being used in a monocular camera framework problems:
which is highly effective for many common surveillance (1) points with similar color background and foreground
scenarios. Anyway, this setting encounters difficulties in
dealing with sudden illumination changes, reflections, and (2) shadows
shadows. (3) invalid pixels in background or foreground
The use of two or more cameras for background model- (4) points with similar depth in both background and
ing serves to overcome these problems. Illumination changes foreground.
and reflections depend on the field of view of the camera
and can be managed observing the scene from different view In [81], an example of a joint (color + depth) background
points, while shadows can be filtered out if 3D information estimation is given. The background model is based on
is available. Even if it is possible to determine the 3D world a multidimensional (depth and RGB colors) histogram
positions of the objects in the scene with a single camera (e.g., approximating a mixture of Gaussians, while foreground
[72]), this is in general very difficult and unreliable [73]. extraction is performed via background comparison in depth
Therefore multicamera approaches to retrieve 3D infor- and normalized color.
mation have been proposed, based on the following. In [82], a method for modeling the background that
uses per-pixel, time-adaptive, Gaussian mixtures in the
(i) Stereo Camera. A single device integrating two or combined input space of depth and luminance-invariant
more monocular cameras with small baseline (i.e., color is proposed. The background model learning rate is
the distance between focal center of the cameras). modulated on the scene activity and the color-based seg-
(ii) Multiple Cameras. A network of calibrated monoc- mentation criteria are dependent on depth observations. The
ular or stereo cameras monitoring the scene from method explicitly deals with illumination changes, shadows,
significantly different viewpoints. reflections, camouflage, and changes in the background.
The same idea of integrating depth information and
4.1. Stereo Cameras. The disparity map extracted that corre- color intensity coming from the left view of the stereo sensor
lates the two views of a stereo camera can be used as an input is exploited by the PLT system in [73]. It is a real-time system,
for a disparity-based background subtraction algorithm. In based on a calibrated fixed stereo vision sensor. The system
order to accurately model the background, a dense disparity analyses three interconnected representations of the stereo
map needs to be computed. data to dynamically update a model of the background, to
For obtaining an accurate dense map of correlations extract foreground objects, such as people and rearranged
between two stereo images, time-consuming stereo algo- furniture, and to track their positions in the world. The
rithms are usually required. Without the aid of specialized background model is a composition of intensity, disparity
hardware, most of these algorithms perform too slowly for and edge information, and it is adaptively updated with a
real time background subtraction [74, 75]. As a consequence, learning factor that varies over time and is different for each
state-of-the-art dedicated hardware solutions implement pixel.
simple and less accurate stereo correlations methods instead
of more precise ones [76]. In some cases, the correlation 4.2. Network of Cameras. In order to monitor large areas
between left and right images is unreliable, and the disparity and/or managing occlusions, the only solution is to use
map presents holes due to “invalid” pixels (i.e., points with multiple cameras. It is not straightforward to generalize a
invalid depth values). single-camera system to become a multicamera one, because
Stereo vision has been used in [77] to build the of a series of problems like camera installation, camera
occupancy map of the ground plane as background model, calibration, object matching, and data fusion.
that is, used to determine moving objects in the scene. Redundant cameras increase not only processing time
The background disparity image is computed by averaging and algorithmic complexity, but also the installation cost. In
the stereo results from an initial background learning stage contrast, a lack of cameras may cause some blind spots, that
where the scene is assumed to contain no people. Pixels that reduce the reliability of the surveillance system. Moreover,
have a disparity larger than the background (i.e., closer to the calibration is more complex when multiple cameras are
camera) are marked as foreground. employed and object matching among multiple cameras
In [78], a simple bimodal model (normal distribution involves finding the correspondences between the objects in
plus an unmodeled token) is used to build the background different images.
14 EURASIP Journal on Advances in Signal Processing

In [83], a real time 3D tracking system using three changes with respect to the information provided by a single
calibrated cameras to locate and track objects and people camera [88], and the insensitivity of stereo to changes in
in a conference room is presented. A background model lighting mitigates to some extent the need for adaptation
is computed for each camera view, using a mixture of [77]. On the other hand, a multiple camera network
Gaussians to estimate the background color per pixel. The allows to view the scene from many directions, monitoring
background subtraction is performed on both the YUV and an area larger than what a single stereo sensor can do.
the RG color spaces. Matching RG foreground regions and However, multicamera systems have to deal with problems
YUV regions, is possible to cut off most of the shadows, in establishing geometric relationships between views and in
thanks to the use of chromatic information, and, at the same maintaining temporal synchronization of frames.
time, to exploit intensity information to obtain smoother In the following, we analyze those problems, taken from
silhouettes. Section 2, for which the multiple visual sensor contribute in
M2 Tracker [84] uses a region-based stereo algorithm to reaching optimal solutions.
find 3D points inside an object, and Bayesian Classification
to classify each pixel as belonging to a person or the Camouflage. This problem is effectively faced by integrating
background. Taking into account models of the foreground the depth information to the color information [73, 81, 82].
objects in the scene, in addition to information about the
background, leads to better background subtraction results. Foreground Aperture. Even in this case, texture information
In [85], a planar homography-based method combines improves the expressivity in the BG model, helping where the
foreground likelihood information (probability of a pixel mere chromatic information leads to ambiguity between the
in the image belonging to the foreground) from different BG and the FG appearance [36, 37, 39].
views to resolve occlusions and determine the locations of
Shadows. This issue is solved employing both stereo cameras
people on the ground plane. The foreground likelihood
[73, 81, 82] and camera networks [74, 83].
maps in each view is estimated by modeling the background
using a mixture of Gaussians. The approach fails in presence Reflections. The use of multiple camera permits to solve this
of strong shadows. Carnegie Mellon University developed problem: the solution is based on the 3D structure of the
a system [86] that allows a human operator to monitor scene monitored. The 3D map permits to locate the ground
activities over a large area using a distributed network of plane of the scene, thus, to suppress all the specularities as
active video sensors. Their system can detect and track those objects lying below this plane [74].
people and vehicles within cluttered scenes and monitor their
activities over long periods of time. They developed robust 5. Single Audio Monaural Sensor
routines for detecting moving objects using a combination
of temporal differencing and template tracking. Analogously to image background modeling for video
EasyLiving project [87] aims to create a practical person- analysis, a logical initial phase in applying audio analysis to
tracking system that solves most of the real-world problems. surveillance and monitoring applications is the detection of
It uses two sets of color stereo cameras for tracking background audio. This would be useful to highlight sections
people during live demonstrations in a living room. Colour of interest in an audio signal, like for example the sound of
histograms are created for each detected person and are breaking glass.
used to identify and track multiple people standing, walking, There are a number of differences between the visual
sitting, occluding, and entering or leaving the space. The and audio domains, with respect to the data. The reduced
background is modeled by computing the mean and variance amount of data in audio results in lower processing
for each pixel in the depth and color images over a sequence overheads, and encourages a more complex computational
of 30 frames on the empty room. approach to analysis. Moreover, the characteristics of the
In [74], a two-camera configuration is described, in audio usually exhibit a higher degree of variability. This is
which the cameras are vertically aligned with respect to a due to both the superimposition of multiple audio sources
dominant ground plane (i.e., the baseline is orthogonal to within a single input signal and the superimposition of
the plane on which foreground objects appear). Background the same sound at different times (multipatch echoing).
subtraction is performed by computing the normalized color Similar situations for video could occur through reflection
difference for a background conjugate pair and averaging off partially reflective surfaces. This results in the formation
the component differences over a 3 × 3 neighborhood. of complex and dynamic audio backgrounds.
Each background conjugate pair is modeled with a mixture Background audio can be defined as the recurring and
of Gaussians. Foreground pixels are then detected if the persistent audio characteristics that dominates the portion of
associated normalized color differences fall outside a decision the signal. Foreground sounds detection can be carried out as
surface defined by a global false alarm rate. the departure from this BG model.
Outside the automated surveillance context, several
4.3. Capabilities of the Approaches Based on Multiple Visual approaches to computational audio analysis are present,
Sensors. The use of a stereo camera represent a compact mainly focused on the computational translation of psy-
solution, relatively cheap and easy to calibrate and set up, choacoustics results. One class of approaches is the so called
able to manage shadows and illumination changes. Indeed, computational auditory scene analysis (CASA) [89], aimed
the disparities information is more invariable to illumination at the separation and classification of sounds present in
EURASIP Journal on Advances in Signal Processing 15

a specific environment. Closely related to this field there is 5.1. Capabilities of the Approaches Based on a Single Audio
the computational auditory scene recognition (CASR) [90, Sensor. The definition of audio background and its mod-
91], aimed at an overall environment interpretation instead elling for background subtraction incorporates issues that
of analyzing the different sound sources. Besides various are analogous to those of the visual domain. In the following,
psychoacoustically oriented approaches derived from these we will consider the problems reported in Section 2,
two classes, a third approach, used both in CASA and CASR analyzing how they translate into the audio domain, and how
contexts, tried to fuse “blind” statistical knowledge with they are solved by the nowadays approaches. Moreover, once
biologically driven representations of the two previous fields, a correspondence is found, we will define a novel name for
performing audio classification and segmentation tasks [92], an audio key issue, in order to gain in clarity.
and source separation [93, 94] (i.e., blind source separation). In general, whereas the visual domain may be considered
In this last approach, many efforts are addressed to the as formed by several independent entities, that is, the pixels
speech processing area, in which the goal is to separate the signals, in the audio domain the spectral subband assume the
different voices composing the audio pattern using several meaning of the basic independent entities. This analogy is
microphones [94] or only one monaural sensor [93]. the one mostly used in the literature, and it will drive us in
In the surveillance context, some proposed methods linking the different key problems across modalities.
in the field of BG subtraction are mainly based on the
monitoring of the audio intensity [95–97], or are aimed at
Moved Object. This situation originally consists in a portion
recognizing specific class of sounds [98]. These methods are
of the visual scene that is, moved. In the audio domain, a
not adaptive to the several possible audio situations, and they
portion consists in an audio subband. Therefore, whatever
do not exploit all the potential information conveyed by the
approach that allows a local adaptation of the audio spec-
audio channel.
trum related to the BG solves this problem. The adaptation
The following approaches, instead, are more general,
depends also in this case by a learning rate. The higher the
they are adaptive and they can cope with quite complex
rate, the faster the model adaptation [99, 100]. We will name
backgrounds. In [99], the authors implement a version of
this audio problem as Local change.
the Gaussian Mixture Model (GMM) method in the audio
domain. The audio signal, acquired by a single microphone,
is processed by considering its frequency spectrum: it is Time of Day. This problem shows in the audio when the BG
subdivided in suitable subbands, assumed to convey inde- spectrum slowly changes. Therefore, approaches that develop
pendent information about the audio events. Each subband an adaptive model solve this problem [99, 100]. We will name
is modeled by a mixture of Gaussians. Being the model on- this audio problem as Slow evolution.
line updated over time, this makes the method adaptive to
the possible different background situations. At each instant
Global Light Switch. Global light switch can be intended in
t, FG information is detected by considering the set of
the audio as an abrupt global change of the audio spectrum.
subbands that show atypical behaviors.
In the video, a global change of illumination has not to
In [100], the authors also employ an online, unsu-
be intended as a FG entity, because the change is global
pervised and adaptive GMM to model the states of the
and persistent and because the structure of the scene does
audio signal. Besides, they propose some solutions to more
not change. The structure invariance in the video can be
accurately model complex backgrounds. One is an entropy-
evaluated by employing edge or texture features, while it is
based approach for combining fragmented BG models to
not clear neither what is the structure of a environmental
determine the BG states of the signal. Then, the number
audio background, nor what are the features to model it.
of states to be incorporated into the background model is
Therefore, an abrupt change in the audio spectrum will
adaptively adjusted according to the background complexity.
be evaluated as an evident presence of foreground and
Finally, an auxiliary cache is employed, with to scope to
successively absorbed as BG if the BG model is adaptive,
prevent the removal from the system of potentially useful
unless a classification-based approach is employed [99, 100],
observed distributions when the audio is rapidly changing.
that minimizes the amount of FG by choosing the most
An issue not addressed by the previous methods, quite
suitable BG model across a set of BG models [101]. We will
similar to the Sleeping foreground problem in video analysis
name this audio problem as Global fast variation.
(see below in Section 5.1), is when the foreground is gradual
and longer lasting, like a plan passing overhead. If there is
no a priori knowledge of the FG and BG, the system adapts Waving Trees. In audio, the analog of the waving tree
the FG sound as background. This particular situation is problem is that of a multimodal audio background, in
addressed in [101], by incorporating explicit knowledge of the sense that each independent entity of the model,
data into the process. The framework is composed by two that is, the audio subband, shows a multimodal statistics.
models. First, the models for the BG and FG sounds are This happens for example when repeated signals occurs
learnt, using a semisupervised method. Then, the learned in the scene (the sound produced by a factory machine).
models are used to bootstrap the system. A separate model Therefore, approaches that deal with multimodality (as
detects the changes in the background, and it is finally expressed above) in the BG modelling deal with this problem
integrated with the audio predictions models to decide on successfully [99, 100]. We will name this audio problem as
the final FG/BG determination. Repeated background.
16 EURASIP Journal on Advances in Signal Processing

Camouflage. The camouflage in the audio can be reasonably cameras for monitoring these urban scenarios. However,
seen as the presence of a FG sound which is similar to that of SWIR-based video surveillance presents a series of challenges
the BG. Using the audio spectrum as basic model for the BG [103].
characterization solves the problem of camouflage, because
different sounds having the same spectral characteristic (so, (i) Low SNR. With low light levels, a high gain is
when we are in presence of similar sounds) will produce a required to enhance the image brightness. However,
spectrum where the spectral intensities are summed over. a high gain tends to amplify the sensor’s noise
Such spectrum is different to that of the single BG sound, introducing a considerable variance in pixel intensity
where the intensities are lower. We will name this audio between frames that impairs the background model-
problem as Audio camouflage. ing approaches based on statistical analysis.
(ii) Blooming. The presence of strong light sources (e.g.
Sleeping Foreground. The sleeping foreground occurs in the car headlights and street lamps) can lead to the sat-
audio when a FG sound continuously holds, becoming BG. uration of the pixel involved, deforming the detected
This issue may be solved explicitly by employing FG models, shape of objects.
as done in [101]. We will name this audio problem as (iii) Reflections. Surfaces in the scene can reflect light
Sleeping audio foreground. causing false positives.
It is worth noting that in this case, the visual problems
of Local light switch, Foreground aperture, Shadows and (iv) Shadows. Moving objects cause sharp shadows with
Reflections have not a clear correspondence in the audio changing orientation (with respect to the object).
domain, and thus they are omitted from the analysis.
In [103], a system to perform automated parking lot
surveillance at night time is presented. As a preprocessing
6. Single Infrared Sensor step, contrast and brightness of input images are enhanced
and spatial smoothing is applied. The background model
Most algorithms for object detection are designed only for is built as a mixture of Gaussians. In [104], an algorithm
daytime visual surveillance and are generally not effective for background modeling based on spatiotemporal patches
for dealing with night conditions, when the images have low especially suited for night outdoor scenes is presented. Based
brightness, low contrast, low signal-to-noise ratio (SNR) and on the spatiotemporal patches, called bricks, the background
nearly no color information [102]. models are learned by an on-line subspace learning method.
For night-vision surveillance, two primary technologies However, the authors claim the algorithm fails on surfaces
are used: image enhancement and thermal imaging. with specular reflection.
Image enhancement techniques aim to amplify the light
reflected by the objects in the monitored scene to improve
6.2. Thermal Infrared Sensors. Thermal infrared sensors (see
visibility. Infrared (IR) light levels are high at twilight or in
Figure 6) are not subject to color imagery problems in
halogen light, therefore a camera with good IR sensitivity
managing shadows, sudden illumination changes, and poor
can capture short-wavelength infrared (SWIR) emissions to
night-time visibility. However, thermal imagery has to deal
increase the image quality. SWIR wavelength follows directly
with its own particular challenges.
from the visible spectrum (VIS), and therefore it is also called
near infrared. (i) Commonly used ferroelectric BST thermal sensor
Thermal imaging refers to the process of capturing the yields imagery with a low SNR, which results in
long-wave IR radiation emitted or reflected by objects in limited information for performing detection or
the scene, which is undetectable to the human eye, and tracking tasks.
transforming it into a colored or grayscale image. (ii) Uncalibrated polarity and intensity of the thermal
The use of infrared light and night vision devices should image, that is, the disparity in terms of thermal prop-
not be confused with thermal imaging (see Figure 5 for a erties between the foreground and the background is
visual comparison). If scene is completely dark, then image quite different if the background is warm or cold (see
enhancement methods are not effective and it is necessary to Figure 7).
use a thermal infrared camera. However, the cost of a thermal
camera is too high for most surveillance applications. (iii) Saturation or “halo effect”, that appears around very
hot or cold objects, can modify the geometrical
properties of the foreground objects deforming their
6.1. Near Infrared Sensors. Near infrared (NIR) sensors are shape.
low cost (around 100 dollars) when compared with thermal
infrared sensors (around 1000 dollars) and have a much The majority of the object detection algorithms working
higher resolution. NIR cameras are suitable for environments with the thermal domain adopt a simple thresholding
with a low illumination level, typically between 5 and 50 method to build the foreground mask, assuming that a
lux [103]. In urban surveillance, it is not unusual to have foreground object is much hotter than the background and
artificial light sources illuminating the scene at night (e.g., hence appears brighter, as an “hot-spot” [105]. In [106],
monitored parking lots next to buildings tends to be well a thresholded image is computed as the first step of a
lit). NIR sensors represent a cheaper alternative to thermal human posture estimation method, based on the assumption
EURASIP Journal on Advances in Signal Processing 17

(a) (b)

Figure 6: A color image (a) and a thermal image (b) from OSU Color-Thermal Database [17, 105].

(a) (b)

Figure 7: Uncalibrated polarity and halo issues in thermal imagery: (a) bright halo around dark objects [105], (b) dark halo around bright
object [110].

that the temperature of the human body is hotter than The above discussed challenges in using thermal imagery
the background. The hot-spot assumption is used in [107] have been largely ignored in the past [105]. Integrating visual
for developing an automatic gait recognition method where and thermal imagery can lead to overcome those drawbacks.
the silhouettes are extracted by thresholding. In [108], the Indeed, in presence of sufficient illumination conditions,
detection of hotspots is performed using a flexible threshold colour optical sensors are oblivious to temperature differ-
calculated as the balance between the thermal image mean ences in the scene and are typically more effective than
intensity and the highest intensity, then a Support Vector thermal cameras when the thermal properties of the objects
Machines-(SVM-) based approach aims to classify humans. in the scene are similar to the surrounding environment.
In [109] the threshold value is extracted from a training
dataset of rectangular boxes containing pedestrians, then 6.3. Capabilities of the Approaches Based on a Single Infrared
probabilistic templates are exploited to capture the variations Sensor. Taken alone and evaluated in scenarios where the
in human shape, for managing the case where contrast is low illumination is enough to perform also visual background
and body parts are missing. subtraction, infrared sensory cannot provide robust systems
However, the hot-spot assumption does not hold if the for the background subtraction, for all the limits discussed
scene is monitored in different time of the day and/or at above. Anyway, infrared is effective when the illumination is
different environmental temperatures (e.g., during winter or scarce, and in disambiguating a camouflage situation, where
summer). Indeed, in night-time (or during winter) usually, the visual aspect of the FG is similar to that of the BG.
foreground is warmer than background, but this is not always Infrared is also the only working solution in scenarios where
true in day-time (or summer), when the background can be the FG objects lie on water surfaces, since the false positive
warmer than the foreground. detections caused by waves can be totally filtered out.
Moreover, the presence of halos in thermal imagery
compromises the use of traditional visual background 7. Fusion of Multiple Sensors
subtraction techniques [105]. Since the halo surrounding
the moving object usually diverges from the background One of the most desirable qualities of a video surveillance
model, it is classified as foreground introducing an error in system is persistence, or the ability to be effective all the
retrieving the structural properties of the foreground objects. times. However, a single sensor is generally not effective
18 EURASIP Journal on Advances in Signal Processing

in all situations. The use of complementary sensors, hence, The blobs are then matched and aligned to reject false
becomes important to provide complete and sufficient positives.
information: information redundancy permits to validate In [114], instead, an image fusion scheme that employs
observations, in order to enhance FG/BG separation, and it multiple scales is illustrated. The method first computes
becomes essential when one modality is not available. pixel saliency in the two images (IR and visible) at multiple
Fusing data from heterogeneous information sources scales, then a merging process, based on a measure of the
arises new problems, such as how to associate distinct objects difference in brightness across the images, produces the final
that represent the same entity. Moreover, the complexity foreground mask.
of the problem increases when the sources do not have
a complete knowledge about the monitoring area and in 7.1.1. Capabilities of the Approaches Based on the Fusion of
situations where the sensors measurements are ambiguous Thermal and Visible Imagery. In general, thermal imagery
and imprecise. is taken as support for the visual modality. Considering the
There is an increasing interest in developing multimodal literature, the key problem in Section 2 where the fusion of
systems that can simultaneously analyze information from thermal and visible imagery results particularly effective is
multiple sources of information. The most interesting trends that of the shadows: actually, all the approaches stress this
regard the fusion of thermal and visible imagery and the fact in their experimental sections.
fusion of audio and video information.
7.2. Fusion of Audio and Video Information. Many
7.1. Fusion of Thermal and Visible Imagery. Thermal and researchers have attempted to integrate vision and acoustic
color video cameras are both widely used for surveillance. senses, with the aim to enhance object detection and
Thermal cameras are independent of illumination, so they tracking, more than BG subtraction. The typical scenario in
are more effective than color cameras under poor lighting an indoor environment with moving or static objects that
conditions. On the other hand, color optical sensors does produce sounds, monitored with fixed or moving cameras
not consider temperature differences in the scene, and and fixed acoustic sensors.
are typically more effective than thermal cameras when For completeness we report in the following some of
the thermal properties of the objects in the scene are these methods, even if they do not tackle BG subtraction
similar to the surrounding environment (provided that explicitly. Usually each sense is processed separately and
the scene is well illuminated and the objects have color the overall results are integrated in the final step. The
signatures different from the background). Integrating visual system developed in [115], for example, uses an array of
and thermal imagery can lead to overcome the draw- eight microphones to initially locate a speaker and then
back of both sensors, enhancing the overall performance steer a camera towards the sound source. The camera
(Figure 8). does not participate in the localization of objects, but it is
In [105], a three-stage algorithm to detect the moving used to take images of the sound source after it has been
objects in urban settings is described. Background subtrac- localized. However, in [116], the authors demonstrate that
tion is performed on thermal images, detecting the regions the localization integrating audio and video information
of interest in the scene. Color and intensity information is more robust compared to the localization based on
is used within these areas to obtain the corresponding stand alone microphone arrays. In [117], the authors detect
regions of interest in the visible domain. Within each image walking persons, with a method based on video sequences
region (thermal and visible, treated independently) the and step sounds. The audiovisual correlation is learned
input and background gradient information are combined by a time-delay neural network, which then performs a
as to highlight only the contours of the foreground object. spatiotemporal search for the walking person. In [118],
Contour fragments belonging to corresponding region in the authors propose a quite complete surveillance system,
the thermal and visible domains are then fused, using the focused on the integration of the visual and the audio
combined input gradient information from both sensors. information provided by different sensing agents. Static
This technique permits to filter out both halos and shadows. cameras, fixed microphones and mobile vision agents work
A similar approach that uses gradient information from together to detect intruders and to capture a closed image
both visible and thermal images is described in [112]: the of them. In [119], the authors deal with tracking and
fusion step is based on mutual agreement between the two identifying multiple people using discriminative visual and
modalities. In [113], the authors propose to use a IR camera acoustic features extracted from cameras and microphone
in conjunction with a standard camera for detecting humans. array measurements. The audio local sensor performs sound
Background subtraction is performed independently on sources localization and source separation to extract the
both camera images using a single Gaussian probability existing speeches in the environment; the video local sensor
distribution to model each background pixel. The couple of performs people localization and face-color extraction. The
detected foreground masks is extracted using a hierarchical association decision is based on the belief theory, and the
genetic algorithm, and the two registered silhouettes are system provides robust performances even with noisy data.
then fused together into the final estimate. Another similar A paper that instead focuses on fusing video and acoustic
approach for humans detection is described in [111]. Even signals with the aim to enhance BG modeling is [120]. The
in this case BG subtraction is run on the two cameras authors build a multimodal model of the scene background,
independently, extracting the blobs from each camera. in which both the audio and the video are modeled by
EURASIP Journal on Advances in Signal Processing 19

(a) (b) (c)

Figure 8: Example of fusion of video and thermal imagery: (a), FG obtained from the thermal camera; at the center, FG obtained from the
video camera; (b), their fusion result [111].

employing a time-adaptive mixture model. The system is literature, hoping that some of them could be embraced and
able to detect single auditory or visual events, as well as applied satisfactorily.
audiovideo simultaneous situations, considering a synchrony
principle. This integration permits to address the FG sleeping
Moved Object (MO). In this case, mainly visual approaches
problem: an audiovisual pattern can remain an actual
are present in the literature, which are not able to solve this
foreground even if one of the components (audio or video)
issue satisfactorily. Actually, when an object belonging to the
becomes BG. The setting is composed by one fixed camera
scene is moved, it erroneously appears to be a FG entity,
and a single microphone.
until when the BG model adapts and absorbs the novel visual
layout. A useful direction to solve effectively this issue is
7.2.1. Capabilities of the Approaches Based on the Fusion of considering thermal information: actually, if the background
Audio and Video Information. Coupling the audio and the has thermal characteristics that are different from the FG
visual signal is a novel direction for the background subtrac- objects, the visual change provoked by an object which is
tion literature. Actually, most of the approaches presented relocated may be inhibited by its thermal information.
in the previous section propose a coupled modeling for the
foreground, instead of detailing a pure background subtrac-
tion strategy. Anyway, all those approaches work in a clear Time of Day (TD). Adaptive BG models showed to be
setting, that is, where the audio signal is clearly associated effective to definitely solve this issue. When the illumination
to the foreground entities. Therefore, the application of such is very scarce, thermal imagery may help. A good direction
techniques in real-world situations need to be supported could be building a structured model that introduces the
by technique able to perform the subtraction of useless thermal imagery selectively, in order to maximize the BG/FG
information in both the audio and the visual channels. In this discrimination.
sense, [120] is the approach that more leads in this direction
(even if it also proposes a modeling for the foreground Light Switch (LS). This problem has been considered under
entities). a pure visual sense. The solutions present in the literature are
satisfying, and operate by considering the global appearance
8. How the Key Problems of Background of the scene. When a global abrupt change happens, the
Subtraction May Be Solved? BG model is suddenly adapted or selected from a set of
predetermined models, in order to minimize the amount of
In this paper, we examined different approaches for the false positive alarms.
background subtraction, with a particular attention to how
they solve typical hoary issues. We consider different sensor
Local Light Switch (LLS). Local light switch is a novel
channels, and different multichannel integration policies.
problem, introduced here and scarcely considered in the
In this section we consider together all these techniques,
literature. The approaches that face this problems work
summarizing for each problem what are the main strategies
on the visual channel, studying in a bootstrap phase how
adopted to solve it.
the illumination of the scene locally changes, monitoring
In particular, we focus in the problems presented in
when a local change does occur and adapting the model
Section 2, without considering the translated versions of
consequently.
the problems in the audio channel (Section 5.1). The table
in Table 1 summarizes the main categories of methods
described in this paper, and the problems that they explicitly Waving Trees (WT). The oscillation of the background is
solve. effectively solved in the literature under a visual perspective.
Moreover, we individuate those that could be winning The idea is that the BG models have to be multimodal: this
strategies that have not been completely exploited in the works well especially when the oscillation of the background
20 EURASIP Journal on Advances in Signal Processing

(or part of it) is persistent and well located (i.e., the oscilla- time, so a still foreground will be always differentiated from
tion has to occur for a long time in the same area; in other the background. Employing audio signals is another way.
words, it has to be predictable). When the oscillations are Associating an audio pattern to a FG entity permits to enlarge
rare or unpredictable, approaches that consider per-region the set of features that need to be constant in time for
strategies are decisive. The idea is that per-pixel models share provoking a total BG absorption. Therefore, a visual entity
their parameters, so that a background value in a pixel may (a person) which is still, that however maintains FG audio
be evaluated as BG even if it occurs in a local neighborhood. characteristics (i.e., that of being unexpected) remains a FG
entity. Employing multiple sensor channels allows to solve
Camouflage (C). Camouflage effects derive from the simi- this problem without relying on tracking techniques: that is,
larity between the features that characterize the foreground the idea is to enrich the BG model, in order to detect better
and those used for modeling the background. Therefore, FG entities, that is, entities that diverge from that model.
the more discriminating features, the better the separation
between FG and BG entities. In this case, under a visual Shadows (SH). The solution for the shadows problem comes
perspective, gray level is the worst solution as feature. from the visual domain or employing multiple sensors or
Moving to color values offers a better discriminability, that considering thermal imagery. In the first way, color analysis
can be further ameliorated by employing edge and texture is applied, by building a chromatic range over which a
information. Particularly effective is the employment of background color may vary when affected by shadows.
stereo sensors, that introduce depth information in the Otherwise, edge, or texture analysis, that has been shown to
analysis. Again, thermal imagery may help. A mixing of be robust to shadows, is applied. Stereo sensors discard the
visual and thermal channels exploiting stereo devices has shadows simply relying on depth information, and multiple
been never taken into account, and seems to be a reasonable cameras are useful to build a 3D map where the items that
novel strategy. are projected on the ground plane of the scene are labelled as
shadows. Thermal imagery is oblivious to shadows issues.
Bootstrapping (B). Bootstrapping methods are explicitly
faced only under a visual perspective, by approaches of Reflections (R). Reflections is a brand-new problem for the
background initialization. These approaches offer good background subtraction literature, in the sense that very
solutions: they essentially build statistics for devising a BG few approaches have been focused on this issue. It is more
model by exploiting the principle of temporal persistence difficult than dealing with the shadows, because, as visible
(elements of the scene which appear continuously with in our test sequence, reflections carry color, edge, or texture
the same layout represent the BG) and spatial continuity information which is not brought by shadows. Therefore,
(i.e., homogeneously colored surfaces or portions of the methods that rely on color, edge, and texture analysis fail.
scene which exploit edge continuity belong to the BG). The only satisfying solution comes through the use of
Bootstrapping considering other sensor channels has never multiple sensors. A 3D map of the scene can be built (so,
been taken into account. the BG model is enriched and made more expressive) and
geometric assumptions on where a FG object could appear or
Foreground Aperture (FA). The problem of the spatiotem- not help in discarding reflection artifacts. The use of thermal
poral persistence of a foreground object, and its partial imagery and stereo sensor is intuitively useful to solve this
erroneous absorption in the BG model, has been faced in problem, but in the literature there are not approaches that
the literature under the sole visual modality. This problem explicitly deal with this problematic.
primarily depends on a too fast learning rate of the BG
model. Resolutive approaches employ per-region reasoning,
by examining the detected FG regions and looking for
9. Final Remarks
holes, filling them by morphological operators. Foreground In this paper, we present an essay of background subtraction
aperture considering other sensor channels has never been methods. It has two important characteristics that make it
taken into account. diverse and appealing with respect to the other reviews. First,
it considers different sensor channels and various integration
Sleeping Foreground (SF). This problem is the one that more policies of heterogeneous channels with which background
implies a sort of knowledge of the FG entities, crossing subtraction may be carried out. This has never appeared
the border towards goals that are typical of the tracking before in the literature. Second, it is problem-oriented, that
literature. In practice, the intuitive solution for this problem is, it individuates the key problems for the background
consists to inhibit the absorption mechanism of the BG subtraction and we analyze and discuss how the different
model whereas a FG object occurs in the scene. In the approaches behave with respect to them. This permits to syn-
literature, a solution comes through the use of multiple thesize a global snapshot of the effectiveness of the nowadays
sensor channels. Employing thermal imagery associated to background subtraction approaches. Almost each problem
visual information permits to discriminate between FG analyzed has a proper solution, that comes from different
and BG in an effective way. Actually, the background is modalities or multimodal integration policies. Therefore, we
assumed to be at a different temperature with respect to hope that this problem-driven analysis may serve in devising
the FG objects: this contrast has to be maintained over an even more complete background subtraction system, able
EURASIP Journal on Advances in Signal Processing 21

to join sensor channels in an advantageous way, facing all Communications and Image Processing, vol. 5308 of Proceed-
the problems at the same time and providing convincing ings of SPIE, pp. 881–892, San Jose, Calif, USA, 2004.
performances. [14] D. H. Parks and S. S. Fels, “Evaluation of background
subtraction algorithms with post-processing,” in Proceedings
of the 5th International Conference on Advanced Video and
Acknowledgments Signal Based Surveillance (AVSS ’08), pp. 192–199, September
2008.
This paper is founded by the EU-Project FP7 SAMURAI,
[15] WALLFLOWER, “Test images for wallflower paper,”
Grant FP7-SEC- 2007-01 no. 217899. http://research.microsoft.com/en-us/um/people/jckrumm/
wallflower/testimages.htm.
References [16] “C. for Biometrics and S. Research. Cbsr nir face dataset,”
http://www.cbsr.ia.ac.cn .
[1] H. T. Nguyen and A. W. M. Smeulders, “Robust tracking [17] “OTCBVS Benchmark Dataset Collection,” http://www.cse
using foreground-background texture discrimination,” Inter- .ohio-state.edu/otcbvs-bench/.
national Journal of Computer Vision, vol. 69, no. 3, pp. 277– [18] R. Miezianko, “Terravic research infrared database”.
293, 2006. [19] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers,
[2] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection “Wallflower: principles and practice of background mainte-
of discriminative tracking features,” IEEE Transactions on nance,” in Proceedings of the IEEE International Conference on
Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. Computer Vision, vol. 1, pp. 255–261, 1999.
1631–1643, 2005. [20] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic
[3] H.-T. Chen, T.-L. Liu, and C.-S. Fuh, “Probabilistic tracking scenes for object detection,” IEEE Transactions on Pattern
with adaptive feature selection,” in Proceedings of the 17th Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1778–
International Conference on Pattern Recognition (ICPR ’04), 1792, 2005.
pp. 736–739, August 2004.
[21] R. Jain and H. H. Nagel, “On the analysis of accumulative
[4] F. Martez-Contreras, C. Orrite-Urunuela, E. Herrero-Jaraba,
difference pictures from image sequences of real world
H. Ragheb, and S. A. Velastin, “Recognizing human actions
scenes,” IEEE Transactions on Pattern Analysis and Machine
using silhouette-based HMM,” in Proceedings of the 6th IEEE
Intelligence, vol. 1, no. 2, pp. 206–214, 1978.
International Conference on Advanced Video and Signal Based
[22] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland,
Surveillance (AVSS ’09), pp. 43–48, 2009.
“Pfinder: real-time tracking of the human body,” IEEE
[5] H. Grabner and H. Bischof, “On-line boosting and vision,”
Transactions on Pattern Analysis and Machine Intelligence, vol.
in Proceedings of the IEEE Computer Society Conference on
19, no. 7, pp. 780–785, 1997.
Computer Vision and Pattern Recognition (CVPR ’06), pp.
[23] J. Heikkilä and O. Silven, “A real-time system for monitoring
260–267, June 2006.
of cyclists and pedestrians,” in Proceedings of the 2nd IEEE
[6] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton
International Workshop on Visual Surveillance, pp. 74–81,
forests for image categorization and segmentation,” in Pro-
Fort Collins, Colo, USA, 1999.
ceedings of the 26th IEEE Conference on Computer Vision and
Pattern Recognition (CVPR ’08), pp. 1–8, June 2008. [24] N. J. B. McFarlane and C. P. Schofield, “Segmentation
[7] M. Bicego, M. Cristani, and V. Murino, “Unsupervised scene and tracking of piglets in images,” Machine Vision and
analysis: a hidden Markov model approach,” Computer Vision Applications, vol. 8, no. 3, pp. 187–193, 1995.
and Image Understanding, vol. 102, no. 1, pp. 22–41, 2006. [25] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting
[8] S. Gong, J. Ng, and J. Sherrah, “On the semantics of moving objects, ghosts, and shadows in video streams,” IEEE
visual behaviour, structured events and trajectories of human Transactions on Pattern Analysis and Machine Intelligence, vol.
action,” Image and Vision Computing, vol. 20, no. 12, pp. 873– 25, no. 10, pp. 1337–1342, 2003.
888, 2002. [26] I. Haritaoglu, R. Cutler, D. Harwood, and L. S. Davis, “Back-
[9] M. Piccardi, “Background subtraction techniques: a review,” pack: detection of people carrying objects using silhouettes,”
in Proceedings of the IEEE International Conference on Computer Vision and Image Understanding, vol. 81, no. 3, pp.
Systems, Man and Cybernetics (SMC ’04), pp. 3099–3104, 385–397, 2001.
October 2004. [27] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4 : real-time
[10] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, “Image surveillance of people and their activities,” IEEE Transactions
change detection algorithms: a systematic survey,” IEEE on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
Transactions on Image Processing, vol. 14, no. 3, pp. 294–307, pp. 809–830, 2000.
2005. [28] N. Friedman and S. Russell, “Image segmentation in video
[11] Y. Benezeth, P. M. Jodoin, B. Emile, H. Laurent, and sequences: a probabilistic approach,” in Proceedings of the
C. Rosenberger, “Review and evaluation of commonly- 13th Conference on Uncertainty in Artificial Intelligence (UAI
implemented background subtraction algorithms,” in Pro- ’97), pp. 175–181, Morgan Kaufmann Publishers, San Fran-
ceedings of the 19th International Conference on Pattern cisco, Calif, USA, 1997.
Recognition (ICPR ’08), December 2008. [29] C. Stauffer and W. E.L. Grimson, “Adaptive background
[12] A. Mittal and N. Paragios, “Motion-based background mixture models for real-time tracking,” in Proceedings of the
subtraction using adaptive kernel density estimation,” in IEEE Computer Society Conference on Computer Vision and
Proceedings of the IEEE Computer Society Conference on Pattern Recognition (CVPR ’99), vol. 2, pp. 246–252, 1999.
Computer Vision and Pattern Recognition (CVPR ’04), vol. 2, [30] H. Wang and D. Suter, “A re-evaluation of mixture-of-
pp. 302–309, 2004. Gaussian background modeling,” in Proceedings of the IEEE
[13] S.-C. S. Cheung and C. Kamath, “Robust techniques for International Conference on Acoustics, Speech, and Signal
background subtraction in urban traffic video,” in Visual Processing (ICASSP ’05), pp. 1017–1020, March 2005.
22 EURASIP Journal on Advances in Signal Processing

[31] Z. Zivkovic, “Improved adaptive Gaussian mixture model [48] H. S. Sawhney, Y. Guo, and R. Kumar, “Independent motion
for background subtraction,” in Proceedings of the 17th detection in 3D scenes,” IEEE Transactions on Pattern Analysis
International Conference on Pattern Recognition (ICPR ’04), and Machine Intelligence, vol. 22, no. 10, pp. 1191–1199,
pp. 28–31, August 2004. 2000.
[32] A. Elgammal, D. Harwood, and L. Davis, “Non parametric [49] C. Yuan, G. Medioni, J. Kang, and I. Cohen, “Detecting
model for background substraction,” in Proceedings of the 6th motion regions in the presence of a strong parallax from a
European Conference Computer Vision, Dublin, Ireland, June- moving camera by multiview geometric constraints,” IEEE
July 2000. Transactions on Pattern Analysis and Machine Intelligence, vol.
[33] A. M. Elgammal, D. Harwood, and L. S. Davis, “Non- 29, no. 9, pp. 1627–1641, 2007.
parametric model for background subtraction,” in Proceed- [50] H. Tao, H. S. Sawhney, and R. Kumar, “Object tracking with
ings of the 6th European Conference on Computer Vision, pp. bayesian estimation of dynamic layer representations,” IEEE
751–767, 2000. Transactions on Pattern Analysis and Machine Intelligence, vol.
[34] R. Duda, P. Hart, and D. Stork, Pattern Classification, John 24, no. 1, pp. 75–89, 2002.
Wiley & Sons, New York, NY, USA, 2001. [51] J. Xiao and M. Shah, “Motion layer extraction in the presence
[35] M. Levine, Vision by Man and Machine, McGraw-Hill, New of occlusion using graph cuts,” IEEE Transactions on Pattern
York, NY, USA, 1985. Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1644–
[36] P. Noriega and O. Bernier, “Real time illumination invariant 1659, 2005.
background subtraction using local kernel histograms,” in [52] Y. Jin, L. Tao, H. Di, N. I. Rao, and G. Xu, “Background
Proceedings of the British Machine Vision Conference, 2006. modeling from a free-moving camera by multi-layer homog-
[37] M. Heikkila and M. Pietikainen, “A texture-based method raphy algorithm,” in Proceedings of the IEEE International
for modeling the background and detecting moving objects,” Conference on Image Processing (ICIP ’08), pp. 1572–1575,
IEEE Transactions on Pattern Analysis and Machine Intelli- October 2008.
gence, vol. 28, no. 4, pp. 657–662, 2006. [53] R. Vidail and Y. Ma, “A unified algebraic approach to 2-D and
[38] T. Ojala, M. Pietikainen, and D. Harwood, “Performance 3-D motion segmentation,” in Proceedings of the 8th European
evaluation of texture measures with classification based Conference on Computer Vision, vol. 3021 of Lecture Notes
on kullback discrimination of distributions,” in Proceed- in Computer Science, pp. 1–15, Prague, Czech Republic, May
ings of the International Conference on Pattern Recognition 2004.
(ICPR ’94), pp. 582–585, 1994.
[54] K. Kanatani, “Motion segmentation by subspace separation
[39] J. Yao and J.-M. Odobez, “Multi-layer background subtrac- and model selection,” in Proceedings of the 8th International
tion based on color and texture,” in Proceedings of the IEEE Conference on Computer Vision, vol. 2, pp. 586–591, July
Computer Society Conference on Computer Vision and Pattern 2001.
Recognition (CVPR ’07), pp. 1–8, June 2007.
[55] Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction
[40] B. Klare and S. Sarkar, “Background subtraction in varying
for freely moving cameras,” in Proceedings of the IEEE
illuminations using an ensemble based on an enlarged feature
International Conference on Computer Vision (ICCV ’09), pp.
set,” in Proceedings of the IEEE Conference on Computer Vision
1219–1225, 2009.
and Pattern Recognition (CVPR ’09), pp. 66–73, 2009.
[41] M. Cristani and V. Murino, “A spatial sampling mechanism [56] R. T. Collins and Y. Liu, “On-line selection of discriminative
for effective background subtraction,” in Proceedings of the tracking features,” in Proceedings of the 9th IEEE International
2nd International Conference on Computer Vision Theory and Conference on Computer Vision, vol. 1, pp. 346–352, October
Applications (VISAPP ’07), pp. 403–410, March 2007. 2003.
[42] O. Barnich and M. Van Droogenbroeck, “ViBE: a powerful [57] T. Parag, A. Elgammal, and A. Mittal, “A framework for
random technique to estimate the background in video feature selection for background subtraction,” in Proceedings
sequences,” in Proceedings of IEEE International Conference on of the IEEE Computer Society Conference on Computer Vision
Acoustics, Speech and Signal Processing (ICASSP ’09), pp. 945– and Pattern Recognition (CVPR ’06), pp. 1916–1923, June
948, IEEE Computer Society, Washington, DC, USA, 2009. 2006.
[43] S. Rowe and A. Blake, “Statistical mosaics for tracking,” Image [58] B. Stenger, V. Ramesh, N. Paragios, F. Coetzee, and J. M.
and Vision Computing, vol. 14, no. 8, pp. 549–564, 1996. Buhmann, “Topology free hidden Markov models: applica-
[44] A. Mittal and D. Huttenlocher, “Scene modeling for wide tion to background modeling,” in Proceedings of the IEEE
area surveillance and image synthesis,” in Proceedings of the International Conference on Computer Vision, vol. 1, pp. 294–
IEEE Conference on Computer Vision and Pattern Recognition 301, 2001.
(CVPR ’00), vol. 2, pp. 160–167, June 2000. [59] N. Ohta, “A statistical approach to background subtraction
[45] E. Hayman and J.-O. Eklundh, “Statistical background for surveillance systems,” in Proceedings of the 8th Interna-
subtraction for a mobile observer,” in Proceedings of the 9th tional Conference on Computer Vision, vol. 2, pp. 481–486,
IEEE International Conference on Computer Vision, vol. 1, pp. July 2001.
67–74, 2003. [60] N. M. Oliver, B. Rosario, and A. P. Pentland, “A Bayesian
[46] Y. Ren, C.-S. Chua, and Y.-K. Ho, “Statistical background computer vision system for modeling human interactions,”
modeling for non-stationary camera,” Pattern Recognition IEEE Transactions on Pattern Analysis and Machine Intelli-
Letters, vol. 24, no. 1−3, pp. 183–196, 2003. gence, vol. 22, no. 8, pp. 831–843, 2000.
[47] M. Irani and P. Anandan, “A unified approach to moving [61] M. Cristani, M. Bicego, and V. Murino, “Integrated region-
object detection in 2d and 3d scenes,” IEEE Transactions on and pixel-based approach to background modelling,” in
Pattern Analysis and Machine Intelligence, vol. 20, no. 6, pp. Proceedings of the IEEE Workshop on Motion and Video
577–589, 1998. Computing, pp. 3–8, 2002.
EURASIP Journal on Advances in Signal Processing 23

[62] M. Cristani, M. Bicego, and V. Murino, “Multi-level back- [78] C. Eveland, K. Konolige, and R. C. Bolles, “Background
ground initialization using hidden Markov models,” in Pro- modeling for segmentation of video-rate stereo sequences,”
ceedings of the ACMSIGMM Workshop on Video Surveillance, in Proceedings of the IEEE Computer Society Conference on
pp. 11–19, 2003. Computer Vision and Pattern Recognition, pp. 266–271, June
[63] Q. Xiong and C. Jaynes, “Multi-resolution background 1998.
modeling of dynamic scenes using weighted match filters,” [79] T. Darrell, D. Demirdjian, N. Checka, and P. Felzenszwalb,
in Proceedings of the 2nd ACM International Workshop on “Plan-view trajectory estimation with dense stereo back-
Video Sureveillance and Sensor Networks (VSSN ’04), pp. 88– ground models,” in Proceedings of the 8th International
96, ACM Press, New York, NY, USA, 2004. Conference on Computer Vision (ICCV ’01), pp. 628–635, July
[64] J. Park, A. Tabb, and A. C. Kak, “Hierarchical data structure 2001.
for real-time background subtraction,” in Proceedings of [80] Y. Ivanov, A. Bobick, and J. Liu, “Fast lighting independent
International Conference on Image Processing (ICIP ’06), 2006. background subtraction,” International Journal of Computer
[65] H. Wang and D. Suter, “Background subtraction based on Vision, vol. 37, no. 2, pp. 199–207, 2000.
a robust consensus method,” in Proceedings of International [81] G. Gordon, T. Darrell, M. Harville, and J. Woodfill, “Back-
Conference on Pattern Recognition, vol. 1, pp. 223–226, 2006. ground estimation and removal based on range and color,”
[66] X. Gao, T. E. Boult, F. Coetzee, and V. Ramesh, “Error in Proceedings of the IEEE Computer Society Conference on
analysis of background adaption,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR ’99), pp.
IEEE Conference on Computer Vision and Pattern Recognition 459–464, June 1999.
(CVPR ’00), vol. 1, pp. 503–510, June 2000. [82] M. Harville, G. Gordon, and J. Woodfill, “Foreground
[67] B. Gloyer, H. K. Aghajan, K. Siu, and T. Kailath, “Video-based segmentation using adaptive mixture models in color and
freeway-monitoring system using recursive vehicle tracking,” depth,” in Proceedings of the IEEE Workshop on Detection and
in Image and Video Processing III, vol. 2421 of Proceedings of Recognition of Events in Video, pp. 3–11, 2001.
SPIE, pp. 173–180, San Jose, Calif, USA, 1995. [83] D. Focken and R. Stiefelhagen, “Towards vision-based 3-d
[68] W. Long and Y.-H. Yang, “Stationary background generation: people tracking in a smart room,” in Proceedings of the 4th
an alternative to the difference of two images,” Pattern IEEE International Conference on Multimodal Interfaces, pp.
Recognition, vol. 23, no. 12, pp. 1351–1359, 1990. 400–405, 2002.
[69] D. Gutchess, M. Trajkovicz, E. Cohen-Solal, D. Lyons, and [84] A. Mittal and L. S. Davis, “M2tracker: a multi-view approach
A. K. Jain, “A background model initialization algorithm for to segmenting and tracking people in a cluttered scene,”
video surveillance,” in Proceedings of the IEEE International International Journal of Computer Vision, vol. 51, no. 3, pp.
Conference on Computer Vision (ICCV ’01), vol. 1, pp. 733– 189–203, 2003.
740, 2001. [85] S. M. Khan and M. Shah, “A multiview approach to tracking
[70] A. Colombari, M. Cristani, V. Murino, and A. Fusiello, people in crowded scenes using a planar homography
“Exemplarbased background model initialization,” in Pro- constraint,” in Proceedings of the 9th European Conference on
ceedings of the third ACM International Workshop on Video Computer Vision (ECCV ’06), vol. 3954 of Lecture Notes in
Surveillance and Sensor Networks, pp. 29–36, Hilton, Singa- Computer Science, pp. 133–146, Graz, Austria, 2006.
pore, 2005. [86] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target clas-
[71] A. Colombari, A. Fusiello, and V. Murino, “Background sification and tracking from real-time video,” in Proceedings
initialization in cluttered sequences,” in Proceedings of the of the IEEE Workshop Application of Computer Vision, pp. 8–
5th Conference on Computer Vision and Pattern Recognition 14, 1998.
(CVPR ’06), 2006. [87] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S.
[72] T. Zhao and R. Nevatia, “Tracking multiple humans in Shafer, “Multi-camera multi-person tracking for easyliving,”
crowded environment,” in Proceedings of the IEEE Computer in Proceedings of the 3rd IEEE International Workshop on
Society Conference on Computer Vision and Pattern Recogni- Visual Surveillance (VS ’00), p. 3, 2000.
tion (CVPR ’04), pp. 406–413, July 2004. [88] R. Muñoz-Salinas, E. Aguirre, and M. Garcı́a-Silvente, “Peo-
[73] S. Bahadori, L. Iocchi, G. R. Leone, D. Nardi, and L. Scoz- ple detection and tracking using stereo vision and color,”
zafava, “Real-time people localization and tracking through Image and Vision Computing, vol. 25, no. 6, pp. 995–1007,
fixed stereo vision,” Applied Intelligence, vol. 26, no. 2, pp. 83– 2007.
97, 2007. [89] A. Bregman, Auditory Scene Analysis: The Perceptual Organi-
[74] S. Lim, A. Mittal, L. Davis, and N. Paragios, “Fast illumina- zation of Sound, MIT Press, London, UK, 1990.
tioninvariant background subtraction using two views: error [90] V. Peltonen, Computational auditory scene recognition, M.S.
analysis, sensor placement and applications,” in Proceedings thesis, Tampere University of Tech., Tampere, Finland, 2001.
of the IEEE Computer Society Conference on Computer Vision [91] M. Cowling and R. Sitte, “Comparison of techniques for
and Pattern Recognition, vol. 1, pp. 1071–1078, 2005. environmental sound recognition,” Pattern Recognition Let-
[75] M. Z. Brown, D. Burschka, and G. D. Hager, “Advances in ters, vol. 24, no. 15, pp. 2895–2907, 2003.
computational stereo,” IEEE Transactions on Pattern Analysis [92] T. Zhang and C.-C. Jay Kuo, “Audio content analysis for
and Machine Intelligence, vol. 25, no. 8, pp. 993–1008, 2003. online audiovisual data segmentation and classification,”
[76] N. Lazaros, G. C. Sirakoulis, and A. Gasteratos, “Review IEEE Transactions on Speech and Audio Processing, vol. 9, no.
of stereo vision algorithms: from software to hardware,” 4, pp. 441–457, 2001.
International Journal of Optomechatronics, vol. 2, no. 4, pp. [93] S. Roweis, “One microphone source separation,” in Advances
435–462, 2008. in Neural Information Processing Systems, pp. 793–799, 2000.
[77] D. Beymer and K. Konolige, “Real-time tracking of multiple [94] K. Hild II, D. Erdogmus, and J. Principe, “On-line minimum
people using continuous detection,” in Proceedings of Inter- mutual information method for time-varying blind source
national Conference on Computer Vision (ICCV ’99), 1999. separation,” in Proceedings of the International Workshop
24 EURASIP Journal on Advances in Signal Processing

on Independent Component Analysis and Signal Separation [110] E. Goubet, J. Katz, and F. Porikli, “Pedestrian tracking
(ICA ’01), pp. 126–131, 2001. using thermal infrared imaging,” in Infrared Technology and
[95] M. Stager, P. Lukowica, N. Perera, T. V. Buren, G. Troster, and Applications XXXII, vol. 6206 of Proceedings of SPIE, pp. 797–
T. Starner, “Soundbutton: design of a low power wearable 808, 2006.
audio classification system,” in Proceedings of the 7th IEEE [111] H. Zhao and S. S. Cheung, “Human segmentation by fusing
International Symposium on Wearable Computers, pp. 12–17, visiblelight and thermal imaginary,” in Proceedings of the
2003. International Conference on Computer Vision (ICCV ’09),
[96] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue, “Bathroom 2009.
activity monitoring based on sound,” in Proceedings of the 3rd [112] P. Kumar, A. Mittal, and P. Kumar, “Fusion of thermal
International Conference on Pervasive Computing, vol. 3468 infrared and visible sprectrum video for robust surveillance,”
of Lecture Notes in Computer Science, pp. 47–61, Munich, in Proceedings of the 5th Indian Conference on Computer
Germany, May 2005. Vision, Graphics and Image Processing (ICVGIP ’06), vol. 4338
[97] M. Azlan, I. Cartwright, N. Jones, T. Quirk, and G. West, of Lectures Notes in Computer Science, pp. 528–539, Madurai,
“Multimodal monitoring of the aged in their own homes,” India, December 2006.
in Proceedings of the 3rd International Conference on Smart [113] J. Han and B. Bhanu, “Detecting moving humans using color
Homes and Health Telematics (ICOST ’05), 2005. and infrared video,” in Proceedings of the International Con-
[98] D. Ellis, “Detecting alarm sounds,” in Proceedings of Consis- ference on Multisensor Fusion and Integration for Intelligent
tent and Reliable Acoustic Cues for sound analysis(CRAC ’01), Systems, pp. 228–233, 2003.
Aalborg, Denmark, September 2001. [114] L. Jiang, F. Tian, L. E. Shen et al., “Perceptual-based fusion of
[99] M. Cristani, M. Bicego, and V. Murino, “On-line adaptive IR and visual images for human detection,” in Proceedings of
background modelling for audio surveillance,” in Proceedings the International Symposium on Intelligent Multimedia, Video
of the 17th International Conference on Pattern Recognition and Speech Processing (ISIMP ’04), pp. 514–517, October
(ICPR ’04), vol. 2, pp. 399–402, August 2004. 2004.
[100] S. Moncrieff, S. Venkatesh, and G. West, “Online audio back- [115] D. Rabinkin, R. Renomeron, A. Dahl, J. French, J. Flanagan,
ground determination for complex audio environments,” and M. Bianchi, “A DSP implementation of source location
ACM Transactions on Multimedia Computing, Communica- using microphone arrays,” The Journal of the Acoustical
tions and Applications, vol. 3, no. 2, Article ID 1230814, 2007. Society of America, vol. 99, no. 4, pp. 2503–2529, 1996.
[101] S. Chu, S. Narayanan, and C.-C. J. Kuo, “A semi-supervised [116] P. Aarabi and S. Zaky, “Robust sound localization using
learning approach to online audio background detection,” in multi-source audiovisual information fusion,” Information
Proceedings of the IEEE International Conference on Acoustics, Fusion, vol. 2, no. 3, pp. 209–223, 2001.
Speech, and Signal Processing (ICASSP ’09), pp. 1629–1632, [117] B. Bhanu and X. Zou, “Moving humans detection based on
April 2009. multimodal sensor fusion,” in Proceedings of the Conference
[102] K. Huang, L. Wang, T. Tan, and S. Maybank, “A real- on Computer Vision and Pattern Recognition, 2004.
time object detecting and tracking system for outdoor night [118] E. Menegatti, E. Mumolo, M. Nolich, and E. Pagello, “A
surveillance,” Pattern Recognition, vol. 41, no. 1, pp. 432–444, surveillance system based on audio and video sensory agents
2008. cooperating with a mobile robot,” in Proceedings of the 8th
[103] M. Stevens, J. Pollak, S. Ralph, and M. Snorrason, “Video International Conference on Intelligent Autonomous Systems
surveillance at night,” in Acquisition, Tracking, and Pointing (IAS ’08), 2004.
XIX, vol. 5810 of Proceedings of SPIE, pp. 128–136, 2005. [119] N. Megherbi, S. Ambellouis, O. Colôt, and F. Cabestaing,
[104] Y. Zhao, H. Gong, L. Lin, and Y. Jia, “Spatio-temporal patches “Joint audio-video people tracking using belief theory,” in
for night background modeling by subspace learning,” in Proceedings of the IEEE Conference on Advanced Video and
Proceedings of the 19th International Conference on Pattern Signal Based Surveillance (AVSS ’05), pp. 135–140, September
Recognition (ICPR ’08), December 2008. 2005.
[105] J. W. Davis and V. Sharma, “Background-subtraction using [120] M. Cristani, M. Bicego, and V. Murino, “Audio-Video
contour-based fusion of thermal and visible imagery,” Com- integration for background modelling,” in Proceedings of the
puter Vision and Image Understanding, vol. 106, no. 2-3, pp. 8th European Conference on Computer Vision (ECCV ’04),
162–182, 2007. vol. 3022 of Lecture Notes in Computer Science, pp. 202–213,
[106] S. Iwasawa, K. Ebiharai, J. Ohya, and S. Morishima, “Real- Prague, Czech Republic, May 2004.
time estimation of human body posture from monocular
thermal images,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pp.
15–20, June 1997.
[107] B. Bhanu and J. Han, “Kinematic based human motion
analysis in infrared sequences,” in Proceedings of the 6th IEEE
Workshop on Applications of Computer Vision, pp. 208–212,
2002.
[108] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and
tracking with night vision,” IEEE Transactions on Intelligent
Transportation Systems, vol. 6, no. 1, pp. 63–71, 2005.
[109] H. Nanda and L. Davis, “Probabilistic template based
pedestrian detection in infrared videos,” in Proceedings of the
IEEE Intelligent Vehicles Symposium, vol. 1, pp. 15–20, 2002.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 205095, 13 pages
doi:10.1155/2010/205095

Review Article
High-Resolution Sonars: What Resolution Do We Need for
Target Recognition?

Yan Pailhas, Yvan Petillot, and Chris Capus


School of Electrical and Physical Science, Oceans Systems Laboratory, Heriot Watt University, Edinburgh EH14 4AS, Scotland, UK

Correspondence should be addressed to Yan Pailhas, [email protected]

Received 23 December 2009; Revised 28 July 2010; Accepted 1 December 2010

Academic Editor: Yingzi Du

Copyright © 2010 Yan Pailhas et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Target recognition in sonar imagery has long been an active research area in the maritime domain, especially in the mine-counter
measure context. Recently it has received even more attention as new sensors with increased resolution have been developed;
new threats to critical maritime assets and a new paradigm for target recognition based on autonomous platforms have emerged.
With the recent introduction of Synthetic Aperture Sonar systems and high-frequency sonars, sonar resolution has dramatically
increased and noise levels decreased. Sonar images are distance images but at high resolution they tend to appear visually as optical
images. Traditionally algorithms have been developed specifically for imaging sonars because of their limited resolution and high
noise levels. With high-resolution sonars, algorithms developed in the image processing field for natural images become applicable.
However, the lack of large datasets has hampered the development of such algorithms. Here we present a fast and realistic sonar
simulator enabling development and evaluation of such algorithms. We develop a classifier and then analyse its performances
using our simulated synthetic sonar images. Finally, we discuss sensor resolution requirements to achieve effective classification of
various targets and demonstrate that with high resolution sonars target highlight analysis is the key for target recognition.

1. Introduction challenging terrains. This creates a direct link between sens-


ing and mission planning, sometimes called active percep-
Target recognition in sonar imagery has long been an active tion, where the data acquisition is directly controlled by the
research area in the maritime domain. Recently, however, scene interpretation.
it has received increased attention, in part due to the Detection and identification techniques have tended to
development of new generations of sensors with increased focus on saliency (global rarity or local contrast) [4–6],
resolution and in part due to the emergence of new threats model-based detection [7–15] or supervised learning [16–
to critical maritime assets and a new paradigm for target 22]. Alternative approaches to investigate the internal struc-
recognition based on autonomous platforms. The recent ture of objects using wideband acoustics [23, 24] are showing
introduction of operational Synthetic Aperture Sonar (SAS) some promises, but it is now widely acknowledged that
systems [1, 2] and the development of ultrahigh resolution current techniques are reaching their limits. Yet, their perfor-
acoustic cameras [3] have increased tenfold the resolution of mances do not enable rapid and effective mine clearance and
the images available for target recognition as demonstrated false alarm rates remain prohibitively high [4–22]. This is not
in Figure 1. In parallel, traditional dedicated ships are a critical problem when operators can validate the outputs of
being replaced by small, low cost, autonomous platforms the algorithms directly, as they still enable a very high data
easily deployable by any vessel of opportunity. This creates compression rate by dramatically reducing the amount of
new sensing and processing challenges, as the classification information that an operator has to review. The increasing
algorithms need to be fully automatic and run in real time use of autonomous platforms raises fundamentally different
on the platforms. The platforms’ behaviours also require challenges. Underwater communication is very poor due to
to be autonomously adapted online, to guarantee appro- the very low bandwidth of the medium (the data transfer rate
priate detection performance is met, sometimes on very is typically around 300 bits/s) and it does not permit online
2 EURASIP Journal on Advances in Signal Processing

3.5 3.5

3 3

2.5 2.5

(a) (b)

Figure 1: Example of Target in Synthetic Aperture Sonar (a) and Acoustic Camera (b). Images are courtesy of the NATO Undersea Research
Centre (a) and Soundmetrics Ltd (b).

(a) (b)

(c) (d)

Figure 2: Snapshots of four different types of seabed: (a) flat seabed, (b) sand ripples, (c) rocky seabed and (d) cluttered environment.

operator visualisation or intervention. For this reason the detection algorithms using high resolution sensors. We focus
use of collaborating multiple platforms requires robust and on two key challenges.
accurate on-board decision making.
The question of resolution has been raised again by (i) The development of fast simulation tools for high
the advent of very high resolution sidescan, forward-look resolution sensors: this will enable us to tackle the
and SAS systems. These change the quality of the images current lack of real datasets to develop and evaluate
markedly producing near-optical images. This paper looks at new algorithms including generative models for
whether the resolution is now high enough to apply optical target identification. It will also provide a ground-
image processing techniques to take advantage of advances truth simulation environment to evaluate potential
made in other fields. active perception strategies.
In order to improve these performances, the MCM (ii) What resolution do we need? The development of
(Mine and Counter Measures) community has focused on new sensors has been driven by the need for increased
improving the resolution of the sensors and high resolution resolution.
sonars are now a reality. However, these sensors are very
expensive and very limited data (if any) are available to the The remainder of the paper is organized as follows: In
research community. This has hampered the development of Section 2, a fast and realistic sonar simulator is described.
new algorithms for effective on-board decision making. In Sections 3 and 4, the simulator is used to explore the
In this paper, we present tools and algorithms to ad- resolution issue. Its flexibility enables the generation of real-
dress the challenges for the development of improved target istic sonar images at various resolutions and the exploration
EURASIP Journal on Advances in Signal Processing 3

SoNaR


dA = rdΦ
d 2
r

θ dA

Bottom

2

Figure 5: Definitions for surface reverberation modeling.

Insonified area

Figure 3: Decomposition of the 3D representation of the seafloor


in 3 layers: partition between the different types of seabed, global
elevation, roughness and texture. Seafloor

Starting
point

Sidescan
sonar
Manta Rockam Cuboid

Hemisphere Cylinder Standing cylinder

Figure 4: 3D models of the different targets and minelike objects. Finish


point

Figure 6: The trajectory of the sonar platform can be placed into


of the effects of resolution on classification performance. the 3D environment.
Extensive simulations provide a database of synthetic images
on various seabed types. Algorithms can be developed and
evaluated using the database. The importance of the pixel processing and is, in general, highly complex [26]. Critically,
resolution for image-based algorithms is analysed as well as in all cases, the algorithms are extremely slow (one hour
the amount of information contained in the target shadow. to several days to compute a synthetic sidescan image with
a desktop computer). When high frequencies are used, the
2. Sidescan Simulator path of the acoustic waves can be approximated by straight
lines. In this case, classical ray-tracing techniques combined
Sonar images are difficult and expensive to obtain. A realistic with a careful and detailed modeling of the energy-based
simulator offers an alternative to develop and test MCM sonar equation can be used. The results obtained are very
algorithms. High-frequency sonars and SAS increase the similar to those obtained using more complex propagation
resolution of the sonar image from tens of cm to a few cm (3 models. Yet they are much faster and produce very realistic
to 5 cm). The resulting sonar images become closer to optical images.
images. By increasing the resolution of the image the objects Note that this simulator is a high-precision sidescan
become sharper. Our objective here is to produce a simulator simulator, which can be equally well applied to forward
that can realistically reproduce such images in real time. looking sonar. SAS images differ from sidescan images in
There is an existing body of research into sonar mainly two points: a constant pixel resolution at all ranges
simulation[25, 26]. The simulators are generally based on and a blur in the object shadows [29]. The simulator can
ray-tracing techniques [27] or on a solution to the full wave cope with the constant range resolution so synthetic target
equation [28]. SAS simulation takes into account the SAS highlights will appear similar. A fully representative SAS
4 EURASIP Journal on Advances in Signal Processing

0 0

10 10
Cross range (m)

Cross range (m)


20 20

30 30

40 40

50 50
0 20 40 0 20 40
Range (m) Range (m)
(a) (b)

Figure 7: Display of the resulting sidescan images ((a) and (b)) of the same scene with different trajectory. The seafloor is composed with two
sand ripples phenomena at different frequencies and different sediments (VeryFineSand for the high frequency ripples and VeryCoarseSand
for the low frequency ripples). A manta object has been put in the centre of the map.

shadow model remains to be implemented, but the analyses the backscattered energy. Note that the shadows are auto-
are still relevant for identification of targets from highlights matically taken into account thanks to the pseudo ray-tracing
in SAS imagery. algorithm. The processing time required to compute a sonar
The simulator presented here first generates a realistic image of 50 m by 50 m using a 2 GHz Intel Core 2 Duo with
synthetic 3D environment. The 3D environment is divided 2 GB of memory is approximately 7 seconds. The remainder
into three layers: a partition layer which assigns a seabed type of the section details each of the modules required to perform
to each area, an elevation profile corresponding to the general the simulation.
variation of the seabed, and a 3D texture that models each
seabed structure. Figure 2 displays snapshots of four different 2.1. 3D Digital Terrain Model Generator. The aim of this
types of seabed (flat sediment, sand ripples, rocky seabed module is to generate realistic 3D seabed environments. It
and a cluttered environment) that can be generated by the should be able to handle several types of seabed, to generate
simulator. All these natural structures can be well modeled a realistic model for each seabed type, and to synthesize
using fractal representations. The simulator can also take a realistic 3D elevation. For these reasons, the final 3D
into account various compositions of the seabed in terms structure is built by superposition of three different layers: a
of scattering strengths. The boundaries between each seabed partition layer, an elevation layer and a texture layer. Figure 3
type are also modeled using fractals. shows an example of the three different layers which form the
Objects of different shapes and different materials can final 3D environment.
be inserted into the environment. For MCM algorithms, In the late seventies, mathematicians such as Mandelbrot
several types of mines have been modeled such as the Manta [31] linked the symmetry patterns and self-similarity found
(truncated cone shape), Rockan and cylindrical mines. in nature to mathematical objects called fractals [32–35].
The resulting 3D environment is an heightmap, meaning Fractals have been used to model realistic textures and
that to one location corresponds one unique elevation. So heightmap terrains [33]. A quick way to generate realistic 3D
objects floating in midwater for example cannot be modelled fractal heightmap terrains is by using a pink noise generator
here. [33]. A pink noise is characterized by its power spectral
The sonar images are produced from this 3D environ- density decreasing as 1/ f β , where 1 < β < 2.
ment, taking into account a particular trajectory of the
sensor (mounted on a vessel or an autonomous platforms). 2.1.1. The Partition Layer. In the simulator, various types of
The seabed reflectivity is computed thanks to state-of-the- seabeds can be chosen (up to three for a given image). The
art models developed by APL-UW in the High-Frequency boundaries between the seabed types are computed using
Ocean Environmental Acoustic Models Handbook [30] and fractal borders.
the reflectivity of the targets is based on a Lambertian
model. A pseudo ray-tracing algorithm is performed and 2.1.2. Elevation Layer. This layer contains two types of pos-
the sonar equation is solved for each insonified area giving sible elevation: a linear slope characterizing coastal seabeds
EURASIP Journal on Advances in Signal Processing 5

0 0

10 10
Cross range (m)

Cross range (m)


20 20

30 30

40 40

50 50
0 20 40 0 20 40
Range (m) Range (m)
(a) (b)
0

10
Cross range (m)

20

30

40

50
0 20 40
Range (m)
(c)

Figure 8: Examples of simulated sonar images for different seabed types (clutter, flat, ripples), 3D elevation and scattering strength. (a)
represents a smooth seabed with some small variations, (b) represents a mixture of flat and cluttered seabed and (c) represents a rippled
seabed.

and a random 3D elevation. The random elevation is a (b) Sand Ripples. The sand ripples are characterized by the
smoothing of a pink noise process. The β parameter is used periodicity and the direction of the ripples. A modified
to tune the roughness of the seabed. pink noise is used here. In this case the frequency decay is
anisotropic. The amplitude of the magnitude of the Fourier
transform follows (1). The frequency of the ripples is given
2.1.3. Texture Layer. Four different textures have been cre- 2 2
by Fripples = fxpeak + f ypeak and the direction is given by
ated to model four kinds of seabed. Once again the textures −1
are synthesized by fractal models derived from pink noise θ = tan ( fxpeak / f ypeak ). The phase is modeled by a uniform
models. distribution
  1 1
F fx , f y =  β  β . (1)
(a) Flat Seabed. A simple flat floor is used for the flat seabed. fx − fxpeak f y − f ypeak
No texture is needed in this case. Differences in reflectivity
and scattering between sediment types are handled by the (c) Rocky Seabed. The magnitude of the Fourier transform
Image Generator module. of the the rocky seabed is modeled by (2). The factor α
6 EURASIP Journal on Advances in Signal Processing

Comparison of leading side-scan sonars for MCM For a monostatic sonar system, the sound propagation can
1.4
be expressed from an energetic point of view as
1.2 XS = SL − 2TL + TS + DI − NL − RL, (3)
1 where XS is the excess level, that is, the backscattering energy,
Coverage rate (nM/h)

SL is the Source Level of the projector, DI is the Directivity


0.8 Index, TL is the Transmission Loss, NL is the Noise Level, RL
is the Reverberation Level and TS is the Target Strength. All
0.6 the parameters are measured in decibels (dB) relative to the
standard reference intensity of a 1 μPa plane wave.
0.4
In a wide range of cases, a good approximation to
transmission loss TL can be made by considering the process
0.2
as a combination of free field spherical spreading and an
added absorption loss. This working rule can be expressed
0
0 0.2 0.4 0.6 0.8 1 as
Along track resolution of the sonar (m)
TL = 20 log r + αr, (4)
EdgeTech 4200 MPX Identification range
Klein 5000-455 Detection range where r is the transmission range and α is an attenuation
Klein 3000-500 coefficient expressed in dB/m. The attenuation coefficient
EdgeTech 4200-600 can be expressed as the sum of two chemical relaxation pro-
Typical SAS cesses and the absorption of pure water. It can be computed
Military SAS
numerically thanks to the Francois-Garrison formula [37].
Figure 9: Ability to detect and identify targets as a function of Reverberation Level is an important restricting factor in
resolution and coverage rate (Nm/h: nautical mile per hour) for the the detection process, especially in the context of MCM. At
best sidescan and synthetic aperture sonars. The SAS sonars here short ranges, it represents the most significant noise factor.
are a typical 100–300 kHz sonar in optimal conditions for synthetic The surface reverberation can be developed as drawn in
aperture. Figure 5, where dA defines the elementary surface subtended
by horizontal angle dφ and is dependent on the pulse length
τ and range. Returns from the front and rear ends of the
pulse determine the size of the elementary surface element,
models the anisotropic erosion of the rock due to underwater
dA. So, for the seabed contribution to reverberation level, we
currents
can write
  1 cτ
F fx , f y =  β , RL = SL − 2TL + Ss + 10 log φr, (5)
(2) 2
α· fx2 + f y2
where d is the altitude of the sonar, r is the range to the
seabed along the main axis of the transducer beam and t is
(d) Cluttered Environment. The cluttered environment is time.
characterized by a random distribution of small rocks. Three types of seabed have been implemented: Rough
A poisson distribution has been chosen for the spatial Rock, Sandy Gravel and Very Fine Sand. A theoretical Bottom
distribution of the rocks on the seabed as the mean number Scattering Strength (Ss in (5)) can be computed thanks to
of occurrences is relatively small. [30].
The source level SL is the power of the transmitter. It is a
2.2. Targets. A separate module is provided for adding targets constant and given by the sonar manufacturer. For sidescan
into the environment. Figure 4 displays the 3D models of 6 the SL is typically between 200 and 230 dB.
different targets. Location, size and material composition can The Directivity Index (DI) is a sonar-dependent factor
be adjusted by the user. The resulting sidescan images offer a associated with directionality of the transducer system. The
large data base for detection and classification algorithms. simulator includes a simple beam pattern derived from a
Nonmine targets can also be generated by varying continuous line array model of length l. The beam pattern
parameters in this module. Several are used to test the function can be computed thanks to the following:
algorithms with results presented in Section 4.1.2.
sin(πl/λ) sin θ
B(θ) = . (6)
(πl/λ) sin θ
2.3. The Sonar Image Generator. The sonar module computes
the sidescan image from a given 3D environment. The Also any transducer beam pattern can be integrated into
simulator is ray-tracing-based and solves the sonar equation the simulator.
[36] (given in (3)). Because (3) is an energetic equation, In our model, the targets form part of the 3D envi-
phenomena such as multipaths are not taken into account. ronment. The Target Scattering Strength (TS) is computed
EURASIP Journal on Advances in Signal Processing 7

0 0

1 1
(Meters)

(Meters)
2 2

3 3
0 1 2 3 0 1 2 3
(Meters) (Meters)
(a) (b)
0 0

1 1
(Meters)

(Meters)

2 2

3 3
0 1 2 3 0 1 2 3
(Meters) (Meters)
(c) (d)

Figure 10: Snapshot of the four targets. (a) Manta, on sand ripples, (b) Rockan on cluttered environment, (c) Cuboid on flat seabed, (d)
Cylinder on sand ripples. The pixel size in these targets images is 5 cm.

using a Lambertian model. The reflectance factor in the Further examples of typical images obtained for the
Lambertian law is associated to the acoustic impedance. The various types of seabed are shown in Figure 8.
simulator takes into account the acoustic impedance of the
target given by Z = ρ · cl , where ρ is the density of the
material, and cl the longitudinal sound speed in the material. 3. Classifier
The sidescan simulator is designed for validity in the 3.1. Target Recognition Using Sonar Images. Target recogni-
range of frequencies from 80 kHz to 300 kHz. We only tion in sonar imagery is a long-standing problem which has
consider one contribution for the ambient Noise Level: the attracted considerable attention [7–15]. However, the reso-
thermal noise. For thermal agitation, the equivalent noise lution of the sensors available has limited not only the spec-
spectrum level is given by the empirical formula [36]: trum of techniques applicable, but also their performances.
NL = −15 + 20 log f with f in kHz. (7) Most techniques for detection rely on matched filtering [38]
or statistical modeling [11, 14], whilst recognition is mainly
The trajectory of the sonar platform is tuneable (as model-based [10, 13, 15].
shown in Figure 6). This allows multiview sidescan images The limitations of current sidescan technology are high-
of the same environment. Figure 7 displays sonar images of lighted in Figure 9. It would seem from this figure that only
the same scene with two different angles of view. SAS systems can give large area coverage and still give high
8 EURASIP Journal on Advances in Signal Processing

70 converted into vectors M Bi of dimension 1 × n · m. A mean


image of the target is computed using the following:
60
1B
k
50 Mmean = Mi . (8)
Misidentification (%)

k i=1
40
The training vectors MBi are centered and normalized
30 according to (9). In the training set, the target is selected
from various ranges (from 5 m to 50 m from the sonar).
20 The contrast and illumination change drastically through the
training set. The normalization by the standard deviation
10 of the image reduce this effect. Let std MBi be the standard
B
deviation of Mi
0
5 10 15 20 25 30
Bi − Mmean
M
Pixel resolution (cm) Ti = . (9)
Bi
std M
Manta Cylinder
Rockan Cube Let T = [Ti ] be the preprocessing training set of
Figure 11: Misidentification of the four targets as a function of the dimension k × n · m. The covariance matrix Ω = T · T T is
pixel resolution. This is considering the highlight of the targets. calculated. The p largest eigenvalues of Ω are computed, and
the p corresponding eigenvectors form the decomposition
base of the target. The subspace Θtarget formed by the p
eigenvectors is called target space. The number of eigenvalues
resolution needed for identification. However the boundaries p has been set to 20.
drawn between detection and identification are more the The classifier projects the test target image In to each
results of general wisdom than solid scientific evidences. target space. We denote PΘtarget (In ) the projection of In in the
New high resolution sonars such as SAS produce images target space Θtarget . The estimated target targ is the target
which get closer to traditional optical imagery. This is also corresponding to the minimum distance between In and
opening a new era of algorithm development for acoustics, PΘtarget (In ) as expressed in
as techniques recently developed in computer vision become ? ?
? ?
more applicable. For example, the SAS system developed by targ = min?In − PΘtarget (In )?. (10)
target
NURC (MUSCLE) can achieve a 5 to 3 cm pixel resolution,
almost independent of range. Thanks to this resolution, PΘtarget (In ) with the minimum distance represents the
direct analysis of the target echo rather than traditional most compact space which represents the object under in-
techniques based on its shadow become possible. spection.
Identifying the resolution required to perform target
classification is not a simple problem. In sonar, this has been
attempted by various authors [39–42], generally looking at 4. Results
the minimum resolution required to distinguish a sphere
In previous works [15, 16, 47, 48], target classification
from a cube and using information theory approaches.
algorithms using standard sidescan sonars have mainly been
These techniques provide a lower bound on the minimum
based on the analysis of the targets’ shadows. With high
resolution required but tend to be over optimistic. We
resolution sonars, we note that more information should
focus here on modern subspace algorithms based on PCA
be exploitable from the target’s highlight. In this section,
(Principal Component Analysis) as a mechanism to analyze
we investigate the resolution needed for the PCA image-
the resolution needs for classification. Why focus on such
based classifier described earlier to classify using only the
techniques? The main reason is that they are very versatile
information carried by the highlight.
and have been applied successfully to a variety of classical
The sidescan simulator presented in Section 2 will pro-
target identification problems. This has been demonstrated
vide synthetic data in order to train and to test the PCA
recently on face recognition [43] and land-based object
image-based classifier. All the sidescan images are generated
detection problems [44].
with a randomly selected seafloor (from flat seabed, ripples
and cluster environment), random sonar altitude (from 2 to
3.2. Principal Component Analysis. The algorithm used in 10 metres altitude) and random range for the targets (from 5
this paper for classification is based on the eigenfaces to 50 metres range).
algorithm. The PCA-basedd eigenfaces approach has been For each experiment, two separate sets of sonar images
used for face recognition purposes [45, 46] and is still close have been computed, one specifically for training (in order
to the state of the art for this application [43]. to compute the target space Θtarget ) and one specifically for
Assuming the training set is composed of k images of a testing. At each sonar resolution and for each target, 80
target. Each target image Mi is an n × m matrix. The Mi are synthetic target images at random ranges, random altitude
EURASIP Journal on Advances in Signal Processing 9

Manta Rockan Cylinder

0.2 0.2 0.2

0.4 0.4 0.4


(Meters)

0.6 0.6 0.6


0.8 0.8 0.8

1 1 1

1.2 1.2 1.2


0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2
(Meters) (Meters) (Meters)
(a) (b) (c)
Cuboid Big hemisphere Hemisphere Box

0.2 0.2 0.2 0.2

0.4 0.4 0.4 0.4


(Meters)

0.6 0.6 0.6 0.6


0.8 0.8 0.8 0.8

1 1 1 1
1.2 1.2 1.2 1.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2
(Meters) (Meters) (Meters) (Meters)
(d) (e) (f) (g)

Figure 12: Snapshot of the targets used for classification. On the first line, the minelike targets with the Manta, the Rockan and the cylinder.
On the second line, the nonmine targets with the cube, the two hemispheres, and the box shape target. The pixel size in these targets images
is 5 cm.

1 1

0.8 0.8
Misidentification (%)

Misclassification (%)

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
Pixel resolution (m) Pixel resolution (m)

Box Small hemisphere Mine-like object


Cube Manta Non mine object
Cylinder Rockan
Hemisphere
(a) (b)

Figure 13: (a) Misidentification of the seven targets as a function of the pixel resolution. (b) Misclassification of the target as function of the
pixel resolution.
10 EURASIP Journal on Advances in Signal Processing

The pixel resolution is tunable in the simulator. Sidescan


simulation/classification processes have been run for 15
different pixel resolutions from 3 cm (high resolution sonar)
to 30 cm (low resolution sonar) covering the detection and
classification range of side looking sonars. Figure 11 displays
the misidentification % of the four targets against the pixel
resolution.
As expected, the image-based classifier fails at low resolu-
tions. Between 15 and 20 cm resolution, which corresponds
to the majority of standard sonar systems, classification based
on the highlights is poor (between 50% and 80% correct
Figure 14: Snapshot of the shadow of the four targets (from left to classification). The results stabilize at around 5 cm resolution
right: Manta, Rockan, Cube and Cylinder) to classify with different
to reach around 95% correct classification.
orientations and backgrounds. The pixel size is these target images
In previous work involving face recognition where it
is 5 cm. The size of each snapshot is 1.25 m × 2.75 m.
has been shown that PCA techniques are not very robust
to rotation [49]. The algorithm can be optimized using
50 multiple subspaces for each nonsymmetric target, each of the
subspaces covering a limited angular range.

40
4.1.2. Classification. In this section we extend the PCA
Misidentification (%)

classifier for underwater object classification purposes. A


30 larger set of seven targets have been chosen with three
minelike objects: the Manta, the Rockan, a cylinder 100 cm
long and 30 cm diameter and four nonmine objects: a cuboid
20 with dimension 100 cm × 50 cm × 40 cm, two hemispheres
with diameters, respectively, 100 cm and 50 cm and a box
10 with dimension 70 cm × 70 cm × 40 cm. Note that the
nonmine targets have been chosen such as the dimension
of the big hemisphere matches with the dimension of the
0 Manta, and the dimension of the box matches with the
0.05 0.1 0.15 0.2 0.25 0.3 dimension of the Rockan. Figure 12 provides snapshots of
Pixel resolution (cm) the different targets.
Manta Cylinder As described in Section 4.1.1, two data sets for training
Rockan Cube and testing have been produced. The target classification
relies on two steps: at first the target is identify following
Figure 15: Percentage of misidentification versus the pixel resolu-
the same process as Section 4.1.1 and then classified into two
tion for various target types. This considers the shadow of the target
classes minelike and nonmine
and not its echo.
Figure 13(a) displays the results of the identification step.
the curves of misidentification for each target follow the
general pattern described earlier in Section 4.1.1 with a low
and with a randomly selected seafloor have been used for misidentification (below 5%) for a pixel resolution lower
training. A larger set of 40000 synthetic target images are than 5 cm. In Figure 13(b), the results of the classification
used to test the classifier. The classifier is trained and tested between minelike and nonmine is showed. Contrary to the
according to the algorithm described in Section 3.2. identification process, the classification curves stabilise at
higher pixel resolution (around 10 cm) to 2-3% misclassifi-
4.1. What Precision Is Needed? cation.
In these examples we show that the identification task
needs a higher pixel resolution that the classification task
4.1.1. Identification. In this first experiment the PCA clas-
to match the same performances 95% correct identifica-
sifier is train for identification. Assuming a minelike object
tion/classification.
has been detected and classified as a mine, the algorithm
identifies the kind of mine the target Four targets have been
chosen: a Manta mine (truncated cone with dimensions 4.2. Identification with Shadow. As mentioned earlier, cur-
98 cm lower diameter; 49 cm upper diameter; 47 cm height), rent sidescan ATR algorithms depend strongly on the
a Rockan mine (L × W × H: 100 cm × 50 cm × 40 cm), target shadow for detection and classification. The usual
a cuboid with dimensions 100 cm × 30 cm × 30 cm and a assumption made is: at low resolution the information relative
cylinder 100 cm long and 30 cm in diameter. to the target is mostly contained in its shadow. In this section
Figure 10 displays snapshots of the four different targets we aim to confirm this statement by using the classifier
for a 5 cm sonar resolution. described in Section 3.2 directly on the target shadows.
EURASIP Journal on Advances in Signal Processing 11

We study here the quantity of information contained using real SAS data. We are currently undertaking this phase
into the shape of the shadow, and how this information is in collaboration with the NATO Undersea Research Centre
retrievable depending on the pixel resolution. and DSTL under the UK Defense Research Centre program.
Shadows are the result of the directional acoustic illumi-
nation of a 3D target. They are therefore range dependent.
For the purposes of this experiment, in order to remove the Acknowledgments
effect of the range dependence of the shadows, the targets
are positioned at a fixed range of 25 m from the sensor. This work was supported by EPSRC and DSTL under
Image segments containing the target shadows are extracted research contracts EP/H012354/1 and EP/F068956/1. The
from the data. Figure 14 displays snapshots of target shadows authors also acknowledge support from the Scottish Funding
with different orientations and backgrounds for a 5 cm pixel Council for the Joint Research Institute in Signal and Image
resolution. We process the target shadow images in exactly Processing between the University of Edinburgh and Heriot-
in the same way as we did for the target highlight images Watt University which is a part of the Edinburgh Research
in the previous sections. For each sonar resolution, 80 target Partnership in Engineering and Mathematics (ERPem).
shadows per object are used for training the classifier, and a
set of 40000 shadow images is used for testing.
In total 15 training/classification simulations have been References
done for 15 sonar pixel resolutions (from 5 cm to 30 cm).
[1] A. Bellettini, “Design and experimental results of a 300-
Figure 15 shows the percentage of misclassification versus the
kHz synthetic aperture sonar optimized for shallow-water
pixel resolution for various target types. operations,” IEEE Journal of Oceanic Engineering, vol. 34, no.
Concerning the Cylinder and Cuboid targets, their shad- 3, pp. 285–293, 2009.
ows are very similar due the similar geometry. In Figure 14 [2] B. G. Ferguson and R. J. Wyber, “Generalized framework
it is almost impossible to distinguish visually between the for real aperture, synthetic aperture, and tomographic sonar
two objects looking only at their shadows. In broadside for imaging,” IEEE Journal of Oceanic Engineering, vol. 34, no. 3,
example, the two shadows have exactly the same rectangular pp. 225–238, 2009.
shape, explaining why the confusion between these two [3] E. O. Belcher, D. C. Lynn, H. Q. Dinh, and T. J. Laughlin,
objects is high. “Beamforming and imaging with acoustic lenses in small,
For the Manta and Rockan targets, the misidentification high-frequency sonars,” in Proceedings of the Oceans Confer-
curves stabilize near 0% misidentification below 20 cm sonar ence, pp. 1495–1499, September 1999.
[4] A. Goldman and I. Cohen, “Anomaly subspace detection
resolution. Therefore, for standard sidescan systems with a
based on a multi-scale Markov random field model,” Signal
resolution in the 10–30 cm range, the target information can Processing, vol. 85, no. 3, pp. 463–479, 2005.
be extracted from the shadow with an excellent probability of [5] F. Maussang, J. Chanussot, A. Hétet, and M. Amate, “Higher-
correct identification. In comparison, correct identification order statistics for the detection of small objects in a noisy
using the target highlights at 20 cm resolution is about 50% background application on sonar imaging,” EURASIP Journal
(cf. Figure 11) on Advances in Signal Processing, vol. 2007, Article ID 47039,
17 pages, 2007.
[6] B. R. Calder, L. M. Linnett, and D. R. Carmichael, “Spatial
stochastic models for seabed object detection,” in Detection
5. Conclusions and Future Work and Remediation Technologies for Mines and Minelike Targets
In this paper, a new real-time realistic sidescan simulator has II, Proceeding of SPIE, pp. 172–182, April 1997.
[7] M. Mignotte, C. Collet, P. Perez, and P. Bouthemy, “Hybrid
been presented. Thanks to the flexibility of this numerical
genetic optimization and statistical model-based approach for
tool, realistic synthetic data can be generated at different pixel the classification of shadow shapes in sonar imagery,” IEEE
resolutions. A subspace target identification technique based Transactions on Pattern Analysis and Machine Intelligence, vol.
on PCA has been developed and used to evaluate the ability 22, no. 2, pp. 129–141, 2000.
of modern sonar systems to identify a variety of targets. [8] B. Calder, Bayesian spatial models for sonar image interpre-
The results processing shadow images back up the widely tation, Ph.D. dissertation, Heriot-Watt University, September
accepted idea that identification from current sonars at 1997.
10–20 cm resolution is reaching its performance limit. The [9] G. J. Dobeck, J. C. Hyland, and LE. D. Smedley, “Automated
advent of much higher resolution sonars has now made it detection and classification of sea mines in sonar imagery,” in
possible to bring in and apply techniques new to the field Detection and Remediation Technologies for Mines and Minelike
from optical image processing. The PCA analyses presented Targets II, Proceedings of SPIE, pp. 90–110, April 1997.
here, operating on highlight as opposed solely to shadow, [10] I. Quidu, J. PH. Malkasse, G. Burel, and P. Vilbe, “Mine
classification based on raw sonar data: an approach combining
show that these can give a significant improvement in target
Fourier descriptors, statistical models and genetic algorithms,”
identification and classification performance opening the in Proceedings of the Oceans Conference, pp. 285–290, Septem-
way for reinvigorated effort in this area. ber 2000.
The emergence of very high resolution sonar systems [11] B. R. Calder, L. M. Linnett, and D. R. Carmichael, “Bayesian
such as SAS and acoustic cameras will enable more advanced approach to object detection in sidescan sonar,” IEE Proceed-
target identification techniques to be used very soon. The ings: Vision, Image and Signal Processing, vol. 145, no. 3, pp.
next phase of this work will be to validate and confirm these 221–228, 1998.
12 EURASIP Journal on Advances in Signal Processing

[12] R. Balasubramanian and M. Stevenson, “Pattern recogni- [28] G. R. Elston and J. M. Bell, “Pseudospectral time-domain
tion for underwater mine detection,” in Proceedings of the modeling of non-Rayleigh reverberation: synthesis and statis-
Computer-Aided Classification/Computer-Aided Design Confer- tical analysis of a sidescan sonar image of sand ripples,” IEEE
ence, Halifax, Canada, November 2001. Journal of Oceanic Engineering, vol. 29, no. 2, pp. 317–329,
[13] S. Reed, Y. Petillot, and J. Bell, “Automated approach to 2004.
classification of mine-like objects in sidescan sonar using [29] M. Pinto, “Design of synthetic aperture sonar systems for
highlight and shadow information,” IEE Proceedings: Radar, high-resolution seabed imaging,” in Proceedings of MTS/IEEE
Sonar and Navigation, vol. 151, no. 1, pp. 48–56, 2004. Oceans Conference, Boston, Mass, USA, 2006.
[14] S. Reed, Y. Petillot, and J. Bell, “Model-based approach to [30] A. P. L. at the University of Washington, “High-Frequency
the detection and classification of mines in sidescan sonar,” Ocean Environmental Acoustic Models Handbook,” Tech.
Applied Optics, vol. 43, no. 2, pp. 237–246, 2004. Rep. APLUW TR 9407, October 1994.
[15] E. Dura, J. Bell, and D. Lane, “Superellipse fitting for the [31] B. Mandelbrot, The Fractal Geometry of Nature, W. H. Free-
recovery and classification of mine-like shapes in sidescan man, 1982.
sonar images,” IEEE Journal of Oceanic Engineering, vol. 33,
[32] A. P. Pentland, “Fractal-based description of natural scenes,”
no. 4, pp. 434–444, 2008.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
[16] B. Zerr, E. Bovio, and B. Stage, “Automatic mine classi-
vol. 6, no. 6, pp. 661–674, 1984.
fication approach based on auv manoeuverability and the
cots side scan sonar,” in Proceedings of the Autonomous [33] R. F. Voss, Random Fractal Forgeries in Fundamental Algo-
Underwater Vehicle and Ocean Modelling Networks Conference rithms for Computer Graphics, R. A. Earnshaw, Ed., Springer,
(GOATS ’00), pp. 315–322, 2001. Berlin, Germany, 1985.
[17] M. Azimi-Sadjadi, A. Jamshidi, and G. Dobeck, “Adaptive [34] P. A. Burrough, “Fractal dimensions of landscapes and other
underwater target classification with multi-aspect decision environmental data,” Nature, vol. 294, no. 5838, pp. 240–242,
feedback,” in Proceedings of the Computer-Aided Classification/ 1981.
Computer-Aided Design Conference, Halifax, Canada, Novem- [35] S. Lovejoy, “Area-perimeter relation for rain and cloud areas,”
ber 2001. Science, vol. 216, no. 4542, pp. 185–187, 1982.
[18] I. Quidu, J. PH. Malkasse, G. Burel, and P. Vilbe, “Mine [36] R. J. Urick, Principles of Underwater Sound, McGraw-Hill, New
classification using a hybrid set of descriptors,” in Proceedings York, NY, USA, 3rd edition, 1975.
of the Oceans Conference, pp. 291–297, September 2000. [37] R. E. Francois, “Sound absorption based on ocean measure-
[19] J. Fawcett, “Image-based classification of side-scan sonar ments: Part I: pure water and magnesium sulfate contribu-
detections,” in Proceedings of the Computer-Aided Classifi- tions,” The Journal of the Acoustical Society of America, vol. 72,
cation/Computer-Aided Design Conference, Halifax, Canada, no. 3, pp. 896–907, 1982.
November 2001. [38] T. Aridgides, M. F. Fernandez, and G. J. Dobeck, “Adaptive
[20] S. Perry and L. Guan, “Detection of small man-made objects three-dimensional range-crossrange-frequency filter process-
in multiple range sector scan imagery using neural networks,” ing string for sea mine classification in side scan sonar
in Proceedings of the Computer-Aided Classification/Computer- imagery,” in Detection and Remediation Technologies for Mines
Aided Design Conference, Halifax, Canada, November 2001. and Minelike Targets II, Proceedings of SPIE, pp. 111–122,
[21] C. Ciany and W. Zurawski, “Performance of computer aided April 1997.
detection/computer aided classification and data fusion [39] M. Pinto, “Performance index for shadow classification in
algorithms for automated detection and classification of minehunting sonar,” in Proceedings of the UDT Conference,
underwater mines,” in Proceedings of the Computer-Aided 1997.
Classification/Computer-Aided Design Conference, Halifax,
[40] V. Myers and M. Pinto, “Bounding the performance of
Canada, November 2001.
sidescan sonar automatic target recognition algorithms using
[22] C. M. Ciany and J. Huang, “Computer aided detec-
information theory,” IET Radar, Sonar and Navigation, vol. 1,
tion/computer aided classification and data fusion algorithms
no. 4, pp. 266–273, 2007.
for automated detection and classification of underwater
[41] R. T. Kessel, “Estimating the limitations that image resolution
mines,” in Proceedings of the Oceans Conference, pp. 277–284,
and contrast place on target recognition,” in Automatic Target
September 2000.
Recognition XII, Proceedings of SPIE, pp. 316–327, usa, April
[23] Y. Pailhas, C. Capus, K. Brown, and P. Moore, “Analysis and
2002.
classification of broadband echoes using bio-inspired dolphin
pulses,” Journal of the Acoustical Society of America, vol. 127, [42] F. Florin, F. Van Zeebroeck, I. Quidu, and N. Le Bouffant,
no. 6, pp. 3809–3820, 2010. “Classification performance of minehunting sonar: theory,
[24] C. Capus, Y. Pailhas, and K. Brown, “Classification of bottom- practical, results and operational applications,” in Proceeedings
set targets from wideband echo responses to bio-inspired of the UDT Conference, 2003.
sonar pulses,” in Proceedings of the 4th International Conference [43] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and YI. Ma,
on Bio-acoustics, 2007. “Robust face recognition via sparse representation,” IEEE
[25] J. Bell, A model for the simulation of sidescan sonar, Ph.D. Transactions on Pattern Analysis and Machine Intelligence, vol.
dissertation, Heriot-Watt University, August 1995. 31, no. 2, pp. 210–227, 2009.
[26] A. J. Hunter, M. P. Hayes, and P. T. Gough, “Simulation of [44] A. Nayak, E. Trucco, A. Ahmad, and A. M. Wallace, “Sim-
multiple-receiver, broadband interferometric SAS imagery,” BIL: appearance-based simulation of burst-illumination laser
in Proceeding of IEEE Oceans Conference, pp. 2629–2634, sequences,” IET Image Processing, vol. 2, no. 3, pp. 165–174,
September 2003. 2008.
[27] J. M. Bell, “Application of optical ray tracing techniques to the [45] L. Sirovich and M. Kirby, “Low-dimensional procedure for the
simulation of sonar images,” Optical Engineering, vol. 36, no. characterization of human faces,” Journal of the Optical Society
6, pp. 1806–1813, 1997. of America A, vol. 4, no. 3, pp. 519–524, 1987.
EURASIP Journal on Advances in Signal Processing 13

[46] K. Etemad and R. Chellappa, “Discriminant analysis for


recognition of human face images,” Journal of the Optical
Society of America A, vol. 14, no. 8, pp. 1724–1733, 1997.
[47] S. Reed, Y. Petillot, and J. Bell, “An automatic approach to the
detection and extraction of mine features in sidescan sonar,”
IEEE Journal of Oceanic Engineering, vol. 28, no. 1, pp. 90–105,
2003.
[48] V. L. Myers, “Image segmentation using iteration and fuzzy
logic,” in Proceedings of the Computer-Aided Classification/
Computer-Aided Design Conference, 2001.
[49] M. Turk and A. Pentland, “Face recognition using eigenfaces,”
in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, pp. 586–591, 1991.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 945130, 19 pages
doi:10.1155/2010/945130

Research Article
An Efficient and Robust Moving Shadow Removal Algorithm and
Its Applications in ITS

Chin-Teng Lin,1 Chien-Ting Yang,1 Yu-Wen Shou,2 and Tzu-Kuei Shen1


1 Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan
2 Department of Computer and Communication Engineering, China University of Technology, Hsinchu 303, Taiwan

Correspondence should be addressed to Yu-Wen Shou, [email protected]

Received 1 December 2009; Accepted 10 May 2010

Academic Editor: Alan van Nevel

Copyright © 2010 Chin-Teng Lin et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We propose an efficient algorithm for removing shadows of moving vehicles caused by non-uniform distributions of light
reflections in the daytime. This paper presents a brand-new and complete structure in feature combination as well as analysis
for orientating and labeling moving shadows so as to extract the defined objects in foregrounds more easily in each snapshot of
the original files of videos which are acquired in the real traffic situations. Moreover, we make use of Gaussian Mixture Model
(GMM) for background removal and detection of moving shadows in our tested images, and define two indices for characterizing
non-shadowed regions where one indicates the characteristics of lines and the other index can be characterized by the information
in gray scales of images which helps us to build a newly defined set of darkening ratios (modified darkening factors) based on
Gaussian models. To prove the effectiveness of our moving shadow algorithm, we carry it out with a practical application of traffic
flow detection in ITS (Intelligent Transportation System)—vehicle counting. Our algorithm shows the faster processing speed,
13.84 ms/frame, and can improve the accuracy rate in 4% ∼10% for our three tested videos in the experimental results of vehicle
counting.

1. Introduction of moving shadows and aim at developing an efficient and


robust algorithm of moving shadow removal as well as the
In recent years, the researches about intelligent video related applications.
surveillances have increased noticeably. Foreground object About the studies on influences of moving shadows,
detection could be considered as one of the fundamental Zhang et al. [1] classified these techniques into four cat-
and critical techniques in this field. In the conventional egories, including color model, statistical model, textural
methods, background subtraction and temporal difference model, and geometric model. Color model used the differ-
have been widely used for foreground extraction in case of ence of colors between the shaded and nonshaded pixels.
using stationary cameras. The continuing improvements in Cucchiara et al. [2] removed moving shadows by using
background modeling techniques have led to many new and the concept on HSV color space that the hue component
fascinating applications like event detection, object behavior of shaded pixels would vary in a smaller range and the
analysis, suspicious object detection, traffic monitoring, and saturation component would decrease more obviously. Some
so forth. However, some factors like dynamic backgrounds researchers proposed shadow detection methods based on
and moving shadows might affect the results of fore- RGB color space and normalized-RGB color space. Yang et
ground detection and make the problems more complicated. al. [3] described the ratio of intensities between a shaded
Dynamic backgrounds, one of the factors, might detect and pixel and its neighboring shaded pixel in the current image,
treat the escalators and swaying trees as the foreground and this intensity ratio was found to be close to that in the
regions. Another factor, moving shadows, occurred when background image. They also made use of the slight change
light was blocked by moving objects, usually misclassified the of intensities on the normalized R and G channels between
foreground regions. This paper would focus on the studies the current and background image. Cavallaro et al. [4]
2 EURASIP Journal on Advances in Signal Processing

found that the color components would not change their on edge information for traffic scenes which in sequence
orders and the photometric invariant features would have consisted of an edge extraction technique, the morphological
a small variation while shadows occurred. Beside of color operations on removing the edges of shadows, and the
model, statistical model used the probabilistic functions to analysis of spatial properties for separating the occluded
determine whether or not a pixel belonged to the shadows. vehicles which resulted from the influences of shadows.
Zhang et al. [1] introduced an illumination invariance fea- They could reconstruct the size of each object and decide
ture and then analyzed and modeled shadows as a Chi-square the practical regions of shadows. However, this method
distribution. They classified each moving pixel into the intrinsically could not deal with the textured shadowed
shadow or foreground object by performing a significance regions like the regions with lane markings, and the more
test. Song and Tai [5] applied Gaussian model to representing complicated cases of occlusions like the concave shapes of
the constant RGB-color ratios, and determined whether vehicles would make this method fail to separate vehicles by
a moving pixel belonged to the shadow or foreground use of spatial properties.
object by setting ±1.5 standard deviation as a threshold. The color information from color cameras might give
Martel-Brisson and Zaccarin [6] proposed GMSM (Gaussian fine results for shadow removal, but B/W (Black & White)
Mixture Shadow Model) for shadow detection which was cameras could provide the better resolution and more
integrated into a background detection algorithm based on sensitive quality under lowly illuminating conditions. That
GMM. They tested if the mean of a distribution could was the reason why B/W cameras rather than color cameras
describe a shaded region, and if so, they would select this would be much more popular for outdoor applications. The
distribution to update the corresponding Gaussian mixture shadow removal method based on color model might not
shadow model. work in such situations. Similarly, texture model could have
The texture model assumed that the texture of the better results under the unstably illuminating conditions
foreground object would be totally different from that of the without the color information. However, texture model
background, and that the textures would be distributed uni- might give the poorest performances for the textureless
formly inside the shaded region. Joshi and Papanikolopoulos objects. Geometric model could be more adaptive to the
[7, 8] proposed an algorithm that could learn and detect specific scenes due to dependency upon the geometric rela-
the shadows by using support vector machine (SVM). They tions between objects and scenes, and would be prevailingly
defined four features of images, including intensity ratio, applied in simulated environments so far. Its biggest prob-
color distortion, edge magnitude distortion and edge gra- lem in the heavily computational loading would obviously
dient distortion. They introduced a cotraining architecture restrict the related uses in real-time cases. By considering all
which could make two SVM classifiers help each other in the these methods, we proposed a fast moving shadow removal
training process, and they should need a small set of labeled scheme by combining texture and statistical models. Our
samples on shadows before training SVM classifiers for proposed method was experimentally proved to be stable and
different video sequences. Leone et al. [9] presented a shadow used the texture model instead of color model to simplify
detection method by using Gabor features. Mohammed our systematic procedures efficiently. Furthermore, we made
Ibrahim and Anupama [10] proposed their method by use of statistical methods to improve the performances of
using division image analysis and projection histogram systems by successfully dealing with the textureless objects.
analysis. Image division operation was performed on the This paper would be organized as follows, including the
current and reference frames to highlight the homogeneous overall architecture, foreground object extraction, feature
property of shadows. They afterwards eliminated the left combination, experimental results on both our proposed
pixels on the boundaries of shadows by using both column- moving shadow removal algorithm and the additional
and row-projection histogram analyses. Geometric model application (vehicle-counting), and conclusions.
attempted to remove shadowed regions or the shadowing
effect by observing the geometric information of objects.
Hsieh et al. [11] used the histograms of vehicles and the 2. Our Systematic Architecture for
calculated center of lane to detect the lane markings, and Moving Shadow Removal
also developed a horizontal and vertical line-based method
to remove shadows by characteristics of those lane markings. In this section, we would introduce our entire architecture
This method might become ineffective in case of no lane for algorithms of moving shadow removal which consisted
markings. of five blocks, including foreground object extraction,
As for some other approaches by combination of the foreground-pixel extraction by edge-based shadow removal,
mentioned models, Benedek and Szirányi [12] proposed a foreground-pixel extraction by gray level-based shadow
method based upon LUV color model. They used “darkening removal, feature combination, and the practical applications.
factor”, the distortion of U and V channel, and microstruc- The architecture diagram of our proposed algorithm was
tural responses to be the determinative features, where shown in Figure 1.
microstructural responses represented a local textured fea-
ture. Their proposed algorithm integrated all those features 3. Foreground-Object Extraction
by using Gaussian model and segmented foreground objects,
backgrounds and shadows by calculating their probabilities. As Figure 1 showed, the sequence of images in gray-level
Xiao et al. [13] proposed a shadow removal method based should be taken as the input of the foreground object
EURASIP Journal on Advances in Signal Processing 3

Gray level image


and removed much easier if the homogeneity in the shadow
sequence region could be within a small range of variance. Relatively,
we could also obtain the features of edges for nonshadow
regions. The flowchart of foreground-pixel extraction by
Foreground edge-based shadow removal was shown in Figure 2. More
object
clearly, we firstly used Sobel operations to extract the edges
extraction
for both GMM-based background images and foreground
objects. Figures 3 and 4 showed the results of edge extraction
by Sobel operations from BI edge and FO edgeMBR where
BI edge and FO edgeMBR represent the edges extracted from
background images and foreground objects, respectively. The
Gray level-based Edge-based
subscript “MBR” means the minimum bounding rectangle,
shadow removal shadow removal
foreground pixel foreground pixel the only region which we have to process inside.
extraction extraction In order to avoid extracting the undesired edges, for
example, the edges of lane marking or textures on the ground
surface, we technically took advantage of pixel-by-pixel max-
operations on the extracted edges of background images and
Feature
foreground objects, and the results would be expressed as
combination MI edgeMBR in (1):
 
MI edgeMBR x, y
     (1)
= max FO edgeMBR x, y , BI edge x, y ,
Tracking
process
where (x, y) represents the coordinate of the pixel, and one
example of MI edgeMBR would be illustrated in Figure 5.
Then, we subtracted BI edge from MI edgeMBR to obtain
St edgeMBR . Figure 6 showed the result of St edgeMBR . Sim-
Result of
object ilarly, Figure 7 showed the same result without using our
detection proposed procedures. It could be easily observed that the
extracted edges of lane marking indeed reduced if our
proposed procedure could be applied appropriately. Figure 8
Figure 1: Architecture of the proposed shadow removal algorithm.
indicated that our proposed procedure would also work well
on the images with textured roads.
To demonstrate the necessity of our proposed procedure,
extraction processes, and the moving object with its min- we circled the edges of lane markings and textures on ground
imum bounding rectangle as the output of the algorithm. surface by red ellipses in both Figures 7 and 8(f). By using the
We would incorporate Gaussian Mixture Model (GMM) max-operation, apparently, we could reduce the effect caused
into our mechanism functioning as the development of either by lane markings or textures on ground surfaces and
background image which is a representative approach to also keep the homogeneous property inside the shadows.
background subtraction. It would be more appropriate to After edge extracting, we used an adaptive binarization
choose the way by background subtraction for extracting method to obtain the binary image from St edgeMBR . Here,
foreground objects rather than that by temporal difference we took Sauvola’s method [16, 17], one kind of local
under the consideration of all the pros and cons of these binarization methods, to provide good results even in the
two typical approaches. Furthermore, the latter could do a condition of nonuniform luminance. Sauvola used an n ×
better job in extracting all relevant pixels, and this paper n mask, covered on the image in each scanning-iteration,
aimed at tackling the problems generated from the traffic to calculate the local mean m(x, y) and standard deviation
monitoring systems where cameras would be usually set s(x, y) of pixel-intensities in the mask as to determine a
fixedly. Some previous studies [14, 15] have proposed a proper threshold according to the contrast in the local
standard process of background construction, hence we neighborhood of a pixel. If there is a high contrast in some
would put a higher premium on the following two parts, region of the image, the equation s(x, y) ≈ R may result in
inclusive of foreground-pixel extraction by edge-based and the condition tfinal (x, y) ≈ m(x, y). To reduce the influences
gray-level based shadow removal. by unimportant edges, we added a suppression term to
Sauvola’s equation. Equation (2) shows the revised equation.
3.1. Foreground-Pixel Extraction by Edge-Based Shadow     
    s x, y
Removal. The main ideas for extracting foreground-pixels by tfinal x, y = m x, y 1+k −1 + Thsuppress ,
the information of edges in the detected object were inspired R
by that the edges of object in interest would be identified (2)
4 EURASIP Journal on Advances in Signal Processing

Edge-based shadow removal


foreground pixel extraction

GMM Sobel
background operation
Pixel by pixel
maximization

Foreground Sobel
object operation
Subtraction

Boundary Adaptive
elimination binarization

Edge-based
shadow removal
foreground pixels

Figure 2: The flowchart of foreground-pixel extraction by edge-based shadow removal.

(a) Background image (b) BI edge

Figure 3: Sobel edge extraction from background images.

(a) Foreground Object (b) FO edgeMBR

Figure 4: Sobel edge extraction from moving objects.


EURASIP Journal on Advances in Signal Processing 5

FO edgeMBR

Max
(pixel-by-pixel)

MI edgeMBR
Figure 6: The result of subtracting BI edge from MI edgeMBR .

BI edge

Figure 5: Max-operations and MI edgeMBR .

where m(x, y) and s(x, y) are the mean and standard


deviation of the mask centered at the pixel (x, y), respectively,
R is the maximum value of the standard deviation (in gray
level image, R = 128), k is the parameter with positive values
in the range [0.2, 0.5], and Thsuppress is a suppression term
and its value is set to be 50 empirically in this paper.
We then applied tfinal (x, y) to the following binarization
Figure 7: The result of subtracting BI edge from MI edgeMBR .
step at location (x, y) according to (3) once tfinal (x, y) had
been obtained.
⎧    
  ⎨0 if St edgeMBR x, y ≤ tfinal x, y , As mentioned above, we used a 7 × 7 mask to
BinIMBR x, y = ⎩
255, otherwise, achieve boundary elimination. According to what have been
(3) observed, the widths of edge detected in shadows are in fact
very approaching regardless of the edges that are extracted
from far or near perspectives in the foreground images.
where BinIMBR represented the result after binarization. The And the width of edge is almost less than the length of 3
result of using the binarization method on St edgeMBR would pixels in most conditions. As Figure 12 illustrated, we put the
be given in Figure 9. Another advantage of using adaptive green mask on the binarized edge points (marked as yellow
binarization methods instead of taking a fixed threshold was color) of BinIMBR , and then scanned every point in BinIMBR .
that users would not have to manually set a proper threshold If the region covered by the mask completely belongs to
for each video scene. And that should be a significant factor foreground objects (marked as white point), we reserve this
for automatic monitoring systems. point (marked as red color); otherwise, we eliminate this
We also had to remove the outer borders of BinIMBR point (marked as light blue point). After applying the outer
because of the following problems. In Figure 9, the shadow boundary elimination, we could obtain the features for non-
region and real foreground object with the same motion shadow pixels, notated as Ft EdgebasedMBR . In Figure 13, we
vectors would make them always adjacent to each other. Also, showed an actual example with Ft EdgebasedMBR expressed
the interior region of shadowed/foreground objects should as red points.
be homogeneous/nonhomogeneous (nontexture or edge-
less/dominant edged), which implied that the edges from
shadows would appear at the outer borders of foreground 3.2. Foreground-Pixel Extraction by Gray Level-Based Shadow
objects. Considering these two properties, the objective of Removal. We tried to integrate the foreground-pixel extrac-
removing shadows could be treated as eliminating the outer tion by gray level-based approach into our shadow removal
borders and preserving the remaining edges which belongs to algorithm, and this novel arrangement could enhance and
real foreground objects. Also, the latter property mentioned stabilize the performance of that only by edge-based scheme.
above might not be always satisfied. The interior region Figure 14 showed the flowchart of gray level-based shadow
of shadows would be sometimes little textured (e.g., lane removal foreground pixel extraction. Worthily speaking, we
markings) like the example shown in Figure 10. We could developed a modified “constant ratio” rule by Gaussian
solve this kind of problem by the procedures that we have model. We selected some pixels which belong to shadow-
mentioned earlier. From Figure 11(b), although the interior potential regions from foreground objects, calculated the
region of shadows had few edge-points after binarization, we darkening factors as our training data, and then built a
could easily cope with these noise-like points only by using a Gaussian model for each gray level. Once the Gaussian
specific filter after our subsequent processing procedures. model was trained, we could use this model to determine if
6 EURASIP Journal on Advances in Signal Processing

(a) Foreground Object (b) FO edgeMBR

(c) Background Image (d) BI edge

(e) St edgeMBR (f) Subtract BI edge from FO edgeMBR

Figure 8: Examples of the ground surface with textures.

(a) Binarization of Figure 6 (b) Binarization of Figure 8(e)

Figure 9: Examples of BinIMBR .

each of the pixels inside foreground objects belonged to the shadow detection, expressed the intensity of each pixel on the
shadowed region or not. coordinate (x, y) in terms of I(x, y) with (4):
Here we would simply introduce the original “constant -
ratio” rule so as to illustrate our modifications. Some studies      
I x, y = e λ, x, y ρ λ, x, y σ(λ)dλ, (4)
[12, 18–20], using the property of “constant ration” for
EURASIP Journal on Advances in Signal Processing 7

(a) Foreground object (b) BinIMBR

Figure 10: Homogeneous property of shadows, example 1.

(a) Moving object (b) BinIMBR

Figure 11: Homogeneous property of shadow, example 2.

where λ is the wavelength parameter, e(λ, x, y) is the (2) The intensity of a pixel (x, y) in the current frame
illumination function, ρ(λ, x, y) is the spectral reflectance, should be smaller than that in the frame of back-
and σ(λ) is the sensitivity of camera sensor. The term grounds, for the shadowed pixels must be darker than
e(λ, x, y) indicated the difference between nonshadowed and background-pixels.
shadowed regions. For backgrounds, the term e(λ, x, y) is
composed of the direct and diffused-reflected light com- (3) The pixels obtained from the foreground-pixel
ponents, but in the shadowed area, e(λ, x, y) only contains extraction by edge-based shadow removal should be
the diffused-reflected light components. This difference excluded to reduce the number of pixels which might
implies the constant ratio property. Equation (5) shows be classified as nonshadowed pixels.
the ratio of I sh (x, y) and I bg (x, y) where I sh (x, y) and
I bg (x, y) represent the intensities of a shadowed pixel and For the practical case shown in Figure 16, the red pixels were
a nonshadowed background pixel, respectively, and α is the selected points for Gaussian model updating.
called the darkening factor which will be a constant over the After the pixels for updating were selected, we would
whole image. update the mean and standard deviation of Gaussian model.
  Figure 17 displayed the flowchart of the updating process of
I sh x, y Gaussian darkening factor model. The darkening factor αk
  = α. (5)
I bg x, y could be calculated as in (6):
 
3.2.1. Gaussian Darkening Factor Model Updating. As I selected x, y
  = αk , (6)
Figure 15 showed, we would rather stimulate the darkening I bg x, y
factor with respect to each gray level by one Gaussian Model
than that with respect to each pixel. In the beginning, we
would select the shadow-potential pixels as the updating where I selected (x, y) is the intensity of the selected pixel at
data of Gaussian models by our three predefined conditions (x, y), and I bg (x, y) is the intensity of the background-pixel
introduced in the followings. at (x, y). After calculating the darkening factor, we would
update the kth Gaussian model.
(1) Pixels must belong to the foreground objects, for We set a threshold as a minimum number of updating
the shadowed pixels must be part of the foregrounds times, and the updating times of each Gaussian model
ideally. must exceed this threshold to ensure the stability of each
8 EURASIP Journal on Advances in Signal Processing

BinIMBR

Mask

Boundary
elimination

Ft EdgebasedMBR

BinIMBR
(a) Using mask to scan BinIMBR Foreground image

Figure 13: An example of boundary elimination.

the trained Gaussian darkening factor model. Figure 18 gave


us the rules to calculate the difference between the mean of
Gaussian model and the darkening factor, and check if the
Foreground image
difference was smaller than 3 times of standard deviation. If
(b) Mask has covered the nonforeground region yes, the pixel would be classified as shadowed. Otherwise, it
would be considered as the nonshadowed pixel and could be
reserved as a feature point.
Figure 19 described our tasks to determine the nonshad-
owed pixels. If the kth Gaussian model was not trained,
we would go checking if the nearby Gaussian models were
marked as trained or not. In our programs, we selected
the nearby 6 Gaussian models for checking, and we chose
Foreground image the nearest one if there existed any trained Gaussian
(c) Mask is inside the foreground region
model. Figure 20 gave an example where the pixels labeled
by red color were the extracted feature pixels after our
determination task, and we denoted the set of these pixels
as Ft DarkeningFactorMBR .

4. Feature Combination
We combined two kinds of features which have been
introduced in the Section 3 in our algorithm to extract fore-
ground objects in a more accurate way. Figure 21 exhibited
the flowchart of our feature combination. We integrated
these two features by applying “OR” operations. Figure 22
Ft EdgebasedMBR demonstrated a real example of processed images by our
(d) Final result of outer border elimination combined features, called the feature-integration images in
this paper.
Figure 12: Illustrations of boundary elimination. After acquiring the feature-integration images, we would
locate the real foreground objects, namely, the foreground
objects excluding the shadowed regions. Hence, we used
the connected-component-labeling approach with the min-
imum bounding rectangle to orientate the real foreground
model. Besides, in order to reduce the computation loading
objects. What is different from the common applications, we
of updating procedure, we gave a limit that each Gaussian
would make some necessary preprocessing procedures before
model could only be updated at most 200 times for one
applying the connected-component-labeling procedure. The
frame.
preprocesses consisted of filtering and dilation operations.
Both the median filter operation and morphological dilation
3.2.2. Determination of Non-Shadowed Pixels. Here, we operation was conducted just once. In Figure 23, we could
introduced how to extract the nonshadowed pixels by using see that there existed some odd points in the left part
EURASIP Journal on Advances in Signal Processing 9

Gray level-based shadow


removal foreground
pixel extraction

Foreground
Edge-based
Shadow-potential object
shadow removal
foreground pixels region extraction

Non-shadow Gray level-based


pixel extraction shadow removal
Gaussian darkening foreground pixels
factor model
updating

Figure 14: Flowchart of gray level-based shadow removal foreground pixel extraction.

0 1 2 ··· 254 255 After labeling and grouping, we would use the size filter
Gray level: to eliminate the minimum bounding rectangle of which the
width and height were both smaller than a threshold, as
depicted in Algorithm 1. The subscript “k” indicated the
kth minimum bounding rectangle. Figure 25 showed some
examples of the final located real objects; the green and light
0 1 0 1 0 1
blue rectangles revealed the foreground objects and final
Darkening Darkening Darkening
located real objects, respectively.
factor factor factor

Figure 15: An illustrated figure, each gray level has a Gaussian 5. Experimental Results
model.
In this section, we would demonstrate the results of our
proposed shadow removal algorithm. We implemented our
algorithm on the platform of PC with P4 3.0 GHz and 1 GB
if {WidthMBR k < Thmin MBR Width } AND
{HeightMBR k < Thmin MBR Height }
RAM. The software we used is Borland C++ Builder on
Eliminate kth Minimum Bounding Rectangle; Windows XP OS. All of the testing inputs are uncompressed
end AVI video files. The resolution of each video frame is 320 ×
240. The average processing time is 13.84 milliseconds for
each frame.
Algorithm 1

5.1. Experimental Results of Our Shadow Removal Algorithm.


In the followings, we showed our experimental results under
of the feature-integration images. It was obviously due no occlusion situation in different scenes. In comparison
to the influences by land markings (see Figure 23(a)). In with the results without using our algorithm, we used “red”
Figure 23(b), these pixels could be eliminated after the and “green” rectangles to indicate the detected objects after
filtering procedure. We then applied dilation operations to processing with and without applying our shadow removal
the results after filtering in order to concentrate the left algorithm, respectively. In Figure 26, we could see that the
feature pixels, as shown in Figure 24. proposed algorithm indeed successfully detected the real
After that, the connected-component-labeling approach objects and neutralized the negative influences by shadows,
could be applied on the dilated images. As for our defined since the intensity of shadowed regions was low enough
rules for this procedure, we used the minimum bounding to be distinguished from that of backgrounds. Besides, our
rectangle for each independent region. Then, if any two proposed algorithm could also cope with the larger-scale
minimum bounding rectangles are close to each other, we shadows and provide the satisfactory results for different
will merge these two rectangles. Finally, iteratively check and sizes of vehicles like trucks which would be shown in
merge the previous results till no rectangle can be merged Figures 26(d) and 26(f). In Figure 27, we demonstrated the
with. processed results for different kinds of shadows such as
10 EURASIP Journal on Advances in Signal Processing

(a) Foreground object (b) Red points are the selected pixels

Figure 16: The selected pixels for Gaussian model updating.

bg
Ik (x, y)-th Gaussian
A selected Darkening factor darkening
pixel at (x, y) calculation Gaussian model
factor model
updating

Figure 17: Gaussian darkening factor model updating procedure.

N(mi , σi2 ) through internet for the highway sequences of videos, which
showed a much better processed result by our proposed
algorithm (right columns).

5.1.1. Occlusions Caused by Shadows. Here, we would


demonstrate some examples of occlusions caused by the
influences of shadows. In Figures 30(a), 30(e), 30(g) and
30(i), two vehicles (or motorcycle—vehicle) were detected
3∗ STDEVi
in the same rectangle due to the influences of shadows, and
Figure 30(c) indicated an even worse case in which vehicles
were framed together on account of light shadows. From all
the figures in the right column of Figure 30, it was apparent
to see, our method could correctly detect the foreground
objects under the influences of shadows. Moreover, we could
still have the correct results (shown in Figure 30(j)) in the
0 1
mi foreground object detection for a more difficult case (shown
in Figure 30(i)) that three shadowed regions were detected
Darkening factor as one foreground object. That was to say, our proposed
Figure 18: Illustration of determination. algorithm could handle the problems of occlusions caused
by shadows which have been always considered as tough
tasks.

the smaller-scale shadows (shown in Figure 27(a)), the low 5.1.2. Discussions of Gray Level-Based Method. In Section 3.2,
contrast between intensities of shadows and backgrounds we used darkening factors to enhance the performance and
(shown in Figure 27(c)), and both the smaller-scale and reliability of the proposed algorithm. We hence in here
lowly contrastive shadows (shown in Figure 27(e)). Clearly, made the comparisons of experimental results by apply-
our proposed method could work robustly no matter how ing and not applying our proposed approach mentioned
large and how observable the shadows would be. Also in in Section 3.2. Figure 31 showed a conspicuous example
Figure 28, we gave the testing results of another scene where in which green/red rectangles represented the foreground
motorcycles and riders were precisely detected (shown in objects/detected objects. Figures 31(a), 31(c), and 31(e) were
Figures 28(c) and 28(d)). Figure 29 provided the compared the results without using the foreground-pixel-extraction
results processed by our proposed algorithm with those by approach by gray level-based shadow removal. In other
the representative methods which could have been accessible words, the only feature for extracting foreground objects was
EURASIP Journal on Advances in Signal Processing 11

Foreground object

Scan a pixel
from foreground
image

Darkening factor
calculation

bg
Ik (x, y)-th No
Gaussian model
is stable

Find a nearby No
stable Gaussian
Yes model

Yes

No Shadow
determination

Preserve
this pixel Yes

No
End scan

Yes

Gray level-based
non-shadow
foreground pixels

Figure 19: Flowchart of the nonshadowed pixel determination.

(a) Foreground Object (b) Ft DarkeningFactorMBR

Figure 20: An example of gray level-based nonshadow foreground-pixel extraction.


12 EURASIP Journal on Advances in Signal Processing

Feature combination
Edge-based
shadow removal
foreground pixels
Integration
Gray level-based Labeling
shadow removal and grouping
foreground pixels

Noise filtering
and dilation
Size filter Object list

Figure 21: Flowchart of feature combination.

(a) Foreground Object (b) Ft EdgebasedMBR

(c) Ft DarkeningFactorMBR (d) Feature integration image

Figure 22: An example after integration.

(a) Feature integration image (b) After filtering

Figure 23: Filtering by median filter for feature-integration images.


EURASIP Journal on Advances in Signal Processing 13

the compared consequences of average accuracy rates by


using the foreground object-detection results with/without
applying our proposed algorithm as the inputs in the exper-
iments of vehicle counting. Table 4 indicated the average
accuracy rate for each video in all lanes. From the testing
results, we would also list two kinds of failed examples in
Figure 33 to indicate some erroneous detected results in
the vehicle counting application, which may be reasonable
to illustrate the failed conditions in a more quantized
manner. One possible condition which might result in the
Figure 24: The result of dilation operations on Figure 23(b). false consequences came from the much more complicated
textures within the detected shadows and was illustrated
in Figure 33(a). Figure 33(a) revealed that the algorithm
failed to provide a correct result in vehicle counting due to
Table 1: Descriptions of testing videos for vehicle counting. the overcomplicated edge information reflected in detected
shadows. The major reason can be easily observed that the
Testing video Scene Shadow description Video FPS
shadow of some specific object was not successfully detected
Video1 Highway Obvious and Large 30 and eliminated. In fact, the research of shadow detection and
Video2 Highway Light and Large 30 removal based upon all the processes by image processing
Video3 Expressway Obvious and Large 25 only has been a tough issue since we did not try to make
use of any other information from some useful sensors
but cameras. Moreover, this paper aimed at developing
obtained from the foreground-pixel-extraction approach a practical algorithm in image processing procedures to
by edge-based shadow removal that we introduced in efficiently remove the shadowing effect before dealing with
Section 3.1. But it would bring about the failure of detections the applications of ITS, which would have less impact on
when the objects were edgeless (or textureless). As the the performance of shadow removal and make the influences
images shown in the left column of Figure 31, the car roofs dependent on some specific application. Owing to the rare
could be regarded as edgeless, which might result in that appearance of such conditions in a longer recorded file of
the detected object were in pieces or that only the rear videos, the detected error rate in counting vehicles could be
bumper could be detected. Figures 31(b), 31(d), and 31(f) kept lower to a satisfactory range. As for the other possible
exhibited the better results which were obtained from our condition that might result in the false detection results,
introduced structure of feature combination including the Figure 33(b) illustrated this kind of example and also showed
information of edge-based and gray level-based shadow the false detection consequence in the vehicle counting
removal. application. This occlusion case for two cars caused the false
counting result because the shadows would be too unap-
parent to be correctly detected and the two detected objects
5.2. Vehicle Counting. To prove the validity and versatility
were moving simultaneously. This kind of problem should
of the proposed approach, we tried to apply our developed
be categorized to another research field of image processing,
algorithm to vehicle counting, one of the popular applica-
yet not the issue that we have focused on in this paper. Since
tions in ITS. We had 3 testing videos for vehicle counting,
this phenomenon may result from many other conditions
and the scenes and properties of shadows in each video were
such as the dynamic behavior of moving vehicles, discussions
arranged in Table 1. Figure 32 showed the scene of each video
of different shadows, and influences of light reflection, we
for vehicle counting. In order to illustrate the compared and
would rather concentrate on developing a more practical
statistical results in a more convenient way, we had Video1
and automatic algorithm in shadow removal. Those exper-
in 6 sectors, Video2 in 13 sectors, and Video3 in 2 sectors,
imental results revealed that our proposed algorithm could
respectively. Each of the sectors was about 2 minutes. We had
not only work well in vehicle counting but also improve
4 lanes for both Video1 and Video2, and 2 lanes for Video3.
the performance of any applications under constraints of
In Table 2, we gave the number of passing vehicles on each
shadows.
lane for each video manually.
We calculated the accuracy rate for each sector by (7) to
give the comparisons in a more reasonable manner.
⎡ ⎛2 2 ⎞⎤ 6. Conclusions
2 2
2Nprogram − Nmanual 2
Accuracy rate = ⎣1 − ⎝ ⎠⎦ × 100%, We in this paper present a real-time and efficient moving
Nmanual shadow removal algorithm based on versatile uses of GMM,
(7) including the background removal and development of
features by Gaussian models. Our algorithm innovates
where Nmanual and Nprogram represented the number of vehi- to use the homogeneous property inside the shadowed
cles which were obtained from manual operations and our regions, and hierarchically detects the foreground objects
programs, respectively. As Table 3 showed, we could obtain by extracting the edge-based, gray level-based features, and
14 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 25: Examples of final located real objects.

(a) (b)

(c) (d)

(e) (f)

Figure 26: Experimental results of foreground object detection.

Table 2: Number of passing vehicles in each lane.

Lane1 Lane2 Lane3 Lane4


Testing video Partition number
(vehicles) (vehicles) (vehicles) (vehicles)
Video1 6 102 189 116 89
Video2 13 464 505 373 261
Video3 2 58 75 — —
EURASIP Journal on Advances in Signal Processing 15

(a) (b)

(c) (d)

(e) (f)

Figure 27: Experimental results for different kinds of shadows.

Table 3: Vehicle counting results.

Testing Lane1 Lane2 Lane3 Lane4


Compared methods
videos (Average Accuracy rate) (Average Accuracy rate) (Average Accuracy rate) (Average Accuracy rate)
Without Shadow 81.58% 97.50% 96.57% 82.29%
Video1 Removal
With Proposed 100% 99.02% 97.22% 100%
Algorithm
Without Shadow 92.88% 96.27% 95.55% 89.51%
Video2 Removal
With Proposed 97.59% 99.31% 99.68% 99.26%
Algorithm
Without Shadow 95.16% 97.14% — —
Video3 Removal
With Proposed 100% 100% — —
Algorithm
16 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)
Figure 28: Experimental results of foreground object detection.

(a) (b)

(c) (d)

(e) (f)
Figure 29: Experimental results of foreground object detection.
EURASIP Journal on Advances in Signal Processing 17

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

Figure 30: Experimental results under the occlusion situation.


18 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

(e) (f)
Figure 31: Compared results of not applying and applying gray level-based shadow removal foreground-pixel extraction.

(a) Video1 (b) Video2 (c) Video3


Figure 32: Scenes of videos for vehicle counting.

(a) (b)
Figure 33: Some failed examples of image frames for (a) the much-texture case and (b) the special occlusion case.
EURASIP Journal on Advances in Signal Processing 19

Table 4: Average accuracy of all lanes in each video. [4] A. Cavallaro, E. Salvador, and T. Ebrahimi, “Shadow-aware
object-based video processing,” IEE Proceedings: Vision, Image
Testing Video Average Accuracy Rate and Signal Processing, vol. 152, no. 4, pp. 398–406, 2005.
Without Shadow Removal 89.49% [5] K.-T. Song and J.-C. Tai, “Image-based traffic monitoring with
Video1
With Proposed Algorithm 99.06% shadow suppression,” Proceedings of the IEEE, vol. 95, no. 2, pp.
Without Shadow Removal 93.55% 413–426, 2007.
Video2 [6] N. Martel-Brisson and A. Zaccarin, “Learning and removing
With Proposed Algorithm 98.96% cast shadows through a multidistribution approach,” IEEE
Without Shadow Removal 96.15% Transactions on Pattern Analysis and Machine Intelligence, vol.
Video3
With Proposed Algorithm 100% 29, no. 7, pp. 1133–1146, 2007.
[7] A. J. Joshi and N. P. Papanikolopoulos, “Learning to detect
moving shadows in dynamic environments,” IEEE Transac-
feature combination. Our approach can be characterized tions on Pattern Analysis and Machine Intelligence, vol. 30, no.
by some original procedures such as “pixel-by-pixel maxi- 11, pp. 2055–2063, 2008.
[8] A. J. Joshi and N. Papanikolopoulos, “Learning of moving
mization”, subtraction of edges from background images in
cast shadows for dynamic environments,” in Proceedings of
the corresponding regions, adaptive binarization, boundary the IEEE International Conference on Robotics and Automation
elimination, the automatic selection mechanism for shadow- (ICRA ’08), pp. 987–992, May 2008.
potential regions, and the Gaussian darkening factor model [9] A. Leone, C. Distante, and F. Buccolieri, “A texture-based
for each gray level. approach for shadow detection,” in Proceedings of the IEEE
Among all these proposed procedures, “pixel-by-pixel Conference on Advanced Video and Signal Based Surveillance
maximization” and subtraction of edges from background (AVSS ’05), pp. 371–376, September 2005.
images in the corresponding regions deal with the problems [10] M. Mohammed Ibrahim and R. Anupama, “Scene adaptive
which result from the shadowed regions with edges. Adaptive shadow detection algorithm,” Proceedings of World Academy
binarization and boundary elimination are developed to of Science, Engineering and Technology, vol. 2, pp. 1307–6884,
extract the foreground-pixels of nonshadowed regions. Most 2005.
[11] J.-W. Hsieh, S.-H. Yu, Y.-S. Chen, and W.-F. Hu, “Automatic
significantly, we propose the Gaussian darkening factor
traffic surveillance system for vehicle tracking and classifica-
model for each gray level to extract nonshadow pixels tion,” IEEE Transactions on Intelligent Transportation Systems,
from foreground objects by using the information of gray vol. 7, no. 2, pp. 179–187, 2006.
levels, and integrate all the useful features to locate the real [12] C. Benedek and T. Szirányi, “Bayesian foreground and shadow
objects without shadows. Finally, in comparison with the detection in uncertain frame rate surveillance videos,” IEEE
previous approaches, the experimental results show that our Transactions on Image Processing, vol. 17, no. 4, pp. 608–621,
proposed algorithm can accurately detect and locate the 2008.
foreground objects in different scenes and various types of [13] M. Xiao, C.-Z. Han, and L. Zhang, “Moving shadow detection
shadows. What’s more, we apply the presented algorithm to and removal for traffic sequences,” International Journal of
vehicle counting to prove its capability and effectiveness. Our Automation and Computing, vol. 4, no. 1, pp. 38–46, 2007.
algorithm indeed improves the results of vehicle counting [14] C. Stauffer and W. E. L. Grimson, “Adaptive background
and it is also verified to be efficient with the prompt mixture models for real-time tracking,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision and
processing speed.
Pattern Recognition (CVPR ’99), vol. 2, pp. 246–252, June 1999.
[15] P. Suo and Y. Wang, “An improved adaptive background
Acknowledgment modeling algorithm based on Gaussian mixture model,” in
Proceedings of the 9th International Conference on Signal
This work was supported in part by the Aiming for the Processing (ICSP ’08), pp. 1436–1439, October 2008.
Top University Plan of National Chiao Tung University, the [16] J. Sauvola and M. Pietikäinen, “Adaptive document image
Ministry of Education, Taiwan, under Contract 99W962, and binarization,” Pattern Recognition, vol. 33, no. 2, pp. 225–236,
supported in part by the National Science Council, Taiwan, 2000.
[17] F. Shafait, D. Keysers, and T. M. Breuel, “Efficient implemen-
under Contracts NSC 99-3114-E-009 -167 and NSC 98-
tation of local adaptive thresholding techniques using integral
2221-E-009 -167. images,” in Document Recognition and Retrieval XV, vol. 6815
of Proceedings of SPIE, San Jose, Calif, USA, January 2008.
References [18] P. L. Rosin and T. Ellis, “Image difference threshold strategies
and shadow detection,” in Proceedings of the 6th British
[1] W. Zhang, X. Z. Fang, X. K. Yang, and Q. M. J. Wu, “Moving Machine Vision Conference, 1994.
cast shadows detection using ratio edge,” IEEE Transactions on [19] Y. Wang, K.-F. Loe, and J.-K. Wu, “A dynamic conditional ran-
Multimedia, vol. 9, no. 6, pp. 1202–1214, 2007. dom field model for foreground and shadow segmentation,”
[2] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting IEEE Transactions on Pattern Analysis and Machine Intelligence,
moving objects, ghosts, and shadows in video streams,” IEEE vol. 28, no. 2, pp. 279–289, 2006.
Transactions on Pattern Analysis and Machine Intelligence, vol. [20] I. Mikić, P. C. Cosman, G. T. Kogut, and M. M. Trivedi,
25, no. 10, pp. 1337–1342, 2003. “Moving shadow and object detection in traffic scenes,” in
[3] M.-T. Yang, K.-H. Lo, C.-C. Chiang, and W.-K. Tai, “Moving Proceedings of the 15th International Conference on Pattern
cast shadow detection by exploiting multiple cues,” IET Image Recognition, vol. 1, pp. 321–324, 2000.
Processing, vol. 2, no. 2, pp. 95–104, 2008.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 837405, 18 pages
doi:10.1155/2010/837405

Research Article
Robust Tracking in Aerial Imagery Based on
an Ego-Motion Bayesian Model

Carlos R. del Blanco, Fernando Jaureguizar, and Narciso Garcı́a


Escuela Técnica Superior de Ingenieros de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain

Correspondence should be addressed to Carlos R. del Blanco, [email protected]

Received 23 November 2009; Revised 16 April 2010; Accepted 17 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Carlos R. del Blanco et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

A novel strategy for object tracking in aerial imagery is presented, which is able to deal with complex situations where the camera
ego-motion cannot be reliably estimated due to the aperture problem (related to low structured scenes), the strong ego-motion,
and/or the presence of independent moving objects. The proposed algorithm is based on a complex modeling of the dynamic
information, which simulates both the object and the camera dynamics to predict the putative object locations. In this model, the
camera dynamics is probabilistically formulated as a weighted set of affine transformations that represent possible camera ego-
motions. This dynamic model is used in a Particle Filter framework to distinguish the actual object location among the multiple
candidates, that result from complex cluttered backgrounds, and the presence of several moving objects. The proposed strategy
has been tested with the aerial FLIR AMCOM dataset, and its performance has been also compared with other tracking techniques
to demonstrate its efficiency.

1. Introduction the object dynamics. However, a dynamic model based on


the object dynamics is only valid for tracking systems with
Object tracking is a fundamental task in a wide range static or quasistatic cameras.
of military and civilian applications, such as surveillance, In aerial imagery applications, the camera system is
traffic monitoring and management, security, and defense. mounted on a moving aerial platform, such as a plane,
In applications with static cameras, the tracking process aims a helicopter, or an Unmanned Aerial Vehicle (UAV). As a
to locate a specific object in each frame of a video sequence consequence, the camera is not stabilized, and the acquired
using geometric, appearance, and motion features of the video sequences undergo a random global motion, called
object. The main problem arises from the fact that there can ego-motion, that prevents the use of the object dynamics
be several location candidates for the object per frame, due to to predict the future object location, making the tracking
the presence of background structures, and other foreground a challenging task. The ego-motion problem has been
objects similar to the target object. Furthermore, several addressed in different manners in the scientific literature.
disturbance phenomena, such as illumination changes due They can be split into two categories: approaches based on
to weather conditions (typical in outdoor applications), the assumption of low ego-motion, and those based on the
variations in the object appearance because of the camera ego-motion estimation.
point of view, and occlusions, prevent using the criteria Approaches assuming low ego-motion consider that the
“the most similar candidate is the most adequate one.” In motion component due to the camera is not very significant
order to solve this problem, additional information is used in comparison with the object dynamics. Under this restric-
to try to recover the actual object location among the set of tion, some recent works expect that the object maintains a
possible candidates. Typically, this information is the object spatiotemporal connectivity along the sequence [1–3]; that
dynamics, which is used to select the candidate location is, the image regions related to the object in consecutive
closer to the predicted location according to the equation of frames are spatially overlapped, and then they perform the
2 EURASIP Journal on Advances in Signal Processing

tracking using morphological connected operators. In cases presence of independent moving objects provided that there
where the hypothesis about the spatiotemporal connectivity are enough detected features belonging to the background.
does not hold, the most common approach is to search for On the other hand, in situations in which the detection
the object in a bounded area centered in the location where of distinctive features is particularly complicated, because
it is expected to find the object according to its dynamics. the acquired images are low textured and structured, an
In [4, 5] an exhaustive search is performed in a fixed-size area-based image registration technique is used to estimate
image region, centered in the previous object location. In [6] the parameters of a global parametric model. In [18], a
the initial search location is estimated using a Kalman filter, perspective camera model is computed using an optical
and then the search is performed deterministically using the flow algorithm for the detection of moving objects in an
Mean Shift algorithm [7]. Other authors [8, 9] propose a application of aerial visual surveillance. The optical flow
stochastic search based on Particle Filtering that is able to algorithm is also used in [19] to estimate the parameters of a
deal with several possible location candidates, that is, local pseudo perspective camera model, which is utilized to create
maxima/minima resulting from the cost function used to panoramic image mosaics. The same approach is followed
perform the search. As the displacement induced by the in [20, 21] for a tracking application of terrestrial targets in
ego-motion increases, all these methods lose effectiveness. airborne FLIR imagery. Also, for the same type of imagery,
The reason is that the size of the search area must be a target detection framework is presented in [22, 23], which
larger to accommodate the expected camera ego-motion, minimizes SSDs- (Sum of Squares Differences-) based error
and therefore, the probability that the tracking is distracted measure to estimate an affine camera model. A similar
by false candidates dramatically increases. framework of camera motion compensation is used in [24]
On the other hand, approaches based on the ego- for tracking vehicles in aerial infrared imagery, but utilizing
motion estimation are able to deal with strong ego-motion a different minimization algorithm. In [25], the Inverse
situations, in which the motion component due to the Compositional Algorithm is used to obtain the parameters of
camera is quite more significant than the one corresponding an affine camera model for a tracking application of vehicles
to the object dynamics. Therefore, these approaches are in aerial imagery. The main problem associated with the
more suitable for aerial imagery applications, in which the area-based image registration techniques is that the presence
ego-motion causes large displacements between consecutive of independent moving objects can drift the ego-motion
frames. They aim to compute the camera ego-motion estimation, especially if their sizes are significant.
between consecutive frames in order to compensate it, and Also, a combination of both feature- and area-based
thus recovering the spatiotemporal correlation of the video methods has been proposed in [26] to improve the quality
sequence. In airborne imagery, the scene acquired by the of the camera compensation.
camera can be considered planar, since the depth relief of All the previous approaches, independently of the spe-
the objects in the scene is small enough compared to the cific camera ego-motion compensation technique used, have
average depth, and the field of view of the camera is also in common that they compute only one parametric model
small [10]. This allows to efficiently model the camera ego- to represent the ego-motion between consecutive frames.
motion by a global parametric model, typically an affine or However, in real applications, there may be many situations
projective geometric transformation, since the effect of the where the ego-motion cannot be accurately estimated, or
parallax (apparent displacement of an object caused by a even where the estimation could be completely wrong,
change in the location of the view point) is not significant. causing the tracking failure. These situations arise as a
The existing works differ in the image registration technique consequence of very low structured or textured scenes, where
used to compute the parameters of the affine or projective the high uncertainty, derived from the so-called aperture
transformation. A thorough review of image registration problem, makes almost impossible to compute the true ego-
techniques can be found in [11] for all kinds of vision-based motion. Also, the presence of independent moving objects,
applications. Another review focused on aerial imagery is especially if they take up large regions in the image, can drift
presented in [12]. the ego-motion estimation, since the assumption of only
On the one hand, feature-based image registration tech- one global motion, that is, the ego-motion, does not hold
niques detect and match distinctive image features between anymore.
consecutive frames to estimate a global parametric camera In this work, a novel approach for object tracking in
model. In [13], a detection and tracking system of moving airborne imagery undergoing strong camera ego-motion is
objects from a moving airborne platform is described, which proposed, which is able to deal with the aforementioned
uses a feature-based approach to estimate an affine camera complex situations in order to produce a robust tracking
model. In [14], the KLT method is used to infer a bilinear along the time. The tracking algorithm models both the
camera model in an application that detects moving objects camera and object dynamics to efficiently predict the most
from a mobile robot. In the field of Forward Looking probable object locations. The camera dynamics (i.e., the
InfraRed (FLIR) imagery, the works [15–17] describe a ego-motion) is probabilistically represented by a set of global
detection and tracking system of aerial targets from an parametric models, more specifically affine transformations,
airborne platform that uses a robust statistic framework to unlike the other approaches that only use one global para-
match edge features in order to estimate an affine camera metric model. This allows to consider several possible camera
model. This system is able to successfully handle situations ego-motions, which have the advantage to be more robust
in which the camera motion estimation is disturbed by the to the aforementioned aperture and independent moving
EURASIP Journal on Advances in Signal Processing 3

object problems. The dynamic information is combined to obtain the prior pdf of the state p(xk | z1:k−1 ) at time k via
with an appearance object model based on the detection the Chapman-Kolmogorov equation:
of bright regions, which is a characteristic feature of the -
target objects in infrared imagery. Both appearance and p(xk | z1:k−1 ) = p(xk , xk−1 | z1:k−1 )dxk−1
dynamic models are managed by a Bayesian framework, (1)
-
which recursively computes the posterior probability density
function (posterior pdf) of the object location. Since the = p(xk | xk−1 )p(xk−1 | z1:k−1 )dxk−1 ,
resulting expression for the posterior pdf cannot be solved
analytically, it is approximated by means of a Particle Filter where p(xk−1 | z1:k−1 ) is the posterior pdf at the previous
technique [27] based on Monte Carlo simulation. Finally, time step, and p(xk | xk−1 ) is the state transition probability,
an estimation of the object location is computed from that encodes the information about the object and camera
the posterior pdf using a Gaussian-MMSE estimator [28], dynamics. The object dynamics is modeled by the linear
which is able to deal with situations in which the posterior function:
pdf is clearly multimodal. In order to prove the efficiency dk = M · dk−1 , (2)
and robustness of the proposed tracking algorithm, it has
been tested on the AMCOM dataset, that is composed where M is a matrix that represents a first-order linear
by a set of airborne FLIR sequences, containing many system of constant velocity. This object dynamic model
challenging tracking situations involving terrestrial vehicles. is a reasonable approximation for a wide range of object
Additionally, the proposed tracking algorithm has been tracking applications, provided that the camera frame rate
compared with two different tracking approaches, based also is enough high. The camera dynamics is modeled by an
on Particle Filtering, in order to demonstrate its superior affine geometric transformation gk , which is a satisfactory
robustness and reliability. approximation of the ideal projective camera model for the
Although the paper is focused on aerial visual tracking, case of aerial imagery, since the depth relief of the objects
the proposed tracking framework can be used in other track- in the scene is small enough compared to the average depth,
ing applications, provided that the scene can be considered and the field of view is also small [10]. Then, combining
planar; that is, the effect of the parallax is not very significant. both models, the joint object and camera dynamics can be
The rest of the paper is organized as follows. Section 2 expressed as
describes the proposed tracking Bayesian filter, that com-
bines the object appearance model, and the joint camera dk = gk · M · dk−1 , (3)
and object dynamic model to efficiently estimate the desired
which, firstly, predicts the object position and velocity
tracking information. The Particle Filtering approximation
according to the object dynamic model, and then, it rectifies
of the previous optimal, but not tractable, Bayesian filter is
them using the affine transformation to compensate the
presented in Section 3. The estimation of the object location,
camera motion.
based on the posterior pdf, is described in Section 4. Exper-
Based on this joint dynamic model, the transition
imental results using the FLIR AMCOM dataset are exposed
probability p(xk | xk−1 ) can be expressed as
in Section 5, along with a comparison with other tracking
 
approaches. And, lastly, the conclusions are presented in p(xk | xk−1 ) = p dk , gk | dk−1 , gk−1
Section 6.    
= p dk | dk−1 , gk−1:k p gk | dk−1 , gk−1 (4)
   
2. Bayesian Tracking = p dk | dk−1 , gk p gk ,

The tracking task is modeled by means of a Bayesian filter where it has been assumed that, on the one hand, the current
that aims to estimate a state vector xk , containing the desired object position is conditionally independent of the camera
tracking information, that evolves over time using a sequence motion in the previous time step (as the proposed joint
of noisy observations z1:k = {zi | i = 1, . . . , k} up to dynamic model states), and, on the other hand, the current
time k. The state vector xk = {dk , gk } contains the object camera motion is conditionally independent of both the
dynamics (position and velocity over the image plane), dk , camera motion and the object position in previous time
and the camera dynamics, gk . The observation zk at time step steps. This last assumption results from the fact that the
k contains the object location candidates, which are obtained camera ego-motion is completely random, not following any
as a result of the processing of the frame Ik . specific pattern. The probability term p(dk | dk−1 , gk ) models
the uncertainty of the proposed joint dynamic model as
The Bayesian filter approach calculates some degree of
   
belief in the state xk at time k using the available prior p dk | dk−1 , gk = N dk ; gk · M · dk−1 , σtr2 , (5)
information about the object and the camera and the set
of observations z1:k . Therefore, the tracking problem can where N(x; μ, σ 2 ) is a Gaussian or Normal distribution of
be formulated as the estimation of the posterior probability mean μ and variance σ 2 . Thus, the term σtr2 represents the
density function (posterior pdf) of the state of the object, unknown disturbances of the joint dynamic model.
p(xk | z1:k ), which is recursively calculated by means of two The other probability term in (4), p(gk ), expresses
stages: prediction and update. The prediction stage involves the probability that one specific geometric transformation
4 EURASIP Journal on Advances in Signal Processing

represents the true camera ego-motion between consecutive coordinates of Figure 2. The multimodality feature is clearly
time steps. For the ongoing tracking application, dealing observed, and in theory any of the modes could be the
with infrared imagery, the probability of a specific geometric right object position. Moreover, for this specific case, if only
transformation gk is based on the quality of the image the object dynamics is considered, the closest mode to the
alignment achieved by gk between consecutive frames. The predicted object location (marked by a vertical black line)
quality of the image alignment is computed by means of is not the true object location, because of the effects of the
the Mean Square Error function, mse(x, y), between the camera ego-motion.
current frame Ik , and the previous frame Ik−1 warped Based on the previous observation model, and assuming
by the transformation gk . Thus, the probability p(gk ) is that zk is conditionally independent of gk given dk , the
mathematically expressed as likelihood probability can be expressed as
       
p gk = N mse Ik , gk · Ik−1 ; 0, σg2 , (6) p(zk | xk ) = p zk | dk , gk
  (8)
= p(zk | dk ) = N zk ; dk , σL2 ,
where N(x; μ, σ 2 ) is a Gaussian distribution of mean μ and
variance σ 2 , and σg2 is the expected variance of the image
alignment process. where zk is the LoG filter response of the frame Ik , and
After the prediction stage, the update stage aims to reduce the variance σL is set to highlight the main modes of zk ,
the uncertainty of the predicted p(xk | z1:k−1 ) using the while discarding the low significant ones. This is illustrated in
new available observation zk (observations are available at Figure 3, where only the most significant modes of Figure 2
discrete times) through Bayes’ rule: are highlighted.
The denominator of (7) is just a normalizing constant
p(zk | xk )p(xk | z1:k−1 ) given by
p(xk | z1:k ) = , (7) -
p(zk | z1:k−1 )
p(zk | z1:k−1 ) = p(zk , xk | z1:k−1 )dxk
where p(zk | xk ) is the likelihood function that evaluates the (9)
-
degree of support of the observation zk to the predicted xk .
Finding an observation model for the likelihood p(zk | xk ) = p(zk | xk )p(xk | z1:k−1 )dxk .
in airborne infrared imagery, that appropriately describes
the object appearance and its variations along the time, is The initial pdf p(x0 | z0 ) ≡ p(x0 ), called the prior, is
quite challenging due to the special characteristics of the initialized as a Kronecker’s delta function δ(x0 ) using the
infrared imagery (low signal-to-noise ratio, target objects ground truth information. In a general case, p(x0 ) could be
low contrasted with the background, and nonrepeatability initialized as a Gaussian function using the information given
of the target signature), changes in illumination, variations by an object detector algorithm, as in [1, 2, 15–17, 22, 23].
in the 3D viewpoint, and changes in the object size along In practice, the computation of the posterior pdf, by
the sequence. The most robust and reliable object property means of the recursive (1) and (7), is not feasible, since
is the presence of bright regions or, at least, regions that the dynamic and observation models are nonlinear and
are brighter than their surrounding neighborhood, which non-Gaussian. As a result, the use of approximate inference
typically correspond to the engine and exhaust areas of methods is necessary. In the next section, a Particle Filtering
the object. Based on this fact, the likelihood function uses strategy is presented to obtain an approximate solution of the
an observation model that aims to detect the main bright posterior pdf.
regions of the target. This is accomplished by a rotationally
symmetric Laplacian of Gaussian (LoG) filter, characterized 3. Particle Filter Approximation
by a sigma parameter that is tuned to the lowest dimension
of the object size, so that the filter response is maximum The optimal solution of the posterior pdf p(xk | z1:k ),
in the bright regions with a size similar to the tracked given by (7), cannot be determined analytically in practice,
object. The main handicap of the observation model is its but it can be approximated using suboptimal methods.
lack of distinctiveness, since whatever bright region with Particle Filtering is an approximate inference method based
an adequate size can be the target object. As consequence, on Monte Carlo simulation for solving Bayesian filters. In
the resulting LoG filter response is strongly multimodal. contrast to other approximate inference methods, such as
This fact, coupled with the camera ego-motion, dramatically Extended Kalman Filters, Unscented Kalman Filters, and
complicates a reliable estimation of the state vector. This Hidden Markov Models, Particle Filtering is able to deal with
situation is illustrated in Figures 1 and 2. The first one, continuous state spaces and nonlinear/non-Gaussian pro-
Figure 1, shows two consecutive frames, (a) and (b), of an cesses [29], conditions that arise in real tracking situations.
infrared sequence acquired by an airborne camera, in which The Particle Filter technique approximates p(xk | z1:k ) by a
the target object has been enclosed by a rectangle. Figure 2 set of NS -weighted random samples {xki , i = 1, . . . , NS } [27]:
shows the LoG filter response related to Figure 1(b), where
the own image has been projected over the filter response
1 S i  
N
for a better interpretation, in such a way that the upper p(xk | z1:k ) ≈ wk δ xk − xki , (10)
left corner of Figure 1(b) corresponds with the origin of c i=1
EURASIP Journal on Advances in Signal Processing 5

(a) (b)

Figure 1: Two consecutive frames of an FLIR sequence acquired by an airborne camera.

to reduce the variance of the estimation given by (10) by


×10−3 Predicted object location means of a Monte Carlo simulation. The set of samples
3 according to the object {xki , i = 1, . . . , NS } is drawn from a proposal distribution
2.5 Mode corresponding dynamics function q(xk | xk−1 , zk ), called the importance density. The
2 to the tracked object
optimal q(xk | xk−1 , zk ) should be proportional to p(xk |
The closest mode to z1:k ) and should have the same support (the support of
1.5 the predicted object a function is the set of points where the function is not
1 location corresponding
to the background
zero), since in this case the variance is zero. But this is
0.5 only a theoretical solution, since it would imply that p(xk |
0 z1:k ) is known. The approach followed in this paper is to
20
120
approximate the importance density by the likelihood and
40
60 80
100 the prior probability of the camera motion:
80 60  
100 40
120 20 q(xk | xk−1 , zk ) = p(zk | dk )p gk , (11)
Figure 2: Multimodal LoG filter response related to Figure 1(b). which is an efficient simplification of the optimal, but not
tractable, importance density q(xk | xk−1 , zk ) = p(xk |
xk−1 , zk ) [29].
The samples xki = {dik , gki } are drawn from the previous
Mode corresponding
to the tracked object proposal distribution by a hierarchical sampling strategy.
0.01
This, firstly, draws samples gki from p(gk ) and then draws
0.008 Mode corresponding samples dik from p(zk | dk ).
0.006 to the background The sampling procedure for obtaining samples gki from
p(gk ) is based on a two-stage strategy, that firstly performs
0.004 a fast, but rough, sampling of the affine space, and lastly
0.002 improves the affine sampling by refining the samples with
higher probability through a more expensive and accurate
0
20
procedure. This two stage strategy allows to efficiently obtain
40 120 a probabilistic representation of the camera motion with a
60 100
80
80 60 relatively low computational cost. Section 3.1 describes the
100 40
120 20 sampling procedure in more detail.
The object dynamic samples dik are drawn from the like-
Figure 3: Likelihood distribution related to Figure 2.
lihood p(zk | dk ) (11), which is a convenient decision since
the main modes of the posterior distribution also appear
in the likelihood function. Sampling from the likelihood
where the function δ(x) is Kronecker’s delta, {wki , i = function is not a trivial task, since it is a bivariate function
1, . . . , NS } is the set of weights related to the samples, and composed by narrow modes (see Figure 3). To deal with

c = Ni=S1 wki is a normalization factor. As the number of this issue, a Markov Chain Monte Carlo (MCMC) sampling
samples becomes very large, this approximation becomes method is proposed, which is able to efficiently represent
equivalent to the true posterior pdf. the likelihood function by a reduced number of samples.
Both samples xki and weights wki are obtained using Section 3.2 describes the MCMC sampling procedure in
the concept of importance sampling [27, 28], which aims more detail.
6 EURASIP Journal on Advances in Signal Processing

the Sampling Importance Resampling (SIR) algorithm that


selects more times the samples with higher weights, while
0.02 the ones with an insignificant weight are discarded. After SIR
0.018
0.016 resampling, all the samples have the same weight.
0.014 Figures 4 and 5 show the estimated posterior probability,
0.012
0.01 p(xk | z1:k ), and the result of applying the SIR resampling,
0.008 respectively. Notice that the samples corresponding with
0.006
0.004 modes related to background structures have a lower weight
0.002 than the ones related to the tracked object, due to the
0
coherence with the expected camera and object dynamics.
20 120 As a result, the estimated posterior pdf concentrates all the
40 100
60 80 meaningful samples in the target object region.
80 60
100 40
120 20
3.1. Sampling of the Affine Space. The sampling procedure
Figure 4: Particle Filtering-based approximation of the posterior for obtaining samples gki from p(gk ) is based on a two-
probability p(xk | z1:k ).
stage strategy, that firstly draws a set of affine transformation
samples that represent a rough estimation of p(gk ) and
then refines the sampling by improving the accuracy of the
samples with higher weight using a complex algorithm.
The goal of the first stage is to compute with a low
computational cost a set of affine transformation samples,
which represent a rough approximation of the underlying
p(gk ). The algorithm is based on a fast uniform sampling
that uses the available prior knowledge for bounding the
range of possible affine parameters and for estimating an
appropriate sampling step. For the purpose of bounding the
range of affine parameters, a subset of the video sequences
used to test the proposed tracking algorithm have been
used as training set, in order to analyze the set of the
expected camera motions. These sequences belong to the
infrared AMCOM dataset (see Section 5) and have been
Figure 5: SIR resampling of p(xk | z1:k ).
acquired by different infrared cameras on board an aerial
platform. The camera motion estimation in this training
set has been supervised by a user to accurately and reliably
Once that the samples xki = {dik , gki } have been obtained, obtain the actual camera motion. The resulting analysis
the weights wki are computed by [29] reveals that the most significant motions are translations,
    which can reach a value close to the half of the image size
p zk | xki p xki | xki −1 for some extreme situations. On the contrary, the magnitude
wki = wki −1   . (12) of the scale, rotation, and shear transformations is much
q xki | xki −1 , zk less significant, close to identity matrix transformation. On
the other hand, the choice of the sampling step depends
Using the likelihood, transition, and importance density on the capability of the whole sampling procedure to
probabilities, this expression can be simplified as converge to the actual affine transformation given an initial
      affine transformation sample. Regarding the convergence,
p zk | dik p dik | dik−1 , gki p gki the sampling step should be small to ensure that at least
wki = wki −1     the distance in the affine space between one sample and
p zk | dik p gki (13) the actual affine transformation that represent the camera
  motion is short enough. But considering the computational
= wki −1 p dik | dik−1 , gki . cost, the sampling step should be as large as possible. The
convergence capability has been experimentally measured by
According to this expression, the samples that best fit with synthetically warping an image by different affine transfor-
the joint camera and object dynamic model will have more mations of increasing magnitude, until the converge to the
relevance than the rest. actual camera motion is not possible. In addition, since the
The importance sampling principle has a serious draw- convergence capability depends on the scene structure, this
back, called the degeneracy problem [27], consisting in only process has been performed with a set of different images
one weight has a significant value after a few iterations, while belonging to several sequences of the AMCOM dataset. As
the rest of weights has an inconsiderable value. In order to a result, the sampling step for the translation components
overcome this problem, a resampling step is applied to reduce must be less than 8 pixels, while for the rest of motion
the degeneracy problem. This is accomplished by means of components a unique sample is enough, which assumes no
EURASIP Journal on Advances in Signal Processing 7

scale, rotation, and/or shear distortion, since the sampling their associated weights, given by the values of consistency of
procedure satisfactorily achieves the convergence to the real the scene structure, are a rough estimation of p(gk ).
affine parameters for camera motions that take place in the Figure 7 shows the weights of the affine transformations
AMCOM video sequences. Taking into account the previous tik used to roughly approximate the camera motion proba-
sampling guidelines, the initial set of affine transformation bility p(gk ) between two consecutive time steps. The weights
samples has the form are arranged in the same way of the previous grid of initial
transformations and are encoded with a color scale. In this
⎡  i ⎤
case, the maximum weight corresponds with t74 k .
⎢1 0 tx k ⎥
⎢  i ⎥
The second stage refines the previous rough estimation
tik =⎢
⎢0 1 t y ⎥,
⎥ i = 1, . . . , NS , (14) of p(gk ) by means of an image registration algorithm
⎣ k⎦ presented in [32]. This method assumes an initial geometric
0 0 1 transformation tik and then uses the whole image intensity
information to compute a global affine transformation gki ,
where (tx )ik and (t y )ik are the translation components, with which is an improved estimation of the camera motion.
a sampling step less than 8 pixels. For the ongoing tracking This method explicitly accounts for global variations in
application, the sampling step has been fixed to 5, which is image intensities to be robust to illumination changes. To
a good tradeoff between accuracy and computational cost. reduce the computational cost, only the samples tik with
Note that the rest of affine parameters of tik are equivalent to higher probability are used to improve the estimation of
the identity matrix, meaning that there is no scale, rotation, p(gk ). Finally, the set of affine transformations {gki | i =
and shear warping with respect to the previous image frame, 1, . . . , NS } is obtained by means of an SIR resampling,
since, as stated before, the whole sampling procedure can which makes a random selection of the affine transformation
satisfactorily deal with these kinds of distortions in the samples according to their weights. The resulting set of
AMCOM dataset. Figure 6 shows the initial set of affine affine transformations is an accurate approximation of the
transformations, {tik , i = 1, . . . , 441}, arranged in a 21 × 21 underlying camera motion probability.
grid. An alternative approach to the SIR resampling could be
The set of initial affine transformations {tik | i = to select the sample with the highest weight, since it should
1, . . . , NS } are evaluated by checking the consistency of represent the most accurate camera motion. In this case, the
the scene structure between the current image and the sampling procedure would be equivalent to an optimization
compensated one, that is, the previous image warped by the approach based on an stochastic search, since only the
affine transformation sample under evaluation. Two images best sample is used. However, the statement “the highest
have a similar scene structure when their image edges have p(gki ) corresponds with the most accurate camera motion
a similar shape and spatial arrangement, indicating that estimation” is not always true. For example, in situations
they are closely aligned. The scene structure of an image is with independent moving objects, the camera ego-motion
characterized by a set of shape descriptors, called extended estimation can be biased by the moving objects. Also, a
shape contexts (E-SCs). The shape context descriptor was poor estimation is obtained when the effects of the aperture
originally proposed by Belongie et al. [30] for recognizing 2D problem [33, 34] are quite significant. As a consequence,
and 3D objects in low clutter situations. Mori and Malik [31] in both situations the actual camera motion could be
proposed an extended version of the shape context, the E-SC, represented by one gki with a probability value lower than the
to achieve a greater robustness to the clutter. The first step to one with the maximum probability value. For this reason,
evaluate the consistency of the scene structure between the a probabilistic representation of the camera motion based
previous image warped by the affine transformation under on discrete samples is more efficient than a deterministic
evaluation, tik · Ik−1 , and the current image, Ik , consists in approach that estimates the best transformation.
computing the most relevant edges of both images using the
Canny algorithm. A uniform random sampling of the edge 3.2. MCMC Sampling of the Likelihood Function. The object
locations of Ik is carried out, and then an E-SC descriptor dynamic samples dik are drawn from the likelihood p(zk | dk )
is computed in each sampled location. Both the set of (11) to finally obtain xki = {dik , gki }. This is a convenient
E-SC descriptors and their spatial distribution define the decision since the main modes of the posterior distribution
scene structure. Another set of E-SC descriptors is computed also appear in the likelihood function. Sampling from the
using the detected edges in Ik−1 . The locations of the E- likelihood function is not a trivial task, since it is a bivariate
SC descriptors are the same as those of Ik , but warped function composed by narrow modes (see Figure 3). To
by the transformation tik under evaluation. This approach deal with this issue, a Markov Chain Monte Carlo (MCMC)
is computationally much more efficient than warping the sampling method is proposed, which is able to efficiently
whole image Ik−1 using tik . The similarity of both sets of represent the likelihood function by a reduced number
descriptors is measured by computing the Bhattacharyya of samples. The MCMC approach generates a sequence
distance between corresponding E-SC descriptors. The con- of samples {dik , i = 1, . . . , NS } by means of a Markov
sistency of the scene structure is then obtained by summing Chain, in such a way that the stationary distribution is
the contributions of all the distances. A low value of the exactly the target distribution. The Metropolis-Hasting [28,
consistency of the scene structure means that Ik and tik · Ik−1 35] algorithm is an MCMC method that uses a proposal
are roughly aligned. The samples {tik | i = 1, . . . , NS } and distribution for simulating such a chain. The appropriate
8 EURASIP Journal on Advances in Signal Processing

1 0 −5 10 0 1 0 5
T199 220 241
k = 0 1 −5 Tk = 0 1 −5 Tk = 0 1 −5
0 0 1 00 1 0 0 1
1 0 −5 1 00 1 0 5
T200
k = 0 1 0 T221
k = 0 1 0 T242
k = 0 1 0
0 0 1 0 01 0 0 1
1 0 −5 1 00 1 0 5
T201
k = 0 1 5 T222
k = 0 1 5 T243
k = 0 1 5
0 0 1 0 01 0 0 1

Figure 6: Initial set of affine transformation samples used to roughly approximate p(gk ).

1 0.25
2 ×10−3
3
3
4
5 0.2 2.5
State estimation
6
2
7
8 1.5
9 0.15
10 1
11 0.5
12
13 0
0.1
14 20
15 40 120
60 100
16 80
80 60
17 100 40
0.05 120 20
18
19
20 Figure 9: Result of applying the Gaussian kernel over p(xk | z1:k )
21 (depicted in Figure 4), along with the final state estimation x1k
1 3 5 7 9 11 13 15 17 19 21 marked by a black circle.
t74
k

Figure 7: Weights of {tik , i = 1, . . . , 441}, which roughly approxi-


mate p(gk ).

Figure 10: Tracked object accurately enclosed by a white rectangle.

Chain. Since the likelihood function concentrates almost all


Figure 8: Metropolis-Hasting sampling of the likelihood distribu-
tion depicted in Figure 2.
the probability in a few sparse regions of the state space
(i.e., in its sparse narrow modes), the Markov Chain needs
a large amount of samples to correctly simulate it. A more
efficient approach is to use a set of Markov Chains, with
selection of the proposal distribution is the key for the different initialization states given by the main local maxima
efficient sampling of the target distribution. For the case of of the likelihood distribution. In this way, the likelihood is
the likelihood p(zk | dk ) sampling, a Gaussian function, with efficiently simulated by a reduced number of samples located
mean zero and a variance proportional to the lowest size on the main modes.
dimension of the tracked object, has proven to be efficient. Figure 8 shows the result of applying the proposed
Another fundamental issue is the initialization of the Markov Metropolis-Hasting sampling algorithm to simulate the
EURASIP Journal on Advances in Signal Processing 9

×10−3
6
5
4
3
2
1
0
20
60 100
60
100 20
(a) Frame 15 (b) Frame 16 (c) Likelihood p(zk | xk ) (d) Sampling

Figure 11: Common intermediate results for all the three tracking algorithms in a situation of strong ego-motion.

0.55
2 0.5
4 0.45 0.08 0.025
6 0.4
0.06 0.02
8 0.35
10 0.3 0.015
0.04
12 0.25 0.01
14 0.2 0.02 0.005
16 0.15 0 0
18 0.1 20 20
20 0.05 60 100 60 100 120
100 60 100 60 80
5 10 15 20 20 20 40

(a) Probability values of gki before SIR (b) Sampled posterior probability (c) Gaussian-MMSE-based estimation (d) Object tracking
result

Figure 12: Tracking results for the BEH algorithm in a situation of strong ego-motion.

likelihood distribution depicted in Figure 3. The samples is mathematically expressed as


have been marked with circles. Notice that the samples are ⎛ ⎞
1 S S  i l   
N N
on the main modes of the likelihood distribution, in spite of
the relatively low number of used samples. x1k = max⎝ N xk ; xk , Σe p xki | z1:k ⎠, (15)
NS l=1 i=1

4. State Estimation where the covariance matrix Σe determines the bandwidth


of the Gaussian kernel, which must be coherent with the
The estimated posterior pdf, p(xk | z1:k ), embodies all the size of the tracked object mode. Taking into account the
available statistical information, allowing the computation relationship between the size of the tracked object mode and
of an optimal estimation of the state of the object x1k . In the bandwidth of the LoG filter used in the object detection
general terms, the resulting posterior probability can be (Section 2), that in turn it was set according to the object size,
quasi-unimodal (if there is only one significant mode) or an efficient covariance matrix can be estimated as
multimodal. This fact depends on the distance between ⎡s ⎤
x
⎢ 0⎥
the mode corresponding to tracked object and the modes Σe = ⎣ 2 s y ⎦, (16)
relative to the background in the likelihood function. While 0
2
for the case of a quasi-unimodal posterior probability, the
state estimation can be efficiently performed by means where sx and s y are the width and height of the object,
of the MMSE estimator, for the case of a multimodal respectively, which are the same parameters as the ones used
posterior probability, the MMSE estimator does not produce in the LoG-based object detector.
a satisfactory estimation, since the background modes bias Figure 9 shows the result of applying the Gaussian kernel
the result. To avoid such a bias in the estimation, the MMSE over p(xk | z1:k ), along with the maximum corresponding
estimator should only use the samples relative to the tracked to the final estimation x1k , that has been marked by a black
object mode, discarding the rest. This is achieved by means circle. Figure 10 shows the tracked object accurately enclosed
of a bivariate Gaussian kernel N(x; μe , Σe ) of mean μe and by a white rectangle corresponding to the estimated x1k .
covariance matrix Σe [28], which gives more relevance to
the samples located close to the Gaussian mean. In this way, 5. Results
when the Gaussian mean is centered over the tracked object
mode, only the samples related to this mode will have a The proposed object tracking algorithm has been tested
significant value. The proposed Gaussian-MMSE estimator using the AMCOM dataset. This consists of 40 infrared
10 EURASIP Journal on Advances in Signal Processing

0.8 0.3
0.6
0.2
0.4
0.1
0.2
0 0
20 20
60 100 60 100
60 60
100 20 100 20
(a) Sampled posterior probability (b) Gaussian-MMSE based estima- (c) Object tracking result
tion

Figure 13: Tracking results for the DEH algorithm in a situation of strong ego-motion.

0.4 0.3
0.3
0.2
0.2
0.1
0.1
0 0
20 20
60 100 60 100
100 60 100 60
20 20
(a) Sampled posterior probability (b) Gaussian-MMSE based estimation (c) Object tracking result

Figure 14: Tracking results for the NEH algorithm in a situation of strong ego-motion.

Frames 168 Frames 169 Frames 170 Frames 171 Frames 172
(a)

0.025 0.025 0.025 0.02 0.025


0.02 0.02 0.02 0.02
0.015 0.015
0.015 0.015 0.015
0.01 0.01 0.01
0.01 0.01
0.005 0.005 0.005 0.005
0.005
0 0 0 0
0 20 20
20 20 20
100 100 60 100 100 60 100
60 60 60 60 60 60 60
100 100 100 20 100 60 100 20
20 20 20

Likelihood p(zk |xk )


(b)

Sampling
(c)

Figure 15: Common intermediate results for all the three tracking algorithms in a situation where the ego-motion compensation is especially
challenging due to the aperture problem.
EURASIP Journal on Advances in Signal Processing 11

2 0.9 2 0.9 2 2 0.8 2 0.9


0.8 0.8 0.8 0.8
4 4 4 4 0.7 4
6 0.7 0.7 0.7 0.7
6 6 6 0.6 6
8 0.6 8 0.6 8 0.6 8 8 0.6
0.5 0.5
10 0.5 10 0.5 10 10 10 0.5
12 12 12 0.4 12 0.4 12
0.4 0.4 0.4
14 14 0.3 14 0.3 14 0.3 14 0.3
16 0.3
16 16 16 0.2 16
18 0.2 18 0.2 0.2 0.2
18 18 0.1 18
20 0.1 0.1 0.1 0.1
20 20 20 20
5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20

Probability values of gki before SIR


(a)

0.025 0.02 0.02 0.02 0.02


0.02 0.015 0.015 0.015
0.015
0.015 0.01
0.01 0.01 0.01 0.01
0.005 0.005 0.005 0.005 0.005
0 0 0 0 0
20 20 20 20 20
100 100 100 60 100
60 60 60 60 60 60 100 60
100 60 100 100 100 60 100 20
20 20 20 20

Sampled posterior probability


(b)

0.04 0.03 0.04


0.04 0.04
0.025 0.03
0.03 0.03 0.02 0.03
0.015 0.02
0.02 0.02 0.02
0.01 0.01
0.01 0.01 0.005 0.01
0 0
0 0 0 20
20 20 100 20 100 20 60 60 100
60 100 60 60 60 60 100
60 60 60 100 20
100 20 100 20 100 20 100 20

Gaussian-MMSE estimation
(c)

Object tracking result


(d)

Figure 16: Tracking results for the BEH algorithm in a situation where the ego-motion compensation is especially challenging due to the
aperture problem.

sequences acquired from a camera mounted on an airborne The algorithm presented in this paper uses a Bayesian model
platform. A variety of moving and stationary terrestrial for the ego-motion, and it is called tracking with Bayesian
targets can be found in two different wavelengths: mid- ego-motion handling (BEH). The second algorithm is based
wave (3 μm–5 μm) and long-wave (8 μm–12 μm). In general, on a deterministic modeling and is referred to as tracking
the tracking task is quite challenging in this dataset due with deterministic ego-motion handling (DEH). It models
to the strong camera ego motion, the magnification and the ego-motion by only one affine transformation, which is
pose variations of the target signatures, and the own equivalent to express p(gk ) by a Kronecker’s delta centered
characteristics of the FLIR imagery described in Section 2. in gkd , an affine transformation deterministically computed
In addition, the proposed object tracking algorithm through the image registration algorithm described in [32].
has been compared with other two tracking algorithms to The last algorithm, referred to as tracking with no ego-
prove its superior performance using the same AMCOM motion handling (NEH), has not an explicit model for the
dataset. These both algorithms are inspired on the existing camera ego-motion, which leads to a simplified expression
works [8, 9], which also use a Particle Filter framework of the state transition probability:
for the tracking, making easier and fairer to compare the
performance of all the three algorithms. The three algorithms
differ in the way they tackle the ego-motion: Bayesian  
modeling, deterministic modeling and not explicit modeling. p(xk | xk−1 ) = N dk ; M · dk−1 , σtr2 , (17)
12 EURASIP Journal on Advances in Signal Processing

0.05 0.04 0.12 0.2 0.4


0.04 0.03 0.1 0.3
0.08 0.15
0.03 0.02 0.06 0.2
0.02 0.1
0.01 0.04 0.1
0.01 0.02 0.05
0 0 0 0 0
20 20 20 20 20
100 60 100 100 100 60 100
60 60 60 60 60 60 60
100 60 100 20 100 20 100 100 20
20 20

Sampled posterior probability


(a)

0.05 0.05 0.04 0.025 0.04


0.04 0.04 0.03 0.02 0.03
0.03 0.03 0.015 0.02
0.02 0.02 0.02
0.01 0.01
0.01 0.01 0.01 0.005
0 0 0 0 0
20 20 20 20 20
100 60 100 100 100 60 100
60 60 60 60 60 60 60
100 20 100 20 100 100 60 100 20
20 20

Gaussian-MMSE estimation
(b)

Object tracking result


(c)

Figure 17: Tracking results for the DEH algorithm in a situation where the ego-motion compensation is especially challenging due to the
aperture problem.

where the value of the parameter σtr2 should be larger than of strong ego-motion. Figure 11 shows the common inter-
that of BEH and DEH algorithms to try to alleviate the ego- mediate results for all the three algorithms. Figures 11(a)
motion effect. and 11(b) show two consecutive frames that have undergone
With the purpose of making a fair comparison, the same a large displacement, in which the target object has been
number of samples has been used for the three algorithms: enclosed by a black rectangle as visual aid. Figure 11(c) shows
NS = 300. This number is enough to ensure a satisfactory the multimodal likelihood function, and lastly, Figure 11(d)
approximation of the state posterior probability given the shows the resulting Metropolis-Hasting based sampling,
specific characteristics of the AMCOM dataset. In the same where each sample has been marked by a black circle.
way, the same value has been chosen for σtr2 = 2 for the BEH Figure 12 shows the tracking results for the BEH
and DEH algorithms, while a value of σtr2 = 4 has been chosen algorithm. The probability values of gki (the estimated affine
for NEH algorithm, in order to alleviate its lack of an explicit transformations) before the SIR resampling are shown in
ego-motion model and make it comparable with the other Figure 12(a), which have been arranged in a rectangular
algorithms. The BEH algorithm needs an extra parameter grid, in a similar way to Figure 5. The probability values are
which has been heuristically set to σg2 = 0.03, offering good displayed using a color scale. Notice that there is a peak in the
results for the given AMCOM dataset. However, other values middle left side, indicating that the camera has undergone
with a variation less than the 15 percent have also offered a strong right translation motion. Figure 12(b) shows the
similar results. sampled posterior probability, where the samples dik with
In the two following subsections, two different tracking higher weights are correctly located over the target object,
situations are evaluated to demonstrate the higher per- thanks to the Bayesian treatment of the camera ego-motion.
formance of the BEH algorithm in complex ego-motion Figure 12(c) shows the result of applying the Gaussian kernel
situations. The last subsection presents the overall tracking over the sampled posterior probability, which is used by
results for each of three algorithms using the aforementioned the Gaussian-MMSE estimator to compute the final state
AMCOM dataset. estimation (marked as a black circle). Finally, Figure 12(d)
shows the target object satisfactorily enclosed by white
5.1. Strong Ego-Motion Situation. The BEH algorithm has rectangle, whose coordinates are determined by the state
been compared with the DEH and NEH ones for a situation estimation. Observe that the infrared image is projected over
EURASIP Journal on Advances in Signal Processing 13

0.06 0.08 0.25 0.4


0.05
0.05 0.06 0.04 0.2 0.3
0.04 0.15
0.03 0.04 0.03 0.2
0.02 0.02 0.1
0.02 0.1
0.01 0.01 0.05
0 0 0 0 0
20 20 20 20
20 100 60 100
60 60 100 60 60 100 60 100 60 60 60
100 100 20 60 100 100 20
20 100 20 20

Sampled posterior probability


(a)

0.05 0.04 0.05 0.05 0.04


0.04 0.03 0.04 0.04 0.03
0.03 0.03 0.03
0.02 0.02
0.02 0.02 0.02
0.01 0.01 0.01
0.01 0.01
0 0 0 0 0
20 20 20 20 20
60 100 60 100 100 60 100 60 100
60 100 60 60 60 60 100 60
100 20 20 100 100 20
20 20

Gaussian-MMSE estimation
(b)

Object tracking result


(c)

Figure 18: Tracking results for the NEH algorithm in a situation where the ego-motion compensation is especially challenging due to the
aperture problem.

the X-Y plane of each probability distribution as visual transformations) before the SIR resampling are shown in
aid. the first column, which have been arranged in a rectangular
Figures 13 and 14 show the tracking results for the grid, in a similar way to Figure 5. The probability values
DEH and NEH algorithms, respectively. Notice that the are displayed using a color scale. Notice that there is not a
tracking fails in both cases, since the dynamic model does well-defined peak, unlike the strong ego-motion situation
not correctly represent the camera and object dynamics, (Figure 12(a)), but there is a set of affine transformation
and consequently the tracking drifts to another mode of candidates with similar probability values, meaning that
the likelihood function. In the case of DEH algorithm, whatever of them could be the true camera motion. The
this fact can be checked by observing that the estimated affine transformations with higher probability value are
affine transformation corresponds to the one located in the located in the horizontal direction, indicating that the
coordinates (11, 11) of Figure 12(a), which has a probability aperture problem is especially significant in that direction. In
value much lower than the one related to the true camera other words, the horizontal translation of the camera motion
motion. cannot be reliably computed between consecutive frames.
The second column of Figure 16 shows the sampled posterior
5.2. High Uncertainty Ego-Motion Situation. A comparison probability related to each frame. Notice that there are several
or the tracking performance of all the three algorithms samples with high weights that are not located over the
(BEH, DEH, and NEH) is presented for a situation where target object, as a consequence of the high uncertainty in
the ego-motion estimation is especially challenging due to the camera ego-motion estimation. However, the majority
the aperture problem (the frames are very low-textured). of samples that have a high weight are located over the
Figure 15 shows common intermediate results, in which the target object, allowing to track it satisfactorily. This fact
first column shows five consecutive frames, where the target can be verified by observing the two last columns, which,
object has been enclosed by a black rectangle as visual aid. respectively, show the Gaussian-MMSE estimation and the
The last two rows show the resulting multimodal likelihood tracking result, where the target object has been satisfactorily
function and the Metropolis-Hasting based sampling for enclosed by a rectangle (whose coordinates are determined
each frame, respectively. by the state estimation).
Figure 16 shows the tracking results for the BEH Figures 17 and 18 show the tracking results for the DEH
algorithm. The probability values of gki (the estimated affine and NEH algorithms, arranged in the same way of Figure 16.
14 EURASIP Journal on Advances in Signal Processing

Observe that the tracking fails in the frame 172 for the DEH Bayesian approach for situations in which it is possible
algorithm, and also in the frame 171 for NEH algorithm. to reliably compute the camera motion. Nonetheless, the
These failures arise from the accumulation of slight errors improvement is insignificant, and in addition, the BEH
in the estimation of the object location, which, in turn, algorithm is able to cope with a wider range of situations than
are caused by the poor characterization of the camera ego- the rest.
motion. There is one situation in which none of the three
algorithms can ensure a correct tracking. This situation arises
5.3. Global Tracking Results. Finally, the global results about when the likelihood distribution has false modes very close
the performance of the BEH, DEH, and NEH algorithms to the true one (corresponding to the tracked object), and
using the sequences of the AMCOM dataset are shown in the apparent motion of the tracked object is very low. Under
Table 1. The table is divided into two sections, showing these circumstances, the tracker can be locked on a false
the tracking results for long-wave and mid-wave infrared mode.
imagery, respectively. The first two columns show the As regards the type of infrared imagery, long and mid
sequence name and the target name, respectively. The third, wave, the tracking results do not show any appreciable
fourth, and fifth columns show the first frame, the last frame, difference between them. Theoretically, mid-wave infrared
and the number of consecutive frames in which the target imagery is better to detect and track objects with hot
appears. The remaining columns show the performance spots, arising from working engines and exhaust pipes, since
of the BEH, DEH, and NEH algorithms, measured as the the target-background contrast is greater. However, if the
number of tracking failures and the tracking accuracy. The terrestrial vehicles are not working, and therefore they are
number of failures indicates the number of times that the at room temperature, the long-wave infrared imagery is
target object has been lost. An object is considered to be preferable, since the target-background contrast is much
lost when the rectangle that encloses the object according greater. Anyway, the AMCOM dataset is not oriented to
to the ground truth and the rectangle resulting from the examine these kinds of differences, since each sequence is
tracking estimation do not overlap each other. The tracking only acquired in a specific wave range, and therefore, a
accuracy has been defined as the average Euclidean distance thorough comparison is not possible. Regarding the tracking
between the object locations (centers of the corresponding performance, the only condition is that there exists an
rectangles) of the ground truth and the tracking estimation. appreciable contrast between the target and the background,
Therefore, the accuracy will be better when its value is since the proposed Bayesian framework is able to handle the
less. It is important to note that the algorithms are not clutter (background regions with similar infrared signature
reinitialized with ground truth data in case of tracking failure to the target object) by means of the coherence between each
(object lost), since one of the more appealing advantages of object region candidate and the object and camera dynamics.
the Particle Filter framework is its capability of recovering In order to provide a better understanding of the results
from tracking failures thanks to the handling of multiple presented in Table 1, the following website http://www.gti
hypotheses (or samples). This also affects the tracking .ssr.upm.es/paper/RobustTracking/ has been built, which
accuracy, since all the erroneous object locations, derived contains the object tracking results along with the ground
from tracking failures, have been taken into account in its truth for all the sequences. In addition, all the intermediate
estimation. Therefore, the sequences with a lot of tracking results (likelihood probability, MCMC sampling, probability
failures will have a much worse tracking accuracy. values of the affine transformations, posterior probability,
From the analysis of the number of tracking failures, it and Gaussian-MMSE estimation) are also available, which
can be summarized that there are 11 situations (sequences) in are useful to comprehend the obtained tracking results.
which the BEH algorithm outperforms the DEH one and 16
situations in which the BEH algorithm outperforms the NEH
one. Regarding the DEH algorithm, there are 11 situations in 6. Conclusions
which it outperforms the NEH one, and 3 situations in which
it outperforms BEH algorithm. Lastly, there are 4 situations A novel strategy for object tracking in aerial imagery is
in which the NEH algorithm outperforms the DEH one and presented, which is able to deal with complex situations
none situation in which it outperforms the BEH algorithm. in which the ego-motion cannot be reliably estimated. The
In the rest of situations, the performance is similar for all proposed algorithm uses a complex dynamic model that
the three algorithms. To sum up, the BEH algorithm is the combines the object and camera dynamics to predict the
best of all, and the DEH algorithm is better than NEH possible object locations. A probabilistic formulation is used
one, as was expected. The errors obtained by the DEH and to represent the camera dynamics by a set of affine trans-
NEH algorithms arise from the poor characterization of the formations, each one corresponding to a possible camera
camera ego-motion, which is satisfactorily solved by the BEH ego-motion. Using this robust model to encode the dynamic
algorithm. information, the tracking algorithm is able to distinguish the
The results about the tracking accuracy follow the same actual object location among multiples candidates, derived
trend. An interesting fact happens when the ego-motion from the appearance model of the object. This approach
is quite low: the tracking accuracy of the DEH algorithm has been proven to be very robust not only in situations
is slightly better than BEH one. The reason is that a with strong ego-motion but also in those situations in
deterministic approach introduces less uncertainty than a which the ego-motion cannot be accurately estimated due
Table 1: Comparison of the tracking performance of the three algorithms BEH, DEH, and NEH using the AMCOM dataset. See the text for a detailed description of the results.
Sequence First Last No. of BEH DEH NEH
Target
name frame frame frames No. of Tracking No. of Tracking No. of Tracking
failures accuracy failures accuracy failures accuracy
Long-wave infrared imagery
L19NSS M60 1 57 57 15 6.19 0 4.10 14 6.78
L19NSS mantruck 86 100 15 0 7.33 2 12.40 2 12.30
L19NSS mantruck 211 274 64 0 4.90 0 6.69 0 4.98
L1415S Mantruck 1 280 280 8 14.14 10 14.84 25 16.32
L1607S mantrk 225 409 185 0 3.49 0 3.56 0 4.78
L1808S apc1 1 79 79 0 2.11 0 2.19 0 1.99
L1808S M60 1 289 289 0 2.95 0 2.74 1 3.73
L1808S mantrk 193 289 97 0 3.93 0 3.38 48 9.09
L1618S apc1 1 290 290 0 13.73 0 18.90 0 16.57
L1618S M60 1 100 100 0 3.51 0 1.87 0 2.35
L1701S Bradley 1 370 370 0 6.54 0 10.89 0 8.21
EURASIP Journal on Advances in Signal Processing

L1701S pickup(trk) 1 30 30 0 1.99 0 1.99 0 1.70


L1702S Mantruck 113 179 67 0 3.27 8 6.18 0 4.56
L1702S Mantruck 631 697 67 0 2.57 0 2.43 0 2.59
L1720S target 1 34 34 0 2.91 0 2.60 4 4.99
L1720S M60 43 777 735 15 6.59 0 5.80 0 4.39
L605S apc1 1 86 86 0 1.41 0 1.52 0 1.48
L605S M60 615 641 27 0 4.88 0 4.80 19 13.15
L605S tank1 614 734 125 0 2.14 0 2.17 163 65.73
L1812S M60 72 157 86 0 2.53 0 2.52 0 2.71
L1813S apc1 1 167 167 0 2.68 0 2.64 0 2.84
L1817S-1 M60 1 193 193 0 4.46 0 3.41 0 4.18
L1817S-2 M60 1 189 189 0 4.93 0 4.43 0 4.75
L1818S apc1 21 112 92 32 6.01 72 10.22 72 26.05
L1818S M60 81 202 122 60 10.55 119 17.79 118 85.25
L1818S tank1 151 364 214 119 35.89 213 54.29 213 55.47
L1906S Mantruck 1 203 203 0 8.47 0 5.58 0 8.21
L1910S apc1 56 129 74 0 10.70 0 11.47 0 11.15
L1911S apc1 1 164 164 0 8.15 0 9.19 0 8.75
L1913S apc1 1 264 264 25 6.79 262 29.93 262 27.69
L1913S M60 182 264 61 0 10.42 0 9.19 98 15.70
L1918S tank1 26 259 234 15 7.03 20 8.98 144 9.46
L2018S tank1 1 447 447 0 3.77 0 4.52 4 4.32
L2104S bradley 1 320 320 51 5.60 319 34.39 319 43.82
L2104S tank1 69 759 691 235 16.88 629 56.99 673 48.33
L2208S apc1 1 379 379 0 5.88 0 5.40 0 5.53
L2312S apc1 1 367 367 0 2.11 0 1.70 0 2.10
15
16

Table 1: Continued.
Sequence First Last No. of BEH DEH NEH
Target
name frame frame frames No. of Tracking No. of Tracking No. of Tracking
failures accuracy failures accuracy failures accuracy
Mid-wave infrared imagery
M1406S Bradley 1 379 379 0 1.84 0 1.92 0 2.72
M1407S Bradley 1 399 399 0 4.83 0 5.77 0 4.37
M1410S tank1 1 497 497 0 7.45 0 5.25 0 3.91
M1413S Mantruck 1 379 379 2 7.44 1 7.73 0 6.67
M1415S Mantruck 1 10 10 0 3.14 2 6.68 0 3.09
M1415S Mantruck 15 527 513 7 5.42 512 62.30 512 65.16
EURASIP Journal on Advances in Signal Processing
EURASIP Journal on Advances in Signal Processing 17

to the aperture problem, strong camera motion, and/or the [13] I. Cohen and G. Medioni, “Detecting and tracking moving
presence of independent moving objects. In these cases, objects in video from an airborne observer,” in Proceedings of
it clearly outperforms other tracking approaches based on the Workshop in Image Understanding, pp. 217–222, 1998.
a deterministic ego-motion compensation or even without [14] B. Jung and G. Sukhatme, “Detecting moving objects using a
explicit compensation. The experimental results, performed single camera on a mobile robot in an outdoor environment,”
with the AMCOM dataset, support this conclusion. in Proceedings of the International Conference on Intelligent
Autonomous Systems, pp. 980–987, 2004.
[15] C. R. del Bianco, F. Jaureguizar, L. Salgado, and N. Garcı́a,
Acknowledgments “Aerial moving target detection based on motion vector field
analysis,” in Proceedings of the 9th International Conference on
This work has been partially supported by the Comunidad de Advanced Concepts for Intelligent Vision Systems (ACIVS ’07),
Madrid under project S-0505/TIC-0223 (Pro-Multidis) and vol. 4678 of Lecture Notes in Computer Science, pp. 990–1001,
by the Ministerio de Ciencia e Innovacion of the Spanish August 2007.
Government under project TEC2007-67764 (SmartVision). [16] C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garcı́a,
“Target detection through robust motion segmentation and
tracking restrictions in aerial FLIR images,” in Proceedings of
References the 14th IEEE International Conference on Image Processing
(ICIP ’06), vol. 5, pp. 445–448, September 2006.
[1] U. Braga-Neto, M. Choudhary, and J. Goutsias, “Automatic [17] C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garcı́a,
target detection and tracking in forward-looking infrared “Automatic aerial target detection and tracking system in
image sequences using morphological connected operators,” airborne FLIR images based on efficient target trajectory
Journal of Electronic Imaging, vol. 13, no. 4, pp. 802–813, 2004. filtering,” in Automatic Target Recognition XVII, vol. 6566 of
[2] H. Xin and T. Shuo, “Target detection and tracking in Proceedings of SPIE, Orlando, Fla, USA, April 2007.
forward-looking infrared image sequences using multiscale [18] R. Pless, T. Brodsky, and Y. Aloimonos, “Detecting inde-
morphological filters,” in Proceedings of the 5th International pendent motion: the statistics of temporal continuity,” IEEE
Symposium on Image and Signal Processing and Analysis (ISPA Transactions on Pattern Analysis and Machine Intelligence, vol.
’07), pp. 25–28, September 2007. 22, no. 8, pp. 768–773, 2000.
[3] C. Wei and S. Jiang, “Automatic target detection and tracking [19] M. Irani and P. Anandan, “Video indexing based on mosaic
in FLIR image sequences using morphological connected representations,” Proceedings of the IEEE, vol. 86, no. 5, pp.
operator,” in Proceedings of the 4th International Conference on 905–921, 1998.
Intelligent Information Hiding and Multiedia Signal Processing
[20] A. Yilmaz, K. Shafique, N. Lobo, X. Li, T. Olson, and M. A.
(IIH-MSP ’08), pp. 414–417, August 2008.
Shah, “Target-tracking in FLIR imagery using mean-shift and
[4] A. Bal and M. S. Alam, “Automatic target tracking in FLIR
global motion compensation,” in Proceedings of the Workshop
image sequences,” in Automatic Target Recognition XIV, vol.
on Computer Vision Beyond the Visible Spectrum, pp. 54–58,
5426 of Proceedings of SPIE, pp. 30–36, April 2004.
2001.
[5] A. Bal and M. S. Alam, “Automatic target tracking in
[21] A. Yilmaz, K. Shafique, and M. Shah, “Target tracking in
FLIR image sequences using intensity variation function and
airborne forward looking infrared imagery,” Image and Vision
template modeling,” IEEE Transactions on Instrumentation and
Computing, vol. 21, no. 7, pp. 623–635, 2003.
Measurement, vol. 54, no. 5, pp. 1846–1852, 2005.
[6] W. Yang, J. Li, D. Shi, and S. Hu, “Mean shift based target [22] A. Strehl and J. K. Aggarwal, “Detecting moving objects in
tracking in FLIR imagery via adaptive prediction of initial airborne forward looking infra-red sequences,” in Proceedings
searching points,” in Proceedings of the 2nd International of the Workshop on Computer Vision Beyond the Visible
Symposium on Intelligent Information Technology Application Spectrum, pp. 3–12, 1999.
(IITA ’08), pp. 852–855, December 2008. [23] A. Strehl and J. K. Aggarwal, “MODEEP: a motion-based
[7] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object object detection and pose estimation method for airborne
tracking,” IEEE Transactions on Pattern Analysis and Machine FLIR sequences,” Machine Vision and Applications, vol. 11, no.
Intelligence, vol. 25, no. 5, pp. 564–577, 2003. 6, pp. 267–276, 2000.
[8] N. A. Mould, C. T. Nguyen, and J. P. Havlicek, “Infrared target [24] S. Lankton and A. Tannenbaum, “Improved tracking by
tracking with AM-FM consistency checks,” in Proceedings of decoupling camera and target motion,” in Real-Time Image
IEEE Southwest Symposium on Image Analysis and Interpreta- Processing, vol. 6811 of Proceedings of SPIE, San Jose, Calif,
tion, pp. 5–8, March 2008. USA, January 2008.
[9] V. Venkataraman, G. Fan, and X. Fan, “Target tracking with [25] H. Zhang and F. Yuan, “Vehicle tracking based on image align-
online feature selection in FLIR imagery,” in Proceedings of ment in aerial videos,” in Proceedings of the 6th International
the IEEE Computer Society Conference on Computer Vision and Conference on Energy Minimization Methods in Computer
Pattern Recognition (CVPR ’07), pp. 1–8, June 2007. Vision and Pattern Recognition (EMMCVPR ’07), vol. 4679 of
[10] R. Hartley and A. Zisserman, Multiple View Geometry in Lecture Notes in Computer Science, pp. 295–302, August 2007.
Computer Vision, Cambridge University Press, Cambridge, [26] S. Ali and M. Shah, “COCOA—tracking in aerial imagery,”
UK, 2nd edition, 2004. in Airborne Intelligence, Surveillance, Reconnaissance (ISR)
[11] B. Zitová and J. Flusser, “Image registration methods: a Systems and Applications III, vol. 6209 of Proceedings of SPIE,
survey,” Image and Vision Computing, vol. 21, no. 11, pp. 977– Kissimmee, Fla, USA, April 2006.
1000, 2003. [27] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A
[12] R. Kumar, H. Sawhney, S. Samarasekera, et al., “Aerial video tutorial on particle filters for online nonlinear/non-Gaussian
surveillance and exploitation,” Proceedings of the IEEE, vol. 89, Bayesian tracking,” IEEE Transactions on Signal Processing, vol.
no. 10, pp. 1518–1538, 2001. 50, no. 2, pp. 174–188, 2002.
18 EURASIP Journal on Advances in Signal Processing

[28] C. M. Bishop, Pattern Recognition and Machine Learning,


Information Science and Statistics, Springer, Berlin, Germany,
2006.
[29] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte
Carlo sampling methods for Bayesian filtering,” Statistics and
Computing, vol. 10, no. 3, pp. 197–208, 2000.
[30] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and
object recognition using shape contexts,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp.
509–522, 2002.
[31] G. Mori and J. Malik, “Recognizing objects in adversarial
clutter: breaking a visual CAPTCHA,” in Proceedings of IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR ’03), vol. 1, pp. 134–144, June 2003.
[32] S. Periaswamy and H. Farid, “Medical image registration with
partial data,” Medical Image Analysis, vol. 10, no. 3, pp. 452–
464, 2006.
[33] J. Domke and Y. Aloimonos, “A probabilistic notion of
correspondence and the epipolar constraint,” in Proceedings
of the 3rd International Symposium on 3D Data Processing,
Visualization, and Transmission (3DPVT ’06), pp. 41–48, June
2006.
[34] J. Domke and Y. Aloimonos, “A probabilistic framework for
correspondence and egomotion,” in Proceedings of the 2nd
International Workshop on Dynamical Vision (WDV ’06), vol.
4358 of Lecture Notes in Computer Science, pp. 232–242, May
2007.
[35] W. K. Hastings, “Monte carlo sampling methods using markov
chains and their applications,” Biometrika, vol. 57, no. 1, pp.
97–109, 1970.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 583918, 9 pages
doi:10.1155/2010/583918

Research Article
Covariance Tracking via Geometric Particle Filtering

Yunpeng Liu,1, 2, 3, 4 Guangwei Li,5 and Zelin Shi1, 3


1 Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
2 Graduate School of Chinese Academy of Sciences, Beijing 100049, China
3 Key Laboratory of Optical-Electronics Information Processing, Chinese Academy of Science, Shenyang 110016, China
4 Key Laboratory of Image Understanding and Computer Vision, Liaoning Province 110016, China
5 Management Science and Engineering Department, Qingdao University, Qingdao 266071, China

Correspondence should be addressed to Yunpeng Liu, [email protected]

Received 30 November 2009; Accepted 24 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Yunpeng Liu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Region covariance descriptor recently proposed has been approved robust and elegant to describe a region of interest, which
has been applied to visual tracking. We develop a geometric method for visual tracking, in which region covariance is used
to model objects appearance; then tracking is led by implementing the particle filter with the constraint that the system state
lies in a low dimensional manifold: affine Lie group. The sequential Bayesian updating consists of drawing state samples while
moving on the manifold geodesics; the region covariance is updated using a novel approach in a Riemannian space. Our main
contribution is developing a general particle filtering-based racking algorithm that explicitly take the geometry of affine Lie groups
into consideration in deriving the state equation on Lie groups. Theoretic analysis and experimental evaluations demonstrate the
promise and effectiveness of the proposed tracking method.

1. Introduction sensitive to background clutter, occlusion, and quick moving


objects. These problems can be mitigated by stochastic
Visual tracking in an image sequence, which is now an active methods which maintain multiple hypotheses in the state
area of research in computer vision, is widely applied to space and in this way, achieve more robustness to the
vision guidance, surveillance, robotic navigation, human- local maximum. Among various stochastic methods, particle
computer interaction, and so forth. Dynamic deformation of filters [5–10] are very successful. Particle filters provide
object is a distinct problem in image-based tracking. a robust tracking framework as they are neither limited
Conventional correlation-based trackers [1, 2] use either to linear systems nor require the noise to be Gaussian.
a region’s gray information or edges and other features Particle filters simultaneously track multiple hypotheses and
as the target signatures, but it is difficult to solve the recursively approximate the posterior probability density
problem of object region deformation in the tracking. function in the state space with a set of random sampled
Over the last 10 years, numerous approaches [3–10] have particles.
been proposed to address this problem. The main idea of Many papers, such as [5–10] utilize particle filter method
them is molding geometric parameter models for the image to track deformable target. They use affine transform as
motions of points within a target region. The parameter parameter model, and the six affine parameters were treated
models including affine model, projective model, or other as a vector. However, the affine parameters belong to spaces
nonlinear models. The classic Lucas-Kanade tracker [3, which are not vector spaces, but instead a curved Lie group.
4] and Meanshift tracker [5] get the model parameters In general, the system state of the particle filter lies in
through gradient descent which minimizes the difference a constrained subspace whose dimension is much lower
between the template and the current region of the image. than the whole space dimension. Only a few recent papers
These methods are computationally efficient. However, the have tried to use the geometry of the manifold to design
methods may converge to a local maximum, they are Bayesian filtering algorithms [11, 12]. However, there is little
2 EURASIP Journal on Advances in Signal Processing

discussion in the literature using the intrinsic geometry of


manifold to develop particle filter-based tracking algorithms.
Object representation is one of major components for a Tx (G)
typical visual tracker. Extensive researches have been done
on this topic. Recently Tuzel et al. [13, 14] proposed an
elegant and simple solution to integrate multiple features. In expx Δ
this method, covariance matrix was employed to represent
the target. Using a covariance matrix to represent the target x
(region covariance descriptor) has many advantages: (1)
it embodies both spatial and statistical properties of the
objects; (2) it provides an elegant solution to fuse multiple
features and modalities; (3) it has a very low-dimensionality;
(4) it is capable of comparing regions without being
G
restricted to a constant window size; and (5) the estimation
of the covariance matrix can be easily implemented.
In this paper, we integrate covariance descriptor into Figure 1: Riemannian Exponential Mapping.
Mont Carlo technique for visual tracking, study the geometry
structure of affine Lie groups, and propose a tracking
algorithm through particle filtering on manifolds, which
by an integral over norms of tangents [17]. The curve
implement the particle filter with the constraint that the
with minimum length is known as the geodesic and the
system state lies in a low dimensional manifold, The
length of the geodesic is the intrinsic distance. Parameter
sequential Bayesian updating consists drawing state samples
spaces occurring in computer vision problems usually have
while moving on the manifold geodesics; this provides
well-studied geometries and closed form formulae for the
a smooth prior for the state space change. The regions
intrinsic distance are available. Tangents and geodesics are
covariance matrices are updated using a novel approach in
closely related. For each tangent Δ ∈ Tx , there is a unique
a Riemannian space. Theoretic analysis and experimental
geodesic starting at x with initial velocity Δ. The exponential
results shows the promise and effectiveness of the approach
map, expx maps Δ to the point on the manifold reached by
proposed.
this geodesic.
The paper is organized as follows. In Section 2, The
A Lie group is a group with the structure of an analytic
mathematical background is described. Section 3 shows the
manifold such that the group operations are analytic, that is
object regions descriptor and the new update solution for
the maps
those descriptors. Section 4 describes the tracking algorithm
using geometric particle filtering. Results on real image G × G −→ G (X, Y ) −→ XY ,
sequences for evaluating algorithm performance are dis- (1)
cussed in Section 5.Section 6 concludes this paper. G −→ G X −→ X −1 ,

are analytic [15]. The local neighborhood of any group


2. Manifold and Lie Group element G can be adequately described by its tangent-space.
The tangent-space at the identity element forms its Lie
The tools used here come primarily from differential geom- algebra.
etry. For more information on these subjects, the reader is The set of nonsingular n × n square matrices forms a
referred to [15, 16]. Lie group where the group product is modeled by matrix
A manifold is a topological space that is locally similar to multiplication, usually denoted by GL(n, R) for the general
an Euclidean space. Intuitively, we can think of a manifold as linear group of the order n. Lie groups are differentiable
a continuous surface lying in a higher dimensional Euclidean manifolds on which we can do calculus.
space. Analytic manifolds satisfy some further conditions of In our task, we use affine transformation as parameter
smoothness [16]. From now onwards, we restrict ourselves model. The set of all affine transformation forms a matrix
to analytic manifolds and by manifold we mean analytic Lie group.
manifold.
The tangent space, Tx at x, is the plane tangent to the
3. Region Covariance Descriptor
surface of the manifold at that point. The tangent space can
be thought of as the set of allowable velocities for a point Let I be the observed image with size of W × H, and F be
constrained to move on the manifold. For d-dimensional W × H × d dimensional feature image extracted from I
manifolds, the tangent space is a d-dimensional vector space.    
An example of a two-dimensional manifold embedded in F x, y = φ I, x, y , (2)
R3 with the tangent space Tx is shown in Figure 1. The
solid arrow Δ is a tangent at x. The distance between two where φ can be any mapping such as color, gradients, filter
points on the manifold is given in terms of the lengths of responses, and so forth. Let {zk }k=1···n be the d-dimensional
curves between them. The length of any curve is defined feature points inside a given rectangular region. The region
EURASIP Journal on Advances in Signal Processing 3

is represented by the d × d covariance matrix of the feature 3.2. Covariance Update. A solution for the covariance matri-
points ces update was proposed in [14], that is based on the
estimation of the points mean on a Riemannian Manifold,
1 
n
 T where each point corresponds to a covariance matrix.
CR = zk − μ zk − μ , (3)
n k=1 This mean estimation is obtained using a gradient descent
approach. In this paper, we propose a novel solution for the
where n is the number of pixels in the region. μ is the mean covariance matrix update, that is based on the mean of the
of the feature points. new covariance matrix and the last covariance updated. If y
In our task, we define φ as is the velocity that takes us from X to Y , y/2 will be the half
  distance to point C. Using (6) and (7), we have
x y I Ix Iy I x Iy , (4)
   
1
where x and y are the pixel location in R; I is the gray value; C = X 1/2 exp X −1/2 y X −1/2 X 1/2
Ix and I y are first derivatives of I; In this way, the region R is 2
 
mapped into a 6 × 6 covariance matrix. 1  −1/2 −1/2  1/2
= X 1/2 exp log X YX X (10)
In a tracking process, the objects appearance changes 2
over time. This dynamic behavior requires a robust temporal  1/2
update of the region covariance descriptors and the defini- = X 1/2 Y X 1/2 ,
tion of dissimilarity metric for the region covariance. The
important question here is how to measure the dissimilarity where C is the average distance between two points on a
between two region covariance matrices and how to update Riemannian Manifold (the updated covariance matrix). This
the regions covariance matrix in the next time slot. Note that update means that the present covariance is more important
the covariance matrices do not lie on Euclidean space. For than the previous covariances. Since we are tracking objects
example, the space is not closed under multiplication with that can change over time, the last information about them
negative scalars. So, it is necessary to get the dissimilarity is more reliable.
between two covariance matrices in a different space. To
overcome this problem a Riemannian Manifold is used.
4. Tracking Model
3.1. Dissimilarity Metric. The dissimilarity between two The visual tracking problem is cast as an inference task in a
regions covariance matrices can be given by the distance Markov model with hidden state variables. The state variable
between two points of the manifold M, considering that St describes the affine parameters of the target at time t.
those points are the two regions. Given a set of observed images I1:t = {I1 , . . . , It }, we aim
The covariance matrix, which is symmetric positive to estimate the value of the hidden state variable St . Using
definite matrix, forms a Riemannian manifold. According to Bayesian theorem, we have the familiar result
[14], we define a Riemannian metric like that -
H I  
y, z = tr X −1/2
yX zX−1 −1/2
. (5) p(St | I1:t−1 ) = p(St | St−1 )p(St−1 | I1:t−1 )dSt−1 , (11)
X

The exponential map associated to the above Riemannian p(It | St )p(St | I1:t−1 )
metric is p(St | I1:t ) = . (12)
  p(It | I1:t−1 )
 
expX y = X 1/2 exp X −1/2 yX −1/2 X 1/2 . (6)
Equation (11) is called the prediction equation and (12) is
By (6), we can obtain the logarithm map called the update equation. The tracking process is governed
  by the observation model p(It | St ), where we estimate
y = logX (Y ) = X 1/2 log X −1/2 Y X −1/2 X 1/2 , (7) the likelihood of St observing It , and the dynamical model
between two states p(St | St−1 ).
Submit (7) to (5)
? ?2 H I
d2 (X, Y ) = ? y ?X = y, y X
4.1. Dynamical Model. Dynamical model, also known as
E F state transition model, can describe transition of object
= logX (Y ), logX (Y ) (8) state in tracking process. In visual tracking problems, it is
X
   ideal to have an exact state transition model. In practice,
= tr log X −1/2 Y X − 1/2
2 however approximations models are used. The deformation
.
and location of a target object in an image can be represented
Furthermore, (8) is equivalent to by affine transform. In this work, the state at time t consists
) of the six parameters of an affine transformation. 2-D affine
*
*d transformation of the image can be written as
*
d(X, Y ) = + log2 λk (X, Y ), (9)       
k=1 x r1 r2 x r
= + 5 , (13)
where λk are the generalized eigenvalues of X and Y . y r3 r4 y r6
4 EURASIP Journal on Advances in Signal Processing

where (x, y) and (x y  ) denote the location


 of the corre- Vt−1
sponding points between two images, rr13 rr24 is a 2×2 nonsin- St
 
gular matrix and translation vector rr56 , (r1 , r2 , r3 , r4 , r5 , r6 )
denotes affine transformation parameters. The transforma-
tion can be expressed in homogeneous coordinates as
St−1
⎡ ⎤
r1 r2 r5
⎢ ⎥
A(r) = ⎣r3 r4 r6 ⎦. (14)
0 0 1

A(r) specify the displacement between St−1 and St , we define


Vt as velocities between St−1 and St , which specify the Figure 2: Drawing state samples moving on the geodesics.
motion. These defines are analogous to the vector space case
in that the velocities are determined by the tangent vectors
along geodesics connecting the observed points (St ). Then
the state transition model is of the following form: 4.3. Sequential Monte Carlo Approach. The Monte Carlo
idea is to approximate the posterior density of St by a
St = St−1 exp(Vt−1 ), (15) large number of samples drawn from it. Having obtained
the samples, any estimate of St (MMSE, MAP, etc.) can be
Vt = Vt−1 + μt−1 , (16) approximated using sample averages.
A recursive formulation, which takes samples from
where S1:t is a discrete-time trajectory on a six-dimensional p(St−1 | I1:t−1 ) and generates the samples from p(St |
affine Lie group, V1:t is a velocity on the corresponding I1:t ) in an efficient fashion, is desirable. We accomplish this
Lie algebra, μ1:t are Gaussian white zero-mean stochastic task using ideas from sequential methods and importance
processes. sampling. Assume that, at the observation time t − 1, we
The tracking algorithm will not require the explicit have a set of M samples from the posterior, {Sit−1 : i =
functional form of the prior density; it will be dependent on 1, 2, . . . , M } £Sit−1 ∝ p(St−1 | I1:t−1 ). Following are the steps
the samples generated from the prior density. In a Markovian to generate the set {Sit : i = 1, 2, . . . , M }.
time-series analysis, often there is a standard characterization
of a time-varying posterior density, in a convenient recursive
form. This characterization relates an underlying Markov Prediction. The first step is to sample from p(St | I1:t−1 )
process to its observations at each observation time via a given the samples from p(St−1 | I1:t−1 ). According to (11),
pair of state transition equations. The following algorithm p(St | I1:t−1 ) is the integral of the product of a marginal and a
specifies a procedure to sample from the conditional prior conditional density. This implies that, for each element Sit−1 ,
p(St | St−1 ): by generating a sample from the conditional, p(St | Sit−1 ) we
can generate a sample from p(St | I1:t−1 ). In our case, this
Algorithm 1. For some t = 2, 3, . . ., we are given the values is accomplished using Algorithm 1. Now we have samples
for Sit−1 and Vti−2 . For i = 1, 2, . . . M: {S4it } from p(St | I1:t−1 ); these samples are called predictions,
but we have used a geodesics prediction different to classic
(1) Generate a sample of Vti−1 , given Vti−2 , according to particle filter on vector space.
(16).
(2) For each sample of Vti−1 , calculate Sit according to Sit =
Sit−1 exp(Vt−1 ). Resampling. Given these predictions, the next step is to
generate samples from the posterior p(St | I1:t ). For this,
The Algorithm 1 consists in drawing state samples while we utilize importance sampling as follows. The samples from
moving on the manifold geodesics. This geodesics sample the prior p(St | I1:t−1 ) are resampled according to the
give a dynamics-based smoothing prior on the state transi- probabilities that are proportional to the likelihoods p(It |
tion space. Figure 2 is an illustration of this geodesics sample S4it ). Form a discrete probability mass function on the set
process. {S4it : i = 1, 2, . . . , M }

 
4.2. Observation Model. Next, we specify the probability p It | S4it
ωti = M  . (18)
model for the observed images. p(It | St ) is the likelihood p It | S4t
j
j =1
through the observation It under the state St
?  ?2 
p(It | St ) ∝ exp −λ?d2 C∗ , CSt ? , (17) Then, resample M values from the set {S41t , S42t , . . . , S4M
t }
according to probability ωti . These values are desired samples
where C∗ be covariance features of the template image, and from the posterior p(St | I1:t ). Denote the resampled set by
CSt denote covariance features at the transformation St . {Sit : i = 1, 2, . . . , M }, Sit ∝ p(St | I1:t ).
EURASIP Journal on Advances in Signal Processing 5

Averaging on the Lie Group. Now that we have M samples the same real image sequences. After that, we evaluated the
from the posterior p(St | I1:t ), we can average them proposed update method with the one previously proposed
appropriately to approximate the posterior mean of St . in the literature. We also tested the proposed algorithm
It may be recalled that for a vector space, the sample under varying illumination conditions. These algorithms are
mean or average of a set {S1 , S2 , . . . , SM } is given by S = implemented in C++ running on an Intel Core-2 2.5 GHz
 processor with 2 GB memory.
1/M M i
i=1 S . However, such a notion cannot be applied
directly to elements of a group manifold. There are at least
two ways of define a mean value on a manifold: extrinsic 5.1. Compared with VPF. Two typical image sequences where
means and intrinsic means. The extrinsic mean depends on the objects undergo large changes in pose and scale were
the geometry of the ambient space and the embedding. The tested using GPF and V PF. Thus, the performance of the two
intrinsic mean is defined using only the intrinsic geometry algorithms has been compare with the same experimental
of the manifold. In general, the intrinsic average is preferable setup.
over the extrinsic average but is often hard to compute due
The first sequence contains 150 frames of images, the
to the nonlinearity of the Riemannian distance function
size of each frame is 768 × 576 and the size of template
and the need to parameterize the group manifold. However,
is 51 × 42. The target to track undergo large scale change
as we will see here, for matrix Lie groups the intrinsic
in the sequence. For the particle filtering in the visual
average can be computed efficiently. In several applications,
tracking, the number of particles is set to 60. The Standard
the Lie algebra is used for computing intrinsic means of
Deviations of the six affine parameters in 16 are assigned
points having Lie group structure [17–19]. We adopt the
as (0.04, 0.003, 0.003, 0.04, 4, 4). The final tracking results of
similar idea to obtain the intrinsic mean of the affine lie
GPF and V PF are shown in Figure 3. For a better visualiza-
group.
tion, we just show the tracking results of four representative
The “true” intrinsic sample mean is given by
frames 52, 87, 135 and 148. The frame number is shown

M   on the top left corner of each image. The value below each
S = arg min d2 Si , S S ∈ G. (19) image is the likelihood of the matching, the smaller the
i=1 matching error, the larger the likelihood. Figure 4(a) shows
the likelihood curves.
It will be recalled that for matrix groups, the Riemannian
distance is defined by the matrix logarithm operation, that From Figure 3, we see that the proposed tracking algo-
is for matrix group elements X and Y we have rithm exhibits a robust tracking result and the tracking
?  ?
window adapt with the scale change of the target. While VPF
d(X, Y ) = ?log Y X −1 ?. (20) tracker begin to drift away form the target form frame 135.
This due to the fact that the VPF treats the parameter space
as a whole. There is not enough observations to provide
4.4. Detail of Tracking Algorithm
a reliable estimate. While GPF consider the geometry of
Algorithm 2. (1) Initialize: the parameter space, this prior of smooth changes of the
Generate samples {Si0 , i = 1, 2, . . . , M } from the prior parameter space. From Figure 4(a), we see that likelihoods of
distribution p(S0 ). Set initial weights ω0i = 1/M. GPF tracker are always larger than VPF tracker. The second
(2) Prediction: sequence contains 370 frames of images, the size of each
Draw {S4it , i = 1, 2, . . . , M } from the conditional prior frame is 352 × 420 and the size of template is 60 × 40.
according to Algorithm1. The target to track experiences large rotation change and
(3) Importance Weights: shear change in the sequence. The number of particles is set
Compute the probability ωti , i = 1, 2, . . . , M according to to 60. The Standard Deviation of the six affine parameters
(18). in (16) are assigned as (0.04, 0.0003, 0.003, 0.04, 4, 4). The
(4) Resampling: final tracking results of GPF and VPF are shown in Figure 5.
Generate M samples from the set {S4it , i = 1, 2, . . . , M } Like sequence 1, we just show the tracking results of four
with the associated probabilities {ωti , i = 1, 2, . . . , M }. Denote representative frames 165, 281, 337 and 364. We see that the
these samples by {Sit , i = 1, 2, . . . , M }. proposed tracking algorithm exhibits a robust tracking result
(5) MMSE Averaging: and the tracking window adapt with the deformation of the
Calculate the sample average according to (19) which is target. While the tracking window of VPF can not enclose
the target state. Set t = t + 1 and go to step 2. the target well. So, the likelihoods are smaller than GPF.
Figure 4(b) shows the likelihood curves. we see that in the
first 150 frames the likelihoods of the two tracker is similar,
5. Experimental Results but from 150th frame the likelihood of GPF is always larger
than VPF. This is due to the fact that the target does not
In order to evaluate the performance of the proposed experience rotation and shear changes before 150th frame,
tracking algorithm based on geometric particle filtering and just translation.
the new update method. We start by comparing proposed In summary, we observe that the GPF tracker outper-
algorithm (referred as GPF) with the tracking algorithm forms VPF in the scenarios of scale, rotation, and shear
based on Particle filtering on vector space (VPF) [5–10] with changes of target.
6 EURASIP Journal on Advances in Signal Processing

52 87 135 148

0.4606 0.4407 0.3964 0.3665


(a)

52 87 135 148

0.497 0.4782 0.4532 0.4557


(b)

Figure 3: Tracking results of sequence 1: (a) tracking using VPF; (b)tracking using GPF.

0.7 0.55

0.65 0.5
0.6
0.45
0.55
Likelihood

Likelihood

0.4
0.5
0.35
0.45
0.3
0.4

0.35 0.25

0.2
0 50 100 150 0 100 200 300 400
Frame Frame

VPF VPF
GPF GPF
(a) (b)

Figure 4: Performance comparison between VPF and GPF: (a) sequence 1; (b) sequence 2.

5.2. Update Method. To evaluate the effectiveness of the results in milliseconds of the two updates methods. The
proposed update solution, we compare the result of it with Porikli update time execution was measured considering a
the ones obtained by the Porikli update proposed in [14]. stack of five regions covariance matrices. The new update is
We compare the likelihood curves between above two image much faster than the one proposed in [14], with an average
sequences; the results were obtained by just changing the performance of 0.6 ms.
update method.
Figure 6 shows the likelihood curves of the two update
methods. From Figure 6, we see that likelihoods curves are 5.3. Illumination Changes. To analyze the robustness against
similar; this means the two updates are equivalent. the illumination changes using the covariance descriptor,
However, the distinct advantage of this new update we have used the algorithm on several sequences with
method is the time execution. In Table 1, we show the illumination changes. One of which is a vehicle driving at
EURASIP Journal on Advances in Signal Processing 7

165 281 337 364

0.4031 0.3067 0.2891 0.2996


(a)

165 281 337 364

0.4637 0.3927 0.3758 0.3665


(b)

Figure 5: Tracking results of sequence 2: (a) tracking using VPF; (b) tracking using GPF.

0.65 0.7

0.6 0.6

0.55 0.5
Likelihood

Likelihood

0.5 0.4

0.45 0.3

0.4 0.2
0 50 100 150 0 100 200 300 400
Frame Frame

Porikli update Porikli update


New update New update
(a) (b)

Figure 6: Performance comparison between two update methods: (a) sequence 1; (b) sequence 2.

Table 1: Execution time of two update methods size. target. So the tracking algorithm using covariance descriptor
outperformed gray-based tracking algorithm under illumi-
Method Execution time (ms)
nation changes.
Porikli update 129.6
New update 0.6
5.4. Experimental Analyses. The algorithm described in the
paper consists of three components.
night, shown in Figure 7(a). Despite the difficult illumina- (1) We develop a general particle filtering based tracking
tion conditions, our algorithm is able to track the vehicle algorithm that explicitly take the geometry of affine
well. We also test the same image sequence using the image Lie groups into consideration in deriving the state
grayscale values. Tracking results are shown in Figure 7(b). equation on Lie groups. This one is our main
we can see that from 280th frame, the tracking window drift contribution and the dominating factor in improving
away from the target, the red dashed window is the real the tracking performance.
8 EURASIP Journal on Advances in Signal Processing

1 100 280 300

0.5448 0.4326 0.3294 0.2438


(a) Tracking using image grayscale

1 100 280 300

0.5448 0.4429 0.4281 0.4355


(b) Tracking using covariance descriptor

Figure 7: Vehicle moving in the night time with large illumination changes.

(2) We use region covariance descriptor to model objects Acknowledgments


appearance, the edge-like information more robust
to the illumination changes than the image grayscale This work is partly supported by the National Natural
can be simultaneously considered with the image Science Foundation of China (Grant no. 60603097) and
grayscale information and pixel spatial information, the National Defense Innovation Foundation of Chinese
and the consequence is the quite robust tracking Academy Sciences (CXJJ-65).
results as seen in Figure 7.

(3) We updated region covariance using a novel approach


References
in a Riemannian space. The new update method has [1] D. A. Montera, S. K. Rogers, D. W. Ruck, and M. E.
improved the real-time performance. Oxley, “Object tracking through adaptive correlation,” Optical
Engineering, vol. 33, pp. 294–302, 1994.
So the order of importance to the performance among [2] H. S. Parry, A. D. Marshall, and K. C. Markham, “Tracking
targets in FLIR images by region template correlation,” in
these components is 1, 2, 3.
Acquisition, Tracking, and Pointing XI, vol. 3086 of Proceedings
of SPIE, pp. 221–232, Orlando, Fla, USA, April 1997.
[3] G. D. Hager and P. N. Belhumeur, “Efficient region tracking
6. Conclusion with parametric models of geometry and illumination,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
In this paper, we have proposed a visual tracking method, 20, no. 10, pp. 1025–1039, 1998.
which integrate covariance descriptor into Mont Carlo [4] S. Baker and I. Matthews, “Lucas-Kanade 20 years on:
tracking technique for visual tracking. The distinct advantage a unifying framework,” International Journal of Computer
of this new approach is carrying Sequential Monte Carlo Vision, vol. 56, no. 3, pp. 221–255, 2004.
method over the affine Lie group, which consider the [5] H. Zhang, W. Huang, Z. Huang, and L. Li, “Affine object
geometry prior of the parameter space. Theoretic analysis tracking with kernel-based spatial-color representation,” in
and experimental results shows the promise and effectiveness Proceedings of the IEEE Computer Society Conference on
of the approach proposed. Computer Vision and Pattern Recognition (CVPR ’05), pp. 293–
300, June 2005.
This paper highlights the role of Monte Carlo methods in
[6] M. Isard and A. Blake, “Condension-conditional density
statistical inferences over affine lie group for visual tracking propagation for visual tracking,” International Journal of
problem. There are several directions for extending the Computer Vision, vol. 29, no. 1, pp. 5–28, 1998.
new idea. One is to consider more general differentiable [7] S. K. Zhou, R. Chellappa, and B. Moghaddam, “Visual
manifolds beyond the affine lie group. In addition, we can tracking and recognition using appearance-adaptive models in
deepen and broaden this research to other image processing particle filters,” IEEE Transactions on Image Processing, vol. 13,
problems. no. 11, pp. 1491–1506, 2004.
EURASIP Journal on Advances in Signal Processing 9

[8] Y. Rathi, N. Vaswani, A. Tannenbaum, and A. Yezzi, “Tracking


deforming objects using particle filtering for geometric active
contours,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 29, no. 8, pp. 1470–1475, 2007.
[9] J.-M. Odobez, D. Gatica-Perez, and S. O. Ba, “Embedding
motion in model-based stochastic tracking,” IEEE Transactions
on Image Processing, vol. 15, no. 11, pp. 3514–3530, 2006.
[10] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental
learning for robust visual tracking,” International Journal of
Computer Vision, vol. 77, no. 1–3, pp. 125–141, 2008.
[11] A. Srivastava and E. Klassen, “Monte Carlo extrinsic estima-
tors of manifold-valued parameters,” IEEE Transactions on
Signal Processing, vol. 50, no. 2, pp. 299–308, 2002.
[12] H. Snoussi and A. Mohammad-Djafari, “Particle filering on
Riemannian manifold,” in Proceedings of the 27th International
Workshop on Bayesian Inference and Maximum Entropy Meth-
ods in Science and Engineering, vol. 872 of AIP Conference
Proceedings, pp. 219–226, 2006.
[13] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: a fast
descriptor for detection and classification,” in Proceedings of
the 9th European Conference on Computer Vision (ECCV ’06),
vol. 3952 of Lecture Notes in Computer Science, pp. 589–600,
2006.
[14] F. Porikli, O. Tuzel, and P. Meer, “Covariance tracking using
model update based on Lie algebra,” in Proceedings of IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR ’06), vol. 1, pp. 728–735, New York, NY,
USA, June 2006.
[15] B. C. Hall, Lie Algebras, and Representations: An Elementary
Introduction, Springer, New York, NY, USA, 2003.
[16] M. Berger, A Panoramic View of Riemannian Geometry,
Springer, Berlin, Germany, 2003.
[17] E. Begelfor and M. Werman, “How to put probabilities on
homographies,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, no. 10, pp. 1666–1670, 2005.
[18] V. M. Govindu, “Lie-algebraic averaging for globally consis-
tent motion estimation,” in Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition
(CVPR ’04), vol. 1, pp. 684–691, Washington, DC, USA, July
2004.
[19] O. Tuzel, R. Subbarao, and P. Meer, “Simultaneous multiple
3D motion estimation via mode finding on lie groups,”
in Proceedings of the 10th IEEE International Conference on
Computer Vision (ICCV ’05), vol. 1, pp. 18–25, Beijing, China,
October 2005.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 296598, 23 pages
doi:10.1155/2010/296598

Research Article
Construction of Fisheye Lens Inverse Perspective Mapping Model
and Its Applications of Obstacle Detection

Chin-Teng Lin,1 Tzu-Kuei Shen,1 and Yu-Wen Shou2


1 Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan
2 Department of Computer and Communication Engineering, China University of Technology, Hsinchu 303, Taiwan

Correspondence should be addressed to Yu-Wen Shou, [email protected]

Received 1 December 2009; Revised 15 April 2010; Accepted 15 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Chin-Teng Lin et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In this paper, we develop a vision based obstacle detection system by utilizing our proposed fisheye lens inverse perspective
mapping (FLIPM) method. The new mapping equations are derived to transform the images captured by the fisheye lens camera
into the undistorted remapped ones under practical circumstances. In the obstacle detection, we make use of the features of
vertical edges on objects from remapped images to indicate the relative positions of obstacles. The static information of remapped
images in the current frame is referred to determining the features of source images in the searching stage from either the profile
or temporal IPM difference image. The profile image can be acquired by several processes such as sharpening, edge detection,
morphological operation, and modified thinning algorithms on the remapped image. The temporal IPM difference image can be
obtained by a spatial shift on the remapped image in the previous frame. Moreover, the polar histogram and its post-processing
procedures will be used to indicate the position and length of feature vectors and to remove noises as well. Our obstacle detection
can give drivers the warning signals within a limited distance from nearby vehicles while the detected obstacles are even with the
quasi-vertical edges.

1. Introduction et al. [1, 2] utilized the IPM method and stereo cameras to
detect obstacles in front of the vehicle, and implemented
With the fast growing number of vehicles and traffic the parallel processor for image checking and analysis (PA-
accidents in recent years, the advanced vehicle control and PRICA) system Single Instruction Multiple Data (SIMD)
safety driving assistance in intelligent transportation systems computer architecture, to construct their obstacle and lane
(ITS) have been more and more important. It has played detection system, called GOLD (Generic Obstacle and Lane
a significant role for the lateral obstacle detection system Detection) [2]. The GOLD implemented in the ARGO
to improve the driving safety and assist drivers to reduce (derived from Argo and Argus, a research group from Italy)
the dead angles of sight while driving. Moreover, the lateral experimental vehicle made automatic driving possible. Ji [3]
obstacle detection could be integrated with that in the front utilized IPM to get the 3D information of the front vehicle,
or rear of a vehicle to make the obstacle detection system and Cerri and Grisleri [4] presented the stabilized subpixel
more robust and complete. In general, the objective of precision IPM image and the time correlation to estimate
camera calibration is to extract the intrinsic and extrinsic the possible driving space on highways. Muad et al. [5] used
information of the camera and the extracted information IPM to implement lane tracking and gave discussions of the
could be used to reconstruct the 3D world coordinate. factors which might have the influences on IPM. Tan et al. [6]
Nevertheless, the performance of camera calibration would combined IPM and the optical flow to detect obstacles for the
depend on the perspective effect, lens distortion, and the lateral blind spot of the vehicle. Jiang et al. [7] proposed the
number of cameras. An alternative method, namely inverse fast IPM algorithm and used it to detect lanes and obstacles.
perspective mapping (IPM), was proposed to reconstruct the Nieto et al. [8] introduced how to stabilize IPM images by
3D world coordinates by using a single camera only. Broggi using vanish point estimation. However in their approaches
2 EURASIP Journal on Advances in Signal Processing

based on IPM, the planar objects such as lane markings were


Input
eliminated and the prominent objects like quasitriangle pairs Polar histogram
video
were reserved. The performance of those detection methods
would obviously depend on the height, width, distance, and
shape of an obstacle. RGB to Histogram
There have been some other methods proposed for gray level post-processing
obstacle detection. Lai [9] used both of vision and the
ultrasonic senors on the mobile robot to detect the wall in
Fisheye lens Obstacle
the indoor environment. For the pedestrian detection, Curio inverse tracking
et al. [10] used the contour, local entropy, and binocular prespective
vision to detect pedestrians. Bertozzi et al. [11] utilized mapping
stereo infrared cameras and three steps including warm area Obstacle
detection, edge-based detection, and v-disparity computa- information
tion to detect pedestrians and used the morphological and extraction
Proprocessing
thermal characteristics of heads to validate the presence of
pedestrians. Though infrared cameras could perform well
in either daytime or nighttime, the applications would be Segment Output
still restricted because of the higher prices of those cameras. searching video
There have existed many kinds of features such as symmetry,
color, shadow, corner, Vertical/horizontal edges, texture, and Figure 1: The overall system.
vehicle light for vehicle detection [12]. Kyo et al. [13] used
edges to detect possible vehicles and further validated the
vehicles by the characteristics of symmetry, shadow, and
differences in the gray-level average intensity, and Denasi
and Quaglia [14] used pattern matching to detect and
validate vehicles. These methods would usually fail if the
obstacles did not match the defined models. For the general Optical
obstacle detection task, the optical flow-based and stereo- center
based methods have been most popular in recent researches.
The optical flow based methods would detect obstacles by
analyzing the differences between the expected and real Z
X
velocity fields. Krueger et al. [15] combined the optical
flow with odometry data to detect obstacles, but the optical Remapped
flow-based methods would have the higher computational image
complexity and might fail if the relative velocity between Y
obstacles and the detector was too small. For the stereo-
based methods, Forster and Tozzi [16] utilized disparities of
obstacles to detect obstacles and used a Kalman filter to track Figure 2: The vertical line projection of (1).
obstacles. However, stereo methods are highly dependent on
the accuracy of identification of correspondences in the two
images. In other words, searching the pairs of homogeneous
Z
points was much tougher for stereo-based methods.
Image plane
In recent years, there were two important subjects,
including improving the accuracy of compensation esti-
Optical
mation and obstacle detection. After an IPM image was center
acquired, a serious problem on resolution between the
original and remapped images might be caused. Therefore,
how to get an appropriate compensation result would be
difficult, especially in our fish-eye lens approach. In Yang World coordinate
et al. [17], the compensation estimation was gained by
X
the recursive method in trials and errors. Firstly, he chose
randomly two pixels with a predefined distance to compare
the optical flow values until gaining twenty pairs, and then
used the median pair to be the value of compensation
Y
estimation. However, the IPM remapped images may cause
a serious problem for computing the optical flow values
in case of the worse resolution. Furthermore, even if the
recursion method was used to avoid choosing nonplanar
pixels, it was still probably to get similar or nonplanar points Figure 3: The projected results of (5).
EURASIP Journal on Advances in Signal Processing 3

Optical
Optical center
center
Image
plane

Z
Z X
Remapped
X
image Remapped
image
Y Y

(a) (b)

Z
Image plane

Optical
center

World coordinate
X

(c)

Figure 4: The figures and expected results for (a) perspective effect removing (b) a vertical straight line in the image will be projected to a
straight line whose prolongation will pass the vertical projection point of the camera on the world surface (c) a horizontal straight line in
the image will be projected to a straight line instead of an arc on the world surface.

when the values of optical flows were very close. In our rate in 58% ∼ 92%. That was because the pedestrians’ foot
approach, we adopted the edge features and images with time steps might be influenced by lane markings, shadows of
difference to improve the above problems in both static and trees, and any other planar noises. Our algorithm used the
dynamic environments. For dynamic environments, since IPM’s property; therefore, the polar histograms derived from
the nonplanar edge features may change more vibrantly than the IPM images could help to obtain the information of
planar edge features, the values of compensation estimation images in 1-D distributions. For separating from nonplanar
can be easily determined by the compensated image with obstacles, we also constructed a novel method to detect and
the minimum number of candidate pixels of obstacles. localize obstacles.
To improve stability and robustness of our system, we With the intrinsic and extrinsic parameters from camera
considered both the time interval and the earlier k frames to calibration, the obstacle detection system could establish a
average and update the latest compensation estimation. For transformation table for mapping the coordinates of real-
obstacle detection, in Ma et al. [18] approach, he adopted road surfaces into the distorted image coordinates. The
the pedestrian features and symmetrical property to search objective of IPM method was to remove the perspective
the possible positions of obstacles in the region of interest. effects caused by cameras, and the higher performance of
Although the performance of their system was acceptable, IPM methods made it possible to achieve better image
the results would be not stable and robust with the detection processing results. Since IPM methods have been proven to
4 EURASIP Journal on Advances in Signal Processing

Image plane H ∗ csc(θ + θ1 )

θ θ1
u θ2
H θ1 θ
H
θ1 A
H ∗ cot(θ + θ1 )
θ θ − θ1 ηx
O

H0 H1 θ2

(a) (b)
ηx X γ

W1 Z
W0
(u, v)
H3 Optical θ
center
H1 W3 v
W2

H2 u
H (X, Y , 0)
H0
γ
ηy X
θ2 D γ
Y L

Y
(c) (d)

Figure 5: The geometrical relations of the image and world coordinate system for deriving our equations.

Z
Optical center
Optical
center

θ
Fisheye lens
v

u
Image plane Field of view

θ1
X Region of interest
D γ 2β
u −β
L m−1

Y
(a) (b)

Figure 6: The original and adjusted scope.

be more efficient and applicable to real traffic conditions, we of the transformed bird-view images. As a result, we could
would focus on developing an accurate IPM algorithm for deal with the transformed images to extract the profile of
both normal lens and fisheye lens by improving the previous edges and obtain the polar histogram for post-processing. We
IPM methods. Our obstacle detection system aimed at have organized the following sections in this paper, including
detecting obstacles with either vertical or quasivertical edges. our systematic structure, the modified normal lens IPM
In fact, the obstacles with the significant height in vertical method, fisheye lens IPM, obstacle detection algorithms,
or quasivertical edges could be mapped to the radial lines experimental results, and conclusions.
EURASIP Journal on Advances in Signal Processing 5

(a) (b) (c)

Figure 7: Illustrations for (a) the real scene image, (b) the distorted image, and (c) the desired image.

2. Our Systematic Structure coordinate value is constant. If we assume that Kv = γ − α +


v(2α/(n − 1)) is constant, then (1) will be simplified to
Our overall systematic structure is illustrated in Figure 1.The
 
obstacle detection is performed after obtaining the bird-view 2β
images of road surfaces captured by the camera mounted X = H ∗ cot θ − β + u ∗ cos(Kv) + L,
m−1
on the lateral side of the vehicle. The edge profile of road   (2)
surfaces in bird-view images or temporal FLIPM difference 2β
image should be acquired, and then the segment searching Y = H ∗ cot θ − β + u ∗ sin(Kv) + D.
m−1
algorithm will use the edge profile to get the feature radial
lines which indicate the obstacles. After searching the feature After simple calculations, we can obtain (3) from (2), which
segments, the polar histogram which represents the direction is shown in Figure 2
and size of obstacles will be computed. The histogram post-
processing will also be used to filter out some noises and X − L = (Y − D) ∗ cot(Kv). (3)
obstacles with shorter height. We still have to identify the
detected obstacles and extract the relative information of Equation (3) means that a vertical straight line in the image
the obstacle after the obstacle tracing process. After all the which represents the vertical edge of obstacles or other planar
processes, we can obtain the final results in the output videos. markings in the world coordinate system will be projected
into a straight line whose prolongation will pass the vertical
projection point of the camera on the world surface.
3. The Modified Normal Lens Inverse
Similarly, the horizontal straight line in the image
Perspective Mapping Method coordinate system can be represented by the set of pixels
To find more practical applications and set up the appropri- whose u coordinate value is a constant. If we assume Ku =
ate mapping equations in our system, we modify the previous θ − β + u(2β/(m − 1)) is constant, then (1) will be also
approaches proposed by Bertozzi and Broggi [2] and make simplified to
the obstacle detection system more complete. Let u and v  

represent the image coordinate system and X, Y , and Z be X = H ∗ cot(Ku) ∗ cos γ − α + v +L
n−1
the world coordinate system where (X, Y , and 0) indicates  
the road surface. L, D, and H are the coordinates of the 2α
= K ∗ cos γ − α + v + L,
camera in the world coordinate system, while θ and γ are n−1
the camera’s tilt and pan angles, respectively. α and β are the   (4)

horizontal and vertical aperture angles. m and n indicate the Y = H ∗ cot(Ku) ∗ sin γ − α + v +D
n−1
height and width of an image. O is the optic axis vector, and  
ηx , η y are the vectors representing the optic axis vector O 2α
= K ∗ sin γ − α + v + D.
projected on the road surface and its perpendicular vector n−1
   
2β 2α Thus, we can derive (5) from (4), which is shown in Figure 3
X = H ∗ cot θ − β + u ∗ cos γ − α + v + L,
m−1 n−1
  (X − L)2 + (Y − D)2 = K 2 . (5)
 
2β 2α
Y = H ∗ cot θ − β + u ∗ sin γ − α + v + D. Equation (5) means that a horizontal straight line in the
m−1 n−1
(1) image will be projected to an arc on the world surface. In
order to modify the original IPM model, we propose a new
From (1), the vertical straight line in the image coordi- pair of transformation equations for two expected results.
nate system can be represented by the set of pixels whose v First, a vertical straight line in the image will still be projected
6 EURASIP Journal on Advances in Signal Processing

Undistored Temporal
Static situation Difference
FLIPM Smoothing
judgment FLIPM image
video

Edge Dilation
Profile
Unsharping and Profile
detection searching
erosion

Figure 8: The flow chart of image pre-processing.

(a) (b)

Figure 9: The results in the profile searching process, (a) the remapped image (b) the profile image.

to a straight line whose prolongation will pass the vertical finally obtain (7) by the geometrical descriptions in Figures
projection point of the camera on the world surface. Second, 5(c) and 5(d) and the length of each segment listed below:
a horizontal straight line in the image will be projected
to a straight line instead of an arc on the world surface.
−→ H0 = H ∗ cot(θ),
The results can be verified by the similar triangle theorem.
With some prior knowledge such as the assumptions on flat −→ H0 + H1 = H ∗ cot(θ + θ1 ),
roads, intrinsic and extrinsic parameters, we will be able to  
reconstruct a 2D image without the perspective effect. The −→ H2 = H0 ∗ sec γ ,
illustrated figures and expected results are shown in Figure 4.  
By referring to the notations, the diagrams of relationship −→ H2 + H3 = (H0 + H1 ) ∗ sec γ ,
between the image coordinate system and the world coor-  
−→ W0 = (H0 + H1 ) ∗ tan γ ,
dinate system are shown in Figure 5. We will derive a new
pair of transformation equations by simple mathematical −→ W0 + W1 = (H0 + H1 ) ∗ tan(θ2 ) ∗ sec(θ + θ1 ),
computations in triangular functions. From Figures 5(a) and  
5(b), we can obtain −→ W2 = H0 ∗ tan γ ,

−→ W2 + W3 = H0 ∗ tan(θ2 ) ∗ sec(θ + θ1 ),
2β  
−→ θ1 = u − β,
m−1 =⇒ X = H2 + H3 + W1 ∗ sin γ = H ∗ cot(θ + θ1 )
    
−→ H0 = H ∗ cot(θ), ∗ cos γ + sec(θ + θ1 ) ∗ tan(θ2 ) ∗ sin γ ,
 
−→ H0 + H1 = H ∗ cot(θ + θ1 ), (6) =⇒ Y = W1 ∗ cos γ
  
2α = H ∗ cot(θ + θ1 ) − sin γ + sec(θ + θ1 )
−→ θ2 = v − α,  
n−1 ∗ tan(θ2 ) ∗ cos γ . (7)
 
−→ tan θ2 = tan(θ2 ) ∗ sec(θ + θ1 ).
Now, we have obtained the forward transformation equa-
Figure 5(c) describes how the points in the first quadrant of tions, and the backward transformation equations shown
the image coordinate system will be projected onto the road in (8) can also be obtained easily by some mathematical
surface. If the world coordinate of camera is (0, 0, H), we will computations in inverse functions
EURASIP Journal on Advances in Signal Processing 7

Region of sight
Planner lane Planner lane by both cameras

Region of
sight by
both cameras

Time= T − t Time= T Time= T − t Time= T

(a) (b)

Region of sight
by both cameras

Time= T − t Time= T

(c)
Figure 10: Illustrations for the temporal FLIPM difference image, (a) the planar object patterns and (b) nonplanar object patterns
(c) Moving nonplanar object patterns.

(a) (b)

Figure 11: The results of the temporal FLIPM process, (a) the remapped image (b) the temporal difference image.

           
−1 X ∗ cos γ − Y ∗ sin γ −1 X ∗ cos γ − Y ∗ sin γ
=⇒ θ1 = cot − θ, ∗ cot −θ+β ,
H H
    
−1 X ∗ sin γ + Y ∗ cos γ       
θ2 = tan ,
H ∗ csc(θ + θ1 ) n−1 X ∗ sin γ + Y ∗ cos γ
=⇒ v = ∗ tan−1 +α .
m−1 2α H ∗ csc(θ + θ1 )
=⇒ u = (8)

8 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 12: The results in the feature searching procedure by using profile images, (a) the sharpened remapped image, (b) the profile image,
(c) the scanned feature segments, and (d) the scanned feature segments superposed on the sharpened remapped image.

(a) (b)

(c) (d)

Figure 13: The results of the feature searching procedure using temporal difference FLIPM images, (a) the remapped image, (b) the temporal
difference FLIPM image, (c) the scanned feature segments, (d) the scanned feature segments superposed on the remapped image.

4. Fisheye Lens Inverse Perspective transformation equations to correct the radial distortion.
Mapping (FLIPM) The warping transformation equation pairs and its inverse
pairs are shown in (9). The coordinate (X, Y , and Z) is the
4.1. The Fisheye Undistortion Model. Zhang and Fu[19] pro- position of point in the 3D world coordinate system, (u1 , v1 )
posed a camera spherical projection model to implement the is the coordinate in the undistorted image, and (u, v) is the
endoscope image formation process and utilized the warping coordinate in the distorted one
EURASIP Journal on Advances in Signal Processing 9

Profile or
differance
image

For each
radius and
angle

5 × 5 mask
searching
N ≥2
N =1

Boundary Boundary
and pixel value and pixel value
checking checking

Vote result Vote result


>threshold > threshold

Ok

Between
Find 1st point pixel distance
checking

Ok

Find 2nd, Between


3rd · · · point pixel distance
checking
Ok

Output Set the point


segment as the 1st point
of a new
segment

Set the point


as the 1st point
of a new
segment

Figure 14: The flow chart of feature searching.


C
f ∗u f ∗v f ∗u u2 + v2 ∗ sin2 (θ1 )
X=√ 2 ; Y=√ 2 , u1 = √ 2 = k1 ∗ ,0
R − u2 − v2 R − u2 − v2 R − u2 − v2 cos2 (θ1 )
(9) C
R∗X R∗Y f ∗v v2 + u2 ∗ sin2 (θ2 )
u= ; v= , v1 = √ 2 = k1 ∗ ,
f 2 + X2 + Y 2 f 2 + X2 + Y 2 R − u2 − v2 cos2 (θ2 )
R ∗ u1 u1
u= = , (10)
where F is the focal length of camera, and R is the radius f2 + u21 + v12 k1 1 + (tan2 (θ1 ) + tan2 (θ2 ))
of the sphere. We modify and redefine that model for our
applications in this paper. We regard the X1 -Y1 plane as an R ∗ v1 v1
v= = ,
undistorted image plane and the u-v plane as the distorted f2 + u21 + v12 k1 1 + (tan2 (θ1 ) + tan2 (θ2 ))
one, thus we can derive the modified equations in (10)
10 EURASIP Journal on Advances in Signal Processing

−1

where k1 = f /R, and θ√1 = sin (u/ R − v ) = tan (u1 / f )
2 2 −1
easily obtain the undistorted and perspective effect removed
−1
and θ2 = sin (v/ R − u ) = tan (v1 / f ) are the
2 2 −1
images
angles between the lines connected the horizontal or vertical
direction projection point of an image point with the optical     
X ∗ cos γ − Y ∗ sin γ
center on the optical axis. Equation (11) instead of (13) θ1 =cot−1 − θ,
will be used through this paper since (13) may produce H
     (12)
many nonpixel-values of the image. We also can obtain the X ∗ sin γ + Y ∗ cos γ
distorted or undistorted images no matter if the focal length θ2 =tan−1 .
is known or not by tuning the parameter k1 . H ∗ csc(θ + θ1 )

5. Our Obstacle Detection Algorithm


4.2. The Complete Fisheye Lens Inverse Perspective Map-
ping. A fisheye lens inverse perspective mapping (FLIPM) In this section, we develop an obstacle detection algorithm
algorithm will be complete by two parts, the forward by using both spatial and temporal information of the
and backward mapping algorithm. The objective of the FLIPM method. We use a single fisheye camera mounted
forward mapping algorithm is to search the dimensions on the lateral side of the vehicle to detect obstacles. The
or ranges of remapped images, which can be illustrated in definitions of obstacles in this paper are the objects with the
Figure 6. height shorter than a threshold and with nonquasivertical
The dimensions of scopes are only related to the view- edges. The straight line in the vertical direction in the
ranges of a camera; that is to say, either the use of images represents the vertical edges of obstacles in the
the normal lens or fisheye lens with fixed tile and pan world coordinate system and will be projected to a straight
angle will determine the factors of influences. In order to line whose prolongation will pass the vertical projection
reduce the computational loadings in use of the tangent point of the camera on the world surface. To illustrate our
and secant triangular functions, we restrict the scope of systematic mechanism more clearly, we will introduce the
a camera by narrowing down its view-range. Without obstacle detection algorithm in the following parts, including
loss of generality, we still keep the broadest range of some image preprocessing steps, feature selection, histogram
scopes and minimize discarding far and fringe information. analysis, object tracking, and information extraction.
Furthermore, we narrow down the view-angles by using
Snell’s Law as shown in (11) where IR simulates the
index of refraction and controls the scopes of resultant 5.1. The Preprocesses. We have to simplify the image patterns
ranges. The range of IR is between 1.3∼1.7 for glass-based for our following procedures by some image preprocessing
lens techniques shown in Figure 8. At first, the remapped image
will be smoothed by mean filter to reduce the noises resulted
from FLIPM transformation. Our developed equations in
  FLIPM have the advantages of IPM in removing the infor-

sin u − β = IR ∗ sin(θ1 ), mation of height and can help to detect the obstacles on
m−1
(11) the surface of roads. We also propose two different strategies
  toward feature extraction. We use the profile image which

sin v − α = IR ∗ sin(θ2 ). will be introduced next to extract the feature series when the
n−1 detected objects and our cameras are relatively motionless,
otherwise we acquire the features by the obstacle-sensitive
temporal FLIPM difference image which will be clarified in
The angles θ1 and θ2 can be substituted into (8) to compute
Section 5.3.
the extreme values about the coordinate values of X and Y ,
and in this way, we will obtain the dimension of the
remapped image. The backward mapping algorithm is 5.2. Profile Image. The obvious edges of obstacles will be
different from the forward one because a plus of the radial essential for extracting the profile images. We hence enhance
distortion correction step should make it more rational. We the edges by the unsharp mask at this time to make
firstly consider the ideas of the backward mapping algorithm up for over-blurred images, and detect edges by simple
by Figure 7. Sobel operations. The binary images can be obtained by
Since the images captured by the fisheye cameras which thresholding after edge detecting of the remapped image,
can be shown in Figure 7(a) have the perspective effects and we have to use the morphological operations on dilation
and distortions, we have to remove those undesired effects and erosion to get the useful edges for our processes. As for
to acquire the available images just like Figure 7(b) in extraction of the feature segments, we remodel the thinning
pursuit of Figure 7(c), where the perspective effect and algorithm introduced in [20] in thinning the binary edges
distortion have been completely removed. Thus so, we in order to meet our real-time needs in the applications of
can derive the backward mapping algorithm by modifying ITS. We turn to use the center pixel of a mask to extract the
(8) as (12). We also complete the distortion correction exterior profile of a pattern without checking the conditions
process by using (13) and the derived formulas of angles of patterns iteratively. Figure 9 shows the processed results of
in (12). By tuning the parameter of IR and k1 , we will our profile image searching.
EURASIP Journal on Advances in Signal Processing 11

Trapezoid Low Local


Histogram Local interval
Histogram shape singleton peak
peak peak
array histogram histogram histogram
searching extraction
elimination elimination array

Figure 15: The flow chart of histogram postprocessing.

200
180
160
140

Accumulation
120
100
80
60
40
20
0
60 80 100 120
Angle
(a) (b)

Figure 16: Illustrative figures of the trapezoid histogram distributions. (a) The figure of lane markings. (b) The trapezoid histograms.

5.3. The Temporal Fisheye Lens Inverse Perspective Mapping the complex background such as many moving objects are
Difference Image. The objective of temporal FLIPM differ- in the field of views, the IPM effect of moving objects will
ence process is to simulate the stereo vision of captured cause a different projection between the front and rear frames
images. The stereo IPM can keep the non-plane objects and as shown in Figure 10(c). Furthermore, to estimate the
remove the plane objects such as lane-markings, shadows needs of compensation like moving directions of obstacles,
by comparing the differences between the left and right we accumulate the movements of planar features of edges
remapped image, which will be illustrated in Figure 10. in k frames to obtain the adaptive value of compensation
According to the stereo IPM method [21], two cameras estimation. It is more difficult and decisive to determine
should be used to acquire the sufficient information of the shift displacement of remapped images, for selecting the
overlapped regions. Since this paper focuses on using a single appropriate time interval may be easier for the expected
camera, we take advantage of time difference to simulate the performance of our obstacle detection system. We use the
effects of stereo cameras. As Figure 11 shows, we have to average displacement of remapped images from two cameras
address on two important issues, selections of time interval as the shift displacement in our temporal FLIPM method to
and the shift displacement of the remapped image. From the obtain the “pseudo” stereo effect binary difference image.
FLIPM, both the maximum movement of shift displacement
and self moving speed have been restricted by the position
of cameras. Therefore, we shall concentrate on deciding 5.4. Feature Searching Algorithm. As mentioned previously,
the time interval to compensate the remapped images. For we only prefer to search the features extracted from the
instance, as shown in Figure 11(a), the remapped image objects with quasivertical edges in the remapped image.
can be estimated while the maximum shift displacement is Based on the observation that those qualified features
the real horizontal distance of the remapped image. With will always pass through the vertical projection point of
the maximum moving speed, we can get the appropriate cameras, we propose a feature searching algorithm and
time interval by keeping the temporal difference images use polar histogram to accurately detect obstacles even for
within an obvious displacement. We determine the value the noisy images. Our searching algorithm begins with the
each time by the assumption that the compensated profile vertical projection point of a camera in the remapped image
image has the minimum nonplanar pixels if the acquired (denoted as CP). After that, we scan the acquired profile or
value is the optimized one. As a result, we accumulate the temporal difference image angle by angle from the center to
movements in the earlier k frames to update the latest outmost border of a circle in the defined radius which can
compensate displacement, and the value k is variant to be determined by the information of remapped images. We
different speeds. When the compensation method works in then use a voting method in the mask searching and adjust
12 EURASIP Journal on Advances in Signal Processing

200 200
180 180
160 160
140 140
Accumulation

Accumulation
120 120
100 100
80 80
60 60
40 40
20 20
0 0
60 80 100 120 60 80 100 120
Angle Angle
(a) (b)

200
180
160
140
Accumulation

120
100
80
60
40
20
0
60 80 100 120
Angle
(c)

Figure 17: The processes of histogram post-processing, where x-axis and y-axis represent the angles of polar histograms and the
accumulation amount on each angle, respectively. (a) The polar histogram of Figure 13. (b) The histogram of (a) after the trapezoid
histogram elimination, low singleton histogram elimination and the peak searching procedure. (c) Local peak histogram.

Local
peak Segment Tracked times
histogram tracking > theshold
array

Yes

Planner object
judgement

No Yes

Point out Point out


obstacles planar object
position position

Figure 18: The flow chart of object tracking and information extraction.
EURASIP Journal on Advances in Signal Processing 13

(a)

(b)

(c)

Figure 19: The results of the normal lens IPM equations (a) the original captured image (b) the bird-view image using Broggi’s equations
(c) the bird-view image using our equations.

(a) (b)

Figure 20: The setup locations of cameras.

the searching space flexibly according to the intensities and segment will be taken only if its corresponding percentage is
relative distances between vehicles and obstacles. The voting higher than some specific value in order to reduce the errors
threshold is fixed and can be determined by the half of the caused by statistics. The next searching point at the same
total elements in the mask. We can keep the major features, angle must be close enough to the last searched segment so as
for the Gaussian weighting values in each 5 × 5 mask indicates to concentrate on the points close to CP. After the searching
the important regions in this mask. For each angle, a feature process at each angle, the number of points at each angle
14 EURASIP Journal on Advances in Signal Processing

200
160
120
80
40
0
60 80 100 120

200
160
120
80
40
0
(a) 60 80 100 120
(e)

(b) (c) (d)

Figure 21: The interface of our program and the related information about adjustable parameters.

will be used to produce a polar histogram in our system. do not process the oversmall histograms (the columns in the
Some results in the feature searching procedure are shown in histogram are few) to avoid disturbances. After eliminating
Figures 12 and 13 and the complete flow chart of our feature planar objects and noise, we will search the position of
searching algorithm is shown in Figure 14. As Figure 14 local maximum which represents the segment position of
shows, our algorithm can deal with two kinds of problems. nonplanar object in the polar histogram. Also, we only
One is that our systematic design can effectively improve the pick a peak column at an angular interval to prevent from
accuracy rate and reduce the possibility of lost pixels by using detecting too many obstacles at the same time. Some results
the features which are defined in each angle line by the results in the histogram post-processing procedure will be shown
of mask searching to indicate the distance between objects in Figures 16 and 17. As Figure 16 shows, the regions in
and the camera. The other is that our proposed method red circles (Figure 16(b)) are corresponding to the lane
can discriminate the meaningful pixels from others by the markings, as shown in Figure 16(a). In Figure 16(b), x-axis
presented model for checking the length of searched feature and y-axis represent the angles of polar histograms and the
segments. Therefore, the flaws of the polar histogram can accumulation on each angle, respectively.
be made up and our obstacle detection will make a great
progress in performance.
5.6. Object Tracking and Information Extraction. Our track-
ing procedure is used to confirm the detected objects. We
5.5. Histogram Postprocessing. Figure 15 shows the processes choose the displacement and variation of angles in the
in the histogram postprocessing. The postprocessing on extracted feature segment as the judgment conditions in the
histograms is necessary since we have to consider some tracking process. If the feature segment has been extracted,
important problems such as how to obtain the peak values we would judge again whether this feature segment belongs
of the histogram, how to reduce the influences of light, and to the planar object by a pattern matching approach. Our
how to find the best way in statistics for our applications. tracking process and the pattern for representing the planar
After we obtain the polar histogram of feature segments, object are shown in Figure 18. We can finally confirm that
we still need to find our desired histograms to remove the detected feature segment is an obstacle and also obtain
the segments of planar objects and noises. Our procedures the position of feature segments or other useful information.
in histogram post-processing try to reduce the undesired
information which may be produced in the polar histogram.
For instance, the line segments belonging to planar objects 6. Experimental Results
will still be extracted in the polar histogram step. By
observing the polar histogram, we can discover that the We arranged the information of the working platform and
trapezoid histogram represents the planar objects. We can listed in Table 1 shown below. To show our experimen-
thus remove those clusters of bins in the histogram and we tal results more clearly, we categorized the experiments
EURASIP Journal on Advances in Signal Processing 15

(a) Scenery1: bicyclist, street light

(b) Scenery 2: railings (c) Scenery 3: multiple vehicles in the parking area

(d) Scenery 4: pedestrians, nearby vehicles (e) Scenery 5: multiple vehicles

Figure 22: The experimental results of FLIPM and obstacle detection in different scenes (a) Scenery1: bicyclist, street light (b) Scenery 2:
railings (c) Scenery 3: multiple vehicles in the parking area (d) Scenery 4: pedestrians, nearby vehicles (e) Scenery 5: multiple vehicles.

according to the proposed process and approach which have performance of the whole system could be improved easily
been introduced in the previous sections as follows. by upgrading the video I/O equipments and optimizing
Table 2 showed the runtime in each processing step the FLIPM Kernel functions in regions of interest. Table 3
defined in Figure 1. As Table 2 demonstrated, our system exhibited the performance of different obstacle algorithms.
processed 15 frames per second. We tested the complete We gave the compared results in four parts, including the
system by two parts, Input and Display stages. Therefore, the runtime, types of sensors, moving compensation, and field
16 EURASIP Journal on Advances in Signal Processing

Number of frame 315 Number of frame 318


(a) (b)

Number of frame 322 Number of frame 326


(c) (d)

Figure 23: Results of obstacle tracking in Scenery 1.

Number of frame 172 Number of frame 188


(a) (b)

Number of frame 204 Number of frame 220


(c) (d)

Figure 24: Results of obstacle tracking in Scenery 3.


EURASIP Journal on Advances in Signal Processing 17

(a) (b)

(c) (d)

Figure 25: Results of obstacle warning in the lateral direction.

(a) (b)

(c) (d)

Figure 26: Results of obstacle warning in the rear direction.

of views. Although our approach included image I/O routine the complexity and accelerate the processing speeds, and
processes on common development platform, it still had the other was to improve the detection rate and accuracy
the better performance than the others. On the detection of obstacle detection. In Table 3, [2, 17], and our approach
module, we adopted the polar histogram to simplify the considered the polar histogram, however, our system had
analytic step. It had two benefits where one was to reduce the better detection rate than the others and might not be
18 EURASIP Journal on Advances in Signal Processing

Detected original image Polar histogram Local peak histogram


200 200
180 180
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
60 80 100 120 60 80 100 120

200 200
180 180
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
60 80 100 120 60 80 100 120

200 200
180 180
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
60 80 100 120 60 80 100 120

200 200
180 180
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
60 80 100 120 60 80 100 120

Figure 27: Results of obstacle warning with moving objects. Upper two rows are the results with moving humans. Bottom two rows are the
results with a moving vehicle turning into the corner.

easily influenced by the planar markings, shadows, and other effect was eliminated by both of Broggi’s and our equations.
noises. Nevertheless, the horizontal line in the original image would
be transformed to an arc by Broggi’s equations as shown in
Figure 19(b). With our modified equations, the horizontal
6.1. Comparisons About the Normal Lens Inverse Perspective
straight line in the original image could be transformed to a
Mapping Method. In Section 4, we proposed a modified
horizontal straight line in the bird-view image as shown in
forward and backward normal lens IPM equation pairs. The
Figure 19(c).
experimental results of our proposed approach and the most
popular one by Broggi’s equations were given in Figure 19.
From Figure 19, the captured images by the normal lens 6.2. The Experimental Configurations. For the experiments
camera were transformed to the bird-view images by using in obstacle detection, we mounted a fisheye lens camera
our equations. In Figures 19(b) and 19(c), the perspective at the center between two side doors with the appropriate
EURASIP Journal on Advances in Signal Processing 19

Detected original image FLIPM image

(a) Daytime with Shadows

(b) Nighttime with Shadows

(c) Raining Daytime and on wet grounds with light reflection

(d) Nighttime and on wet grounds with light reflection

(e) Daytime and with other vehicles

Figure 28: Continued.


20 EURASIP Journal on Advances in Signal Processing

(f) Nighttime and with other vehicles

Figure 28: The results in different environments with heavy noise. (a) In the daytime with many shadow effect. (b) In the nighttime with
self shadow projected by several different direction road lights. (c) In the raining daytime on wet grounds with light reflection. (d) In the
raining night time on wet grounds with strong light reflection. (e) Daytime with other vehicles. (f) Nighttime with other vehicles.

(a)

(b)
Figure 29: Examples of erroneous detected results in our system.

Table 1: The specifications of our working platform. Table 2: The runtime in each processing step in Figure 1.

Intel Core Duo Runtime/


CPU Processing Function
T2050 1.6 GHz frame (ms)
1 GB DDR2 FLIPM (with the IPM remapped position table) 2.793
Memory
RAM Input image, transfer gray-level image and setting
31.98
Borland dynamic array memory
Programming tool
C++ Builder 6.0 Preprocessing 8.847
Microsoft Segment searching and Polar Histogram 6.515
Operation system
Windows XP
Histogram and Postprocessing 0.253
Video resolution 640 × 480
Obstacle tracking and extraction 0.347
Camera frame rate 30 fps
Display 13.725
Total 64.46

height as shown in Figure 20 To avoid disorders of frames, objects such as sidewalks, small balls and so on were excluded
we would only detect the objects whose heights are more in our detection system. The experimental environments
than a threshold and whose edges are quasivertical. The would also be constrained to the brighter backgrounds and
EURASIP Journal on Advances in Signal Processing 21

Table 3: Comparisons of different obstacle algorithms.

Moving
Compared Methods Runtime (ms) Sensor Type Field of View
compensation
Our approach 64.46 (CPU 1.6 GHz) Fisheye Camera YES 125◦
Bertozzi and
100 (FPGA based) Stereo Camera NO 28◦
Broggi [2]
Ji [3] 66.7 (CPU 3 GHz) Single Camera NO 34◦
Cerri and Grisleri [4] 950 (CPU 2.8 GHz) Single Camera YES Normal Lens
Kyo et al. [14] 65 (multipile IMAP-VISION board) Single Camera NO Normal Lens
Yang et al. [17] 50 (CPU 3.6 GHz) Single Camera YES Normal Lens
Ma et al. [18]
(pedestrian candidate 16 (CPU 3.6 GHz) Single Camera NO 48.8◦
detection module)

the speed of vehicles should be under a reasonable limit clear planar markings existed. Here, we presented two
so that the objects between frames would not change too issues in the moving obstacle detection and compensation
drastically. estimation with rotation angles. We could get the results
Figure 21 showed our program interface where the block from distributions of the right polar histograms, localize
(a) gave the input frame in which the line and rectangle were each nonplanar obstacle by dominant peaks, and filter the
used to indicate the position of obstacles, the block (b) had lane markings by trapezoid distributions in the original polar
the relative information of videos, the block (c) contained histograms.
the extrinsic and intrinsic parameters of the camera and In Figure 28, there existed some simulated environments
specifications of look-up table for FLIPM, the block (d) with heavy noises, such as shadows, light reflection, and light
displayed the image used in our obstacle detection algorithm, refraction from wet roads. In case (a) where there existed
and the block (e) showed the obtained polar histogram. In many shadows of trees on the grounds, we could obtain the
the block (d) of Figure 21, the upper image was the remapped remapping image as shown in the right one by the FLIPM
image, the middle one represented the judgment image for transformation. By our approach, the shadows could be
the static situations, and the lower one showed the input filtered out by the polar histogram if its accumulations on
image for the feature-segment searching stage. each angle were small and its shapes were in trapezoid. For
case (b) where there were two road lights at the front and rear
6.3. Results in Various Environments. As Figure 22 showed, of our vehicle in the nighttime, we would find two different
we could accurately detect various kinds of obstacles with shadows on the ground. One was not clearly recognized
quasivertical edges by using our FLIPM methods and obsta- from far light projection and the other was obvious due to
cle detection algorithm. near light projection. Our proposed method, however, could
Our tracking process was carried out by iteratively take advantage of FLIPM effects to remove the shadowing
checking the displacement and angular shift in the image, effect no matter how serious the illuminating conditions
and we also demonstrated the results of the tracking process might be. Our compensation estimation could shift the
in successive frames as given in Figures 23 and 24. According new frame to the adaptive position to gain the minimum
to the FLIPM method, the 3D world coordinate value could candidate pixels of obstacles. For case (c) and case (d), we
be estimated from the remapped image. In other words, demonstrated our results to be reliable and satisfactory in
when we detected the obstacle in the remapped image, we the raining conditions in the daytime and nighttime. In
could also estimate the position information. We hence set the nighttime, noises from light reflection were much more
up an obstacle warning system on the lateral and rear of the serious than those in the daytime, and our approach could
vehicle to give a warning signal when the detected obstacles successfully avoid misrecognizing them as obstacles in the
were over close to our vehicle. We showed the results of the fixed illumination condition. As for case (e) and case (f), we
obstacle warning system In Figures 25, 26, and 27 where the showed the experimental results in the common situations
rectangles in the upper images and the lines in the below which simulated the roads in the daytime and nighttime. As
images indicated “the position of obstacles” and “the distance a result of advantages of the fisheye camera in a wider angle of
and direction between the vertical projection point of camera view, our obstacle detection algorithm could detect the range
and the detected obstacle”, respectively. of the field of view up to 125 degrees which was much wider
In Figure 27, there were results of obstacle warning than other algorithms by common lens.
with moving objects. We had two simulation situations.
One was that a pedestrian was walking form a parking 6.4. Discussions. Although the performance of our obstacle
region to another side while the demonstrated vehicle was detection system based on FLIPM method was quite satisfac-
leaving the parking region. The other situation contained tory, there have existed some disturbance factors as shown
a corner where a coming vehicle, the static obstacle and below. In Figure 29(a), the street light in the remapped image
22 EURASIP Journal on Advances in Signal Processing

were too unapparent to be detected because its texture was algorithm for the development of an automatic lane tracking
similar to that of the grassland behind it. In addition, the system,” in Proceedings of the IEEE Region 10 Conference
completeness of obstacle shape in the profile or temporal on Analog and Digital Techniques in Electrical Engineering
FLIPM difference image would be critical for the following (TENCON ’04), vol. 1, pp. 207–210, Chiang Mai, Thailand,
obstacle detection process. Figure 29(b) showed the broken November 2004.
shape of obstacles in the temporal FLIPM difference image, [6] S. Tan, J. Dale, A. Anderson, and A. Johnston, “Inverse
which could lead to the erroneous detected results. perspective mapping and optic flow: a calibration method and
a quantitative analysis,” Image and Vision Computing, vol. 24,
no. 2, pp. 153–165, 2006.
7. Conclusions [7] G. Y. Jiang, T. Y. Choi, S. K. Hong, J. W. Bae, and B. S. Song,
“Lane and obstacle detection based on fast inverse perspective
In this paper, we propose a complete and novel structure mapping algorithm,” in Proceedings of the IEEE International
for the obstacle detection system. This brand-new structure Conference on Systems, Man and Cybernetics, pp. 2969–2974,
includes three major parts, FLIPM algorithm, feature seg- Nashville, Tenn, USA, October 2000.
ment searching, and obstacle detection. With our modified [8] M. Nieto, L. Salgado, F. Jaureguizar, and J. Cabrera, “Sta-
normal lens IPM equations, the vertical/horizontal straight bilization of inverse perspective mapping images based on
lines in the original image will be projected to a straight robust vanishing point estimation,” in Proceedings of the IEEE
line whose prolongation will pass the vertical/horizontal Intelligent Vehicles Symposium, pp. 315–320, Istanbul, Turkey,
projection point of camera in the remapped image. The June 2007.
resultant phenomenon has two advantages in removing [9] J.-H. Lai, Development of an exploration system for a vision-
planar objects and detecting obstacles. One is to give more guided mobile robot in an unknown indoor environment, M.S.
information in predicting the compensation quantification thesis, St. John’s University, 2006.
between difference frames, which helps us to remove the pla- [10] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. Von
nar objects such like shadow, water, lane line, and so on. The Seelen, “Walking pedestrian recognition,” IEEE Transactions
other one is to reinforce the feature points of obstacles, which on Intelligent Transportation Systems, vol. 1, no. 3, pp. 155–162,
2000.
efficiently reduces the computational loading in searching
obstacles. Besides, we consider the fisheye lens distortion [11] M. Bertozzi, E. Binelli, A. Broggi, and M. D. Rose, “Stereo
Vision-based approaches for Pedestrian Detection,” in Pro-
effect and provide a high efficient and all-direction feature
ceedings of the IEEE Computer Society Conference on Computer
searching method on polar histogram for both of the static Vision and Pattern Recognition (CVPR ’05), vol. 2005, p. 16,
and dynamic environments. We use the polar histogram to San Diego, Calif, USA, 2005.
find the position and length of feature segments by referring [12] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection
to the edge and temporal difference images. We also present using optical sensors: a review,” in Proceedings of the 7th
the histogram post-processing to exclude the planar lane International IEEE Conference on Intelligent Transportation
markings and noises. Finally, all the experimental results Systems (ITSC ’04), pp. 585–590, October 2004.
of our proposed system show the satisfactory performance [13] S. Kyo, T. Koga, K. Sakurai, and S. Okazaki, “Robust vehicle
and provide the accurate detection rate. In the future, our detecting and tracking system for wet weather conditions
obstacle detection system can be integrated into the driving using the IMAP-VISION image processing board,” in Pro-
assistance and safety system, including vehicle collision-free ceedings of the IEEE/IEEJ/JSAI International Conference on
development, warning system, and lane departure warning Intelligent Transportation Systems, pp. 423–428, Tokyo, Japan,
system. Furthermore, we will work on different shapes of October 1999.
obstacles for those without quasivertical edges and speed up [14] S. Denasi and G. Quaglia, “Obstacle detection using a
our detection system for more real-time applications. deformable model of vehicles,” in Proceedings of the IEEE
Intelligent Vehicles Symposium (IV ’01), pp. 145–150, Tokyo,
Japan, 2001.
References [15] W. Krueger, W. Enkelmann, and S. Roessle, “Real-time
estimation and tracking of optical flow vectors for obstacle
[1] M. Bertozzi, A. Broggi, and A. Fascioli, “Stereo inverse per-
detection,” in Proceedings of the Intelligent Vehicles Symposium,
spective mapping: theory and applications,” Image and Vision
pp. 304–309, Detroit, Mich, USA, September 1995.
Computing, vol. 16, no. 8, pp. 585–590, 1998.
[2] M. Bertozzi and A. Broggi, “GOLD: a parallel real-time stereo [16] C. H. Q. Forster and C. Tozzi, “Towards 3D reconstruction of
vision system for generic obstacle and lane detection,” IEEE endoscope images using shape from shading,” in Proceedings of
Transactions on Image Processing, vol. 7, no. 1, pp. 62–81, 1998. the 13th Brazilian Symposium on Computer Graphics and Image
[3] W.-L. Ji, A CCD-based intelligent driver assistance system-based Processing, pp. 90–96, 2000.
on lane and vehicle tracking, Ph.D. thesis, National Cheng Kung [17] C. Yang, H. Hongo, and S. Tanimoto, “A new approach for
University, Tainan, Taiwan, 2005. in-vehicle camera obstacle detection by ground movement
[4] P. Cerri and P. Grisleri, “Free space detection on highways compensation,” in Processing of the 11th IEEE Conference on
using time correlation between stabilized sub-pixel precision Intelligent Transportation Systems (ITSC ’08), pp. 151–156,
IPM images,” in Proceedings of the IEEE International Confer- Beijing, China, October 2008.
ence on Robotics and Automation (ICRA ’05), pp. 2223–2228, [18] G. Ma, D. Muller, S.-B. Park, S. Muller-Schneiders, and A.
Barcelona ,Spain, April 2005. Kummert, “Pedestrian detection using a single-monochrome
[5] A. M. Muad, A. Hussain, S. A. Samad, M. M. Mustaffa, and camera,” IET Intelligent Transport Systems, vol. 3, no. 1, pp. 42–
B. Y. Majlis, “Implementation of inverse perspective mapping 56, 2009.
EURASIP Journal on Advances in Signal Processing 23

[19] S. Zhang and K. S. Fu, “A thinning algorithm for discrete


binary images,” in Proceedings of the International Conference
on Computers and Application, pp. 879–886, Beijing, China,
1984.
[20] Q.-T. Luong, J. Weber, D. Koller, and J. Malik, “An integrated
stereo-based approach to automatic vehicle guidance,” in
Proceedings of the 5th International Conference on Computer
Vision, pp. 52–57, Cambridge, Mass, USA, June 1995.
[21] G. Ma, S.-B. Park, A. Ioffe, S. Müller-Schneiders, and A.
Kummert, “A real time object detection approach applied
to reliable pedestrian detection,” in Proceedings of the IEEE
Intelligent Vehicles Symposium, pp. 755–760, Istanbul, Turkey,
June 2007.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 915639, 18 pages
doi:10.1155/2010/915639

Research Article
Clusters versus GPUs for Parallel Target and
Anomaly Detection in Hyperspectral Images

Abel Paz and Antonio Plaza


Department of Technology of Computers and Communications, University of Extremadura, 10071 Caceres, Spain

Correspondence should be addressed to Antonio Plaza, [email protected]

Received 2 December 2009; Revised 18 February 2010; Accepted 19 February 2010

Academic Editor: Yingzi Du

Copyright © 2010 A. Paz and A. Plaza. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Remotely sensed hyperspectral sensors provide image data containing rich information in both the spatial and the spectral domain,
and this information can be used to address detection tasks in many applications. In many surveillance applications, the size of the
objects (targets) searched for constitutes a very small fraction of the total search area and the spectral signatures associated to the
targets are generally different from those of the background, hence the targets can be seen as anomalies. In hyperspectral imaging,
many algorithms have been proposed for automatic target and anomaly detection. Given the dimensionality of hyperspectral
scenes, these techniques can be time-consuming and difficult to apply in applications requiring real-time performance. In this
paper, we develop several new parallel implementations of automatic target and anomaly detection algorithms. The proposed
parallel algorithms are quantitatively evaluated using hyperspectral data collected by the NASA’s Airborne Visible Infra-Red
Imaging Spectrometer (AVIRIS) system over theWorld Trade Center (WTC) in New York, five days after the terrorist attacks
that collapsed the two main towers in theWTC complex.

1. Introduction [6–10]. Specifically, the automatic detection of targets and


anomalies is highly relevant in many application domains,
Hyperspectral imaging [1] is concerned with the measure- including those addressed in Figure 2 [11–13]. For instance,
ment, analysis, and interpretation of spectra acquired from automatic target and anomaly detection are considered
a given scene (or specific object) at a short, medium, very important tasks for hyperspectral data exploitation
or long distance by an airborne or satellite sensor [2]. in defense and security applications [14, 15]. During the
Hyperspectral imaging instruments such as the NASA Jet last few years, several algorithms have been developed
Propulsion Laboratory’s Airborne Visible Infrared Imaging for the aforementioned purposes, including the automatic
Spectrometer (AVIRIS) [3] are now able to record the visible target detection and classification (ATDCA) algorithm [12],
and near-infrared spectrum (wavelength region from 0.4 an unsupervised fully constrained least squares (UFCLSs)
to 2.5 micrometers) of the reflected light of an area 2 to algorithm [16], an iterative error analysis (IEA) algorithm
12 kilometers wide and several kilometers long using 224 [17], or the well-known RX algorithm developed by Reed
spectral bands. The resulting “image cube” (see Figure 1) and Yu for anomaly detection [18]. The ATDCA algorithm
is a stack of images in which each pixel (vector) has an finds a set of spectrally distinct target pixels vectors using the
associated spectral signature or fingerprint that uniquely concept of orthogonal subspace projection (OSP) [19] in the
characterizes the underlying objects [4]. The resulting data spectral domain. On the other hand, the UFCLS algorithm
volume typically comprises several GBs per flight [5]. generates a set of distinct targets using the concept of least
The special properties of hyperspectral data have signif- square-based error minimization. The IEA uses a similar
icantly expanded the domain of many analysis techniques, approach, but with a different initialization condition. The
including (supervised and unsupervised) classification, spec- RX algorithm is based on the application of a so-called RXD
tral unmixing, compression, target, and anomaly detection filter, given by the well-known Mahalanobis distance. Many
2 EURASIP Journal on Advances in Signal Processing

1
Atmosphere 0.8

Reflectance
0.6
0.4
0.2
0
4 8 12 16 20 24
×102
Wavelength (nm)

1
Soil 0.8

Reflectance
0.6
0.4
0.2
0
4 8 12 16 20 24
Wavelength (nm) ×102

1
Water 0.8

Reflectance
0.6
0.4
0.2
0
4 8 12 16 20 24
Wavelength (nm) ×102
Vegetation

1
0.8
Reflectance

0.6
0.4
0.2
0
4 8 12 16 20 24
Wavelength (nm) ×102

(a) (b)

Figure 1: Concept of hyperspectral imaging.

Defense & intelligence Public safety

Military target detection Search-and-rescue


Mine detection operations

Applications

Precision agriculture Forestry


Geology
Crop stress location Infected trees location
Rare mineral detection

Figure 2: Applications of target and anomaly detection.


EURASIP Journal on Advances in Signal Processing 3

other target/anomaly detection algorithms have also been In this paper, we develop and compare several new
proposed in the recent literature, using different concepts computationally efficient parallel versions (for clusters and
such as background modeling and characterization [13, 20]. GPUs) of two highly representative algorithms for target
Depending on the complexity and dimensionality of (ATDCA) and anomaly detection (RX) in hyperspectral
the input scene [21], the aforementioned algorithms may scenes. In the case of ATDCA, we use several distance
be computationally very expensive, a fact that limits the metrics in addition to the OSP approach implemented in
possibility of utilizing those algorithms in time-critical the original algorithm. The considered metrics include the
applications [5]. In turn, the wealth of spectral information spectral angle distance (SAD) and the spectral information
available in hyperspectral imaging data opens ground- divergence (SID), which introduce an innovation with
breaking perspectives in many applications, including target regards to the distance criterion for target selection originally
detection for military and defense/security deployment [22]. available in the ATDCA algorithm. The parallel versions
In particular, algorithms for detecting (moving or static) are quantitatively and comparatively analyzed (in terms of
targets or targets that could expand their size (such as target detection accuracy and parallel performance) in the
propagating fires) often require timely responses for swift framework of a real defense and security application, focused
decisions that depend upon high computing performance of on identifying thermal hot spots (which can be seen as targets
algorithm analysis [23]. Therefore, in many applications it and/or anomalies) in a complex urban background, using
is of critical importance that automatic target and anomaly AVIRIS hyperspectral data collected over the World Trade
detection algorithms complete their analysis tasks quickly Center in New York just five days after the terrorist attack
enough for practical use. Despite the growing interest in of September 11th, 2001.
parallel hyperspectral imaging research [24–26], only a few The remainder of the paper is organized as follows.
parallel implementations of automatic target and anomaly Section 2 describes the considered target (ATDCA) and
detection algorithms for hyperspectral data exist in the anomaly (RX) detection algorithms. Section 3 develops
open literature [14]. However, with the recent explosion in parallel implementations (referred to as P-ATDCA and P-RX,
the amount and dimensionality of hyperspectral imagery, resp.) for clusters of computers. Section 4 develops parallel
parallel processing is expected to become a requirement in implementations (referred to as G-ATDCA and G-RX, resp.)
most remote sensing missions [5], including those related for GPUs. Section 5 describes the hyperspectral data set used
with the detection of anomalous and/or concealed targets. for experiments and then discusses the experimental results
Of particular importance is the design of parallel algorithms obtained in terms of both target/anomaly detection accuracy
able to detect target and anomalies at subpixel levels [22], and parallel performance, using a Beowulf cluster with 256
thus overcoming the limitations imposed by the spatial processors available at NASA’s Goddard Space Flight Center
resolution of the imaging instrument. in Maryland and a NVidia GeForce 9800 GX2 GPU. Finally,
In the past, Beowulf-type clusters of computers have Section 6 concludes with some remarks and hints at plausible
offered an attractive solution for fast information extraction future research.
from hyperspectral data sets already transmitted to Earth
[27–29]. The goal was to create parallel computing systems 2. Methods
from commodity components to satisfy specific require-
ments for the Earth and space sciences community. However, In this section we briefly describe the target detection
these systems are generally expensive and difficult to adapt algorithms that will be efficiently implemented in parallel
to on-board data processing scenarios, in which low-weight (using different high-performance computing architectures)
and low-power integrated components are essential to reduce in this work. These algorithms are the ATDCA for automatic
mission payload and obtain analysis results in real-time, that target and classification and the RX for anomaly detection.
is, at the same time as the data is collected by the sensor. In the former case, several distance measures are described
In this regard, an exciting new development in the field for implementation of the algorithm.
of commodity computing is the emergence of commodity
graphic processing units (GPUs), which can now bridge 2.1. ATDCA Algorithm. The ATDCA algorithm [12] was
the gap towards on-board processing of remotely sensed developed to find potential target pixels that can be used to
hyperspectral data [15, 30]. The speed of graphics hardware generate a signature matrix used in an orthogonal subspace
doubles approximately every six months, which is much projection (OSP) approach [19]. Let x0 be an initial target
faster than the improving rate of the CPUs (even those made signature (i.e., the pixel vector with maximum length).
up by multiple cores) which are interconnected in a cluster. The ATDCA begins by an orthogonal subspace projector
Currently, state-of-the-art GPUs deliver peak performances specified by the following expression:
more than one order of magnitude over high-end micro-  −1
processors. The ever-growing computational requirements PU⊥ = I − U UT U UT , (1)
introduced by hyperspectral imaging applications can fully
benefit from this type of specialized hardware and take which is applied to all image pixels, with U = [x0 ]. It then
advantage of the compact size and relatively low cost of finds a target signature, denoted by x1 , with the maximum
these units, which make them appealing for on-board data projection in x0
⊥ , which is the orthogonal complement
processing at lower costs than those introduced by other space linearly spanned by x0 . A second target signature x2
hardware devices [5]. can then be found by applying another orthogonal subspace
4 EURASIP Journal on Advances in Signal Processing

projector PU⊥ with U = [x0 , x1 ] to the original image,


where the target signature that has the maximum orthogonal
projection in x0 , x1
⊥ is selected as x2 . The above procedure
is repeated until a set of target pixels {x0 , x1 , . . . , xt } is P1
extracted, where t is an input parameter to the algorithm.
In addition to the standard OSP approach, we have P2
explored other alternatives in the implementation of
ATDCA, given by replacing the PU⊥ operator used in the OSP P3
implementation by one of the distance measures described as
follows [31, 32]: P4

(i) the 1-Norm between two pixel vectors xi and x j , 4 processors


defined by xi − x j , (a)
(ii) the 2-Norm between two pixel vectors xi and x j ,
defined by xi − x j 2 ,
(iii) the Infinity-Norm between two pixel vectors xi and P1
x j , defined by xi − x j ∞ ,
P2
(iv) the spectral angle distance (SAD) between two pixel
vectors xi and x j , defined by the following expression P3
[4]: SAD(xi , x j ) = cos−1 (xi · x j / xi 2 · x j 2 ); as P4
opposed to the previous metric, SAD is invariant in
the presence of illumination interferers, which can P5
provide advantages in terms of target and anomaly
detection in complex backgrounds, 5 processors
(v) the spectral information divergence (SID) between (b)
two pixel vectors xi and x j , defined by the following
i , x j ) = D(xi x j ) + D(x j xi ),
expression [4]: SID(x Figure 3: Spatial-domain decomposition of a hyperspectral data set
where D(xi x j ) = nk=1 pk · log(pk /qk ). Here, we into four (a) and five (b) partitions.
 n
define pk = xi(k) / nk=1 xi(k) and qk = x(k)
j / k=1 x j .
(k)

2.2. RX Algorithm. The RX algorithm has been widely used very important to define the strategy for partitioning the
in signal and image processing [18]. The filter implemented hyperspectral data. In our implementations, a data-driven
by this algorithm is referred to as RX filter (RXF) and defined partitioning strategy has been adopted as a baseline for
by the following expression: algorithm parallelization. Specifically, two approaches for
data partitioning have been tested [28].
 T  
δ RXF (x) = x − μ K−1 x − μ , (2) (i) Spectral-domain partitioning. This approach subdi-
vides the multichannel remotely sensed image into
where x = [x(0) , x(1) , . . . , x(n) ] is a sample, n-dimensional small cells or subvolumes made up of contiguous
hyperspectral pixel (vector), μ is the sample mean, and K is spectral wavelengths for parallel processing.
the sample data covariance matrix. As we can see, the form
of δ RXF is actually the well-known Mahalanobis distance [8]. (ii) Spatial-domain partitioning. This approach breaks
It is important to note that the images generated by the RX the multichannel image into slices made up of one
algorithm are generally gray-scale images. In this case, the or several contiguous spectral bands for parallel
anomalies can be categorized in terms of the value returned processing. In this case, the same pixel vector is
by RXF, so that the pixel with higher value of δ RXF (x) can be always entirely assigned to a single processor, and
considered the first anomaly, and so on. slabs of spatially adjacent pixel vectors are distributed
among the processing nodes (CPUs) of the parallel
system. Figure 3 shows two examples of spatial-
3. Parallel Implementations for domain partitioning over 4 processors and over 5
Clusters of Computers processors, respectively.
Clusters of computers are made up of different processing Previous experimentation with the above-mentioned
units interconnected via a communication network [33]. strategies indicated that spatial-domain partitioning can sig-
In previous work, it has been reported that data-parallel nificantly reduce inter-processor communication, resulting
approaches, in which the hyperspectral data is partitioned from the fact that a single pixel vector is never partitioned
among different processing units, are particularly effective and communications are not needed at the pixel level [28]. In
for parallel processing in this type of high-performance the following, we assume that spatial-domain decomposition
computing systems [5, 26, 28]. In this framework, it is is always used when partitioning the hyperspectral data
EURASIP Journal on Advances in Signal Processing 5

cube. The inputs to the considered parallel algorithms are pixel vectors in U, using a projector given by PU⊥ =
a hyperspectral image cube F with n dimensions, where x I − U(UT U)−1 UT , where U is the identity matrix.
denotes the pixel vector of the same scene, and a maximum The orthogonal space projector PU⊥ is now applied
number of targets to be detected, t. The output in all cases is to all pixel vectors in each local partition to identify
a set of target pixel vectors {x1 , x2 , . . . , xt }. the most distinct pixels (in orthogonal sense) with
regards to the previously detected ones. Each worker
3.1. P-ATDCA. The parallel version of ATDCA adopts then sends the spatial location of the resulting local
the spatial-domain decomposition strategy depicted in pixels to the master node.
Figure 3 for dividing the hyperspectral data cube in
(5) The master now finds a second target pixel by
master-slave fashion. The algorithm has been implemented
applying the PU⊥ operator to the pixel vectors at
in the C++ programming language using calls to MPI,
the spatial locations provided by the workers, and
the message passing interface library commonly available
selecting the one which results in the maximum score
for parallel implementations in multiprocessor systems
(http://www.mcs.anl.gov/research/projects/mpi). The paral- as follows: x2 = arg max{(PU⊥ x)T (PU⊥ x)}. The master
lel implementation, denoted by P-ATDCA and summarized sets U = {x1 , x2 } and broadcasts this matrix to all
by a diagram in Figure 4, consists of the following steps. workers.

(1) The master divides the original image cube F into P (6) Repeat from step 4 until a set of t target pixels,
{x1 , x2 , . . . , xt }, are extracted from the input data. It
spatial-domain partitions. Then, the master sends the
partitions to the workers. should be noted that the P-ATDCA algorithm has not
only been implemented using the aforementioned
(2) Each worker finds the brightest pixel in its local OSP-based approach, but also the different metrics
partition (local maximum) using x1 = arg max{xT · discussed in Section 2.2 by simply replacing the PU⊥
x}, where the superscript T denotes the vector operator by a different distance measure.
transpose operation. Each worker then sends the
spatial locations of the pixel identified as the brightest
one in its local partition back to the master. For 3.2. P-RX. Our MPI-based parallel version of the RX
illustrative purposes, Figure 5 shows the piece of algorithm for anomaly detection also adopts the spatial-
C++ code that the workers execute in order to send domain decomposition strategy depicted in Figure 3. The
their local maxima to the master node using the parallel algorithm is given by the following steps, which are
MPI function MPI send. Here, localmax is the local graphically illustrated in Figure 7.
maximum at the node given by identifier node id,
where node id = 0 for the master and node id > 0 (1) The master processor divides the original image cube
for the workers. MPI COMM WORLD is the name of F into P spatial-domain partitions and distributes
the communicator or collection of processes that are them among the workers.
running concurrently in the system (in our case, all (2) The master calculates the n-dimensional mean vector
the different parallel tasks allocated to the P workers). m concurrently, where each component is the average
(3) Once all the workers have completed their parts and of the pixel values of each spectral band of the
sent their local maxima, the master finds the brightest unique set. This vector is formed once all the
pixel of the input scene (global maximum), x1 , by processors finish their parts. At the same time, the
applying the arg max operator in step 2 to all the master also calculates the sample spectral covariance
pixels at the spatial locations provided by the workers, matrix K concurrently as the average of all the
and selecting the one that results in the maximum individual matrices produced by the workers using
score. Then, the master sets U = x1 and broadcasts their respective portions. This procedure is described
this matrix to all workers. As shown by Figure 5, in detail in Figure 7.
this is implemented (in the workers) by a call to
MPI Recv that stops the worker until the value of (3) Using the above information, each worker applies
the global maximum globalmax is received from (locally) the RXF filter given by the Mahalanobis
the master. On the other hand, Figure 6 shows the distance to all the pixel vectors in the local partition
code designed for calculation of the global maximum as follows: δ (RXF) (x) = (x − m)T K−1 (x − m) and
at the master. First, the master receives all the local returns the local result to the master. At this point,
maxima from the workers using the MPI Gather it is very important to emphasize that, once the
function. Then, the worker which contains the global sample covariance matrix is calculated in parallel as
maximum out of the local maxima is identified in the indicated by Figure 7, the inverse needed for the local
for loop. Finally, the global maximum is broadcast computations at the workers is calculated serially at
to all the workers using the MPI Bcast function. each node.
(4) After this process is completed, each worker now (4) The master now selects the t pixel vectors with higher
finds (in parallel) the pixel in its local partition with associated value of δ (RXF) and uses them to form a
the maximum orthogonal projection relative to the final set of targets {x1 , x2 , . . . , xt }.
6 EURASIP Journal on Advances in Signal Processing

4. Parallel Implementations for GPUs


M
GPUs can be abstracted in terms of a stream model, under M
which all data sets are represented as streams (i.e., ordered max1 max3
max max
data sets) [30]. Algorithms are constructed by chaining so-
called kernels, which operate on entire streams, taking one max2
1 3 max
or more streams as inputs and producing one or more 1 3
streams as outputs. Thereby, data-level parallelism is exposed 2
2
to hardware, and kernels can be concurrently applied.
1) Workers find the brightest
Modern GPU architectures adopt this model and implement 2) Master broadcast the
pixel in its local partition and
a generalization of the traditional rendering pipeline, which sends it to the master brightest pixel to all workers
consists of two main stages [5].
(a) (b)
(1) Vertex processing. The input to this stage is a stream of
vertices from a 3D polygonal mesh. Vertex processors M M
transform the 3D coordinates of each vertex of Dist1 Dist3 Target Target
the mesh into a 2D screen position and apply
lighting to determine their colors (this stage is fully Dist2
Target
programmable). 1 3 1 3
(2) Fragment processing. In this stage, the transformed 2 2
vertices are first grouped into rendering primitives,
such as triangles, and scan-converted into a stream 3) Workers find local pixel 4) Repeat the process until a set
of pixel fragments. These fragments are discrete por- with maximum distance with of t targets have been identified
regards to previous pixels after subsequent iterations
tions of the triangle surface that corresponds to the
pixels of the rendered image. Apart from identifying (c) (d)
constituent fragments, this stage also interpolates
attributes stored at the vertices, such as texture Figure 4: Graphical summary of the parallel implementation of
ATDCA algorithm using 1 master processor and 3 slaves.
coordinates, and stores the interpolated values at
each fragment. Arithmetical operations and texture
lookups are then performed by fragment processors
to determine the ultimate color for the fragment. For that is, as in our cluster-based implementations, each spatial-
this purpose, texture memories can be indexed with domain partition incorporates all the spectral information
different texture coordinates, and texture values can on a localized spatial region and is composed of spatially
be retrieved from multiple textures. adjacent pixel vectors. Each spatial-domain partition is
further divided into 4-band tiles (called spatial-domain
It should be noted that fragment processors currently tiles), which are arranged in different areas of a 2D texture
support instructions that operate on vectors of four RGBA [30]. Such partitioning allows us to map four consecutive
components (Red/Green/Blue/Alpha channels) and include spectral bands onto the RGBA color channels of a texture
dedicated texture units that operate with a deeply pipelined element. Once the procedure adopted for data partitioning
texture cache. As a result, an essential requirement for has been described, we provide additional details about
mapping nongraphics algorithms onto GPUs is that the the GPU implementations of RX and ATDCA algorithms,
data structure can be arranged according to a stream- referred to hereinafter as G-RX and G-ATDCA, respectively.
flow model, in which kernels are expressed as fragment
programs and data streams are expressed as textures. Using 4.1. G-ATDCA. Our GPU version of the ATDCA algorithm
C-like, high-level languages such as NVidia compute unified for target detection is given by the following steps.
device architecture (CUDA), programmers can write fragment
programs to implement general-purpose operations. CUDA is (1) Once the hyperspectral image is mapped onto the
a collection of C extensions and a runtime library (http:// GPU memory, a structure (grid) in which the num-
www.nvidia.com/object/cuda home.html). CUDA’s function- ber of blocks equals the number of lines in the hyper-
ality primarily allows a developer to write C functions to spectral image and the number of threads equals the
be executed on the GPU. CUDA also includes memory man- number of samples is created, thus making sure that
agement and execution configuration, so that a developer all pixels in the hyperspectral image are processed in
can control the number of GPU processors and processing parallel (if this is not possible due to limited memory
threads that are to be invoked during a function’s execution. resources in the GPU, CUDA automatically performs
The first issue that needs to be addressed is how to map several iterations, each of which processes as many
a hyperspectral image onto the memory of the GPU. Since pixels as possible in parallel).
the size of hyperspectral images usually exceeds the capacity (2) Using the aforementioned structure, calculate the
of such memory, we split them into multiple spatial-domain brightest pixel x1 in the original hyperspectral scene
partitions [28] made up of entire pixel vectors (see Figure 3); by means of a CUDA kernel which performs part of
EURASIP Journal on Advances in Signal Processing 7

the calculations to compute x1 = arg max{xT · x} (2) Using the aforementioned structure, calculate the
after computing (in parallel) the dot product between sample spectral covariance matrix K in parallel
each pixel vector x in the original hyperspectral image by means of a CUDA kernel which performs the
and its own transposed version xT . For illustrative calculations needed to compute δ (RXF) (x) = (x −
purposes, Figure 8 shows a portion of code which m)T K−1 (x − m) for each pixel x. For illustrative
includes the definition of the number of blocks purposes, Figure 10 shows a portion of code which
numBlocks and the number of processing threads includes the initialization of matrix K in the GPU
per block numThreadsPerBlock, and then calls the memory using cudaMemset, a call to the CUDA
CUDA kernel BrightestPixel that computes the kernel RXGPU designed to calculate δ (RXF) , and
value of x1 . Here, d bright matrix is the structure finally a call to cudaThreadSynchronize to make
that stores the output of the computation xT · x for sure that the initiated threads are synchronized.
each pixel. Figure 9 shows the code of the CUDA kernel Here, d hyper image is the original hyperspectral
BrightestPixel, in which each different thread image, d K denotes the matrix K, and numlines,
computes a different value of xT ·x for a different pixel numsamples, and numbands, respectively denote
(each thread is given by an identification number the number of lines, samples, and bands of the
idx, and there are as many concurrent threads as original hyperspectral image. It should be noted
pixels in the original hyperspectral image). Once all that the RXGPU kernel implements the Gauss-Jordan
the concurrent threads complete their calculations, elimination method for calculating K−1 . We recall
the G-ATDCA implementation simply computes the that the entire image data is allocated in the GPU
value in d bright matrix with maximum associ- memory, and therefore it is not necessary to partition
ated value and obtains the pixel in that position, the data as it was the case in the cluster-based imple-
labeling the pixel as x1 . Although this operation is mentation. In fact, this is one of the main advantages
inevitably sequential, it is performed in the GPU. of GPUs over clusters of computers (GPUs are shared
memory architectures, while clusters are generally
(3) Once the brightest pixel in the original hyperspectral distributed memory architectures in which message
image has been identified as the first target U = x1 , passing is needed to distribute the workload among
the ATDCA algorithm is executed in the GPU by the workers). A particularity of the Gauss-Jordan
means of another kernel in which the number of elimination method is that it converts the source
blocks equals the number of lines in the hyperspectral matrix into an identity matrix pivoting, where the
image and the number of threads equals the number pivot is the element in the diagonal of the matrix by
of samples is created, thus making sure that all which other elements are divided in an algorithm.
pixels in the hyperspectral image are processed in The GPU naturally parallelizes the pivoting operation
parallel. The concurrent threads find (in parallel) by applying the calculation at the same time to many
the values obtained after applying the OSP-based rows and columns, and hence the inverse operation is
projection operator PU⊥ = I − U(UT U)−1 UT to each calculated in parallel in the GPU.
pixel (using the structure d bright matrix to store
the resulting projection values), and then the G- (3) Once the δ (RXF) has been computed (in parallel) for
ATDCA algorithm finds a second target pixel from every pixel x in the original hyperspectral image, a
the values stored in d bright matrix as follows: final (also parallel) step selects the t pixel vectors
with higher associated value of δ (RXF) (stored in
x2 = arg max{(PU⊥ x)T (PU⊥ x)}. The procedure is
d result) and uses them to form a final set of targets
repeated until a set of t target pixels, {x1 , x2 , . . . , xt },
{x1 , x2 , . . . , xt }. This is done using the portion of
are extracted from the input data. Although in
code illustrated in Figure 11, which calls a CUDA ker-
this description we have only referred to the OSP-
nel RXResult which implements this functionality.
based operation, the different metrics discussed
Here, the number of blocks numBlocks equals the
in Section 2.2 have been implemented by devising
number of lines in the hyperspectral image, while
different kernels which can be replaced in our G-
the number of threads numThreadsPerBlock equals
ATDCA implementation in plug and play fashion in
the number of samples, thus making sure that all
order to modify the distance measure used by the
pixels in the hyperspectral image are processed in
algorithm to identify new targets along the process.
parallel (if this is not possible due to limited memory
resources in the GPU, CUDA automatically performs
4.2. G-RX. Our GPU version of the RX algorithm for several iterations, each of which processes as many
anomaly detection is given by the following steps. pixels as possible in parallel).

(1) Once the hyperspectral image is mapped onto the 5. Experimental Results
GPU memory, a structure (grid) containing n blocks
of threads, each containing n processing threads, is This section is organized as follows. In Section 5.1 we
defined using CUDA. As a result, a total of n × n describe the AVIRIS hyperspectral data set used in our
processing threads are available. experiments. Section 5.2 describes the parallel computing
8 EURASIP Journal on Advances in Signal Processing

if ((node id > 0)&&(node id < num nodes)) {


// Worker sends the local maxima to the master node
MPI Send(&localmax,1,MPI DOUBLE,0,node id,MPI COMM WORLD);
// Worker waits until it receives the global maximum from the master
MPI Recv(&globalmax,1,MPI INT,0,MPI ANY TAG,MPI COMM WORLD,&status);
}

Figure 5: Portion of the code of a worker in our P-ATDCA implementation, in which the worker sends a precomputed local maximum to
the master and waits for a global maximum from the master.

// The master processor perform the following operations:


max aux [0] = max;
max partial = max;
globalmax=0;

// The master receives the local maxima from the workers


MPI Gather(localmax,1,MPI Double,max aux,1,MPI DOUBLE,0,
MPI COMM WORLD);

// MPI Gather is equivalent to:


// for(i=1;i<num nodes;i++)
// MPI Recv(&max aux[i],1,MPI DOUBLE,i,MPI ANY TAG,
// MPI COMM WORLD,&status);

// The worker with the global maximum is identified


for(i=1;i<num nodes;i++){
if(max partial < max aux[i]){
max partial=max aux[i];
globalmax=i;}}

// Master sends all workers the id of the worker with global maximum
MPI Bcast(&globalmax,1,MPI INT,0,MPI COMM WORLD);

// MPI Bcast is equivalent to:


// for(i=1;i<num nodes;i++)
// MPI Send(&globalmax,1,MPI INT,i,0,MPI COMM WORLD);

Figure 6: Portion of the code of the master in our P-ATDCA implementation, in which the master receives the local maxima from the
workers, computes a global maximum, and sends all workers the id of the worker which contains the global maximum.

platforms used for experimental evaluation, which comprise was flown by NASA’s Jet Propulsion Laboratory over the
a Beowulf cluster at NASA’s Goddard Space Flight Center World Trade Center (WTC) area in New York City on
in Maryland and an NVidia GeForce 8900 GX2 GPU. September 16, 2001, just five days after the terrorist attacks
Section 5.3 discusses the target and anomaly detection that collapsed the two main towers and other buildings in
accuracy of the parallel algorithms when analyzing the the WTC complex. The full data set selected for experiments
hyperspectral data set described in Section 5.1. Section 5.4 consists of 614 × 512 pixels, 224 spectral bands, and a total
describes the parallel performance results obtained after size of (approximately) 140 MB. The spatial resolution is 1.7
implementing the P-ATDCA and P-RX algorithms on the meters per pixel. The leftmost part of Figure 12 shows a false
Beowulf cluster. Section 5.5 describes the parallel perfor- color composite of the data set selected for experiments using
mance results obtained after implementing the G-ATDCA the 1682, 1107, and 655 nm channels, displayed as red, green,
and G-RX algorithms on the GPU. Finally, Section 5.6 and blue, respectively. Vegetated areas appear green in the
provides a comparative assessment and general discussion leftmost part of Figure 12, while burned areas appear dark
of the different parallel algorithms presented in this work in gray. Smoke coming from the WTC area (in the red rectan-
light of the specific characteristics of the considered parallel gle) and going down to south Manhattan appears bright blue
platforms (clusters versus GPUs). due to high spectral reflectance in the 655 nm channel.
Extensive reference information, collected by U.S.
5.1. Data Description. The image scene used for experiments Geological Survey (USGS), is available for the WTC
in this work was collected by the AVIRIS instrument, which scene (http://speclab.cr.usgs.gov/wtc). In this work, we use
EURASIP Journal on Advances in Signal Processing 9

Master:

Read hyperspectral
Dist
data cube and divide r i bu Workers:
te pa
it into P spatial- r titio
ns
domain partitions Compute a local
mean component
mk using the pixels
aster
m k to m in the local partition
rn
Retu
Form the mean
vector m by adding
Broadc
up the individual ast m to
worker
components s Substract m from
each local pixel and
to form a local cova-
t
onen riance component
ur n comp r
Re t e
Form covariance mast
matrix K as the
average of all Broadca
st K to wo
individual matrices rkers
returned by workers Apply Mahalanobis
distance to each of
the pixel vectors x
r
aste in the local partition
lt to m
Produce an output resu
urn
from which the t Ret
pixels with max
value are selected

Figure 7: Parallel implementation of the RX algorithm in clusters of computers.

// Define the number of blocks and the number of processing threads per block
int numBlocks = num lines;
int numThreadsPerBlock = num samples;

// Calculate the intensity of each pixel in the original image and store the resulting values in a structure
BrightestPixel<<<numBlocks,numThreadsPerBlock>>>(d hyper image,
d bright matrix, num bands, lines samples);

Figure 8: Portion of code which calls the CUDA kernel BrightestPixel that computes (in parallel) the brightest pixel in the scene in the
G-ATDCA implementation.

a U.S. Geological Survey thermal map (http://pubs.usgs Table 1: Properties of the thermal hot spots reported in the
.gov/of/2001/ofr-01-0429/hotspot.key.tgif.gif) which shows rightmost part of Figure 12.
the target locations of the thermal hot spots at the WTC
area, displayed as bright red, orange, and yellow spots at Hot Latitude Longitude Temperature Area
the rightmost part of Figure 12. The map is centered at the spot (North) (West) (Kelvin) (Square meters)
region where the towers collapsed, and the temperatures of ◦   ◦  
“A” 40 42 47.18 74 00 41.43 1000 0.56
the targets range from 700 F to 1300 F. Further information
   
available from USGS about the targets (including location, “B” 40◦ 42 47.14 74◦ 00 43.53 830 0.08
estimated size, and temperature) is reported on Table 1. As ◦   ◦  
shown by Table 1, all the targets are subpixel in size since the “C” 40 42 42.89 74 00 48.88 900 0.80
spatial resolution of a single pixel is 1.7 square meters. The    
“D” 40◦ 42 41.99 74◦ 00 46.94 790 0.80
thermal map displayed in the rightmost part of Figure 12 will ◦   ◦  
“E” 40 42 40.58 74 00 50.15 710 0.40
be used in this work as ground-truth to validate the target
detection accuracy of the proposed parallel algorithms and “F” 40◦ 42 38.74 74◦ 00 46.70 700 0.40
their respective serial versions. ◦   ◦  
“G” 40 42 39.94 74 00 45.37 1020 0.04
5.2. Parallel Computing Platforms. The parallel computing    
architectures used in experiments are the Thunderhead “H” 40◦ 42 38.60 74◦ 00 43.51 820 0.08
10 EURASIP Journal on Advances in Signal Processing

Beowulf cluster at NASA’s Goddard Space Flight Center by PG-RX and PG-ATDCA (implemented using different
(NASA/GSFC) and a NVidia GeForce 9800 GX2 GPU. distance metrics) and the pixel vectors at the known target
positions, labeled from “A” to “H” in the rightmost part of
(i) The Thunderhead Beowulf cluster is composed Figure 12. The lower the SAD score, the more similar the
of 2.4 GHz Intel Xeon nodes, each with 1 GB spectral signatures associated to the targets. In all cases, the
of memory and a scratch area of 80 GB of number of target pixels to be detected was set to t = 30
memory shared among the different processors after calculating the virtual dimensionality (VD) of the data
(http://newton.gsfc.nasa.gov/thunderhead/). The to- [34]. As shown by Table 2, both the PG-ATDCA and PG-
tal peak performance of the system is 2457.6 Gflops. RX extracted targets were similar, spectrally, to the known
Along with the 256-processor computer core ground-truth targets. The PG-RX was able to perfectly detect
(out of which only 32 were available to us at the (SAD of 0 degrees, represented in the table as 0◦ ) the targets
time of experiments), Thunderhead has several labeled as “A,” “C,” and “D” (all of them relatively large
nodes attached to the core with 2 GHz optical in size and with high temperature), while the PG-ATDCA
fibre Myrinet [27]. The parallel algorithms tested implemented using OSP was able to perfectly detect the
in this work were run from one of such nodes, targets labeled as “C” and “D.” Both the PG-RX and PG-
called thunder1 (used as the master processor ATDCA had more difficulties in detecting very small targets.
in our tests). The operating system used at the In the case of the PG-ATDCA implemented with a
time of experiments was Linux RedHat 8.0, and distance measure other than OSP we realized that, in many
MPICH was the message-passing library used cases, some of the target pixels obtained were repeated. To
(http://www.mcs.anl.gov/research/projects/mpi/mpi- solve this issue, we developed a method called relaxed pixel
ch1). Figure 13(a) shows a picture of the Thunde- method (RPM) which simply removes a detected target pixel
rhead Beowulf cluster. from the scene so that it cannot be selected in subsequent
(ii) The NVidia GeForce 9800 GX2 GPU contains iterations. Table 3 shows the SAD between the most similar
two G92 graphics processors, each with 128 indi- target pixels detected by P-ATDCA (implemented using the
vidual scalar processor (SP) cores and 512 MB aforementioned RPM strategy) and the pixel vectors at the
of fast DDR3 memory (http://www.nvidia.com/ known target positions. It should be noted that the OSP
object/product geforce 9800gx2 us.html). The SPs distance implements the RPM strategy by definition and,
are clocked at 1.5 GHz, and each can perform a fused hence, the results reported for PG-ATDCA in Table 3 are the
multiply-add every clock cycle, which gives the card same as those reported in Table 2 in which the RPM strategy
a theoretical peak performance of 768 GFlop/s. The is not considered. As shown by Table 3, most measured SAD-
GPU is connected to a CPU Intel Q9450 with 4 cores, based scores (in degrees) are lower when the RPM strategy
which uses a motherboard ASUS Striker II NSE (with is used, in particular, for targets of moderate size such as
NVidia 790i chipset) and 4 GB of RAM memory “A,” “E,” or “F.” The detection results were also improved for
at 1333 MHz. Hyperspectral data are moved to and the target with highest temperature, that is, the one labeled
from the host CPU memory by DMA transfers over a as “G.” This indicated that the proposed RPM strategy
PCI Express bus. Figure 13(b) shows a picture of the can improve the detection results despite its apparent
GeForce 9800 GX2 GPU. simplicity.
Finally, Table 4 shows a summary of the detection results
obtained by the PG-RX and PG-ATDCA (with and without
5.3. Analysis of Target Detection Accuracy. It is first important RPM strategy). It should be noted that it was not necessary
to emphasize that our parallel versions of ATDCA and RX to apply the RPM strategy to the PG-RX algorithm since this
(implemented both for clusters and GPUs) provide exactly algorithm selects the final targets according to their value of
the same results as the serial versions of the same algorithms, δ (RXF) (x) (the first pixel selected is the one with higher value
implemented using the Intel C/C++ compiler and optimized of the RXF, then the one with the second higher value of
via compilation flags to exploit data locality and avoid the RXF, and so on). Hence, repetitions of targets are not
redundant computations. As a result, in order to refer to the possible in this case. In the table, the column “detected” lists
target and anomaly detection results provided by the parallel those targets that were exactly identified (at the same spatial
versions of ATDCA and RX algorithms, we will refer to them coordinates) with regards to the ground-truth, resulting in
as PG-ATDCA and PG-RX in order to indicate that the SAD value of exactly 0◦ when comparing the associated
same results were achieved by the MPI-based and CUDA-based spectral signatures. On the other hand, the column “similar”
implementations for clusters and GPUs, respectively. At the lists those targets that were identified with a SAD value below
same time, these results were also exactly the same as those 30◦ (a reasonable spectral similarity threshold taking in mind
achieved by the serial implementation and, hence, the only the great complexity of the scene, which comprises many
difference between the considered algorithms (serial and different spectral classes). As shown by Table 4, the RPM
parallel) is the time they need to complete their calculations, strategy generally improved the results provided by the PG-
which varies depending on the computer architecture in ATDCA algorithm, both in terms of the number of detected
which they are run. targets and also in terms of the number of similar targets, in
Table 2 shows the spectral angle distance (SAD) values particular, when the algorithm was implemented using the
(in degrees) between the most similar target pixels detected SAD and SID distances.
EURASIP Journal on Advances in Signal Processing 11

global void BrightestPixel(short int ∗ d hyper image, float


∗d bright matrix, int num bands, long int lines samples)
{

// The original hyperspectral image is stored in d hyper image


int k;
float bright=0, value;

// Obtain the thread id and assign an operation to each processing thread


int idx = blockDim.x ∗ blockIdx.x + threadIdx.x;

for (k = 0; k < num bands; k++){


value = d hyper image[idx+(k∗ lines samples)];
bright += value;
}

d bright matrix[idx]=bright;
}

Figure 9: CUDA kernel BrightestPixel that computes (in parallel) the brightest pixel in the scene in the G-ATDCA implementation.

// Initialization of matrix K
cudaMemset(d K,0,size2InBytes);

// Calculation of RX filter
RXGPU<<< size, size >>>(d hyper image, d K, lines samples,
num samples, num lines, num bands);

cudaThreadSynchronize ();

Figure 10: Portion of code which calls the CUDA kernel RXGPU designed to calculate the RX filter (in parallel) in the G-RX implementation.

// Calculation of final G-RX result


// numBlock = num lines;
// numThreadsPerBlock = num samples;

RXResult <<< numBlocks, numThreadsPerBlock >>> (d hyper image,d K,


d result,line samples, num samples, num lines,num bands);

cudaThreadSynchronize ();

Figure 11: Portion of code which calls the CUDA kernel RXResult designed to obtain a final set of targets (in parallel) in the G-RX
implementation.

Table 2: Spectral angle values (in degrees) between target pixels and known ground targets for PG-ATDCA and PG-RX.

Algorithm A B C D E F G H
PG-ATDCA (OSP) 9,17◦ 13,75◦ 0,00◦ 0,00◦ 20,05◦ 28,07◦ 21,20◦ 21,77◦
PG-ATDCA (1-Norm) 9,17◦ 24,06◦ 0,00◦ 16,04◦ 37,82◦ 42,97◦ 38,39◦ 35,52◦
PG-ATDCA (2-Norm) 9,17◦ 24,06◦ 0,00◦ 16,04◦ 37,82◦ 42,97◦ 38,39◦ 25,78◦
PG-ATDCA (∞-Norm) 8,59◦ 22,35◦ 0,00◦ 13,75◦ 27,50◦ 30,94◦ 21,20◦ 26,36◦
PG-ATDCA (SAD) 9,17◦ 22,35◦ 0,00◦ 14,32◦ 38,39◦ 32,09◦ 25,21◦ 29,79◦
PG-ATDCA (SID) 9,17◦ 24,06◦ 0,00◦ 16,04◦ 39,53◦ 32,09◦ 22,92◦ 20,05◦
PG-RX 0,00◦ 12,03◦ 0,00◦ 0,00◦ 18,91◦ 28,07◦ 33,80◦ 40,68◦
12 EURASIP Journal on Advances in Signal Processing

D
C

G
E

H
F

Figure 12: False color composition of an AVIRIS hyperspectral image collected by NASA’s Jet Propulsion Laboratory over lower
Manhattan on September 16, 2001 (left). Location of thermal hot spots in the fires observed in World Trade Center area, available online:
http://pubs.usgs.gov/of/2001/ofr-01-0429/hotspot.key.tgif.gif (right).

(a) Beowulf cluster (b) GPU

Figure 13: (a) Thunderhead Beowulf cluster at NASA’s Goddard Space Flight Center in Maryland. (b) NVidia GeForce 9800 GX2 GPU.

Table 3: Spectral angle values (in degrees) between target pixels and known ground targets for PG-ATDCA (implemented using RPM). The
results reported for the OSP distance in PG-ATDCA are the same as those reported for the same distance in Table 2 since OSP implements
the RPM strategy by definition.

Algorithm A B C D E F G H
PG-ATDCA (OSP) 9,17◦ 13,75◦ 0,00◦ 0,00◦ 20,05◦ 28,07◦ 21,20◦ 21,77◦
PG-ATDCA (1-Norm) 0,00◦ 12,03◦ 0,00◦ 10,89◦ 22,35◦ 31,51◦ 34,38◦ 30,94◦
PG-ATDCA (2-Norm) 0,00◦ 14,90◦ 0,00◦ 10,89◦ 27,50◦ 31,51◦ 34,95◦ 25,78◦
PG-ATDCA (∞-Norm) 8,59◦ 14,90◦ 0,00◦ 13,75◦ 25,78◦ 30,94◦ 20,05◦ 26,36◦
PG-ATDCA (SAD) 0,00◦ 14,90◦ 0,00◦ 11,46◦ 29,79◦ 28,07◦ 22,92◦ 29,79◦
PG-ATDCA (SID) 0,00◦ 17,19◦ 0,00◦ 10,89◦ 30,94◦ 28,07◦ 22,35◦ 21,77◦
EURASIP Journal on Advances in Signal Processing 13

5.4. Parallel Performance in the Thunderhead Cluster. In this 5.5. Parallel Performance in the GeForce 9800 GX2 GPU. In
subsection we evaluate the parallel performance of both this subsection we evaluate the parallel performance of both
P-ATDCA and P-RX in a Beowulf cluster. Table 5 shows G-ATDCA and G-RX in the NVidia GeForce 9800 GX2 GPU.
the processing times in seconds for several multiprocessor Table 8 shows the execution times measured after processing
versions of P-RX and P-ATDCA using different numbers of the full hyperspectral scene (614 × 512 pixels and 224 spectral
processors (CPUs) on the Thunderhead Beowulf cluster at bands) on the CPU and on the GPU, along with the speedup
NASA’s Goddard Space Flight Center. As shown by Table 5, achieved in each case. The C function clock() was used for
when 32 processors were used, the P-ATDCA (implemented timing the CPU implementation, and the CUDA timer was
using SAD) was able to finalize in about 19 seconds, thus used for the GPU implementation. The time measurement
clearly outperforming the sequential version which takes was started right after the hyperspectral image file was read
4 minutes of computation in one Thunderhead processor. to the CPU memory and stopped right after the results of
In the case of P-RX, two versions were implemented: the target/anomaly detection algorithm were obtained and
using communications when needed (communicative) and stored in the CPU memory.
using redundant computations to reduce communications From Table 8, it can be seen that the G-ATDCA imple-
(independent), obtaining similar results in both cases. Here, mented using the OSP distance scaled slightly worse than
the processing time using 32 CPUs was only about 4 seconds, the other implementations. This suggests that the matrix
while the sequential time measured in one CPU was above inverse and transpose operations implemented by the PU⊥
one minute. orthogonal projection operator can still be optimized for
Table 6 reports the speedups (number of times that the efficient execution in the GPU. In this case, the speedup
parallel version was faster than the sequential one as the achieved by the GPU implementation over the optimized
number of processors was increased) achieved by multipro- CPU implementation was only 3, 4. When the G-ATDCA
cessor runs of the P-ATDCA algorithm (implemented using was implemented using the 1-Norm, 2-Norm, and ∞-Norm
different distances) and P-RX. It can be seen that P-ATDCA distances, the speedup increased to values around 10, for
(implemented using OSP and SID) scaled better than the two a total processing time below 10 seconds in the considered
considered versions of P-RX. This has to do with the number GPU. This can be defined as a significant accomplishment
of sequential computations involved in P-RX, as indicated if we take in mind that just one GPU is used to parallelize
in Figure 7. Another reason is the fact that, although the the algorithm (in order to achieve similar speedups in
sample covariance matrix K required by this algorithm is the Thunderhead cluster, at least 16 CPUs were required).
calculated in parallel, its inverse is calculated serially at each Table 8 also reveals that the speedups achieved by the
node. In this regard, we believe that the speedups reported GPU implementation were slightly increased when the SAD
for the different implementations of P-RX in Table 6 could be distance was used to implement the G-ATDCA. This suggests
improved even more if not only the calculation of the sample that the spectral angle calculations required for this distance
covariance matrix but also the inverse had been computed in can be efficiently parallelized in the considered GPU (in
parallel. particular, calculation of cosines in the GPU was very
For illustrative purposes, the speedups achieved by efficient).
the different implementations of P-ATDCA and P-RX are It is also clear from Table 8 that the best speedup results
graphically illustrated in Figure 14. The speedup plots in were obtained when the SID distance was used to implement
Figure 14(a) reveal that P-ATDCA scaled better when OSP the G-ATDCA. Specifically, we measured a speedup of 71, 91
and SID were used as baseline distance metrics for imple- when comparing the processing time measured in the GPU
mentation, resulting in speedups close to linear (although with the processing time measured in the CPU. This is mainly
these distance measures introduced higher processing times, due to the fact that the logarithm operations required to
as indicated by Table 5). On the other hand, Figure 14(b) implement the SID distance can be executed very effectively
reveals that both versions of P-RX resulted in speedup plots in the GPU. Although the speedup achieved in this case is
that started to flatten from linear speedup from 16 CPUs in no less than impressive, the final processing time for the
advance. This is probably due to the fact that the ratio of G-ATDCA implemented using this distance is still above
communications to computations increases as the partition two minutes after parallelization, which indicates that the
size is made very small, an effect that is motivated by the high use of the SID distance introduces additional complexity in
number of communications required by P-RX as indicated by both the serial and parallel implementations of the ATDCA
Figure 7. algorithm. Similar comments apply to the parallel version of
Finally, Table 7 shows the load balancing scores for all G-RX, which also takes more than 2 minutes to complete
considered parallel algorithms. The imbalance is defined as its calculations after parallelization. This is due to swapping
D = Max / Min, where Max and Min are the maxima and problems in both of the serial implementation (i.e., an
minima processor run times, respectively. Therefore, perfect excessive traffic between disk pages and memory pages was
balance is achieved when D = 1. As we can see from observed, probably resulting from an ineffective allocation of
Table 7, all the considered parallel algorithms were able to resources in our G-RX implementation). This aspect should
provide values of D very close to optimal in the considered be improved in future developments of the parallel G-RX
cluster, indicating that our implementations of P-ATDCA algorithm.
and P-RX achieved highly satisfactory load balance in all Summarizing, the experiments reported on Table 8 indi-
cases. cate that the considered GPU can significantly increase
14 EURASIP Journal on Advances in Signal Processing

Table 4: Summary of detection results achieved by the PG-ATDCA and PG-RX, with and without the RPM strategy.

Algorithm Detected Similar Detected (RPM) Similar (RPM)


PG-ATDCA (OSP) C, D A, B, E, F, G C, D A, B, E, F, G
PG-ATDCA (1-Norm) C A, B, D A, C B, D, E
PG-ATDCA (2-Norm) C A, B, D, H A, C B, D, E, H
PG-ATDCA (∞-Norm) C A, B, D, E, G, H C A, B, D, E, G, H
PG-ATDCA (SAD) C A, B, D, G, H A, C B, D, E, F, G, H
PG-ATDCA (SID) C A, B, D, G, H A, C B, D, F, G, H
PG-RX A, C, D B, E, F — —

Table 5: Processing times in seconds measured for P-ATDCA (implemented using different distance measures) and P-RX, using different
numbers of CPUs on the Thunderhead Beowulf cluster.
Algorithm 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs 32 CPUs
P-ATDCA (OSP) 1263,21 879,06 447,47 180,94 97,90 49,54
P-ATDCA (1-Norm) 260,43 191,33 97,28 50,00 27,72 19,80
P-ATDCA (2-Norm) 235,78 182,74 94,38 49,42 25,465 19,283
P-ATDCA (∞-Norm) 268,95 187,92 99,28 50,96 27,75 22,00
P-ATDCA (SAD) 241,93 187,83 96,14 49,24 25,35 19,00
P-ATDCA (SID) 2267,60 1148,80 579,51 305,32 165,46 99,37
P-RX (Communicative) 68,86 32,46 16,88 9,14 5,67 4,67
P-RX (Independent) 68,86 32,70 16,82 8,98 5,46 4,42

the performance of the considered algorithms, providing From the obtained results, a set of remarks regarding the use
speedup ratios on the order of 10 for G-ATDCA (for of clusters versus GPUs for parallel processing of remotely
most of the considered distances) and on the order of 14 sensed hyperspectral scenes follow.
for G-RX, although this algorithm should still be further
optimized for more efficient execution on GPUs. When G- 5.6.1. Payload Requirements. A cluster of computers occupies
ATDCA was implemented using OSP as a baseline distance, much more space than a GPU, even if the PCs that form
the speedup decreased since the parallel matrix inverse the cluster are concentrated in a compute core. If the cluster
and transpose operations are not fully optimized in our system is distributed across different locations, the space
GPU implementation. On the other hand, when G-ATDCA requirements increase. This aspect significantly limits the
was implemented using SID as a baseline distance, the exploitation of cluster-based systems in on-board processing
speedup boosted over 71 due to the optimized capability of scenarios in the context of remote sensing, in which the
the considered GPU to compute logarithm-type operations weight of processing hardware must be limited in order
in parallel. Overall, the best processing times achieved in to satisfy mission payload requirements. For example, a
experiments were on the order of 9 seconds. These response massively parallel cluster such as the Thunderhead system
times are not strictly in realtime since the cross-track line used in experiments occupies an area of several square meters
scan time in AVIRIS, a push-broom instrument [3], is quite with a total weight of several tons, requiring heavy cooling
fast (8.3 milliseconds). This introduces the need to process systems, uninterruptible power supplies, and so forth (see
the full image cube (614 lines, each made up of 512 pixels Figure 13(a)) In contrast, the GPU has the size of a PC card
with 224 spectral bands) in about 5 seconds to achieve (see Figure 13(b)) and its weight is much more adequate
fully achieve real-time performance. Although the proposed in terms of current mission payload requirements. Most
implementations can still be optimized, Table 8 indicates that importantly, our experimental results have indicated that
significant speedups can be obtained in most cases using only using just one GPU we can obtain parallel performance
one GPU device, with very few on-board restrictions in terms results which are equivalent to those obtained using tens of
of cost, power consumption, and size, which are important nodes in a cluster, thus significantly reducing the weight and
when defining mission payload (defined as the maximum space occupied by hardware resources while maintaining the
load allowed in the airborne or satellite platform that carries same parallel performance.
the imaging instrument).
5.6.2. Maintenance. The maintenance of a large cluster
5.6. Discussion. In the previous subsections we have reported represents a major investment in terms of time and finance.
performance data for parallel target and anomaly detection Each node of the cluster is a computer in itself, with its
algorithms implemented on a Beowulf cluster and a GPU. own operating system, possible deterioration of components,
EURASIP Journal on Advances in Signal Processing 15

32 32

28 28

24 24

20 20
Speedup

Speedup
16 16
12 12

8 8

4 4
0 0
0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32
Number of CPUs Number of CPUs

OSP Norm-1 P-RX (communicative)


Norm-2 Norm-inf P-RX (independent)
SAD SID Linear
Linear
(a) P-ATDCA (b) P-RX

Figure 14: Speedups achieved by (a) P-ATDCA (using different distance measures) and (b) P-RX (communicative and independent versions)
on the Thunderhead Beowulf cluster at NASA’s Goddard Space Flight Center in Maryland.

Table 6: Speedups for the P-ATDCA algorithm (using different distance measures) and P-RX using different numbers of CPUs on the
Thunderhead Beowulf cluster.
Algorithm 2 CPUs 4 CPUs 8 CPUs 16 CPUs 32 CPUs
P-ATDCA (OSP) 1,437 2,823 6,981 12,902 25,498
P-ATDCA (1-Norm) 1,386 2,677 5,208 9,393 13,148
P-ATDCA (2-Norm) 1,290 2,498 4,770 9,258 12,227
P-ATDCA (∞-Norm) 1,431 2,708 5,277 9,690 12,224
P-ATDCA (SAD) 1,288 2,516 4,913 9,542 12,727
P-ATDCA (SID) 1,973 3,912 7,426 13,704 22,818

P-RX (Communicative) 2,121 4,079 7,531 12,140 14,720


P-RX (Independent) 2,105 4,092 7,662 12,594 15,558

system failures, and so forth. This generally requires a team repairing/changing the defective components, reinstall the
of dedicated system administrators, depending on the size software (if necessary), and reconnecting it.
of the cluster, to ensure that all the nodes in the system are
running. In general terms, the maintenance costs for a cluster 5.6.3. Cost. Although a cluster is a relatively inexpensive
with P processing nodes is similar to the maintenance costs parallel architecture, the cost of a cluster can increase
for P independent machines. However, the maintenance significantly with the number of nodes. The estimated cost
cost for a GPU is similar to that of the administration of a system such as Thunderhead, assuming a conservative
cost of a single machine. As a result, the advantages of a estimate of 600 USD per each node, is in the order of
GPU with regards to a cluster from the viewpoint of the 600 × 256 = 153, 600 USD (without including the cost
maintenance of the system are quite important, in particular, of the communication network). In turn, the cost of a
in the context of remote sensing data analysis scenarios in relatively modern GPU such as the GeForce 9800 GX2
which compact hardware devices, which can be mounted on- GT used in our experiments is now below 500 USD. Our
board imaging instruments, are highly desirable. Regarding experiments reveal that the parallel performance obtained
possible hardware failures in both systems, it is worth noting in the GPU can be superior to that obtained using 32
that such failures are generally easier to manage in GPU- nodes of the Thunderhead system, and the cost of such
based systems rather than in cluster systems, in which the nodes (without including the cost of the communication
failure may require several operations such as identifying network) is around 32 × 256 = 8, 192 USD. This reveals
the node that caused the failure, removing the node, finding the important advantages introduced by GPUs in the sense
out which software/hardware components caused the error, of providing high-performance computing at lower costs
16 EURASIP Journal on Advances in Signal Processing

Table 7: Load balancing ratios for the P-ATDCA (implemented using different distance measures) and P-RX (communicative and
independent versions).

Algorithm Imbalance 2 CPUs 4 CPUs 8 CPUs 16 CPUs 32 CPUs


Max 879,06 447,47 180,94 97,90 49,54
P-ATDCA (OSP) Min 878,94 447,01 180,23 97,06 48,95
D 1,00013652 1,0010290 1,00393941 1,00865444 1,0120531
Max 191,33 97,28 50,00 27,74 19,81
P-ATDCA (1-Norm) Min 191,32 97,27 49,98 27,72 19,80
D 1,00005227 1,0001028 1,00030008 1,00072137 1,00055536
Max 182,75 94,38 49,42 25,47 19,29
P-ATDCA (2-Norm) Min 182,74 94,37 49,41 25,46 19,28
D 1,00005472 1,00006357 1,00020235 1,00047124 1,00072603
Max 187,93 99,29 50,97 27,77 22,01
P-ATDCA (∞-Norm) Min 187,92 99,28 50,96 27,75 22,00
D 1,00005321 1,0000705 1,00019623 1,0006125 1,00068179
Max 187,83 96,14 49,24 25,35 19,01
P-ATDCA (SAD) Min 187,83 96,13 49,23 25,33 19,00
D 1 1,00008321 1,00010155 1,00059202 1,00073669
Max 1148,80 579,52 305,33 165,47 99,39
P-ATDCA (SID) Min 1148,80 579,51 305,32 165,46 99,375
D 1 1,00001726 1,00003275 1,00006044 1,00016101
Max 32,46 16,88 9,14 5,67 4,67
P-RX (Communicative) Min 32,46 16,79 8,92 5,50 4,5264
D 1 1,00553901 1,02495096 1,03130568 1,03360286
Max 32,70 16,82 8,98 5,46 4,42
P-RX (Independent) Min 32,47 16,68 8,95 5,46 4,41
D 1,00701992 1,00869252 1,00334919 1,00131851 1,00194667

Table 8: Processing time (seconds) and speedups measured for systems is distributed. However, the GPU is a shared-
the CPU and GPU implementations of several target and anomaly memory system in which the local memory space is shared by
detection algorithms. all the multiprocessors in the GPU. This avoids the problems
introduced by parallel algorithms with heavy interprocessor
Processing time Processing time
Algorithm Speedup communications such as the P-RX illustrated in Figure 7
(CPU) (GPU)
since these algorithms can be implemented by assuming
G-ATDCA (OSP) 1263,21 370,96 3,40
that shared local memory will be available to all processing
G-ATDCA (1-Norm) 99,24 9,03 10,98 elements in the system, thus reducing quite significantly
G-ATDCA (2-Norm) 83,99 9,41 9,28 the penalties introduced by excessive communications while,
G-ATDCA (∞-Norm) 109,28 9,05 12,07 at the same time, increasing the ratio of computations to
G-ATDCA (SAD) 133,63 9,06 14,74 communications. This generally results in better parallel
G-ATDCA (SID) 911,85 12,67 71,91 performance, as observed in our experimental results.
G-RX 1955,15 139,17 14,04 From the observations above, we can conclude that
commodity cluster-based parallel systems are indeed an
appealing solution in order to process remote sensing image
data sets which have been already transmitted to Earth.
than those generally observed for commodity cluster-based PC workstations are everywhere, and it is not difficult to
systems. put together a network and/or a cluster, given the raw
materials. For instance, the processing power offered by
5.6.4. Memory Considerations. A cluster of P nodes is a such commodity systems has been traditionally employed
distributed memory system in which the P processors have P in data mining applications from very large data archives,
independent memory systems and P copies of the operating possibly distributed among different geographic locations.
system, each subject to local failures. Although a scratch However, compact hardware devices such as GPUs offer
disk area is usually allocated in parallel clusters for common significant advantages in time-critical applications that
use of the different processing nodes, the memory in these demand a response in real-time (i.e., at the same time as
EURASIP Journal on Advances in Signal Processing 17

the data is collected at the sensor) mainly due to the low Although the results reported in this work are very
weight and size of these devices, and to their capacity to encouraging, further experiments should be conducted in
provide high performance computing at very low costs. order to increase the parallel performance of the proposed
In previous work [29], we have quantitatively compared parallel algorithms by resolving memory issues in the cluster-
the performance of clusters versus field programmable gate based implementations and optimizing the parallel design of
arrays (FPGAs) in the context of remote sensing applications. the algorithms in the GPU-based implementations. Regard-
FPGAs are another type of compact hardware device that ing the cluster-based implementation of the RX algorithm
offer interesting perspectives in our application domain, such reported in this work, we are planning on implement-
as the appealing possibility of being able to adaptively select ing not only the sample covariance matrix but also the
the data processing algorithm to be applied (out of a pool inverse in parallel in order to increase scalability in future
of available algorithms) from a control station on Earth, developments. Experiments with additional scenes under
immediately after the data is collected by the sensor. This different target/anomaly detection scenarios are also highly
feature is possible thanks to the inherent reconfigurability desirable. Finally, experiments with radiation-hardened GPU
of FPGA devices, which are generally more expensive than devices will be required in order to evaluate the possibility
GPU devices. In the future, significant developments are of adapting the proposed parallel algorithms to hardware
expected in the active research area devoted to radiation- devices which have been already certified by international
hardening of GPU and FPGA devices, which may allow agencies and are mounted on-board satellite platforms for
their full incorporation to satellite-based Earth and planetary Earth and planetary observation from space.
observation platforms in space. These systems represent the
next frontier of hyperspectral remote sensing. Acknowledgments
This work has been supported by the European Community’s
6. Conclusions and Future Research Marie Curie Research Training Networks Programme under
reference MRTN-CT-2006-035927 (HYPER-I-NET). Fund-
With the ultimate goal of drawing a comparison of clusters ing from the Spanish Ministry of Science and Innovation
versus GPUs as high-performance computing architectures (HYPERCOMP/EODIX project, reference AYA2008-05965-
in the context of remote sensing applications, this paper C04-02), is gratefully acknowledged.
described several innovative parallel algorithms for target
and anomaly detection in hyperspectral image analysis. As
a case study of specific issues involved in the exploita- References
tion of an automatic algorithm for target detection and
[1] A. F. H. Goetz, G. Vane, J. E. Solomon, and B. N. Rock,
classification (ATDCA), we have investigated the impact “Imaging spectrometry for earth remote sensing.,” Science, vol.
of including several distance measures in the design of 228, no. 4704, pp. 1147–1153, 1985.
different parallel versions of this algorithm. This paper [2] A. Plaza, J. A. Benediktsson, J. W. Boardman, et al., “Recent
has also developed a new parallel version of a well-known advances in techniques for hyperspectral image processing,”
anomaly detection algorithm (RX). The parallel algorithms Remote Sensing of Environment, vol. 113, supplement 1, pp.
have been implemented in two types of parallel computing S110–S122, 2009.
platforms: a Beowulf cluster at NASA’s Goddard Space Flight [3] R. O. Green, M. L. Eastwood, C. M. Sarture, et al., “Imaging
Center in Maryland and an NVidia GeForce 9800 GX2 spectroscopy and the Airborne Visible/Infrared Imaging Spec-
GPU. Experimental results, oriented towards analyzing the trometer (AVIRIS),” Remote Sensing of Environment, vol. 65,
target/anomaly detection accuracy and parallel performance no. 3, pp. 227–248, 1998.
of the proposed parallel algorithms, have been presented [4] C.-I. Chang, Hyperspectral Imaging: Techniques for Spectral
and thoroughly discussed in the context of a real defense Detection and Classification, Kluwer Academic Publishers,
and security application: the analysis of hyperspectral data Norwell, Mass, USA, 2003.
collected by NASA’s AVIRIS instrument over the World [5] A. Plaza and C.-I. Chang, High Performance Computing in
Trade Center (WTC) in New York, five days after the Remote Sensing, CRC Press, Boca Raton, Fla, USA, 2007.
terrorist attacks that collapsed the two main towers in the [6] R. A. Schowengerdt, Remote Sensing: Models and Methods for
WTC complex. Our experimental assessment of clusters Image Processing, Academic Press, New York, NY, USA, 2nd
versus GPUs in the context of this particular application edition, 1997.
indicates that commodity clusters represent a source of [7] D. A. Landgrebe, Signal Theory Methods in Multispectral
computational power that is both accessible and applicable to Remote Sensing, John Wiley & Sons, New York, NY, USA, 2003.
obtaining results quickly enough and with high reliability in [8] J. A. Richards and X. Jia, Remote Sensing Digital Image Analysis:
target/anomaly detection applications in which the data has An Introduction, Springer, London, UK, 2006.
already been transmitted to Earth. However, GPU hardware [9] C.-I. Chang, Recent Advances in Hyperspectral Signal and Image
devices may offer important advantages in defense and Processing, John Wiley & Sons, New York, NY, USA, 2007.
security applications that demand a response in realtime, [10] C.-I. Chang, Hyperspectral Data Exploitation: Theory and
mainly due to the low weight and compact size of these Applications, John Wiley & Sons, New York, NY, USA, 2007.
devices, and to their capacity to provide high-performance [11] C.-I. Chang and H. Ren, “An experiment-based quantitative
computing at very low costs. and comparative analysis of target detection and image
18 EURASIP Journal on Advances in Signal Processing

classification algorithms for hyperspectral imagery,” IEEE [26] A. Plaza, J. Plaza, and D. Valencia, “Impact of platform
Transactions on Geoscience and Remote Sensing, vol. 38, no. 2, heterogeneity on the design of parallel algorithms for morpho-
pp. 1044–1063, 2000. logical processing of high-dimensional image data,” Journal of
[12] H. Ren and C.-I. Chang, “Automatic spectral target recogni- Supercomputing, vol. 40, no. 1, pp. 81–107, 2007.
tion in hyperspectral imagery,” IEEE Transactions on Aerospace [27] J. Dorband, J. Palencia, and U. Ranawake, “Commodity
and Electronic Systems, vol. 39, no. 4, pp. 1232–1249, 2003. computing clusters at goddard space flight center,” Journal of
[13] D. Manolakis, D. Marden, and G. A. Shaw, “Hyperspectral Space Communication, vol. 3, p. 1, 2003.
image processing for automatic target detection applications,” [28] A. Plaza, D. Valencia, J. Plaza, and P. Martinez, “Commodity
MIT Lincoln Laboratory Journal, vol. 14, pp. 79–116, 2003. cluster-based parallel processing of hyperspectral imagery,”
[14] A. Paz, A. Plaza, and S. Blazquez, “Parallel implementation Journal of Parallel and Distributed Computing, vol. 66, no. 3,
of target and anomaly detection algorithms for hyperspec- pp. 345–358, 2006.
tral imagery,” in Proceedings of International Geoscience and [29] A. Plaza and C.-I. Chang, “Clusters versus FPGA for parallel
Remote Sensing Symposium (IGARSS ’08), vol. 2, pp. 589–592, processing of hyperspectral imagery,” International Journal of
2008. High Performance Computing Applications, vol. 22, no. 4, pp.
[15] Y. Tarabalka, T. V. Haavardsholm, I. Kasen, and T. Skauli, 366–385, 2008.
“Real-time anomaly detection in hyperspectral images using [30] J. Setoain, M. Prieto, C. Tenllado, A. Plaza, and F. Tirado, “Par-
multivariate normal mixture models and GPU processing,” allel morphological endmember extraction using commodity
Journal of Real-Time Image Processing, vol. 4, no. 3, pp. 287– graphics hardware,” IEEE Geoscience and Remote Sensing
300, 2009. Letters, vol. 4, no. 3, pp. 441–445, 2007.
[16] D. C. Heinz and C.-I. Chang, “Fully constrained least [31] A. Paz, A. Plaza, and J. Plaza, “Comparative analysis of dif-
squares linear spectral mixture analysis method for material ferent implementations of a parallel algorithm for automatic
quantification in hyperspectral imagery,” IEEE Transactions on target detection and classification of hyperspectral images,” in
Geoscience and Remote Sensing, vol. 39, no. 3, pp. 529–545, Satellite Data Compression, Communication, and Processing V,
2001. vol. 7455 of Proceedings of SPIE, San Diego, Calif, USA, August
[17] R. A. Neville, K. Staenz, T. Szeredi, J. Lefebvre, and P. Hauff, 2009.
“Automatic endmember extraction from hyperspectral data [32] J. C. Tilton, W. T. Lawrence, and A. J. Plaza, “Utilizing
for mineral exploration,” in Proceedings of the 21st Canadian hierarchical segmentation to generate water and snow masks
Symposium on Remote Sensing, pp. 401–415, 1999. to facilitate monitoring change with remotely sensed image
[18] I. S. Reed and X. Yu, “Adaptive multiple-band CFAR detection data,” GIScience and Remote Sensing, vol. 43, no. 1, pp. 39–66,
of an optical pattern with unknown spectral distribution,” 2006.
IEEE Transactions on Acoustics, Speech, and Signal Processing, [33] R. Brightwell, L. A. Fisk, D. S. Greenberg, et al., “Massively
vol. 38, no. 10, pp. 1760–1770, 1990. parallel computing using commodity components,” Parallel
[19] J. C. Harsanyi and C.-I. Chang, “Hyperspectral image classifi- Computing, vol. 26, no. 2, pp. 243–266, 2000.
cation and dimensionality reduction: an orthogonal subspace [34] C.-I. Chang and Q. Du, “Estimation of number of spectrally
projection approach,” IEEE Transactions on Geoscience and distinct signal sources in hyperspectral imagery,” IEEE Trans-
Remote Sensing, vol. 32, no. 4, pp. 779–785, 1994. actions on Geoscience and Remote Sensing, vol. 42, no. 3, pp.
[20] N. Acito, M. Diani, and G. Corsini, “A new algorithm for 608–619, 2004.
robust estimation of the signal subspace in hyperspectral
images in the presence of rare signal components,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 47, no. 11,
pp. 3844–3856, 2009.
[21] A. Plaza, P. Martinez, J. Plaza, and R. Perez, “Dimensionality
reduction and classification of hyperspectral image data using
sequences of extended morphological transformations,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 43, no. 3,
pp. 466–479, 2005.
[22] C.-I. Chang and D. C. Heinz, “Constrained subpixel target
detection for remotely sensed imagery,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 38, no. 3, pp. 1144–1159,
2000.
[23] N. Acito, G. Corsini, and M. Diani, “Computational load
reduction for anomaly detection in hyperspectral images:
an experimental comparative analysis,” in Proceedings of
International Geoscience and Remote Sensing Symposium
(IGARSS ’08), pp. 3206–3209, 2008.
[24] K. Itoh, “Massively parallel Fourier-transform spectral imag-
ing and hyperspectral image processing,” Optics and Laser
Technology, vol. 25, p. 202, 1993.
[25] T. El-Ghazawi, S. Kaewpijit, and J. L. Moigne, “Parallel
and adaptive reduction of hyperspectral data to intrinsic
dimensionality,” Cluster Computing, vol. 1, pp. 102–110, 2001.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 503752, 26 pages
doi:10.1155/2010/503752

Research Article
A Review of Unsupervised Spectral Target Analysis for
Hyperspectral Imagery

Chein-I Chang,1, 2 Xiaoli Jiao,1 Chao-Cheng Wu,1 Yingzi Du,3 and Mann-Li Chang4
1 Remote Sensing Signal and Image Processing Laboratory, Department of Computer Science and Electrical Engineering,
University of Maryland, Baltimore, MD 21250, USA
2 Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan
3 Department of Electrical and Computer Engineering, Purdue School of Engineering and Technology,

Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA


4 Management and Information Department, Kang Ning Nursing and Management Junior College, Taipei, Taiwan

Correspondence should be addressed to Chein-I Chang, [email protected]

Received 27 September 2009; Revised 31 December 2009; Accepted 19 February 2010

Academic Editor: Jin-Hua She

Copyright © 2010 Chein-I Chang et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

One of great challenges in unsupervised hyperspectral target analysis is how to obtain desired knowledge in an unsupervised means
directly from the data for image analysis. This paper provides a review of unsupervised target analysis by first addressing two
fundamental issues, “what are material substances of interest, referred to as targets?” and “how can these targets be extracted from
the data?” and then further developing least squares (LS)-based unsupervised algorithms for finding spectral targets for analysis.
In order to validate and substantiate the proposed unsupervised hyperspectral target analysis, three applications in endmember
extraction, target detection and linear spectral unmixing are considered where custom-designed synthetic images and real image
scenes are used to conduct experiments.

1. Introduction spatial targets are referred to as spatial domain-based image


processing techniques. On the other hand, due to use of spec-
Hyperspectral imaging has become an emerging technique tral bands specified by a range of wavelengths a multispectral
in remote sensing analysis. With high spectral resolution or hyperspectral image pixel is actually a column vector, of
many material substances which are not known a priori which a pixel of spectral band is produced by a particular
or cannot be visualized by inspection can now be revealed wavelength. As a consequence, a single image pixel vector
by hyperspectral imaging sensors for data exploitation. of a hyperspectral image already contains abundant spectral
Consequently, two main issues are investigated in this paper. information provided by hundreds of contiguous spectral
One is what are material substances of interest? Once targets bands that can be used for data exploitation. Such spectral
of interest are defined, the next is how to find these targets information within a single image pixel vector is referred to
directly from the data in an unsupervised manner without as intrapixel spectral information. A target analyzed based on
prior knowledge. In order to address the first issue, we first its spectral properties characterized by the intrapixel spectral
explore a new concept of so-called “spectral” targets which information on a single image pixel vector basis is called
is developed to differentiate targets commonly addressed in “spectral target” as opposed to “spatial target” analyzed by
traditional image processing. With no spectral bands used interpixel spatial information provided by spatial correlation
in traditional image processing the targets of interest are among sample pixels. More specifically, three major types of
generally identified by their spatial properties such as size, spectral targets are of particular interest in this paper. One is
shape, and texture. In this case, targets to be recognized based endmembers whose spectral signatures are idealistically pure
on their spatial properties can be considered as “spatial” [1]. Endmembers do not usually appear in multispectral
targets and the techniques developed to recognize such images due to low spatial and spectral resolution but have
2 EURASIP Journal on Advances in Signal Processing

become increasingly important in hyperspectral imaging matrix, {λ1l } and the eigenvalues of covariance matrix, {λl }
because an endmember can be used to identify a spectral are calculated. If λ1l − λl is greater than zero resulting from
class. Another is subpixel targets which do not fully occupy the lth component sample mean, it implies that there is a
a pixel but rather are completely embedded in a single image signal source; otherwise, no signal is present. To materialize
pixel vector [2]. This type of targets cannot be visualized this idea, a binary composite hypothesis testing problem is
spatially and can be only recognized based on their spectral formulated in a way that the null and alternative hypotheses,
properties. Subpixel targets occur when their spatial extent H0 and H1 represent two scenarios, H0 : λ1l − λl = 0 and
is smaller than pixel resolution. A third type of targets
is mixed targets whose spectral signatures are linearly or H1 : λ1l − λl > 0, respectively. The Neyman Pearson detection
nonlinearly mixed by a number of target spectral signatures theory [4] is then applied to find how many times the test
with appropriate portions of fractions present in a single fails running over all spectral bands for a given false alarm
image pixel vector [2]. The occurrence of a mixed target is probability, PF . It is the number of test failures that indicates
a result of low spatial and spectral resolution and it may the number of signal sources assumed to be in the data. The
partially occupy more than one pixel vector. Apparently, beauty of VD lies on the fact that its value is completely
none of these three types of spectral targets can be effectively determined by the PF . By varying the value of PF , the number
analyzed by spatial domain-based techniques. of spectrally distinct signatures estimated by the VD varies.
With a spectral target defined as above what we are For example, if PF is set low, fewer tests will fail and thus
particularly interested in this paper from an aspect of sta- fewer targets are assumed to be in the data and vice versa.
tistical signal processing are two types of spectral targets, one To address the second issue an unsupervised spectral target
characterized by 2nd order sample intrapixel Spectral Infor- finding algorithm (USTFA) is developed which is based
mation Statistics (SIS) and the other by sample intrapixel SIS on three least squares (LS)-based algorithms, automatic
of order higher than 2, referred to as high-order SIS. It should target generation process (ATGP) [5], unsupervised non-
be noted that the term of sample intrapixel SIS is defined as negativity constrained least squares (UNCLS) method [6]
correlation of intrapixel SIS among samples. In the context of and unsupervised fully constrained least squares method
sample spectral information statistics we assume that back- (UFCLS) [7]. In order for these unsupervised methods to
ground (BKG) pixels are those spectral targets characterized extract and distinguish spectral targets of 2nd order sample
by 2nd sample intrapixel SIS while the target pixels of interest intrapixel SIS from high-order sample intrapixel SIS, two
are those high-order sample intrapixel SIS. In hyperspectral data sets, original data and its sphered data are used. It
image analysis this seems a reasonable assumption since the assumes that the BKG in a hyperspectral image is most
spectral targets of interest in hyperspectral data exploitation likely characterized by 2nd order sample intrapixel SIS while
are those which either (1) occur with low probability or hyperspectral targets will be more likely to be captured by
(2) have small populations when they are present. In other high-order sample intrapixel SIS as outliners due to their
words, these types of spectral targets are usually relatively small spatial presence. In this case, high-order spectral targets
small, appear in small population and also occur with are referred to as desired targets to be used for image analysis,
low probabilities, for example, special spices in agriculture while 2nd order spectral targets are considered as undesired
and ecology, toxic wastes in environmental monitoring, targets for which we would like to annihilate or suppress
rare minerals in geology, drug/smuggler trafficking in law prior to data processing so as to improve image analysis.
enforcement, combat vehicles in the battlefield, landmines In order to validate the utility of VD and the pro-
in war zones, chemical/biological agents in bioterrorism, posed unsupervised spectral target analysis for hyperspectral
weapon concealment and mass graves. These spectral targets imagery, custom-designed synthetic image experiments with
are generally considered as insignificant objects because of complete ground truth are conducted to show that the VD
their very limited spatial information but they are actually indeed provides a reasonable estimate of the true dimen-
critical and crucial for defense and intelligence analysis since sionality for qualitative and quantitative analysis for three
they are insignificant compared to targets with large sample applications, endmember extraction, unsupervised spectral
pools and generally hard to be identified by visual inspection target detection and unsupervised LSU-based target classifi-
and, from a statistical point of view, the spectral information cation. These same experiments are further substantiated by
statistics of such special targets cannot be captured by 2nd real image data.
order sample intrapixel SIS but rather by high-order sample As summarized, several contributions are made in this
intrapixel SIS. paper. First and the foremost is to introduce the concept of
Once image pixel vectors are categorized into BKG and sample intrapixel SIS to define BKG and target pixels charac-
target classes according to sample intrapixel SIS, a follow- terized by 2nd order sample intrapixel SIS s and high-order
up task is how to find them in which case two issues need sample intrapixel SIS, respectively. A second contribution is
to be addressed. One is how many of them. The other to use the VD to determine the number of BKG and target
is how to extract them. The first issue can be resolved pixels, each of which represents a particular spectral class,
by a new concept, virtual dimensionality (VD) recently either a BKG or a target class. A third contribution is to come
developed [2, Chapter 17], [3]. The idea of VD is based up an idea of using two sets of original data and its sphered
on the assumption that if a signal source is presenting in data from which BKG and target pixels can be extracted. A
the data, it will contribute energy to 1st order statistics. fourth contribution is to design an USTFA to extract BKG
In doing so, both the eigenvalues of sample correlation and target pixels. Finally, a fifth contribution is to custom
EURASIP Journal on Advances in Signal Processing 3

design synthetic image experiments to validate the proof of be a p × 1 abundance column vector associated with r
concept for the unsupervised target analysis developed in this where α j denotes the fraction of the jth target signature m j
paper. This is then followed by real image experiments to present in the pixel vector r. A classical approach to solving a
substantiate its utility in real applications. mixed pixel classification problem is linear unmixing which
assumes that the spectral signature of the pixel vector r is
2. Unsupervised Least linearly mixed by m1 , m2 , . . . , m p as follows:
Squares-Based Algorithms
r = Mα + n, (1)
In this section, we present three least squares (LS)-based
algorithms for finding spectral targets of interest using where n is noise or can be interpreted as a measurement or
a posteriori knowledge which can be obtained directly from model error.
the data. Equation (1) represents a standard signal detection
To accomplish this task, a designed least squares-based model where Mα is a desired signal vector needed to be
algorithm is first applied to the original data to extract detected and n is a corrupted noise. Since we are interested in
data sample vectors characterized by 2nd order SIS. Then, detecting one target at a time, we can divide the set of the p
the same algorithm is further applied to the data which is target signatures, m1 , m2 , . . . , m p into a desired target, say m p
sphered by removing the data sample mean and covariances and a class of undesired target signatures, m1 , m2 , . . . , m p−1 .
while making data variances ones so that all 2nd order SIS- In this case, a logical approach is to eliminate the effects
characterized data samples will be on the sphere and all other caused by the undesired targets m1 , m2 , . . . , m p−1 that are
data sample vectors that are characterized by high-order considered as interferers to m p before the detection of
SIS are either inside (sub-Gaussian samples) or outside the m p takes place. With annihilation of the undesired target
sphere (super-Gaussian samples). As a consequence of such a signatures the detectability of m p can be therefore enhanced.
sphering process, the resulting data has the 1st and 2nd order In doing so, we first separate m p from m1 , m2 , . . . , m p in M
SIS removed from the original data because of zero mean and and rewrite (1) as
constant variance one so that samples characterized by SIS of r = dα p + Uγ + n, (2)
orders higher than 2 can be extracted from inside or outside
the sphere. Interestingly, despite that the idea of using the where d = m p is the desired spectral signature of m p and U =
same algorithm in two passes, one pass for the original data  m1 m2 · · · m p−1  is the undesired target spectral signature
and another pass for the sphered data to extract two types of matrix made up of m1 , m2 , . . . , m p−1 which are the spectral
targets of interest, 2nd order targets and high-order targets signatures of the remaining p − 1 undesired targets. Using (2)
for data analysis seem simple, it is by no means a trivial we can design an orthogonal subspace projector to annihilate
matter because its novelty has never been explored in the U from the pixel vector r prior to detection of t p . One of such
open literature. In what follows, we design and develop three desired orthogonal subspace projectors was the orthogonal
LS-based algorithms for this purpose. subspace projection (OSP) derived in [8] and given by
The first algorithm of interest was previously proposed
by Ren and Chang in [5], called Automatic Target Gen- PU⊥ = I − UU# , (3)
eration Process (ATGP) which can be considered as an
−1
unsupervised version of the orthogonal subspace projection where U# = (UT U) UT is the pseudo-inverse of U. The
(OSP) algorithm in [8]. Its relationships with LS-based linear notation ⊥U in PU⊥ indicates that the projector PU⊥ maps the
spectral unmixing (LSU) were also explored in [9, 10]. With observed pixel vector r into the orthogonal complement of
this interpretation, the ATGP can be also viewed as an U
, denoted by U
⊥ . By means of (3) a linear optimal
unsupervised version of an unconstrained LS LSU method. signal detector for (2), denoted by δ OSP (r) was developed in
The 2nd and 3rd LS-based algorithms are an unsupervised [8] and given by
version of a partially abundance-constrained least squares
LSU, referred to as Unsupervised Non-negativity Con- δ OSP (r) = dT PU⊥ r. (4)
strained Least Squares (UNCLS) [6] and an unsupervised
version of fully abundance least squares LSU, referred to 2.1. Automatic Target Generation Process (ATGP). The ATGP
as unsupervised fully constrained least squares (UFCLS) can be considered as an unsupervised and unconstrained
[7]. OSP technique which performs a succession of orthogonal
Assume that m1 , m2 , . . . , m p are spectral signatures used subspace projections specified by (3) to find a set of
to unmix the data sample vectors. Let L be the number sequential data sample vectors that represents targets of
of spectral bands and r be an L-dimensional data sample interest as follows.
vector which can be modeled as a linear combination
of m1 , m2 , . . . , m p with appropriate abundance fractions Automatic Target Generation Process (ATGP) Algorithm.
specified by α1 , α2 , . . . , α p . More precisely, r is an L×1 column
vector and M is an L × p target spectral signature matrix, (1) Initial condition:
denoted by  m1 m2 · · · m p , where m j is an L × 1 column Let ε be a prescribed error threshold and t0 be a pixel
vector represented by the spectral signature of the jth target with brightest intensity value, that is, largest gray level
resident in the pixel vector r. Let α = (α1 , α2 , . . . , α p )T value. Set k = 0.
4 EURASIP Journal on Advances in Signal Processing

(2) Let k ← k + 1 and apply Pt⊥0 via (3) to all image pixels UFCLS Algorithm.
r in the image and find the kth target tk generated
at the kth stage which has the maximum orthogonal (1) Initial condition:
projection as follows: Select ε to be a prescribed error threshold and let t0 =
5 J T  K6 arg{maxr [rT r]} where r is run over all image pixel
⊥ ⊥
tk = arg maxr P[Uk −1 tk ]
r P[Uk −1 tk ]
r . (5) vectors, and let k = 0.
T
(2) Let LSE(0) (r) = (r − α1(1)
0 (r)t0 ) (r − α
(1)
10 (r)t0 ) and
(3) If m(tk−1 , tk ) > ε where m(·, ·) can be any target (0)
check if maxr LSE (r) < ε. If yes, the algorithm is
discrimination measure, for example, Spectral Angle terminated; otherwise continue.
Mapper (SAM) in [2], then go to step (2). Otherwise,
(3) Let k ← k + 1 and find tk = arg{maxr [LSE(k−1) (r)]}.
the algorithm is terminated. At this point, all the
generated target pixels t0 , t1 , . . . , tk−1 are considered (4) Apply the FCLS method with the signature matrix
as the desired targets. M(k) = [t0 t1 · · · tk−1 ] to estimate the abundance
fraction of t0 , t1 , . . . , tk−1 , α1(k) (k)
12 (r), . . . , α
1 (r), α
(k)
1k−1 (r).
2.2. Unsupervised Non-Negativity Least Squares (UNCLS) (5) Find the kth maximum least squares error defined by
Method. The UNCLS is an unsupervised version of the
 
abundance Non-negativity Constrained Least squares maxr LSE(k) (r)
(NCLS) where the NCLS is a partially abundance
unconstrained OSP technique that imposes the abundance ⎧⎛ ⎞T ⎛ ⎞⎫

⎨ k −1 k −1 ⎪
⎬ (7)
non-negativity constraint (ANC), α ≥ 0, that is, α j ≥ 0 for = maxr ⎝r − 1 j t j ⎠ ⎝r −
α(k)
1j tj⎠ .
α(k)
all on the linear mixing model specified by (1). It can be ⎪
⎩ ⎪

j =1 j =1
implemented in the following.
If maxr LSE(k−1) (r) < ε, the algorithm is terminated;
UNCLS Algorithm. otherwise go to step (3).
(1) Initial condition:
Select ε to be a prescribed error threshold and let t0 = 3. Unsupervised Spectral Target
arg{maxr [rT r]} where r is run over all image pixel Finding Algorithms
vectors, and set k = 0.
T
When the above-mentioned three unsupervised LS-based
(2) Let LSE(0) (r) = (r − α1(1)
0 (r)t0 ) (r − α
(1)
10 (r)t0 ) and algorithms are implemented, a prescribed error ε which is
(0) determined by various applications is required to terminate
check if maxr LSE (r) < ε. If yes, the algorithm is
terminated; otherwise continue. the algorithms. In general, it is done by visual inspection on
(3) Let k ← k + 1 and find tk = arg{maxr [LSE(k−1) (r)]}. a trial-and-error basis and is not practical for our purpose.
Therefore, instead of using ε as a stopping rule, we use the
(4) Apply the NCLS method with the signature matrix VD as an alternative rule to determine how many targets are
M(k) = [t0 t1 · · · tk−1 ] to estimate the abundance required for our designed LS algorithms to generate.
fraction of t0 , t1 , . . . , tk−1 , α1(k) (k)
12 (r), . . . , α
1 (r), α
(k)
1k−1 (r). In order for the proposed LS-based algorithms to be
(5) Find the kth maximum least squares error defined by successful, we assume that the most image BKG is charac-
terized by a very large number of un-interesting data sample
 
maxr LSE(k) (r) vectors which can be characterized by 2nd order statistics
as opposed to target pixels which can be captured by high-
⎧⎛ ⎞T ⎛ ⎞⎫

⎨ k −1 k −1 ⎪
⎬ (6) order statistics due to a small number of target pixels. By
= maxr ⎝r −
(k)
1j tj
α ⎠ ⎝ r −
(k)
1j tj
α ⎠ . virtue of this assumption we can consider two sets of data

⎩ ⎪
⎭ for processing. One is the original data and the other is the
j =1 j =1
sphered data which has the mean and covariance removed
from the original data for consideration. We then apply
(6) If maxr LSE(k−1) (r) < ε, the algorithm is terminated; the three unsupervised LS-based algorithms to these two
otherwise go to step (3). data sets to extract 2nd order BKG pixels as well as high-
order target pixels. However, if a sample pixel show strong
2.3. Unsupervised Fully Least Squares (UFCLS) Method. The signal statistics in both original and sphered data sets, it is
UFCLS is an unsupervised abundance Fully Constrained considered as a target pixel and can be removed from the
Least Squares (FCLS) where the FCLS is a partially abun- BKG category.
dance unconstrained OSP technique that imposes both A detailed implementation of an LS-based unsupervised
p
Abundance Sum-to-one Constraint (ASC), that is, j =1 α j = spectral target finding algorithm (USTFA) can be briefly
1 and Abundance Non-negativity Constraint (ANC), α ≥ described as follows where the LS-based unsupervised
0, that is, α j ≥ 0 on the linear mixing model (1). Its algorithm used in the USTFA can be one of the three LS
implementation is provided below. unsupervised algorithms described in Section 2.
EURASIP Journal on Advances in Signal Processing 5

LS-Based Unsupervised Spectral Target Finding Algorithm. BKG


(1) Find the VD for the image data to determine the
number of targets required to be generated, nVD .
(2) Apply an LS-based algorithm to the original image A
nVD
data and find nVD BKG pixels SBKG = {bLS
j } j =1 .
B
(3) Apply the LS-based algorithm to the sphered data and M
nVD
find nVD target pixels, Starget = {tLS
j } j =1 .
K

C
(4) Since there may be some pixels in SBKG whose spectra
(a) (b)
are very close to those also showing up in Starget ,
a spectral measure such as SAM [2] is applied to Figure 1: (a) Cuprite AVIRIS image scene, (b) spatial positions of
extract these pixels which will be removed from SBKG . five pure pixels corresponding to minerals: alunite (A), budding-
Let the resulting BKG sample set be denoted by tonite (B), calcite (C), kaolinite (K), and muscovite (M).
BKG n
S4 = {b4 LS } BKG where nBKG is the total number
i i=1
of remaining BKG pixels in SBKG after the pixels in
Starget ∩ SBKG are removed.
BKG website [11]. This scene is a 224-band image with size of
(5) Form a signature matrix M by merging S4 and 350 × 350 pixels and was collected over the Cuprite mining
n
4 LS } BKG ∪ {tLS }nVD .
Starget , that is, finding pixels in {b site, Nevada, in 1997. It is well understood mineralogically.
i i=1 j j =1
It should be noted that the number of pixels in M is As a result, a total of 189 bands were used for experiments
between nVD and 2nVD , that is, nVD ≤ nVD + nBKG ≤ where bands 1–3, 105–115, and 150–170 have been removed
2nVD . prior to the analysis due to water absorption and low SNR
in those bands. Although there are more than five minerals
(6) Apply an LSU method such as abundance-uncon-
in the data set, the ground truth available for this region
strained classifier LSOSP, abundance Non-negativity
only provides the locations of the pure pixels: Alunite (A),
constrained classifier NCLS and abundance fully
Buddingtonite (B), Calcite (C), Kaolinite (K), and Muscovite
constrained classifier FCLS to perform mixed pixel
(M). The locations of these five pure minerals are labeled
classification where only the target pixels in Starget
by A, B, C, K, and M, respectively, and shown in Figure 1.
will be classified by their corresponding abundance
BKG n Available from the image scene is a set of reflectance spectra
fractions while the target pixels in S4 = {b4 LS } BKG
i i=1 shown in Figure 2 which will be used to simulate synthetic
will be used for BKG suppression. It should be images. An area marked by “BKG” at the upper right corner
noted that in order for an LSU to perform pure- of Figure 1(a) was selected to find its sample mean, that is,
pixel classification we need a value to threshold the average of all pixel vectors within the area “BKG”, denoted
the LSU-estimated abundance fractions of each of by b to be used to simulate the BKG for image scene in
targets for making hard decisions. In this case, finding Figure 3 also plotted in Figure 2. The reason for this BKG
an appropriate threshold value is generally very selection is empirical since the selected area “BKG” seemed
challenging. In our experiments conducted in this more homogeneous than other regions. Nevertheless, other
paper, only LSU is performed to produce abundance areas can be also selected for the same purpose.
fraction estimates for target pixels. So, when a specific As we can see from the spectral profiles in Figure 2, the
LS-based algorithm is used, the superscript “LS” Muscovite is the most spectrally distinct signature among all
in the above algorithm will be replaced with this the five signatures and the signature of the Calcite is the most
particular algorithm. For example, if ATGP is used for similar to the BKG signature. These two particular signatures
the USTFA, it is then called ATGP-USTFA. will have significant impact on data analysis as demonstrated
in the following experiments.
4. Synthetic Image Simulated Scenarios The synthetic image to be simulated for experiments has
a size of 200 × 200 pixel vectors with 25 panels of various sizes
The success of the three proposed LS-based algorithms in the which are arranged in a 5 × 5 matrix and located at the center
unsupervised target analysis hinges on two made hypotheses, of the scene shown in Figure 3(a).
(1) targets of interests can be characterized by their spectral The 25 panels in Figure 3(a) were simulated as follows.
statistics, 2nd order targets corresponding to BKG pixels and The five mineral spectral signatures, {mi }5i=1 in Figure 2 are
high-order targets assumed to be desired targets; (2) the VD used to simulate these 25 panels where each row of five
can be used to estimate the number of targets of interest panels was simulated by the same mineral signature and each
present in the data. Since neither can be verified by real image column of 5 panels has the same size. Among 25 panels are
scenes where obtaining full scene ground truth is impossible, five 4 × 4 pure-pixel panels, p4i ×4 for i = 1, . . . , 5 lined up
this section presents two scenarios using a set of controllable in five rows in the 1st column and five 2 × 2 pure-pixel
parameters to simulate synthetic images via a real Cuprite panels, p2i ×2 for i = 1, . . . , 5 lined up in five rows in the
image data shown in Figure 1 which is available at the USGS 2nd column for pure pixel classification; the five 2 × 2-mixed
6 EURASIP Journal on Advances in Signal Processing

7000 Table 1: Mixed panel pixels in the 3rd column for simulations.
Muscovite 1 1
6000 Buddingtonite p3,11 = 0.5A + 0.5B p3,12 = 0.5A + 0.5C
Alunite row 1 1
p3,21 = 0.5A + 0.5K 1
p3,22 = 0.5A + 0.5M
2
5000 Kaolinite p3,11 = 0.5A + 0.5B 2
p3,12 = 0.5B + 0.5C
row 2 2
p3,21 = 0.5B + 0.5K 2
p3,22 = 0.5B + 0.5M
4000 Calcite 3 3
p3,11 = 0.5A + 0.5C p3,12 = 0.5B + 0.5C
row 3 3 3
p3,21 = 0.5C + 0.5K p3,22 = 0.5C + 0.5M
3000 4 4
p3,11 = 0.5A + 0.5K p3,12 = 0.5B + 0.5K
row 4 4
2000 p3,21 = 0.5C + 0.5K 4
p3,22 = 0.5K + 0.5M
5 5
b p3,11 = 0.5A + 0.5M p3,12 = 0.5B + 0.5M
row 5 5 5
1000 p3,21 = 0.5C + 0.5M p3,22 = 0.5K + 0.5M

0
20 40 60 80 100 120 140 160 180 Table 2: Subpanel pixels in the 4th and 5th columns for simula-
Band tions.
Figure 2: Five mineral reflectance spectra and BKG signature— 50% subpixel panels 25% subpixel panels
which is the average of area BKG in the top right of Figure 1(a). in 4th column in 5th column
1
row 1 p4,1 = 0.5A + 0.5b 1
p5,1 = 0.25A + 0.75b
2,2 row 2 2
p4,1 = 0.5B + 0.5b 2
p5,1 = 0.25B + 0.75b
pixel panels, { p3,i jk } j =1,k=1 for i = 1, . . . , 5 lined up in five 3 3
row 3 p4,1 = 0.5C + 0.5b p5,1 = 0.25C + 0.75b
rows in the 3rd column for mixed pixel classification and row 4 4
p4,1 = 0.5K + 0.5b 4
p5,1 = 0.25K + 0.75b
i
both the five subpanel pixels, p4,1 for i = 1, . . . , 5 lined up 5 5
row 5 p4,1 = 0.5M + 0.5b p5,1 = 0.25M + 0.75b
in five rows in the 4th column and the five subpixel panels,
i
p5,1 for i = 1, . . . , 5 lined up in five rows in the 5th column
for subpixel classification. The purpose of introducing the 5 designed to simulate scenarios with pure pixels implanted
panels in the 3rd column and subpanel pixels in the 4th and as pure signatures to represent endmembers to evaluate the
5th columns was designed to conduct a study and analysis on performance of endmember extraction.
five mineral signatures with different mixing in a pixel and
five mineral signatures embedded in single pixels at subpixel 4.2. Target Embeddedness (TE). The second type of target
scale. insertion is referred to as Target Embeddedness (TE) which is
Tables 1 and 2 tabulate the mixing details of the five the same as the TI described above except the way the panel
mineral composition in the 20 mixed pixels in the 3rd pixels were inserted. The BKG pixels were not removed to
column in Figure 3 and the 5 subpanel pixels with 50% accommodate the inserted panel pixels as they were done
abundance of mineral signatures in the 4th column and the in TI but were rather superimposed with the inserted panel
5 subpanel pixels with 25% abundance of mineral signatures pixels. So, in this case, the resulting synthetic image shown in
in the 5th columns in Figure 3(a), respectively. Figure 3(c) has clean panel pixels embedded in a noisy BKG.
So, in Figure 3(a) there are a total of 130 panel pixels The TE is particularly designed to simulate signal detection
present in the scene, 80 pure panel pixels in the 1st column, models [12] with two hypotheses, null hypothesis corre-
20 pure panel pixels in the 2nd column, 20 mixed panel pixels sponding to noise and BKG and the alternative hypothesis
in the 3rd column, five 50%-abundance subpanel pixels in specify the embedded target pixels. Under this circumstance,
the 4th column and five 25%-abundaunce subpanel pixels in the abundances of the pixels containing inserted targets were
the 5th column. not normalized to one in which case the abundance sum to
The image BKG was simulated by the signature b in one constraint imposed on FCLS was violated. Nevertheless,
Figure 2 corrupted by an additive Gaussian noise to achieve a it is worth noting that the TE scenario can be also used for
certain signal-to-noise ratio (SNR) which was defined as 50% endmember extraction to test if an endmember extraction
signature (i.e., reflectance/radiance) divided by the standard algorithm to be able to extract most purest pixels in case there
deviation of the noise in [8]. Once target pixels and BKG are are no pure pixels present in the data.
simulated, two types of target insertion can be designed to Two remarks on scenarios TI and TE are noteworthy.
simulate experiments for various applications.
(1) From Figure 3, one may argue that it is so obvious by
4.1. Target Implantation (TI). The first type of target inser- visual inspection that most panel pixels in Figure 3
tion is referred to as Target Implantation (TI) which inserts are visible. This may lead to a belief that these two
the above 130 panel pixels into the image by replacing scenarios may not be useful or appropriate. The truth
their corresponding BKG pixels. So, the resulting synthetic is that what we see from images is generally not what
image has clean panel pixels implanted in a noisy BKG we will expect. Specifically, what we see is only qual-
with an additive Gaussian noise of SNR = 20 : 1 for this itative and not quantitative, a task that a computer
scenario as shown in Figure 3(b). The TI is primarily algorithm can do well while human being cannot.
EURASIP Journal on Advances in Signal Processing 7

(a) 25 simulated panels (b) Scenario TI (c) Scenario TE

Figure 3: 25 simulated panels according to Tables 1 and 2 and two scenarios for target insertion, TI and TE.

This is exactly what we need these scenarios to show as ATGP-USTFA, UNCLS-USTFA, and UFCLS-USTFA using
that can an algorithm accomplish what human eyes nVD = 6 where (a) the 2nd order BKG pixels obtained by
can or do better? Unfortunately, on many occasions applying an LS-based unsupervised algorithm to the original
misconception of human inspection usually mislead data; (b) high-order target pixels obatined by applying the
to something incorrect. This phenomenon will be same algorithm to the sphered data; (c) the remaining BKG
demonstrated in some experiments conducted in pixels in (a) after removing BKG pixels which were also
Section 5. found as target pixels; (d) total desired pixels obtained by
(2) In hyperspectral imagery noise is generally non- combining the pixels in (b) and (c). According to the results
Gaussian. This is mainly due to the fact that many obtained for the scenarios TI and TE in Figures 4 and 5, target
unknown subtle substances such as clutters, interfer- pixels and BKG pixels overlapped the total number of pixels
ers uncovered by hyperspectral imaging sensors are of interest have been found to be either 6 or 7 with five pure
actually interference and not noise, in which case panel pixels plus one or two pixels corresponding to either
these unwanted interferers should be considered as subpanel pixels or BKG pixels. These found target and BKG
structure noise to represent bias instead of random pixels can be used a posteriori target information for further
noise. If all such unknown substances are absent follow-up various tasks in image analysis.
in the image data which is the case of these two
scenarios, it leaves only random noise. Under this 5.1. Endmember Extraction. In this section, two well-known
circumstance, the Gaussian noise is most appropri- endmember extraction algorithms, PPI [13] and N-finder
ate, which is exactly the case that it is assumed in algorithm (N-FINDR) [14] were implemented for end-
communications. In light of this interpretation, it is member extraction. According to the ground truth used to
reasonable to simulate Gaussian for the scenarios TI simulate the two scenarios TI and TE, the p for TI and TE
and TE because the simulated image BKG is clean. are 5 and 0, respectively, but the VD estimated for both
scenarios TI and TE was 6, nVD = 6. Since both require data
5. Synthetic Image Experiments dimensionality reduction prior to endmember extraction,
the maximum noise fraction MNF [15] was to reduce data
This section presents three applications, endmember extrac- dimensionality to 5 and 6, that is, the number of dimensions
tion, unsupervised target detection, and LSU-based tar- needed to be retained for analysis, q = 5 and 6. Figure 6(a)
get classification to show that each application requires shows endmember extraction results by the PPI using 200
a different level of a posteriori information to perform skewers by letting q = 5 and 6 for both scenarios TI
target analysis. Among these applications the endmember and TE. Due to the fact that the PPI does not have prior
extraction is one that needs the least information with knowledge about the value of p and provides no guideline
only the number of endmembers, p required to be known. to select endmembers, all the data sample vectors with their
The value of the p can be determined by the VD. To the PPI counts greater than 0 were extracted for endmember
contrary, the LSU-based target classification requires the extraction to ensure that no pure panel pixels were left out.
most information including a posteriori information to be But, it does not mean that all the data sample vectors were
used to form the linear mixture model for LSU. endmembers. As shown in Figure 6(a) all the 100 panel pixels
Assume that no prior knowledge about the scenarios TI in the first 2 columns were among many hundreds of pixels
and TE is provided. In both scenarios the VD-estimated extracted by the PPI in the TI and TE scenarios. As a matter
value, nVD was 6 as long as the false alarm probability of fact, some pure panel pixels in these experiments had their
PF ≤ 10−1 . Therefore, nVD = 6 was used for the value PPI counts with the smallest value 1. This implies that it is
of p throughout the experiments conduced in this section. generally not true that a data sample vector with a higher
Figures 4(a)–4(d) and 5(a)–5(d) show the target pixels in PPI count is a more likely endmember. In this case, the PPI
TI and TE found by the 3 LS-based methods, referred to required human intervention to choose an appropriate value
8 EURASIP Journal on Advances in Signal Processing

3 1 7
2 2 2 3 3 3
2 1 7
5 4 4 5 2 6 2 6
6 6
6 5 5 6 5 5
4 1 1 4 1 1
1 3 3 1 4 4

(a) 6 BKG pixels (b) 6 target pixels (c) 1 BKG pixel (d) 7 pixels in (b+c) (a) 6 BKG pixels (b) 6 target pixels (c) 1 BKG pixel (d) 7 pixels in (b+c)
(i) ATGP-USTFA and UNCLS-USTFA (ii) UFCLS-USTFA

Figure 4: Target pixels extracted by three unsupervised algorithms ATGP-USTFA, UNCLS-USTFA, and UFCLS-USTFA for scenario TI.

6 6
2 1 1 3 1 1
2 1 7
5 4 4 6 5 5

6 3 6 5 6 5 4 2 2

4 2 2 5 3 3

1 3 3 1 4 4

(a) 6 BKG pixels (b) 6 target pixels (c) 0 BKG pixel (d) 6 pixels in (b+c) (a) 6 BKG pixels (b) 6 target pixels (c) 1 BKG pixel (d) 7 pixels in (b+c)
(i) ATGP-USTFA and UNCLS-USTFA (ii) UFCLS-USTFA

Figure 5: Target pixels extracted by three unsupervised algorithms ATGP-USTFA, UNCLS-USTFA, and UFCLS-USTFA for scenario TE.

were no endmembers present in the scenario TE and the N-


FINDR tried to extract the most purest panel pixels from the
data. When the value of p was set to low such as 5, the N-
FINDR extracted a mixed BKG pixel instead of panel pixels
in the 3rd row as an endmember in which case the BKG pixel
q=5 q=6 q=5 q=6 exhibited more purity than the panel pixels in the 3rd row.
TI TE This also explained that the endmembers did not necessarily
(a) PPI using 200 skewers for TI and TE have higher PPI counts in Figure 6(a). This phenomenon
was also well demonstrated in Figure 5(i) where the first
panel pixel in the 3rd row was extracted as the sixth target
pixels. If we further compare the results in Figure 6 to those
in Figure 4, it is clear that the three LS-based algorithms
performed as if they were endmember extraction algorithms
which were able to extract all the five endmembers from
p=5 p=6 p=5 p=6 both the original data and the sphered data for TI as the
TI TE
PPI and N-FINDR did in Figures 6(a) and 6(b). However,
(b) Endmembers extracted by N-FINDR for TI and TE for the TE experiments in Figure 5, only the UFCLS-USTFA
missed one endmember in the 3rd row due to the fact that the
Figure 6: Endmember extraction results by PPI and N-FINDR for
TI and TE.
TE did not satisfy the abundance sum-to-one constraint and
the UFCLS was a fully abundance-constrained algorithm.
Nevertheless, the three LS-based algorithms were also able to
extract pixels that corresponded to most purest signatures.
to threshold PPI counts to find desired endmembers despite The experiments demonstrated by Figures 4 and 5 showed
that it does not need to know p. This shows that the human that the three LS-based algorithms can be also used for the
manipulation is a key factor to make the PPI successful and purpose of endmember extraction provided that nVD is set to
effective. p in which case some of extracted targets pixels may not be
Unlike the PPI the N-FINDR did require the knowledge endmembers, particularly, for the TE scenario. This makes
of the p in which case we assumed that p = nVD = 5 and 6 for sense since a spectrally distinct signature is not necessarily
both scenarios. Figure 6(b) shows the endmember extraction an endmember and nVD is generally greater than or equal
results where the N-FINDR successfully extracted the first to p. In the scenarios of TI and TE, the p is supposed to be
panel pixels in each of five different rows that corresponded 5 and 0, respectively. The reason that nVD = 6 > p = 5
to the five distinct mineral signatures as endmembers for is because the BKG spectral signature b in Figure 2 used to
TI with p = 5 and 6. However, it is interesting to note that simulate the image BKG is very distinct from the other five
this was not true for TE with p = 5 where it missed the mineral signature in which case it must be considered as a
panel pixels in the 3rd row. In order to extract a panel pixel signature even it is a mixed signature. However, if we use a
in the 3rd row the p must be assumed to be at least 6 as BKG signature equally mixed by the five mineral signatures to
shown in the experiment. This was due to the fact that there replace the b in Figure 2 to simulate the image BKG, the nVD
EURASIP Journal on Advances in Signal Processing 9

Panels in row 4 Panels in row 1 Panels in row 5 Panels in row 2 Panels in row 3 BKG BKG
(a) CEM in conjunction with ATGP-USTFA and with UNCLS-USTFA

Panels in row 4 Panels in row 2 Panels in row 1 Panels in row 5 Panels in row 3 BKG BKG
(b) CEM in conjunction with UFCLS-USTFA

Figure 7: CEM detection results for TI.

Panels in row 1 Panels in row 4 Panels in row 5 Panels in row 2 Panels in row 3 Panels in row 3
(a) CEM in conjunction with ATGP-USTFA and with UNCLS-USTFA

Panels in row 1 Panels in row 3 Panels in row 4 Panels in row 5 Panels in row 2 BKG BKG
(b) CEM in conjunction with UFCLS-USTFA

Figure 8: CEM detection results for TE.

turns out to be 5 which is exactly the same as p = 5. Since Comparing the result in Figure 8 to that in Figure 7 it seemed
the result was reported in [16] with more details, it is not that the CEM performed better in TE than in TI due to the
included here. fact that the target panels were superimposed atop the BKG
pixels.
5.2. Unsupervised Target Detection. Unlike endmember ex- One comment on differentiating the above unsupervised
traction the unsupervised target detection extracts targets target detection from anomaly detection is noteworthy.
regardless of whether or not they are endmembers. As While the former requires a posteriori target knowledge to
defined previously, the targets of interest to be considered detect specific targets, the latter performs target detection
in this paper are specified by their statistical properties without any target knowledge whatsoever. In this case, it
in spectral characterization and were extracted in Figures does not know what targets it detects. The following simple
4(d) and 5(d) by three LS-based algorithms. Then these example provides a clue of how controversial this issue is.
found targets of interest were further as a posteriori target Figures 9(a) and 9(b) shows a set of the same various
information to perform unsupervised target detection by the target panels with four different sizes implanted in two
constrained energy minimization (CEM) developed in [17] uniform image BKGs with sizes of 64 × 64 pixel vectors and
where Figures 7 and 8 show their CEM-detection results for 200 × 200 pixel vectors, respectively, where the 5 panels in
TI and TE, respectively. the 1st column are size of 6 × 6 pixel vectors, the 5 panels in
As shown in Figures 7 and 8, the unsupervised CEM- the 2nd column are size of 3 × 3 pixel vectors, the 5 panels in
based target detection performed well using each of the the 3rd column are size of 2 × 2 pixel vectors, and then the 5
found targets of interest as a desired target signature. panels in the 4th column are size of 1 × 1 pixel vectors.
10 EURASIP Journal on Advances in Signal Processing

the RXD performance in which case one target detected as


an anomaly in a large image size may not be an anomaly
in a smaller image size. Another issue is determination
of number of anomalies detected by the RXD. Since the
RXD generates real values of all image pixels which can be
considered as detected abundance fractions, it requires an
appropriate threshold to determine which pixel is anomaly
and which pixel is not. A third main issue is well-illustrated
in Figure 10 where the RXD cannot discriminate among all
pixels it detected. All of these issues present challenges for
(a) Image of 64 × (b) Image of 200 × 200 pixels image analysts. It is interesting to note that our proposed
64 pixels LS-based approach provides solutions to all these three
issues.
Figure 9: Target panels with four various sizes implanted in two
uniform image BKGs with sizes of 64×64 pixel vectors and 200×200
5.3. Linear Spectral Unmixing for Target Classification. A key
pixel vectors.
to the success in LSU is to find an appropriate signature
matrix M to form a linear mixing model r = Mα + n where
r is an image pixel and n is a model correction term. In
the supervised LSU (SLSU), this matrix M is assumed to be
known a priori. However, when it comes to unsupervised
LSU (ULSU) the knowledge of the signature matrix M
is not available and must be obtained directly from the
data. The unsupervised LS-based target finding algorithm
presented in Section 2 provides a means of finding such
n
matrix M. More specifically, let {b 4 LS } BKG and {tLS }nVD be
i i=1 j j =1
BKG and target signatures found by an LS-based target
finding algorithm. Then we can form a desired signature
matrix M = [tLS LS LS 4 LS 4 LS 4 LS
(a) Image of 64 × 64 pixels (b) Image of 200 × 200 pixels 1 t2 · · · tnVD b1 b2 · · · bnBKG ] to unmix all
image pixels r. Figures 11 and 12 show the unmixed results
Figure 10: Results of operating RXD on images in Figures 9(a) and which classified the entire image into n V D spectral classes
9(b). nVD
via target pixels,{tLS
j } j =1 found by the USTFA for scenarios
TI and TE, respectively where the ATGP, UNCLS, and UFCLS
were used as the USTFA and the classification was performed
The target panels in Figures 9(a) and 9(b) were implanted by three linear spectral unmixing methods, Least Squares
by replacing the BKG pixels with target panel pixels as the TI Orthogonal Subspace Projection (LSOSP), referred to as
scenario does. Figures 10(a) and 10(b) shows the results of signature subspace projection (SSP) in [2, pages 144–146]
operating a widely used anomaly detector developed by Reed and [19], Non-negativity Constrained Least Squares (NCLS)
and Yu [18], called RX detector (RXD) on the two images in [6] and Fully Constrained Least Squares (FCLS) [7].
Figures 9(a) and 9(b), respectively, where RXD has struggled The results in (i), (ii), and (iii) of Figures 11 and 12 were
with finding panel pixels in 3rd–5th columns in Figure 10(a) obtained by using LSOSP, NCLS, and FCLS to unmix data
and also missed most of subpanel pixels in the 4th and samples in the TI and TE scenarios via the signature matrix
5th columns in Figure 10(b). In addition, the results in M formed by the target pixels found in Figures 4(d) and 5(d),
Figures 10(a) and 10(b) did not discriminate target pixels it respectively, where target pixels were identified by the ground
detected. truth along with their quantification results for comparison.
An immediate finding by comparing the results in It should be noted that each figure was arranged in the order
Figure 10(b) to that in Figure 10(a) leads to an interesting extracted by the unsupervised target algorithm in Figures
observation: the target panels of sizes 2 × 2 and 1 × 1 4(d) and 5(d). Since the whole process is unsupervised
that are detected by the RXD in Figure 10(b) as anomalies we must unmix the data using all target pixels including
now become undetectable and are no longer anomalies in the BKG pixels. Figure 12 shows the linear unmxing results
Figure 10(a) where two images in Figures 10(a) and 10(b) are performed on the scenario TE using the target pixels found
3
shown in the same size for clear and better visual assessment. in Figure 5(d). Due to the use of a subpanel pixel p5,1 in
Moreover, the target panels of sizes 6 × 6 and 3 × 3 detected Figure 5(d) as one of the target signatures to unmix TE the
in Figure 10(a) also become smeared and blurred compared resulting abundance fractions for the two subpanel pixels
to their counterparts in Figure 10(b) which are detected were 100% in Figure 12(b) by all the three LSU methods
clearly as anomalies. Why does the same RXD produce so compared to the case in Figure 12(a) where the target and
different results for the same set of target panels? This simple BKG pixels found by ATGP and UNCLS were used for
example sheds light on several issues resulting from the RXD. unmixing and the two subpanel pixels were unmixed to
As shown in [3] the image size had tremendous effect on their correct abundance fractions of 50% and 25% Calcite,
EURASIP Journal on Advances in Signal Processing 11

1
0
5
15 4 5
3
25
1 2
(i) LSOSP Quantification

1
0
5
15 4 5
25 2 3
1
(ii) NCLS Quantification

1
0
5
15 4 5
25 3
1 2
Panels in row 4 Panels in row 1 Panels in row 5 Panels in row 2 Panels in row 3 BKG BKG Quantification
(iii) FCLS
(a) ATGP-USTFA and UNCLS-USTFA

1
0
5
15 4 5
3
25
1 2
(i) LSOSP Quantification

1
0
5
15 4 5
25 2 3
1
(ii) NCLS Quantification

1
0
5
15 4 5
25 1 2 3
Panels in row 4 Panels in row 2 Panels in row 1 Panels in row 5 Panels in row 3 BKG BKG Quantification
(iii) FCLS
(b) UFCLS-USTFA

Figure 11: Results of using LSOSP, NCLS, and FCLS to unmix TI via the target pixels found in Figure 4(d).

respectively. Since LSOSP is unconstrained, the 20 pure panel in Figures 13 and 14 obtained by SLSU and Figures 11 and
pixels in row 3 were overestimated in Figures 12(b)–12(i) by 12 obtained by USLU the results were comparable in terms
3
using the subpanel pixel p5,1 for unmixing. of quantification for both scenarios TI and TE except one
In order to further investigate this intriguing finding, we case of the TE scenario where FCLS was used to perform
conducted experiments to use the five mineral signatures SLSU and the resulting abundance fractions for every single
plus the BKG signature plotted in Figure 2 to form the panel pixel in row 5 were estimated to be 100% as opposed to
signature matrix M and then perform SLSU for scenarios TI zero for every pixel in rows 1–4 as shown in Figure 14(c). The
and TE using the same three LSU methods. Figures 13(a)– main reason is that the panel pixels in the TE scenario were
13(c) and 14(a)–14(c) show their unmixed results for all added to the BKG pixels such that the abundance fractions
the 130 panel pixels along with their detected abundance of panel pixels and BKG pixels were not summed up to one.
fractions for TI and TE, respectively. Comparing the results However, even though the abundance sum-to-one constraint
12 EURASIP Journal on Advances in Signal Processing

1
0
5 5
15 4
3
25
1 2
(i) LSOSP Quantification

1
0
5 5
15 4
25 2 3
1
(ii) NCLS Quantification

1
0
5
15 4 5
25 3
1 2
Panels in row 1 Panels in row 4 Panels in row 5 Panels in row 2 BKG Panels in row 3 Quantification
(iii) FCLS
(a) ATGP-USTFA and UNCLS-USTFA

1
0
5
15 4 5
3
25
1 2
(i) LSOSP Quantification

1
0
5
15 4 5
25 2 3
1
(ii) NCLS Quantification

1
0
5
15 4 5
25 3
1 2
Panels in row 1 Panels in row 3 Panels in row 4 Panels in row 5 Panels in row 2 BKG BKG Quantification
(iii) FCLS
(b) UFCLS-USTFA

Figure 12: Results of using LSOSP, NCLS, and FCLS to unmix TE via the target pixels found in Figure 5(d).

assumption is violated in TE, the FCLS still tried to impose regardless of whether the LSU is performed supervised or
the constraint by giving all abundance fractions to the most unsupervised. Since the LSOSP and NCLS did not impose
distinct spectral signature which is Muscovite in this image the abundance sum-to-one constraint they performed well
used to simulate panel pixels in row 5. For this particular for both TI and TE scenarios. These experiments also provide
case the ULSU was superior to the SLSU because USLU strong evidence of the advantages of using synthetic images
obtains target knowledge directly from the data which may because it is nearly impossible to conduct such experiment
be more realistic and accurate than the prior knowledge used using real image data where no complete ground truth is
by the SLSU. However, in the scenario TI the panel pixels available that can be used for quantitative and qualitative
are implanted into the BKG with the corresponding BKG data analyses.
pixels removed in which case the abundance sum-to-one By concluding this section two comments are worth-
assumption still holds. As a result, the FCLS performed well while.
EURASIP Journal on Advances in Signal Processing 13

1
0
5
15 4 5
25 3
1 2
Quantification
(a) LSOSP

1
0
5
15 4 5
25 3
1 2
Quantification
(b) NCLS

1
0
5
15 4 5
25 3
1 2
Panels in row 1 Panels in row 2 Panels in row 3 Panels in row 4 Panels in row 5 Quantification
(c) FCLS

Figure 13: Results using LSOSP, NCLS, and FCLS to unmix TI with assuming signature knowledge in Figure 2.

1
0
5
15 4 5
25 3
1 2
Quantification
(a) LSOSP

1
0
5
15 4 5
25 3
1 2
Quantification
(b) NCLS

1
0
5
15 4 5
25 3
1 2
Panels in row 1 Panels in row 2 Panels in row 3 Panels in row 4 Panels in row 5 Quantification
(c) FCLS

Figure 14: Results using LSOSP, NCLS, and FCLS to unmix TE with assuming signature knowledge in Figure 2.

(1) Although the two synthetic image scenarios seem A good example is illustrated by Figure 14(c) where
simple, the value of the experiments should be the FCLS completely failed in the scenario TE because
appreciated. These scenarios provide an objective val- the sum-to-one abundance constraint is violated. If
idation of any designed algorithm under a fully con- it had been applied to real data we would not have
trollable environment with complete ground truth. known that a fully abundance constrained LSU could
14 EURASIP Journal on Advances in Signal Processing

p11 , p12 , p13

p211 , p22 , p23 , p221

p311 , p312 , p32 , p33

p411 , p412 , p42 , p43

p511 , p52 , p53 , p521

(a) (b)

800

700

600

500

400

300

200

100

0
0 2 4 6 8 10 12 14 16 18

p1 p4
p2 p5
p3
(c)

Figure 15: (a) A HYDICE panel scene which contains 15 panels; (b) Ground truth map of spatial locations of the 15 panels; (c) five panel
signatures p1 , p2 , p3 , p4 , p5 .

not be used as a signal detection technique when 6. Real Image Experiments


the linear mixing model was used as a signal/noise
detection model in which case a signal is embedded In the previous section the synthetic image experiments
in a pixel corrupted by an additive noise like the were used to show the unsupervised target analysis in three
scenario TE so that the abundance fractions of the applications where the ground truth was used to substantiate
signal and noise were not summed up to one. If the results. In this section, we further conduct real image
an algorithm does not pass the synthetic image experiments to demonstrate that unsupervised target anal-
experiments, it will be very likely that it may not work ysis is indeed superior to supervised target analysis. The
in real data. reason for this is that the prior knowledge used by the
supervised target analysis generally does not represent true
(2) Due to significantly improved spectral resolution
knowledge about the real data because of many unknown
provided by hyperspectral imaging sensors hyper-
factors such as interference, noise, and so forth present in
spectral imaging generally performs “target”-pixel-
the data that may contaminate the prior knowledge. The
based spectral analysis rather than “class-map/pat-
required true knowledge must be acquired and obtained
tern”-based spatial analysis as conduced in traditional
directly from the real data itself. Two sets of real data are used
image processing. Therefore, BKG pixels are usually
for experiments.
not of major interest and no BKG analysis is necessary
for hyperspectral imaging. Instead, they are only used
for BKG suppression to improve target detection and 6.1. HYDICE Data. The first data is a real HYperspec-
classification. Because of that the scenarios TI and tral Digital Image Collection (HYDICE) scene shown in
TE suffice to serve the purpose where only complete Figure 15(a) and has a size of 64 × 64 pixel vectors with
knowledge of target panel pixels is required for target 15 panels in the scene and the ground truth map in
analysis and BKG pixels can be made as simple as Figure 15(b). It was acquired by 210 spectral bands with a
possible by adding Gaussian noise for suppression. spectral coverage from 0.4 μm to 2.5 μm. Low signal/high
EURASIP Journal on Advances in Signal Processing 15

6 2 2
7 1 1 78 1 10 179
8
4 9 2 11 6
9 6 3 12

5 2 5 4 5 13

4 4
8 5 14
3 3 3

(a) (b) (c) (d)

Figure 16: ATGP-generated BKG and target pixels (a) 9 BKG pixels in original data; (b) 9 target pixels in sphered data; (c) 5 BKG pixels not
identified as target pixels; (d) 14 pixels obtained by combining the pixels in (b)–(c).

noise bands: bands 1–3 and bands 202–210; and water each of five rows and can be considered as a set of target
9
vapor absorption bands: bands 101–112 and bands 137–153 pixels Starget = {tATGP
j } j =1 . Figure 16(c) singled out the 5
were removed. So, a total of 169 bands were used in the BKG
experiments. The spatial resolution and spectral resolution of pixels which were identified as BKG pixels S4 = {b4 ATGP }
i
this image scene are 1.56 m and 10 nm, respectively. Within by removing the 4 target pixels with a similarity measure such
the scene in Figure 15(a) there is a large grass field BKG, as SAM and Figure 16(d) shows a total number of 14 pixels
BKG
and a forest on the left edge. Each element in this matrix obtained by combining the BKG set S4 in Figure 16(c) with
is a square panel and denoted by pi j with rows indexed by the target set S target
in Figure 16(b) into a BKG-target merged
i and columns indexed by j = 1, 2, 3. For each row i = BKG
set S4 ∪S target
to be used for spectral unmixing where
1, 2, . . . , 5, there are three panels painted by the same paint
the numbers in the figures indicated the orders of pixels
but with three different sizes. The sizes of the panels in the
extracted by the ATGP.
first, second, and third columns are 3 m × 3 m, 2 m × 2 m,
Similarly, Figures 17(d)–18(d) also show the results
and 1 m × 1 m, respectively. Since the size of the panels in
produced by the UNCLS with 9 target pixels and 5 BKG
the third column is 1 m × 1 m, they cannot be seen visually
pixels extracted and by the UFCLS with 9 target pixels and
from Figure 15(a) due to the fact that its size is less than
6 BKG pixels extracted. Since the target pixels of interest are
the 1.56 m pixel resolution. For each column j = 1, 2, 3,
those extracted by the three LS-based algorithms in Figures
the 5 panels have same sizes but in five different paints.
16(b), 17(b), and 18(b) from the sphered data they should
However, it should be noted that the panels in rows 2 and
have included five pure targets pixels that corresponded to
3 were made by the same material with two different paints.
all the five pure panel signatures which was exactly the
Similarly, it is also the case for panels in rows 4 and 5.
case where these five pure target pixels, p11 , p221 , p312 , p411 ,
Nevertheless, they were still considered as different panels
and p521 were found and identical in Figures 16(b), 17(b),
but our experiments will demonstrate that detecting panels
and 18(b). Additionally, among these five pure target pixels,
in row 5 (row 3) may also have effect on detection of panels
p11 , p312 , and p521 were the only three target pixels extracted
in row 4 (row 2). The 1.56 m-spatial resolution of the image
as BKG pixels in the original data in Figures 16(a), 17(a),
scene suggests that most of the 15 panels are one pixel in
and 18(a). This was due to the fact that the panel pixels
size except that the panels in the 1st column with the 2nd,
in rows 2 and 4 have very similar signatures to those in
3rd, 4th, 5th rows which are two-pixel panels, denoted by
rows 3 and 5, respectively, according to the ground truth
p211 , p221 , p311 , p312 , p411 , p412 , p511 , p521 . As a result, there are
in which case they were not extracted as endmembers.
a total 19 panel pixels. Figure 15(b) shows the precise spatial
Interestingly, these three pure target pixels will be the
locations of these 19 panel pixels where red pixels (R pixels)
only endmembers extracted by any endmember extraction
are the panel center pixels and the pixels in yellow (Y pixels)
algorithms except the dimensionality reduction is performed
are panel pixels mixed with the BKG. Figure 15(c) shows the
by the independent component analysis (ICA) as shown in
spectra of five panel signatures p1 , p2 , p3 , p4 , p5 obtained
the following section.
by averaging the center R panel pixels for each of five
rows.
6.1.1. Endmember Extraction. Once again experiments were
First of all, the VD estimated for this scene, nVD was
also conducted for endmember extraction using the PPI and
9 with the false alarm probability PF ≤ 10−3 . Figure 16(a)
N-FINDR to extract endmembers from the HYDICE scene
shows the 9 target pixels which were extracted directly from
in Figure 15(a). Figures 19(a) and 20(a) show the pixels with
the original data by the ATGP and can be considered as a set
9 their PPI counts greater than 0 extracted by PPI with using
of BKG (BKG) pixels SBKG = {bATGP j } j =1 and in which three 500 skewers and 6 endmembers extracted by the N-FINDR
panel pixels from rows 1, 3, and 5 were included. Figure 16(b) where both used MNF to reduce data dimensionality to 9.
shows the 9 target pixels extracted from the sphered data by As we can see from Figure 20(a) the N-FINDR could only
the ATGP which included five panel pixels extracted from extract two pure panel pixels, p312 , and p521 corresponding
16 EURASIP Journal on Advances in Signal Processing

8 2 2
8 1 8
91 6 197 1 2 10 9 711
4 3 12
6 6

5 2 5 4 5 13

4 4

3 7 3 5 3 14

(a) (b) (c) (d)

Figure 17: UNCLS-generated BKG and target pixels (a) 9 BKG pixels in original data; (b) 9 target pixels in sphered data; (c) 5 BKG pixels
not identified as target pixels; (d) 14 pixels obtained by combining the pixels in (b)–(c).

2 10 2
6 1 7
1 1 7 1
9 9
2 2 11
6
6
9 3 12
5 3 4 4 13
4

5 5
8 5 8
8
4 3 6 3 1
7 5
(a) (b) (c) (d)

Figure 18: UFCLS-generated BKG and target pixels, (a) 9 BKG pixels in original data; (b) 9 target pixels in sphered data; (c) 6 BKG pixels
not identified as target pixels; (d) 15 pixels obtained by combining the pixels in (b)–(c).

(a) MNF on original data (b) ICA on original data (c) Sphered data

Figure 19: Endmember extraction results by PPI using 500 skewers.

to endmembers which are among the 9 BKG pixels extracted to the sphered data without dimensionality reduction as
in Figures 16(a), 17(a), and 18(a). the USTFAs did for Figures 16–18, both were also able
However, if we used the ICA instead of MNF to perform to extract all the five endmembers as shown in Figures
dimensionality reduction prior to endmember extraction, 19(c) and 20(c). In other words, the only way for an
the PPI and N-FINDR were able to extract four pure target endmember extraction algorithm to succeed in extracting all
pixels, p11 , p312 , p411 , and p521 in Figures 19(b) and 20(b). the five endmembers is either to use ICA for dimensionality
As a matter of fact, as shown in [20], the PPI and N- reduction or to use the sphered data instead of the original
FINDR would fail to extract all the five endmembers if data where in both cases the data characterized by 1st and
the dimensionality reduction was performed by 2nd order 2nd order statistics have been removed prior to endmember
statistics transforms such as principal components analysis extraction. This intriguing finding showed that using the ICA
(PCA), MNF, or singular value decomposition (SVD). With as dimensionality reduction has the same effect as applying
an interesting twist by applying the PPI and N-FINDR endmember extraction algorithms such as PPI, N-FINDR, or
EURASIP Journal on Advances in Signal Processing 17

(a) MNF on original data (b) ICA on original data (c) Sphered data

Figure 20: 9 endmembers extracted by N-FINDR.

(a) CEM in conjunction with ATGP-USTFA or with UNCLS-USTFA

(b) CEM in conjunction with UFCLS-USTFA

Figure 21: CEM detection results using the 9 targets of interest generated by the ATGP, UNCLS, and UFCLS in Figures 16(b), 17(b), and
18(b).

the three USTFAs to the sphered data where both ICA and
sphered data retain targets of interest characterized by high-
order statistics. As a result, the proposed three USTFAs can
be also used for endmember extraction.

6.1.2. Unsupervised Target Detection. This section conducts


experiments for unsupervised target detection performed by
the CEM using a posteriori target knowledge produced by the
three USTFAs from the sphered data. The results are shown
in Figures 21(a) and 21(b) where the desired target signatures
used by the CEM were the targets of interest extracted in
Figures 16(b), 17(b), and 18(b).
In order to further make comparison with anomaly
detection, the RXD was implemented and its result is shown Figure 22: RXD result.
in Figure 22.
By visually comparing Figure 22 to Figure 21, the RXD
missed all the five subpxiel panels in the 3rd column while the unsupervised target detection rather than target classifica-
CEM was able to extract all these five subpixel panels using tion. In this section, we further perform target classification
the a posteriori target information provided by the three by LSU. In order for LSU to perform effectively, the targets
USTFAs. Most importantly, the RXD could not discriminate signatures used to form the signature matrix M must include
nVD n
all the targets pixels {tLS 4 LS BKG
the targets it detected. This is not the case for the CEM which j } j =1 and BKG pixels {bi }i=1 to
used each of a posteriori target knowledge to discriminate nVD
represent the entire data where the target pixels {tLS
j } j =1 are
among targets found by a USTFA.
the target signatures we would like to classify and the BKG
nBKG
4 LS }
pixels {b
6.1.3. Linear Spectral Unmixing. In the unsupervised target i i=1 are considered as undesired signatures which
detection discussed above the targets of interest were those can be removed to enhance target classification performance.
found by an USTFA from the sphered data so as to achieve The three LSU methods, LSOSP, NCLS, and FCLS were used
18 EURASIP Journal on Advances in Signal Processing

(a) LSOSP

(b) NCLS

(c) FCLS

Figure 23: 9 target classes obtained by LSOSP, NCLS, and FCLS using the target pixels generated by ATGP-USTFA.

(a) LSOSP

(b) NCLS

(c) FCLS

Figure 24: 9 target classes obtained by LSOSP, NCLS, and FCLS using the target pixels generated by UNCLS-USTFA.

nVD
to unmix high-order target pixels {tLS j } j =1 extracted from rows. This improvement was mainly due to the fact that
the sphered data, each of which is considered to represent the UFCLS produced 6 BKG pixels rather than 5 BKG pixels
one specific target class. Figures 23(a)–23(c) and 25(a)–25(c) produced by the ATGP and UNCLS to perform better BKG
show their corresponding results where figures labeled by suppression.
(a), (b), and (c) were unmixed results by LSOSP, NCLS, and Since the evaluation of the above unmixed results was
FCLS, respectively. performed qualitatively by visual assessment, the conclu-
Obviously, the results obtained by the NCLS and sions may not be objective. Table 3 further tabulates their
the FCLS performed better than that by the LSOSP in quantitative results where the abundance fractions of the
Figures 23–25 due to the imposed constraints on abundance 19 panel pixels estimated by the NCLS and FCLS were
fractions. very close and both the NCLS and the FCLS outperformed
Interestingly, while the results in Figures 23(b) and 23(c) the unconstrained LSOSP, even though the NCLS is only a
were similar to the results obtained in Figures 24(b)-24(c) partially abundance constrained method.
and 25(b)-25(c), the umixed results obtained by the LSOSP Finally, in order to demonstrate that the performance
in Figure 25(a) were slightly better than those in Figures of unsupervised linear hyperspectral unmixing using the
23(a) and 24(a) in terms of detection of 15 panels in five designed USTFA is superior to that of supervised linear
EURASIP Journal on Advances in Signal Processing 19

(a) LSOSP

(b) NCLS

(c) FCLS

Figure 25: 9 target classes obtained by LSOSP, NCLS, and FCLS using the target pixels generated by UFCLS-USTFA.

“9” is probably sufficiently enough for spectral unmixing to


perform well.
Interferer If we compare the results in Figure 27 to Figures 23–25, it
is obvious that the ULSU performed significantly better than
its supervised counterparts. Table 4 also tabulates the quanti-
tative results obtained in Figures 27(a)–27(c) in comparison
Grass with the results in Table 3 where it clearly showed that
the abundance fractions estimated by unsupervised linear
unmixing in Table 3 were much more accurate than those
Tree obtained by its supervised counterpart in Table 4.
An interesting finding from Table 4 is that the abundance
fractions of panel pixels p412 estimated by the NCLS and
Road p411 , p412 estimated by the FCLS were zero. This was caused
by the fact that the five panel signatures p1 , p2 , p3 , p4 , p5
Figure 26: Four BKG classes obtained by marked areas. in Figure 15(c) used for spectral unmixing were not really
pure signatures because the panel pixels in the 3rd column
that were included for averaging were actually subpixels and
spectral unmixing, we further conducted experiments for not pure signatures. If we conducted the same experiments
supervised linear spectral unmxing where the five panel by using the panel signatures p1 , p2 , p3 , p4 , p5 that were
signatures were obtained from Figures 15(b) and 15(c) obtained by averaging only R panel center pixels in the 1st
and the other four BKG signatures were obtained by prior and 2nd columns for spectral unmixing, Table 5 tabulates the
knowledge as the areas marked in Figure 26 as interferer, tree, abundance fractions of 19 panel pixels by the LSOSP, NCLS,
grass, road to make up 5 target classes representing five panel and FCLS where the abundance fractions of panel pixels p411
signatures and 4 BKG signatures. These signatures represent and p412 were corrected and no longer zero.
exactly 9 signatures estimated by the VD. This fact provides A comment on Table 5 is noteworthy. According to the
further evidence that VD is an effective estimation method ground truth p411 and p412 are the panel center pixels.
to estimate the number of image endmembers for spectral However, from our extensive experience with the HYDICE
unmixing. scene, they are in fact not as pure pixels of 100% abundance
Figures 27(a)–27(c) show the unmixed results of 19 panel purity as we expect. As a result, even though the NCLS is par-
pixels by the three unmixing methods, LSOSP, NCLS, and tially abundance-constrained, it was very comparable to the
FCLS where the unmixed results for 4 BKG classes were not fully abundance-constrained FCLS in terms of abundance
displayed because the BKG classes were not of major interest. estimation. Nevertheless, both performed significantly better
Also, it should be noted that there are more BKG pixels that than the abundance-unconstrained LSOSP.
can be used for this purpose. As a matter of fact, in [7] there Comparing the results in Table 5 against those in Table 4,
were 34 pixels found by the unsupervised FCLS for spectral it apparently shows that using contaminated or inaccurate
unmixing. The results were similar to those presented in our prior knowledge may result in significant distortion in
experiments using only 9 image endmembers. In this case, quantification of abundance fractions. If we further compare
20 EURASIP Journal on Advances in Signal Processing

Table 3: Abundance fractions of 19 panel pixels estimated by LSOSP, NCLS, and FCLS using BKG and target pixels found in Figures 23–25
by ATGP-USTFA, UNCLS-USTFA, and UFCLS-USTFA.
4 ATGP
{b i } ∪ {tATGP
j
9
} j =1 4 UNCLS
{b i } ∪ {tUNCLS
j
9
} j =1 4 UFCLS
{b i } ∪ {tUFCLS
j
9
} j =1
LSOSP NCLS FCLS LSOSP NCLS FCLS LSOSP NCLS FCLS
p11 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
p12 0.4096 0.4332 0.4120 0.3562 0.4165 0.4001 0.4085 0.4148 0.3850
p13 0.0002 0.0887 0.0841 −0.1073 0.0308 0.0465 0.0142 0.0307 0.0250
p211 0.8421 0.8403 0.8404 0.9180 0.8413 0.8209 0.8648 0.8384 0.8453
p221 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
p22 0.6164 0.6257 0.7308 0.6351 0.6607 0.7127 0.6990 0.6126 0.7405
p23 0.5525 0.4774 0.4724 0.3478 0.4168 0.4153 0.3798 0.4471 0.4498
p311 0.8741 0.8674 0.8627 0.9094 0.8674 0.8628 0.8969 0.8671 0.8634
p312 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
p32 0.5027 0.4249 0.4192 0.5906 0.4713 0.4727 0.5925 0.5149 0.4922
p33 0.2516 0.2614 0.2655 0.3541 0.2959 0.2929 0.3388 0.2880 0.2886
p411 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
p412 0.7685 0.3137 0.3876 0.5827 0.3222 0.3605 0.7976 0.3407 0.3923
p42 0.8085 0.6761 0.6657 0.7965 0.7495 0.7485 0.8414 0.7480 0.7477
p43 0.2363 0.1789 0.1473 0.5047 0.2851 0.2633 0.2790 0.1227 0.1542
p511 0.7204 0.7224 0.7215 0.6954 0.7245 0.7198 0.6973 0.7213 0.7235
p521 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
p52 0.7645 0.7770 0.7689 0.7027 0.7460 0.7244 0.7228 0.7753 0.7740
p53 0.1452 0.1545 0.1537 −0.0144 0.0000 0.0017 0.1215 0.1471 0.1554

(a) LSOSP

(b) NCLS

Panels in row 1 Panels in row 2 Panels in row 3 Panels in row 4 Panels in row 5
(c) FCLS
Figure 27: Unmixed results of 15 panels with 19 panel pixels by supervised LSOSP, NCLS, and FCLS using five panel signatures in
Figure 17(c) and 4 BKG signatures obtained by marked areas in Figure 24.

the results in Table 3 against those in Tables 4 and 5, the Section 5. Unfortunately, this may not be true when it comes
USLU outperformed SLSU significantly. These experiments to real world applications where true target knowledge is
further demonstrate two facts. One is that SLSU is effective generally difficult to obtain, if not impossible. Even in the
only if the prior knowledge is accurate such as the synthetic case that prior target knowledge is available, it may not
image experiments conducted for scenarios TI and TE in be reliable due to many unknown signal sources that may
EURASIP Journal on Advances in Signal Processing 21

Table 4: Abundance fractions of 19 panel pixels in Figures 27(a)– Table 6: SAM values of six signatures in Figure 28(b).
27(c) estimated by LSOSP, NCLS, and FCLS.
Cinder Playa Rhyolite Shade Vegetation Anomaly
LSOSP NCLS FCLS Cinder 0 0.292 0.207 0.273 0.221 0.397
p11 1.4475 0.8309 0.0146 Playa 0.292 0 0.111 0.141 0.114 0.119
p12 0.9155 0.8510 0.8199 Rhyolite 0.207 0.111 0 0.166 0.041 0.227
p13 0.6370 0.2735 0.1946 Shade 0.273 0.141 0.166 0 0.148 0.189
p211 1.2384 0.8115 0.5457 Vegetation 0.221 0.114 0.041 0.148 0 0.224
p221 1.3146 0.7945 0.3429 Anomaly 0.397 0.119 0.227 0.189 0.224 0
p22 0.8558 0.8558 0.8524
p23 0.5912 0.4843 0.4985
p311 1.2482 0.8809 0.8298
p312 1.4713 0.8953 0.7912
Field (LCVF) located in Northern Nye County, NV. Atmo-
spheric water bands and low SNR bands have been removed
p32 0.8240 0.5935 0.7389
from the data, reducing the image cube from 224 to 158
p33 0.4565 0.2761 0.2705
bands. The image in Figure 28 has 10 nm spectral resolution
p411 1.2356 0.1617 0.0000 and 20 m spatial resolution. The ground truth of this image
p412 1.1672 0.0000 0.0000 scene identifies five areas of interest red oxidized basaltic
p42 1.2331 0.9555 0.4772 cinders, rhyolite, playa (dry lake), vegetation, and shade and
p43 0.3641 0.2393 0.2003 their marked spectral signature are shown in Figure 28(b) for
p511 1.1770 0.9599 0.9759 data analysis. It should be noted that there is a two-pixel size
p521 1.4698 0.9551 1.0000 anomaly which cannot be identified by visual inspection.
p52 1.0760 0.9925 1.0000 The VD estimated for this scene using the HFC method
p53 0.2772 0.2029 0.1763 is 4 for PF ≤ 10−3 . Since the ground truth of this image scene
is limited, only the application of LSU was conducted for
experiments. Figures 29(a) and 29(b) show 4 pixels extracted
Table 5: Abundance fractions of 19 panel pixels estimated by
by ATGP each from the original and sphered image data
LSOSP, NCLS, and FCLS using panel signatures in Figure 15(c).
where the sample labeled by 1 extracted in Figure 29(a) was
LSOSP NCLS FCLS also extracted as the sample labeled by 1 in Figure 29(b).
p11 1.2635 0.9551 0.7606 Interestingly, among the 4 extracted samples in Figure 29(b),
p12 0.7366 0.6226 0.5376 the samples labeled by 1 and 2 were identified by the SAM
to belong to the same target class which actually comprises
p13 0.4053 0.1302 0.0079
of one single target, a two-pixel anomaly. These two pixels
p211 1.0853 0.9929 0.9085
are the only samples that were identified as target pixels.
p221 1.1652 0.9530 0.7869 Therefore, a total of 7 samples (3 for BKG pixels and 4
p22 0.7495 0.8130 0.8240 for target pixels), each of which represents a spectral class,
p23 0.4578 0.4018 0.4231 were used for SLSU. According to the ground truth, samples
p311 1.0584 0.9139 0.9136 1-2, 3, 6, 7 represent anomalies, vegetation, cinder, playa,
p312 1.2402 0.9292 0.9024 and shade, respectively. Figure 30 shows the results unmixed
p32 0.7014 0.4699 0.4482 by the LSOSP, NCLS, and FCLS where FCLS seemed to
p33 0.3756 0.2072 0.2100 perform the best where the two anomalous target pixels
p411 1.0532 0.9104 0.5087 which are supposed to belong to the same class were also
p412 0.9507 0.4329 0.4384
unmixed in two different classes. As expected, two unmixed
anomaly pixels in Figure 30 were very similar. Interestingly,
p42 0.9962 0.7863 0.7575
according to the results in Figure 30, the vegetation and
p43 0.2684 0.1896 0.1571
rhyolite were unmixed into the same class due to the fact
p511 0.9531 0.8304 0.8304 that these two spectral signature shapes are very similar and
p521 1.1738 1.0295 1.0000 close according to their normalized spectra in Figure 28(c).
p52 0.8732 0.9354 0.9353 Table 6 further tabulates the SAM values of six signatures
p53 0.2303 0.1596 0.1372 in Figure 28(b) where the vegetation and rhyolite spectral
signatures are indeed very close. Such subtle difference
can be only seen from the prior knowledge provided in
contaminate the knowledge. This leads to the second fact that Figure 28(c).
to avoid using unreliable prior knowledge ULSU certainly Similarly, UNCLS-USTFA generated 4 BKG pixels in
provides a better alternative to SLSU. Figure 31(a) and 4 target pixels in Figure 31(b) where 2
pixels were in Figure 31(c) were identified as BKG pixels. By
6.2. AVIRIS Data. The second data is an Airborne Visi- combining pixels in Figures 31(b) and 31(c) a total of 6 pixels
ble InfraRed Imaging Spectrometer (AVIRIS) image scene were used by LSU as signatures for unmixing where pixels
shown in Figure 28(a) which is the Lunar Crater Volcanic 1 and 2 are anomalies, pixels 3, 5, 6 represent signatures of
22 EURASIP Journal on Advances in Signal Processing

1800
Vegetation
1600
Cinders
1400

1200
Anomaly
Shade 1000

800
Rhyolite
600

400

Playa 200

0
0 20 40 60 80 100 120 140 160

Cinder Shade
Playa Vegetation
Rhyolite Anomaly
(a) (b) Original spectra
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02

0
0 20 40 60 80 100 120 140 160

Cinder Shade
Playa Vegetation
Rhyolite Anomaly
(c) Normalized spectra

Figure 28: AVARIS LCVF scene.

4 4
2 3 3 2 6 3

4 2 1 3 7 2 1

1 1 5

(a) 4 BKG pixels (b) 4 target pixels (c) 3 BKG pixels (d) 7 pixels used for
unmixing

Figure 29: ATGP-USTFA generated BKG and target pixels (a) 4 BKG pixels in original data; (b) 4 target pixels in sphered data; (c) 1 BKG
pixel not identified as target pixels; (d) 7 pixels obtained by combining the pixels in (b)–(c).
EURASIP Journal on Advances in Signal Processing 23

(a) LSOSP

(b) NCLS

Anomaly Anomaly Vegetation/rhyolite Playa Cinder Shade


(c) FCLS

Figure 30: LSOSP, NCLS, and FCLS unmixed results using the samples extracted in Figure 29(d).

2 4 3 2 6 3

4
3 4 1
21 2

1 1 5

(a) 4 BKG pixels (b) 4 target pixels (c) 2 BKG pixels (d) 6 pixels used for
unmixing

Figure 31: UNCLS-generated BKG and target pixels (a) 4 BKG pixels in original data; (b) 4 target pixels in sphered data; (c) 2 pixels identified
as BKG pixels; (d) a total of 6 pixels obtained by combining the pixels in (b)–(c).

vegetation, playa and cinders, respectively. Figure 32 shows using ATGP, UNCLS, and UFCLS missed the signature of
the unmixed results of each of the 6 signatures by LSOSP, rhyolite in Figures 29, 31, and 33, but it did pick up
NCLS, and FCLS where once again FCLS was the best and two anomalous pixels which cannot be identified by visual
the vegetation and rhyolite were also unmixed into the same inspection. The reason of missing rhyolite is because its
class. spectral signature shape is very similar to that of vegetation
Following the treatment of ATGP-USTFA and UNCLS- according to the six spectral signatures in Figure 28(c)
USTFA Figure 33 shows UFCLS-USTFA-generated 4 BKG and their SAM values in Table 6. In this case, extracting
pixels in Figure 33(a) and 4 target pixels in Figure 33(b) vegetation also extracts the rhyolite, a fact demonstrated
where 3 pixels in Figure 33(c) were identified as BKG pixels. in Figures 30, 32, and 34. This experiment demonstrates
By combining pixels in Figures 33(b) and 33(c) a total of 7 that spectral signature shapes are more crucial than spectral
pixels were used by LSU as signatures for unmixing where signature amplitudes.
pixels 1(4), 3, 5, 7 represent anomalies and signatures of In order to see effectiveness of using USTFA-generated
vegetation, playa, and cinders, respectively. Figure 34 shows pixels as target pixels to perform ULSU, supervised LSOSP,
the unmixed results of each of the 6 signatures by LSOSP, NCLS, and FCLS using the five signatures, cinders, playa,
NCLS, and FCLS where once again FCLS was the best and rhyolite, shade, and vegetation in Figure 28(b) were per-
the vegetation and rhyolite were also unmixed into the same formed for SLSU for comparison and Figure 35 shows their
class. Interestingly, without prior knowledge the USTFA unmixed results.
24 EURASIP Journal on Advances in Signal Processing

(a) LSOSP

(b) NCLS

Anomaly Anomaly Vegetation/rhyolite Playa Cinder


(c) FCLS

Figure 32: LSOSP, NCLS, and FCLS unmixed results using signatures of samples in Figure 31(d).

2 2 6
3 3 3 7 3

4 4 1 4 1

2 2
1 1 5

(a) 4 BKG pixels (b) 4 target pixels (c) 3 BKG pixels (d) 7 pixels used for unmixing

Figure 33: UFCLS-generated BKG and target pixels (a) 4 BKG pixels in original data; (b) 4 target pixels in sphered data; (c) 3 BKG pixels not
identified as target pixels; (d) 7 pixels obtained by combining the pixels in (b)–(c).

By comparing the unmixed results in Figures 30, 32, and Figures 30, 32, and 34. However, with using the prior
34 to that in Figure 35 three interesting findings are worth knowledge provided in Figure 28(b), Figure 35 was
being mentioned. able to distinguish the rhyolite from the vegetation.
(1) The unmixed results of cinders, playa, shade pro- (3) On the other hand, since the two-pixel size anomaly
duced by USTFA-based ULSU were much better is not visible by inspection or prior knowledge, this
than that by SLSU using the prior knowledge in target was not shown in Figure 35. But its significance
Figure 28(b). was extracted by all the three unsupervised algo-
(2) Due to unavailability of prior knowledge, there is no rithms, ATGP, UNCLS, and UFCLS. As a matter of
way for an unsupervised algorithm to differentiate fact, this target was picked up as the first target pixels
rhyolite from vegetation because their spectral sig- from sphered data. This indicated that this unknown
natures are so similar according to Table 6 that the target was crucial and critical for unsupervised
USTFA considered both signatures belonging to the hyperspectral target analysis. Unfortunately, it was
same spectral class. As a result, these two signatures missed in supervised target analysis in Figure 35 since
were unmixed in the same class as demonstrated in it cannot be known by visual assessment.
EURASIP Journal on Advances in Signal Processing 25

(a) LSOSP

(b) NCLS

Anomaly Vegetation/rhyolite Anomaly Playa Shade Cinders


(c) FCLS

Figure 34: LSOSP, NCLS, and FCLS unmixed results using signatures of samples in Figure 33(d).

(a) LSOSP

(b) NCLS

Cinders Playa Rhyolite Shade Vegetation


(c) FCLS

Figure 35: Unmixed results by unsupervised LSU, supervised LSOSP, NCLS, and FCLS using the five signatures in Figure 28(b).

7. Conclusion imaging sensors. This paper presents unsupervised spectral


target analysis from a statistical signal processing view
Unsupervised spectral target analysis for hyperspectral data point in the sense of intrapixel spectral information across
exploitation is very challenging since many unknown signal the acquired wavelength range. The knowledge used to
sources which cannot be visually inspected or obtained perform unsupervised spectral target analysis is obtained
by prior knowledge can now be uncovered hyperspectral directly from the data a posteriori without pre-assumed prior
26 EURASIP Journal on Advances in Signal Processing

knowledge. In particular, the spectral targets of interest in [6] C.-I Chang and D. Heinz, “Constrained subpixel target
this paper are specified by sample intrapixel spectral infor- detection for remotely sensed imagery,” IEEE Transactions on
mation statistics (SIS) which characterizes spectral targets Geoscience and Remote Sensing, vol. 38, no. 3, pp. 1144–1159,
into two categories, 2nd order spectral targets, referred to as 2000.
background pixels and high-order spectral targets, referred [7] D. C. Heinz and C.-I Chang, “Fully constrained least squares
linear spectral mixture analysis method for material quan-
to as target pixels. Additionally, in order to generate spectral
tification in hyperspectral imagery,” IEEE Transactions on
targets in these two categories, an unsupervised spectral
Geoscience and Remote Sensing, vol. 39, no. 3, pp. 529–545,
target finding algorithm is developed for this purpose 2001.
where three least squares-based unsupervised linear spectral [8] J. C. Harsanyi and C.-I Chang, “Hyperspectral image classifi-
unmixing techniques, ATGP, UNCLS, and UFCLS are used cation and dimensionality reduction: an orthogonal subspace
for finding spectral targets of interest. Despite the fact projection approach,” IEEE Transactions on Geoscience and
that many algorithms have been designed and developed Remote Sensing, vol. 32, no. 4, pp. 779–785, 1994.
for supervised target analysis with the target knowledge [9] J. J. Settle, “On the relationship between spectral unmixing
assumed to be known or provided a priori, very little has and subspace projection,” IEEE Transactions on Geoscience and
been done in unsupervised target analysis. Unfortunately, in Remote Sensing, vol. 34, no. 4, pp. 1045–1046, 1996.
real applications supervised target analysis is generally not [10] C.-I Chang, “Further results on relationship between spectral
practical because the supervised target knowledge is either unmixing and subspace projection,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 36, no. 3, pp. 1030–1032,
difficult to obtain or may not be reliable by prior knowledge.
1998.
Under such circumstance unsupervised target analysis is [11] http://aviris.jpl.nasa.gov/.
more realistic and applicable to real world problems. In [12] B. Thai and G. Healey, “Invariant subpixel material detection
order to validate and substantiate the unsupervised spectral in hyperspectral imagery,” IEEE Transactions on Geoscience and
target analysis three applications are considered where Remote Sensing, vol. 40, no. 3, pp. 599–608, 2002.
two sets of experiments using custom-designed simulated [13] J. W. Boardman, “Geometric mixture analysis of imaging spec-
synthetic images as well as real images are conducted for trometery data,” Proceedings of the International Geoscience
performance evaluation. The experimental results clearly and Remote Sensing Symposium, vol. 4, pp. 2369–2371, 1994.
show that unsupervised spectral target analysis generally [14] M. E. Winter, “N-finder: an algorithm for fast autonomous
performs significantly better than supervised target analysis spectral endmember determination in hyperspectral data,” in
in real applications. As a concluding remark, it should be Image Spectrometry V, vol. 3753 of Proceedings of the SPIE, pp.
noted that the proposed USTFA-based unsupervised target 266–277, 1999.
[15] A. A. Green, M. Berman, P. Switzer, and M. D. Craig, “A
analysis is suitable for spectral targets that are characterized
transformation for ordering multispectral data in terms of
by sample intrapixel spectral information statistics. It is image quality with implications for noise removal,” IEEE
particularly useful and effective for very high spatial and Transactions on Geoscience and Remote Sensing, vol. 26, no. 1,
spectral resolution hyperspectral images. Nevertheless, it is pp. 65–74, 1988.
not a one-size-fit-all technique for hyperspectral imagery. [16] C.-C. Wu, C. S. Lo, and C.-I Chang, “Improved process for use
For example, it may not be appropriate to use our proposed of a simplex growing algorithm for endmember extraction,”
technique to analyze urban scenes which are complex and are IEEE Geoscience and Remote Sensing Letters, vol. 6, no. 3, pp.
heavily mixed by unknown clutters and interferers. 523–527, 2009.
[17] C.-I Chang, “Target signature-constrained mixed pixel clas-
sification for hyperspectral imagery,” IEEE Transactions on
Acknowledgment Geoscience and Remote Sensing, vol. 40, no. 5, pp. 1065–1081,
2002.
The work of C.-I Chang was supported by the National
[18] I. S. Reed and X. Yu, “Adaptive multiple-band CFAR detection
Science Council in Taiwan under NSC 98-2811-E-005-024 of an optical pattern with unknown spectral distribution,”
and NSC 98-2221-E-005-096. IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. 38, no. 10, pp. 1760–1770, 1990.
References [19] C.-I Chang, X.-L. Zhao, M. L. G. Althouse, and J. J. Pan,
“Least squares subspace projection approach to mixed pixel
[1] R. A. Schewengerdt, Remote Sensing: Models and Methods for classification for hyperspectral images,” IEEE Transactions on
Image Processing, Academic Press, Boston, Mass, USA, 2nd Geoscience and Remote Sensing, vol. 36, no. 3, pp. 898–912,
edition, 1997. 1998.
[2] C.-I Chang, Hyperspectral Imaging: Techniques for Spectral [20] C.-I Chang, C. Wu, W. Liu, and Y. C. Ouyang, “A grow-
Detection and Classification, Kluwer Academic/Plenum Pub- ing method for simplex-based endmember extraction algo-
lishers, New York, NY, USA, 2003. rithms,” IEEE Transactions on Geoscience and Remote Sensing,
[3] C.-I Chang and M. Hsueh, “Characterization of anomaly vol. 44, no. 10, pp. 2804–2819, 2006.
detection in hyperspectral imagery,” Sensor Review, vol. 26, no.
2, pp. 137–146, 2006.
[4] H. V. Poor, An Introduction to Signal Detection and Estimation,
Springer, New York, NY, USA, 2nd edition, 1994.
[5] H. Ren and C.-I Chang, “Automatic spectral target recognition
in hyperspectral imagery,” IEEE Transactions on Aerospace and
Electronic Systems, vol. 39, no. 4, pp. 1232–1249, 2003.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 438615, 16 pages
doi:10.1155/2010/438615

Research Article
Subinteger Range-Bin Alignment Method for ISAR Imaging of
Noncooperative Targets

J. M. Muñoz-Ferreras1 and F. Pérez-Martı́nez2


1 Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, Campus Universitario,
Ctra. Madrid-Barcelona, Km. 33, 600, Alcalá de Henares, 28805 Madrid, Spain
2 Department of Signals, Systems and Radiocommunications, Technical University of Madrid, E.T.S.I. Telecomunicación,

Avenida Complutense s/n, 28040 Madrid, Spain

Correspondence should be addressed to J. M. Muñoz-Ferreras, [email protected]

Received 17 November 2009; Accepted 25 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 J. M. Muñoz-Ferreras and F. Pérez-Martı́nez. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

Inverse Synthetic Aperture Radar (ISAR) is a coherent radar technique capable of generating images of noncooperative targets.
ISAR may have better performance in adverse meteorological conditions than traditional imaging sensors. Unfortunately, ISAR
images are usually blurred because of the relative motion between radar and target. To improve the quality of ISAR products,
motion compensation is necessary. In this context, range-bin alignment is the first step for translational motion compensation.
In this paper, we propose a subinteger range-bin alignment method based on envelope correlation and reference profiles. The
technique, which makes use of a carefully designed optimization stage, is robust against noise, clutter, target scintillation, and
error accumulation. It provides us with very fine translational motion compensation. Comparisons with state-of-the-art range-
bin alignment methods are included and advantages of the proposal are highlighted. Simulated and live data from a high-resolution
linear-frequency-modulated continuous-wave radar are included to perform the pertinent comparisons.

1. Introduction targets [5, 6]. Furthermore, these images may be used


for subsequent recognition tasks [7–10]. Although ISAR
Traditional imaging sensors, such as visible and infrared is usually understood as a complement for electro-optical
cameras or laser radar systems, may have a reduced sensors, it may in fact outperform these traditional sensors in
performance in adverse weather conditions, like fog [1– adverse conditions, because it inherits the all-weather feature
3]. Furthermore, in defense and security scenarios, smoke [11] from the long wavelength radars.
screens [4] may literally blind these imaging sensors based The standard scenario for ISAR consists of a static
on very short wavelengths. high-resolution coherent radar which illuminates a moving
The origin of this degradation must be found in the noncooperative target [12]. In this context, a noncooperative
extreme scattering that these wavelengths suffer when inter- target is a target whose motion is unknown.
acting with the little particles present in the atmosphere [1– In ISAR, the two image dimensions are slant-range
3]. When a high signal attenuation is present, the operation and Doppler (cross-range). High slant-range resolution is
range of these sensors diminishes considerably. achieved by transmitting a large bandwidth signal, whereas
However, in important applications related to defense high cross-range resolution depends on a large aspect angle
and security, it is still necessary to obtain images for recogni- variation of the target during the illumination time [12].
tion/identification purposes, regardless of the meteorological Specifically, the slant-range resolution is given by
and scenario conditions.
Inverse Synthetic Aperture Radar (ISAR) is a coherent c
ρr = , (1)
radar technique which may obtain images of noncooperative 2Δ f
2 EURASIP Journal on Advances in Signal Processing

where c is the light speed and Δ f is the transmitted 1900


bandwidth. The cross-range resolution may be written as
1800
λ 1700
ρa = , (2)
2Δθ 1600

where λ is the transmitted wavelength and Δθ is the 1500

EC(τ)
variation of the target aspect angle during the illumination 1400
(observation) time.
1300
Target motion may be divided into a translational
component and a rotational component [13, 14]. The first 1200
one is further decomposed into a radial and a tangential 1100
component, whereas the second one has three attitude
components: yaw, pitch, and roll. 1000
On the one hand, the radial component of the transla- 10 12 14 16 18 20 22
tional motion (i.e., the component along the line-of-sight τ
(LOS)) is undesired, because it does not induce variation of
Figure 1: Envelope correlation (as a function of the range shift τ)
the target aspect angle; that is, it does not generate Doppler between a real shifted range profile and its corresponding reference
gradient among target scatterers situated in the same range profile.
bin. Furthermore, this component causes a large blurring in
ISAR images.
On the other hand, the rest of motion components
may produce the desired Doppler gradient among scatterers, f
hence obtaining bidimensional information. It is true that
the rotational motion (and the tangential component of TC
the translational motion) may also generate blurring effects
on the image [15], called Migration Through Resolution
f0 Δf ···
Cells (MTRCs), but these effects have minor importance in
comparison to the large blurring generated by the radial
component of the translational motion, which must always 1/PRF
be compensated. t
Methods for translational motion compensation work in
two steps [16]: range-bin alignment and phase adjustment. Figure 2: Waveform for an LFMCW radar.
For the first stage, which is the motivation of this paper,
several methods may be found in the literature, such as the
peak tracking approach [5], the centroid tracking algorithm
[17], the envelope correlation method [5, 18, 19], the global alignment approach is based on an optimization stage which
range alignment approach [20], or the minimum-entropy- has carefully been designed in order to avoid possible
based technique [21]. For the second stage regarding phase convergence to local maxima.
adjustment, the literature provides us with famous methods
such as prominent point processing [22], phase gradient The method is robust against noise, clutter, and target
autofocus [23], entropy minimization [16], or contrast scintillation. Moreover, it properly solves the error accumu-
maximization [24]. lation problem. Its performance is similar to the state-of-
In this paper, we concentrate on the range-bin align- the-art current methods such as the global range alignment
ment stage, which is fundamental to guarantee a proper approach [20] and the minimum entropy-based technique
translational motion compensation. Concretely, we present [21], although it provides two clear advantages; it properly
a subinteger range-bin alignment approach based on the solves extreme situations with large range shifts from pulse
traditional envelope correlation method and the use of to pulse (unlike the global range alignment algorithm) and
reference profiles. This work was preliminarily presented in it has moreover the ability to produce subinteger fine range
the conference paper [19]. Here, deep analyses as well as adjustments over a wide range of offsets (in contrast to the
exhaustive comparisons with other existing methods both for minimum entropy-based method).
simulated and live data are provided. Furthermore, the careful design of the method against
The proposed method makes use of reference profiles local maxima makes it very robust, as shown here for
in order to mitigate the error accumulation phenomenon controlled simulated examples for which the state-of-the-art
and the target scintillation effects [18], typical limitations methods have convergence problems.
of the earlier range-bin alignment approaches such as the Simulated and live data from a high-resolution linear-
peak and centroid tracking methods. Furthermore, the frequency-modulated continuous-wave (LFMCW) radar are
technique makes a subinteger alignment which provides us used to validate the proposed approach and to make the
with a very fine range profile adjustment. This subinteger pertinent comparisons.
EURASIP Journal on Advances in Signal Processing 3

where rm (n) is the reference profile for the alignment of


15 the mth range profile pm (n). Note that the calculation of
rm+1 (n) requires the knowledge of the previously aligned
range profiles. This clearly demonstrates the iterative nature
of the method.
Once the (m+1)th range profile pm+1 (n) and its associ-
10
Relative height (m)

ated reference profile rm+1 (n) are available, the objective is to


obtain the (m+1)th-aligned range profile pm+1 (n). For this
purpose, the envelope cross-correlation between rm+1 (n) and
a shifted version of pm+1 (n) is defined as
5
−1
N
2 2
EC(τm+1 ) = |rm+1 (n)| · 2pm+1 (n − τm+1 )2, (4)
n=0

0
where τm+1 is the range shift applied to the (m+1)th range
profile pm+1 (n). The value τm+1 is not necessarily an integer,
so that pm+1 (n−τm+1 ) is calculated by using the shift property
of the Fourier transform as follows:
−15 −10 −5 0 5 10 15
  
Length (m)
pm+1 (n − τm+1 ) = FFT e j(2π/N)τm+1 n IFFT pm+1 (n) , (5)
Figure 3: Scatterer distribution for a simulated target illuminated
by an LFMCW radar. where n is the vector [0, 1, . . ., N −1]T .
In the context of this paper, a maximum value of the
envelope correlation is indicative of an optimum alignment
between the (m+1)th reference profile rm+1 (n) and the
2. Subinteger Range-Bin Alignment Method shifted range profile pm+1 (n−τm+1 ). Hence, we are interested
in obtaining the optimum shift τ1m+1 that maximizes the
The proposed method uses the cross-correlation of range
envelope correlation. Mathematically, this optimum shift
profiles in order to estimate the misalignment between them.
may be expressed as
The correlation is not calculated between the current range
profile and the previous one, which would generate the τ1m+1 = arg max EC(τm+1 ). (6)
undesired error accumulation effect [18]. On the contrary, τm+1
the cross-correlation is calculated between the current range
profile and a reference profile obtained as a combination After solving the optimization problem, we can finally obtain
of the previously aligned range profiles. This reduces the the aligned profile pm+1 (n) as
error accumulation effect and provides robustness against
pm+1 (n) = pm+1 (n − τ1m+1 ), (7)
noise, clutter, and target scintillation. Moreover, the pro-
posed alignment between the current and the reference where the range shift of the original profile is implemented
profiles may be a fraction of one range bin, providing the by using (5), because this optimum shift τ1m+1 may be not an
method with a subinteger capability. This fine alignment is integer.
achieved after an optimization stage which has been designed It is really interesting to analyze the optimization prob-
to minimize possible convergence to local maxima. The lem expressed in (6) in order to visualize its nature. Figure 1
following paragraphs describe the method. shows the value of the envelope correlation (see (4)) between
Let pm (n) be the mth acquired range profile, where n = a live-shifted range profile and its associated reference profile
0, . . . , N − 1, m = 0, . . . , M − 1, N is the total number of as a function of the shift τ. In this example, the variable τ is
range bins and M is the number of acquired range profiles. expressed in number of range bin.
Let us call pm (n) as the aligned profile of pm (n), after the The envelope correlation shown in Figure 1 shows a
alignment process. standard example which we have found typical for both
As previously commented, in order to reduce the error simulated and real data. It can easily be seen that the objective
accumulation effect and to increase the method robustness cost function suffers from local maxima.
against noise, clutter, and target scintillation, it is interesting As a consequence, it is obvious that we have to pay great
to define a reference profile [18]. We calculate this reference attention to the correct initialization of the optimization
profile as a combination of the previously aligned range algorithm in order to guarantee the desired convergence to
profiles. Concretely, we follow the recommendations of [16] the global maximum. That is, if the initial guess for the
in order to define the reference profile rm+1 (n) for the optimization method is near to the global maximum, a
alignment of pm+1 (n) as correct convergence to it is achievable.
Since the problem under study is an optimization
m 1 2 2
2p (n)2, problem with local maxima, it is clear that we could employ
rm+1 (n) = rm (n) + m (3)
m+1 m+1 blind approaches to solve it, such as genetic algorithms or
4 EURASIP Journal on Advances in Signal Processing

970 0 0
120
−5 −5
100
980 −10 −10
80
−15 −15
Slant-range (m)

990 60

Doppler (Hz)
−20 40 −20

(dB)

(dB)
1000 −25 20 −25
−30 0 −30
1010 −35 −35
−20
−40 −40 −40
1020
−45 −60 −45
−50 −80 −50
50 100 150 200 250 980 990 1000 1010 1020
Number of range profile Slant-range (m)
(a) (b)

Figure 4: (a) Range profiles and (b) ISAR image for the simulated example without radial translational velocity.

970 0 −40 0
−5 −60 −5
980 −10 −80 −10
−15 −100 −15
Slant-range (m)

Doppler (Hz)

990
−20 −120 −20
(dB)

(dB)
1000 −25 −140 −25
−30 −160
−30
1010 −180
−35 −35
−200
−40 −40
1020 −220
−45 −45
−240
−50 −50
50 100 150 200 250 980 990 1000 1010 1020
Number of range profile Slant-range (m)
(a) (b)

Figure 5: (a) Range profiles and (b) ISAR image for the simulated example with a radial translational velocity of vr = 10 m/s.

exhaustive procedures (like a grid method or a random gradient-based method or the Newton method, may have
walk). been used for the refinement stage. Our experience is that
Nevertheless, according to our observations of our avail- the zero-order Nelder-Mead algorithm worked properly for
able simulated and real data, we have noticed that the peak the analyzed examples.
corresponding to the global maximum of the cost function As a summary, the following steps implement the pro-
is quite wide, as seen in Figure 1. This means that a proper posed technique for ISAR subinteger range-bin alignment.
initial guess for the optimum shift τ1m+1 is simply the integer
range shift for which the cross-correlation is maximum. Step 1 (m = 0). Consider p0 (n) = p0 (n).
Hence, we evaluate the cross-correlation for the possible
integer shifts n = 0, . . . , N − 1, and select the one for which Step 2. Calculate the (m+1)th reference profile rm+1 (n) using
the cost function is maximum as the initial guess for the (3).
subsequent standard optimization algorithm. This process
lets us finally converge to the desired global maximum. In Step 3. Obtain the envelope correlation (see (4)) between
this paper, a zero-order optimization algorithm (the Nelder- rm+1 (n) and the (m+1)th-shifted range profile pm+1 (n −
Mead algorithm [25]) has been utilized to obtain the desired τm+1 ), with τm+1 being an integer in the interval [0, 1, . . . , N −
subinteger refinement for τ1m+1 . 1] and N the number of range bins.
We would like to highlight that the commented method
to obtain the initial guess properly worked in all the Step 4. Calculate the integer value of τm+1 , from the possible
simulated and real data available to the authors. On the other set [0, 1, . . ., N − 1], that maximizes the result of the previous
hand, it is true that other optimization approaches, like a step. Call this initial value τm+1,0 .
EURASIP Journal on Advances in Signal Processing 5

970 0 Table 1: LFMCW radar parameters for the simulated example in


−5 Figure 3.
980 −10
Central frequency ( f0 ) 10 GHz
−15
Slant-range (m)

990 Bandwidth (Δ f ) 500 MHz


−20 Ramp Repetition Frequency (PRF) 500 Hz

(dB)
1000 −25 TC 0.2 ms
−30 Illumination Time (CPI) 0.5 s
1010 −35
−40
1020 If we consider a point-scatterer at a range Rtk from the
−45
radar, the received signal from the scatterer is
−50
50 100 150 200 250      2 
2R 2R
Number of range profile
sRk (t) = σk exp j 2π f0 t − tk + πγ 4t − tk ,
c c
Figure 6: Range profiles for the simulated example after applying
(9)
the peak tracking method. Note that the target scintillation
phenomenon makes this approach fail.
where σk is a complex value associated to the scatterer, whose
amplitude represents the scatterer backscattering and the
propagation losses, whereas its phase models a possible phase
change inserted by the scatterer.
Step 5. Solve (6) with the Nelder-Mead algorithm, taking LFMCW radars usually apply hardware deramping,
τm+1,0 as the initial guess for the iterative technique. As a which consists of mixing the received signal with a replica
result, obtain the optimum range shift τ1m+1 . of the transmitted signal. By considering that the target has
K scatterers, the beat signal after the deramping processing
Step 6. Obtain pm+1 (n) using (7). Use (5) if the optimum
may be written as
shift τ1m+1 is not an integer.
  

K
4πγRtk 4π f0 4πγR2tk
Step 7 (m ← m + 1). If m ≤ M − 2, where M is number of sb (t) = σk exp j 4
t+ Rtk − .
range profiles, go to Step 2 to align the next range profile. k=1
c c c2
(10)

3. Simulation of an LFMCW Radar Note that, for each transmitted ramp, a range profile may
be obtained. For an LFMCW radar, it is necessary to make
In this section, we provide the tools to simulate targets a Fourier transform of the beat signal to obtain the range
illuminated by an LFMCW radar, because the simulated profiles. Effectively, by neglecting the last phase term in (10),
and real data used in this paper correspond to this type it is clear that a Fourier transform in 4t supplies the range
of radar. Next section will detail the properties of the profile for each m. The beat frequency ftk for a scatterer
proposed alignment method for this kind of data. Although situated at a range Rtk may be written, from (10), as
the results detailed in the paper only refer to LFMCW radars,
it is important to highlight that the proposed method is 2γRtk
applicable to any coherent imaging radar. ftk = . (11)
c
An LFMCW radar transmits a continuous waveform
whose instantaneous frequency as a function of the time is A correct sampling of (10) provides us with the possibility
depicted in Figure 2. The central transmitted frequency is of simulating complex scenes. This is simply made by
f0 , the pulse repetition frequency is PRF, the transmitted calculating the ranges from the radar to all target scatterers
bandwidth is represented by Δ f , whereas the parameter for all ramps. For each ramp, the corresponding range profile
TC represents the necessary time for the radar circuits to may be obtained by applying a Fourier transform to (10).
guarantee the signal coherence from ramp to ramp. As an interesting example, let us consider the distribution
Next, we describe the radar signal model. The complex of 2000 scatterers depicted in Figure 3. Let us say that these
envelope of the transmitted signal sT (t) may be written as scatterers belong to a ship.
Let us consider that this target, while pitching with a
   rotation rate of Ω = 0.05 rad/s, is moving away along LOS
sT (t) = exp j 2π f0 t + πγ4t 2 , (8) and is being illuminated by an LFMCW radar. The radar
parameters are detailed in Table 1, where the illumination
time is also referred as Coherent Processing Interval (CPI).
where 4t = t − mPRF−1 , t is the time, m is the number of range The rotation center of the target is situated in the position
profile (i.e., the number of ramp), and γ is the chirp rate. [0, 0] in Figure 3. Moreover, let us consider that the range
Note that the received signal in the intervals corresponding from the radar to the rotation center is 1000 m in the middle
to TC is not processed. of the illumination time.
6 EURASIP Journal on Advances in Signal Processing

970 0 −40 0
−5 −60 −5
980 −10 −80 −10
−15 −100 −15
Slant-range (m)

Doppler (Hz)
990
−20 −120 −20

(dB)

(dB)
1000 −25 −140 −25
−30 −160
−30
1010 −180
−35 −35
−200
−40 −40
1020 −220
−45 −45
−240
−50 −50
50 100 150 200 250 980 990 1000 1010 1020
Number of range profile Slant-range (m)
(a) (b)

Figure 7: (a) Range profiles and (b) reconstructed ISAR image after applying the proposed subinteger range-bin alignment method to the
simulated data with translational motion.

additional information. For our simulated example, the blind


techniques should obtain an image similar to Figure 4(b)
from processing the data in Figure 5.
Next sections detail the performance of the proposed
alignment algorithm in comparison with other existing
methods for compensating the translational motion. In
order to make fair comparisons among the diverse range-
bin alignment techniques, in this paper we always use the
method in [16] for the phase adjustment step.

4. Properties of the Proposed Method


Figure 8: Photo of the two-mast sailboat. This section addresses the performance of the proposed
range-bin alignment technique in relation to important fea-
tures: robustness against target scintillation, against clutter,
and so forth. Both simulated and real data are used to verify
For this simulated example, we have considered that
the good performance of the proposed method.
|σk | = 1, for all k and ∠σk is uniformly distributed between
0 and 2π. Moreover, a signal-to-noise ratio of 10 dB has
been considered, with the noise being additive, white, and 4.1. Robustness against Target Scintillation. The signal
Gaussian. received by the radar is the coherent sum of many contri-
Figure 4(a) shows the 250 range profiles for this example, butions from target scatterers. This implies that the power in
when the target is considered to have a radial translational each range bin is not constant during the illumination time.
speed of vr = 0 m/s. By applying an FFT in each range This effect is known as target scintillation.
bin, we obtain the ISAR image in the conventional range- Target scintillation makes the standard tracking
Doppler coordinates (Figure 4(b)). In the context of this approaches fail. For example, Figure 6 shows the range
paper, Figure 4(b) must be considered as the optimum ISAR profiles for our simulated example with a radial translational
image for this simulation, because there is no blurring (i.e., motion aligned after applying the peak tracking method.
the radial component of the translational motion is zero). Clearly, the simulated data suffer from the target scintillation
On the other hand, Figures 5(a) and 5(b), respectively, phenomenon, because we have simulated many point-
show the range profiles and the ISAR image for the scatterers. Target scintillation causes the location of the
previously simulated example, but considering that the target global maximum to strongly fluctuate between range
is moving away with a radial speed of vr = 10 m/s. The profiles.
leaning observed in the range profiles and the large blurring Fortunately, the subinteger range-bin alignment method
in the ISAR image are characteristic effects due to the radial does not suffer from target scintillation. Note that envelope
component of the translational motion. correlation is a much more robust approach to calculate
Because the dynamics for noncooperative targets are the existing shift among range profiles. As a proof of it,
unknown, the objective of the blind motion compensation Figure 7(a) details the range profiles obtained after applying
techniques consists of focusing the ISAR images without the proposed method to the simulated data of Figure 5. It can
EURASIP Journal on Advances in Signal Processing 7

810 −200
−5 −5
−150
815
−10 −100 −10
Slant-range (m)

Doppler (Hz)
−15 −50 −15
820

(dB)

(dB)
−20 0 −20
825 −25 50 −25

−30 100 −30


830
−35 150 Clutter −35

835 200
50 100 150 200 250 300 350 400 810 815 820 825 830 835
Number of range profile Slant-range (m)
(a) (b)

Figure 9: (a) Range profiles and (b) ISAR image for the real data of the two-mast sailboat. No motion compensation technique has been
applied.

810 −200
−5 −5
−150
815
−10 −100 −10
Slant-range (m)

Doppler (Hz)

−15 −50 −15


820
(dB)

(dB)
−20 0 −20
825 −25 50 −25

−30 100 −30


830
−35 150 −35

835 200
50 100 150 200 250 300 350 400 810 815 820 825 830 835
Number of range profile Slant-range (m)
(a) (b)

Figure 10: (a) Range profiles and (b) ISAR image for the real data of the two-mast sailboat after applying the centroid tracking method.

810 −200
−5 −5
−150
815
−10 −100 −10
Slant-range (m)

Doppler (Hz)

−15 −50 −15


820
(dB)

(dB)

−20 0 −20
825 −25 50 −25

−30 100 −30


830
−35 150 −35

835 200
50 100 150 200 250 300 350 400 810 815 820 825 830 835
Number of range profile Slant-range (m)
(a) (b)

Figure 11: (a) Range profiles and (b) ISAR image for the real data of the two-mast sailboat after applying the proposed method.
8 EURASIP Journal on Advances in Signal Processing

970 0 −40 0
−5 −60 −5
980 −10 −80 −10
Slant-range (m)

−15 −100 −15


990

Doppler (Hz)
−20 −120 −20

(dB)

(dB)
1000 −25 −140 −25
−30 −160
−30
1010 −180
−35 −35
−200
−40 −40
1020 −220
−45 −45
−240
−50 −50
50 100 150 200 250 980 990 1000 1010 1020
Number of range profile Slant-range (m)
(a) (b)

Figure 12: (a) Range profiles and (b) ISAR image without using reference profiles. The results correspond to the simulated data presented
in Figure 5.

Table 2: Real radar parameters for the acquired data corresponding


to the sailboat in Figure 8.

Central frequency ( f0 ) 28.5 GHz


Bandwidth (Δ f ) 1 GHz
Ramp repetition frequency (PRF) 1000 Hz
TC 0.1 ms
Illumination time (CPI) 0.4 s

campaign made in the Strait of Gibraltar. The circles


approximately indicate the positions of the mast bases and
tips and the sailboat bow and stern. The real radar [26]
is a high-resolution millimeter-wave LFMCW radar, whose
Figure 13: Photo of the vessel.
parameters for this acquisition are detailed in Table 2.
This live example is an interesting one, since the sea
state was 4 and many clutter echoes were received. Figures
be seen that a proper range-bin alignment has been obtained. 9(a) and 9(b), respectively, show the range profiles and the
In fact, this result is very similar to the range profiles shown ISAR image for this capture without applying any motion
in Figure 4(a), which may be understood as the optimum compensation technique. We can see that the signal-to-noise
range profiles, because they correspond to the simulated ratio is poor and that energetic echoes corresponding to
example without translational motion. clutter are evident. Moreover, the ISAR image is blurred
Figure 7(b) shows the motion-compensated ISAR image because of the translational motion of the target, which
obtained with the proposed technique for the range-bin is evident from the characteristic leaning observed in the
alignment stage and with the method in [16] for the phase range profiles in Figure 9(a). Certainly, it is difficult to
adjustment stage. Hence, Figure 7(b) is the reconstructed distinguish the details of the ship in the ISAR image shown
ISAR image after using the proposed technique. Note that in Figure 9(b).
this reconstruction is a very good approximation to the The high levels of clutter and noise have influence on the
optimum ISAR image shown in Figure 4(b). performance of the standard tracking-based techniques. As
an example, refer to the range profiles aligned after apply-
4.2. Robustness against Noise and Clutter. Thermal noise ing the centroid tracking-based technique [17], which are
is always present in real systems. The simulated examples detailed in Figure 10(a). Clearly, by simple visual inspection,
shown in this paper include thermal noise. On the other we can conclude that the alignment has not been satisfactory.
hand, clutter may exist depending on the acquisition sce- Figure 10(b) shows the ISAR image obtained after applying
nario. For example, in maritime scenarios, the clutter due to the centroid tracking method for the range-bin alignment
echoes from the sea may be a problem. stage (and the method in [16] for the phase adjustment
Figure 8 shows the photograph of a noncooperative two- stage). The quality of the ISAR image has not improved in
mast sailboat, which was illuminated during an acquisition relation to Figure 9(b), that is, the ISAR image is still blurred.
EURASIP Journal on Advances in Signal Processing 9

80
6970
−5 100 −5
6980
−10 120 −10
6990
Slant-range (m)

Doppler (Hz)
−15
140 −15
7000

(dB)

(dB)
160
−20 −20
7010 180
−25 −25
7020 200
−30 −30
7030 220
−35 −35
7040 240
100 200 300 400 500 600 6970 6980 6990 7000 7010 7020 7030 7040
Number of range profile Slant-range (m)
(a) (b)

Figure 14: (a) Range profiles and (b) ISAR image for the real data of the vessel without compensating the translational motion.

6970 80
−5 100 −5
6980
−10 120 −10
6990
Slant-range (m)

Doppler (Hz)

−15 140 −15


7000
(dB)

(dB)
160
−20 −20
7010
−25
180 −25
7020 200
−30 −30
7030 220
−35 −35
7040 240
100 200 300 400 500 600 6970 6980 6990 7000 7010 7020 7030 7040
Number of range profile Slant-range (m)
(a) (b)

Figure 15: (a) Range profiles and (b) ISAR image for the real data of the vessel after applying the proposed method.

Certainly, the high levels of noise and clutter for this example Table 3: Focusing indicators for the live ISAR images in Figures
make the tracking-based methods fail. 9(b), 10(b), and 11(b).
On the other hand, Figure 11(a) presents the range pro-
Entropy Contrast
files aligned by the proposed subinteger range-bin alignment
Figure 9(b) 7.34 17.41
method. In this case, we can see that the range profiles are
more properly aligned. Figure 11(b) shows the reconstructed Figure 10(b) 7.46 15.90
ISAR image for this case, where the masts and the deck are Figure 11(b) 6.90 24.80
more detailed in comparison to Figures 9(b) and 10(b).
Hence, the proposed range-bin alignment method is
robust against high levels of noise and clutter. So far, the proposed technique is more focused than the original image
drawn conclusions are based on visual inspection. However, without compensating the translational motion and is also
we can use focusing indicators to quantify the quality more focused than the image obtained with the tracking-
improvement observed in the ISAR image of Figure 11(b). based range-bin alignment approach. We even obtain that
In this context and for this kind of examples, we can use the quality of the ISAR image after applying the centroid
the entropy [16] and the contrast [24], whose mathematical tracking technique gets worse in relation to Figure 9(b).
definitions are given in the corresponding references. The
lower the entropy, the more focused the ISAR image is [16]. 4.3. Robustness against Error Accumulation. The error accu-
And, the greater the contrast, the more focused the ISAR mulation phenomenon is an effect observed when using
image is [24]. some range-bin alignment methods [18]. The proposed
These focusing indicators for the ISAR images of Figures method tries to mitigate this phenomenon by using reference
9(b), 10(b), and 11(b) are detailed in Table 3. According to profiles, whose calculation is based on the previously aligned
these results, we have that the ISAR image after applying the range profiles.
10 EURASIP Journal on Advances in Signal Processing

6970 80
−5 100 −5
6980
−10 120 −10
6990
Slant-range (m)

Doppler (Hz)
−15 140 −15
7000

(dB)

(dB)
160
−20 −20
7010
−25
180 −25
7020 200
−30 −30
7030 220
−35 −35
7040 240
100 200 300 400 500 600 6970 6980 6990 7000 7010 7020 7030 7040
Number of range profile Slant-range (m)
(a) (b)

Figure 16: (a) Range profiles and (b) ISAR image for the real data of the vessel after using the global range alignment algorithm.

6970 Table 4: Focusing indicators for the simulated ISAR images in


−5 Figures 7(b) and 12(b).
6980
−10 Entropy Contrast
6990
Slant-range (m)

−15 Figure 7(b) 7.17 10.60


7000 Figure 12(b) 7.81 8.97
(dB)

−20
7010
−25
7020
−30
7030 Figure 7(b). This is obvious for the scatterers on the sailboat
−35 deck, for example. In fact, the same may be concluded if we
7040 calculate the focusing indicators for Figures 7(b) and 12(b),
100 200 300 400 500 600 as Table 4 details.
Number of range profile

Figure 17: Range profiles of the vessel with an artificially induced 5. Comparison with State-of-the-Art Methods
vibration.
In previous section, we exposed the properties of the
proposed alignment method by using both simulated and
For example, for envelope correlation-based methods, real data. Here, it is our intention to compare the proposed
the error accumulation effect takes importance when the approach with state-of-the-art methods recently proposed
alignment of the current range profile is only based on the in the literature: concretely, the global range alignment
previously aligned range profile, as shown next. algorithm [20] and the minimum entropy-based approach
Figure 12(a) shows the range profiles obtained after using [21].
a reduced version [5] of the proposed method for aligning The performance of the proposed subinteger range-bin
the range profiles of the simulated example in Figure 5. This alignment technique is similar to the one for these state-
simplified algorithm defines the reference profile rm+1 (n) of-the-art methods, as shown next. However, the proposed
for the alignment of the (m+1)th range profile pm+1 (n) approach can deal with extreme situations with large-range
as the previously aligned range profile, that is, rm+1 (n) = shifts from range profile to range profile. Moreover, its
pm (n), according to the nomenclature used in Section 2. subinteger alignment capability is also noticeable and, unlike
In Figure 12(a), some misalignment error accumulates, as the other methods, the careful design of the optimization
clearly shown if we concentrate, for example, on the range stage increases its robustness against possible convergence to
history for the scatterers on sailboat stern. These range local maxima.
variations are not observed in Figure 7(a). To make the pertinent comparisons, we use the live data
The error accumulation effect in Figure 12(a) has minor described in the following. Figure 13 shows the photo of
incidence on the results of Figure 7(a). Hence, the proposed a vessel illuminated by the millimeter-wave LFMCW radar
range-bin alignment method based on the use of reference prototype [26]. The radar parameters for this acquisition are
profiles is robust against this phenomenon. detailed in Table 5.
On the other hand, Figure 12(b) shows the ISAR image Figures 14(a) and 14(b), respectively, show the range
obtained from the range profiles in Figure 12(a). This profiles and the ISAR image for this live acquisition without
image is defocused in comparison with the ISAR image in compensating the translational motion. The ISAR image
EURASIP Journal on Advances in Signal Processing 11

6970 6970
−5 −5
6980 6980
−10 −10
6990 6990
Slant-range (m)

Slant-range (m)
−15 −15
7000 7000

(dB)

(dB)
−20 −20
7010 7010
−25 −25
7020 7020
−30 −30
7030 7030
−35 −35
7040 7040
100 200 300 400 500 600 100 200 300 400 500 600
Number of range profile Number of range profile
(a) (b)

Figure 18: Range profiles aligned after applying (a) the proposed subinteger range-bin alignment method and (b) the global range alignment
algorithm to the vessel data with artificially induced vibration.

80 80
100 −5 100 −5

120 −10 120 −10


Doppler (Hz)

Doppler (Hz)

140 −15 140 −15


(dB)

(dB)
160 160
−20 −20
180 −25
180 −25
200 200
−30 −30
220 220
−35 −35
240 240
6970 6980 6990 7000 7010 7020 7030 7040 6970 6980 6990 7000 7010 7020 7030 7040
Slant-range (m) Slant-range (m)
(a) (b)

Figure 19: ISAR images obtained after applying (a) the proposed subinteger range-bin alignment method and (b) the global range alignment
algorithm to the vessel data with artificially induced vibration.

Table 5: Real radar parameters for the acquired data corresponding


a to the vessel in Figure 13.

b a Central frequency ( f0 ) 28.5 GHz


c b Bandwidth (Δ f ) 1 GHz
a d a c a a Ramp repetition frequency (PRF) 500 Hz
b d b b TC 0.2 ms
b
Illumination time (CPI) 1.27 s
c c c c
d d d d

Figure 15 presents the aligned range profiles and the


focused ISAR image after applying the proposed subinteger
(a) (b) (c)
range-bin alignment method and the method in [16] for the
Figure 20: Alignment process of two range profiles for the global phase adjustment stage. Note the increase in the ISAR image
range alignment algorithm. quality. The masts and the deck appear focused. Certainly,
the alignment produced by the proposed method seems to
be good.
is largely blurred because of the radial component of the The focusing indicators, provided in Table 6, also speak
translational motion, which can easily be guessed from the about the quality enhancement after applying the subinteger
leaning observed in the range profiles of Figure 14(a). range-bin alignment approach.
12 EURASIP Journal on Advances in Signal Processing

40 40

2.5 2.5
35 35

30 2 30 2

Number of range bin


Number of range bin

25 25
1.5 1.5
20 20

15 1 15 1

10 10
0.5 0.5
5 5

1 2 1 2
Number of range profile Number of range profile
(a) (b)

Figure 21: Alignment results for two simulated range profiles by applying (a) the proposed technique and (b) the global range alignment
algorithm. The proposed method is robust against local maxima of its cost function. On the contrary, the global range alignment method
may have convergence difficulties.

6970 80
−5 100 −5
6980
−10 120 −10
6990
Slant-range (m)

Doppler (Hz)

−15 140 −15


7000
(dB)

(dB)
160
−20 −20
7010
−25
180 −25
7020 200
−30 −30
7030 220
−35 −35
7040 240
100 200 300 400 500 600 6970 6980 6990 7000 7010 7020 7030 7040
Number of range profile Slant-range (m)
(a) (b)

Figure 22: (a) Range profiles and (b) ISAR image for the real data of the vessel after using the minimum entropy-based approach.

Table 6: Focusing indicators for the real ISAR images of the vessel show the range profiles and the ISAR image for the nonco-
in Figures 14(b) and 15(b). operative data of Figure 14 after applying this global range
alignment approach. Figure 16(b) is practically indistin-
Entropy Contrast
guishable from the one obtained by the proposed approach
Figure 14(b) 9.32 8.7 (Figure 15(b)).
Figure 15(b) 7.37 39.8 Hence, the global range alignment algorithm provides
a good alignment. In fact, the focusing indicators for
Figure 16(b) are almost the same as the ones for Figure 15(b).
5.1. Comparison with the Global Range Alignment Algorithm. The entropy and contrast for Figure 16(b) are 7.36 and 39.9,
The global range alignment method [20] is also a robust respectively.
method which usually performs well in diverse scenarios. However, when we are faced with situations in which
As a proof of that, Figures 16(a) and 16(b), respectively, large range shifts from range profile to range profile may
EURASIP Journal on Advances in Signal Processing 13

7013 image obtained with the proposed technique is very similar


−5 to the one given in Figure 15(b).
7014
−10
On the other hand, it is noticeable that the optimization
7015 stage in the proposed technique has carefully been designed.
Slant-range (m)

−15 On the contrary, the global range alignment method may


7016
have difficulties with local maxima of its cost function C 

(dB)
−20
7017 [20, equation (8)]. To visualize this, let us consider two
−25 single misaligned range profiles, as Figure 20(a) indicates.
7018
−30 The parameters a, b, c, and d are the absolute values of the
7019 echoes in the corresponding range bins. An intermediate step
−35
7020 in the optimization process of the global range alignment
method is depicted in Figure 20(b). Figure 20(c) indicates
80 100 120 140 160 180 200 220 240 260
the correct alignment of the two range profiles.
Number of range profile
It can easily be shown that the values of C  in the situation
Figure 23: Zoom of Figure 22(a). The jumps in the range profiles depicted in Figures 20(a) and 20(b) are, respectively,
are clearly visible.  

C(a) = 2 a2 + b2 + c2 + d 2 + ad ,
  (12)

C(b) = 2 a2 + b2 + c2 + d 2 + ac + db .


a From the optimization algorithm given in [20], if C(a) >

C(b) , that is, if ad > ac + db, then the global range
alignment algorithm will not converge to the correct result in
b Figure 20(c). Instead of that, it will try to refine the situation
in Figure 20(a). As a proof of this fact, Figure 21 shows the
a two range profiles aligned after using the proposed method
and the global range alignment algorithm, when a = 3, b =
0.25, c = 0.5, and d = 1. The global range alignment method
b
is clearly affected by the local maximum and does not
converge to the situation in Figure 20(c). On the contrary,
the proposed method can deal with these cases because of
the careful design of its optimization stage.
Figure 24: Two simple misaligned range profiles with little target
scintillation.
5.2. Comparison with the Minimum Entropy-Based Approach.
The minimum entropy-based approach [21] for range-bin
alignment is an iterative approach which is based on integer
arise, the global range alignment algorithm may fail, as also shifts of the range profiles, unlike the proposed method and
indicated in [21]. the global range alignment algorithm.
As a proof of that, let us concentrate on Figure 17, which Figure 22(a) shows the range profiles obtained after
shows the range profiles of the vessel with an artificially applying the entropy-based method to the real data of the
induced vibration. To simulate this vibration, each range vessel in Figure 14. Figure 22(b) shows the corresponding
profile has independently been shifted, with the shift being ISAR image.
a Gaussian random variable of zero mean and a standard Apparently, the range profiles shown in Figure 22(a)
deviation of 10 cm. This has let us simulate a large vibration. seem to be properly aligned. However, the fact that the range
Figure 18 presents the results provided by the proposed profiles may only be shifted in integer steps implies that
and the global range alignment methods for the range undesired jumps occur in the range history of the target
profiles shown in Figure 17. By analyzing Figure 18, we can scatterers. Figure 23 presents a zoom of Figure 22(a) which
conclude that the proposed technique is robust against large allows distinguishing the commented jumps. Some of them
range displacements, whereas the global range alignment are marked with an arrow.
algorithm cannot properly solve these situations. These discontinuities appearing in the range profiles
It is obvious that the poor performance of the global induce an amplitude modulation in Doppler. This is the rea-
range alignment method shown in Figure 18(b) has influence son why spurious images are clearly visible in Figure 22(b).
on the subsequent obtained ISAR image. Figure 19 shows the Hence, the ISAR image obtained with the minimum entropy-
ISAR images obtained after applying the proposed method based approach has a quality poorer than the one in
and the global range alignment algorithm to the vessel data Figure 15(b). The entropy and contrast for the ISAR image
with artificially induced vibration. Again, the technique in in Figure 22(b) (8.01 and 35.3, resp.) are also indicative of
[16] has been used for the phase adjustment step. Because this quality decrease. Please refer to Table 6.
of the misalignment, the ISAR image after the global range We have tried to implement an extension of the mini-
alignment approach is blurred. On the contrary, the ISAR mum entropy-based method in order to consider subinteger
14 EURASIP Journal on Advances in Signal Processing

3 3

12 12
2.5 2.5

10 10

2 2
Number of range bin

Number of range bin


8 8

1.5 1.5

6 6

1 1
4 4

0.5 0.5
2 2

1 2 1 2
Number of range profile Number of range profile
(a) (b)

Figure 25: Alignment results for the two simulated range profiles in Figure 24 by applying (a) the proposed technique and (b) the minimum
entropy-based approach.

range-bin alignments. This has been done in a way similar 6. Conclusions


to the optimization stage given in the proposed method.
Unfortunately, the commented jumps still appear when The traditional imaging sensors, such as cameras or laser
applying this extension. radars, may have a reduced performance in adverse mete-
On the other hand, the minimum entropy-based orological conditions or in difficult scenarios where, for
approach may also have problems with local maxima. As a example, smoke screens are present. ISAR is an all-weather
proof of that, let us consider two misaligned range profiles as radar technique which may provide images of noncoop-
shown in Figure 24. The values a, a , b, and b are the absolute erative targets in such adverse environments. Hence, such
values of the echoes in the corresponding range bins. Let us images are interesting for defense and security applications.
also assume that we have a little target scintillation, in such a Furthermore, the ISAR images may be exploited for subse-
way that we assume a ≈ a and b ≈ b . quent recognition/identification tasks.
From the equations given in [21], it can easily be shown Unfortunately, the standard ISAR images are usually
that the minimum entropy-based method will not be able blurred because of the target motion. Motion compensation
to align the two range profiles in Figure 24, if the next two techniques should be applied in order to have focused ISAR
conditions are met: images. Generally, it is at least necessary to compensate the
radial component of the translational motion. To achieve
this, the methods for translational motion compensation
a ln a + b ln b > a ln a + b ln b , work in two stages: range-bin alignment and phase adjust-
(13) ment.
a ln a + b ln b > a ln a + b ln b. In order to increase the quality of the ISAR images,
the range-bin alignment step must properly be designed.
In this paper, we have proposed a range-bin alignment
Let us consider that a = 3.1, b = 1, a = 3, and b = 1.1. These method based on the envelope correlation between the range
values satisfy (13). Figure 25(b) shows the two range profiles profiles and their corresponding reference profiles, calculated
after applying the minimum entropy-based approach for this as a combination of the previously aligned range profiles.
case. As predicted, the method is unable to align the two Furthermore, the method achieves an accurate subinteger
range profiles. On the contrary, the proposed method can refinement for the range profile alignment. This subinteger
align the two range profiles, as shown in Figure 25(a). Again, adjustment is based on an optimization stage which has
we would like to highlight that the optimization stage of the carefully been designed in order to avoid convergence to
proposed method has carefully been designed. undesired local maxima.
EURASIP Journal on Advances in Signal Processing 15

The paper addresses the performance of the proposed [8] B. K. Shreyamsha Kumar, B. Prabhakar, K. Suryanarayana, V.
algorithm in an exhaustive manner, by using both simulated Thilagavathi, and R. Rajagopal, “Target identification using
and real data from LFMCW radars. In this context, it harmonic wavelet based ISAR imaging,” EURASIP Journal on
has been shown that the method is robust against target Applied Signal Processing, vol. 2006, Article ID 86053, 13 pages,
scintillation, noise, and clutter. Its robustness against the 2006.
error accumulation effect has also been verified. [9] E. Radoi, A. Quinquis, and F. Totir, “Supervised self-
On the other hand, the proposed method has also been organizing classification of superresolution ISAR images: an
compared with recently proposed state-of-the-art range-bin anechoic chamber experiment,” EURASIP Journal on Applied
Signal Processing, vol. 2006, Article ID 35043, 14 pages, 2006.
alignment methods, such as the global range alignment
algorithm and the minimum entropy-based approach. We [10] S. Musman, D. Kerr, and C. Bachmann, “Automatic recogni-
have verified that the subinteger feature of the proposed tion of ISAR ship images,” IEEE Transactions on Aerospace and
Electronic Systems, vol. 32, no. 4, pp. 1392–1404, 1996.
method provides us with extremely accurate range-bin
alignments, in contrast to the minimum entropy-based [11] M. I. Skolnik, Introduction to Radar Systems, McGraw Hill
approach. It has also been shown that the method may deal Higher Education, McGraw Hill, New York, NY, USA, 3rd
edition, 2001.
with large range shifts from range profile to range profile,
unlike the global range alignment algorithm. Finally, the [12] D. R. Wehner, High Resolution Radar, Artech House, Boston,
careful design of the proposed optimization stage has been Mass, USA, 2nd edition, 1995.
highlighted. We have addressed simple simulated examples [13] V. C. Chen and W. J. Miceli, “Simulation of ISAR imaging of
in which both the global range alignment algorithm and the moving targets,” IEE Proceedings: Radar, Sonar and Navigation,
minimum entropy-based technique have problems with local vol. 148, no. 3, pp. 160–166, 2001.
maxima. [14] V. C. Chen, Time-Frequency Transforms for Radar Imaging and
The proposed algorithm is robust in many scenarios Signal Analysis, Artech House, Boston, Mass, USA, 2002.
and is hence a very interesting alternative for the range-bin [15] J. L. Walker, “Range-Doppler imaging of rotating objects,”
alignment stage in the task of ISAR translational motion IEEE Transactions on Aerospace and Electronic Systems, vol. 16,
compensation. The improved obtained ISAR images may no. 1, pp. 23–52, 1980.
be of interest for subsequent automatic target recognition [16] L. I. Xi, G. Liu, and N. Jinlin, “Autofocusing of ISAR
methods. images based on entropy minimization,” IEEE Transactions on
Aerospace and Electronic Systems, vol. 35, no. 4, pp. 1240–1252,
1999.
Acknowledgments [17] J. M. Muñoz-Ferreras, J. Calvo-Gallego, F. Pérez-Martı́nez,
A. Blanco-del-Campo, A. Asensio-López, and B. P. Dorta-
This work was financially supported by the Spanish National Naranjo, “Motion compensation for ISAR based on the
Board of Scientific and Technology Research under Project shift-and-convolution algorithm,” in Proceedings of the IEEE
TEC2008-02148/TEC. The authors thank Dr. A. Blanco-del- Conference on Radar, pp. 366–370, Verona, NY, USA, April
Campo, Dr. A. Asensio-López, and Dr. B. P. Dorta-Naranjo 2006.
for providing the live data of the sailboat and the vessel. [18] G. Y. Delisle and H. Wu, “Moving target imaging and
trajectory computation using ISAR,” IEEE Transactions on
Aerospace and Electronic Systems, vol. 30, no. 3, pp. 887–899,
References 1994.
[1] S. A. Hovanessian, Introduction to Sensor Systems, Artech [19] J. M. Muñoz-Ferreras and F. Pérez-Martı́nez, “Extended
House, Boston, Mass, USA, 1988. envelope correlation for range bin alignment in ISAR,” in
Proceedings of the IET International Conference on Radar
[2] A. V. Jelalian, Laser Radar Systems, Artech House, Boston,
Systems, pp. 65–68, Edinburg, UK, October 2007.
Mass, USA, 1992.
[3] G. R. Osche and D. S. Young, “Imaging laser radar in the near [20] J. Wang and D. Kasilingam, “Global range alignment for
and far infrared,” Proceedings of the IEEE, vol. 84, no. 2, pp. ISAR,” IEEE Transactions on Aerospace and Electronic Systems,
103–125, 1996. vol. 39, no. 1, pp. 351–357, 2003.
[4] H.-Y. Chen, I.-Y. Tarn, and Y.-J. Hwang, “Infrared extinction [21] D. Zhu, L. Wang, Y. Yu, Q. Tao, and Z. Zhu, “Robust ISAR
of the powder of brass 70Cu/30Zn calculated by the FDTD and range alignment via minimizing the entropy of the average
turning bands methods,” IEEE Transactions on Geoscience and range profile,” IEEE Geoscience and Remote Sensing Letters, vol.
Remote Sensing, vol. 33, no. 6, pp. 1321–1324, 1995. 6, no. 2, pp. 204–208, 2009.
[5] C.-C. Chen and H. C. Andrews, “Target motion induced [22] B. D. Steinberg, “Microwave imaging of aircraft,” Proceedings
radar imaging,” IEEE Transactions on Aerospace and Electronic of the IEEE, vol. 76, no. 12, pp. 1578–1592, 1988.
Systems, vol. 16, no. 1, pp. 2–14, 1980. [23] D. E. Wahl, P. H. Eichel, D. C. Ghiglia, and C. V. Jakowatz Jr.,
[6] D. A. Ausherman, A. Kozma, J. L. Walker, H. M. Jones, “Phase gradient autofocus—a robust tool for high resolution
and E. C. Poggio, “Developments in radar imaging,” IEEE SAR phase correction,” IEEE Transactions on Aerospace and
Transactions on Aerospace and Electronic Systems, vol. 20, no. Electronic Systems, vol. 30, no. 3, pp. 827–835, 1994.
4, pp. 363–400, 1984. [24] M. Martorella, F. Berizzi, and B. Haywood, “Contrast max-
[7] K.-T. Kim, D.-K. Seo, and H.-T. Kim, “Efficient classification imisation based technique for 2-D ISAR autofocusing,” IEE
of ISAR images,” IEEE Transactions on Antennas and Propaga- Proceedings: Radar, Sonar and Navigation, vol. 152, no. 4, pp.
tion, vol. 53, no. 5, pp. 1611–1621, 2005. 253–262, 2005.
16 EURASIP Journal on Advances in Signal Processing

[25] J. A. Nelder and R. Mead, “A simplex method for function


minimization,” The Computer Journal, vol. 7, pp. 308–313,
1965.
[26] A. Blanco-del-Campo, A. Asensio-López, B. P. Dorta-Naranjo,
et al., “Millimeter-wave radar demonstrator for high resolu-
tion imaging,” in Proceedings of the 1st European Radar Con-
ference (EuRAD ’04), pp. 65–68, Amsterdam, The Netherlands,
October 2004.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 108130, 9 pages
doi:10.1155/2010/108130

Research Article
Investigating the Bag-of-Words Method for 3D Shape Retrieval

Xiaolan Li1 and Afzal Godil2


1 College of Computer Science & Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China
2 Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

Correspondence should be addressed to Xiaolan Li, [email protected]

Received 1 December 2009; Revised 27 February 2010; Accepted 3 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 X. Li and A. Godil. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper investigates the capabilities of the Bag-of-Words (BWs) method in the 3D shape retrieval field. The contributions
of this paper are (1) the 3D shape retrieval task is categorized from different points of view: specific versus generic, partial-to-
global retrieval (PGR) versus global-to-global retrieval (GGR), and articulated versus nonarticulated (2) the spatial information,
represented as concentric spheres, is integrated into the framework to improve the discriminative ability (3) the analysis of the
experimental results on Purdue Engineering Benchmark (PEB) reveals that some properties of the BW approach make it perform
better on the PGR task than the GGR task (4) the BW approach is evaluated on nonarticulated database PEB and articulated
database McGill Shape Benchmark (MSB) and compared to other methods.

1. Introduction fact, from different points of view, the 3D shape retrieval task
can be further refined as follows.
With recent advances in scanning and modeling technolo- (1) Differentiated from the object category extent, the
gies, large number of 3D models are created and stored in task can be discussed in “specific” and “generic” domain,
databases. For these databases be used effectively require which depends on the purpose and interest of the specialists.
methods for indexing, retrieval, and clustering. Therefore, The representative benchmarks of the latter one include
retrieval and classification of 3D objects are becoming an Princeton Shape Benchmark [10], NIST 3D Benchmark
increasingly important task in modern applications such [11], while CAD [12], Protein [13], and Biometrics [14]
as computer vision, computer aided design/computer aided analysis are several important “specific” domains, which have
manufacturing, multimedia, molecular biology, biometric, their own properties. For example, CAD models have more
security, and robotics. complicated structure with holes and other local features.
Because of its simplicity, flexibility, and effectiveness, the Only using global information, these subtle details could be
Bag-of-Words (BWs) method, which originated from the neglected and lead to less ideal retrieval results.
document retrieval field, has recently attracted large amount (2) Based on the completeness of the query shape, the
of interests in the computer vision fields. It has been applied task can be divided into two subtasks as “Partial-to-Global
in the applications such as image/video classification [1], 3D Retrieval (PGR)” and “Global-to-Global Retrieval (GGR)”.
shape analysis, and retrieval [2–5]. We will explore its perfor- For the former one, every query shape is regarded as an
mance especially for the 3D shape retrieval task in this paper. incomplete object, which is used to obtain similar complete
A typical 3D shape retrieval task can be defined as: giving objects from the database. This happens in many cases. For
a query 3D shape, to obtain a list of 3D shapes ordered by example, when using the 3D range scanners to capture 3D
the similarity between the query object and the one on the data in real time, because of the limitation of the view angle,
list. Several methods are proposed to solve the problem, such the occlusion in the scene, and the real time requirement,
as Light Field descriptors [6], spherical harmonics descriptor only parts of the object can be captured during scanning.
[7], D2 shape distribution [8], Reeb Graph-based descriptors Then this incomplete point clouds may be used as the query
[9], Local Feature-based methods [4, 5]. The performance of shape to retrieve the corresponding complete model from
the methods varies mainly according to the specific tasks. In an existing database. Solving this problem will also benefit
2 EURASIP Journal on Advances in Signal Processing

several other applications, such as data registration [15] and based method like Light Field [6], when dealing with NASR
model fixing [16]. Most of the global-based shape retrieval task; b) the BWs method performs better than Light Field [6]
methods [6, 8], which require the complete geometry of a 3D when dealing with ASR task. In this paper, spin images are
object, cannot be applied directly to PGR. To our knowledge used as local features, which can be directly extracted as many
there are only a few literature [3, 17] contributions that solve as you want in 3D domain. On the other hand, according to
the PGR problem. [1], dense features, such as spin images, perform better than
(3) Based on the deformability of the shape, there exist sparse features, such as SIFT.
“Articulated Shape Retrieval (ASR)” and “Nonarticulated Although the BWs method has many advantages, it
Shape Retrieval (NASR)”. Lots of the natural and man-made suffers from its lack of spatial information. Some methods
objects are deformable. For instance, in CAESAR [14], each focus on integrating the spatial layout information into
person is scanned in three different postures: standing, the BWs method. Lazebnik et al. [20] proposes a spatially
sitting with arms open, and sitting with arms down. When enriched Bags-of-Words approach. It works by partitioning
performing shape retrieval using a sitting model of person A the image into increasingly fine subregions and computing
as the query model, the preferred result is to obtain the other histograms of local features found inside each subregion.
two different gestured model of person A than to retrieve Implicitly geometric correspondences of the subregions are
the sitting models of other persons. According to the results built in the pyramid matching scheme [21]. In [22], the
in [4], Light Field method [6], which performs greatly when object is an ensemble of canonical parts linked together by an
dealing with NASR problem, produces poor results for ASR explicit homographic relationship. Through an optimization
task. procedure, the model, corresponding to the lowest residual
To some extent, in [3, 4, 17], the above three different error, gives the class label to the query object along with
tasks are discussed within the BWs framework, but there still the localization and pose estimation. Yuan and Wu [23]
lacks thorough investigation. Several open problems remain describes a context aware clustering method, which captures
unsolved, such as how to integrate spatial information the contextual information between data. For the BWs
into the BWs framework to improve the performance. In method, it means the visual dictionary is constructed based
this paper, we investigate deeply into these three different on both the primitive visual features and spatial contexts.
cases within the framework of the BWs method with spin Li et al. [5] propose to treat the model in two different
images [18] as the low-level features, and provide profound domains, named the feature domain and the spatial domain.
experimental results to support the discussion. The visual word dictionary is built in the feature domain
The organization of the paper is as follows. Several as in the ordinary BWs method. On the other side, the
related works are summarized in Section 2. The performance whole model is partitioned into several pieces in the spatial
measure is discussed in Section 3. In Section 4, we first domain. Thereafter, each piece of the model is represented
introduce the ordinary procedure for BWs framework in 3D as a word histogram. The whole model is recorded as several
domain. Then take the CAD database [12] as an example, word histograms along with a geometry matrix which stores
a Concentric Bag-of-Words (CBWs) approach is proposed the relative distances between every pairs of the pieces. The
to enhance the discriminative ability of the original BWs weighted sums of dissimilarity measurements from these
method. Several interesting phenomena are studied for PGR two domains are used to measure the differences between
problem in Section 5. As for ASR task, McGill articulated models.
shape benchmark [19] is adopted to test the effectiveness of
our approach in Section 6. Finally, we conclude the paper in
Section 7. 3. Performance Measure
The performance measure used in this study is the precision-
2. Related Work recall curve. Precision-recall curve is the most common
Many efforts have been taken to perform 3D shape retrieval metric to evaluate 3D shape retrieval system. Precision is
recently. Among them, the BWs method, which represents the ratio of retrieved objects that are relevant to all retrieved
a 3D shape as an orderless collection of local features, has objects in the ranked list. Recall is the ratio of relevant objects
demonstrated impressive level of performance. retrieved in the ranked list to all relevant objects.
In [2, 3], BWs method is explored to accomplish PGR Let A be the set of all relevant objects, and B be the set of
task, in which a visual feature dictionary is constituted by all retrieved objects, then
clustering spin images [18]. Then, Kullback-Leibler diver-
gence is proposed as a similarity measurement in [3], while a
A∩B A∩B
probabilistic framework is introduced in [2]. precision = , recall = . (1)
For the ASR task, Ohbuchi et al. [4] apply the SIFT B A
algorithm to depth buffer images of the model captured
from uniformly sampled locations on a view sphere to Basically, recall evaluates how well a retrieval algorithm finds
collect visual words. After vector quantization, Kullbak- what we want and precision evaluates how well it weeds out
Leibler divergence measures the similarities of the models. what we do not want. There is a tradeoff between recall and
It also demonstrates that a) given enough samples, the BWs precision, one can increase recall by retrieving more, while
method can reach a comparable retrieval result as a vision can decrease precision.
EURASIP Journal on Advances in Signal Processing 3

a e d e a e d e
a e d b c d b
a a e a c
c c d c c d
a a
c b b c b
b b e b e e e b b e b e b c e e
c c c
c c c c c c
d a d e c d d a d e c d
b b
c c b e c c b e
a d a d
b c b b b
c c c
d e d e
a: 3
b: 5 a: 2 a: 1 a: 0 a: 0 a: 3 a: 0
c: 7 b: 3 b: 1 b: 1 b: 3 b: 1 b: 1
d: 4 c: 1 c: 5 c: 1 c: 3 c: 2 c: 2
e: 5 d: 3 d: 1 d: 0 d: 2 d: 2 d: 0
e: 3 e: 1 e: 1 e: 3 e: 2 e: 0

(a) (b)

Figure 1: Comparing BWs and CBW representations. (a) Representing shapes with one global Bag-of-Words model. The left and the right
shapes are both composed with 5 different words: a, b, c, d, e. Both feature vectors are [3, 5, 7, 4, 5], which count the occurrences of each
word. That means using BWs representation, the left and the right shapes are regarded as the same. (b) Representing shapes with Concentric
Bags-of-Words model. Even there are the same two shapes as shown in (a), because of the concentric sphere partitioning, the left and the
right shapes are different. Along the arrow’s direction, counting from the outer sphere to the inner one, their feature vectors are [2 3 1 3 3; 1
1 5 1 1; 0 1 1 0 1] and [0 3 3 2 3; 3 1 2 2 2; 0 1 2 0 0], respectively.

Database Query

···

3D shape
1 3
representation
a e d e
a d b c
c e a
c
c
b
da 4
··· ··· b b e b e
c d e
b c e e
c
c c Dissimilarity
d a c d
Low level feature extraction c c b e a b d
b c
b c computation
d e
2
···
{V 1 , V 2 , V 3 , V 4 , V 5 , . . . , V N } Concentric sphere
segmentation
Visual words dictionary Retrieval rank list
Construction (size N)
Feature space Geometry space

3D shape representation
⎛ ⎞ ⎛ ⎞
0 3
⎜ 3⎟ ⎜ 2⎟
⎜ ⎟ ⎜ ⎟
Sphere 1 ⎜ 5⎟
⎜ ⎟
Sphere M ⎜ 4⎟
⎜ ⎟
⎜ 0⎟ ⎜ 3⎟
⎜ ⎟ ⎜ ⎟
⎜ 4⎟ ··· ⎜ 3⎟
⎜ ⎟ ⎜ ⎟
⎜ 0⎟ ⎜ 0⎟
⎜ ⎟ ⎜ ⎟
⎜.⎟ ⎜.⎟
⎜.⎟ ⎜.⎟
⎝.⎠ ⎝.⎠
0 N 0 N

M is the number of concentric spheres


N is the vocabulary size

Figure 2: A schematic description of CBW method.


4 EURASIP Journal on Advances in Signal Processing

The number of support points


Support point q
located in this area

h (quantized β)
Tangent plane n β

r α
p

Oriented basis point on the surface


w (quantized α)

Figure 3: The demonstration of spin image.

1 4. Bag-of-Words and Concentric


0.9 Bag-of-Words Methods
0.8 We first describe the original formulation of BWs represen-
0.7 tation [1, 3], and then introduce the whole procedure of
Concentric Bags-of-Words (CBW) method. Their difference
0.6 is demonstrated in Figure 1. Its effectiveness is shown by the
Precision

0.5 experiment performed on “specific” shape database PEB [12]


to reveal its effectiveness.
0.4

0.3 4.1. Bag-of-Words Descriptor. Let us use the ordinary 3D


0.2 shape retrieval as an example to give an explanation of the
BWs framework. Denote N be the total number of labels
0.1 (“visual words”) in the learned visual dictionary. The 3D
0 shape can be represented as a vector with length N, in which
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 the elements count the occurrences of the corresponding
Recall label. The procedure can be completed in three steps.
Light field descriptor 3D shape distribution
2.5D spherical harmonics Surface area and volume
(1) Local feature descriptors, such as spin image [18], are
2D shape histogram Random applied to the 3D model to acquire low-level features.
3D spherical harmonics Bag of words
(2) Visual words, denoted as the discrete set {V1 , V2 , . . . ,
Solid angle histogram Concentric bag of words
VN }, are formed by clustering the features into N
Figure 4: Precision-recall (PR) plots of CBW, BWs, and other clusters, so that each local feature is assigned to a
methods listed in [12]. discrete label.
(3) The shape of the 3D model is summarized with
0.7
a global histogram (“Bag-of-Words”), denoted as
a vector fv = (x1 , x2 , . . . , xN ), by counting the
0.6
occurrences of each visual word.
0.5
4.2. Concentric Bag-of-Words Method. Rather than using
0.4 only a global histogram, this paper advocates using more
Precision

than one histogram along with its related spatial information


0.3 to reveal the 3D shape in more detail. Specifically, the model
is partitioned with several concentric spheres, and all the
0.2 parts between two neighboring spheres are recorded with
original BWs descriptor, which leads to the name Concentric
0.1 Bag-of-Words. A schematic description of the approach is
given in Figure 2.
0 The first block in Figure 2 represents low-level feature
0 0.2 0.4 0.6 0.8 1
Recall extraction. Although several local features, such as depth
buffer image, can be adopted to extract low-level features
KL convergence from 3D models, spin image is the one adopted here.
L1 distance Compared to using depth buffer image, adopting spin
Figure 5: Precision-recall (PR) plots obtained with KL and L1 image has at least two advantages. First, it is quicker to
dissimilarity measurement when doing PGR. compute. Second, it can capture the details from the concave
EURASIP Journal on Advances in Signal Processing 5

Partial retrieval
use 1/6 part of the object as query

(a)

Partial retrieval
use 1/6 part of the object as query

(b)

Figure 6: The example to show the difference between GGR and PGR. (a) First example to show the difference between Global-to-Global
retrieval (GGR) and Partial-to-Global retrieval (PGR). The left group shows the GGR result using a complete model (the top-left image) as
the query. The right group shows the PGR result using 1/6 part of the complete model (the small image shown in the bottom-left corner of
the first image) as the query. The top 20 models are listed orderly according to the dissimilarity measurement. (b) The second example to
show the difference between GGR and PGR. The layout of the images is the same as that of (a).

area and the self-hidden area. As shown in Figure 3, it Instead of representing one model with a histogram
characterizes the local properties around its basis point p of the words from the dictionary, it is partitioned into
within the support range r. It is a two-dimensional histogram M regions by grouping the oriented-basis points with M
accumulating the number of points located at the coordinate concentric spheres as demonstrated in the third block.
(α, β), where α and β are the lengths of the two orthogonal Thereafter, the model is recorded as a set of histograms.
edges of the triangle formed by the oriented-basis point p, Because all the models are scaled into unified scale and
whose orientation is defined by the normal n, and support the partitioning is also unified, the correspondence between
point q. The final size of the spin images is defined by the the regions of two models is obvious. It can be constructed
width w and the height h of the spin plane. We uniformly from outer sphere to inner sphere, as shown in Figure 1(b),
sample Nb oriented-basis points and Ns support points on or reverse. Thus, the CBW feature vector is recorded as
the surface of the model, which satisfies insensitivity to the  
tessellation and resolution of the mesh. c f v = f v1 , f v2 , . . . , f vM
After extracting a set of spin images for each model,   (2)
we construct a shape dictionary as shown in the second 1
= x11 , x21 , . . . , xN , . . . , x1M , x2M , . . . , xNM .
block, whose size is predetermined as N, by clustering all
spin images acquired from the whole training dataset with When performing 3D shape retrieval, the CBW repre-
k-means method. sentation of the query shape is constructed on line, and
6 EURASIP Journal on Advances in Signal Processing

1 Engineering Benchmark [12], which contains 866 3D CAD


models and is classified into 42 classes such as, “Discs”,
0.9
“T-shaped parts”, and “Bracket-like parts”. In Figure 4 we
0.8 compare the Precision-Recall curves obtained with CBW
and BWs using L1 as dissimilarity measurement to those
0.7 methods defined in [12], such as Light Field Descriptor, 2.5D
0.6
Spherical harmonics, 2D Shape Histogram, 3D Spherical
Harmonics, Solid Angle Histogram, 3D Shape Distribution,
0.5 Surface Area and Volume. Here, for CBW, M = 9.
Obviously, the concentric sphere partition improves the PR
0.4
rate, and makes the local Feature-based method comparable
0.3 to the global Feature-based method, such as 2.5D Spherical
Harmonics listed as the second best method in [12].
0.2

0.1 5. Partial-to-Global Retrieval


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
In CAD domain, the PGR task is specifically important.
EVD SHD on embedding Suppose that the query partial model is a screw, the target
CCD SHD complete model we want to obtain is the same size screw
LFD on embedding BW with screw cap on. Here, we design an experiment on PEB to
LFD
simulate the case described above: (1) Represent the models
Figure 7: Precision-recall (PR) plots for various descriptors when with BWs method and save the descriptors for the following
applied to the McGill database of articulated shapes [19]. Except usage as block 1 and 2 in Figure 2. (2) When performing
BWs, all of the other results can be found in [25]. the Partial-to-Global Retrieval, the sampled oriented-basis
points are grouped into M p regions according to their geo-
metric positions first. Then one of the groups is chosen as the
compared with those in the database. An ordered retrieval partial query shape. The BWs representation is constructed
list is obtained according to the dissimilarity metric, which is on line and used to compare with the saved BWs descriptors
of the complete models. The requirements for dissimilarity
  
M   measure for the partial-to-global retrieval task are quite
Dist OA , OB = dist c f viA , c f viB , (3) different from the global-to-global retrieval problem. As
i=1 described in [3], the dissimilarity between the query data and
the target model is not equal to that between the target model
where OA , OB are two objects A and B, respectively;
and the query data. It means that the dissimilarity metric
dist(c f viA , c f viB ) measures the dissimilarity between two
should be asymmetry. An ordinary symmetric distance
feature vectors, which can be KL divergence [3], cosine
measurement, such as L1, L2, is not a suitable choice. KL
distance, L1 and L2 distance. Thus, for every query
divergence is chosen here to satisfy the asymmetric property.
object, the objects in the database are all assigned a
When using one sixth of the model to be the query shape,
metric value based on (3), which results a sorted retrieval
the two PR curves in Figure 5 demonstrate the improvement
list.
introduced by KL comparing to L1.
Figure 6 provides two examples comparing the retrieval
4.3. Experimental Results. According to the discussions in results of Global-to-Global Retrieval and Partial-to-Global
[3, 17], several parameters related to the CBW approach are Retrieval, in which one sixth of a gear is used as the query
defined as follows. shape. It shows that the PGR is better than the GGR, since
PGR lists more gears on the top of the list than GGR does.
(1) The support range r: r = 0.4 ∗ R, where R is the radius
Why does PGR perform better? Recalling the definition of
of the model.
the feature vector will provide some clues to the answer. The
(2) The width w and the height h of the spin plane: w = feature vector describes the frequency of the visual words
h = 12. appearing in the shape. When using the entire gear model to
(3) The number of oriented-basis points for one model be the query data, the plane-kind of visual word overwhelm
Nb : Nb = 500. the other features. However, using partial of the object to
be the query data, the gear teeth shape dominates the whole
(4) The number of oriented-basis points for one model
shape. So more gears are picked out and listed on the top of
Ns : Ns = 5000.
the list.
(5) The size of the dictionary N: N = 1500.
(6) The number of the concentric spheres M: M < 10. 6. Articulated Shape Retrieval
The CBW approach can be applied both in “specific” and The Articulated Shape Retrieval requires that the shape
“generic” domains. Here we demonstrate it on Purdue descriptor should be deformation invariant,which is not
EURASIP Journal on Advances in Signal Processing 7

Spider

(a)

Spectacles

(b)

Human

(c)

Figure 8: Some retrieval results from the McGill database [19]: green bold frame defines the query shape, and orange bold frame defines the
false pick up. Three groups show three different shapes, in which from top to bottom they are spider, spectacles, and human.
8 EURASIP Journal on Advances in Signal Processing

satisfied by several previous methods [6–8]. They perform References


well when dealing with rigid objects, but manifest poor
performance when dealing with deformable ones [4]. BWs [1] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for
learning natural scene categories,” in Proceedings of the IEEE
method can still be used effectively for ASR task. The
Computer Society Conference on Computer Vision and Pattern
descriptors for the models are constructed by following the Recognition (CVPR ’05), vol. 2, pp. 524–531, June 2005.
procedure shown as block 1 and 2 in Figure 2. [2] Y. Shan, H. S. Sawhney, B. Matei, and R. Kumar, “Shapeme
We applied the BWs method for ASR task on McGill histrogram projection and matching for partial object recog-
Shape Benchmark (MSB) [19]. The configuration of the nition,” IEEE Transactions on Pattern Analysis and Machine
parameters is almost the same as those listed in Section 4.3, Intelligence, vol. 28, no. 4, pp. 568–577, 2006.
except that (a) the width w and the height h of the spin plane: [3] Y. Liu, H. Zha, and H. Qin, “Shape topics: a compact repre-
w = h = 16, (b) the number of oriented-basis points for one sentation and new algorithms for 3D partial shape retrieval,”
model Nb : Nb = 1000. Since all of the models in MSB are in Proceedings of the IEEE Computer Society Conference on
regarded as complete, L1 distance is chosen to measure the Computer Vision and Pattern Recognition (CVPR ’06), vol. 2,
dissimilarity. pp. 2025–2032, 2006.
In Figure 7, the BW-based retrieval result is compared [4] R. Ohbuchi, K. Osada, T. Furuya, and T. Banno, “Salient
local visual features for shape-based 3D model retrieval,” in
with several methods described in [25]. BWs method is
Proceedings of the IEEE International Conference on Shape
comparable to the best method EVD. However, except Modeling and Applications (SMI ’08), pp. 93–102, Stony Brook,
BWs, all the other methods are based on geodesic distance NY, USA, 2008.
computation, which is computational expensive. On the [5] X. Li, A. Godil, and A. Wagan, “Spatially enhanced bags
contrary, our method is constrained on local area and can of words for 3D shape retrieval,” in Proceedings of the 4th
be applied for on-line retrieval. International Symposium on Advances in Visual Computing
Figure 8 shows three visual results of articulated shape (ISVC ’08), vol. 5358 of Lecture Notes in Computer Science, pp.
retrieval. Only the top 18 results are listed here, in which the 349–358, Las Vegas, Nev, USA, 2008.
green bold framed shape is the query shape, and the orange [6] D.-Y. Chen, M. Ouhyoung, X.-P. Tian, and Y.-T. Shen, “On
bold framed shapes are the false recall. Figure 8(a) shows visual similarity based 3D model retrieval,” Computer Graphics
the results of retrieving a spider shape from the database. Forum, vol. 22, no. 3, pp. 223–232, 2003.
Among these 18 retrieval shapes, only two shapes do not [7] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz, “Rotation
invariant spherical harmonic representation of 3D shape
belong to the spider class but the ant class. Figures 8(b) and
descriptors,” in Proceedings of the ACM International Confer-
8(c) are the results using a spectacles and a human shape as
ence Symposium on Geometry Processing, vol. 43, pp. 156–164,
query model, respectively. Even though there are quite large Aachen, Germany, June 2003.
amount of bending in the shapes, the performance is quite [8] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape
good. distributions,” ACM Transactions on Graphics, vol. 21, no. 4,
pp. 807–832, 2002.
[9] S. Biasotti, “Reeb graph representation of surfaces with
7. Conclusion and Discussion boundary,” in Proceedings of the Shape Modeling International
(SMI ’04), pp. 371–374, 2004.
In this paper, we explore the BWs framework to solve several [10] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The
different tasks in 3D shape retrieval field, which are classified princeton shape benchmark,” in Proceedings of the Inter-
as specific versus generic, partial-to-global versus global-to- national Conference on Shape Modeling and Applications
(SMI’04), pp. 167–178, Genova, Italy, June 2004.
global retrieval, and articulated versus Nonarticulated. For
[11] R. Fang, A. Godil, X. Li, and A. Wagan, “A new shape
each type, the effectiveness of BWs method is discussed
benchmark for 3D object retrieval,” in Proceedings of the 4th
in detail. First, CBW method is introduced to improve International Symposium on Visual Computing, vol. 5358 of
the discrimination ability of original BWs representation. Lecture Notes in Computer Science, pp. 381–392, Las Vegas,
Second, BWs is applied on PEB to perform partial-to-global Nev, USA, 2008.
retrieval task. And several results revealed, for some shape [12] S. Jayanti, Y. Kalyanaraman, N. Iyer, and K. Ramani, “Develop-
(gear-like shape), that PGR performs better than GGR. ing an engineering shape benchmark for CAD models,” Com-
Finally, we compared the results of BWs to several other puter Aided Design, vol. 38, no. 9, pp. 939–953, 2006, Shape
methods on McGill articulated shape database. Our results Similarity Detection and Search for CAD/CAE Applications.
are comparable to the best results in [25]. More experiments [13] H. M. Berman, J. Westbrook, Z. Feng, et al., “The protein data
need to be done to verify the influence of the parameters bank,” Nucleic Acids Research, vol. 28, no. 1, pp. 235–242, 2000.
listed in Section 4.3. [14] CAESAR Anthopometric Database, http://store.sae.org/
caesar/.
[15] N. J. Mitra, L. Guibas, J. Giesen, and M. Pauly, “Probabilistic
fingerprints for shapes,” in Proceedings of the 4th Eurographics
Acknowledgments Symposium on Geometry Processing, vol. 256, pp. 121–130,
Sardinia, Italy, 2006.
The authors would like to thank the SIMA program and the [16] T. Funkhouser, M. Kazhdan, P. Shilane, et al., “Model-
IDUS program for supporting this work. This work has also ing by example,” in Proceedings of the 31st International
been partially supported by NSF Grants of China (60873218) Conference on Computer Graphics and Interactive Techniques
and NSF Grants of Zhejiang (Z1080232). (SIGGRAPH ’04), pp. 652–663, 2004.
EURASIP Journal on Advances in Signal Processing 9

[17] X. Li, A. Godil, and A. Wagan, “3D part identification based


on local shape descriptors,” in Proceedings of the Perfor-
mance Metrics for Intelligent Systems Workshop (PerMIS ’08),
Gaithersburg, Md, USA, August 2008.
[18] A. E. Johnson and M. Hebert, “Using spin images for efficient
object recognition in cluttered 3D scenes,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp.
433–449, 1999.
[19] J. Winn, A. Criminisi, and T. Minka, “Object categorization by
learned universal visual dictionary,” in Proceedings of the 10th
IEEE International Conference on Computer Vision (ICCV ’05),
vol. 2, pp. 1800–1807, 2005.
[20] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of
features: spatial pyramid matching for recognizing natural
scene categories,” in Proceedings of the Computer Vision and
Pattern Recognition (CVPR ’06), vol. 2, pp. 2169–2178, 2006.
[21] K. Grauman and T. Darrell, “The pyramid match kernel:
discriminative classification with sets of image features,” in
Proceedings of the IEEE International Conference on Computer
Vision (ICCV ’05), vol. 2, pp. 1458–1465, October 2005.
[22] S. Savarese and L. Fei-Fei, “3D generic object categorization,
localization and pose estimation,” in Proceedings of the IEEE
11th International Conference on Computer Vision (ICCV ’07),
pp. 1–8, Rio de Janeiro, Brazil, October 2007.
[23] J. Yuan and Y. Wu, “Context aware clustering,” in Proceedings
of the Computer Vision and Pattern Recognition (CVPR ’08),
pp. 1–8, Anchorage, Alaska, USA, June 2008.
[24] R. Gal and D. Cohen-Or, “Salient geometric features for
partial shape matching and similarity,” ACM Transactions on
Graphics, vol. 25, no. 1, pp. 130–150, 2006.
[25] V. Jain and H. Zhang, “A spectral approach to shape-based
retrieval of articulated 3D models,” Computer Aided Design,
vol. 39, no. 5, pp. 398–407, 2007.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 680623, 6 pages
doi:10.1155/2010/680623

Research Article
Optical Flow and Principal Component Analysis-Based
Motion Detection in Outdoor Videos

Kui Liu, Qian Du, He Yang, and Ben Ma


Department of Electrical and Computer Engineering, Mississippi State University, MS 39762, USA

Correspondence should be addressed to Qian Du, [email protected]

Received 6 December 2009; Accepted 16 January 2010

Academic Editor: Yingzi Du

Copyright © 2010 Kui Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We propose a joint optical flow and principal component analysis (PCA) method for motion detection. PCA is used to analyze
optical flows so that major optical flows corresponding to moving objects in a local window can be better extracted. This joint
approach can efficiently detect moving objects and more successfully suppress small turbulence. It is particularly useful for motion
detection from outdoor videos with low quality. It can also effectively delineate moving objects in both static and dynamic
background. Experimental results demonstrate that this approach outperforms other existing methods by extracting the moving
objects more completely with lower false alarms.

1. Introduction outdoor videos with low quality, special care needs to be


taken in order to better extract features related to moving
The detection of moving objects is critical in many defense objects from optical flows while suppressing false alarms.
and security applications, where motion detection is usually Principal component analysis (PCA) is a typical
performed in a preprocessing step, a key to the success approach in multivariate analysis [8]. It is also named the
in the following target tracking and recognition. Many discrete Karhunen-Loève transform (KLT) or the Hotelling
videos used in defense and security applications are out- transform [9]. PCA includes the eigen-decomposition of
door videos whose quality may be degraded by various a data covariance matrix or singular value decomposi-
noisy sources, such as atmospheric turbulence, and sensor tion of a data matrix, usually after mean centering. It
platform scintillation. Meanwhile, moving objects may be projects the original data onto an orthogonal subspace,
very small occupying a few pixels only, which makes motion where each direction is mutually decorrelated and major
detection very challenging. Under this circumstance, existing data information is present in the first several principal
approaches may generate significant amount of false alarms. components (PCs). For optical flows in a local window,
Motion detection has been extensively investigated [1– moving objects have consistent flows while pixels with only
3]. Many research works are conducted for indoor videos turbulence have random flows. Thus, if PCA is applied to the
with large objects. As one of the major techniques, optical two-dimensional (2D) data of optical flows, the difference
flow-based approaches have been widely used for motion between desired motion pixels and random motion pixels
detection. There are two classic methods of optical flow com- may be magnified because their contributions to the two
putation in computer vision: Horn-Schunck (HS) method eigenvalues are very different; the contribution from random
and Lucas-Kanade (LK) method [4–7]. Both of them are motion pixels can be very small, even to the second eigen-
based on the two-frame differential algorithms. LK method value. Experimental results show that this approach actually
may not perform well in dense flow field; on the other is an effective way of analyzing outdoor videos; it can reduce
hand, HS method can detect minor motion of objects and false alarms for videos with either static or dynamic back-
provide a 100% flow field [7]. Thus, we focus on HS method ground, and it is also useful to delineate the size of moving
for optical flow computation in our research. Considering objects.
2 EURASIP Journal on Advances in Signal Processing

This paper is organized as follows. Section 2 explains Intuitively, only λ1 needs to be considered because it
the proposed method based on optical flow and PCA. corresponds to the major flow component and λ2 corre-
Section 3 presents experiments using ground-based and sponds to the minor flow component or even turbulence. An
airborne videos. Section 4 draws the conclusion. appropriate threshold can be determined by using the Ostu’s
method on the λ1 histogram [10]. However, in practice,
λ2 should be considered as well since pixels inside object
2. Proposed Method
boundaries usually have quite large λ2 but not λ1 . Thus,
HS method is a special approach of using global constraint thresholding may need to be taken on the λ2 histogram; a
of smoothness to express a brightness variation in certain pixel is claimed to have motion if either λ1 or λ2 are above
areas of the frames in a video sequence. It is also a specially the corresponding thresholds.
defined framework to lay out the smoothness of the flow Thus, the motion detection algorithm can be described
field. Let I(x, y, t) represent the brightness of a pixel at (x, y) as follows.
coordinates and the tth frame. According to [4], the image
constraint at I(x, y, t) with Taylor series can be expressed as (1) Calculate optical flows between two adjacent frames
(after registration as needed).
∂I ∂I ∂I (2) For each pixel in the 2D optical flow data, perform
∂x + ∂y + ∂t = 0, (1)
∂x ∂y ∂t PCA for a local mask (of size 3 × 3 in the experiment),
and two eigenvalues are assigned to the central pixel.
which results in
(3) Apply the Ostu’s thresholding to the eigenvalues of all
Ix u + I y v + It = 0, (2) the pixels (λ2 in the experiment).

where u = ∂x/∂t and v = ∂y/∂t are the x and y components Figure 1 illustrates the framework of the proposed
of the velocity or optical flow of I(x, y, t), respectively, and method with a 3 × 3 mask and resulting 2 × 9 data matrices.
Ix = ∂I/∂x, I y = ∂I/∂y, and It = ∂I/∂t, are the derivatives It is noteworthy that some variants exist when imple-
of the image at (x, y, t) in the corresponding directions. A menting the proposed method differently.
constrained minimization problem can be formulated to
calculate optical flow vector (uk+1 , vk+1 ) for the (k + 1)th (1) In Step (1), we may use the optical flow data from
frame: multiple frames. For instance, optical flow data from
Frames 1 and 2 can be combined with optical flow
Ix uk + I y vk + It data from Frames 2 and 3; this may help to emphasize
uk+1 − uk = Ix · ,
α2 + Ix2 + I y2 the desired optical flows of moving objects and to
(3) emphasize the randomness of turbulence.
Ix uk + I y vk + It
vk+1 − vk = I y · , (2) In Step (2), masks with different sizes can be used.
α2 + Ix2 + I y2 Intuitively, for a large moving object, mask size
should be large.
where uk and vk are the estimated local average optical flow
velocities, and α is a weighting factor. A larger value of α (3) In Step (3), thresholding can take place on either λ1
results in a smoother flow; in our experiments using 8-bit or λ2 , depending upon the object size and the features
videos, it is empirically set to be 30000. Based on the norm of of turbulence.
an optical flow vector, one can determine if the motion exists
or not, while the direction of this vector provides the motion In the experiments, we use two adjacent frames, a 3 ×
orientation. 3 mask, and only λ2 for thresholding. It is to show that
Two optical flow images can be constructed by pixel such simplest implementation is sufficient to provide better
optical flow vector (u, v). A mask of size n × n slides through performance than other widely used techniques.
these u and v images. At location (i, j), a two-dimensional
(2D) data matrix X can be constructed, which includes all 3. Experiments
the 2D vectors covered by the mask. The covariance matrix
can be calculated as In the experiments, videos with both static and dynamic
backgrounds were analyzed. They were taken by a commer-
T
Σ = X X, (4) cial Sony Camcorder. We compared our proposed method
with the original optical flow method, the motion detec-
where X is the optical flow matrix after mean removal. After tion methods based on Kalman filtering [11], background
eigen-decomposition, two eigenvalues (λ1 , λ2 ) are assigned modeling using Gaussian mixture model [12], difference-
to the central pixel of the mask. Motion detection is based spatial temporal entropy image (DSTEI) [13], and
accomplished by analyzing or thresholding the eignenvalue forward-backward motion history images (MHI) [14]. They
(s). Since λ1 is the major flow component and λ2 is the minor were chosen for comparison because they are either typical
flow component, it may be more effective to considering methods or designed specifically for more complicated
(λ1 , λ2 ) than the values in the original (u, v) space. videos (e.g., those with dynamic background).
EURASIP Journal on Advances in Signal Processing 3

u v

T
Σ = X2×9 X9×2

Figure 3: The result from optical flow method.


λ2

λ1

Detection map

Figure 1: The framework of the proposed method. Figure 4: The result from Kalman filtering.

(a) frame 1 Figure 5: The result from background modeling.

Figure 3 shows the detection result using optical flow


only, where detected pixels were highlighted in red. It
contained many false alarm pixels in runway and tree
profiles. Figures 4, 5, 6, and 7 are the detection results using
Kalman filtering, background modeling, DSTEI, and MHI
methods, respectively. We can see that they all could detect
the helicopter but with some regions missing and a few
(b) frame 2
false alarm background pixels. The background modeling
Figure 2: The two input frames of a helicopter video. method could detect the largest areas of the helicopter;
however, there were erroneously detected pixels scattered in
the scene (even in the sky area). This method relies on an
accurate background model, generally requiring complicated
3.1. Experiment 1: Ground-Based Video with Relatively Large computations.
Object. In this experiment, a video with static background Figure 8 is the result of the proposed method, where
in a small regional airport was studied, which was taken almost all the false alarm pixels were removed (only two
when the camcorder was mounted on a tripod. As shown pixels in the vehicles were left) and major regions in the
in Figure 2, a Hughes Cayuse helicopter was the moving helicopter were detected. Compared to Figure 3, introducing
object. Since the video was taken during a humid summer PCA can significantly improve the performance of optical
afternoon, there were significant atmospheric turbulence flow-based detection. Compared to the results in Figures 4–7,
effects, which were visible around the vehicle, runway, and the proposed method can reduce false alarm while detecting
tree profiles. larger regions in the moving object.
4 EURASIP Journal on Advances in Signal Processing

(a) frame 1
Figure 6: The result from DSTEI method.

(b) frame 2

Figure 7: The result from MHI method. Figure 9: The two input frame about an airborne video.

Figure 8: The result from the joint optical flow and PCA method. Figure 10: The result from optical flow method.

3.2. Experiment 2: Airborne Videos with Small Objects. method provided the best result, there were still several
The second experiment used an airborne video with low false alarmed pixels, mainly located around the edges of
quality. It was taken by the camcorder mounted on the buildings.
helicopter in the video shown in Experiment 1. In addition We found out that such false alarms in airborne videos
to atmospheric turbulence, scintillation from the airborne with small moving objects can be better removed by corner-
platform (i.e., the small helicopter) further degraded the based detection [16]. Harris corners were detected from
video quality. As shown in Figure 9, there were three moving two difference images, and many false alarm pixels around
vehicles on the highway, highlighted in yellow circles. They buildings could be removed; false alarms were further
consisted of only a few pixels. The two frames were pre- reduced through local tracking of detected corners in several
registered using the method in [15]. consecutive frames. The drawback is that the detected
Figure 10 shows the detection result using optical flow result contains only object corners. In conjunction with the
only, where three vehicles on the highway were completely proposed method, the complete regions of moving objects
detected and the shape of the vehicles were outlined can be segmented for the corner-based detection while
compactly. Figures 11, 12, 13, and 14 are the results for the false alarm can be reduced in the proposed method.
comparison, where the three vehicles were detected but not As shown in Figure 16(a), the corner-based method can
well delineated. For instance, the detected vehicle sizes were accurately detect the three vehicles without false alarms;
too small when using Kalman filtering and background however, it detects only a corner corresponding to an object
modeling, and too big when using DSTEI and MHI. as detailed in Figure 16(b). Figure 16(c) shows the extracted
More false alarm pixels were contained in these results. vehicles using the MHI method, where the object sizes
Figure 15 is the result using optical flow and PCA, which were slightly magnified. Figure 16(d) is the extracted vehicles
could further reduce false alarm and the vehicle sizes using the proposed method, where the object sizes were
seemed to be more reasonable. Although the proposed reasonably reduced and pruned.
EURASIP Journal on Advances in Signal Processing 5

Figure 11: The result from Kalman filtering. (a) detected vehicles based on corner detection

(b) the three vehicles in (a)

Figure 12: The result from background modeling.

(c) extracted entire region using MHI method

(d) extracted entire region using our method


Figure 13: The result from DSETI method.
Figure 16: The result by combining the corner detection and the
propose method.

The result using another airborne video is shown in


Figure 17, which further demonstrated that our method can
better extract object sizes.

4. Conclusion

Figure 14: The result from MHI method.


In this paper, we propose a joint optical flow and PCA
approach for motion detection. Instead of considering the
original optical flow, the two eigenvalues of the covariance
matrix of local optical flows are analyzed. Since the first
eigenvalue represents the major motion component and the
second eigenvalue represents the minor motion component
or turbulence, they are more useful to detect true motions
while more successfully suppressing false alarms. The pro-
posed method is also effective in extracting the actual size of
moving objects.
The computational complexity involved in PCA includes
the calculation of covariance matrix of local optical flow and
its eigen-decomposition. For a mask of size n × n, the number
Figure 15: The result from the joint optical flow and PCA method. of multiplications in calculating the covariance matrix of
6 EURASIP Journal on Advances in Signal Processing

[4] B. K. P. Horn and B. G. Schunck, “Determining optical flow,”


Artificial Intelligence, vol. 17, no. 1–3, pp. 185–203, 1981.
[5] B. K. P. Horn and B. G. Schunck, “Determining optical flow,”
Artificial Intelligence, vol. 59, no. 1-2, pp. 81–87, 1993.
[6] B. D. Lucas and T. Kanade, “An iterative image registration
technique with an application to stereo vision,” in Proceedings
of the 7th International Joint Conference on Artificial Intelli-
gence, vol. 2, pp. 674–679, 1981.
[7] A. Bruhn, J. Weickert, and C. Schnorr, “Lucas/Kanade meets
Horn/Schunck: combining local and global optic flow meth-
(a) detected vehicles based on corner detection ods,” International Journal of Computer Vision, vol. 61, no. 3,
pp. 211–231, 2005.
[8] K. Pearson, “On lines and planes of closest fit to systems of
points in space,” Philosophical Magazine, vol. 2, no. 6, pp. 559–
572, 1901.
[9] H. Hotelling, “Analysis of a complex of statistical variables into
principal components,” Journal of Educational Psychology, vol.
(b) the four vehicles in (a)
24, no. 6, pp. 417–441, 1933.
[10] N. Otsu, “A threshold selection method from gray-level his-
tograms,” IEEE Transactions on Systems, Man, and Cybernetics,
vol. 9, no. 1, pp. 62–66, 1979.
[11] K.-P. Karmann and A. V. Brandt, “Moving object recognition
using an adaptive background memory,” in Time-Varing Image
(c) extracted entire region using MHI method Processing and Moving Object Recognition, V. Cappellini, Ed.,
vol. 2, pp. 297–307, Elsevier, Amsterdam, The Netherlands,
1990.
[12] P. Kaewtrakulpong and R. Bowden, “An improved adap-
tive background mixture model for real-time tracking with
shadow detection,” in Proceedings of 2nd European Workshop
on Advanced Video-Based Surveillance Systems(AVBS ’01),
(d) extracted entire region using our method September 2001.
[13] G. Jing, C. E. Siong, and D. Rajan, “Foreground motion
Figure 17: The result by combining the corner detection and the detection by difference-based spatial temporal entropy image,”
propose method in another airborne video. in Proceedings of IEEE TENCON Conference, pp. A379–A382,
2004.
[14] Z. Yin and R. Collins, “Moving object localization in ther-
mal imagery by forward-backward MHI,” in Proceedings of
size 2 × 2 is (2n)2 , and complexity of eigen-decomposition IEEE Computer Vision and Pattern Recognition Workshops
is generally O(23 ). For an image frame with m pixels, the (CVPRW ’06), 2006.
[15] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani,
total computational complexity is O(m(2n)2 + m23 ). It can
“Hierarchical model-based motion estimation,” in Proceedings
be reduced to O(βm(2n)2 ) if using iterative PCA (IPCA) as of the 2nd European Conference on Computer Vision, vol. 588,
discussed in [17], where β is a small integer. As the future pp. 237–252, Springer, 1992.
work, we will investigate the performance when using IPCA [16] H. Yang, B. Ma, and Q. Du, “Very small moving object
to expedite motion detection. detection from airborne videos using corners in differential
images,” in Proceedings of IEEE International Conference on
Image Processing, 2010.
Acknowledgment [17] Q. Du and J. E. Fowler, “Low-complexity principal component
analysis for hyperspectral image compression,” International
This research was supported by National Geospatial-Intelli-
Journal of High Performance Computing Applications, vol. 22,
gence Agency of the United States. no. 4, pp. 438–448, 2008.

References
[1] A. Mitiche and P. Bouthemy, “Computation and analysis of
image motion: a synopsis of current problems and methods,”
International Journal of Computer Vision, vol. 19, no. 1, pp. 29–
55, 1996.
[2] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on
visual surveillance of object motion and behaviors,” IEEE
Transactions on Systems, Man and Cybernetics Part C, vol. 34,
no. 3, pp. 334–352, 2004.
[3] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: a survey,”
ACM Computing Surveys, vol. 38, no. 4, pp. 1–45, 2006.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 248954, 10 pages
doi:10.1155/2010/248954

Research Article
Shape Analysis of 3D Head Scan Data for U.S. Respirator Users

Ziqing Zhuang,1 Dennis E. Slice,2 Stacey Benson,3 Stephanie Lynch,1 and Dennis J. Viscusi1
1 National Personal Protective Technology Laboratory, National Institute for Occupational Safety and Health,
Pittsburgh, PA 15236, USA
2 Department of Scientific Computing, Florida State University, Dirac Science Library, Tallahassee, FL 32306-4120, USA
3 EG&G Technical Services Inc., Pittsburgh, PA 15236, USA

Correspondence should be addressed to Ziqing Zhuang, [email protected]

Received 25 November 2009; Accepted 29 January 2010

Academic Editor: Yingzi Du

Copyright © 2010 Ziqing Zhuang et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In 2003, the National Institute for Occupational Safety and Health (NIOSH) conducted a head-and-face anthropometric survey
of diverse, civilian respirator users. Of the 3,997 subjects measured using traditional anthropometric techniques, surface scans
and 26 three-dimensional (3D) landmark locations were collected for 947 subjects. The objective of this study was to report
the size and shape variation of the survey participants using the 3D data. Generalized Procrustes Analysis (GPA) was conducted
to standardize configurations of landmarks associated with individuals into a common coordinate system. The superimposed
coordinates for each individual were used as commensurate variables that describe individual shape and were analyzed using
Principal Component Analysis (PCA) to identify population variation. The first four principal components (PC) account for 49%
of the total sample variation. The first PC indicates that overall size is an important component of facial variability. The second PC
accounts for long and narrow or short and wide faces. Longer narrow orbits versus shorter wider orbits can be described by PC3,
and PC4 represents variation in the degree of ortho/prognathism. Geometric Morphometrics provides a detailed and interpretable
assessment of morphological variation that may be useful in assessing respirators and devising new test and certification standards.

1. Introduction between linear facial dimensions in the development of


test panels to capture facial variation. In the field of
Millions of workers across the United States depend on anthropometrics, from which the facial measurements were
respirators for personal protection everyday. Respirators have borrowed, there has been considerable recent innovation in
to fit to provide adequate protection to these workers.
the quantification and statistical analysis of shapes based
Assessing respirator fit has for many years been based on
on the study of the Cartesian coordinates of the landmarks
fit test panels from Air Force data from the 1970s [1, 2].
Given an array of respirator styles and sizes, it is important to that usually serve as the basis for traditional measurement
determine their fit and efficacy with respect to their intended definitions [6, 7]. These new methods, collectively referred
user population and to quantify those facial features relevant to as Geometric Morphometrics (GMs), have proven more
to fit. It is largely recognized that data based on a population powerful and efficient than traditional approaches in many
of young, healthy military personnel from over 30 years ago cases, and it is worthwhile to determine the extent to which
are not likely to be representative of the diversity of the they can advance the goal of respirator fit assessment. Such
contemporary workforce that fit test panels should target studies, in turn, could feed back into respirator design to
[3]. To address this deficiency, the National Institute for achieve more efficient and comfortable product style and
Occupational Safety and Health (NIOSH) conducted a facial sizing. In anticipation of this, the NIOSH study included the
morphological survey of contemporary workers that require collection of both facial surface scans and three-dimensional
the use of a respirator in the course of their work [4, 5]. landmark locations for a large subset (∼25%) of their
Besides being based on a group not likely to be com- surveyed individuals [4].
pletely representative of the contemporary respirator-user The dependence of respirator-fit assessment standards on
population, previous studies focused on the association a base population morphologically distinct from the target
2 EURASIP Journal on Advances in Signal Processing

population and the reliance of the development of those a size measure for further investigations of the relationship
standards on a limited and somewhat arbitrary suite of tra- between shape and size in the sample.
ditional; (curvi-) linear anthropometric measurements were The way the required standardization is usually done is
some of the problems identified by an independent review through Generalized Procrustes Analysis (GPA) [6, 11, 12].
committee that examined the current state of respirator-fit In GPA, landmark configurations are mean-centered so that
assessment [8]. It was the purpose of this study to address their average coordinate location for all landmarks is the
some of these concerns by further investigating the nature of origin. They are then scaled so that the square root of the sum
facial shape variation in the latest data assembled using GM of squared distances of each landmark in a configuration to
techniques. their joint average location (the origin after mean-centering)
is 1.0. This measure is called centroid size and has the
desirable property that it is the only size measure that is
2. Materials and Methods independent of shape variation in the presence of small,
isometric random variation in landmark location around a
2.1. Data. Data for this study were obtained from the mean configuration [13]. Next, an arbitrary configuration
NIOSH National Personal Protective Technology Laboratory of landmarks from the mean-centered and size-standardized
(NPPTL) facial anthropometric survey [4]. The main body data set (usually the first specimen) is used as a reference con-
of data consisted of 947 data files in the format of a Unix- figuration. All specimens in the data are rotated so that the
based, 3D package called INTEGRATE [9]. Each file con- sum of squared distances between individual configuration
tained three-dimensional coordinate locations of anatomical landmarks and corresponding landmarks on the reference
landmarks (Figure 1) for one individual. In addition, demo- is minimized. Once so rotated, a mean configuration is
graphic information including sex, age group, racial group, estimated as the arithmetic averages of landmark coordinates
and traditional anthropometric measures were collected. All in the superimposed data set. The average configuration
data were visually inspected using morphometrics software is then scaled to unit centroid size and the sample refit
to identify mislabeled or obviously erroneous coordinate to the new estimated mean. This process is guaranteed to
values. These were marked as missing data. monotonically converge on a mean estimate for the sample
The proper handling of missing data is a complicated [11] and is not substantively affected by the initial choice of
endeavor [10]. One possible course of action would be to reference. After little or no change is seen due to the rotation
eliminate all individuals with any missing landmarks. That and mean estimation steps, the process is deemed complete
would call for the removal of over 25% of the data set, and the superimposed coordinates for each individual can
which seems extreme. Several other cut points would be be used as commensurate variables that describe individual
defensible, for example, removing individuals with more shape and can be subjected to multivariate analyses, such as
than 3 missing landmarks, 5, and so forth. It was decided, principal components analysis used here.
instead, to retain all 947 individuals. Most individuals (72%) This approach, in its standard form, is not the best for
had no missing landmark coordinates, and less than 1% the purposes of this study directed at assessing variability
had six or more missing landmarks out of the twenty-eight that influences the fit and function of respirators. Here, size
with missing data. If the occurrence of missing data is not variation is not less important to the ultimate goal than shape
random with respect to the morphology of the individuals, variation, and even sequestering it in a separate variable for
then removing individuals will reduce the variability that this joint or separate analysis is, at least initially, irrelevant. For
study is seeking to quantify. Missing data were estimated by this reason, scale was restored to the results of a standard
simply substituting mean coordinate values. GPA by multiplying the resulting shape variables by the
inverse of the scale factor applied to them in the course of
the superimposition of individual configurations onto the
2.2. Generalized Procrustes Analysis. Landmark coordinates grand mean. These are the “form” (shape + size) data used
are not directly comparable as quantitative measures of in subsequent statistical analyses of population variation.
shape because they are (usually) recorded with respect to
an arbitrary set of orthogonal reference axes. In its simplest
2.3. Population Variation. Population variation for the data
case, irrelevant variation is introduced into the coordinate
set, after GPA, was analyzed by principal components
values by the position and orientation of the specimen
analysis (PCA) to identify patterns of covariation in the data.
relative to the digitizing apparatus or scanning device.
Major directions of variation were compared and visualized
In addition, many standard morphometric analyses, using
using GM methods and software.
both traditional measurements and landmark coordinates,
seek to sequester size variation, which often tends to
dominate sample variability, into a separate variable. To 2.4. Software. The above analyses were carried out using a
address these problems and issues, geometric morphometric combination of standard statistical software, existing mor-
methods include a data processing step that standardizes phometrics software, and new routines developed specifically
configurations of landmarks associated with individuals for analyzing the data used in the study. All standard
into a common coordinate system and, further, usually statistical analyses, such as PCA, were carried out in the open
standardizes these configurations to a common size. The source R package [14]. The matrix capabilities of R were
scale factor used in the latter standardization can be saved as also used for some custom data manipulation and testing.
EURASIP Journal on Advances in Signal Processing 3

4 7 15
5 16
22 23
20 8 21 4 7
5
6 9 14 8
24 25 22
1 2
10 9
24
12 13
10
26 12
3
26
11

4 15
7
4
5 16
7
22 8 23 5
20 21 22 8
2 19
1 17
6 9 14 2
1 6
24 25
24 9
10
10
12 13
3 18 12
3
26

26
11
11
(a) (b)

Figure 1: Location and identification of the 26 landmarks used for the PCA: Tragion (1, 19), Zygion (2, 17), gonion (3, 18), Frontotemporale
(4, 15), Zygofrontale (5, 16), Infraorbitale (6, 14), Glabella (7), Sellion (8), Pronasale (9), Subnasale (10), Menton (11), Chelion (12, 13),
Pupil (20, 21), Nasal root point (22, 23), Alare (24, 25), and Chin (26).

Where possible, new Java-based, cross-platform programs slightly redundant set of 78 (26 points × 3 coordinates per
(m vis and the new of Morpheus et al.) currently under point) form variables that characterize the size and shape of
development by one of the authors (Slice) were used for individual faces within a coordinate system common to all.
visualization and data manipulation and analysis. A number Principal component analysis of the 947 superimposed
of new routines were added to these programs to facilitate the configurations in the space of the 78 form variables showed
current study. When morphometric-specific visualization or a substantial proportion of the total sample variability in
analytical routines were not available in the most recent the first four PCs (26%, 10%, 8%, and 5%, resp.). The
versions of this software, an older Microsoft Windows variance on PCs beyond the third (all 5% or less of the
version of Morpheus et al., written in C++, was used [15]. total) trail off gradually suggesting no strong patterns of
intercorrelation amongst the variables. Nonetheless, the
first two PCs together only represent 36% of total sample
3. Results variability and the first four only 49%. In fact, it requires the
first 27 PCs as a group to account for 90% of total sample
Figure 2 shows the data set extracted from the GPA with variation. This suggests that the bivariate approach used in
size restoration as described above. Each cluster of symbols constructing fit panels may be ignoring a substantial and
represents the scatter of individual landmark locations for important aspect of total sample variability.
the 947 individuals in this data set. The coordinates of the The Eigenvectors for each PC are used to multiply the
26 landmarks per individual represented in Figure 2 are a superimposed coordinates to obtain the scores for each PC.
4 EURASIP Journal on Advances in Signal Processing

Table 1: Eigenvectors from Principle Component Analysis.

Face Dimensions PC1 PC2 PC3 PC4


x −0.092402 −0.09838 −0.082324 0.124948
Right Tragion y 0.055546 −0.03336 −0.000476 −0.046974
z −0.226846 −0.015639 −0.081106 0.103059
x −0.105446 −0.037155 −0.250892 0.06538
Right Bizigomatic y 0.058617 −0.021507 −0.008662 0.046299
z −0.264815 −0.218134 0.230527 0.146395
x −0.175016 −0.170624 −0.077249 −0.010609
Right Bigonion y −0.153711 0.283158 0.025585 0.251424
z −0.236316 −0.052436 −0.174264 −0.223689
x −0.087244 −0.058213 −0.171941 0.181386
Right Frontotemporale y 0.185991 −0.083649 −0.231942 0.088256
z −0.065722 0.041468 −0.118217 −0.008149
x −0.124784 −0.061425 −0.131633 0.128663
Right Zygofrontale y 0.099636 −0.079601 −0.122976 0.12982
z −0.049474 0.035962 −0.217425 −0.005845
x −0.155163 −0.153839 0.1021 −0.008096
Right Infraorbitale y 0.002082 −0.081024 0.071887 −0.030466
z −0.084867 −0.135606 0.177772 0.031694
x −0.031366 −0.009799 0.004402 0.088434
Glabella y 0.07538 −0.246936 0.075979 0.015563
z 0.065039 0.032831 −0.023948 −0.165925
x −0.024137 0.000893 −0.000995 0.071671
Sellion y 0.054995 −0.202543 0.014098 −0.056409
z 0.046018 0.007447 −0.028295 −0.150108
x −0.048801 0.006432 0.009248 0.016389
Pronasale y −0.021646 0.023393 0.066119 0.015546
z 0.098337 −0.077382 −0.050498 −0.070001
x −0.041782 0.003084 0.003778 −0.032842
Subnasale y −0.03442 0.021448 0.078113 −0.003448
z 0.090048 −0.049884 −0.013582 0.052307
x −0.026572 −0.075211 0.009377 −0.04029
Menton y −0.239456 0.282161 0.064054 −0.137788
z 0.072425 0.131381 −0.022598 0.095107
x −0.104235 −0.011205 −0.065256 −0.122249
Right Chelion y −0.122439 0.115036 0.072276 −0.114041
z 0.062885 0.044391 −0.022735 0.257529
x 0.025779 −0.027681 0.030278 −0.124742
Left Chelion y −0.121982 0.114939 0.085959 −0.114223
z 0.127491 0.038363 0.030083 0.256637
x 0.153162 0.204991 −0.155992 −0.025605
Left Infraorbitale y 0.011816 −0.066336 0.069849 −0.028775
z 0.05999 0.024546 0.045091 0.03005
x 0.116605 0.011214 0.243554 −0.059974
Left Frontotemporale y 0.184285 −0.067633 −0.286911 0.050317
z 0.021191 0.081583 0.026568 −0.150199
x 0.119144 0.015247 0.290324 −0.048237
Left Zygofrontale y 0.104492 −0.039884 −0.209893 0.097919
z 0.066023 0.07948 −0.091059 −0.117063
EURASIP Journal on Advances in Signal Processing 5

Table 1: Continued.
Face Dimensions PC1 PC2 PC3 PC4
x 0.267026 0.194492 −0.025595 −0.159568
Left Bizigomatic y 0.041141 −0.027079 0.028875 0.100681
z −0.100136 −0.100385 0.361976 0.051431
x 0.284425 0.16777 0.239609 0.221905
Left Gonion y −0.155504 0.305318 0.073897 0.35161
z −0.012303 0.053004 −0.081593 −0.188544
x 0.234216 0.087168 0.122075 −0.178044
Left Tragion y 0.051841 −0.041882 0.015991 −0.093947
z −0.099114 0.04377 0.018098 0.007539
x −0.105114 −0.025959 −0.076164 −0.00666
Right Interpupilary y 0.035167 −0.102977 0.002433 −0.063109
z 0.015465 −0.006616 −0.017302 0.009583
x 0.064232 0.051679 0.046711 −0.042373
Left Interpupilary y 0.036144 −0.100849 0.006195 −0.081916
z 0.085829 0.009695 0.04579 0.009617
x −0.04234 −0.026263 −0.028729 0.047738
Right Nasal Root y 0.044661 −0.149109 −0.007595 −0.078473
z 0.0254 −0.038598 0.029479 −0.071444
x 0.00875 0.056566 −0.014386 0.024296
Left Nasal Root y 0.042245 −0.149522 −0.005162 −0.072066
z 0.052601 0.006138 0.0221 −0.097465
x −0.104936 0.024283 −0.057818 −0.11649
Right Alare y −0.018704 −0.011104 0.069058 −0.012788
z 0.038854 −0.053892 −0.011676 0.017082
x 0.04214 0.012693 0.015318 0.033131
Left Alare y −0.024419 −0.021711 0.089439 −0.007221
z 0.107528 −0.043892 0.034216 0.090584
x −0.046142 −0.080757 0.022201 −0.028164
Chin y −0.191757 0.381253 −0.03619 −0.205793
z 0.104467 0.162404 −0.067402 0.089818

The first principal component score is calculated as follows: The above is a standard analysis and plotting approach
PC1 = −0.092402 ∗ (X coordinate for Right Tragion) + for multivariate data, but since this analysis has been driven
0.055546 ∗ (Y coordinate for Right Tragion) − 0.226846 by the principals of GM that maintain the relationships in
∗ (Z coordinate for Right Tragion)· · · − 0.046142 ∗ (X physical space amongst the variables, these results can be
coordinate for Chin) − 0.191757 ∗ (Y coordinate for Chin) used to construct hypothetical configurations of landmarks
+ 0.104467 ∗ (Z coordinate for Chin). representing arbitrary points in the space of principal
For the above equation, only right tragion and chin components. Since PCA is based on mean-centered data, and
are shown. The other 24 landmarks are not shown. The the PCs, themselves, are linear combinations of the original
eigenvectors for x, y, z coordinates of the 26 landmarks are coordinate variables, one can construct the configuration
shown in Table 1 for PC1-4. The superimposed coordinates at a specific point in PC space by simply multiplying
are not provided due to limited space. the coefficients for the linear combination of coordinates
The projections of the form data for the 947 individual represented by each PC of interest by the coordinate of
configurations onto PCs 1 through 4 are shown in Figure 3. the point of interest on that PC. The most common
Each point represents a linear combination of 78 coordinates use of this technique is to generate visualizations of the
for a single subject. By design, most of the scatter in the patterns of variation captured by particular PCs. Landmark
data is along the first PC and somewhat less along the configurations representing patterns of variation along PC1
second. Variation on higher PCs is reduced, but nonetheless magnified by a factor of 100 are shown in Figure 4. The
substantial suggesting future research should examine this coefficients are scaled so the sum of their squares equals
more closely, and also examine its relationship to respirator 1.0. Hence with 78 coefficients, these are small numbers that
fit and function. require considerable magnification.
6 EURASIP Journal on Advances in Signal Processing

Figure 2: Screen shot from the Morpheus et al. software illustrating a frontal scatter plot of the 947 individuals for each of the 26 landmarks
in the data set after GPA and size restoration. The coordinates of the 26 landmarks represented for each specimen are form variables that
describe the shape and size of each specimen.

The pattern of variation specified by PC1 and shown reflection of the displacements shown in red along their own
graphically in Figure 4 shows a general movement of axes.
landmarks away from their joint center of gravity in the There is a general tendency for landmarks to be displaced
positive direction along the PC. Shape change in the negative medially. The landmarks associated with the upper part of
direction is, of course, the compliment of this with land- the face, especially those of the eyes and the bridge of the
marks all moving more-or-less toward the configuration’s nose, tend to be displaced upwards. Those associated with
center at approximately the same rate (distance per unit the lower face-the corners of the mouth, the angle of the
change along the axis). It is important to note that the jaw, and the chin, tend to be displaced downwards. This
polarity of these axes is arbitrary and positive and negative has a relatively simple interpretation as those individuals
directions can be exchanged without impacting the variance with more positive scores on this axis having relatively
of the projections, which is the only criterion by which they narrower and longer faces. Conversely, individuals with more
are constructed. negative scores would have shorter, wider faces. Given the
Such a pattern clearly represents an overall increase or high correlation of the first PC with size, it is not surprising
diminution of the configuration as results from isometric size that there is a low association between size and this axis
change. Indeed, the correlation of the scores of individuals on (Pearson’s product-moment correlation = 0.09, Kendall’s
this axis with their centroid size is 0.99 (= Pearson’s product tau = 0.05). This represents independence between overall
moment correlation, Kendall’s tau = 0.92). Such a result facial shape (long/narrow versus short/wide) and facial size.
indicates that the overall size is an important component of In traditional biological terms, this is an indication of a
facial variability in the studied population and is likely an lack of “allometry.” Furthermore, this result means that
important component of respirator fit assessment, but would simple concepts of small, medium, and large with respect
not be captured by a standard GM analysis that focuses to respirators cannot capture much of this component of
more on pure shape change. The relatively low proportion variation.
of variation (0.26) suggests, however, that size is not the only Together PC1 and 2 represent 36% of total variation
important consideration. in the data. Thus, even higher PCs may represent patterns
Figure 5 shows a visualization of facial change in the of variation that are important components in the general
positive direction of PC2. As before, what represents positive workforce population.
versus negative change along this axis is arbitrary and the Figure 6 shows the pattern of variation specified by
negative change in this representation would simply be the PC3 accounting for about 8% of the total variation. The
EURASIP Journal on Advances in Signal Processing 7

60 of this axis, of course, are represented by the compliment of


these changes.
The pattern specified by PC4 (Figure 7), though account-
40 ing for only 5% of the total variation, is somewhat more
clear. The pupils, nasal root points, the corners of the mouth,
and the chin landmarks are shifted inferiorly while gonion
20
is shifted superiorly. Tragion, zygion, frontotemporale, and
zygofrontale are shifted medially, and the alare are displaced
laterally in frontal view. In lateral view, gonion and the
PC 2

0
landmarks of the nasal bridge and orbital rim are shifted
posteriorly, while the mouth, chin, tragion, and zygion
are shifted anteriorly. Configurations projected to more
negative scores along this axis manifest the compliment
of these changes. In general, there is an impression that
−40 this component might represent variation in the degree
of ortho/prognathism with positively scoring individuals
having longer, wider, and more projecting lower jaws than
−40 0 20 40 60
negatively scoring individuals.
PC 1
(a)
60 4. Discussion
The comprehensive assessment of morphological variation
40 in users may contribute to understanding how differences
in facial form can affect the fit and efficacy of commercial
respirators. Such knowledge should facilitate the optimal
20
design of these products and inform the development of
standards and protocols by which such devices are evaluated
and certified. Recent advances in the quantitative analysis
PC 4

0
of anatomical variation, called geometric morphometric
methods, have the potential to provide more powerful
and complete descriptions of morphological diversity in
a target population than the traditional anthropometric
measurements upon which current respirator standards are
−40 based. Furthermore, it is important that emerging standards
be reflective of an ever-changing workforce that is not likely
represented by the military-based standards currently used
−40 0 20 40 60
[5].
PC 3
The data were carefully checked visually and statistically
(b) for incorrect data coding, erroneous values, and other
Figure 3: Projections of 947 landmark configurations onto PCs 1 problems that could compromise their use in characteriz-
and 2 (a) and PCs 3 and 4 (b). ing relevant morphological variation. Where possible, data
coding problems were repaired and erroneous values were
marked as missing, and a conservative mean-substitution
approach used to impute the coordinate locations. The result
pattern here is more complicated and less easy to summarize was a final, clean data set of 947 individuals for which
than those in lower PCs. Important features in the positive coordinates for 26 anatomical landmarks were available
direction appear to be a relative lateral displacement of the (either recorded or imputed) for all subjects.
centers of the pupils and a larger lateral displacement of the Principal components analysis of variation in the form
landmarks associated with the frontal bone, sides of the head, (size + shape) variables of the data revealed that approxi-
and angles of the jaw (gonion). In contrast, the landmarks mately 26% of total sample variance could be expressed as
defining the tip and sides of the nose and corners of the a single linear combination of the original variables—PC1.
mouth are displaced upwards. In seeming contrast, right and Since this analysis was based on GM methods, the coefficients
left infraorbitale appear medially displaced. In lateral view, for this combination could be used to visualize the nature
gonion and frontotemporale and zygofrontale are displaced of the captured variation in the physical space of the face.
posteriorly while zygion shifts anteriorly. This pattern defies Inspection of the results revealed that the first PC reflected
simple description, though the nose and mouth do appear to largely isometric size variation. That is, variation in the
shift superiorly relative to the rest of the face, while the face, overall size of faces in the population was the single greatest
itself, appears to widen. Projections on more negative values source of variability within the studied group.
8 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 4: Visualization of PC1. (a) and (c) show frontal view of transformation determined by the first PC. The top is in the positive
direction. The bottom is the negative. (b) and (d) are the same, but for the right lateral view. Green circles represent the average location of
landmarks in the entire, superimposed data set. The black lines are links to aid visualization. The red line segments represent the coefficients
for each coordinate of each landmark specified by the first PC magnified by a factor of 100 to emphasize the pattern of variation. That is, the
red lines indicate the path (direction and relative magnitude) of the landmarks as they change location and move along the specified PC in
the indicated direction. The ends of the line segments in the images on the top row indicate the positions of the landmarks at a point 100
units out in the positive direction on PC1. The bottom row is the same for 100 units in the negative direction.

While expressing the greatest amount of variation, PC1 Further study will investigate the correlation between
does not express most of the variation in the sample and respirator fit and these PCs. This will be done via regression
higher PCs may be important in respirator fit research. of shape-coordinate and ancillary anthropometric data onto
Visualization of PC2 (expressing about 10% of sample varia- respirator fit measures for 30 test subjects. The result will be
tion) revealed a contrast between longer, narrower, shallower a statistical summary and visualization of the components of
heads/faces versus shorter, wider, deeper heads/faces that is facial variation most associated with respirator fit. Also, the
statistically independent of overall head size. These results residuals from the landmark-traditional comparison would
for PC1 and PC2 are consistent with the results reported be assessed for significant association with the respirator fit
by Zhuang et al., who performed a PCA using 10 linear data. A significant result may indicate important information
dimensions related to respirator fit [5]. More complex, but captured by the coordinate analysis and missed by the
still interpretable and potentially relevant, variation was traditional measurements.
identified on PC3 (∼8% of sample variation) and PC4 The NIOSH anthropometric survey data, respirator fit
(∼5%). test panels, and digital 3D headforms have been incorporated
After analysis, concerns were raised about splits in some into national and international respiratory protection stan-
of the heads that are the result of movement during the dards [4, 5, 16]. Products certified under these standards are
scan. A review of all the scans revealed 109 scans with a split used to protect against chemical, biological, radiological, and
greater than 4mm. These scans were removed from the PCA nuclear agents for fire fighters and emergency responders.
so that it could be reanalyzed to see if these aberrations due to They are also used to protect hospital workers and air
movement impacted the results. The resultant PCA showed travelers from H1N1 exposures. If the PC scores are highly
no statistical difference when compared to the original. correlated to respirator fit, the proposed method in this
Because of this, the information from all heads was retained. paper will be applied to develop respirator fit test panels and
EURASIP Journal on Advances in Signal Processing 9

(a)
(a) (b)

Figure 5: Visualization of PC2. (a) is frontal view. (b) is lateral view.


These are positive-direction displacements. Negative direction is the
reflection of all of the red vectors along their own axes.

(b)

Figure 7: Visualization of PC4. (a) is frontal view. (b) is lateral view.


These are positive-direction displacements. Negative direction is the
reflection of the red vectors along their own axes.
(a)

function of commercial respirators and devising new test


and certification standards. A significant amount of this
variation is contained in the first few PCs, but a substantial
portion remains that could be important in respirator fit.
Principal component analysis is not designed to optimize or
take into account the results of the respirator fit testing. The
relationship between this measure and the results reported
here will be the subject of subsequent analyses.

(b) Disclaimer
Figure 6: Visualization of PC3. (a) is frontal view. (b) is lateral view. The findings and conclusions in this report are those of
These are positive displacements. Negative direction is the reflection the authors and do not necessarily represent the views of
of the red vectors along their own axes. the National Institute for Occupational Safety and Health.
Mention of commercial product or trade name does not
constitute endorsement by the National Institute for Occu-
digital headforms which will in turn be applicable to defense pational Safety and Health.
and security.
Acknowledgment
5. Conclusions
Ms. S. Lynch performed this research while holding a
In all, these analyses show that the GM-based approach National Research Council Resident Research Associateship
to morphological variation provides a detailed and inter- at the National Institute for Occupational Safety and Health
pretable assessment of morphological variation in the pro- (NIOSH), National Personal Protective Technology Labora-
vided sample that should be very useful in assessing the tory (NPPTL).
10 EURASIP Journal on Advances in Signal Processing

References
[1] A. L. Hack, E. C. Hyatt, B. J. Held, T. D. Moore, C. P.
Richards, and J. T. McConville, Selection of Respirator Test
Panels Representative of U.S. Adult Facial Sizes, Los Alamos
Scientific Laboratory, Los Alamos, NM, USA, 1974.
[2] A. L. Hack and J. T. McConville, “Respirator protection
factors—part I: development of an anthropometric test panel,”
American Industrial Hygiene Association Journal, vol. 39, no.
12, pp. 970–975, 1978.
[3] Z. Zhuang, J. Guan, H. Hsiao, and B. Bradtmiller, “Evaluating
the representativeness of the LANL respirator fit test panels for
the current U.S. civilian workers,” Journal of the International
Society for Respiratory Protection, vol. 21, pp. 83–93, 2004.
[4] Z. Zhuang and B. Bradtmiller, “Head-and-face anthropomet-
ric survey of U.S. respirator users,” Journal of Occupational and
Environmental Hygiene, vol. 2, no. 11, pp. 567–576, 2005.
[5] Z. Zhuang, B. Bradtmiller, and R. E. Shaffer, “New respirator
fit test panels representing the current U.S. civilian work
force,” Journal of Occupational and Environmental Hygiene,
vol. 4, no. 9, pp. 647–659, 2007.
[6] D. E. Slice, “Modern morphometrics,” in Modern Morpho-
metrics in Physical Anthropology, D. E. Slice, Ed., Kluwer
Academic/Plenum Publishers, New York, NY, USA, 2005.
[7] D. E. Slice, “Geometric morphometrics,” Annual Review of
Anthropology, vol. 36, pp. 261–281, 2007.
[8] J. C. Bailar III, E. A. Meyer, and R. Pool, Eds., Assessment
of the NIOSH Head-and-Face Anthropometric Survey of U.
S. Respirator Users, Institute of Medicine of the National
Academies. National Academies Press, Washington, DC, USA,
2007.
[9] D. Burnsides, P. M. Files, and J. J. Whitestone, “INTEGRATE
1.25: a prototype for evaluating three-dimensional visualiza-
tion, analysis, and manipulation functionality,” Tech. Rep.
AL/CF-TR-1996-0095, Crew Systems Directorate, Human
Engineering Division, Wright-Patterson AFB, Dayton, Ohio,
USA, 1996.
[10] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing
Data, John Wiley & Sons, New York, NY, USA, 1987.
[11] J. C. Gower, “Generalized procrustes analysis,” Psychometrika,
vol. 40, no. 1, pp. 33–51, 1975.
[12] F. J. Rohlf and D. E. Slice, “Extensions of the Procrustes
method for the optimal superimposition of landmarks,”
Systematic Zoology, vol. 39, pp. 40–59, 1990.
[13] F. L. Bookstein, Morphometric Tools for Landmark Data:
Geometry and Biology, Cambridge University Press, New York,
NY, USA, 1991.
[14] R Development Core Team, “R: a language and environment
for statistical computing,” R Foundation for Statistical Com-
puting, Vienna, Austria, 2007, http://www.R-project.org/.
[15] D. E. Slice, Morpheus et al.: Software for Morphometric
Research. Revision 01-31-00, Department of Ecology and
Evolution, State University of New York, Stony Brook, NY,
USA, 1998.
[16] Z. Zhuang, S. Benson, and D. J. Viscusi, “Digital 3-D
headforms with facial features representative of the current
U.S. work force,” Ergonomics, vol. 53, no. 5, 2010.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 468329, 14 pages
doi:10.1155/2010/468329

Research Article
A Conditional Entropy-Based Independent Component Analysis
for Applications in Human Detection and Tracking

Chin-Teng Lin,1 Linda Siana,1 Yu-Wen Shou,2 and Tzu-Kuei Shen1


1 Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan
2 Department of Computer and Communication Engineering, China University of Technology, Hsinchu 303, Taiwan

Correspondence should be addressed to Yu-Wen Shou, [email protected]

Received 1 December 2009; Revised 11 February 2010; Accepted 12 April 2010

Academic Editor: Yingzi Du

Copyright © 2010 Chin-Teng Lin et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present in this paper a modified independent component analysis (mICA) based on the conditional entropy to discriminate
unsorted independent components. We make use of the conditional entropy to select an appropriate subset of the ICA features
with superior capability in classification and apply support vector machine (SVM) to recognizing patterns of human and
nonhuman. Moreover, we use the models of background images based on Gaussian mixture model (GMM) to handle images with
complicated backgrounds. Also, the color-based shadow elimination and head models in ellipse shapes are combined to improve
the performance of moving objects extraction and recognition in our system. Our proposed tracking mechanism monitors the
movement of humans, animals, or vehicles within a surveillance area and keeps tracking the moving pedestrians by using the
color information in HSV domain. Our tracking mechanism uses the Kalman filter to predict locations of moving objects for the
conditions in lack of color information of detected objects. Finally, our experimental results show that our proposed approach can
perform well for real-time applications in both indoor and outdoor environments.

1. Introduction from backgrounds and identified the extracted objects by


neural networks. Although the stereo-based vision technique
Video-based human detection and tracking has been a pop- has been proved to be more robust, it required at least
ular research area and widely applied in various applications two cameras and could be used only for the short distance
such as homecare, security, and patient monitoring. With detection. Orrite Uruñuela et al. [2] used multiple cameras
the increasing criminal rate, the development of automatic to analyze 3D skeletal structure in gait sequences and 3D
visual surveillance with computer visions has attracted more skeletons to extract human body shapes completely and
and more researchers’ attentions. Therefore, the ability constructed the point distribution model (PDM) by using
to distinguish people from other moving objects such as Principal Component Analysis (PCA). Jiang et al. [3] used
animals or vehicles has become an important issue for the background subtraction method to segment an isolate
tracking targets and analyzing their behaviors. human and took advantage of the homogeneous properties
Human detection system could be divided into two parts, of shadows and background objects to reduce the shadowing
segmentation of the moving objects from backgrounds and effects. An area threshold was also used to avoid a sudden
discrimination of humans from nonhuman objects. There change of light and interfering the results of moving object
have been several methods for segmenting moving objects extraction by illumines. Tian and Hampapur [4] combined
from backgrounds such as the optical flow, stereo-based the background subtraction and optical flow methods to
vision, and temporal difference method. The optical flow locate the motion area and to remove the false foreground
method could succeed in detecting independent moving pixels. They modeled the background image as Gaussian
objects, but would be more computational and sensitive distributions to adapt to the gradual change of light by
to the change of intensity. Zhao and Thorpe [1] exploited recursively updating the arguments of models with an
the stereo based segmentation algorithm to extract objects adaptive filter. However, this basic model would sometimes
2 EURASIP Journal on Advances in Signal Processing

Image frames Background


LPF GMM Dilation
substraction
Moving object
Human Fitting Ellipse Shadow
Tracking SVM Modified ICA Connected
classifier function ellimination component

Moving object extraction

Figure 1: System architecture.

fail to handle complicated backgrounds such as water wave an appropriate subset of ICA features. Sorting variables
and tree shaking. Stauffer and Grimson [5] constructed a may be an important step to enhance the high-dimensional
mixture of Gaussian model by modeling each pixel as a dataset, which gave us the idea to place correlated or
mixture of Gaussians and using an online approximation to similar dimensions close to each other in high-dimensional
update the extracted backgrounds. Our proposed real-time visual space to help human users perceive relationships
system firstly used a simpler way to segment moving objects among those variables easier [11]. The remainder of this
to reduce the time complexity, and applied Gaussian mixture paper could be organized as follows. Section 2 described
model (GMM) to constructing a dynamic background model the moving object extraction, including shadow elimination
as to handle dynamic backgrounds or unstable illumination and occlusion handling. Section 3 introduced the modified
in images. ICA. Section 4 described the color-based tracking method.
After moving objects have been segmented, the next pro- Section 5 showed the experimental results. We finally sum-
cess would be human recognition. There have been several marized discussions and conclusions in Section 6.
kinds of methods for human recognition like shape-based,
motion-based, and multicue-based ones. Zhou and Hoang 2. Moving Object Extraction
[6] used the shape information of human bodies to construct
a codebook and to tell human beings from other objects. The architecture of our moving object extraction was indi-
This method obviously would work well if the extracted cated in the dotted-line block of Figure 1 and the remained
human shape was obvious. However, this shape-based would blocks represented our processes in human feature extraction
usually fail for the cases of partially occluded humans or and classification. For the moving object extraction, we
the detected humans carrying something. Histograms of used the background subtraction method in order to meet
Oriented Gradients (HOG) [7, 8], the algorithms based on the real-time acquirements. Besides, we built up a dynamic
Fast Fourier Transform, extracted features from the shape background model based on GMM algorithm to deal with
information. Curio et al. [9] carried out the detection process more complicated backgrounds. Our background model was
based on the geometrical features of human at the first constructed by using three different Gaussian distributions.
step, and then used motion patterns of limb movements We in this paper took the difference of luminance in images
to determine the initial hypotheses of objects. Yoon and since human eyes would be more sensitive to luminance than
Kim [10] made use of the robust skin color, background chrominance. The difference DI for each pixel (x, y) could be
subtraction, and human upper body appearance information calculated by
to classify human or other objects with similar skin color   2    2
regions. For the approaches based on neural networks for DI x, y = 2Ic x, y − Ib x, y 2, (1)
human identification [11], used the back-propagation model
to recognize the pedestrians, to analyze the shape of object, where Ic and Ib denote the luminance of the current
and to classify human beings from other objects. Mostly, and background image, respectively. Practically, the moving
researchers have focused on the issue of feature extraction objects would have larger variances than the background, so
but paid much less attention to the field of feature selection. the determined threshold was set by the variance of each
In this paper, we presented a modified ICA approach based Gaussian background model and the possible foreground
on conditional entropy. In the recent years, ICA has been image PFI could be described in the following equation
applied to human feature extraction for constructing a ⎧    
  ⎨1 if DI x, y ≥ 3σ x, y ,
sufficient set of features describing human beings. ICA is PFI x, y = ⎩ (2)
   
a high-order statistical analysis method, and can be usually 0 if DI x, y < 3σ x, y .
regarded as an extension of PCA, which addresses only the
second-order statistical arguments. Unlike PCA features, the Each Gaussian distribution N ∼ (μ, σ) could adapt to
ICA features are not sorted, thus the conditional entropy is the gradual change of light by recursively updating each
applied to feature selection, the sorting process, and choosing pixel over time. In the practical conditions, the captured
EURASIP Journal on Advances in Signal Processing 3

background might be in gray scale or in an edge map. Both Histogram analysis


150
of the background types had their individual advantages and
disadvantages. The background image in gray scale might Threshold line
take longer in the updating process than that in the edge map,
but could model the background in more details. Relatively,
the edge-type background was less sensitive than the gray-

Accumulate number
100
type one, and would be more suitable for noisy images
or environments with unstable intensities. For the strategic
design in modeling a background in this paper, the Gaussian Non-shadow
low-pass filter would be carried out in the consecutive input
frames before processed at the GMM stage so as to reduce the 50
Shadow
influences of noises and disturbances.

2.1. Color-Based Shadow Elimination. Our color based


shadow elimination are based on RGB-color channels. It 0
0 5 10 15 20 25
can be easily observed that the luminance of shadow pixels
is lower than that of the corresponding pixels in the Ω value
background image. Thus, if we denote ICF and IB the intensity Figure 2: The distribution of POI pixels.
of current frame and the background image, respectively, the
pixel (x, y) satisfying (3) may be in the shadowed region
    r and g denote the spectral ratios of R-B and G-B, respec-
ICF x, y < IB x, y . (3)
tively. As what we have observed, the shadow on the back-
Some other observed characteristics of shadows can be ground pixel may result in a bigger change of brightness than
arranged as follows. First, the texture of shadows like color. Assume that the color of illumination may not change
edge would have a smaller fluctuation than that of the with the effect of shadows, thus the spectral ratio r(x, y) is
corresponding pixels in the background image. Similarly, the invariant to the magnitude of illumination. Similarly, the
chromaticity value of shadows would have a slighter change spectral ratio g(x, y) is invariant under shadows or different
than that of the corresponding pixels in the background conditions of illumination. Thus, pixel (x, y) is in the shadow
image. These observations are described in region if both the current and background spectral ratios are
    the same. The error of spectral ratios can be computed by
ICF x, y I x, y Θ(x, y) defined in
Between-pixel invariant −→   = B ,
ICF x + 1, y IB x + 1, y
  2    2 2    2
    Θ x, y = 2rCF x, y − rB x, y 2 + 2gCF x, y − gB x, y 2.
  I x, y   I x, y
dh x, y = ln  , dv x, y = ln  , (7)
I x + 1, y I x, y + 1
(4) The total error in discriminating (x, y) from shadows is
described in
where I(x, y)/I(x + 1, y) is the ratio between pixel (x, y)
and its neighboring pixel (x + 1, y), dh (x, y), and dv (x, y)      
Ω x, y = α · Ψ x, y + (1 − α) · Θ x, y , (8)
denote the ratio maps which can keep the texture- and
edge-information without the interferences of shadows. We where α denotes the weighting parameter. Finally, a thresh-
will consider the pixel (x, y) in the shadow region if its olding operation will be applied on Ω(x, y) to determine
ratio map is similar to that of the background pixel. The whether the pixel (x, y) belongs to the shadow or foreground
error in discriminating the pixel (x, y) from shadows can be object.
calculated by The distribution of pixels of the possible object image
   2    2 (POI) which contains both of the moving object and
Ψ x, y = 2dCF,h i, j − dB,h i, j 2
shadowed pixels is shown in Figure 2. Figure 2 illustrates
(i, j )∈W (5) a smaller distribution for the shadowed region than the
2    2 extracted region of moving objects. We hence take advantage
+ 2dCF,v i, j − dB,v i, j 2,
of this Ω(x, y) observation to determine a threshold for
where Ψ(x, y) denotes the sum of difference of the ratio map discriminating the shadowed regions. The threshold value
in a small neighborhood window W with the center at (x, y) decides if a pixel (x, y) is in a shadowed region, and can be
  denoted in
    R x, y
Within-pixel invariant −→ rCF x, y = rB x, y = ln   ,
B x, y Ths = μPO − β · σPO , (9)
 
    G x, y where β is a weighting value, μPO and σPO are the mean and
gCF x, y = gB x, y = ln   ,
B x, y standard deviation of POI, respectively. And the region of
(6) shadow image SI would be described in (10). To enhance
4 EURASIP Journal on Advances in Signal Processing

∗ ∗ ∗ ∗∗∗∗∗∗∗∗∗∗∗ ∗∗ ∗
∗ ∗ ∗ ∗∗∗∗∗∗∗∗∗∗∗ ∗∗ ∗
∗ ∗ ∗ ∗∗ ∗∗ ∗∗ ∗
∗ ∗ ∗ ∗ ∗ ∗∗ ∗
∗ ∗ ∗ ∗∗ ∗
∗ ∗ ∗∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ (0,0) ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
(a) ∗ ∗ ∗ ∗∗ ∗

(a)

(b)

(b)

Figure 4: (a) The ellipse head model. (b) The pyramid down
sampling process.

(c)
as only the human bodies are occluded we can use the head
Figure 3: The results in shadow elimination, (a) the original image, information to overcome the occluded problem. If heads are
(b) the extracted object before shadow elimination, (c) the extracted partially or fully occluded with each other, then our ellipse
object after shadow elimination. head model will find the head with the best match.
The proposed head model is shown in Figure 4(a) where
the dot “•” represents the pixels of a head, the star “∗”
the results by shadow removal, we have the results during the represents the pixels of background, and the point (0, 0) is the
process of shadow elimination in Figure 3 center of the ellipse. The process in down sampling is applied
⎧     to fit the ellipse model in different sizes of a moving object.

⎪ 1 if IPO x, y > IB x, y , By setting a threshold of the similarity value, we can decide


  ⎨   which point is a possible center of the head. Consequently,
SI x, y = ⎪ Ω x, y < μPO − β · σPO , (10)


there may be more than one center detected in the real head

⎩0, region, which would be illustrated in the group of green
otherwise.
points in Figure 5. Thus, we have to project the original head
region into x-axis and y-axis, and to group these points to
2.2. Occlusion Handling. The moving objects could be
determine the final representative center as shown in blue
detected as a group of people who may move together or may
points of Figure 5. We also show some results in our human
be partially occluded by each other. In this case, the moving
detection mechanism for individual humans in Figure 6.
object extraction system will label the group of people as
one object by connected components. Without separating
the group of people into each individual, the classification 3. Modified ICA Based on Conditional Entropy
process may usually fail to identify the human beings. In
most conditions, however, the heads are usually separate The independent component analysis (ICA) is a statistical
when the human bodies have been occluded. Besides, the method for transforming an observed multidimensional
shape of human heads is almost invariant even though a random vector into components that are statistically inde-
person rotates his head in different phases. Therefore, as long pendent. ICA can be considered as a generalization of
EURASIP Journal on Advances in Signal Processing 5

Center histogram
7
5
3
1
0 10 20 30 40 50 60 70
0.5

1.5

2.5
Possible Final
1

1
2
3
4
5
6
0 center center 0

10 10

20 20

30 30

40 40

50 50

60 60

70 70

80 80

90 90
Object histogram
90
80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70

Figure 5: Projection of head region.

principal component analysis (PCA) with appended inde- which might not be sorted by the creating sort and might not
pendent properties in the second order equations. In the depend on the binary classification.
field of signal processing, ICA can separate the waveform of Let us have m-training images including both humans
the original source from the sensor array without uses of and nonhumans with the size (nr ×nc ). Figure 8 displays the
the characteristics of the source signal. The main purpose bases of our image set. Reshape all the training data into
in this work is to separate the patterns of humans and an N-length vector, and the mixture data X is an m × N
nonhumans. As Figure 7 shows, ICA is a statistic approach matrix. Also, the mixture data x1 , x2 , . . . , xm are the linear
in the higher order and can transform each input image to combination of n independent and zero-mean of the source
the combination of bases. For the two major problems we are signal s1 , s2 , . . . , sn (typically m ≥ n) as described in
confronting, one is how to choose the bases with the higher
capability in classification. The other is how to enhance the
discriminability of independent components between classes x j = h j1 s1 + h j2 s2 + · · · + h jn sn . (11)
6 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 6: The results of separate humans.

depend on the capability of binary classifiers. In Figure 9, the


= u1 × +u2 × + · · · + un × solid line and dashed line indicate the positive and negative
values of ICA coefficients, respectively. If the distribution is
like Figure 9(a), we can separate humans from nonhumans
Figure 7: ICA-image decomposition. by the dotted threshold line in an easier way. Unfortunately,
the information provided by the binary classifier is too
insufficient to select ICA features. Like what is shown in
The matrix H is expressed in terms of the elements hi j , and Figure 9(b), we cannot easily separate the distributions into
it is an unknown full rank (m × N) mixture matrix. Since all two classes by using a threshold line. That is also the
vectors are column vectors and the transpose of X is a row major reason for us to modify the original ICA by using
vector, we can rewrite (11) to (12) by using vector-matrix the information of conditional entropies for selecting the
notations optimal ICA bases in this paper.
If the entropy is the amount of information provided
X = HS. (12) by a random variable, then our conditional entropy can be
defined as the amount of information about one random
Without loss of generality, we assume that both the mixture
variable provided by another random variable. The entropy
variables and independent components have zero mean
of a random variable reflects the more truthful information
and non-Gaussian distributions. For the nonzero mean
of the observed variable. If the variable is more random, it
distributions, the observable variables x j can always be
means unpredictable and unstructured, which may result in
centered by subtracting the sample mean to become the zero
the large entropy value. Figure 10 illustrates how the entropy
mean distributions. If W denotes the inverse of the basis
values are relevant to the distributions of variables. In such a
matrix S, the coefficients matrix U for training matrix XT
case, the higher entropy value in Figure 10(a) reveals that the
will be expressed in
variable Z1 is more random than Z2 .
U = WXT . (13) The 2-D data space obtained from ICA feature extraction
needs to be discretized into a matrix of grid cells by
The n-component base vectors which have the best distin- separating each dimension into a set of intervals or bins.
guishability for detecting humans and nonhumans should be The discretization process begins with calculating the mean
chosen from many candidate components. It can be achieved value of data in one dimension and dividing the data into two
by calculating the ratios of between-class and within-class halves with that mean value. Recursively, each half is divided
variability r for each coefficient, and the largest ratio r implies into halves with its own mean value. The recursion will stop
the best distinguishability. Or the base vectors can be selected when we obtain the required number of intervals or meet the
by using perceptions in neural network. These two methods constraint of total bins. Let a discrete random variable Z be
EURASIP Journal on Advances in Signal Processing 7

Figure 8: The bases of image set.

The PDF of one coefficient The PDF of one coefficient


0.03 0.08
Class 1
0.07
0.025 Class 2
0.06
Probability density

Probability density

0.02
0.05

0.015 0.04

0.03
0.01
0.02
0.005
0.01

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Coefficient ICA coefficient

Train H Train H
Train NH Train NH
(a) (b)

Figure 9: Different distributions of ICA coefficient in (a) the ideal case and (b) the real case.

H(Z1 ) = 2.98 H(Z2 ) = 1.92

p(Z2 )
p(Z1 )

(a) (b)

Figure 10: The information entropy. (a) Z1 and (b) Z2 .


8 EURASIP Journal on Advances in Signal Processing

Table 1: Results for feature selection and classification.


Feature-classifier Method Number of SV Accuracy (%)
Entropy 895 92.58
Fisher’s criterion 1197 91.24
20 IC-SVM
Neural Network 1198 90.57
Non 2166 84.07
Entropy 825 93.88
Fisher’s criterion 1154 93.21
30 IC-SVM
Neural Network 1137 92.20
Non 1800 89.58

100 4000

Number of SVs
Accuracy (%)

3000
90
2000
80
1000

70 0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Number of ICA features Number of ICA features

Non NN Non NN
Fisher Entropy Fisher Entropy
(a) (b)

Figure 11: Analysis of feature selection. (a) Accuracy rate. (b) Number of SV.

with possible values {z1 , z2 , . . . , zn }. The information entropy SVM classifier to indentify humans or nonhumans. Table 1
of Z with the probability density p(z) is defined in and Figure 11 showed the comparisons of results by our
conditional entropy based feature selection approach with

n
those by others for feature selection and classification. All the
H(Z) = − p(zi ) log p(zi ). (14)
i=1
comparisons in this paper used the same training and testing
database. The training database consisted of 1843 human
The conditional entropy quantifies the uncertainty of a and 840 nonhuman images. Meanwhile, 3178 human and
random variable Y if given that the value of a second random 2847 nonhuman images were used in the testing database.
variable Z is known. Each coefficient has to be normalized The same ICA algorithm was used for feature extraction
to [−1, 1] and quantized to n bins. Let Y = {−1, 1} be the and SVM in classification, and the only difference for
desired class, then the conditional entropy can be described obtaining reasonable compared results lied in the feature
in selection method. Our feature selection approach was based
     on the conditional entropy, and was compared with Fisher’s
H(Y | Z) = − p y, z log p y | z = H(Y , Z) − H(Z). criterion, neural networks, and without feature selection.
z y
Our used parameters in the comparison process were the
(15) number of support vectors (SV) and the accuracy rate which
was obtained from each method with respect to the number
The conditional entropy (Y |Z) is a weighted sum of the
of independent components. Figure 11 showed the accuracy
entropy values in all columns, where the joint entropy is
rate and the number of SV for all number of ICA features
defined by
and indicated the maximum number of ICA to be 76. We
     chose two subsets of independent components as 20 and
H(Z, Y ) = − p z, y log p z, y . (16)
z y 30 and displayed the accuracy rate and number of SV in
more details in Table 1. Table 1 exhibited that the conditional
We sort the conditional entropy (Y | Z) and use the sorted entropy based approach had the accuracy rate in more than
results to select corresponding independent components. 90% but needed the smallest number of support vectors
The coefficients or independent components with the better (SV). For all approaches, the accuracy rate will increase and
classification ability are associated with the small conditional the corresponding number of SVs will decrease when the
entropy. The selected ICA features will be used in the number of ICA features increases from zero to the specific
EURASIP Journal on Advances in Signal Processing 9

M-target N-target
candidate models
Tracking A

Color Color
histogram histogram
representation representation

Bhattachaya
similarity
measure

Tracking B

Similarity F # target model F Define new


target model
>Thb mismatch

T
T

Update target Prediction by


Kalman filter
model Kalman filter

Position Color histogram


Tracking A

Figure 12: The tracking module.

value. When the number of ICA features increases from the Distribution of HSV color space
0.2
specific value to the maximum number, we can use more SVs 0.18
as to maintain the accuracy rate. 0.16
0.14
0.12
4. Tracking 0.1
0.08
Our tracking module is depicted in Figure 12. The proposed 0.06
tracking system is based on the color appearance model 0.04
0.02
because the color distribution will be typically stable under 0
rotations, scaling, or partial occluded conditions. At the same 0 50 100 150 200 250
time, Kalman filter is applied to calculate and predict new
Figure 13: The PDF of color histograms.
locations of each moving object, and to solve the occlusion
problems which the color models may be invalid with. Let
hist(i) represent the ith bin of total N bins of the color
histogram, and the PDF of target models can be computed of HSV color channel occurs when the saturation value S is
by close to 0. In this condition, the hue H will become quite
noisy. Therefore, in practical applications, the HS-histogram
hist(i) will be used only when S is larger than a threshold value
pi = N . (17)
i=1 hist(i)
0.1. Otherwise, only intensity V-histogram is used, and the
total number of histogram’s bin becomes NH NS + NV . In
Most of the color features is unstable under the change order to reduce the computational time and increasing the
of lightness. The HSV color channel extracts the lightness accuracy of object tracking, we use three of fourth of the
information from the RGB color channel, therefore the original moving object region with the same centers as shown
sensitivity to illumination can be reduced. But the problem in Figure 13. Moreover, Bhattacharya similarity measure is
10 EURASIP Journal on Advances in Signal Processing

Figure 14: The positive database.

Figure 15: The negative database.

applied to compute the similarity value between two PDF, database were acquired by considering various conditions
the target model pi and target candidate qi as shown in and activities such as the detected images contained part of
lateral or frontal human shapes, the detected humans were
  
N
, walking or running, the detected moving object did not have
BC p, q = pi · qi . (18) a complete human shape, and so forth. We also took the cases
i=1
under both indoor and outdoor environments into account
When the target candidate qi and target model pi are similar, and meanwhile some nonhuman targets in complicated
the PDF of target models of moving objects can be updated conditions such like trees, animals, and vehicles were used
by the weighting factor γ in the tracking process, which is in the testing database in this paper. All the image data were
expressed in normalized to the 40 × 40 block size. The normalization
  algorithm used in our work was carried out by comparing
pi = 1 − γ · pi + γ · qi , (19) the width and height of moving-object regions. If the width
of moving-object region was larger than the height, the
where γ = BC/4. moving object would be centralized by shifting horizontally,
otherwise by shifting vertically. We showed several positive
5. Experimental Results and negative images in our database after normalization in
Figures 14 and 15.
Our training database captured from 16 different videos We also listed the compared results in the number
included 1843 positive and 2066 negative data, and the of required features, the accuracy rate, and the detection
database in the testing phase were captured from 18 videos in time by our proposed conditional entropy-based feature
3178 positive and 2847 negative data. The images used in our selection approach with those by others in Table 2 and
EURASIP Journal on Advances in Signal Processing 11

100 2000

97.5 1800

95 1600

Number of SVs
Accuracy (%)

1400
92.5

1200
90

1000
87.5
800
85
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
Number of ICA features Number of ICA features
Non NN Non NN
Fisher Entropy Fisher Entropy
(a) (b)

Figure 16: Analysis of different methods of feature selection in (a) Accuracy rate, and (b) Number of SVs.

Table 2: Comparisons in the computational time.

IC selection method Number of IC Number of SV Accuracy (%) Detection (ms/object)


Entropy 30 825 93.88 1.13
Fisher 30 1157 93.21 1.33
Entropy 40 958 94.51 1.41
Fisher 40 1194 94.40 1.65
NN 40 1028 94.58 1.51

Table 3: Accuracy of human detection system (%).

Training Data Testing Data


Method
Human Nonhuman Human Nonhuman
mICA+SVM 97.72 95.84 94.15 93.57
ICA+cosine 90.87 85.73 90.34 85.49
ICA+SVM 97.55 93.90 93.17 91.13
Codebook 87.95 92.83 90.88 93.68
PCA+BP 99.18 99.46 89.65 94.09

Figure 16. We had 5 videos with a total number of frames, and the back-propagation model in neural networks for
14056, and the computational time indicated in our entropy classification. In the other two approaches, ICA + Cosine and
based method would be 1.13–1.41 miniseconds depending ICA + SVM, the IC-features were determined by calculating
on the number of independent components (IC). With the ratios of between-class and within-class variables r for
the increasing number of IC in human feature extraction, each coefficient and choosing a larger r as the features with
the number of support vectors (SV) would also increase, the better distinguishability. After the features have been
which made the system take longer to detect a human. determined, they used the cosine similarity measurement
Moreover, in Table 3, we compared the accuracy of our and SVM for classification, respectively. Table 3 showed the
mICA+SVM approach with that of some others both in the higher accuracy of our mICA+SVM approach in the training
training and testing data. The codebook matching approach part than all the others except PCA+BP. However, in the
in Table 3 used the human shape as the features, and matched testing part, our mICA+SVM approach demonstrated the
the moving object by the code vectors in the codebook. highest accuracy to identify humans among all the compared
The PCA+BP method used PCA for feature extraction methods.
12 EURASIP Journal on Advances in Signal Processing

Figure 17: The detection results for humans and nonhumans.

Figure 18: The processed results for some occlusion cases.

Figures 17–20 showed the human detection results in kinds of conditions, and our approach would accurately
different conditions, where the white color blocks described detect humans for cases that the humans were running,
the nonhuman moving objects and the blocks in other colors walking in different positions and directions, and could
indicated the moving humans. Figure 17 revealed that the correctly recognize the vehicles, moving tree leaves, or animal
proposed human detection system could work well in many as nonhuman objects. Figure 18 showed our experimental
EURASIP Journal on Advances in Signal Processing 13

Figure 19: The results in human tracking—Environment 1.

Figure 20: The results in human tracking—Environment 2.

results in the occluded cases where people were occluded by experimental results have proved the conditional entropy to
each other or by other objects. Figures 19 and 20 displayed be effective in sorting features with the better classification
the results of human tracking in consecutive frames where ability. The SVM classifier is applied to classify the features
we indicated the number of frames and the label of identified into two classes, humans and nonhumans. The Kalman filter
humans in the lower left and the upper left in each image, and Bhattacharya color similarity measurement are both
respectively. used to predict and track the humans in the consecutive
frames. Our experiments also indicate the higher perfor-
mance in human detection and tracking. Besides, we use
6. Conclusions and Discussions the GMM method which is used to model and update
a background image for moving object segmentation to
The modified ICA approach using conditional entropy has handle the dynamic backgrounds. The color-based shadow
been proposed for human detection in this paper. The elimination algorithm is also implemented in our work to
14 EURASIP Journal on Advances in Signal Processing

supported in part by the National Science Council, Taiwan,


under Contracts NSC 99-3114-E-009 -167 and NSC 98-
2221-E-009-167.

References
[1] L. Zhao and C. E. Thorpe, “Stereo and neural network-
based pedestrian detection,” IEEE Transactions on Intelligent
Transportation Systems, vol. 1, no. 3, pp. 148–154, 2000.
[2] C. Orrite-Uruñueta, J. M. del Rincón, J. E. Herrero-Jaraba, and
(a) G. Rogez, “2D silhouette and 3D skeletal models for human
detection and tracking,” in Proceedings of the 17th International
Conference on Pattern Recognition (ICPR ’04), pp. 244–247,
Cambridge, UK, August 2004.
[3] Z.-L. Jiang, S.-F. Li, and D.-F. Gao, “A time saving method for
human detection in wide angle camera images,” in Proceedings
of the International Conference on Machine Learning and
Cybernetics, pp. 4029–4034, Dalian, China, August 2006.
[4] Y.-L. Tian and A. Hampapur, “Robust salient motion detec-
tion with complex background for real-time video surveil-
lance,” in Proceedings of the IEEE Workshop on Motion and
Video Computing (MOTION ’05), pp. 30–35, Breckenridge,
(b) Colo, USA, August 2005.
[5] C. Stauffer and W. E. L. Grimson, “Adaptive background
mixture models for real-time tracking,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR ’99), pp. 246–252, Fort Collins,
Colo, USA, June 1999.
[6] J. Zhou and J. Hoang, “Real time robust human detection and
tracking system,” in Proceedings IEEE Conference on Computer
Vision and Pattern Recognition, pp. 246–252, San Francisco,
Calif, USA, 2005.
[7] R. Polana and R. Nelson, “Detecting activities,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision
(c) and Pattern Recognition, pp. 2–7, New York, NY, USA, June
1993.
Figure 21: The negative examples in human detection. [8] M. Bertozzi, A. Broggi, M. D. Rose, M. Felisa, A. Rakotoma-
monjy, and F. Suard, “A pedestrian detector using histograms
of oriented gradients and a support vector machine classifier,”
in Proceedings of the 10th International IEEE Conference on
reduce the influences of grouping shadows by connected Intelligent Transportation Systems (ITSC ’07), pp. 143–148,
components effectively. In order to make our approach Seattle, Wash, USA, October 2007.
much more practical and perfect, in the near future we [9] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. von
Seelen, “Walking pedestrian recognition,” IEEE Transactions
may consider more conditions such as the clothing colors
on Intelligent Transportation Systems, vol. 1, no. 3, pp. 155–163,
of detected humans are close to those of the backgrounds 2000.
(Figure 21(a)), the shadowed regions of detected humans are [10] S. M. Yoon and H. Kim, “Real-time multiple people detection
much larger than the truthful moving objects (Figure 21(b)), using skin color, motion and appearance information,” in
and the heads of detected humans in the sampled images Proceedings of the IEEE International Workshop on Robot
are too small to be detected more accurately (Figure 21(c)). and Human Interactive Communication, pp. 331–334, Tokyo,
To sum up, the conditional entropy-based mICA approach Japan, September 2004.
has solved most problems in human detection and provides [11] D. Guo, “Coordinating computational and visual approaches
the better discriminability in classes for ICA which may not for interactive feature selection and multivariate clustering,”
depend on the binary classification in an efficient computa- Information Visualization, vol. 2, no. 4, pp. 232–246, 2003.
tional time, 1.13–1.41 ms/object, and in the accuracy of more
than 93% for real-time applications.

Acknowledgments
This work was supported in part by the Aiming for the
Top University Plan of National Chiao Tung University, the
Ministry of Education, Taiwan, under Contract 99W962, and
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 483562, 7 pages
doi:10.1155/2010/483562

Research Article
Objective Assessment of Sunburn and Minimal Erythema Doses:
Comparison of Noninvasive In Vivo Measuring Techniques after
UVB Irradiation

Min-Wei Huang,1, 2 Pei-Yu Lo,2 and Kuo-Sheng Cheng2


1 Department of Psychiatry, Chia-Yi Veterans Hospital, Chia-Yi 600, Taiwan
2 Institute of Biomedical Engineering, National Cheng Kung University, Tainan 701, Taiwan

Correspondence should be addressed to Kuo-Sheng Cheng, [email protected]

Received 29 November 2009; Revised 9 February 2010; Accepted 30 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 Min-Wei Huang et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Military personnel movement is exposed to solar radiation and sunburn is a major problem which can cause lost workdays and lead
to disciplinary action. This study was designed to identify correlation parameters in evaluating in vivo doses and epidermis changes
following sunburn inflammation. Several noninvasive bioengineering techniques have made objective evaluations possible. The
volar forearms of healthy volunteers (n = 20), 2 areas, 20 mm in diameter, were irradiated with UVB 100 mj/cm2 and 200 mj/cm2 ,
respectively. The skin changes were recorded by several monitored techniques before and 24 hours after UV exposures. Our results
showed that chromameter a∗ value provides more reliable information and can be adopted with mathematical model in predicting
the minimal erythema dose (MED) which showed lower than visual assessment by 10 mj/cm2 (Pearson correlation coefficient
I = 0.758). A more objective measure for evaluation of MED was established for photosensitive subjects’ prediction and sunburn
risks prevention.

1. Introduction the correlation parameter in evaluation of in-vivo doses


and epidermal changes following UVB irradiation using
The main effect of UVB (wavelengths of 320 to 340 nm) noninvasive techniques, especially those showing very weak
is thought to take place mainly in the epidermis. UVB is reactions. Using more objective method, the photosensitive
very active in human skin and can induce sunburn, tanning, soldiers could be screened to prevent from sunburn risk.
and many photodermatoses after exposure. Both skin cancer
and aging may occur following chronic repeated exposure
[1]. Sunburn inflammation has been used as end point for
2. Materials and Methods
many photobiologic studies of skin. The patient’s minimal 2.1. Materials
erythema dose (MED), defined as the minimal dose in
producing just-perceptible erythema determined 24 hours 2.1.1. UVB Irradiation Source. UVB irradiation was admin-
after irradiation, is an example. Photosensitive subjects have istered using a light box (UV 3001, Waldmann, Germany)
low MED values and are vulnerable to UV radiation [2]. irradiation unit.
The sensitivity of human skin to UV radiation must be
determined, especially military personnel movement that is 2.1.2. Colorimeter (CR). For quantitative measurement of
exposed to sun, and sunburn is a major problem which skin color, we made use of a chromameter (Minolta CR-400,
can cause lost workdays and lead to disciplinary action. The Osaka, Japan) [3]. It measures the erythema and skin color
assessment of acute effect of epidermis after UVB exposure based on Commission International de I’eclairage L∗ a∗ b∗
is rarely analyzed by noninvasive quantitative means. The color space. The L∗ a∗ b∗ color space method (CIELAB),
traditional visual MED reading lacks accuracy, reproducibil- developed in 1976, is the most frequently used to objectively
ity, and quantification. This study was designed to identify assess colors [4]. In this system the L∗ coordinate correlates
2 EURASIP Journal on Advances in Signal Processing

with the intensity of the reflected light (brightness) and the epidermis at a cellular level [8]. Since epidermis changes after
a∗ and b∗ coordinates are chromatic, covering the spectrum UV irradiation are usually examined by biopsy, it is quite
from red to green and from yellow to blue, respectively. The difficult to monitor the skin changes dynamically. Therefore,
a∗ value is well recognized to linearly correlate with skin we aimed to test the potential of LSCM of epidermis in vivo
erythema [5]. after UVB irradiation [9].
A commercially available LSCM (Vivascope1500, Lucid,
2.1.3. Multiprobe Adapter. The multiprobe adapter (MPA) is Henrietta, New York) was used [10]. The following param-
a flexible, economic plug-in system to combine all skin mea- eters are assessed: thickness of stratum corneum (SC),
surement probes of Courage + Khazaka electronic GmbH measured from skin surface to the first recognizable nucleus
[3], including tewameter for transepidermal water loss in the granular layer; and minimal thickness of epidermis
(TEWL), corneometer CM825 for skin moisture, and mex- (DP), defined as the distance between the skin surface and
ameter Hb MX18 for skin pigmentation and erythema [6]. the most apical recognizable dermal structure [8, 11].

(a) Mexameter Hb (MI, EI). The melanin and erythema 2.1.5. Laser Doppler Perfusion Imager (LDI). Laser Doppler
indexes were evaluated by a mexameter Hb (MX-18, perfusion imager (LDI) is a standard technique in the non-
Courage an Khazaka, Cologne, Germany). The mexameter invasive monitor of blood flow and has been widely applied
is equipped with LED (light emitting diode) light sources in the studies of vascular changes within skin area of interest.
and a silicon diode detector for detecting reflected light In UV irradiation, dermal microperfusion is increased
from skin. The instrument measures the intensity of reflected when inflammation induced by UV exposure starts with
green (568 nm), red (660 nm), and infrared (880 nm). The vasodilatation. In this study, cutaneous microcirculation was
definitions of the melanin (MI) and erythema (EI) indexes measured with a laser Doppler Perfusion imager (Moor
calculated automatically by the mexameter are as follows LDI2, UK) [7]. The output of the LDI system consists of
[5, 7]: two different two-dimensional data sets, perfusion and total
back-scattered light intensity (TLI), with a point-to-point
Melanin index: correspondence. The blood perfusion data set, represented
500 by a color-coded image, was calculated from back-scattered
MI =   , and Doppler-shifted light, defined as the product of red
log 5 log infrared − log I red + 500 (1) blood cells’ mean velocity times their concentration in the
Erythema index: sampled tissue volume. The second data set maps the TLI
  and was coded into a photographic-liked gray-scale image of
EI = 500 = log 5 log I red − log I green + 500. the lesion [4]. The distance between laser Doppler perfusion
imager and skin surface is 30 cm (with measurements taken
(b) TEWL Measurement (TEWL). TEWL was measured from centre of the studied area for all subjects). In this
using an evaporimeter (Tewameter, Courage an Khazaka, study, each recorded image consists of multiple measurement
Cologne, Germany) on the arms of subjects before and after sites and represents the blood perfusion in a skin area of
a single dose of UV-light radiation. In vitro measurement approximately 68 × 86 mm2 .
of TEWL provides a clear indication of the skin barrier
integrity. All investigations were performed at 23–25◦ C and
40–60 relative humidity [6]. 2.2. Methods
2.2.1. Study Design and Subjects. This study was approved by
(c) Electrical Capacitance (COR). Electrical capacitance was the ethics committee for human studies of Veterans General
measured with corneometer (CM 825, Courage an Khaz- Hospital-Kaohsiung. Twenty healthy Chinese volunteers (15
aka, Cologne, Germany). The technique is based on the males and 5 females with mean age at 28 y/o, SD 5.6),
completely different dielectric constant of water and other who gave their informed consents, were enrolled. Subjects
substances (mostly > 7). The measuring capacitor shows have not been exposed to systemic corticosteroids, immuno-
changes of capacitance according to the moisture content of suppresive medicines, or sunbathing in the past 4 weeks.
the samples and provides temperature stability. The capacitor Both temperature and humidity in the room were recorded.
shows changes of capacitance according to moisture content The temperature was maintained within the range between
of the samples [6]. 20◦ C and 25◦ C and relative humidity was within 40% to
60%. Smoking was not allowed within 4 hours prior to the
2.1.4. Laser Scanning Confocal Microscopy (LSCM). The skin measurements. Both coffee and tea intakes were not allowed
changes have been extensively investigated by histological within 1 hour prior to measurement. To design the study
examination. But, biopsy may alter the original morphology properly, the knowledge of intraregional variation and daily
and induce an iatrogenic trauma, and thus noninvasive variability of measurement parameter is of utmost impor-
methods are more desirable for application. Laser scanning tance. In our study, the measurement for each individual was
confocal microscopy (LSCM) allows noninvasive in-vivo taken at exactly the same time each day. Each skin site acts as
optical sectioning of layers of skin in real time. Using melanin its own control in measurement of basal skin color and blood
as main endogenous contrast, the technique can analyze the flow on the skin on day 1 [12].
EURASIP Journal on Advances in Signal Processing 3

2.2.2. Broadband Light Testing to UVB. Broadband light test- proposed data analysis method consists of three concrete
ing to UVB was conducted for all subjects as follows. The test steps.
sites are nonexposed skin of the mid-lower back. The expo-
sure doses for the MED testing were 50 mj/cm2 , 70 mj/cm2 , (a) Evaluation of data validity to exclude data having
100 mj/cm2 , 120 mj/cm2 , 140 mj/cm2 , and 160 mj/cm2 , inherent artifacts: before processing of perfusion
respectively, according to previous experience [8]. These and TLI images, the collected material was prepro-
were determined visually by 2 experienced investigators. The cessed to identify the unwanted artifacts induced by
readings of erythema were also taken for all exposure sites involuntary patient movements. The preprocessing
using chromameter. Digital photographs were taken. procedure involves visual inspection and comparison
of multiple TLI images emanating from the same
2.2.3. Traditional Visual MED Reading (VS) [7, 12]. There lesion.
are
(b) Lesion delineation to define the optical boundaries
0 No erythema, of each lesion in TLI imaging: the region of interest
+ Menimal perceptible erythema with sharp borders (ROI) was superimposed on the perfusion image, and
(1 MED), the perfusion parameters are calculated.
+ Pink erythema, (c) Blood perfusion feature extraction and resulting
++ Marked erythema, no edema, no pain, values were statistically evaluated between different
irradiated groups.
+++ Fair red erythema, mild edema, mild pain,
++++ Violaceus erythema, marked edema, strong
2.3. Statistics. The time course of different measurements
pain, strong edema, partial blistering.
was analyzed within the context of repeated-measurement
ANOVA models because measurements over time and over
2.2.4. Background Skin Reaction Measurements. The consti-
measurements are repeatedly taken within the same volun-
tutional skin color was measured at infra-axillary areas with
teers. A method can be considered to be discriminatory if
chromameter. The forearm of each individual was positioned
the changes of skin condition were detected and ought to
on an arm support at heart level for 15 minutes. The baseline
be judged significant. The discriminatory ability of different
values of cutaneous blood flow and skin condition were
measurements was compared using respective F-values of
measured for all test sites as described below.
the ANOVA models. The highest F-value represents the best
2.2.5. The Test Standard Dosage of UVB Irradiation. The discriminatory ability.
MED does not exhibit an incremental increase of erythema An ANOVA model was calculated for every combination
with increasing UV doses. Takiwaki et al. suggested that 2 of measurement. A dose-dependent effect of a specific
independent MEDs are more appropriate to assess the UV- measurement was significant if the P-value of the F-test is
induced skin reactions in oriental skin since tanning induced < .05. This corresponds to an F-value of 4.35 in a given
by doses lower than 2 MEDs is too weak to discriminate situation (DF 1, 19).
the differences between various reactions. On the contrary, In the second step, pairwise comparisons for the time
doses higher than 2 MEDs often result in desquamation, dependent effect between t0 and t24 were performed in each
which makes assessment by colorimeter difficult [13]. The ANOVA model. A time-dependent effect between two time
mean MED of all suspected photosensitive subjects from points was classified as statistically signification if the P-value
2000 to 2005 in our hospital is 107.3 ± 30.72 mj/cm2 (85 is < .05. This is true for a (Bonferroni-adjusted) t-value of
individuals, 45 females and 40 males with a mean age at 50.3 1.725 in the given situation (DF 19). All calculations were
years) from our previous studies. Accordingly, 100 mj/cm2 done using SPSS10.0.
and 200 mj/cm2 were chosen as our test standard.
Forearm (flexor side) is used for monitoring of skin 3. Results
reactions. Each test area was marked using a template with
2 holes and a green ink pen. On the ventral side of right 3.1. Constitutive Colorimetric Readings. The background
forearm (about 3-4 cm distance from the anticubitale fossa readings before irradiation gave a mean of 58.6 ± 2.9 L∗
and the wrist), 2 areas of each volunteer with each 20 mm (L) units, 5.7 ± 1.2 a∗ (A) units, and 13.6 ± 1.2 b∗ (B)
in diameter were irradiated with UVB 100 mj/cm2 , and units when read with chromameter, respectively. The value of
200 mj/cm2 respectively. The skin of volar forearm skin erythema index (EI) was 271 ± 66 units and melanin index
before UV exposure and 24 hours later was examined by (MI) 216.7 ± 44 units using the mexamter Hb. No significant
traditional visual MED reading (VS), tewameter (TEWL), correlation was observed between MEDs and constitutive
mexameter (E, M), corneometer (COR), laser scanning colorimetric readings.
confocal microscopy (LSCM) for SC and DP reading,
chromameter (L, A, B), and laser-Doppler imager (LDI). 3.2. Assessment of Melanin Pigmentation after UV Irradi-
ation by Laser Scanning Confocal Microscopy. Laser scan-
2.2.6. Image Processing and Data Analysis Method for Extrac- ning confocal microscopy (LSCM) was used to measure
tion of Perfusion Parameters of Laser-Doppler Imager. The the melanocytes in forearm skin after ultraviolet exposure
4 EURASIP Journal on Advances in Signal Processing

200 mj/cm2 -irradiated doses. There is no discrimination for


SC, DP, COR, and TEWL.
In summary, both a∗ and erythema index (EI) show
positive linear relation to VS. The Pearson correlation
coefficients of a∗ value and EI relating visual scoring are
compared (P = .578 over .501 at 100 mj/cm2 ; P = .767
over .759 at 200 mj/cm2 ). The a∗ value is better than EI
index. The a∗ value provides more reliable information
and can be used in mathematical model in predicting the
minimal erythema dose (MED).
(a) (b)

3.5. Colorimetric Determination of MED with Mathematical


Model and Comparing with Conventional Visual Method.
The a∗ data of chromameter were mathematically modeled
to assess the MED values. Objectively we proposed that the
lowest subthreshold UV doses do not induce the erythema
and, thus giving a horizontal line. At the threshold where
UV begins to cause erythema (MED), a curve with a
positive gradient would commence. The data were modeled
to an initial horizontal line with a curve commencing at
an unknown point (Figure 3) [7]. This unknown point,
the intersection of the line and curve, was determined by
(c) (d) mathematical modeling as the MED. Both average intercepts
Figure 1: Laser scanning confocal microscopy (LSCM) of supranu- and slopes for each subject were calculated to see if a more
clear melanin caps of dermal-epidermal junction in the forearm objective measure of MED could be obtained.
skin in relation to UVB exposure. (a) and (b) One representative The individual MED value evaluated by conventional
case before and after (24 hours) 100 mj UVB irradiation. (c) and (d) visual determination and mathematical model is revealed
The same case before and after (24 hours) 200 mj UVB irradiation. at Figure 4. The average MED of all volunteers is 86.5
± 22 mj/cm2 (change in a∗ of +2.08 ± 0.74 units) by
conventional visual method. The corresponding MED using
mathematical model is 76.5 ± 25 mj/cm2 (change in a∗ of
(Figure 1). The brightness of basal layer in LSCM decreased +1 ± 0.77 units). Using mathematical modeling, we are able
24 hours after UV irradiation exposure. to detect a∗ change in erythema at lower UV doses than
the conventional visual assessment except case 14. Thus,
3.3. Assessment of Blood Flux after UVB Irradiation. The we are 95% sure that MED determined by mathematical
software of LDI can demonstrate both the flux image and model is not equal to conventional visual method. However,
video images. The LDI features a camera in production of the correlation of MED values between visual method
color images of scanned area, making the positioning and and mathematical prediction is fair (Pearson correlation
comparison of images easier. Before UVB irradiation, we coefficient = 0.758).
observed no increase of blood flow. Twenty four hours later
after 100 or 200 mj/cm2 UVB exposure, the color scale of
perfusion image increases (Figure 2). The blood perfusion 4. Discussion
feature was extracted and statistically evaluated between Appropriate instruments and mythology are indispensable
different irradiated areas. to monitor the differences of epidermis response. The
present study was designed to identify the discrimination
3.4. Statically Analysis of Different Noninvasive Biomedical capability among different noninvasive techniques. Previous
Techniques. Twenty volunteers are enrolled in all nonin- studies have shown the ability of instruments such as the
vasive biomedical analysis. The mean differences of mea- chromameter (CR) and the mexameter Hb to quantitate
surements related to UV exposure are listed in Table 1. It more sensitive measures of skin color changes [7]. From our
is evident that EI, VS, and a∗ increase with higher UVB results, colorimeric measurements (including mexameter
irradiation. On the contrary, MI, L∗ , and b∗ decrease. and chromameter) and visual scoring give the highest
TEWL, COR, SC, DP, and LDI show discrepant. The repeated discrimination of UVB irradiation, and a∗ and EI show a
measure ANOVA test found a discrimination of MI and EI of linear relationship to VS. Taken together, the reproducibility
mexameter, L, A, and B of chormameter, LDI and visual score and convenience of a∗ is most satisfactory, even if three
(VS). Among them, MI, EI, A, and VS exhibited the hightest times of measurements were used. The a∗ provides reliable
discrimination power. The pairwise comparison (t-test) is information and can be designed as in a mathematical model
significant for 100 mj/cm2 and 200 mj/cm2 UVB-irradiated in predicting MED. Through detailed comparison of differ-
doses in MI, EI, L, A, B, and VS. The LDI is only effective at ent UVB doses and MEDs, the parameters derived could be
EURASIP Journal on Advances in Signal Processing 5

Table 1: The descriptive statistics of mean differences with biomedical techniques. EI, COR, and a∗ increase with higher UVB irradiation.
On the contrary, MI, L∗ , and b∗ decrease. Discrepancy: TEWL, COR, SC, DP, LDI.

Cases Minimum Maximum Mean SD


MI100D 20 −28.00 12.00 −8.2285 11.0630
MI200D 20 −75.00 2.00 −45.6355 23.2291
EI100D 20 −10.75 92.00 47.5690 29.9449
EI200D 20 36.00 299.00 154.9840 68.2407
TEWL100D 20 −2.20 6.00 0.7100 2.3799
TEWL200D 20 −6.40 3.20 0.1000 2.1866
COR100D 20 −10.00 23.89 2.4405 8.0193
COR200D 20 −11.20 35.56 1.2560 10.2402
SC100D 20 −6.00 13.00 0.8971 4.0347
SC200D 20 −4.34 7.00 0.8825 3.4885
DP100D 20 −13.00 26.00 2.5990 10.2211
DP200D 20 −15.00 21.00 1.4970 9.9801
LDI100D 20 −20.80 18.80 −0.2200 11.6452
LDI200D 20 −7.00 625.00 213.7200 177.6635
L100D 20 −4.41 0.56 −1.2520 1.1509
L200D 20 −6.77 −0.25 −3.1450 1.8638
A100D 20 0.21 5.09 1.6305 1.1571
A200D 20 1.11 9.57 4.6560 2.2557
B100D 20 −0.96 0.35 −0.1850 0.3261
B200D 20 −1.46 0.77 −0.5840 0.5619
VS100D 20 0.00 2.00 0.7000 0.5712
VS200D 20 1.00 3.00 2.2500 0.7864
MI: mexameter melanin index; EI: mexameter erythema index; TEWL: transepidermal water loss; COR: corneometer CM 825; SC, thickness of stratum
corneum; DP, minimal thickness of the epidermis; L, A, and B: colorimetric L∗ , a∗ and b∗ measurements using Chromameter CR 400; LDI: laser-Doppler
perfusion imaging; VS: visual score.
#100D: Mean differences of # measure at 100 mj/cm2 UVB-irradiated doses.
#200D: Mean differences of # measure at 200 mj/cm2 UVB-irradiated doses.

(a) (b)

Figure 2: The LDI range features a CCD camera which produces a color image of the scanned area, making the positioning and comparison
of images easier. (a) Before UVB irradiation. (b) After 100 and 200 mj/cm2 UVB-irradiation.

applied in photobiologic studies. Besides, standardization of stratum corneum thickness. Possibly, any minor movement
the method developed in this present study may have defense in our subjects may interfere with delicate measurements of
implication in the future. CLSM. Besides, the brightness decreased after UVB irradia-
Olson et al. reported that MED correlated well with tion. Compared with LSCM analysis, temporary decrease of
melanosome size, quantity, density, and distribution in vari- brightness may be due to rapid proliferation of keratinocytes
ous skin colors [14]. Lee et al. showed that hyperkeratosis and rich in supranuclear melanin caps after UV exposure and
acanthosis were more prominent 24–48 hours after single upward movement, resulting in darken epidermal basal layer.
dose of 2 MED UVB. Marked hyperkeratosis was observed in Further examination of images from stratum corneum to
about 20% after 24 hours [11]. However, we found no dose- the epidermal-dermal junction could be performed by image
dependent effect and no obvious discrimination in terms of processing.
6 EURASIP Journal on Advances in Signal Processing

5 el al. demonstrated that exposure to 5% saltwater leads


to a decrease of threshold for elicitation of UVB-induced
erythema with an increase of erythemal response [17, 18].
4 To clarify the correlation between water content and UVB
irradiation, no discrimination was demonstrated between
the measures and mean difference of corneometer, implying
that water content and barrier function are not correlated
3 with UVB irradiation. Besides, barrier disruption should
a∗ function

be interpreted with caution, as a decrease is seen initially.


Indeed, Fluhr el al. mentioned that barrier damage is a late
2 effect of acute UV irradiation. Attention can also be detected
using VS during early phase (until 48 hours) [12].
Consistent with Westerhof et al. [19], we found no
significant correlation between MEDs and constitutive col-
1 orimetric readings. The MED is only an estimate of amount
Intercept
of UV radiation required for erythema. We cannot sim-
ply predict the MED value from constitutive colorimetric
0 readings, because no correlation was observed between UV
0 1 2 3 4 5 6 sensitivity and skin color.
UV test site Dose-response data of erythema more accurately mea-
Figure 3: Colorimetric determination of MED with mathematical sure the responses of human skin to UVB. A sophisticated
model. chromameter supported by mathematical modeling will offer
objective measurement of erythema to UV radiation and
dose-response relationship.
160
When plotting erythema curves (as measured by a∗ ) for
140 each volunteer, each curve seems to be the composite of two
components—a horizontal component without measure-
120 ment of erythema and a second curve containing measurable
100 erythema. The mathematical modeling we proposed can
MED value

identify these two curves, their intercept and the slope of


80 the second curve. We proposed that the intercept represents
60 the threshold where biological change in erythema occurs.
Theoretically, the MED predicting by mathematical model
40 is lower than visual assessment (10 mj/cm2 in our study).
20
Using mathematical modeling, we are able to detect a∗
change in erythema at lower UV doses than the conventional
0 visual assessment. Whereas, the reason why the 14th case
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
in Figure 4 shows higher counting MED value may be
Case
due to only 6 plottings in the phototests. We suspect that
MED more UVB-irradiated plottings would be better to determine
MEDCOUNT the more accurate intercept in future studies. In addition,
whether the slope is a useful method in differentiation
Figure 4: The individual MED value evaluated by conventional
visual determination and mathematical model. MED: minimal ery- between different skin types compared with visual MED
thema dose from visual assessment. MEDCOUNT: MED counted deserved further investigation.
by mathematical model. Diverse skin responses need different measurement
modalities to achieve a satisfactory discrimination. It is more
desirable to have objective measurement in the assessment
By using LDI, we also found positive relationship of UV damage, such as MED. The proposed mathematical
between blood flow of skin and UVB irradiation. It appears model using intercept predicted by chromameter a∗ function
that UVB irradiation can enhance the dermal microperfu- may be a supplementary method in the measurement of
sion, as with inflammation and vasodilation of skin induced MED. With this approach, we are able to monitor the
by UV exposure [4]. changes of erythema after UVB irradiation. The assessment
Transepidermal water loss (TEWL) is a well-documented of sensitivity of human skin to UV radiation is important
method for studying the skin barrier function through evap- for military personnel movement that is exposed to sun
oration changes [15, 16]. However, these changes are partly and sunburn is a major problem. This study identifies a
due to environmental factors as well as the psychological/ noninvasive parameter in the evaluation of in-vivo dose
physiological status of the person tested. The bioengineering and epidermis changes following UVB irradiation. A more
methods of this study could not discriminate the effects of objective MED is of great help for sunburn risk screening and
UVB on barrier function and water content of skin. Schempp prevention.
EURASIP Journal on Advances in Signal Processing 7

Acknowledgment loss as a function of skin temperature in hairless mice,” Skin


Pharmacology and Applied Skin Physiology, vol. 16, no. 5, pp.
This research was supported by the Veterans General 283–290, 2003.
Hospital-Kaohsiung, Grant no. VGHKS 96-87. [17] M. Moehrle, W. Koehle, K. Dietz, and G. Lischka, “Reduction
of minimal erythema dose by sweating,” Photodermatology
References Photoimmunology & Photomedicine, vol. 16, no. 6, pp. 260–
262, 2000.
[1] I. E. Kochevar and C. R. Taylor, “Photophysics, photo- [18] C. M. Schempp, K. Müller, J. Schulte-Mönting, E. Schöpf, and
chemistry and photobiology,” in Fitzpatrick’s Dermatologhy in J. C. Simon, “Salt water bathing prior to UVB irradiation leads
General Medicine, I. M. Freedberg, A. Z. Eisen, K. Wolff, et al., to a decrease of the minimal erythema dose and an increased
Eds., pp. 1267–1275, McGraw-Hill, New York, NY, USA, 6th erythema index without affecting skin pigmentation,” Photo-
edition, 2003. chemistry and Photobiology, vol. 69, no. 3, pp. 341–344, 1999.
[2] J. Krutmann, H. Honigsmann, C. A. Elmets, and P. R. [19] W. Westerhof, O. Estevez-Uscanga, J. Meens, A. Kammeyer,
Bergstresser, “Dermatological phototherapy and photodi- M. Durocq, and I. Cario, “The relation between constitutional
agnostic methods,” in The Photopatch Test, pp. 338–341, skin color and photosensitivity estimated from UV-induced
Springer, Berlin, Germany, 2001. erythema and pigmentation dose-response curves,” Journal of
[3] http://www.konicaminolta.com/sensingusa/products/color/ Investigative Dermatology, vol. 94, no. 6, pp. 812–816, 1990.
colorimeters/cr400-410/index.html.
[4] M. A. Allias, K. Wårdell, M. Stücker, C. Anderson, and E. G.
Salerud, “Assessment of pigmented skin lesions in terms of
blood perfusion estimates,” Skin Research and Technology, vol.
10, no. 1, pp. 43–49, 2004.
[5] C. K. Kraemer, D. B. Menegon, and T. F. Cestari, “Determina-
tion of the minimal phototoxic dose and colorimetry in pso-
ralen plus ultraviolet A radiation therapy,” Photodermatology
Photoimmunology & Photomedicine, vol. 21, no. 5, pp. 242–
248, 2005.
[6] http://www.courage-khazaka.de/.
[7] T. S. C. Poon, J. M. Kuchel, A. Badruddin, et al., “Objective
measurement of minimal erythema and melanogenic doses
using natural and solar-simulated light,” Photochemistry and
Photobiology, vol. 78, no. 4, pp. 331–336, 2003.
[8] K. Sauermann, S. Clemann, S. Jaspers, et al., “Age related
changes of human skin investigated with histometric measure-
ments by confocal laser scanning microscopy in vivo,” Skin
Research and Technology, vol. 8, no. 1, pp. 52–56, 2002.
[9] S. Nouveau-Richard, M. Monot, P. Bastien, and O. de
Lacharrière, “In vivo epidermal thickness measurement: ultra-
sound vs. confocal imaging,” Skin Research and Technology,
vol. 10, no. 2, pp. 136–140, 2004.
[10] http://vivascopy.com/medical-imagers/vivascope-1500.asp.
[11] T. Gambichler, K. Sauermann, M. A. Altintas, et al., “Effects
of repeated sunbed exposures on the human skin. In vivo
measurements with confocal microscopy,” Photodermatology
Photoimmunology & Photomedicine, vol. 20, no. 1, pp. 27–32,
2004.
[12] J. W. Fluhr, O. Kuss, T. Diepgen, et al., “Testing for irritation
with a multifactorial approach: comparison of eight non-
invasive measuring techniques on five different irritation
types,” British Journal of Dermatology, vol. 145, no. 5, pp. 696–
703, 2001.
[13] B. L. Diffey, C. T. Jansen, F. Urbach, and H. C. Wulf, “The
standard erythema dose: a new photobiological concept,”
Photodermatology Photoimmunology & Photomedicine, vol. 13,
no. 1-2, pp. 64–66, 1997.
[14] R. L. Olson, J. Gaylor, and M. A. Everett, “Skin color, melanin,
and erythema,” Archives of Dermatology, vol. 108, no. 4, pp.
541–544, 1973.
[15] H. Miyauchi, T. Horio, and Y. Asada, “The effect of ultra-
violet radiation on the water-reservoir functions of the
stratum corneum,” Photodermatology Photoimmunology &
Photomedicine, vol. 9, no. 5, pp. 193–197, 1992.
[16] J. J. Thiele, F. Dreher, H. I. Maibach, and L. Packer, “Impact of
ultraviolet radiation and ozone on the transepidermal water
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 901205, 7 pages
doi:10.1155/2010/901205

Research Article
Robust Real-Time Background Subtraction Based on
Local Neighborhood Patterns

Ariel Amato, Mikhail G. Mozerov, F. Xavier Roca, and Jordi Gonzàlez


Computer Vision Center (CVC), Universitat Autonoma de Barcelona, Campus UAB Edifici O, 08193 Bellaterra, Spain

Correspondence should be addressed to Mikhail G. Mozerov, [email protected]

Received 1 December 2009; Accepted 21 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Ariel Amato et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper describes an efficient background subtraction technique for detecting moving objects. The proposed approach is able to
overcome difficulties like illumination changes and moving shadows. Our method introduces two discriminative features based on
angular and modular patterns, which are formed by similarity measurement between two sets of RGB color vectors: one belonging
to the background image and the other to the current image. We show how these patterns are used to improve foreground detection
in the presence of moving shadows and in the case when there are strong similarities in color between background and foreground
pixels. Experimental results over a collection of public and own datasets of real image sequences demonstrate that the proposed
technique achieves a superior performance compared with state-of-the-art methods. Furthermore, both the low computational
and space complexities make the presented algorithm feasible for real-time applications.

1. Introduction components are basically computed by measuring the angle


and the Euclidean distance between two sets of color vectors.
Moving object detection is a crucial part of automatic We will show how these components are combined to
video surveillance systems. One of the most common and improve the robustness and the discriminative sensitivity
effective approach to localize moving objects is background of the background subtraction algorithm in the presence
subtraction, in which a model of the static scene background of (i) moving shadows and (ii) strong similarities in color
is subtracted from each frame of a video sequence. This between background and foreground pixels. Another impor-
technique has been actively investigated and applied by tant advantage of our algorithm is its low computational
many researchers during the last years [1–3]. The task of complexity and its low space complexity that makes it feasible
moving object detection is strongly hindered by several for real-time applications.
factors such as shadows cast by moving object, illuminations The rest of the paper is organized as follows. Section 2
changes, and camouflage. In particular, cast shadows are the introduces a brief literature review. Section 3 presents our
areas projected on a surface because objects are occluding method. In Section 4 experimental results are discussed.
partially or totally direct light sources. Obviously, an area Concluding remarks are available in Section 5.
affected by cast shadow experiences a change of illumi-
nation. Therefore in this case the background subtraction
algorithm can misclassify background as foreground [4, 5]. 2. Related Work
Camouflage occurs when there is a strong similarity in color
between background and foreground; so foreground pixels Many publications are devoted to the background subtrac-
are classified as background. Broadly speaking, these issues tion technique [1–3]. However in this section we consider
rise problems such as shape distortion, object merging, and only the papers that are directly related to our work.
even object losses. Thus a robust and accurate algorithm to Haritaoglu et al. state that in W4 [6] the background
segment moving object is highly desirable. is modeled by representing each pixel by three values: its
In this paper, we present an adaptive background model, minimum and maximum intensity values and the maximum
which is formed by temporal and spatial components. These intensity differences between consecutive frames observed
2 EURASIP Journal on Advances in Signal Processing

during this training period. Pixels are classified as fore- separated in two stages, one dealing with the scene modeling
ground if the differences between the current value and and another with the motion detection process. The scene
the minimum and maximum values are greater than the modeling stage represents a crucial part in the background
values of the maximal interframe difference. However, this subtraction technique [12–17].
approach is rather sensitive to shadows and lighting changes, Usually a simple unimodal approach uses statistical
since the only illumination intensity cue is used and the parameters such as mean and standard deviation values, for
memory resource to implement this algorithm is extremely example, [7, 8, 10], and so forth. Such statistical parameters
high. are obtained during a training period and then these are
Horprasert et al. [7] implement a statistical color dynamically updated. In the background modeling process
background algorithm, which use color chrominance and the statistical values depend on both the low- and high-
brightness distortion. The background model is built frequency changes of the camera signal. If the standard
using four values: the mean, the standard deviation, the deviations of the low- and high-frequency components of
variation of the brightness, and chrominance distortion. the signal are comparable, methods based on such statistical
However, this approach usually fails for low and high parameters exhibit robust discriminability. When the stan-
intensities. dard deviation of the high-frequency change is significantly
Kim et al. [8] use a similar approach as [7], but they less than the low-frequency change, then the background
obtain more robust motion segmentation in the presents of model can be improved to make the discriminative sensitivity
the illumination and scene changes using background model much higher. Since a considerable change in the low-
with codebooks. The codebooks idea gives the possibility frequency domain is produced for the majority of real video
to learn more about the model in the training period. The sequences, we propose to build a model that is insensitive
authors propose to cope with the unstable information of to low-frequency changes. The main idea is to estimate
the dark pixels, but still they have some problems in the only the high-frequency change per each pixel value as
low- and the high-intensity regions. Furthermore, the space one interframe interval. The general background model in
complexity of their algorithm is high. this case can be explained as the subtraction between the
Stauffer and Grimson [9] address the low- and the high- current frame and the previous frame, which suppose to
intensity regions problem by using a mixture of Gaussians to be the background image. Two values for each pixel in the
build a background color model for every pixel. Pixels from image are computed to model background changes during
the current frame are checked against the background model the training period: the maximum difference in angular and
by comparing them with every Gaussian in the model until a Euclidean distances between the color vectors of the consec-
matching Gaussian is found. If so, the mean and variance of utive image frames. The angular difference is used because
the matched Gaussian are updated; otherwise a new Gaussian it can be considered as photometric invariant of color
with the mean equal to the current pixel color and some measurement and in turn as significant cues to detect moving
initial variance is introduced into the mixture. shadows.
McKenna et al. [10] assume that cast shadows result Often pixelwise comparison is not enough to distinguish
in significant change in intensity without much change in background from foreground and in our classification
chromaticity. Pixel chromaticity is modeled using its mean process we further analyze the neighborhood of each pixel
and variance and the first-order gradient of each background position. In the next section we give a formal definition of
pixel modeled using gradient means and magnitude vari- the proposed similarity measurements.
ance. Moving shadows are then classified as background
if the chromaticity or gradient information supports their
classification. 3.1. Background Scene Modeling
Cucchiara et al. [11] use a model in Hue-Saturation- 3.1.1. Similarity Measurements. Four similarity measure-
Value (HSV) and stress their approach in shadow suppres- ments are used to compare a background image with a
sion. The idea is that shadows change the hue component current frame.
slightly and decrease the saturation component significantly.
In the HSV color space a more realistic noise model can
(i) Angular similarity measurement Δθ between two
be done. However, this approach also has drawbacks. The
color vectors p(x) and q(x) at position x in the RGB
similarity measured in the nonlinear HSV color space usually
color space is defined as follows:
generates ambiguity at gray levels. Furthermore threshold
handling is the major limitation of this approach.  
  p(x) · q(x)
Δθ p(x), q(x) = Cos−1 2 22
2p(x)22q(x)2 .
2 (1)

3. Proposed Algorithm
(ii) Euclidean distance similarity measurement ΔI be-
A simple and common background subtraction procedure
tween two color vectors p(x) and q(x) in the RGB
involves subtraction of each new image from a static model
color space is defined as follows:
of the scene. As a result a binary mask with two labels
(foreground and background) is formed for each pixel in   2 2
the image plane. Broadly speaking, this technique can be ΔI p(x), q(x) = 2p(x) − q(x)2. (2)
EURASIP Journal on Advances in Signal Processing 3

| pBg | < | p f | Y | pBg | > | p f |

pBg

ΔI
γS T I
pf
γI T I γθ T θ
Δθ
G X X

Foreground
Background
B Shadow
(a) (b)

Figure 1: (a) Angle and magnitude difference between two color vector in RGB space. (b) Difference in angle and magnitude in 2D “polar
difference space.” The axes are computed as x = ΔI · cos(Δθ) and y = ΔI · sin(Δθ).

False positive error False negative error


35 25

30
20
25
15
Error (%)

Error (%)

20

15
10
10
5
5

0 0
1 2 3 4 1 2 3 4
Sequences Sequences

Our approach W4 Our approach W4


K.Kim Staurf and Grimson K.Kim Staurf and Grimson
Horprasert Horprasert
(a) (b)

Figure 2: Segmentation errors. (a) FPE and (b) FNE.

For each of the described similarity measurements a To describe a neighbourhood similarity measure-
threshold function is associated: ment let us first characterize the index vector x =

(n, m)t ∈ Ω = {0, 1, . . . , n, . . . , N; 0, 1, . . . , m, . . . , M },
  ⎨1, if Δθ > θ T , which define the position of a pixel in the image.
Tθ Δθ, θ T =
⎩ Also we need to name the neighbourhood radius
0, otherwise, vector w = (i, j)t ∈ W = {−W, . . . , 0, 1, . . . ,
⎧ (3) i, . . . , W; −W, . . . , 0, 1, . . . , j, . . . , W }, which define
  ⎨1, if |ΔI | > I T ,
TI ΔI, I T = the positions of pixels that belong to the neighbour-
⎩ hood relative to any current pixel. Indeed, the domain
0, otherwise,
W is just a square window around a chosen pixel.
where θ T and I T are intrinsic parameters of the (iii) Angular neighborhood similarity measurement ηθ
threshold functions of the similarity measurements. between two sets of color vectors in the RGB color
4 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 3: (a) Original image, segmentation result of (b) our method, (c) Stauffer method, and (d) K. Kim method.

space p(x + w) and q(x + w)(w ∈ W) can be written running components (RCs) and training components (TCs).
as The RC is a color vector in RGB space and only this
     component can be updated in running process. The TC is
ηθ ϑ, θ T = Tθ Δθ(ϑ), θ T , (4) a set of fixed thresholds values obtained during the train-
w∈W
ing.
where Tθ, θ T , and Δθ are defined in (3) and (1), The background model is represented by
respectively, and ϑ is (p(x + w), q(x + w)).
   
(iv) Euclidean distance neighborhood similarity measure- BG(x) = p(x) , T θ (x), T I (x), W , (7)
ment μI between two sets of color vectors in the RGB
color space p(x + w) and q(x + w)(w ∈ W) can be
written as
   
where T θ (x) is maxima of the chromaticity variation; T I (x)

μI ϑ, I T = TI ΔI(ϑ), I T , is maxima of the intensity variation; W is the half size of the
(5)
w∈W neighbourhood window.
A training process has to be performed to obtain
where TI, I T , and ΔI are defined in (3) and the background parameters defined by (7). This first step
(2), respectively. With each of the neighbourhood consists of estimating the value of the RC and TC during
similarity measurements we associate a threshold the training period. To initialize our BG we put the RC =
function: { p0 (x)} as the initial frame. T θ (x) and T I (x) are estimated
⎧ during the training period by computing the angular differ-
  ⎨1, if ηθ(ϑ) > ηT ,
Tηθ ηθ(ϑ), η T = ence and the Euclidean distance between the pixel belonging
⎩ to the previous frame and the pixel belonging to the current
0, otherwise,
⎧ (6) frame:
  ⎨1, if μI(ϑ) > μT ,
TμI μI(ϑ), μ T =
⎩   
0, otherwise,
T θ (x) = max Δθ p f −1 (x), p f (x) ,
f ∈{1,2,...,F }
where ηT and μT
are intrinsic parameters of the (8)
threshold functions of the neighborhood similarity   
measurements. T I (x) = max ΔI p f −1 (x), p f (x) ,
f ∈{1,2,...,F }

3.1.2. Scene Modeling. Our background model (BG) will


be represented with two classes of components, namely, where F is the number of frames in the training period.
EURASIP Journal on Advances in Signal Processing 5

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(a) (b) (c)

Figure 4: Sample visual results of our background subtraction algorithm in various environment. (a) Background Image, (b) Current
Image, and (c) Foreground (red) /Shadows (green) /Background (black) detection. (1) PETS 2009 View 7, (2) PETS 2009 View 8, (3) ATON
(Laboratory), (4) ISELAB (ETSE Outdoor), (5) LVSN (HallwayI), (6) VSSN, and (7) ATON (Intelligentroom).
6 EURASIP Journal on Advances in Signal Processing

3.2. Classification Process. Our classification rules consist of well-known datasets taken from 7 different video sequences:
two steps. PETS 2009 (http://www.cvg.rdg.ac.uk/ (View 7 and 8)),
ATON (http://cvrr.ucsd.edu/aton/shadow/ (Laboratory and
Step One. Pixels that have strong dissimilarity with the Intelligentroom)), ISELAB (http://iselab.cvc.uab.es (ETSE
background are classified directly as foreground, in the case Outdoor)), LVSN (http://vision.gel.ulaval.ca/CastShadows/
when the following rule expression is equal to 1 (TRUE): (HallwayI)), and VSSN, (http://mmc36.informatik.uni-
    augsburg.de/VSSN06 OSAC/).
Fr(x) = Tθ Δθ pbg (x), p f (x) , γθ
    (9)
∩TI ΔI pbg (x), p f (x) , γI ,
Quantitative Results. We have applied our proposed algo-
rithm in several indoor and outdoor video scenes. Ground-
where γθ and γI are experimental scale factors. Otherwise, truth masks have been manually extracted to numerically
when (9) is not TRUE, the classification has to be done in the evaluate and compare the performance of our proposed
following step. technique with respect to most similar state-of-the-art
approaches [6–9]. Two metrics were considered to evaluate
Step Two. This step consists of two test rules. One verifies a the segmentation results, namely, False Positive Error (FPE)
test pixel for the shadow class (10) and another verifies for and False Negative Error (FNE). FPE means that the back-
the foreground class (11): ground pixels were set as Foreground while FNE indicates
that foreground pixels were identified as Background. We
Sh(x) show this comparison in terms of accuracy in Figure 2:
    
= TμI μI pbg (x + w), p f (x + w) , γI T I (x) , kFI
2 2 2 2 No. of misclassification pixels
2 2 2 2 Error(%) = × 100%.
∩ 2pbg (x)2 > 2p f (x)2
No. of correct foreground pixels
      (13)
∩ 1 − Tηθ ηθ pbg (x + w), p f (x + w) , γθ T θ (x) , kSθ
     
∩ 1 − TμI μI pbg (x + w), p f (x + w) , γS T I (x) , kSI , Qualitative Results. Figure 3 shows a visual comparison
(10) between our techniques and some well-known methods.
It can be seen that our method performs better in terms
Fr(x) of camouflage areas segmentation and suppressing strong
     shadows. In Figure 4 also visual results are shown. In this
= TμI μI pbg (x + w), p f (x + w) , γI T I (x) , kFI (11) case we have applied our method in several sequences. It
can be seen that the foreground objects are detected without
∩ (1 − Sh(x)). shadows, in such a way preserving their shape properly.
The rest of the pixels that are not classified as shadow
or foreground pixels must be classified as background
pixels. Figure 1 illustrates the classification regions. All the 5. Conclusions
implemented thresholds were obtained on the base of a
tuning process with different video sequences (γθ = 10◦ , γI = This paper proposes an efficient background subtraction
55, γI = 10, γθ = 2◦ , γS = 80 and KFI = KSθ = KSI = 1). technique which overcomes difficulties like illumination
changes and moving shadows. The main novelty of our
method is the incorporation of two discriminative similarity
3.3. Model Updating. In order to maintain the stability of measures based on angular and Euclidean distance patterns
the background model through the time, the model needs in local neighborhoods. Such patterns are used to improve
to be dynamically updated. As it was explained before, only foreground detection in the presence of moving shadows
the RCs have to be updated. The update process is done at and strong similarities in color between background and
every frame, but only in the case when the updated pixels are foreground. Experimental results over a collection of public
classified as a background. The model is updated as follows: and own datasets of real image sequences demonstrate the
bg bg   f effectiveness of the proposed technique. The method shows
pc (x, t) = βpc (x, t − 1)+ 1 − β pc (x, t), c ∈ {R, G, B},
(12) an excellent performance in comparison with other methods.
Most recent approaches are based on very complex models
where (0 < β < 1) is the updated rate. Due to our designed to achieve an extremely effective classification;
experiments the value of this parameter has to be β = 0.45. however these approaches become unfeasible for real-time
applications. Alternatively, our proposed method exhibits
4. Experimental Results low computational and space complexities that make our
proposal very appropriate for real-time processing in surveil-
In this section we present the performance of our approach lance systems with low-resolution cameras or Internet web-
in terms of quantitative and qualitative results applied to 5 cams.
EURASIP Journal on Advances in Signal Processing 7

Acknowledgments [14] A. Mittal and N. Paragios, “Motion-based background sub-


traction using adaptive kernel density estimation,” in Proceed-
This work has been supported by the Spanish Research Pro- ings of the IEEE Computer Society Conference on Computer
grams Consolider-Ingenio 2010:MIPRCV (CSD200700018) Vision and Pattern Recognition (CVPR ’04), vol. 2, pp. 302–309,
and Avanza I+D ViCoMo (TSI-020400-2009-133) and by Washington, DC, USA, July 2004.
the Spanish projects TIN2009-14501-C02-01 and TIN2009- [15] Y.-T. Chen, C.-S. Chen, C.-R. Huang, and Y.-P. Hung,
14501-C02-02. “Efficient hierarchical method for background subtraction,”
Pattern Recognition, vol. 40, no. 10, pp. 2706–2715, 2007.
[16] L. Li, W. Huang, I. Y.-H. Gu, and Q. Tian, “Statistical modeling
of complex backgrounds for foreground object detection,”
References IEEE Transactions on Image Processing, vol. 13, no. 11, pp.
1459–1472, 2004.
[1] M. Karaman, L. Goldmann, D. Yu, and T. Sikora, “Compar- [17] J. Zhong and S. Sclaroff, “Segmenting foreground objects from
ison of static background segmentation methods,” in Visual a dynamic textured background via a robust Kalman filter,”
Communications and Image Processing, vol. 5960 of Proceedings in Proceedings of the 9th IEEE International Conference on
of SPIE, no. 4, pp. 2140–2151, 2005. Computer Vision (ICCV ’03), pp. 44–50, Nice, France, October
[2] M. Piccardi, “Background subtraction techniques: a review,” 2003.
in Proceedings of the IEEE International Conference on Systems,
Man and Cybernetics (SMC ’04), vol. 4, pp. 3099–3104, The
Hague, The Netherlands, October 2004.
[3] A. McIvor, “Background subtraction techniques,” in Proceed-
ings of the International Conference on Image and Vision
Computing, Auckland, New Zealand, 2000.
[4] A. Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara, “Detecting
moving shadows: algorithms and evaluation,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 25, no.
7, pp. 918–923, 2003.
[5] G. Obinata and A. Dutta, Vision Systems: Segmentation
and Pattern Recognition, I-TECH Education and Publishing,
Vienna, Austria, 2007.
[6] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: real-time
surveillance of people and their activities,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp.
809–830, 2000.
[7] T. Hoprasert, D. Harwood, and L. S. Davis, “A statistical
approach for real-time robust background subtraction and
shadow detection,” in Proceedings of the 7th IEEE International
Conference on Computer Vision, Frame Rate Workshop (ICCV
’99), vol. 4, pp. 1–9, Kerkyra, Greece, September 1999.
[8] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis,
“Real-time foreground-background segmentation using code-
book model,” Real-Time Imaging, vol. 11, no. 3, pp. 172–185,
2005.
[9] C. Stauffer and W. E. L. Grimson, “Learning patterns of
activity using real-time tracking,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747–757,
2000.
[10] S. J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H.
Wechsler, “Tracking groups of people,” Computer Vision and
Image Understanding, vol. 80, no. 1, pp. 42–56, 2000.
[11] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti,
“Improving shadow suppression in moving object detection
with HSV color information,” in Proceedings of the IEEE
Intelligent Transportation Systems Proceedings, pp. 334–339,
Oakland, Calif, USA, August 2001.
[12] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower:
principles and practice of background maintenance,” in Pro-
ceedings of the 7th IEEE International Conference on Computer
Vision (ICCV ’99), vol. 1, pp. 255–261, Kerkyra, Greece,
September 1999.
[13] A. Elgammal, D. Harwood, and L. S. Davis, “Nonparametric
background model for background subtraction,” in Proceed-
ings of the European Conference on Computer Vision (ECCV
’00), pp. 751–767, Dublin, Ireland, 2000.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 265631, 12 pages
doi:10.1155/2010/265631

Research Article
Improving Density Estimation by
Incorporating Spatial Information

Laura M. Smith, Matthew S. Keegan, Todd Wittman,


George O. Mohler, and Andrea L. Bertozzi
Department of Mathematics, University of California, Los Angeles, CA 90095, USA

Correspondence should be addressed to Laura M. Smith, [email protected]

Received 1 December 2009; Accepted 9 March 2010

Academic Editor: Alan van Nevel

Copyright © 2010 Laura M. Smith et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Given discrete event data, we wish to produce a probability density that can model the relative probability of events occurring
in a spatial region. Common methods of density estimation, such as Kernel Density Estimation, do not incorporate geographical
information. Using these methods could result in nonnegligible portions of the support of the density in unrealistic geographic
locations. For example, crime density estimation models that do not take geographic information into account may predict events
in unlikely places such as oceans, mountains, and so forth. We propose a set of Maximum Penalized Likelihood Estimation methods
based on Total Variation and H1 Sobolev norm regularizers in conjunction with a priori high resolution spatial data to obtain more
geographically accurate density estimates. We apply this method to a residential burglary data set of the San Fernando Valley using
geographic features obtained from satellite images of the region and housing density information.

1. Introduction live or work [1]. Some law enforcement agencies currently


use software that makes predictions in unrealistic geographic
High resolution and hyperspectral satellite images, city and locations. Methods that incorporate geographic information
county boundary maps, census data, and other types of have recently been proposed and are an active area of research
geographical data provide much information about a given [2, 3].
region. It is desirable to integrate this knowledge into models A common method for creating a probability density is
defining geographically dependent data. Given spatial event to use Kernel Density Estimation [4, 5], which approximates
data, we will be constructing a probability density that the true density by a sum of kernel functions. A popular
estimates the probability that an event will occur in a choice for the kernel is the Gaussian distribution which is
region. Often, it is unreasonable for events to occur in smooth, spatially symmetric and has noncompact support.
certain regions, and we would like our model to reflect Other probability density estimation methods include the
this restriction. For example, residential burglaries and other taut string, logspline, and the Total Variation Maximum
types of crimes are unlikely to occur in oceans, mountains, Penalized Likelihood Estimation models [6–10]. However,
and other regions. Such areas can be determined using none of these methods utilize information from external
aerial images or other external spatial data, and we denote spatial data. Consequently, the density estimate typically
these improbable locations as the invalid region. Ideally, the has some nonzero probability of events occurring in the
support of our density should be contained in the valid invalid region. Figure 1 demonstrates these problems with
region. the current methods and how the methods we will propose
Geographic profiling, a related topic, is a technique used in this paper resolve them. Located in the middle of the
to create a probability density from a set of crimes by a image are two disks where events cannot occur, depicted in
single individual to predict where the individual is likely to Figure 1(a). We selected randomly from the region outside
2 EURASIP Journal on Advances in Signal Processing

the disks using a predefined probabilistic density, that is, The methods we propose use a penalty functional that
provided in Figure 1(b). The 4,000 events chosen are shown depends on the valid region determined from the geo-
in Figure 1(c). With a variance of σ = 2.5, we see in graphical images or other external spatial data. Figure 1
Figure 1(d) that the Kernel Density Estimation predicts that demonstrates how these models will improve on the current
events may occur in our invalid region. methods.
In this paper we propose a novel set of models that
restrict the support of the density estimate to the valid 3. The Modified Total Variation MPLE Model
region and ensure realistic behavior. The models use Max-
imum Penalized Likelihood Estimation [11, 12], which is a The first model we propose is an extension of the Maximum
variational approach. The density estimate is calculated as Penalized Likelihood Estimation method given by Mohler
the minimizer of some predefined energy functional. The et al. [10]
novelty of our approach is in the way we define the energy ⎧ ⎫
functional with explicit dependence on the valid region ⎨- 
n ⎬
such that the density estimate obeys our assumptions of u1(x) = L argmin ⎩ |∇u|dx − μ log(u(xi ))⎭. (2)
udx=1, 0≤u Ω i=1
Ω
its support. The results from our methods for this simple
example are illustrated in Figures 1(f), 1(g), and 1(h). Once we have determined a valid region, we wish to align
The paper is structured in the following way. In Section 2 the level curves of the density function u with the boundary
Maximum Penalized Likelihood Methods are introduced. In of the valid region. The Total Variation functional is well
Sections 3 and 4 we present our set of models which we known to allow discontinuities in its minimizing solution
name the Modified Total Variation MPLE model and the [13]. By aligning the level curves of the density function with
Weighted H1 Sobolev MPLE model, respectively. In Section 5 the boundary, we encourage a discontinuity to occur there to
we discuss the implementation and numerical schemes that keep the density from smoothing into the invalid region.
we use to solve for the solutions of the models. We provide Since ∇u/ |∇u| gives the unit normal vectors to the level
examples for validation of the models and an example with curves of u, we would like
actual residential burglary data in Section 6. In this Section,
we also compare our results to the Kernel Density Estimation ∇(1D ) ∇u
model and other Total Variation MPLE methods. Finally, we = , (3)
|∇(1D )| |∇u|
discuss our conclusions and future work in Section 7.
where (1D ) is the characteristic function of the valid region
D. The region D is obtained from external spatial data,
2. Maximum Penalized Likelihood Estimation such as aerial images. To avoid division by zero, we use
Assuming that u(x) is the desired probability density for x ∈ θ := ∇(1D )/ |∇(1D )| , where |∇v| = vx2 + v2y + 2 . To
R2 , and the known location of events occur at x1 , x2 , . . . , xn , align the density function and the boundary one would want
then Maximum Penalized Likelihood Estimation (MPLE) to minimize |∇u| − θ · ∇u. Integrating L this and applying
models are given by integration by parts, we obtain the term Ω |∇u| + u∇ · θ dx.
⎧ ⎫ We propose the following Modified Total Variation penalty
⎨ 
n ⎬ functional, where we adopt the more general form of the
u1(x) = L argmin ⎩P(u) − μ log(u(xi ))⎭. (1) above functional:
Ωudx=1, 0≤u i=1 ⎧
⎨-
Here, P(u) is a penalty functional, which is generally u1(x) = L argmin ⎩ |∇u|dx
udx=1, 0≤u Ω
Ω
designed to produce a smooth density map. The parameter μ

determines how strongly weighted the maximum likelihood - 
n ⎬
term is, compared to the penalty functional: +λ u∇ · θ dx − μ log(u(xi ))⎭.
A range of penalty functionals has been proposed, Ω i=1
L √ 2
including
L P(u) = Ω |∇ u | dx [11, 12] and P(u) = (4)
Ω |∇ (log(u))| dx [4, 11]. More recently, variants
3 2 of the
L
Total Variation (TV) functional [13], P(u) = Ω |∇u|dx, The parameter λ allows us to vary the strength of the
have been proposed for MPLE [8–10]. These methods do not alignment term. Two pan-sharpening methods, P + XS and
explicitly incorporate the information that can be obtained Variational Wavelet Pan-sharpening [14, 15], both include
from the external spatial data, although some note the a similar term in their energy functional to align the level
need to allow for various domains. Even though the TV curves of the optimal image with the level curves of the high
functional will maintain sharp gradients, the boundaries resolution pan-chromatic image.
of the constant regions do not necessarily agree with the
boundaries within the image. This method also performs 4. The Weighted H1 Sobolev MPLE Model
poorly when the data is too sparse, as the density is smoothed
to have equal probability almost everywhere. Figure 1(e) A Maximum PenalizedL Likelihood Estimation method with
demonstrates this, in addition to how this method predicts penalty functional Ω (1/2)|∇u|2 dx, the H1 Sobolev norm,
events in the invalid region with nonnegligible estimates. gives results equivalent to those obtained using Kernel
EURASIP Journal on Advances in Signal Processing 3

3.5677e − 4

Color scale
0
(a) Valid region (b) True density (c) 4,000 events (d) Kernel density estimate
3.3392e − 4

Color scale
0
(e) TV MPLE (f) Our modified TV MPLE (g) Our weighted H1 MPLE (h) Our weighted TV MPLE method
method method

Figure 1: This is a motivating example that demonstrates the problem with existing methods and how our methods will improve density
estimates. (a) and (b) give the valid region to be considered and the true density for the example. Figure (c) gives the 4000 events sampled
from the true density. (d) and (e) show two of the current methods used. (f), (g), and (h) show how our methods will produce better
estimates. The color scale represents the relative probability of an event occurring in a given pixel. The images are 80 pixels by 80 pixels.

Density Estimation [11]. We enforce the H1 regularizer term 5. Implementation


away from the boundary of the invalid region. This results in
the model 5.1. The Constraints. In the implementation for the Modified
Total Variation MPLE method and Weighted H1 MPLE
⎧ ⎫
⎨1 - 
n ⎬ Lmethod, we must enforce the constraints 0 ≤ u(x) and
u1(x) = L argmin ⎩ |∇u|2 dx − μ log(u(xi ))⎭. Ω u(x)dx = 1 to ensure that u(x) is a probability density
udx=1, 0≤u 2
Ω
Ω\∂D i=1 estimate. The u ≥ 0 constraint will be satisfied in our
(5) numerical solution by solving quadratic equations that have
at least one nonnegative root.
This new term is essentially the smoothness term from We enforce the second constraint by first adding it to the
the Mumford-Shah model [16]. We approximate the H1 energy functional as an L2 penalty term. For the H1 method,
term by introducing the Ambrosio-Tortorelli approximating this change results in the new minimization problem
function z (x) [17], where z → (1 − δ(∂D)) in the sense of ⎧
distributions. More precisely, we use a continuous function ⎨1 - 
n
which has the property u1H (x) = argmin⎩ z2 |∇u|2 dx − μ log(u(xi ))
u 2 Ω i=1
M (8)
1 if d(x, ∂D) > , - 2 N
γ
z (x) = (6) + u(x)dx − 1 ,
0 if x ∈ ∂D. 2 Ω

Thus, the minimization problem becomes where we have denoted u1H (x) as the solution of the H1
⎧ ⎫
model. The constraint is then enforced by applying Bregman
⎨1 - 
n ⎬ iteration [18]. Using this method, we formulate our problem
u1(x) = L argmin ⎩ z2 |∇u|2 dx − μ log(u(xi ))⎭. (7) as
Ωudx=1, 0≤u 2 Ω i=1

⎨1 - 
n
The weighting away from the edges is used to control the (uH , bH ) = argmin⎩ z2 |∇u|2 dx − μ log(u(xi ))
diffusion into the invalid region. This method of weighting u,b 2 Ω i=1
(9)
away from the edges can also be used with the Total Variation - 2 N
functional in our first model, and we will refer to this as our γ
+ u(x)dx + b − 1 ,
Weighted TV MPLE model. 2 Ω
4 EURASIP Journal on Advances in Signal Processing

where b is introduced as the Bregman variable of the sum 5.3. Modified TV MPLE Implementation. There are many
to unity constraint. We solve this problem using alternating approaches for handling the minimization of the Total
minimization, updating the u and the b iterates as Variation penalty functional. A fast and simple method for
⎧ ⎧ doing this is to use the Split Bregman technique (see [10, 19]

⎪ ⎨1- n

⎪ 2 for an in depth discussion, see also [20]). In this approach,

⎪ u(k+1) = argmin z 2
|∇ u| dx − μ log(u(xi ))

⎪ ⎩2 Ω  we substitute the variable d for ∇u in the TV norm and then

⎨ u
- i=1
 N 2 enforce the equality d = ∇u using Bregman iteration. To
(H1 )⎪ γ


+ u(x)dx + b(k) − 1 , apply Bregman iteration, we introduce the variable g as the

⎪ - 2 Ω

⎪ Bregman vector of the d = ∇u constraint. This results in a


⎩b(k+1) = b(k) + u(k+1) dx − 1, minimization problem in which we minimize both d and u.
Ω
(10) Beginning the iteration with g(0) = 0, the minimization
is written as
with b(0) = 0. Similarly for the modified TV method, we solve
the alternating minimization problem ⎧
  ⎨ -
⎧ 5- - (k+1) (k+1)

⎪ u ,d = argmin d1 + λ u∇ · θ dx

⎪ u(k+1) = argmin |∇ u|dx + λ u∇ · θ dx ⎩ Ω

⎪ u Ω Ω u,d

⎪ n



⎪ −μ log(u(xi )) 
n

−μ log(u(xi ))
(TV )⎪ i=1
- 2 N

⎪ γ i=1

⎪ + (k)
u(x)dx + b − 1 ,

⎪ - 2

⎪ - 2 Ω γ

⎪ + u(x)dx + b(k) − 1

⎩b(k+1) = b(k) + u(k+1) dx − 1 2 Ω
Ω ⎫
(11) α??
?2 ⎬
?
+ ?d − ∇u − g(k) ?2 ⎭,
with b(0) = 0. 2

5.2. Weighted H1 MPLE Implementation. For the Weighted g(k) = g(k−1) + ∇u(k) − d(k) .
H1 MPLE model, the Euler-Lagrange equation for the u (15)
minimization is given by
(H1 ) Alternating the minimization of u(k+1) and d(k+1) , we obtain
-  our final formulation for the TV model as
  μ  n
−∇ z2 ∇u − δ(x − xi ) + γ u(x)dx + b(k) − 1 = 0.
u(x) i=1 Ω ⎧ ⎧

⎪ ⎨ - n
(12) ⎪
⎪ (k+1) = argmin λ

⎪ u u ∇ · θ dx − μ log(u(xi ))

⎪ ⎩ Ω
We solve this using a Gauss-Seidel method with central ⎪

u
- i=1
2


differences for the ∇z2 and ∇u. Once we have discretized the ⎪
⎪ γ

⎪ + u(x)dx + b (k)
− 1

⎪ 2 Ω ⎫
partial differential equation, solving this equation simplifies ⎪


solving the quadratic ⎨ α?? (k)
?2 ⎬
?
  (TV )⎪ + ?d − ∇u − g ?2 ⎭, (k)
4z2 + γ u2i, j − αi, j ui, j − μwi, j = 0 ⎪
⎪ 2
(13) ⎪
⎪   

⎪ (k) 1

⎪ j
d (k+1)
= shrink ∇ u (k+1) − d , ,
for the positive root, where ⎪
⎪ j
α


j
  ⎪g(k+1) = g(k) + ∇u(k+1) − d(k+1) ,


⎪ -
αi, j = zi,2 j ui+1, j + ui−1, j + ui, j+1 + ui, j −1 ⎪



⎩b (k+1) =b +
(k) u(k+1) dx − 1.
2 2   Ω
zi+1, j − zi−1, j ui+1, j − ui−1, j (16)
+
2 2
 
zi,2 j+1 − zi,2 j −1  ui, j+1 − ui, j −1  (14) The shrink function is given by
+
2 2
 
⎛ ⎞     z
 shrink z, η = max |z| − η, 0 . (17)
+ γ⎝1 − b(k) − ui , j  ⎠, |z |
(i , j  ) =
/ (i, j)

and where wi, j is the given number of sampled events that Solving for d(k+1) and g(k+1) we use forward difference
occurred at the location (i, j). We chose our parameters μ and discretizations, namely
γ so that the Gauss-Seidel solver will converge. In particular,
we have μ = O((NM)−2 ) and γ = O(μ(NM)), where the  T
∇u(k+1) = ui+1, j − ui, j , ui, j+1 − ui, j . (18)
image is N × M.
EURASIP Journal on Advances in Signal Processing 5

The Euler-Lagrange equations for the variable u(k+1) is Table 1: This is the L2 error comparison of the five methods shown
in Figure 2. Our proposed methods performed better than both the
μ 
n   Kernel Density Estimation method and the TV MPLE method.
− δ(x − xi ) + λ div θ − α Δu + div gk − div dk
u(x) i=1
8,000 Events
- 
Kernel density estimate 8.1079e − 6
+γ ux + b1k − 1 = 0.
Ω TV MPLE 6.6155e − 6
(19) Modified TV MPLE 4.1213e − 6
Weighted H1 MPLE 3.8775e − 6
Discretizing this simplifies solving for the positive root of
  Weighted TV MPLE 4.3195e − 6
4α + γ u2i, j − βi, j ui, j − μwi, j = 0, (20)
where Table 2: This is the L2 error comparison of the five methods for
  both the introductory example shown in Figure 1 and the sparse
βi, j = α ui+1, j + ui−1, j + ui, j+1 + ui, j −1 − λ div θ example shown in Figure 3. Our proposed methods performed
  better than both the Kernel Density Estimation method and the TV
k k k k
− α dx,i, j − dx,i−1, j + d y,i, j − d y,i, j −1 MPLE method.
  40 Events 4,000 Events
k k k k (21)
+ α gx,i, j − gx,i−1, j + g y,i, j − g y,i, j −1
Kernel density estimate 2.3060e − 5 7.3937e − 6
⎛ ⎞
 TV MPLE 2.5347e − 5 7.7628e − 6
+ γ⎝1 − b(k) − ui , j  ⎠. Modified TV MPLE 1.4345e − 5 5.7996e − 7
(i , j  ) =
/ (i, j) Weighted H1 MPLE 3.8449e − 6 2.1823e − 6
We solved for u(k+1) with a Gauss-Seidel solver. Heuristically, Weighted TV MPLE 1.5982e − 5 3.6179e − 6
we found that using the relationships α = 2μN 2 M 2 and γ =
2μNM were sufficient for the solver to converge and provide Table 3: This is the L2 error comparison of the three methods for
good results. We also set λ to have values between 1.0 and the Orange County Coastline example shown in Figures 7, 8, and
1.2. The parameter μ is the last remaining free paramter. This 9. Our proposed methods performed better than the Kernel density
parameter can be chosen using V-cross validation or other estimation method.
techniques, such as the sparsity l1 information criterion [8].
200 Events 2,000 Events 20,000 Events
Kernel density estimate 7.0338e − 7 2.8847e − 7 1.5825e − 7
6. Results Modified TV MPLE 3.0796e − 7 2.6594e − 7 8.9353e − 8
In this Section, we demonstrate the strengths of our models Weighted H1 MPLE 5.4658e − 7 1.5988e − 7 5.8038e − 8
by providing several examples. We first show how our
methods compare to existing methods for a dense data set.
We then show that our methods perform well for sparse data 6.2. Sparse Data Example. Crimes and other types of events
sets. Next, we explore an example with an aerial image and may be quite sparse in a given geographical region. Conse-
randomly selected events to show how these methods could quently, it becomes difficult to determine the probability that
be applied to geographic event data. Finally, we calculate an event will occur in the area. It is challenging for density
probability density estimates for residential burglaries using estimation methods that do not incorporate the spatial
our models. information to distinguish between invalid regions and areas
that have not had any crimes but are still likely to have events.
6.1. Model Validation Example. To validate the use of our Using the same predefined probability density from Section 1
methods, we took a predefined probability map with sharp in Figure 1(b), we demonstrate how our methods maintain
gradients that is shown in Figure 2(a). The chosen valid these invalid regions for sparse data. The 40 events selected
region and the 8,000 selected events are displayed in Figures are shown in Figure 3(b). The density estimates for current
2(b) and 2(c), respectively. We performed density estimates methods and our methods are given in Figure 3. We used a
with the Gaussian Kernel Density Estimate and the Total variance σ = 15 for the Gaussian Kernel Density Estimate.
Variation MPLE method. The variance used for the Kernel For this sparse problem, our Weighted H1 MPLE and
Density Estimation is σ = 2. The results are provided in Modified TV MPLE methods maintain the boundary of the
Figures 2(d) and 2(e). The density estimates obtained from invalid region and appear close to the true solution. Table 2
our Modified TV MPLE method and Weighted H1 MPLE contains the L2 errors for both this example of 40 events and
method are shown in Figures 2(f) and 2(g), respectively. We the example of 4,000 events from the introduction. Notice
also included our Weighted TV MPLE in Figure 2(h). that our Modified TV and Weighted H1 MPLE methods
Our methods maintain the boundary of the invalid performed the best for both examples. The Weighted H1
region and appear close to the true solution. In addition, they MPLE was exceptionally better for the sparse data set. The
keep the sharp gradient in the density estimate. The L2 errors Weighted TV MPLE method does not perform as well for
for these methods are located in Table 1. sparse data sets and fails to keep the boundary of the valid
6 EURASIP Journal on Advances in Signal Processing

3.5677e − 4

Color scale
0
(a) True density (b) Valid region (c) 8,000 events (d) Kernel density estimation
3.5677e − 4

Color scale
0
(e) TV MPLE method (f) Our modified TV MPLE (g) Our weighted H1 MPLE (h) Our weighted TV MPLE method
method method

Figure 2: This is a model-validating example with dense data set of 8000 events. The piecewise-constant true density is given in (a), and the
valid region is provided in (b). The sampled events are shown in (c). (d) and (e) show the two current density estimation methods, Kernel
Density Estimation and TV MPLE. (f), (g), and (h) show the density estimates from our methods. The color scale represents the relative
probability of an event occurring in a given pixel. The images are 80 pixels by 80 pixels.

3.6227e − 4

for these images


Color scale

0
(a) True density (b) 40 Events (c) Kernel density estimation (d) TV MPLE method
3.6227e − 4
for these images
Color scale

0
(e) Our modified TV MPLE (f) Our weighted H1 MPLE (g) Our weighted TV MPLE method
method method

Figure 3: This is a sparse example with 40 events. The true density is given in (a), and it is the same density from the example in the
introduction. The sampled events are shown in (b). (c) and (d) show the two current density estimation methods, Kernel Density Estimation
and TV MPLE. (e), (f), and (g) show the density estimates from our methods. The color scale represents the relative probability of an event
occurring in a given pixel. The images are 80 pixels by 80 pixels.
EURASIP Journal on Advances in Signal Processing 7

(a) Google earth image of orange county (b) Orange county coastline denoised image (c) Orange county coastline smoothed im-
coastline age

Figure 4: This shows how we obtained our valid region for the Orange County Coastline example. Figure (a) is the initial aerial image of
the region to be considered. The region of interest is about 15.2 km by 10 km. Figure (b) is the denoised version of the initial image. We took
this denoised image and smoothed away from regions of large discontinuities to obtain figure (c).

0.0265

for OC coastline example


Color scale
0
(a) Orange county coastline valid region (b) OC coastline density map

Figure 5: After thresholding the intensity values of Figure 4(c), we obtain the valid region for the Orange County Coastline. This valid region
is shown in (a). We then constructed a probability density shown in figure (b). The color scale represents the relative probability of an event
occurring per square kilometer.

(a) OC coastline 200 events (b) OC coastline 2,000 events (c) OC coastline 20,000 events

Figure 6: From the probability density in Figure 5, we sampled 200, 2,000, and 20,000 events. These events are given in (a), (b), and (c),
respectively.

(a) OC coastline kernel density estimate 200 (b) OC coastline kernel density estimate (c) OC coastline kernel density estimate
samples with σ = 35 2,000 samples with σ = 18 20,000 samples with σ = 6.25

Figure 7: These images are the Gaussian Kernel Density estimates for 200, 2,000, and 20,000 sampled events of the Orange County Coastline
example. The color scale for these images is located in Figure 5.
8 EURASIP Journal on Advances in Signal Processing

(a) OC coastline modified TV MPLE 200 (b) OC coastline modified TV MPLE 2,000 (c) OC coastline modified TV MPLE 20,000
samples samples samples

Figure 8: These images are the Modified TV MPLE estimates for 200, 2,000, and 20,000 sampled events of the Orange County Coastline
example. The color scale for these images is located in Figure 5.

(a) OC coastline weighted H1 MPL estimate (b) OC coastline weighted H1 MPL estimate (c) OC coastline weighted H1 MPL estimate
200 samples 2,000 samples 20,000 samples

Figure 9: These images are the Weighted H1 MPLE estimates for 200, 2,000, and 20,000 sampled events of the Orange County Coastline
example. The color scale for these images are located in Figure 5.

region. Since the rest of the examples contains sparse data to have events. Sampling from this constructed density, we
sets, we will omit the Weighted TV MPLE method from the took distinct data sets of 200, 2,000, and 20,000 selected
remaining sections. events given in Figure 6. For each set of events, we included
three probability density estimations for comparison. We
first give the Gaussian Kernel Density Estimate followed by
6.3. Orange County Coastline Example. To test the models our Modified Total Variation MPLE model and our Weighted
with external spatial data, we obtained from Google Earth H1 MPLE model. We provide all images together to allow for
a region of the Orange County coastline with clear invalid visual comparisons of the methods.
regions (see Figure 4(a)). For the purposes of this example, Summing up Gaussian distributions gives a smooth
it was determined to be impossible for events to occur in density estimate. Figure 7 contains the density estimates
the ocean, rivers, or large parks located in the middle of obtained using the Kernel Density Estimation model. The
the region. One may use various segmentation methods for standard deviations σ of the Gaussians are given with each
selecting the valid region. For this example, we only have image. In all of these images, a nonzero density is estimated
data from the true color aerial image, not multispectral in the invalid region.
data. To obtain the valid and invalid regions, we removed Taking the same set of events as the Kernel density
the “texture” (i.e., fine detailed features) using a Total estimation, the images in Figure 8 were obtained using our
Variation-based denoising algorithm [13]. The resulting first model, the Modified Total Variation MPLE method with
image, shown in Figure 4(b), still contains detailed regions the boundary edge aligning term. The parameter for λ must
obtained from large features, such as large buildings. We be sufficiently large in the TV method in order to prevent the
wish to remove these and maintain prominent regional diffusion of the density into the invalid region. In doing so,
boundaries. Therefore, we smooth away from regions of large the boundary of the valid region may attain density values
discontinuities. This is shown in Figure 4(c). Since oceans, too large in comparison to the rest of the image when the
rivers, parks, and other such areas have generally lower size of the image is very large. To remedy this, we may take
intensity values than other regions, we threshold to find the the resulting image from the algorithm and set the boundary
boundary between the valid and invalid regions. The final of the valid region to zero and rescale the image to have a
valid region is displayed in Figure 5(a). sum of one. The invalid region in this case sometimes has
From the valid region, we constructed a toy density map a very small nonzero estimate. For visualization purposes
to represent the probability density for the example and to we have set this to zero. However, we note that the method
generate data. It is depicted in Figure 5(b). Regions with has the strength that density does not diffuse through small
colors farther to the right on the color scale are more likely Sections of the invalid region back into the valid region on
EURASIP Journal on Advances in Signal Processing 9

(a) Google earth image of San Fernando Valley region (b) San Fernando Valley residential burglary residen-
tial burglaries

(c) San Fernando Valley residential burglary housing (d) San Fernando Valley residential burglary valid
density region

Figure 10: These figures are for the San Fernando Valley residential burglary data. In (a), we have the aerial image of the region we are
considering, which is about 16 km by 18 km. Figure (b) shows the residential burglaries of the region. Figure (c) gives the housing density
for the San Fernando Valley. We show the valid region we obtained from the housing density in figure (d).

the opposite side. Events on one side of an object, such as maintaining the boundary of the valid region. To determine
a lake or river, should not necessarily predict events on the how our models did in comparison to one another and to the
other side. Kernel Density Estimate, we calculated the L2 errors located
The next set of images in Figure 9 estimate the density in Table 3. Our models consistently outperform the Kernel
using the same sets of event data but with our Weighted H1 Density Estimation model. The Weighted H1 MPLE method
MPLE model. Notice the difference for the invalid regions performs the best for the 2,000 and 20,000 events and visually
with our models and the Kernel Density Estimation model. appears closer to the true solution for the 200 events than
This method does very well for the sparse data sets of 200 and the other methods. Qualitatively, we have noticed that with
2,000 events. sparse data, the TV penalty functional gives results which are
near constant. Thus, it gives a good L2 error for the Orange
County Coastline example, which has piecewise-constant
6.3.1. Model Comparisons. The density estimates obtained true density, but gives a worse result for the sparse data
from using our methods have a clear improvement in example of Figure 3, where the true density has a nonzero
10 EURASIP Journal on Advances in Signal Processing

17.5

Color scale for San Fernando Valley example


0
(a) San Fernando Valley residential burglary kernel
(b) San Fernando Valley residential burglary TV MPLE density estimation
density estimation
17.5

0 Color scale for San Fernando Valley example


(c) San Fernando Valley residential burglary modified
(d) San Fernando Valley residential burglary weighted H1 MPLE density
TV MPLE density estimation
estimation

Figure 11: These images are the density estimates for the San Fernando Valley residential burglary data. (a) and (b) show the results of the
current methods Kernel Density Estimation and TV MPLE, respectively. The results from our Modified TV MPLE method and our Weighted
H1 MPLE method are shown in figures (c) and (d), respectively. The color scale represents the number of residential burglaries per year per
square kilometer.

gradient. Even though the Modified TV MPLE method has a burglaries cannot occur in large parks, lakes, mountainous
lower L2 error in the Orange County Coastline example, the areas without houses, airports, and industrial areas. Using
density estimation fails to give a good indication of regions census or other types of data, housing density information
of high and low likelihood. for a given region can be calculated. Figure 10(c) is the
housing density for our region of interest. The housing
density provides us with the exact locations of where
6.4. Residential Burglary Example. The following example residential burglaries may occur. However, our methods
uses actual residential burglary information from the San prohibit the density estimates from spreading through the
Fernando Valley in Los Angeles. Figure 10 is the area of boundaries of the valid region. If we were to use this image
interest and the locations of 4,487 burglaries that occurred directly as the valid region, then crimes on one side of a
in the region during 2004 and 2005. The aerial image was street will not have an effect on the opposite side of the road.
obtained using Google earth. We assume that residential Therefore, we fill in small holes and streets in the housing
EURASIP Journal on Advances in Signal Processing 11

density image and use the image located in Figure 10(d) as Tita and the LAPD for the burglary data set. They would also
our valid region. like to thank Jeff Brantingham, Martin Short, and the IPAM
Using our Weighted H1 MPLE and Modified TV MPLE RIPS program at UCLA for the housing density data, which
models, the Gaussian Kernel Density Estimate with variance was obtained using ArcGIS and the LA County tax assessor
σ = 21, and the TV MPLE method, we obtained the density data. They obtained our aerial images from Google Earth.
estimations shown in Figure 11.

References
7. Conclusions and Future Work
[1] D. Kim Rossmo, Geographic Profiling, CRC Press, 2000.
In this paper we have studied the problem of determining [2] G. O. Mohler and M. B. Short, “Geographic profiling from
a more geographically accurate probability density estimate. kinetic models of criminal behavior,” in review.
We demonstrate the importance of this problem by showing [3] M. O’Leary, “The mathematics of geographic profiling,”
how common density estimation techniques, such as Kernel Journal of Investigative Psychology and Offender Profiling, vol.
Density Estimation, fail to restrict the support of the density 6, pp. 253–265, 2009.
in a set of realistic examples. [4] B. W. Silverman, “Kernel density estimation using the fast
fourier transform,” Applied Statistics, Royal Statistical Society,
To handle this problem, we proposed a set of meth-
vol. 31, pp. 93–97, 1982.
ods, based on Total Variation and H1 -regularized MPLE
[5] B. W. Silverman, Density Estimation for Statistics and Data
models, that demonstrates great improvements in accurately Analysis, Chapman & Hall/CRC, 1986.
enforcing the support of the density estimate when the valid [6] P. L. Davies and A. Kovac, “Densities, spectral densities and
region has been provided a priori. Unlike the TV-regularized modality,” Annals of Statistics, vol. 32, no. 3, pp. 1093–1136,
methods, our H1 model has the advantage that it performs 2004.
well for very sparse data sets. [7] C. Kooperberg and C. J. Stone, “A study of logspline density
The effectiveness of the methods is shown in a set estimation,” Computational Statistics and Data Analysis, vol.
of examples in which burglary probability densities are 12, no. 3, pp. 327–347, 1991.
approximated from a set of crime events. Regions in which [8] S. Sardy and P. Tseng, “Density estimation by total variation
burglaries are impossible, such as oceans, mountains, and penalized likelihood driven by the sparsity l1 information
parks, are determined using aerial images or other external criterion,” Scandinavian Journal of Statistics, vol. 37, no. 2, pp.
spatial data. These regions are then used to define an invalid 321–337, 2010.
region in which the density should be zero. Therefore, [9] R. Koenker and I. Mizera, “Density estimation by total
variation regularization,” in Advances in Statistical Modeling
our methods are used to build geographically accurate
and Inference, Essays in Honor of Kjell A. Doksum, pp. 613–
probability maps. 634, World Scientific, 2007.
It is interesting to note that there appears to be a [10] G. O. Mohler, A. L. Bertozzi, T. A. Goldstein, and S. J.
relationship in the ratio between the number of samples and Osher, “Fast TV regularization for 2D maximum penalized
the size of the grid. In fact, each model has shown very likelihood estimation,” to appear in Journal of Computational
different behavior in this respect. The TV-based methods and Graphical Statistics.
appear to be very sensitive to large changes in this ratio, [11] P. P. B. Eggermont and V. N. LaRiccia, Maximum Penalized
whereas the H1 method seems to be robust to these same Likelihood Estimation, Springer, Berlin, Germany, 2001.
changes. We are uncertain about why this phenomenon [12] I. J. Goodd and R. A. Gaskins, “Nonparametric roughness
exists, and this would make an interesting future research penalties for probability densities,” Biometrika, vol. 58, no. 2,
topic. pp. 255–277, 1971.
There are many directions in which we can build on [13] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation
the results of this paper. We would like to devise better based noise removal algorithms,” Physica D, vol. 60, no. 1–4,
methods for determining the valid region, possibly evolving pp. 259–268, 1992.
the edge set of the valid region using Γ-convergence [17]. [14] M. Moeller, T. Wittman, and A. L. Bertozzi, “Variational
wavelet pan-sharpening,” CAM Report 08-81, UCLA, 2008.
Since this technique can be used for many types of event data,
[15] C. Ballester, V. Caselles, L. Igual, J. Verdera, and B. Rougé,
including residential burglaries, we would also like to apply “A variational model for P + XS image fusion,” International
this method to Iraq Body Count Data. Finally, we would Journal of Computer Vision, vol. 69, no. 1, pp. 43–58, 2006.
like to handle possible errors in the data, such as incorrect [16] D. Mumford and J. Shah, “Optimal approximations by piece-
positioning of events that place them in the invalid region, wise smooth functions and associated variational problems,”
by considering a probabilistic model of their position. Communications on Pure and Applied Mathematics, vol. 42, no.
5, pp. 577–685, 1989.
[17] L. Ambrosio and V. M. Tortorelli, “Approximation of func-
Acknowledgments tional depending on jumps by elliptic functional via Γ- con-
vergence,” Communications on Pure and Applied Mathematics,
This work was supported by NSF Grant BCS-0527388, vol. 43, no. 8, pp. 999–1036, 1990.
NSF Grant DMS-0914856, ARO MURI Grant 50363-MA- [18] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An
MUR, ARO MURI Grant W911NS-09-1-0559, ONR Grant iterative regularization method for total variation-based image
N000140810363, ONR Grant N000141010221, and the De- restoration,” Multiscale Modeling and Simulation, vol. 4, no. 2,
partment of Defense. The authors would like to thank George pp. 460–489, 2005.
12 EURASIP Journal on Advances in Signal Processing

[19] T. Goldstein and S. Osher, “Split bregman method for L1


regularized problems,” SIAM Journal on Imaging Sciences, vol.
2, pp. 323–343, 2009.
[20] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating
minimization algorithm for total variation image reconstruc-
tion,” SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–
272, 2008.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 485151, 20 pages
doi:10.1155/2010/485151

Research Article
Adaptive Inverse Hyperbolic Tangent Algorithm for Dynamic
Contrast Adjustment in Displaying Scenes

Cheng-Yi Yu,1, 2 Yen-Chieh Ouyang,1 Chuin-Mu Wang,2 and Chein-I Chang3


1 Department of Electrical Engineering, National Chung Hsing University, Taichung 402, Taiwan
2 Department of Computer Science and Information Engineering, National Chin Yi University of Technology, Taichung 411, Taiwan
3 Remote Sensing Signal and Image Processing Laboratory, Department of Computer Science and Electrical Engineering,

University of Maryland, Baltimore County, Baltimore, MD 21250, USA

Correspondence should be addressed to Yen-Chieh Ouyang, [email protected]

Received 16 November 2009; Accepted 11 February 2010

Academic Editor: Yingzi Du

Copyright © 2010 Cheng-Yi Yu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Contrast has a great influence on the quality of an image in human visual perception. A poorly illuminated environment can
significantly affect the contrast ratio, producing an unexpected image. This paper proposes an Adaptive Inverse Hyperbolic Tangent
(AIHT) algorithm to improve the display quality and contrast of a scene. Because digital cameras must maintain the shadow in
a middle range of luminance that includes a main object such as a face, a gamma function is generally used for this purpose.
However, this function has a severe weakness in that it decreases highlight contrast. To mitigate this problem, contrast enhancement
algorithms have been designed to adjust contrast to tune human visual perception. The proposed AIHT determines the contrast
levels of an original image as well as parameter space for different contrast types so that not only the original histogram shape
features can be preserved, but also the contrast can be enhanced effectively. Experimental results show that the proposed algorithm
is capable of enhancing the global contrast of the original image adaptively while extruding the details of objects simultaneously.

1. Introduction most digital video cameras have the capability of capturing


high dynamic range (HDR) images with a luminance level of
Digital cameras, which have gradually replaced conventional 200% to 600%. Nevertheless, most display devices are only
cameras, store photographs in a digital format. However, this capable of a low dynamic range (LDR), with a luminance
is not the only way in which digital cameras differ from level of about 110% [1].
conventional cameras. In conventional machinery cameras, Visual adaptation in humans provides us with the ability
the diaphragm and focal distance are adjusted by the to see in a wide range of conditions, from the darkness
photographer to obtain better scenes. Digital cameras, on the of night to the brightness of the midday sun. Adaptation
contrary, capture scenes using a sensor which might record a means that the signals from our photoreceptors are processed
tremendous amount of energy from one material in a certain to amplify weak signals and weaken strong signals, thereby
wavelength, while recording another material at much less preventing saturation. The previous physiological research
energy in the same wavelength. Besides, the photographer shows that ganglion cells, the output cells of the retina,
cannot adjust the diaphragm or focal distance. adapt at lower light levels than cones horizontal cells, one
In real-world situations, light intensities have a large of the downstream targets of the cones [2]. This provides
range. At the low end, the average intensity of starlight is evidence for adaptation in the retinal circuitry and in the
approximately 10−3 cd/cm2 and a sunny day can produce a cone photoreceptors, known as receptor adaptation. As light
high light intensity of 105 cd/cm2 or more. However, the visi- levels increase, the main site of adaptation switches from the
ble range perceived by the human eye is only 1 to 104 cd/cm2 . retinal circuitry to the cone photoreceptors.
As a result, we lose almost all the details appearing in the We see light that enters the eye and falls on the retina. The
darkest and brightest ranges of the visible spectrum. Today, retina has two types of photosensitive cells, both of which
2 EURASIP Journal on Advances in Signal Processing

The eye is a complex biological device. A camera is


1
often compared to an eye because both focus the light
from external objects in the visual field into a light-sensitive
0.8
medium. In the case of the camera, this medium is film
R/Rmax

0.6 Rod Cone or an electronic sensor as opposed to the eye which is an


0.4 array of visual receptors. According to the laws of optics,
0.2
this simple geometrical similarity means that both eyes and a
CCD camera function as transducers.
0 Star light Moon light Office light Day light Light entering the eye is refracted as it passes through the
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 cornea. It then passes through the pupil (controlled by the
log L (d/m2 ) iris) and is further refracted by the lens. The cornea and lens
act together as a compound lens to project an inverted image
Figure 1: Human visual system mapping curve. onto the retina.
A common problem in digital cameras is that the range
of reflectance values collected by a sensor may not match the
capabilities of the digital format or color display monitor.
G
0.2 R So, image enhancement techniques are generally required to
Fraction of light absorbed by each type of cone

make an image easier to analyze and interpret. The range of


0.18
brightness values within an image is referred to as contrast.
0.16 The contrast enhancement is a process that makes the image
0.14 features stand out more clearly by optimizing the colors
available on the display or an output device. A contrast
0.12 enhancement algorithm allows users to custom design to
0.1 improve image quality, representation, and interpretation.
Many factors contribute to image quality including
0.08
brightness, contrast, noise, color reproduction, detail repro-
0.06 duction, visual acuity simulation, glare simulation, and
artifacts. In so doing various digital images processing
0.04
techniques have been developed. Among them is the con-
0.02 B
trast enhancement which plays the most important role
0 in increasing the visual quality of an image [6, 7]. For
400 440 480 520 560 600 640 680 this reason, the contrast enhancement has been the major
Wavelength (nm) approach to improve image quality.
According to image contrast an images is generally
Figure 2: Sensitivity of the cones. categorized into one of five groups: dark image, bright
image, back-lighted image, low-contrast image, and high-
contrast image. A dark image has particular low gray levels
in intensity, while a bright image has very high gray levels in
contain pigments that absorb visible light to give us the sense intensity. The gray levels of a back-lighted image are usually
of vision. The rods, which are numerous, are spread all over distributed at the two ends of dark and bright regions. On
the retina and respond only to light and dark. They are very the other hand, the gray levels of a low-contrast image are
sensitive and can respond to a single photon of light. There generally centralized on the middle region, while the gray
are about 110,000,000 to 125,000,000 rods in the eye [3]. levels of a high-contrast image are scattered across the whole
The other type of cell is the cones located in one small area spectrum (Figure 3) [8].
of the retina (the fovea). Their number is about 6,400,000. Five categories of commonly used gray level transfer
These cells are sensitive to colors but require more intense functions shown in Figure 4 are generally used to perform
light in the order of hundreds of photons. Incidentally, the contrast enhancement so as to achieve different types of
cones are very sensitive to red, green, and blue (Figures 1 and contrast [8]. For example, for dark images with mean <0.5,
2) [4, 5], which is the reason why monitors use these colors the function in Figure 4(a) is used; whereas the function in
as primaries. There are three types of cones: A, B and, C Figure 4(b) is used for a bright image with mean >0.5 for the
cones. The A cones are sensitive to red light, and the B cones same purpose. For images whose gray levels are centralized
are sensitive to green light (slightly more than the A cones). in the middle region with mean near at 0.5, the function
The C cones are sensitive to blue light, but their sensitivity is in Figure 4(c) is used. For images whose gray levels are
about 1/30 that of the A or B cones. distributed at the two end of dark and bright region, the
The human eye features a much higher resolution than function in Figure 4(d) is used. For the images whose gray
cameras but its effective resolution is even higher when we levels are uniformly scattered across the whole spectrum, the
consider that the eye can move and refocus itself about three function in Figure 4(e) is used.
to four times a second. This means that in a single second the This paper presents an Adaptive Inverse Hyperbolic
eye can sense and send to the brain about half a billion pixels. Tangent (AIHT) algorithm for image contrast enhancement
EURASIP Journal on Advances in Signal Processing 3

p(r) Dark image p(r) Bright image p(r) Back-lighted


image

r r r
(a) (b) (c)

p(r) Low-contrast p(r) High-contrast


image image

r r
(d) (e)

Figure 3: Five kinds of contrast types.

that is suitable for interactive applications. It can automat- Since a digital image is encoded by L bits, the gray level
ically produce contrast enhanced images with good quality of brightness varies from 0 to 2L−1 . Assume that rk is the kth
while using a spatially uniform mapping function that is gray level. Its probability is defined by
based on a simple brightness perception model to achieve
better efficiency. In addition, the AIHT also provides users nk
P(rk ) = , (1)
with a tool of tuning the on-the-fly image appearance in N
terms of brightness and contrast and thus is suitable for
interactive applications. The AIHT-processed images can be where nk is the number of pixels specified by rk , and N is
reproduced within the capabilities of the display medium to the total number of pixels in the image. If the histogram
have better detailed and faithful representations of original of an image has a narrow dynamic range, it will be a low-
scenes. contrast image. In this case, different objects in the image
The remainder of this paper is organized as follows. will have their brightness in nearly the same gray level range
Section 2 reviews the previous work done in the literature. which may cause difficulty in object identification, object
Section 3 develops the AIHT contrast enhancement algo- classification, and image processing. Under such a circum-
rithm along with its parameters and usage. Section 4 con- stance the contrast enhancement is generally performed to
ducts experiments including simulations. Finally, Section 5 expand gray level range to mitigate the problem. One popular
provides future directions of further research. technique to accomplish this task is histogram equalization
in (Gonzalez and Woods [9]).
There are two categories of contrast enhancement tech-
2. Contrast Enhancement for an Image niques: global methods and local methods. The advantages
of using a global method are its high efficiency and
Each pixel in a gray-scaled image has brightness ranging from low computational load. The drawback of using a global
0 to 255 with the values of 0 and 255 representing black operator is its inability in revealing image details of local
and white, respectively. A histogram shows the number of luminance variation. On the contrary, the advantage of a
pixels with the various levels of brightness. The “0” value local operator is its capability of revealing the details of
on the left of a histogram shows the number of pixels that luminance level information in an image at the expense of
are black, while the “255” value on the right indicates the very high computational cost that may not be unsuitable
number of pixels being white. Normalizing the histogram by for video applications without hardware realization. Two
the total number of pixels in the image produces a probability types of global contrast enhancement techniques, linear and
distribution of brightness levels. nonlinear, are discussed as follows.
4 EURASIP Journal on Advances in Signal Processing

1 1
Bias = 0.37, gain = 0.35 Bias = 0.37, gain = 3
0.9 0.9
0.8 0.8
0.7 0.7
Output level

Output level
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level Input level

(a) (b)
1 1
Bias = 0.97, gain = 1 Bias = 0.97, gain = 1
0.9 0.9
0.8 0.8
0.7 0.7
Output level

Output level

0.6 0.6

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level Input level

(c) (d)
1
Bias = 0.37, gain = 1
0.9
0.8
0.7
Output level

0.6
0.5
0.4
0.3
0.2
0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level

(e)

Figure 4: Five categories of commonly used gray level transform functions: (a) dark image, (b) bright image, (c) back-lighted image, (d)
low-contrast image, and (e) high contrast image.
EURASIP Journal on Advances in Signal Processing 5

Logarithm curve Adaptive power function curve


1 1

0.9 0.9

0.8 0.8

0.7 0.7
Output luminance

0.6 0.6

Output level
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input luminance Input level

β=2 β = 10 0.3 1.25


β=3 β = 20 0.4 1.6
β=5 0.5 2.1
0.65 2.8
Figure 5: Logarithm curve for different β’s value. 0.8 3.8
1

Figure 6: Gamma function curve for different γ’s value.

2.1. Linear Contrast Enhancement. Linear contrast enhance- 2.1.3. Piecewise Linear Contrast Stretch. When the distribu-
ment is also referred to as contrast stretching. It linearly tion of an image histogram is a bi- or tri-modal, it is possible
expands the original digital luminance values of an image to stretch certain values of the histogram to increase contrast
to a new distribution. Expanding the original input values enhancement in selected areas. The piecewise linear contrast
of the image makes it possible to use the entire sensitivity enhancement involves the identification of a number of
range of the display device. Linear contrast enhancement linear enhancement steps that can expand the brightness
also highlights subtle variations within the data. This type ranges in multiple modes of the histogram. Compared to a
of enhancement is best suitable to remotely sensed images normal linear contrast stretch which stretches the minimum
with Gaussian or near-Gaussian histograms. There are three and maximum values to the values of lowest and highest
methods of linear contrast enhancement. gray levels linearly at a constant level of intensity, the
piecewise linear contrast stretch defines several breakpoints
that increase or decrease the contrast of the image for a given
2.1.1. Minimum-Maximum Linear Contrast Stretch. The range of values. A low-slop of an image histogram produces
minimum-maximum linear contrast stretch assigns the a lower contrast for the same range of values. On the other
original minimum value to the new minimum value and the hand, a high-slop of an image histogram produces a higher
original maximum value to the new maximum value where contrast for the same range of values. So, the higher the slope,
the original intermediate values are scaled proportionately the narrower the range of values mapped from the x-axis.
between the new minimum and maximum values. Many This approach creates a wider spread output for the same
digital image processing systems can automatically expand original values, thus increasing the contrast for that range
these minimum and maximum values to optimize the full of values. A piecewise stretch method performs a series of
range of available brightness values. small min-max stretches within a single histogram and is
very useful in contrast enhancement. This benefit is traded
off for that image analysts must be very familiar with the
2.1.2. Percentage Linear Contrast Stretch. The percentage modes of the histogram and the features they represent in
linear contrast stretch technique is similar to the minimum- the real world to take advantage of.
maximum linear contrast stretch except that the minimum
and maximum values are found in a way that the values
between them cover a given percentage of pixels from the 2.2. Nonlinear Contrast Enhancement. Nonlinear contrast
mean of the histogram. A standard deviation from the mean enhancement often involves histogram equalization, which
is often used to push the tails of the histogram beyond the requires an algorithm to accomplish the task. One major
original minimum and maximum values. disadvantage resulting from the nonlinear contrast stretch is
6 EURASIP Journal on Advances in Signal Processing

Gain = 1
Image in 1
(RGB)
0.9 Dark image

0.8
Color conversion
RGB to HSV (HSI) High contrast
0.7

0.6

Output level
Back-lighted and
Luminance Chrominance 0.5 Low contrast

0.4
Adaptive inverse
hyperbolic tangent 0.3

0.2 Bright image


Normalisation
0.1

0
Color conversion 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
HSV (HSI) to RGB Input level

Bias = 0.2 Bias = 4


Enhanced Bias = 1 Linear
image
Figure 9: AIHT is approximately linear over the middle range
values, where the choice of a semisaturation constant determines
Figure 7: A flowchart of the AIHT algorithm. how input values are mapped to display values.

Luminance can also separate pixels into distinct groups if there are few
output values over a wide range [10].
Image analysts should be aware of the fact that while
Evaluate mean(x) Evaluate variance(x)
histogram equalization often provides an image with the
most contrast of any enhancement technique, it may also
Evaluate bias(x) Evaluate gain(x) hide much needed information. This technique groups pixels
that are very dark or very bright into a very few gray scales. If
Inverse hyperbolic one is trying to bring out information in terrain shadows, or
tangent function if there are clouds in the image, histogram equalization may
not be appropriate.
Figure 8: A flowchart of AIHT parameters evaluates. Duan and Qiu extended this idea to color images, but the
equalized images are not visually pleasing for most cases [11].
When the equalization process is applied to gray-scale images
or the luminance component of the color images, regions
that each value in the input image can have several values with overstated contrast usually create visually annoying
in the output image so that objects in the original scene artifacts. In this case, the visually unsatisfactory results
lose their correct relative brightness values. There are two caused by equalization are not acceptable because they give
methods of nonlinear contrast enhancement. the image an unnatural appearance.

2.2.1. Histogram Equalization. Histogram equalization is one 2.2.2. Contrast-Limited Adaptive Histogram Equalization.
of the most useful forms of nonlinear contrast enhancement Contrast-Limited Adaptive Histogram Equalization
(Gonzalez and Woods [9]). When an image’s histogram is (CLAHE) is an improved version of Adaptive Histogram
equalized, all pixel values of the image are redistributed. As Equalization (AHE), both of which overcome the limitations
a result, there are approximately an equal number of pixels of standard histogram equalization. The CLAHE was
for each of the user-specified output gray-scale classes (e.g., originally developed for medical images and has improved
32, 64, and 256). Contrast is increased at the most populated enhancement of low-contrast images such as portal films
range of brightness values of the histogram (or “peaks”). [12].
It automatically reduces the contrast in very light or dark The CLAHE is a local contrast enhancement technique
parts of the image, which are associated with the tails of and operates on small regions in an image, called tiles,
a normally distributed histogram. Histogram equalization rather than the entire image. Each tile’s contrast is enhanced
EURASIP Journal on Advances in Signal Processing 7

Inverse hyperbolic tangent curve


3 1

0.9
2
0.8

0.7
1
0.6

Output level
0 0.5

0.4
−1
0.3

0.2
−2
0.1

−3 0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level
(a) (b)

Figure 10: Inverse hyperbolic tangent curve: (a) inverse hyperbolic tangent curve, (b) shift to [0, 1].

in such a way that the histogram of the output region Bennett and Mcmillan also used a logarithm-like func-
approximately matches the histogram specified by the tion in his video enhancement algorithm [4, 16]. The
“Distribution” parameter. The neighboring tiles are then difference between a logarithm curve and a gamma curve
combined by bilinear interpolation to eliminate artificially is that the former obeys the Weber-Fechner law of just
induced boundaries. The contrast, especially in homoge- noticeable difference (JND) response in human vision but
neous areas, can be limited to avoid amplifying any noise provides a parameter to adapt the logarithmic mapping in a
that might be present in the image. In other words, the way similar to the log map function. While the high slope of
CLAHE partitions an image into a set of contextual regions standard gamma correction for low intensities can result in
and applies the histogram equalization to each one of them. loss of detail in shadow regions.
This evens out the distribution of used grey values and Contrast masking is one of the most important concepts
thus makes hidden features of the image more visible. in human visual systems. In 1987, Whittle presented a
The full gray level spectrum is used to express the image concept that complied with this Weber-Fechner law [17] to
[13]. indicate that larger luminance gradients crossing an image
require more stretch than smaller luminance gradients to
achieve the same contrast perceived by the human eye. This
2.2.3. Logarithm Curve. Using a logarithm curve for contrast concept is adopted in our algorithm.
enhancement is usually performed for images with low Bennett and Mcmillan [16] and Stockham [14] suggested
complexity. Stockham was the first to discuss the advantages a simple form of a logarithm curve to enhance the contrast
of this technique [14]. In a later report [15], Drago et al. of an image:
presented a perception-motivated tone mapping algorithm
for interactive display of high contrast scenes. In Drago’s      
  log w x, y × β − 1 + 1
algorithm the scene luminance values are compressed using v x, y =   , (2)
logarithmic functions, which are computed using different log β
bases depending on scene content. The log2 function is
used in the darkest areas to ensure good contrast and where v(x, y) and w(x, y) are the enhanced luminance level
visibility, while the log10 function is used for the high- and input luminance level, respectively. The parameter β
est luminance values to reinforce the contrast compres- is a control factor that determines the strength of contrast
sion. In-between, luminance is remapped using logarithmic enhancement. Figure 5 shows the relationship between the
values based on the shape of a chosen bias function. extent of enhancement and the β value. A larger β value
However, this approach has drawbacks: for extreme types results in more enhancements. This is similar to the gamma
of images (such as backlight image, too bright and too function in γ correction (Figure 6). The selection of the
dark images), the power function-based image contrast β value is crucial. The curve in (2) is designed for global
enhancement methods cannot retain the detail brightness contrast enhancement, in which all pixels share the same β
distribution of the original image therefore lead to distortion value. The following section describes the method we will
[4]. use, which is similar to the proposed algorithm.
8 EURASIP Journal on Advances in Signal Processing

Gain = 0.95 Bias = 1


1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
Output level

Output level
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level Input level

Bias = 0.8 Bias = 1.2 Gain = 0.93 Gain = 0.69


Bias = 1 Linear Gain = 0.85 Linear
(a) (b)

Bias = 0.4 and varying the gain Bias = 0.5 and varying the gain Bias = 0.65 and varying the gain
1 1 1

0.8 0.8 0.8


Output level

Output level

Output level
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input level Input level Input level
Bias = 0.8 and varying the gain Bias = 1 and varying the gain Bias = 1.25 and varying the gain
1 1 1

0.8
0.8 0.8
Output level

0.6
Output level

Output level

0.6 0.6

0.4
0.4 0.4

0.2
0.2 0.2

0
0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input level Input level Input level

1 0.85 1 0.85 1 0.85


0.99 0.69 0.99 0.69 0.99 0.69
0.97 0.37 0.97 0.37 0.97 0.37
0.93 0.93 0.93

Figure 11: Continued.


EURASIP Journal on Advances in Signal Processing 9

Bias = 1.6 and varying the gain Bias = 2.1 and varying the gain Bias = 2.8 and varying the gain
1 1 1

0.8
0.8 0.8

Output level
0.6

Output level
Output level

0.6 0.6

0.4
0.4 0.4

0.2
0.2 0.2

0
0 0
0 0.5 1 0 0.5 1 0 0.5 1
Input level Input level Input level

1 0.85 1 0.85 1 0.85


0.99 0.69 0.99 0.69 0.99 0.69
0.97 0.37 0.97 0.37 0.97 0.37
0.93 0.93 0.93

(c)

Figure 11: Inverse Hyperbolic Tangent Curve produced by varying the gain and bias values: (a) gain(x) parameter fixed and varying the
bias(x) parameter, (b) bias(x) parameter fixed and varying the gain(x) parameter, (c) varying the gain and bias values of mapping curves.

Bias power function curve Gain function curve


1 0.1

0.9 0.09

0.8 0.08

0.7 0.07

0.6 0.06
Output level
Output level

0.5 0.05

0.4 0.04

0.3 0.03

0.2 0.02

0.1 0.01

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level Input level

Figure 12: Bias power function curve for different mean’s value. Figure 13: Gain function curve for different variance’s value.

3. Adaptive Inverse Hyperbolic the luminance value for each color pixel by (3)
Tangent Algorithm 1
Luminance = (R + G + B). (3)
The proposed Adaptive Inverse Hyperbolic Tangent (AIHT) 3
algorithm automatically converts any color image to a 24-bit This luminance value is the grayscale component in the
pixel format to avoid working with palettes. The HSV (hue, HSV color space. The weights reflect the eye’s brightness
saturation, and value) method is a common approach used sensitivity to the primary colors.
for such color-to-gray-scale conversion. In general, a color All the gray levels of the original image must be
image can be converted to a gray scale value by computing normalized to the range of [0, 1] before implementing AIHT.
10 EURASIP Journal on Advances in Signal Processing

Inverse hyperbolic tangent adaptive curve


1

0.9

0.8

0.7

Output level 0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level

1 0.85
0.99 0.69
0.97 0.37
0.93
(a)

Original image Gain = 1 processed of image Gain = 0.99 processed of image

Gain = 0.97 processed of image Gain = 0.93 processed of image Gain = 0.85 processed of image

Gain = 0.69 processed of image Gain = 0.37 processed of image Bias = 2 processed of image

(b)

Figure 14: The gain function determines the steepness of the curve. Steeper slopes map a smaller range of input values to the display range.
(a) Bias parameter fixed (bias = 1) and eight different gain values of mapping curves. (b) Fixed bias = 1, processed of images.
EURASIP Journal on Advances in Signal Processing 11

Gain = 0.85 and varying the bias


1

0.9

0.8

0.7

Output level 0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level

0.6 1.4
0.8 1.6
1 1.8
1.2 2
(a)

Original image Bias = 0.6 processed of image Bias = 0.8 processed of image

Bias = 1 processed of image Bias = 1.2 processed of image Bias = 1.4 processed of image

Bias = 1.6 processed of image Bias = 1.8 processed of image Bias = 2 processed of image

(b)

Figure 15: The value of bias(x) controls the centering of the inverse hyperbolic tangent. (a) Gain parameter fixed (gain = 0.85) and nine
different bias values of mapping curves, (b) fixed gain = 0.85, processed of image.
12 EURASIP Journal on Advances in Signal Processing

Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT
12000 12000 12000 12000

10000 10000 10000 10000

8000 8000 8000 8000

6000 6000 6000 6000

4000 4000 4000 4000

2000 2000 2000 2000

0 0 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Inverse hyperbolic tangent adaptive curve


1
0.9
0.8
0.7

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input Level

Gain = 0.68365
Bias = 1.1493

Dawn
Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT
8000 4000
6000 6000
7000 3500
5000 6000 5000
3000
4000 5000 2500 4000

3000
4000 2000 3000
Inverse hyperbolic tangent adaptive curve
2000
3000 1500
2000
1
2000 1000
1000 1000

0
1000

0
500

0 0
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.7

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input Level

Gain = 0.94427
Bias = 1.0264
Afternoon
Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT
4500
5000 5000 3000 4000
3500
4000 4000 2500
3000
2000 2500
3000 3000
1500 2000
2000 2000 1500
1000
1000
1000 1000 500
500

0 0 0 0 Inverse hyperbolic tangent adaptive curve


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.9
0.8
0.7
Output level

0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level

Gain = 0.39556
Bias = 0.63203
Night

Processed with Processed with adaptive Adaptive inverse


Original image Processed with histogram contrast-limited adaptive hyperbolic tangent
inverse hyperbolic
equalization histogram equalization mapping curve
tangent

Figure 16: Various types of bad contrast images illustrating the difference between contrast enhancement by histogram equalization, contrast
limited adaptive histogram equalization, and our method (outdoor images).
EURASIP Journal on Advances in Signal Processing 13

Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT
10000 9000
9000
10000

9000
6000 8000 Inverse hyperbolic tangent adaptive curve
8000

7000
8000

7000
5000
7000

6000
1
6000 4000
6000
5000
5000
5000
4000

3000
4000
3000

2000
4000

3000
0.9
3000

2000 2000
2000
1000
1000

0
1000

0 0
1000

0
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.7

Processed with histgram

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input level

Gain = 0.70529
Park Bias = 0.7962
Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT

3000
5000

4500
3000
3000
Inverse hyperbolic tangent adaptive curve
2500
4000 2500
2500
1
2000 3500
2000
3000 2000

1500 2500

2000
1500
1500 0.9
1000 1000
1500 1000

500
1000

500
500 500
0.8
0 0 0 0

0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Gain = 0.85469
Airport hall Bias = 0.94577
Histgram of original image Histgram of processed with histgram Histgram of processed with AIHT
3500
2000

Inverse hyperbolic tangent adaptive curve


2000 2000

1800 3000 1800

1600

1400
2500 1500
1600

1400
1
1200 2000 1200

1000

0.9
1000 1000
1500
800 800

600 1000 600


500
400 400

0.8
500
200 200

0 0 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.7

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Gain = 0.88414
Studio Bias = 0.93678

Original image Processed with Processed with adaptive Adaptive inverse


Processed with histogram
contrast-limited adaptive inverse hyperbolic hyperbolic tangent
equalization
histogram equalization tangent mapping curve

Figure 17: Various types of bad contrast images illustrating the difference between contrast enhancement by histogram equalization, contrast
limited adaptive histogram equalization, and our method (indoor images).

Specifically, let x be the gray level of the original image. Then Figure 7 shows a block diagram of the AIHT algorithm.
and the normalized gray level g can be obtained by The input data is converted from its original format to a
floating point representation of RGB values. The principal
x − min(x) characteristic of our proposed enhancement function is
g= , (4) an adaptive adjustment of the Inverse Hyperbolic Tangent
max(x) − min(x)
(IHT) Function determined by each pixel’s radiance. After
reading the image file, the bias(x) and gain(x) are computed.
where min(x) and max(x) represent the minimum and
These parameters control the shape of the IHT function.
maximum gray levels in original image, respectively. Having
Figure 8 shows a block diagram of AIHT parameters eval-
implemented AIHT, g is mapped to g  using
uates, including bias(x) and gain(x) parameters.
 
g  = T a, b, g , (5)
3.1. AIHT. The Adaptive Inverse Hyperbolic Tangent algo-
where T indicates an AIHT transform, a and b represent rithm has several desirable properties. For very small
parameters to be adjusted. If x is a gray level of enhanced and very large luminance values, its logarithmic function
image, then x can be expressed as enhances the contrast in both dark and bright areas of an
image. Because this function is an asymptote, the output
x = (max(x) − min(x))g  + min(x). (6) mapping is always bounded between 0 and 1. Another
14 EURASIP Journal on Advances in Signal Processing

×104
Histgram of original image ×104
8
Histgram of processed with histgram ×104 ×104
Histgram of processed with AIHT
Inverse hyperbolic tangent adaptive curve
7 7
3
7

6
1
2.5
6 6
5
5 5 2
4
0.9
4 4
1.5
3
3 3

2 2
1
2 0.8
0.5 1
1 1

0 0 0 0
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Air photograph: runway


Gain = 0.36313
Bias = 0.46595
×104
Histgram of original image
×104
Histgram of processed with histgram
×104

2.5
×104 Histgram of processed with AIHT
Inverse hyperbolic tangent adaptive curve
6 6 5 1
5 2
5
4

4 4 1.5
3 0.9
3 3
1
2
2

1
2

1
0.5 1
0.8
0 0 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Air photograph: apron Gain = 0.34249


Bias = 0.49856
×104 Histgram of original image ×104
6
Histgram of processed with histgram ×104

2.5
×104

5
Histgram of processed with AIHT
Inverse hyperbolic tangent adaptive curve
5 5
2
4.5

4
1
4 4
3.5

0.9
1.5
3
3 3
2.5
1
2
2 2
1.5

1 1
0.5 1 0.8
0.5

0 0 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7

Output level
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Air photograph: city Gain = 0.47078


Bias = 0.59143

Original image Processed with histogram Processed with Processed with adaptive Adaptive inverse
equalization contrast-limited adaptive inverse hyperbolic hyperbolic tangent
histogram equalization tangent mapping curve

Figure 18: Various types of bad contrast images illustrating the difference between contrast enhancement by histogram equalization, contrast
limited adaptive histogram equalization, and our method (aerial images).

advantage of this function is that it supports an approxi- narrows a smaller range of input values to the display range.
mately inverse hyperbolic tangent mapping for intermediate The gain function is used to help shape how fast the mid-
luminance, or luminance distributed between dark and range of objects in a soft region goes from 0 to 1. A higher
bright values. Figure 9 shows an example where the middle gain value means a higher rate in change. The enhanced pixel
section of the curve is approximately linear. xi j is defined as follows:
The form of the AIHT fits data obtained from measuring ⎛ ⎛ ⎞ ⎞
the electrical response of photo-receptors to flashes of light in   1 + xibias(x)
Enhance xi j = ⎝log⎝ ⎠ − 1⎠ × gain(x).
j
various species [18]. It has also provided a good fit to other (8)
1 − xibias(x)
j
electro-physiological and psychophysical measurements of
human visual function [19–21]. Therefore the steepness of the inverse hyperbolic tangent
The contrast of an image can be enhanced using inverse curve can be further dynamically adjusted.
hyperbolic function by Figure 10(a) plots the inverse hyperbolic tangent func-
  tion over the domain −1 < x < 1 with a shift to 0 < x < 1
1 1+x
tanh−1 (x) = log . (7) domain in Figure 10(b).
2 1−x
Replace the variable x in (7) with xi j , where xi j is the 3.2. Bias Power Function. Figure 11 shows that the bias(x)
image gray level of the ith row and jth column. We also put value of the inverse hyperbolic tangent function determines
the bias(x) to the power of xi j to speed up the changing. the turning points of the curve. If the bias(x) is greater
The gain function is a weighting function which is used to than mean = 0.5, then the curve forms a straight line
determine the steepness of the AIHT curve. A steeper slope toward the top bending direction. In this case, the pixel
EURASIP Journal on Advances in Signal Processing 15

Original image

AM 07:30 PM 02:00 PM 10:00


Processed with adaptive inverse hyperbolic tangent

Original image

AM 06:00 PM 01:00 PM 08:00


Processed with adaptive inverse hyperbolic tangent

Original image

AM 04:00 PM 01:00 PM 08:30


Processed with adaptive inverse hyperbolic tangent

Figure 19: Capture images at different times of the contrast enhancement by our method.

value is mapped to a higher value. A bias(x) value less than in Figure 11(b). Decreasing the gain(x) value increases the
mean = 0.5 shifts the straight-line portion of the inverse contrast of the remapped image. Shifting the distribution
hyperbolic tangent toward lower levels of light. Figure 11(a) toward lower levels of light (i.e., decreasing bias(x)) decreases
illustrates these relationships where gain(x) = 0.5. Similarly, the highlights. By adjusting the bias(x) and gain(x), it is
a family of inverse hyperbolic tangent remapping curves can possible to tailor a remapping function with appropriate
be generated by having the bias(x) parameter fixed such as amounts of image contrast enhancement, highlights, and
at mean = 0.5 and varying the gain(x) parameter as shown shadow lightness as shown in Figure 11(c).
16 EURASIP Journal on Advances in Signal Processing

Histogram shape change

Histgram of Histgram of processed Histgram of processed


×104 original image ×104 with AIHT ×104 with histgram

6 5 6
4.5
5 4 5
3.5
4 3 4
3 2.5 3
2
2 1.5 2
1 1 1
0.5
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Original Histogram after Processed with histogram


histogram map with AIHT equalization of histogram

Blocking effect

Details missing

Edge is un-smooth

Chromatic aberration

Figure 20: Histogram equalization problems.

To make the inverse hyperbolic tangent curve produce a The bias power function is defined by
smooth mapping, we rely on Perlin and Hoffert “bias” func-
tion [22]. Bias was first presented as a density modulation  0.25
function to change the density of the soft boundary between mean(x)
bias(x) =
the inside and the outside of a procedural hyper texture. It 0.5
is a standard tool in texture synthesis and is also used for (9)
  n 0.25
many different computer graphics tasks. The bias function (1/(m × n)) m i=1 j =1 xi j
is a power function defined over the unit interval which = .
0.5
remaps x according to the bias transfer function. The bias
function is used to bend the density function either upwards
or downwards over the [0, 1] interval. Figure 12 shows the bias curve for different mean values.
EURASIP Journal on Advances in Signal Processing 17

Original image Bias = 0.6 processed of image Bias = 0.8 processed of image

(a) (b) (c)


Bias = 1 processed of image Bias = 1.2 processed of image Bias = 1.4 processed of image

(d) (e) (f)


Bias = 1.6 processed of image Bias = 1.8 processed of image Bias = 2 processed of image

(g) (h) (i)

Figure 21: (a) is a bad contrast image, (b)–(i) illustrate results using the proposed method with different factors, yielding different scales of
enhanced image details.

3.3. Gain Function. The gain function determines the steep- 4. Implementation and Experimental Results
ness of the AIHT curve. A steeper slope maps a smaller range
of input values to the display range. The gain function is used Images with different types of histogram distributions were
to help to reshape the object’s midrange from 0 to 1 of its soft tested for experiments. These include some daily life images
region. that may arise in contrast to the poor image and demonstrate
The gain function is defined by the enhanced results. The images are categorized into out-
door, indoor, and aerial images. The outdoor images include
gain(x) = 0.1 × (variance(x))0.5 dawn, afternoon, and night images. The indoor images
⎛ ⎞0.5
 n 
m   (10) include park, hall, and studio images. The aerial images
1
= 0.1 × ⎝ xi j − μ ⎠ , include runway, apron, and city images. There are four types
m × n i=1 j =1 of extreme images: dark image, bright image, back-lighted
where image, and low-contrast image. Figures 16, 17, and 18 show
1 
m 
n various types of images with bad contrast enhancement.
μ= xi j . (11) Figures 16, 17, and 18 display the results of the enhanced
m × n i=1 j =1
image processing by histogram equalization, contrast limited
Figure 13 shows the gain curve for different mean values. adaptive histogram equalization, and the proposed AIHT
The gain function determines the steepness of the AIHT method. Figures 16–18 show outdoor images, indoor images,
curve. and aerial images, respectively. Table 1 lists the values of
Figure 14 shows the gain curve for different gain values gain and bias parameters used for the AIHT method. Table 2
of mapping curves and processed images. There are a total compares results by histogram equalization, contrast limited
of eight gain values (1, 0.99, 0.97, 0.93, 0.85, 0.69, 0.37), adaptive histogram equalization with that produced by
mapping curves as shown in Figure 14(a). The correspond- the AIHT method using the measures of MSE, SNR, and
ing results are shown in Figure 14(b). PSNR, where the AIHT method was better than histogram
Figure 15 shows processed images and the bias curve for equalization and contrast limited adaptive histogram equal-
different bias values used by mapping curves. There are a ization.
total of nine bias values (0.4, 0.5, 0.65, 0.8, 1.0, 1.25, 1.6, The comparative analysis between the proposed methods
2.1, 2.8), mapping curves as shown in Figure 15(a). The and currently frequently used methods has showed the
corresponding results are shown in Figure 15(b). effectiveness of these methods. CLAHE improved the local
18 EURASIP Journal on Advances in Signal Processing

(a)

(b)

Figure 22: The user interface of the AIHT system: (a) automatic mode, (b) manual mode.

Table 1: Gain and bias parameters. contrast of the poor images. The AIHT technique can keep
the sharpness of defects’ edges well. Therefore, CLAHE and
Type Name Gain Bias AIHT can greatly enhance poor image and they will be
Dawn 0.684 1.149 helpful for defect recognition.
Outdoor images Afternoon 0.944 1.026 Figure 19 shows images captured at different times of
Night 0.396 0.632
the contrast enhancement by the AIHT method. Comparing
these ongoing processing results, we can see that his-
Park 0.705 0.796 togram equalization and contrast limited adaptive histogram
Indoor images Hall 0.855 0.946 equalization produces grave chromatic aberration, blocking
Studio 0.884 0.937 effects, missing details, rough edges. Furthermore, there is
Runway 0.363 0.466 also a histogram shape change issue as well (Figure 20).
Aerial images City 0.342 0.499
Our approach has no chromatic aberration, blocking effects,
missing details, or rough edges and still maintains the
Apron 0.470 0.591 original histogram shape.
EURASIP Journal on Advances in Signal Processing 19

Figure 23: Enhancing the real-time image using the AIHT system.

Table 2: MSE, SNR, and PSNR.


Contrast limited adaptive
Adaptive inverse hyperbolic tangent Histogram equalization
Type Name histogram equalization
MSE SNR PSNR MSE SNR PSNR MSE SNR PSNR
Dawn 0.0386 22.1137 25.8902 0.16913 5.0272 5.9127 0.0432 10.316 23.143
Outdoor
images Afternoon 0.0099 45.0963 101.170 0.00961 46.391 104.08 0.0432 10.316 23.143
Night 0.0174 0.85413 57.3469 0.16739 0.0890 5.9741 0.0596 0.2499 16.780
Park 0.0034 28.8848 295.286 0.05862 1.6688 17.060 0.0339 2.8852 29.495
Indoor
images Hall 0.0163 22.1386 61.4400 0.00206 175.15 486.08 0.0492 7.3229 20.323
Studio 0.0157 21.0555 63.6445 0.02702 12.245 37.013 0.0561 5.8972 17.825
Runway 0.0125 1.78040 79.7332 0.07680 0.0533 13.021 0.0369 0.1109 27.098
Aerial images City 0.0214 0.18545 46.6504 0.11459 0.0347 8.7265 0.0575 0.0691 17.381
Apron 0.0125 1.07540 80.3171 0.12156 0.1102 8.2265 0.0585 0.2289 17.098

Finally, Figure 21 demonstrates the multiscale property adjust the contrast of real-time-processed images as shown
of the AIHT method. The original image (Figure 21(a)) in Figure 23.
has bad contrast. Figures 21(b)–21(i) presents the results
produced by the AIHT method using different factors to yield 5. Conclusions
different scales of enhanced image details. Figure 22 shows
the AIHT system interface in manual and automatic mode. This paper presents an effective approach to image contrast
The automatic mode adjusts the best parameters (gain and enhancement. The proposed algorithm uses an Adaptive
bias) based on the automatic calculation of characteristics of Inverse Hyperbolic Tangent algorithm as a contrast function
images (mean and variance) (Figure 22(a)). In manual mode, to map from the original image into a transformed image.
users can select their own personal preference to adjust This algorithm can improve the displayed quality of contrast
the parameters (Figure 22(b)). The AIHT method can also in the scenes and offers the efficiency of fast computation.
20 EURASIP Journal on Advances in Signal Processing

Experimental results show that it is possible to maintain a [11] J. Duan and G. Qiu, “Novel histogram processing for colour
large portion, if not all, of the perceived contrast of lightness image enhancement,” in Proceedings of the 3rd International
while enhancing the image contrast significantly. The form Conference on Image and Graphics (ICIG ’04), pp. 18–22, Hong
of these curves used for enhancement was determined based Kong, December 2004.
on a simple series of interpolations from a set of optimized [12] J. Rosenman, C. A. Roe, R. Cromartie, K. E. Muller, and S.
reference curves. The proposed algorithm can make the M. Pizer, “Portal film enhancement: technique and clinical
user correctly identify the target as well as dynamically utility,” International Journal of Radiation Oncology Biology
Physics, vol. 25, no. 2, pp. 333–338, 1993.
adjust the parameter by using the multiscale method.
Experimental results also show that the new algorithm can [13] K. Zuiderveld, “Contrast limited adaptive histogram equaliza-
tion,” in Graphics Gems IV, P. S. Heckbert, Ed., chapter 8.5, pp.
adaptively enhance image contrast and produce better visual
474–485, Academic Press, Cambridge, Mass, USA, 1994.
quality than histogram equalization and contrast-limited
[14] T. G. Stockham, “Image processing in the context of a visual
adaptive histogram equalization. In addition, it can also
model,” Proceedings of the IEEE, vol. 60, no. 7, pp. 828–842,
be implemented in real time in various monitor systems. 1972.
For overexposed and underexposed images the proposed
[15] F. Drago, K. Myszkowski, T. Annen, and N. Chiba, “Adaptive
algorithm also shows great benefit in improving contrast logarithmic mapping for displaying high contrast scenes,” in
enhancement with no effects resulting from environments. Proceedings of European Association for Computer Graphics
It is our belief that these functions will play a crucial role 24th Annual Conference (EUROGRAPHICS ’03), vol. 22, pp.
in developing a more universal approach to color gamut 419–426, Granada, Spain, September 2003.
mapping. [16] E. P. Bennett and L. McMillan, “Video enhancement using per-
pixel virtual exposures,” ACM Transactions on Graphics, vol.
Acknowledgments 24, no. 3, pp. 845–852, 2005, ACM SIGGRAPH 2005 Paper
SIGGRAPH ’05.
This work was supported by National Science Council (NSC) [17] P. Whittle, “Increments and decrements: luminance discrimi-
of Taiwan, R.O.C. (NSC 98-2221-E-005-064). nation,” Vision Research, vol. 26, no. 10, pp. 1677–1691, 1986.
[18] K. I. Naka and W. A. Rushton, “S-potentials from luminosity
units in the retina of fish (cyprinidae),” Journal of Physiology,
References vol. 185, no. 3, pp. 587–599, 1966.
[1] Y. Monobe, H. Yamashita, T. Kurosawa, and H. Kotera, [19] J. Kleinschmidt and J. E. Dowling, “Intracellular recordings
“Dynamic range compression preserving local image contrast from gecko photoreceptors during light and dark adaptation,”
for digital video camera,” IEEE Transactions on Consumer Journal of General Physiology, vol. 66, no. 5, pp. 617–648, 1975.
Electronics, vol. 51, no. 1, pp. 1–10, 2005. [20] D. C. Hood and M. A. Finkelstein, “A comparison of changes
[2] F. A. Dunn, M. J. Lankheet, and F. Rieke, “Light adaptation in sensitivity and sensation: implications for the response-
in cone vision involves switching between receptor and post- intensity function of the human photopic system,” Journal of
receptor sites,” Nature, vol. 449, no. 7162, pp. 603–606, 2007. Experimental Psychology: Human Perception and Performance,
[3] G. Osterberg, “Topography of the layer of rods and cones in vol. 5, no. 3, pp. 391–405, 1979.
the human retina,” Acta Ophthalmologica, vol. 13, supplement [21] D. C. Hood, M. A. Finkelstein, and E. Buckingham, “Psy-
6, pp. 1–103, 1935. chophysical tests of models of the response function,” Vision
[4] C. Y. Yu, Y. Y. Chang, T. W. Yu, Y. C. Chen, and D. Y. Jiang, “A Research, vol. 19, no. 4, pp. 401–406, 1979.
local-based adaptive adjustment algorithm for digital images,” [22] K. Perlin and E. M. Hoffert, “Hypertexture,” ACM SIGGRAPH
in Proceedings of the 2nd Cross-Strait Technology, Humanity Computer Graphics, vol. 23, no. 3, pp. 253–262, 1989, Proceed-
Education and Academy-Industry Cooperation Conference, pp. ings of ACM SIGGRAPH ’89.
637–643, 2008.
[5] T. W. Yu, S. S. Su, C. Y. Yu, C. Y. Lin, and Y. Y. Chang, “Adaptive
displaying scenes for real-time image,” in Proceedings of the 3rd
Intelligent Living Technology Conference, pp. 731–737, 2008.
[6] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,
“Image quality assessment: from error visibility to structural
similarity,” IEEE Transactions on Image Processing, vol. 13, no.
4, pp. 600–612, 2004.
[7] D. Garvey, “Perceptual strategies for purposive vision,” Tech-
nical Note 117, AI Center, SRI International, 1976.
[8] C. Y. Yu, Y. C. Ouyang, C. M. Wang, C. I. Chang, and Z. W.
Yu, “Contrast adjustment in displaying scenes using inverse
hyperbolic function,” in Proceedings of the 22th IPPR Confer-
ence on Computer Vision, Graphics, and Image Processing, pp.
1020–1027, 2009.
[9] R. C. Gonzalez and R. E. Woods, “Digital Image Processing,”
Prentice Hall, Upper Saddle River, NJ, USA, 3rd edition, 2008.
[10] ERDAS, Inc., Overview of ERDAS IMAGINE 8.2, ERDAS,
Atlanta, Ga, USA, 1995.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 950438, 8 pages
doi:10.1155/2010/950438

Research Article
Multi-Threshold Level Set Model for Image Segmentation

Chih-Yu Hsu,1 Chih-Hung Yang,2 and Hui-Ching Wang2


1 Department of Information & Communication Engineering, ChaoYang University of Technology,
168 Jifeng E. Rd., Wufeng, Taichung, 41349, Taiwan
2 Department of Applied Mathematics, National Chung-Hsing University, 250 Kuo Kuang Rd., Taichung 40227, Taiwan

Correspondence should be addressed to Chih-Yu Hsu, [email protected]

Received 1 December 2009; Accepted 8 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 Chih-Yu Hsu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A multi-threshold level set model for image segmentations is presented in the paper. The multi-threshold level set formulation
uses a speed function for stopping the locations of the active contours. The speed function with multiple thresholds is designed
for detecting boundaries of multiple regions of interest (ROI) in images. These thresholds can be automatically obtained by Fuzzy
C-means method. The experimental results show that the proposed method is able to capture boundaries of multiple regions of
interest.

1. Introduction forces to conquer the drawbacks of active classical contour


models. Some papers [8–11] studied the applications of
Image segmentation has been widely deployed for defense image segmentation by active contours. The Active Contour
and security applications such as small target detection Models use calculus of variations to obtain Euler-Lagrange
and tracking, vehicle navigation, and automatic/aided tar- equations by a minimization energy functional for solving
get recognition. The most important image processing image segmentation problems. The numerical solutions of
technique for target detection is image segmentation. The these equations are curves located in images after some
purpose of the image segmentation is to detect the region iteration. The functional can be classified into two groups.
of interest in an image. Multi-Threshold Level Set Model One group is a functional without edge detector and the
for image segmentation is an image segmentation approach other group is functional with edge detector. The two most
that can be used in defense and security applications. The famous functionals are the Mumford–Shah functional [12]
image segmentation method can automatically detect the and functional of the snake model.
target of interest to reduce the workload of human analysts. Euler-Lagrange equations of active contours can be para-
Segmentation is a technology to extract the regions of metric or geometric mathematical descriptions. Parametric
interest, in order to find the shapes of objects in images. active contours are explicitly represented as parameterized
Deformable models in mechanics describe the deformation curves. Sethian et al. [13, 14] in 1988 proposed a level set
of elastic structures under external loads and constraints. model that is a geometric active contour model. Instead
Since 1986, deformable models have been applied for image of evolution of curves, level set model evolutes a higher-
matching in computer vision and computer graphics [1– dimensional function to obtain curves with zero level. Level
5]. Image segmentations are developed to extract pixels in set models are also widely applied on image matching in
regions of objects or their contours. The Active Contour computer vision and computer graphics [15–23]. Active
Model (ACM) was first proposed by Kass et al. [6] in contours of level set model can be split and merged
1988 and it is widely used and discussed now. The classic automatically and they are suitable for the segmentation
ACM is well known as “Snake” because it can capture the of multiple regions [24–27] and they can be used for the
contours of objects in the image. To stop the evolving curve segmentation of multiple ROIs in images. Threshold level
on the boundary of the desired object, the GVF active set model [28] was proposed for the image segmentation
contour model [7] was proposed by modifying the image by defining an interval of the pixel intensities to find ROIs
2 EURASIP Journal on Advances in Signal Processing

in images. The governing equations of threshold level set gradients are large. The function P(I) can be designed as the
formulation involve a speed function term to drive the following equation:
active contours. The classic threshold level set method is
able to segment the objects with pixel values in only one P(I) = (1 + |∇Gσ ∗ I |)−1 , (5)
interval. The paper proposed multi-threshold level set model
for segmenting multiple ROIs of objects whose intensities where ∇ is the gradient operator and Gσ ∗ I denotes the
locate in several intervals. Since the intensity information image convolved with a Gaussian smoothing filter with
is used to define the intervals that classify several objects characteristic width σ. Thus, the level set equation for
of interest, the active contour will propagate and stop by segmentation is given by
the defined speed function. If the speed function has several ∂φ 2 2
intervals separated by multiple thresholds, the level set model = −2∇φ2P(I)(1 − εκ). (6)
∂t
is called Multi-Threshold Level Set Model. In Section 2,
the threshold level set with multiple approach thresholds 2.2. Threshold Level Set Method. The term P(I) in (6) only
is briefly presented. In Section 3, the idea of Fuzzy C- keeps the points of active contours on the boundaries
means method for the automatic selection of thresholds is where the gradients are local optimal. The speed function is
described. The experiment results are discussed in Section 4. improved by Lefohn et al. [28] as follows:

2. Method F = −(αD(I) + (1 − α)κ). (7)

Level set models and the speed functions with thresholds are The constant α is a weight number of curvature κ. The term
introduced before designing the threshold parameters. D(I) is a function of image intensity I as follows:
2 2
U −L 2 U + L2
2.1. Level Set Models. A curve Γ(t) can be considered as the D(I) = −2
2I − 2 2,
2 (8)

2
zero-level of a function φ(X, t) in a higher-dimensional space where upper threshold U and L lower threshold are param-

[13, 18]. The position vector X is equal to a two-dimensional eters. In (8), the brightness I is equal to the value (U + L)/2,
coordinates xi1 + y j.
1 Active contour Γ(t) is embedded in an and then the value of function D(I) is equal to (U − L)/2 as

evolving function φ(X, t) and is equal to zero level {φ = 0}. shown in Figure 2.
 Equation (8) makes the active contours enclose the
The evolution equation to solve the evolving function φ(X, t) boundaries of regions whose intensities are in the interval
can be formulated as [L, U]. The function D(I) based on image intensity causes
∂φ 2 2  the model to expand over regions with gray values within the
= −2∇φ2 n · v , (1) specified interval [L, U] and contract otherwise. By adjusting
∂t
parameters L and U, the threshold level set model controls
     the active contours to capture the regions of interest.
where n = ∇φ/ |∇φ| and v = dX /dt = (dx/dt) i + (d y/dt) j .
→ 
By defining F = −
n · v , the level set equation is in the form 2.3. Multi-Threshold Speed Functions. In Section 2.2, the
function D(I) makes level set function φ evolution and active
∂φ 2 2
= −2∇φ2F, (2) contours segment the ROIs in which the intensity values
∂t are all in one single interval (L, U). If there are ROIs with
multiple intervals having different grey levels in an image, the
where a speed function F is designed for edge detection. As function D(I) has to be redesigned for effectively segmenting
shown in Figure 1, the level set function φ and the evolving the desired ROIs. For constructing the function D(I) to find
curve Γ are shown. multiple ROIs, the grey intervals of the multiple interest
If the speed function F in (2) depends on the curvature objects should be selected. Here an example is illustrated
κ, it can be formulated as follows: for constructing the function D(I) to drive active contours
approach to the boundaries of two intervals (L1 , U1 ) and
F = P(I)(1 − εκ), (3)
(L2 , U2 ). It is helpful to introduce functions Θ1 (I) and Θ2 (I)
to construct the function D(I). Heaviside function H is used
where ε is a constant parameter controlling the degree of
to define the functions Θ1 (I) and Θ2 (I) as follows:
smoothness of active contours, I is the image intensity, and
 
curvature κ can be calculated by the following equation: 1
Θ1 = H(I) − H I − (U1 + L2 ) , (9)
2
φxx φ2y + φ y y φx2 − 2φxy φx φ y
κ= 9 3 . (4)  
1
φx2 + φ2y Θ2 = H I − (U1 + L2 ) − H(I − 255), (10)
2
If the function P(I) approaches to zero, then the active where the number 255 is the maximum value of a gray
contours will stop on the boundaries where the image image. The parameters are lower thresholds L1 , L2 and upper
EURASIP Journal on Advances in Signal Processing 3

3.5
3 1 0
2.5
2

0
1.5 0
1
0.5
−1
0
3
2 3
1 2 −2
0 1
−1 0
−1
−2 −2 −3
−3 −3 −3 −2 −1 0 1 2 3
(a) (b)

Figure 1: The level set (a) evolution function and (b) the curve obtained with zero level of the level set function.

80 30
20
60
10
The value of D(I)
The value of D(I)

40
0
20 −10

0 −20
−30
−20 L U L1 U1 L2 U2
−40
−40 −50
−60 −60
0 50 100 150 200 250 300 0 50 100 150 200 250 300
The intensity value The intensity value

Figure 2: The function D(I) is defined by parameters L and U. Figure 3: The function D(I) is defined by parameters L1 , U1 and
L2 , U2 .

thresholds U1 , U2 . The function D(I) for only two intervals


is defined by (11):
 2 2 [L1 , U1 ] and negative in the other intervals. The speed
U 1 − L1 2 U + L1 2
D(I) = Θ1 · −2
2 I− 1 2 function F(I) evolutes level set function and the active
2 2 2
 2 2 (11) contours approach the boundaries of regions whose intensity
U 2 − L2 2 U + L2 2 values are in between [L1 , U1 ] or [L2 , U2 ] ranges.
+ Θ2 · −2
2I− 2 2 .
2 2 2 An example is to segment a synthetic image as shown
in Figure 4(a). There are nine numbered square regions
Using the term D(I), the speed function F(I) is formulated with different gray intensities in Figure 4(b). The gray value
in (12): of the first block is twenty and the other square regions
   2 2
U 1 − L1 2 U + L1 2 have gray values increased by twenty-five. Figure 4(c) is the
F(I) = − α Θ1 · −2
2 I− 1 2
2 2 2 segmentation result obtained by threshold level set method
 2 2 with only one interval [L, U] = [35, 250] containing all
U 2 − L2 2 U 2 + L2 2
+Θ2 · −2
2I −
2 grey values. Because the gray values of the 2th to 9th
2 2 2 square regions are between 35 and 250, these square regions
 are segmented. Figure 4(d) shows the segmentation result
+(1 − α)κ . by setting intervals [L1 , U1 ] = [35, 80] and [L2 , U2 ] =
(12) [200, 250]. The 2th, 3th, 8th, and 9th square regions are
surrounded by red lines. Figure 4(c) shows the segmentation
Figure 3 shows the curve of function D(I) in (11), and the results with only a single interval and Figure 4(d) shows the
values of the D(I) function are positive in the intervals segmentation results with two intervals.
4 EURASIP Journal on Advances in Signal Processing

1 4 7

2 5 8

3 6 9

(a) (b)

50 50

100 100

150 150

200 200

250 250

300 300

50 100 150 200 250 300 50 100 150 200 250 300
(c) (d)

Figure 4: (a) Original image. (b) The number of each blocks. (c) Segmentation result obtained by threshold level set method.
(d) Segmentation result obtained by multi-threshold level set method.

(a) (b) (c)

Figure 5: (a) Original image. (b) The result classified by FCM.


EURASIP Journal on Advances in Signal Processing 5

50 50
100 100
150 150
200 200
250 250
300 300
350 350
400 400
450 450
500 500
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500
(a) (b)

50 50
100 100
150 150
200 200
250 250
300 300
350 350
400 400
450 450
500 500
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500
(c) (d)

Figure 6: (a) Original image and segmentation result obtained by (b) level set method, (c) threshold level set method, and (d) multi-
threshold level set method.

Since the curve of function D(I) of threshold level set method used as image segmentation will group and segment
model can be designed with more than a single interval, (9), a collection of objects into subsets or clusters, such that
(10) and (11) can be generalized for more than two intervals those within each cluster are more closely related to one
with [L3 , U3 ], . . . , [Ln , Un ]. The threshold level set model is another than objects assigned to different clusters. A cluster
called the Multi-Threshold Level Set Method that is mainly is therefore a collection of objects which are similar to each
used for the segmentation of multiple regions of interest. other and are dissimilar to the objects belonging to other
clusters. An object can be described by a set of measurements
3. Thresholds for Image Segmentations or features. Thus each object can be represented by a unique
point x j , 1 ≤ j ≤ M in the M-dimensional feature space. If
Fuzzy C-mean is a clustering method to classify the pixels we want N cluster to be classified, the centers p1 , . . . , pN for
in an image. Level set method is to find the boundary of each group C1 , . . . , CN should be randomly selected initially.
Regions of interest (ROI). The initial contour of threshold Fuzzy C-means clustering method uses an iterative method
level set method is not related to the output of Fuzzy C- to update centers continually by the following equation:
mean, but the final locations of active contours stopped by
M (k)
level set method are decided by the output of Fuzzy C-mean j =1 ui j x j
method. The multi-threshold level set model is able to seg- pi(k+1) = M (k) , (13)
ment multiple regions with designed intervals. The optimal j =1 ui j
intervals selected for multiple ROIs are an important issue.
To automatically select thresholds for image segmentations, where k is one step, ui j is the membership of jth data point
the Fuzzy C-Means (FCM) method is very suitable because at the position x j which belongs to the ith group, and the
the gray values belonging to one object are similar. The FCM membership is defined as
method determines the intervals [L1 , U1 ], . . . , [L2 , U2 ] of ⎡ ⎛ ⎞1/(r −1) ⎤−1
multiple regions in images. To determine multiple thresholds ⎢ 
c d 2

ui j = ⎣ ⎝ 2 ⎠
ij
by these intervals [L1 , U1 ], . . . , [L2 , U2 ], the FCM method is ⎦ , 1 ≤ i ≤ c, 1 ≤ j ≤ M,
k=1
dk j
briefly described.
(14)
3.1. Fuzzy C-Means Method. Fuzzy C-means clustering
(FCM) method proposed by Bezdek [29] is a process of where di j = x j − pi  denotes the Euclidean distance between

clustering data points to one of the N groups. The cluster x j and pi and the condition Ni=1 ui j = 1 is always true. The
6 EURASIP Journal on Advances in Signal Processing

50 50
100 100
150 150
200 200
250 250
300 300
350 350
400 400

50 100 150 200 250 300 350 400 450 500 550 50 100 150 200 250 300 350 400 450 500 550
(a) (b)

50 50
100 100
150 150
200 200
250 250
300 300
350 350
400 400

50 100 150 200 250 300 350 400 450 500 550 50 100 150 200 250 300 350 400 450 500 550
(c) (d)

50
100
150
200
250
300
350
400

50 100 150 200 250 300 350 400 450 500 550
(e)

Figure 7: (a) Original RGB image. (b) Blue channel of original RGB image and segmentation results by (c) level set model, (d) threshold
level set model, and (e) adaptive multi-threshold level set model.

optimal data clustering to minimize the objective function is FCM method, we can automatically obtain the thresholds
shown in the following equation: of multiregions of interest and do not need to guess the
optimal thresholds every time. The interval of grayscale for
  
c 
m ? ?2
? ? Black region is 0∼124, Red region is 125∼139, Green region
J ui j , pi = urij ?x j − pi ? , (15)
i=1 j =1 is 140∼145, Cyan region is 146∼154, and Blue region is
155∼244. These values are candidates for being thresholds
where r > 1 is a tuning parameter which controls the degree for image segmentations. If L1 = 140, U1 = 145, L2 = 155,
of fuzziness. Minimization of the objective function in (15), and U2 = 244 are chosen as thresholds, the green and blue
the optimality of fuzzy membership ui j , and centers p j can regions in Figure 5(b) are segmented by level set model and
be obtained. these two regions are colored yellow in Figure 5(c).

3.2. Automatic Selection of Thresholds. Figure 5(a) is a syn- 4. Experimental Results


thetic image and there are five regions with different intensity
values listed in Table 1. Figure 5(b) is a clustering result by To demonstrate the performance of our new speed function
FCM method. The result classified by FCM is represented by for the level set framework, a series of experiments are shown
five regions with colors (red, green, blue, black, and cyan). in Figures 6 and 7. Figure 6(a) is a synthetic image that
Comparing Table 1 and Figure 5(b), there are five inter- includes five letters A, B, C, D, and E. The order of intensity
vals of grayscale values for five color regions. Using the of letters is A < B < C < D < E. The regions of interest are
EURASIP Journal on Advances in Signal Processing 7

Table 1: The intervals of grayscale values are listed for color regions References
in Figure 5(b).
[1] D. Terzopoulos, “On matching deformable models to images,”
Color region Interval of grayscale Tech. Rep. 60, Schlumberger Palo Alto Research, Palo Alto,
Black 0∼124 Calif, USA, 1986.
Red 125∼139 [2] D. Terzopoulos, “On matching deformable models to images,”
in Proceedings of the Topical Meeting on Machine Vision, vol.
Green 140∼145
12 of Technical Digest Series, pp. 160–167, Optical Society of
Gray 146∼154 America, Washington, DC, USA, 1987.
Blue 155∼244 [3] D. Terzopoulos, A. Witkin, and M. Kass, “Constraintson
deformable models: recovering 3D shape and nonrigid
motion,” Artificial Intelligence, vol. 36, no. 1, pp. 91–123, 1988.
[4] D. Terzopoulos and K. Fleischer, “Deformable models,” The
letter B and letter D. Figure 6(b) is a segmentation result Visual Computer, vol. 4, no. 6, pp. 306–331, 1988.
obtained only by using (6) with ε = 10−6 . By (6), it is [5] F. Leymarie and M. D. Levine, “Tracking deformable objects
impossible to select the letters that we want. Figure 6(c) is in the plane using an active contour model,” IEEE Transactions
a segmentation result obtained by threshold level set model on Pattern Analysis and Machine Intelligence, vol. 15, no. 6, pp.
with [L, U] = [130, 200]. In Figure 6(c), these letters A, 617–634, 1993.
B, and C are segmented by using the threshold level set [6] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active
method because [L, U] = [130, 200] includes intensity of contour models,” International Journal of Computer Vision,
vol. 1, no. 4, pp. 321–331, 1988.
letters B, C, and letter D. Figure 6(d) is the segmentation
[7] C. Xu and J. L. Prince, “Snakes, shapes, and gradient vector
results by multi-threshold level set model with thresholds
flow,” IEEE Transactions on Image Processing, vol. 7, no. 3, pp.
[L1 , U1 ] = [130, 150] and [L2 , U2 ] = [180, 200]. These 359–369, 1998.
thresholds are automatically obtained by using FCM to [8] V. Caselles, R. Kimmel, and G. Sapiro, “On geodesic active
obtain grayscale value intervals of letter B and letter D. The contours,” International Journal of Computer Vision, vol. 22,
results in Figure 6(d) demonstrate that multi-threshold level no. 1, pp. 61–79, 1997.
set model can capture multiple of interest regions. [9] S. C. Zhu, T. S. Lee, and A. L. Yuille, “Region competition: uni-
Figure 7(a) is an image taken by a camera and the fying snakes, region growing, energy/bayes/MDL for multi-
image includes some objects. The regions of interest are band image segmentation,” in Proceedings of the 5th IEEE
forks, spoons, cups, and smaller bowls. In Figure 7(b), blue International Conference on Computer Vision, pp. 416–423,
channel of original RGB image is selected for segmentation. Cambridge, Mass, USA, June 1995.
In Figure 7(c), all objects are segmented by using (6) and ε = [10] K. Siddiqi, Y. B. Lauzière, A. Tannenbaum, and S. W. Zucker,
10−6 . Figure 7(d) is a segmentation result by the threshold “Area and length minimizing flows for shape segmentation,”
level set model with thresholds [L, U] = [50, 110] and IEEE Transactions on Image Processing, vol. 7, no. 3, pp. 433–
the parameter α = 0.95. Forks and spoons are captured. 443, 1998.
[11] T. McInerney and D. Terzopoulos, “On matching deformable
Multi-threshold level set model with thresholds [L1 , U1 ] =
models to images,” Medical Image Analysis, vol. 1, no. 2, pp.
[50, 110], [L2 , U2 ] = [130, 170] and the parameter α = 0.95 is 91–108, 1996.
used to segmentation as shown in Figure 7(e). Forks, spoons, [12] D. Mumford and J. Shah, “Optimal approximation by piece-
and smaller bowls are obtained. Comparing Figures 7(c), wise smooth functions and associated variational problems,”
7(d), and 7(e), multiple regions of interest can be segmented Communications on Pure and Applied Mathematics, vol. 42, pp.
by multi-threshold level set model. 577–685, 1989.
[13] S. Osher and J. A. Sethian, “Fronts propagating with
curvature-dependent speed: algorithms based on Hamilton-
5. Conclusions and Future Work Jacobi formulations,” Journal of Computational Physics, vol. 79,
no. 1, pp. 12–49, 1988.
In this paper, a speed function for multi-threshold level [14] S. J. Osher and R. Fedkiw, Level Set Methods and Dynamic
set model is proposed. The speed functions use thresholds Implicit Surfaces, Springer, London, UK, 2002.
to segment regions of interest. To obtain the thresholds of [15] V. Caselles, “Geometric models for active contours,” in Pro-
multiregions of interest automatically, the FCM method is ceedings of IEEE International Conference on Image Processing
used to select the thresholds obtained from image intensities. (ICIP ’95), vol. 3, pp. 9–12, Washington, DC, USA, 1995.
The experimental results show that the multi-threshold level [16] J. A. Sethian, Level Set Methods, Cambridge University Press,
set model can capture multiple boundaries of ROIs. In the Cambridge, UK, 1996.
future, the multi-threshold level set model can be extended [17] G. Sapiro, Geometric Partial Differential Equations and Image
from 2D to 3D applications. Analysis, Cambridge University Press, Cambridge, UK, 2001.
[18] R. Malladi, J. A. Sethian, and B. C. Vemuri, “Shape modeling
with front propagation: a level set approach,” IEEE Transac-
Acknowledgment tions on Pattern Analysis and Machine Intelligence, vol. 17, no.
2, pp. 158–175, 1995.
The authors thank the National Science Council (NSC) for [19] V. Caselles, F. Catté, T. Coll, and F. Dibos, “A geometric
partial financial support (NSC 97-2115-M-324 - 001) and model for active contours in image processing,” Numerische
(NSC 98-2115-M-324 - 001). Mathematik, vol. 66, no. 1, pp. 1–31, 1993.
8 EURASIP Journal on Advances in Signal Processing

[20] S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A.


Yezzi, “Gradient flows and geometric active contour models,”
in Proceedings of IEEE International Conference on Computer
Vision, pp. 810–815, Cambridge, Mass, USA, 1995.
[21] R. Malladi, J. A. Sethian, and B. C. Vemuri, “Shape modeling
with front propagation: a level set approach,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 17, no.
2, pp. 158–175, 1995.
[22] T. F. Chan and L. A. Vese, “Active contours without edges,”
IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 266–
277, 2001.
[23] C.-Y. Hsu, C.-H. Yang, and H.-C. Wang, “Topological control
of level set method depending on topology constraints,”
Pattern Recognition Letters, vol. 29, no. 4, pp. 537–546, 2008.
[24] N. Vu and B. S. Manjunath, “Shape prior segmentation of
multiple objects with graph cuts,” in Proceedings of 26th
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR ’08), pp. 1–8, Anchorage, Alaska, USA, June 2008.
[25] T. Chan and W. Zhu, “Level set based shape prior segmenta-
tion,” in Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR ’05), vol.
2, pp. 1164–1170, San Diego, Calif, USA, June 2005.
[26] T. Brox and J. Weickert, “Level set segmentation with multiple
regions,” IEEE Transactions on Image Processing, vol. 15, no. 10,
pp. 3213–3218, 2006.
[27] L. A. Vese and T. F. Chan, “A multiphase level set framework
for image segmentation using the Mumford and Shah model,”
International Journal of Computer Vision, vol. 50, no. 3, pp.
271–293, 2002.
[28] A. E. Lefohn, J. E. Cates, and R. T. Whitaker, “Interactive,
GPU-based level sets for 3D segmentation,” in Proceedings of
Medical Image Computing and Computer Assisted Intervention
(MICCAI ’03), vol. 2878, pp. 564–572, November 2003.
[29] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms, Plenum Press, New York, NY, USA, 1981.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 923748, 13 pages
doi:10.1155/2010/923748

Research Article
An Interactive Procedure to Preserve the Desired Edges during
the Image Processing of Noise Reduction

Chih-Yu Hsu,1 Hsuan-Yu Huang,2 and Lin-Tsang Lee2


1 Department of Information and Communication Engineering, ChaoYang University of Technology, Taichung 41349, Taiwan
2 Department of Applied Mathematics, National Chung-Hsing University, Taichung 40227, Taiwan

Correspondence should be addressed to Chih-Yu Hsu, [email protected]

Received 1 December 2009; Revised 5 February 2010; Accepted 30 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 Chih-Yu Hsu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The paper propose a new procedure including four stages in order to preserve the desired edges during the image processing of
noise reduction. A denoised image can be obtained from a noisy image at the first stage of the procedure. At the second stage, an
edge map can be obtained by the Canny edge detector to find the edges of the object contours. Manual modification of an edge
map at the third stage is optional to capture all the desired edges of the object contours. At the final stage, a new method called
Edge Preserved Inhomogeneous Diffusion Equation (EPIDE) is used to smooth the noisy images or the previously denoised image
at the first stage for achieving the edge preservation. The Optical Character Recognition (OCR) results in the experiments show
that the proposed procedure has the best recognition result because of the capability of edge preservation.

1. Introduction files [3]. Researchers developed several methods in order


to remove these image noise including Gaussian noise, salt
Digital images are noisy due to environmental disturbances.
and pepper noise [4]. There are some image filters, which
To ensure image quality, image processing of noise reduction
are used for image denoising [5, 6] and the Gaussian filter
is a very important step before analysis or using images.
is a well-known one [7]. In the period between 1984 and
Optical Character Recognition (OCR) system is an example
1987, Koenderink and Hummel showed how Gaussian filters
that is very sensitive to noise. The quality of documents
removed noise that was equal to dispersion effects of the
influences the results of recognition. Image noise decreases
isotropic diffusion equation, so Gaussian filters are called
the accuracy of the recognition of documentations by OCR
Diffusion Filters.
(optical character recognition) software because of blurred
edges. Great damage will be caused in defense and security Isotropic diffusion equations can reduce noise but blur
applications when OCR software is used for the scanning the contours of images. In order to improve on this draw-
and recognition of documents such as passports and ID back, Perona and Malik improved the diffusion coefficient
cards in busy airports where speed and accuracy are critical of isotropic diffusion filters to produce anisotropic diffusion
for processing thousands of documents daily. The most filters (ADFs) with a function of image gradients in the
important image processing technique for noise reduction 1990 [8]. The coefficients of isotropic diffusion equations
is the image denoising method. The purpose of image are constants, but the diffusion coefficients of anisotropic
denoising method is to increase signal-to-noise ratio (SNR) diffusion equation decrease as image gradients increase.
in an image. However noise reduction always induces blurred Anisotropic diffusion equations are more effective in edge
edges by an image denoising process. Development for edge- preservation. Because the gradients of noises are larger, then
preserved image denoising method is necessary for OCR the coefficients of anisotropic diffusion filters are smaller.
software. The paper is to develop a denoising procedure This solution cannot solve the problem very well. Researches
with the edge preservation capability. The OCR system is [9, 10] are continuously focusing on improving the diffusion
a research field in pattern recognition [1, 2] and is used coefficients. However, these methods may remove image
to convert papers, books, and documents into electronic noise, but edges cannot be preserved.
2 EURASIP Journal on Advances in Signal Processing

250

200

150

100

50

0
0 50 100 150 200
(a) (b)

Figure 1: (a) Synthetic image and (b) the grayscale value of the 36th row of (a).

250 250 250

200 200 A2 200 A2

150 150 150

100 100 100


A1 A1
50 50 50

0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
α1 β1 α2 β2 α1 β1 α2 β2
(a) (b) (c)

Figure 2: (a) The front of signal of Figure 1(b), (b) the back of signal of Figure 1(b), and (c) the signal is superimposed by (a) and ( b).

In this paper, we propose a new procedure including u(x, y) is an image intensity that is between 0 and 255. For
four stages. At the first stage of the procedure, any kind a gray image, the function u has grayscale values of image
of denoising algorithm can be applied on an original noisy pixels. The coordinates (x, y) are locations of the pixels in
image to get a well-denoised image. At the second stage, an image. The grayscale values of the ith row of u(x, y) are
an edge map can be obtained to find the edges of the denoted by u(i, 1 : n) which can be considered as a one-
object contours by the Canny edge detector applied on the dimensional signal with length n. For example, the red line as
previously denoised image at the first stage. Since the contour shown in Figure 1(a) is the 36th row of the image. As shown
edges are not found completely, then the users maybe need in Figure 1(b), the grayscale values profile is composed of two
interactively modify the edge map to keep the edges of the Box functions.
desired object contours. At the third stage, manually modify
the edges of edge map to match the desired edges. At the 2.1.1. One-Dimensional Signals. One-dimensional signals
final stage, a new method Edge Preserved Inhomogeneous can be considered as piecewise constant functions, Heaviside
Diffusion Equation is used to smooth the original noisy function is suitable to discrete piecewise constant functions
image or the previously denoised image at the first stage and [11]. Heaviside function H(x) is defined as:
achieve preserving desired edge. The proposed procedure has
the edge preservation capability that makes OCR results the M
0, x < 0,
best in this experiment. H(x) = (1)
1, x > 0.
2. Mathematic Formulation The Heaviside function H(x) is discontinuous at x = 0,
Section 2.1 introduces the digital image as a matrix, and one and the value is usually defined by 1/2 at x = 0. If the
row can be considered as a signal. Section 2.2 introduces how Heaviside function H(x) is shifted a, then Heaviside function
to find the solutions of a one-dimensional inhomogeneous is H(x − a). Box Function φ can be represented by Heaviside
diffusion equation by using Fourier series. Section 2.3 pro- function H(x) as φ(x) = (H(x) − H(x − 1)). The Box function
posed a flow chart of the EPIDE denoising method. can be represented as follows:
M
2.1. Digital Images and Signals. We defined a m × n grayscale 1, 0 < x < 1,
φ(x) = (2)
digital image as a function u(x, y). The value of the function 0, otherwise.
EURASIP Journal on Advances in Signal Processing 3

250 250 250

200 200 200

150 150 150

100 100 100

50 50 50

0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
(a) (b) (c)

Figure 3: (a) The grayscale value of the 36th row of Figure 1(a), (b) the Fourier series with 300 terms, and (c) diffused result by diffusion
equation at t = 3.

×105 ×10−4
3.5 8

3 6

4
2.5
2
2
0
1.5
−2
1
−4
0.5 −6

0 −8
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

(a) (b)

Figure 4: (a) δ(x) = (ε/π)/(x2 + ε2 ), ε = 10−6 and (b) δ  (x) = −2εx/π(x2 + ε2 )2 , ε = 10−6 .

As shown in Figure 1(b), the 36th row of u(x, y) is u(36, 1 : n) summation of the finite Box functions. If u(i, 1 : n) is an
and the profile of the row is represented by two Box functions integrable function on [0, π], then u(i, 1 : n) can approximate
as in the following equation: the continuous Fourier series [12] as follows:
 ∞

2
   1
u(36, 1 : n) = Ak H(x − αk ) − H x − βk . (3) u(i, 1 : n) = a0 + (ak cos kx + bk sin kx), (5)
2 k=1
k=1

One signal can be superimposed by two signals. Figure 2 where L the coefficients are represented L π by ak =
π
shows how to use two box functions to superimpose the (1/π) −π f (x) cos kxdx and bk = (1/π) −π f (x) sin kxdx
signal in Figure 1(b). equations.
If there are M Box functions in the ith row of u(x, y), the For example as shown in Figure 3, the grayscale value
profile of u(i, 1 : n) can be represented as follows: of the 36th row of Figure 1(a) is shown in Figure 3(a).
Figure 3(b) shows a profile to approximate signal of

M
   Figure 3(a) by using Fourier series with 300 terms.
u(i, 1 : n) = Ak H(x − αk ) − H x − βk . (4) Figure 3(c) shows defused result at t = 3 where the variable t
k=1
will be explained in Section 2.2.1.
The letter αk is the left-location value of the kth Box Function
and βk is value of the right location of the kth Box Function. 2.2. One-Dimensional Inhomogeneous Diffusion Equation.
The letter M denotes the total number of Box functions and We want to solve the problem of finding the intensity
the symbol Ak are coefficient constants. u(x, t) of every row in an image. At both sides of the
interval 0 ≤ x ≤ n, the intensity values are set to
2.1.2. Fourier Series of Box Function. According to (4), the be zero. By adding the inhomogeneous terms into the
function u(i, 1 : n) as one signal can be represented by diffusion equation with the derivative of Delta functions,
4 EURASIP Journal on Advances in Signal Processing

The Delta function is a generalized function, the proper-


ties of the Delta function are as follows [14]:
Stage.1 ⎧
-b ⎨1, a ≤ ξ ≤ b,
Noisy image (N) Rough denoised δ(x − ξ)dx = ⎩
image (P)
a 0, a, b < ξ or ξ < a, b, (9)

δ(x − ξ) = 0, x=
/ ξ,
Stage.2
where a, b, and ζ are constants.
Let (7) be into (6):

∂u(x, t)   M
 
= K ∇2 u(x, t) + Ak δ  (x − αk ) − δ  x − βk .
∂t k=1
Stage.3
(10)
Final edge map (I) Edge map (E) Equation (10) is an inhomogeneous diffusion equation used
Figure 5: The flow chart of finding the edge map during the three
to preserve the edges. In (4), (7), and (10), the locations of
stages. edges αk and βk can be decided by the location of edge pixels
in the signal that is one row in an image. Since in the edge
locations it is not easy to obtain a noisy image, some images
preprocessing techniques and Canny Edge detection method
the proposed denoising method is called the Edge Preserved [15] are used to find the edge map of the object contours. The
Inhomogeneous Diffusion Equation (EPIDE) method. locations αk and βk are decided by the edge map. Modifying
the edge map, the user can decide to keep the contours for
their requirements.
2.2.1. Diffusion Equation Formulation. Consider the inho-
mogeneous differential equation [13] as follows:
2.2.2. Fourier Series Solutions. According to (6), the function
F(x) can be represented by Fourier series [12] as follows:
∂u(x, t)
= K ∇2 u(x, t) + F(x), (6) ∞
∂t  nπx
F(x) = βn · sin . (11)
n=1
L
where the variables x are spatial coordinates and t is time, but
the temperature u(x, t) now is replaced by the intensity in an The solution u(x, t) can be solved by the Fourier series
image that is function of the position x and time t, and K and the initial condition is represented as follows:
is a constant called the “thermal diffusivity” of the material.
The function F(x) is an inhomogeneous term that will be 
M
  
explained in (7). u(x, 0) = Ak H(x − αk ) − H x − βk ,
The function F(x) is used to have the effect of preserving k=1 (12)
edges and can be obtained from derivative of the right side of u(0, t) = u(L, t) = 0.
(4):
where L is the length of the signal. The solution u(x, t) and

M
   the function F(x) can be expanded by Fourier sine series:
F(x) = Ak δ  (x − αk ) − δ x − βk .

(7)


k=1 nπx
u(x, t) = bn (t) · sin , (13)
n=1
L

The function δ is a Dipole distribution and is derivative of

the Delta (or impulse) function δ. where the coefficients βn = (2/π) 0 u(x, 0) sin nxdx are
The relation of the Delta function and step function is as determined by the function F(x, t) and the coefficients bn can
follows: be decided by substituting (13) into (11):
∞ ∞
 
dH(x)  ∂bn (t) nπx  n2 π 2 nπx
δ(x) = , (8) · sin = − 2 bn (t) + βn · sin .
dx n=1
∂t L n=1
L L
(14)
where step function can be approximated by H(x) =
(1/2)(1+(2/π) arctan(x/ε)), and δ = H  (x) = (ε/π)/(x2 +ε2 ), Comparing coefficients of sin(nπx/L) on both sides yields
ε = 10−6 .
The δ(x) and δ  (x) functions are shown in Figures 4(a) ∂bn (t) n2 π 2
+ 2 bn (t) = βn . (15)
and 4(b). ∂t L
EURASIP Journal on Advances in Signal Processing 5

Noisy image (N)/denoised image (P) Final edge map (I)

Apply EPIDE on the noisy image (N)


in x direction and y direction to get
images X and Y

x-direction denoised images (X) y-direction denoised images (Y )

Combining X and Y to get an denoised image D

Denoised image (D)

Figure 6: The flow chart to get denoised image at the final stage.

Equation (15) can easily be solved to obtain bn : the Canny edge detector applied on the previously denoised
   -   image at the first stage. Since the contour edges are not
t
n2 π 2 t n2 π 2 s
bn (t) = exp − βn · exp ds. (16) found completely, then the users may be need to interactively
L2 0 L2 modify the edge map to keep the edges of the desired object
Then solution u(x, t) is obtained by substituting this formula contours. At the third stage, users can manually modify the
for bn into (13). edges of edge map to match the desired edges. At the final
stage, Edge Preserved Inhomogeneous Diffusion Equation
2.3. Proposed Procedure. The goal of the proposed procedure (EPIDE) method is used to smooth the original noisy image
with four stages is to preserve the desired edges during the or the previously denoised image at the first stage and achieve
image processing of noise reduction, so EPIDE method plays preserving desired edge. Two flow charts of the proposed
an important role. However, the edges of object contours in procedure are shown in Figures 5 and 6.
an image should be extracted previously for EPIDE method. Figure 5 shows the flow chart of the three stages. At the
Canny edge detector can automatically find some edges in first stage, any kind of denoising algorithm can be applied on
images. Since the contour edges are not all found, then a noisy image (N) to get a previously denoised image (P). At
the users want to interactively modify the edges capture all the second stage, an edge map (E) can be obtained to find
desired object contours. the edges of the object contours by the Canny edge detector
In the first stage of the procedure, any kind of denoising applied on the previously denoised image at the first stage. At
algorithm can be applied on an original noisy image to the third stage, modified edge map (I) captures all the desired
obtain a denoised image. In the second stage, an edge map edges manually. The flow chart to get a denoised image at the
can be obtained to find the edges of the object contours by final stage with EPIDE method is shown in Figure 6.
6 EURASIP Journal on Advances in Signal Processing

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 7: Comparing the performance of the various noises on “Nine Square Regions”. (a) Original Image. (b) Gaussian noise image,
σ = 0.01. (d) Salt and Pepper noise image, noise density is 0.05. (f) Poisson noise image, (c), (e), and (g) The proposed procedure denoised
by (b), (d), (f).
EURASIP Journal on Advances in Signal Processing 7

(a) (b)

(c) (d)

(e)
Figure 8: (a) “Nine Square Regions”, (b) noisy image with Gaussian noise with σ = 0.01, (c) wavelet, PSNR = 25.493, (d) ADF, PSNR =
23.835, and (e) the proposed procedure, PSNR = 28.495.

At the final stage, the EPIDE method is used to smooth where RMSE is Root Mean Square Error, and it is defined as
the noisy image (N) or the previously denoised image (P) follows:
with the modified edge map (I). Both in x-direction and y- )
direction, two images X and Y are generated by the EPIDE *
* 1 
m 
n
    2
method. Finally a denoised image (D) can be obtained by an RMSE = *
+ f i, j − g i, j . (18)
m × n i=1 j =1
average combination of the image X and Y .

3. Experimental Results The functions f (i, j) and g(i, j) are original and denoised
image, respectively. The numbers m and n are the size of an
There are four test images “Nine Square Regions”, “Number image.
and Character”, “Chinese Words”, and “BarCode” corrupted
by Gaussian noise with zero mean.
3.2. Results and Discussions. Section 3.2.1 shows the
denoised results of noise reduction test. Section 3.2.2
3.1. The Peak Signal-to-Noise Ratio (PSNR). The perfor- compares the denoised results of the proposed procedure
mance measure by using the peak signal-to-noise ratio is with those of wavelet and ADF denoising methods.
defined as follows: Section 3.2.3 describes the third stage in the proposed
255 procedure. Section 3.2.4 shows the OCR application by the
PSNR = 20 · log10 , (17) proposed procedure.
RMSE
8 EURASIP Journal on Advances in Signal Processing

(a)
(a)

(b)
(b)

(c)

(c)

(d)

(d)

(e)

Figure 9: (a) “Number and Character”, (b) noisy image with


Gaussian noise with σ = 0.01, (c) wavelet, PSNR = 21.946, (d) ADF,
PSNR = 22.285, and (e) the proposed procedure, PSNR = 22.495. (e)

Figure 10: (a) Four Chinese words, (b) noisy image with Gaussian
noise with σ = 0.05, (c) denoised image by wavelet, PSNR = 16.453,
3.2.1. Noise Reduction Test. The experiments tested the (d) denoised image by ADF, PSNR = 17.244, and (e) the proposed
synthetic image “Nine Square Regions” with Gaussian noise, procedure, PSNR = 20.769.
Salt-and-Pepper noise, and Poisson noise, the result are
shown in Figure 7.
The test image “Nine Square Regions” is a synthetic “BarCode” by the proposed procedure. The PSNR values of
image shown in Figure 7(a). Figure 7(b) is the test image the proposed procedure are larger than those of the wavelet
corrupted by adding Gaussian noise with variance 0.01. and ADF denoising method. The proposed procedure has
Figure 7(d) is the test image corrupted by adding salt and better denoising capability.
pepper noise with the density 0.05. Figure 7(f) is the test The test image “Nine Square Regions” is a synthetic
image corrupted by adding Poisson noise. Figures 7(c), 7(e), image shown in Figure 8(a). Figure 8(b) is the test image
and 7(g) are images denoised by proposed procedure to corrupted by adding Gaussian noise with variance 0.01.
preserve edges. Figure 8(c) is the denoised image by wavelet denoising
method. Figure 8(d) is the denoised image by ADF denoising
3.2.2. Comparison with Algorithms. In the experiments we method. Figure 8(e) is the denoised image by the proposed
have used the geometric images “Nine Square Regions”, procedure. From the visual evaluation of images (c), (d), and
“Number and Character”, “Chinese Words”, and “BarCode” (e), the proposed procedure has the best edge preservation.
in order to demonstrate the edge preservation capability of The test image “Number and Character” as shown in
the proposed procedure. Corresponding to three denoising Figure 9(a) is an image captured by camera from a car license
methods, the values PSNR of all the denoised images are plate. Figure 9(b) is the test image corrupted by adding
given in Table 1. These results of PSNR are 28.495, 22.495, Gaussian noise with a variance of 0.01. Figure 9(c) is the
20.769, and 28.021 for four test images “Nine Square denoised image by wavelet denoising method. Figure 9(d) is
Regions”, “Number and Character”, “Chinese Words” and the denoised image by ADF denoising method. Figure 9(e)
EURASIP Journal on Advances in Signal Processing 9

(a) (b)

(c) (d)

(e)

Figure 11: (a) BarCode image, (b) noisy image with Gaussian noise with σ = 0.05, (c) denoised image by wavelet, PSNR = 17.762, (d)
denoised image by ADF, PSNR = 16.940, and (e) denoised image by the proposed procedure, PSNR = 28.021.

Table 1: RMSE and PSNR (dB) values of the denoised images by EPIDE, Wavelet, and ADF methods. There are four test images “Nine
Square Regions”, “Number and Character”, “Chinese Words”, and “BarCode”.

Nine Square Regions Number and Character Chinese Words BarCode


Image
RMSE PSNR RMSE PSNR RMSE PSNR RMSE PSNR
Noise Image 21.809 21.358 23.885 20.568 40.206 16.045 18.197 22.931
EPIDE 9.5896 28.495 19.133 22.495 23.340 20.769 10.127 28.021
Wavelet 13.549 25.493 20.379 21.947 38.319 16.453 32.994 17.762
ADF 16.397 23.835 19.602 22.285 35.022 17.244 36.27 16.940

is the denoised image by the proposed procedure. From the Figure 11(c) is the denoised image by wavelet denoising
visual evaluation of images (c), (d), and (e), the proposed method. Figure 11(d) is the denoised image by ADF denois-
procedure has the best edge preservation. ing method. Figure 11(e) is the denoised image by the
The test image “Chinese Words” is an image with four proposed procedure. From the visual evaluationof images
Chinese characters as shown in Figure 10(a). Figure 10(b) (c), (d), and (e), the proposed procedure has the best edge
is the test image corrupted by adding Gaussian noise preservation.
with a variance of 0.05. Figure 10(c) is the denoised
image by wavelet denoising method. Figure 10(d) is the 3.2.3. Interactively Modified Edge Map. The edge-preserved
denoised image by ADF denoising method. Figure 10(e) is performance of the EPIDE method depends on the edge
the denoised image by the proposed procedure. From the detection. A noisy image is difficult to find all edges by Canny
visual evaluation of images (c), (d), and (e), the proposed edge detector because of the noise interference. However,
procedure has the best edge preservation. Canny edge detector has the same function as the Gaussian
The test image “BarCode” is an image without noise filter, but it is better to use any kind of denoising algorithm
as shown in Figure 11(a). Figure 11(b) is the test image in the first stage of proposed algorithm. The following exper-
corrupted by adding Gaussian noise with variance 0.05. iment is used to show how to modify the edge map by Canny
10 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

(e) (f)

(g) (h) (i)

Figure 12: (a) Synthetic image, (b) noisy image with Gaussian noise with σ = 0.05, (c) edge map obtained from (b), (d) denoised image at
final stage, (e) denoised image by neighborhood filters, (f) edge map obtained from (e), (g) modified edge map, (h) denoised image at final
stage, and (i) denoised image at final stage.
EURASIP Journal on Advances in Signal Processing 11

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 13: Left column is (a) noisy image and denoised image by (c) wavelet (e) ADF and (g) the proposed procedure, right column. (b),
(d), (f), and (h) are OCR results.

edge detector and the results of denoised image by EPIDE Table 2: RMSE and PSNR (dB) values of denoised images B, D, H
method. Figure 12(a) is a synthetic image and Figure 12(b) and I.
is the noisy image with Gaussian noise with σ = 0.05. B D H I
Figure 12(c) is an edge map obtained by Canny edge detector
RMSE 6.3711 3.7040 3.6321 3.1651
applied on Figure 12(b). Figure 12(d) is the denoised image
PSNR 32.049 36.757 36.928 38.123
by EPIDE method applied on Figure 12(b) with the edge ∗
map in Figure 12(c). Figure 12(e) is a denoised image by Character “B” represents Figure 12(b).
∗ Character “D” represents Figure 12(d).
neighborhood filters applied on Figures 12(b) and 12(f) is ∗ Character “H“ represents Figure 12(h).
an edge map by Canny edge detector applied on Figure 12(e). ∗ Character “I” represents Figure 12(i).

Figure 12(g) is a manually modified edge from the edge map


of Figure 12(f). Figure 12(h) is the denoised image by EPIDE
method with the edge map in Figure 12(g) applied on image 3.2.4. The Proposed Procedure Applications. Now, the exper-
in Figure 12(e), but Figure 12(i) is the denoised image by iment is to demonstrate that the denoised images with a
EPIDE method with the same edge map applied on image in good edge-preserved can have better OCR result. The noise
Figure 12(b). By the visual evaluations, Figure 12(i) is better image is corrupted with Gaussian noise with variance of 0.08
than Figure 12(h). From Table 2, the PSNR values of image as shown in Figure 12(a). In Figure 13, the left column the
of Figure 12(i) is higher than Figure 12(h). The first stage of denoised image by Figures 13(c) wavelet 13(e) ADF 13(g)
proposed procedure can be an option for different images EPIDE method right column 13(b), 13(d), 13(f) and 13(h)
denoising cases. are OCR results. Figure 13(a) is an image with Gaussian noise
12 EURASIP Journal on Advances in Signal Processing

250 250

200 200

150 150

100 100

50 50

0 0
0 50 100 150 200 0 20 40 60 80 100 120 140 160 180 200
(a) (b)

300 300

250 250

200 200

150 150

100 100

50 50

0 0
0 50 100 150 200 0 50 100 150 200
(c) (d)

250 250

200 200

150 150

100 100

50 50

0 0
0 50 100 150 200 0 50 100 150 200
(e) (f)

Figure 14: The 20th row of u(x, y) of the image “Nine Square Regions” with the various noise. (a) Gaussian noise, (c) Salt and Pepper noise,
(e) Poisson noise. The denoised results are, respectively, (b), (d), and (f)

and it’s variance is 0.08. The denoised images are shown results in Figure 13(h). All the characters in two sentences
in Figures 13(c), 13(e) and 13(g) and they are denoised by “I Love You“ and “Please Give Me Your Favor” are correctly
wavelet, ADF and EPIDE methods. Figures 13(b), 13(d), recognized in the image denoised by EPIDE method. There
13(f) and 13(h) are OCR results of images in the left column. are some errors for character recognition are in Figures
To evaluate the denoising performance of wavelet, ADF and 13(b), 13(d) and 13(f). These results are obtained by JOCR
EPIDE methods, it is suitable to use the character recognition software on the noisy and denoised images by wavelet and
software JOCR [16] to obtain the words in noised and ADF methods.
denoised images. The experimental results show that the The results of the above experiments demonstrate the
image denoised by EPIDE can have the best recognition effective of our proposed denoising procedure. In next
EURASIP Journal on Advances in Signal Processing 13

section, theoretical explanations are described to show why numeric handprint character recognition,” in Proceedings
the proposed denoising procedure works well for any kind of of the IEEE International Conference on Neural Networks,
noise. Washington, DC, USA, 1989.
[2] C. C. Tappert, “Recognition System for Run-on Handwritten
Characters,” US patent no. 4731857, International Business
3.2.5. The Denoising Capability of the Diffusion Equation. Machines Corporation, Armonk, NY, USA, March 1988.
There are many types of image noise, such as Gaussian [3] S.-H. Hahn, J.-H. Lee, and J.-H. Kim, “A study on utilizing
noise, Salt-and-pepper noise, Shot noise and Uniform noise. OCR technology in building text database,” in Proceedings
Noises are randomly distributed in image intensity value. of the 10th International Workshop on Database and Expert
At different pixels, the intensity values are independent of Systems Applications, pp. 582–586, 1999.
one another. For example, Gaussian noise, Shot noise and [4] R. C. Gonzalez and R. E. Woods, Digital Image Processing,
Uniform noise separately follow a Gaussian, Poisson, Fat- Prentice-Hall, Upper Saddle River, NJ, USA, 2002.
tail and Uniform distribution. The 20th row of u(x, y) of [5] M. Welk and J. Weickert, “Semidiscrete and discrete well-
the image “Nine Square Regions” in Figure 7 with Gaussian posedness of shock filtering,” in Mathematical Morphology,
noise, Salt and Pepper noise and Poisson noise has three Springer, Berlin, Germany, 2005.
[6] S. Guillon, P. Baylou, M. Najim, and N. Keskes, “Adaptive
profiles as in Figures 14(a), 14(c), and 14(e). The three
nonlinear filters for 2D and 3D image enhancement,” Signal
profiles of the denoised image are shown in Figures 14(b), Processing, vol. 67, no. 3, pp. 237–254, 1998.
14(d), and 14(f). Three profiles as in Figures 14(a), 14(c), and [7] Y. B. Yuan, T. V. Vorburger, J. F. Song II, and T. B. Renegar,
14(e) are the initial conditions of the diffusion equation (6). “A simplified realization for the Gaussian filter in surface
Three profiles as in Figures 14(b), 14(d), and 14(f) are the metrology,” in Proceedings of the 10th International Colloquium
steady-state solutions of the steady-state diffusion equation on Surfaces, M. Dietzsch and H. Trumpold, Eds., p. 133,
(19) Shaker, Chemnitz, Germany, January-February 2000.
[8] P. Perona and J. Malik, “Scale-space and edge detection using
K ∇2 u(x, t) + F(x) = 0. (19) anisotropic diffusion,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 12, no. 7, pp. 629–639, 1990.
The image denoising results in Figures 14(b), 14(d), and [9] S. K. Weeratunga and C. Kamath, “A comparison of PDE-
14(f) are consistent with the theoretical explanation of (19). based non-linear anisotropic diffusion techniques for image
denoising,” in Image Processing: Algorithms and Systems II,
Proceedings of SPIE, Santa Clara, Calif, USA, January 2003.
4. Conclusion [10] G. Gerig, O. Kubler, R. Kikinis, and F. A. Jolesz, “Nonlinear
The contribution of the paper is to propose a procedure anisotropic filtering of MRI data,” IEEE Transactions on
Medical Imaging, vol. 11, no. 2, pp. 221–232, 1992.
to smooth the noisy or denoised image with any kind
[11] R. P. Kanwal, Generalized Functions: Theory and Applications,
of denoising algorithm for desired edge preservation. To Birkhäuser, Boston, Mass, USA, 3rd edition, 2004.
achieve preservation of designed edges, the inhomogeneous [12] G. B. Folland, Fourier Analysis and Its Applications,
terms of the diffusion equation are formulated by the Brooks/Cole, Pacific Grove, Calif, USA, 1992.
derivative of the Delta function. Fourier series is used to [13] M. Sen, Analytical Heat Transfer, Department of Aerospace
obtain the exact solution of the diffusion equation. The exact and Mechanical, Engineering University of Notre Dame, Notre
solution is a function of time and its value is the intensity of Dame, Ind, USA, 2008.
each pixel in an image. The Delta functions in the diffusion [14] R. Bracewell, The Fourier Transform and Its Applications,
equation are used to locate the positions of edge pixels McGraw-Hill, New York, NY, USA, 2nd edition, 1986.
for each object in the image. To locate contour pixels for [15] J. Canny, “A computational approach to edge detection,” IEEE
each object, it is necessary to use some image preprocessing Transactions on Pattern Analysis and Machine Intelligence, vol.
methods and an edge detection method to find the edges 8, no. 6, pp. 679–698, 1986.
[16] http://home.megapass.co.kr/∼woosjung/Product JOCR.html.
of the object contours. Since the contour edges are not all
found, then the user can interactively modify the edge map
to keep the desired object contours. The proposed denoising
method with edge preservation capability has the best OCR
result in the experiment compared to the results from the
wavelet denoising method and anisotropic diffusion filters.

Acknowledgment
The authors thank National Science Council (NSC) for
partial financial support (NSC 97-2115-M-324-001) and
(NSC 98-2115-M-324-001).

References
[1] W. E. Weideman, M. T. Manry, and H. C. Yau, “A comparison
of nearest neighbor classifier and a neural network for
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 896708, 12 pages
doi:10.1155/2010/896708

Research Article
Full Waveform Analysis for Long-Range 3D Imaging Laser Radar

Andrew M. Wallace (EURASIP Member), Jing Ye (EURASIP Member), Nils J. Krichel,


Aongus McCarthy, Robert J. Collins, and Gerald S. Buller
School of Engineering and Physical Sciences, Heriot-Watt University, Riccarton, Edinburgh EH14 4AS, UK

Correspondence should be addressed to Andrew M. Wallace, [email protected]

Received 27 December 2009; Accepted 21 June 2010

Academic Editor: Yingzi Du

Copyright © 2010 Andrew M. Wallace et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The new generation of 3D imaging systems based on laser radar (ladar) offers significant advantages in defense and security
applications. In particular, it is possible to retrieve 3D shape information directly from the scene and separate a target from
background or foreground clutter by extracting a narrow depth range from the field of view by range gating, either in the sensor
or by postprocessing. We discuss and demonstrate the applicability of full-waveform ladar to produce multilayer 3D imagery, in
which each pixel produces a complex temporal response that describes the scene structure. Such complexity caused by multiple and
distributed reflection arises in many relevant scenarios, for example in viewing partially occluded targets, through semitransparent
materials (e.g., windows) and through distributed reflective media such as foliage. We demonstrate our methodology on 3D image
data acquired by a scanning time-of-flight system, developed in our own laboratories, which uses the time-correlated single-photon
counting technique.

1. Introduction the focus of the Jigsaw [7] and Swedish Defence Research
[8] systems. However, full waveform analysis is also required
In general, laser range finding can be achieved on the basis of in many other situations where single pixel returns are
triangulation or time-of-flight, of which the latter method composed of multiple reflections within the laser footprint.
is more suited to long-range measurement. In the context For example, this occurs at an occluding boundary, that
of time-of-flight, the principal methodologies include mea- is, one object behind another, where objects are partially
surement of phase-shift in an amplitude-modulated signal, obscured, for example, behind foliage, camouflage, or blinds,
measurement of frequency shift in a frequency modulated when imaging through semitransparent surfaces, or where a
signal, or measurement of transmit-receive pulse separation single surface may be distributed in depth or moving during
in a pulsed system [1]. To build a 3D image, either the laser exposure. If selected infrared wavelengths are used, then
beam must be scanned across the scene, or a static laser beam these can penetrate better through the atmosphere or glass
diverges to encompass the target, and a focal plane array of [6], and if multiple wavelengths are used, then this be can
independent pixels records the received radiation. more informative in surface classification [9, 10].
Full waveform ladar [2, 3] requires the analysis of In many defense and security applications, it is also
multiple returns that occur within a single measurement desirable that the active laser pulse is eye-safe and “covert”,
or pixel. One of the major applications for full waveform that it be of short duration and low energy. To that end,
topographic ladar analysis is in the survey of forest canopies we have developed a 3D imaging ladar system based on a
to monitor environmental changes [4, 5], but this analysis low-power pulsed laser source and a time-correlated single
also has important applications in defense and security photon counting detector, for which the detailed optical
[6]. One key application is the detection and classification design is described in [11].
of targets on the ground under tree cover using airborne The twin demands of low power and multiwaveform
imagery, which is related to environmental mapping and is analysis place significant demands on the signal processing
2 EURASIP Journal on Advances in Signal Processing

methodology. Typical techniques within the frequentist Transceiver


framework are to calculate the maximum likelihood esti- Master
head
mates (MLE) of parameters for every possible number of Control clock Laser
signal returns, and then use information theoretic crite- computer
ria, such as akaike (AIC), bayesian information criterion
(BIC) and minimum description length (MDL) [12], to Scanning XY scan
Data optics
determine the signal number. One popular tool for finding acquisition Detector Scene
MLE is expectation-maximization (EM) [13]. Compared module (Si-SPAD) CCD
Single camera
with centroid method and matched filter, this algorithm is mode fiber
computationally more expensive, but it may give estimates
of higher accuracy. However, EM holds a potential risk (a)
in that it might converge to a local maximum likelihood
[14] or diverge to an infinite value [15]. Additionally,
it is sensitive to initial values and not efficient for data
set containing numerous observed events, in our case the
timing information for the received photons. Moreover,
even though AIC, BIC, and MDL introduce penalty terms
to avoid overfitting the data, that is adding more returns
to increase the likelihood, they still have the tendency to
produce more complicated models which correspond to
more signal returns [14].
In [16], a hybrid approach is proposed, which first applies
a deterministic nonparametric bump-hunting process for
initial estimates of signal returns, and second Poisson-MLE
to refine the estimates. Although it is effective in many cases,
it fails to resolve two closely separated peaks and is not able to
(b)
produce satisfactory results when the background noise level
is comparable or higher than the signal amplitudes. Figure 1: (a) Schematic diagram indicating the principal com-
In order to detect multiple, small returns embedded in ponents of the scanning system. Electrical paths are denoted by
background, noise, and clutter, we have been developing con- solid lines, optical paths by dashed lines. Si-SPAD is a silicon
currently ladar signal analysis methods within the Bayesian single-photon avalanche diode. (b) The transceiver head assembly.
framework based on reversible-jump Markov chain Monte The system dimensions are approximately 275 mm by 275 mm by
Carlo (RJMCMC) techniques for both single pixel and image 175 mm. The two galvanometer servo-control circuit boards (not
visible) are on the underside of the slotted baseplate.
data [14]. In this paper, we report the development and
application of these methods to process images from the new
covert, depth imaging sensor, and compare our results with
conventional cross-correlation and peak detection applied to system the direction of the transmitted laser signal can be
the same data. used to compute the (x, y) coordinates. This basic principle
The organisation of the paper is as follows. In Section 2, is applicable to both scanning systems, such as our own,
we describe briefly the 3D image sensor, and the conditions and to arrays of single photon counting detectors such
for data acquisition. In Section 3, we describe the processing as that reported by Sudharasan et al. [17]. While arrayed
methodology. In Section 4, we apply this methodology to detectors provide parallel data acquisition, which has clear
images acquired by the sensor to detect wholly visible and advantages in acquiring data from moving targets and in
partially concealed targets at a moderate range of 325 meters, eliminating scanning components, there are problems with
using our own test facility. We also show how the RJMCMC crosstalk and fill-factor. In general, we can achieve better
method can improve our interpretation of the data. Finally, temporal response and sensitivity with a single element
in Section 5, we conclude and summarise some of the detector, which is of considerable importance for covert,
key issues that must be addressed to develop these ideas low-power operation. The system of interest is illustrated in
further. Figure 1(a).
The system uses a pulsed semiconductor diode laser, of
pulse half-width 90 ps, operating at 842 nm wavelength, that
2. The Ladar Imaging System emits low energy pulses (<30 pJ). The laser is capable of
operating at repetitions rates in excess of 10 MHz, although
In a time-correlated single photon counting (TCSPC) rang- 2 MHz was the maximum rate used in these measurements.
ing system, the general principle is to direct a pulsed laser Scene scanning is performed by a pair of galvanometer
beam towards the target and to collect and record the mirrors. The optical system is used to direct the outgoing
times of arrival (since pulse transmission) of the back- laser pulses onto each optical field position of the target,
scattered photons. Hence, the distance to the target (z) can and also to efficiently collect the scattered photons returned
be computed, and knowing the geometry of the imaging from each corresponding pixel of the imaged scene. The
EURASIP Journal on Advances in Signal Processing 3

collected return photons are routed using polarisation optics 8


to an individual, high performance single-photon detector 7
module via a single mode optical fiber. The signal from the
single-photon detector is recorded as a timed photon event, 6
equivalent to range (z) which can be associated with an

Photon counts
5
(x, y) coordinate that is known from the calibrated scanning
optics. For the particular optical configuration and scanning 4
parameters used in these measurements, the maximum field 3
of view was 55 mrad and the beam width and scanning
resolution were both approximately 23 mm at a standoff 2
distance at 325 m. In general, each detector event records 1
a photon arrival of which some will be returned from the
target, some from stray events (other light sources), and 0
500 1000 1500 2000 2500 3000 3500
some will be due to detector, dark counts. To reduce the Channel
stray photon events, our system includes spatial filtering (by
coupling into the single mode optical fiber), spectral filtering Figure 2: Multiple returns recorded from a distributed target in the
(by narrowband filtering at the known laser wavelength) field of view of a single pixel. The horizontal axis is equivalent to the
and temporal filtering (by the TCSPC technique, as there is round-trip distance, and the vertical axis a measure of the strength
of signal return.
finite window in which to record a photon event). Timing
uncertainty is introduced by jitter in the master clock, the
laser driver, the detector (silicon SPAD) and the timing
electronics. For all those reasons, we use many pulses to the photon count distribution is given by
build up a statistical distribution of the number of recorded ⎧

2
−(t1 −t0 ) /2σ 2 (i−t1 )/τ1

⎪e e , i < t1
photon arrivals as a function of the arrival time. This can ⎪



be interpreted as a range measurement, and by scanning and ⎪e−(i−t0 ) /2σ ,

2 2
t1 ≤ i < t2
recording distributions at each pixel, as a depth image. An fsystem = β⎪


2
−(t2 −t0 ) /2σ 2 −(i−t2 )/τ2
t2 ≤ i < t3
example of a measurement that records data from more than ⎪e ⎪
e ,


one surface in the field of view of a single pixel is shown in ⎪
⎩e−(t2 −t0 )2 /2σ 2 e−(t3 −t2 )/τ2 e−(i−t3 )/τ3 , i ≥ t3 ,
Figure 2.
(1)
where β is an amplitude factor, t0 is the time of the
3. Full Waveform Ladar Analysis peak maximum, and t1 , t2 , and t3 are the points at
Based on RJMCMC which the changeovers between functions occur as shown
in Figure 3(b). In this study, we assume that the shape
3.1. Bayesian Modelling of Ladar Signals. In previous work, parameters are fixed and known from the instrumental
we have shown how Bayesian analysis (using the reversible response. Hence, we only need to compute the amplitude
jump markov chain monte carlo (RJMCMC) computational and time of arrival, measures of reflectance and distance,
methodology [18]) can be used to construct multilayered respectively.
3D images [14] when the laser return consists of multiple For full waveform ladar, multiple returns are observed
peaks due to the footprint of the beam impinging on a against a background level whose expected value is constant
target with surfaces distributed in depth. In dense ladar across all bins of the photon (intensity) histogram, y. This is
images, one can improve the quality of the 3D data by considered as a sample of a nonnormalized statistical mixture
considering spatial context through a markov random field distribution with density
(MRF) [19]. We have also shown how multispectral LiDAR
can be used to classify different types of surface response on   
k  
the basis of different colour responses, using a maximum F i; k, φ = fsystem i; β j , t0 j + B, (2)
of six wavelengths [10]. We have applied these techniques j =1
successfully to both Burst Illumination Laser (BIL) [20] and
where k is the number of peaks, B is the background and φ is
TCSPC [21] ladar systems. As pointed out by Mallet and
the set of parameters of each signal and the background: φ =
Bretar [3] in their survey on full waveform LiDAR for remote
(β, t0 , B) with β = (β1 , β2 , . . . , βk ) and t0 = (t01 , t02 , . . . , t0k ).
sensing, our RJMCMC method is robust (finding a global
fsystem is defined by (1). The number of photons recorded,
minimum in a multimodal distribution), no initialization
yi , in each channel i is considered as a random sample of a
or gradient computations are required, and the grammar of
Poisson distribution with intensity F(i; k, φ),
instrumental models is extensible.
  yi
To interpret this data, we use a piecewise, exponential   −F(i;k,φ) F i; k, φ
model for the Si-SPAD return, first introduced in [22] P yi | k, φ = e . (3)
yi !
because it has the appropriate shape parameters to model
the physical transport processes within the Si-SPAD detector. Assuming that the observations recorded in each channel
The parametric form of the expected temporal variation of i of the histogram are conditionally independent given the
4 EURASIP Journal on Advances in Signal Processing

value of the parameters, the joint probability distribution of 8


y is defined as 7
  yi 6
  @
imax
F i; k, φ
L y | k, φ = e−F(i;k,φ) . (4)

Photon counts
i=1
yi ! 5

4
In the Bayesian paradigm, our goal is to find the posterior
distribution of the number, positions, and amplitudes of 3
the multiple returns in the full waveform ladar signal. The 2
posterior distribution is defined as
1
   
 L y | k, φ f k, φ
π k, φ | y = L       0
L y | k, φ f k, φ d k, φ (5)
2100 2150 2200 2250
    Channel
∝ L y | k, φ f k, φ , (a)

where the likelihood function, L(y | k, φ) is defined by (4) ×104


6
and the full joint prior distribution is given by f (k, φ).
β
5
3.2. RJMCMC Methodology for Ladar Signal Analysis. We σ
follow the methodology described in [14] by constructing 4
a Markov chain whose transitions involve changes to the Photon counts
number, positions, and amplitudes of peaks in the return 3
signal. Hence, we consider the histogram as a discrete
representation of a spatially heterogeneous Poisson process 2
whose intensity is a linear superposition of the scaled and
shifted returns as defined in (2). In the RJMCMC paradigm, 1
the transitions of the Markov chain involve several moves
0
within a single “sweep”. t1 t0 t2 t3

(1) Updating the positions t0 . −1


2100 2120 2140 2160 2180 2200
(2) Updating the amplitudes β. Channel
(3) Updating the background B. (b)

(4) Random birth or death of a peak. Figure 3: (a) A single magnified selected peak from Figure 2. (b)
(5) Random splitting of a peak into two peaks or merging Instrumental response of TCSPC ladar signal (dotted line) and
fitting result (solid line) using piecewise exponential model with
of two peaks into a single peak.
fitting errors (dashed line). The parameter sets corresponding to
At each iteration of the chain, we follow the Metropolis- (1) are: β = 5.41, t0 = 2128.70, (t1 , t2 , t3 ) = (2111.15, 2146.36,
Hastings algorithm. Moves of type (1), (2), and (3) allow 2193.93), (τ1 , τ2 , τ3 ) = (6.32, 10.04, 292.79).
the posterior distribution to be explored within a state space
with a fixed dimension, k. The Metropolis-Hastings method
draws the proposed values from an arbitrary proposal jumps between subspaces for different k in addition to
probability distribution q(·, ·). These values are accepted within-model parameter updates for a particular k, that
with probability α(·, ·), otherwise they are rejected and the is steps (4) and (5) above. Still following the Metropolis-
existing values are retained. The acceptance probability is Hastings procedure, the target distribution now becomes
expressed as π(k, φ | y) and the acceptance probability becomes
M    N M     2   2N
  π φ | y q φ , φ π k , φ | y rm φ 2  2
2 ∂ φ  2

α φ, φ = min 1,     . (6) α = min 1,     2 2 , (7)
π φ | y q φ, φ 2
π k, φ | y rm φ q(u) ∂ φ, u 2

In this case, φ can be β, t0 , or B dependent on whether it is a where rm (φ) is the probability of move type m when in
move of type (1), (2) or (3). state φ, q(u) is the density function of a continuous random
For full waveform ladar, we do not know the value of vector u and the Jacobian term |∂(φ )/∂(φ, u)| arises from
k, so the state space becomes a set of parameter subspaces deterministic transfer from variable (φ, u) to φ .
with different dimensionality. Since both k and φk are subject
to inference, it is necessary to compare the different models 3.3. Convergence Assessment. A properly designed Markov
while learning about the parameters within each model. chain Monte Carlo sampler should generate a convergent
Therefore, we use the RJMCMC algorithm, which allows Markov sequence whose limiting distribution is the true
EURASIP Journal on Advances in Signal Processing 5

joint posterior distribution of interest [23]. However, in x(θ), we label the tth observation in chain i as xit and calculate
practical applications, only a finite number of samples can the between-sequence variance B
be produced, and it is therefore important to choose the
T  ·
I
chain length appropriately and assess the convergence of the 2
B= x − x·· , (8)
Markov chain to the stationary distribution. I − 1 i=1 i
Three separate but related issues need to be considered
when determining the chain length [24, 25]. First, evaluate where
the length of the burn-in period, which is to determine from
1  t 1 ·
2T I
which observation point the chain has “forgotten” its starting xi· = x, x·· = x. (9)
value and escaped from its influence. At this point, the chain T t=T+1 i I i=1 i
has reached the stationary distribution and the previous The within-sequence variance W is estimated by
samples should be discarded to eliminate the estimation bias
introduced by the transient period. Second, determine if 1 2
I
the chain is long enough to fully represent the underlying W= s, (10)
I i=1 i
distribution and conclude its convergence to an asymptotic
distribution. Third, evaluate if the samples are adequate to where
achieve a certain precision of estimation.
Over the last two decades, a number of different 1 
2T
 2
s2i = xit − xi· . (11)
convergence diagnostics have been proposed, which can be T − 1 t=T+1
classified into two categories. For theoretical approaches, the
attempt is to predetermine the number of iterations required The variance of x in the target distribution, V is estimated by
to ensure convergence by analyzing the Markov transition  
1 = T −1 1 B
kernel and stationary distribution; a collection of approaches V W + 1+ . (12)
T I T
can be found in [26, 27] and references therein. Although
they hold formal guarantees, these algorithms are not feasible The convergence of the Markov chain is monitored by the
in practice due to sophisticated mathematical calculation and estimated potential scale reduction factor (PSRF)
loose convergence bounds. Therefore, as pointed out in [26], C
empirical methods are almost always applied, relying on the 1
V
R1 = . (13)
outputs of MCMC samplers and diagnostics computed from W
the produced sequence to check convergence. On the one 1 should
hand, they provide evidence of convergence; on the other As T → ∞, the total variance estimation V
hand, all the diagnostics are unreliable since in practice the decrease while the within-sequence variance W should
target limiting distribution always remains unknown and it is increase, and finally the PSRF should theoretically decline to
impossible to conclude with certainty that the finite MCMC 1. If R1 is large, it indicates the posterior distribution should
samples are sufficient to cover the whole support of the be further explored. Once the PSRF is close to 1, we assume
underlying stationary distribution. From this point of view, the Markov chain has converged to the target distribution.
we should be cautious about the diagnostic results.
In the literature, empirical methods seek to conclude 4. Experimental Comparison:
the convergence through bias and/or variance evaluation. Cross-Correlation, MCMC and RJMCMC
The Gelman and Rubin diagnostic methodology presented
in [23, 28] compares the samples drawn from several In this section, we present the analysis of images acquired
independent sequences with different starting points and under bright daylight conditions of two distant outdoor
quantitatively evaluates mixing by analyzing the within- scenes, comparing methods based on cross-correlation and
sequence and between-sequence variance. The estimation fixed and variable dimension Markov chain Monte Carlo
bias arising from the produced samples is uncovered by analysis. Our images are of a life-sized mannequin (a human
multiple separate chains rather than a single chain, and figure) in full view of the sensor, and of the same mannequin
therefore it has comparatively higher diagnostic reliability in partially concealed behind a fence. The data were acquired
terms of detecting if the underlying stationary distribution at a range of approximately 325 meters. The equivalent
has been fully explored and the chains have converged to scene dimensions were 0.8 m width by 2.0 m height, and the
the same limiting distribution. This is particularly significant scanned image resolution was 32 by 128 pixels for the whole
when applied to multimodal posterior distributions. mannequin. The pulse repetition frequency was 2 MHz,
The idea of the Gelman and Rubin method is that resulting in an average optical power of 40 μW. The pixel
as the number of samples increases, each individual chain dwell time was 1.0 s.
will explore larger parts of the parameter space, and con- To assess the ability of RJMCMC algorithm for multiple
sequently, the overall and within-sequence variances will peak detection and particularly the resolution capacity for
both converge to the true model variance. Assume that we closely separated peaks, we set up a remote target containing
simulate I > 2 independent sequences initialized with over several distributed surfaces with known separations, which
dispersed starting points, each of length 2T, and discard the provides the ground truth and allows us to compare the
first T samples as the burn-in period. For any scalar function performance with cross-correlation method.
6 EURASIP Journal on Advances in Signal Processing

10 1 1.12
8 0.8 1.1
Photon counts

1.08

PSRF
6 0.6
1.06
4 0.4 1.04
2 0.2 1.02
1
0 0
1 2 3 4 5 6 7 8 3.4 3.6 1 2 3 4 5 6 7 8 600 700 800 900 1000
Depth (m) Depth (m) Depth (m) Number of samples
(a) (b) (c) (d)

4 1 1.3
1.25
0.8
Photon counts

3
1.2
0.6

PSRF
2 1.15
0.4 1.1
1 0.2 1.05
1
0 0
2 4 6 8 3.4 3.6 3.8 1 2 3 4 5 6 7 8 2000 4000 6000 8000 10000
Depth (m) Depth (m) Depth (m) Number of samples
(e) (f) (g) (h)

4 1
3.5 1.2
0.8
Photon counts

3
1.15
2.5 0.6

PSRF
2 1.1
1.5 0.4
1 1.05
0.2
0.5 1
0 0
1 2 3 4 5 6 7 8 3.4 3.6 3.8 1 2 3 4 5 6 7 8 1000 1500 2000 2500
Depth (m) Depth (m) Depth (m) Number of samples
(i) (j) (k) (l)

Figure 4: Analysis of time-of-flight ladar data, in which the histogram bins have been converted to relative depth in meters. The first column
shows the raw pixel data (in blue). The second column magnifies the plots of signal peaks in the first column. The third column shows the
normalized cross-correlation values (blue curves) and the frequencies of positions (black bars) obtained from the MCMC samplers. The last
column tracks the corresponding PSRF values against the number of samples. The final fit estimations (from MCMC) are the red curves in
the first column.

4.1. Mannequin in Full View: Cross-Correlation and MCMC. by trans-dimension jumps. Accordingly, only the first three
In the first example, the mannequin is in full view, standing steps in Section 3.2 are used.
in front of a concrete pillar, as shown in Figure 5. It was The unknowns (t0 , β, B) subject to inference have inde-
anticipated that the majority of pixels would have clear pendent priors. To completely eliminate any prior knowledge
and distinct, single returns from the surface of either the of the peak position, t0 is drawn from a uniform distribution
mannequin or the pillar. Given the divergence of the beam on [1, imax ]. The peak amplitudes (β) and background
there may be some mixed pixels at the occluding boundary (B) follow Gamma distributions Γ(C, D) and Γ(E, F) with
of the mannequin, and there may be pixels with no return as the shape parameters C, E set to be 6 and 1.5, while
they miss the targets all together. In short, this is a situation the scale parameters D, F are (max(y)/2)/6 and mean(y),
in which a cross-correlation detector based on the system respectively, where y is the histogram of photon counts.
instrumental response should perform well and there should The previously unspecified proposal distributions are set as
be questionable need for the added complexity of Markov follows: all of the parameter updates employ the Gaussian
chain Monte Carlo analysis. Further, since the expectation random walk whose proposal means are the current sample
in processing this data set is to estimate the range of a values. The standard deviations for amplitude (σβ ) and
single surface return from either the mannequin or the pillar, background (σB ) are both 0.3. For position updates, a delayed
we apply the fixed dimension Markov chain Monte Carlo rejection step [14] is carried out to allow movement between
(MCMC) approach to avoid redundant computation caused posterior estimates that correspond to more widely separated
EURASIP Journal on Advances in Signal Processing 7

(a) (b) (c)

Figure 5: 32 × 128 pixel image of a life-sized mannequin scanned at a distance of 325 m in daylight conditions. (a) Photograph of the 1.8 m
tall mannequin in the scan position. (b) and (c) Three-dimensional plots of the processed depth information using the cross-correlation and
MCMC methods, respectively. Empty pixels in the plots contained depth values outside the displayed range. The lower number of missing
pixels in (c) on noncooperative target surfaces with low reflectance, especially the mannequin’s trousers, demonstrate the MCMC algorithm’s
advantage in resolving low-intensity returns.

channels. When using delayed rejection, the scale in each step For the low-amplitude return in Figure 4(e), the cross-
step step
is characterized by σt0 1 = 1000 and σt0 2 = 10, respectively. correlation approach gives several extrema as displayed in
We first generate multiple chains for each pixel and eval- Figure 4(g). Such low amplitude may be caused primarily by
uate the convergence. After finding a safe convergence length, lower reflectance back towards the receiver, either because of
we then run single MCMC chains with k = 1 on all the the material properties or its angle to the beam direction.
pixels with a bounded number of iterations (5000) including In this case, it is difficult to decide with certainty where
the 500 samples burn-in period. This is consistent with the the surface return is located, although we can always define
initial estimate. Subsequently, to assess the convergence of it to be the one corresponding to the maximum cross-
the MCMC chains, we produce four independent sequences correlation value. In comparison, the power of the MCMC
for each pixel, and monitor the Gelman and Rubin diagnostic methodology lies in supplying Bayesian evidence of the final
statistic (PSRF) defined in Section 3.3 every 100 samples. answer. In other words, the histogram of t0 indicates the
The chain generation is terminated when the convergence posterior distribution of the estimates. As the parameter
is concluded, that is when the PSRF reduces to less than a space becomes more complex, the posterior distribution is
preset threshold 1.002, at which the posterior distributions spread over a wider channel range and becomes bimodal,
p(t0 | y, k = 1) obtained from all the sample trajectories which in turn results in a slower convergence rate and an
becomes approximately the same. increased chain length in excess of 4000 samples.
Figure 4(a) presents a representative pixel with a single Another example is shown in Figure 4(k). For this pixel,
distinct return. For this type of pixel data, there is a clear, the bin index for the maximum cross-correlation does not
sharp peak in the normalised cross correlation plot and a equal the one for the p(t0 | y, k = 1) posterior mode. Hence,
distinct preference in the frequency of positions obtained the MCMC chain gives a different and better substantiated
from MCMC sequences. Their maximum values are both estimate of the true value, further demonstrating the power
located in the same channel index as shown in Figure 4(c). of the Bayesian approach.
In this circumstance, the cross-correlation approach can 3D images based on these two methods are provided in
easily detect the surface return, and according to Figure 4(d), Figure 5, where a target range gate is set and those pixels with
MCMC chains can converge rapidly with a small number of with target position estimates beyond this preset gate are
samples (about 500 samples after the burn-in period) due to treated as zero return. It is observed that there are a few more
the simplicity of parameter space. pixels beyond the target range with cross-correlation, which
8 EURASIP Journal on Advances in Signal Processing

The first row of Figure 7 illustrates a pixel in which


the beam misses all three targets, so that no surface return
exists. The use of the cross-correlation method is difficult
when there is no surface return as shown in Figure 7(a) to
7(d). In comparison with Figure 4(g) from a small signal-
to-background ratio pixel, Figure 7(c) shows the probable
existence of at least one surface return. However, according
to the asymptomatic posterior probability estimate of p(k |
y), no target return is the most probable conclusion. If we
examine the second and third rows of Figure 7 then we
see the situations analogous to Figure 4(e) in that there are
single returns from fence and mannequin, respectively. The
difference in this case is that we have applied full RJMCMC
chains, so that the posterior probability estimate, p(k | y),
shows one return.
Of more interest are those pixels containing more than
one return, shown in Figures 7(m)–7(x). The fourth row
has distinct returns from the fence and mannequin, and
the RJMCMC sampler has a very strong preference for
Figure 6: Close-up photograph of the upper half of the mannequin
two returns. The fifth row is far less distinct, but the
positioned at 1 m behind a wooden fence. The scene was scanned at sampler again shows a strong posterior probability estimate
a standoff distance of 325 m in daylight. of two peaks, although the second one might be difficult
to detect automatically on a cross-correlation detector, for
example, using a fixed (or even proportional) threshold. Due
to the varying surface reflectances and angles, pixels can
implies the maximum values do not always correspond to have different photon intensities, which makes it a difficult
the correct surface position. This is consistent with the problem to choose a reliable threshold. The corresponding
discussion of illustrative pixel data showing the strength parameter estimates of the two surface returns shown in
of the MCMC method in processing low amplitude ladar Figure 7(q) correspond in depth to the known ground truth
signals hidden in backgrounds in that the posterior mode is of the relative separation. Finally, the last row shows one
more informative, robust, and reliable. of the pixels in which the beam partially reflects from the
fence, partially transmits through a gap and hence reflects
4.2. Mannequin Concealed by Fence: Cross-Correlation and from the mannequin, but near an occlusion boundary so that
RJMCMC. In the next example, a wooden fence is placed part reflects from the pillar behind. The posterior estimate
approximately 1 meter in front of the mannequin, as shown of k favours 3 surfaces but it is by no means as clear cut
in Figure 6. The image resolution of the scanned upper half as the earlier examples, and the parameter estimates of the
mannequin is 32 by 48 pixels. Because of the area of the laser 3 surface positions shown in Figure 7(u) correspond to the
footprint, it is highly likely that some pixels may observe fence, mannequin, and pillar separations at this point.
multiple reflections composed of some or all of the fence, To better illustrate the posterior estimates of the number
the mannequin, and the pillar behind, where the beam hits of surfaces, p(k | y), Figure 8 shows those pixels in which
occluding boundaries. In this situation, determination of the 0, 1, 2 and 3 surfaces were estimated. Physically, one expects
number of surfaces is an additional crucial issue and so we no returns when the laser hits no surface, or where the
apply the RJMCMC method to obtain varying-dimensional surface angle is so oblique (e.g., at the extremities of the
ladar signal analysis. pillar) that no return is likely. In this image, these are
In one sweep of the RJMCMC algorithm, the fixed- primarily where the beam goes through the fence but above
dimensional parameter updates (steps 1–3 of Section 3.2) both mannequin and pillar. When k = 1 it hits a single
follow the MCMC sampler settings. Jumps between param- surface, and when k = 2, two surfaces, as described above.
eter subspaces with different dimensions are accomplished There are only a few pixels for which k = 3, where the beam
by steps 4 and 5 in the same manner as [14]. Although our grazes the left arm, and no estimates of k > 3. Figure 9
expectation would be that the number of surface returns shows a surface plot of the meshed (X, Y , Z) data for the
in any single pixel would not be greater than three in 3D image of the partially concealed mannequin behind the
this example, we are conservative in allowing the varying fence. As the mannequin surface has been interpolated and
dimension sampler to explore k values from 0 to 5. smoothed from the raw data values it should be considered
Figure 7 illustrates representative pixels containing zero, as illustrative, but there was no necessity for outlier removal,
one (either mannequin or fence), two (fence and man- and the shape of the upper body is relatively well defined.
nequin) or three returns (fence, mannequin, and pillar), with
the corresponding photon counts histogram, unified cross-
correlation values, p(k | y) estimates and fitting results from 4.3. Real Data with Known Geometry: Cross-Correlation
the RJMCMC sampler. and RJMCMC. We set up a remote target at a range of
EURASIP Journal on Advances in Signal Processing 9

3 1 5000

Cross-correlation
2.5 0.8 4000
Photon counts

Frequency
2
0.6 3000
1.5
0.4 2000
1
0.5 0.2 1000
0 0 0
1 2 3 4 5 6 7 8 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Number of peaks
(a) (b) (c) (d)

12 1 5000

Cross-correlation
10 0.8 4000
Photon counts

Frequency
8
0.6 3000
6
0.4 2000
4
2 0.2 1000
0 0 0
1 2 3 4 5 6 7 8 3.8 4 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Depth (m) Number of peaks
(e) (f) (g) (h)

6 1 5000
Cross-correlation

5
Photon counts

0.8 4000

Frequency
4
0.6 3000
3
0.4 2000
2
1 0.2 1000
0 0 0
1 2 3 4 5 6 7 8 5 5.2 5.4 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Depth (m) Number of peaks
(i) (j) (k) (l)

4 1 5000
3.5
Cross-correlation

0.8 4000
3
Photon counts

Frequency

2.5 0.6 3000


2
1.5 0.4 2000
1 1000
0.2
0.5
0 0 0
1 2 3 4 5 6 7 8 4 4.5 5 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Depth (m) Number of peaks
(m) (n) (o) (p)

9 1 5000
Cross-correlation

8
Photon counts

7 0.8 4000
Frequency

6
5 0.6 3000
4 0.4 2000
3
2 0.2 1000
1
0 0 0
1 2 3 4 5 6 7 8 4 4.5 5 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Depth (m) Number of peaks
(q) (r) (s) (t)

Figure 7: Continued.
10 EURASIP Journal on Advances in Signal Processing

4 1 3500
3.5 3000

Cross-correlation
Photon counts 3 0.8
2500

Frequency
2.5 0.6 2000
2
0.4 1500
1.5
1 1000
0.2
0.5 500
0 0 0
1 2 3 4 5 6 7 8 4 4.5 5 5.5 0 2 4 6 8 0 1 2 3
Depth (m) Depth (m) Depth (m) Number of peaks
(u) (v) (w) (x)

Figure 7: Comparison of ladar signal analysis for the concealed mannequin using RJMCMC and cross-correlation. The first column shows
the raw data and the posterior parameter estimates from the RJMCMC method, while the second column gives the magnified plot of the
signal peaks. The third column shows the cross-correlation function. The right hand column shows the posterior probability estimate of the
number of surface returns, p(k | y).

approximately 325 meters, which contained 6 distributed


surfaces with separations between adjacent surfaces of {450,
10, 200, 30, and 90 mm}. The photon counting histogram
in Figure 10 was collected with the scanning system using a
3 MHz pulse repetition frequency and 50 μW average laser
power, the bin resolution was 4 ps. The RJMCMC sampler
used here is exactly the same as the one for the fence data but
allows k to vary from 0 to 10.
According to Figure 10, both RJMCMC and cross-
correlation methods succeed in detecting distinct return
signals. For the two surfaces separated at 30 mm, they merge
to be a single peak in cross-correlation values. In comparison,
with assistance of Merge/Split updates, RJMCMC can easily
separate them. However, both methods fail to distinguish
the peaks 10 mm (17 channel bins) away from one another,
and instead place a combined return, which results in the
increased estimated distances from the combined signal to
its neighboring peaks, that is, the two peaks corresponding
to the surfaces separated by 450 and 200 mm.
Figure 8: Map of for different k values: k = 0 in navy blue, k = 1 in
Cambridge blue, k = 2 in yellow and k = 3 in carmine.
5. Conclusions and Future Work
In this paper, we have demonstrated the application of
Bayesian analysis using Markov chains to analyse full-
waveform Ladar pixel and image data acquired by a new returns, a few photons at maximum in a single bin. This
scanning sensor. The sensor uses time-correlated photon adds to the covert capability of the sensor, aimed at detecting
counting technology, and coupled with algorithmic develop- returns from uncooperative surfaces at medium range using
ment, we are able to detect multiple surface returns within a low-power source laser diode.
the field of view of single pixels, creating multilayer images. However, there are a number of outstanding problems
This has application in defence and security when objects that require future work. In the long term, we need to
of interest may be partially concealed, or viewed through acquire image data at an approximate rate of one frame per
semitransparent surfaces, such as through windows. second, or better, and to process the data in comparable
To demonstrate the method, and compare with thresh- time frames. Currently, we are investigating the use of
olded correlation analysis, we have used selected data from convergence diagnostics to better control the chain length,
two images of a distant target, the first in full view, the the validity of initialising the chains by correlation data,
second viewed through a trellis fence. In general, RJMCMC and multicore programming in combination with vector
analysis is advantageous in supplying principled estimates processor and FPGA technology. In general, all of these can
of both the number of surface returns and the associated lead to faster, single pixel processing. Another possibility is
parameter vectors (range, amplitude, and background level). to promote an investigation on the Dirichlet process (DP)
This allows us to construct multilayered 3D images. The mixture model developed in [29] and recently studied in
methodology is effective in dealing with low amplitude [30], which provides natural estimates for Bayesian inference
EURASIP Journal on Advances in Signal Processing 11

4.6
4.8 References
4.4
4.2
4 [1] M.-C. Amann, T. Bosch, M. Lescure, R. Myllylä, and M.
Rioux, “Laser ranging: a critical review of usual techniques for
0.5 distance measurement,” Optical Engineering, vol. 40, no. 1, pp.
10–19, 2001.
0.6 [2] W. Wagner, A. Ullrich, V. Ducic, T. Melzer, and N. Studnicka,
“Gaussian decomposition and calibration of a novel small-
0.7
footprint full-waveform digitising airborne laser scanner,”
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 60,
0.8
no. 2, pp. 100–112, 2006.
[3] C. Mallet and F. Bretar, “Full-waveform topographic lidar:
0.9
State-of-the-art,” ISPRS Journal of Photogrammetry and
1 Remote Sensing, vol. 64, no. 1, pp. 1–16, 2009.
[4] C. Véga and B. St-Onge, “Height growth reconstruction of
1.1 a boreal forest canopy over a period of 58 years using a
combination of photogrammetric and lidar models,” Remote
−0.1 Sensing of Environment, vol. 112, no. 4, pp. 1784–1794, 2008.
0 [5] F. Hosoi and K. Omasa, “Estimating vertical plant area density
0.1
profileand growth parameters of a wheat canopy at different
Figure 9: Fence layer and the reconstructed 3D image of the growth stagesusing three-dimensional portable lidar imaging,”
mannequin layer with interpolation and smoothing techniques. ISPRS Journal of Photogrammetry and Remote Sensing, vol. 64,
pp. 151–158, 2009.
[6] D. Letalick, T. Chevalier, and H. Larsson, “3D imaging of
14 partly concealed targets by laser radar,” Tech. Rep., Division
of Sensor Technology, the Swedish Defence Research Agency,
Photon counts and fitting results

12
October 2005.
10 [7] D. Ludwig, A. Kongable, S. Krywick, et al., “Identifying targets
under trees—Jigsaw 3D-LADAR test results,” in Laser Radar
8 Technology and Applications VIII, vol. 5086 of Proceedings of
SPIE, pp. 16–26, 2003.
6 [8] C. Gronwall, T. Chevalier, G. Tolt, and P. Andersson, “An
approach to target detection in forested scenes,” in Laser Radar
4 Technology and Applications XIII, vol. 6950 of Proceedings of
SPIE, pp. S1–S12, 2008.
2
[9] M. Voss and R. Sugumaran, “Seasonal effect on tree species
0 classification in an urban environment using hyperspectral
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 data, LiDAR, and an object-oriented approach,” Sensors, vol.
Depth (m) 8, no. 5, pp. 3020–3036, 2008.
[10] A. M. Wallace, G. S. Buller, R. C. W. Sung et al., “Multi-spectral
Figure 10: Analysis of TCSPC data from a real target containing laser detection and ranging for range profiling and surface
6 distributed surfaces with known separation distances: {450, 10, characterization,” Journal of Optics A, vol. 7, no. 6, pp. S438–
200, 30, and 90 mm}. The blue line gives the 5 peaks detected by S444, 2005.
RJMCMC method with separations determined to be {452.4, 207.6, [11] A. McCarthy, R. J. Collins, N. J. Krichel, V. Fernandez, A. M.
27, and 100.2 mm}. The green line is the cross-correlation of the Wallace, and G. S. Buller, “Long-range time of flight scanning
signal (for the sake of display clarity, the maximum value is scaled sensor based on high speed time-correlated photon counting,”
to be 6), which gives 4 peaks with separations {454.8, 225.6, and Applied Optics, vol. 48, no. 32, pp. 6241–6251, 2009.
97.8 mm}. [12] G. Schwarz, “Estimating the dimension of a model,” Annals of
Statistics, vol. 6, pp. 461–464, 1978.
[13] A. P. Dempster, N. M. Laird, and D. Rubin, “Maximum
in both model number and associated parameters with likelihood from incomplete data via the EM algorithm (with
efficient simulations. discussion),” Journal of the Royal Statistical Society Series B, vol.
39, pp. 1–38, 1977.
[14] S. Hernández-Marı́n, A. M. Wallace, and G. J. Gibson,
Acknowledgments “Bayesian analysis of lidar signals with multiple returns,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
The work reported in this paper was funded in part by the 29, no. 12, pp. 2170–2180, 2007.
UK Engineering and Physical Sciences Research Council, and [15] C. Fraley and A. E. Raftery, “Bayesian regularization for
in part by the Electro-Magnetic Remote Sensing (EMRS) normal mixture estimation and model-based clustering,”
Defence Technology Centre, established by the UK Ministry Journal of Classification, vol. 24, no. 2, pp. 155–181, 2007.
of Defence and run by a consortium of SELEX Sensors and [16] A. M. Wallace, R. C. W. Sung, G. S. Buller, R. D. Harkins,
Airborne Systems (now SELEX Galileo), Thales Defence, R. E. Warburton, and R. A. Lamb, “Detecting and charac-
Roke Manor Research and Filtronic. terizing returns in amulti-spectral pulsed lidar system,” IEE
12 EURASIP Journal on Advances in Signal Processing

Proceedings-Vision Image and Signal Processing, vol. 153, no.


2, pp. 160–172, 2006.
[17] R. Sudharasan, P. Yuan, J. Boisvert, et al., “Single photon
counting Geiger mode InGaAs(P)/InP avalanche photodiode
arrays for 3D imaging,” in Laser Radar Technology and
Applications XII, vol. 6950 of Proceedings of SPIE, p. 69500N,
2008.
[18] P. J. Green, “Reversible jump Markov chain Monte Carlo
computation and Bayesian model determination,” Biometrika,
vol. 82, pp. 711–732, 1995.
[19] S. Hernandez-Marin, A. M. Wallace, and G. J. Gibson,
“Multilayered 3D LiDAR image construction using spatial
models in a Bayesian framework,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 30, no. 6, pp. 1028–
1040, 2008.
[20] S. Duncan, J. Cople, G. Harvey, D. Humphreys, J. Gonglewski,
and I. Baker, “Advances in laser gated imaging in an airborne
environment,” in Infrared Technology and Applications XXXII,
vol. 6206 of Proceedings of SPIE, p. 650607, April 2006.
[21] G. S. Buller and A. M. Wallace, “Ranging and three-
dimensional imaging using time-correlated single-photon
counting and point-by-point acquisition,” IEEE Journal on
Selected Topics in Quantum Electronics, vol. 13, no. 4, pp. 1006–
1015, 2007.
[22] S. Pellegrini, G. S. Buller, J. M. Smith, A. M. Wallace, and S.
Cova, “Laser-based distance measurement using picosecond
resolution time-correlated single-photon counting,” Measure-
ment Science and Technology, vol. 11, no. 6, pp. 712–716, 2000.
[23] A. Gelman, Markov Chain Monte Carlo in Practice: Interdis-
ciplinary Statistics, chapter 8, Chapman & Hall/CRC, Boca
Raton, Fla, USA, 1995.
[24] S. El Adlouni, A.-C. Favre, and B. Bobée, “Comparison of
methodologies to assess the convergence of Markov chain
Monte Carlo methods,” Computational Statistics and Data
Analysis, vol. 50, no. 10, pp. 2685–2701, 2006.
[25] S. G. Giakoumatos, I. D. Vrontos, P. Dellaportas, and D. N.
Politis, “A Markov chain Monte Carlo convergence diagnostic
using subsampling,” Journal of Computational and Graphical
Statistics, vol. 8, no. 3, pp. 431–451, 1999.
[26] M. K. Cowles and B. P. Carlin, “Markov chain Monte Carlo
convergence diagnostics: a comparative review,” Journal of the
American Statistical Association, vol. 91, no. 434, pp. 883–904,
1996.
[27] K. Mengersen, S. Knight, and C. P. Robert, “MCMC: how
do we know when to stop?” Tech. Rep., 1999, http://
www.stat.fi/isi99/proceedings/arkisto/varasto/meng0251.pdf.
[28] A. Gelman and D. B. Rubin, “Inference from iterative
simulation using multiple sequences,” Statistical Science, vol.
7, no. 4, pp. 457–472, 1992.
[29] M. D. Escobar and M. West, “Bayesian density-estimation and
inference using mixtures,” Journal of the American Statistical
Association, vol. 90, pp. 577–588, 1995.
[30] M. I. Jordan, “Hierarchical models, nested models and com-
pletely random measures,” in Frontiers of Statistical Decision
Making and Bayesian Analysis, M.-H. Chen, D. Dey, P. Mueller,
D. Sun, and K. Ye, Eds., Springer, New York, NY, USA, 2010.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 345743, 9 pages
doi:10.1155/2010/345743

Research Article
Facial Recognition in Uncontrolled Conditions for
Information Security

Qinghan Xiao1 and Xue-Dong Yang2


1 Defence Research and Development Canada, Ottawa, 3701 Carling Avenue, Ottawa, ON, Canada K1A 0Z4
2 Department of Computer Science, University of Regina, Regina, SK, Canada S4S 0A2

Correspondence should be addressed to Qinghan Xiao, [email protected]

Received 1 December 2009; Accepted 3 February 2010

Academic Editor: Yingzi Du

Copyright © 2010 Q. Xiao and X.-D. Yang. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

With the increasing use of computers nowadays, information security is becoming an important issue for private companies
and government organizations. Various security technologies have been developed, such as authentication, authorization, and
auditing. However, once a user logs on, it is assumed that the system would be controlled by the same person. To address this flaw,
we developed a demonstration system that uses facial recognition technology to periodically verify the identity of the user. If the
authenticated user’s face disappears, the system automatically performs a log-off or screen-lock operation. This paper presents our
further efforts in developing image preprocessing algorithms and dealing with angled facial images. The objective is to improve
the accuracy of facial recognition under uncontrolled conditions. To compare the results with others, the frontal pose subset of
the Face Recognition Technology (FERET) database was used for the test. The experiments showed that the proposed algorithms
provided promising results.

1. Introduction The objective is to enhance the level of system security by


periodically checking the user’s identity without disrupting
With the growing need to exchange information and share the user’s activities.
resources, information security has become more important Various biometric technologies, which measures human
than ever in both the public and private sectors. Although physiological or behavioural characteristics, have been pro-
many technologies have been developed to control access posed for user authentication [2–4]. Physiological biometric
to files or resources, to enforce security policies, and to traits, such as fingerprints, hand geometry, retina, iris, and
audit network usages, there does not exist a technology facial images, are collected from direct measurements of the
that can verify that the user who is using the system is human body, while behavioural biometric characteristics,
the same person who logged in. Considering heightened such as signature, keystroke rhythms, gait pattern, and voice
security requirements of military organizations to exchange recordings, are associated with a specific set of actions of
information, Defence Research and Development Canada a person. Based on the level of user involvement when
(DRDC) Ottawa started a research project in 2004 to develop capturing the biometric traits, biometrics can be further
a demonstration system that automatically logs the user off defined as either active or passive. Passive biometrics do
the computer or locks the screen when the authenticated not require the user to actively submit a measurement,
user cannot be identified by examining images of the person while active biometrics need cooperation from the user
sitting in front of the computer [1]. Facial recognition [2]. For approaches that enable continuous verification of
technology has been adopted to monitor the presence of identity but do not interrupt the user’s activity, passive
the authenticated user throughout a session. Therefore, biometric technologies, such as keystroke analysis and facial
only the legitimate user could operate the computer and recognition, have shown great potential [5–8]. However,
unauthorized entities have less chance to hijack the session. the user alternates between the mouse and the keyboard,
2 EURASIP Journal on Advances in Signal Processing

thus rendering monitoring difficult with keystroke rhythms. and slight variations in pose. With the first database, the
Recently, some researchers have investigated the possibility proposed system achieved a recognition rate of 100% and
of using multiple biometric modalities to continuously interestingly, it was shown that the posterior probability
authenticate the user [9, 10]. It has been demonstrated p(nt , |z0 : t ) of identity increased with time, whereas the
that a multimodal biometric system provides a higher level conditional entropy decreased. Using the second database,
of authentication assurance, but needs more computational with large fluctuations of illumination, the system produced
resources than a unimodal biometric system. Therefore, an average classification rate of 90.75%.
we developed a video-based facial presence monitoring Chen et al. [13] used the spatio-temporal nature of video
demonstration system, which acquires images from a video sequences to model the motion characteristics of individual
camera and runs on a Windows-based computer in near-real faces. For a given subject, they extracted motion flow fields
time [5]. from the video sequences using wavelet transforms. The high
Experiments have been carried out where users were dimensional vectors encoding these flow fields were reduced
allowed to perform different tasks, such as answering a phone in size by applying a Principal Component Analysis (PCA)
call or drinking a soda, while being able to move freely within followed by a Linear Discriminant Analysis (LDA). Recog-
their normal working space in front of a camera. A major nition was performed using a nearest neighbour classifier.
challenge is that facial images are taken under uncontrolled The training data was collected by recording 28 subjects
conditions, such as changes in illumination, pose, facial pronouncing two words in Mandarin. For each subject, nine
expression, and so forth. The authors of [11] claimed that “In video sequences were captured under different poses. For
such uncontrolled conditions, all current commercial and the testing data, they used the same sequences, but applied
academic face recognition systems fail”. This motivated us to an artificial light source of varying intensity, as the goal of
conduct further research on new algorithms to improve the their evaluation was to measure robustness to illumination
performance of accuracy. variations. Face alignment was performed by cropping the
The rest of the paper is organized as follows. Section 2 faces below the positions of the eyes, which were indicated
reviews the video-based facial recognition technologies. manually. This method was evaluated against the Fisherface
Section 3 briefly introduces the research background and algorithm and exhibited much more stable performance
previous work. Section 4 presents the image preprocessing across a wide range of illumination, it alsoachieved a correct
algorithms. Section 5 deals with multiangle facial image classification rate of ∼70% versus ∼20% for Fisherface on the
analysis. Section 6 presents performance evaluation and equivalent test data.
experimental results, and the conclusion and future work are Liu and Chen [14] applied adaptive Hidden Markov
discussed in Section 7. Models (HMM) for pose-varying video-based face recog-
nition. All face images were reduced to low-dimensional
feature vectors by using PCA. In the training process, an
2. Video-Based Facial Recognition HMM was generated to learn both the statistics of the
video sequences and the temporal dynamics of each subject.
Video-based facial recognition is a promising technology that During the recognition stage, the temporal characteristics
allows covert and unobtrusive monitoring of individuals. of the probe face sequence were analyzed over time by the
Generally, video sequences are a collection of sequential
HMM corresponding to each subject. The likelihood scores
static frames, thereby allowing the use of still-image-based
provided by the HMMs were compared. Based on maximum
techniques. However, in video-based techniques, one can
likelihood scores, the identity of a face in the video sequence
utilize the temporal continuity of the image sequences to
was recognized. Face regions were manually extracted from
enhance robustness of the recognition process.Chellappa
the images. The test database included 21 subjects. Two
and Zhou [12] proposed a system that uses static images
sequences were collected for each subject: one contained 322
as the training data and video sequences as the probe
frames for training and the other had around 400 frames
data. They used a state space model to fuse the temporal
for testing. The recognition performance of the proposed
information contained in the video sequences by tracking
the subject identity using kinematics. A computationally algorithm was compared with an individual PCA method,
efficient sequential importance sampling (SIS) algorithm which showed a 4.0% error rate for HMM versus a 9.9% error
was developed to estimate the posterior distribution. For rate for PCA.
identity n at each time instant t, by propagating the joint A key factor in our application is that the system
posterior distribution p(nt , θt |z0 : t ) of the motion (denoted must be able to perform multiple tasks and recognize the
by θt ) and subject information (z0 : t = (z0 , z1 , . . . zt )), user’s face in near-real time. Unfortunately there is little
a marginalization procedure yielded a robust estimate of emphasis on such a requirement in the existing literature.
identity. Evaluation was performed on two databases con- Therefore, DRDC conducted research on facial recognition
taining 12 and 30 subjects. The training data consisted of a algorithms and developed a prototype system that performs
single frontal face image of each subject and the probe data periodic verification and allows the user to carry out ordinary
were videos of each of the subjects walking straight towards tasks. As illustrated in Figure 1, while a user is typing
the camera. The first database contained images with no in Microsoft Word and running Internet Explorer, the
variation in illumination or pose, while the second, larger system automatically performs the facial verification every 30
database contained images with large illumination variation milliseconds.
EURASIP Journal on Advances in Signal Processing 3

Figure 1: Authentication periodically while a user is working on a Microsoft Word file.

3. Previous Work and System Overview Webcam Video frame buffer

Traditionally, the authentication process only verifies the


identity of a user once at login or sign-on. Afterward, the Face detection module
assumption is that the system remains under the control of
the same authenticated user. This authentication mechanism GUI
is fairly secure for one-time applications, such as accessing Face segmentation
a protected file or withdrawing money from an automatic module
banking machine. However, there is a security threat if an
unauthorized user takes over the session after the legitimate
Relighting
user successfully logged in. A facial presence monitoring Face database module
system was developed in our previous work to verify the
user’s identity throughout the entire session [5].
Facial recognition applications can be categorized into Face matching module
three scenarios: under tightly controlled, loosely controlled,
and uncontrolled conditions. A tightly controlled facial
Figure 2: Overall system architecture.
recognition system operates under strict rules. The users
are required to cooperate with the system and the facial
images can only be accepted when they satisfy certain
conditions, such as a full front view of the face with a neutral (2) Face Detection. A face detection algorithm subsequently
expression and both eyes open under a uniform lighting examines the captured images. The locations of potential
condition without reflection or glare. Unfortunately, most human faces are recorded. Later, when a video image is
real-world applications cannot satisfy these conditions. In finally rendered to the monitor, a red rectangle encloses each
an uncontrolled facial recognition system, the users may be potential face. Because of the near-real time requirement,
unaware that a system is taking their facial images. Hence, this algorithm cannot be expected to perform flawlessly.
there exist considerable variations in illumination, pose, and Therefore, it is anticipated that some detected objects are
expression. Many applications fall into the loosely controlled not actually faces. In order to obtain an accurate result, the
category. For instance, the facial recognition system may system examines more than one image frame to determine
work under uncontrolled illumination and background, but if an object is likely a face or not, which is one of the
a narrow range of face pose. The system we developed advantages of using video-based facial recognition. When an
works under uncontrolled conditions. Figure 2 shows the object presumed to be a face is discovered, the corresponding
system diagram, and the module functionalities are briefly region in the previous frame is examined. If there was no face
summarized. found in that area in the previous frame, then the current
object is unlikely to be a face. As a result, the object will not
be recorded as a potential face. Conversely, if there had been
(1) Input. The system captures video images and processes a face present in that region, the likelihood that the current
either 24-bit or 32-bit colour images. The captured images object represents a face is greater. In such a situation, the
are displayed on the computer screen via DirectX in real time. system assumes that the object is a face.
4 EURASIP Journal on Advances in Signal Processing

(3) Face Segmentation. Because the position of a face In order to evaluate the robustness of the system, we
computed by the face detection module is not accurate, a conducted tests in different scenarios that might happen
more precise location is necessary for a good face-matching in a real office environment. These tests include detecting
performance. Since the size of a user’s face appearing in a multiple users in a scene and recognizing a user answering
video frame also varies depending on the distance of the user a phone call or drinking a soda. As shown in Figure 3,
from the web camera, the face image must be normalized to the system performed very well and even partially-occluded
a standard size. There are some features in face images that faces, such as mouth covered by telephone and eyes covered
may change from time to time. For example, the hairstyle by dark sunglasses, could still be recognized correctly.
can change significantly from one day to another. In order
to reduce the effects of such dynamic features, a standard
4. Image Preprocessing
elliptical region with a fixed aspect ratio is used to extract
the face region. A study on image preprocessing algorithms has been carried
out to improve the accuracy performance. It focused on the
(4) Face Matching. Turk and Pentland pioneered the eigen- areas that affect the accuracy of facial recognition, which
face method [15], which relies on the Karhunen-Loeve (KL) include geometric correction, face alignment, masking and
transform or the Principal Component Analysis (PCA). To photometric normalization.
improve the performance of the eigenface method, it is
important to have a good alignment between the live and the 4.1. Face and Eye Detection. Each image frame is searched
stored face images. It means that the nose has to be in the for faces using a fast Viola-Jones [16] face detector. The
middle, the eyes have to be at a stable vertical position, and approach is enhanced over conventional implementations in
the scale of the face images must be normalized. A significant following aspects. It is invariant to faces of multiple sizes,
portion of our efforts addressed these issues. An elliptical in-plane (IP) head rotations of θ = ±45◦ , and out-of-plane
facial region extracted from a video frame is matched against (OOP) head rotations of ±30◦ , as shown in Figure 4. Once
the facial models stored in a database. Each face image is first the face is located, a virtual bounding box is formed around
converted to a vector. This vector is projected onto eigenfaces it and a built-in eye finder is initiated in this region. The eye
through inner product calculations. Each face produces a finder outputs the 2D locations of the left and right irises.
weight vector. The Euclidean distance between two weight To improve operating speed, it is essential to adapt the face
vectors is used to measure the similarity between the two search regions based on previous face localization results.
faces. This distance is then mapped to a normalized matching This dynamically limits the search area, and as a result, allows
score. this step to operate in near-real time at resolutions up to 1024
× 768.
(5) Relighting. This module provides a histogram-based
intensity mapping function to normalize the intensity dis- 4.2. Face Alignment and Masking. Using the 2D coordinates
tribution of the segmented face image. It is noted that some of the left and right irises, the in-plane rotation angle θ
areas, such as the eyes, can be very dark due to light direction. of the face is estimated trigonometrically. The face is then
It is potentially beneficial to enhance the features in dark rotated by −θ ◦ , effectively placing both eyes at equal height,
regions to improve the recognition performance. as shown in Figure 5. Once normalized for rotation, the face
is geometrically warped in order to set the inter-eye distance
(6) Facial Database. It is assumed that data from up to eight to 40 pixels, chosen as a reasonable value to ensure effective
users may be saved in the database. Each user is required performance of the subsequent recognition steps. The system
to take at least one picture in the user’s normal working then crops the face image to 64 × 64 pixels, leaving out
environment within the normal sitting space and under the the top of the head in order to reduce the impact of hair
normal lighting conditions. style on the classification. Similarly, a mask is applied to the
shoulder and lower-neck areas to avoid the possible influence
of different clothing on the face recognition.
(7) Output. The demonstration system has three main
outputs.
4.3. Photometric Normalization. Variable lighting, both in
(i) Live video: the detected face in the scene is sur- intensity and location, can cause dramatic changes to a
rounded by a red rectangle. person’s appearance as a result of shadows and specularity.
These issues arise especially in uncontrolled environments,
(ii) Matching results: the segmented face image from the
such as outdoors or even in a windowed office. For a
current test scene is displayed, along with up to
facial recognition system using appearance-based method,
five candidate faces from the database in descending
it is crucial to apply photometric normalization. Here,
priority order.
homomorphic filtering [17] is adopted to mitigate the effects
(iii) Performance Data: several performance data are of shadows and specularity on the face. Homomorphic
displayed in real time, such as the overall frame rate, filtering is commonly used to simultaneously normalize
the face detection time, the face recognition time, and brightness across an image and increase contrast. As shown
the best matching score. in Figure 6(b), the shadows caused by lateral lighting
EURASIP Journal on Advances in Signal Processing 5

(a) (b)

Figure 3: Recognize partially occluded faces.

(a) Scale (b) IP head rotation (c) OOP head rotation

Figure 4: Face and eye detection under different situations.

40
64

(a) Original frame (b) Filtering result


64
(a) (b) Figure 6: Homomorphic filtering.

Figure 5: Face alignment and masking to: (a) Original frame, (b)
Aligned and masked face ready to be analyzed (dimensions shown
for illustrative purposes). to obtain an intermediate image. Next, we convolve each col-
umn of the intermediate image with the vertical projection of
the filter. Hence the resulting image is identical to the direct
convolution of the original image and the filter kernel. The
convolution of an N × N image with an M × M filter kernel
on the right side of the person’s face are filtered while requires a time proportional to N 2 × M 2 . In comparison,
preserving important structural details. As this step usually convolution in the separable fashion only requires a time
demands many computationally intensive steps, the filter proportional to N 2 × M. Therefore, the processing speed is
is implemented in a separable fashion by breaking a two- improved to achieve real time operation.
dimensional signal into two one-dimensional signals with
a vertical and a horizontal projections. In homomorphic 5. MultiAngle Facial Image Analysis
filtering, it is necessary to convolve the input image with
a Gaussian filter, which is separable by nature because 2D Because there were more side-view face images than front-
Gaussians are circular symmetric. view images in the captured video stream, a study has been
To convolve an image with a separable filter kernel, each conducted to explore the possibility of using multiangle face
row in the image is convolved with the horizontal projection images to increase the recognition rate. A database called
6 EURASIP Journal on Advances in Signal Processing

(a) (b) (c)

Figure 7: Database capture conditions: (a) Overhead light with neutral expression, (b) Overhead and frontal light with neutral expression,
(c) Overhead and frontal light with varying facial expressions.

(a) First session (b) Second session

Figure 8: An example of repeat subject.

100 97.03
92.31 90
90
Classification rate (%)

80
70
60
50
(a) Wide 40
30
20
10
0
+/ − 30 +/ − 10 0
Angular range (degrees)
(b) Narrow
Figure 10: Mean classification rate versus angular range.
Figure 9: Angular range.

sequence was saved as a raw video file and manually cropped


the CIM face database was constructed at the Centre for at the approximate start and stop times of the participant’s
Intelligent Machines (CIM) at McGill University. Currently, rotation. This step allowed us to estimate the rotation angle
it consists of 43 subjects, 19 of which returned for a second difference between consecutive frames, which was 2.5◦ on
recording session. The recording sequences involved the average.
subject rotating through a range of 180◦ . For each subject, The benefits of using multiangle images were evaluated
three sequences were recorded under different lighting by increasing the training data within an angular range
conditions and facial expressions, as shown in Figure 7. of either ±30◦ , ±10◦ , or 0◦ , as illustrated in Figure 9. In
About one-third of the participants were asked to return addition, an ad-hoc third trial was conducted in which the
for a second recording session at least one week after their test set consisted of outdoor images of eight of the subjects.
first session. They were requested to change their facial In the experiments, thirty subjects were randomly
appearance by growing facial hair, wearing glasses, applying selected to construct the training set and twelve subjects were
makeup, and so forth, as illustrated in Figure 8. Each used as imposters to evaluate the false accept rate (FAR).
EURASIP Journal on Advances in Signal Processing 7

100 100
90 90
80
80
70
70
60

FRR (%)
FRR (%)

60 50
50 40
40 30
30 20
10
20
0
10 0 10 20 30 40 50 60 70 80 90 100
0 FAR (%)
0 5 10 15 20 25 30 35 40
(a) ROC curve
FAR (%)
100
(a) ROC curve 90
100 80
70

Error rate (%)


90
FAR FRR 60
80 FAR FRR
50
70 40
Error rate (%)

60 30
50 20
40 10
0
30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
20
Threshold
10
(b) EER
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 12: FERET database results.
Threshold

+/ − 30 degrees
+/ − 10 degrees Reducing the range to ±10◦ resulted in a 4.72% drop in
0 performance, while training on the 0◦ range (i.e., a single
(b) EER frontal image) led to a 7.03% decrease.
It is desirable to evaluate system accuracy by comparing
Figure 11: CIM database results. the FAR to the FRR for different choices in angular range
of training data. For any given threshold T, we measured
the FAR and FRR values. By varying T, we recorded the
value sets of FAR and FRR. Plotting the value sets, the result
In order to compare with other published algorithms, the
is called a Receiver Operating Characteristic (ROC) curve
proposed approach has been tested using the widely-used
(Figure 11(a)). ROC curve is a concise graphical summary of
FERET database. The fb (frontal pose) subset, consisting
the performance of the biometric device [19], and it enables
of 725 subjects, was used in which 580 subjects were
the performance comparisons between different systems
selected randomly as the training set and the remaining
under similar conditions or of a single system operating
145 used as impostors. The performance metrics consisted
under differing conditions [20]. In the plot, an ROC curve,
of the mean classification rate, the false accept rate, and
which lies to the lower left of another curve, has a better
the false recognition rate (FRR). The FAR is the likelihood
accuracy performance. Therefore, in Figure 11(a), the results
of incorrectly accepting an impostor, while the FRR is
for ±10◦ and ±30◦ are very similar.
the likelihood of incorrectly rejecting an individual in the
Figure 11(b) presents the FAR and FRR, individually, as
training set. As a common rule, the thresholds were adjusted
functions of threshold value. Since higher threshold values
based on the classification confidence values to evaluate the
increase FRR and decrease FAR, analysis of these results is
trade-off between FAR and FRR.
only meaningful when it takes into account both FAR and
FRR together. The point at which the FAR is equal to the
6. Experimental Evaluation FRR is called the equal error rate (EER). This is another
commonly used measure to assess the overall performance
Three sets of experiments were conducted on the CIM and for biometric systems. The result in Figure 11(b) shows that
FERET [18] face databases. Performance of the classifier is error is minimized when training with greater angular range,
associated with the angular range of training data, as seen in as expected. Training with an angular range of ±30◦ yields
Figure 10. The best performance was achieved on the ±30◦ an equal error rate of 4.86%, whereas this increases to 6.80%
range, that is, the same range used for the testing sequences. and 7.42% for angular ranges of ±10◦ and 0◦ , respectively.
8 EURASIP Journal on Advances in Signal Processing

To deal with outdoor scenarios, a preliminary test has nose slope and depth. Not only will we use newly developed
been conducted by using static images of eight of the algorithms to improve the facial presence monitoring system,
subjects taken outdoors; six of the eight subjects (75%) were but also we will explore other application areas that will
successfully classified. Note that the training data for these benefit from uncontrolled face recognition. In addition, we
individuals came exclusively from indoor data. Due to the need to conduct research to analyze legal and social aspects
very limited number of testing samples, these results should of monitoring the user’s presence at a workplace.
not be taken as definitive statements of performance.
Despite very limited opportunity for tuning the algo-
Acknowledgments
rithms, the mean classification rate obtained on the FERET
database was 92.5% (an EER of 7.5%), demonstrating The authors would like to thank for the great contribution
that the presented algorithm is scalable to relatively large that Dr. Martin Levine and Dr. Jeremy Cooperstock made
databases. This result is only 1.5% lower than the best under the contract W7714-071076. Without their contribu-
published classification rate (94%) in the literature for the tion, they would not have been able to achieve the results
same database [18]. The FAR versus FRR plots and error rates presented herein. The authors would also like to thank Dr.
versus threshold values are shown in Figure 12. Daniel Charlebois for his support and valuable comments
throughout the course of this research project.

7. Conclusions and Future Work References


As networks become larger, more complex, and more [1] X. D. Yang, P. Kort, and R. Dosselmann, “Automatically log
distributed, information security has become more critical off upon disappearance of facial image,” Contract Report CR
than it has ever been in the past. Many efforts have 2005-051, DRDC, Ottawa, Canada, March 2005.
been made aiming to accurately authenticate and authorize [2] P. Reid, Biometrics for Network Security, Prentice-Hall, Upper
trusted individuals, and audit their activities. Once a user is Saddle River, NJ, USA, 2003.
successfully logged in, the prevailing technique assumes that [3] J. Chirillo and S. Blaul, Implementing Biometric Security, John
the system is controlled by the same person. Focusing on this Wiley & Sons, Indianapolis, Ind, USA, 2003.
security challenge, we developed an enhanced authentication [4] R. Manoj, “Biometric security: the making of biometrics era,”
method that uses video-based facial recognition technology InfoSecurity, pp. 16–22, July 2007.
to monitor the user during the entire session in which the [5] Q. Xiao and X. D. Yang, “A facial presence monitoring
system for information security,” in Proceedings of the IEEE
person is using the system. It can automatically lock the
Workshop on Computational Intelligence in Biometrics: Theory,
screen or log out the user when the authenticated user’s face Algorithms, and Applications (CIB ’09), pp. 69–76, March 2009.
disappears from the vicinity of the computer system for an [6] R. Janakiraman, S. Kumar, S. Zhang, and T. Sim, “Using
adjustable time interval. continuous face verification to improve desktop security,”
In order to improve the performance in accuracy, in Proceedings of the 7th IEEE Workshop on Applications of
further research has been conducted in developing image Computer Vision (WACV ’07), pp. 501–507, January 2007.
preprocessing algorithms and using multiangle facial images [7] B. Rao, Continuous keystroke biometric system, M.S. thesis,
in training. The experiments conducted on the CIM and Media Arts and Technology, University of California, Santa
FERET face databases showed promising results. On the Barbara, Calif, USA, 2005.
FERET database, an EER of 7.5% is obtained which is [8] R. H. C. Yap, T. Sim, G. X. Y. Kwang, and R. Ramnath,
comparable to the best published EER rate of 6% in the “Physical access protection using continuous authentication,”
in Proceedings of the IEEE International Conference on Tech-
literature. A major advantage of video-based face recognition
nologies for Homeland Security (HST ’08), pp. 510–512, May
is that a set of images of the same subject can be captured in 2008.
a video sequence, while a main problem of video-based face [9] T. Sim, S. Zhang, R. Janakiraman, and S. Kumar, “Continuous
recognition lies in the low images quality in video frames. In verification using multimodal biometrics,” IEEE Transactions
order to improve recognition accuracy, an effort has been put on Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp.
into combining front and angle face images. 687–700, 2007.
Uncontrolled face recognition from video is still a [10] A. Azzini and S. Marrara, “Impostor users discovery using
challenging task. During our experiments, it became clear a multimodal biometric continuous authentication fuzzy
that most classification errors were due to instability in the system,” in Proceedings of the 12th International Conference
alignment process. Specifically, we noticed that even though on Knowledge-Based Intelligent Information and Engineering
the eye detector accurately finds the eyes, the position tends Systems (KES ’08), vol. 5178 of Lecture Notes in Computer
to oscillate in the dark area of the pupils, which causes Science, pp. 371–378, Springer, September 2008.
[11] S. J. D. Prince, “Latent identity variables: a generative
fluctuations in the computed in-plane head rotation angle.
framework for face recognition in uncontrolled conditions,”
One area for future research is to investigate bootstrapping EP/E065872/1, EPSRC, September 2007.
and integration of spatio-temporal filtering methods in the [12] R. Chellappa and S. Zhou, “Face tracking and recognition
eye detector to mitigate this issue. We will perform more from videos,” in Handbook of Face Recognition, S. Z. Li and
research on the relationships among front and angle-face A. K. Jain, Eds., pp. 169–192, Springer, Berlin, Germany, 2005.
images to extract some nose features that cannot be obtained [13] L.-F. Chen, H.-Y. M. Liao, and J.-C. Lin, “Person identification
or accurately measured from the front face itself, such as using facial motion,” in Proceedings of the IEEE International
EURASIP Journal on Advances in Signal Processing 9

Conference on Image Processing (ICIP ’01), vol. 2, pp. 677–680,


October 2001.
[14] X. Liu and T. Chen, “Video-based face recognition using
adaptive hidden Markov models,” in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 1, pp. 340–345, June 2003.
[15] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal
of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[16] P. Viola and M. Jones, “Robust real-time object detection,”
International Journal of Computer Vision, vol. 57, no. 2, pp.
137–154, 2004.
[17] D. B. Williams and V. Madisetti, Digital Signal Processing
Handbook, CRC Press, Boca Raton, Fla, USA, 1999.
[18] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERET
evaluation methodology for face-recognition algorithms,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 22, no. 10, pp. 1090–1104, 2000.
[19] M. Schuckers, “Some statistical aspects of biometric identifi-
cation device performance,” Stats Magazine, p. 3, September
2001.
[20] D. Davis, P. Higgins, P. Kormarinski, J. Marques, N.
Orlans, and J. Wayman, “State of the art biometrics excel-
lence roadmap: technology assessment: volume 1,” Tech.
Rep., MITRE Corporation, 2008, http://www.biometriccoe.
.gov/SABER/index.htm.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 680845, 9 pages
doi:10.1155/2010/680845

Research Article
Iris Recognition: The Consequences of Image Compression

Robert W. Ives,1 Daniel A. Bishop,2 Yingzi Du,3 and Craig Belcher3


1 Department of Electrical and Computer Engineering, U.S. Naval Academy, Annapolis, MD 21402-5000, USA
2 Schoolof Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
3 Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis, Indianapolis,

IN 46202, USA

Correspondence should be addressed to Robert W. Ives, [email protected]

Received 11 November 2009; Accepted 9 March 2010

Academic Editor: Alan van Nevel

Copyright © 2010 Robert W. Ives et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Iris recognition for human identification is one of the most accurate biometrics, and its employment is expanding globally. The
use of portable iris systems, particularly in law enforcement applications, is growing. In many of these applications, the portable
device may be required to transmit an iris image or template over a narrow-bandwidth communication channel. Typically, a full
resolution image (e.g., VGA) is desired to ensure sufficient pixels across the iris to be confident of accurate recognition results. To
minimize the time to transmit a large amount of data over a narrow-bandwidth communication channel, image compression can
be used to reduce the file size of the iris image. In other applications, such as the Registered Traveler program, an entire iris image
is stored on a smart card, but only 4 kB is allowed for the iris image. For this type of application, image compression is also the
solution. This paper investigates the effects of image compression on recognition system performance using a commercial version
of the Daugman iris2pi algorithm along with JPEG-2000 compression, and links these to image quality. Using the ICE 2005 iris
database, we find that even in the face of significant compression, recognition performance is minimally affected.

1. Introduction Daugman algorithm, the template is stored as 512 bytes per


eye.
Iris recognition is gaining popularity as the method of Data compression is beginning to play a part in the
choice for human identification in society today. The iris, employment of iris recognition systems. Law enforcement
the colored portion of the eye that surrounds the pupil, agencies, such as the Border Patrol, the Coast Guard, and
contains unique patterns which are prominent under near- even the Armed Forces, are using portable wireless iris
infrared illumination. These patterns are relatively perma- recognition devices. In cases where the devices require a
nent, remaining stable from a very young age, barring trauma query to a master database for identification, it may be
or disease. They allow accurate identification with a very high required to transmit captured images or templates over a
level of confidence. narrow-bandwidth communication channel. In this case,
Commercial iris systems are used in a number of minimizing the amount of data to transmit (which is possible
applications such as access to secure facilities or other through compression) minimizes the time to transmit,
resources, and even criminal/terrorist identification in the and saves precious battery power. There are other iris
Global War on Terror. The identification process begins with applications that require a full-resolution iris image to be
enrollment of an individual into a commercial iris system, carried on a smart card, but require a small fixed data storage
requiring the capture of one or more images from a video size. An example is the Registered Traveler Interoperability
stream. Typically, the database for such a system does not Consortium (RTIC) standard, where only 4 kB is allocated on
contain actual iris images, but rather it stores a binary the RT smart card for the iris image [4]. Since the standard
file that represents the distinctive information contained in iris image used for recognition is VGA-resolution (640 × 480,
each enrolled iris (called the template). Most commercial grayscale), it contains 307 kilobytes; significant compression
iris systems today use the Daugman algorithm [1–3]. In the would be required to fit a VGA iris image into 4 kilobytes.
2 EURASIP Journal on Advances in Signal Processing

Applications of this nature serve as the primary motivation


for this research.
This paper explores whether image compression can be
utilized while maintaining recognition accuracy, and the
effects on performance. We evaluate the effects of image
compression on recognition using JPEG-2000 compression
along with a commercial implementation of the Daugman
recognition algorithm [5]. The database used in this research
is described in the following section.

2. Data Figure 1: An example image from the ICE 2005 database (image
no. 245596). The visual quality is very good.
Iris images used in this paper are available from the National
Institute of Standards and Technology (NIST). The database
of iris images used in this research is the Iris Challenge
Evaluation (ICE) 2005 database [6]. This iris database is
composed of a total of 2953 iris images, collected from
132 subjects. Of these images, 1425 were of right eyes from
124 different individuals and 1528 were left eyes from 120
individuals. The images are all VGA resolution, 480 rows by
640 columns, with 8-bit grayscale resolution.
This database contains images with a wide range of
visual quality; some images seem near perfect while others
are very blurry, have iris that extend off the periphery
of the image, contain significantly occluded irises, and/or
have video interlace artifacts. All of these factors impair Figure 2: An example image from the ICE 2005 database (image
recognition performance. Several examples of images from no. 245795). Note the extent of the occlusion, including eyelashes.
this database are shown in Figures 1, 2, and 3.

3. Image Compression the regions of the smaller image that did not include the
iris. Since the images were reduced in size to only contain
The JPEG-2000 algorithm is published by the Joint Pho- the segmented iris, higher compression ratios were obtained
tographic Experts Group (JPEG) as one of its still-image with minimal effects on recognition performance. However,
compression standards [7]. JPEG-2000 uses state-of-the art storing iris database images in this manner precludes testing
compression techniques based on wavelets, unlike the more of alternate segmentation methods. In our research, we opted
popular JPEG standard, which is based on the discrete cosine to compress entire images rather than just the area of the
transform (DCT). JPEG-2000 contains options that allow iris-only information. This allows a more general approach
both lossless and lossy compression of imagery, as does to algorithm development research using a compressed iris
JPEG. When using any lossy compression technique, some database.
information is lost in the compression and the amount For this paper, we used the entire ICE-2005 database to
and type of information that is lost depends on several obtain our results. We compressed the images using JPEG-
factors, including the algorithm used for compression, the 2000, with the default parameters and options available in
amount of compression desired (which determines the size the JasPer implementation [10]. The source code is freely
of the compressed file), and special options offered in the available from the JasPer Project. We did not use the ROI
algorithm such as Region-of-Interest (ROI) processing. In capability, so that entire images were compressed as a whole
ROI processing, select regions of the image are deemed more and segmentation testing could be performed on compressed
important than other areas such that less information is lost images.
in those regions. Figure 4 displays an original iris image from the ICE-
The effect of image compression on iris recognition 2005 database before and after its compression to a ratio of
system performance has been addressed [8, 9]. In particular, 100 : 1 using JPEG-2000. This is image number 245596, the
in [8], iris images were compressed up to 50 : 1 using both same as displayed in Figure 1. Comparing both images in
JPEG-2000 and JPEG. In [9], Daugman and Downing used Figure 4 closely reveals some detectable differences, primarily
a portion of the ICE-2005 iris database and JPEG-2000 in the areas of high frequency content (high detail), such
compression. Daugman used the Region-of-Interest (ROI) as the eyelashes, where compression artifacts or smoothing
capability of JPEG-2000 which resulted in compression ratios is noted. Statistically, the two images are not very different;
of up to 145 : 1. He used segmentation methods to completely the maximum difference in value between any two pixels in
isolate the iris so as to reduce the size of the images from the two images is 22, and the average gray level difference
480 × 640 to 320 × 320, and then completely discarded between the two images is essentially unchanged (0.02) with
EURASIP Journal on Advances in Signal Processing 3

Figure 3: An example image from the ICE 2005 database (image


no. 243843). Note the extent of blurriness and the video interlace
artifacts.
Figure 5: Zoomed view of the iris image from Figure 4. At this
level of zoom, the compression artifacts are noticeable, particularly
around areas of high frequency (such as eyelashes). Also note the
smoothed out areas throughout the iris.

4. Quality Metric
The information distance-based quality measure is used
to evaluate the iris image quality [11, 12]. Prior to the
application of the quality measure, the iris is first seg-
mented and transformed to polar coordinates. This quality
measure includes three parts: Feature Information Measure,
Occlusion Measure, and Dilation Measure, which are then
Figure 4: The iris image from Figure 1 after compression to 100 : 1.
combined into a quality score. These three parts and the
fusion to form the quality score are described below.

Table 1: Desired and actual compression ratios. (1) Feature Correlation Measure (FCM). The compression
process will introduce artificial iris patterns, which may have
Desired Actual low correlation with the true patterns. Using this property,
25 : 1 25.24 : 1 we applied the information distance (see [13]) between
50 : 1 50.57 : 1 adjacent rows of the unwrapped image to measure the
75 : 1 75.96 : 1 correlation within regions of the iris.
100 : 1 101.37 : 1
Suppose the row length is L with a starting location at
(u, v). The filtered magnitude values (from feature extrac-

tion) of the L pixels in the row is formed as a vector r . The

probability mass function (pmf) of this selected portion is p

a standard deviation of 1.56. Figure 5 shows a zoomed in and q is the pmf of the neighbor row [13]. The information
image of the upper left portion of the iris in Figure 4.  
distance of this portion is J (u,v) ( p , q ), which can be calculated
Overall, JPEG-2000 does a great job of maintaining the detail
by
information even up to a compression ratio of 50:1.
 
  
  

For this research using JPEG-2000, four compressed
J p, q = D p  q + D q  p , (1)
databases were created using the 2953 ICE images. To create
each of these databases, each original image was compressed 
with loss to compression ratios of approximately 25 : 1, 50 : 1, where D( ) is the Kullback-Leibler information distance, D( p
75 : 1, and 100 : 1. For example, the first database consisted    
 q) = p i log2 ( p i / q i ). In our algorithm, if there are values
of all ICE images compressed to 25 : 1. The JPEG-2000 that do not appear within the selected portions of rows, they
engine is not designed to achieve the specified compression are not considered in the pmf to prevent a divide-by-zero
ratio exactly, but rather uses it as a target which may be condition in (1).
exceeded but should be close to the desired compression The feature correlation measure (FCM) of an iris image
ratio. For these 2953 images, the average compression ratios is then calculated by
achieved are shown in Table 1. The next section discusses the
quality measure that was used to relate compression ratio to 1
FCM = Ji,i+1 , (2)
performance and quality. N i
4 EURASIP Journal on Advances in Signal Processing

where Ji is the representative information distance of the ith 5. Results


row and N is the total number of rows used for feature
information calculation. This section is divided into two parts; performance results
first, and then quality results. In many iris recognition
algorithms, including the Daugman algorithm used in this
(2) Occlusion Measure (O). The total amount of invalid iris research, two iris templates are compared using fractional
patterns can affect the recognition accuracy. Here, occlusion Hamming distance (HD) as the measure of dissimilarity
measure (O) is used to measure the percentage of the iris area between two iris templates. Fractional Hamming distance
that is invalid due to eyelids, eyelashes, and other noise. (HD) is defined by

(3) Dilation Measure (D). The dilation of a pupil can also (code A ⊗ code B) ∩ mask A ∩ mask B
HD = . (7)
affect the recognition accuracy. Here, the dilation measure mask A ∩ mask B
(D) is calculated by the ratio of pupil radius and iris radius.
The ⊗ operator is the Boolean XOR operation. It detects
disagreements between the pairs of phase code bits in the
(4) Score Fusion (Q). The three measures were then com- two templates (called IrisCodes in the Daugman algorithm—
bined to one quality score based on the FCM, O, and D. here, designated as code A and code B). Mask A and B
Different from simply multiplying, we normalized each of identify the locations in each IrisCode that are not believed
the measure scores first: to be corrupted by artifacts such as eyelids/eyelashes and
specularities. The ∩ operator is the Boolean AND operator,
Q = f (FCM) · g(O) · h(D), (3) and the  ·  operator is used to sum the number of “1”
bits within its argument. The denominator of (7) ensures
where f ( ), g( ), and h( ) are normalization functions. that only the phase code bits that are valid are included in
The f ( ) function is used to normalize the FCM score the calculation, after any artifacts are discounted. A value
from 0 to 1, and is defined as follows: of HD = 0 indicates a perfect match between the IrisCodes
M and a value of HD = 1 indicates that none of the bits match.
α · FCM, 0 ≤ FCM ≤ β, Daugman provides an alternate measure of dissimilarity
f (FCM) = (4)
1, FCM > β. in the normalized fractional Hamming distance (HDnorm ),
which incorporates the number of bits that were actually
In (4), β = 0.005 and α = 1/β. The value of β was compared [16]. This serves to reduce the chances of a false
chosen experimentally. For most original images, the Ji,i+1 match and is discussed later in this paper. The standard
scores were above 0.005, while for compressed images most fractional Hamming distance in (7) is used here to derive the
Ji,i+1 scores were lower than 0.005. The value α is the Performance curves shown in this section. A few images were
normalization factor to ensure that when FCM = β, f (FCM) of poor enough quality that at higher compression ratios,
= 1. they did not produce templates for comparison, because they
We analyzed the relationship between available iris failed to segment. An example is found in Figure 7. One
patterns and the iris recognition accuracy to determine image failed to produce a template in its original form or at
the normalization functions empirically. This relationship any compression ratio and is displayed in Figure 8.
is more exponential than linear. Based on [14, 15], the g
function is calculated as 5.1. Performance Curves. The quality of the images in the
  database did play a role in the performance, as demonstrated
1 − e−λ(1−O) in Figure 9. Here, two images of different eyes have been
g(O) = . (5) segmented (segmentation is shown in the images), and
κ
both segmentations are poor. Still, successful segmentation
In (5), κ = 0.9179 and λ = 2.5. Similar to the occlusion, allowed template generation, so each image was represented
the dilation is also a nonlinear function compared to the by a template that could be compared. When the templates
recognition accuracy. The h function is calculated as of these two different eyes were compared, the net result was
M that there was only one valid bit in the Hamming distance
1, D ≤ 0.6, computation, resulting in a HD = 0. There were two other
h(D) = −γ(D−ξ)
(6) such comparisons of different eyes with a low number of
e , 0.6 < D ≤ 1.
valid bits (3 bits and 9 bits), both also resulting in a HD =
Here, ξ = .6, and γ = 40. For dilation, ξ is selected based on 0. All three of these cases would result in false matches. The
the dilation functionality of a normal eye. issue of a low number of bits being compared and their effect
Figure 6 shows two sample images from the ICE on Hamming distance was addressed by Daugman in [16], in
database, along with each image compressed to ratios of which he compared use of performance using raw Hamming
25 : 1, 50 : 1, 75 : 1 and 100 : 1. A zoomed in portion of the iris distance (as we use here) and normalized Hamming distance,
is displayed also, for visual evaluation of the quality. For each defined as
image, the resulting quality score is displayed. Additional 9
n
quality results are included in the following section. HDnorm = 0.5 − (0.5 − HDraw ) . (8)
911
EURASIP Journal on Advances in Signal Processing 5

Original 25 : 1 50 : 1 75 : 1 100 : 1

Quality: .9975 .3874 .266 .2539 .2008


Sample image 1
(a)

Original 25 : 1 50 : 1 75 : 1 100 : 1

Quality: .9935 .9935 .9935 .8929 .7574


Sample image 2
(b)

Figure 6: Two sample images and their compressed versions from the ICE database. Image quality is annotated for each image.

(a) Figure 8: This image failed to generate an iris template in its


original form and at all compression ratios (image no. 242451).

computing the Hamming distance. In [16], the minimum


number of bits used in the results is 400, and this is the
minimum number of bits we allow in determining our
results.
The size and number of subjects in the ICE database,
the number of images that successfully segmented so that a
template could be formed, and the number of valid bits used
(b) in comparing two template were all factors that determined
the number of actual comparisons that were made. Recall
Figure 7: This image failed to generate an iris template at 75 : 1 that five databases were used in this research, one for the
and 100 : 1 compression (image no. 245561). (a) Original image. (b) uncompressed images and one for each of the compression
100 : 1 compression. ratios used. The number of comparisons made (genuine or
impostor) differed when comparing different databases. Part
of the difference in number of comparisons comes about
Here, HDraw is the Hamming distance computed using because when comparing one database to itself, we do not
(7), and n is the number of valid bits in the comparison. The count comparisons of each image to itself (the HD = 0 in this
value 911 is a scaling factor based on a typical number of case). However, when comparing two different databases, the
bits used in comparisons. The normalization comes about difference in compression ratios means that there are no
to account for the number of valid bits actually used in identical images, and this allows an additional number of
6 EURASIP Journal on Advances in Signal Processing

Table 2: Number of matches in each database comparison.

Genuine Imposter Total


Compared databases
matches matches matches
Original versus Original 26,656 4,329,020 4,355,676
Original versus 25 : 1 29,650 4,329,068 4,358,628
Original versus 50 : 1 29,549 4,329,079 4,358,628
Original versus 75 : 1 30,876 4,327,749 4,358,625
Original versus 100 : 1 31,119 4,327,506 4,358,625
25 : 1 versus 25 : 1 26,600 4,329,076 4,355,676
(a) 25 : 1 versus 50 : 1 29,527 4,329,101 4,358,628
25 : 1 versus 75 : 1 30,862 4,327,763 4,358,625
25 : 1 versus 100 : 1 31,106 4,327,519 4,358,625
50 : 1 versus 50 : 1 26,617 4,329,059 4,355,676
50 : 1 versus 75 : 1 30,886 4,327,739 4,358,625
50 : 1 versus 100 : 1 31,136 4,327,489 4,358,625
75 : 1 versus 75 : 1 26,609 4,323,166 4,349,775
75 : 1 versus 100 : 1 29,804 4,322,291 4,352,725
100 : 1 versus 100 : 1 26,692 4,323,083 4,349,775

0.1
(b)
0.09 Imposter
Figure 9: Segmentation images. (a) Image no. 247076 compressed match
0.08 scores
Probability mass function

to 25 : 1. (b) Image no. 246215. These two images from different


eyes generated templates, but their qualities resulted in poor 0.07
segmentation. As a result, in the template comparisons there was 0.06
only one valid bit compared, resulting in a false match using raw 0.05
HD, (HD = 0.0). Normalized HD would have resulted in an HD = Increasing
0.5 (no false match). Also, the iCAP software version used in this 0.04 compression
ratios
research was an early version that did not include capability for 0.03
off-axis images or partially out-of-frame images. The fact that these 0.02
images did not segment properly could be expected.
0.01
Genuine match scores
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
valid comparisons. The number of comparisons also varies Hamming distance
because a few images do not generate templates, so some
databases had fewer templates for comparison than other Figure 10: Probability mass function curves for compression 25 : 1,
as a function of Hamming distance (25 : 1 versus original, 25 : 1
databases. The original, 25 : 1 and 50 : 1 databases held 2952
versus 25 : 1, 25 : 1 versus 50 : 1, 25 : 1 versus 75 : 1 and 25 : 1 versus
templates while the 75 : 1 and 100 : 1 databases held 2950 100 : 1).
templates. Finally, we only compare templates if at least 400
bits were valid in the comparison. As a result, the overall
numbers of genuine, imposter and total comparisons are
shown in Table 2. the PMF close to a Hamming distance of 0; we attribute this
The performance curves that follow are derived from the to the comparison of images that are only slightly different
probability mass functions (PMF) of fractional Hamming (i.e., the compression does not result in much change in the
distance scores. The PMF is an estimate of the underlying iris template). At higher compression ratios, more change
probability distribution using the histogram of HD values. is induced in the templates resulting in higher Hamming
An example of the effects of compression on the PMFs of distances when comparing them.
genuine and imposter distributions is shown in Figure 10. Since the imposter distributions are relatively unchanged
Here the database of 25 : 1 compressed image templates as compression ratios increase, we further analyze the
are compared to the original image templates. We point changes seen in the genuine distributions. Here, we inves-
out that compression does not really change the imposter tigate the changes in HD values as compression ratio is
distribution, but as compression ratio increases, the genuine increased, when compared to the original images. Statistics
distributions move closer to the imposter distribution, which have been gathered for five database comparisons: original
reduces performance. We note that in the comparisons versus original, original versus 25 : 1, original versus 50 : 1,
between this database and the original, and this database and original versus 75 : 1, and original versus 100 : 1. The mini-
itself (25 : 1 versus 25 : 1), that there is a distinct 2nd peak in mum, average, and maximum HDs were recorded for each
EURASIP Journal on Advances in Signal Processing 7

Table 3: Minimum, mean And maximum HDs. Table 4: Summary of performance results.
Minimum Maximum EER (%)
Compared databases Mean HD
HD HD Original Cr25 Cr50 Cr75 Cr100
Original versus Original 0.0025 0.1535 0.4795 Original 1.350 1.470 1.540 2.020 2.500
Original versus 25 : 1 0.0000 0.1514 0.4818 Cr25 1.730 1.770 2.280 2.800
Original versus 50 : 1 0.0000 0.1685 0.4705 Cr50 2.010 2.420 3.000
Original versus 75 : 1 0.0008 0.1912 0.4742 Cr75 3.010 3.350
Original versus 100 : 1 0.0035 0.2109 0.4802 Cr100 4.450
Best Accuracy (%)
Original Cr25 Cr50 Cr75 Cr100
database comparison. We expected that all of these values Original 99.969 99.961 99.950 99.931 99.917
would increase as compression ratio increased, since more Cr25 99.952 99.940 99.920 99.904
of the original data is lost in the compression. These results Cr50 99.931 99.909 99.895
are included in Table 3. We note that the minimum HD is 0.0 Cr75 99.897 99.885
for comparisons between the original and 25 : 1, and original Cr100 99.867
and 50 : 1 databases. We attribute this to the fact that JPEG-
FRR at FAR = 0.001
2000 is efficient in how it performs the compression, and
the impacts on the iris and the iris template are minimal, Original Cr25 Cr50 Cr75 Cr100
so the template of a given iris image is in general close Original 0.022 0.024 0.028 0.042 0.057
to the template of the same image compressed to 25 : 1 or Cr25 0.030 0.035 0.049 0.067
50 : 1. As mentioned earlier, for comparisons of a database Cr50 0.044 0.057 0.075
with itself, comparisons of an image with itself are excluded Cr75 0.079 0.088
because they trivially give a Hamming distance of zero. This Cr100 0.125
is why the minimum Hamming distance in Table 3 is lower FRR at FAR = 0.0001
for the comparisons between the original and the 25 : 1 and Original Cr25 Cr50 Cr75 Cr100
the original and 50 : 1 are lower than the minimum for the
Original 0.036 0.043 0.060 0.087 0.106
first row comparing original and original.
Cr25 0.070 0.088 0.126 0.149
Figure 11 is an example of the performance curves
created for this research. Here, each pair of curves (False Cr50 0.105 0.134 0.153
Rejection Rate (FRR) and False Accept Rate (FAR)) repre- Cr75 0.177 0.170
sents the comparison of each compressed database against Cr100 0.223
the original database. An original versus original comparison
is included as a baseline. We note that as the compression
ratio increases, the FAR curve remains virtually unchanged, comparison. In addition, it shows the decidability of the two
while the FRR curves move further to the right. This will distributions, defined as
cause an increased Equal Error Rate (EER, where FAR = 2 2
FRR), and an increased number of errors (False Accepts + 2 2
2μgenuine − μimposter 2

False Rejects) which reduces overall system accuracy. Some d =9  . (9)
overall results are included in Table 4, where we record: (1) 2 2
0.5 σgenuine + σimposter
best accuracy achieved, which reflects varying the threshold
for identity to minimize the total number of errors achieved;
This equation includes the means and standard deviations
(2) EER point, in percent; (3) FRR when FAR is fixed at
of the pmfs of the genuine and imposter distributions,
0.001 (one false accept in 1000 imposter comparisons); and
combined into a measure of how well separated the two
(4) FRR when FAR is fixed at 0.0001 (one false match
probability mass functions are separated from each other [9].
in 10,000 imposter comparisons). This table reflects all
A larger decidability value is indicative of a greater separation
possible comparisons of the databases used (original and
between the distributions, which should lead to improved
compressed), where the number of valid bits is ≥400.
recognition performance.
We note that for this image, the measured quality
5.2. Quality Measure. The quality measure using the means decreases and Hamming distance increases as the compres-
described in Section 4 was determined for every image sion ratio increases, and is the general trend when using a
utilized (original and compressed). In most cases, as com- large database of images. The number of valid bits compared
pression ratio increases, the quality degrades. An example is does not follow this trend. We attribute this to the fact that
the quality of image number 243843, which is displayed in the compression introduces artifacts that alter the spatial
Figure 3. For this image, Table 5 displays the quality of the makeup of the image, and these artifacts will be reflected in
original and compressed versions of this image, as well as the a change in the masks used in the computation of Hamming
Hamming distance (HD) when compared with the original distances. Overall, the mean qualities of the databases used
image, and the number of valid bits that were used in the are shown in Table 6.
8 EURASIP Journal on Advances in Signal Processing

1 changed phase bits were actually counted. In general, though,


0.9 FRRs FARs the net effect is that comparing compressed images of the
0.8 Increasing same eye will yield higher HDs, shifting the performance
compression curve to the right and resulting in higher FRRs.
0.7 ratios The importance of correct segmentation cannot be
FRR or FAR

0.6 overemphasized. Poor segmentation will lead to poor results,


0.5 and in fact can lead to false matches if too few bits are
0.4 compared in computing the raw Hamming distance (7). The
0.3
normalized Hamming distance (7) was developed to avoid
this occurrence. Controls can be built into code to preclude
0.2
this possibility if the number of bits compared between two
0.1 templates is below some minimum number.
0 In general, when images are not compressed, images that
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
have higher quality will generate higher recognition accuracy,
Hamming distance threshold as should be expected. When the images are compressed,
Figure 11: Performance curves for each compression ratio versus the original image patterns within the iris will be suppressed
original images. and some new artificial compression artifacts/patterns will
be added. This tends to decrease the recognition accuracy.
As the compression rate increases, the recognition accuracy
decreases. However, when using a small database, this effect
Table 5: Image quality and Hamming distances. may not be reflected in the recognition results. For some
HD (versus Bits Decidability,
images in a small database, the compression process could
Image Quality introduce some stable unique patterns, which in some cases
Original) compared (d )
can increase the recognition accuracy. That is why we see
All
Original 0.9395 0.0 unoccluded 5.06 the fluctuations in recognition accuracy across different
bits compression rates, as well as fluctuations in the number
25 : 1 0.8743 0.023354 728 4.65
of bits compared. In addition, different iris images would
have different “reactions” to the compression due to the
50 : 1 0.6713 0.025097 1036 4.47
characteristics of the patterns. The quality of some images
75 : 1 0.6220 0.044146 1042 4.13 may be reduced dramatically due to the compression process,
100 : 1 0.5687 0.092567 713 3.83 but some may not be.
Overall, the iris images in this research were subjected
to considerable compression, and yet the recognition per-
Table 6: Database qualities. formance was only minimally affected. This is a significant,
particularly when compared to the FBI’s wavelet scalar
Database Quality Decidability (d ) quantization (WSQ) compression of fingerprint images. In
Original 0.9255 5.06 the FBI standard, fingerprints can be WSQ compressed
25 : 1 0.8565 4.65 with loss to a maximum ratio of 15 : 1 [17], while in
50 : 1 0.7916 4.47 this research the images were compressed up to 100 : 1.
75 : 1 0.7576 4.13 This proves the effectiveness of JPEG-2000 compression,
100 : 1 0.7306 3.83 and its ability to preserve the important information in
the compression process. Of further note, the iris images
here were compressed without the benefit of the region-of-
interest options available in JPEG-2000, which might allow
6. Conclusions even twice the compression with comparable results.

As expected, and as shown in other researches, as iris images


are compressed more, recognition performance reduces. The Acknowledgments
FAR remains fairly unaffected by changes in the image
data, while the FRR is noticeably affected. The compression For the iCAP software implementation of the Daugman
introduces artifacts into the iris images which alter the algorithm and advice on the use of the SDK, the authors
distinct patterns that are present in the original images, gratefully acknowledge Dr. Jun Hong, Chief Scientist, Mr.
making the compressed images more dissimilar. There are Joseph Hwang, Senior Software Engineer, Mr. Samir Shah,
some cases in which the compression introduced was small Senior Software Engineer, and Mr. Tim Meyerhoff, Project
enough such that the templates of an original and the same Manager, LG Electronics U.S.A. Inc., Iris Technology Divi-
image compressed by some amount resulted in the same sion. This work was supported in part by the Department
template. The cases of zero Hamming distance between of Defense and the National Institute of Justice (Award no.
compression ratios came about due to a combination of 2007-DE-BX-K182). This work was conducted under USNA
small changes in the phase and mask bits so that none of the IRB approval no. USNA.2007.0004-CR01-EM4-A.
EURASIP Journal on Advances in Signal Processing 9

References
[1] J. Daugman, “How iris recognition works,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 14, no. 1, pp.
21–30, 2004.
[2] J. G. Daugman, “High confidence visual recognition of per-
sons by a test of statistical independence,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 15, no. 11,
pp. 1148–1161, 1993.
[3] J. Daugman, “The importance of being random: statistical
principles of iris recognition,” Pattern Recognition, vol. 36, no.
2, pp. 279–291, 2003.
[4] “Registered Traveler Interoperability Consortium (RTIC)
Technical Interoperability Standard Version 1.2,” http://
www.rtconsortium.org/ docpost/RTICTIGSpec v1.2.pdf.
[5] J. Hong, J. Hwang, S. Shah, and T. Meyerhoff, “The iCAP and
SDKs are licensed commercial products”.
[6] “NIST’s Iris Challenge Evaluation (ICE),” http://iris.nist
.gov/ICE/.
[7] “The JPEG-2000 Standard,” May, 2010, http://www.jpeg
.org/jpeg2000/index.html.
[8] R. Ives, R. Broussard, L. Kennell, and D. Soldan, “Effects of
image compression on iris recognition system performance,”
Journal of Electronic Imaging, vol. 17, no. 1, Article ID 011015,
2008.
[9] J. Daugman and C. Downing, “Effect of severe image com-
pression on iris recognition performance,” IEEE Transactions
on Information Forensics and Security, vol. 3, no. 1, pp. 52–61,
2008.
[10] The JasPer Project, June 2007, http://www.ece.uvic.ca/
∼mdadams/jasper/.
[11] C. Belcher and Y. Du, “A selective feature information
approach for Iris image-quality measure,” IEEE Transactions
on Information Forensics and Security, vol. 3, no. 3, pp. 572–
577, 2008.
[12] Z. Zhou, Y. Du, and C. Belcher, “Transforming traditional iris
recognition systems to work on non-ideal situations,” IEEE
Transactions on Industry Electronics, vol. 56, no. 8, pp. 3203–
3213, 2009.
[13] T. Cover and J. Tomas, Elements of Information Theory, John
Wiley & Sons, New York, NY, USA, 1991.
[14] Y. Du, R. Ives, B. Bonney, and D. Etter, “Analysis of partial iris
recognition,” in Biometric Technology for Human Identification
II, vol. 5779 of Proceedings of SPIE, pp. 31–40, Orlando, Fla,
USA, March 2005.
[15] Y. Du, B. Bonney, R. W. Ives, D. M. Etter, and R. Schultz,
“Analysis of partial iris recognition using a 1D approach,” in
Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP ’05), vol. 2, pp. 961–964,
March 2005.
[16] J. Daugman, “New methods in iris recognition,” IEEE Trans-
actions on Systems, Man and Cybernetics, vol. 37, no. 5, 2007.
[17] “The Federal Bureau of Investigation (FBI)’s Forensic Hand-
book,” http://www.fbi.gov/hq/lab/handbook/forensics.pdf.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 936512, 13 pages
doi:10.1155/2010/936512

Research Article
Scale Invariant Gabor Descriptor-Based
Noncooperative Iris Recognition

Yingzi Du, Craig Belcher, and Zhi Zhou


Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis,
Indianapolis, IN 46202, USA

Correspondence should be addressed to Yingzi Du, [email protected]

Received 1 January 2010; Accepted 18 March 2010

Academic Editor: Robert W. Ives

Copyright © 2010 Yingzi Du et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A new noncooperative iris recognition method is proposed. In this method, the iris features are extracted using a Gabor descriptor.
The feature extraction and comparison are scale, deformation, rotation, and contrast-invariant. It works with off-angle and low-
resolution iris images. The Gabor wavelet is incorporated with scale-invariant feature transformation (SIFT) for feature extraction
to better extract the iris features. Both the phase and magnitude of the Gabor wavelet outputs were used in a novel way for
local feature point description. Two feature region maps were designed to locally and globally register the feature points and
each subregion in the map is locally adjusted to the dilation/contraction/deformation. We also developed a video-based non-
cooperative iris recognition system by integrating video-based non-cooperative segmentation, segmentation evaluation, and score
fusion units. The proposed method shows good performance for frontal and off-angle iris matching. Video-based recognition
methods can improve non-cooperative iris recognition accuracy.

1. Introduction database with over 600,000 iris images with over 200 billion
comparisons [8]. Chen et al. proposed using Daugman’s 2D
Performing noncooperative iris recognition is important for Gabor filter with quality measure enhancement to improve
a number of tasks, such as video surveillance and watchlist the recognition accuracy [10]. Matey et al. used Daugman’s
monitoring (identifying most wanted criminals/suspects) method in their Iris on the Move (IOM) system [11] with
[1–4]. In addition, noncooperative iris recognition systems better optics and illumination to perform iris recognition
can provide added convenience for cooperative users for at a distance. Masek and Kovesi found that Gabor wavelets
identification [5]. However, it is challenging to design an can have a DC component and proposed using a 1D Log-
iris recognition system that can work in a noncooperative Gabor filter [12]. Ma et al. proposed using a 2D filter similar
situation, where the image quality may be low and the eye to the Gabor filter [13]. Other works include Wildes [14],
may be deformed, due to a nonfrontal gaze. who proposed use of a Laplacian pyramid to decompose the
In recent years, several methods have been developed iris features for matching. Sudha et al. [15, 16] proposed
for iris recognition [2, 3]. Most of these methods are using edge maps to extract iris patterns and using Hausdorff
designed for frontal and high-quality iris images. Among distance for pattern matching. Boles and Boashash proposed
them, Daugman’s approach has been most widely used using normalized wavelet transform zero-crossings [17]. Sun
in the commercialized iris recognition systems [6–9]. This et al. proposed using moment-based iris blob matching
method transforms the segmented iris image into log-polar [18]. Hollingsworth et al. proposed using the “best bits”
coordinates, extracts the iris features using a 2D Gabor in the middle band to improve the recognition accuracy
wavelet, and encodes the phase information into a binary iris [19]. Thornton et al. proposed using correlation filters with
code [7, 9]. Hamming distance is used to match two iris codes Bayesian deformation estimation [20]. Du et al. proposed
[7]. Daugman’s method has been tested and evaluated using using 1D Local Texture Patterns [21]. Velisavljevic used
large databases, such as the United Arab Emirates (UAE) oriented separable wavelet transforms to do iris feature
2 EURASIP Journal on Advances in Signal Processing

extraction [22]. Miyazawa et al. used a 2D Fourier Phase is an important piece of information in performing the
Code (FPC) method for representing iris information [23]. matching. How does one incorporate the global information
Tajbakhsh et al. used both visible light and near infrared iris in the matching to ensure accurate recognition results?
images for iris recognition [24]. None of these methods are (4) For off-angle iris images, the image segmentation is a
designed for nonideal situations. challenge. How does one design a recognition method that
In the past, several researchers have worked on non-ideal can be tolerant of small segmentation error?
iris recognition [25, 26]. Proenca and Alexandre [5, 27] have It would be desirable if an iris recognition algorithm can
worked on frontal iris recognition under visible wavelengths have the following capabilities:
using the UBIRIS database [28]. Compared to NIR images,
such as CASIA [29] and ICE [30] databases, there is more (i) perform iris recognition for both frontal and off-
reflection noise in the visible wavelength iris images. Vatsa angle iris images,
et al. applied a set of selected quality local enhancement (ii) be scale invariant in both local and global levels,
algorithms to generate a single high-quality iris image for
iris recognition [31]. In [32], we added several modules to (iii) extract iris features efficiently and locally even under
help the traditional iris recognition system work in nonideal the deformation situation,
situations. However, all of these methods are designed for (iv) be tolerant of segmentation error.
frontal-gaze iris images.
For nonfrontal iris recognition, Daugman proposed The goal of our work is to design such an iris recognition
using Fourier-based trigonometry to estimate the two algorithm. The proposed research effort has four novelties
spherical components of angle of gaze and used an affine as compared to the previous works. (1) To better extract
transformation to “correct” the image and center the gaze the iris features, we incorporated Gabor wavelets with SIFT
[9]. Schuckers et al. proposed two methods to calculate for feature extraction. (2) Both the phase and magnitude of
angle of gaze: using Daugman’s integrodifferential operator the Gabor wavelet output are weighted and fused in a novel
and also an angular deformation calibration model [33]. It way for local feature point description. (3) To compensate
is assumed that an estimate of the degree of off-angle is for the iris local deformation due to dilation/contraction
available for the algorithms and subjects are required to place and off-angle image acquisition, provide global registration
their heads on a chin rest looking front (while the camera information of the feature points, and improve the matching
is rotated horizontally in fixed angles). Both methods are efficiency, we used two feature region-based methods and
limited because “the affine transformation assumes the iris ensured that each subregion was locally adjusted to the
is planar, whereas in fact it has some curvature” [9]. dilation/contraction/eye deformation. In this paper, we
Recently, we proposed the Regional Scale Invariant propose two subregion maps for each image for feature point
Feature Transform (SIFT) approach [34] for noncooperative detection. (4) To be tolerant of the segmentation error, we
iris recognition which works for off-angle iris images. Iris allowed the feature point from one image to match with
features are described without a polar or affine transforma- feature points in its nearby locations from another image.
tion and the feature point descriptors are scale and rotation The rest of the paper is organized as follows. In Section 2,
invariant. However, the iris region consists of both noise we give a brief review of the SIFT-based methods and
and patterns, and Regional SIFT describes the area around discuss why these methods could not work in noncooperative
a feature point using gradient information, which is not iris recognition. In Section 3, we introduce the proposed
best suited for feature extraction. Most importantly, Regional Gabor descriptor method and provide technical details
SIFT would not work well with local pattern deformation. about how to develop the feature subregion maps, select
If the strengths of SIFT and Gabor wavelets can be com- the feature points, describe each feature point, and match
bined for feature extraction, it may improve the recognition feature points from different images. The experimental
accuracy for off-angle iris images. A simple combination of results, comparisons with the state-of-the-art algorithms,
SIFT and Gabor wavelet method would not work, and it and discussions are reported in Section 4. Section 5 describes
is challenging to design a method that can take advantage the proposed video-based noncooperative iris recognition
of the SIFT and Gabor wavelet. (1) The SIFT method system and discusses the implementation results. Section 6
may select many feature points in a small region. This draws some conclusions.
increases the computational complexity. More importantly,
the heavy overlapping of the feature descriptors could result 2. Brief Review of SIFT-Based Methods
in extremely high weighting for the matching results in the
small region. (2) The eyes are deformed in the off-angle Local descriptor-based methods are widely used. Local
images. The global Gabor wavelet parameters would not descriptors computed for interest regions are distinctive,
work for off-angle iris images. This means that we need to robust to occlusion, and (sometimes) do not require segmen-
design localized Gabor wavelets. How to design the Gabor tation [35]. Lowe [36] proposed the scale invariant feature
wavelets to locally describe the feature points is a challenge. transformation (SIFT) method, which describes an object as
More importantly, how does one design the approach to a group of feature points such that the object can be found
ensure local deformation invariance? (3) In the SIFT method, in an image with invariance to scale, rotation, and affine
the matching of the feature points does not incorporate transformations. The SIFT features are local and based on
global information. In iris recognition, the feature location the appearance of the object at particular interest points,
EURASIP Journal on Advances in Signal Processing 3

(a) Iris area (b) Subregion map 1 (c) Subregion map 2

Figure 1: Demonstration of the iris subregion selection. (Note that there are more subregions than shown here.) The subregion map 2 is
about half angular resolution shifting of subregion map 1.

and are invariant to image scale and rotation. This method


has been widely used in current object recognition. In [37],
0◦
Ke and Sukthankar applied Principal Components Analysis
(PCA) to the normalized image gradient patch. Carneiro
and Jepson [38] proposed a phase-based descriptor. In [35],
Mikolajczyk and Schmid proposed the gradient location and
orientation histogram (GLOH), which is an extension of (a) Masked feature point map for feature subregion map 1
the SIFT descriptor. In [39], Cheng et al. proposed using a

0◦
multiple support region based SIFT approach for deformable
image recognition. In [40], Mortensen et al. proposed SIFT
method with global context vector that adds curvilinear
shape information from a much larger neighborhood. Some
other local descriptor methods include geodesic intensity
histogram [41], spin images [42, 43], shape context [44–46],
and steerable filter [47–50]. Recently, Lepetit and Fua [51] (b) Masked feature point map for feature subregion map 2
proposed to use training to find the most repeatable object
keypoints. The application of this method is very limited Figure 2: Masked Feature Point Map. (Note that these images are
because it requires the training images have a well-registered for illustration only and do not show the full 720 bins used.) White
target object, which is often difficult in real-life scenarios. regions do not contain any feature points. Black regions have one
In [52], Quelhas et al. used a SIFT-like method to feature point each. The gray regions are noisy regions that should
select the feature points and used bag-of-visterms (extended not be used in matching.
concept of bag-of-visual-words ) to model the local features.
Bag-of-visual-words (BoV) model is analogous to the bag-
of-words (BoW) model in natural language processing and
information retrieval [53–56], where a text is represented as has its own limitations. In [9], Daugman has used 2D Gabor
an unordered collection of words, while grammar and even wavelet for iris recognition and showed that it worked very
word order was disregarded. The BoV in object recognition well in iris pattern extraction. If the strengths of both local
used a similar concept to describe the local patterns using descriptor and Gabor wavelet can be combined in feature
feature vectors. A codebook of the features is generated by extraction, we can design a better local descriptor.
offline quantization of local descriptors. The challenge in
BoV is to find general and representative feature vectors that
can describe the local pattern and training is often necessary 3. Gabor Descriptor-Based
[57–59]. The BoV did not take the spatial relationships
Off-Angle Iris Recognition
into consideration, which introduces ambiguity in object
recognition. This hybrid model by Quelhas et al. [52] To perform iris recognition, the first step is to segment the
can largely improve the recognition accuracy over using iris images. The segmentation of an off-angle iris image itself
BoV model alone. However, similar to general BoV, it is a is a challenging problem [11, 27, 31, 60]. In [61], we designed
challenge to perform robust object recognition. a video-based noncooperative iris segmentation method. In
All the above methods are designed for general object this research, we use the general conic model in [61] for off-
feature extraction, and did not take the iris features into angle image segmentation. In this paper, our focus is how to
account. In [34], we proposed region-based SIFT approach design the iris feature extraction and matching method. Our
to improve the recognition accuracy over the traditional SIFT proposed method includes the following steps: feature point
method. However, as we discussed in Section 1, this method selection, feature description, and region-based matching.
4 EURASIP Journal on Advances in Signal Processing

(x, y)
θ
(xs , ys )
(x p , y p )
SA

SR
Figure 3: Stable feature points found in two subregion maps in a Figure 4: The description of SR , SA , θ, (x, y), and (x p , y p ).
real image. The left image is an original image with feature points.
The top right and bottom right images show the feature points
detected in subregion map 1 and subregion map 2, respectively. the tolerance of segmentation error of the pupil boundary
region and the limbic region (i.e., adding 2 more regions
in the radius direction), we use x = 10 in our research.
3.1. Feature Subregion Maps. The SIFT method finds many In the angular direction, by our observation, we found that
points stable within scale space with many points possible resolution of the iris patterns are usually within 5 degrees.
in a very small region. The goal of our approach is to Therefore, we select y = 72. In this way, a normalized map
increase the opportunity to correctly match feature points of size 10 by 72 is formed. For each sub region, it can only
within a similar relative position with respect to the pupil have one feature point. In Section 3.2, we will discuss how to
across multiple iris images; therefore, we should have a small ensure at most one feature point per subregion.
number of features in small areas. More importantly, iris There could be noise (such as eyelids, eyelashes, and
patterns have their own special characteristics, which need to glares) in an iris area. It would be important to mask these
be considered when designing the feature extraction method, noisy regions. Therefore, we have three kinds of subregions:
(1) The spatial correlations in iris patterns are important noisy regions, regions with one feature point each, and
in recognition. Therefore, in designing the effective local regions without feature points.
descriptor method, we should have global information of The head may tilt and the start point of the subregion
feature points. (2) The pupil may dilate or contract. For off- map became arbitrary in terms of particular iris patterns. As
angle eyes, the distance between the pupil boundary and the a result, some iris patterns may cross two subregions in the
limbic boundary would not be uniform. As a result, the iris angular direction. To take this into consideration, we created
patterns can be deformed locally. It is important to have an additional feature subregion map with the offset of half of
some local normalization process in selecting feature points. the angular resolution, that is, 2.5 degree offset (Figure 1). In
(3) There could be noisy regions in an iris area, such as this way, if the feature point happens to be in the edge of a
eyelids, eyelashes, and glare. These areas should be identified subregion in one feature subregion map due to tilt, it will be
and removed in matching. in the middle of the subregion of the feature subregion map.
In this research, we divide the iris area into a fixed
number of subregions, and ensure each region has at most 3.2. Feature Point Selection. In this research, we used the SIFT
one feature point. In this way, for each feature point, we know Difference of Gaussian (DoG) approach to find potential
its subregion and the correlation of the subregion to other feature points [36]. However, as we discussed inSection 3.1,
subregions. Therefore, we have the global information of the it is important to combine the selection of candidate points
feature points for matching. with the feature subregion maps.
To take local deformation into account, we should not Below is a brief description of the candidate feature
make the subregions to be same size; rather, we should take points selection process using SIFT feature point selection
eye deformation information into account when assigning approach [36]. To find stable feature points, the first step is
the subregions. We know that deformation of the eye would to apply a nominal Gaussian blur, (1), resulting in I(x, y),
change the iris ring radius (the distance between the pupil
  1 −(x2 +y2 )/2σn 2
and limbic boundaries in radius direction). Especially for G x, y = e . (1)
off-angle eyes, the radius distance is nonuniform over the 2πσn 2
angular direction. In this research, we assign x bins in radius Here σn = 0.5. Then, the nominally blurred image, I(x, y),
direction and y bins from 0 to 2π in the angular direction, is progressively Gaussian blurred. The first Gaussian image is
where x represents the resolution in radius direction, and created using
y represents the resolution in the angular direction of the
region. It is important to select proper x and y sizes to ensure gσ = σ02 + σn2 , (2)
the efficiency in matching and accuracy in feature extraction. √
Total, we will have x∗ y subregions in the entire iris where σ0 = 1.5 2 so that
region. The research results by Daugman in [7] have shown    
that 8 rings can work well with iris recognition. To include G x, y, 1 = Ggσ ∗ I x, y . (3)
EURASIP Journal on Advances in Signal Processing 5

Figure 5: Examples of odd and even Gabor filters with different sizes and orientations. The first row is even filters and the second row is odd
filters.

The remaining Gaussian images are created using where D and its derivatives are evaluated at the selected point
√ m and Δx = (Δx, Δy, Δs)T is the offset from this point. Taking
σ = 1.5 2 (m = 0, 1, 2, 3), (4) the derivative of this function with respect to x and setting it
equal to zero, we determine the extremum, Δx4, to be
resulting in five Gaussian blurred images (G(x, y, s) (s =
0, . . . , 4)). The size of the Gaussian filter is always the ∂2 D−1 ∂D
Δx4 = − . (7)
closest odd number to 3σ. These parameters were selected ∂x2 ∂x
empirically and are the same for all images. Then the four To reject points that have low contrast
DoG images are created by subtracting each Gaussian image
from the previous Gaussian image in scale: 1 ∂DT
     
D(Δx4) = D + x4. (8)
2 ∂x4
D x, y, s = G x, y, s + 1 − G x, y, s (s = 0, 1, 2, 3). (5)
If |D(Δx4)| is less than 0.03 for a given extrema point, that
In this research we only use the layers s = 1 and 2. point is rejected.
Unlike the general SIFT approach, we only allow one To determine if an extrema point is along an edge, the
feature point to be selected per layer. For D(x, y, 1) and Hessian matrix is used [18],
D(x, y, 2), the local minima and maxima with the highest  
magnitude of all the points contained within the subregion is Dxx Dxy
stored so that every subregion contains two potential feature H= , (9)
Dxy D y y
points, one scale apart, unless some portion of the subregion
is occluded or masked due to noise. where D is the second partial derivative of the DoG image
For illustration purposes, Figure 2 shows an example D(x, y, s) at a scale s, the following inequality is used to find
of how the iris area can be divided into multiple subre- edge and corner points. If
gions with subregions that include occluded pixels (eyelids,
eyelashes, or glare) being masked entirely. Since the pupil Tr(H)2 (r + 1)2
< , (10)
and limbic boundaries are modeled as ellipses, the sizes of Det(H) r
these subregions vary in the radial direction for each of
the 72 angular bins. This is a major difference from the the extrema point is considered to be a corner; otherwise, the
previous Regional SIFT method in that the entire iris area point is rejected as an edge point. Here r = 10.
can potentially have feature points and every bin size changes After rejecting points based on contrast, edge value, and
with dilation. For further illustration, Figure 2 shows how stability, the remaining points are assigned a description.
some bins will contain a feature point corresponding to a However, if in one subregion, there are still 2 feature
point in the annular iris region, whereas others will not. points available (one feature point per scale), we will then
(Note that for ease of viewing, Figure 2 does not show the full choose the more dominate one (i.e., the one with higher
720 bins used.) In addition, to compensate for feature points |D(Δx4)| value). In this way, we ensure that each subregion
that are on the boundaries of subregions, a second 10 by 72 can have at most one feature point. As we discussed in
normalized feature point map is created with a 2.5 degree Section 3.1, we will have 2 feature subregion maps per iris.
angular offset. Note: the two subregions may have different This means that we will have 2 sets of feature points per iris.
feature points. Figure 3 shows an example of stable feature points found for
Once potential feature points are identified and mapped an iris in two subregion maps.
to the feature point map, the 3D quadratic method is used to
eliminate unstable feature points. Using the Taylor expansion 3.3. Feature Description. For each feature point, a feature
(up to the quadratic terms) of the DoG images, D(x, y, s), description of length 64 is created based on the normalized
shifted so that the origin is at the selected point. and Gaussian weighted position of each point within a
normalized window around a feature point (4 x-bins and
∂DT 1 ∂2 D 4 y-bins) and the magnitude and phase response (4 phase
D(Δx) = D + Δx + (Δx)T 2 Δx, (6)
∂x 2 ∂x orientation bins).
6 EURASIP Journal on Advances in Signal Processing

Table 1: ICE 2005 database matching results. ∗ In the NIST ICE Phase 2005, Duagman 1 and Duagman 2 were listed as Cam 1 and Cam 2.
The EERs are not clear to view from the report. The GAR at FAR = 01% and GAR at FAR = 0.01% were taken from the plot of the report in
[30].

(a) Right eyes.

Algorithm #Images EER GAR at FAR =.1% GAR at FAR =.01%


Daugman 1∗ 1426 — 0.9940 0.9910
Daugman 2∗ 1426 — 0.9950 0.9920
2D Gabor 1426 0.0062 0.9900 0.9850
1D Log-Gabor 1426 0.0079 0.9870 0.9735
Regional SIFT 1426 0.0557 0.7320 0.5640
Proposed 1426 0.0185 0.9588 0.9386
(b) Left eyes.

Algorithm #Images EER GAR at FAR =.1% GAR at FAR =.01%


Daugman 1∗ 1527 — 0.9880 0.9850
Daugman 2∗ 1527 — 0.9890 0.9880
2D Gabor 1527 0.0126 0.9750 0.9629
1D Log-Gabor 1527 0.0106 0.9739 0.9533
Regional SIFT 1527 0.0689 0.6346 0.3741
Proposed 1527 0.0257 0.9316 0.8916

Phase 1 where SA = ( (x − x p )2 + (y − y p )2 ) · (2π/360) · 5 and N


is the number of bins used to describe the relative position
Phase 2
of a point to a feature point (here N = 4). SA is the spatial
extension of the frame around the feature point (x, y) in
Phase 3
the angular direction, (x p , y p ) are the coordinates of pupil
center. SA is used to normalize the window around that
feature point and changes in size based on the distance
Phase 4
between the feature point and pupil centre.
The coordinates of all pixels in the window are then
(a) Normalised window (b) Gabor descriptor
normalized: the pixel (x, y) is normalized as
  
Figure 6: The process of a Gabor descriptor. The different colors in   (x − xs ) cos θ + y − ys sin θ
the left image shows that the Gabor filter results show they are in nx , n y =
different phase layers. SR
   (12)
−(x − xs ) sin θ + y − ys cos θ
.
SA
Imagery axis
SR is the spatial extension around the feature point in the
radial direction, and is used to normalize the window around
Phase 2 Phase 1
that feature point and changes in size based on the amount
of dilation. SA is the same as what we defined in (11). (xs , ys )
(0, 0) Real axis
is the feature point. θ is the angle between the line of the
pixel and the feature point to the line of the feature point and
Phase 3 Phase 4 the pupil center (Figure 4). It is used to orient the window
around the feature point such that the same feature point in
another image will be able to be matched despite differences
in angular position with respect to the pupil center.
Figure 7: Phase areas. In order to capture the iris features around a given feature
point, a bank of 2D Gabor filters are used:
 
G x, y
For each feature point, we first choose its local window
M   2 N
for feature description. The window size is determined as 1 (x − x0 )2 y − y0
= exp −π + (13)
J√ K 2πσβ σ2 β2
N +1  
W= 2 · SA · + 0.5 , (11) × exp i ξ0 x + v0 y .
2
EURASIP Journal on Advances in Signal Processing 7

score 1
Subregion map 1 from image X Subregion map 1 from image Y
sco
re
2

3
re
sco
Subregion map 2 from image X Subregion map 2 from image Y
score 4

Figure 8: Subregion matching between two iris images.

Angular direction

Radial
direction 1
Angular direction
1A 1B 1C
(a) Feature point 1 from image X Radial
direction 1D 1E 1F
Angular direction 1G
A B C
Radial (c) Feature point 1 from image X is matched
direction D E F with feature points A to G from image Y
G

(b) The feature points that are


located in the same location and
neighbor locations from image Y

Figure 9: Feature point matching.

In this equation, (x0 , y0 ) is the center of the receptive field of where mg is the magnitude response of the 2D Gabor
the spatial domain, (ξ0 , v0 ) is the frequency of the filter. σ and wavelet, and weight is added to one of 64 bins based on
β are the standard deviations of the elliptical Gaussian along relative distance from the feature point and quantized phase
x and y directions. By properly designing these parameters, response of the 2D Gabor wavelet. The resulting 64 bin
we can change the Gabor wavelet to fit the specific region for feature point descriptor is then normalized to a unit vector
feature extraction. Figure 5 shows the example of Gabor odd by dividing by the 2-norm of the descriptor:
and even filters in different sizes and orientations.
For each point in the normalized window around the descr
descrnorm = . (16)
feature point, the magnitude and phase response of the descr2
appropriate 2D Gabor wavelet is calculated with the wavelet
centered on the point being considered. The magnitude is Since each descriptor is normalized, the relative difference
then Gaussian weighted based on the relative spatial distance in magnitude response from the 2D Gabor filter remains
from the feature point so that points in the window closest the same for the same points around a feature point across
to the feature point carry the most weight and points further iris images with different global illumination. And since
away carry less. phase is not affected by illumination, the same points in
A Gabor descriptor is created by first computing the two iris images affect the same descriptor bins. Therefore,
gradient magnitude and orientation phase at each point in each feature point descriptor created has each of the 64 bins
a normalized window around the feature point location, uniquely affected by the surrounding points based on
as shown on Figure 6. These are weighted by a Gaussian distance from the feature point, and 2D Gabor wavelet
window, indicated by the overlaid circle. These samples are response magnitude and phase; and an accurate descriptor
then accumulated into four phase quadrics. In this paper, we is formed based entirely on the annular iris data.
separate the phase into 4 areas (Figure 7)
The weight of the Gaussian, wn, is calculated as 3.4. Region-Based Matching. To match two iris images, the
2 2
−.5((nx) /2σx2 +(ny) /2σ y2 )
set of two 10 by 72 feature point maps are compared and
wn = e , (14) the Euclidean distance is found between each feature point
where σ y = N/2 and σx changes based on the dilation descriptor (Figure 8). In other words, the two feature point
around the feature point. Finally, the weight of each point maps from image A are compared to the two feature point
is calculated as maps from image B, resulting in 4 matching scores. The
smallest matching score is then used as the matching score
weight = wn · mg, (15) between the two images. Recall that the two feature point
8 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 10: Remote Iris Image Acquisition Station Set Up.

(a) Look Left (b) Look Center (c) Look Right

(d) Look Up-Left (e) Look Up (f) Look Up-Right

Figure 11: IUPUI Remote Iris Image Database: Multiple Angles.

maps for an iris image describe the same regions, but are segmentation results so as to correctly match each encoded
offset by half of the angular resolution of the bins. This is point.
done in order to accommodate feature points that fall on
boundaries of subregions within a feature point map.
To match two feature point maps, the average of the 4. Experimental Results
distance scores between all overlapping feature points is
calculated and used as the matching score between two 4.1. ICE 2005 Database Results and Comparison. The ICE
feature point maps. To make the proposed method tolerant 2005 Database [30] from National Institute of Standards and
of segmentation error and eye rotation, each feature point Technology (NSIT) consists mostly of frontal look eyes. It
in a feature point map 1 is compared to each feature point includes 2 subdatabases: left iris image database with 1527
in the fifteen surrounding bins (two bins on either side images from 120 subjects, and right iris image database with
and one bin above and below) in a feature point map 2 1426 images from 124 subjects. In this experiment, we used
(Figure 9), and the minimum average distance score is stored the left eyes and right eyes, respectively, as ICE protocol in
for the two feature point maps compared. In this way, the phase 2005 organized by NIST, and our goal was to compare
proposed method is less sensitive to the segmentation error the proposed method with traditional methods using only
that is prone to occur in nonideal iris images since feature frontal eyes.
points can occur anywhere within a bin and allowances are Table 1 shows the comparisons using different methods:
made to maximize the opportunity for the same two feature Daugman’s two methods called Cambridge 1 and Cambridge
points in two images to be compared. Algorithms that sample 2 methods [30], and our implementation of the traditional
the iris region and encode globally require more stringent 2D Gabor wavelet matching and 1D Log-Gabor matching,
EURASIP Journal on Advances in Signal Processing 9

0.16 which are unknown to the public. The performance of


0.14 Regional SIFT is understandable given the limitations previ-
0.12 ously mentioned. To reduce the effect of segmentation error
0.1 to the traditional methods, manual segmentation was used to
find the pupil and limbic boundaries, which were modeled as
0.08
circles.
0.06
0.04
4.2. IUPUI Remote Iris Image Database Recognition Results
0.02 and Comparison. The IUPUI Remote Iris Image Database
0 was acquired at 10.3 feet from the camera to the subject
Center Left Right Up-right Up-left Up All using a MicroVista NIR camera with Fujinon zoom lens.
(a) Comparison using EER (the lower the better) 6 videos were captured for each subject with different
1.2 scenarios: frontal look (1st video) reading from posters 15
feet from the subject and 5 feet behind the camera (2nd
1 and 3rd videos) (Figure 10(a)); searching the wall to count
0.8
the number of occurrences of a certain symbol (4th and 5th
videos) (Figure 10(a)); performing simple calculations using
0.6 numbers posted on the ceiling (6th video) (Figure 10(b)).
Each video was acquired at 30 frames per second with 1280 ×
0.4
1024 resolution. The average iris radius of the video images
0.2 in the database is 95 pixels. During the image acquisition,
subjects can move their heads and eyes freely to perform the
0 tasks, which simulates a remote, noncooperative situation,
Center Left Right Up-right Up-left Up All
such as when a subject looks at flight times in an airport.
(b) Comparison using GAR at FAR = 0.1% (the higher the better) In addition, the subjects can have their own emotions during
1 the acquisition process (Some of the subjects smiled in some
0.9 tasks). The authors are working with IRB to make this
0.8 database available publically.
0.7 In this experiment, a database with 10 video frames for
0.6 each iris for six classifications of angle with respect to the
0.5 camera were constructed (Figure 11) from both sessions:
0.4 looking center, left, right, up-left, up-right, and up. This
0.3
resulted in 60 images per iris, with the exception that three
0.2
iris videos were missing. The total number of images used
0.1
for this experiment was 3690 and included both left and
0
Center Left Right Up-right Up-left Up All right eyes from 31 subjects (because 3 videos were incorrectly
acquired and were not used in this paper).
SFIT
Gabor descriptor
4.2.1. Frontal Look Recognition Results and Comparison.
(c) Comparison using GAR at FAR = 0.01% (the higher the better) Table 2 shows that our results using the proposed method
and the Regional SIFT method are comparable to the results
Figure 12: Comparisons between the regional SIFT method and achieved using traditional matching on the centered eyes
the proposed Gabor Descriptor method using EER, GAR at FAR =
from our noncooperative database. The pupil and limbic
0.1%, and GAR at FAR = 0.01%.
boundaries were modeled as circles which is a simple
and reasonable approximation of the pupil and limbic
boundaries’ geometries. We did not perform this same
our Regional SIFT, and the proposed method results on matching algorithm on the other classes since they are not
annular iris images. The two Cambridge results are the best frontal looking images and it would be difficult to reliably
in the list. It is unknown about the technical difference sample the iris pattern for off-angle images without some
between the two Cambridge methods. They could use dif- transformation such as Daugman proposed [2]. While this
ferent segmentation methods or Gabor wavelet parameters. approach seems reasonable, we argue that due to the 3D
To be comparable, all our methods (our implementation of nature of iris patterns, it is more reasonable to encode
2D Gabor wavelet, 1D Log-gabor wavelet, SIFT method, and iris patterns without a transformation and more accurately
the proposed method) used one segmentation method. It is represent the patterns presented to the camera.
shown that our implementation of the traditional methods
obtains good results which are close to Daugman’s results 4.2.2. Multiple Angle Recognition Results and Comparison.
and our proposed method obtains comparable results. Note, Table 3 shows the experimental results using the Regional
Daugman’s methods used his own segmentation approaches SIFT method. Table 4 shows the experimental results using
10 EURASIP Journal on Advances in Signal Processing

Video image Video-based


acquisition segmentation

Segmentation Good No?


Discard
segmentation
evaluation frame

Yes?

Identification Gabor descriptor based


output Fusion Matching
feature extraction method

Figure 13: The Proposed video-based noncooperative iris recognition system.

Enrollment database
Similarity
Matching
score

···

Near infrared (NIR) video

Figure 14: Matching protocol for noncooperative iris recognition.

Table 2: IUPUI remote database frontal look eyes matching results. Table 3: Recognition results of the regional SIFT method for same
eyes divided into classes based on angle of gaze.
GAR at FAR GAR at FAR
Algorithm #Images EER
=.1% =.01% GAR at FAR GAR at FAR
Classes #Images EER
2D Gabor 610 0.0179 0.9297 0.8856 =.1% =.01%
1D Log Gabor 610 0.0295 0.9235 0.8980 Center 610 0.0350 0.9173 0.8572
Regional SIFT 610 0.0350 0.9173 0.8572 Left 620 0.0454 0.7800 0.5865
Proposed 610 0.0273 0.9213 0.8761 Right 620 0.0454 0.8340 0.6980
Up-right 600 0.0567 0.7941 0.6530
Up-left 620 0.0610 0.7725 0.6170
the proposed method. Here all-to-all matching is used to Up 620 0.1392 0.6265 0.5320
match all irises. The genuine matches are the matching All 3690 0.0588 0.8024 0.6763
results from same eye with same-looking angles. The
imposters are the matching results from different eyes
with same- or different-looking angles. By comparing them
(Figure 12 and Table 4). When the eye is looking left or right,
(Figure 12), we see that the Regional SIFT method does
the image resolution for one side of iris (left or right) would
not perform as well as the proposed method on the
be reduced, but the resolution for another side of the iris is
noncooperative iris images. The main reason is because
increased. This increased resolution of the iris pattern helps
Regional SIFT selects feature points using local gradient
to select stable feature points in recognition. As a result, our
magnitude and angle information, whereas the proposed
method performs slightly better for left-looking or right-
method encodes feature information around feature points
looking iris images. This shows that the proposed method
using the magnitude and phase response of 2D Gabor
is well suited for use in a nonfrontal gaze situation.
wavelets which is more capable of capturing iris feature
characteristics. In addition, the proposed method is less
sensitive to segmentation error. The subregions of the 5. Video-Based Noncooperative
proposed method are locally area adjusted to the iris dilation, Iris Recognition
contraction, and deformation.
For many methods, including the Regional SIFT method, 5.1. Proposed Video-Based Noncooperative Iris Recognition
center gaze would achieve better recognition accuracy than System. Figure 13 describes the proposed system which con-
off-angle eyes. However, for the proposed method, our sists of acquiring video sequences of iris data, using video-
experiment results show that the left and right looking eyes based noncooperative iris image segmentation, evaluating
have achieved higher accuracy than frontal looking images segmentation results, retaining the best segmented images,
EURASIP Journal on Advances in Signal Processing 11

Table 4: Recognition results of the proposed method for same eyes images is 620 with 62 irises from 31 subjects. We automati-
divided into classes based on angle of gaze. cally match the enrollment images with the video frames in
the 5 videos for each person from the second session (the
GAR at FAR GAR at FAR
Classes #Images EER frontal look only video was excluded as they are all frontal
=.1% =.01%
images) for 30 subjects and 60 irises. 1 subject did not have
Center 610 0.0273 0.9213 0.8761 a second session and 2 subjects only had 4 videos from the
Left 620 0.0214 0.9487 0.9108 second session. Totally, we have 298 video sequences.
Right 620 0.0162 0.9613 0.9180 The result is FAR = 0 and EER = 0 for all thresholds since
Up-right 600 0.0540 0.8956 0.8422 only one or zero matching scores are retained for each video.
Up-left 620 0.0492 0.8742 0.8079 73 videos (about 24.5% of the videos) were not recognized
Up 620 0.1251 0.6950 0.6358 since some videos could not generate satisfactory matching
results. For the rest of the videos, there is 100% recognition
All 3690 0.0478 0.8966 0.8476
accuracy (0% FAR at 0% FRR). The results show that 100%
accuracy can be obtained using multiple enrollment images,
video sequences of an iris, and fusion of matching scores;
feature extraction using the proposed Gabor descriptor,
even in a noncooperative iris database.
feature matching, and fusion.

Matching Protocol. Under noncooperative situation, the iris 6. Conclusion


images tend to have lower quality and can be off-angle. To
In this paper, we proposed Gabor Descriptor-Based Non-
ensure the accuracy, it will be important to have multiple
cooperative Iris Recognition. The proposed solution to
enrollment images with different eye-looking angle. In this
noncooperative iris recognition does not transform the
paper, we propose the matching protocol to be multiple
iris to polar coordinates, is normalized for changes in
enrollment images with input video image (Figure 14).
dilation/contraction/deformation, and is tolerant of the seg-
mentation errors that are likely to occur in a noncooperative
Video-Based Noncooperative Iris Image Segmentation. Since situation. Experimental results show that the proposed
noncooperative iris images can be especially difficult to method is comparable to traditional methods on the ICE
segment using traditional methods [3, 9, 27–34], the video- 2005 database [30] and performs well for the IUPUI Remote
based noncooperative iris image segmentation algorithm Iris Image database. Results also show that visible iris features
developed in our lab [61] is used in this paper. It uses a change as the gaze of an iris changes and that video-based iris
course-to-fine approach and uses a general conic to model recognition can greatly improve recognition accuracy when
the pupil and limbic boundaries. More details of this method multiple angle enrollment iris images are used.
can be found in [61].
Acknowledgments
Segmentation Evaluation. The segmentation evaluation
method developed in our lab in [32] was used in this paper The authors would like to gratefully thank Luke Thomas for
to estimate the accuracy of the segmentation result. his help in this project. The authors also would like to thank
Professor J. R. Matey of the US Naval Academy for helpful
Feature Extraction and Matching. The 10 images with the discussions of his published papers on iris recognition in low
best segmentation scores are used for recognition. The pro- constraint scenarios. They would also like to thank Mr. R.
posed Gabor descriptor method (introduced in Section 3) Kirchner from the Department of Defense for his help and
was used for feature extraction and matching. support. They would also like to thank MicroVista for partial
support of the camera equipment [62]. They would also like
Score Fusion. The matching score between the enrollment to thank the people who contributed their iris data for this
image and the individual video frame is fused with the project. The research in this paper uses the ICE database
segmentation evaluation score. After majority vote, if the best provided by NIST [30]. This project is sponsored by the
matching score of the video to an enrollment iris satisfies the ONR Young Investigator Program (Award no. N00014-07-
matching threshold, the matching score will be the matching 1-0788).
result for the video to that enrollment eye. Matching results
from the video sequence to other enrollment eyes will be set References
to 1 (1 means no match). If even the highest matching score
does not satisfy the matching threshold, this video will not [1] A. K. Jain, A. Ross, and S. Pankanti, “Biometrics: a tool
be matched to any eye. for information security,” IEEE Transactions on Information
Forensics and Security, vol. 1, no. 2, pp. 125–143, 2006.
[2] Y. Du, “Review of iris recognition: cameras, systems, and their
5.2. Experimental Results and Discussion. In this experiment, applications,” Sensor Review, vol. 26, no. 1, pp. 66–69, 2006.
10 images per eye were used from the first session for [3] K. W. Bowyer, K. Hollingsworth, and P. J. Flynn, “Image
enrollment. They include the different off-angles (left, right, understanding for iris biometrics: a survey,” Computer Vision
up-left, up-right, and up). The total number of enrollment and Image Understanding, vol. 110, no. 2, pp. 281–307, 2008.
12 EURASIP Journal on Advances in Signal Processing

[4] A.K. Jain, P. Flynn, A. A. Ross, et al., Handbook of Biometrics, [23] K. Miyazawa, K. Ito, T. Aoki, K. Kobayashi, and H. Nakajima,
Springer, New York, NY, USA, 2008. “An effective approach for Iris recognition using phase-based
[5] H. Proenca and L. A. Alexandre, “Toward noncooperative image matching,” IEEE Transactions on Pattern Analysis and
iris recognition: a classification approach using multiple Machine Intelligence, vol. 30, no. 10, pp. 1741–1756, 2008.
signatures,” IEEE Transactions on Pattern Analysis and Machine [24] N. Tajbakhsh, B. N. Araabi, and H. Soltanianzadeh, “An
Intelligence, vol. 29, no. 4, pp. 607–612, 2007. intelligent decision combiner applied to noncooperative iris
[6] J. Daugman, “Statistical richness of visual phase information: recognition,” in Proceedings of the 11th International Confer-
update on recognizing persons by iris patterns,” International ence on Information Fusion (FUSION ’08), 2008.
Journal of Computer Vision, vol. 45, no. 1, pp. 25–38, 2001. [25] C. Belcher and Y. Du, “A selective feature information
[7] J. Daugman, “How iris recognition works,” IEEE Transactions approach for iris image quality measure,” IEEE Transactions on
on Circuits and Systems for Video Technology, vol. 14, no. 1, pp. Information Forensics and Security, vol. 3, no. 3, pp. 572–577,
21–30, 2004. 2008.
[8] J. Daugman, “Probing the uniqueness and randomness of
[26] Y. Du, C. Belcher, Z. Zhou, and R. W. Ives, “Feature correlation
iriscodes: results from 200 billion iris pair comparisons,”
evaluation approach for iris image quality measure,” Signal
Proceedings of the IEEE, vol. 94, no. 11, pp. 1927–1934, 2006.
Processing, vol. 90, no. 4, pp. 1176–1187, 2010.
[9] J. Daugman, “New methods in iris recognition,” IEEE Trans-
actions on Systems, Man, and Cybernetics. Part B, vol. 37, no. 5, [27] H. Proenca and L. A. Alexandre, “Iris segmentation methodol-
pp. 1167–1175, 2007. ogy for non-cooperative recognition,” IEE Proceedings: Vision,
[10] Y. Chen, S. C. Dass, and A. K. Jain, “Localized iris image Image and Signal Processing, vol. 153, no. 2, pp. 199–205, 2006.
quality using 2-D wavelets,” in Proceedings of the IEEE [28] UBIRIS database, http://iris.di.ubi.pt/.
International Conference on Biometrics, Hong Kong, China, [29] CASIA database, http://www.cbsr.ia.ac.cn/IrisDatabase.htm.
2006. [30] P. J. Phillips, K. W. Bowyer, P. J. Flynn, X. Liu, and W. T.
[11] J. R. Matey, O. Naroditsky, K. Hanna, et al., “Iris on the move: Scruggs, “The iris challenge evaluation 2005,” in Proceceedings
acquisition of images for iris recognition in less constrained of the IEEE 2nd International Conference on Biometrics: Theory,
environments,” Proceedings of the IEEE, vol. 94, no. 11, pp. Applications and Systems, Arlington, Va, USA, 2008.
1936–1946, 2006. [31] M. Vatsa, R. Singh, and A. Noore, “Improving iris recogni-
[12] L. Masek and P. Kovesi, MATLAB Source Code for a Biometric tion performance using segmentation, quality enhancement,
Identification System Based on Iris Patterns, University of match score fusion, and indexing,” IEEE Transactions on
Western Australia, Perth, Australia, 2003. Systems, Man, and Cybernetics. Part B, vol. 38, no. 4, pp. 1021–
[13] L. Ma, T. Tan, Y. Wang, and D. Zhang, “Personal identification 1035, 2008.
based on iris texture analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1519– [32] Z. Zhou, Y. Du, and C. Belcher, “Transforming traditional
1533, 2003. iris recognition systems to work in nonideal situations,” IEEE
[14] R. P. Wildes, “Iris recognition: an emerging biometrie tech- Transactions on Industrial Electronics, vol. 56, no. 8, pp. 3203–
nology,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1348–1363, 3213, 2009.
1997. [33] S. A.C. Schuckers, N. A. Schmid, A. Abhyankar, V. Dorairaj,
[15] N. Sudha, N. B. Puhan, H. Xia, and X. Jiang, “Iris recognition C. K. Boyce, and L. A. Hornak, “On techniques for angle
on edge maps,” in Proceedings of the 2007 6th International compensation in nonideal iris recognition,” IEEE Transactions
Conference on Information, Communications and Signal Pro- on Systems, Man, and Cybernetics. Part B, vol. 37, no. 5, pp.
cessing, Singapore, 2007. 1176–1190, 2007.
[16] N. Sudha, N. B. Puhan, H. Xia, and X. Jiang, “Iris recognition [34] C. Belcher and Y. Du, “Region-based SIFT approach to iris
on edge maps,” IET Computer Vision, vol. 3, no. 1, pp. 1–7, recognition,” Optics and Lasers in Engineering, vol. 47, no. 1,
2009. pp. 139–147, 2009.
[17] W. W. Boles and B. Boashash, “A human identification [35] K. Mikolajczyk and C. Schmid, “A performance evaluation of
technique using images of the iris and wavelet transform,” local descriptors,” IEEE Transactions on Pattern Analysis and
IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 1185– Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005.
1188, 1998. [36] D. G. Lowe, “Distinctive image features from scale-invariant
[18] Z. Sun, Y. Wang, T. Tan, and J. Cui, “Improving iris recog- keypoints,” International Journal of Computer Vision, vol. 60,
nition accuracy via cascaded classifiers,” IEEE Transactions on no. 2, pp. 91–110, 2004.
Systems, Man and Cybernetics. Part C, vol. 35, no. 3, pp. 435–
[37] Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive
441, 2005.
representation for local image descriptors,” in Proceedings of
[19] K. P. Hollingsworth, K. W. Bowyer, and P. J. Flynn, “The best
the IEEE Computer Society Conference on Computer Vision and
bits in an Iris code,” IEEE Transactions on Pattern Analysis and
Pattern Recognition, vol. 2, pp. II506–II513, Washington, DC,
Machine Intelligence, vol. 31, no. 6, pp. 964–973, 2009.
USA, 2004.
[20] J. Thornton, M. Savvides, and B. V. K. V. Kumar, “A Bayesian
approach to deformed pattern matching of iris images,” IEEE [38] G. Carneiro and A. D. Jepson, “Flexible spatial configuration
Transactions on Pattern Analysis and Machine Intelligence, vol. of local image features,” IEEE Transactions on Pattern Analysis
29, no. 4, pp. 596–606, 2007. and Machine Intelligence, vol. 29, no. 12, pp. 2089–2104, 2007.
[21] Y. Du, R. W. Ives, D. M. Etter, and T. B. Welch, “Use of one- [39] H. Cheng, Z. Liu, N. Zheng, et al., “A deformable local
dimensional iris signatures to rank iris pattern similarities,” image discriptor,” Proceedings of the IEEE Computer Society
Optical Engineering, vol. 45, no. 3, 037201, pp. 1–10, 2006. Conference on Computer Vision and Pattern Recognition (CVPR
[22] V. Velisavljevic, “Low-complexity iris coding and recognition ’08), vol. 29, pp. 1–8, 2008.
based on directionlets,” IEEE Transactions on Information [40] E. N. Mortensen, H. Deng, and L. Shapiro, “A SIFT descriptor
Forensics and Security, vol. 4, no. 3, pp. 410–417, 2009. with global context,” in Proceedings of the IEEE Computer
EURASIP Journal on Advances in Signal Processing 13

Society Conference on Computer Vision and Pattern Recognition [57] Z. Zhang, S. Chan, and L.-T. Chia, “Codebook+: a new
(CVPR ’05), vol. 1, pp. 184–190, San Diego, Calif, USA, 2005. module for creating discriminative codebooks,” in Proceedings
[41] L. Haibin and D.W. Jacobs, “Deformation invariant image of the 2007 IEEE International Conference on Multimedia and
matching,” in Proceedings of the 10th IEEE International Expo (ICME ’07), pp. 815–818, 2007.
Conference on Computer Vision (ICCV ’05), vol. 1, 2005. [58] L. Wu, S. Luo, and W. Sun, “Create efficient visual codebook
[42] A. E. Johnson and M. Hebert, “Using spin images for efficient based on weighted mRMR for object categorization,” in
object recognition in cluttered 3D scenes,” IEEE Transactions Proceedings of the International Conference on Signal Processing
on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. (ICSP ’08), pp. 1392–1395, 2008.
433–449, 1999. [59] F. Perronnin, “Universal and adapted vocabularies for generic
[43] J. Assfalg, M. Bertini, A. Del Bimbo, and P. Pala, “Content- visual categorization,” IEEE Transactions on Pattern Analysis
based retrieval of 3-D objects using spin image signatures,” and Machine Intelligence, vol. 30, no. 7, pp. 1243–1256, 2008.
IEEE Transactions on Multimedia, vol. 9, no. 3, pp. 589–599, [60] Z. He, T. Tan, Z. Sun, and X. Qiu, “Toward accurate and fast
2007. iris segmentation for iris biometrics,” IEEE Transactions on
[44] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and Pattern Analysis and Machine Intelligence, vol. 31, no. 9, pp.
object recognition using shape contexts,” IEEE Transactions on 1670–1684, 2009.
Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. [61] Y. Du, E. Arslanturk, Z. Zhou, and C. Belcher, “Video-based
509–522, 2002. non-cooperative iris image segmentation,” IEEE Transactions
[45] G. Mori, S. Belongie, and J. Malik, “Efficient shape matching on Systems, Man, and Cybernetics. Part B. In press.
using shape contexts,” IEEE Transactions on Pattern Analysis [62] M. Cameras, http://www.intevac.com/intevacphotonics/prod-
and Machine Intelligence, vol. 27, no. 11, pp. 1832–1837, 2005. ucts/microvista-nir.
[46] G. Mori and J. Malik, “Recovering 3D human body config-
urations using shape contexts,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 28, no. 7, pp. 1052–
1062, 2006.
[47] W. T. Freeman and E.H. Adelson, “The design and use of
steerable filters,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 13, no. 9, pp. 891–906, 1991.
[48] E. P. Simoncelli and H. Farid, “Steerable wedge filters for local
orientation analysis,” IEEE Transactions on Image Processing,
vol. 5, no. 9, pp. 1377–1382, 1996.
[49] M. Jacob and M. Unser, “Design of steerable filters for feature
detection using Canny-like criteria,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp.
1007–1019, 2004.
[50] X. Shi, A. L. Ribeiro Castro, R. Manduchi, and R. Mont-
gomery, “Rotational invariant operators based on steerable
filter banks,” IEEE Signal Processing Letters, vol. 13, no. 11, pp.
684–687, 2006.
[51] V. Lepetit and P. Fua, “Keypoint recognition using randomized
trees,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 9, pp. 1465–1479, 2006.
[52] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, and T.
Tuytelaars, “A thousand words in a scene,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp.
1575–1589, 2007.
[53] T. Deselaers, L. Pimenidis, and H. Ney, “Bag-of-visual-
words models for adult image classification and filtering,”
in Poceedings of the 19th International Conference on Pattern
Recognition (ICPR ’08), Tampa, Fla, USA, 2008.
[54] N. Lazic and P. Aarabi, “Importance of feature locations
in bag-of-words image classification,” in Poceedings of the
IEEE International Conference on Acoustics, Speech and Signal
Processing, vol. 1, pp. I641–I644, Honolulu, Hawaii, USA,
2007.
[55] T. Botterill, S. Mills, and R. Green, “Speeded-up bag-of-words
algorithm for robot localisation through scene recognition,”
in Poceedings of the 23rd International Conference Image and
Vision Computing New Zealand (IVCNZ ’08), 2008.
[56] J.-H. Hsiao, C.-S. Chen, and M.-S. Chen, “A novel language-
model-based approach for image object mining and re-
ranking,” in Poceedings of the IEEE International Conference on
Data Mining, (ICDM ’08), pp. 243–252, 2008.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 158395, 11 pages
doi:10.1155/2010/158395

Research Article
A Multifactor Extension of Linear Discriminant Analysis for Face
Recognition under Varying Pose and Illumination

Sung Won Park and Marios Savvides


Electrical and Computer Engineering Department, Carnegie Mellon University, 5000 Forbes Avenue Pittsburgh, PA 15213, USA

Correspondence should be addressed to Sung Won Park, [email protected]

Received 11 December 2009; Revised 27 April 2010; Accepted 20 May 2010

Academic Editor: Robert W. Ives

Copyright © 2010 S. W. Park and M. Savvides. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Linear Discriminant Analysis (LDA) and Multilinear Principal Component Analysis (MPCA) are leading subspace methods for
achieving dimension reduction based on supervised learning. Both LDA and MPCA use class labels of data samples to calculate
subspaces onto which these samples are projected. Furthermore, both methods have been successfully applied to face recognition.
Although LDA and MPCA share common goals and methodologies, in previous research they have been applied separately and
independently. In this paper, we propose an extension of LDA to multiple factor frameworks. Our proposed method, Multifactor
Discriminant Analysis, aims to obtain multilinear projections that maximize the between-class scatter while minimizing the
withinclass scatter, which is the same core fundamental objective of LDA. Moreover, Multifactor Discriminant Analysis (MDA), like
MPCA, uses multifactor analysis and calculates subject parameters that represent the characteristics of subjects and are invariant
to other changes, such as viewpoints or lighting conditions. In this way, our proposed MDA combines the best virtues of both LDA
and MPCA for face recognition.

1. Introduction images, also called out-of-sample images. Finally, these


test images are classified with respect to different subjects,
Face recognition has significant applications for defense and and the classification accuracy is computed to evaluate the
national security. However, today, face recognition remains effectiveness of the discrimination.
challenging because of large variations in facial image Multilinear Principal Component Analysis (MPCA) [1,
appearance due to multiple factors including facial feature 2] and Linear Discriminant Analysis (LDA) [3, 4] are two
variations among different subjects, viewpoints, lighting of the most widely used dimension reduction methods for
conditions, and facial expressions. Thus, there is great face recognition. Unlike traditional PCA, both MPCA and
demand to develop robust face recognition methods that LDA are based on supervised learning that makes use of given
can recognize a subject’s identity from a face image in class labels. Furthermore, both MPCA and LDA are subspace
the presence of such variations. Dimensionality reduction projection methods that calculate low-dimensional projec-
techniques are common approaches applied to face recog- tions of data samples onto these trained subspaces. Although
nition not only to increase efficiency of matching and LDA and MPCA have different ways of calculating these
compact representation, but, more importantly, to highlight subspaces, they have a common objective function which
the important characteristics of each face image that provide utilizes a subject’s individual facial appearance variations.
discrimination. In particular, dimension reduction methods MPCA is a multilinear extension of Principal Com-
based on supervised learning have been proposed and ponent Analysis (PCA) [5] that analyzes the interaction
commonly used in the following manner. Given a set of face between multiple factors utilizing a tensor framework. The
images with class labels, dimension reduction methods based basic methodology of PCA is to calculate projections of data
on supervised learning make full use of class labels of these samples onto the linear subspace spanned by the principal
images to learn each subject’s identity. Then, a generalization directions with the largest variance. In other words, PCA
of this dimension reduction is achieved for unlabeled test finds the projections that best represent the data. While PCA
2 EURASIP Journal on Advances in Signal Processing

calculates one type of low-dimensional projection vector for The remainder of this paper is organized as follows.
each face image, MPCA can obtain multiple types of low- Section 2 reviews subspace methods from which the pro-
dimensional projection vectors; each vector parameterizes posed method is derived. Section 3 first addresses the advan-
a different factor of variations such as a subject’s identity, tages and disadvantages of multifactor analysis and discrimi-
viewpoint, and lighting feature spaces. MPCA establishes nant analysis individually, and then Section 4 proposes MDA
multiple dimensions based on multiple factors and then with the combined virtues of both methods. Experimental
computes multiple linear subspaces representing multiple results for face recognition in Section 5 show that the
varying factors. proposed MDA outperforms major dimension reduction
In this paper, we separately address the advantages and methods on the CMU PIE database and the Extended Yale B
disadvantages of multifactor analysis and discriminant anal- database. Section 6 summarizes the results and conclusions
ysis and propose Multifactor Discriminant Analysis (MDA) of our proposed method.
by synthesizing both methods. MDA can be thought of as an
extension of LDA to multiple factor frameworks providing
2. Review of Subspace Projection Methods
both multifactor analysis and discriminant analysis. LDA
and MPCA have different advantages and disadvantages, In this section, we review MPCA and LDA, two methods
which result from the fact that each method assumes on which our proposed Multifactor Discriminant Analysis is
different characteristics for data distributions. LDA can based.
analyze clusters distributed in a global data space based on
the assumption that the samples of each class approximately
2.1. Multilinear PCA. Multilinear Principal Component
create a Gaussian distribution. On the other hand, MPCA
Analysis (MPCA) [1, 2] is a multilinear extension of PCA.
can analyze the locally repeated distributions which are
MPCA computes a linear subspace representing the variance
caused by varying one factor under fixed other factors. Based
of data due to the variation of each factor as well as the linear
on synthesizing both LDA and MPCA, our proposed MDA
subspace of the image space itself. In this paper, we consider
can capture both global and local distributions caused by a
three factors: different subjects, viewpoints (i.e., pose types),
group of subjects.
and lighting conditions (i.e., illumination). While PCA is
Similar to our MDA, the Multilinear Discriminant
based on Singular Value Decomposition (SVD) [7], MPCA
Analysis proposed in [6] applies both tensor frameworks
is based on High-Order Singular Value Decomposition
and LDA to face recognition. Our method aims to analyze
(HOSVD) [8], which is a multidimensional extension of
multiple factors such as subjects’ identities and lighting
SVD.
conditions in a set of vectored images. On the other
Let X be the m p × n data matrix whose columns are
hand, [6] is designed to analyze multidimensional images
vectored training images x1 , x2 , . . . , xn with n p pixels. We
with a single factor, that is, subjects’ identities. In [6],
assume that these data samples are centered at zero. By SVD,
each face image constructs an n-mode tensor, and the
the matrix X can be decomposed into three matrices U, S,
low-dimensional representation of this original tensor is
and V:
calculated as another n-mode tensor with a smaller size. For
example, if we simply use 2-mode tensors, that is, matrices, X = USVT . (1)
representing 2D images, the method proposed in [6] reduces
each dimension of the rows and columns by capturing the If we keep only the m < n column vectors of U and V
repeated tendencies in rows and the repeated tendencies in corresponding to the m largest singular values and discard
columns. On the other hand, our proposed MDA analyzes the rests of the matrices, the sizes of the matrices in (1) are as
the repeated tendencies caused by varying each factor in a follows: U ∈ Rn p ×m , S ∈ Rm×m , and V ∈ Rn×m . For a sample
subspace obtained by LDA. The goal of MDA is to reduce the x, PCA obtains an m-dimensional representation:
impacts of environmental conditions, such as viewpoint and
lighting, from the low-dimensional representations obtained yPCA = UT x. (2)
by LDA. While [6] obtains a single tensor with a smaller
size for each image tensor, our proposed MDA obtains Note that these low-dimensional projections preserve the dot
multiple low-dimensional vectors, for each image vector, products of training images. We define the matrix YPCA ∈
which decompose and parameterize the impacts of multiple Rm×n consisting of these projections obtained by PCA:
factors. Thus, for each image, while the low-dimensional
representation obtained by [6] is still influenced by variance YPCA = UT X = SVT . (3)
in environmental factors, multiple parameters obtained by
our MDA are expected to be independent from each other. Then, we can see that the Gram matrices of X and YPCA are
The extension of [6] to multiple factor frameworks cannot identical since
be simply drawn because this method is formulated only
using a single factor, that is to say, subjects’ identities. On G = XT X = YTPCA YPCA = VS2 VT . (4)
the other hand, our proposed MDA decomposes the low-
dimensional representations obtained by LDA into multiple Since a Gram matrix is a matrix of all possible dot products, a
types of factor-specific parameters such as subject para- set of yPCA also preserves the dot products of original training
meters. images.
EURASIP Journal on Advances in Signal Processing 3

   
While PCA parameterizes a sample x with one low- Then, Z ∈ Rm ×ns nv nl can be easily derived as
dimensional vector y, MPCA [1] parameterizes the sample  
using multiple vectors associated with multiple factors of Z = UT X Vsubj ⊗ Vview ⊗ Vlight (10)
a data set. In this paper, we consider three factors of face from (7). For a training image xs,v,l assigned as one column
images: ns identities (or subjects), nv poses, and nl lighting subj light
of X, the three factor parameters vs , vvview , and vl are
conditions. xi,p,l denotes a vectored training image of the
identical to the sth row of Vsubj , vth row of Vview , and l
ith subject in the pth pose and the lth lighting condition.
th row of Vlight , respectively. In this paper, to solve for the
These training images are sorted in a specific order so as to
three parameters of an arbitrary unlabeled image x, one first
construct a data matrix X ∈ Rm×ns nv nl :
calculates the Kronecker product of these parameters using
 
X = x1,1,1 , x2,1,1 , . . . , xns ,1,1 , x1,2,1 , . . . , xns ,nv ,nl . (5) (6):
vsubj ⊗ vview ⊗ vlight = Z+ UT x, (11)
Using MPCA, an arbitrary image x and a data matrix X
are represented as where+ denotes the Moore-Penrose pseudoinverse. To
  decompose the Kronecker product of multiple parameters
x = UZ vsubj ⊗ vview ⊗ vlight , (6) into individual ones, two leading methods have been applied
in [2] and [9]. The best rank-1 method [2] reshapes the
 T
X = UZ Vsubj ⊗ Vview ⊗ Vlight , (7) vector vsubj ⊗ vview ⊗ vlight ∈ Rn s n v nl to the matrix
vsubj (vview ⊗ vlight )T ∈ Rn s ×n v n l , and using SVD of
respectively, where ⊗ denotes the Kronecker product and U this matrix, vsubj is calculated as the left singular vector
is identical to the matrix U in (1). A matrix Z results from corresponding to the largest singular value. Another method
the pixel-mode flattening of a core tensor [1]. In (6), we is the rank-(1, 1, . . . , 1) approximation using the alternating
can see that MPCA parameterizes a single image x using least squares method proposed in [9]. In this paper, we

three parameters: subject parameter vsubj ∈ Rns , viewpoint employed the decomposition method proposed in [2],
 
parameter vview ∈ Rnv , and lighting parameter vlight ∈ Rnl , which produced slightly better performances for face
  
where ns ≤ ns , nx ≤ nv , and nl ≤ nl . Similarly, X in (7) recognition than the method proposed in [9].

is represented by three orthogonal matrices Vsubj ∈ Rns ×ns , Based on the observation that the Gram-like matrices in
×nv ×nl (8) are formulated using the dot products, Multifactor Kernel
V view ∈ R nv , and V light ∈ Rn l . The columns of each
matrix span the linear subspace of the data space formed by PCA (MKPCA), a kernel-based extension of MPCA, was
varying each factor. Therefore, Vsubj , Vview , and Vlight consist introduced [10]. If we define a kernel function k, the kernel
of eigenvectors corresponding to the largest eigenvalues of versions of the Gram-like matrices in (8) can be directly
three Gram-like matrices Gsubj , Gview , and Glight respectively, calculated. Thus, for training images, Vsubj , Vview , and Vlight
where the (r, c) entry of these matrices is calculated as can be also calculated using eigen decomposition of these
matrices. Equations (10) and (11) show that in order to
1 v l T obtain vsubj , vview , and vlight for any test image, also called
n n
subj
Grc = x xc,p,l , an out-of-sample image, x, we must be able to calculate
nv nl p=1 l=1 r,p,l
UT X and UT x. Note that UT X and UT x are projections of
training samples and a test sample onto nonlinear subspace,
1 s l T
n n
Gview
rc = x xi,c,l , (8) respectively, and these can be calculated by KPCA as shown
ns nl i=1 l=1 i,r,l in [11].
1 s v T
n n
light
Grc = x xi,p,c . 2.2. Linear Discriminant Analysis. Since Linear Discriminant
ns nv i=1 p=1 i,p,r Analysis (LDA) [3, 4] is a supervised learning algorithm,
class labels of all samples are provided to the traditional LDA
These three Gram-like matrices Gsubj , Gview , Glight , represent approach. Let li ∈ 1, 2, . . . , c be the class label corresponding
similarities between different subjects, different poses, and to xi , where i = 1, 2, . . . , n and c is the number of classes.
different lighting conditions, respectively. For example, Gsubj
c ni be the number of samples in the class i such that
Let
can be thought of as the average similarity, measured by the i=1 ni = n. LDA calculates the optimal projection direction
dot product, between the rth subject’s face images and the cth w maximizing Fisher’s criterion
subject’s face images under varying viewpoints and lighting
conditions. wT Sb w
J(w) = , (12)
Three orthogonal matrices Vsubj , Vview , and Vlight are wT Sw w
calculated by SVD of the three Gram-like matrices: where Sb and Sw are the between-class and within-class
2 T
scatter matrices:
Gsubj = Vsubj Ssubj Vsubj , 
c
Sb = ni (mi − m)(mi − m)T ,
view view view 2 view T i=1
G =V S V , (9) (13)
n 
  T
light light 2 light T Sw = xi − mli xi − mli ,
Glight = V S V . i=1
4 EURASIP Journal on Advances in Signal Processing

15 15
10 10
5 5
0 0
−5 −25 −5
−20 −10 −25
−10
−15 −15 −15 −20
−10 −15
−5 −15 −10
−15 −10 −5
−10 0
5 −5 0
−5 0 5
0 10 5
5 10
10 15 10 15
15 20 15 20
20 25 20 25
25 25
(a) (b)

15
10 15
5 10
0 5
−5 0
−10 −5
−25 −25
−15 −20 −10
−20
−15 −15 −15
−15 −10 −10
−10 −5 −5
−5 0 −10 0
0 5
5 0 5
10 10
10 15
15 10 15
20 20 20
25 25 20 25
(c) (d)

Figure 1: Low-dimensional representations of training images obtained by PCA using the CMU PIE database. (a) Each set of samples with
the same color represents each subject’s face images. (b) Each set of samples with the same color represents face images under each viewpoint.
(c) Each set of samples with the same color represents face images under each lighting condition. (d) The red C-shape curve connects face
images under various lighting conditions for one person and one viewpoint. The blue V-shape curve connects face images under various
viewpoints for one person and one lighting condition. Green dots represent 30 subjects’ face images under one viewpoint and one lighting
condition. We can see that varying viewpoints and lighting conditions create clusters, rather than varying subjects.

where mi denotes the sample mean for the class i. The that p < c. Despite the success of the LDA algorithm in

solution of (12) is calculated as the eigenvectors correspond- many applications, the dimension of yLDA ∈ Rn p is often
ing to the largest eigenvalues of the following generalized insufficient for representing each sample. This is caused by
eigenvector problem: the fact that the number of available projection directions is
lower than the class number c. To improve this limitation of
Sb w = λSw w. (14) LDA, variants of LDA, such as the null subspace algorithm
Since Sw does not have full column rank and thus is not [12] and a direct LDA algorithm [13], were proposed.
invertible, (14) can be solved not by eigen decomposition but
instead by a generalized eigenvector problem. LDA obtains a
low-dimensional representation yLDA for an arbitrary sample 3. Limitations of Multifactor Analysis and
x: Discriminant Analysis
yLDA = WT x, (15) LDA and MPCA have different advantages and disadvan-
tages, which result from the fact that each method assumes

where the columns of the matrix W ∈ Rn p ×n p consist of different characteristics for data distributions. MPCA’s sub-
w1 , w2 , . . . , wp . In other words, yLDA is the projection of x ject parameters represent the average positions of a group of
onto the linear subspace spanned by w1 , w2 , . . . , wp . Note subjects across varying viewpoints and lighting conditions.
EURASIP Journal on Advances in Signal Processing 5

LDA inspires multiple advanced variants such as Kernel


Discriminant Analysis (KDA) [14, 15], which can obtain
nonlinear subspaces. However, these subspaces are still based
on the analysis of the clusters distributed in a global data
space. Thus, there is no guarantee that KDA can be successful
if face images which belong to the same subject are scattered
rather than distributed as clusters. In sum, LDA cannot be
successfully applied unless, in a given data set, data samples
are distributed as clusters due to different classes.

3.2. The Assumption of MPCA: Repeated Distributions Caused


by Varying One Factor. MPCA is based on the assumption
that the variation of one factor repeats similar shapes of
distributions, and these common shapes rarely depend on
the variation of other factors. For example, the subject
parameters represent the averages of the relative positions
of subjects in the data space across varying viewpoints and
lighting conditions. To illustrate this, we consider viewpoint-
and lighting-invariant subsets of a given face image set; each
subset consists of the face images of ns subjects captured
under fixed viewpoint and lighting:
Figure 2: Ideal factor-specific submanifolds in an entire manifold
 
on which face images lie. Each red curve connects face images X:,v,l = x1,v,l x2,v,l · · · xns ,v,l ∈ Rn p ×ns (16)
only due to varying viewpoint while each blue curve connects face
images only due to varying illumination.
That is, each column of X:,v,l represents each image in this
subset. As shown in Figure 4(a), there are nv nl viewpoint-
and lighting-invariant subsets, and Gsubj in (8) can be
MPCA’s averaging is premised on the assumption that these rewritten as the average of the Gram matrices calculated in
subjects maintain similar relative positions in a data space these subsets:
under each viewpoint and lighting condition. On the other 1 v l T
n n
hand, LDA is based on the assumption that the samples Gsubj = X X:,v,l . (17)
nv nl v=1 l=1 :,v,l
of each class approximately create a Gaussian distribution.
Thus, we can expect that the comparative performances of In Euclidean geometry, the dot product between two vectors
MPCA and LDA vary with the characteristics of a data set. formulates the distance and linear similarity between them.
For classification tasks, LDA sometimes outperforms MPCA; Equation (9) shows that Gsubj is also the Gram matrix of
at other times MPCA outperforms LDA. In this section, we T
a set of the column vectors of the matrix Ssubj Vsubj ∈
demonstrate the assumptions on which each method is based 
×
R s . Thus, these ns column vectors represent the average
n ns
and the conditions where one can outperform the other.
distances between pairs of ns subjects. Therefore, the row
vectors of Vsubj , that is, the subject parameters, depend on
3.1. The Assumption of LDA: Clusters Caused by Different these average distances between ns subject across varying
Classes. Face recognition is a task to classify face images viewpoints and lighting conditions. Similarly, the viewpoint
with respect to different subjects. LDA assumes that each parameters and the lighting parameters depend on the
class, that is, each subject, approximately causes a Gaussian average distances between nv viewpoints and nl lighting
distribution in a data set. Based on this assumption, LDA cal- conditions, respectively, in a data space.
culates a global linear subspace which is applied to the entire Figure 2 illustrates an ideal case to which MPCA can
data set. However, a real-world face image set often includes be successfully applied. Face images lie on a manifold, and
other factors, such as viewpoints or lighting conditions viewpoint- and lighting-invariant subsets construct red and
in addition to differences between subjects. Unfortunately, blue curves, respectively. Each red curve connects face images
the variation of viewpoints or lighting conditions often only due to varying illumination while each blue curve
constructs global clusters across the entire data set while connects face images only due to varying viewpoints. Since
the variation of subjects creates only local distribution all of the red curves have identical shapes, nl different lighting
as shown in Figure 1. In the CMU PIE database, both conditions can be perfectly represented by nl row vectors of

viewpoints and lighting conditions create global clusters, as Vlight ∈ Rnl ×nl . Also, since all of the blue curves have identical
shown in Figures 1(b) and Figure 1(c), while a group of shapes, nv different viewpoints can be perfectly represented

subjects creates a local distribution, as shown in Figure 1(a). by nv row vectors of Vview ∈ Rnv ×nv . For each factor,
Therefore, low-dimensional projections obtained by LDA are when these subsets construct similar structures with small
not appropriate for face recognition in these samples, which variations, the average of these structures can successfully
are not globally separable. cover each sample.
6 EURASIP Journal on Advances in Signal Processing

5 10
0 5 5
0 0 0.4
−5 −5 −5
−10 −10
20 20 20 20
0.3
15 20 1510 20 15 15
10
10 5 10 15 5 10 50 5 10 0.2
5 0 0 −5 0
0.1
5 2 5
0
0 0 0
−4
−5 −5
20
15 −10 −0.1
20 10 25
15 20 25 5 0
15 20
20 20 −0.2
10
50 10 15 −5 5 10
15
10 10 15
0 5 5 −5 0 5
−0.3
10 10 10 −0.4
5 5 5 −0.1
0 0 0
−5 −5 −5 −0.2
−10 −0.5
−10 −10 −0.3
20 20
15 10
20 1510 20 20 1520 −0.4 0
10 5 0 5 10 1510 5 10
0 −10
0
−5 0 5 −5 0 −0.5 0.5

(a) (b)

15
15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5 0.35
−10 −10 −10
−15 −15
−15
−20 −20 0.3
−15 −5 5 15 −20 −10 0 10 20 −20 −10 0 10 20

20 20 15
15 15 10 0.25
10 10 5
5 5 0
0 0 −5
−5 −5
0.2
−10
−10 −10
−15 −15
−15 −20
−20 0.15
−25 −15 −5 5 15 25 −25 −15 −5 5 15 25 −20 −10 0 10 20

15 10 10 0.1
10 5 5
5
0 0
0
−5 −5 −5 0.05
−10 −10 −10
−15 −15 −15
−20 −20 0
−20 −10 0 10 20 −15 −5 5 15 −15 −5 5 15 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4

(c) (d)

Figure 3: Low-dimensional representations of training images obtained by PCA and MPCA. (a) the PCA projections of 9 subjects’ face
images generated by varying viewpoints under one lighting condition. (b) the viewpoint parameters obtained by MPCA. (c) the PCA
projections of 9 subjects’ face images generated by varying lighting conditions under one viewpoint. (d) the lighting parameters obtained by
MPCA.

We observe that each blue curve in Figure 3(a) that a variety of data distributions. In Figure 3(a), some curves
represents viewpoint variation seems to repeat a similar have W-shapes while most of the other curves have V-shapes.
V-shape for each person and each lighting condition. Also, Thus, in this case, we cannot expect reliable performances
Figure 3(b) visualizes the viewpoint parameters yv , learned from MPCA because the average shape obtained by MPCA
by MPCA; the curve connecting the viewpoint parameters for each factor insufficiently covers individual shapes of
roughly fits the average shape of the blue curves. As a curves.
result, yv in Figure 3(b) also has a V-shape. Also, the 3D
visualization of the lighting parameters in Figure 3(d) 4. Multifactor Discriminant Analysis
roughly averages the C-shapes of red curves shown in
Figure 3(c), each connecting face images under various As shown in Section 3.1, for face recognition, LDA is
lighting conditions for one person and one viewpoint. preferred if in a given data set, face images are distributed
Similar observations were illustrated in [9]. as clusters due to different subjects. Unlike LDA, as shown
Based on the above expectations, if varying just one in Section 3.2, MPCA can be successfully applied to face
factor generates dissimilar shapes of distribution, multilinear recognition if various subjects’ face images repeat similar
subspaces based on these average shapes do not represent shapes of distributions under each viewpoint and lighting,
EURASIP Journal on Advances in Signal Processing 7


even if these subjects do not seem to create these clusters. In and Vlight , and subsequently Z . Then, during testing, for the
this paper, we propose a novel method which can offer the LDA projection yLDA of an arbitrary test image, we calculate
advantages of both methods. Our proposed method is based the factor-specific parameters by decomposing Z+ yLDA .
on an extension of LDA to multiple factor frameworks. Thus, In Section 3.2, factor-specific parameters obtained by
we can call our method Multifactor Discriminant Analysis MPCA preserve the three Gram-like matrices Gsubj , Gview ,
(MDA). From yLDA , MDA aims to remove the remaining and Glight defined in (8). Figure 4 demonstrates that MPCA
characteristics which are caused by other factors, such as calculates subject, viewpoint, and lighting parameters using
viewpoints and lighting conditions. only the colored parts in the Gram matrix. These colored
We start with the observation that MPCA is based on the parts represent the dot products between pairs of samples
relationships between yPCA , low-dimensional representations that have only one varying factor. For example, the colored
obtained by PCA, and multiple factor-specific parameters. parts in Figure 4(a) represent the dot products of different
Combining (3) and (7), we can see that the matrix YPCA ∈ subjects’ face images under fixed viewpoint and lighting

Rn p ×ns nv nl is rewritten as condition. Based on these observations, among the dot
 T products of pairs of LDA projections, we only use the dot
YPCA = UT X = Z Vsubj ⊗ Vview ⊗ Vlight . (18) products which correspond to the colored parts of G in
Figure 4. Replacing x with yLDA , we define three new Gram-
  
Similarly, combining (2) and (7), for an arbitrary image x, like matrices, Gsubj , Gview , and Glight :
yPCA can be decomposed into three vectors by MPCA:
 T  
 
nv 
nl
T
yPCA = UT x = Z vsubj ⊗ vview ⊗ vlight (19) Gsubj m,n
= yLDAm,v,l yLDAn,v,l ,
v=1 l=1

where yPCA is the low-dimensional representation of x 


nv 
nl
T
obtained by PCA. Thus, we can think that Z performs a = xm,v,l WWT xn,v,l ,
linear transformation which maps the Kronecker product of v=1 l=1
multiple factor-specific parameters to the low-dimensional (22)
 
 
ns 
nl
representations provided by PCA. In other words, yPCA Gview = T
yLDAs,m,l yLDAs,n,l ,
m,n
is decomposed into vsubj , vview , and vlight by using the s=1 l=1
transformation matrix Z.
  
ns 
nv
In this paper, instead of decomposing yPCA , decomposing Glight

= T
yLDAs,v,m yLDAs,v,n ,
yLDA is proposed, where yLDA is the low-dimensional repre- m,n
s=1 v=1
sentation of x provided by LDA, as defined in (15). yLDA often
has more discriminant power than yPCA , but it still has the where yLDAs,v,l denotes the LDA projection of a training
combined characteristics caused by multiple factors. Thus, image xs,v,l of the sth subject under the vth viewpoint and
we first formulate yLDA into the Kronecker product of the the lth lighting condition. In (9), for MPCA, Vsubj , Vview ,
subject, viewpoint, and lighting parameters: and Vlight are calculated as the eigenvector matrices of Gsubj ,
   
T Gview , and Glight , respectively. In similar ways, for MDA,
yLDA = WT x = Z vsubj ⊗ vview ⊗ vlight , (20)     
Vsubj ∈ Rns ×ns , Vview ∈ Rnv ×nv , and Vlight ∈ Rnl ×nl can



be calculated as the eigenvector matrices of Gsubj , Gview , and
where W ∈ Rn p ×n p is the LDA transformation matrix defined  
Glight , respectively. Again, each row vector of Vsubj represents
in (14) and (15). As reviewed in Section 2.2, np , the number the subject parameter of each subject in a training set.

of available projection directions, is lower than the class We remember that YLDA ∈ Rn p ×ns nv nl and np < ns . Thus,
number ns : np < ns . Note that yLDA in (20) is formulated in if we define the Gram matrix G as 

a similar way to yPCA in (19) using different factor-specific



parameters and Z. We expect vsubj in (20), the subject G = YTLDA YLDA = XT WWT X, (23)
parameter obtained by MDA, to be more reliable than both

yLDA and vsubj since vsubj provides the advantages of the this matrix G ∈ Rns nv nl ×ns nv nl does not have full column
virtues of both LDA and MPCA. Using (15), we also calculate rank. If G is decomposed by SVD, G has ns − 1 nonzero

the matrix YLDA ∈ Rn p ×ns nv nl whose columns are the LDA singular values at most. However, each of the matrices Gsubj ,


projections of training samples.  


Gview , and Glight has full column rank since these matrices
While MPCA decomposes the data matrix X ∈ Rn p ×ns nv nl
are defined in terms of the averages of different parts of G as
consisting of training samples, our proposed MDA aims to
shown in Figure 4. Thus, even if np < nv or np < nl , one can
decompose the LDA projection matrix YLDA :  
calculate valid ns , nv , and nl eigenvectors from Gsubj , Gview ,
 T 
 
YLDA = WT X = Z Vsubj ⊗ Vview ⊗ Vlight

. (21) and Glight , respectively.
After calculating these three eigenvector matrices, Z ∈
np ×ns nv nl
To obtain the factor-specific parameters of an arbitrary test R can be easily calculated as
image x, we perform the following steps. During training,   

  
we first calculate the three orthogonal matrices, Vsubj , Vview , Z = YLDA Vsubj ⊗ Vview ⊗ Vlight . (24)
8 EURASIP Journal on Advances in Signal Processing

S1 S2 V1 V2 V3

 V1
S1
V2
S2
V3

(a) G (left) and Gsubj (right) (b) G (left) and Gview (right)

l1 l2

l1

l2

(c) G (left) and Glight (right)

Figure 4: The relationships between the Gram matrix G defined in (4) and each of the Gram-like matrices Gsubj , Gview , and Glight defined
in (8), where a training set has two subjects, three viewpoints, and two lighting conditions. Each of Gsubj , Gview , and Glight is calculated as
the average of parts of the Gram matrix G. Each entry of these three Gram-like matrices is the average of same-color entries of G. (a) Gsubj
consists of averages of dot products which represent the averages of the pairwise relationships between a group of subjects. (b) Gview consists
of averages of dot products which represent the averages of the pairwise relationships between different viewpoints. (c) Glight consists of
averages of dot products which represent the averages of the pairwise relationships between different lighting conditions.

Thus, using this transformation matrix Z , the Kronecker For the face recognition experiments, we used two
product of the three factor-specific parameters is calculated databases: the Extended YaleB database [16] and the CMU
as PIE database [17]. The Extended YaleB database contains
   28 subjects captured under 64 different lighting conditions
vsubj ⊗ vview ⊗ vlight = Z+ yLDA . (25) in 9 different viewpoints. For each of the subjects, we used
  all of the 9 viewpoints and the first 30 lighting conditions
Again, as done in (11), by SVD of the matrix vsubj (vview ⊗ to reduce time for experiments. Among the face images, we
 
vlight )T , vsubj is calculated as the left singular vector corre- used 10 lighting conditions in 5 viewpoints for each person
sponding to the largest singular value. Consequently, we can for training and all of the remaining images for testing.

obtain vsubj of an arbitrary image test x. Next, we used the CMU PIE database, which contains 68
individuals with 13 different viewpoints and 21 different
5. Experimental Results lighting conditions. Again, to reduce time for experiments,
we utilized 30 subjects. Also, we did not use two viewpoints:
In this section, we demonstrate that Multifactor Discrim- the leftmost profile and the rightmost profile. For each
inant Analysis is an appropriate method for dimension person, 5 lighting conditions in 5 viewpoints were used
reduction of face images with varying factors. To test for training and all of the remaining images were used for
the quality of dimension reduction, we conducted face testing. For each set of data, experiments were repeated
recognition tests. In all experiments, face images are aligned 10 times using randomly selected lighting conditions and
using eye coordinates and then cropped. Then, face images viewpoints. The averages of the results were reported in
were resized to 32 × 32 gray-scale images, and each vectored Tables 1 and 2.
image was normalized with unit norm and zero mean. After We compare the performance of our proposed method,
aligning and cropping, the left and right eyes are located at Multifactor Discriminant Analysis, and other traditional
(9, 10) and (24, 10), respectively, in each 32 × 32 image. subspace projection methods with respect to dimension
EURASIP Journal on Advances in Signal Processing 9

0.6 0.3

0.5 3 0.2
4
1 8 1 43 3 0.1
0.4 3 9 4
1 4 4 4 32 1
422 34 94 0
8
0.3 1 9 81 92 9 33 1 1
1 2 8 8 9 8 3 1 93 9 34
9
2 8
55 71 4
2 1 4 3 1 8 22 2 3 4 93 1 −0.1
0.2 5 21 9 18
38 1 5 44
7 2989 1 918 8 2
8
3
57 9 2988 865 5 7 3 92 89899 423 −0.2
2994 6 1505 2 4 8 9 191 4
0.1 86 96 6 32 78017 91 5 4 3
57 8
7 7 7 10 2
8 8 7 7 5 1
5 2 −0.3
5 624 1798 874 2 2
2 0 6 785605709214896742429333 98119894
0 5 57 6 5 7291 1108 6 06439783 8
7 2 5 81 88 2 4 1 4
7 777 9 77127807816504771711498928737432277964926381959493824243493 3 −0.4
6 9 3 69 87 8 25 68
7 0 1 181
23 4
32
−0.1 439 33118 9 04094857307577701689976690766350652704895118625384848 89924 3 13
4 41394 28129 2855575805870607772065481950327975406279811
624060009760764610879607629815355 391543 −0.5
2 92 16724570423965156042781910730857384666137099 51787258 95
3742143056376245580104863906680879071920869663290160734908950666282909106239661244127241529583591115132
−0.2 31552346147525926803476342982555135848183
34152333391514151354243594521349214233 544 −0.6
5353 315344 3
33 33 −1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1
−0.3

−0.4
Test images
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Person 1 with pose 8
Person 4 with pose 1
(a)
Figure 6: The first two coordinates of lighting feature vectors
0.6 computed by Multifactor Discriminant Analysis using the Extended
5 Yale database.
5 4
0.5 0
2
3 2 550 55 05
3 8 565 55555 05555
0.4 57 3 555605555555555555 0 Table 1: Rank-1 recognition rate on the Extended YaleB database.
3 8 33 55 5 05
33 37 3334333 37 503055055505550555506055005605006606 00
4 3 3 335523 5556656000 5000066000006 6
0.3 3 2 343 3 353 33333333333333333630363052550000056040060660000000
4 3 5 53 3 60666
06606060066366 untrained untrained Untrained
434 3 33 33337333333 6 6666 66666663666666
4 134 3 3 3333 1 66466666 lighting viewpoints viewpoints & lighting
0.2 4 3 64 3 0 66
24 344445 44444444 73 37 4 PCA 0.87 ± 0.03 0.64 ± 0.03 0.59 ± 0.05
42 34 24 444 444444444444444444447641 57131063877 7 7
0.1 4 2 44444 44449444 773777874677 779 4 MPCA 0.90 ± 0.01 0.70 ± 0.05 0.65 ± 0.06
39 442444 8242 8 7 777777777777777777777777
1 4 911 58424 2 24 77777777677777 77
0 241 2 2222522222228 7 7777777 7
4 77 KPCA 0.88 ± 0.03 0.67 ± 0.04 0.64 ± 0.06
9 4 42 22222222422822222222222 2 2
0.89 ± 0.03 0.65 ± 0.03 0.62 ± 0.05
9 5222882222222822 8 LDA
1 2 22 9 8222 8 8
−0.1 9 812 2222122288 288888888888884 8 MDA 0.94±0.03 0.77±0.04 0.70±0.05
1 8 248 18118888 88888888 28 88
8888888888
9 91 9 11118118181811181811111881188
1281888888 888
−0.2 9 1119 9 1111111111111111199891919811998198
9 1 911111199 99998999999999
9 1 91911 91999999999999 9
−0.3 9 9 9 1999 9 91999 9 methods for face recognition. This seems to be because Mul-
1 9 9 9 19 9 91 9
9 99 9 9
9 tifactor Discriminant Analysis offers the combined virtues of
−0.4 99 9 9 9
99 both multifactor analysis methods and discriminant analysis
−0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 methods. Like multilinear subspace methods, Multifactor
Discriminant Analysis can analyze one sample in a multiple
(b)
factor framework, which improves face recognition perfor-
Figure 5: Two dimensional projections of 10 classes in the Extended mance.
Yale B database. (a) features calculated by LDA, (b) subject Figure 5 shows two dimensional projections of 10 sub-
parameters calculated by MDA. jects under varying viewpoints and lighting conditions
calculated by LDA and Multifactor Discriminant Analysis.
For each image, while LDA calculated one kind of projection
vector as shown in Figure 5(a), Multifactor Discriminant
reduction: PCA, MPCA, KPCA, and LDA. For PCA and Analysis obtained individual projection vectors for subjects,
KPCA, we used the subspaces consisting of the minimum viewpoint and lighting. Among the factor parameters,
numbers of eigenvectors whose cumulative energy is above Figure 5(b) shows subject parameters obtained by MDA.
0.95. For MPCA, we set the threshold in pixel mode to Since these parameters are independent from varying view-
0.95 and the threshold in other modes to 1.0. KPCA used points and lighting conditions, the subject parameters of
RBF kernels with σ set to 100. We compared the rank-1 face images are distributed as clusters created by varying
recognition rates of all of the methods using the simple subjects rather than the scattered results in Figure 5(a). For
cosine distance. the same reason, Tables 1 and 2 show that MPCA and
As shown in Tables 1 and 2, our proposed method, Multifactor Discriminant Analysis outperformed PCA and
Multifactor Discriminant Analysis, outperforms the other LDA respectively.
10 EURASIP Journal on Advances in Signal Processing

Table 2: Rank 1 recognition rate on the CMU PIE database. lighting conditions. This improved dimension reduction
power will allow us to have reduced size feature sets (optimal
untrained untrained untrained
for template storage) and increased matching speed due
lighting viewpoints viewpoints & lighting to these smaller dimensional features. Our approach is
PCA 0.89 ± 0.06 0.70 ± 0.05 0.22 ± 0.05 thus attractive for robust face recognition for real-world
MPCA 0.91 ± 0.04 0.74 ± 0.05 0.24 ± 0.06 defense and security applications. Future work will include
KPCA 0.91 ± 0.04 0.73 ± 0.05 0.23 ± 0.06 evaluating this approach on larger data sets such as the CMU
LDA 0.90 ± 0.06 0.72 ± 0.05 0.23 ± 0.05 Multi-PIE database and NIST’s FRGC and MBGC databases.
MDA 0.96±0.04 0.79±0.04 0.27±0.06
References
Also, Figure 6 shows the first two coordinates of the [1] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear image
lighting features calculated by Multifactor Discriminant analysis for facial recognition,” in Proceedings of the Interna-
Analysis for the face images of two different subjects in tional Conference on Pattern Recognition, vol. 1, no. 2, pp. 511–
514, 2002.
different viewpoints. These two-dimensional mappings are
continuously distributed with steadily varying lighting while [2] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear inde-
pendent components analysis,” in Proceedings of the IEEE
differences in subjects or viewpoint appear to be relatively
Computer Society Conference on Computer Vision and Pattern
insignificant. For example, for both Person 1 in Viewpoint 8 Recognition, vol. 1, pp. 547–553, San Diego, Calif, USA, 2005.
and Person 4 in Viewpoint 1, the mappings for face images
[3] K. Fukunaga, Introduction to Statistical Pattern Recognition,
that were lit from the subjects’ right side appear on the top Academic Press, San Diego, Calif, USA, 2nd edition, 1999.
left-hand corner, while dark images appear on the top-right
[4] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE
corner; images captured under neutral lighting conditions Transactions on Pattern Analysis and Machine Intelligence, vol.
lie on the bottom right. On the other hand, any two images 23, no. 2, pp. 228–233, 2001.
captured under similar lighting conditions tend to be located
[5] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal
close to each other even if they are of different subjects in of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
different viewpoints. Therefore, we can conclude that the
[6] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H.-J. Zhang,
lighting features calculated by our proposed MDA preserve “Multilinear discriminant analysis for face recognition,” IEEE
neighbors for lighting, which are captured under similar Transactions on Image Processing, vol. 16, no. 1, pp. 212–220,
lighting conditions. 2007.
[7] G. H. Golub and C. F. V. Loan, Matrix Computations, The
Johns Hopkins University Press, London, UK, 1996.
6. Conclusion
[8] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multi-
In this paper, we propose a novel dimension reduction linear singular value decomposition,” SIAM Journal on Matrix
method for face recognition: Multifactor Discriminant Anal- Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
ysis. Multifactor Discriminant Analysis can be thought of [9] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear projection
as an extension of LDA to multiple factor frameworks for appearance-based recognition in the tensor framework,” in
providing both multifactor analysis and discriminant anal- Proceedings of the IEEE International Conference on Computer
ysis. Moreover, we have shown through experiments that Vision (ICCV ’07), pp. 1–8, 2007.
MDA extracts more reliable subject parameters compared [10] Y. Li, Y. Du, and X. Lin, “Kernel-based multifactor analysis for
to the low-dimensional projections obtained by LDA and image synthesis and recognition,” in Proceedings of the IEEE
International Conference on Computer Vision, vol. 1, pp. 114–
MPCA. These subject parameters obtained by MDA rep-
119, 2005.
resent locally repeated shapes of distributions due to dif-
[11] B. Scholkopf, A. Smola, and K.-R. Muller, “Nonlinear com-
ferences in subjects for each combination of other factors.
ponent analysis as a kernel eigenvalue problem,” in Neural
Consequently, MDA can offer more discriminant power, Computation, pp. 1299–1319, 1996.
making full use of both global distribution of the entire
[12] X. Wang and X. Tang, “Dual-space linear discriminant analysis
data set and local factor-specific distribution. Reference [6] for face recognition,” in Proceedings of the IEEE Computer Soci-
introduced another method which is theoretically based on ety Conference on Computer Vision and Pattern Recognition, pp.
both MPCA and LDA: Multilinear Discriminant Analysis. 564–569, 2004.
However, Multilinear Discriminant Analysis cannot analyze [13] H. Yu and J. Yang, “A direct LDA algorithm for high dimen-
multiple factor frameworks, while our proposed Multifactor sional data-with application to face recognition,” Pattern
Discriminant Analysis can. Relevant examples are shown in Recognition, pp. 2067–2070, 2001.
Figure 5 where our proposed approach has been able to yield [14] G. Baudat and F. Anouar, “Generalized discriminant analysis
a discriminative two dimensional subspace that can cluster using a kernel approach,” Neural Computation, vol. 12, no. 10,
multiple subjects in the Yale-B database. On the other hand, pp. 2385–2404, 2000.
LDA completely spreads the data samples into one global [15] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller,
undiscriminative distribution of data samples. These results “Fisher discriminant analysis with kernels,” in Proceedings of
show the dimension reduction power of our approach in the IEEE Workshop on Neural Networks for Signal Processing,
the presence of nuisance factors such as viewpoints and pp. 41–48, 1999.
EURASIP Journal on Advances in Signal Processing 11

[16] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear


subspaces for face recognition under variable lighting,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
27, no. 5, pp. 684–698, 2005.
[17] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination,
and expression database,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 25, no. 12, pp. 1615–1618,
2003.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 938737, 20 pages
doi:10.1155/2010/938737

Research Article
Unconstrained Iris Acquisition and Recognition Using
COTS PTZ Camera

Shreyas Venugopalan and Marios Savvides


Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA

Correspondence should be addressed to Shreyas Venugopalan, [email protected]

Received 2 December 2009; Revised 3 May 2010; Accepted 19 July 2010

Academic Editor: Yingzi Du

Copyright © 2010 S. Venugopalan and M. Savvides. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

Uniqueness of iris patterns among individuals has resulted in the ubiquity of iris recognition systems in virtual and physical spaces,
at high security facilities around the globe. Traditional methods of acquiring iris patterns in commercial systems scan the iris
when an individual is at a predetermined location in front of the scanner. Most state-of-the-art techniques for unconstrained iris
acquisition in literature use expensive custom equipment and are composed of a multicamera setup, which is bulky, expensive, and
requires calibration. This paper investigates a method of unconstrained iris acquisition and recognition using a single commercial
off-the-shelf (COTS) pan-tilt-zoom (PTZ) camera, that is compact and that reduces the cost of the final system, compared to
other proposed hierarchical multicomponent systems. We employ state-of-the-art techniques for face detection and a robust eye
detection scheme using active shape models for accurate landmark localization. Additionally, our system alleviates the need for
any calibration stage prior to its use. We present results using a database of iris images captured using our system, while operating
in an unconstrained acquisition mode at 1.5 m standoff, yielding an iris diameter in the 150−200 pixels range.

1. Introduction style approaches, other methods build on the method


proposed by Wildes in [5] to develop an alternate scheme
Biometrics is a fast-emerging field that is used in association for iris segmentation, feature extraction, and matching.
with security provisions in many establishments around the Additionally, matching based on partial iris patterns has been
globe. A good biometric is one that does not change with investigated in [6, 7].
time (stability); it is unique for each individual (distinctive- In most security systems that are based on iris patterns, a
ness); it has features that are not restricted to a certain class major concern is the acquisition of pristine quality iris pat-
of people (availability); it is easily acquirable (accessible); terns (i.e., accessibility of the iris pattern). This often requires
it should not pose any inconvenience to the individual significant cooperation from the individual whose eye is
whose biometrics are being acquired (acceptability and being imaged. The usual constraints include positioning the
unobtrusiveness factor). Of the various biometrics that have subject at a predefined location, at a predefined distance
been studied, the iris pattern has gained a lot of popularity from the camera, providing sufficient near-IR illumination
in recent years. The iris is actually the sphincter muscle for acquisition while maintaining prescribed eye-safety limits
within the sclera (the white region of the eye) that controls [8, 9]. One example of a widely used commercial device is
the contraction and dilation of the pupil depending on the the LG IrisAccess4000 [10]. This device uses voice prompts to
amount of light that is incident on the eye. Additionally, direct the user to the optimal position so that the system can
depending on the amount of melanin content in the iris, it acquire an in-focus iris image. The need for fine adjustment
varies in color from person to person. The iris pattern is of user position arises from the limited capture volume of
unique to an individual, and, as a result, this method has the system. The capture volume is defined as the volume
reported very high identification and verification accuracies of space in front of the image acquisition system within
in literature [1–4]. In addition to the popular Daugman which the user has to be present to acquire iris patterns of
2 EURASIP Journal on Advances in Signal Processing

acceptable in-focus quality. Once the iris of the user is within Another system that uses multiple cameras is the Retica
the acceptable capture volume, the user typically remains in Eagle Eye system [16]. It uses a scene camera, a face camera,
that position with limited motion until the system acquires and two iris cameras, which account for its large form factor.
a good quality image. In general, with these systems, this The capture volume of this setup is larger compared to the
positioning process can seem significant for some users and systems described so far, yielding a 3 × 2 × 3 m capture
may be unintuitive for relatively new users. This can result in volume with increased stand-off (average of 5 m). Yoon et
failure to acquire (FTA) results. al. [17] use a light stripe projection to detect the presence of
Other systems have been proposed that involve less coop- users and to estimate the distance from the camera. A system
eration from the users. One good example is the Iris-On- to perform recognition at a stand-off of 2 meters has been
the-Move system (IOM) that was proposed and developed introduced by AOptix [18]. This is designed to determine
by Sarnoff Corporation [11]. Iris patterns are captured while depth information from a capture 2D image, which helps
users walk through a portal that has near-IR illuminators it to set the focus at the user’s position. Also, it reports a
within the side panels. The system has a throughput of capture volume with a depth of 1 meter, enabling it to enroll
an average of 20 persons/min at a walking rate of around and verify users whose heights are between 0.9 meters and
1 meter/second. The subject stand-off required by the system 1.9 meters. The system uses adaptive optics that employ a
is 3 meters. The stand-off distance is the distance between the multistage, real-time closed loop control system, in order to
front of the lens and the subject. This acquisition system is a find the subject within the capture volume zone.
fixed focus system with a reported depth of field, based on the All of these systems involve custom-made components
iris match scores, of 5 cm−10 cm [11]. It should be noted here to capture iris patterns of required quality. Also, most of
that in the context of iris acquisition and recognition, the them depend on the presence of a wide angle “scene” camera
depth of field is defined as the depth of the capture volume to perform the initial face detection before handing off to
within which an iris image of acceptable quality may be an “eye” camera to perform iris acquisition. As mentioned,
acquired. The quality should suffice for successful operation the scene camera is typically equipped with a wide angle
of the iris-matching algorithm of the system. Compared to lens with a small value of focal length, while the eye camera
traditional desktop/wall mount systems (such as Panasonic, will have a telephoto lens for achieving high magnification.
LG, and others), this has an advantage of an increased stand- The system described in this paper has a twofold aim. It
off distance and reduced level of cooperation effort needed by aims to decrease the amount of cooperation required from
the subject. However, the issues of limited and fixed capture the individual who is being identified in an access control
volume remains. Iris image acquisition fails if a user’s iris is type scenario by using a PTZ-based system. It may be used
not acquired through this small capture volume. Also, this for access to high security areas within a building. Also, we
work suggests a modular approach to increase the height of propose to do this using a single camera (i.e., the functions of
the capture volume. Multiple cameras are stacked one above the scene and the eye cameras in the previous works will now
the other so that the iris can be captured irrespective of the be performed by a single camera through the utilization of
height of the user. This, however, increases the cost of the sys- the in-built zoom lens and autofocusing mechanism). More
tem. Both the LG system and the IOM are shown in Figure 1. importantly, we built this system using Commercial Off-
Another category of acquisition systems involves a The-Shelf (COTS) equipment, so there are no custom made
pan-tilt-zoom camera setup, which has a “dynamic” capture components in our system, which significantly reduces the
volume. A pan-tilt-zoom system can alleviate the need for cost compared to other systems and allows for rapid building
a fixed capture volume in order to get pristine iris pattern and reproduction.
images. By introducing 3 new degrees of freedom on the side The remainder of the paper has four sections. Section 2
of the acquirer (i.e., panning, tilting, and zooming), these describes our system hardware architecture and the various
systems give greater freedom of position and motion to the components that are used to build it. Section 3 gives an
subject. Early attempts at such a setup are reported by Oki overview of the various algorithms that are used for face
IrisPass-M [12], Sensar R1 [13], recent development from detection, iris acquisition, and iris matching. Section 4
Wheeler et al. [14], and the Mitsubishi Corporation [15]. presents the results of using this system over a database
All of these systems are based on the use of two cameras—a of images that was acquired using the PTZ camera under
scene camera having a wide angle lens to detect the eye in the varying lighting conditions and positions of the subject. We
scene and the second camera having a high magnification conclude this paper in Section 5.
lens, specifically aimed to capture the iris pattern on each
eye. The former three systems use a biocular setup for the
scene camera. The set of stereo images thus obtained may 2. System Hardware Architecture
be used to recover the 3D world position of the user. Using
this information, the pan and tilt required is estimated. The This section details the architecture of our proposed system
focus of the second camera can be estimated using the depth and the devices used therein. We discuss the specifications
information. The approach used by [15] is similar except needed of the proposed PTZ camera setup and the additional
they estimate the position of the user in 3D space using COTS optics required.
the disparity between facial features. This naturally involves
a calibration stage in which one has to learn the relation 2.1. Pan-Tilt-Zoom Camera. The acquisition device used
between facial features and position in space. in the proposed system is the Axis 233D Network
EURASIP Journal on Advances in Signal Processing 3

(a) (b)

Figure 1: Two commercial state of the art iris acquisition and recognition systems (a) The LG IrisAccess4000 which uses voice prompts
to direct the user to the optimal position so that an in-focus iris image may be acquired. (b) is the Iris-On-the-Move system developed by
Sarnoff Corporation which captures iris images with an increased stand-off of 3 meters. In both (a) and (b), the reader will note that the
capture volume is fixed, that is, the user is expected to be in a fixed location for iris acquisition.

(a) (b)

Figure 2: The Axis 233D network dome camera used in this work. (b) shows the camera after adding the necessary lenses for our work.

Pan-Tilt-Zoom (PTZ) camera manufactured by Axis Com- The camera (see Figure 2) captures up to 30 frames
munications. Details about the camera are available on a second and the frames are encoded in motion JPEG
the specification sheet [19]. The interesting specifications stream format. Application Programming Interfaces (APIs)
needed for our application include the following: a network for controlling the motion of the camera are freely available
PTZ camera built around a 1/4-inch ExView HAD progres- [21]. If the device is installed at the exterior of a building
sive scan CCD. It is capable of 35X optical zoom with a then the in-built Electronic Image Stabilizer (EIS) reduces
pan capability of 360◦ and a tilt capability of 180◦ . Both the vibration caused by traffic or wind, and thus yields
pan and tilt motions have adjustable speeds ranging from sharper images which are of importance for maintaining
0.05◦ to 450◦ /second. A built-in switchable IR cutoff filter spatial frequency content for matching. The 128X Wide
varies the light sensitivity depending on the ambient light. Dynamic Range (WDR) of the camera allows us to capture
Typically, in low light conditions, this filter is automatically detailed images in complex lighting conditions. It is able to
removed, increasing the sensitivity of the CCD to 0.008 lux perform automatic backlight compensation, which adjusts
(i.e., monochrome night mode as opposed to 0.05 lux in day the dynamic range of the camera, allowing it to capture
light mode). We operate the camera in the night mode since very detailed images even when the illumination is not
we want to image the scene in the near infrared wavelengths. sufficient. To some extent, this reduces the illumination
This is because, as the melanin content in the eye increases, requirements that must be imposed on our proposed system.
the iris absorbs more and more energy in the visible The camera is equipped with an autofocusing mechanism,
spectrum. The near infra-red spectrum preserves most of the enabling the lens to achieve focus at every zoom level. The
information in the iris pattern, and this is what we record autofocusing mechanism that is used in this device uses a
using the camera for further analysis [20]. Additionally, scene contrast measurement technique for focusing. This is the
reflections from the environment (typically observed by the traditional autofocus approach used in most digital image
naked eye in the visible domain) are no longer visibly present. acquisition devices that use the same sensor for focusing
4 EURASIP Journal on Advances in Signal Processing

and image acquisition (i.e., when no mechanical shutter is its telephoto end (i.e., maximum focal length or full zoom),
present for image acquisition and there is no range finding the lens is designed specifically to focus only on objects that
capability). As the image comes into focus, the spatial are at a distance of more than a meter away. For any user who
frequency content of the image increases, particularly at is at a stand-off distance of less than this, the iris image will
higher bands. Many algorithms exist to perform contrast be out of focus. This limits the depth of the capture volume
analysis: some are frequency-based, and some are more because we want the users to have an option of standing
computationally simplistic using postprocessing of adjacent at distances of less than a meter from the camera lens. The
pixels to compute some figure of contrast measure. reader can compare this effect to a person who has the vision
defect hypermetropia, or longsightedness. In this case, too,
2.2. Magnification. In order to get irises that are suitable the lens in the eye cannot focus on objects closer to it than
for recognition purposes, the iris images must have at least a certain distance. The solution to this problem is reading
150 pixels across their diameter [22]. We know that a typical glasses, which actually uses converging lens that helps to
human iris has a diameter of around 12 mm [23]. Given this bring the object back into focus on to the retina. Using a
data, a useful analysis would be to determine the focal length similar approach, we overcame this limitation in our system
of the lens required, in order to capture iris images with this and allowed the users to stand closer (if needed), by fixing a
resolution and the stand-off distance from the camera. secondary converging lens to the front of the primary lens of
The Axis 233D network dome camera has a minimum the PTZ camera. This effect is illustrated in Figures 3(b) and
focal length of 3.4 mm (wide) and a maximum focal length 3(c).
of 119 mm (telephoto). In this work, once the position of the The design question then is the focal length of the
eye has been detected on the user’s face, the system zooms secondary lens to be used. In photography, such secondary
into the frame keeping the eye at the center—the focal length lenses are called closeup lenses [24]. Commonly available
of the lens is increased to 119 mm before the iris image is closeup lenses start from a power specification of +1 dioptre,
captured so as to get maximum magnification. The focusing the unit of power of a lens and is equal to the inverse of the
mechanism ensures that the eye is in focus during image focal length in meters. So, a 1 dioptre lens has a focal length
capture. With this in mind, the following analysis determines of one meter. The greater the dioptre rating, the greater the
what the maximum stand-off distance of the subject can be converging power of the lens. While selecting a secondary
so that iris images of required resolution are obtained. lens for our application, we need to make sure that, apart
The Axis 233D uses an ExView HAD sensor, which has a from bringing closer objects into focus, the lens combination
pixel side of 6.45 μm. If an iris image has to have a resolution of primary and secondary has no adverse effect on the final
of at least 150 pixels, then the diameter of the iris image on magnification of the system. For combination of lenses that
the sensor is given by are attached to each other, as in our case, the effective focal
length is determined from [24]
v = 150 × 6.45 μm = 0.9675 mm. (1)
1 1 1
= + , (5)
If we consider an average person, the iris diameter u = feff f1 f2
12 mm [23], the magnification is given by
 
where f1 is the focal length of the primary lens (119 mm
image size(v) 0.9675 in our case) and f2 is the focal length of the secondary
M= = = 0.0806. (2)
object size(u) 12 converging lens. Hence, the effect of adding the secondary
lens is that of decreasing the effective focal length. But if the
The magnification of a lens system with effective focal length effective focal length decreases, as we can see from (3), the
f , for an object at a distance of D in front of the lens is given magnification will decrease. So, we need to fit a secondary
by the standard relation [24] lens so that there is minimal effect on magnification, and we
f still get the required iris resolution during acquisition. This
M= . (3) translates to using a lens with higher value of f2 , from (5)—
D− f and, hence, a lower dioptre value. As mentioned previously,
In our work, the iris images are captured using a lens common photographic close up lenses start from 1 dioptre;
focal length of 119 mm, that is, f = 119 mm. So, thus, we choose this lens as our secondary lens. The effect
on magnification can be analyzed by looking at (5) again. On
119 adding the aforementioned secondary lens, the effective focal
M = 0.0806 = ,
D − 119 (4) length is determined to be
D = 1.595 m.  −1
1 1
feff = + = 106.34 mm. (6)
Hence, in the system that is setup, the maximum allowable 119 1000
subject stand-off distance is approximately 1.6 meters if the Backtracking through the analysis performed in (1), (2),
required iris resolution is at least 150 pixels across the iris. and (3), we get the image size to be v = 0.8572 mm when
the object is at a stand-off distance of 1.6 meters. This
2.2.1. Close Up Lens for Magnification. From the specifica- corresponds to around 132 pixels across the iris, which is not
tions of the Axis 233D camera, we see that when the lens is at a significant change from the original 150 pixel requirement
EURASIP Journal on Advances in Signal Processing 5

Distance of object front lens (D)

Focal length ( f ) Focal length ( f )


Magnification
v
M=
u

Object (iris)
Iris image diameter =
diameter = v u = 12 mm

Image
Primary lens
plane
(IP)
Minimum focus distance Dmin
(a)
Object at a distance closer than
the minimum focusing distance
of the lens

Blurred
image

Primary Dmin
Back focus distance of
the lens has shifted away lens
from image plane
(b)

Object at a distance closer than


the minimum focusing distance
of the primary lens

Primary Secondary
Image is brought back into lens lens
focus on image plane
(c)

Figure 3: The figure illustrates how a secondary converging lens, used along with the primary lens, reduces the minimum focus distance
of the system. (a) Shows an object place farther away than the minimum focus distance Dmin from the primary lens. (b) Shows an object
placed closer to the primary lens. Here, the object is not in focus on the image plane IP. (c) Shows how the addition of a converging lens
brings the image back into focus on the IP.

as our verification experiment shows (Section 4). Also, the minimum stand-off distance of 1 meter. The converging
since the secondary lens consists of only a single optical power of the lens at this setting is given by the inverse of the
element, we should not expect a significant degradation of focal distance (i.e., it is equal to (1/1 meter) = 1 dioptre). If
the net modulation transfer function (MTF) of the system. we now add a secondary lens of 1 dioptre to this system, the
Next consider the camera’s lens system alone without the net converging power is the sum of both powers, 2 dioptres.
secondary lens attached and consider the case when the Hence, the minimum focus distance at this setting (with the
autofocus mechanism has positioned itself to focus objects at secondary lens) of the autofocus mechanism of the camera
6 EURASIP Journal on Advances in Signal Processing

Camera

0.5 m 200 pixels  0.5 m


150 pixels  0.9 m
100 pixels  1.6 m

Figure 4: The figure illustrates the varying depths of the capture volume based on the minimum iris resolution required. The iris resolution
is indicated in terms of the number of pixels obtained across the iris in the capture eye image. The number to the right of the respective
resolution indicates the depth of the capture volume in each case. 0.5 meters is the theoretical minimum focusing distance as determined in
Section 2.1. We see that for minimum iris resolutions of 200, 150 and 100 pixels, the respective capture volume depths are 0.5 meters, 0.9
meters, and 1.6 meters. These results are based on analysis done using (1), (2), and (3).

is the inverse of this total power (i.e., 1/(2 dioptres) = 0.5 be used, but they must be certified to conform to eye-safety
meters). Examples of such secondary lenses can be seen in standards [8, 9]. To ensure complete safety, we use ambient
[25]. We have used the Tiffen 1 dioptre 52 mm lens shown in illumination or a table lamp with a standard 100 W bulb in
the same website. our lab experiments. Our face and eye detection algorithms
perform better if the face is not saturated with light. Hence,
2.2.2. Capture Volume. As mentioned previously, we note the lamp is directed at an angle to the user of the system
that the depth of field of an iris acquisition system is depth (approximately 30◦ with a line joining the lamp to the center
of the capture volume within which iris images of required of the subject’s face), to ensure there is enough illumination
resolution for recognition can be captured. This fact, coupled to acquire a good quality iris. At the same time, this off-angle
with the presence of an autofocusing mechanism in the illumination ensures that the face is not saturated with light.
camera, means that we practically get a depth of field limited Additionally, the camera is fitted with a near-IR filter,
by only the iris pixel resolution required. Other works which blocks most of the visible light and passes only the
in unconstrained iris acquisition, including commercial near-IR component of the incident light. It is seen that as
products, use a fixed focal length “iris camera” [10, 11, the melanin content in the human iris increases, much of
16, 17], where the focal length of the telephoto lens used the visible light component that is incident on the iris is
to acquire the eye images is fixed, and the only degree of absorbed. However, the near-IR wavelengths are reflected,
freedom is in image focus parameter (i.e., they have no carrying the reflected iris pattern information [20]. This is
zoom capability). As a result, the iris capture volume is at what is captured on the CCD sensor of the camera when
a fixed distance from the camera and has a fixed depth we use the IR pass filter. We experimented with various
of field. In our work, the presence of the combination of filters [26]. The filters are manufactured using custom made
the autofocusing mechanism, along with the pan and tilt precision ISO2002 German glass. The graph below (Figure 5)
capability, allows us to achieve a large torus-shaped capture shows the amount of visible light transmittance that is
volume. Hence, as mentioned, the depth of capture volume achievable using the various filters of this type.
is limited only by the iris resolution required (see Figure 4 On experimenting with various filters, we found that
below) for iris matching (i.e., the focus parameter is no the use of the 715 nm and the 780 nm wavelength filters in
longer a constraint as in traditional iris scanners). the proposed system achieved sufficient quality acquisition
Figure 4 shows the maximum stand-off distance achiev- images, exposing most of the iris pattern needed for feature
able if we can work with iris pixel resolutions as low extraction. The choice of the 715 nm filter performed
as 100 pixels and look at higher resolutions of 150 pixels the best, yielding the most iris-matching results in our
and 200 pixels across the iris diameter. The distances are experiments. Figure 6 compares the eye image of a user with
calculated as before using (1), (2), and (3). For instance, if and without a near IR filter attached to the camera. Figure 7
we require 200 pixels across the iris for a recognition system, illustrates the entire setup of the system.
then the maximum stand-off distance of the user is 1 meter
in front of the camera and the minimum stand-off distance is 2.4. Process Flow. This section describes the entire process of
0.5 m, which is the minimum focusing distance as explained iris acquisition using the axis 233D PTZ camera. The camera
in Section 2.1. is positioned at the required location with sufficient ambient
near-IR illumination. For this, there should be ambient
2.3. Illumination. If the near-infrared (NIR) content in the sunlight within the room. If this is not the case, a desk lamp
ambient light is not sufficient at the location where the image with a standard 100 W bulb can be used, as mentioned in the
is captured, a desk lamp with a standard 100 W bulb may be previous section. When a person approaches it, the camera
used while acquiring the images to increase ambient illumi- tracks the face until the person stops moving. The face track-
nation. Specific near-IR LED-based illumination sources can ing is performed using real time face detection at every input
EURASIP Journal on Advances in Signal Processing 7

1 different subjects and their irises acquired automatically


X-Nite630
0.9
www.MaxMax.com using our proposed system.
X-Nite665 X-Nite780 IR filter glass transmittance
0.8 X-Nite715 X-Nite830 3. Real-Time Iris Acquisition and Recognition
0.7 X-Nite850
X-Nite1000 This section briefly outlines the various algorithms that are
Transmittance

0.6 used in this system, specifically, the algorithms used for face
detection, eye detection, and iris matching.
0.5

0.4 3.1. Real Time Face Detection. The detection of faces in the
0.3
frame of the camera is done using the face detection method
proposed by Viola and Jones [27]. This method is extremely
0.2 efficient and has been shown to give good detection accuracy
at high frame rates. The algorithm involves three phases—
0.1
feature extraction, classification using boosting, and mul-
0 tiscale detection. Here, we present a brief overview of the
600 650 700 750 800 850 900 950 1000 1050 1100
algorithm. Feature extraction for face detection uses a large
Wavelength (nm)
set of Haar-wavelet-like rectangular features, and an image
Figure 5: Transmittance of various filters. The graph shows the representation called integral image is used to allow for a
various IR filters available, out of which X-Nite715, X-Nite780, X- fast way to compute the equivalent inner products of these
Nite830, and X-Nite1000 were tested for use with the proposed feature sets with the image using only lookup operations. A
system. It was found that the acquired images were suitable for iris few examples of possible features selected in the cascade are
matching when using the X-Nite715 filter. The source of the graph shown in Figure 10.
is www.maxmax.com. The intuition behind the use of these features lies in
the fact that faces offer more or less similar topographical
distributions of shades. These features can thus indicate the
frame, and our algorithm used for this task is described in presence (or absence) of certain characteristics in the image,
Section 3. The camera zooms into the face, so that the active such as multiresolution edges (i.e., to detect the eye region
shape model may be accurately fit. The position of the eye or mouth region). The value of a two-rectangle shows the
can be very accurately obtained from the Active Shape Model difference between the sums of pixels within two rectangular
(ASM) fitting (see Section 3 for details), which uses multiple regions. The regions have the same size and shape and are
facial landmarks around the eye region. Furthermore, the horizontally or vertically adjacent. A three-rectangle feature
ASM can provide a pose estimate to tell when the user is computes the sum within two outside rectangles subtracted
looking toward the camera. Using the Cartesian coordinates from the sum in a center rectangle. A four-rectangle feature
of the eye boundary from the ASM model, we calculate the computes the difference between diagonal pairs of rectangles.
coordinates of the center of the eye and direct the camera The different rectangle features are shown in Figure 10. A
to make this point the center of the acquisition frame. The rich feature set allows for a more robust face detector to be
camera provides built-in functions in order to perform this learned by the Adaboost algorithm.
action. Once this is done, the camera zooms into the frame Each feature can be considered to be a “weak” binary
keeping the eye center as the approximate center of the frame. classifier of a face. The final detector combines the output
Figure 8 illustrates the entire process of iris acquisition. From of each of these features and decides whether the given image
our experiments, we noted that once the user is still, the has a face or not. The original supervised learning algorithm
entire process takes an average of 5 seconds to complete on a used by Viola and Jones is AdaBoost. Adaboost takes a
standard desktop machine. number of these “weak” classifiers and obtains a final strong
decision rule that is a weighted and signed combination of
A key point to note is that we do not require any
the weak ones. We can obtain a “weak” classifier h(x) from a
calibration of the camera before starting the system. Other
feature by assigning it a threshold θ that is determined during
stereo camera systems, such as those proposed in [12–14],
the training phase. If the value of this feature is above/below
require a calibration stage that will estimate the 3D world-
the threshold value, then it is predicted that the image is or
coordinate system of the face to predict the pan-tilt angles
is not a face. Adaboost takes a weighted vote of the decisions
necessary to center on the iris and use the estimated depth to
of several of these features and outputs the final decision of
obtain the required focus on the iris region.
whether or not the image is a face. This is given as
In our system, the face detection and eye detection ⎧ ⎫
method alleviates the need for calibration. We estimate the ⎨ ⎬
pan-tilt angles based purely on the outputs from the face H(x) = sign⎩ αi hi (x)⎭,
i=1
and eye detector. Additionally, the contrast-based focusing ⎧ (7)
mechanism of the camera (Section 2.1) does not require any ⎨1 if Value of Feature>θ
knowledge of the distance of the individual from the camera h(x) = ⎩
alleviating the need for 3D depth estimation. Figure 9 shows −1 other.
8 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 6: The image illustrates eyes captures with and without the X-Nite715 near infrared (IR) filter. This filter is an passes near IR
wavelength with 50% transmittance at 715 nm. We see that the iris pattern details are clear when we image in the near IR spectrum.

Tilt up

0.97 m
Capture volume

0.73 m Pan right


Pan left

Depth depends on iris


Desk lamp used resolution required
55.8◦
for illumination (Table I)
41.8◦ Tilt down Subject at a
stand-off
of ∼1 m
PTZ camera with
secondary lens
and near IR filter

Figure 7: This figure shows the system that we setup. It consists of the axis 233D pan-tilt-zoom camera along with a table lamp with a
standard 100 W bulb for illumination. The capture volume indicated is limited in depth by the pixel iris resolution required. Because of the
autofocus mechanism, even at a fixed zoom level, which is 119 mm in our work, the person can come upto 0.5 meters in front of the lens.
The arrows indicate how the volume can move based on the pan-tilt motion of the camera. The field of view of the camera is 55.80 × 41.80,
which is approximately 0.97 × 0.73 meters at a subject stand-off of 1 meter.

Adaboost maintains a weight distribution over all training the previous iteration, then εt is given by
samples. The way Adaboost combines features is by analyzing

n
 
them one feature at a time and finding which feature best εt = Dt (i)I yi =
/ ht (xi ) , (9)
separates the data. The data samples correctly classified are i=1
down weighted, while the wrongly classified samples are up
weighted, and the whole process is reiterated. The error at I() is the identifier function such that,
any given iteration is given by the sum of the weights of the ⎧
  ⎨1
if yi =
/ ht (xi )
misclassified samples. The αt represents how confident the I yi =
/ ht (xi ) = ⎩ (10)
tth weak classifier is and is given by 0 otherwise,

1 1 − εt yi is the true value of the ith sample, while ht (xi )is the
αt = ln , (8) predicted output of this sample, by the weak classifier
2 εt
ht () [28].
where, εt is the classification error when using the tth weak We use the implementation of this algorithm that is
classifier. If Dt (i) is the weight assigned to the ith sample after provided here [29] for our work.
EURASIP Journal on Advances in Signal Processing 9

(a) (b)

Zoom into
frame when
face is still
Real-time face detection. This Fit ASM model
helps to track the movement on the face
of a face across the frame
(c)

Identify the eye


location from ASM
model

(e) (d)

Center the eye in the frame and


zoom in to acquire eye image
with required resolution
across iris

Figure 8: The figure illustrates the entire process of iris acquisition using the COTS PTZ camera. (a)The camera detects any face within
the frame and moves (pans/tilts) along with it, keeping the face at the center. (b) Once it is sure that the face is still (i.e., the person stops
moving), it zooms into the face (c) and fits the Active Shape Model (ASM) model on the face. (d) The position of the eye of interest is read
off from the ASM output and (e) the camera zooms into the frame keeping this eye at the center (by pan/tilt).

(a) (b) (a) (b)

(c) (d) (c) (d)

(a) (b)

Figure 9: Subjects using the proposed system. This figure shows the visual interface for the system as seen by the user. (a) the face detector
detects the face, (b) the ASM model is fitted on the face when it is still, (c) the eye is detected and (d) the zoomed in image of the eye with
200 pixels across the iris.
10 EURASIP Journal on Advances in Signal Processing

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 10: Example rectangle features shown relative to the enclosing detection window. The sum of the pixels that lie within the white
rectangles is subtracted from the sum of pixels in the grey rectangles. Two-rectangle features are shown in (a) and (b). (c) shows a three-
rectangle feature, and (d) a four-rectangle feature. (e), (f), (g), and (h) show these features overlaid on an acquired face, during the face
detection stage. The outputs of several such features are combined to make a final decision as to whether a face is present or not in the frame.

3.2. Eye Detection Using Active-Shape Models. For the pur-


pose of eye detection, we rely on a method that parameterizes
an input face based solely on the shape of the face. The
advantage of this approach is that the eye detection is now
free of any effects arising out of facial texture difference,
ambient light variations, and so forth. As mentioned previ-
ously, we use active-shape models (ASM), which automat-
ically detect landmark points that define the shape of any
statistically modeled object—in our case, the human face.
Here, we briefly describe the formation of an ASM model and
how an ASM model is fitted on any previously unseen image.
For a more detailed treatment of this approach, the reader
is directed to [30–32]. Specifically, we used the latest robust
ASM developed by [30]. Figure 11 shows the ASM points that
are fit using [30].
In order to build a statistical facial model, a training set
that has images with manually annotated landmarks is used
(Figure 11). The training set comprises a subset of images
from the MBGC 2008 database [33].The following are the
stages during training of the ASM

(1) The coordinates of all the keypoints are stored as a Figure 11: The 79 keypoints that parameterizes the active shape
shape vector x = [x1 , y1 , . . . , xN , yN ], where xi and yi model for a given face.
are the coordinates of the ith keypoint and N is the
number of landmarks used.
(2) Generalized Procustes Analysis [34] is used to align all the x) because this serves as the initial shape when
the various shapes (represented by x) with each other. we try to fit the ASM on a previously unseen face.
(3) Following this, principal component analysis (PCA) (4) During the training phase, we need to “profile” the
is computed on these shapes. Eigenvectors that individual keypoints in order to generate statistical
correspond to 97% of the variations are stored. We models of the gray level intensity regions around
keep a record of the mean shape, x (i.e., the mean of each landmark. This will aid the ASM fitting on a
EURASIP Journal on Advances in Signal Processing 11

(a) (b)

Figure 12: Profile constructionduring training is shown, to avoid clutter, for a few points along the boundary of the ASM model and one
point each on the nose and the lips. We do the profiling for the entire set of 79 ASM points on the face during training. (a) 1D profiles
constructed using lines normal to the shape boundary at a keypoint (b) 2D profilesconstructed using square regions around each keypoint.

previously unseen face. Profiling is used to build a are made at the finest level. The best location for a
subspace of variations of these intensities across all landmark is determined by constructing profiles of
keypoints, in all the training images. A 1D profile of neighboring patches around candidate points [30].
a keypoint is constructed by sampling the grey level The candidate point that bears a profile most similar
intensities of points that lie alonglines normal to the to the mean profile for that keypoint (the latter being
shape boundary at that point (each line comprises calculated during the training phase, as mentioned
17 pixels)—see Figure 12. The normalized gradient before) will be chosen as the new location for the
of gray level intensities is stored as a vector. The landmark. The similarity measure used in this case is
mean of such vectors for each keypoint across all the Mahalanobis distance (D). If the candidate profile
training images is called the mean profile vector is given by g and the corresponding mean profile is
(g) and the covariance matrix of all such vectors given by g then, D is given by
is denoted by Sg . The mean profile vector and
covariancematrix are computed for each keypoint  T  
at four differentlevels in an image pyramid (with D = g − g Sg −1 g − g . (11)
each image in the pyramidhalf the size of the image This process is repeated until the best location for each
at the previous level). Similarly, 2D profiles can be keypoint is located. Let this new shape for the given face be
constructed for each keypoint by sampling the image represented by the vector xL . This can be given by the relation
gradient in a square region around each landmark.
The pixels used for both 1D and 2D profiles are xL = x + Pb, (12)
shown for a few keypoints in Figure 12.
where x is the mean shape estimated during the training
Next, we consider what happens when we need to fit an phase, P is the eigenvector matrix from the ASM shapes
ASM on a new face, that is, one that was not present during determined during the training phase, and b is a vector
the training phase. of projection coefficients that needs to be calculated. This
stage is necessary to make sure that the obtained shape is
(1) The mean face x is scaled, rotated, and translated a legal shape modeled by the PCA face subspace [30]. This
to best fit the output of the face detection stage. As is done by iteratively minimizing the mean squared error
mentioned previously, this mean shape model serves 2
xL − T(x + Pb) during the testing phase. T is a similarity
as the initial shape model. It has to be deformed
transform that minimizes the distance between xL and the
to obtain the final shape model associated with the
shape given by x + Pb. Details about the determination of b
input face.
and T can be found in [31]. The keypoints are shifted at a
(2) Profiling is performed in the same manner as during particular pyramid level till no significant change in position
the training phase. Multilevel profiles are constructed is observed between two successive iterations (indicated by
around each keypoint from a coarse to a fine level (see a lower value of the distance measure). Following this, the
Figure 13). Large adjustments to keypoint locations landmarks are scaled and used as the initial positions for the
are made at the coarsest level and smaller adjustments next level of the pyramid.
12 EURASIP Journal on Advances in Signal Processing

(a)

(b)

(c)

Figure 13: The mean shape fit on the output of the face detector. (b) shows the different iterations over the face at different resolutions until
convergence is achieved at the highest level. This involves profiling, as well as PCA reconstruction-based (see (12)) restriction of the shape
to a legal shape [30]. (c) shows the final ASM fit on the detected face.

The process is continued until there is convergence at into the frame. This ensures that the eye is at the center of the
the highest (finest) level of the pyramid. At this stage, we frame when the magnified image is captured.
get the final keypoint locations. The process is illustrated in
Figure 12. Figure 13 shows the ASM shape corresponding to 3.3. Iris Recognition. In order to evaluate the usefulness
two users of the system. of the acquired iris images for open set identification, we
Once the ASM model has been fit for the new face, the use a variation of Daugman’s approach to iris matching,
eye position can be read off from the keypoints around the as described by Kerekes et al. [35]. We acquire iris images
eye (see Figure 11). More precisely, we calculate the center from 12 subjects at distances varying from 0.5 m to 1.5 m
of the right eye as represented by the ASM keypoints by from the front of the lens. The iris patterns in these eye
determining the average location. The camera is instructed images are segmented before running the iris verification
to keep this point as the center of the frame and then zoom algorithm. Given a person’s iris, the iris system segments and
EURASIP Journal on Advances in Signal Processing 13

(a) (b)

Figure 14: Active shape model fitting for two users of the system. The position of the eye can be read off from the keypoint locations around
the eye (see Figure 13). The center of the eye is taken as the center of mass of these keypoints.

encodes an iris template (iris code), which is then matched the iris region. The whole image is smoothed using a median
to a set of iris code templates in a stored database. Open set filter to remove any noise. A median filter is better suited for
identification is performed in our experiments. First, the user this application, compared to an averaging filter, since the
is associated with a class in our database to which he/she is specular reflection that we have to remove now takes on a salt
most similar (as indicated by the similarity in iris codes); we and pepper appearance (Figure 15). Another advantage of
then verify whether the person actually belongs to that class. using a median filter instead of a global averaging filter is that
A receiver operating characteristic (ROC) curve is plotted the former preserves the gradient at the limbic and pupillary
to visualize how robust this verification is in our proposed boundary better than the latter. This is used in the following
system (Figure 21). subsection. Figures 15(c) and 15(d) depicts this process.

3.3.1. Segmentation. Any iris identification algorithm must (2) Determination of Pupillary and Limbic Boundaries.
be preceded by a stage that segments the iris pattern from the Because the pupil is the darkest regions in the resulting image
acquired eye image. This stage consists of clearly delineating (Figure 15(d)), it can be localized by intensity thresholding.
the pupillary boundary and the limbic boundary. Once these However, thresholding alone can result in several candidate
boundaries are identified, the iris pattern can be isolated by regions like eyebrows and eyelashes. To eliminate these false
cropping out the region between these boundaries. candidate regions we analyze the geometric properties of
In order to correctly identify both these boundaries, the the connected component as suggested in [36]. A possible
segmentation algorithm used in this work has the following geometric property that we have explored is an approximate
stages as described in [36]. height and width of the pupil region. These are empirically
set to constant values. Once this is done, we calculate the
(1) Specular Reflection Removal. Often the main hurdle center of mass of the detected region, to identify the center
to accurate iris segmentation is the presence of specu- of the pupil. (The reader should note here that this is not
larities within the iris region. These specularities are due the same as the center of eye determined in Section 2.4
to the reflection of the illumination source. This stage because the latter is a rough estimate provided to the pan-tilt
of segmentation aims to remove these specularities using mechanism for it to zoom into the eye region of the frame.)
pixel information from areas bordering it. A specularity The pupillary radius is simply the distance from this center
is identified by a simple intensity thresholding operation, to the furthest point in the detected region.
owing to its high intensity value compared to the rest of the Similarly, the limbic boundary can be determined from
eye image. For our system, an empirical value for threshold Figure 15(d), using the determined pupil center as a starting
was fixed at 90% of the maximum intensity. Geometrical point. This is done by drawing radiating lines from the pupil
constraints are also imposed at this stage so that only high center. The sclera appears much brighter than the iris region
intensities closer to the center of the image are detected. and, hence, the limbic boundary is the point along these
This is shown in Figure 15(b). These regions are filled with lines where the gradient is maximized. We can locate the iris
pixel information from the neighboring pixels, so they look center by finding the intersection of perpendicular bisectors
like an iris region. Due to the off-angle alignment of the of lines joining all pairs of these points [36]. Once the center
illumination source in our system setup (Section 2.3), a valid is determined, the radius is calculated as the average distance
assumption here is that the specularity is usually seen within from the center to all these points.
14 EURASIP Journal on Advances in Signal Processing

(a) (b)

(c) (d)

Figure 15: Here, we illustrate the steps to localize the limbic and pupillary boundaries. (a) shows the input eye image to be segmented.
(b) shows the localization of the specularity present in the image. Intensity thresholding as well as geometrical constraints are used for this
purpose. (c) shows the specularity filled with pixel values from the neighbourhood and finally (d) is the result of applying a median filter to
(c). Intensity thresholding and geometric constraints help to localize the pupil in (d) following which we identify the pupillary and limbic
boundaries as described in Section 3.3.1(2).

Figure 16 shows the segmentation results on a few eye The unrotated Gabor function G0 (x) is given by
images acquired using our PTZ system.    
  1 x2 y 2
At the end of this sub-section, the reader should note that G0 x, σx , σ y , f = exp − + + j2π f0 t . (14)
the median filtered result as depicted in Figure 15(d) is only 2 σx2 σ y2
for the segmentation stage. As mentioned before, the use of
the median filter preserves most of the gradient information σxi , σ yi , fi , and θi are the parameters for the ith Gabor
with regards to the pupillary and limbic boundaries versus function and Rθ is a 2D rotation matrix of angle θ which is
using an averaging filter. Once we have determined the given by
centers of the pupil and the iris, as well as the corresponding  
radii, this information is used to segment the iris from cos θ sin θ
Rθ = . (15)
the original image, resulting in the final segmented iris − sin θ cos θ
pattern.
The rotated values of coordinates x and y may be obtained by
multiplying the vector x = [x y] by Rθ . Examples of Gabor
3.3.2. Recognition Based on Iris Pattern. Once the boundaries functions with different parameters are shown in Figure 17.
of the pupil and the iris pattern are determined, the next The iris bit code c(x) is then generated for each iris image
step is to use the segmented iris pattern for identification of f , as follows:
the user. We fit a polar coordinate system to the iris region ⎡    ⎤
and the image obtained by unwrapping this polarcoordinate sgn Re f ∗ G1 (x)
⎢    ⎥
system, is referred to as the iris plane [35, 37]. ⎢ sgn Im f ∗ G1 (x) ⎥
⎢ ⎥
The feature extraction method used is described in [37]. ⎢    ⎥
⎢ sgn Re f ∗ G2 (x) ⎥
An iris code is generated by projecting the iris plane onto a set ⎢ ⎥
⎢    ⎥
of complex valued 2D Gabor functions. Each Gabor function c(x) = ⎢
⎢ sgn Im f ∗ G2 (x) ⎥ ⎥. (16)
⎢ .. ⎥
Gi (x) is given by ⎢ ⎥
⎢   .  ⎥
⎢ ⎥
⎢ sgn Re f ∗ GN (x) ⎥
  ⎣    ⎦
Gi (x) = G0 Rθ x, σxi , σ yi , fi . (13) sgn Im f ∗ GN (x)
EURASIP Journal on Advances in Signal Processing 15

(a) (b)

(c) (d)

Figure 16: Results of applying the segmentation described in Section 3.3.1 on eye images acquired using our setup. The pupillary boundary
is depicted by the red circle while the limbic boundary is depicted by the green circle. The iris pattern is isolated by simply cropping out the
region between these two boundaries.

The bit code uses two bits per pixel of the Gabor response, The template iris code is divided into K regions. When
one bit each for the real part of the response and the a query image from the same class is compared to this
imaginary part of the response at each position. A match template image, it is assumed that each of these K regions
score between any template iris image code ct (x) and a within the template will be closely matched to a neighboring
query iris image code cq (x) is computed by simply counting region in the query image. In order to determine the possible
the number of matching bits at corresponding locations. alignments of the regions, match scores are computed for
The template image is the iris pattern against which the shifts of 10 pixels in both the vertical and horizontal direction
verification is performed and the query image is that pattern at each region. So, for each region i a 21 × 21 match score
which has to be verified as belonging to the class represented matrix mi (x) is computed. There will be K such match score
by the template. Hence, the match score m(d) at a relative matrices.
shift d between template and query is given by Additionally, for each pixel x of the iris plane, an occlu-
1   T   sion metric π(x) is computed that measures the likelihood
m(d) = ct y cq y − d . (17) that the pixel belongs to the eyelid rather than the iris pattern.
|St | y ∈S
t
This metric is computed from four local statistics as in [35],
St is the support of the template iris code and hence |St | is the namely (1) the mean intensity values within a neighborhood
size of the template. The white pixels in Figure 18(b) indicate around the pixel, (2) the standard deviation of these intensity
the number of pixels that were considered a match when the values, (3) the percentage of pixels whose intensity is greater
iris plane in Figure 18(a) was matched with a subject from than one standard deviation above the mean for the centers
the same class. of the upper and lower eyelids, and (4) shortest Euclidean
A problem to be considered during iris matching of query distance to the centers of the upper and lower eyelid. A Fisher
irises from the same test subject is that the iris plane may linear discriminant [38] is then used to generate a single
undergo local nonlinear deformations due to the contraction scalar quantity π(x) from the set of four statistics at each
and dilation of the pupil during acquisition, as a result of pixel. Finally, an overall occlusion metric πi is computed for
varying lighting conditions. Even though we normalize the each region i as the mean of all π(x) in that region.
radius of the iris to a fixed length, this linear normalization If a certain segment in the image has a particular
does not fix the nonlinear deformations as the iris sphincter deformation or occlusion state, neighboring segments are
muscles move angularly as well as radially. This nonlinear more likely to have similar states rather than different ones.
deformation is represented as a coarse vector field as in [35]. As in [35], we use a graphical model to represent this
16 EURASIP Journal on Advances in Signal Processing

0.5
1
0 0.8
0.6
0.4
−0.5
0.2
0
−1 −0.2
15
−0.4
10 −0.6
5 −0.8 15
0 −1 10
−5 15 5
10 0
−10 −15
5 −5
−10 0
0 −5 −5 −10
−15 10 5 −10 −15
15 −15
(a) (b)

1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0
0
−0.2
−0.2 −0.4
10
−0.4 10 −0.6
−0.8 5
−0.6 5
0 −1 0
−0.8 10
10 −5 5 −5
5 0 0
−5 −10 −5 −10
−10 −10
(c) (d)

Figure 17: Examples of Gabor functions that are used in the feature extraction stage.

(a)

(b)

Figure 18: (a) shows an example of an iris plane that is obtained by unwrapping the iris region. The area occupied by the sclera is the
U-shaped structure along the top of the rectangular image and at the top corners of the image. (b) shows the number of pixels in this image
that were matched with a query iris image belonging to the same subject.

relationship between regions within the iris code. Based on template region i. With respect to the query region, i and
this model, conditional likelihoods P(di | O1 , O2 , . . . , OK ) ωi is a binary-valued occlusion variable which is either 0
and P(ωi | O1 , O2 , . . . , OK ) are determined. (For a detailed or 1. The former indicates that the corresponding region
derivation of these likelihoods, the reader is directed to is not occluded, and the latter indicates otherwise. Oi is
[35].) Here, di contains vertical and horizontal shifts of the the estimated value of mi (x) and πi for the region i where
EURASIP Journal on Advances in Signal Processing 17

(a) (b) (a) (b)

(c) (d)
(c) (d)

Figure 19: Face detection, eye detection, and iris acquisition shown for two users of the system. (a) shows the face detector. Once the face is
determined to be stationary, the ASM is fitted and this is depicted in (b). (c) shows the position of the right eye read off from the ASM and
finally (d) is the image of the eye and iris pattern acquired after the lens zooms into the location indicated in (c).

i = 1, 2, . . . , K. Once we know these conditional likelihoods, significant portion of the frame). Following this, the ASM
the score for each subregion i is given by model is fit on the face in the frame, and the center of the
 eye is determined, as described in Section 2. The acquisition
Mi = mi (d)P(di = d | O1 , O2 , . . . , OK ), (18) process is depicted in Figure 19. The ASM fitting is not always
d robust along the boundary of the face. This is seen in the
figure on the right in Figure 19. This is not the case however
where mi (d) corresponds to the pixel shift represented in d. with the ASM fitting around the eye, as can be seen in
The final score M from all the K sub-regions is computed as both cases in Figure 19; the key points are always localized
follows: properly.
K Various eye images acquired using this method are shown
i=1 βi Mi
M=  K , (19) in Figure 20, along with their segmentation results, and
i=1 βi discussed in Section 3.3.1.
where the weights βi correspond to belief of nonocclusion in
the region i, that is, βi = P(ωi = 0 | O1 , O2 , . . . , OK ). Receiver
4.2. Open Set Identification. As described in Section 3.3.2,
Operating Curves (ROCs) generated using such match scores
the metric used to measure the similarity between two iris
are shown in the next section.
bit codes is the number of matching bits at corresponding
locations. It was shown how the match score Mis computed
4. Results from a given template iris code ct (x) and query iris code
cq (x). (See (17), (18), and (19).) Clearly, 0 ≤ M ≤ 1, with
This section shows the results of the face detector and the M = 1 for a perfect match.
ASM-based eye detector, along with the captured image of In this work, we perform open set identification. In other
the eye/iris. The iris is obtained from this acquired image words, the user is first associated with a class with which the
and is used as the query image for a database of irises that user’s iris code matches best. Following this, a verification
have been collected. The database on which the system was experiment is performed to verify whether the user actually
tested has 12 people, with 7 images from each. The images belongs to this class or if the user is an imposter. To perform
were taken at distances ranging from 0.6 m to 1.5 m with the the verification experiments, we first enrolled the 12 users
people in different postures (standing versus sitting) in front manually, using the same PTZ camera used in the verification
of the camera. The average acquisition time for the iris, once stage. Once this was done, the camera was setup to perform
the face was detected, is measured to be 5 seconds. iris acquisition as and when users approached the system
(as described in Section 2.4). All the irises acquired, along
4.1. Face and Eye Detection. The camera turns towards the with class information, were recorded before starting the
face of the user wherever he/she may be in the frame, due verification experiment.
to the face detection module of our system. In the case of A user is said to be verified as belonging to a particular
multiple faces, it picks the face that is more dominant in class, only when M is below a set threshold. In real-
the frame. Once the user limits their motion for about 5 world applications, there can be both false positives and
frames, the camera zooms into the frame such that the face true positives, depending on the value of this threshold. A
occupies roughly 200 pixels across the frame. (This is because Receiver Operating Curve (ROC) can be plotted by varying
the active-shape model fitting requires that the face occupy a the value of this threshold and noting how many false
18 EURASIP Journal on Advances in Signal Processing

Figure 20: Eye images acquired from various users who were at different positions within the capture volume are shown in the first row. The
second row shows the segmentation results of these images. The segmentation is discussed in Section 3.3.1.

1 Table 1: Useful values for visualization of result depicted in


0.9 Figure 21. The first column indicates the number of true positives
0.8 for increasing values of thresholds in the iris similarity metric. The
second column indicates the number of false positives for the same.
Verification rate

0.7
0.6
Number of true accepts Number of false accepts
0.5
0.4 79 1
0.3 80 5
0.2 81 10
0.1 83 15
0 84 22
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False acceptance rate

Figure 21: Receiver operating curve for the database captured with on the iris pixel resolution required by the iris verification
the Axis 233D Network PTZ camera. The verification rate obtained algorithm. We required a minimum resolution of 100 pixels
at a false accept rate of 1% (1 image out of 84) is 95% (79 images using [35], and, hence, the limit for our experiments is 2.1
out of 84). The number of classes used in this experiment is 12, with meters, as depicted in Figure 4. At this maximum stand-off,
7 images per class. the illumination provided by a standard 100 W bulb suffices,
in order to acquire the required iris images. This decreases
the illumination requirement of our system as compared to
positives and true positives occur across the database. The the rest of the systems shown. An additional point to note is
ROC in Figure 21 shows the verification rate (VR) versus that all the other systems have a large form factor requiring
the false accept rate (FAR). The vertical axis of this plot more than two cameras and additional hardware. This is
represents the true accept rate (i.e., the rate of correctly another area where our system improves on other systems
verifying a person), and the horizontal axis represents the because it uses only a standard security COTS PTZ camera
false accept rate (i.e., the rate of erroneously verifying a along with COTS filters and lenses. (See Figure 2(b).) This
person as being someone else). We see that we get a VR of significantly reduces the cost of the final system. Our system
95% (79 out of 84 images) when the FAR is 1% (1 image out is also plug and play, requiring no calibration before use. For
of 84). In other words, 95% of the people in the database were all other systems, a multicamera approach is advocated. The
correctly identified, while only 1% were falsely identified. purpose of this is to use stereo vision to estimate depth. This
Table 1 helps to visualize this graph in terms of the actual depth information is then used to zoom (if required) to the
numbers of irises that were correctly identified versus those iris and to perform autofocus.
that were incorrectly identified. Each row in this table is
a point on the graph in Figure 21, that is, for a particular 5. Conclusion
value of threshold. The threshold value increases from row
1 to 5 in the table. Table 2 compares the relevant features This paper describes an unconstrained medium stand-off
of our system compared with other state of the art systems range iris acquisition and recognition system built using a
mentioned in the introduction. commercial off-the-shelf (COTS) PTZ camera, which can
We see from Table 2 that the proposed system achieves be used for access to high-security areas within buildings.
a greater capture volume than other systems that were Other possible scenarios for use of this system include
mentioned in the introduction. We do this using the pan- border patrol and immigration. The device may also be
tilt functionality of the system. Additionally, there is no used as a desktop security system for workstations. The
upper limit for the capture volumes; this depends entirely novelty in this approach over other published literature is
EURASIP Journal on Advances in Signal Processing 19

Table 2: Comparison of relevant features of the proposed system with other state of the art systems.

Sarnoff Iris On the Move


System specific features Retica Eagle Eyes [16] Wheeler et al. [14] Proposed System
[11]
Time to iris acquisition
2 6.1 3.2 5
(seconds)
0.076 × 0.076 for the
narrow field of view
Heigth × Width of Capture Not constrained,
0.2 × 0.4 3×2 camera. The details are not
volume (m) duepan/tilt capability.
mentioned for the wide
field of view camera
Depth of capture volume 1.6 (limited by iris
0.1 3 <1.5
(m) resolution)
Minimum stand-off
3 3 Not mentioned in paper 0.5
distance (meters)
Number of cameras used 4 3 3 1
Standard 100 W table lamp
Portal with multiple Laser illuminator, if ambient IR illumination
Additional requirements Near-IR illumination panel
near-IR illuminators range-finder (from sunlight) is not
sufficient.
Yes, in order to
Yes, for the wide field of
Initial Calibration determine camera
No view cameras, in order to No
required? heights and camera
estimate depth from stereo
focus position

the use of a single COTS camera (a monocular system) with since the face and periocular regions are being captured
relevant algorithms that can handle both face capture and during the process of acquiring the iris.
iris capture and can provide the proposed large dynamic
capture volume for unconstrained acquisition at about 1.5 m
stand-off. Almost all state-of-the-art research in reduced References
constrained iris acquisition use two cameras for this purpose
[1] J. Daugman, “Probing the uniqueness and randomness of
[14–17]). The long-range Iris On-the-Move (IOM) system
iriscodes: results from 200 billion iris pair comparisons,”
uses only one camera for both face and iris capture. However,
Proceedings of the IEEE, vol. 94, no. 11, pp. 1927–1934, 2006.
the drawback in this system is the limited capture volume
[2] J. Daugman, “Biometric personal identification system based
(5 cm−10 cm [11]) and associated large form factor and cost
on iris analysis,” US patent no. 5, 291, 560, March 1994.
as compared to our system, which is of the form of a simple
[3] P. J. Phillips, K. W. Bowyer, P. J. Flynn, X. Liu, and W. T.
PTZ surveillance camera.
Scruggs, “The iris challenge evaluation 2005,” in Proceedings of
Our work has blended many state-of-the-art techniques the IEEE 2nd International Conference on Biometrics: Theory,
for face detection and facial shape modeling for the purpose Applications and Systems (BTAS ’08), pp. 1–8, October 2008.
of accurate eye detection and iris pattern feature extraction. [4] P. Phillips, W. Scruggs, A. Toole et al., “FRVT 2006 and
Face detection helps to localize the position of the user ICE 2006 large-scale results,” Tech. Rep. NISTIR 7408, 2006,
in the frame and to track the user’s movements before http://iris.nist.gov/ice/ice2006.htm.
applying the ASM for facial landmark localization (useful [5] R. P. Wildes, “Iris recognition: an emerging biometrie tech-
for pose estimation). The tracking output controls the pan- nology,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1348–1363,
tilt mechanism of the camera. Once the user is detected, the 1997.
camera automatically zooms in to capture the face and fit the [6] Y. Du, R. Ives, B. Bonney, and D. Etter, “Analysis of partial iris
ASM model (which works optimally when the user restricts recognition,” in Biometric Technology for Human Identification
motion). The ASM model helps to localize the position of the II, vol. 5779 of Proceedings of SPIE, pp. 31–40, 2005.
eye in the frame. The lens then zooms into its telephoto end, [7] Y. Du, B. Bonney, R. Ives, D. Etter, and R. Schnltz, “Analysis of
keeping this eye position at the center of the frame. partial iris recognition using a 1-D approach,” in Proceedings
Initial experimental setup of 12 test subjects with 7 eye of the IEEE International Conference on Acoustics, Speech, and
images per subject, show that we are able to consistently Signal Processing (ICASSP ’05), vol. 2, pp. II961–II964, March
acquire eye images with 200 pixels across the iris on an 2005.
average. This is a significant achievement because most [8] American National Standards Institute (ANSI), Z136-1-1993.
standard algorithms require greater than 100 pixels diameter [9] American Conference of Government Industrial Hygienists
for useful feature extraction and matching. In addition to the (ACGIH)‘Threshold Limits Values’ 1994.
use of only the iris information that is captured, the system [10] LG IrisAcessTM 4000, http://www.lgiris.com/ps/products/
can be modified for multibiometric recognition applications irisaccess4000.htm.
20 EURASIP Journal on Advances in Signal Processing

[11] J. R. Matey, O. Naroditsky, K. Hanna et al., “Iris on the move: [33] “Multiple Biometric Grand Challenge—details,” http://face
acquisition of images for iris recognition in less constrained .nist.gov/mbgc/.
environments,” Proceedings of the IEEE, vol. 94, no. 11, pp. [34] J. C. Gower, “Generalized procrustes analysis,” Psychometrika,
1936–1946, 2006. vol. 40, no. 1, pp. 33–51, 1975.
[12] OKI Irispass, http://www.oki.com/en/iris/. [35] R. Kerekes, B. Narayanaswamy, J. Thornton, M. Savvides, and
[13] U. Cahn von Seelen, T. Camus, P. Venetianer, G. Zhang, B. V. K. Vijaya Kumar, “Graphical model approach to iris
M. Salganicoff, and M. Negin, “Active vision as an enabling matching under deformation and occlusion,” Proceedings of
technology for user-friendly iris identification,” in Proceedings the IEEE Computer Society Conference on Computer Vision and
of 2nd IEEE Workshop on Automatic Identification Advanced Pattern Recognition, 2007.
Technologies, pp. 169–172, 1999. [36] Y. Li and M. Savvides, A robust approach to specularity removal
[14] F. W. Wheeler, A. G. A. Perera, G. Abramovich, B. Yu, and P. and iris segmentation, Ph.D. thesis, Department of Electrical
H. Tu, “Stand-off iris recognition system,” in Proceedings of and Computer Engineering, Carnegie Mellon University,
the IEEE 2nd International Conference on Biometrics: Theory, Pittsburgh, Pa, USA, 2009.
Applications and Systems (BTAS ’08), October 2008.
[37] J. Daugman, “High confidence recognition of persons by
[15] G. Guo, M. Jones, and P. Beardsley, “A system for auto- iris patterns,” in Proceedings of the 35th Annual International
matic iris capturing,” Tech. Rep. TR2005-044, Mitsubishi Carnahan Conference on Security Technology, pp. 254–263,
Electric Research Laboratories, 2005, http://www.merl.com/ October 2001.
publications/TR2005-044/.
[38] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,
[16] F. Bashir, P. Casaverde, D. Usher, and M. Friedman, “Eagle-
John Wiley & Sons, New York, NY, USA, 2001.
eyesTM : a system for iris recognition at a distance,” in Proceed-
ings of the IEEE International Conference on Technologies for
Homeland Security (HST ’08), pp. 426–431, May 2008.
[17] S. Yoon, H. Jung, K. Park, and J. Kim, “Non-intrusive iris
image acquisition system based on a pan-tilt-zoom camera
and light stripe projection,” in Optical Engineering, vol. 48, pp.
037202–037202-15, 2009.
[18] http://www.aoptix.com/biometrics.html.
[19] Axis 233D Datasheet, http://www.axis.com/products/cam
233d/index.htm.
[20] C. Boyce, A. Ross, M. Monaco, L. Hornak, and X. Li, “Multi-
spectral iris analysis: a preliminary study,” in Proceedings
of Computer Vision and Pattern Recognition on Biometrics
Workshop, 2006.
[21] Application Programming Interface, http://www.axis.com/.
[22] ANSI INCITS 379-2004: Iris Image Interchange Format.
[23] J. Forrester, A. Dick, P. Mcmenamin, and W. Lee, The Eye: Basic
Sciences in Practice, W. B. Saunder, London, UK, 2001.
[24] S. Ray, Applied Photographic Optics, 3rd edition.
[25] http://www.bhphotovideo.com/c/product/56779 REG/Tiffen
52CUS 52mm Close up Glass Lens.html.
[26] http://www.maxmax.com/.
[27] P. Viola and M. Jones, “Rapid object detection using a
boosted cascade of simple features,” in Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition, pp. I511–I518, December 2001.
[28] Y. Freund and R. Schapire, “A short introduction to boosting,”
Journal of Japanese Society for Artificial Intelligence, vol. 14, pp.
771–780, 1999.
[29] http://sourceforge.net/projects/opencvlibrary/.
[30] K. Seshadri and M. Savvides, “Robust modified active shape
model for automatic facial landmark annotation of frontal
faces,” in Proceedings of the IEEE 3rd International Conference
on Biometrics: Theory, Applications and Systems (BTAS ’09),
2009.
[31] T. F. Cootes and C. J. Taylor, “Statistical models of appear-
ance for computer vision,” Tech. Rep., Imaging Science and
Biomedical Engineering, University of Manchester, Manch-
ester, UK, 2004.
[32] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active
shape models—their training and application,” Computer
Vision and Image Understanding, vol. 61, no. 1, pp. 38–59,
1995.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 647597, 12 pages
doi:10.1155/2010/647597

Research Article
Fusion of PCA-Based and LDA-Based Similarity Measures for
Face Verification

Mohammad T. Sadeghi,1 Masoumeh Samiei,1 and Josef Kittler2


1 SignalProcessing Research Group, Department of Electrical and Computer Engineering, Yazd University,
P.O. Box 89195-741, Yazd, Iran
2 Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, Surrey, GU2 7XH, UK

Correspondence should be addressed to Mohammad T. Sadeghi, [email protected]

Received 1 December 2009; Accepted 19 July 2010

Academic Editor: Yingzi Du

Copyright © 2010 Mohammad T. Sadeghi et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The problem of fusing similarity measure-based classifiers is considered in the context of face verification. The performance of
face verification systems using different similarity measures in two well-known appearance-based representation spaces, namely
Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA) is experimentally studied. The study is performed
for both manually and automatically registered face images. The experimental results confirm that our optimised Gradient
Direction (GD) metric within the LDA feature space outperforms the other adopted metrics. Different methods of selection and
fusion of the similarity measure-based classifiers are then examined. The experimental results demonstrate that the combined
classifiers outperform any individual verification algorithm. In our studies, the Support Vector Machines (SVMs) and Weighted
Averaging of similarity measures appear to be the best fusion rules. Another interesting achievement of the work is that although
features derived from the LDA approach lead to better results than those of the PCA algorithm for all the adopted scoring functions,
fusing the PCA- and LDA-based scores improves the performance of the system.

1. Introduction Different similarity measures have been adopted in


different machine vision applications. In [1], a number
In spite of the rapid advances in machine learning, in many of commonly used similarity measures including the City-
pattern recognition problems, the decision making is based block, Euclidean, Normalised Correlation (NC), Chi-square
on simple concepts such as distance from or similarity to (χ 2 ), and Chebyshev distance have been considered in an
some reference patterns. This type of approach is particularly image retrieval system. The reported experimental results
relevant when the number of training samples available to demonstrate that the City-block and Chi-square metrics
model a class of objects is very limited. Examples of such are more efficient in terms of both retrieval accuracy and
situations include content-based retrieval from image or retrieval efficiency. In a similar comparative study, it has been
video databases, where the query image is the only sample shown that the Chi-square statistics measure outperforms
at our disposal to define the object model, or biometrics the other similarity measures for remote sensing image
where only one or a few biometric traits can be acquired retrieval [2]. In another study, the effect of 14 scoring
during subject enrolment to create a reference template. functions such as the City-block, Euclidean, NC, Canberra,
In biometric identity verification, a similarity function Chebyshev, and Distance based Correlation Coefficients
measures the degree of similarity of an unknown pattern has been studied in the context of the face recognition
to the claimed identity template. If the degree exceeds a problem [3] in the PCA space. It has been shown that a
pre-specified threshold, the unknown pattern is accepted simplified form of Mahalanobis distance outperforms the
to be the same as the claimed identity. Otherwise, it is other metrics. In [4], four classical distance measures, City-
rejected. block, Euclidean, Normalised Correlation, and Mahalanobis
2 EURASIP Journal on Advances in Signal Processing

distance have been compared in the PCA space. It has decision-level, that is, combining similarity scores output by
been shown that when the number of eigenvectors is individual classifiers. Thus, the scores are treated as features,
relatively high, the Mahalanobis distance outperforms the and a second-level classifier is constructed to fuse these
other measures. Otherwise, a similar performance is achieved scores.
using different measures. It has been also propounded that Fusion rules can be divided into two main categories:
no significant improvement is achieved by combining the fixed rules such as the sum, product, minimum, maximum,
distance measures. and median rule [11–13] and trained rules like the weighted
A similarity score is computed in a suitable feature space. averaging of classifiers outputs [14, 15], Support Vector
Commonly, similarity would be quantised in terms of a Machines (SVM) [10], bagging, and boosting [16]. Overall,
distance function, on the grounds that similar patterns will the fixed rules are most often used because of their simplicity
lie physically close to each other. Thus, the smaller the and the fact that they do not require any training. Accord-
distance, the greater the similarity of two entities. The role ingly, equal weights are used for all the classifiers [11, 17].
of the feature space in similarity measurement is multifold. However, in many studies it has been demonstrated
First of all, the feature space is selected so as to maximise that trained classifiers such as Support Vector Machines
the discriminatory information content of the data projected (SVMs) have the potential to outperform the simple fusion
into the feature space and to remove any redundancy. rules, especially when enough training data is available. In
However, additional benefits sought after from mapping the [18], AdaBoost has been adopted for combining unimodal
original pattern data into a feature space is to simplify the features extracted from face and speech signals of individuals
similarity measure deployed for decision making. in multimodal biometrics. In [8] the fusion problem was
PCA and LDA are two classical tools widely used in the solved by selecting the best classifier or a group of classifiers
appearance-based approaches for dimensionality reduction dynamically with the help of a gating function learnt for each
and feature extraction. Many face recognition methods, such similarity measure.
as eigenfaces [5] and fisherfaces [6], are built on these In summary, it is clear that it is still pertinent to
two techniques or their variants. Different researches show ask which classifiers provide useful information and how
that in solving the pattern classification problems the LDA- the expert scores should be fused to achieve the best
based algorithms outperform the PCA-based ones, since the possible performance of the face verification system. In
former take the between classes variations into account. [19], considering a set of similarity measure-based classifiers
The LDA is a powerful feature extraction tool for pattern within the LDA feature space, a sequential search algorithm
recognition in general and for face recognition in particular. was applied in order to find an optimum subset of similarity
It was introduced to this application area by Belhumeur et measures to be fused as a basis for decision making. The SVM
al. in 1997 [6]. An important contributing factor in the classifier was used for fusing the selected classifiers.
performance of a face authentication system is the metric In this paper, a variety of fixed and trained fusion rules
used for defining a matching score. Theoretically, Euclidean are compared in the context of face authentication. Five
distance provides an optimal measure in the LDA space. In fixed fusion rules (sum, min, max, median, and product)
[7], it has been demonstrated that it is outperformed by the and two trained rules (the support vector machines and
Normalised Correlation (NC) and Gradient Direction (GD). weighted averaging of scores) are considered. It is shown that
Also, in [8], the performance of the NC scoring function a better performance is obtained by fusing the classifiers.
has been compared with the GD metric. The study has been Moreover, the adopted trained rules outperform the fixed
performed on the BANCA database [9] using internationally rule. Although, the PCA-based classifiers perform nearly 3
agreed experimental protocols by applying a geometric face times worse than the LDA-based one, an interesting finding
registration method based on manually or automatically of this paper compared to our previous work [19] is that
annotated eyes positions. It has been concluded that overall the performance of the verification system can be further
the NC function is less sensitive to missregistration error but improved by fusing the LDA- and PCA-based classifiers. In
in certain conditions GD metric performs better. However, in [20], a similar study has been performed using Euclidean
[10], it has been further demonstrated that by optimising the distance as the scoring function. In the training stage of
GD metric, this metric almost always outperforms the NC the proposed algorithm, adopting a fixed reference as the
metric for both manually and automatically registered data. central value of the decision making threshold, client specific
In this study, a variety of other metrics have been inves- weights are determined by calculating the average value of
tigated, including Euclidean, City-block, Chebyshev, Can- the Euclidean distance of all the patterns from each client
berra, Chi-square (χ 2 ), NC, GD, and Correlation coefficient- template. The client specific weights are determined in both
based distance. The experimental results in face verification LDA and PCA spaces. The weights are then used within
confirm that, individually, other metrics on the whole do the framework of three simple untrained fusion rules. In
not perform as well as the NC and GD metrics in the LDA the adopted experimental protocol, each subject images are
space. However, in different conditions, certain classifiers can divided into two parts as the training and test sets. The
deliver a better performance. experimental study performed on the ORL and Yale data sets
It is well known that a combination of many differ- demonstrate that the combined classifier outperforms the
ent classifiers can improve classification accuracy. Various individual PCA- and LDA-based classifiers [20]. Although
schemes have been proposed for combining multiple clas- the training and test images are different, since the same
sifiers. We concentrate on classifier combination at the subjects are available within the training and test sets, the
EURASIP Journal on Advances in Signal Processing 3

weighting process is somehow biased so that the performance In [7], it has been demonstrated that a matching score
of the system in the presence of new impostors (not those based on Normalised Correlation (NC) scoring function,
used for training) could be worse. defined by the following equation, is more efficient:
The rest of the paper is organised as follows. In the ? ?
? T ?
next section, the adopted scoring functions are introduced. ?x μi ?
Fusion rules are reviewed in Section 3. A description of sN = . (7)
the experimental design including the face database used in xT xμTi μi
the study, the experimental protocols, and the experimental
setup are given in Section 4. The experimental results using Another similarity measure which is conceptually the
the adopted scoring functions and the fusion results are same as the NC function is the Correlation Coefficients-based
presented and discussed in Section 5. Finally a summary of distance. For more details, the reader is referred to [3].
the main findings and conclusions can be found in Section 6. The Gradient Direction (GD) metric proposed in [7, 21]
measures the distance between a probe image and a model
in the gradient direction of the a posteriori probability
2. Similarity Functions function P(i | x) associated with the hypothesised client
identity i. A mixture of Gaussian distributions with isotropic
In a similarity measure-based face verification system, a covariance matrix has been assumed as the density function
matching scheme measures the similarity or distance of the representing the anticlass (world population) estimated from
test sample, x, to the template of the claimed identity, μi , both the data provided by all the other users (for all j = / i). The
projected into an appropriate feature space. The general form diagonal elements of the isotropic covariance matrix are
of a group of similarity measures which is called Minkowski assumed to have values related to the magnitude of variation
Distance or power norm metrics (L p ) is defined as of the image data in the feature space. It was demonstrated
that in a face verification system, applying GD metric is even
⎡ ⎤1/ p
m 2
 2p more efficient than the NC function. This matching score is
2 2
sM = ⎣ 2μi j − x j 2 ⎦ , (1) defined as
j =1 ? T ?
? ?
? x−μ ∇O P(i | x)?
? i ?, (8)
where m is the dimensionality and j indexes the components sO =
of the two vectors. ∇O P(i | x)
The most commonly used similarity measures, Man- where ∇O P(i | x) refers to the gradient direction. For the
hattan or City-block metric, Euclidean Distance (ED), and isotropic structure of the covariance matrix, that is, Σ = σI,
Chebyshev Distance are special cases of the Minkowski metric the optimal direction would be
for p = 1, p = 2, and p → ∞, respectively, that is, L1 , L2 , and
L∞ metrics: 
m
  
∇I P(i | x) = p x | j μ j − μi .
m 2
(9)
 2 j =1
2 2 j=
sCity = 2μi j − x j 2, (2) /i
j =1
Note that the magnitude of σ will affect the gradient
9 T   direction through the values of density p(x | j)(x j).
sED = x − μi x − μi , (3)
2 2
2 2 3. Similarity Scores Fusion
sCheby = max2μi j − x j 2. (4)
j
One of the very promising research directions in the field
The Canberra Distance is also given by of pattern recognition and computer vision is classifier
fusion. It has been recognised that the classical approach
2 2 to designing a pattern recognition system which focuses
 m 2 2
2 μi j − x j 2
2 2 2 2.
on finding the best classifier has a serious drawback.
sCanb = 2 2 2 2 (5)
j =1 2μi j 2 + 2x j 2
Any complementary discriminatory information that other
classifiers may capture is not tapped. Multiple expert fusion
aims to make use of many different designs to improve the
This can be considered as the normalised Manhattan Dis- classification performance. In the case considered here, as
tance. The Chi-squared (χ 2 ) Distance is defined by different metrics span the feature space in different ways, it
 2 seems reasonable to expect that a better performance could

m μi j − x j be obtained by combining the resulting classifiers.
sχ2 = 2 2 2 2 (6) Since the scores for different classifiers lie in different
2 2 2 2
j =1 2μi j 2 + 2x j 2 ranges, a normalisation process is required to transform
these score to the same range before combining them
which is basically a relative Euclidean squared distance and is [22]. The simplest normalisation technique is the min-max
usually meant for nonnegative variables only. normalisation. The min-max normalisation is best suited for
4 EURASIP Journal on Advances in Signal Processing

the case where the bounds (maximum and minimum values) and the optimal shifts can be found by solving the following
of the scores produced by a matcher are known. In this case, quadratic programming problem:
we can easily shift the minimum and maximum scores to 0
and 1, respectively. Given a set of scores for each classifier 
N
minimise w · w + C δi
si , i = 1, 2, . . . , M, where M is the number of samples, the i=1
normalised scores are given by (13)
subject to: yi (w · xi + b) ≥ 1 − δi ,
si − mini δi ≥ 0, i = 1, . . . , N,
Si = , (10)
maxi − mini
where parameter C defines the penalty for shifting the objects
where si and Si are, respectively, the original and normalised that would otherwise be misclassified in the case of linearly
scores associated to the ith sample. mini and maxi are the nonseparable classes.
minimum and maximum scores determined from a training The QP problem is usually solved in a dual formulation:
set.

N
1 
N N
As mentioned earlier, two main groups of fusion rules, minimise αi − αi α j yi y j xi · x j
untrained (fixed) and trained rules can be applied for i=1
2 i=1 j =1
classifiers fusion. The untrained methods such as Sum (or
Average), Product, Min, Max,and Median are very well 
N (14)
known approaches. For example, the Sum rule is defined as subject to: αi yi = 0,
i=1


M 0 ≤ αi ≤ C i = 1, . . . , N.
Snew = Si , (11)
i=1 Those training objects xi with αi > 0 are called Support
Vectors, because only they determine direction w:
where M is the number of classifiers. This is simply 
N
equivalent to averaging the normalised scores over the w= αi yi xi (15)
classifiers. A variety of trained fusion techniques such as i=1, αi >0
neural network classifier, Bayesian classifier, and SVM have
The dual QP problem can be rapidly solved by the Sequential
been suggested. It has been shown that the SVM classifier
Minimal Optimisation method, proposed by Platt [24]. This
is among the best trained fusion rules. In [10], decision
method exploits the presence of linear constraints in (14).
level fusion strategy using the SVMs has been adopted for
The QP problem is iteratively decomposed into a series of
combining the similarity measure-based classifiers. A very
one variable optimisation problems which can be solved
good performance has been reported using the adopted
analytically.
method.
For the face verification problem, the size of the training
Another promising trained rule involves a weighted
set for clients is usually less than the one for impostors.
averaging of similarity scores. Obviously, the technique used
In such a case, the class of impostors is represented better.
for determining the weight is an important factor in such a
Therefore, it is necessary to shift the optimal hyperplane
method.
towards the better represented class. In this paper, the size
of the shift is determined in the evaluation step based on the
3.1. Support Vector Machines. A Support Vector Machine is Equal Error Rate criterion.
a two-class classifier showing superior performance to other
methods in terms of Structural Risk Minimisation [23]. For a 3.2. Weighted Averaging of Similarity Measures. Compare to
given training sample {xi , yi }, i = 1, . . . , N, where xi ∈ RD is the simple averaging rule, in the case of weighted averaging,
the object marked with a label yi ∈ {−1, 1}, it is necessary to different weights are considered for the scores achieved from
find the direction w along which the margin between objects different classifiers, that is,
of two classes is maximal. Once this direction is found the
decision function is determined by threshold b: 
M
Snew = wi Si , (16)
i=1
y(x) = sgn(w · x + b). (12)
where wi is the weight assigned to the ith classifier output.
In this study, three methods of weighted averaging are
The threshold is usually chosen to provide equal distance to considered. In the first group, each classifier weight is
the closest objects of the two classes from the discriminant determined based on the performance of the classifier in an
hyperplane w · x + b = 0, which is called the optimal evaluation step. The smaller the error rate, the greater the
hyperplane. When the classes are linearly nonseparable some weight assigned to the classifier output, that is,
objects can be shifted by a value δi towards the right class.
This converts the original problem into one which exhibits 1
wi = , i = 1, 2, . . . , M, (17)
linear separation. The parameters of the optimal hyperplane TEREi
EURASIP Journal on Advances in Signal Processing 5

where TEREi is the Total Error Rate of the ith classifier in the In the BANCA protocol, 7 different distinct experimental
Evaluation stage. configurations have been specified, namely, Matched Con-
The main idea behind the second adopted method is trolled (Mc), Matched Degraded (Md), Matched Adverse
to minimise the correlation between classifier outputs. In (Ma), Unmatched Degraded (Ud), Unmatched Adverse (Ua),
practise, outputs of multiple classifiers are not uncorre- Pooled test (P), and Grand test (G). Table 1 describes the
lated, but some classifiers are more correlated than others. usage of the different sessions in each configuration. “T”
Therefore, it is reasonable to assign different weights to refers to the client training while “C” and “I” depict client
different classifiers according to their correlation. Principle and impostor test sessions, respectively.
Component Analysis, PCA, is one of the statistical techniques
frequently used to decorrelate the data [25]. Denote by S the
4.2. Experimental Setup. The performance of different deci-
vector of scores delivered by the M classifiers, that is,
sion making methods discussed in Section 2 is experimen-
S = [S1 S2 · · · . SN ] . (18) tally evaluated on the BANCA database using the config-
urations discussed in the previous section. The evaluation
Let λi and vi , i = 1, . . . , M, be the eigenvalues and is performed in the LDA and PCA spaces. The original
eigenvectors of the covariance matrix of the evaluation score resolution of the image data is 720 × 576. The experiments
vectors S retaining a certain proportion of the score variance. were performed with a relatively low resolution face images,
The eigenvectors are used as the bases of a new feature space. namely, 64 × 49. The results reported in this paper have been
Applying the simple averaging rule (equation (11)) to the obtained by applying a geometric face normalisation based
scores transformed to this feature space is equivalent to the on the eyes positions. The eyes positions were localised either
weighted averaging of the original scores in (16) where wi are manually or automatically. A fast method of face detection
determined using the following equation: and eyes localisation was used for the automatic localisation
of eyes centre [26]. The XM2VTS database [27] was used for

M calculating the LDA and PCA projection matrices.
wi = vi j . (19) The thresholds in the decision making system have been
j =1 determined based on the Equal Error Rate criterion, that
is, by the operating point where the false rejection rate
As the third method of weighted averaging of the scores, (FRR) is equal to the false acceptance rate (FAR). The
the above mentioned idea can be extended by applying the thresholds are set either globally (GT) or using the client
LDA algorithm. In a face verification system, two groups specific thresholding (CST) technique [21]. In the training
of score vectors are considered: client scores and impostor sessions of the BANCA database 5 client images per person
scores. In the evaluation step, these classes of data can are available. In the case of global thresholding method, all
be used within the framework of the Linear Discriminant these images are used for training the clients template. The
Analysis (LDA) for computing the feature space bases and other group data is then used to set the threshold. In the
the classifier weights. case of the client specific thresholding strategy, only two
images are used for the template training and the other three
4. Experimental Design along with the other group data are used to determine the
thresholds. Moreover, in order to increase the number of
In this section, the face verification experiments carried data used for training and to take the errors of the geometric
out on images of the BANCA database are described. normalisation into account, 24 additional face images per
The BANCA database is briefly introduced first. The main each image were generated by perturbing the location of the
specification of the experimental setup is then presented. eyes position around the annotated positions.
In the previous studies [21], it has been demonstrated
that the Client Specific Thresholding (CST) technique is
4.1. BANCA Database. The BANCA database has been
superior in the matched scenario (Mc, Md, Ma, and G)
designed in order to test multimodal identity verification
whereas the Global Thresholding (GT) method gives a
systems deploying different cameras in different scenarios
better performance on the unmatched protocols. The results
(Controlled, Degraded, and Adverse). The database has been
reported in the next section using thresholding have been
recorded in several languages in different countries. Our
acquired using this criterion.
experiments were performed on the English section of the
database. Each section contains 52 subjects (26 males and 26
females). 5. Experimental Results and Discussion
Each subject participated to 12 recording sessions in
different conditions and with different cameras. Sessions 1–4 As mentioned earlier, in the GD metric, the impostor
contain data under Controlled conditions whereas sessions 5– distributions have been approximated by isotropic Gaussian
8 and 9–12 contain Degraded and Adverse scenarios, respec- functions with a standard deviation of σ, that is, Σ = σI. The
tively. In order to create more independent experiments, order of σ is related to the order of the standard deviation of
images in each session have been divided into two groups of the input data (grey level values in the LDA feature space). In
26 subjects (13 males and 13 females). Experiments can be the previous work [8], a fixed value equal to 104 has been
performed on each group separately. used for σ. In this work, in order to optimise the metric
6 EURASIP Journal on Advances in Signal Processing

Table 1: The usage of the different sessions in the BANCA experimental protocols.

1 2 3 4 5 6 7 8 9 10 11 12
Mc TI CI CI CI
Md TI CI CI CI
Ma TI CI CI CI
Ud T I CI CI CI
Ua T I CI CI CI
P TI CI CI CI I CI CI CI I CI CI CI
G TI CI CI CI TI CI CI CI TI CI CI CI

35 35
30 30
25 25
20 20

TET
TEE

15 15
10 10
5 5
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
σ ×105 σ ×105

(a) Evaluation (Manual registration) (b) Test (Manual registration)

50 50
45
45
40
35 40
30 35
TET
TEE

25
20 30
15 25
10
20
5
0 15
0 2 4 6 8 10 0 2 4 6 8 10
×104 ×104
σ σ
Mc Ud Mc Ud
Md P Md P
(c) Evaluation (Automatic registration) (d) Test (Automatic registration)

Figure 1: The performance of the GD metric versus the value of σ.

for dealing with different imaging conditions, the value of of σ is then used in the test stage. Since, the effectiveness
σ is adaptively determined in the evaluation step where of a similarity measure depends on the adopted method of
the performance of the system for different values of σ is feature extraction, in the next subsection the experimental
evaluated. As examples, Figure 1 contains plots of the Total results using the PCA and LDA algorithms are reported. The
Error rate versus the value of σ in the evaluation and test steps fusion rules are presented in the sequel.
for the Mc, Md, Ud, and P protocols.
The evaluation plots show that by increasing the value 5.1. Experimental Results in the PCA and LDA Feature Spaces.
of σ, the Total Error rate first rapidly decreases. Then, for Figure 2 contains the results obtained using the individual
larger values of σ, the TE rate remains relatively constant scoring functions on the evaluation and test data sets in the
or increases gradually. From these plots, one can also see PCA and LDA spaces when manually annotated eyes position
that the behaviour of the system in the evaluation and test were used for the face geometric normalisation. The Total
phases is almost consistent. Therefore, the optimum σ can Error rates in the Evaluation (TEE) and Test (TET) stages
be found in the evaluation step by looking for the point have been used as performance measures in the plots. These
after which the performance of the system is not significantly results clearly demonstrate that among the adopted metrics,
improved by increasing the value of σ. The associated value the GD metric is individually the outright winner.
EURASIP Journal on Advances in Signal Processing 7

90 90
80 80
70 70
60
60
50

TET
TEE

50
40
40
30
20 30
10 20
0 10
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G
(a) Evaluation (PCA feature space) (b) Test (PCA feature space)

70 70
60 60
50 50
40 40

TET
TEE

30 30
20 20
10 10
0 0
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G

NC Cheby NC Cheby
GD KAY GD KAY
EU CANB EU CANB
City CORR City CORR
(c) Evaluation (LDA feature space) (d) Test (LDA feature space)

Figure 2: ID verification results using different scoring functions in the PCA and LDA feature spaces for the manually registered data.

70 80
60 70
50 60
40 50
TET
TEE

30 40
20 30
10 20
0 10
Mc Md Ud Ua P G Mc Md Ma Ud Ua P G

NC Cheby NC Cheby
GD KAY GD KAY
EU CANB EU CANB
City CORR City CORR
(a) Evaluation (LDA feature space) (b) Test (LDA feature space)

Figure 3: ID verification results using different scoring functions in the LDA feature space for automatically registered data.

For the sake of simplicity of comparison, Table 2 contains and test sets when manually annotated eyes positions were
the evaluation and test results for the GD metric using the used for the face geometric normalisation in the LDA space.
PCA and LDA spaces. These results demonstrate that a better The values in the table indicate the Total Error rates in the
performance can always be achieved using the LDA space. Evaluation (TEE) and Test (TET) stages, respectively.
Table 3 also contains a summary of the results obtained The results of the similar experiments with automatically
using the individual scoring functions on the evaluation registered data in the LDA feature space demonstrate that in
8 EURASIP Journal on Advances in Signal Processing

Table 2: ID verification results using GD metric, LDA (left) and PCA (right). TEE: Total Error rate Evaluation; TET: Total Error rate Test.

Manual Registration
LDA PCA
TEE TET TEE TET
Mc 0.597 4.87 2.2 15.77
Md 1.77 7.18 4.26 25.19
Ma 1.56 8.03 8.6 20.54
Ud 26.09 24.74 49.49 48.32
Ua 27.5 27.4 48.49 50.96
P 19.56 19.64 39.64 39.6
G 2.43 4.12 8.74 18.04

Table 3: ID verification results using different similarity measures for the manual registered data in the LDA feature space.

Mc Md Ma Ud Ua P G
TEE TET TEE TET TEE TET TEE TET TEE TET TEE TET TEE TET
NC 1.93 8.08 3.57 13.36 3.79 14.61 24.81 25.93 37.63 38.81 27.69 28.01 7.26 9.75
GD 0.60 4.87 1.77 7.18 1.55 8.03 26.09 24.74 27.5 27.40 19.56 19.64 2.43 4.12
ED 7.97 25.89 17 32.34 25.06 38.62 52.37 51.15 59.26 60.42 47.12 48.22 46.33 54.93
City 11.6 29.65 22.9 37.4 34.17 43.71 57.82 58.4 66.44 67.3 54.25 54.25 57.24 62.26
Cheb 8.2 31.73 16.22 39.23 16 35.86 56.44 56.3 58.94 57.41 51.56 51.85 32.54 43.79
χ2 7.49 20.41 14.88 28.88 22.99 34.17 48.17 47.15 56.35 60.48 44.46 45.45 42.91 48.12
Corr 2.25 11.22 4.74 15.6 4.54 17.43 22.66 26.25 36.57 37.44 34.44 34.54 8.02 10.85
Canb 5 13.85 8.69 20.25 12.01 24.2 34.26 33.5 51.54 52.37 26.74 27.69 22.54 24.04

Table 4: Fusion results for the different BANCA protocols using different fusion rules.

Sum WA1 WA2 WA3


TEE TET TEE TET TEE TET TEE TET
Mc 2.51 10.61 .82 6.31 1.9 8.56 1.38 6.5
Md 5.98 16.28 2.93 10.67 5.35 15.93 2.5 9.33
Ma 7.54 18.75 2.24 11.06 6.29 17.14 2.55 10.32
Ud 30.03 30.54 26.16 25.38 29.45 30.48 19.55 21.79
Ua 40.41 41.19 36.47 37.95 40.35 41.4 26.96 29.61
P 29.65 29.8 25.3 24.87 18.51 28.57 18.33 19.94
G 15.5 19.12 3.8 5.66 14.92 18.47 3.32 4.55

this case the optimised GD function again delivers a better or manually and automatically registered data. For the sake of
at least comparable performance. The performance of other simplicity of comparison of the results using the untrained
metrics, with the exception of NC, is much worse. These and trained rules, the fusion results using the Sum rule for
results are shown in Figure 3. manually registered data have been reported in Table 4.
In the second group of fusion experiments, different
5.2. Fusion Results and Discussions. In the next step, we weighted averaging of the outputs of classifiers employing
investigated the effect of fusing the classifiers employing different similarity measures were examined. The results
the different similarity measures. In the first group of are presented in Table 4. In this table WA1, WA2, and
experiments, we compared the fixed combination rules WA3 represent the weighted averaging results for the error
(Sum, Product, Min, Max, and Median) in which all the minimisation method, PCA, and LDA, respectively.
classifiers are deemed to carry the same weight. The results As can be seen, all the adopted weighted averaging
obtained in the evaluation and test steps for both manually methods give better results compared to the simple aver-
and automatically registered data are shown in Figure 4. aging (Sum) rule. Also, among the weighted averaging
These results clearly demonstrated that among the adopted methods, a better performance is achieved using the LDA
fixed rules, the Sum rule outperforms the others for both method.
EURASIP Journal on Advances in Signal Processing 9

60 100
90
50
80
40 70
60

TET
TEE

30
50
20 40
30
10
20
0 10
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G
(a) Evaluation (Manual registration) (b) Test (Manual registration)
60 100

90
50
80
40
70
TET
TEE

30 60

50
20
40
10
30

0 20
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G

Sum Max Sum Max


Product Median Product Median
Min Min
(c) Evaluation (Automatic registration) (d) Test (Automatic registration)

Figure 4: Untrained fusion results in the evaluation and test steps for different BANCA experimental protocols.

Table 5: Fusion results on BANCA protocols with PCA and LDA space using SVM, manual registration (left), and automatic registration
(right).

Manual Registration Automatic Registration


FARE FRRE TERE FART FRRT TERT FARE FRRE TERE FART FRRT TERT
Mc 0.096 0.13 0.22 0.86 0.13 0.99 5.48 5.51 10.99 6.92 6.54 13.46
Md 0.96 1.02 1.98 1.06 2.18 3.24 2.88 2.95 5.83 21.83 6.41 28.24
Ma 1.44 1.54 2.98 0.38 3.72 4.1 0.86 0.9 1.76 0.86 7.56 8.42
Ud 10.19 10.13 20.32 9.14 14.61 23.75 10.48 10.38 20.86 9.81 15 24.81
Ua 10.77 10.9 21.67 11.83 10.51 22.34 15 14.87 29.87 26.15 22.44 48.59
P 7.6 7.52 15.12 7.92 9.83 17.75 14.87 14.82 29.6 12.08 17.52 29.6
G 1.31 1.33 2.64 1.15 1.7 2.85 6.35 6.41 12.76 9.87 8.93 18.8

Figure 5 contains comparative plots of the results using Since, the effectiveness of a similarity measure depends
the Sum rule, LDA-based weighted averaging, and the on the adopted method of feature extraction, in the next
SVMs. These plots demonstrate that the trained methods step, the merit of fusing the PCA- and LDA-based clas-
outperform the untrained (Sum) rule. In most of the cases, sifiers using SVM was investigated. Figure 6 contains the
comparable results are obtained using LDA weighting and comparative plots of the Total Error rates obtained in the
SVMs. Evaluation (TEE) and Test (TET) stages for both manually
10 EURASIP Journal on Advances in Signal Processing

45 45

40 40

35 35

30 30

25 25

TET
TEE

20 20

15 15

10 10

5 5

0 0
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G
(a) Evaluation (Manual registration) (b) Test (Manual registration)
45 60

40 55

35 50

30 45

25 40
TET
TEE

20 35

15 30

10 25

5 20

0 15
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G

Sum Sum
Weighted averaging Weighted averaging
SVM SVM
(c) Evaluation (Automatic registration) (d) Test (Automatic registration)
Figure 5: Fusion using the Sum, LDA- based weighted averaging, and SVMs.

25 50
45
20 40
35
15 30
Error

Error

25
10 20
15
5 10
5
0 0
Mc Md Ma Ud Ua P G Mc Md Ma Ud Ua P G

TEE TEE
TET TET
(a) (Manual registration) (b) (Automatic registration)
Figure 6: Verification results by fusing the LDA- and PCA-based classifiers using SVMs.
EURASIP Journal on Advances in Signal Processing 11

and automatically registered data (see Table 5). These plots linear projection,” IEEE Transactions on Pattern Analysis and
demonstrate that these methods outperform the other Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
rules. [7] J. Kittler, Y. P. Li, and J. Matas, “On matching scores for
Overall, the results clearly demonstrate that the proposed LDA-based face verification,” in Proceedings of British Machine
similarity measure fusion considerably improves the perfor- Vision Conference, M. Mirmehdi and B. Thomas, Eds., pp. 42–
51, 2000.
mance of the face verification system.
[8] M. T. Sadeghi and J. Kittler, “Confidence based gating of mul-
tiple face authentication experts,” in Proceedings of Joint IAPR
6. Conclusions International Workshops on Syntactical and Structural Pattern
Recognition and Statistical Pattern Recognition (SSPR ’06), vol.
The problem of fusing similarity measure-based classifiers in 4109 of Lecture Notes in Computer Science, pp. 667–676, Hong
face verification was considered. First, the performance of Kong, August 2006.
face verification systems in PCA and LDA feature spaces with [9] E. Bailly-Bailliére, S. Bengio, F. Bimbot et al., “The BANCA
different similarity measure classifiers was experimentally database and evaluation protocol,” in Proceedings of Interna-
evaluated. The study was performed for both manually tional Conference on Audio and Video Based Person Anthenti-
and automatically registered face images. The experimental cation, vol. 2688, pp. 625–638, 2003.
results confirm that our optimised Gradient Direction [10] M. T. Sadeghi, M. Samiei, S. M. T. Almodarresi, and J.
metric in the LDA feature space outperforms the other Kittler, “Similarity measures fusion using SVM classifier for
investigated metrics. Different methods for the selection and face authentication,” in Proceedings of the 3rd International
Conference on Computer Vision Theory and Applications (VIS-
fusion of the various similarity measure-based classifiers
APP ’08), vol. 2, pp. 105–110, Funchal, Madeira, Portugal,
were compared. The experimental results demonstrate that January 2008.
the combined classifiers outperform any individual verifi- [11] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining
cation algorithm. Moreover, the Support Vector Machines classifiers,” IEEE Transactions on Pattern Analysis and Machine
and Weighted Averaging of similarity measures have been Intelligence, vol. 20, no. 3, pp. 226–239, 1998.
shown to be the best fusion rules. It was also shown that [12] J. Kittler and F. Roli, Multiple Classifier Systems, vol. 2096,
although the features derived from the LDA approach lead Springer, Berlin, Germany, 2001.
to better results, than those of the PCA algorithm, fusing [13] L. Xu, A. Krzyzak, and C. Y. Suen, “Methods of combining
the PCA- and LDA-based scores improves the performance multiple classifiers and their applications to handwriting
further. Based on our previous study within the LDA space recognition,” IEEE Transactions on Systems, Man and Cyber-
[19], further improvement is also expected by adaptively netics, vol. 22, no. 3, pp. 418–435, 1992.
[14] A. Verikas, A. Lipnickas, K. Malmqvist, M. Bacauskiene,
selecting a subset of the LDA-based and PCA-based classi-
and A. Gelzinis, “Soft combining of neural classifiers: a
fiers. comparative study,” Pattern Recognition Letters, vol. 20, pp.
429–444, 1999.
Acknowledgment [15] F. Roli and G. Fumera, “Analysis of linear and order statistics
combiners for fusion of imbalanced classifiers,” in Proceedings
The financial support from the Iran Telecommunica- of the 3rd International Workshop on Multiple Classifier Sys-
tion Research Centre and the EU funded Project Mobio tems, pp. 252–261, Springer, Cagliari, Italy, June 2002.
(http://www.mobioproject.org/) Grant IST-214324 is grate- [16] Y. Freund and R. E. Schapire, “A decision-theoretic general-
ization of on-line learning and an application to boosting,”
fully acknowledged.
Journal of Computer and System Sciences, vol. 55, no. 1, pp.
119–139, 1997.
References [17] R. P.W. Duin, “The combining classifier: to train or not to
train?” in Proceedings of the International Conference on Pattern
[1] D. Zhang and G. Lu, “Evaluation on similarity measurement Recognition, vol. 16, no. 2, pp. 765–770, 2002.
for image retrieval,” Neural Network and Signal Processing, vol. [18] K. Maghooli and M. S. Moin, “A new approach on multi-
2, pp. 228–231, 2003. modal biometrics based on combining neural networks using
[2] Q. Bao and P. Guo, “Comparative studies on similarity AdaBoost,” in Proceedings of the International ECCV Workshop
measures for remote sensing image retrieval,” in Proceedings on Biometric Authentication (BioAW ’04), vol. 3087, pp. 332–
of the IEEE International Conference on Systems, Man and 341, Prague, Czech, May 2004.
Cybernetics (SMC ’04), pp. 1112–1116, October 2004. [19] M. T. Sadeghi, M. Samiei, and J. Kittler, “Selection and fusion
[3] V. Perlibakas, “Distance measures for PCA-based face recogni- of similarity measure based classifiers using support vector
tion,” Pattern Recognition Letters, vol. 25, no. 6, pp. 711–724, machines,” in Proceedings of Joint IAPR International Workshop
2004. on Structural, Syntactic, and Statistical Pattern Recognition
[4] W. S. Yambor, B. A. Draper, and J. R. Beveridge, “Analyzing (SSPR ’08), vol. 5342 of Lecture Notes in Computer Science, pp.
PCA-based face recognition algorithm: eigenvector selection 479–488, 2008.
and distance measures,” in Empirical Evaluation Methods in [20] G. L. Marcialis and F. Roli, “Fusion of LDA and PCA for
Computer Vision, H. Christensen and J. Phillips, Eds., World face verification,” in Proceedings of the International ECCV
Scientific Press, Singapore, 2002. Workshop on Biometric Authentication, M. Marcialis and J.
[5] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal Bigun, Eds., vol. 2359 of Lecture Notes in Computer Science,
of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991. pp. 30–37, 2002.
[6] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, [21] M. T. Sadeghi and J. Kittler, “Decision making in the LDA
“Eigenfaces vs. fisherfaces: recognition using class specific space: generalised gradient direction metric,” in Proceedings of
12 EURASIP Journal on Advances in Signal Processing

the 6th IEEE International Conference on Automatic Face and


Gesture Recognition, pp. 248–253, Seoul, Korea, May 2004.
[22] A. Jain, K. Nandakumar, and A. Ross, “Score normalization in
multimodal biometric systems,” Pattern Recognition, vol. 38,
no. 12, pp. 2270–2285, 2005.
[23] V. Vapnik, The Nature of Statistical Learning Theory, Springer,
New York, NY, USA, 1995.
[24] J. Platt, “Sequential minimal optimization: a fast algorithm
for training support vector machines,” Tech. Rep. 98-14,
Microsoft Research, Redmond, Wash, USA, April 1998.
[25] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recog-
nition by independent component analysis,” IEEE Transactions
on Neural Networks, vol. 13, no. 6, pp. 1450–1464, 2002.
[26] M. Hamouz, J. Kittler, J.-K. Kamarainen, P. Paalanen, H.
Kälviäinen, and J. Matas, “Feature-based affine-invariant
localization of faces,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 27, no. 9, pp. 1490–1495, 2005.
[27] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “
XM2VTSDB: the extended m2vts database,” in Proceedings of
the 2nd International Conference on Audio and Video-based
Biometric Person Authentication, pp. 72–77, 1999.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 415307, 16 pages
doi:10.1155/2010/415307

Research Article
A Robust Iris Identification System Based on
Wavelet Packet Decomposition and Local Comparisons of
the Extracted Signatures

Florence Rossant, Beata Mikovicova, Mathieu Adam, and Maria Trocan


LISITE (Laboratoire d’Informatique, Signal et Image, Electronique et Télécommunications),
Institut Supérieur d’Electronique de Paris (ISEP), 21 rue d’Assas, 75006 Paris, France

Correspondence should be addressed to Florence Rossant, fl[email protected]

Received 1 December 2009; Revised 16 February 2010; Accepted 18 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 Florence Rossant et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

This paper presents a complete iris identification system including three main stages: iris segmentation, signature extraction,
and signature comparison. An accurate and robust pupil and iris segmentation process, taking into account eyelid occlusions,
is first detailed and evaluated. Then, an original wavelet-packet-based signature extraction method and a novel identification
approach, based on the fusion of local distance measures, are proposed. Performance measurements validating the proposed iris
signature and demonstrating the benefit of our local-based signature comparison are provided. Moreover, an exhaustive evaluation
of robustness, with regards to the acquisition conditions, attests the high performances and the reliability of our system. Tests have
been conducted on two different databases, the well-known CASIA database (V3) and our ISEP database. Finally, a comparison of
the performances of our system with the published ones is given and discussed.

1. Introduction makes possible the use of very long codes and thus the brute-
force hacking strategies are inefficient.
Biometric systems provide reliable automatic recognition Different types of biometrics such as fingerprints, hand
(identification) of persons based on one or several biological geometry, facial appearance, voice, retina, and iris have been
features. These systems are progressively replacing the con- used. Nowadays, the iris is considered as one of the most reli-
ventional identification methods, such as documents, login able traits for biometric identification because of its random
passwords, or personal identification codes. There are several morphogenesis, great variability among different persons,
benefits of using biometrics in combination with or instead and stability over time. The performances of iris based
of traditional techniques. A first advantage is the ease of algorithms are better than those using other biometrics, as
use: to be identified, a person does not have to remember for example, the face recognition algorithms. However, iris
a password or identification code and does not have to carry recognition systems rely on good quality images and their
a key, and thus the identification process can be very quick. performances deteriorate in unconstrained environments.
Second advantage, the protection of the identifier: a weak Iris based recognition systems have been widely studied
spot in many traditional security systems is that the users for the last 20 years. It was in the early nineties that John
often write down their code or tend to choose a code which is Daugman implemented and patented an automated ready-
easy to remember and thus also easy to break. Keys and cards to-use iris recognition system [1, 2]. Even though Daugman’s
can be stolen or copied. On the contrary, biometric systems system is the most successful and the most well known,
can be made quite safe against forgery. Last but not least, the many other approaches have been proposed. Typically, such
length of the code is a very important advantage: biometrics recognition systems, in spite of their specificities, have the
2 EURASIP Journal on Advances in Signal Processing

same structure: the first stage consists in the iris segmenta- the acquired and the database signatures but used it only in
tion, then the image is normalized and features are extracted the verification task.
in order to generate a signature. Finally, this signature is The decision criterion used for classification is generally a
compared to reference signatures (i.e., gallery database) in simple threshold, obtained empirically on training databases.
order to measure a numerical dissimilarity value to be used Some authors proposed other types of classifiers such as
in the decision process. neural networks [12, 17].
The segmentation part, consisting of the localization and In this paper, we propose a ready-to-use iris identifi-
extraction of the iris, is crucial as the whole recognition cation solution. We developed a complete reliable system
system depends on its accuracy. Therefore, much research with high performances that includes all steps of a clas-
has been conducted on the segmentation, based essentially sical iris identification scheme (Section 2). As explained
on two main methods: an integro-differential algorithm in Sections 2.3, 2.4, and 2.5, we contributed to all these
proposed by Daugman [2] and an algorithm based on stages, either by proposing new methods or improving the
the circular Hough transform employed by Wildes [3]. existing approaches. We present a precise, accurate and
Many other methods have been proposed, combining these robust segmentation method that takes into account the
algorithms or employing some threshold-based methods [4] eyelid detection. An original method of signature extraction
and, more recently, algorithms using active contour models and comparison, based on a wavelet packet decomposition,
[5]. An important part of the segmentation is the localization is also presented. Section 3 describes a complete valida-
of occlusions caused by eyelids and eyelashes, hiding, at least tion of the proposed method, realized on two different
partially, the iris texture. If they are not taken into account, databases: CASIA (infrared) and ISEP (visible domain).
they are considered as a part of the iris structure and lead We also evaluate the robustness of the algorithm with
to the deterioration of the performances. Methods used for regards to acquisition conditions, by simulating changes
the eyelids localization are similar to those used for the in illumination, blurring, and optical axis deviation. The
iris boundaries, mainly based on Daugman’s and Wildes’ performances of the whole system are presented in Section 4
method. The eyelashes segmentation methods are principally and compared to those described in the literature. Section 5
based on thresholding. concludes the presented system and further improvements
The normalization step transforms the iris region so are suggested.
that it has fixed dimensions, in order to allow comparisons.
The dimensional differences are mainly due to variations
in the pupil dimensions, varying imaging distance, rotation
of the camera, head tilt, and ocular motion. The majority 2. Iris Identification System
of systems use the transform proposed by Daugman [6]
which translates the segmented iris into a fixed length and 2.1. Iris Databases. The images used in this study were
dimensionless polar coordinate system. Two other methods acquired in the visible domain (ISEP database) or in the
are worth of noting, the Boles’ system [7] using virtual circles near infrared domain (CASIA database). Both systems are
and Wildes’ algorithm [3] employing an image registration cooperative, that is, the iris images were captured at small
technique. distances, under controlled lighting conditions and with
The aim of the feature (signature) extraction is to cooperating subjects.
provide the most discriminating information present in an To create the ISEP database, we used a dedicated iris
iris pattern. The analysis of the iris is accomplished either imaging equipment provided by Miles research lab [18]. It
globally on the whole iris or locally. The data (coefficients) is made up of a Nikon camera with 105 mm lens. The flash
issued from this analysis are then encoded to form a illumination is precisely guided to the eye via fibre optics
biometric signature. The feature representation approaches light-guides, in order to provide uniformly illuminated eye
could be roughly divided into three major categories: phase- pictures with little light reflections (Figure 1(a)). Neverthe-
based methods (as proposed by Daugman in e.g. [1, 2]), less, light spots are present in the pupil area and sometimes in
zero-crossing representation methods [7, 8], and various the iris texture itself, when the flash is not perfectly centred.
texture-analysis-based methods [3, 9–12]. The images were resized to 600 × 400 pixels and transformed
The identification step consists in the confrontation from RGB colour format to grey-levels, each pixel being
of the tested signature to those stored in a reference coded on one byte.
database (gallery). This comparison allows establishing their The ISEP database contains 1572 images acquired from
similarity (or dissimilarity, depending on algorithms). Then, the left and/or the right eye of 337 different individuals.
a decision criterion has to be applied to classify correctly the Most of them are European (87.23%) but the database
user as being authentic or imposter. To compare signatures, contains also images from African or Indian (10.1%) or
different distance metrics are commonly applied. The Ham- Asiatic (2.67%) people. The acquisition protocol included
ming distance is the most frequently used [2, 6, 13, 14]. It two different acquisition conditions: in the first case, the
can be modified to limit the comparison only to coefficients eye was preilluminated to contract the pupil; in the second
corresponding to nonoccluded part of the iris [6, 13, 14] or case, the picture was taken in indoor conditions, with
weighted by local quality measures [15]. Other authors use no preillumination, providing images with largely dilated
Euclidean [8] or weighted Euclidean distance [16]. Wildes in pupils. We have two images of both types for each of the 403
[3] proposed a measure of normalized correlation between classes (i.e., eyes).
EURASIP Journal on Advances in Signal Processing 3

(a) (b)

Figure 1: Acquisition systems and iris images from (a) ISEP database and (b) CASIA database.

The CASIA-IrisV3-Interval is a public database provided since the whole recognition system depends on the accuracy
by the Center for Biometrics and Security Research (CBSR) of this segmentation step.
[19]. Iris images are 8 bit grey-level, 320 × 280 pixels, The proposed method is based on three steps: rough
collected under near infrared illumination (Figure 1(b)). localization of the pupil, iris boundary detection, and eyelid
Almost all subjects are Chinese. Images extracted from detection. This way, the pixels belonging to iris texture are
CASIA database serve generally as references for the compar- precisely determined. Further, the iris is unwrapped into
ison of iris identification systems. In our study, we extracted a rectangular image of fixed size, using Daugman’s polar
a subdatabase including 888 images of 222 classes, having transform [6]. A binary mask is also deduced from the eyelid
therefore 4 images per eye. The extraction process was segmentation results in order to distinguish precisely iris
conducted as follows: exclusion of images with very large pixels from eyelid pixels.
occlusions (more than 50% of the iris texture) or inadequate
segmentation (as explained in Section 2.3.4), exclusion of
a class when less than four images are available under the 2.3.1. Rough Detection of the Pupil. This step can be, to a
previous conditions, and random selection of four remaining certain extent, dependent on the illumination system used
images per eye. for image acquisition. Nevertheless, the proposed method is
robust enough to process both CASIA (Figure 3(a)) and ISEP
databases. The bright spots are detected based on top-hat
2.2. Functional Scheme. The functional scheme of the pro- morphological filters. They are then removed by a specific
posed system is depicted in Figure 2 and follows the typical filter which replaces every spot-pixel (i.e., pixels contained
structure of majority of iris recognition systems. The image within a bright spot) by an average of its nearby pixels,
is first segmented in order to extract the iris texture. Two starting from the spot periphery and progressing iteratively
major steps are required: the localization of the inner towards the centre (Figure 3(b)). The average includes only
and outer iris boundaries (Sections 2.3.1 and 2.3.2) and pixels that were not labelled as spot-pixels beforehand or
the detection of the eyelids (Section 2.3.3). Our system those which have been processed at a previous iteration. So,
provides a robust and accurate segmentation, which has been the spots are filled with dark pixels without creating high
demonstrated by comparing the automatic segmentations gradients.
with manual ones (Section 2.3.4). Then, the iris ring is In the following, we assume that the pupil is almost
unwrapped to get a rectangular image of normalized size. centred in the image (cooperative acquisition system) and
The information about the eyelid boundaries are used to define a region of interest (160 × 160 pixels) in order to
define a binary mask, where pixels corresponding to iris localize it roughly (Figure 3(b)). This area is represented in
texture are coded as 1’s (Section 2.3.5). We propose then an a binary manner, for selecting the darkest pixels that are
original method for extracting signatures, based on a wavelet likely to belong to the pupil (Figure 3(c)). The threshold
packet decomposition (Section 2.4). The identification part is dynamically adjusted through the histogram analysis, in
consists in the comparison between iris signatures. Two order to keep the 15% of darkest pixels. Morphological filters
different approaches are proposed: a global comparison and are applied to improve the segmentation quality and the
a novel method based on a fusion of local distances calculated region of largest area is kept as pupil region. The pupil centre
on subregions of the iris (Section 2.5). is estimated from the bounding box and the gravity centre of
the extracted region (Figure 3(d)).

2.3. Segmentation and Normalisation. The segmentation step


aims at extracting the iris texture area from the eye image. 2.3.2. Iris Boundary Localization. This second step consists
Major difficulties come from the weak contrast between the in the localization of the inner and outer iris boundaries,
pupil and the iris (dark eyes) or the iris and the sclera (pale considered either circular or elliptical. Two major methods
eyes), but also from the poor quality of the acquired pictures. were proposed by Daugman and Wildes. Both are based on
Images are indeed often defocused or occluded by eyelids or the use of the first derivatives of the image and a circular
light spots. Nevertheless, efficient iris localization is required, parametric model of the iris contours. The best parameters
4 EURASIP Journal on Advances in Signal Processing

Acquisition system Normalisation Signature extraction


Segmentation Identification
& masking

Reference
database

Figure 2: Functional scheme of the complete recognition system.

(a) (b) (c) (d)

Figure 3: Rough localization of the pupil. (a) Source image, (b) filtered image and definition of the zone of interest for the pupil detection,
(c) pupil detection by thresholding means, and (d) pupil localization (bounding box and centre estimation).

are obtained either by maximizing the output of an integro- 2.3.3. Eyelids Localization. The detection of eyelid occlusions
differential operator [2] or by using a Hough transform is crucial to achieve good identification rates. Most proposed
applied on a binary edge map [3]. The second method is methods are based on the analysis of the gradient image,
probably less accurate than the first one since it depends assuming a high gradient at the frontier between the eyelid
on a threshold to be chosen for the edge detection and, and the iris. But the selection of relevant contours is a very
thus sensitive to different types of images and illumination difficult task since eyelashes often hide eyelid boundaries
conditions. Moreover, the results provided by the Hough and highly textured iris also provides high gradients. That
transform are often sensitive to the sampling of the space is why a priori knowledge about the shape of the searched
parameters. contour is again required. Therefore most authors model
Our method is similar to Daugman’s one, but with a eyelids as parabolic [3, 6, 15] or circular arcs, and apply
circular model of the pupil boundary and an elliptical model again an integro-differential operator [6] or a Hough
of the outer contour of the iris. The centre of the pupil transform [3] to select the best parameters of the parametric
and the centre of the iris are supposed to be close to one representation. Additional criteria are sometimes added to
another. From the first estimation of the pupil location shortlist admissible contours: eyelashes detection, since they
(Figure 3(d)), we deduce a grid of possible coordinates for are supposed to be at the border of the eyelid [20], which
the pupil centre. Then, we find the centre and the radius of requires an additional difficult step; selection of the longest
the circle that maximize the mean gradient in the orthogonal edge [15], but the searched contours are often cut by
direction of the circular curve (Figure 4(a)). The gradient eyelashes; denoising based on a statistical model that need
is estimated by a correlation with the 1D kernel [−1 − to be learnt beforehand [21].
1 0 + 1 + 1], representing an ideal step. This process To improve the robustness of the process, we propose
is applied on the preprocessed image to avoid high gradients an algorithm [22] based on three steps: preprocessing,
due to illumination spots. A similar algorithm is used for preselection of edge candidates to eyelids including first
determining the ellipse, the possible centres being restricted approximation and, finally, a decision by optimization of
in the neighbourhood of the pupil centre. However in this the mean gradient. In the preprocessing stage, a nonlinear
case, the gradient maximization is limited to the left and diffusion filter is applied to smooth the iris texture, while
right subparts of the ellipse in order to avoid possible eyelid sufficiently preserving the eyelid boundaries (Figure 5(a)).
or eyelash occlusions (Figure 4(b)). This algorithm provides The second step consists in applying a Canny-edge detector
very good results (Figure 4(c)), even in case of very low in order to obtain a map of edge candidates. A priori
contrast, because of the average effect, and because there is no knowledge about the position of the eyelids, with respect to
need of parameter tuning as we use a maximization criterion. the iris boundaries, is used to perform a first selection of
EURASIP Journal on Advances in Signal Processing 5

(a) (b) (c)

Figure 4: Iris boundary localization. (a) Estimation of the pupil contour by a circle (in yellow, the grid of the tested centres), (b) estimation
of the outer iris contour by an ellipse, and (c) obtained result.

(a) (b) (c) (d)

Figure 5: Occlusion localization. (a) Preprocessed image, (b) selected edges to be fitted by a parabola,(c) final candidate selection, and (d)
edge detection after optimization.

the relevant edges. Therefore, we restrict the analysis area centre and radius of the pupil obtained with the proposed
to the inner iris. Afterwards, we remove the left and the method, and by (x(M) (M) M
p , y p ) and r p the parameters obtained
right parts of the iris in order to avoid connections between manually, serving as reference. We compute the relative error
the eyelids/iris and the iris/sclera boundaries. We thus keep made on the pupil centre estimation by
the edges that are the most likely to belong to the eyelid 9
border (Figure 5(b)), so that the speed and the robustness 2  2
x(A) (M)
p − xp + y (A) (M)
p − yp
of the algorithm are improved. The remaining edges, whose Epupil centre = (1)
by length is greater than the mean, are fitted by a parabolic r p(M)
curve. More side-knowledge is introduced by eliminating the
parabolas which have inaccurate orientation. At this stage, and on the radius estimation by
only 2 to 30 edge candidates still remain (Figure 5(c)). In 2 2
2 (A) (M) 2
the third step, the analysis is refined: the mean gradient 2r p − r p 2
Epupil radius = . (2)
along each candidate is calculated for different values of the r p(M)
parabola parameters around the first estimation, on a larger
area (twice the size of the iris in the horizontal direction). The The relative errors commited on the outer iris boundary
mean gradient is estimated on the result of the horizontal localization are calculated in the same way, replacing r pM by
Sobel kernel filtering of the original image, in order to the horizontal axis of the ellipse in (1), and by the horizontal
focus on the horizontal edges. A global maximisation is or vertical axis of the ellipse in (2). Table 1 indicates the
then performed to select the parabola approximating the percentage of images with a relative error of less than 5%
eyelid/iris frontier (Figure 5(d)). and 10%, respectively. The correspondence between the
percentage and the number of pixels (in average), is also
provided.
2.3.4. Performance Evaluation of the Segmentation Process. As expected, the presented results demonstrate better
The automatic segmentations were compared with seg- performances for the detection of the pupil on images
mentations realized manually on images from CASIA V3- illuminated with near-infrared light (CASIA database) rather
Interval database (2655 images) and ISEP database (1572 than images illuminated with visible light (ISEP database),
images), in order to provide a quantitative evaluation of the since the iris/pupil contrast is higher in the first case. On
(A) (A)
segmentation process. Let us denote by (x p , y p ) and r pA the the opposite, the determination of the iris centre and of the
6 EURASIP Journal on Advances in Signal Processing

Table 1: Evaluation of the accuracy of the segmentation process.

ISEP CASIA
<5% <10% %↔pixels <5% <10% %↔pixels
Centre (Epupil centre ) 77.4% 94.8% 1%↔0.53 px 92.8% 95.1% 1%↔0.42 px
Pupil
Radius (Epupil radius ) 86.1% 95.6% 1%↔0.53 px 95% 96% 1%↔0.42 px
Centre 92.5% 99.2% 1%↔1.4 px 81.9% 94.2% 1%↔1.1 px
Iris Horizontal axis 95% 99.6% 1%↔1.4 px 92.2% 97.8% 1%↔1.1 px
Vertical axis 57.3% 95.2% 1%↔1.4 px 64.1% 92.4% 1%↔1 px

horizontal axis parameter of the ellipse is more accurate for Segmentation performance
100
the ISEP database than for the CASIA database. Nevertheless,
the vertical axis parameters are less accurate than the

Percentage of images (%)


80
horizontal ones, for both databases. Indeed, the detection is
disturbed by the presence of eyelids and eyelashes.
60
It should be noticed that the real position of the pupil
centre was out of the grid of possible centres, determined
40
through the rough location of the pupil (Section 2.3.1 and
Figure 4(a)), for only 0.4% of the ISEP images and 0.97% of
20
the CASIA images. This demonstrates the robustness of this
first segmentation step. However, the overall segmentation
0
performances might still be improved by enlarging the grid. 0 20 40 60 80 100
These results are however difficult to compare with the Error (%)
ones given in the literature, since the databases are different
ISEP
(CASIA V1 in the literature) and no quantitative criterion
CASIA
has ever been expressed, to our knowledge.
A complete evaluation of the eyelid detection was also Figure 6: Global segmentation performance for the considered
realized on the whole CASIA-IrisV3-Interval database (2655 databases.
images) [22]. The automatic segmentations were compared
with manual segmentations. We defined the global error as
the percentage of subsegmentation or oversegmentation. For iris/sclera border. Moreover, imprecision on the eyelids
the lower eyelids, we obtain 97.6% of localizations with less detection are much less critical, since it has no influence
than 10% of global error. For the upper eyelids, 87.5% of on the normalization process. This fact was clearly demon-
localizations have less than 10% of global error. We also strated by experiments conducted on the CASIA database
noticed that the system is very robust to oversegmentation. [24]. Indeed, the identification error rates are zero on the
As stated above, these performance rates cannot be compared manually segmented CASIA database, when segmenting the
to those given in the literature, since the authors provided eyelids, and are only slightly degraded when not considering
only qualitative assessment and on a different database the occlusions. On the opposite, the identification error rates
(CASIA V1). come to 1.80% in middle security mode and to 5.43% in
Figure 6 summarizes the segmentation performances high-security mode, when testing the fully automatic system
obtained on all available images, including those presenting including the proposed segmentation step.
erroneous detection of iris borders. This graphic represents In what follows, we use only images that are correctly
the proportion of images whose global error is lower than a segmented, those for which the global error is less than 25%
given value, where the global error is computed as (thus, less than 2% of image rejection).
ErrorGlobal
number of sub or oversegmented pixels (3) 2.3.5. Normalization. The normalization stage is usu-
= 100 . ally accomplished by using the method proposed by
number of nonoccluded iris pixels
Daugman [2]. This model remaps each point within the iris
Approximately 90% of the images have a global error region to a pair of polar coordinates (r, θ) where r is on the
less than 10% for both databases, which demonstrates the interval [0, 1] and θ is an angle on the circle ([0, 2π]). The
reliability of the segmentation process. However, it should angle sampling step is predefined, as well as the number of
be pointed out that the different sources of inaccuracy do pixels sampled along each radius, so that the output image
not have the same influence on the system identification is a rectangle of fixed size. Let us denote by nθ and nr the
rates. As underlined in [23], the pupil centre position is number of points along the angle and the radius axes. We
the most critical parameter, since it serves as origin of designate the coordinates of pairs of pixels located on the
the normalization process (Section 2.3.5), while recognition pupil and the iris borders, aligned with the pupil centre
systems are more tolerant to inaccuracies related to the and forming an angle θn with the x-axis, as (x p (θn ), y p (θn ))
EURASIP Journal on Advances in Signal Processing 7

y The instantaneous phase is obtained by constructing an


(xi (θn ), yi (θn ))
analytic image which is a combination of the original image
x and its Hilbert transform. Module of emergent frequency
and the real and the imaginary parts of the instantaneous
phase are used to encode the iris texture.
(x p (θn ), y p (θn )) Lately, the subband decomposition methods [26] have
θn
gained a lot of interest due to their demonstrated efficiency
in characterising different types of textures. Among these
schemes, some have employed separable wavelet basis [7,
12, 15, 27], as well as wavelet packet basis [28] in order to
represent the analysed texture in a way that discriminant fea-
Figure 7: Polar transform. tures are highlighted. A major inconvenience of the wavelet
representation, however, is that only a subset of the possible
space-frequency segmentation is used for the extraction of
the spatial frequency components of the texture. Wavelet
and (xi (θn ), yi (θn )), respectively, (Figure 7). The Daugman’s
packets (WP) provide a solution to this problem so that full
transform is expressed as
or adaptive frequency segmentation for a given texture can
⎧ be obtained.

⎨x(rk , θn ) = rk xi (θn ) + (1 − rk )x p (θn ),
Especially for images with highly textured content, or

⎩ y(r , θ ) = r y (θ ) + (1 − r )y (θ ), residual textures (as the unwrapped iris images), the energy
k n k i n k p n
compactness performance of the wavelet packet subband
with structures is superior to classical wavelet one, as it has been
(4) shown in [29]. Moreover, a valid reason for using WP for iris
⎧ n

⎪ signature extraction is that cyclic events (e.g., unwrapped iris
⎨θn =


2π, n = 0, . . . , nθ − 1,
strips) produce regular patterns in the spatial domain which

⎪ k can be efficiently represented by wavelet packet means.

⎩rk = , k = 0, . . . , nr − 1.
nr − 1 As the WP transform [30–32] generalizes the dyadic
wavelet decomposition by iterating the decompositions on
Even though other normalizations have been proposed the high-pass bands, it can be implemented by using a pair
(e.g., [3, 7]), Daugman’s transformation is commonly of Quadrature Mirror Filter (QMF) banks that divide the
adopted since it easily deals with pupil dilatation or focal frequency bands into equal parts. This recursive splitting of
variations. Moreover, as the signatures are computed from the vector space is represented by the admissible WP tree
the normalized images, no further normalization is necessary (Figure 9).
to compare signatures. There are two major categories of features extraction
The size of the unwrapped images is set to nθ × nr = methods employing wavelet packets. The first one uses
256 × 128 pixels (Figure 8(a)). The binary mask defining abstract aggregates of the original wavelet packet features
the pixels belonging to the iris texture is defined in the such as: entropy, energy, distance, and so forth. on the full
same way (Figure 8(b)). The last step consists in a histogram WP decomposition tree [33, 34]. The second category clus-
equalization that increases the contrast of the texture and ters the best-basis WP feature extraction methods. Generally,
normalizes dynamically the grey-levels (Figure 8(c)). in this latter class, the WP decomposition coefficients are
used to form a feature space by merging specific nodes of the
2.4. Signature Extraction. Finding the appropriate features WP tree and splitting others, in order to produce a tree that
for the description of the unwrapped iris images represents represent the best reflection of the properties of the texture
the key for a robust signature extraction and classification. [29, 33]. The features are then extracted based on some
The literature acknowledges manifold of propositions. For criteria applied to the wavelet coefficients in the terminal
example, Daugman [2, 6, 13, 14] applied 2D Gabor filters for nodes of the resulted optimized tree [35].
extracting the phase structure information of the iris. Boles Therefore, a compact iris signature can be obtained by
and Boashash [7] have chosen zero-crossing representations quantizing the coefficients of the full WP decomposition
of the 1D wavelet transform of a concentric circle on the iris tree into one bit each, depending on their sign [34]. In the
image, at various resolution levels, in order to characterize following we propose a signature extraction method (which
the iris texture. Wildes [3] decomposed the iris region using could be classified as belonging to the first category) which
LoG (Laplacian of Gaussian) filters. The resulted filtered uses the energy of the WP coefficients as discriminator for
image is thus represented as a 4-level Laplacian pyramid and determining which subbands carry the most useful part of
further used for generating a compact iris signature. Lim et the information.
al. [12] decorrelated the iris images using a 4-level 2D Haar The subbands to be analysed will be generated by the
transform and quantized the high-frequency information full 3-level WP decomposition using the orthogonal Haar
thus obtained. Ma et al. [9–11] defined new spatial filters to transform, as shown in Figure 9. Due to the orthogonality
capture local details of the iris. Tisse et al. [25] introduced a of the involved transform, the energy preserving criterion
concept of instantaneous phase and/or emergent-frequency. is fulfilled. We can thus use the normalized WP subbands
8 EURASIP Journal on Advances in Signal Processing

(a) (b) (c)

Figure 8: Normalization process. (a) Unwrapped image, (b) binary unwrapped mask, and (c) equalized iris texture with masked occlusions.

Image subbands of the full WP decomposition tree and generate the


signature bitwise as in the following:
     2  2
A H V D si j, k = sign wi j, k , if 2wi j, k 2 > T,
(6)
= 0, otherwise,
A H V D A H V D A H V D A H V D
where the threshold T is dynamically computed, following
A H V D A H V D the descending magnitude order of the absolute values of the
0 1 2 3 60 61 62 63 coefficients, such that only the most relevant ones retaining a
certain percentage (Section 3.1) of the subband energy are
Figure 9: Full 3-level wavelet packet decomposition tree: the low- sign-quantized. This way, similar to the wavelet-based de-
pass subbands (approximations) are denoted by A and the high-pass
noising algorithms, the small coefficients are thus assimilated
ones are represented following their directions—H (horizontal), V
(vertical), and D (diagonal).
to noise and therefore filtered, in order to decrease their
impact in the signature matching process. We thus obtain for
each of the two energy-selected packets a robust signature,
each represented by 16 × 32 symbols ({−1, 0, 1}) and which
Haar, mean wavelet packets energy, ISEP will be coded on 384 bytes.
160
The choice of the orthogonal Haar transform and of the
140 quantization method has also been retrospectively validated
120 by experiments (see Section 3.1).
100
Energy

80 2.5. Identification Method. The final task to be performed in


60 an iris recognition system is the iris matching, that is, the
comparison of the tested iris signature to signatures stored in
40
a reference database, followed by the decision classifying the
20 iris as an authentic or impostor. As mentioned before (see
0 Section 1), different distance metrics have been proposed
0 10 20 30 40 50 60
(Hamming, Euclidean, Manhattan) as well as different classi-
Wavelet packets
fiers (from simple thresholds to neural networks classifiers).
Figure 10: Mean wavelet-packet energy distribution. In [36], we proposed a global iris identification method
based on a normalized Manhattan distance measure. The
applied measure combines the normalized Manhattan met-
rics given by signatures extracted from two wavelet packets
energies Ei=0,...,63 as discriminator for deciding which packets (subbands 2 and 10). Two signatures for each individual
should be considered for signature extraction are stored in the reference database. The tested signature
is compared to both signatures of each individual and the
minimum distance is retained to provide the final measure
1  2  of dissimilarity. The eyelid occlusions are also taken into
Ei = w j, k , (5)
Ni j,k i account [24].
However, this kind of global analysis cannot deal with
distortions due to segmentation imprecision. Indeed, many
where wi ( j, k) denotes the ith subband wavelet packet parameters, related to the image acquisition process, impact
coefficients (Ni gives the total number of coefficients of the the segmentation results and are responsible for the seg-
ith subband). mentation inaccuracies observed in Section 2.3.4. Especially,
As shown in Figure 10, the subbands 2 (AAV) and 10 the optical axis, the illumination conditions, the pupil
(AVV) are efficient discriminants for the mean-energy value size, and the partial occlusions by eyelids or eyelashes
criterion. We propose thus to select these two most energetic are causes of segmentation variability: displacement of the
EURASIP Journal on Advances in Signal Processing 9

the reference iris R [22]. The distances HDbi,{m,n} between the


tested and reference subsignatures derived from packet Pi for
the subwindow b and the translation {m, n} are computed as
follows:
HDbi,{m,n}
1 2 2 T,b    2 2   R,b  
= 2Si,{m,n} j, k − SiR,b j, k 2M{T,b
m,n} j, k M j, k ,
2Nb j,k
(a) (b)
(7)
Figure 11: (a) The 8 areas in the source image, (b) corresponding
to 8 rectangular blocks in the unwrapped image (128 × 256). where Nb is the number of coefficients equal to 1 in both
masks for the subwindow b.
The local distances measured for the subwindow b and
combining two wavelet packets are obtained as
circle or ellipse centre, change of the radius or ellipse
axis parameters. This variability results in translations of D{bm,n} = 2HDb2,{m,n} HDb10,{m,n} . (8)
the iris structures in the unwrapped image, as well as
dilation or contraction, especially in the radial direction, thus As each considered subband provides specific discriminating
disturbing the signature comparison step. Consequently, information, the use of the product as fusion rule is
despite the demonstrated robustness of the segmentation very pertinent, increasing the discrimination power of the
process (Section 2.3.4), acquisition conditions lead to iris classifier. Other conjunctive rules, such as the minimum,
detection inaccuracies that strongly affect the identification would provide less selective results.
performances [23, 24]. Subsignatures are generated for all the (2N + 1) ×
To cope with such local distortions due to segmentation (2N + 1) possible block translations and compared to the
imprecision, we propose a novel signature comparison corresponding reference subsignatures. This process leads to
method based on a fusion of local distances. The reason a set of (2N + 1) × (2N + 1) distance measures for each
is that the local distortions are not uniform over the block. The optimal superposition between the tested and the
unwrapped image. Another idea is to give more importance reference subsignatures is given by the minimum distance
to iris areas which are likely to provide more reliable Db :
information. In our method, the iris is divided in eight
subregions, equivalent to eight rectangular subwindows in Db = min D{bm,n} , b ∈ [1, 8]. (9)
m,n
the unwrapped image, as illustrated in Figure 11.
Blocks 1 to 4 correspond to the texture close to the pupil. In this way, eight distances are obtained, each corre-
As blocks 5 to 7 are related to peripheral textures, they are sponding to the analysis of a subwindow. The aim of the next
more prone to occlusions. step is the fusion of these local distances, in order to get a final
measure representing the global dissimilarity of both irises.
2.5.1. Local Distance Measures. The first step of our iden-
tification method consists in a global angular shifting to 2.5.2. Fusion of Local Distances. In the literature, diverse
compensate for eye rotation between the reference and the fusion rules are admitted, such as the minimum or maxi-
tested image [34]. Local comparisons are then performed, as mum, the arithmetical or geometrical average [37]. In our
described in what follows. system, we take into account some additional knowledge by
The tested unwrapped image is extended by N pixels in giving more importance to the local distances corresponding
both directions to authorize horizontal and vertical sliding to the most informative and reliable areas. Therefore, we
of the subwindows with minimized side effects. N is the choose a weighted sum as fusion rule and the global distance
maximum shift, in pixels, of the subwindows. between the tested and the reference iris is given by:
The comparison is therefore independently realized on
each rectangular block of the unwrapped image. The blocks 
8

of the tested unwrapped image slide along the vertical and Dw = wb Db . (10)
b=1
horizontal directions, around their central position, while
the subwindows of the reference image are fixed. In this equation, the weight wb is a combination of two
Let us denote by Si,T,b {m,n} ( j, k) the coefficients of the weights, in order to take into account knowledge about the
signature derived from packet Pi (i = 2, 10), for the information quantity carried by the different subregions and
subwindow b (b ∈ [1, 8]) of the tested iris T and the their reliability. The first weight, denoted by wdb , represents
translation {m, n} (m ∈ [−N, N]). We denote by SiR,b ( j, k) its the proportion of coefficients corresponding to nonoccluded
equivalent for the reference iris R. M{T,bm,n} ( j, k) are the binary iris texture.
masks defining the nonoccluded coefficients (corresponding The second weight, wsb allows giving less importance to
to iris texture) for the subwindow b and the translation the local measures that are statistically different from the
{m, n} of the tested iris T, and M R,b ( j, k) their equivalent for others. Thus, the influence of blocks that are likely to be
10 EURASIP Journal on Advances in Signal Processing

unreliable or nondiscriminating is reduced [24]. Let m be Table 2: EER and FRR (FAR = 0%) measured on (a) CASIA and (b)
the mean and σ the standard deviation of the eight local ISEP databases for a binary coding and the proposed coding.
distances Db obtained between the tested iris and a reference
(a)
iris (9). The weight, denoted as wsb , is given by the following
equation: EER FRR (FAR = 0%)
⎧ Binary coding 0.00% 0.45%

⎪ p, if Db < m − σ,

⎪ Proposed coding 0.00% 0.00%

wsb = ⎪ p, if Db > m + σ, (11) (b)



⎩1, otherwise, EER FRR (FAR = 0%)
Binary coding 0.21% 1.81%
where p is less than 1.
Proposed coding 0.00% 0.00%
The final weight of block b is defined by:

wb wb
wb = 8 d sk . (12) the wavelet packet coefficients and achieves also a good dis-
k
k=1 wd ws
crimination [34]. Consequently, we now refine the analysis
In our algorithm, two parameters are required: the by computing the identification performances obtained for
maximum shift N and the parameter p of the weight wsb . Both both wavelets, on both ISEP and CASIA databases. Moreover,
are obtained in a prior training phase on a representative we look for the best percentage of energy defining the sign-
subset of the database. The optimal values of parameters are quantized coefficients (Section 2.4, equation (6)). Figure 12
those which minimize the error rates: N = 3 and p = 0.8. shows the EER and the FRR (FAR = 0%) as a function of
These learnt parameters are then validated on a test set that this parameter, for both wavelets and both databases. The
is independent of the training set [24]. training set includes the first half of the individuals, and the
In what follows, we will refer to the measure derived test set the second half (as described in [24]). Note that the
from (10), (11), and (12) with N = 0 and p = 1 as global images were manually segmented to avoid misinterpretation
comparison. In this case, the measure is made globally on due to segmentation errors. The comparison method is
the nonoccluded iris texture. global.
As experiments show (see Sections 3 and 4), significant The training shows that the most suitable wavelet is the
improvement of performances is obtained when applying the Haar wavelet for both CASIA and ISEP databases. Keeping
local identification approach instead of the global one. respectively around 99% of the energy for defining the
threshold T leads to the best performances. These results
3. Validation and Performances Evaluation are confirmed on the test sets. They reinforce the idea that
some WP coefficients are more related to noise than to iris
In this Section, we propose the simulation framework texture features and therefore must have less importance
for both CASIA and ISEP databases, in order to validate in the signature comparison. Table 2 shows the error rates
parameters involved in the algorithm and to evaluate the obtained on the complete databases, for the proposed
robustness of the proposed method. In our experiments, coding (6) compared to the binary coding coupled with
we have considered 888 images of 222 classes from each a Hamming distance. The performances are significantly
of our databases. The retained ISEP images were selected improved especially on the ISEP database.
randomly. Indeed, the performances measurements should Thus, the proposed signature, extracted from manually
not be biased by different numbers of classes, in order to be segmented images, leads to an identification system without
comparable. errors. It is worth noting that coding the packet having the
In the following, we express the performance results third highest energy does not increase the separation between
in terms of false reject rate (FRR), false accept rate (FAR) authentic and impostor distributions.
and equal error rate (EER). The FAR measures the rate of
impostors accepted by the system, the FRR measures the rate 3.2. Robustness Evaluation. We now focus on the robustness
of authentics rejected by the system, and the EER is the rate evaluation of the identification process, with regards to
where FAR = FRR. Moreover, we study the error rate in high- acquisition conditions. For that, we progressively degrade
security mode: the FRR when preventing the false accept the quality of the images, by changing the illumination
errors (denoted by FRR (FAR = 0%) in what follows). conditions, blurring, and modelling optical axis deviations.
All modifications are carried out on the original images.
3.1. Signature Extraction. The energy of the WP coefficients The transformed images are unwrapped according to the
is used as discriminator for determining which subbands parameters provided by the manual segmentation and new
carry the most useful part of the information for the signatures are generated. In this way, segmentation inaccu-
signature generation, as explained in Section 2.4. The study racies do not interfere in the signature robustness analysis.
was first made using the Haar wavelet, but other wavelets Obviously, this process concerns only images of the test
could be suitable for generating the signature. Especially, database, the reference signatures being unchanged. Then,
the Biorthogonal 1.3 leads to a similar energy repartition of the performances are measured for gradually increasing
EURASIP Journal on Advances in Signal Processing 11

Learning Learning
5 50

4 40

FRR (FAR = 0%)


3 30
EER (%)

2 20

1 10

0 0
88 90 92 94 96 98 100 88 90 92 94 96 98 100
Percentage of energy defining the sign-quantized coefficients (%) Percentage of energy defining the sign-quantized coefficients (%)
(a) (b)
Test Test
5 50

4 40

FRR (FAR = 0%)


3 30
EER (%)

2 20

1 10

0 0
88 90 92 94 96 98 100 88 90 92 94 96 98 100
Percentage of energy defining the sign-quantized coefficients (%) Percentage of energy defining the sign-quantized coefficients (%)

CASIA/Haar ISEP/Haar CASIA/Haar ISEP/Haar


CASIA/Bior ISEP/Bior CASIA/Bior ISEP/Bior
(c) (d)

Figure 12: Performance results obtained for the choice of the wavelet and the percentage of energy defining the sign-quantized coefficients.
(a) EER and (b) FRR obtained on the training sets; (c) EER and (d) FRR obtained on the test sets.

Figure 15 shows the EER and the FRR (FAR = 0%) as


a function of the illumination shift k. The performances,
obtained with the global comparison method of the iris
signatures, show very good robustness to illumination
changes, since the EER and FRR remain stable in the range
(a) (b) of k [−0.4, +0.2] for the CASIA database (Figures 14(a) and
14(b)) and [0, +0.3] for the ISEP database (Figure 14(c)).
Figure 13: Examples of unwrapped images extracted from (a) ISEP These results can be linked to the grey-level repartition of
and (b) CASIA databases. the iris texture: the performances start to deteriorate when
more than approximately 2.5% of the pixels are saturated to
0 or 1. Indeed, the saturation effect leads to the removal of
defects. Measurements are done for both global and local some iris texture, explaining the observed behaviour. On the
approaches. Some visual examples are given in the next ISEP database, an underexposure leads quickly to an increase
subsections for the images represented in Figure 13. of the error rates, since the database contains at least 10% of
very dark eyes.
3.2.1. Robustness to Illumination Variations. The tested The robustness with regard to contrast changes is studied
image, denoted by I, is transformed by adding a constant in a similar way, using this time the following transformation
k, corresponding to a shift of the grey-level histogram
(overexposure or underexposure, Figure 14). The resulting
pixel values are limited to the range [0, 1]:      
         
Ia x, y = aI x, y , I x, y ∈ [0, 1],

I x, y = min max 0, I x, y + k , 1 , I x, y ∈ [0, 1].         (14)
(13) I  x, y = min max 0, Ia x, y − I a + I , 1 ,
12 EURASIP Journal on Advances in Signal Processing

(a) k = −0.4 (b) k = 0.2 (c) k = 0.3

Figure 14: Examples of illumination variations. (a, b) CASIA image, (c) ISEP image.

Illumination variation Illumination variation


50 100

40 80

FRR (FAR = 0%)


EER (%)

30 60

20 40

10 20

0 0
−0.4 −0.2 0 0.2 0.4 0.6 −0.4 −0.2 0 0.2 0.4 0.6
k k

CASIA, global analysis ISEP, global analysis CASIA, global analysis ISEP, global analysis
CASIA, local analysis ISEP, local analysis CASIA, local analysis ISEP, local analysis
(a) (b)

Figure 15: (a) EER and (b) FRR (FAR = 0%) as a function of the illumination, measured on CASIA and ISEP databases.

where I denotes the mean grey-level of the image. The 3.2.3. Robustness to Optical Axis Deviation. Ideally, the
contrast is decreased when a is lower than 1 and increased images should be acquired with the optical axis orthogonal
otherwise, with a histogram mean unchanged, however with to the eye. Nevertheless, ocular motion or head rota-
a saturation effect for the darkest and the brightest pixels tion movements cause deviations and the image might
(a > 1) (Figure 16). be nonorthogonally projected on the focal plane, causing
Figure 17 shows the EER and the FRR (FAR = 0%) as deformations of the iris structures. In this study, we simulate
a function of a. The admissible loss of contrast is around vertical and horizontal axis deviations (Figure 20).
a factor of 0.1 for the CASIA images and 0.5 for the ISEP The tests show that the algorithm is robust for images
images, which demonstrates strong robustness (Figures 16(a) with an optical axis deviation up to 15◦ , in vertical or
and 16(c)). A high contrast amplification corresponds to horizontal direction. The local comparison method reduces
some binarization of the unwrapped images (Figures 16(b) both error rates (EER, FRR (FAR = 0%)), since it allows
and 16(d)). We observe again a very good robustness, since compensating for the distortions induced by the optical axis
the admissible factors are up to 3 (ISEP) or 5 (CASIA). deviation.
A similar robustness to illumination variation has been
obtained with the local comparison method.
3.2.4. Conclusion. These experiments show that our biomet-
ric signature and identification process, applied on accurately
3.2.2. Robustness to Blurring. The images are blurred with a segmented images, lead to a zero error identification system,
Gaussian filter of increasing standard deviation σ, simulating which is furthermore very robust to acquisition conditions,
a progressive focus degradation (Figure 18). It is worth in terms of illumination variability, focusing, and optical axis
noting that the algorithm is very robust, up to a standard deviation. These results suggest that restoration techniques,
deviation equal to σ = 3 (Figures 18 and 19), for the global such as deblurring [38], could be useful to improve the
comparison approach. With the local analysis approach, segmentation process, but are not necessary for extracting
the EER slightly increases for strong blurring, however and comparing signatures. As these restoration techniques
insignificantly with respect to the gain obtained in high- are heavy and complex, and as they cannot deal with all
security mode (FRR (FAR = 0%)). types image variability (such as variability due to occlusions),
EURASIP Journal on Advances in Signal Processing 13

(a) a = 0.1 (b) a = 5 (c) a = 0.5 (d) a = 3

Figure 16: Examples of contrast variations. (a, b) CASIA image, (c, d) ISEP image.

Contrast variation Contrast variation


5 60

4 50

FRR (FAR = 0%)


40
3
EER (%)

30
2
20
1 10

0 0
10−1 100 101 10−1 100 101
a a

CASIA, global analysis ISEP, global analysis CASIA, global analysis ISEP, global analysis
CASIA, local analysis ISEP, local analysis CASIA, local analysis ISEP, local analysis
(a) (b)

Figure 17: (a) EER and (b) FRR (FAR = 0%) as a function of the contrast variation, measured on CASIA and ISEP databases.

Table 3: EER and FRR (FAR = 0%) obtained with automatically


segmented databases, for both global and local comparison meth-
ods, on (a) CASIA and (b) ISEP databases. Note that the error rates
were zero on the manually segmented databases.

(a)
(a) (b)
EER FRR (FAR = 0%)
Figure 18: Examples of blurring with σ = 3. (a) CASIA image, (b) Global 1.80% [0.45% : 3.6%] 5.43% [2.7% : 8.56%]
ISEP image.
Local 1.36% [0% : 3.15%] 2.64% [0.45% : 4.95%]
(b)

EER FRR (FAR = 0%)


we prefer to address the segmentation imprecision issue Global 0.47% [0% : 1.36%] 2.3% [0.45% : 4.52%]
at the signature comparison step. In the next section, we Local 0.47% [0% : 1.82%] 1.34% [0% : 3.18%]
will evaluate the benefits of the local analysis method, for
dealing with the distortions induced by the segmentation
Table 4: Performances reached by our local analysis method,
inaccuracies.
compared with the literature.

EER FRR (FAR = 0%)


4. Global Performances Daugman’s algorithm [40] 1.44% 3.41%
Proença [40] 1.01% 2.39%
Up to now, the simulations were done on manually seg-
mented images. In the followings, we introduce the segmen- Chen et al. [15] 0.79% Non evaluated
tation process described in Section 2.3. So, the identification Proposed 1.36% 2.64%
system is fully automatic. Figure 21 shows the performances
obtained with both global and local analysis methods. The
EER and the FRR (FAR = 0%) are detailed in Table 3. The As expected, there is a performance loss due to segmen-
95% confidence intervals have been calculated using the tation imprecision, but this is significantly reduced by the
bootstrap method [39]. local comparison, especially in high-security mode. Indeed,
14 EURASIP Journal on Advances in Signal Processing

Blurring Blurring
1 16
14
0.8
12

FRR (FAR = 0%)


0.6 10
EER (%)

8
0.4 6
4
0.2
2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
σ σ

CASIA, global analysis CASIA, local analysis CASIA, global analysis CASIA, local analysis
ISEP, global analysis ISEP, local analysis ISEP, global analysis ISEP, local analysis
(a) (b)

Figure 19: (a) EER and (b) FRR (FAR = 0%) as a function of the standard deviation of the blurring Gaussian filter, measured on CASIA and
ISEP databases.

Vertical axis Y Vertical axis Y


2.5 30

2 25
FRR (FAR = 0%)

20
1.5
EER (%)

15
1
10
0.5 5

0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Angular deviation (◦ ) Angular deviation (◦ )
(a) (b)

Horizontal axis X Horizontal axis X


1.4 15

1.2
FRR (FAR = 0%)

1
10
EER (%)

0.8
0.6
5
0.4
0.2

0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Angular deviation (◦ ) Angular deviation (◦ )

CASIA, global analysis ISEP, global analysis CASIA, global analysis ISEP, global analysis
CASIA, local analysis ISEP, local analysis CASIA, local analysis ISEP, local analysis
(c) (d)

Figure 20: (a, c) EER and (b, d) FAR (FRR = 0%) as a function of the optical axis deviation in the vertical (a, b) and horizontal (c, d)
directions, measured on CASIA and ISEP databases.
EURASIP Journal on Advances in Signal Processing 15

CASIA ISEP
6 2.5

5 2
4
1.5

FRR (%)
FRR (%)

3
1
2

1 0.5

0 0
10−2 10−1 100 101 102 10−2 10−1 100 101 102
FAR (%) FAR (%)

Global Global
Local Local
(a) (b)

Figure 21: ROC curves obtained with automatically segmented databases, for both global and local comparison methods, on (a) CASIA and
(b) ISEP databases.

the local analysis allows compensating for segmentation Experimental results demonstrate their efficiency in terms
inaccuracies, as explained in [24]. The results obtained for of error rates and the benefit of our local-based signa-
the ISEP database (Figure 21(b)) confirm the analysis. ture scheme for compensating segmentation imprecision.
Finally, the performances of the fully automatic system, Moreover, a robustness evaluation of such signature scheme,
relying on the local analysis method, have been compared with regards to the acquisition conditions, has been carried
with those published in the literature [15, 40], on the CASIA out. The performance comparison, conducted on different
databases. In [40], Proença compared Daugman’s algorithm iris-databases, highlights the efficiency of our identification
[2] with his own approach, on the CASIA-IrisV3-Interval step in normal conditions, as well as its robustness to the
database with a subset of 800 images and 80 subjects. most common degradation factors: illumination variations,
Similarly, Chen et al. [15] evaluated the EER performance optical axe deviation, and focus degradation. The overall
on the CASIA V1 database with a subset of 756 images and performances of the system could be further increased
108 subjects. In all these experiments, the images were hand- by improving the accuracy of the automatic segmentation
selected to eliminate incorrectly segmented irises, following a method. Moreover, the presented system could be jointly
selection procedure similar to that presented in Sections 2.1 used with other biometrics approaches (fingerprint, face
and 2.3.4. recognition, etc.) in the design of a multimodal high-security
As it can be seen in Table 4, we obtain similar results for system.
approximately the same number of iris images (888). How-
ever our database contains two or three more subjects (222)
and so, these results demonstrate a very good robustness with
References
respect to the increase of subjects. [1] J. Daugman, “Biometric personal identification system based
on iris analysis,” US patent no. 5291560, March 1994.
[2] J. Daugman, “High confidence visual recognition of persons
5. Conclusion and Perspectives by a test of statistical independence,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 15, no. 11, pp.
In this paper, we presented a ready-to-use iris identification 1148–1161, 1993.
system. Several methods have been proposed and validated [3] R. P. Wildes, “Iris recognition: an emerging biometrie tech-
for iris segmentation, signature extraction and comparison. nology,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1348–1363,
Firstly, a precise and robust pupil and iris segmentation 1997.
scheme, taking into account eyelid occlusions, is described. [4] T. Tambouratzis and M. Masouris, “GA-based iris/sclera
The segmentation process is done automatically and its boundary detection for biometric iris identification,” in
Proceeding of the International Conference on Adaptive and
performance is objectively measured. The measurements,
Natural Computing Algorithms, pp. 457–466, 2007.
realized on two different types of images (visible domain [5] J. Daugman, “New methods in iris recognition,” IEEE Trans-
and near infrared), demonstrate that 90% of the irises are actions on Systems, Man, and Cybernetics, Part B, vol. 37, no. 5,
detected with less than 10% of global error. pp. 1167–1175, 2007.
A wavelet-packet-based signature extraction method, [6] J. Daugman, “How iris recognition works,” in Proceedings of
as well as a novel identification approach, based on the the IEEE International Conference on Image Processing, vol. 1,
fusion of local distance measures, are further proposed. pp. 33–36, 2002.
16 EURASIP Journal on Advances in Signal Processing

[7] W. Boles and B. Boashash, “A human identification technique [26] T. Randen and J. H. Husøy, “Filtering for texture classification:
using images of the Iris and wavelet transform,” IEEE Transac- a comparative study,” IEEE Transactions on Pattern Analysis
tions on Signal Processing, vol. 46, no. 4, pp. 1185–1188, 1998. and Machine Intelligence, vol. 21, no. 4, pp. 291–310, 1999.
[8] C. Sanchez-Avila, R. Sanchez-Reillo, and D. De Martin- [27] S. Livens, P. Scheunders, G. Van de Wouwer, and D. Van Dyck,
Roche, “Iris-based biometric recognition using dyadic wavelet “Wavelets for texture analysis,” in Proceeding of IEE Conference
transform,” IEEE Aerospace and Electronic Systems Magazine, on Image Processing and Its Applications, San Diego, Calif, USA,
vol. 17, no. 10, pp. 3–6, 2002. July 1997.
[9] L. Ma, “Iris recognition based on multichannel gabor filter- [28] M. Acharyya and M. K. Kundu, “Adaptive basis selection
ing,” in Proceeding of the Asian Conference on Computer Vision for multi texture segmentation by M-band wavelet packet
(ACCV ’02), Melbourne, Australia, January 2002. frames,” in Proceeding of IEEE International Conference on
[10] L. Ma, Y. Wang, and T. Tan, “Iris recognition using circular Image Processing (ICIP ’01), vol. 2, pp. 622–625, October 2001.
symmetric filters,” in Proceedings of the International Confer- [29] N. M. Rajpoot, “Texture classiffication using discriminant
ence on Pattern Recognition, vol. 16, pp. 414–417, 2002. wavelet packet subbands,” in Proceeding of IEEE Midwest
[11] L. Ma, T. Tan, Y. Wang, and D. Zhang, “Personal identification Syposium on Circuits and Systems, August 2002.
based on Iris texture analysis,” IEEE Transactions on Pattern [30] R. R. Coifman and M. V. Wickerhauser, “Wavelets and
Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1519– adapted waveform analysis,” in Wavelets: Mathematics and
1533, 2003. Applications, J. J. Benedetto and M. W. Frazier, Eds., pp. 399–
[12] S. Lim, K. Lee, O. Byeon, and T. Kim, “Efficient iris recognition 424, CRC Press, Boca Raton, Fla, USA, 1994.
through improvement of feature vector and classifier,” ETRI [31] V. M. Wickerhauser and R. R. Coifman, “Entropy-based
Journal, vol. 23, no. 2, pp. 61–70, 2001. algorithms for best basis selection,” IEEE Transactions on
[13] J. Daugman, “High confidence personal identification by rapid Information Theory, vol. 38, no. 2, pp. 713–718, 1992.
video analysis of the iris texture,” in Proceedings of the IEEE [32] M. V. Wickerhauser and A. K. Peters, Adapted Wavelet Analysis
International Carnahan Conference on Security Technology, from Theory to Software, AK Peters, Ltd., Wellesley, Mass, USA,
1992. 1994.
[14] J. Daugman, “The importance of being random: statistical [33] A. Laine and J. Fan, “Texture classification by wavelet packet
principles of iris recognition,” Pattern Recognition, vol. 36, no. signatures,” IEEE Transactions on Pattern Analysis and Machine
2, pp. 279–291, 2003. Intelligence, vol. 15, no. 11, pp. 1186–1191, 1993.
[15] Y. Chen, S. C. Dass, and A. K. Jain, “Localized iris image [34] E. Rydgren, T. Ea, F. Amiel, A. Amara, and F. Rossant, “Iris
quality using 2-D wavelets,” in Proceeding of the International features extraction using wavelet packets,” in Proceedings of the
Conference on Biometrics, pp. 373–381, 2006. International Conference on Image Processing (ICIP ’04), vol. 2,
[16] Y. Zhu, T. Tan, and Y. Wang, “Biometric personal identifica- pp. 861–864, Singapore, 2004.
tion based on iris patterns,” in Proceedings of the International [35] L. Deqiang, W. Pedrycz, and N. J. Pizzi, “Fuzzy wavelet
Conference on Pattern Recognition, vol. 15, pp. 801–804, 2000. packet based feature extraction method and its application
[17] Z. Ma, M. Qi, H. Kang, S. Wang, and J. Kong, “Iris verification to biomedical signal classification,” IEEE Transactions on
using wavelet moments and neural network,” in Proceeding Biomedical Engineering, vol. 52, no. 6, pp. 1132–1139, 2005.
of the International Conference on Life System Modeling and [36] F. Rossant, M. Torres Eslava, T. Ea, F. Amiel, and A. Amara,
Simulation (LSMS ’07), pp. 218–226, 2007. “Iris identification and robustness evaluation of a wavelet
[18] http://www.milesresearch.com/main/products.htm. packets based algorithm,” in Proceedings of the International
[19] http://www.cbsr.ia.ac.cn/IrisDatabase.htm. Conference on Image Processing (ICIP ’05), vol. 3, pp. 257–260,
[20] J. Cui, Y. Wang, T. Tan, L. Ma, and Z. Sun, “A fast and robust Genova, Italy, 2005.
iris localization method based on texture segmentation,” in [37] R. Snelick, U. Uludag, A. Mink, M. Indovina, and A. Jain,
Proceedings of the Conference on Biometric Technology for “Large-scale evaluation of multimodal biometric authenti-
Human Identification, vol. 5404, pp. 401–408, 2004. cation using state-of-the-art systems,” IEEE Transactions on
[21] Z. He, T. Tan, Z. Sun, and X. Qiu, “Robust eyelid, eyelash and Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp.
shadow localization for iris recognition,” in Proceedings of the 450–455, 2005.
International Conference on Image Processing (ICIP ’08), pp. [38] B. J. Kang and K. R. Park, “Real-time image restoration for iris
265–268, 2008. recognition systems,” IEEE Transactions on Systems, Man, and
[22] M. Adam, F. Rossant, F. Amiel, B. Mikovicova, and T. Ea, Cybernetics, Part B, vol. 37, no. 6, pp. 1555–1566, 2007.
“Reliable eyelid localization for iris recognition,” in Proceeding [39] B. Efron, Introduction to the Bootstrap, chapter 13, Stanley
of Advanced Concepts for Intelligent Vision Systems (ACIVS ’08), Thornes, Cheltenham, UK, 1994.
pp. 1062–1070, Juan-les-Pins, France, October 2008. [40] H. Proença and L. A. Alexandre, “Toward noncooperative
[23] H. Proença and L. A. Alexandre, “A method for the identifica- iris recognition: a classification approach using multiple
tion of inaccuracies in pupil segmentation,” in Proceedings of signatures,” IEEE Transactions on Pattern Analysis and Machine
the 1st International Conference on Availability, Reliability and Intelligence, vol. 29, no. 4, pp. 607–612, 2007.
Security (ARES ’06), vol. 1, pp. 224–228, April 2006.
[24] M. Adam, F. Rossant, and B. Mikovicova, “Iris identification
based on a local analysis of the iris texture,” in Proceedings
of the 6th International Symposium on Image and Signal
Processing and Analysis (ISPA ’09), pp. 523–528, Salzburg,
Austria, September 2009.
[25] C.-L. Tisse, L. Martin, L. Torres, and M. Robert, “Person
identification technique using human iris recognition,” in
Proceedings of the International Conference on Vision Interface,
Calgary, Canada, May 2002.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 847680, 26 pages
doi:10.1155/2010/847680

Research Article
The Complete Gabor-Fisher Classifier for
Robust Face Recognition

Vitomir Štruc and Nikola Pavešić


Laboratory of Artificial Perception, Systems and Cybernetics, Faculty of Electrical Engineering,
University of Ljubljana, SI-1000 Ljubljana, Slovenia

Correspondence should be addressed to Vitomir Štruc, [email protected]

Received 2 December 2009; Revised 15 April 2010; Accepted 20 April 2010

Academic Editor: Robert W. Ives

Copyright © 2010 V. Štruc and N. Pavešić. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

This paper develops a novel face recognition technique called Complete Gabor Fisher Classifier (CGFC). Different from existing
techniques that use Gabor filters for deriving the Gabor face representation, the proposed approach does not rely solely on Gabor
magnitude information but effectively uses features computed based on Gabor phase information as well. It represents one of the
few successful attempts found in the literature of combining Gabor magnitude and phase information for robust face recognition.
The novelty of the proposed CGFC technique comes from (1) the introduction of a Gabor phase-based face representation and
(2) the combination of the recognition technique using the proposed representation with classical Gabor magnitude-based
methods into a unified framework. The proposed face recognition framework is assessed in a series of face verification and
identification experiments performed on the XM2VTS, Extended YaleB, FERET, and AR databases. The results of the assessment
suggest that the proposed technique clearly outperforms state-of-the-art face recognition techniques from the literature and that
its performance is almost unaffected by the presence of partial occlusions of the facial area, changes in facial expression, or severe
illumination changes.

1. Introduction image acquisition procedure, the majority of applications


(especially those linked to unconstrained face recognition,
Biometrics is a scientific discipline that uses unique and e.g., surveillance) cannot. In such cases, image characteris-
measurable physical, biological, or/and behavioral human tics, such as changes in illumination, partial occlusions of the
characteristics that can be processed to establish identity, facial area, or different facial expressions, heavily influence
to perform identity verification, or to recognize a person the appearance of the face in the acquired image and render
through automation [1–3]. Among the different character- much of the existing face recognition technology useless
istics suitable for biometric recognition, the human face [5]. To prove useful in unconstrained environments, the
and the associated face recognition technology bear the deployed face recognition system has to utilize recognition
most potential. This potential is fueled by the countless techniques capable of providing reliable recognition results
application possibilities of face recognition technology in the regardless of the (variable) characteristics of the acquired
private as well as the public sector. Examples of potential images.
application domains range from entertainment, human- Many researchers have tackled the problem of robust
machine interaction, homeland security, smart surveillance, face recognition in uncontrolled (out-door) environments
access and border control to user authentication schemes in by trying to develop face recognition techniques insensitive
e-commerce, e-health, and e-government services [1, 2, 4]. to image degradations caused by various external factors.
While, for example, access control applications can often Sanderson and Paliwal, for example, proposed a feature
ensure stable and controlled external conditions for the extraction technique called DCT-mod2. The DCT-mod2
2 EURASIP Journal on Advances in Signal Processing

technique first applies the discrete cosine transform (DCT) filters, the filter bank can be constructed in such a way that
to subregions (or image blocks) of facial images to extract it excludes the frequency bands most affected by lighting
a number of DCT coefficients. Next, it compensates for any variations, resulting in robustness to lighting changes).
potential illumination changes, by replacing the coefficients While the existing Gabor-based methods are among the
most affected by illumination with their corresponding most successful face recognition techniques, one could still
horizontal and vertical delta coefficients. By doing so, the voice some misgivings, as they rely only on the Gabor
technique derives a face representation (partially) insensitive magnitude information and discard the potentially useful
to external lighting changes. The authors assessed their Gabor phase information. In this paper, we tackle this issue
approach on various databases comprised of images with and propose a novel face representation called oriented
illumination-induced variability. On all databases the DCT- Gabor phase congruency image, which, as the name suggests,
mod2 technique resulted in promising results [6]. is derived from the Gabor phase congruency model [15].
Gao and Leung [7] went a different way and proposed The proposed face representation is based on the phase
a face representation called Line Edge Map (LEM). Here, a responses of the Gabor filer bank rather than the Gabor
given face image is first processed with the Sobel operator magnitude responses and as such offers an alternative (or
to extract edge pixels, which are then combined into line complement) to the established Gabor magnitude methods.
segments that constitute the LEMs. While the authors We show that the face representation derived from the
suggest that the LEMs ensure illumination and expression oriented Gabor phase congruency images is more compact
invariant face recognition, the developed face representation, than the commonly used Gabor magnitude representation of
nevertheless, inherits the shortcomings of the gradient- face images and that it also exhibits an inherent robustness to
based Sobel operator, which is known to struggle with its illumination changes.
performance under severe lighting variations. The novel representation is combined with the mul-
Fidler et al. [8] tried to achieve robust face recognition ticlass linear discriminant analysis to obtain the so-called
by exploiting an elaborate subsampling procedure. The sub- phase-based Gabor-Fisher classifier (PBGFC). The devel-
sampling procedure first detects image pixels representing oped PBGFC method is ultimately fused with the GFC
statistical outliers in each of the facial images and then technique to result in the complete Gabor-Fisher classifier
derives a low-dimensional representation of each facial (CGFC), which effectively uses Gabor magnitude as well as
image by considering only valid image pixels (i.e., based on Gabor phase information for robust face recognition. The
statistical “inliers”). The authors show that their procedure feasibility of the proposed techniques is assessed in a series
ensures (partial) robustness to facial occlusions as well as to of face recognition experiments performed on the popular
extreme facial expression changes. XM2VTS, FERET, AR, and Extended YaleB databases. The
More recently, Wright et al. [9] introduced a novel results of the assessments show that the proposed technique
method for robust face recognition exploiting recent compare favorably with face recognition methods from the
advances from the field of compressed sensing. Their literature in terms of robustness as well as face recognition
method, called the Sparse Representation Classifier (SRC), performance.
derives a sparse representation from the given face image The rest of the paper is structured as follows. In Section 2
and simultaneously assumes that the image is contaminated a brief review of Gabor filters and Gabor filter base face
with a spatially sparse error. Under the assumption of recognition techniques is given. In Section 3, the novel face
the sparse error, the authors are able to construct robust representation in form of oriented Gabor phase congru-
classifiers capable of performing well under a variety of ency images is introduced. Sections 4 and 5 develop the
image degradations caused, for example, by illumination phase-based and complete Gabor-Fisher classifiers for face
changes, noise, or facial occlusions. recognition. Section 6 presents the classification rules, while
One of the most popular solutions to the problem of Section 7 describes the employed experimental databases.
robust face recognition was, however, proposed by Liu and The feasibility of the proposed technique is assessed in
Wechsler in [10]. Here, the authors proposed to adopt a Section 8. The paper concludes with some final remarks in
filter bank of forty Gabor filters to derive an augmented Section 9.
feature vector of Gabor magnitude features, and then apply
a variant of the multiclass linear discriminant analysis to
the constructed Gabor feature vector to improve the vector’s 2. Review of Gabor Filters for Face Recognition
compactness. The efficiency of the technique, named the
This section briefly reviews the use of Gabor filters for face
Gabor-Fisher Classifier (GFC), was determined on a large
recognition. It commences with the introduction of Gabor
and challenging database and is, furthermore, evidenced by
filters and the basic concepts of feature extraction using the
the large number of papers following up on the work in [10],
Gabor filter bank and proceeds with the presentation of
for example, [11–14].
the Gabor (magnitude) face representation, which forms the
It should be noted that the Gabor face representation (as
foundation for many popular face recognition techniques,
proposed in [10]) exhibits (partial) robustness to changing
including the prominent Gabor-Fisher Classifier [10].
facial expressions as well as illumination variations. The
former is a consequence of the local nature of the Gabor
feature vector, while the latter is linked to the properties of 2.1. Gabor Filter Construction. Gabor filters (also called
the Gabor filter bank (as Gabor filters represent band-limited Gabor wavelets or kernels) have proven themselves to be
EURASIP Journal on Advances in Signal Processing 3

(a) (b)

Figure 1: An example of the real (a) and imaginary (b) part of a Gabor filter.

a powerful tool for facial feature extraction and robust face


recognition. They represent complex band-limited filters
with an optimal localization in both the spatial as well
as the frequency domain. Thus, when employed for facial
feature extraction, they extract multiresolutional, spatially
local features of a confined frequency band [6]. Like all filters
operating in the scale-space, Gabor filters also relate to the
simple cells of the mammalian visual cortex and are, hence,
relevant from the biological point of few as well.
In general, the family of 2D Gabor filters can be defined
in the spatial domain as follows [10, 11, 13, 14, 16–20]:

  fu2 −(( fu2 /κ2 )x 2 +( fu2 /η2 )y 2 ) j2π fu x


ψu,v x, y = e e , (1)
πκη Figure 2: The real parts of the Gabor filter bank commonly used
for feature extraction in the field of face recognition.
where x = x cos θv + y sin θv , y  = −x sin θv + y cos θv ,
fu = fmax /2(u/2) , and θv = vπ/8. As can be seen from
the filters definition, each Gabor filer represents a Gaussian
kernel function modulated by a complex plane wave whose given face image I(x, y) with the Gabor filter ψu,v (x, y) of size
center frequency and orientation are given by fu and θv , u and orientation v [10, 11, 13, 14, 17, 19, 20], that is
respectively. The parameters κ and η determine the ratio
     
between the center frequency and the size of the Gaussian Gu,v x, y = I x, y ∗ ψu,v x, y , (2)
envelope and, when set to a fixed value, ensure that Gabor
filters of different scales behave as scaled versions of each where Gu,v (x, y) denotes the complex filtering output that
other [6]. It should also be noted that with fixed values of can be decomposed into its real (Eu,v (x, y)) and imaginary
the parameters κ and η, the scale of the given Gabor filter is (Ou,v (x, y)) parts:
uniquely defined by the value of its center frequency fu .     
While different choices of the parameters determining Eu,v x, y = Re Gu,v x, y ,
     (3)
the shape and characteristics of the filters define different Ou,v x, y = Im Gu,v x, y .
families of Gabor filters, the most common √ parameters used
for face recognition are κ = η = 2 and fmax = 0.25 Based on these results, the magnitude (Au,v (x, y)) and
[6, 10, 11, 13, 19, 20]. When using the Gabor filters for facial phase (φu,v (x, y)) responses of the filtering operation can be
feature extraction, researchers typically construct a filter computed as follows:
bank featuring filters of five scales and eight orientations, that      
is, u = 0, 1, . . . , p − 1 and v = 0, 1, . . . , r − 1, where p = 5 Au,v x, y = Eu,v
2 x, y + O2 x, y ,
u,v
and r = 8. An example of the real and imaginary parts of   
  Ou,v x, y (4)
a Gabor filter is presented in Figure 1, while the real parts φu,v x, y = arctan   .
of the entire filter bank commonly used for facial feature Eu,v x, y
extraction (comprised of 40 filters) are shown in Figure 2. The majority of Gabor-based face recognition techniques
found in the literature discard the phase information of the
2.2. Feature Extraction Using Gabor Filters. Let I(x, y) stand filtering output and rely solely on the magnitude information
for a grey-scale face image of size a × b pixels and, moreover, when constructing the Gabor face representation. By doing
let ψu,v (x, y) denote a Gabor filter given by its center so, they discard potentially valuable discriminative informa-
frequency fu and orientation θv . The feature extraction tion that could prove useful for the recognition task. The
procedure can then be defined as a filtering operation of the magnitude responses usually retained by the Gabor-based
4 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 3: An example of the Gabor magnitude output: a sample image (a) and the magnitude output of the filtering operation with the
entire Gabor filter bank of 40 Gabor filters (b).

recognition techniques are presented in Figure 3 for a sample feature vector x can be defined as follows [10, 11, 13, 14, 19,
face image. 20]:
 T
T T T T
2.3. The Gabor (Magnitude) Face Representation. When x = g0,0 , g0,1 , g0,2 , . . . , g4,7 . (5)
deriving the Gabor (magnitude) face representation from a
It should be noted that in the experimental section, we
given facial image, the first step is the construction of the
use images of size 128 × 128 pixels and a rectangular
Gabor filter bank. As we have pointed out already, most of
sampling grid with 16 horizontal and 16 vertical lines,
the existing techniques in the literature adopt a filter bank
which corresponds to a downsampling factor of ρ =
comprising Gabor filters of five scales (u = 0, 1, . . . , 4) and
64. The resulting feature vector, or, in other words, the
eight orientations (v = 0, 1, . . . , 7).
resulting Gabor (magnitude) face representation forms the
Next, the given face image is filtered with all 40 filters
foundation for the Gabor-Fisher classifier, which will be
from the filter bank resulting in an inflation of data
presented in Section 4 in more detail.
dimensionality to 40 times its initial size. Even for a small face
image of, for example, 128 × 128 pixels, the 40 magnitude
responses reside in a 655360 (128 × 128 × 40) dimensional 3. The Oriented Gabor Phase Congruency
space, which is far too extensive for efficient processing Face Representation
and storage. Thus, to overcome this dimensionality issue,
downsampling strategies are normally exploited. The down- Up until now we have been concerned with Gabor magnitude
sampling techniques reduce the dimensionality of the Gabor responses and face representations derived from them. In
magnitude responses, unfortunately often at the expense this section, however, we will focus on face representations
of valuable discriminatory information. One of the most derived from Gabor phase responses and their usefulness for
popular downsampling strategies relies on a rectangular face recognition. The section commences by reviewing the
sampling grid (as shown in Figure 4) superimposed over existing attempts at incorporating the Gabor phase infor-
the image to be sampled. In the downsampled image only mation into the face recognition procedure and, thus, the
the values located under the sampling grid’s nodes are attempts at further improving the recognition performance
retained, while the rest is discarded. The downsampling of Gabor-based recognition techniques. Next, it presents
procedure is applied to all magnitude responses, which a novel representation of face images called the oriented
are ultimately normalized using a properly selected nor- Gabor phase congruency image, and, finally, it develops the
malization procedure and then concatenated into the final oriented Gabor phase congruency face representation, which
Gabor (magnitude) face representation or, as named by forms the basis for the phase-based Gabor-Fisher classifier
Liu and Wechsler [10], into the augmented Gabor feature presented in Section 4.
vector. (Note that typically zero-mean and unit variance
normalization is applied at this step. However, as other 3.1. Background. Before we turn our attention to the novel
normalization techniques might be superior to the zero- representation of face images, let us take a closer look at why
mean and unit-variance scheme, the issue of selecting the the Gabor phase information is commonly discarded when
most appropriate normalization procedure will empirically using Gabor filters for face recognition.
be investigated in the experimental section.) Unlike the (Gabor) magnitude, which is known to vary
If we denote the downsampled Gabor magnitude slowly with the spatial position, the (Gabor) phase takes
responses in vector form at the uth filter scale and vth very different values even if it sampled at image locations
orientation by gu,v , then the augmented Gabor (magnitude) only a few pixels apart [6, 21, 22]. This inherent instability
EURASIP Journal on Advances in Signal Processing 5

(a) (b) (c)

Figure 4: Down-sampling of a magnitude filter response with a rectangular sampling grid (from left to right): (a) an example of a magnitude
response, (b) an example of the magnitude response with a superimposed rectangular sampling grid, and (c) a downsampled magnitude
response.

of the Gabor phase makes it difficult to extract stable and the resulting representation directly for recognition rather
discriminative features from the phase responses of (4) and than solely for feature selection.
is also the primary reason that most of the existing methods The difference to the existing face recognition methods
rely solely on the Gabor magnitude when constructing the using Gabor phase information is even more pronounced
Gabor feature vector. if we consider only techniques adopting histograms of
To the best of our knowledge, there are only a few studies Gabor phase (or phase difference) patterns. (Note that
in the literature that successfully derive useful features from in the remainder of the paper we will use the term
Gabor phase responses for the task of face recognition, that is, Gabor phase patterns for all Gabor phase-based patterns
[6, 21–28]. A common characteristic of these methods is the whether they were computed from actual phase responses
fact that they do not rely on face representations constructed or phase-differences due to the similarity of the descriptors.)
directly from the Gabor phase responses; rather they use These methods alleviate the problem of phase instability
features derived from the “raw” Gabor phase responses or by observing local histograms of the phase patterns, or in
combine the phase information with other descriptors to other words, by adopting histograms of the Gabor phase
compensate for the instability of the Gabor phase. patterns computed from image subblocks as the basic image
Zhang et al. [21, 22], for example, adopt local histograms descriptors. In a sense, they (e.g., [21, 22, 26, 27]) assume
of the phase responses encoded via the local binary patterns that despite the irregular changes of the Gabor phase from
(LBPs) [29, 30] as facial descriptors and consequently show one spatial position to a neighboring one, the distribution
that, over small image areas, the Gabor phase patterns exhibit of the Gabor phase values over a small spatial area is
regularity that can be exploited for face recognition [6]. consistent and, thus, useful for recognition. Furthermore, to
A similar procedure is introduced in [27] by Guo and reduce the variability of the Gabor phase responses prior
Xu and later extended by Guo et al. in [26]. Different from to histogram construction (it should be noted that the
the procedure of Zhang et al. presented in [21], Guo et al. computed histograms serve as non-parametric estimates of
rely on Gabor phase differences instead of the “raw” phase the true Gabor phase distribution), they encode the phase
values to compute the binary patterns. In the second step, the using different versions of binary patterns, for example, [29].
computed patterns corresponding to the phase response of Unlike the presented Gabor-phase-pattern-based methods,
the Gabor filter of a given scale and orientation are grouped which exploit regularities of the Gabor phase in the spatial
to form local (subregion-based) histograms and ultimately domain, the face representation presented in this paper
concatenated into extended histograms, which encode local relies on regularities in the scale-space domain (or frequency
as well as global aspects of the given phase response. domain—see Section 2.1). Hence, it exploits a completely
Other authors (e.g., [23–25]) incorporate the Gabor different approach to overcome the instability problems
phase information by employing the so-called phase congru- related to the Gabor phase.
ency model (developed by Kovesi [15]) for edge detection As will be shown in the remainder of this section, the
in the facial image and then deploy the “edge” image for proposed representation does not offer only an efficient way
detection of interest points that are used with other image of overcoming the Gabor phase instability, but also exhibits
descriptors, such as Gabor magnitude features. several desirable properties for the task of face recognition.
The face representation developed in this paper differs
greatly from the existing Gabor phase-based approaches
presented above. It is related to work presented in [23– 3.2. The 2D Phase Congruency Model. The original 2D
25] only as far as it uses the concept of phase congruency phase congruency model as proposed by Kovesi in [15] was
for encoding the Gabor phase information. However, unlike developed with the goal of robust edge and corner detection
previous work on this subject, it derives a face representation in digital images. Unlike classical gradient-based edge detec-
based on a modified model of phase congruency and employs tors, which search for image points of maximum intensity
6 EURASIP Journal on Advances in Signal Processing

gradients and are known to be susceptible to image contrast for robust edge and corner detection, its usefulness for face
and illumination conditions, the phase congruency model recognition is at least questionable. First of all, the edges
searches for points of order in the frequency spectrum, and detected by the model are highly localized, suggesting that
provides an illumination invariant model of edge detection even small variation in facial expression or misalignment
[15, 24]. would drastically change the appearance of the PCI of a
For 1D signals, the phase congruency PC(x) is defined given subject; and, second of all, the phase congruency
implicitly by the relation of the energy at a given point in representation does not make use of multiorientational
the signal E(x) and the sum of the Fourier amplitudes An as information, which can provide important clues for the
shown by Venkatesh and Owens [31]: recognition task.
 To overcome the presented shortcomings, we propose in
E(x) = PC(x) An , (6) this paper a novel face representation, called the oriented
n
Gabor phase congruency image (OGPCI). Rather than com-
where n denotes the number of Fourier components. Thus, bining phase congruency information computed over several
phase congruency at a given location of the signal x is defined orientations, and using the result for construction of the
as the ratio of the local energy at this location and the sum of facial feature vector, we compute the oriented form of phase
Fourier amplitudes. congruency for each of the employed filter orientations and
Kovesi extended the above concept to 2D signals by construct an augmented Gabor phase congruency feature
computing the phase congruency with logarithmic Gabor vector based on the results [6]. Note that differently from the
filters using the following expression: original model of Kovesi [15], we deploy conventional Gabor
r −1  p−1    
filter as given by (1) rather than logarithmic Gabor filters.
  v=0 u=0 Au,v x, y ΔΦu,v x, y Taking into account the original definition of phase
PC2D x, y = r −1  p−1   , (7) congruency, we derive an oriented form of phase congruency,
v=0 u=0 Au,v x, y + ε
which, when presented in image form, reveals the OGPCI for
where Au,v (x, y) denotes the magnitude response of the the vth orientation:
logarithmic Gabor filter at scale u and orientation v, ε  p−1    
  u=0 Au,v x, y ΔΦu,v x, y
represents a small constant that prevents divisions with zero, OGPCI x, y =  p−1   . (9)
and ΔΦu,v (x, y) stands for a phase deviation measure defined u=0 Au,v x, y + ε
as Some examples of the OGPCIs for different number of
       employed filter scales p and a fixed orientation of θv = 0◦ are
ΔΦu,v x, y = cos φu,v x, y − φv x, y
shown in Figure 6. We can see that the choice of the number
2     2
(8)
2 2 of filter scales p influences the appearance of the OGPCIs
− 2sin φu,v x, y − φv x, y 2.
and for optimal face recognition performance should be set
Here φu,v (x, y) denotes the phase angle of the logarithmic based on preliminary results on some development data (see
Gabor filters at the uth scale and vth orientation, while φv (z) Section 8.4 for more details).
represents the mean phase angle at the vth orientation.
  p−1 3.4. The Oriented Gabor Phase Congruency Face Representa-
Clearly, the expression rv−=10 u=0 Au,v (x, y)ΔΦu,v (x, y)
approximates the local energy at the spatial location (x, y), tion. The OGPCIs introduced in the previous section form
while the denominator of (7) represents the sum of the the foundation for the derivation of the oriented Gabor
(logarithmic) Gabor amplitudes over all orientations and phase congruency face representation or, in accordance
scales. Obviously, the phase congruency as defined by (7) with the notation used by Liu and Wechsler in [10],
represents a quantity that is independent of the overall the augmented Gabor phase congruency vector, which is
magnitude of its underlying signal and is, hence, invariant computed by taking the following steps.
to changes in contrast and illumination [6, 15, 24, 25]. The (i) For a given face image the OGPCIs are computed for
model detects points in an image where the logarithmic all r orientations (an example of all OGPCIs for a
Gabor filter responses are maximally in phase, or in other sample image with r = 8 and p = 2 is presented in
words, scans the logarithmic Gabor phase responses for Figure 7).
regularities in the scale-space. (ii) The computed OGPCIs are downsampled by a down-
At closer examination of the 2D phase congruency model sampling factor ρ (similar as depicted in Figure 4).
we can notice that it first computes the phase congruency
(iii) The downsampled OGPCIs are normalized using an
for each of the employed filter orientations and subsequently
appropriate normalization procedure.
combines the results to form the final output. Some examples
of a facial image subjected to logarithmic Gabor filter banks (iv) The downsampled and normalized OGPCIs in vector
with different numbers of scales p and orientations r are form (denoted by Dv ) are concatenated to form the
shown in Figure 5. We can see that both parameters effect the augmented Gabor phase congruency feature vector x.
appearance of the resulting phase congruency image (PCI). Formally, the augmented Gabor phase congruency fea-
ture vector can be defined as follows:
3.3. The Oriented Gabor Phase Congruency Model. While  T
the 2D phase congruency model given by (7) is suitable x = DT0 , DT1 , DT2 , . . . , DTr−1 , (10)
EURASIP Journal on Advances in Signal Processing 7

(a) (b) (c) (d) (e)

Figure 5: Examples of phase congruency images (from left to right): (a) the original face image, (b) the PCI for p = 3 and r = 6, (c) the PCI
for p = 5 and r = 6, (d) and the PCI for p = 3 and r = 8, (e) the PCI for p = 5 and r = 8.

(a) (b) (c) (d) (e)

Figure 6: Examples of OGPCIs (from left to right): (a) the original face image, (b) the OGPCI for p = 2, (c) the OGPCI for p = 3, (d) and
the OGPCI for p = 4, (e) the OGPCI for p = 5.

where T denotes the transform operator and Dv , for v = classes C1 , C2 , . . . , CN , one first computes the between-class
0, 1, . . . , r − 1, represents the vector form of the OGPCI at and the within-class scatter matrices ΣB and ΣW :
the vth orientation.
Note that in the experiments presented in Section 8 
N   T
the augmented Gabor phase congruency feature vector was ΣB = ni μi − μ μi − μ ,
i=1
constructed using a downsampling factor of ρ = 16, as (11)
opposed to the augmented Gabor magnitude vector, where N  
  T
a downsampling factor of ρ = 64 was employed. This setup ΣW = x j − μi x j − μi ,
led to similar lengths of the two augmented feature vectors i=1 x j ∈Ci
allowing for a fair comparison of their usefulness for face
recognition [6]. and then one derives the LDA transformation matrix W
which maximizes Fisher’s discriminant criterion [32, 33]:
2 T 2
2W ΣB W2
4. The Gabor-Fisher and Phase-Based J(W) = arg max , (12)
W |W T Σ W W |
Gabor-Fisher Classifiers
where ni denotes the number of samples in the ith class, μi
We have already emphasized that both the augmented stands for the class conditional mean and μ represents the
Gabor magnitude feature vector the augmented Gabor global mean of all training samples [4].
phase congruency feature vector, despite the downsampling Fisher’s discriminant criterion is maximized when W is
procedure, still reside in a very high-dimensional space. constructed by a simple concatenation of the d ≤ N − 1
The Gabor-Fisher Classifier presented by Liu and Wechsler leading eigenvectors of the following eigenproblem:
[10] and the Phase-based Gabor-Fisher Classifier introduced
−1
here overcome this dimensionality issue by subjecting the ΣW ΣB wi = λi wi , i = 1, 2, . . . , d , (13)
augmented feature vectors to Fisher’s Discriminant Analysis
(also know as Linear Discriminant Analysis). The subspace that is, W = [w1 w2 · · · wd ].
projection reduces the size of the augmented feature vectors Once the transformation matrix W is calculated, it can
and allows for an efficient implementation of the matching be used to project a test pattern (i.e., an arbitrary augmented
procedure. feature vector) x into the LDA subspace, thus reducing the
The employed dimensionality reduction technique (i.e., pattern’s dimension from d to d :
LDA) derives a transformation matrix (i.e., the projection  
basis) which is used to project the augmented feature y = WT x − μ , (14)
vectors into a subspace where between-class variations of
the projected patterns are maximized while within-class where y represents the d -dimensional projection of the
variations are minimized [32]. centered pattern x [4]. To avoid singularity issues, when
Given a set of n d-dimensional training patterns (i.e., computing the inverse of the within-class scatter matrix ΣW ,
augmented feature vectors) xi arranged into a d × n data LDA is implemented in the PCA subspace as suggested by
matrix X = [x1 , x2 , . . . , xn ], each belonging to one of N Belhumeur et al. in [33].
8 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 7: An example of all OGPCIs: the original face image (a), the OGPCIs (for r = 8) (b), which in their downsampled and normalized
form constitute the augmented Gabor phase congruency vector.

When the presented technique is applied to the aug- [35]. In verification mode, the goal of the system is to
mented Gabor magnitude feature vectors, we obtain the determine the validity of the identity claim made by the
Gabor-Fisher Classifier and, similarly, if the underlying user currently presented to the system. This is achieved by
feature vectors take the form of the augmented Gabor phase comparing the so-called “live” feature vector y extracted
congruency vectors, we obtain the Phase-based Gabor-Fisher from the given face image of the user with the template
Classifier. corresponding to the claimed identity. Based on the outcome
of this comparison, the identity claim is either rejected or
5. The Complete Gabor-Fisher Classifier accepted.
Formally this can be written as follows: given the live
The presented Gabor-Fisher and phase-based Gabor-Fisher feature vector y and a claimed identity Ci associated with
classifiers (GFC and PBGFC, resp.) operate on different a user-template yi , where i ∈ 1, 2, . . . , N and N represents
feature types derived from the Gabor filter responses. Since the number of enrolled users, determine the validity of the
the first relies on Gabor magnitude information and the identity claim by classifying the pair (y, Ci ) into one of two
second encodes Gabor phase information, we combine classes w1 or w2 [2, 36]:
both classifiers into the Complete Gabor-Fisher Classifier ⎧  
(CGFC), which should exhibit enhanced face recognition   ⎨w1 , if δ y, yi ≥ Δ, i = 1, 2, . . . , N,
y, Ci ∈ ⎩ (16)
performance when compared to either of the classifiers on w2 , otherwise,
their own.
The fusion of the classifiers is implemented at the where w1 denotes the class of genuine identity claims, w2
matching score level using the fusion scheme shown in stands for the class of illegitimate identity claims, δ(·, ·) rep-
Figure 8. Here, the final matching score of the CGFC method resents a function measuring the similarity of its arguments,
δCGFC is computed using the following expression [34]: which in our case takes the form of the cosine similarity
  measure, that is,
δCGFC = 1 − γ δGFC + γδPBGFC , (15)
  yT yi
where δGFC denotes the matching score obtained with the δ y, yi = , (17)
yT yyiT yi
GFC technique, δPBGFC denotes the matching score obtained
with the PBGFC approach and γ ∈ [0, 1] denotes the fusion and Δ stands for a predefined decision threshold.
parameter that controls the relative importance of the two In a face recognition system operating in the identifi-
matching scores. ( Note that the matching scores for the cation mode the problem statement is different from that
individual classifiers are computed based on the procedure presented above. In case of the identification task we are not
described in Section 6.) When set to γ = 0, the CGFC interested whether the similarity of the live feature vector
method turns into the GFC method, when set to γ = 1, with a specific user-template is high enough; rather, we are
the CGFC technique turns into the PBGFC technique, while looking for the template in the database that best matches the
for any other value of γ the CGFC technique considers live feature vector. This can be formalized as follows: given
both feature types. It has to be noted that the value of the a live feature vector y and a database containing N user-
fusion parameter γ should be optimized for the best possible templates y1 , y2 , . . . , yN of the enrolled users (or identities)
performance (see Section 8.5). C1 , C2 , . . . , CN , determine the most suitable identity [2], that
is,
6. The Classification Rule ⎧    

⎨ Ci , N
if δ y, yi =max δ y, y j ≥ Δ,
In general, a face recognition system can operate in one of y∈⎪ j =1 (18)
⎩C
two modes, either in verification or in identification mode N+1 , otherwise,
EURASIP Journal on Advances in Signal Processing 9

LDA
Matching

PBGFC Final
Fusion matching
GFC score

LDA
Matching

CGFC

Figure 8: Block scheme of the Complete Gabor-Fisher Classifier.

where δ(y, yi ) again denotes the cosine similarity measure


and CN+1 stands for the case, where no appropriate identity
from the database can be assigned to the live feature vector
y. The presented expression postulates that, if the similarity
of the live feature vector and the template associated with the
ith identity is the highest among the similarities with all user-
templates in the system, then the ith identity is assigned to
the live feature vector y.
It should be noted that, in the experiments presented
in the remainder of this paper, the user-templates are Figure 9: Sample images from the image part of the XM2VTS
constructed as the mean vectors of the feature vectors database.
extracted from the enrollment images of the users.

(as shown in Figure 9). Thus, images of the same subject


7. The Databases and Experimental Setups differ in terms of hair-style, presence or absence of make-
This section presents the experimental databases, setups up and glasses, pose, expression, and so forth. Since two
and performance measures used to assess the feasibility images were taken at each of the four recording sessions, 8
of the Complete Gabor-Fisher Classifier (CGFC) for face facial images are featured in the database for each of the 295
recognition. Four popular face databases are selected for subjects.
the experiments presented in the next section, namely, the To ensure that our results are comparable to other
XM2VTS database [37], the Extended YaleB database [38, results obtained on the XM2VTS database and published
39], the FERET database [40], and the AR database [41, 42]. in the literature, we follow the first configuration of the
These databases are employed either in face verification or experimental protocol (for face verification) associated with
face identification experiments to demonstrate the effective- the database, also known as the Lausanne protocol [37].
ness and robustness of the proposed CGFC framework. The first configuration of the protocol was chosen for the
experiments, since it is considered to be the most difficult
of the different experimental configurations defined by the
7.1. The XM2VTS Database. The XM2VTS database is Lausanne protocol. As stated by the protocol, we split the
a large multimodal database featuring image, video and subjects of the database into two disjoint groups of 200
speech data of 295 subjects [37]. For the experiments clients and 95 impostors (25 evaluation impostors and 70
presented in Section 8 we adopt only the (face) image test impostors). (Note that the term client refers to a subject
part of the database—the datasets labeled as CD001 and making a legitimate identity claim, while the term impostor
CD006. These two datasets contain a total of 2360 images refers to an user making an illegitimate identity claim.) These
that were captured in four separate recording sessions. The two groups are then further partitioned into image sets
recording sessions were distributed evenly over a period of employed for training, evaluation and testing. Specifically,
approximately five months and at each session the external the first configuration of the protocol results in the following
conditions were controlled. This means that all images were experimental setup [35]:
taken against a more or less uniform background, that good
illumination conditions were present during the recording, (i) number of training images: 3 per client (600 in total),
and that only small tilts and in-plane rotations were allowed.
The described recording setup resulted in the facial images (ii) number of client verification attempts on the evalua-
exhibiting variations mainly induced by the temporal factor tion image sets: nce = 600 (3 × 200),
10 EURASIP Journal on Advances in Signal Processing

(i) number of training images: 7 per client (265 in total),


(ii) number of identification experiments with images
from subset 2: ns2 = 456,
(iii) number of identification experiments with images
from subset 3: ns3 = 455,
(iv) number of identification experiments with images
from subset 4: ns4 = 525,
Figure 10: Sample images from the Extended YaleB database. (v) number of identification experiments with images
from subset 5: ns5 = 714.

(iii) number of impostor verification attempts in the It should be noted that not all subjects from the database
evaluation image sets: nie = 40000 (25 × 8 × 200), are represented with an equal number of images due to
difficulties during the image acquisition stage. The corrupted
(iv) number of client verification attempts in the test images were excluded from the database prior to our
image sets: nct = 400 (2 × 200), and experiments. This exclusion resulted in less than the initial
(v) number of impostor verification attempts in the test 64 images being available for each of the 38 subjects and in
image sets: nit = 112000 (70 × 8 × 200). the image subset sizes presented above.
The above numbers are obtained when matching each client
image from the evaluation or test image set against the 7.3. The FERET Database. The third database chosen for the
corresponding client template and all impostor images from evaluation of the Complete Gabor Fisher Classifier is the
the evaluation or test image sets against all client templates FERET database. The database has long been the standard
stored in the systems database. database to assess new face identification techniques, not
The training set is used to train the system, that is, to only due to its size but also due to the great challenges
generate the face space where facial images are compared, that it poses to the existing face recognition technology. The
and to build the client templates (or models). The evaluation images in the database differ in terms of facial expression,
image set is employed to tune any potential parameters of the illumination, age and ethnicity.
face space and adjust the decision threshold, while the test set For our experiments we adopt the standard FERET
is used exclusively for the final performance assessment with evaluation protocol, where 1196 frontal face images of
predefined system parameters. 1196 subjects are defined as gallery (target) images, and
four different probe (query/test) sets are employed for
determining the recognition rates of the assessed techniques,
7.2. The Extended YaleB Database. The second database used that is, [40, 43]:
in our experiments is the Extended YaleB (EYB) database [38,
39]. Different from the XM2VTS database, the EYB database (i) the Fb probe set, which contains 1195 images exhibit-
is used in our experiments to assess the relative usefulness of ing different facial variations in comparison to the
the CGFC method for face identification. gallery images (nsFb = 1195),
The EYB database was recorded at the Yale University and (ii) the Fc probe set, which contains 194 images exhibit-
features 2415 frontal face images of 38 subjects. Different ing different illumination conditions in comparison
from the XM2VTS database, images of the EYB were to the gallery images (nsFc = 194),
captured at a single recording session in a relatively short
time. Hence, the images are free from severe expression- (iii) the Dup I probe set, which contains 722 images
changes and session-induced variability, but exhibit large acquired between one minute and 1031 days after the
variations in illumination, as shown in Figure 10. corresponding gallery images (nsDupI = 722),
To make the experimental protocol as challenging as (iv) the Dup II probe set, which contains 234 images
possible, we partition the EYB database into five image acquired at least 18 months after the corresponding
subsets based on the extremity in illumination, as suggested gallery images (nsDupII = 234).
by the authors of the database [38, 39], and use the
first subset (the subset with images captured in “good” Some examples of these images are shown in Figure 11. It
illumination conditions) for training and the remaining should be noted that the standard experimental protocol
four subsets for testing. This setup results in highly miss- associated with the FERET database does not define a
matched conditions between the training and test images and fixed set of training images. Therefore, we select the most
represents quite a challenge to the recognition techniques commonly adopted training set of 1002 images (corre-
used. Furthermore, it is also in accordance with real- sponding to 428 subjects) for our experiments. (Please visit
life settings, where the enrollment stage can typically be http://luks.fe.uni-lj.si/en/staff/vitomir/index.html for the list
supervised, while the operation environment is unknown in of training images used in our experiments.)
advance and can feature arbitrary conditions. Specifically, the
presented partitioning of the database results in the following 7.4. The AR Database. The last database employed in the
experimental setup: experimental section is the AR database [41]. The database
EURASIP Journal on Advances in Signal Processing 11

(20 per subject)—denoted by SC, L, R, F, G, GL, GR,


S, SL, and SR in Figure 12.

7.5. Performance Measures. The recognition performance of


the techniques assessed in the next section is measured by the
standard error and recognition rates commonly used in the
field of face recognition.
Figure 11: Sample images from the FERET database. For the verification experiments the false acceptance
error and false rejection error rates (FAR and FRR, resp.) as
well as the half total error rate (HTER) are used. The FAR
contains more than 4000 color images of 126 subjects and FRR are defined as follows:
taken during two separate recording sessions. While the nrc nai
variability in the face images of the AR database is caused, FRR = 100%, FAR = 100%, (19)
nc ni
for example, by different illumination and facial expressions,
the main characteristic that has popularized the database are while the HTER is given by
occlusions of the facial area due to the presence of scarves
and glasses. HTER = 0.5(FAR + FRR). (20)
Each subject in the AR database is accounted for with
26 images taken under different conditions. Figure 12 shows In the above equations nrc denotes the number of rejected
all 13 images of one subject from the AR database acquired legitimate identity claims, nc stands for the number of all
in the first recording session. The remaining 13 images legitimate identity claims made, nai denotes the number of
were recorded under the same conditions during the second accepted illegitimate identity claims, and ni represents the
recording session and are, hence, similar in appearance. number of all illegitimate identity claims made.
Following the experimental setup presented in [9], we Note that both the FAR and the FRR depend on the value
select a subset of 100 subjects (50 males and 50 females) of the decision threshold Δ (see (15)). Selecting a threshold
for our experiments. We choose the first three images from that ensures a small value of the FAR inevitably results in a
both recording sessions for training, that is, images denoted large value of the FRR and vice versa, a threshold that ensures
by N, H, and A in Figure 12 (6 images per subject, 600 in a small FRR results in a large value of the FAR. Thus, to fairly
total), and group the remaining images into a number of compare the different recognition techniques the decision
probe (or test) sets. Here, each of the probe sets is designed threshold has to be set in such a way that it ensures some
in such a way that only a predefined type of image variability predefined ratio of the FAR and FRR on some evaluation
(or combination of specific variability types) and its (their) dataset or, alternatively, the two error rates have to be plotted
impact on the CGFC technique is assessed at a time, that is: against all possible values of the decision threshold, resulting
in the so-called performance curves. For our assessment
(i) the scarves probe set is designed to assess the impact we chose the latter approach and represent the results in
of lower face occlusions on the recognition accuracy the form of Detection Error Trade-off (DET) curves, which
of the CGFC technique, and features a total of nsS = plot the FAR against the FRR at different values of Δ on a
600 images (6 per subject)—denoted by S, SL, and SR scale defined by the inverse of a cumulative Gaussian density
in Figure 12, function.
(ii) the glasses probe set is designed to assess the impact of For the performance evaluation on the test image sets we
upper face occlusions on the recognition accuracy of again use performance curves, which this time take the form
the CGFC technique, and features a total of nsG = 600 of Expected Performance Curves (EPCs) [44]. To generate an
images (6 per subject)—denoted by G, GL, and GR in EPC two separate image sets are needed. The first image set,
Figure 12, that is, the evaluation image set, is used to find a threshold
that minimizes the following weighted error function (WER)
(iii) the scream probe set is designed to assess the for different values of α:
impact of extreme facial expression variations on the
recognition accuracy of the CGFC technique, and WER(Δ, α) = αFAR(Δ) + (1 − α)FRR(Δ), (21)
features a total of nsSC = 200 images (2 per subject)—
denoted by SC in Figure 12, where α denotes a weighting factor that controls the relative
importance of the FAR and FRR in the above expression.
(iv) the lighting probe set is designed to assess the
Next, the second image set, that is, the test image set, is
impact of illumination variations on the recognition
employed to estimate the value of the HTER at the given α
accuracy of the CGFC technique, and features a total
and with the computed value of the decision threshold Δ.
of nsL = 600 images (6 per subject)—denoted by L,
When plotting the HTER (obtained on the test image sets)
R, and F in Figure 12,
against different values of the weighting factor α, an example
(v) the all probe set is designed to assess the robustness of the EPC is generated.
of the CGFC technique to various types of image For the identification experiments we provide results not
variability, and features a total of nsA = 2000 images in the form of error rates, but rather in form of recognition
12 EURASIP Journal on Advances in Signal Processing

Figure 12: Sample images of one subject from the AR database taken at the first photo session. The images exhibit the following
characteristics: (upper row—from left to right) neutral face (N), happy face (H), angry face (A), screaming face (SC), left light (L), right
light (R), and frontal light (F); (lower row—from left to right) occluded by glasses (G), occluded by glasses and lit from left (GL), occluded
by glasses and lit from right (GR), occluded by a scarf (S), occluded by a scarf and lit from left (SL), and occluded by a scarf and lit from
right (SR).

Table 1: Results of the face identification experiments on the EYB database for varying lengths of the PCA and LDA feature vectors.

PCA LDA
NOF NOF
subset 2 subset 3 subset 4 subset 5 subset 2 subset 3 subset 4 subset 5
10 56.6 29.5 11.2 15.6 5 98.3 56.9 9.9 13.6
50 93.4 54.9 16.7 22.0 10 100 85.3 27.2 29.7
100 93.6 54.9 16.7 22.0 20 100 97.8 47.0 43.7
150 93.6 54.9 16.7 22.0 30 100 99.3 53.6 47.6
200 93.6 54.9 16.7 22.0 37 100 99.8 56.3 51.0

rates. To this end, we compute the so-called rank one preceding the assessment of the proposed face recognition
recognition rate (ROR) for each of the probe (test) sets of approach and continues by analyzing the results of the
the given database. Here, the ROR is defined as follows: assessment.
nsi
ROR = 100%, (22) 8.1. Image Preprocessing. Before we turn our attention to the
ns
experiments, let us say a few words on the preprocessing
where nsi denotes the number of images successfully assigned preceding our experiments. Since we are concerned with
to the right identity and ns stands for the overall number of the recognition of faces from digital images and not the
images trying to assign an identity to. performance of facial detectors, we assume that all facial
In addition to the ROR, we also make use of cumulative images are properly localized and aligned. In any case, the
match characteristic (CMC) curves, which represent perfor- results presented in the remainder of the paper can be
mance curves for biometric recognition systems operating considered as an upper bound on the performance with
in identification mode. While the ROR carries information a properly working face detector. The reader is referred
about the percentage of images where the closest match elsewhere for details on how to obtain properly localized face
in the database corresponds to the correct identity, it is images, for example, [45, 46].
sometimes also of interest whether the correct identity is To localize the facial region in the experimental images,
among the top r ranked matches (where r = 1, 2, . . . , N we use the eye coordinates provided with the four databases.
and N denotes the number of subjects in the database Based on these coordinates, we first rotate and scale the
of the biometric recognition system). This is especially images in such a way that the centers of the eyes are located
important for law enforcement applications, where the top r at predefined pixel positions. This procedure ensures that
matches can additionally be inspected by a human operator. the images are aligned with each other. Next, we crop the
When computing the recognition rate for the rth rank, facial region to a standard size of 128 × 128 pixels and finally
the identification procedure is considered successful if the normalize the cropped region using histogram equalization
correct identity is among the top r ranked matches and is followed by the zero-mean and unit-variance normalization.
considered unsuccessful otherwise. Plotting the computed Some examples of the facial images from the XM2VTS,
recognition rates against the corresponding rank results in EYB, AR and FERET databases processed with the described
an example of the CMC curve. procedure are presented in Figure 13.

8. Experiments and Results 8.2. The Baseline Performance. The first series of recognition
experiments assesses the performance of some baseline face
This section presents the experiments with the CGFC tech- recognition techniques and adjusts the parameters of the
nique. It commences by describing the basic preprocessing techniques for the best possible performance. It should be
EURASIP Journal on Advances in Signal Processing 13

the three operating points. Note that for the PCA technique
the DET curves are generated only for feature vectors with
300 or less features, while the cross sections (in Figure 14(b))
contain the entire span of feature vector lengths. The reason
for such a presentation of the results lies in the fact that the
performance of the PCA technique saturates at feature vector
lengths of around 200, and, thus, the DET curves, if shown
up to the maximum number of features possible, would be
illegible.
Different from the PCA case, where the improvements
in the face verification performance become marginal with
the increase of the number of features once a certain feature
vector length is reached, the performance of LDA technique
steadily increases with the increase of the feature vector
length. This setting is evidenced by the DET curves as well
as their cross sections presented in Figures 14(c) and 14(d),
respectively.
Similar observations as with the XM2VTS database can
also be made with the EYB database, only this time for
the task of face identification. From the results of the
experiments on the EYB database presented in Table 1 in
form of rank one recognition rates, we can again see that
Figure 13: Examples of preprocessed images of (from top to the PCA technique saturates in performance with less than
bottom): the XM2VTS database, the EYB database, the AR database, the maximum possible feature vector length and reaches its
and FERET database. top performance with 150 features. The LDA technique, on
the other hand, once more requires the entire set of (in this
case 37) features to obtain the best possible performance
achievable with this technique. It is evident that, for the
noted that, at this point, only two out of four experimental optimal performance, the LDA technique requires 100% of
databases (i.e., the XM2VTS and EYB databases) are used in the features comprising its feature vectors, while the PCA
the experiments. The findings from this series of experiments approach saturates in its performance with a feature vector
are ultimately employed with the remaining two databases length which on both databases ensures that approximately
in the remainder of this section. Such an experimental 98% of the (training set) variance is retained. (Note that
configuration reflects real-life settings, where the parameters the 200 features of the XM2VTS correspond to 97.67% of
of the adopted recognition technique have to be set in the variance, while the 150 features of the EYB database
advance on an independent (generic) database due to the account for 98.44% of the variance in the training data.)
fact that the actual test images are not available in the The role of the presented experimental results is twofold: (i)
optimization stage. they provide a baseline performance on the two databases
We select the Principal Component Analysis (PCA) [47] for the following comparative studies with the phase-based
and the Linear Discriminant Analysis (LDA) [33, 48] as and complete Gabor-Fisher classifiers, and (ii) they serve as
our baseline techniques and assess their performance with a guideline for selecting the feature vector lengths on the
different numbers of features (NOFs) in the PCA and LDA remaining two databases.
feature vectors. Considering the number of subjects and
available training images in each of the two databases, the
maximum length of the feature vector for the PCA method is 8.3. The Baseline Performance of the Gabor-Based Classifiers.
599 for the XM2VTS database and 264 for the EYB database, The second series of face recognition experiments assesses
while the maximum number of features constituting the LDA the performance of the classical Gabor-Fisher Classifier
feature vectors equals 199 for the XM2VTS database and 37 and the novel Phase-based Gabor-Fisher Classifier and,
for the EYB database. furthermore, evaluates the relative usefulness of additional
With the presented experimental setup in mind, let us normalization techniques applied to the augmented (Gabor
first turn to the results of the assessment on the XM2VTS magnitude and Gabor phase congruency) feature vectors
database presented in Figure 14. Here, the graphs depicted prior to the deployment of LDA for dimensionality reduc-
in Figures 14(a) and 14(c) represent DET curves generated tion. It has to be noted that commonly only a zero-mean
during the experiments with the PCA and LDA techniques, and unit-variance normalization is applied to the Gabor
respectively. The plots in Figures 14(b) and 14(d), on the magnitude features, usually with the justification of adjusting
other hand, were created by computing the HTER at three the dynamic range of the responses to a common scale.
characteristic operating points, that is, at FAR = FRR, at However, as will be shown in this section, the same result can
FAR = 0.1FRR, and at FAR = 10FRR, on the DET curves and also be achieved with other normalization techniques, which
thus in a sense represent cross sections of the DET curves at can also have a positive effect on the final face recognition
14 EURASIP Journal on Advances in Signal Processing

24
40 22
20

20 18

Half total error rate (%)


False rejection rate (%)

16
10 14
5 12
10
2
8
1
6
0.5
4
0.2
2
0.1
0
0.1 0.2 0.5 1 2 5 10 20 40 0 60 120 180 240 300 360 420 480 540 600
False acceptance rate (%) Number of features (NOF)

NOF = 10 NOF = 100 At FAR = FRR


NOF = 30 NOF = 200 At FAR = 10FRR
NOF = 50 NOF = 300 At FAR = 0.1FRR

(a) DET curves for the PCA technique (b) Cross sections of the DET curves for the PCA technique
24
40 22
20

20 18
Half total error rate (%)
False rejection rate (%)

16
10 14
5 12
10
2
8
1
6
0.5
4
0.2
2
0.1
0
0.1 0.2 0.5 1 2 5 10 20 40 0 20 40 60 80 100 120 140 160 180
False acceptance rate (%) Number of features (NOF)

NOF = 10 NOF = 100 At FAR = FRR


NOF = 25 NOF = 150 At FAR = 10FRR
NOF = 50 NOF = 199 At FAR = 0.1FRR

(c) DET curves for the LDA technique (d) Cross sections of the DET curves for the LDA technique

Figure 14: Results of the face verification experiments on the XM2VTS database for varying lengths of the PCA and LDA feature vectors.

performance. Note that again only the XM2VTS and EYB also chosen for this series of experiments. We follow three
databases are employed in the experiments. different strategies to normalize the downsampled Gabor
We implement the traditional Gabor Fisher Classifier as magnitude responses (GMRs) and the oriented Gabor phase
well as the Phase-based Gabor-Fisher Classifier with a Gabor congruency images (OGPCIs):
filter bank containing filters of five scales (p = 5) and eight
orientation (r = 8). Such a filter bank is the most common (i) after the downsampling of the GMRs (or OGPSIs),
composition of Gabor filters used for deriving the Gabor face each downsampled GMR (or OGPCI) is normalized
representation [6, 10, 11, 13, 14, 19, 20, 22], and is therefore to zero-mean and unit-variance before concatenation
EURASIP Journal on Advances in Signal Processing 15

15

ZMUV

ZMUV
10

0
−4 −2 −1 0 1 2 3 4

(a) Zero mean and unit variance normalization

6
5
HQ + ZMUV

HQ + ZMUV
4
3
2
1
0
−2 −1 0 1 2

(b) Histogram equalization normalization

0
GS

GS

0
−4 −2 0 2 4

(c) Normalization with pixel value Gaussianization

Figure 15: Visualization of the applied normalization schemes.

into the final augmented Gabor magnitude (or OGPCIs, and the most right image shows the impact of the
phase) feature vector (denoted by ZMUV), applied normalization procedure on the histogram of the
downsampled GMR (or OGPCI).
(ii) after the downsampling of the GMRs (or OGP-
For the implementation of the subspace projection
SIs), each downsampled GMR (or OGPCI) is first
technique (LDA) the following feature vector lengths were
subjected to histogram equalization followed by
chosen: 199 for the XM2VTS database and 37 for the EYB
zero-mean and unit-variance normalization before
database. These lengths were selected based on preliminary
concatenation into the final augmented Gabor mag-
experiments, which suggested the same result as the baseline
nitude (or phase) feature vector (denoted by HQ),
experiments from the previous section, that is, that the best
(iii) after the downsampling of the GMRs (or OGPSIs), performance with LDA applied on augmented Gabor (mag-
each downsampled GMR (or OGPCI) is first sub- nitude and phase congruency) feature vectors is obtained
jected to gaussianization [49] before concatenation with the maximum feature vector length.
into the final augmented Gabor magnitude (or The results of this series of experiments are presented in
phase) feature vector (denoted by GS). (It should Figure 16 for the XM2VTS database and in Table 2 for the
be noted that the term gaussianization refers to the EYB database. If we first focus on the PBGFC technique, we
remapping of the histogram of an image or pattern can notice that overall the best performance was achieved
vector to a normal distribution with predefined with the help of the HQ technique. Figure 16(a) clearly
parameters. In our case the target distribution is shows that the DET curve generated during the experiments
N (0, 1).) with the HQ techniques outperforms the remaining two
normalization techniques at almost all operating points.
The described strategies are also shown in Figure 15, where Similarly, the technique also results in the best identification
the most left of each image-triplet depicts the normalization performance on three out of four test (probe) subsets of the
procedure applied on the downsampled GMRs, the center EYB database when compared to any of the remaining two
image depicts the normalization procedure applied to the normalization techniques.
16 EURASIP Journal on Advances in Signal Processing

40 40

20 20
False rejection rate (%)

False rejection rate (%)


10 10

5 5

2 2

1 1
0.5 0.5

0.2 0.2
0.1 0.1

0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40


False acceptance rate (%) False acceptance rate (%)

ZMUV ZMUV
HQ HQ
GS GS
(a) DET curves for the PBGFC technique (b) DET curves for the GFC technique

Figure 16: Comparison of the impact of different normalization techniques on the face verification performance on the XM2VTS database.

Differently from the PBGFC technique, we observe literature [10, 11], all with the same result—that 5 scales and
the best performance for the GFC method with the GS 8 orientations result in the best performance.
normalization technique both on the XM2VTS database as From the results presented in Figure 17 for the XM2VTS
well as on the EYB database. The result is rather expected database and Table 3 for the EYB database we can notice
since the Gaussian distribution most appropriately reflects that differently from the GFC technique, the PBGFC does
the sparse nature of the Gabor wavelet face representation. not perform at its optimum with 5 filter scales. Rather,
When compared to the baseline results obtained with the the best performance for the XM2VTS database is observed
PCA and LDA techniques, both Gabor-based classifiers with only 2 filter scales, that is, p = 2. Here, an equal
significantly improve upon the baseline performance on both error rate of 1.16% is achieved with the PBGFC approach
experimental databases. However, putting this issue aside, using 2 filter scales only. Similar results are also observed on
we can conclude that this series of recognition experiments the EYB database, where again the recognition performance
suggests that the PBGFC technique should be implemented increases with the decrease of used filter scales. However, the
with the HQ normalization technique, while the GFC performance peaks with p = 3 filter scales.
method should be combined with the GS normalization Based on this series of experiments, we chose to imple-
procedure and that these combinations should be used in the ment the construction procedure of the augmented phase
following comparative assessments. congruency feature vector with 2 scales for the XM2VTS
database and 3 scales for the EYB database for the inclusion
into the Complete Gabor-Fisher Classifier that will be
assessed in the next section.
8.4. Impact of Filter Scales. The third series of experiments
evaluates the impact of the number of the filter scales p
in the Gabor filer bank on the performance of the PBGFC
technique. We fix the angular resolution of the filter bank to 8.5. Optimization of the Complete Gabor-Fisher Classifier. Up
r = 8 and gradually change the value of the employed filter until now, the experiments focused on a separate assessment
scales for phase congruency computation from p = 2 to p = of the Gabor-Fisher and Phase-based Gabor-Fisher Classi-
5. In all of the experiments we set the feature vector lengths fiers for face recognition. However, the results obtained so
to their maximum value and adopt the HQ technique for far suggest that the combination of both techniques into a
normalization of the augmented phase congruency feature unified framework, that is, into the Complete Gabor-Fisher
vectors. It should be noted that we do not assess the impact Classifier, could improve upon the recognition performance
of the filter scales on the performance of the GFC techniques achievable with either of the two Gabor-based classifiers
since various studies on this topic can be found in the alone.
EURASIP Journal on Advances in Signal Processing 17

Table 2: Comparison of the impact of different normalization techniques on the rank one recognition rates (in %) on the EYB database.

PBGFC GFC
Norm. tech.
subset 2 subset 3 subset 4 subset 5 subset 2 subset 3 subset 4 subset 5
ZMUV 100 99.8 88.8 93.8 100 100 83.2 89.1
HQ 100 100 86.1 94.8 100 100 82.3 89.1
GS 100 100 84.8 93.0 100 100 84.6 92.2

Table 4: Rank one recognition rates (in %) obtained with the CGFC
40 approach on the EYB database for four different values of the fusion
parameter γ.

γ subset 2 subset 3 subset 4 subset 5


20
γ = 0.1 100 100 84.8 95.5
False rejection rate (%)

γ = 0.3 100 100 91.8 98.2


10
γ = 0.5 100 100 95.4 98.6
γ = 0.7 100 100 98.5 98.2
5

1 low frequency Gabor filters, which suggests that the Gabor


0.5 phase congruency and Gabor magnitude features represent
feature types with complementary information and can
0.2
therefore be combined into a highly efficient unified Gabor-
0.1 based face recognition approach [6].
0.1 0.2 0.5 1 2 5 10 20 40
As suggested in Section 5, we build the Complete Gabor-
False acceptance rate (%) Fisher Classifier by combining the GFC and PBGFC tech-
niques at the matching score level [34]. Recall that in this
p=5 p=3 setting the final CGFC similarity score depends on the proper
p=4 p=2 selection of the fusion parameter γ. To assess the robustness
of the fusion scheme, the fourth series of face recognition
Figure 17: DET curves generated for different numbers of filter experiments on the XM2VTS and EYB databases evaluates
scales used for computing the OGPCIs. the performance of the CGFC technique with respect to
different values of the fusion parameter γ, where γ ∈ [0, 1].
The results obtained with the XM2VTS database are
Table 3: Rank one recognition rates (in %) obtained with PBGFC presented in Figure 18. Here, Figure 18(a) shows DET curves
technique on the EYB database for different numbers of filter scales obtained at three different values of the fusion parameter
employed during construction of the OGPCIs.
γ, while Figure 18(b) depicts the HTER at the same three
No. of scales subset 2 subset 3 subset 4 subset 5 characteristic operating points as in the case of Figures 14(b)
p=5 100 100 86.1 94.8 and 14(d). From the two graphs we can see that the fusion of
p=4 100 100 94.5 94.4
two Gabor-based classifiers is quite robust, as the recognition
performance for a wide range of parameter values of γ
p=3 100 100 96.4 96.4
improves upon the performance of the individual classifiers
p=2 100 100 94.7 96.6 or, as exemplified by the cross section of the DET curves
at the equal error operating point (i.e., FAR = FRR) in
Figure 18(b), performs at least as well as the better of the two
The experimental results presented in Section 8.4 showed Gabor-based classifiers.
that the PBGFC requires less than the 40 filters needed Similar findings can also be made with the EYB database.
by the GFC technique to achieve optimal face recognition Here, the recognition rates on the test subsets 2 and 3 are
performance. Thus, the PBGFC technique operates on a 100% regardless of the value of the fusion parameter γ. On
much narrower frequency band than the GFC approach the test subsets 4 and 5, however, the performance peaks at
with most of the discriminatory Gabor phase congruency parameter values in the range from γ = 0.4 to γ = 0.8 as
information being contained in the OGPCIs obtained with shown in Figure 19. The actual rank one recognition rates
Gabor filters of high frequencies. (Note that the number for four different values of the fusion parameter on all four
of filter scales is directly proportional to the filters banks test images subsets are presented in Table 4. Again, we can
coverage of the frequency domain.) In addition to the high see that among the listed values of the fusion parameter, the
frequency filters, the GFC method effectively uses also the values of γ = 0.5 and γ = 0.7 result in the best performance.
18 EURASIP Journal on Advances in Signal Processing

5
40
4.5

4
20
False acceptance rate (%)

Half total error rate (%)


3.5
10
3
5
2.5
2
2
1
0.5 1.5

0.2 1
0.1
0.5
0.1 0.2 0.5 1 2 5 10 20 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False rejection rate (%) γ

γ = 0.2 At FAR = FRR


γ = 0.5 At FAR = 10FRR
γ = 0.8 At FAR = 0.1FRR

(a) DET curves for different values of γ (b) Cross section o the DETs at three operating points

Figure 18: Assessment of the CFGC technique on the XM2VTS database.

100
98
Rank one recognition rate (%)

96
94
92
90
88
86
84
82
80
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
γ

Subset 2 Subset 4
Subset 3 Subset 5

Figure 19: Rank one recognition rates (in %) obtained during the optimization of the CGFC method on the EYB database.

Considering the results of this series of experiments, we databases. In the last series of recognition experiments,
select a value of γ = 0.7 for the calculation of the CGFC however, we finally make use of all four databases and
matching scores and use this value for the implementation evaluate the robustness of the proposed CGFC approach to
of the CGFC approach on all four experimental databases. various sources of image variability commonly encountered
in the field of face recognition.
8.6. Recognition in the Presence of Illumination Variations, First, we compute the feature vector lengths employed
Partial Occlusions, and Facial Expression Changes. Up to this with the PCA and LDA techniques for the AR and FERET
point we have assessed only the impact of different parameter databases. As suggested by the experimental results obtained
values and normalization techniques on the performance of in Section 8.2, we fix the LDA feature vector length to its
the CGFC technique using only two out of four experimental maximum possible value, which for the AR database equals
EURASIP Journal on Advances in Signal Processing 19

1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate
0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 20 40 60 80 100 120 140 160 180 200 1 20 40 60 80 100 120 140 160 180 200
Rank Rank
(a) Fb probe set (b) Fc probe set

1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 20 40 60 80 100 120 140 160 180 200 1 20 40 60 80 100 120 140 160 180 200
Rank Rank

PCA GF + PCA PCA GF + PCA


LDA GFC LDA GFC
PCF + PCA CG + PCA PCF + PCA CG + PCA
PBGFC CGFC PBGFC CGFC
(c) Dup I probe set (d) Dup II probe set

Figure 20: CMC curves obtained with different probe sets of the FERET database.

vector length in such a way that approximately 98% of the


training data variance is preserved, resulting in the feature
vector length of d = 448 for the FERET database and in the
feature vector length of d = 358 for the AR database. Next,
Figure 21: Examples of rendered face images (from left to right): we implement the PBGFC technique for the AR and FERET
the original face image, and the rendered image for τ = 40, the databases using three filter scales (p = 3), and use the value
rendered image for τ = 80, the rendered image for τ = 120, the of γ = 0.7 for construction of the Complete Gabor-Fisher
rendered image for τ = 160. Classifier.
Using the presented parameters, we first evaluate the
performance of the CGFC technique on the standard probe
to d = 99, and for the FERET database takes the value of sets, that is, the Fb, Fc, Dup I, and Dup II probe sets, of
d = 427. For the PCA approach we compute the feature the FERET database. The comparative results of the CGFC
20 EURASIP Journal on Advances in Signal Processing

30 30

27 27

24 24

21 21

18 18
HTER (%)

HTER (%)
15 15

12 12

9 9

6 6

3 3

0 0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
α α
(a) EPC curves for τ = 40 (b) EPC curves for τ = 80
30 30

27 27

24 24

21 21

18 18
HTER (%)

HTER (%)

15 15

12 12

9 9

6 6

3 3

0 0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
α α
PCA GF + PCA PCA GF + PCA
LDA GFC LDA GFC
PCF + PCA CG+PCA PCF + PCA CG+PCA
PBGFC CGFC PBGFC CGFC
(c) EPC curves for τ = 120 (d) EPC curves for τ = 160

Figure 22: EPC curves obtained on the test sets of the XM2VTS database for different values of the parameter τ.

technique and our own implementations of seven popular with the PCA and LDA techniques (i.e., CG+PCA and CGFC,
face recognition techniques are presented in form of CMC resp.) and three state-ofthe-art methods, namely, the Local
curves in Figure 20. Here, the Eigenface technique (PCA) Gabor Binary Pattern Histogram Sequence (LGBPHS) [50]
[47], the Fisherface technique (LDA) [33, 48], the Phase- approach, a Local Binary Pattern-(LBP) [51] based technique
based Gabor-Fisher Classifier (PBGFC) [6], the Gabor-Fisher and the best performing method from the original Sep96
Classifier (GFC) [10], the PCA technique applied to the FERET evaluation (BYS) [40]—see Table 5. (Note that n/a
augmented Gabor magnitude and Gabor phase congruency stands for the fact that the results for the specific probe set
feature vectors (GF+PCA and PCF+PCA, resp.) as well as the were not provided in the original publication.)
PCA technique in combination with the complete Gabor face Note that the CGFC technique results in competitive
representation (CG+PCA) were adopted for the comparison. recognition rates on all probe sets. It achieves the ROR of
In addition to the graphical results, we also present the rank 98.7% on the Fb probe set, which comprises images with
one recognition rates (RORs) for the baseline methods PCA different facial expressions as in the gallery set, the ROR of
and LDA, for the complete Gabor representations combined 92.8% on the Fc probe set, where the probe (test) images
EURASIP Journal on Advances in Signal Processing 21

1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate
0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 5 10 15 20 25 30 35 1 5 10 15 20 25 30 35
Rank Rank
(a) CMCs for subset 2 (b) CMCs for subset 3
1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 5 10 15 20 25 30 35 1 5 10 15 20 25 30 35
Rank Rank

PCA GF + PCA PCA GF + PCA


LDA GFC LDA GFC
PCF + PCA CG + PCA PCF + PCA CG + PCA
PBGFC CGFC PBGFC CGFC
(c) CMCs for subset 4 (d) CMCs for subset 5

Figure 23: CMC curves obtained on the four probe subsets of the EYB database.

feature illumination variations, and RORs of 77.2% and training set for the construction of the PCA/LDA subspace
57.3% on the Dup I and Dup II sets, respectively, where and, thus, the results of the comparison with other methods
the goal is to assess aging effects on the performance of from the literature should be interpreted with this fact in
the given face recognition technique. As already indicated in mind.
the introduction, the rather good performance of the CGFC Let us now turn our attention to the XM2VTS database.
method on the Fb probe set can be linked to the local nature All experiments on this database conducted so far have been
of the Gabor features, which ensures robustness to changes performed on the evaluation image sets, while the test image
in facial expression, while the robustness to illumination sets were not used. In this series of experiments, however,
changes evidenced by the recognition rates on the Fc probe we employ the test image sets for our comparative study and
set can be related to frequency band coverage of the Gabor implement all recognition techniques with their hyperpa-
filter bank. Despite the competitiveness of the proposed rameters (such as decision thresholds, feature vector lengths,
approach in our experiments, it should, nonetheless, be number of employed filter scales, etc.) predefined using the
noted that the FERET database does not define a standard evaluation image sets. As suggested in Section 7.5, we report
22 EURASIP Journal on Advances in Signal Processing

1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate
0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50
Rank Rank
(a) Scarves probe set (b) Glasses probe set
1 1

0.9 0.9

0.8 0.8

0.7 0.7
Recognition rate

Recognition rate

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50
Rank Rank

PCA GF+PCA PCA GF+PCA


LDA GFC LDA GFC
PCF+PCA CG+PCA PCF+PCA CG+PCA
PBGFC CGFC PBGFC CGFC
(c) Scream probe set (d) Lighting probe set

Figure 24: CMC curves obtained on the four probe sets (scarves, glasses, scream, and lighting) of the AR database.

the results in form of EPC curves. To make the assessment introduced artificial illumination change [6]. The authors
more challenging, we introduce an artificial illumination of [52] stressed that their model (see (23)) does not cover
change to the test image sets from the XM2VTS database all possible illumination effects in real life settings, but is
and adopt the artificial illumination model of Sanderson nevertheless useful for providing suggestive results regarding
and Paliwal [52] for rendition of the facial images. The the robustness of the assessed face recognition techniques
model simulates different illumination conditions during the to illumination changes. Some examples of facial images
image capturing process by modifying the preprocessed facial rendered with the presented model for different values of
images I(x, y), that is, the parameter τ are shown in Figure 21. Due to the extensive
   
experimental section, we do not report the performance on
I4 x, y = I x, y + mx + τ, (23) the original test sets, but provide only comparative results
on the harder (degraded) test image sets. Similar as with
where x = 0, 1, . . . , a − 1; y = 0, 1, . . . , b − 1; m = −2τ/(b − 1); the FERET database, we implement seven face recognition
τ denotes the parameter that controls the “strength” of the techniques from the literature for our comparison.
EURASIP Journal on Advances in Signal Processing 23

1 the CGFC technique on real illumination changes and,


consequently, to further demonstrate the robustness of the
0.95
proposed technique to illumination changes, we perform the
0.9 next round of experiments on the EYB database
The final recognition experiments on the EYB database
Recognition rate

0.85 are again conducted in accordance with the experimental


setup presented in Section 7.2. From the results presented
0.8
in the form of CMC curves in Figure 23, we can see that,
0.75 while on the probe subsets 2 and 3 almost all evaluated
methods resulted in the recognition rate of 100% for all
0.7 ranks, only the CGFC method is able to retain an almost
0.65 perfect recognition rate for the more challenging probe
subsets 4 and 5 as well. In the comparison with state-
0.6 of-the-art techniques from the literature, that is, a LBP-
1 5 10 15 20 25 30 35 40 45 50
based method [53], the Linear Subspace (LS) approach
Rank
[39], the Harmonic Image (HI) technique [54], and the
PCA GF+PCA Gradient Angle (GA) procedure [55], presented in Table 6,
LDA GFC the CGFC technique again resulted in competitive RORs
PCF+PCA CG+PCA making it a suitable choice for various (security, defense,
PBGFC CGFC etc.) applications requiring robust recognition in difficult
Figure 25: CMC curves for the challenging All probe set of the AR illumination conditions.
database. Last but not least, we evaluated the CGFC technique
on the AR database, where the goal was to assess the
Table 5: Rank one recognition rates (in %) obtained on different robustness of the proposed method against extreme facial
probe sets of the FERET database. expression changes, disguise (partial occlusion of the facial
area), illumination changes and different combinations of
Method Fb Fc Dup I Dup II the listed image characteristics. When looking at the CMC
PCA 77.3 11.4 34.2 10.7 curves generated during the recognition experiments on the
LDA 91.9 75.3 52.9 18.0 scarves, glasses, scream, and lighting probe sets (Figure 24),
CG+PCA 82.9 62.4 51.4 35.9 we can see that the CGFC method achieved the ROR of
CGFC 98.7 92.8 77.2 57.3 more than 98% on all probe sets. These results suggest that
LGBPHS [50] 94.0 97.0 68.0 53.0 our method is highly robust to facial occlusions (which
LBP [51] 93.0 51.0 61.0 50.0 usually occur when the subject to be identified tries to
conceal his identity through disguise), to extreme facial
BYS [40] 82.0 37.0 52.0 32.0
expression changes and, as already shown in the previous
experiments, to illumination variations. When looking at the
Table 6: Rank one recognition rates (in %) obtained on the EYB results obtained with the most challenging probe set, that
database during the comparative assessment.
is, the All probe set, shown in Figure 25, we can notice that
Method subset 2 subset 3 subset 4 subset 5 the overall ROR on this set again reaches a value of more
PCA 93.6 55.0 16.7 22.0 than 99%. To the best of our knowledge the recognition
rates obtained on the different probe sets of the AR database
LDA 100 99.8 56.3 51.0
represent the best results achieved on this database and
CG+PCA 100 100 96.8 98.2
published in the literature. This fact is also evidenced in
CGFC 100 100 98.5 98.2 the comparison with state-of-the-art techniques from the
LBP [53] 100 99.8 97.3 87.5 literature presented in Table 7. Here, the RORs are shown
LS [39] 100 100 85.0 n/a for the SubSampling (SS) method from [8], the recently
HI [54] 100 100 97.3 n/a proposed Sparse Representation Classifier (SRC) [9], and
GA [55] 100 100 98.6 n/a the Attributed Relational Graph-(ARG) based method [56]
as well as for four techniques implemented to produce the
CMC curves in Figure 24. The reader should note that, while
The results of this series of face recognition (verification) the results taken from the literature were obtained using
experiments are presented in Figure 22 in the form of EPC similar training images, the probe (test) sets for the SS, SRC
curves. The first thing to notice from the presented results and ARG techniques were either smaller in size or featured
is that the CGFC method systematically outperforms all only one degradation at a time. For example, while we
other techniques, significantly improving upon the baseline adopted all images featuring a scarf for our scarves probe
performance of the PCA and LDA methods. Moreover, it set regardless of the illumination conditions, the SS, SRC
also results in the most robust performance in the presence and ARG methods produced their results based on only scarf
of (artificial) lighting variations, again due to the properties images taken in the same lighting conditions as the training
of the Gabor filter bank. To evaluate the performance of (target) images. Similar observations could also be made for
24 EURASIP Journal on Advances in Signal Processing

Table 7: Rank one recognition rates (in %) obtained on the five probe sets of the AR database.
Method scarves glasses scream lighting all
PCA 55.8 45.7 59.0 64.9 64.7
LDA 58.8 56.8 72.5 70.0 71.6
CG+PCA 97.8 97.3 99.0 97.9 98.3
CGFC 98.7 99.2 100 99.1 99.2
SS [8] 93.0 84.0 87.0 n/a n/a
SRC [9] 93.5 97.5 n/a n/a n/a
ARG [56] 85.2 80.7 66.7 98.5 n/a

the remaining probe sets. The performance of the CGFC [3] J.D. Woodword, N. M. Orlans, and P. T. Higgins, Biometrics,
method on the challengingly designed probe sets of the AR McGraw–Hill, New York, NY, USA, 2002.
database offers a final demonstration of the competitiveness [4] V. Štruc and N. Pavešić, “Phase congruency features for palm-
of the proposed approach. print verification,” IET Signal Processing, vol. 3, no. 4, pp. 258–
268, 2009.
9. Conclusion [5] A. Franco and L. Nanni, “Fusion of classifiers for illumination
robust face recognition,” Expert Systems with Applications, vol.
In this paper we have proposed a novel face classifier for 36, no. 5, pp. 8946–8954, 2009.
face recognition called the Complete Gabor-Fisher Classifier. [6] V. Štruc, B. Vesnicer, and N. Pavešić, “The phase-based
Unlike the majority of Gabor filter-based methods from the Gabor Fisher classifier and its application to face recognition
literature, which mainly rely only on the Gabor magnitude under varying illumination conditions,” in Proceedings of
features for representing facial images, the proposed classifier the 2nd International Conference on Signal Processing and
exploits both Gabor magnitude features as well as features Communication Systems (ICSPCS ’08), pp. 1–6, Gold Coast,
Australia, December 2008.
derived from Gabor phase information. The feasibility
of the proposed technique was assessed on four publicly [7] Y. Gao and M. K. H. Leung, “Face recognition using line
edge map,” IEEE Transactions on Pattern Analysis and Machine
available databases, namely, on the XM2VTS, FERET, AR
Intelligence, vol. 24, no. 6, pp. 764–779, 2002.
and Extended YaleB databases. On all datasets, the proposed
[8] S. Fidler, D. Skočaj, and A. Leonardis, “Combining reconstruc-
technique resulted in a promising face recognition perfor-
tive and discriminative subspace methods for robust classifi-
mance and outperformed several popular face recognition cation and regression by subsampling,” IEEE Transactions on
techniques, such as PCA, LDA, the Gabor-Fisher Classifier Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp.
and others. The proposed method was also shown to 337–350, 2006.
ensure robust recognition performance in the presence of [9] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma,
extreme facial changes, severe lighting variations and partial “Robust face recognition via sparse representation,” IEEE
occlusions of the facial area. Transactions on Pattern Analysis and Machine Intelligence, vol.
The source code used in all of our experiments will be 31, no. 2, pp. 210–227, 2009.
made freely available. The interested reader is referred to [10] C. Liu and H. Wechsler, “Gabor feature based classification
http://luks.fe.uni-lj.si/en/staff/vitomir/index.html for more using the enhanced Fisher linear discriminant model for face
information. recognition,” IEEE Transactions on Image Processing, vol. 11,
no. 4, pp. 467–476, 2002.
Acknowledgments [11] L. Shen and L. Bai, “A review of Gabor wavelets for face
recognition,” Pattern Analysis and Applications, vol. 9, no. 2,
The presented research has been supported in parts by pp. 273–292, 2006.
the national research program P2-0250(C) Metrology and [12] L. Nanni and D. Maio, “Weighted sub-Gabor for face recogni-
Biometric Systems, the bilateral project with the Bulgarian tion,” Pattern Recognition Letters, vol. 28, no. 4, pp. 487–492,
Academy of Sciences—Face and Signature Biometrics, the 2007.
national project AvID M2-0210, the COST Action 2101 [13] L. Shen, L. Bai, and M. Fairhurst, “Gabor wavelets and general
Biometrics for Identity Documents and Smart Cards and discriminant analysis for face identification and verification,”
the EU-FP7 project 217762 Homeland security, biometric Image and Vision Computing, vol. 25, no. 5, pp. 553–563, 2007.
Identification and personal Detection Ethics (HIDE). [14] V. Štruc and N. Pavešić, “Gabor-based kernel partial-least-
squares discrimination features for face recognition,” Infor-
matica, vol. 20, no. 1, pp. 115–138, 2009.
References [15] P. Kovesi, “Image features from phase congruency,” Videre:
[1] R. W. Ives, Y. Du, D. M. Etter, and T. B. Welch, “A Journal of Computer Vision Research, vol. 1, no. 3, pp. 1–26,
multidisciplinary approach to biometrics,” IEEE Transactions 1999.
on Education, vol. 48, no. 3, pp. 462–471, 2005. [16] A. Eleyan, H. Özkaramanli, and H. Demirel, “Complex
[2] A. K. Jain, A. Ross, and S. Prabhakar, “An introduction to wavelet transform-Based face recognition,” EURASIPJournal
biometric recognition,” IEEE Transactions on Circuits and on Advances in Signal Processing, vol. 2008, Article ID 185281,
Systems for Video Technology, vol. 14, no. 1, pp. 4–20, 2004. 13 pages, 2008.
EURASIP Journal on Advances in Signal Processing 25

[17] V. Kyrki, J.-K. Kamarainen, and H. Kälviäinen, “Simple [32] T. Savič and N. Pavešić, “Personal recognition based on an
Gabor feature space for invariant object recognition,” Pattern image of the palmar surface of the hand,” Pattern Recognition,
Recognition Letters, vol. 25, no. 3, pp. 311–318, 2004. vol. 40, no. 11, pp. 3152–3163, 2007.
[18] M. Lades, Jan C. Vorbrueggen, J. Buhmann, et al., “Distortion [33] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman,
invariant object recognition in the dynamic link architecture,” “Eigenfaces vs. fisherfaces: recognition using class specific
IEEE Transactions on Computers, vol. 42, no. 3, pp. 300–311, linear projection,” IEEE Transactions on Pattern Analysis and
1993. Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[19] C. Liu, “Capitalize on dimensionality increasing techniques [34] A. Ross, K. Nandakumar, and A. K. Jain, “Introduction to
for improving face recognition grand challenge performance,” multibiometrics,” in Handbook of Biometrics, A. K. Jain, P.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Flynn, and A. Ross, Eds., pp. 271–292, Springer, New York,
vol. 28, no. 5, pp. 725–737, 2006. NY, USA, 2008.
[20] L. Shen and L. Bai, “Information theory for Gabor feature [35] V. Štruc, F. Mihelič, and N. Pavešić, “Face authentication using
selection for face recognition,” EURASIP Journal on Applied a hybrid approach,” Journal of Electronic Imaging, vol. 17, no.
Signal Processing, vol. 2006, Article ID 30274, 11 pages, 2006. 1, pp. 1–11, 2008.
[21] B. Zhang, S. Shan, X. Chen, and W. Gao, “Histogram of [36] B. Vesnicer and F. Mihelič, “The likelihood ratio decision
Gabor phase patterns (HGPP): a novel object representation criterion for nuisance attribute projection in GMM speaker
approach for face recognition,” IEEE Transactions on Image verification,” EURASIP Journal on Advances in Signal Process-
Processing, vol. 16, no. 1, pp. 57–68, 2007. ing, vol. 2008, Article ID 786431, 11 pages, 2008.
[22] B. Zhang, Z. Wang, and B. Zhong, “Kernel learning of [37] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre,
histogram of local Gabor phase patterns for face recognition,” “XM2VTSDB: the extended M2VTS database,” in Proceedings
EURASIP Journal on Advances in Signal Processing, vol. 2008, of the 2nd International Conference on Audio- and Video-
Article ID 469109, 8 pages, 2008. Based Person Authentication, pp. 72–77, Washington, DC,
[23] E. Bezalel and U. Efron, “Efficient face recognition method USA, 1999.
using a combined phase congruency/Gabor wavelet tech- [38] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear
nique,” in Optical Information Systems III, vol. 5908 of subspaces for face recognition under variable lighting,” IEEE
Proceedings of SPIE, pp. 1–8, San Diego, Calif, USA, September Transactions on Pattern Analysis and Machine Intelligence, vol.
2005. 27, no. 5, pp. 684–698, 2005.
[24] S. Gundimada and V. K. Asari, “A novel neighborhood defined [39] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman,
feature selection on phase congruency images for recognition “From few to many: illumination cone models for face recog-
of faces with extreme variations,” International Journal of nition under variable lighting and pose,” IEEE Transactions on
Information Technology, vol. 3, no. 1, pp. 25–31, 2006. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp.
[25] S. Gundimada, V. K. Asari, and N. Gudur, “Face recognition 643–660, 2001.
in multi-sensor images based on a novel modular feature [40] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERET
selection technique,” Information Fusion, vol. 11, no. 2, pp. evaluation methodology for face-recognition algorithms,”
124–132, 2010. IEEE Transactions on Pattern Analysis and Machine Intelligence,
[26] Y. Guo, G. Zhao, J. Chen, M. Pietikäinen, and Z. Xu, “A new vol. 22, no. 10, pp. 1090–1104, 2000.
Gabor phase difference pattern for face and ear recognition,” [41] A. M. Martinez and R. Benavente, “The AR face database,”
in Proceedings of the International Conference on Computer Tech. Rep. 24, CVC, New York, NY, USA, June 1998.
Analysis of Images and Patterns, vol. 5702 of Lecture Notes in [42] A. M. Martı́nez, “Recognizing imprecisely localized, partially
Computer Science, pp. 41–49, Münster, Germany, September occluded, and expression variant faces from a single sample
2009. per class,” IEEE Transactions on Pattern Analysis and Machine
[27] Y. Guo and Z. Xu, “Local Gabor phase difference pattern for Intelligence, vol. 24, no. 6, pp. 748–763, 2002.
face recognition,” in Proceedings of the 19th International Con- [43] K. Delac, M. Grgic, and S. Grgic, “Independent comparative
ference on Pattern Recognition (ICPR ’08), pp. 1–4, Tampca, study of PCA, ICA, and LDA on the FERET data set,”
Fla, USA, December 2008. International Journal of Imaging Systems and Technology, vol.
[28] L. Qing, S. Shan, X. Chen, and W. Gao, “Face recognition 15, no. 5, pp. 252–260, 2005.
under varying lighting based on the probabilistic model of [44] S. Bengio and J. Marithoz, “The expected performance curve:
Gabor phase,” in Proceedings of the International Conference on a new assessment measure for person authentication,” in
Pattern Recognition, vol. 3, pp. 1139–1142, Hong Kong, August Proceedings of the the Speaker and Language Recognition
2006. Workshop Oddyssey, pp. 279–284, Toledo, Spain, 2004.
[29] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution [45] P. Viola and M. J. Jones, “Robust real-time face detection,”
gray-scale and rotation invariant texture classification with International Journal of Computer Vision, vol. 57, no. 2, pp.
local binary patterns,” IEEE Transactions on Pattern Analysis 137–154, 2004.
and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002. [46] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, and Y. Ma,
[30] T. Ojala, M. Pietikäinen, and T. Mäenpää, “A generalized “Towards a practical face recognition system: robust registra-
local binary pattern operator for multiresolution gray scale tion and illumination by sparse representation,” in Proceedings
and rotation invariant texture classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
of the 2nd International Conference on Advances in Pattern Recognition (CVPR ’09), pp. 597–604, June 2009.
Recognition, pp. 397–406, Rio de Janeiro, Brazil, March 2001. [47] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal
[31] S. Venkatesh and R. Owens, “An energy feature detection of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
scheme,” in Proceedings of the International Conference on [48] W. Zhao, A. Krishnaswamy, R. Chellappa, D. L. Swets, and
Image Processing, pp. 553–557, Singapore, 1989. J. Weng, “Discriminant analysis of principal components
26 EURASIP Journal on Advances in Signal Processing

for face recognition,” in Face Recognition: From Theory to


Applications, H. Wechsler, P. J. Phillips, V. Bruce, F. F. Soulie,
and T. S. Huang, Eds., pp. 73–85, Springer, Berlin, Germany,
1998.
[49] V. Štruc, J. Žibert, and N. Pavešić, “Histogram remapping
as a preprocessing step for robust face recognition,” WSEAS
Transactions on Information Science and Applications, vol. 6,
no. 3, pp. 520–529, 2009.
[50] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, “Local
Gabor binary pattern histogram sequence (LGBPHS): a novel
non-statistical model for face represetation and recognition,”
in Proceedings of the International Confeence on Computer
Vision, pp. 786–791, Beijing, China, October 2005.
[51] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition
with local binary patterns,” in Proceedings of the European
Conference on Computer Vision, vol. 3021, pp. 469–481,
Prague, Check Republic, May 2004.
[52] C. Sanderson and K. K. Paliwal, “Fast features for face
authentication under illumination direction changes,” Pattern
Recognition Letters, vol. 24, no. 14, pp. 2409–2419, 2003.
[53] X. Tan and B. Triggs, “Enhanced local texture feature sets
for face recognition under difficult lighting conditions,” in
Proceedings of the Analysis and Modelling of Faces and Gestures,
vol. 4778 of Lecture Notes in Computer Science, pp. 168–182,
Springer, 2007.
[54] R. Basri and D. Jacobs, “Lambertian reflectance and linear
subspaces,” in Proceedings of the 8th International Conference
on Computer Vision, pp. 383–390, Vancouver, Canada, July
2001.
[55] H. F. Chen, P. N. Belhumeur, and D. W. Jacobs, “In search
of illumination invariants,” in Proceedings of the International
Conference on Computer Visionand Pattern Recognition (CVPR
’00), pp. 254–261, South Carolina, SC, USA, June 2000.
[56] B.-G. Park, K.-M. Lee, and S.-U. Lee, “Face recognition using
face-ARG matching,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 27, no. 12, pp. 1982–1988, 2005.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 983581, 15 pages
doi:10.1155/2010/983581

Research Article
Multiclient Identification System Using
Adaptive Probabilistic Model

Chin-Teng Lin,1 Linda Siana,1 Yu-Wen Shou,2 and Chien-Ting Yang1


1 Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan
2 Department of Computer and Communication Engineering, China University of Technology, Hsinchu 303, Taiwan

Correspondence should be addressed to Yu-Wen Shou, [email protected]

Received 1 December 2009; Revised 26 February 2010; Accepted 14 April 2010

Academic Editor: Yingzi Du

Copyright © 2010 Chin-Teng Lin et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

This paper aims at integrating detection and identification of human faces in a more practical and real-time face recognition
system. The proposed face detection system is based on the cascade Adaboost method to improve the precision and robustness
toward unstable surrounding lightings. Our Adaboost method innovates to adjust the environmental lighting conditions by
histogram lighting normalization and to accurately locate the face regions by a region-based-clustering process as well. We
also address on the problem of multi-scale faces in this paper by using 12 different scales of searching windows and 5 different
orientations for each client in pursuit of the multi-view independent face identification. There are majorly two methodological
parts in our face identification system, including PCA (principal component analysis) facial feature extraction and adaptive
probabilistic model (APM). The structure of our implemented APM with a weighted combination of simple probabilistic functions
constructs the likelihood functions by the probabilistic constraint in the similarity measures. In addition, our proposed method
can online add a new client and update the information of registered clients due to the constructed APM. The experimental results
eventually show the superior performance of our proposed system for both offline and real-time online testing.

1. Introduction challenging for face detection to localize the faces in an image


because the detected results might highly depend on the
Biometrics has been an emerging technology for identifying surrounding conditions such as environments, movements,
people by their physical and behavioral characteristics [1, 2], lighting, orientations, and even the expressions of faces.
and its applications have attracted more and more atten- These variant factors may lead to the changes of colors,
tions of researchers recently. Some physical characteristics luminance, shadows, and contours of images. For this reason,
of an individual could be used in biometric identifica- it is impractical to detect faces by using a single feature.
tion/verification system, such as fingerprint, palm print, face, Papageorgiou et al. [3] proposed a 2-D Haar feature to
and ear. Similarly, the behavioral characteristics included detect objects by using SVM (Support Vector Machine) in
signature, speech, gesture, and gait. Among all biometric the training multiple Haar features. Li et al. [4] proposed a
identification fields, face recognition has always been consid- “Floatboost” algorithm to delete the worse face features to
ered much more popular and significant. Face detection and improve the detection rate and speed. Liehhart and Maydt
recognition were also used in video surveillance and human [5] proposed the extensive set of Haar-like features for the
computer interface. Furthermore, face recognition with the rapid object detection, which gave the versatile uses of Haar
passive and nonintrusive benefits would be more appropriate features and improved the precisions of object detection.
for personal identifications. Then Viola and Jones [6] proposed three important methods
A typical face recognition system was composed of two to detect objects efficiently. First, they applied integral images
parts, face detection and face identification. It would be quite to reducing the computational loadings of features. Second,
2 EURASIP Journal on Advances in Signal Processing

Face database

Training
Camera Face regions
Mutiscale Lighting Face detector Region based
searching window normalization based on Adaboost candidate clustering
Face region
Decision Face identifier Facial feature
Client/impostor extraction
based on APM
(eigenfaces)
Training Updating
Client
database

Figure 1: Architecture of the proposed face recognition system.

a simple and efficient classifier based on the Adaboost the clients’ information automatically. For more real-time
learning algorithm [7] was used to select a small number of applications, we in this paper tended to propose a practical
features from a very large range of potential features. Third, system by integrating both face detection and identification
they presented a cascaded-combined-classifier method to systematically. Our proposed system used the cascaded
speed up the processing time. Adaboost learning algorithm in face detection, and achieved
Face identification was to identify faces in the registered the multiclient identification mechanism by using adaptive
database. While many approaches of face identification probabilistic model (APM). This paper would be organized
have aimed at identifying faces under slight changes of in the following sections. First, our presented face detection
lighting, facial expressions, and poses, reliable techniques for system included histogram lighting normalization, feature
identification under drastic variations have proven elusive. selection, the cascaded Adaboost classifier, and the region-
The major issue in the view-independent face identification based clustering algorithm. After that, the identification
was how to identify a registered face from different view- process including similarity measurement and the parameter
ing directions. There were different kinds of methods for adjustment by APM would be introduced. Finally, the
handling posed variations in face identification, including experimental results and conclusions would be given to be
the invariant-feature method, 3D model-based method, and summarized.
multiview method. The invariant-feature method attempted
to extract features of faces from novel views, and uses these 2. Face Recognition System
features to identify the faces [8–10]. One major disadvantage
of this method was the unfeasibility of finding the sufficient The architecture of face recognition system in our work was
invariant features for identification. The 3D model-based shown in Figure 1. The system consists of face detection
method focuses on constructing a prototypical view from system localizing the face regions in a captured image, and
a 3D model. As what [11] has mentioned, the 3D model- face identifier identifying “who” may belong to the extracted
based method can work well for faces with small angles face. We presented a novel idea, the searching windows with
of rotations. However, this kind of methods might fail various sizes, which would be used to find different face
for faces with larger rotations due to the invisibility of candidates in multiscales. The face candidates in different
some important features [12]. The multiview method could scales truly reflect various distances of clients from cameras.
be more significant since the sufficient number of faces We totally define 12 searching windows in various sizes
in different views would be taken into consideration to from the smallest block size of 24 × 24 to the biggest
deal with the pose problems [13]. Beymer [14] modeled one by a multiplier of 1.25. While a camera acquires an
faces from 15 views, and sampled different poses from the image, the camera would produce the images in different
viewing sphere. The way of face identification consisted of illuminating intensities depending on the light surrounding
two main stages, the geometrical alignment and correlation clients. Therefore, it would be necessary for an accurate
for matching. There have been also other works presented recognition process to normalize the changes of light with
and proven to be robust to changes of viewpoints. One of respect to the surrounding environments.
them was the well-known single-view eigenspace approach,
and its concept was based on the principal component 2.1. Lighting Normalization. The lighting normalization is
analysis (PCA) [15–20]. Many related works were proposed based on a histogram fitting method. The primary task of
to improve either the performance of face detection or face histogram fitting is to transform the candidate histogram
identification. However, a complete face recognition system H(l) to the target one G(l) for l = 0, . . . , L − 1 where L
including face detector and identifier was rarely proposed represents the number of discrete gray levels. Our target
in the recent researches. Moreover, most of presented works histogram G(l) was chosen as the histogram of image closest
were lack of flexibility to add new clients and to update to the mean of the face database. Both of the original H(l)
EURASIP Journal on Advances in Signal Processing 3

MG→U (l)
60
50
40 l
j =0 G( j)
30 MG→U (l) = L−1
j =0 G( j)
20
10
0 0
−50 0 50 100 150 200 250 300 0 l
(a)
MH →U (l)
30
25
20 l
MH →U (hi )
j =0 H( j)
15 MH →U (l) = L−1
j =0 H( j)
10
5
0 0
−50 0 50 100 150 200 250 300 0 hi l
(b)

Figure 2: The candidate and target histograms, and their corresponding distributions.

MU →G (MH →U (l)) after the lighting normalization, we showed the chosen target
image G(l) and the images before and after normalization
in Figures 4(a), 4(b), and 4(c), respectively. The images
with over-dark or over-light intensities would be normalized
to the target one. Therefore, the histograms after lighting
normalization would be similar to the histograms of targets.

2.2. Feature Selection. The intensity based features employed


in this paper were based on Haar features. We selected four
types of rectangle features as illustrated in Figure 5, including
0
0 MH →U (hi ) MH →U (l)
the vertical edge, horizontal edge, vertical line, and diagonal
edge proposed by Papageorgiou [3]. In fact, it is feasible
Figure 3: The distribution of MH → G (l). to use the composition of different brightness rectangles
to represent the light and dark regions in the image. The
features are defined in the following equation:
and target histogram G(l) would be mapped to the uniform  
distributions MH → U (l) and MG → U (l) valvesubtracted = f x, y, w, h, Type , (3)
l   l   where (x, y) indicate the origin of the relative coordinate of
j =0 H j j =0 G j
MH → U (l) = L−1   , MG → U (l) = L−1   , rectangle features in the searching window. The significance
j =0 H j j =0 G j of w and h denote the relative weight and height of rectangle
(1) features, respectively. Type presents the type of rectangle
features, and valvesubtracted is the sum of the pixels in the white
where MH → U (l) and MG → U (l) are monotonically increasing. rectangle subtracted from those in the black ones.
The histograms H(l) can be mapped to G(l) by MH → G (l) in A single rectangle feature which best separates the face
the following equation: and nonface samples can be considered as a weak classifier
h(x, f , p, θ) as shown in the following equation:
MH → G (l) = MU → G (MH → U (l)). (2)

  ⎨1, if p f (x) < pθ,
MU → G (l) denotes the inverse mapping of MG → U (l). For each h x, f , p, θ = ⎩ (4)
pixel in the original image, if the value of some pixel is hi , 0, otherwise.
we will firstly map hi to its corresponding value MH → U (l) as
shown in Figure 2. After that, MH → U (l) will be mapped to The weak classifier h(x, f , p, θ) used to determine if the
MH → G (l) by using the iterative scheme, which can be also x-block image is a face or a nonface depends on the feature
illustrated in Figure 3. To demonstrate the practical changes f (x, y, w, and h type), the threshold θ, and the polarity p
4 EURASIP Journal on Advances in Signal Processing

Target image

60 30 80 60
50 G(l) 25 H(l) 70 H(l) 50 H(l)
60
40 20 50 40
30 15 40 30
10 30 20
20
20
10 5 10 10
0 0 0 0
−50 0 50 100 150 200 250 300 −50 0 50 100 150 200 250 300 −50 0 50 100 150 200 250 300 −50 0 50 100 150 200 250 300
(a) (b)

60 70 50
50 MH →G (l) 60 MH →G (l) MH →G (l)
40
40 50
40 30
30
30 20
20 20
10 10 10
0 0 0
−50 0 50 100 150 200 250 300 −50 0 50 100 150 200 250 300 −50 0 50 100 150 200 250 300
(c)

Figure 4: Lighting normalization. (a) Target image, (b) Input images, (c) Lighting normalized images.

(x, y) (x, y) (x, y) (x, y)

h h h h

w w w w
Type 1 Type 2 Type 3 Type 4
(a) (b) (c) (d)

Figure 5: Four types of rectangle features.

indicating the signs of inequality. For each weak classifier, an smallest searching window of 24 × 24 block size, for example,
optimal threshold is chosen to minimize the possibilities of the entire number of rectangle features will be 160,000.
misclassifications. The selected threshold for each rectangle The Adaboost method combines a collection of weak
feature is acquired through the training process by our classifiers to form a stronger classifier. Since the stronger
database which consists of 4000 face images and 59000 classifier is rather time consuming, the structure of cascaded
nonface images. Figures 6(a) and 6(b) present some face classifiers by Viola and Jones [6] will be preferred to
and nonface examples in our database. In this procedure, we improving the detection performance and reducing the
could collect the distributions of f (x, y, w, and h type) for computational time. As a result, our cascaded Adaboost
each image in the database, and then a threshold with higher classification based on the stronger classifier will classify each
distinguishability in clustering would be chosen. Although extracted face image step by step. In each step, only the
each rectangle feature can be obtained easily, computing the image-block classified as a face may be essential to go to the
complete set of all features is extremely costing. Take the next step. The number of steps must be sufficient to achieve
EURASIP Journal on Advances in Signal Processing 5

(a)

(b)

Figure 6: Database of face detection system. (a) Face images, (b) Non-face images.

an excellent detection rate and the minimized computational method consists of two levels of clustering, local and global
loading. For example, a detection rate of 0.9 can be achieved scale clustering. The local scale clustering is used to cluster
by 10-step classifier for the detection rate of 0.99 in each the blocks in the same scale and design a simple filter to
step (0.9 ≈ 0.9910 ). The procedure of our implemented determine the number of blocks within clusters. While the
Adaboost process can be simply described as the following number of blocks in some cluster is more than one, that
equations. If m and l are the number of nonface and face cluster will be reserved as the possible candidate of faces,
samples, respectively, and j is the sum of nonface and face otherwise it will be discarded. The local scale clustering
samples, the initial weight wi, j for the ith-stage can be defined judges if the blocks meet the decision rule in (8). In (8), the
as wi, j = 1/2m, (1/2l) for y j = 0, 1. The normalized weighted overlap rate (x, y) is the percentage overlapped between two
error with respect to the weak classifier can be expressed in detected regions, x and y, and distance (x, y) is the distance
the following equation: of centers in these two regions. The equality,cluster (x, y) = 1
 2   2 means the block x and y are in the same cluster and the
2 2
εi = min wi, j 2h x j , f , p, θ − y j 2. (5) regions are completely overlapped
f ,p,θ
j ⎧  

⎪ 1, if overlap rate x, y ≥ THoverlap rate


The updating weights for each iteration are defined in (6)   ⎨  
where e j equals to 0 if the object is classified correctly and cluster x, y = ⎪ and distance x, y ≤ THdistance ,


vice versa: ⎪
⎩0, otherwise.
1−e j (8)
wi, j = wi, j βi . (6)
Figure 9 shows several cases of the clustering process. In
Also, the final classifier for the ith-stage is defined in the Figure 9(a), the two blocks are processed as the same cluster,
following equation: and in Figure 9(b) the two blocks are processed as different
⎧   1 clusters because the distance of the centers does not satisfy
  ⎪
⎨1, distance (x, y) ≤ THdistance . For the special case as shown
αi h x j , f , p, θ ≥ αi ,
C xj = ⎪ 2 (7) in Figure 9(c), they are all considered as face candidates but
⎩ 0, otherwise, most of them are false accept blocks. Therefore in this paper
for practical applications, we only choose one block that
where αi = log(1/βi ) and βi = εi /(1 − εi ) satisfies overlap rate(x, y) ≥ THoverlap rate rather than select
multiple blocks. At the end, the global scale clustering will
2.3. Region Based Clustering. The face detector usually finds use the blocks obtained from local scale clustering, and label
more than one face candidate even though only one single the face regions by the average size of all available blocks.
face appears in an image, which is illustrated in Figure 8, Some results in the entire region based clustering process
and a region-based clustering method is used to solve this for both local-scale and global-scale levels will be shown in
kind of problems. The proposed region-based clustering Figure 10. From the right image in Figure 8, in fact, only
6 EURASIP Journal on Advances in Signal Processing

Non-face

Database Lighting Each feature’s


normalization threshold value

Frames
Face

Subtracted value

Figure 7: The process in selecting the threshold value for features in each rectangle.

(a) (b)

Figure 8: Some results in face detector.

(a) (b) (c)

Figure 9: Examples of overlapped regions and the distance of centers of two blocks, (a) the same cluster, (b) two different clusters, (c) a
special case, more than two blocks overlapping.

(a) (b)

Figure 10: The results in the region based clustering, (a) for the local-scale clustering process and (b) for the global-scale clustering
process.
EURASIP Journal on Advances in Signal Processing 7

90 100

80 90

80
70
Sum of eigenvalue

70

Detection rate
60
60
50
50
40
40

30 30

20 20
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200
Number of principle component Number of principle component
(a) (b)

Figure 11: The pattern information for the specific dimensions of image subspace (a) the sum of eigenvalues and (b) the detection rate with
respect to the number of eigenvectors.

Figure 12: Five different head orientations of a client.

one block will be precisely clustered as a face region after searching subspaces. Each original image can be projected
applying our local and global clustering processes even into the subspace as in the following equation:
though more than five face candidates are obtained for an
image with only five faces. ηk = vkT · Φs k = 1, 2, . . . , m, (9)

where m(m < n) is the chosen dimensionality of the


3. Face Identification image subspace, and Φs = Γ s −Ψ represents the relation
of Γs , a set of projected training images, and Ψ,the average
We have two major parts of face identification in this image of the training set. If m is closer to n, the results
work, facial feature extraction and adaptive probabilistic of face identification will be more precise. But the face
model (APM). The facial feature extractor is constructed by identification takes more computational time to project the
principle component analysis (PCA) [17] which effectively original images into the corresponding subspaces. Hence,
reduces the number of dimensions by maximizing the we have to choose the appropriate dimensionality of image
projections of scatters of all samples. To begin with, we have subspaces. Figure 11 shows one instance of how we deter-
a training set of N images and each image consists of n mine the number of image subspaces. Figure 11(a) indicates
elements. In our case, N equals to 4000 which indicates the that the pattern information about representative facial
total number of images in the database. Each image has n features will gain when the number of principle components
elements with the size equaling to 24 × 24 or a 576-element increases. But the detection rate in Figure 11(b) will be
vector. saturated to a limit value or even reduced when the number
The process of obtaining a single space consists of finding of principle components is larger than the specific value
the covariance matrix C of the training set and computing (50 in this case). The reason is simple that the pattern
the eigenvectors vk for k = 1, 2, . . . , n. The eigenvectors vk information may include both the significant information
corresponding to the largest eigenvalues λk span the base of and noise, which makes more eigenvectors extracted more
8 EURASIP Journal on Advances in Signal Processing

noises involved. Therefore, how to determine the optimal 100


number of eigenvectors will be much more significant, and 90
this idea contributes to one of the major points in this
80
paper. We can easily observe that the performance of the
detection rate descends by the effect of noises. From the 70
observations in simulations, we can determine the number of 60

Detection rate
eigenvectors in our face identification system by the pattern
information and detection rate in about 81% and 93%, 50
respectively. 40
30

3.1. Similarity Measure. The APM method is proposed to 20


achieve the faster and functional goals for face identification. 10
It can online register new clients and update the clients’
information. This capability can enhance the practicability 0
5 10 15 20 25 30 35 40 45 50
and heighten the identification rate of the proposed system
Parameter
for more other applications. The primary concept of APM
architecture is based on the view-independent face identifi- Figure 13: The detection rate with respect to different j of the
cation. The view-independent model of face identification covariance matrix.
is more robust than the single-view one because the head
orientations of a person may be changeable in the real
conditions. In our proposed system, the view-independent to be 0.2 for each j-head orientation:
model of face identification system is built with five different
head orientations for each client as shown in Figure 12. |Σ| = σc,2dj,t −→ |Σ|1/2 = σc,d j,t ,
The APM method follows the probabilistic constraint in (13)
the similarity measures to design a model of likelihood [Σ] = σc,2 j,t · I −→ [Σ]−1 = σc,−j,t
2
· I,
functions, since the judgment rules of classifications depend ⎛  T  ⎞
  
on the degree of likelihoods. We denote a testing sample x, 1 1 ⎜ x − μc, j,t x − μc, j,t ⎟
and the similarity between x and each registered client can pc, j (x) = exp⎝− ⎠.
(2π)d/2 σc, j,t d 2
2σc, j,t
be computed by the likelihood function of each client. The
testing sample x will be classified as the client by the largest (14)
similarity. The likelihood function APMc (x) for class c is The other parameters in (10)–(14) such as t, d, μc, j,t , σc, j,t
a mixture of probabilistic functions which is shown in the represent the time for updating each client’s information,
following equation: the dimension of input vectors, the mean vector, and the
covariance matrix, respectively.

5
APMc (x) = wc, j,t pc, j (x), (10) 3.2. Parameter Tuning and Adaptive Updating. The covari-
j =1 ance matrix σc, j,t may affect the performance of APM, so we
are inspired to optimize the covariance matrix σc, j,t . We have
the face database which contains the images of 10 persons in
where pc, j (x) for j = 1, 2, . . . , 5 is defined as the probabilistic our simulations, and initialize the covariance matrix σc, j,0 by
function in (11) and wc, j,t is the weighting value which can the variance of training data. We can then obtain the updated
be expressed in the following equation: covariance matrix σc, j,1 by the following equation:
1
   σc, j,1 = × σc, j,0 . (15)
j
1 1
pc, j (x) =
(2π)d/2 σc, j,t d The detection rate with respect to different j of the covariance
(11) matrix σc, j,1 is shown in Figure 13. The detection rate will
 
1 T
−1
 be obviously improved for 4 < j < 43. We can thus choose
× exp − x − μc, j,t [Σ] x − μc, j,t , the parameter j to be 5 and obtain an optimized covariance
2
matrix σc, j,1 in parameter tuning of APM throughout this

5 paper.
wc, j,1 = 1. (12) The adaptive updating process focuses on the parameter
j =1 updating of APM. The design of adaptive updating for APM
can improve the detection rate of face identification. As the
number of updating iterations increases, APM will become
With the assumptions in (13), the probabilistic function (11) more robust and can be simulated to identify the heads in
can be simplified to (14). And our initial weight wc, j,1 is set different orientations of our testing clients more accurately.
EURASIP Journal on Advances in Signal Processing 9

100 100

90 90

80 80

70

Detection rate
70
Detection rate

60 60

50 50

40 40

30 30

20 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Parameter α Parameter ρ
(a) (b)

Figure 14: Detection rate, for (a) the parameter α and (b) the parameter ρ.

(a) (b)

Figure 15: Face detection results for (a) a single face and (b) multifaces in an image.

While a client is identified correctly, the functional APM will detection rates in the experimental results, we can choose
be updated immediately by using the following equation: 0.05 for the parameterα and 0.2 for the parameter ρ.
 
wc, j,t = (1 − α)wc, j,t−1 + α Mc, j,t ,
4. Experimental Results

⎨1, if mapped, The experimental results could be divided into two sections,
Mc, j,t = ⎩ face detection and face identification. We also compared our
0, otherwise, (16) face detection results with OpenCV (Open Source Com-
 
μc, j,t = 1 − ρ μc, j,t−1 + ρx puter Vision Library Community). For experiment in face
identification, both of face detection and face identification
   T  
σc,2 j,t = 1 − ρ σc,2 j,t−1 + ρ x − μc, j,t x − μc, j,t , had to work together. Our proposed system could work
well for both offline and online testing. Our online testing
mechanism could automatically capture images frame by
where α and ρ are the learning rates for the weights, mean frame from a fixed camera in order to achieve the real-time
and covariance matrix. The parameter μc, j,t and σc, j,t for purposes, and update parameters of APM or add new clients.
unmatched distributions remains the same. The magnitude
of learning rates may have influences on the efficiency of
APM updating, and a large learning rate will make the 4.1. Face Detection. Figures 15(a) and 15(b) presented the
likelihood functions of APM over-fitted while the small results of face detection for an image with a single face or
learning rate results in the worse detection rate. We use multiple faces. In the case of multipersons with different
the ORL database which contains 40 persons to select the sizes of faces, the proposed face detector system could
parameter α and ρ. Also, we have ten images for each person; precisely locate the regions of faces. In order to estimate
five of them are used in the training data, two of them are the performance of the face detector in a more quantized
for the testing data, and the others are for the updating data. manner, we would show the performance of our system by
Figures 14(a) and 14(b) show the detection rates regarding estimating the detection rates, error rates, and the numbers
to the parameter α andρ, respectively. To obtain the best of false accept images. The testing set consisted of 100
10 EURASIP Journal on Advances in Signal Processing

100 100 100


95 90 90
90 80 80

False accept images


85 70 70
Detection rate

80 60 60

Error rate
75 50 50
70 40 40
65 30 30
60 20 20
55 10 10
50 0 0
0.53 0.535 0.54 0.545 0.55 0.555 0.56 0.565 0.53 0.535 0.54 0.545 0.55 0.555 0.56 0.565 0.53 0.535 0.54 0.545 0.55 0.555 0.56 0.565
Threshold Threshold Threshold

With lighting With lighting With lighting


Without lighting Without lighting Without lighting
(a) (b) (c)

Figure 16: The performance of face detectors in different thresholds for (a) the detection rates, (b) The error rates, (c) The numbers of false
accept images.

Figure 17: Face detection results of the proposed system and OpenCV.

pictures with 434 labeled frontal faces, which were collected number of false-accept images by OpenCV’s was smaller than
randomly from BaoDataBase, Carnegie Mellon Test Images, that by our method in some special case, the detection rate by
and the collected pictures of our own. Table 1 displayed our method would be better than that by OpenCV’s for the
the comparison results of face detectors between with and threshold value smaller than 0.56. Figure 17 demonstrated
without lighting normalization. Figures 16(a)–16(c) showed the results of face detection where the images in the top
the detection rates, error rates, and the numbers of false and bottom row images indicated the detected results of our
accept images in different thresholds. We tested 7 different proposed face detector system and OpenCV’s, respectively.
thresholds for our face detector, and calculated the accuracy, It could be observed that our face detector could localize
error, and the numbers of false accept images correspond- most of face regions precisely except the images shown in
ingly. In our results, the performance of face detectors with the second row. Most of the erroneous results occurred
lighting normalization would be better than that without because the face images in the used database were most from
lighting normalization. We also compared our face detection Westerns but not from Asians.
system with OpenCV for the same testing set. The testing
results of OpenCV were the detection rate 81.36%, error rate 4.2. Face Identification. Table 2 illustrated the comparisons
26%, and the number of false-accept images 5. Although the of the proposed face identification method PCA+APM and
EURASIP Journal on Advances in Signal Processing 11

Table 1: The comparisons of face detectors between with and without lighting normalization.

Detected rate (%) Error rate (%) Numbers of false accept images
Threshold
without with without with without with
0.5600 72.02 78.60 40 13 13 20
0.5550 73.12 82.80 30 7 21 31
0.5525 74.53 85.11 26 6 23 34
0.5500 74.55 85.57 26 5 29 39
0.5450 76.92 86.70 20 4 38 43
0.5400 80.24 87.29 13 4 51 58
0.5350 82.86 88.29 8 2 54 66

Table 2: The comparison of characteristics for the proposed method and others.

Comparison parameters Eigenfaces [17] PCA+CN [19] SOM+CN [20] PCA+APM


Feature Eigenfaces Eigenfaces Self-organizing map Eigenfaces
Method
Classification K-nearest neighbor Neural Network Neural Network Adaptive Probabilistic Model
Training time Quick Slow Slow Quick
Update ability No No No Yes
Characteristic
Add client Yes No No Yes
Practicability Better Worse Worse Better

other methods. The compared methods included Eigenfaces, 100


PCA+CN (Principal Component Analysis and Convolu- 90
86%
tional Neural Network), and SOM+CN (Self Organizing 80
Map and Convolutional Neural Network). Our arguments 70
in comparisons were based on the training computational
Detection rate

time, the updating ability, and the ability to add the client’s 60
database in practicability. Because our system was developed 50
under the considerations of efficiency and performances in 40
real-time environments, the practicability would be the most 30
important one of all the factors. Our system could register
20
new clients and update the client’s information online in the
real-time cases to achieve practicability. 10
0
0 10 20 30 40 50 60
4.2.1. Offline Testing . Table 3 presented the comparisons of Client number
detection rates, in which the testing set used ORL database
including 40 persons with ten images of each. We selected Figure 18: The performance of face identifier with respect to the
3-images, 4-images, and 5-images of each person to be number of clients.
the training data, and the remained for testing. Table 3
showed that the more images of each person used in the
training process led to the higher detection rates. The lighting could be acceptable if the detection rate of face identifier was
normalization also had a great impact on the detection rate. still higher than 80%. In such cases, our system would still be
The detection rate without using lighting normalization was in a tolerable range, and our proposed approach could accept
slightly below SOM+CN and the methods with lighting more than 63 clients. For measuring the performances of
normalization, and the detection rate would be higher by the adaptive updating process, we used in the ORL database
about 0.3%–3%. In order to test the tolerable degrees of our five images of each person for the training data, three for
system, we designed an experiment to measure the accuracy testing, and two for updating. We used the cross validation to
for different numbers of clients in particular. Figure 18 estimate the performances of adaptive updating. The upper-
indicated the performance of face identifier with different and lower- lines in Figure 19 illustrated the detection rates
numbers of clients. The detection rate would achieve 100% after and before the adaptive updating process, respectively.
for the number of clients lower than 10, and decrease when From those experimental data, the performance of face
the number of clients exceeded 10. The detection rate would identifier with the adaptive updating process has been
be reduced to 86% until the number of clients achieved 63. It obviously improved.
12 EURASIP Journal on Advances in Signal Processing

Table 3: The comparison of detection rates of the proposed method and others.

PCA+APM
Eigenfaces PCA+CN SOM+CN
without With
3 81.8 86.8 88.2 86.8 89.3
Training images for each person 4 84.6 87.9 92.9 92.5 92.9
5 89.5 92.5 96.2 95.0 98.0

120 100

110 90
80
100
70
90
Detection rate

60

Percentage
80
50
70
40
60 30
50 20

40 10

30 0
1 2 3 4 5 6 7 8 9 10 0.000125 0 1 2 3 4 5 6 7 8 9 10
Threshold ×10−3
Cross-valiation
FRR
Before updating
FAR
After updating
Figure 20: The threshold distinguishing the clients and impostors.
Figure 19: The performance of adaptive updating.

Table 4: The summarized threshold for distinguishing the clients


and impostors.
threshold for telling the clients from impostors could be set
to be 0.000125.
Threshold FRR FAR
0 4.5 89.9 4.2.2. Online Testing. The experimental results in the online
0.000125 18.5 18.5 testing were obtained by using the system with the Intel
0.001 32.8 5.1 P4 2.60 GHz CPU and 1 GB RAM. The development tool
0.002 40.5 1.3 was Borland C++ Builder 6.0 on Window XP OS. The
0.003 48.9 0 input images were captured from cameras in the 320 × 240
0.004 50.7 0 resolution. Figure 21 showed some examples of clients in our
database. The process to identify a new coming face as a
client or an impostor started with capturing an image and
extracting the face detector to localize the face regions. The
We had the total of 26 persons to find the threshold for face identifier module then would recognize the faces as the
telling clients from imposters in the proposed system. For registered clients or impostors. Some results were shown in
13 out of all clients, five images for each would be used in Figure 22. The registering process for new clients required
the training data and the remained were taken for the testing the images with five different head orientations including the
data. We additionally selected five images from the remained upward, downward, leftward, rightward, and frontal faces.
13 impostors to be the testing images. Figure 20 presented Figure 23 showed the processes in the database when the
the results of selected thresholds for telling the clients from client was updated. The total run time during the whole
impostors and the tested results would be summarized in process which begins with capturing an image, continues
Table 4 for some threshold values. The values of thresholds with extraction of face detectors and localizing face regions,
indicated the similarity measures of APM. In Figure 20, FRR/ and terminates at face recognition and client identification
FAR represented the false reject/accept rate. We made use of is estimated to be 0.4–0.5 miniseconds. Moreover, each sub-
the intersection of FAR and FRR to determine the similarity process such like image acquirement, extraction of face
threshold. Based on the experimental result, the value of detectors, and client identification accounts for between a
EURASIP Journal on Advances in Signal Processing 13

Figure 21: Examples of clients in the database.

Impostor 2. Whei
Impostor
Impostor Impostor
1. Whei
1. Linda

Figure 22: The testing results generated from the proposed face identifier.

(a)

(b)

(c)

Figure 23: (a) A new client with different head orientations, (b) An overview of the updating database, (c) Face identifier before and after
registering a new client.
14 EURASIP Journal on Advances in Signal Processing

(a) (b)

Figure 24: Examples of detection failures in (a) false detection, and (b) missed face regions.

(a) (b)

Figure 25: The example of false acceptance for (a) an impostor and (b) a client.

fourth to a half of the total runtime in a complete process information of clients online. By the process of adaptive
of multiclient detection and identification. updating, the weights for different poses and the matched
probabilistic functions are adjusted to satisfy the latest
information of registered clients. The experimental finally
5. Discussions and Conclusions results show that the proposed APM-based technique indeed
has a good performance for both the face detection and
The integration of face detection and face identification for identification in most cases. And we will consider more
real-time face recognition application has been proposed in exceptional cases which may not be processed by our
this paper. The design of this system focuses on robustness proposed system in the near future.
and practicability. We demonstrate our proposed approach
to accurately detect the face regions in an image. Besides,
the system provides an identification mechanism to identify Acknowledgments
whom the extracted face of clients belongs to in the database,
and a judgment way to regard the detected client as an This work was supported in part by the Aiming for the
impostor or a new client. In the face detection, the lighting Top University Plan of National Chiao Tung University, the
normalization can actually improve the detection rate and Ministry of Education, Taiwan, under Contract 99W962, and
a region-based clustering method is able to deal with the supported in part by the National Science Council, Taiwan,
problems of multiple candidates around our target face. under Contracts NSC 99-3114-E-009-167 and NSC 98-2221-
However, some nonface images with face-like shapes as E-009-167.
shown in Figure 24(a), or partially occluded faces in an image
as shown in Figure 24(b) may result in the detection errors.
For some special case as shown in Figure 25 where the two References
clients are too similar to be distinguished from, the false [1] A. K. Jain, R. Bolle, and S. Pankanti, Biometrics: Personal Iden-
acceptance may occur inevitably. For face identification, an tification in Networked Society, Kluwer Academic Publishers,
adaptive probabilistic model (APM) is introduced to model Dordrecht, The Netherlands, 1999.
the characteristics of clients. According to the design of [2] D. Zhang, Automated Biometrics: Technologies Systems, Kluwer
APM, the system can register a new client and update the Academic Publishers, Dordrecht, The Netherlands, 2000.
EURASIP Journal on Advances in Signal Processing 15

[3] C. P. Papageorgiou, M. Oren, and T. Poggio, “A general


framework for object detection,” in Proceedings of the 6th IEEE
International Conference on Computer Vision, pp. 555–562,
January 1998.
[4] S. Z. Li, Q. Zhang, H. Shum, and H. J. Zhang, “FloatBoost
learning for classification,” in Proceedings of the Neural Infor-
mation Processing Systems, December 2002.
[5] R. Lienhart and J. Maydt, “An extended set of Haar-like
features for rapid object detection,” in Proceedings of the
International Conference on Image Processing (ICIP ’02), vol.
1, pp. 900–903, September 2002.
[6] P. Viola and M. J. Jones, “Robust real-time face detection,”
International Journal of Computer Vision, vol. 57, no. 2, pp.
137–154, 2004.
[7] Y. Freund and R. E. Schapire, “A decision-theoretic general-
ization of on-line learning and an application to boosting,” in
Proceedings of the 2nd European Conference on Computational
Learning Theory (EuroCOLT ’95), pp. 23–57, Springer, 1995.
[8] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman,
“Eigenfaces vs. fisherfaces: recognition using class specific
linear projection,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[9] T. F. Cooles, G. J. Edwards, and C. J. Taylor, “Active appearance
models,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 23, no. 6, pp. 681–685, 2001.
[10] L. Wiskott, J.-M. Fellous, N. Krüger, and C. D. Von Malsburg,
“Face recognition by elastic bunch graph matching,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
19, no. 7, pp. 775–779, 1997.
[11] K. W. Bowyer, K. Chang, and P. Flynn, “A survey of approaches
and challenges in 3D and multi-modal 3D + 2D face
recognition,” Computer Vision and Image Understanding, vol.
101, no. 1, pp. 1–15, 2006.
[12] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face
recognition: a literature survey,” ACM Computing Surveys, vol.
35, no. 4, pp. 399–458, 2003.
[13] S. Du and R. Ward, “Face recognition under pose variations,”
Journal of the Franklin Institute, vol. 343, no. 6, pp. 596–613,
2006.
[14] D. J. Beymer, “Face recognition under varying pose,” Tech.
Rep. 1461, MIT AI Lab, Massachusetts Institute of Technology,
Cambridge, Mass, USA, 1993.
[15] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve
procedure for the characterization of human faces,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
12, no. 1, pp. 103–108, 1990.
[16] L. Sirovich and M. Kirby, “Low-dimensional procedure for the
characterization of human faces,” Journal of the Optical Society
of America A, vol. 4, no. 3, pp. 519–524, 1987.
[17] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal
of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[18] A. Sehad, H. Hocini, A. Hadid, M. Djeddi, and S. Ameur, “Face
recognition under varying views,” in Proceedings of the 1st IEEE
International Workshop on Biologically Motivated Computer
Vision, pp. 258–267, May 2000.
[19] R. Ebrahimpour, E. Kabir, and M. R. Yousefi, “Teacher-
directed learning in view-independent face recognition with
mixture of experts using single-view eigenspaces,” Journal of
the Franklin Institute, vol. 345, no. 2, pp. 87–101, 2008.
[20] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face
recognition: a convolutional neural-network approach,” IEEE
Transactions on Neural Networks, vol. 8, no. 1, pp. 98–113,
1997.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 764838, 7 pages
doi:10.1155/2010/764838

Research Article
Comparing an FPGA to a Cell for an Image Processing Application

Ryan N. Rakvic,1 Hau Ngo,1 Randy P. Broussard,2 and Robert W. Ives1


1 Department of Electrical and Computer Engineering, U.S. Naval Academy, Annapolis, MD 21402-5000, USA
2 Department of Systems Engineering, U.S. Naval Academy, Annapolis, MD 21402-5000, USA

Correspondence should be addressed to Ryan N. Rakvic, [email protected]

Received 2 December 2009; Accepted 8 March 2010

Academic Editor: Yingzi Du

Copyright © 2010 Ryan N. Rakvic et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an
exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3
(PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex
image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance
of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently
become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion
of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times
speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.

1. Introduction The Cell processor is a joint effort by Sony Computer


Entertainment, Toshiba Corporation, and IBM that began
For most of the history of computing, the amazing gains in in 2000, with the goal of designing a processor with
performance we have experienced were due to two factors: performance of an order of magnitude over that of desktop
decreasing feature size and increasing clock speed. However, systems shipping in 2005. The result was the first-generation
there are fundamental physical limits to this approach— Cell Broadband Engine (BE) processor, which is a multicore
decreasing feature size gets more and more expensive and chip comprised of a 64-bit Power Architecture processor
difficult due to the physics of the photolithographic process core and eight synergistic processor cores. A high-speed
used to make CPUs, and increasing clock speed results memory controller and high-bandwidth bus interface are
in a subsequent increase in power consumption and heat also integrated on-chip [1].
dissipation requirements. Parallel computation has been in The Cell processor, shown in Figure 1, has a unique
use for many years in high performance computing; however, heterogeneous architecture compared to the homogeneous
in recent years, multicore architectures have become the Intel Core architecture. It has a main processor called the
dominate computer architecture for achieving performance Power Processing Element (PPE) (a two-way SMT PowerPC
gains. The signal of this shift away from ever increasing clock based processor), and eight fully functional coprocessors
speeds occurred when Intel Corporation cancelled develop- called the Synergistic Processing Elements, or SPEs. The
ment of its new single core processors to focus development PPE directs the SPEs where the bulk of the computation
on dual core technology. Executing programs in parallel occurs. The PPE is intended primarily for control processing,
on hardware specifically designed with parallel capabilities running operating systems, managing system resources, and
is the new model to increase processor capabilities while managing SPE threads. The SPEs are single-instruction,
not entering into the realm of extensive cooling and power multiple-data (SIMD), shown in Figure 1, processors with a
requirements. RISC core [2].
2 EURASIP Journal on Advances in Signal Processing

Synergistic processor elements


SPU SPU SPU SPU SPU SPU SPU SPU
SXU SXU SXU SXU SXU SXU SXU SXU

Local Local Local Local Local Local Local Local


store store store store store store store store

SMF SMF SMF SMF SMF SMF SMF SMF

16 bytes/cycle

Element interconnect bus (up to 96 bytes/cycle)

Power
processor
16 bytes/cycle 16 bytes/cycle (2x)
element

Power Memory Bus


processor unit interface interface
controller controller
L1 Power
L2 cache cache execution
unit
32 bytes/cycle 16 bytes/cycle
Dual XDR Flex I/O

64 bit power architecture with vector media extensions

Figure 1: Cell BE high-level architecture diagram.

According to IBM, the Cell BE is capable of achieving statements are inherently parallel, not sequential. VHDL
in many cases 10 times the performance of the latest PC allows the programmer to dictate the type of hardware that
processors [3]. The first major commercial application of is synthesized on an FPGA. For example, if you would like to
the Cell processor was in Sony’s PlayStation3 game system. have many ALUs that execute in parallel, then you program
The PlayStation3 has only 6 SPU cores available due to one this in the VHDL code.
core being reserved by the OS and 1 core being disabled in In this work, we have parallelized a repeatedly executed
order to increase production yields. Sony has made it very portion of an image processing algorithm with both an
easy to install a new Linux-based operating system onto the FPGA and a Cell processor. In Section 2 we present the
PlayStation3, thereby making the game system a popular Iris Recognition Algorithm and iris template matching. In
choice for experimenting with the Cell BE. Section 3, we present an approach to iris matching utilizing
Historically programmers have thought in sequential parallel logic with field-programmable gate arrays and cell
terms, and programming these multicore processors can be processors. In Section 4 we demonstrate this efficiency with
difficult. Often times, this involves completely redesigning a comparison between the FPGA, the Cell processor, and a
an existing program from the ground up and implementing sequential processor. We provide concluding statements in
complex synchronization protocols. Parallel programming is Section 5.
based on the simple idea of division of labor—that large
problems can be broken up into smaller ones that can be
worked on simultaneously. Making it more challenging is 2. Iris Recognition Algorithm
the fact that the SPEs in the Cell do not share memory with
the PPE. Additionally, they are not visible to the operating Iris recognition stands out as one of the most accurate
system, thereby leaving all management of SPE code and data biometric methods in use today. One of the first iris
to the programmer. recognition algorithms was introduced by pioneer Dr. John
Another popular approach to parallelization is to use Daugmann [4]. An alternate iris recognition algorithm,
Field Programmable Gate Arrays (FPGAs). FPGAs are referred to as the Ridge Energy Direction (RED) algorithm
complex programmable logic devices that are essentially a [5], will be the basis for this work. There are many iris
“blank slate” integrated circuit from the manufacturer and detection algorithms. What follows is a brief description
can be programmed with nearly any parallel logic function. of the RED algorithm. Since this research is focused on
They are fully customizable and the designer can prototype, computational acceleration, we refer the reader to [6–12].
simulate, and implement a parallel logic function without The iris is the colored part of the eye, protected by the
the costly process of having a new integrated circuit man- cornea that extends from the pupil to the white of the eye. Its
ufactured from scratch. FPGAs are commonly programmed patterns remain stable over a lifetime. An example iris image
via VHDL (VHSIC Hardware Description Language). VHDL is depicted in Figure 2. Typically an iris image is captured in
EURASIP Journal on Advances in Signal Processing 3

9 × 9 filter
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1 Input image data
1 1 1 2 2 2 1 1 1 ···
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1
1 1 1 2 2 2 1 1 1

1) Segment iris into polar coordinates

Vertical filtering

Compare Template Filter centroid


.
..
Horizontal filtering
Figure 3: 9 × 9 filter computing the filtering of the top left portion
2) Template generation of hypothetical input energy data. In this instance, each coefficient
(directional filtering) of the filter is multiplied by the corresponding image data within
the scope of the filter (filter kernel) where some of the data is
Figure 2: Red iris recognition algorithm. Visible is the associated repeated from the opposite side. These filter coefficient and input
two dimensional encoding of the iris image into energy data [14]. data products make up a partial result, the sums of which generate
a local result corresponding to the centroid of the filter.

the Near Infrared light spectrum. Most iris capture systems passes over this periodic array taking in 81 (9 × 9) values at
have dedicated illumination and capture a 640 by 480 pixels a time (note, in [5], 11 × 11 is used). More specifically, the
image containing eight bits per pixel. Once a digital image of result is computed by first multiplying each filter value by
the iris is captured, the system begins processing the image the corresponding energy data value. Then a summation is
to transform it from a two-dimensional array of pixels to a performed, and the result is stored in a memory location that
two dimensional encoded string of bits for comparison (see corresponds to the centroid of the filter. This process repeats
“Segment Iris into Polar Coordinates” in Figure 2). In this, for each pixel in the energy data, stepping right, column-by-
the first step is to identify the iris among other facial elements column, and down, row-by-row, until the filtering is com-
such as the eyelids, sclera (white part of the eye), pupil (dark plete as shown in Figure 3. Finally, the template is generated
circle in the center of the eye), and eyelashes. The algorithm by comparing the results of two different directional filters
finds the pupil by thresholding the image and using basic (horizontal and vertical, see Figure 3) and writing a single
features such as circularity to find the most circle-like object bit that represents the filter with the highest output at the
in the thresholded image. The outer boundary is found using equivalent location. The output of each filter is compared
local kurtosis which has near-zero values at the boundary. and for each pixel, a “1” is assigned for strong vertical
Details of this segmentation method are described in prior content or a “0” for strong horizontal content. These bits are
art [13]. Once these boundaries are located, the computer concatenated to form a bit vector unique to the “iris signal”
can now extract only the meaningful portions of the iris. that conveys the identifiable information. In this study, we
Once the iris is segmented, the algorithm takes the iris assume that a template consists of 2048 bits, representing the
and divides it into m concentric annuli and n radial lines, uniqueness of the iris.
which results in an m×n representation of the iris. This step is A template mask is also created during this filtering
effectively a rectangular to polar coordinate conversion. The process. If both filter output values are not above a certain
energy of each pixel is merely the square of the value of the threshold, then a mask bit is cleared for that particular
infrared intensity within the pixel and is used to distinguish pixel location. The template mask is used to identify pixel
features within the iris. The next step is to encode the iris locations where neither vertical nor horizontal directions are
image from two dimensional brightness data down to a two identified.
dimensional binary signature, referred to as the template Once encoded, the iris recognition system must be able
(“Template Generation” in Figure 2), to accomplish this, the to reliably match the newly created template with a database
energy data are passed into two directional filters to deter- of previously enrolled templates. The newly encoded iris is
mine the existence of ridges and their orientation. The RED compared to a database of previously created templates using
algorithm uses directional filtering to generate the iris tem- a fractional Hamming Distance (HD) calculation, which is
plate, a set of bits that meaningfully represents a person’s iris. defined in (1). This is illustrated in Figure 4
To help perform this filtering, the energy data passed ?  ?
from the iris segmentation process is made periodic in the ? template A ⊗ template B O mask A O mask B?
HD = O .
horizontal dimensions to account for edge effects when mask A mask B
performing the rectangular to polar conversion. The filter (1)
4 EURASIP Journal on Advances in Signal Processing

times. Also, note that the XOR and AND computations


HD
are also performed sequentially. These instructions could
2048 bit . be scheduled to execute in parallel, but a modern CPU
.
template . has a limited number of functional units, therefore limiting
2048 bit the amount of parallel execution. Summation of the bits
HD template is performed using lookup tables. Finally, the HD score is
computed as a ratio of the number of differences between
Match ?
the templates to the total number of bits that are not
masked.
Figure 4: New template is compared with each template stored in a Illustrated in Figure 6 is the associated assembly code
database. created for the hamming distance calculation. The code is
compiled for an Xeon Processor, and hence IA-32 assembly
code is produced [16]. For each C++ computation, there
The ⊗ operator is the exclusive-or operation used to are at least 5 assembly language instructions required.
detect disagreement
O between corresponding bit pairs in the For example, the AND computation that is in C++ code
two templates, represents the binary AND function, and generates 4 MOV instructions and one AND instruction. The
masks A and B identify the values in each template that MOV instructions are required to move data to and from
are not corrupted by artifacts such as eyelids/eyelashes and memory. The AND instruction is a 32-bitwise computation
specularities. The denominator of (1) ensures that only performed by an ALU functional unit in the processor.
valid bits are included in the calculation, after artifacts As stated before, instruction execution bandwidth for a
are discounted. The lower the HD result, the greater the processor is limited by the number of functional units
match between the two irises being compared. The fractional that it has. Loop instructions require overhead assembly
Hamming distance between two templates is compared to instructions to again move the proper data to and from
a predetermined threshold value and a match or nonmatch memory. For each iteration of the loop, there is required
declaration is made. a total of 38 assembly instructions. Therefore, this code
The HD calculation, or iris matching, is critical to requires 64 loops × 38 assembly instructions to perform one
the throughput performance of iris recognition since this template match.
task is repeated many times, seen in Figure 4. Traditional
systems for HD calculation have been coded in sequential
logic (software); databases have been spread across multiple 3.2. Parallel on an FPGA. Field Programmable Gate Arrays
processors to take advantage of the parallelism of the (FPGAs) are complex programmable logic devices that
database search, but the inherent parallelism of the HD are essentially a “blank slate” integrated circuit from the
calculation has not been fully exploited. manufacturer and can be programmed with nearly any
parallel logic function. They are fully customizable and
3. Implementations the designer can prototype, simulate, and implement a
parallel logic function without the costly process of having
3.1. Sequential on a CPU. Currently, iris recognition algo- a new integrated circuit manufactured from scratch. FPGAs
rithms are deployed globally in a variety of systems ranging are commonly programmed via VHDL (VHSIC Hardware
from computer access to building security to national size Description Language). VHDL statements are inherently
databases. These systems typically use central processing parallel, not sequential. VHDL allows the programmer to
unit- (CPU-) based computers. CPU based computers dictate the type of hardware that is synthesized on an
are general purpose machines, designed for all types of FPGA. Ideally, if 2,048 matching elements could fit onto the
applications and are to first order programmed as sequential FPGA, all 2048 bits of the template could be compared at
machines, though there are provisions for multiprocessing once, with a corresponding increase in throughput. Here
and multithreading. Recently, there has been an interest we perform the same function as the aforementioned C++
in exploring the parallel nature of this application [15]. It code. However, we are doing this computation completely
is challenging to exploit the inherent parallelism of many in parallel. There are 2,048 XOR gates and 4,096 AND
algorithms in such architectures. gates required for this computation. In addition, adders are
In particular, the matching portion of the algorithm required for summing and calculating the score.
is important since it needs to be repeated many times This code is contained within a “process” statement.
(depending on the number of iris comparisons necessary). The process statement is only initiated when a signal in
Illustrated in Figure 5 is optimized C++ code for computing the sensitivity list changes values. The sensitivity list of the
the fractional HD between two templates. The optimizations process contains the clock signal and therefore the code
in this code include the use of 32-bit logical operations and is executed once per clock cycle. In this code, the clock
the use of a lookup table for bit counting. signal is drawn from our FPGA board which contains a
We would like to highlight the sequential nature of this 50 Mhz clock. Therefore, every 20 ns, this hamming distance
code. For example, since the XOR function is performed calculation is computed. This code is fully synthesizable
32 bits at a time, a loop (for loop denoted) is necessary. and can be downloaded onto an FPGA for direct hardware
Since it is computing 2048 bits, this loop is executed 64 execution.
EURASIP Journal on Advances in Signal Processing 5

for(IntPtr1=(unsigned int *)&matrix[row][0],


IntPtr2=(unsigned int *)&InMatrix->matrix[0][0],
MaskPtr1=(unsigned int *)&Mask1->matrix[row][0],
MaskPtr2=(unsigned int *)&Mask2->matrix[0][0];
IntPtr1 <(unsigned int *)&matrix[row][ActualCols - 4];
IntPtr1++,IntPtr2++,MaskPtr1++,MaskPtr2++)
{
// AND two Masks using 32 bit pointers
Mask = *MaskPtr1 & *MaskPtr2;
// XOR templates, AND with Masks using 32 bit
pointers
XOR = (*IntPtr1 ^ *IntPtr2) & Mask;
// Sum lower 16 bits of XOR using lookup table
Sum += Value[XOR & 0x0000ffff];
// Sum upper 16 bits of XOR
Sum += Value[(XOR>>16) & 0x0000ffff];
// Sum lower 16 of Mask
MaskSum += Value[Mask & 0x0000ffff];
// Sum upper 16 of Mask
MaskSum += Value[(Mask>>16) & 0x0000ffff];
};
Score->matrix[row][0] = (float)Sum/(float)MaskSum;

Figure 5: C++ code for fractional Hamming Distance Computation.

3.3. Parallel on a CELL. We have also parallelized the The FPGA experiment is executed on a DE2 [18] board
HD calculation on the Cell processor on the PlayStation3. provided by Altera Corporation. The DE2 board includes
As stated before, SPE management is left entirely to the a Cyclone-II EP2C35 FPGA chip, as well as the required
programmer. We therefore have completely separate code programming interface. Although the DE2 board is utilized
and compilations for the PPE and the SPEs. The code on for this research, only the Cyclone-II chip is necessary
the PPE works as a slave master, spawning off threads of to execute our algorithm. The Cyclone-II [19] family is
work to the 6 individual SPEs. The work is divided up on iris designed for high-performance, low-power applications. It
template matching boundaries, not within a template match. contains over 30,000 logic elements (LE) and over 480,000
Therefore, each SPE is individually responsible for 1/6th of embedded memory bits. In order to program our VHDL
the HD comparisons. To maximize performance, the HD onto the Cyclone-II, we utilize the Altera Quartus software
calculation is vectorized on the SPEs, taking advantage of the for implementation of our VHDL program. The Quartus
SIMD capabilities of the SPU’s. suite includes compilation, synthesis, simulation, and pro-
gramming environments. We are able to determine the size
required of our program on the FPGA, and the resulting
4. Results execution time. The optimized C++ code time is actually
faster than some of the times reported in the literature
The CPU experiment is executed on an Intel Xeon X5355 for commercial implementations [20]. We attribute this
[17] workstation class machine. The processor is equipped difference to improvements in CPU speed and efficiency
with 8 cores, 2.66 GHz clock, and an 8 MB L2 cache. between the time of our experiments and the previous
While there are eight cores available, only one core is reports. However, this indicates that our C++ code is a
used to perform this test, therefore allowing all cache reasonable target for comparison and that we may reasonably
and memory resources for the code under test. The HD expect similar improvements from application of FPGA
code was compiled under Windows XP using the Visual technology to other HD-based algorithms.
Studio software suite. The code has been fully optimized All VHDL code is fully synthesizable and is downloaded
to enhance performance. Additionally, millions of matches onto our DE2 for direct hardware execution. As discussed
were executed to ensure that the templates are fully cached above, our code is fully contained within a “process”
in the on-chip L2 cache. We report the best-case per match statement. The process statement is only initiated when a
execution time. signal in its sensitivity list changes values. The sensitivity
The PlayStation3 is used for our Cell experiments. Fedora list of our process contains the clock signal and therefore
Core 8 was chosen for installation onto the PlayStation3. the code is executed once per clock cycle. In this code, the
Fedora Core 8 is not the most recent release of Fedora but was clock signal is drawn from our DE2 board which contains
chosen because it is the most recent release that has been fully a 50 MHz clock. Therefore, every 20 ns, our calculation is
adapted to the PlayStation3. Additionally, the installation computed.
procedures available online for FC8 are the most detailed and Table 1 illustrates the execution times and acceleration
complete of any Linux distribution. Furthermore, the IBM achieved for our implemented FPGA version on the Cyclone-
SDK, which is required for writing code that runs on the II EP2C35, a CELL-based version and an Xeon-based C++
Cell’s SPUs, is specifically only released for the commercial version. The optimized C++ version takes 383 ns per match,
Red Hat Enterprise Edition Linux or the freely available the CELL version with 6 SPEs takes 50 ns, and the FPGA takes
Fedora Core. 20 ns per match. The main result in this research is that the
6 EURASIP Journal on Advances in Signal Processing

Mask = *MaskPtr1 & *MaskPtr2;// AND Masks with 32 bit pointers


00401D63 mov ecx,dword ptr [ebp-24h]
00401D66 mov edx,dword ptr [ebp-28h]
00401D69 mov eax,dword ptr [ecx]
00401D6B and eax,dword ptr [edx]
00401D6D mov dword ptr [ebp-30h],eax
XOR = (*IntPtr1 ^ *IntPtr2) & Mask;
00401D70 mov ecx,dword ptr [ebp-1Ch]
00401D73 mov edx,dword ptr [ebp-20h]
00401D76 mov eax,dword ptr [ecx]
00401D78 xor eax,dword ptr [edx]
00401D7A and eax,dword ptr [ebp-30h]
00401D7D mov dword ptr [ebp-2Ch],eax
Sum += Value[XOR & 0x0000ffff]; // Sum lower 16 bits of XOR
using lookup table

00401D80 mov ecx,dword ptr [ebp-2Ch]


00401D83 and ecx,0FFFFh
00401D89 mov edx,dword ptr [ebp-34h]
00401D8C add edx,dword ptr [ecx*4+4519E0h]
00401D93 mov dword ptr [ebp-34h],edx
Sum += Value[(XOR>>16) & 0x0000ffff]; // Sum upper 16 bits XOR

00401D96 mov eax,dword ptr [ebp-2Ch]


00401D99 shr eax,10h
00401D9C and eax,0FFFFh
00401DA1 mov ecx,dword ptr [ebp-34h]
00401DA4 add ecx,dword ptr [eax*4+4519E0h]
00401DAB mov dword ptr [ebp-34h],ecx
MaskSum += Value[Mask & 0x0000ffff]; // Sum lower 16 bits of
Mask
00401DAE mov edx,dword ptr [ebp-30h]
00401DB1 and edx,0FFFFh
00401DB7 mov eax,dword ptr [ebp-38h]
00401DBA add eax,dword ptr [edx*4+4519E0h]
00401DC1 mov dword ptr [ebp-38h],eax
MaskSum += Value[(Mask>>16) & 0x0000ffff]; // Sum upper 16
bits of Mask
00401DC4 mov ecx,dword ptr [ebp-30h]
00401DC7 shr ecx,10h
00401DCA and ecx,0FFFFh
00401DD0 mov edx,dword ptr [ebp-38h]
00401DD3 add edx,dword ptr [ecx*4+4519E0h]

Figure 6: C++ code (highlighted) and IA-32 Assembly Code for Hamming Distance Calculations.

Table 1: FPGA versus CPU comparison for iris match execution.

Optimized Xeon CELL (with 6 Cyclone-II EP2C35 Cyclone-II estimated Stratix IV estimated
Code on PS3 SPEs) (50 MHz) @ 100 MHz @ 500 MHz
Time per match (ns) 383 ns 50 ns 20 ns 10 ns (est) 2 ns (est)
Speedup over Xeon n/a 7.66 19.15 38.3 191.5
% usage of chip n/a n/a 73% n/a 7.3% (est)

HD calculation on a modest sized FPGA is approximately 19 2048-bit wide memory in VHDL. We have added this to our
times faster than a state-of-the-art CPU design and 2.5 times code to verify that a small database can be stored on chip.
faster than the image processing Cell processor. The Cell One of the two templates compared is received from this
processor greatly outperforms the Xeon machine and scales dual-ported, 2048-bit wide, single-cycle cache implemented
really well across the cores, but still does not outperform a on our Cyclone-II FPGA. Therefore, once per clock cycle, a
modestly sized FPGA. 2048-bit vector is fetched from on-chip memory, and the HD
In the Cyclone-II FPGA, there are over 400,000 memory calculation is performed. Again, therefore, the entire process
bits available for on-chip storage. The iris templates must can be executed in 20 ns. We have successfully implemented
be stored either in memory on the FPGA or off-chip. In and tested the HD calculation with and without a memory
one instance of our implementation, we have implemented a device.
EURASIP Journal on Advances in Signal Processing 7

Also reported in Table 1 is the utilization of the FPGA [5] R. W. Ives, R. P. Broussard, L. R. Kennell, R. N. Rakvic, and
resources. Our implementation of the Hamming Distance D. M. Etter, “Iris recognition using the ridge energy direction
algorithm utilizes 73% of our Cyclone-II FPGA. In terms of (RED) algorithm,” in Proceedings of the 42nd Annual Asilomar
on-chip memory usage, one of the two templates compared Conference on Signals, Systems and Computers, pp. 1219–1223,
is stored in the dual-ported, 2048-bit wide, single-cycle cache Pacific Grove, Calif, USA, November 2008.
implemented on our Cyclone-II FPGA. Each stored template [6] C.-H. Park, J.-J. Lee, M. J. T. Smith, and K.-H. Park, “Iris-based
consumes 0.7% of on-chip memory. We have added this to personal authentication using a normalized directional energy
our code to verify that a small database of approximately 230 feature,” in Proceedings of Audio and Video Based Biometric
Person Authentication Conference, vol. 2688, pp. 224–232,
can be stored on chip.
2003.
The Cyclone-II is not built for performance and is
[7] Y. Chen, S. C. Dass, and A. K. Jain, “Localized iris image
also not a state-of-the-art design. A projection of the
quality using 2-D wavelets,” in Proceedings of the International
performance of a faster Cyclone-II (100 MHz) and a state- Conference on Biometrics (ICB ’06), pp. 373–381, Hong Kong,
of-the-art Stratix IV (500 MHz) FPGA is given in Table 1. January 2006.
A still modest Cyclone version clocked at 100 MHz is able [8] S. Shao and M. Xie, “Iris recognition based on feature extrac-
to outperform the sequential version by a factor of 38. tion in kernel space,” in Proceedings of the IEEE Biometrics
The faster Stratix IV is projected to perform approximately Symposium, Baltimore, Md, USA, September 2006.
190 times faster than the sequential version. Additionally, [9] R. P. Broussard, L. R. Kennell, and R. W. Ives, “Identifying dis-
our implementation on the Stratix IV would only consume criminatory information content within the iris,” in Biometric
approximately 7.3% of the chip. On-chip memory for Technology for Human Identification V, Proceedings of SPIE,
the Stratix-IV is also much larger with 22.4 Mbits of on- Orlando, Fla, USA, March 2008.
chip storage. For example, a database consisting of 10000 [10] G. Gupta and M. Agarwal, “Iris recognition using non filter-
irises can be stored on the Stratix-IV. We anticipate this based technique,” in Proceedings of the Biometrics Symposium,
storage scaling trend to continue into the future, with larger pp. 45–47, Arlington, Va, USA, September 2005.
and larger database storage becoming available. If a larger [11] R. W. Ives, L. Kennell, R. Broussard, and D. Soldan, “Iris
database is necessary, we propose an implementation where recognition using directional energy,” in Proceedings of the
a DRAM chip is provided as part of the package, and the IEEE International Conference on Image Processing (ICIP ’08),
on-chip database is concurrently loaded while hamming San Diego, Calif, USA, October 2008.
distances are being computed. In addition, with a larger [12] L. Masek, Recognition of human iris patterns for biometric
FPGA, it is possible to compute multiple matches in parallel. identification, M.S. thesis, The University of Western Australia,
This available parallelism is also demonstrated in Table 1. Perth Crawley, Australia, 2003, http://www.csse.uwa.edu.au/∼
pk/studentprojects/libor/LiborMasekThesis.pdf.
[13] L. Kennell, R. W. Ives, and R. M. Gaunt, “Binary morphology
5. Conclusion and local statistics applied to iris segmentation for recogni-
tion,” in Proceedings of the IEEE International Conference on
The trend in modern computing is toward a multicore Image Processing (ICIP ’06), Atlanta, Ga, USA, October 2006.
design. In this research, we are interested in the performance [14] J. Daugman, “Statistical richness of visual phase information:
of a modern multicore, Cell processor, compared to an update on recognizing persons by iris patterns,” International
FPGA for an image processing algorithm. We demonstrate Journal of Computer Vision, vol. 45, no. 1, pp. 25–38, 2001.
that a vital portion of an iris recognition algorithm can be [15] R. P. Broussard, R. N. Rakvic, and R. W. Ives, “Accelerating iris
parallelized on both systems, and our results on an FPGA template matching using commodity video graphics adapters,”
are 2.5 times better than the CELL processor. FPGAs have in Proceedings of the 2nd IEEE International Conference on
been on an impressive scaling trend over the last 10 years. We Biometrics: Theory, Applications and Systems (BTAS ’08),
expect this scaling trend to continue in the short term and we Crystal City, Va, USA, September 2008.
even believe that an FPGA could potentially be a part of the [16] Intel Corporation, June 2008, http://www.intel.com/products/
General Purpose Computer of tomorrow. processor/manuals/index.htm.
[17] Intel Corporation, June 2008, http://processorfinder.intel
.com/details.aspx?sSpec=SL9YM.
References [18] Altera Corporation, June 2008, http://www.altera.com/
education/univ/materials/boards/unv-de2-board.html.
[1] “Synergistic processing in cell’s multicore architecture,” [19] Altera Corporation, June 2008, http://www.altera.com/
http://www.research.ibm.com/people/m/mikeg/papers/2006 products/devices/cyclone2/cy2-index.jsp.
ieeemicro.pdf. [20] J. Daugman, “How iris recognition works,” IEEE Transactions
[2] Cell Broadband Engine Programming, IBM Developer Works, on Circuits and Systems for Video Technology, vol. 14, no. 1, pp.
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 21–30, 2004.
1741C509C5F64B3300257460006FD68D.
[3] “Cell Broadband Engine,” https://www-01.ibm.com/chips/
techlib/techlib.nsf/products/Cell Broadband Engine.
[4] J. Daugman, “Probing the uniqueness and randomness of
iriscodes: results from 200 billion iris pair comparisons,”
Proceedings of the IEEE, vol. 94, no. 11, pp. 1927–1935, 2006.

You might also like