0% found this document useful (0 votes)
53 views

E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space

This document summarizes a research paper that proposes a new indexing structure called two-tier inverted file for efficient similarity search on ultra-high dimensional histogram-based image features. The indexing structure leverages the discrete and sparse nature of histogram features to map each feature dimension to a finite set of states. It indexes these states in two levels - the first level lists states for each dimension, and the second level lists images for each state. Query processing uses a weighted state voting scheme to quickly identify candidate images before computing actual distances. The paper evaluates the approach on real face datasets with up to 15,488 dimensional features, demonstrating improved accuracy and efficiency over existing methods.

Uploaded by

100godtime
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space

This document summarizes a research paper that proposes a new indexing structure called two-tier inverted file for efficient similarity search on ultra-high dimensional histogram-based image features. The indexing structure leverages the discrete and sparse nature of histogram features to map each feature dimension to a finite set of states. It indexes these states in two levels - the first level lists states for each dimension, and the second level lists images for each state. Query processing uses a weighted state voting scheme to quickly identify candidate images before computing actual distances. The paper evaluates the approach on real face datasets with up to 15,488 dimensional features, demonstrating improved accuracy and efficiency over existing methods.

Uploaded by

100godtime
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Efficient Histogram-Based Similarity Search in

Ultra-High Dimensional Space

Jiajun Liu1 , Zi Huang1,2 , Heng Tao Shen1 , and Xiaofang Zhou1,2


1
School of ITEE, University of Queensland, Australia
2
Queensland Research Laboratory, National ICT Australia
{jiajun,huang,shenht,zxf}@itee.uq.edu.au

Abstract. Recent development in image content analysis has shown


that the dimensionality of an image feature can reach thousands or
more for satisfactory results in some applications such as face recogni-
tion. Although high-dimensional indexing has been extensively studied in
database literature, most existing methods are tested for feature spaces
with less than hundreds of dimensions and their performance degrades
quickly as dimensionality increases. Given the huge popularity of his-
togram features in representing image content, in this papers we propose
a novel indexing structure for efficient histogram based similarity search
in ultra-high dimensional space which is also sparse. Observing that all
possible histogram values in a domain form a finite set of discrete states,
we leverage the time and space efficiency of inverted file. Our new struc-
ture, named two-tier inverted file, indexes the data space in two levels,
where the first level represents the list of occurring states for each in-
dividual dimension, and the second level represents the list of occurring
images for each state. In the query process, candidates can be quickly
identified with a simple weighted state-voting scheme before their actual
distances to the query are computed. To further enrich the discrimina-
tive power of inverted file, an effective state expansion method is also
introduced by taking neighbor dimensions’ information into considera-
tion. Our extensive experimental results on real-life face datasets with
15,488 dimensional histogram features demonstrate the high accuracy
and the great performance improvement of our proposal over existing
methods.

1 Introduction
Image retrieval based on content similarity has been put in spotlight for the
past few decades [8]. Histogram constructed by counting the number of pix-
els from an image in each of a fixed list of bins is one of the most popular
features used in many applications [11], where each image is represented by a
high-dimensional histogram feature vector. Among many distance functions pro-
posed for histogram comparison, the histogram intersection and the Euclidean
distance are widely used due to their high efficiency and effectiveness [16]. The
dimensionality of an image histogram is typically about tens or hundreds. Re-
cently, driven by the significant need of real-life applications such as identity

J.X. Yu, M.H. Kim, and R. Unland (Eds.): DASFAA 2011, Part II, LNCS 6588, pp. 1–15, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 J. Liu et al.

verification, video surveillance, automated border control, crime scene footage


analysis, and so on, more sophisticated image features are required to reduce
false alarm rate under various conditions and noises in face recognition. For ex-
ample, Local Binary Patterns (LBP) [1] and recently proposed Local Derivative
Patterns (LDP) [19] are well known and proved to be very effective. According to
the particular settings, an 88 × 88 face image can generate a 15,488-dimensional
histogram feature or more. A major challenge that prevents face recognition
from being widely applied on large-scale or real-time applications is the vast
computational cost when faces are compared based on the above ultra-high di-
mensional histogram features. Obviously, without any database support, few
applications can actually bear such high computational cost rooted from the
ultra-high dimensionality. Although many high-dimensional indexing methods
have been introduced in database literature [4], performance results on feature
spaces over thousands dimensions are hardly found.
In this paper, we frame our work in the context of histogram-based similarity
search. Our main idea comes from the following observations on histogram fea-
tures. Firstly, given the known image resolution and the fixed number of bins,
all the possible values in a histogram feature vector form a finite set of discrete
values. Therefore, a value in an arbitrary dimension has a finite number of possi-
ble states. Secondly, many dimensional values could be zeros since features may
not be evenly distributed, especially in the ultra-high dimensional space. Our
LDP feature dataset extracted from standard face datasets show that more than
30% dimensional values are zeros. The particular characteristics of discrete state
and high sparsity in the high-dimensional feature space have not been previously
exploited to tackle the similarity search problem.
Motivated by the above observations and the high efficiency of inverted file
in text retrieval where data are also discrete and sparse, we propose a novel
two-tier inverted file structure to index the ultra-high dimensional histograms
for efficient similarity search, where a dimension for a state (and a state for an
image) is analogous to a word for a document. To be more specific, we make the
following contributions.

– We model histogram feature values in a finite set of discrete states, based


on which a two-tier inverted file structure is proposed to leverage the high
efficiency of inverted file. In the new structure, the first tier represents the
list of states for each individual dimension, and the second tier represents
the list of images for each state. Meanwhile, techniques are also employed to
remove those indiscriminate state lists for further performance improvement
and space reduction.
– We propose a fast query processing algorithm based on a simple weighted
state-voting scheme. Only those images with highest voting scores with re-
spect to the query are remained for the actual similarity computations in
the original space.
– We propose an effective state expansion method for each dimensional value
of a histogram by taking its local information into consideration. Each di-
mension of an image is assigned with a larger number of possible states by
Efficient Histogram-Based Similarity Search 3

comparing itself with its left and right neighbor dimensions. The purpose of
this is to further increase the discriminative power of inverted file.
– We conduct an extensive performance study on real-life face datasets with
up to 15488-dimensional histogram features. The results demonstrate the
high accuracy and the significant performance improvement of our proposal
over existing methods.

The rest of the paper is organized as follows. We review some related work in
Section 2. Section 3 provides some preliminary information on the ultra-high
dimensional histogram feature and the related similarity measure. The proposed
two-tier inverted file indexing structure is introduced in Section 4, which followed
by the query processing in Section 5. Extensive experiments regarding effective-
ness, efficiency and scalability has been conducted and analyzed in Section 6.
Finally we conclude our work in Section 7.

2 Related Work
Towards effective database supports for high-dimensional similarity search, a lot
of research efforts have been witnessed in database community. Various cate-
gories of high-dimensional indexing methods have been proposed to tackle the
“curse of dimensionality”.
Tree structures have achieved notable success in managing low-dimensional
feature vectors, from early R-tree, kd-tree and their variants, to M-tree [6], A-
tree [13] and many other trees [4]. The key idea is to prune tree branches as much
as possible based on the established bounding distances so that the number of
accessed feature vectors (or points) can be reduced significantly. However, their
performance rapidly degrades as feature dimensionality increases, and eventually
most of them are outperformed by sequence scan when dimensionality reaches
high tens due to the massive overlap among different branches [18].
Apart from exact search, approximate search has recently drawn much atten-
tion. The aim is to gain performance improvement by sacrificing minor accuracy.
One typical approach is Locality Sensitive Hashing (LSH) [9]. The basic idea is
to use a family of locality sensitive hash functions composed of linear projec-
tion over random directions in the feature space. The intuition behind is that
for at lease one of the hash functions, nearby objects have high probability of
being hashed into the same state. Improvements to LSH have been made contin-
uingly during the past decade, regarding its accuracy, time efficiency and space
efficiency by improving the hashing distribution [7], by enforcing its projection
method [3], and by combining efficient tree structures [17]. However, how to gen-
erate effective hash functions for thousands of dimensions or higher is unclear.
One-dimensional indexing using the efficient B+ -tree is another category, such
as iDistance [10]. It partitions data points into clusters and indexes all the points
by their distances to their respective reference points using a single B+ -tree. Its
efficiency comes from the localized distances to corresponding reference points
and B+ -tree. Its performance is further improved by finding the optimal refer-
ence points which can maximize the performance of B+ -tree [14]. Nonetheless,
4 J. Liu et al.

single dimensional distance values become very indistinguishable for ultra-high


dimensional feature vectors.
Another direction is to reduce the number of dimensions of the high-dimensional
data before indexing it. The data is first transformed into a much lower-
dimensional space using dimensionality reduction methods and then an index is
built on it to further facilitate the retrieval [15,5]. The key idea is to transform
data from a high-dimensional space to a lower dimensional space without losing
much information. However, it is mostly infeasible to reduce the dimensionality
from thousands or higher to tens without losing critical information.
Instead of reducing dimensionality, some methods aim to approximate data,
such as VA-file [18]. It approximates each dimension with a small number of
bits, by dividing the data space into 2b rectangular cells where b denotes a user
specified number of bits. The VA-File allocates a unique bit-string of length b
for each cell, and approximates data points that fall into a cell by that bit-string.
The VA-File itself is simply an array of these compact, geometric approxima-
tions. Query process is performed by scanning the entire approximation file and
excluding points from the actual distance computation based on the lower and
upper bounds established from these approximations. This approach is insen-
sitive to the dimensionality and thus able to outperform sequential scan if a
small number of candidates are finally accessed. However, the improvement ra-
tio is rather limited since every single dimension needs to be encoded. Some
refined approaches based on VA-file have also been proposed to handle datasets
of different distributions [2,12].
It is clear that most existing works are not deemed to index ultra-high di-
mensional feature vectors for efficient similarity search. VA-file is likely the most
feasible one to have comparable performance with sequential scan in ultra-high
dimensional spaces since its is dimension independent. Interestingly, inverted file
has been a very effective solution for indexing large-scale text databases with
extremely high dimensionality [20]. In this paper, by analyzing the histogram
intrinsic properties, we introduce a novel and compact indexing structure called
two-tier inverted file to index ultra-high dimensional histograms. The fact that
dimensional values in histogram are discrete and finite motivates us to utilize
the efficiency of inverted file for histogram-based similarity search.

3 Preliminaries

In this section, we provide the information on how ultra-high dimensional fea-


ture vectors can be generated from images and explain the observations which
motivate our design. For easy illustration, we take the recently proposed Local
Derivative Pattern (LDP) feature [19] in face recognition as the example.

3.1 LDP Histogram

Face recognition is a very important topic in pattern recognition. Given a query


face image, it aims at finding the most similar face from a face database. Due to
Efficient Histogram-Based Similarity Search 5

the strong requirement in high accuracy, face images are usually represented by
very sophisticated features in order to capture the face in very detailed levels.
Given a certain similarity measure, face recognition can be considered as the
nearest neighbor search problem in ultra-high dimensional spaces.
An effective face feature or descriptor is one of the key issues for a well-designed
face recognition system. The feature should be of high ability to discriminate be-
tween classes, has low intra-class variance, and can be easily computed. Local Bi-
nary Pattern (LBP) is a simple yet very efficient texture descriptor which labels
the pixels of an image by thresholding the neighborhood of each pixel with the
value of the center pixel and considers the result as a binary number [1]. Due to
its discriminative power and computational simplicity, LBP has become a pop-
ular approach in face recognition. As an extension to LBP, the high-order Local
Derivative Pattern (LDP) has been recently proposed as a more robust face de-
scriptor, which significantly outperforms LBP for face identification and face veri-
fication under various conditions [19]. Next, we provide a brief review of these two
descriptors.
Derived from a general definition of texture in a local neighborhood, LBP is
defined as a grayscale invariant texture measure and is a useful tool to model
texture images. The original LBP operator labels the pixels of an image by
thresholding the 3 × 3 neighborhood of each pixel with the value of the central
pixel and concatenating the results binomially to form a 8-bit binary sequence
for each pixel. LBP encodes the binary result of the first-order derivative among
local neighbors.
As an extension to LBP, LDP encodes the higher-order derivative information
which contains more detailed discriminative features. The second order LDP
descriptor labels the pixels of an image by encoding the first-order local derivative
direction variations and concatenating the results as a 32-bit binary sequence for
each pixel. A histogram can then be constructed based on the LDP descriptor
to represent an image.
To get more precise image representation, an image is typically divided into
small blocks, on which more accurate histogram is calculated. For example, given
an image with resolution of 88 × 88, it can be divided into a number of 484
4 × 4 sized blocks. In [19], each block is represented by 4 local 8-dimensional
histograms along four different directions, where each dimension represents the
number of pixels in the bin. The final LDP histogram of the image is generated
by concatenating all the local histograms of each block, i.e., 484 32-dimensional
histogram. Its overall dimensionality is the number of blocks multiplied by the
local histogram size, i.e., 484 × 32 = 15, 488. Theoretically, the maximum di-
mensionality could reach 88 × 88 × 32 when each pixel is regarded as a block.
This LDP histogram is claimed as a robust face descriptor which is insensitive
to rotation, translation and scaling of images.
For histogram features, the number of bins for an image (or block) is always
predetermined. Since the number of pixels in the image (or block) is also known,
the value along each dimension in the histogram is an integer within the range
from 0 to the maximum number of pixels in the image (or block). For example,
6 J. Liu et al.

in LDP histogram, if the block size is 4 × 4, then the value in the histogram
can only be an integer in the range of [0,16]. Clearly, the first observation is
that the histogram values are discrete and from a finite set of numbers, where
each number is regarded as a state. Note that values could also be float if some
normalization is applied. However, normalization does not change the nature
of being discrete and finite. At the same time, many dimensions may have zero
value in ultra-high dimensional histograms. Motivated by the discrete and sparse
characteristics, we utilize the efficiency of inverted file to achieve efficient similar-
ity search in ultra-high dimensional histogram feature spaces, as to be presented
in Section 4.

3.2 Histogram Similarity Measures


Many similarity measures have been proposed for histogram matching. The his-
togram intersection is a widely used similarity measure. Given a pair of LDP
histograms H and S with D dimensions, the histogram intersection is defined as
D

Sim(H, S) = min(Hi , Si ) (1)
i=1

In the metric defined above, the intersection is incremented by the number of


pixels which are common between the target image and the query image along
each dimension. Its computational complexity is very low. It is used to calculate
the similarity for nearest neighbor identification and has shown very good accu-
racy for face recognition [19]. Another popular measure is the classical Euclidean
distance which has also been used in many other feature spaces. Although other
similarity measures can be used, in this paper we will test both the histogram
intersection and the Euclidean distance to see their effects on the performance.

4 Two-Tier Inverted File


As introduced in Section 3, the face image feature, LDP histogram, is usually in
ultra-high dimensionality (i.e., more than ten thousands). Given its extremely
high dimensionality, it is not practical to perform the full similarity computations
for all database images. In this section, we present a novel two-tier inverted file
for indexing ultra-high dimensional histograms, based on the discrete and sparse
characteristics of histograms.
Inverted file has been used widely in text databases for its high efficiency
[20] in both time and space, where the text dimensionality (i.e., the number of
words) is usually very high and the word-document matrix is very sparse since
a document only contains a small subset of the word dictionary. However, it
has not been well investigated in the low-level visual feature databases. Here we
exploit the discrete and finite nature of histograms and design a two-tier inverted
file structure for efficient similarity search in ultra-high dimensional space.
In the traditional text-based inverted file, each word points to a list of docu-
ments which contain the word. Naive adoption of inverted file to histograms is
Efficient Histogram-Based Similarity Search 7

to regard each dimension as a word pointing to a list of images whose values (or
states) on the dimension is not zero. By doing this, all zero entries in histograms
are removed. However, histograms also have some different features from text
datasets. Firstly, word-document matrix is far more sparser than histograms,
since the word dictionary size is typically much larger than the average number
of words in documents. This leads to a rather long images list for each dimension.
Secondly, all values in histograms are distributed in a predetermined state range
from 0 to the maximum number of pixels allowed in a bin. This inspires us to
create another level of inverted file for each dimension by regarding each state on
the dimension as a word pointing to a list of images which have the same state.
Therefore, a long image list can be further partitioned into multiple shorter lists
for quicker identification. Thirdly, comparing with the number of images, the
number of states is often much smaller. For example, LDP histograms generated
from 4 × 4 sized blocks have 16 possible states only, without considering the zero
state. To further improve the discriminative power of inverted file, we design an
effective state expansion method, before we look at the overall structure of the
two-tier inverted file.

4.1 State Expansion


Given that the number of states in histograms is relatively small, we aim to ex-
pand the number of states to balance the state list size and the image list size for
better performance. The basic idea is to expand the original state on a dimen-
sion of an image into multiple states which are more specific and discriminative.
The difficulty for state expansion lies in the preservation of the original state
information. We propose to take the local neighbor information into account for
expansion.
To illustrate the idea, we assume an image is divided into 4×4 sized blocks in
LDP histogram. The number of pixels in each bin ranges from 0 to B, where B
is the block size, i.e., B=16. Thus the number of possible states for a dimension
is B+1. Since all zero entries in histograms are not indexed in inverted file, we
have B states left to consider.
To expand the number of states, we consider the relationship between the
states of ith dimension with its neighbor dimensions, i.e., its left and right neigh-
bors. Comparing the values of ith dimension and (i−1)th dimension for an image,
there exist three relationships, including “ < ”, “ > ” and “ = ”. Similarly, the
comparison between values of ith dimension and (i + 1)th dimension have three
relationships as well. Therefore, by considering the relationship with its left and
right neighbor dimensions, a single ith dimension’s state can be expanded into
3 × 3 possible states.
Given an image histogram H = (h1 , h2 , ..., hD ), it can be transformed to the
expanded feature H  = (h1 , h2 , ..., hD ), where hi is calculated by the following
formula
hi = hi × 9 + t1 × 3 + t2 × 1, (2)
8 J. Liu et al.

⎧ ⎧
⎨0 if hi < hi−1 or i = 1 ⎨0 if hi < hi+1 or i = D
t1 = 1 if hi = hi−1 t2 = 1 if hi = hi+1
⎩ ⎩
2 if hi > hi−1 2 if hi > hi+1

8×9+0×3+0=72
Dimi-1 Dimi Dimi+1 <8< State72
<8= State73
< < <8> State74
=8<
= = State
8 Expansion
=8= ...
=8>
> > >8< 8×9+2×3+1=79
>8= State79
>8> 8×9+2×3+2=80 State80

Fig. 1. An example for state expansion

Basically, each state is stretched into an interval which contains nine new
states based on the local relationship with its left and right neighbors. The term
hi × 9 is used to distinguish original states into different intervals, and the term
t1 × 3 + t2 × 1 is used to differentiate nine local relationships within an interval.
Figure 1 depicts an example where ith dimension has an original state of 8
and is expanded into nine new states. Since a dimension of an image originally
have B possible states without considering zero, the total number of states after
expansion becomes 3 × 3 × B. For example, when B is 16, the total number of
possible states for a dimension is expanded to 3 × 3 × 16 = 144.
State expansion is performed on the original feature for each dimension of
every histogram. The ith dimension of j th image, Hij , is assigned with the new

value of Hij = Hij × 9 + t1 × 3 + t2 × 1. Note that more local relationships
can be exploited if more neighbor dimensions are considered. If the number of
histograms is overwhelming, more neighbors like (i−2)th dimension and (i+2)th
dimension can be used. For our data scale used in experiments, expansion on
two neighbor dimensions has shown very satisfactory performance.
State expansion achieves a more detailed description of histogram by consid-
ering neighbor information. It plays an important role in accelerating the search
process, by distributing fixed number of images into a larger number of states.
The average number of images on each state is hence reduced, making query
process more efficient, as to be explained in Section 5.

4.2 Index Construction


Given an image dataset consisting of N histograms in D dimensionality, Figure 2
illustrates the general process of constructing the two-tier inverted file.
Given an image represented as a histogram, H = (h1 , h2 , ..., hD ), it is firstly
transformed to H  = (h1 , h2 , ..., hD ) by taking state expansion. In H  , each
dimension of an image is associated with a new state value, which is generated
by considering the relationships with its neighbor dimensions.
Efficient Histogram-Based Similarity Search 9

Dim1 ... DimD Dim1 DimD


... Dimension
Img1
Img2
... State1 Img1 State3 Img1
...
State2 Img3 State6 Img2
ImgN Img7 Img5
State6 State8
State ... ...
... ...
Hij Expansion ... ...

State26 Img1 State15 Img3


Dim1 ... DimD ... Img10 ... Img11
Img51 Img20
Img1 Indexing ... ...
Img2
...
State State
ImgN
List List
Image Image
Hij'=Hij×9+t1×3+t2 List List

Fig. 2. Construction of the two-tier inverted file indexing structure

Motivated by the discrete nature of values (or states) in histogram, we propose


a two-tier inverted file to effectively index H  and handle the sparsity issue. The
right sub figure in Figure 2 shows an overview of the indexing structure. In the
first tier, an inverted list of states is constructed for each individual dimension
among all images. This tier indicates what states exist on a dimension. If the
number of states is small while the number of images is large, all dimensions
will basically have a complete list of states. By effective state expansion, each
dimension is likely to have a different list of states. In the second tier, an inverted
list of images is built for each state existing in a dimension. Denote the number of
states as M . The maximum number of image lists is M × D. Given the relatively
small block size, M is usually much smaller than D and N . With state expansion,
M can be enlarged so that a better balance between the state lists and the image
lists can be obtained.
Like the traditional inverted file for documents, the new two-tier inverted file
for histograms does not index the original zero states. Meanwhile, one question
rises here. Is it necessary to keep those states with very long image lists? In text
retrieval, we understand that frequent words are removed since they are not dis-
criminative. Here we adopt the same assumption which is also verified by our ex-
periments. A threshold on the length of image list, ε, is used to determine if an
image list should be removed from the indexing structure. Only the states (and
their image lists) who have less number of images than this threshold are kept in
the two-tier inverted file. Note that rare states are also retained in our structure
since some applications such as face recognition only search for the nearest neigh-
bor. Rare information could be helpful in identifying the most similar result.
Thus, the original histograms in the ultra-high dimensional space are finally
indexed by the compact two-tier inverted file. Given an image query, it can be
efficiently processed in the structure via a simple weighted state-voting scheme,
as to be explained next.
10 J. Liu et al.

Input: Q[], D, B, L[][]


Output: Nearest Neighbor
1. for (i = 1; i < D; i + + ) do
2. qi ← ComputeState(qi );
3. end for
4. Candidates = {}
5. for (i = 1; i <= D; i + + ) do
6. Candidates+ ← L[i].qi ;
7. end for
8. Candidates[k] ← WeightedStateVoting(Candidates+);
9. N earestN eighbor ← ComputeNearestNeighbor(Candidates[k]);
10. return N earestN eighbor;

Algorithm 1. The Query Processing Algorithm

5 Query Processing
Based on the two-tier inverted file, query processing is efficient and straight-
forward. We use a simple weighted state-voting scheme to quickly rank all the
images and only a small set of candidates will be selected for full similarity
computations in the original space. Algorithm 1 outlines the query process.
Given a query image histogram, Q = (q1 , ..., qD ), we firstly transform it to
Q = (q1 , ..., qD

) by applying state expansion (lines 1-3). ComputeState() is the
method to compute the new state value for qi , based for Equation 2. Next, the
two-tier inverted file, denoted as L[][], is searched. For ith dimension, its corre-
sponding image list which has the same state value as qi is quickly retrieved via
allocating ith dimension in the first tier and then qi in the second tier in the struc-
ture. After all dimensions are searched, a set of candidates is generated (lines
5-7). Each image in the candidate set shares one or more common states with the
query image in certain dimensions. Here a weighted state-voting method is em-
ployed to compute the amount of contribution to the final similarity between the
query and a candidate. The frequency of an image in the candidate set reflects
the number of common states it shares with the query image. Note that candi-
dates are generated by matching states on each dimension. However, different
matched states contribute differently to the final similarity when the histogram
intersection is used. Matched states with larger values contribute more to the
final similarity. Therefore, state values have to be considered when candidates
are ranked. Since only expanded states are indexed in the data structure, the
matched state qi has to be transformed back to the original state qi , according
to Equation 2. WeightedStateVoting() is the method to rank all the candidates
(line 8). When the histogram intersection
 is applied, the ranking score for each
candidate is computed based on q¯i , where q¯i is the value of matched state be-
tween Q and a candidate. Top-k candidates are returned for the actual histogram
intersection computations to find the nearest neighbor to Q (line 9). For example,
Efficient Histogram-Based Similarity Search 11

assume that D = 3, Q = (1, 3, 2), L[1].1 = {img1 , img2 }, L[2].3 = {img2 , img4 },
and L[3].2 = {img4 } without state expansion. The weighted state-voting result
is for img1 , img2 and img3 is 1, 1+3, and 3+2 respectively. By setting k = 2,
img3 and img2 are returned as the final candidates to compute their histogram
intersection similarities with respect to the query to find the nearest neighbor.
When the Euclidean distance is applied, two matched dimensions have distance
of 0. In this case, by setting the same weight for all matched states, top-k most
frequently occurring candidates in the candidate set are returned for the actual
Euclidean distance computations. This is reasonable since more matched dimen-
sions lead to a smaller overall distance with a higher probability. This algorithm
also has the flexibility in returning more nearest neighbors which will affect the
setting of k. The effect of k will be examined in the experiments.
It is noticeable that the above query processing algorithm only returns ap-
proximate results. There are three factors which affect the accuracy. Firstly, state
expansion may cause information loss. In state expansion, one original state may
be expanded into different new states if the neighbor relationships are different.
Since the algorithm selects the candidates based on matching states and their
voting scores, two different new states with the same original state cannot be
matched. It is expected that this loss becomes relatively less significant as dimen-
sionality increases and the encoded local information can compensate the loss
to certain extent. Secondly, removal of frequent states in the two-tier inverted
file may also affect the accuracy, as studied in text retrieval. Thirdly, since only
top-k candidates are selected for final similarity computations, the correctness
of the results cannot be guaranteed. In the next section, we extensively study
the effects of the above three factors. Results on real-life ultra-high dimensional
histograms show very promising performance with negligible sacrifice on quality,
despite of the correctness guarantee problem.

6 Experiments
6.1 Set Up
We have collected 40,000 face images from different sources, including various
standard face databases1 , such as FERET, PIE Database, CMU, the Yale Face
Database, etc., and faces extracted from different digital albums. Both database
images and query images are represented by 15,488 dimensional LDP histograms
which have shown very good accuracy in face recognition [19]. All experiments
are conducted on a desktop with 2.93GHz Intel CPU and 8GB RAM.
To measure the search effectiveness of our proposal, we use the standard
precision, where the ground-truth for a query is the search results from sequential
scan in the original space. In face recognition, typically only the top one result is
needed. Thus, we only evaluate results on the nearest neighbor search, although
more nearest neighbors can also be returned.

1
http://www.face-rec.org/databases/
12 J. Liu et al.

Before the performance comparison with existing methods, We first conduct


experiments on FERET to test our method. FERET2 is a standard face dataset
consisting of 3,541 gray-level face images representing the faces of 1,196 peo-
ple under various conditions (i.e., variant facial expression, illumination, and
ageing). The dataset is divided into five categories, fa (i.e., frontal images), fb
(i.e., facial expression variations), fc (i.e., under various illumination conditions),
dup1 (i.e., face images taken later in time between one minute to 1,031 days) and
dup2 (i.e., a subset of dup1; face images taken at least after 18 months). FERET
is widely used as a standard dataset for evaluation of face recognition related
algorithms and systems. For effectiveness and efficiency evaluation, categories
fb, fc, dup1 and dup2 of FERET are considered as four query image sets.
Since we have 2 parameters in our scheme, ε, and k, representing the threshold
on the image list for a state, and the number of candidates for actual similarity
computations respectively, both of them need to be tested. By default, state
expansion is applied, ε is 5% of the size of the image dataset, and k = 20. Due
to the space limit, we only report the results by applying histogram intersection
similarity measure. Euclidean distance actually shows very similar results.

6.2 Effect of ε
In our two-tier indexing structure, we assume that the length of an image list for
a state reflects its discrimination power. If the number of images for a state is
greater than ε, this image list is considered as non-discriminative and removed
from the index structure. We test different values of ε including 5%, 7%, 15%
and 20% of the total image size to observe its effect on effectiveness.
Observed from Figure 3(a), a larger ε leads to a better precision for all query
sets since more image lists are maintained in the data structure. However, the
overall precision under various settings is promising, i.e., all higher than 98%.
The precision difference among different settings is not significant and nearly
identical. The search time for different ε values is shown in Figure 3(b). As
ε goes up, the indexing structure is larger and more images are likely to be
accessed and compared. Therefore, for different sets of queries, they show the
same trend. The search time drops quickly as ε goes up. Since ε has greater
impact on efficiency, by default, we set ε = 5%.

6.3 Effect of k
In query processing, voting scheme is applied to generate a set of candidates for
further similarity calculation. Different settings on k lead to different precisions.
Figure 3(c) shows the results of k = 5, 10, 20 and 50 for the nearest neighbor
search. Precision reaches almost 100% when k ≥ 20. The reason is that the
more candidates we include, the higher probability that the correct results are
finally accessed and returned. The search time increases as k increases since more
candidates are compared, as shown in Figure 3(d). k = 20 is a reasonable default
value for both precision and efficiency consideration.
2
http://www.itl.nist.gov/iad/humanid/feret/feret_master.html
Efficient Histogram-Based Similarity Search 13

6.4 Effect of State Expansion

A key factor that contributes to the high effectiveness and efficiency of our
method is that we expand the space of effective states and consequently encode
more local distinctiveness into each of the states. In this subsection, we also test
the effect of state expansion.
Figure 3(e) and 3(f) depict the selectivity improvement made by state expan-
sion. The total number of image lists and the average number of images in each
list are reported. Clearly by expanding the number of states, the average number
of images for each state is greatly reduced. The average number of images per
list is about 30 after state expansion.
The effect of state expansion on precision and efficiency is reflected in Figure
3(g) and 3(h) respectively. Very surprising, with our state expansion, the accu-
racy is even higher, especially for fc, dup1 and dup2 query sets. This is a bit hard
to explain since state expansion may possibly miss some results if local neighbor
relationships among their dimensions are different. Without state expansion, in-
formation loss mainly comes from the removal of long image lists. Because images
lists without state expansion are expected to be much longer than those with
state expansion (as depicted in 3(f)), there is a risk to remove more lists from
the indexing structure. As a result, more information could be lost if the states
are not expanded. Undoubtedly, state expansion improves the search efficiency
(as shown in Figure 3(h)) since fewer and shorter lists are searched. In short,
state expansion achieves improvement in both precision and efficiency.

6.5 Performance Comparison

In the last experiment, we conduct a comparison study on efficiency, with se-


quential scan, VA-file and iDistance. Sequential scan is included because that,
in the ultra-high dimensional space, its performance is even better than most of
existing indexing methods due to the “curse of dimensionality”. VA-file, on the
other hand, is less sensitive to the dimensionality than most tree-based index
structures. Two bits are used for each dimension in VA-file since only 17 original
states exist in LDP histogram. Note that the above index structures return com-
plete results, while two-tier inverted file is an approximate search scheme which
offers superior efficiency with negligible precision loss. In order to compare the
two-tier inverted file with other approximate searchs, we also adopt iDistance as
an approximate search scheme. Ten clusters are used in iDistance and its search
radius is increased until the scheme reaches the same precision as the two-tier
inverted file. The whole dataset of 40,000 face images is used for this experiment.
Figure 3(i) shows the average search time for a single query with four different
methods. We observe that our method outperforms all other three methods by
more than two orders of magnitude. The search time for all methods increases
as the data size increases. However, our method grows very slow as the data
size increases from 1000 to 40,000 (up to 0.1 second), while the search time for
sequential scan, VA-File and iDistance increase dramatically. Notice that VA-File
is outperformed by sequential scan. There are two main reasons. Firstly, LDP
14 J. Liu et al.

1.2
20% 0.06 20% 10
10% 10% 20

Avg. Response Time (Sec)


1.15
7% 7% 30
5% 0.05 5% 1.1 50
100
0.04 1.05
Precision

Precision
1 1
0.03
0.95 0.95
0.02
0.9 0.9
0.85 0.01 0.85
0.8 0 0.8
fb fc dup1 dup2 fb fc dup1 dup2 fb fc dup1 dup2

(a) effect of ε (b) effect of ε (c) effect of k

Avg. No.Imgs in Non-empty Lists


0.1 180
10 7.0 Without Expansion Without Expansion

No.Non-empty Lists (106)


20 With Expansion 160 With Expansion
Avg. Response Time (Sec)

0.08 30 6.0 140


50 5.0 120
100
0.06 4.0 100
80
3.0
0.04 60
2.0
40
1.0 20
0.02
0.0 0
6 9 12 15 6 9 12 15
0 No.Dimensions Considered (103) No.Dimensions Considered (103)
fb fc dup1 dup2

(d) effect of k (e) effect of state expansion (f) effect of state expansion
0.1
Without Expansion Without Expansion

Avg. Response Time (Sec)


14 Two-tier Inverted File
With Expansion With Expansion
Avg. Response Time (Sec)

Sequential Scan
0.08 12 VA-File
10
Precision

0.06 8
1
6
0.95 0.04
4
0.9
0.02 2
0.85 0
0.8 0 1 2 3 5 10 20 40
fb fc dup1 dup2 fb fc dup1 dup2 3
No.Records (10 )

(g) effect of state expansion (h) effect of state expansion (i) scalability

Fig. 3. Effectiveness, efficiency and scalability

histograms are highly skew in different localities. Secondly, it is difficult for VA-
File to have a tight bound for the histogram intersection similarity to achieve
efficient pruning. IDistance shows slightly better performance than sequential
scan. However, its search time still climbs quickly, because the distance between
any point and the reference point tends to be very close when dimensionality is
extremely high, making a minor increase on search radius include an excessive
number of data points to process. This experiment proves that by utilizing the
high efficiency of inverted file, our method is able to achieve real-time retrieval
in ultra-high dimensional histogram spaces.

7 Conclusion

In this paper, we present a two-tier inverted file indexing method for efficient
histogram-based similarity search in ultra-high dimensional spaces. It indexes the
sparse and ultra-high dimensional histograms with a compact structure which
utilizes the high efficiency of inverted file, by observing that histogram values
are actually discrete and from a finite value set. An effective state expansion
method is designed to further discriminate the data for an efficient and effective
Efficient Histogram-Based Similarity Search 15

feature representation. An extensive study on a large-scale face image dataset


confirms the novelty and practical significance of the proposal.

References
1. Ahonen, T., Hadid, A., Pietikäinen, M.: Face description with local binary patterns:
Application to face recognition. IEEE TPAMI 28(12), 2037–2041 (2006)
2. An, J., Chen, H., Furuse, K., Ohbo, N.: Cva file: an index structure for high-
dimensional datasets. Knowl. Inf. Syst. 7(3), 337–357 (2005)
3. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. CACM 51(1), 117–122 (2008)
4. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index
structures for improving the performance of multimedia databases. ACM Comput.
Surv. 33(3), 322–373 (2001)
5. Chakrabarti, K., Mehrotra, S.: Local dimensionality reduction: A new approach to
indexing high dimensional spaces. In: VLDB, pp. 89–100 (2000)
6. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity
search in metric spaces. In: VLDB, pp. 426–435 (1997)
7. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing
scheme based on p-stable distributions. In: Symposium on Computational Geom-
etry, pp. 253–262 (2004)
8. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and
trends of the new age. ACM Comput. Surv. 40(2) (2008)
9. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: VLDB, pp. 518–529 (1999)
10. Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R.: iDistance: An adaptive
B+ -tree based indexing method for nearest neighbor search. ACM TODS 30(2),
364–397 (2005)
11. Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information
retrieval: State of the art and challenges. ACM TOMCCAP 2(1), 1–19 (2006)
12. Lu, H., Ooi, B.C., Shen, H.T., Xue, X.: Hierarchical indexing structure for efficient
similarity search in video retrieval. IEEE TKDE 18(11), 1544–1559 (2006)
13. Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index struc-
ture for high-dimensional spaces using relative approximation. In: VLDB, pp. 516–
526 (2000)
14. Shen, H.T., Ooi, B.C., Zhou, X., Huang, Z.: Towards effective indexing for very
large video sequence database. In: SIGMOD, pp. 730–741 (2005)
15. Shen, H.T., Zhou, X., Zhou, A.: An adaptive and dynamic dimensionality reduction
method for high-dimensional indexing. VLDB Journal 16(2), 219–234 (2007)
16. Swain, M.J., Ballard, D.H.: Color indexing. IJCV 7(1), 11–32 (1991)
17. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional
nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)
18. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study
for similarity-search methods in high-dimensional spaces. In: VLDB, pp. 194–205
(1998)
19. Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary
pattern: face recognition with high-order local pattern descriptor. IEEE TIP 19(2),
533–544 (2010)
20. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput.
Surv. 38(2) (2006)

You might also like