E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
1 Introduction
Image retrieval based on content similarity has been put in spotlight for the
past few decades [8]. Histogram constructed by counting the number of pix-
els from an image in each of a fixed list of bins is one of the most popular
features used in many applications [11], where each image is represented by a
high-dimensional histogram feature vector. Among many distance functions pro-
posed for histogram comparison, the histogram intersection and the Euclidean
distance are widely used due to their high efficiency and effectiveness [16]. The
dimensionality of an image histogram is typically about tens or hundreds. Re-
cently, driven by the significant need of real-life applications such as identity
J.X. Yu, M.H. Kim, and R. Unland (Eds.): DASFAA 2011, Part II, LNCS 6588, pp. 1–15, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 J. Liu et al.
comparing itself with its left and right neighbor dimensions. The purpose of
this is to further increase the discriminative power of inverted file.
– We conduct an extensive performance study on real-life face datasets with
up to 15488-dimensional histogram features. The results demonstrate the
high accuracy and the significant performance improvement of our proposal
over existing methods.
The rest of the paper is organized as follows. We review some related work in
Section 2. Section 3 provides some preliminary information on the ultra-high
dimensional histogram feature and the related similarity measure. The proposed
two-tier inverted file indexing structure is introduced in Section 4, which followed
by the query processing in Section 5. Extensive experiments regarding effective-
ness, efficiency and scalability has been conducted and analyzed in Section 6.
Finally we conclude our work in Section 7.
2 Related Work
Towards effective database supports for high-dimensional similarity search, a lot
of research efforts have been witnessed in database community. Various cate-
gories of high-dimensional indexing methods have been proposed to tackle the
“curse of dimensionality”.
Tree structures have achieved notable success in managing low-dimensional
feature vectors, from early R-tree, kd-tree and their variants, to M-tree [6], A-
tree [13] and many other trees [4]. The key idea is to prune tree branches as much
as possible based on the established bounding distances so that the number of
accessed feature vectors (or points) can be reduced significantly. However, their
performance rapidly degrades as feature dimensionality increases, and eventually
most of them are outperformed by sequence scan when dimensionality reaches
high tens due to the massive overlap among different branches [18].
Apart from exact search, approximate search has recently drawn much atten-
tion. The aim is to gain performance improvement by sacrificing minor accuracy.
One typical approach is Locality Sensitive Hashing (LSH) [9]. The basic idea is
to use a family of locality sensitive hash functions composed of linear projec-
tion over random directions in the feature space. The intuition behind is that
for at lease one of the hash functions, nearby objects have high probability of
being hashed into the same state. Improvements to LSH have been made contin-
uingly during the past decade, regarding its accuracy, time efficiency and space
efficiency by improving the hashing distribution [7], by enforcing its projection
method [3], and by combining efficient tree structures [17]. However, how to gen-
erate effective hash functions for thousands of dimensions or higher is unclear.
One-dimensional indexing using the efficient B+ -tree is another category, such
as iDistance [10]. It partitions data points into clusters and indexes all the points
by their distances to their respective reference points using a single B+ -tree. Its
efficiency comes from the localized distances to corresponding reference points
and B+ -tree. Its performance is further improved by finding the optimal refer-
ence points which can maximize the performance of B+ -tree [14]. Nonetheless,
4 J. Liu et al.
3 Preliminaries
the strong requirement in high accuracy, face images are usually represented by
very sophisticated features in order to capture the face in very detailed levels.
Given a certain similarity measure, face recognition can be considered as the
nearest neighbor search problem in ultra-high dimensional spaces.
An effective face feature or descriptor is one of the key issues for a well-designed
face recognition system. The feature should be of high ability to discriminate be-
tween classes, has low intra-class variance, and can be easily computed. Local Bi-
nary Pattern (LBP) is a simple yet very efficient texture descriptor which labels
the pixels of an image by thresholding the neighborhood of each pixel with the
value of the center pixel and considers the result as a binary number [1]. Due to
its discriminative power and computational simplicity, LBP has become a pop-
ular approach in face recognition. As an extension to LBP, the high-order Local
Derivative Pattern (LDP) has been recently proposed as a more robust face de-
scriptor, which significantly outperforms LBP for face identification and face veri-
fication under various conditions [19]. Next, we provide a brief review of these two
descriptors.
Derived from a general definition of texture in a local neighborhood, LBP is
defined as a grayscale invariant texture measure and is a useful tool to model
texture images. The original LBP operator labels the pixels of an image by
thresholding the 3 × 3 neighborhood of each pixel with the value of the central
pixel and concatenating the results binomially to form a 8-bit binary sequence
for each pixel. LBP encodes the binary result of the first-order derivative among
local neighbors.
As an extension to LBP, LDP encodes the higher-order derivative information
which contains more detailed discriminative features. The second order LDP
descriptor labels the pixels of an image by encoding the first-order local derivative
direction variations and concatenating the results as a 32-bit binary sequence for
each pixel. A histogram can then be constructed based on the LDP descriptor
to represent an image.
To get more precise image representation, an image is typically divided into
small blocks, on which more accurate histogram is calculated. For example, given
an image with resolution of 88 × 88, it can be divided into a number of 484
4 × 4 sized blocks. In [19], each block is represented by 4 local 8-dimensional
histograms along four different directions, where each dimension represents the
number of pixels in the bin. The final LDP histogram of the image is generated
by concatenating all the local histograms of each block, i.e., 484 32-dimensional
histogram. Its overall dimensionality is the number of blocks multiplied by the
local histogram size, i.e., 484 × 32 = 15, 488. Theoretically, the maximum di-
mensionality could reach 88 × 88 × 32 when each pixel is regarded as a block.
This LDP histogram is claimed as a robust face descriptor which is insensitive
to rotation, translation and scaling of images.
For histogram features, the number of bins for an image (or block) is always
predetermined. Since the number of pixels in the image (or block) is also known,
the value along each dimension in the histogram is an integer within the range
from 0 to the maximum number of pixels in the image (or block). For example,
6 J. Liu et al.
in LDP histogram, if the block size is 4 × 4, then the value in the histogram
can only be an integer in the range of [0,16]. Clearly, the first observation is
that the histogram values are discrete and from a finite set of numbers, where
each number is regarded as a state. Note that values could also be float if some
normalization is applied. However, normalization does not change the nature
of being discrete and finite. At the same time, many dimensions may have zero
value in ultra-high dimensional histograms. Motivated by the discrete and sparse
characteristics, we utilize the efficiency of inverted file to achieve efficient similar-
ity search in ultra-high dimensional histogram feature spaces, as to be presented
in Section 4.
to regard each dimension as a word pointing to a list of images whose values (or
states) on the dimension is not zero. By doing this, all zero entries in histograms
are removed. However, histograms also have some different features from text
datasets. Firstly, word-document matrix is far more sparser than histograms,
since the word dictionary size is typically much larger than the average number
of words in documents. This leads to a rather long images list for each dimension.
Secondly, all values in histograms are distributed in a predetermined state range
from 0 to the maximum number of pixels allowed in a bin. This inspires us to
create another level of inverted file for each dimension by regarding each state on
the dimension as a word pointing to a list of images which have the same state.
Therefore, a long image list can be further partitioned into multiple shorter lists
for quicker identification. Thirdly, comparing with the number of images, the
number of states is often much smaller. For example, LDP histograms generated
from 4 × 4 sized blocks have 16 possible states only, without considering the zero
state. To further improve the discriminative power of inverted file, we design an
effective state expansion method, before we look at the overall structure of the
two-tier inverted file.
⎧ ⎧
⎨0 if hi < hi−1 or i = 1 ⎨0 if hi < hi+1 or i = D
t1 = 1 if hi = hi−1 t2 = 1 if hi = hi+1
⎩ ⎩
2 if hi > hi−1 2 if hi > hi+1
8×9+0×3+0=72
Dimi-1 Dimi Dimi+1 <8< State72
<8= State73
< < <8> State74
=8<
= = State
8 Expansion
=8= ...
=8>
> > >8< 8×9+2×3+1=79
>8= State79
>8> 8×9+2×3+2=80 State80
Basically, each state is stretched into an interval which contains nine new
states based on the local relationship with its left and right neighbors. The term
hi × 9 is used to distinguish original states into different intervals, and the term
t1 × 3 + t2 × 1 is used to differentiate nine local relationships within an interval.
Figure 1 depicts an example where ith dimension has an original state of 8
and is expanded into nine new states. Since a dimension of an image originally
have B possible states without considering zero, the total number of states after
expansion becomes 3 × 3 × B. For example, when B is 16, the total number of
possible states for a dimension is expanded to 3 × 3 × 16 = 144.
State expansion is performed on the original feature for each dimension of
every histogram. The ith dimension of j th image, Hij , is assigned with the new
value of Hij = Hij × 9 + t1 × 3 + t2 × 1. Note that more local relationships
can be exploited if more neighbor dimensions are considered. If the number of
histograms is overwhelming, more neighbors like (i−2)th dimension and (i+2)th
dimension can be used. For our data scale used in experiments, expansion on
two neighbor dimensions has shown very satisfactory performance.
State expansion achieves a more detailed description of histogram by consid-
ering neighbor information. It plays an important role in accelerating the search
process, by distributing fixed number of images into a larger number of states.
The average number of images on each state is hence reduced, making query
process more efficient, as to be explained in Section 5.
5 Query Processing
Based on the two-tier inverted file, query processing is efficient and straight-
forward. We use a simple weighted state-voting scheme to quickly rank all the
images and only a small set of candidates will be selected for full similarity
computations in the original space. Algorithm 1 outlines the query process.
Given a query image histogram, Q = (q1 , ..., qD ), we firstly transform it to
Q = (q1 , ..., qD
) by applying state expansion (lines 1-3). ComputeState() is the
method to compute the new state value for qi , based for Equation 2. Next, the
two-tier inverted file, denoted as L[][], is searched. For ith dimension, its corre-
sponding image list which has the same state value as qi is quickly retrieved via
allocating ith dimension in the first tier and then qi in the second tier in the struc-
ture. After all dimensions are searched, a set of candidates is generated (lines
5-7). Each image in the candidate set shares one or more common states with the
query image in certain dimensions. Here a weighted state-voting method is em-
ployed to compute the amount of contribution to the final similarity between the
query and a candidate. The frequency of an image in the candidate set reflects
the number of common states it shares with the query image. Note that candi-
dates are generated by matching states on each dimension. However, different
matched states contribute differently to the final similarity when the histogram
intersection is used. Matched states with larger values contribute more to the
final similarity. Therefore, state values have to be considered when candidates
are ranked. Since only expanded states are indexed in the data structure, the
matched state qi has to be transformed back to the original state qi , according
to Equation 2. WeightedStateVoting() is the method to rank all the candidates
(line 8). When the histogram intersection
is applied, the ranking score for each
candidate is computed based on q¯i , where q¯i is the value of matched state be-
tween Q and a candidate. Top-k candidates are returned for the actual histogram
intersection computations to find the nearest neighbor to Q (line 9). For example,
Efficient Histogram-Based Similarity Search 11
assume that D = 3, Q = (1, 3, 2), L[1].1 = {img1 , img2 }, L[2].3 = {img2 , img4 },
and L[3].2 = {img4 } without state expansion. The weighted state-voting result
is for img1 , img2 and img3 is 1, 1+3, and 3+2 respectively. By setting k = 2,
img3 and img2 are returned as the final candidates to compute their histogram
intersection similarities with respect to the query to find the nearest neighbor.
When the Euclidean distance is applied, two matched dimensions have distance
of 0. In this case, by setting the same weight for all matched states, top-k most
frequently occurring candidates in the candidate set are returned for the actual
Euclidean distance computations. This is reasonable since more matched dimen-
sions lead to a smaller overall distance with a higher probability. This algorithm
also has the flexibility in returning more nearest neighbors which will affect the
setting of k. The effect of k will be examined in the experiments.
It is noticeable that the above query processing algorithm only returns ap-
proximate results. There are three factors which affect the accuracy. Firstly, state
expansion may cause information loss. In state expansion, one original state may
be expanded into different new states if the neighbor relationships are different.
Since the algorithm selects the candidates based on matching states and their
voting scores, two different new states with the same original state cannot be
matched. It is expected that this loss becomes relatively less significant as dimen-
sionality increases and the encoded local information can compensate the loss
to certain extent. Secondly, removal of frequent states in the two-tier inverted
file may also affect the accuracy, as studied in text retrieval. Thirdly, since only
top-k candidates are selected for final similarity computations, the correctness
of the results cannot be guaranteed. In the next section, we extensively study
the effects of the above three factors. Results on real-life ultra-high dimensional
histograms show very promising performance with negligible sacrifice on quality,
despite of the correctness guarantee problem.
6 Experiments
6.1 Set Up
We have collected 40,000 face images from different sources, including various
standard face databases1 , such as FERET, PIE Database, CMU, the Yale Face
Database, etc., and faces extracted from different digital albums. Both database
images and query images are represented by 15,488 dimensional LDP histograms
which have shown very good accuracy in face recognition [19]. All experiments
are conducted on a desktop with 2.93GHz Intel CPU and 8GB RAM.
To measure the search effectiveness of our proposal, we use the standard
precision, where the ground-truth for a query is the search results from sequential
scan in the original space. In face recognition, typically only the top one result is
needed. Thus, we only evaluate results on the nearest neighbor search, although
more nearest neighbors can also be returned.
1
http://www.face-rec.org/databases/
12 J. Liu et al.
6.2 Effect of ε
In our two-tier indexing structure, we assume that the length of an image list for
a state reflects its discrimination power. If the number of images for a state is
greater than ε, this image list is considered as non-discriminative and removed
from the index structure. We test different values of ε including 5%, 7%, 15%
and 20% of the total image size to observe its effect on effectiveness.
Observed from Figure 3(a), a larger ε leads to a better precision for all query
sets since more image lists are maintained in the data structure. However, the
overall precision under various settings is promising, i.e., all higher than 98%.
The precision difference among different settings is not significant and nearly
identical. The search time for different ε values is shown in Figure 3(b). As
ε goes up, the indexing structure is larger and more images are likely to be
accessed and compared. Therefore, for different sets of queries, they show the
same trend. The search time drops quickly as ε goes up. Since ε has greater
impact on efficiency, by default, we set ε = 5%.
6.3 Effect of k
In query processing, voting scheme is applied to generate a set of candidates for
further similarity calculation. Different settings on k lead to different precisions.
Figure 3(c) shows the results of k = 5, 10, 20 and 50 for the nearest neighbor
search. Precision reaches almost 100% when k ≥ 20. The reason is that the
more candidates we include, the higher probability that the correct results are
finally accessed and returned. The search time increases as k increases since more
candidates are compared, as shown in Figure 3(d). k = 20 is a reasonable default
value for both precision and efficiency consideration.
2
http://www.itl.nist.gov/iad/humanid/feret/feret_master.html
Efficient Histogram-Based Similarity Search 13
A key factor that contributes to the high effectiveness and efficiency of our
method is that we expand the space of effective states and consequently encode
more local distinctiveness into each of the states. In this subsection, we also test
the effect of state expansion.
Figure 3(e) and 3(f) depict the selectivity improvement made by state expan-
sion. The total number of image lists and the average number of images in each
list are reported. Clearly by expanding the number of states, the average number
of images for each state is greatly reduced. The average number of images per
list is about 30 after state expansion.
The effect of state expansion on precision and efficiency is reflected in Figure
3(g) and 3(h) respectively. Very surprising, with our state expansion, the accu-
racy is even higher, especially for fc, dup1 and dup2 query sets. This is a bit hard
to explain since state expansion may possibly miss some results if local neighbor
relationships among their dimensions are different. Without state expansion, in-
formation loss mainly comes from the removal of long image lists. Because images
lists without state expansion are expected to be much longer than those with
state expansion (as depicted in 3(f)), there is a risk to remove more lists from
the indexing structure. As a result, more information could be lost if the states
are not expanded. Undoubtedly, state expansion improves the search efficiency
(as shown in Figure 3(h)) since fewer and shorter lists are searched. In short,
state expansion achieves improvement in both precision and efficiency.
1.2
20% 0.06 20% 10
10% 10% 20
Precision
1 1
0.03
0.95 0.95
0.02
0.9 0.9
0.85 0.01 0.85
0.8 0 0.8
fb fc dup1 dup2 fb fc dup1 dup2 fb fc dup1 dup2
(d) effect of k (e) effect of state expansion (f) effect of state expansion
0.1
Without Expansion Without Expansion
Sequential Scan
0.08 12 VA-File
10
Precision
0.06 8
1
6
0.95 0.04
4
0.9
0.02 2
0.85 0
0.8 0 1 2 3 5 10 20 40
fb fc dup1 dup2 fb fc dup1 dup2 3
No.Records (10 )
(g) effect of state expansion (h) effect of state expansion (i) scalability
histograms are highly skew in different localities. Secondly, it is difficult for VA-
File to have a tight bound for the histogram intersection similarity to achieve
efficient pruning. IDistance shows slightly better performance than sequential
scan. However, its search time still climbs quickly, because the distance between
any point and the reference point tends to be very close when dimensionality is
extremely high, making a minor increase on search radius include an excessive
number of data points to process. This experiment proves that by utilizing the
high efficiency of inverted file, our method is able to achieve real-time retrieval
in ultra-high dimensional histogram spaces.
7 Conclusion
In this paper, we present a two-tier inverted file indexing method for efficient
histogram-based similarity search in ultra-high dimensional spaces. It indexes the
sparse and ultra-high dimensional histograms with a compact structure which
utilizes the high efficiency of inverted file, by observing that histogram values
are actually discrete and from a finite value set. An effective state expansion
method is designed to further discriminate the data for an efficient and effective
Efficient Histogram-Based Similarity Search 15
References
1. Ahonen, T., Hadid, A., Pietikäinen, M.: Face description with local binary patterns:
Application to face recognition. IEEE TPAMI 28(12), 2037–2041 (2006)
2. An, J., Chen, H., Furuse, K., Ohbo, N.: Cva file: an index structure for high-
dimensional datasets. Knowl. Inf. Syst. 7(3), 337–357 (2005)
3. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. CACM 51(1), 117–122 (2008)
4. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index
structures for improving the performance of multimedia databases. ACM Comput.
Surv. 33(3), 322–373 (2001)
5. Chakrabarti, K., Mehrotra, S.: Local dimensionality reduction: A new approach to
indexing high dimensional spaces. In: VLDB, pp. 89–100 (2000)
6. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity
search in metric spaces. In: VLDB, pp. 426–435 (1997)
7. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing
scheme based on p-stable distributions. In: Symposium on Computational Geom-
etry, pp. 253–262 (2004)
8. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and
trends of the new age. ACM Comput. Surv. 40(2) (2008)
9. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: VLDB, pp. 518–529 (1999)
10. Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R.: iDistance: An adaptive
B+ -tree based indexing method for nearest neighbor search. ACM TODS 30(2),
364–397 (2005)
11. Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information
retrieval: State of the art and challenges. ACM TOMCCAP 2(1), 1–19 (2006)
12. Lu, H., Ooi, B.C., Shen, H.T., Xue, X.: Hierarchical indexing structure for efficient
similarity search in video retrieval. IEEE TKDE 18(11), 1544–1559 (2006)
13. Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index struc-
ture for high-dimensional spaces using relative approximation. In: VLDB, pp. 516–
526 (2000)
14. Shen, H.T., Ooi, B.C., Zhou, X., Huang, Z.: Towards effective indexing for very
large video sequence database. In: SIGMOD, pp. 730–741 (2005)
15. Shen, H.T., Zhou, X., Zhou, A.: An adaptive and dynamic dimensionality reduction
method for high-dimensional indexing. VLDB Journal 16(2), 219–234 (2007)
16. Swain, M.J., Ballard, D.H.: Color indexing. IJCV 7(1), 11–32 (1991)
17. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional
nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)
18. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study
for similarity-search methods in high-dimensional spaces. In: VLDB, pp. 194–205
(1998)
19. Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary
pattern: face recognition with high-order local pattern descriptor. IEEE TIP 19(2),
533–544 (2010)
20. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput.
Surv. 38(2) (2006)