Optimizing Bitmap Indices With Efficient Compression: ACM Transactions On Database Systems March 2006
Optimizing Bitmap Indices With Efficient Compression: ACM Transactions On Database Systems March 2006
net/publication/200084986
CITATIONS READS
251 335
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Kesheng Wu on 21 May 2014.
Bitmap indices are efficient for answering queries on low-cardinality attributes. In this article,
we present a new compression scheme called Word-Aligned Hybrid (WAH) code that makes com-
pressed bitmap indices efficient even for high-cardinality attributes. We further prove that the new
compressed bitmap index, like the best variants of the B-tree index, is optimal for one-dimensional
range queries. More specifically, the time required to answer a one-dimensional range query is
a linear function of the number of hits. This strongly supports the well-known observation that
compressed bitmap indices are efficient for multidimensional range queries because results of
one-dimensional range queries computed with bitmap indices can be easily combined to answer
multidimensional range queries. Our timing measurements on range queries not only confirm the
linear relationship between the query response time and the number of hits, but also demonstrate
that WAH compressed indices answer queries faster than the commonly used indices including
projection indices, B-tree indices, and other compressed bitmap indices.
Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data compaction
and compression; H.3.1 [Information Systems]: Content Analysis and Indexing—Indexing
methods
General Terms: Performance, Algorithms
Additional Key Words and Phrases: Compression, bitmap index, query processing
1. INTRODUCTION
Bitmap indices are known to be efficient, especially for read-mostly or append-
only data. Many researchers have demonstrated this [O’Neil 1987; Chaudhuri
and Dayal 1997; Jürgens and Lenz 1999]. Major DBMS vendors including
This work was supported by the Office of Science of the U.S. Department of Energy under Contract
No. DE-AC03-76SF00098.
Authors’ addresses: Mailstop 50B-3238, Computational Research Division, Lawrence Brekeley Na-
tional Laboratory, Scientific Data Management Group, 1 Cyclotron Rd., Berkeley, CA 94720; email:
{Kwu,EJOtoo,AShoshani}@lbl.gov.
2006
c Association for Computing Machinery. ACM acknowledges that this contribution was au-
thored or co-authored by a contractor or affiliate of the [U.S.] Government. As such, the Government
retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to
do so, for Government purposes only.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515
Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].
C 2006 ACM 0362-5915/06/0300-0001 $5.00
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006, Pages 1–38.
2 • K. Wu et al.
Fig. 1. A sample bitmap index. Each column b0 , . . . , b3 is called a bitmap in this article.
ORACLE, Sybase, and IBM have implemented them in their respective DBMS
products. However, users are usually cautioned not to use them for high-
cardinality attributes. In this article, we present an efficient compression
scheme, called Word-Aligned Hybrid (WAH) code, that not only reduces the
index sizes but also guarantees a theoretically optimal query response time
for one-dimensional range queries. A number of empirical studies have shown
that WAH compressed bitmap indices answer queries faster than uncom-
pressed bitmap indices, projection indices, and B-tree indices, on both high-
and low-cardinality attributes [Wu et al. 2001a, 2002, 2004; Stockinger et al.
2002]. This article complements the observations with rigorous analyses. The
main conclusion of the article is that the WAH compressed bitmap index is
in fact optimal. Some of the most efficient indexing schemes such as B+ -
tree indices and B∗ -tree indices have a similar optimality property [Comer
1979; Knuth 1998]. However, a unique advantage of compressed bitmap in-
dices is that the results of one-dimensional queries can be efficiently com-
bined to answer multidimensional queries. This makes WAH compressed
bitmap indices well suited for ad hoc analyses of large high-dimensional
datasets.
make extensive use of the bitmap index was Model 204 [O’Neil 1987]. In many
data warehouse applications, bitmap indices perform better than tree-based
schemes, such as the variants of B-tree or R-tree [Jürgens and Lenz 1999; Chan
and Ioannidis 1998; O’Neil 1987; Wu and Buchmann 1998]. According to the
performance model proposed by Jürgens and Lenz [1999], bitmap indices are
likely to be even more competitive in the future as disk technology improves.
In addition to supporting queries on a single table as shown in this article, re-
searchers have also demonstrated that bitmap indices can accelerate complex
queries involving multiple tables [O’Neil and Graefe 1995].
The simple bitmap index shown in Figure 1 is known as the basic bitmap
index. The basic bitmap index can be built for integer attributes as well as
floating-point values and strings. The main practical difference is that floating-
point attributes typically have more distinct values, and therefore their bitmap
indices require more bitmaps.
There is only one attribute in the above example. With more than one at-
tribute, typically a bitmap index is generated for each attribute. It is straight-
forward to process queries involving multiple attributes. For example, to process
a query with the condition “Energy > 15 GeV and 7 < NumParticles < 13,” a
bitmap index on attribute Energy and a bitmap index on NumParticles are
used separately to generate two bitmaps representing rows satisfying the con-
ditions “Energy > 15 GeV” and “7 < NumParticles < 13,” and the final answer
is then generated with a bitwise logical AND operation on these two interme-
diate bitmaps.
1.4 Outline
The remainder of this article is organized as follows. In Section 2 we review
three commonly used compression schemes and identify their key features.
These three were selected as representatives for later performance compar-
isons. Section 3 contains the description of the Word-Aligned Hybrid code
(WAH). Section 4 contains the analysis of bitmap index sizes, and Section 5
presents the analysis of time complexity to answer a range query. We present
some performance measurements in Sections 6 and 7 to support the analyses.
A short summary is given in Section 8, while algorithms to perform logical
operator’s are presented in the Appendix.
efficient than working on individual bits, this byte alignment property makes
BBC more efficient than others techniques such as ExpGol.
Fig. 2. A WAH compressed bitmap. Each WAH word (last row) represents a multiple of 31 bits
from the input bitmap, except the last word that represents the four leftover bits.
alignment by two orders of magnitude [Wu et al. 2001b]. The reason for this
performance difference is that the word alignment ensures logical operations
only access whole words, not bytes or bits.
Figure 2 shows the WAH compressed representation of 128 bits. We assume
that each computer word contains 32 bits. Under this assumption, each literal
word stores 31 bits from the bitmap, and each fill word represents a multiple of
31 bits. The second line in Figure 2 shows the bitmap as 31-bit groups, and the
third line shows the hexadecimal representation of the groups. The last line
shows the WAH words also as hexadecimal numbers. The first three words are
regular words, the first and the third are literal words, and the second a fill
word. The fill word 80000002 indicates a 0-fill of two words long (containing 62
consecutive 0 bits). Note that the fill word stores the fill length as two rather
than 62. The fourth word is the active word; it stores the last few bits that could
not be stored in a regular word.1
For sparse bitmaps, where most of the bits are 0, a WAH compressed bitmap
would consist of pairs of a fill word and a literal word. If the bitmap is truly
sparse, say only one bit in 1000 is 1, then each literal word would likely contain
a single bit that is 1. In this case, for a set of bitmaps to contain N bits of 1, the
total size of the compressed bitmap is about 2N words. In the next section, we
will give a rigorous analysis of sizes of compressed bitmaps.
The detailed algorithms for performing logical operations are given in the
Appendix. Here we briefly describe one example (C = A AND B), as shown in
Figure 3. To perform a logical AND operation, we essentially need to match each
31-bit group from the two operands, and generate the corresponding groups for
the result. Each column of the table is reserved for one such group. A literal
word occupies the location for the group and a fill word is given at the first
1 Note that we need to indicate how many bits are represented in the active word, and we have
chosen to store this information separately (not shown in Figure 2). We also chose to store the
leftover bits as the least significant bits in the active word so that during bitwise logical operations
the active word can be treated the same as a regular literal word.
space reserved for the fill. The first 31-bit group of the result C is the same
as that of A because the corresponding group in B is part of a 1-fill. The next
three groups of C contain only 0 bits. The active words are always treated
separately.
When two sparse bitmaps are lined up for a bitwise logical operation, we
expect most of the literal words not to fall on top of each other or adjacent to
each other. In this case, the results of logical OR and XOR are as large as the
total size of the two input bitmaps. It is easy to see that the time required to
perform these logical operations would be linear in the total size of the two
input bitmaps. We have observed this linearity in a number of different tests
[Wu et al. 2001a, 2002, 2004]. We formally show this linearity in Section 5.
The version of WAH presented in this article has two important improve-
ments over the earlier version [Wu et al. 2001b]. The first change is that we use
the multiple of (w − 1) bits to measure the fill length rather than the number
of bits. The second change is that we assume the length of a bitmap can be
represented in one computer word. These changes allow us to store a fill of any
length in one fill word, which reduces the complexity of encoding and decod-
ing procedures. In addition, our current implementation of the bitwise logical
operations also takes advantages of short cuts available for the specific logical
operations, while the original performance measurement [Wu et al. 2001b] used
the generic algorithm shown in the Appendix. All of these changes improved
the efficiency of bitwise logical operations.
Operations on WAH compressed bitmaps should be faster than the same
operations on BBC compressed bitmaps for three main reasons.
(1) The encoding scheme of WAH is simpler than BBC; therefore the algorithms
for performing logical operations are also simpler. In particular, the header
byte of a BBC run is considerably more complex than any WAH word.
(2) The words in WAH compressed bitmaps have no dependency among them,
but the bytes in BBC have complicated dependencies. Therefore, accessing
BBC compressed bitmaps is more complicated and more time consuming
than accessing WAH compressed bitmaps.
(3) BBC can encode short fills, say those with less than 60 bits, more compactly
than WAH. However, this comes at a cost. Each time BBC encounters such
a fill it starts a new run, while WAH represents such fills in literal words.
It takes much less time to operate on a literal word in WAH than on a run
in BBC.
One way to improve the speed of bitwise logical operations on bitmaps with
many short fills is to decompress all the short fills. Indeed, decompressing all
short fills in both WAH and BBC could decrease the time to perform bitwise
logical operations on these bitmaps. However, it usually increases the time to
perform operations between a uncompressed bitmap and another compressed
bitmap. It is possible that decompressing selected bitmaps could reduce the av-
erage query response time, but in all tests, the query response time increased
rather than decreased. Further investigation may reveal exactly what to decom-
press to improve the query response time; however, we will leave that for future
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Optimizing Bitmap Indices with Efficient Compression • 11
work. For the remainder of this article, unless explicitly stated, all bitmaps are
fully compressed.
2 If the database contains more records than 2wrows, multiple bitmaps can be used each to represent
a subset of the rows. To improve the flexibility of generating the indices and caching the indices
in memory, a bitmap may contain only a few million bits, corresponding to a small partition of the
data. This will create more than c bitmaps. The total size of the bitmap index due to the per bitmap
overhead will increase accordingly; however, the number of words required to represent the bulk
of 0s and 1s will not change.
Fig. 5. Breaking the 31-bit groups from Figure 2 into three counting groups.
PROOF. To prove this lemma, we first observe that any l literal groups can
be broken into (l − 1) counting groups. If the l literal groups form a fill, then
the (l − 1) counting groups are also fills. The two literal groups at the ends
may be combined with their adjacent groups outside the fill to form up to two
counting groups. According to our definition of a WAH fill, the groups adjacent
to the fill must be one of the following: (1) a fill of a different type, (2) a literal
group with a mixture of 0 bits and 1 bits, or (3) null, that is, there are no more
literal groups before or after the fill. In the last case, no more counting groups
can be constructed. In the remaining cases, the counting groups constructed
with one literal group inside the fill and one literal group outside the fill are
not fills. Therefore the fill generates exactly (l − 1) counting groups that are
fills.
Combining the above two lemmas gives an easy way to compute the number
of words needed to store a compressed bitmap.
THEOREM 3. Let G denote the number of counting groups that are fills and
let M denote the number of literal groups. Then the number of regular words in
a compressed bitmap is M − G.
PROOF. The number of regular words in a compressed bitmap is the sum of
the number of fills and the number of literal groups that are not fills. Let L be the
number of fills of a bitmap and let l i be the size of ith fill in the bitmap; according
to the previous lemma L it must contribute (l i − 1) counting groups that
L L are fills.
By definition, G ≡ i=1 (l i − 1). The number of fills is L = i=1 l i − i=1 (l i − 1),
L
and the number of groups that are not in any fills is M − i=1 l i . Altogether,
the number of fills plus the number of literal groups is
L L L L
li − (l i − 1) + M − li = M − (l i − 1) = M − G.
i=1 i=1 i=1 i=1
one parameter; the bit density d , which is defined to be the fraction of bits that
are 1 (0 ≤ d ≤ 1).
The efficiency of a compression scheme is often measured by the compression
ratio, which is the ratio of its compressed size to its uncompressed size. For a
bitmap with N bits, the uncompressed scheme (LIT) needs N /w words, and
the decompressed form of WAH requires M + 2 words. The compression ratio of
the decompressed WAH is (N /(w − 1) + 2)/N /w ≈ w/(w − 1). All compres-
sion schemes pay an overhead to represent incompressible bitmaps. For WAH,
this overhead is 1 bit per word. When a word is 32 bits, the overhead is about
3%. The overhead for BBC is about 1 byte per 15 bytes, which is roughly 6%.
Let d be the bit density of a uniform random bitmap, the probability of
finding a counting groups that is a 1-fill, that is, 2w −2 consecutive bits that are
1, is d 2w−2 . Similarly, the probability of finding a counting group that is a 0-fill is
(1−d )2w−2 . With WAH compression, the expected size of a bitmap with N bits is
N N
m R (d ) ≡ M + 2 − G = +2− − 1 ((1 − d )2w−2 + d 2w−2 )
w−1 w−1
N
≈ (1 − (1 − d )2w−2 − d 2w−2 ). (1)
w−1
The above approximation neglects the constant 2, which corresponds to the
two words comprising of the active word and the counter for the active word.
We loosely refer to it as the “per bitmap overhead.” This overhead may become
important when G is close to M , that is, when m R is close to 2. For application
where compression is useful, N is typically much larger than w. In these cases,
dropping the floor operator ( ) does not introduce any significant amount of
error.
The compression ratio is approximately w(1−(1−d )2w−2 −d 2w−2 )/(w−1). For
d between 0.05 and 0.95, the compression ratios are nearly 1. In other words,
these random bitmaps cannot be compressed with WAH. For sparse bitmaps,
say 2wd 1, we have m R (d ) ≈ 2d N , because d 2w−2 → 0 and (1 − d )2w−2 ≈
1 − (2w − 2)d . In this case, the compression ratios are approximately 2wd . Let
h denote the number of bits that are 1. By definition, h = dN. The compressed
size of a sparse bitmap is related to h by the following equation:
m R (d ) ≈ 2dN = 2h. (2)
In such a sparse bitmap, all literal words contain only a single bit that is 1, and
each literal word is separated from the next one by a 0-fill. On average, two
words are used for each bit that is 1.
Next, we compute the compressed sizes of bitmaps generated from a two-
state Markov process, as illustrated in Figure 6. These bitmaps require a second
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Optimizing Bitmap Indices with Efficient Compression • 15
The computation of f is similar to that of the rehash chain length [O’Neil and O’Neil 2000,
p. 520].
5 The exact maximum is 4N − 2w − 2(N %(w − 1)), where operator % denotes modulus.
Fig. 7. The expected size of bitmap indices on random data and Markov data with various clus-
tering factors.
of bitmaps have three regular words plus the active word.6 There are a few
bitmaps using two or three words rather than four.7 For a large range of high-
cardinality attributes, say c < N /10, the maximum size of WAH compressed
bitmap indices is about 2N words.
For attributes with a clustering factor f greater than one, the stable plateau
is reduced by a factor close to 1/ f . Another factor that reduces the total size of
the compressed bitmap index is that the cardinality of an attribute is usually
much smaller than N . For attributes with Zipf distribution, the stable plateau
is the same as the uniform random attribute. However, because the actual
cardinality is much less than N , it is very likely that the size of the compressed
bitmap index would be about 2N words. For example, for an attribute with Zipf
distribution with z = 1 and i < 109 , among 100 million values, we see about
27 million distinct values, and the index size is about 2.3N words. Clearly,
for Zipf distributions with larger z, we expect to see fewer distinct values and
the index size would be smaller. For example, for z = 2, we see about 14,000
distinct values for nearly any limit on i that is larger than 14,000. In these
cases, the index size is about 2N words. The following proposition summarizes
these observations.
PROPOSITION 4. Let N be the number of rows in a table, and let c be the
cardinality of the attribute to be indexed. Then the total size s of all compressed
bitmaps in an index is such that
(1) it never takes more than 4N words,
(2) if c < N /10, the maximum size of the compressed bitmap index of the at-
tribute is about 2N words,
6 Since all active words have the same number of bits, one word is sufficient to store this number.
7 The three regular words in the majority of the bitmaps represents a 0-fill, a literal group, and a
0-fill. There are w bitmaps without the first 0-fill and w bitmaps without the last 0-fill. The 2w
bitmaps use three words each. There are also (N %(w − 1)) bitmaps whose 1 bits are in their active
words. In these bitmaps, only one regular word representing a 0-fill is used.
(3) and if the attribute has a clustering factor f > 1 and c < N /10, the maxi-
mum size of its compressed bitmap index is
N 2w − 3
s∼ 1+ ,
w−1 f
which is nearly inversely proportional to the clustering factor f .
When the probability distribution is not known, we can estimate the sizes
in a number of ways. If we only know the cardinality, we can use Equation (4)
to give an upper bound on the index size. If the histogram is computed, we
can use the frequency of each value as the probability pi to compute the size
of each bitmap using Equation (7). We can further refine the estimate if the
clustering factors are computed. For a particular value in a dataset, comput-
ing the clustering factor requires one to scan the data to find out how many
times a particular value appears consecutively, including groups of size 1. The
clustering factor is the total number of appearances divided by the number of
consecutive groups. With this additional parameter, we can compute the size
the compressed bitmaps using Equation (3). However, since the total size of the
compressed bitmap index is relatively small in all cases, one can safely generate
an index without first estimating its size.
The above formulas only include the size of the compressed bitmaps of an
index. They do not include the attribute values or other supporting information
required by a bitmap index. When stored on disk, we use two arrays in additions
to the bitmaps, one to store the attribute values and the other to store the
starting position of the bitmaps. Since we pack the bitmaps sequentially in a
single file, there is no need to store the end positions of most of the bitmaps
except the last one. For an index with c bitmaps, there are c attribute values
and c starting positions. If each attribute value and each starting position can
be stored in a single word, the file containing a bitmap index uses 2c more words
than the total size of bitmaps. For most high-cardinality attributes, the index
file size is at most 2N + 4c because the total bitmap size is about 2N + 2c, as
shown in Equation (6). In the extreme case where c = N , the index file size is
6N . It may be desirable to reduce the size in this extreme case, for example, by
using different compression schemes or using a RID list [O’Neil 1987]. In our
experience, we hardly ever see this extreme case even for floating-point valued
attributes. The maximum size of WAH compressed bitmap indices in a typical
application is 2N words, and the average size may be a fraction of N words.
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Optimizing Bitmap Indices with Efficient Compression • 19
Fig. 8. The range of possible values for the number of iterations Iw through the main WHILE loop
of function generic op defined in Listing 1, in the Appendix.
bitmap and use it to store the result of the operation.8 We call this the in-
place OR operation and denote it as x |= y. In addition to avoiding repeated
allocation of new memory for the intermediate results, the in-place OR also
avoids repeatedly generating 0-fills in the intermediate results.
Following the derivation of Equation (10), we can easily compute the total
time required by algorithm inplace or. Let m y denote the number of words in
y.vec. It needs to call function run::decode m y times at a cost of Cd m y . Let L y
denotes the length of all 1-fills in y; the total number of iterations through the
inner loop marked “assign 1-fill” is L y . Assume each inner iteration takes C4
seconds; then the total cost of this inner loop is C4 L y . The main loop is executed
m y times, and the time spent in the main loop, excluding that spent in the inner
loops, should be linear in number of iterations. This cost and the cost of decoding
can be combined as C3 m y . The total time of algorithm inplace or is
tI = C3 m y + C4 L y . (11)
Let d y denote the bit density of y, for sparse bitmaps where 2wd y 1,
L y = N d 2w−2
y /(w − 1) → 0. In this case, the above formula can be stated as
follows.
THEOREM 8. On a sparse bitmap y, the time complexity of algorithm
inplace or is O(m y ), where m y is the number of regular words in y.
8 Thisis similar to the basic method used by Johnson [1999]. However Johnson’s implementation
involved a literal bitmap and a compressed bitmap, while only WAH compressed bitmaps are used
in our case.
Assuming all bitmaps have the same size m, the above formula simplifies to
TG ≤ C2 m(k + 2)(k − 1)/2. In other words, the total time grows quadratically
in the number of input bitmaps.9 In an earlier test, we actually observed this
quadratic relation [Wu et al. 2004]. Next we show that a linear relation is
achieved with inplace or.
To use inplace or, we first need to produce a uncompressed bitmap. Let
C0 denote the amount of time to produce this uncompressed bitmap. To com-
plete the operations, we need to call inplace or k times on k input compressed
bitmaps. The total time required is
k
k
TI = C0 + C3 mi + C4 Li ,
i=1 i=1
where Li denotes the total length of the1-fills in the ith bitmap. Under the
k
sparse bitmaps assumption, the term C4 i=1 Li in the above equation is much
smaller than the others. This leads to
k
TI ≈ C0 + C3 mi , (13)
i=1
9 Note that this quadratic relation holds only if all the intermediate results are also sparse bitmaps.
If some intermediate results are not sparse, we should use Equation (9) rather than Equation (10)
to compute the total time. This would lead to a more realistic upper bound on time. When a small
number of bitmaps are involved, using generic op is faster than using inplace or. As the number
of bitmaps increases, before Equation (12) becomes a gross exaggeration; using inplace or already
becomes significantly better [Wu et al. 2004]. For this reason, we have chosen to omit the formula
for dense intermediate results.
We did not label the last two propositions as theorems because there are a
number of factors that may cause the observed time to deviate from the expected
linear relation. Next we describe three major ones.
The first one is that the “constant” C0 actually is a linear function of M . To
generate a uncompressed bitmap, one has to allocate M + 2 words and fill them
with (zero) values. Because the uncompressed bitmap is generated in memory,
the procedure is very fast. The observed value of C0 is typically negligible.
The second factor is the memory hierarchy in computers. Given two sets of
sparse bitmaps with the same number of 1s, the procedure of using inplace or
is basically taking one bitmap at a time to modify the uncompressed bitmap,
and the content of the uncompressed bitmap is modified one word at a time.
In the worst-case, the total number of words to be modified is H, which is the
same for both sets of bitmaps. However, because the words are loaded from
main memory into the caches one cache line at a time, this causes some of the
words to be loaded into various levels of caches more than once. Many words are
loaded into caches unnecessarily. We loosely refer to these as extra work. The
lower the bit density, the more extra work is required. This makes the observed
value of C3 increase as the attribute cardinality increases. In the extreme case,
H cache lines are loaded, one for each hit. In short, the value of C3 depends on
some characteristics of the data, but it approaches an asymptotic maximum as
the attribute cardinality increases.
The third factor is that the above analysis neglected the time required to
determine what bitmaps are needed to answer a query. Our test software uses
binary searches on the attribute values to locate the bitmaps. Theoretically, a
binary search takes O(log(c)) time. One way to reduce this time would be to
use a hash function which can reduce this time to O(1) [Czech and Majewski
1993; Fox et al. 1991]. Timing measurements show that the binary searches
take negligible amounts of time; therefore, we have not implemented the hash
functions.
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Optimizing Bitmap Indices with Efficient Compression • 25
Fig. 9. Logical operation time is proportional to compression ratio of the operands, and therefore
the total size of the operands. On the STAR bitmap indices, the total CPU time used by BBC is
about 12 times of that used by WAH.
Fig. 10. The sizes of the compressed bitmaps. The symbols for the Markov bitmaps are marked
with their clustering factors. The dashed lines are predictions based on Equations (1) and (3).
1 approach a constant. For very small bitmaps, where the logical operation
time is measured to be a few microseconds, the measured time deviates from
the linear relation because of factors such as the timing overhead and function
call overhead. The regression lines for WAH and BBC are about a factor of 10
apart in both plots.
If we sum up the execution time of all logical operations performed on the
STAR bitmaps for each compression scheme, the total time for BBC is about 12
times that for WAH. Much of this difference can be attributed to the large num-
ber of relative short fills in the bitmaps. BBC is more effective in compressing
these short fills, but it also takes more time to use these fills in bitwise logi-
cal operations. In contrast, WAH typically does not compress these short fills.
When performing bitwise logical operations on these uncompressed fills, WAH
can be nearly 100 times faster than BBC. The performance differences between
WAH and BBC are the smallest on sparse bitmaps. On very sparse bitmaps,
WAH is about four to five times faster than BBC when bitmaps are in memory.
When the bitmaps are read from disk, WAH is about two to three times faster
than BBC.
Table I. Total Sizes of the Indices on STAR Data and the Average Time Needed to Process a
Random One-Dimensional Range Query
Commercial DBMS Our Bitmap Indices
Projection B-tree Bitmap LIT BBC WAH
12 lowest Size (MB) 113 370 7 84 4 7
cardinality Time (s) 0.57 0.85 0.01 0.01 0.007 0.003
12 commonly Size (MB) 113 408 111 726,484 118 186
queried Time (s) 0.57 0.95 0.66 — 0.32 0.052
gzip. For bit density between 0.001 and 0.01, WAH uses about 2.5 (∼ 8/3)
times the space as BBC. In fact, in extreme cases, WAH may use four times as
much space as BBC. Fortunately, these cases do not dominate the total space
required by a bitmap index. In a typical bitmap index, the compression ratios
of bitmaps vary widely, and the total size is dominated by bitmaps with the
largest compression ratios. Since most schemes use about the same amount of
space to store these incompressible bitmaps, the differences in total sizes are
usually much smaller than the extreme cases. For example, on the set of STAR
data, the bitmap indices compressed with WAH are about 60% (186/118 ∼ 1.6)
bigger than those compressed with BBC, as shown in Table I.
12 The values x1 and x2 are chosen separately and each is chosen from the domain of X with uniform
probability distribution. If x1 > x2 , we swap the two. If x1 = x2 , the range is taken to be x1 ≤ X .
13 In this case, before a query is run, the file system containing raw data and indices are unmounted
Fig. 11. The average response time of five-dimensional queries on the STAR dataset. The query
size is the expected fraction of events that are hits.
Fig. 12. The average time and the maximum time needed to process a random range query using
a WAH compressed bitmap index.
Fig. 13. The query response time using WAH compressed bitmap indices is at worst linear in the
number of hits.
plot is for high-cardinality attributes. Each data point is the average time of
1000 different queries. For both high- and low-cardinality attributes, we see
that WAH compressed bitmap indices use significantly less time than others.
On low-cardinality attributes, the uncompressed bitmap indices are smaller
than projection indices and B-tree indices, and are also much more efficient.
This agrees with what has been observed by others [O’Neil and Quass 1997].
We report the performance of a commercial implementation of the compressed
bitmap index to validate our own implementation. Our implementation of the
BBC compressed index and the DBMS implementation performed about as
well as the uncompressed bitmap indices (marked LIT in Figure 11) on the low-
cardinality attributes. On the 12 most frequently queried attributes, the query
response time is longer than that on the low-cardinality attributes; however,
the WAH compressed indices are still more efficient than others. On these high-
cardinality attributes, projection indices are about three times as fast as B-
tree indices; the WAH compressed indices are at least three times as fast as
projection indices.
Table I shows the total sizes of various indices and the average time required
to answer a random one-dimensional query on the STAR data.14 We consider the
projection index as the reference method since it is compact and efficient [O’Neil
and Quass 1997]. The projection index size is the same as the raw data, which
is smaller than most indices on high-cardinality attributes. The bitmap index
sizes reported are the index file sizes which include the bitmaps, attribute val-
ues, and starting positions of bitmaps. The particular B-tree reported in Table I
is nearly four times the size of the raw data. We did not generate the uncom-
pressed bitmap indices for high-cardinality attributes because they would take
726 GB of disk space and clearly would not be competitive against other indices.
On the high-cardinality attributes, the WAH compressed bitmap index is
about 60% (186/118 ∼ 1.6) larger than the BBC compressed index. In terms of
query response time, the WAH compressed indices are about 13 (0.66/0.052 ∼
13) times faster than the commercial implementation of the compressed index,
and about six (0.32/0.052 ∼ 6) times faster than our own implementation of
the BBC compressed index. In the previous section, we observed that WAH per-
forms bitwise logical operations about 12 times as fast as BBC. The difference
in query response time is less than 12 for two main reasons. First, the aver-
age query response time weighs operations on sparse bitmaps more heavily
because it takes more sparse bitmaps to answer a query. On sparse bitmaps,
the performance difference between WAH and BBC could be as low as 2. Sec-
ond, the query response time also includes time for operations such as network
communication, parsing of the queries, locking, and other administrative oper-
ations. In this test, all indices can fit into memory. For larger datasets, where
more I/O is required during query processing, the relative difference between
using WAH and using BBC would be smaller [Stockinger et al. 2002; Wu et al.
2004]. However, unless the I/O system is extremely slow, say 2 MB/s, using
WAH is preferred [Stockinger et al. 2002]. This observation was made with two
14 Thesize reported for the commercial DBMS are the actual bytes used, not the total size of the
pages occupied.
different BBC implementations, one by the authors and one by Dr. Johnson
[Johnson 1999; Stockinger et al. 2002].
The bitmap index is known to be efficient for low-cardinality attributes. For
scientific data such as the one from STAR, where the cardinalities of some
attributes are in the millions, there were doubts as to whether it is still effi-
cient. In Figure 11(b), we see many cases where compressed bitmap indices
perform worse than projection indices. However, WAH compressed indices are
more efficient than projection indices in all test cases. This shows that WAH
compressed indices are efficient for both low- and high-cardinality attributes.
Next, we demonstrate that the WAH compressed indices scales linearly.
15 More precisely 56%, which can be computed as follows. Let β be the fraction of bitmaps read,
βS −8 S seconds.
the time to read it at 16 MB/s is 16×106 seconds. The measured time is 3.5 × 10
βS −8 S, which leads to β = 16 × 106 ×
The two time values must be the same, 16×10 6 = 3.5 × 10
3.5 × 10−8 = 0.56.
— How do bitmap indices perform on other types of queries, such as θ-joins, top-k
queries, similarity queries, and queries with complex arithmetic expressions?
— How can bitmap indices be updated in the presence of frequent updates? One
approach might be to mark the bits corresponding to the modified records in
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Optimizing Bitmap Indices with Efficient Compression • 33
the bitmap indices as invalid, store the updated records separately, say in
memory, and update indices when the system is idle.
— How does one deal with the extreme case where every value is distinct? In
this case, the compressed bitmap index may be three times as large as the
normal case. Model 204 uses a compact RID list as an alternative to the
uncompressed bitmap index when the attribute cardinality is high [O’Neil
1987]. It might be worthwhile to consider using a compact RID list instead
of the compressed bitmap index as well.
— How can bitmap indices be efficiently generated and organized on disk? Our
implementation currently uses high-level I/O functions to lay out the index
without considering the intrinsic paging and blocking structure of the file
systems. Systematically studying the index generation process could be an
interesting activity especially if the issues such as paging/blocking, recovery,
and maintainability are also considered.
— How does one handle the frequent queries on the same set of attributes? For
ad hoc range queries on high-dimensional data, generating one compressed
bitmap index for each attribute is an efficient approach. However, if some
attributes are frequently used together, it may be more efficient to use a
composite index. A WAH compressed version of this composite index should
be optimal for range queries on the indexed attributes. Demonstrating this
could also be interesting for future work.
— How could we fully explore the design choices of different compression
schemes for bitmaps? We cannot expect any compression scheme to make
the compressed bitmap indices scale better than WAH; however, there might
be compression schemes with smaller scaling constants. In Section 3, we
mentioned two potential ways of improving bitmap compression: one would
be to decompress some dense bitmaps and the other would be to explore dif-
ferent ways of extending BBC to be word aligned. There may be many other
options.
— How do other compressed indices scale? There are some indications that
BBC may scale linearly as well [Wu et al. 2004]. It would be an interesting
exercise to formally prove that it indeed scales linearly. It is possible that all
compression schemes based on run-length encoding have this property.
IF (xrun.isFill)
IF (yrun.isFill)
nWords = min(xrun.nWords, yrun.nWords),
z.appendFill(nWords, (*(xrun.it) ◦ *(yrun.it))),
xrun.nWords -= nWords, yrun.nWords -= nWords;
ELSE
z.active.value = xrun.fill ◦ *yrun.it,
z.appendLiteral(),
-- xrun.nWords, yrun.nWords = 0;
ELSEIF (yrun.isFill)
z.active.value = yrun.fill ◦ *xrun.it,
z.appendLiteral(),
-- yrun.nWords, xrun.nWords = 0;
ELSE
z.active.value = *xrun.it ◦ *yrun.it,
z.appendLiteral(),
xrun.nWords = 0, yrun.nWords = 0;
}
z.active.value = x.active.value ◦ y.active.value;
z.active.nbits = x.active.nbits;
}
bitmap::appendLiteral() {
Input: 31 literal bits stored in active.value.
Output: vec extended by 31 bits.
IF (vec.empty())
vec.push back(active.value); // cbi = 1
ELSEIF (active.value == 0)
IF (vec.back() == 0)
vec.back() = 0x80000002; // cbi = 3
ELSEIF (vec.back() ≥ 0x80000000 AND vec.back() < 0xC0000000)
++vec.back(); // cbi = 4
ELSE
vec.push back(active.value) // cbi = 4
ELSEIF (active.value == 0x7FFFFFFF)
IF (vec.back() == active.value)
vec.back() = 0xC0000002; // cbi = 4
ELSEIF (vec.back() ≥ 0xC0000000)
++vec.back(); // cbi = 5
ELSE
vec.push back(active.value); // cbi = 5
ELSE
vec.push back(active.value); // cbi = 3
}
bitmap::appendFill(n, fillBit) {
Input: n and fillBit, describing a fill with 31n bits of fillBit
Output: vec extended by 31n bits of value fillBit.
COMMENT: Assuming active.nbits = 0 and n > 0.
IF (n > 1 AND ! vec.empty())
IF (fillBit == 0)
IF (vec.back() ≥ 0x80000000 AND vec.back() < 0xC0000000)
vec.back() += n; // cbi = 3
ELSE
vec.push back(0x80000000 + n); // cbi = 3
ELSEIF (vec.back() ≥ 0xC0000000)
vec.back() += n; // cbi = 3
ELSE
vec.push back(0xC0000000 + n); // cbi = 3
ELSEIF (vec.empty())
IF (fillBit == 0)
vec.push back(0x80000000 + n); // cbi = 3
ELSE
vec.push back(0xC0000000 + n); // cbi = 3
ELSE
active.value = (fillBit?0x7FFFFFFF:0), // cbi = 3
appendLiteral();
}
ACKNOWLEDGMENTS
The authors wish to express our sincere gratitude to Professor Ding-Zhu Du for
his helpful suggestion of the inequality in Lemma 5, and to Drs. Doron Rotem
and Kurt Stockinger for their help in reviewing the drafts of this article.
REFERENCES
AMER-YAHIA, S. AND JOHNSON, T. 2000. Optimizing queries on compressed bitmaps. In VLDB 2000,
Proceedings of 26th International Conference on Very Large Data Bases, September 10–14, 2000,
Cairo, Egypt, A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter,
and K.-Y. Whang, Eds. Morgan Kaufmann, San Francisco, CA, 329–338.
ANTOSHENKOV, G. 1994. Byte-aligned bitmap compression. Tech. rep. Oracle Corp., Redwood
Shores, CA. U.S. Patent number 5,363,098.
ANTOSHENKOV, G. AND ZIAUDDIN, M. 1996. Query processing and optimization in ORACLE RDB.
VLDB J. 5, 229–237.
CHAN, C.-Y. AND IOANNIDIS, Y. E. 1998. Bitmap index design and evaluation. In Proceedings of the
1998 ACM SIGMOD: International Conference on Management of Data. ACM Press, New York,
NY, 355–366.
CHAN, C. Y. AND IOANNIDIS, Y. E. 1999. An efficient bitmap encoding scheme for selection queries.
In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on Manage-
ment of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA, A. Delis, C. Faloutsos, and S.
Ghandeharizadeh, Eds. ACM Press, New York, NY, 215–226.
CHAUDHURI, S. AND DAYAL, U. 1997. An overview of data warehousing and OLAP technology. ACM
SIGMOD Rec. 26, 1 (Mar.), 65–74.
COMER, D. 1979. The ubiquitous B-tree. Comput. Surv. 11, 2, 121–137.
CZECH, Z. J. AND MAJEWSKI, B. S. 1993. A linear time algorithm for finding minimal perfect hash
functions. Comput. J. 36, 6 (Dec.), 579–587.
FOX, E. A., CHEN, Q. F., DAOUD, A. M., AND HEATH, L. S. 1991. Order-preserving minimal perfect
hash functions and information retrieval. ACM Trans. Inf. Syst. 9, 3, 281–308.
FURUSE, K., ASADA, K., AND IIZAWA, A. 1995. Implementation and performance evaluation of com-
pressed bit-sliced signature files. In Information Systems and Data Management, 6th Interna-
tional Conference, CISMOD’95, Bombay, India, November 15–17, 1995, Proceedings, S. Bhalla,
Ed. Lecture Notes in Computer Science, vol. 1006. Springer, Berlin, Germany, 164–177.
GAILLY, J. AND ADLER, M. 1998. zlib 1.1.3 manual. Source code available online at http://-
www.info-zip.org/pub/infozip/zlib.
ISHIKAWA, Y., KITAGAWA, H., AND OHBO, N. 1993. Evalution of signature files as set access facilities
in OODBs. In Proceedings of the ACM SIGMOD International Conference on Management of
Data, (Washington, DC May 26–28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY,
247–256.
JOHNSON, T. 1999. Performance measurements of compressed bitmap indices. In VLDB’99, Pro-
ceedings of 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edin-
burgh, Scotland, UK, M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie,
Eds. Morgan Kaufmann, San Francisco, CA, 278–289. A longer version appeared as AT&T report
number AMERICA112.
JÜRGENS, M. AND LENZ, H.-J. 1999. Tree based indexes vs. bitmap indexes—a performance study.
In Proceedings of the International Workshop on Design and Management of Data Warehouses,
DMDW’99, Heidelberg, Germany, June 14-15, 1999, S. Gatziu, M. A. Jeusfeld, M. Staudt, and
Y. Vassiliou, Eds.
KNUTH, D. E. 1998. The Art of Computer Programming, 2nd ed. Vol. 3. Addison Wesley, Reading,
MA.
KOUDAS, N. 2000. Space efficient bitmap indexing. In Proceedings of the Ninth International
Conference on Information Knowledge Management (CIKM 2000, November 6–11, McLean, VA).
ACM Press, New York, NY, 194–201.
LEE, D. L., KIM, Y. M., AND PATEL, G. 1995. Efficient signature file methods for text retrieval.
IEEE Trans. Knowl. Data Eng. 7, 3, 423–435.
MOFFAT, A. AND ZOBEL, J. 1992. Parameterised compression for sparse bitmaps. In Proceedings of
the ACM-SIGIR International Conference on Research and Development in Information Retrieval,
(Copenhagen, Denmark, June), N. Belkin, P. Ingwersen, and A. M. Pejtersen, Eds. ACM Press,
New York, NY, 274–285.
O’NEIL, P. 1987. Model 204 architecture and performance. In 2nd International Workshop in
High Performance Transaction Systems, Asilomar, CA. Lecture Notes in Computer Science, vol.
359. Springer-Verlag, Berlin, Germany, 40–59.
O’NEIL, P. AND O’NEIL, E. 2000. Database: Principles, Programming, and Performance, 2nd ed.
Morgan Kaugmann, San Francisco, CA.
O’NEIL, P. AND QUASS, D. 1997. Improved query performance with variant indices. In Proceedings
of the ACM SIGMOD International Conference on Management of Data (Tucson, AZ, May 13–15),
J. Peckham, Ed. ACM Press, New York, NY, 38–49.
O’NEIL, P. E. AND GRAEFE, G. 1995. Multi-table joins through bitmapped join indices. SIGMOD
Rec. 24, 3, 8–11.
SHOSHANI, A., BERNARDO, L. M., NORDBERG, H., ROTEM, D., AND SIM, A. 1999. Multidimensional
indexing and query coordination for tertiary storage management. In Proceedings of the 11th
International Conference on Scientific and Statistical Database Management (Cleveland, OH,
28–30 July). IEEE Computer Society Press, Los Alamitos, CA, 214–225.
STOCKINGER, K., DUELLMANN, D., HOSCHEK, W., AND SCHIKUTA, E. 2000. Improving the performance
of high-energy physics analysis through bitmap indices. In Proceedings of the 11th International
Conference on Database and Expert Systems Applications (DEXA 2000, London, Greenwich, UK).
STOCKINGER, K., WU, K., AND SHOSHANI, A. 2002. Strategies for processing ad hoc queries on large
data warehouses. In Proceedings of DOLAP’02 (McLean, VA), 72–79. A draft appeared as Tech
rep. LBNL-51791, Lawrence Berkeley National Laboratory, Berkeley, CA.
WONG, H. K. T., LIU, H.-F., OLKEN, F., ROTEM, D., AND WONG, L. 1985. Bit transposed files. In
Proceedings of VLDB 85 (Stockholm, Sweden). 448–457.
WU, K.-L. AND YU, P. 1996. Range-based bitmap indexing for high cardinality attributes with
skew. Tech. rep. RC 20449. IBM Watson Research Division, Yorktown Heights, NY.
WU, M.-C. AND BUCHMANN, A. P. 1998. Encoded bitmap indexing for data warehouses. In Proceed-
ings of the Fourteenth International Conference on Data Engineering (February 23–27, Orlando,
FL). IEEE Computer Society, ACM Press, Los Alamitos, CA, 220–230.
WU, K., KOEGLER, W., CHEN, J., AND SHOSHANI, A. 2003. Using bitmap index for interactive ex-
ploration of large datasets. In Proceedings of SSDBM 2003 (Cambridge, MA), 65–74. A draft
appeared as Tech rep. LBNL-52535, Lawrence Berkeley National Laboratory, Berkeley, CA.
WU, K., OTOO, E. J., AND SHOSHANI, A. 2001a. A performance comparison of bitmap indexes. In
Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge
Management (Atlanta, GA, November 5–10). ACM Press, New York, NY, 559–561.
WU, K., OTOO, E. J., AND SHOSHANI, A. 2002. Compressing bitmap indexes for faster search oper-
ations. In Proceedings of SSDBM’02 (Edinburgh, Scotland), 99–108. Also published as Tech rep.
LBNL-49627, Lawrence Berkeley National Laboratory, Berkeley, CA.
WU, K., OTOO, E. J., AND SHOSHANI, A. 2004. On the performance of bitmap indices for high-
cardinality attributes. In Proceedings of the Thirtieth International Conference on Very Large
Data Bases, Toronto, Canada, August 31-September 3, 2004, M. A. Nascimento, M. T. Özsu,
D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B. Schiefer, Eds. Morgan Kaufmann, San
Francisco, CA, 24–35.
WU, K., OTOO, E. J., SHOSHANI, A., AND NORDBERG, H. 2001b. Notes on design and implementation
of compressed bit vectors. Tech. rep. LBNL/PUB-3161. Lawrence Berkeley National Laboratory,
Berkeley, CA.
ZIV, J. AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans.
Inform. Theor. 23, 3, 337–343.
Received April 2004; revised December 2004, June 2005; accepted July 2005