L5 HashTables
L5 HashTables
Hashed Indexes
At the end of this lecture students should be able to:
• Describe static hashed tables and indexes, and how to
handle collisions
• Describe dynamic hashed tables, how database operations
are carried out on them, and the advantages offered
• Explain the main properties of hash functions, and multi-
attribute hashing techniques
Hashed Indexing and Collision
Handling
• Associative Tables
• (Dynamic) Hashed Tables
• Hash Functions
• Collisions and How to Handle Them
Introduction: Hashing
• Many applications require a dynamic set that supports only the
dictionary operations INSERT, SEARCH, and DELETE. For example,
a compiler for a computer language maintains a symbol table, in
which the keys of elements are arbitrary character strings that
correspond to identifiers in the language.
• A hash table is an effective data structure for implementing
dictionaries.
• Although searching for an element in a hash table can take as long as
searching for an element in a linked list O(n) time in the worst case—
in practice, hashing performs extremely well.
• Under reasonable assumptions, the expected time to search for an
element in a hash table is O(1).
• The bottom line is that hashing is an extremely effective and practical
technique: the basic dictionary operations require only O(1) time on
the average.
• “perfect hashing” can support searches in O(1) worst case time, when
the set of keys being stored is static (that is, when the set of keys never
changes once stored).
Associative Tables
• Consider an array of integers:
• It is easy enough to find the value at slot 3 (that is, 91). How to find
the slot with value 76?
– We could search in the array.
– But it is much faster to use associative lookup.
• Hash functions provide association by mapping from the stored value
(for example 76) to the location where it is stored (for example slot 4).
Hash Tables
• Basic idea:
– Have a hash table H of T slots, addressed from 0 to T−1.
– Have a hash function h, that is, any function that maps keys
(integers, or strings, or ...) into integers in the range 0 to T − 1.
– Store record Rk at slot v = h(k).
• Ideally the access cost is O(1), regardless of data volume. That is, no
matter how much data we have, the lookup cost is the same.
Hash Tables
• Example: Consider a hash table of 8 slots:
• When T is large and λ = n/T is bounded (that is, not too big like in our
practical application), the above distribution can be simplified using
the Poisson approximation to the binomial distribution:
• Note that a uniform distribution does not mean that all slots get
roughly the same number of hits. In this case, 45% of the hash table is
empty.
Overflows
• The number of overflow records is: 1 for each bucket with 2 records; 2
for each bucket with 3 records; 3 for each bucket with 4 records ...
That is, the number of overflow records is
• That is, 24,800 ÷ 80,000 or 31% of records can't be stored in the slot
given by their hash value. If T = n = 100,000, this fraction rises to
about 37%.