0% found this document useful (0 votes)
11 views22 pages

L5 HashTables

Uploaded by

myhealth632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

L5 HashTables

Uploaded by

myhealth632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Hashed Indexes

Hashed Indexes
At the end of this lecture students should be able to:
• Describe static hashed tables and indexes, and how to
handle collisions
• Describe dynamic hashed tables, how database operations
are carried out on them, and the advantages offered
• Explain the main properties of hash functions, and multi-
attribute hashing techniques
Hashed Indexing and Collision
Handling
• Associative Tables
• (Dynamic) Hashed Tables
• Hash Functions
• Collisions and How to Handle Them
Introduction: Hashing
• Many applications require a dynamic set that supports only the
dictionary operations INSERT, SEARCH, and DELETE. For example,
a compiler for a computer language maintains a symbol table, in
which the keys of elements are arbitrary character strings that
correspond to identifiers in the language.
• A hash table is an effective data structure for implementing
dictionaries.
• Although searching for an element in a hash table can take as long as
searching for an element in a linked list O(n) time in the worst case—
in practice, hashing performs extremely well.
• Under reasonable assumptions, the expected time to search for an
element in a hash table is O(1).
• The bottom line is that hashing is an extremely effective and practical
technique: the basic dictionary operations require only O(1) time on
the average.
• “perfect hashing” can support searches in O(1) worst case time, when
the set of keys being stored is static (that is, when the set of keys never
changes once stored).
Associative Tables
• Consider an array of integers:

• It is easy enough to find the value at slot 3 (that is, 91). How to find
the slot with value 76?
– We could search in the array.
– But it is much faster to use associative lookup.
• Hash functions provide association by mapping from the stored value
(for example 76) to the location where it is stored (for example slot 4).
Hash Tables
• Basic idea:
– Have a hash table H of T slots, addressed from 0 to T−1.
– Have a hash function h, that is, any function that maps keys
(integers, or strings, or ...) into integers in the range 0 to T − 1.
– Store record Rk at slot v = h(k).
• Ideally the access cost is O(1), regardless of data volume. That is, no
matter how much data we have, the lookup cost is the same.
Hash Tables
• Example: Consider a hash table of 8 slots:

• Suppose the hash function is h(k) = k mod 8.


Then h(76) = 4 and h(91) = 3.

• Why use hash tables for databases?


As each database record has a key, hash tables can provide very fast
access:
– Hash the key to get a slot number.
– Store the record in the slot. Alternatively, store a pointer to the
record in the slot.
• Here … 25.4.2019
Hash Tables in Memory
• The number of slots T is fixed. Each slot is a structure (similar to a C
language struct) that holds a key and maybe some other information
(for example, the rest of the record).
• To insert a record Rk with key k into hash table H compute its hash
value v = h(k) and set table entry H[v] to contain Rk.
• Problems:
– Collisions, when two records hash to the same slot.
– Static size, when the hash table is not of the appropriate capacity
(too big or too small).
Collisions
• “Collisions” happens when more than one key maps to the same array
index (slot).
• A good hash function is a bit like a random number generator. Given a
regular pattern of input, the output appears chaotic. A good hash
function for integers is
h(k) = ( ( p0 × k + p1 ) MOD p2 ) MOD T
where p0, p1, and p2 are large primes. (T does not have to be prime.)
• Aside: Not only does this look like a random number generator, but it
is a good random number generator. To get 16-bit pseudo-random
numbers, use:
unsigned R0 = seed
Rn+1 = (( Rn + p0 ) × p1 ) MOD p2
return Rn+1 MOD 216
Collisions

• Consider a hash table of size T = 10 and suppose p0 = 438,439, p1 =


34,723,753, and p2=376,307.
• h(k) = ( ( 438439 × k + 34723753 ) MOD 376307 ) MOD 10

• There are two 2's and two 7's but no 1 and no 9.


• We say that the hash function is uniform, because all 10 possible hash
values are equally likely. But that does not mean that hashing 10 numbers k
will give ten different addresses h(k). Some of the h(k) values will be the
same, that is, they will collide.
• :
Handling collisions
• There are three standard methods for in-memory collision resolution.
1. Linear probing:
2. Double Hashing:
3. Chaining:
Linear probing
• If H[v] is occupied, try H[(v+1) MOD T], H[(v+2) MOD T], ... until a
free slot is found, i.e., look in the slot next door, the slot next to that ...
• More generally: try H[(v + r) MOD T], H[(v + 2r) MOD T] ... where r
and T have no common divisor greater than 1.
• Linear probing leads to clustering of records; once the table is nearly
full, search degenerates to O(T).
• On search, the same procedure is followed, with a check at each stage
to see if the current slot is either empty or has a matching k value.
• At most T records can be inserted.
Double Hashing
• Set v = h(k) as before.
• If H[v] is occupied, use a secondary hash function h′ to compute v′ = h
′(k).
• Try H[(v + v′) MOD T], H[(v + 2v′) MOD T], ...
• Although records no longer cluster, search still degenerates to O(T),
but not as quickly.
• There is also an absolute problem of overflow — at most T records
can be inserted.
Chaining
• The hash table is an array of pointers, not an array of data.
• Each pointer in the hash table points to a linked list of records that
hash to that location.
• On insertion, a record Rk with hash value v = h(k) is added to the
linked list for slot H(v).
• On search, the list must be traversed to find the matching k value.
• A hash table of T slots can now be used to index n > T records, but if n
>> T then access may be slow.
• Access costs do not degenerate to O(n) as n grows.
• Extra space is required for pointers, as the table contains an array of
pointers not required in the other schemes.
Chaining
• Aside: If some keys are more commonly searched for than others, it is
efficient to move the most-recently accessed record to the front of the
list. The likelihood is, then, that the searched for record will be at the
start of its list. A chained hash table with move-to-front on access in
the chains is almost always the most efficient in-memory data
structure, unless it is essential that the data be kept sorted.
Cost of Collisions
• What is the cost of these collision schemes? That is, how many slots
(or records) need to be visited to find the record that is being searched
for? To estimate this cost, we need to know the probability of a
collision, that is, the value p(collision). In each of the three schemes,
the expected number of accesses is always greater than or equal to 1 +
p(collision). For linear probing and double hashing, the costs are much
greater once the table is more than (say) 2/3 full — but as these
methods are hard to analyse we will ignore them! The probability of
collision in a chained hash table can be worked out as follows.
• Assume that the hash addresses generated by the hash function are
uniformly-distributed random numbers in the interval 0 to T – 1.
Cost of Collisions
• Suppose that a total of n records are hashed. The probability that
exactly m records hash to a particular slot is given by the binomial
distribution:

• When T is large and λ = n/T is bounded (that is, not too big like in our
practical application), the above distribution can be simplified using
the Poisson approximation to the binomial distribution:

where λ is the mean number of records per slot.


Example:
• Suppose T = 100,000 and n = 80,000 so that λ = 0.80 records per slot.
• Then p(λ, 2) = p(0.8, 2) = ( 0.82 × e–.08 ) ÷ ( 2 × 1) = 0.14378
• Thus 0.14378 × 100,000 = 14,378 slots get exactly 2 records each. In
each case, one of these records must be an overflow.

• Note that a uniform distribution does not mean that all slots get
roughly the same number of hits. In this case, 45% of the hash table is
empty.
Overflows
• The number of overflow records is: 1 for each bucket with 2 records; 2
for each bucket with 3 records; 3 for each bucket with 4 records ...
That is, the number of overflow records is

• That is, 24,800 ÷ 80,000 or 31% of records can't be stored in the slot
given by their hash value. If T = n = 100,000, this fraction rises to
about 37%.

You might also like