Hashing
Hashing
6.5. Hashing
In previous sections we were able to make improvements in our search
algorithms by taking advantage of
information about where items are
stored in the collection with respect to one another. For example, by
knowing that a list was ordered, we could search in logarithmic time
using a binary search. In this section
we will attempt to go one step
further by building a data structure that can be searched in
O(1) time. This
concept is referred to as hashing.
In order to do this, we will need to know even more about where the
items might be when we go to look for
them in the collection. If every
item is where it should be, then the search can use a single comparison
to
discover the presence of an item. We will see, however, that this is
typically not the case.
The mapping between an item and the slot where that item belongs in the
hash table is called the hash
function. The hash function will take
any item in the collection and return an integer in the range of slot
names, between 0 and m-1. Assume that we have the set of integer items
54, 26, 93, 17, 77, and 31. Our
first hash function, sometimes referred
to as the “remainder method,” simply takes an item and divides it
by the
table size, returning the remainder as its hash value
(h(item) = item%11). Table 4 gives all of
the
hash values for our example items. Note that this remainder method
(modulo arithmetic) will typically
be present in some form in all hash
functions, since the result must be in the range of slot names.
54 10
26 4
93 5
17 6
77 0
31 9
Once the hash values have been computed, we can insert each item into
the hash table at the designated
position as shown in
Figure 5. Note that 6 of the 11 slots are now occupied. This
is referred to as the load
numberof items 6
factor, and is commonly denoted by
λ =
tablesize
. For this example,
λ =
11
.
Now when we want to search for an item, we simply use the hash function
to compute the slot name for
the item and then check the hash table to
see if it is present. This searching operation is O(1), since
a
constant amount of time is required to compute the hash value and then
index the hash table at that
location. If everything is where it should
be, we have found a constant time search algorithm.
You can probably already see that this technique is going to work only
if each item maps to a unique
location in the hash table. For example,
if the item 44 had been the next item in our collection, it would
have a
hash value of 0 (44%11 == 0). Since 77 also had a hash
value of 0, we would have a problem.
According to the hash function, two
or more items would need to be in the same slot. This is referred to as
a collision (it may also be called a “clash”). Clearly, collisions
create a problem for the hashing technique.
We will discuss them in
detail later.
One way to always have a perfect hash function is to increase the size
of the hash table so that each
possible value in the item range can be
accommodated. This guarantees that each item will have a unique
slot.
Although this is practical for small numbers of items, it is not
feasible when the number of possible
items is large. For example, if the
items were nine-digit Social Security numbers, this method would
require
almost one billion slots. If we only want to store data for a class of
25 students, we will be wasting
an enormous amount of memory.
Our goal is to create a hash function that minimizes the number of
collisions, is easy to compute, and
pythonds This Chapter
evenly distributes the items in the
hash table. There are a Saving
number and Logging
of common are to
ways extend the
Disabled simple ✏
remainder method. We will consider a few of them here.
54 10 3
26 4 7
93 5 9
17 6 8
77 0 4
31 9 6
>>> ord('c')
99
>>> ord('a')
97
>>> ord('t')
116
We can then take these three ordinal values, add them up, and use the
remainder method to get a hash
value (see Figure 6).
Listing 1 shows a function called hash that takes a
string and a table size and
returns the hash value in the range from 0
to tablesize -1.
Listing 1
sum = 0
return sum%tablesize
One method for resolving collisions looks into the hash table and tries
to find another open slot to hold the
item that caused the collision. A
simple way to do this is to start at the original hash value position
and
then move in a sequential manner through the slots until we
encounter the first slot that is empty. Note that
we may need to go back
to the first slot (circularly) to cover the entire hash table. This
collision resolution
process is referred to as open addressing in
that it tries to find the next open slot or address in the hash
table.
By systematically visiting each slot one at a time, we are performing an
open addressing technique
called linear probing.
Once we have built a hash table using open addressing and linear
probing, it is essential that we utilize the
same methods to search for
items. Assume we want to look up the item 93. When we compute the hash
value, we get 5. Looking in slot 5 reveals 93, and we can return
True . What if we are looking for 20? Now
the hash value is 9, and
slot 9 is currently holding 31. We cannot simply return False since
we know that
there could have been collisions. We are now forced to do a
sequential search, starting at position 10,
looking until either we find
the item 20 or we find an empty slot.
The general name for this process of looking for another slot after a
collision is rehashing. With simple
linear probing, the rehash
function is newhashvalue = rehash(oldhashvalue) where
When we want to search for an item, we use the hash function to generate
the slot where it should reside.
Since each slot holds a collection, we
use a searching technique to decide whether the item is present.
The
advantage is that on the average there are likely to be many fewer items
in each slot, so the search is
perhaps more efficient. We will look at
the analysis for hashing at the end of this section.
Self Check
Q-1: In a hash table of size 13 which index positions would the following two keys map to? 27,
130
A. 1, 10
10
B. 13, 0
C. 1, 0
D. 2, 3
Check Me Compare me
Q-2: Suppose you are given the following set of keys to insert into a hash table that holds
exactly 11 values: 113 , 117 , 97 , 100 , 114 , 108 , 116 , 105 , 99 Which of the following best
demonstrates the contents of the hash table after all the keys have been inserted using linear
probing?
A. 100, __, __, 113, 114, 105, 116, 117, 97, 108, 99
99
B. 99, 100, __, 113, 114, __, 116, 117, 105, 97, 108
108
C. 100, 113, 117, 97, 14, 108, 116, 105, 99, __, __
__
D. 117, 114, 108, 116, 105, 99, __, __, 97, 100, 113
113
Check Me Compare me
One of the great benefits of a dictionary is the fact that given a key,
we can look up the associated data
value very quickly. In order to
provide this fast look up capability, we need an implementation that
supports
an efficient search. We could use a list with sequential or
binary search but it would be even better to use
a hash table as
described above since looking up an item in a hash table can approach
O(1)
performance.
Listing 2
class HashTable:
self.size = 11
Listing 3
def put(self,key,data):
hashvalue = self.hashfunction(key,len(self.slots))
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = data
else:
if self.slots[hashvalue] == key:
else:
nextslot = self.rehash(hashvalue,len(self.slots))
self.slots[nextslot] != key:
nextslot = self.rehash(nextslot,len(self.slots))
if self.slots[nextslot] == None:
self.slots[nextslot]=key
self.data[nextslot]=data
else:
def hashfunction(self,key,size):
return key%size
def rehash(self,oldhash,size):
return (oldhash+1)%size
Listing 4
1 def get(self,key):
2 startslot = self.hashfunction(key,len(self.slots))
4 data = None
5 stop = False
6 found = False
7 position = startslot
10 if self.slots[position] == key:
11 found = True
12 data = self.data[position]
13 else:
14 position=self.rehash(position,len(self.slots))
15 if position == startslot:
16 stop = True
17 return data
18
19 def __getitem__(self,key):
20 return self.get(key)
21
22 def __setitem__(self,key,data):
23 self.put(key,data)
>>> H=HashTable()
>>> H[54]="cat"
>>> H[26]="dog"
>>> H[93]="lion"
>>> H[17]="tiger"
>>> H[77]="bird"
>>> H[31]="cow"
>>> H[44]="goat"
>>> H[55]="pig"
>>> H[20]="chicken"
>>> H.slots
[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]
>>> H.data
Next we will access and modify some items in the hash table. Note that
the value for the key 20 is being
pythonds This Chapter
replaced. Saving and Logging are Disabled ✏
>>> H[20]
'chicken'
>>> H[17]
'tiger'
>>> H[20]='duck'
>>> H[20]
'duck'
>>> H.data
>> print(H[99])
None
2
1 1 1 1
2
(1 +
1−λ
) and an
unsuccessful search gives
2
(1 + (
1−λ
) )
If we are using chaining, the
λ
average number of comparisons is
1 + 2
for the successful case, and simply
λ comparisons if the
search is unsuccessful.