Unit - 2 - Hashing
Unit - 2 - Hashing
Hashing:
Hashing is a technique that is used to uniquely identify a specific object
from a group of similar objects. Some examples of how hashing is used in
our lives include:
In both these examples the students and books were hashed to a unique
number.
Assume that you have an object and you want to assign a key to it to make
searching easy. To store the key/value pair, you can use a simple array like
a data structure where keys (integers) can be used directly as an index to
store values.
In hashing, large keys are converted into small keys by using hash
functions. The values are then stored in a data structure called hash table.
The idea of hashing is to distribute entries (key/value pairs) uniformly
across an array. Each element is assigned a key (converted key). By using
that key you can access the element in O(1) time. Using the key, the
algorithm (hash function) computes an index that suggests where an entry
can be found or inserted.
2) Hash Function
1) Division Method
2) Mid square Method
3) Digit folding Method
4) Digit Analysis Method/Binary/Radix Method
For example if the record 52, 68, 99, 84 is to be placed in a hash table
and let us take the table size is 10.
Then:
Division Method
Hash Table
int hashfunction(int e)
{
int key;
key = e % size;
return key;
}
void main()
{
int i, j, element;
clrscr();
printf(“enter size of hash table “);
scanf(“%d”,&size);
j=hashfunction(element);
a[j]=element;
}
getch();
}
So, 3101*3101=9616201
Hash Table
If the square value is a single digit, the value will be placed in that
index only. Value 3 is stored at location 9.
Shift Folding
h =
= P1+P2+P3+P4+P5
= 123 + 203 + 241 + 112 + 20
= 699
h =
= P1+P2+P3+P4+P5
= 123 + 302 + 241 + 211 + 20
= 897
Shift folding
H(key) =124+655+12 = 791
791 is the index to store the value 12465512
We then examine the digits of each identifier, deleting those digits that
have the most skewed distributions. We continue deleting digits until
the number of remaining digits is small enough to give an address in
the range of the hash table. The digits used to calculate the hash
address must be the same for all identifiers and must not have
abnormally high peaks or valleys (the standard deviation must be
small).
For Example store values 4,8,3,7 in the hash table considering the
radix 2.
Hash Table
1) The hash function should generate different hash values for the
similar string.
3) The hash function should produce the keys which will get distributed,
uniformly over an array.
5) The hash function is a perfect hash function when it uses all the input
data.
Collision
It is a situation in which the hash function returns the same hash key
for more than one record, it is called as collision. Sometimes when we
are going to resolve the collision it may lead to a overflow condition and
this overflow and collision condition makes the poor hash function.
1) Chaining
2) Linear Probing (Open addressing)
3) Quadratic Probing (Open addressing) and
4) Double Hashing (Open addressing).
In the diagram we can see at same bucket 1 there are two records
which are maintained by linked list or we can say by chaining method.
H(31) = 31%10 = 1
H(33) = 33%10 = 3
H(77) = 77%10 = 7
H(61) = 61%10 = 1
2) Linear probing (Open addressing)
It is very easy and simple method to resolve or to handle the collision.
In this, collision can be solved by placing the second record linearly
down, whenever the empty place is found. In this method there is a
problem of clustering which means at some place block of a data is
formed in a hash table.
0 NULL
56 % 10 = 6 1 NULL
64 % 10 = 4 2 NULL
3 NULL
4 64
5 NULL
6 56
7 NULL
8 NULL
9 NULL
36 % 10 = 6 0 NULL
1 NULL
The index 6 is already filled with 56 2 NULL
It is not empty 3 NULL
Collision occurred 4 64
To resolve this check the next location 5 NULL
i.e. 6+1 = 7 6 56
index 7 is NULL so insert 36 at index 7. 7 36
8 NULL
9 NULL
0 NULL
71 % 10 = 1 1 71
As the index 1 is null 2 NULL
We can insert 71 at index 1 3 NULL
4 64
5 NULL
6 56
7 36
8 NULL
9 NULL
int hashfunction(int e)
{
int key;
key = e % size;
if (a[key]==0)
return key;
else
if (size==key)
{
printf(“hash table is FULL”);
return -1;
}
else
hashfunction(e+1);
}
void main()
{
int i,j,element;
clrscr();
printf(“enter size of hash table “);
scanf(“%d”,&size);
getch();
}
H(key)=(H(key)+x*x)%table_size
0 90
67%10 = 7
1
90%10 = 0 2
3
55%10 = 5 4
5 55
17%10 = 7 6
7 67
49%10 = 9 8
9
In this we can see if we insert 67, 90, and 55 it can be inserted easily but in
the case of 17 hash function is used in such a manner that :-
To insert 17
The initial index generated is 17%10 = 7
(7+0*0)%10 = 7
(when x=0 it provide the index value 7 only) by making the increment in
value of x. let x =1 so ,
(7+1*1)%10 = 8.
0 90
67%10 = 7
1
90%10 = 0 2
3
55%10 = 5 4
5 55
17%10 = 7 6
7 67
49%10 = 9 8 17
9 49
Where, p is a prime number which should be taken smaller than the size of
a hash table.
67%10 =7 0 90
90%10 =0 1 17
2
55%10 =5
3
17%10 =7
4
= 7 - (17%7) 5 55
=7 -3 6
=4 7 67
8
49%10 = 9 9 49
In this we can see 67, 90 and 55 can be inserted in a hash table by using
first hash function but in case of 17 the bucket is full and in this case we
have to use the second hash function which is
where P is a prime number which should be taken smaller than the hash
table so value of P will be 7.
that means we have to take 4 jumps for placing 17. Therefore 17 will be
placed at index 1.
int h[TABLE_SIZE]={NULL};
void insert()
{
int key,index,i,hkey,hash2;
hkey=key%TABLE_SIZE;
hash2 = 7-(key %7);
for(i=0;i<TABLE_SIZE;i++)
{
index=(hkey+i*hash2)%TABLE_SIZE;
if(h[index] == NULL)
{
h[index]=key;
break;
}
}
if (i == TABLE_SIZE)
printf("\nelement cannot be inserted\n");
}
void search()
{
int key,index,i,hkey,hash2;
printf("\nenter search element\n");
scanf("%d",&key);
hkey=key%TABLE_SIZE;
hash2 = 7-(key %7);
void display()
{
int i;
void main()
{
int opt,i;
clrscr();
printf("DOUBLE HASHING\n");
while(1)
{
printf("\nPress 1.Insert 2.Display 3.Search 4.Exit \n");
scanf("%d",&opt);
switch(opt)
{
case 1: insert();
break;
case 2: display();
break;
case 3: search();
break;
case 4: exit(0);
}
}
}
Loading density
The loading density or loading factor of a hash table is
α = n / (sb)
∝=
The hash function must map each of the possible identifiers onto one
of the number, 0-25.
Using this scheme, the library functions acos, define, float, exp, char,
atan, ceil, floor, clock, and ctime hash into buckets 0, 3, 5, 4, 2, 0, 2,
5, 2, and 2, respectively.
Table) Hash table with 26 buckets and two slots per bucket
The identifier clock hashes into the bucket ht[2]. Since this bucket is
full, we have an overflow.
2-010 10 00 4
4-100 00 01 5
5-101 01 10 2
3-011 11 11 3
Now if we want to insert any mode values in the hash table they will
overflow. To avoid this we can use dynamic hashing. We can add some
more memory and readjust the values already stored in the hash table
and add the new values in the hash table.
2-010 000
4-100 001
5-101 010 2
3-011 011 3
7-111 100 4
6-110 101 5
110 6
111 7
We consider two forms of Dynamic hashing- one uses a directory and the
other does not.
Dynamic Hashing
o First,
t, calculate the hash address of the key.
o Check how many bits are used in the directory, and these bits are
called as i.
o Take the least significant i bits of the hash address. This gives an
index of the directory.
o Now using the index, go to the directory a and
nd find bucket address
where the record might be.
o Firstly, you have to follow the same procedure for retrieval, ending up
in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full,l, then we will split the bucket and redistribute the
records.
Example 1: Consider the following grouping of keys and insert them
into buckets, depending on the prefix of their hash address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last
two bits of 5 and 6 are 01, so it will go into bucket B1. The last two bits
of 1 and 3 are 10, so it will go into bucket
bucket B2. The last two bits of 7 are
11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001,, it must go into the bucket B1.
But bucket B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of key 5 and
key 9 are 001, so it will go into bucket B1, and the last three bits of
key 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and
100 entry because last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and
110 entry because last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011
0
entry because last two bits of both the entry are 11.
11
Advantages of dynamic hashing
o In this method, the performance does not decrease as the data grows
in the system. It simply increases the size of memory to accommodate
the data.
o This method is good for the dynamic database where data grows and
shrinks frequently.
o In this method, if the data size increases then the bucket size is also
increased. These addresses of data will be maintained in the bucket
address table. This is because the data address will keep changing as
buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might
take little time to reach this situation than static hashing.
k h(k)
A0 100 000
A1 100 001
B0 101 000
B1 101 001
C2 110 010
C3 110 011
C5 110 101
C1 110 001
C4 110 100
The size of directory d is 2r, where r is the number of bits used to identify
all h(k).
Depth 2
In the directory we can see this is pointing to a bucket where both slots are
already filled with A1 and B1. The Key C5 is over flow from This bucket. Now
let us consider the least
st 3 bits of the hash address.
Find the least u such that h(C5, u) is not the same with some keys in h(C5,
2) (01) bucket.
In this case, u = 3.
Since u > r, expand the size of d to 2u and duplicate the pointers to the new
half (why?).
Depth 3
Let r = u = 3.
Find the least u such that h(C1, u) is not the same with some keys in
h(C1, 3) (001) bucket.
In this case, u = 4.
Since u > r, expand the size of d to 2u and duplicate the pointers to the new
half.
Depth 4
depth 4
When C4 (110100) is to enter
Find the least u such that h(C1, u) is not the same with some keys in
h(C1, 4) (0100) bucket.
In this case, u = 3.
Since u = 3 < r = 4, d is not required to expand its size.
Advantages
Only doubling directory rather than the whole hash table used in
static hashing.
Only rehash the entries in the buckets that overflows.
Directory Less Dynamic Hashing
Directory less Dynamic hashing is also known as linear dynamic hashing
k h(k)
A0 100 000
A1 100 001
B0 101 000
B1 101 001
B4 101 100
B5 101 101
C1 110 001
C2 110 010
C3 110 011
C5 110 101
Figure a) shows a directory less hash table ht with r=2 the number of bits of
h(k) used to index into the hash table and q=0. The number of active buckets
are 4 indexed (00 01 10 11 ). Each active bucket has 2 slots.
d is directory buckets=2r
Figure c) r=2 , q=2, h(k,3) has been used for chains 000 , 100 and 101.
Insert C1, q=2 means 2 buckets added
Initially, there are four pages/buckets, each addressed by two bits (Figure
(a). Two of the pages/buckets are full, and two have one identifier each.
In the next step, we insert the identifier C1. Last 3 bits of C1 are 001. Since
it hashes to the same page as C5, we use another overflow node to store it.
A1 - 100 001
B5 - 101 101
C5 - 110 101
C1 - 110 001
We add another new page to the end of the file and rehash the identifiers in
the second page. The address of new bucket added to the has table is 101.
Last 3 bits of A1, C1 are 001. Last 3 Bits of B5 and C5 are 101. So B5 and
C5 will be storing in the new bucket and C1 will be storing in 001 bucket so
that there will be no overflow.
• Message Digest
• Password Verification
• Data Structures(Programming Languages)
• Compiler Operation
• Rabin-Karp Algorithm
• Linking File name and path together
Page 32
Page 3
9) Show the hash function h(k)=k%17 does not satisfy the one way
property, weak collision resistance and strong collision
resistance.
Using hash function we are able to find index of k. We can store k in the hash table at
position index.
H(153)=153%17=0
Using 0 we can’t get 153 without hash table. So the hash function is not satisfying one
way property.
Collision Resistance : no two elements with the hash function should generate the
same hash value.
10) Write and explain procedure to insert a dictionary pair into a dynamic hash
table that uses a directory.
Page no’s 20,21,22,23,24
11) Define key density and loading density. Explain with example.
Key density
Loading density
α = n / (sb)
The hash function must map each of the possible identifiers onto one
of the number, 0-25.
Using this scheme, the library functions acos, define, float, exp, char,
atan, ceil, floor, clock, and ctime hash into buckets 0, 3, 5, 4, 2, 0, 2,
5, 2, and 2, respectively.
The identifier clock hashes into the bucket ht[2]. Since this bucket is
full, we have an overflow.
12) With suitable example explain about linear probing, quadratic probing and
rehashing.
Rehashing:
Why Rehashing?
Rehashing is done because whenever key value pairs are inserted into
the map, the load factor increases, which implies that the time
complexity also increases as explained above. This might not give the
required time complexity of O(1).
13) Write and explain procedure to insert a dictionary pair into a dynamic hash
table that uses a directory.
What is Collision?
Since a hash function gets us a small number for a key which is a big
integer or string, there is a possibility that two keys result in the same
value. The situation where a newly inserted key maps to an already
occupied slot in the hash table is called collision and must be handled
using some collision handling technique.
15) Write and explain procedure to delete a dictionary pair from a dynamic hash
table that uses a directory.
16) Let α=n/b be the loading density of a uniform hashing function h. Then derive
expressions for the expected number of key comparisons Un and the average
number of key comparisons Sn for linear open addressing and for chaining.
1 1 1 1
U n ≈ [1 + ] S n ≈ [1 + ]
2 (1 − α ) 2 2 1− α
1
U n ≈ 1 /(1 − α ) S n ≈ −[ ] log e (1 − α )
α
for chaining
Un ≈ α Sn ≈ 1 + α / 2
Linear probing
in which the interval between probes is fixed — often set to 1.
Quadratic probing
in which the interval between probes increases quadratically (hence,
the indices are described by a quadratic function).
Double hashing
in which the interval between probes is fixed for each record but is
computed by another hash function.
The main tradeoffs between these methods are that linear probing has
the best cache performance but is most sensitive to clustering, while
double hashing has poor cache performance but exhibits virtually no
clustering; quadratic probing falls in-between in both areas. Double
hashing can also require more computation than other forms of
probing.
Shift folding
19) Discuss the techniques for insert a directory pair into a dynamic hash table that
uses a directory.
Page 32
21) Explain the digit analysis Hashing technique with suitable examples.
Page 8
22) Discuss the techniques for deleting a directory pair into a dynamic hash table
that uses a directory.
Page no’s 20,21,22,23,24
23) Write an algorithm to insert a directory pair from a directory less dynamic hash
table.
Page 32
24) What are hash functions? List some techniques that are used to implement Hash
functions.
a) Division Method
b) Mid square Method
c) Digit folding Method
d) Digit Analysis Method/Binary/Radix method
Why rehashing?
Rehashing is done because whenever key value pairs are inserted into
the map, the load factor increases, which implies that the time
complexity also increases as explained above. This might not give the
required time complexity of O(1).
0
1
2 12
3 13
4 2
5 3
6 23
7 5
8 18
9 15
This effort will be less than it took to insert them into the original
table.
a) Division Method
b) Mid square Method
c) Digit folding Method
d) Digit Analysis Method/Binary/Radix method
30) Explain about the analysis of closed hashing for successful search and deletion.
Load factor
α = n/m = no of elements stored in the table/total no of elements in the table
a) Chaining
b) Linear Probing
c) Quadratic Probing and
d) Double Hashing.
32) What are primary and secondary clustering problems? Suggest some open
addressing methods to avoid them.
Open Addressing
Insert can insert an item in a deleted slot, but the search doesn’t
stop at a deleted slot.
Linear probing has the best cache performance but suffers from
clustering. One more advantage of Linear probing is easy to
compute.
In open addressing, all the keys are stored inside the hash table.
No key is stored outside the hash table. The Techniques used for
open addressing are-
a) Linear Probing
b) Quadratic Probing
c) Double Hashing