14-HashTable
14-HashTable
2 21/03/2025
Map
Map is an abstract data type designed to efficiently store and retrieve <key,
value> pairs. Each pair is an entry. Keys of all entries needs to be unique, so
that a mapping can be defined between key and its corresponding value.
Also referred as associative array, because key can serve as an index, but
unlike standard arrays, keys may not always be numeric
Operations:
insert(key, value)
get(key)
remove(key)
3 21/03/2025
Applications
A university’s information system relies on some form of a student ID as a key that is
mapped to that student’s associated record (such as the student’s name, address, and course
grades) serving as the value.
The domain-name system (DNS) maps a host name, such as www.wiley.com, to an
Internet-Protocol (IP) address, such as 208.215.179.146.
A social media site typically relies on a (nonnumeric) username as a key that can be
efficiently mapped to a particular user’s associated information.
A company’s customer base may be stored as a map, with a customer’s account number or
unique user ID as a key, and a record with the customer’s information as a value. The map
would allow a service representative to quickly access a customer’s record, given the key.
A computer graphics system may map a color name, such as 'turquoise', to the triple of
numbers that describes the color’s RGB (red-green-blue) representation, such as (64, 224,
208).
4 21/03/2025
Implementation?
How we can implement this Map ADT as efficiently as possible?
What data structures we have read so far?
Examples:
1. Let say we have a text file and we need to count frequency of each letter?
Sorted array of letters?
2. If we need to store dictionary of words, for spell checking?
Sorted Array?
Search Trees balanced
To lookup username and password for authentication?
Time complexity?
Does size of data matter?
5 21/03/2025
Hash Table
A data structure which provides insertion, deletion and search in O(1) time in
average case.
Idea: Use the key as address to find its associated value. There are two
components:
Look-up table/array: an array to hold data, each position can be referred as a slot or bucket.
Hash function: a function to map key to an integer value that represents a slot of array (index).
Also known as direct address table. Look-Up Array
Hash-Function
0
Key 1
: :
Key
3
Key 4
6 21/03/2025
Hash Table
Letter Frequency- Hash Table
Look-Up Array
Unique mapping of key to slot Hash-Function
Simply call hash function to access the required slot 0 5
‘a’ 1 4
: :
‘i’ Slot = (ascii of
key) mod N 18
‘h’ 19 9
:
24
Operations are independent of size of data.
25 0
If mapping of key to slot is unique.
Search, delete, insert will take O(1) time in worst case
Array needs not to be sorted
7 21/03/2025
Hash Table
Phone Directory- Hash Table
Look-Up Array
How many possible numbers ? Hash-Function
0
One-to-one mapping is not feasible
‘zaki’ 1 saad,
Slot= sum(ascii
: : 1290
‘noor’ of each
character) mod 18
‘saad’ N 22 zaki,
4567
:
38 noor,
: 4567
Size of table<range of data 59
Any potential issue with smaller hash table? 50
collision
8 21/03/2025
Hash Table
Two Challenges:
1. Hash function
How to map keys to slots?
Should minimize collision, which occurs when two different keys are mapped to same index/slot.
11 21/03/2025
Hash Function
Hash function calculates two things:
1. Hash-Code/Value: Conversion of key into an
integer value, as key may not be an integer
Independent of array size.
No compulsion to be in range of [0, N-1]
2. Compression: Mapping of calculated hash code to
range[0, N-1], where N is the length of look-up
array.
Dependent on array size
A very common way is taking modules of hash value with
N In simple hash functions, 1st
step may not be present
N can be taken as prime number
• Key is already integer
To avoid getting same remainder for different values and uniform
distribution of values.
12 21/03/2025
Java’s Built-in Hash Function
Java has built-in hashcode() function defined in Object class, which returns a
32 bit integer as hash value of any object.
Default implementation returns an integer representation of object’s memory
address, which is of course unique.
A hash function must give same hash values for two objects if they have same key.
Two objects are considered equal if they have same keys.
If object1.equals(object2) == true
Then object1.hashcode()==object2.hashcode()
Default implementation of equals(Object o) method also relies on memory address.
So, you must need to override hashcode(), if you are overriding equals().
You should do this to provide your own meaning of equality and hashing.
13 21/03/2025
Common Hashing Techniques
Purpose of a hash-function is to produce hash value as unique as possible.
One simple idea can be to simply use array size according to possible range of keys,
but if actual data is very small, but range of keys is huge, it becomes extremely space
inefficient.
Given the number of entries and array size N, theoretically there are always numerous
guaranteed ways to get unique mapping, but value should be easily computable and
array should not waste space. Time of hash function should not dominate time of basic
Map operations.
Depends upon type of Keys:
Numbers
Strings
Compound Keys
14 21/03/2025
Common Hashing Techniques
Direct Hashing:
Using key itself as index
Guaranteed one-to-one mapping
Need a very large array if key is a large integer, space can be wasted
Modular Hashing or Division Method
If key is an integer
it can be mapped to array slot directly by taking its modulus with N
Value of N affects the distribution.
If keys are ={20, 12, 3, 19, 6, 8, 5, 18 }, try with different values of N.
A prime number not close to power of 2 is often considered as a good choice for N
If key is String
You can sum up the ASCII values and then take modulus.
ab= (97+ 98) mod N
Can produce same result for anagrams like abc, bca, baba, baab etc.
This can be handled by exploiting relative positions
ab= (97*1 + 98 *2) mod N
ba= (98*1 + 97 *2) mod N
15 21/03/2025
Common Hashing Techniques
Modular Hashing or Division Method
Java’s built in algorithm to compute hash value for Strings:
int hash = 0; R=31
for (int i = 0; i < string.length(); i++)
hash = hash * R + charAt(i);
R is a prime number.
It will return a hash value as a very large integer. Which can be mapped to index using
modulus.
Check the details:
http://www.javamex.com/tutorials/collections/hashmaps4.shtml
16 21/03/2025
Common Hashing Techniques
Mid-Square
Square the key, then take middle portion
40 402=1600 60 mod N = array-slot
Folding
Shift folding: If keys are compound like account number, ISBN, phone number,
CNIC, divide key into equal sized portions except last, and add
051-342671051+342+671= value mod N = array-slot
Another approach is boundary folding: reverse alternate portions:
051-342671051+243+671= value mod N = array-slot
17 21/03/2025
Compression Function Variation
Mapping of integer hash codes/values to array index. Hash values my exceed
array size, so we need to map them to an index. Few approaches can be:
Division
value mod N
Let say hash values are {12, 15,18 , 21, 24, 27} and N=6
Multiple values will collide for same index.
What if we take N as a prime number?
MAD(multiply-add-division)
((a*value+b) mod p) mod N
p is next larger prime number after N
a and b are two random numbers in range 0-p, where a>0
Choose the technique wisely, to avoid collision
18 21/03/2025
Collision
If hash function is not perfect, it can cause collision. Collision occurs if two
different keys get mapped to same slot of array.
h(k1) = 3, h(k2) = 3
Collisions can be handled. But at best they should be avoided or minimized to keep the
time constant.
Load Factor:
A key factor to determine hash table performance.
If there are n entries to be hashed, and table size is N,
λ = n/N, called the load factor of the hash table, should be bounded by a small value, ideally less
than 1.
As the load factor grows, hash table starts to slow. And collisions starts to increase.
A good hash function will minimize these collisions by distributing them uniformly.
Load factor of individual buckets also matters
19 21/03/2025
Collision Handling
There are many different strategies to handle collision:
Separate chaining/ Open Hashing
Open addressing Close Hashing
Linear Probing
Quadratic Probing
Double Hashing
20 21/03/2025
Separate Chaining
Chain all collision for a specific slot in another container/bucket, and store them in that slot.
This chain can be formed using a linked list.
Every slot will hold address to head of its linked list
If a bucket of some slot is empty, it will contain null
Insertion?
Simply add the new entry at front of list– O(1)
Search and Delete?
Search the key by traversing list
Will take time proportional to size of bucket
Advantage:
Un-limited collisions can be handled, even if load factor is >=1
Disadvantage:
Require an auxiliary data structure, pointers maintenance
Worst case O(N)
21 21/03/2025
Separate Chaining
Worst case?
Size of individual bucket is proportional to the size of the array.
If all keys mapped to one bucket only O(N)
Ideally there should be 2-3 values in bucket
Average case? O(1)
Size of individual list = λ, average search time= λ/2
So in separate chaining, load factor of individual buckets also matters.
Two hash tables with n=35, and N = 51, load factor =0.67
But one is distributing values well, another is hashing all values to only 2,3 buckets
Separate chaining also called open hashing, as values are hashed to another data structure
and array itself
22 21/03/2025
Open Addressing
Maps collisions to array slots rather than in another data structure. Main idea is
to probe forward from the calculated index to find next empty slot, considering
the array as circular.
Has many variants
Linear probing
Probe next slot by interval of 1
Quadratic probing
Probe next slot by interval of squares of 1, 2, 3 ….
Double hashing
Probe next slot, by calculating your interval using a secondary hash function
It is also called closed hashing, as all hashing occurs in same array, no extra data
structure required.
23 21/03/2025
Linear Probing
If the h(k) returns value j, and A[j] is occupied, then go to next available slot A[(j+1) mod N],
if that is occupied, then go to A[(j+2) mod N] and do so until a free slot is found.
Let say if N=11
h(21)= 10 null null null null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
h(13)= 2 null null 13 null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
h(9)= 9
null null 13 null null null null null null 9 21
0 1 2 3 4 5 6 7 8 9 10
h(24)= 2 null null 13 24 null null null null null 9 21
(2+1) mod 11=3
0 1 2 3 4 5 6 7 8 9 10
h(20)=9 20 null 13 24 null null null null null 9 21
(9+1) mod 11=10 0 1 2 3 4 5 6 7 8 9 10
(9+2) mod 11=0
When to stop?
24 21/03/2025
Linear Probing
Primary Clustering
If many collisions occur for a specific slot, neighboring slots will be filled and form
blocks, this is called primary clustering. You have to probe the clusters to find next
available slot.
Clustering cause more collisions for other values that are going to mapped in those
blocks.
h(11) 0
hash value of 20 is not 0 but still 11 will go to 1
20 11 13 24 12 null null null null 9 21
h(12) 1 0 1 2 3 4 5 6 7 8 9 10
will go to 2, it is filled
go to 3, it is filled
Go to 4
You have to skip clusters created for slot 0 and 2
and then if needed, others as well
25 21/03/2025
Quadrating Probing
If the h(k) returns value j, and A[j] is occupied, then go to next available slot A[(j+i 2) mod N], for i =1, 2, 3,
… and so on, until an empty slot is found.
So every time a collision occurs, perfect square number of slots are added to current index.
j+1, j+4, j+9, j+16 and so on.
Let say if N=11 null null null null null null null null null null 21
h(21)= 10 0 1 2 3 4 5 6 7 8 9 10
null null 13 null null null null null null 9 21
h(9)= 9
0 1 2 3 4 5 6 7 8 9 10
h(24)= 2
null null 13 24 null null null null null 9 21
(2+1) mod 11=3 0 1 2 3 4 5 6 7 8 9 10
When to stop?
26 21/03/2025
Quadrating Probing
Secondary Clustering
Quadratic probing sufficiently reduces primary clustering, but it does have a problem of clustering which
is called Secondary Clustering
It is called secondary, because now hash values mapped to alternate slots from current slot, and it will
form clusters using every alternate cell.
But it has another key problem: as it does not go linearly, there is no guarantee that it will check
all slots in its path.
This problem become worse if load factor > 0.75, all insertions are not guaranteed.
Insert 7 in following table
h(7)=7 22 12 13 24 null 16 null 20 19 9 21
(7+1) mod 11=8
0 1 2 3 4 5 6 7 8 9 10
(7+4) mod 11=0
(7+9) mod 11=5
(7+16) mod 11=1
(7+25) mod 11=10
Keep in adding intervals
Can you decide when to stop further probing?
27 21/03/2025
Double Hashing
Double hashing uses a secondary hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions
form the point of collision to insert. There are a couple of requirements for the
second function:
It must not return 0
It must make sure no slot will be skipped during probing
Example secondary hash function:
h2(key) = R - ( key mod R ) , where R is a prime number that is smaller than the size of the table.
Interval will depend upon key itself. So, it eliminates probability of clustering.
28 21/03/2025
Double Hashing
Let say if N=11, and h2= 7 - ( key mod 7 )
h(21)= 10 null null null null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
h(13)= 2 null null 13 null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
h(9)= 9
null null 13 null null null null null null 9 21
0 1 2 3 4 5 6 7 8 9 10
h(24)= 2, h2=4
null null 13 null null null 24 null null 9 21
(2+4) mod 11=6
0 1 2 3 4 5 6 7 8 9 10
h(20)=9, h2=1
20 null 13 24 null null null null null 9 21
(9+1) mod 11=10
0 1 2 3 4 5 6 7 8 9 10
(9+2) mod 11=0
20 null 13 24 null null 11 null null 9 21
h(11)=0, h2=3
0 1 2 3 4 5 6 7 8 9 10
(0+3) mod 11=3
(0+6) mod 11=6
When to stop?
29 21/03/2025
Open Addressing
Advantage:
No other data structure
Suitable for smaller entries
Better memory management, no hassle of references
Disadvantage:
Insertions cannot be done if load factor is =1.
Limited collisions can be handled
The load factor becomes more than 2/3, hash table performance degrades,
So, open addressing require a mandatory array resize when load factor grows beyond a minimum
limit.
Function must minimize clustering of values, which have consecutive probe order.
31 21/03/2025
Open Addressing
Best case
No clustering O(1)
Worst case
O(N)
Average case
≈½(
If λ is too high, close to 1, O(N), equation will lead to undefined if λ=1
Try λ=.99
If λ is less than 0.75, it becomes constant and independent of N
let say if λ=0.75, its 8.5
So, careful selection of function and N, will fairly provide O(1) bound.
32 21/03/2025
Lazy deletion
How deletion will work in open addressing?
Assume linear probing here. 20 7 23 9 8 null 13
And N=7 0 1 2 3 4 5 6
Let say we delete 20?
h(20)= 6 ? 7 23 9 8 null 13
34 21/03/2025
ReHashing
When load factor becomes large (more than 2/3), performance of hash table degrades.
Solution:
Resize table to a bigger size
Rehash values according to new size of table
Why?
Hash table will become slow, collisions will increase
It will not allow further insertion in case of closed hashing
Hash function can be refined. As table size will be a different prime number.
Lazy deletion entries will be removed
When?
If load factor > ½ or a certain limit
If an insertion fails
35 21/03/2025
Hash Table
Advantages
Faster, O(1) average cost of operations.
Suitable for larger number of records
If keys are known ahead of time, a collision free mapping is possible by choosing
suitable hash function, and array size
Disadvantages
Not good for smaller number of keys
Not efficient, if too many collisions
36 21/03/2025
Visualizations
Check the following links to understand collision using visuals.
https://www.cs.usfca.edu/~galles/visualization/OpenHash.html
https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html
37 21/03/2025
Practice Problem
What will be the final state of a hash table after all insertions, with following properties:
Fix size table, N=11
h(k) = (3k+5) mod 11
Keys: 12, 44, 13, 88, 23, 94, 11, 39, 20, 16, and 5, assuming collisions are handled by:
Chaining
Linear probing
Quadratic probing
Double hashing using the secondary hash function h′(k) = 7−(k mod 7)?
What will be the final state of a hash table with following properties:
Initial table size is 5
Keys are integers
h(k)= k mod N
Collision Resolution: Linear probing.
Rehashing at load factor >= 0.5
Operation sequence: add(15); add(5); add(13); add(24); add(32); remove(13); add(17); add(44);
remove(15); add(47);
Use any character as a symbol for an entry that is removed.
38 21/03/2025