0% found this document useful (0 votes)
8 views

14-HashTable

The document discusses the Map Abstract Data Type (ADT), which is designed to store and retrieve unique key-value pairs efficiently. It details various implementations of maps, including sorted arrays, search trees, and hash tables, emphasizing the importance of hash functions and collision resolution techniques. Additionally, it covers applications of maps in real-world scenarios and the challenges associated with hash tables, such as collision handling and load factors.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

14-HashTable

The document discusses the Map Abstract Data Type (ADT), which is designed to store and retrieve unique key-value pairs efficiently. It details various implementations of maps, including sorted arrays, search trees, and hash tables, emphasizing the importance of hash functions and collision resolution techniques. Additionally, it covers applications of maps in real-world scenarios and the challenges associated with hash tables, such as collision handling and load factors.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Maps, Hash Table

Data Structure and Algorithms


Outline
 Map ADT
 Structure
 Operations
 Implementation
 Sorted Array
 Search Tree
 Hash Map/Table
 Hash Function
 Collision Resolution

2 21/03/2025
Map
 Map is an abstract data type designed to efficiently store and retrieve <key,
value> pairs. Each pair is an entry. Keys of all entries needs to be unique, so
that a mapping can be defined between key and its corresponding value.
 Also referred as associative array, because key can serve as an index, but
unlike standard arrays, keys may not always be numeric
 Operations:
 insert(key, value)
 get(key)
 remove(key)

3 21/03/2025
Applications
 A university’s information system relies on some form of a student ID as a key that is
mapped to that student’s associated record (such as the student’s name, address, and course
grades) serving as the value.
 The domain-name system (DNS) maps a host name, such as www.wiley.com, to an
Internet-Protocol (IP) address, such as 208.215.179.146.
 A social media site typically relies on a (nonnumeric) username as a key that can be
efficiently mapped to a particular user’s associated information.
 A company’s customer base may be stored as a map, with a customer’s account number or
unique user ID as a key, and a record with the customer’s information as a value. The map
would allow a service representative to quickly access a customer’s record, given the key.
 A computer graphics system may map a color name, such as 'turquoise', to the triple of
numbers that describes the color’s RGB (red-green-blue) representation, such as (64, 224,
208).

4 21/03/2025
Implementation?
 How we can implement this Map ADT as efficiently as possible?
 What data structures we have read so far?
 Examples:
1. Let say we have a text file and we need to count frequency of each letter?
 Sorted array of letters?
2. If we need to store dictionary of words, for spell checking?
 Sorted Array?
 Search Trees  balanced
 To lookup username and password for authentication?
 Time complexity?
 Does size of data matter?

5 21/03/2025
Hash Table
 A data structure which provides insertion, deletion and search in O(1) time in
average case.
 Idea: Use the key as address to find its associated value. There are two
components:
 Look-up table/array: an array to hold data, each position can be referred as a slot or bucket.
 Hash function: a function to map key to an integer value that represents a slot of array (index).
 Also known as direct address table. Look-Up Array
Hash-Function
0
Key 1
: :
Key
3
Key 4

6 21/03/2025
Hash Table
 Letter Frequency- Hash Table
Look-Up Array
 Unique mapping of key to slot Hash-Function
 Simply call hash function to access the required slot 0 5
‘a’ 1 4
: :
‘i’ Slot = (ascii of
key) mod N 18
‘h’ 19 9
:
 24
Operations are independent of size of data.
 25 0
If mapping of key to slot is unique.
 Search, delete, insert will take O(1) time in worst case
 Array needs not to be sorted

7 21/03/2025
Hash Table
 Phone Directory- Hash Table
Look-Up Array
 How many possible numbers ? Hash-Function

 0
One-to-one mapping is not feasible
‘zaki’ 1 saad,
Slot= sum(ascii
: : 1290
‘noor’ of each
character) mod 18
‘saad’ N 22 zaki,
4567
:
38 noor,
: 4567
 Size of table<range of data 59
 Any potential issue with smaller hash table? 50
 collision
8 21/03/2025
Hash Table
 Two Challenges:
1. Hash function
 How to map keys to slots?
 Should minimize collision, which occurs when two different keys are mapped to same index/slot.

2. Size of Look-up array


 Size can also cause collisions
 Purpose is to minimize space wastage and collisions
9 21/03/2025
Hash Function
 The goal of a hash function h(key), is to map each key “k” to an integer in the
range [0,N − 1] using some arithmetic, where N is the capacity of the bucket array
for a hash table. Given a key, it calculates and returns the integer using some
algorithm.
 Hash-function is the backbone of hash table, as it is used by all operations that are performed
on table.
 Insert(‘a’,1) Look-Up Array
 will use hash-function to calculate slot/index to insert the given pair in array.
 A[h(‘a’)]= Entry<‘a’,1> 0 a,1
1 b,3
 Hash-function has many applications other than hash-table : :
 Cryptographic hash function, used in information security 3
 Passwords verification 4
 Digital signatures
10 21/03/2025
Hash Function
 Properties:
 It should be deterministic, same keys should get same hash value
 It should uniformly distribute input to output, to avoid collisions
 Should be of fixed length
 Should be efficient
 See the details:
 https://en.wikipedia.org/wiki/Hash_function
 Perfect Hash Function
 If hash function provides unique mapping of keys to array-slots, such a function is
called perfect hash-function

11 21/03/2025
Hash Function
 Hash function calculates two things:
1. Hash-Code/Value: Conversion of key into an
integer value, as key may not be an integer
 Independent of array size.
 No compulsion to be in range of [0, N-1]
2. Compression: Mapping of calculated hash code to
range[0, N-1], where N is the length of look-up
array.
 Dependent on array size
 A very common way is taking modules of hash value with
N In simple hash functions, 1st
step may not be present
 N can be taken as prime number
• Key is already integer
 To avoid getting same remainder for different values and uniform
distribution of values.

12 21/03/2025
Java’s Built-in Hash Function
 Java has built-in hashcode() function defined in Object class, which returns a
32 bit integer as hash value of any object.
 Default implementation returns an integer representation of object’s memory
address, which is of course unique.
 A hash function must give same hash values for two objects if they have same key.
 Two objects are considered equal if they have same keys.
 If object1.equals(object2) == true
 Then object1.hashcode()==object2.hashcode()
 Default implementation of equals(Object o) method also relies on memory address.
 So, you must need to override hashcode(), if you are overriding equals().
 You should do this to provide your own meaning of equality and hashing.

13 21/03/2025
Common Hashing Techniques
 Purpose of a hash-function is to produce hash value as unique as possible.
 One simple idea can be to simply use array size according to possible range of keys,
but if actual data is very small, but range of keys is huge, it becomes extremely space
inefficient.
 Given the number of entries and array size N, theoretically there are always numerous
guaranteed ways to get unique mapping, but value should be easily computable and
array should not waste space. Time of hash function should not dominate time of basic
Map operations.
 Depends upon type of Keys:
 Numbers
 Strings
 Compound Keys

14 21/03/2025
Common Hashing Techniques
 Direct Hashing:
 Using key itself as index
 Guaranteed one-to-one mapping
 Need a very large array if key is a large integer, space can be wasted
 Modular Hashing or Division Method
 If key is an integer
 it can be mapped to array slot directly by taking its modulus with N
 Value of N affects the distribution.
 If keys are ={20, 12, 3, 19, 6, 8, 5, 18 }, try with different values of N.
 A prime number not close to power of 2 is often considered as a good choice for N
 If key is String
 You can sum up the ASCII values and then take modulus.
 ab= (97+ 98) mod N
 Can produce same result for anagrams like abc, bca, baba, baab etc.
 This can be handled by exploiting relative positions
 ab= (97*1 + 98 *2) mod N
 ba= (98*1 + 97 *2) mod N

15 21/03/2025
Common Hashing Techniques
 Modular Hashing or Division Method
 Java’s built in algorithm to compute hash value for Strings:
int hash = 0; R=31
for (int i = 0; i < string.length(); i++)
hash = hash * R + charAt(i);
 R is a prime number.
 It will return a hash value as a very large integer. Which can be mapped to index using
modulus.
 Check the details:
 http://www.javamex.com/tutorials/collections/hashmaps4.shtml

 Function can be modified using different values of R, where R is a small prime


number.

16 21/03/2025
Common Hashing Techniques
 Mid-Square
 Square the key, then take middle portion
 40 402=1600 60 mod N = array-slot
 Folding
 Shift folding: If keys are compound like account number, ISBN, phone number,
CNIC, divide key into equal sized portions except last, and add
 051-342671051+342+671= value mod N = array-slot
 Another approach is boundary folding: reverse alternate portions:
 051-342671051+243+671= value mod N = array-slot

17 21/03/2025
Compression Function Variation
 Mapping of integer hash codes/values to array index. Hash values my exceed
array size, so we need to map them to an index. Few approaches can be:
 Division
 value mod N
 Let say hash values are {12, 15,18 , 21, 24, 27} and N=6
 Multiple values will collide for same index.
 What if we take N as a prime number?
 MAD(multiply-add-division)
 ((a*value+b) mod p) mod N
 p is next larger prime number after N
 a and b are two random numbers in range 0-p, where a>0
 Choose the technique wisely, to avoid collision

18 21/03/2025
Collision
 If hash function is not perfect, it can cause collision. Collision occurs if two
different keys get mapped to same slot of array.
 h(k1) = 3, h(k2) = 3
 Collisions can be handled. But at best they should be avoided or minimized to keep the
time constant.
 Load Factor:
 A key factor to determine hash table performance.
 If there are n entries to be hashed, and table size is N,
 λ = n/N, called the load factor of the hash table, should be bounded by a small value, ideally less
than 1.
 As the load factor grows, hash table starts to slow. And collisions starts to increase.
 A good hash function will minimize these collisions by distributing them uniformly.
 Load factor of individual buckets also matters
19 21/03/2025
Collision Handling
 There are many different strategies to handle collision:
 Separate chaining/  Open Hashing
 Open addressing Close Hashing
 Linear Probing
 Quadratic Probing
 Double Hashing

20 21/03/2025
Separate Chaining
 Chain all collision for a specific slot in another container/bucket, and store them in that slot.
 This chain can be formed using a linked list.
 Every slot will hold address to head of its linked list
 If a bucket of some slot is empty, it will contain null
 Insertion?
 Simply add the new entry at front of list– O(1)
 Search and Delete?
 Search the key by traversing list
 Will take time proportional to size of bucket

 Advantage:
 Un-limited collisions can be handled, even if load factor is >=1
 Disadvantage:
 Require an auxiliary data structure, pointers maintenance
 Worst case  O(N)

21 21/03/2025
Separate Chaining
 Worst case?
 Size of individual bucket is proportional to the size of the array.
 If all keys mapped to one bucket only  O(N)
 Ideally there should be 2-3 values in bucket
 Average case? O(1)
 Size of individual list = λ, average search time= λ/2
 So in separate chaining, load factor of individual buckets also matters.
 Two hash tables with n=35, and N = 51, load factor =0.67
 But one is distributing values well, another is hashing all values to only 2,3 buckets

 Separate chaining also called open hashing, as values are hashed to another data structure
and array itself
22 21/03/2025
Open Addressing
 Maps collisions to array slots rather than in another data structure. Main idea is
to probe forward from the calculated index to find next empty slot, considering
the array as circular.
 Has many variants
 Linear probing
 Probe next slot by interval of 1
 Quadratic probing
 Probe next slot by interval of squares of 1, 2, 3 ….
 Double hashing
 Probe next slot, by calculating your interval using a secondary hash function
 It is also called closed hashing, as all hashing occurs in same array, no extra data
structure required.

23 21/03/2025
Linear Probing
 If the h(k) returns value j, and A[j] is occupied, then go to next available slot A[(j+1) mod N],
if that is occupied, then go to A[(j+2) mod N] and do so until a free slot is found.
 Let say if N=11
 h(21)= 10 null null null null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
 h(13)= 2 null null 13 null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
 h(9)= 9
null null 13 null null null null null null 9 21
0 1 2 3 4 5 6 7 8 9 10
 h(24)= 2 null null 13 24 null null null null null 9 21
 (2+1) mod 11=3
0 1 2 3 4 5 6 7 8 9 10
 h(20)=9 20 null 13 24 null null null null null 9 21
 (9+1) mod 11=10 0 1 2 3 4 5 6 7 8 9 10
 (9+2) mod 11=0
 When to stop?

24 21/03/2025
Linear Probing
 Primary Clustering
 If many collisions occur for a specific slot, neighboring slots will be filled and form
blocks, this is called primary clustering. You have to probe the clusters to find next
available slot.
 Clustering cause more collisions for other values that are going to mapped in those
blocks.
 h(11) 0
 hash value of 20 is not 0 but still 11 will go to 1
20 11 13 24 12 null null null null 9 21
 h(12) 1 0 1 2 3 4 5 6 7 8 9 10
 will go to 2, it is filled
 go to 3, it is filled
 Go to 4
 You have to skip clusters created for slot 0 and 2
 and then if needed, others as well
25 21/03/2025
Quadrating Probing
 If the h(k) returns value j, and A[j] is occupied, then go to next available slot A[(j+i 2) mod N], for i =1, 2, 3,
… and so on, until an empty slot is found.
 So every time a collision occurs, perfect square number of slots are added to current index.
 j+1, j+4, j+9, j+16 and so on.
 Let say if N=11 null null null null null null null null null null 21
 h(21)= 10 0 1 2 3 4 5 6 7 8 9 10

null null 13 null null null null null null null 21


 h(13)= 2 0 1 2 3 4 5 6 7 8 9 10


null null 13 null null null null null null 9 21
h(9)= 9
0 1 2 3 4 5 6 7 8 9 10

 h(24)= 2
null null 13 24 null null null null null 9 21
 (2+1) mod 11=3 0 1 2 3 4 5 6 7 8 9 10

 h(20)=9 null null 13 24 null null null 20 null 9 21


 (9+1) mod 11=10 0 1 2 3 4 5 6 7 8 9 10
 (9+4) mod 11=2
 (9+9) mod 11=7

 When to stop?
26 21/03/2025
Quadrating Probing
 Secondary Clustering
 Quadratic probing sufficiently reduces primary clustering, but it does have a problem of clustering which
is called Secondary Clustering
 It is called secondary, because now hash values mapped to alternate slots from current slot, and it will
form clusters using every alternate cell.
 But it has another key problem: as it does not go linearly, there is no guarantee that it will check
all slots in its path.
 This problem become worse if load factor > 0.75, all insertions are not guaranteed.
 Insert 7 in following table
 h(7)=7 22 12 13 24 null 16 null 20 19 9 21
 (7+1) mod 11=8
0 1 2 3 4 5 6 7 8 9 10
 (7+4) mod 11=0
 (7+9) mod 11=5
 (7+16) mod 11=1
 (7+25) mod 11=10
 Keep in adding intervals
 Can you decide when to stop further probing?
27 21/03/2025
Double Hashing
 Double hashing uses a secondary hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions
form the point of collision to insert. There are a couple of requirements for the
second function:
 It must not return 0
 It must make sure no slot will be skipped during probing
 Example secondary hash function:
 h2(key) = R - ( key mod R ) , where R is a prime number that is smaller than the size of the table.

 Interval will depend upon key itself. So, it eliminates probability of clustering.

28 21/03/2025
Double Hashing
 Let say if N=11, and h2= 7 - ( key mod 7 )
 h(21)= 10 null null null null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
 h(13)= 2 null null 13 null null null null null null null 21
0 1 2 3 4 5 6 7 8 9 10
 h(9)= 9
null null 13 null null null null null null 9 21
0 1 2 3 4 5 6 7 8 9 10
 h(24)= 2, h2=4
null null 13 null null null 24 null null 9 21
 (2+4) mod 11=6
0 1 2 3 4 5 6 7 8 9 10
 h(20)=9, h2=1
20 null 13 24 null null null null null 9 21
 (9+1) mod 11=10
0 1 2 3 4 5 6 7 8 9 10
 (9+2) mod 11=0
20 null 13 24 null null 11 null null 9 21
 h(11)=0, h2=3
0 1 2 3 4 5 6 7 8 9 10
 (0+3) mod 11=3
 (0+6) mod 11=6

 When to stop?
29 21/03/2025
Open Addressing
 Advantage:
 No other data structure
 Suitable for smaller entries
 Better memory management, no hassle of references
 Disadvantage:
 Insertions cannot be done if load factor is =1.
 Limited collisions can be handled
 The load factor becomes more than 2/3, hash table performance degrades,
 So, open addressing require a mandatory array resize when load factor grows beyond a minimum
limit.
 Function must minimize clustering of values, which have consecutive probe order.

 Read the following article for detailed insight.


 https://en.wikipedia.org/wiki/Hash_table#Open_addressing
30 21/03/2025
Open Addressing
 If h(k) is not the required slot, we need to search the table by skipping nodes
according to interval until
 Either node is null
 its an empty slot
 Or currentIndex!=h(k)
 it means you have come again at same index and table is actually full

31 21/03/2025
Open Addressing
 Best case
 No clustering O(1)
 Worst case
 O(N)
 Average case
 ≈½(
 If λ is too high, close to 1, O(N), equation will lead to undefined if λ=1
 Try λ=.99
 If λ is less than 0.75, it becomes constant and independent of N
 let say if λ=0.75, its 8.5
 So, careful selection of function and N, will fairly provide O(1) bound.

32 21/03/2025
Lazy deletion
 How deletion will work in open addressing?
 Assume linear probing here. 20 7 23 9 8 null 13
 And N=7 0 1 2 3 4 5 6
 Let say we delete 20?
 h(20)= 6 ? 7 23 9 8 null 13

 Probe linearly until you found 20 or null 0 1 2 3 4 5 6


 If found delete
 If null found, element is not present
 With what value to replace 20? null 7 23 9 8 null 13
 Can we place null? 0 1 2 3 4 5 6
 How to delete 7
 h(7)=0 and its null, that means 7 is not present in array
 But that’s not the truth.
33 21/03/2025
Lazy deletion
 Instead of deleting 20, mark it as dead or deleted
20 7 23 9 8 null 13
 During insertion, this dead will be considered as free 0 1 2 3 4 5 6
 During search, it will be considered as occupied
? 7 23 9 8 null 13
0 1 2 3 4 5 6

34 21/03/2025
ReHashing
 When load factor becomes large (more than 2/3), performance of hash table degrades.
 Solution:
 Resize table to a bigger size
 Rehash values according to new size of table
 Why?
 Hash table will become slow, collisions will increase
 It will not allow further insertion in case of closed hashing
 Hash function can be refined. As table size will be a different prime number.
 Lazy deletion entries will be removed
 When?
 If load factor > ½ or a certain limit
 If an insertion fails

35 21/03/2025
Hash Table
 Advantages
 Faster, O(1) average cost of operations.
 Suitable for larger number of records
 If keys are known ahead of time, a collision free mapping is possible by choosing
suitable hash function, and array size
 Disadvantages
 Not good for smaller number of keys
 Not efficient, if too many collisions

36 21/03/2025
Visualizations
 Check the following links to understand collision using visuals.

 https://www.cs.usfca.edu/~galles/visualization/OpenHash.html
 https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html

37 21/03/2025
Practice Problem
 What will be the final state of a hash table after all insertions, with following properties:
 Fix size table, N=11
 h(k) = (3k+5) mod 11
 Keys: 12, 44, 13, 88, 23, 94, 11, 39, 20, 16, and 5, assuming collisions are handled by:
 Chaining
 Linear probing
 Quadratic probing
 Double hashing using the secondary hash function h′(k) = 7−(k mod 7)?
 What will be the final state of a hash table with following properties:
 Initial table size is 5
 Keys are integers
 h(k)= k mod N
 Collision Resolution: Linear probing.
 Rehashing at load factor >= 0.5
 Operation sequence: add(15); add(5); add(13); add(24); add(32); remove(13); add(17); add(44);
remove(15); add(47);
 Use any character as a symbol for an entry that is removed.
38 21/03/2025

You might also like