0% found this document useful (0 votes)
45 views15 pages

Dsa Lecture 13 Hash Tables

This document provides an overview of hash tables. It discusses: 1. Hash tables allow storing data in easily determined locations by mapping keys to indices using a hash function, allowing constant-time lookups but using more memory than other data structures. 2. The table abstract data type specifies common operations like insert, retrieve, update, and delete based on a unique key. 3. Hash tables are implemented using arrays and a hash function to map keys to indices, with strategies like buckets, chaining, and open addressing to handle collisions when different keys map to the same index. 4. Linear probing and double hashing are open addressing strategies, with double hashing avoiding clustering issues by jumping multiple slots based

Uploaded by

Krizzy kriz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views15 pages

Dsa Lecture 13 Hash Tables

This document provides an overview of hash tables. It discusses: 1. Hash tables allow storing data in easily determined locations by mapping keys to indices using a hash function, allowing constant-time lookups but using more memory than other data structures. 2. The table abstract data type specifies common operations like insert, retrieve, update, and delete based on a unique key. 3. Hash tables are implemented using arrays and a hash function to map keys to indices, with strategies like buckets, chaining, and open addressing to handle collisions when different keys map to the same index. 4. Linear probing and double hashing are open addressing strategies, with double hashing avoiding clustering issues by jumping multiple slots based

Uploaded by

Krizzy kriz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Structures and Algorithms

Lecture 13 – Hash Tables

Sayed Faheem Qadry


Department of Computer Science
The Institute of Finance Management
Storing data
 So far we have learned about arrays, linked lists and trees.
 These approaches can perform quite differently when it comes to the particular tasks we expect to
carry out on the items, such as insertion, deletion and searching.
 The best way of storing data does not exist in general, but depends on the particular application.
Now lets look at another way to store data:
 We want to put each item in an easily determined location.
 So that we never need to search for it, and have no ordering to maintain when inserting or deleting
items.
 This has impressive performance as far as time is concerned
 BUT the disadvantage is need for more memory, as well as complicated algorithms and harder to
implement.
The Table abstract data type
The specification of the table abstract data type is as follows:
 A table can be used to store objects
 The table has a unique key
Methods or procedures:
 Boolean IsEmpty()
 Boolean IsFull()
 void Insert(Record)
 Record Retrieve(key)
 void Update(Record)
 void Delete(key)
 void Traverse()
Implementations of the table data structure

 Implementation via sorted arrays: deletion is a challenge.


 Implementation via binary search trees: the best way to utilize will
be to use self-balancing binary search tree which is complicated
 Implementation via Hash tables: this is better than the above
alternatives, but uses more memory.
Hash Tables
 given a key, there is a way of jumping straight to the entry for that key.
 So there is no need to search at all.
 Assume that we have an array data to hold our entries.
 if we had a function h(k) that maps each key k to the index (an integer) where the associated
entry will be stored, then we could just look up data[h(k)] to find the entry with the key k.
 If keys were in small number say n = 100, then we could use an array a[100], then the
function h(k) will simply be k (the index of the array) and will make accessing the values
much easier.
 BUT what if we have huge data say 14 digit NIDA ID number?
 Therefore we use a non-trivial function h, the so-called hash function, to map the space of
possible keys to the set of indices of our array (mapping).
 particular attention should be paid to choosing the hash function h in such a way that
collisions among them are less likely to occur (similar keys).
Collision likelihoods and load factors for hash
tables
The von Mises birthday paradox
 for 24 people’s birthdays in a 365 calendar days the probability of collision is bigger than 50%.
 It may be surprising that p(22) = 0.476 and p(23) = 0.507, which means that as soon as there are more
than 22 people in a group, it is more likely that two of them share a birthday than not.
Implications for hash tables
 If 23 random locations in a table of size 365 have more than a 50% chance of overlapping, it seems
inevitable that collisions will occur in any hash table that does not waste an enormous amount of memory.
 And collisions will be even more likely if the hash function does not distribute the items randomly
throughout the table.
The load factor of a hash table
 Suppose we have a hash table of size m, and it currently has n entries.
 Then we call λ = n/m the load factor of the hash table.
 Therefore, to minimize collisions, it is prudent to keep the load factor low. Fifty percent is an often quoted
good maximum figure, while beyond an eighty percent load the performance deteriorates considerably.
A simple Hash Table in operation
 Assume we have the following table with key/value pair. Key Value
 Say the key is name and value is the age of the individual. Paul 29
 We want to build a hash table or a dictionary in this case such that ideally Jane 35
the operations insert(), search(), delete() and update() will be O(1). Chacha 18
 Say my hash table is an array of 8 elements and I decide the best way to Alex 30
locate the names is by using the first letter’s distance from first letter (A).
 Since I have only 8 spaces in my hash table I can only store digits 0 – 7.
 But e.g. if my first letter is ‘P’ for Paul I will have ‘p’ – ‘a’ = 15, I cannot
store 15 in my array.
 Therefore I have to use a modulo operation, and modular arithmetic more
generally, are widely used when constructing good hash functions.
 That is 15 mod 8 = 7 (this will give me the remainder after dividing by
7). It’s called a hash code and (x – a) mod 8 is called a hash function.
A simple Hash Table in operation (cont)

Key Value index


Paul 29 7
Jane 35 1
Chacha 18 2
Alex 30 0

0 1 2 3 4 5 6 7

Alex Jane Chacha Paul


30 35 18 29

 What if we want to add the following names Amina, Chakubanga, Jamila, Peter?
Strategies for dealing with collisions
Buckets
 One obvious option is to reserve a two-dimensional array from the start.
 The disadvantage of this approach is that it has to reserve quite a bit more space than will be
eventually required, since it must take into account the likely maximal number of collisions.
 Also, when searching for a particular key, it will be necessary to search the entire column
associated with its expected position, at least until an empty slot is reached (Linear search).
You can also sort the column (Binary Search).

0 1 2 3 4 5 6 7
Alex Jane Chacha Paul
30 35 18 29
Amina Jamila Chakubanga Peter
23 37 25 50
Strategies for dealing with collisions (cont)
Direct chaining
 Use linked lists instead of the full array.
 This approach does not reserve any space that will not be taken up, but has the disadvantage that in order to find
a particular item, lists will have to be traversed.
 However, adding the hashing step still speeds up retrieval considerably.
 the complexity class of all operations is constant, i.e. O(1)
 For traversal, we need to sort the keys, which can be done in O(nlog2 n)
 Hence, this method is better than the previous.
0 1 2 3 4 5 6 7

Alex 18 Jane 35 Chacha Paul 29


18

Amina Jamila 37 Chakuba Peter 50


23 nga 25
Strategies for dealing with collisions (cont)
Open addressing
 involves finding another open location for any entry which cannot be placed where its
hash function points.
 We refer to that position as a key’s primary position (so in the earlier example, Alex
and Amina have the same primary position).
 The easiest strategy for achieving this is to search for open locations by simply
decreasing the index considered by one until we find an empty space.
 If this reaches the beginning of the array, i.e. index 0, we start again at the end. This
process is called linear probing.
 A better approach is to search for an empty location using a secondary hash function.
 This process is called double hashing.
Open addressing
Linear probing
 Insert into an empty slot on the left of the array by reducing its index.
 hash table that uses open addressing should have at least one empty slot at any time, and be
declared full when only one empty location is left.
 hash table time complexity for search is constant, i.e. O(1)
 Creates secondary clusters of same hash code, these blocks, or clusters, keep growing, not
only if we hit the same primary location repeatedly, but also if we hit anything that is part
of the same cluster. The last effect is called secondary clustering.
 Note that searching for keys is also adversely affected by these clustering effects.
Open addressing (cont)
Double hashing
 The obvious way to avoid the clustering problems of linear probing is to do something
slightly more sophisticated than trying every position to the left until we find an empty
one.
 We apply a secondary hash function to tell us how many slots to jump to look for an
empty slot if a key’s primary position has been filled already.
 Say if we choose the secondary hash function such as [(x – a)/6] mod 6 equals 3 then we
try to look for an empty slot every 3rd location.
 Suppose you want to insert (Peter, 50) then you will get position at index 4.

0 1 2 3 4 5 6 7

Alex Jane Chacha Peter Paul


30 35 18 50 29

3rd 2nd 1st


Choosing good hash functions
For primary hash functions
 make sure that it spreads the space of possible keys onto the set of hash table indices as evenly
as possible, so that few collisions occur.
 it is advantageous if any potential clusters in the space of possible keys are broken up (leave a
space between one cluster and another) to avoid “continuous run”.
For secondary hash functions
 different keys with the same primary position give different results when the secondary hash
function is applied.
 one has to be careful to ensure that the secondary hash function cannot result in a number which
has a common divisor with the size of the hash table.
For example, if the hash table has size 10, and we get a secondary hash function which gives 2 (or
4, 6 or 8) as a result, then only half of the locations will be checked, which might result in failure
(an endless loop, for example) while the table is still half empty. Even for large hash tables, this can
still be a problem if the secondary hash keys can be similarly large.
A simple remedy for this is to always make the size of the hash table a prime number.
Applications of Hash tables
 Password Verification
Cryptographic hash functions are very commonly used in password verification. Let’s
understand this using an Example:
When you use any online website which requires a user login, you enter your E-mail and
password to authenticate that the account you are trying to use belongs to you. When the
password is entered, a hash of the password is computed which is then sent to the server for
verification of the password. The passwords stored on the server are actually computed hash
values of the original passwords. This is done to ensure that when the password is sent from
client to server, no sniffing is there.
 File System
The hashing is used for the linking of the file name to the path of the file. When you interact
with a file system as a user, you see the file name, maybe the path to the file. But to actually
store the correspondence between the file name and path, and the physical location of that file
on the disk, the System uses a map, and that map is usually implemented as a hash table.

You might also like