0% found this document useful (0 votes)
3 views

Unit-5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit-5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT-5

Hashing
• General Idea
• Advantages
• Hash Function
• Collisions
• Collision Resolution Techniques
• Separate Chaining
• Linear Chaining
• Quadratic Probing
• Double Hashing
• Application
• Rehashing & Extensible Hashing
• Dictionaries
Introduction
• Hashing is a technique that is used to uniquely identify a specific object
from a group of similar objects.
• Some examples of how hashing is used in our lives include:
• In universities, each student is assigned a unique roll number that can be
used to retrieve information about them.
• In libraries, each book is assigned a unique number that can be used to
determine information about the book, such as its exact position in the
library or the users it has been issued to etc.
• In both these examples the students and books were hashed to a
unique number.
Continue…
• Assume that you have an object and you want to assign a key to it to
make searching easy. To store the key/value pair, you can use a simple
array like a data structure where keys (integers) can be used directly as
an index to store values.
• However, in cases where the keys are large and cannot be used directly
as an index, you should use hashing.
Continue…
• In hashing, largekeys are converted into small keys by using hash functions.
• The values are then stored in a data structure called hash table.
• The idea of hashing is to distribute entries (key/value pairs)
uniformly across an array.
• Each element is assigned a key (converted key).
• By using that key you can access the element in O(1) time. Using the key, the
algorithm (hash function) computes an index that suggests where an entry can
be found or inserted.
Continue…
• Hashing is implemented in two steps:
• An element is converted into an integer by using a hash function. This element can
be used as an index to store the original element, which falls into the hash table.
• The element is stored in the hash table where it can be quickly retrieved using
hashed key.
• hash = hashfunc(key)
• index = hash % array_size
Advantages
• The main advantage of hash tables over other table data structures is
speed.
• This advantage is more apparent when the number of entries is large
(thousands or more).
• Hash tables are particularly efficient when the maximum number of
entries can be predicted in advance, so that the bucket array can be
allocated once with the optimum size and never resized.
Hash function
• A hash function is any function that can be used to map a data set of an
arbitrary size to a data set of a fixed size, which falls into the hash table.
• The values returned by a hash function are called hash values, hash codes,
hash sums, or simply hashes.
Continue…
• A hash function is any function that can be used to map a data set of an arbitrary
size to a data set of a fixed size, which falls into the hash table.
• The values returned by a hash function are called hash values, hash codes, hash
sums, or simply hashes.
• To achieve a good hashing mechanism, It is important to have a good hash
function with the following basic requirements:
• Easy to compute
• Uniform distribution
• Less Collision
Different Hash functions
Division Method
• It is the most simple method of hashing an integer x. This method divides x by
M and then uses
• the remainder obtained. In this case, the hash function can be given as
h(x) = x mod M
Example:
• Calculate the hash values of keys 1234 and 5462.
Solution Setting M = 97, hash values can be calculated
as: h(1234) = 1234 % 97 = 70
h(5642) = 5642 % 97 = 16
Continue…
• Multiplication Method
The steps involved in the multiplication method are as follows:
Step 1: Choose a constant A such that 0 < A < 1.
Step 2: Multiply the key k by A.
Step 3: Extract the fractional part of kA.
Step 4: Multiply the result of Step 3 by the size of hash table (m).
Hence, the hash function can be given as:
h(k) = floor(m (kA mod 1) )
• where (kA mod 1) gives the fractional part of kA and m is the total number
of indices in the hash table.
Continue…
Example:
Given a hash table of size 1000, map the key 12345 to an appropriate
location in the hash table.
Solution We will use A = 0.618033, m = 1000, and k =
12345 h(12345) = floor(1000 (12345 ¥ 0.618033 mod
1)) h(12345) = floor(1000 (7629.617385 mod 1) )
h(12345) = floor( 1000 (0.617385) )
h(12345) = floor(617.385)
h(12345) = 617
Continue…
Mid-Square Method
The mid-square method is a good hash function which works in two steps:
Step 1: Square the value of the key. That is, find k2.
Step 2: Extract the middle r digits of the result obtained in Step 1.
Example:
Calculate the hash value for keys 1234 and 5642 using the mid-square method. The hash table has
100 memory locations.
Solution:
Note that the hash table has 100 memory locations whose indices vary from 0 to 99.
This means that only two digits are needed to map the key to a location in the hash table, so r = 2.
When k = 1234, k2 = 1522756, h (1234) = 27
When k = 5642, k2 = 31832164, h (5642) = 32
Continue…
Folding Method
The folding method works in the following two steps:
Step 1: Divide the key value into a number of parts. That is, divide
k into parts k1, k2, ..., kn, where each part has the same number of
digits except the last part which may have lesser digits than the other
parts.
Step 2: Add the individual parts. That is, obtain the sum of k1 + k2 +
... + kn. The hash value is produced by ignoring the last carry, if any.
Continue…
Example:
Given a hash table of 100 locations, calculate the hash value
using folding method for keys 5678, 321, and 34567.
Solution
Since there are 100 memory locations to address, we will break
the key into parts where each part (except the last) will contain two
digits. The hash values can be obtained as shown below:
Continue…
Collisions
• collisions occur when the hash function maps two different keys to
the same location. Obviously, two records cannot be stored in the
same location.
• Therefore, a method used to solve the problem of collision, also
called collision resolution technique, is applied.
• The two most popular methods of resolving collisions are:
1. Open addressing
2. Chaining
Open addressing
• Choosing a hash function that minimizes the number of collisions
and also hashes uniformly is another critical issue.
• We use in open addressing:
• Linear probing (closed hashing)
• Quadratic Probing
• Double hashing
Linear Probing
• In open addressing, instead of in linked lists, all entry records are
stored in the array itself.
• When a new entry has to be inserted, the hash index of the hashed
value is computed and then the array is examined (starting with the
hashed index).
• If the slot at the hashed index is unoccupied, then the entry record is
inserted in slot at the hashed index else it proceeds in some probe
sequence until it finds an unoccupied slot.
Continue....
• The probe sequence is the sequence that is followed while traversing
through entries. In different probe sequences, you can have different
intervals between successive entry slots or probes.
• When searching for an entry, the array is scanned in the same
sequence until either the target element is found or an unused slot is
found. This indicates that there is no such key in the table. The name
"open addressing" refers to the fact that the location or address of the
item is not determined by its hash value.
Continue....
• Linear probing is when the interval between successive probes is
fixed (usually to 1). Let’s assume that the hashed index for a particular
entry is index. The probing sequence for linear probing will be:
index = index % hashTableSize
index = (index + 1) % hashTableSize
index = (index + 2) % hashTableSize
index = (index + 3) % hashTableSize
let hash(x) be the slot index computed using a hash function and M be the table size

If slot hash(x) % M is full, then we try (hash(x) + 1) % M


If (hash(x) + 1) % M is also full, then we try (hash(x) + 2) % M
If (hash(x) + 2) % M is also full, then we try (hash(x) + 3) % M
..................................................
..................................................
Let us consider a simple hash function as “key mod 7” and a sequence of keys as 50, 700, 76, 85,
92, 73, 101.
Quadratic Probing
• Quadratic probing is similar to linear probing and the only difference
is the interval between successive probes or entry slots.
• Here, when the slot at a hashed index for an entry record is already
occupied, you must start traversing until you find an unoccupied slot.
• The interval between slots is computed by adding the successive value
of an arbitrary polynomial in the original hashed index.
Continue....
• Let us assume that the hashed index for an entry is index and at
index there is an occupied slot. The probe sequence will be as
follows:
let hash(x) be the slot index computed using hash function.
If slot hash(x) % M is full, then we try (hash(x) + 1*1) % M
If (hash(x) + 1*1) % M is also full, then we try (hash(x) + 2*2) % M
If (hash(x) + 2*2) % M is also full, then we try (hash(x) + 3*3) % M

suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion. The elements are
{96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}
Hash Table
Double Hashing
• Double hashing is similar to linear probing and the only difference is
the interval between successive probes. Here, the interval between
probes is computed by using two hash functions.
• Let us say that the hashed index for an entry record is an index that is
computed by one hashing function and the slot at that index is already
occupied. You must start traversing in a specific probing sequence to
look for an unoccupied slot. The probing sequence will be:

index = (index + 1 * indexH) % hashTableSize;


index = (index + 2 * indexH) % hashTableSize;
let hash(x) be the slot index computed using hash function.
If slot hash(x) % M is full, then we try (hash(x) + 1*hash2(x)) % M
If (hash(x) + 1*hash2(x)) % M is also full, then we try (hash(x) + 2*hash2(x)) % M
If (hash(x) + 2*hash2(x)) % M is also full, then we try (hash(x) + 3*hash2(x)) % M

suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion. The elements are
{96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}
h1(x)=xmod20
h2(x)=xmod13
h(x, i) = (h1 (x) + ih2(x)) mod 20
Separate Chaining (Open Hashing)
• Separate chaining is one of the most commonly used collision
resolution techniques.
• It is usually implemented using linked lists. In separate chaining,
each element of the hash table is a linked list.
• To store an element in the hash table you must insert it into a
specific linked list.
• If there is any collision (i.e. two different elements have same hash
value) then store both the elements in the same linked list.
Example-separate chaining
Applications
• Associative arrays: Hash tables are commonly used to implement many types of
in-memory tables. They are used to implement associative arrays (arrays whose
indices are arbitrary strings or other complicated objects).
• Database indexing: Hash tables may also be used as disk-based data structures
and database indices (such as in DBMS).
• Caches: Hash tables can be used to implement caches i.e. auxiliary data tables
that are used to speed up the access to data, which is primarily stored in slower
media.
• Object representation: Several dynamic languages, such as Perl, Python,
JavaScript, and Ruby use hash tables to implement objects.
• Hash Functions are used in various algorithms to make their computing faster.
Rehashing:
Rehashing means hashing again.

 Every Hash Table maintains a load factor which specify the the no. of elements that can be occupied to reduce the Collisions
Load factor =n /k.
Here n—No.of elements occupied in the hash table
k=size of the table
 Basically, when the load factor increases to more than its pre-defined value (default value of load factor is 0.75), the complexity increases.
 So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new d ouble sized array
to maintain a low load factor and low complexity.

Rehashing is done because whenever key value pairs are inserted , the load factor increases, which implies that the time comp lexity also increases
as explained above. This might not give the required time complexity of O(1).
Hence, rehash must be done, increasing the size of the bucketArray so as to reduce the load factor and the time complexity.

Rehashing can be done as follows:


 For each addition of a new entry , check the load factor.
 If it’s greater than its pre-defined value (or default value of 0.75 if not given), then Rehash.
 For Rehash, make a new array of double the previous size and make it the new bucketarray.
 Then traverse to each element in the old bucketArray and call the insert() for each so as to insert it into the new larger bucket array.
 Extendible Hashing
It is a dynamic hashing method where in directories, and buckets are used to hash data. It is an aggressively flexible method in which the hash
function also experiences dynamic changes.

Main features of Extendible Hashing: The main features in this hashing technique are:
 Directories: The directories store addresses of the buckets in pointers. An id is assigned to each directory which may change each time when
Directory Expansion takes place.

 Buckets: The buckets are used to hash the actual data.

Basic Structure of Extendible Hashing:


The different terms used in Extendible Hashing:

 Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each time when expansion tak es
place. The hash function returns this directory id which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth.
 Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointers to it if its local depth is less
than the global depth.
 Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash function to categorize the keys.
Global Depth = Number of bits in directory id.
 Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the buckets and not the direc tories.
Local depth in accordance with the global depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is
always less than or equal to the Global Depth.
 Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts.
 Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when the local depth of
the overflowing bucket is equal to the global depth.

 Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc.. Currently, let us consider data
elements of type integer. eg: 49.
 Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the ASCII equivalent integer of
the starting character and then convert the integer into binary form. Since we have 49 as our data element, its binary form is 110001.
 Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory is 3.
 Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001.
 Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001.
 Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is encountered, go to step
7 followed by Step 8, otherwise, go to step 9.
 Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting data in the buckets, it might happen that the
Bucket overflows. In such cases, we need to follow an appropriate procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.
 Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well as Bucket Split,
needs to be performed. Then increment the global depth and the local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.
 Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the local depth
value by 1. And, assign appropriate pointers.

 Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed w.r.t the new global
depth of the directory.
 Step 9 – The element is successfully hashed.
Example based on Extendible Hashing: Now, let us consider a prominent example of hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.

Bucket Size: 3 (Assume)

Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.
 Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 01101
 Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this:

 Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the
directory with id=0.

 Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
 Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is already full. Hence, Over Flow occurs.

 As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory expansion takes place. Also, rehashing of
numbers present in the overflowing bucket takes place after the split. And, since the global depth is incremented by 1, now,the global depth is
2. Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
 *Notice that the bucket which was underflow has remained untouched. But, since the number of directories has doubled, we now have 2
directories 01 and 11 pointing to the same bucket. This is because the local-depth of the bucket has remained 1. And, any bucket having a
local depth less than the global depth is pointed-to by more than one directories.
 Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00 and 10. Here, we encounter no overflow
condition.
 Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or 11 in their LSBs. Hence, they are mapped on the
bucket pointed out by 01 and 11. We do not encounter any overflow condition here.
 Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem.
 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of the bucket = global-depth, directory
expansion (doubling) takes place along with bucket splitting. Elements present in overflowing bucket are rehashed with the ne w global depth.
Now, the new Hash table looks like this:
 Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best fits in the bucket pointed out by directory 010.

 The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket < Global depth (2<3), directories are not doubled
but, only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.
 Hashing of 11 Numbers is Thus Completed.
Advantages:
1. Data retrieval is less expensive (in terms of computing).
2. No problem of Data-loss since the storage capacity increases dynamically.
3. With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function.
Limitations Of Extendible Hashing:
1. The directory size may increase significantly if several records are hashed on the same directory while keeping the record distribution non-
uniform.
2. Size of every bucket is fixed.
3. Memory is wasted in pointers when the global depth and local depth difference becomes drastic.
4. This method is complicated to code

Dictionaries
• In a dictionary, we separate the data into two parts.
• Each item stored in a dictionary is represented by a key/value pair. •
The key is used to access the item.
• With the key you can access the value, which typically has more
informati on.
• Each key identifies one entry; that is, each key is unique.
• However, nothing prevents two different keys from referencing the same
value.
• The contains test is in the dictionary replaced by a test to see if a given key
is legal.
• Finally, data is removed from a dictionary by specifying the key for the data
value to be de leted.
Operations on Dictionaries
Operation Function

put(key, value) Place the key and value association into the dictionary

get(key) Retrieve the value associated with the given key.

containsKey(key) Return true if key is found in dictionary

removeKey(key) Remove key from association

keys() Return iterator for keys in dictionary

size() Return number of elements in dictionary

You might also like