Binary Search, Hashing and File Structures
Binary Search, Hashing and File Structures
A linear search or sequential search is a method for finding the location of an element to be
searched within a list. It sequentially checks each element of the list until a match is found or
until the list is exhausted, i.e., all the elements have been searched.
The following figure of an array shows how the searching starts from the beginning element
and then one by one element is searched in a sequential manner.
0 1 2 3 4 5 6 7
12 15 10 25 30 28 50 17
Start
In case, the item is in the list, there are three different possibilities. In the best case we will be
fortunate enough to find the element in the first place at the beginning of the list. Then we need
only one comparison. In the worst case, we will not find the item until the very last comparison,
the nth comparison.
In an average case, we may find the item about halfway into the list; that is, n/2 comparisons
will be made. However, as n gets large, the coefficients become insignificant in our
approximation, therefore, the complexity of the sequential search is O(n).
Binary Search
The binary search algorithm is also known as Half Interval Search or Logarithmic Search. The
prerequisite of a binary search is that the list should be sorted in an ascending or in a descending
order. Binary search compares the element to be searched with the middle element of the list;
if they are unequal, the half in which the element cannot lie is eliminated and the search
continues on the remaining half until it is successful. If the search ends with the remaining half
being empty, the element is not in the list. It works by repeatedly dividing in half the portion
of the list that may contain the item, until we have narrowed down the possible locations to just
one.
Example:
Let us consider that there is a sorted list of integers in ascending order as given below:
5 10 15 20 25 30 35 40 45 50 55
Step 1:
The middle element in the list is 30. Compare 35 with 30. As 35>30, discard the first half of
the list because now it is evident that the element to be searched is in the second half of the list
or is not present in the list at all. We actually ignore half of the elements just after one
comparison.
5 10 15 20 25 30 35 40 45 50 55
Step 2:
The middle element in the second half of the list is 45. Compare 35 with 45. As 35<45, discard
the second half of the list because element cannot be in the second half.
5 10 15 20 25 30 35 40 45 50 55
Step 3:
Now the middle element is 35 which actually is the element which we were searching.
HASH TABLE
The searching methods we earlier studied were generally based on comparison of keys. We
need search techniques in which there are no unnecessary comparisons of keys. The objective
of searching the desired record efficiently and quickly can be achieved by minimizing the
number of comparisons in order. In a hashing technique, the location of the desired record
present in the search table depends only on the given key but not on the other keys, and
therefore it is a very efficient searching technique.
Suppose we have an employee table having n records and each record is defined by a unique
enrollment number key. This key takes values from 1 to n inclusively. If the enrollment number
is used as in index to the employee table, we can directly find the information of an employee
in that table. Therefore, arrays can be used to organize records in such a search table.
Hashing is a technique that is used to uniquely identify a specific object from a group of similar
objects. Some examples of hashing may be seen in our daily lives as illustrated below:
• In a school, every student is assigned a unique roll number that can be used to retrieve
information about her.
• In a library, each book is assigned a unique number that can be used to get information
about the book, such as its location in the library or the person it has been issued.
In both of the above examples the students and the books were hashed to a unique number.
Suppose we have an object and we want to assign a key to it to make searching easy. To store
the key/value pair, we can use an array data structure where keys can be used directly as an
index to store values.
Let us consider that there are five records whose keys are
5 6 9 8 3
The keys are stored in an array arr as shown below:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
3 5 6 8 9
Here, we can see that the record which has a key value 6 can be directly accessed through an
array index arr[6].
However, in cases where the keys are large and cannot be used directly as an index, we should
use hashing.
In hashing, large keys are converted into small keys by using the hash functions. The values
are then stored in a data structure called a hash table. In a hash table, each element is assigned
a key (converted small key). By using that key we can access the element in O(1) time. Using
the key, the hash function computes an index that tells where an entry can be found or inserted.
Hashing
Hashing is a technique which converts a range of key values into a range of index values of an
array. We use modulo operator (%) to get a range of key values. Hashing can be implemented
in two steps as illustrated below:
1. An element is converted into an integer by using a hash function. This element can be
used as an index to store the element, which falls into the hash table.
H = hashFunction(key)
2. The element is stored in the hash table where it can be quickly retrieved using hashed
key.
In this method, the hash H is independent of the array size and it is then reduced to an index (a
number between 0 and N − 1) by using modulo operator (%).
Hash function
Hash function can be used to map a data set of an arbitrary size to a data set of a fixed size,
which falls into the hash table. The values returned by a hash function are called hash values
or hash codes.
Hash Table is a data structure which stores data in an associative manner. In a hash table, data
elements are stored in an array, where each data element has its own unique index value. Access
of data becomes very fast if we know the index of the desired data element.
Therefore, hash table is a data structure in which insertion and search operations are very fast
irrespective of the size of the data set. Hash Table uses an array as storage and uses hash
technique to generate an index where an element is located or it has to be inserted.
The basic idea in hashing is the transformation of a key into the corresponding location in the
hash table. This is done by taking a hash function. A hash function can be defined as a function
that takes a key as an input and transforms it into a hash index. It is generally denoted by H.
H ( K ) -> M
Where, H is the hash function, K is the set of keys, M is the set of memory addresses.
Example:
Let us consider a simple hash table:
Index Key, Value
0
1 0001, Data 1
2
3 0011, Data 3
4 0100, Data 4
5
6
7 0111, Data 7
8
9 1001, Data 9
Here, the keys are applied to hash function to find the addresses.
H(0001) -> 1, At address with index 1, Data 1 is stored.
H(0011) -> 3, At address with index 3, Data 3 is stored.
H(0111) -> 7, At address with index 7, Data 7 is stored.
H(1001) -> 9, At address with index 9, Data 9 is stored.
Sometimes such a function H may not yield distinct values; it is possible that two different keys
K1 and K2 may yield the same hash address. This situation is called hash collision.
Example:
Suppose keys are 16, 24, 26, 31, 35 and the hash function to find the index is defines as
K%10 ->M
Then, the indices are generated as follows:
16%10 = 6
24%10 = 4
26%10 = 6
31%10 = 1
35%10 = 5
As we can see that when the keys 16 and 26 are provided as inputs to the hash function, the
same index value 6 is generated. Two items cannot be stored at the same address (index). This
situation is known as hash collision.
Choosing a hash function
It is important to choose a good hash function. The basic requirements to achieve a good
hashing mechanism are as illustrated below:
1. Hash function should be easy to compute.
2. It should provide a uniform distribution across the hash table and should not result in
clustering.
3. Collisions occur when pairs of elements are mapped to the same hash value. The ideal
case is that a hash function is chosen in such a way that no collisions occur. But,
irrespective of how good a hash function is, collisions are bound to occur. Therefore,
to maintain the performance of a hash table, it is important to manage collisions through
various collision resolution techniques.
Truncation Method
In the truncation method, the part of a key is taken as the address. Let us take a few 6 digit
keys.
754321, 457643, 249801, 443276, 234598
Suppose we choose the two least significant digits (two rightmost digits) as the addresses of
given keys. Then, the addresses are 21, 43, 01, 76 and 98 respectively. It is the simplest way to
compute a hash address, but here the chances of collision are very high.
For floating point numbers, discard the integral part and multiply the fractional part of a given
key by the table size. An integral part of the result is the hash address.
For example, if the key is 334.4704 and the table size is 37. Then
.4704 x 37 = 17.4048
Therefore, Hash (334.4704) = 17
If the key is in the range of 0 to 1, then multiply it by the table size. The integral part of the
result is the hash address.
For example, if the key is .87654 and the table size is 37. Then
.87654 x 37 = 32.43198
Therefore, Hash (.87654) = 32
Mid Square Method
In mid square method, first we find the square of a given key, and then truncate the middle
digits to get the address. Let us take few 3 digit keys.
839, 784, 476, 526
Now choose the middle two digits as the addresses. Therefore, 39, 46, 65 and 66 are the hash
addresses of the given keys respectively.
Folding Method
In folding method, at first we break the key into pieces, add them and finally apply truncation
method to get the hash address. Let us take few 8 digit keys.
92427643, 87653421, 97320856
Suppose, at first we break the keys in the pieces of 3, 2 and 3 and add them.
Suppose the table size is 1000, so the hash addresses can be from 0 to 999. Therefore, truncate
three least significant digits from the sums to get the hash addresses.
Hash(92427643) = 594
Hash(87653421) = 350
Hash(97320856) = 849
Modulus Method
In modulus method, we take the modulus of a key to compute the hash address. We take the
table size of prime number to minimize the hash collisions. Let us consider few keys.
456987, 2341, 4367, 9076
Let us have a table size of 37, then the hash addresses are
456987 % 37 = 0
2341 % 37 = 10
4367 % 37 = 1
9076 % 37 = 11
Other methods can be mixed with modulus method. Suppose we use folding method first
chopping the given keys into two parts and summing them and then applying the modulus
method as below:
For example, suppose the key is “rishi” and the table size is 97. Sum the ASCII codes of all
characters coming in a string “rishi” as follows:
114 + 105 + 115 + 104 + 105 = 543
Take the modulus with the table size 97 to get the hash address.
Hash (543 % 97) = 58
Therefore the key “rishi” can be mapped to the 58th position in the hash table.
Collision Resolution
A mapping from a potentially huge set of strings into a small set of integers cannot be unique.
The hash function maps keys into indexes in many-to-one fashion. Having a second key into a
previously used slot is called a collision. If collision occurs, there are various methods which
can be employed to resolve the collision. The collision resolution methods deal with keys that
are mapped to the same addresses. The collision resolution methods are as follows:
1. Separate chaining
2. Open addressing
a. Linear probing
b. Quadratic probing
c. Double hashing
d. Bucket addressing
Example:
Suppose the hash function is hash (key) = key%7 and the key string is
“SEPARATECHAINING”
(Decimal values of ASCII code are taken)
Hash 6 6 3 2 5 2 0 6 4 2 2 3 1 3 1 1
(M=7)
We maintain M Lists header nodes. Here, 7 linked lists are created. Each column below shown
is a linked list. The hash collisions are stored in a linked list. We can use the list for search and
insert functions for sequential searching. The choice of M generally depends on the factors
such as availability of a memory. Typically, M is chosen relatively small so as not to use up a
large area of contiguous memory, but enough large so that the lists are short for more efficient
sequential search.
Hash Table:
0 1 2 3 4 5 6
T G A I C R E
\0 N H I \0 \0 E
N A P S
\0 A \0 \0
\0
We observe that the hash table contains the pointers which keep the addresses of the linked
lists. For insertion operation, at first the hash key value is computed through a hash function,
which is mapped in the hash table position, then the element is inserted in the linked list.
In a search operation, at first we get the hash value in the hash table through a hash function,
then searching of a key is done in a corresponding linked list.
In a deletion operation, at first the position of an element is searched, then the element is
removed from the linked list.
Linear Probing
In a linear probing, at first a hash function is applied to compute the hash value. If there is a
hash collision, then a new position is calculated.
Suppose a table size is 10 and the function for computing the hash value is hash(key)
= k % 10
First, find the hash value. If there is a collision, calculate the new position by applying the hash
function, p = (1+p) % 10
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty Empty Empty Empty Empty Empty
If we want to insert 15 at index = 15%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, compute a new position by using p = (1+ p) % 10.
p = (1 + 5) % 10 = 6.
But at the new position 6 again there is a hash collision as the space with index 6 is already
occupied. Therefore, again compute the next position as
p = (1 + 6) % 10 = 7.
Finally, insert 15 at index 7.
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 15 18 Empty
If we want to insert 35 at index = 35%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, compute a new position by using p = (1+ p) % 10.
p = (1 + 5) % 10 = 6.
But at the new position 6 again there is a hash collision as the space with index 6 is already
occupied. Therefore, again compute the next position as
p = (1 + 6) % 10 = 7.
But at the new position 7 again there is a hash collision as the space with index 7 is already
occupied. Therefore, again compute the next position as
p = (1 + 7) % 10 = 8.
But at the new position 8 again there is a hash collision as the space with index 8 is already
occupied. Therefore, again compute the next position as
p = (1 + 8) % 10 = 9.
Finally, insert 35 at index 9.
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 15 18 35
If we want to insert 55 at index = 55%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, if we compute the new positions by using p = (1+ p) % 10, we get the indexes 6, 7,
8 and 9 and there is a collision at each of these positions. Therefore again apply the formula to
calculate the next position as
p = (1 + 9) % 10 = 0.
Finally, insert 55 at index 0.
0 1 2 3 4 5 6 7 8 9
55 Empty Empty Empty Empty 25 16 15 18 35
We can easily observe that there is a severe flaw in the linear probing method. When we tried
to insert 35 and 55 there had been repeated collisions. The problem gets aggravated when half
of the table gets filled and it becomes difficult to find the empty space to insert the key. This
condition is known as clustering.
Quadratic Probing
In a quadratic probing, we compute the hash value as we did in linear probing, but if collision
occurs, then hashing is done as p = (p + i2) % table_size, which minimizes the collisions.
Suppose we again insert the same key values inserted in linear probing, 25, 16, 18, 15, 35 and
55.
In inserting 25, 16 and 18, there are no collisions, therefore, insert them as
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 Empty
If we compute hash value to insert 15 as hash(15) = (15 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(15) = (15 + 22) % 10 = (19) % 10 = 9.
Therefore, insert 15 at index 9 as
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 15
We compute hash value to insert 35 as hash(35) = (35 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(35) = (15 + 32) % 10 = (24) % 10 = 4.
Therefore, insert 35 at index 4 as
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty 35 25 16 Empty 18 15
We compute hash value to insert 55 as hash(55) = (55 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(55) = (55 + 42) % 10 = (71) % 10 = 1.
Therefore, insert 55 at index 1 as
0 1 2 3 4 5 6 7 8 9
Empty 55 Empty Empty 35 25 16 Empty 18 15
This method reduces clustering, but it cannot search all the locations. If the hash table size is a
prime number, then by quadratic probing we will be able to search around half of the locations.
Double Hashing
Another way to sharply reduce clustering is to increment p, not by a constant as we did in linear
probing but by an amount that depends on the key. Instead of applying the hash function p =
(1 + p) % table_size we apply p = (p + increment (Key)) % table_size. This technique is called
double hashing.
Let us suppose increment (Key) = 1 + (Key % 7).
Suppose we insert the key values inserted in linear probing, 25, 16, 18,15 and 55.
In inserting 25, 16 and 18, there are no collisions, therefore, insert them as
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 Empty
As we observe, each value probes the array positions in a different order. There is no clustering.
There do exist keys that have the same value of hash(key) and the same value of
increment(key), but these circumstances are rare as compared to linear probing.
Bucket Addressing
In bucket addressing, the hash table slots are grouped into buckets. When the key is to be
inserted into the table, the key value is hashed to determine the bucket where it will be placed.
If the space is already occupied and hash collision occurs, then the key is stored in the next slot
of the same bucket. If all the slots of the bucket are occupied, then the key value is stored in an
overflow bucket of infinite capacity. This overflow bucket is common for all the buckets.
Example:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
Overflow Bucket
Suppose we have a table with five buckets each having two slots and an overflow bucket. Let
us insert keys 27, 32, 20, 12, 70 and 35.
Insert key 27 at 27%5 = 2
0 1 2 3 4 5 6 7 8 9
27
Insert key 32 at 32%5 = 2. As the first slot of bucket 2 is occupied, insert key in the second
slot.
0 1 2 3 4 5 6 7 8 9
27 32
0 1 2 3 4 5 6 7 8 9
20 27 32
Insert key 12 at 12%5 = 2. As both the slots of bucket 2 are occupied, insert the key in an
overflow bucket
0 1 2 3 4 5 6 7 8 9
20 27 32
0 1 2 3 4 5 6 7 8
12
Overflow Bucket
Insert key 70 at 70%5 = 0. As the first slot of bucket 0 is occupied, insert key in the second
slot.
0 1 2 3 4 5 6 7 8 9
20 70 27 32
Insert key 35 at 35%5 = 0. As both the slots of bucket 0 are occupied, insert the key in an
overflow bucket
0 1 2 3 4 5 6 7 8 9
20 70 27 32
0 1 2 3 4 5 6 7 8
12 35
Overflow Bucket
Suppose we want to retrieve 35. It is hashed to 35%5 = 0. The key 35 not in both the slots of
bucket 0, therefore, the overflow bucket is searched. The first key is not matched but at index
1, key 35 is found.
The advantage of bucket addressing is that we avoid the linked list operations. The
disadvantage is that a lot of space is wasted.
Memory Management
The programs that are scheduled to run must reside in the memory; therefore, programs must
be allocated space in main memory to get executed. However, due to the fact that primary
memory is volatile, a user needs to store the program in some non-volatile store i.e., in the
secondary storage medium. The programs and files may be disk resident and downloaded
whenever their execution is required. Therefore, some form of memory management is needed
at both primary and secondary memory levels.
Memory management is the process of assigning portions called blocks to various
running programs, controlling and coordinating computer memory, to optimize overall system
performance.
Cache memory: It is the most costly and fastest form of storage. It is usually very small in
size and managed by the operating system.
Main memory: It is a memory used as the storage area for data available to be operated on.
General-purpose machine instructions operate on main memory. Due to its volatile nature,
contents of main memory are usually lost when system is shutdown or in case of a power
failure. It is too small and too expensive to store the entire database.
Magnetic-disk storage: It is a primary medium for long-term storage. Generally the entire
database is stored on disk. Data must be moved from disk to main memory in order to be
operated. After operations are performed, data must be copied back to magnetic disk. Disk
storage is known as direct access storage as it is possible to read data on the disk in any order,
i.e., non-sequential access to data is possible. Disk storage generally survives power failures
and system crashes.
Optical storage: Compact Disk Read only Memory (CD-ROM) is one of the examples of
optical storage. Data are burnt into CD-ROM once and can only be read whenever required.
Magnetic tape storage: It is used primarily for backup and archival data. It is cheaper but has
much slower access since tape must be read sequentially from the beginning. It may be used as
protection from disk failures. This storage is a suitable medium to store sequential files.
Garbage Collection
This technique is used all the nodes which are previously allocated but not currently in use. It
is automatic recycling of a dynamically allocated memory space. When a node is deleted, some
memory space becomes free and therefore turns into reusable memory space that is available
for future use.
One way to do this is to immediately insert the space which has become free into availability
list. But this method may be time consuming for the operating system. Therefore, another
method applied by the operating system to do this task is called garbage collection.
The method applied by the operating system for garbage collection is that at first the data
objects that cannot be accessed in the future are found and then the resources used by those
data objects are reclaimed.
The garbage collection is an automatic process which is performed by the Garbage Collector.
This process is done in two steps.
i. The operating space sequentially visits all the nodes in the memory and tags all those cells
which are currently being used.
ii. The operating system goes through all the nodes again and collects untagged space and adds
this collected space to the availability list.
The garbage collection may be performed when small amount of free space is left in the system
or no free space is left in the system or when central processing unit (CPU) is idle and has time
to collect the garbage.
Primary key: A primary key is a key that uniquely identifies a record. In the above file,
Book_no can be taken as a primary key.
Secondary key: Other keys that can be used for searching a record are called secondary keys.
In the above file, Author, Title, Author + Title can be used as the secondary keys to search a
given record.
Record Structures
There are two types of record structures:
i. Fixed length records
ii. Variable length records.
ii. Variable-length records: In variable length records, the number of fields is fixed. Each
record begins with the length indicator. An index file is used to keep track of addresses of
the records. The index file keeps the byte offset for each record which allows us to search
the index in order to determine the beginning of the record. A delimiter is placed at the end
of each record.
File Organization
There are four basic types of file organization if a file is viewed as a sequence of records:
i. Sequential file organization
ii. Relative file organization
iii. Indexed sequential file organization
iv. Multi key file organization
With the relative key, we can randomly access any record without starting from the first record.
The disadvantage of relative file organization is its dependence on relative keys. If we do not
know the relative key of a particular record, we cannot randomly access the file.
Index: It is a data structure that allows particular record in a file to be located more speedily
as we do using index of a book to search any topic. An index can be either dense or sparse.
Dense Index: In dense index, there is an index record for every search key value of a file. This
makes searching faster but requires more space to store index records itself. Index records
contain search key value and a pointer to the actual record in the file.
Suppose we want to find the population of Madrid with key value “Spain”. Since every key
value is stored in an index record, the data is directly found using the corresponding pointer.
Sparse Index: In sparse index, index records are not created for every search key. An index
record contains a search key and a pointer to the record. To search a data, we first proceed by
index record with the largest search key value less than or equal to the search key value and
reach at the actual location of the record. If the record we are looking for is not available at that
index record then sequential search is done until the desired record is found.
Suppose we want to find the population of Lampyong with key “Thailand”. The the key
“Thailand” is present in the index record. The record is searched sequentially taking a key value
“Thailand” which is the largest search key value equal to the given key to find the data.
Suppose we want to find the population of Madrid with key value “Spain”. But this key is not
present in the index record. Therefore, the record is searched sequentially taking a key value
“Russia” which is the largest search key value less than “Spain” to find the data.
A file containing the logical records is called a data file in which the records are sequentially
stored. A file containing the index records is called an index file which has a tree structure.
The field used to order the index records in the index file is termed as indexing field.
A sorted data file with a primary index is called an Indexed Sequential File. The advantage of
an indexed sequential file is that it allows both sequential searching and individual record
retrieval through the index value. The structure of an indexed sequential file includes a primary
storage area, a separate index or indexes and an overflow area. B+ tree is one of the most
widely-used structures for database.