0% found this document useful (0 votes)
9 views

Binary Search, Hashing and File Structures

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Binary Search, Hashing and File Structures

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Linear Search

A linear search or sequential search is a method for finding the location of an element to be
searched within a list. It sequentially checks each element of the list until a match is found or
until the list is exhausted, i.e., all the elements have been searched.
The following figure of an array shows how the searching starts from the beginning element
and then one by one element is searched in a sequential manner.

0 1 2 3 4 5 6 7
12 15 10 25 30 28 50 17

Start

Algorithm: linearSearch[array, item, size]


1. Set position =0
2. Set found = false
3. Repeat while position < size and not found
4. If (array[position] == item) then
Set found = true and return
5. Else
Set position = position + 1
[End of while loop]
6. End

Analysis of Sequential Search


Assume that the list is unsorted; this means that the elements are stored randomly in the list.
Therefore, the probability of getting the item at any particular position is exactly the same for
each position of the list. Complexity of a searching algorithm can be found by counting the
number of comparisons. Each comparison does not guarantee the successful search. If the item
is not in the list with n elements stored in it, then until all n comparisons are done we cannot
realize that the item is not present in the list.

In case, the item is in the list, there are three different possibilities. In the best case we will be
fortunate enough to find the element in the first place at the beginning of the list. Then we need
only one comparison. In the worst case, we will not find the item until the very last comparison,
the nth comparison.

In an average case, we may find the item about halfway into the list; that is, n/2 comparisons
will be made. However, as n gets large, the coefficients become insignificant in our
approximation, therefore, the complexity of the sequential search is O(n).
Binary Search
The binary search algorithm is also known as Half Interval Search or Logarithmic Search. The
prerequisite of a binary search is that the list should be sorted in an ascending or in a descending
order. Binary search compares the element to be searched with the middle element of the list;
if they are unequal, the half in which the element cannot lie is eliminated and the search
continues on the remaining half until it is successful. If the search ends with the remaining half
being empty, the element is not in the list. It works by repeatedly dividing in half the portion
of the list that may contain the item, until we have narrowed down the possible locations to just
one.

Example:
Let us consider that there is a sorted list of integers in ascending order as given below:

5 10 15 20 25 30 35 40 45 50 55

Suppose an element that is to be searched is 35.

Step 1:
The middle element in the list is 30. Compare 35 with 30. As 35>30, discard the first half of
the list because now it is evident that the element to be searched is in the second half of the list
or is not present in the list at all. We actually ignore half of the elements just after one
comparison.

5 10 15 20 25 30 35 40 45 50 55

Step 2:
The middle element in the second half of the list is 45. Compare 35 with 45. As 35<45, discard
the second half of the list because element cannot be in the second half.
5 10 15 20 25 30 35 40 45 50 55

Step 3:
Now the middle element is 35 which actually is the element which we were searching.

Algorithm: Binary Search


binarySearch[Sorted Array (A), Lower Bound (LB), Upper Bound (UB), ITEM]
1. Repeat steps 2 to 7 while LB<=UB
2. Set MID = (LB + UB)/2
3. If (A[MID]<ITEM) then
4. Set LB = MID + 1
5. Else if (A[MID]>ITEM)
6. Set UB = MID – 1
7. Else return MID
[End of if-else statement]
[End of while loop]
8. Item not found
9. End

Analysis of Binary Search


The algorithm paradigm of a binary search is to divide and conquer. The idea of binary search
is to reduce the time complexity to log2 n. A list already sorted is an input to a binary search,
therefore, all the elements in the first part will be smaller than the middle element and all the
elements in the second part will be greater than the middle element.
As we discard one part of the search case during every step of binary search, and perform the
search operation on the other half, this results in a worst case time complexity of log2n.
With a binary search, we eliminate half the possible entries of total 64 elements in each
iteration, such that at most it would only take 6 compares to find our value (log base 2 of 64 is
6) This is the power of a binary search.
In case of iterative implementation, an auxiliary space of a binary search is O(1) and in case
of recursive implementation, it is O(log n) due to recursion call stack space used by the
algorithm.

HASH TABLE
The searching methods we earlier studied were generally based on comparison of keys. We
need search techniques in which there are no unnecessary comparisons of keys. The objective
of searching the desired record efficiently and quickly can be achieved by minimizing the
number of comparisons in order. In a hashing technique, the location of the desired record
present in the search table depends only on the given key but not on the other keys, and
therefore it is a very efficient searching technique.
Suppose we have an employee table having n records and each record is defined by a unique
enrollment number key. This key takes values from 1 to n inclusively. If the enrollment number
is used as in index to the employee table, we can directly find the information of an employee
in that table. Therefore, arrays can be used to organize records in such a search table.
Hashing is a technique that is used to uniquely identify a specific object from a group of similar
objects. Some examples of hashing may be seen in our daily lives as illustrated below:
• In a school, every student is assigned a unique roll number that can be used to retrieve
information about her.
• In a library, each book is assigned a unique number that can be used to get information
about the book, such as its location in the library or the person it has been issued.
In both of the above examples the students and the books were hashed to a unique number.
Suppose we have an object and we want to assign a key to it to make searching easy. To store
the key/value pair, we can use an array data structure where keys can be used directly as an
index to store values.
Let us consider that there are five records whose keys are
5 6 9 8 3
The keys are stored in an array arr as shown below:

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
3 5 6 8 9

Here, we can see that the record which has a key value 6 can be directly accessed through an
array index arr[6].
However, in cases where the keys are large and cannot be used directly as an index, we should
use hashing.
In hashing, large keys are converted into small keys by using the hash functions. The values
are then stored in a data structure called a hash table. In a hash table, each element is assigned
a key (converted small key). By using that key we can access the element in O(1) time. Using
the key, the hash function computes an index that tells where an entry can be found or inserted.

Hashing
Hashing is a technique which converts a range of key values into a range of index values of an
array. We use modulo operator (%) to get a range of key values. Hashing can be implemented
in two steps as illustrated below:
1. An element is converted into an integer by using a hash function. This element can be
used as an index to store the element, which falls into the hash table.

H = hashFunction(key)

2. The element is stored in the hash table where it can be quickly retrieved using hashed
key.

Index = H % N, where N is the size of an array

In this method, the hash H is independent of the array size and it is then reduced to an index (a
number between 0 and N − 1) by using modulo operator (%).

Hash function
Hash function can be used to map a data set of an arbitrary size to a data set of a fixed size,
which falls into the hash table. The values returned by a hash function are called hash values
or hash codes.
Hash Table is a data structure which stores data in an associative manner. In a hash table, data
elements are stored in an array, where each data element has its own unique index value. Access
of data becomes very fast if we know the index of the desired data element.
Therefore, hash table is a data structure in which insertion and search operations are very fast
irrespective of the size of the data set. Hash Table uses an array as storage and uses hash
technique to generate an index where an element is located or it has to be inserted.
The basic idea in hashing is the transformation of a key into the corresponding location in the
hash table. This is done by taking a hash function. A hash function can be defined as a function
that takes a key as an input and transforms it into a hash index. It is generally denoted by H.
H ( K ) -> M
Where, H is the hash function, K is the set of keys, M is the set of memory addresses.

Example:
Let us consider a simple hash table:
Index Key, Value
0
1 0001, Data 1
2
3 0011, Data 3
4 0100, Data 4
5
6
7 0111, Data 7
8
9 1001, Data 9

Here, the keys are applied to hash function to find the addresses.
H(0001) -> 1, At address with index 1, Data 1 is stored.
H(0011) -> 3, At address with index 3, Data 3 is stored.
H(0111) -> 7, At address with index 7, Data 7 is stored.
H(1001) -> 9, At address with index 9, Data 9 is stored.

Sometimes such a function H may not yield distinct values; it is possible that two different keys
K1 and K2 may yield the same hash address. This situation is called hash collision.

Example:
Suppose keys are 16, 24, 26, 31, 35 and the hash function to find the index is defines as
K%10 ->M
Then, the indices are generated as follows:
16%10 = 6
24%10 = 4
26%10 = 6
31%10 = 1
35%10 = 5
As we can see that when the keys 16 and 26 are provided as inputs to the hash function, the
same index value 6 is generated. Two items cannot be stored at the same address (index). This
situation is known as hash collision.
Choosing a hash function
It is important to choose a good hash function. The basic requirements to achieve a good
hashing mechanism are as illustrated below:
1. Hash function should be easy to compute.
2. It should provide a uniform distribution across the hash table and should not result in
clustering.
3. Collisions occur when pairs of elements are mapped to the same hash value. The ideal
case is that a hash function is chosen in such a way that no collisions occur. But,
irrespective of how good a hash function is, collisions are bound to occur. Therefore,
to maintain the performance of a hash table, it is important to manage collisions through
various collision resolution techniques.

Let us discuss a few of the techniques for choosing hash function.

Truncation Method
In the truncation method, the part of a key is taken as the address. Let us take a few 6 digit
keys.
754321, 457643, 249801, 443276, 234598

Suppose we choose the two least significant digits (two rightmost digits) as the addresses of
given keys. Then, the addresses are 21, 43, 01, 76 and 98 respectively. It is the simplest way to
compute a hash address, but here the chances of collision are very high.

For floating point numbers, discard the integral part and multiply the fractional part of a given
key by the table size. An integral part of the result is the hash address.

For example, if the key is 334.4704 and the table size is 37. Then
.4704 x 37 = 17.4048
Therefore, Hash (334.4704) = 17

If the key is in the range of 0 to 1, then multiply it by the table size. The integral part of the
result is the hash address.

For example, if the key is .87654 and the table size is 37. Then
.87654 x 37 = 32.43198
Therefore, Hash (.87654) = 32
Mid Square Method
In mid square method, first we find the square of a given key, and then truncate the middle
digits to get the address. Let us take few 3 digit keys.
839, 784, 476, 526

Key 839 784 476 526


Square 703921 614656 226576 276676

Now choose the middle two digits as the addresses. Therefore, 39, 46, 65 and 66 are the hash
addresses of the given keys respectively.

Folding Method
In folding method, at first we break the key into pieces, add them and finally apply truncation
method to get the hash address. Let us take few 8 digit keys.
92427643, 87653421, 97320856

Suppose, at first we break the keys in the pieces of 3, 2 and 3 and add them.

92427643 = 924 + 27 + 643 = 1594


87653421 = 876 + 53 + 421 = 1350
97320856 = 973 + 20 + 856 = 1849

Suppose the table size is 1000, so the hash addresses can be from 0 to 999. Therefore, truncate
three least significant digits from the sums to get the hash addresses.

Hash(92427643) = 594
Hash(87653421) = 350
Hash(97320856) = 849

Modulus Method
In modulus method, we take the modulus of a key to compute the hash address. We take the
table size of prime number to minimize the hash collisions. Let us consider few keys.
456987, 2341, 4367, 9076

Let us have a table size of 37, then the hash addresses are
456987 % 37 = 0
2341 % 37 = 10
4367 % 37 = 1
9076 % 37 = 11
Other methods can be mixed with modulus method. Suppose we use folding method first
chopping the given keys into two parts and summing them and then applying the modulus
method as below:

456 + 987 = 1443, 1443 % 37 = 0


23 + 41 = 64, 64 % 37 = 27
43 + 67 = 110, 110 % 37 = 34
90 + 76 = 166, 166 % 37 = 18

Other methods can also be mixed to get better hash addresses.


Suppose a string taken as a key. Every alphabet or alphanumeric symbol has an equivalent
ASCII key, which can be utilized to compute the hash address. The ASCII value of the symbol
can be summed and then modulus with the table size can be taken to generate the hash address.

For example, suppose the key is “rishi” and the table size is 97. Sum the ASCII codes of all
characters coming in a string “rishi” as follows:
114 + 105 + 115 + 104 + 105 = 543
Take the modulus with the table size 97 to get the hash address.
Hash (543 % 97) = 58
Therefore the key “rishi” can be mapped to the 58th position in the hash table.

Collision Resolution
A mapping from a potentially huge set of strings into a small set of integers cannot be unique.
The hash function maps keys into indexes in many-to-one fashion. Having a second key into a
previously used slot is called a collision. If collision occurs, there are various methods which
can be employed to resolve the collision. The collision resolution methods deal with keys that
are mapped to the same addresses. The collision resolution methods are as follows:

1. Separate chaining

2. Open addressing

a. Linear probing
b. Quadratic probing
c. Double hashing
d. Bucket addressing

Separate Chaining (Open Hashing)


This collision resolution method was invented by H. P. Luhn in1953. In this collision resolution
method, keys hashing to same address are kept in the lists attached to that address. For each
table address, a linked list of the records, whose keys hash to the same address, is created.
Separate chaining is useful for highly dynamic situations where the number of the search keys
cannot be predicted beforehand.

Example:
Suppose the hash function is hash (key) = key%7 and the key string is
“SEPARATECHAINING”
(Decimal values of ASCII code are taken)

The address for ‘S’ is 83%7 = 6


The address for ‘E’ is 69%7 = 6
The address for ‘P’ is 80%7 =3
The address for ‘A’ is 65%7 =2
The address for ‘R’ is 82%7 = 5
The address for ‘T’ is 84%7 = 0
The address for ‘C’ is 67%7 =4
The address for ‘H’ is 72%7 =2
The address for ‘I’ is 73%7 = 3
The address for ‘N’ is 78%7 = 1
The address for ‘G’ is 71%7 =1

Records: Each character is a key.


Key S E P A R A T E C H A I N I N G

Hash 6 6 3 2 5 2 0 6 4 2 2 3 1 3 1 1
(M=7)

We maintain M Lists header nodes. Here, 7 linked lists are created. Each column below shown
is a linked list. The hash collisions are stored in a linked list. We can use the list for search and
insert functions for sequential searching. The choice of M generally depends on the factors
such as availability of a memory. Typically, M is chosen relatively small so as not to use up a
large area of contiguous memory, but enough large so that the lists are short for more efficient
sequential search.
Hash Table:

0 1 2 3 4 5 6
T G A I C R E
\0 N H I \0 \0 E
N A P S
\0 A \0 \0
\0
We observe that the hash table contains the pointers which keep the addresses of the linked
lists. For insertion operation, at first the hash key value is computed through a hash function,
which is mapped in the hash table position, then the element is inserted in the linked list.
In a search operation, at first we get the hash value in the hash table through a hash function,
then searching of a key is done in a corresponding linked list.
In a deletion operation, at first the position of an element is searched, then the element is
removed from the linked list.

Open Addressing (Closed Hashing)


In an open hashing, linked lists are used to resolve the hash collisions. In a closed hashing,
arrays are utilized. In an open addressing, each position in the array is in one of the three states:
empty, deleted or occupied. Initially, the positions in an array are empty. If the value in a
position is deleted, the position is marked as deleted. If a position is occupied, it contains an
appropriate value i.e., (key, data) pair; otherwise, it contains no value. It should be noted that
if the value is deleted, its position will not be marked with empty but with deleted.

The algorithm for inserting a key into a table is illustrated as below:


i. Compute the position at which key is to be stored as p = hash (key).
ii. If the position p is not occupied, store key at this place
iii. If the position p is occupied, that means there is a hash collision. Compute another position, set its value as p by
applying some method until an empty space is found by repeating steps 2 and 3.

Linear Probing
In a linear probing, at first a hash function is applied to compute the hash value. If there is a
hash collision, then a new position is calculated.
Suppose a table size is 10 and the function for computing the hash value is hash(key)
= k % 10
First, find the hash value. If there is a collision, calculate the new position by applying the hash
function, p = (1+p) % 10

Initially all the positions in the table are empty.

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty Empty Empty Empty Empty Empty

Suppose we insert 25, 16 and 18 one by one.


Insert 25 at index = 25%10 = 5
Insert 16 at index = 18%10 = 6
Insert 18 at index = 18%10 = 8
There is no collision in inserting 25, 16 and 18. Insert them.
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 Empty

If we want to insert 15 at index = 15%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, compute a new position by using p = (1+ p) % 10.
p = (1 + 5) % 10 = 6.
But at the new position 6 again there is a hash collision as the space with index 6 is already
occupied. Therefore, again compute the next position as
p = (1 + 6) % 10 = 7.
Finally, insert 15 at index 7.

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 15 18 Empty

If we want to insert 35 at index = 35%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, compute a new position by using p = (1+ p) % 10.
p = (1 + 5) % 10 = 6.
But at the new position 6 again there is a hash collision as the space with index 6 is already
occupied. Therefore, again compute the next position as
p = (1 + 6) % 10 = 7.
But at the new position 7 again there is a hash collision as the space with index 7 is already
occupied. Therefore, again compute the next position as
p = (1 + 7) % 10 = 8.
But at the new position 8 again there is a hash collision as the space with index 8 is already
occupied. Therefore, again compute the next position as
p = (1 + 8) % 10 = 9.
Finally, insert 35 at index 9.
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 15 18 35

If we want to insert 55 at index = 55%10 = 5, then there is a collision as the place with index
5 is already occupied.
Therefore, if we compute the new positions by using p = (1+ p) % 10, we get the indexes 6, 7,
8 and 9 and there is a collision at each of these positions. Therefore again apply the formula to
calculate the next position as
p = (1 + 9) % 10 = 0.
Finally, insert 55 at index 0.

0 1 2 3 4 5 6 7 8 9
55 Empty Empty Empty Empty 25 16 15 18 35
We can easily observe that there is a severe flaw in the linear probing method. When we tried
to insert 35 and 55 there had been repeated collisions. The problem gets aggravated when half
of the table gets filled and it becomes difficult to find the empty space to insert the key. This
condition is known as clustering.

Quadratic Probing
In a quadratic probing, we compute the hash value as we did in linear probing, but if collision
occurs, then hashing is done as p = (p + i2) % table_size, which minimizes the collisions.

Suppose we again insert the same key values inserted in linear probing, 25, 16, 18, 15, 35 and
55.

In inserting 25, 16 and 18, there are no collisions, therefore, insert them as

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 Empty

If we compute hash value to insert 15 as hash(15) = (15 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(15) = (15 + 22) % 10 = (19) % 10 = 9.
Therefore, insert 15 at index 9 as

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 15

We compute hash value to insert 35 as hash(35) = (35 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(35) = (15 + 32) % 10 = (24) % 10 = 4.
Therefore, insert 35 at index 4 as

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty 35 25 16 Empty 18 15

We compute hash value to insert 55 as hash(55) = (55 + 12) % 10 = 6, but there is a collision
as the place with index 6 is already occupied. Hence again compute the next position as
hash(55) = (55 + 42) % 10 = (71) % 10 = 1.
Therefore, insert 55 at index 1 as

0 1 2 3 4 5 6 7 8 9
Empty 55 Empty Empty 35 25 16 Empty 18 15
This method reduces clustering, but it cannot search all the locations. If the hash table size is a
prime number, then by quadratic probing we will be able to search around half of the locations.

Double Hashing
Another way to sharply reduce clustering is to increment p, not by a constant as we did in linear
probing but by an amount that depends on the key. Instead of applying the hash function p =
(1 + p) % table_size we apply p = (p + increment (Key)) % table_size. This technique is called
double hashing.
Let us suppose increment (Key) = 1 + (Key % 7).
Suppose we insert the key values inserted in linear probing, 25, 16, 18,15 and 55.

In inserting 25, 16 and 18, there are no collisions, therefore, insert them as

0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 Empty 18 Empty

If we compute hash value to insert 15 as hash(15) = (15 ) % 10 = 5, but there is a collision as


the space at index 5 is already occupied. Therefore, we compute the new location.
Find first, increment (15) = 1 + (15 % 7) = 2
Now find a new position as
p = (p + increment (Key)) % 10 = (5 + 2) % 10 = 7
Insert 15 at index 7 as
0 1 2 3 4 5 6 7 8 9
Empty Empty Empty Empty Empty 25 16 15 18 Empty

If we compute hash value to insert 55 as hash(55) = (55 ) % 10 = 5, but there is a collision as


the space at index 5 is already occupied. Therefore, we compute the new location.
Find first, increment (55) = 1 + (55 % 7) = 7
Now find a new position as
p = (p + increment (Key)) % 10 = (5 + 7) % 10 = 2
Insert 55 at index 2 as
0 1 2 3 4 5 6 7 8 9
Empty Empty 55 Empty Empty 25 16 15 18 Empty

As we observe, each value probes the array positions in a different order. There is no clustering.
There do exist keys that have the same value of hash(key) and the same value of
increment(key), but these circumstances are rare as compared to linear probing.
Bucket Addressing
In bucket addressing, the hash table slots are grouped into buckets. When the key is to be
inserted into the table, the key value is hashed to determine the bucket where it will be placed.
If the space is already occupied and hash collision occurs, then the key is stored in the next slot
of the same bucket. If all the slots of the bucket are occupied, then the key value is stored in an
overflow bucket of infinite capacity. This overflow bucket is common for all the buckets.

Example:
0 1 2 3 4 5 6 7 8 9

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

0 1 2 3 4 5 6 7 8

Overflow Bucket

Suppose we have a table with five buckets each having two slots and an overflow bucket. Let
us insert keys 27, 32, 20, 12, 70 and 35.
Insert key 27 at 27%5 = 2

0 1 2 3 4 5 6 7 8 9
27

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

Insert key 32 at 32%5 = 2. As the first slot of bucket 2 is occupied, insert key in the second
slot.

0 1 2 3 4 5 6 7 8 9
27 32

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4


Insert key 20 at 20%5 = 0.

0 1 2 3 4 5 6 7 8 9
20 27 32

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

Insert key 12 at 12%5 = 2. As both the slots of bucket 2 are occupied, insert the key in an
overflow bucket

0 1 2 3 4 5 6 7 8 9
20 27 32

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

0 1 2 3 4 5 6 7 8
12

Overflow Bucket

Insert key 70 at 70%5 = 0. As the first slot of bucket 0 is occupied, insert key in the second
slot.

0 1 2 3 4 5 6 7 8 9
20 70 27 32

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

Insert key 35 at 35%5 = 0. As both the slots of bucket 0 are occupied, insert the key in an
overflow bucket
0 1 2 3 4 5 6 7 8 9
20 70 27 32

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4

0 1 2 3 4 5 6 7 8
12 35

Overflow Bucket

Suppose we want to retrieve 35. It is hashed to 35%5 = 0. The key 35 not in both the slots of
bucket 0, therefore, the overflow bucket is searched. The first key is not matched but at index
1, key 35 is found.
The advantage of bucket addressing is that we avoid the linked list operations. The
disadvantage is that a lot of space is wasted.

Memory Management
The programs that are scheduled to run must reside in the memory; therefore, programs must
be allocated space in main memory to get executed. However, due to the fact that primary
memory is volatile, a user needs to store the program in some non-volatile store i.e., in the
secondary storage medium. The programs and files may be disk resident and downloaded
whenever their execution is required. Therefore, some form of memory management is needed
at both primary and secondary memory levels.
Memory management is the process of assigning portions called blocks to various
running programs, controlling and coordinating computer memory, to optimize overall system
performance.

Physical Media Storage


There are several types of data storage be present in the computer systems. They vary in speed
of access, cost and reliability.

Cache memory: It is the most costly and fastest form of storage. It is usually very small in
size and managed by the operating system.

Main memory: It is a memory used as the storage area for data available to be operated on.
General-purpose machine instructions operate on main memory. Due to its volatile nature,
contents of main memory are usually lost when system is shutdown or in case of a power
failure. It is too small and too expensive to store the entire database.

Flash memory: An Electrically Erasable Programmable Read-only Memory (EEPROM) is a


flash memory in which data stored in it survives even if there is a power failure. Reading data
from flash memory is as fast as reading from main memory, but writing data into flash memory
is a bit more complicated. To overwrite the data written into it, one has to first erase the entire
bank of the memory.

Magnetic-disk storage: It is a primary medium for long-term storage. Generally the entire
database is stored on disk. Data must be moved from disk to main memory in order to be
operated. After operations are performed, data must be copied back to magnetic disk. Disk
storage is known as direct access storage as it is possible to read data on the disk in any order,
i.e., non-sequential access to data is possible. Disk storage generally survives power failures
and system crashes.

Optical storage: Compact Disk Read only Memory (CD-ROM) is one of the examples of
optical storage. Data are burnt into CD-ROM once and can only be read whenever required.

Magnetic tape storage: It is used primarily for backup and archival data. It is cheaper but has
much slower access since tape must be read sequentially from the beginning. It may be used as
protection from disk failures. This storage is a suitable medium to store sequential files.

Organization of Records into Blocks


As data elements are stored in the memory blocks, it is convenient to assign records to blocks
in such a way that each block contains the related records. For fixed-length record
representation of variable-length records we may use several records to represent one variable-
length record. We will prefer to store these records together. However, when a new account is
opened the block may be full. We must then move a record to other space. An alternative
method uses more memory space but gives greater efficiency in data access. We assign one
bucket to each value. This bucket stores the entire variable-length record for the corresponding
value. A bucket consists of as many blocks as necessary, but buckets never share blocks. The
first record in the bucket holds the value for that bucket. Subsequent records are repeating fields
in the same bucket.
A bucket may require several blocks. We chain blocks of a bucket together. We can have a
chain of unused blocks for insertions. However, we would like all the blocks for a bucket to be
stored on the same cylinder of a disk. If deletion is more frequent than insertion and we reserve
emptied blocks for the buckets, we could get a lot of empty blocks. When buckets become
sufficiently disorganized that performance begins to suffer, the database can be re-organized.
To organize the buckets database is copied to tape. The blocks are relocated and the database
is reloaded so that buckets are no longer fragmented.

Garbage Collection
This technique is used all the nodes which are previously allocated but not currently in use. It
is automatic recycling of a dynamically allocated memory space. When a node is deleted, some
memory space becomes free and therefore turns into reusable memory space that is available
for future use.
One way to do this is to immediately insert the space which has become free into availability
list. But this method may be time consuming for the operating system. Therefore, another
method applied by the operating system to do this task is called garbage collection.
The method applied by the operating system for garbage collection is that at first the data
objects that cannot be accessed in the future are found and then the resources used by those
data objects are reclaimed.
The garbage collection is an automatic process which is performed by the Garbage Collector.
This process is done in two steps.
i. The operating space sequentially visits all the nodes in the memory and tags all those cells
which are currently being used.
ii. The operating system goes through all the nodes again and collects untagged space and adds
this collected space to the availability list.
The garbage collection may be performed when small amount of free space is left in the system
or no free space is left in the system or when central processing unit (CPU) is idle and has time
to collect the garbage.

Compacting Garbage Collection


When the garbage has been removed from the heap (free memory), the Compacting (Copying)
Collector can think compacting the resulting set of objects to remove the spaces that are
between them.
The memory space that is created by the garbage collector may be scattered among many small
blocks of memory. This external fragmentation may prevent the memory space from being
used efficiently. A compacting collector moves the blocks of allocated memory together,
compacting them so that there is no unused space between them. Compacting collectors may
be likely to cause caches to become more effective which improves the run-time performance
after garbage collection.
Compacting collectors are difficult to implement because they change the locations of the
objects in the heap. Therefore, all pointers to objects which are moved must also be updated.
This extra load can be expensive in time and storage.
File Structure
A file can be seen as either a stream of bytes with no structure or a collection of records with
fields. A Stream File is viewed as a sequence of bytes. In a stream file, data semantics is lost
and therefore there is no way to get it apart again.
In a file which is seen as a collection of records, data semantics is not lost, therefore, the data
structure operation can easily be performed. A record is a collection of related fields. A field
is the smallest logically meaningful unit of information in a file and a key is subset of the fields
in a record used to identify the record uniquely.

For example, suppose there is a file of books:


In a file each line corresponds to a record.
There are fields in each record of a file:
Book_no, Type, Author, Title, Version, Year
Book_no Title Author Version Year Price
101 A X I 2015 600
102 B Y III 2018 750
- - - - - -
- - - - - -
120 U Z I 2005 545

Primary key: A primary key is a key that uniquely identifies a record. In the above file,
Book_no can be taken as a primary key.
Secondary key: Other keys that can be used for searching a record are called secondary keys.
In the above file, Author, Title, Author + Title can be used as the secondary keys to search a
given record.

Record Structures
There are two types of record structures:
i. Fixed length records
ii. Variable length records.

i. Fixed-length records: There are two ways of making fixed-length records:


1. Fixed-length records with fixed-length fields.
2. Fixed-length records with variable-length fields.

ii. Variable-length records: In variable length records, the number of fields is fixed. Each
record begins with the length indicator. An index file is used to keep track of addresses of
the records. The index file keeps the byte offset for each record which allows us to search
the index in order to determine the beginning of the record. A delimiter is placed at the end
of each record.

File Organization
There are four basic types of file organization if a file is viewed as a sequence of records:
i. Sequential file organization
ii. Relative file organization
iii. Indexed sequential file organization
iv. Multi key file organization

i. Sequential File Organization


In a sequential file, records are stored contiguously on the storage device. A file is read from
the beginning to the end. Few of the operations performed on sequential files are very efficient.
Example is to find the averages. A sequential file can be organized in two ways:
a. Unordered sequential file (pile file)
b. Sorted sequential file (records are ordered by some field)
a. Unordered Sequential File(Pile File)
A pile file is a succession of records, placed one after another with no additional structures. In
a pile file records may vary in length. If any record has to be searched, we must examine each
record sequentially in the file starting from the first record.
b. Sorted Sequential Files
A sorted sequential file organization is considered for efficient processing of records in sorted
order on some search key. In a sequential file the records are linked together by pointers to
permit fast retrieval in search key order. Each pointer points to next record in order. Sorted
files are read sequentially to produce lists, such as mailing lists, invoices etc. A sorted file
cannot remain in order after insertions of records. It has an overflow area of newly inserted
records. This overflow area is not sorted. To search a record, at first the sorted area is looked,
if record is not found then the overflow area is searched. If there are a large number of
overflows, then the access time degenerates to that of a sequential file.

ii. Relative File Organization


A relative file organization has features of both sequential file organization and fixed file
organization. In a relative file, the records can have various lengths similar to those in a
sequential file but the records are stored on the disk or tape in fixed-length areas similar to
those in a fixed file. In a relative file, a record begins with a 2-byte record length. The physical
record size is equal to the maximum record size and two byte record length.
2-byte 2-byte 2-byte
length Data of record1 length Data of record2 length Data of record3

Record 1 Record 2 Record 3

With the relative key, we can randomly access any record without starting from the first record.
The disadvantage of relative file organization is its dependence on relative keys. If we do not
know the relative key of a particular record, we cannot randomly access the file.

iii. Indexed Sequential File Organization


Indexed sequential file organization has features of both the sequential file organization and
the relative file organization. It is an effective way of organizing the records when there is a
requirement to access the records sequentially using some key value and also to access a record
individually using the same key value.

Index: It is a data structure that allows particular record in a file to be located more speedily
as we do using index of a book to search any topic. An index can be either dense or sparse.

Dense Index: In dense index, there is an index record for every search key value of a file. This
makes searching faster but requires more space to store index records itself. Index records
contain search key value and a pointer to the actual record in the file.

Suppose we want to find the population of Madrid with key value “Spain”. Since every key
value is stored in an index record, the data is directly found using the corresponding pointer.

Sparse Index: In sparse index, index records are not created for every search key. An index
record contains a search key and a pointer to the record. To search a data, we first proceed by
index record with the largest search key value less than or equal to the search key value and
reach at the actual location of the record. If the record we are looking for is not available at that
index record then sequential search is done until the desired record is found.
Suppose we want to find the population of Lampyong with key “Thailand”. The the key
“Thailand” is present in the index record. The record is searched sequentially taking a key value
“Thailand” which is the largest search key value equal to the given key to find the data.

Suppose we want to find the population of Madrid with key value “Spain”. But this key is not
present in the index record. Therefore, the record is searched sequentially taking a key value
“Russia” which is the largest search key value less than “Spain” to find the data.

A file containing the logical records is called a data file in which the records are sequentially
stored. A file containing the index records is called an index file which has a tree structure.
The field used to order the index records in the index file is termed as indexing field.

There are two types of indexes:


Primary Index: A primary index is an index ordered in the same way as the data file that is
sequentially ordered according to a key. In a primary index the indexing field is equal to this
key.
Secondary Index: A secondary index is an index which is defined on a non-ordering field of
the data file. In a secondary index the indexing field does not contain unique values.
A data file can associate with at most one primary index in addition to several secondary
indexes.

A sorted data file with a primary index is called an Indexed Sequential File. The advantage of
an indexed sequential file is that it allows both sequential searching and individual record
retrieval through the index value. The structure of an indexed sequential file includes a primary
storage area, a separate index or indexes and an overflow area. B+ tree is one of the most
widely-used structures for database.

iv. Multi-Key File Organization


An indexed sequential file organization only provides access by one primary key field. But if we want to
access the file by other than the primary key then secondary key fields can be used to access a file other
than the primary key. There are many applications that may require multi key files.
Suppose we want to access the student file by student number, and then by his sir name, and then later by course.
Similarly, suppose we want to access the employee file by employee number, and also by date of joining, also
by designation, also by last name, and also by department. The multi key file organization allows multiple ways
to access the file by several different key fields.
There are two main techniques employed to implement multi-key file organization: multi-lists and inverted-lists.
In a multi-lists organization, indexes are defined on the multiple fields that are frequently used to search the
record. Similar to multi-lists structure, inverted list structures can also maintain multiple indexes
on the file. The only difference is that instead of maintaining pointers in each record as in multi-
lists, indexes in the inverted file maintain multiple pointers to point to the records.
To implement a multi key file organization, we create multiple indexes each having the key through which we
can access the address where the record is stored.
All the indexes in a multi key file organization are memory address sensitive, i.e., if a record is relocated in
memory then changes have to be made in all indexes. This can be avoided if we have one index of primary key
and memory address pair and all other indexes of secondary key and primary key pair. Now if a record is
relocated in memory then the changes we have to do is in only one index. The fields for which indexes already
been created, may be removed from the actual database to save memory. This is known as inversion. It means
a completely inverted list will not have any entry in actual database and all fields will be represented in form of
indexes only which is only of a theoretical interest because normally a record need not be accessed by all fields.

You might also like