DS_TM_Study_Material_Presentations_Unit-4_1TM
DS_TM_Study_Material_Presentations_Unit-4_1TM
Unit-4
Hashing & File Structure
(File Structure)
What is File?
A file is a collection of records where a record consists of one or more fields. Each contains the
same sequence of fields.
Each field is normally of fixed length.
A sample file with four records is shown below:
Name Roll No. Year Marks • There are four records
AMIT 1000 1 82 • There are four fields (Name, Roll No., Year,
KALPESH 1005 2 54 Marks)
JITENDRA 1009 1 75 • Records can be uniquely identified on the field
RAVI 1010 1 79 'Roll No.' Therefore, Roll No. is the key field.
• A database is a collection of files.
2
File Organizations
File Organizations Primitive Operations on a File
1. Sequential files 1. Creation
2. Relative files 2. Reading
3. Direct files 3. Insertion
4. Indexed Sequential files 4. Deletion
5. Index files 5. Updation
6. Searching
3
Sequential Files
It is the most common type of file. Block 1
Name Roll No. Year Marks
A fixed format is used for record.
AMIT 1000 1 82
All records are of the same length. KALPESH 1005 1 54
JITENDRA 1009 1 75
Position of each field in record and length of field is fixed.
RAVI 1010 1 79
Records are physically ordered on the value of one of the fields
- called the ordering field. Block 2
Name Roll No. Year Marks
RAMESH 1015 1 75
ROHIT 1025 1 65
JANAK 1026 1 75
AMAR 1029 1 79
4
Advantages of Sequential Files
Reading of records in order of the ordering key is extremely efficient.
Finding the next record in order of the ordering key usually, does not require additional block
access. Next record may be found in the same block.
Searching operation on ordering key is must faster. Binary search can be utilized. A binary
search will require log2b block accesses where b is the total number of blocks in the file.
5
Disadvantages of Sequential Files
Sequential file does not give any advantage when the search operation is to be carried out on
non- ordering field.
Inserting a record is an expensive operation. Insertion of a new record requires finding of place
of insertion and then all records ahead of it must be moved to create space for the record to be
inserted. This could be very expensive for large files.
Deleting a record is an expensive operation. Deletion too requires movement of records.
Modification of field value of ordering key could be time consuming. Modifying the ordering field
means the record can change its position. This requires deletion of the old record followed by
insertion of the modified record.
6
Hashing (Direct file organization)
Bucket 0
0 230 480
460 790
1
2
Bucket 1
… 321 Hashing with buckets
… 531 of chained blocks
…
…
… Bucket 2 930
… 232 270 420
242 470
B-1
Bucket Directory
7
Hashing (Direct file organization)
It is a common technique used for fast accessing of records on secondary storage.
Records of a file are divided among buckets.
A bucket is either one disk block or cluster of contiguous blocks.
A hashing function maps a key into a bucket number. The buckets are numbered 0, 1,2...b-1.
A hash function f maps each key value into one of the integers 0 through b - 1.
If x is a key, f(x) is the number of bucket that contains the record with key x.
The blocks making up each bucket could either be contiguous blocks or they can be chained
together in a linked list.
8
Hashing (Direct file organization)
Translation of bucket number to disk block address is done with the help of bucket directory. It
gives the address of the first block of the chained blocks in a linked list.
Hashing is quite efficient in retrieving a record on hashed key. The average number of block
accesses for retrieving a record.
𝑵𝒐 𝒐𝒇 𝒓𝒆𝒄𝒐𝒓𝒅𝒔
= 1 (bucket directory) + 𝑵𝒐 𝒐𝒇 𝒃𝒖𝒄𝒌𝒆𝒕𝒔 𝒙 𝑵𝒐 𝒐𝒇 𝒓𝒆𝒄𝒐𝒓𝒅𝒔 𝒑𝒆𝒓 𝒃𝒍𝒐𝒄𝒌
Thus the operation is b times faster (b = number of buckets) than unordered file.
To insert a record with key value x, the new record can added to the last block in the chain for
bucket f(x). If the record does not fit into the existing block, record is stored in a new block and
this new block is added at the end of the chain for bucket f(x).
A well designed hashed structure requires two block accesses for most operations
9
Indexing
Indexing is used to speed up retrieval of records.
It is done with the help of a separate sequential file.
Each record of in the index file consists of two fields, a key field and a pointer into the main file.
To find a specific record for the given key value, index is searched for the given key value.
Binary search can used to search in index file. After getting the address of record from index file,
the record in main file can easily be retrieved.
10
Indexing
Index File Main File
Index file is ordered on the ordering key Roll No. each record of index file points to
the corresponding record. Main file is not sorted.
11
Advantages of Indexing
Sequential file can be searched effectively on ordering key. When it is necessary to search for a
record on the basis of some other attribute than the ordering key field, sequential file
representation is inadequate.
Multiple indexes can be maintained for each type of field used for searching. Thus, indexing
provides much better flexibility.
An index file usually requires less storage space than the main file.
A binary search on sequential file will require accessing of more blocks.
This can be explained with the help of the following example.
Consider the example of a sequential file with r = 1024 records of fixed length with record size R
= 128 bytes stored on disk with block size B = 2048 bytes.
12
Advantages of Indexing
Size of Sequential File
Number of blocks required to store the file
(1024 x 128) / 2048 = 64
Number of block accesses for searching a record
log264= 6
13
Types of Indexes
With indexing, new records can be added at the end of the main file. It will not require movement
of records as in the case of sequential file.
Updation of index file requires fewer block accesses compare to sequential file
Types of Indexes:
1. Primary indexes
2. Clustering indexes
3. Secondary indexes
14
Primary Indexes (Indexed Sequential File)
101
101 200
201 201
351
350
Data File
…
… Sequential File
351
805
905 400
… …
…
Index File …
805
Primary Index on ordering key field
Roll Number 904
15
Primary Indexes (Indexed Sequential File)
An indexed sequential file is characterized by
Sequential organization (ordered on primary key)
Indexed on primary key
16
Clustering Indexes
100 Math
100 Science
100 105 Physics
105 105
106 105
108 106
… 106
…
…
…
…
… 108
108
Field Clustering 109
Index File
Data File 109
17
Clustering Indexes
If records of a file are ordered on a non-key field, we can create a different type of index known
as clustering index.
A non-key field does not have distinct value for each record.
A Clustering index is also an ordered file with two fields.
18
Secondary Indexes (Simple Index File)
1 2
2 5
3 3
4 17
5 6
6 10
7 14
8 7
10 13
12 4
13 15
14 18
15
12
17
1
18
19
19
8
20
Secondary Indexes (Simple Index File)
A secondary index requires more storage space and longer search time than does a primary
index.
A secondary index file has an entry for every record whereas primary index file has an entry for
every block in data file.
There is a single primary index file but the number of secondary indexes could be quite a few.
21
Data Structures (DS)
Thank
You