0% found this document useful (0 votes)
597 views

DSA Unit6 Theory

This document summarizes different file organization techniques including sequential, direct access, and indexed sequential files. It describes sequential files and their inefficient searching. Indexed sequential files maintain a separate index file to accelerate record retrieval using the primary key. Direct access files use hashing to directly access records by hash key, requiring collision handling. Indexed sequential files allow variable records and efficient access via indexes, while direct access files have fixed records and more efficient retrieval by hash key. Primary indexes in indexed sequential files have one-to-one relationships between index entries and data blocks.

Uploaded by

Ankush Amrutkar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
597 views

DSA Unit6 Theory

This document summarizes different file organization techniques including sequential, direct access, and indexed sequential files. It describes sequential files and their inefficient searching. Indexed sequential files maintain a separate index file to accelerate record retrieval using the primary key. Direct access files use hashing to directly access records by hash key, requiring collision handling. Indexed sequential files allow variable records and efficient access via indexes, while direct access files have fixed records and more efficient retrieval by hash key. Primary indexes in indexed sequential files have one-to-one relationships between index entries and data blocks.

Uploaded by

Ankush Amrutkar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Unit VI File Organization


Syllabus: Files: concept, need, primitive operations. Sequential file
organization- concept and primitive operations, Direct Access File-
Concepts and Primitive operations, Indexed sequential file
organization-concept, types of indices, structure of index sequential file,
Linked Organization- multilist files, coral rings, inverted files and
cellular partitions.

1. File
A file is named collection of related information that is recorded on
secondary storage such as magnetic disks, magnetic tables and optical
disks.

A sample file with four records is shown fig

There are four records. There are four fields (Name, Roll No., Year, Marks)
Records can be uniquely identified on the field 'Roll No.' Therefore, Roll No.
is the key field. A database is a collection of files.

1
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

2. Primitive Operations
Primitive Operations on a File :
1. Creation
2. Reading
3. Insertion
4. Deletion
5. Updation
6. Searching

3. Types of File
There ate basically Two Types of File
1. Text Files
Text files are regular files that contain information readable by the user.
This information is stored in ASCII. You can display and print these files.
The lines of a text file must not contain NUL characters, and none can
exceed {LINE_MAX} bytes in length, including the new-line character.
The term text file does not prevent the inclusion of control or other
nonprintable characters (other than NUL). Therefore, standard utilities that
list text files as inputs or outputs are either able to process the special
characters gracefully or they explicitly describe their limitations within their
individual sections.
2. Binary Files
Binary files are regular files that contain information readable by the
computer. Binary files may be executable files that instruct the system to
accomplish a job. Commands and programs are stored in executable,
binary files. Special compiling programs translate ASCII text into binary
code.
The only difference between text and binary files is that text files have lines
of less than {LINE_MAX} bytes, with no NUL characters, each terminated
by a new-line character.

2
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

4. File Organization
File Organization refers to the logical relationships among various records
that constitute the file, particularly with respect to the means of
identification and access to any specific record. In simple terms, Storing
the files in certain order is called file Organization.

1. Random / Direct Access File


Random organization is a kind of file organization in which records are
stored at random locations on the disks.
There are three techniques used in random organization and those are
given in following fig.

a. Direct Addressing
In direct addressing two types of records are handled: fixed length record
and variable length record.
For storing the fixed length records the disk space is divided into the nodes.
These nodes are large enough to hold individual record.
Every fixed length record is stored in node number # which is equal to the
primary key value. For example: If a primary key value is 185 then the
record must be present in the node number 185.

3
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

If we consider that the records are stored on the external storage devices
then deletion and searching of the record requires one disk access. If we
want td update a record then it requires two disk access, one for reading
the record one for writing the updated data back to the disk.
For storing the variable length records on the disk, the address (pointer) of
each individual record is stored in the file at specific index. We can locate
the variable length record using the index of the pointer. This pointer will
point to desired record which is present on the disk.
Variable length records make the storage management more complex.
b. Directory Lookup
• In this scheme the index for the pointers to the records is
maintained.
• For retrieving the desired record first of all the index for the record
address is searched and then using this record address the actual
record is accessed.
• The drawback of this method is that it requires more disk access
than direct address method.
• Advantage of this method is that effective disk space utilization in it
as compared to direct addressing method.
c. Hashing
• Hashing is a technique in which hash key is obtained using some
suitable hash function and record is placed in the hash table with the
help of this hash key.
• Thus in this random organization, the record can be quickly searched
with the help of hash function being used.
• For creation of hash table the available file space is divided into
buckets and slots.
• Some file space is left aside for handling the overflow situation.
The total number of slots per bucket is equal to the total number of records
each bucket can hold.
Operations on Direct Access File
Various operations that can be performed on direct access file are
1. Create
2. Insert a record into a file
3. Delete a record from a file
4. Update a record.

4
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

2. Index Sequential File Organization

The main drawback of sequential file is that searching operation


is not efficient. Because in sequential organization primary key
of every record is compared with searching key. To optimize this
operation concept of index sequential file is introduced.
In index sequential file organization, a separate file for storing
indexes of every record is maintained along with the master file.

The index sequential organization, accelerates the retrieval of any desired


record. In this case, we need not have to scan the entire memory block of
record. Instead of that using primary key (such as EMP_ID ) and position
we can access the record from master file.
Advantages
1. Desired record can be accessed efficiently by using index Which is
maintained in separate file.
2. Variable length records can also be handled using index sequential
file.
Disadvantages
1. At least two files need to be maintained : One master file and
another index file. Hence extra amount of memory is required in
order to maintain index file
2. While performing insertion and deletion index manipulation is
required.

5
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Comparison
Comparison between Index sequential and direct access file.

Sr.
Index Sequential File Direct Access File
No.
Desired record can be obtained Desired record can be obtained
1 using the index which is using the hash key. This hash key
maintained in a separate file is returned by some hash function.
On insertion or deletion of records,
Insertion and deletion operations
collision may occur. Hence
2 can be performed by simple
collision handling techniques are
index manipulation.
required.
Variable length records are
3 Record length must be fixed.
allowed
Hash key is obtained by passing
Index key is obtained using
4 primary key of record to hash
primary key of the record.
function.
Records can be arranged
Records are arranged
5 randomly. This arrangement is
sequentially in the master file.
influenced by hash key.

6 It is less efficient. It is more efficient

5. Types of Indices
1. Primary Index
2. Secondary Index
3. Clustering Index

6
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

1. Primary Index
Primary Index is an ordered file which is fixed length size with two fields.
The first field is the same a primary key and second filed is pointed to that
specific data block. In the primary Index, there is always one to one
relationship between the entries in the index table.
The primary Indexing in DBMS is also further divided into two types.
• Dense Index
• Sparse Index
Dense Index: In a dense index, a record is created for every search key
valued in the database. This helps you to search faster but needs more
space to store index records. In this Indexing, method records contain
search key value and points to the real record on the disk.

Sparse Index: It is an index record that appears for only some of the
values in the file. Sparse Index helps you to resolve the issues of dense
Indexing in DBMS. In this method of indexing technique, a range of index
columns stores the same data block address, and when data needs to be
retrieved, the block address will be fetched.
However, sparse Index stores index records for only some search-key
values. It needs less space, less maintenance overhead for insertion, and
deletions but It is slower compared to the dense Index for locating records.

7
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

2. Secondary Index
The secondary Index in DBMS can be generated by a field which has a
unique value for each record, and it should be a candidate key. It is also
known as a non-clustering index.
This two-level database indexing technique is used to reduce the mapping
size of the first level. For the first level, a large range of numbers is selected
because of this; the mapping size always remains small.
Example of secondary Indexing
Let's understand secondary indexing with a database index example:
In a bank account database, data is stored sequentially by acc_no; you
may want to find all accounts in of a specific branch of ABC bank.
Here, you can have a secondary index in DBMS for every search-key. Index
record is a record point to a bucket that contains pointers to all the records
with their specific search-key value.

3. Clustering Index
In a clustered index, records themselves are stored in the Index and not
pointers. Sometimes the Index is created on non-primary key columns
which might not be unique for each record. In such a situation, you can
group two or more columns to get the unique values and create an index
which is called clustered Index. This also helps you to identify the record
faster.
Example: suppose a company contains several employees in each
department. Suppose we use a clustering index, where all employees which
belong to the same Dept_ID are considered within a single cluster, and
index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.

8
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

4. Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index
is stored in the main memory, a single-level index might become too large
a size to store with multiple disk accesses. The multilevel indexing
segregates the main block into various smaller blocks so that the same can
stored in a single block. The outer blocks are divided into inner blocks which
in turn are pointed to the data blocks. This can be easily stored in the main
memory with fewer overheads.

9
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

6. Linked Organization
In linked organization the logical sequence of the records is different than
the physical sequence. In any sequential organization if we are accessing
nth node at Loci then (n+1)th record may be located at (Loci + c) where c is
the constant which represents the length of the record or it may be some
inter-record spacing.
In linked organization we can access next logical record by following the
link-value pair. The link-value pair denotes each individual record.
The typical structure of every record is as follows.

Thus records in the linked organization can be stored as follows

10
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

1. Multi-list Files
In linked organization the searching of record is possible by using primary
as well as secondary keys. Hence several indices for each corresponding
keys must be maintained. This leads to multi-list structure for linked file
organization.
Example: Consider a multi-list structure for student database -
Each individual record looks like this

The index for each key field i.e. Class, Sex and Marks is as follows -

The index on the primary key Roll_no is maintained using multi-list


structure which is as shown below
In the Roll_no record structure, there are 3 fields - value, length and pointer
to the first record.
• The value field indicates the upper bound value for the Roll_no. For
instance: If the roll number of particular student is 437 then it lies
between 0 to 500. Note that here upper bound is 500. Hence the
record of that student must be associated with value 500. Similarly if
roll_number of particular student is 689 then it must be associated
with the value 700. The length field denotes total number of records.
• From above Fig. length 2 in the value of 500 means there are 2
records whose Roll_no lie between 0 to 500. Similarly length 2 for
value 700 means there are 2 records who have Roll_no that lie
between 500 to 700.
• The third field is the pointer or a link field which points to the first
record. For instance in above Fig. For value 500 the pointer field
points to record BBB and in the record BBB there is a field Roll_no
link which points to record EEE. Thus there are total 2 records BBB
and EEE with value 500.

11
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Similarly for value 700 there are 2 records the pointer field points to the
first record DDD. The record DDD shows the next record of it by pointing
to AAA (Refer the Roll_no link of record DDD).

And for value 900 there is only one record i.e. CCC which is pointed by
pointer field. The index for each key field is maintained which is useful for
executing any query. Observe above Fig. of class index. This figure tells us
that there are two records for fifth standard and 3 records for tenth
standard. The first student of fifth standard class is BBB and second student
is CCC. (Just refer the class field of record BBB from Fig.).
Thus we can solve the query "select * from stud_table where class = fifth"
and the answer will be BBB and CCC. If we observe above Fig. of Marks
index, the column of second class value shows that there are 3 records, out
of which first record is AAA. Now from above Fig. record AAA has a Marks
field which denotes next record as DDD and Marks field of record DDD
denotes CCC as the next record. And for record CCC the Marks field denotes
the value NULL. This all indicates the second class holder students are AAA,
CCC and DDD.
Advantages
1) The multi-list structure provides satisfactory solution to simple and
range queries. For instance : "Select * from dept_table where
salary > 10000" Such queries can be executed efficiently using
multi-list structure.
2) Quick access to every individual record is possible.
Disadvantage
1) Some amount of memory gets consumed in maintaining the link or
address field.

12
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

2. Coral Ring: Coral ring is a file organization in which doubly linked multi-
list structure is maintained. Each list is a circular list. Thus coral ring
structure is a kind of structure in which circular doubly linked list is
maintained for connecting the records together.

Associated with each record there are two link fields i.e. forward link and
back link. Thus records get associated with each other by circular linked
list.
3. Inverted Files
• Inverted files are similar to multi-lists.
• The difference between multi-list and inverted files is that in multi-
lists records with the same key value are linked together along with
the link information being kept in individual record. But in case of
inverted files this link information is kept in the index itself.

13
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Consider above Fig. (a) of Roll-no index which shows records BBB, EEE,
DDD, AAA and CCC. In Fig. (b) of class index two class are there fifth and
tenth and we can observe that in the link information is stored in the index
itself. Hence for fifth class records are BBB and CCC. And for tenth class
records are EEE, DDD and AAA.
Similarly from sex index Fig. (c), it is clear that BBB, DDD and CCC are
females and EEE and AAA are males.
• The above index structure is a dense index structure Dense Indexing.
The dense index is a kind of indexing in which record appers for every
search key value in the file.
• Thus in inverted files the index entries is of variable length. Hence
inverted files structure is more complex than multi-list file structure.
Following are the two steps that are adopted while searching a record
from inverted files –
i) Index of required record is searched first of all.
ii) Then actual record is retrieved.
• In inverted files the index structure is important. The records can be
arranged sequentially, randomly or linked depending on primary
key.
• The number of disk accesses required = Number of records being
retrieved + Processing for indexes.
Advantage
1) Inverted files are space saving as compared to other file structures when
record retrieval does not require retrieval of key fields.
Disadvantages
1) Insertion and deletion of records is complex because it requires the
ability to insert and delete within indexes.
2) Index maintenance is complicated as compared to multi-list.
4. Cellular Partitions
• For reducing the searching time during file operations, the storage
media (e.g. secondary memory, magnetic disk, magnetic tape etc.)
may be divided into cells.
• The cells can be of two types –
i) Entire disk pack can be a cell
ii) A cylinder can be a cell.
• A list of records can occupy either entire disk pack or it may lie on
particular cylinder.
• If all the records lie on the same cylinder then without moving the
read/write head the records can be accessed.

14
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

• If the cell is nothing but entire disk pack then the disk is partitioned
into different partitions. Such partitions are called cellular partitions.
Then these different cells can be searched in parallel.
Advantages
1) Various read operations can be performed parallelly in order to
reduce the search time.
2) Faster execution of any query.
Disadvantage
1) If multiple records lie in the same cell then reading a single cell
becomes a time consuming process.

7. External Sort
• In external sorting, the data stored on secondary memory is part by
part loaded into main memory, sorting can be done over there.
• The sorted data can be then stored in the intermediate files. Finally
these intermediate files can be merged repeatedly to get sorted data.
• Thus huge amount of data can be sorted using this technique.
Consequential Processing and Merging Two Lists
The external merge sort is a technique in which the data is loaded in
intermediate files. Each intermediate file is sorted independently and then
combined or merged to get the sorted data.
For example : Consider that there are 10,000 records that has to be
sorted. Clearly we need to apply external sorting method. Suppose main
memory has a capacity to store 500 records in blocks, with each block size
of 100 records.

The sorted 5 blocks (i.e. 500 records) are stored in intermediate file. This
process will be repeated 20 times to get all the records sorted in chunks.
In the second step, we start merging a pair of intermediate files in the main
memory to get output file.

15
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Example:

16
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Multiway merge
• Multiway merge sort is a technique of merging 'm' sorted lists into
single sorted list. The two-way merge is a special case of multiway
merge sort.
• The two way merge sort makes use of two input tapes and two output
tapes for sorting the records.
• It works in two stages –
• Stage 1 : Break the records into block. Sort individual record with the
help of two input tapes.
• Stage 2 : Merge the sorted blocks and create a single sorted file with
the help of two output tapes.

Algorithm for Two-Way Merge Sort:


Step 1) Divide the elements into the blocks of size M. Sort each block and
then write on disk.
Step 2) Merge two runs
1. Read first value on every two runs.
2. Then compare it and sort it.
3. Write the sorted record on the output tape.
Step 3) Repeat the step 2 and get longer and longer runs on alternates
tapes. Finally, at last, we will get a single sorted list.

17
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Example: Sort the following list of elements using two way merge sort with
M = 3. 20, 47, 15, 8, 9, 4, 40, 30, 12, 17, 11, 56, 28, 35.
Solution: As M = 3, we will break the records in the group of 3 and sort
them. Then we will store them on tape. We will store data on alternate
tapes.
Stage I: Sorting Phase

18
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Stage II : Merging Phase

19
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

K way Merge Algorithm


In this method instead of two tapes the k tapes. The basic two way merge
algorithm is used. The representation of multiway merge technique is as
shown in Fig.

Algorithm:
1) Read M values at a times into internal memory, sort, write on disk.
2) Merge k runs
a) Read first value on each of k runs and build min heap.
b) Remove minimum from heap and write to disk.
c) Read next value from disk and inserted that value on heap.
3) Repeat step 2 until all first k runs are processed.
4) Finally merge all the runs into single to get sorted list.
Example: Sort the following list of elements using k way merge sort with
k = 3. 20, 47, 15, 8, 9, 4, 40, 30, 12, 17, 11, 56, 28, 35.
Solution: We will read three records in the memory, sort them and store
on tape Tb1, them read next three records, sort them and store on tape
Tb2, similarly store next three sorted records on Tb3.
Stage I: Sorting Phase

20
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

Stage II : Merging

21
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

22
Subject: Data Structures and Algorithms (210252) SE COMPUTER (2019) Pattern

23

You might also like