UNIT-4(DS-NOTES)
UNIT-4(DS-NOTES)
DEFINITION
Directory is a list of files that stores all the related information about the file. Each entry of a
directory define a file information like a file name, type, its version number, size ,owner of file,
access rights, date of creation and date of last backup.
A directory organizes files and folders into a hierarchical manner as shown in the
following Fig.1.
Folders
1
1. Single-level directory:-
Single level directory is simplest directory structure. Here all files are contained in same
directory as shown in the following Fig.2, which make it easy to support and understand.
A single level directory has a significant limitation, however, when the number of files increases
or when the system has more than one user. Since all the files are in the same directory, they
must have the unique name. If two users call their dataset test, then the unique name rule
violated.
Advantages:
Disadvantages:
• There may chances of name collision because two files can not have the same name.
• Searching will become time taking if directory will be large.
• In this the same type of files can not be grouped together.
2. Two-level directory:-
As we have seen, a single level directory often leads to confusion of files names among different
users. The solution to this problem is to create a separate directory for each user.
In the two-level directory structure as shown in the following Fig.3, each user has their own user
files directory (UFD). The UFDs has similar structures, but each lists only the files of a single
2
user. System’s master file directory (MFD) searches whenever a new user id is logged in. The
MFD is indexed by username or account number, and each entry points to the UFD for that user.
Advantages:
• We can give full path like /User-name/directory-name/.
• Different users can have same directory as well as file name.
• Searching of files become more easy due to path name and user-grouping.
Disadvantages:
3. Tree-structured directory:-
Once we have seen a two-level directory as a tree of height 2, the natural generalization is to
extend the directory structure to a tree of arbitrary height.
As shown in the following Fig.4, this generalization allows the user to create their own
subdirectories and to organize on their files accordingly. A tree structure is the most common
directory structure. The tree has a root directory, and every file in the system have a unique path.
3
Fig. 4 : Tree-structured directory
Advantages:
• Very generalize, since full path name can be given.
• Very scalable, the probability of name collision is less.
• Searching becomes very easy, we can use both absolute path (full path starting from root
directory) as well as relative path (only current working directory).
Disadvantages:
• Every file does not fit into the hierarchical model, files may be saved into multiple
directories.
• We can not share files.
• It is inefficient, because accessing a file may go under multiple directories.
An acyclic graph is a graph with no cycle and allows to share subdirectories and files. The same
file or subdirectories may be in two different directories. As shown in the following Fig.5, it is a
natural generalization of the tree-structured directory.
4
It is used in the situation like when two programmers are working on a joint project and they
need to access files. The associated files are stored in a subdirectory, separating them from other
projects and files of other programmers, since they are working on a joint project so they want
the subdirectories to be into their own directories. The common subdirectories should be shared.
So here we use Acyclic directories.
Here a point must be noted that shared file is not the same as copy file. If any programmer makes
some changes in the subdirectory it will be reflected in both subdirectories.
Disadvantages:
• We share the files via linking, so in the case of deleting it may create the problem.
• In Unix operating system, if the link is softlink (just an alias name of original file name)
then after deleting the file, we left with a dangling pointer (it still points to the memory location
of deleted file ). In case of hardlink (which is a copy of original file name), to delete a file we
have to delete all the reference associated with it.
5
5. General graph directory :-
In general graph directory structure as shown in the following Fig.6, cycles are allowed within a
directory structure where multiple directories can be derived from more than one parent
directory.
The main problem with this kind of directory structure is to calculate total size or space that has
been taken by the files and directories.
Disadvantages:
• It is more costly than others.
• It needs garbage collection.
6
TOPIC 2: TYPES OF FILE ORGANIZATION
FILE ORGANIZATION
File is a collection of records related to each other. And records are collection of fields related to
each other. If file is a table then records are rows and fields are columns. The technique used to
represent and store the records in a file is known as file organization.
File organization ensures that records are available for processing. File organization determines
the way records are stored and accessed. For example, if we want to retrieve employee records in
alphabetical order of name. Sorting the file by employee name is a good file organization.
However, if we want to retrieve all employees whose marks are in a certain range, a file is
ordered by employee name would not be a good file organization.
Based on how the records are accessed from the file, the file organization techniques are
categorized into four types as given below:
7
Sequential file search starts from the beginning of the file and the records can be added at the
end of the file.
In sequential file, it is not possible to add a record in the middle of the file without rewriting the
file.
Sequential files are generally used for backup or transporting data to a different
system.
Advantages:
Disadvantages:
Direct access file is also known as random access or relative file organization and this
technique is also called as hashing.
In direct access file, all records are stored in direct access storage device (DASD), such
as hard disk. The records are randomly placed throughout the file.
The records does not need to be in sequence because they are updated directly and
rewritten back in the same location.
Here an Hashing alogrithm is used to compute the address of a record. It uses a hash
function which calculates the address of the block storing the record. The hash function can be
any simple or complex mathematical function. The hash function is applied on a field (defined
as the primary key) of the record and the result is used as an address. Then it converts primary
key values directly into addresses. For example you input a Student ID Number, a mathematical
formula is applied to it, and the resulting value is the value that points to the storage location on
disk where the record can be found(as shown in the Fig.9). This means that we need to know the
key value to retrieve a particular record. So the primary key value is the input to the algorithm
and the block address of the record is the output as shown below:
8
Example: The structure of a Direct access file (Fig.9) can be shown for the following given
table of student information-Fig.8.
ADDRESSES RECORDS
Student_ID
----- R2
----- R4
----- R3
----- R5
----- R1
Advantages:
In direct access file, sorting of the records are not required.
It accesses the desired records immediately.
It updates several files quickly.
It has better control over record allocation.
Disadvantages:
Direct access file does not provide backup facility.
9
It is expensive.
It has less storage space as compared to sequential file.
Indexed sequential access file combines both sequential file and direct access file organization.
In indexed sequential access file, records are stored randomly on a direct access device such as
magnetic disk by a primary key , which represents Direct access file.
A separate table called an index table (or Sequential file) contains primary key values and
pointer to give the physical address of each record occurrence in the Direct access file.
Example: The structure of a Indexed sequential access file can be shown for the student
information table given in the Fig.8.
Advantages:
In indexed sequential access file, sequential file and random file access is possible.
It accesses the records very fast if the index table is properly organized.
The records can be inserted in the middle of the file.
It reduces the degree of the sequential search.
Disadvantages:
Indexed sequential access file requires unique keys and periodic reorganization.
Indexed sequential access file takes longer time to search the index for the data access or
retrieval.
10
It requires more storage space.
It is expensive because it requires special software.
It is less efficient in the use of storage space as compared to other file organizations.
4. Multi-key file organization:-
Muti-key file organisation technique allows records to be accessed by more than one key
field. Until this point, we have considered only single-key file organization. Sequential, by a
given key; direct access by a particular key; and indexed sequential giving both direct and
sequential access by a single key. Now we enlarge our base to include those file organization that
enable a single data file (main file having all records) to support multiple access paths, each by a
different key using many index files.
The ability to search on many keys is enabled by building multiple index files (multi-key
file) “on top of” the data file as shown in the following Fig.11.
There are many techniques that have been used to implement multi-key
file organization. The two such techniques are:
11
a. Multi-list (or Multi-level) File Organization
b. Inverted File Organization
In real-world applications, we have very large files that may contain millions of records. For
such files, a simple indexing technique will not be sufficient. In such a situation, we use multi-
level indices.
To understand this concept, consider a file that has 10,000 records. If we use simple indexing,
then we need an index table that can contain at least 10,000 entries to point to 10,000 records. If
each entry in the index table occupies 4 bytes, then we need an index table of 4 X 10000 bytes =
40000 bytes. Finding such a big space consecutively is not always easy. So, a better scheme is
to index the index table.
The following Fig.12 shows a two-level multi-indexing. We can continue further by having a
three-level indexing and so on. But practically, we use two-level indexing. In the figure, the main
index table (second level) stores pointers to three inner index tables (first level). The inner index
tables are index tables that in turn store pointers to the records. The ‘Record number’ in index
tables is a unique primary key field of the main data file where all the records are stored.
Index table
(First level) Main data file
Index table
(Second level)
12
b. Inverted File Organisation:-
The inverted files are similar to multilists. The difference is that while in multilists, all the index
files are linked together by the same key value (one primary key) or field of the records in the
main data file ,whereas in the case of inverted files all the index files are linked together by
different key values (one primary key and many secondary keys) or fields of the records.
Primary Key is the one which is used for unique identification of records. The condition is that
we can have only one Primary Key per data file/table. Whereas, Secondary Key is the one
which is also used for identification of records but not usually unique and we can have multiple
Secondary Key per data table. For example, in the Fig.8-Student information table, we can select
Student_ID as primary key and Student_Enroll and Student_Name as secondary keys.
In Partially inverted file organization, index files are built only for some selected fields of
records in the data file. Whereas in Fully inverted file organization, an index file for each filed in
the records are built.
Example:
Now let us have an example of partially inverted file organization as shown in the following
Fig.13 for the table in the Fig.8 by building the index files for the selected fields: Student_ID as
primary key and Student_Enroll and Student_Name as secondary keys.
13
Fig.13: Inverted file organization
In both organizations, because of linking records together by many index files, the
insertion and deletion of records is easier once the place at which the insertion or
deletion to be made is known.
14
But because of the presence of inverted file ,the searching of records in Inverted file
organization is faster when a particular field has to be retrieved based on its name.
These are advanced file organization methods which are dynamic implementations of
both Indexed sequential access file organization and Multi-list file organization.
Examples:
NOTE: The B-tree must have any order greater than 2, otherwise a B tree of order 2 will be
same as BST.
1) A B trees of order m=3 is shown in the following image having maximum (3-1)=2 keys
per node and also maximum 3 children per node. All the leaf nodes are at same level-2.
Level
15
2) A B tree of order 4 is shown in the following image:
Definition of B+ tree :
B+tree is another data structure that used to store data, which looks almost the same as
the B-tree, but has 2 differences as listed below:
In a B-tree, the keys and data(pointer to record) can be stored in both the internal (inner
nodes including root )and leaf nodes(external nodes), whereas in a B+ tree, the data and
keys can only be stored in the leaf nodes.
This means that all non-leaf node values are duplicated in leaf nodes again.
The leaf nodes of a B+ tree are linked together in the form of a singly linked lists to
make the search operations more efficient, which is not the case in B tree.
Examples:
1) A B+ trees of order 3 is shown in the following image. It should be noted that the internal
nodes 13,30,9,11,16 and 38 are duplicated in the list of leaf nodes at the end of the tree.
16
2) Another B+ trees of order 3 is shown in the following image.
Here you can see that all records are stored in the leaf nodes of the B+tree and index used
as the key to creating a B+tree. No records are stored on non-leaf nodes. Each of the leaf nodes
has reference (link) to the next record in the tree. A database can perform a binary search by
using the index or sequential search by searching through every element by only traveling
through the leaf nodes.
17
B Tree VS B+ Tree: The following comparison table shows the merits and demerits of
both the trees.
SN B Tree B+ Tree
1 Search keys can not be repeatedly stored. Redundant search keys can be present.
2 Data can be stored in leaf nodes as well as Data can only be stored on the leaf nodes.
internal nodes
3 Searching for some data is a slower process Searching is comparatively faster as data
since data can be found on internal nodes as can only be found on the leaf nodes.
well as on the leaf nodes.
4 Deletion of internal nodes are so complicated Deletion will never be a complexed process
and time consuming. since element will always be deleted from
the leaf nodes.
5 Leaf nodes can not be linked together. Leaf nodes are linked together to make the
search operations more efficient.
-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------
18