0% found this document useful (0 votes)
27 views

File Organization Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

File Organization Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

File Organization

Basic Concepts
• Data is usually stored in the form of records. Each record consists of a collection of
related data values or items, where each value is formed of one or more bytes and
corresponds to a particular field of the record.

• A file is a sequence of records. The records of a file must be allocated to disk blocks
because a block is the unit of data transfer between disk and memory
• A file is a collection of blocks, each containing a collection of records. A record is a
collection of related fields . Each field is a data item.
Basic Concepts
• A file header or file descriptor contains information about a file that is
needed by the system programs that access the file records. The header
includes information to determine the disk addresses of the file blocks as
well as to record format descriptions
• An access method, on the other hand, provides a group of operations that
can be applied to a file
• Methods for organizing records of a file on disk are discussed in the
upcoming slides. Several general techniques, such as ordering, hashing,
and indexing, are used to create access methods.
What is File Organization?
• Just as arrays, lists, trees and other data structures are used to
implement data Organisation in main memory, a number of strategies
are used to support the Organisation of data in secondary memory. A file
organisation is a technique to organise data in the secondary memory
• The order in which records are stored and accessed in the file is
dependent on the file organization.

File-organization refers to the organization of the data of a file into


records, blocks, and access structures; this includes the way records and
blocks are placed on the storage medium and interlinked
Types of file organization
• Heap (unordered) files Records are placed on disk in no particular
order.
• Sequential (ordered) files Records are ordered by the value of a
specified field.
• Hash files Records are placed on disk according to a hash function.
Heap files/ Files of unordered records/Pile Files
• In this simplest and most basic type of organization, records are placed in the file
in the order in which they are inserted, so new records are inserted at the end of
the file
• Inserting a new record is very efficient. The last disk block of the file is copied into
a buffer, the new record is added, and the block is then rewritten back to disk.
The address of the last file block is kept in the file header.
• However, searching for a record using any search condition involves a linear
search through the file block by block—an expensive procedure.
• To delete a record, a program must first find its block, copy the block into a buffer,
delete the record from the buffer, and finally rewrite the block back to the disk.
• Deletion requires periodic reorganization of the file to reclaim the unused space
of deleted records.
Files of ordered records/ Sequential Files
• We can physically order the records of a file on disk based on the values of
one of their fields—called the ordering field. This leads to an ordered or
sequential file.
• If the ordering field is also a key field of the file—a field guaranteed to
have a unique value in each record—then the field is called the ordering
key for the file.
• Inserting and deleting records are expensive operations because the
records must remain physically ordered. To insert a record, we must find
its correct position in the file, based on its ordering field value, and then
make space in the file to insert the record in that position.
Files of ordered records/ Sequential Files
• One option for making insertion more efficient is to keep some unused
space in each block for new records. However, once this space is used up,
the original problem resurfaces.
• Another frequently used method is to create a temporary unordered file
called an overflow or transaction file. With this technique, the actual
ordered file is called the main or masterfile. New records are inserted at
the end of the overflow file rather than in their correct position in the
main file. Periodically, the overflow file is sorted and merged with the
master file during file reorganization.
Files of ordered records/ Sequential Files
• The third option is use pointer field in each record:
We insert tuples in the end by changing the two pointer values.
Deletion is not physical, we change only two next pointer values for
deletion.
Records are logically sorted but physically unsorted.
To physically sort the records in file time to time reorganization is
done.
Two special pointers are used, start and available. Start always points
to the first record of the file, it forms the LL of sorted records.
Available points to the first record which is deleted.
• A binary search for disk files can be done on the
blocks rather than on the records if we assume all
blocks are in RAM or disk addresses of the file
blocks are available in the file header. The time to
search will be log2B
• Even if all the blocks are not in RAM the searching
would be efficient than heap because here we
have block search rather than record search.
• A binary search usually accesses log2B blocks,
whether the record is found or not—an
improvement over linear searches, where, on the
average, (B/2) blocks are accessed when the
record is found and B blocks are accessed when
the record is not found.
• Ordering does not provide any advantages for
random or ordered access of the records based on
values of the other non-ordering fields of the file.
Hashing/ Hash file
• Hashing provides very fast access to records under certain search
conditions.
• The search condition must be an equality condition on a single field,
called the hash field. In most cases, the hash field is also a key field of
the file, in which case it is called the hash key.
• The idea behind hashing is to provide a function h, called a hash
function or randomizing function, which is applied to the hash field
value of a record and yields the address of the disk block in which the
record is stored.
• A search for the record within the block can be carried out in a main
memory buffer. For most records, we need only a single-block access
to retrieve that record.
Hashing/ Hash file
A simple hash function is:
Hash Address = K mod M,
where K is the hashing key, an integer equivalent to the value of the hashing attribute,
M is the maximum number of buckets needed to accommodate the table.

Suppose K is a hash key value, the hash function h will map this value to a block
address in the following form: h(K) = address of the block containing the record with
the key value K
Hashing/ Hash file
• The problem with most hashing functions is that they do not
guarantee that distinct values will hash to distinct addresses.
• A collision occurs when the hash field value of a new record that is
being inserted hashes to an address that already contains a different
record. In this situation, we must insert the new record in some other
position. The process of finding another position is called collision
resolution
External Hashing
• Hashing for disk files is called external hashing.
• To suit the characteristics of disk storage, the hash address space is
made of buckets.
• Each bucket consists of either one disk block or a cluster of contiguous
(neighboring) blocks, and can accommodate a certain number of
records.
• A hash function maps a key into a relative bucket number, rather than
assigning an absolute block address to the bucket.
• A table maintained in the file header converts the relative bucket
number into the corresponding disk block address.
Indexed Sequential Files
• Main disadvantage of Sequential files is that its is always sequential
access so binary search cannot be applied
• Binary Search can only be applied in Sequential Files if either the entire
file is in RAM or File header contains the address of all blocks of the file
How to make Sequential files as random access
We can make sequential file a random access by maintaining a small file
called the index of the file which contains only two columns- search key
(i.e index is built on the ordering field a field on which we have done
sorting in the file) and second is the block pointer.
Index file is always sorted as per the search key value.
Search key is an attribute of the table which is present in the index file. It
may or may not be the primary key
Ordering Field

Search Key Block Pointer


Questions?
Q: What is the maximum no of disc accesses required to fetch a record from an
indexed sequential file assuming indexed file is in RAM
Ans: 1

Q: In above question if Indexed file is in hard disc and all pointers in index block are
available then how many disc accesses?
Ans: log2Bi where Bi is the number of blocks in the index file

If there is no index file in the above case, then the time would be
log2B, where B is the number of blocks in the main file; B>>>Bi
Indexed Sequential Files reduces the disc access considerably in comparison to Sequential Files
Types of Indexes
• Indexes in database are similar as that of a book
• Provide faster way to access the records of a file
• We can have more than one index, index can be created for any field
• Types
• Single level ordered indexes
• Primary
• Clustered Index
• Secondary Indexes
• Multilevel Indexes
• Dynamic Multilevel Indexes
Single level ordered indexes
Primary Index
• Index is built on the ordering field which is the candidate key of the data file
• So, file is sorted on candidate key and index is built on that candidate key
• A file can have atmost one primary index
Clustered Index
• Index is built on the ordering field which is not the candidate key of the data
file
• So, file is sorted on non-key and that non-key is used to build index
• A file can have atmost one clustered index
Secondary Index
• Index is built on non-ordering field which may or may not be candidate key
• A file can have any number of secondary index
Hard Disc

Multilevel Indexes

RAM • Index file at first level is


sorted as per search key,
we can make the index
of the index file itself till
we get index file which
can be stored in single
block

You might also like