File Organization Notes
File Organization Notes
Basic Concepts
• Data is usually stored in the form of records. Each record consists of a collection of
related data values or items, where each value is formed of one or more bytes and
corresponds to a particular field of the record.
• A file is a sequence of records. The records of a file must be allocated to disk blocks
because a block is the unit of data transfer between disk and memory
• A file is a collection of blocks, each containing a collection of records. A record is a
collection of related fields . Each field is a data item.
Basic Concepts
• A file header or file descriptor contains information about a file that is
needed by the system programs that access the file records. The header
includes information to determine the disk addresses of the file blocks as
well as to record format descriptions
• An access method, on the other hand, provides a group of operations that
can be applied to a file
• Methods for organizing records of a file on disk are discussed in the
upcoming slides. Several general techniques, such as ordering, hashing,
and indexing, are used to create access methods.
What is File Organization?
• Just as arrays, lists, trees and other data structures are used to
implement data Organisation in main memory, a number of strategies
are used to support the Organisation of data in secondary memory. A file
organisation is a technique to organise data in the secondary memory
• The order in which records are stored and accessed in the file is
dependent on the file organization.
Suppose K is a hash key value, the hash function h will map this value to a block
address in the following form: h(K) = address of the block containing the record with
the key value K
Hashing/ Hash file
• The problem with most hashing functions is that they do not
guarantee that distinct values will hash to distinct addresses.
• A collision occurs when the hash field value of a new record that is
being inserted hashes to an address that already contains a different
record. In this situation, we must insert the new record in some other
position. The process of finding another position is called collision
resolution
External Hashing
• Hashing for disk files is called external hashing.
• To suit the characteristics of disk storage, the hash address space is
made of buckets.
• Each bucket consists of either one disk block or a cluster of contiguous
(neighboring) blocks, and can accommodate a certain number of
records.
• A hash function maps a key into a relative bucket number, rather than
assigning an absolute block address to the bucket.
• A table maintained in the file header converts the relative bucket
number into the corresponding disk block address.
Indexed Sequential Files
• Main disadvantage of Sequential files is that its is always sequential
access so binary search cannot be applied
• Binary Search can only be applied in Sequential Files if either the entire
file is in RAM or File header contains the address of all blocks of the file
How to make Sequential files as random access
We can make sequential file a random access by maintaining a small file
called the index of the file which contains only two columns- search key
(i.e index is built on the ordering field a field on which we have done
sorting in the file) and second is the block pointer.
Index file is always sorted as per the search key value.
Search key is an attribute of the table which is present in the index file. It
may or may not be the primary key
Ordering Field
Q: In above question if Indexed file is in hard disc and all pointers in index block are
available then how many disc accesses?
Ans: log2Bi where Bi is the number of blocks in the index file
If there is no index file in the above case, then the time would be
log2B, where B is the number of blocks in the main file; B>>>Bi
Indexed Sequential Files reduces the disc access considerably in comparison to Sequential Files
Types of Indexes
• Indexes in database are similar as that of a book
• Provide faster way to access the records of a file
• We can have more than one index, index can be created for any field
• Types
• Single level ordered indexes
• Primary
• Clustered Index
• Secondary Indexes
• Multilevel Indexes
• Dynamic Multilevel Indexes
Single level ordered indexes
Primary Index
• Index is built on the ordering field which is the candidate key of the data file
• So, file is sorted on candidate key and index is built on that candidate key
• A file can have atmost one primary index
Clustered Index
• Index is built on the ordering field which is not the candidate key of the data
file
• So, file is sorted on non-key and that non-key is used to build index
• A file can have atmost one clustered index
Secondary Index
• Index is built on non-ordering field which may or may not be candidate key
• A file can have any number of secondary index
Hard Disc
Multilevel Indexes