0% found this document useful (0 votes)
63 views

UNIT-IV - File Organization

The document provides an overview of storage and indexing in database management systems. It discusses how data is stored on external storage devices like disks and tapes and organized into files, records, and pages. It describes different types of file organization, including unordered, ordered, and hash files. It also covers index structures like primary, secondary, and cluster indexes that allow efficient retrieval of records. Index data structures can be hash-based or tree-based, with B-trees being a common tree structure used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

UNIT-IV - File Organization

The document provides an overview of storage and indexing in database management systems. It discusses how data is stored on external storage devices like disks and tapes and organized into files, records, and pages. It describes different types of file organization, including unordered, ordered, and hash files. It also covers index structures like primary, secondary, and cluster indexes that allow efficient retrieval of records. Index data structures can be hash-based or tree-based, with B-trees being a common tree structure used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT-V

Overview of Storage and Indexing: Data on External Storage – File Organization and
Indexing – Cluster Indexes, Primary and Secondary Indexes – Index data Structures –
Hash Based Indexing – Tree base Indexing
Data on External Storage

 DBMS stores vast quantities of data


 Therefore, data is stored on external storage devices, such as disks and tapes, and
fetched into main memory as needed for processing.
 The unit of information read from or written to disk is a page.
 The size of page typically is 4KB or 8 KB.
 The database systems are carefully optimized to minimize the cost of page I/O
 If we read several pages in the order that they are stored physically, the cost can
be much less than the cost of reading the same pages in a random order.
 Each record in a file has a unique identified called a record id (or) rid. Using rid
we can identify the disk address of the page containing the record.
 Data is read into memory for processing and written to disk for persistent storage,
by a layer of software called the buffer manager.
 Space on disk is managed by the disk space manager.

File organization The physical arrangement of data in a file into records and pages on
secondary storage. The order in which records are stored and accessed in the file is
dependent on the file organization. The main types of file organization are:
 Heap (unordered) Files: Records are placed on disk in no particular order
 Sequential (ordered) Files: Records are ordered by the value of a specified
field.
 Hash Files: Records are placed on disk according to hash function
Unordered Files
The records are placed in the file in the same order as they are inserted. A new record is
inserted in the last page of the file; if there is insufficient space in the last page, a new
page is added to the file. This makes insertion very efficient. However, as a heap file has
no particular ordering with respect to field values, a linear search must be performed to
access a record. A linear search involves reading pages from the file until the required
record is found. This makes retrievals from heap files that have more than a few pages
relatively slow. To delete a record, the required page first has to be retrieved, the record
marked as deleted, and the page written back to disk. The spaced with deleted records is
not reused. Heap files are one of the best organizations for bulk loading data into a table,
as records are inserted at the end of the sequence; there is no overhead of calculating
what page the record should go on.
Ordered Files
The records in a file can be stored on the values of one or more of the fields. The
resulting file is called an ordered or sequential file. The field(s) that the file is sorted on is
called the ordering field. If the ordering field is also a key of the file, and therefore
guaranteed to have a unique value in each record, the field is also called the ordering key
To search a particular record, a binary search can be performed because already the
records are in sorted order. In general, the binary search is more efficient that a linear
search.
Inserting and deleted records in a sorted file are problematic because the order of records
has to be maintained. To insert a new record, we must find the correct position in the
ordering for the record and then find space to insert it. If there is sufficient space in the
required page for the new record, then the single page can be reordered and written back
to disk. If this is not the case, then it would be necessary to move one or more records on
to the next page. Again, the next page may have no free space and the records on this
page must be move, and so on.
To delete a record we must reorganize the records to remove the new free slot.
Ordered files are rarely used for database storage unless a primary index is added to the
file.
Hash Files
In a hash file, records do not have to be written sequentially to the file. Instead, a hash
function calculates the address of the page in which the record is to be stored based on
one or more fields in the record. The base filed is called the hash field, or if the field is
also a key field of the file, it is called the hash key. Records in a hash file will appear to
be randomly distributed across the available file space. The hash function is chosen so
that records are as evenly distributed as possible throughout the file. One popular
technique is the division-remainder hashing. This technique uses the MOD function,
which takes the field value, divides it by some predetermined integer value, and uses the
remainder of this division as the disk address.

Index: A data structure that allows the DBMS to locate particular records in a file more
quickly and there by speed response to user queries.
 An index structure is associated with a particular search key and contains records
consisting of the key value and the address of the logical record in the file
containing the key value.
 The file containing the logical records is called the data file.
 The file containing the index records is called the index file.
 The values in the index file are ordered according to the indexing field, which is
usually based on a single attribute.
Types of Index
 Primary Index: - The data file is sequentially ordered by an ordering key field
and the indexing field is built on the ordering key field, which is guaranteed to
have a unique value in each record.
 Clustering Index: - The data file is sequentially ordered on a non-key field, and
the indexing field is built on this non-key field, so that there can be more than one
record corresponding to a value of the indexing field. The non-key field is called a
clustering attribute.
 Secondary Index: - An index that is defined on a non-ordering field of the data
file is called Secondary index.
 Spare Index: - In sparse index, index records are not created for every search
key. An index record here contains a search key and an actual pointer to the data
on the disk. To search a record, we first proceed by index record and reach at the
actual location of the data. If the data we are looking for is not where we directly
reach by following the index, then the system starts sequential search until the
desired data is found.

 Dense Index: - In dense index, there is an index record for every search key
value in the database. This makes searching faster but requires more space to store
index records itself. Index records contain search key value and a pointer to the
actual record on the disk.
 Multilevel Index: - Index records comprise search-key values and data
pointers. Multilevel index is stored on the disk along with the actual database
files. As the size of the database grows, so does the size of the indices. There is an
immense need to keep the index records in the main memory so as to speed up the
search operations. If single-level index is used, then a large size index cannot be
kept in memory which leads to multiple disk accesses.

NOTE: - A file can have at most one primary index or one clustering index, and in
addition can have several secondary indexes.
INDEX DATA STRUCTURES

 One way to organize data entries is to hash data entries on the search key
(HASHBASED INDEXING)
 Another way to organize data entries is to build a tree-like data structure that
directs a search for data entries (TREEBASED INDEXING)

HASH BASED INDEXING


We can organize records using a technique called hashing to quickly find records
that have a given search key value. For example, if the file of employee records is hashed
on the name field, we can retrieve all records about Joe.
 The records in a file are grouped in buckets.
 A bucket consists of a primary page and possibly, additional pages linked in a
chain.
 The bucket to which a record belongs can be determined by applying a special
function, called a hash function to the search key.
 INSERT: on inserts, the record is inserted into the appropriate bucket, with
overflow pages allocated as necessary.
 SEARCH: To search for a record with a given search key value we apply the hash
function to identify the bucket to which such records belong and look at all pages
in that bucket.

Hash indexing is illustrates in below fig., where the data is stored in a file that is
hashed on age; the data entries in this first index file are the actual data records. Applying
the hash function to the age field identifies the page that the record belongs to. The hash
function h for this example is quite simple; it converts the search key value to its binary
representation and uses the two least significant bits as the bucket identifier.
Fig. Index-Organized File Hashed on age, with Auxiliary Index on sal.

TREE BASED INDEXING


An alternative to has-base indexing is to organize records using a tree-like data
structure. The data entries are arranged in sorted order by search key value, and a
hierarchical search data structure is maintained that directs searches to the correct page to
data entries.
The following fig. shows the employee records in a tree-structured index with search key
age. The lowest level of the tree, called the leaf level, contains the data entries;

Example: Find all data entries with 24<age<50


What is the difference between ISAM and B+ Trees?
SNO (Indexed Sequential Access Method) B+ Tree
ISAM

1 ISAM tree is a static index structure


B+ tree is a dynamic index structure

2 It is effective when the file is not It is not effective when the file is not
frequently updated frequently updated

3 It is unsuitable for files that grow and


It is suitable for files that grow and shrink
shrink a lot
a lot.

4 It will not adjust to changes in the file It will adjust to changes in the file
gracefully.

5 It is rarely used index structures It is most widely used index structures

6 It will not support both equality and It will support both equality and range
range queries queries

7 It suffers from long overflow chains. It will not suffer from long overflow
chains.

8 In ISAM, the set of primary leaf pages In B+ Trees, the set of primary leaf pages
was static. are not static.

DIFFERENCE BETWEEN B AND B+ TREE


B tree indices are similar to B+ tree indices. The primary distinction between the two
approaches is that a B-tree eliminates the redundant storage of search key values. Search
keys are not repeated in B tree indices.
Given below the major difference between B tree and B+ tree structure.
1. In a B tree search keys and data stored in internal or leaf nodes. But in B+-tree data
store only leaf nodes.
2. Searching of any data in a B+ tree is very easy because all data are found in leaf nodes
otherwise in a B tree data cannot found in leaf node.
3. In B tree data may found leaf or non leaf node. Deletion of non leaf node is very
complicated. Otherwise in a B+ tree data must found leaf node. So deletion is easy in leaf
node.  
4. Insertion of a B tree is more complicated than B+ tree. 
5.  B +tree store redundant search key but b-tree has no redundant value.
6. In B+ tree leaf node data are ordered in a sequential linked list but in B tree the leaf
node cannot stored using linked list.
Many database system implementers prefer the structural simplicity of a b+ tree.
7.In a B-tree, pointers to data records exist at all levels of the tree
In a B+-tree, all pointers to data records exists at the leaf-level nodes
8.A B+-tree can have less levels (or higher capacity of search values) than the
corresponding B-tree 

Searching a record in B+ Tree


Suppose we want to search 65 in the below B+ tree structure. First we will fetch for the
intermediary node which will direct to the leaf node that can contain record for 65. So we
find branch between 50 and 75 nodes in the intermediary node. Then we will be
redirected to the third leaf node at the end. Here DBMS will perform sequential search to
find 65. Suppose, instead of 65, we have to search for 60. What will happen in this case?
We will not be able to find in the leaf node. No insertions/update/delete is allowed during
the search in B+ tree.

Insertion in B+ tree

Suppose we have to insert a record 60 in below structure. It will go to 3 rd leaf node after
55. Since it is a balanced tree and that leaf node is already full, we cannot insert the
record there. But it should be inserted there without affecting the fill factor, balance and
order. So the only option here is to split the leaf node. But how do we split the nodes? 

The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node in the middle so that its balance is not altered. So we can
group (50, 55) and (60, 65, 70) into 2 leaf nodes. If these two has to be leaf nodes, the
intermediary node cannot branch from 50. It should have 60 added to it and then we can
have pointers to new leaf node.

This is how we insert a new entry when there is overflow. In normal scenario, it is simple
to find the node where it fits and place it in that leaf node.

Delete in B+ tree

Suppose we have to delete 60 from the above example. What will happen in this case?
We have to remove 60 from 4th leaf node as well as from the intermediary node too. If we
remove it from intermediary node, the tree will not satisfy B+ tree rules. So we need to
modify it have a balanced tree. After deleting 60 from above B+ tree and re-arranging
nodes, it will appear as below. 

Suppose we have to delete 15 from above tree. We will traverse to the 1st leaf node and
simply delete 15 from that node. There is no need for any re-arrangement as the tree is
balanced and 15 do not appear in the intermediary node.

You might also like