0% found this document useful (0 votes)
20 views

File Organization

The document discusses file organizations and indexing in databases. It describes different file organizations like sorted files and heap files. It also explains primary and secondary indexing as well as different index structures like B-trees and hash indexes. The purpose of indexing is to improve efficiency of retrieving records from files.

Uploaded by

yashnaik7664
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

File Organization

The document discusses file organizations and indexing in databases. It describes different file organizations like sorted files and heap files. It also explains primary and secondary indexing as well as different index structures like B-trees and hash indexes. The purpose of indexing is to improve efficiency of retrieving records from files.

Uploaded by

yashnaik7664
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Overview of Storage and Indexing

 Data in a DBMS is a collection of records, or a file, and each file


consists of one or more pages

 Understanding how records are organized is essential to using a


database system effectively, and it is the main topic of this
chapter.

 A file organization is a method of arranging the records in a


file when the file is stored on disk. Each file organization makes
certain operations efficient but other operations expensive.
Example: File of employee records, each containing age, name
and sal fields

1) If we want to retrieve employee records in order of


increasing age
 sorting the file by age is a good file organization
 sort order is expensive to maintain if the file is frequently modified

2) Other operations on a given collection of records


 Ex: retrieve all employees who make more than $5000
 We have to scan the entire file to find such employee records.
 A technique called indexing can help when we have to access a
collection of records in multiple ways, in addition to efficiently
supporting various kinds of selection.
DATA ON EXTERNAL STORAGE
 A DBMS stores vast quantities of data, and the data must persist
across program executions. Therefore, data is stored on external
storage devices such as disks and tapes, and fetched into main
memory as needed for processing.

 The unit of information read from or written to disk is a page.


The size of a page is a DBMS parameter, and typical values are
4KB or 8KB.
 Data is read into memory for processing, and written to disk
for persistent storage

 The cost of page I/O (input from disk to main memory and
output from memory to disk) dominates the cost of typical
database operations, and database systems are carefully
optimized to minimize this cost.
FILE ORGANIZATIONS AND INDEXING
 The file of records is an important abstraction in a DBMS
 A file can be created, destroyed, and have records inserted into and
deleted from it. It also supports scans; a scan operation allows us to
step through all the records in the file one at a time.
 A relation is typically stored as a file of records.
 The file layer stores the records in a file in a collection of disk
pages. It keeps track of pages allocated to each file, and as records
are inserted into and deleted from the file, it also tracks available
space within pages allocated to the file.
 The simplest file structure is an unordered file, or heap file.
Records in a heap file are stored in random order across the pages
of the file.
 An index is a data structure that organizes data records on
disk to optimize certain kinds of retrieval operations.

 An index allows us to efficiently retrieve all records that satisfy


search conditions on the search key fields of the index.

 We can also create additional indexes on a given collection of


data records, each with a different search key, to speed up
search operations that are not efficiently supported by the file
organization used to store the data records
Clustered Indexes
 When a file is organized so that the ordering of data records
is the same as or close to the ordering of data entries in some
index, we say that the index is clustered; otherwise, it
clustered is an unclustered index.
Primary and Secondary Indexes
Primary Index

 An index whose search key specifies the sequential order of


the data file is a primary index.
 One index record appears for every search-key value in the
file.
 The index record contains the search-key value and a
pointer to the block containing the first data record with
that search key value
Secondary index
 An index whose search key does not specify the sequential
order of the data file is a secondary index.
 Secondary indices use an extra level of indirection
 The pointers do not point directly to the file. Instead, each
pointer points to a bucket that contains pointers to the
block containing the record
 Why is searching on an index faster than searching without
an index?
Dense Index

 There is an index record for every search key value in the data file
 This record contains the search key and also a reference to the first
data record with that search key value.
Sparse Index
 The index record appears only for a few items in the data file.
Each item points to a block as shown.
 To locate a record, we find the index record with the largest
search key value less than or equal to the search key value we
are looking for.
 We start at that record pointed to by the index record, and
proceed along with the pointers in the file (that is,
sequentially) until we find the desired record.
Sparse Index
Dense vesus Sparse Index
 Dense indexes require more maintenance than sparse indexes at
write-time. Since every row must have an entry, the database must
maintain the index on inserts, updates, and deletes.
 Having an entry for every row also means that dense indexes will
require more memory.
 The benefit of a dense index is that values can be quickly found with
just a binary search.

 Sparse indexes require less maintenance than dense indexes at write-


time since they only contain a subset of the values. This lighter
maintenance burden means that inserts, updates, and deletes will be
faster.
 Having fewer entries also means that the index will use less memory.
 Finding data is slower since a scan across the page typically follows the
binary search. Sparse indexes are also only an option when working
with ordered data.
INDEX DATA STRUCTURES

 One way to organize data entries is to build a tree-like data


structure that directs a search for data entries.
 Another way to organize data entries is to hash data entries on
the search key.
Tree-Based Indexing
 In tree-based indexing records are organized using a tree like data
structure. The data entries are arranged in sorted order by search
key value, and a hierarchical search data structure is maintained
that directs searches to the correct page of data entries.

 Figure shows the employee records organized in a tree-structured


index with search key age. Each node in this figure is a physical
page, and retrieving a node involves a disk I/O.

 The lowest level of the tree, called the leaf level, contains the
data entries; in our example, these are employee records.
Tree-Based Indexing
 This structure allows us to efficiently locate all data entries with
a particular search key value
 All searches begin at the topmost node, called the root, and the
contents of pages in non-leaf levels direct searches to the correct
leaf page.
 Non-leaf pages contain node pointers separated by search key
values.
 The node pointer to the left of a key value k points to a
subtree that contains only data entries less than k.
 The node pointer to the right of a key value k points to a
subtree that contains only data entries greater than or equal
to k.
Example

1. Find all employees who are 25 years of age


2. Find all employees having 24< age <50
Tree-Based Indexing
 Thus, the number of disk I/Os incurred during a search is equal
to the length of a path from the root to a leaf, plus the number
of leaf pages with qualifying data entries.

 Finding the correct leaf page is faster than binary search of the
pages in a sorted file because each non~leaf node can
accommodate a very large number of node-pointers, and the
height of the tree is rarely more than three or four in practice.

 The height of a balanced tree is the length of a path from root


to leaf; in Figure, the height is three. The number of l/Os to
retrieve a desired leaf page is four, including the root and the
leaf page.
 The average number of children for a non-leaf node is called the
fan-out of the tree.

 In practice, F is at least 100, which means a tree of height four


contains 100 million leaf pages. Thus, we can search a file with
100 million leaf pages and find the page we want using four
l/Os; (logF/2100,000,000)

 In contrast, binary search of the same file would take


log2100,000,000(over 25) l/Os.
Hash-Based Indexing

 We can organize records using a technique called hashing to


quickly find records that have a given search key value
 In this approach, the records in a file are grouped in buckets,
where a bucket consists of a primary page and, possibly,
additional pages linked in a chain.
 The bucket to which a record belongs can be determined by
applying a special function, called a hash function, to the
search key.
 Given a bucket number, a hash-based index structure allows us
to retrieve the primary page for the bucket in one or two disk
l/Os.
 Hash indexing is illustrated in Figure 8.2, where the data is
stored in a file that is hashed on age; the data entries in this first
index file are the actual data records.

 Applying the hash function to the age field identifies the page
that the record belongs to.

 The hash function h for this example is quite simple; it converts


the search key value to its binary representation and uses the
two least significant bits as the bucket identifier.
 To search for a record with a given search key value, we apply
the hash function to identify the bucket to which such records
belong and look at all pages in that bucket.

 On inserts, the record is inserted into the appropriate bucket,


with 'overflow‘ pages allocated as necessary.

 If we do not have the search key value for the record, for
example, the index is based on age and we want records with a
given age value, we have to scan all pages in the file
 Figure 8.2 also shows an index with search key sal that contains
(sal, rid) pairs as data entries.

 The rid (short for record id) component of a data entry in this
second index is a pointer to a record with search key value sal

 The file of employee records is hashed on age. The second


index, on sal, also uses hashing to locate data entries, which are
now <sal, rid of employee>record pairs
Organization of Records in Files
 A relation is aset of records. Given a set of records, the next
question is how to organize them in a file.
 Several of the possible ways of organizing records in files are:
 Heap file organization: Any record can be placed anywhere in
the file where there is space for the record. There is no ordering of
records. Typically, there is either a single file or a set of files for
each relation.

 Sequential file organization: Records are stored in sequential


order, according to the value of a “search key” of each record.

 Multitable clustering file organization: Generally, a


separate file or set of files is used to store the records of each
relation. However, in a multitable clustering file organization, records
of several different relations are stored in the same file, and in fact
in the same block within a file, to reduce the cost of certain join
operations.
 B+-tree file organization: The traditional sequential file
organization does support ordered access even if there are
insert, delete, and update operations, which may change the
ordering of records. However, in the face of a large number
of such operations, efficiency of ordered access suffers.
 We study another way of organizing records, called the B+-
tree file organization. The B+-tree file organization is related to
the B+-tree index structure and can provide efficient
ordered access to records even if there are a large number of
insert, delete, or update operations. Further, it supports very
efficient access to specific records, based on the search key.
 Hashing file organization: A hash function is computed
on some attribute of each record. The result of the hash
function specifies in which block of the file the record should
be placed.
Data-Dictionary Storage
 A relational database system needs to maintain data about the
relations, such as the schema of the relations. In general, such
“data about data” are referred to as metadata.
 Relational schemas and other metadata about relations are
stored in a structure called the data dictionary or system catalog.
Among the types of information that the system must store are
these:
 Names of the relations
 Names of the attributes of each relation
 Domains and lengths of attributes
 Names of views defined on the database, and definitions of those views
 Integrity constraints (e.g., key constraints)
 Further, the database may store statistical and descriptive
data about the relations and attributes, such as the number of
tuples in each relation, or the number of distinct values for
each attribute.
 The data dictionary may also note the storage organization
(heap, sequential, hash, etc.) of relations, and the location
where each relation is stored:
 If relations are stored in operating system files, the dictionary would note
the names of the file (or files) containing each relation.
 If the database stores all relations in a single file, the dictionarymay note
the blocks containing records of each relation in a data structure such as a
linked list
 There is also a need to store information about each index on
each of the relations:
 Name of the index
 Name of the relation being indexed
 Attributes on which the index is defined
 Type of index formed
References

 Chapter 8: Overview of Storage and Indexing


Data base Management Systems, RaghuramaKrishnan, Johannes
Gehrke, TATA McGrawHill, 3rd Edition, 2003.

 Chapter 13: Data Storage Structures


Data base System Concepts, A.Silberschatz, H.F. Korth,
S.Sudarshan, McGraw Hill, VI edition, 2006.

You might also like