Module-5 Dbms Cs208 Notes
Module-5 Dbms Cs208 Notes
MODULE 5
Physical Data Organization
Single level ordered indexes
Indexes are used to speed up the retrieval of records in response to certain search conditions. The index
structures are additional files on disk that provide secondary access paths, which
provide alternative ways to access the records without affecting the physical placement of records in the
primary data file on disk. They enable efficient access to records based on the indexing fields that are
used to construct the index.
For a file with a given record structure consisting of several fields (or attributes), an index access
structure is usually defined on a single field of a file, called an indexing field (or indexing attribute). The
index typically stores each value of the index field along with a list of pointers to all disk blocks that
contain records with that field value. The values in the index are ordered so that we can do a binary
search on the index.
KTU STUDENTS
There are several types of ordered indexes. A primary index is specified on the ordering key field of an
ordered file of records.An ordering key field is used to physically order the file records on disk, and every
record has a unique value for that field. If the ordering field is not a key field—that is, if numerous records
in the file can have the same value for the ordering field—another type of index, called a clustering
index, can be used. The data file is called a clustered file in this latter case. A third type of index, called a
secondary index, can be specified on any nonordering field of a file. A data file can have several
secondary indexes in addition to its primary access method.
Primary indexes
A primary index is an ordered file whose records are of fixed length with two fields, and it acts like an
access structure to efficiently search for and access the data records in a data file. The first field is of the
same data type as the ordering key field—called the primary key—of the data file, and the second field is
a pointer to a disk block (a block address). There is one index entry (or index record) in the index file for
each block in the data file. Each index entry has the value of the primary key field for the first record in a
block and a pointer to that block as its two
field values. We will refer to the two field values of index entry i as <K(i), P(i)>.
The total number of entries in the index is the same as the number of disk blocks in the ordered data file.
The first record in each block of the data file is called the anchor record of the block, or simply the block
anchor.
Indexes can also be characterized as dense or sparse. A dense index has an index entry for every search
key value (and hence every record) in the data file. A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values. A sparse index has fewer entries than the number of
records in the file. Thus, a primary index is a nondense (sparse) index, since it includes an entry for each
disk block of the data file and the keys of its anchor record rather than for every search value (or every
record).
KTU STUDENTS
A major problem with a primary index—as with any ordered file—is insertion and deletion of records.
With a primary index, the problem is compounded because if we attempt to insert a record in its correct
position in the data file, we must not only move records to make space for the new record but also
change some index entries, since moving records will change the anchor records of some blocks.
Clustering Indexes
If file records are physically ordered on a nonkey field—which does not have a distinct
value for each record—that field is called the clustering field and the data file is called a clustered file.
We can create a different type of index, called a clustering index, to speed up retrieval of all the records
that have the same value for the clustering field. This differs from a primary index, which requires that
the ordering field of the data file have a distinct value for each record.
A clustering index is also an ordered file with two fields; the first field is of the same type as the clustering
field of the data file, and the second field is a disk block pointer. There is one entry in the clustering index
Record insertion and deletion still cause problems because the data records are physically ordered. To
alleviate the problem of insertion, it is common to reserve a whole block (or a cluster of contiguous
blocks) for each value of the clustering field; all records with that value are placed in the block (or block
cluster). This makes insertion and deletion relatively straightforward.
A clustering index is another example of a nondense index because it has an entry for every distinct value
of the indexing field, which is a nonkey by definition and hence has duplicate values rather than a unique
value for every record in the file.
KTU STUDENTS
KTU STUDENTS
Secondary Indexes
A secondary index provides a secondary means of accessing a data file for which some primary access
already exists. The data file records could be ordered, unordered, or hashed. The secondary index may be
created on a field that is a candidate key and has a unique value in every record, or on a nonkey field with
duplicate values. The index is again an ordered file with two fields. The first field is of the same data type
as some nonordering field of the data file that is an indexing field. The second field is either a block
pointer or a record pointer.
First we consider a secondary index access structure on a key (unique) field that has a distinct value for
every record. Such a field is sometimes called a secondary key; in the relational model, this would
correspond to any UNIQUE key attribute or to the primary key attribute of a table. In this case there is one
index entry for each record in the data file, which contains the value of the field for the record and a
pointer either to the block in which the record is stored or to the record itself. Hence, such an index is
dense.
Because the records of the data file are not physically ordered by values of the secondary key field, we
cannot use block anchors.
The pointers P(i) in the index entries are block pointers, not record pointers. Once the appropriate disk
block is transferred to a main memory buffer, a search for the desired record within the block can be
carried out.
KTU STUDENTS
We can also create a secondary index on a nonkey, nonordering field of a file. In this case, numerous
records in the data file can have the same value for the indexing field. There are several options for
implementing such an index:
■ Option 1 is to include duplicate index entries with the same K(i) value—one for each record. This would
be a dense index.
■ Option 2 is to have variable-length records for the index entries, with a repeating field for the pointer.
■ Option 3, which is more commonly used, is to keep the index entries themselves at a fixed length and
have a single entry for each index field value, but to create an extra level of indirection to handle the
multiple pointers as shown in figure below.
KTU STUDENTS
Retrieval via the index requires one or more additional block accesses because of the extra level, but the
algorithms for searching the index and (more importantly) for inserting of new records in the data file are
straightforward.
A secondary index provides a logical ordering on the records by the indexing field. If we access the
records in order of the entries in the secondary index, we get them in order of the indexing field. The
primary and clustering indexes assume that the field used for physical ordering of records in the file is
the same as the indexing field.
Multi-level indexes
If an index is small enough to be kept entirely in main memory, the search time to find an entry
is low. However, if the index is so large that not all of it can be kept in memory, index blocks
must be fetched from disk when required. (Even if an index is smaller than the main memory of
a computer, main memory
is also required for a number of other tasks, so it may not be possible to keep the entire index in
memory.) The search for an entry in the index then requires several disk-block reads.
In such a case, we can create yet another level of index. Indeed, we can repeat this process as
KTU STUDENTS
many
times as necessary. Indices with two or more levels are called multilevel indices. Searching for
records with a multilevel index requires significantly fewer I/O operations than does searching
for records by binary search. Multilevel indices are closely related to tree structures, such as the
binary trees used for in-memory indexing.
The multilevel scheme can be used on any type of index—whether it is primary, clustering, or
secondary—as long as the first-level index has distinct values for K(i) and fixed-length entries.
KTU STUDENTS
A multilevel index reduces the number of blocks accessed when searching for a record, given its indexing
field value.We are still faced with the problems of dealing with index insertions and deletions, because all
index levels are physically ordered files. To retain the benefits of using multilevel indexing while reducing
index insertion and deletion problems, designers adopted a multilevel index called a dynamic multilevel
index that leaves some space in each of its blocks
for inserting new entries and uses appropriate insertion/deletion algorithms for creating and deleting
new index blocks when the data file grows and shrinks. It is often implemented by using data structures
called B-trees and B+-trees,
B+-Trees
A tree is formed of nodes. Each node in the tree, except for a special node called the root, has one
parent node and zero or more child nodes. The root node has no parent. A node that does not have any
child nodes is called a leaf node; a nonleaf node is called an internal node. The level of a node is always
one more than the level of its parent, with the level of the root node being zero. A subtree of a node
In a B+-tree, data pointers are stored only at the leaf nodes of the tree; hence, the structure of leaf nodes
differs from the structure of internal nodes. The leaf nodes have an entry for every value of the search
field, along with a data pointer to the record (or to the block that contains
this record) if the search field is a key field. For a nonkey search field, the pointer points to a block
containing pointers to the data file records, creating an extra level of indirection.
The leaf nodes of the B+-tree are usually linked to provide ordered access on the search field to the
records. These leaf nodes are similar to the first (base) level of an index. Internal nodes of the B +-tree
correspond to the other levels of a multilevel index. Some search field values from the leaf nodes are
repeated in the internal nodes of the B+-tree to guide the search.
KTU STUDENTS
The pointers in internal nodes are tree pointers to blocks that are tree nodes, whereas the pointers in leaf
nodes are data pointers to the data file records or blocks. Because entries in the internal nodes of a B+-
tree include search values and tree pointers without any data pointers, more entries can be packed into
an internal node of a B+-tree than for a similar B-tree. Thus, for the same block (node) size, the order p
will be larger for the B+-tree than for the B-tree. This can lead to fewer B+-tree levels, improving search
time. Because the structures for internal and for leaf nodes of a B+-tree are different, the order p can be
different.
Query processing
The scanner identifies the query tokens—such as SQL keywords, attribute names, and relation names—
that appear in the text of the query, whereas the parser checks the query syntax to determine whether it
is formulated according to the syntax rules (rules of grammar) of the
query language. An internal representation of the query is then created, usually as a tree data structure
called a query tree or query graph. The DBMS must then devise an execution strategy or query plan for
retrieving the results of the query from the database files. A query typically has many possible execution
strategies, and the process of choosing a suitable one for processing a query is known as query
optimization.
The scanner and parser of an SQL query first generate a data structure that corresponds
to an initial query representation, which is then optimized according to heuristic rules. This leads to an
optimized query representation, which corresponds to the query execution strategy. Following that, a
query execution plan is generated to execute groups of operations based on the access paths available on
the files involved in the query.
One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other
binary operations, because the size of the file resulting from a binary operation—such as JOIN—is usually
a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations reduce the size
of a file and hence should be applied before a join or other binary operation.
A query tree is a tree data structure that corresponds to a relational algebra expression. It represents the
input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as
internal nodes. An execution of the query tree consists of executing an internal node operation whenever
its operands are available and then replacing that internal node by the relation that results from
executing the operation.
The order of execution of operations starts at the leaf nodes, which represents the input database
KTU STUDENTS
relations for the query, and ends at the root node, which represents the final operation of the query. The
execution terminates when the root node operation is executed and produces the result relation for the
query.
10
KTU STUDENTS
11