DBMS Unit-4
DBMS Unit-4
Benefits
Improved Query Performance: Secondary indexes can improve the performance of
queries by reducing the amount of data that needs to be scanned to find the desired
records. With a secondary index, the database can directly access the required records,
rather than having to scan the entire table.
Flexibility: Secondary indexes provide greater flexibility in managing a database, as
they can be created and dropped at any time. This allows for a more dynamic
approach to database management, as the needs of the database can change over time.
Simplified Search: Secondary indexes simplify the search for specific records within
a database, making it easier to find the desired data.
Reduced Data Storage Overhead: Secondary indexes use a compact data structure
that requires less space to store compared to the original data. This means that you can
store more data in a database while reducing the amount of storage space required.
Types of Secondary Indexes
B-tree Index: A B-tree index is a type of index that stores data in a balanced tree
structure. B-tree indexes are commonly used in relational databases and provide
efficient search, insert, and delete operations.
Hash Index: A hash index is a type of index that uses a hash function to map data to a
specific location within the index. Hash indexes are commonly used in non-relational
databases, such as NoSQL databases, and provide fast access to data.
Bitmap Index: A bitmap index is a type of index that uses a bitmap to represent the
data in a database. Each bit in the bitmap represents a specific record in the database,
and the value of the bit indicates whether the record is present or not. Bitmap indexes
are commonly used in data warehousing and business intelligence applications, as
they provide efficient access to large amounts of data.
When to Use Secondary Indexing
Secondary indexing should be used in database management systems when there is a
need to improve the performance of data retrieval operations that search for data based on
specific conditions. Secondary indexing is particularly useful in the following scenarios:
Queries with Complex Search Criteria: Secondary indexes can be used to support
complex queries that search for data based on multiple conditions. By creating a
secondary index based on the columns used in the search criteria, database
management systems can access the data more efficiently.
Large Data Sets: Secondary indexing can be beneficial for large data sets where the
time and resources required for data retrieval operations can be significant. By
creating a secondary index, database management systems can access the data more
quickly, reducing the time and resources required for data retrieval operations.
Frequently Accessed Data: Secondary indexing should be used for frequently
accessed data to reduce the time and resources required for data retrieval operations.
This is because secondary indexes provide a fast and efficient way to access data
stored in a database.
Sorting and Aggregating Data: Secondary indexing can be used to support sorting
and aggregating data based on specific columns. By creating a secondary index based
on the columns used for sorting and aggregating, database management systems can
access the data more efficiently, reducing the time and resources required for data
retrieval operations.
Data Structure: The data structure of a database can also affect the decision to use
secondary indexing. For example, if the data is structured as a B-tree, a B-tree index
may be the most appropriate type of secondary index.
We will be discussing each of the file Organizations in further sets of this article along
with differences and advantages/ disadvantages of each file Organization methods.
The easiest method for file Organization is Sequential method. In this method the file are
stored one after another in a sequential manner. There are two ways to implement this
method:
Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Sorted File Method –In this method, As the name itself suggest whenever a new
record has to be inserted, it is always inserted in a sorted (ascending or descending)
manner. Sorting of records may be based on any primary key or any other key.
Heap File Organization works with data blocks. In this method records are inserted at the
end of the file, into the data blocks. No Sorting or Ordering is required in this method. If
a data block is full, the new record is stored in some other block, Here the other data
block need not be the very next data block, but it can be any block in the memory. It is
the responsibility of DBMS to store and manage the new records.
Insertion of new record –
Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new
record R2 has to be inserted in the heap then, since the last data block i.e data block 3 is
full it will be inserted in any of the data blocks selected by the DBMS, lets say data block
1.
If we want to search, delete or update data in heap file Organization the we will traverse
the data from the beginning of the file till we get the requested record. Thus if the
database is very huge, searching, deleting or updating the record will take a lot of time.
Pros and Cons of Heap File Organization –
Pros –
Fetching and retrieving records is faster than sequential record but only in case of
small databases.
When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.
Cons –
Problem of unused memory blocks.
Inefficient for larger databases.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic
tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation
of a student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
o It will waste time as we cannot jump on a particular record that is required but we have to
move sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.
o
o When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way, when a
new record has to be inserted, then the address is generated using the hash key and record
is directly inserted. The same process is applied in the case of delete and update.
o In this method, there is no effort for searching and sorting the entire file. In this method,
each record will be stored randomly in the memory.
o
B+ File Organization
o B+ tree file organization is the advanced method of an indexed sequential access method.
It uses a tree-like structure to store records in File.
o It uses the same concept of key-index where the primary key is used to sort the records.
For each primary key, the value of the index is generated and mapped with the record.
o The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node. Intermediate
nodes act as a pointer to the leaf nodes. They do not contain any records.
The above B+ tree shows that:
o There is one root node of the tree, i.e., 25.
o There is an intermediary layer with nodes. They do not store the actual record. They have
only pointers to the leaf node.
o The nodes to the left of the root node contain the prior value of the root and nodes to the
right contain next value of the root, i.e., 15 and 30 respectively.
o There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
o Searching for any record is easier as all the leaf nodes are balanced.
o In this method, searching any record can be traversed through the single path and
accessed easily.
Indexing in Databases
Indexing improves database performance by minimizing the number of disc visits
required to fulfil a query. It is a data structure technique used to locate and quickly access
data in databases. Several database fields are used to generate indexes. The main key or
candidate key of the table is duplicated in the first column, which is the Search key. To
speed up data retrieval, the values are also kept in sorted order. It should be highlighted
that sorting the data is not required. The second column is the Data Reference or Pointer
which contains a set of pointers holding the address of the disk block where that
particular key value can be found.
Attributes of Indexing
Access Types: This refers to the type of access such as value based search, range
access, etc.
Access Time: It refers to the time needed to find particular data element or set of
elements.
Insertion Time: It refers to the time taken to find the appropriate space and insert a
new data.
Deletion Time: Time taken to find an item and delete it as well as update the index
structure.
Space Overhead: It refers to the additional space required by the index.
In general, there are two types of file organization mechanism which are followed by the
indexing methods to store the data:
Primary Indexing
This is a type of Clustered Indexing wherein the data is sorted according to the search
key and the primary key of the database table is used to create the index. It is a default
format of indexing where it induces sequential file organization. As primary keys are
unique and are stored in a sorted manner, the performance of the searching operation
is quite efficient.
Non-clustered or Secondary Indexing
A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual
pointers or references to the location where the data is actually stored. Data is not
physically stored in the order of the index. Instead, data is present in leaf nodes. For
eg. the contents page of a book. Each entry gives us the page number or location of
the information stored. The actual data here(information on each page of the book) is
not organized but we have an ordered reference(contents page) to where the data
points actually lie. We can have only dense ordering in the non-clustered index as
sparse ordering is not possible because data is not physically organized accordingly.
It requires more time as compared to the clustered index because some amount of
extra work is done in order to extract the data by further following the pointer. In the
case of a clustered index, data is directly present in front of the index.
Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index is stored
in the main memory, a single-level index might become too large a size to store with
multiple disk accesses. The multilevel indexing segregates the main block into various
smaller blocks so that the same can stored in a single block. The outer blocks are
divided into inner blocks which in turn are pointed to the data blocks. This can be
easily stored in the main memory with fewer overheads.
Advantages of Indexing
Improved Query Performance: Indexing enables faster data retrieval from the
database. The database may rapidly discover rows that match a specific value or
collection of values by generating an index on a column, minimising the amount of
time it takes to perform a query.
Efficient Data Access: Indexing can enhance data access efficiency by lowering the
amount of disk I/O required to retrieve data. The database can maintain the data pages
for frequently visited columns in memory by generating an index on those columns,
decreasing the requirement to read from disk.
Optimized Data Sorting: Indexing can also improve the performance of sorting
operations. By creating an index on the columns used for sorting, the database can
avoid sorting the entire table and instead sort only the relevant rows.
Consistent Data Performance: Indexing can assist ensure that the database performs
consistently even as the amount of data in the database rises. Without indexing,
queries may take longer to run as the number of rows in the table grows, while
indexing maintains roughly consistent speed.
By ensuring that only unique values are inserted into columns that have been indexed
as unique, indexing can also be utilized to ensure the integrity of data. This avoids
storing duplicate data in the database, which might lead to issues when performing
queries or reports.
Overall, indexing in databases provides significant benefits for improving query
performance, efficient data access, optimized data sorting, consistent data performance,
and enforced data integrity
Disadvantages of Indexing
Indexing necessitates more storage space to hold the index data structure, which
might increase the total size of the database.
Increased database maintenance overhead: Indexes must be maintained as data is
added, destroyed, or modified in the table, which might raise database maintenance
overhead.
Indexing can reduce insert and update performance since the index data structure must
be updated each time data is modified.
Choosing an index can be difficult: It can be challenging to choose the right indexes
for a specific query or application and may call for a detailed examination of the data
and access patterns.
Features of Indexing
The development of data structures, such as B-trees or hash tables, that provide quick
access to certain data items is known as indexing. The data structures themselves are
built on the values of the indexed columns, which are utilized to quickly find the data
objects.
The most important columns for indexing columns are selected based on how
frequently they are used and the sorts of queries they are subjected to. The cardinality,
selectivity, and uniqueness of the indexing columns can be taken into account.
There are several different index types used by databases, including primary,
secondary, clustered, and non-clustered indexes. Based on the particular needs of the
database system, each form of index offers benefits and drawbacks.
For the database system to function at its best, periodic index maintenance is required.
According to changes in the data and usage patterns, maintenance work involves
building, updating, and removing indexes.
Database query optimization involves indexing, which is essential. The query
optimizer utilizes the indexes to choose the best execution strategy for a particular
query based on the cost of accessing the data and the selectivity of the indexing
columns.
Databases make use of a range of indexing strategies, including as covering indexes,
index-only scans, and partial indexes. These techniques maximize the utilization of
indexes for particular types of queries and data access.
When non-contiguous data blocks are stored in an index, it can result in index
fragmentation, which makes the index less effective. Regular index maintenance, such
as defragmentation and reorganisation, can decrease fragmentation.
2. Methods –
When a message is passed then the body of code that is executed is known as a
method. Whenever a method is executed, it returns a value as output. A method can
be of two types:
Read-only method: When the value of a variable is not affected by a method,
then it is known as the read-only method.
Update-method: When the value of a variable change by a method, then it is
known as an update method.
3. Variables –
It stores the data of an object. The data stored in the variables makes the object
distinguishable from one another.
2. Object Classes:
An object which is a real-world entity is an instance of a class. Hence first we need to
define a class and then the objects are made which differ in the values they store but
share the same class definition. The objects in turn correspond to various messages and
variables stored in them.
Example –
class CLERK
{ //variables
char name;
string address;
int id;
int salary;
//Messages
char get_name();
string get_address();
int annual_salary();
};
In the above example, we can see, CLERK is a class that holds the object variables and
messages.
An OODBMS also supports inheritance in an extensive manner as in a database there
may be many classes with similar methods, variables and messages. Thus, the concept of
the class hierarchy is maintained to depict the similarities among various classes.
The concept of encapsulation that is the data or information hiding is also supported by
an object-oriented data model. And this data model also provides the facility of abstract
data types apart from the built-in data types like char, int, float. ADT’s are the user-
defined data types that hold the values within them and can also have methods attached to
them.
Thus, OODBMS provides numerous facilities to its users, both built-in and user-defined.
It incorporates the properties of an object-oriented data model with a database
management system, and supports the concept of programming paradigms like classes
and objects along with the support for other concepts like encapsulation, inheritance, and
the user-defined ADT’s (abstract data types).
Features of ODBMS:
Object-oriented data model: ODBMS uses an object-oriented data model to store and
manage data. This allows developers to work with data in a more natural way, as objects
are similar to the objects in the programming language they are using.
Complex data types: ODBMS supports complex data types such as arrays, lists, sets,
and graphs, allowing developers to store and manage complex data structures in the
database.
Automatic schema management: ODBMS automatically manages the schema of the
database, as the schema is defined by the classes and objects in the application code. This
eliminates the need for a separate schema definition language and simplifies the
development process.
High performance: ODBMS can provide high performance, especially for applications
that require complex data access patterns, as objects can be retrieved with a single query.
Data integrity: ODBMS provides strong data integrity, as the relationships between
objects are maintained by the database. This ensures that data remains consistent and
correct, even in complex applications.
Concurrency control: ODBMS provides concurrency control mechanisms that ensure
that multiple users can access and modify the same data without conflicts.
Scalability: ODBMS can scale horizontally by adding more servers to the database
cluster, allowing it to handle large volumes of data.
Support for transactions: ODBMS supports transactions, which ensure that multiple
operations on the database are atomic and consistent.
Advantages:
Disadvantages:
Characteristics
Easy to link with programming language: The programming language and the
database schema use the same type definitions, so developers may not need to learn a
new database query language.
No need for user defined keys: Object Database Management Systems have an
automatically generated OID associated with each of the objects.
Easy modeling: ODBMS can easily model real-world objects, hence, are suitable for
applications with complex data.
Can store non-textual data ODBMS can also store audio, video and image data.
Advantages
Speed: Access to data can be faster because an object can be retrieved directly
without a search, by following pointers.
Improved performance:These systems are most suitable for applications that use
object oriented programming.
Extensibility:Unlike traditional RDBMS where the basic-datatypes are hardcoded,
when using ODBMS the user can encode any kind of data-structures to hold the data.
Data consistency: When ODBMS is integrated with an object-based application,
there is much greater consistency between the database and the programming
language since both use the same model of representation for the data. This helps
avoid the impedance mismatch.
Capability of handling variety of data: Unlike other database management systems,
ODBMS can also store nn textual data like-: images, videos and audios
Disadvantages
No universal standards: There is no universally agreed standards of operating
ODBMS This is the most significant drawback as the user is free to manipulate data
model as he wants which can be an issue when handling enormous amounts of data.
No security features:Since use of ODBMS is very limited, there are not adequate
security features to store production-grade data.
Exponential increase in complexity:ODBMS become very complex very fast. When
there is a lot of data and a lot of relations between data, managing and optimising
ODBMS becomes difficult.
Scalability: Unable to support large systems.
Query optimization is challenging: Optimising ODBMS queries requires complete
information about the data like-: type and size of data. This compromises the data-
encapsulation feature that ODBMS had to offer.
Types:
1. Homogeneous Database: A homogeneous database stores data uniformly across all locations.
All sites utilize the same operating system, database management system, and data structures.
They are therefore simple to handle.
Data may be stored on several places in two ways using distributed data storage:
1. Replication - With this strategy, every aspect of the connection is redundantly kept at two or
more locations. It is a completely redundant database if the entire database is accessible from
every location. Systems preserve copies of the data as a result of replication. This has advantages
since it makes more data accessible at many locations. Moreover, query requests can now be
handled in parallel. But, there are some drawbacks as well. Data must be updated often. All
changes performed at one site must be documented at every site where that relation is stored in
order to avoid inconsistent results. There is a tone of overhead here. Moreover, since concurrent
access must now be monitored across several sites, concurrency management becomes far more
complicated.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces and each
fragment is kept in the many locations where it is needed. To ensure there is no data loss, the
pieces must be created in a way that allows for the reconstruction of the original relation. As
fragmentation doesn't result in duplicate data, consistency is not a concern.
o Place unrelated
o Spread-out query processing
o The administration of distributed transactions
o Independent of hardware
o Network independent of operating systems
o Transparency of transactions
o DBMS unrelated<
All of the physical sites in a homogeneous distributed database system use the same operating
system and database software, as well as the same underlying hardware. It can be significantly
simpler to build and administer homogenous distributed database systems since they seem to the
user as a single system. The data structures at each site must either be the same or compatible for
a distributed database system to be considered homogeneous. Also, the database program utilized
at each site must be compatible or same.
The hardware, operating systems, or database software at each site may vary in a heterogeneous
distributed database. Although separate sites may employ various technologies and schemas, a
variation in schema might make query and transaction processing challenging.
Various nodes could have dissimilar hardware, software, and data structures, or they might be
situated in incompatible places. Users may be able to access data stored at a different place but
not upload or modify it. Because heterogeneous distributed databases are sometimes challenging
to use, many organizations find them to be economically unviable.
o As distributed databases provide modular development, systems may be enlarged by putting new
computers and local data in a new location and seamlessly connecting them to the distributed
system.
o With centralized databases, failures result in a total shutdown of the system. Distributed database
systems, however, continue to operate with lower performance when a component fails until the
issue is resolved.
o If the data is near to where it is most often utilized, administrators can reduce transmission costs
for distributed database systems. Centralized systems are unable to accommodate this<
Types of Distributed Database
o Data instances are created in various areas of the database using replicated data. Distributed
databases may access identical data locally by utilizing duplicated data, which reduces
bandwidth. Read-only and writable data are the two types of replicated data that may be
distinguished.
o Only the initial instance of replicated data can be changed in read-only versions; all subsequent
corporate data replications are then updated. Data that is writable can be modified, but only the
initial occurrence is affected.
o Primary keys that point to a single database record are used to identify horizontally fragmented
data. Horizontal fragmentation is typically used when business locations only want access to the
database for their own branch.
o Using primary keys that are duplicates of each other and accessible to each branch of the database
is how vertically fragmented data is organized. When a company's branch and central location
deal with the same accounts differently, vertically fragmented data is used.
o Data that has been edited or modified for decision support databases is referred to as reorganised
data. When two distinct systems are managing transactions and decision support, reorganised data
is generally utilised. When there are numerous requests, online transaction processing must be
reconfigured, and decision support systems might be challenging to manage.
o In order to accommodate various departments and circumstances, separate schema data separates
the database and the software used to access it. Often, there is overlap between many databases
and separate schema data
Distributed database examples
o Apache Ignite, Apache Cassandra, Apache HBase, Couchbase Server, Amazon SimpleDB,
Clusterpoint, and FoundationDB are just a few examples of the numerous distributed databases
available.
o Large data sets may be stored and processed with Apache Ignite across node clusters. GridGain
Systems released Ignite as open source in 2014, and it was later approved into the Apache
Incubator program. RAM serves as the database's primary processing and storage layer in Apache
Ignite.
o Apache Cassandra has its own query language, Cassandra Query Language, and it supports
clusters that span several locations (CQL). Replication tactics in Cassandra may also be
customized.
o Apache HBase offers a fault-tolerant mechanism to store huge amounts of sparse data on top of
the Hadoop Distributed File System. Moreover, it offers per-column Bloom filters, in-memory
execution, and compression. Although Apache Phoenix offers a SQL layer for HBase, HBase is
not meant to replace SQL databases.
o An interactive application that serves several concurrent users by producing, storing, retrieving,
aggregating, altering, and displaying data is best served by Couchbase Server, a NoSQL software
package. Scalable key value and JSON document access is provided by Couchbase Server to
satisfy these various application demands.
o Along with Amazon S3 and Amazon Elastic Compute Cloud, Amazon SimpleDB is utilised as a
web service. Developers may request and store data with Amazon SimpleDB with a minimum of
database maintenance and administrative work.
o Relational database designs' complexity, scalability problems, and performance restrictions are all
eliminated with Clusterpoint. Open APIs are used to handle data in the XLM or JSON formats.
Clusterpoint does not have the scalability or performance difficulties that other relational database
systems experience since it is a schema-free document database.