UNIT V Transaction and Indexing
UNIT V Transaction and Indexing
Introduction
• A transaction is a unit of program execution that accesses and possibly updates various data items.
• The transaction consists of all operations executed between the statements begin and end of the
transaction
• Transaction operations: Access to the database is accomplished in a transaction by the following two
operations: ƒ
• read (X): Performs the reading operation of data item X from the database ƒ
• write (X): Performs the writing operation of data item X to the database
• A transaction must see a consistent database. During transaction execution the database may be
inconsistent. When the transaction is committed, the database must be consistent
• Two main issues to deal with: ƒ
1. Failures, e.g. hardware failures and system crashes ƒ
2. Concurrency, for simultaneous execution of multiple transactions
To ensure consistency, completeness of the database in scenario of concurrent access, system failure, the
following ACID properties can be enforced on to database.
1. Atomicity,
2. Consistency,
3. Isolation and
4. Durability
Atomicity:
• This property states that all of the instructions with in a transaction must be executed or none of them
should be executed.
• This property states that all transactions execution must be atomic i.e. all actions should be carried out
or none of the actions should be executed.
• It involves following two operations.
o Abort: If a transaction aborts, changes made to database are not visible.
o Commit: If a transaction commits, changes made are visible.
• Atomicity is also known as the ‘All or nothing rule’.
Example:
• Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.
If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but
before write(Y)), then amount has been deducted from X but not added to Y. This results in an
inconsistent database state. Therefore, the transaction must be executed in entirety in order to ensure
correctness of database state.
Consistency
• The database must remain in consistence state even after performing any kind of transaction ensuring
correctness of the database. This means that integrity constraints must be maintained so that the
database is consistent before and after the transaction. It refers to the correctness of a database.
• Each transaction, run by itself with no concurrent execution of other transactions, must preserve the
consistency of the database. This property is called consistency and the DBMS assumes that it holds for
each transaction. Ensuring this property of a transaction is the responsibility of the user.
Example:
• Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.
Example:
• Let X= 500, Y = 500.
Consider two transactions T and T”.
Suppose T has been executed till Read (Y) and then T’’ starts. As a result , interleaving of operations
takes place due to which T’’ reads correct value of X but incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
• This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take
place in isolation and changes should be visible only after they have been made to the main
memory.
Durability:
• This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if a system failure
occurs.
• These updates now become permanent and are stored in non-volatile memory. The effects of the
transaction, thus, are never lost.
-------------------------------------------------------x----------------------------------------------------------------
5.1 Transaction State
A transaction can be referred to be an atomic operation by the user, but in fact it goes through a number of
states during its lifetime as given as follows:
1.Active State –
When the instructions of the transaction are running then the transaction is in active state. If all the ‘read
and write’ operations are performed without any error then it goes to the “partially committed state”; if any
instruction fails, it goes to the “failed state”.
2.Partially Committed –
After completion of all the read and write operation the changes are made in main memory or local buffer.
If the the changes are made permanent on the Data Base then the state will change to “committed state”
and in case of failure it will go to the “failed state”.
3.Failed State –
When any instruction of the transaction fails, it goes to the “failed state” or if failure occurs in making a
permanent change of data on Data Base.
4.Aborted State –
After having any type of failure the transaction goes from “failed state” to “aborted state” and since in
previous states, the changes are only made to local buffer or main memory and hence these changes are
deleted or rolled-back.
5.Committed State –
It is the state when the changes are made permanent on the Data Base and the transaction is complete.
----------------------------------------------------------x-----------------------------------------------------------
Durability and atomicity can be ensured by using Recovery manager which is available by default in every
DBMS. We can implement atomicity by using
1. Shadow copying technique
2. Using recovery manager which available by default in DBMS.
Shadow copying technique:
• Maintaining a shadow copy of original database & reflecting all changes to the database as a result of
any transaction after committing the transaction.
• The scheme also assumes that the database is simply a file on disk.
• A pointer called db-pointer is maintained on disk; it points to the current copy of the database. 4. In the
shadow-copy scheme, a transaction that wants to update the database first creates a complete copy of the
database. All updates are done on the new database copy, leaving the original copy, the shadow copy,
untouched. If at any point the transaction has to be aborted, the system merely deletes the new copy. The
old copy of the database has not been affected.
• If the transaction completes, it is committed as follows.
• First, the operating system is asked to make sure that all pages of the new copy of the database have
been written out to disk. (Unix systems use the flush command for this purpose.)
• After the operating system has written all the pages to disk, the database system updates the pointer db-
pointer to point to the new copy of the database; the new copy then becomes the current copy of the
database. The old copy of the database is then deleted
• We now consider how the technique handles transaction and system failures.
• First, consider transaction failure. If the transaction fails at any time before db-pointer is updated, the
old contents of the database are not affected. We can abort the trans- action by just deleting the new
copy of the database. Once the transaction has been committed, all the updates that it performed are in
the database pointed to by db- pointer. Thus, either all updates of the transaction are reflected, or none
of the effects are reflected, regardless of transaction failure.
• Now consider the issue of system failure. Suppose that the system fails at any time before the updated
db-pointer is written to disk. Then, when the system restarts, it will read db-pointer and will thus see the
original contents of the database, and none of the effects of the transaction will be visible on the
database. Next, suppose that the system fails after db-pointer has been updated on disk. Before the
pointer is updated, all updated pages of the new copy of the database were written to disk. Again, we
assume that, once a file is written to disk, its contents will not be damaged even if there is a system
failure. Therefore, when the system restarts, it will read db-pointer and will thus see the contents of the
database after all the updates performed by the transaction.
-----------------------------------------------------------x--------------------------------------------------------
5.3 Concurrent Executions
• Executing a set of transactions simultaneously in a pre emptive and time shared method. In DBMS
concurrent execution of transaction can be implemented with interleaved execution.
Transaction Schedules:
• Schedule: It refers to the list of actions to be executed by transaction.
• A schedule is a process of grouping the transactions into one and executing them in a predefined order.
• Schedule of actions can be classified into 2 types.
1. Serializable schedule/serial schedule.
2. Concurrent schedule.
1. Serial schedule:
In the serial schedule the transactions are allowed to execute one after the other ensuring
correctness of data. A schedule is called serial schedule, if the transactions in the schedule are
defined to execute one after the other.
T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
R(B)
Where R(A) denotes that a read operation is performed on some data item ‘A’. This is a serial
schedule since the transactions perform serially in the order T 1 —> T2
5.4 Serializability
1. Conflict Serializable:
A schedule is called conflict serializable if it can be transformed into a serial schedule by swapping
non-conflicting operations.
Conflicting operations: Two operations are said to be conflicting if all conditions satisfy:
1. They belong to different transactions
2. They operate on the same data item
3. At Least one of them is a write operation
Example: –
• Conflicting operations pair (R 1(A), W2(A)) because they belong to two different transactions on
same data item A and one of them is write operation.
• Similarly, (W1(A), W2(A)) and (W1(A), R2(A)) pairs are also conflicting.
• On the other hand, (R1(A), W2(B)) pair is non-conflicting because they operate on different data
item.
• Similarly, ((W1(A), W2(B)) pair is non-conflicting.
Example1:
Consider the following schedule:
S1: R1(A), W1(A), R2(A), W2(A), R1(B), W1(B), R2(B), W2(B)
If Oi and Oj are two operations in a transaction and O i< Oj (Oi is executed before Oj), same order will
follow in the schedule as well. Using this property, we can get two transactions of schedule S1 as:
S12 is a serial schedule in which all operations of T1 are performed before starting any operation of T2.
Since S has been transformed into a serial schedule S12 by swapping non-conflicting operations of S1,
S1 is conflict serializable.
Example2:
Let us take another Schedule:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)
Swapping non-conflicting operations R1(A) and R2(B) in S2, the schedule becomes,
S21: R2(A), W2(A), R2(B), W1(A), R1(B), W1(B), R1(A), W2(B)
Similarly, swapping non-conflicting operations W1(A) and W2(B) in S21, the schedule becomes,
S22: R2(A), W2(A), R2(B), W2(B), R1(B), W1(B), R1(A), W1(A)
In schedule S22, all operations of T2 are performed first, but operations of T1 are not in order (order
should be R1(A), W1(A), R1(B), W1(B)). So S2 is not conflict serializable.
Conflict Equivalent: Two schedules are said to be conflict equivalent when one can be transformed to
another by swapping non-conflicting operations. In the example discussed above, S11 is conflict
equivalent to S1 (S1 can be converted to S11 by swapping non-conflicting operations). Similarly, S11 is
conflict equivalent to S12 and so on.
Note 1: Although S2 is not conflict serializable, but still it is conflict equivalent to S21 and S21 because
S2 can be converted to S21 and S22 by swapping non-conflicting operations.
Note 2: The schedule which is conflict serializable is always conflict equivalent to one of the serial
schedule. S1 schedule discussed above (which is conflict serializable) is equivalent to serial schedule
(T1->T2).
2. View serializability.
o View serializability is a concept that is used to compute whether schedules are View-Serializable
or not. A schedule is said to be View-Serializable if it is view equivalent to a Serial
Schedule (where no interleaving of transactions is possible).
o If a schedule is conflict serializable, then it will be view serializable.
o The view serializable which does not conflict serializable contains blind writes.
T1 T2 T3
------------------------
R(A)
W(A)
R(A)
R(B)
Transaction T2 is reading A from database.
2)Updated Read
If Ti is reading A which is updated by Tj in S1 then in S2 also Ti should read A which is updated by Tj.
T1 T2 T3 T1 T2 T3
----------------------- ----------------------
W(A) W(A)
W(A) R(A)
R(A) W(A)
Above two schedule are not view-equivalent as in S1 :T3 is reading A updated by T2, in S2 T3 is reading
A updated by T1.
3)Final Write operation
If a transaction T1 updated A at last in S1, then in S2 also T1 should perform final write operations.
T1 T2 T1 T2
---------------- ------------------
R(A) R(A)
W(A) W(A)
W(A) W(A)
Above two schedules are not view-equivalent as Final write operation in S1 is done by T1 while in S2
done by T2.
View Serializability: A Schedule is called view serializable if it is view equal to a serial schedule (no
overlapping transactions).
Example:
Schedule S
With 3 transactions, the total number of possible schedule = 3! = 6
1. S1 = <T1 T2 T3>
2. S2 = <T1 T3 T2>
3. S3 = <T2 T3 T1>
4. S4 = <T2 T1 T3>
5. S5 = <T3 T1 T2>
6. S6 = <T3 T2 T1>
Taking first schedule S1:
Schedule S1
Recoverable Schedules:
• Schedules in which transactions commit only after all transactions whose changes they read commit
are called recoverable schedules. In other words, if some transaction T j is reading value updated or
written by some other transaction T i, then the commit of Tj must occur after the commit of T i.
Example1:
The following schedule is not recoverable if T9 commits immediately after the read.
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent state. Hence,
database must ensure that schedules are recoverable.
Example 2: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
commit
Commit
This is a recoverable schedule since T 1 commits before T2, that makes the value read by T2 correct.
A single transaction failure leads to a series of transaction rollbacks. Consider the following schedule
where none of the transactions has yet committed (so the schedule is recoverable)
Cascadeless schedules :
o Cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data
item previously written by Ti, the commit operation of Ti appears before the read operation of Tj.
o Every cascadeless schedule is also recoverable
o It is desirable to restrict the schedules to those that are cascadeless
Example: Consider the following schedule.
o T2 reads data item A which is previously written by T1, and the commit operation of T1 appears
before the read operation of T2.
o In the same way, T3 reads data item A which is previously written by T2, and the commit
operation of T2 appears before the read operation of T3.
o This type of schedule is called as cascadeless schedule.
NOTE-
o Cascadeless schedule allows only committed read operations.
o However, it allows uncommitted write operations.
--------------------------------------------------------------X------------------------------------------------------------
o A database must provide a mechanism that will ensure that all possible schedules are either conflict
or view serializable, and are recoverable and preferably cascadeless.
o A policy (e.g.: using a lock) in which only one transaction can execute at a time generates serial
schedules, but provides a poor degree of concurrency
o Goal – to develop concurrency control protocols that will assure serializability.
o These protocols will impose a discipline that avoids nonseralizable schedules.
o A common concurrency control protocol uses locks.
o While one transaction is accessing a data item, no other transaction can modify it.
o Require a transaction to lock the item before accessing it.
Lock-Based Protocols
o A lock is a mechanism to control concurrent access to a data item.Lock requests are made to
concurrency-control manager. Transaction can proceed only after request is granted.
o Data items can be locked in two modes:
1. exclusive (X) mode. Data item can be both read and written. X-lock is requested using the
lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S.
o FGDFG
--------------------------------------------------------------X------------------------------------------------------------
5.7 Testing for Serializability
Example:
• If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are executed
before the first instruction of Tj is executed.
• If a precedence graph for schedule S contains a cycle, then S is non-serializable. If the precedence
graph has no cycle, then S is known as serializable.
Example1:
Explanation:
The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non-serializable.
Explanation:
The precedence graph for schedule S2 contains no cycle that's why ScheduleS2 is serializable.
--------------------------------------------------------------X------------------------------------------------------------
To find that where the problem has occurred, we generalize a failure into the following categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from where it can't
go any further. If a few transaction or process is hurt, then this is called as transaction failure.
Reasons for a transaction failure could be -
1. Logical errors: If a transaction cannot complete due to some code error or an internal error
condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active transaction because the
database system is not able to execute it. For example, The system aborts an active
transaction, in case of deadlock or resource unavailability.
2. System Crash
o System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
Fail-stop assumption: In the system crash, non-volatile storage is assumed not to be
corrupted.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It was a common
problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash, and unreachability
to the disk or any other failure, which destroy all or part of disk storage.
--------------------------------------------------------------X------------------------------------------------------------
5.9 Storage, Recovery and Atomicity
Storage Structure
• Modifying the database without ensuring that the transaction will commit may leave the database in
an inconsistent state
• Consider transaction T i that transfers $50 from account A to account B; goal is either to perform all
database modifications made by T i or none at all.
• Several output operations may be required for T i (to output A and B ). A failure may occur after
one of these modifications have been made but before all of them are made.
• To ensure atomicity despite failures, we first output information describing the modifications to
stable storage without modifying the database itself.
• We study two approaches: log-based recovery , and shadow-paging
• We assume (initially) that transactions run serially, that is, one after the other.
--------------------------------------------------------------X------------------------------------------------------------
--------------------------------------------------------------X------------------------------------------------------------
--------------------------------------------------------------X------------------------------------------------------------
--------------------------------------------------------------X------------------------------------------------------------
Index Classification
• Primary vs. secondary index:
o If search key contains primary key, then called primary index.
• Clustered vs. unclustered index :
o If order of data records is the same as, or `close to’, order of data entries, then called
clustered index.
Primary indexing:
• Primary indexing is defined mainly on the primary key of the data-file, in which the data-file
is already ordered based on the primary key.
• Primary Index is an ordered file whose records are of fixed length with two fields. The first
field of the index replicates the primary key of the data file in an ordered manner, and the
second field of the ordered file contains a pointer that points to the data-block where a record
containing the key is available.
• The first record of each block is called the Anchor record or Block anchor. There exists a
record in the primary index file for every block of the data-file.
• The average number of blocks using the Primary Index is = log2B + 1, where B is the number
of index blocks.
• Example:
Secondary Indexing:
• A secondary index is an index that is not a primary index and may have duplicates.
• Example:
• Consider a database containing a list of students at a college, each of whom has a unique student ID
number. A typical database would use the student ID number as the key; however, one might also
reasonably want to be able to look up students by last name. Now, we can construct a secondary
index in which the secondary key
• Any number of secondary indexes may be associated with a given primary database, up to
limitations on available memory and the number of open file descriptors.
Clustered indexing:
Clustered index is the type of indexing that established a physical sorting order of rows.Suppose
you have a table Student_info which contains ROLL_NO as a primary key than Clustered index
which is self created on that primary key will sort the Student_info table as per ROLL_NO.
Clustered index is like Dictionary, in the dictionary sorting order is alphabetical there is no separate
index page.
Examples:
Input:
CREATE TABLE Student_info
(
ROLL_NO int(10) primary key,
NAME varchar(20),
DEPARTMENT varchar(20),
);
insert into Student_info values(1410110405, 'H Agarwal', 'CSE')
insert into Student_info values(1410110404, 'S Samadder', 'CSE')
insert into Student_info values(1410110403, 'MD Irfan', 'CSE')
Unclustered index:
The Non-Clustered index is an index structure separate from the data stored in a table that reorders
one or more selected columns. The non-clustered index is created to improve the performance of
frequently used queries not covered by clustered index. It’s like a textbook, the index page is
created separately at the beginning of that book.
Examples:
Input:
CREATE TABLE Student_info
(
ROLL_NO int(10),
NAME varchar(20),
DEPARTMENT varchar(20),
);
insert into Student_info values(1410110405, 'H Agarwal', 'CSE')
insert into Student_info values(1410110404, 'S Samadder', 'CSE')
insert into Student_info values(1410110403, 'MD Irfan', 'CSE')
--------------------------------------------------------------X------------------------------------------------------------
5.14 Index data Structures: Hash Based Indexin, Tree base Indexing
• Organizes the records using a technique called hashing to quickly find records that have a given
search key value.
• In this approach, the records in a file are grouped in buckets, where a bucket consists of a primary
page and, possibly, additional pages linked in a chain. The bucket to which a record belongs can be
determined by applying a special function, called a hash function, to the search key. Given a bucket
number, a hash-based index structure allows us to retrieve the primary page for the bucket in one or
two disk l/Os.
• On inserts, the record is inserted into the appropriate bucket, with 'overflow' pages allocated as
necessary.
• To search for a record with a given search key value, we apply the hash function to identify the
bucket to which such records belong and look at all pages in that bucket.
Example:
• The data is stored in a file that is hashed on age;
• The data entries in this first index file are the actual data records.
• Applying the hash function to the age field identifies the page that the record belongs to.
• hash function h is-- it converts the search key value to its binary representation and uses the two
least significant bits as the bucket identifier.
• The figure also shows an index with search key sal that contains (sal, rid) pairs as data entries.
• The rid component of a data entry in this second index is a pointer to a record with search key value
sal.
• The data entries are arranged in sorted order by search key value, and a hierarchical search data
structure is maintained that directs searches to the correct page of data entries.
--------------------------------------------------------------X------------------------------------------------------------
• Assume that the files and indexes are organized according to the composite search key (age, sal)
--------------------------------------------------------------X------------------------------------------------------------
Effective indexes are one of the best ways to improve performance in a database application. Without an
index, the SQL Server engine is like a reader trying to find a word in a book by examining each page. By
using the index in the back of a book, a reader can complete the task in a much shorter time. In database
terms, a table scan happens when there is no index available to help a query. In a table scan SQL
Server examines every row in the table to satisfy the query results. Table scans are sometimes unavoidable,
but on large tables, scans have a terrific impact on performance.
One of the most important jobs for the database is finding the best index to use when generating an
execution plan. Most major databases ship with tools to show you execution plans for a query and help in
optimizing and tuning indexes. This article outlines several good rules of thumb to apply when creating and
modifying indexes for your database. First, let’s cover the scenarios where indexes help performance, and
when indexes can hurt performance.
Useful Index Queries
Just like the reader searching for a word in a book, an index helps when you are looking for a specific
record or set of records with a WHERE clause. This includes queries looking for a range of values, queries
designed to match a specific value, and queries performing a join on two tables. For example, both of the
queries against the Northwind database below will benefit from an index on the UnitPrice column.
DELETE FROM Products WHERE UnitPrice = 1
Index Drawbacks
Indexes are a performance drag when the time comes to modify records. Any time a query modifies the
data in a table the indexes on the data must change also. Achieving the right number of indexes will require
testing and monitoring of your database to see where the best balance lies. Static systems, where databases
are used heavily for reporting, can afford more indexes to support the read only queries. A database with a
heavy number of transactions to modify data will need fewer indexes to allow for higher throughput.
Indexes also use disk space. The exact size will depends on the number of records in the table as well as the
number and size of the columns in the index. Generally this is not a major concern as disk space is easy to
trade for better performance.
Distinct Keys
The most effective indexes are the indexes with a small percentage of duplicated values. As an analogy,
think of a phone book for a town where almost everyone has the last name of Smith. A phone book in this
town is not very useful if sorted in order of last name, because you can only discount a small number of
records when you are looking for a Smith.
An index with a high percentage of unique values is a selective index. Obviously, a unique index is highly
selective since there are no duplicate entries. Many databases will track statistics about each index so they
know how selective each index is. The database uses these statistics when generating an execution plan for
a query.
Covering Queries
Indexes generally contain only the data values for the columns they index and a pointer back to the row
with the rest of the data. This is similar to the index in a book: the index contains only the key word and
then a page reference you can turn to for the rest of the information. Generally the database will have to
follow pointers from an index back to a row to gather all the information required for a query. However, if
the index contains all of he columns needed for a query, the database can save a disk read by not returning
to the table for more information.
Take the index on UnitPrice we discussed earlier. The database could use just the index entries to satisfy
the following query.
SELECT Count(*), UnitPrice FROM Products
GROUP BY UnitPrice
We call these types of queries covered queries, because all of the columns requested in the output are
covered by a single index. For your most crucial queries, you might consider creating a covering index to
give the query the best performance possible. Such an index would probably be a composite index (using
more than one column), which appears to go against our first guideline of keeping index entries as short as
possible. Obviously this is another tradeoff you can only evaluate with performance testing and monitoring.
Clustered Indexes
Many databases have one special index per table where all of the data from a row exists in the index. SQL
Server calls this index a clustered index. Instead of an index at the back of a book, a clustered index is
closer in similarity to a phone book because each index entry contains all the information you need, there
are no references to follow to pick up additional data values.
As a general rule of thumb, every non-trivial table should have a clustered index. If you only create one
index for a table, make the index a clustered index. In SQL Server, creating a primary key will
automatically create a clustered index (if none exists) using the primary key column as the index key.
Clustered indexes are the most effective indexes (when used, they always cover a query), and in many
databases systems will help the database efficiently manage the space required to store the table.
When choosing the column or columns for a clustered index, be careful to choose a column with static
data. If you modify a record and change the value of a column in a clustered index, the database might need
to move the index entry (to keep the entries in sorted order). Remember, index entries for a clustered index
contain all of the column values, so moving an entry is comparable to executing a DELETE statement
followed by an INSERT, which can obviously cause performance problems if done often. For this reason,
clustered indexes are often found on primary or foreign key columns. Key values will rarely, if ever,
change.