DBMS UNIT-5
DBMS UNIT-5
Transaction
X's Account
1. Open_Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close_Account(X)
Y's Account
1. Open_Account(Y)
2. Old_Balance = Y.balance
3. New_Balance = Old_Balance + 800
4. Y.balance = New_Balance
5. Close_Account(Y)
Operations of Transaction:
Following are the main operations of transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in a
buffer in main memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
Let's take an example to debit transaction from an account which consists of following
operations:
1. 1. R(X);
2. 2. X = X - 500;
3. 3. W(X);
Let's assume the value of X before starting of the transaction is 4000.
o The first operation reads X's value from database and stores it in a buffer.
o The second operation will decrease the value of X by 500. So buffer will contain 3500.
o The third operation will write the buffer's value to the database. So X's final value will be
3500.
But it may be possible that because of the failure of hardware, software or power, etc. that
transaction may fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after executing operation 2
then X's value will remain 4000 in the database which is not acceptable by the bank.
To solve this problem, we have two important operations:
Commit: It is used to save the work done permanently.
Rollback: It is used to undo the work done.
States of Transaction
Active state
o The active state is the first state of every transaction. In this state, the transaction is being
executed.
o For example: Insertion or deletion or updating a record is done here. But all the records
are still not saved to the database.
Partially committed
o In the partially committed state, a transaction executes its final operation, but the data is
still not saved to the database.
o In the total mark calculation example, a final display of the total marks step is executed in
this state.
Committed
A transaction is said to be in a committed state if it executes all its operations successfully. In
this state, all the effects are now permanently saved on the database system.
Failed state
o If any of the checks made by the database recovery system fails, then the transaction is
said to be in the failed state.
o In the example of total mark calculation, if the database is not able to fire a query to fetch
the marks, then the transaction will fail to execute.
Aborted
o If any of the checks fail and the transaction has reached a failed state then the database
recovery system will make sure that the database is in its previous consistent state. If not
then it will abort or roll back the transaction to bring the database into a consistent state.
o If the transaction fails in the middle of the transaction then before executing the
transaction, all the executed transactions are rolled back to its consistent state.
o After aborting the transaction, the database recovery module will select one of the two
operations:
o Re-start
start the transaction
o Kill the transaction
ACID Properties
The expansion of the term ACID defines for:
1) Atomicity
Data's ability to remain atomic is defined by the word "atomicity."." It indicates that any operation
performed on the data should be done so either fully or not at all.It also implies that there
shouldn't be any breaks or incomplete execution of the process. When carrying out operations on
a transaction, the
he action need to be carried out whole and not only in part.
Example: If Remo has account A having $30 in his account from which he wishes to send $10
to Sheero's account, which is B. In account B, a sum of $ 100 is already present. When $10 will
be transferred
erred to account B, the sum will become $110. Now, there will be two operations that
will take place. One is the amount of $10 that Remo wants to transfer will be debited from his
account A, and the same amount will get credited to account B, i.e., into Sh Sheero's
eero's account. Now,
what happens - the first operation of debit executes successfully, but the credit operation,
however, fails. Thus, in Remo's account A, the value becomes $20, and to that of Sheero's
account, it remains $100 as it was previously presen
present.
In the above diagram, it can be seen that after crediting $10, the amount is still $100 in account
B. So, it is not an atomic transaction.
The below image shows that both debit and credit operations are done successfully. Thus the
transaction is atomic.
Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge issue, and
so the atomicity is the main focus in the bank systems.
2) Consistency
The definition of consistency is that a value should always be maintained. Any modifications
done to the database should always be kept, since this ensures the integrity of the data in
DBMSs. Data integrity is essential for transactions in order to assure consistency in the database
earlier and later in the transaction.
Example:
In the above figure, there are three accounts, A, B, and C, where A is making a transaction T one
by one to both B & C. There are two operations that take place, i.e., Debit and Credit. Account A
firstly debits $50 to account B, and the amount in account A iiss read $300 by B before the
transaction. After the successful transaction T, the available amount in B becomes $150. Now, A
debits $20 to account C, and that time, the value read by C is $250 (that is correct as a debit of
$50 has been successfully done to B). The debit and credit operation from account A to C has
been done successfully. We can see that the transaction is done successfully, and the value is also
read correctly. Thus, the data is consistent. In case the value read by B and C is $300, which
means
eans that data is inconsistent because when the debit operation executes, it will not be
consistent.
3) Isolation
'Isolation' indicates to a phase of separation. A database's isolation, as defined by DBMS, is the
quality wherein no data should impact any other database and can happen simultaneously. To put
it briefly, one database operation should start once the first database operation is finished. It
signifies that two processes on two different databases could not affect the values of the other
process. When two or more transactions happen at the same time, consistency in the case of the
transactions should be preserved. Until a modification is not committed in the memory, it cannot
be observed by other transactions in that same transaction.
Example: If two operations are concurrently running on two different accounts, then the value of
both accounts should not get affected. The value should remain persistent. As you can see in the
below diagram, account A is making T1 and T2 transactions to account B and C, but both are
executing independently without affecting each other. It is known as Isolation.
4) Durability
Something's durability guarantees its permanence. The word ""durability
durability" in database
management systems (DBMS) refers to the guarantee that data is permanently stored in the
database following a successful transaction. Data need to be so completely persistent that it
keeps the database intact even in the case of a crash or system failure.However, the recovery
manager is responsible for ensuring th that
at the database remains durable in the unfortunate
circumstances. To commit the data, we must use the COMMIT command each time we make a
modification.
Let's get straight to the banking example, where funds are being moved between accounts. Let's
look at this example's ACID characteristics one by one:
Atomicity:Money
Money must be moved from one account to another for a transaction to be completed.
The data would become inconsistent if funds were taken out of one account and not added to
another.
Consistency: Let's take a look at a database constraint where an account balance cannot be lower
than $0. Any changes made to the account balance during a transaction must result in a
legitimate, non-negative
negative balance when the transaction is completed; otherwise, it should be
cancelled.
Isolation: Take into consideration two requests for money transfers from the same bank account
made at the same time. When the transfer requests are processed sequentially and
simultaneously, the outcome should be the same.
Durability: As soon as a database verifies that funds have been moved from one bank account to
another, take into consideration a power outage. Despite the unexpected failure, the database
ought to still have the updated data.
----------|-----------
Read(A) |
A = A+50 |
| Read(A)
| A = A+100
Write(A) |
| Write(A)
------------------|-----------
Read(A) |
A = A+50 |
Write(A) |
| Read(A)
| A = A+100
| Write(A)
Read(A)(rollbacks)|
| commit
Result: T2 has a "dirty" value, that was never committed in T1 and doesn't actually exist in the
database.
3. Unrepeatable Read Problem (Read-Write conflict):
when a single transaction reads the same row multiple times and observes different values each
time. This occurs because another concurrent transaction has modified the row between the two
reads.
Examples:
T1 | T2
----------|----------
Read(A) |
| Read(A)
| A = A+100
| Write(A)
Read(A) |
Result: Within the same transaction, T1 has read two different values for the same data item.
This inconsistency is the unrepeatable read.
Serializability in DBMS
Schedule is an order of multiple transactions executing in concurrent environment.
Serial Schedule: The schedule in which the transactions execute one after the other is called
serial schedule. It is consistent in nature.
For example: Consider following two transactions T1 and T2.
T1 | T2
----------|----------
Read(A) |
Write(A) |
Read(B) |
Write(B) |
| Read(A)
| Write(A)
| Read(B)
| Write(B)
All the operations of transaction T1 on data items A and then B executes and then in transaction
T2 all the operations on data items A and B execute.
Non Serial Schedule: The schedule in which operations present within the transaction are
intermixed. This may lead to conflicts in the result or inconsistency in the resultant data.
For example- Consider following two transactions,
T1 | T2
----------|----------
Read(A) |
Write(A) |
| Read(A)
| Write(B)
Read(A) |
Write(B) |
| Read(B)
| Write(B)
The above transaction is said to be non serial which result in inconsistency or conflicts in the
data.
Types of Serializability
1. Conflict Serializability
2. View Serializability
Conflict Serializability
Conflict serializability is a form of serializability where the order of non-conflicting operations is
not significant. It determines if the concurrent execution of several transactions is equivalent to
some serial execution of those transactions.
Two operations are said to be in conflict if:
----------|----------
Read(A) | Read(A)
Read(A) | Read(B)
Write(B) | Read(A)
Read(B) | Write(A)
Write(A) | Write(B)
Examples of conflicting operations
T1 | T2
----------|----------
Read(A) | Write(A)
Write(A) | Read(A)
Write(A) | Write(A)
A schedule is conflict serializable if it can be transformed into a serial schedule (i.e., a schedule
with no overlapping transactions) by swapping non-conflicting operations. If it is not possible to
transform a given schedule to any serial schedule using swaps of non-conflicting operations, then
the schedule is not conflict serializable.
To determine if S is conflict serializable:
Precedence Graph (Serialization Graph): Create a graph where:
Nodes represent transactions.
Draw an edge from \( T_i \) to \( T_j \) if an operation in \( T_i \) precedes and conflicts with an
operation in \( T_j \).
For the given example:
T1 | T2
----------|----------
Read(A) |
| Read(A)
Write(A) |
| Read(B)
| Write(B)
R1(A)conflicts with W1(A),so there's an edge from T1 to T1, but this is ignored because they´re
from the same transaction.
R2(A) conflicts with W1(A), so there's an edge from T2 to T1.
No other conflicting pairs.
The graph has nodes T1 and T2 with an edge from T2 to T1. There are no cycles in this graph.
Decision: Since the precedence graph doesn't have any cycles,Cycle is a path using which we
can start from one node and reach to the same node. the schedule S is conflict serializable. The
equivalent serial schedules, based on the graph, would be T2 followed by T1.
View Serializability
View Serializability is one of the types of serializability in DBMS that ensures the consistency of
a database schedule. Unlike conflict serializability, which cares about the order of conflicting
operations, view serializability only cares about the final outcome. That is, two schedules are
view equivalent if they have:
Initial Read: The same set of initial reads (i.e., a read by a transaction with no preceding
write by another transaction on the same data item).
Updated Read: For any other writes on a data item in between, if a transaction \(T_j\)
reads the result of a write by transaction \(T_i\) in one schedule, then \(T_j\) should read
the result of a write by \(T_i\) in the other schedule as well.
Final Write: The same set of final writes (i.e., a write by a transaction with no
subsequent writes by another transaction on the same data item).
Let's understand view serializability with an example:
Consider two transactions \(T_1\) and \(T_2\):
Schedule 1(S1):
| Transaction T1 | Transaction T2 |
|---------------------|---------------------|
| Write(A) | |
| | Read(A) |
| | Write(B) |
| Read(B) | |
| Write(B) | |
| Commit | Commit |
Schedule 2(S2):
| Transaction T1 | Transaction T2 |
|---------------------|---------------------|
| | Read(A) |
| Write(A) | |
| | Write(A) |
| Read(B) | |
| Write(B) | |
| Commit | Commit |
Here,
1. Both S1 and S2 have the same initial read of A by \(T_2\).
2. Both S1 and S2 have the final write of A by \(T_2\).
3. For intermediate writes/reads, in S1, \(T_2\) reads the value of A after \(T_1\) has written
to it. Similarly, in S2, \(T_2\) reads A which can be viewed as if it read the value after
\(T_1\) (even though in actual sequence \(T_2\) read it before \(T_1\) wrote it). The
important aspect is the view or effect is equivalent.
4. B is read and then written by \(T_1\) in both schedules.
Considering the above conditions, S1 and S2 are view equivalent. Thus, if S1 is serializable, S2
is also view serializable.
Recoverability in DBMS
Recoverability refers to the ability of a system to restore its state to a point where the integrity of
its data is not compromised, especially after a failure or an error.
When multiple transactions are executing concurrently, issues may arise that affect the system's
recoverability. The interaction between transactions, if not managed correctly, can result in
scenarios where a transaction's effects cannot be undone, which would violate the system's
integrity.
Importance of Recoverability:
The need for recoverability arises because databases are designed to ensure data reliability and
consistency. If a system isn't recoverable:
The integrity of the data might be compromised.
Business processes can be adversely affected due to corrupted or inconsistent data.
The trust of end-users or businesses relying on the database will be diminished.
Levels of Recoverability 1. Recoverable Schedules
A schedule is said to be recoverable if, for any pair of transactions \(T_i\) and \(T_j\), if \(T_j\)
reads a data item previously written by \(T_i\), then \(T_i\) must commit before \(T_j\) commits.
If a transaction fails for any reason and needs to be rolled back, the system can recover without
having to rollback other transactions that have read or used data written by the failed transaction.
Example of a Recoverable Schedule:Suppose we have two transactions \(T_1\) and \(T_2\).
| Transaction T1 | Transaction T_2 |
|---------------------|---------------------|
| Write(A) | |
| | Read(A) |
| Commit | |
| | Write(B) |
| | Commit |
In the above schedule, \(T_2\) reads a value written by \(T_1\), but \(T_1\) commits before
\(T_2\), making the schedule recoverable
2. Non-Recoverable Schedules
A schedule is said to be non-recoverable (or irrecoverable) if there exists a pair of transactions
\(T_i\) and \(T_j\) such that \(T_j\) reads a data item previously written by \(T_i\), but \(T_i\) has
not committed yet and \(T_j\) commits before \(T_i\). If \(T_i\) fails and needs to be rolled back
after \(T_j\) has committed, there's no straightforward way to roll back the effects of \(T_j\),
leading to potential data inconsistency.
Example of a Non-Recoverable Schedule:Again, consider two transactions \(T_1\) and \(T_2\).
| Transaction T1 | Transaction T2 |
|---------------------|---------------------|
| Write(A) | |
| | Read(A) |
| | Write(B) |
| | Commit |
| Commit | |
In this schedule, \(T_2\) reads a value written by \(T_1\) and commits before \(T_1\) does. If
\(T_1\) encounters a failure and has to be rolled back after \(T_2\) has committed, we're left in a
problematic situation since we cannot easily roll back \(T_2\), making the schedule non-
recoverable.
3. Cascading Rollback
A cascading rollback occurs when the rollback of a single transaction causes one or more
dependent transactions to be rolled back. This situation can arise when one transaction reads
uncommitted changes of another transaction, and then the latter transaction fails and needs to be
rolled back. Consequently, any transaction that has read the uncommitted changes of the failed
transaction also needs to be rolled back, leading to a cascade effect.
Example of Cascading Rollback
Consider two transactions \(T_1\) and \(T_2\):
| Transaction T1 | Transaction T2 |
|---------------------|---------------------|
| Write(A) | |
| | Read(A) |
| | Write(B) |
| Abort(some failure) | |
| Rollback | |
Here, \(T_2\) reads an uncommitted value of A written by \(T_1\). When \(T_1\) fails and is
rolled back, \(T_2\) also has to be rolled back, leading to a cascading rollback. This is
undesirable because it wastes computational effort and can complicate recovery procedures.
4. Cascadeless Schedules
A schedule is considered cascadeless if transactions only read committed values. This means, in
such a schedule, a transaction can read a value written by another transaction only after the latter
has committed. Cascadeless schedules prevent cascading rollbacks.
Example of Cascadeless Schedule
Consider two transactions \(T_1\) and \(T_2\):
| Transaction T1 | Transaction T2 |
|---------------------|---------------------|
| Write(A) | |
| Commit | |
| | Read(A) |
| | Write(B) |
| | Commit |
In this schedule, \(T_2\) reads the value of A only after \(T_1\) has committed. Thus, even if
\(T_1\) were to fail before committing (not shown in this schedule), it would not affect \(T_2\).
This means there's no risk of cascading rollback in this schedule.
Implementation of Isolation in DBMS: Isolation is one of the core ACID properties of a
database transaction, ensuring that the operations of one transaction remain hidden from other
transactions until completion. It means that no two transactions should interfere with each other
and affect the other's intermediate state.
Isolation Levels
Isolation levels defines the degree to which a transaction must be isolated from the data
modifications made by any other transaction in the database system. There are four levels of
transaction isolation defined by SQL –
1. Serializable
The highest isolation level.
Guarantees full serializability and ensures complete isolation of transaction operations.
2. Repeatable Read
This is the most restrictive isolation level.
The transaction holds read locks on all rows it references.
It holds write locks on all rows it inserts, updates, or deletes.
Since other transaction cannot read, update or delete these rows, it avoids non repeatable
read.
3. Read Committed
This isolation level allows only committed data to be read.
Thus it does not allows dirty read (i.e. one transaction reading of data immediately after
written by another transaction).
The transaction hold a read or write lock on the current row, and thus prevent other rows
from reading, updating or deleting it.
4. Read Uncommitted
It is lowest isolation level.
In this level, one transaction may read not yet committed changes made by other
transaction.
This level allows dirty reads.
The proper isolation level or concurrency control mechanism to use depends on the specific
requirements of a system and its workload. Some systems may prioritize high throughput and
can tolerate lower isolation levels, while others might require strict consistency and higher
isolation.
Serializable NO NO
Repeatable Read NO NO
Implementation of Isolation
Implementing isolation typically involves concurrency control mechanisms. Here are common
mechanisms used:
1. Locking Mechanisms
Locking ensures exclusive access to a data item for a transaction. This means that while one
transaction holds a lock on a data item, no other transaction can access that item.
Shared Lock (S-lock): Allows a transaction to read an item but not write to it.
Exclusive Lock (X-lock): Allows a transaction to read and write an item. No other
transaction can read or write until the lock is released.
Two-phase Locking (2PL): This protocol ensures that a transaction acquires all the
locks before it releases any. This results in a growing phase (acquiring locks and not
releasing any) and a shrinking phase (releasing locks and not acquiring any).
2. Timestamp-based Protocols
Every transaction is assigned a unique timestamp when it starts. This timestamp determines the
order of transactions. Transactions can only access the database if they respect the timestamp
order, ensuring older transactions get priority.
Lock Based Protocols in DBMS:Locks are essential in a database system to ensure:
1. Consistency: Without locks, multiple transactions could modify the same data item
simultaneously, resulting in an inconsistent state.
2. Isolation: Locks ensure that the operations of one transaction are isolated from other transactions,
i.e., they are invisible to other transactions until the transaction is committed.
3. Concurrency: While ensuring consistency and isolation, locks also allow multiple transactions to
be processed simultaneously by the system, optimizing system throughput and overall
performance.
4. Avoiding Conflicts: Locks help in avoiding data conflicts that might arise due to simultaneous
read and write operations by different transactions on the same data item.
5. Preventing Dirty Reads: With the help of locks, a transaction is prevented from reading data
that hasn't yet been committed by another transaction.
Lock-Based Protocols
1. Simple Lock Based Protocol
The Simple lock based protocol is a mechanism in which there is exclusive use of locks on the
data item for current transaction.
Types of Locks: There are two types of locks used –
Shared Lock (S-lock)
This lock allows a transaction to read a data item. Multiple transactions can hold shared locks on
the same data item simultaneously. It is denoted by Lock-S. This is also called as read lock.
Exclusive Lock (X-lock):
This lock allows a transaction to read and write a data item. If a transaction holds an exclusive
lock on an item, no other transaction can hold any kind of lock on the same item. It is denoted
as Lock-X. This is also called as write lock.
T1 | T2
----------|----------
Lock-S(A) |
Read(A) |
Unlock(A) | Lock-X(A)
| Read(A)
| Write(A)
| Unlock(A)
Any number of transactions can hold shared lock But exclusive lock can be hold by only one transaction.
on an item.
Using shared lock data item can be viewed. Using exclusive lock data can be inserted or deleted.
----------|----------
Lock-S(A) |
| Lock-S(A)
Lock-X(B) |
Unlock(A) |
| Lock-X(C)
Unlock(B) |
| Unlock(A)
| Unlock(C)
Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.
Validation Phase
All concurrent data items are checked to ensure serializability will not be validated if the
transaction updates are actually applied to the database. Any changes in the value cause the
transaction rollback. The transaction timestamps are used and the write-sets and read-sets are
maintained.
To check that transaction A does not interfere with transaction B the following must hold −
TransB completes its write phase before TransA starts the read phase.
TransA starts its write phase after TransB completes its write phase, and the read set of
TransA has no items in common with the write set of TransB.
Both the read set and write set of TransA have no items in common with the write set of
TransB and TransB completes its read before TransA completes its read Phase.
Write Phase
The transaction updates applied to the database if the validation is successful. Otherwise, updates
are discarded and transactions are aborted and restarted. It does not use any locks hence deadlock
free, however starvation problems of data items may occur.
Problem
S: W1(X), r2(Y), r1(Y), r2(X).
T1 -3
T2 – 4
Check whether timestamp ordering protocols allow schedule S.
Solution
Initially for a data-item X, RTS(X)=0, WTS(X)=0
Initially for a data-item Y, RTS(Y)=0, WTS(Y)=0
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and
similarly, transaction T2 is waiting for T1 to release its lock. All activities come to a halt state
and remain at a standstill. It will remain in a standstill until the DBMS detects the deadlock and
aborts one of the transactions.
Below is a list of conditions necessary for a deadlock to occur:
o Circular Waiting: It is when two or more transactions wait each other indefinitely for a
lock held by the others to be released.
o Partial Allocation: When a transaction acquires some of the required data items but not
all the data items as they may be exclusively locked by others.
o Non-Preemptive scheduling: A data item that could be only single transaction at a time.
o Mutual Exclusion: A data item can be locked exclusively by one transaction at a time.
To avoid a deadlock atleast one of the above mentioned necessary conditions should not occur.
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather
than aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this method
is suitable only for the smaller database. For the larger database, deadlock prevention
method can be used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should
detect whether the transaction is involved in a deadlock or not. The lock manager maintains a
Wait for the graph to detect the deadlock cycle in the database.
Wait for Graph
o This is the suitable method for deadlock detection. In this method, a graph is created
based on the transaction and their lock. If the created graph has a cycle or closed loop,
then there is a deadlock.
o The wait for the graph is maintained by the system for every transaction which is waiting
for some data held by the others. The system keeps checking the graph if there is any
cycle in the graph.
Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are allocated
in such a way that deadlock never occurs, then the deadlock can be prevented.
o The Database management system analyzes the operations of the transaction whether
they can create a deadlock situation or not. If they do, then the DBMS never allowed that
transaction to be executed.
Each transaction has unique identifier which is called timestamp. It is usually based on the state
of the transaction and assigned once the transaction is started. For example if the transaction T1
starts before the transaction T2 then the timestamp corresponding to the transaction T1 will be
less than timestamp corresponding to transaction T2. The timestamp decides whether a
transaction should wait or abort and rollback. Aborted transaction retain their timestamps values
and hence the seniority.
The following deadlock prevention schemes using timestamps have been proposed.
o Wait-Die scheme
o Wound wait scheme
The significant disadvantage of both of these techniques is that some transactions are aborted and
restarted unnecessarily even though those transactions never actually cause a deadlock.
Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting
lock by another transaction then the DBMS simply checks the timestamp of both transactions. It
allows the older transaction to wait until the resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any transaction
T. If T2 holds a lock by some other transaction and T1 is requesting for resources held by T2
then the following actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource,
then Ti is allowed to wait until the data-item is available for execution. That means if the
older transaction is waiting for a resource which is locked by the younger transaction,
then the older transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj
is waiting for it, then Tj is killed and restarted later with the random delay but with the
same timestamp.
In the above representation, T1 is requesting transactions and T2 is the transaction holding the
lock on data item and t(Ti) is the timestamp of the transaction Ti.
Consider an example in which transactions having following timestamps:
If T1 request a data item is locked by transaction T2 then T1 has to wait until T2 completes and
all locks acquired by it are released because t(T1) < t(T2). On the other hand, if transaction T3
requests a data item locked by transaction T2 and T3 has to abort and rollback i.e. dies because
t(T3) < t(T2).
Deadlock Detection and Recovery
In the deadlock detection scheme, the deadlock detection algorithm checks the state of the
system periodically whether the deadlock has occurred or not, if the deadlock exists in the
system tries to recover from the deadlock.
In order to detect a deadlock the system must have the following information:The system
must provide an algorithm that uses this information i.e. the information about the current
allocations of data items to examine whether a system has entered a deadlock state or not. If the
deadlock exists then the system attempts to recover from the deadlock.
Recovery from Deadlock
If the wait for graph which is used for deadlock detection contains a deadlock situation i.e. there
exists cycles in it then those cycles should be removed to recover from the deadlock. The most
widely used technique of recovering from a deadlock is to rollback one or more transactions till
the system no longer displays a deadlock condition.
The selection of the transactions to be rolled back is based on the following deliberations:
Selection of victim: There may be many transactions which are involved in a deadlock i..e
deadlocked transaction. So to recover from the deadlock some of the transaction should be rolled
back, out of the possible transactions causing a deadlock. The one that is rolled back is known as
victim transaction and the mechanism is known as victim election.
The transactions to be rolled back are the one which has just started or has not made many
changes. Avoid selecting transactions that have made many updates and have been running for a
long time.
Rollback: Once the selection of the transaction to be rolled back is decided we should find out
how far the current transaction should be rolled back. One of the simplest solution is the total
rollback i.e. abort the transaction and restart it. However, the transaction should be rolled back to
the extent required to break the deadlock. Also, the additional information of the state of
currently executing transactions should be maintained.
Starvation: To recover from the deadlock, we must ensure that the same transaction should not
be selected again and again as a victim to rollback. The transaction will never complete if the
type of situation is not avoided. To avoid starvation, only a finite number of times a transaction
should be picked up as a victim.
A widely used solution is to include the number of rollbacks of the transaction that is selected as
the victim.
Failure Classification
To find that where the problem has occurred, we generalize a failure into the following
categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from
where it can't go any further. If a few transaction or process is hurt, then this is called as
transaction failure.
Reasons for a transaction failure could be -
1. Logical errors: If a transaction cannot complete due to some code error or an
internal error condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active transaction
because the database system is not able to execute it. For example, The system
aborts an active transaction, in case of deadlock or resource unavailability.
2. System Crash
o System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
Fail-stop assumption: In the system crash, non-volatile storage is assumed not to
be corrupted.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It was a
common problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash, and
unreachability to the disk or any other failure, which destroy all or part of disk
storage.
STORAGE: In DBMS (Database Management System), storage refers to the organization and
management of data on physical storage devices like disks or tapes, ensuring efficient data
handling and access for applications and queries. According to Talent Battle, the storage system
in DBMS plays a crucial role in ensuring efficient data handling, from rapid processing in
primary storage to long-term backup in tertiary storage.
Here's a more detailed explanation:
Data Storage:
DBMS stores data as files on physical storage devices like disks, managing the allocation and
deallocation of storage space.
Storage Management:
DBMS components, like the disk space manager and file manager, handle the interaction with
the storage devices, ensuring efficient data access and retrieval.
File Organization:
DBMS uses various file organization techniques to store data efficiently, such as heap files,
sorted files, and indexed files.
Storage Hierarchy:
Database storage systems often employ a hierarchy of storage devices, from fast primary
storage (like RAM) to slower secondary storage (like hard drives) and tertiary storage (like
tapes) for backups.
Storage Engines:
Some DBMS systems use storage engines, which are specialized components responsible for
managing the storage of data, interacting with the file system, and providing efficient data
access.
RAID:
Redundant Array of Independent Disks (RAID) technology is often used in DBMS to improve
performance, reliability, and scalability by distributing data across multiple storage devices.
Metadata:
DBMS stores metadata, which is data about data, such as table names, column names, data
types, and constraints, to describe the structure and organization of the database.
o
The search key is the database’s first column, and it contains a duplicate or copy of the
table’s candidate key or primary key. The primary key values are saved in sorted order so
that the related data can be quickly accessible.
The data reference is the database’s second column. It contains a group of pointers that
point to the disk block where the value of a specific key can be found.
Types of Indexes:
1. Single-level Index: A single index table that contains pointers to the actual data records.
2. Multi-level Index: An index of indexes. This hierarchical approach reduces the number
of accesses (disk I/O operations) required to find an entry.
3. Dense and Sparse Indexes:
o In a dense index, there's an index entry for every search key value in the database.
o In a sparse index, there are fewer index entries. One entry might point to several
records.
4. Primary and Secondary Indexes:
o A primary index is an ordered file whose records are of fixed length with two
fields. The first field is the same as the primary key, and the second field is a
pointer to the data block. There's a one-to-one relationship between the number of
entries in the index and the number of records in the main file.
o A secondary index provides a secondary means of accessing data. For each
secondary key value, the index points to all the records with that key value.
5. Clustered vs. Non-clustered Index:
o In a clustered index, the rows of data in the table are stored on disk in the same
order as the index. There can only be one clustered index per table.
o In a non-clustered index, the order of rows does not match the index's order. You
can have multiple non-clustered indexes.
6. Bitmap Index: Used mainly for data warehousing setups, a bitmap index uses bit arrays
(bitmaps) and usually involves columns that have a limited number of distinct values.
7. B-trees and B+ trees: Balanced tree structures that ensure logarithmic access time. B+
trees are particularly popular in DBMS for their efficiency in disk I/O operations.
Benefits of Indexing:
Faster search and retrieval times for database operations.
Drawbacks of Indexing:
Overhead for insert, update, and delete operations, as indexes need to be maintained.
Additional storage requirements for the index structures.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of
the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after
55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55)
and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to
find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it to
have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
B+ Trees
A B+ Tree is a type of self-balancing tree structure commonly used in databases and file systems
to maintain sorted data in a way that allows for efficient insertion, deletion, and search
operations. B+ Trees are an extension of B-Trees but differ mainly in the way they handle leaf
nodes, which contain all the key values and point to the actual records.
A B+ Tree of order `n` has the following properties:
1. Every node has a maximum of `n` children.
2. Every node (except the root) has a minimum of `n/2` children.
3. The tree is perfectly balanced, meaning that all leaf nodes are at the same level.
4. All keys are stored in the leaf nodes, and the internal nodes act as 'guides' to locate the
leaf nodes faster.
Operations on B+ Trees:
1. Search: Starts at the root and traverses down the tree, guided by the key values in each
node, until it reaches the appropriate leaf node.
2. Insert: Inserts a new key-value pair and then reorganizes the tree as needed to maintain its
properties.
3. Delete: Removes a key-value pair and then reorganizes the tree, again to maintain its
properties.
Example of B+ Tree Operations
Let's say we have a B+ Tree of order 4, and we want to insert the keys `[10, 20, 5, 6, 12, 30, 7,
17]` into an initially empty tree.
Let's say we have a B+ Tree of order 4, and we want to insert the keys `[10, 20, 5, 6, 12, 30, 7, 17]` into an initially
empty tree.
-------------------
|| || || ||
-------------------
Insertion
1. Insert 10:
- The tree is empty, so 10 becomes the root.
[10]
2. Insert 20:
- There's room in the leaf node for 20.
[10, 20]
3. Insert 5:
- Still room in the leaf node for 5.
[10]
/ \
[5, 6] [10, 20]
5. Insert 12:
- Insert into the appropriate leaf node.
[10, , ]
/ \
[5, 6] [10, 12, 20]
6. Insert 30:
- Need to split the right leaf node, promote 20.
[10 , 20 , ]
/ | \
[5, 6] [10, 12] [20, 30]
7. Insert 7:
- Insert into the appropriate leaf node.
[10 , 20 , ]
/ | \
[5, 6, 7] [10, 12] [20, 30]
8. Insert 17:
- Insert into the appropriate leaf node and split.
[10 , 20 , ]
/ | \
[5, 6, 7] [10, 12] [17, 20, 30]
[10 , 17 , 20 ]
/ | | \
[5, 6, 7] [10, 12] [17] [20, 30]
Search (for 12):
- Start at the root, go down the second child because 12 > 10 and 12 < 17, and find 12 in the
corresponding leaf node.
Deletion (of 10):
[10 , 17 , 20 ]
/ | | \
[5, 6, 7] [12] [17] [20, 30]
2. Since the key 10 is also present in the internal node, we replace it with its in-order predecessor
(or successor based on design), which is 7.
[7 , 17 , 20 ]
/ | | \
[5, 6] [7, 12] [17] [20, 30]
And that's how B+ Trees work for search, insert, and delete operations. B+ Trees are dynamic,
adapting efficiently as keys are added or removed, which makes them quite useful for databases
where high-speed data retrieval is crucial.
Hash Based Indexing:
For a huge database structure, it can be almost next to impossible to search all the index values
through all its level and then reach the destination data block to retrieve the desired data.
Hashing is an effective technique to calculate the direct location of a data record on the disk
without using index structure. Hashing uses hash functions with search keys as parameters to
generate the address of a data record.
Hash Organization
Bucket − A hash file stores data in bucket format. Bucket is considered a unit of storage. A
bucket typically stores one complete disk block, which in turn can store one or more records.
Hash Function − A hash function, h, is a mapping function that maps all the set of search- search
keys K to the address where actual records are placed. It is a function from search keys to bucket
addresses.
Static Hashing
Operation
Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K,, where the record will be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be used to retrieve the
address of the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.
Bucket Overflow The conditionndition of bucket
bucket-overflow is known as collision.. This is a fatal state
for any static hash function. In this case, overflow chaining can be used.
Overflow Chaining − When buckets are full, a new bucket is allocated for the same hash result
and is linked after
fter the previous one. This mechanism is called Closed Hashing.
Linear Probing − When a hash function generates an address at which data is already stored,
the next free bucket is allocated to it. This mechanism is called Open Hashing.
Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the size of
the database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets are
added and removed dynamically and on on-demand. Dynamic hashing is also so known as extended
hashing.Hash
.Hash function, in dynamic hashing, is made to produce a large number of values and
only a few are used initially.
Organization
The prefix of an entire hash value is taken as a hash index. Only a portion of the hash value is
used for computing bucket addresses. Every hash index has a depth value to signify how many
bits are used for computing a hash function. These bits can address 2n buckets. When all these
bits are consumed − that is, when all the buckets are full − then the depth value is increased
linearly and twice the buckets are allocated.
Operation
Querying − Look at the depth value of the hash index and use those bits to compute the bucket
address.
Update − Perform a query as above and update the data.
Deletion − Perform rm a query to locate the desired data and delete the same.
Insertion − Compute the address of the bucket
If the bucket is full.
Add more buckets already.
Add additional bits to the hash value.
Re-compute the hash function.
Else
Add data to the bucket,
If all the buckets are full, perform the remedies of static hashing.
Hashing is not favorable when the data is organized in some ordering and the queries require a
range of data. When data is discrete and random, hash performs the best.
Hashing algorithms have high complexity than indexing. All hash operations are done in
constant time.
Hash-Based Indexing
In hash-based indexing, a hash function is used to convert a key into a hash code. This hash code
serves as an index where the value associated with that key is stored. The goal is to distribute the
keys uniformly across an array, so that access time is, on average, constant.
Let's break down some of these elements to further understand how hash-based indexing works
in practice:
Buckets
In hash-based indexing, the data space is divided into a fixed number of slots known as
"buckets." A bucket usually contains a single page (also known as a block), but it may have
additional pages linked in a chain if the primary page becomes full. This is known as overflow.
Hash Function
The hash function is a mapping function that takes the search key as an input and returns the
bucket number where the record should be located. Hash functions aim to distribute records
uniformly across buckets to minimize the number of collisions (two different keys hashing to the
same bucket).
Disk I/O Efficiency
Hash-based indexing is particularly efficient when it comes to disk I/O operations. Given a
search key, the hash function quickly identifies the bucket (and thereby the disk page) where the
desired record is located. This often requires only one or two disk I/Os, making the retrieval
process very fast.
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the appropriate
bucket. If the primary page of the bucket is full, an additional overflow page is allocated and
linked to the primary page. The new record is then stored on this overflow page.
Search Operations
To find a record with a specific search key, the hash function is applied to the search key to
identify the bucket. All pages (primary and overflow) in that bucket are then examined to find
the desired record.
Limitations
Hash-based indexing is not suitable for range queries or when the search key is not known. In
such cases, a full scan of all pages is required, which is resource-intensive.
Hash-Based Indexing Example
Let's consider a simple example using employee names as the search key.
Employee Records
| Name | Age | Salary
|-----------|----------|--------
| Alice | 28 | 50000
| Bob | 35 | 60000
| Carol | 40 | 70000
Hash Function: H(x) = ASCII value of first letter of the name mod 3
Alice: 65 mod 3 = 2
Bob: 66 mod 3 = 0
Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
Pros of Hash-Based Indexing
Extremely fast for exact match queries.
Well-suited for equality comparisons.
Cons of Hash-Based Indexing
Not suitable for range queries (e.g., "SELECT * FROM table WHERE age BETWEEN
20 AND 30").
Performance can be severely affected by poor hash functions or a large number of
collisions.
Tree-based Indexing
The most commonly used tree-based index structure is the B-Tree, and its variations like B+
Trees and B* Trees. In tree-based indexing, data is organized into a tree-like structure. Each
node represents a range of key values, and leaf nodes contain the actual data or pointers to the
data.
Why Tree-based Indexing?
Tree-based indexes like B-Trees offer a number of advantages:
Sorted Data: They maintain data in sorted order, making it easier to perform range
queries.
Balanced Tree: B-Trees and their variants are balanced, meaning the path from the root
node to any leaf node is of the same length. This balancing ensures that data retrieval
times are consistently fast, even as the dataset grows.
Multi-level Index: Tree-based indexes can be multi-level, which helps to minimize the
number of disk I/Os required to find an item.
Dynamic Nature: B-Trees are dynamic, meaning they're good at inserting and deleting
records without requiring full reorganization.
Versatility: They are useful for both exact-match and range queries.
[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4
In the tree, navigating from the root to the leaf nodes will lead us to the desired data record.
Pros of Tree-based Indexing:
Efficient for range queries.
Good for both exact and partial matches.
Keeps data sorted.
Cons of Tree-based Indexing:
Slower than hash-based indexing for exact queries.
More complex to implement and maintain.