Advanced Database Ch2 and 3
Advanced Database Ch2 and 3
QUERY PROCESSING
AND OPTIMIZATION
Chapter 2
1
11/21/2023
Select * from Staff s, Branch b where s.branchNo=b.branchNo AND (s.position=‘manager’ and b.city=‘London’);
2
11/21/2023
Assignment
Assume there are –1000 tuples in Staff;
-50 tuples in Branch;
–50 Managers;
-5 London branches;
■ Find the total cost to access disk for each query.
■ Which one is the best query?
3
11/21/2023
First query
σ(position=‘Manager’) ^ (city=‘London’) ^ ( Staff.branchNo
=Branch.branchNo)(Staff × Branch)
■ It calculates the Cartesian product of staff and Branch, which requires
(1000 *50) disk accesses to read the relations, and creates a relation
with (1000*50) tuples.
■ We then have to read each of these tuples again to test them against
the selection predicate at a cost of another(1000+50) disk accesses,
giving a total cost of:
(1000+50) +2*(1000*50)=101050 disk accesses
4
11/21/2023
■ The final query first reads each Staff tuple to determine the Manager tuples, which requires 1000
disk accesses and produces a relation with 50 tuples.
■ The second Selection operation reads each Branch tuple to determine the London branches,
which requires 50 disk accesses and produces a relation with 5 tuples(London branches).
■ The final operation is the join of the reduced Staff and Branch relations, which requires (50 +5)
disk accesses, giving a total cost of:
5
11/21/2023
Query Decomposition
■ It is the first phase of query processing.
■ Used to transform a high-level query into a relational
algebra query.
■ Aids to check whether the query is
syntactically(programming rules which have checked
at compile time ) and semantically correct (meaning
that have handled at run time).
6
11/21/2023
Analysis
■ The query is lexically and syntactically analyzed using the
techniques of programming language compilers.
■ In addition, this stage verifies that the relations and attributes
specified in the query are defined in the system catalog.
■ It also verifies that any operations applied to database objects are
appropriate for the object type.
■ For example, consider the following query:
SELECT staffNumber FROM Staff WHERE position >10;
This query would be rejected on two grounds:
1. In the select list, the attribute staffNumber is not defined for the Staff
relation (should be staffNo).
2. In the WHERE clause, the comparison “>10” is incompatible with the
data type position, which is a variable character string.
■Normalization
converts the query into a normalized form that can
be more easily manipulated.
The predicate (in SQL, the WHERE condition), which
may be arbitrarily complex, can be converted into
one of two forms by applying a few transformation
rules.
7
11/21/2023
Semantic analysis
■ The objective of semantic analysis is to reject normalized queries that
are incorrectly formulated or contradictory.
■ A query is incorrectly formulated if components do not contribute to
the generation of the result, which may happen if some join
specifications are missing.
■ A query is contradictory if its predicate cannot be satisfied by any
tuple.
■ For example, the predicate (position = ‘Manager’ ^ position =
‘Assistant’) on the Staff relation is contradictory, as a member of staff
cannot be both a Manager and an Assistant simultaneously.
■ Thus the predicate ((position = ‘Manager’ ^ position =’Assistant’) v
salary>20000) could be simplified to (salary>20000) by interpreting
the contradictory clause as the Boolean value FALSE.
8
11/21/2023
Simplification
■ The objectives of the simplification stage are:
Detect redundant qualifications
Eliminate common subexpressions
Transform the query to a semantically equivalent but more
easily and efficiently computed form.
■ An initial optimization is to apply the well-known idempotency rules
(operations such that no matter how many times you execute
them, you achieve the same result)of boolean algebra, such as:
9
11/21/2023
Query restructuring
■ In the final stage of query decomposition, the query is
restructured to provide a more efficient implementation.
■ It is the techniques(approaches) for query optimization
Heuristic rules
Compare different strategies
10
11/21/2023
11
11/21/2023
Dynamic optimization
■ Advantage
All information required to select an optimum strategy is up to date
■ Disadvantage
The performance of the query is affected because the query has to be
parsed, validated, and optimized before it can be executed
Since it is dynamic it generates a number of execution strategies to
be analyzed, which may have the effect of selecting a less than
optimum strategy.
Static optimization
■ Advantage
the query is parsed, validated, optimized once.
Runtime overhead is removed.
There may be more time available to evaluate larger
number of execution strategies there by increasing
the chances of finding a more optimum strategy.
■ Disadvantage
The execution strategy that is chosen as being
optimal when the query is compiled may no longer
be optimal when the query is run.
12
11/21/2023
13
11/21/2023
14
11/21/2023
■ As the Equijoin and Natural join are special cases of the Theta
join, then this rule also applies to these Join operations. For
example, using the Equijoin of Staff and Branch:
15
11/21/2023
16
11/21/2023
■ For example:
17
11/21/2023
■ For example:
18
11/21/2023
■ For example:
19
11/21/2023
■ For example:
20
11/21/2023
21
11/21/2023
Materialization
■ Materialization :the results of intermediate relational algebra
operations are written temporarily to disk.
■ the output of one operation is stored in a temporary relation for
processing by the next operation.
■ evaluate one operation at a time, starting at the lowest-level.
■ Use intermediate results materialized into temporary relations to
evaluate next-level operations.
■ Materialized evaluation is always applicable
■ Cost of writing results to disk and reading them back can be quite high
– Our cost formulas for operations ignore cost of writing results to
disk, so
■ Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
22
11/21/2023
Pipelining
■ Pipelining (stream-based processing or on-the-fly processing) is to
pipeline the results of one operation to another operation without
creating a temporary relation to hold the intermediate result.
■ Clearly, if we can use pipelining we can save on the cost of creating
temporary relations and reading the results back in again.
■ evaluate several operations simultaneously, passing the results of one
operation on to the next.
■ Much cheaper than materialization: no need to store a temporary
relation to disk.
■ Pipelining may not always be possible – e.g., sort, hash-join.
■ we can use the index to efficiently process the first Selection on salary,
store the result in a temporary relation and then apply the second
Selection to the temporary relation.
■ The pipeline approach dispenses with the temporary relation and
instead applies the second Selection to each tuple in the result of the
first Selection as it is produced, and adds any qualifying tuples from
the second operation to the result.
23
11/21/2023
Database Statistics
■ The success of estimating the size and cost of intermediate relational
algebra operations depends on the amount and currency of the
statistical information that the DBMS holds.
■ DBMS should hold the following types of information in its system
catalog.
■ For each base relation R:
■ nTuples(R) – the number of tuples (records) in relation R( that is, its
cardinality)
■ bFactor(R) – the blocking factor of R ( that is, the number of tuples of
that fit into one block).
24
11/21/2023
25
11/21/2023
Selection Operation
■ The selection operation in the relational algebra works on a single R
and defines a relation S containing only those tuples of R that satisfy
the specified predicate.
■ There are a number of different implementations for the Selection
operation, depending on the structure of the file in which the relation is
stored.
■ The main strategies that we consider are:
Linear search ( unordered file, no index)
Binary search ( ordered file, no index)
Equality on hash key;
Equality condition on primary key;
Inequality condition on primary key;
26
11/21/2023
27
11/21/2023
Binary Search
■ If the predicate is of the form (A=x) and the file is ordered on
attribute A, which is also the key attribute of relation R, then the
cost estimate for the search is
■ [log2 (nBlocks(R))]
■ [log2(nBlocks(R))] + [SCA(R)/bFactor(R)]-1
28
11/21/2023
End of Chapter 2
29
Chapter 3
Transaction processing concepts
Transaction
• The transaction is a set of logically related operation.
It contains a group of tasks.
• A transaction is an action or series of actions. It is
performed by a single user to perform operations for
accessing the contents of the database.
Y's Account
Open_Account(Y)
Old_Balance = Y.balance
New_Balance = Old_Balance + 800
Y.balance = New_Balance
Close_Account(Y)
Operations of Transaction
• Following are the main operations of
transaction:
• Read(X): Read operation is used to read the value
of X from the database and stores it in a buffer in
main memory.
• Write(X): Write operation is used to write the
value back to the database from the buffer.
• Let's take an example to debit transaction from an account which consists
of following operations:
1. R(X);
2. X = X - 500;
3. W(X);
• Let's assume the value of X before starting of the transaction is 4000.
• The first operation reads X's value from database and stores it in a buffer.
• The second operation will decrease the value of X by 500. So buffer will contain
3500.
• The third operation will write the buffer's value to the database. So X's final value
will be 3500.
• But it may be possible that because of the failure of hardware, software or
power, etc. that transaction may fail before finished all the operations in the
set.
For example: If in the above transaction, the debit
transaction fails after executing operation 2 then X's value
will remain 4000 in the database which is not acceptable by
the bank.
To solve this problem, we have two important operations:
• Commit: It is used to save the work done permanently.
• Rollback: It is used to undo the work done.
Transaction property
• The transaction has the four properties.
• These are used to maintain consistency in a database, before
and after the transaction
Property of Transaction
Atomicity
Consistency
Isolation
Durability
Atomicity
• It states that all operations of the transaction take place at once if
not, the transaction is aborted.
• There is no midway, i.e., the transaction cannot occur partially.
Each transaction is treated as one unit and either run to
completion or is not executed at all.
• Atomicity involves the following two operations:
Abort: If a transaction aborts then all the changes made are not
visible.
Commit: If a transaction commits then all the changes made are
visible.
• Example: Let's assume that following transaction T consisting of T1 and
T2. A consists of ETB 600 and B consists of ETB 300. Transfer ETB 50
from account A to account B.
T1 T2
Read(A) Read(B)
A:= A-50 Y:= Y+50
Write(A) Write(B)
Above two schedules are view equivalent because Initial read operation in S1 is
done by T1 and in S2 it is also done by T1.
Cont….
2. Updated Read
In schedule S1, if Ti is reading A which is updated by Tj then in S2
also, Ti should read A which is updated by Tj.
Above two schedules are not view equal because, in S1, T3 is reading A
updated by T2 and in S2, T3 is reading A updated by T1.
Cont…
3. Final Write
• A final write must be the same between both the schedules. In schedule S1, if
a transaction T1 updates A at last then in S2, final writes operations should
also be done by T1.
Above two schedules is view equal because Final write operation in S1 is done by
T3 and in S2, the final write operation is also done by T3.
END OF CHAPTER 3