0% found this document useful (0 votes)
2 views

DBMS UNIT 4 Part 1

DBMS

Uploaded by

msreevanicse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DBMS UNIT 4 Part 1

DBMS

Uploaded by

msreevanicse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT - IV Query Processing, Query optimization

Query Processing: Overview, Measures of Query cost, Selection operation, sorting, Join Operation,
other operations, Evaluation of Expressions.
Query optimization: Overview, Transformation of Relational Expressions, Estimating statistics of
Expression results, Choice of Evaluation Plans, Materialized views, Advanced Topics in Query
Optimization.

Overview of Query Processing:


 Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps
involved are:
1. Parsing and translation
2. Optimization
3. Evaluation

Parsing and translation:


 Translate the query into its internal form. This is then translated into relational algebra.
 Parser checks syntax, validates relations, attributes and access permissions
 The parser creates a tree of the query, known as 'parse-tree.' Further, translate it into the
form of relational algebra.
Example:
 In SQL, a user wants to fetch the records of the employees whose salary is greater than
or equal to 10000. For doing this, the following query is undertaken:
select emp_name from Employee where salary>10000;
 Thus, to make the system understand the user query, it needs to be translated in the form
of relational algebra. We can bring this query in the relational algebra form as:
o σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))
 After translating the given query, we can execute each relational algebra operation by
using different algorithms
Evaluation:
 For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a
query evaluation plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query execution
plan.
o A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.
Optimization:
o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to write
their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
******END*****
Measures of Query cost:
 Cost is generally measured as total elapsed time for answering query
o Many factors contribute to time cost, for instance disk accesses, CPU, or even
network communication
 Typically disk access is the predominant cost, and is also relatively easy to estimate.
Measured by taking into account
Number of seeks * average-seek-cost
Number of blocks read * average-block-read-cost
Number of blocks written * average-block-write-cost
 For simplicity we just use the number of block transfers from disk and the number of
seeks as the cost measures
o tT – time to transfer one block
o tS – time for one seek
o Cost for b block transfers plus S seeks
b * tT + S * t S
 We ignore CPU costs for simplicity
 Costs of algorithms depend on the size of the buffer in main memory, as having more
memory reduces need for disk access. Thus memory size should be a parameter while
estimating cost; often use worst case estimates.
 We refer to the cost estimate of algorithm A as EA. We do not include cost of writing
output to disk
*****END*****
Selection operation:
 Selection operation using basic scan algorithms are as follows
File scan:
 In query processing, the file scan is the lowest-level operator to access data. File scans
are search algorithms that locate and retrieve records that fulfill a selection condition.
 In relational systems, a file scan allows an entire relation to be read in those cases where
the relation is stored in a single, dedicated file.
Algorithm A1 (linear search):

Algorithm A2 (binary search):

Algorithm A4 (secondary index, equality on key/Non-key):


Algorithm A5(primary index, comparison).

A6 (secondary index, comparison):

Cost: (hi + n) ∗ (tT + tS)


Implementation of Complex Selections

*****END******
Sorting:
 Why sorting?
 SQL queries can specify that the output be sorted.
 Several of the relational operations, such as joins, can be implemented efficiently if
the input relations are first sorted.
 We may build an index on the relation, and then use the index to read the relation in
sorted order. May lead to one disk block access for each tuple.
 For relations that fit in memory, techniques like quick sort can be used.
 For relations that don’t fit in memory, external sort-merge is a good choice.
External Sort-Merge Algorithm:
 Sorting of relations that do not fit in memory is called external sorting. The most
commonly used technique for external sorting is the external sort–merge algorithm.
 Let M denote the number of blocks in the main memory buffer available for sorting
1. In the first stage, a number of sorted runs are created; each run is sorted but contains only
some of the records of the relation.
i = 0;
repeat
read M blocks of the relation into memory
sort the in-memory part of the relation;
write the sorted data to run file Ri;
i = i + 1;
until the end of the relation
2. In the second stage, the runs are merged. Suppose, for now, that the total number of
runs, N, is less than M, so that we can allocate one block to each run and have space left
to hold one block of output. The merge stage operates as follows:
read one block of each of the N files Ri into a buffer block in memory;
repeat
o choose the first tuple (in sort order) among all buffer blocks;
o write the tuple to the output, and delete it from the buffer block;
o if the buffer block of any run Ri is empty and not end-of-file(Ri)
then read the next block of Ri into the buffer block;
until all input buffer blocks are empty
Example: External Sorting Using Sort-Merge:
a 19 a 19
g 24 d 31 a 14
b 14
a 19 g 24 a 19
c 33
d 31 b 14
b 14 d 31
c 33 c 33
c 33 e 16
b 14 d 7
e 16 g 24
e 16 d 21
r 16 d 21 d 31
a 14
d 21 m 3 e 16
d 7
m 3 r 16 g 24
d 21
p 2 m 3
m 3
d 7 a 14 p 2
p 2
a 14 d 7 r 16
r 16
p 2
initial sorted
relation runs runs output
create merge merge
runs pass–1 pass–2
Cost of External Merge Sort: Block Transfers

Cost of External Merge Sort: Seeks

*****END*****
Join Operation:
 Several different algorithms to implement joins
o Nested-loop join
o Block nested-loop join
o Indexed nested-loop join
o Merge-join
o Hash-join
Nested-Loop Join:
Algorithm:
 r is called the outer relation and s the inner relation of the join.
 Requires no indices and can be used with any kind of join condition.
 Expensive since it examines every pair of tuples in the two relations. If the smaller
relation fits entirely in main memory, use that relation as the inner relation.
 Nested loop join requires no indices
Cost for Nested-Loop Join:
 The cost of the nested-loop join algorithm.
o The number of pairs of tuples to be considered is nr * ns where nr denotes the number
of tuples in r, and ns denotes the number of tuples in s.
 For each record in r, complete scan on s is performed.
 In the worst case, if there is enough memory only to hold one block of each relation, the
estimated cost is nr * bs + br disk accesses.
 If the smaller relation i.e best case fits entirely in memory, use that as the inner relation.
This reduces the cost estimate to br + bs disk accesses.
Example:
 Number of records of student: nstudent = 5000.
 Number of blocks of student: bstudent = 100.
 Number of records of takes: ntakes = 10, 000.
 Number of blocks of takes: btakes = 400.
o nstudent as outer relation and ntakes as inner realtion
o Assuming the worst case memory availability scenario, cost estimate will be
5000 *400 + 100 = 2, 000, 100 disk accesses
o If the smaller relation (depositor ) fits entirely in memory, the cost estimate will
be 400 + 100 = 500 disk accesses.
Block Nested-Loop Join:
Algorithm:

 The primary difference between Nested-Loop join and Block Nested-Loop join is, each
block in the inner relation S is read only once for each block in the outer relation (instead
of once for each tuple in the outer relation)
 Worst case estimate Block Nested-Loop join is: br * bs + br block accesses.
 Best case for Block Nested-Loop join is: br + bs block accesses.
Example:
 Number of records of student: nstudent = 5000.
 Number of blocks of student: bstudent = 100.
 Number of records of takes: ntakes = 10, 000.
 Number of blocks of takes: btakes = 400.
o nstudent as outer relation and ntakes as inner realtion
o Assuming the worst case memory availability scenario, cost estimate will be
100 *400 + 100 = 40, 100 disk accesses
o If the smaller relation (depositor ) fits entirely in memory, the cost estimate will
be 400 + 100 = 500 disk accesses.
Indexed nested-loop join:
 In a nested-loop join, if an index is available on the inner loop’s join attribute, index
lookups can replace file scans.
 For each tuple tr in the outer relation r, the index is used to look up tuples in s that will
satisfy the join condition with tuple tr.
 It can be used with existing indices, as well as with temporary indices created for the sole
purpose of evaluating the join.
 For each tuple in outer relation r, a lookup is performed on the index s and relevant
tuples are retrieved.
 Looking up tuples in s that will satisfy the join conditions with a given tuple tr is
essentially a selection on s
 Worst case: buffer has space for only one page of r and one page of the index.
o br disk accesses are needed to read relation r , and, for each tuple in r , we
perform an index lookup on s.
o Cost of the join: br + nr * c, where c is the cost of a single selection on s using
the join condition.
 If indices are available on both r and s, use the one with fewer tuples as the outer relation.

Example:
 Number of records of student: nstudent = 5000.
 Number of blocks of student: bstudent = 100.
 Number of records of takes: ntakes = 10, 000.
 Number of blocks of takes: btakes = 400.
o nstudent as outer relation and ntakes as inner realtion
o Assuming the worst case memory availability scenario, cost estimate will be
100 +5000 * 5 = 25, 100 disk accesses
Merge-join
 The merge-join algorithm (also called the sort-merge-join algorithm) can be used to compute
natural joins and equi-joins. Let r(R) and s(S) be the relations whose natural join is to be
computed, and let R ∩ S denote their common attributes.
1. First sort both relations on their join attribute (if not already sorted on the join attributes).
2. Merge the sorted relations to join them
o Join step is similar to the merge stage of the sort-merge algorithm.
o Main difference is handling of duplicate values in join attribute — every pair with same
value on join attribute must be matched

Cost Analysis:
 Each block needs to be read only once (assuming all tuples for any given value of the join
attributes fit in memory
 Thus the cost of merge join is:
bR + bS block transfers + [(bR / bb)+ (bS / bb)] seeks + the cost of sorting if relations are
unsorted
nR = number of tuples of R
bR = number of blocks containing tuples of R
bb = number of buffer blocks allocated for each relation
hybrid merge-join:
1. hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index
on the join attribute
o Merge the sorted relation with the leaf entries of the B+-tree.
o Sort the result on the addresses of the unsorted relation’s tuples
o Scan the unsorted relation in physical address order and merge with previous
result, to replace addresses by the actual tuples
Hash-join
 A hash function h is used to partition tuples of both relations into sets that have the same
hash value on the join attributes, as follows:
Hash–Join algorithm
The hash-join of r and s is computed as follows.
1. Partition the relations s using hashing function h. When partitioning a relation, one block
of memory is reserved as the output buffer for each partition.
2. Partition r similarly.
3. For each i :
a) Load Hsi into memory and build an in-memory hash index on it using the join
attribute. This hash index uses a different hash function than the earlier one h.
b) Read the tuples in Hri from disk one by one. For each tuple tr locate each matching
tuple ts in Hsi using the in-memory hash index. Output the concatenation of their
attributes.
 Relation s is called the build input and r is called the probe input.
Recursive partitioning:
 Recursive partitioning required if number of partitions n is greater than number of
pages M of memory.
 Instead of partitioning n ways, use M – 1 partitions for s
 Further partition the M – 1 partitions using a different hash function
 Use same partitioning method on r
Cost of Hash–Join
 If recursive partitioning is not required: 3(br + bs) + 2 * max
 If recursive partitioning is required, number of passes required for partitioning s is
[logM−1(bs)-1]. This is because each final partition of s should fit in memory.
 The number of partitions of probe relation r is the same as that for build relation s;
******END*******

Other Operations
Duplicate elimination:
 Duplicate elimination can be implemented via hashing or sorting.
 On sorting duplicates will come adjacent to each other, and all but one of a set of
duplicates can be deleted.
 Optimization: duplicates can be deleted during run generation as well as at intermediate
merge steps in external sort-merge.
 Hashing is similar – duplicates will come into the same bucket.
Projection:
 Projection is implemented by performing projection on each tuple followed by duplicate
elimination.
Aggregation:
 Aggregation can be implemented in a manner similar to duplicate elimination.
o Sorting or hashing can be used to bring tuples in the same group together, and
then the aggregate functions can be applied on each group.
o Optimization: combine tuples in the same group during run generation and
intermediate merges, by computing partial aggregate values.
Set operations:
 Set operations (ᴜ, ᴧ and −): can either use variant of merge-join after sorting, or variant
of hash-join.
Outer Join

****END****
Evaluation of Expressions
 The obvious way to evaluate an expression is simply to evaluate one operation at a time,
in an appropriate order.
 To evaluate an expression two approachs are used 1. Materialized and 2.pipeline
 The result of each evaluation is materialized in a temporary relation for subsequent use.
A disadvantage to this approach is the need to construct the temporary relations, which
(unless they are small) must be written to disk.
 An alternative approach is to evaluate several operations simultaneously in a pipeline,
with the results of one operation passed on to the next, without the need to store a
temporary relation
Materialized
Pipelining
 Pipelined evaluation : evaluate several operations simultaneously, passing the results of
one operation on to the next.
 E.g., in previous expression tree, don’t store result of

o instead, pass tuples directly to the join. Similarly, don’t store result of join, pass
tuples directly to projection.
 Much cheaper than materialization: no need to store a temporary relation to disk.
 Pipelining may not always be possible – e.g., sort, hash-join.
 For pipelining to be effective, use evaluation algorithms that generate output tuples even
as tuples are received for inputs to the operation.
 Pipelines can be executed in two ways: demand driven and producer driven
In demand driven or lazy evaluation
 system repeatedly requests next tuple from top level operation
 Each operation requests next tuple from children operations as required, in order to
output its next tuple
 In between calls, operation has to maintain “state” so it knows what to return next
Implementation of demand-driven pipelining
 Each operation is implemented as an iterator implementing the following operations
open()
o E.g. file scan: initialize file scan
 state: pointer to beginning of file
o E.g.merge join: sort relations;
 state: pointers to beginning of sorted relations
next()
o E.g. for file scan: Output next tuple, and advance and store file pointer
o E.g. for merge join: continue with merge from earlier state till next output tuple is
found. Save pointers as iterator state.
close()

In producer-driven or eager pipelining


 Operators produce tuples eagerly and pass them up to their parents
o Buffer maintained between operators, child puts tuples in buffer, parent removes
tuples from buffer
o if buffer is full, child waits till there is space in the buffer, and then generates
more tuples
 System schedules operations that have space in output buffer and can process more input
tuples
Implementation of Producer-driven pipelining
 Producer-driven pipelines, on the other hand, are implemented in a different manner.
 For each pair of adjacent operations in a producer-driven pipeline, the system creates a
buffer to hold tuples being passed from one operation to the next.
 The processes or threads corresponding to different operations execute concurrently.
Each operation at the bottom of a pipeline continually generates output tuples, and puts
them in its output buffer, until the buffer is full.
 Once the output buffer is full, the operation waits until its parent operation removes
tuples from the buffer so that the buffer has space for more tuples.

You might also like