Query
Query
PROCESSING
Distributed Query Processing
◦ There are various steps that are followed for query processing.
◦ A distributed database query is processed in stages as follows:
1. Query Mapping:
◦ The input query on distributed data is specified formally using a query language.
◦ It is then translated into an algebraic query on global relations.
◦ This translation is done by referring to the global conceptual schema.
◦ This translation is largely identical to the one performed in a centralized DBMS.
◦ It is first normalized, analyzed for semantic errors, simplified, and finally restructured into an algebraic
query.
Distributed Query Processing
2. Localization:
◦ This stage maps the distributed query on the global schema to separate queries on individual fragments
using data distribution and replication information.
3. Global Query Optimization:
◦ Optimization consists of selecting a strategy from a list of candidates that is closest to optimal.
◦ A list of candidate queries can be obtained by permuting the ordering of operations within a fragment
query generated by the previous stage.
◦ The total cost is a weighted combination of costs such as CPU cost, I/O costs, and communication costs.
4. Local Query Optimization.
◦ This stage is common to all sites in the DDB.
◦ The techniques are similar to those used in centralized systems.
The first three stages discussed above are performed at a central control site, while the last stage is
performed locally.
Distributed Query Processing
Data Transfer Costs of Distributed Query Processing
◦ In a distributed system, several additional factors further complicate query processing.
◦ The first is the cost of transferring data over the network.
◦ This data includes intermediate files that are transferred to other sites for further processing, as well as
the final result files that may have to be transferred to the site where the query result is needed.
◦ These costs may not be very high if the sites are connected via a high-performance local area network,
they become quite significant in other types of networks.
◦ DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as an
optimization criterion in choosing a distributed query execution strategy.
Distributed Query Processing
Example:
◦ Suppose that the EMPLOYEE and DEPARTMENT relations are distributed at two sites.
◦ Suppose that each record in the query result is 40 bytes long.
Distributed Query Processing
Example:
◦ The query is submitted at a distinct site 3, which is called the result site because the query result is
needed there.
◦ Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.
◦ There are three simple strategies for executing this distributed query:
1. First:
◦ Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform the join at
site 3.
◦ In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must be transferred.
Distributed Query Processing
Example:
2. Second:
◦ Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3.
◦ Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3.
◦ The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes
must be transferred.
3. Third:
◦ Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3.
◦ In this case, 400,000 + 3,500 = 403,500 bytes must be transferred.
*If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3.
Distributed Query Processing
◦ Now consider another query Q:
◦ For each department, retrieve the department name and the name of the department manager.
◦ This can be stated as follows in the relational algebra: