Database_Technologies_ch3
Database_Technologies_ch3
TECHNOLOGIES
Parallel and Distributed Databases
Department of Computer Science and Engineering
1) Here, all processors have their own memory and their own disk or disks, as in Figure.
2) All communication is via the network, from processor to processor.
3) For example, if one processor P wants to read tuples from the disk of another processor
Q, then processor P sends a message to Q asking for the data.
4) Processor Q obtains the tuples from its disk and ships them over the network in another
message, which is received by P.
5) As we mentioned, the shared-nothing architecture is the most commonly used
architecture for database systems.
6) Shared-nothing machines are relatively inexpensive to build; one buys racks of commodity
machines and connects them with the network connection that is typically built into the
rack. Multiple racks can be connected by an external network.
DATABASE TECHNOLOGIES
Parallel Databases - 3) Shared-Nothing Machines (Cont’d)
8) But when we design algorithms for these machines we must be aware that it is costly to
send data from one processor to another.
9) Normally, data must be sent between processors in a message, which has considerable
overhead associated with it. Both processors must execute a program that supports the
message transfer, and there may be contention or delays associated with the
communication network as well.
10) Typically, the cost of a message can be broken into a large fixed overhead plus a small
amount of time per byte transmitted.
11) Thus, there is a significant advantage to designing a parallel algorithm so that
communications between processors involve large amounts of data sent at once.
12) For instance, we might buffer several blocks of data at processor P, all bound for processor
Q. If Q does not need the data immediately, it may be much more efficient to wait until we
have a long message at P and then send it to Q.
13) Fortunately, the best known parallel algorithms for database operations can use long
messages effectively.
DATABASE TECHNOLOGIES
Parallel Databases - 3) Shared-Nothing Machines
14) Parallel databases using shared-nothing architecture are relatively inexpensive to build.
1) Today, commodity processors are being connected in this fashion on a rack, and several
racks can be connected by an external network.
15) Each processor has its own memory and disk storage.
16) The shared-nothing architecture affords the possibility of achieving parallelism in query
processing at three levels, which we will discuss below:
1) individual operator parallelism,
2) intraquery parallelism, and
3) interquery parallelism.
DATABASE TECHNOLOGIES
Parallel Databases - 3) Shared-Nothing Machines
Studies have shown that by allocating more
processors and disks, linear speed-up—a linear
reduction in the time taken for operations—is
possible.
There are two types of query parallelism: interquery parallelism and intraquery parallelism.
1) Interquery Parallelism
Interquery parallelism refers to the ability of the database to accept queries from multiple
applications at the same time. Each query runs independently of the others, but the database
manager runs all of them at the same time. Db2 database products have always supported this
type of parallelism.
• Intra Query Parallelism
Intraquery parallelism refers to the simultaneous processing of parts of a single query, using
either intrapartition parallelism, interpartition parallelism, or both.
DATABASE TECHNOLOGIES
Parallel Databases -
Query Execution
• We have discussed how each individual operation may be executed by distributing the data among
multiple processors and performing the operation in parallel on those processors.
• To achieve a parallel execution of a query, one approach is to use a parallel algorithm for each
operation involved in the query, with appropriate partitioning of the data input to that operation.
• Another opportunity to parallelize comes from the evaluation of an operator tree where some of the
operations may be executed in parallel because they do not depend on one another.
• These operations may be executed on separate processors. If the output of one of the operations
can be generated tuple-by-tuple and fed into another operator, the result is pipelined parallelism.
• An operator that does not produce any output until it has consumed all its inputs is said to block the
pipelining.
DATABASE TECHNOLOGIES
Parallel Databases -
Interquery Parallelism
5) We can speed up the query execution by performing various operations, such as sorting,
selection, projection, join, and aggregate operations, individually using their parallel
execution.
6) We may achieve further speed-up by executing parts of the query tree that are
independent in parallel on different processors.
7) However, it is difficult to achieve interquery parallelism in shared-nothing parallel
architectures or shared-disk architectures.
8) One area where the shared disk architecture has an edge is that it has a more general
applicability, since it, unlike the shared-nothing architecture, does not require data to be
stored in a partitioned manner.
DATABASE TECHNOLOGIES
Parallel Databases -
THANK YOU
Department of Computer Science
and Engineering
20
DATABASE
TECHNOLOGIES
Map Reduce
Department of Computer Science and Engineering
MapReduce Processing
Advantages
1. The model is easy to use since it hides the details of parallelization, fault-tolerance, locality
optimization and load balancing
30
DATABASE
TECHNOLOGIES
Parallel Algorithms on Relations
Department of Computer Science and Engineering
40
DATABASE
TECHNOLOGIES
Distributed Databases
Department of Computer Science and Engineering
Data is stored across several sites, each site managed by DBMS capable of
running independently
DATABASE TECHNOLOGIES
Distributed Database Concepts
1. Management of distributed data with different levels of transparency: hiding the details
of where each file (table, relation) is physically stored within the system.
2. The possible types of transparency are:
a) Distribution or network transparency – freedom for the user from the operational
details of the network. It can be divided into location and naming transparency.
b) Replication transparency – makes the user unaware of the existence of copies.
c) Fragmentation transparency – makes the user unaware of the existence of
fragments.
DATABASE TECHNOLOGIES
Advantages of Distributed Databases
2. Increased reliability and availability: two of the most common potential advantages cited for
DDB:
a. Reliability – the probability that a system is running (not down) at a certain point of time
b. Availability – the probability that the system is continuously available during a time interval
3. Improved performance: a DDBMS fragments the database by keeping the data closer to where
it is needed most (data localization reduces CPU and I/O).
4. Easier expansion: adding more data, increasing database sizes, or adding more processors is
much easier in a DDB.
DATABASE TECHNOLOGIES
Additional Functions of DDB
1) Keeping track of data: the ability to keep track of the data distribution, fragmentation, and
replication,
2) Distributed query processing: the ability to access remote sites via a communication network,
3) Distributed transaction management: the ability to execute transactions that access data from
many sites,
4) Replicated data management: the ability to decide which copy of a replicated data to access
and maintain the consistency of copies of replicated data items.
5) Distributed database recovery: the ability to recover from individual site crashes (or
communication failures),
6) Security: proper management of the security of the data and the authorization/access
privileges of users,
7) Distributed directory (catalog) management: design and policy issues for the placement of
directory.
DATABASE TECHNOLOGIES
Data Fragmentation
● Techniques that are used to break up a database into logical units, called fragmentation.
• Information concerning data fragmentation, allocation, and replication is stored in a global
DB.
• A horizontal fragment (or shard) of a relation is a subset of the tuples in the relation (e.g.,
DNO=5).
● Horizontal fragmentation is also known as sharding or horizontal partitioning.
• A vertical fragment of a relation keeps only a subset of the attributes of a relation (e.g., SSN,
AGE).
• A mixed (hybrid) fragment of a relation combines the horizontal and vertical
fragmentations.
DATABASE TECHNOLOGIES
Data Replication
54
54
DATABASE
TECHNOLOGIES
Distributed Transactions
Department of Computer Science and Engineering
Distributed Transactions
Distributed Transactions
Distributed Commit
Two-Phase Commit Protocol
Concurrency Control
DATABASE TECHNOLOGIES
Distributed Transactions
2PC - PHASE 1:
1. The coordinator places a log record <Prepare T> on the log at its site.
2. The coordinator sends to each component’s site the message Prepare T.
3. Each site receiving the message Prepare T decides whether to commit or abort its component
of T.
4. If a site wants to commit its component, it must enter a state called pre-committed. Once in
the pre-committed state, the site cannot abort its component of T without a directive to do
so from the coordinator:
a. Perform whatever steps necessary to be sure the local component of T will not have to abort.
b. Place the record <Ready T> on the local log and flush the log to disk.
c. Send Ready T message to the coordinator.
5. If the site wants to abort its component of T, then it logs the record <Don’t Commit T> and
sends the message Don’t Commit T to the coordinator.
DATABASE TECHNOLOGIES
Two Phase Commit Protocol - Phase 2
2PC - Phase 2
1. If the coordinator has received Ready T from all components of T, only then it
decides to commit T. The coordinator logs <Commit T> at its site and then
sends message Commit T to all sites involved in T.
2. If the coordinator has received Don’t Commit T from one or more sites, it logs
<Abort T> at its site and then sends Abort T message to all sites involved in T.
3. If a site receives a Commit T message, it commits the component of T at that
site, logging <Commit T> as it does.
4. If a site receives the message Abort T, it aborts T and writes the log record
<Abort T>
DATABASE TECHNOLOGIES
Three Phase Commit Protocol
• All concurrency mechanisms must preserve data consistency and complete each atomic
action in finite time.
• Important capabilities are
65
DATABASE
TECHNOLOGIES
Distributed Query Processing
Department of Computer Science and Engineering
• In a distributed system, the following must be taken into account while evaluating the cost:
1. The cost of a data transmission over the network.
2. The potential gain in performance from having several sites process parts of the query
in parallel.
DATABASE TECHNOLOGIES
Distributed Query Processing
Semi Join
• Only the relevant part of each relation is shipped to the site of the other relation(s)
• Project the join attributes of S and send the result set to the site where R is
• Compute R πY(S) there
• Send the result set of R πY(S) to the site where S is to compute R S
DATABASE TECHNOLOGIES
Distributed Query Processing
76
DATABASE
TECHNOLOGIES
Distributed Locking
Department of Computer Science and Engineering
Primary-Copy Locking
• An improvement on the centralized locking approach, one which also allows replicated data, is to
distribute the function of the lock site, but still maintain the principle that each logical element has a
single site responsible for its global lock.
• This distributed-lock method called primary copy, avoids the possibility that the central lock site will
become a bottleneck, while still maintaining the simplicity of the centralized method.
• Each logical element X has one of its copies designated as the “primary copy.” In order to get a lock
on logical element X, a transaction sends a request to the site of the primary copy of X. The site of
the primary copy maintains an entry for X in its lock table and grants or denies the request as
appropriate.
• Most lock requests require three messages, except for those where the transaction and the primary
copy are at the same site.
DATABASE TECHNOLOGIES
Distributed Locking
84
DATABASE
TECHNOLOGIES
Peer to Peer Distributed Search
Department of Computer Science and Engineering
Consider the problem of searching records in a (very large) set of key-value pairs.
1. Associated with each key K is a value V. For example, K might be the identifier of a
document. V could be the document itself or it could be the set of nodes at which the
document can be found.
2. If the size of the key-value data is small, we could use a central node that holds the
entire key-value table
3. All nodes would query the central node when they wanted the value V associated with a
given key K
4. A pair of query-response messages would answer any lookup question for any node
5. Alternatively, we could replicate the entire table at each node, so there would be no
messages needed at all
DATABASE TECHNOLOGIES
Distributed Hashing Problem
• The problem becomes more challenging when the key-value table is too large to be
handled by a single node
• Consider this problem, using the following constraints:
1. At any time, only one node among the peers knows the value associated with any
given key K.
2. The key-value pairs are distributed roughly equally among the peers.
3. Any node can ask the peers for the value V associated with a chosen key K. The value
of V should be obtained in a way such that the number of messages sent among the
peers grows much more slowly than the number of peers.
4. The amount of routing information needed at each node to help locate keys must
also grow much more slowly than the number of nodes.
DATABASE TECHNOLOGIES
Centralized Solutions for Distributed Hashing Problem
If the set of participants in the network is fixed or the set of participants changes slowly, then use
a hash function h that hashes keys into node numbers. We place the key-value pair (K,V ) at the
node h(K).
Chord Circles
1. Arrange the peers in a “chord circle.”
2. Each node knows its predecessor and successor around the circle
3. Nodes also have links to nodes located at an exponentially growing set of distances around
the circle
4. To place a node in the circle, hash its ID i and place it at position h(i). Refer to this node as
Nh(i).
5. The successor of each node is the next higher one, clockwise around the circle.
DATABASE TECHNOLOGIES
Centralized Solutions for Distributed Hashing Problem
1. The nodes are located around the circle using a hash function h that is capable of
mapping both keys and node ID’s to m-bit numbers, for some m.
2. Suppose m = 6, then there are 64 different locations for nodes around the circle
3. Key-value pairs are also distributed around the circle using the hash function h.
4. If (K,V ) is a key-value pair, then we compute h(K) and place (K,V ) at the lowest
numbered node Nj such that h(K) ≤ j.
1. For example, any (K,V ) pair such that 42 < h(K) ≤ 48 would be stored at N48. If h(K) is any
of 57, 58,..., 63, 0, 1, then (K,V ) would be placed at N1.
DATABASE TECHNOLOGIES
Centralized Solutions for Distributed Hashing Problem
100