Adbms
Adbms
Agenda
The problem domain of design parallel & distributed
databases (chp 18-20)
The data allocation problem
The data processing algorithms
Application
Distributed control
Application
DBMS
DBMS
Hardware
Hardware
DBMS
Hardware
Distributed services
Hardware
Hardware
Server processes
These receive user queries
(transactions), execute them and send
results back
Processes may be multithreaded,
allowing a single process to execute
several user queries concurrently
Lock manager process
Reduce lock-contention,
Spin-locks/ semaphores
Database writer process
Output modified buffer blocks to
disks continually
Data Servers
Data servers appear as a distributed DBMS that exchanges low-level
objects, e.g. pages
Ship data to client machines where processing is performed, and then
ship results back to the server machine.
This architecture requires full back-end functionality at the clients.
Used in LANs, where there is a very high speed connection between
the clients and the server, the client machines are comparable in
processing power to the server machine, and the tasks to be executed
are compute intensive.
Issues:
Page-Shipping versus Item-Shipping
Locking
Data Caching
Lock Caching
Data Caching
Data can be cached at client even in between transactions
But check that data is up-to-date before it is used (cache coherency)
Check can be done when requesting lock on data item
Lock Caching
Locks can be retained by client system even in between transactions
Transactions can acquire cached locks locally, without contacting
server
Server calls back locks from clients when it receives conflicting lock
request. Client returns lock once no local transaction is using it.
Similar to deescalation, but across transactions.
Issues:
SQL cache coherency
Transaction management
Optimization over materialized results
Parallel Systems
Parallel database systems consist of multiple processors and multiple
disks connected by a fast interconnection network.
A coarse-grain parallel machine consists of a small number of
powerful processors
A massively parallel or fine grain parallel machine utilizes thousands
of smaller processors.
Two main performance measures:
throughput --- the number of tasks that can be completed in a
given time interval
response time --- the amount of time it takes to complete a single
task from the time it is submitted
Scaleup: increase the size of both the problem and the system
N-times larger system used to perform N-times larger job
Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
Scale up is linear if equation equals 1.
Distributed Systems
Distributed Databases
Homogeneous distributed databases
Same software/schema on all sites, data may be partitioned among sites
Goal: provide a view of a single database, hiding details of distribution
Heterogeneous distributed databases
Different software/schema on different sites
Goal: integrate existing databases to provide useful functionality
Differentiate between local and global transactions
A local transaction accesses data in the single site at which the
transaction was initiated.
A global transaction either accesses data in a site different from the one
at which the transaction was initiated or accesses data in several
different sites.
Implementation issues
Advantages of Replication
Availability: failure of site containing relation r does not result in
unavailability of r if replicas exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site
containing a replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special
concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
Fragmentation
Relation is partitioned into several fragments stored in distinct sites
Implementation issues
Introduction
Parallelism in Databases
I/O Parallelism
Reduce the time required to retrieve relations from disk by partitioning the
relations on multiple disks.
Round robin:
Advantages
Best suited for sequential scan of entire relation on each query.
All disks have almost an equal number of tuples; retrieval work is
thus well balanced between disks.
Hash partitioning:
Good for sequential access
Assuming hash function is good, and partitioning attributes form a
key, tuples will be equally distributed between disks
Retrieval work is then well balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering
other queries.
Index on partitioning attribute can be local to disk, making lookup
and update more efficient
No clustering, so difficult to answer range queries
If a relation contains only a few tuples which will fit into a single disk
block, then assign the relation to a single disk.
Large relations are preferably partitioned across all the available disks.
Handling of Skew
Basic idea:
If any normal partition would have been skewed, it is very likely
the skew is spread over a number of virtual partitions
Skewed virtual partitions get spread across a number of
processors, so work gets distributed evenly!