Distributed Databases Data Warehousing: CPS 216 Advanced Database Systems
Distributed Databases Data Warehousing: CPS 216 Advanced Database Systems
Data Warehousing
CPS 216
Advanced Database Systems
Review
Distributed DBMS
Top-down approach
Data partitioning
Query processing
Query optimization
Concurrency control and recovery
Bottom-up approach
Query processing and optimization
2
Plan A
100
40
50
3
ba
ba
ba
Site 1
ba
ba
R1 R2 R3 R4 R1 R2 R3 R4
Site 1
ba
R1 R4 R2 R3
Site 1
op1; op2
op3; op4
Site 1
Site 2
Site 3
commit
abort
commit
Two-phase commit
Coordinator
Participant
Initial
T.commit
prepare*
Wait
OK*
commit*
Initial
NOK
abort*
prepare
OK
Abort
Ready
commit
-
Commit
Notation:
prepare
NOK
abort
-
Abort
Commit
Incoming message
Outgoing message
* = everyone
6
Bottom-up approach to
Building a distributed DBMS
Data already in various sources
Build a distributed DBMS to provide global,
uniform access to all data
How to integrate data?
How to deal with heterogeneous and autonomous
sources?
Mediation approach
8
Wrapper/mediator architecture
Client
Client
Mediator
Catalog
Wrapper
Wrapper
Wrapper
Database
Database
Database
Mediator
Client
Client
Mediator
Catalog
Accept queries from clients
Rewrite and optimize queries
Wrapper
Wrapper
Wrapper
Send subplans to be executed
by wrappers
Database
Database
Database
Combine results from
wrappers and perform any additional local processing
necessary
Mediator catalog stores global schema and external
schema of sources as exported by wrappers
No source-specific code in a mediator!
10
Wrapper
Hide heterogeneity away
from mediator
Translate mediator requests
so that they are understood
by sources
Client
Client
Mediator
Catalog
Wrapper
Wrapper
Wrapper
Database
Database
Database
13
No wrap_join rule
I cannot handle process joins
15
med_pushdown(subplan) = subplan
Condition: subplan.site = mediator
And more
17
Plan enumeration
Call all wrap_access and med_access rules to generate
single-table access plans
Repeatedly call all wrap_join and med_join rules to
generate multi-table join plans
Example final plans
FILTERmed(
RECEIVE(SEND(FETCHWeb(Books, title LIKE string))),
author = string), versus
FILTERmed(
RECEIVE(SEND(FETCHWeb(Books, author = string))),
title LIKE string)
RECEIVE(SEND(JOINDBMS(R, S))), versus
JOINmed(RECEIVE(SEND(R)), RECEIVE(SEND(S)))
18
Costing
Wrapper-supplied cost model
Lots of work for wrapper developers
Calibration
Define a generic cost model with parameters for all
wrappers
Example: cost = c (# of tuples)
Learning curve
Use recent statistics to adjust cost estimates
Example: cost = average over last three runs
19
Summary of wrapper/mediator
Not all sources are created equal!
Whats in a source?
Wrapper: source schema external schema
Mediator: external schema global schema
Data warehousing
Data resides in many distributed, heterogeneous
OLTP (On-Line Transaction Processing) sources
Sales, inventory, customer,
NC branch, NY branch, CA branch,
OLAP
Mostly updates
Short, simple transactions
Clerical users
Goal: ACID, transaction
throughput
Mostly reads
Long, complex queries
Analysts, decision makers
Goal: fast queries
22
Mediation
Lazy integration
On demand: at query time
Answer is more up-to-date
Faster
Can operate when sources
are unavailable
23
24
Incremental maintenance
Compute only the changes to V: V and V
Apply the changes to Vold: Vnew Vold V V
Potentially much faster if changes are small
26
Incremental maintenance
Example: V = p R
Change: Rnew Rold R
Vnew = p Rnew = p (Rold R) = p Rold p R
= Vold V
Change propagation
More change propagation equations
(R R)
S =
(R
S) (R
S)
(R R)
S =
(R
S) (R
S)
Self-maintainability
A warehouse is self-maintainable if it can be
maintained without accessing the sources
Self-maintainable: V = p R
Not self-maintainable: V = R
S
R and R need to be joined with S
S and S need to be joined with R
Problem: need to query the source for maintenance
What if the source/network is slow?
What if the source/network is down?
What if the source has been updated again?
29
10
Next time
Warehouse design
Data cube
ROLAP versus MOLAP
31
11