0% found this document useful (0 votes)
5 views

All Merged

The document outlines the principles of distributed database systems, covering topics such as distributed database design, data control, query processing, transaction processing, and data replication. It explains the architecture of distributed database management systems (DBMS) and highlights the importance of transparency, reliability, and performance in managing distributed data. Additionally, it discusses various approaches to data distribution, fragmentation, and the evolution of database technologies including cloud computing and NoSQL systems.

Uploaded by

Diogo Freitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

All Merged

The document outlines the principles of distributed database systems, covering topics such as distributed database design, data control, query processing, transaction processing, and data replication. It explains the architecture of distributed database management systems (DBMS) and highlights the importance of transparency, reliability, and performance in managing distributed data. Additionally, it discusses various approaches to data distribution, fragmentation, and the evolution of database technologies including cloud computing and NoSQL systems.

Uploaded by

Diogo Freitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 513

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and Parallel Database Design
n Distributed Data Control
n Distributed Query Processing
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Introduction
q What is a distributed DBMS
q History
q Distributed DBMS promises
q Design issues
q Distributed DBMS architecture

© 2020, M.T. Özsu & P. Valduriez 3


Distributed Computing

n A number of autonomous processing elements (not


necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
n What is being distributed?
q Processing logic
q Function
q Data
q Control

© 2020, M.T. Özsu & P. Valduriez 4


Current Distribution – Geographically
Distributed Data Centers

© 2020, M.T. Özsu & P. Valduriez 5


What is a Distributed Database System?

A distributed database is a collection of multiple, logically


interrelated databases distributed over a computer network

A distributed database management system (Distributed


DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users

© 2020, M.T. Özsu & P. Valduriez 6


What is not a DDBS?

n A timesharing computer system


n A loosely or tightly coupled multiprocessor system
n A database system which resides at one of the nodes of
a network of computers - this is a centralized database
on a network node

© 2020, M.T. Özsu & P. Valduriez 7


Distributed DBMS Environment

© 2020, M.T. Özsu & P. Valduriez 8


Implicit Assumptions

n Data stored at a number of sites → each site logically


consists of a single processor
n Processors at different sites are interconnected by a
computer network → not a multiprocessor system
q Parallel database systems
n Distributed database is a database, not a collection of
files → data logically related as exhibited in the users’
access patterns
q Relational data model
n Distributed DBMS is a full-fledged DBMS
q Not remote file system, not a TP system

© 2020, M.T. Özsu & P. Valduriez 9


Important Point

Logically integrated
but
Physically distributed

© 2020, M.T. Özsu & P. Valduriez 10


Outline
n Introduction
q

q History
q
q

© 2020, M.T. Özsu & P. Valduriez 11


History – File Systems

© 2020, M.T. Özsu & P. Valduriez 12


History – Database Management

© 2020, M.T. Özsu & P. Valduriez 13


History – Early Distribution
Peer-to-Peer (P2P)

© 2020, M.T. Özsu & P. Valduriez 14


History – Client/Server

© 2020, M.T. Özsu & P. Valduriez 15


History – Data Integration

© 2020, M.T. Özsu & P. Valduriez 16


History – Cloud Computing

On-demand, reliable services provided over the Internet in


a cost-efficient manner
n Cost savings: no need to maintain dedicated compute
power
n Elasticity: better adaptivity to changing workload

© 2020, M.T. Özsu & P. Valduriez 17


Data Delivery Alternatives

n Delivery modes
q Pull-only
q Push-only
q Hybrid
n Frequency
q Periodic
q Conditional
q Ad-hoc or irregular
n Communication Methods
q Unicast
q One-to-many
n Note: not all combinations make sense
© 2020, M.T. Özsu & P. Valduriez 18
Outline
n Introduction
q

q Distributed DBMS promises


q

© 2020, M.T. Özsu & P. Valduriez 19


Distributed DBMS Promises

Œ Transparent management of distributed, fragmented,


and replicated data

 Improved reliability/availability through distributed


transactions

Ž Improved performance

 Easier and more economical system expansion

© 2020, M.T. Özsu & P. Valduriez


Transparency

n Transparency is the separation of the higher-level


semantics of a system from the lower level
implementation issues.
n Fundamental issue is to provide
data independence
in the distributed environment
q Network (distribution) transparency
q Replication transparency
q Fragmentation transparency
n horizontal fragmentation: selection
n vertical fragmentation: projection
n hybrid

© 2020, M.T. Özsu & P. Valduriez


Example

© 2020, M.T. Özsu & P. Valduriez 22


Transparent Access

Tokyo

SELECT ENAME,SAL
FROM EMP,ASG,PAY Boston Paris

WHERE DUR > 12 Paris projects


Paris employees
AND EMP.ENO = ASG.ENO Communication Paris assignments
Network Boston employees
AND PAY.TITLE = EMP.TITLE
Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments

© 2020, M.T. Özsu & P. Valduriez 23


Distributed Database - User View

Distributed Database

© 2020, M.T. Özsu & P. Valduriez 24


Distributed DBMS - Reality
User
Query

DBMS User
Application
Software
DBMS
Software

DBMS Communication
Software Subsystem

User
DBMS User Application
Software Query
DBMS
Software

User
Query

© 2020, M.T. Özsu & P. Valduriez 25


Types of Transparency

n Data independence
n Network transparency (or distribution transparency)
q Location transparency
q Fragmentation transparency
n Fragmentation transparency
n Replication transparency

© 2020, M.T. Özsu & P. Valduriez 26


Reliability Through Transactions

n Replicated components and data should make distributed


DBMS more reliable.
n Distributed transactions provide
Concurrency transparency
q

q Failure atomicity

• Distributed transaction support requires implementation of


q Distributed concurrency control protocols

q Commit protocols

n Data replication
q Great for read-intensive workloads, problematic for updates
q Replication protocols

© 2020, M.T. Özsu & P. Valduriez 27


Potentially Improved Performance

n Proximity of data to its points of use

q Requires some support for fragmentation and replication

n Parallelism in execution

q Inter-query parallelism

q Intra-query parallelism

© 2020, M.T. Özsu & P. Valduriez 28


Scalability

n Issue is database scaling and workload scaling

n Adding processing and storage power

n Scale-out: add more servers


q Scale-up: increase the capacity of one server → has limits

© 2020, M.T. Özsu & P. Valduriez 29


Outline
n Introduction
q

q
q Design issues
q

© 2020, M.T. Özsu & P. Valduriez 30


Distributed DBMS Issues

n Distributed database design


q How to distribute the database
q Replicated & non-replicated database distribution
q A related problem in directory management

n Distributed query processing


q Convert user transactions to data manipulation instructions
q Optimization problem
n min{cost = data transmission + local processing}
q General formulation is NP-hard

© 2020, M.T. Özsu & P. Valduriez 31


Distributed DBMS Issues

n Distributed concurrency control


q Synchronization of concurrent accesses
q Consistency and isolation of transactions' effects
q Deadlock management

n Reliability
q How to make the system resilient to failures
q Atomicity and durability

© 2020, M.T. Özsu & P. Valduriez 32


Distributed DBMS Issues

n Replication
q Mutual consistency
q Freshness of copies
q Eager vs lazy
q Centralized vs distributed
n Parallel DBMS
q Objectives: high scalability and performance
q Not geo-distributed
q Cluster computing

© 2020, M.T. Özsu & P. Valduriez 33


Related Issues

n Alternative distribution approaches


q Modern P2P
q World Wide Web (WWW or Web)
n Big data processing
q 4V: volume, variety, velocity, veracity
q MapReduce & Spark
q Stream data
q Graph analytics
q NoSQL
q NewSQL
q Polystores

© 2020, M.T. Özsu & P. Valduriez 34


Outline
n Introduction
q

q
q

q Distributed DBMS architecture

© 2020, M.T. Özsu & P. Valduriez 35


DBMS Implementation Alternatives

© 2020, M.T. Özsu & P. Valduriez 36


Dimensions of the Problem

n Distribution
q Whether the components of the system are located on the same machine or
not
n Heterogeneity
q Various levels (hardware, communications, operating system)
q DBMS important one
n data model, query language,transaction management algorithms
n Autonomy
q Not well understood and most troublesome
q Various versions
n Design autonomy: Ability of a component DBMS to decide on issues related to its
own design.
n Communication autonomy: Ability of a component DBMS to decide whether and
how to communicate with other DBMSs.
n Execution autonomy: Ability of a component DBMS to execute local operations in
any manner it wants to.

© 2020, M.T. Özsu & P. Valduriez 37


Client/Server Architecture

© 2020, M.T. Özsu & P. Valduriez 38


Advantages of Client-Server
Architectures
n More efficient division of labor
n Horizontal and vertical scaling of resources
n Better price/performance on client machines
n Ability to use familiar tools on client machines
n Client access to remote data (via standards)
n Full DBMS functionality provided to client workstations
n Overall better system price/performance

© 2020, M.T. Özsu & P. Valduriez 39


Database Server

© 2020, M.T. Özsu & P. Valduriez 40


Distributed Database Servers

© 2020, M.T. Özsu & P. Valduriez 41


Peer-to-Peer Component Architecture

© 2020, M.T. Özsu & P. Valduriez 42


MDBS Components & Execution

© 2020, M.T. Özsu & P. Valduriez 43


Mediator/Wrapper Architecture

© 2020, M.T. Özsu & P. Valduriez 44


Cloud Computing

On-demand, reliable services provided over the Internet in


a cost-efficient manner
n IaaS – Infrastructure-as-a-Service

n PaaS – Platform-as-a-Service

n SaaS – Software-as-a-Service

n DaaS – Database-as-a-Service

© 2020, M.T. Özsu & P. Valduriez 45


Simplified Cloud Architecture

© 2020, M.T. Özsu & P. Valduriez 46


Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and Parallel Database Design
n Distributed Data Control
n Distributed Query Processing
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Distributed and Parallel Database Design
q Fragmentation
q Data distribution
q Combined approaches

© 2020, M.T. Özsu & P. Valduriez 3


Distribution Design

© 2020, M.T. Özsu & P. Valduriez 4


Outline
n Distributed and Parallel Database Design
q Fragmentation
q

© 2020, M.T. Özsu & P. Valduriez 5


Fragmentation

n Can't we just distribute relations?


n What is a reasonable unit of distribution?
q relation
n views are subsets of relations è locality
n extra communication
q fragments of relations (sub-relations)
n concurrent execution of a number of transactions that access
different portions of a relation
n views that cannot be defined on a single fragment will require extra
processing
n semantic data control (especially integrity enforcement) more
difficult

© 2020, M.T. Özsu & P. Valduriez 6


Example Database

© 2020, M.T. Özsu & P. Valduriez 7


Fragmentation Alternatives – Horizontal

PROJ1 : projects with budgets


less than $200,000
PROJ2 : projects with budgets
greater than or equal
to $200,000

© 2020, M.T. Özsu & P. Valduriez 8


Fragmentation Alternatives – Vertical

PROJ1: information about


project budgets
PROJ2: information about
project names and
locations

© 2020, M.T. Özsu & P. Valduriez 9


Correctness of Fragmentation

n Completeness
q Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be found in
some Ri
n Reconstruction
q If relation R is decomposed into fragments R1, R2, ..., Rn, then
there should exist some relational operator ∇ such that
R = ∇1≤i≤nRi
n Disjointness
q If relation R is decomposed into fragments R1, R2, ..., Rn, and
data item di is in Rj, then di should not be in any other fragment
Rk (k ≠ j ).

© 2020, M.T. Özsu & P. Valduriez 10


Allocation Alternatives

n Non-replicated
q partitioned : each fragment resides at only one site
n Replicated
q fully replicated : each fragment at each site
q partially replicated : each fragment at some of the sites
n Rule of thumb:

If read-only queries << 1, replication is advantageous,


update queries
otherwise replication may cause problems

© 2020, M.T. Özsu & P. Valduriez 11


Comparison of Replication Alternatives

© 2020, M.T. Özsu & P. Valduriez 12


Fragmentation

n Horizontal Fragmentation (HF)


q Primary Horizontal Fragmentation (PHF)
q Derived Horizontal Fragmentation (DHF)

n Vertical Fragmentation (VF)


n Hybrid Fragmentation (HF)

© 2020, M.T. Özsu & P. Valduriez 13


PHF – Information Requirements

n Database Information
q relationship

q cardinality of each relation: card(R)

© 2020, M.T. Özsu & P. Valduriez 14


PHF - Information Requirements
n Application Information
q simple predicates : Given R[A1, A2, …, An], a simple predicate
pj is
pj : Ai θValue
where θ Î {=,<,≤,>,≥,≠}, Value Î Di and Di is the domain of Ai.
For relation R we define Pr = {p1, p2, …,pm}
Example :
PNAME = "Maintenance"
BUDGET ≤ 200000
q minterm predicates : Given R and Pr = {p1, p2, …,pm}
define M = {m1,m2,…,mr} as
M = { mi | mi = ÙpjÎPr pj* }, 1≤j≤m, 1≤i≤z
where pj* = pj or pj* = ¬(pj).

© 2020, M.T. Özsu & P. Valduriez 15


PHF – Information Requirements

Example
m1: PNAME="Maintenance" Ù BUDGET≤200000

m2: NOT(PNAME="Maintenance") Ù BUDGET≤200000

m3: PNAME= "Maintenance" Ù NOT(BUDGET≤200000)

m4: NOT(PNAME="Maintenance") Ù NOT(BUDGET≤200000)

© 2020, M.T. Özsu & P. Valduriez 16


PHF – Information Requirements

n Application Information
q minterm selectivities: sel(mi)
n The number of tuples of the relation that would be accessed by a
user query which is specified according to a given minterm
predicate mi.

q access frequencies: acc(qi)


n The frequency with which a user application qi accesses data.
n Access frequency for a minterm predicate can also be defined.

© 2020, M.T. Özsu & P. Valduriez 17


Primary Horizontal Fragmentation

Definition :
Rj = sFj(R), 1 ≤ j ≤ w
where Fj is a selection formula, which is (preferably) a minterm
predicate.
Therefore,
A horizontal fragment Ri of relation R consists of all the tuples of R
which satisfy a minterm predicate mi.

ê
Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.

© 2020, M.T. Özsu & P. Valduriez 18


PHF – Algorithm

Given: A relation R, the set of simple predicates Pr


Output: The set of fragments of R = {R1, R2,…,Rw} which
obey the fragmentation rules.

Preliminaries :
q Pr should be complete
q Pr should be minimal

© 2020, M.T. Özsu & P. Valduriez 19


Completeness of Simple Predicates

n A set of simple predicates Pr is said to be complete if


and only if the accesses to the tuples of the minterm
fragments defined on Pr requires that two tuples of the
same minterm fragment have the same probability of
being accessed by any application.

n Example :
q Assume PROJ[PNO,PNAME,BUDGET,LOC] has two
applications defined on it.
q Find the budgets of projects at each location. (1)
q Find projects with budgets less than $200000. (2)

© 2020, M.T. Özsu & P. Valduriez 20


Completeness of Simple Predicates

According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}

which is not complete with respect to (2).


Modify
Pr ={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}

which is complete.

© 2020, M.T. Özsu & P. Valduriez 21


Minimality of Simple Predicates

n If a predicate influences how fragmentation is performed,


(i.e., causes a fragment f to be further fragmented into,
say, fi and fj) then there should be at least one
application that accesses fi and fj differently.
n In other words, the simple predicate should be relevant
in determining a fragmentation.
n If all the predicates of a set Pr are relevant, then Pr is
minimal.
acc(mi ) acc(m j )
=
card( fi ) card( f j )

© 2020, M.T. Özsu & P. Valduriez 22


Minimality of Simple Predicates

Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}

is minimal (in addition to being complete). However, if we


add
PNAME = “Instrumentation”

then Pr is not minimal.

© 2020, M.T. Özsu & P. Valduriez 23


COM_MIN Algorithm

Given: a relation R and a set of simple predicates Pr


Output: a complete and minimal set of simple predicates
Pr' for Pr

Rule 1: a relation or fragment is partitioned into at least


two parts which are accessed differently by at
least one application.

© 2020, M.T. Özsu & P. Valduriez 24


COM_MIN Algorithm

Œ Initialization :
q find a pi Î Pr such that pi partitions R according to Rule 1
q set Pr' = pi ; Pr ¬Pr – {pi} ; F ¬ {fi}
 Iteratively add predicates to Pr' until it is complete
q find a pj Î Pr such that pj partitions some fk defined according to
minterm predicate over Pr' according to Rule 1
q set Pr' = Pr' È {pi}; Pr ¬Pr – {pi}; F ¬ F È {fi}
q if $pk Î Pr' which is nonrelevant then
Pr' ¬ Pr – {pi}
F ¬ F – {fi}

© 2020, M.T. Özsu & P. Valduriez 25


PHORIZONTAL Algorithm

Makes use of COM_MIN to perform fragmentation.


Input: a relation R and a set of simple predicates Pr
Output: a set of minterm predicates M according to which
relation R is to be fragmented

Œ Pr' ¬ COM_MIN (R,Pr)


 determine the set M of minterm predicates
Ž determine the set I of implications among pi Î Pr
 eliminate the contradictory minterms from M

© 2020, M.T. Özsu & P. Valduriez 26


PHF – Example

n Two candidate relations : PAY and PROJ.


n Fragmentation of relation PAY
q Application: Check the salary info and determine raise.
q Employee records kept at two sites Þ application run at two
sites
q Simple predicates
p1 : SAL ≤ 30000
p2 : SAL > 30000
Pr = {p1,p2} which is complete and minimal Pr'=Pr
q Minterm predicates
m1 : (SAL ≤ 30000)
m2 : NOT(SAL ≤ 30000) = (SAL > 30000)

© 2020, M.T. Özsu & P. Valduriez 27


PHF – Example

© 2020, M.T. Özsu & P. Valduriez 28


PHF – Example
n Fragmentation of relation PROJ
q Applications:
n Find the name and budget of projects given their no.
q Issued at three sites
n Access project information according to budget
q one site accesses ≤200000 other accesses >200000

q Simple predicates
q For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
q For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
q Pr = Pr' = {p1,p2,p3,p4,p5}

© 2020, M.T. Özsu & P. Valduriez 29


PHF – Example

n Fragmentation of relation PROJ continued


q Minterm fragments left after elimination
m1 : (LOC = “Montreal”) Ù (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”) Ù (BUDGET > 200000)
m3 : (LOC = “New York”) Ù (BUDGET ≤ 200000)
m4 : (LOC = “New York”) Ù (BUDGET > 200000)
m5 : (LOC = “Paris”) Ù (BUDGET ≤ 200000)
m6 : (LOC = “Paris”) Ù (BUDGET > 200000)

© 2020, M.T. Özsu & P. Valduriez 30


PHF – Example

© 2020, M.T. Özsu & P. Valduriez 31


PHF – Correctness

n Completeness
q Since Pr' is complete and minimal, the selection predicates are
complete

n Reconstruction
q If relation R is fragmented into FR = {R1,R2,…,Rr}

R = È"Ri ÎFR Ri
n Disjointness
q Minterm predicates that form the basis of fragmentation should
be mutually exclusive.

© 2020, M.T. Özsu & P. Valduriez 32


Derived Horizontal Fragmentation

n Defined on a member relation of a link according to a


selection operation specified on its owner.
q Each link is an equijoin.
q Equijoin can be implemented by means of semijoins.

© 2020, M.T. Özsu & P. Valduriez 33


DHF – Definition

Given a link L where owner(L)=S and member(L)=R, the


derived horizontal fragments of R are defined as
Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be
defined on R and
Si = sFi (S)

where Fi is the formula according to which the primary


horizontal fragment Si is defined.

© 2020, M.T. Özsu & P. Valduriez 34


DHF – Example

Given link L1 where owner(L1)=SKILL and member(L1)=EMP


EMP1 = EMP ⋉ SKILL1
EMP2 = EMP ⋉ SKILL2
where
SKILL1 = sSAL≤30000(SKILL)
SKILL2 = sSAL>30000(SKILL)

© 2020, M.T. Özsu & P. Valduriez 35


DHF – Correctness

n Completeness
q Referential integrity
q Let R be the member relation of a link whose owner is relation S
which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A
be the join attribute between R and S. Then, for each tuple t of
R, there should be a tuple t' of S such that
t[A] = t' [A]
n Reconstruction
q Same as primary horizontal fragmentation.
n Disjointness
q Simple join graphs between the owner and the member
fragments.

© 2020, M.T. Özsu & P. Valduriez 36


Vertical Fragmentation

n Has been studied within the centralized context


q design methodology
q physical clustering
n More difficult than horizontal, because more alternatives
exist.
Two approaches :
q grouping
n attributes to fragments
q splitting
n relation to fragments

© 2020, M.T. Özsu & P. Valduriez 37


Vertical Fragmentation

n Overlapping fragments
q grouping

n Non-overlapping fragments
q splitting

We do not consider the replicated key attributes to be


overlapping.
Advantage:
Easier to enforce functional dependencies
(for integrity checking etc.)

© 2020, M.T. Özsu & P. Valduriez 38


VF – Information Requirements

n Application Information
q Attribute affinities
n a measure that indicates how closely related the attributes are
n This is obtained from more primitive usage data
q Attribute usage values
n Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An],

ì 1 if attribute Aj is referenced by query qi


use(qi,Aj) = í
î 0 otherwise
use(qi,•) can be defined accordingly

© 2020, M.T. Özsu & P. Valduriez 39


VF – Definition of use(qi,Aj)

Consider the following 4 queries for relation PROJ


q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ FROM PROJ
WHERE PNO=Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)
FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value

© 2020, M.T. Özsu & P. Valduriez 40


VF – Affinity Measure aff(Ai,Aj)

The attribute affinity measure between two attributes Ai and Aj


of a relation R[A1, A2, …, An] with respect to the set of
applications Q = (q1, q2, …, qq) is defined as follows :

aff (Ai, Aj) = å (query access)


all queries that access Ai and Aj

å
access
query access = access frequency of a query *
execution
all sites

© 2020, M.T. Özsu & P. Valduriez 41


VF – Calculation of aff(Ai, Aj)

Assume each query in the previous example accesses the attributes once
during each execution.
S1 S2 S3
Also assume the access frequencies q1 15 20 10
q2 5 0 0
q3 25 25 25

q4 3 0 0

Then
aff(A1, A3) = 15*1 + 20*1+10*1
= 45
and the attribute affinity matrix AA is
(Let A1=PNO, A2=PNAME, A3=BUDGET,
A4=LOC)

© 2020, M.T. Özsu & P. Valduriez 42


VF – Clustering Algorithm

n Take the attribute affinity matrix AA and reorganize the


attribute orders to form clusters where the attributes in
each cluster demonstrate high affinity to one another.
n Bond Energy Algorithm (BEA) has been used for
clustering of entities. BEA finds an ordering of entities
(in our case attributes) such that the global affinity
measure is maximized.
AM = åå (affinity of Ai and Aj with their neighbors)
i j

© 2020, M.T. Özsu & P. Valduriez 43


Bond Energy Algorithm

Input: The AA matrix


Output: The clustered affinity matrix CA which is a
perturbation of AA
Œ Initialization: Place and fix one of the columns of AA in
CA.
 Iteration: Place the remaining n-i columns in the
remaining i+1 positions in the CA matrix. For each
column, choose the placement that makes the most
contribution to the global affinity measure.
Ž Row order: Order the rows according to the column
ordering.

© 2020, M.T. Özsu & P. Valduriez 44


Bond Energy Algorithm

“Best” placement? Define contribution of a placement:

cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)

where
n
bond(Ax,Ay) = å
z =1
aff(A ,A )aff(A ,A )
z x z y

© 2020, M.T. Özsu & P. Valduriez 45


BEA – Example
Consider the following AA matrix and the corresponding CA matrix where
PNO and PNAME have been placed. Place BUDGET:

Ordering (0-3-1) :
cont(A0,BUDGET,PNO) = 2bond(A0, BUDGET)+2bond(BUDGET, PNO)
–2bond(A0 , PNO)
= 8820
Ordering (1-3-2) :
cont(PNO,BUDGET,PNAME) = 10150
Ordering (2-3-4) :
cont (PNAME,BUDGET,LOC) = 1780

© 2020, M.T. Özsu & P. Valduriez 46


BEA – Example

n Therefore, the CA matrix has the form

n When LOC is placed, the final form of the CA matrix


(after row organization) is

© 2020, M.T. Özsu & P. Valduriez 47


VF – Algorithm

How can you divide a set of clustered attributes {A1, A2,


…, An} into two (or more) sets {A1, A2, …, Ai} and {Ai, …,
An} such that there are no (or minimal) applications that
access both (or more than one) of the sets.

© 2020, M.T. Özsu & P. Valduriez 48


VF – ALgorithm

Define
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes
CTQ*CBQ-COQ2

© 2020, M.T. Özsu & P. Valduriez 49


VF – Algorithm

Two problems :
 Cluster forming in the middle of the CA matrix
q Shift a row up and a column left and apply the algorithm to find
the “best” partitioning point
q Do this for all possible shifts
q Cost O(m2)
 More than two clusters
q m-way partitioning
q try 1, 2, …, m–1 split points along diagonal and try to find the
best point for each of these
q Cost O(2m)

© 2020, M.T. Özsu & P. Valduriez 50


VF – Correctness

A relation R, defined over attribute set A and key K,


generates the vertical partitioning FR = {R1, R2, …, Rr}.
n Completeness
q The following should be true for A:
A = È ARi

n Reconstruction
q Reconstruction can be achieved by
R = ⋈K Ri, "Ri Î FR

n Disjointness
q TID's are not considered to be overlapping since they are
maintained by the system
q Duplicated keys are not considered to be overlapping

© 2020, M.T. Özsu & P. Valduriez 51


Hybrid Fragmentation

© 2020, M.T. Özsu & P. Valduriez 52


Reconstruction of HF

© 2020, M.T. Özsu & P. Valduriez 53


Outline
n Distributed and Parallel Database Design
q

q Data distribution
q

© 2020, M.T. Özsu & P. Valduriez 54


Fragment Allocation

n Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S ={S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
n Optimality
q Minimal cost
n Communication + storage + processing (read & update)
n Cost in terms of time (usually)
q Performance
Response time and/or throughput
q Constraints
n Per site constraints (storage & processing)

© 2020, M.T. Özsu & P. Valduriez 55


Information Requirements
n Database information
q selectivity of fragments
q size of a fragment
n Application information
q access types and numbers
q access localities
n Communication network information
q unit cost of storing data at a site
q unit cost of processing at a site
n Computer system information
q bandwidth
q latency
q communication overhead
© 2020, M.T. Özsu & P. Valduriez 56
Allocation

File Allocation (FAP) vs Database Allocation (DAP):


q Fragments are not individual files
n relationships have to be maintained

q Access to databases is more complicated


n remote file access model not applicable
n relationship between allocation and query processing

q Cost of integrity enforcement should be considered


q Cost of concurrency control should be considered

© 2020, M.T. Özsu & P. Valduriez 57


Allocation Model

General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint

Decision Variable
1 if fragment Fi is stored at site Sj
xij =
0 otherwise

© 2020, M.T. Özsu & P. Valduriez 58


Allocation Model

n Total Cost

å query processing cost +


all queries

å å cost of storing a fragment at a site


all sites all fragments

n Storage Cost (of fragment Fj at Sk)


(unit storage cost at Sk) * (size of Fj) * xjk

n Query Processing Cost (for one query)


processing component + transmission component

© 2020, M.T. Özsu & P. Valduriez 59


Allocation Model

n Query Processing Cost

Processing component
access cost + integrity enforcement cost + concurrency control cost
q Access cost

å å (no. of update accesses+ no. of read accesses) *


all sites all fragments
xij * local processing cost at a site

q Integrity enforcement and concurrency control costs


n Can be similarly calculated

© 2020, M.T. Özsu & P. Valduriez 60


Allocation Model

n Query Processing Cost


Transmission component
cost of processing updates + cost of processing retrievals
q Cost of updates

å å update message cost +


all sites all fragments
å å acknowledgment cost
all sites all fragments
q Retrieval Cost

å min all sites (cost of retrieval command +


all fragments cost of sending back the result)

© 2020, M.T. Özsu & P. Valduriez 61


Allocation Model

n Constraints
q Response Time
execution time of query ≤ max. allowable response time for that query

q Storage Constraint (for a site)

å storage requirement of a fragment at that site £


storage capacity at that site
all fragments
q Processing constraint (for a site)

å processing load of a query at that site £


all queries processing capacity of that site

© 2020, M.T. Özsu & P. Valduriez 62


Allocation Model

n Solution Methods
q FAP is NP-complete
q DAP also NP-complete

n Heuristics based on
q single commodity warehouse location (for FAP)
q knapsack problem
q branch and bound techniques
q network flow

© 2020, M.T. Özsu & P. Valduriez 63


Allocation Model

n Attempts to reduce the solution space


q assume all candidate partitionings known; select the “best”
partitioning

q ignore replication at first

q sliding window on fragments

© 2020, M.T. Özsu & P. Valduriez 64


Outline
n Distributed and Parallel Database Design
q

q Combined approaches

© 2020, M.T. Özsu & P. Valduriez 65


Combining Fragmentation & Allocation

Partition the data to dictate where it is located


n Workload-agnostic techniques
q Round-robin partitioning
q Hash partitioning
q Range partitioning
n Workload-aware techniques
q Graph-based approach

© 2020, M.T. Özsu & P. Valduriez 66


Round-robin Partitioning

© 2020, M.T. Özsu & P. Valduriez 67


Hash Partitioning

© 2020, M.T. Özsu & P. Valduriez 68


Range Partitioning

© 2020, M.T. Özsu & P. Valduriez 69


Workload-Aware Partitioning

n Examplar: Schism
q Graph G=(V,E) where
n vertex vi ∈ V represents a tuple in database,
n edge e=(vi,vj) ∈ E represents a query that accesses both tuples vi
and vj;
n each edge has weight counting the no. of queries that access both
tuples
q Perform vertex disjoint graph partitioning
n Each vertex is assigned to a separate partition

© 2020, M.T. Özsu & P. Valduriez 70


Incorporating Replication

n Replicate each vertex based on the no. of transactions


accessing that tuple è each transaction accesses a
separate copy

© 2020, M.T. Özsu & P. Valduriez 71


Dealing with graph size

n Each tuple a vertex è graph too big è directory too big


n SWORD
q Use hypergraph model
q Compress the directory

© 2020, M.T. Özsu & P. Valduriez 72


Adaptive approaches

n Redesign as physical (network characteristics, available


storage) and logical (workload) changes occur.
n Most focus on logical
n Most follow combined approach
n Three issues:
Œ How to detect workload changes?
 How to determine impacted data items?
Ž How to perform changes efficiently?

© 2020, M.T. Özsu & P. Valduriez 73


Detecting workload changes

n Not much work


n Periodically analyze system logs
n Continuously monitor workload within DBMS
q SWORD: no. of distributed queries
q E-Store: monitor system-level metrics (e.g., CPU utilization) and
tuple-level access

© 2020, M.T. Özsu & P. Valduriez 74


Detecting affected data items

n Depends on the workload change detection method


n If monitoring queries è queries will identify data items
q Apollo: generalize from “similar” queries
SELECT PNAME FROM PROJ WHERE BUDGET>20000 AND
LOC=‘LONDON’


SELECT PNAME FROM PROJ WHERE BUDGET>? AND LOC=‘?’
n If monitoring tuple-level access (E-Store), this will tell
you

© 2020, M.T. Özsu & P. Valduriez 75


Performing changes

n Periodically compute redistribution


q Not efficient
n Incremental computation and migration
q Graph representation è look at changes in graph
n SWORD and AdaptCache: Incremental graph partitioning initiates
data migration for reconfiguration
q E-Store: determine hot tuples for which a migration plan is
prepared determine; cold tuple reallocation as well
n Optimization problem; real-time heuristic solutions
q Database cracking: continuously reorganize data to match query
workload
n Incoming queries are used as advice
n When a node needs data for a local query, this is hint that data may
need to be moved
© 2020, M.T. Özsu & P. Valduriez 76
Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and Parallel Database Design
n Distributed Data Control
n Distributed Query Processing
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Distributed Data Control
q View management
q Data security
q Semantic integrity control

© 2020, M.T. Özsu & P. Valduriez 3


Semantic Data Control

n Involves:
q View management
q Security control
q Integrity control

n Objective :
q Ensure that authorized users perform correct operations on the
database, contributing to the maintenance of the database
integrity.

© 2020, M.T. Özsu & P. Valduriez 4


Outline
n Distributed Data Control
q View management
q

© 2020, M.T. Özsu & P. Valduriez 5


View Management

View – virtual relation EMP


q generated from base relation(s) by a query ENO ENAME TITLE
q not stored as base relations E1 J. Doe Elect. Eng
E2 M. Smith Syst. Anal.
Example : E3 A. Lee Mech. Eng.
CREATE VIEW SYSAN(ENO,ENAME) E4 J. Miller Programmer
E5 B. Casey Syst. Anal.
AS SELECT ENO,ENAME E6 L. Chu Elect. Eng.
FROM EMP E7 R. Davis Mech. Eng.
E8 J. Jones Syst. Anal.
WHERE TITLE= "Syst. Anal."

© 2020, M.T. Özsu & P. Valduriez 6


View Management

Views can be manipulated as base relations

Example :

SELECT ENAME, PNO, RESP


FROM SYSAN, ASG
WHERE SYSAN.ENO = ASG.ENO

© 2020, M.T. Özsu & P. Valduriez 7


Query Modification

Queries expressed on views

Queries expressed on base relations


Example :
SELECT ENAME, PNO, RESP
FROM SYSAN, ASG
WHERE SYSAN.ENO = ASG.ENO

SELECT ENAME,PNO,RESP
FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO
AND TITLE = "Syst. Anal."

© 2020, M.T. Özsu & P. Valduriez 8


View Management

n To restrict access
CREATE VIEW ESAME
AS SELECT *
FROM EMP E1, EMP E2
WHERE E1.TITLE = E2.TITLE
AND E1.ENO = USER
n Query
SELECT *
FROM ESAME

© 2020, M.T. Özsu & P. Valduriez 9


View Updates

n Updatable
CREATE VIEW SYSAN(ENO,ENAME)
AS SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal."

n Non-updatable
CREATE VIEW EG(ENAME,RESP)
AS SELECT ENAME,RESP
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO

© 2020, M.T. Özsu & P. Valduriez 10


View Management in Distributed DBMS

n Views might be derived from fragments.


n View definition storage should be treated as database
storage
n Query modification results in a distributed query
n View evaluations might be costly if base relations are
distributed
q Use materialized views

© 2020, M.T. Özsu & P. Valduriez 11


Materialized View

n Origin: snapshot in the 1980’s


q Static copy of the view, avoid view derivation for each query
q But periodic recomputing of the view may be expensive
n Actual version of a view
q Stored as a database relation, possibly with indices
n Used much in practice
q DDBMS: No need to access remote, base relations
q Data warehouse: to speed up OLAP
n Use aggregate (SUM, COUNT, etc.) and GROUP BY

© 2020, M.T. Özsu & P. Valduriez 12


Materialized View Maintenance

n Process of updating (refreshing) the view to reflect


changes to base data
q Resembles data replication but there are differences
n View expressions typically more complex
n Replication configurations more general
n View maintenance policy to specify:
q When to refresh
q How to refresh

© 2020, M.T. Özsu & P. Valduriez 13


When to Refresh a View

n Immediate mode
q As part of the updating transaction, e.g. through 2PC
q View always consistent with base data and fast queries
q But increased transaction time to update base data
n Deferred mode (preferred in practice)
q Through separate refresh transactions
n No penalty on the updating transactions
q Triggered at different times with different trade-offs
n Lazily: just before evaluating a query on the view
n Periodically: every hour, every day, etc.
n Forcedly: after a number of predefined updates

© 2020, M.T. Özsu & P. Valduriez 14


How to Refresh a View

n Full computing from base data


q Efficient if there has been many changes
n Incremental computing by applying only the changes to
the view
q Better if a small subset has been changed
q Uses differential relations which reflect updated data only

© 2020, M.T. Özsu & P. Valduriez 15


Differential Relations

Given relation R and update u


R+ contains tuples inserted by u
R- contains tuples deleted by u
Type of u
insert R- empty
delete R+ empty
modify R+ È (R – R- )
Refreshing a view V is then done by computing
V+ È (V – V- )
computing V+ and V- may require accessing base data

© 2020, M.T. Özsu & P. Valduriez 16


Example

EG = SELECT DISTINCT ENAME, RESP


FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO

EG+= (SELECT DISTINCT ENAME, RESP


FROM EMP, ASG+
WHERE EMP.ENO=ASG+.ENO) UNION
(SELECT DISTINCT ENAME, RESP
FROM EMP+, ASG
WHERE EMP+.ENO=ASG.ENO) UNION
(SELECT DISTINCT ENAME, RESP
FROM EMP+, ASG+
WHERE EMP+.ENO=ASG+.ENO)

© 2020, M.T. Özsu & P. Valduriez 17


Techniques for Incremental View
Maintenance
n Different techniques depending on:
q View expressiveness
n Non recursive views: SPJ with duplicate elimination, union and
aggregation
n Views with outerjoin
n Recursive views
n Most frequent case is non recursive views
q Problem: an individual tuple in the view may be derived from
several base tuples
n Example: tuple áM. Smith, Analystñ in EG corresponding to
q áE2, M. Smith, … ñ in EMP
q áE2,P1,Analyst,24 ñ and áE2,P2,Analyst,6ñ in ASG
n Makes deletion difficult
q Solution: Counting

© 2020, M.T. Özsu & P. Valduriez 18


Counting Algorithm

n Basic idea
q Maintain a count of the number of derivations for each tuple in
the view
q Increment (resp. decrement) tuple counts based on insertions
(resp. deletions)
q A tuple in the view whose count is zero can be deleted
n Algorithm
1. Compute V+ and V- using V, base relations and diff. relations
2. Compute positive in V+ and negative counts in V-
3. Compute V+ È (V – V- ), deleting each tuple in V with count=0
n Optimal: computes exactly the view tuples that are
inserted or deleted
© 2020, M.T. Özsu & P. Valduriez 19
Exploiting Data Skew

n Basic idea
q Partition the relations on heavy / light values for join attributes
n Threshold depends on data size and user parameter
q Maintain the join of different parts using different plans
n Most cases done using delta processing (Counting)
n Few cases require pre-materialization of auxiliary views
q Rebalance the partitions to reflect heavy ↔ light changes
n Reasons for change:
q Much more/less occurrences of a value than before
q The heavy/light threshold changes due to change in data size
n Update times are amortized to account for occasional rebalancing

© 2020, M.T. Özsu & P. Valduriez 20


Example:Q Triangle Count
counts the number of tuples
in the join of R, S, and T .
P
Q = a,b,c R(a, b) · S(b, c) · T (c, a)
n Data model
q Relations are functions mapping tuples to multiplicities
q Updates also map tuples to multiplicities
n Triangle count query
q Joins relations R, S and T on common variables
q Aggregates away all variables a, b and c
q Sums over the product of the multiplicities of matching tuples
n Next: Maintenance under single-tuple update to R
q Single-tuple update ∆𝑅 maps 𝑎! , 𝑏 ! to multiplicity 𝑚
q If 𝑚 > 0 (𝑚 < 0 ) then the update is an insert (delete)

© 2020, M.T. Özsu & P. Valduriez 21


Maintenance
Naïve Maintenance for Triangle Count
“Compute from scratch!”
n Compute from scratch

newR := R + R
P
a,b,c newR(a, b) · S(b, c) · T (c, a)

n Maintenance time: O(N1.5)


q Assuming the input relations have size O(N)
q Using existing worst-case optimal join algorithms
Maintenance Complexity
n No extra space needed
Time: O(N 1.5 ) using worst-case optimal join algorithms
© 2020, O(N)
Space: M.T. Özsu & to store input relations
P. Valduriez 22
ical IVM
Delta Processing
“Compute for Triangle Count
the di↵erence!”
P 0 0
a,b,c R(a, b) + R(a , b ) · S(b, c) · T (c, a)
n Compute the change =
P
a,b,c R(a, b) · S(b, c) · T (c, a)
+
P
R(a , b ) · c S(b 0 , c) · T (c, a0 )
0 0

n Maintenance time: O(N)


q Intersect the set of c values paired with b’ in S and with a’ in T
Maintenance Complexity
• Time:
n NoO(N)
extratospace
intersect C -values from S and T
needed
• Space: O(N) to store input relations

© 2020, M.T. Özsu & P. Valduriez 23


Materialized View for Triangle Count
gher Order IVM
n Compute thethechange
“Compute using
di↵erence materialized
by using viewsviews!”
pre-materialized
P
Pre-materialize VST (b, a) = c S(b, c) · T (c, a)
P
a,b,c R(a, b) · S(b, c) · T (c, a)
+
R(a0 , b 0 ) · VST (b 0 , a0 )

Maintenance Complexity
n Maintenance time:
q Updates
Time for updates to R:
to R: O(1)
O(1) totime
looktouplook upSTin VST
in V
q Updates to S and T: O(N) time to maintain VST
Time for updates to S and T : O(N) to maintain VST
n Extra O(N2) space needed for the view VST
pace: O(N ) to store input relations and VST (improvable to O(N 1.5 ))
2

© 2020, M.T. Özsu & P. Valduriez 24


"
VM Exhibits a Time-Space Tradeo↵
Data Skew for Triangle Count
Given " 2 [0, 1], IVM" maintains the triangle count with
• O(N max{",1 "}
) amortized time and
n• O(N𝜀1+min{",1
For ∈ [0,1],"}the triangle count can be maintained with
) space.
𝑂(𝑁 !"#{%,'(%} ) update time and 𝑂(𝑁 '*!+,{%,'(%} ) space.
complexity
O(N 1.5 )
Space
Amortized Time
O(N)

worst-case optimality
O(N 0.5 ) " = 0.5

"
0 0.5 1

n• No algorithm
Known can approaches
maintenance -./(0
attain 𝑂(𝑁are ) for any
recovered γ >. 0.
by IVM "

© 2020, M.T. Özsu & P. Valduriez 25


Relation Partitio
Heavy/Light PartitioningFixof" 2Relations
[0, 1] and partition R into
• a light part RL = {t 2 R |
"
n
| A=t.A | < N
Partition R on a into a light part RL and a heavy part }, R
H
q 𝑅" = 𝑡 ∈ 𝑅 𝜎#$%.# 𝑅 < 𝑁 ' } • a heavy part RH = R\RL !
q 𝑅( = 𝑡 𝑡 ∈ 𝑅, 𝑡 ∉ 𝑅" }
R light part
· · RL
ai b1 .. ..
n Cardinality bounds .. .. . .
. .
q For every value a’: 𝜎#$#! 𝑅" < 𝑁 ' ` < N"
ai b`
q 𝜋# 𝑅( ≤ 𝑁 )*' b10
aj heavy part
.. ..
. . RH
"
aj bh0
h N
. .
.. ..
n Also partition S on b and T on c · ·

© 2020, M.T. Özsu & P. Valduriez 26


Maintenance for Skew-Aware Views
( (
RU (a, b) · SV (b, c) · TW (c, a)
U,V ,W ∈{L,H} a,b,c

n For joins of light parts only or heavy parts only


q Maintenance using delta processing (Counting)

n For joins of a heavy part with a light part


q Maintenance using pre-materialized views

n Next: Consider one skew-aware view at a time


q Single-tuple update ∆𝑅 𝑎! , 𝑏 ! to R

© 2020, M.T. Özsu & P. Valduriez 27


Given an update ∆R∗ (a, b) = {(a′ , b ′ ) '→ m
Case 1: Light-Light Interaction
e Maintenance Strategy view using different strategies:

∆R∗ (a, b) = {(a′ , b ′ ) '→ m}, compute the difference for each
n Skew-aware views (any partition of R)
nt strategies: Skew-aware View Evaluation f
! ′ ′
a,b,c R(a, b) · S L (b, c) · T L (c, a) ∆R(a ,b ) ·
Maintenance under update ∆𝑅 𝑎 1, 𝑏1
n Evaluation from left to right Time
′ ′
!
c) · TL (c, a) ∆R(a , b ) · SL (b ′ , c) · TL (c, a′ ) O(N ε )
c
q There are at most 𝑁 ' c values paired with b’
q For each such value c, we check (c,a’) in TL in O(1)

n Maintenance time: 𝑂(𝑁 % )

© 2020, M.T. Özsu & P. Valduriez 28


view using different strategies:
e Maintenance Strategy
e ∆RCase 2: Heavy-Heavy
′ ′ Interaction
∗ (a, b) = {(a , b ) '→ m}, compute the difference for each sk
rent strategies: Skew-aware
!
View Evaluation fro
!
′ ′
a,b,c R(a, b) · SL (b, c) · TL (c, a) ∆R(a , b ) ·
n Skew-aware view (any partition of R) c
!
Evaluation from left to right Time ′ ′ !
a,b,c′ R(a,

!· SH (b,
b) ′
c) · TH (c,′ a) ∆R(aε
,b ) ·
b, c) · TL (c, a) ∆R(a , b ) · SL (b , c) · TL (c, a ) O(N ) c
n
c
Maintenance under update ∆𝑅 𝑎1 , 𝑏 1
′ ′
!
b, c) · TH (c, a) ∆R(a , b ) · TH (c, a′ ) · SH (b ′ , c) O(N 1−ε )
c
q There are at most 𝑁 )*' c values paired with a’ in TH
q For each such value c, we check (b’,c) in SH in O(1)

n Maintenance time: 𝑂(𝑁 '(% )

© 2020, M.T. Özsu & P. Valduriez 29


Given an update
! ∆R∗ (a, b) = {(a , b ) '→ m}, compute the difference
!
ntenance Strategy a,b,c R(a,
view using different
Case 3: Light-Heavy Interaction
strategies: ′ ′
b) · SL (b, c) · TL (c, a) ∆R(a , b ) · S
c
′ ′
= {(a , b ) '→ m}, compute the difference for each skew-aware
! ′ ′
!
Skew-aware View Evaluation froma) left to right , b ) · Tim
a,b,c R(a, b) · SH (b, c) · TH (c, ∆R(a T
es:
! !
a,b,c R(a, b) · SL (b, c) · TL (c, a) ∆R(a′ , b ′ ) · SL (b ′ , c) · TL (c, a′ ) cO(
nSkew-aware view (any partition of R) c
′ ′
!
Evaluation
! from left to right Time ! ∆R(a , b ) · S
!
R(a, ′ · SH (b, c) · T
b) ′ ′
′ H (c, a) ε ∆R(a , b ) · T (c, a ′
) · S (b ′
, c) cO(
′ a,b,c
′ !
a) ∆R(a , b ) · SL (b , c) · TL (c, a ) O(N ) H H
a,b,c R(a, b) · SL (b, c) · TH (c,
c a) or
c !
∆R(a′ , b ′ ) · SL (b ′ , ∆R(a
c) · TH′(c, !
′ ′ ) O(
a
′ ′
! ′ ′ 1−ε ,b ) · T
n Two possible maintenance plans
a) ∆R(a !, b ) · T H (c, a ) · S H (b , c) O(N ) c
c
R(a, b) · SL (b, c) · TH (c, a) or
c
a,b,c
! ε ∆R(a′ , b ′ ) ·
!
′ ′ ′ ′
∆R(a , b ) · SL (b , c) · TH (c, a ) O(N ) TH (c, a′ ) · SL (b ′ , c) O(
c c
) or
! are at′ most 𝑁′' c values 1−ε
∆R(a , b ) · TH (c, a ) · SL (b , c) O(N paired
1.′ ′There
) with b’ in SL
2. c are at most 𝑁 )*' c values paired with a’ in TH
There

n Maintenance time: 𝑂(min{𝑁 % , 𝑁 '(% }) = 𝑂(𝑁 !+,{%,'(%} )


114

© 2020, M.T. Özsu & P. Valduriez 30


using different strategies:a,b,c R(a, b) · SH (b, c) · TH (c, a) ∆R(a′ , b ′ ) · TH (c,
c

Case 4: Heavy-Light Interaction∆R(a , b ) · ′ ′


!
SL (b ′ ,
are View ! Evaluation from left to right Time c
!
(a, b) · SL (b, c) · TL (c, a)a,b,c∆R(a , b·′ )S·L (b,Sc)
R(a, ′b) · ′TH (c,
L (b , c) · Ta) or

L (c, a ) O(N ε ) !
c ∆R(a′ , b ′ ) · TH (c,
n Skew-aware view (any partition ! of R) c
(a, b) · SH (b, c) · TH (c,! ′ ′ ′
a) ∆R(a , b ) · TH (c, a ) · SH (b , c) O(N ′
′ ′
1−ε
) ′
a,b,c R(a, b) · SH c (b, c) · TL (c, a) ∆R(a , b ) · VST (b , a
Materialized Auxiliary Views
n Maintenance under ∆R(a , b ) · ∆𝑅
update
′ ′
!
𝑎1′,,𝑏c)1 · TH (c, a′ ) O(N ε )
SL (b
⇥ c ⇤
q Materialize auxiliary view a,c RH (a, b) · SL (b, c)
(a, b) · SL (b, c) · TH (c, a)VRS (a,
or c) = ⇡P
VST ∆R(a
(b, a) =

!
′ c SH (b, c) · T′L (c, a) ′ 1−ε
, b ) ⇥· TH (c, a ) · SL (b ⇤ , c) O(N )
q Lookup in the VTRview
(a, c) = ⇡a,c TcH (c, a) · RL (a, b)
R(a, b) · SH (b, c) · TL (c, a) ∆R(a ⇥ ′ , b ′ ) · VST (b ′ , a′⇤) O(1)
• Maintenance of VRS (a, c) = ⇡a,c RH (a, b) · SL (b, c)
n Maintenance time
Update
q O(1)
for the skew-aware view Evaluation from left to right Time
0 0
P
+#,{',)*'}
RH (a,
q 𝑂(𝑁 b) = {(a) ,for 7! m}
b )the (a , b ) · SL (b 0 , c 0 )
0
auxiliaryRHview
0
O(N " )
c0
'*!23{%,'(%} P
n Size ofc)auxiliary
SL (b, ) 7! m}𝑂(𝑁 SL (b 0 , c 0 ) · )RH (a0 , b 0 )
= {(b 0 , c 0view: O(N 1 "
)
a0

© 2020, M.T. Özsu & P. Valduriez 31


View Self-maintainability

n A view is self-maintainable if the base relations need not


be accessed
q Not the case for the Counting algorithm
n Self-maintainability depends on views’ expressiveness
q Most SPJ views are often self-maintainable wrt. deletion and
modification, but not wrt. insertion
q Example: a view V is self-maintainable wrt to deletion in R if the
key of R is included in V

© 2020, M.T. Özsu & P. Valduriez 32


Outline
n Distributed Data Control
q

q Data security
q

© 2020, M.T. Özsu & P. Valduriez 33


Data Security

n Data protection
q Prevents the physical content of data to be understood by
unauthorized users
q Uses encryption/decryption techniques (Public key)

n Access control
q Only authorized users perform operations they are allowed to on
database objects
q Discretionary access control (DAC)
n Long been provided by DBMS with authorization rules
q Multilevel access control (MAC)
n Increases security with security levels

© 2020, M.T. Özsu & P. Valduriez 34


Discretionary Access Control

n Main actors
q Subjects (users, groups of users) who execute operations
q Operations (in queries or application programs)
q Objects, on which operations are performed
n Checking whether a subject may perform an op. on an
object
q Authorization= (subject, op. type, object def.)
q Defined using GRANT OR REVOKE
q Centralized: one single user class (admin.) may grant or revoke
q Decentralized, with op. type GRANT
n More flexible but recursive revoking process which needs the hierarchy
of grants

© 2020, M.T. Özsu & P. Valduriez 35


Problem with DAC

n A malicious user can access unauthorized data through


an authorized user
n Example
q User A has authorized access to R and S
q User B has authorized access to S only
q B somehow manages to modify an application program used by
A so it writes R data in S
q Then B can read unauthorized data (in S) without violating
authorization rules
n Solution: multilevel security based on the famous Bell
and Lapuda model for OS security

© 2020, M.T. Özsu & P. Valduriez 36


Multilevel Access Control

n Different security levels (clearances)


q Top Secret > Secret > Confidential > Unclassified
n Access controlled by 2 rules:
q No read up
n subject S is allowed to read an object of level L only if level(S) ≥ L
n Protect data from unauthorized disclosure, e.g. a subject with secret
clearance cannot read top secret data
q No write down:
n subject S is allowed to write an object of level L only if level(S) ≤ L
n Protect data from unauthorized change, e.g. a subject with top secret
clearance can only write top secret data but not secret data (which could
then contain top secret data)

© 2020, M.T. Özsu & P. Valduriez 37


MAC in Relational DB

n A relation can be classified at different levels:


q Relation: all tuples have the same clearance
q Tuple: every tuple has a clearance
q Attribute: every attribute has a clearance
n A classified relation is thus multilevel
q Appears differently (with different data) to subjects with different
clearances

© 2020, M.T. Özsu & P. Valduriez 38


Example

PROJ*: classified at attribute level

PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4


P1 C Instrumentation C 150000 C Montreal C
P2 C DB Develop. C 135000 S New York S
P3 S CAD/CAM S 250000 S New York S

PROJ* as seen by a subject with confidential clearance

PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4


P1 C Instrumentation C 150000 C Montreal C
P2 C DB Develop. C Null C Null C

© 2020, M.T. Özsu & P. Valduriez 39


Distributed Access Control

n Additional problems in a distributed environment


q Remote user authentication
n Typically using a directory service
q Should be replicated at some sites for availability
q Management of DAC rules
n Problem if users’ group can span multiple sites
q Rules stored at some directory based on user groups location
q Accessing rules may incur remote queries
q Covert channels in MAC

© 2020, M.T. Özsu & P. Valduriez 40


Covert Channels
n Indirect means to access unauthorized data
n Example
q Consider a simple DDB with 2 sites: C (confidential) and S
(secret)
q Following the “no write down” rule, an update from a subject with
secret clearance can only be sent to S
q Following the “no read up” rule, a read query from the same
subject can be sent to both C and S
q But the query may contain secret information (e.g. in a select
predicate), so is a potential covert channel
n Solution: replicate part of the DB
q So that a site at security level L contains all data that a subject at
level L can access (e.g. S above would replicate the confidential
data so it can entirely process secret queries)

© 2020, M.T. Özsu & P. Valduriez 41


Outline
n Distributed Data Control
q

q Semantic integrity control

© 2020, M.T. Özsu & P. Valduriez 42


Semantic Integrity Control

Maintain database consistency by enforcing a set of


constraints defined on the database.
n Structural constraints
q Basic semantic properties inherent to a data model e.g., unique
key constraint in relational model

n Behavioral constraints
q Regulate application behavior, e.g., dependencies in the
relational model
n Two components
q Integrity constraint specification
q Integrity constraint enforcement

© 2020, M.T. Özsu & P. Valduriez 43


Semantic Integrity Control

n Procedural
q Control embedded in each application program
n Declarative
q Assertions in predicate calculus
q Easy to define constraints
q Definition of database consistency clear
q But inefficient to check assertions for each update
n Limit the search space
n Decrease the number of data accesses/assertion
n Preventive strategies
n Checking at compile time

© 2020, M.T. Özsu & P. Valduriez 44


Constraint Specification Language
Predefined constraints
specify the more common constraints of the relational model
q Not-null attribute
ENO NOT NULL IN EMP
q Unique key
(ENO, PNO) UNIQUE IN ASG
q Foreign key
A key in a relation R is a foreign key if it is a primary key of another
relation S and the existence of any of its values in R is dependent
upon the existence of the same value in S
PNO IN ASG REFERENCES PNO IN PROJ
q Functional dependency
ENO IN EMP DETERMINES ENAME

© 2020, M.T. Özsu & P. Valduriez 45


Constraint Specification Language

Precompiled constraints
Express preconditions that must be satisfied by all tuples in a
relation for a given update type
(INSERT, DELETE, MODIFY)
NEW - ranges over new tuples to be inserted
OLD - ranges over old tuples to be deleted
General Form
CHECK ON <relation> [WHEN <update type>]
<qualification>

© 2020, M.T. Özsu & P. Valduriez 46


Constraint Specification Language

Precompiled constraints

q Domain constraint

CHECK ON PROJ (BUDGET≥500000 AND BUDGET≤1000000)

q Domain constraint on deletion

CHECK ON PROJ WHEN DELETE (BUDGET = 0)

q Transition constraint

CHECK ON PROJ (NEW.BUDGET > OLD.BUDGET AND


NEW.PNO = OLD.PNO)

© 2020, M.T. Özsu & P. Valduriez 47


Constraint Specification Language

General constraints
Constraints that must always be true. Formulae of tuple
relational calculus where all variables are quantified.
General Form
CHECK ON <variable>:<relation>,(<qualification>)
q Functional dependency
CHECK ON e1:EMP, e2:EMP
(e1.ENAME = e2.ENAME IF e1.ENO = e2.ENO)
q Constraint with aggregate function
CHECK ON g:ASG, j:PROJ
(SUM(g.DUR WHERE g.PNO = j.PNO) < 100 IF
j.PNAME = "CAD/CAM")

© 2020, M.T. Özsu & P. Valduriez 48


Integrity Enforcement

Two methods
n Detection
Execute update u: D ® Du
If Du is inconsistent then
if possible: compensate Du ® Du’
else
undo Du ® D
n Preventive
Execute u: D ® Du only if Du will be consistent
q Determine valid programs

q Determine valid states

© 2020, M.T. Özsu & P. Valduriez 49


Query Modification
n Preventive
n Add the assertion qualification to the update query
n Only applicable to tuple calculus formulae with
universally quantified variables
UPDATE PROJ
SET BUDGET = BUDGET*1.1
WHERE PNAME = "CAD/CAM"

UPDATE PROJ
SET BUDGET = BUDGET*1.1
WHERE PNAME = "CAD/CAM"
AND NEW.BUDGET ≥ 500000
AND NEW.BUDGET ≤ 1000000

© 2020, M.T. Özsu & P. Valduriez 50


Compiled Assertions
Triple (R,T,C) where
R relation
T update type (insert, delete, modify)
C assertion on differential relations
Example: Foreign key assertion
"g Î ASG, $j Î PROJ : g.PNO = j.PNO
Compiled assertions:
(ASG, INSERT, C1), (PROJ, DELETE, C2), (PROJ, MODIFY, C3)
where
C1:"NEW Î ASG+ $j Î PROJ: NEW.PNO = j.PNO
C2:"g Î ASG, "OLD Î PROJ- : g.PNO ≠ OLD.PNO
C3:"g Î ASG, "OLD Î PROJ- $NEW Î PROJ+:
g.PNO ≠OLD.PNO OR OLD.PNO = NEW.PNO

© 2020, M.T. Özsu & P. Valduriez 51


Differential Relations

Given relation R and update u


R+ contains tuples inserted by u
R- contains tuples deleted by u

Type of u
insert R- empty
delete R+ empty
modify R+ È (R – R-)

© 2020, M.T. Özsu & P. Valduriez 52


Differential Relations
Algorithm:
Input: Relation R, update u, compiled assertion Ci
Step 1: Generate differential relations R+ and R–
Step 2: Retrieve the tuples of R+ and R– which do not
satisfy Ci
Step 3: If retrieval is not successful, then the assertion is
valid.
Example :
u is delete on J. Enforcing (EMP, DELETE, C2) :
retrieve all tuples of EMP-
into RESULT
where not(C2)
If RESULT = {}, the assertion is verified

© 2020, M.T. Özsu & P. Valduriez 53


Distributed Integrity Control

n Problems:
q Definition of constraints
n Consideration for fragments

q Where to store
n Replication

n Non-replicated : fragments

q Enforcement
n Minimize costs

© 2020, M.T. Özsu & P. Valduriez 54


Types of Distributed Assertions

n Individual assertions
q Single relation, single variable
q Domain constraint

n Set oriented assertions


q Single relation, multi-variable
n functional dependency

q Multi-relation, multi-variable
n foreign key

n Assertions involving aggregates

© 2020, M.T. Özsu & P. Valduriez 55


Distributed Integrity Control
n Assertion Definition
q Similar to the centralized techniques
q Transform the assertions to compiled assertions
n Assertion Storage
q Individual assertions
n One relation, only fragments
n At each fragment site, check for compatibility
n If compatible, store; otherwise reject
n If all the sites reject, globally reject
q Set-oriented assertions
n Involves joins (between fragments or relations)
n May be necessary to perform joins to check for compatibility
n Store if compatible

© 2020, M.T. Özsu & P. Valduriez 56


Distributed Integrity Control
n Assertion Enforcement
q Where to enforce each assertion depends on
n Type of assertion
n Type of update and where update is issued
q Individual Assertions
n If update = insert
q Enforce at the site where the update is issued
n If update = qualified
q Send the assertions to all the sites involved
q Execute the qualification to obtain R+ and R-
q Each site enforces its own assertion
q Set-oriented Assertions
n Single relation
q Similar to individual assertions with qualified updates
n Multi-relation
q Move data to perform joins; then send the result to query master site

© 2020, M.T. Özsu & P. Valduriez 57


Conclusion

n Solutions initially designed for centralized systems have


been significantly extended for distributed systems
q Materialized views and group-based discretionary access control
n Semantic integrity control has received less attention
and is generally not well supported by distributed DBMS
products
n Full data control is more complex and costly in
distributed systems
q Definition and storage of the rules (site selection)
q Design of enforcement algorithms which minimize
communication costs

© 2020, M.T. Özsu & P. Valduriez 58


Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and parallel database design
n Distributed data control
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Distributed Transaction Processing
q Distributed Concurrency Control
q Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 3


Transaction

A transaction is a collection of actions that make consistent


transformations of system states while preserving system
consistency.
q concurrency transparency
q failure transparency

© 2020, M.T. Özsu & P. Valduriez 4


Transaction Characterization

Begin_transaction

Read
Read

Write
Read

Commit
n Read set (RS)
q The set of data items that are read by a transaction
n Write set (WS)
q The set of data items whose values are changed by this transaction
n Base set (BS)
q RS ∪ WS

© 2020, M.T. Özsu & P. Valduriez 5


Principles of Transactions

ATOMICITY
q all or nothing

CONSISTENCY
q no violation of integrity constraints

ISOLATION
q concurrent changes invisible Þ serializable

DURABILITY
q committed updates persist

© 2020, M.T. Özsu & P. Valduriez 6


Transactions Provide…

n Atomic and reliable execution in the presence of failures

n Correct execution in the presence of multiple user


accesses

n Correct management of replicas (if they support it)

© 2020, M.T. Özsu & P. Valduriez 7


Distributed TM Architecture

© 2020, M.T. Özsu & P. Valduriez 8


Outline
n Distributed Transaction Processing
q Distributed Concurrency Control
q

© 2020, M.T. Özsu & P. Valduriez 9


Concurrency Control

n The problem of synchronizing concurrent transactions


such that the consistency of the database is maintained
while, at the same time, maximum degree of
concurrency is achieved.
n Enforce isolation property
n Anomalies:
q Lost updates
n The effects of some transactions are not reflected on the database.
q Inconsistent retrievals
n A transaction, if it reads the same data item more than once, should
always read the same value.

© 2020, M.T. Özsu & P. Valduriez 10


Serializability in Distributed DBMS

n Two histories have to be considered:


q local histories
q global history

n For global transactions (i.e., global history) to be


serializable, two conditions are necessary:
q Each local history should be serializable → local serializability
q Two conflicting operations should be in the same relative order
in all of the local histories where they appear together →
global serializability

© 2020, M.T. Özsu & P. Valduriez 11


Global Non-serializability

T1: Read(x) T2: Read(x)


x ←x-100 Read(y)
Write(x) Commit
Read(y)
y ←y+100
Write(y)
Commit

n x stored at Site 1, y stored at Site 2


n LH1, LH2 are individually serializable (in fact serial), but the
two transactions are not globally serializable.
LH1={R1(x),W1(x), R2(x)}
LH2={R2(y), R1(y),W1(y)}
© 2020, M.T. Özsu & P. Valduriez 12
Concurrency Control Algorithms

n Pessimistic
q Two-Phase Locking-based (2PL)
n Centralized (primary site) 2PL
n Primary copy 2PL
n Distributed 2PL
q Timestamp Ordering (TO)
n Basic TO
n Multiversion TO
n Conservative TO
n Optimistic
q Locking-based
q Timestamp ordering-based

© 2020, M.T. Özsu & P. Valduriez 13


Locking-Based Algorithms

n Transactions indicate their intentions by requesting locks


from the scheduler (called lock manager).
n Locks are either read lock (rl) [also called shared lock] or
write lock (wl) [also called exclusive lock]
n Read locks and write locks conflict (because Read and
Write operations are incompatible
rl wl
rl yes no
wl no no
n Locking works nicely to allow concurrent processing of
transactions.
© 2020, M.T. Özsu & P. Valduriez 14
Centralized 2PL

n There is only one 2PL scheduler in the distributed system.


n Lock requests are issued to the central scheduler.

© 2020, M.T. Özsu & P. Valduriez 15


Distributed 2PL

n 2PL schedulers are placed at each site. Each scheduler


handles lock requests for data at that site.
n A transaction may read any of the replicated copies of
item x, by obtaining a read lock on one of the copies of x.
Writing into x requires obtaining write locks for all copies
of x.

© 2020, M.T. Özsu & P. Valduriez 16


Distributed 2PL Execution

© 2020, M.T. Özsu & P. Valduriez 17


Deadlock

n A transaction is deadlocked if it is blocked and will


remain blocked until there is intervention.
n Locking-based CC algorithms may cause deadlocks.
n TO-based algorithms that involve waiting may cause
deadlocks.
n Wait-for graph
q If transaction Ti waits for another transaction Tj to release a lock
on an entity, then Ti → Tj in WFG.

Ti Tj

© 2020, M.T. Özsu & P. Valduriez 18


Local versus Global WFG

n T1 and T2 run at site 1, T3 and T4 run at site 2.


n T3 waits for a lock held by T4 which waits for a lock held by T1 which
waits for a lock held by T2 which, in turn, waits for a lock held by T3.

Local WFG

Global WFG

© 2020, M.T. Özsu & P. Valduriez 19


Deadlock Detection

n Transactions are allowed to wait freely.


n Wait-for graphs and cycles.
n Topologies for deadlock detection algorithms
q Centralized
q Distributed
q Hierarchical

© 2020, M.T. Özsu & P. Valduriez 20


Centralized Deadlock Detection

n One site is designated as the deadlock detector for the


system. Each scheduler periodically sends its local WFG
to the central site which merges them to a global WFG to
determine cycles.
n How often to transmit?
q Too often ⇒ higher communication cost but lower delays due to
undetected deadlocks
q Too late ⇒ higher delays due to deadlocks, but lower
communication cost
n Would be a reasonable choice if the concurrency control
algorithm is also centralized.
n Proposed for Distributed INGRES
© 2020, M.T. Özsu & P. Valduriez 21
Hierarchical Deadlock Detection

Build a hierarchy of detectors

© 2020, M.T. Özsu & P. Valduriez 22


Distributed Deadlock Detection

n Sites cooperate in detection of deadlocks.


n One example:
q Form local WFGs at each modified as follows:
1) Potential deadlock cycles from other sites are
added as edges
2) Join these with regular edges
3) Pass these local WFGs to other sites

q Each local deadlock detector:


n looks for a cycle that does not involve the external edge. If it exists,
there is a local deadlock which can be handled locally.
n looks for a cycle involving the external edge. If it exists, it indicates a
potential global deadlock. Pass on the information to the next site.

© 2020, M.T. Özsu & P. Valduriez 23


Timestamp Ordering

Œ Transaction (Ti) is assigned a globally unique timestamp ts(Ti).


 Transaction manager attaches the timestamp to all operations
issued by the transaction.
Ž Each data item is assigned a write timestamp (wts) and a read
timestamp (rts):
q rts(x) = largest timestamp of any read on x
q wts(x) = largest timestamp of any read on x
 Conflicting operations are resolved by timestamp order.

Basic T/O:
for Ri(x) for Wi(x)

if ts(Ti) < wts(x) if ts(Ti) < rts(x) and ts(Ti) < wts(x)
then reject Ri(x) then reject Wi(x)
else accept Ri(x) else accept Wi(x)
rts(x) ¬ ts(Ti) wts(x) ¬ ts(Ti)
© 2020, M.T. Özsu & P. Valduriez 24
Basic Timestamp Ordering

Two conflicting operations Oij of Ti and Okl of Tk → Oij executed before


Okl iff ts(Ti) < ts(Tk).
q Ti is called older transaction
q Tk is called younger transaction

for Ri(x) for Wi(x)

if ts(Ti) < wts(x) if ts(Ti) < rts(x) and ts(Ti) < wts(x)
then reject Ri(x) then reject Wi(x)
else accept Ri(x) else accept Wi(x)
rts(x) ¬ ts(Ti) wts(x) ¬ ts(Ti)

© 2020, M.T. Özsu & P. Valduriez 25


Conservative Timestamp Ordering

n Basic timestamp ordering tries to execute an operation


as soon as it receives it
q progressive
q too many restarts since there is no delaying
n Conservative timestamping delays each operation until
there is an assurance that it will not be restarted
n Assurance?
q No other operation with a smaller timestamp can arrive at the
scheduler
q Note that the delay may result in the formation of deadlocks

© 2020, M.T. Özsu & P. Valduriez 26


Multiversion Concurrency Control
(MVCC)

n Do not modify the values in the database, create new


values.
n Typically timestamp-based implementation
ts(Ti) < ts(xr) < ts(Tj)
n Implemented in a number of systems: IBM DB2, Oracle,
SQL Server, SAP HANA, BerkeleyDB, PostgreSQL

© 2020, M.T. Özsu & P. Valduriez 27


MVCC Reads

n A Ri(x) is translated into a read on one version of x.


q Find a version of x (say xv) such that ts(xv) is the largest
timestamp less than ts(Ti).

© 2020, M.T. Özsu & P. Valduriez 28


MVCC Writes

n A Wi(x) is translated into Wi(xw) and accepted if the


scheduler has not yet processed any Rj(xr) such that
ts(Ti) < ts(xr) < ts(Tj)

© 2020, M.T. Özsu & P. Valduriez 29


Optimistic Concurrency Control
Algorithms

Pessimistic execution

Validate Read Compute Write

Optimistic execution

Read Compute Validate Write

© 2020, M.T. Özsu & P. Valduriez 30


Optimistic Concurrency Control
Algorithms
n Transaction execution model: divide into subtransactions
each of which execute at a site
q Tij: transaction Ti that executes at site j

n Transactions run independently at each site until they


reach the end of their read phases
n All subtransactions are assigned a timestamp at the end
of their read phase
n Validation test performed during validation phase. If one
fails, all rejected.

© 2020, M.T. Özsu & P. Valduriez 31


Optimistic CC Validation Test

Œ If all transactions Tk where ts(Tk) < ts(Tij) have


completed their write phase before Tij has started its
read phase, then validation succeeds
q Transaction executions in serial order

© 2020, M.T. Özsu & P. Valduriez 32


Optimistic CC Validation Test

 If there is any transaction Tk such that ts(Tk)<ts(Tij) and


which completes its write phase while Tij is in its read
phase, then validation succeeds if WS(Tk) Ç RS(Tij) = Ø
q Read and write phases overlap, but Tij does not read data
items written by Tk

© 2020, M.T. Özsu & P. Valduriez 33


Optimistic CC Validation Test

Ž If there is any transaction Tk such that ts(Tk)< ts(Tij) and


which completes its read phase before Tij completes its
read phase, then validation succeeds if
WS(Tk) Ç RS(Tij) = Ø and WS(Tk) Ç WS(Tij) = Ø
q They overlap, but don't access any common data items.

© 2020, M.T. Özsu & P. Valduriez 34


Snapshot Isolation (SI)

n Each transaction “sees” a consistent snapshot of the


database when it starts and R/W this snapshot
n Repeatable reads, but not serializable isolation
n Read-only transactions proceed without significant
synchronization overhead
n Centralized SI-based CC
1) Ti starts, obtains a begin timestamp tsb(Ti)
2) Ti ready to commit, obtains a commit timestamp tsc(Ti) that is greater
than any of the existing tsb or tsc
3) Ti commits if no other Tj such that tsc(Ti) [tsb(Ti), tsc(Ti)]; otherwise
aborted (first committer wins)
4) When Ti commits, changes visible to all Tk where tsb(Tk)> tsc(Ti)

© 2020, M.T. Özsu & P. Valduriez 35


Distributed CC with SI
n Computing a consistent distributed snapshot is hard
n Similar rules to serializability
q Each local history should be SI
q Global history is SI → commitment orders at each site are the same

n Dependence relationship: 𝑇! at site s (𝑇!" ) is dependent on 𝑇#"


(dependent(𝑇!" , 𝑇#" )) iff
(𝑅𝑆(𝑇!" ) ∩ 𝑊𝑆(𝑇#" ) ≠ ∅) ∨ (𝑊𝑆 𝑇!" ∩ 𝑅𝑆(𝑇#" ) ≠ ∅) ∨ (𝑊𝑆(𝑇!" ) ∩ 𝑊𝑆(𝑇#" ) ≠ ∅)
n Conditions
1) 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡(𝑇! , 𝑇" ) ∧ 𝑡𝑠# 𝑇!$ < 𝑡𝑠% 𝑇"$ ⟹ 𝑡𝑠# 𝑇!& < 𝑡𝑠% 𝑇"& at every site 𝑡 where
𝑇! and 𝑇" execute together
2) 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡(𝑇! , 𝑇" ) ∧ 𝑡𝑠% 𝑇!$ < 𝑡𝑠% 𝑇"$ ⟹ 𝑡𝑠% 𝑇!& < 𝑡𝑠# 𝑇"& at every site 𝑡 where
𝑇! and 𝑇" execute together
3) 𝑡𝑠% 𝑇!$ < 𝑡𝑠% 𝑇"$ ⟹ 𝑡𝑠% 𝑇!& < 𝑡𝑠# 𝑇"& at every site 𝑡 where 𝑇! and 𝑇" execute
together
© 2020, M.T. Özsu & P. Valduriez 36
Distributed CC with SI – Executing 𝑇!
Coordinating TM Each site s

Check if first 2
Conditions hold

Positive Yes
Any ?
neg? Update event clock:
No max(own,coord TM)

Update event clock:


Max(event clocks of all s)
Wait

If global commit:
1) Persist 𝑇! updates
2) Update event clock
© 2020, M.T. Özsu & P. Valduriez 37
Outline
n Distributed Transaction Processing
q

q Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 38


Reliability

Problem:
How to maintain

atomicity

durability

properties of transactions

© 2020, M.T. Özsu & P. Valduriez 39


Ch.10/39
Types of Failures

n Transaction failures
q Transaction aborts (unilaterally or due to deadlock)
n System (site) failures
q Failure of processor, main memory, power supply, …
q Main memory contents are lost, but secondary storage contents
are safe
q Partial vs. total failure
n Media failures
q Failure of secondary storage devices → stored data is lost
q Head crash/controller failure
n Communication failures
q Lost/undeliverable messages
q Network partitioning
© 2020, M.T. Özsu & P. Valduriez 40
Distributed Reliability Protocols

n Commit protocols
q How to execute commit command for distributed transactions.
q Issue: how to ensure atomicity and durability?
n Termination protocols
q If a failure occurs, how can the remaining operational sites deal with it.
q Non-blocking: the occurrence of failures should not force the sites to
wait until the failure is repaired to terminate the transaction.
n Recovery protocols
q When a failure occurs, how do the sites where the failure occurred deal
with it.
q Independent: a failed site can determine the outcome of a transaction
without having to obtain remote information.
n Independent recovery Þ non-blocking termination

© 2020, M.T. Özsu & P. Valduriez 41


Two-Phase Commit (2PC)

Phase 1 : The coordinator gets the participants ready to


write the results into the database
Phase 2 : Everybody writes the results into the database
q Coordinator :The process at the site where the transaction
originates and which controls the execution
q Participant :The process at the other sites that participate in
executing the transaction
Global Commit Rule:
Œ The coordinator aborts a transaction if and only if at least one
participant votes to abort it.
 The coordinator commits a transaction if and only if all of the
participants vote to commit it.

© 2020, M.T. Özsu & P. Valduriez 42


State Transitions in 2PC

Coordinator Participant

INITIAL INITIAL

Commit Prepare
Prepare Vote-commit
Prepare
Vote-abort
WAIT READY
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit Ack Ack

ABORT COMMIT ABORT COMMIT

COMMIT

© 2020, M.T. Özsu & P. Valduriez 43


Centralized 2PC

P P

P P

C C C

P P

P P

Ready? Yes/No Commit/Abort Confirmation

Phase 1 Phase 2

© 2020, M.T. Özsu & P. Valduriez 44


2PC Protocol Actions
Coordinator Participant

INITIAL INITIAL

A RE
PREP

write
begin commit write No Ready to
abort Commit?
B ORT
E-A

(Unilateral abort)
VOT
Yes
WAIT VOTE-COMMIT write
ready

Yes write GLOBAL-ABORT


Any No? READY
abort
OMMIT
GLO BAL-C
No
write
commit
Abort Type of
ABORT msg
AC
K
write Commit
COMMIT abort
ACK write
commit

write COMMIT
end of transaction ABORT

© 2020, M.T. Özsu & P. Valduriez 45


Linear 2PC

Phase 1

Prepare V-C/V-A V-C/V-A V-C/V-A

C P P P P

G-C/G-A G-C/G-A G-C/G-A G-C/G-A

Phase 2

V-C: Vote-Commit, V-A: Vote-Abort, G-C: Global-commit, G-A: Global-abort

© 2020, M.T. Özsu & P. Valduriez 46


Distributed 2PC

P
P

P Global
decision
P
C made
independently
P
P

P
P

Vote-commit/
Prepare Vote-abort

© 2020, M.T. Özsu & P. Valduriez 47


Variations of 2PC

To improve performance by
1) Reduce the number of messages between coordinator &
participants
2) Reduce the number of time logs are written
n Presumed Abort 2PC
q Participant polls coordinator about transaction’s outcome
q No information → abort the transaction
n Presumed Commit 2PC
q Participant polls coordinator about transaction’s outcome
q No information → assume transaction is committed
q Not an exact dual of presumed abort 2PC

© 2020, M.T. Özsu & P. Valduriez 48


Site Failures - 2PC Termination

Coordinator
n Timeout in INITIAL
q Who cares INITIAL

n Timeout in WAIT Commit


q Cannot unilaterally commit Prepare

q Can unilaterally abort


WAIT
n Timeout in ABORT or COMMIT Vote-abort
Global-abort
Vote-commit
Global-commit

q Stay blocked and wait for the acks


ABORT COMMIT

© 2020, M.T. Özsu & P. Valduriez 49


Site Failures - 2PC Termination

Participant
n Timeout in INITIAL
q Coordinator must have failed INITIAL
in INITIAL state
Prepare
q Unilaterally abort Prepare
Vote-commit

n Timeout in READY Vote-abort


READY
q Stay blocked Global-abort
Ack
Global-commit
Ack

ABORT COMMIT

© 2020, M.T. Özsu & P. Valduriez 50


Site Failures - 2PC Recovery

Coordinator
n Failure in INITIAL
q Start the commit process upon INITIAL

recovery
Commit
n Failure in WAIT Prepare

q Restart the commit process upon WAIT


recovery Vote-abort
Global-abort
Vote-commit
Global-commit
n Failure in ABORT or COMMIT
q Nothing special if all the acks have ABORT COMMIT

been received
q Otherwise the termination protocol is
involved

© 2020, M.T. Özsu & P. Valduriez 51


Site Failures - 2PC Recovery

Participant
n Failure in INITIAL
q Unilaterally abort upon recovery
n Failure in READY INITIAL

q The coordinator has been


Prepare
informed about the local decision Prepare
Vote-commit

q Treat as timeout in READY state Vote-abort


READY
and invoke the termination Global-abort Global-commit
Ack Ack
protocol
n Failure in ABORT or COMMIT ABORT COMMIT

q Nothing special needs to be done

© 2020, M.T. Özsu & P. Valduriez 52


2PC Recovery Protocols –
Additional Cases
Arise due to non-atomicity of log and message send
actions
n Coordinator site fails after writing “begin_commit” log
and before sending “prepare” command
q treat it as a failure in WAIT state; send “prepare” command
n Participant site fails after writing “ready” record in log but
before “vote-commit” is sent
q treat it as failure in READY state
q alternatively, can send “vote-commit” upon recovery
n Participant site fails after writing “abort” record in log but
before “vote-abort” is sent
q no need to do anything upon recovery
© 2020, M.T. Özsu & P. Valduriez 53
2PC Recovery Protocols –
Additional Case

n Coordinator site fails after logging its final decision


record but before sending its decision to the participants
q coordinator treats it as a failure in COMMIT or ABORT state
q participants treat it as timeout in the READY state
n Participant site fails after writing “abort” or “commit”
record in log but before acknowledgement is sent
q participant treats it as failure in COMMIT or ABORT state
q coordinator will handle it by timeout in COMMIT or ABORT state

© 2020, M.T. Özsu & P. Valduriez 54


Problem With 2PC

n Blocking
q Ready implies that the participant waits for the coordinator
q If coordinator fails, site is blocked until recovery
q Blocking reduces availability
n Independent recovery is not possible
n However, it is known that:
q Independent recovery protocols exist only for single site failures;
no independent recovery protocol exists which is resilient to
multiple-site failures.
n So we search for these protocols – 3PC

© 2020, M.T. Özsu & P. Valduriez 55


Three-Phase Commit

n 3PC is non-blocking.
n A commit protocols is non-blocking iff
q it is synchronous within one state transition, and
q its state transition diagram contains
n no state which is “adjacent” to both a commit and an abort state,
and
n no non-committable state which is “adjacent” to a commit state
n Adjacent: possible to go from one stat to another with a
single state transition
n Committable: all sites have voted to commit a
transaction
q e.g.: COMMIT state

© 2020, M.T. Özsu & P. Valduriez 56


State Transitions in 3PC

Coordinator Participant

INITIAL INITIAL

Commit Prepare
Prepare Vote-commit
Prepare
Vote-abort
WAIT READY
Vote-abort Vote-commit Global-abort Prepare-to-commit
Global-abort Prepare-to-commit Ack Ready-to-commit

PRE- PRE-
ABORT ABORT
COMMIT COMMIT

Ready-to-commit Global-commit
Global-commit Ack

COMMIT COMMIT
3PC Protocol Actions Coordinator Participant

INITIAL INITIAL

A RE
PREP

write
begin commit No Ready to
write abort
RT Commit?
A BO
TE-

(Unilateral abort)
VO
Yes
WAIT VOTE-COMMIT
write ready

Yes GLOBAL-ABORT
Any No? write abort READY
OMMIT
E-TO-C
No PREPAR

write
prepare to commit
Abort Type of
msg
ABORT ACK
Prepare-
PRE- write abort to-commit
COMMIT READY-TO-COMMIT write
prepare to commit

ABORT
write commit GLOBAL-COM
MIT
PRE-
COMMIT

COMMIT ACK
write commit

write
end of transaction
COMMIT
© 2020, M.T. Özsu & P. Valduriez 58
Network Partitioning

n Simple partitioning
q Only two partitions
n Multiple partitioning
q More than two partitions
n Formal bounds:
q There exists no non-blocking protocol that is resilient to a
network partition if messages are lost when partition occurs.
q There exist non-blocking protocols which are resilient to a single
network partition if all undeliverable messages are returned to
sender.
q There exists no non-blocking protocol which is resilient to a
multiple partition.

© 2020, M.T. Özsu & P. Valduriez 59


Independent Recovery Protocols for
Network Partitioning

n No general solution possible


q allow one group to terminate while the other is blocked
q improve availability
n How to determine which group to proceed?
q The group with a majority
n How does a group know if it has majority?
q Centralized
n Whichever partitions contains the central site should terminate the
transaction
q Voting-based (quorum)

© 2020, M.T. Özsu & P. Valduriez 60


Quorum Protocols

n The network partitioning problem is handled by the


commit protocol.
n Every site is assigned a vote Vi.
n Total number of votes in the system V
n Abort quorum Va, commit quorum Vc
q Va + Vc > V where 0 ≤ Va , Vc ≤ V
q Before a transaction commits, it must obtain a commit quorum Vc
q Before a transaction aborts, it must obtain an abort quorum Va

© 2020, M.T. Özsu & P. Valduriez 61


Paxos Consensus Protocol

n General problem: how to reach an agreement


(consensus) among TMs about the fate of a transaction
q 2PC and 3PC are special cases
n General idea: If a majority reaches a decision, the global
decision is reached (like voting)
n Roles:
q Proposer: recommends a decision
q Acceptor: decides whether to accept the proposed decision
q Learner: discovers the agreed-upon decision by asking or it is
pushed

© 2020, M.T. Özsu & P. Valduriez 62


Paxos & Complications

n Naïve Paxos: one proposer


q Operates like a 2PC
n Complications
q Multiple proposers can exist at the same time; acceptor has to
choose
n Attach a ballot number
q Multiple proposals may result in split votes with no majority
n Run multiple consensus rounds → performance implication
n Choose a leader
q Some accepts fail after they accept a decision; the remaining
acceptors may not constitute majority
n Use ballot numbers

© 2020, M.T. Özsu & P. Valduriez 63


Basic Paxos – No Failures
Proposer (or Leader) Acceptor
Any
previous
Ack No prop?
Record
Record Ack prepare(bal) Yes

Yes bal > any


Record bal
Ack from prepare(bal) received?
majority?

No

Ignore and
Wait
if all are Ack (not with bal’,val’)
then val ← proposer wants &
nbal ← bal
No
else val ← val’ & nbal ← bal nbal = ack.bal? Ignore

Yes

Record
accepted(nbal,val)

© 2020, M.T. Özsu & P. Valduriez 64


Basic Paxos with Failures

n Some acceptors fail but there is quorum


q Not a problem
n Enough acceptors fail to eliminate quorum
q Run a new ballot
n Proposer/leader fails
q Choose a new leader and start a new ballot

© 2020, M.T. Özsu & P. Valduriez 65


Processamento
de Transações
Distribuídas
Paulo Vieira - 2024
Introdução às Transações
Unidade de trabalho que deve ser completamente executada ou não executada
Exemplo: Transferência bancária (débito de uma conta, crédito noutra)

2
Propriedades ACID
1. Atomicidade: Tudo ou nada
2. Consistência: Mantém a integridade dos dados
3. Isolamento: Transações parecem ser executadas de forma isolada
4. Durabilidade: Após o commit, as mudanças são permanentes

3
Desafios em Sistemas Distribuídos
Dados distribuídos por vários nós
Necessidade de coordenação entre nós
Possibilidade de falhas parciais
Latência de rede

4
Controlo de Concorrência
1. Bloqueio (Locking)
2. Ordenação por Marcas Temporais (Timestamp Ordering)
3. Métodos Otimistas

5
Bloqueio em Duas Fases (2PL)
1. Fase de crescimento: é possível adquirir bloqueios, mas nenhum pode ser libertado
2. Fase de contração: é possível libertar bloqueios, mas nenhum pode ser adquirido

6
Tipos de Bloqueios
Bloqueio Partilhado: para leituras
Bloqueio Exclusivo: para escritas

7
2PL Estrito
Mantém bloqueios de escrita até ao fim da transação
Vantagens:
Evita reversões em cascata
Garante recuperabilidade
Desvantagem: Reduz concorrência

8
2PL Distribuído
Abordagens:
1. Centralizada: Um nó gere todos os bloqueios
2. Distribuída: Cada nó gere bloqueios para seus dados locais

9
2PL Centralizado
Processo:
1. Transação solicita bloqueio ao nó central
2. Nó central verifica compatibilidade
3. Concede ou rejeita a aquisição do bloqueio
Prós e Contras:
Mais simples de implementar
Ponto único de falha
Possível bottleneck
10
2PL Totalmente Distribuído
Processo:
1. Transação solicita bloqueio ao nó que possui os dados necessários
2. Nó verifica compatibilidade localmente
3. Concede ou rejeita a aquisição do bloqueio
Prós e Contras:
Mais robusto e escalável
Mais complexo de implementar
Pode levar a impasses distribuídos
11
Impasses (Deadlocks)
Definição:
Situação onde duas ou mais transações esperam uma pela outra indefinidamente
Exemplo:
T1 tem bloqueio em A, precisa de B
T2 tem bloqueio em B, precisa de A

12
Gestão de Impasses
Métodos:
1. Prevenção
2. Detecção e Resolução
Abordagens de Detecção:
Centralizada
Hierárquica
Distribuída
13
Detecção de Impasses Centralizada
Processo:
1. Nó central adquire informações das intenções de lock de todos os nós
2. Constrói grafo global
3. Procura periodicamente ciclos no grafo global
Prós e Contras:
Mais simples de implementar
Possível ponto único de falha em sistemas complexos
14
Detecção de Impasses Distribuída
Processo:
1. Cada nó mantém grafo das intenções de lock locais
2. Nós trocam informações sobre potenciais ciclos
3. Algoritmo de detecção é executado em múltiplos nós
Prós e Contras:
Mais escalável
Mais complexo de implementar e manter
15
Resolução de Impasses
Métodos:
1. Selecionar uma "vítima" (transação a abortar)
2. Rollback da transação "vítima"
3. Libertar os seus locks
Possíveis critérios de seleção:
Idade da transação
Progresso da transação
Número de recursos bloqueados
16
Commit Atómico Distribuído
Desafio:
Garantir que uma transação distribuída seja totalmente commit ou totalmente abort em
todos os nós participantes
Soluções:
1. Protocolo de Commit em Duas Fases (2PC)
2. Protocolo de Commit em Três Fases (3PC)

17
Protocolo de Commit em Duas Fases (2PC)
Fases:
1. Fase de Votação
2. Fase de Decisão
Participantes:
Coordenador (decide resultado da transacção)
Participantes (nós envolvidos na transacção)
18
2PC: Fase de Votação
1. Coordenador envia "preparar" aos participantes
2. Participantes verificam se podem fazer commit
3. Participantes respondem "sim" ou "não" ao coordenador
4. Participantes entram em estado "preparado" se votarem "sim"

19
2PC: Fase de Decisão
1. Se todos votaram "sim", coordenador decide commit
2. Se algum votou "não", coordenador decide abortar
3. Coordenador envia decisão aos participantes
4. Participantes executam a decisão e confirmam

20
2PC: Logging
Coordenador:
BEGIN_COMMIT
GLOBAL_COMMIT ou GLOBAL_ABORT
COMPLETE
Participante:
READY
COMMIT ou ABORT
21
Limitações do 2PC
1. Bloqueante se o coordenador falhar após "preparar"
2. Participantes podem ficar em estado de incerteza
3. Desempenho afetado por latência de rede
4. Vulnerável a falhas de rede (partições)

22
Protocolo de Commit em Três Fases (3PC)
Objetivo:
Tornar o protocolo de commit não-bloqueante
Fases:
1. Fase de Votação
2. Fase de Pré-commit
3. Fase de Commit Final
23
3PC: Fase de Votação
Idêntica à fase de votação do 2PC
Participantes votam "sim" ou "não"

24
3PC: Fase de Pré-commit
1. Se todos votaram "sim", coordenador envia "pré-commit"
2. Participantes confirmam recepção do "pré-commit"
3. Estado intermediário entre preparado e commit

25
3PC: Fase de Commit Final
1. Coordenador envia "commit" final
2. Participantes executam commit e confirmam
Vantagem:
Participantes podem decidir em caso de falha do coordenador

26
Comparação 2PC vs 3PC
2PC:
Mais simples e amplamente utilizado
Bloqueante em algumas situações de falha
3PC:
Não-bloqueante
Maior overhead de mensagens
Mais complexo de implementar
27
Recuperação em Transações Distribuídas
Cenários:
1. Falha de participante
2. Falha de coordenador
3. Falhas de rede (partições)
Princípio:
Usar informações de log para reconstruir o estado
28
Recuperação de Falha de Participante
1. Participante consulta o seu log ao reiniciar
2. Se encontrar READY, contacta coordenador para decisão final
3. Se encontrar COMMIT ou ABORT, executa a ação correspondente

29
Recuperação de Falha de Coordenador
1. Novo coordenador é eleito (se necessário)
2. Novo coordenador consulta participantes sobre estado da transação
3. Decide com base nas respostas dos participantes

30
Conclusão
Transações distribuídas são complexas, mas essenciais
2PL e 2PC/3PC são fundamentais para garantir ACID
Balancear consistência, disponibilidade e tolerância a partições (Teorema CAP)
Sistemas modernos podem "relaxar" algumas destas garantias para melhorar o seu
desempenho

31
HBase
Base de dados não-relacional NoSQL
orientado a colunas
Baseado no modelo do Google BigTable
Oferece alta escalabilidade e desempenho

32
Componentes do HBase
HMaster: gerencia os RegionServers
RegionServer: gerencia as regiões de dados
ZooKeeper: coordenação e gestão de estado

33
Replicação no HBase
Replicação assíncrona entre clusters
Suporte a topologias master-slave e multi-master
Melhora a disponibilidade e tolerância a falhas

34
Fragmentação no HBase
Dados divididos em regiões
Distribuição automática de regiões entre RegionServers
Balanceamento de carga automático

35
Transações no HBase
Suporte a transações ACID apenas ao nível da linha
Isolamento baseado em versões
Implementado através do componente Transaction Manager

36
Phoenix
Camada SQL sobre o HBase
Permite consultas SQL e transações ACID
Mapeia tabelas SQL para tabelas HBase

37
Componentes do Phoenix
Cliente Phoenix: driver JDBC
Phoenix Query Server: execução de consultas
Phoenix Compiler: otimização de consultas

38
Replicação com Phoenix
Suporta replicação nativa do HBase
Permite consultas SQL em réplicas secundárias
Melhora o desempenho de leitura

39
Transações no Phoenix
Transações ACID completas
Isolamento snapshot

40
Hive
Data warehouse distribuído para análise de big data
Suporta consultas SQL-like (HiveQL)
Executa jobs MapReduce ou Spark

41
Componentes do Hive
Metastore: armazena metadados das tabelas
Driver: compilação e otimização de consultas
Execution Engine: execução de consultas (MapReduce/Spark)

42
Fragmentação no Hive
Suporta particionamento de tabelas
Permite bucketing para melhor distribuição de dados
Otimiza consultas com pruning de partições

43
Transações no Hive
Suporte a transações ACID (desde a versão 0.14)
Implementado sobre o HDFS e MapReduce/Tez
Usa controlo de concorrência otimista

44
Hue
Interface web para ecossistema Hadoop
Facilita interação com HBase, Hive e outros componentes
Oferece editores SQL, navegadores de dados e dashboards

45
Componentes do Hue
Editor SQL: interface para consultas Hive e HBase
File Browser: navegação no HDFS
Job Browser: monitoramento de jobs

46
O que é o ZooKeeper?
Serviço de coordenação para sistemas distribuídos
Desenvolvido pela Apache Software Foundation
Actua como um "árbitro confiável" entre serviços
distribuídos

47
O que o ZooKeeper Coordena?
Principalmente:
1. Metadados e estados de configuração
2. Informações de controlo e coordenação entre serviços
Não é usado para:
Coordenar dados de aplicação em grande escala
Armazenar ficheiros grandes

48
Problemas Resolvidos pelo ZooKeeper
1. Manter consistência entre múltiplos servidores
2. Lidar com falhas parciais no sistema
3. Gerir estados partilhados correctamente

49
Como é que o ZooKeeper Funciona?
Usa uma estrutura hierárquica chamada "znodes", similar a um sistema de ficheiros
Cada znode pode armazenar até 1MB de dados
Clientes observam znodes para detectar mudanças

50
Mecanismo de Watchers
Clientes podem "observar" znodes para perceber mudanças
São enviadas notificações quando ocorrem alterações
Permite reacção rápida a mudanças de configuração ou estado

51
Casos de utilização "normais"
1. Configuração dinâmica: Atualizar configurações em tempo real
2. Eleição de líder: Determinar o servidor líder num cluster
3. Locks distribuídos: Coordenar acesso a recursos compartilhados
4. Filas de tarefas: Distribuir trabalho entre múltiplos trabalhadores

52
Mapeamento de Conceitos
Replicação: implementada no HBase e Phoenix
Fragmentação: presente no HBase (regiões) e Hive (partições)
Transações: suportadas em HBase, Phoenix e Hive
Controlo de concorrência: MVCC (Multi-Version Concurrency Control) no
HBase/Phoenix, otimista no Hive

53
Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and parallel database design
n Distributed data control
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Data Replication
q Consistency criteria
q Update Management Strategies
q Replication Protocols
q Replication and Failure Management

© 2020, M.T. Özsu & P. Valduriez 3


Replication

n Why replicate?
q System availability
n Avoid single points of failure
q Performance
n Localization
q Scalability
n Scalability in numbers and geographic area
q Application requirements
n Why not replicate?
q Replication transparency
q Consistency issues
n Updates are costly
n Availability may suffer if not careful

© 2020, M.T. Özsu & P. Valduriez 4


Execution Model

n There are physical copies of logical objects in the system.


n Operations are specified on logical objects, but translated to operate
on physical objects.
n One-copy equivalence
q Transaction effects on replicated objects should be the same as if they
had been performed on a single set of objects.
Write(x)

x Logical data item

Write(x1) Write(x2) Write(xn)

x1 x2 … xn

Physical data item (replicas, copies)

© 2020, M.T. Özsu & P. Valduriez 5


Replication Issues

n Consistency models - how do we reason about the


consistency of the “global execution state”?
q Mutual consistency
q Transactional consistency
n Where are updates allowed?
q Centralized
q Distributed
n Update propagation techniques – how do we propagate
updates to one copy to the other copies?
q Eager
q Lazy

© 2020, M.T. Özsu & P. Valduriez 6


Outline
n Data Replication
q Consistency criteria
q

q
q

© 2020, M.T. Özsu & P. Valduriez 7


Consistency

n Mutual Consistency
q How do we keep the values of physical copies of a logical data
item synchronized?
q Strong consistency
n All copies are updated within the context of the update transaction
n When the update transaction completes, all copies have the same
value
n Typically achieved through 2PC
q Weak consistency
n Eventual consistency: the copies are not identical when update
transaction completes, but they eventually converge to the same
value
n Many versions possible:
q Time-bounds
q Value-bounds
q Drifts

© 2020, M.T. Özsu & P. Valduriez 8


Transactional Consistency

n How can we guarantee that the global execution history


over replicated data is serializable?
n One-copy serializability (1SR)
q The effect of transactions performed by clients on replicated
objects should be the same as if they had been performed one
at-a-time on a single set of objects.
n Weaker forms are possible
q Snapshot isolation
q RC-serializability

© 2020, M.T. Özsu & P. Valduriez 9


Example 1

Site A Site B Site C


x x, y x, y, z
T1: x ← 20 T2: Read(x) T3: Read(x)
Write(x) x ← x+y Read(y)
Commit Write(y) z ← (x∗y)/100
Commit Write(z)
Commit
Consider the three histories:
HA={W1(xA), C1}
HB={W1(xB), C1, R2(xB), W2(yB), C2}
HC={W2(yC), C2, R3(xC), R3(yC),W3(zC), C3, W1(xC),C1}

Global history non-serializable: HB: T1→T2, HC: T2→T3→T1


Mutually consistent: Assume xA=xB=xC=10, yB=yC=15,yC=7 to begin; in the end
xA=xB=xC=20, yB=yC=35,yC=3.5

© 2020, M.T. Özsu & P. Valduriez 10


Example 2

Site A Site B
x x

T1: Read(x) T2: Read(x)


x ← x+5 x ← x∗10
Write(x) Write(x)
Commit Commit

Consider the two histories:

HA={R1(xA),W1(xA), C1, R2(xA), W2(xA), C2}


HB={R2(xB), W2(xB), C2, R1(xB), W1(xB), C1}

Global history non-serializable: HA: T1→ T2, HB: T2→ T1


Mutually inconsistent: Assume xA=xB=1 to begin; in the end xA=15, xB=60

© 2020, M.T. Özsu & P. Valduriez 11


Outline
n Data Replication
q

q Update Management Strategies


q
q

© 2020, M.T. Özsu & P. Valduriez 12


Update Management Strategies

n Depending on when the updates are propagated


q Eager
q Lazy
n Depending on where the updates can take place
q Centralized
q Distributed Centralized Distributed

Eager

Lazy

© 2020, M.T. Özsu & P. Valduriez 13


Eager Replication

n Changes are propagated within the scope of the transaction making the
changes. The ACID properties apply to all copy updates.
q Synchronous
q Deferred
n ROWA protocol: Read-one/Write-all

Transaction
updates commit
Ž
Œ 

Site 1 Site 2 Site 3 Site 4

© 2020, M.T. Özsu & P. Valduriez 14


Lazy Replication

● Lazy replication first executes the updating transaction on one copy. After
the transaction commits, the changes are propagated to all other copies
(refresh transactions)
● While the propagation takes place, the copies are mutually inconsistent.
● The time the copies are mutually inconsistent is an adjustable parameter
which is application dependent.

Transaction
updates commit

Œ 

Ž
Site 1 Site 2 Site 3 Site 4

© 2020, M.T. Özsu & P. Valduriez 15


Centralized

● There is only one copy which can be updated (the master), all others
(slave copies) are updated reflecting the changes to the master.

Site 1 Site 2 Site 3 Site 4

Site 1 Site 2 Site 3 Site 4

© 2020, M.T. Özsu & P. Valduriez 16


Distributed

● Changes can be initiated at any of the copies. That is, any of the
sites which owns a copy can update the value of the data item.

Transaction
updates commit

Site 1 Site 2 Site 3 Site 4

Transaction
updates commit

Site 1 Site 2 Site 3 Site 4

© 2020, M.T. Özsu & P. Valduriez 17


Forms of Replication
Eager Centralized
+ No inconsistencies (identical copies) + No inter-site synchronization is
+ Reading the local copy yields the most necessary (it takes place at the
up to date value master)
+ Changes are atomic + There is always one site which has
− A transaction has to update all sites all the updates
− Longer execution time
− The load at the master can be high
− Lower availability
− Reading the local copy may not
Lazy yield the most up-to-date value
+ A transaction is always local (good
response time) Distributed
− Data inconsistencies + Any site can run a transaction
− A local read does not always return + Load is evenly distributed
the most up-to-date value
− Copies need to be synchronized
− Changes to all copies are not
guaranteed
− Replication is not transparent

© 2020, M.T. Özsu & P. Valduriez 18


Outline
n Data Replication
q

q Replication Protocols
q

© 2020, M.T. Özsu & P. Valduriez 19


Replication Protocols

The previous ideas can be combined into 4 different replication protocols:

Eager Eager centralized Eager distributed

Lazy Lazy centralized Lazy distributed

Centralized Distributed

© 2020, M.T. Özsu & P. Valduriez 20


Eager Centralized Protocols
n Design parameters:
q Distribution of master
n Single master: one master for all data items
n Primary copy: different masters for different (sets of) data items
q Level of transparency
n Limited: applications and users need to know who the master is
q Update transactions are submitted directly to the master
q Reads can occur on slaves
n Full: applications and users can submit anywhere, and the
operations will be forwarded to the master
q Operation-based forwarding
n Four alternative implementation architectures, only three
are meaningful:
q Single master, limited transparency
q Single master, full transparency
q Primary copy, full transparency

© 2020, M.T. Özsu & P. Valduriez 21


Eager Single Master/Limited
Transparency
n Applications submit update transactions directly to the master
n Master:
q Upon read: read locally and return to user
q Upon write: write locally, multicast write to other replicas (in FFO timestamps order)
q Upon commit request: run 2PC coordinator to ensure that all have really installed the
changes
q Upon abort: abort and inform other sites about abort
n Slaves install writes that arrive from the master
Update Transaction Read-only Transaction
Op(x) . . . C ommit Read(x) . . .
3 4

Master Slave Slave Slave


Site Site A Site B Site C

© 2020, M.T. Özsu & P. Valduriez 22


Eager Single Master/Limited
Transparency (cont’d)
n Applications submit read transactions directly to an appropriate slave
n Slave
q Upon read: read locally
q Upon write from master copy: execute conflicting writes in the proper order
(FIFO or timestamp)
q Upon write from client: refuse (abort transaction; there is error)
q Upon commit request from read-only: commit locally
q Participant of 2PC for update transaction running on primary
Update Transaction Read-only Transaction
Op(x) . . . C ommit Read(x) . . .
3 4

Master Slave Slave Slave


Site Site A Site B Site C

© 2020, M.T. Özsu & P. Valduriez 23


Eager Single Master/Full Transparency
Applications submit all transactions to the Transaction Manager at their
own sites (Coordinating TM)

Coordinating TM Master Site


1. Send op(x) to the master site 1. If op(x) = Read(x): set read lock on x
and send “lock granted” msg to the
coordinating TM
2. If op(x) = Write(x)
2. Send Read(x) to any site that has x
1. Set write lock on x
2. Update local copy of x
3. Inform coordinating TM
3. Send Write(x) to all the slaves
where a copy of x exists
4. When Commit arrives, act as
coordinator for 2PC
3. Act as participant in 2PC

2
Eager Primary Copy/Full Transparency

n Applications submit transactions directly to their local TMs


n Local TM:
q Forward each operation to the primary copy of the data item
q Upon granting of locks, submit Read to any slave, Write to all slaves
q Coordinate 2PC

Transaction
Op(x) . . . Op(y ) . . . C ommit
3

2
1
1 2

Master(x) Slave(x, y ) Master(y ) Slave(y )


Slave(x)
Site A Site B Site C Site D

© 2020, M.T. Özsu & P. Valduriez 25


Eager Primary Copy/Full Transparency
(cont’d)
n Primary copy site
q Read(x): lock xand reply to TM
q Write(x): lock x, perform update, inform TM
q Participate in 2PC
n Slaves: as before

Transaction
Op(x) . . . Op(y ) . . . C ommit
3

2
1
1 2

Master(x) Slave(x, y ) Master(y ) Slave(y )


Slave(x)
Site A Site B Site C Site D

© 2020, M.T. Özsu & P. Valduriez 26


Eager Distributed Protocol

n Updates originate at any copy


q Each sites uses 2 phase locking.
q Read operations are performed locally.
q Write operations are performed at all sites (using a distributed locking
protocol).
q Coordinate 2PC
n Slaves:
q As before
Transaction 1 Transaction 2
W r ite(x) . . . C ommi t W r ite(x) . . . C ommi t
3

2
1
2 1 3

Site A Site B Site C Site D

© 2020, M.T. Özsu & P. Valduriez 27


Eager Distributed Protocol (cont’d)

n Critical issue:
q Concurrent Writes initiated at different master sites are executed in the
same order at each slave site
q Local histories are serializable (this is easy)
n Advantages
q Simple and easy to implement
n Disadvantage
q Very high communication overhead
n n replicas; m update operations in each transaction: n*m messages (assume
no multicasting)
n For throughput of k tps: k* n*m messages
n Alternative
q Use group communication + deferred update to slaves to reduce
messages

© 2020, M.T. Özsu & P. Valduriez 28


Lazy Single Master/Limited
Transparency
n Update transactions submitted to master
n Master:
q Upon read: read locally and return to user
q Upon write: write locally and return to user
q Upon commit/abort: terminate locally
q Sometime after commit: multicast updates to slaves (in order)
n Slaves:
q Upon read: read locally
q Refresh transactions: install updates
Transaction 1 Transaction 2
W r ite(x) . . . C ommit Read(x) . . .

3
1 2 4

Master Slave Slave Slave


Site Site A Site B Site C

© 2020, M.T. Özsu & P. Valduriez 29


Lazy Primary Copy/Limited
Transparency
n There are multiple masters; each master execution is similar to lazy
single master in the way it handles transactions

n Slave execution complicated: refresh transactions from multiple


masters and need to be ordered properly

© 2020, M.T. Özsu & P. Valduriez 30


Lazy Primary Copy/Limited
Transparency – Slaves
n Assign system-wide unique timestamps to refresh transactions and
execute them in timestamp order
q May cause too many aborts
n Replication graph
q Similar to serialization graph, but nodes are transactions (T) + sites (S);
edge 〈Ti,Sj〉exists iff Ti performs a Write(x) and x is stored in Sj
q For each operation (opk), enter the appropriate nodes (Tk) and edges; if
graph has no cycles, no problem
q If cycle exists and the transactions in the cycle have been committed at
their masters, but their refresh transactions have not yet committed at
slaves, abort Tk; if they have not yet committed at their masters, Tkwaits.
n Use group communication

© 2020, M.T. Özsu & P. Valduriez 31


Lazy Single Master/Full Transparency

n This is very tricky


q Forwarding operations to a master and then getting refresh
transactions cause difficulties
n Two problems:
q Violation of 1SR behavior
q A transaction may not see its own reads
n Problem arises in primary copy/full transparency as well

© 2020, M.T. Özsu & P. Valduriez 32


Site B Site M

Example 3 R1 (x)
R1 (x)
W2 (x)
result
W2 (x)

OK
W2 (y )
W2 (y )
Site M (Master) holds x, y; SiteB OK

holds slave copies of x, y C2

T1: Read(x), Write(y), Commit


C2
OK
W1 (x)
W1 (x )

T2: Read(x), Write(y), Commit W1 (x)


OK

OK
C1
C1
HM = {W2 (xM ), W2 (yM ), C2 , W1 (yM ), C1 }
C1
HB = {R1 (xB ), C1 , W2R (xB ), W2R (yB ), C2R , W1R (xB ), C1R } OK
Refresh(T2 )
R y )}
), W 2 (
{ W 2 (x
OK R

Execute & Commit


Refresh(T2 )
OK

OK
Refresh(T1 )
)}
{ W 1 (x
R

Execute & Commit


Refresh(T1 )
OK

OK
Time

© 2020, M.T. Özsu & P. Valduriez 33


Example 4

n Master site M holds x, site C holds slave copy of x


n T3: Write(x), Read(x), Commit
n Sequence of execution
1. W3(x) submitted at C, forwarded to M for execution
2. W3(x) is executed at M, confirmation sent back to C
3. R3(x) submitted at C and executed on the local copy
4. T3 submits Commit at C, forwarded to M for execution
5. M executes Commit, sends notification to C, which also
commits T3
6. M sends refresh transaction for T3 to C (for W3(x) operation)
7. C executes the refresh transaction and commits it
n When C reads x at step 3, it does not see the effects of
Write at step 2
© 2020, M.T. Özsu & P. Valduriez 34
Lazy Single Master/
Full Transparency - Solution
n Assume T = Write(x)
n At commit time of transaction T, the master generates a
timestamp for it [ts(T)]
n Master sets last_modified(xM) ← ts(T)
n When a refresh transaction arrives at a slave site i, it
also sets last_modified(xi) ← last_modified(xM)
n Timestamp generation rule at the master:
q ts(T) should be greater than all previously issued timestamps
and should be less than the last_modified timestamps of the
data items it has accessed. If such a timestamp cannot be
generated, then T is aborted.

© 2020, M.T. Özsu & P. Valduriez 35


Lazy Distributed Replication
n Any site:
q Upon read: read locally and return to user
q Upon write: write locally and return to user
q Upon commit/abort: terminate locally
q Sometime after commit: send refresh transaction
q Upon message from other site
n Detect conflicts
n Install changes
n Reconciliation may be necessary
Transaction 1 Transaction 2
W r ite(x) . . . C ommit W r ite(x) . . . C ommit

3
1 2 1 2

Site A Site B Site C Site D

© 2020, M.T. Özsu & P. Valduriez 36


Reconciliation

n Such problems can be solved using pre-arranged


patterns:
q Latest update win (newer updates preferred over old ones)
q Site priority (preference to updates from headquarters)
q Largest value (the larger transaction is preferred)
n Or using ad-hoc decision making procedures:
q Identify the changes and try to combine them
q Analyze the transactions and eliminate the non-important ones
q Implement your own priority schemas

© 2020, M.T. Özsu & P. Valduriez 37


Replication Strategies

+ Updates do not need to be + No inconsistencies


coordinated + Elegant (symmetrical solution)
+ No inconsistencies - Long response times
Eager

- Longest response time - Updates need to be coordinated


- Only useful with few updates
- Local copies are can only be read

+ No coordination necessary + No centralized coordination


+ Short response times + Shortest response times
- Local copies are not up to date - Inconsistencies
Lazy

- Inconsistencies - Updates can be lost


(reconciliation)

Centralized Distributed

© 2020, M.T. Özsu & P. Valduriez 38


Group Communication

n A node can multicast a message to all nodes of a group


with a delivery guarantee
n Multicast primitives
q There are a number of them
q Total ordered multicast: all messages sent by different nodes are
delivered in the same total order at all the nodes
n Used with deferred writes, can reduce communication
overhead
q Remember eager distributed requires k*m messages (with
multicast) for throughput of ktps when there are n replicas and m
update operations in each transaction
q With group communication and deferred writes: 2k messages

© 2020, M.T. Özsu & P. Valduriez 39


Outline
n Data Replication
q

q
q Replication and Failure Management

© 2020, M.T. Özsu & P. Valduriez 40


Failures

n So far we have considered replication protocols in the


absence of failures
n How to keep replica consistency when failures occur
q Site failures
n Read One Write All Available (ROWAA)
q Communication failures
n Quorums
q Network partitioning
n Quorums

© 2020, M.T. Özsu & P. Valduriez 41


ROWAA with Primary Site

n READ = read any copy, if time-out, read another copy.


n WRITE = send W(x) to all copies. If one site rejects the
operation, then abort. Otherwise, all sites not responding
are “missing writes”.
n VALIDATION = To commit a transaction
q Check that all sites in “missing writes” are still down. If not, then
abort the transaction.
n There might be a site recovering concurrent with transaction
updates and these may be lost
q Check that all sites that were available are still available. If some
do not respond, then abort.

© 2020, M.T. Özsu & P. Valduriez 42


Distributed ROWAA
n Each site has a copy of V
q V represents the set of sites a site believes is available
q V(A) is the “view” a site has of the system configuration.
n The view of a transaction T [V(T)] is the view of its coordinating site,
when the transaction starts.
q Read any copy within V; update all copies in V
q If at the end of the transaction the view has changed, the transaction is
aborted
n All sites must have the same view!
n To modify V, run a special atomic transaction at all sites.
q Take care that there are no concurrent views!
q Similar to commit protocol.
q Idea: Vs have version numbers; only accept new view if its version
number is higher than your current one
n Recovery: get missed updates from any active node
q Problem: no unique sequence of transactions

© 2020, M.T. Özsu & P. Valduriez 43


Quorum-Based Protocol

n Assign a vote to each copy of a replicated object (say


Vi) such that ∑iVi = V
n Each operation has to obtain a read quorum (Vr) to read
and a write quorum (Vw) to write an object
n Then the following rules have to be obeyed in
determining the quorums:
q Vr+ Vw>V an object is not read and written by two
transactions concurrently
q Vw>V/2 two write operations from two transactions cannot
occur concurrently on the same object

© 2020, M.T. Özsu & P. Valduriez 44


Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
n Introduction
n Distributed and Parallel Database Design
n Distributed Data Control
n Distributed Query Processing
n Distributed Transaction Processing
n Data Replication
n Database Integration – Multidatabase Systems
n Parallel Database Systems
n Peer-to-Peer Data Management
n Big Data Processing
n NoSQL, NewSQL and Polystores
n Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
n Database Integration – Multidatabase Systems
q Schema Matching
q Schema Integration
q Schema Mapping
q Query Rewriting
q Optimization Issues

© 2020, M.T. Özsu & P. Valduriez 3


Problem Definition

n Given existing databases with their Local Conceptual


Schemas (LCSs), how to integrate the LCSs into a
Global Conceptual Schema (GCS)
q GCS is also called mediated schema
n Bottom-up design process

© 2020, M.T. Özsu & P. Valduriez 4


Integration Alternatives

n Physical integration
q Source databases integrated and the integrated database is
materialized
q Data warehouses
n Logical integration
q Global conceptual schema is virtual and not materialized
q Enterprise Information Integration (EII)

© 2020, M.T. Özsu & P. Valduriez 5


Data Warehouse Approach

Materialized
Global
Database

ETL
tools

Database 1 Database 2 ··· Database n

© 2020, M.T. Özsu & P. Valduriez 6


Bottom-up Design

n GCS (also called mediated schema) is defined first


q Map LCSs to this schema
q As in data warehouses
n GCS is defined as an integration of parts of LCSs
q Generate GCS and map LCSs to this GCS

© 2020, M.T. Özsu & P. Valduriez 7


GCS/LCS Relationship

n Local-as-view
q The GCS definition is assumed to exist, and each LCS is treated
as a view definition over it
n Global-as-view
q The GCS is defined as a set of views over the LCSs

Objects Objects
expressible as queries expressible as queries
over the source DBMSs over the GCS

Objects Source Source


DBMS ··· DBMS
accessible
through 1 n
GSC

© 2020, M.T. Özsu & P. Valduriez 8


Database Integration Process

GCS

Schema Generator
Schema
Mapping

Schema
Integration

Schema
Matching

InS1 InS2 ··· InSn

Translator 1 Translator 2 ··· Translator n

Database 1 Database 2 ··· Database n


Schema Schema Schema

© 2020, M.T. Özsu & P. Valduriez 9


Database Integration Issues –
Schema Translation
n Component database schemas translated to a common
intermediate canonical representation
n What is the canonical data model?
q Relational
q Entity-relationship
n DIKE
q Object-oriented
n ARTEMIS
q Graph-oriented
n DIPE, TranScm, COMA, Cupid
n Translation algorithms
q These are well-known

© 2020, M.T. Özsu & P. Valduriez 10


Database Integration Issues –
Schema Generation
n Intermediate schemas are used to create a global conceptual
schema
n Schema matching
q Finding the correspondences between multiple schemas
n Schema integration
q Creation of the GCS (or mediated schema) using the correspondences
n Schema mapping
q How to map data from local databases to the GCS
n Important: sometimes the GCS is defined first, and schema
matching and schema mapping is done against this target
GCS

© 2020, M.T. Özsu & P. Valduriez 11


Outline
n Database Integration – Multidatabase Systems
q Schema Matching
q

q
q

© 2020, M.T. Özsu & P. Valduriez 12


Running Example

Relational E-R Model Project


Name
Responsibility
Number Name Number Budget

N 1
City WORKER WORKS IN PROJECT Location

N
Title Salary
Duration

CONTRACTED BY

EMP(ENO, ENAME, TITLE) 1


Contract
PROJ(PNO, PNAME, BUDGET, LOC, CNAME) number

ASG(ENO, PNO, RESP, DUR) CLIENT

PAY(TITLE, SAL)
Client name Address

© 2020, M.T. Özsu & P. Valduriez 13


Schema Matching

n Schema heterogeneity
q Structural heterogeneity
n Type conflicts
n Dependency conflicts
n Key conflicts
n Behavioral conflicts
q Semantic heterogeneity
n More important and harder to deal with
n Synonyms, homonyms, hypernyms
n Different ontology
n Imprecise wording

© 2020, M.T. Özsu & P. Valduriez 14


Schema Matching (cont’d)

n Other complications
q Insufficient schema and instance information
q Unavailability of schema documentation
q Subjectivity of matching
n Issues that affect schema matching
q Schema versus instance matching
q Element versus structure level matching
q Matching cardinality

© 2020, M.T. Özsu & P. Valduriez 15


Schema Matching Approaches

Individual matchers

Schema-based Instance-based

Element-level Structure-level Element-level

Linguistic Constraint-based Constraint-based Linguistic Constraint-based Learning-based

© 2020, M.T. Özsu & P. Valduriez 16


Linguistic Schema Matching

n Use element names and other textual information (textual


descriptions, annotations)
n May use external sources (e.g., Thesauri)
n 〈SC1.element-1 ≈ SC2.element-2, p,s〉
q Element-1 in schema SC1 is similar to element-2 in schema SC2 if
predicate p holds with a similarity value of s
n Schema level
q Deal with names of schema elements
q Handle cases such as synonyms, homonyms, hypernyms, data type
similarities
n Instance level
q Focus on information retrieval techniques (e.g., word frequencies, key
terms)
q “Deduce” similarities from these
© 2020, M.T. Özsu & P. Valduriez 17
Linguistic Matchers

n Use a set of linguistic (terminological) rules


n Basic rules can be hand-crafted or may be discovered from outside
sources (e.g., WordNet)
n Predicate p and similarity value s
q hand-crafted ⇒ specified,
q discovered ⇒ may be computed or specified by an expert after
discovery
n Examples
q 〈uppercase names ≈ lower case names, true, 1.0〉
q 〈uppercase names ≈ capitalized names, true, 1.0〉
q 〈capitalized names ≈ lower case names, true, 1.0〉
q 〈DB1.ASG ≈ DB2.WORKS_IN, true, 0.8〉

© 2020, M.T. Özsu & P. Valduriez 18


Automatic Discovery of Name
Similarities
n Affixes
q Common prefixes and suffixes between two element name strings
n N-grams
q Comparing how many substrings of length n are common between the
two name strings
n Edit distance
q Number of character modifications (additions, deletions, insertions) that
needs to be performed to convert one string into the other
n Soundex code
q Phonetic similarity between names based on their soundex codes
n Also look at data types
q Data type similarity may suggest stronger relationship than the
computed similarity using these methods or to differentiate between
multiple strings with same value

© 2020, M.T. Özsu & P. Valduriez 19


N-gram Example

n 3-grams of string “Responsibility” are the following:


lRes l sib
libi l esp

lbip l spo

lili l pon

llit l ons
lity l nsi
n 3-grams of string “Resp” are
q Res
q esp
n 3-gram similarity: 2/12 = 0.17
© 2020, M.T. Özsu & P. Valduriez 20
Edit Distance Example

n Again consider “Responsibility” and “Resp”


n To convert “Responsibility” to “Resp”
q Delete characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”

n To convert “Resp” to “Responsibility”


q Add characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”

n The number of edit operations required is 10


n Similarity is 1 − (10/14) = 0.29

© 2020, M.T. Özsu & P. Valduriez 21


Constraint-based Matchers

n Data always have constraints – use them


q Data type information
q Value ranges
q …
n Examples
q RESP and RESPONSIBILITY: n-gram similarity = 0.17, edit
distance similarity = 0.19 (low)
q If they come from the same domain, this may increase their
similarity value
q ENO in relational, WORKER.NUMBER and PROJECT.NUMBER
in E-R
q ENO and WORKER.NUMBER may have type INTEGER while
PROJECT.NUMBER may have STRING
© 2020, M.T. Özsu & P. Valduriez 22
Constraint-based Structural Matching

n If two schema elements are structurally similar, then


there is a higher likelihood that they represent the same
concept
n Structural similarity:
q Same properties (attributes)
q “Neighborhood” similarity
n Using graph representation
n The set of nodes that can be reached within a particular path length
from a node are the neighbors of that node
n If two concepts (nodes) have similar set of neighbors, they are likely
to represent the same concept

© 2020, M.T. Özsu & P. Valduriez 23


Learning-based Schema Matching

n Use machine learning techniques to determine schema


matches
n Classification problem: classify concepts from various
schemas into classes according to their similarity. Those
that fall into the same class represent similar concepts
n Similarity is defined according to features of data
instances
n Classification is “learned” from a training set

© 2020, M.T. Özsu & P. Valduriez 24


Learning-based Schema Matching

⌧ = {DI .em ⇡ Dj .en } Learner

Probabilistic
knowledge

Dk , Dl Classifier

Classification
predictions

© 2020, M.T. Özsu & P. Valduriez 25


Combined Schema Matching Approaches

n Use multiple matchers


q Each matcher focuses on one area (name, etc)

n Meta-matcher integrates these into one prediction


n Integration may be simple (take average of similarity
values) or more complex (see Fagin’s work)

© 2020, M.T. Özsu & P. Valduriez 26


Outline
n Database Integration – Multidatabase Systems
q

q Schema Integration
q
q

© 2020, M.T. Özsu & P. Valduriez 27


Schema Integration

n Use the correspondences to create a GCS


n Mainly a manual process, although rules can help

Integration process

Binary n-ary

Ladder Balanced One-shot Iterative

© 2020, M.T. Özsu & P. Valduriez 28


Binary Integration Methods

Stepwise Pure binary

© 2020, M.T. Özsu & P. Valduriez 29


N-ary Integration Methods

One-pass Iterative

© 2020, M.T. Özsu & P. Valduriez 30


Outline
n Database Integration – Multidatabase Systems
q

q Schema Mapping
q

© 2020, M.T. Özsu & P. Valduriez 31


Schema Mapping

n Mapping data from each local database (source) to GCS


(target) while preserving semantic consistency as
defined in both source and target.
n Data warehouses ⇒ actual translation
n Data integration systems ⇒ discover mappings that can
be used in the query processing phase
n Mapping creation
n Mapping maintenance

© 2020, M.T. Özsu & P. Valduriez 32


Mapping Creation

Given
q A source LCS: 𝒮 = {𝑆" }
q A target GCS: 𝒯 = {𝑇" }
q A set of value correspondences discovered during schema
matching phase: 𝒱 = {𝑉" }
Produce a set of queries that, when executed, will create
GCS data instances from the source data.
We are looking, for each 𝑇# , a query 𝑄# that is defined on a
(possibly proper) subset of the relations in 𝑆 such that,
when executed, will generate data for 𝑇𝑖 from the source
relations
© 2020, M.T. Özsu & P. Valduriez 33
Mapping Creation Algorithm

General idea:

n Consider each 𝑇𝑘 in turn.


$
q Divide 𝑉𝑘 into subsets {𝑉!" , … , 𝑉!# } such that each 𝑉! specifies
one possible way that values of 𝑇𝑘 can be computed
$ $
n Each 𝑉# can be mapped to a query 𝑞# that, when
executed, would generate some of 𝑇𝑘 ’s data.
$
n Union of these queries gives 𝑄# (= ⋃$ 𝑞# )

© 2020, M.T. Özsu & P. Valduriez 34


Outline
n Database Integration – Multidatabase Systems
q

q
q Query Rewriting
q

© 2020, M.T. Özsu & P. Valduriez 35


Multidatabase Query Processing

n Mediator/wrapper architecture
n MDB query processing architecture
n Query rewriting using views
n Query optimization and execution
n Query translation and execution

© 2020, M.T. Özsu & P. Valduriez 36


Recall Mediator/Wrapper Architecture
USER

System User Query


Result
Responses Requests Global view Processing
Integration

Mediator Mediator

Local Local Local


Schema Schema Schema
Mediator Mediator

Wrapper Wrapper Wrapper

DBMS DBMS DBMS DBMS

© 2020, M.T. Özsu & P. Valduriez 37


Issues in MDB Query Processing

n Component DBMSs are autonomous and may range


from full-fledge relational DBMS to flat file systems
q Different computing capabilities
n Prevents uniform treatment of queries across DBMSs
q Different processing cost and optimization capabilities
n Makes cost modeling difficult
q Different data models and query languages
n Makes query translation and result integration difficult
q Different runtime performance and unpredictable behavior
n Makes query execution difficult

© 2020, M.T. Özsu & P. Valduriez 38


Component DBMS Autonomy

n Communication autonomy
q The ability to terminate services at any time
q How to answer queries completely?
n Design autonomy
q The ability to restrict the availability and accuracy of information
needed for query optimization
q How to obtain cost information?
n Execution autonomy
q The ability to execute queries in unpredictable ways
q How to adapt to this?

© 2020, M.T. Özsu & P. Valduriez 39


MDB Query Processing Architecture

Query on
global relations

Global Global/local
REWRITING Schema correspondences

MEDIATOR Query on
SITE local relations

OPTIMIZATION & Allocation & Allocation and


EXECUTION Capability Inf. capabilities

Distributed
query execution plan

WRAPPER TRANSLATION & Wrapper Local/DBMS


SITES EXECUTION Information mappings

Results

© 2020, M.T. Özsu & P. Valduriez 40


Query Rewriting Using Views

n Views used to describe the correspondences between


global and local relations
q Global As View: the global schema is integrated from the local
databases and each global relation is a view over the local
relations
q Local As View: the global schema is defined independently of
the local databases and each local relation is a view over the
global relations
n Query rewriting best done with Datalog, a logic-based
language
q More expressive power than relational calculus
q Inline version of relational domain calculus

© 2020, M.T. Özsu & P. Valduriez 41


Datalog Terminology

n Conjunctive (SPJ) query: a rule of the form


q Q(T) :- R1(T1), … Rn(Tn)
q Q(T) : head of the query denoting the result relation
q R1(T1), … Rn(Tn): subgoals in the body of the query
q R1, … Rn: predicate names corresponding to relation names
q T1, … Tn: refer to tuples with variables and constants
q Variables correspond to attributes (as in domain calculus)
q “-” means unnamed variable
n Disjunctive query = n conjunctive queries with same
head predicate

© 2020, M.T. Özsu & P. Valduriez 42


Datalog Example

EMP(E#,ENAME,TITLE,CITY)
WORKS(E#,P#,RESP,DUR)

SELECT E#, TITLE, P#


FROM EMP NATURAL JOIN WORKS
WHERE TITLE = "Programmer" OR DUR=24

Q(E#,TITLE,P#) :- EMP(E#,ENAME,"Programmer",CITY),
WORKS(E#,P#,RESP,DUR).
Q(E#,TITLE,P#) :- EMP(E#,ENAME,TITLE,CITY),
WORKS(E#,P#,RESP,24).

© 2020, M.T. Özsu & P. Valduriez 43


Rewriting in GAV

n Global schema similar to that of homogeneous


distributed DBMS
q Local relations can be fragments
q But no completeness: a tuple in the global relation may not exist
in local relations
n Yields incomplete answers
q And no disjointness: the same tuple may exist in different local
databases
n Yields duplicate answers
n Rewriting (unfolding)
q Similar to query modification
n Apply view definition rules to the query and produce a union of
conjunctive queries, one per rule application
n Eliminate redundant queries

© 2020, M.T. Özsu & P. Valduriez 44


GAV Example Schema

Global relations Local relations


EMP(E#,ENAME,CITY) EMP1(E#,ENAME,TITLE,CITY)
WORKS(E#,P#,TITLE,DUR) EMP2(E#,ENAME,TITLE,CITY)
WORKS(E#,P#,DUR)

EMP(E#,ENAME,CITY):- EMP1(E#,ENAME,TITLE,CITY). (d1)


EMP(E#,ENAME,CITY):- EMP2(E#,ENAME,TITLE,CITY). (d2)
WORKS(E#,P#,TITLE,DUR):- EMP1(E#,ENAME,TITLE,CITY),
WORKS(E#,P#,DUR). (d3)
WORKS(E#,P#,TITLE,DUR):- EMP2(E#,ENAME,TITLE,CITY),
WORKS(E#,P#,DUR). (d4)

© 2020, M.T. Özsu & P. Valduriez 45


GAV Example Query

Let Q: project for employees in Paris


Q(e,p) :- EMP(e,ENAME,"Paris"),WORKS(e,p,TITLE,DUR).
Unfolding produces Q′
Q′(e,p) :- EMP1(e,ENAME,"Paris"),
WORKS(e,p,TITLE,DUR). (q1)
Q′(e,p):- EMP2(e,ENAME,"Paris"),
WORKS(e,p,TITLE,DUR). (q2)
where
q1 is obtained by applying d3 only or both d1 and d3
In the latter case, there are redundant queries
same for q2 with d2 only or both d2 and d4
© 2020, M.T. Özsu & P. Valduriez 46
Rewriting in LAV

n More difficult than in GAV


q No direct correspondence between the terms in GS (EMP,
ENAME) and those in the views (EMP1, EMP2, ENAME)
q There may be many more views than global relations
q Views may contain complex predicates to reflect the content of
the local relations
n e.g. a view EMP3 for only programmers
n Often not possible to find an equivalent rewriting
q Best is to find a maximally-contained query which produces a
maximum subset of the answer
n e.g. EMP3 can only return a subset of the employees

© 2020, M.T. Özsu & P. Valduriez 47


Rewriting Algorithms

n The problem to find an equivalent query is NP-complete


in the number of views and number of subgoals of the
query
n Thus, algorithms try to reduce the numbers of rewritings
to be considered
n Three main algorithms
q Bucket
q Inverse rule
q MiniCon

© 2020, M.T. Özsu & P. Valduriez 48


LAV Example Schema

Local relations Global relations


EMP1(E#,ENAME,TITLE,CITY) EMP(E#,ENAME,CITY)
EMP2(E#,ENAME,TITLE,CITY) WORKS(E#,P#,TITLE,DUR)
WORKS1(E#,P#,DUR)

EMP1(E#,ENAME,TITLE,CITY):- EMP(E#,ENAME,CITY) (d5)


WORKS(E#,P#,TITLE,DUR).
EMP1(E#,ENAME,TITLE,CITY):- EMP(E#,ENAME,CITY) (d6)
WORKS(E#,P#,TITLE,DUR).
WORKS(E#,P#,DUR):- WORKS(E#,P#,TITLE,DUR). (d7)

© 2020, M.T. Özsu & P. Valduriez 49


Bucket Algorithm

n Considers each predicate of the query Q independently


to select only the relevant views
Step 1
q Build a bucket b for each subgoal q of Q that is not a comparison
predicate
q Insert in b the heads of the views which are relevant to answer q
Step 2
q For each view V of the Cartesian product of the buckets, produce
a conjunctive query
n If it is contained in Q, keep it
n The rewritten query is a union of conjunctive queries

© 2020, M.T. Özsu & P. Valduriez 50


LAV Example Query
Q(e,p) :- EMP(e,ENAME,"Paris"), WORKS(e,p,TITLE,DUR).

Step1: we obtain 2 buckets (one for each subgoal of Q)


b1 = {EMP1(E#,ENAME,TITLE′,CITY),
EMP2(E#,ENAME,TITLE′,CITY)}
b2 = {WORKS1(E#,P#,DUR′)}
(the prime variables (TITLE’ and DUR’) are not useful)

Step2: produces
Q′(e,p) :- EMP1(e,ENAME,TITLE,"Paris"),
WORKS1(e,p,DUR). (q1)
Q′(e,p) :- EMP2(e,ENAME,TITLE,"Paris"),
WORKS1(e,p,DUR). (q2)

© 2020, M.T. Özsu & P. Valduriez 51


Outline
n Database Integration – Multidatabase Systems
q

q
q

q Optimization Issues

© 2020, M.T. Özsu & P. Valduriez 52


Query Optimization and Execution

n Takes a query expressed on local relations and


produces a distributed QEP to be executed by the
wrappers and mediator
n Three main problems
q Heterogeneous cost modeling
n To produce a global cost model from component DBMS
q Heterogeneous query optimization
n To deal with different query computing capabilities
q Adaptive query processing
n To deal with strong variations in the execution environment

© 2020, M.T. Özsu & P. Valduriez 53


Heterogeneous Cost Modeling

n Goal: determine the cost of executing the subqueries at


component DBMS
n Three approaches
q Black-box: treats each component DBMS as a black-box and
determines costs by running test queries
q Customized: customizes an initial cost model
q Dynamic: monitors the run-time behavior of the component
DBMS and dynamically collect cost information

© 2020, M.T. Özsu & P. Valduriez 54


Black-box Approach
n Define a logical cost expression
q Cost = init cost + cost to find qualifying tuples
+ cost to process selected tuples
n The terms will differ much with different DBMS
n Run probing queries on component DBMS to compute
cost coefficients
q Count the numbers of tuples, measure cost, etc.
q Special case: sample queries for each class of important queries
n Use of classification to identify the classes
n Problems
q The instantiated cost model (by probing or sampling) may
change over time
q The logical cost function may not capture important details of
component DBMS

© 2020, M.T. Özsu & P. Valduriez 55


Customized Approach

n Relies on the wrapper (i.e. developer) to provide cost


information to the mediator
n Two solutions
q Wrapper provides the logic to compute cost estimates
n Access_cost = reset + (card-1)*advance
q reset = time to initiate the query and receive a first tuple
q advance = time to get the next tuple (advance)
q card = result cardinality
q Hierarchical cost model
n Each node associates a query pattern with a cost function
n The wrapper developer can give cost information at various levels of
details, depending on knowledge of the component DBMS

© 2020, M.T. Özsu & P. Valduriez 56


Hierarchical Cost Model
select(Collection, Predicate)
CountObject = . . .
Default-scope TotalSize = . . .
rules TotalTime = . . .
etc

Wrapper- Source 1: Source 2:


scope rules select(Collection, Predicate) select(Collection, Predicate)
TotalTime = . . . TotalTime = . . .

Collection- select(PROJ, Predicate) select(EMP, Predicate)


scope rules TotalSize = . . . TotalTime = . . .

Predicate- select(EMP,TITLE = value) select(EMP,ENAME = value)


scope rules TotalTime = . . . TotalTime = . . .

Query-
specific rules

© 2020, M.T. Özsu & P. Valduriez 57


Dynamic Approach

n Deals with execution environment factors which may


change
q Frequently: load, throughput, network contention, etc.
q Slowly: physical data organization, DB schemas, etc.
n Two main solutions
q Extend the sampling method to consider some new queries as
samples and correct the cost model on a regular basis
q Use adaptive query processing which computes cost during
query execution to make optimization decisions

© 2020, M.T. Özsu & P. Valduriez 58


Heterogeneous Query Optimization

n Deals with heterogeneous capabilities of component DBMS


q One DBMS may support complex SQL queries while another only
simple select on one fixed attribute
n Two approaches, depending on the M/W interface level
q Query-based
n All wrappers support the same query-based interface (e.g. ODBC or
SQL/MED) so they appear homogeneous to the mediator
n Capabilities not provided by the DBMS must be supported by the
wrappers
q Operator-based
n Wrappers export capabilities as compositions of operators
n Specific capabilities are available to mediator
n More flexibility in defining the level of M/W interface

© 2020, M.T. Özsu & P. Valduriez 59


Query-based Approach

n We can use 2-step query optimization with a


heterogeneous cost model
q But centralized query optimizers produce left-linear join trees
whereas in MDB, we want to push as much processing in the
wrappers, i.e. exploit bushy trees
n Solution: convert a left-linear join tree into a bushy tree
such that
q The initial total cost of the QEP is maintained
q The response time is improved
n Algorithm
q Iterative improvement of the initial left-linear tree by moving
down subtrees while response time is improved

© 2020, M.T. Özsu & P. Valduriez 60


Operator-based Approach

n M/W communication in terms of subplans


n Use of planning functions (Garlic)
q Extension of cost-based centralized optimizer with new
operators
n Create temporary relations
n Retrieve locally stored data
n Push down operators in wrappers
n accessPlan and joinPlan rules
q Operator nodes annotated with
n Location of operands, materialization, etc.

© 2020, M.T. Özsu & P. Valduriez 61


Planning Functions Example

n Consider 3 component databases with 2 wrappers:


q w1.db1: EMP(ENO,ENAME,CITY)
q w1.db2: ASG(ENO,PNAME,DUR)
q w2.db3: EMPASG(ENAME,CITY,PNAME,DUR)
n Planning functions of w1
q accessPlan(R: rel, A: attlist, P: pred) = scan(R, A, P, db(R))
q joinPlan(R1, R2: rel, A: attlist, P: joinpred) = join(R1, R2, A, P)
n condition: db(R1) ≠ db(R2)
n implemented by w1
n Planning functions of w2
q accessPlan(R: rel, A: attlist, P: pred) = fetch(city=c)
ncondition: (city=c) included in P
q accessPlan(R: rel, A: attlist, P: pred) = scan(R, A, P, db(R))
n implemented by w2

© 2020, M.T. Özsu & P. Valduriez 62


Heterogenous QEP

SELECT ENAME,PNAME,DUR
FROM EMPASG
WHERE CITY = "Paris" AND DUR>24
m
Union

w1 w2
Join Scan(DUR> 24)

Scan(CITY=“Paris”) Scan(DUR> 24) Fetch(CITY=“Paris”)

db1 db2 db3

EMP WORKS EMPASG

© 2020, M.T. Özsu & P. Valduriez 63


Query Translation and Execution

n Performed by wrappers using the component DBMS


q Conversion between common interface of mediator and DBMS-
dependent interface
n Query translation from wrapper to DBMS
n Result format translation from DBMS to wrapper
q Wrapper has the local schema exported to the mediator (in
common interface) and the mapping to the DBMS schema
q Common interface can be query-based (e.g. ODBC or
SQL/MED) or operator-based
n In addition, wrappers can implement operators not
supported by the component DBMS, e.g. join

© 2020, M.T. Özsu & P. Valduriez 64


Wrapper Placement

n Depends on the level of


autonomy of component DB MEDIATOR
n Cooperative DB
q May place wrapper at component Common Interface
DBMS site
q Efficient wrapper-DBMS com.
WRAPPER
n Uncooperative DB
q May place wrapper at mediator DBMS-dependent
Interface
q Efficient mediator-wrapper com.
COMPONENT
n Impact on cost functions DBMS

65
SQL Wrapper for Text Files

n Consider EMP (ENO, ENAME, CITY) stored in a Unix text file in


componentDB
q Each EMP tuple is a line in the file, with attributes separated by “:”
n SQL/MED definition of EMP
CREATE FOREIGN TABLE EMP
ENO INTEGER, ENAME VARCHAR(30), CITY CHAR(30)
SERVER componentDB
OPTIONS(Filename ‘/usr/EngDB/emp.txt’, Delimiter ‘:’)
n The query
SELECT ENAME FROM EMP
Can be translated by the wrapper using a Unix shell command
cut –d: -f2/ usr/EngDB/emp

© 2020, M.T. Özsu & P. Valduriez 66


Wrapper Management Issues
n Wrappers mostly used for read-only queries
q Makes query translation and wrapper construction easy
q DBMS vendors provide standard wrappers
n ODBC, JDBC, ADO, etc.
n Updating makes wrapper construction harder
q Problem: heterogeneity of integrity constraints
n Implicit in some legacy DB
q Solution: reverse engineering of legacy DB to identify implicit
constraints and translate in validation code in the wrapper
n Wrapper maintenance
q schema mappings can become invalid as a result of changes in
component DB schemas
n Use detection and correction, using mapping maintenance
techniques

© 2020, M.T. Özsu & P. Valduriez 67


Bases de Dados
Distribuídas Avançadas
Paulo Vieira
2023

Mestrado em Ciência de Dados


Apache Hadoop Ecosystem architecture | Download Scientific Diagram (researchgate.net)
BDDA - MCD - Paulo Vieira 2
Hadoop
• Google e Yahoo! foram os primeiros a enfrentar os desafios de
escalabilidade na internet
• Em 2006 é publicado como projecto open source na Apache Software
Foundation
• O criador Doug Cutting trabalhava num projecto de indexação e foi
inspirado pelas publicações da Google:
• Google File System
• MapReduce: Simplified Data Processing on Large Clusters

BDDA - MCD - Paulo Vieira 3


Hadoop
• É uma plataforma de armazenamento e processamento de dados
baseado num conceito central: data locality
• O objectivo é aproximar a capacidade de computação/processamento dos dados
• É um sistema schema-on-read
• É constituído por três componentes principais:
• Hadoop Distributed File System (HDFS) – o subsistema de armazenamento
• Yet Another Resource Negotiator (YARN) – o subsistema de escalonamento de
processos
• MapReduce – framework de processamento

BDDA - MCD - Paulo Vieira 4


Hadoop

https://blog.verbat.com/hadoop-ecosystem-beginners-overview/
BDDA - MCD - Paulo Vieira 5
Hadoop – casos de utilização
• Data warehouse, extract load transform (ELT) e extract transform load
(ETL)
• Eventos e processamento complexo de eventos
• Ingestão e processamento de dados de sensors, mensagens ou logs
• Associado geralmente ao conceito de IoT (Internet of Things)
• Mineração de dados e machine learning

BDDA - MCD - Paulo Vieira 6


HDFS
• Sistema de ficheiros virtual constituído por blocos que são
distribuídos através de um ou mais nós de um cluster
• Os ficheiros são divididos de acordo com uma determinada dimensão
de bloco na altura do upload
• Um processo conhecido como ingestão de dados
• Os blocos são depois distribuídos e replicados pelos nós do cluster
para possibilitar tolerância a falhas e oportunidades adicionais de
processamento local dos dados
• De acordo com o conceito central de data locality

BDDA - MCD - Paulo Vieira 7


HDFS
• Fonte de dados principal e destino das operações de processamento
• Originalmente desenvolvido para suportar os requisitos dos motores
de pesquisa, e.g., Yahoo!
• Inspirado pelo artigo GoogleFS
• Escalável (económico)
• Tolerante a falhas
• Utiliza hardware produzido em massa (PCs)
• Suporta alta concorrência
• Favorece grandes larguras de banda em vez de baixa latência nos
acessos aleatórios
BDDA - MCD - Paulo Vieira 8
HDFS
• Imutável: dados commited no sistema de ficheiros não podem ser
actualizados
• WORM (write once, read many)
• Os blocos tem um tamanho, por omissão, de 128Mb
• Os ficheiros são separados na altura da ingestão
• Se um cluster tiver mais de um nó, os blocos são distribuídos
• São também replicados num factor de replicação (tipicamente 3, embora no
modo pseudo-distribuído seja 1)
• Aumenta a probabilidade de data locality
• Tolerância a falhas

BDDA - MCD - Paulo Vieira 9


HDFS - Interacção
• Shell (hdfs dfs)
• hdfs dfs -put lord-of-the-rings.txt /data/books
• hdfs dfs –ls

• Java API

• RESTful proxy interfaces (HttpFS e WebHDFS)

BDDA - MCD - Paulo Vieira 10


HDFS - NameNode
• Nó HDFS master que gere os metadados do sistema de ficheiros
• São mantidos em memória para permitir leituras e escritas eficientes
• Responsável pela durabilidade e consistência dos metadados
• Processo obrigatório e necessário para o funcionamento do HDFS
• User interface disponível, por omissão, no porto http 9870
• HDFS NameNode Web UI: http://localhost:9870/

BDDA - MCD - Paulo Vieira 11


HDFS - DataNode
• Nó HDFS slave que executa num ou mais nós do cluster HDFS
• Responsável por gerir o armazenamento do bloco e acessos de leitura
e escrita, assim como replicação dos blocos

BDDA - MCD - Paulo Vieira 12


YARN
• Orquestrador do processamento de dados no Hadoop
• Baseado numa arquitectura master-slave com um nó master
chamado ResourceManager e um ou mais nós slave chamados
NodeManagers

BDDA - MCD - Paulo Vieira 13


YARN - ResourceManager
• Responsável por garantir os recursos computacionais às aplicações
em execução no clusters
• Disponibilizados na forma de containers que são combinações pré-definidas
de cores CPU e memória
• Monitoriza a capacidade no cluster quando as aplicações terminam e
libertam recursos
• Tal como o NameNode, também disponibiliza uma UI web
• http://localhost:8088/cluster

BDDA - MCD - Paulo Vieira 14


YARN – linha temporal
1. Os clientes submetem aplicações no ResourceManager
2. O ResourceManager aloca o primeiro container disponível no
NodeManager como um processo delegado chamado
ApplicationMaster
3. Este ApplicationMaster negoceia os próximos containers
necessários para executar a aplicação

BDDA - MCD - Paulo Vieira 15


YARN - NodeManager
• É o nó slave que gere os containers num determinado host
• Executam as tarefas envolvidas numa aplicação

BDDA - MCD - Paulo Vieira 16


YARN – Modos de implantação
• Completamente distribuído – os nós master estão em máquinas
diferentes dos nós slave

• Pseudo-distribuído – todos os daemons são executados em JVM


separadas

• LocalJobRunner – todos os componentes são executados na mesma


JVM

BDDA - MCD - Paulo Vieira 17


Sqoop
• É uma abstracção do MapReduce
• Uma operação de importação ocorre da seguinte forma:
• Ligação ao SGBD usando JDBC
• Examinar a tabela a ser importada
• Criar uma classe Java que represente a estrutura da tabela especificada
• Usar o YARN para executar o job MapReduce (por omissão, 4 tarefas
paralelas)

BDDA - MCD - Paulo Vieira 18


Ingestão de dados usando Sqoop
• Sqoop (sql-to-hadoop) é um projecto da ASF desenhado para
importer bases de dados relacionais e ingerir em HDFS.
• Pode também ser utilizado para enviar dados do Hadoop para uma
base de dados relacional, útil para enviar dados processados para um
sistema que suporta transacções.
• Inclui ferramentas para:
• Listar bases de dados e tabelas
• Importar tabelas
• Importar dados usando SELECT statements
• Exportar dados a partir do HDFS para uma base de dados remota

BDDA - MCD - Paulo Vieira 19


Sqoop - exemplo
sqoop import-all-tables \
--username pjvieira \
--password ***** \
--connect jdbc:mysql://databaseserver.local/db

BDDA - MCD - Paulo Vieira 20


NoSQL
• Em 2006 a Google publica o artigo “Bigtable: A Distributed Storage
System for Structured Data”
• O desenho da Bigtable permite escalar até aos petabytes de dados e
milhares de máquinas:
• Usado no Google Analytics, Google Finance e Google Earth
• Schemaless em design time, “schema-on-read” em runtime
• Os dados não estão relacionados
• A junções são normalmente evitadas

BDDA - MCD - Paulo Vieira 21


NoSQL
• Tipos de sistemas NoSQL:

• Key-value stores – contém um conjunto de chaves indexadas e valores


associados. E.g., Cassandra, Amazon DynamoDB e Hbase
• Document stores – armazena documentos, objectos complexos (JSON ou
BSON). Aos documentos é atribuído um “document ID” e o conteúdo seriam
dados semi-estruturados. E.g., MongoDB e CouchDB
• Graph stores – baseados na Teoria dos grafos e conceitos de processamento.
E.g., Neo4J e GraphBase

BDDA - MCD - Paulo Vieira 22


HBase
• Guarda informação como um mapa ordenado,
esparso e multidimensional
• O mapa é indexado pela chave da linha e os
valores são guardados em células (consistem em
chave da coluna e valor da coluna)
• A chave da coluna e a chave da linha são strings e o
valor da coluna é um byte array que pode representar
qualquer tipo de dados – primitivo ou complexo.
• Depende de ZooKeeper, outro projecto ASF

BDDA - MCD - Paulo Vieira 23


HBase – métodos de acesso
• get, put, scan e delete

• Através de Shell ou de APIs


• Shell é REPL (Read-Evaluate-Print-Loop)
• Acedida através do comando hbase shell

BDDA - MCD - Paulo Vieira 24


ZooKeeper
• Serviço centralizado para manter informações de configuração,
nomes, sincronização distribuída e serviços.
• Interface simples para um serviço de coordenação centralizado
• Distribuído e fiável
• Permite a coordenação de processos distribuídos segundo um espaço
de nomes hierárquico e distribuído

ProjectDescription - Apache ZooKeeper - Apache Software Foundation


BDDA - MCD - Paulo Vieira 25
Phoenix
• Sistema de gestão de base de dados relacional distribuído construído
sobre o Apache Hbase
• “Hadoop’s database”
• Suporte transacional ACID através da utilização do projecto Apache
Tephra
• Suporta vistas com algumas limitações (cf. Secção “Limitations”)

Overview | Apache Phoenix BDDA - MCD - Paulo Vieira 26


Phoenix – data types

Data Types | Apache Phoenix


BDDA - MCD - Paulo Vieira 27
Comandos CRUD
• CREATE CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
• UPSERT city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)
• SELECT );

• DELETE UPSERT INTO us_population VALUES ('NY','New York',8143197);

SELECT
state as "State",
count(city) as "City Count",
sum(population) as "Population Sum"
FROM
us_population
GROUP BY
state
ORDER BY Phoenix in 15 minutes or less | Apache Phoenix
sum(population) DESC
;
BDDA - MCD - Paulo Vieira 28
DEMO – Phoenix & HUE

BDDA - MCD - Paulo Vieira 29


Hive
• Sistema de armazenamento de dados e consulta para o
ecossistema Apache Hadoop
• Utilizado como uma solução de data warehouse em ambientes
Hadoop
• Desenvolvido originalmente pelo Facebook
• Permite a análise e a consulta de grandes conjuntos de dados
armazenados em Hadoop
• Utiliza uma linguagem chamada HiveQL, que é semelhante ao SQL
• Usa um componente chamado Metastore para armazenar
metadados sobre a estrutura e localização dos dados no cluster
• ajuda na organização e otimização das consultas.

BDDA - MCD - Paulo Vieira 30


Hive
• Interface de alto nível para o MapReduce
• No Facebook em 2010, poucos analistas tinham
competências de programação em Java
• Tinham competências em SQL
• Introduz uma nova linguagem chamada HiveQL que implementa
um sub-conjunto do SQL-92
• Implementa uma abstracção aos objectos no HDFS
• Os dados no HDFS podem ser acedidos através de DML,
como nos SGBDs convencionais

BDDA - MCD - Paulo Vieira 31


Hive
• Apesar disso, tem algumas diferenças:
• UPDATE não é suportado
• Não há transacções, rollbacks ou níveis de isolamento
transaccional
• Não existem chaves primárias, estrangeiras ou restrições de
integridade declarativas
• Dados incorrectamente formatados são representados como
NULL
• Tem uma base de dados relacional (metastore) que é
escrita e lida pelo cliente Hive
• Por omissão uma BD Derby
BDDA - MCD - Paulo Vieira 32
Parquet
• É um formato de armazenamento em colunas
• Código aberto para o ecossistema de processamento de dados do
Apache Hadoop
• Especialmente desenhado para ser eficiente no armazenamento e
processamento de grandes volumes de dados

https://parquet.apache.org/

BDDA - MCD - Paulo Vieira 33


Parquet
• Principais características:
• Compressão e Codificação Eficientes
• Integração com Ferramentas de Big Data
• Esquema Evolutivo:
• O formato suporta a evolução do esquema, permitindo a adição ou modificação de
colunas sem reescrever os conjuntos de dados existentes
• Otimizado para Consultas de Leitura
• Interoperabilidade:
• O Parquet pode ser utilizado com uma variedade de linguagens de programação e
plataformas, facilitando a interoperabilidade entre diferentes sistemas e ferramentas.

https://parquet.apache.org/

BDDA - MCD - Paulo Vieira 34


Referências
• White, Tom. (2015). Hadoop: The Definitive Guide (4th. ed.). O'Reilly
Media, Inc. ISBN: 9781491901632

BDDA - MCD - Paulo Vieira 35


Hadoop, YARN e
ZooKeeper
Notas sobre a
sessão #3
Paulo Vieira - 2024
2
Hadoop: Introdução
Framework de código aberto para processamento distribuído
Projetado para escalar de servidores individuais a milhares de máquinas
Google e Yahoo! foram os primeiros a enfrentar os desafios de escalabilidade na
internet
Em 2006 é publicado como projecto open source na Apache Software Foundation

3
Hadoop: Características Principais
Armazenamento distribuído | Processamento paralelo
Tolerância a falhas
Alta disponibilidade
É uma plataforma de armazenamento e processamento de dados baseado num
conceito central: data locality
O objectivo é aproximar a capacidade de computação/processamento dos dados
É um sistema schema-on-read

4
Componentes Principais do Hadoop
1. HDFS (Hadoop Distributed File System)
2. MapReduce
3. YARN (Yet Another Resource Negotiator)

5
HDFS: Visão Geral
Sistema de ficheiros distribuído
Projetado para armazenar grandes volumes
de dados
Executa em commodity hardware
Constituído por blocos que são distribuídos
(replicados) através de um ou mais nós de
um cluster
Os ficheiros são divididos de acordo
com uma determinada dimensão de
bloco na altura do upload
6
HDFS: Visão Geral (cont.)
Imutável: dados commited no sistema de ficheiros não podem ser actualizados
WORM (write once, read many)
Os blocos tem um tamanho, por omissão, de 128Mb
Os ficheiros são separados na altura da ingestão
Se um cluster tiver mais de um nó, os blocos são distribuídos
São também replicados num factor de replicação (tipicamente 3, embora no
modo pseudo-distribuído seja 1)
Aumenta a probabilidade de data locality
7
HDFS: Interacção
Shell (hdfs dfs)
hdfs dfs -put lord-of-the-rings.txt /data/books
hdfs dfs –ls

Java API
RESTful proxy interfaces (HttpFS e WebHDFS)

8
HDFS: Arquitetura
Arquitetura master-slave
NameNode (master)
DataNodes (slaves)

9
NameNode
Processo obrigatório e necessário para o funcionamento do HDFS
Responsável pela durabilidade e consistência dos metadados:
Armazena metadados de todos os ficheiros e diretórios
Gere o namespace do sistema de ficheiros
Regula o acesso aos ficheiros pelos clientes
Executa operações como abrir, fechar e renomear ficheiros
Mantém a árvore do sistema de ficheiros
Ponto único de falha (pode ser resolvido com NameNode secundário)
10
DataNode
Responsável por gerir o armazenamento do bloco e acessos de leitura e escrita, assim
como replicação dos blocos
Reportam ao NameNode com listas de blocos que estão a armazenar
Executam operações de leitura e escrita para os clientes do sistema de ficheiros
Realizam criação, eliminação e replicação de blocos sob instrução do NameNode

11
12
YARN: Introdução
Yet Another Resource Negotiator
Introduzido no Hadoop 2.0
Orquestrador do processamento de dados
no Hadoop
Separa as funcionalidades de "gestão
de recursos" do "motor de
processamento"

13
YARN: Componentes Principais
1. ResourceManager
2. NodeManager
3. ApplicationMaster
4. Container

14
YARN: ResourceManager
Gestor global de recursos
Aloca recursos para aplicações distribuídas
Responsável por garantir os recursos computacionais às aplicações em execução no
clusters
Disponibilizados na forma de containers que são combinações pré-definidas de cores
CPU e memória
Monitoriza a capacidade no cluster quando as aplicações terminam e libertam
recursos
15
YARN: NodeManager
Agente em cada nó de computação
Responsável pelos containers
Monitoriza o uso de recursos

16
YARN: ApplicationMaster
Negoceia recursos com o ResourceManager
Trabalha com os NodeManagers para executar e monitorizar tarefas

17
YARN: Container
Unidade de recursos no YARN
Inclui elementos como memória, CPU, disco, rede, etc.

18
YARN: Fluxo de Trabalho
1. Cliente submete aplicação
2. ResourceManager cria container para ApplicationMaster
3. ApplicationMaster negoceia recursos adicionais
4. ApplicationMaster inicia a execução nos containers alocados

19
20
YARN: Vantagens
Maior escalabilidade
Melhor utilização de recursos do cluster
Suporta workloads não-MapReduce

21
MapReduce: Visão Geral
Modelo de programação para
processamento de dados em larga escala
Divide o processamento em duas fases:
Map e Reduce

22
MapReduce: Fase Map
Processa os dados de entrada
Produz pares chave-valor intermédios

23
MapReduce: Fase Reduce
Combina os pares chave-valor intermédios
Produz o resultado final

24
O que é o ZooKeeper?
Serviço de coordenação para sistemas distribuídos
Desenvolvido pela Apache Software Foundation
Actua como um "árbitro confiável" entre serviços
distribuídos

25
O que o ZooKeeper Coordena?
Principalmente:
1. Metadados e estados de configuração
2. Informações de controlo e coordenação entre serviços
Não é usado para:
Coordenar dados de aplicação em grande escala
Armazenar ficheiros grandes

26
Problemas Resolvidos pelo ZooKeeper
1. Manter consistência entre múltiplos servidores
2. Lidar com falhas parciais no sistema
3. Gerir estados partilhados correctamente

27
Como é que o ZooKeeper Funciona?
Usa uma estrutura hierárquica chamada "znodes", similar a um sistema de ficheiros
Cada znode pode armazenar até 1MB de dados
Clientes observam znodes para detectar mudanças

28
Mecanismo de Watchers
Clientes podem "observar" znodes para perceber mudanças
São enviadas notificações quando ocorrem alterações
Permite reacção rápida a mudanças de configuração ou estado

29
Casos de utilização "normais"
1. Configuração dinâmica: Atualizar configurações em tempo real
2. Eleição de líder: Determinar o servidor líder num cluster
3. Locks distribuídos: Coordenar acesso a recursos compartilhados
4. Filas de tarefas: Distribuir trabalho entre múltiplos trabalhadores

30
Exemplo prático: contagem de palavras num texto
"O Iscte – Instituto Universitário de Lisboa é uma instituição pública de
ensino universitário, criada em 1972, que dispõe de campi em Lisboa e
Sintra. Em 2010 foi implementada uma nova estrutura organizacional que
resultou na composição de unidades orgânicas descentralizadas: quatro
Escolas, 16 Departamentos, oito Unidades de Investigação."

31
32
33
Conclusão
Hadoop: framework para processamento distribuído
HDFS: armazenamento distribuído
MapReduce: modelo de programação para processamento em larga escala
YARN: gestão de recursos e agendamento de tarefas
ZooKeeper: coordenação de sistemas distribuídos

34
Referências
1. Apache Hadoop. (n.d.). Apache Hadoop. https://hadoop.apache.org/
2. White, T. (2015). Hadoop: The Definitive Guide. O'Reilly Media.
3. Apache YARN. (n.d.). Apache Hadoop YARN.
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
4. Apache ZooKeeper. (n.d.). Apache ZooKeeper. https://zookeeper.apache.org/
5. Sammer, E. (2012). Hadoop Operations. O'Reilly Media.

35
hadoop-yarn-zookeeper-presentation.md 2024-10-20

Hadoop, YARN e MapReduce


Paulo Vieira - 2024

Hadoop: Introdução
Framework de código aberto para processamento distribuído
Projetado para escalar de servidores individuais a milhares de máquinas
Google e Yahoo! foram os primeiros a enfrentar os desafios de escalabilidade na internet
Em 2006 é publicado como projecto open source na Apache Software Foundation

Hadoop: Características Principais


Armazenamento distribuído | Processamento paralelo
1/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

Tolerância a falhas
Alta disponibilidade
É uma plataforma de armazenamento e processamento de dados baseado num conceito central:
data locality
O objectivo é aproximar a capacidade de computação/processamento dos dados
É um sistema schema-on-read

Componentes Principais do Hadoop


. HDFS (Hadoop Distributed File System)
. MapReduce
. YARN (Yet Another Resource Negotiator)

HDFS: Visão Geral


Sistema de ficheiros distribuído
Projetado para armazenar grandes volumes de dados
Executa em commodity hardware
Constituído por blocos que são distribuídos (replicados) através de um ou mais nós de um cluster
Os ficheiros são divididos de acordo com uma determinada dimensão de bloco na altura do
upload

HDFS: Visão Geral (cont.)


Imutável: dados commited no sistema de ficheiros não podem ser actualizados
WORM (write once, read many)
Os blocos tem um tamanho, por omissão, de 128Mb
Os ficheiros são separados na altura da ingestão
Se um cluster tiver mais de um nó, os blocos são distribuídos
São também replicados num factor de replicação (tipicamente 3, embora no modo pseudo-
distribuído seja 1)
Aumenta a probabilidade de data locality

HDFS: Interacção
Shell (hdfs dfs)
2/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

hdfs dfs -put lord-of-the-rings.txt /data/books


hdfs dfs –ls

Java API
RESTful proxy interfaces (HttpFS e WebHDFS)

HDFS: Arquitetura
Arquitetura master-slave
NameNode (master)
DataNodes (slaves)

NameNode
Processo obrigatório e necessário para o funcionamento do HDFS
Responsável pela durabilidade e consistência dos metadados:
Armazena metadados de todos os ficheiros e diretórios
Gere o namespace do sistema de ficheiros
Regula o acesso aos ficheiros pelos clientes
Executa operações como abrir, fechar e renomear ficheiros
Mantém a árvore do sistema de ficheiros
Ponto único de falha (pode ser resolvido com NameNode secundário)

DataNode
Responsável por gerir o armazenamento do bloco e acessos de leitura e escrita, assim como
replicação dos blocos
Reportam ao NameNode com listas de blocos que estão a armazenar
Executam operações de leitura e escrita para os clientes do sistema de ficheiros
Realizam criação, eliminação e replicação de blocos sob instrução do NameNode

3/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

YARN: Introdução
Yet Another Resource Negotiator
Introduzido no Hadoop 2.0
Orquestrador do processamento de dados no Hadoop
Separa as funcionalidades de "gestão de recursos" do "motor de processamento"

YARN: Componentes Principais


. ResourceManager
. NodeManager
. ApplicationMaster
. Container
4/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

YARN: ResourceManager
Gestor global de recursos
Aloca recursos para aplicações distribuídas
Responsável por garantir os recursos computacionais às aplicações em execução no clusters
Disponibilizados na forma de containers que são combinações pré-definidas de cores CPU e
memória
Monitoriza a capacidade no cluster quando as aplicações terminam e libertam recursos

YARN: NodeManager
Agente em cada nó de computação
Responsável pelos containers
Monitoriza o uso de recursos

YARN: ApplicationMaster
Negoceia recursos com o ResourceManager
Trabalha com os NodeManagers para executar e monitorizar tarefas

YARN: Container
Unidade de recursos no YARN
Inclui elementos como memória, CPU, disco, rede, etc.

YARN: Fluxo de Trabalho


. Cliente submete aplicação
. ResourceManager cria container para ApplicationMaster
. ApplicationMaster negoceia recursos adicionais
. ApplicationMaster inicia a execução nos containers alocados

5/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

YARN: Vantagens
Maior escalabilidade
Melhor utilização de recursos do cluster
Suporta workloads não-MapReduce

MapReduce: Visão Geral


Modelo de programação para processamento de dados em larga escala
Divide o processamento em duas fases: Map e Reduce

MapReduce: Fase Map


Processa os dados de entrada
6/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

Produz pares chave-valor intermédios

MapReduce: Fase Reduce


Combina os pares chave-valor intermédios
Produz o resultado final
Exemplo prático: contagem de palavras num texto
"O Iscte – Instituto Universitário de Lisboa é uma instituição pública de ensino
universitário, criada em 1972, que dispõe de campi em Lisboa e Sintra. Em 2010
foi implementada uma nova estrutura organizacional que resultou na composição de
unidades orgânicas descentralizadas: quatro Escolas, 16 Departamentos, oito
Unidades de Investigação."

Conclusão
Hadoop: framework para processamento distribuído
7/8
hadoop-yarn-zookeeper-presentation.md 2024-10-20

HDFS: armazenamento distribuído


MapReduce: modelo de programação para processamento em larga escala
YARN: gestão de recursos e agendamento de tarefas
ZooKeeper: coordenação de sistemas distribuídos

Referências
. Apache Hadoop. (n.d.). Apache Hadoop. https://hadoop.apache.org/
. White, T. (2015). Hadoop: The Definitive Guide. O'Reilly Media.
. Apache YARN. (n.d.). Apache Hadoop YARN. https://hadoop.apache.org/docs/current/hadoop-
yarn/hadoop-yarn-site/YARN.html
. Apache ZooKeeper. (n.d.). Apache ZooKeeper. https://zookeeper.apache.org/
. Sammer, E. (2012). Hadoop Operations. O'Reilly Media.

8/8
Docker
Criação de
containers e
principais comandos
Paulo Vieira 2024
O que é o Docker?
Lançado em 2013, rapidamente se tornou um padrão na indústria
Plataforma de código aberto para criação de containers
Permite empacotar as aplicações e as suas dependências
Baseada em Linux

2
Para que serve o Docker?
1. Criar ambientes isolados e consistentes
2. Facilitar o desenvolvimento e deploy de aplicações
3. Garantir que as aplicações funcionem de forma idêntica em diferentes ambientes
4. Simplificar a gestão de infraestrutura e escalabilidade

3
Componentes principais do Docker
1. Docker Engine: O runtime que executa os containers
2. Docker Images: Templates de leitura para criar containers
3. Docker containers: Instâncias em execução de imagens Docker
4. Dockerfile: Script para construir imagens Docker
5. Docker Hub: Repositório público de imagens Docker

4
Comandos Docker Essenciais
# Listar contentores
docker ps

# Executar um contentor
docker run [opções] imagem

# Construir uma imagem


docker build -t nome_imagem:tag .

# Parar um contentor
docker stop id_contentor

# Remover um contentor
docker rm id_contentor

# Listar imagens
docker images

# Remover uma imagem


docker rmi nome_imagem
5
Docker Compose
Ferramenta para definir e gerir aplicações multi-contentor
Usa um arquivo YAML para configurar serviços, redes e volumes

6
Exemplo de docker-compose.yml
version: '3'
services:
web:
build: .
ports:
- "5000:5000"
redis:
image: "redis:alpine"

7
Comandos Docker Compose
# Iniciar serviços
docker-compose up

# Parar serviços
docker-compose down

# Listar serviços em execução


docker-compose ps

8
Variáveis de Ambiente no Docker
Usadas para configurar contentores em tempo de execução
Podem ser definidas no Dockerfile, docker-compose.yml ou na linha de comando

9
No Dockerfile:
ENV VARIAVEL_EXEMPLO=valor

No docker-compose.yml:
services:
web:
environment:
- VARIAVEL_EXEMPLO=valor

Na linha de comando:
docker run -e VARIAVEL_EXEMPLO=valor imagem

10
Vantagens do Docker
1. Consistência: "Funciona na minha máquina"
torna-se "Funciona em qualquer lugar"
2. Isolamento: Aplicações e dependências são
encapsuladas
3. Eficiência: Usa menos recursos que as
máquinas virtuais tradicionais
4. Portabilidade: Fácil de evoluir entre
ambientes (dev, test, prod)
5. Escalabilidade: Facilita a criação e gestão de
múltiplas instâncias
11
Casos de utilização
Desenvolvimento de software: Ambientes de desenvolvimento consistentes
Microserviços: Facilita a arquitetura e deployment de microserviços
Continuous Integration/Continuous Deployment (CI/CD): Automatiza testes e
deployment
Aplicações legadas: Facilita a migração e modernização
Infraestrutura como código: Define ambientes de forma programática

12
Exemplo prático: Tutorial

13
hbase-tutorial-completo.md 2024-10-20

Paulo Vieira | 2024

Tutorial 2: HBase
O Apache HBase é uma solução de armazenamento e processamento de dados NoSQL, inspirada no
modelo do Google BigTable. Desenvolvido sobre o Hadoop Distributed File System (HDFS), o HBase
oferece uma abordagem distribuída, não relacional e orientada a colunas para gerir eficientemente grandes
volumes de informação. Esta base de dados permite acesso aleatório e consistente a conjuntos de dados
extremamente vastos, proporcionando uma plataforma escalável para aplicações que exigem alta
performance no tratamento de informação estruturadas.
Componentes Principais do HBase:
1. HMaster
O HMaster é o processo principal que gere o cluster HBase. Monitoriza e coordena os RegionServers, além
de gerir as operações de criação, desativação ou divisão de regiões. É fundamental para garantir a correta
distribuição e escalabilidade dos dados no cluster.
2. RegionServer
Os RegionServers são responsáveis pelo armazenamento e gestão dos dados em HBase. Cada servidor
gere uma ou mais regiões, que são fragmentos da tabela. Além disso, processam as operações de leitura e
escrita, tornando-se cruciais para o desempenho global do sistema.
3. Zookeeper
O Zookeeper é utilizado pelo HBase para a coordenação dos serviços distribuídos. Ele assegura que os
processos como o HMaster e os RegionServers mantenham o estado de sincronização e comunicação.
Funciona como um "supervisor" para garantir que o sistema distribua a carga de trabalho de forma eficaz.
4. Região
Uma região é a unidade básica de escalabilidade em HBase. Cada tabela é dividida em várias regiões e, à
medida que os dados aumentam, novas regiões são criadas e distribuídas pelos RegionServers. Isto
permite uma divisão eficiente dos dados e um aumento de capacidade sem comprometer o desempenho.
5. Tabela
Uma tabela em HBase é uma coleção de linhas organizadas por famílias de colunas. Cada linha é
identificada por uma chave de linha única, e as famílias de colunas agrupam colunas relacionadas,
facilitando a gestão e armazenamento dos dados.
1/8
hbase-tutorial-completo.md 2024-10-20

6. Família de Colunas
Uma família de colunas é um grupo de colunas relacionadas dentro de uma tabela. Cada família de colunas
é armazenada separadamente, permitindo uma maior eficiência na leitura e escrita de dados,
especialmente quando apenas algumas colunas são consultadas.
Objectivos e Exploração da Web UI
Antes de avançarmos com os desafios, vamos explorar a interface Web UI do HBase, disponível em
http://localhost:16010. A Web UI oferece uma visão geral do estado do cluster, permitindo monitorizar o
desempenho, as regiões ativas e a "saúde" do sistema.
Desafio Inicial: Explore a Web UI e responda às seguintes questões:
Quantos RegionServers estão ativos no sistema?
Qual é a região com mais leituras/escritas nas últimas 24 horas?
Qual é a tabela com mais dados armazenados atualmente?
Configuração do HBase (com Docker)
Para começar a usar o HBase, vamos configurá-lo usando o Docker.
. Instale o Docker, se ainda não o tiver feito.
. Execute o seguinte comando para iniciar um contentor HBase:

docker run -d -p 2181:2181 -p 16000:16000 -p 16010:16010 -p


16020:16020 -p 16030:16030 --name hbase-docker harisekhon/hbase

. Verifique se o contentor está em execução e identifique o container que pretendemos utilizar:

docker ps

. Aceda ao shell do HBase no container hbase-docker:

docker exec -it hbase-docker hbase shell

Manipulação de dados em HBase


Criar uma Tabela para Turma
Vamos criar uma tabela chamada "turma" com duas famílias de colunas: "info" e "notas".

create 'turma', 'info', 'notas'

2/8
hbase-tutorial-completo.md 2024-10-20

Inserir Dados
Insira alguns dados na tabela:

put 'turma', '1', 'info:nome', 'Ana Silva'


put 'turma', '1', 'info:idade', '20'
put 'turma', '1', 'notas:matematica', '18'
put 'turma', '2', 'info:nome', 'João Santos'
put 'turma', '2', 'info:idade', '22'
put 'turma', '2', 'notas:matematica', '16'

Remover Dados
Para remover uma célula específica, é necessário identificar a linha/coluna:

delete 'turma', '1', 'notas:matematica'

Para remover uma linha inteira, é suficiente identificar a família de colunas:

deleteall 'turma', '2'

Consultas e Filtros
O HBase oferece várias formas de consultar e filtrar dados:
Consulta Simples
Para obter uma linha específica, especifico a tabela e a família de colunas:

get 'turma', '1'

Scan
Para ver todas as linhas da tabela:

scan 'turma'

Consultas com Filtros


O HBase suporta vários tipos de filtros para consultas mais complexas:
. Filtro por Valor:
3/8
hbase-tutorial-completo.md 2024-10-20

scan 'turma', {FILTER => "ValueFilter(=, 'binary:Ana Silva')"}

. Filtro por Prefixo de Linha:

scan 'turma', {FILTER => "PrefixFilter('1')"}

. Filtro por Família de Colunas:

scan 'turma', {COLUMNS => 'info'}

. Filtro por Intervalo de Linhas:

scan 'turma', {STARTROW => '1', ENDROW => '3'}

. Combinação de Filtros:

scan 'turma', {FILTER => "ColumnPrefixFilter('nome') AND


ValueFilter(=, 'binary:João Santos')"}

Desafios
Tenta resolver os desafios seguintes. Em caso de dúvida, consulta a documentação.
Escolas
. Cria uma nova tabela chamada "escola" com as famílias de colunas "detalhes" e "estatisticas".
. Insire dados para três escolas diferentes, incluindo nome, localização e número de alunos.
. Faz uma consulta para obter todos os dados de uma escola específica.
. Atualiza o número de alunos de uma das escolas.
. Cria uma nova família de colunas chamada "professores" na tabela "escola".
. Adiciona informações sobre dois professores para cada escola.
. Remove uma escola inteira da tabela.
. Faz um scan na tabela para ver todos os dados restantes.
. Conta o número de linhas na tabela "escola".
4/8
hbase-tutorial-completo.md 2024-10-20

. Desativa a tabela "escola", depois reativa-a.


Produtos
. Cria uma tabela para armazenar informação de produtos com duas column families – características
específicas e gerais. Como características específicas, considera o nome e a descrição, e como
características gerais o preço, a quantidade e a taxa de IVA.
. Insire dados para representar cinco produtos diferentes, incluindo as características específicas atrás
definidas.
. Obtem todos os detalhes de um produto específico por nome ou por taxa de IVA.
. Lista informações de todos os produtos.
. Atualiza o preço de um produto específico na tabela.
. Elimina a descrição do produto para um item específico.
. Elimina todos os registos de produtos que não têm stock.
Alunos
. Cria uma tabela para armazenar informações de alunos com detalhes sobre cursos e notas. As notas
pertencem a unidades curriculares e são atribuídas entre A (excelente) e E (Insuficiente).
. Lista todas as UCs em que o aluno teve nota excelente (A).
. Desativa a tabela de alunos temporariamente.
. Elimina permanentemente a tabela de alunos após garantir que não há dados importantes.
Soluções
Escolas
. Cria uma nova tabela chamada "escola" com as famílias de colunas "detalhes" e "estatisticas".

create 'escola', 'detalhes', 'estatisticas'

. Insire dados para três escolas diferentes, incluindo nome, localização e número de alunos.

put 'escola', '1', 'detalhes:nome', 'Escola Secundária de Lisboa'


put 'escola', '1', 'detalhes:localizacao', 'Lisboa'
put 'escola', '1', 'estatisticas:num_alunos', '1000'

put 'escola', '2', 'detalhes:nome', 'Escola Básica do Porto'


put 'escola', '2', 'detalhes:localizacao', 'Porto'
put 'escola', '2', 'estatisticas:num_alunos', '750'

5/8
hbase-tutorial-completo.md 2024-10-20

put 'escola', '3', 'detalhes:nome', 'Colégio de Coimbra'


put 'escola', '3', 'detalhes:localizacao', 'Coimbra'
put 'escola', '3', 'estatisticas:num_alunos', '500'

. Faz uma consulta para obter todos os dados de uma escola específica.

get 'escola', '1'

. Atualiza o número de alunos de uma das escolas.

put 'escola', '2', 'estatisticas:num_alunos', '800'

. Cria uma nova família de colunas chamada "professores" na tabela "escola".

alter 'escola', 'professores'

. Adiciona informações sobre dois professores para cada escola.

put 'escola', '1', 'professores:1', 'Ana Silva'


put 'escola', '1', 'professores:2', 'João Santos'

put 'escola', '2', 'professores:1', 'Maria Oliveira'


put 'escola', '2', 'professores:2', 'Pedro Costa'

put 'escola', '3', 'professores:1', 'Carla Ferreira'


put 'escola', '3', 'professores:2', 'Rui Almeida'

. Remove uma escola inteira da tabela.

deleteall 'escola', '3'

. Faz um scan na tabela para ver todos os dados restantes.

scan 'escola'

. Conta o número de linhas na tabela "escola".

count 'escola'

6/8
hbase-tutorial-completo.md 2024-10-20

. Desativa a tabela "escola", depois reativa-a.

disable 'escola'
enable 'escola'

Produtos
. Cria uma tabela para armazenar informação de produtos com duas column families – características
específicas e gerais. Como características específicas, considera o nome e a descrição, e como
características gerais o preço, a quantidade e a taxa de IVA.

create 'produtos', 'especificas', 'gerais'

. Insire dados para representar cinco produtos diferentes, incluindo as características específicas atrás
definidas.

put 'produtos', '1', 'especificas:nome', 'Laptop'


put 'produtos', '1', 'especificas:descricao', 'Portátil de alta
performance'
put 'produtos', '1', 'gerais:preco', '999.99'
put 'produtos', '1', 'gerais:quantidade', '50'
put 'produtos', '1', 'gerais:iva', '23'

# Repita para mais 4 produtos

. Obtem todos os detalhes de um produto específico por nome ou por taxa de IVA.

scan 'produtos', {FILTER => "SingleColumnValueFilter('especificas',


'nome', =, 'binary:Laptop')"}
scan 'produtos', {FILTER => "SingleColumnValueFilter('gerais', 'iva',
=, 'binary:23')"}

. Lista informações de todos os produtos.

scan 'produtos'

. Atualiza o preço de um produto específico na tabela.

put 'produtos', '1', 'gerais:preco', '899.99'

7/8
hbase-tutorial-completo.md 2024-10-20

. Elimina a descrição do produto para um item específico.

delete 'produtos', '1', 'especificas:descricao'

. Elimina todos os registos de produtos que não têm stock.

scan 'produtos', {FILTER => "SingleColumnValueFilter('gerais',


'quantidade', =, 'binary:0')", COLUMNS => ['especificas:nome']}

Alunos
. Cria uma tabela para armazenar informações de alunos com detalhes sobre cursos e notas. As notas
pertencem a unidades curriculares e são atribuídas entre A (excelente) e E (Insuficiente).

create 'alunos', 'info', 'notas'

. Lista todas as UCs em que o aluno teve nota excelente (A).

scan 'alunos', {FILTER => "SingleColumnValueFilter('notas', '.', =,


'binary:A')"}

. Desativa a tabela de alunos temporariamente.

disable 'alunos'

. Elimina permanentemente a tabela de alunos após garantir que não há dados importantes.

drop 'alunos'

8/8
Tutorial Hive
Instalação do Hive e Kick-off

Baseado no trabalho de Nikolay Dimolarov e Romain Rigaux


(https://towardsdatascience.com/making-big-moves-in-big-data-with-hadoop-hive-parquet-hue-and-
docker-320a52ca175)

Preparar o Docker Hive

git clone https://github.com/tech4242/docker-hadoop-hive-parquet.git


cd .\docker-hadoop-hive-parquet\
docker-compose up -d

Nota: caso obtenha uma mensagem de erro no arranque, executar o comando:


net stop winnat

Importar dados de ficheiro Parquet


1. Seleccionar um dataset com informação interessante (e.g., Kaggle)
a. Para efeitos deste tutorial utilizar
https://www.kaggle.com/datasets/samyakb/student-stress-factors/
2. Descarregar o ficheiro “Student Stress Factors.csv” e renomear para
“Student_Stress_Factors.csv”
3. Colocar esse ficheiro junto ao script “parquet_converter.py”
4. Modificar a linha 8 colocando o caminho correcto para o ficheiro
5. Modificar a linha 9 colocando o caminho pretendido para o ficheiro de destino
6. Instalar os seguintes módulos Python:

Paulo Vieira 1 2023/2024


pip install pandas
pip install pyarrow

7. Executar o script “parquet_converter.py” de modo a obter o ficheiro


“Student_Stress_Factors.parquet”

Criar modelo de dados para suportar a integração

1. Instalar os seguintes módulos Python:

pip install parquet-tools

2. Executar o comando na directoria onde está o ficheiro “Student_Stress_Factors.parquet”

parquet-tools inspect .\Student_Stress_Factors.parquet

3. Aceder ao hue: http://localhost:8888


a. Definir a password no primeiro acesso

4. Criar uma base de dados

Paulo Vieira 2 2023/2024


5. Criar uma nova tabela “demo30112023”

6. Importar os dados do ficheiro “csv” após realizar o upload

7. Seleccionar o nome da tabela e o formato “csv” do ficheiro

Paulo Vieira 3 2023/2024


8. Seleccionar o destino como “Parquet”

9. Submit

10. Esta versão utiliza Hive-on-MR (Hive on MapReduce) que foi descontinuado. As alternativas
são Spark, que será introduzido na próxima UC.

Paulo Vieira 4 2023/2024


11. Visualizar os dados através do dashboard e realizar algumas queries sobre os novos dados.
Dica: faça primeiro uma query e depois mude para o gráfico

12. Podemos tirar algumas conclusões com os dados?

Paulo Vieira 5 2023/2024

You might also like