0% found this document useful (0 votes)
38 views86 pages

23 Distributedoltp

The document discusses distributed transaction processing and coordination in distributed databases. It introduces atomic commit protocols like two-phase commit used to coordinate transactions across multiple nodes to ensure consistency. The two-phase commit protocol is explained with diagrams showing the prepare and commit phases. Handling both successful commits and aborted transactions are illustrated.

Uploaded by

akshay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views86 pages

23 Distributedoltp

The document discusses distributed transaction processing and coordination in distributed databases. It introduces atomic commit protocols like two-phase commit used to coordinate transactions across multiple nodes to ensure consistency. The two-phase commit protocol is explained with diagrams showing the prepare and commit phases. Handling both successful commits and aborted transactions are illustrated.

Uploaded by

akshay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

23 Distributed OLTP

Databases

Intro to Database Systems Andy Pavlo


15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2

ADMINISTRIVIA

Homework #5: Monday Dec 3rd @ 11:59pm

Project #4: Monday Dec 10th @ 11:59pm

Extra Credit: Wednesday Dec 10th @ 11:59pm

Final Exam: Monday Dec 9th @ 5:30pm

CMU 15-445/645 (Fall 2019)


3

LAST CLASS

System Architectures
→ Shared-Memory, Shared-Disk, Shared-Nothing

Partitioning/Sharding
→ Hash, Range, Round Robin

Transaction Coordination
→ Centralized vs. Decentralized

CMU 15-445/645 (Fall 2019)


4

O LT P V S . O L A P

On-line Transaction Processing (OLTP):


→ Short-lived read/write txns.
→ Small footprint.
→ Repetitive operations.

On-line Analytical Processing (OLAP):


→ Long-running, read-only queries.
→ Complex joins.
→ Exploratory queries.

CMU 15-445/645 (Fall 2019)


5

D E C E N T R A L I Z E D C O O R D I N AT O R

Partitions

Begin Request P1 P2

Application
Server P3 P4

CMU 15-445/645 (Fall 2019)


5

D E C E N T R A L I Z E D C O O R D I N AT O R

Partitions
Query
P1 P2

Query

Application
Server P3 P4
Query

CMU 15-445/645 (Fall 2019)


5

D E C E N T R A L I Z E D C O O R D I N AT O R

Partitions

Commit Request P1 P2

Safe to commit?
Application
Server P3 P4

CMU 15-445/645 (Fall 2019)


6

O B S E R VAT I O N

We have not discussed how to ensure that all


nodes agree to commit a txn and then to make
sure it does commit if we decide that it should.
→ What happens if a node fails?
→ What happens if our messages show up late?
→ What happens if we don't wait for every node to agree?

CMU 15-445/645 (Fall 2019)


7

I M P O R TA N T A S S U M P T I O N

We can assume that all nodes in a distributed


DBMS are well-behaved and under the same
administrative domain.
→ If we tell a node to commit a txn, then it will commit the
txn (if there is not a failure).

If you do not trust the other nodes in a distributed


DBMS, then you need to use a Byzantine Fault
Tolerant protocol for txns (blockchain).

CMU 15-445/645 (Fall 2019)


8

T O D AY ' S A G E N D A

Atomic Commit Protocols


Replication
Consistency Issues (CAP)
Federated Databases

CMU 15-445/645 (Fall 2019)


9

AT O M I C C O M M I T P R O T O C O L

When a multi-node txn finishes, the DBMS needs


to ask all the nodes involved whether it is safe to
commit.

Examples:
→ Two-Phase Commit
→ Three-Phase Commit (not used)
→ Paxos
→ Raft
→ ZAB (Apache Zookeeper)
→ Viewstamped Replication
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Commit Request

Participant
Application
Server
Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Commit Request

Participant
Application
Server
Phase1: Prepare Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Commit Request

Participant
OK

Application
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Commit Request

Participant
OK

Application
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Phase2: Commit

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Commit Request

Participant
OK

Application OK
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Phase2: Commit
OK

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
10

T W O -P H A S E C O M M I T ( S U C C E S S )
Success!

Participant
Application
Server
Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Commit Request

Participant
Application
Server
Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Commit Request

Participant
Application
Server
Phase1: Prepare Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Commit Request

Participant
Application
Server
Phase1: Prepare Node 2
ABORT!
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Aborted

Participant
Application
Server
Node 2
ABORT!
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Aborted

Participant
Application
Server
Node 2
ABORT!
Coordinator

Participant
Phase2: Abort

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
11

T W O -P H A S E C O M M I T ( A B O R T )
Aborted

Participant
Application OK
Server
Node 2
ABORT!
Coordinator

Participant
Phase2: Abort
OK

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
12

2 P C O P T I M I Z AT I O N S

Early Prepare Voting


→ If you send a query to a remote node that you know will
be the last one you execute there, then that node will also
return their vote for the prepare phase with the query
result.

Early Acknowledgement After Prepare


→ If all nodes vote to commit a txn, the coordinator can
send the client an acknowledgement that their txn was
successful before the commit phase finishes.

CMU 15-445/645 (Fall 2019)


13

E A R LY A C K N O W L E D G E M E N T
Commit Request

Participant
Application
Server
Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
13

E A R LY A C K N O W L E D G E M E N T
Commit Request

Participant
Application
Server
Phase1: Prepare Node 2
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
13

E A R LY A C K N O W L E D G E M E N T
Commit Request

Participant
OK

Application
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
13

E A R LY A C K N O W L E D G E M E N T
Success!

Participant
OK

Application
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Node 1 Node 3
CMU 15-445/645 (Fall 2019)
13

E A R LY A C K N O W L E D G E M E N T
Success!

Participant
OK

Application
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Phase2: Commit

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
13

E A R LY A C K N O W L E D G E M E N T
Success!

Participant
OK

Application OK
Server
Phase1: Prepare Node 2
OK
Coordinator

Participant
Phase2: Commit
OK

Node 1 Node 3
CMU 15-445/645 (Fall 2019)
14

T W O -P H A S E C O M M I T

Each node records the outcome of each phase in a


non-volatile storage log.

What happens if coordinator crashes?


→ Participants must decide what to do.

What happens if participant crashes?


→ Coordinator assumes that it responded with an abort if it
hasn't sent an acknowledgement yet.

CMU 15-445/645 (Fall 2019)


15

PA XO S

Consensus protocol where a


coordinator proposes an outcome
(e.g., commit or abort) and then the
participants vote on whether that
outcome should succeed.
Does not block if a majority of
participants are available and has
provably minimal message delays in
the best case.

CMU 15-445/645 (Fall 2019)


16

PA XO S

Acceptor
Commit Request
Node 2
Application

Acceptor
Server

Node 3
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Commit Request
Node 2
Application

Acceptor
Server
Propose
Node 3
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Commit Request
Node 2

X
Application

Acceptor
Server
Propose
Node 3
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Agree
Commit Request
Node 2

X
Application

Acceptor
Server
Propose
Node 3
Agree
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Agree
Commit Request
Node 2

X
Application

Acceptor
Server
Propose

Commit
Node 3
Agree
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Agree
Commit Request
Accept Node 2

X
Application

Acceptor
Server
Propose

Commit
Node 3
Agree
Proposer

Acceptor
Accept

Node 1
Node 4
CMU 15-445/645 (Fall 2019)
16

PA XO S

Acceptor
Success!
Node 2

X
Application

Acceptor
Server

Node 3
Proposer

Acceptor
Node 1
Node 4
CMU 15-445/645 (Fall 2019)
17

PA XO S
Proposer Acceptors Proposer
TIME

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
TIME

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
TIME

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
TIME

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
Commit(n)
TIME

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
Commit(n)
TIME

Reject(n,n+1)

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
Commit(n)
TIME

Reject(n,n+1)

Agree(n+1)

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
Commit(n)
TIME

Reject(n,n+1)

Agree(n+1)
Commit(n+1)

CMU 15-445/645 (Fall 2019)


17

PA XO S
Proposer Acceptors Proposer
Propose(n)
Agree(n)
Propose(n+1)
Commit(n)
TIME

Reject(n,n+1)

Agree(n+1)
Commit(n+1)
Accept(n+1)

CMU 15-445/645 (Fall 2019)


18

M U LT I -PA X O S

If the system elects a single leader that is in charge


of proposing changes for some period of time,
then it can skip the Propose phase.
→ Fall back to full Paxos whenever there is a failure.

The system periodically renews who the leader is


using another Paxos round.

CMU 15-445/645 (Fall 2019)


19

2 P C V S . PA XO S

Two-Phase Commit
→ Blocks if coordinator fails after the prepare message is
sent, until coordinator recovers.

Paxos
→ Non-blocking if a majority participants are alive,
provided there is a sufficiently long period without
further failures.

CMU 15-445/645 (Fall 2019)


20

R E P L I C AT I O N

The DBMS can replicate data across redundant


nodes to increase availability.

Design Decisions:
→ Replica Configuration
→ Propagation Scheme
→ Propagation Timing
→ Update Method

CMU 15-445/645 (Fall 2019)


21

R E P L I C A C O N F I G U R AT I O N S

Approach #1: Master-Replica


→ All updates go to a designated master for each object.
→ The master propagates updates to its replicas without an
atomic commit protocol.
→ Read-only txns may be allowed to access replicas.
→ If the master goes down, then hold an election to select a
new master.

Approach #2: Multi-Master


→ Txns can update data objects at any replica.
→ Replicas must synchronize with each other using an
atomic commit protocol.
CMU 15-445/645 (Fall 2019)
22

R E P L I C A C O N F I G U R AT I O N S
Master-Replica Multi-Master
Writes Reads
Reads Writes
Reads P1
P1
Node 1
P1
P1 Writes
Reads P1
Master
Replicas
Node 2
CMU 15-445/645 (Fall 2019)
23

K-S A F E T Y

K-safety is a threshold for determining the fault


tolerance of the replicated database.

The value K represents the number of replicas per


data object that must always be available.

If the number of replicas goes below this


threshold, then the DBMS halts execution and
takes itself offline.

CMU 15-445/645 (Fall 2019)


24

P R O PA G AT I O N S C H E M E

When a txn commits on a replicated database, the


DBMS decides whether it must wait for that txn's
changes to propagate to other nodes before it can
send the acknowledgement to application.

Propagation levels:
→ Synchronous (Strong Consistency)
→ Asynchronous (Eventual Consistency)

CMU 15-445/645 (Fall 2019)


25

P R O PA G AT I O N S C H E M E
Flush!
Approach #1: Synchronous Commit? Flush?

→ The master sends updates to replicas and


then waits for them to acknowledge that
they fully applied (i.e., logged) the
changes.

CMU 15-445/645 (Fall 2019)


25

P R O PA G AT I O N S C H E M E
Flush!
Approach #1: Synchronous Commit? Flush?

→ The master sends updates to replicas and


then waits for them to acknowledge that
they fully applied (i.e., logged) the
Ack Ack
changes.

CMU 15-445/645 (Fall 2019)


25

P R O PA G AT I O N S C H E M E
Flush!
Approach #1: Synchronous Commit? Flush?

→ The master sends updates to replicas and


then waits for them to acknowledge that
they fully applied (i.e., logged) the
Ack Ack
changes.

Approach #2: Asynchronous


→ The master immediately returns the Commit? Flush?
acknowledgement to the client without
waiting for replicas to apply the changes.
Ack

CMU 15-445/645 (Fall 2019)


27

P R O PA G AT I O N T I M I N G

Approach #1: Continuous


→ The DBMS sends log messages immediately as it
generates them.
→ Also need to send a commit/abort message.

Approach #2: On Commit


→ The DBMS only sends the log messages for a txn to the
replicas once the txn is commits.
→ Do not waste time sending log records for aborted txns.
→ Assumes that a txn's log records fits entirely in memory.

CMU 15-445/645 (Fall 2019)


28

A C T I V E V S . PA S S I V E

Approach #1: Active-Active


→ A txn executes at each replica independently.
→ Need to check at the end whether the txn ends up with
the same result at each replica.

Approach #2: Active-Passive


→ Each txn executes at a single location and propagates the
changes to the replica.
→ Can either do physical or logical replication.
→ Not the same as master-replica vs. multi-master

CMU 15-445/645 (Fall 2019)


29

CAP THEOREM

Proposed by Eric Brewer that it is


impossible for a distributed system to
always be:
→ Consistent Pick Two!
Sort of…
→ Always Available
→ Network Partition Tolerant
Proved in 2002.

Brewer

CMU 15-445/645 (Fall 2019)


30

CAP THEOREM
All up nodes can satisfy
Linearizability all requests.

C A
Consistency
Availability Impossible
Partition Tolerant
P
Still operate correctly
despite message loss.
CMU 15-445/645 (Fall 2019)
31

CAP CONSISTENCY

Set A=2
Application Application
Server Server

A=1 A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
31

CAP CONSISTENCY

Set A=2
Application Application
Server Server

A=2
A=1 A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
31

CAP CONSISTENCY

Set A=2
Application Application
Server Server

A=2
A=1 A=2
A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
31

CAP CONSISTENCY

Set A=2
Application Application
Server ACK Server

A=2
A=1 A=2
A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
31

CAP CONSISTENCY

Set A=2 Read A


Application Application
Server ACK Server

A=2
A=1 A=2
A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
31

CAP C
If O N Ssays
master I S the
T Etxn
Ncommitted,
CY
then it should be immediately
visible on replicas.

Set A=2 Read A


Application Application
Server ACK A=2 Server

A=2
A=1 A=2
A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
32

CAP AVA I L A B I L I T Y

Application Application
Server Server

A=1
B=8

Master
NETWORK X A=1
B=8

Replica
CMU 15-445/645 (Fall 2019)
32

CAP AVA I L A B I L I T Y

Read B
Application Application
Server Server

A=1
B=8

Master
NETWORK X A=1
B=8

Replica
CMU 15-445/645 (Fall 2019)
32

CAP AVA I L A B I L I T Y

Read B
Application Application
Server B=8 Server

A=1
B=8

Master
NETWORK X A=1
B=8

Replica
CMU 15-445/645 (Fall 2019)
32

CAP AVA I L A B I L I T Y

Read A
Application Application
Server Server

A=1
B=8

Master
NETWORK X A=1
B=8

Replica
CMU 15-445/645 (Fall 2019)
32

CAP AVA I L A B I L I T Y

Read A
Application Application
Server A=1 Server

A=1
B=8

Master
NETWORK X A=1
B=8

Replica
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Application Application
Server Server

A=1 A=1
B=8 B=8
NETWORK
Master Replica
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Application Application
Server Server

A=1 A=1
B=8 B=8

Master Master
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Set A=2 Set A=3


Application Application
Server Server

A=1 A=1
B=8 B=8

Master Master
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Set A=2 Set A=3


Application Application
Server Server

A=2
A=1 A=3
A=1
B=8 B=8

Master Master
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Set A=2 Set A=3


Application Application
Server ACK ACK Server

A=2
A=1 A=3
A=1
B=8 B=8

Master Master
CMU 15-445/645 (Fall 2019)
33

CAP PA R T I T I O N TO L E R A N C E

Set A=2 Set A=3


Application Application
Server ACK ACK Server

A=2
A=1 A=3
A=1
B=8 B=8
NETWORK
Master Master
CMU 15-445/645 (Fall 2019)
34

C A P F O R O LT P D B M S s

How a DBMS handles failures determines which


elements of the CAP theorem they support.

Traditional/NewSQL DBMSs
→ Stop allowing updates until a majority of nodes are
reconnected.
NoSQL DBMSs
→ Provide mechanisms to resolve conflicts after nodes are
reconnected.

CMU 15-445/645 (Fall 2019)


35

O B S E R VAT I O N

We have assumed that the nodes in our distributed


systems are running the same DBMS software.
But organizations often run many different
DBMSs in their applications.

It would be nice if we could have a single interface


for all our data.

CMU 15-445/645 (Fall 2019)


36

F E D E R AT E D D ATA B A S E S

Distributed architecture that connects together


multiple DBMSs into a single logical system.
A query can access data at any location.

This is hard and nobody does it well


→ Different data models, query languages, limitations.
→ No easy way to optimize queries
→ Lots of data copying (bad).

CMU 15-445/645 (Fall 2019)


37

F E D E R AT E D D ATA B A S E E X A M P L E

Back-end DBMSs

Middleware
Query Requests
Connectors

Application
Server

CMU 15-445/645 (Fall 2019)


37

F E D E R AT E D D ATA B A S E E X A M P L E

Back-end DBMSs
Query Requests
Connectors

Foreign
Data
Application Wrappers
Server

CMU 15-445/645 (Fall 2019)


38

C O N C LU S I O N

We assumed that the nodes in our distributed


DBMS are friendly.
Blockchain databases assume that the nodes are
adversarial. This means you must use different
protocols to commit transactions.

CMU 15-445/645 (Fall 2019)


39

NEXT CLASS

Distributed OLAP Systems

CMU 15-445/645 (Fall 2019)

You might also like