Building A Transactional Distributed Data Store With Erlang
Building A Transactional Distributed Data Store With Erlang
Scalability matters
− >104 accesses per sec.
− many concurrent writes
Traditional Web 2.0 hosting
Clients
Traditional Web 2.0 hosting
Clients
Traditional Web 2.0 hosting
Clients
...
Traditional Web 2.0 hosting
Clients
...
Now think big. Clients
Really BIG.
Clients
Application Layer
crash
recovery
model Key/Value Store (= simple database) strong data consistency
improves availability
Replication Layer at the cost of consistency
P2P LAYER
Key/Value Store
for storing “items” (= “key/value pairs”)
− synonyms: “key/value store”, “dictionary”, “map”, …
just 3 ops
Turing Award Winners
− insert(key, value)
Key Value
− delete(key) Backus 1977
− lookup(key) Hoare 1980
Karp 1985
Knuth 1974
Wirth 1984
... ...
Chord# - Distributed Key/Value Store
key space: total order on items (strings, numbers, …)
nodes have a random key as their position in ring
items are stored on the successor node (clockwise)
(Backus, …, Karp]
keys
Chord#
Routing Table and Data Lookup
Building the routing table Retrieving items
log2N pointers ≤ log2N hops
exponentially spaced pointers Example:
lookup (Hoare)
started from here
(Backus – Karp]
Chord# Chord#
Churn
Nodes join, leave, or crash at any time
N3
crashed
k !
N2 N3
N1
N4
Lookup Consistency
Violated lookup consistency caused by imperfect failure
detector: lookup(k): at N1 N3, but at N2 N4
N2
crashed k
! N2 N3
N1
N4
N3
crashed
!
How often does this occur?
Simulated nodes with imperfect failure detectors
(A node detects another alive node as dead probabilistically)
SUMMARY P2P LAYER
Chord# provides a key/value store
− scalable
− efficient: log2N hops
REPLICATION LAYER
Replication
Many schemes
− symmetric replication
− succ. list replication
− …
r1 r2 r3 r4 r5
majority
quorum algorithms
How to implement?
− Always read/write majority f/2 + 1 of f replicas.
Latest version is always in the read or write set
How to implement?
− 2PC? Blocks if the transaction manager fails.
− 3PC? Too much latency.
− We use a variant of the Paxos Commit Protocol
non-blocking: Votes of transaction participants are sent to
multiple “acceptors”
Adapted Paxos Commit
Optimistic CC with fallback
Write
− 3 rounds
− non-blocking (fallback)
Read even faster
− reads majority of replicas
− just 1 round
succeeds when >f/2 nodes alive
Adapted Paxos Commit
replicated Items at
Optimistic CC with fallback Transaction Transaction
Managers Participants
Leader (TMs) (TPs)
Write
− 3 rounds
1. Step: 1. Step
− non-blocking (fallback) O(log N) hops
4. Step
succeeds when
>f/2 nodes alive 5. Step
After majority
6. Step
Transactions have two purposes:
Consistency of replicas & consistency across items
BOT BOT
− debit (a1, 100);
− debit (a, 100); − debit (a2, 100);
− debit (a3, 100);
− deposit (b1, 100);
− deposit (b, 100); − deposit (b2, 100);
− deposit (b3, 100);
EOT EOT
SUMMARY TRANSACTION LAYER
other
search
servers
NFS web
servers
Our Wikipedia
Renderer
Java
− Tomcat, Plog4u Java
Jinterface
− Interface to Erlang Erlang
Key/Value Store
Chord# + Replication
+ Transactions
Mapping Wikipedia to Key/Value Store
Mapping
key value
page content title list of Wikitext for
all versions
Erlang Processes
− Chord#
− load balancing
− transaction framework
− supervision (OTP)
Erlang Processes (per node)
Failure Detector supervises Chord# nodes and sends crash messages
when a failure is detected.
Chord# Node performs the main functionality of the node, e.g. successor
list and routing table.
throughput with increasing access rate over time CPU load with increasing access rate over time
Distributed Erlang
− currently has weak security and limited scalability
⇒ we implemented own transport layer on top of TCP
Numerous applications:
− Internet databases, transactional online-services, …
Tradeoff: High availability vs. data consistency
Team
Thorsten Schütt
Florian Schintke
Monika Moser
Stefan Plantikow
Alexander Reinefeld
Nico Kruber
Christian von Prollius
Seif Haridi (SICS)
Ali Ghodsi (SICS)
Tallat Shafaat (SICS)
Publications
Chord# Talks / Demos
T. Schütt, F. Schintke, A. Reinefeld.
A Structured Overlay for Multi-dimensional IEEE Scale Challenge, May 2008
Range Queries. Euro-Par, August 2007. 1st price (live demo)
Transactions
M. Moser, S. Haridi.
Atomic Commitment in Transactional DHTs.
1st CoreGRID Symposium, August 2007.
Wiki
S. Plantikow, A. Reinefeld, F. Schintke.
Transactions for Distributed Wikis on Structured Overlays.
DSOM, October 2007.