0% found this document useful (0 votes)
1K views

Building A Transactional Distributed Data Store With Erlang

The e-commerce platforms at Amazon, E-Bay or Google serve millions of customers using tens of thousands of servers located in data centers throughout the world. At this scale, components fail continuously and it is difficult to maintain a consistent state while hiding failures from the application. Peer-to-peer protocols have been invented to provide availability by replicating services among peers. The current systems are perfectly tuned for sharing read-only data. To extend them beyond the typical file sharing, the support of transactions on distributed hash tables (DHTs) is a most important but yet missing feature. At this talk given at the Erlang eXchange 2008, Alexander presented a key/value store based on DHTs that supports consistent writes. Alexander will explain how a system by Zuse Institute Berlin and onScale solutions GmbH comprises of three layers, all of them implemented in Erlang: a DHT layer for scalable, reliable access to replicated distributed data, a transaction layer to ensure data consistency in the face of concurrent write operations, an application layer with a very demanding access rate of several thousand reads/writes per second. For the application layer, Zuse Institute Berlin and onScale solutions GmbH selected a distributed, scalable Wiki with full transaction support. Alexander will show that its Wiki outperforms the public Wikipedia in terms of served page requests per second and he will discuss how the development of the distributed code benefited from the use of Erlang rather than C++ or Java.

Uploaded by

Dmytro Shteflyuk
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Building A Transactional Distributed Data Store With Erlang

The e-commerce platforms at Amazon, E-Bay or Google serve millions of customers using tens of thousands of servers located in data centers throughout the world. At this scale, components fail continuously and it is difficult to maintain a consistent state while hiding failures from the application. Peer-to-peer protocols have been invented to provide availability by replicating services among peers. The current systems are perfectly tuned for sharing read-only data. To extend them beyond the typical file sharing, the support of transactions on distributed hash tables (DHTs) is a most important but yet missing feature. At this talk given at the Erlang eXchange 2008, Alexander presented a key/value store based on DHTs that supports consistent writes. Alexander will explain how a system by Zuse Institute Berlin and onScale solutions GmbH comprises of three layers, all of them implemented in Erlang: a DHT layer for scalable, reliable access to replicated distributed data, a transaction layer to ensure data consistency in the face of concurrent write operations, an application layer with a very demanding access rate of several thousand reads/writes per second. For the application layer, Zuse Institute Berlin and onScale solutions GmbH selected a distributed, scalable Wiki with full transaction support. Alexander will show that its Wiki outperforms the public Wikipedia in terms of served page requests per second and he will discuss how the development of the distributed code benefited from the use of Erlang rather than C++ or Java.

Uploaded by

Dmytro Shteflyuk
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Building a transactional distributed

data store with Erlang

Alexander Reinefeld, Florian Schintke, Thorsten Schütt

Zuse Institute Berlin,


onScale solutions GmbH
Transactional data store - What for?
 Web 2.0 services: shopping, banking, gaming, …
− don’t need full SQL semantics, key/value DB often suffice
e.g. Mike Stonebreaker: “One size does not fit all”

 Scalability matters
− >104 accesses per sec.
− many concurrent writes
Traditional Web 2.0 hosting
Clients
Traditional Web 2.0 hosting
Clients
Traditional Web 2.0 hosting
Clients

...
Traditional Web 2.0 hosting
Clients

...
Now think big. Clients

Really BIG.
Clients

Not how fast our code is today, but:


Clients
− Can it “scale out”?
− Can it run in parallel? … distributed?
− Any common resources causing locking? Clients
Asymptotic performance matters!
Clients
Our Approach: P2P makes it scalable
“arbitrary“ number of clients

Web 2.0 services


with P2P nodes
in datacenters
x
x
Our Approach

Application Layer

crash
recovery
model Key/Value Store (= simple database) strong data consistency

Transaction Layer implements ACID

improves availability
Replication Layer at the cost of consistency

crash stop implements


model P2P Layer - scalability
- eventual consistency

unreliable, distributed nodes


providing a scalable distributed data store:

P2P LAYER
Key/Value Store
 for storing “items” (= “key/value pairs”)
− synonyms: “key/value store”, “dictionary”, “map”, …

 just 3 ops
Turing Award Winners
− insert(key, value)
Key Value
− delete(key) Backus 1977
− lookup(key) Hoare 1980
Karp 1985
Knuth 1974
Wirth 1984
... ...
Chord# - Distributed Key/Value Store
 key space: total order on items (strings, numbers, …)
 nodes have a random key as their position in ring
 items are stored on the successor node (clockwise)

(Backus, …, Karp]
keys

(Karp, …, Knuth] Key Value


Backus 1977
Chord# item
distributed Hoare 1980
key/value Karp 1985
store
Knuth 1974
Wirth 1984
nodes
... ...
Routing Table and Data Lookup
Building the routing table
 log2N pointers
 exponentially spaced pointers

Chord#
Routing Table and Data Lookup
Building the routing table Retrieving items
 log2N pointers  ≤ log2N hops
 exponentially spaced pointers Example:
lookup (Hoare)
started from here
(Backus – Karp]

Chord# Chord#
Churn
 Nodes join, leave, or crash at any time

 Need “failure detector” to check aliveness of nodes


− failure detector may be wrong: Node dead? Or just slow
network?

 Churn may cause inconsistencies


− need local repair mechanism
Responsibility Consistency
 Violated responsibility consistency caused by imperfect
failure detector: Both, N3 and N4 claim responsibility for item k

N3
crashed
k !
N2 N3
N1
N4
Lookup Consistency
 Violated lookup consistency caused by imperfect failure
detector: lookup(k): at N1  N3, but at N2  N4

N2
crashed k
! N2 N3
N1
N4

N3
crashed
!
How often does this occur?
 Simulated nodes with imperfect failure detectors
(A node detects another alive node as dead probabilistically)
SUMMARY P2P LAYER
 Chord# provides a key/value store
− scalable
− efficient: log2N hops

 Quality of failure detector is crucial

 Need replication to prevent data loss …


improving availability

REPLICATION LAYER
Replication
 Many schemes
− symmetric replication 
− succ. list replication
− …

 Must ensure data consistency


− need quorum-based methods
Quorum based algorithms
 Enforce consistency by operating on majorities

r1 r2 r3 r4 r5

majority

 Comes at the cost of increased latency


− but latency can be avoided by clever replica distribution
in datacenters (cloud computing)
SUMMARY REPLICATION LAYER

 availability in face of churn

 quorum algorithms

 But need transactional data access …


coping with concurrency:
TRANSACTION LAYER
Transaction Layer
 Transactions on P2P are challenging because of …
− churn
 changing node responsibilities

− crash stop fault model


 as opposed to crash recovery in traditional DBMS

− imperfect failure detector


 don’t know whether node crashed or slow network
Strong Data Consistency
 What is it?
− When a write is finished, all following reads return the new
value.

 How to implement?
− Always read/write majority f/2 + 1 of f replicas.
 Latest version is always in the read or write set

− Must ensure that replication degree is ≤ f


Atomicity
 What is it?
− Make all or no changes!
− Either ‘commit’ or ‘abort’.

 How to implement?
− 2PC? Blocks if the transaction manager fails.
− 3PC? Too much latency.
− We use a variant of the Paxos Commit Protocol
 non-blocking: Votes of transaction participants are sent to
multiple “acceptors”
Adapted Paxos Commit
 Optimistic CC with fallback
 Write
− 3 rounds
− non-blocking (fallback)
 Read even faster
− reads majority of replicas
− just 1 round
 succeeds when >f/2 nodes alive
Adapted Paxos Commit
replicated Items at
 Optimistic CC with fallback Transaction Transaction
Managers Participants
Leader (TMs) (TPs)
 Write
− 3 rounds
1. Step: 1. Step
− non-blocking (fallback) O(log N) hops

 Read even faster 2. Step

− reads majority of replicas


2.-6. Step: 3. Step
− just 1 round O(1) hops

4. Step
 succeeds when
>f/2 nodes alive 5. Step
After majority

6. Step
Transactions have two purposes:
Consistency of replicas & consistency across items

User Request Operation on replicas

BOT BOT
− debit (a1, 100);
− debit (a, 100); − debit (a2, 100);
− debit (a3, 100);
− deposit (b1, 100);
− deposit (b, 100); − deposit (b2, 100);
− deposit (b3, 100);

EOT EOT
SUMMARY TRANSACTION LAYER

 Consistent update of items and replicas

 Mitigates some of the overlay oddities


− node failures
− asynchronicity
demonstrator application:
WIKIPEDIA
Wikipedia
Top 10 Web sites 50.000 requests/sec
1. Yahoo!
− 95% are answered by squid proxies
2. Google
− only 2,000 req./sec hit the backend
3. YouTube
4. Windows Live
5. MSN
6. Myspace
7. Wikipedia
8. Facebook
9. Blogger.com
10. Yahoo!カテゴリ
Public Wikipedia

other

search
servers

NFS web
servers
Our Wikipedia
Renderer
 Java
− Tomcat, Plog4u Java
 Jinterface
− Interface to Erlang Erlang

Key/Value Store
 Chord# + Replication
+ Transactions
Mapping Wikipedia to Key/Value Store
 Mapping
key value
page content title list of Wikitext for
all versions

backlinks title list of titles


categories category name list of titles

 For each insert or modify we must


− update backlinks
write transaction
− update category page(s)
Erlang Processes

Erlang Processes
− Chord#
− load balancing
− transaction framework
− supervision (OTP)
Erlang Processes (per node)
 Failure Detector supervises Chord# nodes and sends crash messages
when a failure is detected.

 Configuration provides access to the configuration file and maintains


parameter changes made at runtime.

 Key Holder stores the identifier of the node in the overlay.

 Statistics Collector collects statistics information and forwards them to


statistic servers.

 Chord# Node performs the main functionality of the node, e.g. successor
list and routing table.

 Database stores key/value pairs in each node.


Accessing Erlang Transactions
from Java via Jinterface
void updatePage(string title, int oldVersion, string newText)
{
Transaction t = new Transaction(); //new transaction
Page p = t.read(title); // read old version
if (p.currentVersion != oldVersion) // concurrent update?
t.abort();
else {
t.write(p.add(newText)); // write new text
//update categories
foreach(Category c in p)
t.write(t.read(c.name).add(title));
t.commit();
}
}
Performance on Linux Cluster
test results with load generator

throughput with increasing access rate over time CPU load with increasing access rate over time

1500 trans./sec on 10 CPUs


2500 trans./sec on 16 CPUs (64 cores) and 128 DHT nodes
Implementation
 11,000 lines of Erlang code
− 2,700 for transactions
− 1,300 for Wikipedia
− 7,000 for Chord# and infrastructure

 Distributed Erlang
− currently has weak security and limited scalability
⇒ we implemented own transport layer on top of TCP

 Java for rendering and user interface


SUMMARY
Summary
 P2P as new paradigm for Web 2.0 hosting
− we support consistent, distributed write operations.

 Numerous applications:
− Internet databases, transactional online-services, …
Tradeoff: High availability vs. data consistency
Team
 Thorsten Schütt
 Florian Schintke
 Monika Moser
 Stefan Plantikow
 Alexander Reinefeld
 Nico Kruber
 Christian von Prollius
 Seif Haridi (SICS)
 Ali Ghodsi (SICS)
 Tallat Shafaat (SICS)
Publications
Chord# Talks / Demos
T. Schütt, F. Schintke, A. Reinefeld.
A Structured Overlay for Multi-dimensional IEEE Scale Challenge, May 2008
Range Queries. Euro-Par, August 2007. 1st price (live demo)

T. Schütt, F. Schintke, A. Reinefeld.


Structured Overlay without Consistent
Hashing: Empirical Results. GP2PC, May 2006.

Transactions
M. Moser, S. Haridi.
Atomic Commitment in Transactional DHTs.
1st CoreGRID Symposium, August 2007.

T. Shafaat, M. Moser, A. Ghodsi , S. Haridi,


T. Schütt, A. Reinefeld. Key-Based Consistency
and Availability in Structured Overlay Networks.
Infoscale, June 2008.

Wiki
S. Plantikow, A. Reinefeld, F. Schintke.
Transactions for Distributed Wikis on Structured Overlays.
DSOM, October 2007.

You might also like