Big data 101 for beginners riga dev days

BIG DATA 101, FOUNDATIONAL
KNOWLEDGE FOR A NEW PROJECT
IN 2017
@doanduyhai
Technical Advocate @ Datastax
Apache Zeppelin™ Committer
@doanduyhai1

Who Am I ?
Duy Hai DOAN
Technical Advocate @ Datastax
•  talks, meetups, confs
•  open-source devs (Achilles, Zeppelin,…)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
Apache Zeppelin™ committer
@doanduyhai2

Agenda
1) Distributed systems theories & properties
2) Data sharding , replication
3) CAP theorem
4) Distributed systems architecture: master/slave vs masterless
@doanduyhai3

Distributed systems theories
@doanduyhai4
Time
Ordering
Latency
Failure
Consensus

Time
There is no absolute time in theory (even with atomic clocks!)
Time-drift is unavoidable
•  unless you provide atomic clock to each server
•  unless you’re Google
NTP is your friend ☞ configure it properly !
@doanduyhai5

Ordering of operations
How to order operations ?
What does before/after mean ?
•  when clock is not 100% reliable
•  when operations occur on multiple machines …
•  … that live in multiple continents (1000s km distance)
@doanduyhai6

Ordering of operations
Local/relative ordering is possible
Global ordering ?
•  either execute all operations on single machine (☞ master)
•  or ensure time is perfectly synchronized on all machines executing the
operations (really feasible ?)
@doanduyhai7

Known algorithms
Lamport clock
•  algorithm for message sender
•  algorithm for message receiver
•  partial ordering between a pair of (sender, receiver) is possible
@doanduyhai8
time = time+1;
time_stamp = time;
send(Message, time_stamp);
(message, time_stamp) = receive();
time = max(time_stamp, time)+1;

Known algorithms
@doanduyhai9
Vector clock

Latency
Def: time interval between request & response.
Latency is composed of
•  network delay: router/switch delay + physical medium delay
•  OS delay (negligible)
•  time to process the query by the target (disk access, computation …)
@doanduyhai10

Latency
Speed of light physics
•  ≈ 300 000 km/s in the void
•  ≈ 197 000 km/s in ﬁber optic cable (due to refraction indice)
London – New York bird flight distance ≈ 5500km è 28ms for a
one way trip
Conclusion: a ping between London – New York cannot take less
than 56ms
@doanduyhai11

@doanduyhai12
"The mean latency is
below 10ms"
Database vendor X
✔︎ ✘︎

@doanduyhai13
"The mean latency is
below 10ms"
Database vendor X
✔︎ ✘︎

Failure modes
•  Byzantine failure: same input, different outputs à application bug !!!
•  Performance failure: response correct but arrives too late
•  Omission failure: special case of performance failure, no response (timeout)
•  Crash failure: self-explanatory, server stops responding
Byzantine failure à value issue
Other failures à timing issue
@doanduyhai14

Failure
Root causes
•  Hardware: disk, CPU, …
•  Software: packet lost, process crash, OS crash …
•  Workload-specific: flushing huge ﬁle to SAN (🙀)
•  JVM-related: long GC pause
Deﬁning failure is hard
@doanduyhai15

@doanduyhai16
"A server fails when it does
not respond to one or
multiple request(s) in a
timely manner"
Usual meaning of failure

Failure detection
Timely manner ☞ timeout!
Failure detector:
•  heart beat: binary state, (up/down), too simple
•  exponential backoff with threshold: better model
•  phi accrual detector: advanced model using statictics
@doanduyhai17

Distributed consensus protocols
Since time is unreliable, global ordering is hard to achieve & failure is
hard to detect ...
... how different machines can agree on a single value ?
Important properties:
•  validity: the agreed value must have been proposed by some process
•  termination: at least one non-faulty process eventually decides
•  agreement: all processes agree on the same value
@doanduyhai18

Distributed consensus protocols
2-phases commit
•  termination KO: the protocol can be blocked if coordinator fails
3-phases commit
•  agreement KO: in case of network partition, possibility of inconsistent state
Paxos, RAFT & Zab (Zookeeper)
•  OK: satisﬁes 3 requirements
•  QUORUM-based: requires a strict majority of copies/replicas to be alive
@doanduyhai19

Data sharding & replication
@doanduyhai20

Data Sharding
Why sharding ?
•  scalability: map logical shard to physical hardware (machines/racks,...)
•  divide & conquer: each shard represents the DB at a smaller scale
How to shard ?
•  user-defined algorithm: user chooses the sharding algorithm & the target
columns on which applies the algorithm.
•  fixed algorithm: the DB imposes the sharding algorithm. The user decides only
on which columns to apply the algorithm. Ex: user_id
@doanduyhai21

Data Sharding
Example of user-deﬁned sharding
•  user data with sharding key == user_id, sharding algo == MD5 🙂
@doanduyhai22
18
24
17
19
22
0
5
10
15
20
25
30
0-19 20-39 40-59 60-79 80-99
Dataownershipin%
Shards
MD5 Data Distribution

Data Sharding
Example of user-deﬁned sharding
•  user data with sharding key == email, sharding algo == take 1st letter 😱
@doanduyhai23
19
32
27
15
5
2
0
5
10
15
20
25
30
35
a - c e - h m - p q - t u - x y - z
Dataownershipin%
Shards
1st letter Data Distribution

Data Sharding
Example of ﬁxed sharding algo Murmur3
•  user data with sharding key == user_id or whatever key 😎
@doanduyhai24
19
23
18
19
21
0
5
10
15
20
25
0-19 20-39 40-59 60-79 80-99
Dataownershipin%
Shards
Murmur3 Data Distribution

@doanduyhai25
"With Murmur3 we are
guaranteed to have
even data distribution"
✔︎ ✘︎

@doanduyhai26
"With Murmur3 we are
guaranteed to have
even data distribution"
✔︎ ✘︎

Dice rolling experiment
@doanduyhai27

@doanduyhai28

@doanduyhai29

@doanduyhai30
It’s all about
statistics !

Data Sharding Trade-off
Logical sharding (with ordering)
•  can lead to hotspots & imbalance in data distribution
•  but allows range queries
•  WHERE sharding_key >= xxx AND sharding_key <= yyy
Hash-based sharding
•  guarantees uniform distribution (with sufﬁcient distinct shard key values)
•  range queries not possible, only point queries
•  WHERE sharding_key >= xxx AND sharding_key <= yyy
•  WHERE sharding_key == zzz
@doanduyhai31

Data Sharding and Rebalancing
For some category of NoSQL solutions
•  range queries is mandatory à hotspots not avoidable !!!
•  mainly K/V databases, some wide columns databases too
Rebalancy is necessary
•  sometimes automated process
•  sometimes manual admin process 😭
•  resource-intensive operation (CPU, disk I/O + network) à impact live
production trafﬁc
@doanduyhai32

Data Replication
How ? By having multiple copies
Type of replicas
•  symetric: no role, each replica is similar to others
•  asymetric: "master/slave" style. All operations (read/write) should go through
a single server
Replica deﬁnition
•  symetric: 1 replica == 1 copies. 3 replicas == 3 copies in total
•  asymetric: 1 replica == 1 slave copy. Total copies = master + replica(s)
@doanduyhai33

Data Replication
@doanduyhai34
Client
Replica1 Replica2 Replica3
Symetric replicas, write operations
Parallel dispatch

Data Replication
@doanduyhai35
Client
Symetric replicas, read operations

Data Replication
@doanduyhai36
Master
Client
Asymetric replicas, write operations

Data Replication
@doanduyhai37
Master
Client
Asymetric replicas, read operations

Data Replication
@doanduyhai38
Master
Client
Asymetric replicas, read operations
BOTTLENECK !!!

Data Replication
@doanduyhai39
Master
Client
Asymetric replicas, read operations from slaves
✘

Data Replication
@doanduyhai40
Master
Replica
Asymetric replicas, common write failure scenarios
✘ Message lost (network)
àMaster never receives ack
à KO
Master
Replica
✘
Write dropped (overload)
à KO
Master
Replica
✘
Replica crashed right away
à KO

Data Replication
@doanduyhai41
Master
Replica
Asymetric replicas, tricky write failure scenarios
✘
Ack lost (network)
à KO !!!!
Master
Replica
✘
Replica crashes AFTER
sending ACK but before
flushing data to disk
àMaster receives ack
à OK ?

CAP Theorem
@doanduyhai42
Pick 2 out of 3

CAP theorem
@doanduyhai43
Conjecture by Brewer, formalized later in a paper (2002):
The CAP theorem states that any networked shared-data system can
have at most two of three desirable properties
•  consistency (C): equivalent to having a single up-to-date copy of the data
•  high availability (A): of that data (for updates)
•  and tolerance to network partitions (P)

CAP theorem revised (2012)
@doanduyhai45
You cannot choose not to be partition-tolerance
Choice is not that binary:
•  in the absence of partition, you can tend toward CA
•  if when partition occurs, choose your side (C or A)
☞ tunable consistency

What is Consistency ?
@doanduyhai46
Meaning is different from the C of ACID
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Eventual Consistency
Read Your Write
Pipelined RAM
Causal
Snapshot Isolation Linearizability
Serializability
Without coordination
Requires coordination

Consistency with some CP (supposedly) system
@doanduyhai47
Some DB

Consistency with some AP system
@doanduyhai48
Cassandra tunable consistency
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Read Your Write
Pipelined RAM
Causal
Serializability
Consistency Level
ONE

@doanduyhai49
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Read Your Write
Pipelined RAM
Causal
Serializability
Consistency Level
QUORUM

@doanduyhai50
Read Uncommited
Read Commited
Cursor Stability
Repeatable Read
Read Your Write
Pipelined RAM
Causal
Serializability
LightWeight
Transaction
Single partition writes
are linearizable

What is availability ?
@doanduyhai51
Ability to:
•  Read in the case of failure ?
•  Write in the case of failure ?
Brewer deﬁnition: high availability of the data (for updates)

Real world example
@doanduyhai52
Cassandra claims to be highly available, is it true ?
Some marketing slide even claims continous availability (100%
uptime), is it true ?

Network partition scenario with Cassandra
@doanduyhai53
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE
✔︎

Network partition scenario with Cassandra
@doanduyhai54
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE ✘︎

So how can it be highly available ???
@doanduyhai55
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
Read/Write at
Consistency level ONE
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
C*
C*
C*
US DataCenter EU DataCenter
✘
Datacenter-aware load balancing strategy at driver level

Architecture
@doanduyhai56
Master/Slave vs Masterless

Pure master/slave architecture
@doanduyhai57
Single server for all writes, read can be done on master or any slave
Advantages
•  operations can be serialized
•  easy to reason about
•  pre-aggregation is possible
Drawbacks
•  cannot scale on write (read can be scaled)
•  single point of failure (SPOF)

Master/slave SPOF
@doanduyhai58
Write request
MASTER
SLAVE1 SLAVE2 SLAVE3

Multi-master/slave layout
@doanduyhai59
Write request
MASTER1
Shard1
MASTER2
Shard2
…
Proxy layer

@doanduyhai60
"Failure of a shard-master is
not a problem because it
takes less than 10ms to elect
a slave into a master"
Wrong Objection Rhetoric

The wrong objection rhetoric
@doanduyhai61
How long does it take to detect that a shard-master has failed ?
•  heart-beat is not used because too simple
•  so usually after a timeout, after some successive retries
Timeout is usually in tens of seconds
•  you cannot write during this time period

Multi-master/slave architecture
@doanduyhai62
Distribute data between shards. One master per shard
Advantages
•  operations can still be serialized in a single shard
•  easy to reason about in a single shard
•  no more big SPOF
Drawbacks
•  consistent only in a single shard (unless global lock)
•  multiple small points of failure (SPOF inside a shard)
•  global pre-aggregation is no longer possible

Fake masterless/shared-nothing architecture
@doanduyhai63
In reality, multi-master architecture …
… but branded as shared-nothing/masterless architecture

@doanduyhai65
Censored
It was in
Dec. 2016

As of May 2017
Ofﬁcial doc
@doanduyhai66
Censored

As of May 2017
Technical overview doc
@doanduyhai67

As of May 2017
Technical overview doc
@doanduyhai68
Remember this ?

@doanduyhai69
Beware of
marketing!
Shared-nothing architecture
Masterless architecture
Primary-shard == hidden master

@doanduyhai70
No master, every node has equal role
☞ how to manage consistency then if there is no master ?
☞ which replica has the right value of my data ?
Some data-structures to the rescue:
•  vector clock
•  CRDT (Convergent Replicated Data Type)

@doanduyhai71
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
Client sending request
C* C*
C*
Notion of coordinator
•  just a network proxy !
•  what if the coordinator dies ???
coordinator
replica replica
replica

@doanduyhai72
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
Client sending request
C* C*
C*
Anyone can be coordinator !!!
new coordinator
replica replica
replica

CRDT
@doanduyhai73
Riak
•  Registers
•  Counters
•  Sets
•  Maps
•  …
Cassandra only proposes LWW-register (Last Write Win)
•  based on write timestamp

Timestamp, again …
@doanduyhai74
But didn’t we say that timestamp not really reliable ?
Why not implement pure CRDTs ?
Why choose LWW-registered ?
•  because last-write-win is still the most "intuitive"
•  because conflict resolution with other CRDT is the user responsibility
•  because one should not be required to have a PhD in CS to use Cassandra

Example of write conflict with Cassandra
@doanduyhai75
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=32 WHERE id=1
C* C*
C*
Local time
10:00:01.050
age=32 @ 10:00:01.050 age=32 @ 10:00:01.050
age=32 @ 10:00:01.050

@doanduyhai76
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=33 WHERE id=1
C* C*
C*
Local time
10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020

@doanduyhai77
C*
C*
C*
C*
C*C*
C*
C*
C*
C*
UPDATE users SET
age=33 WHERE id=1
C* C*
C*
Local time
10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020
age=32 @ 10:00:01.050
age=33 @ 10:00:01.020

Example of write conflict
@doanduyhai78
How can we cope with this ?
•  It’s functionally rare to have a update on the same column by differents
clients at atmost same time (few millisecs apart)
•  can also force timestamp at client-side (but need to synchronize clients now
…)
•  can always use LightWeight Transaction to guarantee linearizability
UPDATE user SET age = 33 WHERE id = 1 IF age = 32

@doanduyhai79
Advantages
•  no SPOF
•  no failover procedure
•  can achieve 0 downtime with correct tuning
Drawbacks
•  hard to reason about
•  require some knowledge about distributed systems
•  pre-aggregation possible

Big data 101 for beginners riga dev days

More Related Content

What's hot (20)

Similar to Big data 101 for beginners riga dev days (20)

More from Duyhai Doan (15)

Recently uploaded (20)

Big data 101 for beginners riga dev days