Overview of Zookeeper, Helix and Kafka (Oakjug)

@crichardson
Distributed system goodies:
Zookeeper, Helix and Kafka
Chris Richardson
Author of POJOs in Action
Founder of the original CloudFoundry.com
@crichardson
chris@chrisrichardson.net
http://plainoldobjects.com
http://microservices.io

@crichardson
Presentation goal
Talk about a collection of
interesting technologies
for building distributed
systems

@crichardson
About Chris
Founder of a startup that’s creating
a platform for developing
event-driven microservices
(http://bit.ly/trialeventuate)

@crichardson
For more information
https://github.com/cer/event-sourcing-examples
https://github.com/cer/microservices-examples
http://microservices.io
http://plainoldobjects.com/
https://twitter.com/crichardson

@crichardson
Agenda
Zookeeper
Helix
Kafka

@crichardson
Apache ZooKeeper is an open source distributed
conﬁguration service, synchronization service,
and naming registry for large distributed systems
https://zookeeper.apache.org/

@crichardson
Distributed system use
cases…
Name service
lookup by name,
e.g. service discovery: name => [host, port]*
Group membership
E.g. distributed cache
Cluster members need to talk amongst themselves
Clients need to discover the group members

@crichardson
…Use cases
Leader election
N servers, one of which needs to be the master
e.g. master/slave replication
Distributed locking and latches
e.g. cluster wide singleton
Queues
…

@crichardson
Zookeeper server
In-
memory
DB
datadir
snapshot logs
txn logs
Zookeeper server
In-
memory
DB
datadir
snapshot logs
txn logs
Zookeeper server
In-
memory
DB
datadir
snapshot logs
txn logs
ZAB ZAB
Client
Majority-based
Leader FollowerFollower

@crichardson
Zookeeper clients
Languages:
Ships with Java, C, Perl, and Python
Community: Scala, NodeJS, Go, Lua, …
Client connects to one of a list of servers
Client establishes a session
Survives TCP disconnects
Client-speciﬁed session timeout
https://cwiki.apache.org/conﬂuence/display/ZOOKEEPER/ZKClientBindings

Zookeeper data model
Hierarchical tree of named
znodes
Znodes have binary data
and children
Znodes can be ephemeral
- live for as long as the
client session
Clients can watch a node
- get notiﬁed of changes

@crichardson
Zookeeper operations
create(path, data, mode)
Persistence or ephemeral?
Sequential: append parent’s counter value to name?
delete(path)
exists(path)
readData(path, watch?) : Object
writeData(path, data)
getChildren(path, watch?) : List[String]

@crichardson
Znode watches
readData/getChildren can establish a watch
client gets a one-time notiﬁcation when changed

@crichardson
Using the zkCli
$ bin/zkCli.sh -server $DOCKER_HOST_IP
[zk] create /cer x
Created /cer
[zk] create /cer/foo y
Created /cer/foo
[zk] get /cer/foo watch
y
[zk] set /cer/foo z
set /cer/foo z
WatchedEvent state:SyncConnected
type:NodeDataChanged path:/cer/foo

@crichardson
Creating an ephemeral
sequential node
[zk] create -s -e /cer/baz aa
Created /cer/baz0000000001]
[zk] ls /cer watch
ls /cer watch
[baz0000000001, foo]
[Zk] exit
WatchedEvent
state:SyncConnected
type:NodeChildrenChanged
path:/cer
[zk] ls /cer watch
ls /cer watch
[foo]

@crichardson
Leader election example
/
myAppElection
guidA_0
hostA, portA
guidB_1
hostB, portB
guidC_2
hostC, portC
Server A
guidA
val = 0
Server B
guidB
val = 1
Server C
guidC
val = 2
watches watches
Ephemeral/
Sequential
znodes
Leader
Lowest
value

@crichardson
Apache Curator
Open source library developed by Netflix
Simplifies connection management
Simplifies error handling
Implements recipes
Three projects: client, framework, and recipes
http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html

@crichardson
Netﬂix Exhibitor
Supervisory process for managing a Zookeeper instance
Watches a ZK instance and makes sure it is running
Performs periodic backups
Perform periodic cleaning of ZK log directory
A GUI explorer for viewing ZK nodes
A rich REST API
https://github.com/Netﬂix/exhibitor/wiki

@crichardson
About Helix
http://helix.apache.org/
Built on Zookeeper

@crichardson
Typical distributed systems
Partitioning - e.g. use a PK (or other attribute) to choose
server
Replication - for availability
State machines, e.g. master/slave replication
One replica is the master
Other replica is the slave

@crichardson
Use cases - master/slave
replication
MySQL master/slave replication or MongoDB replica sets
N machines
1 master, N slaves
If the master dies then elect a new master

@crichardson
Use cases - Cassandra
Cluster consists of N nodes
Data consists of M partitions (aka vnodes)
Each partition has R replicas
Client can read/write any replica - no master/slave concept
Dynamic assignment of M*R partition replicas to N nodes

@crichardson
Use case - abstractly
Cluster:
Set of N nodes (machines)
One or more resources
A resource is
partitioned and replicated
Resource has a state machine
e.g. ofﬂine/online, master/slave
State machine has constraints: 1 master replica, other replicas are slaves
Helix
dynamically assigns partitions to nodes
Manages state transitions and notiﬁes nodes

@crichardson
Leader/standby state machine
Standby
LeaderDropped
Ofﬂine

@crichardson
Example assignment
Node 1 Node 2 Node 3
Partition 1
LEADER
Partition 1
STANDBY
Partition 3
LEADER
Partition 2
LEADER
Partition 2
STANDBY
Partition 3
STANDBY
Partition 2
OFFLINE
Partition 1
OFFLINE
Partition 3
OFFLINE

@crichardson
Post-failure assignment
Node 1 Node 2 Node 3
Partition 1
LEADER
Partition 1
STANDBY
Partition 3
LEADER
Partition 2
LEADER
Partition 2
LEADER
Partition 3
STANDBY
Partition 2
STANDBY
Partition 1
STANDBY
Partition 3
OFFLINEX

@crichardson
Helix cluster setup
val admin = new ZKHelixAdmin(ZK_ADDRESS)
admin.addStateModelDef(clusterName, STATE_MODEL_NAME,
new StateModelDefinition(StateModelConfigGenerator.generateConfigForLeaderStandby()));
admin.addResource(clusterName, RESOURCE_NAME, NUM_PARTITIONS,
STATE_MODEL_NAME, "AUTO")
HelixControllerMain.startHelixController(ZK_ADDRESS, clusterName,
nodeInfo.nodeId.id,
HelixControllerMain.STANDALONE)

@crichardson
Adding an instance to the
cluster
val ic = new InstanceConﬁg(nodeInfo.nodeId.id)
ic.setHostName(nodeInfo.host)
ic.setPort("" + nodeInfo.port)
ic.setInstanceEnabled(true)
admin.addInstance(clusterName, ic)
admin.rebalance(clusterName, RESOURCE_NAME, NUM_REPLICAS)
Assign to newly
added nodes

@crichardson
Helix - connecting to the
cluster
manager =
HelixManagerFactory.getZKHelixManager(clusterName, instanceName,
InstanceType.PARTICIPANT, ZK_ADDRESS)
val stateModelFactory = new MyStateModelFactory
val stateMach = manager.getStateMachineEngine
stateMach.registerStateModelFactory(STATE_MODEL_NAME, stateModelFactory)
manager.connect()
Connect as a
participant
Supply factory to create
callbacks for state transitions

@crichardson
State transition callbacks
class MyStateModel(partitionName: String) extends StateModel {
def onBecomeStandbyFromOffline(message: Message, context: NotificationContext) {
…
}
def onBecomeLeaderFromStandby(message: Message, context: NotificationContext) {
…
}
…
}
class MyStateModelFactory extends StateModelFactory[StateModel] {
def createNewStateModel(partitionName: String) =
new MyStateModel(partitionName)
} <resourceName>_<partitionNumber>
invoked by Helix

@crichardson
More about Helix
Spectators
Non-participants - don’t have resources/partitioned assigned to them
Get notiﬁed of changes to cluster
Property store
Write through cache of properties in Zookeeper
Messaging
Intra-cluster communication
…

Kakfa concepts - topic
Clients publish messages to
a topic
A topics has a name
A topic is a partitioned log
Topics live on disk
Messages have an offset
within partition
Messages are kept for a
retention period

@crichardson
Kafka is clustered
Kafka cluster consists of N machines
Each topic partition has R replicas
1 machine is the leader (think master) for the topic partition
Clients publish/consume to/from leader
R - 1 machines are followers (think slaves)
Followers consume messages from the leader
Messages are committed when all replicas have written to the log
Producers can optionally wait for a message to be committed
Consumers only ever see committed messages

@crichardson
Kafka producers
Publish message to a topic
Message = (key, body)
Hash of key determines topic partition
Carefully choose key to preserve ordering, e.g. stock ticker
symbol => all prices for same symbol end up in same
partition
Makes request to topic partition’s leader

@crichardson
Kafka consumer
Consumes the messages from the partitions of one or more
topics
Makes a fetch request to a topic partition’s leader
speciﬁes the partition oﬀset in each request
gets back a chunk of messages
Scale by having N topic partitions, N consumers

@crichardson
Kafka consumers - between a
rock and a hard place
Simple Kafka consumer
Very ﬂexible
BUT you are responsible for contacting leaders for each
topics’ partition, storing offsets
High level consumer
Does a lot: stores offsets in Zookeeper, deals with leaders, ….
BUT it assumes that if you read a message it has been
processed
More flexible consumer is on the way

@crichardson
High-level consumer API
interface ConsumerConnector {
static create(…. Zookeeper conﬁguration…);
public <K,V> Map<String, List<KafkaStream<K,V>>>
createMessageStreams(Map<String, Integer> topicCountMap,
Decoder<K> keyDecoder, Decoder<V> valueDecoder);
public void commitOffsets();
}
class KafkaStream<K, V> {
ConsumerIterator<K,V> iterator()
}
interface ConsumerIterator<K,V> {
MessageAndMetadata<K, V> next()
boolean hasNext()
}

@crichardson
Kafka at LinkedIn
1100 Kafka brokers organized into more than 60 clusters.
Writes:
Over 800 billion messages per day
Over 175 terabytes of data
Over 650 terabytes of messages are consumed daily
Peak
13 million messages per second
2.75 gigabytes of data per second
https://engineering.linkedin.com/kafka/running-kafka-scale

@crichardson
Summary
Zookeeper, Helix and Kafka are excellent building blocks for
distributed systems

@crichardson
@crichardson chris@chrisrichardson.net
http://plainoldobjects.com http://microservices.io

Overview of Zookeeper, Helix and Kafka (Oakjug)

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Overview of Zookeeper, Helix and Kafka (Oakjug) (20)

More from Chris Richardson (20)

Recently uploaded (20)

Overview of Zookeeper, Helix and Kafka (Oakjug)