High throughput data replication over RAFT

1 © Hortonworks Inc. 2011–2018. All rights reserved
High throughput data replication over
RAFT
Mukul Kumar Singh, Staff Software Engineer, Hortonworks
Lokesh Jain, Software Engineer, Hortonworks

• msingh@apache.org
• Staff Software Engineer, Hortonworks
• ASF
• Committer for Apache Hadoop
• Committer for Apache Ratis
• MS from Carnegie Mellon University,
Pittsburgh
• ljain@apache.org
• Software Engineer, Hortonworks
• ASF
• Committer for Apache Ratis
• BE(Hons) Computer Science & M.Sc.
(Hons) Mathematics from BITS Pilani
Mukul Kumar Singh Lokesh Jain
Speakers

Raft

Raft
• Raft is a consensus algorithm
• Works when majority of nodes are alive in cluster
• i.e. can handle loss of minority number of nodes.
• “In Search of an Understandable Consensus Algorithm”
• by Diego Ongaro and John Ousterhout
• USENIX ATC’14, https://raft.github.io

Raft Library
• Our Motivations
• Use Raft in Ozone
• “In Search of a Usable Raft Library”
• A long list of Raft implementations is available
• None of them a general library ready to be consumed by other projects.
• Most of them are tied to another project or a part of another project.
• We need a Raft library!

Raft Basic
• Leader Election
• Servers are started as a Follower
• Randomly timeout to become Candidate and start a leader election
• Candidate sends requestVote to other servers
• It becomes the leader once it gets a majority of the votes.
• Append Entries
• Clients send requests to the Leader
• Leader forwards the requests to the Followers
• Leader sends appendEntries to Followers
• When there is no client requests, Leader also sends empty appendEntries
(heartbeats) to Followers to maintain leadership

Apache Ratis

Data Intensive Applications
• In Raft,
• All transactions and the data are written in the log
• Not suitable for data intensive applications
• In Ratis
• Application could choose to not write all the data to log
• State machine data and log data can be separately managed
• See the FileStore example in ratis-example
• See the ContainerStateMachine as an implementation in Apache Hadoop Ozone.

Ratis: Standard Raft Features
• Leader Election + Log Replication
• Automatically elect a leader among the servers in a Raft group
• Randomized timeout for avoiding split votes
• Log is replicated in the Raft group
• Membership Changes
• Members in a Raft group can be re-configurated in runtime
• Replication factor can be changed in runtime
• Log Compaction
• Snapshot is taken periodically
• Send snapshot instead of a long log history.

Ratis: Pluggability
• Pluggable state machine
• Application must define its state machine
• Example: a key-value map
• Pluggable RPC
• Users may provide their own RPC implementation
• Default implementations: gRPC, Netty, Hadoop RPC
• gRPC allows implementation of native client
• Pluggable Raft log
• Users may provide their own log implementation
• The default implementation stores log in local files

Ratis: Asynchronous/Synchronous APIs
• Using gRPC bi-directional stream API
• Netty and Hadoop RPC can support async but not yet implemented
• Server-to-server
• Asynchronous append entries
• Client-to-server
• Asynchronous client requests

General Ratis Use Cases
• You want to:
• (1) replicate the server log/states to multiple machines
• The replication number/cluster membership can be changed in runtime
• It can tolerate server failures.
• or
• (2) have a HA (highly available) service
• When a server fails, another server will automatically take over.
• Clients automatically failover to the new server.
• Apache Ratis is for you!

API
• Client Side APIs
• Send/SendReadOnly
• Send readonly commands are do not change the state of the raft server.
• Async versions also available (sendAsync, sendReadOnlyAsync)
• Server Side APIs
• applyTransaction
• Applies the transaction to the statemachine
• writeStateMachineData
• An optimization to avoid double write penalty for data intensive
applications.

High Throughput
Data Pipeline

Building a high performance data pipeline
• Requirements
• High data write throughput
• Parallelism/async interface
• Large number of transactions per second
• Configurable parameters
• Support for security

Building a high performance data pipeline
• Optimizations
• Separate user data from the raft log
– Avoids double write penalty for data
• Efficient batching of raft log entries
– High write performance during local disk write
– Efficient network replication
• Async processing of operations
– Client ops
– Append entries to followers
– StateMachine implementation

FileStoreStateMachine
• Located at org.apache.ratis.examples.filestore
• Simple state machine implementation to write bytes to a file
• Separates file data from raft log.
• File data written is persisted to disk
• Client generates random bytes of the specified file size
• Client uses writeAsync

Performance Benchmarking
• Setup, 3 nodes with
• Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
• 256GiB System memory
• 10 Gigabit Network Connection
• 4 HGST (HUS726060AL4210) HDD of 5.5TB each

Performance – Write Throughout
0
50
100
150
200
250
300
128000 102400 64000 51200 32000 20480 16000 10240 8000 5120 4000 2048 2000 1024 1000 1000 512 500 250 125 100
DatathrouhputinMB/s
File Size in KB
Write throughput for 1GB

Performance – Transactions per second
0
2000
4000
6000
8000
10000
12000
100000 10000 1000 100 10
NUMBEROFTRANSACTIONPERSECOND
FILE SIZE IN BYTES
Number of transaction with 100000 files

Ozone

Ozone
Client
DN DN DN
RATIS
Ozone
Master
Storage
Container
Manager
Get Block
Get Container Location
(List of DNs)
Write Data

Terminologies
• OM – Ozone Master
• Namespace manager inside Ozone, manages key name to block id mapping.
• Also manages Volume, buckets and key namespaces
• SCM – Storage Container Manager
• Block Manager, manager cluster membership, container location
information, manager containers
• Datanode
• Used to store user data, Ratis server spawned inside the datanode
• Ozone datanode persist containers, blocks are allocated out of containers.

Storage Container
• Hadoop Distributed Data Storage (HDDS) introduces Storage Containers
• Provide generic data storage functionalities.
• Configurable Size (2GB - 16GB+)
• Unit of management and replication in SCM.
• Blocks are allocated from container
• BID = CID + LocalID

Use of Ratis in Ozone
• Replicating data in open containers
• Replication of user data using Ratis
• Support HA in Storage Container Manager
• Work in Progress
• Support HA in Ozone Manager
• Work in Progress

Ozone Ratis Commands
• Ozone Data Pipeline involved interaction between client and
datanode.
• Commands are marked as readonly if they do not change the state
of the datanode.
• GetKey, ReadChunk, Read Container, or
• WriteChunk, PutKey, CreateContainer etc.
• Ozone Client send container commands to the leader datanode
using Ratis Protocol (grpc as underlying rpc)

Command Replication on Containers
Leader
Follower Follower
Write Chunk
CSM
Response

Open Container Replication using Ratis
• Ratis is used for replication of data being written to Ozone Datanodes.
• Ratis replicates container commands on open containers.
• Ozone Datanode provides its own state machine implementation
• This implementation handles various datanode commands (write chunk, put key, create
container)
• Performance optimizations
• To avoid rewrite of data twice to the disk, the state machine implementation separates user
data from block/chunk metadata.
• Multiple chunks are written in parallel.
• Append requests from Leader to followers are made async. Allows multiple appends in
parallel.
• Raft-journal in separate disk – fast contiguous writes without seeking

Ozone Data Write Performance
• The performance numbers were taken for different key sizes and 10 client writes
in parallel.
• Measure the end to end throughput numbers
• Key allocation in OM and Block Allocation is SCM also account for total throughput.
• Ozone Client
• Uses sync apis to write data to the datanodes
• ContainerStateMachine implementation
• Parallelize write chunk operations
Key Sizes 10 MB 100MB
Throughput (MB/s) 81.3 110.3 MB

Summary
• Ratis is Java based implementation of Raft protocol
• Essentially constituting a replicated statemachine.
• Suitable for data intensive applications.
• Features
• Sync/Async client apis
• Pluggable StateMachine
• Pluggable Raft Log Implementation
• Performance
• Write throughput - 250MB/s – 300 MB/s
• IOPS - 10,000 txns/s

Contributors
• A big thanks to all the contributors for Apache Ratis, Apache Hadoop
and Ozone
• Animesh Trivedi, Anu Engineer, Arpit Agarwal, Brent,
• Chen Liang, Chris Nauroth, Devaraj Das, Enis Soztutar,
• garvit, Hanisha Koneru, Hugo Louro, Jakob Homan,
• Jian He, Jing Chen, Jing Zhao, Jitendra Pandey, Junping Du,
• kaiyangzhang, Karl Heinz Marbaise, Li Lu, Lokesh Jain,
• Marton Elek, Mayank Bansal, Mingliang Liu,
• Mukul Kumar Singh, Sen Zhang, Shashikant Banerjee, Sriharsha
Chintalapani,Tsz Wo Nicholas Sze,
• Uma Maheswara Rao G, Venkat Ranganathan, Wangda Tan,
• Weiqing Yang, Will Xu, Xiaobing Zhou, Xiaoyu Yao, Yubo Xu,
• yue liu, Zhiyuan Yang

Apache Ratis & Apache Hadoop Ozone
• Contributions are welcome!
• Ratis
• http://ratis.incubator.apache.org
• dev@ratis.incubator.apache.org
• Ozone
• http://hadoop.apache.org
• hdfs-dev@hadoop.apache.org

Questions?

Thank you

High throughput data replication over RAFT

Recommended

More Related Content

What's hot (20)

Similar to High throughput data replication over RAFT (20)

More from DataWorks Summit (20)

Recently uploaded (20)

High throughput data replication over RAFT

Editor's Notes