SlideShare a Scribd company logo
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka at half the price
Dong Lin
Streams Infrastructure
©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 3
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
©2017 LinkedIn Corporation. All Rights Reserved. 4
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure
©2017 LinkedIn Corporation. All Rights Reserved. 5
RAID-10 setup with RF=3
producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 6
JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 7
JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 8
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
©2017 LinkedIn Corporation. All Rights Reserved. 9
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
©2017 LinkedIn Corporation. All Rights Reserved. 10
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3
©2017 LinkedIn Corporation. All Rights Reserved. 11
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 12
Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 13
Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 14
Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
©2017 LinkedIn Corporation. All Rights Reserved. 15
Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2
©2017 LinkedIn Corporation. All Rights Reserved. 16
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 17
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 18
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)
©2017 LinkedIn Corporation. All Rights Reserved. 19
Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
Become follower for partition 2
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 20
Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2
©2017 LinkedIn Corporation. All Rights Reserved. 21
Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest
©2017 LinkedIn Corporation. All Rights Reserved. 22
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 23
Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk
©2017 LinkedIn Corporation. All Rights Reserved. 24
one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk
©2017 LinkedIn Corporation. All Rights Reserved. 25
one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput
©2017 LinkedIn Corporation. All Rights Reserved. 26
Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3
©2017 LinkedIn Corporation. All Rights Reserved. 27
One-broker-per-machine throughput
Average throughput is 2.3 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 28
One-broker-per-disk throughput
Average throughput is 2 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 29
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 30
Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
©2017 LinkedIn Corporation. All Rights Reserved. 31
Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.
©2017 LinkedIn Corporation. All Rights Reserved. 32
References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)
©2017 LinkedIn Corporation. All Rights Reserved. 33
Ad

More Related Content

What's hot (20)

Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
Huy Vo
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
Opsta
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
confluent
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
Norberto Leite
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)
Sebastian Poxhofer
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
Huy Vo
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
Opsta
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
confluent
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
Norberto Leite
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)
Sebastian Poxhofer
 

Similar to Kafka at half the price with JBOD setup (20)

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Mydbops
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
panagenda
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
Mydbops
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
The Linux Foundation
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
Tetsuyuki Kobayashi
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Shuo LI
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
Allen Wittenauer
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN Environment
VMUG IT
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
Codership Oy - Creators of Galera Cluster
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
Veronika Smidova
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
Frederic Descamps
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
Gluster.org
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
sqlserver.co.il
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
Swiss Data Forum Swiss Data Forum
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎
YUCHENG HU
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
RedWireServices
 
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-ReloadedRNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
panagenda
 
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Mydbops
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
panagenda
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
Mydbops
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
The Linux Foundation
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
Tetsuyuki Kobayashi
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Shuo LI
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
Allen Wittenauer
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN Environment
VMUG IT
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
Frederic Descamps
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
Gluster.org
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
sqlserver.co.il
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
Swiss Data Forum Swiss Data Forum
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎
YUCHENG HU
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
RedWireServices
 
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-ReloadedRNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
panagenda
 
Ad

More from Dong Lin (6)

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptx
Dong Lin
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
Dong Lin
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022
Dong Lin
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
Dong Lin
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
Dong Lin
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptx
Dong Lin
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
Dong Lin
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022
Dong Lin
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
Dong Lin
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
Dong Lin
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
Ad

Recently uploaded (20)

new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 

Kafka at half the price with JBOD setup

  • 1. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka at half the price Dong Lin Streams Infrastructure
  • 2. ©2017 LinkedIn Corporation. All Rights Reserved. 2 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C
  • 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C - Tolerate only one broker failure
  • 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 RAID-10 setup with RF=3 producer Broker 1 Broker 3 A B C A B C A B C A B C Broker 2 A B C A B C - Tolerate up to two broker failures - 50% more storage cost
  • 6. ©2017 LinkedIn Corporation. All Rights Reserved. 6 JBOD setup with RF=2 producer Broker 1 A B C Broker 2 A B C - Tolerate only one broker failure - 50% less storage cost
  • 7. ©2017 LinkedIn Corporation. All Rights Reserved. 7 JBOD setup with RF=3 producer Broker 1 A B C Broker 3 A B C Broker 2 A B C - Tolerate up to two broker failures - 25% less storage cost
  • 8. ©2017 LinkedIn Corporation. All Rights Reserved. 8 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5
  • 9. ©2017 LinkedIn Corporation. All Rights Reserved. 9 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
  • 10. ©2017 LinkedIn Corporation. All Rights Reserved. 10 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down) 4 4X 3 (300% up) 3
  • 11. ©2017 LinkedIn Corporation. All Rights Reserved. 11 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 12. ©2017 LinkedIn Corporation. All Rights Reserved. 12 Problem 1: All replicas become offline if any log directory fails Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 13. ©2017 LinkedIn Corporation. All Rights Reserved. 13 Solution: Only replicas on the failed disk become offline Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 14. ©2017 LinkedIn Corporation. All Rights Reserved. 14 Problem 2: Controller does not recognize disk failure Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online X No further leader election STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline
  • 15. ©2017 LinkedIn Corporation. All Rights Reserved. 15 Solution: Broker notifies and provides partition list to controller Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure X STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline STEP 5: Elect another broker as leader for partition 2
  • 16. ©2017 LinkedIn Corporation. All Rights Reserved. 16 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 17. ©2017 LinkedIn Corporation. All Rights Reserved. 17 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent STEP 4: Created partition 2 (problematic) Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list Good disk may become overloaded STEP 1: I am online STEP 4: Created partition 2 (problematic)
  • 19. ©2017 LinkedIn Corporation. All Rights Reserved. 19 Solution: Controller specifies whether to create log for partition Zookeeper Controller STEP 3: Become follower for partition 2 This is NOT a new partition STEP 4: Partition 2 is not available and there is offline log dir Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list - Broker -> is new partition? STEP 5: Exclude broker 1 from leader election for partition 2 STEP 1: I am online
  • 20. ©2017 LinkedIn Corporation. All Rights Reserved. 20 Problem 4: No mechanism to move replicas between disks Broker 1 P1 P2 P3 P5P4 P6 P7 Disk 1 Disk 2
  • 21. ©2017 LinkedIn Corporation. All Rights Reserved. 21 Example workflow to move replicas between disks Broker Client STEP 1: DescribeDirRequest STEP 2: DescribeDirResponse Partition list and size STEP 3: ChangeDirRequest Disk 1 Disk 2 STEP 4: create p1.move STEP 5: ChangeDirResponse (Inprogress) STEP 6: copy data from p1.log to p1.move STEP 7: delete p1.log and rename p1.move to p1.log STEP 8: Verify new assignment via DescribeDirRequest
  • 22. ©2017 LinkedIn Corporation. All Rights Reserved. 22 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Alternatives ▪ RAID-0 doesn’t provide disk fault tolerance – Assume each broker has 10 disks and RF = 2 – RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD ▪ RAID-5 and RAID-6 have poor performance ▪ Hardware RAID is expensive ▪ One broker per disk
  • 24. ©2017 LinkedIn Corporation. All Rights Reserved. 24 one-broker-per-machine vs. one-broker-per-disk Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Broker 2 Broker 3 V.S. One-broker-per-machine One-broker-per-disk
  • 25. ©2017 LinkedIn Corporation. All Rights Reserved. 25 one-broker-per-machine vs. one-broker-per-disk ▪ Both solutions use JBOD as disk configuration ▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine) – 100X threads and 100X sockets per machine – 10X control plane traffic from the controller to brokers (e.g. MetadataRequest) – 10X broker instances and configuration files to manage – 10X time to bounce a cluster if we bounce one broker at a time – 10X load on external service (e.g. a service used to query per-topic ACL) – Less efficient quota enforcement – Less efficient rebalance across disks on the same machine – Lower throughput
  • 26. ©2017 LinkedIn Corporation. All Rights Reserved. 26 Experimental setup ▪ Brokers deployed on 15 machines with 10 disks per machine IO threads Network threads Replica-fetcher threads One-broker-per-machine 160 120 140 One-broker-per-disk 16 12 14 ▪ Producers deployed on 15 machines acks threads sync retries retry backoff message size batch size request timeout all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT ▪ Topic configuration partition replication factor min-insync-replicas 512 3 3
  • 27. ©2017 LinkedIn Corporation. All Rights Reserved. 27 One-broker-per-machine throughput Average throughput is 2.3 GBps
  • 28. ©2017 LinkedIn Corporation. All Rights Reserved. 28 One-broker-per-disk throughput Average throughput is 2 GBps
  • 29. ©2017 LinkedIn Corporation. All Rights Reserved. 29 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 30. ©2017 LinkedIn Corporation. All Rights Reserved. 30 Changes in operational procedure ▪ Adjust replication factor and min.insync.replicas ▪ Configure num.replica.move.threads for broker ▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
  • 31. ©2017 LinkedIn Corporation. All Rights Reserved. 31 Future work ▪ Use more intelligent solution to select log directory for new replica ▪ Automatic load balancing across log directories on the same broker – Reduced operational overhead ▪ Distribute segments of a given replica across multiple log directories – Less overhead for rebalance between disks – Higher partition size limit ▪ Handle partial disk failure, e.g. disk with degraded performance.
  • 32. ©2017 LinkedIn Corporation. All Rights Reserved. 32 References ▪ KIP-112: Handle disk failure for JBOD (link) ▪ KIP-113: Support replicas movement between log directories (link)
  • 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33