SlideShare a Scribd company logo
Maximum Overdrive:
Tuning the Spark Cassandra Connector
Russell Spitzer, Datastax
© DataStax, All Rights Reserved.
Who is this guy and why should I listen to him?
2
Russell Spitzer, Passing Software Engineer
•Been working at DataStax since 2013
•Worked in Test Engineering and now Analytics Dev
•Have been working with Spark since 0.9
•Working with Cassandra since 1.2
•Main focus is the Spark Cassandra Connector
•Surgically grafted to the Spark Cassandra Connector Mailing List
© DataStax, All Rights Reserved.
The Spark Cassandra Connector
Connects Spark to Cassandra
3
It's all there in the name
•Provides a DataSource for Datasets/DataFrames
•Provides methods for Writing DataSets/Data Frames
•Reading and Writing RDD
•Connection Pooling
•Type Conversions and Mapping
•Data Locality
•Open Source Software!
https://github.com/datastax/spark-cassandra-connector
© DataStax, All Rights Reserved. 4
WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT
SCALA
DISTRIBUTED SYSTEMS
Tuning the Spark Cassandra Connector
DISTRIBUTED SYSTEMS
© DataStax, All Rights Reserved.
1 Lots of Write Tuning
2 A Bit of Read Tuning
5
© DataStax, All Rights Reserved. 6
Context is Very Important
Knowing your Data is Key for Maximum Performance
© DataStax, All Rights Reserved.
Write Tuning in the SCC Is all about
Batching
7
Batches aren't
good for
performance in
Cassandra.
Not when the writes
within the batch are in
the same Partition and
they are unlogged!
I keep telling you this!
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
8
THE BATCH
Cassandra
Cluster
Row Row
Row
Row
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
9
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
A batch moves as a single entity
to the Coordinator for that write
This batch has to sit there until
all the portions of it get confirmed
at their set consistency level
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
10
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
Even when some portions of the
batch finish early we have to wait
until the entire thing is done before
we can respond to the client.
© DataStax, All Rights Reserved.
We end up with a lot of rows just sitting around in
memory waiting for others to get out of the way
11
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
12
THE BATCH
Row Row Row
RowRow Row
Row Row Row
Row Row Row
Cassandra
Cluster
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
13
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
Cassandra
Cluster
Now the entire batch can be
treated as a single mutation. We
also only have to wait for one set
of replicas
© DataStax, All Rights Reserved.
When all of the Rows are Going to the Same Place
Writing to Cassandra is Fast
14
The Connector Will Automatically Batch
Writes
15
rdd.saveToCassandra("bestkeyspace", "besttable")
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace"))
.save()
import org.apache.spark.sql.cassandra._
df.write
.cassandraFormat("besttable", "bestkeyspace")
.save()
RDD
DataFrame
By default batching happens on
Identical Partition Key
16
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
WriteConf(batchGroupingKey= ?)
Change it as a SparkConf or DataFrame Parameter
Or directly pass in a WriteConf
Batches are Placed in Holding Until Certain
Thresholds are hit
17
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
Batches are Placed in Holding Until Certain
Thresholds are hit
18
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
19
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
20
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
21
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Spark Cassandra Stress for Running Basic
Benchmarks
22
https://github.com/datastax/spark-cassandra-stress
Running Benchmarks on a bunch of AWS Machines,
5 M3.2XLarge
DSE 5.0.1
Spark 1.6.1
Spark CC 1.6.0
RF = 3
2 M Writes/ 100K C* Partitions and 400 Spark Partitions
Caveat: Don't benchmark exactly like this
I'm making some bad decisions to to make some broad points
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
23
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
0
20
40
60
80
100
Rows Out of Order Rows In Order
Default Conf kOps/s
Grouping on Partition Key
The Safest Thing You Can Do
Including everything in the Batch
24
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
69
125
0
32.5
65
97.5
130
162.5
Rows Out of Order Rows In Order
Default Conf kOps/s No Batch Key
May be Safe For Short Durations
BUT WILL LEAD TO SYSTEM
INSTABILITY
Grouping on Replica Set
25
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
28
77
70
125
0
32.5
65
97.5
130
162.5
Rows Out of Order Rows In Order
Default Conf kOps/s
Grouped on Replica Set
Safer, But still will put
extra load on the Coordinator
Remember the Tortoise vs the Hare
26
Overwhelming Cassandra will slow you down
Limit the amount of writes per executor : output.throughput_mb_per_sec
Limit maximum executor cores : spark.max.cores
Lower concurrency : output.concurrent.writes
DEPENDING ON DISK PERFORMANCE YOUR
INITIAL SPEEDS IN BENCHMARKING MAY
NOT BE SUSTAINABLE
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
27
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
28
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Back to Default PartitionKey Batching
29
28
77
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
Default Conf kOps/s 10X Length Run
So why are we doing so
much better over a longer
run?
Back to Default PartitionKey Batching
30
28
77
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
Default Conf kOps/s 10X Length Run
400 Spark Partitions in Both Cases
2M/ 400 = 5000
20M / 400 = 50000
Having Too Many Partitions will Slow Down your
Writes
31
Every task has Setup and Teardown and
we can only build up good batches if there
are enough elements to build them from
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
32
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run
A spark sort on partition key
may speed up your total operation
by several fold
Maximizing performance for out of Order Writes or
No Clustering Keys
33
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
47
59
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000
Maximizing performance for out of Order Writes or
No Clustering Keys
34
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
39
190
47
59
0
50
100
150
200
Rows Out of Order Rows In Order
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000
Single
Partition
Batches are
good I keep
telling you!
This turns the connector into a Multi-Machine Cassandra
Loader (Basically just executeAsync as fast as possible)
35
https://github.com/brianmhess/cassandra-loader
Now Let's Talk About Reading!
36
Read Tuning mostly About Partitioning
37
• RDDs are a large Dataset Broken Into Bits,
• These bits are call Partitions
• Cassandra Partitions != Spark Partitions
• Spark Partitions are sized based on the estimated data size of the underlying C* table
• input.split.size_in_mb
TokenRange
Spark Partitions
OOMs Caused by Spark Partitions Holding Too Much
Data
38
Executor JVM Heap
Core 1
Core 2
Core 3
As a general rule of thumb your Executor should be
set to hold
Number of Cores * Size of Partition * 1.2
See a lot of GC? OOM? Increase the amount of partitions
Some Caveats
• We don't know the actual partition size until runtime
• Cassandra on disk memory usage != in memory size
OOMs Caused by Spark Partitions Holding Too Much
Data
39
Executor JVM Heap
Core 1
Core 2
Core 3
input.split.size_in_mb 64
Approx amount of data to be fetched into a
Spark partition. Minimum number of resulting
Spark partitions is
1 + 2 * SparkContext.defaultParallelism
split.size_in_mb compares uses the system table size_esitmates
to determine how many Cassandra Partitions should be in a
Spark Partition.
Due to Compression and Inflation, the actual in memory size
can be much larger
Certain Queries can't be broken Up
40
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
Certain Queries can't be broken Up
41
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
Certain Queries can't be broken Up
42
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
• IN clauses
• Replace with JoinWithCassandraTable
• If all else fails use CassandraConnector
Read speed is mostly dictated by Cassandra's
Paging Speed
43
input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
Cassandra of the Future, As Fast as CSV!?!
44
https://issues.apache.org/jira/browse/CASSANDRA-9259 :
Bulk Reading from Cassandra
Stefania Alborghetti
45
The End
Don't Let it End Like That!
Contribute to the Spark Cassandra Connector
46
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Write Code
• Doc Improvements
• Come join us!
https://github.com/datastax/spark-cassandra-connector
See you on the mailing list!
https://github.com/datastax/spark-cassandra-connector
Ad

More Related Content

What's hot (20)

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Mydbops
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Mydbops
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 

Similar to Maximum Overdrive: Tuning the Spark Cassandra Connector (20)

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
DataStax Academy
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
Trivadis
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
DataStax Academy
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Richard Low
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
DataStax Academy
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
Trivadis
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
DataStax Academy
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Richard Low
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 
Ad

More from Russell Spitzer (9)

Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
Russell Spitzer
 
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and TinkerpopTale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Russell Spitzer
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Cassandra and IoT
Cassandra and IoTCassandra and IoT
Cassandra and IoT
Russell Spitzer
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and TinkerpopTale of Two Graph Frameworks: Graph Frames and Tinkerpop
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Russell Spitzer
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
Russell Spitzer
 
Ad

Recently uploaded (20)

Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 

Maximum Overdrive: Tuning the Spark Cassandra Connector

  • 1. Maximum Overdrive: Tuning the Spark Cassandra Connector Russell Spitzer, Datastax
  • 2. © DataStax, All Rights Reserved. Who is this guy and why should I listen to him? 2 Russell Spitzer, Passing Software Engineer •Been working at DataStax since 2013 •Worked in Test Engineering and now Analytics Dev •Have been working with Spark since 0.9 •Working with Cassandra since 1.2 •Main focus is the Spark Cassandra Connector •Surgically grafted to the Spark Cassandra Connector Mailing List
  • 3. © DataStax, All Rights Reserved. The Spark Cassandra Connector Connects Spark to Cassandra 3 It's all there in the name •Provides a DataSource for Datasets/DataFrames •Provides methods for Writing DataSets/Data Frames •Reading and Writing RDD •Connection Pooling •Type Conversions and Mapping •Data Locality •Open Source Software! https://github.com/datastax/spark-cassandra-connector
  • 4. © DataStax, All Rights Reserved. 4 WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT SCALA DISTRIBUTED SYSTEMS Tuning the Spark Cassandra Connector DISTRIBUTED SYSTEMS
  • 5. © DataStax, All Rights Reserved. 1 Lots of Write Tuning 2 A Bit of Read Tuning 5
  • 6. © DataStax, All Rights Reserved. 6 Context is Very Important Knowing your Data is Key for Maximum Performance
  • 7. © DataStax, All Rights Reserved. Write Tuning in the SCC Is all about Batching 7 Batches aren't good for performance in Cassandra. Not when the writes within the batch are in the same Partition and they are unlogged! I keep telling you this!
  • 8. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 8 THE BATCH Cassandra Cluster Row Row Row Row
  • 9. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 9 Cassandra Cluster R o w R o w R o w R o w A batch moves as a single entity to the Coordinator for that write This batch has to sit there until all the portions of it get confirmed at their set consistency level
  • 10. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 10 Cassandra Cluster R o w R o w R o w R o w Even when some portions of the batch finish early we have to wait until the entire thing is done before we can respond to the client.
  • 11. © DataStax, All Rights Reserved. We end up with a lot of rows just sitting around in memory waiting for others to get out of the way 11
  • 12. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 12 THE BATCH Row Row Row RowRow Row Row Row Row Row Row Row Cassandra Cluster
  • 13. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 13 R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w R o w Cassandra Cluster Now the entire batch can be treated as a single mutation. We also only have to wait for one set of replicas
  • 14. © DataStax, All Rights Reserved. When all of the Rows are Going to the Same Place Writing to Cassandra is Fast 14
  • 15. The Connector Will Automatically Batch Writes 15 rdd.saveToCassandra("bestkeyspace", "besttable") df.write .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace")) .save() import org.apache.spark.sql.cassandra._ df.write .cassandraFormat("besttable", "bestkeyspace") .save() RDD DataFrame
  • 16. By default batching happens on Identical Partition Key 16 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters WriteConf(batchGroupingKey= ?) Change it as a SparkConf or DataFrame Parameter Or directly pass in a WriteConf
  • 17. Batches are Placed in Holding Until Certain Thresholds are hit 17 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
  • 18. Batches are Placed in Holding Until Certain Thresholds are hit 18 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 19. Batches are Placed in Holding Until Certain Thresholds are hit 19 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 20. Batches are Placed in Holding Until Certain Thresholds are hit 20 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 21. Batches are Placed in Holding Until Certain Thresholds are hit 21 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 22. Spark Cassandra Stress for Running Basic Benchmarks 22 https://github.com/datastax/spark-cassandra-stress Running Benchmarks on a bunch of AWS Machines, 5 M3.2XLarge DSE 5.0.1 Spark 1.6.1 Spark CC 1.6.0 RF = 3 2 M Writes/ 100K C* Partitions and 400 Spark Partitions Caveat: Don't benchmark exactly like this I'm making some bad decisions to to make some broad points
  • 23. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 23 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 0 20 40 60 80 100 Rows Out of Order Rows In Order Default Conf kOps/s Grouping on Partition Key The Safest Thing You Can Do
  • 24. Including everything in the Batch 24 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 69 125 0 32.5 65 97.5 130 162.5 Rows Out of Order Rows In Order Default Conf kOps/s No Batch Key May be Safe For Short Durations BUT WILL LEAD TO SYSTEM INSTABILITY
  • 25. Grouping on Replica Set 25 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 28 77 70 125 0 32.5 65 97.5 130 162.5 Rows Out of Order Rows In Order Default Conf kOps/s Grouped on Replica Set Safer, But still will put extra load on the Coordinator
  • 26. Remember the Tortoise vs the Hare 26 Overwhelming Cassandra will slow you down Limit the amount of writes per executor : output.throughput_mb_per_sec Limit maximum executor cores : spark.max.cores Lower concurrency : output.concurrent.writes DEPENDING ON DISK PERFORMANCE YOUR INITIAL SPEEDS IN BENCHMARKING MAY NOT BE SUSTAINABLE
  • 27. For Example Lets run with Batch Key None for a Longer Test (20M writes) 27 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 28. For Example Lets run with Batch Key None for a Longer Test (20M writes) 28 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 29. Back to Default PartitionKey Batching 29 28 77 39 190 0 50 100 150 200 Rows Out of Order Rows In Order Default Conf kOps/s 10X Length Run So why are we doing so much better over a longer run?
  • 30. Back to Default PartitionKey Batching 30 28 77 39 190 0 50 100 150 200 Rows Out of Order Rows In Order Default Conf kOps/s 10X Length Run 400 Spark Partitions in Both Cases 2M/ 400 = 5000 20M / 400 = 50000
  • 31. Having Too Many Partitions will Slow Down your Writes 31 Every task has Setup and Teardown and we can only build up good batches if there are enough elements to build them from
  • 32. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 32 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run A spark sort on partition key may speed up your total operation by several fold
  • 33. Maximizing performance for out of Order Writes or No Clustering Keys 33 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 47 59 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000
  • 34. Maximizing performance for out of Order Writes or No Clustering Keys 34 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 39 190 47 59 0 50 100 150 200 Rows Out of Order Rows In Order 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000 Single Partition Batches are good I keep telling you!
  • 35. This turns the connector into a Multi-Machine Cassandra Loader (Basically just executeAsync as fast as possible) 35 https://github.com/brianmhess/cassandra-loader
  • 36. Now Let's Talk About Reading! 36
  • 37. Read Tuning mostly About Partitioning 37 • RDDs are a large Dataset Broken Into Bits, • These bits are call Partitions • Cassandra Partitions != Spark Partitions • Spark Partitions are sized based on the estimated data size of the underlying C* table • input.split.size_in_mb TokenRange Spark Partitions
  • 38. OOMs Caused by Spark Partitions Holding Too Much Data 38 Executor JVM Heap Core 1 Core 2 Core 3 As a general rule of thumb your Executor should be set to hold Number of Cores * Size of Partition * 1.2 See a lot of GC? OOM? Increase the amount of partitions Some Caveats • We don't know the actual partition size until runtime • Cassandra on disk memory usage != in memory size
  • 39. OOMs Caused by Spark Partitions Holding Too Much Data 39 Executor JVM Heap Core 1 Core 2 Core 3 input.split.size_in_mb 64 Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism split.size_in_mb compares uses the system table size_esitmates to determine how many Cassandra Partitions should be in a Spark Partition. Due to Compression and Inflation, the actual in memory size can be much larger
  • 40. Certain Queries can't be broken Up 40 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton
  • 41. Certain Queries can't be broken Up 41 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed
  • 42. Certain Queries can't be broken Up 42 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed • IN clauses • Replace with JoinWithCassandraTable • If all else fails use CassandraConnector
  • 43. Read speed is mostly dictated by Cassandra's Paging Speed 43 input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
  • 44. Cassandra of the Future, As Fast as CSV!?! 44 https://issues.apache.org/jira/browse/CASSANDRA-9259 : Bulk Reading from Cassandra Stefania Alborghetti
  • 46. Don't Let it End Like That! Contribute to the Spark Cassandra Connector 46 • OSS Project that loves community involvement • Bug Reports • Feature Requests • Write Code • Doc Improvements • Come join us! https://github.com/datastax/spark-cassandra-connector
  • 47. See you on the mailing list! https://github.com/datastax/spark-cassandra-connector