Maximum Overdrive: Tuning the Spark Cassandra Connector

Maximum Overdrive:
Tuning the Spark Cassandra Connector
Russell Spitzer, Datastax

© DataStax, All Rights Reserved.
Who is this guy and why should I listen to him?
2
Russell Spitzer, Passing Software Engineer
•Been working at DataStax since 2013
•Worked in Test Engineering and now Analytics Dev
•Have been working with Spark since 0.9
•Working with Cassandra since 1.2
•Main focus is the Spark Cassandra Connector
•Surgically grafted to the Spark Cassandra Connector Mailing List

The Spark Cassandra Connector
Connects Spark to Cassandra
3
It's all there in the name
•Provides a DataSource for Datasets/DataFrames
•Provides methods for Writing DataSets/Data Frames
•Reading and Writing RDD
•Connection Pooling
•Type Conversions and Mapping
•Data Locality
•Open Source Software!
https://github.com/datastax/spark-cassandra-connector

© DataStax, All Rights Reserved. 4
WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT
SCALA
DISTRIBUTED SYSTEMS
Tuning the Spark Cassandra Connector
DISTRIBUTED SYSTEMS

1 Lots of Write Tuning
2 A Bit of Read Tuning
5

© DataStax, All Rights Reserved. 6
Context is Very Important
Knowing your Data is Key for Maximum Performance

Write Tuning in the SCC Is all about
Batching
7
Batches aren't
good for
performance in
Cassandra.
Not when the writes
within the batch are in
the same Partition and
they are unlogged!
I keep telling you this!

Multi-Partition Key Batches put load on the
Coordinator
8
THE BATCH
Cassandra
Cluster
Row Row
Row
Row

Coordinator
9
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
A batch moves as a single entity
to the Coordinator for that write
This batch has to sit there until
all the portions of it get confirmed
at their set consistency level

Coordinator
10
Cassandra
Cluster
R
o
w
R
o
w
R
o
w
R
o
w
Even when some portions of the
batch finish early we have to wait
until the entire thing is done before
we can respond to the client.

We end up with a lot of rows just sitting around in
memory waiting for others to get out of the way
11

Single Partition Batches are Treated as
A Single Mutation in Cassandra
12
THE BATCH
Row Row Row
RowRow Row
Row Row Row
Row Row Row
Cassandra
Cluster

Single Partition Batches are Treated as
A Single Mutation in Cassandra
13
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
R
o
w
Cassandra
Cluster
Now the entire batch can be
treated as a single mutation. We
also only have to wait for one set
of replicas

When all of the Rows are Going to the Same Place
Writing to Cassandra is Fast
14

The Connector Will Automatically Batch
Writes
15
rdd.saveToCassandra("bestkeyspace", "besttable")
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace"))
.save()
import org.apache.spark.sql.cassandra._
df.write
.cassandraFormat("besttable", "bestkeyspace")
.save()
RDD
DataFrame

By default batching happens on
Identical Partition Key
16
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
WriteConf(batchGroupingKey= ?)
Change it as a SparkConf or DataFrame Parameter
Or directly pass in a WriteConf

Batches are Placed in Holding Until Certain
Thresholds are hit
17

Thresholds are hit
18
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level

Thresholds are hit
19

Thresholds are hit
20

Thresholds are hit
21

Spark Cassandra Stress for Running Basic
Benchmarks
22
https://github.com/datastax/spark-cassandra-stress
Running Benchmarks on a bunch of AWS Machines,
5 M3.2XLarge
DSE 5.0.1
Spark 1.6.1
Spark CC 1.6.0
RF = 3
2 M Writes/ 100K C* Partitions and 400 Spark Partitions
Caveat: Don't benchmark exactly like this
I'm making some bad decisions to to make some broad points

Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
23
28
77
0
20
40
60
80
100
Rows Out of Order Rows In Order
Default Conf kOps/s
Grouping on Partition Key
The Safest Thing You Can Do

Including everything in the Batch
24
28
77
69
125
0
32.5
65
97.5
130
162.5
Default Conf kOps/s No Batch Key
May be Safe For Short Durations
BUT WILL LEAD TO SYSTEM
INSTABILITY

Grouping on Replica Set
25
28
77
70
125
0
32.5
65
97.5
130
162.5
Default Conf kOps/s
Grouped on Replica Set
Safer, But still will put
extra load on the Coordinator

Remember the Tortoise vs the Hare
26
Overwhelming Cassandra will slow you down
Limit the amount of writes per executor : output.throughput_mb_per_sec
Limit maximum executor cores : spark.max.cores
Lower concurrency : output.concurrent.writes
DEPENDING ON DISK PERFORMANCE YOUR
INITIAL SPEEDS IN BENCHMARKING MAY
NOT BE SUSTAINABLE

For Example Lets run with Batch Key None for a
Longer Test (20M writes)
27
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

For Example Lets run with Batch Key None for a
Longer Test (20M writes)
28
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskS
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Back to Default PartitionKey Batching
29
28
77
39
190
0
50
100
150
200
Default Conf kOps/s 10X Length Run
So why are we doing so
much better over a longer
run?

Back to Default PartitionKey Batching
30
28
77
39
190
0
50
100
150
200
Default Conf kOps/s 10X Length Run
400 Spark Partitions in Both Cases
2M/ 400 = 5000
20M / 400 = 50000

Having Too Many Partitions will Slow Down your
Writes
31
Every task has Setup and Teardown and
we can only build up good batches if there
are enough elements to build them from

Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
32
39
190
0
50
100
150
200
10X Length Run
A spark sort on partition key
may speed up your total operation
by several fold

Maximizing performance for out of Order Writes or
No Clustering Keys
33
39
190
47
59
0
50
100
150
200
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000

Maximizing performance for out of Order Writes or
No Clustering Keys
34
39
190
47
59
0
50
100
150
200
10X Length Run Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows 1
spark.cassandra.output.concurrent.writes 2000
Single
Partition
Batches are
good I keep
telling you!

This turns the connector into a Multi-Machine Cassandra
Loader (Basically just executeAsync as fast as possible)
35
https://github.com/brianmhess/cassandra-loader

Now Let's Talk About Reading!
36

Read Tuning mostly About Partitioning
37
• RDDs are a large Dataset Broken Into Bits,
• These bits are call Partitions
• Cassandra Partitions != Spark Partitions
• Spark Partitions are sized based on the estimated data size of the underlying C* table
• input.split.size_in_mb
TokenRange
Spark Partitions

OOMs Caused by Spark Partitions Holding Too Much
Data
38
Executor JVM Heap
Core 1
Core 2
Core 3
As a general rule of thumb your Executor should be
set to hold
Number of Cores * Size of Partition * 1.2
See a lot of GC? OOM? Increase the amount of partitions
Some Caveats
• We don't know the actual partition size until runtime
• Cassandra on disk memory usage != in memory size

OOMs Caused by Spark Partitions Holding Too Much
Data
39
Executor JVM Heap
Core 1
Core 2
Core 3
input.split.size_in_mb 64
Approx amount of data to be fetched into a
Spark partition. Minimum number of resulting
Spark partitions is
1 + 2 * SparkContext.defaultParallelism
split.size_in_mb compares uses the system table size_esitmates
to determine how many Cassandra Partitions should be in a
Spark Partition.
Due to Compression and Inflation, the actual in memory size
can be much larger

Certain Queries can't be broken Up
40
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton

41
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed

42
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
• IN clauses
• Replace with JoinWithCassandraTable
• If all else fails use CassandraConnector

Read speed is mostly dictated by Cassandra's
Paging Speed
43
input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request

Cassandra of the Future, As Fast as CSV!?!
44
https://issues.apache.org/jira/browse/CASSANDRA-9259 :
Bulk Reading from Cassandra
Stefania Alborghetti

Don't Let it End Like That!
Contribute to the Spark Cassandra Connector
46
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Write Code
• Doc Improvements
• Come join us!

See you on the mailing list!

Maximum Overdrive: Tuning the Spark Cassandra Connector

Recommended

More Related Content

What's hot (20)

Similar to Maximum Overdrive: Tuning the Spark Cassandra Connector (20)

More from Russell Spitzer (9)

Recently uploaded (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector