SlideShare a Scribd company logo
Spark-Cassandra Integration 2016
DuyHai DOAN
Apache Cassandra Evangelist
@doanduyhai
Main use-cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration,
Data conversion
@doanduyhai
Data import
3
•  Read data from CSV and dump into Cassandra ?
☞ Spark Job to distribute the import !
Load data from various
sources
Demo
4
@doanduyhai
Data cleaning
5
Sanitize, validate, normalize, transform data
•  Bugs in your application ?
•  Dirty input data ?
☞ Spark Job to clean it up!
Demo
6
@doanduyhai
Schema migration
7
•  Business requirements change with time ?
•  Current data model no longer relevant ?
☞ Spark Job to migrate data !
Schema migration,
Data conversion
Demo
8
@doanduyhai
Analytics
9
Given existing tables of performers and albums, I want:
①  top 10 most common music styles (pop,rock, RnB, …) ?
②  performer productivity(albums count) by origin country and by decade ?
☞ Spark Job to compute analytics !
Analytics (join, aggregate, transform, …)
Connector Architecture
•  Cluster Deployment
•  Data Locality
•  Failure Handling
•  Cross DC/cluster operations
@doanduyhai
Cluster Deployment
11
•  Stand-alone cluster
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – remember token ranges ?
12
A: −x,−
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
B: −
3x
4
,−
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C: −
2x
4
,−
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
D: −
x
4
,0
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
E: 0,
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
F:
x
4
,
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
G:
2x
4
,
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H :
3x
4
,x
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C*
C*
C*
C*
C* C*
C* C*
@doanduyhai
Data Locality – how to
13
Spark partition RDD
Cassandra
tokens ranges
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – how to
14
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Perfect data locality scenario
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
à to a table having same partition key as original table
•  save back into this Cassandra table
Sanitize, validate, normalize, transform data
USE CASE
15
@doanduyhai
Failure Handling
16
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
@doanduyhai
Failure Handling
17
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
☞ Spark master will re-assign
the job to another worker
@doanduyhai
Failure Handling
18
Oh no, my data locality !!!
@doanduyhai
Failure Handling
19
@doanduyhai
Data Locality Impl
20
abstract'class'RDD[T](…)'{'
' @DeveloperApi'
' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]'
'
' protected'def'getPartitions:'Array[Partition]'
' '
' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil''''''''
}'
@doanduyhai
CassandraRDD
21
def getPreferredLocations(split: Partition): Cassandra replicas IP address
corresponding to this Spark partition
@doanduyhai
Failure Handling
22
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
@doanduyhai
Failure Handling
23
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
Only work for fixed
token ranges (vnodes)
@doanduyhai
Cross cluster/DC operations
24
Tales from the field, SASI index benchmark
•  Deployment automation
•  Parallel ingestion
•  Migrating data
•  Spark + Cassandra 3.4 SASI index for topK query
@doanduyhai
Deployment Automation
26
Use Ansible to bootstrap a cluster
•  role tools (install vim, htop, dstat, fio, jmxterm..)
•  role Cassandra. Do not put all nodes as seeds ….
•  role Spark (vanilla Spark). Slave on all nodes, master on a random node
DO NOT START ALL CASSANDRA NODES AT THE SAME TIME !!!!
•  bootstrap first seeds nodes
•  give ≥ 30secs between 2 node bootstrap for token range agreement
•  watch -n 5 nodetool status
@doanduyhai
Parallel ingestion for SASI index benchmark
27
Hardware specs
•  13 nodes
•  6 cores CPU (HT)
•  4 SSD in RAID 0 😎
•  64 Gb of RAM 
Cassandra conf:
•  G1GC 32Gb JVM Heap
•  compaction throughput in MB = 256
•  concurrent compactor = 2
@doanduyhai
Parallel ingestion for SASI index benchmark
28
@doanduyhai
Parallel ingestion for SASI index benchmark
29
3.2 billions row in 17h
(compaction disabled)
RF = 2
☞ ≈ 8000 ips
I/O idle, high CPU
@doanduyhai
Migrating Data
30
@doanduyhai
Migrating Data
31
@doanduyhai
TopK query
32
Pass 1, for each music provider
•  sum albums sales count by title
•  take top N, associate weight from descending order (1st = 1000, 2nd = 999 …)
Retrieve all albums from pass 1
•  re-sum the sum(sales count) and weight group by title
•  order again by sum(sales count) in descending order
•  take top N
@doanduyhai
TopK query
33
Target data set = 3.2 billions rows
•  minimum filter = 1 month (period_end_month = 201404 for ex)
•  worst filter = 3 months range
•  +8 other dynamic filters (music provider, distribution type …)
☞ SASI indices for filtering
☞ Spark for aggregation
@doanduyhai
TopK query results
34
3.2 billions rows in total
•  random distribution over 3 years (36 months) à 88 millions rows/month
Filters #rows Duration #rows/sec
3 months 376 947 612 14 mins (840 secs) 448 747
1 month 94 239 127 6.1 mins (366 secs) 257 483
1 month + 1 provider 7 267 983 2.1 mins (126 secs) 57 682
1 month + 1 provider + 1 country 2 737 178 1.5 mins (90 secs) 30 413
35
Q & A
! "
36
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You

More Related Content

What's hot (20)

PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
PDF
Sasi, cassandra on full text search ride
Duyhai Doan
 
PDF
Cassandra introduction 2016
Duyhai Doan
 
PDF
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
PDF
Apache cassandra in 2016
Duyhai Doan
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Datastax enterprise presentation
Duyhai Doan
 
PDF
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
PDF
Spark zeppelin-cassandra at synchrotron
Duyhai Doan
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Data stax academy
Duyhai Doan
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PDF
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Sasi, cassandra on full text search ride
Duyhai Doan
 
Cassandra introduction 2016
Duyhai Doan
 
Apache zeppelin the missing component for the big data ecosystem
Duyhai Doan
 
Apache cassandra in 2016
Duyhai Doan
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Datastax enterprise presentation
Duyhai Doan
 
Datastax day 2016 introduction to apache cassandra
Duyhai Doan
 
Spark zeppelin-cassandra at synchrotron
Duyhai Doan
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Data stax academy
Duyhai Doan
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Apache spark Intro
Tudor Lapusan
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 

Viewers also liked (15)

PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
PDF
Cassandra introduction @ ParisJUG
Duyhai Doan
 
PDF
Cassandra drivers and libraries
Duyhai Doan
 
PDF
Introduction to KillrChat
Duyhai Doan
 
PDF
KillrChat Data Modeling
Duyhai Doan
 
PDF
KillrChat presentation
Duyhai Doan
 
PDF
Cassandra introduction mars jug
Duyhai Doan
 
PDF
Cassandra introduction @ NantesJUG
Duyhai Doan
 
PDF
Apache Zeppelin @DevoxxFR 2016
Duyhai Doan
 
PDF
Cassandra introduction at FinishJUG
Duyhai Doan
 
PDF
Libon cassandra summiteu2014
Duyhai Doan
 
PDF
Cassandra for the ops dos and donts
Duyhai Doan
 
PDF
From rdbms to cassandra without a hitch
Duyhai Doan
 
PDF
Apache zeppelin, the missing component for the big data ecosystem
Duyhai Doan
 
PDF
Introduction to spark
Duyhai Doan
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
Cassandra introduction @ ParisJUG
Duyhai Doan
 
Cassandra drivers and libraries
Duyhai Doan
 
Introduction to KillrChat
Duyhai Doan
 
KillrChat Data Modeling
Duyhai Doan
 
KillrChat presentation
Duyhai Doan
 
Cassandra introduction mars jug
Duyhai Doan
 
Cassandra introduction @ NantesJUG
Duyhai Doan
 
Apache Zeppelin @DevoxxFR 2016
Duyhai Doan
 
Cassandra introduction at FinishJUG
Duyhai Doan
 
Libon cassandra summiteu2014
Duyhai Doan
 
Cassandra for the ops dos and donts
Duyhai Doan
 
From rdbms to cassandra without a hitch
Duyhai Doan
 
Apache zeppelin, the missing component for the big data ecosystem
Duyhai Doan
 
Introduction to spark
Duyhai Doan
 
Ad

Similar to Spark cassandra integration 2016 (20)

PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PDF
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
PDF
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
PPTX
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
PDF
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
 
PDF
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
PDF
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
Codemotion
 
PDF
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
PDF
Cassandra at Pollfish
Pollfish
 
PDF
Cassandra at Pollfish
Stavros Kontopoulos
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Real Time Analytics with Dse
DataStax Academy
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
Codemotion
 
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
Cassandra at Pollfish
Pollfish
 
Cassandra at Pollfish
Stavros Kontopoulos
 
Ad

More from Duyhai Doan (9)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Duyhai Doan
 
PDF
Le futur d'apache cassandra
Duyhai Doan
 
PDF
Big data 101 for beginners devoxxpl
Duyhai Doan
 
PDF
Big data 101 for beginners riga dev days
Duyhai Doan
 
PDF
Datastax day 2016 : Cassandra data modeling basics
Duyhai Doan
 
PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Duyhai Doan
 
PDF
Cassandra UDF and Materialized Views
Duyhai Doan
 
PDF
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Duyhai Doan
 
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Duyhai Doan
 
Le futur d'apache cassandra
Duyhai Doan
 
Big data 101 for beginners devoxxpl
Duyhai Doan
 
Big data 101 for beginners riga dev days
Duyhai Doan
 
Datastax day 2016 : Cassandra data modeling basics
Duyhai Doan
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Duyhai Doan
 
Cassandra UDF and Materialized Views
Duyhai Doan
 
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
Algorithmes distribues pour le big data @ DevoxxFR 2015
Duyhai Doan
 

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Python basic programing language for automation
DanialHabibi2
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 

Spark cassandra integration 2016