Spark + Cassandra = Real Time Analytics on Operational Data

Analytique temps réel sur des données transactionnelles
= Cassandra + Spark 20/02/15
Victor Coustenoble Ingénieur Solutions
victor.coustenoble@datastax.com
@vizanalytics

Comment utilisez vous Cassandra?
En contrôlant votre
consommation d’énergie
En regardant des films
en streaming
En naviguant
sur des sites Internet
En achetant
en ligne
En effectuant un règlement
via Smart Phone
En jouant à des
jeux-vidéo très
connus
• Collections/Playlists
• Recommandation/Pe
rsonnalisation
• Détection de Fraude
• Messagerie
• Objets Connectés

Aperçu
Fondé en avril 2010
~35 500+
Santa Clara, Austin, New York, London, Paris, Sydney
400+
Employés Pourcent Clients
4

Straightening the road
RELATIONAL DATABASES
CQL SQL
OpsCenter / DevCenter Management tools
DSE for search & analytics Integration
Security Security
Support, consulting & training 30 years ecosystem

Apache Cassandra™
• Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour
les applications en ligne, modernes, critiques et avec des montée en charge massive.
• Java , hybride entre Amazon Dynamo et Google BigTable
• Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF)
• Distribuée avec la possibilité de Data Center
• 100% Disponible
• Massivement scalable
• Montée en charge linéaire
• Haute Performance (lecture ET écriture)
• Multi Data Center
• Simple à Exploiter
• Language CQL (comme SQL)
• Outils OpsCenter / DevCenter
6
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Node 1
Node 2
Node 3Node 4
Node 5

Haute Disponibilité et Cohérence
• La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système
• Cohérence choisie au niveau du client
• Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès
• Exemple:
• RF = 3
• CL = QUORUM (= 51% des replicas)
©2014 DataStax Confidential. Do not distribute without consent. 7
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
> 51% de réponses – donc la requête est réussie
CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte

DataStax Enterprise
Cassandra
Certifié,
Prêt pour
l’Entreprise
8
Security Analytics Search Visual
Monitoring
Management
Services
In-Memory
Dev.IDE&
Drivers
Professional
Services
Support&
Training
Confiance
d’utilisation
Fonctionnalités
d’Entreprise

DataStax Enterprise - Analytique
• Conçu pour faire des analyses sur des données Cassandra
• Il y a 4 façons de faire de l’Analytique sur des données Cassandra:
1. Recherche (Solr)
2. Analytique en mode Batch (Hadoop)
3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks)
4. Analytique Temps Réel
©2014 DataStax Confidential. Do not distribute without consent.

Partenariat

Why Spark on Cassandra?
• Analytics on transactional data and operational applications
• Data model independent queries
• Cross-table operations (JOIN, UNION, etc.)
• Complex analytics (e.g. machine learning)
• Data transformation, aggregation, etc.
• Stream processing
• Better performances than Hadoop Map/Reduce

Real-time Big Data
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL

Real-Time Big Data Use Cases
• Recommendation Engine
• Internet of Things
• Fraud Detection
• Risk Analysis
• Buyer Behaviour Analytics
• Telematics, Logistics
• Business Intelligence
• Infrastructure Monitoring
• …

Composants Sparks
Shark
or
Spark SQL
Structured
Spark
Streaming
Real-time
MLlib
Machine learning
Spark (General execution engine)
GraphX
Graph
Cassandra
Compatible

Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
DSE Spark Integration Architecture
Node 1
Node 2
Node 3
Node 4
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Spark
Master
(JVM)
App
Driver

Spark Cassandra Connector
C*
C*
C*C*
Spark Executor
C* Java Driver
Spark-Cassandra Connector
User Application
Cassandra

Cassandra Spark Driver
•Cassandra tables exposed as Spark RDDs
•Load data from Cassandra to Spark
•Write data from Spark to Cassandra
•Object mapper : Mapping of C* tables and rows to Scala objects
•Type conversions : All Cassandra types supported and converted to Scala types
•Server side data selection
•Virtual Nodes support
•Scala and Java APIs

DSE Spark Interactive Shell
$ dse spark
...
Spark context available as sc.
HiveSQLContext available as hc.
CassandraSQLContext available as csc.
scala> sc.cassandraTable("test", "kv")
res5: com.datastax.spark.connector.rdd.CassandraRDD
[com.datastax.spark.connector.CassandraRow] =
CassandraRDD[2] at RDD at CassandraRDD.scala:48
scala> sc.cassandraTable("test", "kv").collect
res6: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{k: 1, v: foo})
cqlsh> select * from
test.kv;
k | v
---+-----
1 | foo
(1 rows)

Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial
contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)

Reading Data
val table = sc
.cassandraTable[CassandraRow]("db", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
row
representation keyspace table
server side column
and row selection

Writing Data
CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);
val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;
word | count
------+-------
bar | 5
foo | 2
(2 rows)

Mapping Rows to Objects
CREATE TABLE test.cars (
id text PRIMARY KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),
// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)

Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option

Shark
• SQL query engine on top of Spark
• Not part of Apache Spark
• Hive compatible (JDBC, UDFs, types, metadata, etc.)
• Supports in-memory tables
• Available as a part of DataStax Enterprise

Spark SQL
• Spark SQL supports a subset of SQL-92 language
• Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark
• Support for in-memory computation

•From Spark command line
•Mapping of Cassandra keyspaces and tables
•Read and write on Cassandra tables
Usage of Spark SQL & HiveQL query
import com.datastax.spark.connector._
// Connect to the Spark cluster
val conf = new SparkConf(true)...
val sc = new SparkContext(conf)
// Create Cassandra SQL context
val cc = new CassandraSQLContext(sc)
// Execute SQL query
val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2")
// Execute HQL query
val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")

Spark Streaming
• For real time analytics
• Push or pull model
• Stream TO and FROM Cassandra
• Micro batching (each batch represented as RDD)
• Fault tolerant
• Data processed in small batches
• Exactly-once processing
• Unified stream and batch processing framework
• Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT
producers

Usage of Spark Streaming
• Due to the unifying Spark architecture,
portions of batch and streaming
development can be reused
• Given that Spark Streaming is backed by
Cassandra, no need to depend upon
solutions like Apache Zookeeper ™ in
production
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word =>
(word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()

Python API
$ dse pyspark
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Python version 2.7.8 (default, Oct 20 2014 15:05:19)
SparkContext available as sc.
>>> sc.cassandraTable("test", "kv").collect()
[Row(k=1, v=u'foo')]

DataStax Enterprise + Spark Special Features
•Easy setup and config
• no need to setup a separate Spark cluster
• no need to tweak classpaths or config files
•High availability of Spark Master
•Enterprise security
• Password / Kerberos / LDAP authentication
• SSL for all Spark to Cassandra connections
•CFS integration (no SPOF distributed file system)
•Cassandra access through Spark Python API
•Certified and Supported on Cassandra
•Shark availability

DataStax Enterprise - High Availability
• All nodes are Spark Workers
• By default resilient to Worker failures
• First Spark node promoted as Spark Master (state saved
in CFS, no SPOF)
• Standby Master promoted on failure (New Spark Master
reconnects to Workers and the driver app and continues the job)

Without DataStax Enterprise
33
C*
SparkM
SparkW
C* SparkW
C* SparkWC* SparkW
C* SparkW

With DataStax Enterprise
34
C*
SparkM
SparkW
C*
SparkW*
C* SparkWC* SparkW
C* SparkW
Master state in C*
Spare master for H/A

Spark Use Cases
35
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion

DataStax Enterprise
© 2014 DataStax, All Rights Reserved. Company Confidential
External Hadoop Distribution
Cloudera, Hortonworks
OpsCenter
Services
Hadoop
Monitoring
Operations
Operational
Application
Real Time
Search
Real Time
Analytics
Batch
Analytics
SGBDR
Analytics
Transformation
s
36
Cassandra Cluster – Nodes Ring – Column Family Storage
High Performance – Alway Available – Massive Scalability
Advanced
Security
In-Memory

How to Spark on Cassandra?
DataStax Cassandra Spark driver
https://github.com/datastax/cassandra-driver-spark
Compatible with
•Spark 1.2
•Cassandra 2.0.x and 2.1.x
•DataStax Enterprise 4.5 et 4.6
DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1
Spark 1.2 in next DSE 4.7 version (March)

Merci Questions ?
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.
victor.coustenoble@datastax.com
@vizanalytics

Spark + Cassandra = Real Time Analytics on Operational Data

More Related Content

What's hot (20)

Similar to Spark + Cassandra = Real Time Analytics on Operational Data (20)

More from Victor Coustenoble (13)

Recently uploaded (20)

Spark + Cassandra = Real Time Analytics on Operational Data

Editor's Notes