SlideShare a Scribd company logo
Analytique temps réel sur des données transactionnelles
= Cassandra + Spark 20/02/15
Victor Coustenoble Ingénieur Solutions
victor.coustenoble@datastax.com
@vizanalytics
Spark + Cassandra = Real Time Analytics on Operational Data
Comment utilisez vous Cassandra?
En contrôlant votre
consommation d’énergie
En regardant des films
en streaming
En naviguant
sur des sites Internet
En achetant
en ligne
En effectuant un règlement
via Smart Phone
En jouant à des
jeux-vidéo très
connus
• Collections/Playlists
• Recommandation/Pe
rsonnalisation
• Détection de Fraude
• Messagerie
• Objets Connectés
Aperçu
Fondé en avril 2010
~35 500+
Santa Clara, Austin, New York, London, Paris, Sydney
400+
Employés Pourcent Clients
4
Straightening the road
RELATIONAL DATABASES
CQL SQL
OpsCenter / DevCenter Management tools
DSE for search & analytics Integration
Security Security
Support, consulting & training 30 years ecosystem
Apache Cassandra™
• Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour
les applications en ligne, modernes, critiques et avec des montée en charge massive.
• Java , hybride entre Amazon Dynamo et Google BigTable
• Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF)
• Distribuée avec la possibilité de Data Center
• 100% Disponible
• Massivement scalable
• Montée en charge linéaire
• Haute Performance (lecture ET écriture)
• Multi Data Center
• Simple à Exploiter
• Language CQL (comme SQL)
• Outils OpsCenter / DevCenter
6
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Node 1
Node 2
Node 3Node 4
Node 5
Haute Disponibilité et Cohérence
• La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système
• Cohérence choisie au niveau du client
• Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès
• Exemple:
• RF = 3
• CL = QUORUM (= 51% des replicas)
©2014 DataStax Confidential. Do not distribute without consent. 7
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
> 51% de réponses – donc la requête est réussie
CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte
DataStax Enterprise
Cassandra
Certifié,
Prêt pour
l’Entreprise
8
Security Analytics Search Visual
Monitoring
Management
Services
In-Memory
Dev.IDE&
Drivers
Professional
Services
Support&
Training
Confiance
d’utilisation
Fonctionnalités
d’Entreprise
DataStax Enterprise - Analytique
• Conçu pour faire des analyses sur des données Cassandra
• Il y a 4 façons de faire de l’Analytique sur des données Cassandra:
1. Recherche (Solr)
2. Analytique en mode Batch (Hadoop)
3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks)
4. Analytique Temps Réel
©2014 DataStax Confidential. Do not distribute without consent.
Partenariat
©2014 DataStax Confidential. Do not distribute without consent. 10
Why Spark on Cassandra?
• Analytics on transactional data and operational applications
• Data model independent queries
• Cross-table operations (JOIN, UNION, etc.)
• Complex analytics (e.g. machine learning)
• Data transformation, aggregation, etc.
• Stream processing
• Better performances than Hadoop Map/Reduce
Real-time Big Data
©2014 DataStax Confidential. Do not distribute without consent. 12
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Real-Time Big Data Use Cases
• Recommendation Engine
• Internet of Things
• Fraud Detection
• Risk Analysis
• Buyer Behaviour Analytics
• Telematics, Logistics
• Business Intelligence
• Infrastructure Monitoring
• …
©2014 DataStax Confidential. Do not distribute without consent. 13
Composants Sparks
Shark
or
Spark SQL
Structured
Spark
Streaming
Real-time
MLlib
Machine learning
Spark (General execution engine)
GraphX
Graph
Cassandra
Compatible
Isolation des ressources
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
DSE Spark Integration Architecture
Node 1
Node 2
Node 3
Node 4
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Cassandra
Executor
ExecutorSpark
Worker
(JVM)
Spark
Master
(JVM)
App
Driver
Spark Cassandra Connector
C*
C*
C*C*
Spark Executor
C* Java Driver
Spark-Cassandra Connector
User Application
Cassandra
Cassandra Spark Driver
•Cassandra tables exposed as Spark RDDs
•Load data from Cassandra to Spark
•Write data from Spark to Cassandra
•Object mapper : Mapping of C* tables and rows to Scala objects
•Type conversions : All Cassandra types supported and converted to Scala types
•Server side data selection
•Virtual Nodes support
•Scala and Java APIs
DSE Spark Interactive Shell
$ dse spark
...
Spark context available as sc.
HiveSQLContext available as hc.
CassandraSQLContext available as csc.
scala> sc.cassandraTable("test", "kv")
res5: com.datastax.spark.connector.rdd.CassandraRDD
[com.datastax.spark.connector.CassandraRow] =
CassandraRDD[2] at RDD at CassandraRDD.scala:48
scala> sc.cassandraTable("test", "kv").collect
res6: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{k: 1, v: foo})
cqlsh> select * from
test.kv;
k | v
---+-----
1 | foo
(1 rows)
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial
contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Reading Data
val table = sc
.cassandraTable[CassandraRow]("db", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
row
representation keyspace table
server side column
and row selection
Writing Data
CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);
val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
cqlsh:test> select * from words;
word | count
------+-------
bar | 5
foo | 2
(2 rows)
Mapping Rows to Objects
CREATE TABLE test.cars (
id text PRIMARY KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),
// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)
Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
Shark
• SQL query engine on top of Spark
• Not part of Apache Spark
• Hive compatible (JDBC, UDFs, types, metadata, etc.)
• Supports in-memory tables
• Available as a part of DataStax Enterprise
Spark SQL
• Spark SQL supports a subset of SQL-92 language
• Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark
• Support for in-memory computation
•From Spark command line
•Mapping of Cassandra keyspaces and tables
•Read and write on Cassandra tables
Usage of Spark SQL & HiveQL query
import com.datastax.spark.connector._
// Connect to the Spark cluster
val conf = new SparkConf(true)...
val sc = new SparkContext(conf)
// Create Cassandra SQL context
val cc = new CassandraSQLContext(sc)
// Execute SQL query
val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2")
// Execute HQL query
val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
Spark Streaming
• For real time analytics
• Push or pull model
• Stream TO and FROM Cassandra
• Micro batching (each batch represented as RDD)
• Fault tolerant
• Data processed in small batches
• Exactly-once processing
• Unified stream and batch processing framework
• Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT
producers
Usage of Spark Streaming
• Due to the unifying Spark architecture,
portions of batch and streaming
development can be reused
• Given that Spark Streaming is backed by
Cassandra, no need to depend upon
solutions like Apache Zookeeper ™ in
production
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word =>
(word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()
Python API
$ dse pyspark
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Python version 2.7.8 (default, Oct 20 2014 15:05:19)
SparkContext available as sc.
>>> sc.cassandraTable("test", "kv").collect()
[Row(k=1, v=u'foo')]
DataStax Enterprise + Spark Special Features
•Easy setup and config
• no need to setup a separate Spark cluster
• no need to tweak classpaths or config files
•High availability of Spark Master
•Enterprise security
• Password / Kerberos / LDAP authentication
• SSL for all Spark to Cassandra connections
•CFS integration (no SPOF distributed file system)
•Cassandra access through Spark Python API
•Certified and Supported on Cassandra
•Shark availability
DataStax Enterprise - High Availability
• All nodes are Spark Workers
• By default resilient to Worker failures
• First Spark node promoted as Spark Master (state saved
in CFS, no SPOF)
• Standby Master promoted on failure (New Spark Master
reconnects to Workers and the driver app and continues the job)
Without DataStax Enterprise
33
C*
SparkM
SparkW
C* SparkW
C* SparkWC* SparkW
C* SparkW
With DataStax Enterprise
34
C*
SparkM
SparkW
C*
SparkW*
C* SparkWC* SparkW
C* SparkW
Master state in C*
Spare master for H/A
Spark Use Cases
35
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
DataStax Enterprise
© 2014 DataStax, All Rights Reserved. Company Confidential
External Hadoop Distribution
Cloudera, Hortonworks
OpsCenter
Services
Hadoop
Monitoring
Operations
Operational
Application
Real Time
Search
Real Time
Analytics
Batch
Analytics
SGBDR
Analytics
Transformation
s
36
Cassandra Cluster – Nodes Ring – Column Family Storage
High Performance – Alway Available – Massive Scalability
Advanced
Security
In-Memory
How to Spark on Cassandra?
DataStax Cassandra Spark driver
https://github.com/datastax/cassandra-driver-spark
Compatible with
•Spark 1.2
•Cassandra 2.0.x and 2.1.x
•DataStax Enterprise 4.5 et 4.6
DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1
Spark 1.2 in next DSE 4.7 version (March)
Merci Questions ?
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.
victor.coustenoble@datastax.com
@vizanalytics

More Related Content

What's hot (20)

PDF
Implementing Domain Events with Kafka
Andrei Rugina
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Spark (v1.3) - Présentation (Français)
Alexis Seigneurin
 
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Apache Spark Crash Course
DataWorks Summit
 
PPTX
kafka
Amikam Snir
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
NATS Streaming - an alternative to Apache Kafka?
Anton Zadorozhniy
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Ceph and RocksDB
Sage Weil
 
PDF
Getting Started with HBase
Carol McDonald
 
PPTX
Managing your Hadoop Clusters with Apache Ambari
DataWorks Summit
 
PPTX
Apache Kafka
Saroj Panyasrivanit
 
Implementing Domain Events with Kafka
Andrei Rugina
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Spark (v1.3) - Présentation (Français)
Alexis Seigneurin
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Apache Spark Crash Course
DataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
NATS Streaming - an alternative to Apache Kafka?
Anton Zadorozhniy
 
Kafka presentation
Mohammed Fazuluddin
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Apache Kafka - Martin Podval
Martin Podval
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Ceph and RocksDB
Sage Weil
 
Getting Started with HBase
Carol McDonald
 
Managing your Hadoop Clusters with Apache Ambari
DataWorks Summit
 
Apache Kafka
Saroj Panyasrivanit
 

Similar to Spark + Cassandra = Real Time Analytics on Operational Data (20)

PPTX
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PPTX
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
PDF
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PPTX
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PPTX
Big Data Analytics with Spark
DataStax Academy
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Cassandra and Spark
datastaxjp
 
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Real Time Analytics with Dse
DataStax Academy
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Big Data Analytics with Spark
DataStax Academy
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Cassandra and Spark
datastaxjp
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
Ad

More from Victor Coustenoble (13)

PPTX
Préparation de Données pour la Détection de Fraude
Victor Coustenoble
 
PPTX
Préparation de Données dans le Cloud
Victor Coustenoble
 
PDF
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
PPTX
Webinaire Business&Decision - Trifacta
Victor Coustenoble
 
PPTX
DataStax Enterprise BBL
Victor Coustenoble
 
PPTX
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
PPTX
DataStax et Cassandra dans Azure au Microsoft Techdays
Victor Coustenoble
 
PPTX
Webinar Degetel DataStax
Victor Coustenoble
 
PPTX
Quelles stratégies de Recherche avec Cassandra ?
Victor Coustenoble
 
PPTX
Cassandra 2.2 & 3.0
Victor Coustenoble
 
PPTX
DataStax Enterprise - La plateforme de base de données pour le Cloud
Victor Coustenoble
 
PPTX
Datastax Cassandra + Spark Streaming
Victor Coustenoble
 
PPTX
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
Victor Coustenoble
 
Préparation de Données pour la Détection de Fraude
Victor Coustenoble
 
Préparation de Données dans le Cloud
Victor Coustenoble
 
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
Webinaire Business&Decision - Trifacta
Victor Coustenoble
 
DataStax Enterprise BBL
Victor Coustenoble
 
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
DataStax et Cassandra dans Azure au Microsoft Techdays
Victor Coustenoble
 
Webinar Degetel DataStax
Victor Coustenoble
 
Quelles stratégies de Recherche avec Cassandra ?
Victor Coustenoble
 
Cassandra 2.2 & 3.0
Victor Coustenoble
 
DataStax Enterprise - La plateforme de base de données pour le Cloud
Victor Coustenoble
 
Datastax Cassandra + Spark Streaming
Victor Coustenoble
 
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
Victor Coustenoble
 
Ad

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 

Spark + Cassandra = Real Time Analytics on Operational Data

  • 1. Analytique temps réel sur des données transactionnelles = Cassandra + Spark 20/02/15 Victor Coustenoble Ingénieur Solutions [email protected] @vizanalytics
  • 3. Comment utilisez vous Cassandra? En contrôlant votre consommation d’énergie En regardant des films en streaming En naviguant sur des sites Internet En achetant en ligne En effectuant un règlement via Smart Phone En jouant à des jeux-vidéo très connus • Collections/Playlists • Recommandation/Pe rsonnalisation • Détection de Fraude • Messagerie • Objets Connectés
  • 4. Aperçu Fondé en avril 2010 ~35 500+ Santa Clara, Austin, New York, London, Paris, Sydney 400+ Employés Pourcent Clients 4
  • 5. Straightening the road RELATIONAL DATABASES CQL SQL OpsCenter / DevCenter Management tools DSE for search & analytics Integration Security Security Support, consulting & training 30 years ecosystem
  • 6. Apache Cassandra™ • Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour les applications en ligne, modernes, critiques et avec des montée en charge massive. • Java , hybride entre Amazon Dynamo et Google BigTable • Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF) • Distribuée avec la possibilité de Data Center • 100% Disponible • Massivement scalable • Montée en charge linéaire • Haute Performance (lecture ET écriture) • Multi Data Center • Simple à Exploiter • Language CQL (comme SQL) • Outils OpsCenter / DevCenter 6 Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Node 1 Node 2 Node 3Node 4 Node 5
  • 7. Haute Disponibilité et Cohérence • La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système • Cohérence choisie au niveau du client • Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès • Exemple: • RF = 3 • CL = QUORUM (= 51% des replicas) ©2014 DataStax Confidential. Do not distribute without consent. 7 Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Parallel Write Write CL=QUORUM 5 μs ack 12 μs ack 12 μs ack > 51% de réponses – donc la requête est réussie CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte
  • 8. DataStax Enterprise Cassandra Certifié, Prêt pour l’Entreprise 8 Security Analytics Search Visual Monitoring Management Services In-Memory Dev.IDE& Drivers Professional Services Support& Training Confiance d’utilisation Fonctionnalités d’Entreprise
  • 9. DataStax Enterprise - Analytique • Conçu pour faire des analyses sur des données Cassandra • Il y a 4 façons de faire de l’Analytique sur des données Cassandra: 1. Recherche (Solr) 2. Analytique en mode Batch (Hadoop) 3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks) 4. Analytique Temps Réel ©2014 DataStax Confidential. Do not distribute without consent.
  • 10. Partenariat ©2014 DataStax Confidential. Do not distribute without consent. 10
  • 11. Why Spark on Cassandra? • Analytics on transactional data and operational applications • Data model independent queries • Cross-table operations (JOIN, UNION, etc.) • Complex analytics (e.g. machine learning) • Data transformation, aggregation, etc. • Stream processing • Better performances than Hadoop Map/Reduce
  • 12. Real-time Big Data ©2014 DataStax Confidential. Do not distribute without consent. 12 Data Enrichment Batch Processing Machine Learning Pre-computed aggregates Data NO ETL
  • 13. Real-Time Big Data Use Cases • Recommendation Engine • Internet of Things • Fraud Detection • Risk Analysis • Buyer Behaviour Analytics • Telematics, Logistics • Business Intelligence • Infrastructure Monitoring • … ©2014 DataStax Confidential. Do not distribute without consent. 13
  • 14. Composants Sparks Shark or Spark SQL Structured Spark Streaming Real-time MLlib Machine learning Spark (General execution engine) GraphX Graph Cassandra Compatible
  • 16. Cassandra Executor ExecutorSpark Worker (JVM) Cassandra Executor ExecutorSpark Worker (JVM) DSE Spark Integration Architecture Node 1 Node 2 Node 3 Node 4 Cassandra Executor ExecutorSpark Worker (JVM) Cassandra Executor ExecutorSpark Worker (JVM) Spark Master (JVM) App Driver
  • 17. Spark Cassandra Connector C* C* C*C* Spark Executor C* Java Driver Spark-Cassandra Connector User Application Cassandra
  • 18. Cassandra Spark Driver •Cassandra tables exposed as Spark RDDs •Load data from Cassandra to Spark •Write data from Spark to Cassandra •Object mapper : Mapping of C* tables and rows to Scala objects •Type conversions : All Cassandra types supported and converted to Scala types •Server side data selection •Virtual Nodes support •Scala and Java APIs
  • 19. DSE Spark Interactive Shell $ dse spark ... Spark context available as sc. HiveSQLContext available as hc. CassandraSQLContext available as csc. scala> sc.cassandraTable("test", "kv") res5: com.datastax.spark.connector.rdd.CassandraRDD [com.datastax.spark.connector.CassandraRow] = CassandraRDD[2] at RDD at CassandraRDD.scala:48 scala> sc.cassandraTable("test", "kv").collect res6: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{k: 1, v: foo}) cqlsh> select * from test.kv; k | v ---+----- 1 | foo (1 rows)
  • 20. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)
  • 21. Reading Data val table = sc .cassandraTable[CassandraRow]("db", "tweets") .select("user_name", "message") .where("user_name = ?", "ewa") row representation keyspace table server side column and row selection
  • 22. Writing Data CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT); val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5))) collection.saveToCassandra("test", "words", SomeColumns("word", "count")) cqlsh:test> select * from words; word | count ------+------- bar | 5 foo | 2 (2 rows)
  • 23. Mapping Rows to Objects CREATE TABLE test.cars ( id text PRIMARY KEY, model text, fuel_type text, year int ); case class Vehicle( id: String, model: String, fuelType: String, year: Int ) sc.cassandraTable[Vehicle]("test", "cars").toArray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011)  * Mapping rows to Scala Case Classes * CQL underscore case column mapped to Scala camel case property * Custom mapping functions (see docs)
  • 24. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 25. Shark • SQL query engine on top of Spark • Not part of Apache Spark • Hive compatible (JDBC, UDFs, types, metadata, etc.) • Supports in-memory tables • Available as a part of DataStax Enterprise
  • 26. Spark SQL • Spark SQL supports a subset of SQL-92 language • Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark • Support for in-memory computation
  • 27. •From Spark command line •Mapping of Cassandra keyspaces and tables •Read and write on Cassandra tables Usage of Spark SQL & HiveQL query import com.datastax.spark.connector._ // Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf) // Create Cassandra SQL context val cc = new CassandraSQLContext(sc) // Execute SQL query val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2") // Execute HQL query val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
  • 28. Spark Streaming • For real time analytics • Push or pull model • Stream TO and FROM Cassandra • Micro batching (each batch represented as RDD) • Fault tolerant • Data processed in small batches • Exactly-once processing • Unified stream and batch processing framework • Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT producers
  • 29. Usage of Spark Streaming • Due to the unifying Spark architecture, portions of batch and streaming development can be reused • Given that Spark Streaming is backed by Cassandra, no need to depend upon solutions like Apache Zookeeper ™ in production import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
  • 30. Python API $ dse pyspark Python 2.7.8 (default, Oct 20 2014, 15:05:19) [GCC 4.9.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Python version 2.7.8 (default, Oct 20 2014 15:05:19) SparkContext available as sc. >>> sc.cassandraTable("test", "kv").collect() [Row(k=1, v=u'foo')]
  • 31. DataStax Enterprise + Spark Special Features •Easy setup and config • no need to setup a separate Spark cluster • no need to tweak classpaths or config files •High availability of Spark Master •Enterprise security • Password / Kerberos / LDAP authentication • SSL for all Spark to Cassandra connections •CFS integration (no SPOF distributed file system) •Cassandra access through Spark Python API •Certified and Supported on Cassandra •Shark availability
  • 32. DataStax Enterprise - High Availability • All nodes are Spark Workers • By default resilient to Worker failures • First Spark node promoted as Spark Master (state saved in CFS, no SPOF) • Standby Master promoted on failure (New Spark Master reconnects to Workers and the driver app and continues the job)
  • 33. Without DataStax Enterprise 33 C* SparkM SparkW C* SparkW C* SparkWC* SparkW C* SparkW
  • 34. With DataStax Enterprise 34 C* SparkM SparkW C* SparkW* C* SparkWC* SparkW C* SparkW Master state in C* Spare master for H/A
  • 35. Spark Use Cases 35 Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion
  • 36. DataStax Enterprise © 2014 DataStax, All Rights Reserved. Company Confidential External Hadoop Distribution Cloudera, Hortonworks OpsCenter Services Hadoop Monitoring Operations Operational Application Real Time Search Real Time Analytics Batch Analytics SGBDR Analytics Transformation s 36 Cassandra Cluster – Nodes Ring – Column Family Storage High Performance – Alway Available – Massive Scalability Advanced Security In-Memory
  • 37. How to Spark on Cassandra? DataStax Cassandra Spark driver https://github.com/datastax/cassandra-driver-spark Compatible with •Spark 1.2 •Cassandra 2.0.x and 2.1.x •DataStax Enterprise 4.5 et 4.6 DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1 Spark 1.2 in next DSE 4.7 version (March)
  • 38. Merci Questions ? We power the big data apps that transform business. ©2013 DataStax Confidential. Do not distribute without consent. [email protected] @vizanalytics

Editor's Notes

  • #4: Qui nous connait parmi vous. En fait dans votre vie quotienne, vous utilisez la technologie DataStax sans le savoir : ebay pour les recommandations produit, bientot NetFlix pour visonner des films en streaming, un achat par SmartSphone grace à nouveau un service offert par un grande banque mutualiste, un échange de de message instantanée avec un service du plus gros opérateur de téléphonie en France etc… Finallement vous utilisez dans votre vie de tous les jours les différents types d’applications proposées par nos 500 clients et qui s’appuie sur notre technologie de base de données We are growing so fast, and in so many ways, I'm willing to bet you’ve used our technology several times in just the past few days and don’t even realize it.  Whether you did some online banking, browsed news sites, did a bit of retail shopping, filled a few prescriptions, or watched movies online -- basically, if you lived your life -- you used the kinds of applications that we power for over 400 customers, including over 20 of the Fortune 100.
  • #5: Key Takeaway- Introduce the company, our incredible growth and global presence, that we are in about 25% of the FORTUNE 100, and the fact that many of the online and mobile applications you already use every day are actually built on DataStax. Talk Track- DataStax, the leading distributed database technology, delivers Apache Cassandra to the world’s most innovative companies such as Netflix, Rackspace, Pearson Education and Constant Contact. DataStax is built to be agile, always-on, and predictably scalable to any size. We were founded in April 2010, so we are a little over 4 years old. We are headquartered in Santa Clara, California and have offices in Austin TX, New York, London, England and Sydney Australia. We now have over 330 employees; this number will reach well over 400 by the end of our fiscal year (Jan 31 2015) and double by the end of FY16. Currently 25% of the Fortune 100 use us, and our success has been built on our customers success and today and we have over 500 customers worldwide, in over 40 countries. The logos you see here are ones that you are already using every day. These applications are all built on DataStax and Apache Cassandra. So how have we come so far in such a short time…..?
  • #6: En fait la mission de DataStax est de vos libérer de ces incertitudes et vous faciliter la route sur cette nouvelle voie. A cette fin, nous vous offrons un DML DDL appelé CQL très proche du SQL maitrisé par vos équipes, des outils complets d’administration et de monitoring, So, What DataStax is doing is trying to straightened that bend in the road. We are providing things like CQL, and management tools called DevCenter and OpsCenter. DataStax Enterprise provides integration into analytics and search capabilities and we do it all within a secure environment. We also provide consultants and training courses, including free virtual training to help get you up to speed.
  • #7: Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtable’s. It takes its distribution algorithm from Dynamo and its data model from Bigtable. Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle can’t keep up. This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance
  • #9: Key Takeaway- DataStax Enterprise delivers the commercial confidence and the additional enterprise functionality that you need to support your online business applications. Talk Track- DataStax takes the latest version of open source Apache Cassandra, certifies and prepares it for bullet-proof enterprise deployment. We deliver commercial confidence in the form of training and support, development tools and drivers and professional implementation services to ensure that you have everything you need to successfully deploy Cassandra on support of your mainstream business applications. We also offer additional functionality such as Management Services, that allow you to automatically manage administration and performance Security and encryption to ensure that your data remains perfectly safe and free from corruption In-Memory option that allows you to deliver online applications with lightening fast response times Analytics that allow you to gain valuable insights into data center performance Search which easily allows you search you database, and Visual Monitoring, our Ops Center product that allows you to easily manage and monitor data center performance from anywhere, and on any device
  • #11: Databricks is the company behind Apache Spark.
  • #13: Predictive analytics Does this simple architecture look familiar to you? Lambda Nathan Marz
  • #15: Shark is hive compatible – you can run the same application on Shark Shark integration is only on DSE, otherwise you have to wait for Spark SQL Separate projects – Spark is totally different project Spark SQL has borrowed from Spark Both promising to be Hive compatible
  • #16: Cassandra spark driver will NOT connect to remote DC Different nodes, profile etc..
  • #33: Master HA out of the box with DSE A Spark Master controls the workflow, and a Spark Worker launches executors responsible for executing part of the job submitted to the Spark master.
  • #36: DUYHAI