SlideShare a Scribd company logo
codecentric AG
Matthias Niehoff
Big Data Analytics with Cassandra, Spark & MLLib
codecentric AG
– Spark
 Basics
 In A Cluster
 Cassandra Spark Connector
 Use Cases
– Spark Streaming
– Spark SQL
– Spark MLLib
– Live Demo
AGENDA
codecentric AG
SELECT * FROM performer WHERE name = 'ACDC’
ok
SELECT * FROM performer WHERE name = 'ACDC’
and country = ’Australia’
not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER
BY quantity DESC
not possible
CQL – QUERYING LANGUAGE WITH LIMITATIONS
18.06.2015 5
performer
name text K
style text
country text
type text
codecentric AG
– Open Source & Apache project since 2010
– Data processing framework
 Batch processing
 Stream processing
WHAT IS APACHE SPARK?
18/06/2015 6
codecentric AG
– Fast
 up to 100 times faster than Hadoop
 a lot of in-memory processing
 scalable about nodes
– Easy
 Scala, Java and Python API
 Clean Code (e.g. with lambdas in Java 8)
 expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey,
groupByKey, sample, take, first, count ..
– Fault-Tolerant
 easily reproducible
WHY USE SPARK?
18.06.2015 7
codecentric AG
– RDD‘s – Resilient Distributed Dataset
 Read – Only Collection of Objects
 Distributed among the Cluster (on memory or disk)
 Determined through transformations
 Automatically rebuild on failure
– Operations
 Transformations (map,filter,reduce...)  new RDD
 Actions (count, collect, save)
– Only Actions start processing!
EASILY REPRODUCIBLE?
18.06.2015 8
codecentric AG
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
scala> linesWithSpark.count()
res0: Long = 126
RDD EXAMPLE
18.06.2015 9
codecentric AG
REPRODUCE RDD USING A TREE
18.06.2015 10
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()
codecentric AG
– Spark Cassandra Connector by Datastax
 https://github.com/datastax/spark-cassandra-connector
– Cassandra tables as Spark RDD (read & write)
– Mapping of C* tables and rows onto Java/Scala objects
– Server-Side filtering („where“)
– Compatible with
 Spark ≥ 0.9
 Cassandra ≥ 2.0
– Clone & Compile with SBT or download at Maven Central
SPARK ON CASSANDRA
18.06.2015 16
codecentric AG
– Start Spark Shell
bin/spark-shell
--jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
--conf spark.cassandra.connection.host=localhost
--driver-memory 2g
--executor-memory 3g
– Import Cassandra Classes
scala> import com.datastax.spark.connector._,
USE SPARK CONNECTOR
18.06.2015 17
codecentric AG
– Read complete table
val movies = sc.cassandraTable("movie", "movies")
// Return type: CassandraRDD[CassandraRow]
– Read selected columns
val movies = sc.cassandraTable("movie", "movies").select("title","year")
– Filter Rows
val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'")
READ TABLE
18.06.2015 18
codecentric AG
– Access columns in ResultSet
movies.collect.foreach(r => println(r.get[String]("title")))
– As Scala tuple
sc.cassandraTable[(String,Int)]("movie","movies")
.select("title","year")
or
sc.cassandraTable("movie","movies").select("title","year")
.as((_: String, _:Int))
CassandraRDD[(String,Int)]
READ TABLE
18.06.2015 19
getString,
getStringOption,
getOrElse, getMap,
codecentric AG
case class Movie(name: String, year: Int)
sc.cassandraTable[Movie]("movie","movies")
.select("title","year")
sc.cassandraTable("movie","movies").select("title","year")
.as(Movie)
CassandraRDD[Movie]
READ TABLE – CASE CLASS
18.06.2015 20
codecentric AG
– Every RDD can be saved (not only CassandraRDD)
– Tuple
val tuples = sc.parallelize(Seq(("Hobbit",2012),
("96 Hours",2008)))
tuples.saveToCassandra("movie","movies",
SomeColumns("title","year")
- Case Class
case class Movie (title:String, year: int)
val objects = sc.parallelize(Seq( Movie("Hobbit",2012),
Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
WRITE TABLE
18.06.2015 21
codecentric AG
( [atomic,collection,object] , [atomic,collection,object])
val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") )
val pairRDD = sc.parallelize(fluege)
pairRDD.filter(_._1 == "Thomas")
.collect
.foreach(t => println(t._1 + " flog nach " + t._2))
PAIR RDD‘S
18.06.2015 23
key – not unique value
codecentric AG
– Parallelization!
 keys are use for partitioning
 pairs with different keys are distributed across the cluster
– Efficient processing of
 aggregate by key
 group by key
 sort by key
 joins, union based on keys
WHY USE PAIR RDD'S?
18.06.2015 24
codecentric AG
– keys(), values()
– mapValues(func), flatMapValues(func)
– lookup(key), collectAsMap(), countByKey()
– reduceByKey(func), foldByKey(zeroValue)(func)
– groupByKey(), cogroup(otherDataset)
– sortByKey([ascending])
– join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset)
– union(otherDataset), substractByKey(otherDataset)
SPECIAL OP‘S FOR PAIR RDD‘S
18.06.2015 25
codecentric AG
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Country, sorted
pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(-
_._2).collect.foreach(println)
// or, unsorted
pairRDD.countByKey().foreach(println)
// All Countries
pairRDD.keys()
CASSANDRA EXAMPLE
18.06.2015 26
director
name text K
country text
codecentric AG
pairRDD.groupByCountry()
RDD[(String,Iterable[CassandraRow])]
val directors = sc.cassandraTable(..).map(r => r.getString("name"),r))
val movies = sc.cassandraTable().map(r => r.getString("director"),r))
directors.cogroup(movies)
RDD[(String,
(Iterable[CassandraRow], Iterable[CassandraRow]))]
CASSANDRA EXAMPLE
18.06.2015 27
director
name text K
country text
movie
title text K
director text
Director Movies
codecentric AG
– Joins could be expensive
 partitions for same keys in different
tables on different nodes
 requires shuffling
val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r))
val movies = sc.cassandraTable().map(r => (r.getString("director"),r))
movies.join(directors)
 RDD[(String, (CassandraRow, CassandraRow))]
CASSANDRA EXAMPLE - JOINS
18.06.2015 28
director
name text K
country text
movie
title text K
director text
DirectorMovie
codecentric AG
USE CASES
18.06.2015 30
DataLoading
Validation& Normalization
Analyses (Joins, Transformations,..)
Schema Migration
DataConversion
codecentric AG
– In particular for huge amounts of external data
– Support for CSV, TSV, XML, JSON und other
– Example:
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")
DATA LOADING
18.06.2015 31
codecentric AG
– Validate consistency in a Cassandra Database
 syntactic
 Uniqueness (only relevant for columns not in the PK)
 Referential integrity
 Integrity of the duplicates
 semantic
 Business- or Application constraints
 e.g.: At least one genre per movies, a maximum of 10 tags per blog post
VALIDATION
18.06.2015 32
codecentric AG
– Modelling, Mining, Transforming, ....
– Use Cases
 Recommendation
 Fraud Detection
 Link Analysis (Social Networks, Web)
 Advertising
 Data Stream Analytics ( Spark Streaming)
 Machine Learning ( Spark ML)
ANALYSES
18.06.2015 33
codecentric AG
– Changes on existing tables
 New table required when changing primary key
 Otherwise changes could be performed in-place
– Creating new tables
 data derived from existing tables
 Support new queries
– Use the CassandraConnectors in Spark
val cc = CassandraConnector(sc.getConf)
cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD
year timestamp}
SCHEMA MIGRATION & DATA CONVERSION
18.06.2015 34
codecentric AG
STREAM PROCESSING
18.06.2015 35
.. with Spark Streaming
– Real Time Processing
– Supported sources: TCP, HDFS, S3, Kafka, Twitter,..
– Data as Discretized Stream (DStream)
– All Operations of the GenericRDD
– Stateful Operations
– Sliding Windows
codecentric AG
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.map(x => 1).reduce(_ + _).print()
ssc.start()
// await manual termination or error
ssc.awaitTermination()
// manual termination
ssc.stop()
SPARK STREAMING - BEISPIEL
18.06.2015 36
codecentric AG
– SQL Queries with Spark (SQL & HiveQL)
 On structured data
 On SchemaRDD‘s
 Every result of Spark SQL are SchemaRDD‘s
 All operations of the GenericRDD‘s available
– Unterstützt, auch auf nicht PK Spalten,...
 Joins
 Union
 Group By
 Having
 Order By
SPARK SQL
18.06.2015 37
codecentric AG
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUNT(*) as anzahl" +
"FROM artists GROUP BY country" +
"ORDER BY anzahl DESC");
result.collect.foreach(println);
SPARK SQL – CASSANDRA BEISPIEL
18.06.2015 38
performer
name text K
style text
country text
type text
codecentric AG
val sqlContext = new SQLContext(sc)
val personen = sqlContext.jsonFile(path)
// Show the schema
personen.printSchema()
personen.registerTempTable("personen")
val erwachsene =
sqlContext.sql("SELECT name FROM personen WHERE alter > 18")
erwachsene.collect.foreach(println)
SPARK SQL – RDD / JSON BEISPIEL
18.06.2015 39
{"name":"Michael"}
{"name":"Jan", "alter":30}
{"name":"Tim", "alter":17}
codecentric AG
– Fully integrated in Spark
 Scalable
 Scala, Java & Python APIs
 Use with Spark Streaming & Spark SQL
– Packages various algorithms for machine learning
– Includes
 Clustering
 Classification
 Prediction
 Collaborative Filtering
– Still under development
 performance, algorithms
SPARK MLLIB
18.06.2015 40
codecentric AG
age
EXAMPLE – CLUSTERING
18.06.2015 41
set of data points meaningful clusters
income
codecentric AG
// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(
s.split(' ').map(_.toDouble))).cache()
// Cluster the data into 3 classes using KMeans with 20 iterations
val clusters = KMeans.train(parsedData, 2, 20)
// Evaluate clustering by computing Sum of Squared Errors
val SSE = clusters.computeCost(parsedData)
println("Sum of Squared Errors = " + WSSSE)
EXAMPLE – CLUSTERING (K-MEANS)
18.06.2015 42
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 43
codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 44
codecentric AG
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 45
codecentric AG
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 46
codecentric AG
COLLABORATIVE FILTERING
18.06.2015 47
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 48
codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 49
codecentric AG
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate)
=> (user, product) }
val predictions = model.predict(usersProducts).map {
case Rating(user, product, rate) => ((user, product), rate) }
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2); err * err }.mean()
println("Mean Squared Error = " + MSE)
COLLABORATIVE FILTERING (ALS)
18.06.2015 50
codecentric AG
– Setup, Codes & Examples on Github
 https://github.com/mniehoff/spark-cassandra-playground
– Install Cassandra
 For local experiments: Cassandra Cluster Manager
https://github.com/pcmanus/ccm
 ccm create test -v binary:2.0.5 -n 3 –s
 ccm status
 ccm node1 status
– Install Spark
 Tar Ball (Pre Built for Hadoop 2.6)
 HomeBrew
 MacPorts
CURIOUS? GETTING STARTED!
18.06.2015 51
codecentric AG
– Spark Cassandra Connector
 https://github.com/datastax/spark-cassandra-connector
 Clone, ./sbt/sbt assembly
or
 Download Jar @ Maven Central
– Start Spark shell
spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-
SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost
– Import Cassandra classes
scala> import com.datastax.spark.connector._,
CURIOUS? GETTING STARTED!
18.06.2015 52
codecentric AG
– Production ready Cassandra
– Integrates
 Spark
 Hadoop
 Solr
– OpsCenter & DevCenter
– Bundled and ready to go
– Free for development and test
– http://www.datastax.com/download
(One time registration required)
DATASTAX ENTERPRISE EDITION
18.06.2015 53
codecentric AG
QUESTIONS?
18.06.2015 57
codecentric AG
Matthias Niehoff
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe
tel +49 (0) 721 – 95 95 681
matthias.niehoff@codecentric.de
KONTAKT
18.06.2015 58
Ad

More Related Content

What's hot (20)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
Patrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
Patrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 

Similar to Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day (20)

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Ad

Recently uploaded (20)

Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Ad

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

  • 1. codecentric AG Matthias Niehoff Big Data Analytics with Cassandra, Spark & MLLib
  • 2. codecentric AG – Spark  Basics  In A Cluster  Cassandra Spark Connector  Use Cases – Spark Streaming – Spark SQL – Spark MLLib – Live Demo AGENDA
  • 3. codecentric AG SELECT * FROM performer WHERE name = 'ACDC’ ok SELECT * FROM performer WHERE name = 'ACDC’ and country = ’Australia’ not ok SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC not possible CQL – QUERYING LANGUAGE WITH LIMITATIONS 18.06.2015 5 performer name text K style text country text type text
  • 4. codecentric AG – Open Source & Apache project since 2010 – Data processing framework  Batch processing  Stream processing WHAT IS APACHE SPARK? 18/06/2015 6
  • 5. codecentric AG – Fast  up to 100 times faster than Hadoop  a lot of in-memory processing  scalable about nodes – Easy  Scala, Java and Python API  Clean Code (e.g. with lambdas in Java 8)  expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey, groupByKey, sample, take, first, count .. – Fault-Tolerant  easily reproducible WHY USE SPARK? 18.06.2015 7
  • 6. codecentric AG – RDD‘s – Resilient Distributed Dataset  Read – Only Collection of Objects  Distributed among the Cluster (on memory or disk)  Determined through transformations  Automatically rebuild on failure – Operations  Transformations (map,filter,reduce...)  new RDD  Actions (count, collect, save) – Only Actions start processing! EASILY REPRODUCIBLE? 18.06.2015 8
  • 7. codecentric AG scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 scala> linesWithSpark.count() res0: Long = 126 RDD EXAMPLE 18.06.2015 9
  • 8. codecentric AG REPRODUCE RDD USING A TREE 18.06.2015 10 Datenquelle rdd1 rdd3 val1 rdd5 rdd2 rdd4 val2 rdd6 val3 map(..)filter(..) union(..) count() count() count() sample(..) cache()
  • 9. codecentric AG – Spark Cassandra Connector by Datastax  https://github.com/datastax/spark-cassandra-connector – Cassandra tables as Spark RDD (read & write) – Mapping of C* tables and rows onto Java/Scala objects – Server-Side filtering („where“) – Compatible with  Spark ≥ 0.9  Cassandra ≥ 2.0 – Clone & Compile with SBT or download at Maven Central SPARK ON CASSANDRA 18.06.2015 16
  • 10. codecentric AG – Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 2g --executor-memory 3g – Import Cassandra Classes scala> import com.datastax.spark.connector._, USE SPARK CONNECTOR 18.06.2015 17
  • 11. codecentric AG – Read complete table val movies = sc.cassandraTable("movie", "movies") // Return type: CassandraRDD[CassandraRow] – Read selected columns val movies = sc.cassandraTable("movie", "movies").select("title","year") – Filter Rows val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'") READ TABLE 18.06.2015 18
  • 12. codecentric AG – Access columns in ResultSet movies.collect.foreach(r => println(r.get[String]("title"))) – As Scala tuple sc.cassandraTable[(String,Int)]("movie","movies") .select("title","year") or sc.cassandraTable("movie","movies").select("title","year") .as((_: String, _:Int)) CassandraRDD[(String,Int)] READ TABLE 18.06.2015 19 getString, getStringOption, getOrElse, getMap,
  • 13. codecentric AG case class Movie(name: String, year: Int) sc.cassandraTable[Movie]("movie","movies") .select("title","year") sc.cassandraTable("movie","movies").select("title","year") .as(Movie) CassandraRDD[Movie] READ TABLE – CASE CLASS 18.06.2015 20
  • 14. codecentric AG – Every RDD can be saved (not only CassandraRDD) – Tuple val tuples = sc.parallelize(Seq(("Hobbit",2012), ("96 Hours",2008))) tuples.saveToCassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:String, year: int) val objects = sc.parallelize(Seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.saveToCassandra("movie","movies") WRITE TABLE 18.06.2015 21
  • 15. codecentric AG ( [atomic,collection,object] , [atomic,collection,object]) val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") ) val pairRDD = sc.parallelize(fluege) pairRDD.filter(_._1 == "Thomas") .collect .foreach(t => println(t._1 + " flog nach " + t._2)) PAIR RDD‘S 18.06.2015 23 key – not unique value
  • 16. codecentric AG – Parallelization!  keys are use for partitioning  pairs with different keys are distributed across the cluster – Efficient processing of  aggregate by key  group by key  sort by key  joins, union based on keys WHY USE PAIR RDD'S? 18.06.2015 24
  • 17. codecentric AG – keys(), values() – mapValues(func), flatMapValues(func) – lookup(key), collectAsMap(), countByKey() – reduceByKey(func), foldByKey(zeroValue)(func) – groupByKey(), cogroup(otherDataset) – sortByKey([ascending]) – join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset) – union(otherDataset), substractByKey(otherDataset) SPECIAL OP‘S FOR PAIR RDD‘S 18.06.2015 25
  • 18. codecentric AG val pairRDD = sc.cassandraTable("movie","director") .map(r => (r.getString("country"),r)) // Directors / Country, sorted pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairRDD.countByKey().foreach(println) // All Countries pairRDD.keys() CASSANDRA EXAMPLE 18.06.2015 26 director name text K country text
  • 19. codecentric AG pairRDD.groupByCountry() RDD[(String,Iterable[CassandraRow])] val directors = sc.cassandraTable(..).map(r => r.getString("name"),r)) val movies = sc.cassandraTable().map(r => r.getString("director"),r)) directors.cogroup(movies) RDD[(String, (Iterable[CassandraRow], Iterable[CassandraRow]))] CASSANDRA EXAMPLE 18.06.2015 27 director name text K country text movie title text K director text Director Movies
  • 20. codecentric AG – Joins could be expensive  partitions for same keys in different tables on different nodes  requires shuffling val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r)) val movies = sc.cassandraTable().map(r => (r.getString("director"),r)) movies.join(directors)  RDD[(String, (CassandraRow, CassandraRow))] CASSANDRA EXAMPLE - JOINS 18.06.2015 28 director name text K country text movie title text K director text DirectorMovie
  • 21. codecentric AG USE CASES 18.06.2015 30 DataLoading Validation& Normalization Analyses (Joins, Transformations,..) Schema Migration DataConversion
  • 22. codecentric AG – In particular for huge amounts of external data – Support for CSV, TSV, XML, JSON und other – Example: case class User (id: java.util.UUID, name: String) val users = sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.saveToCassandra("keyspace","users") DATA LOADING 18.06.2015 31
  • 23. codecentric AG – Validate consistency in a Cassandra Database  syntactic  Uniqueness (only relevant for columns not in the PK)  Referential integrity  Integrity of the duplicates  semantic  Business- or Application constraints  e.g.: At least one genre per movies, a maximum of 10 tags per blog post VALIDATION 18.06.2015 32
  • 24. codecentric AG – Modelling, Mining, Transforming, .... – Use Cases  Recommendation  Fraud Detection  Link Analysis (Social Networks, Web)  Advertising  Data Stream Analytics ( Spark Streaming)  Machine Learning ( Spark ML) ANALYSES 18.06.2015 33
  • 25. codecentric AG – Changes on existing tables  New table required when changing primary key  Otherwise changes could be performed in-place – Creating new tables  data derived from existing tables  Support new queries – Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD year timestamp} SCHEMA MIGRATION & DATA CONVERSION 18.06.2015 34
  • 26. codecentric AG STREAM PROCESSING 18.06.2015 35 .. with Spark Streaming – Real Time Processing – Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. – Data as Discretized Stream (DStream) – All Operations of the GenericRDD – Stateful Operations – Sliding Windows
  • 27. codecentric AG val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.socketTextStream("127.0.0.1",9999) stream.map(x => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaitTermination() // manual termination ssc.stop() SPARK STREAMING - BEISPIEL 18.06.2015 36
  • 28. codecentric AG – SQL Queries with Spark (SQL & HiveQL)  On structured data  On SchemaRDD‘s  Every result of Spark SQL are SchemaRDD‘s  All operations of the GenericRDD‘s available – Unterstützt, auch auf nicht PK Spalten,...  Joins  Union  Group By  Having  Order By SPARK SQL 18.06.2015 37
  • 29. codecentric AG val csc = new CassandraSQLContext(sc) csc.setKeyspace("musicdb") val result = csc.sql("SELECT country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); SPARK SQL – CASSANDRA BEISPIEL 18.06.2015 38 performer name text K style text country text type text
  • 30. codecentric AG val sqlContext = new SQLContext(sc) val personen = sqlContext.jsonFile(path) // Show the schema personen.printSchema() personen.registerTempTable("personen") val erwachsene = sqlContext.sql("SELECT name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) SPARK SQL – RDD / JSON BEISPIEL 18.06.2015 39 {"name":"Michael"} {"name":"Jan", "alter":30} {"name":"Tim", "alter":17}
  • 31. codecentric AG – Fully integrated in Spark  Scalable  Scala, Java & Python APIs  Use with Spark Streaming & Spark SQL – Packages various algorithms for machine learning – Includes  Clustering  Classification  Prediction  Collaborative Filtering – Still under development  performance, algorithms SPARK MLLIB 18.06.2015 40
  • 32. codecentric AG age EXAMPLE – CLUSTERING 18.06.2015 41 set of data points meaningful clusters income
  • 33. codecentric AG // Load and parse data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).cache() // Cluster the data into 3 classes using KMeans with 20 iterations val clusters = KMeans.train(parsedData, 2, 20) // Evaluate clustering by computing Sum of Squared Errors val SSE = clusters.computeCost(parsedData) println("Sum of Squared Errors = " + WSSSE) EXAMPLE – CLUSTERING (K-MEANS) 18.06.2015 42
  • 34. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 43
  • 35. codecentric AG EXAMPLE – CLASSIFICATION 18.06.2015 44
  • 36. codecentric AG // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 45
  • 37. codecentric AG // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) EXAMPLE – CLASSIFICATION (LINEAR SVM) 18.06.2015 46
  • 39. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 48
  • 40. codecentric AG // Load and parse the data (userid,itemid,rating) val data = sc.textFile("data/mllib/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 20 val model = ALS.train(ratings, rank, numIterations, 0.01) COLLABORATIVE FILTERING (ALS) 18.06.2015 49
  • 41. codecentric AG // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate)}.join(predictions) val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => val err = (r1 - r2); err * err }.mean() println("Mean Squared Error = " + MSE) COLLABORATIVE FILTERING (ALS) 18.06.2015 50
  • 42. codecentric AG – Setup, Codes & Examples on Github  https://github.com/mniehoff/spark-cassandra-playground – Install Cassandra  For local experiments: Cassandra Cluster Manager https://github.com/pcmanus/ccm  ccm create test -v binary:2.0.5 -n 3 –s  ccm status  ccm node1 status – Install Spark  Tar Ball (Pre Built for Hadoop 2.6)  HomeBrew  MacPorts CURIOUS? GETTING STARTED! 18.06.2015 51
  • 43. codecentric AG – Spark Cassandra Connector  https://github.com/datastax/spark-cassandra-connector  Clone, ./sbt/sbt assembly or  Download Jar @ Maven Central – Start Spark shell spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0- SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost – Import Cassandra classes scala> import com.datastax.spark.connector._, CURIOUS? GETTING STARTED! 18.06.2015 52
  • 44. codecentric AG – Production ready Cassandra – Integrates  Spark  Hadoop  Solr – OpsCenter & DevCenter – Bundled and ready to go – Free for development and test – http://www.datastax.com/download (One time registration required) DATASTAX ENTERPRISE EDITION 18.06.2015 53
  • 46. codecentric AG Matthias Niehoff codecentric AG Zeppelinstraße 2 76185 Karlsruhe tel +49 (0) 721 – 95 95 681 [email protected] KONTAKT 18.06.2015 58

Editor's Notes

  • #5: token = murmur3(partition key)
  • #9: speichert Liste der Partitionen
  • #12: tasks from different applications run in different JVMs Driver programm near to workers, e.g. same LAN
  • #13: GroupBy, ReduceBy, SQL Joins
  • #14: - Cassandra as Datasotre Spark for Data Modelling & Data Analytics Data Locality
  • #15: Data Locality Workload Isolation
  • #16: Master: Worker, Running / Completed Apps. Worker: Executors
  • #18: Anwendung: Jar hinzufügen, Context erzeugen
  • #19: Filter rows possible on primary key and indexed columns
  • #24: Alternativ: keyBy (rdd => key)
  • #27: toSeq.sortBy(-_._2) für Sortierung
  • #29: Join n keys -> n rows Cogroup -> Same keys are grouped
  • #30: Data Locality: Partitions aus den Daten des lokalen C* Nodes
  • #32: defaultParallelism: Total Number of Cores on all Nodes http://icons.iconarchive.com/icons/custom-icon-design/office/256/import-icon.png
  • #33: http://usacanadaregion.org/sites/usacanadaregion.org/files/Images/checkmark.jpg
  • #34: http://www.webhostingreviewjam.com/wp-content/uploads/2013/02/icon_analytics.png
  • #36: Intervall: 500ms bis einige Sekunden https://spark.apache.org/docs/latest/img/streaming-flow.png
  • #38: Schema RDD: Schema Informationen, besteht aus Zeilen.
  • #39: Tabelle mit Sänger, PK nur der Name
  • #40: Tabelle mit Sänger, PK nur der Name
  • #42: unsupervised
  • #43: clusters.clusterCenters
  • #44: supervised
  • #45: supervised
  • #46: SVM = Support Vector Machine