Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

codecentric AG
Matthias Niehoff
Big Data Analytics with Cassandra, Spark & MLLib

codecentric AG
– Spark
 Basics
 In A Cluster
 Cassandra Spark Connector
 Use Cases
– Spark Streaming
– Spark SQL
– Spark MLLib
– Live Demo
AGENDA

codecentric AG
SELECT * FROM performer WHERE name = 'ACDC’
ok
SELECT * FROM performer WHERE name = 'ACDC’
and country = ’Australia’
not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER
BY quantity DESC
not possible
CQL – QUERYING LANGUAGE WITH LIMITATIONS
18.06.2015 5
performer
name text K
style text
country text
type text

codecentric AG
– Open Source & Apache project since 2010
– Data processing framework
 Batch processing
 Stream processing
WHAT IS APACHE SPARK?
18/06/2015 6

codecentric AG
– Fast
 up to 100 times faster than Hadoop
 a lot of in-memory processing
 scalable about nodes
– Easy
 Scala, Java and Python API
 Clean Code (e.g. with lambdas in Java 8)
 expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey,
groupByKey, sample, take, first, count ..
– Fault-Tolerant
 easily reproducible
WHY USE SPARK?
18.06.2015 7

codecentric AG
– RDD‘s – Resilient Distributed Dataset
 Read – Only Collection of Objects
 Distributed among the Cluster (on memory or disk)
 Determined through transformations
 Automatically rebuild on failure
– Operations
 Transformations (map,filter,reduce...)  new RDD
 Actions (count, collect, save)
– Only Actions start processing!
EASILY REPRODUCIBLE?
18.06.2015 8

codecentric AG
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
scala> linesWithSpark.count()
res0: Long = 126
RDD EXAMPLE
18.06.2015 9

codecentric AG
REPRODUCE RDD USING A TREE
18.06.2015 10
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()

codecentric AG
– Spark Cassandra Connector by Datastax
 https://github.com/datastax/spark-cassandra-connector
– Cassandra tables as Spark RDD (read & write)
– Mapping of C* tables and rows onto Java/Scala objects
– Server-Side filtering („where“)
– Compatible with
 Spark ≥ 0.9
 Cassandra ≥ 2.0
– Clone & Compile with SBT or download at Maven Central
SPARK ON CASSANDRA
18.06.2015 16

codecentric AG
– Start Spark Shell
bin/spark-shell
--jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
--conf spark.cassandra.connection.host=localhost
--driver-memory 2g
--executor-memory 3g
– Import Cassandra Classes
scala> import com.datastax.spark.connector._,
USE SPARK CONNECTOR
18.06.2015 17

codecentric AG
– Read complete table
val movies = sc.cassandraTable("movie", "movies")
// Return type: CassandraRDD[CassandraRow]
– Read selected columns
val movies = sc.cassandraTable("movie", "movies").select("title","year")
– Filter Rows
val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'")
READ TABLE
18.06.2015 18

codecentric AG
– Access columns in ResultSet
movies.collect.foreach(r => println(r.get[String]("title")))
– As Scala tuple
sc.cassandraTable[(String,Int)]("movie","movies")
.select("title","year")
or
sc.cassandraTable("movie","movies").select("title","year")
.as((_: String, _:Int))
CassandraRDD[(String,Int)]
READ TABLE
18.06.2015 19
getString,
getStringOption,
getOrElse, getMap,

codecentric AG
case class Movie(name: String, year: Int)
sc.cassandraTable[Movie]("movie","movies")
.select("title","year")
sc.cassandraTable("movie","movies").select("title","year")
.as(Movie)
CassandraRDD[Movie]
READ TABLE – CASE CLASS
18.06.2015 20

codecentric AG
– Every RDD can be saved (not only CassandraRDD)
– Tuple
val tuples = sc.parallelize(Seq(("Hobbit",2012),
("96 Hours",2008)))
tuples.saveToCassandra("movie","movies",
SomeColumns("title","year")
- Case Class
case class Movie (title:String, year: int)
val objects = sc.parallelize(Seq( Movie("Hobbit",2012),
Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
WRITE TABLE
18.06.2015 21

codecentric AG
( [atomic,collection,object] , [atomic,collection,object])
val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") )
val pairRDD = sc.parallelize(fluege)
pairRDD.filter(_._1 == "Thomas")
.collect
.foreach(t => println(t._1 + " flog nach " + t._2))
PAIR RDD‘S
18.06.2015 23
key – not unique value

codecentric AG
– Parallelization!
 keys are use for partitioning
 pairs with different keys are distributed across the cluster
– Efficient processing of
 aggregate by key
 group by key
 sort by key
 joins, union based on keys
WHY USE PAIR RDD'S?
18.06.2015 24

codecentric AG
– keys(), values()
– mapValues(func), flatMapValues(func)
– lookup(key), collectAsMap(), countByKey()
– reduceByKey(func), foldByKey(zeroValue)(func)
– groupByKey(), cogroup(otherDataset)
– sortByKey([ascending])
– join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset)
– union(otherDataset), substractByKey(otherDataset)
SPECIAL OP‘S FOR PAIR RDD‘S
18.06.2015 25

codecentric AG
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Country, sorted
pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(-
_._2).collect.foreach(println)
// or, unsorted
pairRDD.countByKey().foreach(println)
// All Countries
pairRDD.keys()
CASSANDRA EXAMPLE
18.06.2015 26
director
name text K
country text

codecentric AG
pairRDD.groupByCountry()
RDD[(String,Iterable[CassandraRow])]
val directors = sc.cassandraTable(..).map(r => r.getString("name"),r))
val movies = sc.cassandraTable().map(r => r.getString("director"),r))
directors.cogroup(movies)
RDD[(String,
(Iterable[CassandraRow], Iterable[CassandraRow]))]
CASSANDRA EXAMPLE
18.06.2015 27
director
name text K
country text
movie
title text K
director text
Director Movies

codecentric AG
– Joins could be expensive
 partitions for same keys in different
tables on different nodes
 requires shuffling
val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r))
val movies = sc.cassandraTable().map(r => (r.getString("director"),r))
movies.join(directors)
 RDD[(String, (CassandraRow, CassandraRow))]
CASSANDRA EXAMPLE - JOINS
18.06.2015 28
director
name text K
country text
movie
title text K
director text
DirectorMovie

codecentric AG
USE CASES
18.06.2015 30
DataLoading
Validation& Normalization
Analyses (Joins, Transformations,..)
Schema Migration
DataConversion

codecentric AG
– In particular for huge amounts of external data
– Support for CSV, TSV, XML, JSON und other
– Example:
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")
DATA LOADING
18.06.2015 31

codecentric AG
– Validate consistency in a Cassandra Database
 syntactic
 Uniqueness (only relevant for columns not in the PK)
 Referential integrity
 Integrity of the duplicates
 semantic
 Business- or Application constraints
 e.g.: At least one genre per movies, a maximum of 10 tags per blog post
VALIDATION
18.06.2015 32

codecentric AG
– Modelling, Mining, Transforming, ....
– Use Cases
 Recommendation
 Fraud Detection
 Link Analysis (Social Networks, Web)
 Advertising
 Data Stream Analytics ( Spark Streaming)
 Machine Learning ( Spark ML)
ANALYSES
18.06.2015 33

codecentric AG
– Changes on existing tables
 New table required when changing primary key
 Otherwise changes could be performed in-place
– Creating new tables
 data derived from existing tables
 Support new queries
– Use the CassandraConnectors in Spark
val cc = CassandraConnector(sc.getConf)
cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD
year timestamp}
SCHEMA MIGRATION & DATA CONVERSION
18.06.2015 34

codecentric AG
STREAM PROCESSING
18.06.2015 35
.. with Spark Streaming
– Real Time Processing
– Supported sources: TCP, HDFS, S3, Kafka, Twitter,..
– Data as Discretized Stream (DStream)
– All Operations of the GenericRDD
– Stateful Operations
– Sliding Windows

codecentric AG
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.map(x => 1).reduce(_ + _).print()
ssc.start()
// await manual termination or error
ssc.awaitTermination()
// manual termination
ssc.stop()
SPARK STREAMING - BEISPIEL
18.06.2015 36

codecentric AG
– SQL Queries with Spark (SQL & HiveQL)
 On structured data
 On SchemaRDD‘s
 Every result of Spark SQL are SchemaRDD‘s
 All operations of the GenericRDD‘s available
– Unterstützt, auch auf nicht PK Spalten,...
 Joins
 Union
 Group By
 Having
 Order By
SPARK SQL
18.06.2015 37

codecentric AG
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUNT(*) as anzahl" +
"FROM artists GROUP BY country" +
"ORDER BY anzahl DESC");
result.collect.foreach(println);
SPARK SQL – CASSANDRA BEISPIEL
18.06.2015 38
performer
name text K
style text
country text
type text

codecentric AG
val sqlContext = new SQLContext(sc)
val personen = sqlContext.jsonFile(path)
// Show the schema
personen.printSchema()
personen.registerTempTable("personen")
val erwachsene =
sqlContext.sql("SELECT name FROM personen WHERE alter > 18")
erwachsene.collect.foreach(println)
SPARK SQL – RDD / JSON BEISPIEL
18.06.2015 39
{"name":"Michael"}
{"name":"Jan", "alter":30}
{"name":"Tim", "alter":17}

codecentric AG
– Fully integrated in Spark
 Scalable
 Scala, Java & Python APIs
 Use with Spark Streaming & Spark SQL
– Packages various algorithms for machine learning
– Includes
 Clustering
 Classification
 Prediction
 Collaborative Filtering
– Still under development
 performance, algorithms
SPARK MLLIB
18.06.2015 40

codecentric AG
age
EXAMPLE – CLUSTERING
18.06.2015 41
set of data points meaningful clusters
income

codecentric AG
// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(
s.split(' ').map(_.toDouble))).cache()
// Cluster the data into 3 classes using KMeans with 20 iterations
val clusters = KMeans.train(parsedData, 2, 20)
// Evaluate clustering by computing Sum of Squared Errors
val SSE = clusters.computeCost(parsedData)
println("Sum of Squared Errors = " + WSSSE)
EXAMPLE – CLUSTERING (K-MEANS)
18.06.2015 42

codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 43

codecentric AG
EXAMPLE – CLASSIFICATION
18.06.2015 44

codecentric AG
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 45

codecentric AG
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
EXAMPLE – CLASSIFICATION (LINEAR SVM)
18.06.2015 46

codecentric AG
COLLABORATIVE FILTERING
18.06.2015 47

codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
COLLABORATIVE FILTERING (ALS)
18.06.2015 48

codecentric AG
// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate)
=> Rating(user.toInt, item.toInt, rate.toDouble) })
// Build the recommendation model using ALS
val rank = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
18.06.2015 49

codecentric AG
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate)
=> (user, product) }
val predictions = model.predict(usersProducts).map {
case Rating(user, product, rate) => ((user, product), rate) }
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2); err * err }.mean()
println("Mean Squared Error = " + MSE)
18.06.2015 50

codecentric AG
– Setup, Codes & Examples on Github
 https://github.com/mniehoff/spark-cassandra-playground
– Install Cassandra
 For local experiments: Cassandra Cluster Manager
https://github.com/pcmanus/ccm
 ccm create test -v binary:2.0.5 -n 3 –s
 ccm status
 ccm node1 status
– Install Spark
 Tar Ball (Pre Built for Hadoop 2.6)
 HomeBrew
 MacPorts
CURIOUS? GETTING STARTED!
18.06.2015 51

codecentric AG
– Spark Cassandra Connector
 https://github.com/datastax/spark-cassandra-connector
 Clone, ./sbt/sbt assembly
or
 Download Jar @ Maven Central
– Start Spark shell
spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-
SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost
– Import Cassandra classes
scala> import com.datastax.spark.connector._,
CURIOUS? GETTING STARTED!
18.06.2015 52

codecentric AG
– Production ready Cassandra
– Integrates
 Spark
 Hadoop
 Solr
– OpsCenter & DevCenter
– Bundled and ready to go
– Free for development and test
– http://www.datastax.com/download
(One time registration required)
DATASTAX ENTERPRISE EDITION
18.06.2015 53

codecentric AG
QUESTIONS?
18.06.2015 57

codecentric AG
Matthias Niehoff
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe
tel +49 (0) 721 – 95 95 681
matthias.niehoff@codecentric.de
KONTAKT
18.06.2015 58

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Recommended

More Related Content

What's hot (20)

Similar to Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day (20)

Recently uploaded (20)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Editor's Notes