SlideShare a Scribd company logo
©2014 DataStax Confidential. Do not distribute without consent.
@helenaedelson
Helena Edelson
Lambda Architecture with Spark
Streaming, Kafka, Cassandra, Akka, Scala
1
• Spark Cassandra Connector committer
• Akka contributor (Akka Cluster)
• Scala & Big Data conference speaker
• Sr Software Engineer, Analytics @ DataStax
• Sr Cloud Engineer, VMware,CrowdStrike,SpringSource…
• (Prev) Spring committer - Spring AMQP, Spring Integration
Analytic
Who Is This Person?
Analytic
Analytic
Search
Use Case: Hadoop + Scalding
/** Reads SequenceFile data from S3 buckets, computes then persists to Cassandra. */
class TopSearches(args: Args) extends TopKDailyJob[MyDataType](args) with Cassandra {
PailSource.source[Search](rootpath, structure, directories).read
.mapTo('pailItem -> 'engines) { e: Search ⇒ results(e) }
.filter('engines) { e: String ⇒ e.nonEmpty }
.groupBy('engines) { _.size('count).sortBy('engines) }
.groupBy('engines) { _.sortedReverseTake[(String, Long)](('engines, 'count) -> 'tcount, k) }
.flatMapTo('tcount -> ('key, 'engine, 'topCount)) { t: List[(String, Long)] ⇒
t map { case (k, v) ⇒ (jobKey, k, v) }}
.write(CassandraSource(connection, "top_searches", Scheme(‘key, ('engine, ‘topCount))))
}
Analytic
Analytic
Search
Use Case: Spark?
Talk Roadmap
What Delivering Meaning
Why Spark, Kafka, Cassandra & Akka
How Composable Pipelines
App Robust Implementation
©2014 DataStax Confidential. Do not distribute without consent.
Sending Data Between Systems Is
Difficult Risky
6
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
Strategies
• Scalable Infrastructure
• Partition For Scale
• Replicate For Resiliency
• Share Nothing
• Asynchronous Message Passing
• Parallelism
• Isolation
• Location Transparency
Strategy Technologies
Scalable Infrastructure / Elastic
scale on demand
Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency
span racks and datacenters, survive regional outages
Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
Lambda Architecture
A data-processing architecture designed to handle massive quantities of
data by taking advantage of both batch and stream processing methods.
• Spark is one of the few data processing frameworks that allows you to
seamlessly integrate batch and stream processing
• Of petabytes of data
• In the same application
I need fast access to historical data
on the fly for predictive modeling
with real time data
from the stream
Analytic
Analytic
Search
• Fast, distributed, scalable and fault
tolerant cluster compute system
• Enables Low-latency with complex
analytics
• Developed in 2009 at UC Berkeley
AMPLab, open sourced in 2010, and
became a top-level Apache project in
February, 2014
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
The one thing in your infrastructure
you can always rely on.
•Massively Scalable
• High Performance
• Always On
• Masterless
Spark Cassandra
Connector
• Fault tolerant
• Hierarchical Supervision
• Customizable Failure Strategies & Detection
• Asynchronous Data Passing
• Parallelization - Balancing Pool Routers
• Akka Cluster
• Adaptive / Predictive
• Load-Balanced Across Cluster Nodes
• Stream data from Kafka to Cassandra
• Stream data from Kafka to Spark and write to Cassandra
• Stream from Cassandra to Spark - coming soon!
• Read data from Spark/Spark Streaming Source and write to C*
• Read data from Cassandra to Spark
Your Code
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
Most Active OSS In Big Data
Search
Apache Spark - Easy to Use API
Returns the top (k) highest temps for any location in the year
def topK(aggregate: Seq[Double]): Seq[Double] =
sc.parallelize(aggregate).top(k).collect
Returns the top (k) highest temps … in a Future
def topK(aggregate: Seq[Double]): Future[Seq[Double]] =
sc.parallelize(aggregate).top(k).collectAsync
Analytic
Analytic
Search
© 2014 DataStax, All Rights Reserved Company Confidential
Not Just MapReduce
Spark Basic Word Count
val conf = new SparkConf()
.setMaster(host).setAppName(app)



val sc = new SparkContext(conf)
sc.textFile(words)
.flatMap(_.split("s+"))
.map(word => (word.toLowerCase, 1))
.reduceByKey(_ + _)
.collect
Analytic
Analytic
Search
Analytic
Analytic
Search
Collection To RDD
scala> val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)



scala> val distributedData = sc.parallelize(data)

distributedData: spark.RDD[Int] =
spark.ParallelCollection@10d13e3e
Analytic
Analytic
Search
Transformation
Action
RDD Operations
When Batch Is Not Enough
Analytic
Analytic
Spark Streaming
• I want results continuously in the event stream
• I want to run computations in my even-driven async apps
• Exactly once message guarantees
Spark Streaming Setup
val conf = new SparkConf().setMaster(SparkMaster).setAppName(AppName)
val ssc = new StreamingContext(conf, Milliseconds(500))
// Do work in the stream
ssc.checkpoint(checkpointDir)
ssc.start()
ssc.awaitTermination
DStream (Discretized Stream)
RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3)
A transformation on a DStream = transformations on its RDDs
DStream
Continuous stream of micro batches
• Complex processing models with minimal effort
• Streaming computations on small time intervals
Spark Streaming External Source/Sink
DStreams - the stream of raw data received from streaming sources:
• Basic Source - in the StreamingContext API
• Advanced Source - in external modules and separate Spark artifacts
Receivers
• Reliable Receivers - for data sources supporting acks (like Kafka)
• Unreliable Receivers - for data sources not supporting acks
33
ReceiverInputDStreams
Basic Streaming: FileInputDStream
// Creates new DStreams
ssc.textFileStream("s3n://raw_data_bucket/")
.flatMap(_.split("s+"))
.map(_.toLowerCase, 1))
.countByValue()
.saveAsObjectFile("s3n://analytics_bucket/")
Search
Streaming Window Operations
kvStream
.flatMap { case (k,v) => (k,v.value) }
.reduceByKeyAndWindow((a:Int,b:Int) =>
(a + b), Seconds(30), Seconds(10))
.saveToCassandra(keyspace,table)
Window Length:
Duration = every 10s Sliding Interval:
Interval at which the window operation
is performed = every 10 s
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
Apache Cassandra
• Elasticity - scale to as many nodes as you need, when you need
• Always On - No single point of failure, Continuous availability
• Masterless peer to peer architecture
• Replication Across DataCenters
• Flexible Data Storage
• Read and write to any node syncs across the cluster
• Operational simplicity - all nodes in a cluster are the same
• Fast Linear-Scale Performance
•Transaction Support
Security, Machine Learning
Science
Physics: Astro Physics / Particle Physics..
Genetics / Biological Computations
IoT
Europe
US
Cassandra
• Availability Model
• Amazon Dynamo
• Distributed, Masterless
• Data Model
• Google Big Table
• Network Topology Aware
• Multi-Datacenter Replication
Analytics with Spark Over Cassandra
Online
Analytics
Cassandra enables Spark nodes
to transparently communicate
across data centers for data
Handling Failure
Async Replication
What Happens When Nodes Come Back?
• Hinted handoff to the rescue
• Coordinators keep writes for downed nodes for a
configurable amount of time
Gossip
Did you hear node
1 was down??
Consensus
• Consensus, the agreement among peers on the value of a shared piece
of data, is a core building block of Distributed systems
• Cassandra supports consensus via the Paxos protocol
CREATE TABLE users (
username varchar,
firstname varchar,
lastname varchar,
email list<varchar>,
password varchar,
created_date timestamp,
PRIMARY KEY (username)
);
INSERT INTO users (username, firstname, lastname,
email, password, created_date)
VALUES ('hedelson','Helena','Edelson',
[‘helena.edelson@datastax.com'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00’)
IF NOT EXISTS;
• Familiar syntax
• Many Tools & Drivers
• Many Languages
• Friendly to programmers
• Paxos for locking
CQL - Easy
CREATE	
  TABLE	
  weather.raw_data	
  (

	
  	
  	
  wsid	
  text,	
  year	
  int,	
  month	
  int,	
  day	
  int,	
  hour	
  int,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

	
  	
  	
  temperature	
  double,	
  dewpoint	
  double,	
  pressure	
  double,	
  	
  
	
  	
  	
  wind_direction	
  int,	
  wind_speed	
  double,	
  one_hour_precip	
  	
  	
  	
  
	
  	
  	
  PRIMARY	
  KEY	
  ((wsid),	
  year,	
  month,	
  day,	
  hour)

)	
  WITH	
  CLUSTERING	
  ORDER	
  BY	
  (year	
  DESC,	
  month	
  DESC,	
  day	
  DESC,	
  hour	
  DESC);	
  
C* Clustering Columns Writes by most recent
Reads return most recent first
Timeseries
Cassandra will automatically sort by most recent for both both write and read
val multipleStreams = (1 to numDstreams).map { i =>
streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))
}
streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
A record of every event, in order in which it happened, per URL:
CREATE TABLE IF NOT EXISTS requests_ks.timeline (
timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,
PRIMARY KEY ((url, timesegment) , t_uuid)
);
timeuuid protects from simultaneous events over-writing one another.
timesegment protects from writing unbounded partitions.
©2014 DataStax Confidential. Do not distribute without consent.
Spark Cassandra Connector
52
Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
• Write data from Spark to Cassandra
• Read data from Cassandra to Spark
• Data Locality for Speed
• Easy, and often implicit, type conversions
• Server-Side Filtering - SELECT, WHERE, etc.
• Natural Timeseries Integration
• Implemented in Scala
Spark Cassandra Connector
C*
C*
C*C*
Spark Executor
C* Driver
Spark-Cassandra Connector
User Application
Cassandra
55
Co-locate Spark and C* for Best
Performance
C*
C*C*
C*
Spark

Worker
Spark

Worker
Spark
Master
Spark
Worker
Running Spark Workers on
the same nodes as your C* Cluster
saves network hops
Analytic
Search
Writing and Reading
SparkContext
import	
  com.datastax.spark.connector._	
  
StreamingContext	
  
import	
  com.datastax.spark.connector.streaming._
Analytic
Write from Spark to Cassandra
sc.parallelize(collection).saveToCassandra("keyspace",	
  "raw_data")
SparkContext Keyspace Table
Spark RDD JOIN with NOSQL!
predictionsRdd.join(music).saveToCassandra("music",	
  "predictions")
Read From C* to Spark
val	
  rdd	
  =	
  sc.cassandraTable("github",	
  "commits")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .select("user","count","year","month")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .where("commits	
  >=	
  ?	
  and	
  year	
  =	
  ?",	
  1000,	
  2015)
CassandraRDD[CassandraRow]
Keyspace Table
Server-Side Column
and Row Filtering
SparkContext
val	
  rdd	
  =	
  ssc.cassandraTable[MonthlyCommits]("github",	
  "commits_aggregate")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .where("user	
  =	
  ?	
  and	
  project_name	
  =	
  ?	
  and	
  year	
  =	
  ?",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "helena",	
  "spark-­‐cassandra-­‐connector",	
  2015)
CassandraRow Keyspace TableStreamingContext
Rows: Custom Objects
Rows
val	
  tuplesRdd	
  =	
  sc.cassandraTable[(Int,Date,String)](db,	
  tweetsTable)	
  
	
  .select("cluster_id","time",	
  "cluster_name")	
  
	
  .where("time	
  >	
  ?	
  and	
  time	
  <	
  ?",	
  
	
  	
  	
  	
  	
  	
  	
  	
  "2014-­‐07-­‐12	
  20:00:01",	
  "2014-­‐07-­‐12	
  20:00:03”)	
  
val	
  keyValuesPairsRdd	
  =	
  sc.cassandraTable[(Key,Value)](keyspace,	
  table)	
  
Rows
val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time")
rdd.where("key = 1").limit(10).collect	
  
rdd.where("key = 1").take(10).collect
val	
  rdd	
  =	
  ssc.cassandraTable[(Int,DateTime,String)]("stats",	
  "clustering_time")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .where("key	
  =	
  1").withAscOrder.collect	
  
val	
  rdd	
  =	
  ssc.cassandraTable[(Int,DateTime,String)]("stats",	
  "clustering_time")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .where("key	
  =	
  1").withDescOrder.collect	
  
Cassandra User Defined Types
CREATE TYPE address (
street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);
UDT = Your Custom Field Type In Cassandra
Cassandra UDT’s With JSON
{
"productId": 2,
"name": "Kitchen Table",
"price": 249.99,
"description" : "Rectangular table with oak finish",
"dimensions": {
"units": "inches",
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
CREATE TYPE dimensions (
units text,
length float,
width float,
height float
);
CREATE TYPE category (
catalogPage int,
url text
);
CREATE TABLE product (
productId int,
name text,
price float,
description text,
dimensions frozen <dimensions>,
categories map <text, frozen <category>>,
PRIMARY KEY (productId)
);
©2014 DataStax Confidential. Do not distribute without consent.
Composable Pipelines
With Spark, Kafka & Cassandra
64
Spark SQL with Cassandra
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sparkContext)
cc.setKeyspace(keyspaceName)
cc.sql("""
SELECT table1.a, table1.b, table.c, table2.a
FROM table1 AS table1
JOIN table2 AS table2 ON table1.a = table2.a
AND table1.b = table2.b
AND table1.c = table2.c
""")
.map(Data(_))
.saveToCassandra(keyspace1, table3)
Cross Keyspaces (Databases)
val schemaRdd1 = cassandraSqlContext.sql("""
INSERT INTO keyspace2.table3
SELECT table2.a, table2.b, table2.c
FROM keyspace1.table2 as table2""")
val schemaRdd2 = cassandraSqlContext.sql("""
SELECT table3.a, table3.b, table3.c
FROM keyspace2.table3 as table3""")
val schemaRdd3 = cassandraSqlContext.sql("""
SELECT test1.a, test1.b, test1.c, test2.a
FROM database1.test1 AS test1

JOIN database2.test2 AS test2 ON test1.a = test2.a
AND test1.b = test2.b AND test1.c = test2.c""")


val sql = new SQLContext(sparkContext)
val json = Seq(

"""{"user":"helena","commits":98, "month":3, "year":2015}""",

"""{"user":"jacek-lewandowski", "commits":72, "month":3, "year":2015}""",

"""{"user":"pkolaczk", "commits":42, "month":3, "year":2015}""")
// write
sql.jsonRDD(json)
.map(CommitStats(_))
.flatMap(compute)
.saveToCassandra("stats","monthly_commits")

// read
val rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits")
cqlsh>	
  CREATE	
  TABLE	
  github_stats.commits_aggr(user	
  VARCHAR	
  PRIMARY	
  KEY,	
  commits	
  INT…);
Spark SQL with Cassandra & JSON
Analytic
Analytic
Search
Spark Streaming, Kafka, C* and JSON
cqlsh>	
  select	
  *	
  from	
  github_stats.commits_aggr;	
  


	
   user | commits | month | year
-------------------+---------+-------+------
pkolaczk | 42 | 3 | 2015
jacek-lewandowski | 43 | 3 | 2015
helena | 98 | 3 | 2015

(3	
  rows)	
  
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](

ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)

.map { case (_,json) => JsonParser.parse(json).extract[MonthlyCommits]}

.saveToCassandra("github_stats","commits_aggr")
Spark Streaming, Kafka & Cassandra
sparkConf.set("spark.cassandra.connection.host", "10.20.3.45")

val streamingContext = new StreamingContext(conf, Seconds(30))

KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](

streamingContext, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2)
.countByValue()
.saveToCassandra("my_keyspace","wordcount")
Spark Streaming, Twitter & Cassandra
/** Cassandra is doing the sorting for you here. */

TwitterUtils.createStream(
ssc, auth, tags, StorageLevel.MEMORY_ONLY_SER_2)

.flatMap(_.getText.toLowerCase.split("""s+"""))

.filter(tags.contains(_))

.countByValueAndWindow(Seconds(5), Seconds(5))

.transform((rdd, time) =>
rdd.map { case (term, count) => (term, count, now(time))})

.saveToCassandra(keyspace, table)
CREATE TABLE IF NOT EXISTS keyspace.table (

topic text, interval text, mentions counter,

PRIMARY KEY(topic, interval)

) WITH CLUSTERING ORDER BY (interval DESC)
Streaming From Kafka, R/W Cassandra
val ssc = new StreamingContext(conf, Seconds(30))

val stream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](

ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
stream.flatMap { detected =>
ssc.cassandraTable[AdversaryAttack]("behavior_ks",	
  "observed")	
  
	
  	
  	
  	
  	
  	
  .where("adversary	
  =	
  ?	
  and	
  ip	
  =	
  ?	
  and	
  attackType	
  =	
  ?",	
  	
  
	
  	
  	
  	
  	
  	
  	
  detected.adversary,	
  detected.originIp,	
  detected.attackType)	
  
	
  	
  	
  	
  	
  	
  .collect
}.saveToCassandra("profiling_ks",	
  "adversary_profiles")
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test
Data
Your Data Extract Data To Analyze
Train your model to predict
Spark Streaming ML, Kafka & Cassandra
val ssc = new StreamingContext(new SparkConf()…, Seconds(5)

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)



val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](

ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2).map(LabeledPoint.parse)
trainingStream.saveToCassandra("ml_keyspace", “raw_training_data")



val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(weights))

.trainOn(trainingStream)


//Making predictions on testData
model
.predictOnValues(testData.map(lp => (lp.label, lp.features)))
.saveToCassandra("ml_keyspace", "predictions")
Spark Streaming ML, Kafka & C*
Timeseries Data Application
• Global sensors & satellites collect data
• Cassandra stores in sequence
• Application reads in sequence
Apache
Cassandra
Data Analysis
Application predictive modelling
Apache
Cassandra
Data model should look like your queries
• Store raw data per ID
• Store time series data in order: most recent to oldest
• Compute and store aggregate data in the stream
• Set TTLs on historic data
• Get data by ID
• Get data for a single date and time
• Get data for a window of time
• Compute, store and retrieve daily, monthly, annual aggregations
Design Data Model to support queries
Queries I Need
Data Model
• Weather Station Id and Time
are unique
• Store as many as needed
CREATE TABLE temperature (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY (weather_station,year,month,day,hour)
);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,8,-5.1);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,9,-4.9);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,10,-5.3);
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {



override val supervisorStrategy =

OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {

case _: ActorInitializationException => Stop

case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate

case _: Exception => Escalate

}


private val producer = new KafkaProducer[K, V](producerConfig)



override def postStop(): Unit = producer.close()


def receive = {

case e: KafkaMessageEnvelope[K,V] => producer.send(e)

}

}
Kafka Producer as Akka Actor
class HttpReceiverActor(kafka: ActorRef) extends Actor with ActorLogging {
implicit val materializer = FlowMaterializer()
IO(Http) ! Http.Bind(HttpHost, HttpPort)



val requestHandler: HttpRequest => HttpResponse = {

case HttpRequest(POST, Uri.Path("/weather/v1/hourly-weather"), headers, entity, _) =>

HttpSource(headers,entity).collect { case s: HeaderSource =>
for(s <- source.extract)
kafka ! KafkaMessageEnvelope[String, String](topic, key, fs.data:_*)

HttpResponse(200, entity = HttpEntity(MediaTypes.`text/html`, s"POST $entity"))

}
getOrElse HttpResponse(404, entity = s"Unsupported request")
}


def receive : Actor.Receive = {

case Http.ServerBinding(localAddress, stream) =>
Source(stream).foreach({

case Http.IncomingConnection(remoteAddress, requestProducer, responseConsumer) =>

log.info("Accepted new connection from {}.", remoteAddress)

Source(requestProducer).map(requestHandler).to(Sink(responseConsumer)).run()

})
}}
Akka Actor as REST Endpoint
class HttpNodeGuardian extends ClusterAwareNodeGuardianActor {



val router = context.actorOf(
BalancingPool(PoolSize).props(Props(
new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize))))



Cluster(context.system) registerOnMemberUp {

val router = context.actorOf(
BalancingPool(PoolSize).props(Props(
new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize))))
}
def initialized: Actor.Receive = { … }


}
Akka: Load-Balanced Kafka Work
Store raw data on ingestion
val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]
(ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)

.map(transform)

.map(RawWeatherData(_))



/** Saves the raw data to Cassandra. */

kafkaStream.saveToCassandra(keyspace, raw_ws_data)
Store Raw Data on Ingestion
To Cassandra From Kafka Stream
/** Now proceed with computations from the same stream.. */
kafkaStream…
Now we can replay: on failure, for later computation, etc
CREATE	
  TABLE	
  weather.raw_data	
  (

	
  	
  	
  wsid	
  text,	
  year	
  int,	
  month	
  int,	
  day	
  int,	
  hour	
  int,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

	
  	
  	
  temperature	
  double,	
  dewpoint	
  double,	
  pressure	
  double,	
  	
  
	
  	
  	
  wind_direction	
  int,	
  wind_speed	
  double,	
  one_hour_precip	
  	
  	
  	
  
	
  	
  	
  PRIMARY	
  KEY	
  ((wsid),	
  year,	
  month,	
  day,	
  hour)

)	
  WITH	
  CLUSTERING	
  ORDER	
  BY	
  (year	
  DESC,	
  month	
  DESC,	
  day	
  DESC,	
  hour	
  DESC);	
  
CREATE	
  TABLE	
  daily_aggregate_precip	
  (

	
  	
  	
  wsid	
  text,

	
  	
  	
  year	
  int,

	
  	
  	
  month	
  int,

	
  	
  	
  day	
  int,

	
  	
  	
  precipitation	
  counter,

	
  	
  	
  PRIMARY	
  KEY	
  ((wsid),	
  year,	
  month,	
  day)

)	
  WITH	
  CLUSTERING	
  ORDER	
  BY	
  (year	
  DESC,	
  month	
  DESC,	
  day	
  DESC);	
  
Our Data Model Again…
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
Efficient Stream Computation
val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]
(ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)

.map(transform)

.map(RawWeatherData(_))



kafkaStream.saveToCassandra(keyspace, raw_ws_data)
/** Per `wsid` and timestamp, aggregates hourly pricip by day in the stream. */

kafkaStream.map { weather =>

(weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)

}.saveToCassandra(keyspace, daily_precipitation_aggregations)
class TemperatureActor(sc: SparkContext, settings: WeatherSettings)
extends AggregationActor {

import akka.pattern.pipe


def receive: Actor.Receive = {

case e: GetMonthlyHiLowTemperature => highLow(e, sender)

}



def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =

sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)

.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)

.collectAsync()

.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester
}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala
©2014 DataStax Confidential. Do not distribute without consent. 89
@helenaedelson
github.com/helena
slideshare.net/helenaedelson
Resources
Spark Cassandra Connector
github.com/datastax/spark-cassandra-connector
github.com/killrweather/killrweather
groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
Apache Spark spark.apache.org
Apache Cassandra cassandra.apache.org
Apache Kafka kafka.apache.org
Akka akka.io
Analytic
Analytic
Thanks	
  for	
  listening!	
  
Cassandra	
  Summit
SEPTEMBER	
  22	
  -­‐	
  24,	
  2015	
  	
  |	
  	
  Santa	
  Clara	
  Convention	
  Center,	
  Santa	
  Clara,	
  CA
3,000 attendees in 2014

More Related Content

What's hot (20)

PDF
Spark
Amir Payberah
 
PPTX
Log analysis using elk
Rushika Shah
 
PPTX
ELK Stack
Phuc Nguyen
 
PDF
KSQL: Streaming SQL for Kafka
confluent
 
PDF
The Power of SPL
Splunk
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Splunk
Douglas Bernardini
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Masahiko Sawada
 
PDF
Alphorm.com Formation Elastic : Maitriser les fondamentaux
Alphorm
 
PPSX
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
PDF
Log analytics with ELK stack
AWS User Group Bengaluru
 
PDF
ksqlDB: Building Consciousness on Real Time Events
confluent
 
PPTX
Nginx Reverse Proxy with Kafka.pptx
wonyong hwang
 
PDF
Speed-Up Kafka Delivery with AsyncAPI & Microcks | Hugo Guerrero, Red Hat
HostedbyConfluent
 
PPTX
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
PDF
Combine Spring Data Neo4j and Spring Boot to quickl
Neo4j
 
Log analysis using elk
Rushika Shah
 
ELK Stack
Phuc Nguyen
 
KSQL: Streaming SQL for Kafka
confluent
 
The Power of SPL
Splunk
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Masahiko Sawada
 
Alphorm.com Formation Elastic : Maitriser les fondamentaux
Alphorm
 
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
Log analytics with ELK stack
AWS User Group Bengaluru
 
ksqlDB: Building Consciousness on Real Time Events
confluent
 
Nginx Reverse Proxy with Kafka.pptx
wonyong hwang
 
Speed-Up Kafka Delivery with AsyncAPI & Microcks | Hugo Guerrero, Red Hat
HostedbyConfluent
 
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
Combine Spring Data Neo4j and Spring Boot to quickl
Neo4j
 

Viewers also liked (20)

PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
PDF
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
PDF
Akka in Production - ScalaDays 2015
Evan Chan
 
PDF
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PDF
H2O - the optimized HTTP server
Kazuho Oku
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Reactive app using actor model & apache spark
Rahul Kumar
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Intro to Apache Spark
Mammoth Data
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Akka in Production - ScalaDays 2015
Evan Chan
 
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
H2O - the optimized HTTP server
Kazuho Oku
 
Ad

Similar to Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala (20)

PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PPTX
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark,
Swiss Data Forum Swiss Data Forum
 
PPTX
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
PDF
Cake Solutions: Cassandra as event sourced journal for big data analytics
DataStax Academy
 
PDF
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
PDF
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
PDF
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Robert Stupp
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Swiss Data Forum Swiss Data Forum
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Building end to end streaming application on Spark
datamantra
 
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
Cake Solutions: Cassandra as event sourced journal for big data analytics
DataStax Academy
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Robert Stupp
 
Ad

More from Helena Edelson (6)

PDF
Toward Predictability and Stability
Helena Edelson
 
PDF
Patterns In The Chaos
Helena Edelson
 
PDF
Disorder And Tolerance In Distributed Systems At Scale
Helena Edelson
 
PDF
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Helena Edelson
 
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
PDF
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Toward Predictability and Stability
Helena Edelson
 
Patterns In The Chaos
Helena Edelson
 
Disorder And Tolerance In Distributed Systems At Scale
Helena Edelson
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Helena Edelson
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
Streaming Big Data & Analytics For Scale
Helena Edelson
 

Recently uploaded (20)

PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala

  • 1. ©2014 DataStax Confidential. Do not distribute without consent. @helenaedelson Helena Edelson Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala 1
  • 2. • Spark Cassandra Connector committer • Akka contributor (Akka Cluster) • Scala & Big Data conference speaker • Sr Software Engineer, Analytics @ DataStax • Sr Cloud Engineer, VMware,CrowdStrike,SpringSource… • (Prev) Spring committer - Spring AMQP, Spring Integration Analytic Who Is This Person?
  • 3. Analytic Analytic Search Use Case: Hadoop + Scalding /** Reads SequenceFile data from S3 buckets, computes then persists to Cassandra. */ class TopSearches(args: Args) extends TopKDailyJob[MyDataType](args) with Cassandra { PailSource.source[Search](rootpath, structure, directories).read .mapTo('pailItem -> 'engines) { e: Search ⇒ results(e) } .filter('engines) { e: String ⇒ e.nonEmpty } .groupBy('engines) { _.size('count).sortBy('engines) } .groupBy('engines) { _.sortedReverseTake[(String, Long)](('engines, 'count) -> 'tcount, k) } .flatMapTo('tcount -> ('key, 'engine, 'topCount)) { t: List[(String, Long)] ⇒ t map { case (k, v) ⇒ (jobKey, k, v) }} .write(CassandraSource(connection, "top_searches", Scheme(‘key, ('engine, ‘topCount)))) }
  • 5. Talk Roadmap What Delivering Meaning Why Spark, Kafka, Cassandra & Akka How Composable Pipelines App Robust Implementation
  • 6. ©2014 DataStax Confidential. Do not distribute without consent. Sending Data Between Systems Is Difficult Risky 6
  • 8. Strategies • Scalable Infrastructure • Partition For Scale • Replicate For Resiliency • Share Nothing • Asynchronous Message Passing • Parallelism • Isolation • Location Transparency
  • 9. Strategy Technologies Scalable Infrastructure / Elastic scale on demand Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency span racks and datacenters, survive regional outages Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka
  • 10. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. • Spark is one of the few data processing frameworks that allows you to seamlessly integrate batch and stream processing • Of petabytes of data • In the same application
  • 11. I need fast access to historical data on the fly for predictive modeling with real time data from the stream
  • 12. Analytic Analytic Search • Fast, distributed, scalable and fault tolerant cluster compute system • Enables Low-latency with complex analytics • Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014
  • 13. • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
  • 15. The one thing in your infrastructure you can always rely on.
  • 16. •Massively Scalable • High Performance • Always On • Masterless
  • 18. • Fault tolerant • Hierarchical Supervision • Customizable Failure Strategies & Detection • Asynchronous Data Passing • Parallelization - Balancing Pool Routers • Akka Cluster • Adaptive / Predictive • Load-Balanced Across Cluster Nodes
  • 19. • Stream data from Kafka to Cassandra • Stream data from Kafka to Spark and write to Cassandra • Stream from Cassandra to Spark - coming soon! • Read data from Spark/Spark Streaming Source and write to C* • Read data from Cassandra to Spark
  • 22. Most Active OSS In Big Data Search
  • 23. Apache Spark - Easy to Use API Returns the top (k) highest temps for any location in the year def topK(aggregate: Seq[Double]): Seq[Double] = sc.parallelize(aggregate).top(k).collect Returns the top (k) highest temps … in a Future def topK(aggregate: Seq[Double]): Future[Seq[Double]] = sc.parallelize(aggregate).top(k).collectAsync Analytic Analytic Search
  • 24. © 2014 DataStax, All Rights Reserved Company Confidential Not Just MapReduce
  • 25. Spark Basic Word Count val conf = new SparkConf() .setMaster(host).setAppName(app)
 
 val sc = new SparkContext(conf) sc.textFile(words) .flatMap(_.split("s+")) .map(word => (word.toLowerCase, 1)) .reduceByKey(_ + _) .collect Analytic Analytic Search
  • 26. Analytic Analytic Search Collection To RDD scala> val data = Array(1, 2, 3, 4, 5)
 data: Array[Int] = Array(1, 2, 3, 4, 5)
 
 scala> val distributedData = sc.parallelize(data)
 distributedData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
  • 28. When Batch Is Not Enough Analytic Analytic
  • 29. Spark Streaming • I want results continuously in the event stream • I want to run computations in my even-driven async apps • Exactly once message guarantees
  • 30. Spark Streaming Setup val conf = new SparkConf().setMaster(SparkMaster).setAppName(AppName) val ssc = new StreamingContext(conf, Milliseconds(500)) // Do work in the stream ssc.checkpoint(checkpointDir) ssc.start() ssc.awaitTermination
  • 31. DStream (Discretized Stream) RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3) A transformation on a DStream = transformations on its RDDs DStream Continuous stream of micro batches • Complex processing models with minimal effort • Streaming computations on small time intervals
  • 33. DStreams - the stream of raw data received from streaming sources: • Basic Source - in the StreamingContext API • Advanced Source - in external modules and separate Spark artifacts Receivers • Reliable Receivers - for data sources supporting acks (like Kafka) • Unreliable Receivers - for data sources not supporting acks 33 ReceiverInputDStreams
  • 34. Basic Streaming: FileInputDStream // Creates new DStreams ssc.textFileStream("s3n://raw_data_bucket/") .flatMap(_.split("s+")) .map(_.toLowerCase, 1)) .countByValue() .saveAsObjectFile("s3n://analytics_bucket/") Search
  • 35. Streaming Window Operations kvStream .flatMap { case (k,v) => (k,v.value) } .reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) .saveToCassandra(keyspace,table) Window Length: Duration = every 10s Sliding Interval: Interval at which the window operation is performed = every 10 s
  • 38. Apache Cassandra • Elasticity - scale to as many nodes as you need, when you need • Always On - No single point of failure, Continuous availability • Masterless peer to peer architecture • Replication Across DataCenters • Flexible Data Storage • Read and write to any node syncs across the cluster • Operational simplicity - all nodes in a cluster are the same • Fast Linear-Scale Performance •Transaction Support
  • 40. Science Physics: Astro Physics / Particle Physics..
  • 41. Genetics / Biological Computations
  • 42. IoT
  • 43. Europe US Cassandra • Availability Model • Amazon Dynamo • Distributed, Masterless • Data Model • Google Big Table • Network Topology Aware • Multi-Datacenter Replication
  • 44. Analytics with Spark Over Cassandra Online Analytics Cassandra enables Spark nodes to transparently communicate across data centers for data
  • 46. What Happens When Nodes Come Back? • Hinted handoff to the rescue • Coordinators keep writes for downed nodes for a configurable amount of time
  • 47. Gossip Did you hear node 1 was down??
  • 48. Consensus • Consensus, the agreement among peers on the value of a shared piece of data, is a core building block of Distributed systems • Cassandra supports consensus via the Paxos protocol
  • 49. CREATE TABLE users ( username varchar, firstname varchar, lastname varchar, email list<varchar>, password varchar, created_date timestamp, PRIMARY KEY (username) ); INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES ('hedelson','Helena','Edelson', [‘[email protected]'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00’) IF NOT EXISTS; • Familiar syntax • Many Tools & Drivers • Many Languages • Friendly to programmers • Paxos for locking CQL - Easy
  • 50. CREATE  TABLE  weather.raw_data  (
      wsid  text,  year  int,  month  int,  day  int,  hour  int,                          
      temperature  double,  dewpoint  double,  pressure  double,          wind_direction  int,  wind_speed  double,  one_hour_precip              PRIMARY  KEY  ((wsid),  year,  month,  day,  hour)
 )  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC,  hour  DESC);   C* Clustering Columns Writes by most recent Reads return most recent first Timeseries Cassandra will automatically sort by most recent for both both write and read
  • 51. val multipleStreams = (1 to numDstreams).map { i => streamingContext.receiverStream[HttpRequest](new HttpReceiver(port)) } streamingContext.union(multipleStreams) .map { httpRequest => TimelineRequestEvent(httpRequest)} .saveToCassandra("requests_ks", "timeline") A record of every event, in order in which it happened, per URL: CREATE TABLE IF NOT EXISTS requests_ks.timeline ( timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text, PRIMARY KEY ((url, timesegment) , t_uuid) ); timeuuid protects from simultaneous events over-writing one another. timesegment protects from writing unbounded partitions.
  • 52. ©2014 DataStax Confidential. Do not distribute without consent. Spark Cassandra Connector 52
  • 53. Spark Cassandra Connector https://github.com/datastax/spark-cassandra-connector • Write data from Spark to Cassandra • Read data from Cassandra to Spark • Data Locality for Speed • Easy, and often implicit, type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration • Implemented in Scala
  • 54. Spark Cassandra Connector C* C* C*C* Spark Executor C* Driver Spark-Cassandra Connector User Application Cassandra
  • 55. 55 Co-locate Spark and C* for Best Performance C* C*C* C* Spark
 Worker Spark
 Worker Spark Master Spark Worker Running Spark Workers on the same nodes as your C* Cluster saves network hops
  • 56. Analytic Search Writing and Reading SparkContext import  com.datastax.spark.connector._   StreamingContext   import  com.datastax.spark.connector.streaming._
  • 57. Analytic Write from Spark to Cassandra sc.parallelize(collection).saveToCassandra("keyspace",  "raw_data") SparkContext Keyspace Table Spark RDD JOIN with NOSQL! predictionsRdd.join(music).saveToCassandra("music",  "predictions")
  • 58. Read From C* to Spark val  rdd  =  sc.cassandraTable("github",  "commits")                                            .select("user","count","year","month")                                            .where("commits  >=  ?  and  year  =  ?",  1000,  2015) CassandraRDD[CassandraRow] Keyspace Table Server-Side Column and Row Filtering SparkContext
  • 59. val  rdd  =  ssc.cassandraTable[MonthlyCommits]("github",  "commits_aggregate")                            .where("user  =  ?  and  project_name  =  ?  and  year  =  ?",                                    "helena",  "spark-­‐cassandra-­‐connector",  2015) CassandraRow Keyspace TableStreamingContext Rows: Custom Objects
  • 60. Rows val  tuplesRdd  =  sc.cassandraTable[(Int,Date,String)](db,  tweetsTable)    .select("cluster_id","time",  "cluster_name")    .where("time  >  ?  and  time  <  ?",                  "2014-­‐07-­‐12  20:00:01",  "2014-­‐07-­‐12  20:00:03”)   val  keyValuesPairsRdd  =  sc.cassandraTable[(Key,Value)](keyspace,  table)  
  • 61. Rows val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time") rdd.where("key = 1").limit(10).collect   rdd.where("key = 1").take(10).collect val  rdd  =  ssc.cassandraTable[(Int,DateTime,String)]("stats",  "clustering_time")                            .where("key  =  1").withAscOrder.collect   val  rdd  =  ssc.cassandraTable[(Int,DateTime,String)]("stats",  "clustering_time")                            .where("key  =  1").withDescOrder.collect  
  • 62. Cassandra User Defined Types CREATE TYPE address ( street text, city text, zip_code int, country text, cross_streets set<text> ); UDT = Your Custom Field Type In Cassandra
  • 63. Cassandra UDT’s With JSON { "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } } CREATE TYPE dimensions ( units text, length float, width float, height float ); CREATE TYPE category ( catalogPage int, url text ); CREATE TABLE product ( productId int, name text, price float, description text, dimensions frozen <dimensions>, categories map <text, frozen <category>>, PRIMARY KEY (productId) );
  • 64. ©2014 DataStax Confidential. Do not distribute without consent. Composable Pipelines With Spark, Kafka & Cassandra 64
  • 65. Spark SQL with Cassandra import org.apache.spark.sql.cassandra.CassandraSQLContext val cc = new CassandraSQLContext(sparkContext) cc.setKeyspace(keyspaceName) cc.sql(""" SELECT table1.a, table1.b, table.c, table2.a FROM table1 AS table1 JOIN table2 AS table2 ON table1.a = table2.a AND table1.b = table2.b AND table1.c = table2.c """) .map(Data(_)) .saveToCassandra(keyspace1, table3)
  • 66. Cross Keyspaces (Databases) val schemaRdd1 = cassandraSqlContext.sql(""" INSERT INTO keyspace2.table3 SELECT table2.a, table2.b, table2.c FROM keyspace1.table2 as table2""") val schemaRdd2 = cassandraSqlContext.sql(""" SELECT table3.a, table3.b, table3.c FROM keyspace2.table3 as table3""") val schemaRdd3 = cassandraSqlContext.sql(""" SELECT test1.a, test1.b, test1.c, test2.a FROM database1.test1 AS test1
 JOIN database2.test2 AS test2 ON test1.a = test2.a AND test1.b = test2.b AND test1.c = test2.c""")
  • 67. 
 val sql = new SQLContext(sparkContext) val json = Seq(
 """{"user":"helena","commits":98, "month":3, "year":2015}""",
 """{"user":"jacek-lewandowski", "commits":72, "month":3, "year":2015}""",
 """{"user":"pkolaczk", "commits":42, "month":3, "year":2015}""") // write sql.jsonRDD(json) .map(CommitStats(_)) .flatMap(compute) .saveToCassandra("stats","monthly_commits")
 // read val rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits") cqlsh>  CREATE  TABLE  github_stats.commits_aggr(user  VARCHAR  PRIMARY  KEY,  commits  INT…); Spark SQL with Cassandra & JSON
  • 68. Analytic Analytic Search Spark Streaming, Kafka, C* and JSON cqlsh>  select  *  from  github_stats.commits_aggr;   
   user | commits | month | year -------------------+---------+-------+------ pkolaczk | 42 | 3 | 2015 jacek-lewandowski | 43 | 3 | 2015 helena | 98 | 3 | 2015
 (3  rows)   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
 .map { case (_,json) => JsonParser.parse(json).extract[MonthlyCommits]}
 .saveToCassandra("github_stats","commits_aggr")
  • 69. Spark Streaming, Kafka & Cassandra sparkConf.set("spark.cassandra.connection.host", "10.20.3.45")
 val streamingContext = new StreamingContext(conf, Seconds(30))
 KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 streamingContext, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) .map(_._2) .countByValue() .saveToCassandra("my_keyspace","wordcount")
  • 70. Spark Streaming, Twitter & Cassandra /** Cassandra is doing the sorting for you here. */
 TwitterUtils.createStream( ssc, auth, tags, StorageLevel.MEMORY_ONLY_SER_2)
 .flatMap(_.getText.toLowerCase.split("""s+"""))
 .filter(tags.contains(_))
 .countByValueAndWindow(Seconds(5), Seconds(5))
 .transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))})
 .saveToCassandra(keyspace, table) CREATE TABLE IF NOT EXISTS keyspace.table (
 topic text, interval text, mentions counter,
 PRIMARY KEY(topic, interval)
 ) WITH CLUSTERING ORDER BY (interval DESC)
  • 71. Streaming From Kafka, R/W Cassandra val ssc = new StreamingContext(conf, Seconds(30))
 val stream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) stream.flatMap { detected => ssc.cassandraTable[AdversaryAttack]("behavior_ks",  "observed")              .where("adversary  =  ?  and  ip  =  ?  and  attackType  =  ?",                  detected.adversary,  detected.originIp,  detected.attackType)              .collect }.saveToCassandra("profiling_ks",  "adversary_profiles")
  • 72. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark Streaming ML, Kafka & Cassandra
  • 73. val ssc = new StreamingContext(new SparkConf()…, Seconds(5)
 val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)
 
 val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) .map(_._2).map(LabeledPoint.parse) trainingStream.saveToCassandra("ml_keyspace", “raw_training_data")
 
 val model = new StreamingLinearRegressionWithSGD()
 .setInitialWeights(Vectors.dense(weights))
 .trainOn(trainingStream) 
 //Making predictions on testData model .predictOnValues(testData.map(lp => (lp.label, lp.features))) .saveToCassandra("ml_keyspace", "predictions") Spark Streaming ML, Kafka & C*
  • 74. Timeseries Data Application • Global sensors & satellites collect data • Cassandra stores in sequence • Application reads in sequence Apache Cassandra
  • 75. Data Analysis Application predictive modelling Apache Cassandra
  • 76. Data model should look like your queries
  • 77. • Store raw data per ID • Store time series data in order: most recent to oldest • Compute and store aggregate data in the stream • Set TTLs on historic data • Get data by ID • Get data for a single date and time • Get data for a window of time • Compute, store and retrieve daily, monthly, annual aggregations Design Data Model to support queries Queries I Need
  • 78. Data Model • Weather Station Id and Time are unique • Store as many as needed CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY (weather_station,year,month,day,hour) ); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-5.1); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-4.9); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.3);
  • 80. class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {
 
 override val supervisorStrategy =
 OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
 case _: ActorInitializationException => Stop
 case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate
 case _: Exception => Escalate
 } 
 private val producer = new KafkaProducer[K, V](producerConfig)
 
 override def postStop(): Unit = producer.close() 
 def receive = {
 case e: KafkaMessageEnvelope[K,V] => producer.send(e)
 }
 } Kafka Producer as Akka Actor
  • 81. class HttpReceiverActor(kafka: ActorRef) extends Actor with ActorLogging { implicit val materializer = FlowMaterializer() IO(Http) ! Http.Bind(HttpHost, HttpPort)
 
 val requestHandler: HttpRequest => HttpResponse = {
 case HttpRequest(POST, Uri.Path("/weather/v1/hourly-weather"), headers, entity, _) =>
 HttpSource(headers,entity).collect { case s: HeaderSource => for(s <- source.extract) kafka ! KafkaMessageEnvelope[String, String](topic, key, fs.data:_*)
 HttpResponse(200, entity = HttpEntity(MediaTypes.`text/html`, s"POST $entity"))
 } getOrElse HttpResponse(404, entity = s"Unsupported request") } 
 def receive : Actor.Receive = {
 case Http.ServerBinding(localAddress, stream) => Source(stream).foreach({
 case Http.IncomingConnection(remoteAddress, requestProducer, responseConsumer) =>
 log.info("Accepted new connection from {}.", remoteAddress)
 Source(requestProducer).map(requestHandler).to(Sink(responseConsumer)).run()
 }) }} Akka Actor as REST Endpoint
  • 82. class HttpNodeGuardian extends ClusterAwareNodeGuardianActor {
 
 val router = context.actorOf( BalancingPool(PoolSize).props(Props( new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize))))
 
 Cluster(context.system) registerOnMemberUp {
 val router = context.actorOf( BalancingPool(PoolSize).props(Props( new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize)))) } def initialized: Actor.Receive = { … } 
 } Akka: Load-Balanced Kafka Work
  • 83. Store raw data on ingestion
  • 84. val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder] (ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)
 .map(transform)
 .map(RawWeatherData(_))
 
 /** Saves the raw data to Cassandra. */
 kafkaStream.saveToCassandra(keyspace, raw_ws_data) Store Raw Data on Ingestion To Cassandra From Kafka Stream /** Now proceed with computations from the same stream.. */ kafkaStream… Now we can replay: on failure, for later computation, etc
  • 85. CREATE  TABLE  weather.raw_data  (
      wsid  text,  year  int,  month  int,  day  int,  hour  int,                          
      temperature  double,  dewpoint  double,  pressure  double,          wind_direction  int,  wind_speed  double,  one_hour_precip              PRIMARY  KEY  ((wsid),  year,  month,  day,  hour)
 )  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC,  hour  DESC);   CREATE  TABLE  daily_aggregate_precip  (
      wsid  text,
      year  int,
      month  int,
      day  int,
      precipitation  counter,
      PRIMARY  KEY  ((wsid),  year,  month,  day)
 )  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC);   Our Data Model Again…
  • 86. Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast. Efficient Stream Computation val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder] (ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)
 .map(transform)
 .map(RawWeatherData(_))
 
 kafkaStream.saveToCassandra(keyspace, raw_ws_data) /** Per `wsid` and timestamp, aggregates hourly pricip by day in the stream. */
 kafkaStream.map { weather =>
 (weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)
 }.saveToCassandra(keyspace, daily_precipitation_aggregations)
  • 87. class TemperatureActor(sc: SparkContext, settings: WeatherSettings) extends AggregationActor {
 import akka.pattern.pipe 
 def receive: Actor.Receive = {
 case e: GetMonthlyHiLowTemperature => highLow(e, sender)
 }
 
 def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
 sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
 .where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
 .collectAsync()
 .map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester } C* data is automatically sorted by most recent - due to our data model. Additional Spark or collection sort not needed.
  • 89. ©2014 DataStax Confidential. Do not distribute without consent. 89 @helenaedelson github.com/helena slideshare.net/helenaedelson
  • 91. Thanks  for  listening!   Cassandra  Summit SEPTEMBER  22  -­‐  24,  2015    |    Santa  Clara  Convention  Center,  Santa  Clara,  CA 3,000 attendees in 2014