SlideShare a Scribd company logo
STREAMING BIG DATA & ANALYTICS
FOR SCALE
Helena Edelson
1
@helenaedelson
@helenaedelson
Who Is This Person?
• VP of Product Engineering @Tuplejump
• Big Data, Analytics, Cloud Engineering, Cyber Security
• Committer / Contributor to FiloDB, Spark Cassandra
Connector, Akka, Spring Integration
2
• @helenaedelson
• github.com/helena
• linkedin.com/in/helenaedelson
• slideshare.net/helenaedelson
@helenaedelson
@helenaedelson
Topics
• The Problem Domain - What needs to be solved
• The Stack (Scala,Akka,Spark Streaming,Kafka,Cassandra)
• Simplifying the architecture
• The Pipeline - integration
4
@helenaedelson
THE PROBLEM DOMAIN
Delivering Meaning From A Flood Of Data
5
@helenaedelson
@helenaedelson
The Problem Domain
Need to build scalable, fault tolerant, distributed data
processing systems that can handle massive amounts of
data from disparate sources, with different data structures.
6
@helenaedelson
Delivering Meaning
• Deliver meaning in sec/sub-sec latency
• Disparate data sources & schemas
• Billions of events per second
• High-latency batch processing
• Low-latency stream processing
• Aggregation of historical from the stream
7
I need fast access to historical data
on the fly for predictive modeling
with real time data
from the stream
@helenaedelson
It's Not A Stream It's A Flood
• Netflix
• 50 - 100 billion events per day
• 1 - 2 million events per second at peak
• LinkedIn
• 500 billion write events per day
• 2.5 trillion read events per day
• 4.5 million events per second at peak with Kafka
• 1 PB of stream data
9
@helenaedelson
Reality Check
• Massive event spikes & bursty traffic
• Fast producers / slow consumers
• Network partitioning & out of sync systems
• DC down
• Wait, we've DDOS'd ourselves from fast streams?
• Autoscale issues
– When we scale down VMs how do we not lose data?
10
@helenaedelson
And stay within our
cloud hosting budget
11
@helenaedelson
Oh, and don't loose data
12
@helenaedelson
THE STACK
13
@helenaedelson
About The Stack
An ensemble of technologies enabling a data pipeline,
storage and analytics.
14
@helenaedelson
15
@helenaedelson
Scala and Spark
Scala is now becoming this glue that connects the dots in big
data. The emergence of Spark owes directly to its simplicity and
the high level of abstraction of Scala.
• Distributed collections that are functional by default
• Apply transformations to them
• Spark takes that exact idea and puts it on a cluster
16
@helenaedelson
SIMPLIFYING
ARCHITECTURE
17
@helenaedelson
Pick Technologies Wisely
Based on your requirements
• Latency
• Real time / Sub-Second: < 100ms
• Near real time (low): > 100 ms or a few seconds - a few hours
• Consistency
• Highly Scalable
• Topology-Aware & Multi-Datacenter support
• Partitioning Collaboration - do they play together well
18
@helenaedelson
Strategies
19
• Partition For Scale & Data Locality
• Replicate For Resiliency
• Share Nothing
• Fault Tolerance
• Asynchrony
• Async Message Passing
• Memory Management
• Data lineage and reprocessing in
runtime
• Parallelism
• Elastically Scale
• Isolation
• Location Transparency
@helenaedelson
Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
20
@helenaedelson
21
few years
in Silicon Valley
Cloud Engineering team
@helenaedelson
@helenaedelson
22
Batch analytics data flow from several years ago looked like...
@helenaedelson
Analytic
Analytic
Search
Hadoop
WordCount
in Java
23
Painful just to look at
@helenaedelson
24
Batch analytics data flow from several years ago looked like...
@helenaedelson
Analytic
Analytic
Search
Scalding: WordCount
25
@helenaedelson
26
Transforming data multiple times, multiple ways
@helenaedelson
27
Sweet, let's triple the code we have to update and regression test
every time our analytics logic changes
@helenaedelson
AND THEN WE GREEKED OUT
28
Lambda
@helenaedelson
Lambda Architecture
A data-processing architecture designed to handle
massive quantities of data by taking advantage of both
batch and stream processing methods.
29
• Or, "How to beat the CAP theorum"
• An approach coined by Nathan Mars
• This was a huge stride forward
@helenaedelson
https://www.mapr.com/developercentral/lambda-architecture 30
@helenaedelson
Implementing Is Hard
• Real-time pipeline backed by KV store for updates
• Many moving parts - KV store, real time, batch
• Running similar code in two places
• Still ingesting data to Parquet/HDFS
• Reconcile queries against two different places
31
@helenaedelson
Performance Tuning & Monitoring
on so many disparate systems
32
Also Hard
@helenaedelson
33
λ: Streaming & Batch Flows
Evolution Or Just Addition?
Or Just Technical Debt?
@helenaedelson
Lambda Architecture
Ingest an immutable sequence of records is captured
and fed into
• a batch system
• and a stream processing system
in parallel
34
@helenaedelson
WAIT, DUAL SYSTEMS?
35
Challenge Assumptions
@helenaedelson
Which Translates To
• Performing analytical computations & queries in dual
systems
• Duplicate Code
• Untyped Code - Strings
• Spaghetti Architecture for Data Flows
• One Busy Network
36
@helenaedelson
Architectyr?
"This is a giant mess"
- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 37
@helenaedelson
38
These are not the solutions you're looking for
@helenaedelson
Escape from Hadoop?
Hadoop
• MapReduce - very powerful, no longer enough
• It’s Batch
• Stale data
• Slow, everything written to disk
• Huge overhead
• Inefficient with respect to memory use
• Inflexible vs Dynamic
39
@helenaedelson
One Pipeline
40
• A unified system for streaming and batch
• Real-time processing and reprocessing
• Code changes
• Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps
@helenaedelson
THE PIPELINE
41
Stream Integration, IoT, Timeseries and Data Locality
@helenaedelson
42
KillrWeather
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and
batch data processing with Apache Spark Streaming, Apache
Cassandra, Apache Kafka and Akka for fast, streaming computations
on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/
databricks/apps/weather
@helenaedelson
val context = new StreamingContext(conf, Seconds(1))
val stream = KafkaUtils.createDirectStream[Array[Byte],
Array[Byte], DefaultDecoder, DefaultDecoder](
context, kafkaParams, kafkaTopics)
stream.flatMap(func1).saveToCassandra(ks1,table1)
stream.map(func2).saveToCassandra(ks1,table1)
context.start()
43
Kafka, Spark Streaming and Cassandra
@helenaedelson
class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {



override val supervisorStrategy =

OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {

case _: ActorInitializationException => Stop

case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate

case _: Exception => Escalate

}


private val producer = new KafkaProducer[K, V](producerConfig)



override def postStop(): Unit = producer.close()


def receive = {

case e: KafkaMessageEnvelope[K,V] => producer.send(e)

}

} 44
Kafka, Spark Streaming, Cassandra & Akka
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
Spark Streaming ML, Kafka & Cassandra
@helenaedelson
Spark Streaming, ML, Kafka & C*
val ssc = new StreamingContext(new SparkConf()…, Seconds(5)

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)



val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](

ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2).map(LabeledPoint.parse)
trainingStream.saveToCassandra("ml_keyspace", "raw_training_data")



val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(weights))

.trainOn(trainingStream)
//Making predictions on testData
model
.predictOnValues(testData.map(lp => (lp.label, lp.features)))
.saveToCassandra("ml_keyspace", "predictions")
46
@helenaedelson
47
@helenaedelson
48
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
Kafka, Spark Streaming, Cassandra & Akka
@helenaedelson
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
49
Now we can replay
• On failure
• Reprocessing on code changes
• Future computation...
@helenaedelson
50
Here we are pre-aggregating to a table for fast querying later -
in other secondary stream aggregation computations and scheduled computing
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
@helenaedelson
CREATE TABLE weather.raw_data (

wsid text, year int, month int, day int, hour int, 

temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
CREATE TABLE daily_aggregate_precip (

wsid text,

year int,

month int,

day int,

precipitation counter,

PRIMARY KEY ((wsid), year, month, day)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Data Model (simplified)
51
@helenaedelson
52
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
@helenaedelson
CREATE TABLE weather.raw_data (

wsid text, year int, month int, day int, hour int, 

temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
C* Clustering Columns Writes by most recent
Reads return most recent first
Timeseries Data
53
Cassandra will automatically sort by most recent for both write and read
@helenaedelson
54
val multipleStreams = for (i <- numstreams) {
streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))
}
streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
CREATE TABLE IF NOT EXISTS requests_ks.timeline (
timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,
PRIMARY KEY ((url, timesegment) , t_uuid)
);
Record Every Event In The Order In
Which It Happened, Per URL
timesegment protects from writing
unbounded partitions.
timeuuid protects from simultaneous
events over-writing one another.
@helenaedelson
val stream = KafkaUtils.createDirectStream(...)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
stream
.map(hour => (hour.id, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

55
Replay and Reprocess - Any Time
Data is on the nodes doing the querying
- Spark C* Connector - Partitions
• Timeseries data with Data Locality
• Co-located Spark + Cassandra nodes
• S3 does not give you
Cassandra & Spark Streaming:
Data Locality For Free®
@helenaedelson
class PrecipitationActor(ssc: StreamingContext, settings: Settings) extends AggregationActor {

import akka.pattern.pipe


def receive : Actor.Receive = { 

case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)

}



/** Returns the 10 highest temps for any station in the `year`. */

def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {

val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,

ssc.sparkContext.parallelize(aggregate).top(k).toSeq)



ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync().map(toTopK) pipeTo requester

}

}
56
Queries pre-aggregated
tables from the stream
Compute Isolation: Actor
@helenaedelson
57
class TemperatureActor(sc: SparkContext, settings: Settings) extends AggregationActor {

import akka.pattern.pipe


def receive: Actor.Receive = {

case e: GetMonthlyHiLowTemperature => highLow(e, sender)

}



def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =

sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)

.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)

.collectAsync()

.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester
}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
Efficient Batch Analysis
@helenaedelson
TCO: Cassandra, Hadoop
• Compactions vs file management and de-duplication jobs
• Storage cost, cloud providers / hardware, storage format, query speeds, cost of
maintaining
• Hiring talent to manage Cassandra vs Hadoop
• Hadoop has some advantages if used with AWS EMR or other hosted solutions
(What about HCP?)
• HDFS vs Cassandra is not a fair comparison
– You have to decide first if you want file vs DB
– Where in the stack
– Then you can compare
• See http://velvia.github.io/presentations/2015-breakthrough-olap-cass-spark/
index.html#/15/3
58
@helenaedelson
59
@helenaedelson
github.com/helena
helena@tuplejump.com
slideshare.net/helenaedelson
THANKS!
@helenaedelson
@helenaedelson
THE STACK
61
Appendix
Analytic
Analytic
Search
• Fast, distributed, scalable and fault
tolerant cluster compute system
• Enables Low-latency with complex
analytics
• Developed in 2009 at UC Berkeley
AMPLab, open sourced in 2010, and
became a top-level Apache project
in February, 2014
@helenaedelson
Spark Streaming
• One runtime for streaming and batch processing
• Join streaming and static data sets
• No code duplication
• Easy, flexible data ingestion from disparate sources to
disparate sinks
• Easy to reconcile queries against multiple sources
• Easy integration of KV durable storage
63
@helenaedelson
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
64
val context = new StreamingContext(conf, Milliseconds(500))
val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils
.createStream(ssc, zkQuorum, group,..)
.map(event => model.predict(event.feature))
@helenaedelson
Apache Cassandra
• Extremely Fast
• Extremely Scalable
• Multi-Region / Multi-Datacenter
• Always On
• No single point of failure
• Survive regional outages
• Easy to operate
• Automatic & configurable replication
65
@helenaedelson
66
The one thing in your infrastructure you can always rely on
•Massively Scalable
• High Performance
• Always On
• Masterless
Streaming Big Data & Analytics For Scale
IoT
Security, Machine Learning
Science
Physics: Astro Physics / Particle Physics..
Genetics / Biological Computations
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
Streaming Big Data & Analytics For Scale
@helenaedelson
75
High performance concurrency framework for Scala and
Java
• Fault Tolerance
• Asynchronous messaging and data processing
• Parallelization
• Location Transparency
• Local / Remote Routing
• Akka: Cluster / Persistence / Streams
@helenaedelson
Akka Actors
76
A distribution and concurrency abstraction
• Compute Isolation
• Behavioral Context Switching
• No Exposed Internal State
• Event-based messaging
• Easy parallelism
• Configurable fault tolerance
@helenaedelson
Some Resources
• http://www.planetcassandra.org/blog/equinix-evaluates-hbase-and-cassandra-for-the-real-time-
billing-and-fraud-detection-of-hundreds-of-data-centers-worldwide
• Building a scalable platform for streaming updates and analytics
• @SAPdevs tweet: Watch us take a dive deeper and explore the benefits of using Akka and
Scala https://t.co/M9arKYy02y
• https://mesosphere.com/blog/2015/07/24/learn-everything-you-need-to-know-about-scala-and-
big-data-in-oakland/
• https://www.youtube.com/watch?v=buszgwRc8hQ
• https://www.youtube.com/watch?v=G7N6YcfatWY
• http://noetl.org
77
Ad

More Related Content

What's hot (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
Szilveszter Molnár
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
Evans Ye
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
Evans Ye
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 

Similar to Streaming Big Data & Analytics For Scale (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Rethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for ScaleRethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for Scale
C4Media
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Data engineering
Data engineeringData engineering
Data engineering
Parimala Killada
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
Justin Cunningham
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
Rakuten Group, Inc.
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
Clarence J M Tauro
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
confluent
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
confluent
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Rethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for ScaleRethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for Scale
C4Media
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
Justin Cunningham
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
confluent
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
confluent
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Ad

Recently uploaded (20)

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Ad

Streaming Big Data & Analytics For Scale

  • 1. STREAMING BIG DATA & ANALYTICS FOR SCALE Helena Edelson 1 @helenaedelson
  • 2. @helenaedelson Who Is This Person? • VP of Product Engineering @Tuplejump • Big Data, Analytics, Cloud Engineering, Cyber Security • Committer / Contributor to FiloDB, Spark Cassandra Connector, Akka, Spring Integration 2 • @helenaedelson • github.com/helena • linkedin.com/in/helenaedelson • slideshare.net/helenaedelson
  • 4. @helenaedelson Topics • The Problem Domain - What needs to be solved • The Stack (Scala,Akka,Spark Streaming,Kafka,Cassandra) • Simplifying the architecture • The Pipeline - integration 4
  • 5. @helenaedelson THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 5 @helenaedelson
  • 6. @helenaedelson The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 6
  • 7. @helenaedelson Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream 7
  • 8. I need fast access to historical data on the fly for predictive modeling with real time data from the stream
  • 9. @helenaedelson It's Not A Stream It's A Flood • Netflix • 50 - 100 billion events per day • 1 - 2 million events per second at peak • LinkedIn • 500 billion write events per day • 2.5 trillion read events per day • 4.5 million events per second at peak with Kafka • 1 PB of stream data 9
  • 10. @helenaedelson Reality Check • Massive event spikes & bursty traffic • Fast producers / slow consumers • Network partitioning & out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not lose data? 10
  • 11. @helenaedelson And stay within our cloud hosting budget 11
  • 14. @helenaedelson About The Stack An ensemble of technologies enabling a data pipeline, storage and analytics. 14
  • 16. @helenaedelson Scala and Spark Scala is now becoming this glue that connects the dots in big data. The emergence of Spark owes directly to its simplicity and the high level of abstraction of Scala. • Distributed collections that are functional by default • Apply transformations to them • Spark takes that exact idea and puts it on a cluster 16
  • 18. @helenaedelson Pick Technologies Wisely Based on your requirements • Latency • Real time / Sub-Second: < 100ms • Near real time (low): > 100 ms or a few seconds - a few hours • Consistency • Highly Scalable • Topology-Aware & Multi-Datacenter support • Partitioning Collaboration - do they play together well 18
  • 19. @helenaedelson Strategies 19 • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
  • 20. @helenaedelson Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 20
  • 21. @helenaedelson 21 few years in Silicon Valley Cloud Engineering team @helenaedelson
  • 22. @helenaedelson 22 Batch analytics data flow from several years ago looked like...
  • 24. @helenaedelson 24 Batch analytics data flow from several years ago looked like...
  • 27. @helenaedelson 27 Sweet, let's triple the code we have to update and regression test every time our analytics logic changes
  • 28. @helenaedelson AND THEN WE GREEKED OUT 28 Lambda
  • 29. @helenaedelson Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 29 • Or, "How to beat the CAP theorum" • An approach coined by Nathan Mars • This was a huge stride forward
  • 31. @helenaedelson Implementing Is Hard • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places 31
  • 32. @helenaedelson Performance Tuning & Monitoring on so many disparate systems 32 Also Hard
  • 33. @helenaedelson 33 λ: Streaming & Batch Flows Evolution Or Just Addition? Or Just Technical Debt?
  • 34. @helenaedelson Lambda Architecture Ingest an immutable sequence of records is captured and fed into • a batch system • and a stream processing system in parallel 34
  • 36. @helenaedelson Which Translates To • Performing analytical computations & queries in dual systems • Duplicate Code • Untyped Code - Strings • Spaghetti Architecture for Data Flows • One Busy Network 36
  • 37. @helenaedelson Architectyr? "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 37
  • 38. @helenaedelson 38 These are not the solutions you're looking for
  • 39. @helenaedelson Escape from Hadoop? Hadoop • MapReduce - very powerful, no longer enough • It’s Batch • Stale data • Slow, everything written to disk • Huge overhead • Inefficient with respect to memory use • Inflexible vs Dynamic 39
  • 40. @helenaedelson One Pipeline 40 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps
  • 41. @helenaedelson THE PIPELINE 41 Stream Integration, IoT, Timeseries and Data Locality
  • 42. @helenaedelson 42 KillrWeather http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
  • 43. @helenaedelson val context = new StreamingContext(conf, Seconds(1)) val stream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( context, kafkaParams, kafkaTopics) stream.flatMap(func1).saveToCassandra(ks1,table1) stream.map(func2).saveToCassandra(ks1,table1) context.start() 43 Kafka, Spark Streaming and Cassandra
  • 44. @helenaedelson class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {
 
 override val supervisorStrategy =
 OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
 case _: ActorInitializationException => Stop
 case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate
 case _: Exception => Escalate
 } 
 private val producer = new KafkaProducer[K, V](producerConfig)
 
 override def postStop(): Unit = producer.close() 
 def receive = {
 case e: KafkaMessageEnvelope[K,V] => producer.send(e)
 }
 } 44 Kafka, Spark Streaming, Cassandra & Akka
  • 45. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark Streaming ML, Kafka & Cassandra
  • 46. @helenaedelson Spark Streaming, ML, Kafka & C* val ssc = new StreamingContext(new SparkConf()…, Seconds(5)
 val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)
 
 val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) .map(_._2).map(LabeledPoint.parse) trainingStream.saveToCassandra("ml_keyspace", "raw_training_data")
 
 val model = new StreamingLinearRegressionWithSGD()
 .setInitialWeights(Vectors.dense(weights))
 .trainOn(trainingStream) //Making predictions on testData model .predictOnValues(testData.map(lp => (lp.label, lp.features))) .saveToCassandra("ml_keyspace", "predictions") 46
  • 48. @helenaedelson 48 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } Kafka, Spark Streaming, Cassandra & Akka
  • 49. @helenaedelson class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } 49 Now we can replay • On failure • Reprocessing on code changes • Future computation...
  • 50. @helenaedelson 50 Here we are pre-aggregating to a table for fast querying later - in other secondary stream aggregation computations and scheduled computing class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 }
  • 51. @helenaedelson CREATE TABLE weather.raw_data (
 wsid text, year int, month int, day int, hour int, 
 temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); CREATE TABLE daily_aggregate_precip (
 wsid text,
 year int,
 month int,
 day int,
 precipitation counter,
 PRIMARY KEY ((wsid), year, month, day)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); Data Model (simplified) 51
  • 52. @helenaedelson 52 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
  • 53. @helenaedelson CREATE TABLE weather.raw_data (
 wsid text, year int, month int, day int, hour int, 
 temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); C* Clustering Columns Writes by most recent Reads return most recent first Timeseries Data 53 Cassandra will automatically sort by most recent for both write and read
  • 54. @helenaedelson 54 val multipleStreams = for (i <- numstreams) { streamingContext.receiverStream[HttpRequest](new HttpReceiver(port)) } streamingContext.union(multipleStreams) .map { httpRequest => TimelineRequestEvent(httpRequest)} .saveToCassandra("requests_ks", "timeline") CREATE TABLE IF NOT EXISTS requests_ks.timeline ( timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text, PRIMARY KEY ((url, timesegment) , t_uuid) ); Record Every Event In The Order In Which It Happened, Per URL timesegment protects from writing unbounded partitions. timeuuid protects from simultaneous events over-writing one another.
  • 55. @helenaedelson val stream = KafkaUtils.createDirectStream(...)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.id, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 55 Replay and Reprocess - Any Time Data is on the nodes doing the querying - Spark C* Connector - Partitions • Timeseries data with Data Locality • Co-located Spark + Cassandra nodes • S3 does not give you Cassandra & Spark Streaming: Data Locality For Free®
  • 56. @helenaedelson class PrecipitationActor(ssc: StreamingContext, settings: Settings) extends AggregationActor {
 import akka.pattern.pipe 
 def receive : Actor.Receive = { 
 case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
 }
 
 /** Returns the 10 highest temps for any station in the `year`. */
 def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
 val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
 ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
 
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync().map(toTopK) pipeTo requester
 }
 } 56 Queries pre-aggregated tables from the stream Compute Isolation: Actor
  • 57. @helenaedelson 57 class TemperatureActor(sc: SparkContext, settings: Settings) extends AggregationActor {
 import akka.pattern.pipe 
 def receive: Actor.Receive = {
 case e: GetMonthlyHiLowTemperature => highLow(e, sender)
 }
 
 def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
 sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
 .where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
 .collectAsync()
 .map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester } C* data is automatically sorted by most recent - due to our data model. Additional Spark or collection sort not needed. Efficient Batch Analysis
  • 58. @helenaedelson TCO: Cassandra, Hadoop • Compactions vs file management and de-duplication jobs • Storage cost, cloud providers / hardware, storage format, query speeds, cost of maintaining • Hiring talent to manage Cassandra vs Hadoop • Hadoop has some advantages if used with AWS EMR or other hosted solutions (What about HCP?) • HDFS vs Cassandra is not a fair comparison – You have to decide first if you want file vs DB – Where in the stack – Then you can compare • See http://velvia.github.io/presentations/2015-breakthrough-olap-cass-spark/ index.html#/15/3 58
  • 62. Analytic Analytic Search • Fast, distributed, scalable and fault tolerant cluster compute system • Enables Low-latency with complex analytics • Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014
  • 63. @helenaedelson Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 63
  • 64. @helenaedelson Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict 64 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
  • 65. @helenaedelson Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 65
  • 66. @helenaedelson 66 The one thing in your infrastructure you can always rely on
  • 67. •Massively Scalable • High Performance • Always On • Masterless
  • 69. IoT
  • 71. Science Physics: Astro Physics / Particle Physics..
  • 72. Genetics / Biological Computations
  • 73. • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
  • 75. @helenaedelson 75 High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams
  • 76. @helenaedelson Akka Actors 76 A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance
  • 77. @helenaedelson Some Resources • http://www.planetcassandra.org/blog/equinix-evaluates-hbase-and-cassandra-for-the-real-time- billing-and-fraud-detection-of-hundreds-of-data-centers-worldwide • Building a scalable platform for streaming updates and analytics • @SAPdevs tweet: Watch us take a dive deeper and explore the benefits of using Akka and Scala https://t.co/M9arKYy02y • https://mesosphere.com/blog/2015/07/24/learn-everything-you-need-to-know-about-scala-and- big-data-in-oakland/ • https://www.youtube.com/watch?v=buszgwRc8hQ • https://www.youtube.com/watch?v=G7N6YcfatWY • http://noetl.org 77