SlideShare a Scribd company logo
Sessionization At Scale
Using Spark Streaming in production and staying sane
Marina Grechuhin & Yuval
Itzchakov
12/09/2017
YuvalItzchakov
2Confidential
• Developer @ Clicktale for the past 3 years
• Previously developer @ IDF (8200)
• @yuvalitzchakov
• https://stackoverflow.com/users/1870803/yuval-itzchakov
• http://asyncified.io
MarinaGrechuhin
3Confidential
• Team Leader @ Clicktale
• Previously co-founder and VP R&D @ SureVisit
• Previously – many more
Yes
No
Agenda
4Confidential
• Introduction to Spark
• Spark In Depth
• What Is Sessionization?
• Spark Brief Overview
• Sessionization With Spark Streaming
• Scale Challenges
• Structured Streaming with Stateful Aggregations
5Confidential
Architecture – Pipeline CEC
Elastic Load
Balancing
Auto Scaling group
Ingest
Servers
{
"version": 1,
"location":"http://adobe.com/shoe.html",
"projectId": 10,
"documentReferrer": "",
"visitId": 6403608503386111,
"domContentLoaded": 324,
"visitorId": 3246944914767871,
"pageviewId": 1199465738272767,
"engagementTime": 2336,
"messageId": 0
}
6Confidential
Pipeline CEC – Data Types
• Init Message
• Chunk Messages 0-N
• End Message
14
7Confidential
Sizing Pipeline CEC
Elastic Load
Balancing
10
500G/Day
100G/Day
Ingest
1415
Elastic Load
Balancing
8
Ingest
10
8Confidential
What is Sessionization?
Session:
“A sequence of requests made by a single end-user during a visit to a particular site”
(Wikipedia)
• To be able to aggregate user actions over time
• All data doesn’t arrive at once, but piece by piece
9Confidential
Pipeline CEC – Data Types
PageView
End
Chunk
Init
PageView
End
Chunk
Init
PageView
Chunk
Chunk
Init
Visit
PageView
PageView
PageView
PageView – User’s Journey on a single web page
Visit – User’s journey on site
10Confidential
Requirements overview
• Data size ranging between 200B – 1K (may grow over time)
• Process incoming user messages up to 100,000 messages per second
• Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500
companies)
• Scale out as needed without user intervention (hopefully linearly)
• Save user state until a session is complete, and only then send it down the pipeline
• Latency - up to 10 seconds from ingestion to processing (make data available as
soon as it’s ready)
11Confidential
12Confidential
Spark Ecosystem
Source: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
13Confidential
Spark Streaming
• Discretized Stream (DStream)
• Micro batching
• One RDD every batch
Where is the state
kept between
batches?
14Confidential
mapWithState
Source: https://databricks.com/wp-content/uploads/2016/01/blog-faster-stateful-streaming-figure-1-1024x562.png
Init Chunk End
Page
View
Visit
PageView
• Partial Updates
• Timeout
• Initial State
15Confidential
(“dardasaba”, “hello”),
(“dardasaba”,
“goodbye”),
(“hathatul”, “w00t”),
(“hathatul”, “nope”),
(“gargamel”, “muhaha”)
Executor 1
Executor 2
Executor 3
Key Value
“dardasaba” [“hello”,
“goodbye”]
Key Value
“hathatul” [“w00t”,
“nope”]
Key Value
“gargamel” [“muhaha”]
OpenHashMap[String, List[String]]
DStream[(String, String)]
Key Value
16Confidential
What Could Possibly Go Wrong?
17Confidential
Scale Challenges
• Stability
• Resiliency
• Scalability (scale up / down)
• Monitoring
18Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing
• S3 
• Task failure - Eventual consistency on read
• AWS EFS (?)
• Not suited for small file systems (limited IOPS)
• HDFS 
• Best overall write performance out of the three
• Can be installed on the same node as Spark Workers
• Relatively low maintenance (if used only for checkpoint)
19Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing (cont.)
• Problem:
• State not always recoverable
• No matter the DFS, limits your throughput:
• 1KB message size
• 100,000 messages/sec
• 1 minute checkpoint time
(occurs every 40 seconds)
• Workaround:
• None (in Spark Streaming )
Checkpointing –
This is the cost???
20Confidential
Spark Streaming Challenges
2. Resiliency -> Managing user state between application upgrades
• Problems:
• Can’t change the graph
• Can’t change your data structures
• Workaround:
• Roll your own using `stateSnapshot()`
• Provide on start up using `StateSpec.initialState()`
* Can potentially double overhead of the job time (critical with high throughput).
21Confidential
Spark Streaming Challenges
• Problem:
• Spark Streaming defaults to one job (batch) at a
time
• If a particular job is stuck, all others wait
indefinitely
• Workaround:
• Monitor job status using Sparks driver REST API
(http://<driver ip>:4040/api/v1/applications)
• Consider using Speculation (should be done
carefully)
• Enable Blacklisting if a particular node is faulty.
• If you like to live dangerously, consider modifying
“spark.streaming.concurrentJobs”
3. Stability -> Frozen Jobs
22Confidential
Spark Streaming Challenges
• Scale Up – Just works*
• Scale Down – Who takes over the worker’s state?
4. Scalability
No One!
23Confidential
Spark Streaming Challenges
• Logging mechanism – Log only small and random percent of
traffic
4. Monitoring
25Confidential
Is there a better alternative?
26Confidential
Structured Streaming
“The key idea in Structured Streaming is to treat a live data stream as a table
that is being continuously appended” (Structured Streaming Documentation)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
27Confidential
Structured Streaming (Cont.)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
28Confidential
mapGroupsWithState
A second iteration at stateful aggregations in Spark
Resiliency & Stability -> Checkpointing
• Checkpoints are incremental, only deltas!
• Allows state recovery between upgrades *
*According to a set of tests made by us, may not apply to all cases and isn’t documented
behavior
29Confidential
Spark Structured Streaming
• More new features and cool stuff
• Event based timeouts (previously only processing based)​
• Watermarking (New)​
• Deduplication (New)​
• Timeout per state item (Enhancement)​
30Confidential
Our experience so far
Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState:
Pros:
• Queries seem to take less time on average than Spark Streaming *
• No need to save state manually
• Deduplication out of the box is awesome
• Event based timeouts + Watermarking for late data is also awesome
* In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
31Confidential
Our experience so far (Cont.)
Neutral:
• Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data
locality (less shuffling).
• This means that in order to scale up, you must have at least a 1:1 mapping
between number of Kafka partitions and Spark Executors.
Cons:
• Creates a significantly larger memory overhead (due to internal state implementation)
• Makes heavier use of HDFS (many small file writes)
• Doesn’t support multiple states (yet)
• UI not as good as Streaming
32Confidential
Wrapping up
• Overall, Spark Streaming is a great candidate for small-medium loads or none
Stateful aggregations streams.
• If you’re considering Spark as an option for your business, start with
Structured Streaming from the get go.
• Do consider Apache Flink and it’s similar state management module which
allows pluggable state stores as an alternative.
33Confidential
• Real-time Streaming ETL with Structured Streaming:
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-
streaming-apache-spark-2-1.html
• Making Structured Streaming Ready for Production:
https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be
• Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark:
https://www.youtube.com/watch?v=JAb4FIheP28
• Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful-
streaming-with-apache-spark
• Exploring Stateful Streaming with Spark Structured Streaming:
http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming
Resources
Thank you for listening!
Questions?
Ad

More Related Content

What's hot (20)

Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
Redis Labs
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Matt Fuller
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
Phil Peace
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
What's new in MongoDB 2.6
What's new in MongoDB 2.6What's new in MongoDB 2.6
What's new in MongoDB 2.6
Matias Cascallares
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres Deployment
PGConf APAC
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
Aung Thu Rha Hein
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
Ivan Glushkov
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Bastiaan Bakker
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
Lynn Langit
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
DataStax Academy
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
Redis Labs
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Matt Fuller
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
Phil Peace
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres Deployment
PGConf APAC
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
Ivan Glushkov
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Bastiaan Bakker
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
Lynn Langit
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
DataStax Academy
 

Similar to Spark Streaming @ Scale (Clicktale) (20)

Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Ad

Recently uploaded (20)

introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Ad

Spark Streaming @ Scale (Clicktale)

  • 1. Sessionization At Scale Using Spark Streaming in production and staying sane Marina Grechuhin & Yuval Itzchakov 12/09/2017
  • 2. YuvalItzchakov 2Confidential • Developer @ Clicktale for the past 3 years • Previously developer @ IDF (8200) • @yuvalitzchakov • https://stackoverflow.com/users/1870803/yuval-itzchakov • http://asyncified.io
  • 3. MarinaGrechuhin 3Confidential • Team Leader @ Clicktale • Previously co-founder and VP R&D @ SureVisit • Previously – many more
  • 4. Yes No Agenda 4Confidential • Introduction to Spark • Spark In Depth • What Is Sessionization? • Spark Brief Overview • Sessionization With Spark Streaming • Scale Challenges • Structured Streaming with Stateful Aggregations
  • 5. 5Confidential Architecture – Pipeline CEC Elastic Load Balancing Auto Scaling group Ingest Servers
  • 6. { "version": 1, "location":"http://adobe.com/shoe.html", "projectId": 10, "documentReferrer": "", "visitId": 6403608503386111, "domContentLoaded": 324, "visitorId": 3246944914767871, "pageviewId": 1199465738272767, "engagementTime": 2336, "messageId": 0 } 6Confidential Pipeline CEC – Data Types • Init Message • Chunk Messages 0-N • End Message
  • 7. 14 7Confidential Sizing Pipeline CEC Elastic Load Balancing 10 500G/Day 100G/Day Ingest 1415 Elastic Load Balancing 8 Ingest 10
  • 8. 8Confidential What is Sessionization? Session: “A sequence of requests made by a single end-user during a visit to a particular site” (Wikipedia) • To be able to aggregate user actions over time • All data doesn’t arrive at once, but piece by piece
  • 9. 9Confidential Pipeline CEC – Data Types PageView End Chunk Init PageView End Chunk Init PageView Chunk Chunk Init Visit PageView PageView PageView PageView – User’s Journey on a single web page Visit – User’s journey on site
  • 10. 10Confidential Requirements overview • Data size ranging between 200B – 1K (may grow over time) • Process incoming user messages up to 100,000 messages per second • Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500 companies) • Scale out as needed without user intervention (hopefully linearly) • Save user state until a session is complete, and only then send it down the pipeline • Latency - up to 10 seconds from ingestion to processing (make data available as soon as it’s ready)
  • 13. 13Confidential Spark Streaming • Discretized Stream (DStream) • Micro batching • One RDD every batch Where is the state kept between batches?
  • 15. 15Confidential (“dardasaba”, “hello”), (“dardasaba”, “goodbye”), (“hathatul”, “w00t”), (“hathatul”, “nope”), (“gargamel”, “muhaha”) Executor 1 Executor 2 Executor 3 Key Value “dardasaba” [“hello”, “goodbye”] Key Value “hathatul” [“w00t”, “nope”] Key Value “gargamel” [“muhaha”] OpenHashMap[String, List[String]] DStream[(String, String)] Key Value
  • 17. 17Confidential Scale Challenges • Stability • Resiliency • Scalability (scale up / down) • Monitoring
  • 18. 18Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing • S3  • Task failure - Eventual consistency on read • AWS EFS (?) • Not suited for small file systems (limited IOPS) • HDFS  • Best overall write performance out of the three • Can be installed on the same node as Spark Workers • Relatively low maintenance (if used only for checkpoint)
  • 19. 19Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing (cont.) • Problem: • State not always recoverable • No matter the DFS, limits your throughput: • 1KB message size • 100,000 messages/sec • 1 minute checkpoint time (occurs every 40 seconds) • Workaround: • None (in Spark Streaming ) Checkpointing – This is the cost???
  • 20. 20Confidential Spark Streaming Challenges 2. Resiliency -> Managing user state between application upgrades • Problems: • Can’t change the graph • Can’t change your data structures • Workaround: • Roll your own using `stateSnapshot()` • Provide on start up using `StateSpec.initialState()` * Can potentially double overhead of the job time (critical with high throughput).
  • 21. 21Confidential Spark Streaming Challenges • Problem: • Spark Streaming defaults to one job (batch) at a time • If a particular job is stuck, all others wait indefinitely • Workaround: • Monitor job status using Sparks driver REST API (http://<driver ip>:4040/api/v1/applications) • Consider using Speculation (should be done carefully) • Enable Blacklisting if a particular node is faulty. • If you like to live dangerously, consider modifying “spark.streaming.concurrentJobs” 3. Stability -> Frozen Jobs
  • 22. 22Confidential Spark Streaming Challenges • Scale Up – Just works* • Scale Down – Who takes over the worker’s state? 4. Scalability No One!
  • 23. 23Confidential Spark Streaming Challenges • Logging mechanism – Log only small and random percent of traffic 4. Monitoring
  • 24. 25Confidential Is there a better alternative?
  • 25. 26Confidential Structured Streaming “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended” (Structured Streaming Documentation) Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
  • 26. 27Confidential Structured Streaming (Cont.) Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
  • 27. 28Confidential mapGroupsWithState A second iteration at stateful aggregations in Spark Resiliency & Stability -> Checkpointing • Checkpoints are incremental, only deltas! • Allows state recovery between upgrades * *According to a set of tests made by us, may not apply to all cases and isn’t documented behavior
  • 28. 29Confidential Spark Structured Streaming • More new features and cool stuff • Event based timeouts (previously only processing based)​ • Watermarking (New)​ • Deduplication (New)​ • Timeout per state item (Enhancement)​
  • 29. 30Confidential Our experience so far Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState: Pros: • Queries seem to take less time on average than Spark Streaming * • No need to save state manually • Deduplication out of the box is awesome • Event based timeouts + Watermarking for late data is also awesome * In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
  • 30. 31Confidential Our experience so far (Cont.) Neutral: • Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data locality (less shuffling). • This means that in order to scale up, you must have at least a 1:1 mapping between number of Kafka partitions and Spark Executors. Cons: • Creates a significantly larger memory overhead (due to internal state implementation) • Makes heavier use of HDFS (many small file writes) • Doesn’t support multiple states (yet) • UI not as good as Streaming
  • 31. 32Confidential Wrapping up • Overall, Spark Streaming is a great candidate for small-medium loads or none Stateful aggregations streams. • If you’re considering Spark as an option for your business, start with Structured Streaming from the get go. • Do consider Apache Flink and it’s similar state management module which allows pluggable state stores as an alternative.
  • 32. 33Confidential • Real-time Streaming ETL with Structured Streaming: https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured- streaming-apache-spark-2-1.html • Making Structured Streaming Ready for Production: https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be • Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark: https://www.youtube.com/watch?v=JAb4FIheP28 • Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful- streaming-with-apache-spark • Exploring Stateful Streaming with Spark Structured Streaming: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming Resources
  • 33. Thank you for listening! Questions?

Editor's Notes

  • #6: Monitor message sending mechanism 1 kafka topic
  • #8: Monitor message sending mechanism 1 kafka topic
  • #10: How do we aggregate user messages over time in a Streaming application??
  • #11: Do a brief overview of all points, 15-20 seconds per point. At the end of the slide do an intro to Spark and talk a little about why we chose it over alternatives
  • #13: Ask a question: How many people use Spark in production? How many people use Spark Streaming in production? How many do Sessionization?
  • #14: Spark is not real time streaming, but micro batching Where is the state held?
  • #19: Talk about each file system briefly