SlideShare a Scribd company logo
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead - @fwiffo
Data Day Texas 2017
Building Production Spark Streaming
Applications
© Rocana, Inc. All Rights Reserved. | 2
Joey
• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 3
© Rocana, Inc. All Rights Reserved. | 4
Context
• We built a system for large scale realtime collection, processing, and
analysis of event-oriented machine data
• On prem or in the cloud, but not SaaS
• Supportability is a big deal for us
• Predictability of performance under load and failures
• Ease of configuration and operation
• Behavior in wacky environments
© Rocana, Inc. All Rights Reserved. | 5
Apache Spark Streaming
© Rocana, Inc. All Rights Reserved. | 6
Spark streaming overview
• Stream processing API built on top of the Spark execution engine
• Micro-batching
• Every n-milliseconds fetch records from data source
• Execute Spark jobs on each input batch
• DStream API
• Wrapper around the RDD API
• Lets the developer think in terms of transformations on a stream of events
© Rocana, Inc. All Rights Reserved. | 7
Input Batch
Spark Batch
Engine
Output Batch
© Rocana, Inc. All Rights Reserved. | 8
Structured streaming
• New streaming API for Spark
• Re-use DataFrames API for streaming
• API was too new when we started
• First release was an alpha
• No Kafka support at the time
• Details won't apply, but the overall approach should be in the ballpark
© Rocana, Inc. All Rights Reserved. | 9
Other notes
• Our experience is with Spark 1.6.2
• 2.0.0 was released after we started our Spark integration
• We use the Apache release of Spark
• Supports both CDH and HDP without recompiling
• We run Spark on YARN, so we're decoupled from other users on the cluster
© Rocana, Inc. All Rights Reserved. | 10
Use Case
Real-time alerting on IT operational data
© Rocana, Inc. All Rights Reserved. | 11
Our typical customer use cases
• >100K events / sec (8.6B events / day), sub-second end to end latency,
full fidelity retention, critical use cases
• Quality of service - “are credit card transactions happening fast enough?”
• Fraud detection - “detect, investigate, prosecute, and learn from fraud.”
• Forensic diagnostics - “what really caused the outage last friday?”
• Security - “who’s doing what, where, when, why, and how, and is that ok?”
• User behavior - ”capture and correlate user behavior with system
performance, then feed it to downstream systems in realtime.”
© Rocana, Inc. All Rights Reserved. | 12
Overall architecture
weirdo formats
transformation 1
weirdo format -> event
avro events
transformation 2
event -> storage-specific
storage-specific representation of events
© Rocana, Inc. All Rights Reserved. | 13
Real-time alerting
• Define aggregations, conditions, and actions
• Use cases:
• Send me an e-mail when the number of failed login events from a user is > 3
within an hour
• Create a ServiceNow ticket when CPU utilization spikes to > 95% for 10 minutes
© Rocana, Inc. All Rights Reserved. | 14
UI
© Rocana, Inc. All Rights Reserved. | 15
Architecture
© Rocana, Inc. All Rights Reserved. | 16
Packaging, Deployment, and Execution
© Rocana, Inc. All Rights Reserved. | 17
Packaging
• Application classes and dependencies
• Two options
• Shade all dependencies into an uber jar
• Make sure Hadoop and Spark dependencies are marked provided
• Submit application jars and dependent jars when submitting
© Rocana, Inc. All Rights Reserved. | 18
Deployment modes
• Standalone
• Manually start up head and worker services
• Resource control depends on options selected when launching daemons
• Difficult to mix versions
• Apache Mesos
• Coarse grained run mode, launch executors as Mesos tasks
• Can use dynamic allocation to launch executors on demand
• Apache Hadoop YARN
• Best choice if your cluster is already running YARN
© Rocana, Inc. All Rights Reserved. | 19
Spark on YARN
• Client mode versus cluster mode
• Client mode == Spark Driver on local server
• Cluster mode == Spark Driver in YARN AM
• Spark executors run in YARN containers (one JVM per executor)
• spark.executor.instances
• Each executor core uses one YARN vCore
• spark.executor.cores
© Rocana, Inc. All Rights Reserved. | 20
Job submission
• Most documentation covers spark-submit
• OK for testing, but not great for production
• We use spark submitter APIs
• Built easier to use wrapper API
• Hide some of the details of configuration
• Some configuration parameters aren't respected when using submitter API
• spark.executor.cores, spark.executor.memory
• spark.driver.cores, spark.driver.memory
© Rocana, Inc. All Rights Reserved. | 21
Job monitoring
• Streaming applications are always on
• Need to monitor the job for failures
• Restart the job on recoverable failures
• Notify an admin on fatal failures (e.g. misconfiguration)
• Validate as much up front as possible
• Our application runs rules through a type checker and query planner before
saving
© Rocana, Inc. All Rights Reserved. | 22
Instrumentation, Metrics, and Monitoring
© Rocana, Inc. All Rights Reserved. | 23
Instrumentation
You can't fix what
you don't measure
© Rocana, Inc. All Rights Reserved. | 24
Instrumentation APIs
• Spark supports Dropwizard (née CodaHale) metrics
• Collect both application and framework metrics
• Supports most popular metric types
• Counters
• Gauges
• Histograms
• Timers
• etc.
• Use your own APIs
• Best option if you have your existing metric collection infrastructure
© Rocana, Inc. All Rights Reserved. | 25
Custom metrics
• Implement the org.apache.spark.metrics.source.Source interface
• Register your source with sparkEnv.metricsSystem().registerSource()
• If you're measuring something during execution, you need to register the metric
on the executors
• Register executor metrics in a static block
• You can't register a metrics source until the SparkEnv has been initialized
SparkEnv sparkEnv = SparkEnv.get();
if (sparkEnv != null) {
// create and register source
}
© Rocana, Inc. All Rights Reserved. | 26
Metrics collection
• Configure $SPARK_HOME/conf/metrics.properties
• Built-in sinks
• ConsoleSink
• CVSSink
• JmxSink
• MetricsServlet
• GraphiteSink
• Slf4jSink
• GangliaSink
• Or build your own
© Rocana, Inc. All Rights Reserved. | 27
Build your own
• Implement the org.apache.spark.metrics.sink.Sink interface
• We built a KafkaEventSink that sends the metrics to a Kafka topic
formatted as Osso* events
• Our system has a metrics collector
• Aggregates metrics in a Parquet table
• Query and visualize metrics using SQL
• *http://www.osso-project.org
© Rocana, Inc. All Rights Reserved. | 28
Report and visualize
© Rocana, Inc. All Rights Reserved. | 29
Gotcha
• Due to the order of metrics subsystem initialization, your collection plugin
must be on the system classpath, not application classpath
• https://issues.apache.org/jira/browse/SPARK-18115
• Options:
• Deploy library on cluster nodes (e.g. add to HADOOP_CLASSPATH)
• Build a custom Spark assembly jar
© Rocana, Inc. All Rights Reserved. | 30
Custom spark assembly
• Maven shade plugin
• Merge upstream Spark assembly JAR with your library and dependencies
• Shade/rename library packages
• Might break configuration parameters as well 
• *.sink.kafka.com_rocana_assembly_shaded_kafka_brokers
• Mark any dependencies already in the assembly as provided
• Ask me about our akka.version fiasco
© Rocana, Inc. All Rights Reserved. | 31
Configuration and Tuning
© Rocana, Inc. All Rights Reserved. | 32
Architecture
© Rocana, Inc. All Rights Reserved. | 33
Predicting CPU/task resources
• Each output operation creates a separate batch job when processing a
micro-batch
• number of jobs = number of output ops
• Each data shuffle/re-partitioning creates a separate stage
• number of stages per job = number of shuffles + 1
• Each partition in a stage creates a separate task
• number of tasks per job = number of stages * number of partitions
© Rocana, Inc. All Rights Reserved. | 34
Resources for alerting
• Each rule has a single output operation (write to Kafka)
• Each rule has 3 stages
1. Read from Kafka, project, filter and group data for aggregation
2. Aggregate values, filter (conditions) and group data for triggers
3. Aggregate trigger results and send trigger events to Kafka
• First stage partitions = number of Kafka partitions
• Stage 2 and 3 use spark.default.parallelism partitions
© Rocana, Inc. All Rights Reserved. | 35
Example
• 100 rules, Kafka partitions = 50, spark.default.parallelism = 50
• number of jobs = 100
• number of stages per job = 3
• number of tasks per job = 3 * 50 = 150
• total number of tasks = 100 * 150 = 15,000
© Rocana, Inc. All Rights Reserved. | 36
Task slots
• number of task slots = spark.executor.instances * spark.executor.cores
• Example
• 50 instances * 8 cores = 400 task slots
© Rocana, Inc. All Rights Reserved. | 37
Waves
• The jobs processing the micro-batches will run in waves based on
available task slots
• Number of waves = total number of tasks / number of task slots
• Example
• Number of waves = 15,000 / 400 = 38 waves
© Rocana, Inc. All Rights Reserved. | 38
Max time per wave
• maximum time per wave = micro-batch duration / number of waves
• Example:
• 15 second micro-batch duration
• maximum time per wave = 15,000 ms / 38 waves = 394 ms per wave
• If the average task time > 394 ms, then Spark streaming will fall behind
© Rocana, Inc. All Rights Reserved. | 39
Monitoring batch processing time
© Rocana, Inc. All Rights Reserved. | 40
Delay scheduling
• A technique of delaying scheduling of tasks to get better data locality
• Works great for long running batch tasks
• Not ideal for low-latency stream processing tasks
• Tip
• Set spark.locality.wait = 0ms
• Results
• Running job with 800 tasks on a very small (2 task slot) cluster, 300 event micro-batch
• With default setting: 402 seconds
• With 0ms setting: 26 seconds (15.5x faster)
© Rocana, Inc. All Rights Reserved. | 41
Model memory requirements
• Persistent memory used by stateful operators
• reduceByWindow, reduceByKeyAndWindow
• countByWindow, countByValueAndWindow
• mapWithState, updateStateByKey
• Model retention time
• Built-in time-based retention (e.g. reduceByWindow)
• Explicit state management (e.g. org.apache.spark.streaming.State#remove())
© Rocana, Inc. All Rights Reserved. | 42
Example
• Use reduceByKeyAndWindow to sum integers with a 30 second window
and 10 second slide over 10,000 keys
• active windows = window length / window slide
• 30s / 10s = 3
• estimated memory = active windows * num keys * (state size + key size)
• 3 *10,000 * (16 bytes + 80 bytes) = 2.75 MB
© Rocana, Inc. All Rights Reserved. | 43
Monitor Memory
© Rocana, Inc. All Rights Reserved. | 44
Putting it altogether
• Pick your packaging and deployment model based on operational needs,
not developer convenience
• Use Spark submitter APIs whenever possible
• Measure and report operational metrics
• Focus configuration and tuning on the expected behavior of your
application
• Model, configure, monitor
© Rocana, Inc. All Rights Reserved. | 45
Questions?
@fwiffo | batman@rocana.com
Ad

More Related Content

What's hot (20)

Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
Tzach Zohar
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
Tzach Zohar
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 

Viewers also liked (20)

Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
Joey Echeverria
 
Wer Liefert Was - NOAH16 Berlin
Wer Liefert Was - NOAH16 BerlinWer Liefert Was - NOAH16 Berlin
Wer Liefert Was - NOAH16 Berlin
NOAH Advisors
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
Demi Ben-Ari
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
Danny Yuan
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Danny Yuan
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark Summit
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
Anu Shetty
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration
TestCampRO
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Lucidworks
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing Environment
Łukasz Morawski
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
lucenerevolution
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
BillCavaUs
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using Spark
Salesforce Engineering
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016
Kevin Risden
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and Recommendations
Lucidworks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
ivan provalov
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Wer Liefert Was - NOAH16 Berlin
Wer Liefert Was - NOAH16 BerlinWer Liefert Was - NOAH16 Berlin
Wer Liefert Was - NOAH16 Berlin
NOAH Advisors
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
Demi Ben-Ari
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
Danny Yuan
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Danny Yuan
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark Summit
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
Anu Shetty
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration
TestCampRO
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Lucidworks
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing Environment
Łukasz Morawski
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
BillCavaUs
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using Spark
Salesforce Engineering
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016
Kevin Risden
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and Recommendations
Lucidworks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
ivan provalov
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Ad

Similar to Building production spark streaming applications (20)

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
Felicia Haggarty
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
YARN
YARNYARN
YARN
Alex Moundalexis
 
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Asher Feldman
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
振东 刘
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
InMobi Technology
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
Hari Shreedharan
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)
Asher Feldman
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
振东 刘
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
InMobi Technology
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Ad

More from Joey Echeverria (10)

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
Joey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Big data security
Big data securityBig data security
Big data security
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
Joey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 

Recently uploaded (20)

IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 

Building production spark streaming applications

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead - @fwiffo Data Day Texas 2017 Building Production Spark Streaming Applications
  • 2. © Rocana, Inc. All Rights Reserved. | 2 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  • 3. © Rocana, Inc. All Rights Reserved. | 3
  • 4. © Rocana, Inc. All Rights Reserved. | 4 Context • We built a system for large scale realtime collection, processing, and analysis of event-oriented machine data • On prem or in the cloud, but not SaaS • Supportability is a big deal for us • Predictability of performance under load and failures • Ease of configuration and operation • Behavior in wacky environments
  • 5. © Rocana, Inc. All Rights Reserved. | 5 Apache Spark Streaming
  • 6. © Rocana, Inc. All Rights Reserved. | 6 Spark streaming overview • Stream processing API built on top of the Spark execution engine • Micro-batching • Every n-milliseconds fetch records from data source • Execute Spark jobs on each input batch • DStream API • Wrapper around the RDD API • Lets the developer think in terms of transformations on a stream of events
  • 7. © Rocana, Inc. All Rights Reserved. | 7 Input Batch Spark Batch Engine Output Batch
  • 8. © Rocana, Inc. All Rights Reserved. | 8 Structured streaming • New streaming API for Spark • Re-use DataFrames API for streaming • API was too new when we started • First release was an alpha • No Kafka support at the time • Details won't apply, but the overall approach should be in the ballpark
  • 9. © Rocana, Inc. All Rights Reserved. | 9 Other notes • Our experience is with Spark 1.6.2 • 2.0.0 was released after we started our Spark integration • We use the Apache release of Spark • Supports both CDH and HDP without recompiling • We run Spark on YARN, so we're decoupled from other users on the cluster
  • 10. © Rocana, Inc. All Rights Reserved. | 10 Use Case Real-time alerting on IT operational data
  • 11. © Rocana, Inc. All Rights Reserved. | 11 Our typical customer use cases • >100K events / sec (8.6B events / day), sub-second end to end latency, full fidelity retention, critical use cases • Quality of service - “are credit card transactions happening fast enough?” • Fraud detection - “detect, investigate, prosecute, and learn from fraud.” • Forensic diagnostics - “what really caused the outage last friday?” • Security - “who’s doing what, where, when, why, and how, and is that ok?” • User behavior - ”capture and correlate user behavior with system performance, then feed it to downstream systems in realtime.”
  • 12. © Rocana, Inc. All Rights Reserved. | 12 Overall architecture weirdo formats transformation 1 weirdo format -> event avro events transformation 2 event -> storage-specific storage-specific representation of events
  • 13. © Rocana, Inc. All Rights Reserved. | 13 Real-time alerting • Define aggregations, conditions, and actions • Use cases: • Send me an e-mail when the number of failed login events from a user is > 3 within an hour • Create a ServiceNow ticket when CPU utilization spikes to > 95% for 10 minutes
  • 14. © Rocana, Inc. All Rights Reserved. | 14 UI
  • 15. © Rocana, Inc. All Rights Reserved. | 15 Architecture
  • 16. © Rocana, Inc. All Rights Reserved. | 16 Packaging, Deployment, and Execution
  • 17. © Rocana, Inc. All Rights Reserved. | 17 Packaging • Application classes and dependencies • Two options • Shade all dependencies into an uber jar • Make sure Hadoop and Spark dependencies are marked provided • Submit application jars and dependent jars when submitting
  • 18. © Rocana, Inc. All Rights Reserved. | 18 Deployment modes • Standalone • Manually start up head and worker services • Resource control depends on options selected when launching daemons • Difficult to mix versions • Apache Mesos • Coarse grained run mode, launch executors as Mesos tasks • Can use dynamic allocation to launch executors on demand • Apache Hadoop YARN • Best choice if your cluster is already running YARN
  • 19. © Rocana, Inc. All Rights Reserved. | 19 Spark on YARN • Client mode versus cluster mode • Client mode == Spark Driver on local server • Cluster mode == Spark Driver in YARN AM • Spark executors run in YARN containers (one JVM per executor) • spark.executor.instances • Each executor core uses one YARN vCore • spark.executor.cores
  • 20. © Rocana, Inc. All Rights Reserved. | 20 Job submission • Most documentation covers spark-submit • OK for testing, but not great for production • We use spark submitter APIs • Built easier to use wrapper API • Hide some of the details of configuration • Some configuration parameters aren't respected when using submitter API • spark.executor.cores, spark.executor.memory • spark.driver.cores, spark.driver.memory
  • 21. © Rocana, Inc. All Rights Reserved. | 21 Job monitoring • Streaming applications are always on • Need to monitor the job for failures • Restart the job on recoverable failures • Notify an admin on fatal failures (e.g. misconfiguration) • Validate as much up front as possible • Our application runs rules through a type checker and query planner before saving
  • 22. © Rocana, Inc. All Rights Reserved. | 22 Instrumentation, Metrics, and Monitoring
  • 23. © Rocana, Inc. All Rights Reserved. | 23 Instrumentation You can't fix what you don't measure
  • 24. © Rocana, Inc. All Rights Reserved. | 24 Instrumentation APIs • Spark supports Dropwizard (née CodaHale) metrics • Collect both application and framework metrics • Supports most popular metric types • Counters • Gauges • Histograms • Timers • etc. • Use your own APIs • Best option if you have your existing metric collection infrastructure
  • 25. © Rocana, Inc. All Rights Reserved. | 25 Custom metrics • Implement the org.apache.spark.metrics.source.Source interface • Register your source with sparkEnv.metricsSystem().registerSource() • If you're measuring something during execution, you need to register the metric on the executors • Register executor metrics in a static block • You can't register a metrics source until the SparkEnv has been initialized SparkEnv sparkEnv = SparkEnv.get(); if (sparkEnv != null) { // create and register source }
  • 26. © Rocana, Inc. All Rights Reserved. | 26 Metrics collection • Configure $SPARK_HOME/conf/metrics.properties • Built-in sinks • ConsoleSink • CVSSink • JmxSink • MetricsServlet • GraphiteSink • Slf4jSink • GangliaSink • Or build your own
  • 27. © Rocana, Inc. All Rights Reserved. | 27 Build your own • Implement the org.apache.spark.metrics.sink.Sink interface • We built a KafkaEventSink that sends the metrics to a Kafka topic formatted as Osso* events • Our system has a metrics collector • Aggregates metrics in a Parquet table • Query and visualize metrics using SQL • *http://www.osso-project.org
  • 28. © Rocana, Inc. All Rights Reserved. | 28 Report and visualize
  • 29. © Rocana, Inc. All Rights Reserved. | 29 Gotcha • Due to the order of metrics subsystem initialization, your collection plugin must be on the system classpath, not application classpath • https://issues.apache.org/jira/browse/SPARK-18115 • Options: • Deploy library on cluster nodes (e.g. add to HADOOP_CLASSPATH) • Build a custom Spark assembly jar
  • 30. © Rocana, Inc. All Rights Reserved. | 30 Custom spark assembly • Maven shade plugin • Merge upstream Spark assembly JAR with your library and dependencies • Shade/rename library packages • Might break configuration parameters as well  • *.sink.kafka.com_rocana_assembly_shaded_kafka_brokers • Mark any dependencies already in the assembly as provided • Ask me about our akka.version fiasco
  • 31. © Rocana, Inc. All Rights Reserved. | 31 Configuration and Tuning
  • 32. © Rocana, Inc. All Rights Reserved. | 32 Architecture
  • 33. © Rocana, Inc. All Rights Reserved. | 33 Predicting CPU/task resources • Each output operation creates a separate batch job when processing a micro-batch • number of jobs = number of output ops • Each data shuffle/re-partitioning creates a separate stage • number of stages per job = number of shuffles + 1 • Each partition in a stage creates a separate task • number of tasks per job = number of stages * number of partitions
  • 34. © Rocana, Inc. All Rights Reserved. | 34 Resources for alerting • Each rule has a single output operation (write to Kafka) • Each rule has 3 stages 1. Read from Kafka, project, filter and group data for aggregation 2. Aggregate values, filter (conditions) and group data for triggers 3. Aggregate trigger results and send trigger events to Kafka • First stage partitions = number of Kafka partitions • Stage 2 and 3 use spark.default.parallelism partitions
  • 35. © Rocana, Inc. All Rights Reserved. | 35 Example • 100 rules, Kafka partitions = 50, spark.default.parallelism = 50 • number of jobs = 100 • number of stages per job = 3 • number of tasks per job = 3 * 50 = 150 • total number of tasks = 100 * 150 = 15,000
  • 36. © Rocana, Inc. All Rights Reserved. | 36 Task slots • number of task slots = spark.executor.instances * spark.executor.cores • Example • 50 instances * 8 cores = 400 task slots
  • 37. © Rocana, Inc. All Rights Reserved. | 37 Waves • The jobs processing the micro-batches will run in waves based on available task slots • Number of waves = total number of tasks / number of task slots • Example • Number of waves = 15,000 / 400 = 38 waves
  • 38. © Rocana, Inc. All Rights Reserved. | 38 Max time per wave • maximum time per wave = micro-batch duration / number of waves • Example: • 15 second micro-batch duration • maximum time per wave = 15,000 ms / 38 waves = 394 ms per wave • If the average task time > 394 ms, then Spark streaming will fall behind
  • 39. © Rocana, Inc. All Rights Reserved. | 39 Monitoring batch processing time
  • 40. © Rocana, Inc. All Rights Reserved. | 40 Delay scheduling • A technique of delaying scheduling of tasks to get better data locality • Works great for long running batch tasks • Not ideal for low-latency stream processing tasks • Tip • Set spark.locality.wait = 0ms • Results • Running job with 800 tasks on a very small (2 task slot) cluster, 300 event micro-batch • With default setting: 402 seconds • With 0ms setting: 26 seconds (15.5x faster)
  • 41. © Rocana, Inc. All Rights Reserved. | 41 Model memory requirements • Persistent memory used by stateful operators • reduceByWindow, reduceByKeyAndWindow • countByWindow, countByValueAndWindow • mapWithState, updateStateByKey • Model retention time • Built-in time-based retention (e.g. reduceByWindow) • Explicit state management (e.g. org.apache.spark.streaming.State#remove())
  • 42. © Rocana, Inc. All Rights Reserved. | 42 Example • Use reduceByKeyAndWindow to sum integers with a 30 second window and 10 second slide over 10,000 keys • active windows = window length / window slide • 30s / 10s = 3 • estimated memory = active windows * num keys * (state size + key size) • 3 *10,000 * (16 bytes + 80 bytes) = 2.75 MB
  • 43. © Rocana, Inc. All Rights Reserved. | 43 Monitor Memory
  • 44. © Rocana, Inc. All Rights Reserved. | 44 Putting it altogether • Pick your packaging and deployment model based on operational needs, not developer convenience • Use Spark submitter APIs whenever possible • Measure and report operational metrics • Focus configuration and tuning on the expected behavior of your application • Model, configure, monitor
  • 45. © Rocana, Inc. All Rights Reserved. | 45 Questions? @fwiffo | [email protected]

Editor's Notes

  • #5: YMMV Not necessarily true for you Enterprise software – shipping stuff to people Fine grained events – logs, user behavior, etc. For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes). This isn’t condemnation of general purpose search engines as much as what we had to do for our domain