SlideShare a Scribd company logo
Introduction to Streaming in
Apache Spark
Based on Apache Spark 1.6.0
Akash Sethi
Software Consultant
Knoldus Software LLP.
Agenda

What is Streaming

Abstraction Provided For Streaming

Execution Process

Transformation

Type of Transformation

Action

Performance Tuning options
High Level architecture of Spark Streaming
Streaming in Apache Spark
Provide way to consume continues stream of data.
Build on top of Spark Core
It supports Java, Scala and Python.
API is similar to Spark Core.
DStream as a continues series of Data
Streaming in Apache Spark
Spark Streaming uses a “micro-batch” architecture. New batches
are created at regular time intervals. At the beginning of each time
interval a new batch is created, and any data that arrives during
that interval gets added to that batch. At the end of the time
interval the batch is done growing. The size of the time intervals is
determined by a parameter called the batch interval.
Streaming in Apache Spark
Spark Streaming provides an abstraction called DStreams,
or discretized streams. A DStream is a sequence of data
arriving over time. Internally, each DStream is
represented as a sequence of RDDs arriving at each time
step. Here RDDs are created on the basis of Time.
Each input batch forms an RDD, and is processed using
Spark jobs to create other RDDs. The processed results can
then be pushed out to external systems in batches.
We can also specify block size in milliseconds.
By default, received data is replicated across
two nodes, so Spark Streaming can tolerate
single worker failures. Using just lineage,
however, re computation could take a long
time for data that has been built up since the
beginning of the program. Thus, Spark
Streaming also includes a mechanism called
checkpointing that saves state periodically to
a reliable file system (e.g., HDFS or S3).
Typically, you might set up checkpointing
every 5–10 batches of data. When recovering
lost data, Spark Streaming needs only to go
Streaming in Apache Spark
Execution of Spark Streaming within Spark’s Components
Streaming in Apache Spark
Transformation
Transformations apply some operation on
current DStream and generate a new
DStream.
Transformations on DStreams can be grouped into
either stateless or stateful:

In stateless transformations the processing of each batch
does not depend on the data of its previous batches.

Stateful transformations, in contrast, use data or intermediate
results from previous batches to compute the results of the
current batch. They include transformations based on sliding
windows and on tracking state across time.
Stateless Transformation Code
Transformation on DStream
Stateless Transformation
Stateful Transformations
Stateful transformations are operations on DStreams that
track data across time; that is, some data from previous
batches is used to generate the results for a new batch.
The two main types of Stateful Transformation
are:

Windowed Operations

UpdateStateByKey
Stateful Transformation

Windowed Transformations
Windowed operations compute results across a longer time
period than the StreamingContext’s batch interval, by
combining results from multiple batches
All windowed operations need two parameters, window
duration and sliding duration, both of which must be a
multiple of the StreamingContext’s batch interval. The
window duration controls how many previous batches of
data are considered
Stateful Transformation
If we had a source DStream with a
batch interval of 10 seconds and
wanted to create a sliding window
of the last 30 seconds(or last 3
batches) we would set the
windowDuration to 30 seconds.
The sliding duration, which defaults
to the batch interval, controls how
frequently the new DStream
computes results. If we had the
source DStream with a batch
interval of 10 seconds and wanted
to compute our window only on
every second batch, we would set
our sliding interval to 20 seconds.
Stateful Transformations

UpdateStateByKey Transformations
it’s useful to maintain state across the batches in a DStream .
updateStateByKey() enables this by providing access to a state
variable for DStreams of key/value pairs. Given a DStream of (key,
event) pairs, it lets you construct a new DStream of (key, state)
pairs by taking a function that specifies how to update the state for
each key given new events. For example, in a web server log, our
events might be visits to the site, where the key is the user ID.
Using updateStateByKey(), we could track the last pages each user
visited. This list would be our “state” object, and we’d update it as
each event arrives.
Action/Output Operations
Output operations specify what needs to be
done with the final transformed data in a
stream.
Output Operations are similar to spark core.
- print the output
- save to text file
- save as Object in File etc.
It contains some extra methods like. forEachRdd()
Performance Considerations
Spark Streaming applications have a few
specialized tuning options.

Batch and Window Sizes
The most common question is what minimum batch size Spark
Streaming can use. In general, 500 milliseconds has proven to
be a good minimum size for many applications. The best
approach is to start with a larger batch size (around 10
seconds) and work your way down to a smaller batch size.

Level of Parallelism
A common way to reduce the processing time of batches is
to increase the parallelism.

Increasing the number of receivers
Receivers can sometimes act as a bottleneck if
there are too many records for a single
machine to read in and distribute. You can add
more receivers by creating multiple input
DStreams (which creates multiple receivers),
and then applying union to merge them into a
single stream.

Explicitly repartitioning received data
If receivers cannot be increased anymore, you
can further redistribute the received data by
Performance Considerations
Demo
https://github.com/knoldus/knolx-spark-streaming
References

Learning Spark LIGHTNING-FAST DATA
ANALYSIS Holden Karau, Andy Konwinski, Patrick
Wendell & Matei Zaharia

http://spark.apache.org/streaming/
Thank You
Ad

More Related Content

What's hot (20)

Process management in os
Process management in osProcess management in os
Process management in os
Miong Lazaro
 
08 Operating System Support
08  Operating  System  Support08  Operating  System  Support
08 Operating System Support
Jeanie Delos Arcos
 
Introduction to linux ppt
Introduction to linux pptIntroduction to linux ppt
Introduction to linux ppt
Omi Vichare
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Eucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaEucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebula
Amar Myana
 
Round robin scheduling
Round robin schedulingRound robin scheduling
Round robin scheduling
Raghav S
 
2 operating system structures
2 operating system structures2 operating system structures
2 operating system structures
Dr. Loganathan R
 
operating system lecture notes
operating system lecture notesoperating system lecture notes
operating system lecture notes
AVC College of Engineering
 
Peephole optimization techniques in compiler design
Peephole optimization techniques in compiler designPeephole optimization techniques in compiler design
Peephole optimization techniques in compiler design
Anul Chaudhary
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
GOVERNMENT COLLEGE OF ENGINEERING,TIRUNELVELI
 
Multithreading
MultithreadingMultithreading
Multithreading
Dr. A. B. Shinde
 
Multi processing
Multi processingMulti processing
Multi processing
Muhammad Ishaq
 
Openmp
OpenmpOpenmp
Openmp
Amirali Sharifian
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Post Machine
Post MachinePost Machine
Post Machine
Saira Fazal Qader
 
Artificial Intelligence A Modern Approach
Artificial Intelligence A Modern ApproachArtificial Intelligence A Modern Approach
Artificial Intelligence A Modern Approach
Sara Perez
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
Mahesh Kumar Attri
 
Open mp
Open mpOpen mp
Open mp
Gopi Saiteja
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
Riya Choudhary
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
vani gupta
 

Viewers also liked (20)

Akka streams
Akka streamsAkka streams
Akka streams
Knoldus Inc.
 
Introduction to Scala JS
Introduction to Scala JSIntroduction to Scala JS
Introduction to Scala JS
Knoldus Inc.
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async Library
Knoldus Inc.
 
Getting Started With AureliaJs
Getting Started With AureliaJsGetting Started With AureliaJs
Getting Started With AureliaJs
Knoldus Inc.
 
String interpolation
String interpolationString interpolation
String interpolation
Knoldus Inc.
 
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdomMailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Knoldus Inc.
 
Realm Mobile Database - An Introduction
Realm Mobile Database - An IntroductionRealm Mobile Database - An Introduction
Realm Mobile Database - An Introduction
Knoldus Inc.
 
Kanban
KanbanKanban
Kanban
Knoldus Inc.
 
Shapeless- Generic programming for Scala
Shapeless- Generic programming for ScalaShapeless- Generic programming for Scala
Shapeless- Generic programming for Scala
Knoldus Inc.
 
An Introduction to Quill
An Introduction to QuillAn Introduction to Quill
An Introduction to Quill
Knoldus Inc.
 
Introduction to Scala Macros
Introduction to Scala MacrosIntroduction to Scala Macros
Introduction to Scala Macros
Knoldus Inc.
 
Introduction to Java 8
Introduction to Java 8Introduction to Java 8
Introduction to Java 8
Knoldus Inc.
 
Mandrill Templates
Mandrill TemplatesMandrill Templates
Mandrill Templates
Knoldus Inc.
 
Introduction to Knockout Js
Introduction to Knockout JsIntroduction to Knockout Js
Introduction to Knockout Js
Knoldus Inc.
 
Functors, Applicatives and Monads In Scala
Functors, Applicatives and Monads In ScalaFunctors, Applicatives and Monads In Scala
Functors, Applicatives and Monads In Scala
Knoldus Inc.
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
Knoldus Inc.
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
Knoldus Inc.
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
Knoldus Inc.
 
Functional programming in Javascript
Functional programming in JavascriptFunctional programming in Javascript
Functional programming in Javascript
Knoldus Inc.
 
HTML5, CSS, JavaScript Style guide and coding conventions
HTML5, CSS, JavaScript Style guide and coding conventionsHTML5, CSS, JavaScript Style guide and coding conventions
HTML5, CSS, JavaScript Style guide and coding conventions
Knoldus Inc.
 
Introduction to Scala JS
Introduction to Scala JSIntroduction to Scala JS
Introduction to Scala JS
Knoldus Inc.
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async Library
Knoldus Inc.
 
Getting Started With AureliaJs
Getting Started With AureliaJsGetting Started With AureliaJs
Getting Started With AureliaJs
Knoldus Inc.
 
String interpolation
String interpolationString interpolation
String interpolation
Knoldus Inc.
 
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdomMailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Knoldus Inc.
 
Realm Mobile Database - An Introduction
Realm Mobile Database - An IntroductionRealm Mobile Database - An Introduction
Realm Mobile Database - An Introduction
Knoldus Inc.
 
Shapeless- Generic programming for Scala
Shapeless- Generic programming for ScalaShapeless- Generic programming for Scala
Shapeless- Generic programming for Scala
Knoldus Inc.
 
An Introduction to Quill
An Introduction to QuillAn Introduction to Quill
An Introduction to Quill
Knoldus Inc.
 
Introduction to Scala Macros
Introduction to Scala MacrosIntroduction to Scala Macros
Introduction to Scala Macros
Knoldus Inc.
 
Introduction to Java 8
Introduction to Java 8Introduction to Java 8
Introduction to Java 8
Knoldus Inc.
 
Mandrill Templates
Mandrill TemplatesMandrill Templates
Mandrill Templates
Knoldus Inc.
 
Introduction to Knockout Js
Introduction to Knockout JsIntroduction to Knockout Js
Introduction to Knockout Js
Knoldus Inc.
 
Functors, Applicatives and Monads In Scala
Functors, Applicatives and Monads In ScalaFunctors, Applicatives and Monads In Scala
Functors, Applicatives and Monads In Scala
Knoldus Inc.
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
Knoldus Inc.
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
Knoldus Inc.
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
Knoldus Inc.
 
Functional programming in Javascript
Functional programming in JavascriptFunctional programming in Javascript
Functional programming in Javascript
Knoldus Inc.
 
HTML5, CSS, JavaScript Style guide and coding conventions
HTML5, CSS, JavaScript Style guide and coding conventionsHTML5, CSS, JavaScript Style guide and coding conventions
HTML5, CSS, JavaScript Style guide and coding conventions
Knoldus Inc.
 
Ad

Similar to Introduction to Spark Streaming (20)

Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
rveiga100
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Venkateswaran Kandasamy
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
rveiga100
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Ad

More from Knoldus Inc. (20)

Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 

Recently uploaded (20)

Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 

Introduction to Spark Streaming

  • 1. Introduction to Streaming in Apache Spark Based on Apache Spark 1.6.0 Akash Sethi Software Consultant Knoldus Software LLP.
  • 2. Agenda  What is Streaming  Abstraction Provided For Streaming  Execution Process  Transformation  Type of Transformation  Action  Performance Tuning options
  • 3. High Level architecture of Spark Streaming Streaming in Apache Spark Provide way to consume continues stream of data. Build on top of Spark Core It supports Java, Scala and Python. API is similar to Spark Core.
  • 4. DStream as a continues series of Data Streaming in Apache Spark Spark Streaming uses a “micro-batch” architecture. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval.
  • 5. Streaming in Apache Spark Spark Streaming provides an abstraction called DStreams, or discretized streams. A DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. Here RDDs are created on the basis of Time. Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches. We can also specify block size in milliseconds.
  • 6. By default, received data is replicated across two nodes, so Spark Streaming can tolerate single worker failures. Using just lineage, however, re computation could take a long time for data that has been built up since the beginning of the program. Thus, Spark Streaming also includes a mechanism called checkpointing that saves state periodically to a reliable file system (e.g., HDFS or S3). Typically, you might set up checkpointing every 5–10 batches of data. When recovering lost data, Spark Streaming needs only to go Streaming in Apache Spark
  • 7. Execution of Spark Streaming within Spark’s Components Streaming in Apache Spark
  • 8. Transformation Transformations apply some operation on current DStream and generate a new DStream. Transformations on DStreams can be grouped into either stateless or stateful:  In stateless transformations the processing of each batch does not depend on the data of its previous batches.  Stateful transformations, in contrast, use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time.
  • 11. Stateful Transformations Stateful transformations are operations on DStreams that track data across time; that is, some data from previous batches is used to generate the results for a new batch. The two main types of Stateful Transformation are:  Windowed Operations  UpdateStateByKey
  • 12. Stateful Transformation  Windowed Transformations Windowed operations compute results across a longer time period than the StreamingContext’s batch interval, by combining results from multiple batches All windowed operations need two parameters, window duration and sliding duration, both of which must be a multiple of the StreamingContext’s batch interval. The window duration controls how many previous batches of data are considered
  • 13. Stateful Transformation If we had a source DStream with a batch interval of 10 seconds and wanted to create a sliding window of the last 30 seconds(or last 3 batches) we would set the windowDuration to 30 seconds. The sliding duration, which defaults to the batch interval, controls how frequently the new DStream computes results. If we had the source DStream with a batch interval of 10 seconds and wanted to compute our window only on every second batch, we would set our sliding interval to 20 seconds.
  • 14. Stateful Transformations  UpdateStateByKey Transformations it’s useful to maintain state across the batches in a DStream . updateStateByKey() enables this by providing access to a state variable for DStreams of key/value pairs. Given a DStream of (key, event) pairs, it lets you construct a new DStream of (key, state) pairs by taking a function that specifies how to update the state for each key given new events. For example, in a web server log, our events might be visits to the site, where the key is the user ID. Using updateStateByKey(), we could track the last pages each user visited. This list would be our “state” object, and we’d update it as each event arrives.
  • 15. Action/Output Operations Output operations specify what needs to be done with the final transformed data in a stream. Output Operations are similar to spark core. - print the output - save to text file - save as Object in File etc. It contains some extra methods like. forEachRdd()
  • 16. Performance Considerations Spark Streaming applications have a few specialized tuning options.  Batch and Window Sizes The most common question is what minimum batch size Spark Streaming can use. In general, 500 milliseconds has proven to be a good minimum size for many applications. The best approach is to start with a larger batch size (around 10 seconds) and work your way down to a smaller batch size.  Level of Parallelism A common way to reduce the processing time of batches is to increase the parallelism.
  • 17.  Increasing the number of receivers Receivers can sometimes act as a bottleneck if there are too many records for a single machine to read in and distribute. You can add more receivers by creating multiple input DStreams (which creates multiple receivers), and then applying union to merge them into a single stream.  Explicitly repartitioning received data If receivers cannot be increased anymore, you can further redistribute the received data by Performance Considerations
  • 19. References  Learning Spark LIGHTNING-FAST DATA ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia  http://spark.apache.org/streaming/