SlideShare a Scribd company logo
1
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
UNIT IV
Apache Spark Streaming Introduction - Spark’s Memory Usage - Understanding Resilience and
Fault - Tolerance in a Distributed System - Spark’s cluster manager - Data Delivery Semantics in
Spark - Data Delivery Semantics in Spark Applications - Microbatching - Dynamic Batch Interval
- Structured Stream processing model - Spark Streaming Resilience Model - Data Structures in
Spark – RDDs and DStreams - Spark Fault Tolerance Guarantees - First Steps in Structured
Streaming - Streaming Analytics Phases - Acquiring streaming data - Transforming streaming data
- Output the resulting data - Demo – Stream Processing with Spark Streaming
Apache Spark Streaming Introduction
Spark offers two different stream-processing APIs,
• Spark Streaming and
• Structured Streaming:
Spark Streaming: This is an API and a set of connectors, in which a Spark program is being
served small batches of data collected from a stream in the form of microbatches spaced at fixed
time intervals, performs a given computation, and eventually returns a result at every interval.
Structured Streaming: This is an API and a set of connectors, built on the substrate of a SQL
query optimizer, Catalyst. It offers an API based on DataFrames and the notion of continuous
queries over an unbounded table that is constantly updated with fresh records from the stream.
Spark’s Memory Usage
Spark offers in-memory storage of slices of a dataset, which must be initially loaded from
a data source. The data source can be a distributed filesystem or another storage medium. Spark’s
form of in-memory storage is analogous to the operation of caching data.
Hence, a value in Spark’s in-memory storage has a base, which is its initial data source, and
layers of successive operations applied to it.
Failure Recovery What happens in case of a failure? Because Spark knows exactly which data
source was used to ingest the data in the first place, and because it also knows all the operations
that were performed on it thus far, it can reconstitute the segment of lost data that was on a crashed
executor, from scratch. Obviously, this goes faster if that reconstitution (recovery, in Spark’s
parlance), does not need to be totally from scratch. So, Spark offers a replication mechanism, quite
in a similar way to distributed filesystems. However, because memory is such a valuable yet limited
commodity, Spark makes (by default) the cache short lived.
Lazy Evaluation A good part of the operations that can be defined on values in Spark’s storage
have a lazy execution, and it is the execution of a final, eager output operation that will trigger the
actual execution of computation in a Spark cluster. It’s worth noting that if a program consists of
a series of linear operations, with the previous one feeding into the next, the intermediate results
disappear right after said next step has consumed its input.
Cache Hints On the other hand, what happens if we have several operations to do on a single
intermediate result? Should we have to compute it several times? Thankfully, Spark lets users
specify that an intermediate value is important and how its contents should be safeguarded for
later. Figure below presents the data flow of such an operation.
2
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Figure: Operations on cached values
Finally, Spark offers the opportunity to spill the cache to secondary storage in case it runs out of
memory on the cluster, extending the in-memory operation to secondary —and significantly
slower—storage to preserve the functional aspects of a data pro‐ cess when faced with temporary
peak loads.
Now that we have an idea of the main characteristics of Apache Spark, let’s spend some time
focusing on one design choice internal to Spark, namely, the latency versus throughput trade-off.
Understanding Resilience and Fault - Tolerance in a Distributed System
Resilience and fault tolerance are absolutely essential for a distributed application: they are the
condition by which we will be able to perform the user’s computation to completion. Nowadays,
clusters are made of commodity machines that are ideally operated near peak capacity over their
lifetime.
To put it mildly, hardware breaks quite often. A resilient application can make progress with its
process despite latencies and noncritical faults in its distributed environment. A fault-tolerant
application is able to succeed and complete its process despite the unplanned termination of one
or several of its nodes.
This sort of resiliency is especially relevant in stream processing given that the applications we’re
scheduling are supposed to live for an undetermined amount of time. That undetermined amount
of time is often correlated with the life cycle of the data source. For example, if we are running a
retail website and we are analyzing transactions and website interactions as they come into the
system against the actions and clicks and navigation of users visiting the site, we potentially have a
data source that will be available for the entire duration of the lifetime of our business, which we
hope to be very long, if our business is going to be successful.
As a consequence, a system that will process our data in a streaming fashion should run
uninterrupted for long periods of time.
This “show must go on” approach of streaming computation makes the resiliency and fault-
tolerance characteristics of our applications more important. For a batch job, we could launch it,
hope it would succeed, and relaunch if we needed to change it or in case of failure. For an online
streaming Spark pipeline, this is not a reasonable assumption.
3
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Fault Recovery
In the context of fault tolerance, we are also interested in understanding how long it takes to
recover from failure of one particular node. Indeed, stream processing has a particular aspect: data
continues being generated by the data source in real time. To deal with a batch computing failure,
we always have the opportunity to restart from scratch and accept that obtaining the results of
computation will take longer. Thus, a very primitive form of fault tolerance is detecting the failure
of a particular node of our deployment, stopping the computation, and restarting from scratch.
That process can take more than twice the original duration that we had budgeted for that
computation, but if we are not in a hurry, this still acceptable.
For stream processing, we need to keep receiving data and thus potentially storing it, if the
recovering cluster is not ready to assume any processing yet. This can pose a problem at a high
throughput: if we try restarting from scratch, we will need not only to reprocess all of the data that
we have observed since the beginning of the application—which in and of itself can be a
challenge—but during that reprocessing of historical data, we will need it to continue receiving
and thus potentially storing new data that was generated while we were trying to catch up. This
pattern of restarting from scratch is something so intractable for streaming that we will pay special
attention to Spark’s ability to restart only minimal amounts of computation in the case that a node
becomes unavailable or nonfunctional.
Cluster Manager Support for Fault Tolerance
We want to highlight why it is still important to understand Spark’s fault tolerance guarantees, even
if there are similar features present in the cluster managers of YARN, Mesos, or Kubernetes. To
understand this, we can consider that cluster managers help with fault tolerance when they work
hand in hand with a framework that is able to report failures and request new resources to cope
with those exceptions. Spark possesses such capabilities.
For example, production cluster managers such as YARN, Mesos, or Kubernetes have the ability
to detect a node’s failure by inspecting endpoints on the node and asking the node to report on its
own readiness and liveness state. If these cluster managers detect a failure and they have spare
capacity, they will replace that node with another, made available to Spark. That particular action
implies that the Spark executor code will start anew in another node, and then attempt to join the
existing Spark cluster.
The cluster manager, by definition, does not have introspection capabilities into the applications
being run on the nodes that it reserves. Its responsibility is limited to the container that runs the
user’s code.
That responsibility boundary is where the Spark resilience features start. To recover from a failed
node, Spark needs to do the following:
• Determine whether that node contains some state that should be reproduced in the form
of checkpointed files
• Understand at which stage of the job a node should rejoin the computation
The goal here is for us to explore that if a node is being replaced by the cluster man‐ ager, Spark
has capabilities that allow it to take advantage of this new node and to distribute computation onto
it.
4
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Within this context, our focus is on Spark’s responsibilities as an application and underline the
capabilities of a cluster manager only when necessary: for instance, a node could be replaced
because of a hardware failure or because its work was simply preempted by a higher-priority job.
Apache Spark is blissfully unaware of the why, and focuses on the how.
Spark’s cluster manager
Spark has two internal cluster managers:
The local cluster manager
This emulates the function of a cluster manager (or resource manager) for testing purposes.
It reproduces the presence of a cluster of distributed machines using a threading model that relies
on your local machine having only a few available cores. This mode is usually not very confusing
because it executes only on the user’s laptop.
The standalone cluster manager
A relatively simple, Spark-only cluster manager that is rather limited in its availability to
slice and dice resource allocation. The standalone cluster manager holds and makes available the
entire worker node on which a Spark executor is deployed and started. It also expects the executor
to have been predeployed there, and the actual shipping of that .jar to a new machine is not within
its scope. It has the ability to take over a specific number of executors, which are part of its
deployment of worker nodes, and execute a task on it. This cluster manager is extremely useful for
the Spark developers to provide a bare-bones resource management solution that allows you to
focus on improving Spark in an environment without any bells and whistles. The standalone cluster
manager is not recommended for production deployments.
As a summary, Apache Spark is a task scheduler in that what it schedules are tasks, units of
distribution of computation that have been extracted from the user program. Spark also
communicates and is deployed through cluster managers including Apache Mesos, YARN, and
Kubernetes, or allowing for some cases its own standalone cluster manager. The purpose of that
communication is to reserve a number of executors, which are the units to which Spark
understands equal-sized amounts of computation resources, a virtual “node” of sorts. The reserved
resources in question could be provided by the cluster manager as the following:
• Limited processes (e.g., in some basic use cases of YARN), in which processes have their
resource consumption metered but are not prevented from accessing each other’s resource
by default.
• Containers (e.g., in the case of Mesos or Kubernetes), in which containers are a relatively
lightweight resource reservation technology that is born out of the groups and namespaces
of the Linux kernel and have known their most popular iteration with the Docker project.
• They also could be one of the above deployed on virtual machines (VMs), them‐ selves
coming with specific cores and memory reservation.
Data Delivery Semantics in Spark
As you have seen in the streaming model, the fact that streaming jobs act on the basis of data that
is generated in real time means that intermediate results need to be provided to the consumer of
that streaming pipeline on a regular basis.
5
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Those results are being produced by some part of our cluster. Ideally, we would like those
observable results to be coherent, in line, and in real time with respect to the arrival of data. This
means that we want results that are exact, and we want them as soon as possible. However,
distributed computation has its own challenges in that it sometimes includes not only individual
nodes failing, as we have mentioned, but it also encounters situations like network partitions, in
which some parts of our cluster are not able to communicate with other parts of that cluster, as
illustrated in below Figure.
Figure: A network partition
Spark has been designed using a driver/executor architecture. A specific machine, the driver, is
tasked with keeping track of the job progression along with the job submissions of a user, and the
computation of that program occurs as the data arrives. How‐ ever, if the network partitions
separate some part of the cluster, the driver might be able to keep track of only the part of the
executors that form the initial cluster. In the other section of our partition, we will find nodes that
are entirely able to function, but will simply be unable to account for the proceedings of their
computation to the driver.
This creates an interesting case in which those “zombie” nodes do not receive new tasks, but might
well be in the process of completing some fragment of computation that they were previously
given. Being unaware of the partition, they will report their results as any executor would. And
because this reporting of results sometimes does not go through the driver (for fear of making the
driver a bottleneck), the reporting of these zombie results could succeed.
Because the driver, a single point of bookkeeping, does not know that those zombie executors are
still functioning and reporting results, it will reschedule the same tasks that the lost executors had
to accomplish on new nodes. This creates a double answering problem in which the zombie
machines lost through partitioning and the machines bearing the rescheduled tasks both report the
same results. This bears real consequences: one example of stream computation that we previously
mentioned is routing tasks for financial transactions. A double withdrawal, in that context, or
double stock purchase orders, could have tremendous consequences.
It is not only the aforementioned problem that causes different processing semantics. Another
important reason is that when output from a stream-processing application and state
checkpointing cannot be completed in one atomic operation, it will cause data corruption if failure
happens between checkpointing and outputting.
6
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
These challenges have therefore led to a distinction between at least once processing and at most
once processing:
• At least once: This processing ensures that every element of a stream has been processed
once or more.
• At most once: This processing ensures that every element of the stream is processed once
or less.
• Exactly once: This is the combination of “at least once” and “at most once.”
At-least-once processing is the notion that we want to make sure that every chunk of initial data
has been dealt with—it deals with the node failure we were talking about earlier. As we’ve
mentioned, when a streaming process suffers a partial failure in which some nodes need to be
replaced or some data needs to be recomputed, we need to reprocess the lost units of computation
while keeping the ingestion of data going. That requirement means that if you do not respect at-
least-once processing, there is a chance for you, under certain conditions, to lose data.
The antisymmetric notion is called at-most-once processing. At-most-once processing systems
guarantee that the zombie nodes repeating the same results as a rescheduled node are treated in a
coherent manner, in which we keep track of only one set of results. By keeping track of what data
their results were about, we’re able to make sure we can discard repeated results, yielding at-most-
once processing guarantees. The way in which we achieve this relies on the notion of idempotence
applied to the “last mile” of result reception. Idempotence qualifies a function such that if we
apply it twice (or more) to any data, we will get the same result as the first time. This can be
achieved by keeping track of the data that we are reporting a result for, and having a bookkeeping
system at the output of our streaming process.
Microbatching
Two important approaches to stream processing:
• bulk-synchronous processing, and
• one-at-a-time record processing.
The objective of this is to connect those two ideas to the two APIs that Spark possesses for stream
processing: Spark Streaming and Structured Streaming.
Microbatching: An Application of Bulk-Synchronous Processing
Spark Streaming, the more mature model of stream processing in Spark, is roughly approximated
by what’s called a Bulk Synchronous Parallelism (BSP) system.
The gist of BSP is that it includes two things:
• A split distribution of asynchronous work
• A synchronous barrier, coming in at fixed intervals
The split is the idea that each of the successive steps of work to be done in streaming is separated
in a number of parallel chunks that are roughly proportional to the number of executors available
to perform this task. Each executor receives its own chunk (or chunks) of work and works
separately until the second element comes in. A particular resource is tasked with keeping track of
the progress of computation. With Spark Streaming, this is a synchronization point at the “driver”
that allows the work to progress to the next step. Between those scheduled steps, all of the
executors on the cluster are doing the same thing.
7
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Note that what is being passed around in this scheduling process are the functions that describe
the processing that the user wants to execute on the data. The data is already on the various
executors, most often being delivered directly to these resources over the lifetime of the cluster.
This was coined “function-passing style” by Heather Miller in 2016 (and formalized in
[Miller2016]): asynchronously pass safe functions to distributed, stationary, immutable data in a
stateless container, and use lazy combinators to eliminate intermediate data structures.
The frequency at which further rounds of data processing are scheduled is dictated by a time
interval. This time interval is an arbitrary duration that is measured in batch processing time; that
is, what you would expect to see as a “wall clock” time observa‐ tion in your cluster.
For stream processing, we choose to implement barriers at small, fixed intervals that better
approximate the real-time notion of data processing
One-Record-at-a-Time Processing
By contrast, one-record-at-a-time processing functions by pipelining: it analyzes the whole
computation as described by user-specified functions and deploys it as pipelines using the
resources of the cluster. Then, the only remaining matter is to flow data through the various
resources, following the prescribed pipeline. Note that in this latter case, each step of the
computation is materialized at some place in the cluster at any given point. Systems that function
mostly according to this paradigm include Apache Flink, Naiad, Storm, and IBM Streams. This
does not necessarily mean that those systems are incapable of microbatching, but rather
characterizes their major or most native mode of operation and makes a statement on their
dependency on the process of pipelining, often at the heart of their processing.
The minimum latency, or time needed for the system to react to the arrival of one particular event,
is very different between those two: minimum latency of the micro‐ batching system is therefore
the time needed to complete the reception of the current microbatch (the batch interval) plus the
time needed to start a task at the executor where this data falls (also called scheduling time). On
the other hand, a system pro‐ cessing records one by one can react as soon as it meets the event
of interest.
Microbatching Versus One-at-a-Time: The Trade-Offs
Despite their higher latency, microbatching systems offer significant advantages:
• They are able to adapt at the synchronization barrier boundaries. That adaptation might
represent the task of recovering from failure, if a number of executors have been shown
to become deficient or lose data. The periodic synchronization can also give us an
opportunity to add or remove executor nodes, giving us the possibility to grow or shrink
our resources depending on what we’re seeing as the cluster load, observed through the
throughput on the data source.
• Our BSP systems can sometimes have an easier time providing strong consistency because
their batch determinations—that indicate the beginning and the end of a particular batch
of data—are deterministic and recorded. Thus, any kind of computation can be redone
and produce the same results the second time.
• Having data available as a set that we can probe or inspect at the beginning of the
microbatch allows us to perform efficient optimizations that can provide ideas on the way
to compute on the data. Exploiting that on each microbatch, we can con‐ sider the specific
8
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
case rather than the general processing, which is used for all possible input. For example,
we could take a sample or compute a statistical measure before deciding to process or drop
each microbatch.
More importantly, the simple presence of the microbatch as a well-identified element also allows
an efficient way of specifying programming for both batch processing (where the data is at rest
and has been saved somewhere) and streaming (where the data is in flight). The microbatch, even
for mere instants, looks like data at rest.
Dynamic Batch Interval
What is this notion of dynamic batch interval? The dynamic batch interval is the notion that the
recomputation of data in a streaming DataFrame or Dataset consists of an update of existing data
with the new elements seen over the wire. This update is occurring based on a trigger and the usual
basis of this would be time duration. That time duration is still determined based on a fixed world
clock signal that we expect to be synchronized within our entire cluster and that represents a single
synchronous source of time that is shared among every executor.
However, this trigger can also be the statement of “as often as possible.” That statement is simply
the idea that a new batch should be started as soon as the previous one has been processed, given
a reasonable initial duration for the first batch. This means that the system will launch batches as
often as possible. In this situation, the latency that can be observed is closer to that of one-element-
at-a-time processing. The idea here is that the microbatches produced by this system will converge
to the smallest manageable size, making our stream flow faster through the executor computations
that are necessary to produce a result. As soon as that result is produced, a new query will be
started and scheduled by the Spark driver.
Structured Stream processing model
The main steps in Structured Streaming processing are as follows:
1. When the Spark driver triggers a new batch, processing starts with updating the account
of data read from a data source, in particular, getting data offsets for the beginning and the
end of the latest batch.
2. This is followed by logical planning, the construction of successive steps to be executed
on data, followed by query planning (intrastep optimization).
3. And then the launch and scheduling of the actual computation by adding a new batch of
data to update the continuous query that we’re trying to refresh.
Hence, from the point of view of the computation model, we will see that the API is significantly
different from Spark Streaming.
The Disappearance of the Batch Interval
We now briefly explain what Structured Streaming batches mean and their impact with respect to
operations. In Structured Streaming, the batch interval that we are using is no longer a computation
budget. With Spark Streaming, the idea was that if we produce data every two minutes and flow
data into Spark’s memory every two minutes, we should produce the results of computation on
that batch of data in at least two minutes, to clear the memory from our cluster for the next
microbatch. Ideally, as much data flows out as flows in, and the usage of the collective memory of
our cluster remains stable.
9
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
With Structured Streaming, without this fixed time synchronization, our ability to see performance
issues in our cluster is more complex: a cluster that is unstable—that is, unable to “clear out” data
by finishing to compute on it as fast as new data flows in— will see ever-growing batch processing
times, with an accelerating growth. We can expect that keeping a hand on this batch processing
time will be pivotal.
However, if we have a cluster that is correctly sized with respect to the throughput of our data,
there are a lot of advantages to have an as-often-as-possible update. In particular, we should expect
to see very frequent results from our Structured Streaming cluster with a higher granularity than
we used to in the time of a conservative batch interval.
Spark Streaming Resilience Model
In most cases, a streaming job is a long-running job. By definition, streams of data observed and
processed over time lead to jobs that run continuously. As they process data, they might
accumulate intermediary results that are difficult to reproduce after the data has left the processing
system. Therefore, the cost of failure is considerable and, in some cases, complete recovery is
intractable.
In distributed systems, especially those relying on commodity hardware, failure is a function of
size: the larger the system, the higher the probability that some component fails at any time.
Distributed stream processors need to factor this chance of failure in their operational model.
We look at the resilience that the Apache Spark platform provides us: how it’s able to recover
partial failure and what kinds of guarantees we are given for the data passing through the system
when a failure occurs. We begin by getting an overview of the different internal components of
Spark and their relation to the core data structure. With this knowledge, you can proceed to
understand the impact of failure at the different levels and the measures that Spark offers to
recover from such failure.
RDDs and DStreams
Spark builds its data representations on Resilient Distributed Datasets (RDDs). Introduced in 2011
by the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” [Zaharia2011], RDDs are the foundational data structure in Spark. It is at this ground
level that the strong fault tolerance guarantees of Spark start.
RDDs are composed of partitions, which are segments of data stored on individual nodes and
tracked by the Spark driver that is presented as a location-transparent data structure to the user.
We illustrate these components in below Figure in which the classic word count application is
broken down into the different elements that comprise an RDD.
Figure: An RDD operation represented in a distributed system
10
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
The colored blocks are data elements, originally stored in a distributed filesystem, represented on
the far left of the figure. The data is stored as partitions, illustrated as columns of colored blocks
inside the file. Each partition is read into an executor, which we see as the horizontal blocks. The
actual data processing happens within the executor. There, the data is transformed following the
transformations described at the RDD level:
• .flatMap(l => l.split(" ")) separates sentences into words separated by space.
• .map(w => (w,1)) transforms each word into a tuple of the form (, 1) in this way preparing
the words for counting.
• .reduceByKey(_ + _) computes the count, using the as a key and apply‐ ing a sum operation
to the attached number.
• The final result is attained by bringing the partial results together using the same reduce
operation.
RDDs constitute the programmatic core of Spark. All other abstractions, batch and streaming
alike, including DataFrames, DataSets, and DStreams are built using the facilities created by RDDs,
and, more important, they inherit the same fault tolerance capabilities.
Another important characteristic of RDDs is that Spark will try to keep their data preferably in-
memory for as long as it is required and provided enough capacity in the system. This behavior is
configurable through storage levels and can be explicitly controlled by calling caching operations.
We mention those structures here to present the idea that Spark tracks the progress of the user’s
computation through modifications of the data. Indeed, knowing how far along we are in what
the user wants to do through inspecting the control flow of his program (including loops and
potential recursive calls) can be a daunting and errorprone task. It is much more reliable to define
types of distributed data collections, and let the user create one from another, or from other data
sources.
In below figure, we show the same word count program, now in the form of the user provided
code (left) and the resulting internal RDD chain of operations. This dependency chain forms a
particular kind of graph, a Directed Acyclic Graph (DAG). The DAG informs the scheduler,
appropriately called DAG Scheduler, on how to dis‐ tribute the computation and is also the
foundation of the failure-recovery functionality, because it represents the internal data and their
dependencies.
Figure: RDD lineage
11
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
As the system tracks the ordered creation of these distributed data collections, it tracks the work
done, and what’s left to accomplish.
Data Structures in Spark
To understand at what level fault tolerance operates in Spark, it’s useful to go through an overview
of the nomenclature of some core concepts. We begin by assuming that the user provides a
program that ends up being divided into chunks and executed on various machines, as we saw in
the previous section, and as depicted in below Figure.
Figure: Spark nomenclature
Let’s run down those steps, which define the vocabulary of the Spark runtime:
User Program The user application in Spark Streaming is composed of user-specified function
calls operating on a resilient data structure (RDD, DStream, streaming DataSet, and so on),
categorized as actions and transformations.
Transformed User Program The user program may undergo adjustments that modify some of
the specified calls to make them simpler, the most approachable and understandable of which is
map-fusion. Query plan is a similar but more advanced concept in Spark SQL.
RDD A logical representation of a distributed, resilient, dataset. In the illustration, we see that the
initial RDD comprises three parts, called partitions.
Partition A partition is a physical segment of a dataset that can be loaded independently.
Stages The user’s operations are then grouped into stages, whose boundary separates user
operations into steps that must be executed separately. For example, operations that require a
shuffle of data across multiple nodes, such as a join between the results of two distinct upstream
operations, mark a distinct stage. Stages in Apache Spark are the unit of sequencing: they are
executed one after the other. At most one of any interdependent stages can be running at any given
time.
Jobs After these stages are defined, what internal actions Spark should take is clear. Indeed, at this
stage, a set of interdependent jobs is defined. And jobs, precisely, are the vocabulary for a unit of
scheduling. They describe the work at hand from the point of view of an entire Spark cluster,
whether it’s waiting in a queue or currently being run across many machines.
12
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Tasks Depending on where their source data is on the cluster, jobs can then be cut into tasks,
crossing the conceptual boundary between distributed and single-machine computing: a task is a
unit of local computation, the name for the local, executor bound part of a job.
Spark aims to make sure that all of these steps are safe from harm and to recover quickly in the
case of any incident occurring in any stage of this process. This concern is reflected in fault-
tolerance facilities that are structured by the aforementioned notions: restart and checkpointing
operations that occur at the task, job, stage, or program level.
Spark Fault Tolerance Guarantees
Now that we have seen the “pieces” that constitute the internal machinery in Spark, we are ready
to understand that failure can happen at many different levels. In this section, we see Spark fault-
tolerance guarantees organized by “increasing blast radius,” from the more modest to the larger
failure. We are going to investigate the following:
• How Spark mitigates Task failure through restarts
• How Spark mitigates Stage failure through the shuffle service
• How Spark mitigates the disappearance of the orchestrator of the user program, through
driver restarts
Task Failure Recovery: Tasks can fail when the infrastructure on which they are running has a
failure or logical conditions in the program lead to an sporadic job, like OutOfMemory, network,
storage errors, or problems bound to the quality of the data being processed.
If the input data of the task was stored, through a call to cache() or persist() and if the chosen
storage level implies a replication of data, the task does not need to have its input recomputed,
because a copy of it exists in complete form on another machine of the cluster. We can then use
this input to restart the task. Table summarizes the different storage levels configurable in Spark
and their chacteristics in terms of mem‐ ory usage and replication factor.
13
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
If, however, there was no persistence or if the storage level does not guarantee the existence of a
copy of the task’s input data, the Spark driver will need to consult the DAG that stores the user-
specified computation to determine which segments of the job need to be recomputed.
Consequently, without enough precautions to save either on the caching or on the storage level,
the failure of a task can trigger the recomputation of several others, up to a stage boundary.
Stage boundaries imply a shuffle, and a shuffle implies that intermediate data will somehow be
materialized: as we discussed, the shuffle transforms executors into data servers that can provide
the data to any other executor serving as a destination.
As a consequence, these executors have a copy of the map operations that led up to the shuffle.
Hence, executors that participated in a shuffle have a copy of the map operations that led up to it.
But that’s a lifesaver if you have a dying downstream exec‐ utor, able to rely on the upstream
servers of the shuffle (which serve the output of the map-like operation). What if it’s the contrary:
you need to face the crash of one of the upstream executors?
Stage Failure Recovery We’ve seen that task failure (possibly due to executor crash) was the
most frequent incident happening on a cluster and hence the most important event to mitigate.
Recurrent task failures will lead to the failure of the stage that contains that task. This brings us to
the second facility that allows Spark to resist arbitrary stage failures: the shuffle service.
When this failure occurs, it always means some rollback of the data, but a shuffle operation, by
definition, depends on all of the prior executors involved in the step that precedes it.
As a consequence, since Spark 1.3 we have the shuffle service, which lets you work on map data
that is saved and distributed through the cluster with a good locality, but, more important, through
a server that is not a Spark task. It’s an external file exchange service written in Java that has no
dependency on Spark and is made to be a much longer-running service than a Spark executor. This
additional service attaches as a separate process in all cluster modes of Spark and simply offers a
data file exchange for executors to transmit data reliably, right before a shuffle. It is highly
optimized through the use of a netty backend, to allow a very low overhead in trans‐ mitting data.
This way, an executor can shut down after the execution of its map task, as soon as the shuffle
service has a copy of its data. And because data transfers are faster, this transfer time is also highly
reduced, reducing the vulnerable time in which any executor could face an issue.
Driver Failure Recovery Having seen how Spark recovers from the failure of a particular task
and stage, we can now look at the facilities Spark offers to recover from the failure of the driver
program. The driver in Spark has an essential role: it is the depository of the block manager, which
knows where each block of data resides in the cluster. It is also the place where the DAG lives.
Finally, it is where the scheduling state of the job, its metadata, and logs resides. Hence, if the
driver is lost, a Spark cluster as a whole might well have lost which stage it has reached in
computation, what the computation actually consists of, and where the data that serves it can be
found, in one fell swoop.
Cluster-mode deployment Spark has implemented what’s called the cluster deployment mode,
which allows the driver program to be hosted on the cluster, as opposed to the user’s computer.
The deployment mode is one of two options: in client mode, the driver is launched in the same
process as the client that submits the application. In cluster mode, however, the driver is launched
14
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills
its responsibility of submitting the application without waiting for the application to finish.
This, in sum, allows Spark to operate an automatic driver restart, so that the user can start a job in
a “fire and forget fashion,” starting the job and then closing their laptop to catch the next train.
Every cluster mode of Spark offers a web UI that will let the user access the log of their application.
Another advantage is that driver failure does not mark the end of the job, because the driver
process will be relaunched by the cluster manager. But this only allows recovery from scratch,
given that the temporary state of the computation—previously stored in the driver machine—
might have been lost.
Checkpointing To avoid losing intermediate state in case of a driver crash, Spark offers the option
of checkpointing; that is, recording periodically a snapshot of the application’s state to disk. The
setting of the sparkContext.setCheckpointDirectory() option should point to reliable storage (e.g.,
Hadoop Distributed File System [HDFS]) because having the driver try to reconstruct the state of
intermediate RDDs from its local filesystem makes no sense: those intermediate RDDs are being
created on the executors of the cluster and should as such not require any interaction with the
driver for backing them up.
First Steps in Structured Streaming
In the previous section, we learned about the high-level concepts that constitute Structured
Streaming, such as sources, sinks, and queries. We are now going to explore Structured Streaming
from a practical perspective, using a simplified web log analytics use case as an example.
Before we begin delving into our first streaming application, we are going to see how classical
batch analysis in Apache Spark can be applied to the same use case.
This exercise has two main goals:
• First, most, if not all, streaming data analytics start by studying a static data sample. It is
far easier to start a study with a file of data, gain intuition on how the data looks, what kind
of patterns it shows, and define the process that we require to extract the intended
knowledge from that data. Typically, it’s only after we have defined and tested our data
analytics job, that we proceed to transform it into a streaming process that can apply our
analytic logic to data on the move.
• Second, from a practical perspective, we can appreciate how Apache Spark simplifies many
aspects of transitioning from a batch exploration to a streaming application through the
use of uniform APIs for both batch and streaming ana‐ lytics. This exploration will allow
us to compare and contrast the batch and streaming APIs in Spark and show us the
necessary steps to move from one to the other.
Batch Analytics
Given that we are working with archive log files, we have access to all of the data at once. Before
we begin building our streaming application, let’s take a brief intermezzo to have a look at what a
classical batch analytics job would look like
First, we load the log files, encoded as JSON, from the directory where we unpacked them:
// This is the location of the unpackaged files. Update accordingly
val logsDirectory = ???
15
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
val rawLogs = sparkSession.read.json(logsDirectory)
Next, we declare the schema of the data as a case class to use the typed Dataset API. Following
the formal description of the dataset (at NASA-HTTP ), the log is structured as follows:
The logs are an ASCII file with one line per request, with the following columns:
• Host making the request. A hostname when possible, otherwise the Internet address if the
name could not be looked up.
• Timestamp in the format “DAY MON DD HH:MM:SS YYYY,” where DAY is the day
of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is
the time of day using a 24-hour clock, and YYYY is the year. The timezone is –0400.
• Request given in quotes.
• HTTP reply code.
• Bytes in the reply.
Translating that schema to Scala, we have the following case class definition:
import java.sql.Timestamp
case class WebLog(host: String,
timestamp: Timestamp,
request: String,
http_reply: Int,
bytes: Long )
We convert the original JSON to a typed data structure using the previous schema definition:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// we need to narrow the `Interger` type because
// the JSON representation is interpreted as `BigInteger`
val preparedLogs = rawLogs.withColumn("http_reply", $"http_reply".cast(IntegerType))
val weblogs = preparedLogs.as[WebLog]
Now that we have the data in a structured format, we can begin asking the questions that interest
us. As a first step, we would like to know how many records are con‐ tained in our dataset:
val recordCount = weblogs.count >recordCount: Long = 1871988
A common question would be: “what was the most popular URL per day?” To answer that, we
first reduce the timestamp to the day of the month. We then group by this new dayOfMonth
column and the request URL and we count over this aggregate. We finally order using descending
order to get the top URLs first:
val topDailyURLs = weblogs.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))
16
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
topDailyURLs.show()
+----------+----------------------------------------+-----+
|dayOfMonth| request|count|
+----------+----------------------------------------+-----+
| 13|GET /images/NASA-logosmall.gif HTTP/1.0 |12476|
| 13|GET /htbin/cdt_main.pl HTTP/1.0 | 7471|
| 12|GET /images/NASA-logosmall.gif HTTP/1.0 | 7143|
| 13|GET /htbin/cdt_clock.pl HTTP/1.0 | 6237|
| 6|GET /images/NASA-logosmall.gif HTTP/1.0 | 6112|
| 5|GET /images/NASA-logosmall.gif HTTP/1.0 | 5865| ...
Top hits are all images. What now? It’s not unusual to see that the top URLs are images commonly
used across a site. Our true interest lies in the content pages generating the most traffic. To find
those, we first filter on html content and then proceed to apply the top aggregation we just learned.
As we can see, the request field is a quoted sequence of [HTTP_VERB] URL [HTTP_VER
SION]. We will extract the URL and preserve only those ending in .html, .htm, or no extension
(directories). This is a simplification for the purpose of this example:
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html",".htm", "")
val contentPageLogs = weblogs.filter
{log => log.request match
{ case urlExtractor(url) => val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext) case _ => false
}
}
With this new dataset that contains only .html, .htm, and directories, we proceed to apply the same
top-k function as earlier:
val topContentPages = contentPageLogs
.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))
topContentPages.show()
+----------+------------------------------------------------+-----+
|dayOfMonth| request|count|
+----------+------------------------------------------------+-----+
| 13| GET /shuttle/countdown/liftoff.html HTTP/1.0" | 4992|
| 5| GET /shuttle/countdown/ HTTP/1.0" | 3412|
| 6| GET /shuttle/countdown/ HTTP/1.0" | 3393|
| 3| GET /shuttle/countdown/ HTTP/1.0" | 3378|
| 13| GET /shuttle/countdown/ HTTP/1.0" | 3086|
| 7| GET /shuttle/countdown/ HTTP/1.0" | 2935|
17
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
| 4| GET /shuttle/countdown/ HTTP/1.0" | 2832|
| 2| GET /shuttle/countdown/ HTTP/1.0" | 2330| ...
We can see that the most popular page that month was lifto…
.html, corresponding to the coverage
of the launch of the Discovery shuttle, as documented on the NASA archives. It’s closely followed
by countdown/, the days prior to the launch.
Streaming Analytics Phases
In the previous section, we explored historical NASA web log records. We found trending events
in those records, but much later than when the actual events happened.
One key driver for streaming analytics comes from the increasing demand of organizations to have
timely information that can help them make decisions at many different levels.
We can use the lessons that we have learned while exploring the archived records using a batch-
oriented approach and create a streaming job that will provide us with trending information as it
happens.
The first difference that we observe with the batch analytics is the source of the data. For our
streaming exercise, we will use a TCP server to simulate a web system that delivers its logs in real
time. The simulator will use the same dataset but will feed it through a TCP socket connection
that will embody the stream that we will be analyzing.
Connecting to a Stream
If you recall from the introduction of this chapter, Structured Streaming defines the concepts of
sources and sinks as the key abstractions to consume a stream and produce a result. We are going
to use the TextSocketSource implementation to connect to the server through a TCP socket.
Socket connections are defined by the host of the server and the port where it is listening for
connections. These two configuration elements are required to create the socket source:
val stream = sparkSession.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
Note how the creation of a stream is quite similar to the declaration of a static data‐ source in the
batch case. Instead of using the read builder, we use the readStream construct and we pass to it
the parameters required by the streaming source. As you will see during the course of this exercise
and later on as we go into the details of Structured Streaming, the API is basically the same
DataFrame and Dataset API for static data but with some modifications and limitations that you
will learn in detail.
Preparing the Data in the Stream
The socket source produces a streaming DataFrame with one column, value, which contains the
data received from the stream.
In the batch analytics case, we could load the data directly as JSON records. In the case of the
Socket source, that data is plain text. To transform our raw data to WebLog records, we first
require a schema. The schema provides the necessary information to parse the text to a JSON
object. It’s the structure when we talk about structured streaming.
18
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
After defining a schema for our data, we proceed to create a Dataset, following these steps:
import java.sql.Timestamp
case class WebLog(host:String,
timestamp: Timestamp,
request: String,
http_reply:Int,
bytes: Long
)
val webLogSchema = Encoders.product[WebLog].schema
val jsonStream = stream.select(from_json($"value", webLogSchema) as "record")
val webLogStream: Dataset[WebLog] = jsonStream.select("record.*").as[WebLog]
1. Obtain a schema from the case class definition
2. Transform the text value to JSON using the JSON support built into Spark SQL
3. Use the Dataset API to transform the JSON records to WebLog objects
As a result of this process, we obtain a Streaming Dataset of WebLog records.
Operations on Streaming Dataset
The webLogStream we just obtained is of type Dataset[WebLog] like we had in the batch analytics
job. The difference between this instance and the batch version is that webLogStream is a
streaming Dataset.
We can observe this by querying the object:
webLogStream.isStreaming
> res: Boolean = true
At this point in the batch job, we were creating the first query on our data: How many records are
contained in our dataset? This is a question that we can easily answer when we have access to all
of the data. However, how do we count records that are constantly arriving? The answer is that
some operations that we consider usual on a static Dataset, like counting all records, do not have
a defined meaning on a Streaming Dataset.
As we can observe, attempting to execute the count query in the following code snippet will result
in an AnalysisException:
val count = webLogStream.count()
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;
This means that the direct queries we used on a static Dataset or DataFrame now need two levels
of interaction. First, we need to declare the transformations of our stream, and then we need to
start the stream process.
Creating a Query
What are popular URLs? In what time frame? Now that we have immediate analytic access to the
stream of web logs, we don’t need to wait for a day or a month to have a rank of the popular
URLs. We can have that information as trends unfold in much shorter windows of time.
19
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
First, to define the period of time of our interest, we create a window over some time‐ stamp. An
interesting feature of Structured Streaming is that we can define that time interval on the timestamp
when the data was produced, also known as event time, as opposed to the time when the data is
being processed.
Our window definition will be of five minutes of event data. Given that our timeline is simulated,
the five minutes might happen much faster or slower than the clock time. In this way, we can
clearly appreciate how Structured Streaming uses the time‐ stamp information in the events to
keep track of the event timeline.
As we learned from the batch analytics, we should extract the URLs and select only content pages,
like .html, .htm, or directories. Let’s apply that acquired knowledge first before proceeding to
define our windowed query:
// A regex expression to extract the accessed URL from weblog.request
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html", ".htm", "")
val contentPageLogs: String => Boolean = url => {
val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext)
}
val urlWebLogStream = webLogStream.flatMap { weblog =>
weblog.request match {
case urlExtractor(url) if (contentPageLogs(url)) =>
Some(weblog.copy(request = url))
case _ => None
}
}
We have converted the request to contain only the visited URL and filtered out all noncontent
pages. Now, we define the windowed query to compute the top trending URLs:
val rankingURLStream = urlWebLogStream
.groupBy($"request", window($"timestamp", "5 minutes", "1 minute"))
.count()
Start the Stream Processing
All of the steps that we have followed so far have been to define the process that the stream will
undergo. But no data has been processed yet.
To start a Structured Streaming job, we need to specify a sink and an output mode. These are two
new concepts introduced by Structured Streaming:
• A sink defines where we want to materialize the resulting data; for example, to a file in a
filesystem, to an in-memory table, or to another streaming system such as Kafka.
• The output mode defines how we want the results to be delivered: do we want to see all
data every time, only updates, or just the new records?
20
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
These options are given to a writeStream operation. It creates the streaming query that starts the
stream consumption, materializes the computations declared on the query, and produces the result
to the output sink. For now, let’s use them empirically and observe the results.
For our query, shown in below Example, we use the memory sink and output mode complete to
have a fully updated table each time new records are added to the result of keeping track of the
URL ranking.
Example. Writing a stream to a sink
val query = rankingURLStream.writeStream
.queryName("urlranks")
.outputMode("complete")
.format("memory")
.start()
The memory sink outputs the data to a temporary table of the same name given in the queryName
option. We can observe this by querying the tables registered on Spark SQL:
scala> spark.sql("show tables").show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | urlranks| true|
+--------+---------+-----------+
In the expression in Example, query is of type StreamingQuery and it’s a handler to control the
query life cycle.
Exploring the Data
Given that we are accelerating the log timeline on the producer side, after a few seconds, we can
execute the next command to see the result of the first windows, as illustrated in Figure. Note how
the processing time (a few seconds) is decoupled from the event time (hun‐ dreds of minutes of
logs):
urlRanks.select($"request", $"window", $"count").orderBy(desc("count"))
Figure: URL ranking: query results by window
21
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Acquiring streaming data
In Structured Streaming, a source is an abstraction that lets us consume data from a streaming data
producer. Sources are not directly created. Instead, the sparkSession provides a builder method,
readStream, that exposes the API to specify a streaming source, called a format, and provide its
configuration.
For example, the code in Example creates a File streaming source. We specify the type of source
using the format method. The method schema lets us provide a schema for the data stream, which
is mandatory for certain source types, such as this File source.
Example. File streaming source
val fileStream = spark.readStream
.format("json")
.schema(schema)
.option("mode","DROPMALFORMED")
.load("/tmp/datasrc")
>fileStream:
org.apache.spark.sql.DataFrame = [id: string, timestamp: timestamp ... ]
Each source implementation has different options, and some have tunable parameters. In
Example, we are setting the option mode to DROPMALFORMED. This option instructs the
JSON stream processor to drop any line that neither complies with the JSON format nor matches
the provided schema.
Behind the scenes, the call to spark.readStream creates a DataStreamBuilder instance. This instance
is in charge of managing the different options provided through the builder method calls. Calling
load(...) on this DataStreamBuilder instance validates the options provided to the builder and, if
everything checks, it returns a streaming DataFrame.
In our example, this streaming DataFrame represents the stream of data that will result from
monitoring the provided path and processing each new file in that path as JSON-encoded data,
parsed using the schema provided. All malformed code will be dropped from this data stream.
Loading a streaming source is lazy. What we get is a representation of the stream, embodied in the
streaming DataFrame instance, that we can use to express the series of transformations that we
want to apply to it in order to implement our specific business logic. Creating a streaming
DataFrame does not result in any data actually being consumed or processed until the stream is
materialized. This requires a query, as you will see further on.
Available Sources
As of Spark v2.4.0, the following streaming sources are supported:
• json, orc, parquet, csv, text, textFile: These are all file-based streaming sources. The
base functionality is to monitor a path (folder) in a filesystem and consume files atomically
placed in it. The files found will then be parsed by the formatter specified. For example, if
json is provided, the Spark json reader will be used to process the files, using the schema
information provided.
• socket Establishes a client connection to a TCP server that is assumed to provide text data
through a socket connection.
22
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• kafka Creates Kafka consumer able to retrieve data from Kafka.
• rate Generates a stream of rows at the rate given by the rowsPerSecond option. It’s mainly
intended as a testing source.
Transforming streaming data
As we saw in the previous section, the result of calling load is a streaming DataFrame. After we
have created our streaming DataFrame using a source, we can use the Data set or DataFrame API
to express the logic that we want to apply to the data in the stream in order to implement our
specific use case.
Assuming that we are using data from a sensor network, in Example 8-3 we are selecting the fields
deviceId, timestamp, sensorType, and value from a sensor Stream and filtering to only those
records where the sensor is of type temperature and its value is higher than the given threshold.
Example: Filter and projection
val highTempSensors = sensorStream
.select($"deviceId", $"timestamp", $"sensorType", $"value")
.where($"sensorType" === "temperature" && $"value" > threshold)
Likewise, we can aggregate our data and apply operations to the groups over time. Example shows
that we can use timestamp information from the event itself to define a time window of five
minutes that will slide every minute.
What is important to grasp here is that the Structured Streaming API is practically the same as the
Dataset API for batch analytics, with some additional provisions spe‐ cific to stream processing.
Example 8-4. Average by sensor type over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType", $"value")
.groupBy(window($"timestamp", "5 minutes", "1 minute"), $"sensorType")
.agg(avg($"value")
If you are not familiar with the structured APIs of Spark, we suggest that you familiarize yourself
with it. Covering this API in detail is beyond the scope of this book.
Streaming API Restrictions on the DataFrame API
As we hinted in the previous chapter, some operations that are offered by the standard DataFrame
and Dataset API do not make sense on a streaming context. We gave the example of stream.count,
which does not make sense to use on a stream. In general, operations that require immediate
materialization of the underlying dataset are not allowed. These are the API operations not directly
supported on streams:
• count
• show
• describe
• limit
• take(n)
• distinct
• foreach
23
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• sort
• multiple stacked aggregations
Next to these operations, stream-stream and static-stream joins are partially supported.
Understanding the limitations Although some operations, like count or limit, do not make sense
on a stream, some other stream operations are computationally difficult. For example, distinct is
one of them. To filter duplicates in an arbitrary stream, it would require that you remember all of
the data seen so far and compare each new record with all records already seen. The first condition
would require infinite memory and the second has a computational complexity of O(n 2 ), which
becomes prohibitive as the number of elements (n) increases.
Operations on aggregated streams Some of the unsupported operations become defined after
we apply an aggregation function to the stream. Although we can’t count the stream, we could
count messages received per minute or count the number of devices of a certain type.
In Example, we define a count of events per sensorType per minute.
Example: Count of sensor types over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Likewise, it’s also possible to define a sort on aggregated data, although it’s further restricted to
queries with output mode complete.
Stream deduplication We discussed that distinct on an arbitrary stream is computationally
difficult to implement. But if we can define a key that informs us when an element in the stream
has already been seen, we can use it to remove duplicates: stream.dropDuplicates("") …
Workarounds Although some operations are not supported in the exact same way as in the batch
model, there are alternative ways to achieve the same functionality:
• foreach Although foreach cannot be directly used on a stream, there’s a foreach sink that
provides the same functionality. Sinks are specified in the output definition of a stream.
• show Although show requires an immediate materialization of the query, and hence it’s
not possible on a streaming Dataset, we can use the console sink to output data to the
screen.
Output the resulting data
All operations that we have done so far—such as creating a stream and applying transformations
on it—have been declarative. They define from where to consume the data and what operations
we want to apply to it. But up to this point, there is still no data flowing through the system.
Before we can initiate our stream, we need to first define where and how we want the output data
to go:
• Where relates to the streaming sink: the receiving side of our streaming data.
• How refers to the output mode: how to treat the resulting records in our stream.
24
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
From the API perspective, we materialize a stream by calling writeStream on a streaming
DataFrame or Dataset.
Calling writeStream on a streaming Dataset creates a DataStreamWriter. This is a builder instance
that provides methods to configure the output behavior of our streaming process.
Example. File streaming sink
val query = stream.writeStream
.format("json")
.queryName("json-writer")
.outputMode("append")
.option("path", "/target/dir")
.option("checkpointLocation", "/checkpoint/dir")
.trigger(ProcessingTime("5 seconds"))
.start() >query: org.apache.spark.sql.streaming.StreamingQuery = ...
format
The format method lets us specify the output sink by providing the name of a builtin sink or the
fully qualified name of a custom sink.
As of Spark v2.4.0, the following streaming sinks are available:
• console sink A sink that prints to the standard output. It shows a number of rows
configurable with the option numRows.
• file sink File-based and format-specific sink that writes the results to a filesystem. The
format is specified by providing the format name: csv, hive, json, orc, parquet, avro, or
text.
• kafka sink A Kafka-specific producer sink that is able to write to one or more Kafka
topics.
• memory sink Creates an in-memory table using the provided query name as table name.
This table receives continuous updates with the results of the stream.
• foreach sink Provides a programmatic interface to access the stream contents, one
element at the time.
• foreachBatch sink foreachBatch is a programmatic sink interface that provides access to
the com‐ plete DataFrame that corresponds to each underlying microbatch of the
Structured Streaming execution.
outputMode
The outputMode specifies the semantics of how records are added to the output of the streaming
query. The supported modes are append, update, and complete:
• append (default mode) Adds only final records to the output stream. A record is
considered final when no new records of the incoming stream can modify its value. This
is always the case with linear transformations like those resulting from applying projection,
filtering, and mapping. This mode guarantees that each resulting record will be output only
once.
• update Adds new and updated records since the last trigger to the output stream. update
is meaningful only in the context of an aggregation, where aggregated values change as
25
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
new records arrive. If more than one incoming record changes a single result, all changes
between trigger intervals are collated into one output record.
• complete complete mode outputs the complete internal representation of the stream. This
mode also relates to aggregations, because for nonaggregated streams, we would need to
remember all records seen so far, which is unrealistic. From a practical perspective,
complete mode is recommended only when you are aggregating values over low-cardinality
criteria, like count of visitors by country, for which we know that the number of countries
is bounded.
Understanding the append semantic
When the streaming query contains aggregations, the definition of final becomes nontrivial. In
an aggregated computation, new incoming records might change an existing aggregated value
when they comply with the aggregation criteria used. Following our definition, we cannot
output a record using append until we know that its value is final. Therefore, the use of the
append output mode in combination with aggregate queries is restricted to queries for which
the aggregation is expressed using event-time and it defines a watermark. In that case, append
will output an event as soon as the watermark has expired and hence it’s considered that no
new records can alter the aggregated value. As a consequence, output events in append mode
will be delayed by the aggregation time window plus the watermark offset.
queryName With queryName, we can provide a name for the query that is used by some
sinks and also presented in the job description in the Spark Console, as depicted in Figure.
Figure: Completed Jobs in the Spark UI showing the query name in the job descrip‐ tion
option With the option method, we can provide specific key–value pairs of configuration to the
stream, akin to the configuration of the source. Each sink can have specific con‐ figuration we can
customize using this method. We can add as many .option(...) calls as necessary to configure the
sink.
options options is an alternative to option that takes a Map[String, String] containing all the key–
value configuration parameters that we want to set. This alternative is more friendly to an
externalized configuration model, where we don’t know a priori the set‐ tings to be passed to the
sink’s configuration.
trigger The optional trigger option lets us specify the frequency at which we want the results to
be produced. By default, Structured Streaming will process the input and produce a result as soon
as possible. When a trigger is specified, output will be produced at each trigger interval.
org.apache.spark.sql.streaming.Trigger provides the following supported triggers:
• ProcessingTime() Lets us specify a time interval that will dictate the frequency of the
query results.
26
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• Once() A particular Trigger that lets us execute a streaming job once. It is useful for testing
and also to apply a defined streaming job as a single-shot batch operation.
• Continuous() This trigger switches the execution engine to the experimental continuous
engine for low-latency processing. The checkpoint-interval parameter indicates the
frequency of the asynchronous checkpointing for data resilience. It should not be confused
with the batch interval of the ProcessingTime trigger.
start() To materialize the streaming computation, we need to start the streaming process.
Finally, start() materializes the complete job description into a streaming computation and
initiates the internal scheduling process that results in data being consumed from the source,
processed, and produced to the sink. start() returns a Streaming Query object, which is a handle
to manage the individual life cycle of each query. This means that we can simultaneously start
and stop multiple queries independently of one other within the same sparkSession.
Demo
The first part of our program deals with the creation of the streaming Dataset:
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
> rawData: org.apache.spark.sql.DataFrame
The entry point of Structured Streaming is an existing Spark Session (sparkSession). As you can
appreciate on the first line, the creation of a streaming Dataset is almost identical to the creation
of a static Dataset that would use a read operation instead. sparkSession.readStream returns a
DataStreamReader, a class that implements the builder pattern to collect the information needed
to construct the streaming source using a fluid API. In that API, we find the format option that
lets us specify our source provider, which, in our case, is kafka. The options that follow it are
specific to the source:
• kafka.bootstrap.servers
o Indicates the set of bootstrap servers to contact as a comma-separated list of
host:port addresses
• subscribe
o Specifies the topic or topics to subscribe to
• startingOffsets
o The offset reset policy to apply when this application starts out fresh.
The load() method evaluates the DataStreamReader builder and creates a DataFrame as a result,
as we can see in the returned value:
> rawData: org.apache.spark.sql.DataFrame
A DataFrame is an alias for Dataset[Row] with a known schema. After creation, you can use
streaming Datasets just like regular Datasets. This makes it possible to use the full-fledged Dataset
API with Structured Streaming, albeit some exceptions apply because not all operations, such as
show() or count(), make sense in a streaming context.
27
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
To programmatically differentiate a streaming Dataset from a static one, we can ask a Dataset
whether it is of the streaming kind:
rawData.isStreaming
res7: Boolean = true
And we can also explore the schema attached to it, using the existing Dataset API, as demonstrated
in Example.
Example. The Kafka schema
rawData.printSchema()
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
In general, Structured Streaming requires the explicit declaration of a schema for the consumed
stream. In the specific case of kafka, the schema for the resulting Dataset is fixed and is
independent of the contents of the stream. It consists of a set of fields specific to the Kakfa source:
key, value, topic, partition, offset, timestamp, and timestampType, as we can see in Example 9-1.
In most cases, applications will be mostly interested in the contents of the value field where the
actual payload of the stream resides.
Application Logic
Recall that the intention of our job is to correlate the incoming IoT sensor data with a reference
file that contains all known sensors with their configuration. That way, we would enrich each
incoming record with specific sensor parameters that would allow us to interpret the reported data.
We would then save all correctly processed records to a Parquet file. The data coming from
unknown sensors would be saved to a sepa‐ rate file for later analysis.
Using Structured Streaming, our job can be implemented in terms of Dataset opera‐ tions:
val iotData = rawData.select($"value").as[String].flatMap{record => val fields = record.split(",")
try {
SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)
}.toOption
}
val sensorRef = sparkSession.read.parquet(s"$workDir/$referenceFile") sensorRef.cache()
val sensorWithInfo = sensorRef.join(iotData, Seq("sensorId"), "inner")
val knownSensors = sensorWithInfo
.withColumn("dnvalue", $"value"*($"maxRange"-$"minRange")+$"minRange")
.drop("value", "maxRange", "minRange")
In the first step, we transform our CSV-formatted records back into SensorData entries. We apply
Scala functional operations on the typed Dataset[String] that we obtained from extracting the value
field as a String.
28
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Then, we use a streaming Dataset to static Dataset inner join to correlate the sensor data with the
corresponding reference using the sensorId as key.
To complete our application, we compute the real values of the sensor reading using the minimum-
maximum ranges in the reference data.
Writing to a Streaming Sink
The final step of our streaming application is to write the enriched IoT data to a Parquet-formatted
file. In Structured Streaming, the write operation is crucial: it marks the completion of the declared
transformations on the stream, defines a write mode, and upon calling start(), the processing of
the continuous query will begin.
In Structured Streaming, all operations are lazy declarations of what we want to do with the
streaming data. Only when we call start() will the actual consumption of the stream begin and the
query operations on the data materialize into actual results:
val knownSensorsQuery = knownSensors.writeStream
.outputMode("append")
.format("parquet")
.option("path", targetPath)
.option("checkpointLocation", "/tmp/checkpoint")
.start()
Let’s break this operation down:
• writeStream creates a builder object where we can configure the options for the desired
write operation, using a fluent interface.
• With format, we specify the sink that will materialize the result downstream. In our case,
we use the built-in FileStreamSink with Parquet format.
• mode is a new concept in Structured Streaming: given that we, theoretically, have access
to all the data seen in the stream so far, we also have the option to produce different views
of that data.
• The append mode, used here, implies that the new records affected by our streaming
computation are produced to the output.
The result of the start call is a StreamingQuery instance. This object provides meth‐ ods to control
the execution of the query and request information about the status of our running streaming
query, as shown in Example.
Example. Query progress
knownSensorsQuery.recentProgress
res37: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array({
"id" : "6b9fe3eb-7749-4294-b3e7-2561f1e840b6",
"runId" : "0d8d5605-bf78-4169-8cfe-98311fc8365c",
"name" : null,
"timestamp" : "2017-08-10T16:20:00.065Z",
"numInputRows" : 4348,
"inputRowsPerSecond" : 395272.7272727273,
"processedRowsPerSecond" : 28986.666666666668,
"durationMs" : {
29
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
"addBatch" : 127,
"getBatch" : 3,
"getOffset" : 1,
"queryPlanning" : 7,
"triggerExecution" : 150,
"walCommit" : 11
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[iot-data]]",
"startOffset" : {
"iot-data" : { "0" : 19048348 } },
"endOffset" : {
"iot-data" : { "0" : 19052696 } },
"numInputRow...
In Example, we can see the StreamingQueryProgress as a result of calling known
SensorsQuery.recentProgress. If we see nonzero values for the numInputRows, we can be
certain that our job is consuming data. We now have a Structured Streaming job running
properly.
Stream Processing with Spark Streaming
Spark Streaming was the first stream-processing framework built on top of the dis‐ tributed
processing capabilities of Spark. Nowadays, it offers a mature API that’s widely adopted in the
industry to process large-scale data streams.
Spark is, by design, a system that is really good at processing data distributed over a cluster of
machines. Spark’s core abstraction, the Resilient Distributed Dataset (RDD), and its fluent
functional API permits the creation of programs that treat distributed data as a collection. That
abstraction lets us reason about data-processing logic in the form of transformation of the
distributed dataset. By doing so, it reduces the cogni‐ tive load previously required to create and
execute scalable and distributed dataprocessing programs.
Spark Streaming was created upon a simple yet powerful premise: apply Spark’s dis‐ tributed
computing capabilities to stream processing by transforming a continuous stream of data into
discrete data collections on which Spark could operate.
As we can see in Figure, the main task of Spark Streaming is to take data from the stream, package
it down into small batches, and provide them to Spark for further processing. The output is then
produced to some downstream system.
Figure. Spark and Spark Streaming in action
30
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
The DStream Abstraction
Whereas Structured Streaming, which you learned in Part II, builds its streaming capabilities on
top of the Spark SQL abstractions of DataFrame and Dataset, Spark Streaming relies on the much
more fundamental Spark abstraction of RDD. At the same time, Spark Streaming introduces a
new concept: the Discretized Stream or DStream. A DStream represents a stream in terms of
discrete blocks of data that in turn are represented as RDDs over time, as we can see in Figure.
Figure. DStreams and RDDs in Spark Streaming
The DStream abstraction is primarily an execution model that, when combined with a functional
programming model, provides us with a complete framework to develop and execute streaming
applications.
DStreams as a Programming Model
The code representation of DStreams give us a functional programming API consis‐ tent with the
RDD API and augmented with stream-specific functions to deal with aggregations, time-based
operations, and stateful computations. In Spark Streaming, we consume a stream by creating a
DStream from one of the native implementations, such as a SocketInputStream or using one of
the many connectors available that pro‐ vide a DStream implementation specific to a stream
provider (this is the case of Kafka, Twitter, or Kinesis connectors for Spark Streaming, just to
name a few):
// creates a dstream using a client socket connected to the given host and port val textDStream =
ssc.socketTextStream("localhost", 9876)
After we have obtained a DStream reference, we can implement our application logic using the
functions provided by the DStream API. For example, if the textDStream in the preceding code
is connected to a log server, we could count the number of error occurrences:
// we break down the stream of logs into error or info (not error)
// and create pairs of `(x, y)`.
// (1, 1) represents an error, and
// (0, 1) a non-error occurrence.
val errorLabelStream = textDStream.map{line =>
if (line.contains("ERROR")) (1, 1) else (0, 1)
}
We can then count the totals and compute the error rate by using an aggregation function called
reduce:
31
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
// reduce by key applies the provided function for each key.
val errorCountStream = errorLabelStream.reduce {
case ((x1,y1), (x2, y2)) => (x1+x2, y1+y2)
}
To obtain our error rate, we perform a safe division:
// compute the error rate and create a string message with the value
val errorRateStream = errorCountStream.map {case (errors, total) =>
val errorRate = if (total > 0 ) errors.toDouble/total else 0.0
"Error Rate:" + errorRate
}
It’s important to note that up until now, we have been using transformations on the DStream but
there is still no data processing happening. All transformations on DStreams are lazy. This process
of defining the logic of a stream-processing applica‐ tion is better seen as the set of transformations
that will be applied to the data after the stream processing is started. As such, it’s a plan of action
that Spark Streaming will recurrently execute on the data consumed from the source DStream.
DStreams are immutable. It’s only through a chain of transformations that we can process and
obtain a result from our data.
Finally, the DStream programming model requires that the transformations are ended by an output
operation. This particular operation specifies how the DStream is materialized. In our case, we are
interested in printing the results of this stream com‐ putation to the console:
// print the results to the console
errorRateStream.print()
In summary, the DStream programming model consists of the functional composition of
transformations over the stream payload, materialized by one or more output operations and
recurrently executed by the Spark Streaming engine.
DStreams as an Execution Model
In the preceding introduction to the Spark Streaming programming model, we could see how data
is transformed from its original form into our intended result as a series of lazy functional
transformations. The Spark Streaming engine is responsible for taking that chain of functional
transformations and turning it into an actual execution plan. That happens by receiving data from
the input stream(s), collecting that data into batches, and feeding it to Spark in a timely manner.
The measure of time to wait for data is known as the batch interval. It is usually a short amount
of time, ranging from approximately two hundred milliseconds to minutes depending on the
application requirements for latency. The batch interval is the central unit of time in Spark
Streaming. At each batch interval, the data corresponding to the previous interval is sent to Spark
for processing while new data is received. This process repeats as long as the Spark Streaming job
is active and healthy. A natural consequence of this recurring microbatch operation is that the
computation on the batch’s data has to complete within the duration of the batch interval so that
computing resources are available when the new microbatch arrives. As you will learn in this part
of the book, the batch interval dictates the time for most other functions in Spark Streaming.
Ad

More Related Content

Similar to Streaming Analytics unit 4 notes for engineers (20)

A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
Jessica Morris
 
Module01
 Module01 Module01
Module01
NPN Training
 
Benchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark StreamingBenchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark Streaming
AKUDA Labs
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed Caches
IJTET Journal
 
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
Markus Michalewicz
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
COCOMA presentation, FIA 2013
COCOMA presentation, FIA 2013COCOMA presentation, FIA 2013
COCOMA presentation, FIA 2013
BonFIRE
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Spark1
Spark1Spark1
Spark1
Dr. G. Bharadwaja Kumar
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
David Groozman
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
seminar 4 sem
seminar 4 semseminar 4 sem
seminar 4 sem
AISWARYA TV
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
Benchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark StreamingBenchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark Streaming
AKUDA Labs
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed Caches
IJTET Journal
 
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
Markus Michalewicz
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
COCOMA presentation, FIA 2013
COCOMA presentation, FIA 2013COCOMA presentation, FIA 2013
COCOMA presentation, FIA 2013
BonFIRE
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 

More from ManjuAppukuttan2 (18)

SEPM UNIT V.pptx software engineeing and product management
SEPM UNIT V.pptx software engineeing and product managementSEPM UNIT V.pptx software engineeing and product management
SEPM UNIT V.pptx software engineeing and product management
ManjuAppukuttan2
 
SEPM UNIT V.pptx software engineering and product management
SEPM UNIT V.pptx software engineering and product managementSEPM UNIT V.pptx software engineering and product management
SEPM UNIT V.pptx software engineering and product management
ManjuAppukuttan2
 
Unit 1 Introduction to Streaming Analytics
Unit 1 Introduction to Streaming AnalyticsUnit 1 Introduction to Streaming Analytics
Unit 1 Introduction to Streaming Analytics
ManjuAppukuttan2
 
SRM First Review PPT Template for project
SRM First  Review PPT Template for projectSRM First  Review PPT Template for project
SRM First Review PPT Template for project
ManjuAppukuttan2
 
Streaming Analytics Unit 5 notes for engineers
Streaming Analytics Unit 5 notes for engineersStreaming Analytics Unit 5 notes for engineers
Streaming Analytics Unit 5 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics Unit 3 notes for engineers
Streaming Analytics Unit 3 notes for engineersStreaming Analytics Unit 3 notes for engineers
Streaming Analytics Unit 3 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics unit 2 notes for engineers
Streaming Analytics unit 2 notes for  engineersStreaming Analytics unit 2 notes for  engineers
Streaming Analytics unit 2 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics Unit 1 notes for engineers
Streaming Analytics Unit 1 notes for engineersStreaming Analytics Unit 1 notes for engineers
Streaming Analytics Unit 1 notes for engineers
ManjuAppukuttan2
 
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.pptCHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
ManjuAppukuttan2
 
CHAPTER 2 BASIC ANALYSIS.ppt
CHAPTER 2 BASIC ANALYSIS.pptCHAPTER 2 BASIC ANALYSIS.ppt
CHAPTER 2 BASIC ANALYSIS.ppt
ManjuAppukuttan2
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.pptCHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
ManjuAppukuttan2
 
UNIT 3.1 INTRODUCTON TO IDA.ppt
UNIT 3.1 INTRODUCTON TO IDA.pptUNIT 3.1 INTRODUCTON TO IDA.ppt
UNIT 3.1 INTRODUCTON TO IDA.ppt
ManjuAppukuttan2
 
UNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptUNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.ppt
ManjuAppukuttan2
 
SA UNIT III STORM.pdf
SA UNIT III STORM.pdfSA UNIT III STORM.pdf
SA UNIT III STORM.pdf
ManjuAppukuttan2
 
SA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdfSA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdf
ManjuAppukuttan2
 
SA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdfSA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdf
ManjuAppukuttan2
 
CHAPTER 2 BASIC ANALYSIS.pdf
CHAPTER 2 BASIC ANALYSIS.pdfCHAPTER 2 BASIC ANALYSIS.pdf
CHAPTER 2 BASIC ANALYSIS.pdf
ManjuAppukuttan2
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdfCHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
ManjuAppukuttan2
 
SEPM UNIT V.pptx software engineeing and product management
SEPM UNIT V.pptx software engineeing and product managementSEPM UNIT V.pptx software engineeing and product management
SEPM UNIT V.pptx software engineeing and product management
ManjuAppukuttan2
 
SEPM UNIT V.pptx software engineering and product management
SEPM UNIT V.pptx software engineering and product managementSEPM UNIT V.pptx software engineering and product management
SEPM UNIT V.pptx software engineering and product management
ManjuAppukuttan2
 
Unit 1 Introduction to Streaming Analytics
Unit 1 Introduction to Streaming AnalyticsUnit 1 Introduction to Streaming Analytics
Unit 1 Introduction to Streaming Analytics
ManjuAppukuttan2
 
SRM First Review PPT Template for project
SRM First  Review PPT Template for projectSRM First  Review PPT Template for project
SRM First Review PPT Template for project
ManjuAppukuttan2
 
Streaming Analytics Unit 5 notes for engineers
Streaming Analytics Unit 5 notes for engineersStreaming Analytics Unit 5 notes for engineers
Streaming Analytics Unit 5 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics Unit 3 notes for engineers
Streaming Analytics Unit 3 notes for engineersStreaming Analytics Unit 3 notes for engineers
Streaming Analytics Unit 3 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics unit 2 notes for engineers
Streaming Analytics unit 2 notes for  engineersStreaming Analytics unit 2 notes for  engineers
Streaming Analytics unit 2 notes for engineers
ManjuAppukuttan2
 
Streaming Analytics Unit 1 notes for engineers
Streaming Analytics Unit 1 notes for engineersStreaming Analytics Unit 1 notes for engineers
Streaming Analytics Unit 1 notes for engineers
ManjuAppukuttan2
 
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.pptCHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
ManjuAppukuttan2
 
CHAPTER 2 BASIC ANALYSIS.ppt
CHAPTER 2 BASIC ANALYSIS.pptCHAPTER 2 BASIC ANALYSIS.ppt
CHAPTER 2 BASIC ANALYSIS.ppt
ManjuAppukuttan2
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.pptCHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
ManjuAppukuttan2
 
UNIT 3.1 INTRODUCTON TO IDA.ppt
UNIT 3.1 INTRODUCTON TO IDA.pptUNIT 3.1 INTRODUCTON TO IDA.ppt
UNIT 3.1 INTRODUCTON TO IDA.ppt
ManjuAppukuttan2
 
UNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptUNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.ppt
ManjuAppukuttan2
 
SA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdfSA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdf
ManjuAppukuttan2
 
CHAPTER 2 BASIC ANALYSIS.pdf
CHAPTER 2 BASIC ANALYSIS.pdfCHAPTER 2 BASIC ANALYSIS.pdf
CHAPTER 2 BASIC ANALYSIS.pdf
ManjuAppukuttan2
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdfCHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
CHAPTER 1 MALWARE ANALYSIS PRIMER.pdf
ManjuAppukuttan2
 
Ad

Recently uploaded (20)

new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Ad

Streaming Analytics unit 4 notes for engineers

  • 1. 1 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM UNIT IV Apache Spark Streaming Introduction - Spark’s Memory Usage - Understanding Resilience and Fault - Tolerance in a Distributed System - Spark’s cluster manager - Data Delivery Semantics in Spark - Data Delivery Semantics in Spark Applications - Microbatching - Dynamic Batch Interval - Structured Stream processing model - Spark Streaming Resilience Model - Data Structures in Spark – RDDs and DStreams - Spark Fault Tolerance Guarantees - First Steps in Structured Streaming - Streaming Analytics Phases - Acquiring streaming data - Transforming streaming data - Output the resulting data - Demo – Stream Processing with Spark Streaming Apache Spark Streaming Introduction Spark offers two different stream-processing APIs, • Spark Streaming and • Structured Streaming: Spark Streaming: This is an API and a set of connectors, in which a Spark program is being served small batches of data collected from a stream in the form of microbatches spaced at fixed time intervals, performs a given computation, and eventually returns a result at every interval. Structured Streaming: This is an API and a set of connectors, built on the substrate of a SQL query optimizer, Catalyst. It offers an API based on DataFrames and the notion of continuous queries over an unbounded table that is constantly updated with fresh records from the stream. Spark’s Memory Usage Spark offers in-memory storage of slices of a dataset, which must be initially loaded from a data source. The data source can be a distributed filesystem or another storage medium. Spark’s form of in-memory storage is analogous to the operation of caching data. Hence, a value in Spark’s in-memory storage has a base, which is its initial data source, and layers of successive operations applied to it. Failure Recovery What happens in case of a failure? Because Spark knows exactly which data source was used to ingest the data in the first place, and because it also knows all the operations that were performed on it thus far, it can reconstitute the segment of lost data that was on a crashed executor, from scratch. Obviously, this goes faster if that reconstitution (recovery, in Spark’s parlance), does not need to be totally from scratch. So, Spark offers a replication mechanism, quite in a similar way to distributed filesystems. However, because memory is such a valuable yet limited commodity, Spark makes (by default) the cache short lived. Lazy Evaluation A good part of the operations that can be defined on values in Spark’s storage have a lazy execution, and it is the execution of a final, eager output operation that will trigger the actual execution of computation in a Spark cluster. It’s worth noting that if a program consists of a series of linear operations, with the previous one feeding into the next, the intermediate results disappear right after said next step has consumed its input. Cache Hints On the other hand, what happens if we have several operations to do on a single intermediate result? Should we have to compute it several times? Thankfully, Spark lets users specify that an intermediate value is important and how its contents should be safeguarded for later. Figure below presents the data flow of such an operation.
  • 2. 2 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Figure: Operations on cached values Finally, Spark offers the opportunity to spill the cache to secondary storage in case it runs out of memory on the cluster, extending the in-memory operation to secondary —and significantly slower—storage to preserve the functional aspects of a data pro‐ cess when faced with temporary peak loads. Now that we have an idea of the main characteristics of Apache Spark, let’s spend some time focusing on one design choice internal to Spark, namely, the latency versus throughput trade-off. Understanding Resilience and Fault - Tolerance in a Distributed System Resilience and fault tolerance are absolutely essential for a distributed application: they are the condition by which we will be able to perform the user’s computation to completion. Nowadays, clusters are made of commodity machines that are ideally operated near peak capacity over their lifetime. To put it mildly, hardware breaks quite often. A resilient application can make progress with its process despite latencies and noncritical faults in its distributed environment. A fault-tolerant application is able to succeed and complete its process despite the unplanned termination of one or several of its nodes. This sort of resiliency is especially relevant in stream processing given that the applications we’re scheduling are supposed to live for an undetermined amount of time. That undetermined amount of time is often correlated with the life cycle of the data source. For example, if we are running a retail website and we are analyzing transactions and website interactions as they come into the system against the actions and clicks and navigation of users visiting the site, we potentially have a data source that will be available for the entire duration of the lifetime of our business, which we hope to be very long, if our business is going to be successful. As a consequence, a system that will process our data in a streaming fashion should run uninterrupted for long periods of time. This “show must go on” approach of streaming computation makes the resiliency and fault- tolerance characteristics of our applications more important. For a batch job, we could launch it, hope it would succeed, and relaunch if we needed to change it or in case of failure. For an online streaming Spark pipeline, this is not a reasonable assumption.
  • 3. 3 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Fault Recovery In the context of fault tolerance, we are also interested in understanding how long it takes to recover from failure of one particular node. Indeed, stream processing has a particular aspect: data continues being generated by the data source in real time. To deal with a batch computing failure, we always have the opportunity to restart from scratch and accept that obtaining the results of computation will take longer. Thus, a very primitive form of fault tolerance is detecting the failure of a particular node of our deployment, stopping the computation, and restarting from scratch. That process can take more than twice the original duration that we had budgeted for that computation, but if we are not in a hurry, this still acceptable. For stream processing, we need to keep receiving data and thus potentially storing it, if the recovering cluster is not ready to assume any processing yet. This can pose a problem at a high throughput: if we try restarting from scratch, we will need not only to reprocess all of the data that we have observed since the beginning of the application—which in and of itself can be a challenge—but during that reprocessing of historical data, we will need it to continue receiving and thus potentially storing new data that was generated while we were trying to catch up. This pattern of restarting from scratch is something so intractable for streaming that we will pay special attention to Spark’s ability to restart only minimal amounts of computation in the case that a node becomes unavailable or nonfunctional. Cluster Manager Support for Fault Tolerance We want to highlight why it is still important to understand Spark’s fault tolerance guarantees, even if there are similar features present in the cluster managers of YARN, Mesos, or Kubernetes. To understand this, we can consider that cluster managers help with fault tolerance when they work hand in hand with a framework that is able to report failures and request new resources to cope with those exceptions. Spark possesses such capabilities. For example, production cluster managers such as YARN, Mesos, or Kubernetes have the ability to detect a node’s failure by inspecting endpoints on the node and asking the node to report on its own readiness and liveness state. If these cluster managers detect a failure and they have spare capacity, they will replace that node with another, made available to Spark. That particular action implies that the Spark executor code will start anew in another node, and then attempt to join the existing Spark cluster. The cluster manager, by definition, does not have introspection capabilities into the applications being run on the nodes that it reserves. Its responsibility is limited to the container that runs the user’s code. That responsibility boundary is where the Spark resilience features start. To recover from a failed node, Spark needs to do the following: • Determine whether that node contains some state that should be reproduced in the form of checkpointed files • Understand at which stage of the job a node should rejoin the computation The goal here is for us to explore that if a node is being replaced by the cluster man‐ ager, Spark has capabilities that allow it to take advantage of this new node and to distribute computation onto it.
  • 4. 4 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Within this context, our focus is on Spark’s responsibilities as an application and underline the capabilities of a cluster manager only when necessary: for instance, a node could be replaced because of a hardware failure or because its work was simply preempted by a higher-priority job. Apache Spark is blissfully unaware of the why, and focuses on the how. Spark’s cluster manager Spark has two internal cluster managers: The local cluster manager This emulates the function of a cluster manager (or resource manager) for testing purposes. It reproduces the presence of a cluster of distributed machines using a threading model that relies on your local machine having only a few available cores. This mode is usually not very confusing because it executes only on the user’s laptop. The standalone cluster manager A relatively simple, Spark-only cluster manager that is rather limited in its availability to slice and dice resource allocation. The standalone cluster manager holds and makes available the entire worker node on which a Spark executor is deployed and started. It also expects the executor to have been predeployed there, and the actual shipping of that .jar to a new machine is not within its scope. It has the ability to take over a specific number of executors, which are part of its deployment of worker nodes, and execute a task on it. This cluster manager is extremely useful for the Spark developers to provide a bare-bones resource management solution that allows you to focus on improving Spark in an environment without any bells and whistles. The standalone cluster manager is not recommended for production deployments. As a summary, Apache Spark is a task scheduler in that what it schedules are tasks, units of distribution of computation that have been extracted from the user program. Spark also communicates and is deployed through cluster managers including Apache Mesos, YARN, and Kubernetes, or allowing for some cases its own standalone cluster manager. The purpose of that communication is to reserve a number of executors, which are the units to which Spark understands equal-sized amounts of computation resources, a virtual “node” of sorts. The reserved resources in question could be provided by the cluster manager as the following: • Limited processes (e.g., in some basic use cases of YARN), in which processes have their resource consumption metered but are not prevented from accessing each other’s resource by default. • Containers (e.g., in the case of Mesos or Kubernetes), in which containers are a relatively lightweight resource reservation technology that is born out of the groups and namespaces of the Linux kernel and have known their most popular iteration with the Docker project. • They also could be one of the above deployed on virtual machines (VMs), them‐ selves coming with specific cores and memory reservation. Data Delivery Semantics in Spark As you have seen in the streaming model, the fact that streaming jobs act on the basis of data that is generated in real time means that intermediate results need to be provided to the consumer of that streaming pipeline on a regular basis.
  • 5. 5 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Those results are being produced by some part of our cluster. Ideally, we would like those observable results to be coherent, in line, and in real time with respect to the arrival of data. This means that we want results that are exact, and we want them as soon as possible. However, distributed computation has its own challenges in that it sometimes includes not only individual nodes failing, as we have mentioned, but it also encounters situations like network partitions, in which some parts of our cluster are not able to communicate with other parts of that cluster, as illustrated in below Figure. Figure: A network partition Spark has been designed using a driver/executor architecture. A specific machine, the driver, is tasked with keeping track of the job progression along with the job submissions of a user, and the computation of that program occurs as the data arrives. How‐ ever, if the network partitions separate some part of the cluster, the driver might be able to keep track of only the part of the executors that form the initial cluster. In the other section of our partition, we will find nodes that are entirely able to function, but will simply be unable to account for the proceedings of their computation to the driver. This creates an interesting case in which those “zombie” nodes do not receive new tasks, but might well be in the process of completing some fragment of computation that they were previously given. Being unaware of the partition, they will report their results as any executor would. And because this reporting of results sometimes does not go through the driver (for fear of making the driver a bottleneck), the reporting of these zombie results could succeed. Because the driver, a single point of bookkeeping, does not know that those zombie executors are still functioning and reporting results, it will reschedule the same tasks that the lost executors had to accomplish on new nodes. This creates a double answering problem in which the zombie machines lost through partitioning and the machines bearing the rescheduled tasks both report the same results. This bears real consequences: one example of stream computation that we previously mentioned is routing tasks for financial transactions. A double withdrawal, in that context, or double stock purchase orders, could have tremendous consequences. It is not only the aforementioned problem that causes different processing semantics. Another important reason is that when output from a stream-processing application and state checkpointing cannot be completed in one atomic operation, it will cause data corruption if failure happens between checkpointing and outputting.
  • 6. 6 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM These challenges have therefore led to a distinction between at least once processing and at most once processing: • At least once: This processing ensures that every element of a stream has been processed once or more. • At most once: This processing ensures that every element of the stream is processed once or less. • Exactly once: This is the combination of “at least once” and “at most once.” At-least-once processing is the notion that we want to make sure that every chunk of initial data has been dealt with—it deals with the node failure we were talking about earlier. As we’ve mentioned, when a streaming process suffers a partial failure in which some nodes need to be replaced or some data needs to be recomputed, we need to reprocess the lost units of computation while keeping the ingestion of data going. That requirement means that if you do not respect at- least-once processing, there is a chance for you, under certain conditions, to lose data. The antisymmetric notion is called at-most-once processing. At-most-once processing systems guarantee that the zombie nodes repeating the same results as a rescheduled node are treated in a coherent manner, in which we keep track of only one set of results. By keeping track of what data their results were about, we’re able to make sure we can discard repeated results, yielding at-most- once processing guarantees. The way in which we achieve this relies on the notion of idempotence applied to the “last mile” of result reception. Idempotence qualifies a function such that if we apply it twice (or more) to any data, we will get the same result as the first time. This can be achieved by keeping track of the data that we are reporting a result for, and having a bookkeeping system at the output of our streaming process. Microbatching Two important approaches to stream processing: • bulk-synchronous processing, and • one-at-a-time record processing. The objective of this is to connect those two ideas to the two APIs that Spark possesses for stream processing: Spark Streaming and Structured Streaming. Microbatching: An Application of Bulk-Synchronous Processing Spark Streaming, the more mature model of stream processing in Spark, is roughly approximated by what’s called a Bulk Synchronous Parallelism (BSP) system. The gist of BSP is that it includes two things: • A split distribution of asynchronous work • A synchronous barrier, coming in at fixed intervals The split is the idea that each of the successive steps of work to be done in streaming is separated in a number of parallel chunks that are roughly proportional to the number of executors available to perform this task. Each executor receives its own chunk (or chunks) of work and works separately until the second element comes in. A particular resource is tasked with keeping track of the progress of computation. With Spark Streaming, this is a synchronization point at the “driver” that allows the work to progress to the next step. Between those scheduled steps, all of the executors on the cluster are doing the same thing.
  • 7. 7 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Note that what is being passed around in this scheduling process are the functions that describe the processing that the user wants to execute on the data. The data is already on the various executors, most often being delivered directly to these resources over the lifetime of the cluster. This was coined “function-passing style” by Heather Miller in 2016 (and formalized in [Miller2016]): asynchronously pass safe functions to distributed, stationary, immutable data in a stateless container, and use lazy combinators to eliminate intermediate data structures. The frequency at which further rounds of data processing are scheduled is dictated by a time interval. This time interval is an arbitrary duration that is measured in batch processing time; that is, what you would expect to see as a “wall clock” time observa‐ tion in your cluster. For stream processing, we choose to implement barriers at small, fixed intervals that better approximate the real-time notion of data processing One-Record-at-a-Time Processing By contrast, one-record-at-a-time processing functions by pipelining: it analyzes the whole computation as described by user-specified functions and deploys it as pipelines using the resources of the cluster. Then, the only remaining matter is to flow data through the various resources, following the prescribed pipeline. Note that in this latter case, each step of the computation is materialized at some place in the cluster at any given point. Systems that function mostly according to this paradigm include Apache Flink, Naiad, Storm, and IBM Streams. This does not necessarily mean that those systems are incapable of microbatching, but rather characterizes their major or most native mode of operation and makes a statement on their dependency on the process of pipelining, often at the heart of their processing. The minimum latency, or time needed for the system to react to the arrival of one particular event, is very different between those two: minimum latency of the micro‐ batching system is therefore the time needed to complete the reception of the current microbatch (the batch interval) plus the time needed to start a task at the executor where this data falls (also called scheduling time). On the other hand, a system pro‐ cessing records one by one can react as soon as it meets the event of interest. Microbatching Versus One-at-a-Time: The Trade-Offs Despite their higher latency, microbatching systems offer significant advantages: • They are able to adapt at the synchronization barrier boundaries. That adaptation might represent the task of recovering from failure, if a number of executors have been shown to become deficient or lose data. The periodic synchronization can also give us an opportunity to add or remove executor nodes, giving us the possibility to grow or shrink our resources depending on what we’re seeing as the cluster load, observed through the throughput on the data source. • Our BSP systems can sometimes have an easier time providing strong consistency because their batch determinations—that indicate the beginning and the end of a particular batch of data—are deterministic and recorded. Thus, any kind of computation can be redone and produce the same results the second time. • Having data available as a set that we can probe or inspect at the beginning of the microbatch allows us to perform efficient optimizations that can provide ideas on the way to compute on the data. Exploiting that on each microbatch, we can con‐ sider the specific
  • 8. 8 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM case rather than the general processing, which is used for all possible input. For example, we could take a sample or compute a statistical measure before deciding to process or drop each microbatch. More importantly, the simple presence of the microbatch as a well-identified element also allows an efficient way of specifying programming for both batch processing (where the data is at rest and has been saved somewhere) and streaming (where the data is in flight). The microbatch, even for mere instants, looks like data at rest. Dynamic Batch Interval What is this notion of dynamic batch interval? The dynamic batch interval is the notion that the recomputation of data in a streaming DataFrame or Dataset consists of an update of existing data with the new elements seen over the wire. This update is occurring based on a trigger and the usual basis of this would be time duration. That time duration is still determined based on a fixed world clock signal that we expect to be synchronized within our entire cluster and that represents a single synchronous source of time that is shared among every executor. However, this trigger can also be the statement of “as often as possible.” That statement is simply the idea that a new batch should be started as soon as the previous one has been processed, given a reasonable initial duration for the first batch. This means that the system will launch batches as often as possible. In this situation, the latency that can be observed is closer to that of one-element- at-a-time processing. The idea here is that the microbatches produced by this system will converge to the smallest manageable size, making our stream flow faster through the executor computations that are necessary to produce a result. As soon as that result is produced, a new query will be started and scheduled by the Spark driver. Structured Stream processing model The main steps in Structured Streaming processing are as follows: 1. When the Spark driver triggers a new batch, processing starts with updating the account of data read from a data source, in particular, getting data offsets for the beginning and the end of the latest batch. 2. This is followed by logical planning, the construction of successive steps to be executed on data, followed by query planning (intrastep optimization). 3. And then the launch and scheduling of the actual computation by adding a new batch of data to update the continuous query that we’re trying to refresh. Hence, from the point of view of the computation model, we will see that the API is significantly different from Spark Streaming. The Disappearance of the Batch Interval We now briefly explain what Structured Streaming batches mean and their impact with respect to operations. In Structured Streaming, the batch interval that we are using is no longer a computation budget. With Spark Streaming, the idea was that if we produce data every two minutes and flow data into Spark’s memory every two minutes, we should produce the results of computation on that batch of data in at least two minutes, to clear the memory from our cluster for the next microbatch. Ideally, as much data flows out as flows in, and the usage of the collective memory of our cluster remains stable.
  • 9. 9 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM With Structured Streaming, without this fixed time synchronization, our ability to see performance issues in our cluster is more complex: a cluster that is unstable—that is, unable to “clear out” data by finishing to compute on it as fast as new data flows in— will see ever-growing batch processing times, with an accelerating growth. We can expect that keeping a hand on this batch processing time will be pivotal. However, if we have a cluster that is correctly sized with respect to the throughput of our data, there are a lot of advantages to have an as-often-as-possible update. In particular, we should expect to see very frequent results from our Structured Streaming cluster with a higher granularity than we used to in the time of a conservative batch interval. Spark Streaming Resilience Model In most cases, a streaming job is a long-running job. By definition, streams of data observed and processed over time lead to jobs that run continuously. As they process data, they might accumulate intermediary results that are difficult to reproduce after the data has left the processing system. Therefore, the cost of failure is considerable and, in some cases, complete recovery is intractable. In distributed systems, especially those relying on commodity hardware, failure is a function of size: the larger the system, the higher the probability that some component fails at any time. Distributed stream processors need to factor this chance of failure in their operational model. We look at the resilience that the Apache Spark platform provides us: how it’s able to recover partial failure and what kinds of guarantees we are given for the data passing through the system when a failure occurs. We begin by getting an overview of the different internal components of Spark and their relation to the core data structure. With this knowledge, you can proceed to understand the impact of failure at the different levels and the measures that Spark offers to recover from such failure. RDDs and DStreams Spark builds its data representations on Resilient Distributed Datasets (RDDs). Introduced in 2011 by the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” [Zaharia2011], RDDs are the foundational data structure in Spark. It is at this ground level that the strong fault tolerance guarantees of Spark start. RDDs are composed of partitions, which are segments of data stored on individual nodes and tracked by the Spark driver that is presented as a location-transparent data structure to the user. We illustrate these components in below Figure in which the classic word count application is broken down into the different elements that comprise an RDD. Figure: An RDD operation represented in a distributed system
  • 10. 10 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM The colored blocks are data elements, originally stored in a distributed filesystem, represented on the far left of the figure. The data is stored as partitions, illustrated as columns of colored blocks inside the file. Each partition is read into an executor, which we see as the horizontal blocks. The actual data processing happens within the executor. There, the data is transformed following the transformations described at the RDD level: • .flatMap(l => l.split(" ")) separates sentences into words separated by space. • .map(w => (w,1)) transforms each word into a tuple of the form (, 1) in this way preparing the words for counting. • .reduceByKey(_ + _) computes the count, using the as a key and apply‐ ing a sum operation to the attached number. • The final result is attained by bringing the partial results together using the same reduce operation. RDDs constitute the programmatic core of Spark. All other abstractions, batch and streaming alike, including DataFrames, DataSets, and DStreams are built using the facilities created by RDDs, and, more important, they inherit the same fault tolerance capabilities. Another important characteristic of RDDs is that Spark will try to keep their data preferably in- memory for as long as it is required and provided enough capacity in the system. This behavior is configurable through storage levels and can be explicitly controlled by calling caching operations. We mention those structures here to present the idea that Spark tracks the progress of the user’s computation through modifications of the data. Indeed, knowing how far along we are in what the user wants to do through inspecting the control flow of his program (including loops and potential recursive calls) can be a daunting and errorprone task. It is much more reliable to define types of distributed data collections, and let the user create one from another, or from other data sources. In below figure, we show the same word count program, now in the form of the user provided code (left) and the resulting internal RDD chain of operations. This dependency chain forms a particular kind of graph, a Directed Acyclic Graph (DAG). The DAG informs the scheduler, appropriately called DAG Scheduler, on how to dis‐ tribute the computation and is also the foundation of the failure-recovery functionality, because it represents the internal data and their dependencies. Figure: RDD lineage
  • 11. 11 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM As the system tracks the ordered creation of these distributed data collections, it tracks the work done, and what’s left to accomplish. Data Structures in Spark To understand at what level fault tolerance operates in Spark, it’s useful to go through an overview of the nomenclature of some core concepts. We begin by assuming that the user provides a program that ends up being divided into chunks and executed on various machines, as we saw in the previous section, and as depicted in below Figure. Figure: Spark nomenclature Let’s run down those steps, which define the vocabulary of the Spark runtime: User Program The user application in Spark Streaming is composed of user-specified function calls operating on a resilient data structure (RDD, DStream, streaming DataSet, and so on), categorized as actions and transformations. Transformed User Program The user program may undergo adjustments that modify some of the specified calls to make them simpler, the most approachable and understandable of which is map-fusion. Query plan is a similar but more advanced concept in Spark SQL. RDD A logical representation of a distributed, resilient, dataset. In the illustration, we see that the initial RDD comprises three parts, called partitions. Partition A partition is a physical segment of a dataset that can be loaded independently. Stages The user’s operations are then grouped into stages, whose boundary separates user operations into steps that must be executed separately. For example, operations that require a shuffle of data across multiple nodes, such as a join between the results of two distinct upstream operations, mark a distinct stage. Stages in Apache Spark are the unit of sequencing: they are executed one after the other. At most one of any interdependent stages can be running at any given time. Jobs After these stages are defined, what internal actions Spark should take is clear. Indeed, at this stage, a set of interdependent jobs is defined. And jobs, precisely, are the vocabulary for a unit of scheduling. They describe the work at hand from the point of view of an entire Spark cluster, whether it’s waiting in a queue or currently being run across many machines.
  • 12. 12 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Tasks Depending on where their source data is on the cluster, jobs can then be cut into tasks, crossing the conceptual boundary between distributed and single-machine computing: a task is a unit of local computation, the name for the local, executor bound part of a job. Spark aims to make sure that all of these steps are safe from harm and to recover quickly in the case of any incident occurring in any stage of this process. This concern is reflected in fault- tolerance facilities that are structured by the aforementioned notions: restart and checkpointing operations that occur at the task, job, stage, or program level. Spark Fault Tolerance Guarantees Now that we have seen the “pieces” that constitute the internal machinery in Spark, we are ready to understand that failure can happen at many different levels. In this section, we see Spark fault- tolerance guarantees organized by “increasing blast radius,” from the more modest to the larger failure. We are going to investigate the following: • How Spark mitigates Task failure through restarts • How Spark mitigates Stage failure through the shuffle service • How Spark mitigates the disappearance of the orchestrator of the user program, through driver restarts Task Failure Recovery: Tasks can fail when the infrastructure on which they are running has a failure or logical conditions in the program lead to an sporadic job, like OutOfMemory, network, storage errors, or problems bound to the quality of the data being processed. If the input data of the task was stored, through a call to cache() or persist() and if the chosen storage level implies a replication of data, the task does not need to have its input recomputed, because a copy of it exists in complete form on another machine of the cluster. We can then use this input to restart the task. Table summarizes the different storage levels configurable in Spark and their chacteristics in terms of mem‐ ory usage and replication factor.
  • 13. 13 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM If, however, there was no persistence or if the storage level does not guarantee the existence of a copy of the task’s input data, the Spark driver will need to consult the DAG that stores the user- specified computation to determine which segments of the job need to be recomputed. Consequently, without enough precautions to save either on the caching or on the storage level, the failure of a task can trigger the recomputation of several others, up to a stage boundary. Stage boundaries imply a shuffle, and a shuffle implies that intermediate data will somehow be materialized: as we discussed, the shuffle transforms executors into data servers that can provide the data to any other executor serving as a destination. As a consequence, these executors have a copy of the map operations that led up to the shuffle. Hence, executors that participated in a shuffle have a copy of the map operations that led up to it. But that’s a lifesaver if you have a dying downstream exec‐ utor, able to rely on the upstream servers of the shuffle (which serve the output of the map-like operation). What if it’s the contrary: you need to face the crash of one of the upstream executors? Stage Failure Recovery We’ve seen that task failure (possibly due to executor crash) was the most frequent incident happening on a cluster and hence the most important event to mitigate. Recurrent task failures will lead to the failure of the stage that contains that task. This brings us to the second facility that allows Spark to resist arbitrary stage failures: the shuffle service. When this failure occurs, it always means some rollback of the data, but a shuffle operation, by definition, depends on all of the prior executors involved in the step that precedes it. As a consequence, since Spark 1.3 we have the shuffle service, which lets you work on map data that is saved and distributed through the cluster with a good locality, but, more important, through a server that is not a Spark task. It’s an external file exchange service written in Java that has no dependency on Spark and is made to be a much longer-running service than a Spark executor. This additional service attaches as a separate process in all cluster modes of Spark and simply offers a data file exchange for executors to transmit data reliably, right before a shuffle. It is highly optimized through the use of a netty backend, to allow a very low overhead in trans‐ mitting data. This way, an executor can shut down after the execution of its map task, as soon as the shuffle service has a copy of its data. And because data transfers are faster, this transfer time is also highly reduced, reducing the vulnerable time in which any executor could face an issue. Driver Failure Recovery Having seen how Spark recovers from the failure of a particular task and stage, we can now look at the facilities Spark offers to recover from the failure of the driver program. The driver in Spark has an essential role: it is the depository of the block manager, which knows where each block of data resides in the cluster. It is also the place where the DAG lives. Finally, it is where the scheduling state of the job, its metadata, and logs resides. Hence, if the driver is lost, a Spark cluster as a whole might well have lost which stage it has reached in computation, what the computation actually consists of, and where the data that serves it can be found, in one fell swoop. Cluster-mode deployment Spark has implemented what’s called the cluster deployment mode, which allows the driver program to be hosted on the cluster, as opposed to the user’s computer. The deployment mode is one of two options: in client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched
  • 14. 14 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish. This, in sum, allows Spark to operate an automatic driver restart, so that the user can start a job in a “fire and forget fashion,” starting the job and then closing their laptop to catch the next train. Every cluster mode of Spark offers a web UI that will let the user access the log of their application. Another advantage is that driver failure does not mark the end of the job, because the driver process will be relaunched by the cluster manager. But this only allows recovery from scratch, given that the temporary state of the computation—previously stored in the driver machine— might have been lost. Checkpointing To avoid losing intermediate state in case of a driver crash, Spark offers the option of checkpointing; that is, recording periodically a snapshot of the application’s state to disk. The setting of the sparkContext.setCheckpointDirectory() option should point to reliable storage (e.g., Hadoop Distributed File System [HDFS]) because having the driver try to reconstruct the state of intermediate RDDs from its local filesystem makes no sense: those intermediate RDDs are being created on the executors of the cluster and should as such not require any interaction with the driver for backing them up. First Steps in Structured Streaming In the previous section, we learned about the high-level concepts that constitute Structured Streaming, such as sources, sinks, and queries. We are now going to explore Structured Streaming from a practical perspective, using a simplified web log analytics use case as an example. Before we begin delving into our first streaming application, we are going to see how classical batch analysis in Apache Spark can be applied to the same use case. This exercise has two main goals: • First, most, if not all, streaming data analytics start by studying a static data sample. It is far easier to start a study with a file of data, gain intuition on how the data looks, what kind of patterns it shows, and define the process that we require to extract the intended knowledge from that data. Typically, it’s only after we have defined and tested our data analytics job, that we proceed to transform it into a streaming process that can apply our analytic logic to data on the move. • Second, from a practical perspective, we can appreciate how Apache Spark simplifies many aspects of transitioning from a batch exploration to a streaming application through the use of uniform APIs for both batch and streaming ana‐ lytics. This exploration will allow us to compare and contrast the batch and streaming APIs in Spark and show us the necessary steps to move from one to the other. Batch Analytics Given that we are working with archive log files, we have access to all of the data at once. Before we begin building our streaming application, let’s take a brief intermezzo to have a look at what a classical batch analytics job would look like First, we load the log files, encoded as JSON, from the directory where we unpacked them: // This is the location of the unpackaged files. Update accordingly val logsDirectory = ???
  • 15. 15 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM val rawLogs = sparkSession.read.json(logsDirectory) Next, we declare the schema of the data as a case class to use the typed Dataset API. Following the formal description of the dataset (at NASA-HTTP ), the log is structured as follows: The logs are an ASCII file with one line per request, with the following columns: • Host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up. • Timestamp in the format “DAY MON DD HH:MM:SS YYYY,” where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is –0400. • Request given in quotes. • HTTP reply code. • Bytes in the reply. Translating that schema to Scala, we have the following case class definition: import java.sql.Timestamp case class WebLog(host: String, timestamp: Timestamp, request: String, http_reply: Int, bytes: Long ) We convert the original JSON to a typed data structure using the previous schema definition: import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.IntegerType // we need to narrow the `Interger` type because // the JSON representation is interpreted as `BigInteger` val preparedLogs = rawLogs.withColumn("http_reply", $"http_reply".cast(IntegerType)) val weblogs = preparedLogs.as[WebLog] Now that we have the data in a structured format, we can begin asking the questions that interest us. As a first step, we would like to know how many records are con‐ tained in our dataset: val recordCount = weblogs.count >recordCount: Long = 1871988 A common question would be: “what was the most popular URL per day?” To answer that, we first reduce the timestamp to the day of the month. We then group by this new dayOfMonth column and the request URL and we count over this aggregate. We finally order using descending order to get the top URLs first: val topDailyURLs = weblogs.withColumn("dayOfMonth", dayofmonth($"timestamp")) .select($"request", $"dayOfMonth") .groupBy($"dayOfMonth", $"request") .agg(count($"request").alias("count")) .orderBy(desc("count"))
  • 16. 16 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM topDailyURLs.show() +----------+----------------------------------------+-----+ |dayOfMonth| request|count| +----------+----------------------------------------+-----+ | 13|GET /images/NASA-logosmall.gif HTTP/1.0 |12476| | 13|GET /htbin/cdt_main.pl HTTP/1.0 | 7471| | 12|GET /images/NASA-logosmall.gif HTTP/1.0 | 7143| | 13|GET /htbin/cdt_clock.pl HTTP/1.0 | 6237| | 6|GET /images/NASA-logosmall.gif HTTP/1.0 | 6112| | 5|GET /images/NASA-logosmall.gif HTTP/1.0 | 5865| ... Top hits are all images. What now? It’s not unusual to see that the top URLs are images commonly used across a site. Our true interest lies in the content pages generating the most traffic. To find those, we first filter on html content and then proceed to apply the top aggregation we just learned. As we can see, the request field is a quoted sequence of [HTTP_VERB] URL [HTTP_VER SION]. We will extract the URL and preserve only those ending in .html, .htm, or no extension (directories). This is a simplification for the purpose of this example: val urlExtractor = """^GET (.+) HTTP/d.d""".r val allowedExtensions = Set(".html",".htm", "") val contentPageLogs = weblogs.filter {log => log.request match { case urlExtractor(url) => val ext = url.takeRight(5).dropWhile(c => c != '.') allowedExtensions.contains(ext) case _ => false } } With this new dataset that contains only .html, .htm, and directories, we proceed to apply the same top-k function as earlier: val topContentPages = contentPageLogs .withColumn("dayOfMonth", dayofmonth($"timestamp")) .select($"request", $"dayOfMonth") .groupBy($"dayOfMonth", $"request") .agg(count($"request").alias("count")) .orderBy(desc("count")) topContentPages.show() +----------+------------------------------------------------+-----+ |dayOfMonth| request|count| +----------+------------------------------------------------+-----+ | 13| GET /shuttle/countdown/liftoff.html HTTP/1.0" | 4992| | 5| GET /shuttle/countdown/ HTTP/1.0" | 3412| | 6| GET /shuttle/countdown/ HTTP/1.0" | 3393| | 3| GET /shuttle/countdown/ HTTP/1.0" | 3378| | 13| GET /shuttle/countdown/ HTTP/1.0" | 3086| | 7| GET /shuttle/countdown/ HTTP/1.0" | 2935|
  • 17. 17 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM | 4| GET /shuttle/countdown/ HTTP/1.0" | 2832| | 2| GET /shuttle/countdown/ HTTP/1.0" | 2330| ... We can see that the most popular page that month was lifto… .html, corresponding to the coverage of the launch of the Discovery shuttle, as documented on the NASA archives. It’s closely followed by countdown/, the days prior to the launch. Streaming Analytics Phases In the previous section, we explored historical NASA web log records. We found trending events in those records, but much later than when the actual events happened. One key driver for streaming analytics comes from the increasing demand of organizations to have timely information that can help them make decisions at many different levels. We can use the lessons that we have learned while exploring the archived records using a batch- oriented approach and create a streaming job that will provide us with trending information as it happens. The first difference that we observe with the batch analytics is the source of the data. For our streaming exercise, we will use a TCP server to simulate a web system that delivers its logs in real time. The simulator will use the same dataset but will feed it through a TCP socket connection that will embody the stream that we will be analyzing. Connecting to a Stream If you recall from the introduction of this chapter, Structured Streaming defines the concepts of sources and sinks as the key abstractions to consume a stream and produce a result. We are going to use the TextSocketSource implementation to connect to the server through a TCP socket. Socket connections are defined by the host of the server and the port where it is listening for connections. These two configuration elements are required to create the socket source: val stream = sparkSession.readStream .format("socket") .option("host", host) .option("port", port) .load() Note how the creation of a stream is quite similar to the declaration of a static data‐ source in the batch case. Instead of using the read builder, we use the readStream construct and we pass to it the parameters required by the streaming source. As you will see during the course of this exercise and later on as we go into the details of Structured Streaming, the API is basically the same DataFrame and Dataset API for static data but with some modifications and limitations that you will learn in detail. Preparing the Data in the Stream The socket source produces a streaming DataFrame with one column, value, which contains the data received from the stream. In the batch analytics case, we could load the data directly as JSON records. In the case of the Socket source, that data is plain text. To transform our raw data to WebLog records, we first require a schema. The schema provides the necessary information to parse the text to a JSON object. It’s the structure when we talk about structured streaming.
  • 18. 18 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM After defining a schema for our data, we proceed to create a Dataset, following these steps: import java.sql.Timestamp case class WebLog(host:String, timestamp: Timestamp, request: String, http_reply:Int, bytes: Long ) val webLogSchema = Encoders.product[WebLog].schema val jsonStream = stream.select(from_json($"value", webLogSchema) as "record") val webLogStream: Dataset[WebLog] = jsonStream.select("record.*").as[WebLog] 1. Obtain a schema from the case class definition 2. Transform the text value to JSON using the JSON support built into Spark SQL 3. Use the Dataset API to transform the JSON records to WebLog objects As a result of this process, we obtain a Streaming Dataset of WebLog records. Operations on Streaming Dataset The webLogStream we just obtained is of type Dataset[WebLog] like we had in the batch analytics job. The difference between this instance and the batch version is that webLogStream is a streaming Dataset. We can observe this by querying the object: webLogStream.isStreaming > res: Boolean = true At this point in the batch job, we were creating the first query on our data: How many records are contained in our dataset? This is a question that we can easily answer when we have access to all of the data. However, how do we count records that are constantly arriving? The answer is that some operations that we consider usual on a static Dataset, like counting all records, do not have a defined meaning on a Streaming Dataset. As we can observe, attempting to execute the count query in the following code snippet will result in an AnalysisException: val count = webLogStream.count() > org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; This means that the direct queries we used on a static Dataset or DataFrame now need two levels of interaction. First, we need to declare the transformations of our stream, and then we need to start the stream process. Creating a Query What are popular URLs? In what time frame? Now that we have immediate analytic access to the stream of web logs, we don’t need to wait for a day or a month to have a rank of the popular URLs. We can have that information as trends unfold in much shorter windows of time.
  • 19. 19 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM First, to define the period of time of our interest, we create a window over some time‐ stamp. An interesting feature of Structured Streaming is that we can define that time interval on the timestamp when the data was produced, also known as event time, as opposed to the time when the data is being processed. Our window definition will be of five minutes of event data. Given that our timeline is simulated, the five minutes might happen much faster or slower than the clock time. In this way, we can clearly appreciate how Structured Streaming uses the time‐ stamp information in the events to keep track of the event timeline. As we learned from the batch analytics, we should extract the URLs and select only content pages, like .html, .htm, or directories. Let’s apply that acquired knowledge first before proceeding to define our windowed query: // A regex expression to extract the accessed URL from weblog.request val urlExtractor = """^GET (.+) HTTP/d.d""".r val allowedExtensions = Set(".html", ".htm", "") val contentPageLogs: String => Boolean = url => { val ext = url.takeRight(5).dropWhile(c => c != '.') allowedExtensions.contains(ext) } val urlWebLogStream = webLogStream.flatMap { weblog => weblog.request match { case urlExtractor(url) if (contentPageLogs(url)) => Some(weblog.copy(request = url)) case _ => None } } We have converted the request to contain only the visited URL and filtered out all noncontent pages. Now, we define the windowed query to compute the top trending URLs: val rankingURLStream = urlWebLogStream .groupBy($"request", window($"timestamp", "5 minutes", "1 minute")) .count() Start the Stream Processing All of the steps that we have followed so far have been to define the process that the stream will undergo. But no data has been processed yet. To start a Structured Streaming job, we need to specify a sink and an output mode. These are two new concepts introduced by Structured Streaming: • A sink defines where we want to materialize the resulting data; for example, to a file in a filesystem, to an in-memory table, or to another streaming system such as Kafka. • The output mode defines how we want the results to be delivered: do we want to see all data every time, only updates, or just the new records?
  • 20. 20 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM These options are given to a writeStream operation. It creates the streaming query that starts the stream consumption, materializes the computations declared on the query, and produces the result to the output sink. For now, let’s use them empirically and observe the results. For our query, shown in below Example, we use the memory sink and output mode complete to have a fully updated table each time new records are added to the result of keeping track of the URL ranking. Example. Writing a stream to a sink val query = rankingURLStream.writeStream .queryName("urlranks") .outputMode("complete") .format("memory") .start() The memory sink outputs the data to a temporary table of the same name given in the queryName option. We can observe this by querying the tables registered on Spark SQL: scala> spark.sql("show tables").show() +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ | | urlranks| true| +--------+---------+-----------+ In the expression in Example, query is of type StreamingQuery and it’s a handler to control the query life cycle. Exploring the Data Given that we are accelerating the log timeline on the producer side, after a few seconds, we can execute the next command to see the result of the first windows, as illustrated in Figure. Note how the processing time (a few seconds) is decoupled from the event time (hun‐ dreds of minutes of logs): urlRanks.select($"request", $"window", $"count").orderBy(desc("count")) Figure: URL ranking: query results by window
  • 21. 21 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Acquiring streaming data In Structured Streaming, a source is an abstraction that lets us consume data from a streaming data producer. Sources are not directly created. Instead, the sparkSession provides a builder method, readStream, that exposes the API to specify a streaming source, called a format, and provide its configuration. For example, the code in Example creates a File streaming source. We specify the type of source using the format method. The method schema lets us provide a schema for the data stream, which is mandatory for certain source types, such as this File source. Example. File streaming source val fileStream = spark.readStream .format("json") .schema(schema) .option("mode","DROPMALFORMED") .load("/tmp/datasrc") >fileStream: org.apache.spark.sql.DataFrame = [id: string, timestamp: timestamp ... ] Each source implementation has different options, and some have tunable parameters. In Example, we are setting the option mode to DROPMALFORMED. This option instructs the JSON stream processor to drop any line that neither complies with the JSON format nor matches the provided schema. Behind the scenes, the call to spark.readStream creates a DataStreamBuilder instance. This instance is in charge of managing the different options provided through the builder method calls. Calling load(...) on this DataStreamBuilder instance validates the options provided to the builder and, if everything checks, it returns a streaming DataFrame. In our example, this streaming DataFrame represents the stream of data that will result from monitoring the provided path and processing each new file in that path as JSON-encoded data, parsed using the schema provided. All malformed code will be dropped from this data stream. Loading a streaming source is lazy. What we get is a representation of the stream, embodied in the streaming DataFrame instance, that we can use to express the series of transformations that we want to apply to it in order to implement our specific business logic. Creating a streaming DataFrame does not result in any data actually being consumed or processed until the stream is materialized. This requires a query, as you will see further on. Available Sources As of Spark v2.4.0, the following streaming sources are supported: • json, orc, parquet, csv, text, textFile: These are all file-based streaming sources. The base functionality is to monitor a path (folder) in a filesystem and consume files atomically placed in it. The files found will then be parsed by the formatter specified. For example, if json is provided, the Spark json reader will be used to process the files, using the schema information provided. • socket Establishes a client connection to a TCP server that is assumed to provide text data through a socket connection.
  • 22. 22 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM • kafka Creates Kafka consumer able to retrieve data from Kafka. • rate Generates a stream of rows at the rate given by the rowsPerSecond option. It’s mainly intended as a testing source. Transforming streaming data As we saw in the previous section, the result of calling load is a streaming DataFrame. After we have created our streaming DataFrame using a source, we can use the Data set or DataFrame API to express the logic that we want to apply to the data in the stream in order to implement our specific use case. Assuming that we are using data from a sensor network, in Example 8-3 we are selecting the fields deviceId, timestamp, sensorType, and value from a sensor Stream and filtering to only those records where the sensor is of type temperature and its value is higher than the given threshold. Example: Filter and projection val highTempSensors = sensorStream .select($"deviceId", $"timestamp", $"sensorType", $"value") .where($"sensorType" === "temperature" && $"value" > threshold) Likewise, we can aggregate our data and apply operations to the groups over time. Example shows that we can use timestamp information from the event itself to define a time window of five minutes that will slide every minute. What is important to grasp here is that the Structured Streaming API is practically the same as the Dataset API for batch analytics, with some additional provisions spe‐ cific to stream processing. Example 8-4. Average by sensor type over time val avgBySensorTypeOverTime = sensorStream .select($"timestamp", $"sensorType", $"value") .groupBy(window($"timestamp", "5 minutes", "1 minute"), $"sensorType") .agg(avg($"value") If you are not familiar with the structured APIs of Spark, we suggest that you familiarize yourself with it. Covering this API in detail is beyond the scope of this book. Streaming API Restrictions on the DataFrame API As we hinted in the previous chapter, some operations that are offered by the standard DataFrame and Dataset API do not make sense on a streaming context. We gave the example of stream.count, which does not make sense to use on a stream. In general, operations that require immediate materialization of the underlying dataset are not allowed. These are the API operations not directly supported on streams: • count • show • describe • limit • take(n) • distinct • foreach
  • 23. 23 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM • sort • multiple stacked aggregations Next to these operations, stream-stream and static-stream joins are partially supported. Understanding the limitations Although some operations, like count or limit, do not make sense on a stream, some other stream operations are computationally difficult. For example, distinct is one of them. To filter duplicates in an arbitrary stream, it would require that you remember all of the data seen so far and compare each new record with all records already seen. The first condition would require infinite memory and the second has a computational complexity of O(n 2 ), which becomes prohibitive as the number of elements (n) increases. Operations on aggregated streams Some of the unsupported operations become defined after we apply an aggregation function to the stream. Although we can’t count the stream, we could count messages received per minute or count the number of devices of a certain type. In Example, we define a count of events per sensorType per minute. Example: Count of sensor types over time val avgBySensorTypeOverTime = sensorStream .select($"timestamp", $"sensorType") .groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType") .count() Likewise, it’s also possible to define a sort on aggregated data, although it’s further restricted to queries with output mode complete. Stream deduplication We discussed that distinct on an arbitrary stream is computationally difficult to implement. But if we can define a key that informs us when an element in the stream has already been seen, we can use it to remove duplicates: stream.dropDuplicates("") … Workarounds Although some operations are not supported in the exact same way as in the batch model, there are alternative ways to achieve the same functionality: • foreach Although foreach cannot be directly used on a stream, there’s a foreach sink that provides the same functionality. Sinks are specified in the output definition of a stream. • show Although show requires an immediate materialization of the query, and hence it’s not possible on a streaming Dataset, we can use the console sink to output data to the screen. Output the resulting data All operations that we have done so far—such as creating a stream and applying transformations on it—have been declarative. They define from where to consume the data and what operations we want to apply to it. But up to this point, there is still no data flowing through the system. Before we can initiate our stream, we need to first define where and how we want the output data to go: • Where relates to the streaming sink: the receiving side of our streaming data. • How refers to the output mode: how to treat the resulting records in our stream.
  • 24. 24 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM From the API perspective, we materialize a stream by calling writeStream on a streaming DataFrame or Dataset. Calling writeStream on a streaming Dataset creates a DataStreamWriter. This is a builder instance that provides methods to configure the output behavior of our streaming process. Example. File streaming sink val query = stream.writeStream .format("json") .queryName("json-writer") .outputMode("append") .option("path", "/target/dir") .option("checkpointLocation", "/checkpoint/dir") .trigger(ProcessingTime("5 seconds")) .start() >query: org.apache.spark.sql.streaming.StreamingQuery = ... format The format method lets us specify the output sink by providing the name of a builtin sink or the fully qualified name of a custom sink. As of Spark v2.4.0, the following streaming sinks are available: • console sink A sink that prints to the standard output. It shows a number of rows configurable with the option numRows. • file sink File-based and format-specific sink that writes the results to a filesystem. The format is specified by providing the format name: csv, hive, json, orc, parquet, avro, or text. • kafka sink A Kafka-specific producer sink that is able to write to one or more Kafka topics. • memory sink Creates an in-memory table using the provided query name as table name. This table receives continuous updates with the results of the stream. • foreach sink Provides a programmatic interface to access the stream contents, one element at the time. • foreachBatch sink foreachBatch is a programmatic sink interface that provides access to the com‐ plete DataFrame that corresponds to each underlying microbatch of the Structured Streaming execution. outputMode The outputMode specifies the semantics of how records are added to the output of the streaming query. The supported modes are append, update, and complete: • append (default mode) Adds only final records to the output stream. A record is considered final when no new records of the incoming stream can modify its value. This is always the case with linear transformations like those resulting from applying projection, filtering, and mapping. This mode guarantees that each resulting record will be output only once. • update Adds new and updated records since the last trigger to the output stream. update is meaningful only in the context of an aggregation, where aggregated values change as
  • 25. 25 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM new records arrive. If more than one incoming record changes a single result, all changes between trigger intervals are collated into one output record. • complete complete mode outputs the complete internal representation of the stream. This mode also relates to aggregations, because for nonaggregated streams, we would need to remember all records seen so far, which is unrealistic. From a practical perspective, complete mode is recommended only when you are aggregating values over low-cardinality criteria, like count of visitors by country, for which we know that the number of countries is bounded. Understanding the append semantic When the streaming query contains aggregations, the definition of final becomes nontrivial. In an aggregated computation, new incoming records might change an existing aggregated value when they comply with the aggregation criteria used. Following our definition, we cannot output a record using append until we know that its value is final. Therefore, the use of the append output mode in combination with aggregate queries is restricted to queries for which the aggregation is expressed using event-time and it defines a watermark. In that case, append will output an event as soon as the watermark has expired and hence it’s considered that no new records can alter the aggregated value. As a consequence, output events in append mode will be delayed by the aggregation time window plus the watermark offset. queryName With queryName, we can provide a name for the query that is used by some sinks and also presented in the job description in the Spark Console, as depicted in Figure. Figure: Completed Jobs in the Spark UI showing the query name in the job descrip‐ tion option With the option method, we can provide specific key–value pairs of configuration to the stream, akin to the configuration of the source. Each sink can have specific con‐ figuration we can customize using this method. We can add as many .option(...) calls as necessary to configure the sink. options options is an alternative to option that takes a Map[String, String] containing all the key– value configuration parameters that we want to set. This alternative is more friendly to an externalized configuration model, where we don’t know a priori the set‐ tings to be passed to the sink’s configuration. trigger The optional trigger option lets us specify the frequency at which we want the results to be produced. By default, Structured Streaming will process the input and produce a result as soon as possible. When a trigger is specified, output will be produced at each trigger interval. org.apache.spark.sql.streaming.Trigger provides the following supported triggers: • ProcessingTime() Lets us specify a time interval that will dictate the frequency of the query results.
  • 26. 26 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM • Once() A particular Trigger that lets us execute a streaming job once. It is useful for testing and also to apply a defined streaming job as a single-shot batch operation. • Continuous() This trigger switches the execution engine to the experimental continuous engine for low-latency processing. The checkpoint-interval parameter indicates the frequency of the asynchronous checkpointing for data resilience. It should not be confused with the batch interval of the ProcessingTime trigger. start() To materialize the streaming computation, we need to start the streaming process. Finally, start() materializes the complete job description into a streaming computation and initiates the internal scheduling process that results in data being consumed from the source, processed, and produced to the sink. start() returns a Streaming Query object, which is a handle to manage the individual life cycle of each query. This means that we can simultaneously start and stop multiple queries independently of one other within the same sparkSession. Demo The first part of our program deals with the creation of the streaming Dataset: val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("startingOffsets", "earliest") .load() > rawData: org.apache.spark.sql.DataFrame The entry point of Structured Streaming is an existing Spark Session (sparkSession). As you can appreciate on the first line, the creation of a streaming Dataset is almost identical to the creation of a static Dataset that would use a read operation instead. sparkSession.readStream returns a DataStreamReader, a class that implements the builder pattern to collect the information needed to construct the streaming source using a fluid API. In that API, we find the format option that lets us specify our source provider, which, in our case, is kafka. The options that follow it are specific to the source: • kafka.bootstrap.servers o Indicates the set of bootstrap servers to contact as a comma-separated list of host:port addresses • subscribe o Specifies the topic or topics to subscribe to • startingOffsets o The offset reset policy to apply when this application starts out fresh. The load() method evaluates the DataStreamReader builder and creates a DataFrame as a result, as we can see in the returned value: > rawData: org.apache.spark.sql.DataFrame A DataFrame is an alias for Dataset[Row] with a known schema. After creation, you can use streaming Datasets just like regular Datasets. This makes it possible to use the full-fledged Dataset API with Structured Streaming, albeit some exceptions apply because not all operations, such as show() or count(), make sense in a streaming context.
  • 27. 27 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM To programmatically differentiate a streaming Dataset from a static one, we can ask a Dataset whether it is of the streaming kind: rawData.isStreaming res7: Boolean = true And we can also explore the schema attached to it, using the existing Dataset API, as demonstrated in Example. Example. The Kafka schema rawData.printSchema() root |-- key: binary (nullable = true) |-- value: binary (nullable = true) |-- topic: string (nullable = true) |-- partition: integer (nullable = true) |-- offset: long (nullable = true) |-- timestamp: timestamp (nullable = true) |-- timestampType: integer (nullable = true) In general, Structured Streaming requires the explicit declaration of a schema for the consumed stream. In the specific case of kafka, the schema for the resulting Dataset is fixed and is independent of the contents of the stream. It consists of a set of fields specific to the Kakfa source: key, value, topic, partition, offset, timestamp, and timestampType, as we can see in Example 9-1. In most cases, applications will be mostly interested in the contents of the value field where the actual payload of the stream resides. Application Logic Recall that the intention of our job is to correlate the incoming IoT sensor data with a reference file that contains all known sensors with their configuration. That way, we would enrich each incoming record with specific sensor parameters that would allow us to interpret the reported data. We would then save all correctly processed records to a Parquet file. The data coming from unknown sensors would be saved to a sepa‐ rate file for later analysis. Using Structured Streaming, our job can be implemented in terms of Dataset opera‐ tions: val iotData = rawData.select($"value").as[String].flatMap{record => val fields = record.split(",") try { SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble) }.toOption } val sensorRef = sparkSession.read.parquet(s"$workDir/$referenceFile") sensorRef.cache() val sensorWithInfo = sensorRef.join(iotData, Seq("sensorId"), "inner") val knownSensors = sensorWithInfo .withColumn("dnvalue", $"value"*($"maxRange"-$"minRange")+$"minRange") .drop("value", "maxRange", "minRange") In the first step, we transform our CSV-formatted records back into SensorData entries. We apply Scala functional operations on the typed Dataset[String] that we obtained from extracting the value field as a String.
  • 28. 28 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM Then, we use a streaming Dataset to static Dataset inner join to correlate the sensor data with the corresponding reference using the sensorId as key. To complete our application, we compute the real values of the sensor reading using the minimum- maximum ranges in the reference data. Writing to a Streaming Sink The final step of our streaming application is to write the enriched IoT data to a Parquet-formatted file. In Structured Streaming, the write operation is crucial: it marks the completion of the declared transformations on the stream, defines a write mode, and upon calling start(), the processing of the continuous query will begin. In Structured Streaming, all operations are lazy declarations of what we want to do with the streaming data. Only when we call start() will the actual consumption of the stream begin and the query operations on the data materialize into actual results: val knownSensorsQuery = knownSensors.writeStream .outputMode("append") .format("parquet") .option("path", targetPath) .option("checkpointLocation", "/tmp/checkpoint") .start() Let’s break this operation down: • writeStream creates a builder object where we can configure the options for the desired write operation, using a fluent interface. • With format, we specify the sink that will materialize the result downstream. In our case, we use the built-in FileStreamSink with Parquet format. • mode is a new concept in Structured Streaming: given that we, theoretically, have access to all the data seen in the stream so far, we also have the option to produce different views of that data. • The append mode, used here, implies that the new records affected by our streaming computation are produced to the output. The result of the start call is a StreamingQuery instance. This object provides meth‐ ods to control the execution of the query and request information about the status of our running streaming query, as shown in Example. Example. Query progress knownSensorsQuery.recentProgress res37: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array({ "id" : "6b9fe3eb-7749-4294-b3e7-2561f1e840b6", "runId" : "0d8d5605-bf78-4169-8cfe-98311fc8365c", "name" : null, "timestamp" : "2017-08-10T16:20:00.065Z", "numInputRows" : 4348, "inputRowsPerSecond" : 395272.7272727273, "processedRowsPerSecond" : 28986.666666666668, "durationMs" : {
  • 29. 29 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM "addBatch" : 127, "getBatch" : 3, "getOffset" : 1, "queryPlanning" : 7, "triggerExecution" : 150, "walCommit" : 11 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaSource[Subscribe[iot-data]]", "startOffset" : { "iot-data" : { "0" : 19048348 } }, "endOffset" : { "iot-data" : { "0" : 19052696 } }, "numInputRow... In Example, we can see the StreamingQueryProgress as a result of calling known SensorsQuery.recentProgress. If we see nonzero values for the numInputRows, we can be certain that our job is consuming data. We now have a Structured Streaming job running properly. Stream Processing with Spark Streaming Spark Streaming was the first stream-processing framework built on top of the dis‐ tributed processing capabilities of Spark. Nowadays, it offers a mature API that’s widely adopted in the industry to process large-scale data streams. Spark is, by design, a system that is really good at processing data distributed over a cluster of machines. Spark’s core abstraction, the Resilient Distributed Dataset (RDD), and its fluent functional API permits the creation of programs that treat distributed data as a collection. That abstraction lets us reason about data-processing logic in the form of transformation of the distributed dataset. By doing so, it reduces the cogni‐ tive load previously required to create and execute scalable and distributed dataprocessing programs. Spark Streaming was created upon a simple yet powerful premise: apply Spark’s dis‐ tributed computing capabilities to stream processing by transforming a continuous stream of data into discrete data collections on which Spark could operate. As we can see in Figure, the main task of Spark Streaming is to take data from the stream, package it down into small batches, and provide them to Spark for further processing. The output is then produced to some downstream system. Figure. Spark and Spark Streaming in action
  • 30. 30 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM The DStream Abstraction Whereas Structured Streaming, which you learned in Part II, builds its streaming capabilities on top of the Spark SQL abstractions of DataFrame and Dataset, Spark Streaming relies on the much more fundamental Spark abstraction of RDD. At the same time, Spark Streaming introduces a new concept: the Discretized Stream or DStream. A DStream represents a stream in terms of discrete blocks of data that in turn are represented as RDDs over time, as we can see in Figure. Figure. DStreams and RDDs in Spark Streaming The DStream abstraction is primarily an execution model that, when combined with a functional programming model, provides us with a complete framework to develop and execute streaming applications. DStreams as a Programming Model The code representation of DStreams give us a functional programming API consis‐ tent with the RDD API and augmented with stream-specific functions to deal with aggregations, time-based operations, and stateful computations. In Spark Streaming, we consume a stream by creating a DStream from one of the native implementations, such as a SocketInputStream or using one of the many connectors available that pro‐ vide a DStream implementation specific to a stream provider (this is the case of Kafka, Twitter, or Kinesis connectors for Spark Streaming, just to name a few): // creates a dstream using a client socket connected to the given host and port val textDStream = ssc.socketTextStream("localhost", 9876) After we have obtained a DStream reference, we can implement our application logic using the functions provided by the DStream API. For example, if the textDStream in the preceding code is connected to a log server, we could count the number of error occurrences: // we break down the stream of logs into error or info (not error) // and create pairs of `(x, y)`. // (1, 1) represents an error, and // (0, 1) a non-error occurrence. val errorLabelStream = textDStream.map{line => if (line.contains("ERROR")) (1, 1) else (0, 1) } We can then count the totals and compute the error rate by using an aggregation function called reduce:
  • 31. 31 18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM // reduce by key applies the provided function for each key. val errorCountStream = errorLabelStream.reduce { case ((x1,y1), (x2, y2)) => (x1+x2, y1+y2) } To obtain our error rate, we perform a safe division: // compute the error rate and create a string message with the value val errorRateStream = errorCountStream.map {case (errors, total) => val errorRate = if (total > 0 ) errors.toDouble/total else 0.0 "Error Rate:" + errorRate } It’s important to note that up until now, we have been using transformations on the DStream but there is still no data processing happening. All transformations on DStreams are lazy. This process of defining the logic of a stream-processing applica‐ tion is better seen as the set of transformations that will be applied to the data after the stream processing is started. As such, it’s a plan of action that Spark Streaming will recurrently execute on the data consumed from the source DStream. DStreams are immutable. It’s only through a chain of transformations that we can process and obtain a result from our data. Finally, the DStream programming model requires that the transformations are ended by an output operation. This particular operation specifies how the DStream is materialized. In our case, we are interested in printing the results of this stream com‐ putation to the console: // print the results to the console errorRateStream.print() In summary, the DStream programming model consists of the functional composition of transformations over the stream payload, materialized by one or more output operations and recurrently executed by the Spark Streaming engine. DStreams as an Execution Model In the preceding introduction to the Spark Streaming programming model, we could see how data is transformed from its original form into our intended result as a series of lazy functional transformations. The Spark Streaming engine is responsible for taking that chain of functional transformations and turning it into an actual execution plan. That happens by receiving data from the input stream(s), collecting that data into batches, and feeding it to Spark in a timely manner. The measure of time to wait for data is known as the batch interval. It is usually a short amount of time, ranging from approximately two hundred milliseconds to minutes depending on the application requirements for latency. The batch interval is the central unit of time in Spark Streaming. At each batch interval, the data corresponding to the previous interval is sent to Spark for processing while new data is received. This process repeats as long as the Spark Streaming job is active and healthy. A natural consequence of this recurring microbatch operation is that the computation on the batch’s data has to complete within the duration of the batch interval so that computing resources are available when the new microbatch arrives. As you will learn in this part of the book, the batch interval dictates the time for most other functions in Spark Streaming.