Streaming Analytics unit 4 notes for engineers

1
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
UNIT IV
Apache Spark Streaming Introduction - Spark’s Memory Usage - Understanding Resilience and
Fault - Tolerance in a Distributed System - Spark’s cluster manager - Data Delivery Semantics in
Spark - Data Delivery Semantics in Spark Applications - Microbatching - Dynamic Batch Interval
- Structured Stream processing model - Spark Streaming Resilience Model - Data Structures in
Spark – RDDs and DStreams - Spark Fault Tolerance Guarantees - First Steps in Structured
Streaming - Streaming Analytics Phases - Acquiring streaming data - Transforming streaming data
- Output the resulting data - Demo – Stream Processing with Spark Streaming
Apache Spark Streaming Introduction
Spark offers two different stream-processing APIs,
• Spark Streaming and
• Structured Streaming:
Spark Streaming: This is an API and a set of connectors, in which a Spark program is being
served small batches of data collected from a stream in the form of microbatches spaced at fixed
time intervals, performs a given computation, and eventually returns a result at every interval.
Structured Streaming: This is an API and a set of connectors, built on the substrate of a SQL
query optimizer, Catalyst. It offers an API based on DataFrames and the notion of continuous
queries over an unbounded table that is constantly updated with fresh records from the stream.
Spark’s Memory Usage
Spark offers in-memory storage of slices of a dataset, which must be initially loaded from
a data source. The data source can be a distributed filesystem or another storage medium. Spark’s
form of in-memory storage is analogous to the operation of caching data.
Hence, a value in Spark’s in-memory storage has a base, which is its initial data source, and
layers of successive operations applied to it.
Failure Recovery What happens in case of a failure? Because Spark knows exactly which data
source was used to ingest the data in the first place, and because it also knows all the operations
that were performed on it thus far, it can reconstitute the segment of lost data that was on a crashed
executor, from scratch. Obviously, this goes faster if that reconstitution (recovery, in Spark’s
parlance), does not need to be totally from scratch. So, Spark offers a replication mechanism, quite
in a similar way to distributed filesystems. However, because memory is such a valuable yet limited
commodity, Spark makes (by default) the cache short lived.
Lazy Evaluation A good part of the operations that can be defined on values in Spark’s storage
have a lazy execution, and it is the execution of a final, eager output operation that will trigger the
actual execution of computation in a Spark cluster. It’s worth noting that if a program consists of
a series of linear operations, with the previous one feeding into the next, the intermediate results
disappear right after said next step has consumed its input.
Cache Hints On the other hand, what happens if we have several operations to do on a single
intermediate result? Should we have to compute it several times? Thankfully, Spark lets users
specify that an intermediate value is important and how its contents should be safeguarded for
later. Figure below presents the data flow of such an operation.

2
Figure: Operations on cached values
Finally, Spark offers the opportunity to spill the cache to secondary storage in case it runs out of
memory on the cluster, extending the in-memory operation to secondary —and significantly
slower—storage to preserve the functional aspects of a data pro‐ cess when faced with temporary
peak loads.
Now that we have an idea of the main characteristics of Apache Spark, let’s spend some time
focusing on one design choice internal to Spark, namely, the latency versus throughput trade-off.
Understanding Resilience and Fault - Tolerance in a Distributed System
Resilience and fault tolerance are absolutely essential for a distributed application: they are the
condition by which we will be able to perform the user’s computation to completion. Nowadays,
clusters are made of commodity machines that are ideally operated near peak capacity over their
lifetime.
To put it mildly, hardware breaks quite often. A resilient application can make progress with its
process despite latencies and noncritical faults in its distributed environment. A fault-tolerant
application is able to succeed and complete its process despite the unplanned termination of one
or several of its nodes.
This sort of resiliency is especially relevant in stream processing given that the applications we’re
scheduling are supposed to live for an undetermined amount of time. That undetermined amount
of time is often correlated with the life cycle of the data source. For example, if we are running a
retail website and we are analyzing transactions and website interactions as they come into the
system against the actions and clicks and navigation of users visiting the site, we potentially have a
data source that will be available for the entire duration of the lifetime of our business, which we
hope to be very long, if our business is going to be successful.
As a consequence, a system that will process our data in a streaming fashion should run
uninterrupted for long periods of time.
This “show must go on” approach of streaming computation makes the resiliency and fault-
tolerance characteristics of our applications more important. For a batch job, we could launch it,
hope it would succeed, and relaunch if we needed to change it or in case of failure. For an online
streaming Spark pipeline, this is not a reasonable assumption.

3
Fault Recovery
In the context of fault tolerance, we are also interested in understanding how long it takes to
recover from failure of one particular node. Indeed, stream processing has a particular aspect: data
continues being generated by the data source in real time. To deal with a batch computing failure,
we always have the opportunity to restart from scratch and accept that obtaining the results of
computation will take longer. Thus, a very primitive form of fault tolerance is detecting the failure
of a particular node of our deployment, stopping the computation, and restarting from scratch.
That process can take more than twice the original duration that we had budgeted for that
computation, but if we are not in a hurry, this still acceptable.
For stream processing, we need to keep receiving data and thus potentially storing it, if the
recovering cluster is not ready to assume any processing yet. This can pose a problem at a high
throughput: if we try restarting from scratch, we will need not only to reprocess all of the data that
we have observed since the beginning of the application—which in and of itself can be a
challenge—but during that reprocessing of historical data, we will need it to continue receiving
and thus potentially storing new data that was generated while we were trying to catch up. This
pattern of restarting from scratch is something so intractable for streaming that we will pay special
attention to Spark’s ability to restart only minimal amounts of computation in the case that a node
becomes unavailable or nonfunctional.
Cluster Manager Support for Fault Tolerance
We want to highlight why it is still important to understand Spark’s fault tolerance guarantees, even
if there are similar features present in the cluster managers of YARN, Mesos, or Kubernetes. To
understand this, we can consider that cluster managers help with fault tolerance when they work
hand in hand with a framework that is able to report failures and request new resources to cope
with those exceptions. Spark possesses such capabilities.
For example, production cluster managers such as YARN, Mesos, or Kubernetes have the ability
to detect a node’s failure by inspecting endpoints on the node and asking the node to report on its
own readiness and liveness state. If these cluster managers detect a failure and they have spare
capacity, they will replace that node with another, made available to Spark. That particular action
implies that the Spark executor code will start anew in another node, and then attempt to join the
existing Spark cluster.
The cluster manager, by definition, does not have introspection capabilities into the applications
being run on the nodes that it reserves. Its responsibility is limited to the container that runs the
user’s code.
That responsibility boundary is where the Spark resilience features start. To recover from a failed
node, Spark needs to do the following:
• Determine whether that node contains some state that should be reproduced in the form
of checkpointed files
• Understand at which stage of the job a node should rejoin the computation
The goal here is for us to explore that if a node is being replaced by the cluster man‐ ager, Spark
has capabilities that allow it to take advantage of this new node and to distribute computation onto
it.

4
Within this context, our focus is on Spark’s responsibilities as an application and underline the
capabilities of a cluster manager only when necessary: for instance, a node could be replaced
because of a hardware failure or because its work was simply preempted by a higher-priority job.
Apache Spark is blissfully unaware of the why, and focuses on the how.
Spark’s cluster manager
Spark has two internal cluster managers:
The local cluster manager
This emulates the function of a cluster manager (or resource manager) for testing purposes.
It reproduces the presence of a cluster of distributed machines using a threading model that relies
on your local machine having only a few available cores. This mode is usually not very confusing
because it executes only on the user’s laptop.
The standalone cluster manager
A relatively simple, Spark-only cluster manager that is rather limited in its availability to
slice and dice resource allocation. The standalone cluster manager holds and makes available the
entire worker node on which a Spark executor is deployed and started. It also expects the executor
to have been predeployed there, and the actual shipping of that .jar to a new machine is not within
its scope. It has the ability to take over a specific number of executors, which are part of its
deployment of worker nodes, and execute a task on it. This cluster manager is extremely useful for
the Spark developers to provide a bare-bones resource management solution that allows you to
focus on improving Spark in an environment without any bells and whistles. The standalone cluster
manager is not recommended for production deployments.
As a summary, Apache Spark is a task scheduler in that what it schedules are tasks, units of
distribution of computation that have been extracted from the user program. Spark also
communicates and is deployed through cluster managers including Apache Mesos, YARN, and
Kubernetes, or allowing for some cases its own standalone cluster manager. The purpose of that
communication is to reserve a number of executors, which are the units to which Spark
understands equal-sized amounts of computation resources, a virtual “node” of sorts. The reserved
resources in question could be provided by the cluster manager as the following:
• Limited processes (e.g., in some basic use cases of YARN), in which processes have their
resource consumption metered but are not prevented from accessing each other’s resource
by default.
• Containers (e.g., in the case of Mesos or Kubernetes), in which containers are a relatively
lightweight resource reservation technology that is born out of the groups and namespaces
of the Linux kernel and have known their most popular iteration with the Docker project.
• They also could be one of the above deployed on virtual machines (VMs), them‐ selves
coming with specific cores and memory reservation.
Data Delivery Semantics in Spark
As you have seen in the streaming model, the fact that streaming jobs act on the basis of data that
is generated in real time means that intermediate results need to be provided to the consumer of
that streaming pipeline on a regular basis.

5
Those results are being produced by some part of our cluster. Ideally, we would like those
observable results to be coherent, in line, and in real time with respect to the arrival of data. This
means that we want results that are exact, and we want them as soon as possible. However,
distributed computation has its own challenges in that it sometimes includes not only individual
nodes failing, as we have mentioned, but it also encounters situations like network partitions, in
which some parts of our cluster are not able to communicate with other parts of that cluster, as
illustrated in below Figure.
Figure: A network partition
Spark has been designed using a driver/executor architecture. A specific machine, the driver, is
tasked with keeping track of the job progression along with the job submissions of a user, and the
computation of that program occurs as the data arrives. How‐ ever, if the network partitions
separate some part of the cluster, the driver might be able to keep track of only the part of the
executors that form the initial cluster. In the other section of our partition, we will find nodes that
are entirely able to function, but will simply be unable to account for the proceedings of their
computation to the driver.
This creates an interesting case in which those “zombie” nodes do not receive new tasks, but might
well be in the process of completing some fragment of computation that they were previously
given. Being unaware of the partition, they will report their results as any executor would. And
because this reporting of results sometimes does not go through the driver (for fear of making the
driver a bottleneck), the reporting of these zombie results could succeed.
Because the driver, a single point of bookkeeping, does not know that those zombie executors are
still functioning and reporting results, it will reschedule the same tasks that the lost executors had
to accomplish on new nodes. This creates a double answering problem in which the zombie
machines lost through partitioning and the machines bearing the rescheduled tasks both report the
same results. This bears real consequences: one example of stream computation that we previously
mentioned is routing tasks for financial transactions. A double withdrawal, in that context, or
double stock purchase orders, could have tremendous consequences.
It is not only the aforementioned problem that causes different processing semantics. Another
important reason is that when output from a stream-processing application and state
checkpointing cannot be completed in one atomic operation, it will cause data corruption if failure
happens between checkpointing and outputting.

6
These challenges have therefore led to a distinction between at least once processing and at most
once processing:
• At least once: This processing ensures that every element of a stream has been processed
once or more.
• At most once: This processing ensures that every element of the stream is processed once
or less.
• Exactly once: This is the combination of “at least once” and “at most once.”
At-least-once processing is the notion that we want to make sure that every chunk of initial data
has been dealt with—it deals with the node failure we were talking about earlier. As we’ve
mentioned, when a streaming process suffers a partial failure in which some nodes need to be
replaced or some data needs to be recomputed, we need to reprocess the lost units of computation
while keeping the ingestion of data going. That requirement means that if you do not respect at-
least-once processing, there is a chance for you, under certain conditions, to lose data.
The antisymmetric notion is called at-most-once processing. At-most-once processing systems
guarantee that the zombie nodes repeating the same results as a rescheduled node are treated in a
coherent manner, in which we keep track of only one set of results. By keeping track of what data
their results were about, we’re able to make sure we can discard repeated results, yielding at-most-
once processing guarantees. The way in which we achieve this relies on the notion of idempotence
applied to the “last mile” of result reception. Idempotence qualifies a function such that if we
apply it twice (or more) to any data, we will get the same result as the first time. This can be
achieved by keeping track of the data that we are reporting a result for, and having a bookkeeping
system at the output of our streaming process.
Microbatching
Two important approaches to stream processing:
• bulk-synchronous processing, and
• one-at-a-time record processing.
The objective of this is to connect those two ideas to the two APIs that Spark possesses for stream
processing: Spark Streaming and Structured Streaming.
Microbatching: An Application of Bulk-Synchronous Processing
Spark Streaming, the more mature model of stream processing in Spark, is roughly approximated
by what’s called a Bulk Synchronous Parallelism (BSP) system.
The gist of BSP is that it includes two things:
• A split distribution of asynchronous work
• A synchronous barrier, coming in at fixed intervals
The split is the idea that each of the successive steps of work to be done in streaming is separated
in a number of parallel chunks that are roughly proportional to the number of executors available
to perform this task. Each executor receives its own chunk (or chunks) of work and works
separately until the second element comes in. A particular resource is tasked with keeping track of
the progress of computation. With Spark Streaming, this is a synchronization point at the “driver”
that allows the work to progress to the next step. Between those scheduled steps, all of the
executors on the cluster are doing the same thing.

7
Note that what is being passed around in this scheduling process are the functions that describe
the processing that the user wants to execute on the data. The data is already on the various
executors, most often being delivered directly to these resources over the lifetime of the cluster.
This was coined “function-passing style” by Heather Miller in 2016 (and formalized in
[Miller2016]): asynchronously pass safe functions to distributed, stationary, immutable data in a
stateless container, and use lazy combinators to eliminate intermediate data structures.
The frequency at which further rounds of data processing are scheduled is dictated by a time
interval. This time interval is an arbitrary duration that is measured in batch processing time; that
is, what you would expect to see as a “wall clock” time observa‐ tion in your cluster.
For stream processing, we choose to implement barriers at small, fixed intervals that better
approximate the real-time notion of data processing
One-Record-at-a-Time Processing
By contrast, one-record-at-a-time processing functions by pipelining: it analyzes the whole
computation as described by user-specified functions and deploys it as pipelines using the
resources of the cluster. Then, the only remaining matter is to flow data through the various
resources, following the prescribed pipeline. Note that in this latter case, each step of the
computation is materialized at some place in the cluster at any given point. Systems that function
mostly according to this paradigm include Apache Flink, Naiad, Storm, and IBM Streams. This
does not necessarily mean that those systems are incapable of microbatching, but rather
characterizes their major or most native mode of operation and makes a statement on their
dependency on the process of pipelining, often at the heart of their processing.
The minimum latency, or time needed for the system to react to the arrival of one particular event,
is very different between those two: minimum latency of the micro‐ batching system is therefore
the time needed to complete the reception of the current microbatch (the batch interval) plus the
time needed to start a task at the executor where this data falls (also called scheduling time). On
the other hand, a system pro‐ cessing records one by one can react as soon as it meets the event
of interest.
Microbatching Versus One-at-a-Time: The Trade-Offs
Despite their higher latency, microbatching systems offer significant advantages:
• They are able to adapt at the synchronization barrier boundaries. That adaptation might
represent the task of recovering from failure, if a number of executors have been shown
to become deficient or lose data. The periodic synchronization can also give us an
opportunity to add or remove executor nodes, giving us the possibility to grow or shrink
our resources depending on what we’re seeing as the cluster load, observed through the
throughput on the data source.
• Our BSP systems can sometimes have an easier time providing strong consistency because
their batch determinations—that indicate the beginning and the end of a particular batch
of data—are deterministic and recorded. Thus, any kind of computation can be redone
and produce the same results the second time.
• Having data available as a set that we can probe or inspect at the beginning of the
microbatch allows us to perform efficient optimizations that can provide ideas on the way
to compute on the data. Exploiting that on each microbatch, we can con‐ sider the specific

8
case rather than the general processing, which is used for all possible input. For example,
we could take a sample or compute a statistical measure before deciding to process or drop
each microbatch.
More importantly, the simple presence of the microbatch as a well-identified element also allows
an efficient way of specifying programming for both batch processing (where the data is at rest
and has been saved somewhere) and streaming (where the data is in flight). The microbatch, even
for mere instants, looks like data at rest.
Dynamic Batch Interval
What is this notion of dynamic batch interval? The dynamic batch interval is the notion that the
recomputation of data in a streaming DataFrame or Dataset consists of an update of existing data
with the new elements seen over the wire. This update is occurring based on a trigger and the usual
basis of this would be time duration. That time duration is still determined based on a fixed world
clock signal that we expect to be synchronized within our entire cluster and that represents a single
synchronous source of time that is shared among every executor.
However, this trigger can also be the statement of “as often as possible.” That statement is simply
the idea that a new batch should be started as soon as the previous one has been processed, given
a reasonable initial duration for the first batch. This means that the system will launch batches as
often as possible. In this situation, the latency that can be observed is closer to that of one-element-
at-a-time processing. The idea here is that the microbatches produced by this system will converge
to the smallest manageable size, making our stream flow faster through the executor computations
that are necessary to produce a result. As soon as that result is produced, a new query will be
started and scheduled by the Spark driver.
Structured Stream processing model
The main steps in Structured Streaming processing are as follows:
1. When the Spark driver triggers a new batch, processing starts with updating the account
of data read from a data source, in particular, getting data offsets for the beginning and the
end of the latest batch.
2. This is followed by logical planning, the construction of successive steps to be executed
on data, followed by query planning (intrastep optimization).
3. And then the launch and scheduling of the actual computation by adding a new batch of
data to update the continuous query that we’re trying to refresh.
Hence, from the point of view of the computation model, we will see that the API is significantly
different from Spark Streaming.
The Disappearance of the Batch Interval
We now briefly explain what Structured Streaming batches mean and their impact with respect to
operations. In Structured Streaming, the batch interval that we are using is no longer a computation
budget. With Spark Streaming, the idea was that if we produce data every two minutes and flow
data into Spark’s memory every two minutes, we should produce the results of computation on
that batch of data in at least two minutes, to clear the memory from our cluster for the next
microbatch. Ideally, as much data flows out as flows in, and the usage of the collective memory of
our cluster remains stable.

9
With Structured Streaming, without this fixed time synchronization, our ability to see performance
issues in our cluster is more complex: a cluster that is unstable—that is, unable to “clear out” data
by finishing to compute on it as fast as new data flows in— will see ever-growing batch processing
times, with an accelerating growth. We can expect that keeping a hand on this batch processing
time will be pivotal.
However, if we have a cluster that is correctly sized with respect to the throughput of our data,
there are a lot of advantages to have an as-often-as-possible update. In particular, we should expect
to see very frequent results from our Structured Streaming cluster with a higher granularity than
we used to in the time of a conservative batch interval.
Spark Streaming Resilience Model
In most cases, a streaming job is a long-running job. By definition, streams of data observed and
processed over time lead to jobs that run continuously. As they process data, they might
accumulate intermediary results that are difficult to reproduce after the data has left the processing
system. Therefore, the cost of failure is considerable and, in some cases, complete recovery is
intractable.
In distributed systems, especially those relying on commodity hardware, failure is a function of
size: the larger the system, the higher the probability that some component fails at any time.
Distributed stream processors need to factor this chance of failure in their operational model.
We look at the resilience that the Apache Spark platform provides us: how it’s able to recover
partial failure and what kinds of guarantees we are given for the data passing through the system
when a failure occurs. We begin by getting an overview of the different internal components of
Spark and their relation to the core data structure. With this knowledge, you can proceed to
understand the impact of failure at the different levels and the measures that Spark offers to
recover from such failure.
RDDs and DStreams
Spark builds its data representations on Resilient Distributed Datasets (RDDs). Introduced in 2011
by the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” [Zaharia2011], RDDs are the foundational data structure in Spark. It is at this ground
level that the strong fault tolerance guarantees of Spark start.
RDDs are composed of partitions, which are segments of data stored on individual nodes and
tracked by the Spark driver that is presented as a location-transparent data structure to the user.
We illustrate these components in below Figure in which the classic word count application is
broken down into the different elements that comprise an RDD.
Figure: An RDD operation represented in a distributed system

10
The colored blocks are data elements, originally stored in a distributed filesystem, represented on
the far left of the figure. The data is stored as partitions, illustrated as columns of colored blocks
inside the file. Each partition is read into an executor, which we see as the horizontal blocks. The
actual data processing happens within the executor. There, the data is transformed following the
transformations described at the RDD level:
• .flatMap(l => l.split(" ")) separates sentences into words separated by space.
• .map(w => (w,1)) transforms each word into a tuple of the form (, 1) in this way preparing
the words for counting.
• .reduceByKey(_ + _) computes the count, using the as a key and apply‐ ing a sum operation
to the attached number.
• The final result is attained by bringing the partial results together using the same reduce
operation.
RDDs constitute the programmatic core of Spark. All other abstractions, batch and streaming
alike, including DataFrames, DataSets, and DStreams are built using the facilities created by RDDs,
and, more important, they inherit the same fault tolerance capabilities.
Another important characteristic of RDDs is that Spark will try to keep their data preferably in-
memory for as long as it is required and provided enough capacity in the system. This behavior is
configurable through storage levels and can be explicitly controlled by calling caching operations.
We mention those structures here to present the idea that Spark tracks the progress of the user’s
computation through modifications of the data. Indeed, knowing how far along we are in what
the user wants to do through inspecting the control flow of his program (including loops and
potential recursive calls) can be a daunting and errorprone task. It is much more reliable to define
types of distributed data collections, and let the user create one from another, or from other data
sources.
In below figure, we show the same word count program, now in the form of the user provided
code (left) and the resulting internal RDD chain of operations. This dependency chain forms a
particular kind of graph, a Directed Acyclic Graph (DAG). The DAG informs the scheduler,
appropriately called DAG Scheduler, on how to dis‐ tribute the computation and is also the
foundation of the failure-recovery functionality, because it represents the internal data and their
dependencies.
Figure: RDD lineage

11
As the system tracks the ordered creation of these distributed data collections, it tracks the work
done, and what’s left to accomplish.
Data Structures in Spark
To understand at what level fault tolerance operates in Spark, it’s useful to go through an overview
of the nomenclature of some core concepts. We begin by assuming that the user provides a
program that ends up being divided into chunks and executed on various machines, as we saw in
the previous section, and as depicted in below Figure.
Figure: Spark nomenclature
Let’s run down those steps, which define the vocabulary of the Spark runtime:
User Program The user application in Spark Streaming is composed of user-specified function
calls operating on a resilient data structure (RDD, DStream, streaming DataSet, and so on),
categorized as actions and transformations.
Transformed User Program The user program may undergo adjustments that modify some of
the specified calls to make them simpler, the most approachable and understandable of which is
map-fusion. Query plan is a similar but more advanced concept in Spark SQL.
RDD A logical representation of a distributed, resilient, dataset. In the illustration, we see that the
initial RDD comprises three parts, called partitions.
Partition A partition is a physical segment of a dataset that can be loaded independently.
Stages The user’s operations are then grouped into stages, whose boundary separates user
operations into steps that must be executed separately. For example, operations that require a
shuffle of data across multiple nodes, such as a join between the results of two distinct upstream
operations, mark a distinct stage. Stages in Apache Spark are the unit of sequencing: they are
executed one after the other. At most one of any interdependent stages can be running at any given
time.
Jobs After these stages are defined, what internal actions Spark should take is clear. Indeed, at this
stage, a set of interdependent jobs is defined. And jobs, precisely, are the vocabulary for a unit of
scheduling. They describe the work at hand from the point of view of an entire Spark cluster,
whether it’s waiting in a queue or currently being run across many machines.

12
Tasks Depending on where their source data is on the cluster, jobs can then be cut into tasks,
crossing the conceptual boundary between distributed and single-machine computing: a task is a
unit of local computation, the name for the local, executor bound part of a job.
Spark aims to make sure that all of these steps are safe from harm and to recover quickly in the
case of any incident occurring in any stage of this process. This concern is reflected in fault-
tolerance facilities that are structured by the aforementioned notions: restart and checkpointing
operations that occur at the task, job, stage, or program level.
Spark Fault Tolerance Guarantees
Now that we have seen the “pieces” that constitute the internal machinery in Spark, we are ready
to understand that failure can happen at many different levels. In this section, we see Spark fault-
tolerance guarantees organized by “increasing blast radius,” from the more modest to the larger
failure. We are going to investigate the following:
• How Spark mitigates Task failure through restarts
• How Spark mitigates Stage failure through the shuffle service
• How Spark mitigates the disappearance of the orchestrator of the user program, through
driver restarts
Task Failure Recovery: Tasks can fail when the infrastructure on which they are running has a
failure or logical conditions in the program lead to an sporadic job, like OutOfMemory, network,
storage errors, or problems bound to the quality of the data being processed.
If the input data of the task was stored, through a call to cache() or persist() and if the chosen
storage level implies a replication of data, the task does not need to have its input recomputed,
because a copy of it exists in complete form on another machine of the cluster. We can then use
this input to restart the task. Table summarizes the different storage levels configurable in Spark
and their chacteristics in terms of mem‐ ory usage and replication factor.

13
If, however, there was no persistence or if the storage level does not guarantee the existence of a
copy of the task’s input data, the Spark driver will need to consult the DAG that stores the user-
specified computation to determine which segments of the job need to be recomputed.
Consequently, without enough precautions to save either on the caching or on the storage level,
the failure of a task can trigger the recomputation of several others, up to a stage boundary.
Stage boundaries imply a shuffle, and a shuffle implies that intermediate data will somehow be
materialized: as we discussed, the shuffle transforms executors into data servers that can provide
the data to any other executor serving as a destination.
As a consequence, these executors have a copy of the map operations that led up to the shuffle.
Hence, executors that participated in a shuffle have a copy of the map operations that led up to it.
But that’s a lifesaver if you have a dying downstream exec‐ utor, able to rely on the upstream
servers of the shuffle (which serve the output of the map-like operation). What if it’s the contrary:
you need to face the crash of one of the upstream executors?
Stage Failure Recovery We’ve seen that task failure (possibly due to executor crash) was the
most frequent incident happening on a cluster and hence the most important event to mitigate.
Recurrent task failures will lead to the failure of the stage that contains that task. This brings us to
the second facility that allows Spark to resist arbitrary stage failures: the shuffle service.
When this failure occurs, it always means some rollback of the data, but a shuffle operation, by
definition, depends on all of the prior executors involved in the step that precedes it.
As a consequence, since Spark 1.3 we have the shuffle service, which lets you work on map data
that is saved and distributed through the cluster with a good locality, but, more important, through
a server that is not a Spark task. It’s an external file exchange service written in Java that has no
dependency on Spark and is made to be a much longer-running service than a Spark executor. This
additional service attaches as a separate process in all cluster modes of Spark and simply offers a
data file exchange for executors to transmit data reliably, right before a shuffle. It is highly
optimized through the use of a netty backend, to allow a very low overhead in trans‐ mitting data.
This way, an executor can shut down after the execution of its map task, as soon as the shuffle
service has a copy of its data. And because data transfers are faster, this transfer time is also highly
reduced, reducing the vulnerable time in which any executor could face an issue.
Driver Failure Recovery Having seen how Spark recovers from the failure of a particular task
and stage, we can now look at the facilities Spark offers to recover from the failure of the driver
program. The driver in Spark has an essential role: it is the depository of the block manager, which
knows where each block of data resides in the cluster. It is also the place where the DAG lives.
Finally, it is where the scheduling state of the job, its metadata, and logs resides. Hence, if the
driver is lost, a Spark cluster as a whole might well have lost which stage it has reached in
computation, what the computation actually consists of, and where the data that serves it can be
found, in one fell swoop.
Cluster-mode deployment Spark has implemented what’s called the cluster deployment mode,
which allows the driver program to be hosted on the cluster, as opposed to the user’s computer.
The deployment mode is one of two options: in client mode, the driver is launched in the same
process as the client that submits the application. In cluster mode, however, the driver is launched

14
from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills
its responsibility of submitting the application without waiting for the application to finish.
This, in sum, allows Spark to operate an automatic driver restart, so that the user can start a job in
a “fire and forget fashion,” starting the job and then closing their laptop to catch the next train.
Every cluster mode of Spark offers a web UI that will let the user access the log of their application.
Another advantage is that driver failure does not mark the end of the job, because the driver
process will be relaunched by the cluster manager. But this only allows recovery from scratch,
given that the temporary state of the computation—previously stored in the driver machine—
might have been lost.
Checkpointing To avoid losing intermediate state in case of a driver crash, Spark offers the option
of checkpointing; that is, recording periodically a snapshot of the application’s state to disk. The
setting of the sparkContext.setCheckpointDirectory() option should point to reliable storage (e.g.,
Hadoop Distributed File System [HDFS]) because having the driver try to reconstruct the state of
intermediate RDDs from its local filesystem makes no sense: those intermediate RDDs are being
created on the executors of the cluster and should as such not require any interaction with the
driver for backing them up.
First Steps in Structured Streaming
In the previous section, we learned about the high-level concepts that constitute Structured
Streaming, such as sources, sinks, and queries. We are now going to explore Structured Streaming
from a practical perspective, using a simplified web log analytics use case as an example.
Before we begin delving into our first streaming application, we are going to see how classical
batch analysis in Apache Spark can be applied to the same use case.
This exercise has two main goals:
• First, most, if not all, streaming data analytics start by studying a static data sample. It is
far easier to start a study with a file of data, gain intuition on how the data looks, what kind
of patterns it shows, and define the process that we require to extract the intended
knowledge from that data. Typically, it’s only after we have defined and tested our data
analytics job, that we proceed to transform it into a streaming process that can apply our
analytic logic to data on the move.
• Second, from a practical perspective, we can appreciate how Apache Spark simplifies many
aspects of transitioning from a batch exploration to a streaming application through the
use of uniform APIs for both batch and streaming ana‐ lytics. This exploration will allow
us to compare and contrast the batch and streaming APIs in Spark and show us the
necessary steps to move from one to the other.
Batch Analytics
Given that we are working with archive log files, we have access to all of the data at once. Before
we begin building our streaming application, let’s take a brief intermezzo to have a look at what a
classical batch analytics job would look like
First, we load the log files, encoded as JSON, from the directory where we unpacked them:
// This is the location of the unpackaged files. Update accordingly
val logsDirectory = ???

15
val rawLogs = sparkSession.read.json(logsDirectory)
Next, we declare the schema of the data as a case class to use the typed Dataset API. Following
the formal description of the dataset (at NASA-HTTP ), the log is structured as follows:
The logs are an ASCII file with one line per request, with the following columns:
• Host making the request. A hostname when possible, otherwise the Internet address if the
name could not be looked up.
• Timestamp in the format “DAY MON DD HH:MM:SS YYYY,” where DAY is the day
of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is
the time of day using a 24-hour clock, and YYYY is the year. The timezone is –0400.
• Request given in quotes.
• HTTP reply code.
• Bytes in the reply.
Translating that schema to Scala, we have the following case class definition:
import java.sql.Timestamp
case class WebLog(host: String,
timestamp: Timestamp,
request: String,
http_reply: Int,
bytes: Long )
We convert the original JSON to a typed data structure using the previous schema definition:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// we need to narrow the `Interger` type because
// the JSON representation is interpreted as `BigInteger`
val preparedLogs = rawLogs.withColumn("http_reply", $"http_reply".cast(IntegerType))
val weblogs = preparedLogs.as[WebLog]
Now that we have the data in a structured format, we can begin asking the questions that interest
us. As a first step, we would like to know how many records are con‐ tained in our dataset:
val recordCount = weblogs.count >recordCount: Long = 1871988
A common question would be: “what was the most popular URL per day?” To answer that, we
first reduce the timestamp to the day of the month. We then group by this new dayOfMonth
column and the request URL and we count over this aggregate. We finally order using descending
order to get the top URLs first:
val topDailyURLs = weblogs.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))

16
topDailyURLs.show()
+----------+----------------------------------------+-----+
|dayOfMonth| request|count|
+----------+----------------------------------------+-----+
| 13|GET /images/NASA-logosmall.gif HTTP/1.0 |12476|
| 13|GET /htbin/cdt_main.pl HTTP/1.0 | 7471|
| 12|GET /images/NASA-logosmall.gif HTTP/1.0 | 7143|
| 13|GET /htbin/cdt_clock.pl HTTP/1.0 | 6237|
| 6|GET /images/NASA-logosmall.gif HTTP/1.0 | 6112|
| 5|GET /images/NASA-logosmall.gif HTTP/1.0 | 5865| ...
Top hits are all images. What now? It’s not unusual to see that the top URLs are images commonly
used across a site. Our true interest lies in the content pages generating the most traffic. To find
those, we first filter on html content and then proceed to apply the top aggregation we just learned.
As we can see, the request field is a quoted sequence of [HTTP_VERB] URL [HTTP_VER
SION]. We will extract the URL and preserve only those ending in .html, .htm, or no extension
(directories). This is a simplification for the purpose of this example:
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html",".htm", "")
val contentPageLogs = weblogs.filter
{log => log.request match
{ case urlExtractor(url) => val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext) case _ => false
}
}
With this new dataset that contains only .html, .htm, and directories, we proceed to apply the same
top-k function as earlier:
val topContentPages = contentPageLogs
.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))
topContentPages.show()
+----------+------------------------------------------------+-----+
|dayOfMonth| request|count|
+----------+------------------------------------------------+-----+
| 13| GET /shuttle/countdown/liftoff.html HTTP/1.0" | 4992|
| 5| GET /shuttle/countdown/ HTTP/1.0" | 3412|

17
| 2| GET /shuttle/countdown/ HTTP/1.0" | 2330| ...
We can see that the most popular page that month was lifto…
.html, corresponding to the coverage
of the launch of the Discovery shuttle, as documented on the NASA archives. It’s closely followed
by countdown/, the days prior to the launch.
Streaming Analytics Phases
In the previous section, we explored historical NASA web log records. We found trending events
in those records, but much later than when the actual events happened.
One key driver for streaming analytics comes from the increasing demand of organizations to have
timely information that can help them make decisions at many different levels.
We can use the lessons that we have learned while exploring the archived records using a batch-
oriented approach and create a streaming job that will provide us with trending information as it
happens.
The first difference that we observe with the batch analytics is the source of the data. For our
streaming exercise, we will use a TCP server to simulate a web system that delivers its logs in real
time. The simulator will use the same dataset but will feed it through a TCP socket connection
that will embody the stream that we will be analyzing.
Connecting to a Stream
If you recall from the introduction of this chapter, Structured Streaming defines the concepts of
sources and sinks as the key abstractions to consume a stream and produce a result. We are going
to use the TextSocketSource implementation to connect to the server through a TCP socket.
Socket connections are defined by the host of the server and the port where it is listening for
connections. These two configuration elements are required to create the socket source:
val stream = sparkSession.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
Note how the creation of a stream is quite similar to the declaration of a static data‐ source in the
batch case. Instead of using the read builder, we use the readStream construct and we pass to it
the parameters required by the streaming source. As you will see during the course of this exercise
and later on as we go into the details of Structured Streaming, the API is basically the same
DataFrame and Dataset API for static data but with some modifications and limitations that you
will learn in detail.
Preparing the Data in the Stream
The socket source produces a streaming DataFrame with one column, value, which contains the
data received from the stream.
In the batch analytics case, we could load the data directly as JSON records. In the case of the
Socket source, that data is plain text. To transform our raw data to WebLog records, we first
require a schema. The schema provides the necessary information to parse the text to a JSON
object. It’s the structure when we talk about structured streaming.

18
After defining a schema for our data, we proceed to create a Dataset, following these steps:
import java.sql.Timestamp
case class WebLog(host:String,
timestamp: Timestamp,
request: String,
http_reply:Int,
bytes: Long
)
val webLogSchema = Encoders.product[WebLog].schema
val jsonStream = stream.select(from_json($"value", webLogSchema) as "record")
val webLogStream: Dataset[WebLog] = jsonStream.select("record.*").as[WebLog]
1. Obtain a schema from the case class definition
2. Transform the text value to JSON using the JSON support built into Spark SQL
3. Use the Dataset API to transform the JSON records to WebLog objects
As a result of this process, we obtain a Streaming Dataset of WebLog records.
Operations on Streaming Dataset
The webLogStream we just obtained is of type Dataset[WebLog] like we had in the batch analytics
job. The difference between this instance and the batch version is that webLogStream is a
streaming Dataset.
We can observe this by querying the object:
webLogStream.isStreaming
> res: Boolean = true
At this point in the batch job, we were creating the first query on our data: How many records are
contained in our dataset? This is a question that we can easily answer when we have access to all
of the data. However, how do we count records that are constantly arriving? The answer is that
some operations that we consider usual on a static Dataset, like counting all records, do not have
a defined meaning on a Streaming Dataset.
As we can observe, attempting to execute the count query in the following code snippet will result
in an AnalysisException:
val count = webLogStream.count()
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;
This means that the direct queries we used on a static Dataset or DataFrame now need two levels
of interaction. First, we need to declare the transformations of our stream, and then we need to
start the stream process.
Creating a Query
What are popular URLs? In what time frame? Now that we have immediate analytic access to the
stream of web logs, we don’t need to wait for a day or a month to have a rank of the popular
URLs. We can have that information as trends unfold in much shorter windows of time.

19
First, to define the period of time of our interest, we create a window over some time‐ stamp. An
interesting feature of Structured Streaming is that we can define that time interval on the timestamp
when the data was produced, also known as event time, as opposed to the time when the data is
being processed.
Our window definition will be of five minutes of event data. Given that our timeline is simulated,
the five minutes might happen much faster or slower than the clock time. In this way, we can
clearly appreciate how Structured Streaming uses the time‐ stamp information in the events to
keep track of the event timeline.
As we learned from the batch analytics, we should extract the URLs and select only content pages,
like .html, .htm, or directories. Let’s apply that acquired knowledge first before proceeding to
define our windowed query:
// A regex expression to extract the accessed URL from weblog.request
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html", ".htm", "")
val contentPageLogs: String => Boolean = url => {
val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext)
}
val urlWebLogStream = webLogStream.flatMap { weblog =>
weblog.request match {
case urlExtractor(url) if (contentPageLogs(url)) =>
Some(weblog.copy(request = url))
case _ => None
}
}
We have converted the request to contain only the visited URL and filtered out all noncontent
pages. Now, we define the windowed query to compute the top trending URLs:
val rankingURLStream = urlWebLogStream
.groupBy($"request", window($"timestamp", "5 minutes", "1 minute"))
.count()
Start the Stream Processing
All of the steps that we have followed so far have been to define the process that the stream will
undergo. But no data has been processed yet.
To start a Structured Streaming job, we need to specify a sink and an output mode. These are two
new concepts introduced by Structured Streaming:
• A sink defines where we want to materialize the resulting data; for example, to a file in a
filesystem, to an in-memory table, or to another streaming system such as Kafka.
• The output mode defines how we want the results to be delivered: do we want to see all
data every time, only updates, or just the new records?

20
These options are given to a writeStream operation. It creates the streaming query that starts the
stream consumption, materializes the computations declared on the query, and produces the result
to the output sink. For now, let’s use them empirically and observe the results.
For our query, shown in below Example, we use the memory sink and output mode complete to
have a fully updated table each time new records are added to the result of keeping track of the
URL ranking.
Example. Writing a stream to a sink
val query = rankingURLStream.writeStream
.queryName("urlranks")
.outputMode("complete")
.format("memory")
.start()
The memory sink outputs the data to a temporary table of the same name given in the queryName
option. We can observe this by querying the tables registered on Spark SQL:
scala> spark.sql("show tables").show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | urlranks| true|
+--------+---------+-----------+
In the expression in Example, query is of type StreamingQuery and it’s a handler to control the
query life cycle.
Exploring the Data
Given that we are accelerating the log timeline on the producer side, after a few seconds, we can
execute the next command to see the result of the first windows, as illustrated in Figure. Note how
the processing time (a few seconds) is decoupled from the event time (hun‐ dreds of minutes of
logs):
urlRanks.select($"request", $"window", $"count").orderBy(desc("count"))
Figure: URL ranking: query results by window

21
Acquiring streaming data
In Structured Streaming, a source is an abstraction that lets us consume data from a streaming data
producer. Sources are not directly created. Instead, the sparkSession provides a builder method,
readStream, that exposes the API to specify a streaming source, called a format, and provide its
configuration.
For example, the code in Example creates a File streaming source. We specify the type of source
using the format method. The method schema lets us provide a schema for the data stream, which
is mandatory for certain source types, such as this File source.
Example. File streaming source
val fileStream = spark.readStream
.format("json")
.schema(schema)
.option("mode","DROPMALFORMED")
.load("/tmp/datasrc")
>fileStream:
org.apache.spark.sql.DataFrame = [id: string, timestamp: timestamp ... ]
Each source implementation has different options, and some have tunable parameters. In
Example, we are setting the option mode to DROPMALFORMED. This option instructs the
JSON stream processor to drop any line that neither complies with the JSON format nor matches
the provided schema.
Behind the scenes, the call to spark.readStream creates a DataStreamBuilder instance. This instance
is in charge of managing the different options provided through the builder method calls. Calling
load(...) on this DataStreamBuilder instance validates the options provided to the builder and, if
everything checks, it returns a streaming DataFrame.
In our example, this streaming DataFrame represents the stream of data that will result from
monitoring the provided path and processing each new file in that path as JSON-encoded data,
parsed using the schema provided. All malformed code will be dropped from this data stream.
Loading a streaming source is lazy. What we get is a representation of the stream, embodied in the
streaming DataFrame instance, that we can use to express the series of transformations that we
want to apply to it in order to implement our specific business logic. Creating a streaming
DataFrame does not result in any data actually being consumed or processed until the stream is
materialized. This requires a query, as you will see further on.
Available Sources
As of Spark v2.4.0, the following streaming sources are supported:
• json, orc, parquet, csv, text, textFile: These are all file-based streaming sources. The
base functionality is to monitor a path (folder) in a filesystem and consume files atomically
placed in it. The files found will then be parsed by the formatter specified. For example, if
json is provided, the Spark json reader will be used to process the files, using the schema
information provided.
• socket Establishes a client connection to a TCP server that is assumed to provide text data
through a socket connection.

22
• kafka Creates Kafka consumer able to retrieve data from Kafka.
• rate Generates a stream of rows at the rate given by the rowsPerSecond option. It’s mainly
intended as a testing source.
Transforming streaming data
As we saw in the previous section, the result of calling load is a streaming DataFrame. After we
have created our streaming DataFrame using a source, we can use the Data set or DataFrame API
to express the logic that we want to apply to the data in the stream in order to implement our
specific use case.
Assuming that we are using data from a sensor network, in Example 8-3 we are selecting the fields
deviceId, timestamp, sensorType, and value from a sensor Stream and filtering to only those
records where the sensor is of type temperature and its value is higher than the given threshold.
Example: Filter and projection
val highTempSensors = sensorStream
.select($"deviceId", $"timestamp", $"sensorType", $"value")
.where($"sensorType" === "temperature" && $"value" > threshold)
Likewise, we can aggregate our data and apply operations to the groups over time. Example shows
that we can use timestamp information from the event itself to define a time window of five
minutes that will slide every minute.
What is important to grasp here is that the Structured Streaming API is practically the same as the
Dataset API for batch analytics, with some additional provisions spe‐ cific to stream processing.
Example 8-4. Average by sensor type over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType", $"value")
.groupBy(window($"timestamp", "5 minutes", "1 minute"), $"sensorType")
.agg(avg($"value")
If you are not familiar with the structured APIs of Spark, we suggest that you familiarize yourself
with it. Covering this API in detail is beyond the scope of this book.
Streaming API Restrictions on the DataFrame API
As we hinted in the previous chapter, some operations that are offered by the standard DataFrame
and Dataset API do not make sense on a streaming context. We gave the example of stream.count,
which does not make sense to use on a stream. In general, operations that require immediate
materialization of the underlying dataset are not allowed. These are the API operations not directly
supported on streams:
• count
• show
• describe
• limit
• take(n)
• distinct
• foreach

23
• sort
• multiple stacked aggregations
Next to these operations, stream-stream and static-stream joins are partially supported.
Understanding the limitations Although some operations, like count or limit, do not make sense
on a stream, some other stream operations are computationally difficult. For example, distinct is
one of them. To filter duplicates in an arbitrary stream, it would require that you remember all of
the data seen so far and compare each new record with all records already seen. The first condition
would require infinite memory and the second has a computational complexity of O(n 2 ), which
becomes prohibitive as the number of elements (n) increases.
Operations on aggregated streams Some of the unsupported operations become defined after
we apply an aggregation function to the stream. Although we can’t count the stream, we could
count messages received per minute or count the number of devices of a certain type.
In Example, we define a count of events per sensorType per minute.
Example: Count of sensor types over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Likewise, it’s also possible to define a sort on aggregated data, although it’s further restricted to
queries with output mode complete.
Stream deduplication We discussed that distinct on an arbitrary stream is computationally
difficult to implement. But if we can define a key that informs us when an element in the stream
has already been seen, we can use it to remove duplicates: stream.dropDuplicates("") …
Workarounds Although some operations are not supported in the exact same way as in the batch
model, there are alternative ways to achieve the same functionality:
• foreach Although foreach cannot be directly used on a stream, there’s a foreach sink that
provides the same functionality. Sinks are specified in the output definition of a stream.
• show Although show requires an immediate materialization of the query, and hence it’s
not possible on a streaming Dataset, we can use the console sink to output data to the
screen.
Output the resulting data
All operations that we have done so far—such as creating a stream and applying transformations
on it—have been declarative. They define from where to consume the data and what operations
we want to apply to it. But up to this point, there is still no data flowing through the system.
Before we can initiate our stream, we need to first define where and how we want the output data
to go:
• Where relates to the streaming sink: the receiving side of our streaming data.
• How refers to the output mode: how to treat the resulting records in our stream.

24
From the API perspective, we materialize a stream by calling writeStream on a streaming
DataFrame or Dataset.
Calling writeStream on a streaming Dataset creates a DataStreamWriter. This is a builder instance
that provides methods to configure the output behavior of our streaming process.
Example. File streaming sink
val query = stream.writeStream
.format("json")
.queryName("json-writer")
.outputMode("append")
.option("path", "/target/dir")
.option("checkpointLocation", "/checkpoint/dir")
.trigger(ProcessingTime("5 seconds"))
.start() >query: org.apache.spark.sql.streaming.StreamingQuery = ...
format
The format method lets us specify the output sink by providing the name of a builtin sink or the
fully qualified name of a custom sink.
As of Spark v2.4.0, the following streaming sinks are available:
• console sink A sink that prints to the standard output. It shows a number of rows
configurable with the option numRows.
• file sink File-based and format-specific sink that writes the results to a filesystem. The
format is specified by providing the format name: csv, hive, json, orc, parquet, avro, or
text.
• kafka sink A Kafka-specific producer sink that is able to write to one or more Kafka
topics.
• memory sink Creates an in-memory table using the provided query name as table name.
This table receives continuous updates with the results of the stream.
• foreach sink Provides a programmatic interface to access the stream contents, one
element at the time.
• foreachBatch sink foreachBatch is a programmatic sink interface that provides access to
the com‐ plete DataFrame that corresponds to each underlying microbatch of the
Structured Streaming execution.
outputMode
The outputMode specifies the semantics of how records are added to the output of the streaming
query. The supported modes are append, update, and complete:
• append (default mode) Adds only final records to the output stream. A record is
considered final when no new records of the incoming stream can modify its value. This
is always the case with linear transformations like those resulting from applying projection,
filtering, and mapping. This mode guarantees that each resulting record will be output only
once.
• update Adds new and updated records since the last trigger to the output stream. update
is meaningful only in the context of an aggregation, where aggregated values change as

25
new records arrive. If more than one incoming record changes a single result, all changes
between trigger intervals are collated into one output record.
• complete complete mode outputs the complete internal representation of the stream. This
mode also relates to aggregations, because for nonaggregated streams, we would need to
remember all records seen so far, which is unrealistic. From a practical perspective,
complete mode is recommended only when you are aggregating values over low-cardinality
criteria, like count of visitors by country, for which we know that the number of countries
is bounded.
Understanding the append semantic
When the streaming query contains aggregations, the definition of final becomes nontrivial. In
an aggregated computation, new incoming records might change an existing aggregated value
when they comply with the aggregation criteria used. Following our definition, we cannot
output a record using append until we know that its value is final. Therefore, the use of the
append output mode in combination with aggregate queries is restricted to queries for which
the aggregation is expressed using event-time and it defines a watermark. In that case, append
will output an event as soon as the watermark has expired and hence it’s considered that no
new records can alter the aggregated value. As a consequence, output events in append mode
will be delayed by the aggregation time window plus the watermark offset.
queryName With queryName, we can provide a name for the query that is used by some
sinks and also presented in the job description in the Spark Console, as depicted in Figure.
Figure: Completed Jobs in the Spark UI showing the query name in the job descrip‐ tion
option With the option method, we can provide specific key–value pairs of configuration to the
stream, akin to the configuration of the source. Each sink can have specific con‐ figuration we can
customize using this method. We can add as many .option(...) calls as necessary to configure the
sink.
options options is an alternative to option that takes a Map[String, String] containing all the key–
value configuration parameters that we want to set. This alternative is more friendly to an
externalized configuration model, where we don’t know a priori the set‐ tings to be passed to the
sink’s configuration.
trigger The optional trigger option lets us specify the frequency at which we want the results to
be produced. By default, Structured Streaming will process the input and produce a result as soon
as possible. When a trigger is specified, output will be produced at each trigger interval.
org.apache.spark.sql.streaming.Trigger provides the following supported triggers:
• ProcessingTime() Lets us specify a time interval that will dictate the frequency of the
query results.

26
• Once() A particular Trigger that lets us execute a streaming job once. It is useful for testing
and also to apply a defined streaming job as a single-shot batch operation.
• Continuous() This trigger switches the execution engine to the experimental continuous
engine for low-latency processing. The checkpoint-interval parameter indicates the
frequency of the asynchronous checkpointing for data resilience. It should not be confused
with the batch interval of the ProcessingTime trigger.
start() To materialize the streaming computation, we need to start the streaming process.
Finally, start() materializes the complete job description into a streaming computation and
initiates the internal scheduling process that results in data being consumed from the source,
processed, and produced to the sink. start() returns a Streaming Query object, which is a handle
to manage the individual life cycle of each query. This means that we can simultaneously start
and stop multiple queries independently of one other within the same sparkSession.
Demo
The first part of our program deals with the creation of the streaming Dataset:
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
> rawData: org.apache.spark.sql.DataFrame
The entry point of Structured Streaming is an existing Spark Session (sparkSession). As you can
appreciate on the first line, the creation of a streaming Dataset is almost identical to the creation
of a static Dataset that would use a read operation instead. sparkSession.readStream returns a
DataStreamReader, a class that implements the builder pattern to collect the information needed
to construct the streaming source using a fluid API. In that API, we find the format option that
lets us specify our source provider, which, in our case, is kafka. The options that follow it are
specific to the source:
• kafka.bootstrap.servers
o Indicates the set of bootstrap servers to contact as a comma-separated list of
host:port addresses
• subscribe
o Specifies the topic or topics to subscribe to
• startingOffsets
o The offset reset policy to apply when this application starts out fresh.
The load() method evaluates the DataStreamReader builder and creates a DataFrame as a result,
as we can see in the returned value:
> rawData: org.apache.spark.sql.DataFrame
A DataFrame is an alias for Dataset[Row] with a known schema. After creation, you can use
streaming Datasets just like regular Datasets. This makes it possible to use the full-fledged Dataset
API with Structured Streaming, albeit some exceptions apply because not all operations, such as
show() or count(), make sense in a streaming context.

27
To programmatically differentiate a streaming Dataset from a static one, we can ask a Dataset
whether it is of the streaming kind:
rawData.isStreaming
res7: Boolean = true
And we can also explore the schema attached to it, using the existing Dataset API, as demonstrated
in Example.
Example. The Kafka schema
rawData.printSchema()
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
In general, Structured Streaming requires the explicit declaration of a schema for the consumed
stream. In the specific case of kafka, the schema for the resulting Dataset is fixed and is
independent of the contents of the stream. It consists of a set of fields specific to the Kakfa source:
key, value, topic, partition, offset, timestamp, and timestampType, as we can see in Example 9-1.
In most cases, applications will be mostly interested in the contents of the value field where the
actual payload of the stream resides.
Application Logic
Recall that the intention of our job is to correlate the incoming IoT sensor data with a reference
file that contains all known sensors with their configuration. That way, we would enrich each
incoming record with specific sensor parameters that would allow us to interpret the reported data.
We would then save all correctly processed records to a Parquet file. The data coming from
unknown sensors would be saved to a sepa‐ rate file for later analysis.
Using Structured Streaming, our job can be implemented in terms of Dataset opera‐ tions:
val iotData = rawData.select($"value").as[String].flatMap{record => val fields = record.split(",")
try {
SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)
}.toOption
}
val sensorRef = sparkSession.read.parquet(s"$workDir/$referenceFile") sensorRef.cache()
val sensorWithInfo = sensorRef.join(iotData, Seq("sensorId"), "inner")
val knownSensors = sensorWithInfo
.withColumn("dnvalue", $"value"*($"maxRange"-$"minRange")+$"minRange")
.drop("value", "maxRange", "minRange")
In the first step, we transform our CSV-formatted records back into SensorData entries. We apply
Scala functional operations on the typed Dataset[String] that we obtained from extracting the value
field as a String.

28
Then, we use a streaming Dataset to static Dataset inner join to correlate the sensor data with the
corresponding reference using the sensorId as key.
To complete our application, we compute the real values of the sensor reading using the minimum-
maximum ranges in the reference data.
Writing to a Streaming Sink
The final step of our streaming application is to write the enriched IoT data to a Parquet-formatted
file. In Structured Streaming, the write operation is crucial: it marks the completion of the declared
transformations on the stream, defines a write mode, and upon calling start(), the processing of
the continuous query will begin.
In Structured Streaming, all operations are lazy declarations of what we want to do with the
streaming data. Only when we call start() will the actual consumption of the stream begin and the
query operations on the data materialize into actual results:
val knownSensorsQuery = knownSensors.writeStream
.outputMode("append")
.format("parquet")
.option("path", targetPath)
.option("checkpointLocation", "/tmp/checkpoint")
.start()
Let’s break this operation down:
• writeStream creates a builder object where we can configure the options for the desired
write operation, using a fluent interface.
• With format, we specify the sink that will materialize the result downstream. In our case,
we use the built-in FileStreamSink with Parquet format.
• mode is a new concept in Structured Streaming: given that we, theoretically, have access
to all the data seen in the stream so far, we also have the option to produce different views
of that data.
• The append mode, used here, implies that the new records affected by our streaming
computation are produced to the output.
The result of the start call is a StreamingQuery instance. This object provides meth‐ ods to control
the execution of the query and request information about the status of our running streaming
query, as shown in Example.
Example. Query progress
knownSensorsQuery.recentProgress
res37: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array({
"id" : "6b9fe3eb-7749-4294-b3e7-2561f1e840b6",
"runId" : "0d8d5605-bf78-4169-8cfe-98311fc8365c",
"name" : null,
"timestamp" : "2017-08-10T16:20:00.065Z",
"numInputRows" : 4348,
"inputRowsPerSecond" : 395272.7272727273,
"processedRowsPerSecond" : 28986.666666666668,
"durationMs" : {

29
"addBatch" : 127,
"getBatch" : 3,
"getOffset" : 1,
"queryPlanning" : 7,
"triggerExecution" : 150,
"walCommit" : 11
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[iot-data]]",
"startOffset" : {
"iot-data" : { "0" : 19048348 } },
"endOffset" : {
"iot-data" : { "0" : 19052696 } },
"numInputRow...
In Example, we can see the StreamingQueryProgress as a result of calling known
SensorsQuery.recentProgress. If we see nonzero values for the numInputRows, we can be
certain that our job is consuming data. We now have a Structured Streaming job running
properly.
Stream Processing with Spark Streaming
Spark Streaming was the first stream-processing framework built on top of the dis‐ tributed
processing capabilities of Spark. Nowadays, it offers a mature API that’s widely adopted in the
industry to process large-scale data streams.
Spark is, by design, a system that is really good at processing data distributed over a cluster of
machines. Spark’s core abstraction, the Resilient Distributed Dataset (RDD), and its fluent
functional API permits the creation of programs that treat distributed data as a collection. That
abstraction lets us reason about data-processing logic in the form of transformation of the
distributed dataset. By doing so, it reduces the cogni‐ tive load previously required to create and
execute scalable and distributed dataprocessing programs.
Spark Streaming was created upon a simple yet powerful premise: apply Spark’s dis‐ tributed
computing capabilities to stream processing by transforming a continuous stream of data into
discrete data collections on which Spark could operate.
As we can see in Figure, the main task of Spark Streaming is to take data from the stream, package
it down into small batches, and provide them to Spark for further processing. The output is then
produced to some downstream system.
Figure. Spark and Spark Streaming in action

30
The DStream Abstraction
Whereas Structured Streaming, which you learned in Part II, builds its streaming capabilities on
top of the Spark SQL abstractions of DataFrame and Dataset, Spark Streaming relies on the much
more fundamental Spark abstraction of RDD. At the same time, Spark Streaming introduces a
new concept: the Discretized Stream or DStream. A DStream represents a stream in terms of
discrete blocks of data that in turn are represented as RDDs over time, as we can see in Figure.
Figure. DStreams and RDDs in Spark Streaming
The DStream abstraction is primarily an execution model that, when combined with a functional
programming model, provides us with a complete framework to develop and execute streaming
applications.
DStreams as a Programming Model
The code representation of DStreams give us a functional programming API consis‐ tent with the
RDD API and augmented with stream-specific functions to deal with aggregations, time-based
operations, and stateful computations. In Spark Streaming, we consume a stream by creating a
DStream from one of the native implementations, such as a SocketInputStream or using one of
the many connectors available that pro‐ vide a DStream implementation specific to a stream
provider (this is the case of Kafka, Twitter, or Kinesis connectors for Spark Streaming, just to
name a few):
// creates a dstream using a client socket connected to the given host and port val textDStream =
ssc.socketTextStream("localhost", 9876)
After we have obtained a DStream reference, we can implement our application logic using the
functions provided by the DStream API. For example, if the textDStream in the preceding code
is connected to a log server, we could count the number of error occurrences:
// we break down the stream of logs into error or info (not error)
// and create pairs of `(x, y)`.
// (1, 1) represents an error, and
// (0, 1) a non-error occurrence.
val errorLabelStream = textDStream.map{line =>
if (line.contains("ERROR")) (1, 1) else (0, 1)
}
We can then count the totals and compute the error rate by using an aggregation function called
reduce:

31
// reduce by key applies the provided function for each key.
val errorCountStream = errorLabelStream.reduce {
case ((x1,y1), (x2, y2)) => (x1+x2, y1+y2)
}
To obtain our error rate, we perform a safe division:
// compute the error rate and create a string message with the value
val errorRateStream = errorCountStream.map {case (errors, total) =>
val errorRate = if (total > 0 ) errors.toDouble/total else 0.0
"Error Rate:" + errorRate
}
It’s important to note that up until now, we have been using transformations on the DStream but
there is still no data processing happening. All transformations on DStreams are lazy. This process
of defining the logic of a stream-processing applica‐ tion is better seen as the set of transformations
that will be applied to the data after the stream processing is started. As such, it’s a plan of action
that Spark Streaming will recurrently execute on the data consumed from the source DStream.
DStreams are immutable. It’s only through a chain of transformations that we can process and
obtain a result from our data.
Finally, the DStream programming model requires that the transformations are ended by an output
operation. This particular operation specifies how the DStream is materialized. In our case, we are
interested in printing the results of this stream com‐ putation to the console:
// print the results to the console
errorRateStream.print()
In summary, the DStream programming model consists of the functional composition of
transformations over the stream payload, materialized by one or more output operations and
recurrently executed by the Spark Streaming engine.
DStreams as an Execution Model
In the preceding introduction to the Spark Streaming programming model, we could see how data
is transformed from its original form into our intended result as a series of lazy functional
transformations. The Spark Streaming engine is responsible for taking that chain of functional
transformations and turning it into an actual execution plan. That happens by receiving data from
the input stream(s), collecting that data into batches, and feeding it to Spark in a timely manner.
The measure of time to wait for data is known as the batch interval. It is usually a short amount
of time, ranging from approximately two hundred milliseconds to minutes depending on the
application requirements for latency. The batch interval is the central unit of time in Spark
Streaming. At each batch interval, the data corresponding to the previous interval is sent to Spark
for processing while new data is received. This process repeats as long as the Spark Streaming job
is active and healthy. A natural consequence of this recurring microbatch operation is that the
computation on the batch’s data has to complete within the duration of the batch interval so that
computing resources are available when the new microbatch arrives. As you will learn in this part
of the book, the batch interval dictates the time for most other functions in Spark Streaming.

Streaming Analytics unit 4 notes for engineers

Recommended

More Related Content

Similar to Streaming Analytics unit 4 notes for engineers (20)

More from ManjuAppukuttan2 (18)

Recently uploaded (20)

Streaming Analytics unit 4 notes for engineers