Unit - 5 FBDA
Unit - 5 FBDA
Sparking Streaming
Spark Streaming provides an abstraction called DStreams, or discretized streams. A DStream is a
sequence of data arriving over time. Internally, each DStream is represented as a sequence of
RDDs arriving at each time step (hence the name “discretized”). DStreams can be created from
various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of
operations: transformations, which yield a new DStream, and output operations, which write data
to an external system. DStreams provide many of the same operations available on RDDs, plus
new operations related to time, such as slid‐ ing windows.
Unlike batch programs, Spark Streaming applications need additional setup in order to operate
24/7. We will discuss checkpointing, the main mechanism Spark Streaming provides for this
purpose, which lets it store data in a reliable file system such as HDFS. We will also discuss how
to restart applications on failure or set them to be automatically restarted.
Finally, as of Spark 1.1, Spark Streaming is available only in Java and Scala. Experi‐ mental
Python support was added in Spark 1.2, though it supports only text data. We will focus this
chapter on Java and Scala to show the full API, but similar concepts apply in Python.
You can create DStreams either from external input sources, or by applying transformations to
other DStreams. DStreams support many of the transformations that you saw on RDDs.
Additionally, DStreams also have new “stateful” transformations that can aggregate data across
time. We will discuss these in the next section. In our simple example, we created a DStream
from data received through a socket, and then applied a filter() transformation to it. This
internally creates RDDs as shown in Figure 10-3.
The execution of Spark Streaming within Spark’s driver-worker components is shown in Figure
10-5 (see Figure 2-3 earlier in the book for the components of Spark). For each input source,
Spark Streaming launches receivers, which are tasks running within the application’s executors
that collect data from the input source and save it as RDDs. These receive the input data and
replicate it (by default) to another executor for fault tolerance. This data is stored in the memory
of the executors in the same way as cached RDDs.1 The StreamingContext in the driver program
then periodically runs Spark jobs to process this data and combine it with RDDs from previous
time steps.
Transformations on DStreams
• In stateless transformations the processing of each batch does not depend on the data of its
previous batches. They include the common RDD transformations like map(), filter(), and
reduceByKey().
• Stateful transformations, in contrast, use data or intermediate results from previous batches to
compute the results of the current batch. They include transformations based on sliding windows
and on tracking state across time.
Stateless Transformations
Stateless transformations, some of which are listed in Table 10-1, are simple RDD
transformations being applied on every batch—that is, every RDD in a DStream. We have
already seen filter() in Figure 10-3. Many of the RDD transformations are also available on
DStreams. Note that key/value DStream transformations like reduceByKey() are made available
in Scala by import StreamingContext._. In Java, as with RDDs, it is necessary to create a
JavaPairD Stream using mapToPair().
As an example, in our log processing program from earlier, we could use map() and
reduceByKey() to count log events by IP address in each time step, as shown in Examples 10-10
and 10-11.
Stateful Transformations
Stateful transformations are operations on DStreams that track data across time; that is, some
data from previous batches is used to generate the results for a new batch. The two main types
are windowed operations, which act over a sliding window of time periods, and
updateStateByKey(), which is used to track state across events for each key (e.g., to build up an
object representing each user session).
Stateful transformations require checkpointing to be enabled in your StreamingCon‐ text for fault
tolerance.
For local development, you can also use a local path (e.g., /tmp) instead of HDFS.
Windowed transformations
Windowed operations compute results across a longer time period than the Streaming Context’s
batch interval, by combining results from multiple batches. In this sec‐ tion, we’ll show how to
use them to keep track of the most common response codes, content sizes, and clients in a web
server access log. All windowed operations need two parameters, window duration and sliding
duration, both of which must be a multiple of the Streaming Context’s batch interval. The
window duration controls how many previous batches of data are considered, namely the last
windowDuration/batchInterval. If we had a source DStream with a batch interval of 10 seconds
and wanted to create a sliding window of the last 30 seconds (or last 3 batches) we would set the
windowDuration to 30 seconds. The sliding duration, which defaults to the batch interval,
controls how frequently the new DStream computes results. If we had the source DStream with a
batch interval of 10 seconds and wanted to compute our window only on every second batch, we
would set our sliding interval to 20 seconds. Figure 10-6 shows an example
While we can build all other windowed operations on top of window(), Spark Stream ‐ ing
provides a number of other windowed operations for efficiency and convenience. First,
reduceByWindow() and reduceByKeyAndWindow() allow us to perform reduc‐ tions on each
window more efficiently. They take a single reduce function to run on the whole window, such
as +. In addition, they have a special form that allows Spark to compute the reduction
incrementally, by considering only which data is coming into the window and which data is
going out.
This special form requires an inverse of the reduce function, such as - for +. It is much more
efficient for large windows if your function has an inverse (see Figure 10-7).