Unit - 5 FBDA

Uploaded by

Kalyan Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Unit - 5 FBDA

Uploaded by

Kalyan Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

UNIT V

Sparking Streaming: High level architecture of Spark Streaming, DStreams,

Transformations on DStreams, Different Types of Transformations on DStreams.

Sparking Streaming
Spark Streaming provides an abstraction called DStreams, or discretized streams. A DStream is a
sequence of data arriving over time. Internally, each DStream is represented as a sequence of
RDDs arriving at each time step (hence the name “discretized”). DStreams can be created from
various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of
operations: transformations, which yield a new DStream, and output operations, which write data
to an external system. DStreams provide many of the same operations available on RDDs, plus
new operations related to time, such as slid‐ ing windows.

Unlike batch programs, Spark Streaming applications need additional setup in order to operate
24/7. We will discuss checkpointing, the main mechanism Spark Streaming provides for this
purpose, which lets it store data in a reliable file system such as HDFS. We will also discuss how
to restart applications on failure or set them to be automatically restarted.

Finally, as of Spark 1.1, Spark Streaming is available only in Java and Scala. Experi‐ mental
Python support was added in Spark 1.2, though it supports only text data. We will focus this
chapter on Java and Scala to show the full API, but similar concepts apply in Python.

High level architecture of Spark Streaming

Spark Streaming uses a “micro-batch” architecture, where the streaming computa‐ tion is treated
as a continuous series of batch computations on small batches of data. Spark Streaming receives
data from various input sources and groups it into small batches. New batches are created at
regular time intervals. At the beginning of each time interval a new batch is created, and any data
that arrives during that interval gets added to that batch. At the end of the time interval the batch
is done growing. The size of the time intervals is determined by a parameter called the batch
interval. The batch interval is typically between 500 milliseconds and several seconds, as config‐
ured by the application developer. Each input batch forms an RDD, and is processed using Spark
jobs to create other RDDs. The processed results can then be pushed out to external systems in
batches. This high-level architecture is shown in Figure 10-1.
As you’ve learned, the programming abstraction in Spark Streaming is a discretized stream or a
DStream (shown in Figure 10-2), which is a sequence of RDDs, where each RDD has one time
slice of the data in the stream.

You can create DStreams either from external input sources, or by applying transformations to
other DStreams. DStreams support many of the transformations that you saw on RDDs.
Additionally, DStreams also have new “stateful” transformations that can aggregate data across
time. We will discuss these in the next section. In our simple example, we created a DStream
from data received through a socket, and then applied a filter() transformation to it. This
internally creates RDDs as shown in Figure 10-3.
The execution of Spark Streaming within Spark’s driver-worker components is shown in Figure
10-5 (see Figure 2-3 earlier in the book for the components of Spark). For each input source,
Spark Streaming launches receivers, which are tasks running within the application’s executors
that collect data from the input source and save it as RDDs. These receive the input data and
replicate it (by default) to another executor for fault tolerance. This data is stored in the memory
of the executors in the same way as cached RDDs.1 The StreamingContext in the driver program
then periodically runs Spark jobs to process this data and combine it with RDDs from previous
time steps.

Transformations on DStreams

Transformations on DStreams can be grouped into either stateless or stateful:

• In stateless transformations the processing of each batch does not depend on the data of its
previous batches. They include the common RDD transformations like map(), filter(), and
reduceByKey().

• Stateful transformations, in contrast, use data or intermediate results from previous batches to
compute the results of the current batch. They include transformations based on sliding windows
and on tracking state across time.

Stateless Transformations

Stateless transformations, some of which are listed in Table 10-1, are simple RDD
transformations being applied on every batch—that is, every RDD in a DStream. We have
already seen filter() in Figure 10-3. Many of the RDD transformations are also available on
DStreams. Note that key/value DStream transformations like reduceByKey() are made available
in Scala by import StreamingContext._. In Java, as with RDDs, it is necessary to create a
JavaPairD Stream using mapToPair().

As an example, in our log processing program from earlier, we could use map() and
reduceByKey() to count log events by IP address in each time step, as shown in Examples 10-10
and 10-11.
Stateful Transformations
Stateful transformations are operations on DStreams that track data across time; that is, some
data from previous batches is used to generate the results for a new batch. The two main types
are windowed operations, which act over a sliding window of time periods, and
updateStateByKey(), which is used to track state across events for each key (e.g., to build up an
object representing each user session).

Stateful transformations require checkpointing to be enabled in your StreamingCon‐ text for fault
tolerance.

Example 10-16. Setting up checkpointing ssc.checkpoint("hdfs://...")

For local development, you can also use a local path (e.g., /tmp) instead of HDFS.

Windowed transformations

Windowed operations compute results across a longer time period than the Streaming Context’s
batch interval, by combining results from multiple batches. In this sec‐ tion, we’ll show how to
use them to keep track of the most common response codes, content sizes, and clients in a web
server access log. All windowed operations need two parameters, window duration and sliding
duration, both of which must be a multiple of the Streaming Context’s batch interval. The
window duration controls how many previous batches of data are considered, namely the last
windowDuration/batchInterval. If we had a source DStream with a batch interval of 10 seconds
and wanted to create a sliding window of the last 30 seconds (or last 3 batches) we would set the
windowDuration to 30 seconds. The sliding duration, which defaults to the batch interval,
controls how frequently the new DStream computes results. If we had the source DStream with a
batch interval of 10 seconds and wanted to compute our window only on every second batch, we
would set our sliding interval to 20 seconds. Figure 10-6 shows an example

While we can build all other windowed operations on top of window(), Spark Stream ‐ ing
provides a number of other windowed operations for efficiency and convenience. First,
reduceByWindow() and reduceByKeyAndWindow() allow us to perform reduc‐ tions on each
window more efficiently. They take a single reduce function to run on the whole window, such
as +. In addition, they have a special form that allows Spark to compute the reduction
incrementally, by considering only which data is coming into the window and which data is
going out.

This special form requires an inverse of the reduce function, such as - for +. It is much more
efficient for large windows if your function has an inverse (see Figure 10-7).

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
INGENIUM Architecture II
100% (6)
INGENIUM Architecture II
34 pages
GL Processing Guide
No ratings yet
GL Processing Guide
480 pages
Project Question
33% (3)
Project Question
4 pages
Cost Center Creation Using LSMW
0% (1)
Cost Center Creation Using LSMW
17 pages
Spark Streaming
100% (1)
Spark Streaming
3 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Chapter 10 Processing Integrity and Availability Controls1
No ratings yet
Chapter 10 Processing Integrity and Availability Controls1
4 pages
BDA UNIT-III-1
No ratings yet
BDA UNIT-III-1
29 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
lec20
No ratings yet
lec20
25 pages
8- Streaming 3 - Spark Flink
No ratings yet
8- Streaming 3 - Spark Flink
52 pages
Distributed System Lab Manual
100% (1)
Distributed System Lab Manual
62 pages
Client Server Programming With Winsock
No ratings yet
Client Server Programming With Winsock
4 pages
BDC
100% (1)
BDC
58 pages
Asritha Kolli: OS/DB Migration Using SUM With DMO Tool
100% (2)
Asritha Kolli: OS/DB Migration Using SUM With DMO Tool
10 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Basis Certification 2003
No ratings yet
Basis Certification 2003
19 pages
Abstract On Socket Programming
100% (2)
Abstract On Socket Programming
33 pages
pflb_us_blog_load_testing_results_granularity_influences
No ratings yet
pflb_us_blog_load_testing_results_granularity_influences
24 pages
ACT Lab Manual
No ratings yet
ACT Lab Manual
6 pages
Analyzing Real-Time Data With Spark
No ratings yet
Analyzing Real-Time Data With Spark
7 pages
RDI Preview
No ratings yet
RDI Preview
25 pages
Java Interoperability Using Service Interface and DTO Architecture Patterns - Code Project
No ratings yet
Java Interoperability Using Service Interface and DTO Architecture Patterns - Code Project
8 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Atm Reconcilation
No ratings yet
Atm Reconcilation
19 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Spark Architecture
No ratings yet
Spark Architecture
12 pages
Win Sock
No ratings yet
Win Sock
75 pages
Transactions STAD, SM19, SM20
No ratings yet
Transactions STAD, SM19, SM20
14 pages
Middleware Interview Questions W Answers
100% (1)
Middleware Interview Questions W Answers
6 pages
Order By: 5. Union
No ratings yet
Order By: 5. Union
20 pages
35 System Refresh OrQAS Refresh OrDBrefresh
No ratings yet
35 System Refresh OrQAS Refresh OrDBrefresh
15 pages
Using FireBase, FireDac, DataSnap, Rest, Wifi, and FireMonkey
No ratings yet
Using FireBase, FireDac, DataSnap, Rest, Wifi, and FireMonkey
54 pages
The Spark Programming Model
No ratings yet
The Spark Programming Model
7 pages
Data Stage Architecture
No ratings yet
Data Stage Architecture
9 pages
BDS Doc
No ratings yet
BDS Doc
30 pages
Solution Methodology
No ratings yet
Solution Methodology
5 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Pipeline Parallelism 2. Partition Parallelism
No ratings yet
Pipeline Parallelism 2. Partition Parallelism
12 pages
Part4 Idocs
100% (1)
Part4 Idocs
10 pages
Step by Step LSMW
No ratings yet
Step by Step LSMW
3 pages
SAS Connect Vs SAS Access
No ratings yet
SAS Connect Vs SAS Access
8 pages
Programming Assignment 6: Software Defined Internet Exchange Points: Instructions
No ratings yet
Programming Assignment 6: Software Defined Internet Exchange Points: Instructions
12 pages
Filg 8
No ratings yet
Filg 8
631 pages
09 Green Deal Data Space (GDDS) : Infrastructure Components
No ratings yet
09 Green Deal Data Space (GDDS) : Infrastructure Components
10 pages
1 UNIT-1
No ratings yet
1 UNIT-1
59 pages
Ds Questions
No ratings yet
Ds Questions
11 pages
1.ab Initio - Unix - DB - Concepts & Questions - !
No ratings yet
1.ab Initio - Unix - DB - Concepts & Questions - !
35 pages
3 - Fullstack Part2 - Getting Data From Server
No ratings yet
3 - Fullstack Part2 - Getting Data From Server
17 pages
Xx. List of SAP Binaries
No ratings yet
Xx. List of SAP Binaries
9 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Indian Institute of Space Science and Technology AV 341: Computer Networks Lab
No ratings yet
Indian Institute of Space Science and Technology AV 341: Computer Networks Lab
19 pages
Docker Java App With MariaDB - Deployment in Less Than A Minute PDF
No ratings yet
Docker Java App With MariaDB - Deployment in Less Than A Minute PDF
12 pages
Abap Notes
No ratings yet
Abap Notes
4 pages
Introduction to Redux Toolkit
No ratings yet
Introduction to Redux Toolkit
19 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
A Bit of History
No ratings yet
A Bit of History
13 pages
Exps. 1-4 CN - DCN
No ratings yet
Exps. 1-4 CN - DCN
14 pages
Active Directory Replication Over Firewalls
No ratings yet
Active Directory Replication Over Firewalls
22 pages
Interfacing With SAP
No ratings yet
Interfacing With SAP
16 pages
Interface SAP WCS EN
No ratings yet
Interface SAP WCS EN
14 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Week 2
No ratings yet
Week 2
3 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
E-COMMERCE - Unit-1
No ratings yet
E-COMMERCE - Unit-1
26 pages
TEMS Discovery Basics
No ratings yet
TEMS Discovery Basics
167 pages
Class 11 - (Process Selection)
No ratings yet
Class 11 - (Process Selection)
9 pages
IS Auditing Quizzer 2020 Version PDF
100% (1)
IS Auditing Quizzer 2020 Version PDF
13 pages
Professional Machine Learning Engineer Demo
No ratings yet
Professional Machine Learning Engineer Demo
6 pages
Manual Cal Sap
No ratings yet
Manual Cal Sap
36 pages
LSMW and BDC New
100% (3)
LSMW and BDC New
67 pages
Understanding Operating Systems Sixth Edition
100% (3)
Understanding Operating Systems Sixth Edition
46 pages
How To Guide For Transferring Data in SAP
No ratings yet
How To Guide For Transferring Data in SAP
66 pages
Effects of Computers On Auditing
No ratings yet
Effects of Computers On Auditing
26 pages
Operating System
No ratings yet
Operating System
117 pages
IS 581 Milestone 7 and 8 Case Study Coastline Systems Consulting
No ratings yet
IS 581 Milestone 7 and 8 Case Study Coastline Systems Consulting
6 pages
Batch User Guide
No ratings yet
Batch User Guide
600 pages
Unit 2 IT NOTES
No ratings yet
Unit 2 IT NOTES
18 pages
Maximo Adapters For SAP PDF
No ratings yet
Maximo Adapters For SAP PDF
324 pages
JBASE Query Language
No ratings yet
JBASE Query Language
23 pages
IMS Documentation
0% (1)
IMS Documentation
59 pages
HZS240C8 Manual Book SANY
No ratings yet
HZS240C8 Manual Book SANY
411 pages
Project Report
No ratings yet
Project Report
29 pages
Transactions
No ratings yet
Transactions
16 pages
Introduction and Installation: Composer ™
No ratings yet
Introduction and Installation: Composer ™
60 pages
Wso2 T24
100% (1)
Wso2 T24
6 pages
Unit-3 CFOA Notes
100% (1)
Unit-3 CFOA Notes
12 pages
Introduction Cobol
100% (1)
Introduction Cobol
53 pages

Unit - 5 FBDA

Uploaded by

Unit - 5 FBDA

Uploaded by

UNIT V

Sparking Streaming: High level architecture of Spark Streaming, DStreams,

High level architecture of Spark Streaming

Transformations on DStreams can be grouped into either stateless or stateful:

Example 10-16. Setting up checkpointing ssc.checkpoint("hdfs://...")

You might also like