0% found this document useful (0 votes)

3 views

spark

Apache Spark is a fast cluster computing technology that enhances Hadoop's capabilities by providing in-memory processing, which significantly speeds up data computation. It supports various workloads, including batch processing, interactive queries, and machine learning, and can be deployed in multiple ways, such as standalone or on Hadoop YARN. Spark's core component is the Resilient Distributed Dataset (RDD), which allows for efficient data sharing and processing across distributed systems.

Uploaded by

gaurav kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

spark

Uploaded by

gaurav kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

APACHE SPARK

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark

Apache Spark has following features.

• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing
data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark comes
up with 80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.

• Standalone − Spark Standalone deployment means Spark occupies the place on
top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs
on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark into
Hadoop ecosystem or Hadoop stack. It allows other components to run on top of
stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its
shell without any administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems. Provides task scheduling, memory management, fault recovery, interacting with
storage system).
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as
the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable

distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may
be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
Although this framework provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The

following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O,
and serialization, which makes the system slow.
Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable
storage, which can dominate application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the
object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk.

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes.
Running Spark Jobs on YARN

When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce

schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same

container. This approach enables several orders of magnitude faster task startup time.

Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode.

Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense

for interactive and debugging uses where you want to see your application’s output immediately.

Understanding the difference requires an understanding of YARN’s Application Master concept.

In YARN, each application instance has an Application Master process, which is the first container

started for that application. The application is responsible for requesting resources from the

ResourceManager, and, when allocated them, telling NodeManagers to start containers on its

behalf. Application Masters obviate the need for an active client — the process starting the

application can go away and coordination continues from a process managed by YARN running on

the cluster.

In yarn-cluster mode, the driver runs in the Application Master. This means that the same process
is responsible for both driving the application and requesting resources from YARN, and this

process runs inside a YARN container. The client that starts the app doesn’t need to stick around

for its entire lifetime.

yarn cluster mode

The yarn-cluster mode, however, is not well suited to using Spark interactively. Spark applications

that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client

process that initiates the Spark application. In yarn-client mode, the Application Master is merely

present to request executor containers from YARN. The client communicates with those

containers to schedule work after they start:

Yarn Client Mode

Different Deployment Modes across the cluster

In Yarn Cluster Mode, Spark client will submit spark application to yarn, both Spark Driver and

Spark Executor are under the supervision of yarn. In yarn client mode, only the Spark Executor are

under the supervision of yarn. The Yarn ApplicationMaster will request resource for just spark

executor. The driver program is running in the client process which has nothing to do with yarn.

Google Hacking Database
83% (18)
Google Hacking Database
91 pages
Dangerous Google - Searching For Secrets PDF
88% (26)
Dangerous Google - Searching For Secrets PDF
12 pages
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
No ratings yet
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
55 pages
Dangerous Google Searching For Secrets
No ratings yet
Dangerous Google Searching For Secrets
12 pages
Google Hacking Database
No ratings yet
Google Hacking Database
91 pages
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
100% (15)
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
643 pages
Understanding Database Types - by Alex Xu
No ratings yet
Understanding Database Types - by Alex Xu
13 pages
Policy Document Ucc Redemption Understanding The Process Further
80% (20)
Policy Document Ucc Redemption Understanding The Process Further
37 pages
How To Use Google Hack
100% (1)
How To Use Google Hack
4 pages
PayPal Hacks
100% (1)
PayPal Hacks
6 pages
Hackers Black Book (2011-Edition)
No ratings yet
Hackers Black Book (2011-Edition)
6 pages
UCC-1 Financing Statement
87% (39)
UCC-1 Financing Statement
94 pages
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
91% (11)
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
7 pages
Hackers Favorite Search Queries 4
100% (1)
Hackers Favorite Search Queries 4
6 pages
Kali Linux Tools Descriptions
100% (2)
Kali Linux Tools Descriptions
26 pages
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
100% (1)
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
393 pages
canadianResumeTemplate 1
No ratings yet
canadianResumeTemplate 1
2 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unit 4
No ratings yet
Unit 4
8 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark
No ratings yet
Spark
9 pages
Bda 5
No ratings yet
Bda 5
21 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Sspark
No ratings yet
Sspark
7 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Introduction
No ratings yet
Spark Introduction
25 pages
Spark
No ratings yet
Spark
96 pages
Shark
No ratings yet
Shark
24 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Spark-Introduction
No ratings yet
Spark-Introduction
12 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
learn
No ratings yet
learn
16 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit 5
100% (1)
Unit 5
109 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Spark BD
No ratings yet
Spark BD
9 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
7_SPARK
No ratings yet
7_SPARK
9 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Solution Methodology
No ratings yet
Solution Methodology
3 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
sparkapache
No ratings yet
sparkapache
2 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
BDA1
No ratings yet
BDA1
17 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Performance Comparison of Apache Hadoop and Apache Spark
No ratings yet
Performance Comparison of Apache Hadoop and Apache Spark
5 pages
DA U2
No ratings yet
DA U2
17 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
SQL Crash Course
No ratings yet
SQL Crash Course
17 pages
Google Hacking Database PDF
0% (1)
Google Hacking Database PDF
100 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
42 pages
Microsoft Access For Beginners PDF
100% (2)
Microsoft Access For Beginners PDF
196 pages
Useful Google Hacks
100% (4)
Useful Google Hacks
7 pages
TITLE 28 United States Code Sec. 3002
91% (11)
TITLE 28 United States Code Sec. 3002
77 pages
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
No ratings yet
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
40 pages
Excel Cheat Sheet: Travis Cuzick
100% (1)
Excel Cheat Sheet: Travis Cuzick
15 pages
Mythic Magazine #015
100% (3)
Mythic Magazine #015
34 pages
Master Cyber Digital Forensics
50% (2)
Master Cyber Digital Forensics
114 pages
Record Keeping and Documentation
100% (4)
Record Keeping and Documentation
18 pages
SFDSFD401 - Basics and Fundamentals of Database
No ratings yet
SFDSFD401 - Basics and Fundamentals of Database
77 pages
JCL Reference
No ratings yet
JCL Reference
722 pages
Mobile Application Development -- UNIT 3 (1)
No ratings yet
Mobile Application Development -- UNIT 3 (1)
66 pages
Traversing Graphs:: A Journey Through Connected Data
No ratings yet
Traversing Graphs:: A Journey Through Connected Data
19 pages
Exception 20240212
No ratings yet
Exception 20240212
16 pages
Salesforce government_cloud_4-3-2025
No ratings yet
Salesforce government_cloud_4-3-2025
31 pages
'TMF1434 Data Structure and Algorithms Semester 2, 2022/2023 LAB04: Linked List
No ratings yet
'TMF1434 Data Structure and Algorithms Semester 2, 2022/2023 LAB04: Linked List
7 pages
Lost and Found
No ratings yet
Lost and Found
29 pages
GEM4D Comparing Hydraulic Radius and Radius Factor
100% (1)
GEM4D Comparing Hydraulic Radius and Radius Factor
5 pages
Case Study Amazon
No ratings yet
Case Study Amazon
8 pages
03 PICMET Keuper Hellwig Submission
No ratings yet
03 PICMET Keuper Hellwig Submission
8 pages
The Dark Web - A Comprehensive Handbook by Mohamed Karim
No ratings yet
The Dark Web - A Comprehensive Handbook by Mohamed Karim
198 pages
Get Started with T1000-E Tracker _ Seeed Studio Wiki
No ratings yet
Get Started with T1000-E Tracker _ Seeed Studio Wiki
1 page
En Acq80-04 HW B
No ratings yet
En Acq80-04 HW B
130 pages
4 Requirement Analysis and Specification
No ratings yet
4 Requirement Analysis and Specification
3 pages
Panel Mount Load Cell Indicator: Operating Manual - English 2.00
No ratings yet
Panel Mount Load Cell Indicator: Operating Manual - English 2.00
53 pages
Data AQAR Criteria 3 2023 2024 Final
No ratings yet
Data AQAR Criteria 3 2023 2024 Final
70 pages
MCQ Online5
No ratings yet
MCQ Online5
3 pages
Electronic Evidence under Bhartiya Sakshya Adhiniyam, 2023
No ratings yet
Electronic Evidence under Bhartiya Sakshya Adhiniyam, 2023
1 page
01 - Introduction To Angular
No ratings yet
01 - Introduction To Angular
15 pages
AI for Marketing (1)
No ratings yet
AI for Marketing (1)
198 pages
Aml ct2 QP Batch 2 Set A
No ratings yet
Aml ct2 QP Batch 2 Set A
3 pages
Installation Manual: JY997D32001F
No ratings yet
Installation Manual: JY997D32001F
1 page
B Qradar Admin Guide
No ratings yet
B Qradar Admin Guide
448 pages
Job Description Sprinklr
No ratings yet
Job Description Sprinklr
2 pages
SAP SRM Course Content
No ratings yet
SAP SRM Course Content
1 page
C++ Programming For Blockchain Developers
No ratings yet
C++ Programming For Blockchain Developers
21 pages
(G11) UNIT 9 - Cities in The Future - Reading Comprehension
No ratings yet
(G11) UNIT 9 - Cities in The Future - Reading Comprehension
4 pages
Log
No ratings yet
Log
7 pages
390 810 Dayton Audio Imm 6 User Manual
No ratings yet
390 810 Dayton Audio Imm 6 User Manual
1 page
Chapater - 1 - Introduction To Human Computer Interaction
No ratings yet
Chapater - 1 - Introduction To Human Computer Interaction
18 pages
Dorwsiness Detection
No ratings yet
Dorwsiness Detection
5 pages

spark

Uploaded by

spark

Uploaded by

APACHE SPARK

Evolution of Apache Spark

Features of Apache Spark

Apache Spark has following features.

Spark Built on Hadoop

There are three ways of Spark deployment as explained below.

The following illustration depicts the different components of Spark.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable

Data Sharing is Slow in MapReduce

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The

Data Sharing using Spark RDD

Iterative Operations on Spark RDD

Interactive Operations on Spark RDD

Understanding the difference requires an understanding of YARN’s Application Master concept.

for its entire lifetime.

containers to schedule work after they start:

Different Deployment Modes across the cluster

You might also like