Apache Spark

Uploaded by

Muhammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views27 pages

Apache Spark

Uploaded by

Muhammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

APACHE SPARK

An Introduction
What is Apache Spark?
Apache Spark is a fast and multi-purpose engine for
large-scale data processing.
THERE ARE 1. Speed
FOUR 2. Ease of use
REASONS 3. Generality
TO USE
4. Platform-agnostic
SPARK
SPEED
EASE OF USE

1 Spark supports Java, Scala, Python, 2 It offers 80 high-level operators

and R natively, as well as ANSI SQL. making it fast and easy to build
applications, even including
parallelization and streaming.

3 We can use popular interfaces like 4 By using just two lines of code, you

Jupyter Notebook, Apache Zeppelin, as can count all words in a large file.
well as the command shell.
GENERALITY
AGNOSTIC
PLATFORM
Besides running in nearly any environment you can access the
data in the Hadoop distributed file system, known as HDFS,
Cassandra, HBASE, Hive, or any other Hadoop data source.

In Spark 2.0 you can also connect directly to traditional

relational databases using dataframes in Python and Scala.
Apache Spark Components

How Spark's components fit together?

SPARK ●
●
Fundamental Component
Task distribution
CORE ● Scheduling
● Input/Output operations
SPARK SQL ● It supports the ANSI SQL
● Enable tools like Tableau to easily integrate
with Spark
● DataFrames
○ Spark SQL provides Dataframe concept that is a
familiar term for data science.
SPARK Streaming

Streaming Analytics Micro Batches Lambda Architecture

SPARK ● The MLlib component enables machine
learning algorithms to run.
MLLib ● It is 9X faster than Apache Mahout.
● It includes common functions
SPARK GRAPHX
● Graph Processing
● It is in-memory
version of Apache
Giraph.
● Based on RDDs
SPARK R ● R package for Spark
● It provides an interface for connecting your
Spark cluster from the R statistical package.
● This package provides Distributed
DataFrames, which are comparable to
DataFrames in R.
● R Studio integration
● Data integration or ETL
The usage of ● Machine Learning
● BI/Analytics
Apache Spark ● Real-Time Processing
● Recommendation Engines
Languages used in Spark
Spark components used in
production
Deep Dive into
Apache Spark
Resilient ● RDD is a fundamental data structure of
Spark.

Distributed ● It is an immutable distributed collection

of objects.

Datasets (RDD) ● Each dataset in RDD is divided into

logical partitions, which may be
computed on different nodes of the
cluster.
● RDD can contain any type of objects,
including user-defined classes.
● RDD is a fault-tolerant collection of
elements that can be operated on
parallelly.
How to
Create Parallelizing an existing
collection in your driver
program.
RDDs?
Create RDDs

Referencing a dataset in
an external storage
system, such as a
shared file system,
HDFS, HBase, or any
data source offering a
Hadoop Input Format.
MapReduce
MapReduce is used for
processing and generating large
datasets with a parallel,
distributed algorithm on a cluster.
Data sharing in
MapReduce
Data sharing is slow in
MapReduce due to
replication, serialization,
and disk IO. Regarding
storage system, most of
the Hadoop applications,
they spend more than 90%
of the time doing HDFS
read-write operations.
Iterative Operations on
MapReduce
● Reuse intermediate results across
multiple computations in multi-stage
applications.
● The following illustration explains how
the current framework works, while
doing the iterative operations on
MapReduce.
● This incurs substantial overheads due
to data replication, disk I/O, and
serialization, which makes the system
slow.
Interactive Operations on
MapReduce
● Each query will do the disk I/O
on the stable storage, which
can dominate application
execution time.
● The following illustration
explains how the current
framework works while doing
the interactive queries on
MapReduce.
Data Sharing using
Spark RDD
● RDD supports in-memory
processing computation.
it stores the state of
memory as an object
across the jobs and the
object is shareable
between those jobs.
● Data sharing in memory
is 10 to 100 times faster
than network and disk.
The illustration given below shows the iterative operations
Iterative Operations on on Spark RDD. It will store intermediate results in a
Spark RDD distributed memory instead of Stable storage (Disk) and
make the system faster.
If the Distributed memory (RAM) is not sufficient to store
intermediate results (State of the JOB), then it will store
those results on the disk.
● The following illustration shows interactive
Interactive Operations operations on Spark RDD.
on Spark RDD ● If different queries are run on the same set of data
repeatedly, this particular data can be kept in
memory for better execution times.
● By default, each transformed RDD may be
recomputed each time you run an action on it.
● However, you may also persist an RDD in memory,
in which case Spark will keep the elements around
on the cluster for much faster access, the next time
you query it.
● There is also support for persisting RDDs on disk, or
replicated across multiple nodes.

Msi MS-7680 Rev 3.2
No ratings yet
Msi MS-7680 Rev 3.2
34 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Manual Testing Interview Question by Shammi Jha
100% (7)
Manual Testing Interview Question by Shammi Jha
25 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unit 4
No ratings yet
Unit 4
8 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
spark
No ratings yet
spark
9 pages
Unit 5
100% (1)
Unit 5
109 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
BIG DATA ANLYTICS UNIT 3 R22 IT
No ratings yet
BIG DATA ANLYTICS UNIT 3 R22 IT
57 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Module 3
No ratings yet
Module 3
51 pages
Shark
No ratings yet
Shark
24 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Bda 5
No ratings yet
Bda 5
21 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark
No ratings yet
Spark
9 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark
No ratings yet
Spark
96 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Sspark
No ratings yet
Sspark
7 pages
8_PDFsam_apache_spark_tutorial
No ratings yet
8_PDFsam_apache_spark_tutorial
7 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
L3
No ratings yet
L3
30 pages
Spark BD
No ratings yet
Spark BD
9 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
B 1610 Vlan 9500 CG
No ratings yet
B 1610 Vlan 9500 CG
108 pages
Mechatronics Module 3 - Yajnesha P Shettigar
No ratings yet
Mechatronics Module 3 - Yajnesha P Shettigar
21 pages
Install - Log - 6 27 2024 - 20 50 16
No ratings yet
Install - Log - 6 27 2024 - 20 50 16
4 pages
Lec 00
No ratings yet
Lec 00
76 pages
9.1.9 Lab - Explore PenTest Reports - ILM
No ratings yet
9.1.9 Lab - Explore PenTest Reports - ILM
19 pages
Components To Be Measured
No ratings yet
Components To Be Measured
48 pages
OEL
No ratings yet
OEL
15 pages
Systems Development Life Cycle
100% (4)
Systems Development Life Cycle
17 pages
ITC_Module2
No ratings yet
ITC_Module2
30 pages
zpa-private-service-edge-at-a-glance
No ratings yet
zpa-private-service-edge-at-a-glance
2 pages
Reliability Engineering
No ratings yet
Reliability Engineering
5 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Week 4 - 1st Sem Iaas311
No ratings yet
Week 4 - 1st Sem Iaas311
26 pages
Introduction To Linux: Linus Torvalds Computer Science Dept. University of Helsinki
No ratings yet
Introduction To Linux: Linus Torvalds Computer Science Dept. University of Helsinki
11 pages
Denial of Service Attack
No ratings yet
Denial of Service Attack
8 pages
lastUIException 63747539419
No ratings yet
lastUIException 63747539419
2 pages
Am6548 DataSheet
No ratings yet
Am6548 DataSheet
279 pages
Setting Integration Tool Manual
No ratings yet
Setting Integration Tool Manual
14 pages
SimpleBeacon Editor Release Notes
No ratings yet
SimpleBeacon Editor Release Notes
14 pages
MD-43 The Rikke Processor Described As Rikke-Mathilda Differences Dec80
No ratings yet
MD-43 The Rikke Processor Described As Rikke-Mathilda Differences Dec80
30 pages
Activity Sheet Week4 7
No ratings yet
Activity Sheet Week4 7
2 pages
Unit-5 C QB with ans
No ratings yet
Unit-5 C QB with ans
33 pages
DST4030A Lecture Notes Week 4
No ratings yet
DST4030A Lecture Notes Week 4
42 pages
SM5200 04 Us
No ratings yet
SM5200 04 Us
4 pages
Datasheet of DS 7604NI K1 - 4P NVRD - V4.71.200 - 20221031
No ratings yet
Datasheet of DS 7604NI K1 - 4P NVRD - V4.71.200 - 20221031
5 pages
Java File Hritik
No ratings yet
Java File Hritik
35 pages
Transaction and Master Files
No ratings yet
Transaction and Master Files
5 pages
Asm IoT
No ratings yet
Asm IoT
24 pages

Apache Spark

Uploaded by

Apache Spark

Uploaded by

APACHE SPARK

1 Spark supports Java, Scala, Python, 2 It offers 80 high-level operators

In Spark 2.0 you can also connect directly to traditional

How Spark's components fit together?

Streaming Analytics Micro Batches Lambda Architecture

Distributed ● It is an immutable distributed collection

Datasets (RDD) ● Each dataset in RDD is divided into

You might also like