0% found this document useful (0 votes)

2 views

Apache Spark

Apache Spark is a unified analytics engine designed for large-scale data processing, offering features like in-memory processing, fault tolerance, and support for multiple programming languages. Developed at UC Berkeley in 2009, it addresses limitations of Hadoop MapReduce by enabling faster data processing and real-time analytics. Key components include Spark SQL for structured data, GraphX for graph processing, Spark Streaming for real-time data, and MLlib for machine learning.

Uploaded by

Isuru Amarasena

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Apache Spark

Uploaded by

Isuru Amarasena

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

5CS022 Distributed and Cloud

Systems Programming
Lecture 9 Apache Spark
What is Apache Spark?

• "… a unified analytics engine for large-

scale data processing. "
• "… provides an interface for programming
entire clusters with implicit data parallelism
and fault tolerance"
• "… is a unified computing engine and a set
of libraries for parallel data processing on
computer clusters"
What is Apache Spark?
Motivation and History of Spark
• Created in 2009 at UC Berkley’s AMPLab by Matei Zaharia to address the Hadoop
MapReduce cluster computing paradigm. Hadoop had 3 problems:
– Uses disk based processing (slow compared to in-memory based)
– Applications only written in Java (security concerns)
– No stream processing support only batch processing
• Original UC Berkeley user group used Spark to monitor and predict Bay Area traffic
patterns
• In 2010 Spark became open source under the BSD license
• 2013 Spark turned into an Apache Software project
– Currently one of the largest projects of the Apache foundation
• 2014 Spark 1.0.0 released the largest release to this date and started the 1.x line
• 2016 Spark 2.0.0 released
– Current Spark release 2.2.1 (12/2017)
Key Features
• Spark decreases the number of reads and writes to disc which significantly increasing the processing speed
• The 80 high-level operators help Spark overcome the limitation of Hadoop MapReduce restriction to java
– Possible to develop parallel applications with Scala, Python and R
– Other systems could be used in same way with more work from community

• In-memory processing reduces disc reads and writes

– Data size continues to increase reading and writing TBs and PBs isn’t viable
– Storing data in servers’ RAM makes access significantly faster

• DAG (directed acyclic graph) execution engine

• Fault tolerant
– The Spark RDD abstraction handles failures of any worker nodes within the cluster so data loss is negligible

• Spark Streaming allows for the processing of real-time data streams

– Not as powerful as Apache Storm/Heron/Flink

• Single workflow to sophisticated analytics by integrating streaming data, machine learning, map and reduces and
queries
• Spark is compatible with all other forms of Hadoop data
– But Hadoop not aimed at many important applications!

• Lazy Evaluation: “Execution will not start until an action is triggered", ex: Data is not loaded before execution
• Active community around the globe working with Spark, is very important feature
Executing Spark Programs

• Spark programs can be executed

standalone locally
• Interactively on a cluster
• Or submitted to a Spark cluster
Local Mode
Interactive Client Mode
Cluster Mode
Spark Cluster Applications
Cluster Resoruce Manager
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD)
• This is a cacheable database technology
• Fundamental data structure of Spark and naturally supports in-memory processing
– The state of the memory is stored as an object and the object is shared among jobs
– Sharing data as objects in memory is 1-2 orders of magnitude faster than network and Disk sharing

• Iterative and interactive processes become more efficient because of efficient data reuse
– Iterative examples include PageRank, K-means clusters and logistic regression
– An interactive example is interactive datamining where several queries are run on the same dataset
– Makes better use of database technologies than previous technologies

• Previous frameworks required that data be written to a stable file storage system (distributed
file system)
– This process creates overhead that could dominate execution times
• Data replication
• Disk I/O
• Serialization
Features of Spark RDD
• Lazy Evaluation
– In Spark lazy evaluation means that results are not computed immediately, instead execution starts when an action is triggered
– In Spark evaluation happens when a transformation occurs
• Spark maintains a record of the operation
• Fault Tolerance
– RDDs track data lineage and rebuild data upon failures
– RDDs do this by remembering how they were created from other datasets
• RDDs are created by transformations (map, join …)
• Immutability
– Safely share data across processes
– Create and retrieve data anytime
• Ensures easy caching, sharing and replication
– Allows computations to be consistent
• Partitioning
– Fundamental unit of parallelism within Spark RDD
• Each partition is a mutable logical division of data
– Partitions are created by performing a transformation of existing partitions
• Persistence
• Coarse-grained Operations
Supported RDD Operations
Transformations examples
• Many are classic database operations
• Map() - map function defined is applied to each element in the RDD
• Flatmap() - similar to map, but it can return a multiple new RDDs
• Filter() - creates a new RDD after filtering out specifics from original RDD
• MapPartitions() - Works on RDD partitions, useful when distributed
computation is done.
• Union(dataset) - Performs union operation (putting them together in a single
RDD) on RDDs
• Intersect() - Finds the intersecting elements in RDDs and creating a new RDD
• Distinct() - Finds out the distinct elements in RDDs and creating a new RDD
• Join() - Performs inner or outer join on a dataset.
Supported RDD Operations
Actions
• When you need to work with the actual dataset an action must
be performed as transformations simply create new RDDs from other
RDDs where actions actually produce values
– Values are produced by actions by executing the RDD lineage graph
• An action in Spark is an RDD operator that will produce a non-RDD value
– These values are stored either in drivers or to an external storage system
• Actions are what bring "laziness" of RDD to the spot light
– Results are only computed when an action is called
– Importance of synchronization key
• Action are one way of sending data from the Executor to the driver and
the other is accumulators
Supported RDD Operations
Actions Examples
• Count()
• Collect()
• Take()
• Top()
• CountByValue()
• Reduce()
• Fold()
• Aggregate()
• Foreach()
Transformations vs. Actions

• Transformations • Actions return values

generate a new from the computation
modified RDD based performed on the RDD
off the parent RDD
RDD Operations
• 1. Transformations: define new RDDs based on
the current one.
– e.g., filter, map, flatMap, reduce, etc.
RDD New RDD

RDD
• 2. Actions: return values. Value
– e.g., count, sum, collect, etc.
RDD Persistence
• RDD to be persisted using the persist() or cache() methods on it.
The first time it is computed in an action, it will be kept in memory on the
nodes.
• Fault-tolerant – if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.

Storage Level Meaning

MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, some partitions will not
be cached and will be recomputed on the fly each time
they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're
needed.
Directed Acyclic Graph (DAG) Execution
Engine
• DAG is a programming style for distributed systems. • Directed Acyclic Graph (DAG)
– Edges have direction, represented by the arrows
• Basically Spark uses vertices and edges to represent
RDDs and operation to be applied on RDD respectively.
• DAG is modelled in Spark using a DAG scheduler which
further splits the graph into stages of the tasks.
• DAG scheduler splits up the original task into smaller 3
subtasks and use transformations in RDD to perform sub
task operations.
• In Spark the DAG can have any number of stages 1 2 4
– This prevents having to split a single job up into several jobs
if more than two stages are required 5
– Allows simple jobs to complete after one stage
7
• In MapReduce DAG has two predefined stages 6
– Map • Links of DAG are transformations, actions,
– Reduce maps
• Nodes of graph are result or RDD instances
• https://databricks.com/blog/2015/06/22/understanding-
your-spark-application-through-visualization.html
DAG In Action
• For DAG to work, it uses Scala or Java layer as the first interpreter to get the information on
the task that has to be performed.
• Once a job or a set of tasks is submitted to Spark, it creates an Operator Graph.
• In process, at the time an Action is called on Spark RDD, spark submits a task to the DAG
scheduler with the information on what to do when to do.
• DAG Scheduler has stages being defined depending on the operation submitted to it. A
certain stage data partitioning can take place and in a certain stage data gathering can take
place after a computation in being performed on separate partitions.
• The small stages are passed to Task Scheduler and task scheduler manages all the sub tasks
generated for a particular job.
• A cluster manager is the tool which launch the tasks and Task schedulers submit those
requests on the cluster manager.
• Cluster manager will submit these tasks to different machines or workers.
• Each of the workers are executing these tasks. (Master-worker parallel computing paradigm)
Execution Graph
DAG In Action
Apache Spark Components and
Relationships
Cluster management
• A cluster manager is used to acquire cluster
resources for executing jobs.
• Spark core runs over diverse cluster managers
including Hadoop YARN, Apache Mesos, Amazon
EC2 and Spark’s built-in cluster manager.
• The cluster manager handles resource sharing
between Spark applications.
• On the other hand, Spark can access data in
HDFS, Cassandra, HBase, Hive, Alluxio, and any
Hadoop data source
Spark SQL
• Spark SQL is a new module in Spark which integrates
relational processing with Spark’s functional programming
API.
• It supports querying data either via SQL or via the Hive
Query Language.
• The DataFrame and Dataset APIs of Spark SQL provide a
higher level of abstraction for structured data.
GraphX

• GraphX is the Spark API for graphs and

graph-parallel computation.
• Thus, it extends the Spark RDD with a
Resilient Distributed Property Graph
Spark Streaming

• Spark Streaming is the component of Spark

which is used to process real-time
streaming data.
MLlib

• MLlib stands for Machine Learning Library.

Spark MLlib is used to perform machine
learning in Apache Spark.
Questions?

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Lec no 10
No ratings yet
Lec no 10
17 pages
Spark
No ratings yet
Spark
96 pages
Unit 4
No ratings yet
Unit 4
8 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
SPARK
No ratings yet
SPARK
35 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark
No ratings yet
Spark
9 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
spark
No ratings yet
spark
9 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
bd1718-10-spark
No ratings yet
bd1718-10-spark
55 pages
Module 3
No ratings yet
Module 3
51 pages
Spark
No ratings yet
Spark
51 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Spark And Scala Week 1
No ratings yet
Spark And Scala Week 1
16 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Spark(Introduction,RDD)
No ratings yet
Spark(Introduction,RDD)
28 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
Spark 1
No ratings yet
Spark 1
57 pages
Bda 5
No ratings yet
Bda 5
21 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
SPARK
No ratings yet
SPARK
66 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Web Digcomp2.1pdf (Online)
No ratings yet
Web Digcomp2.1pdf (Online)
48 pages
Chapter 6 Solution 12 S
100% (4)
Chapter 6 Solution 12 S
11 pages
DATABASE DEVELOPMENT NOTES - Udated - 2023 - PRINTED
No ratings yet
DATABASE DEVELOPMENT NOTES - Udated - 2023 - PRINTED
103 pages
Download full Test Bank for Using MIS, 11th Edition, Kroenke, Randall J. Boyle all chapters
No ratings yet
Download full Test Bank for Using MIS, 11th Edition, Kroenke, Randall J. Boyle all chapters
56 pages
AfRU-POLICY GUIDELINES FOR THE FORMAT OF RESEARCHPROPOSALS, RESEARCHREPORTS, THESIS AND DISSERTATIONS
No ratings yet
AfRU-POLICY GUIDELINES FOR THE FORMAT OF RESEARCHPROPOSALS, RESEARCHREPORTS, THESIS AND DISSERTATIONS
22 pages
Physical Inventory in Warehouse (1FW - US) : Test Script SAP S/4HANA - 17-09-20
No ratings yet
Physical Inventory in Warehouse (1FW - US) : Test Script SAP S/4HANA - 17-09-20
44 pages
Power BI RLS
No ratings yet
Power BI RLS
12 pages
In Linked Lists Following Steps Are Required
No ratings yet
In Linked Lists Following Steps Are Required
20 pages
合并PDF
No ratings yet
合并PDF
3 pages
Planning A Social Network Analysis: Digital Promise
No ratings yet
Planning A Social Network Analysis: Digital Promise
29 pages
MOD09 UserGuide v1 3
No ratings yet
MOD09 UserGuide v1 3
40 pages
Thesis Internet Sources
100% (2)
Thesis Internet Sources
5 pages
Dbms Answers All - Haam's Community
No ratings yet
Dbms Answers All - Haam's Community
55 pages
Single Case Research Design and Analysis New Directions for Psychology and Education 1st Edition Thomas R Kratochwill Joel R Levin Editors - The ebook in PDF format is available for download
100% (1)
Single Case Research Design and Analysis New Directions for Psychology and Education 1st Edition Thomas R Kratochwill Joel R Levin Editors - The ebook in PDF format is available for download
72 pages
Canubing II - Dec
No ratings yet
Canubing II - Dec
73 pages
06 Query Processing (2) - NDN
No ratings yet
06 Query Processing (2) - NDN
31 pages
Big Data Documentation - Big Data Documentation
No ratings yet
Big Data Documentation - Big Data Documentation
2 pages
Sample Thesis Chapter 3 Research Locale
100% (3)
Sample Thesis Chapter 3 Research Locale
8 pages
Data and Information
No ratings yet
Data and Information
17 pages
Unix m2 PDF
No ratings yet
Unix m2 PDF
18 pages
The IoT For Smart Sustainable Cities PDF
No ratings yet
The IoT For Smart Sustainable Cities PDF
25 pages
sql PRACTICE PAPER
No ratings yet
sql PRACTICE PAPER
7 pages
Aira CV Resume
No ratings yet
Aira CV Resume
4 pages
Seeq UseCase Well Productivity Index
No ratings yet
Seeq UseCase Well Productivity Index
2 pages
OptiStar CG13 en
No ratings yet
OptiStar CG13 en
53 pages
Unit Lesson Plan
No ratings yet
Unit Lesson Plan
11 pages
LAB 02: Designing and Implementing A Data Warehouse: Scenario
No ratings yet
LAB 02: Designing and Implementing A Data Warehouse: Scenario
4 pages
Difference Between The Programming Language and
No ratings yet
Difference Between The Programming Language and
2 pages
CBTP Phase Four Model
100% (1)
CBTP Phase Four Model
22 pages
DB Assgmt 1
No ratings yet
DB Assgmt 1
6 pages

Apache Spark

Uploaded by

Apache Spark

Uploaded by

5CS022 Distributed and Cloud

• "… a unified analytics engine for large-

• In-memory processing reduces disc reads and writes

• DAG (directed acyclic graph) execution engine

• Spark Streaming allows for the processing of real-time data streams

• Spark programs can be executed

• Transformations • Actions return values

Storage Level Meaning

• GraphX is the Spark API for graphs and

• Spark Streaming is the component of Spark

• MLlib stands for Machine Learning Library.

You might also like