BDA-Lec8
BDA-Lec8
Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Why we Need Spark?
● Spark SQL
○ – To work with structured data
○ – Allows querying data via SQL Like Query.
○ – Supports many data sources (Hive tables, Parquet, JSON, …)
○ – Extends Spark RDD API
● Spark Streaming
○ – To process live streams of data
○ – Extends Spark RDD API
Spark as Unified Analytics Engine
● MLlib
○ – Scalable ML library
○ – Many distributed algorithms: feature extraction, classification,
regression, clustering, recommendation, …
● GraphX
○ – API for manipulating graphs and performing graph-parallel
computations
○ – Includes also common graph algorithms (e.g., PageRank)
○ – Extends Spark RDD API
Spark Architecture
Master/worker Architecture
Master(Driver Program):
1- create spark context
2- run main function
3- convert user program to tasks
and stages.
Worker Nodes(Datanodes):
Contain Task Executors(processes) that
1- execute tasks(run computations).
2- read data from hdfs
3- performing the transformation operations
4- store data for applications.
Spark Context: The Gateway to Apache Spark
Applications
Core Components:
- SparkContext: Connection to the Spark cluster; creates RDDs (Resilient Distributed Datasets).
- Cluster Manager: Manages resources and schedules tasks.
- Worker Nodes: Execute tasks via executors.
- Executors: Run tasks on worker nodes.
- RDDs: Immutable, fault-tolerant distributed data structures.
- HDFS: Distributed file system for data storage.
How It Works:
1. Driver Program: User writes a Spark application using Python, Scala, or Java.
2. RDD Creation: SparkContext reads data (e.g., from HDFS) and creates RDDs.
3. Transformations: Operations like map, filter, reduce modify RDDs.
4. Actions: Commands like count, collect trigger execution.
5. Task Scheduling: Cluster Manager schedules tasks on worker nodes.
6. Task Execution: Executors process tasks, interacting with HDFS.
7. Results: Returned to the driver for display or storage.
Spark Architecture
• Main program (called driver program(master)) talks to
cluster manager, which allocates resources for
• Worker nodes in which executors run
• Executors are processes that run computations and
store data for the application
https://downloads.apache.org/spark/docs/2.0.0-preview/cluster-
overview.html
Spark Architecture
● Each application consists of a driver program and executors on the
cluster
○ – Driver program: process which runs application main() and
creates SparkContext object
● • Each application gets its own executors, which are processes which
stay up for the duration of the whole application and run tasks in
multiple threads
○ – Isolation of concurrent applications
● • To run on a cluster, SparkContext connects to(communicate with)
cluster manager, which allocates cluster resources for worker nodes
● • Once connected, Spark acquires executors on cluster nodes and
sends the application code (e.g., jar) to executors.(Data Locality)
● • Finally, SparkContext sends tasks to executors to run
Spark Programming Model
Spark Programming Model
Spark on Top of Cluster Managers
● Spark can exploit many cluster resource managers which allocate
cluster resources to run the applications
○ 1. Standalone – Simple cluster manager included with Spark that
makes it easy to set up a cluster (default cluster manager).
○ 2. Hadoop YARN – Resource manager in Hadoop 2
○ 3. Mesos – General cluster manager from AMPLab
○ 4. Kubernetes
Deploy Modes and Cluster Managers
● Spark supports different deploy modes(VM, Docker, Kubernetes) and
cluster managers(Standalone, HadoopYARN, Mesos, Kubernetes) , so
it can run in different configurations and environments
RDD
● RDDs are the key programming abstraction in Spark: a
distributed memory abstraction
● • Immutable, partitioned and fault-tolerant collection of
elements that can be manipulated in parallel
○ – Like a LinkedList <MyObjects>.
○ – Stored in main memory across the cluster nodes
■ • Each worker node that is used to run an application
contains at least one partition of the RDD(s) that is (are)
defined in the application.
RDDs: distributed and partitioned
● Stored in main memory of the executors running in the worker
nodes (when it is possible) or on node local disk (if not enough
main memory)
● • Allow executing in parallel the code invoked on them
○ – Each executor of a worker node runs the specified code on its
partition of the RDD
○ – Partition: atomic chunk of data (a logical division of data) and
basic unit of parallelism
○ – Partitions of an RDD can be stored on different cluster nodes
RDDs: immutable and fault-tolerant
● Immutable once constructed
○ – i.e., RDD content cannot be modified
○ – Create new RDD based on existing RDD
● • RDD suitability
○ – Best suited for applications that apply the same operation to all the
elements in dataset (Coarse-grained manipulations only )
○ – Provides fine-grained control over the physical distribution of
data
○ – Not a good fit for applications with fine-grained updates to shared
state
Spark and RDDs
● Spark(Spark Core) manages the split of RDDs in partitions(atomic units)
and allocates RDDs’ partitions to cluster nodes(Master(Spark Driver),
Worker nodes)
● Spark hides complexity of fault tolerance
○ – RDDs are automatically rebuilt in case of failure using the RDD lineage
DAG, that defines the logical execution plan.
Directed Acyclic Graph (DAG)
● A Directed Acyclic Graph (DAG) in Spark is a set of vertices and
edges, where vertices represent the RDDs and edges represent
the operations to be applied on RDDs
○ – Generalization of MapReduce model, which has only two
operations (Map and Reduce)
Directed Acyclic Graph (DAG)
● DAG can be visualized using Spark
Web UI
○ – figure: WordCount DAG
● • A stage is a set of operation that
does not involve a shuffle of data
● • As soon as a shuffle of data is
needed (when a wide
transformation is performed), the
DAG will yield a new stage
Operations in RDD API
● Spark programs are written in terms of operations on RDDs
● Programming model based on parallelizable operators
○ – Higher-order functions that execute user-defined
functions in parallel
● RDDs are created and manipulated through operators See
https://spark.apache.org/docs/latest/rdd-programming-
guide.html
● RDDs are created from external data or other RDDs
How to create RDD?
● RDD can be created by:
○ – Parallelizing existing data collections of the hosting programming
language (e.g., collections and lists of Scala, Java, Python, or R)
■ Number of partitions specified by user
■ RDD API: parallelize
○ – From (large) files stored in HDFS or any other file system
■ One partition per HDFS block
■ RDD API: textFile
○ – Transforming an existing RDD
■ Number of partitions depends on transformation type
■ RDD API: transformation operations (map, filter, flatMap)
How to create RDD?
● Turn an existing collection into an RDD
● filter: generates a new RDD by filtering the source dataset using the
specified function
RDD transformations: flatMap
● flatMap: takes as input a function which is applied to each element of
the RDD; can map each input item to zero or more output items
range function in
Python: ordered
sequence of integer
values in range
[start;end;step) with
nonzero step
RDD transformations: reduceByKey
● reduceByKey: aggregates values with identical key using
the specified function
● Runs several parallel reduce operations, one for each key
in the dataset, where each operation combines values that
have the same key
RDD transformations: reduceByKey
● • Let’s visualize the DAG
RDD transformations: join
● join: performs an inner-join on the keys of two
RDDs
● Only keys that are present in both RDDs are output
● Join candidates are independently processed
RDD transformations: join
● • Let’s visualize the DAG
Some RDD actions
● collect: returns all the elements of the RDD as a list
● reduce: aggregates the elements in the RDD using the specified function
● saveAsTextFile: writes the elements of the RDD as a text file either to the local file
system or HDFS
Example: WordCount in Python
SparkSession
● A unified entry point for manipulating data with Spark
● From Spark 2.0, SparkSession unifies the different
contexts from different APIs and represents the entry
point into all Spark functionalities
● Already available in Spark shell as variable spark
● Within application: use builder to create a basic
SparkSession
● Only one SparkContext may be active per JVM – stop() the
active SparkContext before creating a new one
SparkSession vs SparkContext