0% found this document useful (0 votes)
8 views

BDA-Lec8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

BDA-Lec8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

3rd grade

Big Data Analytics


Dr. Nesma Mahmoud
Lecture 8: Spark II
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Why we Need Spark?

02. Dive into Spark

03. How Spark works?


03. How Spark works?
Core libraries(modules) of Apache Spark
Spark Core( == Spark)
● Provides basic functionalities (including task scheduling,
memory management, fault recovery, interacting with
storage systems) used by other components
● • Provides a data abstraction called resilient distributed
dataset (RDD)
○ – Spark Core provides APIs for building and manipulating
these collections(RDDs)
● • Written in Scala but APIs for Java, Python and R
Spark as Unified Analytics Engine
● A number of integrated higher-level modules built on top of Spark
○ – Can be combined seamlessly in the same application

● Spark SQL
○ – To work with structured data
○ – Allows querying data via SQL Like Query.
○ – Supports many data sources (Hive tables, Parquet, JSON, …)
○ – Extends Spark RDD API

● Spark Streaming
○ – To process live streams of data
○ – Extends Spark RDD API
Spark as Unified Analytics Engine
● MLlib
○ – Scalable ML library
○ – Many distributed algorithms: feature extraction, classification,
regression, clustering, recommendation, …

● GraphX
○ – API for manipulating graphs and performing graph-parallel
computations
○ – Includes also common graph algorithms (e.g., PageRank)
○ – Extends Spark RDD API
Spark Architecture
Master/worker Architecture

Master(Driver Program):
1- create spark context
2- run main function
3- convert user program to tasks
and stages.
Worker Nodes(Datanodes):
Contain Task Executors(processes) that
1- execute tasks(run computations).
2- read data from hdfs
3- performing the transformation operations
4- store data for applications.
Spark Context: The Gateway to Apache Spark
Applications
Core Components:
- SparkContext: Connection to the Spark cluster; creates RDDs (Resilient Distributed Datasets).
- Cluster Manager: Manages resources and schedules tasks.
- Worker Nodes: Execute tasks via executors.
- Executors: Run tasks on worker nodes.
- RDDs: Immutable, fault-tolerant distributed data structures.
- HDFS: Distributed file system for data storage.

How It Works:
1. Driver Program: User writes a Spark application using Python, Scala, or Java.
2. RDD Creation: SparkContext reads data (e.g., from HDFS) and creates RDDs.
3. Transformations: Operations like map, filter, reduce modify RDDs.
4. Actions: Commands like count, collect trigger execution.
5. Task Scheduling: Cluster Manager schedules tasks on worker nodes.
6. Task Execution: Executors process tasks, interacting with HDFS.
7. Results: Returned to the driver for display or storage.
Spark Architecture
• Main program (called driver program(master)) talks to
cluster manager, which allocates resources for
• Worker nodes in which executors run
• Executors are processes that run computations and
store data for the application

https://downloads.apache.org/spark/docs/2.0.0-preview/cluster-
overview.html
Spark Architecture
● Each application consists of a driver program and executors on the
cluster
○ – Driver program: process which runs application main() and
creates SparkContext object
● • Each application gets its own executors, which are processes which
stay up for the duration of the whole application and run tasks in
multiple threads
○ – Isolation of concurrent applications
● • To run on a cluster, SparkContext connects to(communicate with)
cluster manager, which allocates cluster resources for worker nodes
● • Once connected, Spark acquires executors on cluster nodes and
sends the application code (e.g., jar) to executors.(Data Locality)
● • Finally, SparkContext sends tasks to executors to run
Spark Programming Model
Spark Programming Model
Spark on Top of Cluster Managers
● Spark can exploit many cluster resource managers which allocate
cluster resources to run the applications
○ 1. Standalone – Simple cluster manager included with Spark that
makes it easy to set up a cluster (default cluster manager).
○ 2. Hadoop YARN – Resource manager in Hadoop 2
○ 3. Mesos – General cluster manager from AMPLab
○ 4. Kubernetes
Deploy Modes and Cluster Managers
● Spark supports different deploy modes(VM, Docker, Kubernetes) and
cluster managers(Standalone, HadoopYARN, Mesos, Kubernetes) , so
it can run in different configurations and environments
RDD
● RDDs are the key programming abstraction in Spark: a
distributed memory abstraction
● • Immutable, partitioned and fault-tolerant collection of
elements that can be manipulated in parallel
○ – Like a LinkedList <MyObjects>.
○ – Stored in main memory across the cluster nodes
■ • Each worker node that is used to run an application
contains at least one partition of the RDD(s) that is (are)
defined in the application.
RDDs: distributed and partitioned
● Stored in main memory of the executors running in the worker
nodes (when it is possible) or on node local disk (if not enough
main memory)
● • Allow executing in parallel the code invoked on them
○ – Each executor of a worker node runs the specified code on its
partition of the RDD
○ – Partition: atomic chunk of data (a logical division of data) and
basic unit of parallelism
○ – Partitions of an RDD can be stored on different cluster nodes
RDDs: immutable and fault-tolerant
● Immutable once constructed
○ – i.e., RDD content cannot be modified
○ – Create new RDD based on existing RDD

● Automatically rebuilt on failure (without replication)


○ – Track lineage information so to efficiently recompute
missing or lost data due to node failures
○ – For each RDD, Spark knows how it has been constructed
and can rebuild it if a failure occurs
○ – This information is represented by means of RDD lineage
DAG connecting input data and RDDs.
RDDs: API and suitability
● RDD API
○ – Clean language-integrated API for Scala, Python, Java, and R
○ – Can be used interactively from console (Scala and PySpark)
○ – Also higher-level APIs: DataFrames and DataSets

● • RDD suitability
○ – Best suited for applications that apply the same operation to all the
elements in dataset (Coarse-grained manipulations only )
○ – Provides fine-grained control over the physical distribution of
data
○ – Not a good fit for applications with fine-grained updates to shared
state
Spark and RDDs
● Spark(Spark Core) manages the split of RDDs in partitions(atomic units)
and allocates RDDs’ partitions to cluster nodes(Master(Spark Driver),
Worker nodes)
● Spark hides complexity of fault tolerance
○ – RDDs are automatically rebuilt in case of failure using the RDD lineage
DAG, that defines the logical execution plan.
Directed Acyclic Graph (DAG)
● A Directed Acyclic Graph (DAG) in Spark is a set of vertices and
edges, where vertices represent the RDDs and edges represent
the operations to be applied on RDDs
○ – Generalization of MapReduce model, which has only two
operations (Map and Reduce)
Directed Acyclic Graph (DAG)
● DAG can be visualized using Spark
Web UI
○ – figure: WordCount DAG
● • A stage is a set of operation that
does not involve a shuffle of data
● • As soon as a shuffle of data is
needed (when a wide
transformation is performed), the
DAG will yield a new stage
Operations in RDD API
● Spark programs are written in terms of operations on RDDs
● Programming model based on parallelizable operators
○ – Higher-order functions that execute user-defined
functions in parallel
● RDDs are created and manipulated through operators See
https://spark.apache.org/docs/latest/rdd-programming-
guide.html
● RDDs are created from external data or other RDDs
How to create RDD?
● RDD can be created by:
○ – Parallelizing existing data collections of the hosting programming
language (e.g., collections and lists of Scala, Java, Python, or R)
■ Number of partitions specified by user
■ RDD API: parallelize
○ – From (large) files stored in HDFS or any other file system
■ One partition per HDFS block
■ RDD API: textFile
○ – Transforming an existing RDD
■ Number of partitions depends on transformation type
■ RDD API: transformation operations (map, filter, flatMap)
How to create RDD?
● Turn an existing collection into an RDD

○ – sc is Spark context variable


○ – Important parameter: number of partitions to cut the dataset into
○ – Spark will run one task for each partition of the cluster (typical
setting: 2-4 partitions for each CPU in the cluster)
○ – Spark tries to set the number of partitions automatically based on
resource availability.
○ – You can also set it manually by passing it as a second parameter
to parallelize, e.g., sc.parallelize(data, 10).
● Load data from storage (local file system, HDFS, or S3)
RDD transformations: map and filter
● map: takes as input a function which is applied to each element of the
RDD and maps each input element to another element

● filter: generates a new RDD by filtering the source dataset using the
specified function
RDD transformations: flatMap
● flatMap: takes as input a function which is applied to each element of
the RDD; can map each input item to zero or more output items
range function in
Python: ordered
sequence of integer
values in range
[start;end;step) with
nonzero step
RDD transformations: reduceByKey
● reduceByKey: aggregates values with identical key using
the specified function
● Runs several parallel reduce operations, one for each key
in the dataset, where each operation combines values that
have the same key
RDD transformations: reduceByKey
● • Let’s visualize the DAG
RDD transformations: join
● join: performs an inner-join on the keys of two
RDDs
● Only keys that are present in both RDDs are output
● Join candidates are independently processed
RDD transformations: join
● • Let’s visualize the DAG
Some RDD actions
● collect: returns all the elements of the RDD as a list

● take: returns an array with the first n elements in the RDD

● count: returns the number of elements in the RDD

● reduce: aggregates the elements in the RDD using the specified function

● saveAsTextFile: writes the elements of the RDD as a text file either to the local file
system or HDFS
Example: WordCount in Python
SparkSession
● A unified entry point for manipulating data with Spark
● From Spark 2.0, SparkSession unifies the different
contexts from different APIs and represents the entry
point into all Spark functionalities
● Already available in Spark shell as variable spark
● Within application: use builder to create a basic
SparkSession
● Only one SparkContext may be active per JVM – stop() the
active SparkContext before creating a new one
SparkSession vs SparkContext

Spark 1.x and 2.x:


•SparkContext was the primary entry point for Spark applications.
Spark 2.x and later:
•SparkSession replaced SparkContext as the preferred entry point starting from Spark 2.0.
•It combines functionality from SparkContext, SQLContext, and HiveContext, offering a unified entry point to
work with Spark's features, including DataFrames, Datasets, and SQL queries.
Try Spark?
● https://colab.research.google.com/drive/115k5smTNJllpJrv3fn16Z3vg6SQWnyLC?u
sp=sharing#scrollTo=dhzk3GE6S9RC
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes


icons by Flaticon, and infographics & images by Freepik

You might also like