ECS765P_W4_Introduction to Spark
ECS765P_W4_Introduction to Spark
Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Data
Ingestion Storage Processing Output
Sources
Weeks 2-3: Apache Hadoop
Data
Ingestion Storage Processing Output
Sources
During weeks 2 to 4, we covered MapReduce and Apache Hadoop, our first Big Data solution
consisting of:
• Processing capabilities (MapReduce)
• Storage system (HDFS) + Scheduler (YARN)
Weeks 4-5: Processing
Data
Ingestion Storage Processing Output
Sources
Weeks 4-5 will cover more Big Data processing technologies further:
• Weeks 4 and 5: Apache Spark and Spark Programming
Big Data Processing: Week 4
Topic List:
● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Complex algorithms in MapReduce
Job 1
Reduc
Input Map
Map e Output
Reduc
Data Map
e Data 1
Job 2
Map Reduc
Output e Output
Map Reduc
Data 1 Map
e Data 2
Job n
Map Reduc
Output e Output
Map Reduc
Data n-1 Map
e Data n
Example: the K-means algorithm
1 1
0 0
0 1 2 3
0 1 2 3
Example: the K-means algorithm
K-means defines K prototypes/centroids and executes iteratively two steps until a stop criterion is met:
• Step 1: Samples are assigned to the closest prototype. This results in K clusters consisting of all the
samples assigned to the same prototype.
• Step 2: The K prototypes are updated. They are obtained as the centroid of each new cluster.
Example: the K-means algorithm
1 1
0 0
0 1 2 3 0 1 2 3
1 1
0 0
0 1 2 3 0 1 2 3
Example: the K-means algorithm
Map Reduce
Data Map Prototype 1
Map Reduce
Prototype 0
Check
Convergence
Map Reduce
Data Map Prototype 2
Map Reduce
Prototype 1
Iterative MapReduce performance
● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Hadoop/MapReduce limitations
In-memory storage devices (RAM) provide much faster data access than on-disk storage devices and
provides with further flexibility needed in many Big Data scenarios.
In-memory processing is suitable when new data arrives at a fast pace (streams), for real-time analytics
and exploratory tasks, when iterative access is required, or multiple jobs need the same data.
The Spark project originated at the AMPLab at UC Berkeley, and is one of the most active Apache Open-
Source projects, currently led by Databricks.
Logistic Regression
• Significantly faster than Hadoop
Spark’s basic architecture
Spark manages and coordinates the execution of tasks on data distributed across a cluster:
• The cluster manager (Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or
Mesos which is more generic) keeps track of resources available.
• Spark Applications are submitted to the manager, which will grant resources to complete them.
There can be multiple Spark Applications running on a cluster at the same time.
Resilient Distributed Dataset (RDD)
RDDs are created by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3,
Cassandra, …) or from an existing collection in the driver program.
All Spark code compiles down to operate on an abstract representation namely RDDs. However, even
though Spark allows you to write low-level applications that explicitly operate on RDDs, it is more common
to use high-level distributed collections (Datasets, DataFrames or SQL tables).
Spark operations
Transformations:
• Lazy operations to build RDDs from other RDDs
• Executed in parallel (similar to map and shuffle in MapReduce)
Actions:
• View data in console
• Collect data
• Write to output data sources
Spark operations
Transformations Actions
(define a new RDD from an existing one) (take an RDD and return a result to
driver//HDFS)
map collect
filter reduce
sample count
union saveAsTextFile
groupByKey lookupKey
reduceByKey forEach
join …
persist
…
Execution plan and lazy evaluation
Given a Spark Application, Spark creates an optimised execution plan that is represented as a Directed
Acyclic Graph (DAG) of transformations that can be executed in parallel across workers on the cluster.
RDD 1
RDD 2 RDD 3
Input Output
Transf 1 Transf 2 Action Data
RDD 1
Data
RDD 2 RDD 3
RDD 1
Big Data Processing: Week 4
Topic List:
● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Spark code can be written using different programming languages. This is made possible by Spark APIs:
• Scala (Spark’s default language)
• Java
• Python
• SQL
• R
Code written in these languages include Spark’s core operations and is translated into Spark low-level APIs
or RDDs that is executed by the workers across the cluster.
In addition, Spark offers interactive shells that can be used for prototyping and ad-hoc data analysis,
namely pyspark (Python), spark-shell (Scala), spark-sql (SQL) and sparkR (R).
Low-level APIs: RDD
Structured APIs add simplicity, expressiveness and efficiency to Spark, as it makes it possible to express
computations as common data analysis patterns.
Structured APIs include:
• DataFrames
• Datasets
• SQL Tables
DataFrames are conceptually equivalent to tables in relational databases or data frames in R or Python:
• Represent data as immutable, distributed, in-memory tables.
• Have a schema that defines column names and associated data types.
• Can be constructed from a variety of data sources such as structured data files, tables in Hive,
databases, RDDs…
• The DataFrame API is available in Scala, Java, Python and R.
SparkSession
A session in computer systems is defined an interaction between two entities. In order to use Spark’s
functionality to manipulate data, it is necessary to initiate a session with Spark.
SparkSession:
• Single, unified entry-point to access all the functionalities offered in Spark
• Encapsulates a diversity of entry-points for different functionalities: SparkContext, SQLContext, …
• Driver process that manages an application and allows Spark to execute user-defined manipulations
In a standalone Spark application, it is necessary to create the SparkSession object in the application
code. Spark’s Language APIs allow you to create a SparkSession.
When using an interactive shell, the SparkSession is created automatically and accessible via the variable
spark.
This is an example of using low-level api
Word Count in Spark (Python)
Import pyspark
sc = pyspark.SparkContext()
#Transformations,
another
note that lines is an RDD now
RDD
RDD words = lines.flatMap(lambda lines: lines.split(“ “) )
counts = words.map(lambda word : (word, 1))
.reduceByKey(lambda a,b : a + b)
#Action Store Results, note that words & counts are RDDs
counts.saveAsTextFile("/output/path")
RDD I/O
HDFS://inputpath
HDFS://outputpath
sc.textFile()
sc.saveAstextFile()
lines
counts
flatMap ( l: l.split() )
words
HDFS://inputpath
HDFS://outputpath
sc.textFile()
sc.saveAstextFile()
lines counts
flatMap ( l: l.split() )
words
HDFS://inputpath
HDFS://outputpath
sc.textFile()
sc.saveAstextFile()
lines
counts
MAP REDUCE
Big Data Processing: Week 4
Topic List:
● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Overview of Spark’s Toolset
Structured APIs
Low-level APIs
• Follows a micro-batch approach (accumulates small batches of input data and then processes them in
parallel)
• A stream of data is treated as a table to which data is appended continuously
• No need to change your code to do batch or stream processing
• It is available in all the environments: Scala, Java, Python, R and SQL
• Has native support for event-time data
• Makes it easy to build Spark end-to-end applications that combine streaming, batch and interactive
questions
Advanced analytics: MLlib
MLlib is Spark’s high-level API for machine learning:
• It allows pre-processing of data, training models and deploy them to make predictions
• Models trained on Mllib can be deployed in Structured Streaming
• Includes machine learning algorithms for classification, regression, recommendation systems,
clustering, among others, also statistics and algebra
• Interacts with NumPy in Python and R libraries
from pyspark.ml.classification import LogisticRegression
● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset