0% found this document useful (0 votes)
40 views

ECS765P_W4_Introduction to Spark

The document outlines a course on Big Data Processing with a focus on Apache Spark, covering topics such as iterative MapReduce, Spark concepts, and programming basics. It highlights the limitations of Hadoop and the advantages of Spark's in-memory processing and data flow programming model. The document also introduces Spark's architecture, Resilient Distributed Datasets (RDDs), and various APIs for different programming languages.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

ECS765P_W4_Introduction to Spark

The document outlines a course on Big Data Processing with a focus on Apache Spark, covering topics such as iterative MapReduce, Spark concepts, and programming basics. It highlights the limitations of Hadoop and the advantages of Spark's in-memory processing and data flow programming model. The document also introduces Spark's architecture, Resilient Distributed Datasets (RDDs), and various APIs for different programming languages.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

ECS640U/ECS765P Big Data Processing

Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …


The Big Data Pipeline
So where are we?

Data
Ingestion Storage Processing Output
Sources
Weeks 2-3: Apache Hadoop

Data
Ingestion Storage Processing Output
Sources

During weeks 2 to 4, we covered MapReduce and Apache Hadoop, our first Big Data solution
consisting of:
• Processing capabilities (MapReduce)
• Storage system (HDFS) + Scheduler (YARN)
Weeks 4-5: Processing

Data
Ingestion Storage Processing Output
Sources

Weeks 4-5 will cover more Big Data processing technologies further:
• Weeks 4 and 5: Apache Spark and Spark Programming
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Complex algorithms in MapReduce

MapReduce might not be directly applicable to scenarios where:


• More than one Map and Reduce tasks are needed (MapReduce defines one Map and one Reduce task)
• There are several processing stages, where each stage depends on the results from previous stages
(MapReduce allows data parallelism only)

MapReduce can still be used by:


• Chaining sequentially MapReduce jobs where the output of one job becomes the input to the next job
This needs implemented manually or programmatically
Iterative MapReduce

Job 1
Reduc
Input Map
Map e Output
Reduc
Data Map
e Data 1

Job 2
Map Reduc
Output e Output
Map Reduc
Data 1 Map
e Data 2

Job n
Map Reduc
Output e Output
Map Reduc
Data n-1 Map
e Data n
Example: the K-means algorithm

• Clustering is an unsupervised family of machine learning techniques used in many scientific,


engineering and business applications (patient risk stratification, market segmentation…)
• K-means is a clustering algorithm that groups data samples into clusters represented by a
prototype/centroid (covered in detail - ECS706U/ECS766P Data Mining & ECS708P Machine Learning
modules)

1 1

0 0

0 1 2 3
0 1 2 3
Example: the K-means algorithm

K-means defines K prototypes/centroids and executes iteratively two steps until a stop criterion is met:

• Step 1: Samples are assigned to the closest prototype. This results in K clusters consisting of all the
samples assigned to the same prototype.
• Step 2: The K prototypes are updated. They are obtained as the centroid of each new cluster.
Example: the K-means algorithm

1 1

0 0

0 1 2 3 0 1 2 3

1 1

0 0

0 1 2 3 0 1 2 3
Example: the K-means algorithm

K-means can be implemented in MapReduce as follows:

1. Select K random locations for the prototypes/centroids


2. Create input file containing initial prototypes and data
3. Mapper setup: read prototype file and store in the data structure
4. Run Map: emit [nearest centroid, data sample] where centroid is the key
5. Run Reducer: calculate new prototypes as centroids and emit
6. In job configuration: if difference between old and new prototype is zero, convergence is reached,
otherwise go to 2 and repeat using new prototypes
K-means MapReduce Pseudocode

centroids = k random sampled points from the dataset.


do:
Mapper:
- Given a point and the set of centroids.
- Calculate the distance between the point and each centroid.
- Emit the point and the closest centroid.
Reducer:
- Given the centroid and the points belonging to its cluster.
- Calculate the new centroid as the arithmetic mean position of the points.
- Emit the new_centroids.
prev_centroids = centroids.
centroids = new_centroids.
while prev_centroids - centroids > threshold.
Example: the K-means algorithm

Map Reduce
Data Map Prototype 1
Map Reduce

Prototype 0
Check
Convergence

Map Reduce
Data Map Prototype 2
Map Reduce

Prototype 1
Iterative MapReduce performance

In general, in an iterative implementation of MapReduce:

• Every MapReduce job is independent (no shared state)


• Data has to be transferred from Mappers to Reducers
• Data has to be loaded from disk on every iteration
Performance Killer
• Results are saved to HDFS, with multiple replications
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Hadoop/MapReduce limitations

Hadoop is a batch processing framework:


• Designed to process very large datasets
• Efficient at processing the Map stage: data already distributed
• Inefficient in I/O – communications: data must be loaded and written from HDFS, shuffle and sort
incur long latency and produce large network traffic
• Job start-up and finish takes seconds, regardless of the size of the dataset

MapReduce is not a good fit for every problem:


• Rigid structure: Map, Shuffle/Sort, Reduce
• No native support for iterations
• One synchronization barrier
Note: Haddop was invented when the time the memory is expensive.
In-memory processing

In-memory storage devices (RAM) provide much faster data access than on-disk storage devices and
provides with further flexibility needed in many Big Data scenarios.

In an in-memory processing approach:


• Data is loaded in memory before computation
• Kept in memory during successive steps

In-memory processing is suitable when new data arrives at a fast pace (streams), for real-time analytics
and exploratory tasks, when iterative access is required, or multiple jobs need the same data.

Main initiatives using in-memory processing:


• Databases: Redis, Memcached
• Graph centric: Pregel
• General purpose: Spark, Flink
Spark

The Spark project originated at the AMPLab at UC Berkeley, and is one of the most active Apache Open-
Source projects, currently led by Databricks.

Spark’s main features include:


• Data flow programming model operating on distributed collections of records
• Collections are kept in-memory
• Support for iterations and interactive queries
• Retain the attractive properties of MapReduce (no references to parallelism in programming logic,
fault tolerance, data locality, scalability)

Logistic Regression
• Significantly faster than Hadoop
Spark’s basic architecture

Spark manages and coordinates the execution of tasks on data distributed across a cluster:
• The cluster manager (Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or
Mesos which is more generic) keeps track of resources available.
• Spark Applications are submitted to the manager, which will grant resources to complete them.

A Spark Application consists of:


• Driver process: runs the main() function on a node in the cluster. Maintains all information during the
lifetime of the application, respond to user’s input, and schedules tasks across executors.
• Executor process: carries out the tasks assigned by the driver and reports the state of the computation
back to the driver.

There can be multiple Spark Applications running on a cluster at the same time.
Resilient Distributed Dataset (RDD)

A Resilient Distributed Dataset (RDD) represents a partitioned collection of records:

• Fault tolerant (can be rebuilt if a partition is lost )


• Immutable (can be transformed into new RDDs, but not edited)
• Parallelisable (can be operated on in parallel when distributed across a cluster)

RDDs are created by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3,
Cassandra, …) or from an existing collection in the driver program.

All Spark code compiles down to operate on an abstract representation namely RDDs. However, even
though Spark allows you to write low-level applications that explicitly operate on RDDs, it is more common
to use high-level distributed collections (Datasets, DataFrames or SQL tables).
Spark operations

Spark defines two types of operations: transformations and actions.

Transformations:
• Lazy operations to build RDDs from other RDDs
• Executed in parallel (similar to map and shuffle in MapReduce)

Actions:
• View data in console
• Collect data
• Write to output data sources
Spark operations

Transformations Actions
(define a new RDD from an existing one) (take an RDD and return a result to
driver//HDFS)
map collect
filter reduce
sample count
union saveAsTextFile
groupByKey lookupKey
reduceByKey forEach
join …
persist

Execution plan and lazy evaluation

Given a Spark Application, Spark creates an optimised execution plan that is represented as a Directed
Acyclic Graph (DAG) of transformations that can be executed in parallel across workers on the cluster.

Operations are evaluated lazily in Spark:


• Transformations are only executed when they are needed
• Only the invocation of an action will trigger the execution chain
• Enables building the actual execution plan to optimise the data flow
Execution plan and lazy evaluation

RDD 1
RDD 2 RDD 3
Input Output
Transf 1 Transf 2 Action Data
RDD 1
Data
RDD 2 RDD 3
RDD 1
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset

Break and Quiz


Spark’s language APIs

Spark code can be written using different programming languages. This is made possible by Spark APIs:
• Scala (Spark’s default language)
• Java
• Python
• SQL
• R

Code written in these languages include Spark’s core operations and is translated into Spark low-level APIs
or RDDs that is executed by the workers across the cluster.

In addition, Spark offers interactive shells that can be used for prototyping and ad-hoc data analysis,
namely pyspark (Python), spark-shell (Scala), spark-sql (SQL) and sparkR (R).
Low-level APIs: RDD

All Spark code compiles down to an RDD.


Spark offers APIs to write applications that operate on RDDs directly, called low-level APIs. In addition to
RDDs, low-level APIs also allow you to distribute and manipulate shared distributed variables.

Using low-level APIs is however uncommon and is only recommended when:


• Control over physical data across a cluster is needed
• Some very specific functionality is needed
• There is legacy code using RDDs

The recommended approach is to use high-level APIs.


High-level data abstractions: Structured APIs

Structured APIs add simplicity, expressiveness and efficiency to Spark, as it makes it possible to express
computations as common data analysis patterns.
Structured APIs include:
• DataFrames
• Datasets
• SQL Tables

DataFrames are conceptually equivalent to tables in relational databases or data frames in R or Python:
• Represent data as immutable, distributed, in-memory tables.
• Have a schema that defines column names and associated data types.
• Can be constructed from a variety of data sources such as structured data files, tables in Hive,
databases, RDDs…
• The DataFrame API is available in Scala, Java, Python and R.
SparkSession

A session in computer systems is defined an interaction between two entities. In order to use Spark’s
functionality to manipulate data, it is necessary to initiate a session with Spark.

SparkSession:
• Single, unified entry-point to access all the functionalities offered in Spark
• Encapsulates a diversity of entry-points for different functionalities: SparkContext, SQLContext, …
• Driver process that manages an application and allows Spark to execute user-defined manipulations

In a standalone Spark application, it is necessary to create the SparkSession object in the application
code. Spark’s Language APIs allow you to create a SparkSession.

When using an interactive shell, the SparkSession is created automatically and accessible via the variable
spark.
This is an example of using low-level api
Word Count in Spark (Python)

Import pyspark
sc = pyspark.SparkContext()

#Ingest and Preprocess Input Data, sc is SparkContext


lines = sc.textFile("/input/path")

#Transformations,
another
note that lines is an RDD now
RDD
RDD words = lines.flatMap(lambda lines: lines.split(“ “) )
counts = words.map(lambda word : (word, 1))
.reduceByKey(lambda a,b : a + b)

#Action Store Results, note that words & counts are RDDs
counts.saveAsTextFile("/output/path")
RDD I/O

Word Count in Spark: Data Flow


Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines
counts

flatMap ( l: l.split() )

words

map ( w: (w,1) ) reduceByKey (a,b: a+b )


RDD I/O

Word Count in Spark: Data Flow


Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines counts

flatMap ( l: l.split() )

words

map ( w: (w,1) ) reduceByKey (a,b: a+b )


RDD I/O

Word count in Spark: Data Flow


Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines
counts

MAP REDUCE
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Overview of Spark’s Toolset

Structured Advanced Third Party


Streaming Analytics Ecosystem

Structured APIs

Datasets DataFrames SQL

Low-level APIs

RDDs Distributed variables


Structured Streaming

Structured Steaming is Spark’s high-level API for stream processing:

• Follows a micro-batch approach (accumulates small batches of input data and then processes them in
parallel)
• A stream of data is treated as a table to which data is appended continuously
• No need to change your code to do batch or stream processing
• It is available in all the environments: Scala, Java, Python, R and SQL
• Has native support for event-time data
• Makes it easy to build Spark end-to-end applications that combine streaming, batch and interactive
questions
Advanced analytics: MLlib
MLlib is Spark’s high-level API for machine learning:
• It allows pre-processing of data, training models and deploy them to make predictions
• Models trained on Mllib can be deployed in Structured Streaming
• Includes machine learning algorithms for classification, regression, recommendation systems,
clustering, among others, also statistics and algebra
• Interacts with NumPy in Python and R libraries
from pyspark.ml.classification import LogisticRegression

# Every record of this DataFrame contains the label and


# features represented by a vector
df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.


# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)

# Fit the model to the data.


model = lr.fit(df)
https://spark.apache.org/docs/latest/ml-classification-regression.html
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset

Next Week – Discuss more on Spark Programming

You might also like