0% found this document useful (0 votes)

40 views

ECS765P_W4_Introduction to Spark

The document outlines a course on Big Data Processing with a focus on Apache Spark, covering topics such as iterative MapReduce, Spark concepts, and programming basics. It highlights the limitations of Hadoop and the advantages of Spark's in-memory processing and data flow programming model. The document also introduces Spark's architecture, Resilient Distributed Datasets (RDDs), and various APIs for different programming languages.

Uploaded by

Yen-Kai Cheng

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

ECS765P_W4_Introduction to Spark

Uploaded by

Yen-Kai Cheng

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

ECS640U/ECS765P Big Data Processing

Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Introduction to Spark
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …

The Big Data Pipeline
So where are we?

Data
Ingestion Storage Processing Output
Sources
Weeks 2-3: Apache Hadoop

Data
Ingestion Storage Processing Output
Sources

During weeks 2 to 4, we covered MapReduce and Apache Hadoop, our first Big Data solution
consisting of:
• Processing capabilities (MapReduce)
• Storage system (HDFS) + Scheduler (YARN)
Weeks 4-5: Processing

Data
Ingestion Storage Processing Output
Sources

Weeks 4-5 will cover more Big Data processing technologies further:
• Weeks 4 and 5: Apache Spark and Spark Programming
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Complex algorithms in MapReduce

MapReduce might not be directly applicable to scenarios where:

• More than one Map and Reduce tasks are needed (MapReduce defines one Map and one Reduce task)
• There are several processing stages, where each stage depends on the results from previous stages
(MapReduce allows data parallelism only)

MapReduce can still be used by:

• Chaining sequentially MapReduce jobs where the output of one job becomes the input to the next job
This needs implemented manually or programmatically
Iterative MapReduce

Job 1
Reduc
Input Map
Map e Output
Reduc
Data Map
e Data 1

Job 2
Map Reduc
Output e Output
Map Reduc
Data 1 Map
e Data 2

Job n
Map Reduc
Output e Output
Map Reduc
Data n-1 Map
e Data n
Example: the K-means algorithm

• Clustering is an unsupervised family of machine learning techniques used in many scientific,

engineering and business applications (patient risk stratification, market segmentation…)
• K-means is a clustering algorithm that groups data samples into clusters represented by a
prototype/centroid (covered in detail - ECS706U/ECS766P Data Mining & ECS708P Machine Learning
modules)

1 1

0 0

0 1 2 3
0 1 2 3
Example: the K-means algorithm

K-means defines K prototypes/centroids and executes iteratively two steps until a stop criterion is met:

• Step 1: Samples are assigned to the closest prototype. This results in K clusters consisting of all the
samples assigned to the same prototype.
• Step 2: The K prototypes are updated. They are obtained as the centroid of each new cluster.
Example: the K-means algorithm

1 1

0 0

0 1 2 3 0 1 2 3

1 1

0 0

0 1 2 3 0 1 2 3
Example: the K-means algorithm

K-means can be implemented in MapReduce as follows:

1. Select K random locations for the prototypes/centroids

2. Create input file containing initial prototypes and data
3. Mapper setup: read prototype file and store in the data structure
4. Run Map: emit [nearest centroid, data sample] where centroid is the key
5. Run Reducer: calculate new prototypes as centroids and emit
6. In job configuration: if difference between old and new prototype is zero, convergence is reached,
otherwise go to 2 and repeat using new prototypes
K-means MapReduce Pseudocode

centroids = k random sampled points from the dataset.

do:
Mapper:
- Given a point and the set of centroids.
- Calculate the distance between the point and each centroid.
- Emit the point and the closest centroid.
Reducer:
- Given the centroid and the points belonging to its cluster.
- Calculate the new centroid as the arithmetic mean position of the points.
- Emit the new_centroids.
prev_centroids = centroids.
centroids = new_centroids.
while prev_centroids - centroids > threshold.
Example: the K-means algorithm

Map Reduce
Data Map Prototype 1
Map Reduce

Prototype 0
Check
Convergence

Map Reduce
Data Map Prototype 2
Map Reduce

Prototype 1
Iterative MapReduce performance

In general, in an iterative implementation of MapReduce:

• Every MapReduce job is independent (no shared state)

• Data has to be transferred from Mappers to Reducers
• Data has to be loaded from disk on every iteration
Performance Killer
• Results are saved to HDFS, with multiple replications
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Hadoop/MapReduce limitations

Hadoop is a batch processing framework:

• Designed to process very large datasets
• Efficient at processing the Map stage: data already distributed
• Inefficient in I/O – communications: data must be loaded and written from HDFS, shuffle and sort
incur long latency and produce large network traffic
• Job start-up and finish takes seconds, regardless of the size of the dataset

MapReduce is not a good fit for every problem:

• Rigid structure: Map, Shuffle/Sort, Reduce
• No native support for iterations
• One synchronization barrier
Note: Haddop was invented when the time the memory is expensive.
In-memory processing

In-memory storage devices (RAM) provide much faster data access than on-disk storage devices and
provides with further flexibility needed in many Big Data scenarios.

In an in-memory processing approach:

• Data is loaded in memory before computation
• Kept in memory during successive steps

In-memory processing is suitable when new data arrives at a fast pace (streams), for real-time analytics
and exploratory tasks, when iterative access is required, or multiple jobs need the same data.

Main initiatives using in-memory processing:

• Databases: Redis, Memcached
• Graph centric: Pregel
• General purpose: Spark, Flink
Spark

The Spark project originated at the AMPLab at UC Berkeley, and is one of the most active Apache Open-
Source projects, currently led by Databricks.

Spark’s main features include:

• Data flow programming model operating on distributed collections of records
• Collections are kept in-memory
• Support for iterations and interactive queries
• Retain the attractive properties of MapReduce (no references to parallelism in programming logic,
fault tolerance, data locality, scalability)

Logistic Regression
• Significantly faster than Hadoop
Spark’s basic architecture

Spark manages and coordinates the execution of tasks on data distributed across a cluster:
• The cluster manager (Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or
Mesos which is more generic) keeps track of resources available.
• Spark Applications are submitted to the manager, which will grant resources to complete them.

A Spark Application consists of:

• Driver process: runs the main() function on a node in the cluster. Maintains all information during the
lifetime of the application, respond to user’s input, and schedules tasks across executors.
• Executor process: carries out the tasks assigned by the driver and reports the state of the computation
back to the driver.

There can be multiple Spark Applications running on a cluster at the same time.
Resilient Distributed Dataset (RDD)

A Resilient Distributed Dataset (RDD) represents a partitioned collection of records:

• Fault tolerant (can be rebuilt if a partition is lost )

• Immutable (can be transformed into new RDDs, but not edited)
• Parallelisable (can be operated on in parallel when distributed across a cluster)

RDDs are created by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3,
Cassandra, …) or from an existing collection in the driver program.

All Spark code compiles down to operate on an abstract representation namely RDDs. However, even
though Spark allows you to write low-level applications that explicitly operate on RDDs, it is more common
to use high-level distributed collections (Datasets, DataFrames or SQL tables).
Spark operations

Spark defines two types of operations: transformations and actions.

Transformations:
• Lazy operations to build RDDs from other RDDs
• Executed in parallel (similar to map and shuffle in MapReduce)

Actions:
• View data in console
• Collect data
• Write to output data sources
Spark operations

Transformations Actions
(define a new RDD from an existing one) (take an RDD and return a result to
driver//HDFS)
map collect
filter reduce
sample count
union saveAsTextFile
groupByKey lookupKey
reduceByKey forEach
join …
persist
…
Execution plan and lazy evaluation

Given a Spark Application, Spark creates an optimised execution plan that is represented as a Directed
Acyclic Graph (DAG) of transformations that can be executed in parallel across workers on the cluster.

Operations are evaluated lazily in Spark:

• Transformations are only executed when they are needed
• Only the invocation of an action will trigger the execution chain
• Enables building the actual execution plan to optimise the data flow
Execution plan and lazy evaluation

RDD 1
RDD 2 RDD 3
Input Output
Transf 1 Transf 2 Action Data
RDD 1
Data
RDD 2 RDD 3
RDD 1
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset

Break and Quiz

Spark’s language APIs

Spark code can be written using different programming languages. This is made possible by Spark APIs:
• Scala (Spark’s default language)
• Java
• Python
• SQL
• R

Code written in these languages include Spark’s core operations and is translated into Spark low-level APIs
or RDDs that is executed by the workers across the cluster.

In addition, Spark offers interactive shells that can be used for prototyping and ad-hoc data analysis,
namely pyspark (Python), spark-shell (Scala), spark-sql (SQL) and sparkR (R).
Low-level APIs: RDD

All Spark code compiles down to an RDD.

Spark offers APIs to write applications that operate on RDDs directly, called low-level APIs. In addition to
RDDs, low-level APIs also allow you to distribute and manipulate shared distributed variables.

Using low-level APIs is however uncommon and is only recommended when:

• Control over physical data across a cluster is needed
• Some very specific functionality is needed
• There is legacy code using RDDs

The recommended approach is to use high-level APIs.

High-level data abstractions: Structured APIs

Structured APIs add simplicity, expressiveness and efficiency to Spark, as it makes it possible to express
computations as common data analysis patterns.
Structured APIs include:
• DataFrames
• Datasets
• SQL Tables

DataFrames are conceptually equivalent to tables in relational databases or data frames in R or Python:
• Represent data as immutable, distributed, in-memory tables.
• Have a schema that defines column names and associated data types.
• Can be constructed from a variety of data sources such as structured data files, tables in Hive,
databases, RDDs…
• The DataFrame API is available in Scala, Java, Python and R.
SparkSession

A session in computer systems is defined an interaction between two entities. In order to use Spark’s
functionality to manipulate data, it is necessary to initiate a session with Spark.

SparkSession:
• Single, unified entry-point to access all the functionalities offered in Spark
• Encapsulates a diversity of entry-points for different functionalities: SparkContext, SQLContext, …
• Driver process that manages an application and allows Spark to execute user-defined manipulations

In a standalone Spark application, it is necessary to create the SparkSession object in the application
code. Spark’s Language APIs allow you to create a SparkSession.

When using an interactive shell, the SparkSession is created automatically and accessible via the variable
spark.
This is an example of using low-level api
Word Count in Spark (Python)

Import pyspark
sc = pyspark.SparkContext()

#Ingest and Preprocess Input Data, sc is SparkContext

lines = sc.textFile("/input/path")

#Transformations,
another
note that lines is an RDD now
RDD
RDD words = lines.flatMap(lambda lines: lines.split(“ “) )
counts = words.map(lambda word : (word, 1))
.reduceByKey(lambda a,b : a + b)

#Action Store Results, note that words & counts are RDDs
counts.saveAsTextFile("/output/path")
RDD I/O

Word Count in Spark: Data Flow

Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines
counts

flatMap ( l: l.split() )

words

map ( w: (w,1) ) reduceByKey (a,b: a+b )

RDD I/O

Word Count in Spark: Data Flow

Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines counts

flatMap ( l: l.split() )

words

map ( w: (w,1) ) reduceByKey (a,b: a+b )

RDD I/O

Word count in Spark: Data Flow

Transformation/Action

HDFS://inputpath
HDFS://outputpath

sc.textFile()
sc.saveAstextFile()

lines
counts

MAP REDUCE
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset
Overview of Spark’s Toolset

Structured Advanced Third Party

Streaming Analytics Ecosystem

Structured APIs

Datasets DataFrames SQL

Low-level APIs

RDDs Distributed variables

Structured Streaming

Structured Steaming is Spark’s high-level API for stream processing:

• Follows a micro-batch approach (accumulates small batches of input data and then processes them in
parallel)
• A stream of data is treated as a table to which data is appended continuously
• No need to change your code to do batch or stream processing
• It is available in all the environments: Scala, Java, Python, R and SQL
• Has native support for event-time data
• Makes it easy to build Spark end-to-end applications that combine streaming, batch and interactive
questions
Advanced analytics: MLlib
MLlib is Spark’s high-level API for machine learning:
• It allows pre-processing of data, training models and deploy them to make predictions
• Models trained on Mllib can be deployed in Structured Streaming
• Includes machine learning algorithms for classification, regression, recommendation systems,
clustering, among others, also statistics and algebra
• Interacts with NumPy in Python and R libraries
from pyspark.ml.classification import LogisticRegression

# Every record of this DataFrame contains the label and

# features represented by a vector
df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.

# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)

# Fit the model to the data.

model = lr.fit(df)
https://spark.apache.org/docs/latest/ml-classification-regression.html
Big Data Processing: Week 4
Topic List:

● Iterative MapReduce
● Spark concepts
● Spark programming basics
● Spark’s toolset

Next Week – Discuss more on Spark Programming

Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
DOS Command Exercises
100% (2)
DOS Command Exercises
3 pages
How To Configure Rodc
No ratings yet
How To Configure Rodc
27 pages
L3
No ratings yet
L3
30 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark
No ratings yet
Spark
96 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
SPARK
No ratings yet
SPARK
66 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Unit 4
No ratings yet
Unit 4
8 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Module 4
No ratings yet
Module 4
29 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
M5
No ratings yet
M5
18 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
spark
No ratings yet
spark
9 pages
Chap5_BigDataComputingAndProcessing
No ratings yet
Chap5_BigDataComputingAndProcessing
72 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
spark
No ratings yet
spark
160 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Module 3
No ratings yet
Module 3
51 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
SPARK
No ratings yet
SPARK
125 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Week 3 v1.1 (hidden) Supervised Learning (Regression)
No ratings yet
Week 3 v1.1 (hidden) Supervised Learning (Regression)
52 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Note_Wireless Communications for Everybody
No ratings yet
Note_Wireless Communications for Everybody
2 pages
ECS726-Week01 Intro
No ratings yet
ECS726-Week01 Intro
70 pages
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
No ratings yet
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
52 pages
Week 4 v1.1 (hidden) - Supervised Learning (Classification)
No ratings yet
Week 4 v1.1 (hidden) - Supervised Learning (Classification)
43 pages
Magic Pen Script 10-05-19
No ratings yet
Magic Pen Script 10-05-19
4 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
Lights Illusions Script 08-26-19
No ratings yet
Lights Illusions Script 08-26-19
6 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
ECS726-Week02 Symmetric EncryptionP
No ratings yet
ECS726-Week02 Symmetric EncryptionP
62 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
ECS726-Week05 Cryptographic Protocols Key Management-P
No ratings yet
ECS726-Week05 Cryptographic Protocols Key Management-P
58 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
W3 Ecs7020p
No ratings yet
W3 Ecs7020p
51 pages
ECS765P - W9 - Large-Scale Graph Processing
No ratings yet
ECS765P - W9 - Large-Scale Graph Processing
51 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
ECS781P 6 CloudPerformanceSLAs
No ratings yet
ECS781P 6 CloudPerformanceSLAs
39 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
ECS781P-11-Edge of The Cloud
No ratings yet
ECS781P-11-Edge of The Cloud
30 pages
ECS781P-3-Cloud Applications
No ratings yet
ECS781P-3-Cloud Applications
50 pages
Tom Rose - From The Red Notebook 2nd Edition
75% (4)
Tom Rose - From The Red Notebook 2nd Edition
33 pages
ECS781P 10 Microservices
No ratings yet
ECS781P 10 Microservices
34 pages
Ecs781p 4 Rest
No ratings yet
Ecs781p 4 Rest
47 pages
The Passion of An Amateur Card Magician
100% (2)
The Passion of An Amateur Card Magician
557 pages
Cloud Computing Lab 2
No ratings yet
Cloud Computing Lab 2
4 pages
W4 Ecs7020p
No ratings yet
W4 Ecs7020p
48 pages
Matt Mello - Thought Control
No ratings yet
Matt Mello - Thought Control
16 pages
NoSQL - Unit1
No ratings yet
NoSQL - Unit1
29 pages
UCM Archive Replicate Using Filesystem
No ratings yet
UCM Archive Replicate Using Filesystem
4 pages
What Are The Components of DBMS
No ratings yet
What Are The Components of DBMS
3 pages
Supplier Librarian: Brings Books T
No ratings yet
Supplier Librarian: Brings Books T
3 pages
How To Use A Superuser
No ratings yet
How To Use A Superuser
2 pages
CS8481 - Set1
No ratings yet
CS8481 - Set1
8 pages
Narendra Kancherapalli
No ratings yet
Narendra Kancherapalli
6 pages
Computerscience 41
No ratings yet
Computerscience 41
9 pages
DBW4H EN Col19 CO A4
No ratings yet
DBW4H EN Col19 CO A4
24 pages
Answers To Blockchain Interview Questions
No ratings yet
Answers To Blockchain Interview Questions
3 pages
Oracle 12c SQL 3rd Edition (eBook PDF) pdf download
100% (1)
Oracle 12c SQL 3rd Edition (eBook PDF) pdf download
30 pages
SQL Rdbms Concepts
No ratings yet
SQL Rdbms Concepts
3 pages
Transactions Processing: Database System Concepts, 6 Ed
No ratings yet
Transactions Processing: Database System Concepts, 6 Ed
31 pages
Database PPQ
No ratings yet
Database PPQ
5 pages
PFDA
No ratings yet
PFDA
23 pages
Zeta Tech Info Solution PVT LTD: SAP PARTNER ID - 0001810781
No ratings yet
Zeta Tech Info Solution PVT LTD: SAP PARTNER ID - 0001810781
6 pages
DP 200
No ratings yet
DP 200
13 pages
Data Processing 1
No ratings yet
Data Processing 1
2 pages
Building A Secure Ecosystem For Laserfiche
No ratings yet
Building A Secure Ecosystem For Laserfiche
29 pages
Start and Stop SAP HANA
No ratings yet
Start and Stop SAP HANA
8 pages
TDD
No ratings yet
TDD
3 pages
Top 65 Windows Server Interview Questions
No ratings yet
Top 65 Windows Server Interview Questions
11 pages
SQL Server 2012 Feature Pack Instructions
No ratings yet
SQL Server 2012 Feature Pack Instructions
1 page
3-Difference Between DA and DBA
No ratings yet
3-Difference Between DA and DBA
2 pages
SAP BW - Hierarchy Load From Flat File
No ratings yet
SAP BW - Hierarchy Load From Flat File
25 pages
Project MS Access
100% (3)
Project MS Access
16 pages
Features Supported by The Editions of SQL Server 2012
No ratings yet
Features Supported by The Editions of SQL Server 2012
25 pages
PanamaCanal C
No ratings yet
PanamaCanal C
8 pages

ECS765P_W4_Introduction to Spark

Uploaded by

ECS765P_W4_Introduction to Spark

Uploaded by

ECS640U/ECS765P Big Data Processing

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …

MapReduce might not be directly applicable to scenarios where:

MapReduce can still be used by:

• Clustering is an unsupervised family of machine learning techniques used in many scientific,

K-means can be implemented in MapReduce as follows:

1. Select K random locations for the prototypes/centroids

centroids = k random sampled points from the dataset.

In general, in an iterative implementation of MapReduce:

• Every MapReduce job is independent (no shared state)

Hadoop is a batch processing framework:

MapReduce is not a good fit for every problem:

In an in-memory processing approach:

Main initiatives using in-memory processing:

Spark’s main features include:

A Spark Application consists of:

A Resilient Distributed Dataset (RDD) represents a partitioned collection of records:

• Fault tolerant (can be rebuilt if a partition is lost )

Spark defines two types of operations: transformations and actions.

Operations are evaluated lazily in Spark:

Break and Quiz

All Spark code compiles down to an RDD.

Using low-level APIs is however uncommon and is only recommended when:

The recommended approach is to use high-level APIs.

#Ingest and Preprocess Input Data, sc is SparkContext

Word Count in Spark: Data Flow

map ( w: (w,1) ) reduceByKey (a,b: a+b )

Word Count in Spark: Data Flow

map ( w: (w,1) ) reduceByKey (a,b: a+b )

Word count in Spark: Data Flow

Structured Advanced Third Party

Datasets DataFrames SQL

RDDs Distributed variables

Structured Steaming is Spark’s high-level API for stream processing:

# Every record of this DataFrame contains the label and

# Set parameters for the algorithm.

# Fit the model to the data.

Next Week – Discuss more on Spark Programming

You might also like