0% found this document useful (0 votes)

5 views

10 Spark1

spark

Uploaded by

silvshootss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

10 Spark1

spark

Uploaded by

silvshootss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Spark I

Big Data Management

1
Knowledge objectives
1. Name the main Spark contributions and characteristics
2. Compare MapReduce and Spark
3. Define a dataframe
4. Distinguish dataframe from relation and matrix
5. Distinguish Spark and Pandas dataframe
6. Enumerate some abstraction on top of Spark

2
Application Objectives
• Provide the Spark pseudo-code for a simple problem using dataframes

3
Background
MapReduce limitations

4
MapReduce intra-job coordination

<k1, v1>
Mapper
<k1’’, v1’’>
Combiner
<k1’, L1> Reducer
<k2, v2>
Input

Mapper MergeSort
<k2’’, v2’’>
Combiner <k2’, L2> Reducer
<k3, v3>
Mapper

R W R|W R W

5
MapReduce inter-job coordination

Count Rank …

6
MapReduce limitations
• Coordination between phases using DFS
• Map, Shuffle, Reduce
• Coordination between jobs using DFS
• Count, rank, aggregate, …

Map Shuffle Reduce Map Shuffle Reduce Map Shuffle Reduce

7
Main memory coordination in Spark
Count Rank …

Map Shuffle Reduce Map Shuffle Reduce Map Shuffle Reduce

Count Rank …

8
Dataframes

9
Problems of relational tables in data exploration
• Schema needs to be defined before examining the data
• Not well-structured data is difficult to query
• Generating queries requires familiarity with the schema
• Complex declarative queries are hard to debug
• SQL was not conceived for REPL (Read, Evaluate, Print Loop)

10
Characteristics of dataframes
• First introduced to S in 1990
• Symmetrical treatment of rows and columns
• Both can be referenced explicitly
• By position (data is ordered row- and column-wise)
• By name
• Data has to adhere to a schema
• Defined at runtime
• Useful for data cleaning
• A variety of operations
• Relational-like (e.g., filter, join)
• Spreadsheet-like (e.g., pivot)
• Linear algebra (e.g., multiply)
• Incrementally composable query syntax
• Native embedding in an imperative language

11
Relation, Dataframe and Matrix
Relation Original dataframe Matrix
Numeric
R T1 … Tn T1 T2 … Tn
A1 … An 1/A1 2/A2 … n/An 1 2 … n
1/r1 x “x” … T 1 a11 a12 … a1n
2 a21 a22 … a2n
2/r2 y “y” … F
… … … …
… … … … m am1 am2 … amn
m/rm z None … T

Spark dataframe
T1 T2 … Tn
1/A1 2/A2 … n/An
x “x” … T
y “y” … F
… … …
z None … T

12
Spark Dataframe definition
“A Dataset is a strongly typed collection of domain-specific objects that can be transformed in
parallel using functional or relational operations.”

“A Dataframe is an immutable collection of data organized into named columns, potentially

distributed in the nodes of a cluster. It is implemented as an indexed Dataset of Rows.”

• Resembles a Relational table

• Row class does not fix a schema at compile time, but at execution time
• Uses StructType
• Allows to infer schemas from the file (e.g., CSV or JSON)
• Can be partitioned and distributed
• Implemented on top of Resilient Distributed Datasets (RDD)

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html

13
Dataframe vs Matrix/Array/Tensor

Dataframe Matrix
Heterogeneously typed Homogeneously typed
Both numeric and non‐numeric types Only numeric types
Explicit column names (also row in Pandas) No names at all
Supports relational algebra Does not support relational algebra

D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020

14
Dataframe vs Relation/Table

Pandas Dataframe Spark Dataframe Relation

Ordered Unordered
Named rows Unnamed rows
Lazily‐induced schema Rigid schema
Column‐row symmetry Columns and rows are different
Supports linear algebra Does not support linear algebra

D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020

15
Dataframe implementations

Pandas Spark
Eager evaluation of transformations Lazy evaluation of transformations
Resides in memory Requires a SparkSession
Not scalable
Transparently scalable in the Cloud
(multithread operators exist, but manual split is required)
Transposable Non‐transposable (problems with too many rows)

D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020

16
Spark dataframe operations
a) Input/Output
b) Transformations
• Lower abstraction: O-O interface (similar to RDDs)
• Functions over columns
• Higher abstraction: SQL
c) Actions
d) Schema management

17
Input/Output
• Matrix
• Pandas dataframe
• CSV
• JSON
• RDBMS
• HDFS file formats:
• ORC
• Parquet

18
Transformations available
• select
• filter/where
• sample
• distinct/dropDuplicates
• sort
• replace
• groupBy+agg
• union/unionAll/unionByName
• subtract
• join
• … https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html

19
Functions over columns
• Normal (lit, isNull, …)
• Math (sqrt, sin, ceil, log, …)
• Daytime (current_date, dayofweek, …)
• Collection (array_sort, forall, zip_with, …)
• Aggregate (avg, count, first, corr, max, min, …)
• Sort (asc, desc, asc_nulls_first, …)
• String (length, lower, trim, …)

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html

20
Actions available
• count
• first
• collect
• take/head/tail
• show
• write
• toPandas
•…

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html

21
Schema operations
• summary/describe
• printSchema
• columns

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html

22
Optimizations
• Lazy evaluation
• cache/persist
• unpersist
• checkpoint
• Parallelism
• repartition/coalesce

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html

23
Abstractions

24
Spark Abstractions

Spark Structured
Dataframes
SQL
MLlib GraphX
Stream
…

Resilient Distributed Datasets

25
Spark SQL
• Besides the O-O interface of dataframes, there is a declarative one
• It can be used independently of the kind of source
• Not only Relational tables
• It is translated into functional programming by an optimizer
• Based on
a) Rules
• Predicate push down
• Column pruning
b) Cost model
• Extensible

https://spark.apache.org/sql

26
Spark SQL interface
• There is a catalog with all tables available
SparkSession.catalog
• Dataframes are registered as views in the catalog
DataFrame.createOrReplaceTempView(<tablename>)
• Queries:
SparkSession.sql(<query>)
• Input is simply a string
• Output is a dataframe

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html

27
Shared Optimization and Execution

SQL
Catalyst
Unresolved Optimized Selected
Logical Physical
Dataframe Logical Logical Physical RDDs
Plan Plan
Plan Plan Plan

Dataset
Cost
Catalog Model

28
Closing

29
Summary
• Overcoming MapReduce limitations
• Dataframes
• Comparison
• Differences with Relations
• Differences with Matrixes
• Differences in Pandas and Spark
• Operations
• Transformations
• Actions
• Optimizations
• Lazy evaluation
• Parallelism
• Abstraction

30
References
• H. Karau et al. Learning Spark. O’Really, 2015
• D. Petersohn, W. W. Ma, D. Jung Lin Lee, S. Macke, D. Xin, X. Mo, J.
Gonzalez, J. M. Hellerstein, A. D. Joseph, A. G. Parameswaran.
Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020
• A. Hogan. Procesado de Datos Masivos (Universidad de Chile).
http://aidanhogan.com/teaching/cc5212-1-2020

PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
Page 01
No ratings yet
Page 01
2 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Unit 4( Data Frame and Apache Kafka)
No ratings yet
Unit 4( Data Frame and Apache Kafka)
28 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Py Spark
No ratings yet
Py Spark
9 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Pyspark
No ratings yet
Pyspark
31 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Spark Material
No ratings yet
Spark Material
6 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
Ams 560 Spark SQL
No ratings yet
Ams 560 Spark SQL
2 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
BDA1
No ratings yet
BDA1
17 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
REcall Topics
No ratings yet
REcall Topics
2 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Python Pyspark q's
No ratings yet
Python Pyspark q's
16 pages
spark theory
No ratings yet
spark theory
26 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
From Everand
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
U.Q. Magnusson
No ratings yet
Sinhaniya Level1 1-BCA
No ratings yet
Sinhaniya Level1 1-BCA
10 pages
R.K.D.F. University, Bhopal: First Year
No ratings yet
R.K.D.F. University, Bhopal: First Year
43 pages
Data Warehouse unit1 CS3551
No ratings yet
Data Warehouse unit1 CS3551
25 pages
Barometer Correction Table Table 7.3 MSL Correction Site Specific
No ratings yet
Barometer Correction Table Table 7.3 MSL Correction Site Specific
209 pages
4) A Logical View of Data
No ratings yet
4) A Logical View of Data
3 pages
1 Mark Questionsf
No ratings yet
1 Mark Questionsf
9 pages
Lab 1
No ratings yet
Lab 1
9 pages
SADCW 7e Chapter09
No ratings yet
SADCW 7e Chapter09
39 pages
Instant Download Database Systems Design Implementation Management 12th Edition Carlos Coronel PDF All Chapter
100% (4)
Instant Download Database Systems Design Implementation Management 12th Edition Carlos Coronel PDF All Chapter
53 pages
GATE 2011 Syllabus
No ratings yet
GATE 2011 Syllabus
3 pages
AIS Chap 9 Notes
No ratings yet
AIS Chap 9 Notes
7 pages
MCQ
No ratings yet
MCQ
2 pages
Dangling Tuples
No ratings yet
Dangling Tuples
58 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
10 pages
21 Series - CSE Curriculum - V SEMESTER
No ratings yet
21 Series - CSE Curriculum - V SEMESTER
25 pages
BBA (B&I) Syllabus 2009-12
No ratings yet
BBA (B&I) Syllabus 2009-12
91 pages
Present - M14 - ICT
No ratings yet
Present - M14 - ICT
127 pages
B.tech. 3rd Yr CSE (AIML) 2022 23 Revised
No ratings yet
B.tech. 3rd Yr CSE (AIML) 2022 23 Revised
33 pages
21CSC205P - DBMS - Unit I - QB
No ratings yet
21CSC205P - DBMS - Unit I - QB
6 pages
On Student Support System: Roject Eport
No ratings yet
On Student Support System: Roject Eport
20 pages
Gate Project
No ratings yet
Gate Project
77 pages
PGDCA
No ratings yet
PGDCA
52 pages
1000 Java Interview Questions-4
No ratings yet
1000 Java Interview Questions-4
250 pages
DBMS Quiz
100% (3)
DBMS Quiz
2 pages
Database Models
No ratings yet
Database Models
5 pages
DBMS Viva Questions: 1. What Is Database?
100% (1)
DBMS Viva Questions: 1. What Is Database?
9 pages
Synopsis of Online Book Store
75% (77)
Synopsis of Online Book Store
29 pages
Organizing Data and Information
100% (1)
Organizing Data and Information
11 pages
Ads Phoenix Circus School
No ratings yet
Ads Phoenix Circus School
11 pages
Chap 7 Database Concepts
No ratings yet
Chap 7 Database Concepts
25 pages

10 Spark1

Uploaded by

10 Spark1

Uploaded by

Spark I

Big Data Management

Map Shuffle Reduce Map Shuffle Reduce Map Shuffle Reduce

Map Shuffle Reduce Map Shuffle Reduce Map Shuffle Reduce

“A Dataframe is an immutable collection of data organized into named columns, potentially

• Resembles a Relational table

Pandas Dataframe Spark Dataframe Relation

Resilient Distributed Datasets

You might also like