10 Spark1
10 Spark1
1
Knowledge objectives
1. Name the main Spark contributions and characteristics
2. Compare MapReduce and Spark
3. Define a dataframe
4. Distinguish dataframe from relation and matrix
5. Distinguish Spark and Pandas dataframe
6. Enumerate some abstraction on top of Spark
2
Application Objectives
• Provide the Spark pseudo-code for a simple problem using dataframes
3
Background
MapReduce limitations
4
MapReduce intra-job coordination
<k1, v1>
Mapper
<k1’’, v1’’>
Combiner
<k1’, L1> Reducer
<k2, v2>
Input
Mapper MergeSort
<k2’’, v2’’>
Combiner <k2’, L2> Reducer
<k3, v3>
Mapper
R W R|W R W
5
MapReduce inter-job coordination
Count Rank …
6
MapReduce limitations
• Coordination between phases using DFS
• Map, Shuffle, Reduce
• Coordination between jobs using DFS
• Count, rank, aggregate, …
7
Main memory coordination in Spark
Count Rank …
Count Rank …
8
Dataframes
9
Problems of relational tables in data exploration
• Schema needs to be defined before examining the data
• Not well-structured data is difficult to query
• Generating queries requires familiarity with the schema
• Complex declarative queries are hard to debug
• SQL was not conceived for REPL (Read, Evaluate, Print Loop)
10
Characteristics of dataframes
• First introduced to S in 1990
• Symmetrical treatment of rows and columns
• Both can be referenced explicitly
• By position (data is ordered row- and column-wise)
• By name
• Data has to adhere to a schema
• Defined at runtime
• Useful for data cleaning
• A variety of operations
• Relational-like (e.g., filter, join)
• Spreadsheet-like (e.g., pivot)
• Linear algebra (e.g., multiply)
• Incrementally composable query syntax
• Native embedding in an imperative language
11
Relation, Dataframe and Matrix
Relation Original dataframe Matrix
Numeric
R T1 … Tn T1 T2 … Tn
A1 … An 1/A1 2/A2 … n/An 1 2 … n
1/r1 x “x” … T 1 a11 a12 … a1n
2 a21 a22 … a2n
2/r2 y “y” … F
… … … …
… … … … m am1 am2 … amn
m/rm z None … T
Spark dataframe
T1 T2 … Tn
1/A1 2/A2 … n/An
x “x” … T
y “y” … F
… … …
z None … T
12
Spark Dataframe definition
“A Dataset is a strongly typed collection of domain-specific objects that can be transformed in
parallel using functional or relational operations.”
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html
13
Dataframe vs Matrix/Array/Tensor
Dataframe Matrix
Heterogeneously typed Homogeneously typed
Both numeric and non‐numeric types Only numeric types
Explicit column names (also row in Pandas) No names at all
Supports relational algebra Does not support relational algebra
D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020
14
Dataframe vs Relation/Table
D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020
15
Dataframe implementations
Pandas Spark
Eager evaluation of transformations Lazy evaluation of transformations
Resides in memory Requires a SparkSession
Not scalable
Transparently scalable in the Cloud
(multithread operators exist, but manual split is required)
Transposable Non‐transposable (problems with too many rows)
D. Petersohn, et al. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020
16
Spark dataframe operations
a) Input/Output
b) Transformations
• Lower abstraction: O-O interface (similar to RDDs)
• Functions over columns
• Higher abstraction: SQL
c) Actions
d) Schema management
17
Input/Output
• Matrix
• Pandas dataframe
• CSV
• JSON
• RDBMS
• HDFS file formats:
• ORC
• Parquet
18
Transformations available
• select
• filter/where
• sample
• distinct/dropDuplicates
• sort
• replace
• groupBy+agg
• union/unionAll/unionByName
• subtract
• join
• … https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
19
Functions over columns
• Normal (lit, isNull, …)
• Math (sqrt, sin, ceil, log, …)
• Daytime (current_date, dayofweek, …)
• Collection (array_sort, forall, zip_with, …)
• Aggregate (avg, count, first, corr, max, min, …)
• Sort (asc, desc, asc_nulls_first, …)
• String (length, lower, trim, …)
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
20
Actions available
• count
• first
• collect
• take/head/tail
• show
• write
• toPandas
•…
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
21
Schema operations
• summary/describe
• printSchema
• columns
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
22
Optimizations
• Lazy evaluation
• cache/persist
• unpersist
• checkpoint
• Parallelism
• repartition/coalesce
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
23
Abstractions
24
Spark Abstractions
Spark Structured
Dataframes
SQL
MLlib GraphX
Stream
…
25
Spark SQL
• Besides the O-O interface of dataframes, there is a declarative one
• It can be used independently of the kind of source
• Not only Relational tables
• It is translated into functional programming by an optimizer
• Based on
a) Rules
• Predicate push down
• Column pruning
b) Cost model
• Extensible
https://spark.apache.org/sql
26
Spark SQL interface
• There is a catalog with all tables available
SparkSession.catalog
• Dataframes are registered as views in the catalog
DataFrame.createOrReplaceTempView(<tablename>)
• Queries:
SparkSession.sql(<query>)
• Input is simply a string
• Output is a dataframe
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html
27
Shared Optimization and Execution
SQL
Catalyst
Unresolved Optimized Selected
Logical Physical
Dataframe Logical Logical Physical RDDs
Plan Plan
Plan Plan Plan
Dataset
Cost
Catalog Model
28
Closing
29
Summary
• Overcoming MapReduce limitations
• Dataframes
• Comparison
• Differences with Relations
• Differences with Matrixes
• Differences in Pandas and Spark
• Operations
• Transformations
• Actions
• Optimizations
• Lazy evaluation
• Parallelism
• Abstraction
30
References
• H. Karau et al. Learning Spark. O’Really, 2015
• D. Petersohn, W. W. Ma, D. Jung Lin Lee, S. Macke, D. Xin, X. Mo, J.
Gonzalez, J. M. Hellerstein, A. D. Joseph, A. G. Parameswaran.
Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13(11), 2020
• A. Hogan. Procesado de Datos Masivos (Universidad de Chile).
http://aidanhogan.com/teaching/cc5212-1-2020
31