4- Spark SQL
4- Spark SQL
Spark SQL
Project Tungesten
1
Spark SQL7
RDD APIs although richer and more concise than MapReduce, still
are considered low-level
We still need to benefit from the in-memory execution model of Spark
but make it accessible to more people
Similar to Hive and Impala for MapReduce/HDFS, Spark SQL wraps
RDD API calls with an SQL-like shell
Spark SQL uses DataFrames and DataSet abstractions
It has an advanced query optimizer, called Catalyst
source: Armbrust, Michael, et al. “Spark SQL: Relational data processing in spark.” Proceedings of the 2015 ACM SIGMOD
3
DataFrame
4
Creating DataFrames
5
Basic DataFrame Operatrions
6
Manipulating Data in DataFrames
7
DataFrame Query String
You can pass column names as String
peopleDF.select(“name”,”age”)
peopleDF.where(“age >21”)
8
Querying DataFrames Using Columns
peopleDF.select(peopleDF("age")+10,peopleDF("name").toUpperCase())
peopleDF.sort(peopleDF.age.desc())
9
SQL Queries
10
RDD Vrs. DataFrames Vrs. Spark SQL
11
RDD Vrs. DataFrames Vrs. Spark SQL
12
RDD Vrs. DataFrames Vrs. Spark SQL
13
Major Milestones in Spark SQL
14
DataFrames Vs. Datasets
15
Spark RDD API Example
16
Spark DataFrame Example - SQL
17
Spark DataFrame Example
18
Spark Dataset Example
19
Catalyst: Plan Optimization and Execution8
21
Logical Optimization
22
Logical Optimization
23
Trees
24
Tree Abstractions
Expression
An expression represents a new
value, computed based on input
values
e.g. 1 + 2 + t1.value
Attribute: A column of a
dataset (e.g. t1.id) or a column
generated by a specific data
operation (e.g. v)
25
SQL to Logical plan
Expression
26
Logical Plan
27
Physical Plan
28
Optimization
29
Transforms
30
Combining Multiple Rules
31
Combining Multiple Rules
32
Combining Multiple Rules
33
Combining Multiple Rules
34
Combining Multiple Rules
35
Project Tungesten
36
Off-Heap Memory Management
37
Off-Heap Memory Management
Spark to have its data stored in binary format
Spark serializes/deserializes its own data
Exploit data schema to reduce the overhead
build on sun.misc.unsafe to give C-like memory access
38
Off-Heap Memory Management
39
Off-Heap Memory Management
40
Cache-aware Computation
41
Code Generation
42
Example
Consider the case where we need to filter a dataframe by the column year:
year >2015
43
Whole Stage Code Generation
44
Volcano Iterator Model
45
Issues with Volcano Model
46
How would a dedicated code look like?
47
Dedicated versus Volcano
48
Why?
49
Whole-stage Code Generation
Target was to reach a functionality of a general-purpose execution engine
like volcano model and Perform just like a hand built system that does
exactly what user wants to do
A new technique now popular in DB literature,
Simply fuse together the operators so the generated code looks like
hand-optimized
50
Vectorized Execution: in-memory columnar storage
After WSCG, we can still speedup the execution of the generated code.
How? Vectorization.
The idea is to take advantage of data level parallelism (DLP) in
modern CPUs. That is, to process data in batches of rows rather
than one row at a time.
Shift from row-based to columnar-based storage
51
Scalar versus Vector Processing
52
Data availability for vectorized processing
53
Ideal execution without CPU stalls
54
Execution with CPU stalls
55
Benchmarking Big SQL Systems
56
Benchmark Scope
57
TPC-H Results
58