2018 data warehouse features in spark

Spark Technology Center
1
Data Warehouse Features in
Apache Spark
IoanaDelaney
SparkTechnologyCenter,IBM

About the Speaker
Ioana Delaney
Spark Technology Center, IBM
DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer.
Worked on various releases of DB2 LUW and DB2 with BLU Acceleration
Apache Spark SQL Contributor

IBM Spark Technology Center
Founded in 2015
Location: 505 Howard St., San Francisco
Web: http://spark.tc
Twitter: @apachespark_tc
Mission:
• Contribute intellectual and technical capital
to the Apache Spark community.
• Make the core technology enterprise and
cloud-ready.
• Build data science skills to drive intelligence
into business applications
http://bigdatauniversity.com

Enterprise Data Warehouses and Open Source Analytics
• Data Warehouses provide critical features like security, performance, backup/recovery,
etc. that make it the technology of choice for most enterprise uses
• However, modern workloads like advanced analytics or machine learning require a
more flexible and agile environment, like Apache Spark
• Enabling Spark with Data Warehouse capabilities allows to get the best of both worlds:
the enterprise-strength and performance of modern Data Warehouses and the flexibility
and power of open source analytics

Easy Steps toward Data Warehouse Technology Integration
into Spark
• Incorporate basic data modeling features into Spark such as Star Schema and
Informational Referential Integrity Constraints
• Enhance push down capabilities from Spark into the underlying Data Warehouse to
exploit the features of the database engine e.g. index access

What is Data Warehouse?
• Relational database that integrates data from multiple heterogeneous sources e.g.
transactional data, files, other sources
• Designed for data modeling and analysis
• Provides information around a subject of a business e.g. product, customers, suppliers, etc.
• The most important requirements are query performance and data simplicity
• Based on a dimensional, or Star Schema model
• Consists of a fact table referencing a number of dimension tables
• Fact table contains the main data, or measurements, of a business
• Dimension tables, usually smaller tables, describe the different characteristics, or dimensions, of a
business

TPC-DS Benchmark
• Proxy of a real organization data warehouse
• De-facto industry standard benchmark for measuring the performance of decision
support solutions such as RDBMS and Hadoop/Spark based systems
• The underlying business model is a retail product supplier e.g. retail sales, web, catalog
data, inventory, demographics, etc
• Examines large volumes of data e.g. 1TB to 100TB
• Executes SQL queries of various operational requirements and complexities e.g. ad-hoc,
reporting, data mining
Excerpt from store_sales fact table diagram:

Star Schema Detection in Spark
• Queries against star schema are expected to run fast based on the relationships
among the tables
• SPARK-17791 implements star schema detection based on cardinality
heuristics
• In a query, star schema detection algorithm:
• Finds the tables connected in a star-join
• Lets the Spark Optimizer plan the star-join tables in an optimal way

Join Reordering using Star Schema
Execution plan transformation:
select i_item_id, s_store_id, avg(ss_net_profit) as store_sales_profit,
avg(sr_net_loss) as store_returns_loss
from
store_sales, store_returns, date_dim d1, date_dim d2, store, item
where
i_item_sk = ss_item_sk and
s_store_sk = ss_store_sk and
ss_customer_sk = sr_customer_sk and
ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and
sr_returned_date_sk = d2.d_date_sk and
d1.d_moy = 4 and d1.d_year = 1998 and . . .
group by i_item_id, s_store_id
order by i_item_id, s_store_id
Simplified TPC-DS Query 25
Star schema diagram:
item
store
date_dim
store_sales store_returns
date_dim
N:1
1 : N
1: N
1:N
1:N
Query execution drops from 421 secs to 147 secs
(1TB TPC-DS setup), ~ 3x improvement

Performance Results for Star Schema Queries
• TPC-DS query speedup: 2x – 8x
• By observing relationships among the
tables, Optimizer makes better planning
decisions
• Reduce the data early in the execution plan
• Reduce, or eliminate Sort Merge joins in
favor of more efficient Broadcast Hash
joins
TPC-DS 1TB performance results with star schema detection:

Star Schema to reduce join enumeration search space
• Cost-based Spark Optimizer uses dynamic programing for join enumeration
• One issue is the explosion of the search space when there is a large number of
joins
• Use heuristics search methods to eliminate sub-optimal plans
• SPARK-20233 applies star-schema filters to the dynamic programming
algorithm for join enumeration

Support for Informational Referential Integrity Constraints
• Open up an area of query optimization techniques that rely on referential
integrity (RI) constraints semantics
• Support for informational primary key and foreign key (referential integrity)
constraints
• Not enforced by the Spark SQL engine; rather used by Catalyst to optimize the
query processing
• Targeted to applications that load and analyze data that originated from a Data
Warehouse for which the conditions for a given constraint are known to be true
• SPARK-19842 - umbrella JIRA for RI support

How do Query Optimizers use RI Constraints?
• Implement powerful optimizations based on RI semantics e.g. Join Elimination
• Example using a typical user scenario: queries against views
create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as
select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity
from store_sales, date_dim, customer, item, store
where d_date_sk = ss_sold_date_sk and
c_customer_sk = ss_customer_sk and
i_item_sk = ss_item_sk and
d_year = 2002
select id, first, last, product, quantity
from customer_purchases_2002
where product like ‘bicycle%’ and
month between 1 and 2
User view: User query:
Internal optimizer query processing:
Selects only a subset
of columns from view
Join between store and
store_sales removed
based on RI analysis

Many RI based Optimizations
• Existential subquery to inner joins
• Group by push down through join
• Redundant join elimination
• Distinct elimination
• etc.

Advanced Data Source Push Down Features
• Spark applications often directly query external data sources such as relational databases
or files
• Provides Data Sources APIs for accessing structured data through Spark SQL
• Support optimizations such as Filter push down and Column pruning - subset of the
functionality that can be pushed down to some data sources
• Extend Data Sources APIs with join push down i.e. Selection and Projection push down
• Significantly improves query performance by reducing the amount of data transfer and
exploiting the capabilities of the data sources such as index access

Benefits of Join Push Down
E.g. Selective join with tables in DBMS; DBMS has indexes
• Spark reads the entire tables, applies some predicates at the data source, and executes the
join locally
• Join execution in Spark ~ O(|L|)
• Join execution in DBMS ~ O(|S|)
• Instead, push the join execution to the data source
 Efficient join execution using index access
 Reduce the data transfer
 Query runs 100x faster!

Push down based on Cost vs. Heuristics
• Join push down is not always beneficial e.g. Cartesian product
• Query Optimizer determines the best execution of a SQL query e.g. implementation of
relational operators, order of execution, etc.
• The optimizer must also decide whether the different operations should be done by
Spark, or by the data source
• Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how much it
costs (e.g. statistics from data source, network speed, etc.)
• Spark’s Catalyst Optimizer uses a combination of heuristics and cost model
• Cost model is an evolving feature in Spark
• Until cost model is fully implemented, use safe heuristics

Star-Schema Heuristics to Push Down Joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
Filtering joins (e.g. Star-joins) Expanding joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
M : N
store_returns
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_profit
from store_sales, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_profit, sum(ss_net_loss) as store_loss
from store_sales, date_dim, store, store_returns
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
ss_ticket_number = sr_ticket_number and
ss_item_sk = sr_item_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100

Performance Results for Star Join Push Down
• Mix of two data sources: IBM DB2/JDBC
and Parquet
TPC-DS
Query
spark-2.2
(mins)
spark-2.2-jpd
(mins)
Speedup
Q8 32 4 8x
Q13 121 5 25x
Q15 4 2 2x
Q17 77 7 11x
Q19 42 4 11x
Q25 153 7 21x
Q29 81 7 11x
Q42 31 3 10x
Q45 14 3 4x
Q46 61 5 12x
Q48 155 5 31x
Q52 31 5 6x
Q55 31 4 8x
Q68 69 4 17x
Q74 47 23 2x
Q79 63 4 15x
Q85 22 2 11x
Q89 55 4 14x
Cluster: 4-node cluster, each node having:
12 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017)
Database info:
Schema: TPCDS
Scale factor: 1TB total space
Mix of Parquet and DB2/JDBC data sources
DB2 DPF info: 4-node cluster, each node having:
10 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48

Summary
• Data Warehouse capabilities are important to next generation Spark Data platform
• Start with data modeling features such as Star Schema and Informational Referential
Integrity Constraints to boost query performance
• Push down processing from Spark into the underlying Data Warehouse to exploit the
features of the underlying data base engine

Future work
• Transform Catalyst into a global optimizer
• Global optimizer generates an optimal execution plan across all data sources
• Determines where an operation should be evaluated based on:
1. The cost to execute the operation.
2. The cost to transfer data between Spark and the data sources
• Key factors that affect global optimization:
• Remote table statistics (e.g. number of rows, number of distinct values in each column, etc)
• Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.)
• Extend Data Source APIs with data source characteristics
• Retrieve/compute data source table statistics
• Integrate data source cost model into Catalyst

Spark Technology Center
22
Thank You.
IoanaDelaneyursu@us.ibm.com
Visithttp://spark.tc

2018 data warehouse features in spark

Recommended

More Related Content

What's hot (20)

Similar to 2018 data warehouse features in spark (20)

More from Chester Chen (20)

Recently uploaded (20)

2018 data warehouse features in spark