Performance Analysis of Apache Spark and Presto in Cloud Environments

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Víctor Cuevas-Vicenttín,
Barcelona Supercomputing Center
Performance Analysis of
Apache Spark and Presto in
Cloud Environments
#UnifiedDataAnalytics #SparkAISummit

The Barcelona Supercomputing Center (BSC) is the Spanish
national supercomputing facility, and a top EU research institution,
established in 2005 by the Spanish government, the Catalan
government and the UPC/BarcelonaTECH university.
The mission of BSC is to be at the service of the international
scientific community and of industry in need of HPC resources.
BSC's research lines are developed within the framework of
European Union research funding programmes, and the centre
also does basic and applied research in collaboration with
companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola.
About BSC
3

TPC-DS Benchmark Work
5
The BSC collaborated with Databricks to benchmark
comparisons on large-scale analytics computations, using
the TPC-DS Toolkit v2.10.1rc3
The Transaction Processing Performance Council (TPC)
Benchmark DS (1) has the objective of evaluating decision
support systems, which process large volumes of data in
order to provide answers to real-world business questions.
Our results are not official TPC Benchmark DS results.
Databricks provided BSC an account and credits, which
BSC then independently used for the benchmarking study
with other analytics products on the market.
The TPC is a non-profit
corporation focused on
developing data-centric
benchmark standards and
disseminating objective,
verifiable performance data to
the industry.

Context and motivation
• Need to adopt data analytics in a cost-effective
manner
– SQL still very relevant
– Open-source based analytics platforms
– On-demand computing resources from the Cloud
• Evaluate Cloud-based SQL engines
6#UnifiedDataAnalytics #SparkAISummit

Systems Under Test (SUTs)
• Databricks Unified Analytics Platform
– Based on Apache Spark but with optimized
Databricks Runtime
– Notebooks for interactive development and
production Jobs
– JDBC and custom API access
– Delta storage layer supporting ACID transactions

Systems Under Test (SUTs)
• AWS EMR Presto
– Distributed SQL engine created by Facebook
– Connectors non-relational and relational sources
– JDBC and CLI access
– Based on in-memory, pipelined parallel execution
• AWS EMR Spark
– Based on open-source Apache Spark

Plan
• TPC Benchmark DS
• Hardware and software configuration
• Benchmarking infrastructure
• Benchmark results and their analysis
• Usability and developer productivity
• Conclusions

TPC Benchmark DS
• Created around 2006 to evaluate decision
support systems
• Based on a retailer with several channels of
distribution
• Process large volumes of data to answer
real-world business questions

TPC Benchmark DS
• Snowflake schema: fact tables associated
with multiple dimension tables
• Data produced by data generator
• 99 queries of various types
– reporting
– ad hoc
– iterative
– data mining

TPC Benchmark DS
• Load Test (1 TB)
• Power Test
• Data Refresh
• Throughput Test
.dat
ORC,
parquet
Query 1 Query 99Query 2 . . .
Query1,1 Query1,99Query1,2 . . .
Queryn,1 Queryn,99Queryn,2 . . .
. . .

Hardware configuration
Type vCPUs Memory Local storage
i3.2xlarge 8 (2.3 GHz Intel
Xeon E5 2686 v4)
61 GiB 1 x 1,900 GB
NVMe SSD
1 master node 8 worker nodes

Software configuration
System Versions Configuration parameters
Runtime 5.5,
Spark 2.4.3,
Scala 2.11
spark.sql.broadcastTimeout: 7200
spark.sql.crossJoin.enabled: true
emr-5.26.0,
Presto 0.220
hive.allow-drop-table: true
hive.compression-codec: SNAPPY
hive.s3-file-system-type: PRESTO
query.max-memory: 240 GB
emr-5.26.0,
Spark 2.4.3
spark.sql.broadcastTimeout : 7200
spark.driver.memory: 5692M

SQL
.dat
parquet
ORC
client application cluster execution analysis
.dat
JARJAR
.log
AWS Glue
Metastore
.log
.XLSX

Benchmark execution time (base)

Cost-Based Optimizer (CBO) stats
• Collect table and column-level statistics to
create optimized query evaluation plans
– distinct count, min, max, null count

Benchmark execution time (stats)
CBO enabled: ↑ 27.11

Speedup with table and column stats
CBO enabled: ↓ 0.60

TPC-DS Power Test – geom. mean

TPC-DS Power Test – arith. mean

Additional configuration for Presto
Query-specific configuration parameters
5, 75, 78, and 80 join_distribution_type: PARTITIONED
78 and 85 join_reordering_strategy: NONE
67 task_concurrency: 32
18 join_reordering_strategy=ELIMINATE_CROSS_JOINS
Session configuration for all queries
query_max_stage_count: 102
join_reordering_strategy: AUTOMATIC
join_distribution_type: AUTOMATIC
Query modifications (carried on to all systems)
72 manual join re-ordering
95 add distinct clause

TPC-DS Power Test – Query 72
• Manually modified join order
catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item
⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns
• Databricks optimized join order no stats
Same as modified join order + pushed down selections and projections
• Original benchmark join order
catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈
household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns

TPC-DS Power Test – Query 72
• Databricks optimized join order with stats
(((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item)
(((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse))
⋈
⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections
• EMR Spark optimized join order with stats
and CBO enabled/disabled
Same as modified join order + pushed down selections and projections
but different physical plans

Dynamic data partitioning
• Splits a table based on the value of a particular
column
– Split only 7 largest tables by date surrogate keys
– One S3 bucket folder for each value
• Databricks and EMR Spark: limit number of files
per partition
• EMR Presto: out of memory error for largest table
– Use Hive with TEZ to load data

Benchmark exec. time (part + stats)
Power Test: 2 failed queries
Throughput Test: 6 failed queries

Speedup with partitioning and stats

TPC Benchmark total execution time

TPC Benchmark DS metric
• The modified primary performance metric is
𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 =
𝑆𝐹 ∗ 𝑄
,
𝑇./ ∗ 𝑇01 ∗ 𝑇11
Scale factor
Num. weighted queries:
num streams x 99
Load factor:
0.1 x num streams x load time
Power Test and Throughput Test times

TPC Benchmark DS metric

System costs
𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠
System Hardware Software
EMR Presto $0.624 $0.156
EMR Spark $0.624 $0.156
Databricks $0.624 $0.3
𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠

TPC Benchmark DS cost

TPC-DS price-performance

Disk utilization
• Databricks
– Automatically caches hot input data
– Requires machines with NVMe SSDs
• EMR Presto
– Experimental spilling of state to disk
– “we do not configure any of the Facebook
deployments to spill…local disks would increase
hardware costs…”
Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813

DatabricksEMRPresto

3
7
DatabricksEMRPresto

Usability and developer productivity
Feature EMR Presto EMR Spark Databricks
Easy and flexible cluster creation ü ü ü
Framework configuration at cluster
creation time
ü ü ü
Direct distributed file system support û û ü
Independent data catalog (metastore) ü ü ü
Support for notebooks ü ü ü
Integrated Web GUI û û ü

Feature EMR Presto EMR Spark Databricks
JDBC access ü ü ü
Programmatic interface û ü ü
Job creation and management
infrastructure
û û ü
Customized visualization of query plan
execution
ü ü ü
Resource utilization monitoring with
Ganglia and CloudWatch
ü ü ü
Usability and developer productivity

Conclusions
• Databricks is about 4x faster than EMR Presto
without statistics
– About 3x faster with them
• Difference smaller with EMR Spark
– Databricks still more cost-effective
– More efficient runtime, cache, and CBO optimizer
• Databricks and EMR Spark deal better with
concurrency and benefit from data partitioning

Conclusions
• EMR Presto requires significantly more tuning
– Minimal for Databricks and EMR Spark
• Functionality of Databricks and EMR
Presto/Spark for SQL very similar
– Databricks more user friendly in some aspects

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Performance Analysis of Apache Spark and Presto in Cloud Environments

More Related Content

What's hot (20)

Similar to Performance Analysis of Apache Spark and Presto in Cloud Environments (20)

More from Databricks (20)

Recently uploaded (20)

Performance Analysis of Apache Spark and Presto in Cloud Environments