SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Víctor Cuevas-Vicenttín,
Barcelona Supercomputing Center
Performance Analysis of
Apache Spark and Presto in
Cloud Environments
#UnifiedDataAnalytics #SparkAISummit
The Barcelona Supercomputing Center (BSC) is the Spanish
national supercomputing facility, and a top EU research institution,
established in 2005 by the Spanish government, the Catalan
government and the UPC/BarcelonaTECH university.
The mission of BSC is to be at the service of the international
scientific community and of industry in need of HPC resources.
BSC's research lines are developed within the framework of
European Union research funding programmes, and the centre
also does basic and applied research in collaboration with
companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola.
About BSC
3
4
13.7
Petaflops
TPC-DS Benchmark Work
5
The BSC collaborated with Databricks to benchmark
comparisons on large-scale analytics computations, using
the TPC-DS Toolkit v2.10.1rc3
The Transaction Processing Performance Council (TPC)
Benchmark DS (1) has the objective of evaluating decision
support systems, which process large volumes of data in
order to provide answers to real-world business questions.
Our results are not official TPC Benchmark DS results.
Databricks provided BSC an account and credits, which
BSC then independently used for the benchmarking study
with other analytics products on the market.
The TPC is a non-profit
corporation focused on
developing data-centric
benchmark standards and
disseminating objective,
verifiable performance data to
the industry.
Context and motivation
• Need to adopt data analytics in a cost-effective
manner
– SQL still very relevant
– Open-source based analytics platforms
– On-demand computing resources from the Cloud
• Evaluate Cloud-based SQL engines
6#UnifiedDataAnalytics #SparkAISummit
Systems Under Test (SUTs)
• Databricks Unified Analytics Platform
– Based on Apache Spark but with optimized
Databricks Runtime
– Notebooks for interactive development and
production Jobs
– JDBC and custom API access
– Delta storage layer supporting ACID transactions
7#UnifiedDataAnalytics #SparkAISummit
Systems Under Test (SUTs)
• AWS EMR Presto
– Distributed SQL engine created by Facebook
– Connectors non-relational and relational sources
– JDBC and CLI access
– Based on in-memory, pipelined parallel execution
• AWS EMR Spark
– Based on open-source Apache Spark
8#UnifiedDataAnalytics #SparkAISummit
Plan
• TPC Benchmark DS
• Hardware and software configuration
• Benchmarking infrastructure
• Benchmark results and their analysis
• Usability and developer productivity
• Conclusions
9#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Created around 2006 to evaluate decision
support systems
• Based on a retailer with several channels of
distribution
• Process large volumes of data to answer
real-world business questions
10#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Snowflake schema: fact tables associated
with multiple dimension tables
• Data produced by data generator
• 99 queries of various types
– reporting
– ad hoc
– iterative
– data mining
11#UnifiedDataAnalytics #SparkAISummit
12#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Load Test (1 TB)
• Power Test
• Data Refresh
• Throughput Test
13#UnifiedDataAnalytics #SparkAISummit
.dat
ORC,
parquet
Query 1 Query 99Query 2 . . .
Query1,1 Query1,99Query1,2 . . .
Queryn,1 Queryn,99Queryn,2 . . .
. . .
Hardware configuration
14#UnifiedDataAnalytics #SparkAISummit
Type vCPUs Memory Local storage
i3.2xlarge 8 (2.3 GHz Intel
Xeon E5 2686 v4)
61 GiB 1 x 1,900 GB
NVMe SSD
1 master node 8 worker nodes
Software configuration
15#UnifiedDataAnalytics #SparkAISummit
System Versions Configuration parameters
Runtime 5.5,
Spark 2.4.3,
Scala 2.11
spark.sql.broadcastTimeout: 7200
spark.sql.crossJoin.enabled: true
emr-5.26.0,
Presto 0.220
hive.allow-drop-table: true
hive.compression-codec: SNAPPY
hive.s3-file-system-type: PRESTO
query.max-memory: 240 GB
emr-5.26.0,
Spark 2.4.3
spark.sql.broadcastTimeout : 7200
spark.driver.memory: 5692M
16#UnifiedDataAnalytics #SparkAISummit
SQL
.dat
parquet
ORC
client application cluster execution analysis
.dat
JARJAR
.log
AWS Glue
Metastore
.log
.XLSX
Benchmark execution time (base)
17#UnifiedDataAnalytics #SparkAISummit
Cost-Based Optimizer (CBO) stats
• Collect table and column-level statistics to
create optimized query evaluation plans
– distinct count, min, max, null count
18#UnifiedDataAnalytics #SparkAISummit
Benchmark execution time (stats)
19#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↑ 27.11
Speedup with table and column stats
20#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↓ 0.60
TPC-DS Power Test – geom. mean
21#UnifiedDataAnalytics #SparkAISummit
TPC-DS Power Test – arith. mean
22#UnifiedDataAnalytics #SparkAISummit
Additional configuration for Presto
23#UnifiedDataAnalytics #SparkAISummit
Query-specific configuration parameters
5, 75, 78, and 80 join_distribution_type: PARTITIONED
78 and 85 join_reordering_strategy: NONE
67 task_concurrency: 32
18 join_reordering_strategy=ELIMINATE_CROSS_JOINS
Session configuration for all queries
query_max_stage_count: 102
join_reordering_strategy: AUTOMATIC
join_distribution_type: AUTOMATIC
Query modifications (carried on to all systems)
72 manual join re-ordering
95 add distinct clause
TPC-DS Power Test – Query 72
• Manually modified join order
24#UnifiedDataAnalytics #SparkAISummit
catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item
⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns
• Databricks optimized join order no stats
Same as modified join order + pushed down selections and projections
• Original benchmark join order
catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈
household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
TPC-DS Power Test – Query 72
• Databricks optimized join order with stats
25#UnifiedDataAnalytics #SparkAISummit
(((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item)
(((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse))
⋈
⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections
• EMR Spark optimized join order with stats
and CBO enabled/disabled
Same as modified join order + pushed down selections and projections
but different physical plans
Dynamic data partitioning
• Splits a table based on the value of a particular
column
– Split only 7 largest tables by date surrogate keys
– One S3 bucket folder for each value
• Databricks and EMR Spark: limit number of files
per partition
• EMR Presto: out of memory error for largest table
– Use Hive with TEZ to load data
26#UnifiedDataAnalytics #SparkAISummit
Benchmark exec. time (part + stats)
27#UnifiedDataAnalytics #SparkAISummit
Power Test: 2 failed queries
Throughput Test: 6 failed queries
Speedup with partitioning and stats
28#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark total execution time
29#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS metric
• The modified primary performance metric is
30#UnifiedDataAnalytics #SparkAISummit
𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 =
𝑆𝐹 ∗ 𝑄
,
𝑇./ ∗ 𝑇01 ∗ 𝑇11
Scale factor
Num. weighted queries:
num streams x 99
Load factor:
0.1 x num streams x load time
Power Test and Throughput Test times
TPC Benchmark DS metric
31#UnifiedDataAnalytics #SparkAISummit
System costs
32#UnifiedDataAnalytics #SparkAISummit
𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠
System Hardware Software
EMR Presto $0.624 $0.156
EMR Spark $0.624 $0.156
Databricks $0.624 $0.3
𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
TPC Benchmark DS cost
33#UnifiedDataAnalytics #SparkAISummit
TPC-DS price-performance
34#UnifiedDataAnalytics #SparkAISummit
Disk utilization
• Databricks
– Automatically caches hot input data
– Requires machines with NVMe SSDs
• EMR Presto
– Experimental spilling of state to disk
– “we do not configure any of the Facebook
deployments to spill…local disks would increase
hardware costs…”
35#UnifiedDataAnalytics #SparkAISummit
Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
3
7
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
Usability and developer productivity
38#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
Easy and flexible cluster creation ü ü ü
Framework configuration at cluster
creation time
ü ü ü
Direct distributed file system support û û ü
Independent data catalog (metastore) ü ü ü
Support for notebooks ü ü ü
Integrated Web GUI û û ü
39#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
JDBC access ü ü ü
Programmatic interface û ü ü
Job creation and management
infrastructure
û û ü
Customized visualization of query plan
execution
ü ü ü
Resource utilization monitoring with
Ganglia and CloudWatch
ü ü ü
Usability and developer productivity
Conclusions
• Databricks is about 4x faster than EMR Presto
without statistics
– About 3x faster with them
• Difference smaller with EMR Spark
– Databricks still more cost-effective
– More efficient runtime, cache, and CBO optimizer
• Databricks and EMR Spark deal better with
concurrency and benefit from data partitioning
40#UnifiedDataAnalytics #SparkAISummit
Conclusions
• EMR Presto requires significantly more tuning
– Minimal for Databricks and EMR Spark
• Functionality of Databricks and EMR
Presto/Spark for SQL very similar
– Databricks more user friendly in some aspects
41#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PPTX
Zero to Snowflake Presentation
Brett VanderPlaats
 
PPT
Cassandraのしくみ データの読み書き編
Yuki Morishita
 
PDF
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Apache Spark Crash Course
DataWorks Summit
 
PPTX
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
NTT DATA Technology & Innovation
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
Apache spark 소개 및 실습
동현 강
 
PDF
Percona Live 2022 - MySQL Shell for Visual Studio Code
Frederic Descamps
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PPT
インフラエンジニアのためのcassandra入門
Akihiro Kuwano
 
PPTX
Building a modern data warehouse
James Serra
 
Zero to Snowflake Presentation
Brett VanderPlaats
 
Cassandraのしくみ データの読み書き編
Yuki Morishita
 
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 
Databricks Platform.pptx
Alex Ivy
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
Data Lakehouse Symposium | Day 4
Databricks
 
Apache Spark Crash Course
DataWorks Summit
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
NTT DATA Technology & Innovation
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Apache spark 소개 및 실습
동현 강
 
Percona Live 2022 - MySQL Shell for Visual Studio Code
Frederic Descamps
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
インフラエンジニアのためのcassandra入門
Akihiro Kuwano
 
Building a modern data warehouse
James Serra
 

Similar to Performance Analysis of Apache Spark and Presto in Cloud Environments (20)

PDF
Elastify Cloud-Native Spark Application with Persistent Memory
Databricks
 
PDF
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
Using Production Profiles to Guide Optimizations
Databricks
 
PDF
Spark | IBM
Rob Thomas
 
PDF
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
PPTX
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
PDF
Unifying Analytics
Data Con LA
 
PDF
Ibm integrated analytics system
ModusOptimum
 
PPTX
SQL Server 2017 Deep Dive - @Ignite 2017
Travis Wright
 
PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPTX
Thing you didn't know you could do in Spark
SnappyData
 
PPTX
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Torsten Steinbach
 
PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
PPTX
Apache spark
sivachandra mandalapu
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
Elastify Cloud-Native Spark Application with Persistent Memory
Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Using Production Profiles to Guide Optimizations
Databricks
 
Spark | IBM
Rob Thomas
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Unifying Analytics
Data Con LA
 
Ibm integrated analytics system
ModusOptimum
 
SQL Server 2017 Deep Dive - @Ignite 2017
Travis Wright
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
Demystifying data engineering
Thang Bui (Bob)
 
Thing you didn't know you could do in Spark
SnappyData
 
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Torsten Steinbach
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Apache spark
sivachandra mandalapu
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 

Performance Analysis of Apache Spark and Presto in Cloud Environments

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Víctor Cuevas-Vicenttín, Barcelona Supercomputing Center Performance Analysis of Apache Spark and Presto in Cloud Environments #UnifiedDataAnalytics #SparkAISummit
  • 3. The Barcelona Supercomputing Center (BSC) is the Spanish national supercomputing facility, and a top EU research institution, established in 2005 by the Spanish government, the Catalan government and the UPC/BarcelonaTECH university. The mission of BSC is to be at the service of the international scientific community and of industry in need of HPC resources. BSC's research lines are developed within the framework of European Union research funding programmes, and the centre also does basic and applied research in collaboration with companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola. About BSC 3
  • 5. TPC-DS Benchmark Work 5 The BSC collaborated with Databricks to benchmark comparisons on large-scale analytics computations, using the TPC-DS Toolkit v2.10.1rc3 The Transaction Processing Performance Council (TPC) Benchmark DS (1) has the objective of evaluating decision support systems, which process large volumes of data in order to provide answers to real-world business questions. Our results are not official TPC Benchmark DS results. Databricks provided BSC an account and credits, which BSC then independently used for the benchmarking study with other analytics products on the market. The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifiable performance data to the industry.
  • 6. Context and motivation • Need to adopt data analytics in a cost-effective manner – SQL still very relevant – Open-source based analytics platforms – On-demand computing resources from the Cloud • Evaluate Cloud-based SQL engines 6#UnifiedDataAnalytics #SparkAISummit
  • 7. Systems Under Test (SUTs) • Databricks Unified Analytics Platform – Based on Apache Spark but with optimized Databricks Runtime – Notebooks for interactive development and production Jobs – JDBC and custom API access – Delta storage layer supporting ACID transactions 7#UnifiedDataAnalytics #SparkAISummit
  • 8. Systems Under Test (SUTs) • AWS EMR Presto – Distributed SQL engine created by Facebook – Connectors non-relational and relational sources – JDBC and CLI access – Based on in-memory, pipelined parallel execution • AWS EMR Spark – Based on open-source Apache Spark 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Plan • TPC Benchmark DS • Hardware and software configuration • Benchmarking infrastructure • Benchmark results and their analysis • Usability and developer productivity • Conclusions 9#UnifiedDataAnalytics #SparkAISummit
  • 10. TPC Benchmark DS • Created around 2006 to evaluate decision support systems • Based on a retailer with several channels of distribution • Process large volumes of data to answer real-world business questions 10#UnifiedDataAnalytics #SparkAISummit
  • 11. TPC Benchmark DS • Snowflake schema: fact tables associated with multiple dimension tables • Data produced by data generator • 99 queries of various types – reporting – ad hoc – iterative – data mining 11#UnifiedDataAnalytics #SparkAISummit
  • 13. TPC Benchmark DS • Load Test (1 TB) • Power Test • Data Refresh • Throughput Test 13#UnifiedDataAnalytics #SparkAISummit .dat ORC, parquet Query 1 Query 99Query 2 . . . Query1,1 Query1,99Query1,2 . . . Queryn,1 Queryn,99Queryn,2 . . . . . .
  • 14. Hardware configuration 14#UnifiedDataAnalytics #SparkAISummit Type vCPUs Memory Local storage i3.2xlarge 8 (2.3 GHz Intel Xeon E5 2686 v4) 61 GiB 1 x 1,900 GB NVMe SSD 1 master node 8 worker nodes
  • 15. Software configuration 15#UnifiedDataAnalytics #SparkAISummit System Versions Configuration parameters Runtime 5.5, Spark 2.4.3, Scala 2.11 spark.sql.broadcastTimeout: 7200 spark.sql.crossJoin.enabled: true emr-5.26.0, Presto 0.220 hive.allow-drop-table: true hive.compression-codec: SNAPPY hive.s3-file-system-type: PRESTO query.max-memory: 240 GB emr-5.26.0, Spark 2.4.3 spark.sql.broadcastTimeout : 7200 spark.driver.memory: 5692M
  • 16. 16#UnifiedDataAnalytics #SparkAISummit SQL .dat parquet ORC client application cluster execution analysis .dat JARJAR .log AWS Glue Metastore .log .XLSX
  • 17. Benchmark execution time (base) 17#UnifiedDataAnalytics #SparkAISummit
  • 18. Cost-Based Optimizer (CBO) stats • Collect table and column-level statistics to create optimized query evaluation plans – distinct count, min, max, null count 18#UnifiedDataAnalytics #SparkAISummit
  • 19. Benchmark execution time (stats) 19#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↑ 27.11
  • 20. Speedup with table and column stats 20#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↓ 0.60
  • 21. TPC-DS Power Test – geom. mean 21#UnifiedDataAnalytics #SparkAISummit
  • 22. TPC-DS Power Test – arith. mean 22#UnifiedDataAnalytics #SparkAISummit
  • 23. Additional configuration for Presto 23#UnifiedDataAnalytics #SparkAISummit Query-specific configuration parameters 5, 75, 78, and 80 join_distribution_type: PARTITIONED 78 and 85 join_reordering_strategy: NONE 67 task_concurrency: 32 18 join_reordering_strategy=ELIMINATE_CROSS_JOINS Session configuration for all queries query_max_stage_count: 102 join_reordering_strategy: AUTOMATIC join_distribution_type: AUTOMATIC Query modifications (carried on to all systems) 72 manual join re-ordering 95 add distinct clause
  • 24. TPC-DS Power Test – Query 72 • Manually modified join order 24#UnifiedDataAnalytics #SparkAISummit catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns • Databricks optimized join order no stats Same as modified join order + pushed down selections and projections • Original benchmark join order catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
  • 25. TPC-DS Power Test – Query 72 • Databricks optimized join order with stats 25#UnifiedDataAnalytics #SparkAISummit (((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item) (((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse)) ⋈ ⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections • EMR Spark optimized join order with stats and CBO enabled/disabled Same as modified join order + pushed down selections and projections but different physical plans
  • 26. Dynamic data partitioning • Splits a table based on the value of a particular column – Split only 7 largest tables by date surrogate keys – One S3 bucket folder for each value • Databricks and EMR Spark: limit number of files per partition • EMR Presto: out of memory error for largest table – Use Hive with TEZ to load data 26#UnifiedDataAnalytics #SparkAISummit
  • 27. Benchmark exec. time (part + stats) 27#UnifiedDataAnalytics #SparkAISummit Power Test: 2 failed queries Throughput Test: 6 failed queries
  • 28. Speedup with partitioning and stats 28#UnifiedDataAnalytics #SparkAISummit
  • 29. TPC Benchmark total execution time 29#UnifiedDataAnalytics #SparkAISummit
  • 30. TPC Benchmark DS metric • The modified primary performance metric is 30#UnifiedDataAnalytics #SparkAISummit 𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 = 𝑆𝐹 ∗ 𝑄 , 𝑇./ ∗ 𝑇01 ∗ 𝑇11 Scale factor Num. weighted queries: num streams x 99 Load factor: 0.1 x num streams x load time Power Test and Throughput Test times
  • 31. TPC Benchmark DS metric 31#UnifiedDataAnalytics #SparkAISummit
  • 32. System costs 32#UnifiedDataAnalytics #SparkAISummit 𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠 System Hardware Software EMR Presto $0.624 $0.156 EMR Spark $0.624 $0.156 Databricks $0.624 $0.3 𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
  • 33. TPC Benchmark DS cost 33#UnifiedDataAnalytics #SparkAISummit
  • 35. Disk utilization • Databricks – Automatically caches hot input data – Requires machines with NVMe SSDs • EMR Presto – Experimental spilling of state to disk – “we do not configure any of the Facebook deployments to spill…local disks would increase hardware costs…” 35#UnifiedDataAnalytics #SparkAISummit Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
  • 38. Usability and developer productivity 38#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks Easy and flexible cluster creation ü ü ü Framework configuration at cluster creation time ü ü ü Direct distributed file system support û û ü Independent data catalog (metastore) ü ü ü Support for notebooks ü ü ü Integrated Web GUI û û ü
  • 39. 39#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks JDBC access ü ü ü Programmatic interface û ü ü Job creation and management infrastructure û û ü Customized visualization of query plan execution ü ü ü Resource utilization monitoring with Ganglia and CloudWatch ü ü ü Usability and developer productivity
  • 40. Conclusions • Databricks is about 4x faster than EMR Presto without statistics – About 3x faster with them • Difference smaller with EMR Spark – Databricks still more cost-effective – More efficient runtime, cache, and CBO optimizer • Databricks and EMR Spark deal better with concurrency and benefit from data partitioning 40#UnifiedDataAnalytics #SparkAISummit
  • 41. Conclusions • EMR Presto requires significantly more tuning – Minimal for Databricks and EMR Spark • Functionality of Databricks and EMR Presto/Spark for SQL very similar – Databricks more user friendly in some aspects 41#UnifiedDataAnalytics #SparkAISummit
  • 42. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT