SlideShare a Scribd company logo
Enabling Presto to Handle
Massive Scale at Lightning Speed
Fast and Scalable Data Processing
Raunaq Morarka
26/3/2019
00
About Presenter
● Presenter Name, Title: Raunaq Morarka, Staff Engineer at Qubole, Bangalore
● Bio: I currently work in the Presto team at Qubole. My areas of interests are distributed
database systems and software performance optimizations. At Qubole I have worked on features
related to scheduling, autoscaling and usage of spot nodes for running Presto as a service on
cloud. I have recently started contributing to Presto sql open source project. Before Qubole, I
worked on a time-series distributed columnar database which supports real time ingest and low
latency queries at an internet scale company.
00
Agenda
● State of Presto today
○ Background – Introduction, Why Presto
○ Presto Architecture
○ Usage Overview – Recent Growth and Adoption Trends
● Presto in the Cloud
○ Optimizing for Scale
■ Autoscaling
■ Maximizing the Benefits of the Cloud
○ Optimizing for Speed
■ Dynamic Filtering, Join Reordering, Join Strategy Selection
■ RubiX – The next-generation column level optimized caching on Presto
● Future roadmap
00
State of Presto Today
00
State of Presto Today - Background
What is Presto?
• Distributed SQL Query Engine originated in Facebook in 2013
• ANSI SQL Compliant
• Supports Federated Queries
• Pluggable data sources
• Completely in-memory and pipelined execution model
Why Presto?
• Built for variety of use cases : Low latency user facing applications, Exploratory Analysis through BI tools, Batch
ETL
• Data source agnostic : HDFS, RDBMSs, NoSQL, Stream processing, Cloud Object stores (S3, ADLS, GCP)
• Zero configuration ideology
• Proven in production at Petabyte Scale: Facebook, Netflix, Airbnb, Uber, LinkedIn, Qubole, and more
• Highly Extendible
00
Presto Architecture
00
Query Lifecycle
● Client submits sql query to Coordinator using HTTP REST API
● ANTLR based parser converts query to syntax tree
● Logical Planner generates tree of plan tree
● Optimizer transforms logical plan into an efficient execution strategy
• RBO (predicate and limit pushdown, column pruning, partition pruning etc.)
• CBO (Join reordering, Join strategy selection)
• Takes advantage of Data layout (partitioning, sorting, grouping and indices)
• Inter-node parallelism by breaking up plan into Stages that can be executed in parallel across
workers
• Intra-node parallelism by running a sequence of operators (pipeline) in multiple threads.
00
00
Scheduling
● Coordinator distributes plan to workers, starts execution of tasks and then
begins to enumerate splits, which are opaque handles to an addressable
chunk of data in an external storage system
● Splits are assigned to the tasks responsible for reading this data
00
Exchange (Shuffles)
● Presto uses in-memory buffered shuffles over HTTP to exchange intermediate results for
different stages of a query
● Tasks produce data into an in-memory output buffer
● Workers consume results from other workers through an exchange client which uses HTTP
long-polling
● Exchange client buffers data before it is processed (input buffer)
● Exchange server retains data until client acknowledges receipt
● Engine tunes parallelism to maintain target utilization rates for output and input buffers
00
Split Assignment
Presto asks connectors to enumerate small batches of splits, and assigns them to tasks lazily
● Decouples query response time from time taken for listing files
● Avoid enumerating all splits when queries are cancelled or finish early when LIMIT clause is
satisfied.
● Workers maintain a queue of splits. The coordinator assigns new splits to tasks with the
shortest queue. Keeping these queues small allows the system to adapt to variance in CPU cost
of processing different splits and performance differences among workers
● Allows queries to execute without having to hold all their metadata in memory
● Lazy split enumeration can make it difficult to accurately estimate and report query progress
00
State of Presto Today – Usage Overview
Presto grew 420% in terms of compute hours on Qubole’s cloud platform from January 2017 to 2018.
Customers in aggregate are running 24x more commands per hour in Presto than Apache Spark and 6x
more commands than Apache Hadoop/Hive.
00
State of Presto Today – Usage Overview
Top Three Industries Using
Presto
1. Entertainment
2. Travel Services
3. Gaming
Verticals everywhere are adopting Presto
00
Presto in the Cloud
00
Optimizing for Scale – Autoscaling
● Scale clusters in range [min size, max size]
● Scale up for the increased workload
● Scale down when load goes down
● Graceful scale down
● Usually implemented by defining rules on top of CPU/memory/IO metrics exported by system
● Qubole’s implementation
○ Monitor progress of queries
○ Intelligent decision making to scale up only if it can help to meet a given SLA
○ Handle bursty workloads by avoiding fixed step sizes
○ Finer controls like grouped scale up/down, cool down period, etc.
○ Automatic termination of idle clusters
○ Self start of cluster in response to first query on a shutdown cluster
00
Required workers
● Non source stages cannot be redistributed to take advantage of newly added nodes
● Min size of cluster must be large enough to avoid query failures
● Choice between high cost and degraded performance for initial queries
● Required workers is a mechanism to delay query execution until a minimum no. of worker
nodes join the cluster
● Integration with Qubole’s autoscaling
○ Scale up cluster to satisfy min workers requirement
○ Avoid scaling up for DDL and monitoring related queries
○ Scale down to 1 node during periods of inactivity
00
Config A Config B Config C
Total time taken 5h 12m 4h 26m 4h 37m
Total node runtime
seconds
143137 134664 124351
Min size 2 6 1 (6 required nodes)
00
Optimizing for Scale – Autoscaling
00
Optimizing for Scale – Maximizing the Benefits of the Cloud
● Spot nodes are generally available at highly discounted prices
● Presto is not able to utilize them well OOB due to its pipelined and in-memory execution
architecture
● Spot loss will lead to failure of all queries which had any part of their execution tree running on
that spot node
● Presto is usually run on newer generation, high memory instance types which experience spot
loss more often due to greater demand
● Qubole’s handling of Spot termination notification
○ Proactive addition of nodes to maintain cluster size
○ Avoid scheduling tasks on spot node after receiving STN
○ Acquire on-demand quickly
○ Lazily rebalance to achieve desired spot ratio
○ No query failures if all queries finish under 2 minutes
00
00
Query retries
● Fallback for query failures that cannot be handled in STN Integration
● Query retries should be transparent to the clients and work with all Presto clients: Java cli,
*DBC Drivers, Ruby client, etc.
● The retry should happen only if it is guaranteed that no partial results have yet been sent to
the client
● The retry should happen only if changes (if any) caused by the failed query have been rolled
back e.g. in case of Inserts, data written by the failed query has been deleted
● The retry should happen only if there is a chance of successful query execution
● Qubole’s implementation
○ Presto server responsible for retries, clients redirected to new query without any changes
required to client
○ Convert SELECT queries into IOD queries, clients get result only after query has finished
○ Track rollback status of query
○ Retry in bigger cluster if the failure is due to insufficient memory
○ Retry when cluster size stabilizes if the failure is due to node loss
00
Optimizing for Speed – Dynamic Filtering
SELECT (...)
FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk (...)
WHERE d_year = 2000 and d_moy = 12 (...)
(... GROUP BY ... ORDER BY ...)
Currently (assuming tables are not partitioned) Presto will perform full table scan of both tables.
1. Skip accessing fact table partitions not needed by the query (partition pruning)
2. Filter rows on probe side of join by sending only the subset of rows that match the join keys
across the network
3. If storage format supports predicate pushdown, use runtime filters to avoid scanning data on
probe side
00
Dynamic Filtering concept
00
Dynamic Filtering results
• Runtime of 13 queries improved by at least 5X.
• Runtime of 13 queries improved between 3X - 5X.
• Runtime of 22 queries improved between 1.5X - 3X.
• 14 queries that did not run before succeeded.
00
Optimizing for Speed – Join Reordering
• Smaller table to the right for better performance
• Difficult to ensure it in a multi-join query
• Join Reordering Optimizer rule to the rescue
00
Optimizing for Speed – Join Reordering
BA AB
A.a = B.b B.b = A.a
Join Reordering made for the case
where build-side of Join is expensive
3~6x
Tpcds scale 3000*
Geomean 3.1x
00
Optimizing for Speed – Join Reordering
00
Join strategy selection
● Broadcast (Map-side join) vs Repartitioned (Shuffle join)
● Repartitioned
○ Default
○ Low memory usage
○ Both build and probe side need to be partitioned
○ More efficient for joins between large tables of similar size
● Broadcast
○ High memory usage, build side table must fit in memory
○ Probe side does not need to be partitioned
○ Build side table broadcast to all nodes
○ More efficient for joins where one table is of small size
00
Optimizing for Speed – RubiX
RubiX is the next-generation column level optimized caching on Presto, for lightning fast big data
analytics on cloud storage
• Cache file chunks Shared
• Cache across JVMs
• Engine-independent scheduling logic
Avg.
~20%
00
Optimizing for Speed – RubiX
00
Presto OS roadmap
● Coordinator scalability and HA
● Allow connectors to participate in query optimization
● Improvements to Spill to disk functionality
● Partial recovery support for failure of long running queries
● Ranger plugin
● Qubole contributions
○ Dynamic Filtering
○ Kinesis Connector
00
Q&A
00
Thanks for attending!
Please feel free to reach out to me at raunaqm@qubole.com

More Related Content

What's hot (20)

PDF
Presto Summit 2018 - 02 - LinkedIn
kbajda
 
PDF
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
ODP
Cassandra - Tips And Techniques
Knoldus Inc.
 
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
PDF
Key considerations in productionizing streaming applications
KafkaZone
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PDF
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
PDF
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
PDF
Using ClickHouse for Experimentation
Gleb Kanterov
 
PPTX
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
PDF
Streaming Data from Cassandra into Kafka
Abrar Sheikh
 
PDF
Fast dataarchitecture
Knoldus Inc.
 
PPTX
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
PDF
Presto in my_use_case
wyukawa
 
PDF
Log ingestion kafka -- impala using apex
Apache Apex
 
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
PDF
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 
PDF
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Presto Summit 2018 - 02 - LinkedIn
kbajda
 
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Cassandra - Tips And Techniques
Knoldus Inc.
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Key considerations in productionizing streaming applications
KafkaZone
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
Using ClickHouse for Experimentation
Gleb Kanterov
 
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
Streaming Data from Cassandra into Kafka
Abrar Sheikh
 
Fast dataarchitecture
Knoldus Inc.
 
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
Presto in my_use_case
wyukawa
 
Log ingestion kafka -- impala using apex
Apache Apex
 
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 

Similar to Enabling Presto to handle massive scale at lightning speed (20)

PDF
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
PDF
Evolution of DBA in the Cloud Era
Mydbops
 
PDF
20180522 infra autoscaling_system
Kai Sasaki
 
PPTX
Megastore by Google
Ankita Kapratwar
 
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
PDF
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Searce Inc
 
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
PPTX
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Rajesh Kannan S
 
PDF
Ceph Month 2021: RADOS Update
Ceph Community
 
PDF
August-20_Autoscaling-and-Cost-Optimization-on-Kubernetes-From-0-to-100.pdf
LumbanSopian1
 
ODP
BAXTER phase 1b
Franck MIKULECZ
 
PDF
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Dipti Borkar
 
PDF
Run your queries 14X faster without any investment!
Knoldus Inc.
 
PDF
Scalable complex event processing on samza @UBER
Shuyi Chen
 
PDF
Rally--OpenStack Benchmarking at Scale
Mirantis
 
PDF
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
PPTX
NOSQL introduction for big data analytics
Radhika R
 
PDF
PostgreSQL 9.5 - Major Features
InMobi Technology
 
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Lucidworks
 
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
Evolution of DBA in the Cloud Era
Mydbops
 
20180522 infra autoscaling_system
Kai Sasaki
 
Megastore by Google
Ankita Kapratwar
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Searce Inc
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Rajesh Kannan S
 
Ceph Month 2021: RADOS Update
Ceph Community
 
August-20_Autoscaling-and-Cost-Optimization-on-Kubernetes-From-0-to-100.pdf
LumbanSopian1
 
BAXTER phase 1b
Franck MIKULECZ
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Dipti Borkar
 
Run your queries 14X faster without any investment!
Knoldus Inc.
 
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Rally--OpenStack Benchmarking at Scale
Mirantis
 
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
NOSQL introduction for big data analytics
Radhika R
 
PostgreSQL 9.5 - Major Features
InMobi Technology
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Lucidworks
 
Ad

More from Shubham Tagra (7)

PDF
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Shubham Tagra
 
PDF
Presto Bangalore Meetup1 Event Listeners@qubole
Shubham Tagra
 
PDF
Presto Bangalore Meetup1 Presto Raptor@ola
Shubham Tagra
 
PDF
Presto Bangalore Meetup1 Ranger+Presto@ola
Shubham Tagra
 
PDF
Presto Bangalore Meetup1 Repertoire@Myntra
Shubham Tagra
 
PPTX
RubiX
Shubham Tagra
 
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Shubham Tagra
 
Presto Bangalore Meetup1 Event Listeners@qubole
Shubham Tagra
 
Presto Bangalore Meetup1 Presto Raptor@ola
Shubham Tagra
 
Presto Bangalore Meetup1 Ranger+Presto@ola
Shubham Tagra
 
Presto Bangalore Meetup1 Repertoire@Myntra
Shubham Tagra
 
Ad

Recently uploaded (20)

PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
PPTX
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
 
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
PDF
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
PDF
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
PDF
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
PDF
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
PDF
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PPTX
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
PPTX
Mobile database systems 20254545645.pptx
herosh1968
 
PPTX
Functions in Python Programming Language
BeulahS2
 
PPTX
WHO And BIS std- for water quality .pptx
dhanashree78
 
PPTX
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
resming1
 
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Mobile database systems 20254545645.pptx
herosh1968
 
Functions in Python Programming Language
BeulahS2
 
WHO And BIS std- for water quality .pptx
dhanashree78
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 

Enabling Presto to handle massive scale at lightning speed

  • 1. Enabling Presto to Handle Massive Scale at Lightning Speed Fast and Scalable Data Processing Raunaq Morarka 26/3/2019
  • 2. 00 About Presenter ● Presenter Name, Title: Raunaq Morarka, Staff Engineer at Qubole, Bangalore ● Bio: I currently work in the Presto team at Qubole. My areas of interests are distributed database systems and software performance optimizations. At Qubole I have worked on features related to scheduling, autoscaling and usage of spot nodes for running Presto as a service on cloud. I have recently started contributing to Presto sql open source project. Before Qubole, I worked on a time-series distributed columnar database which supports real time ingest and low latency queries at an internet scale company.
  • 3. 00 Agenda ● State of Presto today ○ Background – Introduction, Why Presto ○ Presto Architecture ○ Usage Overview – Recent Growth and Adoption Trends ● Presto in the Cloud ○ Optimizing for Scale ■ Autoscaling ■ Maximizing the Benefits of the Cloud ○ Optimizing for Speed ■ Dynamic Filtering, Join Reordering, Join Strategy Selection ■ RubiX – The next-generation column level optimized caching on Presto ● Future roadmap
  • 5. 00 State of Presto Today - Background What is Presto? • Distributed SQL Query Engine originated in Facebook in 2013 • ANSI SQL Compliant • Supports Federated Queries • Pluggable data sources • Completely in-memory and pipelined execution model Why Presto? • Built for variety of use cases : Low latency user facing applications, Exploratory Analysis through BI tools, Batch ETL • Data source agnostic : HDFS, RDBMSs, NoSQL, Stream processing, Cloud Object stores (S3, ADLS, GCP) • Zero configuration ideology • Proven in production at Petabyte Scale: Facebook, Netflix, Airbnb, Uber, LinkedIn, Qubole, and more • Highly Extendible
  • 7. 00 Query Lifecycle ● Client submits sql query to Coordinator using HTTP REST API ● ANTLR based parser converts query to syntax tree ● Logical Planner generates tree of plan tree ● Optimizer transforms logical plan into an efficient execution strategy • RBO (predicate and limit pushdown, column pruning, partition pruning etc.) • CBO (Join reordering, Join strategy selection) • Takes advantage of Data layout (partitioning, sorting, grouping and indices) • Inter-node parallelism by breaking up plan into Stages that can be executed in parallel across workers • Intra-node parallelism by running a sequence of operators (pipeline) in multiple threads.
  • 8. 00
  • 9. 00 Scheduling ● Coordinator distributes plan to workers, starts execution of tasks and then begins to enumerate splits, which are opaque handles to an addressable chunk of data in an external storage system ● Splits are assigned to the tasks responsible for reading this data
  • 10. 00 Exchange (Shuffles) ● Presto uses in-memory buffered shuffles over HTTP to exchange intermediate results for different stages of a query ● Tasks produce data into an in-memory output buffer ● Workers consume results from other workers through an exchange client which uses HTTP long-polling ● Exchange client buffers data before it is processed (input buffer) ● Exchange server retains data until client acknowledges receipt ● Engine tunes parallelism to maintain target utilization rates for output and input buffers
  • 11. 00 Split Assignment Presto asks connectors to enumerate small batches of splits, and assigns them to tasks lazily ● Decouples query response time from time taken for listing files ● Avoid enumerating all splits when queries are cancelled or finish early when LIMIT clause is satisfied. ● Workers maintain a queue of splits. The coordinator assigns new splits to tasks with the shortest queue. Keeping these queues small allows the system to adapt to variance in CPU cost of processing different splits and performance differences among workers ● Allows queries to execute without having to hold all their metadata in memory ● Lazy split enumeration can make it difficult to accurately estimate and report query progress
  • 12. 00 State of Presto Today – Usage Overview Presto grew 420% in terms of compute hours on Qubole’s cloud platform from January 2017 to 2018. Customers in aggregate are running 24x more commands per hour in Presto than Apache Spark and 6x more commands than Apache Hadoop/Hive.
  • 13. 00 State of Presto Today – Usage Overview Top Three Industries Using Presto 1. Entertainment 2. Travel Services 3. Gaming Verticals everywhere are adopting Presto
  • 15. 00 Optimizing for Scale – Autoscaling ● Scale clusters in range [min size, max size] ● Scale up for the increased workload ● Scale down when load goes down ● Graceful scale down ● Usually implemented by defining rules on top of CPU/memory/IO metrics exported by system ● Qubole’s implementation ○ Monitor progress of queries ○ Intelligent decision making to scale up only if it can help to meet a given SLA ○ Handle bursty workloads by avoiding fixed step sizes ○ Finer controls like grouped scale up/down, cool down period, etc. ○ Automatic termination of idle clusters ○ Self start of cluster in response to first query on a shutdown cluster
  • 16. 00 Required workers ● Non source stages cannot be redistributed to take advantage of newly added nodes ● Min size of cluster must be large enough to avoid query failures ● Choice between high cost and degraded performance for initial queries ● Required workers is a mechanism to delay query execution until a minimum no. of worker nodes join the cluster ● Integration with Qubole’s autoscaling ○ Scale up cluster to satisfy min workers requirement ○ Avoid scaling up for DDL and monitoring related queries ○ Scale down to 1 node during periods of inactivity
  • 17. 00 Config A Config B Config C Total time taken 5h 12m 4h 26m 4h 37m Total node runtime seconds 143137 134664 124351 Min size 2 6 1 (6 required nodes)
  • 18. 00 Optimizing for Scale – Autoscaling
  • 19. 00 Optimizing for Scale – Maximizing the Benefits of the Cloud ● Spot nodes are generally available at highly discounted prices ● Presto is not able to utilize them well OOB due to its pipelined and in-memory execution architecture ● Spot loss will lead to failure of all queries which had any part of their execution tree running on that spot node ● Presto is usually run on newer generation, high memory instance types which experience spot loss more often due to greater demand ● Qubole’s handling of Spot termination notification ○ Proactive addition of nodes to maintain cluster size ○ Avoid scheduling tasks on spot node after receiving STN ○ Acquire on-demand quickly ○ Lazily rebalance to achieve desired spot ratio ○ No query failures if all queries finish under 2 minutes
  • 20. 00
  • 21. 00 Query retries ● Fallback for query failures that cannot be handled in STN Integration ● Query retries should be transparent to the clients and work with all Presto clients: Java cli, *DBC Drivers, Ruby client, etc. ● The retry should happen only if it is guaranteed that no partial results have yet been sent to the client ● The retry should happen only if changes (if any) caused by the failed query have been rolled back e.g. in case of Inserts, data written by the failed query has been deleted ● The retry should happen only if there is a chance of successful query execution ● Qubole’s implementation ○ Presto server responsible for retries, clients redirected to new query without any changes required to client ○ Convert SELECT queries into IOD queries, clients get result only after query has finished ○ Track rollback status of query ○ Retry in bigger cluster if the failure is due to insufficient memory ○ Retry when cluster size stabilizes if the failure is due to node loss
  • 22. 00 Optimizing for Speed – Dynamic Filtering SELECT (...) FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk (...) WHERE d_year = 2000 and d_moy = 12 (...) (... GROUP BY ... ORDER BY ...) Currently (assuming tables are not partitioned) Presto will perform full table scan of both tables. 1. Skip accessing fact table partitions not needed by the query (partition pruning) 2. Filter rows on probe side of join by sending only the subset of rows that match the join keys across the network 3. If storage format supports predicate pushdown, use runtime filters to avoid scanning data on probe side
  • 24. 00 Dynamic Filtering results • Runtime of 13 queries improved by at least 5X. • Runtime of 13 queries improved between 3X - 5X. • Runtime of 22 queries improved between 1.5X - 3X. • 14 queries that did not run before succeeded.
  • 25. 00 Optimizing for Speed – Join Reordering • Smaller table to the right for better performance • Difficult to ensure it in a multi-join query • Join Reordering Optimizer rule to the rescue
  • 26. 00 Optimizing for Speed – Join Reordering BA AB A.a = B.b B.b = A.a Join Reordering made for the case where build-side of Join is expensive 3~6x Tpcds scale 3000* Geomean 3.1x
  • 27. 00 Optimizing for Speed – Join Reordering
  • 28. 00 Join strategy selection ● Broadcast (Map-side join) vs Repartitioned (Shuffle join) ● Repartitioned ○ Default ○ Low memory usage ○ Both build and probe side need to be partitioned ○ More efficient for joins between large tables of similar size ● Broadcast ○ High memory usage, build side table must fit in memory ○ Probe side does not need to be partitioned ○ Build side table broadcast to all nodes ○ More efficient for joins where one table is of small size
  • 29. 00 Optimizing for Speed – RubiX RubiX is the next-generation column level optimized caching on Presto, for lightning fast big data analytics on cloud storage • Cache file chunks Shared • Cache across JVMs • Engine-independent scheduling logic Avg. ~20%
  • 31. 00 Presto OS roadmap ● Coordinator scalability and HA ● Allow connectors to participate in query optimization ● Improvements to Spill to disk functionality ● Partial recovery support for failure of long running queries ● Ranger plugin ● Qubole contributions ○ Dynamic Filtering ○ Kinesis Connector
  • 33. 00 Thanks for attending! Please feel free to reach out to me at [email protected]