Enabling Presto Caching at Uber with Alluxio

0 likes•348 views

The document discusses enabling Presto caching at Uber using Alluxio, highlighting the architecture, data characteristics, and problems with HDFS latency. It details initial testing results, improvements in query performance, and outlines the implementation of a persistent file-level metadata store to manage cache and prevent stale caching. Future work includes performance tuning and enhancements for more efficient data handling.

Software

Data informs every decision at Uber
Marketplace
Pricing
Community
Operations
Growth Marketing Data Science
Compliance
Eats

Presto @ Uber-scale
12K
Monthly Active Users
400K
Queries/day
2
Data Centers
6K
Nodes
14
Clusters
50PB
HDFS data
processed/day

Workloads
Interactive
Ad hoc queries
Batch
Scheduled

Data: From On-Premise to Cloud
● What
○ BI (Application)
○ Analytics (Compute)
○ Storage
● How
○ Feature Compatibility
○ Performance Measurement
○ Security / Compliance
○ Tech Debt ?
● Why
○ Cost Efﬁciency
○ Usability / Scalability / Reliability

Alluxio Local Caching-- High Level Architecture
Running as a local library in presto Worker
Key <-> Value:
HDFS File Path as the key
https://prestodb.io/blog/2020/06/16/alluxio-datacaching

Key Problems -- Data
● Data Characteristics
○ Mostly partition by Date
○ Hudi incremental update on File
○ Staging Directory / Partition from ETL framework
● Cache Data Hit Ratio
○ 3+ PB distinct data access per day
○ ~10% frequently accessed data
○ ~3% hot accessed data
● Data Cache Filtering
○ Ofﬂine Query Analytics on the Table (with Partition) Access
○ Onboarding hot accessed data

Key Problems -- Apache Hadoop® HDFS Latency
● Data Nodes can create some random latency
● In real production environment, CPU walltime mostly spent in reading data

Key Problems -- HDFS Latency, Cont
● Reading from local cache have much better guaranteed latency
● Fixing a bug of Namenode listing (ListLocatedStatus API)

Key Problems -- Presto Soft Afﬁnity Scheduling
● Compute Preferred workers
○ Split override getPreferredNodes() to return the 2 preferred workers
○ Simple Mod based algorithm
○ try to assign it one by one by looking at whether it is busy
○ If both workers are busy, then select least busy worker (with cacheable = false)
● Deﬁne Busy worker
○ Max splits per node: node-scheduler.max-splits-per-node
○ Max pending splits per task: node-scheduler.max-pending-splits-per-task

Key Problems -- Soft Afﬁnity with Consistent Hash
● Change from simple
Mod based node
selection to consistent
hashing
● 10 virtual nodes,
original 400 nodes
cluster

Current Status and next steps
● Initial testing has been ﬁnished, great improvement on queries
● TPCDS testing with sf10k in progress
● Historical Table/Partition analytics to setup cache ﬁlters
● Dashboarding, monitoring, metadata integrations

Persistent File Level Metadata for Local Cache
● Prevent stale caching
○ The underlying data ﬁles might be changed by the 3rd party frameworks. (This situation might be
rare in hive table, but very common in hudi tables)
● Scoped quota management
○ Do you want to put a cache quota for each table?
● Metadata should be recoverable after server restart

File Level Metadata -- High Level Approach
● Implement a ﬁle level metadata store which keeps the last modiﬁed time and the scope of
each data ﬁle we cached.
● The ﬁle level metadata store should be persistent on disk so the data will not disappear
after restarting

Cache data and Metadata Structure
root_path/page_size(ulong)/bucket(uint)/ﬁle_id(str)/
timestamp1/
Page_ﬁle1 (The ﬁlename is a ulong)
Page_ﬁle2
….
Page_ﬁleN
timestamp2/
Page_ﬁle1 (The ﬁlename is a ulong)
Page_ﬁle2
….
Page_ﬁleN
metadata (stores FileInfo in protobuf format)
Contains timestamp and scope

Metadata Awareness -- Cache Context (New in
2.6.1)

Per Query Metrics Aggregation on Presto Side

Future Work
● Performance Tuning
● Semantic Cache
● More efﬁcient deserialization

More Related Content

What's hot (20)

PDF

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

PDF

How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.

PDF

Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupWojciech Biela

PDF

Speed Up Uber's Presto with AlluxioAlluxio, Inc.

PDF

Powering Interactive Analytics with Alluxio and PrestoAlluxio, Inc.

PDF

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

PDF

Best Practices for Using Alluxio with SparkAlluxio, Inc.

PDF

Presto Summit 2018 - 09 - Netflix Icebergkbajda

PDF

The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks

PDF

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

PDF

Running MySQL on LinuxGreat Wide Open

PDF

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

PDF

Presto Summit 2018 - 04 - Netflix Containerskbajda

PPTX

Performance Tuning Cheat Sheet for MongoDBSeveralnines

PDF

Iceberg: a fast table format for S3DataWorks Summit

PDF

Presto on Alluxio Hands-On LabAlluxio, Inc.

PPTX

Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd

PPTX

Need for Time series DatabasePramit Choudhary

PDF

Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseVolha Banadyseva

PDF

Data Analysis with TensorFlow in PostgreSQLEDB

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.

Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupWojciech Biela

Speed Up Uber's Presto with AlluxioAlluxio, Inc.

Powering Interactive Analytics with Alluxio and PrestoAlluxio, Inc.

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Presto Summit 2018 - 09 - Netflix Icebergkbajda

The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Running MySQL on LinuxGreat Wide Open

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Presto Summit 2018 - 04 - Netflix Containerskbajda

Performance Tuning Cheat Sheet for MongoDBSeveralnines

Iceberg: a fast table format for S3DataWorks Summit

Presto on Alluxio Hands-On LabAlluxio, Inc.

Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd

Need for Time series DatabasePramit Choudhary

Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseVolha Banadyseva

Data Analysis with TensorFlow in PostgreSQLEDB

Similar to Enabling Presto Caching at Uber with Alluxio (20)

PDF

Big data should be simpleDori Waldman

PDF

How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.

PDF

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

PDF

hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon

PDF

20140120 presto meetup_enOgibayashi

PDF

Apache Hadoop 3.0 Community UpdateDataWorks Summit

PDF

Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du

PDF

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.

PDF

2021.02 new in Ceph Pacific DashboardCeph Community

PDF

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

PPTX

Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningStorage Switzerland

PDF

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

PPTX

ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1

PDF

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.

PPTX

Bootstrapping state in Apache FlinkDataWorks Summit

PDF

Enabling Presto to handle massive scale at lightning speedShubham Tagra

PDF

Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.

PDF

Netflix Open Source Meetup Season 4 Episode 2aspyker

PDF

The state of SQL-on-Hadoop in the CloudDataWorks Summit/Hadoop Summit

PDF

Presto@UberZhenxiao Luo

Big data should be simpleDori Waldman

How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon

20140120 presto meetup_enOgibayashi

Apache Hadoop 3.0 Community UpdateDataWorks Summit

Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.

2021.02 new in Ceph Pacific DashboardCeph Community

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningStorage Switzerland

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.

Bootstrapping state in Apache FlinkDataWorks Summit

Enabling Presto to handle massive scale at lightning speedShubham Tagra

Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.

Netflix Open Source Meetup Season 4 Episode 2aspyker

The state of SQL-on-Hadoop in the CloudDataWorks Summit/Hadoop Summit

Presto@UberZhenxiao Luo

More from Alluxio, Inc. (20)

PDF

Introduction to Apache Iceberg™ & TableflowAlluxio, Inc.

PDF

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI ScaleAlluxio, Inc.

PDF

Meet in the Middle: Solving the Low-Latency Challenge for Agentic AIAlluxio, Inc.

PDF

From Data Preparation to Inference: How Alluxio Speeds Up AIAlluxio, Inc.

PDF

Best Practice for LLM Serving in the CloudAlluxio, Inc.

PDF

Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...Alluxio, Inc.

PDF

How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingAlluxio, Inc.

PDF

Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio, Inc.

PDF

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...Alluxio, Inc.

PDF

AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAlluxio, Inc.

PDF

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAlluxio, Inc.

PDF

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.

PDF

AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAlluxio, Inc.

PDF

AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...Alluxio, Inc.

PDF

Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio, Inc.

PDF

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.

PDF

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...Alluxio, Inc.

PDF

AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAlluxio, Inc.

PDF

AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...Alluxio, Inc.

PDF

Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio, Inc.