SlideShare a Scribd company logo
Maximizing GPU Efficiency:
Optimizing Model Training
with GPUs Anywhere
Bin Fan
Founding Engineer, VP of Technology @ Alluxio
Aug 29 2024
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
○ Founding Engineer, VP of Technology @
Alluxio
○ Email: binfan@alluxio.com
○ Previously worked in Google - Technical
Infra; PhD in CS at Carnegie Mellon
University
Common ML Platform
Architecture
3
Serving platform
Model
Files
Training
Dataset
Checkpoints
2
Training Infra
Data Lake
1
3
Explore Efficient, Scalable
I/O for Model Training
4
Questions:
▪ Possible Architectures
▪ How to design a efficient, scalable,
distributed caching
○ Evolution of Alluxio Architecture
▪ Benchmark and Case Studies
○ FIO Benchmark
○ User Success Stories
5
Option 1: Connecting to
Cloud Storage Directly
Pros:
Easy to manage – Single source of
truth
Data Lake
Cons:
● Slow or Inconsistent Performance
○ “(Service: Amazon S3; Status Code: 503;
Error Code: SlowDown …)”
● High cost in accessing cloud
storage
○ https://arxiv.org/abs/2311.00156 - Joint case
study by Alluxio, CMU & Uber
Training
Direct Access to
Data Lake
6
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
● Costly Infrastructure
● Extra overhead in data migration,
and maintenance
● Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Data Lake
Fast Access
Migrate Data
Training
HPC Storage
…
7
Data Lake
Fast Access
Migrate Data
HPC Storage
us-west-1
Training …
us-east-1
Training
HPC Storage
…
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
● Costly Infrastructure
● Extra overhead in data migration,
and maintenance
● Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Observation: A Classic
Caching Problem
8
● Itʼs always great to maintain a single-source of truth in
your data lake
● Having a data access/caching layer between different
compute and data lake storage to solve the demand of
IOPS, with possible data virtualization
● Share cache across analytics and AI workloads
Data Lake
Training Compute
Access/Caching Layer
9
Option3: Adding a High-performance Caching
Pros:
● High and consistent I/O
performance
● Still Keep Single-source of truth -
No Extra Cost in Data Migration,
and Maintenance
● Scalable to extend to
multi-region/cloud
Data Lake
us-east-1
Training
Distributed Cache
…
Fast Access with
Hot Data Cached
Only retrieve
Data on Demand
Distributed Cache
us-west-1
Training
Designing a High-performance,
Scalable, Distributed Caching for
Training Workloads
11
Alluxio Data Platform
Accelerate data-intensive AI training workloads
Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
12
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
GENERATIVE AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio
Powered by Alluxio
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
13
14
When Alluxio (Tachyon) was born in
Berkeley
15
Early Architecture: Modeled after HDFS
Compute Node Under Storage
Primary Master
Alluxio Cluster
MapReduce/
Spark/Trino
Request
1
Client
Standby
Master
Standby
Master
Worker
Worker
Worker
2 Get location
3
Request
worker
4
Cache miss
read from
under storage
When Serving ML Training:
Different Requirements
16
● Programming interface: HDFS vs POSIX
● Deployment environment: YARN/Bare Metal vs K8s
● Data format: Structured vs unstructured (audio, picture, video, text)
● Metadata performance: critical for CV/multimodal Training (millions to
billions of small files)
● I/O Concurrency: much higher in training
● Training Duration: hours vs days or weeks ⇒ reliability is the key.
● Fast Write (Checkpointing): Essential
Time to revisit key design choices
17
New Architecture of Alluxio
Training Node Under Storage
Service Registry
Alluxio Cluster
PyTorch
I/ORequest
Worker
Worker
Worker
Select
worker
Cache miss
read from
under storage
Consistent hashing
based data partition
I/ORequest
Under the hood
18
● Use consistent hashing to cache both data and metadata on workers.
○ Reduced I/O RPC length, Performance ++
○ No more single point of failure. Reliability ++
○ No more performance bottleneck on masters. Performance ++
● Remove master from critical path: no more journal
● Many other resource/performance optimizations: e.g., applying zero
copy whenever possible
https://www.alluxio.io/blog/introducing-dora-the-next-generation-
alluxio-architecture/
By the numbers
19
● High Scalability
○ One worker node supports 50+ million small files
○ Scale linearly - easy to support 10 billions of files
● High Availability
○ 99.99% uptime
○ No single point of failure
● High Performance
○ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management
20
API Option 1: Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users
21
API Option 2: Use Python Client (alluxiofs)
Existing Code
With alluxiofs
Can we further minimize the
modification of existing code?
22
API Option 3 (Experimental):
Use alluxioio package
● Import a python package called alluxioio
● No need to modify existing code to use alluxiofs
Benchmark & Case Studies
FIO Benchmark:
Sequential Read x Single Client
24
● Alluxio AI-3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single
client, significantly outperforming competitors.
● NAS (J***FS): Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1%
slower than Alluxio 3.2.
● HPC FS (FSx Lustre): Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to
51.2% slower than Alluxio 3.2.
Setup
● Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client
(c5n.metal)
● NAS (J***FS)
● HPC FS (AWS FSx Lustre (12TB
capacity)
Note: the Alluxio fuse client co-located
with training servers is responsible for
POSIX API access to Alluxio Workers
which actually cache the data
Alluxio 3.2 shows better performance, particularly in handling concurrent
sequential read operations.
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU utilization in LLM
training
~50%
93%+
CASE STUDY:
High Data Performance AI Platform for
model training & inference
10X faster
time-to-production
-Avoid data copy from Cloud Object
store to HDFS
-Start GPU cluster and Alluxio Caching
in any Cloud with Kubernetes in 10
minutes
Increase GPU utilization in
Search/Recommendation/Ads training
~20%
40%+
HDFS
Training Data &
checkpoints
C
h
e
c
k
p
o
i
n
t
s
Training
Data
Checkpoints
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Checkpoint
Training - Cloud Training - On Prem Online machine learning platform - Cloud
Training Data
&
checkpoints
400 Gbps network connection 400 Gbps network connection
Blog with sign up link and tutorial
Get started with a fully deployed Alluxio AI cluster
with just a few clicks in under 40 minutes!
● Explore the potential performance benefits of Alluxio by
running FIO benchmarks
● Simplify the deployment process with preconfigured
template clusters
● Maintain full control of your data with Alluxio deployed
within your AWS account
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Takeaway
27
● When Compute Resource Scarcity Becomes the Norm, a Distributed
Caching Layer Works Well to Enable I/O-Intensive Training
○ Having a Single Source of Truth Makes Life Much Easier
● Architectural Changes Are Required to Meet the Requirements for ML
workloads
○ Especially for Metadata Performance and Scalability of Data
Capacity
Thank You
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
Ad

More Related Content

Similar to AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere (20)

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio, Inc.
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio, Inc.
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to ProductionAlluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio, Inc.
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
Alluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Community
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio, Inc.
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio, Inc.
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to ProductionAlluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio, Inc.
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
Alluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Community
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Ad

Recently uploaded (20)

Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
wareshashahzadiii
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Shift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software DevelopmentShift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software Development
SathyaShankar6
 
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key  With LatestAdobe Photoshop CC 2025 Crack Full Serial Key  With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
usmanhidray
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
wareshashahzadiii
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Shift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software DevelopmentShift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software Development
SathyaShankar6
 
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key  With LatestAdobe Photoshop CC 2025 Crack Full Serial Key  With Latest
Adobe Photoshop CC 2025 Crack Full Serial Key With Latest
usmanhidray
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Ad

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

  • 1. Maximizing GPU Efficiency: Optimizing Model Training with GPUs Anywhere Bin Fan Founding Engineer, VP of Technology @ Alluxio Aug 29 2024
  • 2. About Me 2 Bin Fan (https://www.linkedin.com/in/bin-fan/) ○ Founding Engineer, VP of Technology @ Alluxio ○ Email: [email protected] ○ Previously worked in Google - Technical Infra; PhD in CS at Carnegie Mellon University
  • 3. Common ML Platform Architecture 3 Serving platform Model Files Training Dataset Checkpoints 2 Training Infra Data Lake 1 3
  • 4. Explore Efficient, Scalable I/O for Model Training 4 Questions: ▪ Possible Architectures ▪ How to design a efficient, scalable, distributed caching ○ Evolution of Alluxio Architecture ▪ Benchmark and Case Studies ○ FIO Benchmark ○ User Success Stories
  • 5. 5 Option 1: Connecting to Cloud Storage Directly Pros: Easy to manage – Single source of truth Data Lake Cons: ● Slow or Inconsistent Performance ○ “(Service: Amazon S3; Status Code: 503; Error Code: SlowDown …)” ● High cost in accessing cloud storage ○ https://arxiv.org/abs/2311.00156 - Joint case study by Alluxio, CMU & Uber Training Direct Access to Data Lake
  • 6. 6 Option 2: Adding a High-performance Storage Pros: High and consistent I/O performance Cons: ● Costly Infrastructure ● Extra overhead in data migration, and maintenance ● Not scalable to extend to multi-region/cloud: infra cost & egress cost / bw limit Data Lake Fast Access Migrate Data Training HPC Storage …
  • 7. 7 Data Lake Fast Access Migrate Data HPC Storage us-west-1 Training … us-east-1 Training HPC Storage … Option 2: Adding a High-performance Storage Pros: High and consistent I/O performance Cons: ● Costly Infrastructure ● Extra overhead in data migration, and maintenance ● Not scalable to extend to multi-region/cloud: infra cost & egress cost / bw limit
  • 8. Observation: A Classic Caching Problem 8 ● Itʼs always great to maintain a single-source of truth in your data lake ● Having a data access/caching layer between different compute and data lake storage to solve the demand of IOPS, with possible data virtualization ● Share cache across analytics and AI workloads Data Lake Training Compute Access/Caching Layer
  • 9. 9 Option3: Adding a High-performance Caching Pros: ● High and consistent I/O performance ● Still Keep Single-source of truth - No Extra Cost in Data Migration, and Maintenance ● Scalable to extend to multi-region/cloud Data Lake us-east-1 Training Distributed Cache … Fast Access with Hot Data Cached Only retrieve Data on Demand Distributed Cache us-west-1 Training
  • 10. Designing a High-performance, Scalable, Distributed Caching for Training Workloads
  • 11. 11 Alluxio Data Platform Accelerate data-intensive AI training workloads
  • 12. Alluxio Technology Journey Open Source Started From UC Berkeley AMPLab in 2014 1000+ nodes Largest deployment by Baidu Started from UC Berkeley AMPLab 1 Billion Files supported by Alluxio with 2.0 release 2014 2019 2023 7/10 top Internet Co powered by Alluxio 12 AliPay 80% Model Training Zhihu LLM Model training served by Alluxio EXPLOSION OF DATA rise of big data & analytics CLOUD ADOPTION Single to hybrid cloud, multi-cloud, cross region GENERATIVE AI Large-scale model training and deployment 1000+ Contributors Open Source 1000+ Attendees Data Orchestration Summit 100% Presto @ Meta Fully on-boarded to Alluxio 9/10 top Internet Co powered by Alluxio
  • 13. Powered by Alluxio INTERNET PUBLIC CLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE 13
  • 14. 14 When Alluxio (Tachyon) was born in Berkeley
  • 15. 15 Early Architecture: Modeled after HDFS Compute Node Under Storage Primary Master Alluxio Cluster MapReduce/ Spark/Trino Request 1 Client Standby Master Standby Master Worker Worker Worker 2 Get location 3 Request worker 4 Cache miss read from under storage
  • 16. When Serving ML Training: Different Requirements 16 ● Programming interface: HDFS vs POSIX ● Deployment environment: YARN/Bare Metal vs K8s ● Data format: Structured vs unstructured (audio, picture, video, text) ● Metadata performance: critical for CV/multimodal Training (millions to billions of small files) ● I/O Concurrency: much higher in training ● Training Duration: hours vs days or weeks ⇒ reliability is the key. ● Fast Write (Checkpointing): Essential Time to revisit key design choices
  • 17. 17 New Architecture of Alluxio Training Node Under Storage Service Registry Alluxio Cluster PyTorch I/ORequest Worker Worker Worker Select worker Cache miss read from under storage Consistent hashing based data partition I/ORequest
  • 18. Under the hood 18 ● Use consistent hashing to cache both data and metadata on workers. ○ Reduced I/O RPC length, Performance ++ ○ No more single point of failure. Reliability ++ ○ No more performance bottleneck on masters. Performance ++ ● Remove master from critical path: no more journal ● Many other resource/performance optimizations: e.g., applying zero copy whenever possible https://www.alluxio.io/blog/introducing-dora-the-next-generation- alluxio-architecture/
  • 19. By the numbers 19 ● High Scalability ○ One worker node supports 50+ million small files ○ Scale linearly - easy to support 10 billions of files ● High Availability ○ 99.99% uptime ○ No single point of failure ● High Performance ○ Faster data loading ● Cloud-native K8s Operator and CSI-FUSE for data access management
  • 20. 20 API Option 1: Alluxio FUSE ● Expose the Alluxio file system as a local file system. ● Can access the cloud storage just as accessing local storage. ○ cat, ls ○ f = open(“a.txt”, “r”) ● Very low impact for end users
  • 21. 21 API Option 2: Use Python Client (alluxiofs) Existing Code With alluxiofs Can we further minimize the modification of existing code?
  • 22. 22 API Option 3 (Experimental): Use alluxioio package ● Import a python package called alluxioio ● No need to modify existing code to use alluxiofs
  • 23. Benchmark & Case Studies
  • 24. FIO Benchmark: Sequential Read x Single Client 24 ● Alluxio AI-3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors. ● NAS (J***FS): Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2. ● HPC FS (FSx Lustre): Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2. Setup ● Alluxio: 1 Alluxio worker (i3en.metal) 1 Alluxio fuse client (c5n.metal) ● NAS (J***FS) ● HPC FS (AWS FSx Lustre (12TB capacity) Note: the Alluxio fuse client co-located with training servers is responsible for POSIX API access to Alluxio Workers which actually cache the data Alluxio 3.2 shows better performance, particularly in handling concurrent sequential read operations.
  • 25. BUSINESS BENEFIT: TECH BENEFIT: Increase GPU utilization in LLM training ~50% 93%+ CASE STUDY: High Data Performance AI Platform for model training & inference 10X faster time-to-production -Avoid data copy from Cloud Object store to HDFS -Start GPU cluster and Alluxio Caching in any Cloud with Kubernetes in 10 minutes Increase GPU utilization in Search/Recommendation/Ads training ~20% 40%+ HDFS Training Data & checkpoints C h e c k p o i n t s Training Data Checkpoints Model Training Model Training Model Deployment Model Inference Downstream Applications Checkpoint Training - Cloud Training - On Prem Online machine learning platform - Cloud Training Data & checkpoints 400 Gbps network connection 400 Gbps network connection
  • 26. Blog with sign up link and tutorial Get started with a fully deployed Alluxio AI cluster with just a few clicks in under 40 minutes! ● Explore the potential performance benefits of Alluxio by running FIO benchmarks ● Simplify the deployment process with preconfigured template clusters ● Maintain full control of your data with Alluxio deployed within your AWS account Introducing Rapid Alluxio Deployer (RAD) in AWS!
  • 27. Takeaway 27 ● When Compute Resource Scarcity Becomes the Norm, a Distributed Caching Layer Works Well to Enable I/O-Intensive Training ○ Having a Single Source of Truth Makes Life Much Easier ● Architectural Changes Are Required to Meet the Requirements for ML workloads ○ Especially for Metadata Performance and Scalability of Data Capacity
  • 28. Thank You Any Questions? Scan the QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts!