AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

Maximizing GPU Efﬁciency:
Optimizing Model Training
with GPUs Anywhere
Bin Fan
Founding Engineer, VP of Technology @ Alluxio
Aug 29 2024

About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
○ Founding Engineer, VP of Technology @
Alluxio
○ Email: binfan@alluxio.com
○ Previously worked in Google - Technical
Infra; PhD in CS at Carnegie Mellon
University

Common ML Platform
Architecture
3
Serving platform
Model
Files
Training
Dataset
Checkpoints
2
Training Infra
Data Lake
1
3

Explore Efﬁcient, Scalable
I/O for Model Training
4
Questions:
▪ Possible Architectures
▪ How to design a eﬀicient, scalable,
distributed caching
○ Evolution of Alluxio Architecture
▪ Benchmark and Case Studies
○ FIO Benchmark
○ User Success Stories

5
Option 1: Connecting to
Cloud Storage Directly
Pros:
Easy to manage – Single source of
truth
Data Lake
Cons:
● Slow or Inconsistent Performance
○ “(Service: Amazon S3; Status Code: 503;
Error Code: SlowDown …)”
● High cost in accessing cloud
storage
○ https://arxiv.org/abs/2311.00156 - Joint case
study by Alluxio, CMU & Uber
Training
Direct Access to
Data Lake

6
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
● Costly Infrastructure
● Extra overhead in data migration,
and maintenance
● Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Data Lake
Fast Access
Migrate Data
Training
HPC Storage
…

7
Data Lake
Fast Access
Migrate Data
HPC Storage
us-west-1
Training …
us-east-1
Training
HPC Storage
…
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
● Costly Infrastructure
● Extra overhead in data migration,
and maintenance
● Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit

Observation: A Classic
Caching Problem
8
● Itʼs always great to maintain a single-source of truth in
your data lake
● Having a data access/caching layer between diﬀerent
compute and data lake storage to solve the demand of
IOPS, with possible data virtualization
● Share cache across analytics and AI workloads
Data Lake
Training Compute
Access/Caching Layer

9
Option3: Adding a High-performance Caching
Pros:
● High and consistent I/O
performance
● Still Keep Single-source of truth -
No Extra Cost in Data Migration,
and Maintenance
● Scalable to extend to
multi-region/cloud
Data Lake
us-east-1
Training
Distributed Cache
…
Fast Access with
Hot Data Cached
Only retrieve
Data on Demand
Distributed Cache
us-west-1
Training

Designing a High-performance,
Scalable, Distributed Caching for
Training Workloads

11
Alluxio Data Platform
Accelerate data-intensive AI training workloads

Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
12
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
GENERATIVE AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio

Powered by Alluxio
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
13

14
When Alluxio (Tachyon) was born in
Berkeley

15
Early Architecture: Modeled after HDFS
Compute Node Under Storage
Primary Master
Alluxio Cluster
MapReduce/
Spark/Trino
Request
1
Client
Standby
Master
Standby
Master
Worker
Worker
Worker
2 Get location
3
Request
worker
4
Cache miss
read from
under storage

When Serving ML Training:
Different Requirements
16
● Programming interface: HDFS vs POSIX
● Deployment environment: YARN/Bare Metal vs K8s
● Data format: Structured vs unstructured (audio, picture, video, text)
● Metadata performance: critical for CV/multimodal Training (millions to
billions of small files)
● I/O Concurrency: much higher in training
● Training Duration: hours vs days or weeks ⇒ reliability is the key.
● Fast Write (Checkpointing): Essential
Time to revisit key design choices

17
New Architecture of Alluxio
Training Node Under Storage
Service Registry
Alluxio Cluster
PyTorch
I/ORequest
Worker
Worker
Worker
Select
worker
Cache miss
read from
under storage
Consistent hashing
based data partition
I/ORequest

Under the hood
18
● Use consistent hashing to cache both data and metadata on workers.
○ Reduced I/O RPC length, Performance ++
○ No more single point of failure. Reliability ++
○ No more performance bottleneck on masters. Performance ++
● Remove master from critical path: no more journal
● Many other resource/performance optimizations: e.g., applying zero
copy whenever possible
https://www.alluxio.io/blog/introducing-dora-the-next-generation-
alluxio-architecture/

By the numbers
19
● High Scalability
○ One worker node supports 50+ million small files
○ Scale linearly - easy to support 10 billions of files
● High Availability
○ 99.99% uptime
○ No single point of failure
● High Performance
○ Faster data loading
● Cloud-native K8s Operator and CSI-FUSE for data access management

20
API Option 1: Alluxio FUSE
● Expose the Alluxio file system as a local file system.
● Can access the cloud storage just as accessing local storage.
○ cat, ls
○ f = open(“a.txt”, “r”)
● Very low impact for end users

21
API Option 2: Use Python Client (alluxiofs)
Existing Code
With alluxiofs
Can we further minimize the
modification of existing code?

22
API Option 3 (Experimental):
Use alluxioio package
● Import a python package called alluxioio
● No need to modify existing code to use alluxiofs

FIO Benchmark:
Sequential Read x Single Client
24
● Alluxio AI-3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single
client, significantly outperforming competitors.
● NAS (J***FS): Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1%
slower than Alluxio 3.2.
● HPC FS (FSx Lustre): Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to
51.2% slower than Alluxio 3.2.
Setup
● Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client
(c5n.metal)
● NAS (J***FS)
● HPC FS (AWS FSx Lustre (12TB
capacity)
Note: the Alluxio fuse client co-located
with training servers is responsible for
POSIX API access to Alluxio Workers
which actually cache the data
Alluxio 3.2 shows better performance, particularly in handling concurrent
sequential read operations.

BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU utilization in LLM
training
~50%
93%+
CASE STUDY:
High Data Performance AI Platform for
model training & inference
10X faster
time-to-production
-Avoid data copy from Cloud Object
store to HDFS
-Start GPU cluster and Alluxio Caching
in any Cloud with Kubernetes in 10
minutes
Increase GPU utilization in
Search/Recommendation/Ads training
~20%
40%+
HDFS
Training Data &
checkpoints
C
h
e
c
k
p
o
i
n
t
s
Training
Data
Checkpoints
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Checkpoint
Training - Cloud Training - On Prem Online machine learning platform - Cloud
Training Data
&
checkpoints
400 Gbps network connection 400 Gbps network connection

Blog with sign up link and tutorial
Get started with a fully deployed Alluxio AI cluster
with just a few clicks in under 40 minutes!
● Explore the potential performance benefits of Alluxio by
running FIO benchmarks
● Simplify the deployment process with preconfigured
template clusters
● Maintain full control of your data with Alluxio deployed
within your AWS account
Introducing Rapid Alluxio Deployer (RAD) in AWS!

Takeaway
27
● When Compute Resource Scarcity Becomes the Norm, a Distributed
Caching Layer Works Well to Enable I/O-Intensive Training
○ Having a Single Source of Truth Makes Life Much Easier
● Architectural Changes Are Required to Meet the Requirements for ML
workloads
○ Especially for Metadata Performance and Scalability of Data
Capacity

Thank You
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

Recommended

More Related Content

Similar to AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere