SlideShare a Scribd company logo
The power of Ray in the era of
LLM and multi-modality AI
Current: NVIDIA (DGX Cloud)
‘20 ~ ‘24: Anyscale Head of Ray / OSS
Before: Cloudera, LinkedIn, OSS
Hadoop, Spark etc.
Ray is a popular OSS project
Ray
Apache Spark
MLFlow
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
ChatGPT / GPT-4 Trained on Ray!
Fast iterations at Hyper-scale!
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Ray: a short history
2016: Started in UC Berkeley (same lab as Spark / vLLM)
2019: Anyscale founded (company behind Ray)
2020: Ray v1.0 release; Ray Serve released
2022: Ray v2.0 release; (KubeRay, Ray Data etc)
2023 / 2024: a lot of focus on LLMs
A Typical ML Pipeline
Ray: Holistically addresses AI/LLM challenges
Unified Framework for Scaling AI Workloads
Ray Core
“Operating System” for heterogeneous distributed computing
Ray AI Libraries
Data Train Reinforcement Learning Serve
Tune
Minimalist API
ray.init() Initialize Ray context.
@ray.remote
Function or class decorator specifying that the function will be executed as a task or
the class as an actor in a different process.
.remote
Postfix to every remote function, remote class declaration, or invocation of a remote
class method. Remote operations are asynchronous.
ray.put()
Store object in object store, and return its ID. This ID can be used to pass object as an
argument to any remote function or method call. This is a synchronous operation.
ray.get()
Return an object or list of objects from the object ID or list of object IDs. This is a
synchronous (i.e., blocking) operation.
def read_array(file):
# read ndarray “a”
# from “file”
return a
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
Function
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Class
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Function → Task Class → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
id5 = c.inc.remote()
Function → Task Object → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
id1: distributed future (object id)
read_array
id1
Return future id1; before
read_array() finishes
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
Dynamic task graph:
build at runtime
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node 3
Every task submitted,
but not finished yet
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node 3
ray.get() block until
result available
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1
Node 1 Node 2
Node 3
read_array
file2
read_array
add
sum
Task graph executed to
compute sum
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote(num_cpus=2, num_gpus=1)
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id1 = c.inc.remote()
id2 = c.inc.remote()
val = ray.get(id2)
Function → Task Class → Actor
can specify
resource demands;
support heterogeneous hardware
The “No B.S” Slide 🤓
At the end of the day, “RPC is all you need”
💻
Run func(x)
Return y
At the end of the day, “RPC is all you need”
. . .
💻
Harry: do this Ron: do that
. . .
💻
SELECT * …
At the end of the day, “RPC is all you need”
. . .
💻
I need someone to do this (w/ 2 CPUs)
I need someone to do that (w/ 1 H100 GPU)
At the end of the day, “RPC is all you need”
“No abstraction; it’s like SSH”
“Do things my way (queries)”
“It’s still Python code”
At the end of the day, “RPC is all you need”
A compelling example
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Most Successful Use Cases
Model Training Unstructured Data
(text, image, video)
LLM Inference /
Fine-tune
Reinforcement
Learning
Graph
Computing
…
Model Training – Pinterest
Model Training – Pinterest
Model Training – Pinterest
Model Training – Uber
Model Training – Uber
Unstructured Data – Benchmark
Leading
open-source
framework
Leading
commercial
ML Platform
$0
$20
$40
$60
$3.5
Load
Pre-
processing
Inference Save
Ray Core
CPU CPU
GPU
Batch Inference
● Use most cost-effective hardware for each stage
● Independently scale every stage
$7.3
$57
Cost to process 1M images
https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker
Unstructured Data – Benchmark
Load
Pre-
processing
Inference Save
Ray Core
CPU CPU
GPU
● Use most cost-effective hardware for each stage
● Independently scale every stage
Unstructured Data – Niantic
Unstructured Data – Niantic
Unstructured Data – Niantic
This Talk
Adoption Highlights
What is Ray
How is Ray Used?
Future Outlook
Zhe’s Takes on ML Infra
“No abstraction; it’s like SSH”
“Do things my way (queries)”
“It’s still Python code”
Zhe’s Takes on ML Infra
How structured is the problem?
- ML is a much more unstructured problem than
Data at this point
- It could become structured at some point
- Most people are still settling with the ssh + cmd
approach (e.g. torchrun on Slurm)
Zhe’s Takes on ML Infra
The rise of multi-modality AI
- Much much more data-intensive
- More development / trial-and-error
- It becomes more meaningful to “upgrade your
gear” (try Ray)
Ad

More Related Content

Similar to AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI (20)

Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Node.js Patterns for Discerning Developers
Node.js Patterns for Discerning DevelopersNode.js Patterns for Discerning Developers
Node.js Patterns for Discerning Developers
cacois
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
Jan Margeta
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Framework engineering JCO 2011
Framework engineering JCO 2011Framework engineering JCO 2011
Framework engineering JCO 2011
YoungSu Son
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
 IronPython and Dynamic Languages on .NET by Mahesh Prakriya IronPython and Dynamic Languages on .NET by Mahesh Prakriya
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
codebits
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Provectus
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
 
Ty bca-sem-v-introduction to vb.net-i-uploaded
Ty bca-sem-v-introduction to vb.net-i-uploadedTy bca-sem-v-introduction to vb.net-i-uploaded
Ty bca-sem-v-introduction to vb.net-i-uploaded
Prachi Sasankar
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
Knoldus Inc.
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in RTackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in R
Lun-Hsien Chang
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Using Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App EngineUsing Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App Engine
River of Talent
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Node.js Patterns for Discerning Developers
Node.js Patterns for Discerning DevelopersNode.js Patterns for Discerning Developers
Node.js Patterns for Discerning Developers
cacois
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
Jan Margeta
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Framework engineering JCO 2011
Framework engineering JCO 2011Framework engineering JCO 2011
Framework engineering JCO 2011
YoungSu Son
 
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
 IronPython and Dynamic Languages on .NET by Mahesh Prakriya IronPython and Dynamic Languages on .NET by Mahesh Prakriya
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
codebits
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Provectus
 
Ty bca-sem-v-introduction to vb.net-i-uploaded
Ty bca-sem-v-introduction to vb.net-i-uploadedTy bca-sem-v-introduction to vb.net-i-uploaded
Ty bca-sem-v-introduction to vb.net-i-uploaded
Prachi Sasankar
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in RTackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in R
Lun-Hsien Chang
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Using Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App EngineUsing Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App Engine
River of Talent
 

More from Alluxio, Inc. (20)

How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Ad

Recently uploaded (20)

Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Ad

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI

  • 1. The power of Ray in the era of LLM and multi-modality AI
  • 2. Current: NVIDIA (DGX Cloud) ‘20 ~ ‘24: Anyscale Head of Ray / OSS Before: Cloudera, LinkedIn, OSS Hadoop, Spark etc.
  • 3. Ray is a popular OSS project Ray Apache Spark MLFlow
  • 5. ChatGPT / GPT-4 Trained on Ray! Fast iterations at Hyper-scale!
  • 6. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 7. Ray: a short history 2016: Started in UC Berkeley (same lab as Spark / vLLM) 2019: Anyscale founded (company behind Ray) 2020: Ray v1.0 release; Ray Serve released 2022: Ray v2.0 release; (KubeRay, Ray Data etc) 2023 / 2024: a lot of focus on LLMs
  • 8. A Typical ML Pipeline
  • 9. Ray: Holistically addresses AI/LLM challenges Unified Framework for Scaling AI Workloads Ray Core “Operating System” for heterogeneous distributed computing Ray AI Libraries Data Train Reinforcement Learning Serve Tune
  • 10. Minimalist API ray.init() Initialize Ray context. @ray.remote Function or class decorator specifying that the function will be executed as a task or the class as an actor in a different process. .remote Postfix to every remote function, remote class declaration, or invocation of a remote class method. Remote operations are asynchronous. ray.put() Store object in object store, and return its ID. This ID can be used to pass object as an argument to any remote function or method call. This is a synchronous operation. ray.get() Return an object or list of objects from the object ID or list of object IDs. This is a synchronous (i.e., blocking) operation.
  • 11. def read_array(file): # read ndarray “a” # from “file” return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) Function class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Class
  • 12. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Function → Task Class → Actor
  • 13. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() Function → Task Object → Actor
  • 14. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 id1: distributed future (object id) read_array id1 Return future id1; before read_array() finishes Task API
  • 15. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 Dynamic task graph: build at runtime Task API
  • 16. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 Every task submitted, but not finished yet Task API
  • 17. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 ray.get() block until result available Task API
  • 18. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 Node 1 Node 2 Node 3 read_array file2 read_array add sum Task graph executed to compute sum Task API
  • 19. @ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote(num_cpus=2, num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id1 = c.inc.remote() id2 = c.inc.remote() val = ray.get(id2) Function → Task Class → Actor can specify resource demands; support heterogeneous hardware
  • 20. The “No B.S” Slide 🤓 At the end of the day, “RPC is all you need” 💻 Run func(x) Return y
  • 21. At the end of the day, “RPC is all you need” . . . 💻 Harry: do this Ron: do that
  • 22. . . . 💻 SELECT * … At the end of the day, “RPC is all you need”
  • 23. . . . 💻 I need someone to do this (w/ 2 CPUs) I need someone to do that (w/ 1 H100 GPU) At the end of the day, “RPC is all you need”
  • 24. “No abstraction; it’s like SSH” “Do things my way (queries)” “It’s still Python code” At the end of the day, “RPC is all you need”
  • 26. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 27. Most Successful Use Cases Model Training Unstructured Data (text, image, video) LLM Inference / Fine-tune Reinforcement Learning Graph Computing …
  • 28. Model Training – Pinterest
  • 29. Model Training – Pinterest
  • 30. Model Training – Pinterest
  • 33. Unstructured Data – Benchmark Leading open-source framework Leading commercial ML Platform $0 $20 $40 $60 $3.5 Load Pre- processing Inference Save Ray Core CPU CPU GPU Batch Inference ● Use most cost-effective hardware for each stage ● Independently scale every stage $7.3 $57 Cost to process 1M images https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker
  • 34. Unstructured Data – Benchmark Load Pre- processing Inference Save Ray Core CPU CPU GPU ● Use most cost-effective hardware for each stage ● Independently scale every stage
  • 38. This Talk Adoption Highlights What is Ray How is Ray Used? Future Outlook
  • 39. Zhe’s Takes on ML Infra “No abstraction; it’s like SSH” “Do things my way (queries)” “It’s still Python code”
  • 40. Zhe’s Takes on ML Infra How structured is the problem? - ML is a much more unstructured problem than Data at this point - It could become structured at some point - Most people are still settling with the ssh + cmd approach (e.g. torchrun on Slurm)
  • 41. Zhe’s Takes on ML Infra The rise of multi-modality AI - Much much more data-intensive - More development / trial-and-error - It becomes more meaningful to “upgrade your gear” (try Ray)