SlideShare a Scribd company logo
Reducing Prefill for LLM Serving in
RAG (Using Knowledge Sharing)
Junchen Jiang
1
The Internet age is punctuated with innovations
2
Web apps
CDN
Video
MPEG-
DASH, 5G
Big Data
Cloud,
MapReduce
Search engine
Data Centers
Gen AI
????
OpenAI alone already has 1.8 billion monthly visitors
(YouTube has 2.7 billion monthly visitors)
What will be the key system innovations for generative AI?
time
Do we know how to build the Gen AI system yet?
3
Basics
How to build
websites & players
1990s 2000s – 2020s
Building a global distributed system?
P2P or CDN, video transcoding, scale out streaming, streaming
quality monitoring, DNS redirection, video caching, …
Basics
How to build AI
apps and servers
2022-2024 2024 - ??????
Building a global distributed system
???
We are still at the very early stage of LLM infrastructure
These took us 20 years
This talk:
Sharing knowledge across LLMs
Internet
video
Gen AI
(LLMs)
We are here
LLMs are more powerful when paired with "knowledge"
LLMs need to read a large amount of data in real-time
(looooooooooooog) contexts
output text
LLM
(short) query
News Business
docs
Chat/
shopping
history
Book
User
The prefill delay will only grow (longer contexts, bigger models),
while users get less patient.
Yet, it takes time to "learn" (prefill) the context
LLM
LLM
LLM
Queries about a book
Prefilling
Prefilling
Prefilling
LLM-Learned
knowledge
(KV cache)
Knowledge
Sharing
You Only Learn Once:
Once one LLM learns something, other LLMs will immediately know
Vision: Knowledge Sharing
LLM
LLM
LLM
Queries about a book
Prefilling
LLM-Learned
knowledge
(KV cache)
Feel the speedup!
Context text
(13K tokens)
6.5sec
Query 2
0.9sec (7x faster)
Mistral 7B
on A40
Mistral 7B
on A40
Query 1
KV Cache
Sharing
KV cache
w/o KV cache
With efficient KV cache sharing
(explained shortly)
Vision: Knowledge Sharing
Why will the same knowledge (KV cache) be reused?
20% of your knowledge is used 80% of the time. (20-80 rule)
Faster (shorter time-to-first-token)
Ex. 5,000-token document (context) + 100-token question
With document's KV cache, time-to-first-token is at least 50x faster
Higher throughput
Without prefill, generation (decoding) would be easier to batch
On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second*
Will it be too expensive to store KV cache?
KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs.
With longer contexts (or bigger models), KV cache size grows slower than prefill delay.
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
Perfect fit for storage
solutions, like Alluxio
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
KV cache is 100,000x bigger than text.
Simply loading them remotely is too slow
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
If a text is not at the prefix, its KV
cache cannot be reused
Key technique #2:
Flexible join of multiple KV caches
Knowledge-
Sharing
System LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
CacheGen: KV Cache Compression and
Streaming for Fast Language Model Serving
14
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang,
Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh
Ananthanarayanan, Junchen Jiang
ACM SIGCOMM 2024
CacheGen: Compressing KV cache for fast prefill
15
loooooo…oooooong context + query output text
LLM
Prefill on query
time
Generate
output
Loading KV cache
Prefill on query
time
Generate
output
Loading compressed KV
cache & decompress
Compressed KV cache
KV cache
Faster prefill even if the reused
KV cache is loaded remotely
10,000 ft view of CacheGen
16
K tensor
KV cache
V tensor
Binary representations Storage
Encoding
Binary representations
K tensor
Decompressed KV cache
V tensor
Decoding
CacheGen: Encode KV cache to compact binary representation
Several emerging approaches to KV compression
17
They all keep the KV's tensor shape è Complementary to CachGen.
CacheGen can improve them too!
CacheGen: Encode KV cache to compact binary representation
Quantizing KV cache directly?
Dropping less important tokens from the text?
Dropping less important tokens from the KV cache?
Can KV cache be encoded efficiently?
size of encoded KV cache
text
quality
B
e
t
t
e
r
size of encoded video
video
quality
B
e
t
t
e
r
Encode a video in a small size with small degradation on video quality
KV cache Generated text
Analogy with video compression
We could borrow the 20-year research literature of video compression
Why can fewer bits represent KV cache?
19
KV cache is similar between neighboring tokens
Some parts of a KV cache are less sensitive to quantization
Quantized KV cache can be entropy-encoded with fewer bits
Key distributional properties of KV cache
Opportunity 1: Locality of KV cache values
20
K tensor
K @ layer j
# of layers
# of tokens
tokens
Channels
#
of channels
Opportunity: The KV values at nearby tokens have similar values
Delta values have much smaller variance
21
For any token 𝑖
Original: |𝐾!|, |𝑉!|
Delta: |𝐾! − 𝑘!"#|, |𝑉! − 𝑉!"#|
Encode the delta between neighboring tokens, rather than the tokens themselves
Delta values have much smaller variance è Easier to quantize
0
0.2
0.4
0.6
0.8
1
0 2 4
CDF
Values (abs.)
Original
Delta
Opportunity 2: Heterogeneous sensitivity to quantization
22
The output quality of LLM is more sensitive to losses in the KV cache values
of the shallower layers than to those in the deeper layers.
0
0.2
0.4
0.6
0.8
1
[0, 3] [4, 7] [8, 11] [12, 15] [16, 19] [20, 23]
LLM
output
quality
(Accuracy)
Layer
Opportunity 3: Arithmetic coding
23
001101001110010…
KV cache
Compute
delta
Quantize
Adaptive
arithmetic
coding
More compact binary
representation
- stored on disk
- sent via network
Reducing decoding overhead?
24
K tensor
KV cache
V tensor
Binary
representations
Encoding
K tensor
Decompressed KV cache
V tensor
GPU-based
Decoding
Loading
Decoding and loading can be pipelined
Evaluation setup
3Gbps
(cloud server bandwidth)
Llama-70B, Llama-34B, Mistral-7B
Llama-70B
Llama-70B
Llama-70B
LongChat
TriviaQA
NarrativeQA
WikiText
Context length distribution Various quality metrics
Accuracy
F1 score
Perplexity
Quality vs. Size & TTFT (time to first token)
26
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
Accuracy
Size of KV cache (MB)
Better
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6
Accuracy
Time to first token (TTFT) in seconds
CacheGen
Uniform
quantization CacheGen
Uniform
quantization
Full prefill
(no caching)
Better
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache (1.6 GB 8bit): 3Gbps
3x smaller KV cache size è 3-6x lower time to first token (TTFT)
Impact of context length
27
0
1000
2000
3000
4000
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
KV
cache
size
(MB)
Context length (K tokens)
CacheGen Uniform (8-bit quant.)
Setup
Model: Llama-70B
The size reduction remains under various context lengths
Breakdown of Time to First Token (TTFT)
28
6.1 sec
Prefill on
input time
Prefill on query
time
7.2 sec
Load
0.12 sec
Prefill on query
time
1.8 sec
Load + decompress
0.12 sec
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache: 3Gbps
Full recompute
Naïve KV cache loading
(w/o KV encoding)
CacheGen
(KV encoding)
Knowledge
Storing &
Sharing LLM
LLM
LLM
Towards Efficient Knowledge Storing & Sharing
Key technique #1:
Fast KV cache loading via KV codec
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
Happy to chat about
technique #2 after the talk
Try it yourself!
30
https://github.com/uchi-jcl/cachegen
https://arxiv.org/pdf/2310.07240.pdf
Research paper
Code repo
Efficient Knowledge Sharing System
31
Delay
(time
to
first
token)
Cost
(storage, compute, communication)
Better
GPU prefill
Storing KV cache in CPU
Storing KV cache in SSD
Storing KV cache in S3
Efficient
Knowledge
Sharing
Contact me if you are a potential
user or contributor to our
Knowledge-Sharing System!!
junchenj@uchicago.edu
Ad

More Related Content

Similar to AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG (20)

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Community
 
Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
ETCenter
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
TechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlexTechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlex
Jarut Nakaramaleerat
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
OPNFV
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
Michelle Holley
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptx
ssuserabc741
 
Redefining Data Redundancywith RAID Offload
Redefining Data Redundancywith RAID OffloadRedefining Data Redundancywith RAID Offload
Redefining Data Redundancywith RAID Offload
Derze
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
Brent Salisbury
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Kieran Kunhya
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS Riyadh User Group
 
#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout
1CloudRoad.com
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108
qnapivan
 
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
Amazon Web Services Korea
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Community
 
Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
ETCenter
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
OPNFV
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
Michelle Holley
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptx
ssuserabc741
 
Redefining Data Redundancywith RAID Offload
Redefining Data Redundancywith RAID OffloadRedefining Data Redundancywith RAID Offload
Redefining Data Redundancywith RAID Offload
Derze
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
Brent Salisbury
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Kieran Kunhya
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS Riyadh User Group
 
#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout#VMUGMTL - Xsigo Breakout
#VMUGMTL - Xsigo Breakout
1CloudRoad.com
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108
qnapivan
 
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
[AWS Media Symposium 2019] AWS Media Services Innovation - Christer Whitehorn...
Amazon Web Services Korea
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 

More from Alluxio, Inc. (20)

How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingHow Coupang Leverages Distributed Cache to Accelerate ML Model Training
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio:  Preprocessing, ...
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAI/ML Infra Meetup | Big Data and AI, Zoom Developers
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Ad

Recently uploaded (20)

Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Ad

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

  • 1. Reducing Prefill for LLM Serving in RAG (Using Knowledge Sharing) Junchen Jiang 1
  • 2. The Internet age is punctuated with innovations 2 Web apps CDN Video MPEG- DASH, 5G Big Data Cloud, MapReduce Search engine Data Centers Gen AI ???? OpenAI alone already has 1.8 billion monthly visitors (YouTube has 2.7 billion monthly visitors) What will be the key system innovations for generative AI? time
  • 3. Do we know how to build the Gen AI system yet? 3 Basics How to build websites & players 1990s 2000s – 2020s Building a global distributed system? P2P or CDN, video transcoding, scale out streaming, streaming quality monitoring, DNS redirection, video caching, … Basics How to build AI apps and servers 2022-2024 2024 - ?????? Building a global distributed system ??? We are still at the very early stage of LLM infrastructure These took us 20 years This talk: Sharing knowledge across LLMs Internet video Gen AI (LLMs) We are here
  • 4. LLMs are more powerful when paired with "knowledge" LLMs need to read a large amount of data in real-time (looooooooooooog) contexts output text LLM (short) query News Business docs Chat/ shopping history Book User
  • 5. The prefill delay will only grow (longer contexts, bigger models), while users get less patient. Yet, it takes time to "learn" (prefill) the context LLM LLM LLM Queries about a book Prefilling Prefilling Prefilling LLM-Learned knowledge (KV cache)
  • 6. Knowledge Sharing You Only Learn Once: Once one LLM learns something, other LLMs will immediately know Vision: Knowledge Sharing LLM LLM LLM Queries about a book Prefilling LLM-Learned knowledge (KV cache)
  • 7. Feel the speedup! Context text (13K tokens) 6.5sec Query 2 0.9sec (7x faster) Mistral 7B on A40 Mistral 7B on A40 Query 1 KV Cache Sharing KV cache w/o KV cache With efficient KV cache sharing (explained shortly)
  • 8. Vision: Knowledge Sharing Why will the same knowledge (KV cache) be reused? 20% of your knowledge is used 80% of the time. (20-80 rule) Faster (shorter time-to-first-token) Ex. 5,000-token document (context) + 100-token question With document's KV cache, time-to-first-token is at least 50x faster Higher throughput Without prefill, generation (decoding) would be easier to batch On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second* Will it be too expensive to store KV cache? KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs. With longer contexts (or bigger models), KV cache size grows slower than prefill delay.
  • 9. LLM LLM LLM Architecting Efficient Knowledge Sharing Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System
  • 10. LLM LLM LLM Architecting Efficient Knowledge Sharing Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System Perfect fit for storage solutions, like Alluxio
  • 11. Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Challenge: KV cache is 100,000x bigger than text. Simply loading them remotely is too slow Key technique #1: Fast KV retrieval via KV encoding (Speed up KV loading by 3-10x)
  • 12. Knowledge synthesis Knowledge caching Knowledge retrieval Knowledge-Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Challenge: If a text is not at the prefix, its KV cache cannot be reused Key technique #2: Flexible join of multiple KV caches
  • 13. Knowledge- Sharing System LLM LLM LLM Architecting Efficient Knowledge Sharing Key technique #1: Fast KV retrieval via KV encoding (Speed up KV loading by 3-10x) Key technique #2: Flexible join of multiple KV caches
  • 14. CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving 14 Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh Ananthanarayanan, Junchen Jiang ACM SIGCOMM 2024
  • 15. CacheGen: Compressing KV cache for fast prefill 15 loooooo…oooooong context + query output text LLM Prefill on query time Generate output Loading KV cache Prefill on query time Generate output Loading compressed KV cache & decompress Compressed KV cache KV cache Faster prefill even if the reused KV cache is loaded remotely
  • 16. 10,000 ft view of CacheGen 16 K tensor KV cache V tensor Binary representations Storage Encoding Binary representations K tensor Decompressed KV cache V tensor Decoding CacheGen: Encode KV cache to compact binary representation
  • 17. Several emerging approaches to KV compression 17 They all keep the KV's tensor shape è Complementary to CachGen. CacheGen can improve them too! CacheGen: Encode KV cache to compact binary representation Quantizing KV cache directly? Dropping less important tokens from the text? Dropping less important tokens from the KV cache?
  • 18. Can KV cache be encoded efficiently? size of encoded KV cache text quality B e t t e r size of encoded video video quality B e t t e r Encode a video in a small size with small degradation on video quality KV cache Generated text Analogy with video compression We could borrow the 20-year research literature of video compression
  • 19. Why can fewer bits represent KV cache? 19 KV cache is similar between neighboring tokens Some parts of a KV cache are less sensitive to quantization Quantized KV cache can be entropy-encoded with fewer bits Key distributional properties of KV cache
  • 20. Opportunity 1: Locality of KV cache values 20 K tensor K @ layer j # of layers # of tokens tokens Channels # of channels Opportunity: The KV values at nearby tokens have similar values
  • 21. Delta values have much smaller variance 21 For any token 𝑖 Original: |𝐾!|, |𝑉!| Delta: |𝐾! − 𝑘!"#|, |𝑉! − 𝑉!"#| Encode the delta between neighboring tokens, rather than the tokens themselves Delta values have much smaller variance è Easier to quantize 0 0.2 0.4 0.6 0.8 1 0 2 4 CDF Values (abs.) Original Delta
  • 22. Opportunity 2: Heterogeneous sensitivity to quantization 22 The output quality of LLM is more sensitive to losses in the KV cache values of the shallower layers than to those in the deeper layers. 0 0.2 0.4 0.6 0.8 1 [0, 3] [4, 7] [8, 11] [12, 15] [16, 19] [20, 23] LLM output quality (Accuracy) Layer
  • 23. Opportunity 3: Arithmetic coding 23 001101001110010… KV cache Compute delta Quantize Adaptive arithmetic coding More compact binary representation - stored on disk - sent via network
  • 24. Reducing decoding overhead? 24 K tensor KV cache V tensor Binary representations Encoding K tensor Decompressed KV cache V tensor GPU-based Decoding Loading Decoding and loading can be pipelined
  • 25. Evaluation setup 3Gbps (cloud server bandwidth) Llama-70B, Llama-34B, Mistral-7B Llama-70B Llama-70B Llama-70B LongChat TriviaQA NarrativeQA WikiText Context length distribution Various quality metrics Accuracy F1 score Perplexity
  • 26. Quality vs. Size & TTFT (time to first token) 26 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 Accuracy Size of KV cache (MB) Better 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 Accuracy Time to first token (TTFT) in seconds CacheGen Uniform quantization CacheGen Uniform quantization Full prefill (no caching) Better Setup Dataset: Longchat (200 contexts, ~9.6K tokens each) Model: Llama-70B Link to load KV cache (1.6 GB 8bit): 3Gbps 3x smaller KV cache size è 3-6x lower time to first token (TTFT)
  • 27. Impact of context length 27 0 1000 2000 3000 4000 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 KV cache size (MB) Context length (K tokens) CacheGen Uniform (8-bit quant.) Setup Model: Llama-70B The size reduction remains under various context lengths
  • 28. Breakdown of Time to First Token (TTFT) 28 6.1 sec Prefill on input time Prefill on query time 7.2 sec Load 0.12 sec Prefill on query time 1.8 sec Load + decompress 0.12 sec Setup Dataset: Longchat (200 contexts, ~9.6K tokens each) Model: Llama-70B Link to load KV cache: 3Gbps Full recompute Naïve KV cache loading (w/o KV encoding) CacheGen (KV encoding)
  • 29. Knowledge Storing & Sharing LLM LLM LLM Towards Efficient Knowledge Storing & Sharing Key technique #1: Fast KV cache loading via KV codec (Speed up KV loading by 3-10x) Key technique #2: Flexible join of multiple KV caches Happy to chat about technique #2 after the talk
  • 31. Efficient Knowledge Sharing System 31 Delay (time to first token) Cost (storage, compute, communication) Better GPU prefill Storing KV cache in CPU Storing KV cache in SSD Storing KV cache in S3 Efficient Knowledge Sharing Contact me if you are a potential user or contributor to our Knowledge-Sharing System!! [email protected]