0% found this document useful (0 votes)

30 views

Exploring CXL-based KV Cache Storage for LLM Serving

Uploaded by

maruixiang6688

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Exploring CXL-based KV Cache Storage for LLM Serving

Uploaded by

maruixiang6688

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Exploring CXL-based KV Cache Storage

for LLM Serving

Yupeng Tang1∗ Runxiang Cheng2∗ Ping Zhou3 Tongping Liu3 Fei Liu3 Wei Tang3
Kyoungryun Bae3 Jianjun Chen3 Wu Xiang3 Rui Shi3
1
Yale University 2 University of Illinois Urbana-Champaign 3 ByteDance

Abstract

Large language model (LLM) serving systems often store key-value (KV) cache
to reduce redundant inference computations. However, storing KV cache of long-
context requests under high-throughput serving demand can overwhelm GPU and
host memory. Recently, Compute Express Link (CXL) emerges as a promising
interconnect technology that offers low-latency host-device communication and
expanded memory capacity. In this paper, we explore leveraging CXL memory to
store KV cache in LLM serving. Our results show that CXL-CPU interconnect
performs comparably to CPU-GPU interconnect in both data-transfer latency and
bandwidth, enabling a 30% increase in batch size compared to full KV re-compute
under the same SLO. Our production Return on Investment (ROI) modeling further
shows that storing KV cache on CXL memory can reduce GPU requirements by
up to 87%, with 7.5× higher GPU utilization for prefill, compared to full KV
re-compute. Our overall results demonstrate the performance improvement and
resource benefits of using CXL memory to store KV cache in LLM serving.

1 Introduction

Autoregressive large language models (LLMs) generate output tokens sequentially, where the gen-
eration of each token involves the attention computation using key-value (KV) of its preceding
tokens [1]. This sequential dependency makes LLM inference both compute- and memory-intensive.
LLM inference typically includes two stages: the prefill stage, where all input tokens are processed
to generate the initial output token, and the decode stage, where the rest of the output tokens are
generated one by one until the model generates an end-of-sequence token [2, 3, 4].
For applications such as chatbot and coding assistant, LLM serving systems aim to minimize the
time to finish the prefill stage, or time to first token (TTFT). In production, service-level objective
(SLO) for TTFT is typically 400ms [3]. To meet such SLO, LLM serving systems often cache the
previously-computed KV data of the preceding tokens (i.e., prefix) in GPU memory, to avoid re-
computing them for future requests that have the same prefix [5, 3, 6]. Storing KV cache reduces the
overall computational load and significantly improves throughput by trading memory for computation.
In production chatbot applications that support large context windows, the demand for KV cache
storage grows rapidly by the number of inference requests from users, which cannot be fully accom-
modated by the limited and expensive GPU memory [7]. Researchers thus developed techniques
to offload KV cache to CPU memory, leveraging the larger CPU memory capacity to reduce GPU
memory pressure [6, 8, 9]. However, as larger LLMs and support for long-context inference requests
continue to emerge, the approach of storing KV cache to CPU memory is still insufficient. For
example, in LLaMA-2-7B, KV cache of token in FP32 precision is 1024KB; KV cache of a single
∗
Co-first authors. Work done during internship at ByteDance.

Machine Learning for Systems Workshop at (NeurIPS 2024).

request with 4096 tokens (maximum context length) is 4GB [10]. The memory demand from serving
many concurrent long-context requests can easily overwhelm even high-end memory servers [5, 11].
Practitioners increasingly turn to more scalable memory architectures, such as Compute Express Link
(CXL) memory [12, 13, 14], to address the growing memory demands of large-scale systems. CXL
expands memory capacity by connecting additional DRAM to servers via PCIe, while maintaining
low-latency access. It offers a promising solution to the KV cache storage demand in LLM serving.
In this paper, We propose leveraging CXL memory for KV cache storage to improve throughput,
meet SLO on TTFT, and alleviate memory pressure in LLM serving systems. This paper makes the
following contributions:
• We present the first measurement of CXL-GPU interconnect and its feasibility for storing large
KV cache. We show that the data-transfer latency and bandwidth on CXL-GPU interconnect is on
par with CPU-GPU interconnect.
• We present our design of CXL-based KV cache storage interface and evaluate its performance
improvement to LLM serving, on our platform that is the first to successfully integrate ASIC-CXL
device and GPU. Our results show competitive TTFT achieved by CXL-based prefix caching.
• We examine the cost-efficiency in using CXL for KV cache storage in production via Return on
Investment (ROI) modeling. Estimates show a promising reduction in GPU compute cost when
using CXL for KV cache storage. We also identify promising future research directions.

2 CXL-based KV Cache Storage

We now present the design and implementation of our CXL-based KV cache storage interface for
LLM serving. We also describe the hardware platform used to evaluate our design.
Design and implementation. Our goal is to develop a CXL storage interface which can be integrated
into existing LLM serving systems for saving and loading KV cache of inference requests. The
interface provides two external APIs to its upper-level serving system: save and load. The save
takes a unique identifier of a token chunk as input, and copies its KV cache from GPU to CXL
memory. The load takes a unique identifier of a token chunk as input, and finds if its KV cache
exists in CXL memory, if so, copies the KV cache from CXL memory to GPU. A token chunk can
consist of one or more tokens. The unique identifier of a token chunk ti for a sequence is the hash of
the content of ti and the hash of its prefix ⟨t0 , ..., ti−1 ⟩. If the KV of the current request’s prefix has
been computed and saved into CXL, the KV cache will be loaded from CXL and used [5].
To avoid calling save and load too frequently and incurring unnecessary overhead to the upper-level
serving system, save is called only when a request is finished so the KV cache of all the tokens for
that request is saved at once; load is called when the prefill stage of a request begins.
We implement our design of CXL-based KV cache storage interface in gpt-fast [15], a low-latency
text generation system with support on a number of widely-used inference optimizations [16, 17, 18]
and open-source LLMs [10, 19]. We modify gpt-fast to support our evaluation on batched inference.
Hardware platform. Our single socket server is equipped with Intel Xeon Platinum processors [20],
1TB of 4800 MHz DDR5 memory, an NVIDIA H100 GPU with 96GB HBM, and a CXL memory
expansion card with 256 GB of DDR5 memory at 4800 MHz [13]. While prior works [21, 22, 23, 24]
have explored utilizing CXL for accelerators, to our knowledge, we are the first to integrate a real
ASIC-CXL device and a GPU within a single inference server and evaluate it for KV cache storage.

3 Performance Evaluation

In Section 3.1, we measure the latency and bandwidth of CXL-GPU interconnect for data transfer to
assess the feasibility of storing KV cache on CXL devices. In Section 3.2, we compare the TTFT of
KV re-compute, prefix caching with CXL, and prefix caching with GPU, to understand if CXL-based
KV cache storage can achieve similar TTFT as existing approaches for prefill requests under varying
context lengths. In Section 3.3, we study the maximum batch size achieved while retaining a given
SLO on TTFT between KV re-compute and prefix caching with CXL. In Section 3.4, we model the
ROI of CXL-based KV cache storage in production.

2
CPU-GPU CXL-GPU
400 500 KV re-compute
104 PC-CXL

Bandwidth (GB/s)

TTFT (ms)
400

TTFT (ms)
Latency (us)
103 20
200 300
KV re-compute
2 PC-GPU
10 10
PC-CXL 200
101 0
0
4 9 14 4 9 14 500 1000 1500 2000 20 40 60
2 2 2 2 2 2
Conversation Length Batch Size
Access Size (KB)

(a) CPU-GPU/CXL-GPU interconnect. (b) TTFT comparison. (c) Max BS under SLO.
Fig. 1: Experiment results. (a) Latency and bandwidth measurements across different access sizes, CXL-GPU
interconnect performs similarly as CPU-GPU interconnect. (b) TTFT comparison between KV re-compute and
prefix caching with CXL or GPU. (c) Serving throughput comparison under a fixed SLO constraint (400ms).

3.1 Measurements on CXL-GPU interconnect performance

KV cache storage requires low-latency access. Although prior studies [12, 13] show that accessing
CXL memory from the host CPU is over 2× slower than accessing local memory, none of their
measurements involves any interaction with the GPU. In this paper, we evaluate the performance
characteristics of the CXL-GPU interconnect by measuring the latency and bandwidth of copying
data from CXL memory to the GPU. Transferring in the reverse direction yields similar results [25].
Since CXL memory devices are exposed to the system as NUMA nodes without CPUs by default [13],
we allocate a set of host buffers on the CXL NUMA node and use cudaMemcpyAsync to copy data
between the host buffers and GPU device buffers allocated via the CUDA API [26]. While prior work
evaluate data transfer in 64 bytes [24], we evaluated on sizes ranging from 1KB to 256MB.
Figure 1(a) shows that the performance of the CXL-GPU interconnect is unexpectedly on par with
traditional CPU-GPU memory transfers, exhibiting no significant slowdown. Latency remains low for
smaller access sizes but increases exponentially once the size exceeds 64KB. Meanwhile, bandwidth
increases almost linearly with data size and saturates around 4MB. This indicates that, while the CPU
oversees the data transfer, the data path actually bypasses the host’s local memory, flowing directly
from CXL memory to GPU buffers via PCIe. Our results demonstrate that the CXL-GPU interconnect
operates efficiently with minimal latency overhead, positioning it as a promising expansion for KV
cache offloading [9, 27, 28] and swapping [5] in addition to CPU memory.

3.2 Evaluation on TTFT under varying input context length

Given that CXL-GPU interconnect performs nearly the same as CPU-GPU interconnect, we further
study if CXL-based KV cache storage can achieve similar TTFT as existing approaches in completing
the prefill stage computation for an inference request. We evaluate three approaches:
• Full KV re-compute: Compute KV data of all input tokens for the request with GPU.
• Prefix caching with CXL: Load KV cache of the prefix tokens for the request from CXL to GPU.
• Prefix caching with GPU: Store and use KV cache in GPU for the prefix tokens for the request.
We measure the TTFT of the aforementioned approaches on conversation requests of input length
ranging from 256 to 2048 tokens from the ShareGPT-Vicuna-Unfiltered dataset [29]. We use the
LLaMA-2-13B as the underlying model for our evaluation. Figure 1(b) shows the TTFT (y-axis in
log-scale) achieved by the evaluated approaches for requests of varying input context length (x-axis).
Compared to the other approaches, prefix caching with GPU (denoted as “PC-GPU” in Figure 1(b))
achieves the smallest TTFT (0.44ms to 0.56ms) constantly across different input context lengths.
Such performance is expected as there is no data transfer latency and computation of KV data is only
needed for tokens after the prefix. This approach is an optimal baseline that is however difficult to
achieve in practice due to limited memory capacity of existing GPU models and the rapidly growing
demand of KV cache storage in LLM serving.
Comparing prefix caching with CXL (denoted as “PC-CXL”) and KV re-compute, prefix caching
with CXL performs at least as good as computing KV data on GPU from scratch. Prefix caching
with CXL achieve TTFT ranging from 55ms to 336ms, with slight increase in latency as input size
length grows. The close performance gap between storing prefix KV cache in CXL memory and full
KV re-computation indicates that there is a potential opportunity to reduce GPU compute cost with
adaptation of CXL devices for memory capacity expansion in LLM inference.

3
3.3 Evaluation on serving throughput while adhering SLO

By storing the KV cache of the inference request prefix in CXL memory and thus reducing re-
computation during the prefill stage, we can effectively reduce the computational load on the GPU.
The saved GPU compute can be re-allocated to handle a larger number of concurrent inference
requests. In other words, the LLM serving system can achieve a higher serving throughput, by
handling a larger batch size of inference requests using the saved GPU compute, while maintaining
the same SLO on TTFT [3].
Figure 1(c) shows the TTFT achieved by KV re-compute and prefix caching with CXL under varying
batch size. The horizontal red-dashed line indicates our SLO limit–the maximum TTFT that can be
tolerant in production. The typical SLO is 400ms used for LLaMA-2 [30]. As shown in Figure 1(c),
with KV re-compute, the evaluated serving system (§2) can handle a maximum batch size of 44 before
hitting the SLO limit. On the other hand, when leveraging CXL for storing KV cache, the system
can handle a maximum batch size of 57, which is a 30% increase compared to KV re-compute.
Our initial evaluation on SLO-adhering serving throughput highlights the performance benefits of
utilizing CXL memory for KV cache storage, particularly in scenarios that require efficient scaling
under strict latency requirements.

3.4 Cost-efficiency modeling

We apply a Return on Investment (ROI) model to estimate the cost-efficiency of CXL-based KV

cache storage, compared to full KV re-compute, in our production for chatbot application (§A.1).
• Assumption: A GPU has the computational power of 100 TFLOP/s, a prefill request on average
requires 25 TFLOP of computation, and the SLO on TTFT is 400ms (i.e., 0.4s).
• KV re-compute (baseline): To complete the prefill request within SLO, each request demands
62.5 TFLOP/s (25 TFLOP / 0.4s), meaning a single GPU can serve 1.6 prefill requests per second.
• CXL-based KV cache storage: By spending 0.1s loading the KV cache of the request prefix, we
reduce the computational demand to 2.5 TFLOP (assuming the prefix accounts for 90% of the
input for a request). To meet the same SLO, the remaining computation must be finished within
0.3s, requiring 8.3 TFLOP/s (2.5 TFLOP / 0.3s). In this case, a single GPU can serve 12 prefill
requests, which is 7.5× more requests per second with the same amount of GPU compute.
From the estimate, we can reduce the number of GPUs needed to reach the same throughput while
adhering SLO on TTFT by 87% when compared to KV re-compute. Overall, by replacing computation
with CXL memory access, we reduce the overall compute demand while still meeting the same SLO,
thereby resulting in significant GPU cost savings for LLM inference.

4 Conclusion and Future Work

Storing KV cache in GPU memory for LLM inference can quickly lead to memory saturation, limiting
serving scalability and performance. KV cache storage on CPU memory becomes limited as model
size and request context length increase. To that extent, we explore CXL memory for KV cache
offloading, in which CXL offers expanded capacity with low-latency access. Our preliminary results
show that CXL-CPU data transfer has similar latency and bandwidth as the CPU-GPU counterpart.
In addition, CXL-based KV cache offloading provides similar performance compared to full KV
re-compute on GPUs, while supporting larger workloads. Specifically, using CXL memory for KV
cache storage increased the maximum batch size by 30%, while maintaining the same SLO on TTFT.
Our cost-efficiency analysis further shows the potential for using CXL memory to substantially reduce
the GPU compute cost for high-throughput LLM serving under SLO. Looking ahead, future work
will explore the integration of CXL memory with multi-GPU systems, focusing on maintaining cache
coherence across GPUs that could further enhance the scalability and efficiency of LLM inference.

Acknowledgments and Disclosure of Funding

We thank Supermicro for the hardware. We thank Ken Hu and Henry Hu from ByteDance for
assisting experiment setup; Wenhui Zhang from ByteDance, for insightful suggestions to this work.

4
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and
Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve.
In OSDI, 2024.
[3] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao
Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language
model serving. In OSDI, 2024.

[4] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and
Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In ISCA,
2024.
[5] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language
model serving with pagedattention. In SOSP, 2023.
[6] Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang,
Sa Wang, Yungang Bao, Ninghui Sun, et al. Memserve: Context caching for disaggregated llm
serving with elastic memory pool, 2024.
[7] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and
Zhihao Jia. Towards efficient generative large language model serving: A survey from algorithms
to systems. arXiv preprint arXiv:2312.15234, 2023.
[8] Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du,
Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with
cached knowledge fusion. In EuroSys, 2024.
[9] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy
Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative
inference of large language models with a single gpu. In ICML, 2023.
[10] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[11] Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael
Maire, Henry Hoffmann, Ari Holtzman, et al. Cachegen: Fast context loading for language
model applications. 2024.
[12] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji,
Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, et al. Demystifying cxl memory with genuine
cxl-ready systems and devices. In MICRO, 2023.
[13] Yupeng Tang, Ping Zhou, Wenhui Zhang, Henry Hu, Qirui Yang, Hao Xiang, Tongping Liu,
Jiaxin Shan, Ruoyun Huang, Cheng Zhao, et al. Exploring performance and cost optimization
with asic-based cxl memory. In EuroSys, 2024.
[14] Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic,
Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. Pond: Cxl-based memory
pooling systems for cloud platforms. In ASPLOS, 2023.
[15] PyTorch Labs. GPT-fast: Simple and Efficient Pytorch-Native Transformer tTxt Generation.
https://github.com/pytorch-labs/gpt-fast.git, 2024.

[16] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via
speculative decoding. In ICML, 2023.

5
[17] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary,
Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro,
et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In SC,
2021.
[18] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,
Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for
efficient integer-arithmetic-only inference. In CVPR, 2018.
[19] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[20] Intel Corporation. Intel® Xeon® Platinum Processor. https://www.intel.com/content/www/
us/en/products/details/processors/xeon/scalable/platinum.html, 2024.

[21] Moiz Arif, Avinash Maurya, and M Mustafa Rafique. Accelerating performance of gpu-based
workloads using cxl. In HPDC FlexScience, 2023.
[22] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki,
Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa. Gpu graph process-
ing on cxl-based microsecond-latency external memory. In SC-W, 2023.
[23] Donghyun Gouk, Seungkwan Kang, Hanyeoreum Bae, Eojin Ryu, Sangwon Lee, Dongpyung
Kim, Junhyeok Jang, and Myoungsoo Jung. Breaking barriers: Expanding gpu memory with
sub-two digit nanosecond latency cxl controller. In HotStorage, 2024.
[24] Jie Liu, Xi Wang, Jianbo Wu, Shuangyan Yang, Jie Ren, Bhanu Shankar, and Dong Li. Exploring
and evaluating real-world cxl: Use cases and system adoption. arXiv preprint arXiv:2405.14209,
2024.
[25] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J
Barker. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE
Transactions on Parallel and Distributed Systems, 2019.
[26] NVIDIA. CUDA Runtime API - Memory Management Functions. https://docs.nvidia.co
m/cuda/cuda-runtime-api/group__CUDART__MEMORY.html, 2024.

[27] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton
Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference:
enabling efficient inference of transformer models at unprecedented scale. In SC, 2022.
[28] Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative
inference of large language models with dynamic {KV} cache management. In OSDI, 2024.
[29] anon8231489123. ShareGPT Vicuna Unfiltered Dataset. https://huggingface.co/datasets/
anon8231489123/ShareGPT_Vicuna_unfiltered, 2024.

[30] Meta. Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models. https:
//mlcommons.org/2024/03/mlperf-llama2-70b, 2024.

6
A Appendix
A.1 Definition of Return on Investment (ROI) Model

Table 1 provides the definition of our ROI model used in §3.4 in the main text.

Table 1: ROI model.

C0 Avg. FLOPs needed by a prefill request in an initial request. Can be estimated as C0 = 2M L,
where M is the model parameters and L is the avg. sequence length.
C1 FLOPs needed by new prompt in a follow-up conversation request. Can be estimated as
C1 = rC0 , where r is the avg. ratio of the new prompt (e.g., 10%).
Tslo SLO of prefill (e.g., 0.4s).
Tload Avg. time to load KV cache from memory (e.g., 0.1s) using measurements from §3.1.
P Computation power (FLOP/s) of the GPU.
P0 FLOP/s needed for the initial request. P0 = C0 /Tslo
P1 FLOP/s needed for the new prompt. P1 = C1 /(Tslo − Tload )
Rgpu Request per second (RPS) a single GPU can support. Rgpu = P/(P0 (1 − h) + P1 h), where h
is the ratio of multi-round requests.
Ncxl Number of GPUs needed using our CXL memory scheme. Ncxl = ⌈R/Rgpu ⌉ =
R
⌈ P/(P0 (1−h)+P 1 h)
⌉
Nbaseline Number of GPUs needed without any KV cache stored (i.e., all data discarded after prefill).
R
Nbaseline = ⌈ P/P 0
⌉

Universal Chiplet Interconnect Express (UCIe) Specification Revision 1.1
No ratings yet
Universal Chiplet Interconnect Express (UCIe) Specification Revision 1.1
338 pages
CEN468 Lab 3 V2
No ratings yet
CEN468 Lab 3 V2
14 pages
CXL Specification Rev3p0 Ver1p0 Final 2022aug1 Clean Eval
No ratings yet
CXL Specification Rev3p0 Ver1p0 Final 2022aug1 Clean Eval
911 pages
1 An LPDDR-based CXL-PNM Platform For TCO-efficient Inference of Transformer-Based Large Language Models
No ratings yet
1 An LPDDR-based CXL-PNM Platform For TCO-efficient Inference of Transformer-Based Large Language Models
13 pages
Exploring and Evaluating Real-world CXL- Use Cases and System Adoption
No ratings yet
Exploring and Evaluating Real-world CXL- Use Cases and System Adoption
15 pages
Demystifying CXL Memory With Genuine CXL Ready Systems and Devices
No ratings yet
Demystifying CXL Memory With Genuine CXL Ready Systems and Devices
12 pages
Fast25 Qin
No ratings yet
Fast25 Qin
17 pages
2407.00079v3
No ratings yet
2407.00079v3
23 pages
optimize-system-bandwidth-for-hpc-ai-micron-cxl-intel-xeon-whitepaper
No ratings yet
optimize-system-bandwidth-for-hpc-ai-micron-cxl-intel-xeon-whitepaper
8 pages
Mem Con 2023
No ratings yet
Mem Con 2023
8 pages
Transparent Page Placement For CXL-Enabled Tiered Memory
No ratings yet
Transparent Page Placement For CXL-Enabled Tiered Memory
14 pages
What We Can Learn from Persistent Memory for CXL
No ratings yet
What We Can Learn from Persistent Memory for CXL
5 pages
SMT Software-Defined Memory Tiering for Heterogeneous Computing Systems With CXL Memory Expander
No ratings yet
SMT Software-Defined Memory Tiering for Heterogeneous Computing Systems With CXL Memory Expander
10 pages
Demystifying CXL Memory
No ratings yet
Demystifying CXL Memory
15 pages
CXL and the Return of Scale-Up Database Engines
No ratings yet
CXL and the Return of Scale-Up Database Engines
8 pages
A Case For CXL-Centric Sever Processors
No ratings yet
A Case For CXL-Centric Sever Processors
13 pages
2411.09317v1
No ratings yet
2411.09317v1
13 pages
2310.07240v6
No ratings yet
2310.07240v6
19 pages
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
CXL Flash
No ratings yet
CXL Flash
18 pages
2501.08192v1
No ratings yet
2501.08192v1
16 pages
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
No ratings yet
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
16 pages
CXL Memory Interconnect Initiative:: Enabling A New Era of Data Center Architecture
No ratings yet
CXL Memory Interconnect Initiative:: Enabling A New Era of Data Center Architecture
8 pages
E LLM I KC: Fficient Nference With Ache
No ratings yet
E LLM I KC: Fficient Nference With Ache
8 pages
PaperWriting2024 Yuhao 202414131146 Survey
No ratings yet
PaperWriting2024 Yuhao 202414131146 Survey
2 pages
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
From Everand
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
POONAM DEVI
No ratings yet
H2O- Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
No ratings yet
H2O- Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
49 pages
Achieving 10Gbps Line-Rate Key-Value Stores With FPGAs
No ratings yet
Achieving 10Gbps Line-Rate Key-Value Stores With FPGAs
6 pages
CXL Introduction
No ratings yet
CXL Introduction
11 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
No ratings yet
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
15 pages
Reconfigurable Dataflow Graphs For Processing-In-memory
No ratings yet
Reconfigurable Dataflow Graphs For Processing-In-memory
11 pages
CSD Final Report
No ratings yet
CSD Final Report
8 pages
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
From Everand
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
Mamta Devi
No ratings yet
An Introduction To The Compute Express Link CXL in
No ratings yet
An Introduction To The Compute Express Link CXL in
34 pages
2411.18077v2
No ratings yet
2411.18077v2
18 pages
A_Three-Tier_Buffer_Manager_Integrating_CXL_Device_Memory_for_Database_Systems
No ratings yet
A_Three-Tier_Buffer_Manager_Integrating_CXL_Device_Memory_for_Database_Systems
7 pages
CXL2.0 White Paper November 2020 FINAL
No ratings yet
CXL2.0 White Paper November 2020 FINAL
4 pages
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
From Everand
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
POONAM DEVI
No ratings yet
CXL Over Ethernet A Novel FPGA-based Memory
No ratings yet
CXL Over Ethernet A Novel FPGA-based Memory
9 pages
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
NVMe Performance Hacks
From Everand
NVMe Performance Hacks
Mei Gates
No ratings yet
6-23_HC35_PIM_PNM_Samsung_final
No ratings yet
6-23_HC35_PIM_PNM_Samsung_final
31 pages
2501.01005v1
No ratings yet
2501.01005v1
20 pages
High-Performance Tomographic Reconstruction Using OpenCL
No ratings yet
High-Performance Tomographic Reconstruction Using OpenCL
99 pages
FPT2017-PipeCNN
No ratings yet
FPT2017-PipeCNN
4 pages
Lab 1 Intro To High Performance Computing
No ratings yet
Lab 1 Intro To High Performance Computing
8 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
No ratings yet
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
8 pages
Compute Express Link Overview
No ratings yet
Compute Express Link Overview
21 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
DHL - Enabling Flexible Software Network Functions With FPGA Acceleration
No ratings yet
DHL - Enabling Flexible Software Network Functions With FPGA Acceleration
11 pages
AugmentingLLMSwithDatabasesASttheirSymbolicMEMORY
No ratings yet
AugmentingLLMSwithDatabasesASttheirSymbolicMEMORY
12 pages
Register File Prefetching
No ratings yet
Register File Prefetching
14 pages
LSERVER
No ratings yet
LSERVER
14 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
2023 CXL DesignTradeoffs IEEE Micro
No ratings yet
2023 CXL DesignTradeoffs IEEE Micro
9 pages
Internet Cafe Timer Thesis
100% (3)
Internet Cafe Timer Thesis
6 pages
CXL Ann
No ratings yet
CXL Ann
17 pages
CXL Specification Rev3p1 Ver1p0 2023August9 Watermark
No ratings yet
CXL Specification Rev3p1 Ver1p0 2023August9 Watermark
1,166 pages
Eetop.cn DWC Pcie Ctl Cxl Databook
No ratings yet
Eetop.cn DWC Pcie Ctl Cxl Databook
210 pages
An Introduction To UCIE Blog
No ratings yet
An Introduction To UCIE Blog
4 pages
Exploring CXL-based KV Cache Storage for LLM Serving
No ratings yet
Exploring CXL-based KV Cache Storage for LLM Serving
7 pages
14_HC2024.Intel.Xeon_6_SoC.Praveen.Mosur
No ratings yet
14_HC2024.Intel.Xeon_6_SoC.Praveen.Mosur
28 pages
01 - 05 PCIe 6.0 Protocol Update
No ratings yet
01 - 05 PCIe 6.0 Protocol Update
31 pages
Ag Overview 683458 666707
No ratings yet
Ag Overview 683458 666707
58 pages
Thesis Firewall
100% (3)
Thesis Firewall
5 pages
2023 Practical Memory Disaggregation - Thesis
No ratings yet
2023 Practical Memory Disaggregation - Thesis
172 pages
Lenovo ThinkSystem SR650 V3 Server Product Guide
No ratings yet
Lenovo ThinkSystem SR650 V3 Server Product Guide
191 pages
DEN0069E_SBMR_2.1
No ratings yet
DEN0069E_SBMR_2.1
88 pages
intel-xeon-6-product-brief
No ratings yet
intel-xeon-6-product-brief
7 pages
02 AMD Tech Day AECG Portfolio Overview
No ratings yet
02 AMD Tech Day AECG Portfolio Overview
34 pages
Compute Express Link (CXL) : TM TM
No ratings yet
Compute Express Link (CXL) : TM TM
13 pages
Errata and Clarifications To The Compute Express Link Specification Rev3.0 - 2022aug30 - Watermark - Revised
100% (1)
Errata and Clarifications To The Compute Express Link Specification Rev3.0 - 2022aug30 - Watermark - Revised
12 pages
OIF Co Packaging FD 01.0
No ratings yet
OIF Co Packaging FD 01.0
53 pages
CXL Spec 3.0 CB
No ratings yet
CXL Spec 3.0 CB
1,192 pages
A-201 Mota
No ratings yet
A-201 Mota
13 pages
CXL* Type 3 Memory Device Software Guide
No ratings yet
CXL* Type 3 Memory Device Software Guide
128 pages
CXL Spec 3.0 v0.7
100% (1)
CXL Spec 3.0 v0.7
851 pages
My CXL Presentation
100% (1)
My CXL Presentation
25 pages
AMBA CXS Streaming Protocol Specification
No ratings yet
AMBA CXS Streaming Protocol Specification
54 pages
HPE ProLiant DL360 Gen11-QuickSpecs
No ratings yet
HPE ProLiant DL360 Gen11-QuickSpecs
62 pages

Exploring CXL-based KV Cache Storage for LLM Serving

Uploaded by

Exploring CXL-based KV Cache Storage for LLM Serving

Uploaded by

Exploring CXL-based KV Cache Storage

for LLM Serving

Machine Learning for Systems Workshop at (NeurIPS 2024).

2 CXL-based KV Cache Storage

3.1 Measurements on CXL-GPU interconnect performance

3.2 Evaluation on TTFT under varying input context length

3.4 Cost-efficiency modeling

We apply a Return on Investment (ROI) model to estimate the cost-efficiency of CXL-based KV

4 Conclusion and Future Work

Acknowledgments and Disclosure of Funding

Table 1: ROI model.

You might also like