Characterizing Power Management Opportunities For LLMs in The Cloud
Characterizing Power Management Opportunities For LLMs in The Cloud
com/download
llm
Pratyush Patel* e re
Esha Chouks Brijesh Nithish Mahalingam
Bianchini
azu
CCS : ;
(llm) ; ;
gpu
llm : gpu
ACM :
llm Pratyush Patel, Esha Choukse igo Goiri, Brijesh Warrier,
Nithish Mahalingam, Ricardo Bianchini 2024 llm
LLM 29 ACM
LLM 3 (ASPLOS 24) 2024 4 27 5 1 USA
ACM USA 16 https://doi.org/10.1145/
gpu 3620666.3651329
1
(llm)
POLCA LLM GPU [8]
POLCA OpenAI 7,500 GPU GPT-
POLCA 3 llm[48];Meta 6000 A100 gpu
30% [38] AI
Bard GPT-4[16]
LLM 90% [53,59,61]
GPU LLM
[38,48] ;
* Pratyush Patel
llm gpu
[15,17,23,73]
GPU
This work is licensed under a Creative Commons Attribution International 4.0 License.
LLM
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
2024 Copyright held by the owner/author(s). ACM ISBN
979-8-4007-0386-7/24/04 https://doi. org/10. 1145/3620666.
3651329
207
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
1 GPT( ) 2 3 (8 A100-80GB
)
llm LLM
(
) LLM
2
LLM
( 3%) transform
LLM er
( 21%) [67] BERT[14] RoBERTa[37]
llm
transformer GPT[57] BLOOM[70]
FLAN-T5[12]
LLM POLCA
-
LLM
POLCA
30%
vs.
LLM
LLM
[25]
LLM Infiniband
: OpenAI
LLM 7500 GPU GPT-3
[48]
LLM
- 176b[70] ( GPT-3[10]
) 8 GPU
GPU ( )
LLM
OpenAI Meta
LLM [38,48] Microsoft Philly
LLM
208
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
3.2
CPU cpu IPMI [26] (OOB) RAPL [55] (IB)
Singularity[64] Azure OpenAI[6] [31]
Google Vertex AI[56] SageMaker[1] Azure ML[2] CPU( DRAM)
cpu
qos [34]
209
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
1 LLM 2 3 LLM ( * )
3.3 LLM
CPU IPMI[26] GPU
RAPL[55] 4.3
CPU
GPU vm
CPU GPU IB GPU
LLM LLM
3.4 LLM
Huggingface transformer with Accelerate[20,69] GPT-
NeoX [5] Deep-Speed[60]
LLM DeepSpeed [4,39] vLLM
[32] LLM
DCGM IPMI gpu
Huggingface Trans-
nvidia-smi GPU
formers[69] bitsandbytes [13]
100ms DCGM
LLM GPU /
[43] IPMI DCGM
GPU
210
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
DCGM GPU
DCGM IPMI
5-10W
DCGM
DCGM 4
DCGM
LLM /
[25,35,63] GPU
LLM
LLM
8 GPU 5 (100+ )
85% GPU
LLM LLM 5
4.3 4 LLM
profiling
LLM LLM
GPU
LLM ( FasterTransformer
DeepSpeed-Inference) ( 4.1 LLM
) 4( ) 100 GPU
5
gpu TDP, GPT-NeoX Flan-T5
LLM RoBERTa
TDP
LLM
1:LLM gpu
GPU
TDP LLM
GPU
4( ) gpu
GPU RoBERTa 1
( ) 0.5
( SM )
GPU
gpu
211
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
2: LLM gpu
BLOOM
LLM
GPU 7
4 GPU SM GPU
20%
GPT- 4:LLM
NeoX Flan-T5 : GPU TDP
RoBERTa LLM
;
5
8 LLM
[9]
Flan-T5 GPT-NeoX GPU TDP
22% 10% : (
) ( )
(
BLOOM-176B)
LLM token
prompt token phase
8b token
4.2 LLM
212
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
6 GPU ( ) ( )
>
4096
8c 1 16
token token
8d
8e 8f
token
token
Insight 5: LLM
KV
gpu
213
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
gpu
Llama2-70B, FP16
Insight 7:
9 BLOOM
LLM
4.3
10a
GPU
- ( 20%)
11 TDP
( 7%)
GPU :(1)GPU
60% (2)GPU
; GPT-NeoX
GPU TDP( 500W) (
BLOOM 13%
1 4) (3) GPU
5% 10b BLOOM
(4)GPU (5)
( )
10c GPU SM
Insight 8: gpu LLM
214
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
GPU LLM
4 LLM
cpu gpu
:(1)
(2) 25%( 3)
2 -
37.5% 9% (3)
;
5.1 LLM
LLM
4.1 (insight - ( 4)
sight 2)
(insight 4 insight 5) gpu
[71]
Insight 9: [1,2,56,64]
LLM
LLM
/
GPU
5 LLM
GPU GPU
DGX-A100 6500W[44] 5.2 LLM
5700W
800W
GPU
GPU
OOB
215
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
6 LLM A
GPU LLM
LLM
LLM
gpu
gpu [49] LLM
gpu LLM ChatGPT Azure OpenAI
Infiniband [44] slo POLCA
CPU
[15,36,55,58] [23,73]
6.3 POLCA
[17,19,31]
POLCA LLM
LLM (Insight 9)
LLM
( ) ( ) :(1)
(2) (3) POLCA
( [31])
[31]
POLCA
6.2
POLCA
llm 5
POLC
216
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
6 4
6 21STto8 2ND2023 6
POLCA ( token )
(MAPE) 3% 16
uncap 5%
5
(T1) :(1)
(2) 6.5 1
slo 4 7 T1
slo
T1
( A100 gpu 1275 MHz)
13 T1-
(T2) POLCA T2 75-85% 89-89% 85-95%
40 (OOB ) 80-89% 4 OOB
T2 POLCA
1110 MHz
75-85% 80-89% T1-T2 35%
1305
( ) 85-95% 32.5%
MHz
75-85% slo
(Insight 7)
85-95% LP
12 POLCA HP
PDU T2
( 4 11%)
slo POLCA T1 = 80% T2 = 89%
5 BMC 30% ( 13
gpu SMBPBI[42] ) slo
gpu
15a T1
1275 MHz LP
6.4
SLO A100 1275 MHz
T1
LLM 2
6.6 6
POLCA
POLCA
16 POLCA 30%
6 :(1)5
/ (
) BLOOM-176B 4.2 (2)
BLOOM-176B
217
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
5
12 6 slo
13 30% 14 POLCA
llm
5%
15
17
POLCA SLOs
P99 SLO
POLCA
17
:(1) 89% (1- POLCA -
threshold - low-pri) SLO
218
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
POLCA
GPU
[28]
llm
GPU
16 BLOOM y
llm
2 6.1
CPU
17 POLCA 30%
DL [11,40,41,71]
[22]
GPU DL GPU
[24,29,33]
[27,72] gpu[52,52,54,65]
18
8
18
LLM
POLCA
GPU
6.7
LLM
POLCA
Workload-aware [1,6,56]
POLCA
Lieven Eeckhout
LLM Pratyush Patel CNS-
GPU 2104548 VMware
NVIDIA DGX-A100 (6U, 6.
5kW TDP) DGX-H100(8U, 10.
219
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
220
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
[35] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, [56] Google Cloud Platform. Vertex AI. https://cloud.google.com/vertex-ai,
Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and 2023.
Soumith Chintala. Pytorch Distributed: Experiences on Accelerating
[57] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Data Parallel Training. In VLDB, 2020.
Ilya Sutskever. Language Models are Unsupervised Multitask Learners,
[36] Yang Li, Charles R Lefurgy, Karthick Rajamani, Malcolm S Allen-Ware, 2019.
Guillermo J Silva, Daniel D Heimsoth, Saugata Ghose, and Onur Mutlu.
[58] Parthasarathy Ranganathan, Phil Leech, David Irwin, and Jeffrey Chase.
A Scalable Priority-aware Approach to Managing Data Center Server
Ensemble-level Power Management for Dense Blade Servers. In ISCA,
Power. In HPCA, 2019.
2006.
[37] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
[59] Tirias Research. Why Your AI infrastructure Needs Both Training and
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
Inference. 2019.
Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv preprint arXiv:1907.11692, 2019. [60] Philipp Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed &
Hugging Face Transformers. https://www.philschmid.de/fine-tune-flan-t5-
[38] Meta. Introducing the AI Research SuperCluster Meta s Cutting-edge
deepspeed.
AI Supercomputer for AI Research. https://ai.facebook.com/blog/ai-rsc/.
[61] Amazon Web Services. Amazon EC2 Update Inf1 Instances with AWS
[39] Microsoft. DeepSpeed: Model Implementations for Inference (MII).
Inferentia Chips for High Performance Cost-Effective Inferencing. https://
https://github.com/microsoft/DeepSpeed-MII.
aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-
[40] Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. inferentia-chips-for-high-performance-cost-effective-inferencing/.
Coordinated Batching and DVFS for DNN Inference on GPU Accelera-
[62] Amazon Web Services. AWS Trainium: High-performance Machine
tors. TPDS, 2022.
Learning Training Accelerator, Purpose Built by AWS. https://aws.
[41] Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, amazon.com/machine-learning/trainium/.
and Devesh Tiwari. Characterizing Temperature, Power, and Soft-error
[63] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
Behaviors in Data Center Systems: Insights, Challenges, and
Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-
Opportunities. In MASCOTS, 2017.
billion Parameter Language Models Using Model Parallelism. arXiv
[42] NVIDIA. Data Center GPU Driver. https://docs. nvidia. com/datacenter/ preprint arXiv:1909.08053, 2019.
tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_ Notes_450_v1.
[64] Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav
pdf.
Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra,
[43] NVIDIA. Data Center GPU Manager (DCGM). https://developer.nvidia. Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaib-
com/dcgm. hav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu
[44] NVIDIA. DGX A100: The Universal System for AI Infrastructure. https:// Xun, Ravi Anupindi, Karthik Elangovan, and Mark Russinovich.
resources.nvidia.com/en-us-dgx-systems/dgx-ai.
Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI
Workloads. arXiv preprint arXiv:2202.07848, 2022.
[45] NVIDIA. DGX H100. https://www. nvidia. com/en-us/data-center/dgx-
h100/.
[65] Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D
Sinclair, and Shivaram Venkataraman. Not All GPUs Are Created Equal:
[46] NVIDIA. NVIDIA A100 80GB PCIe GPU Product Brief. https://www. Characterizing Variability in Large-scale, Accelerator-rich Systems. In
nvidia.com/content/dam/en-zz/Solutions/Data- SC, 2022.
Center/a100/pdf/PB-10577-001_v02.pdf. [66] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[47] NVIDIA. System Management Interface (nvidia-smi). https://developer. hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
nvidia.com/nvidia-system-management-interface.
Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
[48] OpenAI. Scaling Kubernetes to 7,500 Nodes. https://openai.com/research/ Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
scaling-kubernetes-to-7500-nodes. Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
[49] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Aashaka Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv: Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
2311.18677, 2023.
Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
[50] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Brijesh War- Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
rier, Nithish Mahalingam, and Ricardo Bianchini. POLCA: Power Over- Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
subscription in LLM Cloud Providers. arXiv preprint arXiv:2308.12908, ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
2023. Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
[51] Pratyush Patel, Zibo Gong, Syeda Rizvi, Esha Choukse, Pulkit Misra, Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foun-
Thomas Anderson, and Akshitha Sriraman. Towards Improved Power dation and Fine-tuned Chat Models. arXiv preprint arXiv:2307. 09288,
2023.
Management in Cloud GPUs. In CAL, 2023.
[52] Tapasya Patki, Zachary Frye, Harsh Bhatia, Francesco Di Natale, James [67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Glosli, Helgi Ingolfsson, and Barry Rountree. Comparing GPU Power Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention
is All You Need. In NeurIPS, 2017.
and Frequency Capping: A Case Study with the MuMMI Workflow. In
WORK, 2019. [68] Lan Vu, Hari Sivaraman, and Rishi Bidarkar. GPU Virtualization for
[53] David Patterson, Joseph Gonzalez, Urs H lzle, Quoc Le, Chen Liang, High Performance General Purpose Computing on the ESX Hypervisor.
In HPC, 2014.
Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and
Jeff Dean. The Carbon Footprint of Machine Learning Training Will [69] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement
Plateau, Then Shrink. Computer, 2022. Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R mi Louf, Morgan
Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara
[54] Martin Peres. Reverse Engineering Power Management on NVIDIA
GPUs - A Detailed Overview. In XDC, 2013.
[55] Pavlos Petoumenos, Lev Mukhanov, Zheng Wang, Hugh Leather, and
Dimitrios S. Nikolopoulos. Power Capping: What Works, What Does
221
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.
Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gug- NSDI, 2023.
ger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
Trans-formers: State-of-the-art Natural Language Processing. In [72] Junyeol Yu, Jongseok Kim, and Euiseong Seo. Know Your Enemy
EMNLP, 2020.
To Save Cloud Energy: Energy-Performance Characterization of
Machine Learning Serving. In HPCA, 2023.
[70] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher
Akiki, Ellie Pavlick, Suzana Ili , Daniel Hesslow, Roman [73] Chaojie Zhang, Alok Gautam Kumbhare, Ioannis Manousakis, Deli
Castagn , Alexan-dra Sasha Luccioni, Fran ois Yvon, Matthias Zhang, Pulkit A Misra, Rod Assis, Kyle Woolcock, Nithish
Gall , Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Mahalingam, Brijesh Warrier, David Gauthier, Lalu Kunnath, Steve
Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno t Solomon, Os-valdo Morales, Marcus Fontoura, and Ricardo
Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Bianchini. Flex: High-Availability Datacenters With Zero Reserved
Power. In ISCA, 2021.
Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,
Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz [74] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya
Suarez, Victor Sanh, Hugo Lau-ren on, Yacine Jernite, Julien Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi
Launay, Margaret Mitchell, and Colin Raffel. BLOOM: A 176B- Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster,
Parameter Open-access Multilingual Lan-guage Model. arXiv Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and
preprint arXiv:2211.05100, 2022. Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language
Models. arXiv preprint arXiv:2205.01068, 2022.
[71] Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus:
Understand-ing and Optimizing GPU Energy Consumption of DNN
Training. In
222