0% found this document useful (0 votes)
51 views

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

albert.csdn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

albert.csdn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

https://fanyi.youdao.

com/download

llm
Pratyush Patel* e re
Esha Chouks Brijesh Nithish Mahalingam

Bianchini

azu

CCS : ;
(llm) ; ;
gpu

llm : gpu

ACM :
llm Pratyush Patel, Esha Choukse igo Goiri, Brijesh Warrier,
Nithish Mahalingam, Ricardo Bianchini 2024 llm
LLM 29 ACM
LLM 3 (ASPLOS 24) 2024 4 27 5 1 USA
ACM USA 16 https://doi.org/10.1145/
gpu 3620666.3651329

1
(llm)
POLCA LLM GPU [8]
POLCA OpenAI 7,500 GPU GPT-
POLCA 3 llm[48];Meta 6000 A100 gpu
30% [38] AI
Bard GPT-4[16]
LLM 90% [53,59,61]
GPU LLM
[38,48] ;

* Pratyush Patel
llm gpu
[15,17,23,73]

GPU

This work is licensed under a Creative Commons Attribution International 4.0 License.

LLM
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
2024 Copyright held by the owner/author(s). ACM ISBN
979-8-4007-0386-7/24/04 https://doi. org/10. 1145/3620666.
3651329

207
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

1 GPT( ) 2 3 (8 A100-80GB
)

llm LLM

(
) LLM
2
LLM

( 3%) transform
LLM er
( 21%) [67] BERT[14] RoBERTa[37]

llm
transformer GPT[57] BLOOM[70]
FLAN-T5[12]
LLM POLCA
-
LLM
POLCA
30%
vs.
LLM
LLM

[25]
LLM Infiniband
: OpenAI
LLM 7500 GPU GPT-3
[48]
LLM
- 176b[70] ( GPT-3[10]
) 8 GPU
GPU ( )
LLM
OpenAI Meta
LLM [38,48] Microsoft Philly

LLM

208
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

[29] Meta 3.1


[21] (Trainium llm GPU
[62] Inferentia[61]) 1
[53,59,61]
IPMI
[26] BMC (baseboard management
controller) GPU
[42] (OOB)
(IB) [3,43,47] NVIDIA
GPU OOB SMBPBI [42];
1 LLM NVIDIA IB :
: token NVIDIA -smi[47] DCGM[43] nvidia-smi
KV-cache GPU /
token [57] token DCGM GPU
token (SM) PCIe TX/
RX GPU GPU
[3]

3.2
CPU cpu IPMI [26] (OOB) RAPL [55] (IB)
Singularity[64] Azure OpenAI[6] [31]
Google Vertex AI[56] SageMaker[1] Azure ML[2] CPU( DRAM)
cpu
qos [34]

GPU GPU IB OOB ;


2 LLM
CPU
(pdu) [73]
GPU GPU
GPU :(1)GPU [3,47] GPU
FLOPS ( SM )
(2) GPU
GPU
cpu RAPL[31] [51]
CPU GPU TDP
[17,31] IB
[31]
[34] NVIDIA GPU SMBPBI OOB
GPU LLM [42] IB OOB
[51] 2 OOB
40
OOB GPU 5
3
LLM

209
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

1 LLM 2 3 LLM ( * )

3.3 LLM
CPU IPMI[26] GPU
RAPL[55] 4.3
CPU
GPU vm
CPU GPU IB GPU

NVIDIA DGX A100 8


llm A100-40GB 8 A100-80GB GPU[44,46]
AMD CPU GPU PCIe 4.0 CPU
[31] NVLink 3.0
3 ; 50%
IB GPU GPU GPU
GPU vm
[68] GPU IB
GPU 3
OOB
VM IB GPU [67]: (RoBERTa[37])
(GPT-NeoX [5] OPT [74] Llama2 [66] BLOOM[70])
GPU OOB 2 en + (Flan-T5[12])
OOB GPU ( )
40 UPS 10s ( GPT-NeoX) (
[73] OOB BLOOM, Llama2)
GPU

LLM LLM

3.4 LLM
Huggingface transformer with Accelerate[20,69] GPT-
NeoX [5] Deep-Speed[60]
LLM DeepSpeed [4,39] vLLM
[32] LLM
DCGM IPMI gpu
Huggingface Trans-
nvidia-smi GPU
formers[69] bitsandbytes [13]

100ms DCGM
LLM GPU /
[43] IPMI DCGM
GPU

210
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

nvidia-smi GPU SM (1.1-1.4GHz)


(300-400W)
2

DCGM GPU
DCGM IPMI
5-10W
DCGM
DCGM 4
DCGM

LLM /
[25,35,63] GPU
LLM
LLM
8 GPU 5 (100+ )
85% GPU
LLM LLM 5
4.3 4 LLM
profiling

LLM LLM

GPU
LLM ( FasterTransformer
DeepSpeed-Inference) ( 4.1 LLM
) 4( ) 100 GPU
5
gpu TDP, GPT-NeoX Flan-T5
LLM RoBERTa
TDP

LLM
1:LLM gpu
GPU
TDP LLM
GPU

4( ) gpu
GPU RoBERTa 1
( ) 0.5
( SM )
GPU
gpu

211
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

RoBERTa TDP 75% GPT- GPU TDP


NeoX 50% Flan-T5 20% llm llm
gpu
token token
gpu KV-cache to-kens

2: LLM gpu

BLOOM
LLM
GPU 7

4 GPU SM GPU

20%

GPT- 4:LLM
NeoX Flan-T5 : GPU TDP

RoBERTa LLM
;

5
8 LLM
[9]
Flan-T5 GPT-NeoX GPU TDP
22% 10% : (
) ( )
(
BLOOM-176B)

Insight 3: 8a 256 8192


TDP

LLM token
prompt token phase
8b token

4.2 LLM

212
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

6 GPU ( ) ( )

9 BLOOM GPU ( =8192


=128 =1)
7 BLOOM GPU

>
4096
8c 1 16

token token
8d

8e 8f

token
token

Insight 5: LLM

bitsand-bytes FP32 FP16 INT8


Llama2-70B Llama2-13B[13] FP32 Llama2-
70B 4 A100-80GB gpu FP16 INT8
2 1 Llama2-13B
8 A100-80GB gpu
( ) GPU

KV
gpu

213
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

10 ( TDP) vs. GPU SM 11 GPU


TDP

gpu
Llama2-70B, FP16

[46] FP32 INT8


CUDA
[18] Llama2-13B, FP16
FP32 gpu 4 LLM
NVIDIA H100
FP8 [45]
100MHz(7%) 2%
Insight 6:

Insight 7:
9 BLOOM

LLM

4.3
10a
GPU

- ( 20%)
11 TDP
( 7%)
GPU :(1)GPU
60% (2)GPU
; GPT-NeoX
GPU TDP( 500W) (
BLOOM 13%
1 4) (3) GPU
5% 10b BLOOM
(4)GPU (5)
( )

10c GPU SM
Insight 8: gpu LLM

214
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

GPU LLM

4 LLM
cpu gpu
:(1)
(2) 25%( 3)
2 -
37.5% 9% (3)
;

5.1 LLM
LLM
4.1 (insight - ( 4)
sight 2)
(insight 4 insight 5) gpu

[71]

Insight 9: [1,2,56,64]
LLM
LLM
/
GPU
5 LLM

GPU GPU
DGX-A100 6500W[44] 5.2 LLM

5700W
800W

GPU
GPU

OOB gpu OOB

OOB

215
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

6 LLM A

GPU LLM
LLM
LLM
gpu
gpu [49] LLM
gpu LLM ChatGPT Azure OpenAI
Infiniband [44] slo POLCA

6 :POLCA Latency-bounded POLCA


OOB ( 3.2 )
LLM
GPU OOB ( 2) ups
:POLCA[50]
10s [73]
POLCA (slo)
POLCA OOB UPS
gpu;
GPU
40s
6.1 POLCA

CPU
[15,36,55,58] [23,73]
6.3 POLCA
[17,19,31]
POLCA LLM
LLM (Insight 9)
LLM
( ) ( ) :(1)
(2) (3) POLCA

( [31])

LLM gpu LLM


CPU 3.3 OOB GPU POLCA PDU
GPU CPU 2
( 5kW 500W) POLCA

[31]

POLCA
6.2

POLCA

llm 5
POLC

216
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

6 4
6 21STto8 2ND2023 6
POLCA ( token )

(MAPE) 3% 16
uncap 5%

5
(T1) :(1)
(2) 6.5 1
slo 4 7 T1

slo
T1
( A100 gpu 1275 MHz)
13 T1-
(T2) POLCA T2 75-85% 89-89% 85-95%
40 (OOB ) 80-89% 4 OOB
T2 POLCA
1110 MHz
75-85% 80-89% T1-T2 35%
1305
( ) 85-95% 32.5%
MHz
75-85% slo
(Insight 7)
85-95% LP
12 POLCA HP
PDU T2
( 4 11%)
slo POLCA T1 = 80% T2 = 89%
5 BMC 30% ( 13
gpu SMBPBI[42] ) slo
gpu
15a T1
1275 MHz LP
6.4
SLO A100 1275 MHz
T1
LLM 2

6.6 6
POLCA
POLCA
16 POLCA 30%
6 :(1)5
/ (
) BLOOM-176B 4.2 (2)
BLOOM-176B

217
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

5
12 6 slo

13 30% 14 POLCA

(2) 89% (1- threshold - all) (3)


(no -cap)
17
POLCA
1- threshold - low-pri SLOs
(a) (b) T1 low-for
1- threshold - all P99 SLOs
89% POLCA
15 POLCA
14 SLOs P99
P100 POLCA
< 2%

llm
5%
15
17

POLCA SLOs
P99 SLO
POLCA

17
:(1) 89% (1- POLCA -
threshold - low-pri) SLO

218
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

2kW TDP) GPU CPU (1U,


600 W TDP)

POLCA
GPU

[28]
llm
GPU
16 BLOOM y
llm

2 6.1
CPU
17 POLCA 30%

DL [11,40,41,71]
[22]

GPU DL GPU
[24,29,33]
[27,72] gpu[52,52,54,65]

18

8
18
LLM
POLCA
GPU

6.7
LLM
POLCA
Workload-aware [1,6,56]
POLCA
Lieven Eeckhout
LLM Pratyush Patel CNS-
GPU 2104548 VMware
NVIDIA DGX-A100 (6U, 6.
5kW TDP) DGX-H100(8U, 10.

219
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

[20] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid,


Zachary Mueller, and Sourab Mangrulkar. Accelerate: Training and
[1] Amazon SageMaker. https://aws.amazon.com/sagemaker, 2023. Inference at Scale Made Simple, Efficient and Adaptable. https: //
[2] Azure Machine Learning - ML as a Service. https://azure.microsoft. github.com/huggingface/accelerate, 2022.
com/en-us/products/machine-learning, 2023. [21] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku
[3] AMD. ROCm Open Software Platform for GPU Compute. https: // Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia,
www.amd.com/en/graphics/servers-solutions-rocm. Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis,
Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied
[4] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Machine Learning at Facebook: A Datacenter Infrastructure
Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Perspective. In HPCA, 2018.
Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. DeepSpeed-
Inference: Enabling Efficient Inference of Transformer Models at [22] Miro Hodak, Masha Gorkovenko, and Ajay Dholakia. Towards
Unprecedented Scale. In SC, 2022. Power Efficiency in Deep Learning on Data Center Hardware. In
BigData, 2019.
[5] Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black,
Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor [23] Chang-Hong Hsu, Qingyuan Deng, Jason Mars, and Lingjia Tang.
Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, SmoothOperator: Reducing Power Fragmentation and Improving
Tri Songz, Wang Phil, and Samuel Weinbach. GPT-NeoX: Large Power Utilization in Large-Scale Datacenters. In ASPLOS, 2018.
Scale Autoregressive Language Modeling in PyTorch. https://www. [24] Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei
github. com/eleutherai/gpt-neox, 2021. Zhang. Characterization and Prediction of Deep Learning Workloads
[6] Microsoft Azure. Azure OpenAI Service. https://azure.microsoft.com/ in Large-scale GPU Datacenters. In SC, 2021.
en-us/products/ai-services/openai-service, 2022. [25] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
[7] Luiz Andr Barroso, Urs H lzle, and Parthasarathy Ranganathan. The Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le,
Datacenter as a Computer: Designing Warehouse-Scale Machines. Yonghui Wu, et al. Gpipe: Efficient Training of Giant Neural
Synthesis Lectures on Computer Architecture, 2018. Networks Using Pipeline Parallelism. In NeurIPS, 2019.
[8] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and [26] Intel, Hewlett Packard, NEC, and Dell. Intelligent Platform Manage-
Shmar-garet Shmitchell. On the Dangers of Stochastic Parrots: Can ment Interface Specification (IPMI). https://www. intel. in/content/
Language Models Be Too Big? In FAccT, 2021. www/in/en/products/docs/servers/ipmi/ipmi-second-gen-interface-
spec-v2-rev1-1.html, 2013.
[9] Srikant Bharadwaj, Shomit Das, Kaushik Mazumdar, Bradford Beck-
mann, and Stephen Kosonocky. Predict; Do Not React for Enabling [27] Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong.
Efficient Fine Grain DVFS in GPUs. In ASPLOS, 2023. GPU-NEST: Characterizing Energy Efficiency of Multi-GPU
Inference Servers. In CAL, 2020.
[10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, [28] Majid Jalili, Ioannis Manousakis, igo Goiri, Pulkit A. Misra,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Ashish Raniwala, Husam Alissa, Bharath Ramakrishnan, Phillip
Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Tuma, Chris-tian Belady, Marcus Fontoura, and Ricardo Bianchini.
Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Cost-Efficient Overclocking in Immersion-Cooled Datacenters. In
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ISCA, 2021.
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam [29] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Qian, Wencong Xiao, and Fan Yang. Analysis of Large-Scale Multi-
Language Models Are Few-Shot Learn-ers. In NeurIPS, 2020. Tenant GPU clusters for DNN Training Workloads. In USENIX
[11] Sangjin Choi, Inhoe Koo, Jeongseob Ahn, Myeongjae Jeon, and ATC, 2019.
Youngjin Kwon. EnvPipe: Performance-preserving DNN Training [30] Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen,
Framework for Saving Energy. In USENIX ATC, 2023. and Zhonghong Ou. RAPL in Action: Experiences in Using RAPL
[12] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, for Power Measurements. 2018.
William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha [31] Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand
Brahma, et al. Scaling Instruction-Finetuned Language Models. arXiv Bonde, Felipe Frujeri, Nithish Mahalingam, Pulkit A. Misra, Seyyed
preprint arXiv:2210.11416, 2022. Ah-mad Javadi, Bianca Schroeder, Marcus Fontoura, and Ricardo
[13] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Bian-chini. Prediction-Based Power Oversubscription in Cloud
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Platforms. In USENIX ATC, 2021.
NeurIPS, 2022. [32] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion
Toutanova. BERT: Pre-training of Deep Bidirectional Transformers Stoica. Efficient Memory Management for Large Language Model
for Language Understanding. In NAACL, 2019. Serving with PagedAttention. In SOSP, 2023.
[15] Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power [33] Baolin Li, Rohin Arora, Siddharth Samsi, Tirthak Patel, William Ar-
Provisioning for a Warehouse-sized Computer. In ISCA, 2007. cand, David Bestor, Chansup Byun, Rohan Basu Roy, Bill Bergeron,
John Holodnak, Michael Houle, Matthew Hubbell, Michael Jones,
[16] International Institute for Strategic Studies. Large Language Models: Jeremy Kepner, Anna Klein, Peter Michaleas, Joseph McDonald, Lau-
Fast Proliferation and Budding International Competition. Strategic ren Milechin, Julie Mullen, Andrew Prout, Benjamin Price, Albert
Comments, 2023. Reuther, Antonio Rosa, Matthew Weiss, Charles Yee, Daniel
[17] Xing Fu, Xiaorui Wang, and Charles Lefurgy. How Much Power Edelman, Allan Vanterpool, Anson Cheng, Vijay Gadepally, and
Oversubscription is Safe and Allowed in Data Centers? In ICAC, Devesh Tiwari. AI-Enabling Workloads on Large-Scale GPU-
2011. Accelerated System: Char-acterization, Opportunities, and
Implications. In HPCA, 2022.
[18] GitHub. bitsandbytes: Memory decreases but latency increases.
https: //github.com/TimDettmers/bitsandbytes/issues/6, 2022. [34] Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis,
Sreekumar Kodakara, David Lo, and Parthasarathy Ranganathan.
[19] Sriram Govindan, Jeonghwan Choi, Bhuvan Urgaonkar, Anand Siva- Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware
subramaniam, and Andrea Baldini. Statistical Profiling-based Tech- Power Capping at Scale. In OSDI, 2020.
niques for Effective Power Provisioning in Data Centers. In EuroSys,
2009.

220
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

[35] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, [56] Google Cloud Platform. Vertex AI. https://cloud.google.com/vertex-ai,
Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and 2023.
Soumith Chintala. Pytorch Distributed: Experiences on Accelerating
[57] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Data Parallel Training. In VLDB, 2020.
Ilya Sutskever. Language Models are Unsupervised Multitask Learners,
[36] Yang Li, Charles R Lefurgy, Karthick Rajamani, Malcolm S Allen-Ware, 2019.
Guillermo J Silva, Daniel D Heimsoth, Saugata Ghose, and Onur Mutlu.
[58] Parthasarathy Ranganathan, Phil Leech, David Irwin, and Jeffrey Chase.
A Scalable Priority-aware Approach to Managing Data Center Server
Ensemble-level Power Management for Dense Blade Servers. In ISCA,
Power. In HPCA, 2019.
2006.
[37] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
[59] Tirias Research. Why Your AI infrastructure Needs Both Training and
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
Inference. 2019.
Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv preprint arXiv:1907.11692, 2019. [60] Philipp Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed &
Hugging Face Transformers. https://www.philschmid.de/fine-tune-flan-t5-
[38] Meta. Introducing the AI Research SuperCluster Meta s Cutting-edge
deepspeed.
AI Supercomputer for AI Research. https://ai.facebook.com/blog/ai-rsc/.
[61] Amazon Web Services. Amazon EC2 Update Inf1 Instances with AWS
[39] Microsoft. DeepSpeed: Model Implementations for Inference (MII).
Inferentia Chips for High Performance Cost-Effective Inferencing. https://
https://github.com/microsoft/DeepSpeed-MII.
aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-
[40] Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. inferentia-chips-for-high-performance-cost-effective-inferencing/.
Coordinated Batching and DVFS for DNN Inference on GPU Accelera-
[62] Amazon Web Services. AWS Trainium: High-performance Machine
tors. TPDS, 2022.
Learning Training Accelerator, Purpose Built by AWS. https://aws.
[41] Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, amazon.com/machine-learning/trainium/.
and Devesh Tiwari. Characterizing Temperature, Power, and Soft-error
[63] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
Behaviors in Data Center Systems: Insights, Challenges, and
Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-
Opportunities. In MASCOTS, 2017.
billion Parameter Language Models Using Model Parallelism. arXiv
[42] NVIDIA. Data Center GPU Driver. https://docs. nvidia. com/datacenter/ preprint arXiv:1909.08053, 2019.
tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_ Notes_450_v1.
[64] Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav
pdf.
Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra,
[43] NVIDIA. Data Center GPU Manager (DCGM). https://developer.nvidia. Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaib-
com/dcgm. hav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu
[44] NVIDIA. DGX A100: The Universal System for AI Infrastructure. https:// Xun, Ravi Anupindi, Karthik Elangovan, and Mark Russinovich.
resources.nvidia.com/en-us-dgx-systems/dgx-ai.
Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI
Workloads. arXiv preprint arXiv:2202.07848, 2022.
[45] NVIDIA. DGX H100. https://www. nvidia. com/en-us/data-center/dgx-
h100/.
[65] Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D
Sinclair, and Shivaram Venkataraman. Not All GPUs Are Created Equal:
[46] NVIDIA. NVIDIA A100 80GB PCIe GPU Product Brief. https://www. Characterizing Variability in Large-scale, Accelerator-rich Systems. In
nvidia.com/content/dam/en-zz/Solutions/Data- SC, 2022.
Center/a100/pdf/PB-10577-001_v02.pdf. [66] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[47] NVIDIA. System Management Interface (nvidia-smi). https://developer. hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
nvidia.com/nvidia-system-management-interface.
Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
[48] OpenAI. Scaling Kubernetes to 7,500 Nodes. https://openai.com/research/ Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
scaling-kubernetes-to-7500-nodes. Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
[49] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Aashaka Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv: Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
2311.18677, 2023.
Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
[50] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Brijesh War- Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
rier, Nithish Mahalingam, and Ricardo Bianchini. POLCA: Power Over- Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
subscription in LLM Cloud Providers. arXiv preprint arXiv:2308.12908, ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
2023. Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
[51] Pratyush Patel, Zibo Gong, Syeda Rizvi, Esha Choukse, Pulkit Misra, Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foun-
Thomas Anderson, and Akshitha Sriraman. Towards Improved Power dation and Fine-tuned Chat Models. arXiv preprint arXiv:2307. 09288,
2023.
Management in Cloud GPUs. In CAL, 2023.
[52] Tapasya Patki, Zachary Frye, Harsh Bhatia, Francesco Di Natale, James [67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Glosli, Helgi Ingolfsson, and Barry Rountree. Comparing GPU Power Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention
is All You Need. In NeurIPS, 2017.
and Frequency Capping: A Case Study with the MuMMI Workflow. In
WORK, 2019. [68] Lan Vu, Hari Sivaraman, and Rishi Bidarkar. GPU Virtualization for
[53] David Patterson, Joseph Gonzalez, Urs H lzle, Quoc Le, Chen Liang, High Performance General Purpose Computing on the ESX Hypervisor.
In HPC, 2014.
Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and
Jeff Dean. The Carbon Footprint of Machine Learning Training Will [69] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement
Plateau, Then Shrink. Computer, 2022. Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R mi Louf, Morgan
Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara
[54] Martin Peres. Reverse Engineering Power Management on NVIDIA
GPUs - A Detailed Overview. In XDC, 2013.
[55] Pavlos Petoumenos, Lev Mukhanov, Zheng Wang, Hugh Leather, and
Dimitrios S. Nikolopoulos. Power Capping: What Works, What Does

Not. In ICPADS, 2015.

221
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gug- NSDI, 2023.
ger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
Trans-formers: State-of-the-art Natural Language Processing. In [72] Junyeol Yu, Jongseok Kim, and Euiseong Seo. Know Your Enemy
EMNLP, 2020.
To Save Cloud Energy: Energy-Performance Characterization of
Machine Learning Serving. In HPCA, 2023.
[70] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher
Akiki, Ellie Pavlick, Suzana Ili , Daniel Hesslow, Roman [73] Chaojie Zhang, Alok Gautam Kumbhare, Ioannis Manousakis, Deli
Castagn , Alexan-dra Sasha Luccioni, Fran ois Yvon, Matthias Zhang, Pulkit A Misra, Rod Assis, Kyle Woolcock, Nithish
Gall , Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Mahalingam, Brijesh Warrier, David Gauthier, Lalu Kunnath, Steve
Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno t Solomon, Os-valdo Morales, Marcus Fontoura, and Ricardo
Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Bianchini. Flex: High-Availability Datacenters With Zero Reserved
Power. In ISCA, 2021.
Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,
Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz [74] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya
Suarez, Victor Sanh, Hugo Lau-ren on, Yacine Jernite, Julien Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi
Launay, Margaret Mitchell, and Colin Raffel. BLOOM: A 176B- Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster,
Parameter Open-access Multilingual Lan-guage Model. arXiv Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and
preprint arXiv:2211.05100, 2022. Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language
Models. arXiv preprint arXiv:2205.01068, 2022.
[71] Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus:
Understand-ing and Optimizing GPU Energy Consumption of DNN
Training. In

222

You might also like