0% found this document useful (0 votes)

51 views

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

albert.csdn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

albert.csdn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

https://fanyi.youdao.

com/download

llm
Pratyush Patel* e re
Esha Chouks Brijesh Nithish Mahalingam

Bianchini

azu

CCS : ;
(llm) ; ;
gpu

llm : gpu

ACM :
llm Pratyush Patel, Esha Choukse igo Goiri, Brijesh Warrier,
Nithish Mahalingam, Ricardo Bianchini 2024 llm
LLM 29 ACM
LLM 3 (ASPLOS 24) 2024 4 27 5 1 USA
ACM USA 16 https://doi.org/10.1145/
gpu 3620666.3651329

1
(llm)
POLCA LLM GPU [8]
POLCA OpenAI 7,500 GPU GPT-
POLCA 3 llm[48];Meta 6000 A100 gpu
30% [38] AI
Bard GPT-4[16]
LLM 90% [53,59,61]
GPU LLM
[38,48] ;

* Pratyush Patel
llm gpu
[15,17,23,73]

GPU

This work is licensed under a Creative Commons Attribution International 4.0 License.

LLM
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA
2024 Copyright held by the owner/author(s). ACM ISBN
979-8-4007-0386-7/24/04 https://doi. org/10. 1145/3620666.
3651329

207
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

1 GPT( ) 2 3 (8 A100-80GB
)

llm LLM

(
) LLM
2
LLM

( 3%) transform
LLM er
( 21%) [67] BERT[14] RoBERTa[37]

llm
transformer GPT[57] BLOOM[70]
FLAN-T5[12]
LLM POLCA
-
LLM
POLCA
30%
vs.
LLM
LLM

[25]
LLM Infiniband
: OpenAI
LLM 7500 GPU GPT-3
[48]
LLM
- 176b[70] ( GPT-3[10]
) 8 GPU
GPU ( )
LLM
OpenAI Meta
LLM [38,48] Microsoft Philly

LLM

208
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

[29] Meta 3.1

[21] (Trainium llm GPU
[62] Inferentia[61]) 1
[53,59,61]
IPMI
[26] BMC (baseboard management
controller) GPU
[42] (OOB)
(IB) [3,43,47] NVIDIA
GPU OOB SMBPBI [42];
1 LLM NVIDIA IB :
: token NVIDIA -smi[47] DCGM[43] nvidia-smi
KV-cache GPU /
token [57] token DCGM GPU
token (SM) PCIe TX/
RX GPU GPU
[3]

3.2
CPU cpu IPMI [26] (OOB) RAPL [55] (IB)
Singularity[64] Azure OpenAI[6] [31]
Google Vertex AI[56] SageMaker[1] Azure ML[2] CPU( DRAM)
cpu
qos [34]

GPU GPU IB OOB ;

2 LLM
CPU
(pdu) [73]
GPU GPU
GPU :(1)GPU [3,47] GPU
FLOPS ( SM )
(2) GPU
GPU
cpu RAPL[31] [51]
CPU GPU TDP
[17,31] IB
[31]
[34] NVIDIA GPU SMBPBI OOB
GPU LLM [42] IB OOB
[51] 2 OOB
40
OOB GPU 5
3
LLM

209
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

1 LLM 2 3 LLM ( * )

3.3 LLM
CPU IPMI[26] GPU
RAPL[55] 4.3
CPU
GPU vm
CPU GPU IB GPU

NVIDIA DGX A100 8

llm A100-40GB 8 A100-80GB GPU[44,46]
AMD CPU GPU PCIe 4.0 CPU
[31] NVLink 3.0
3 ; 50%
IB GPU GPU GPU
GPU vm
[68] GPU IB
GPU 3
OOB
VM IB GPU [67]: (RoBERTa[37])
(GPT-NeoX [5] OPT [74] Llama2 [66] BLOOM[70])
GPU OOB 2 en + (Flan-T5[12])
OOB GPU ( )
40 UPS 10s ( GPT-NeoX) (
[73] OOB BLOOM, Llama2)
GPU

LLM LLM

3.4 LLM
Huggingface transformer with Accelerate[20,69] GPT-
NeoX [5] Deep-Speed[60]
LLM DeepSpeed [4,39] vLLM
[32] LLM
DCGM IPMI gpu
Huggingface Trans-
nvidia-smi GPU
formers[69] bitsandbytes [13]

100ms DCGM
LLM GPU /
[43] IPMI DCGM
GPU

210
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

nvidia-smi GPU SM (1.1-1.4GHz)

(300-400W)
2

DCGM GPU
DCGM IPMI
5-10W
DCGM
DCGM 4
DCGM

LLM /
[25,35,63] GPU
LLM
LLM
8 GPU 5 (100+ )
85% GPU
LLM LLM 5
4.3 4 LLM
profiling

LLM LLM

GPU
LLM ( FasterTransformer
DeepSpeed-Inference) ( 4.1 LLM
) 4( ) 100 GPU
5
gpu TDP, GPT-NeoX Flan-T5
LLM RoBERTa
TDP

LLM
1:LLM gpu
GPU
TDP LLM
GPU

4( ) gpu
GPU RoBERTa 1
( ) 0.5
( SM )
GPU
gpu

211
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

RoBERTa TDP 75% GPT- GPU TDP

NeoX 50% Flan-T5 20% llm llm
gpu
token token
gpu KV-cache to-kens

2: LLM gpu

BLOOM
LLM
GPU 7

4 GPU SM GPU

20%

GPT- 4:LLM
NeoX Flan-T5 : GPU TDP

RoBERTa LLM
;

5
8 LLM
[9]
Flan-T5 GPT-NeoX GPU TDP
22% 10% : (
) ( )
(
BLOOM-176B)

Insight 3: 8a 256 8192

TDP

LLM token
prompt token phase
8b token

4.2 LLM

212
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

6 GPU ( ) ( )

9 BLOOM GPU ( =8192

=128 =1)
7 BLOOM GPU

>
4096
8c 1 16

token token
8d

8e 8f

token
token

Insight 5: LLM

bitsand-bytes FP32 FP16 INT8

Llama2-70B Llama2-13B[13] FP32 Llama2-
70B 4 A100-80GB gpu FP16 INT8
2 1 Llama2-13B
8 A100-80GB gpu
( ) GPU

KV
gpu

213
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

10 ( TDP) vs. GPU SM 11 GPU

TDP

gpu
Llama2-70B, FP16

[46] FP32 INT8

CUDA
[18] Llama2-13B, FP16
FP32 gpu 4 LLM
NVIDIA H100
FP8 [45]
100MHz(7%) 2%
Insight 6:

Insight 7:
9 BLOOM

LLM

4.3
10a
GPU

- ( 20%)
11 TDP
( 7%)
GPU :(1)GPU
60% (2)GPU
; GPT-NeoX
GPU TDP( 500W) (
BLOOM 13%
1 4) (3) GPU
5% 10b BLOOM
(4)GPU (5)
( )

10c GPU SM
Insight 8: gpu LLM

214
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

GPU LLM

4 LLM
cpu gpu
:(1)
(2) 25%( 3)
2 -
37.5% 9% (3)
;

5.1 LLM
LLM
4.1 (insight - ( 4)
sight 2)
(insight 4 insight 5) gpu

[71]

Insight 9: [1,2,56,64]
LLM
LLM
/
GPU
5 LLM

GPU GPU
DGX-A100 6500W[44] 5.2 LLM

5700W
800W

GPU
GPU

OOB gpu OOB

OOB

215
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

6 LLM A

GPU LLM
LLM
LLM
gpu
gpu [49] LLM
gpu LLM ChatGPT Azure OpenAI
Infiniband [44] slo POLCA

6 :POLCA Latency-bounded POLCA

OOB ( 3.2 )
LLM
GPU OOB ( 2) ups
:POLCA[50]
10s [73]
POLCA (slo)
POLCA OOB UPS
gpu;
GPU
40s
6.1 POLCA

CPU
[15,36,55,58] [23,73]
6.3 POLCA
[17,19,31]
POLCA LLM
LLM (Insight 9)
LLM
( ) ( ) :(1)
(2) (3) POLCA

( [31])

LLM gpu LLM

CPU 3.3 OOB GPU POLCA PDU
GPU CPU 2
( 5kW 500W) POLCA

[31]

POLCA
6.2

POLCA

llm 5
POLC

216
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

6 4
6 21STto8 2ND2023 6
POLCA ( token )

(MAPE) 3% 16
uncap 5%

5
(T1) :(1)
(2) 6.5 1
slo 4 7 T1

slo
T1
( A100 gpu 1275 MHz)
13 T1-
(T2) POLCA T2 75-85% 89-89% 85-95%
40 (OOB ) 80-89% 4 OOB
T2 POLCA
1110 MHz
75-85% 80-89% T1-T2 35%
1305
( ) 85-95% 32.5%
MHz
75-85% slo
(Insight 7)
85-95% LP
12 POLCA HP
PDU T2
( 4 11%)
slo POLCA T1 = 80% T2 = 89%
5 BMC 30% ( 13
gpu SMBPBI[42] ) slo
gpu
15a T1
1275 MHz LP
6.4
SLO A100 1275 MHz
T1
LLM 2

6.6 6
POLCA
POLCA
16 POLCA 30%
6 :(1)5
/ (
) BLOOM-176B 4.2 (2)
BLOOM-176B

217
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

5
12 6 slo

13 30% 14 POLCA

(2) 89% (1- threshold - all) (3)

(no -cap)
17
POLCA
1- threshold - low-pri SLOs
(a) (b) T1 low-for
1- threshold - all P99 SLOs
89% POLCA
15 POLCA
14 SLOs P99
P100 POLCA
< 2%

llm
5%
15
17

POLCA SLOs
P99 SLO
POLCA

17
:(1) 89% (1- POLCA -
threshold - low-pri) SLO

218
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

2kW TDP) GPU CPU (1U,

600 W TDP)

POLCA
GPU

[28]
llm
GPU
16 BLOOM y
llm

2 6.1
CPU
17 POLCA 30%

DL [11,40,41,71]
[22]

GPU DL GPU
[24,29,33]
[27,72] gpu[52,52,54,65]

8
18
LLM
POLCA
GPU

6.7
LLM
POLCA
Workload-aware [1,6,56]
POLCA
Lieven Eeckhout
LLM Pratyush Patel CNS-
GPU 2104548 VMware
NVIDIA DGX-A100 (6U, 6.
5kW TDP) DGX-H100(8U, 10.

219
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

[20] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid,

Zachary Mueller, and Sourab Mangrulkar. Accelerate: Training and
[1] Amazon SageMaker. https://aws.amazon.com/sagemaker, 2023. Inference at Scale Made Simple, Efficient and Adaptable. https: //
[2] Azure Machine Learning - ML as a Service. https://azure.microsoft. github.com/huggingface/accelerate, 2022.
com/en-us/products/machine-learning, 2023. [21] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku
[3] AMD. ROCm Open Software Platform for GPU Compute. https: // Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia,
www.amd.com/en/graphics/servers-solutions-rocm. Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis,
Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied
[4] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Machine Learning at Facebook: A Datacenter Infrastructure
Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Perspective. In HPCA, 2018.
Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. DeepSpeed-
Inference: Enabling Efficient Inference of Transformer Models at [22] Miro Hodak, Masha Gorkovenko, and Ajay Dholakia. Towards
Unprecedented Scale. In SC, 2022. Power Efficiency in Deep Learning on Data Center Hardware. In
BigData, 2019.
[5] Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black,
Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor [23] Chang-Hong Hsu, Qingyuan Deng, Jason Mars, and Lingjia Tang.
Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, SmoothOperator: Reducing Power Fragmentation and Improving
Tri Songz, Wang Phil, and Samuel Weinbach. GPT-NeoX: Large Power Utilization in Large-Scale Datacenters. In ASPLOS, 2018.
Scale Autoregressive Language Modeling in PyTorch. https://www. [24] Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei
github. com/eleutherai/gpt-neox, 2021. Zhang. Characterization and Prediction of Deep Learning Workloads
[6] Microsoft Azure. Azure OpenAI Service. https://azure.microsoft.com/ in Large-scale GPU Datacenters. In SC, 2021.
en-us/products/ai-services/openai-service, 2022. [25] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
[7] Luiz Andr Barroso, Urs H lzle, and Parthasarathy Ranganathan. The Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le,
Datacenter as a Computer: Designing Warehouse-Scale Machines. Yonghui Wu, et al. Gpipe: Efficient Training of Giant Neural
Synthesis Lectures on Computer Architecture, 2018. Networks Using Pipeline Parallelism. In NeurIPS, 2019.
[8] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and [26] Intel, Hewlett Packard, NEC, and Dell. Intelligent Platform Manage-
Shmar-garet Shmitchell. On the Dangers of Stochastic Parrots: Can ment Interface Specification (IPMI). https://www. intel. in/content/
Language Models Be Too Big? In FAccT, 2021. www/in/en/products/docs/servers/ipmi/ipmi-second-gen-interface-
spec-v2-rev1-1.html, 2013.
[9] Srikant Bharadwaj, Shomit Das, Kaushik Mazumdar, Bradford Beck-
mann, and Stephen Kosonocky. Predict; Do Not React for Enabling [27] Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong.
Efficient Fine Grain DVFS in GPUs. In ASPLOS, 2023. GPU-NEST: Characterizing Energy Efficiency of Multi-GPU
Inference Servers. In CAL, 2020.
[10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, [28] Majid Jalili, Ioannis Manousakis, igo Goiri, Pulkit A. Misra,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Ashish Raniwala, Husam Alissa, Bharath Ramakrishnan, Phillip
Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Tuma, Chris-tian Belady, Marcus Fontoura, and Ricardo Bianchini.
Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Cost-Efficient Overclocking in Immersion-Cooled Datacenters. In
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ISCA, 2021.
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam [29] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Qian, Wencong Xiao, and Fan Yang. Analysis of Large-Scale Multi-
Language Models Are Few-Shot Learn-ers. In NeurIPS, 2020. Tenant GPU clusters for DNN Training Workloads. In USENIX
[11] Sangjin Choi, Inhoe Koo, Jeongseob Ahn, Myeongjae Jeon, and ATC, 2019.
Youngjin Kwon. EnvPipe: Performance-preserving DNN Training [30] Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen,
Framework for Saving Energy. In USENIX ATC, 2023. and Zhonghong Ou. RAPL in Action: Experiences in Using RAPL
[12] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, for Power Measurements. 2018.
William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha [31] Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand
Brahma, et al. Scaling Instruction-Finetuned Language Models. arXiv Bonde, Felipe Frujeri, Nithish Mahalingam, Pulkit A. Misra, Seyyed
preprint arXiv:2210.11416, 2022. Ah-mad Javadi, Bianca Schroeder, Marcus Fontoura, and Ricardo
[13] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Bian-chini. Prediction-Based Power Oversubscription in Cloud
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Platforms. In USENIX ATC, 2021.
NeurIPS, 2022. [32] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion
Toutanova. BERT: Pre-training of Deep Bidirectional Transformers Stoica. Efficient Memory Management for Large Language Model
for Language Understanding. In NAACL, 2019. Serving with PagedAttention. In SOSP, 2023.
[15] Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power [33] Baolin Li, Rohin Arora, Siddharth Samsi, Tirthak Patel, William Ar-
Provisioning for a Warehouse-sized Computer. In ISCA, 2007. cand, David Bestor, Chansup Byun, Rohan Basu Roy, Bill Bergeron,
John Holodnak, Michael Houle, Matthew Hubbell, Michael Jones,
[16] International Institute for Strategic Studies. Large Language Models: Jeremy Kepner, Anna Klein, Peter Michaleas, Joseph McDonald, Lau-
Fast Proliferation and Budding International Competition. Strategic ren Milechin, Julie Mullen, Andrew Prout, Benjamin Price, Albert
Comments, 2023. Reuther, Antonio Rosa, Matthew Weiss, Charles Yee, Daniel
[17] Xing Fu, Xiaorui Wang, and Charles Lefurgy. How Much Power Edelman, Allan Vanterpool, Anson Cheng, Vijay Gadepally, and
Oversubscription is Safe and Allowed in Data Centers? In ICAC, Devesh Tiwari. AI-Enabling Workloads on Large-Scale GPU-
2011. Accelerated System: Char-acterization, Opportunities, and
Implications. In HPCA, 2022.
[18] GitHub. bitsandbytes: Memory decreases but latency increases.
https: //github.com/TimDettmers/bitsandbytes/issues/6, 2022. [34] Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis,
Sreekumar Kodakara, David Lo, and Parthasarathy Ranganathan.
[19] Sriram Govindan, Jeonghwan Choi, Bhuvan Urgaonkar, Anand Siva- Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware
subramaniam, and Andrea Baldini. Statistical Profiling-based Tech- Power Capping at Scale. In OSDI, 2020.
niques for Effective Power Provisioning in Data Centers. In EuroSys,
2009.

220
Characterizing Power Management Opportunities for LLMs in the Cloud ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA

[35] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, [56] Google Cloud Platform. Vertex AI. https://cloud.google.com/vertex-ai,
Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and 2023.
Soumith Chintala. Pytorch Distributed: Experiences on Accelerating
[57] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Data Parallel Training. In VLDB, 2020.
Ilya Sutskever. Language Models are Unsupervised Multitask Learners,
[36] Yang Li, Charles R Lefurgy, Karthick Rajamani, Malcolm S Allen-Ware, 2019.
Guillermo J Silva, Daniel D Heimsoth, Saugata Ghose, and Onur Mutlu.
[58] Parthasarathy Ranganathan, Phil Leech, David Irwin, and Jeffrey Chase.
A Scalable Priority-aware Approach to Managing Data Center Server
Ensemble-level Power Management for Dense Blade Servers. In ISCA,
Power. In HPCA, 2019.
2006.
[37] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
[59] Tirias Research. Why Your AI infrastructure Needs Both Training and
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
Inference. 2019.
Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv preprint arXiv:1907.11692, 2019. [60] Philipp Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed &
Hugging Face Transformers. https://www.philschmid.de/fine-tune-flan-t5-
[38] Meta. Introducing the AI Research SuperCluster Meta s Cutting-edge
deepspeed.
AI Supercomputer for AI Research. https://ai.facebook.com/blog/ai-rsc/.
[61] Amazon Web Services. Amazon EC2 Update Inf1 Instances with AWS
[39] Microsoft. DeepSpeed: Model Implementations for Inference (MII).
Inferentia Chips for High Performance Cost-Effective Inferencing. https://
https://github.com/microsoft/DeepSpeed-MII.
aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-
[40] Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. inferentia-chips-for-high-performance-cost-effective-inferencing/.
Coordinated Batching and DVFS for DNN Inference on GPU Accelera-
[62] Amazon Web Services. AWS Trainium: High-performance Machine
tors. TPDS, 2022.
Learning Training Accelerator, Purpose Built by AWS. https://aws.
[41] Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, amazon.com/machine-learning/trainium/.
and Devesh Tiwari. Characterizing Temperature, Power, and Soft-error
[63] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
Behaviors in Data Center Systems: Insights, Challenges, and
Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-
Opportunities. In MASCOTS, 2017.
billion Parameter Language Models Using Model Parallelism. arXiv
[42] NVIDIA. Data Center GPU Driver. https://docs. nvidia. com/datacenter/ preprint arXiv:1909.08053, 2019.
tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_ Notes_450_v1.
[64] Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav
pdf.
Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra,
[43] NVIDIA. Data Center GPU Manager (DCGM). https://developer.nvidia. Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaib-
com/dcgm. hav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu
[44] NVIDIA. DGX A100: The Universal System for AI Infrastructure. https:// Xun, Ravi Anupindi, Karthik Elangovan, and Mark Russinovich.
resources.nvidia.com/en-us-dgx-systems/dgx-ai.
Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI
Workloads. arXiv preprint arXiv:2202.07848, 2022.
[45] NVIDIA. DGX H100. https://www. nvidia. com/en-us/data-center/dgx-
h100/.
[65] Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D
Sinclair, and Shivaram Venkataraman. Not All GPUs Are Created Equal:
[46] NVIDIA. NVIDIA A100 80GB PCIe GPU Product Brief. https://www. Characterizing Variability in Large-scale, Accelerator-rich Systems. In
nvidia.com/content/dam/en-zz/Solutions/Data- SC, 2022.
Center/a100/pdf/PB-10577-001_v02.pdf. [66] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[47] NVIDIA. System Management Interface (nvidia-smi). https://developer. hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
nvidia.com/nvidia-system-management-interface.
Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
[48] OpenAI. Scaling Kubernetes to 7,500 Nodes. https://openai.com/research/ Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
scaling-kubernetes-to-7500-nodes. Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
[49] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Aashaka Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv: Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
2311.18677, 2023.
Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
[50] Pratyush Patel, Esha Choukse, Chaojie Zhang, igo Goiri, Brijesh War- Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
rier, Nithish Mahalingam, and Ricardo Bianchini. POLCA: Power Over- Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
subscription in LLM Cloud Providers. arXiv preprint arXiv:2308.12908, ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
2023. Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
[51] Pratyush Patel, Zibo Gong, Syeda Rizvi, Esha Choukse, Pulkit Misra, Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foun-
Thomas Anderson, and Akshitha Sriraman. Towards Improved Power dation and Fine-tuned Chat Models. arXiv preprint arXiv:2307. 09288,
2023.
Management in Cloud GPUs. In CAL, 2023.
[52] Tapasya Patki, Zachary Frye, Harsh Bhatia, Francesco Di Natale, James [67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Glosli, Helgi Ingolfsson, and Barry Rountree. Comparing GPU Power Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention
is All You Need. In NeurIPS, 2017.
and Frequency Capping: A Case Study with the MuMMI Workflow. In
WORK, 2019. [68] Lan Vu, Hari Sivaraman, and Rishi Bidarkar. GPU Virtualization for
[53] David Patterson, Joseph Gonzalez, Urs H lzle, Quoc Le, Chen Liang, High Performance General Purpose Computing on the ESX Hypervisor.
In HPC, 2014.
Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and
Jeff Dean. The Carbon Footprint of Machine Learning Training Will [69] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement
Plateau, Then Shrink. Computer, 2022. Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R mi Louf, Morgan
Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara
[54] Martin Peres. Reverse Engineering Power Management on NVIDIA
GPUs - A Detailed Overview. In XDC, 2013.
[55] Pavlos Petoumenos, Lev Mukhanov, Zheng Wang, Hugh Leather, and
Dimitrios S. Nikolopoulos. Power Capping: What Works, What Does

Not. In ICPADS, 2015.

221
ASPLOS 24, April 27-May 1, 2024, La Jolla, CA, USA Patel et al.

Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gug- NSDI, 2023.
ger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
Trans-formers: State-of-the-art Natural Language Processing. In [72] Junyeol Yu, Jongseok Kim, and Euiseong Seo. Know Your Enemy
EMNLP, 2020.
To Save Cloud Energy: Energy-Performance Characterization of
Machine Learning Serving. In HPCA, 2023.
[70] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher
Akiki, Ellie Pavlick, Suzana Ili , Daniel Hesslow, Roman [73] Chaojie Zhang, Alok Gautam Kumbhare, Ioannis Manousakis, Deli
Castagn , Alexan-dra Sasha Luccioni, Fran ois Yvon, Matthias Zhang, Pulkit A Misra, Rod Assis, Kyle Woolcock, Nithish
Gall , Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Mahalingam, Brijesh Warrier, David Gauthier, Lalu Kunnath, Steve
Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno t Solomon, Os-valdo Morales, Marcus Fontoura, and Ricardo
Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Bianchini. Flex: High-Availability Datacenters With Zero Reserved
Power. In ISCA, 2021.
Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,
Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz [74] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya
Suarez, Victor Sanh, Hugo Lau-ren on, Yacine Jernite, Julien Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi
Launay, Margaret Mitchell, and Colin Raffel. BLOOM: A 176B- Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster,
Parameter Open-access Multilingual Lan-guage Model. arXiv Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and
preprint arXiv:2211.05100, 2022. Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language
Models. arXiv preprint arXiv:2205.01068, 2022.
[71] Jie You, Jae-Won Chung, and Mosharaf Chowdhury. Zeus:
Understand-ing and Optimizing GPU Energy Consumption of DNN
Training. In

222

Agri Crop Production 10 q1 DLL Week 1
No ratings yet
Agri Crop Production 10 q1 DLL Week 1
7 pages
P5.2 Plugs, Cables, and Short Circuits Practical
No ratings yet
P5.2 Plugs, Cables, and Short Circuits Practical
6 pages
1 s2.0 S030643792100003X Main
No ratings yet
1 s2.0 S030643792100003X Main
21 pages
SLO-Aware_GPU_DVFS_for_Energy-Efficient_LLM_Inference_Serving
No ratings yet
SLO-Aware_GPU_DVFS_for_Energy-Efficient_LLM_Inference_Serving
4 pages
Towards Greener LLMS: Bringing Energy-Efficiency To The Forefront of LLM Inference
No ratings yet
Towards Greener LLMS: Bringing Energy-Efficiency To The Forefront of LLM Inference
6 pages
494-Article Text-1970-1-4-20201110
No ratings yet
494-Article Text-1970-1-4-20201110
27 pages
s00521-024-10872-1
No ratings yet
s00521-024-10872-1
27 pages
666 File Paper
No ratings yet
666 File Paper
6 pages
Data Center Growth - Power Supply's Vital Role in Fueling HPC & AI 13feb2024
No ratings yet
Data Center Growth - Power Supply's Vital Role in Fueling HPC & AI 13feb2024
4 pages
Costo de La IA
No ratings yet
Costo de La IA
20 pages
Running Industrial Workflow Applications in A Software-Defined Multicloud Environment Using Green Energy Aware Scheduling Algorithm
No ratings yet
Running Industrial Workflow Applications in A Software-Defined Multicloud Environment Using Green Energy Aware Scheduling Algorithm
12 pages
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
No ratings yet
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
11 pages
SLA-V-1
No ratings yet
SLA-V-1
6 pages
Elastic Power Utilization in Sustainable Micro Cloud Data Centers
No ratings yet
Elastic Power Utilization in Sustainable Micro Cloud Data Centers
14 pages
Minimizing Energy Cost for Green Data Center by Ex
No ratings yet
Minimizing Energy Cost for Green Data Center by Ex
12 pages
Power Consumptionof Virtual Machinesin Cloud Computing
No ratings yet
Power Consumptionof Virtual Machinesin Cloud Computing
54 pages
Jatoi
No ratings yet
Jatoi
10 pages
ACO 307 Fois 2018
No ratings yet
ACO 307 Fois 2018
16 pages
ASurvey onGreen-Energy-Aware Power Management for Datacenters
No ratings yet
ASurvey onGreen-Energy-Aware Power Management for Datacenters
38 pages
Distributed Energy Management For Multiple Data Centers With Renewable Resources and Energy Storages
No ratings yet
Distributed Energy Management For Multiple Data Centers With Renewable Resources and Energy Storages
12 pages
Applying_Large_Language_Models_to_Power_Systems_Potential_Security_Threats
No ratings yet
Applying_Large_Language_Models_to_Power_Systems_Potential_Security_Threats
4 pages
2410.11845v1
No ratings yet
2410.11845v1
37 pages
Efficient Resource Management For Cloud Computing Environments
No ratings yet
Efficient Resource Management For Cloud Computing Environments
8 pages
Resource Management Paper
No ratings yet
Resource Management Paper
8 pages
Tang2016 Article AnEnergy-EfficientTaskScheduli
No ratings yet
Tang2016 Article AnEnergy-EfficientTaskScheduli
20 pages
High Performance Computing in Power and Energy Systems by Siddhartha Kumar Khaitan and Anshul Gupta
100% (1)
High Performance Computing in Power and Energy Systems by Siddhartha Kumar Khaitan and Anshul Gupta
395 pages
Optimized Energy Efficient Resource Management in Cloud Data Center
No ratings yet
Optimized Energy Efficient Resource Management in Cloud Data Center
2 pages
Performance-to-Power Ratio Aware Virtual Machine (VM) Allocation in Energy-Efficient Clouds
No ratings yet
Performance-to-Power Ratio Aware Virtual Machine (VM) Allocation in Energy-Efficient Clouds
10 pages
CloudSim DVFS PDF
No ratings yet
CloudSim DVFS PDF
16 pages
2502.09334v1
No ratings yet
2502.09334v1
17 pages
Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)
No ratings yet
Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)
12 pages
Uddin et.al (2015) Evaluating power efficient algorithms for efficiency and carbon emission in cloud DC
No ratings yet
Uddin et.al (2015) Evaluating power efficient algorithms for efficiency and carbon emission in cloud DC
11 pages
mini_211
No ratings yet
mini_211
22 pages
Chen 2016
No ratings yet
Chen 2016
8 pages
An Efficient Wireless Noc With Congestion-Aware Routing For Multicore Chips
No ratings yet
An Efficient Wireless Noc With Congestion-Aware Routing For Multicore Chips
5 pages
Aa4dcfec 3ca5 4a73 8ac9 23d67fe53e85 Author
No ratings yet
Aa4dcfec 3ca5 4a73 8ac9 23d67fe53e85 Author
31 pages
When Edge Computing Meets Microgrid A Deep Reinforcement Learning Approach
No ratings yet
When Edge Computing Meets Microgrid A Deep Reinforcement Learning Approach
15 pages
Energy Aware Scheduling For Distributed Real-Time Systems
No ratings yet
Energy Aware Scheduling For Distributed Real-Time Systems
9 pages
PowerPi Measuring and Modeling The Power Consumption of The Raspberry Pi
No ratings yet
PowerPi Measuring and Modeling The Power Consumption of The Raspberry Pi
8 pages
Power
No ratings yet
Power
14 pages
Renewable Energy Solutions for Computing_Abstract
No ratings yet
Renewable Energy Solutions for Computing_Abstract
1 page
Energy and Policy Considerations For Deep Learning in NLP
No ratings yet
Energy and Policy Considerations For Deep Learning in NLP
6 pages
DPS Dynamic Pricing and Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks
No ratings yet
DPS Dynamic Pricing and Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks
17 pages
A Novel Energy Proficient Computing Framework for Green Computing Using Sustainable Energy Sources
No ratings yet
A Novel Energy Proficient Computing Framework for Green Computing Using Sustainable Energy Sources
13 pages
Energy and Policy Considerations For Modern Deep Learning Research
No ratings yet
Energy and Policy Considerations For Modern Deep Learning Research
4 pages
01 4 An On-Line Virtual Machine Consolidation Strategy For Dual Improvement in Performance and Energy Conservation of Server Clusters in Cloud Data Centers
No ratings yet
01 4 An On-Line Virtual Machine Consolidation Strategy For Dual Improvement in Performance and Energy Conservation of Server Clusters in Cloud Data Centers
13 pages
Dynamollm: Designing LLM Inference Clusters For Performance and Energy Efficiency
No ratings yet
Dynamollm: Designing LLM Inference Clusters For Performance and Energy Efficiency
14 pages
Yadav Et.al (2018) Adaptive Energy-Aware Algorithms for Minimizing Energy Consumption and SLA Violation in Clouding Computing
No ratings yet
Yadav Et.al (2018) Adaptive Energy-Aware Algorithms for Minimizing Energy Consumption and SLA Violation in Clouding Computing
14 pages
7y2311.16863v3
No ratings yet
7y2311.16863v3
21 pages
1.3 Energy-Aware Computing
No ratings yet
1.3 Energy-Aware Computing
13 pages
Smart Power Grid and Cloud Computing 2013 PDF
No ratings yet
Smart Power Grid and Cloud Computing 2013 PDF
12 pages
CCrrseach
No ratings yet
CCrrseach
5 pages
Khan2019 Article AnalyzingThePowerConsumptionBe
No ratings yet
Khan2019 Article AnalyzingThePowerConsumptionBe
10 pages
Virtual Power Plant Operational Strategies Models, Markets, Optimization, Challenges, and Opportunities
No ratings yet
Virtual Power Plant Operational Strategies Models, Markets, Optimization, Challenges, and Opportunities
24 pages
FALLSEM2024-25_MCSE503L_TH_VL2024250108049_2024-11-19_Reference-Material-II
No ratings yet
FALLSEM2024-25_MCSE503L_TH_VL2024250108049_2024-11-19_Reference-Material-II
27 pages
KSW SVR
No ratings yet
KSW SVR
13 pages
DNN Is Not All You Need: Parallelizing Non-Neural ML Algorithms On Ultra-Low-Power Iot Processors
No ratings yet
DNN Is Not All You Need: Parallelizing Non-Neural ML Algorithms On Ultra-Low-Power Iot Processors
33 pages
ENERGY CONSUMPTION FOR EDGE COMPUTING IN INTERNET OF THINGS
No ratings yet
ENERGY CONSUMPTION FOR EDGE COMPUTING IN INTERNET OF THINGS
12 pages
Energy and Policy Considerations For Deep Learning in NLP: Emma Strubell Ananya Ganesh Andrew Mccallum
No ratings yet
Energy and Policy Considerations For Deep Learning in NLP: Emma Strubell Ananya Ganesh Andrew Mccallum
6 pages
Tamil Typing Practice Book Free 426 PDF
0% (1)
Tamil Typing Practice Book Free 426 PDF
5 pages
Po 202394
No ratings yet
Po 202394
1 page
Resist 65: Technical Data Sheet
No ratings yet
Resist 65: Technical Data Sheet
5 pages
And Checkbox.: Aim (1) :-Design An Applet/application To Create From Using Text Field, Text Area, Button
No ratings yet
And Checkbox.: Aim (1) :-Design An Applet/application To Create From Using Text Field, Text Area, Button
16 pages
Holton (2003) The Project Physics Course
No ratings yet
Holton (2003) The Project Physics Course
8 pages
Aculon AL-B TDS
No ratings yet
Aculon AL-B TDS
2 pages
Get (Ebook) Animism: Respecting the Living World by Graham Harvey ISBN 9780231137003, 9780231137010, 9780231510271, 9782005048416, 0231137001, 023113701X, 2005048410, 0231510276 PDF ebook with Full Chapters Now
100% (4)
Get (Ebook) Animism: Respecting the Living World by Graham Harvey ISBN 9780231137003, 9780231137010, 9780231510271, 9782005048416, 0231137001, 023113701X, 2005048410, 0231510276 PDF ebook with Full Chapters Now
71 pages
DELL - Inspiron 3480 3580 3583 3780 and Vostro 3480 3580 3583 - LA-G711P
No ratings yet
DELL - Inspiron 3480 3580 3583 3780 and Vostro 3480 3580 3583 - LA-G711P
5 pages
Soal Science Unit 8
No ratings yet
Soal Science Unit 8
3 pages
Test - 1 Answer Key: Neet Booster Test Series (NBTS) For Neet-2021
No ratings yet
Test - 1 Answer Key: Neet Booster Test Series (NBTS) For Neet-2021
25 pages
Milestones (0 To 3 Years Old)
No ratings yet
Milestones (0 To 3 Years Old)
6 pages
3rd Tos Filipino, Eng
No ratings yet
3rd Tos Filipino, Eng
3 pages
ML-Unit2
No ratings yet
ML-Unit2
8 pages
Amadeus Travel Trends
No ratings yet
Amadeus Travel Trends
44 pages
Plant Organogram
No ratings yet
Plant Organogram
1 page
Lecture - 2 - Thin Film From Solution
No ratings yet
Lecture - 2 - Thin Film From Solution
41 pages
Sandvik De811 Spec Sheet
No ratings yet
Sandvik De811 Spec Sheet
1 page
Software Application For Determining Conditions of The Human-Computer Interaction Using Fitts and Hicks Laws
No ratings yet
Software Application For Determining Conditions of The Human-Computer Interaction Using Fitts and Hicks Laws
6 pages
Term Exam Me2a
67% (3)
Term Exam Me2a
3 pages
Science Assignment Year 10
No ratings yet
Science Assignment Year 10
5 pages
Taylor Introms12 PPT 10
No ratings yet
Taylor Introms12 PPT 10
38 pages
Designing & Drawing Elevations PDF
100% (2)
Designing & Drawing Elevations PDF
38 pages
Surds
100% (1)
Surds
27 pages
Second Moment Area
No ratings yet
Second Moment Area
17 pages
Pragmatic Skepticism
No ratings yet
Pragmatic Skepticism
20 pages
CUT150S - Gastronomy Theory & Practical - HM1 - 2023 - Subject Guide
No ratings yet
CUT150S - Gastronomy Theory & Practical - HM1 - 2023 - Subject Guide
20 pages
Certification: Office of The Sangguniang Barangay
No ratings yet
Certification: Office of The Sangguniang Barangay
10 pages
Assessment I Workplace Diversity
No ratings yet
Assessment I Workplace Diversity
25 pages

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

Characterizing Power Management Opportunities For LLMs in The Cloud

Uploaded by

https://fanyi.youdao.

[29] Meta 3.1

GPU GPU IB OOB ;

NVIDIA DGX A100 8

nvidia-smi GPU SM (1.1-1.4GHz)

RoBERTa TDP 75% GPT- GPU TDP

Insight 3: 8a 256 8192

9 BLOOM GPU ( =8192

bitsand-bytes FP32 FP16 INT8

10 ( TDP) vs. GPU SM 11 GPU

[46] FP32 INT8

OOB gpu OOB

6 :POLCA Latency-bounded POLCA

LLM gpu LLM

(2) 89% (1- threshold - all) (3)

2kW TDP) GPU CPU (1U,

[20] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid,

Not. In ICPADS, 2015.

You might also like