0% found this document useful (0 votes)
39 views

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

TWNOG

Uploaded by

3362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

TWNOG

Uploaded by

3362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Experiences and Lessons Learned in

Operating GPU-Based HPC Systems


陳政宇
資深技術經理, 台智雲
Apr. 26th, 2024

TWNOG 5.0
Outline
p TWSC & TWCC Background

p AI-HPC and Cloud Hybrid Architecture Design

p Experiences and Lessons Learned

p GenAI Solution for Training and Inference


TWSC: OUR STORY
台智雲的前身,是科技部2017年起花了4年時間、耗資50億
元打造、曾打進全球前20大超級電腦的「台灣杉二號」。

台灣第一家提供機敏資料落地、國家主權管理及前瞻AI應用
發展的雲端高速運算及海量儲存之雲服務運營商。

透過國家級TWCC 臺灣 AI 雲平台資料中心,提供各種產業
數位發展所需的AI智慧應用服務及雲架構解決方案。

AIHPC
in coming

Taiwania 4 HPC #222


of Top500
2023 2023/11

Taiwania 3 HPC #181


of Top500
2020 2021/6

Taiwania 2 AIHPC #20


of Top500
2018 2018/11
Taiwan Computing Cloud for AI


HPC - Taiwania2
252 nodes / 2016 V100 GPUs
Software Environment
Ranked 20 th

• Slurm / Kubernetes
• 9 Nvidia DGX H100 (New in 2023) • Openstack

10
• 10 PB Parallel file system • Nvidia NGC Docker Images
• EDR InfiniBand 100 Gbps • Ceph (Object & Block) Ranked th
• 1.2 PUE (Warm Water Cooling) • Spectrum Scale (GPFS)

HPC - compute node MPI / AI Framework

• Intel Xeon Gold CPU x 2 • OpenMPI / Intel oneAPI


• 768 GB memory • Tensorflow / PyTorch
• 240 GB SSD + 4TB NVMe • Nvidia NGC images
• Nvidia Tesla V100 w/32GB x 8 • …..and more
• EDR InfiniBand 100 Gbps x 4
• Dual Port 10Gb Ethernet
Current Dev. for Day0
Cluster Planning

Day 0 Op Day 1 Op Day 2 Op


Cluster Planning
• Taiwan GPU Cloud

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4 10 Nvidia DGX V100 16GB


Deployment Optimization Fine-tuning

Project TWGC TWCC TWGC Top500 Officially


initiation v0.5 Beta Test
initiation v0.3 Ranked 20th Online
2017 2018 2019
12/15 2/1 5/1 9/1 11/13 1/3 6/1
Cluster Planning
• A simple AI pipeline

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4
Deployment Optimization Fine-tuning
Cluster Planning
• Taiwan Computing Cloud
Big Data (Cloud, Hadoop, Spark)
• Using CPU to run computing jobs
• High I/O Throughput(Read and Write)
• 3Vs(Volume, Velocity, Variety)

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4
Deployment Optimization Fine-tuning

Cloud AI Training(AIHPC/GPU Container)


• On-demand cloud service • GPU-Accelerated Computing
• Reliable infrastructure • Write once read many and small files (e.g. images)
• Cyber security • Large-scale training jobs
• Interactive interfaces
Current Dev. for Day1
The birth of TWCC

Day 0 Op Day 1 Op Day 2 Op


TWCC Software Stack Cloud

Admin Portal User Portal

API CLI
VCS (Virtual Compute Service)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security


x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container

Admin Portal User Portal

API CLI
CCS (Container Compute Service)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security


x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container AIHPC

Admin Portal User Portal

API CLI
HPC (High-Performance Computing)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security


x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
Example Case: Highway Speed realtime prediction

ü Over 1500 cams on TW Hway


ü Massive training
ü Realtime computing
ü Bigdata/AI hybrid system
Lessons learned in Hybrid-Architecture
● Performance considerations
○ HPC
■ NUMA
■ IB Topology
○ Kubernetes
■ NUMA
■ SRIOV
○ Openstack
■ NUMA
■ Passthrough A100/H100 to Openstack VM
● GPU Resource Management
● Security
HPC - NUMA
● GPU-CPU Affinity
[root@v100 slurm]# nvidia-smi topo -m
[root@v100 slurm]# cat gres.conf
CPU Affinity NUMA Affinity
Name=gpu File=/dev/nvidia0 Cores=[0-17]
GPU0 ... 0-17 0
Name=gpu File=/dev/nvidia1 Cores=[0-17]
GPU1 ... 0-17 0
Name=gpu File=/dev/nvidia2 Cores=[0-17]
GPU2 ... 0-17 0
Name=gpu File=/dev/nvidia3 Cores=[0-17]
GPU3 ... 0-17 0
Name=gpu File=/dev/nvidia4 Cores=[18-35]
GPU4 ... 18-35 1
Name=gpu File=/dev/nvidia5 Cores=[18-35]
GPU5 ... 18-35 1
Name=gpu File=/dev/nvidia6 Cores=[18-35]
GPU6 ... 18-35 1
Name=gpu File=/dev/nvidia7 Cores=[18-35]
GPU7 ... 18-35 1

● CPU Isolation
○ Reserved for OS, Parallel File System, Monitoring etc.
[root@v100 slurm]# cat slurm.conf

NodeName=v100 Gres=gpu:8 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 CpuSpecList=17,35 …

HPC - Infiniband Topology
● Defining the leaf switches and nodes are important.
[root@v100 slurm]# cat topology.conf [root@v100 slurm]# cat slurm.conf
… …
SwitchName=IBISL12 Switches=IBISL12[1-3] TopologyPlugin=topology/tree
SwitchName=IBISL121 Switches=IBLF120[1-2],IBLF1207 TopologyParam=TopoOptional
SwitchName=IBISL122 Switches=IBLF120[3-4],IBLF1207 …
SwitchName=IBISL123 Switches=IBLF120[5-6],IBLF1207 [root@v100 slurm]# $ srun "env"
SwitchName=IBLF1201 Nodes=gn12[01-05] …
SwitchName=IBLF1202 Nodes=gn12[05-09] 5: SLURM_TOPOLOGY_ADDR=IBISL12.IBISL121.IBLF1201.gn1201
… 5: SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
Kubernetes - NUMA

● GPU-CPU Affinity
[root@v100 kubernetes]# cat config.yaml

featureGates:
CPUManager: true
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: best-effort

kubeReserved:
cpu: 500m

Ref: https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/
K8S Over IB (SR-IOV)
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: twcc.ib/mlnx_sriov_ib
spec:
config: '{
"type": "ib-sriov",
"ipam": {
"type": "whereabouts",
"rdmaIsolation": true,

kind: ConfigMap
"resourceList": [{
"resourceName"; "mlnx_sriov_ib",
"selectors": {
"pfNames": ["ib0", "ib1", "ib2", "ib3"],
"LinkTypes": ["infiniband"],
"isRdma": true,
"devices": ["1018"]

template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-conf-0@vib0
resources:
requests:
twcc.ib/mlnx_sriov_ib: 1
Openstack - NUMA

● /etc/nova/nova.conf
vcpu_pin_set=0-16,17-34
enabled_filters=<...>,NUMATopologyFilter

● GPU-CPU Affinity
openstack flavor create --disk 100 --vcpus 14 --ram 186368 \
--property aggregate_instance_extra_specs:pinned='true' \
--property hw:cpu_policy='dedicated' \
--property pci_passthrough:alias='V100:2' \
--property hw:numa_nodes=2 \
<flavour-name>
NVIDIA GPU Topology
V100 GPU A100 GPU H100 GPU

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s


NVIDIA GPU Topology
V100 GPU A100 GPU H100 GPU

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s


Passthrough A100/H100 to Openstack VM
● A100 NVSwitch Devices
c4:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c5:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c6:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c7:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c8:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c9:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)

● H100 NVSwitch Devices


07:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
08:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
09:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
0a:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)

● /etc/nova/nova.conf
[pci]
alias = { "vendor_id":"10de", "product_id":"20b0", "device_type":"type-PF", "name":"A100" }
alias = { "vendor_id":"10de", "product_id":"1af1", "device_type":"type-PCI", "name":"NVSWITCH" }
[pci]
alias = { "vendor_id":"10de", "product_id":”2330", "device_type":"type-PF", "name":”H100" }
alias = { "vendor_id":"10de", "product_id":"22a3", "device_type":"type-PCI", "name":"NVSWITCH" }
Fully Utilized GPU Resource
TWCC ISO And Security Certifications
TWCC Services
Developers & Users
Data
AI
Service
User
Service
Portal
Account
Precision Env protection Smart city Industry 4.0 Fundamental Virtual management

SaaS
medical science research
Founder for
ML based Analytics/Data Visualization AI Cloud Platform Multi-tenant
Space
軟體即服務
Application oriented APIs

Open/
Government data
Material
Science
Bio
informatics
voice
recognition
vision
recognition
Predictive
maintenance
Auto
pilot
PaaS
/LOD Services
Engineering Geo face Production Security Management
text mining
Data analytics simulation informatics recognition automation Detection
PaaS module & ML/DNN
Model Repository HPC Function oriented APIs AI/BD Function oriented APIs
平台即服務 API
Management
Shared data De-identity ETL Caffe, TensorFlow,
Shared models Impala Spark Batch
Hadoop Search Analytic Stream
Torch, DIGITS,
Shared modules Engine Processing
Marketplace
Data
Hub/API
MapReduce SQL processing Hive/Pig
MXNET, Keras, …
Data
Spark Machine Learning Preparation
Data Platform Data Analytics Platform Platform

Cloud Management Platform Customer Mgmt.


Event & Dashboard Backup Chargeback
Template Mgmt.
Alarm Mgmt. & Report Mgmt. Mgmt. HA User Profile
Workflow Mgmt. &
NOC/SOC Resource Mgmt. & Monitoring Quota N+1 Account &
Admin/Operator Service Level Mgmt. OpenStack/Kubernetes/Slurm/Ceph.. Mgmt. Billing System
Cloud Management Platform
CPU/GPU Resource Management SDN/NFV Monitoring Storage Tiering/Backup
IaaS
NCHC
IaaS University
Resource
management
基礎設施即服務 Computing
GPU Cluster CPU Cluster High Speed Storage Object Storage Center
26 26
Current Dev. for Day2
Generative AI

Day 0 Op Day 1 Op Day 2 Op


AFS
GenAI Solution for Training and inference
TWSC AFS: AI Foundry Service
IndustrialGPT Solutions
Step 1. Step 2.
Upload training data Deploy on Premise* / On-demand

Platform
Self Mgmt. Deploy
Provided by
NOTE: Premise* deployment can be easily operated in
Data Collection TWSC on-premise AFS Appliance

01 Full Control 02 No Code

03 Formosa LLM 04 Wide Adoption


Standing on Formosa LLM
Pursue Excellence
AFS ModelSpace
Utilize CCS with InfiniBand ability
Proprietary LLM fast deployment
Implementing GPT-3 level LLM in TAIWANIA 2
Due to the large model with 176 billion parameters, it is not possible to train directly on any
single GPU. It requires precise model segmentation and efficient distributed training.
■ Training can achieve linear acceleration.
■ Training and inference of the 176 billion parameter model can be run on TWCC.

Implementing GPT-3 level LLM in Taiwanina 2


Thanks!
Do you have any questions?

[email protected]
https://www.twsc.io

CREDITS: This presentation template was created by Slidesgo, including


icons by Flaticon, infographics & images by Freepik and illustrations by
Stories
Please keep this slide for attribution
Technology
Consulting
Here is where your
presentation begins

You might also like