Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems
Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems
TWNOG 5.0
Outline
p TWSC & TWCC Background
台灣第一家提供機敏資料落地、國家主權管理及前瞻AI應用
發展的雲端高速運算及海量儲存之雲服務運營商。
透過國家級TWCC 臺灣 AI 雲平台資料中心,提供各種產業
數位發展所需的AI智慧應用服務及雲架構解決方案。
AIHPC
in coming
•
HPC - Taiwania2
252 nodes / 2016 V100 GPUs
Software Environment
Ranked 20 th
• Slurm / Kubernetes
• 9 Nvidia DGX H100 (New in 2023) • Openstack
10
• 10 PB Parallel file system • Nvidia NGC Docker Images
• EDR InfiniBand 100 Gbps • Ceph (Object & Block) Ranked th
• 1.2 PUE (Warm Water Cooling) • Spectrum Scale (GPFS)
1 2 3
Cleaning & Model
Data Input
extraction Training
1 2 3
Cleaning & Model
Data Input
extraction Training
6 5 4
Deployment Optimization Fine-tuning
Cluster Planning
• Taiwan Computing Cloud
Big Data (Cloud, Hadoop, Spark)
• Using CPU to run computing jobs
• High I/O Throughput(Read and Write)
• 3Vs(Volume, Velocity, Variety)
1 2 3
Cleaning & Model
Data Input
extraction Training
6 5 4
Deployment Optimization Fine-tuning
API CLI
VCS (Virtual Compute Service)
SaaS
Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework
Infrastructure Admin
Virtual Machine Container M
Service Admin
Billing
Reporting
P
IAM
Openstack Kubernetes Singularity I
OS / Infrastructure Software
User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager
Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container
API CLI
CCS (Container Compute Service)
SaaS
Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework
Infrastructure Admin
Virtual Machine Container M
Service Admin
Billing
Reporting
P
IAM
Openstack Kubernetes Singularity I
OS / Infrastructure Software
User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager
Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container AIHPC
API CLI
HPC (High-Performance Computing)
SaaS
Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework
Infrastructure Admin
Virtual Machine Container M
Service Admin
Billing
Reporting
P
IAM
Openstack Kubernetes Singularity I
OS / Infrastructure Software
User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager
Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
Example Case: Highway Speed realtime prediction
● CPU Isolation
○ Reserved for OS, Parallel File System, Monitoring etc.
[root@v100 slurm]# cat slurm.conf
…
NodeName=v100 Gres=gpu:8 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 CpuSpecList=17,35 …
…
HPC - Infiniband Topology
● Defining the leaf switches and nodes are important.
[root@v100 slurm]# cat topology.conf [root@v100 slurm]# cat slurm.conf
… …
SwitchName=IBISL12 Switches=IBISL12[1-3] TopologyPlugin=topology/tree
SwitchName=IBISL121 Switches=IBLF120[1-2],IBLF1207 TopologyParam=TopoOptional
SwitchName=IBISL122 Switches=IBLF120[3-4],IBLF1207 …
SwitchName=IBISL123 Switches=IBLF120[5-6],IBLF1207 [root@v100 slurm]# $ srun "env"
SwitchName=IBLF1201 Nodes=gn12[01-05] …
SwitchName=IBLF1202 Nodes=gn12[05-09] 5: SLURM_TOPOLOGY_ADDR=IBISL12.IBISL121.IBLF1201.gn1201
… 5: SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
Kubernetes - NUMA
● GPU-CPU Affinity
[root@v100 kubernetes]# cat config.yaml
…
featureGates:
CPUManager: true
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: best-effort
…
kubeReserved:
cpu: 500m
…
Ref: https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/
K8S Over IB (SR-IOV)
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: twcc.ib/mlnx_sriov_ib
spec:
config: '{
"type": "ib-sriov",
"ipam": {
"type": "whereabouts",
"rdmaIsolation": true,
kind: ConfigMap
"resourceList": [{
"resourceName"; "mlnx_sriov_ib",
"selectors": {
"pfNames": ["ib0", "ib1", "ib2", "ib3"],
"LinkTypes": ["infiniband"],
"isRdma": true,
"devices": ["1018"]
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-conf-0@vib0
resources:
requests:
twcc.ib/mlnx_sriov_ib: 1
Openstack - NUMA
● /etc/nova/nova.conf
vcpu_pin_set=0-16,17-34
enabled_filters=<...>,NUMATopologyFilter
● GPU-CPU Affinity
openstack flavor create --disk 100 --vcpus 14 --ram 186368 \
--property aggregate_instance_extra_specs:pinned='true' \
--property hw:cpu_policy='dedicated' \
--property pci_passthrough:alias='V100:2' \
--property hw:numa_nodes=2 \
<flavour-name>
NVIDIA GPU Topology
V100 GPU A100 GPU H100 GPU
● /etc/nova/nova.conf
[pci]
alias = { "vendor_id":"10de", "product_id":"20b0", "device_type":"type-PF", "name":"A100" }
alias = { "vendor_id":"10de", "product_id":"1af1", "device_type":"type-PCI", "name":"NVSWITCH" }
[pci]
alias = { "vendor_id":"10de", "product_id":”2330", "device_type":"type-PF", "name":”H100" }
alias = { "vendor_id":"10de", "product_id":"22a3", "device_type":"type-PCI", "name":"NVSWITCH" }
Fully Utilized GPU Resource
TWCC ISO And Security Certifications
TWCC Services
Developers & Users
Data
AI
Service
User
Service
Portal
Account
Precision Env protection Smart city Industry 4.0 Fundamental Virtual management
SaaS
medical science research
Founder for
ML based Analytics/Data Visualization AI Cloud Platform Multi-tenant
Space
軟體即服務
Application oriented APIs
Open/
Government data
Material
Science
Bio
informatics
voice
recognition
vision
recognition
Predictive
maintenance
Auto
pilot
PaaS
/LOD Services
Engineering Geo face Production Security Management
text mining
Data analytics simulation informatics recognition automation Detection
PaaS module & ML/DNN
Model Repository HPC Function oriented APIs AI/BD Function oriented APIs
平台即服務 API
Management
Shared data De-identity ETL Caffe, TensorFlow,
Shared models Impala Spark Batch
Hadoop Search Analytic Stream
Torch, DIGITS,
Shared modules Engine Processing
Marketplace
Data
Hub/API
MapReduce SQL processing Hive/Pig
MXNET, Keras, …
Data
Spark Machine Learning Preparation
Data Platform Data Analytics Platform Platform
Platform
Self Mgmt. Deploy
Provided by
NOTE: Premise* deployment can be easily operated in
Data Collection TWSC on-premise AFS Appliance
[email protected]
https://www.twsc.io