0% found this document useful (0 votes)

39 views

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

TWNOG

Uploaded by

3362

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

TWNOG

Uploaded by

3362

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Experiences and Lessons Learned in

Operating GPU-Based HPC Systems

陳政宇
資深技術經理, 台智雲
Apr. 26th, 2024

TWNOG 5.0
Outline
p TWSC & TWCC Background

p AI-HPC and Cloud Hybrid Architecture Design

p Experiences and Lessons Learned

p GenAI Solution for Training and Inference

TWSC: OUR STORY
台智雲的前身，是科技部2017年起花了4年時間、耗資50億
元打造、曾打進全球前20大超級電腦的「台灣杉二號」。

台灣第一家提供機敏資料落地、國家主權管理及前瞻AI應用
發展的雲端高速運算及海量儲存之雲服務運營商。

透過國家級TWCC 臺灣 AI 雲平台資料中心，提供各種產業
數位發展所需的AI智慧應用服務及雲架構解決方案。

AIHPC
in coming

Taiwania 4 HPC #222

of Top500
2023 2023/11

Taiwania 3 HPC #181

of Top500
2020 2021/6

Taiwania 2 AIHPC #20

of Top500
2018 2018/11
Taiwan Computing Cloud for AI

•
HPC - Taiwania2
252 nodes / 2016 V100 GPUs
Software Environment
Ranked 20 th

• Slurm / Kubernetes
• 9 Nvidia DGX H100 (New in 2023) • Openstack

10
• 10 PB Parallel file system • Nvidia NGC Docker Images
• EDR InfiniBand 100 Gbps • Ceph (Object & Block) Ranked th
• 1.2 PUE (Warm Water Cooling) • Spectrum Scale (GPFS)

HPC - compute node MPI / AI Framework

• Intel Xeon Gold CPU x 2 • OpenMPI / Intel oneAPI

• 768 GB memory • Tensorflow / PyTorch
• 240 GB SSD + 4TB NVMe • Nvidia NGC images
• Nvidia Tesla V100 w/32GB x 8 • …..and more
• EDR InfiniBand 100 Gbps x 4
• Dual Port 10Gb Ethernet
Current Dev. for Day0
Cluster Planning

Day 0 Op Day 1 Op Day 2 Op

Cluster Planning
• Taiwan GPU Cloud

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4 10 Nvidia DGX V100 16GB

Deployment Optimization Fine-tuning

Project TWGC TWCC TWGC Top500 Officially

initiation v0.5 Beta Test
initiation v0.3 Ranked 20th Online
2017 2018 2019
12/15 2/1 5/1 9/1 11/13 1/3 6/1
Cluster Planning
• A simple AI pipeline

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4
Deployment Optimization Fine-tuning
Cluster Planning
• Taiwan Computing Cloud
Big Data (Cloud, Hadoop, Spark)
• Using CPU to run computing jobs
• High I/O Throughput(Read and Write)
• 3Vs(Volume, Velocity, Variety)

1 2 3
Cleaning & Model
Data Input
extraction Training

6 5 4
Deployment Optimization Fine-tuning

Cloud AI Training(AIHPC/GPU Container)

• On-demand cloud service • GPU-Accelerated Computing
• Reliable infrastructure • Write once read many and small files (e.g. images)
• Cyber security • Large-scale training jobs
• Interactive interfaces
Current Dev. for Day1
The birth of TWCC

Day 0 Op Day 1 Op Day 2 Op

TWCC Software Stack Cloud

Admin Portal User Portal

API CLI
VCS (Virtual Compute Service)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security

x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container

Admin Portal User Portal

API CLI
CCS (Container Compute Service)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security

x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
TWCC Software Stack Cloud GPU Container AIHPC

Admin Portal User Portal

API CLI
HPC (High-Performance Computing)
SaaS

Monitoring
Quota
PaaS / Resource Management
Template/Pre-load Software/ML Framework

Storage(S3) / SDN / Security

x2GO, Jupyter, TensorBoard, Caffee, Theano, TensorFlow …

Infrastructure Admin
Virtual Machine Container M

Service Admin
Billing

Reporting
P

IAM
Openstack Kubernetes Singularity I

OS / Infrastructure Software

User Management
Automation, Bare metal, Monitoring, Intelligent Resource Manager

CPU GPU (V100/A100/H100)

Analytics
Cloud Storage (Object & Block) Parallel FS
Tape Library
IB
Ethernet (SDN) Eth
Overarching
systems
Hardware
Example Case: Highway Speed realtime prediction

ü Over 1500 cams on TW Hway

ü Massive training
ü Realtime computing
ü Bigdata/AI hybrid system
Lessons learned in Hybrid-Architecture
● Performance considerations
○ HPC
■ NUMA
■ IB Topology
○ Kubernetes
■ NUMA
■ SRIOV
○ Openstack
■ NUMA
■ Passthrough A100/H100 to Openstack VM
● GPU Resource Management
● Security
HPC - NUMA
● GPU-CPU Affinity
[root@v100 slurm]# nvidia-smi topo -m
[root@v100 slurm]# cat gres.conf
CPU Affinity NUMA Affinity
Name=gpu File=/dev/nvidia0 Cores=[0-17]
GPU0 ... 0-17 0
Name=gpu File=/dev/nvidia1 Cores=[0-17]
GPU1 ... 0-17 0
Name=gpu File=/dev/nvidia2 Cores=[0-17]
GPU2 ... 0-17 0
Name=gpu File=/dev/nvidia3 Cores=[0-17]
GPU3 ... 0-17 0
Name=gpu File=/dev/nvidia4 Cores=[18-35]
GPU4 ... 18-35 1
Name=gpu File=/dev/nvidia5 Cores=[18-35]
GPU5 ... 18-35 1
Name=gpu File=/dev/nvidia6 Cores=[18-35]
GPU6 ... 18-35 1
Name=gpu File=/dev/nvidia7 Cores=[18-35]
GPU7 ... 18-35 1

● CPU Isolation
○ Reserved for OS, Parallel File System, Monitoring etc.
[root@v100 slurm]# cat slurm.conf
…
NodeName=v100 Gres=gpu:8 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 CpuSpecList=17,35 …
…
HPC - Infiniband Topology
● Defining the leaf switches and nodes are important.
[root@v100 slurm]# cat topology.conf [root@v100 slurm]# cat slurm.conf
… …
SwitchName=IBISL12 Switches=IBISL12[1-3] TopologyPlugin=topology/tree
SwitchName=IBISL121 Switches=IBLF120[1-2],IBLF1207 TopologyParam=TopoOptional
SwitchName=IBISL122 Switches=IBLF120[3-4],IBLF1207 …
SwitchName=IBISL123 Switches=IBLF120[5-6],IBLF1207 [root@v100 slurm]# $ srun "env"
SwitchName=IBLF1201 Nodes=gn12[01-05] …
SwitchName=IBLF1202 Nodes=gn12[05-09] 5: SLURM_TOPOLOGY_ADDR=IBISL12.IBISL121.IBLF1201.gn1201
… 5: SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
Kubernetes - NUMA

● GPU-CPU Affinity
[root@v100 kubernetes]# cat config.yaml
…
featureGates:
CPUManager: true
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: best-effort
…
kubeReserved:
cpu: 500m
…

Ref: https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/
K8S Over IB (SR-IOV)
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: twcc.ib/mlnx_sriov_ib
spec:
config: '{
"type": "ib-sriov",
"ipam": {
"type": "whereabouts",
"rdmaIsolation": true,

kind: ConfigMap
"resourceList": [{
"resourceName"; "mlnx_sriov_ib",
"selectors": {
"pfNames": ["ib0", "ib1", "ib2", "ib3"],
"LinkTypes": ["infiniband"],
"isRdma": true,
"devices": ["1018"]

template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-conf-0@vib0
resources:
requests:
twcc.ib/mlnx_sriov_ib: 1
Openstack - NUMA

● /etc/nova/nova.conf
vcpu_pin_set=0-16,17-34
enabled_filters=<...>,NUMATopologyFilter

● GPU-CPU Affinity
openstack flavor create --disk 100 --vcpus 14 --ram 186368 \
--property aggregate_instance_extra_specs:pinned='true' \
--property hw:cpu_policy='dedicated' \
--property pci_passthrough:alias='V100:2' \
--property hw:numa_nodes=2 \
<flavour-name>
NVIDIA GPU Topology
V100 GPU A100 GPU H100 GPU

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s

NVIDIA GPU Topology
V100 GPU A100 GPU H100 GPU

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s

Passthrough A100/H100 to Openstack VM
● A100 NVSwitch Devices
c4:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c5:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c6:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c7:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c8:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
c9:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)

● H100 NVSwitch Devices

07:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
08:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
09:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
0a:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)

● /etc/nova/nova.conf
[pci]
alias = { "vendor_id":"10de", "product_id":"20b0", "device_type":"type-PF", "name":"A100" }
alias = { "vendor_id":"10de", "product_id":"1af1", "device_type":"type-PCI", "name":"NVSWITCH" }
[pci]
alias = { "vendor_id":"10de", "product_id":”2330", "device_type":"type-PF", "name":”H100" }
alias = { "vendor_id":"10de", "product_id":"22a3", "device_type":"type-PCI", "name":"NVSWITCH" }
Fully Utilized GPU Resource
TWCC ISO And Security Certifications
TWCC Services
Developers & Users
Data
AI
Service
User
Service
Portal
Account
Precision Env protection Smart city Industry 4.0 Fundamental Virtual management

SaaS
medical science research
Founder for
ML based Analytics/Data Visualization AI Cloud Platform Multi-tenant
Space
軟體即服務
Application oriented APIs

Open/
Government data
Material
Science
Bio
informatics
voice
recognition
vision
recognition
Predictive
maintenance
Auto
pilot
PaaS
/LOD Services
Engineering Geo face Production Security Management
text mining
Data analytics simulation informatics recognition automation Detection
PaaS module & ML/DNN
Model Repository HPC Function oriented APIs AI/BD Function oriented APIs
平台即服務 API
Management
Shared data De-identity ETL Caffe, TensorFlow,
Shared models Impala Spark Batch
Hadoop Search Analytic Stream
Torch, DIGITS,
Shared modules Engine Processing
Marketplace
Data
Hub/API
MapReduce SQL processing Hive/Pig
MXNET, Keras, …
Data
Spark Machine Learning Preparation
Data Platform Data Analytics Platform Platform

Cloud Management Platform Customer Mgmt.

Event & Dashboard Backup Chargeback
Template Mgmt.
Alarm Mgmt. & Report Mgmt. Mgmt. HA User Profile
Workflow Mgmt. &
NOC/SOC Resource Mgmt. & Monitoring Quota N+1 Account &
Admin/Operator Service Level Mgmt. OpenStack/Kubernetes/Slurm/Ceph.. Mgmt. Billing System
Cloud Management Platform
CPU/GPU Resource Management SDN/NFV Monitoring Storage Tiering/Backup
IaaS
NCHC
IaaS University
Resource
management
基礎設施即服務 Computing
GPU Cluster CPU Cluster High Speed Storage Object Storage Center
26 26
Current Dev. for Day2
Generative AI

Day 0 Op Day 1 Op Day 2 Op

AFS
GenAI Solution for Training and inference
TWSC AFS: AI Foundry Service
IndustrialGPT Solutions
Step 1. Step 2.
Upload training data Deploy on Premise* / On-demand

Platform
Self Mgmt. Deploy
Provided by
NOTE: Premise* deployment can be easily operated in
Data Collection TWSC on-premise AFS Appliance

01 Full Control 02 No Code

03 Formosa LLM 04 Wide Adoption

Standing on Formosa LLM
Pursue Excellence
AFS ModelSpace
Utilize CCS with InfiniBand ability
Proprietary LLM fast deployment
Implementing GPT-3 level LLM in TAIWANIA 2
Due to the large model with 176 billion parameters, it is not possible to train directly on any
single GPU. It requires precise model segmentation and efficient distributed training.
■ Training can achieve linear acceleration.
■ Training and inference of the 176 billion parameter model can be run on TWCC.

Implementing GPT-3 level LLM in Taiwanina 2

Thanks!
Do you have any questions?

[email protected]
https://www.twsc.io

CREDITS: This presentation template was created by Slidesgo, including

icons by Flaticon, infographics & images by Freepik and illustrations by
Stories
Please keep this slide for attribution
Technology
Consulting
Here is where your
presentation begins

main_powershell-active-directory-cheat-sheet
No ratings yet
main_powershell-active-directory-cheat-sheet
2 pages
Google Cloud Platform an Architect's Guide
From Everand
Google Cloud Platform an Architect's Guide
alasdair gilchrist
5/5 (1)
Learning AWS
From Everand
Learning AWS
Aurobindo Sarkar
4/5 (4)
uTP-58E Manual 201908E
No ratings yet
uTP-58E Manual 201908E
1 page
90210-1178DED - Soft Absorber (D, E Series)
No ratings yet
90210-1178DED - Soft Absorber (D, E Series)
38 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
S3516 Build Your Own GPU Research Cluster
No ratings yet
S3516 Build Your Own GPU Research Cluster
28 pages
Google Cloud Platform - Networking
From Everand
Google Cloud Platform - Networking
alasdair gilchrist
No ratings yet
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
From Everand
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Neylson Crepalde
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
From Everand
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
Georgio Daccache
No ratings yet
TripleO OpenStack
No ratings yet
TripleO OpenStack
36 pages
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Mastering C++ Network Automation: Run Automation across Configuration Management, Container Orchestration, Kubernetes, and Cloud Networking
From Everand
Mastering C++ Network Automation: Run Automation across Configuration Management, Container Orchestration, Kubernetes, and Cloud Networking
Justin Barbara
No ratings yet
Mastering C++ Network Automation
From Everand
Mastering C++ Network Automation
Justin Barbara
No ratings yet
Google Associate Cloud Engineer Exam Companion: Q&A with Explanations
From Everand
Google Associate Cloud Engineer Exam Companion: Q&A with Explanations
SUJAN
No ratings yet
Doing More With Slurm Advanced Capabilities
No ratings yet
Doing More With Slurm Advanced Capabilities
31 pages
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
Mastering Zabbix - Second Edition
From Everand
Mastering Zabbix - Second Edition
Vacche Andrea Dalle
No ratings yet
Kubernetes
No ratings yet
Kubernetes
4 pages
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
From Everand
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
Joe Brian
No ratings yet
Deploying+Openstack+Lab+on+GCP v5 (3)
No ratings yet
Deploying+Openstack+Lab+on+GCP v5 (3)
9 pages
Scheduler
No ratings yet
Scheduler
13 pages
Mastering the Art of Cloud Computing with Google Cloud Platform: Unraveling the Secrets of Experts
From Everand
Mastering the Art of Cloud Computing with Google Cloud Platform: Unraveling the Secrets of Experts
Steve Jones
No ratings yet
Deploying Openstack Lab On GCP-v3
No ratings yet
Deploying Openstack Lab On GCP-v3
10 pages
Mini Project
No ratings yet
Mini Project
9 pages
OpenStack Essentials - Second Edition
From Everand
OpenStack Essentials - Second Edition
Dan Radez
No ratings yet
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
Hyper-V 2016 Best Practices
From Everand
Hyper-V 2016 Best Practices
Benedict Berger
No ratings yet
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
From Everand
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
Robert Johnson
No ratings yet
Deploy any website on google cloud platform
From Everand
Deploy any website on google cloud platform
AJ Books
No ratings yet
Openstack Multinode Installation
No ratings yet
Openstack Multinode Installation
4 pages
Implementing Linkerd Service Mesh
From Everand
Implementing Linkerd Service Mesh
Kimiko Lee
No ratings yet
Mastering Google Cloud Platform: Navigating the Clouds
From Everand
Mastering Google Cloud Platform: Navigating the Clouds
Kameron Hussain
No ratings yet
The Curious Case of Container Orchestration and Scheduling in GPU-based Datacenters
No ratings yet
The Curious Case of Container Orchestration and Scheduling in GPU-based Datacenters
1 page
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
From Everand
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
Ian Taylor
No ratings yet
Masteringopenstack PDF
100% (2)
Masteringopenstack PDF
462 pages
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
From Everand
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
Adi Wijaya
No ratings yet
OpenStack Pike Volet 8
No ratings yet
OpenStack Pike Volet 8
9 pages
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
From Everand
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
Myra Kelnor
No ratings yet
Learning Azure DevOps
From Everand
Learning Azure DevOps
Myra Kelnor
No ratings yet
AWS CLI Essentials: A Beginner's Guide to Cloud Automation
From Everand
AWS CLI Essentials: A Beginner's Guide to Cloud Automation
Robert Johnson
No ratings yet
Kubernatis (k8s) Basics: Components
No ratings yet
Kubernatis (k8s) Basics: Components
20 pages
Week 1 Implementation Detailed Guide to Create AWS EKS Cluster and Cluster Setup Using eksctl and Bash Scripts
No ratings yet
Week 1 Implementation Detailed Guide to Create AWS EKS Cluster and Cluster Setup Using eksctl and Bash Scripts
10 pages
Mastering Shell for DevOps
From Everand
Mastering Shell for DevOps
Gilbert Stew
No ratings yet
Mastering Shell for DevOps: Automate, streamline, and secure DevOps workflows with modern shell scripting
From Everand
Mastering Shell for DevOps: Automate, streamline, and secure DevOps workflows with modern shell scripting
Gilbert Stew
No ratings yet
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
OpenStack Object Storage (Swift) Essentials
From Everand
OpenStack Object Storage (Swift) Essentials
Amar Kapadia
No ratings yet
Kubernetes CKA 0200 Scheduling PDF
No ratings yet
Kubernetes CKA 0200 Scheduling PDF
34 pages
AWS SysOps Administrator Associate: From basic to advanced
From Everand
AWS SysOps Administrator Associate: From basic to advanced
Alex Carvalho
No ratings yet
Modern DevOps Practices: Implement, secure, and manage applications on the public cloud by leveraging cutting-edge tools
From Everand
Modern DevOps Practices: Implement, secure, and manage applications on the public cloud by leveraging cutting-edge tools
Gaurav Agarwal
No ratings yet
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
AWS Cloud Practitioner: From Basic to Advanced
From Everand
AWS Cloud Practitioner: From Basic to Advanced
Alex Carvalho
No ratings yet
Mastering Cloud Computing With Best Practices
From Everand
Mastering Cloud Computing With Best Practices
Manish Soni
No ratings yet
Mastering Kubernetes: From Basics to Advanced Cluster Orchestration
From Everand
Mastering Kubernetes: From Basics to Advanced Cluster Orchestration
Dargslan
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Mastering Kubernetes: From Basics to Expert Proficiency
From Everand
Mastering Kubernetes: From Basics to Expert Proficiency
William Smith
No ratings yet
Manager Evaluation Form (Online)
No ratings yet
Manager Evaluation Form (Online)
6 pages
Odoo Document (1) (1)
No ratings yet
Odoo Document (1) (1)
125 pages
Operation Sheet Terminal Identification of Lighting System Circuit Diagrams Tools and Materials
No ratings yet
Operation Sheet Terminal Identification of Lighting System Circuit Diagrams Tools and Materials
12 pages
P0710
No ratings yet
P0710
5 pages
Design Project 2 Presentation
No ratings yet
Design Project 2 Presentation
29 pages
Vehicle Detection Assignment Report
No ratings yet
Vehicle Detection Assignment Report
4 pages
ZX210LC 5G Ka En160
No ratings yet
ZX210LC 5G Ka En160
6 pages
WBP Model Answer W 2022
No ratings yet
WBP Model Answer W 2022
23 pages
CMPE 256- MIDTERM_REPORT
No ratings yet
CMPE 256- MIDTERM_REPORT
3 pages
FT ActivPilot Concept C PVC KT 603 012022 en
No ratings yet
FT ActivPilot Concept C PVC KT 603 012022 en
252 pages
Barracuda NG Firewall Product Guide
No ratings yet
Barracuda NG Firewall Product Guide
64 pages
D110P D110E Zaptel User Manual English
No ratings yet
D110P D110E Zaptel User Manual English
11 pages
Simplex Fault Guide
100% (1)
Simplex Fault Guide
60 pages
Install Eve-Ng On Windows
No ratings yet
Install Eve-Ng On Windows
12 pages
4-Maria DBMS
No ratings yet
4-Maria DBMS
10 pages
Senior Specialist PEX Chennai Oct19
No ratings yet
Senior Specialist PEX Chennai Oct19
1 page
DigiPay - User Manual - V 1.3
No ratings yet
DigiPay - User Manual - V 1.3
15 pages
Recent Trends or Advances in Embedded Systems From KVKK Prasad
No ratings yet
Recent Trends or Advances in Embedded Systems From KVKK Prasad
10 pages
IXGN200N60A
No ratings yet
IXGN200N60A
4 pages
ATOLL User - Manual-251-300
No ratings yet
ATOLL User - Manual-251-300
50 pages
G Huawei AMR Optimization Proposal 20070903 A 1.0
100% (1)
G Huawei AMR Optimization Proposal 20070903 A 1.0
33 pages
Superior Results Robust Design
No ratings yet
Superior Results Robust Design
2 pages
Facebook: by Snehashis Khan (17810072), Pavankumar S (17810044), Akshay Kumar Jain (17810011)
100% (1)
Facebook: by Snehashis Khan (17810072), Pavankumar S (17810044), Akshay Kumar Jain (17810011)
52 pages
Limitations of E Banking
100% (7)
Limitations of E Banking
1 page
Data Quality in Customer Relationship Management (CRM) - Literature Review
No ratings yet
Data Quality in Customer Relationship Management (CRM) - Literature Review
8 pages
MCQ
No ratings yet
MCQ
12 pages
UsersGuide AlphaStrip
No ratings yet
UsersGuide AlphaStrip
14 pages

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

Uploaded by

Speaker -A03- 5704- Experiences and Lessons Learned in Operating GPU-Based HPC Systems

Uploaded by

Experiences and Lessons Learned in

Operating GPU-Based HPC Systems

p AI-HPC and Cloud Hybrid Architecture Design

p Experiences and Lessons Learned

p GenAI Solution for Training and Inference

Taiwania 4 HPC #222

Taiwania 3 HPC #181

Taiwania 2 AIHPC #20

HPC - compute node MPI / AI Framework

• Intel Xeon Gold CPU x 2 • OpenMPI / Intel oneAPI

Day 0 Op Day 1 Op Day 2 Op

6 5 4 10 Nvidia DGX V100 16GB

Project TWGC TWCC TWGC Top500 Officially

Cloud AI Training(AIHPC/GPU Container)

Day 0 Op Day 1 Op Day 2 Op

Admin Portal User Portal

Storage(S3) / SDN / Security

CPU GPU (V100/A100/H100)

Admin Portal User Portal

Storage(S3) / SDN / Security

CPU GPU (V100/A100/H100)

Admin Portal User Portal

Storage(S3) / SDN / Security

CPU GPU (V100/A100/H100)

ü Over 1500 cams on TW Hway

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s

GPU Interconnect up to 300GB/s GPU Interconnect up to 600GB/s GPU Interconnect up to 900GB/s

● H100 NVSwitch Devices

Cloud Management Platform Customer Mgmt.

Day 0 Op Day 1 Op Day 2 Op

01 Full Control 02 No Code

03 Formosa LLM 04 Wide Adoption

Implementing GPT-3 level LLM in Taiwanina 2

CREDITS: This presentation template was created by Slidesgo, including

You might also like