0% found this document useful (0 votes)
24 views

Accelerating AI with Storage Scale

Uploaded by

abery.au
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Accelerating AI with Storage Scale

Uploaded by

abery.au
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Accelerating AI with

Storage Scale

Storage Scale User Group


May 13th, 2024, ISC, Hamburg Germany
Ted Hoover
Product Manager, Storage for Data and AI
Disclaimer

IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without
notice at IBM's sole discretion. Information regarding potential future products is intended to outline our
general product direction and it should not be relied on in making a purchasing decision. The information
mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code, or functionality. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
IBM reserves the right to change product specifications and offerings at any time without notice. This
publication could include technical inaccuracies or typographical errors. References herein to IBM products
and services do not imply that IBM intends to make them available in all countries.
To unlock the full potential of AI we must overcome the challenges of
enterprise infrastructure

Infrastructure limitations Growing resource Operational and physical Security and data
& platforms to scale AI demands & silos resource efficiencies resiliency
82% of organizations cite siloed Increasing operational overhead Data must be trusted and
AI is the fastest growing workload
data as a key obstacle to more with new AI apps challenge security of sensitive
driving spending on compute and IT budgets and energy efficiencies information from cyberthreats,
storage infrastructure2. effective AI development1.
loss or downtime is high
priority.

Data Sources:
1 IDC, Planning for Success with Generative AI
2 https://www.ibm.com/downloads/cas/VKGPNJ3B
What if your organization could accelerate AI workloads with a storage infrastructure
designed to accelerate business growth?

Difficult for AI Optimized for AI


Isolated Accelerated
with Silos Innovation
Workload Workload Workload Workloads Workloads Workloads

Management
Generative AI models and platform

Security
End-to-end application platform

Data acceleration and sharing with governance

or
Cloud 1 Cloud 2 Cloud 1 Cloud 2
Storage 1 Storage 2 Storage 1 Storage 2

Client IT environment Client IT environment

• Siloed and slow-to-adopt innovation • Continuous and speedy innovation


• Sub-optimal use of resources • Integrated and automated operating model
• Hard to align across business • Accelerated value of investments
• AI constrained • Generative AI at scale

4
Customers need an end-to-end data strategy to bring accelerated results for the AI
pipeline
Prepare Build Deploy
Data Distributed training & Model Inference
preparation model validation adaptation

Workflow of steps (e.g. Long-running job on Model tuning with May have sensitivity to
deduplicate, remove massive infrastructure custom data set for latency/throughput,
Data hate & profanity, etc. downstream tasks always cost-sensitive
Decisions
Hours to days Weeks to months Minutes to hours Sub-second API
request
10-2000+ low to 10-500+ high-end 1+ mid high-end
mid-end CPUs GPUs (per job) GPU (per jkob) Single or fraction of
cores GPU per fine tuning
task or serving
request
Infra: 8xA100, 8xH100 Infra: 8xA100, Infra:
High performance 8xH100 L40S, L4
networking

AI acceleration and collaboration


with efficiency and resilience
IBM Storage for Data and AI / © 2024 IBM Corporation
Storage Requirements for AI

AI Tuning/Inferencing AI Training

Efficient GPU support


AI Workloads
Efficient GPU Rapid Maximum
High bandwidth
Storage support deployment Performance
Acceleration
Low latency
HA/DR/Backup
AI Platform
Storage
Abstraction Simplified Linear scaling of performance and
Metadata
Day-2 capacity
catalog Scalability
Optimized for operations
integration
AI High density

IBM Storage for Data and AI / © 2024 IBM Corporation


Storage: Why Matters

• Lot of I/O (Yes, customers will downplay it)

• 2:1 Read: Write

• Most important are Read & Re-Read

• Writes are massive with large parameters models with 175B+


• Scalable Performance really matters

• High Performance Parallel File Storage (PFS) is a scratch space, not


long-term storage
https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf
• Tiering to Object/NL-SAS/Tape is common practice
7
IBM Tops Nvidia GPU data
delivery charts
IBM Storage Scale System 6000 is over 2x more performant than ESS 3500

Why IBM Storage for NVIDIA GPUs?


The world’s fastest systems need the world’s best storage.
IBM has the best storage for NVIDIA GPUs

Highest Performance Platform


• Fastest performance for reads, writes, and density
• Linearly scalability for future growth
A Robust Enterprise Platform
• Six 9’s for all apps: AI, Analytics, HPC, Back-up, Archive, Cloud
• Cyber-reslient, encryption, WORM, and immutability

Collapse Layers & Simplify Data Integration


• Eliminate extra copies and share data globally with all protocols
• Data cataloging and tiering for economics and data flexibility

https://blocksandfiles.com/2023/08/15/ibm-nvidia-gpu-data-delivery/
Why IBM Storage and NVIDIA are better together to accelerate AI innovation
IBM Storage Scale accelerates your infrastructure with a hybrid cloud by design for AI platform

Accelerate discovery
Multi-protocol parallel data access w/ up
AI Workloads to 310GB/s, 13M IOPs and NVIDIA
Servers with NVIDIA GPUs GPUDirect® support
AI Platform
Increase collaboration
IBM Storage Scale Data abstraction with remote data, non-
IBM storage and cloud data directly to
NVIDIA DGX BasePOD NVIDIA Systems

Support lower cost and


IBM Storage Scale System 6000 green initiatives
New QLC computational storage with
transparent archive optimization
NVIDIA DGX SuperPOD lobal ata Platform

etapp ell S b ect TAP


Safeguard data from the
unknown
P re ast Storage

NVIDIA DGX Grace Hopper Cyber enhanced 99.9999% availability


w/data catalog/namespace to enhance
9
trust
IBM Storage for Data and AI & NVIDIA GPU Solutions
A full spectrum of scalable AI solutions
Start small and scale predictably in response to business demand with the same IBM Storage Software

AI Entrant AI Medium AI Master AI Scaler


IBM Storage:
• Simple building block – simple,
scalable seamless upgrade path
• Enterprise features– performance,
scalability, data protection and
security
• Global Data Platform Services –
Integrate with current storage,
NVIDIA SuperPOD multi-site active-active, edge to
1 x DGX A100/H100 4 x DGX A100/H100 32 x DGX H100 cloud to core, single namespace
8 x DGX A100/H100
or or or across multiple installations
or
1x NVIDIA Certified Server 4 x NVIDIA Certified Servers 32 x NVIDIA Certified Servers
8x NVIDIA Certified Servers
• IBM expertise and services

• Successful deployments across the


globe –

Telco, Automobile, Banking and


• 1 x 3500 • 2 x 3500 • 2 x 3500
• 12 NVMe Half Populated Finance, Healthcare, Retail,
• Up to 125 GB/s read • Up to 250 GB/s read • Up to 250 GB/s read
3500 Academic/ Research and
• Up to 60 GB/s Read • 1 x 6000 • 2 x 6000 Public Sector
• 1 x 6000
• 1 x 6000 w/ 12 NVMe • Up to 310 GB/s read • Up to 310 GB/s read • Up to 620 GB/s read
• Up to 80+ GB/s read • Up to 155 GB/s write • Up to 155 GB/s write • Up to 310 GB/s write

A simple, scalable upgrade path


IBM Storage for Data and AI / © 2024 IBM Corporation 10
IBM Storage Scale
An integral part of the Vela architecture • Built completely on IBM Cloud infrastructure
• Dedicated IBM Storage Scale cluster on IBM Cloud
instances
AGG 1 AGG 2 AGG 3 AGG 4
• Cloud-Native Scale Access (CNSA) on GPU compute
cluster
• 200 nodes, 1600 GPUs
TOR 1 TOR 2 TOR 1 TOR 2 TOR 1 TOR 2 TOR 1 TOR 2 • Shared POSIX file system semantics
• One volume for training data
• Fit complete training dataset
• One volume for checkpointing
Rack 1 Rack 50 Rack 51 Rack 63
• Can accumulate ~10 days of checkpointing
• Large cost-effective data repository using
Remote IBM Cloud Object Storage
Mount • Two-tier architecture where AFM transparently moves
data between the object storage and file system
Fast IBM Storage Scale - Active File Management
File
System Transparent Tiering
Fast Storage Tier Cheap Storage Tier

Raw performance improvements:


• 3x write bandwidth compared to COS-only (15GB/s vs
Cost-
5GB/s)
IBM Cloud Object Storage
effective • 40x read bandwidth over NFS (40GB/s vs 1GB/s)
object
• Long-term storage for training checkpoints
storage
Training performance improvements:
• Storage Scale improved training step time variation by 5X 11
IBM Storage for Data and AI / © 2024 IBM Corporation
IBM Blue Vela
HGX “SuperPOD” Storage Fabric (IBM Cloud/ IBM Research)
SU# SU# SU#n
1 2
Accelerated GPU
Compute Scale
DGX H100 #n
Cluster

Storage Fabric – InfiniBand NDR Network Remote Mount

ESS3500 #1 ESS3500 #2 SSS6000#1 ESS3500 #n EMS


SSS6000#n
ESS3500 #3 ESS3500 #4
IBM Storage Scale System Cluster

• A leading global AI & Hybrid Cloud company • AI and Data platform to deliver enterprise AI service
• AI Supercomputer Scalable up to 5000 H100 HGX Systems
• Training LLM models with 100B+ parameters
• 1st Phase 1 SU with 32 HGX node
• 2nd phase will have 20 Scalable Units; 384 HGX nodes • Faster results – quality & speed of the training
models.
• ESS3500 for initial Phase 1deployment; 32 SSS6000 for Phase 2
• NDR is the Network Fabric for both compute & storage

12
IBM Storage Scale on ARM
GA with IBM Storage Scale 5.2.0
On April 26, 2024

Storage Scale User Group


May 13th, 2024, ISC, Hamburg Germany
Ingo Meents
IT Architect
Storage Scale Development
Why ARM? Increasing demand in AI & HPC IBM Storage for Data and AI

• Advanced RISC Machine


• Processor design licensed from ARM limited
• Simple RISC architecture 64 bit (and 32)
• Efficiency: embedded, mobile devices
• Growing into HPC, AI, ML
https://www.arm.com/markets/computing-infrastructure/high-performance-computing

TOP 500 list European Grace-CPU AWS


Fugaku super computer Processor Initiative DPU Graviton 2 and 3
https://www.top500.org/system/179807/ https://www.european-processor-initiative.eu/ https://www.nvidia.com/de-de/data-center/grace-cpu/ https://aws.amazon.com/de/ec2/graviton/

© Copyright IBM Corporation 2023 14


ARM Neoverse Family IBM Storage for Data and AI
Group of 64-bit ARM processor cores

Neoverse Series Intended Usage Level Instruction Set Examples

Ampere Altra (2-socket 80 cores)


Neoverse N-series
Data center usage N1 ARMv8.2-A AWS Graviton2 (64 cores)
(scale out performance)
Huawei Kunpeng 920

N2 ARMv9.0-A Alibaba Yitian 710


Neoverse E-series
Edge computing E1 ARMv8.2-A
(efficient throughout)
E2 ARMv9.0-A

AWS Graviton3 (64 cores)


Neoverse V-series High performance
V1 ARMv8.4-A Center for Dev of Advanced
(max performance) computing
Computing (C-DAC) AUM

Nvidia Grace (144 cores)


Nvidia Blue Field 3
V2 ARMv9.0-A
AWS Graviton 4
Google Axion

A64FX, Fujitsu HPC Armv8.2-A + SVE Supercomputer Fugaku

N3 and V3 have been presented in Feb 2024


**This is a general list of where ARM can be found, how it can be categorized and some examples. This is not a Scale support list.**
https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/
© Copyright IBM Corporation 2023 15
Hardware Examples IBM Storage for Data and AI

NVIDIA ARM HPC Developer Kit Server - Ampere® Altra® Max


ARM Server
Our development
• Single socket Ampere® Altra® Max or Altra® Processor & test platform
• Up to 2 x NVIDIA ® A100 PCIe Gen4 GPU cards
• Up to 2 x NVIDIA ® BlueField-2 DPUs
• 8-Channel RDIMM/LRDIMM DDR4, 16 x DIMMs

QuantaGrid S74G-2U

• NVIDIA GH200 Grace Hopper Superchip


• NVIDIA Grace 72 Arm® Neoverse V2 cores Grace Hopper
• 1 Processor
• NVIDIA® NVLink®-C2C 900GB/s
• 3 PCIe 5.0 x16 FHFL Dual Width slots

Blue Field3 DPU DPU

• Up to 16 Armv8.2+ A78 Hercules cores (64-bit)


• 16GB on-board DDR5

CPU Fujitsu A64FA AWS Graviton-Prozessor


https://www.fujitsu.com/global/products/ in Amazon EC2
computing/servers/supercomputer/a64fx/ https://aws.amazon.com/de/ec2/graviton/
Positive Feedback Basic tests successful
© Copyright IBM Corporation 2023 16
ARM @ Nvidia IBM Storage for Data and AI

Grace Grace Super Chip


Grace Hopper Super Chip

Grace Blackwell Super Chip


Just announced in March
at GTC24

https://www.nvidia.com
Grace = ARM CPU where our clients runs
Hopper or Blackwell = GPU where we can put data with GDS Grace Blackwell & Grace Hopper
© Copyright IBM Corporation 2023 17
ARM support with Storage Scale 5.2.0 IBM Storage for Data and AI

• Included
• SE package / install toolkit / rpm based install Where to get the SE package
• NSD client
• Scale base functionality (IO, policies, remote mounts, snapshots, quotas, • https://www.ibm.com/support/fixcentral
etc.) • Data Access and Data Management editions
• Manager roles: file system manager / token manager / cluster manager
• RDMA (IB or RoCE) including GDS
• Health Monitoring
• Target OS: RHEL 9.3 and Ubuntu 22.04 (ask to open RFE for customers
askign for RHEL 8)
• File audit logging, watch folders folders
• Call home
• GUI (can display ARM node, but cannot run on ARM)
• Excluded, but planned for future releases
• NSD servers Supported Operating Systems
• GNR/ECE
• Excluded • RHEL 9.3
• SNC • gpfs.base-5.2.0-0.aarch64.rpm
• Protocols
• BDA / HDFS • Ubuntu 22.04
• CNSA • gpfs.base_5.2.0-0_arm64.deb
• TCT

© Copyright IBM Corporation 2023 18


Thank you for using

You might also like