Accelerating AI with Storage Scale
Accelerating AI with Storage Scale
Storage Scale
IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without
notice at IBM's sole discretion. Information regarding potential future products is intended to outline our
general product direction and it should not be relied on in making a purchasing decision. The information
mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code, or functionality. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
IBM reserves the right to change product specifications and offerings at any time without notice. This
publication could include technical inaccuracies or typographical errors. References herein to IBM products
and services do not imply that IBM intends to make them available in all countries.
To unlock the full potential of AI we must overcome the challenges of
enterprise infrastructure
Infrastructure limitations Growing resource Operational and physical Security and data
& platforms to scale AI demands & silos resource efficiencies resiliency
82% of organizations cite siloed Increasing operational overhead Data must be trusted and
AI is the fastest growing workload
data as a key obstacle to more with new AI apps challenge security of sensitive
driving spending on compute and IT budgets and energy efficiencies information from cyberthreats,
storage infrastructure2. effective AI development1.
loss or downtime is high
priority.
Data Sources:
1 IDC, Planning for Success with Generative AI
2 https://www.ibm.com/downloads/cas/VKGPNJ3B
What if your organization could accelerate AI workloads with a storage infrastructure
designed to accelerate business growth?
Management
Generative AI models and platform
Security
End-to-end application platform
or
Cloud 1 Cloud 2 Cloud 1 Cloud 2
Storage 1 Storage 2 Storage 1 Storage 2
4
Customers need an end-to-end data strategy to bring accelerated results for the AI
pipeline
Prepare Build Deploy
Data Distributed training & Model Inference
preparation model validation adaptation
Workflow of steps (e.g. Long-running job on Model tuning with May have sensitivity to
deduplicate, remove massive infrastructure custom data set for latency/throughput,
Data hate & profanity, etc. downstream tasks always cost-sensitive
Decisions
Hours to days Weeks to months Minutes to hours Sub-second API
request
10-2000+ low to 10-500+ high-end 1+ mid high-end
mid-end CPUs GPUs (per job) GPU (per jkob) Single or fraction of
cores GPU per fine tuning
task or serving
request
Infra: 8xA100, 8xH100 Infra: 8xA100, Infra:
High performance 8xH100 L40S, L4
networking
AI Tuning/Inferencing AI Training
https://blocksandfiles.com/2023/08/15/ibm-nvidia-gpu-data-delivery/
Why IBM Storage and NVIDIA are better together to accelerate AI innovation
IBM Storage Scale accelerates your infrastructure with a hybrid cloud by design for AI platform
Accelerate discovery
Multi-protocol parallel data access w/ up
AI Workloads to 310GB/s, 13M IOPs and NVIDIA
Servers with NVIDIA GPUs GPUDirect® support
AI Platform
Increase collaboration
IBM Storage Scale Data abstraction with remote data, non-
IBM storage and cloud data directly to
NVIDIA DGX BasePOD NVIDIA Systems
• A leading global AI & Hybrid Cloud company • AI and Data platform to deliver enterprise AI service
• AI Supercomputer Scalable up to 5000 H100 HGX Systems
• Training LLM models with 100B+ parameters
• 1st Phase 1 SU with 32 HGX node
• 2nd phase will have 20 Scalable Units; 384 HGX nodes • Faster results – quality & speed of the training
models.
• ESS3500 for initial Phase 1deployment; 32 SSS6000 for Phase 2
• NDR is the Network Fabric for both compute & storage
12
IBM Storage Scale on ARM
GA with IBM Storage Scale 5.2.0
On April 26, 2024
QuantaGrid S74G-2U
https://www.nvidia.com
Grace = ARM CPU where our clients runs
Hopper or Blackwell = GPU where we can put data with GDS Grace Blackwell & Grace Hopper
© Copyright IBM Corporation 2023 17
ARM support with Storage Scale 5.2.0 IBM Storage for Data and AI
• Included
• SE package / install toolkit / rpm based install Where to get the SE package
• NSD client
• Scale base functionality (IO, policies, remote mounts, snapshots, quotas, • https://www.ibm.com/support/fixcentral
etc.) • Data Access and Data Management editions
• Manager roles: file system manager / token manager / cluster manager
• RDMA (IB or RoCE) including GDS
• Health Monitoring
• Target OS: RHEL 9.3 and Ubuntu 22.04 (ask to open RFE for customers
askign for RHEL 8)
• File audit logging, watch folders folders
• Call home
• GUI (can display ARM node, but cannot run on ARM)
• Excluded, but planned for future releases
• NSD servers Supported Operating Systems
• GNR/ECE
• Excluded • RHEL 9.3
• SNC • gpfs.base-5.2.0-0.aarch64.rpm
• Protocols
• BDA / HDFS • Ubuntu 22.04
• CNSA • gpfs.base_5.2.0-0_arm64.deb
• TCT