0% found this document useful (0 votes)
1 views9 pages

Solution Architecture

The document outlines the architecture and design considerations for a high-availability AI solution using Nutanix and Mellanox technologies. Key requirements include availability, recoverability, manageability, performance, security, and network architecture, with specific configurations and features detailed for Nutanix and Mellanox components. It emphasizes the importance of replication factors, block awareness, and load balancing to enhance system resilience and performance.

Uploaded by

Bhushan Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views9 pages

Solution Architecture

The document outlines the architecture and design considerations for a high-availability AI solution using Nutanix and Mellanox technologies. Key requirements include availability, recoverability, manageability, performance, security, and network architecture, with specific configurations and features detailed for Nutanix and Mellanox components. It emphasizes the importance of replication factors, block awareness, and load balancing to enhance system resilience and performance.

Uploaded by

Bhushan Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

phases.

When you understand the core principles of the methods, you can better evaluate when one may be more suitable than the next.

he term agile has become a stand-in for any workflow that improves efficiency and transparency.

ttributes of DevOps.

isplay of task cards and statuses, with a focus on moving a backlog of tasks through to completion.

pping, and statistical process control (SPC).

e management (ITSM).
An architecture model is a partial abstraction of a system. It is an approximation, and it captures the different properties of the syste
Type your text
Solution Architecture
When architecting this solution, we took into consideration the following AI-specific
requirements:

Availability: The solution must have high availability and maintain availability during upgrades
and failure scenarios.

Recoverability: The solution must have a strategy for recovering AI workloads and restoring
data in case of a disaster, while also minimizing recovery point objectives (RPOs) and
recovery time objectives (RTOs).

Manageability: The solution must reduce administrative effort for day-one and day-two
operations.

Performance and scalability: The solution must increase resilience, performance, and
capacity by scaling without impacting performance.

Security: The solution must implement security policies across the full stack.

Storage: The solution must reduce administrative effort and maximize storage performance
regardless of the application workload.

Network: The solution must provide a network architecture that maximizes bandwidth and
decreases latency.

Availability
In this section, we discuss the common failure scenarios in the solution as well as how the
various hardware and software components increase infrastructure availability.

Table. Summary of Availability Design


Configuration Item Parameter

1 per host (4 total)


Nutanix CVM design
12 vCPUs, 64 GB of RAM

3 VMs (initial configuration)


Nutanix file server VM
design
4 vCPUs, 12 GB of RAM

Nutanix file server export


NFS v4
protocol
Configuration Item Parameter

Nutanix file server export


Sharded directories
type

Nutanix file server DNS Records automatically created in AD during provisioning


settings (round robin)

Cluster redundancy factor 2

Cluster high availability


Enabled
reservation

Cluster virtual IP address Set

Cluster iSCSI data services


Set
IP

Nutanix Availability Features

Nutanix can operate as either a single node or a cluster of nodes in which three or more
nodes share resources and distributed data, which increases application and storage
availability. As represented in the following figure, as AOS ingests data from the NVIDIA DGX-
1 system or application, it creates a local copy on the home node and distributes a secondary
copy to another node in the cluster. Then the system sends an acknowledgment back to the
ingesting application that the write operation is complete. Consequently, as the application
writes data, the system always stores a secondary copy stored on another node. This
process is the replication factor, which by default is set to 2. An administrator can set the
replication factor to 3, which requires a minimum of five Nutanix nodes but dramatically
increases availability.

Note: We selected replication factor 2 because it provides an acceptable level of availability for this architecture.
However, if customers are evaluating larger clusters, it may be useful to increase the replication factor to 3.

Figure. Nutanix Data Availability


Click to enlarge
Block Awareness

A block is a rack-mountable enclosure that contains one to four Nutanix nodes. In multinode
blocks, the power supplies and the fans are the only components shared by nodes in a block.
When certain conditions are met, Nutanix cloud clusters are block aware, which means that
redundant copies of any data needed to serve I/O are placed on nodes that aren’t in the
same block, which maximizes the solution’s availability. When you scale your AI infrastructure,
it’s important to note that block awareness is applied automatically when all the following
conditions are met:

The cluster is three or more blocks (unless the cluster was created with replication factor 3, in
which case the cluster is five or more blocks).

Every storage tier in the cluster contains at least one drive on each block.

Every container in the cluster has a replication factor of at least 2.

The storage tiers on each block in the cluster are of comparable size.
The size of cluster SSD tiers with replication factor 2 isn’t more than 33 percent different
across blocks.

The size of cluster SSD tiers with replication factor 3 isn’t more than 25 percent different
across blocks.

Figure. Nutanix Data Availability with Block Awareness


Click to enlarge

Tip: When you scale your AI infrastructure with Nutanix blocks, consider physically distributing blocks across racks to
further increase availability.

Node Availability

Because all Nutanix cloud clusters have at least three nodes, we used an additional node in
our solution to meet the requirement for n + 1 availability. This extra node allows us to handle
planned or unplanned events without operating in a degraded state. When you enable
Cluster High Availability in Prism, the system maintains this level of availability automatically.

Controller VM

Nutanix is a 100 percent software-defined solution that places a software storage controller
(the CVM) on each node in the cluster. The storage controller actively accepts I/O from
applications running locally on that node and participates in cluster-wide operations such as
replicating data, self-healing, and rebalancing data.

Nutanix Files Load Balancing

As a complement to the CVM, Nutanix Files also serves NFS and SMB requests from clients
and internal and external systems. Nutanix Files use the CVM for reads and writes to
distributed storage, providing resilience (replication factor 2), data integrity, and scalability, as
detailed in the Nutanix Data Availability figure. Nutanix Files doesn’t need to be present on
every node; rather, it starts off with a minimum of three file server VMs (FSVMs) and
automatically scales out when needed. The following figure shows a high-level
representation of the relationship between the Nutanix CVM and FSVMs—specifically the
distribution of NFS exports and directories across multiple FSVMs.
Figure. High-Level Nutanix Files Architecture
Click to enlarge

Refer to the Nutanix Files tech note and the Nutanix Volumes best practices guide for more
detailed information on load balancing. For AI architects, a DGX system is equivalent to the
NFS client.

Note: In our testing, we implemented Nutanix Files with the Sharded Directory option and configured an iSCSI data
services IP.

Mellanox Availability Features

Mellanox SN2100 series switches are designed for high availability from both a software and
hardware perspective. Key high availability features include:

Color-coded PSUs and fans.


Up to 64x 10 or 25 GbE ports, 32x 50 GbE ports, or 16x 100 GbE ports.

MLAG for active-active L2 multipathing.

64-way equal-cost multipath (ECMP) routing for load balancing and redundancy.

1 + 1 power supplies.

The following table provides a summary of how Mellanox SN2100 switches maintain
availability during certain network failures.

Table. Network Failures Summary


Event Detection Action Effect on Network

Subordinate role
changed to
Three continuous standalone.
keepalives were No traffic loss.
lost and the leader
Leader down
isn’t visible on the Flush all MLAG
management
MACs.
network.

Flush all IPL MACs.

Standalone role
IPL up and received
Leader up changed to No traffic loss.
leader keepalive.
subordinate.

Three continuous Flush any MACs the


keepalives are lost subordinate has
and the subordinate learned.
Subordinate down No traffic loss.
isn’t visible on the
management
network. Flush all IPL MACs.

IPL up and received


Sync subordinate
Subordinate up subordinate No traffic loss.
with leader tables.
keepalive.

You might also like