Solution Architecture
Solution Architecture
When you understand the core principles of the methods, you can better evaluate when one may be more suitable than the next.
he term agile has become a stand-in for any workflow that improves efficiency and transparency.
ttributes of DevOps.
isplay of task cards and statuses, with a focus on moving a backlog of tasks through to completion.
e management (ITSM).
An architecture model is a partial abstraction of a system. It is an approximation, and it captures the different properties of the syste
Type your text
Solution Architecture
When architecting this solution, we took into consideration the following AI-specific
requirements:
Availability: The solution must have high availability and maintain availability during upgrades
and failure scenarios.
Recoverability: The solution must have a strategy for recovering AI workloads and restoring
data in case of a disaster, while also minimizing recovery point objectives (RPOs) and
recovery time objectives (RTOs).
Manageability: The solution must reduce administrative effort for day-one and day-two
operations.
Performance and scalability: The solution must increase resilience, performance, and
capacity by scaling without impacting performance.
Security: The solution must implement security policies across the full stack.
Storage: The solution must reduce administrative effort and maximize storage performance
regardless of the application workload.
Network: The solution must provide a network architecture that maximizes bandwidth and
decreases latency.
Availability
In this section, we discuss the common failure scenarios in the solution as well as how the
various hardware and software components increase infrastructure availability.
Nutanix can operate as either a single node or a cluster of nodes in which three or more
nodes share resources and distributed data, which increases application and storage
availability. As represented in the following figure, as AOS ingests data from the NVIDIA DGX-
1 system or application, it creates a local copy on the home node and distributes a secondary
copy to another node in the cluster. Then the system sends an acknowledgment back to the
ingesting application that the write operation is complete. Consequently, as the application
writes data, the system always stores a secondary copy stored on another node. This
process is the replication factor, which by default is set to 2. An administrator can set the
replication factor to 3, which requires a minimum of five Nutanix nodes but dramatically
increases availability.
Note: We selected replication factor 2 because it provides an acceptable level of availability for this architecture.
However, if customers are evaluating larger clusters, it may be useful to increase the replication factor to 3.
A block is a rack-mountable enclosure that contains one to four Nutanix nodes. In multinode
blocks, the power supplies and the fans are the only components shared by nodes in a block.
When certain conditions are met, Nutanix cloud clusters are block aware, which means that
redundant copies of any data needed to serve I/O are placed on nodes that aren’t in the
same block, which maximizes the solution’s availability. When you scale your AI infrastructure,
it’s important to note that block awareness is applied automatically when all the following
conditions are met:
The cluster is three or more blocks (unless the cluster was created with replication factor 3, in
which case the cluster is five or more blocks).
Every storage tier in the cluster contains at least one drive on each block.
The storage tiers on each block in the cluster are of comparable size.
The size of cluster SSD tiers with replication factor 2 isn’t more than 33 percent different
across blocks.
The size of cluster SSD tiers with replication factor 3 isn’t more than 25 percent different
across blocks.
Tip: When you scale your AI infrastructure with Nutanix blocks, consider physically distributing blocks across racks to
further increase availability.
Node Availability
Because all Nutanix cloud clusters have at least three nodes, we used an additional node in
our solution to meet the requirement for n + 1 availability. This extra node allows us to handle
planned or unplanned events without operating in a degraded state. When you enable
Cluster High Availability in Prism, the system maintains this level of availability automatically.
Controller VM
Nutanix is a 100 percent software-defined solution that places a software storage controller
(the CVM) on each node in the cluster. The storage controller actively accepts I/O from
applications running locally on that node and participates in cluster-wide operations such as
replicating data, self-healing, and rebalancing data.
As a complement to the CVM, Nutanix Files also serves NFS and SMB requests from clients
and internal and external systems. Nutanix Files use the CVM for reads and writes to
distributed storage, providing resilience (replication factor 2), data integrity, and scalability, as
detailed in the Nutanix Data Availability figure. Nutanix Files doesn’t need to be present on
every node; rather, it starts off with a minimum of three file server VMs (FSVMs) and
automatically scales out when needed. The following figure shows a high-level
representation of the relationship between the Nutanix CVM and FSVMs—specifically the
distribution of NFS exports and directories across multiple FSVMs.
Figure. High-Level Nutanix Files Architecture
Click to enlarge
Refer to the Nutanix Files tech note and the Nutanix Volumes best practices guide for more
detailed information on load balancing. For AI architects, a DGX system is equivalent to the
NFS client.
Note: In our testing, we implemented Nutanix Files with the Sharded Directory option and configured an iSCSI data
services IP.
Mellanox SN2100 series switches are designed for high availability from both a software and
hardware perspective. Key high availability features include:
64-way equal-cost multipath (ECMP) routing for load balancing and redundancy.
1 + 1 power supplies.
The following table provides a summary of how Mellanox SN2100 switches maintain
availability during certain network failures.
Subordinate role
changed to
Three continuous standalone.
keepalives were No traffic loss.
lost and the leader
Leader down
isn’t visible on the Flush all MLAG
management
MACs.
network.
Standalone role
IPL up and received
Leader up changed to No traffic loss.
leader keepalive.
subordinate.