NVD 2031 Hybrid Cloud 6 5 On Premises Design
NVD 2031 Hybrid Cloud 6 5 On Premises Design
Contents
1. Executive Summary................................................................................. 5
Audience.............................................................................................................................................. 7
Purpose............................................................................................................................................... 7
Software Versions............................................................................................................................... 7
Document Version History.................................................................................................................. 8
5. Ordering.................................................................................................. 76
Substitutions...................................................................................................................................... 76
Sizing Considerations........................................................................................................................77
Bill of Materials..................................................................................................................................77
6. Test Plan................................................................................................. 85
7. Appendix................................................................................................. 86
Windows VM Performance Tuning................................................................................................... 86
Linux VM Performance Tuning......................................................................................................... 86
Design Limits.....................................................................................................................................88
References........................................................................................................................................ 90
About Nutanix.............................................................................................91
List of Figures.............................................................................................................................................92
Hybrid Cloud: AOS 6.5 with AHV On-Premises Design
1. Executive Summary
Nutanix continues to innovate and engineer solutions that are simple to deploy and
operate. To further improve customer experience and add value for customers, Nutanix
uses robust validation to simplify designing and deploying solutions. This document
discusses the design decisions that support deploying a scalable, resilient, and secure
private cloud solution with two datacenters for high availability and disaster recovery.
Nutanix can deliver this Nutanix Validated Design (NVD), based on the Nutanix Hybrid
Cloud Reference Architecture, as a bundled solution for general server virtualization that
includes hardware, software, and services to accelerate and simplify the deployment
and implementation process. The architecture includes an automation layer, a business
continuity layer (with backup and restore), a management layer (with capacity planning,
access control, monitoring, platform life-cycle management, and real-time analytics),
a security and compliance layer, and a cluster that contains a virtualization layer
(hypervisor, networking, and storage) and a physical layer (compute, networking, and
storage).
This scalable modular design, based on the Nutanix block-and-pod architecture, is well
suited to hybrid cloud use cases of all sizes. Some highlights of the NVD include:
• A full-stack solution for hybrid cloud deployments that integrates multiple products
including AOS, AHV, Prism Pro, Nutanix Cloud Manager (NCM) Self-Service (formerly
Calm), Flow Network Security, Nutanix Disaster Recovery, Mine, and HYCU
• Support for up to 7,500 general-purpose VMs per pod in a block-and-pod architecture,
which offers a repeatable and scalable design
• A multidatacenter design built for failure tolerance and 99.999 percent availability
• Active-active datacenters with two availability zones (AZs) that run at 50 percent
capacity to provide for full AZ failover in either direction
• Testing for both planned and unplanned full-site failover with standardized business
continuity and disaster recovery (BCDR) service levels
• NCM Self-service automation through the NCM Self-Service marketplace that includes
blueprints for Windows, Linux, LAMP, and WISA as well as standardized VM sizes
Audience
This guide is part of the Nutanix Solutions Library and is intended for architects and
engineers responsible for scoping, designing, installing, and testing server virtualization
solutions. Readers of this document should already be familiar with the Nutanix Hybrid
Cloud Reference Architecture and Nutanix products.
Purpose
This document describes the components, integration, and configuration for the NVD
packaged hybrid cloud solution and covers the following topics:
• Core Nutanix infrastructure and related technology
• Backup and disaster recovery for the Nutanix platform and hosted applications
• NCM Self-Service automation and integration with third-party applications
• Bill of materials
Software Versions
Table: Software Versions Used in Validation Testing
Component Software Version
Nutanix Prism Central pc.2022.6.8
Nutanix AOS 6.5.4
• Monitoring:
› Enable platform fault monitoring and use email to send alerts.
› Monitor performance metrics and store historical data for the past 12 months.
› Keep resource usage under 75 percent; usage over 75 percent generates an email
alert.
› Monitor resources critical to Nutanix AOS operations (for example, CPU, memory,
storage, and network resources); resource usage that exceeds configured limits
generates an alert.
› For resources with high-availability reservations, measure the resource usage
threshold against the usable capacity after subtracting the capacity reserved for
high availability.
› Monitor all network links (including host-switch and switch-switch) for bandwidth
utilization and store historical data for the past 12 months.
› Use email as the primary channel for event monitoring alerts.
› Ensure that event monitoring is resilient. For example, when the management plane
is the primary source of alerts, you need a secondary method for monitoring the
management plane itself. Then, if the management plane fails, an alert from the
secondary source can trigger the action to recover the management plane.
› Facilitate automated issue discovery and remote diagnostics.
• Monitoring:
› IT operations teams can continuously staff the mailbox that receives monitoring
alerts to address critical issues promptly.
› IT operations teams can provide email infrastructure with sufficient resilience to
send, receive, and access emails even during critical outages.
› Network security appliances allow the management plane to transmit telemetry data
to Nutanix.
• Infrastructure: IT operations teams can deploy Active Directory and DNS in a highly
available configuration in each management cluster.
Core infrastructure monitoring component design risk: If Prism Central becomes
unavailable for any reason, the platform can't send alerts. To mitigate this risk, configure
each Prism Element instance to send alerts as well. As this approach results in duplicate
alerts during normal operations, send Prism Element alerts to a different mailbox that you
can monitor when Prism Central is unavailable.
Core infrastructure design constraints by component:
• Clusters: The number of VMs per pod doesn't exceed 7,500 (the limit of Flow Network
Security policies per Prism Central instance).
• Monitoring: SMTP is an available channel in the environment that can receive event
monitoring alerts. Syslog captures logs but doesn't generate alerts on events.
Scalability
Scalability is one of the core concepts of the Nutanix platform and refers to the ability to
increase storage and compute capacity to meet current and future workload demands. A
well-designed cluster meets current requirements while providing a path to support future
growth.
Because this NVD supports three general VM sizes, each node's memory is fully
populated to accommodate the resulting mixed memory requirements. This approach
also provides maximum memory performance, even if you don't need it. If memory
pressure increases, add more nodes. The design uses all-flash disks to accommodate
peak workload demands.
The design uses two racks in the on-premises datacenter, with redundant top-of-rack
network switches. One rack holds management and backup clusters, and the other
holds workload clusters. Datacenter power and cooling limitations might introduce further
constraints; for more information, see the Datacenter Infrastructure section.
When you scale VM workloads, cluster design is the biggest constraint.
Table: Scalability Design Decisions
Design Option Validated Selection
Node memory population Fully populate node memory.
Node drive type Use all-flash drives.
Node drive population Don't fully populate nodes with disk drives.
Dual rack Use two racks per AZ.
Establish scalability boundaries Use X-Ray to confirm load per node.
Rack availability Don't use rack availability.
Configuration maximums also constrain solution scalability. For the limits specific to this
design, see the Design Limits section of this document or the configuration maximums
or the maximum system values on the Nutanix Support Portal (Nutanix Portal credentials
required). You might reach a constraint before you reach a configuration maximum. For
example, a workload node that contains only Linux LAMP all-in-one VMs can't hold more
than 48 VMs, assuming that you can use 100 percent of the available memory for VMs.
Resilience
Nutanix provides many resilience features, including storage replication, snapshots,
block awareness, degraded node detection, and self-healing. These capabilities
increase the resilience of all workloads, even if the application itself has limited resilience
options. Nutanix layers these software features on hardware designed to be resilient
(for example, with redundant physical components and power supplies, many of
which are hot-swappable or otherwise easily serviceable). Running workloads in a
virtualized environment adds another kind of resilience, as you can perform many
maintenance operations without application downtime. A resilient network fabric that can
sustain individual link, node, or block failures without significant impact completes the
architecture.
VM Design
As the overall objective is to provide a hybrid cloud environment for general server
virtualization workloads, this NVD establishes three standard VM sizes to facilitate
consistent deployment, automation, sizing, and capacity planning for the environment.
The Cluster Design section specifies the maximums for each VM size to help with
capacity planning, but you can combine any number of VMs of any size up to the
maximums Nutanix designed this architecture to support. All VMs deploy with UEFI and
Secure Boot enabled.
VM Names
Nutanix recommends keeping the VM name and the guest OS host name the same. This
approach streamlines operational and support requirements and minimizes confusion
when you identify systems in the environment.
VM Guest Clustering
You can use VM guest clustering to form failover clusters using shared disk devices
with both Windows and Linux guest operating systems. With Nutanix AHV, you can use
a shared volume group between multiple VMs as part of a failover cluster—connect
the shared volume group to the VMs and install the necessary guest software. Nutanix
natively integrates SCSI-based fencing using persistent reservations and doesn't require
any complex configuration.
Note: This design targets an oversubscription ratio of four or fewer virtual CPUs per physical CPU.
Windows VMs
All Windows VMs in this NVD are based on Windows Server 2019 Datacenter Edition.
Windows VMs use the standard blueprints detailed in the following table when
provisioned with NCM Self-Service.
Note: Secure Boot–enabled VMs don't support hot-plug operations.
Note: Flow Network Security policies require the kNormal NIC type to function correctly.
Nutanix VirtIO Driver version 1.1.7 or later is required on all Windows Guest VMs.
The WISA (Windows Server, Internet Information Services, Microsoft SQL Server, and
ASP.NET) all-in-one blueprint installs and configures all necessary web, application, and
database components when deployed through NCM Self-Service.
The WISA distributed blueprint includes at least two load-balanced VMs for web servers,
two load-balanced application servers, and one database server. NCM Self-Service
provisions the individual VMs and installs their specific roles. The WISA distributed
blueprint predefines and automatically applies Prism Central categories and Flow
Network Security policies.
Refer to the appendix for Windows VM performance tuning recommendations.
Linux VMs
All Linux VMs in this NVD are based on Red Hat Enterprise Linux 8.4. Linux VMs use
the standard blueprints detailed in the following table when provisioned with NCM Self-
Service.
Note: Secure Boot–enabled VMs don't support hot-plug operations.
The LAMP (Linux, Apache, MySQL, and PHP) all-in-one blueprint has all necessary web,
application, and database components preinstalled and ready to deploy on demand as a
single VM through NCM Self-Service.
The LAMP distributed blueprint includes at least two load-balanced VMs for web servers,
two load-balanced application servers, and one database server. NCM Self-Service
provisions the individual VMs and installs their specific roles. The LAMP distributed
blueprint predefines and automatically applies Nutanix and Flow Network Security
policies.
For Linux VM performance tuning recommendations, see the appendix.
Cluster Design
This design incorporates three distinct cluster types:
• Management: Critical infrastructure and environment management workloads
• Workload: The building block for all general server virtualization workloads
• Backup: Backup storage for the workload and management components
This section defines the overall high-level cluster design, platform selection, capacity
management, scaling, and resilience. This design follows the block-and-pod architecture
defined in the Nutanix Hybrid Cloud Reference Architecture.
Platform Selection
The following table provides details regarding hardware platform selection.
Table: Platform Selection
Hardware or Service Management Cluster Workload Cluster Backup (Mine)
Cluster
Node type NX-1175S-G8 NX-3170-G8 NX-8155-G8
Node count 4 (increments of 1) 4–16 per building 4 (increments of 1)
block (increments of 4,
up to 16 maximum)
Processor 1 Intel Xeon Gold 2 Intel Xeon Gold 2 Intel Xeon Gold
6326 16-core 185 W 5318Y 24-core 165 W 6326 16-core 185 W
2.9 GHz 2.1 GHz (Ice Lake) 2.9 GHz
Memory 8 × 64 GB 3,200 MHz 24 × 64 GB 3,200 MHz 8 × 32 GB 3,200 MHz
DDR4 RDIMM (512 DDR4 RDIMM (1.5 TB DDR4 RDIMM (256
GB total) total) GB total)
SSD 2 × 1.92 TB 6 × 3.84 TB 2 × 3.84 TB
HDD N/A N/A 8 × 18 TB
NIC 25 GbE Dual SFP+ 25 GbE Dual SFP+ 25 GbE Dual SFP+
Form factor 1RU of single nodes 1RU of single nodes 2RU of single nodes
Support 3Y Production 3Y Production 3Y Production
Capacity Management
This NVD sizes the management and backup (Nutanix Mine Integrated Backup) clusters
to host typical workloads as defined in the Management Components and Backup
sections of this document. If those clusters need more resources, you can expand them
one node at a time. NCM Intelligent Operations can help forecast resource demand.
The main unit of expansion for workload clusters is the building block. In this design,
each workload cluster building block has a maximum of 16 nodes, with 15 nodes of
usable capacity and 1 node for failure capacity, and a minimum of 4 nodes with 3
usable (following the n + 1 principle). You can expand a workload cluster building block
in increments of 4 nodes, up to the maximum. Based on the small VM specification,
you can have a maximum of 1,860 VMs per workload cluster building block. When a
workload cluster building block reaches the maximum number of nodes, the administrator
starts a new building block with the 4-node minimum, then can expand the new block in
increments of 4 nodes as needed.
Each pod can support a maximum of 7,440 VMs. When a pod reaches the maximum
number of VMs, start a new pod. This NVD sets the workload cluster building block
maximum at 16 nodes to allow you to complete nondisruptive Nutanix software,
hardware, firmware, and driver maintenance using Nutanix LCM within a 48-hour
maintenance window (using Nutanix NX model hardware). You can split the maintenance
window into shorter segments if needed. You can also use a smaller maximum size per
workload building block to shorten maintenance windows and deploy more small clusters
per pod without changing the maximum number of nodes or VMs each pod supports.
For example, an 8-node workload cluster building block reduces maintenance windows
by half and provides twice the number of clusters per pod without changing the number
of nodes supported. However, the number of usable nodes decreases with the smaller
cluster size, as one node per cluster is logically reserved for maintenance and failure.
Software upgrades of AHV and AOS run for approximately 1 hour per node and firmware
(BMC, BIOS, host bus adapters) upgrades run for approximately 1.5 hours per node.
Note: Nutanix OEM partner hardware platforms might require more or less time depending on the specific
OEM partner recommendations.
In the following figure, the first pod (Pod 1) reached capacity and the administrator
started a new pod (Pod n). If existing management clusters have enough capacity for the
additional pods, you can reuse them and not implement additional management clusters.
The following table displays the maximum number of VMs per workload cluster building
block and per node.
Table: Maximum Number of VMs
Scalability Small VMs Medium VMs Large VMs
Consideration
Maximum running 1,860 1,380 690
VMs per workload
cluster building block
Maximum running 124 92 46
VMs per node
Maximum deployed 930 690 345
VMs per workload
cluster building block
to ensure disaster
recovery capacity
Note: The maximum deployed VMs per workload cluster is 50 percent of the running maximum to ensure
disaster recovery capacity.
Cluster Resilience
Replication factor 2 protects against the loss of a single component in case of failure or
maintenance. During a failure or maintenance scenario, Nutanix rebuilds any data that
falls out of compliance much faster than traditional RAID data protection methods, and
rebuild performance increases linearly as the cluster grows.
The Nutanix architecture rapidly recovers in the event of failure and has no single points
of failure. You can configure the cluster to maintain three copies of data; however, for
general server virtualization, Nutanix recommends that you distribute application and VM
components across multiple clusters to provide greater resilience at the application level.
You can achieve rack-aware resilience when you split clusters evenly across at least
three racks, but this NVD doesn't use that approach because it adds configuration and
operational complexity. Nutanix cluster replication factor 2 in this design is sufficient to
exceed five nines of availability (99.999 percent).
Storage Design
Nutanix uses a distributed, shared-nothing architecture for storage. For details on
Nutanix storage constructs, see the Storage Design section in the Nutanix Hybrid
Cloud Reference Architecture. For information on node types, counts, and physical
configurations, see the Cluster Design section in this document.
Creating a cluster automatically creates the following storage containers:
• NutanixManagementShare: Used for Nutanix features like Files and Objects and other
internal storage needs
This storage container doesn't store workload vDisks.
Network Design
A Nutanix cluster can tolerate multiple simultaneous failures because it maintains a set
redundancy factor and offers features such as block awareness and rack awareness.
However, this level of resilience requires a highly available network connecting a cluster's
nodes.
Nutanix clusters send each write to another node in the cluster. As a result, a fully
populated cluster sends storage replication traffic in a full mesh, using network bandwidth
between all Nutanix nodes. Because storage write latency directly correlates to the
network latency between Nutanix nodes, any increase in network latency adds to storage
write latency. Protecting the cluster's read and write storage capabilities requires highly
available connectivity between nodes. Even with intelligent data placement, if network
connectivity between multiple nodes is interrupted or becomes unstable, VMs on the
cluster can experience write failures and enter read-only mode.
Nutanix recommends using datacenter-grade switches designed to handle high-
bandwidth server and storage traffic at low latency. For more information, see the Nutanix
physical networking best practice guide.
Network Microsegmentation
Flow Network Security enables VM- and application-based microsegmentation for traffic
visibility and control. This NVD uses Flow Network Security to protect the environment
from network attacks, create strict traffic controls that segment the network, and gain
visibility into application network behavior for all AHV hosts managed by a single Prism
Central instance. Flow Network Security applies all categories and security policies
uniformly across all clusters and VMs in this Prism Central instance.
When two Prism Central instances exist—for example, for disaster recovery or scalability
—the categories and policies don't replicate between them, so the designer and
administrator need to create a system to either automatically or manually sync categories
and policies between sites. This NVD uses a script to sync Flow Network Security
policies and categories between Prism Central instances in different AZs. To enable
Flow Network Security synchronization between two Prism Central instances, follow the
procedure in KB 12253.
We created a list of requirements and limitations to keep in mind when protecting your
on-premises hybrid cloud environment with Flow Network Security:
• Flow Network Security relies on AHV to enforce policies in the hypervisor virtual
switch, so you can't use it to protect VMs running on ESXi or Hyper-V hypervisors.
• Nutanix Flow Network Security can secure fewer VMs than the maximum number of
VMs that Prism Central can manage. Consider these scalability limits when you design
clusters and Prism Central deployments.
• Consider the maximum number of VMs protected in a single policy when you design
the individual security policies. For detailed requirements and limitations, see the Flow
Network Security guide.
• In this NVD, all applications exist inside the same environment, so you don't need
isolation policies.
• You can achieve good application security by strictly controlling the inbound side of
the policy and allowing all traffic on the outbound side, but your situation might require
stricter outbound traffic regulation. If you don't have a physical north-south firewall
available for this task, the Flow Network Security application policy can perform this
function.
• You must determine whether you need policy hit logs to track allowed and blocked
connections. This NVD enables policy hit logs for all policies.
• This NVD creates application VMs through NCM Self-Service with the appropriate
AppType and AppTier categories assigned. You can also use external automation with
our APIs. When disaster recovery replicates VMs to another site, the categories also
replicate.
Table: Flow Network Security Design Decisions
Design Option Validated Selection
Number of Prism Central instances Use two Prism Central instances (one per AZ)
and use a script to replicate security policies
between Prism Central instances.
VM scale per Prism Central instance Limit VM scale to 7,500 per Prism Central
instance.
Isolation policies Don't use isolation policies.
Application inbound and outbound traffic Use inbound security policies and allow all
outbound traffic.
Category creation Create a unique AppType category for each
application and reuse AppTier categories.
For each application policy, determine the inbound traffic required to this application
and whether the traffic is from another VM in the Nutanix environment or external. Next,
determine the required traffic between tiers of the application and whether you should
allow traffic within the same tier. Finally, decide whether you should allow outbound traffic
for this application.
The following table provides an example security policy for a single application.
Modify the name and specific addresses or categories based on the application you're
protecting.
Note: Enable hit logs for all policies.
For all inbound traffic rules concerning VMs that exist in the Nutanix environment but
aren't part of an existing AppType, create new top-level categories that you can use to
add the relevant VMs to the policy as sources. For sources and destinations that don't
exist in any Nutanix cluster, create an Addresses entity to group these networks and IP
addresses for easy policy management.
This NVD creates address groups with the following addresses of corporate servers and
clients to allow differentiated access for devices that don't run as AHV VMs. Replace
these placeholders with addresses specific to your deployment.
Table: Address Groups
Name Addresses Purpose
AddrCorpAll 10.0.0.0/8 Identify all corporate IP
addresses.
AddrCorpClient 10.50.0.0/16 Identify all IP addresses that
belong to corporate client
devices.
AddrCorpServer 10.38.0.0/16 Identify all IP addresses that
belong to corporate server
devices.
The following application security policies protect infrastructure VMs that run on AHV.
This infrastructure is unique for each site, so you don't need to replicate the policies
between Prism Central instances. Create these infrastructure policies in each Prism
Central instance.
Table: Active Directory Application Security Policy InfraAD-001
Purpose Source Destination Port / Protocol
Allow all corp to Active AddrCorpAll AppType: See Microsoft
Directory ActiveDirectory documentation
Allow Active Directory AppType: Allow All All
out ActiveDirectory
For more information on Active Directory, see Microsoft's Active Directory and Active
Directory Domain Services Port Requirements article.
Table: Syslog Application Security Policy InfraSyslog-001
Purpose Source Destination Port / Protocol
Allow corp servers to AddrCorpServer AppType: Syslog UDP 6514, TCP 6514
syslog
Allow corp clients to AddrCorpClient AppType: Syslog TCP 9000, UDP 514,
syslog TCP 514
Allow syslog out AppType: Syslog Allow all All
This NVD modifies the forensic quarantine policy to allow quarantine of specific VMs
while also allowing access from the security operations team. VMs owned by the security
operations team for the explicit purpose of digital forensics and incident response have
the category Security: DFIR. Update this policy in both Prism Central instances.
Table: Quarantine Security Policy
Management Components
Management components such as Prism Central, Active Directory, DNS, and NTP are
critical services that must be highly available. Prism Central is the global control plane
for Nutanix, responsible for VM management, replication, application orchestration,
microsegmentation, and other monitoring and analytics functions. You can deploy Prism
Central in either a single-VM or scale-out (three-VM) configuration.
When you design your management components, decide how many Prism Central
instances you need. This NVD uses a scale-out Prism Central instance in each AZ, for a
total of two Prism Central instances. This setup provides better scalability and increased
disaster recovery functionality when you use additional Nutanix portfolio products, such
as Flow Network Security, NCM Self-Service, and Objects.
Monitoring
Monitoring in the NVD falls into two categories: event monitoring and performance
monitoring. Each category addresses different needs and issues.
In a highly available environment, you must monitor events to maintain high service
levels. When faults occur, the system must raise alerts promptly so that administrators
can take remediation actions as soon as possible. This NVD configures the Nutanix
platform's built-in capability to generate alerts in case of failure.
In addition to keeping the platform healthy, maintaining resource usage is also essential
to delivering a high-performing environment. Performance monitoring continuously
captures and stores metrics that are essential when you need to troubleshoot application
performance. A comprehensive monitoring approach should track metrics for the
following areas:
• Application and database
• Operating system
• Hyperconverged platform
• Network environment
• Physical environment
By tracking a variety of metrics in these areas, the Nutanix platform can also provide
capacity monitoring across the stack. Most enterprise environments inevitably grow,
so you need to understand resource utilization and the rate of expansion to anticipate
changing capacity demands and avoid any business impact caused by a lack of
resources.
To cover situations where Prism Central might be unavailable, each Nutanix cluster in
this NVD sends out notifications using SMTP as well. The individual Nutanix clusters
send alerts to a different receiving mailbox that's only monitored when Prism Central isn't
available.
By default, Prism Central captures cluster performance in key areas such as CPU,
memory, network, and storage utilization. When a Prism Central instance manages a
cluster, Prism Central transmits all Pulse data, so it doesn't originate from individual
clusters. When you enable Pulse, it detects known issues affecting cluster stability and
automatically opens support cases.
The network switches that connect the cluster also play an important role in cluster
performance. A separate monitoring tool that's compatible with the deployed switches
can capture switch performance metrics. For example, an SNMP-based tool can
regularly poll counters from the switches.
SMTP alerting Prism Element recipient email Configure the Prism Element
address recipient email address to be
secondaryalerts@<yourdomain>.com.
Security Domains
Nutanix recommends isolating the management and backup clusters from the rest of
the network by firewalls. Backup clusters are often prime targets for compromise. When
designing your network security architecture, remember that the backup and workload
clusters have significant traffic between them. In addition, Nutanix management IPMI
interfaces must only be directly accessible from the management domain.
AOS Hardening
This NVD enables additional nondefault hardening options in each AOS cluster:
Syslog
For each control plane endpoint, system-level internal logging goes to a centralized third-
party syslog server that runs in the local management cluster in each AZ. The system
sends logs for all available modules when they reach the syslog Error severity level.
TCP transport using TLS is preferred where available. Syslog coverage extends to
microsegmentation event logging from Prism Central with Flow Network Security.
Note: This NVD assumes that the centralized syslog servers in each AZ can replicate log messages
between sites, allowing inspection if the primary log system is unavailable.
Certificates
SSL endpoints serve all Nutanix control plane web pages. This NVD replaces the default
self-signed certificates with certificates signed by an internal certificate authority from a
Microsoft public key infrastructure (PKI). Any client endpoints that interact with the control
plane should have the trusted certificate authority chain preloaded, preventing browser
security errors.
Note: Certificate management is an ongoing activity, and certificates need to be rotated periodically. The
NVD signs all certificates for one year of validity.
Data-at-Rest Encryption
Nutanix AOS can perform data-at-rest encryption (DaRE) at the cluster level; however,
as the NVD doesn't have a stated requirement that warrants enabling it, this design
doesn't use it. If requirements change, you can enable DaRE nondisruptvely after cluster
creation and data population. Once you enable DaRE, existing data is encrypted in place
and all new data is written in an encrypted format.
To enable DaRE, you must also deploy an encryption key management solution.
Even if you decide not to use DaRE, you can still use in-guest encryption techniques
such as system-level encryption, database encryption (for example, Microsoft SQL
Transparent Data Encryption (TDE)), or encrypted file storage; however, in-guest
encrypted data can't be compressed in most cases. As this design enables compression
but in-guest encrypted data isn't likely to be compressible, using in-guest encryption
might affect the amount of available storage.
Table: Security Design Decisions
Design Option Validated Selection
DaRE Disable DaRE; don't deploy a key
management server.
SSL endpoints Sign control plane SSL endpoints with an
internal certificate authority (Microsoft PKI).
Certificates Provision certificates with a yearly expiration
date and rotate accordingly.
Authentication Use Active Directory LDAPS authentication
(port 636).
Control plane endpoint administration Use a common administrative Active Directory
group for all control plane endpoints.
Cluster lockdown mode Don't enable cluster lockdown mode (allow
password-driven SSH).
Nondefault hardening options Enable AIDE and hourly SCMA.
System-level internal logging Enable error-level logging to external syslog
server for all available modules.
Syslog delivery Use UDP transport for syslog delivery.
Datacenter Infrastructure
This design assumes that datacenters in the hosting region can sustain two AZs without
intraregional fate-sharing—in other words, that failures in one datacenter's physical
plant or supporting utilities don't affect the other datacenter. To ensure high availability
at the physical level, this NVD addresses the connection between the hardware running
Nutanix software and the datacenter's power and networking components.
Rack Design
This NVD recommends dedicating one rack to the management and backup clusters,
and another rack to the workload clusters. Give each rack a pair of 10 or 25 Gbps
datacenter switches and a 1 Gbps out-of-band management switch. You can add
more racks as needed, depending on top-of-rack network switch density as well as the
datacenter's power, weight, and cooling density capabilities per square foot.
Rack 1 has the following requirements, assumptions, and constraints:
• Two top-of-rack switches
• One management switch
• Minimum power: 4,660 VA
• Minimum thermal: 15,888 BTU per hour
• Minimum weight: 365 lb
• Nutanix Mine cluster (four 2RU NX-8155-G8)
• Management cluster (four 1RU NX-1175-G8)
Rack 2 has the following requirements, assumptions, and constraints:
• Two top-of-rack switches
• One management switch
• Minimum power: 12,672 VA
• Minimum thermal: 43,216 BTU per hour
• Minimum weight: 654 lb
When you scale the environment, consider physical rack space, network port availability,
and the datacenter's power and cooling capacity. In most environments the workload
clusters are the most likely to grow, followed by the backup clusters.
In this design's physical rack space, one generic 42RU rack has 3RU reserved at the
top for two data switches and one out-of-band switch. The Nutanix nodes should be
populated in the racks starting at the bottom and working up.
For network ports, each of the nodes requires two ports on the datacenter switches and
one port on the out-of-band management switches. Each node must use the top-of-rack
switches in the same rack.
For power, cooling, and weight, you need the minimums specified; account for future
expansion. Datacenter selection is beyond the scope of this design; have a conversation
about fully loaded racks with datacenter management before the initial deployment.
Planning to properly support the environment's long-term growth might change where in
the facility you want to set up the equipment.
Disaster Recovery
The following sections provide a logical and detailed overview of this NVD's disaster
recovery solution.
The following two tables provide details on protection policy configuration for 2,000 VMs.
Each protection policy has VMs located in a single AZ.
Table: Protection Policy Configuration for AZ01
Policy Name Category No. of VMs Source Target RPO
Cluster Cluster
AZ01-AZ02- AZ01-DR- 375 AZ01-CLS-01 AZ02-CLS-01 2h
Bronze-01 Bronze-01
AZ01-AZ02- AZ01-DR- 375 AZ01-CLS-02 AZ02-CLS-02 2h
Bronze-02 Bronze-02
AZ01-AZ02- AZ01-DR- 125 AZ01-CLS-01 AZ02-CLS-01 0 min.
Gold-01 Gold-01
AZ01-AZ02- AZ01-DR- 125 AZ01-CLS-02 AZ02-CLS-02 0 min.
Gold-02 Gold-02
The following two tables provide detailed information about recovery plans. To simplify
failover and failback, the design assigns VMs to a recovery plan from the AZ. For
example, VMs located in AZ01 are assigned to the recovery plan for AZ01. You can
implement all recovery plans for a specific AZ in parallel because each AZ covers a
maximum of 1,000 VMs and the product maximum for VMs recovered in parallel is 1,000.
For other product maximums, see the appendix.
Table: Details of Recovery Plans for AZ01 VMs
Name Stage VM Delay Source Failover Test No. of
Category Network Networks Failover VMs
Network
AZ01- Stage1 AZ01-DR- 0 Source- Failover- Test-PG 375
AZ02- Bronze-01 PG PG
Bronze-01
AZ01- Stage1 AZ01-DR- 0 Source- Failover- Test-PG 375
AZ02- Bronze-02 PG PG
Bronze-02
AZ01- Stage1 AZ01-DR- 0 Source- Failover- Test-PG 125
AZ02- Gold-01 PG PG
Gold-01
AZ01- Stage1 AZ01-DR- 0 Source- Failover- Test-PG 125
AZ02- Gold-02 PG PG
Gold-02
The following two tables provide details about mapping between protection policies,
recovery plans, and categories for 2,000 VMs.
Table: Protection Policy to Recovery Plan Mapping for AZ01
Policy Name Recovery Category RPO RTO No. of VMs
Plan Name Name
AZ01-AZ02- AZ01-RP- AZ01-DR- 2h 4h 375
Bronze-01 Bronze-01 Bronze-01
AZ01-AZ02- AZ01-RP- AZ01-DR- 2h 4h 375
Bronze-02 Bronze-02 Bronze-02
AZ01-AZ02- AZ01-DR- AZ02-DR- 0 min. 2h 125
Gold-01 Gold-01 Gold-01
AZ01-AZ02- AZ01-DR- AZ02-DR- 0 min. 2h 125
Gold-02 Gold-02 Gold-02
Note: An RPO of 2 hours has a snapshot and replication schedule of 1 hour, assuming that the most recent
in-flight snapshot replication might fail part way through.
Backup
The following section provides a logical and detailed overview of this NVD's backup
solution. It also covers the logical design and use of Nutanix Mine Integrated Backup as
an external backup system and Nutanix Objects as the backup target storage.
• CVM
• AHV
• Object networks (storage and client)
• HYCU backup VMs
Nutanix Mine is a high-performance backup target that's compatible with S3 storage.
S3-compatible storage provides advanced security features to help protect data against
common security threats.
The object stores on the Nutanix Mine cluster in each AZ that hosts backup data
(AZ01Backup01.domain.local and AZ02Backup01.domain.local) have the following
resources:
• Two load balancer VMs with 2 vCPU and 4 GB of memory each
• Three worker VMs with 10 vCPU and 32 GB of memory each
• Maximum available storage allocated
The Nutanix S3 bucket has versioning turned off and 365 days of WORM. Name buckets
using the AZ##Backup## convention and append -copy to copies. The number of S3
buckets depends on the number of backup VMs. The backup VM, the S3 bucket for
primary backup data, and the S3 bucket for the backup data copy on the remote site
have 1:1 matching.
• Integrate with:
› IPAM for configuring VM addresses
› Directory services for authentication
› Backup for VM protection
› Datacenter load balancers for configuring application VIP addresses
• Integration:
› IPAM infrastructure has sufficient resilience for the system to request, register, and
release IP addresses, even during critical outages.
› Directory services infrastructure has sufficient resilience for adding and removing
VMs, even during critical outages.
› Backup services infrastructure has sufficient resilience for backing up and restoring
VMs, even during critical outages.
› Email infrastructure has sufficient resilience to send, receive, and access emails,
even during critical outages.
› Load balancer infrastructure has sufficient resilience for handling API requests,
even during critical outages.
› Blueprints are also available in a source code management system.
NCM Self-Service with automation risks by component:
• NCM Self-Service:
› During NCM Self-Service upgrades, the service is unavailable.
› During NCM Self-Service downtime, the service is unavailable.
› Single-instance NCM Self-Service is a single point of failure.
› In the event of a disaster, applications recovered in another Prism Central instance
are unavailable in NCM Self-Service until you run the Prism Central–to–Prism
Central sync script.
• Integration:
› During IPAM downtime, new NCM Self-Service deployments might fail.
› During directory service downtime, new NCM Self-Service deployments might fail.
› During load balancer downtime, new NCM Self-Service deployments that need load
balancing might fail.
NCM Self-Service with automation constraints by component:
• NCM Self-Service:
› Blueprints must use existing approved VM templates.
› VM names must adhere to existing naming conventions.
› Virtual hardware has a maximum of three sizes.
• Integration:
› The IPAM solution is Infoblox.
› The backup solution is Nutanix Mine with HYCU.
› The network security solution is Nutanix Flow Network Security.
› The BCDR solution is Nutanix Disaster Recovery.
› The directory service is Microsoft Active Directory.
› The load balancer is F5.
Table: NCM Self-Service with Automation Design Decisions
Design Option Validated Selection
NCM Self-Service deployment model Use a standalone single virtual appliance.
NCM Self-Service deployment size Use a large NCM Self-Service deployment.
NCM Self-Service recoverability Protect NCM Self-Service using a Nutanix
Disaster Recovery protection policy and a
recovery plan as well as a Mine category for
backup and archiving.
NCM Self-Service project structure and role- Don't use the default NCM Self-Service project;
based access control (RBAC) configuration instead, use dedicated NCM Self-Service
projects with RBAC based on your Nutanix
Services architecture workshop.
Active Directory authentication Use Active Directory for NCM Self-Service
access and project RBAC.
NCM Self-Service policy engine Enable the NCM Self-Service policy engine; it's
required for functionalities like quotas.
NCM Self-Service showback Enable showback for Nutanix AHV provider
accounts.
Figure 15: NCM Self-Service with Automation Conceptual Design for Hybrid Cloud
application target. Each project has blueprints presented to the end users in the Nutanix
marketplace.
Accounts
NCM Self-Service needs at least one provider account so that projects can deploy
workloads. By default, enabling NCM Self-Service in Prism Central creates the
NTNX_LOCAL_AZ account. This account automatically discovers the AHV clusters
registered in Prism Central. Because the NVD uses a standalone NCM Self-Service
instance, no clusters are registered in Prism Central in this case. This NVD adds two
Nutanix accounts that connect to the Prism Central instances that manage the clusters in
each AZ.
Projects
Projects are like tenants, delivering governance and multitenancy. Usually, projects are
aligned with environments (development or production), operating systems (Windows or
Linux), departments (human resources or finance), or applications (Exchange or SAP).
A project must have at least one account, RBAC using Prism Central roles and Active
Directory, and an environment before project users can request workloads from the
marketplace. This NVD has one project for blueprint design and four projects to validate
the security aspects of tenant workloads.
Blueprints
Blueprints are project-specific and define how to automate workload deployment. An
important part of this design is the use of Prism Central categories in a blueprint to drive
DevSecOps and help prevent ransomware. To make a blueprint available for other
projects, you must publish it in the marketplace. This NVD has four blueprints: Windows,
Linux, WISA, and LAMP. All the blueprints integrate with IPAM, Active Directory, and
email. WISA and LAMP also integrate with the load balancer.
Marketplace
When projects submit blueprints for approval, an administrator reviews, categorizes, and
assigns a version number to them. After they publish a blueprint, an administrator can
assign it to projects for consumption through the marketplace page. This NVD uses two
projects with different blueprint assignments to validate its security.
Integrations
Integrations are part of the blueprint and occur at different stages of the life cycle. In
this NVD, most integrations use NCM Self-Service Escript (a Python library), with some
instances of PowerShell and Shell scripts for integrations where only a CLI is available.
Categories
Security policies with Flow Network Security, protection and recovery policies with
Nutanix Disaster Recovery, backup policies with Nutanix Mine, and HYCU archiving to
Nutanix Objects all use Prism Central categories. Using categories in blueprints helps
prevent ransomware because every deployment is secure by design.
NCM Self-Service
NCM Self-Service is a standalone instance in this NVD.
Table: Self-Service with Automation Deployment Model
Setting Value
Deployment One instance of NCM Self-Service on AHV
Resources 10 vCPU, 52 GB of memory, 581 GB disk
Network Management (requires IP address)
Protection Nutanix Disaster Recovery protection policy
and recovery plan
This NVD uses the following settings to protect the NCM Self-Service virtual appliance
from a disaster scenario.
Table: Nutanix Categories
Category Name Value Assigned Used By
AppType CalmAppliance Calm_on_AHV (VM) Nutanix Disaster
Recovery
AZ01-Backup-01 RPO24h Calm_on_AHV (VM) Nutanix Mine (backup)
Setting Value
Policy name AZ01-AZ02-Calm
Category AppType: CalmAppliance
Source cluster AZ01-MGMT-01
Target cluster AZ02-MGMT-01
RPO 15 min.
For the blueprints used in this design, see the Nutanix Validated Designs GitHub
repository.
The blueprints in the following table all use the IPAM, Active Directory, and email
integrations. All blueprints except the two single-service blueprints also require load
balancing.
Table: NCM Self-Service with Automation Blueprints
Blueprint Component Software Version Component
Dependencies
Windows Single service Windows Server 2019 N/A
Linux Single service CentOS 8.2 N/A
WISA Load balancer BIG-IP 16.1.0 Build Scale-out web
0.0.19 Final
WISA Scale-out web Windows Server 2019 Database
+ IIS 10
WISA Database Windows Server 2019 N/A
+ SQL Server 2019
LAMP Load balancer TBD Scale-out web
LAMP Scale-out web CentOS 8 + PHP 8 Database
LAMP Database CentOS 8 + MariaDB N/A
10.6
Directory Services
This NVD adds every workload provisioned using NCM Self-Service to Active Directory
and uses Windows Server 2019 for Active Directory integration.
IPAM
Based on user input, Infoblox provides every workload provisioned using NCM Self-
Service with an IP address in the selected network and configured DNS.
Table: NCM Self-Service with Automation IPAM Connection Details
Connection Details
Infoblox API X.X.X.X or FQDN
Username svc_selfservice
Password xxx
Networks TBD
Load Balancing
WISA and LAMP workloads integrate with the F5 load balancer for the web server tier.
Table: NCM Self-Service with Automation Load Balancer Connection Details
Connection Details
F5 API X.X.X.X or FQDN
Username svc_selfservice
Password xxx
This NVD uses the AddrLoadBalancer (X.X.X.X/32) Flow Network Security policy
to let the F5 load balancer send HTTP and HTTPS traffic to the application tier. In
the AddrLoadBalancer example load-balancing security policy, the destination is all
applications in the AppType: AZ01-Example-0001 and AppTier: App categories and the
port and protocol are TCP 80,443.
Notifications
Every workload provisioned using NCM Self-Service sends an email to the requester.
This NVD uses the following software versions for notification integration.
Table: NCM Self-Service with Automation Notifications Software Versions
OS Notification Library
Windows Email Send-MailMessage
Linux Email smtplib and email.message
Username svc_selfservice
Password xxx
Recipients NCM Self-Service requester, distribution list, or
both
5. Ordering
This bill of materials reflects the validated and tested hardware, software, and services
that Nutanix recommends to achieve the outcomes described in this document. Consider
the following points when you build your orders:
• All software is based on core licensing whenever possible.
• Nutanix Professional Services or an affiliated partner selected by Nutanix provides all
services.
• Nutanix based the functional testing described in this document on NX series models
with similar configurations to validate the interoperability of software and services.
Note: Because available hardware, software, and services can change without notice, contact Nutanix
Sales when ordering to ensure that you have the correct product codes.
Substitutions
• Nutanix recommends that you purchase the exact hardware configuration reflected
in the bill of materials whenever possible. If a specific hardware configuration is
unavailable, choose a similar option that meets or exceeds the recommended
specification.
• You can make hardware substitutions to suit your preferences; however, such
changes might result in a solution that doesn't follow the recommended Nutanix
configuration.
• Avoid software product code substitutions except when:
› You need different quantities to maintain software licensing compliance.
› You prefer a higher license tier or support level for the same software product code.
• Adding any software or workloads that aren't specified in this design to the
environment (including additional Nutanix products) might affect the validated density
calculations and result in a solution that doesn't follow the recommended Nutanix
configuration.
• Nutanix Professional Services substitutions to accommodate customer preferences
aren't possible.
Sizing Considerations
This NVD is based on a block-and-pod architecture. A block consists of 32 nodes, or
two 16-node workload clusters—one in each datacenter for BCDR. A pod consists of the
following components:
• Two 4-node management clusters
• Enough 32-node blocks (sets of two 16-node workload clusters) to meet the desired
capacity
• Two Nutanix Mine backup clusters
Once the number of nodes, VMs, or clusters exceeds the maximum specified for the
solution, create a new pod with a new management cluster and Prism Central instance.
For smaller environments, you can downsize the workload clusters to 4, 8, or 12 nodes
based on your capacity requirements, but don't change the hardware configuration or
sizing associated with the management clusters or the Nutanix Mine backup clusters.
Note: Contact your HYCU sales team for more information regarding licensing.
Bill of Materials
The following sections show the bills of materials for the primary and secondary
datacenter management clusters, the primary and secondary datacenter workload
clusters, and the primary and secondary datacenter backup clusters.
Item Specification
Configuration One node
Type All flash
Support Level Production
NRDK Support No
NR Node Support No
Use the following software for the primary datacenter management cluster:
• Nutanix Cloud Infrastructure (NCI) Pro
• NCI Advanced Replication
• NCI Flow Network Security
• Nutanix Cloud Manager (NCM) Starter
Nutanix recommends the following Nutanix Professional Services for the primary
datacenter management cluster:
• Infrastructure Deploy - On-Prem NCI Cluster
• Nutanix Unified Storage Deployment
• FastTrack for NCM Intelligent Operations
Use the following software for the primary datacenter workload cluster:
• NCI Pro
• NCI Advanced Replication
• NCI Flow Network Security
• NCM Ultimate
Nutanix recommends the following Nutanix Professional Services for the primary
datacenter workload cluster:
• Infrastructure Deploy - On-Prem NCI Cluster
Use the following software for the primary datacenter backup cluster:
• Nutanix Objects dedicated
• Nutanix Mine software
• HYCU License Bundle for Nutanix Mine, 1Ct for 3YR
Nutanix recommends the following Nutanix Professional Services for the primary
datacenter backup cluster:
• Infrastructure Deploy - On-Prem NCI Cluster
Note: You only need Nutanix Unified Storage Deployment - Objects on the bill of materials so that your
Nutanix representative can quote 18 TB HDD. A quantity of 1 TiB is sufficient.
Use the following software for the secondary datacenter management cluster:
• NCI Pro
• NCI Advanced Replication
• NCI Flow Network Security
• Nutanix Cloud Manager (NCM) Starter
Nutanix recommends the following Nutanix Professional Services for the secondary
datacenter management cluster:
• Infrastructure Deploy - On-Prem NCI Cluster
• Nutanix Unified Storage Deployment
Use the following software for the secondary datacenter workload cluster:
• NCI Pro
• NCI Advanced Replication
• NCI Flow Network Security
• NCM Ultimate
Nutanix recommends the following Nutanix Professional Services for the secondary
datacenter workload cluster:
• Infrastructure Deploy - On-Prem NCI Cluster
• Nutanix Unified Storage Deployment
Use the following software for the secondary datacenter backup cluster:
• Nutanix Mine software
• HYCU License Bundle for Nutanix Mine, 1Ct for 3YR
6. Test Plan
The test plan for the Nutanix Hybrid Cloud: AOS 6.5 with AHV On-Premises Design
validates a successful implementation (spreadsheet automatically downloads when
you click the link). Compare the result for each test plan item with the Expected Result
column, then select the correct response from the dropdown menu in the Result column.
7. Appendix
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs=100
vm.swappiness = 0
fs.aio-max-nr=3145728
fs.file-max = 6815744
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
# disable ip forwarding
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_syncookies = 1
net.core.netdev_max_backlog = 5000
net.core.somaxconn = 10000
net.ipv4.tcpkeepalive_intvl = 15
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 5000
#kernel.shmmni = 4096
kernel.shmmax = 5368709120
# Large pages, (for Java workloads etc) replace the nr pages with the
amount of memory to reserve divided by 2M,
# which is the page size, and replace the group id (gid) with the ID of the
group id that locks the pages in memory, replace values in <>
# vm.nr_hugepages = 2304
# vm.hugetlb_shm_group = 1002
@<gid java user> hard memlock 4718592 # Should be same value as above
Design Limits
The following design limits apply to the following versions at the time of publishing.
Table: Hybrid Cloud On-Premises AOS 6.5 Design Limits
Name Product Maximum NVD Design Maximum
AHV: Maximum powered-on 128 124
VMs per host
AHV: Nodes per cluster 32 16
References
1. Nutanix Hybrid Cloud Reference Architecture
2. Nutanix Cloud Manager Self-Service
3. Flow Network Security Guide
4. Nutanix Disaster Recovery (formerly Leap)
5. Nutanix Mine with HYCU User Guide
6. Nutanix Objects
7. Data Protection and Disaster Recovery
8. Physical Networking
About Nutanix
Nutanix offers a single platform to run all your apps and data across multiple clouds
while simplifying operations and reducing complexity. Trusted by companies worldwide,
Nutanix powers hybrid multicloud environments efficiently and cost effectively. This
enables companies to focus on successful business outcomes and new innovations.
Learn more at Nutanix.com.
Figure 15: NCM Self-Service with Automation Conceptual Design for Hybrid Cloud.................................. 67