NSX-T Reference Design Guide 3-2 - v1.2
NSX-T Reference Design Guide 3-2 - v1.2
VMWARE NSX ®
REFERENCE DESIGN
GUIDE
Software Version 3.2
VMware NSX Reference Design Guide
1 Introduction 9
How to Use This Document and Provide Feedback 9
Networking and Security Today 9
NSX Architecture Value and Scope 10
Containers and Cloud Native Application Integrations with NSX 13
Role of NSX in the VMware multi-cloud platform 14
2 NSX Architecture Components 17
Management Plane and Control Plane 17
Management Plane 17
Control Plane 18
NSX Manager Appliance 18
Data Plane 19
NSX Consumption Model 20
NSX Policy API Framework 20
Policy API Usage Example 1- Templatize and Deploy 3-Tier Application Topology 21
Policy API Usage Example 2- Application Security Policy Lifecycle Management 22
When to use Policy vs Manager UI/API 22
NSX Logical Object Naming relationship between manager and policy mode 24
3 NSX Logical Switching 25
The NSX Virtual Switch 25
Segments and Transport Zones 25
Uplink vs. pNIC 28
Teaming Policy 28
Uplink Profile 31
1
VMware NSX Reference Design Guide
2
VMware NSX Reference Design Guide
3
VMware NSX Reference Design Guide
4
VMware NSX Reference Design Guide
5
VMware NSX Reference Design Guide
6
VMware NSX Reference Design Guide
7
VMware NSX Reference Design Guide
Intended Audience
This document is targeted toward virtualization and network architects interested in deploying
VMware NSX® network virtualization solutions in a variety of on premise solutions.
Revision History
8
VMware NSX Reference Design Guide
1 Introduction
This document provides guidance and best practices for designing environments that leverage
the capabilities of VMware NSX®. It is targeted at virtualization and network architects
interested in deploying NSX solutions.
9
VMware NSX Reference Design Guide
allow them to write and provision apps in a fraction of the time required with traditional
methods.
Heterogeneity
Application proliferation has given rise to heterogeneous environments, with application
workloads being run inside VMs, containers, clouds, and bare metal servers. IT departments
must maintain governance, security, and visibility for application workloads regardless of
whether they reside on premises, in public clouds, or in clouds managed by third-parties.
Cloud-centric Architectures
Cloud-centric architectures and approaches to building and managing applications are
increasingly common because of their efficient development environments and fast delivery of
applications. These cloud architectures can put pressure on networking and security
infrastructure to integrate with private and public clouds. Logical networking and security must
be highly extensible to adapt and keep pace with ongoing change.
Against this backdrop of increasing application needs, greater heterogeneity, and the
complexity of environments, IT must still protect applications and data while addressing the
reality of an attack surface that is continuously expanding.
10
VMware NSX Reference Design Guide
The NSX architecture is designed around four fundamental attributes. FIGURE 1-1: NSX
ANYWHERE Architecture depicts the universality of those attributes that spans from any site, to any
cloud, and to any endpoint device. This enables greater decoupling, not just at the
infrastructure level (e.g., hardware, hypervisor), but also at the public cloud and container level;
all while maintaining the four key attributes of platform implemented across the domains. NSX
architectural value and characteristics of NSX architecture include:
• Policy and Consistency: Allows policy definition once and realizable end state via RESTful
API, addressing requirements of today’s automated environments. NSX maintains unique
and multiple inventories and controls to enumerate desired outcomes across diverse
domains.
• Networking and Connectivity: Allows consistent logical switching and distributed routing
without being tied to a single compute manager/domain (e.g. vCenter server). The
connectivity is further extended across containers and clouds via domain specific
implementation while still providing connectivity across heterogeneous endpoints.
• Security and Services: Allows a unified security policy model as with networking
connectivity. This enables implementation of services such as load balancer, Edge
(Gateway) Firewall, Distributed Firewall, and Network Address Translation cross multiple
11
VMware NSX Reference Design Guide
compute domains. Providing consistent security between VMs and container workloads
in private and public clouds is essential to assuring the integrity of the overall framework
set forth by security operations.
• Visibility: Allows consistent monitoring, metric collection, and flow tracing via a common
toolset across compute domains and clouds. Visibility is essential for operationalizing
mixed workloads – VM and container-centric –typically both have drastically different
tools for completing similar tasks.
These attributes enable the heterogeneity, app-alignment, and extensibility required to support
diverse requirements. Additionally, NSX supports DPDK libraries that offer line-rate stateful
services.
Heterogeneity
In order to meet the needs of heterogeneous environments, a fundamental requirement of NSX
is to be compute-manager agnostic. As this approach mandates support for multi-hypervisor
and/or multi-workloads, a single NSX manager’s manageability domain can span multiple
vCenters. When designing the management plane, control plane, and data plane components
of NSX, special considerations were taken to enable flexibility, scalability, and performance.
The management plane was designed to be independent of any compute manager, including
vSphere. The VMware NSX® Manager™ is fully independent; management of the NSX based
network functions are accesses directly – either programmatically or through the GUI.
The control plane architecture is separated into two components – a centralized cluster and an
endpoint-specific local component. This separation allows the control plane to scale as the
localized implementation – both data plane implementation and security enforcement – is
more efficient and allows for heterogeneous environments.
The data plane was designed to be normalized across various environments. NSX introduces a
host switch that normalizes connectivity among various compute domains, including multiple
VMware vCenter® instances, KVM, containers, bare metal servers, and other off premises or
cloud implementations. This switch is referred as N-VDS. The functionality of the N-VDS switch
was fully implemented in the ESXi VDS 7.0 and later, which allows ESXi customers to take
advantage of full NSX functionality without having to change VDS. Regardless of
implementation, data plane connectivity is normalized across all platforms, allowing for a
consistent experience.
App-aligned
NSX was built with the application as the key construct. Regardless of whether the app was
built in a traditional monolithic model or developed in a newer microservices application
framework, NSX treats networking and security consistently. This consistency extends across
containers and multi-hypervisors on premises, then further into the public cloud.
12
VMware NSX Reference Design Guide
13
VMware NSX Reference Design Guide
14
VMware NSX Reference Design Guide
defined infrastructure, Platform-as-a-Service (PaaS) and management stack that can be layered
on top of any physical hardware layer on any cloud or data center. The software stack is based
on VMware Cloud Foundation (VCF) and includes vSphere, VSAN, and NSX as its core
components. It provides a unified approach to building, running and managing traditional and
modern apps on any cloud. This unique architectural approach provides a single platform that
can function across all application types and multiple cloud environments. NSX is a key strategic
asset for the VMware multi-cloud platform. Limited and fragmented public cloud native
network and security services are augmented by rich and uniform enterprise-grade capabilities
across any cloud.
The cloud operating model cuts across the traditional silos – networking, network security, load
balancing and endpoint protection solutions are designed, deployed and managed by an
increasingly integrated team or a set of integrated processes in the form of automation and
pipelines. In this world network and security become software that is defined in advance and is
integrated into the customer CI/CD pipeline. It is accessed programmatically by high level,
declarative or intent based APIs, and it is oriented to serving the needs of the application. The
NSX APIs allow customers to embrace the cloud operation model, where an entire workload,
including all network and security services, is launched with no human touch, while not
sacrificing key enterprise functionalities.
Virtualized networking separates the logical connectivity policies from the physical transport
layer. In a multi-cloud world, the transport layer is increasingly less available because it is
embedded into another cloud or running on the public Internet. Thus, virtual networking is
essential for making the VMware stack run on any provider’s hardware, for seamlessly
connecting the heterogeneous clouds of a modern enterprise and present a uniform
consumption model across different providers.
Organizations are looking for a multi-cloud platform that delivers best in class workload
security. Thanks to NSX, VMware clouds can transparently insert services, to protect, manage
and operationalize applications at scale, and to have an intimate understanding of the end user
and the application context. This allows for unique features such as “Virtual patching” via the
NSX distributed IPS that protects individual vulnerable workloads before the application of a
security patch.
While all the VMware clouds run the same code base, some go a step further in term of
simplification and application alignment. The NSX that is presented to VMware Cloud
customers is much simpler than the premise-based version because the underlying topology is
controlled by VMware and all that is required is specify application-level policies. Since the
same software stack is deployed on any VMware cloud, operations such as the vMotion of a
workload and its attached firewall/security policy between private and public clouds are
available.
15
VMware NSX Reference Design Guide
16
VMware NSX Reference Design Guide
Management Plane
The management plane provides an entry point to the system for API as well NSX graphical user
interface. It is responsible for maintaining user configuration, handling user queries, and
performing operational tasks on all management, control, and data plane nodes.
The NSX Manager implements the management plane for the NSX ecosystem. It provides an
aggregated system view and is the centralized network management component of NSX. NSX
Manager provides the following functionality:
• Serves as a unique entry point for user configuration via RESTful API (CMP, automation)
or NSX user interface.
• Responsible for storing desired configuration in its database. The NSX Manager stores
the final configuration request by the user for the system. This configuration will be
pushed by the NSX Manager to the control plane to become a realized configuration (i.e.,
a configuration effective in the data plane).
17
VMware NSX Reference Design Guide
Control Plane
The control plane computes the runtime state of the system based on configuration from the
management plane. It is also responsible for disseminating topology information reported by
the data plane elements and pushing stateless configuration to forwarding engines.
NSX splits the control plane into two parts:
• Central Control Plane (CCP) – The CCP is implemented as a cluster of virtual machines
called CCP nodes. The cluster form factor provides both redundancy and scalability of
resources. The CCP is logically separated from all data plane traffic, meaning any failure
in the control plane does not affect existing data plane operations. User traffic does not
pass through the CCP Cluster.
• Local Control Plane (LCP) – The LCP runs on transport nodes. It is adjacent to the data
plane it controls and is connected to the CCP. The LCP is responsible for programing the
forwarding entries and firewall rules of the data plane.
18
VMware NSX Reference Design Guide
(CPU, memory and disk). With the converged manager appliance, one only need to consider the
appliance sizing once.
Each appliance has a dedicated IP address and its manager process can be accessed directly or
through a load balancer. Optionally, the three appliances can be configured to maintain a
virtual IP address which will be serviced by one appliance selected among the three. The design
consideration of NSX Manager appliance is further discussed in CHAPTER 7.
Data Plane
The data plane performs stateless forwarding or transformation of packets based on tables
populated by the control plane. It reports topology information to the control plane and
maintains packet level statistics. The hosts running the local control plane daemons and
forwarding engines implementing the NSX data plane are called transport nodes. Transport
nodes are running an instance of the NSX virtual switch called the NSX Virtual Distributed
Switch, or N-VDS.
On ESXi platforms, the N-VDS is built on the top of the vSphere Distributed Switch (VDS). In fact,
the N-VDS is so close to the VDS that NSX 3.0 introduced the capability of installing NSX directly
on the top of a VDS on ESXi hosts. For all other kinds of transport node, the N-VDS is based on
the platform independent Open vSwitch (OVS) and serves as the foundation for the
implementation of NSX in other environments (e.g., cloud, containers, etc.).
As represented in FIGURE 2-1: NSX ARCHITECTURE AND Components, there are two main types of
transport nodes in NSX:
19
VMware NSX Reference Design Guide
• Hypervisor Transport Nodes: Hypervisor transport nodes are hypervisors prepared and
configured for NSX. NSX provides network services to the virtual machines running on
those hypervisors. NSX currently supports VMware ESXi™ and KVM hypervisors.
• Edge Nodes: VMware NSX Edge™ nodes are service appliances dedicated to running
centralized network services that cannot be distributed to the hypervisors. They can be
instantiated as a bare metal appliance or in virtual machine form factor. They are
grouped in one or several clusters, representing a pool of capacity. It is important to
remember that an Edge Node does not represent a service itself but just a pool of
capacity that one or more services can consume.
20
VMware NSX Reference Design Guide
The NSX API documentation can be accessible directly from the NSX Manager UI, under Policy
section within API documentation, or it can be accessed from CODE .VMWARE .COM.
The following examples walks you through the policy API examples for two of the customer
scenarios:
The desired outcome for deploying the application, as shown in the figure above, can be
defined using JSON. Once JSON request body is defined to reflect the desired outcome, then
API & JSON request body can be leveraged to automate following operational workflows:
• Deploy entire topology with single API and JSON request body.
• The same API/JSON can be further leveraged to templatize and reuse to deploy same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.
21
VMware NSX Reference Design Guide
The policy model can define the desired outcome by specifying grouping and micro-
segmentation polices using JSON. It uses single API call with a JSON request body to automate
following operational workflows:
• Deploy white-list security policy with single API and JSON request body.
• The same API/JSON can further leveraged to templatize and reuse to secure same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.
All new deployments should use Policy mode. Deployments which were created using the
advanced interface, for example, upgrades
from versions where policy mode was not
available.
22
VMware NSX Reference Design Guide
New deployments which integrate with cloud Deployments which in previous versions
automation platforms, automation tools, or were integrated with cloud automation
the NSX Container plug-in (NCP). All the platforms, NCP, or custom orchestration
northbound integrations between VMware tools via the manager api. Customers should
products and NSX now support policy mode. plan the transition to Policy Mode. Consult
the specific guidance for NCP and VMware
Integrated Openstack (VIO). Migration of
manager api objects managed by vRealize
Automation is not yet supported.
It is recommended that whichever mode is used to create objects (Policy or Manager) be the
only mode used (if the Manager Mode objects are required, create all objects in Manager
mode). Do not alternate use of the modes or there will be unpredictable results. Note that the
23
VMware NSX Reference Design Guide
default mode for the NSX Manager is Policy mode. When working in an installation where all
objects are new and created in Policy mode, the Manager mode option will not be visible in the
UI. For details on switching between modes, please see the NSX DOCUMENTATION .
NSX Logical Object Naming relationship between manager and policy mode
The name of some of the networking and security logical objects in the Manager API/Data
model have changed in the new policy model. The table below provides the before and after
naming side by side for those NSX Logical objects. This change only affects the name for the
given NSX object, but conceptually and functionally it is the same as before.
NSGroup Group
24
VMware NSX Reference Design Guide
25
VMware NSX Reference Design Guide
needs to be provisioned in the physical infrastructure for those two VMs to communicate at
layer 2 over a VLAN backed segment.
On the other hand, two VMs on different hosts and attached to the same overlay backed
segment will have their layer 2 traffic carried by tunnel between their hosts. This IP tunnel is
instantiated and maintained by NSX without the need for any segment specific configuration in
the physical infrastructure, thus decoupling NSX virtual networking from this physical
infrastructure.
Starting in NSX version 3.2, security only deployments can have VMs connected to traditional
vSphere dvpg too. The security functionalities available are equivalent to those on NSX VLAN
backed segments, but overlay backed segment cannot be deployed on the same ESXi host
where NSX security features have been enabled for vSphere dvpgs.
Note: representation of NSX segments in vCenter
This design document will only use the term “segment” when referring to the NSX virtual Layer
broadcast domain. Note however that in the vCenter UI, those segments will appear as
“opaque networks” on host configured with an N-VDS, and as NSX dvportgroups on host
configured with a VDS. Below is a screenshot representing both possible representation:
Check the following KB article for more information on the impact of this difference in
representation: https://kb.vmware.com/s/article/79872
Segments are created as part of an NSX object called a transport zone. There are VLAN
transport zones and overlay transport zones. A segment created in a VLAN transport zone will
be a VLAN backed segment, while, as you can guess, a segment created in an overlay transport
zone will be an overlay backed segment. NSX transport nodes attach to one or more transport
zones, and as a result, they gain access to the segments created in those transport zones.
26
VMware NSX Reference Design Guide
Transport zones can thus be seen as objects defining the scope of the virtual network because
they provide access to groups of segments to the hosts that attach to them, as illustrated in
FIGURE 3-1 below:
In above diagram, transport node 1 is attached to transport zone “Staging”, while transport
nodes 2-4 are attached to transport zone “Production”. If one creates a segment 1 in transport
zone “Production”, each transport node in the “Production” transport zone immediately gain
access to it. However, this segment 1 does not extend to transport node 1. The span of segment
1 is thus defined by the transport zone “Production” it belongs to.
Few additional points related to transport zones and transport nodes:
• Multiple virtual switches, N-VDS, VDS (with or without NSX) or VSS, can coexist on a ESXi
transport node; however, a given pNIC can only be associated with a single virtual
switch. This behavior is specific to the VMware virtual switch model, not to NSX.
• An NSX virtual switch (N-VDS or VDS with NSX) can attach to a single overlay transport
zone and multiple VLAN transport zones at the same time.
• A transport node can have multiple NSX virtual switches. A transport node can thus
attach to multiple overlays and VLAN transport zones.
• A transport zone can only be attached to a single NSX virtual switch on a given transport
node. In other words, two NSX virtual switches on the same transport node cannot be
attached to the same transport zone.
• Edge transport node-specific points:
o An edge transport node can only have one N-VDS attached to an overlay transport zone.
o If multiple VLAN segments are backed by the same VLAN ID (across all the VLAN
transport zones of an edge N-VDS), only one of those segments will be “realized” (i.e.
working effectively).
27
VMware NSX Reference Design Guide
Please see additional consideration at RUNNING A VDS PREPARED FOR NSX ON ESXI HOSTS.
In this example, a single virtual switch with two uplinks is defined on the hypervisor transport
node. One of the uplinks is a LAG, bundling physical port p1 and p2, while the other uplink is
only backed by a single physical port p3. Both uplinks look the same from the perspective of the
virtual switch; there is no functional difference between the two.
Note that the example represented in this picture is by no means a design recommendation, it’s
just illustrating the difference between the virtual switch uplinks and the host physical uplinks.
Teaming Policy
The teaming policy defines how the NSX virtual switch uses its uplinks for redundancy and
traffic load balancing. There are two main options for teaming policy configuration:
• Failover Order – An active uplink is specified along with an optional list of standby
uplinks. Should the active uplink fail, the next available uplink in the standby list takes its
place immediately. This policy results in an active/standby use of the uplinks.
• Load Balanced Source Port/Load Balance Source Mac Address – Traffic is distributed
across a specified list of active/active uplinks.
28
VMware NSX Reference Design Guide
o The “Load Balanced Source Port” policy maps a VM’s virtual interface to an uplink of the
host. Traffic sent by this virtual interface will leave the host through this uplink only, and
traffic destined to this virtual interface will necessarily enter the host via this uplink.
o The “Load Balanced Source Mac Address” goes a little bit further in term of granularity
for rare scenario where a virtual interface that can source traffic from different mac
addresses. Here, two frames sent by the same virtual interface could be associated to
different uplinks based on their source mac address.
The teaming policy only defines how the NSX virtual switch balances traffic across its uplinks.
The uplinks can in turn be individual pNICs or LAGs (as seen in the previous section.) Note that a
LAG uplink has its own hashing options, however, those hashing options only define how traffic
is distributed across the physical members of the LAG uplink, whereas the teaming policy define
how traffic is distributed between NSX virtual switch uplinks.
FIGURE 3-3 presents an example of the failover order and source teaming policy options,
illustrating how the traffic from two different VMs in the same segment is distributed across
uplinks. The uplinks of the virtual switch could be any combination of single pNICs or LAGs;
whether the uplinks are pNICs or LAGs has no impact on the way traffic is balanced between
uplinks. When an uplink is a LAG, it is only considered down when all the physical members of
the LAG are down. When defining a transport node, the user must specify a default teaming
policy that will be applicable by default to the segments available to this transport node.
29
VMware NSX Reference Design Guide
In the above FIGURE 3-4, the default failover order teaming policy specifies u1 as the active
uplink and u2 as the standby uplink. By default, all the segments are thus going to send and
receive traffic on u1. However, an additional failover order teaming policy called “Storage” has
been added, where u2 is active and u1 standby. The VLAN segment where VM3 is attached can
be mapped to the “Storage” teaming policy, thus overriding the default teaming policy for the
VLAN traffic consumed by this VM3. Sometimes, it might be desirable to only send overlay
traffic on a limited set of uplinks. This can also be achieved with named teaming policies for
VLAN backed segment, as represented in the FIGURE 3-5 below:
30
VMware NSX Reference Design Guide
Here, the default teaming policy only includes uplinks u1 and u2. As a result, overlay traffic is
constrained to those uplinks. However, an additional teaming policy named “VLAN-traffic” is
configured for load balancing traffic on uplink u3 and u4. By mapping VLAN segments to this
teaming policy, overlay and VLAN traffic are segregated.
Uplink Profile
As mentioned earlier, a transport node includes at least one NSX virtual switch, implementing
the NSX data plane. It is common for multiple transport nodes to share the exact same NSX
virtual switch configuration. It is also very difficult from an operational standpoint to configure
(and maintain) multiple parameters consistently across many devices. For this purpose, NSX
defines a separate object called an uplink profile that acts as a template for the configuration of
a virtual switch. The administrator can this way create multiple transport nodes with similar
virtual switches by simply pointing to a common uplink profile. Even better, when the
administrator modifies a parameter in the uplink profile, it is automatically updated in all the
transport nodes following this uplink profile.
The following parameters are defined in an uplink profile:
31
VMware NSX Reference Design Guide
• The transport VLAN used for overlay traffic. Overlay traffic will be tagged with the VLAN
ID specified in this field.
• The MTU of the uplinks. NSX will assume that it can send overlay traffic with this MTU on
the physical uplinks of the transport node without any fragmentation by the physical
infrastructure.
• The name of the uplinks and the LAGs used by the virtual switch. LAGs are optional of
course, but if you want to define some, you can give them a name, specify the number of
links and the hash algorithm they will use.
• The teaming policies applied to the uplinks (default and named teaming policies)
The virtual switch uplinks defined in the uplink profile must be mapped to real, physical uplinks
on the device becoming a transport node.
FIGURE 3-6 shows how a transport node “TN1” is created using the uplink profile “UP1”.
The uplinks U1 and U2 listed in the teaming policy of the uplink profile UP1 are just variable
names. When transport node TN1 is created, some physical uplinks available on the host are
mapped to those variables. Here, we’re mapping vmnic0 to U1 and vmnic1 to U2. If the uplink
profile defined LAGs, physical ports on the host being prepared as a transport node would have
to be mapped to the member ports of the LAGs defined in the uplink profile.
The benefit of this model is that we can create an arbitrary number of transport nodes
following the configuration of the same uplink profile. There might be local differences in the
way virtual switch uplinks are mapped to physical ports. For example, one could create a
transport node TN2 still using the same UP1 uplink profile, but mapping U1 to vmnic3 and U2 to
vmnic0. Then, it’s possible to change the teaming policy of UP1 to failover order and setting U1
as active and U2 as standby. On TN1, this would lead vmnic0 as active and vmnic1 as standby,
while TN2 would use vmnic3 as active and vmnic0 as standby.
If uplink profiles allow configuring the virtual switches of multiple transport nodes in a
centralized fashion, they also allow for very granular configuration if needed. Suppose now that
we want to turn a mix of ESXi host and KVM hosts into transport nodes. UP1 defined above
32
VMware NSX Reference Design Guide
cannot be applied to KVM hosts because those only support the failover order policy. The
administrator can simply create an uplink profile specific to KVM hosts, with a failover order
teaming policy, while keeping an uplink profile with a source teaming policy for ESXi hosts, as
represented in FIGURE 3-7 below:
If NSX had a single centralized configuration for all the hosts, we would have been forced to fall
back to the lowest common denominator failover order teaming policy for all the hosts.
The uplink profile model also allows for different transport VLANs on different hosts. This can
be useful when the same VLAN ID is not available everywhere in the network, for example, the
case for migration, reallocation of VLANs based on topology or geo-location change.
When running NSX on VDS, the LAG definition and the MTU fields of the uplink profile are now
directly defined on the VDS, controlled by vCenter. It is still possible to associate transport node
based on N-VDS and transport nodes based on VDS to the same uplink profile. It’s just that the
LAG definition and the MTU will be ignored on the VDS-based transport node.
33
VMware NSX Reference Design Guide
configuration changes are kept in sync across all the hosts, leading to easier cluster
management.
34
VMware NSX Reference Design Guide
35
VMware NSX Reference Design Guide
Logical Switching
This section on logical switching focuses on overlay backed segments due to their ability to
create isolated logical L2 networks with the same flexibility and agility that exists with virtual
machines. This decoupling of logical switching from the physical network infrastructure is one
of the main benefits of adopting NSX.
In the upper part of the diagram, the logical view consists of five virtual machines that are
attached to the same segment, forming a virtual broadcast domain. The physical
representation, at the bottom, shows that the five virtual machines are running on hypervisors
spread across three racks in a data center. Each hypervisor is an NSX transport node equipped
36
VMware NSX Reference Design Guide
with a tunnel endpoint (TEP). The TEPs are configured with IP addresses, and the physical
network infrastructure just need to provide IP connectivity between them. Whether the TEPs
are L2 adjacent in the same subnet or spread in different subnets does not matter. The
VMware® NSX Controller™ (not pictured) distributes the IP addresses of the TEPs across the
transport nodes so they can set up tunnels with their peers. The example shows “VM1” sending
a frame to “VM5”. In the physical representation, this frame is transported via an IP point-to-
point tunnel between transport nodes “HV1” to “HV5”.
The benefit of this NSX overlay model is that it allows direct connectivity between transport
nodes irrespective of the specific underlay inter-rack (or even inter-datacenter) connectivity
(i.e., L2 or L3). Segments can also be created dynamically without any configuration of the
physical network infrastructure.
Flooded Traffic
The NSX segment behaves like a LAN, providing the capability of flooding traffic to all the
devices attached to this segment; this is a cornerstone capability of layer 2. NSX does not
differentiate between the different kinds of frames replicated to multiple destinations.
Broadcast, unknown unicast, or multicast traffic will be flooded in a similar fashion across a
segment. In the overlay model, the replication of a frame to be flooded on a segment is
orchestrated by the different NSX components. NSX provides two different methods for
flooding traffic described in the following sections. They can be selected on a per segment
basis.
37
VMware NSX Reference Design Guide
The diagram illustrates the flooding process from the hypervisor transport node where “VM1”
is located. “HV1” sends a copy of the frame that needs to be flooded to every peer that is
interested in receiving this traffic. Each green arrow represents the path of a point-to-point
tunnel through which the frame is forwarded. In this example, hypervisor “HV6” does not
receive a copy of the frame. This is because the NSX Controller has determined that there is no
recipient for this frame on that hypervisor.
In this mode, the burden of the replication rests entirely on source hypervisor. Seven copies of
the tunnel packet carrying the frame are sent over the uplink of “HV1”. This should be
considered when provisioning the bandwidth on this uplink.
38
VMware NSX Reference Design Guide
Assume that “VM1” on “HV1” needs to send the same broadcast on “S1” as in the previous
section on head-end replication. Instead of sending an encapsulated copy of the frame to each
remote transport node attached to “S1”, the following process occurs:
1. “HV1” sends a copy of the frame to all the transport nodes within its group (i.e., with a
TEP in the same subnet as its TEP). In this case, “HV1” sends a copy of the frame to
“HV2” and “HV3”.
2. “HV1” sends a copy to a single transport node on each of the remote groups. For the
two remote groups - subnet 20.0.0.0 and subnet 30.0.0.0 – “HV1” selects an arbitrary
member of those groups and sends a copy of the packet there a bit set to indicate the
need for local replication. In this example, “HV1” selected “HV5” and “HV7”.
3. Transport nodes in the remote groups perform local replication within their respective
groups. “HV5” relays a copy of the frame to “HV4” while “HV7” sends the frame to
“HV8” and “HV9”. Note that “HV5” does not relay to “HV6” as it is not interested in
traffic from “LS1”.
The source hypervisor transport node knows about the groups based on the information it has
received from the NSX Controller. It does not matter which transport node is selected to
perform replication in the remote groups so long as the remote transport node is up and
available. If this were not the case (e.g., “HV7” was down), the NSX Controller would update all
transport nodes attached to “S1”. “HV1” would then choose “HV8” or “HV9” to perform the
replication local to group 30.0.0.0.
39
VMware NSX Reference Design Guide
In this mode, as with head end replication example, seven copies of the flooded frame have
been made in software, though the cost of the replication has been spread across several
transport nodes. It is also interesting to understand the traffic pattern on the physical
infrastructure. The benefit of the two-tier hierarchical mode is that only two tunnel packets
(compared to the headend mode of five packets) were sent between racks, one for each
remote group. This is a significant improvement in the network inter-rack (or inter-datacenter)
fabric utilization - where available bandwidth is typically less than within a rack. That number
that could be higher still if there were more transport nodes interested in flooded traffic for
“S1” on the remote racks. In the case where the TEPs are in another data center, the savings
could be significant. Note also that this benefit in term of traffic optimization provided by the
two-tier hierarchical mode only applies to environments where TEPs have their IP addresses in
different subnets. In a flat Layer 2 network, where all the TEPs have their IP addresses in the
same subnet, the two-tier hierarchical replication mode would lead to the same traffic pattern
as the source replication mode.
The default two-tier hierarchical flooding mode is recommended as a best practice as it
typically performs better in terms of physical uplink bandwidth utilization.
Unicast Traffic
When a frame is destined to an unknown MAC address, it is flooded in the network. Switches
typically implement a MAC address table, or filtering database (FDB), that associates MAC
addresses to ports in order to prevent flooding. When a frame is destined to a unicast MAC
address known in the MAC address table, it is only forwarded by the switch to the
corresponding port.
The NSX virtual switch maintains such a table for each segment/logical switch it is attached to.
A MAC address can be associated with either a virtual NIC (vNIC) of a locally attached VM or a
remote TEP (when the MAC address is located on a remote transport node reached via the
tunnel identified by that TEP).
FIGURE 3-11 illustrates virtual machine “Web3” sending a unicast frame to another virtual
machine “Web1” on a remote hypervisor transport node. In this example, the NSX virtual
switch on both the source and destination hypervisor transport nodes are fully populated.
40
VMware NSX Reference Design Guide
1. “Web3” sends a frame to “Mac1”, the MAC address of the vNIC of “Web1”.
2. “HV3” receives the frame and performs a lookup for the destination MAC address in its
MAC address table. There is a hit. “Mac1” is associated to the “TEP1” on “HV1”.
3. “HV3” encapsulates the frame and sends it to “TEP1”.
4. “HV1” receives the tunnel packet, addressed to itself and decapsulates it. TEP1 then
performs a lookup for the destination MAC of the original frame. “Mac1” is also a hit
there, pointing to the vNIC of “VM1”. The frame is then delivered to its final destination.
This mechanism is relatively straightforward because at layer 2 in the overlay network, all the
known MAC addresses are either local or directly reachable through a point-to-point tunnel.
In NSX, the MAC address tables can be populated by the NSX Controller or by learning from the
data plane. The benefit of data plane learning, further described in the next section, is that it is
immediate and does not depend on the availability of the control plane.
41
VMware NSX Reference Design Guide
transport node instead of the transport node that originated the traffic. FIGURE 3-12 below
illustrates this problem by focusing on the flooding of a frame from VM1 on HV1 using the two-
tier replication model (similar to what was described earlier in FIGURE 3-10: TWO-TIER
HIERARCHICAL Mode) When intermediate transport node HV5 relays the flooded traffic from HV1
to HV4, it is actually decapsulating the original tunnel traffic and re-encapsulating it, using its
own TEP IP address as a source.
The problem is thus that, if the NSX virtual switch on “HV4” was using the source tunnel IP
address to identify the origin of the tunneled traffic, it would wrongly associate Mac1 to TEP5.
To solve this problem, upon re-encapsulation, TEP 5 inserts an identifier for the source TEP as
NSX metadata in the tunnel header. Metadata is a piece of information that is carried along
with the payload of the tunnel. FIGURE 3-13 displays the same tunneled frame from “Web1” on
“HV1”, this time carried with a metadata field identifying “TEP1” as the origin.
With this additional piece of information, “HV4” can correctly identify the origin of the tunneled
traffic on replicated traffic.
42
VMware NSX Reference Design Guide
43
VMware NSX Reference Design Guide
1. Virtual machine “vmA” has just finished a DHCP request sequence and been assigned IP
address “IPA”. The NSX virtual switch on “HV1” reports the association of the MAC
address of virtual machine “vmA” to “IPA” to the NSX Controller.
2. Next, a new virtual machine “vmB” comes up on “HV2” that must communicate with
“vmA”, but its IP address has not been assigned by DHCP and, as a result, there has been
no DHCP snooping. The virtual switch will be able to learn this IP address by snooping
ARP traffic coming from “vmB”. Either “vmB” will send a gratuitous ARP when coming
up or it will send an ARP request for the MAC address of “vmA”. The virtual switch then
can derive the IP address “IPB” associated to “vmB”. The association (vmB -> IPB) is then
pushed to the NSX Controller.
3. The NSX virtual switch also holds the ARP request initiated by “vmB” and queries the
NSX Controller for the MAC address of “vmA”.
4. Because the MAC address of “vmA” has already been reported to the NSX Controller,
the NSX Controller can answer the request coming from the virtual switch, which can
now send an ARP reply directly to “vmB” on the behalf of “vmA”. Thanks to this
mechanism, the expensive flooding of an ARP request has been eliminated. Note that if
the NSX Controller did not know about the MAC address of “vmA” or if the NSX
44
VMware NSX Reference Design Guide
Controller were down, the ARP request from “vmB” would still be flooded by the virtual
switch.
Overlay Encapsulation
NSX uses Generic Network Virtualization Encapsulation (Geneve) for its overlay model. Geneve
is currently an IETF Internet Draft that builds on the top of VXLAN concepts to provide
enhanced flexibility in term of data plane extensibility.
VXLAN has static fields while Geneve offers flexible field. This capability can be used by anyone
to adjust the need of typical workload and overlay fabric, thus NSX tunnels are only setup
between NSX transport nodes. NSX only needs efficient support for the Geneve encapsulation
by the NIC hardware; most NIC vendors support the same hardware offload for Geneve as they
would for VXLAN.
Network virtualization is all about developing a model of deployment that is applicable to a
variety of physical networks and diversity of compute domains. New networking features are
developed in software and implemented without worry of support on the physical
infrastructure. For example, the data plane learning section described how NSX relies on
metadata inserted in the tunnel header to identify the source TEP of a forwarded frame. This
metadata could not have been added to a VXLAN tunnel without either hijacking existing bits in
the VXLAN header or making a revision to the VXLAN specification. Geneve allows any vendor
to add its own metadata in the tunnel header with a simple Type-Length-Value (TLV) model.
NSX defines a single TLV, with fields for:
45
VMware NSX Reference Design Guide
46
VMware NSX Reference Design Guide
47
VMware NSX Reference Design Guide
As of NSX 2.4, a segment can be connected to only a single Edge Bridge. That means that L2
traffic can enter and leave the NSX overlay in a single location, thus preventing the possibility of
a loop between a VLAN and the overlay. It is however possible to bridge several different
segments to the same VLAN ID, if those different bridging instances are leveraging separate
Edge uplinks.
Starting NSX 2.5, the same segment can be attached to several bridges on different Edges. This
allows certain bare metal topologies to be connected with overlay segment and bridging to
VLANs that can exist in separate rack without depending on physical overlay. With NSX 3.0, the
Edge Bridge supports bridging 802.1Q tagged traffic carried in an overlay backed segment
(Guest VLAN Tagging.) For more information about this feature, see the BRIDGING WHITE PAPER
on the VMware communities website.
48
VMware NSX Reference Design Guide
Figure 3-19: Bridge Profile, defining a redundant Edge Bridge (primary and backup)
Once a Bridge Profile is created, the user can attach a segment to it. By doing so, an active
Bridge instance is created on the primary Edge, while a standby Bridge is provisioned on the
backup Edge. NSX creates a Bridge Endpoint object, which represents this pair of Bridges. The
attachment of the segment to the Bridge Endpoint is represented by a dedicated Logical Port,
as shown in the diagram below:
Figure 3-20: Primary Edge Bridge forwarding traffic between segment and VLAN
When associating a segment to a Bridge Profile, the user can specify the VLAN ID for the VLAN
traffic as well as the physical port that will be used on the Edge for sending/receiving this VLAN
49
VMware NSX Reference Design Guide
traffic. At the time of the creation of the Bridge Profile, the user can also select the failover
mode. In the preemptive mode, the Bridge on the primary Edge will always become the active
bridge forwarding traffic between overlay and VLAN as soon as it is available, usurping the
function from an active backup. In the non-preemptive mode, the Bridge on the primary Edge
will remain standby should it become available when the Bridge on the backup Edge is already
active.
The firewall rules can leverage existing NSX grouping constructs, and there is currently a single
firewall section available for those rules.
50
VMware NSX Reference Design Guide
51
VMware NSX Reference Design Guide
In this above example, VM1, VM2, Physical Servers 1 and 2 have IP connectivity. Remarkably,
through the Edge Bridges, Tier-1 or Tier-0 gateways can act as default gateways for physical
devices. Note also that the distributed nature of NSX routing is not affected by the introduction
of an Edge Bridge. ARP requests from physical workload for the IP address of an NSX router
acting as a default gateway will be answered by the local distributed router on the Edge where
the Bridge is active.
52
VMware NSX Reference Design Guide
In a data center, traffic is categorized as East-West (E-W) or North-South (N-S) based on the
origin and destination of the flow. When virtual or physical workloads in a data center
communicate with the devices external to the data center (e.g., WAN, Internet), the traffic is
referred to as North-South traffic. The traffic between workloads confined within the data
center is referred to as East-West traffic. In modern data centers, more than 70% of the traffic
is East-West.
53
VMware NSX Reference Design Guide
For a multi-tiered application where the web tier needs to talk to the app tier and the app tier
needs to talk to the database tier and, these different tiers sit in different subnets. Every time a
routing decision is made, the packet is sent to the router. Traditionally, a centralized router
would provide routing for these different tiers. With VMs that are hosted on same the ESXi or
KVM hypervisor, traffic will leave the hypervisor multiple times to go to the centralized router
for a routing decision, then return to the same hypervisor; this is not optimal.
NSX is uniquely positioned to solve these challenges as it can bring networking closest to the
workload. Configuring a Gateway via NSX Manager instantiates a local distributed gateway on
each hypervisor. For the VMs hosted (e.g., “Web 1”, “App 1”) on the same hypervisor, the E-W
traffic does not need to leave the hypervisor for routing.
54
VMware NSX Reference Design Guide
55
VMware NSX Reference Design Guide
3. Once the MAC address of “App1” is learned, the L2 lookup is performed in the local
MAC table to determine how to reach “App1” and the packet is delivered to the App1
VM.
4. The return packet from “App1” follows the same process and routing would happen
again on the local DR.
In this example, neither the initial packet from “Web1” to “App1” nor the return packet from
“App1” to “Web1” left the hypervisor.
East-West Routing - Distributed Routing with Workloads on Different Hypervisor
In this example, the target workload “App2” differs as it rests on a hypervisor named “HV2”. If
“Web1” needs to communicate with “App2”, the traffic would have to leave the hypervisor
“HV1” as these VMs are hosted on two different hypervisors. FIGURE shows a logical view of
topology, highlighting the routing decisions taken by the DR on “HV1” and the DR on “HV2”.
When “Web1” sends traffic to “App2”, routing is done by the DR on “HV1”. The reverse traffic
from “App2” to “Web1” is routed by DR on “HV2”. Routing is performed on the hypervisor
attached to the source VM.
FIGURE shows the corresponding physical topology and packet walk from “Web1” to “App2”.
56
VMware NSX Reference Design Guide
57
VMware NSX Reference Design Guide
Services Router
East-West routing is completely distributed in the hypervisor, with each hypervisor in the
transport zone running a DR in its kernel. However, some services of NSX are not distributed,
due to its locality or stateful nature such as:
• Physical infrastructure connectivity (BGP Routing with Address Families – VRF lite)
• NAT
• DHCP server
• VPN
• Gateway Firewall
• Bridging
• Service Interface
• Metadata Proxy for OpenStack
A services router (SR) is instantiated on an edge cluster when a service is enabled that cannot
be distributed on a gateway.
A centralized pool of capacity is required to run these services in a highly available and scaled-
out fashion. The appliances where the centralized services or SR instances are hosted are called
Edge nodes. An Edge node is the appliance that provides connectivity to the physical
infrastructure.
Left side of FIGURE 4-6 shows the logical view of a Tier-0 Gateway showing both DR and SR
components when connected to a physical router. Right side of FIGURE 4-6 shows how the
components of Tier-0 Gateway are realized on Compute hypervisor and Edge node. Note that
the compute host (i.e. HV1) has just the DR component and the Edge node shown on the right
has both the SR and DR components. SR/DR forwarding table merge has been done to address
future use-cases. SR and DR functionality remains the same after SR/DR merge in NSX 2.4
release, but with this change SR has direct visibility into the overlay segments. Notice that all
the overlay segments are attached to the SR as well.
58
VMware NSX Reference Design Guide
As mentioned previously, connectivity between DR on the compute host and SR on the Edge
node is auto plumbed by the system. Both the DR and SR get an IP address assigned in
169.254.0.0/24 subnet by default. The management plane also configures a default route on
the DR with the next hop IP address of the SR’s intra-tier transit link IP. This allows the DR to
take care of E-W routing while the SR provides N-S connectivity.
59
VMware NSX Reference Design Guide
FIGURE 4-8 shows a detailed packet walk from data center VM “Web1” to a device on the L3
physical infrastructure. As discussed in the E-W routing section, routing always happens closest
to the source. In this example, eBGP peering has been established between the physical router
interface with the IP address 192.168.240.1 and the Tier-0 Gateway SR component hosted on
the Edge node with an external interface IP address of 192.168.240.3. Tier-0 Gateway SR has a
BGP route for 192.168.100.0/24 prefix with a next hop of 192.168.240.1 and the physical router
has a BGP route for 172.16.10.0/24 with a next hop of 192.168.240.3.
60
VMware NSX Reference Design Guide
61
VMware NSX Reference Design Guide
62
VMware NSX Reference Design Guide
Two-Tier Routing
In addition to providing optimized distributed and centralized routing functions, NSX supports a
multi-tiered routing model with logical separation between different gateways within the NSX
infrastructure. The top-tier gateway is referred to as a Tier-0 gateway while the bottom-tier
gateway is a Tier-1 gateway. This structure gives complete control and flexibility over services
and policies. Various stateful services can be hosted on the Tier-1 while the Tier-0 can operate
in an active-active manner.
Configuring two tier routing is not mandatory. It can be single tiered as shown in the previous
section. FIGURE 4-10 presents an NSX two-tier routing architecture.
Northbound, the Tier-0 gateway connects to one or more physical routers/L3 switches and
serves as an on/off ramp to the physical infrastructure. Southbound, the Tier-0 gateway
connects to one or more Tier-1 gateways or directly to one or more segments as shown in
North-South routing section. Northbound, the Tier-1 gateway connects to a Tier-0 gateway
using a RouterLink port. Southbound, it connects to one or more segments using downlink
interfaces.
Concepts of DR/SR discussed in the SECTION 4.1 remain the same for multi-tiered routing. Like
Tier-0 gateway, when a Tier-1 gateway is created, a distributed component (DR) of the Tier-1
gateway is intelligently instantiated on the hypervisors and Edge nodes. Before enabling a
centralized service on a Tier-0 or Tier-1 gateway, an edge cluster must be configured on this
gateway. Configuring an edge cluster on a Tier-1 gateway, instantiates a corresponding Tier-1
services component (SR) on two Edge nodes part of this edge cluster. Configuring an Edge
cluster on a Tier-0 gateway does not automatically instantiate a Tier-0 service component (SR),
63
VMware NSX Reference Design Guide
the service component (SR) will only be created on a specific edge node along with the external
interface creation.
Unlike the Tier-0 gateway, the Tier-1 gateway does not support northbound connectivity to the
physical infrastructure. A Tier-1 gateway can only connect northbound to:
• a Tier-0 gateway
• a service port, this is used to connect a one-arm load-balancer to a segment. More
details are available in CHAPTER 6.
Note that connecting Tier-1 to Tier-0 is a one click configuration or one API call configuration
regardless of components instantiated (DR and SR) for that gateway.
64
VMware NSX Reference Design Guide
• Router Link Interface/Linked Port: Interface connecting Tier-0 and Tier-1 gateways. Each
Tier-0-to-Tier-1 peer connection is provided a /31 subnet within the 100.64.0.0/16
reserved address space (RFC6598). This link is created automatically when the Tier-0 and
Tier-1 gateways are connected. This subnet can be changed when the Tier-0 gateway is
being created. It is not possible to change it afterward.
• Service Interface: Interface connecting VLAN segments to provide connectivity to VLAN
backed physical or virtual workloads. Service interface can also be connected to
overlay/VLAN segments for standalone load balancer use cases explained in load
balancer Chapter 6. Service Interface supports static. It is supported on both Tier-0 and
Tier-1 gateways configured in Active/Standby high-availability configuration mode
explained in section 4.6.2. Note that a Tier-0 or Tier-1 gateway must have an SR
component to realize service interfaces. This interface was referred to as centralized
service interface in previous releases. Dynamic Routing is not supported on Service
Interfaces.
• Loopback: Tier-0 gateway supports the loopback interfaces. A Loopback interface is a
virtual interface, and it can be redistributed into a routing protocol.
65
VMware NSX Reference Design Guide
• Tier-1 Gateway
o Connected – Connected routes on Tier-1 include segment subnets connected to Tier-
1 and service interface subnets configured on Tier-1 gateway.
o In FIGURE 4-12, 172.16.10.0/24 (Connected segment) and 192.168.10.0/24 (Service
Interface) are connected routes for Tier-1 gateway.
o Static– User configured static routes on Tier-1 gateway.
o NAT IP – NAT IP addresses owned by the Tier-1 gateway discovered from NAT rules
configured on the Tier-1 gateway.
o LB VIP – IP address of load balancing virtual server.
o LB SNAT – IP address or a range of IP addresses used for Source NAT by load balancer.
o IPsec Local IP – Local IPsec endpoint IP address for establishing VPN sessions.
o DNS Forwarder IP – Listener IP for DNS queries from clients. Also used as the source
IP to forward DNS queries to the upstream DNS server.
66
VMware NSX Reference Design Guide
“Tier-1 Gateway” advertises connected routes to Tier-0 Gateway. Figure 4-12 shows an
example of connected routes (172.16.10.0/24 and 192.168.10.0/24). If there are other route
types, like NAT IP etc. as discussed in section 4.2.2, a user can advertise those route types as
well. As soon as “Tier-1 Gateway” is connected to “Tier-0 Gateway”, the management plane
configures a default route on “Tier-1 Gateway” with next hop IP address as RouterLink interface
IP of “Tier-0 Gateway” i.e. 100.64.224.0/31 in the example above.
Tier-0 Gateway sees 172.16.10.0/24 and 192.168.10.1/24 as Tier-1 Connected routes (t1c) with
a next hop of 100.64.224.1/31. Tier-0 Gateway also has Tier-0 “Connected” routes
(172.16.20.0/24) in Figure 4-12.
Northbound, “Tier-0 Gateway” redistributes the Tier-0 connected and Tier-1 connected routes
in BGP and advertises these routes to its BGP neighbor, the physical router.
67
VMware NSX Reference Design Guide
FIGURE 4-13 shows both logical and per transport node views of two Tier-1 gateways serving
two different tenants and a Tier-0 gateway. Per transport node view shows that the distributed
component (DR) for Tier-0 and the Tier-1 gateways have been instantiated on two hypervisors.
If “VM1” in tenant 1 needs to communicate with “VM3” in tenant 2, routing happens locally on
hypervisor “HV1”. This eliminates the need to route of traffic to a centralized location to route
between different tenants or environments.
Multi-Tier Distributed Routing with Workloads on the same Hypervisor
The following list provides a detailed packet walk between workloads residing in different
tenants but hosted on the same hypervisor.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM3” (172.16.201.11) in tenant 2.
The packet is sent to its default gateway interface located on tenant 1, the local Tier-1
DR.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet is routed to the Tier-0
DR following the default route. This default route has the RouterLink interface IP
address (100.64.224.0/31) as a next hop.
3. Routing lookup happens on the Tier-0 DR. It determines that the 172.16.201.0/24
subnet is learned from the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is
routed there.
4. Routing lookup happens on the tenant 2 Tier-1 DR. This determines that the
172.16.201.0/24 subnet is directly connected. L2 lookup is performed in the local MAC
table to determine how to reach “VM3” and the packet is sent.
The reverse traffic from “VM3” follows the similar process. A packet from “VM3” to destination
172.16.10.11 is sent to the tenant-2 Tier-1 DR, then follows the default route to the Tier-0 DR.
The Tier-0 DR routes this packet to the tenant 1 Tier-1 DR and the packet is delivered to “VM1”.
During this process, the packet never left the hypervisor to be routed between tenants.
68
VMware NSX Reference Design Guide
The following list provides a detailed packet walk between workloads residing in different
tenants and hosted on the different hypervisors.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM2” (172.16.200.11) in tenant 2.
VM1 sends the packet to its default gateway interface located on the local Tier-1 DR in
HV1.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet follows the default
route to the Tier-0 DR with a next hop IP of 100.64.224.0/31.
3. Routing lookup happens on the Tier-0 DR which determines that the 172.16.200.0/24
subnet is learned via the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is routed
accordingly.
4. Routing lookup happens on the tenant 2 Tier-1 DR which determines that the
172.16.200.0/24 subnet is a directly connected subnet. A lookup is performed in ARP
table to determine the MAC address associated with the “VM2” IP address. This
destination MAC is learned via the remote TEP on hypervisor “HV2”.
5. The “HV1” TEP encapsulates the packet and sends it to the “HV2” TEP, finally leaving the
host.
6. The “HV2” TEP decapsulates the packet and recognize the VNI in the Geneve header. A
L2 lookup is performed in the local MAC table associated to the LIF where “VM2” is
connected.
7. The packet is delivered to “VM2”.
69
VMware NSX Reference Design Guide
The return packet follows the same process. A packet from “VM2” gets routed to the local
hypervisor Tier-1 DR and is sent to the Tier-0 DR. The Tier-0 DR routes this packet to tenant 1
Tier-1 DR which performs the L2 lookup to find out that the MAC associated with “VM1” is on
remote hypervisor “HV1”. The packet is encapsulated by “HV2” and sent to “HV1”, where this
packet is decapsulated and delivered to “VM1". It is important to notice that in this use case,
routing is performed locally on the hypervisor hosting the VM sourcing the traffic.
Routing Capabilities
NSX supports static routing as well dynamic routing protocols to provide connectivity to the
IPv4 and IPv6 workloads.
Static routing
In a multi-tier routing architecture and as described previously, NSX automatically creates the
RouterLink segment, ports and static routes to interconnect a Tier-0 gateway with one or
several Tier-1 gateways as depicted previously.
By default, the RouterLink segment is in the 100.64.0.0/16 IPv4 address range which is a shared
IPv4 address space that is compliant with the RFC 6598. A default static route pointing to the
Tier-0 gateway DR RouterLink port is automatically installed on the Tier-1 gateway routing
table.
Northbound, static routes can be configured on Tier-1 gateways with the next hop IP as the
RouterLink IP of the Tier-0 gateway (100.64.0.0/16 range or a range defined by user for
RouterLink interface when the Tier-0 gateway is being created). Southbound, static routes can
also be configured on Tier-1 gateway with a next hop as a layer 3 device reachable via a service
or downlink interface.
Tier-0 gateways can be configured with a static route toward external subnets with a next hop
IP of the physical router. Southbound, static routes can be configured on Tier-0 gateways with a
next hop of a layer 3 device reachable via Service interface.
ECMP is supported with static routes to provide load balancing, increased bandwidth, and fault
tolerance for failed paths or Edge nodes. Figure 4-4-15 shows a Tier-0 gateway with two
external interfaces leveraging Edge node, EN1 and EN2 connected to two physical routers. Two
equal cost static default routes configured for ECMP on Tier-0 Gateway. Up to eight paths are
supported in ECMP.
Tier-0 northbound static routes can be protected via BFD. BFD timers depend on the Edge node
type. Bare metal Edge supports a minimum of 50ms TX/RX BFD keep alive timer while the VM
form factor Edge supports a minimum of 500ms TX/RX BFD keep alive timer.
It is recommended to always implement BFD when configuring static routing between the Tier-
0 gateway and the physical network. BFD will detect an upstream physical network failures in
the absence of a dynamic routing protocol.
70
VMware NSX Reference Design Guide
First hop redundancy protocols on the physical network and a virtual IP (VIP) on the Tier-0
Gateway can also be implemented to provide high availability, but they are less reliable and
slower than BFD, cannot provide ECMP, and VIP only works with active/standby Tier-0
gateways. They should only be considered if enabling BFD was not an option.
71
VMware NSX Reference Design Guide
72
VMware NSX Reference Design Guide
The BGP session will not be GR capable if only one of the peers advertises it in the BGP OPEN
message; GR needs to be configured on both ends. GR can be enabled/disabled per Tier-0
gateway. The GR restart timer is 180 seconds by default and cannot be change after a BGP
peering adjacency is in the established state, otherwise the peering needs to be negotiated
again.
73
VMware NSX Reference Design Guide
• If a loopback is configured the RID will be equal to the highest numeric loopback IPv4
address.
NSX does not support to manually configure an OSPF router-ID. There is no preemption when it
comes to calculating the Router ID otherwise that will trigger a reset in the OSPF adjacency and
disrupt traffic forwarding.
To change the RID, the OSPF process must be restarted either using the UI or an API call. An
NSX edge reboot will not change the RID. If the NSX edge is being redeployed, the OSPF RID will
be recalculated.
Once the OSPF RID is chosen, OSPF hello messages are sent on the OSPF enabled external
uplinks using the multicast address “224.0.0.5” as represented on FIGURE 4-16.
Figure 4-16 - OSPF Adjacency between the Tier-0 SR and the physical network fabric
74
VMware NSX Reference Design Guide
As explained earlier, Hello messages are sent on the OSPF enabled external uplink to the
multicast address 224.0.0.5 IP address (ALL SPF Routers). The hello message will check the
following parameters before establishing the adjacency:
• Interfaces between the OSPF routers must be in the same subnet.
• Interfaces between the OSPF routers must belong to the same OSPF area.
• Interfaces between the OSPF routers must have the same OSPF area type.
• Router ID must be unique between the OSPF Routers.
• OSPF timers (HelloInterval and RouterDeadInterval) must match between the OSPF
routers.
• Authentication process must be validated.
OSPF Authentication
The Tier-0 gateway supports the following authentication method:
• None
• Password: Password is sent in clear text over the external uplink interface
• MD5: Password is a message digest and sent over the external uplink interface
NSX supports area-wide authentication and does not support authentication per interface.
VMware recommends authenticating every dynamic routing protocol adjacency using the MD5
method to exchange routes in a secure way.
75
VMware NSX Reference Design Guide
76
VMware NSX Reference Design Guide
77
VMware NSX Reference Design Guide
• Full:
o Routers have exchanged their LSDb and are fully adjacent.
After both routers have exchanged their LSDb, they run the Dijkstra algorithm. Since their LSDb
is identical they possess the same knowledge of the OSPF topology. The Dijkstra algorithm is
run so that each router can calculate their own best way to reach every destination.
Graceful Restart (GR)
Graceful restart in OSPF allows a neighbor to preserve its forwarding table while the control
plane restarts. It is recommended to enable OSPF Graceful restart when the OSPF peer has
multiple supervisors. An OSPF control plane restart could happen due to a supervisor
switchover in a dual supervisor hardware, planned maintenance, or active routing engine crash.
As soon as a GR-enabled router restarts (control plane failure), it preserves its forwarding table,
marks the routes as stale, and sets a grace period restart timer for the OSPF adjacency to
reestablish. If the OSPF adjacency reestablishes during this grace period, route revalidation is
done, and the forwarding table is updated. If the OSPF adjacency does not reestablish within
this grace period, the router flushes the stale routes.
OSPF Graceful restart helper mode is supported.
78
VMware NSX Reference Design Guide
maximum number of routers fully establishing an OSPF adjacency using the Point-to-Point
network type is two. No DR or BDR election is performed using this OSPF network type. The
Tier-0 SR OSPF Routers are considered “DROther” (OSPF priority of 0).
It is recommended to configure the Tier-0 SR uplink interfaces using a /31 network mask as
depicted on FIGURE 4-20:
In this example, the Tier-0 SR hosted on both the Edge Node 01 and Edge Node 02 will establish
an adjacency with both the physical router 1 and the physical router 2.
From a global Tier-0 perspective, all four adjacencies to the physical routers will be in the “Full”
state.
OSPF Broadcast Network Type:
On an ethernet segment using the OSPF broadcast type, more than two OSPF router can
exchange their LSAs. To reduce the LSAs flooding on that segment, the OSPF process will elect a
Designated Router and a Backup Designated Routers. The router with the highest OSPF priority
(1-255) will be elected DR. The router with the second highest OSPF priority will be elected BDR.
All other routers on the segment will be considered OSPF DROthers.
If the priorities for either the DR or BDR role are equal, the router with the highest OSPF
RouterID will be elected DR.
79
VMware NSX Reference Design Guide
All DROther routers will establish an adjacency with both the DR and the BDR as demonstrated
on FIGURE 4-21. These adjacencies will be in the “Full” state and LSUs can be exchanged. The DR
and BDR also establish an adjacency between themselves.
Figure 4-21: OSPF adjacency in the Full State – Broadcast network type
The DROthers routers will see each other’s in the 2-Way state but will not exchange their LSAs
directly between themselves, the DR and BDR will perform that function.
Figure 4-22: OSPF adjacency in the “2-Way” State – Broadcast network type
80
VMware NSX Reference Design Guide
In this example, each router will have 2 adjacencies in the Full State and 2 adjacencies in the “2-
Way” state.
From an NSX perspective, the Tier-0 gateway is always a DROther as its OSPF priority is hard-
coded to 0. It is not possible to change that priority.
FIGURE 4-23 demonstrates the OSPF adjacencies in the “Full” state between the physical
networking fabric and the Tier-0 Service Routers
Figure 4-23: OSPF adjacency in the Full State – Broadcast network type
FIGURE 4-24 demonstrates the OSPF adjacencies in the “2-Way” state between the Tier-0
Service Routers.
81
VMware NSX Reference Design Guide
When OSPF uses the broadcast network type, the Database Description packets are sent from
the Tier-0 SR (DROthers) to the Designated Router in the networking fabric using the multicast
address 224.0.0.6 (ALL OSPF DR Router multicast address). The DR sends its Database
Description packets using the 224.0.0.5 multicast address (ALL OSPF Router multicast address).
The number of adjacencies is increased using this network type which can increase the
complexity of the OSPF topology.
The OSPF LSDb is identical throughout the entire area therefore, there is no need to establish
an adjacency between the Tier-0 SR.
As explained earlier, VMware recommends using the Point-to-Point OSPF network type over
the Broadcast network type to simplify the routing topology.
82
VMware NSX Reference Design Guide
This can be problematic in a very large-scale environment where it would be unnecessary for all
routers to recompute the Dijkstra algorithm when a networking factor has changed.
OSPF routers and links can be compartmented into different areas to limit the LSA flooding and
reduce the time the SPF algorithm is run.
Routers with links in both a backbone area and a standard area are considered as Autonomous
Border Router. The Tier-0 gateway cannot perform ABR’s duties since only one area per Tier-0
is supported.
FIGURE 4-25 represents the OSPF LSAs that are supported in an NSX OSPF architecture.
The OSPF areas supported by NSX are listed in the FIGURE 4-26 and represented in FIGURE 4-27.
83
VMware NSX Reference Design Guide
Since a Tier-0 service router redistributes the routes into the OSPF domain, it can be considered
as an Autonomous System Boundary Router. An ABR in the OSPF topology will inject LSA type 4
in other areas to make sure the OSPF routers know how to reach the ASBR. When a Tier-0 SR
redistributes a prefix into OSPF, it uses an external metric of type 2:
• Total cost to reach the prefix is always the redistributed metric. Internal cost to reach the
ASBR is ignored.
The OSPF External type 1 is not supported by NSX when a prefix is redistributed.
FIGURE 4-28 represents the different external metric types used by the Tier-0 when different
area types are used:
• Standard and Backbone areas: Routes are redistributed using LSAs type 5 with an
external type of E2 and a cost of 20.
• When the Tier-0 gateway is connected to a not so stubby area, it redistributes its routes
using LSAs type 7 with an external type of N2 and a cost of 20.
The OSPF metric cannot be changed in NSX and will have a hard coded value of 20 throughout
the OSPF domain.
84
VMware NSX Reference Design Guide
Not So Stubby Areas are recommended for very large topologies. Instead of redistributing the
prefixes using LSA type 5 in the OSPF domain, LSA type 7 will be used. This type of LSA is
originated by the ASBR and flooded in the NSSA only. Another ABR in the OSPF domain will
translate that LSA type 7 into an LSA type 5 that will be transmitted throughout the OSPF
domain.
With the advent of OSPF support in NSX 3.1.1, it is now possible to choose which protocol the
routes can be redistributed into. It is possible to redistribute prefixes in BGP or OSPF.
It is possible to use OSPF to learn BGP peers IP addresses in the case of E-BGP multi-hop
topology.
FIGURE 4-29 represents an example where OSPF and BGP can be used on the same Tier-0
gateway. OSPF is used as an IGP to provide connectivity between the loopback interfaces.
BGP can be setup between the physical routers’ loopback interfaces and the Tier-0 loopback
interfaces. This solution should only be considered when direct single hop eBGP peering is not
an option. Chapter 7 provides guidance on the recommended routing configuration for BGP
peering on directly connected segments.
85
VMware NSX Reference Design Guide
Figure 4-29: OSPF used to learn Loopback reachability to establish a E-BGP multi-hop peering
86
VMware NSX Reference Design Guide
87
VMware NSX Reference Design Guide
Since the internal transit interface on the standby Tier-0 SR is administratively shutdown, the
physical fabric needs to send the traffic to the NSX domain via the active Tier-0 SR.
The standby Tier-0 SR will redistribute the internal NSX prefixes in OSPF and advertise them
with a cost of 65534 (External Type E2 or N2).
The active Tier-0 SR redistributes the same prefixes with a cost of 20, the networking fabric will
logically prefer the prefixes with a lower cost and therefore use the links via the active Tier-0
SR.
In this topology, ECMP with the top of rack switches can still be leveraged as demonstrated on
FIGURE 4-29. There is no ECMP between the Tier-0 DR and the Tier-0 SR.
88
VMware NSX Reference Design Guide
Figure 4-29: ECMP between the physical fabric and the Active Tier-0 SR
89
VMware NSX Reference Design Guide
90
VMware NSX Reference Design Guide
If the IP Addressing schema has been designed properly for summarization (contiguous
subnets), these large number of Type 5 LSAs can be advertised as a single summary prefix
embedded in a single LSA Type 5 as depicted in figure 4-32.
91
VMware NSX Reference Design Guide
When the area connected to the Tier-0 gateway is an NSSA, route summarization is supported
and the type of LSA injected in the area for summarization is an LSA Type 7.
92
VMware NSX Reference Design Guide
From a dynamic routing perspective, there are two options to when it comes to designing a
modern data center routing architecture with NSX:
• OSPFv2
• BGP
BGP is well known in the networking industry to be the best routing protocol when it comes to
interoperability as it exchanges prefixes between autonomous systems on the internet.
Modern and scalable data centers architecture rely on BGP to provide connectivity as
multiprotocol support is not possible with OSPFv2. OSPFv2 implementation require a different
routing protocol or static routing to route IPv6 packets.
VMware recommends using BGP over OSPF as it provides more flexibility, had a better feature
set and support address families.
OSPF should be considered when it is the only routing protocol running in the physical fabric
and no feature available only with BGP is required. In such a scenario OSPF avoids the
implementation of routing redistribution leading to simpler and more scalable design.
A comparison of features between OSPF and BGP is described in the Table 4-6.
93
VMware NSX Reference Design Guide
94
VMware NSX Reference Design Guide
Figure 4-33: Type of IPv6 addresses supported on Tier-0 and Tier-1 Gateway components
FIGURE 4-34 shows a single tiered routing topology on the left side with a Tier-0 Gateway
supporting dual stack on all interfaces and a multi-tiered routing topology on the right side with
a Tier-0 Gateway and Tier-1 Gateway supporting dual stack on all interfaces. A user can either
assign static IPv6 addresses to the workloads or use a DHCPv6 relay supported on gateway
interfaces to get dynamic IPv6 addresses from an external DHCPv6 server.
For a multi-tier IPv6 routing topology, each Tier-0-to-Tier-1 peer connection is provided a /64
unique local IPv6 address from a pool i.e. fc5f:2b61:bd01::/48. A user has the flexibility to
95
VMware NSX Reference Design Guide
change this subnet range and use another subnet if desired. Similar to IPv4, this IPv6 address is
auto plumbed by system in background.
96
VMware NSX Reference Design Guide
97
VMware NSX Reference Design Guide
Active/Active
Active/Active – This is a high availability mode where SRs hosted on Edge nodes act as active
forwarders. Stateless services such as layer 3 forwarding are IP based, so it does not matter
which Edge node receives and forwards the traffic. All the SRs configured in active/active
configuration mode are active forwarders. This high availability mode is only available on Tier-0
gateway.
Stateful services typically require tracking of connection state (e.g., sequence number check,
connection state), thus traffic for a given session needs to go through the same Edge node. As
of NSX 3.0, active/active HA mode does not support stateful services such as Gateway Firewall
or stateful NAT. Stateless services, including reflexive NAT and stateless firewall, can leverage
the active/active HA model.
Left side of FIGURE 4-36 shows a Tier-0 gateway (configured in active/active high availability
mode) with two external interfaces leveraging two different Edge nodes, EN1 and EN2. Right
side of the diagram shows that the services router component (SR) of this Tier-0 gateway
instantiated on both Edge nodes, EN1 and EN2. A Compute host, ESXi is also shown in the
diagram that only has distributed component (DR) of Tier-0 gateway.
Note that Tier-0 SR on Edge nodes, EN1 and EN2 have different IP addresses northbound
toward physical routers and different IP addresses southbound towards Tier-0 DR.
Management plane configures two default routes on Tier-0 DR with next hop as SR on EN1
(169.254.0.2) and SR on EN2 (169.254.0.3) to provide ECMP for overlay traffic coming from
compute hosts.
98
VMware NSX Reference Design Guide
North-South traffic from overlay workloads hosted on Compute hosts will be load balanced and
sent to SR on EN1 or EN2, which will further do a routing lookup to send traffic out to the
physical infrastructure.
A user does not have to configure these static default routes on Tier-0 DR. Automatic plumbing
of default route happens in background depending upon the HA mode configuration.
Inter-SR Routing
To provide redundancy for physical router failure, Tier-0 SRs on both Edge nodes must establish
routing adjacency or exchange routing information with different physical router or TOR. These
physical routers may or may not have the same routing information. For instance, a route
192.168.100.0/24 may only be available on physical router 1 and not on physical router 2.
For such asymmetric topologies, users can enable Inter-SR routing. This feature is only available
on Tier-0 gateway configured in active/active high availability mode. Figure 4-34 shows an
asymmetric routing topology with Tier-0 gateway on Edge node, EN1 and EN2 peering with
physical router 1 and physical router 2, both advertising different routes.
When Inter-SR routing is enabled by the user, an overlay segment is auto plumbed between SRs
(similar to the transit segment auto plumbed between DR and SR) and each end gets an IP
address assigned in 169.254.0.128/25 subnet by default. An IBGP session is automatically
created between Tier-0 SRs and northbound routes (EBGP and static routes) are exchanged on
this IBGP session.
99
VMware NSX Reference Design Guide
As explained in previous figure, Tier-0 DR has auto plumbed default routes with next hops as
Tier-0 SRs and North-South traffic can go to either SR on EN1 or EN2. In case of asymmetric
routing topologies, a particular Tier-0 SR may or may not have the route to a destination. In
that case, traffic can follow the IBGP route to another SR that has the route to destination.
FIGURE 4-37 shows a topology where Tier-0 SR on EN1 is learning a default WAN route 0.0.0.0/0
and a corporate prefix 192.168.100.0/24 from physical router 1 and physical router 2
respectively. If “External 1” interface on Tier-0 fails and the traffic from compute workloads
destined to WAN lands on Tier-0 SR on EN1, this traffic can follow the default route (0.0.0.0/0)
learned via IBGP from Tier-0 SR on EN2.Traffic is being sent to EN2 through the Geneve overlay.
After a route lookup on Tier-0 SR on EN2, this N-S traffic can be sent to physical router 1 using
“External interface 3”.
Active/Standby
Active/Standby – This is a high availability mode where only one SR act as an active forwarder.
This mode is required when stateful services are enabled. Services like NAT are in constant
state of sync between active and standby SRs on the Edge nodes. This mode is supported on
both Tier-1 and Tier-0 SRs. Preemptive and Non-Preemptive modes are available for both Tier-0
and Tier-1 SRs. Default mode for gateways configured in active/standby high availability
configuration is non-preemptive.
100
VMware NSX Reference Design Guide
A user can select the preferred member (Edge node) when a gateway is configured in
active/standby preemptive mode. When enabled, preemptive behavior allows a SR to resume
active role on preferred edge node as soon as it recovers from a failure.
For Tier-1 Gateway, active/standby SRs have the same IP addresses northbound. Only the active
SR will reply to ARP requests, while the standby SR interfaces operational state is set as down
so that they will automatically drop packets.
For Tier-0 Gateway, active/standby SRs have different IP addresses northbound and both have
BGP sessions or OSPF adjacencies established on their uplinks. Both Tier-0 SRs (active and
standby) receive routing updates from physical routers and advertise routes to the physical
routers; however, the standby Tier-0 SR prepends its local AS three times in case of BGP or
advertise routes with a higher metric in case of OSPF so that traffic from the physical routers
prefer the active Tier-0 SR.
Southbound IP addresses on active and standby Tier-0 SRs are the same and the operational
state of standby SR southbound interface is down. Since the operational state of southbound
Tier-0 SR interface is down, the Tier-0 DR does not send any traffic to the standby SR. FIGURE
4-38 shows active and standby Tier-0 SRs on Edge nodes “EN1” and “EN2”.
101
VMware NSX Reference Design Guide
Suppose the edge node connects to multiple dual supervisor systems. In that case, the network
architect will have to choose what type of failover mechanism to prioritize between graceful
restart and traffic rerouting over a different link.
102
VMware NSX Reference Design Guide
Failover events that belong to the first category are (all services and SRs will failover to the
corresponding peer edge node):
1. Dataplane service is not running.
2. All TEP interfaces are down. This condition only applies to bare metal edge nodes as the
interface of an edge node VM should never go “physically” down. If a bare metal edge
node has more than one TEP interface, all need to be down for this condition to take
effect.
3. The edge node enters maintenance mode
4. Edge nodes run BFD with compute hosts and other edge nodes to monitor the health of
the overlay tunnels. If host transport nodes are present, when all the overlay tunnels
are down to both remote Edges and compute hypervisors, all the SRs on the edge will
be declared down. This condition does not apply if no host transport nodes are present.
5. Edge nodes in an Edge cluster exchange BFD keepalives on two interfaces, management
and TEP interfaces. Each edge monitors the status of its peer edges on those two
interfaces, if both are down, it will consider the peer down, and it will make the
corresponding SR active.
Failover event that belongs to the second category are (Only a specific SR is declared down and
failover to the peer edge):
1. Northbound routing on a Tier-0 SR is down. This is applicable to BGP, and OSPF with or
without BFD enabled and to static routes with BFD enabled. When this situation occurs,
the Tier-0 SR only will be declared failed. Other Tier-1 SRs on the same edge will still be
103
VMware NSX Reference Design Guide
active. If the Tier-0 is configured with static routes with no BFD, northbound routing will
never be considered down.
2. Services are configured on SR. The health score of services on the SR is less than that on
the peer SR
We will now review in more detail the failover triggers that requires a more careful
consideration from a design perspective.
FIGURE 4-40 outlines condition 5. The BFD sessions between two edge nodes are lost on both
the management and the overlay network. Standby Tier-0 and Tier-1 SRs will become active.
This condition addresses the scenario when an edge node fails or is completely isolated. When
designing edge node connectivity, it is important to consider corner-case situations when this
condition may apply, but the edge node is not down or completely isolated. For example, uplink
connectivity could be up while the overly and management network could be down if the uplink
traffic used dedicated pNICs. Designing for fate sharing between uplink, management, and
overlay traffic will mitigate the risk of a dual-active scenario. If dedicated pNICs are part of the
design, providing adequate pNIC redundancy to management and overlay traffic will ensure
that condition five is only triggered because of a complete failure of an edge.
Figure 4-40: Failover triggers – BFD on management and overlay network down
FIGURE 4-41 below outlines condition 6. A northbound connectivity issue is detected by the
dynamic routing protocol timers or BFD. In this case, the affected Tier-0 only will undergo a
failover event. Any other Tier-1 SR on the same edge will remain active. Static routes without
BFD do not allow to detect a failure in the uplink network. For this reason, configurations
including static routes should always include BFD.
104
VMware NSX Reference Design Guide
FIGURE 4-42 below outlines condition 4. In this case, edge node one did not fail, but it is not able
to communicate to any other transport node on the overlay network. This condition mitigates
situations when an edge node does not fail, but connectivity issues exist. All the SRs on the
affected edge node, including all Tier-0 and Tier-1 SRs, will failover. This condition only applies
to deployments including host transport nodes, which can help discriminate between the
failure of the peer edge and a network connectivity problem to it.
Figure 4-42: Failover triggers – All tunnels down (with host transport nodes)
FIGURE 4-43 outlines condition four again, but this time no host transport node is present in the
topology. The all overlay tunnel down condition does not put the SRs running on edge node 2 in
105
VMware NSX Reference Design Guide
a failed state. In the scenario depicted in the diagram, edge node one failed, and the SRs on
edge two took over even if all the overlay tunnels on edge 2 were down. If the all tunnel
condition applied to deployment without host transport nodes, the SRs running on edge node
two would go into a fail state, blackholing the traffic. This topology applies to VLAN-only
deployments, where workloads are connected via service interfaces.
Note: VLAN-Only deployments with service interfaces must use a single TEP. Multi-TEP
configurations will cause the all overlay tunnel down condition to disable the surviving node.
Figure 4-43: Failover triggers – All tunnels down (without host transport nodes)
We should pay special attention to the implications of deploying multiple edge node VMs on
the same host and/or on hosts with TEP interfaces on the same network (possible starting NSX
version 3.1) on the “all tunnel down condition” failover trigger. A failure of the upstream
overlay transport network (pNIC or switch level failure) will not bring down the tunnels local to
the host, those between edge VMs on the same host, or to the host TEP (See FIGURE 4-44). This
situation may lead the edge node VMs to not trigger a failover if northbound routing
connectivity is up. Traffic that the upstream network forward to the affected edge node VMs
will be blackholed because of the lack of access to the overlay network.
106
VMware NSX Reference Design Guide
Figure 4-44: Failover Triggers - All tunnels Down - Tunnels local to ESXi host
This condition can be avoided by designing for fate sharing between the overlay and upstream
network. If that is not a possibility based on the design requirements, we should eliminate the
possibility of establishing overlay tunnels within a host with one of these options:
• No more than one edge node VM per host on hosts not part of the same overlay TZ
• No more than one edge node VM per host on hosts part of the same overlay TZ but edge
and host overlay VLAN and subnet are different (BFD keepalives will need to be
forwarded to the physical network)
• With more than one edge node VM per host, edge node VMs should have different
overlay VLAN and subnet (BFD keepalives will need to be forwarded to the physical
network)
VRF Lite
107
VMware NSX Reference Design Guide
108
VMware NSX Reference Design Guide
Another supported design is to deploy a separate Tier-0 gateway for each tenant on a
dedicated tenant edge node. This configuration allows for duplicated IP addresses between
tenants, but requires dedicated edge nodes per tenant (no more than a single Tier-0 SR can run
on each edge node). While providing the required separation, this solution may be limited from
a scalability perspective.
FIGURE 4-47 shows a traditional multi-tenant architecture using dedicated Tier-0 per tenant in
NSX 2.X .
109
VMware NSX Reference Design Guide
Figure 4-47: NSX 2.x multi-tenant architecture. Dedicated Tier0 for each tenant
In traditional networking, VRF instances are hosted on a physical appliance and share the
resources with the global routing table. Starting with NSX 3.0, Virtual Routing and Forwarding
(VRF) instances configured on the physical fabric can be extended to the NSX domain. A VRF
Tier-0 gateway must be associated to a traditional Tier-0 gateway identified as the “Parent Tier-
0”. FIGURE 4-48 diagrams an edge node hosting a traditional Tier-0 gateway with two VRF
gateways. Control plane is completely isolated between all the Tier-0 gateways instances.
The parent Tier-0 gateway can be considered as the global routing table and must have
connectivity to the physical fabric. A unique Tier-0 gateway instance (DR and SR) will be created
110
VMware NSX Reference Design Guide
and dedicated to a VRF. F IGURE 4-49 shows a detailed representation of the Tier-0 VRF gateway
with their respective Service Router and Distributed Router components.
Figure 4-49: Detailed representation of the SR/DR component for Tier-0 VRF hosted on an edge node
FIGURE 4-50 shows a typical single tier routing architecture with two Tier-0 VRF gateways
connected to their parent Tier-0 gateway. Traditional segments are connected to a Tier-0 VRF
gateway.
Figure 4-50: NSX 3.0 multi-tenant architecture. Dedicated Tier-0 VRF Instance for each VRF
111
VMware NSX Reference Design Guide
Since control plane is isolated between Tier-0 VRF instances and the parent Tier-0 gateway,
each Tier-0 VRF needs their own network and routing constructs:
• Segments
• Uplink Interfaces
• BGP configuration with dedicated peers or Static route configuration
NSX 3.0 supports BGP and static routes for the Tier-0 VRF gateway. It offers the flexibility to use
static routes on a particular Tier-0 VRF while another Tier-0 VRF would use BGP. The OSPF
routing protocol is not supported for VRF topologies.
FIGURE 4-51 shows a topology with two Tier-0 VRF instances and their respective BGP peers on
the physical networking fabric. It is important to emphasize that the Parent Tier-0 gateway has
a BGP peering adjacency with the physical routers using their respective global routing table
and BGP process.
From a data plane standpoint, 802.1q VLAN tags are used to differentiate traffic between the
VRFs instances as demonstrated in the following figure.
Figure 4-51: BGP peering Tier-0 VRF gateways and VRF on the networking fabric
When a Tier-0 VRF is attached to parent Tier-0, multiple parameters will be inherited by design
and cannot be changed:
112
VMware NSX Reference Design Guide
• Edge Cluster
• High Availability mode (Active/Active – Active/Standby)
• BGP Local AS Number
• Internal Transit Subnet
• Tier-0, Tier-1 Transit Subnet.
All other configuration parameters can be independently managed:
• External Interface IP addresses
• BGP neighbor
• Prefix list, route-map, Redistribution
• Firewall rules
• NAT rules
VRF Lite HA
As mentioned previously, The Tier-0 VRF is associated to a Parent Tier-0 and will follow the high
availability mode and state of its Parent Tier-0.
Both Active/Active or Active/Standby high availability mode are supported on the Tier-0 VRF
gateways. It is not possible to have an Active/Active Tier-0 VRF associated to an Active/Standby
Parent Tier-0 and vice-versa.
113
VMware NSX Reference Design Guide
FIGURE 4-53 represents an unsupported VRF Active/Active design where different routes are
learned from different physical routers. Both the Parent Tier-0 and VRF Tier-0 gateways are
learning their default route from a single physical router. At the same time they receive specific
routes from a different single BGP peer. This kind of scenario would be supported for traditional
Tier-0 architecture as Inter-SR would provide a redundant path to the networking fabric, but it
is not supported for VRFs because of the lack of inter-SR peering capability. This VRF
architecture is not supported in NSX 3.2.
114
VMware NSX Reference Design Guide
FIGURE 4-54 demonstrates the traffic being as one internet router fails and Tier-0 VRF gateways
can’t leverage another redundant path to reach the destination. Since the Parent Tier-0
gateway has an established BGP peering adjacency (VRFs HA state is inherited from the parent),
failover will not be triggered, and traffic will be blackholed on the Tier-0 VRF. This situation
would not arise if each VRF SR peered with every ToR.
115
VMware NSX Reference Design Guide
3. From a Tier-0 SR1 point of view, the traffic is blackholed as there is no inter-SR BGP
adjacency between the Tier-0 SRs VRF.
Following the same BGP peering design and principle for Active/Standby topologies is also
mandatory for VRF architectures as the Tier-0 VRF will inherit the behavior of the parent Tier-0
gateway.
FIGURE 4-55 represents another unsupported design with VRF architecture.
Figure 4-55: Unsupported design BGP architecture Different peering with the networking fabric
In this design, traffic will be blackholed on the Tier-0 VRF SR1 as the internet router fails. Since
the Tier-0 VRF share its high availability running mode with the Parent Tier-0, it is important to
note that the Tier-0 SR1 will not failover to the Tier-0 SR2. The reason behind this behavior is
because a failover is triggered only if all northbound BGP sessions change to the “down” state
on the parent Tier-0 SR.
Since the parent Tier-0 gateway has an active BGP peering with a northbound router on the
physical networking fabric, failover will not occur and traffic will be blackholed on the VRF that
have only one BGP peer active. This situation would not arise if each VRF SR peered with every
ToR.
116
VMware NSX Reference Design Guide
Stateful services can either run on a Tier-0 VRF gateway or a Tier-1 gateway except for VPN and
Load Balancing as these features are not supported on a Tier-0 VRF. Tier-0 SR in charge of the
stateful services for a particular VRF will be hosted on the same edge nodes as the Parent Tier-0
Gateway. It is recommended to run the stateful services on a Tier-1 gateways and leverage an
Active/Active Tier-0 gateway to send the traffic northbound to the physical fabric.
FIGURE 4-57 represents stateful services running on traditional Tier-1 gateways SR.
117
VMware NSX Reference Design Guide
118
VMware NSX Reference Design Guide
▪ Admin Distance: 1
▪ Scope: VRF-A
VRF-lite also supports northbound VRF route leaking as traffic can be exchanged between a
virtual workload on an VRF overlay segment and a bare metal server hosted in a different VRF
on the physical networking fabric.
NSX VRF route leaking requires that the next hop for the static route must not be pointing to an
interface that belongs to a Tier-0 gateway (VRF or Parent). Static routes pointing to the directly
connected IP addresses uplink on the ToR switches would not be a recommended design as the
static route would fail if an outage would occur on that link or neighbor (Multiple static routes
would be needed for redundancy).
A loopback or virtual IP address host route (/32) can be advertised in the network in the
destination VRF. Since the host route is advertised by both top of rack switches, two ECMP
routes will be installed in the Tier-0 VRF. The loopback or virtual IP address host route (/32)
does not need to be advertised in the source VRF.
119
VMware NSX Reference Design Guide
FIGURE 4-58 demonstrates the design and Tier-0 VRF gateways will use all available healthy
paths to the networking fabric to reach the server in VRF-B.
120
VMware NSX Reference Design Guide
Figure 4-59: Different VRFs sharing Internet access via common Internet Tier-0
Directly connecting the different VRFs to a common VLAN is not allowed in NSX 3.2.
The VRFs can be interconnected to the shared T0 via dedicated overlay segments. The shared
Tier-0 gateway is providing internet access using VLAN segments.
Internet facing NAT can implemented on the shared T0 if the VRFs are not carrying overlapping
IP spaces, or at the VRF level if they are.
eBGP peering between the VRFs Tier-0 Gateways and the shared Tier-0 gateway is
recommended to protect against an edge node failure.
121
VMware NSX Reference Design Guide
122
VMware NSX Reference Design Guide
123
VMware NSX Reference Design Guide
This EVPN Route type can advertise either IPv4 or IPv6 prefixes with a variable network mask
length (0 to 32 bits for IPv4 and 0 to 128 for IPv6). This route type allows inter subnet
forwarding, overlapping IP address scheme with the help of route distinguishers.
FIGURE 4-62 represents the field included in an EVPN Route Type 5
When designing an EVPN architecture based on NSX, the following network constructs must be
taken into account:
• Segments used for the uplink interfaces
124
VMware NSX Reference Design Guide
From a BGP standpoint, there are 2 mains requirements needed to setup the basic MP-BGP
Peering:
125
VMware NSX Reference Design Guide
• TEP interfaces (loopbacks) need to have reachability (Top of racks loopback interfaces to
NSX Tier-0 Loopbacks hosted on each edge nodes)
• L2VPN/EVPN address family negotiated between the Top of Rack Switch and the Tier-0
Service Router.
To achieve connectivity between the loopbacks hosted on the Tier-0 SR and the Top of Rack
switches, 2 designs are available as demonstrated in FIGURE 4-64. The first design will have a
BGP peering between the NSX Tier-0 SR and the Top of Rack switches using their uplink
interfaces. The loopbacks must be redistributed into BGP so that the entire network fabric is
aware of these TEP IP addresses to reach the different VRF prefixes hosted on the NSX domain.
The second design uses BGP Multi-hop between the loopback interfaces. To achieve
reachability between the loopbacks in this case, the use of static routing or OSPF is needed,
otherwise the BGP Multi-hop peering will never reach the “established” state.
The first design with loopback redistribution is recommended as it simplify the overall
architecture and reduce the operational load.
Figure 4-64: Peering over uplink interfaces with loopback redistribution into BGP (recommended) vs multi-hop E-BGP and
peering over the loopback interfaces
The Route Distinguisher field in a Type 5 EVPN route has the following functionalities and
properties:
• Allows VRF prefixes to be considered unique. If two tenants advertise the same prefix,
the Tier-0 (and the network fabric) must have a way to differentiate these 2 overlapping
prefixes so that they are globally unique among the different isolated routed instances.
• As a result, the Route Distinguisher must be unique per VRF. It is an 8 bytes identifier
126
VMware NSX Reference Design Guide
• NSX can auto assign Route Distinguishers, or an administrator can configure them
• Network engineer traditional use the format “BGP_AS:VRF” to configure the route
distinguishers
The following pictures show a multi-tenancy architecture based on a single MP-BGP EVPN
session that advertises the same prefix owned by 3 different tenants using route distinguishers
to make the routes unique.
127
VMware NSX Reference Design Guide
Figure 4-65: Relationships between Route Distinguisher, Route Target and VXLAN ID
FIGURE 4-66 summarizes the networking constructs needed per VRF to design and implement
MP-BGP EVPN with VXLAN between the Tier-0 gateways and the Top of Rack switches.
128
VMware NSX Reference Design Guide
The NSX edge node uses Geneve TEP interfaces for all NSX overlay traffic. To exchange network
traffic with the Top of Rack switches, the Tier-0 must have a different TEP interfaces enabled
for VXLAN. One VXLAN TEP being supported on each edge node, loopback interfaces acting as
VXLAN TEP interfaces are recommended in an NSX EVPN architecture.
129
VMware NSX Reference Design Guide
The diagram represents a multicast source located in the physical infrastructure sending to
multiple virtual machines on different ESXi hosts. Two kinds of multicast are involved here:
• multicast between physical devices, represented in dark green,
• multicast handling on the virtual switch inside the ESXi hosts, represented in light green.
More specifically, you can see that within ESXi-3, traffic has been replicated to VM1 and
VM3, but filtered to VM2.
When running multicast in a VLAN-backed environment (NSX VLAN segments, or simple
dvportgroups, if NSX is not present in the picture), the physical infrastructure can directly snoop
the IGMP traffic sent by the virtual machines and take care of the multicast replication and
filtering between the hosts. In that scenario, there is no difference between NSX and regular
vSphere networking: the virtual switch is identifying the receivers using IGMP snooping and
delivers the multicast traffic to the appropriate vnics. We will not elaborate on this capability in
this guide.
For overlay segments however, the physical infrastructure is unaware of the traffic being
tunneled and it’s up to NSX to handle efficiently the distribution of multicast traffic to the
appropriate transport nodes. This design guide section is going to focus on that scenario: the
replication/filtering of multicast traffic between transport nodes with an NSX overlay.
130
VMware NSX Reference Design Guide
131
VMware NSX Reference Design Guide
FIGURE 4-68 summarizes the kind of multicast connectivity provided by NSX with multicast
routing. Sources and receivers can be inside or outside the NSX domain, the Tier0 gateway is
running PIM SM to the physical infrastructure where the RP is located.
Technical overview
NSX routes multicast traffic between overlay segment thanks to its Tier0 and Tier1 gateways.
This section focuses on the data plane operations within NSX: how NSX achieves efficient
multicast replication between transport nodes. This multicast traffic is carried on an overlay
network and the physical infrastructure cannot look inside the tunnels NSX creates.
4.8.3.2 Data plane operations between Tier0 gateway and physical infrastructure
The Tier0 gateway connecting NSX to the physical infrastructure is implemented as multiple
Tier0-SRs on multiple edge transport nodes. FIGURE 4-69 illustrates a Tier0 made of four
132
VMware NSX Reference Design Guide
active/active Tier0-SRs, each configured with two physical uplinks running PIM sparse mode.
Thanks to ECMP, unicast traffic can be sent and received on all the uplinks of those Tier0-SRs.
NSX is also spreading multicast traffic across the different Tier0-SRs on a per-multicast group
basis.
For a given destination multicast IP address, NSX uses a hash to determine a unique “active”
Tier0-SR that will be responsible for handing traffic in and out of the NSX domain for this group.
Different groups are hashed to different Tier0-SRs, thus providing some form of load balancing
for multicast traffic. There is no configuration option to manually select the multicast active
Tier0-SR for a given multicast group.
All Tier0-SRs must have a configuration and connectivity to the physical infrastructure capable
of handling all the groups. If a Tier0-SR fails, the multicast groups it was handling will be
assigned to an arbitrary Tier0-SR that is still active. No multicast routing state is synchronized
between the Tier0-SRs, and all the affected multicast groups will see their traffic disrupted until
PIM has re-populated its tables. Note that only the groups that were associated to the failed
Tier0-SR are impacted, other groups remain associated to their existing active Tier0-SR.
Figure 4-69: multicast between edges and physical infrastructure, external sources, internal receivers
NSX will only join the source tree of a multicast source in the physical infrastructure on one of
the uplinks of the multicast active Tier0-SR for that group. That means that external multicast
traffic will always enter NSX through the multicast active Tier0-SR. In the above FIGURE 4-69, the
multicast traffic for G1 is entering NSX through Tier0-SR1, the multicast active Tier0-SR for
133
VMware NSX Reference Design Guide
group G1. Multicast group G2 could be associated to another Tier0-SR, like Tier0-SR3 in the
diagram, thus spreading inbound traffic across multiple Tier0-SRs.
Figure 4-70: multicast between edges and physical infrastructure, internal sources, external receivers.
When the source for the group is located inside the NSX domain, the traffic pattern might be
slightly different. As soon as the IP address of the source is known to the physical
infrastructure, a PIM join will be sent toward the IP address of the source, in order to join the
receiver to the source tree.
This PIM join can be forwarded by the physical infrastructure on any path toward the source.
Thanks to unicast ECMP, the IP address of the source, is reachable via any Tier0-SR. In FIGURE
4-70 above, the external receiver for group G1 ended up joining the source tree rooted at S1 G1
through Tier0-SR1, the Tier0-SR active for G1. This is purely by chance. By contrast, the receiver
for group G2 ended up trying to join the source tree rooted at S2 G2 via Tier0-SR4.
This is not the Tier0-SR active for G2. Tier0-SR4 will itself need to join the source tree rooted
inside NSX in order to provide connectivity to the receiver. Because G2 is associated with Tier0-
SR3, the multicast traffic to G2 is always sent to Tier0-SR3. From there, it will be forwarded to
Tier0-SR4 so that it can reach the receiver that joined the source tree through an uplink of
Tier0-SR4. In the “south-north” direction, multicast traffic can thus hairpin to the multicast
active Tier0-SR before reaching its destination.
134
VMware NSX Reference Design Guide
4.8.3.3 Data plane operations within NSX for segments attached to a Tier0 gateway
We have seen that the multicast traffic going in and out the NSX domain for group G is always
routed through the multicast active Tier0-SR for G. The rest of this section is dedicated to the
multicast traffic flow within NSX. We’re only going to represent traffic using active Tier0-SR1 in
the diagrams, but of course, traffic could be going through any other Tier0-SR, depending on
the group.
The following diagram show the path for multicast traffic initiated in the physical infrastructure
and distributed inside NSX via the multicast active Tier0-SR1. In this scenario, we’re going to
assume that all the receivers are attached to segments directly connected to the Tier0 gateway
(in other words, there is no Tier1 gateway in the picture).
1. Multicast traffic from the external source S reaches the Tier0-SR multicast active uplink.
There are receivers in the NSX domain, so this packet needs to be replicated to one or
multiple ESXi transport nodes.
2. For performance reasons, Tier0-SR1 is not going to replicate the traffic to the interested
transport nodes in the NSX domain. Instead, it’s going to “offload” the replication to an
ESXi host, picked arbitrarily in the Tier0 routing domain (again, the routing domain is the
group of transport nodes where the Tier0 gateway spans.) The host selected for
replication does not need to have a local receiver for the traffic, but the destination
multicast IP address is part of the hash used to select the host. This way, this multicast
offload is distributed across all the hosts in the routing domain, based on multicast
group. A note on the format of the packet that is forwarded to the ESXi-1 transport
135
VMware NSX Reference Design Guide
node: this is a multicast packet tunneled unicast to ESXi-1. The destination IP address of
the overlay packet is the unicast IP address of a TEP in ESXi-1.
3. In this example, ESXi-1 has been selected to replicate the multicast packet. First, the
local Tier0-DR routes the traffic to any receiver on local VMs.
4. Then the Tier0-DR initiate a “hybrid replication” to all the remote hosts in the Tier0
routing domain that have at least one receiver (and only those hosts.) The next section
will detail what this hybrid replication means, at that stage, just assume that it’s a way
of leveraging the physical infrastructure to assist packet replication between ESXi hosts.
5. At that final step, the remote Tier0-DRs receive the multicast packet and route it to their
local receivers.
FIGURE 4-72 details the multicast packets flow when a source is in the NSX domain (on a
segment attached to a Tier0) and there are receivers inside and outside NSX.
1. The VM sends multicast traffic on the segment to which it is connected. It reaches the
local Tier0-DR. Note: there is no receiver on ESXi-1. If there were some in the source
segment, they would directly receive the multicast traffic, forwarded at Layer 2 (with
IGMP snooping.) If there were some receivers on other segments in ESXi-1, the T0-DR
would route the traffic directly to them too.
2. Again, we assumed that Tier0-SR1 is multicast active for the group. It is thus considered
as being the multicast router and needs to receive a copy of every multicast packet
generated in the NSX domain for this group. The Tier0-DR on ESXi-1 sends a unicast copy
of the multicast packet to Tier0-SR1 (multicast traffic encapsulated in unicast overlay.)
136
VMware NSX Reference Design Guide
3. The Tier0-SR receives the traffic and routes it to the outside world on one of its uplinks
(or potentially through another Tier0-SR, as we have seen earlier.)
4. The Tier0-DR of ESXi-1 also starts hybrid replication to forward the multicast traffic to all
other ESXi transport nodes with receivers in the routing domain (the part dedicated to
hybrid replication will detail how this is achieved.)
5. The remote Tier0-DRs forward the multicast traffic to their local receivers.
4.8.3.4 Data plane operations within NSX: Tier0 and Tier1 segments combined
Until NSX 3.0, multicast routing was only possible on Tier0 gateways. NSX 3.1 introduces the
capability of routing multicast on Tier1 gateways. The handling of the Tier1 gateways is an
extension of the model described in the previous part for the Tier0 gateways.
When multicast routing is enabled on a Tier1 gateway, a Tier1-SR must be instantiated on an
edge. This Tier1-SR will be considered as the multicast router for the Tier1 routing domain and
will receive a copy of all multicast traffic initiated on a segment attached to this Tier1 gateway.
Then this Tier1-SR will be attached as a leaf to the existing multicast tree of its Tier0 gateway
(remember that in order to run multicast, a Tier1 gateway must itself be attached to a Tier0
gateway running multicast.)
The following diagram represents the traffic path of some external multicast traffic that needs
to be replicated within NSX to receivers off a Tier0 gateway as well as two Tier1 gateways
(green and purple.) Step 1-5 are about replicating the traffic to segments attached to the Tier0
gateway, thus exactly similar to the previous example.
137
VMware NSX Reference Design Guide
1. The multicast traffic is received by the multicast active Tier0-SR for the group (Tier0-
SR1.)
2. The Tier0-SR1 offloads replication to an arbitrary ESXi transport node, here ESXi-1.
3. The Tier0-DR of the offload transport node (ESXi-1) routes the multicast traffic to local
receivers.
4. The Tier0-DR of ESXi-1 starts hybrid replication to reach all the ESXi hosts that have
receivers on segments attached to the Tier0 gateway.
5. The Tier0-DRs of those remote ESXi transport nodes route the traffic to their local
receivers.
So far, those steps have been exactly like those described in the previous part.
6. This is the first step specific to multicast on Tier1 gateways: the multicast active Tier0-
SR1 initiate a hybrid replication targeting all the Tier1-SRs with receivers for this group.
7. The Tier1-SRs (green and purple) offload the multicast replication to an arbitrary ESXi
transport node in their routing domain. In this respect, they act exactly the same way as
the multicast active Tier0-SR for its routing domain.
8. The ESXi transport nodes selected for offloading their Tier1-SR (ESXi-3 for the green
Tier1 and ESXi-4 for the purple Tier1) route the traffic to their local receivers, if any.
9. The ESXi hosts selected for offloading their Tier1-SR initiate a hybrid replication for
reaching all the interested Tier1-DRs in their Tier1 routing domain.
10. Finally, the Tier1-DRs on remote transport nodes route the multicast traffic to their local
receivers.
For the sake of being complete, FIGURE 4-74 is showing the traffic flow when a multicast source
is in the routing domain of a Tier1 gateway and there are receivers off another Tier1 gateway,
Tier0 gateway and external.
138
VMware NSX Reference Design Guide
Figure 4-74: internal source, external receiver and receivers on Tier0 and Tier1 segments
1. The source is located on ESXi-3, off a segment attached to the purple Tier1. The
multicast traffic is first received by the local purple Tier1-DR.
2. The purple Tier1-DR initiate a hybrid replication to reach every remote purple Tier1-DR.
3. The remote purple Tier1-DRs route the traffic to their local destination.
4. The purple Tier1-DR of ESXi-3 also sends a copy of the multicast traffic to the multicast
router of the routing domain, i.e. the purple Tier1-SR on edge3.
5. The purple Tier1-SR forwards the traffic to all SRs with receivers for this group (Tier0SR
and green T1SR). Note that the diagram is simplifying the real process. You don’t need
to understand the details, but for accuracy, let’s just mention that the purple Tier1-SR is
in fact sending a unicast copy (not represented) to Tier0SR1, its multicast router. Then it
initiates a hybrid replication that will reach all SRs with receivers. So, the T0SR receives
two copies: one is replicated on the uplink toward external receivers (step 6 below),
while the other will go down the tree to internal receivers on segments attached to the
Tier0 (part of step 7 below.)
6. The multicast active Tier0-SR routes the traffic on one of its uplinks toward the external
receiver(s).
7. The Tier0SR and green Tier1-SR now offload replication to an arbitrary ESXi transport
node in their routing domain.
8. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 (the host selected to offload the
green Tier1-SR) route the multicast traffic to their local receivers, if any.
139
VMware NSX Reference Design Guide
9. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 initiate a hybrid replication to their
peers in their respective routing domain.
10. The remote Tier0-DR and green Tier1-DR route the traffic to their local receivers.
Some few notes on the Tier1 multicast model:
• The use of a Tier1-SR for multicast has an impact on unicast traffic. Indeed, unicast traffic
that needs to be routed to other Tier1s or to the outside world will now transit through
this Tier1-SR.
• Multicast traffic for Tier1 attached segments follows a tree that is rooted on the
corresponding Tier1-SR. The consequence is that a host with receivers on segments that
are attached to different Tier0/Tier1s will get multiple copies of the same multicast
traffic, one for each Tier0-DR or Tier1-DR involved. Check ESXi-4 in the above FIGURE
4-74: it received three separate copies of the same multicast packet, one for each
gateway.
140
VMware NSX Reference Design Guide
Range1 will be used for hybrid replication between ESXi transport nodes, while range2 will be
used for hybrid replication between service routers (Tier0-SR and Tier1-SRs.)
Suppose that a transport node is interested in receiving multicast traffic sent to multicast group
G in the overlay. This group G will be hashed by NSX into two multicast IP addresses M1 and M2
in the replication multicast range1 and range2 respectively. Then if the transport node is an
ESXi host, it will join group M1 in the underlay. This means that a TEP on this ESXi host will send
an IGMPv2 report with the intend of joining group M1. The physical infrastructure (running
IGMP snooping) can act on this IGMP message and record that this ESXi host is a receiver for
M1. If the transport node was a service router on an edge node, it would join the multicast
group M2 in the underlay. That’s the reason for the even split of the replication multicast
range: one is for hybrid replication between ESXi transport nodes, and the other is for hybrid
replication between edge transport nodes (you’ll notice in the previous multicast flow examples
that there is no hybrid replication between edges and ESXi hosts, only unicast communication.)
Now that all the transport nodes interested in G have joined a common multicast group in the
underlay, NSX can send tunnel traffic to multicast address M1 or M2 to replicate efficiently
multicast traffic for group G between ESXi hosts or edges: within the same Layer 2 domain, the
physical infrastructure will be able to take care of the replication. When traffic needs to be
replicated across underlay subnets, NSX will send a single unicast copy of the multicast traffic to
a remote transport node in each remote subnet. Don’t worry if this sounds confusing at that
stage, the following diagrams will hopefully make that clear!
In this following scenario, let’s assume that a virtual machine on transport node TN-1 generates
multicast traffic to multicast group G, and that there are receivers on multiple ESXi transport
nodes. This example is not about repeating the multicast packet flow presented in the previous
part, we’re not going to consider the multicast traffic sent to the multicast router for example,
we’re just going to show how hybrid replication can deliver this multicast IP packet to all the
other transport nodes that have a receiver for G. In this diagram those transport nodes with
receivers are TN-2, TN4, TN-5, TN-7, TN-8, TN-9, TN-10, and TN-11. The transport nodes have
their TEPs in three different subnets. This is a common design: typically, hosts under the same
top of rack switch (TOR) have their IP address in the same subnet. In this example, transport
nodes TN-1 to TN-4 have their TEP IP addresses in subnet 10.0.0.0/24, TN-5 to TN-8 have TEPs
in subnet 20.0.0.0/24 and TN-9 to TN-12 have TEP addresses in 30.0.0.0.24.
Let’s assume that NSX has mapped group G to group M in the replication multicast range for
ESXi transport nodes (called earlier as “range1”.) All the ESXi hosts with a receiver for G in the
overlay are joining group M in the physical infrastructure (that’s what the “M” on the TEPs is
representing.)
141
VMware NSX Reference Design Guide
Transport nodes TN-1 knows the IP address of all the transport nodes interested in G (as
mentioned earlier, this has been advertised by relaying IGMP packets.) More specifically, TN-1
knows that it has receivers in its local subnet and in two remote subnets.
1. TN-1 encapsulate the multicast traffic to G in an overlay packet with M as a destination
and send it on its uplink. The physical infrastructure sees a packet to M and knows that
there are two hosts that have joined group M. The top of rack switch takes care of
replicating this packet to TN-2 and TN-4. Note again that we don’t expect the physical
infrastructure to route multicast traffic across subnet. The packet sent to M is thus
constrained to its source subnet 10.0.0.0/24 and not routed to the remote subnets
20.0.0.0/24 and 30.0.0.0/24 by the physical infrastructure.
2. TN-1 knows that receivers TN-5, TN-7 and TN-8 are all in the remote underlay subnet
20.0.0.0/24. It picks one of those three transport nodes (let’s say TN-7) and forward the
multicast traffic to G as an overlay unicast to it. TN-1 repeats the operation for the
remote underlay subnet 30.0.0.0/24 and sends the multicast traffic to G as an overlay
unicast to TN-9.
3. TN-7 receives an encapsulated multicast packet from TN-1. In the metadata of this
packet, TN-1 has set a bit indicating that this packet needs to be replicated to the other
transport nodes in the underlay subnet (reference section O VERLAY ENCAPSULATION ). TN-
7 delivers the multicast packet to its local receiver VMs and copies the multicast traffic
to G on its uplink, encapsulated with an overlay destination multicast IP address M. The
physical infrastructure is then again taking care of replicating this packet to TN-5 and
TN-8. The metadata of this overlay packet created by TN-7 is not indicating this time
that this packet needs to be replicated to the other hosts in the same subnet, so TN-5
and TN-8 only forward the multicast traffic to their local receivers.
4. TN-9 receives the overlay unicast copy of the multicast traffic to G and perform the
same operation as TN-7 in the previous step.
142
VMware NSX Reference Design Guide
143
VMware NSX Reference Design Guide
they received it, they would drop it.) The result would however be suboptimal, as some
bandwidth would be wasted for transmitting traffic that would be eventually discarded.
Reserve the largest replication multicast range possible
There is a mapping between each multicast IP address used in the overlay and a unique pair of
multicast addresses in this replication multicast range.
Suppose that the virtual network uses 1000 different multicast groups, but that the replication
multicast range only reserves 100 multicast addresses (leading to 50 addresses in range1 and
range2.) Obviously, there is a 1:20 over-subscription and multiple multicast groups from the
overlay address space will map to the same multicast address in the replication range. In the
following diagram, due to a hashing collision, two overlay multicast groups G1 and G2 are
mapped by NSX to the same underlay multicast group M.
When TN-1 sends traffic to group G1, it’s encapsulated in a multicast tunnel packet and sent to
multicast address M. As a result, both TN-2 and TN-4 receive this packet. This is sub-optimal
because TN-4 has no receiver for group G1. TN-4 just discards this packet. Note that, from a
performance standpoint, this issue is not catastrophic. The additional replication work is done
by the physical infrastructure, the cost for discarding superfluous packets it not that great on
TN-4, but it’s still better to minimize this scenario.
Enable IGMP snooping in the physical infrastructure
As mentioned already in this document, NSX only uses the physical infrastructure for multicast
within the same Layer 2 domain. One of the reasons for this implementation choice is that we
did not want NSX to be dependent on configuring multicast routing in the physical
infrastructure (something that many of our customers are not able to do.) It is however
beneficial to have IGMP snooping enabled on the Layer 2 switches of the physical
infrastructure, in order to optimize the multicast replication within the rack.
144
VMware NSX Reference Design Guide
Most switches default to IGMP snooping enabled, just check that your infrastructure is
configured that way. Without IGMP snooping, IP multicast is treated like broadcast by Layer 2
switches, which results in sub-optimal replication.
Limit the number of TEP subnets
This one is just a recommendation related to performance. With the hybrid replication model,
NSX performs a unicast copy for each remote TEP subnets (see FIGURE 4-75: HYBRID replication.)
This copy consumes CPU on the source transport node and wastes bandwidth on its uplink. The
following diagram proposes a relatively extreme example comparing the replication of a single
multicast frame to ten remote ESXi transport nodes. In the case represented on the top of the
diagram, all the receiving transport nodes are in different subnets, in the bottom part of the
diagram, the receiving transport nodes are in the same subnet as the source.
145
VMware NSX Reference Design Guide
Figure 4-78: replication across multiple subnets vs. within a single subnet
In the top example, the source transport node (TN-1) needs to send 10 unicast copies of the
multicast packet, one for each and every receiving transport node. Practically, it means that the
bandwidth of the uplink of TN-1 has been divided by 10. There is a CPU cost to this 10-fold
replication too.
In the example represented at the bottom, the source transport node just needs to send a
single multicast copy of the multicast packet. It is received by the physical switch local to the
rack and directly replicated to the receivers. The source transport node thus only had to send a
single packet, and the burden of the replication has been entirely put on the hardware switch.
Of course, we’re not recommending putting all your hosts in the same Layer 2 domain. In fact,
one of the benefits of the NSX overlay is precisely that you can design your physical
infrastructure without the need of extending Layer 2. The cost associated to replicating across
subnets is difficult to avoid for ESXi hosts, that are bound to be spread across subnets. It might
however be possible to group VMs joining the same multicast groups on hosts that are spread
across a minimum number of racks.
Hybrid replication is also used between edges hosting the Tier0-SRs and Tier1-SRs. The edges of
your edge cluster should not be dependent on the TORs of a single rack, but it’s perfectly
possible to deploy all the edges of your edge cluster across two racks (and two TEP subnets.)
This will adequately reduce the amount of unicast replication that the hybrid model needs to
perform.
Using Tier1-SRs in the model has consequences, minimize the number of Tier1 routers with
multicast
This has already been mentioned in the chapter DATA PLANE OPERATIONS WITHIN NSX: TIER0 AND
TIER1 SEGMENTS COMBINED . An ESXi transport node will receive a copy of the same multicast
packet for each Tier0 and Tier1 with local receivers. The following diagram makes a simpler
representation of this property:
146
VMware NSX Reference Design Guide
Figure 4-79: host transport node receiving multiple copies of the same multicast packet
Here, the ESXi host at the bottom receives three copies of a multicast packet generated outside
NSX, behind a Tier0 gateway. The multicast trees for the Tier1 gateways are rooted in their
Tier1-SRs. You can see in the diagram that the ESXi host receives the multicast packet for the
local receiver under the green Tier1-DR straight from the green Tier1-SR, not from the local
Tier0-DR (as it would, for unicast traffic.) This looks sub-optimal (multicast is all about avoiding
the same links to carry the same multicast traffic multiple times), but it’s mandatory for
inserting stateful services on the Tier1 gateways, in a future release. Just be aware that
multiplying Tier1 gateways with receivers for the same groups will cause NSX to replicate the
same traffic multiple times.
Using Tier1-SRs in the model has consequences, be aware of a Tier1-SR impact on unicast
traffic
The choice of implementing a Tier1-SR for multicast traffic also has an impact on unicast traffic:
unicast traffic can only enter/exit the Tier1 routing domain through this centralized Tier1-SR, as
represented in the figure below.
147
VMware NSX Reference Design Guide
The impact of inserting a Tier1-SR is well known, and not specific to multicast. It is still possible
to have unicast ECMP straight from the host transport node to the Tier0-SRs (and the physical
infrastructure) by attaching workloads directly to a segment off the Tier0 gateway.
Edge Node
Edge nodes are service appliances with pools of capacity, dedicated to running network and
security services that cannot be distributed to the hypervisors. Edge node also provides
148
VMware NSX Reference Design Guide
149
VMware NSX Reference Design Guide
overlay transport zones, multiple sets of edge nodes must be deployed, one for each
overlay transport zone.
• VLAN Transport Zone: Edge nodes connect to the physical infrastructure using VLANs.
Edge node needs to be configured for VLAN transport zone to provide external or N-S
connectivity to the physical infrastructure. Depending upon the N-S topology, an edge
node can be configured with one or more VLAN transport zones.
Edge node can have one or more N-VDS to provide the desired connectivity. Each N-VDS on the
Edge node uses an uplink profile which can be the same or unique per N-VDS. The teaming
policies defined in the uplink profile defines how the N-VDS balances traffic across its uplinks.
The uplinks can in turn be individual pNICs or LAGs. While supported, implementing LAGs on
edge nodes is discouraged and not included in any of the deployment examples in this guide.
150
VMware NSX Reference Design Guide
The Bare Metal Edge resources specified above specify the minimum resources needed. It is
recommended to deploy an edge node on a bare metal server with the following specifications
for maximum performance:
• Memory: 256GB
• CPU Cores: 24
• Disk Space: 200GB
When NSX Edge is installed as a VM, vCPUs are allocated to the Linux IP stack and DPDK. The
number of vCPU assigned to a Linux IP stack or DPDK depends on the size of the Edge VM. A
medium Edge VM has two vCPUs for Linux IP stack and two vCPUs dedicated for DPDK. This
changes to four vCPUs for Linux IP stack and four vCPUs for DPDK in a large size Edge VM.
Starting with NSX 3.0, several AMD CPUs are supported both for the virtualized and Bare Metal
Edge node form factor. Specifications can be found HERE .
151
VMware NSX Reference Design Guide
In-band management feature is leveraged for management traffic. Overlay traffic gets load
balanced by using multi-TEP feature on Edge and external traffic gets load balanced using
"Named Teaming policy" as described in section TEAMING P OLICY in chapter 3.
Figure 4-81: Bare metal Edge -Same N-VDS for overlay and external traffic with Multi-TEP
During a pNIC failure, Edge performs a TEP failover by migrating TEP IP and its MAC address to
another uplink. For instance, if pNIC P1 fails, TEP IP1 along with its MAC address will be
migrated to use Uplink2 that’s mapped to pNIC P2. In case of pNIC P1 failure, pNIC P2 will carry
the traffic for both TEP IP1 and TEP IP2.
152
VMware NSX Reference Design Guide
where management traffic can leverage an interface being used for overlay or external (N-S)
traffic.
Bare metal Edge node supports a maximum of 16 datapath physical NICs. For each of these 16
physical NICs on the server, an internal interface is created following the naming scheme “fp-
ethX”. These internal interfaces are assigned to the DPDK Fast Path. There is a flexibility in
assigning these Fast Path interfaces (fp-eth) for overlay or external connectivity.
VM Edge Node
NSX VM Edge in VM form factor can be installed using an OVA, OVF, or ISO file. NSX Edge VM is
only supported on ESXi host.
Up to NSX 3.2.0, an NSX Edge VM has four internal interfaces: eth0, fp-eth0, fp-eth1, and fp-
eth2. Eth0 is reserved for management, while the rest of the interfaces are assigned to DPDK
Fast Path. Starting with NSX 3.2.1, four datapath interfaces are available: fp-eth0, fp-eth1, fp-
eth2, and fp-eth3.
These interfaces are allocated for external connectivity to TOR switches and for NSX overlay
tunneling. There is complete flexibility in assigning Fast Path interfaces (fp-eth) for overlay or
external connectivity. As an example, fp-eth0 could be assigned for overlay traffic with fp-eth1,
fp-eth2, or both for external traffic.
FIGURE 4-82 shows an edge VM where only two of the fast path interfaces are in use. They are
managed by the same NVDS and carry both overlay and VLAN traffic. This design is in line with
the edge reference architecture and will be explained in detailed in chapter 7 section EDGE
CONNECTIVITY GUIDELINES FOR LAYER 3 P EERING USE CASE .
153
VMware NSX Reference Design Guide
Edge Cluster
An Edge cluster is a group of Edge transport nodes. It provides scale out, redundant, and high-
throughput gateway functionality for logical networks. Scale out from the logical networks to
the Edge nodes is achieved using ECMP. There is a flexibility in assigning Tier-0 or Tier-1
gateways to Edge nodes and clusters. Tier-0 and Tier-1 gateways can be hosted on either same
or different Edge clusters.
154
VMware NSX Reference Design Guide
Depending upon the services hosted on the Edge node and their usage, an Edge cluster could
be dedicated simply for running centralized services (e.g., NAT). FIGURE 4-84 shows two clusters
of Edge nodes. Edge Cluster 1 is dedicated for Tier-0 gateways only and provides external
connectivity to the physical infrastructure. Edge Cluster 2 is responsible for NAT functionality on
Tier-1 gateways.
Figure 4-84: Multiple Edge Clusters with Dedicated Tier-0 and Tier-1 Services
There can be only one Tier-0 gateway per Edge node; however, multiple Tier-1 gateways can be
hosted on one Edge node.
A Tier-0 gateway supports a maximum of eight equal cost paths per SR or DR component, thus
a maximum of eight Edge nodes are supported for ECMP. Edge nodes in an Edge cluster run
Bidirectional Forwarding Detection (BFD) on both tunnel and management networks to detect
Edge node failure. Edge VMs support BFD with minimum BFD timer of 500ms with three retries,
providing a 1.5 second failure detection time. Bare metal Edges support BFD with minimum BFD
TX/RX timer of 50ms with three retries which implies 150ms failure detection time.
Failure Domain
Failure domain is a logical grouping of Edge nodes within an Edge Cluster. This feature can be
enabled on the Edge cluster level via API.
As discussed in high availability section, a Tier-1 gateway with centralized services runs on Edge
nodes in active/standby HA configuration mode. When a user assigns a Tier-1 gateway to an
Edge cluster, NSX manager automatically chooses the Edge nodes in the cluster to run the
active and standby Tier-1 SR. The auto placement of Tier-1 SRs on different Edge nodes
considers several parameters like Edge capacity, active/standby HA state etc.
Failure domains compliment auto placement algorithm and guarantee service availability in
case of a failure affecting multiple edge nodes. Active and standby instance of a Tier-1 SR
always run in different failure domains.
FIGURE 4-85 shows an edge cluster comprised of four Edge nodes, EN1, EN2, EN3 and EN4. EN1
and EN2 connected to two TOR switches in rack 1 and EN3 and EN4 connected to two TOR
switches in rack 2. Without failure domain, a Tier-1 SR could be auto placed in EN1 and EN2. If
rack1 fails, both active and standby instance of this Tier-1 SR fail as well.
155
VMware NSX Reference Design Guide
EN1 and EN2 are configured to be a part of failure domain 1, while EN3 and EN4 are in failure
domain 2. When a new Tier-1 SR is created and if the active instance of that Tier-1 is hosted on
EN1, then the standby Tier-1 SR will be instantiated in failure domain 2 (EN3 or EN4).
To ensure that all Tier-1 services are active on a set of edge nodes, a user can also enforce that
all active Tier-1 SRs are placed in one failure domain. This configuration is supported for Tier-1
gateway in preemptive mode only.
156
VMware NSX Reference Design Guide
FIGURE 4-86 diagrams a physical network with URPF enabled on the core router.
1. The core router receives a packet with a source IP address of 10.1.1.1 on interface
ethernet0/2.
2. The core router has the URPF feature enabled on all its interfaces and will check in its
routing table if the source IP address of the packet would be routed through interface
ethernet 0/2. In this case, 10.1.1.1 is the source IP address present in the IP header. The
core router has a longest prefix match for 10.1.1.0/24 via interface ethernet 0/0.
3. Since the packet does not come from the interface ethernet 0/0, the packet will be
discarded.
In NSX, URPF is enabled by default on external, internal and service interfaces. From a security
standpoint, it is a best practice to keep uRPF enabled on these interfaces. uRPF is also
recommended in architectures that leverage ECMP. On intra-tier and router link interfaces, a
simplified anti-spoofing mechanism is implemented. It is checking that a packet is never sent
back to the interface the packet was received on.
157
VMware NSX Reference Design Guide
It is possible to disable uRPF in complex routing architecture where the upstream BGP or OSPF
peers do not advertise the same networks.
NAT Rules
Type Specific Usage Guidelines
Type
Table 4-4 summarizes the use cases and advantages of running NAT on Tier-0 and Tier-1
gateways.
158
VMware NSX Reference Design Guide
NAT Rule
Gateway Type Specific Usage Guidelines
Type
DHCP Services
NSX provides both DHCP relay and DHCP server functionality. DHCP relay can be enabled at the
gateway level and can act as relay between non-NSX managed environment and DHCP servers.
DHCP server functionality can be enabled to service DHCP requests from VMs connected to
NSX-managed segments. DHCP server functionality is a stateful service and must be bound to
an Edge cluster or a specific pair of Edge nodes as with NAT functionality.
159
VMware NSX Reference Design Guide
Since Gateway Firewalling is a centralized service, it needs to run on an Edge cluster or a set of
Edge nodes. This service is described in more detail in the security chapter 5.
Proxy ARP
Proxy ARP is a method that consist of answering an ARP request on behalf of another host. This
method is performed by a layer 3 networking device (usually a router). The purpose is to
provide connectivity between 2 hosts when routing wouldn’t be possible for various reasons.
Proxy ARP in an NSX infrastructure can be considered in environments where IP subnets are
limited. Proof of concepts and VMware Enterprise PKS environments are usually using Proxy-
ARP to simplify the network topology.
For production environment, it is recommended to implement proper routing between a
physical fabric and the NSX Tier-0 by using either static routes or Border Gateway Protocol with
BFD. If proper routing is used between the Tier-0 gateway and the physical fabric, BFD with its
sub-second timers will converge faster. In case of failover with proxy ARP, the convergence
relies on gratuitous ARP (broadcast) to update all hosts on the VLAN segment with the new
MAC Address to use. If the Tier-0 gateway has proxy ARP enabled for 100 IP addresses, the
newly active Tier-0 SR needs to send 100 Gratuitous ARP packets.
By enabling proxy-ARP, hosts on the overlay segments and hosts on a VLAN segment can
exchange network traffic together without implementing any change in the physical networking
160
VMware NSX Reference Design Guide
fabric. Proxy ARP is automatically enabled when a NAT rule or a load balancer VIP uses an IP
address from the subnet of the Tier-0 gateway uplink.
FIGURE 4-87 presents the logical packet flow between a virtual machine connected to an NSX
overlay segment and a virtual machine or physical appliance connected to a VLAN segment
shared with the NSX Tier-0 uplinks.
In this example, the virtual machine connected to the overlay segment initiates networking
traffic toward 20.20.20.100.
161
VMware NSX Reference Design Guide
3. Tier-0 SR has proxy ARP enabled on its uplink interface and will send an ARP request
(broadcast) on the vlan segment to map the IP address of 20.20.20.100 with the correct
MAC address.
4. The physical appliance “SRV01” answers to that ARP request with an ARP reply.
5. Tier-0 SR sends the packet to the physical appliance with a source IP of 20.20.20.10 and
a destination IP of 20.20.20.100.
6. The physical appliance “SRV01” receives the packet and sends an ARP broadcast on the
VLAN segment to map the IP address of the virtual machine (20.20.20.10) to the
corresponding MAC address.
7. Tier-0 receives the ARP request for 20.20.20.10 (broadcast) and has the proxy ARP
feature enabled on its uplink interfaces. It replies to the ARP request with an ARP reply
that contains the Tier-0 SR MAC address for the interface uplink.
8. The physical appliance “SRV01” receives the ARP request and sends a packet on the vlan
segment with a source IP of 20.20.20.100 and a destination IP of 20.20.20.10.
9. The packet is being received by the Tier-0 SR and is being routed to the Tier-1 who does
translate the Destination IP of 20.20.20.10 with a value of 172.16.10.10. Packet is sent
to the overlay segment and the virtual machine receives it.
It is crucial to note that in this case, the traffic is initiated by the virtual machine which is
connected to the overlay segment on the Tier-1. If the initial traffic was initiated by a server on
the VLAN segment, a Destination NAT rule would have been required on the Tier-1/Tier-0 since
the initial traffic would not match the SNAT rule that has been configured previously.
FIGURE 4-88 represents an outage on an active Tier-0 gateway with Proxy ARP enabled. The
newly active Tier-0 gateway will send a gratuitous ARP to announce the new MAC address to be
used by the hosts on the VLAN segment in order to reach the virtual machine connected to the
overlay. It is critical to fathom that the newly active Tier-0 will send a Gratuitous ARP for each IP
address that are configured for Proxy ARP.
162
VMware NSX Reference Design Guide
Topology Consideration
This section covers a few of the many topologies that customers can build with NSX. NSX
routing components - Tier-1 and Tier-0 gateways - enable flexible deployment of multi-tiered
routing topologies. Topology design also depends on what services are enabled and where
those services are provided at the provider or tenant level.
Supported Topologies
FIGURE 4-89 shows three topologies with Tier-0 gateway providing N-S traffic connectivity via
multiple Edge nodes. The first topology is single-tiered where Tier-0 gateway connects directly
to the segments and provides E-W routing between subnets. Tier-0 gateway provides multiple
active paths for N-S L3 forwarding using ECMP. The second topology shows the multi-tiered
approach where Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP
and Tier-1 gateways as first hops for the segments connected to them. Routing is fully
distributed in this multi-tier topology. The third topology shows a multi-tiered topology with
Tier-0 gateway configured in Active/Standby HA mode to provide some centralized or stateful
services like NAT, VPN etc.
163
VMware NSX Reference Design Guide
As discussed in the two-tier routing section, centralized services can be enabled on Tier-1 or
Tier-0 gateway level. FIGURE 4-90 shows two multi-tiered topologies.
The first topology shows centralized services like NAT, load balancer on Tier-1 gateways while
Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP.
The second topology shows centralized services configured on a Tier-1 and Tier-0 gateway.
Some centralized services are only available on Tier-1.
Figure 4-90: Stateful and Stateless (ECMP) Services Topologies Choices at Each Tier
FIGURE 4-91 shows a topology with Tier-0 gateways connected back to back. “Tenant-1 Tier-0
Gateway” is configured for a stateful firewall while “Tenant-2 Tier-0 Gateway” has stateful NAT
configured. Since stateful services are configured on both “Tenant-1 Tier-0 Gateway” and
“Tenant-2 Tier-0 Gateway”, they are configured in Active/Standby high availability mode. The
top layer of Tier-0 gateway, "Aggregate Tier-0 Gateway” provides ECMP for North-South traffic.
164
VMware NSX Reference Design Guide
Note that only external interfaces should be used to connect a Tier-0 gateway to another Tier-0
gateway. Static routing and BGP are supported to exchange routes between two Tier-0
gateways and full mesh connectivity is recommended for optimal traffic forwarding. This
topology provides high N-S throughput with centralized stateful services running on different
Tier-0 gateways. This topology also provides complete separation of routing tables on the
tenant Tier-0 gateway level and allows services that are only available on Tier-0 gateways (like
VPN with redundant remote peers) to leverage ECMP northbound. Note that VPN is available
on Tier-1 gateways starting NSX 2.5 release. NSX 3.0 introduces new multi tenancy features
such as EVPN and VRF-lite. These features are recommended and suitable for true multi-tenant
architecture where stateful services need to be run on multiple layers or Tier-0
Figure 4-91: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services
FIGURE 4-92 shows another topology with Tier-0 gateways connected back-to-back. “Corporate
Tier-0 Gateway” on Edge cluster-1 provides connectivity to the corporate resources
(172.16.0.0/16 subnet) learned via a pair of physical routers on the left. This Tier-0 has stateful
Gateway Firewall enabled to allow access to restricted users only.
“WAN Tier-0 Gateway” on Edge-Cluster-2 provides WAN connectivity via WAN routers and is
also configured for stateful NAT.
“Aggregate Tier-0 gateway” on the Edge cluster-3 learns specific routes for corporate subnet
(172.16.0.0/16) from “Corporate Tier-0 Gateway” and a default route from “WAN Tier-0
Gateway”. “Aggregate Tier-0 Gateway” provides ECMP for both corporate and WAN traffic
originating from any segments connected to it or connected to a Tier-1 southbound. Full mesh
connectivity is recommended for optimal traffic forwarding.
165
VMware NSX Reference Design Guide
Figure 4-92: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services
A Tier-1 gateway usually connects to a Tier-0 gateway but it is possible for some use cases to
interconnect it to another Tier-1 gateway using service interfaces (SI) and downlink as depicted
in FIGURE 4-93. Static routing must be configured on both Tier-1 gateways in this case as
dynamic routing is not supported. The Tier-0 gateway must be aware of the all the overlay
segments prefixes to provide connectivity.
166
VMware NSX Reference Design Guide
Figure 4-93: Supported Topology – T1 gateways interconnected to each other using Service Interface and Downlink
Unsupported Topologies
While the deployment of logical routing components enables customers to deploy flexible
multi-tiered routing topologies, FIGURE 4-94 presents topologies that are not supported. The
topology on the left shows that a tenant Tier-1 gateway cannot be connected directly to
another tenant Tier-1 gateway using downlinks exclusively.
The rightmost topology highlights that a Tier-1 gateway cannot be connected to two different
upstream Tier-0 gateways.
167
VMware NSX Reference Design Guide
5 NSX Security
In addition to providing network virtualization, NSX also serves as an advanced security
platform, providing a rich set of features to streamline the deployment of security solutions.
This chapter focuses on core NSX security capabilities, architecture, components, and
implementation. Key concepts for examination include:
● NSX distributed firewall (DFW) provides stateful protection of the workload at the vNIC
level. For ESXi, the DFW enforcement occurs in the hypervisor kernel, helping deliver
micro-segmentation. However, the DFW extends to physical servers, KVM hypervisors,
containers, and public clouds providing distributed policy enforcement.
● Uniform security policy model for on-premises and cloud deployment, supporting multi-
hypervisor (i.e., ESXi and KVM) and multi-workload, with a level of granularity down to
VM/containers/bare metal attributes.
● Agnostic to compute domain - supporting hypervisors managed by different compute-
managers while allowing any defined micro-segmentation policy to be applied across
hypervisors spanning multiple vCenter environments.
● Support for Layer 3, Layer 4, Layer-7 APP-ID, & Identity based firewall policies provide
security via protocol, port, and or deeper packet/session intelligence to suit diverse
needs.
● NSX Gateway firewall serves as a centralized stateful firewall service for N-S traffic.
Gateway firewall is implemented per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall is independent of NSX DFW from policy configuration and enforcement
perspective, providing a means for defining perimeter security control in addition to
distributed security control.
● Gateway & Distributed Firewall Service insertion capability to integrate existing security
investments using integration with partner ecosystem products on a granular basis
without the need for interrupting natural traffic flows.
● Distributed IDS extends IDS capabilities to every host in the environment.
● Dynamic grouping of objects into logical constructs called Groups based on various
criteria including tag, virtual machine name or operating system, subnet, and segments
which automates policy application.
● The scope of policy enforcement can be selective, with application or workload-level
granularity.
● Firewall Flood Protection capability to protect the workload & hypervisor resources.
● IP discovery mechanism dynamically identifies workload addressing.
● SpoofGuard blocks IP spoofing at vNIC level.
168
VMware NSX Reference Design Guide
● Switch Security provides storm control and security against unauthorized traffic.
NSX 3.2 introduces advanced security features such as security on vCenter dvpgs, distributed
and centralized malware protection, centralized IDS/IPS, URL Filtering, extensive next
generation firewall App Identification support, Network Traffic Anomaly Detection, and
Network Detection and Response (NDR). These new features will be covered in the new version
if the SECURITY DESIGN GUIDE . This document will cover the NSX core security feature.
169
VMware NSX Reference Design Guide
170
VMware NSX Reference Design Guide
Management Plane
The NSX management plane is implemented through NSX Managers. NSX Managers are
deployed as a cluster of 3 manager nodes. Access to the NSX Manager is available through a
GUI or REST API framework. When a firewall policy rule is configured, the NSX management
plane service validates the configuration and locally stores a persistent copy. Then the NSX
Manager pushes user-published policies to the control plane service within Manager Cluster
which in turn pushes to the data plane. A typical DFW policy configuration consists of one or
more sections with a set of rules using objects like Groups, Segments, and application level
gateway (ALGs). For monitoring and troubleshooting, the NSX Manager interacts with a host-
based management plane agent (MPA) to retrieve DFW status along with rule and flow
statistics. The NSX Manager also collects an inventory of all hosted virtualized workloads on
NSX transport nodes. This is dynamically collected and updated from all NSX transport nodes.
Control Plane
The NSX control plane consists of two components - the central control plane (CCP) and the
Local Control Plane (LCP). The CCP is implemented on NSX Manager Cluster, while the LCP
171
VMware NSX Reference Design Guide
includes the user space module on all of the NSX transport nodes. This module interacts with
the CCP to exchange configuration and state information.
From a DFW policy configuration perspective, NSX Control plane will receive policy rules pushed
by the NSX Management plane. If the policy contains objects including segments or Groups, it
converts them into IP addresses using an object-to-IP mapping table. This table is maintained by
the control plane and updated using an IP discovery mechanism. Once the policy is converted
into a set of rules based on IP addresses, the CCP pushes the rules to the LCP on all the NSX
transport nodes.
The CCP utilizes a hierarchy system to distribute the load of CCP-to-LCP communication. The
responsibility for transport node notification is distributed across the managers in the manager
clusters based on an internal hashing mechanism. For example, for 30 transport nodes with
three managers, each manager will be responsible for roughly ten transport nodes.
Data Plane
The NSX transport nodes comprise the distributed data plane with DFW enforcement done at
the hypervisor kernel level. Each of the transport nodes, at any given time, connects to only one
of the CCP managers, based on mastership for that node. On each of the transport nodes, once
the local control plane (LCP) has received policy configuration from CCP, it pushes the firewall
policy and rules to the data plane filters (in kernel) for each of the virtual NICs. With the
“Applied To” field in the rule or section which defines scope of enforcement, the LCP makes
sure only relevant DFW rules are programmed on relevant virtual NICs instead of every rule
everywhere, which would be a suboptimal use of hypervisor resources. Additional details on
data plane components for both ESXi and KVM hosts explained in following sections.
172
VMware NSX Reference Design Guide
NSX uses OVS and its utilities on KVM to provide DFW functionality, thus the LCP agent
implementation differs from an ESXi host. For KVM, there is an additional component called the
NSX agent in addition to LCP, with both running as user space agents. When LCP receives DFW
policy from the CCP, it sends it to the NSX-agent. The NSX-agent will process and convert policy
messages received to a format appropriate for the OVS data path. Then, the NSX agent
programs the policy rules onto the OVS data path using OpenFlow messages. For stateful DFW
rules, NSX uses the Linux conntrack utilities to keep track of the state of permitted flow
173
VMware NSX Reference Design Guide
connections allowed by a stateful firewall rule. For DFW policy rule logging, NSX uses the ovs-
fwd module.
The MPA interacts with NSX Manager to export status, rules, and flow statistics. The MPA
module gets the rules and flows statistics from data path tables using the stats exporter
module.
174
VMware NSX Reference Design Guide
Because of this behavior, it is always recommended to put the most granular policies at the top
of the rule table. This will ensure more specific policies are enforced first. The DFW default
policy rule, located at the bottom of the rule table, is a catchall rule; packets not matching any
other rule will be enforced by the default rule - which is set to “allow” by default. This ensures
that VM-to-VM communication is not broken during staging or migration phases. It is a best
practice to then change this default rule to a “drop” action and enforce access control through
an explicit allow model (i.e., only traffic defined in the firewall policy is allowed onto the
network). FIGURE 5-5: NSX DFW P OLICY Lookup diagrams the policy rule lookup and packet flow.
175
VMware NSX Reference Design Guide
176
VMware NSX Reference Design Guide
5.4.1.1 Ethernet
The Ethernet Section of the policy is a Layer 2 firewalling section. All rules in this section must
use MAC Addresses for their source or destination objects. Any rule defined with any other
object type will be ignored.
5.4.1.2 Application
In an application-centric approach, grouping is based on the application type (e.g., VMs tagged
as “Web-Servers”), application environment (e.g., all resources tagged as “Production-Zone”)
and application security posture. An advantage of this approach is the security posture of the
application is not tied to network constructs or infrastructure. Security policies can move with
the application irrespective of network or infrastructure boundaries, allowing security teams to
focus on the policy rather than the architecture. Policies can be templated and reused across
instances of the same types of applications and workloads while following the application
lifecycle; they will be applied when the application is deployed and is destroyed when the
application is decommissioned. An application-based policy approach will significantly aid in
moving towards a self-service IT model. In an environment where there is strong adherence to
a strict naming convention, the VM substring grouping option allows for simple policy
definition.
An application-centric model does not provide significant benefits in an environment that is
static, lacks mobility, and has infrastructure functions that are properly demarcated.
5.4.1.3 Infrastructure
Infrastructure-centric grouping is based on infrastructure components such as segments or
segment ports, identifying where application VMs are connected. Security teams must work
closely with the network administrators to understand logical and physical boundaries.
If there are no physical or logical boundaries in the environment, then an infrastructure-centric
approach is not suitable.
5.4.1.4 Network
177
VMware NSX Reference Design Guide
When east-west security is first implemented in a brownfield environment, there are two
common approaches, depending on corporate culture: either an incremental zonal approach
where one application is secured before moving to the next, or a top-down iterative approach
where first prod and non-prod are divided then each of those areas are further subdivided.
Regardless of the chosen approach, there will likely be a variety of security postures taken
within each zone. A lab zone, for example may merely be ring-fenced with a policy that allows
any traffic type from lab device to lab device and only allows basic common services such as
LDAP, NTP, and DNS to penetrate the perimeter in. On the other end of the spectrum, any zone
containing regulated or sensitive data (such as customer info) will often be tightly defined
traffic between entities, many types being further inspected by partner L7 firewall offerings
using Service Insertion.
The answers to these questions help shape a policy rule model. Policy models should be flexible
enough to address ever-changing deployment scenarios, rather than simply be part of the initial
setup. Concepts such as intelligent grouping, tags and hierarchy provide flexible yet agile
response capability for steady state protection as well as during instantaneous threat response.
The model shown in FIGURE 5-7: SECURITY RULE Model represents an overview of the different
classifications of security rules that can be placed into the NSX DFW rule table. Each of the
classification shown represents a category on NSX firewall table layout. The Firewall table
category aligns with the best practice around organizing rules to help admin with grouping
Policy based on the category. Each firewall category can have one or more policy within it to
organize firewall rules under that category.
178
VMware NSX Reference Design Guide
179
VMware NSX Reference Design Guide
• Define Security Policy – Using the firewall rule table, define the security policy. Have
categories and policies to separate and identify emergency, infrastructure, environment,
and application-specific policy rules based on the rule model.
The methodology and rule model mentioned earlier would influence how to tag and group the
workloads as well as affect policy definition. The following sections offer more details on
grouping and firewall rule table construction with an example of grouping objects and defining
NSX DFW policy.
A Group is a logical construct that allows grouping into a common container of static (e.g.,
IPSet/NSX objects) and dynamic (e.g., VM names/VM tags) elements. This is a generic construct
which can be leveraged across a variety of NSX features where applicable.
Static criteria provide capability to manually include particular objects into the Group. For
dynamic inclusion criteria, Boolean logic can be used to create groups between various criteria.
A Group creates a logical grouping of VMs based on static and dynamic criteria. TABLE 5-1: NSX
O BJECTS USED FOR Groups shows one type of grouping criteria based on NSX Objects.
180
VMware NSX Reference Design Guide
Segment All VMs/vNICs connected to this segment/logical switch segment will be selected.
Selected MAC sets container will be used. MAC sets contain a list of individual
MAC Address
MAC addresses.
Grouping based on Active Directory groups for Identity Firewall (VDI/RDSH) use
AD Groups
case.
TABLE 5-2: VM P ROPERTIES USED FOR Groups list the selection criteria based on VM properties.
VM Property Description
All VMs that contain/equal/starts/not-equals with the string as part
VM Name
of their name.
Tags All VMs that are applied with specified NSX security tags
The use of Groups gives more flexibility as an environment changes over time. This approach
has three major advantages:
• Rules stay more constant for a given policy model, even as the data center environment
changes. The addition or deletion of workloads will affect group membership alone, not
the rules.
• Publishing a change of group membership to the underlying hosts is more efficient than
publishing a rule change. It is faster to send down to all the affected hosts and cheaper in
terms of memory and CPU utilization.
181
VMware NSX Reference Design Guide
• As NSX adds more grouping object criteria, the group criteria can be edited to better
reflect the data center environment.
182
VMware NSX Reference Design Guide
A rule within a policy is composed of field shown in Table 5-3: Policy Rule Fields and its meaning is
described below
Rule Name: User field; supports up to 30 characters.
ID: Unique rule ID auto generated by System. The rule id helps in monitoring and
troubleshooting. Firewall Log carries this Rule ID when rule logging is enabled.
183
VMware NSX Reference Design Guide
Source and Destination: Source and destination fields of the packet. This will be a GROUP
which could be static or dynamic groups as mentioned under Group section.
Service: Predefined services, predefined services groups, or raw protocols can be selected.
When selecting raw protocols like TCP or UDP, it is possible to define individual port numbers or
a range. There are four options for the services field:
• Pre-defined Service – A pre-defined Service from the list of available objects.
• Add Custom Services – Define custom services by clicking on the “Create New Service”
option. Custom services can be created based on L4 Port Set, application level gateways
(ALGs), IP protocols, and other criteria. This is done using the “service type” option in the
configuration menu. When selecting an L4 port set with TCP or UDP, it is possible to
define individual destination ports or a range of destination ports. When selecting ALG,
select supported protocols for ALG from the list. ALGs are only supported in stateful
mode; if the section is marked as stateless, the ALGs will not be implemented.
Additionally, some ALGs may be supported only on ESXi hosts, not KVM. Please review
release-specific documentation for supported ALGs and hosts.
• Custom Services Group – Define a custom Services group, selecting from single or
multiple services. Workflow is similar to adding Custom services, except you would be
adding multiple service entries.
Profiles: This is used to select & define Layer 7 Application ID & FQDN profile. This is used for
Layer 7 based security rules.
Applied To: Define the scope of rule publishing. The policy rule could be published all
workloads (default value) or restricted to a specific GROUP. When GROUP is used in Applied-
To it needs to be based on NON-IP members like VM object, Segments etc. Not using the
Applied To field can result in very large firewall tables being loaded on vNICs, which will
negatively affect performance.
Action: Define enforcement method for this policy rule; available options are listed in TABLE
5-4: FIREWALL RULE TABLE – “ACTION” Values
Action Description
184
VMware NSX Reference Design Guide
185
VMware NSX Reference Design Guide
In order to define micro-segmentation policy for this application use the category Application
on DFW rule table and add a new policy session and rules within it for each application.
The following use cases employ present policy rules based on the different methodologies
introduced earlier.
This example shows use of the network methodology to define policy rule. Groups in this
example are identified in TABLE 5-5: FIREWALL RULE TABLE - EXAMPLE 1 – GROUP Definition while
the firewall policy configuration is shown in TABLE 5-6: FIREWALL RULE TABLE - EXAMPLE 1- Policy.
186
VMware NSX Reference Design Guide
The DFW engine is able to enforce network traffic access control based on the provided
information. To use this type of construct, exact IP information is required for the policy rule.
This construct is quite static and does not fully leverage dynamic capabilities with modern cloud
systems.
Example 2: Using Segment object Group in Security Policy rule.
This example uses the infrastructure methodology to define policy rule. Groups in this example
are identified in TABLE 5-7: FIREWALL RULE TABLE - EXAMPLE 2 – GROUP Definition while the firewall
policy configuration is shown in TABLE 5-8: FIREWALL RULE TABLE - EXAMPLE 2 – Policy.
187
VMware NSX Reference Design Guide
Reading this policy rule table would be easier for all teams in the organization, ranging from
security auditors to architects to operations. Any new VM connected on any segment will be
automatically secured with the corresponding security posture. For instance, a newly installed
web server will be seamlessly protected by the first policy rule with no human intervention,
while VM disconnected from a segment will no longer have a security policy applied to it. This
type of construct fully leverages the dynamic nature of NSX. It will monitor VM connectivity at
any given point in time, and if a VM is no longer connected to a particular segment, any
associated security policies are removed.
This policy rule also uses the “Applied To” option to apply the policy to only relevant objects
rather than populating the rule everywhere. In this example, the first rule is applied to the vNIC
associated with “SEG-Web”. Use of “Applied To” is recommended to define the enforcement
point for the given rule for better resource usage.
Security policy and IP Discovery
Both NSX DFW and Gateway Firewall (GFW) has a dependency on VM-to-IP discovery which is
used to translate objects to IP before rules are pushed to data path. This is mainly required
when the policy is defined using grouped objects. This VM-to-IP table is maintained by NSX
Control plane and populated by the IP discovery mechanism. IP discovery used as a central
mechanism to ascertain the IP address of a VM. By default, this is done using DHCP and ARP
snooping, with VMware Tools available as another mechanism with ESXi hosts. These
discovered VM-to-IP mappings can be overridden by manual input if needed, and multiple IP
addresses are possible on a single vNIC. The IP and MAC addresses learned are added to the
VM-to-IP table. This table is used internally by NSX for SpoofGuard, ARP suppression, and
firewall object-to-IP translation.
Intrusion Detection
Much like distributed firewalling changed the game on firewalling by providing a distributed,
ubiquitous enforcement plane, NSX distributed IPS/IPS changes the game on IPS by providing a
distributed, ubiquitous enforcement plane. However, there are additional benefits that the
NSX distributed IPS model brings beyond ubiquity (which in itself is a game changer). NSX IPS is
IPS distributed across all the hosts. Much like with DFW, the distributed nature allows the IPS
188
VMware NSX Reference Design Guide
capacity to grow linearly with compute capacity. Beyond that, however, there is an added
benefit to distributing IPS. This is the added context. Legacy network Intrusion Detection and
Prevention systems are deployed centrally in the network and rely either on traffic to be
hairpinned through them or a copy of the traffic to be sent to them via techniques like SPAN or
TAPs. These sensors typically match all traffic against all or a broad set of signatures and have
very little context about the assets they are protecting. Applying all signatures to all traffic is
very inefficient, as IDS/IPS unlike firewalling needs to look at the packet payload, not just the
network headers. Each signature that needs to be matched against the traffic adds inspection
overhead and potential latency introduced. Also, because legacy network IDS/IPS appliances
just see packets without having context about the protected workloads, it’s very difficult for
security teams to determine the appropriate priority for each incident. Obviously, a successful
intrusion against a vulnerable database server in production which holds mission-critical data
needs more attention than someone in the IT staff triggering an IDS event by running a
vulnerability scan. Because the NSX distributed IDS/IPS is applied to the vNIC of every workload,
traffic does not need to hairpinned to a centralized appliance, and we can be very selective as
to what signatures are applied. Signatures related to a windows vulnerability don’t need to be
applied to Linux workloads, or servers running Apache don’t need signatures that detect an
exploit of a database service. Through the Guest Introspection Framework, and in-guest drivers,
NSX has access to context about each guest, including the operating system version, users
logged in or any running process. This context can be leveraged to selectively apply only the
relevant signatures, not only reducing the processing impact, but more importantly reducing
the noise and quantity of false positives compared to what would be seen if all signatures are
applied to all traffic with a traditional appliance. For a detailed description of IDS configuration,
see the NSX Product Documentation.
Service Insertion
The value of NSX security extends beyond NSX to your pre-existing security infrastructure; NSX
is the mortar that ties your security bricks to build a stronger wall. Legacy security strategies
were intolerant of pre-existing security infrastructure. Anyone who had a Checkpoint firewall
and wanted to move to a Palo Alto Networks firewall would run the 2 managers, side by side
until the transition was complete. Troubleshooting during this transition period required a lot of
chair swiveling. NSX brings a new model, complementing pre-existing infrastructure. Service
Insertion is the feature which allows NSX firewalls (both gateway and DFW) to send traffic to
legacy firewall infrastructure for processing. This can be done as granularly as a port level,
without any modification to existing network architecture. Service Insertion not only sends the
traffic to other services for processing, Service Insertion offers and a deep integration which
allows the exchange of NSX Manager objects to SI service managers. So, a group in NSX which
is comprised on VMs which a substring of “web” (for example) would get shared to the SI
service manager. Thus, when a new VM is spun up which becomes a member of the new
group, the NSX Manager will send that update to the SI Service Manager so that policy can be
consistently applied across platforms.
189
VMware NSX Reference Design Guide
190
VMware NSX Reference Design Guide
NSX Distributed Firewall for Mix of VLAN and Overlay backed workloads
This use case mainly applies to customer who wants to adapt NSX micro-segmentation policies
to all of their workloads and looking at adapting NSX network virtualization (overlay) for their
application networking needs in phases. This scenario may arise when customer starts to either
deploy new application with network virtualization or migrating existing applications in phases
from VLAN to overlay backed networking to avail the advantages of NSX network virtualization.
This scenario is also common where there are applications which prevent overlay backed
networking from being adopted fully (as described in section <BRIDGING> above). The order of
operations in this environment is as follows: on egress, DFW processing happens first, then
overlay network processing happens second. On traffic arrival at a remote host, overlay
network processing happens first, then DFW processing happens before traffic arrives at the
VM.
191
VMware NSX Reference Design Guide
The following diagram depicts this use case with logical and physical topology.
Figure 5-12: NSX DFW Logical Topology – Mix of VLAN & Overlay Backed Workloads
Figure 5-13: NSX DFW Physical Topology – Mix of VLAN & Overlay Backed Workloads
192
VMware NSX Reference Design Guide
Gateway Firewall
The NSX Gateway firewall provides essential perimeter firewall protection which can be used in
addition to a physical perimeter firewall. Gateway firewall service is part of the NSX Edge node
for both bare metal and VM form factors. The Gateway firewall is useful in developing PCI
zones, multi-tenant environments, or DevOps style connectivity without forcing the inter-
tenant or inter-zone traffic onto the physical network. The Gateway firewall data path uses
DPDK framework supported on Edge to provide better throughput.
Optionally, Gateway Firewall service insertion capability can be leveraged with the partner
ecosystem to provide integrated security which leverages existing security investments. This
enhances the security posture by providing next-generation firewall (NGFW) services on top of
native firewall capability NSX provides. This is applicable for the design where security
compliance requirements mandate zone or group of workloads need to be secured using
NGFW, for example, DMZ or PCI zones or Multi-Tenant environments. Service insertion
leverages existing security infrastructure investments and extends NSX dynamic security groups
to them.
Consumption
NSX Gateway firewall is instantiated per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall works independently of NSX DFW from a policy configuration and
enforcement perspective, although objects can be shared from the DFW. A user can consume
the Gateway firewall using either the GUI or REST API framework provided by NSX Manager.
The Gateway firewall configuration is similar to DFW firewall policy; it is defined as a set of
individual rules within a section. Like the DFW, the Gateway firewall rules can use logical
objects, tagging and grouping constructs (e.g., Groups) to build policies. Similarly, regarding L4
services in a rule, it is valid to use predefined Services, custom Services, predefined service
groups, custom service groups, or TCP/UDP protocols with the ports. NSX Gateway firewall also
supports multiple Application Level Gateways (ALGs). The user can select an ALG and supported
protocols by using the other setting for type of service. Gateway FW supports only FTP and
TFTP as part of ALG. ALGs are only supported in stateful mode; if the section is marked as
stateless, the ALGs will not be implemented.
When partner services are leveraged through service insertion, the implementation requires
registering the NSX Manager on the partner management console and the registration of the
partner management console in the NSX manager. Once the two managers are integrated,
they will share relevant objects, which will improve security policy consistency across the
board.
Implementation
Gateway firewall is an optional centralized firewall implemented on NSX Tier-0 gateway uplinks
and Tier-1 gateway links. This is implemented on a Tier-0/1 SR component which is hosted on
193
VMware NSX Reference Design Guide
NSX Edge. Tier-0 Gateway firewall supports stateful firewalling only with active/standby HA
mode. It can also be enabled in an active/active mode, though it will be only working in
stateless mode. Gateway firewall uses a similar model as DFW for defining policy, and NSX
grouping construct can be used as well. Gateway firewall policy rules are organized using one or
more policy sections in the firewall table for each Tier-0 and Tier-1 Gateway. Firewalling at the
perimeter allows for a coarse grain policy definition which can greatly reduce the security policy
size inside.
Deployment Scenarios
This section provides two examples for possible deployment and data path implementation.
Gateway FW as Inter-tenant FW
194
VMware NSX Reference Design Guide
The Tier-1 Gateway firewall is used as inter-tenant firewall within an NSX virtual domain. This is
used to define policies between different tenants who resides within an NSX environment. This
firewall is enforced for the traffic leaving the Tier-1 router and uses the Tier-0 SR component
which resides on the Edge node to enforce the firewall policy before sending to the Tier-0
Gateway for further processing of the traffic. The intra-tenant traffic continues to leverage
distributed routing and firewalling capabilities native to the NSX.
195
VMware NSX Reference Design Guide
196
VMware NSX Reference Design Guide
solutions have the ability to add tags to the workloads based on the result of the AV/AM scan.
This allows for an automated immediate quarantine policy based on the result of the AV/AM
scan with the definition of DFW security rules based on the partner tags.
The Endpoint Protection platform for NSX following a simple 3 step process to use.
197
VMware NSX Reference Design Guide
Registration
Registration of the VMware Partner console with NSX and vCenter.
Deployment
Creating a Service Deployment of the VMware Partner SVM and deployment to the ESXi
Clusters. The SVMs require a Management network with which to talk to the Partner
Management Console. This can be handled by IP Pool in NSX or by DHCP from the network.
Management networks must be on a VSS or VDS switch.
Consumption
Consumption of the Endpoint Protection platform consists of creating a Service Profile of which
references the Service Deployment and then creating Service Endpoint Protection Policy with
Endpoint Rule that specifies which Service Profile should be applied to what NSX Group of
Virtual Machines.
● Exclude management components like vCenter Server, and security tools from the DFW
policy to avoid lockout, at least in the early days of DFW use. Once there is a level of
comfort and proficiency, the management components can be added back in with the
appropriate policy. This can be done by adding those VMs to the exclusion list.
● Use the Applied To field in the DFW to limit the rule growth on individual vNICs.
● Choose the policy methodology and rule model to enable optimum groupings and
policies for micro-segmentation.
● Use NSX tagging and grouping constructs to group an application or environment to its
natural boundaries. This will enable simpler policy management.
● Consider the flexibility and simplicity of a policy model for Day-2 operations. It should
address ever-changing deployment scenarios rather than simply be part of the initial
setup.
● Leverage DFW category and policies to group and manage policies based on the chosen
rule model. (e.g., emergency, infrastructure, environment, application...)
198
VMware NSX Reference Design Guide
● Use an explicit allow model; create explicit rules for allowed traffic and change DFW the
default rule from “allow” to “drop”.
199
VMware NSX Reference Design Guide
200
VMware NSX Reference Design Guide
3. Within the ZONE - Applications VM’s belonging to a certain application should not be
talking to other application VM’s.
4. Some application within a ZONE have common Database services which runs within that
ZONE.
5. Log all unauthorized communication between workloads for monitoring and for
compliance.
2. Define policy for common services; like DNS, NTP as in the figure below.
a) Define this policy under Infrastructure tab as shown below.
b) Have two rules allows all workloads to access the common services using GROUPS
created in step 1 above.
201
VMware NSX Reference Design Guide
c) Use Layer 7 context profile, DNS and NTP, in the rule to further enhance the security
posture.
d) Have catch-all deny rule to deny any other destination for the common services with
logging enabled, for compliance and monitoring any unauthorized communication.
Note: If the management entities are not in an exclusion list, this section would need to
have rules to allow the required protocols between the appropriate entities. See
HTTPS :// PORTS . VMWARE . COM/ HOME /V S PHERE for the ports for all VMware products.
Phase-2: Define Segmentation around ZONES - by having an explicit allow policy between
ZONES
As per the requirement, define policy between zones to deny any traffic between zones. This
can be done using IP CIDR block as data center zones have pre-assigned IP CIDR block.
Alternatively, this can be done using workload tags and other approach. However, IP-GROUP
based approach is simpler (as admin has pre-assigned IP CIDR Block per zone), no additional
workflow to tag workload and also less toll, compare to tagged approach, on NSX Manager and
control plane. Tagged approach may add additional burden on NSX Manager to compute
polices and update, in an environment with scale and churn. As a rule of thumb, the larger the
IP block that can be defined in a rule, the more the policy can be optimized using CIDR blocks. In
cases where there is no convenient CIDR block to group workloads, static groupings may be
used to create entities without churn on the NSX Manager.
Here are the suggested steps:
1. Define 2 NSX Groups for each of the ZONE, Development and Production, say DC-ZONE-
DEV-IP & DC-ZONE-PROD-IP with respective IP CIDR BLOCKs associated with the
respective zones as members.
202
VMware NSX Reference Design Guide
2. Define policy in environment category using the IP GROUPS created in step-1 to restrict
all communication between Development and Production ZONE’s.
3. Have logging enabled for this policy to track all unauthorized communication attempts.
(Note: In many industries, it is sufficient to log only the default action for
troubleshooting purposes. In others, there may be a compliance mandate to log every
action. Logging requirements are driven by the balance between storage costs and
compliance requirements.)
203
VMware NSX Reference Design Guide
This is two step approach to build a policy for the application. First step is to start with fence
around application to build security boundary. Then as a second step profile the application
further to plan and build more granular port-defined security policies between tiers.
• Start with DEV zone first and identify an application to be micro-segmented, say DEV-
ZONE-APP-1.
• Identify all VM’s associated with the Application within the zone.
• Check application has its own dedicated network Segments or IP Subnets.
o If yes, you can leverage Segment or IP-based Group.
o If no, tag application VM’s with uniquely zone and application specific tags, say
ZONE-DEV & APP-1.
• Check this application requires any other communication other than infra services and
communication within group. For example, APP is accessed from outside on HTTPS.
Once you have above information about DEV-ZONE-APP-1, create segmentation around
application by following steps:
1- Apply two tags to all the VM’s belonging to APP-1 in the ZONE DEV, ZONE-DEV & APP-1.
2- Create a GROUP, say “ZONE-DEV-APP-1” with criteria to match on tag equal to “ZONE-
DEV & APP-1”.
204
VMware NSX Reference Design Guide
3- Define a policy under Application category with 3 rules as in the FIGURE 5-26:
APPLICATION P OLICY Example.
a. Have “Applied To” set to “ZONE-DEV-APP-1” to limit the scope of policy only to
the application VM’s.
b. The first rule allows all internal communications between the application VM’s.
Enable logging for this rule to profile the application tiers and protocols. (Each
log entry will contain 5 tuple details about every connection.)
c. The second rule allows access to front end of the application from outside. Use
the L7 context profile to allow only SSL traffic. The below example uses Exclude
Source from within ZONE, so that application is only accessible from outside, not
from within except APP’s other VM’s, as per rule one.
d. Default deny all other communication to these “ZONE-DEV-APP-1” VM’s. Enable
log for compliance and monitoring any unauthorized communication.
205
VMware NSX Reference Design Guide
Log entries will identify the direction (In/Out) as well as the protocol and source IP address/port
and destination IP addresses/port for each flow. If using the log file for policy definition, it is
often advisable to process the log files using excel to sort traffic. Typically, 2 sheets are created,
one for IN traffic and one for OUT traffic. Then, each sheet is sorted by port first then IP
address. (In the case of IN traffic by destination IP address and in the case of OUT traffic by
source address. This sorting methodology allows for the grouping of multiple servers
serving/accessing the same traffic.) For each of these groupings, a rule can be inserted above
rule 1 for the application. This will prevent the known traffic from appearing in the log. Once
sufficient confidence is gained that the application is completely understood (this is typically
when the logs are empty), the original rule ZONE-DEV-APP-1 can be removed. At this point, the
security model has transitioned from zone-based to micro segmentation. (Note: Certain
environments - such as labs - may be best served by ring fencing, whereas other environments
may wish to add service insertion for certain traffic types on top of micro segmentation – such
as sensitive financial information. The value of NSX is that a customer provides the means to
implement appropriate security in one environment without impacting the other.)
Phase-5: Repeat Phase-3 for other applications and ZONES.
Repeat the same approach as in Phase-3 for other applications, to have security boundary for
every application within the ZONE-DEV and ZONE-PROD. Note that the securing of each of
these applications can happen asynchronously, without impact to the others. This
accommodates application-specific maintenance windows, where required.
Phase-6: Define Emergency policy, Kill Switch, in case of Security Event
An emergency policy mainly leveraged for following use case and enforced on top of the
firewall table:
1- To quarantine vulnerable or compromised workloads in order to protect other
workloads.
2- May want to explicitly deny known bad actors by their IP Subnet based on GEO location
or reputation.
This policy is defined in Emergency Category as shown:
1- First two rules quarantine all traffic from workloads belonging to group GRP-
QUARANTINE.
a. “GRP-QUARANTINE” is a group which matches all VM with tag equal to
“QUARANTINE”. (If guest introspection is implemented, the AV/AM tags can be
used to define different quarantine levels.)
b. In order to enforce this policy to vulnerable VM’s, add tag “QUARANTINE” to
isolate the VM’s and allow only admin to access the hosts to fix the vulnerability.
2- Other two rule uses Group with known bad IP’s to stop any communication with those
IP’s.
206
VMware NSX Reference Design Guide
In creating these policies, the iterative addition of rules to the policy is something that can be
done at any time. It is only when the action of the default rule changes from allow to
deny/drop that a maintenance window is advised. As logging has been on throughout the
process, it is highly unusual to see an application break during the window. What is most
frequently the case is that something within the next week or month may emerge as an
unforeseen rule that was missed. For this reason, it is advised that even in environments where
compliance does not dictate the collection of logs, the Deny All rule be set to logging. Aside
from the security value of understanding the traffic that is being blocked, the Deny All rule logs
are very useful when troubleshooting applications.
At this point you have basic level of micro-segmentation policy applied to all the workloads to
shrink the attack surface. As a next step you further break the application into application tiers
and its communication by profiling application flows using firewall logs or exporting IPFIX flows
to Network Insight platform. This will help to group the application workload based on the
function within the application and define policy based on associated port & protocols used.
Once you have these groupings and protocols identified for a given application, update the
policy for that application by creating additional groups and rules with right protocols to have
granularly defined rules one at a time.
With this approach you start with outside-in fencing to start with micro-segmentation policies
and finally come up with a granular port-based micro-segmentation policy for all the
application.
207
VMware NSX Reference Design Guide
A typical data center would have different workloads: VM's, Containers, Physical Server, and a
mix of NSX managed and non-managed workloads. These workloads may also have a
combination of a VLAN-based network or an NSX based overlay network.
The following FIGURE 5-28: NSX FIREWALL FOR ALL DEPLOYMENT Scenario summarizes different
datacenter deployment scenarios and associated NSX firewall security controls, which best fits
the design. You can use same NSX manager as a single pane of glass to define Security policies
to all of these different scenarios using different security controls.
208
VMware NSX Reference Design Guide
209
VMware NSX Reference Design Guide
210
VMware NSX Reference Design Guide
The server pool can include an arbitrary mix of physical servers, VMs or containers that
together, allow scaling out the application.
• Application high-availability:
The load balancer is also tracking the health of the servers and can transparently
remove a failing server from the pool, redistributing the traffic it was handling to the
other members:
Modern applications are often built around advanced load balancing capabilities, which go far
beyond the initial benefits of scale and availability. In the example below, the load balancer
selects different target servers based on the URL of the requests received at the VIP:
211
VMware NSX Reference Design Guide
Thanks to its native capabilities, modern applications can be deployed in NSX without requiring
any third party physical or virtual load balancer. The next sections in this part describe the
architecture of the NSX load balancer and its deployment modes.
212
VMware NSX Reference Design Guide
Load Balancer
The NSX load balancer is running on a Tier-1 gateway. The arrows in the above diagram
represent a dependency: the two load balancers LB1 and LB2 are respectively attached to the
Tier-1 gateways 1 and 2. Load balancers can only be attached to Tier-1 gateways (not Tier-0
gateways), and one Tier-1 gateway can only have one load balancer attached to it.
Virtual Server
On a load balancer, the user can define one or more virtual server (the maximum number
depends on the load balancer form factor – See NSX Administrator Guide for load balancer
scale information). As mentioned earlier, a virtual server is defined by a VIP and a TCP/UDP port
number, for example IP: 20.20.20.20 TCP port 80. The diagram represents four virtual servers
VS1, VS2, VS5 and VS6. A virtual server can have basic or advanced load balancing options such
as forward specific client requests to specific pools (see below), or redirect them to external
sites, or even block them.
Pool
A pool is a construct grouping servers hosting the same application. Grouping can be configured
using server IP addresses or for more flexibility using Groups. NSX provides advanced load
balancing rules that allow a virtual server to forward traffic to multiple pools. In the above
213
VMware NSX Reference Design Guide
diagram for example, virtual server VS2 could load balance image requests to Pool2, while
directing other requests to Pool3.
Monitor
A monitor defines how the load balancer tests application availability. Those tests can range
from basic ICMP requests to matching patterns in complex HTTPS queries. The health of the
individual pool members is then validated according to a simple check (server replied), or more
advanced ones, like checking whether a web page response contains a specific string.
Monitors are specified by pools: a single pool can use only 1 monitor, but the same monitor can
be used by different Pools.
214
VMware NSX Reference Design Guide
Because the traffic between client and servers necessarily go through the load-balancer, there
is no need to perform any LB Source-NAT (Load Balancer Network Address Translation at virtual
server VIP).
The in-line mode is the simplest load-balancer deployment model. Its main benefit is that the
pool members can directly identify the clients from the source IP address, which is passed
unchanged (step2). The load-balancer being a centralized service, it is instantiated on a Tier-1
gateway SR (Service Router). The drawback from this model is that, because the Tier-1 gateway
now has a centralized component, East-West traffic for Segments behind different Tier-1 will be
pinned to an Edge node in order to get to the SR. This is the case even for traffic that does not
need to go through the load-balancer.
215
VMware NSX Reference Design Guide
Figure 6-6: One-Arm Load Balancing with Clients and Servers on the same segment
The need for a Tier-1 SR in for the centralized load-balancer service result in East-West traffic
for Segments behind different Tier-1 being pinned to an Edge node. This is the same drawback
as for the inline model described in the previous part.
This design allows for better horizontal scale, as an individual segment can have its own
dedicated load-balancer service appliance(s). This flexibility in the assignment of load-balancing
resources comes at the expense of potentially instantiating several additional Tier-1 SRs on
several Edge nodes. Because the load-balancer service has its dedicated appliance, in East-West
traffic for Segments behind different Tier-1 gateway (the blue Tier-1 gateway in the above
diagram) can still be distributed. The diagram above represented a Tier-1 One-Arm attached to
overlay segment.
216
VMware NSX Reference Design Guide
Tier-1 One-Arm LB can also be attached to physical VLAN segments as shown in above figure,
and thus offering load balancing service even for applications on VLAN. In this use case, the
Tier-1 interface is also using a Service Interface, but this time connected to a segment-VLAN
instead of a segment-overlay.
Load-balancer high-availability
The load-balancer is a centralized service running on a Tier-1 gateway, meaning that it runs on a
Tier-1 gateway Service Router (SR). The load-balancer will thus run on the Edge node of its
associated Tier-1 SR, and its redundancy model will follow the Edge high-availability design.
217
VMware NSX Reference Design Guide
The above diagram represents two Edge nodes hosting three redundant Tier-1 SRs with a load-
balancer each. The Edge High Availability (HA) model is based on periodic keep alive messages
exchanged between each pair of Edges in an Edge Cluster. This keepalive protects against the
loss of an Edge as a whole. In the above diagram, should Edge node 2 go down, the standby
green SR on Edge node 1, along with its associated load-balancer, would become active
immediately.
There is a second messaging protocol between the Edges. This one is event driven (not
periodic), and per-application. This means that if a failure of the load-balancer of the red Tier-1
SR on Edge node 1 is detected, this mechanism can trigger a failover of just this red Tier-1 SR
from Edge node 1 to Edge node 2, without impacting the other services.
The active load balancer service will always synchronize the following information to the
standby load balancer:
• State Synchronization
• L4 Flow State
• Source-IP Persistence State
• Monitor State
This way, in case of failover, the standby load balancer (and its associated Tier-1 SR) can
immediately take over with minimal traffic interruption.
218
VMware NSX Reference Design Guide
Load-balancer monitor
The pools targeted by the virtual servers configured on a load-balancer have their monitor
services running on the same load-balancer. This ensure that the monitor service cannot fail
without the load-balancer failing itself (fate sharing.) The left part of the following diagram is
representing the same example of relation between the different load-balancer components as
the one used in part 6.3. The right part of the diagram is providing an example of where those
components would be physically located in a real-life scenario.
Here, LB1 is a load-balancer attached to Tier-1 Gateway 1 and running two virtual servers VS1
and VS2. The SR for Tier-1 Gateway 1 is instantiated on Edge 1. Similarly, load-balancer LB2 is
on gateway Tier-1 Gateway 2, running VS5 and VS6.
Monitor1 and Monitor2 protecting server pools Pool1, Pool2 and Pool3 used by LB1. As a result,
both Monitor1 and Monitor2 are implemented on the SR where LB1 reside. Monitor2 is also
polling servers used by LB2, thus it is also implemented on the SR where LB2 is running. The
Monitor2 example highlights the fact that a monitor service can be instantiated in several
physical locations and that a given pool can be monitored from different SRs.
219
VMware NSX Reference Design Guide
L7 VIP load balances HTTP or HTTPS connections. The client connection is terminated by the
VIP, and once the client’s HTTP or HTTPS request is received then the load balancer establishes
another connection to one of the pool members. If needed, some specific load balancing
configuration can also be done by L7 VIP, like a selection of specific pool members based on the
request.
Pool
Virtual Server
30.30.30.30:80
www
www.mysite.com
blog.mysite.com Pool
blog
For L7 VIP HTTPS, NSX LB offers 3 modes: HTTPS Off-Load, HTTPS End-to-End SSL, and SSL
Passthrough.
HTTPS Off-Load decrypts the HTTPS traffic at the VIP and forward the traffic in clear HTTP to the
pool members. It is the best balance between security, performance, and LB flexibility:
• Security: traffic is encrypted on the external side.
• Performance: web servers don’t have to run encryption.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.
220
VMware NSX Reference Design Guide
S
VIP L7
HTTPS:443
HTTPS End-to-End SSL decrypts the HTTPS traffic at the VIP and re-encrypts the traffic in
another HTTPS session to the pool members. It is the best security and LB flexibility:
• Security: traffic is encrypted end to end.
• Performance: this mode has lower performance with traffic decrypted/encrypted twice.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.
S
VIP L7
HTTPS:443
HTTPS SSL Passthrough does not decrypt the HTTPS traffic at the VIP and SSL connection is
terminated on the pool members. It is the best security and performance, but with limited LB
flexibility:
• Security: traffic is encrypted end to end.
• Performance: highest performance since LB does not terminate SSL traffic
• LB flexibility: advanced configuration based on HTTP traffic is not available. Only
advanced configuration based on SSL traffic is available like SSL SNI load balancing.
221
VMware NSX Reference Design Guide
S
VIP L7
HTTPS:443
Load-balancer IPv6
NSX LB has many NSX network and security services offers its service for IPv4 and IPv6 clients.
LB IPv4
Clients LB Server
IPv4 IPv4
S Pool
LB IPv6
Clients LB Server
IPv6 IPv6
S Pool
222
VMware NSX Reference Design Guide
in-line. Clearly, traffic from clients must go through the Tier-1 SR where the load-balancer is
instantiated in order to reach the server and vice versa:
The following diagram represents another scenario that, from a logical standpoint at least,
looks like an in-line load-balancer design. However, source LB-SNAT is required in this design,
even if the traffic between the clients and the servers cannot apparently avoid the Tier-1
gateway where the load-balancer is instantiated.
Figure 6-18: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
223
VMware NSX Reference Design Guide
The following expanded view, where the Tier-1 SR and DR are represented as distinct entities
and hosted physically in different location in the network, clarifies the reason why source LB-
SNAT is mandatory:
Figure 6-19: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
Traffic from server to client would be switched directly by the Tier-1 DR without going through
the load-balancer on the SR if source LB-SNAT was not configured. This design is not in fact a
true in-line deployment of the load-balancer and does require LB-SNAT.
224
VMware NSX Reference Design Guide
Figure 6-20: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Logical View
The diagram below offers a possible physical representation of the same network, where the
Tier-1 gateway is broken down between an SR on an Edge Node, and a DR on the host where
both client and servers are instantiated (note that, in order to simplify the representation, the
DR on the Edge was omitted.)
Figure 6-21: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
225
VMware NSX Reference Design Guide
This representation makes it clear that because the VIP is not physically instantiated on the DR,
even if it belongs to the subnet of the downlink of the Tier-1 gateway, some additional
“plumbing” is needed in order to make sure that traffic destined to the load-balancer reach its
destination. Thus, NSX configures proxy-ARP on the DR to answer local request for the VIP and
adds a static route for the VIP pointing to the SR (represented in red in the diagram.)
226
VMware NSX Reference Design Guide
The first step in planning an NSX installation is understanding the deployment model we will
adopt. NSX can be adopted in three modes. They are not mutually exclusive, as they can coexist
in the same NSX installation and even vSphere cluster in some case. The three deployment
models map to different NSX use cases: virtual workload security, network virtualization, bare
227
VMware NSX Reference Design Guide
metal server security. NSX can help in a variety of more advanced use cases, such as multi
location connectivity, disaster recovery, and cloud native applications, but here we focus on the
core use cases to create a design framework.
The three deployment models are:
1. Distributed security model for virtualized workloads
2. Network virtualization with distributed security for virtualized workloads
3. Bare metal workloads security via gateway firewall (centralized security) or NSX bare
metal agent
In the distributed security model, NSX provides distributed security services such as distributed
firewall and distributed IDS/IPS. Still, the physical network retains a significant part of the
switching and routing responsibilities. The only NSX component participating in the networking
functionalities is the VMware VDS, responsible for switching the traffic between the VMs
running in the hypervisor and the physical uplinks. Once the packets are delivered to the
upstream physical switch, it is the responsibility of the physical fabric to deliver it to the
appropriate destination. This networking model is the most common in vSphere deployments
that are not running NSX, making the adoption of NSX distributed security entirely transparent
for the physical network in brownfield environments.
The networking and security model provides the full benefit of NSX network and security
virtualization. Distributed security services operate as in the security-only model, being
orthogonal to the adoption of the NSX network virtualization services. NSX network
virtualization decouples the physical network from the topology requirements of the
applications and services that run on top of it. The physical fabric can be designed and operated
based on a simpler and more stable model that does not require following the application's
lifecycle. The role of adapting to the ever-changing application requirements is offloaded to the
NSX network virtualization layer. At the same time, the physical fabric design can be centered
around maximizing throughput and high availability, minimizing latency, and optimizing
operational efficiency.
228
VMware NSX Reference Design Guide
Rack Availability Achieving rack availability via Easy to achieve rack availability
vSphere HA is not possible via vSphere HA for compute
without extending VLANs workloads
between racks
Disaster Recovery Disaster recovery requires re-IP Easy disaster recovery without
re-IP
Support of Advanced Tied to physical ASIC and limited Software defined allows scale the
Networking – NAT, by switch vendor networking and security stack
Multicast, IPV6, etc.
229
VMware NSX Reference Design Guide
The following sections list the physical network requirements for both deployment models and
provide considerations around specific network fabric designs.
The centralized security (or gateway firewall) model is a third less standard but emerging
deployment model. In this option, NSX Edges act as the perimeter firewall for the data center or
a branch location and protect not virtualized workloads operating as a next-generation firewall
appliance. This model has no requirements on the physical network. Physical server security via
the NSX agent is covered in this DOCUMENT .
230
VMware NSX Reference Design Guide
resources was available on day 1, the distribution could have been more orderly. Each rack
could have been dedicated to a cluster, or maybe two for rack redundancy, to limit the span of
VM mobility. In the real world, a different line of business can fund their initiatives in different
phases, and IT operators must expand their clusters on demand leveraging the available
network resources.
Figure 7-3: Network Overlays for elastic and non-uniform compute requirements
The role of the physical network is to provide reliable high throughput connectivity between
the hosts regardless of their nature. Dedicating a set of physical switches or racks to specific
workloads (vSphere clusters) is not a valid approach in data centers required to reach cloud
scale. It represents a waste of resources when the rack and switches are underutilized, and they
are a scale upper boundary for the workloads. What happens when you need to add a host to a
cluster with a dedicated rack and the ports, or the rack space is unavailable? NSX overlays
231
VMware NSX Reference Design Guide
allows for the different vSphere cluster to grown inorganically while preserving connectivity
and VM mobility.
So far, we have explored the benefits and flexibility that overlays bring to private clouds
requiring elastic compute and network connectivity properties. We can extend the model to
north-south bandwidth requirements and network services. NSX edge nodes are responsible for
scaling the software-defined network solution over those dimensions. FIGURE 7-4 shows how
edge hosts (ESXi hosts running NSX edge VMs) can be added to the design over time based on
changing requirements. The only additional requirement on the physical network is two
additional subnets/VLANs to establish BGP peering in each rack. As for the other infrastructure
VLANs, those are local and do not stretch across racks.
The NSX Edge nodes can be thought has the “spine” layer of the NSX SDN solution. A single NSX
deployment can horizontally scale up to 160 edge nodes. T hey can be distributed over multiple
racks to reduce the physical fabric oversubscription ratio between the leaf and spine layers.
232
VMware NSX Reference Design Guide
Figure 7-4: Network Overlays for elastic North-South Bandwidth and network services
233
VMware NSX Reference Design Guide
When planning a distributed security deployment without overlay, two options exist:
1. Prepare the vSphere hosts for NSX Security only, and leave the virtual machines
connected to the virtual distributed port-groups managed by vCenter
2. Prepare the vSphere hosts for NSX Networking & Security, and migrate the virtual
machines to NSX VLAN port-groups
Option 1 does not require migrating the VM, and the entire networking configuration is
retained by vCenter. All virtual machines connected to dvpgs are automatically protected by
the NSX DFW. This is the most straightforward and recommended approach, however as of
now (NSX version 3.2) it does not allow the implementation of NSX overlays. If it is expected to
implement NSX Overlay soon, option 2 should be considered. Migrating from a security-only
deployment to a network & security deployment requires uninstalling NSX from the vSphere
cluster. Option 1 is available with NSX 3.2, vSphere 6.5 or later, and VDS 6.6 or later.
Option 2 requires placing the workloads on NSX-managed distributed virtual port-groups. In a
brownfield, we can create NSX segments with the same VLAN as the dvpgs where the VMs
reside and then migrate them, reconfiguring the vNic with no disruption. VMs on dvpg are
excluded by the NSX distributed firewall. With this option, we can adopt overlays at a later
stage. Option 2 requires NSX 3.0 or later, vSphere 7, and VDS 7.
234
VMware NSX Reference Design Guide
The two deployment models can coexist within the same installation on a per cluster basis. A
vSphere cluster can be prepared for security only, another for network & security. This is
valuable for existing deployments where one can deploy NSX for security only while in parallel
overlay based clusters providing a path for a future migration.
ESXi NSX VLAN ESXi VLAN Yes, it can have VMs connected to an
Segment Transport Zone NSX overlay NSX VLAN Segment can
(connects Guest (Different than the segments in the have a SI as the default
VMs to VLANs) one for NSX Edge same cluster gateway
TNs)
NSX Edge VM Edge VLAN N/A. NSX Edge VMs SI is connected to NSX
VLAN Segment Transport Zone are automatically Edge VLAN when it
(routing peering or (Different than the excluded from serves as the default
service interfaces) one for ESXi TNs) DFW gateway for VMs on
dvpgs or VLAN
segments. VLAN ID
must match. Dynamic
routing over SI is only
supported for EVPN
route servers mode
235
VMware NSX Reference Design Guide
236
VMware NSX Reference Design Guide
or higher. To improve the throughput, one can increase the MTU up to 8800 (a ballpark
number to accommodate bridging and future header expansion). Modern Operating
Systems support a TCP/IP stack with a feature called Path MTU (PMTU) discovery. PMTU
allows the discovery of underlying networks' capability to carry message size measured
by MSS (maximum segment size) for a given TCP session. Thus, every TCP session will
continue to make this discovery periodically and thus allows maximum efficiency payload
for a given session.
Once the above requirements are met, an NSX network virtualization deployment is agnostic to
any underlay topology and configurations, specifically:
• Any physical topology – core/aggregation/access, leaf-spine, etc.
• On any switch from any physical switch vendor, including legacy switches.
• With any underlying technology. IP connectivity can be achieved over an end-to-end
layer 2 network as well as across a fully routed environment.
For an optimal design and operation of NSX, well-known baseline standards are applicable.
These standards include:
• Device availability (e.g., host, TOR, rack-level)
• TOR bandwidth - both host-to-TOR and TOR uplinks
• Fault and operational domain consistency (e.g., localized peering of Edge node to the
northbound network, separation of host compute domains, etc.)
The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-6):
237
VMware NSX Reference Design Guide
• Different racks require different VLANs and network configurations. Rack 1 and 2
(yellow) host the compute clusters, rack 3 (green) hosts management workloads, rack 4
(blue) is dedicated to WAN connectivity, all with different VLAN requirements.
• While WAN and Management blocks are usually static, the compute rack switches may
require frequent configuration updates based on the applications lifecycle.
• Virtual machine mobility is not available across racks. This may limit the agility and
elasticity properties of the design and reduce the consolidation ratio in the data center.
• vSphere clusters cannot be striped across racks, which may limit high availability as the
infrastructure cannot protect against a rack failure.
The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-7):
• The ToR VLAN configuration is more consistent than in a security-only deployment, and it
is agnostic to the application lifecycle. Generally, three configuration templates must be
provided for the ToR switches, one for the compute block (yellow), one for the
management block (green), and one for the edge block (blue). Those configurations
238
VMware NSX Reference Design Guide
templates tend to be static as they do not require modification when new NSX virtual
networks are deployed.
• Virtual machine mobility is available across racks. NSX overlay segments extend layer two
networks across the layer three boundaries of each rack.
• Compute clusters can be stretched across racks for rack level high availability.
• Resource pooling is streamlined as hosts in different racks (or datacenter rooms) can be
added to the same vSphere cluster.
• While some infrastructure management VMs such as the vRealize suite components can
be deployed on NSX overlay segments, NSX managers and vCenter should be deployed
on a VLAN network. This limits the span of the management vSphere cluster to a single
rack. The management VLAN must be extended across racks, to provide rack high
availability. The NSX manager cluster appliances can be deployed in different IP subnets
to avoid extending layer two between racks, but no solution is currently available for the
vCenter server.
FIGURE 7-8 below outlines a sample VLAN and IP design for an NSX networking and security
deployment over a layer three fabric. Because each rack represents a unique layer two domain,
we can use the same VLAN IDs across racks to improve consistency.
The compute racks generally require only four VLANs (ESXi Management, vMotion, Storage, ad
NSX Overlay), and they can share the same VDS. Management racks may or may not be
prepared for NSX. When they are not, they do not require the NSX Overlay VLAN.
239
VMware NSX Reference Design Guide
Figure 7-8: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric
The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-9):
• At minimum, VLANs are extended between the switches in the same rack, meeting the IP
and MAC mobility requirement of vSphere networking. VLANs can be extended across
240
VMware NSX Reference Design Guide
racks, increasing VM mobility and allowing the possibility to stripe vSphere clusters
across racks for increased high availability.
• While possible, extending VLANs across multiple racks increases the size of a layer two
fault domain.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.
• The network architect will be required to balance the above considerations. Maximum
flexibility in VM placement via a larger VLAN span will negatively impact the size of
failure domains and the overall manageability. A more segmented approach will instead
limit VM mobility and the overall elasticity of the design.
The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-10 AND FIGURE 7-11):
• The network architect can limit the size of the layer two domains to a single rack by
pruning the VLANs appropriately. VM mobility is not impacted as NSX overlays extend
VM networks across layer three boundaries. This configuration makes the layer two
241
VMware NSX Reference Design Guide
physical fabric properties very similar to those of a layer three fabric concerning the NSX
design.
• Management VMs VLAN can easily be extended between racks enabling the striping of
the management cluster.
• Small environments may benefit from a simpler VLAN/IP schema where the
infrastructure VLANs (management, vMotion, storage, overlay) are unique across the
fabric.
• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the applications lifecycle are provisioned in NSX
transparently to the physical fabric.
• Bare metal edge nodes, or the ESXi hosts where edge node VMs reside, should be
connected to layer three capable devices. It usually means the spine or aggregation layer
switches in a layer two fabric. When implementing such a design is not desirable,
dedicated layer 3 ToR switches may be deployed to interconnect the edge nodes.
Figure 7-10: Network and Security Deployment – L2 fabric – Edge Nodes connected to spine switches
242
VMware NSX Reference Design Guide
Figure 7-11: Network and Security Deployment – L2 fabric – Edge Nodes connected to dedicated L3 ToRs
FIGURE 7-12 below outlines a sample VLAN/IP schema for a layer two fabric where VLANs have
been extended across the whole physical network. This design is appropriate for smaller
deployments. A large implementation may benefit from the segmentation of layer two
domains. In such cases, the VLAN/IP schema may approach the one outlined for the layer three
fabric example.
Figure 7-12: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 2 fabric
243
VMware NSX Reference Design Guide
From an NSX and vSphere perspective, a layer three fabric with overlay is equivalent to a layer
two fabric. As such, similar considerations apply. For NSX Security only deployments (FIGURE
7-13), the points to emphasize are:
• VLANs can be extended across racks via the physical fabric overlays, increasing the span
of VM mobility and providing the possibility to stripe vSphere clusters across racks for
increased high availability.
• The span of the VLANs is less of a concern regarding fault domains size compared to
Layer 2 fabrics because network overlays are in use.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.
For NSX Network and Security only deployments (FIGURE 7-14), the points to emphasize are:
• Larger physical fabrics can adopt a simplified VLAN/IP schema where unique VLANs are
common across the environment.
• NSX Geneve encapsulation will work over the physical fabric overlay encapsulation
without any problem as long as the virtual machines and physical fabric MTU account for
the additional headers.
244
VMware NSX Reference Design Guide
• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the application lifecycle are provisioned in NSX
transparently to the physical fabric.
• Edge nodes are connected to edge leaf switches with access to the WAN network.
• Edge node VMs should not be deployed on clusters striped across racks as it would
require extending the L3 peering VLANs via the physical fabric overlay. Edge nodes
should peer to the leaf switches in the same rack on dedicated local VLANs. Edges in
different racks and vSphere clusters can be grouped in the same NSX edge cluster.
FIGURE 7-15 outlines a sample VLAN/IP schema for a layer three fabric with an overlay where
NSX network overlays are also in use. The number of IP segments is minimized as the physical
networks overlay allows each IP network to be available on every rack. Note the four physical to
virtual transit segments (VLAN 106-108) in the edge racks, two per rack. They are not extended
between racks and only provide connectivity to the local edge nodes.
245
VMware NSX Reference Design Guide
Figure 7-15: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric with overlays
246
VMware NSX Reference Design Guide
Overview
NSX Manager Appliances (bundling manager and controller functions) are mandatory NSX
infrastructure components. Their networking requirement is basic IP connectivity with the
other NSX components. The detailed requirements for these communication are listed at
HTTPS :// PORTS . VMWARE . COM/ HOME /NSX-DATA-CENTER.
NSX Manager Appliances are typically deployed on a hypervisor and connected to a VLAN
backed port group on a vSS or vDS; there is no need for the colocation of the three appliances
in the same subnet or VLAN. There are no dependencies on MTU or encapsulation
requirements as the NSX Manager appliances send management and control plane traffic over
the management VLAN only. The NSX Manager appliances ca be deployed on both ESXi or KVM
hypervisors.
FIGURE 7-16 shows three ESXi hypervisors in the management rack hosting three NSX Manager
appliances.
The ESXi management hypervisors are configured with a VDS/VSS with a management port
group mapped to a management VLAN. The management port group is configured with two
uplinks using physical NICs “P1” and “P2” attached to different top of rack switches. The uplink
teaming policy has no impact on NSX Manager operation, so it can be based on existing
VSS/VDS policy.
247
VMware NSX Reference Design Guide
FIGURE 7-17 presents the same NSX Manager appliance VMs running on KVM hosts.
The KVM management hypervisors are configured on a Linux bridge with two uplinks using
physical NICs “P1” and “P2”. The traffic is injected into a management VLAN configured in the
physical infrastructure. Either active/active or active/standby is fine for the uplink team
strategy for NSX Manager since both provide redundancy; this example uses simplest
connectivity model with active/standby configuration.
In a typical deployment, the NSX management components should be deployed on a VLAN. This
is the recommended best practice. The target compute cluster only need to have the hypervisor
switch – VSS/VDS on ESXi and Linux Bridge on KVM. Deploying NSX management components
on the software defined overlay requires elaborate considerations and thus beyond the scope
of this document.
248
VMware NSX Reference Design Guide
host and may extend to a data store, vSphere cluster, or even cabinet or rack. NSX does not
natively enforce this design practice.
When a single vSphere-based management cluster is available, deploy the NSX Managers in the
same vSphere cluster and leverage the native vSphere Distributed Resource Scheduler (DRS)
and anti-affinity rules to avoid instantiating more than one NSX nodes on the same ESXi server.
For more information on how to create a VM-to-VM anti-affinity rule, refer to the VMware
documents on VM-to-VM and VM-to-host rules. For a vSphere-based design, it is recommended
to leverage vSphere HA functionality to ensure single NSX Manager node can recover during the
loss of a hypervisor. Furthermore, NSX Manager should be installed on shared storage. vSphere
HA requires shared storage so that VMs can be restarted on another host if the original host
fails. A similar mechanism is recommended when NSX Manager is deployed in a KVM
hypervisor environment.
Additional considerations apply for management Cluster with respect to storage availability and
IO consistency. A failure of a datastore should not trigger a loss of Manager Node majority, and
the IO access must not be oversubscribed such that it causes unpredictable latency where a
Manager node goes into read only mode due to lack of write access. If a single VSAN data store
is being used to host NSX Manager cluster, additional steps should be taken to reduce the
probability of a complete data store failure some of which are addressed in Physical placement
considerations for the NSX Manager Nodes below. It is also recommended to reserve resources
in CPU and memory according to their respective requirements. Please refer to the following
links for details
NSX Manager Sizing and Requirements:
https://docs.vmware.com/en/VMware-NSX-Data-Center/3.2/installation/GUID-AECA2EE0-
90FC-48C4-8EDB-66517ACFE415.html
NSX Manager Cluster Requirements with HA, Latency and Multi-site:
https://docs.vmware.com/en/VMware-NSX-Data-Center/3.2/installation/GUID-509E40D3-
1CD5-4964-A612-0C9FA32AE3C0.html
249
VMware NSX Reference Design Guide
7.3.3.1 Single vSphere Cluster for all NSX Manager Nodes in a Manager Cluster
When all NSX Manager Nodes are deployed into a single vSphere cluster, it is important to
design that cluster to meet the needs of the NSX Managers with at least the following:
1. At least four vSphere Hosts. This is to adhere to best practices around vSphere HA and
vSphere Dynamic Resource Scheduling (DRS) and allow all three NSX Manager Nodes to
remain available during proactive maintenance or a failure scenario.
2. The hosts should all have access to the same data stores hosting the NSX Manager
Nodes to enable DRS and vSphere HA.
3. Each NSX Manager Node should be deployed onto a different data store (this is
supported as a VMFS, NFS, or other data store technology supported by vSphere)
4. DRS Anti-Affinity rules should be put in place to prevent, whenever possible, two NSX
Manager VMs from running on the same host.
5. During lifecycle events of this cluster, each node should be independently put into
maintenance mode, moving any running NSX Manager Node off the host prior to
maintenance, or any NSX Manager Node should be manually moved to a different host.
6. If possible, a rack-level or critical infrastructure (e.g., Power, HVAC, ToRs) should also be
taken into account to protect this cluster from any single failure event taking down the
entire cluster at once. In many cases, this means spreading this cluster across multiple
cabinets or racks and connecting the hosts in it to a diverse set of physical switches, etc.
Rack level redundancy requires extending the management VLAN between the racks as
the NSX Managers generally reside on a vCenter managed VLAN dvpg. Properly planned
rack-level redundancy includes a cluster striped across three racks and DRS rules placing
the NSX Managers on hosts in different racks because the failure of a cabinet with more
than one NSX manager causes the loss of the quorum.
7. NSX Manager backups should be configured and pointed to a location running outside of
the vSphere Cluster that the NSX Manager Nodes are deployed on.
250
VMware NSX Reference Design Guide
7.3.3.2 Single vSphere Cluster when leveraging VSAN as the storage technology
When all NSX Manager Nodes are deployed into a single vSphere cluster where VSAN is the
storage technology in use, we should take additional steps to protect the NSX Manager
Cluster’s availability since VSAN will present only a single data store. Few VSAN specific
parameters including, Primary Levels of Failures to Tolerate (PFTT), Secondary Levels of Failures
to Tolerate (SFTT), and Failure Tolerance Mode (FTM) (when only a single site in VSAN is
configured, PFTT is set to 0 and SFTT is usually referred to as just FTT) govern the availability of
the resources deployed on the VSAN datastore. The following configurations should be made to
improve the availability of the NSX Management and Control Planes:
• FTT >= 2
• FTM = Raid1
• Number of hosts in the VSAN cluster >= 5
This will dictate that for each object associated with the NSX Manager Node, two or more
copies of the data and a witness are available even if two failures occur, allowing the objects to
remain in a healthy state. This will accommodate both a maintenance event and an outage
occurring without impacting the integrity of the data on the datastore. This configuration
requires at least five (5) hosts in the vSphere Cluster.
Cabinet Level Failures can be accommodated as well, by distributing the cluster horizontally
across multiple cabinets. No more hosts than the number of failures that can be tolerated (a
251
VMware NSX Reference Design Guide
maximum of FTT=3 is supported by VSAN) should exist in each rack. This means a minimum of
five (5) hosts spread across three (3) racks in a 2-2-1 pattern, and DRS rules preferentially
placing the NSX Managers in different racks.
Implementing VSAN Stretched Clusters to provide cabinet level protection is not
recommended. While appealing because of the reduced number of hosts required (2 hosts and
a witness VM in 3 different failure domains are the minimum requirements), the failure of a
rack may cause the loss of the NSX manager cluster quorum because two NSX manager
appliances are placed in the same rack.
It is strongly recommended that hosts, any time they are proactively removed from service,
vacate the storage and repopulate the objects on the remaining hosts in the VSAN cluster.
Providing a VSAN cluster with five hosts is the recommended approach, but fewer hosts are
available in some situations. VSAN 7 update 2 introduced the Enhanced Data Durability feature,
which may reduce the risk of running the NSX manager cluster with a storage policy with FTT=1.
With this feature, if a failure occurs when one of the nodes has been taken out of service for
maintenance, VSAN can rebuild the up-to-date data once the host is out of maintenance mode.
Additional VSAN settings and best practices that we should consider are:
• Enable Operations Reserve and Host rebuild reserve. This helps monitor the reserve
capacity threshold, generates alerts when the threshold is reached and prevents further
provisioning.
• Ensure that the VM storage policy applied to the NSX Managers is compliant.
• Ensure to use the Data migration pre-check tool before carrying out host/cluster
maintenance
• Review the vSAN Skyline Health Checks regularly.
252
VMware NSX Reference Design Guide
253
VMware NSX Reference Design Guide
254
VMware NSX Reference Design Guide
255
VMware NSX Reference Design Guide
The NSX Manager availability has improved compared to the previous option; however, it is
important to clarify the distinction between node availability and load-balancing. In the case of
cluster VIP, all the API and GUI requests go to one node, and it's not possible to achieve load-
balancing of GUI and API sessions. In addition, the sessions that were established on a failed
node will need to be re-authenticated and re-established at the new owner of the cluster VIP.
The cluster VIP model is also designed to address the failure of certain NSX Manager services
but cannot guarantee recovery for certain corner cases service failures.
It is important to emphasize that the northbound clients' connectivity happened via the cluster
VIP. Still, the communication between NSX Manager nodes and the NSX transport nodes
components happens on the individual node IPs.
The cluster VIP is the preferred and recommended option for high availability of the NSX
Manager appliance nodes.
256
VMware NSX Reference Design Guide
access to the NSX Manager nodes will have load-balancing, and the cluster management access
is highly available via a single IP address. The external load-balancer option also makes the
deployment of NSX manager nodes independent of underlying physical topology (agnostic to L2
or L3 topology).
Additionally, one can distribute nodes to more than one rack to achieve rack redundancy (in L2
topology, it is the same subnet but a different rack, while in L3 topology, it will be a distinct
subnet per rack). The downside of this option is that it requires an external load balancer and
additional configuration complexity based on the load-balancer model. The make and model of
load-balancer are left to the user preference; however one can also use NSX native load
balancer included as part of the system and managed by the same cluster of NSX Manager
nodes that are load-balanced or the standalone NSX Advanced load balancer (formally AVI
network) offered as a part of NSX Datacenter portfolio.
FIGURE 7-22 presents a scenario where the NSX manager nodes are load-balanced by an
external load balancer, and session persistency is based on the source IP of the client. With this
257
VMware NSX Reference Design Guide
configuration, the load balancer will redirect all requests from the same client to the same NSX
manager node.
NSX Manager can authenticate clients in four ways - HTML basic authentication, client
certificate authentication, vIDM, and session. The API-based client can use all four forms of
authentication while web browsers use session-based authentication. The session-based
authentication typically requires LB persistence configuration, while API-based access does not
mandate that. FIGURE 7-22 represents a VIP with LB persistent configuration for both browser
(GUI) and API-based access.
While one can conceive an advanced load-balancing schema in which dedicated VIP for browser
access with LB persistent while other VIP without LB persistence for API access, this option may
have limited value in terms of scale and performance differentiation while complicating the
access to the system. For this reason, it is highly recommended to first adopt the basic option of
LB persistence based on source IP with a single VIP for all types of access. The overall
recommendation is to start with a cluster VIP and move to an external LB if real needs exist.
258
VMware NSX Reference Design Guide
259
VMware NSX Reference Design Guide
traffic management tools such as NIOC to provide adequate bandwidth to critical traffic. One
example is VSAN traffic. The teaming type offered in NSX that enables this behavior is called
“Load Balanced Source Teaming.”
Deterministic traffic per pNIC– A certain traffic type is only carried on a specific pNIC, thus
allowing dedicated bandwidth for a given traffic type. Additionally, it allows deterministic
failure as one or more links could be in standby mode. Based on the number of pNICs, one can
design a traffic management schema that avoids the contention of two high bandwidth flows
(e.g., VSAN and vMotion). The teaming type offered in NSX that enables this behavior is called
“Failover Order.” We can build a teaming failover policy with a single pNIC or more. When a
single pNIC is used, its failure is not covered by a standby pNIC. Such a scenario is not common,
but it can be an appropriate choice when other high availability mechanisms like vSphere HA or
application-level redundancy are in place.
A scenario where we may leverage deterministic traffic per pNIC is in the case of multiple
distinct physical networks, e.g., DMZ, Storage, or Backup Networks, where the physical
underlay differs based on pNIC.
Designing a traffic management schema that utilizes both design patterns is possible. This
design guide covers both scenarios based on specific requirements and makes generalized
recommendations.
260
VMware NSX Reference Design Guide
With the N-VDS, NSX segments appeared as opaque network in vCenter. Even if opaque
networks have been available in vCenter for almost a decade, way before NSX was developed,
many third-party solutions still don’t take this network type into account and, as a result, fail
with NSX. When running NSX on VDS, NSX segments are represented as DVPGs (now onward
called NSX DVPG) in vCenter. Third party scripts that had not been retrofitted for opaque
networks can now work natively with NSX.
261
VMware NSX Reference Design Guide
than two pNICs on the host. The use-cases will still cover a four pNICs design, for those who
plan on keeping the N-VDS for now, and for the cases when multiple virtual switches are a
requirement for other reasons.
7.4.1.4 NSX on VDS and Interaction with vSphere & Other Compute Domains
The relationship and representative difference between N-VDS vs NSX on VDS is subtle and
requires consideration in designing the cluster and operational understanding. The N-VDS has a
complete independence to underlying compute manager (vSphere or any other compute
domains like or in cloud like AWS, Azure or Google Cloud). Unlike deployment of VDS via
vCenter, N-VDS deployment is managed via NSX manager. Because of this decoupling it allows
consistent connectivity and security with multiple compute domains. This consistent
262
VMware NSX Reference Design Guide
connectivity is shown in below FIGURE 7-23 first part with N-VDS and its relation to vCenter.
With NSX 3.0, the NSX can be enabled on traditional VDS allowing same level of flexibility,
however now there is a mandatory requirement of having vCenter to instantiate a VDS. This
capability when enabled, depicted in below FIGURE 7-23 center part as a NSX DVPG. The center
part of the figure below depicts a case of single VDS with NSX which is logically similar to first
one, where only difference is representation in vCenter - opaque vs NSX DVPG. The third part of
the figure represent multiple VDS. This is possible either with single vCenter with each cluster
having dedicated VDS or with multiple vCenters with VDS. In this later case of multiple VDSs,
the same segment of NSX is represented via unique NSX DVPG under each VDS or vCenters. This
might represent a challenge operationally identifying VM connectivity to VDS and the
automation that relies on the underlying assumption; however, future releases shall make this
identification easier with unique names. Typically, it is a good practice to invoke a single VDS
per compute domain and thus have a consistent view and operational consistency of the VM
connectivity. However, there are cases where single VDS invocation may not be ideal from
separation of workload for security, automation, storage policy, NIOC control and provisioning
boundary. Thus, it is acceptable to have a multiple VDS per given vCenter.
The details of these differences and additional considerations are documented with following
KB articles
https://kb.vmware.com/s/article/79872
263
VMware NSX Reference Design Guide
Overlay-TZ/VLAN-TZ
NSX Segment
nsx-dvpg100
TZs can extend to other vCenter hosts, KVM etc..
Port NSX-T nsx-dvpg-100 - NSX-T owned, read-only for vCenter nsx-dvpg-100 nsx-dvpg-100
N-VDS VDS with NSX VDS with NSX VDS with NSX N-VDS VDS with NSX
KVM TN ESXi TN1 ESXi TN2 ESXi TN3 ESXi TN4 ESXi TN5
The key concept here is that NSX abstract the connectivity and security via segment. This
segment is represented via various realization like NSX DVPG in VDS with NSX, Port-NSX in KVM
or an opaque network in N-VDS. Regardless of the underlying switch DFW can be realized either
on VLAN or overlay however only with NSX DVPG and not via vSphere DVPG.
264
VMware NSX Reference Design Guide
Figure 7-25: ESXi Compute Rack Failover Order Teaming with One Teaming Policy
In FIGURE 7-25Figure 7-25, a single virtual VDS or N-VDS is used with a two pNICs design and
carries both infrastructure and VM traffic. Physical NICs “P1” and “P2” are attached to the
different top of rack switches. The teaming policy selected is failover order active/standby;
“Uplink1” is active while “Uplink2” is standby.
• If the virtual switch is an N-VDS, all traffic on the host is carried by NSX segments (VLAN
or overlay), and the redundancy model can be achieved with a single default NSX
teaming policy.
• If the virtual switch is a VDS, only NSX DVPG traffic will follow the NSX teaming policy.
The infrastructure traffic will follow the teaming policy defined in their respective VDS
DVPGs configured in vCenter.
The top-of-rack switches are configured with a first-hop redundancy protocol (e.g., HSRP or
VRRP), providing the active default gateway for all the VLANs on “ToR-Left .” The VMs are
attached to overlay or VLAN segments defined in NSX, and they will follow the default teaming
policy regardless of the type of segment they are connected to. With the use of a single
teaming policy, the above design allows for a simple configuration of the physical infrastructure
and simple traffic management at the expense of leaving an uplink completely unused.
It is, however, easy to load balance traffic across the two uplinks while maintaining the
deterministic nature of the traffic distribution. The following example shows the same hosts,
this time configured with two separate failover order teaming policies: one with P1 active, P2
265
VMware NSX Reference Design Guide
standby, and the other with P1 standby, P2 active. Then, individual traffic types can be assigned
a preferred path by mapping it to either teaming policy.
Figure 7-26: ESXi Compute Rack Failover Order Teaming with two Teaming Policies
In FIGURE 7-26, storage and vMotion traffic follow a teaming policy setting P1 as primary active,
while management and VM traffic are following a teaming policy setting P2 as primary active.
• When the virtual switch is an N-VDS, all segments follow the default teaming policy by
default. VLAN segments can, however, be associated with additional teaming policies
(identified by a name and thus called “named” teaming policies). We can thus achieve
the above design with a default teaming policy (P2 active/P1 standby) and an additional
named teaming policy (P1 active/P2 standby) to which NSX VLAN segments for storage
and vMotion traffic are mapped. Overlay networks always follow the default teaming
policy, and it is impossible to associate them with a named teaming policy.
• When the virtual switch is a VDS with NSX, the above design is achieved with a default
teaming policy P2 active & P1 standby. Then, the DVPGs for infrastructure traffic need to
be configured individually: storage and vMotion will have a failover order teaming policy
setting P1 active & P2 standby, while the management DVPG will be configured for P2
active & P1 standby.
The ToR switches are configured with a first-hop redundancy protocol (FHRP), providing an
active default gateway for storage and vMotion traffic on “ToR-Left,” management, and overlay
266
VMware NSX Reference Design Guide
traffic on “ToR-Right” to limit interlink usage. Multiple teaming policies allow the utilization of
all available pNICs while maintaining deterministic traffic management. This is a better
recommended approach when adopting a deterministic traffic per pNIC design approach.
267
VMware NSX Reference Design Guide
Additionally, one can utilize a mix of the different teaming policy types together such that
infrastructure traffic (VSAN, vMotion, Management) leverage “failover order” enabling
deterministic bandwidth and failover, while VM traffic use some “load balance source” teaming
policy, spreading VM traffic across both pNICs.
• On a host running NSX with an N-VDS, the default teaming policy will have to be
configured for “load balance source,” as it’s the only policy that overlay traffic follows.
Then, we can map individual VLAN segments to named teaming policies steering the
infrastructure traffic to the desired uplink.
• On a host running VDS with NSX, overlay traffic will follow the default teaming policy
defined in NSX. Infrastructure traffic will follow the individual teaming policies
configured on their DVPGs in vCenter.
Based on the requirement of underlying applications and preference, one can select a type of
traffic management as desired for the infrastructure traffic. However, for the overlay traffic, it
is highly recommended to use one of the “load balanced source” teaming policies as they’re the
only ones allowing overlay traffic on multiple active uplinks and thus provide better throughput
in/out of the host for VM traffic.
268
VMware NSX Reference Design Guide
269
VMware NSX Reference Design Guide
The following FIGURE 7-28 presents an example of a host with 4 uplinks and two virtual
switches:
• A VDS is dedicated to infrastructure traffic.
• A second NSX prepared VDS or N-VDS handles the VM traffic.
Figure 7-28: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch
The VDS is configured with pNICs “P1” and “P2”. And each port group is configured with
different pNICs in active/standby to use both pNICs. However, the choice of teaming mode on
VDS is left to the user based on the considerations we outlined in the two pNIC design section.
The virtual switch running NSX owns pNICs “P3” and “P4”. To leverage both pNICs, the N-VDS is
configured in load balance source teaming mode. Each type of host traffic has dedicated IP
subnets and VLANs.
The model in FIGURE 7-28 still make sense when running NSX on N-VDS. However, now that we
can deploy NSX directly on a VDS, we can easily achieve the same functionality with a single
VDS running NSX and owning the four uplinks, as represented below.
270
VMware NSX Reference Design Guide
Figure 7-29: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch
You can achieve this configuration by simply mapping the four pNICs of the host to 4 uplinks on
the VDS. Then, install NSX using an uplink profile that maps its uplinks (Uplink1/Uplink2 in the
diagram) to the VDS uplinks P3 and P4. This model still benefits from the simple, non-disruptive
installation of NSX. At the same time, the NSX component can only work with the two uplinks
mapped to the uplink profile. This means that VM traffic can never flow over P1/P2.
The final added benefit of this model is that the administrator can manage a single virtual
switch and has the flexibility of adding additional uplinks for overlay or infrastructure traffic.
271
VMware NSX Reference Design Guide
on a given host can be either VDS or N-VDS (note, however, that if both virtual switches are
prepared for NSX, they can either be both VDS with NSX or both N-VDS, not a mix of N-VDS and
VDS with NSX.). The segments cannot extend between two virtual switches as each is tied to a
different transport zone.
Figure 7-30: ESXi Compute Rack 4 pNICs- Two N-VDS (Or VDS with NSX)
Below, we list a series of use cases and configurations where the dual VDS design is relevant. In
any of those cases, the infrastructure traffic will be carried on the first virtual switch. Here are
some examples:
• The first two pNICs are exclusively used for infrastructure traffic, and the remaining two
pNICs are for overlay VM traffic. This allows dedicated bandwidth for overlay application
traffic. One can select the appropriate teaming mode as discussed in the above two
pNICs design section (ESXI-BASED COMPUTE HYPERVISOR WITH TWO P NICS)
• The first two pNICs are dedicated to “VLAN only” micro-segmentation, and the second
one is for overlay traffic.
• Building multiple overlays for separation of traffic. Starting with NSX 3.1, a host can have
virtual switches part of different overlay transport zones and the TEPs on each virtual
switch can be on different VLAN/IP subnets (still, all the TEPs for an individual switch
must be part of the same subnet). When planning such configuration, it is important to
remember that while hosts can be part of different overlay transport zones, edge nodes
272
VMware NSX Reference Design Guide
cannot. The implication is that when implementing multiple overlay transport zones,
multiple edge clusters are required, one per overlay transport zones as a minimum.
• Building regulatory compliant domains with VLAN only or overlay.
• Building traditional DMZ type isolation.
The two virtual switches running NSX must attach to different transport zones. See detail in
section SEGMENTS AND TRANSPORT ZONES
=====================================================
Note: The second virtual switch could be of enhanced datapath type, for the NFV use case. This
type is beyond the scope of this design guide.
=====================================================
P1 P2 P1 P2
N-VDS N-VDS
Compute-KVM1 Compute-KVM2
In FIGURE 7-31 the design is very similar to ESXi Failover Order Teaming Mode.
273
VMware NSX Reference Design Guide
A single host switch is used with a 2 pNICs design. This host switch manages all traffic – overlay,
management, storage, etc. Physical NICs “P1” and “P2” are attached to different top of rack
switches. The teaming option selected is failover order active/standby; “Uplink1” is active while
“Uplink2” is standby. As shown logical switching section, host traffic is carried on the active
uplink “Uplink1”, while “Uplink2” is purely backup in the case of a port or switch failure. This
teaming policy provides a deterministic and simple design for traffic management.
The top-of-rack switches are configured with a first hop redundancy protocol (e.g. HSRP, VRRP)
providing an active default gateway for all the VLANs on “ToR-Left”. The VMs are attached to
segments/logical switches defined on the N-VDS, with the default gateway set to the logical
interface of the distributed Tier-1 logical router instance.
Note about N-VDS ports and bridge:
NSX host preparation of KVM creates automatically the N-VDS and its “Port NSX” (with TEP IP
address) and “Bridge nsx-managed” (to plug the VMs). The other ports like “Port Mgt” and
“Port Storage” have to be created outside of NSX preparation.
root@kvm1-ubuntu:~# ovs-vsctl show
1def29bb-ac94-41b3-8474-486c87d96ef1
Manager "unix:/var/run/vmware/nsx-agent/nsxagent_ovsdb.sock"
Config created outside of NSX-T: is_connected: true
root@KVM1:~# ovs-vsctl add-port nsx-switch.0 Bridge nsx-managed
"switch-mgt" tag=22 -- set interface "switch -mgt" Controller "unix:/var/run/vmware/nsx-agent/nsxagent_vswitchd.sock"
type=internal is_connected: true
root@KVM1:~# ovs-vsctl add-port nsx-switch.0 fail_mode: secure
"switch-mgt" tag=22 -- set interface "switch -mgt" Port nsx-managed
type=internal Interface nsx-managed
type: internal
Port hyperbus
Interface hyperbus
P P type: internal
1 2 Bridge "nsx-switch.0"
Controller "unix:/var/run/vmware/nsx-agent/nsxagent_vswitchd.sock"
is_connected: true
fail_mode: secure
Uplink1 Uplink2
Port "nsx-vtep0.0"
tag: 25
N-VDS Interface "nsx-vtep0.0"
type: internal
Port "nsx-switch.0"
Port Port Port Bridge Interface "nsx-switch.0"
Mgt Storage NSX-T nsx-managed type: internal
Mgt- Stor- TEP- Port switch-mgt
IP IP IP tag: 22
Interface switch-mgt
Web1 App1 type: internal
Port switch-storage
Compute-KVM1 tag: 23
Interface switch-storage
type: internal
Port "nsx-uplink.0"
Interface "enp3s0f1"
Interface "enp3s0f0"
ovs_version: "2.7.0.6383692"
And IP addresses for those ports have to be created outside of NSX preparation.
274
VMware NSX Reference Design Guide
root@kvm1-ubuntu:~# vi /etc/network/interfaces
auto enp3s0f0
iface enp3s0f0 inet manual
Config created outside of NSX-T mtu 1700
auto enp3s0f1
iface enp3s0f1 inet manual
mtu 1700
auto switch-mgt
iface switch-mgt inet static
P1 P2
pre-up ip addr flush dev switch-mgt
address 10.114.213.86
netmask 255.255.255.240
gateway 10.114.213.81
Uplink1 Uplink2
dns-nameservers 10.113.165.131
down ifconfig switch-mgt down
N-VDS up ifconfig switch-mgt up
275
VMware NSX Reference Design Guide
276
VMware NSX Reference Design Guide
• Normalization of N-VDS configuration – All the edge node form factors deployments
can use a single N-VDS like a host hypervisor. Single teaming policy for overlay – Load
Balanced Source. Multiple policies for N-S peering – Named teaming Policies
277
VMware NSX Reference Design Guide
If deployment is running vSphere 6.5 where mac learning is not available, the only other way to
run bridging is by enabling promiscuous mode. Typically, promiscuous mode should not be
enabled system wide. Thus, either enable promiscuous mode just for DVPG associated with
bridge vNIC or it may be worth considering dedicating an Edge VM for the bridged traffic so that
other kinds of traffic to/from the Edge do not suffer from the performance impact related to
promiscuous mode.
278
VMware NSX Reference Design Guide
279
VMware NSX Reference Design Guide
In the above scenario, the failure of the uplink of Edge 1 to physical switch S1 would trigger an
Edge Bridge convergence where the Bridge on Edge 2 would become active. However, the
failure of the path between physical switches S1 and S3 (as represented in the diagram) would
have no impact on the Edge Bridge HA and would have to be recovered in the VLAN L2 domain
itself. Here, we need to make sure that the alternate path S1-S2-S3 would become active
thanks to some L2 control protocol in the bridged physical infrastructure.
280
VMware NSX Reference Design Guide
Edges for Segment/VLAN load balancing. Also note that, up to NSX release 2.5, a failover cannot
be triggered by user intervention. As a result, with this design, one cannot assume that both
Edges will be leveraged for bridged traffic, even when they are both available and several
Bridge Profiles are used for Segment/VLAN load balancing. This is perfectly acceptable if
availability is more important than available aggregate bandwidth.
Figure 7-36: Load-balancing bridged traffic for two Logical Switches over two Edges (Edge Cluster omitted for clarity.)
Further scale out can be achieved with more Edge nodes. The following diagram shows an
example of three Edge Nodes active at the same time for three different Logical Switches.
281
VMware NSX Reference Design Guide
Figure 7-32: Load-balancing example across three Edge nodes (Bridge Profiles not shown for clarity.)
Note that if several Bridge Profiles can be configured to involve several Edge nodes in the
bridging activity, a given Bridge Profile cannot specify more than two Edge nodes.
282
VMware NSX Reference Design Guide
• Overlay traffic is load-balanced across two physical links leveraging multi-tep edge
capability. Multi-tep provides load balancing and better edge node resource utilization
compared to a single TEP.
• Because for each bridging instance, VLAN traffic can be forwarded over a single uplink at
the time, load balancing is achieved by pinning different VLANs to different pNICs.
• VLAN and overlay traffic are forwarded over different pNICs. While sharing the pNICs
between VLAN and overlay traffic may lead to better performances and high availability
(e.g., configuring 4 TEPs and connecting the edge to 4 ToRs), we chose a simpler design
with a deterministic traffic pattern.
The design presented in FIGURE 7-38 below includes a single NVDS per edge node. The NVDS
manages both the overlay and VLAN traffic. The diagram presents two bridging instances, for
VLAN and overlay X, and VLAN and overlay Y, but more can be configured. The VLAN traffic is
pinned to pNIC 2 or pNIC 3. Additional VLANs could be pinned to either interface. In this case,
the pinning is achieved via a bridging teaming policy, implying that any bridging instance must
be associated with a named teaming policy. Overlay traffic for any bridging instance is load-
balanced across pNIC0 and pNIC1 via the default teaming policy.
We should pay special attention to the cabling and VLAN to pNIC association, which is not the
same for the two bare metal edge nodes. The reason for this asymmetry is that while overlay
traffic can failover between the two pNICs on the same edge node, the same is not true for
VLAN traffic, which is pinned to a single uplink because of the limitation discussed in chapter 3.
If that single pNIC or the ToR where it is connected fails, the bridge instance should fail over to
the second edge, which must have a viable path for the associated VLANs. We want to maintain
a standard cabling configuration, where two pNICs are connected to one ToR and two to the
other consistently while at the same time allowing for the recovery of the traffic after the
failure of a ToR. For this reason, we crossed the mapping between pNIC2 and pNIC3 and uplink2
and uplink3 on the two bare metal edges.
FIGURE 7-37 shows a sample configuration. If this configuration is not desirable, connecting
pNIC2 to ToR2 and pNIC3 to ToR1 for bare metal edge-2 will have the same effect.
283
VMware NSX Reference Design Guide
Figure 7-37: Bridge Design – 4 pNIC BM edge – Single NVDS – uplink to pNIC mapping
284
VMware NSX Reference Design Guide
We can achieve the same design without named teaming policies and deploying multiple NVDSs
on the same bare metal edge. The functionalities provided are the same, and the decision
between the two options may be based on the preferred configuration workflows for new
bridging instances. This configuration option is presented in the diagram below.
285
VMware NSX Reference Design Guide
So far, we have shown how to achieve optimal high availability, CPU, and link utilization across
a single bare metal edge. The previous diagrams show Bare Metal Edge 2 running standby
bridging instances only. It is possible to spread the active bridging instances across multiple
bare metal edges via bridge profiles. Because the traffic for each bridging instance is always
handled by a single edge node at the time, and VLAN traffic is pinned to a single pNIC, at least
four bridging instances are required to utilize all available links. This configuration is depicted in
FIGURE 7-40.
286
VMware NSX Reference Design Guide
287
VMware NSX Reference Design Guide
288
VMware NSX Reference Design Guide
source MAC addresses (the MACs of all the VMs on the overlay) that can be used by the
upstream VDS for the load-balancing hash.
MAC learning and forged transmit must be enabled on the bridge dvpg carrying VLAN
traffic. It is not required to configure MAC learning on the Overlay1 and Overlay2 dvpgs as
the failures of vNIC1 or vNIC2 are not realistic occurrences and should not be part of a test
plan.
This configuration can be achieved with a single N-VDS and a named teaming policy as
depicted in FIGURE 7-42.
It can also be achieved without named teaming policies and two NVDSs per edge VM. The
resulting designs are equivalent, they only differ in the configuration workflow. The 2 NVDS
design may be preferable to avoid the need to associate a teaming policy to each bridging
instance. Doing so may lower the probability of a misconfiguration.
289
VMware NSX Reference Design Guide
290
VMware NSX Reference Design Guide
• A deployment incorporating two hosts with 4 edge VMs (2 per host) requires two bridge
profiles for an active/standby configuration (1 host active, 1 host standby), and 4 bridge
profiles for an active/active configuration (both hosts active).
Note: nor hosts or edge VMs are active or standby, it’s the bridging instances running on top of
the edge VMs that are. Designating hosts as active or standby represents a design abstraction
helpful when building a logical model hiding implementation complexities.
FIGURE 7-44 presents a 4 pNICs ESXi server hosting two edge VMs dedicated to bridging. The
two edge VMs are associated with different bridge profiles so that they will never run an
active/standby pair for any VLAN-Overlay pair. The failure of the host should not cause both the
active and standby instance to fail at the same time. Edge VMs running on a different host
provide high availability for the bridging instances depicted in the diagram. vSphere DRS rules
should be implemented to achieve the desired design and placement.
291
VMware NSX Reference Design Guide
292
VMware NSX Reference Design Guide
Router1
Router2
eBGP
eBGP External2-VLAN 200
External1-VLAN 100
EN1 EN2
SR
Tier0 DR
Tier1 Overlay
SR
Traffic DR
DR
Figure 7-45: Typical Enterprise Bare metal Edge Note Logical View with Overlay/External Traffic
293
VMware NSX Reference Design Guide
is load-balanced across uplinks using a named teaming policy, which pins a VLAN segment to a
specific uplink.
We can also use the same VLAN segment to connect Tier-0 Gateway to TOR-Left and TOR-Right.
However, it is not recommended because of inter-rack VLAN dependencies leading to spanning
tree-related convergence and the inability to load-balance the traffic across different pNICs.
This topology provides redundancy for management, overlay, and external traffic, in the event
of a pNIC failure on Edge node/TOR and a complete TOR Failure.
The right side of the diagram shows two pNICs bare metal edge configured with the same N-
VDS "Overlay and External N-VDS" for carrying overlay and external traffic as in the example
above, but that it is also leveraging in-band management.
P1 P2 P3 P4 P1 P2
Mgmt
Uplink1 Uplink2 Uplink1 Uplink2
Traffic
Bare Metal Edge with 4 Physical NICS Bare Metal Edge with 2 Physical NICS
2* 1G NIC + 2 * 10G NIC 2 * 10G NIC
Figure 7-46: Bare metal Edge configured for Multi-TEP - Single N-VDS for overlay and external traffic
Both topologies use the same transport node profile, as shown in FIGURE 7-47. This
configuration shows a default teaming policy that uses both Uplink1 and Uplink2. This default
policy is used for all the overlay segments/logical switches created on this N-VDS.
Two additional teaming policies, "Vlan300-Policy" and "Vlan400-Policy," have been defined to
override the default teaming policy and send traffic to "Uplink1" and "Uplink2" only,
respectively.
"External VLAN segment 300" is configured to use the named teaming policy "Vlan300-Policy"
that sends traffic from this VLAN only on "Uplink1". "External VLAN segment 400" is configured
294
VMware NSX Reference Design Guide
to use a named teaming policy "Vlan400-Policy" that sends traffic from this VLAN only on
"Uplink2".
Based on these teaming policies, TOR-Left will receive traffic for VLAN 100 (Mgmt.), VLAN 200
(overlay) and VLAN 300 (Traffic from VLAN segment 300). Similarly, TOR-Right will receive
traffic for VLAN 100 (Mgmt.), VLAN 200 (overlay) and VLAN 400 (Traffic from VLAN segment
400). A sample configuration screenshot is shown below.
FIGURE 7-48 shows a logical and physical topology where a Tier-0 gateway has four external
interfaces. External interfaces 1 and 2 are provided by bare metal Edge node “EN1”, whereas
External interfaces 3 and 4 are provided by bare metal Edge node “EN2”. Both the Edge nodes
are in the same rack and connect to the TOR switches in that rack. Both the Edge nodes are
configured for Multi-TEP and use named teaming policy to send traffic from VLAN 300 to TOR-
Left and traffic from VLAN 400 to TOR-Right. Tier-0 Gateway establishes BGP peering on all four
external interfaces and provides 4-way ECMP.
Logical Topology Physical Topology
BGP P1 P2 P1 P2
Uplink1 Uplink2 Uplink1 Uplink2
External-1 External-4
192.168.240.2/24 192.168.250.3/24
Overlay and External Overlay and External
External-2 External-3
N-VDS N-VDS
192.168.250.2/24 192.168.240.3/24 TEP-IP1 TEP-IP2 TEP-IP1 TEP-IP2
EN1 EN2
Mgmt-IP Mgmt-IP
295
VMware NSX Reference Design Guide
For most environments the performance provided by a pNIC bare metal edge are comparable
to an edge node VMs when the physical server has only 2 pNICs, if those interfaces are 10Gbps
or 25Gbps. The pNIC bandwidth usually represents the bottleneck and the VM form factor is
generally recommended for the easier lifecycle management. Situation when a bare metal edge
node should be considered are:
• Requirement for line rate services
• Higher bandwidth pNIC (25 or 40 Gbps)
• Traffic profile is characterized by small packet size (e.g., 250 Bytes)
• Sub-second link failure detection between physical network and the edge node
• Network operation team retains responsibility for the NSX edges and has a preference
for an appliance-based model
• In a multiple Tier-0 deployment model where the top Tier-0 is deployed on bare metal
edges and drives the throughput with higher speed (40 Gbps) pNICs.
• Management interface redundancy is not always required but a good practice. In-band
option is most practical deployment model when a limited number of interfaces is
available.
296
VMware NSX Reference Design Guide
Figure 7-49: Bare metal Edge with six pNICs - Same N-VDS for Overlay and External traffic
The bare metal configuration with greater than two pNICs is the most practical and
recommended design. This is because four or more pNICs configurations substantially offer
more bandwidths than the equivalent Edge VM configurations. The same reasons for choosing
bare metal apply as in the two pNICs configurations discussed above.
297
VMware NSX Reference Design Guide
Named teaming policy is also configured to force external traffic on specific edge VM nNICs.
FIGURE 7-50 also shows named teaming policy configuration used for this topology. "External
VLAN segment 300" is configured to use a named teaming policy “Vlan300-Policy” that sends
traffic from this VLAN on “Uplink1” (vNIC2 of Edge VM). "External VLAN segment 400" is
configured to use a named teaming policy “Vlan400-Policy” that sends traffic from this VLAN on
“Uplink2” (vNIC3 of Edge VM). Based on this named teaming policy, external traffic from
“External VLAN Segment 300” will always be sent and received on vNIC2 of the Edge VM.
North-South or external traffic from “External VLAN Segment 400” will always be sent and
received on vNIC3 of the Edge VM.
Overlay or external traffic from Edge VM is received by the VDS DVPGs “Trunk1 PG” and
“Trunk2 PG”. The teaming policy used on the VDS port groups defines how this overlay and
external traffic coming from Edge node VM exits the hypervisor. For instance, “Trunk1 PG” is
configured to use active uplink as “VDS-Uplink1” and standby uplink as “VDS-Uplink2”. “Trunk2
PG” is configured to use active uplink as “VDS-Uplink2” and standby uplink as “VDS-Uplink1”.
This configuration ensures that the traffic sent on “External VLAN Segment 300” (i.e., VLAN 300)
always uses vNIC2 of Edge VM to exit the Edge VM. This traffic then uses “VDS-Uplink1” (based
on “Trunk1 PG” configuration) and is sent to the left TOR switch. Similarly, traffic sent on VLAN
400 uses “VDS-Uplink2” and is sent to the TOR switch on the right.
In case of a failure of an ESXi host pNIC, the active/standby teaming policy on the trunk dvpg
will redirect the traffic to the surviving pNIC. Overlay traffic (VLAN 200 in the diagram) will flow
over the operational ESXi pNIC. This failover event is entirely transparent to the edge node VM
N-VDS, which keeps load balancing overlay traffic over the two TEPs and vNICs. Traffic
originated or destined to both TEPs flows over a single ESXi pNIC. External VLAN traffic will not
failover to the surviving ESXi pNIC because, while the overlay VLAN is defined on both switches,
the two external VLANs are defined on TOR-Left or TOR-Right. A routing protocol
reconvergence will redirect the traffic to the surviving peering VLAN and corresponding pNIC.
The TOR switches configuration mustn't define the two peering VLAN on both switches;
otherwise, it would be possible to have the failed neighborship restored over the inter-switch
link, which is not desirable because of the spanning tree dependency and the lack of univocal
mapping between the physical links and the routing paths.
This design does not consider the failure of the edge vNICs because it does not represent a
realistic failure being the vNIC a virtual component. Disabling the edge vNICs will cause a black
hole of the traffic. It is not recommended to enable MAC learning and/or forged transmit on
the upstream VDS dvpg to support this failure scenario.
298
VMware NSX Reference Design Guide
Starting with NSX release 2.5, single N-VDS deployment mode is recommended for both bare
metal and Edge VM. Key benefits of single N-VDS deployment are:
• Consistent deployment model for both Edge VM and bare metal Edge with one N-VDS
carrying both overlay and external traffic.
• Load balancing of overlay traffic with Multi-TEP configuration.
• Ability to distribute external traffic to specific TORs for distinct point to point routing
adjacencies.
• No change in DVPG configuration when new service interfaces (workload VLAN
segments) are added.
• Deterministic North South traffic pattern.
299
VMware NSX Reference Design Guide
communicate with any other VLAN or overlay segments. Tier-0 SR or Tier-1 SR is always hosted
on Edge node (bare metal or Edge VM).
In the recommended single N-VDS design for layer 3 peering, no change in the DVPGs
configuration is required when new service interfaces (workload VLAN segments) are added.
FIGURE 7-51 shows a VLAN segment “VLAN Seg-500” that is defined to provide connectivity to
the VLAN workloads. “VLAN Seg-500” is configured with a VLAN tag of 500. Tier-0 gateway has
a service interface “Service Interface-1” configured leveraging this VLAN segment and acts as a
gateway for VLAN workloads connected to this VLAN segment. In this example, if the workload
VM, VM1 needs to communicate with any other workload VM on overlay or VLAN segment, the
traffic will be sent from the compute hypervisor (ESXi-2) to the Edge node (hosted on ESXi-1).
This traffic is tagged with VLAN 500 and hence the DVPG receiving this traffic (“Trunk-1 PG” or
“Trunk-2 PG”) must be configured in VST (Virtual Switch Tagging) mode. Adding more service
interfaces on Tier-0 or Tier-1 is just a matter of making sure that the specific VLAN is allowed on
DVPG (“Trunk-1 PG” or “Trunk-2 PG”).
Service interface on a Tier-1 Gateway can also be connected to overlay segments for
standalone load balancer use cases. This is explained in Load balancer CHAPTER 6. Connecting a
service interface to an overlay segment to act as the default gateway for the VMs on that
segment is supported but not recommended. The service interface is residing on the edge node
300
VMware NSX Reference Design Guide
SR, meaning that any East-West traffic will traverse the edge node and distributed routing
capabilities are not available. Overlay segments should always be connected to downlink
interfaces (default behavior when creating a segment and connecting it to a Gateway).
In some corner case scenarios, service interfaces belonging to different Gateways (Tier-0 or
Tier-1) must be connected to the same segment. This configuration is supported, with the
caveat that if the segment is a VLAN the edges node hosting the different gateways must
belong to different edge clusters. This limitation does not apply to overlay segments.
301
VMware NSX Reference Design Guide
Figure 7-52: Single N-VDS per Edge VM - Two Edge Node VM on Host
302
VMware NSX Reference Design Guide
303
VMware NSX Reference Design Guide
Figure 7-53: Two Edge VMs per host with 2 pNICs – Routing and bridging
304
VMware NSX Reference Design Guide
From a configuration perspective, the 2 PNIC design is replicated “side-by-side” with dedicated
pNICs and dvpgs. We create a separate management dvpg carrying traffic for the same VLAN
and IP space of the first edge VM, but with a different teaming policy that forces the
management traffic for the second edge VM over P3 and P4. We perform this configuration to
enforce fate sharing across management, overlay, and VLAN peering networks for each edge
VM. Refer to the edge HA section in chapter 4 for the rationale behind this recommendation.
7.5.2.5 Edge node VM connectivity to N-VDS (or VDS prepared for NSX)
In all the previous scenarios, we connected the edge node VMs to a vSphere VDS not prepared
for NSX, which is the most common scenario when a dedicated vSphere edge cluster is
available. When edge node VMs share the ESXi host with regular workload VMs requiring NSX
services (Overlay and/or DFW), it may be desirable to connect the edge VMs and the workload
VMs to the same virtual switch (NVDS or VDS prepared for NSX).
This type of design is the only available option when only two pNICs are available on the host.
When four or more pNICs are available, it is possible to have multiple virtual switches on the
ESXi host and connect edge VMs and workloads VMs to different ones.
Before NSX 3.1, it was required to place the ESXi TEPs and the Edge TEPs in different subnets
and VLANs when leveraging the same two pNICs for both. FIGURE 7-55 below presents such a
design. We can transport the edge overlay traffic over trunk NSX segments or regular vCenter
managed trunk dvpgs. If the ESXi host virtual switch consists of an NVDS, NSX VLAN segments
are the only choice.
305
VMware NSX Reference Design Guide
Figure 7-55: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on different IP/VLAN
Starting with NSX version 3.1, edge and host TEPs can reside on the same VLAN because the
host now can process Geneve traffic internal to the host itself. We must transport edge VM
overlay traffic over an NSX Segment in this case. If the edge TEPs are connected to a vCenter
managed dvpg, tunnels between the host and the edge will not come up. This design is
presented in FIGURE 7-56 below:
306
VMware NSX Reference Design Guide
Figure 7-56: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on the same IP/VLAN - NSX 3.1 or later
When connecting edge VMs to an NSX prepared VDS or NVDS, please keep in mind the
following recommendations:
• Host and Edges should be part of different VLAN Transport Zones. This ensures a clear
boundary between the transport segments on the host and those used for the routing
peering on the edges. The edge segment traffic is transported by the host segments,
configured as a trunk.
• When implementing a single TEP VLAN design like in FIGURE 7-56, the VDS trunk port
groups transporting the edge TEP traffic must be NSX managed segments and cannot be
created in vCenter.
• Follow the canonical recommendations regarding VLAN trunking and teaming policy
configuration for overlay and VLAN peering traffic described in section: 7.5.2.2.
The design with different VLAN/IP subnets per TEP is still valid and can be used with any NSX
version, including 3.1 or later. In most cases, the single TEP design is preferred for its simplicity,
however for most deployments having separate VLANs for Edge and Host TEP is recommended
due to following considerations:
• When Edges and hosts share the same TEP VLAN, they also share the span of that VLAN.
While it is usually desirable to limit the host TEP VLAN to a rack, edge VMs may require
mobility across racks or even sites (e.g., in the VCF stretched cluster design). Separate
VLANs allow to manage the span of host and edge TEP networks individually.
• An edge and the host where the edge is running will never lose TEP connectivity if they
share the same TEP network, regardless of a pNIC failure. This means that the edge node
307
VMware NSX Reference Design Guide
VM will never incur in an all tunnels down HA condition, limiting its ability to react to
specific failures. Please refer to the EDGE HA SECTION IN CHAPTER 4 for more information.
(Note: a design that matches FIGURE 7-56 should not incur any issue as management,
overlay, and VLAN peering networks share the same pNICs).
308
VMware NSX Reference Design Guide
FIGURE 7-58 shows a detailed example of services only edge node VM connectivity. Again, you
can notice the absence of VLAN connectivity requirements for the services only edges deployed
on the ESXi host on the right.
Specific implementation considerations apply to the edge node VM form factor:
• VLAN tagging is not required on edge NVDS as the upstream VDS dvpg only carries
overlay traffic and could be configured for a specific VLAN rather than as a trunk.
• Tagging overlay traffic on the edge node VM NVDS and configuring the dvpgs as trunks is
recommended for consistency with the other use cases and possibly adopting them at a
later stage on the same edge.
Specific design considerations apply to the edge node VM form factor:
• In a layer 2 physical fabric (or layer 3 with overlays), edge node VMs dedicated to T1
Gateway services can have a larger mobility span than edges providing peering
functionalities and can be deployed on clusters striped across multiple racks.
309
VMware NSX Reference Design Guide
• While the ESXi pNICs are usually the limiting performance factor for edge node VMs
running T0 Gateway for layer three peering, the host CPU may represent the bottleneck
for service-intensive edges. In such cases connecting multiple edge VMs to the same
pNIC may be appropriate (see the right ESXi in FIGURE 7-58)
310
VMware NSX Reference Design Guide
311
VMware NSX Reference Design Guide
• General SLA for the service itself. We could have an SLA for all the services, e.g., X for all
services to be available, or we could have a different SLAs for different services (e.g., X
for layer3 peering, Y for VPN, Z TLS Decryption), or even a specific one per tenant and
service (e.g., tenant 1 requires A for TLS decryption, tenant 1 requires B for Gateway IPS
functionality, tenant 2 requires C for TLS decryption). SLA requirements affect the edge
cluster design in term of the number of edge nodes deployed, where they are deployed
(same rack, different rack, different rooms), and the availability requirements of the
underlying infrastructure (e.g., the vSphere cluster or the datastore)
• Max recovery time in case of a network failure. While the general SLA considers the
total downtime over an extending time, usually a year, requirements should outline the
expectation for fast recovery in case of common hardware network failure such as the
failure of a pNIC, a host, or a top of rack switch. The reason for such requirements is that
why multiple short interruption may not violate the SLA for the year, they may represent
a problem for applications with short timeouts or incapable of reconnecting
automatically. This type of requirements affects, among other things, the edge node
form factor, routing protocols and BFD timer’s implementation. These requirements
should always be derived by specific application needs, and not provided as blanket
statements.
Manageability:
• Lifecycle. Different edge services should be able to undergo maintenance at different
times. Requirements in this area may lead to deploy edges in different edge clusters,
different vSphere clusters, and even different NSX Manager domains.
• Scalability. How elastic should the service cluster be? This consideration may not apply
to all the services in the same way. For example, I may need to increase the throughput
for the TLS decryption service based on on-boarding new tenants, but the VPN service
will not be affected by this change. Requirements in this area may lead to dedicate edge
clusters to specific services, and to design the underlying infrastructure with different
characteristics depending on the services (e.g., hosts with different pNIC, CPU resources,
memory)
• RBAC. Who is entitled to configure each service? Is it a self-service offering? This area
has impacts on the integration with directory services and user privileges.
• Object based RBAC. Do I need different users to be able to manage different objects of
the same type? For example, each tenant should be able to manage the Gateway
Firewall rules for their dedicated T0 or T1 Gateway. This area may lead to considerations
around separating the NSX deployment in multiple NSX Managers domains, or the
integration with the most appropriate cloud management platform.
• Monitoring. What are the metrics that I need to collect to ensure I am proactive in
addressing performance and scalability issues? Separating services on different edge
312
VMware NSX Reference Design Guide
VMs may provide granular reporting around the resource consumption of specific
services.
Performance:
• North/South throughput. The amount of traffic that the NSX edges should support
between the physical and the virtual environments is commonly estimated in bits per
second (bps), e.g., 20Gbits. Such requirements tend to be misleading as the throughput
is highly influenced by the packet size. So, the traffic profile should be taken into
consideration when stating the expected North/South throughput for the system. A
more general and accurate characterization of the capability of the provided design
would be stating North/South throughput is packets per second (PPS). Requirements in
this are affect edge node form factor, the number of edges deployed in ECMP, the host
resources in terms of pNIC and CPU, and others.
• Latency and jitter. Some application may have stringent latency and jitter requirements.
Those requirements should be evaluated and may lead to the provisioning of dedicated
resources such as T0 Gateways, pNICs or hosts to avoid the resource contention between
mice and elephant flows.
• New connections per second (CPS). This parameter has impact on the stateful services
design, it may require spreading services across multiple edge nodes.
• Throughput per service. This is different from the north/south throughput which is
generally and aggregate of all the services traffic. Different services may be more or less
resource intensive and may require the allocation of dedicated resources. For example,
NAT and Gateway firewall are in most case light on resources and can coexist on the
same edge dedicated to the layer 3 peering without a noticeable performance impact.
Other services (e.g. VPN and TLS decryption) may have noticeable impact if deployed at
scale. Requirements in this area will impact the segmentation of services in different
edge clusters, and the choice of the edge form factor for some services.
Recoverability:
• Recovery Time Objective (RTO). This is different than the requirements for high
availability in the sense that we want to specify how much time we have to restore the
services after a major outage. The outage in scope depends on the context of the design
and may refer for example to a rack failure, a datacenter room failure, or an entire site
failure. Requirements in this area may affect the striping of an edge cluster across
different availability zones, the protection mechanisms we want to rely on for the
recovery (e.g., edge native HA vs vSphere HA), and the recovery procedure in place (fully
automatic, vs. manual, vs. scripted)
• Recovery Point Objective (RPO). While RPO usually refers to workload data, in the
context of NSX services it may help us formalize how much configuration it is acceptable
to lose because of a failure. While this requirement may not be stringent in a manually
313
VMware NSX Reference Design Guide
314
VMware NSX Reference Design Guide
○ With two Edge nodes hosting a Tier-0 configured in active/active mode, traffic will
be spread evenly. If one Edge node is of lower capacity or performance, half of the
traffic may see reduced performance while the other Edge node has excess
capacity.
○ For two Edge nodes hosting a Tier-0 or Tier-1 configured in active/standby mode,
only one Edge node is processing the entire traffic load. If this Edge node fails, the
second Edge node will become active but may not be able to meet production
requirements, leading to slowness or dropped connections.
• Services available on a T1 Gateway and not on a T0 Gateway. Some services are available
on T1 Gateway only and are not available on T0 Gateway. Designs requiring: L7 Gateway
Firewall, Gateway IDS/IPS/Malware Prevention, Gateway Identity Firewall, URL Filtering,
TLS Inspection and NSX Load Balancer, need to include T1 Gateways.
• Services on T1 implies the hair pinning of the traffic between tenants connected to
different T1 Gateways.
• VPN remote peer redundancy. VPNs can be enabled on T0 or T1 Gateways with exactly
the same functionalities except for the ability to establish BGP peering over an IPSEC
routed VPN. If remote peer redundancy is required VPNs on T0 Gateways are the only
option.
• Because Edge nodes can be part of a single overlay transport zone, if the NSX
deployment has been segmented in multiple overlay transport zone, as a minimum, a
corresponding number of edge clusters should be deployed to provide physical to virtual
connectivity to each overlay transport zone.
315
VMware NSX Reference Design Guide
CPU), this mode is perfectly acceptable. Typically, one can restore the rebalancing of the
services during off-peak or planned maintenance window.
The shared mode provides simplicity of allocating services in automated fashion as NSX tracks
which Edge node is provisioned with service and reduced that Edge node as potential target for
next services deployment. However, each Edge node is sharing CPU and thus bandwidth is
shared among services. In addition, if the Edge node fails, all the services inside Edge nodes fails
together. Shared edge mode if configured with preemption for the services, leads to only
316
VMware NSX Reference Design Guide
FIGURE 7-60: DEDICATED SERVICE EDGE NODE Cluster described dedicated modes per service,
ECMP or stateful services. One can further enhanced configuration by deploying a specific
service per Edge node, in another word each of the services in EN3 and EN4 gets deployed as an
independent Edge node. It’s the most flexible model, however not a cost-effective mode as
each Edge node reserves the CPU. In this mode of deployment one can choose preemptive or
non-preemptive mode for each service individually if deployed as a dedicated Edge VM per
services. In above figure if preemptive mode is configured, all the services in EN3 will
experience secondary convergence. However, if one segregate each service to dedicated Edge
VM, one can control which services can be preemptive or non-preemptive. Thus, it is a design
choice of availability verses load-balancing the edge resources. The dedicated edge node either
per service or grouped for set of services allows deploying a specific form factor Edge VM, thus
one can distinguish ECMP based Edge VM running larger form (8 vCPU) allowing dedicated CPU
for high bandwidth need of the NSX domain. Similar design choices can be adopted by allowing
smaller form factor of Edge VM if the services do not require line rate bandwidth. Thus, if the
multi-tenant services do not require high bandwidth one can construct a very high density per
tenant Edge node services with just 2 vCPU per edge node (e.g. VPN services or a LB deployed
with DevTest/QA). The LB service with container deployment is one clear example where
adequate planning of host CPU and bandwidth is required. A dedicated edge VM or cluster may
be required as each container services can deploy LB, quickly exhausting the underlying
resources.
Another use case to run dedicated services node is multi-tier Tier-0 or having a Tier-0 level
multi-tenancy model, which is only possible with running multiple instances of dedicated Edge
317
VMware NSX Reference Design Guide
node (Tier-0) for each tenant or services and thus Edge VM deployment is the most economical
and flexible option. For the startup design one should adopt Edge VM form factor, then later as
growth in bandwidth or services demands, one can lead to selective upgrade of Edge node VM
to bare metal form. For Edge VM host convertibility to bare metal , it must be compatible with
BARE METAL REQUIREMENT . If the design choice is to immunize from most of the future capacity
and predictive bandwidth consideration, by default going with bare metal is the right choice
(either for ECMP or stateful services). This decision to go with VM versus bare metal also hinges
on operational model of the organization in which if the network team owns the lifecycle and
relatively want to remain agnostic to workload design and adopt a cloud model by providing
generalized capacity then bare metal is also a right choice.
318
VMware NSX Reference Design Guide
For the deployment that requires stateful services the most common mode of deployment is
shared Edge node mode (see FIGURE 7-59: SHARED SERVICE EDGE NODE Cluster) in which both
ECMP Tier-0 services as well stateful services at Tier-1 is enabled inside an Edge node, based on
per workload requirements. The FIGURE 7-62: SHARED SERVICES EDGE NODE CLUSTER GROWTH
Patterns below shows shared edge not for services at Tier-1, red Tier-1 is enabled with load-
balancer, while black Tier-1 with NAT. In addition, one can enable multiple active-standby
services per Edge node, in other word one can optimize services such that two services can run
on separate host complementing each other (e.g. on two host configuration below one can
enable Tier-1 NAT active on host 2 and standby on host 1) while in four hosts configuration
dedicated services are enabled per host. For the workloads which could have dedicated Tier-1
gateways, are not shown in the figure as they are in distributed mode thus, they all get ECMP
service from Tier-0. For the active-standby services consideration, in this case of in-rack
deployment mode one must ensure the active-standby services instances be deployed in two
different host. This is obvious in two edge nodes deployed on two hosts as shown below as NSX
will deploy them in two different host automatically. The growth pattern is just adding two
more hosts so on. Note here there is only one Edge node instances per host with assumption of
two 10 Gbps pNICs. Adding additional Edge node in the same host may oversubscribed
available bandwidth, as one must not forget that Edge node not only runs ECMP Tier-0 but also
serves other Tier-1s that are distributed.
319
VMware NSX Reference Design Guide
The choices in adding additional Edge node per host from above configuration is possible with
higher bandwidth pNIC deployment (25/40 Gbps) or, even better, a higher number of pNICs. In
the case four Edge node deployment on two hosts, it is required to ensure active-standby
instances does not end up on Edge nodes on the same hosts. One can prevent this condition by
building a horizontal Failure Domain as shown in below FIGURE 7-63: TWO EDGE NODES PER HOST
– SHARED SERVICES CLUSTER GROWTH Pattern. Failure domain in below figures make sure any
stateful services.
T0 EN3 T1
T1 Standby NAT
Same Rack
Host 4 T1 Active LB
2 hosts 2 Edge Node per host T0 EN4 T1
Active/Standby Service Availability not
guaranteed without Failure Domain T1 Standby LB
Host 1
FD 1
T0 EN1 T1 T0 EN3 T1
Use Fault T0 EN1 T1 T0 EN3 T1
Host 2 Domains
T0 EN2 T1 T0 EN4 T1 FD 2
T0 EN2 T1 T0 EN4 T1
Edge Cluster
Same Rack
320
VMware NSX Reference Design Guide
Figure 7-63: Two Edge Nodes per Host – Shared Services Cluster Growth Pattern
An edge cluster design with dedicated Edge node per services is shown in below FIGURE 7-64:
DEDICATED SERVICES PER EDGE NODES GROWTH Pattern. In a dedicated mode, Tier-0 is only running
ECMP services belongs to first edge cluster while Tier-1 running active-standby services on
second edge cluster. Both of this configuration are shown below.
Notice that each cluster is striped vertically to make sure each service gets deployed in separate
host. This is especially needed for active/standby services. For the ECMP services the vertical
striping is needed when the same host is used for deploying stateful services. This is to avoid
over deployment of Edge nods on the same host otherwise the arrangement shown in FIGURE
7-61: ECMP BASE EDGE NODE CLUSTER GROWTH Pattern is a sufficient configuration.
The multi-rack Edge node deployment is the best illustration of Failure Domain capability. It is
obvious each Edge node must be on separate hypervisor in a separate rack with the
deployment with two Edge nodes.
The case described below is the dedicated Edge node per service. The figure below shows the
growth pattern evolving from two to four in tandem to each rack. In the case of four hosts,
assuming two Edge VMs (one for ECMP and other for services) per host with two hosts in two
different rack. In that configuration, the ECMP Edge node is stripped across two racks with its
own Edge cluster, the placement and availability are not an issue since each node is capable of
servicing equally. The Edge node where services is enabled must use failure domain vertically
striped as shown in below figure. If the failure domains are not used, the cluster configuration
will mandate dedicated Edge cluster for each service as there is no guarantee that active-
standby services will be instantiated in Edge node residing in two different rack. This mandate
minimum two edge clusters where each cluster consist of Edge node VM from two racks
providing rack availability.
321
VMware NSX Reference Design Guide
Finally, the standby edge reallocation capability (only available to Tier-1 gateways) allows a
possibility of building a multiple availability zones such that a standby edge VM can be
instantiated automatically after minimum of 10 minutes of failure detection. If the Edge node
that fails is running the active logical router, the original standby logical router becomes the
active logical router and a new standby logical router is created. If the Edge node that fails is
running the standby logical router, the new standby logical router replaces it.
There are several other combinations of topologies are possible based on the requirements of
the SLA as described in the beginning of the section. Reader can build necessary models to
meet the business requirements from above choices.
322
VMware NSX Reference Design Guide
323
VMware NSX Reference Design Guide
324
VMware NSX Reference Design Guide
If deployment of the NSX Edge Cluster across multiple VSAN datastores is not feasible then the
following configurations can be made to minimize the potential of a cascading failure due to the
VSAN datastore:
1. At a minimum, the vSphere Cluster should be configured with an FTT setting of at least 2
and an FTM of Raid1
This will dictate that for each object associated with the NSX Edge Node VMs, a witness,
two copies of the data (or component) are available even if two failures occur, allowing
the objects to remain in a healthy state. This will accommodate both a maintenance
event and an outage occurring without impacting the integrity of the data on the
datastore. This configuration requires at least five (5) hosts in the vSphere Cluster.
2. ToR and Cabinet Level Failures can be accommodated as well, this can be accomplished
in multiple ways; either through leveraging VSAN’s PFTT capability commonly referred
to as Failure Domains or VSAN Stretched Clusters and leveraging a Witness VM running
in a third failure domain or by distributing the cluster horizontally across multiple
cabinets where no more hosts exist in each Rack or Cabinet than the number of failures
that can be tolerated (a maximum of FTT=3 is supported by VSAN).
325
VMware NSX Reference Design Guide
Figure 7-68: Single NSX Edge Cluster, Each Rack NSX Failure Domain
For a realistic example of edge nodes deployed on a single vSphere cluster striped across
multiple racks refer to section USE CASE : IMPLEMENTING A REPEATABLE RACK DESIGN FOR A SCALABLE
PRIVATE CLOUD PLATFORM IN A LAYER 3 FABRIC
While VSAN Stretched Cluster and other Metro-Storage Cluster technologies provide a very
high level of storage availability, NSX Edge Nodes provide an “Application Level” availability
through horizontal scaling and various networking technologies. If a dedicated vSphere Cluster
is planned to host the Edge Node VMs, using two independent clusters that are in diverse
locations as opposed to a single vSphere Cluster stretched across those locations should
seriously be considered to simplify the design.
326
VMware NSX Reference Design Guide
Figure 7-69: Single Architecture for Heterogeneous Compute and Cloud Native Application Framework
There are essential factors to consider when evaluating how to best design these workload
domains, as well as how the capabilities and limitations of each component influence the
arrangement of NSX resources. Designing multi-domain compute requires considering the
following key factors:
● Use Cases
○ IaaS
○ CaaS
○ VDI
○ PaaS
○ Other
● Type of Workloads
○ Enterprise applications, QA, DevOps
○ Regulation and compliance
○ Performance requirements
○ Security
327
VMware NSX Reference Design Guide
● Management
○ RBAC
○ Inventory of objects and attributes controls
○ Lifecycle management
○ Ecosystem support – applications, storage, and staff knowledge
● Scale and Capacity
○ Compute hypervisor scale
○ NSX Platform Scale limits
○ Network services design
○ Bandwidth requirements, either as a whole compute or per compute domains
● Availability and Agility
○ Cross-domain mobility
○ Cross-domain connectivity
328
VMware NSX Reference Design Guide
Upgrades motivated by features or bug fixes required by a specific solution (e.g., TKG-S)
now affect the whole infrastructure (e.g., VDI), requiring more planning, testing, and an
overall higher burden on the operations team.
Separation of management responsibility. Different teams may be responsible for
different solutions. The organization may be segmented per solution rather than for
technology (e.g., the VDI team has NSX expertise, same as the team managing the Tanzu
platform. No single NSX team manages NSX across the different internal IT offerings)
Scalability. The NSX platform scalability limits affect all the solutions. Capacity planning
becomes more complex as the growth of all the different environments must be taken
into consideration.
The mixing of manual and programmatic consumption (potentially by multiple plug-ins,
integrations, or tools) on the same NSX manager creates the risk of configuration conflicts
and errors.
329
VMware NSX Reference Design Guide
Tanzu environment requires the latest release, and frequent and timely software upgrades
are expected
VDI environment runs an internally certified version of NSX and does not require frequent
updates. Each new NSX version needs to go through an internal certification cycle.
The IaaS environment includes dynamically created configuration objects (Gateways,
segments, DFW rules, Groups) by vRealize Automation. The frequent manual configuration
changes required by the security policies in the VDI environment should be carefully
planned to avoid conflicts, and manual configuration errors may lead to applications
downtime.
Examples of design implications for the NSX domain separation are:
When the administrator creates security policies in the VDI environment to provide access
to applications in the IaaS environments, groups based on IP sets must be used. Intelligent
grouping is not available for resources outside of the NSX domain. Policies may have to be
duplicated in the destination NSX Domain (The duplication may be avoided by creating a
wider allow rule on the destination environment permitting the entire VDI pool range.
Granular policies can be enforced at the source only).
Troubleshooting may be more complex as native tools such as trace flow or live traffic
analysis have a single NSX domain scope.
Micro-segmentation planning may be more complex as NSX Intelligence have a single NSX
domain scope.
330
VMware NSX Reference Design Guide
331
VMware NSX Reference Design Guide
What we cannot provide within a single NSX installation is an effective separation of the
management plane in a proper multi-tenant solution. Relying on the native multi-tenancy
capability of an interoperable cloud management solution such as VMware vRealize
Automation or VMware vCloud director is an option, but such solutions may not expose all the
required NSX functionalities or direct access to the NSX GUI/API may be preferred. In such
situations dedicating an NSX manager per tenant may be an appropriate solution. Typical
examples are VMware Cloud designs such as VMConAWS.
Examples of drivers for the separation decision are:
Management plane multi-tenancy
Independent lifecycle management per tenant
Self-service IaaS solution
Examples of design implications for the NSX domain separation in this scenario are:
NSX Manager sprawl. Hard to manage at scale without custom automation.
Infrastructure resources overhead in small environments. NSX manager VMs must be
deployed for each tenant, even if the environment includes few hosts, at times one in DRaaS
use cases. The singleton NSX Manager deployment may alleviate the issue.
332
VMware NSX Reference Design Guide
333
VMware NSX Reference Design Guide
Figure 7-74: NSX Design with Single Overlay Transport Zone Design
334
VMware NSX Reference Design Guide
Each use case requires a dedicated NSX edge cluster because edge nodes can be part of a single
overlay transport zone only. The edge nodes can be deployed on the vSphere clusters dedicated
to the use case as depicted in FIGURE 7-75. Or they can be deployed on a shared vSphere cluster
as depicted in FIGURE 7-76.
Figure 7-75: NSX Design with Overlay Transport Zone per use case - Edges on Compute Clusters
335
VMware NSX Reference Design Guide
Figure 7-76: NSX Design with Overlay Transport Zone per use case - Edges on Shared Cluster
336
VMware NSX Reference Design Guide
Figure 7-77: Multiple Overlay Transport Zones on the same compute cluster
337
VMware NSX Reference Design Guide
338
VMware NSX Reference Design Guide
this stage. We are still planning our design at a more abstract level. The decision to include a
disaster recovery site or multiple active data centers will depend on non-functional
requirements such as the high availability of SLA and RPO/RTO. The design may cater to a single
application with very stringent non-functional requirements. In such a case, the
conceptual/logical design can be very simple, while the complexity resides in implementing
resilient physical components across multiple sites.
Once we have our conceptual model in place, we can now map the abstract elements of the
conceptual design to actual NSX components. FIGURE 7-80 presets an example of such mapping,
where each tenant is mapped to an independent NSX Domain, projects to dedicated compute
managers (vCenter servers) and Tier-0 Gateway, and individual applications, VDI pools, and
Tanzu Namespaces to segments and/or Tier-1 Gateways.
339
VMware NSX Reference Design Guide
Let’s consider the implications of such mapping in a multi-environment multi-use case design
where the first level of tenancy (tenant/org) is mapped to the environment while the second
(project) is to the use case. An example of such mapping is presented in FIGURE 7-81.
Figure 7-81: Two Tier Multi tenancy design with mapping to NSX constructs
340
VMware NSX Reference Design Guide
341
VMware NSX Reference Design Guide
• A dedicated Tier-0 Gateway for each use case: IaaS and VDI. Performance, scalability, and
high availability requirements can be met independently for the two use cases.
• The IaaS Tier-0 gateway is referenced in the cloud management platform (e.g., vRealize
Automation), which dynamically creates Tier-1 Gateways and segments for each
application deployment. The dynamically created Tier-1 gateways are connected to the
IaaS Tier-0 gateway.
• VDI pools are deployed on dedicated segments rather than on a shared one. Distributed
firewall security policies can be applied to the VDIs at the deployment time based on the
network where they reside without human intervention or custom automation
(Segments can be part of groups, both via direct membership and based on tags)
• VDI does not require edge network services. VDI segments are directly connected to an
Active/Active Tier-0 gateway.
FIGURE 7-83 presents an example where the same use cases are implemented, but the chosen
topology is different. The design has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.
342
VMware NSX Reference Design Guide
• The shared Tier-0 should be deployed in Active/Active (no stateful services) to allow for
scaling out the North/South bandwidth.
• A Tier-1 Gateway is deployed for each use case, but no stateful services are required on
them. The separation is mostly logical. It has no impact on the data plane as the routing
is fully distributed across the entire deployment.
• Each application or VDI pool is deployed on a dedicated segment
• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.
FIGURE 7-84 shows an example where the entire NSX Domain is dedicated to a single use case
(IaaS). Within the same NSX domain, two environments are present: Prod and Test. The design
has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.
• The shared Tier-0 are deployed in Active/Active (no stateful services) to allow for scaling
out the North/South bandwidth.
343
VMware NSX Reference Design Guide
• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.
• Each environment has a dedicated segment and corresponding IP range.
• When an environment grows beyond a single segment limit, segments can be added and
connected to the same Tier-0 gateway.
• Segmentation between environments is provided via DFW based on the IP ranges
associated with each environment. Such policies are manually created by the
administrator.
• The CMP (e.g., vRA) sees the segments as external networks. It means that those
segments must be pre-create by the NSX admin, referenced in the vRA configuration,
and mapped to the two different projects. vRA will place the dynamically created VMs on
those segments based on the environment. Because of the segment placement, the VMs
will inherit the environment segmentation based on IP ranges implemented via DFW.
• The separation between applications within an environment can be achieved via
dynamically created groups and security policies managed by the CMP.
• Larger, shared, and manually configured segments facilitate the integration with disaster
recovery orchestration tools (e.g., VMware Sire Recovery Manager) compared to
segments dynamically created by the CMP. Network mappings between sites can be
configured in advance and do not necessitate frequent updates.
344
VMware NSX Reference Design Guide
345
VMware NSX Reference Design Guide
This model is the most scalable and provides less resource contention for the NSX components.
346
VMware NSX Reference Design Guide
the management cluster is prepared for NSX, overlay networks can be consumed by the
management appliances part of the vRealize suite, but not by vCenter and NSX Manager.
347
VMware NSX Reference Design Guide
Figure 7-86: vSphere Edge Cluster - Dedicated Hosts to P2V and Services edges
Figure 7-87: vSphere Edge Cluster - Shared Hosts to P2V and Services Edges
vSphere edge clusters demand specific configurations on the leaf switches to where they
connect. Specifically, they require VLANs for edge management, edge overlay, and edge routing
peering. If we stripe the vSphere edge cluster across different racks, we must extend those
VLANs as well. NSX edge clusters can be deployed across multiple vSphere clusters, and they do
not require layer two connectivity between them. In a layer 3 fabric, it is recommended to keep
348
VMware NSX Reference Design Guide
the vSphere edge cluster confined to a single rack and deploy NSX Edges in different vSphere
clusters if rack availability is required. See FIGURE 7-88.
Figure 7-88: NSX Edge Cluster deployed across 2 different vSphere clusters in different racks
349
VMware NSX Reference Design Guide
Figure 7-89: vSphere Edge cluster and vSphere HA for P2V Edges
350
VMware NSX Reference Design Guide
recover the lost capacity and redeploy the standby SRs before the standby relocation timeout
kicks in.
DRS in fully automated mode provides efficient utilization of the edge cluster resources. NSX
places the T1 Gateways with services across different edge node VMs, but it does not consider
the traffic or the services they provide. As a consequence, some edge nodes can run hotter
than others. DRS can help balance the utilization across the cluster. As mentioned, Edge node
VMs support vMotion, but it is recommended to set the DRS threshold to a conservative value
to avoid too frequent vMotion operations. The network interruption caused by vMotion, even if
it is extremely brief and hardly disruptive, will impact many workloads when affecting an edge
VM. Some applications may not be affected at all, and others may be. DRS rules, in conjunction
with NSX failure domains, should be in place so that Active/Standby Gateways do not reside on
the same ESXi host during normal operations and maintenance windows (Anti-affinity rules).
Figure 7-90: vSphere edge cluster - vSphere HA and DRS for Service Edge VMs
With a single vSphere edge cluster hosting independent NSX Edge cluster dedicated to different
use cases (e.g., p2v and services), it’s possible to customize the vSphere HA and DRS setting per
edge VM to achieve different behavior per use case. When the NSX edge cluster is shared (e.g.,
the same edge node VMs provide p2v and services), the design decisions around vSphere HA
and DRS setting are more complex and should be based on what design properties are the most
important.
351
VMware NSX Reference Design Guide
When limited compute resources are available, or the footprint of the management and edge
components is limited, collapsing those components in the same vSphere cluster is an option.
When dedicated clusters are not an option, collapsing edge and management functions is
usually the preferred option because of the predictable footprint and growth of the
management components compared to those of the application workloads.
Mixing management and edge components in smaller or simpler deployments is not an issue as
long as the requirements for both workload types are taken into consideration when designing
the hardware resources in the cluster. For example, the four nodes management/edge cluster
depicted in FIGURE 7-92 can be adequate to host essential infrastructure management
components and edge nodes dedicated to North/South traffic and minimal services. Having
dedicated pNICs for the edge node VMs hosting the tier-0 Gateway is always a good idea. Host
traffic (vMotion, storage) should share the pNICs used by the management appliances rather
than those used by the edge VMs. Deploying a management/edge cluster with two pNIC hosts
is supported, but it will most likely represent a compromise in terms of north/south
performances.
352
VMware NSX Reference Design Guide
353
VMware NSX Reference Design Guide
Refer to the dedicated sections for management components and edge node connectivity for
detailed guidelines in term of virtual switch design and teaming policies. The diagrams below
present a detailed view of such configuration settings in a two and four pNIC designs.
P1 P2
VDS
VMK Storage PG VMK vMotion PG VMK Mgmt PG Trunk DVPG-1 A/S Trunk DVPG-2 A/S
ESXi-with-Mgmt_Edge_Node_VM-VDS
354
VMware NSX Reference Design Guide
P1 P2 P3 P4
VDS-1 VDS-2
vDS vDS
Storage PG vMotion PG Mgmt PG Mgt PG Trunk DVPG 1 Trunk DVPG 2
Figure 7-94: Collapsed Management and Edge VM on Separate VDSs with 4 pNICs
355
VMware NSX Reference Design Guide
Some designs incorporate a collapsed edge/compute cluster design. This model may be
appealing from a resource planning perspective and allows for a simple chargeback model. It
allows the operator to dedicate ESXi resources to a use case or a tenant and use those
resources for both workloads and network services.
A dedicated vSphere edge cluster may mean that multiple use cases or tenants will share the
same compute resources. Sharing the compute hosts dedicated to edge services across multiple
use cases or tenants makes resource planning and measuring consumption more complex.
Splitting the edge services in dedicated vSphere clusters may be unfeasible from a cost
perspective. The co-placement of the edge nodes with the associated workloads may become
appealing to preserve the resource segmentation across the use cases. A collapsed
compute/edge deployment model presents the following challenges:
• Resource contention and planning. Application workloads demands may vary over time
and may affect edge services performance even when the appropriate capacity planning
was performed as part of the initial implementation.
• North/South edge services benefit from dedicated pNICs. Compute clusters are usually
deployed with a two pNIC model. We should not deploy more than one edge per host to
avoid excessive oversubscription.
• Compute hosts require minimal configuration on the physical fabric: four VLANs local to
each rack, management, vMotion, Overlay, and storage, even when the cluster is striped
across racks. Edge nodes require additional VLANs that must be available on each host in
the cluster. A layer 2 fabric is now required to stripe the compute cluster across racks if
the edge node VMs are co-located with compute workloads. Edge VMs running T1
356
VMware NSX Reference Design Guide
Services and no T0 can run without problems on a cluster striped across racks in a layer 2
fabric.
• Edge nodes with T0s should always create routing adjacencies with switches directly
connected to the host where the edge node VMs reside. This may be unfeasible in an
edge/compute cluster spanning multiple racks. An edge VM moving between racks may
lose the BGP peering adjacencies unless the network fabric extends the peering VLANs
between racks (not recommended). A vSphere cluster hosting T0 Edge VMs should be
confined to a single rack. Rack availability can be provided via another vSphere cluster
hosting edge node VMs part of the same NSX edge cluster and T0 Gateway. In case the
cluster must be striped across racks, mobility of the T0 Gateway should be limited to a
single rack even during maintenance operations (Placing the edge node in maintenance
mode and powering it off is an option if adequate compute resources are not available in
the same rack).
• vSphere HA is most likely enabled globally for the cluster to support compute workloads.
If vSphere HA for T0 edge VM is not desired, it must be disabled manually (In case the
recovery of the edge VM on a different host will not lead to gains in high availability or
performances). vSphere HA for T1 only service or T0/T1 mixed edges is a good practice in
most situations.
• vSphere HA does not honor preferential rules (should) but respects mandatory rules. This
behavior should be considered when the placement of edge node VMs is critical (T0 Edge
VM with vSphere cluster striped across racks)
• DRS is most likely enabled for the cluster to support compute workloads. VM to Host
Should rules should be implemented to enforce the desired host to T0 edge VM
placement and limit the T0 edge VMs mobility to host maintenance events.
• Limit over reservation of CPU/memory resources for the compute workloads as it may
lead to DRS moving the T0 edge VMs regardless of the preferential rules.
• DRS rules, in conjunction with NSX edge failure domains, should be in place so that
Active/Standby Gateways do not reside on the same ESXi host during normal operations
and maintenance windows (Anti-affinity rules).
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-55: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-56: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.
357
VMware NSX Reference Design Guide
In a fully collapsed cluster design, compute workloads, edge node VMs, and management
components all reside in the same vSphere cluster. The NSX EASY ADOPTION DESIGN
GUIDE presents an example of this deployment model in great detail. Please refer to the DC in A
Box section. A collapsed cluster design presents the challenges of both the compute/edge, and
the management/edge collapsed models, but those concerns are usually mitigated by the
smaller size of the deployment, usually confined to a single rack. Key considerations for a fully
collapsed cluster are:
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-55: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-56: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.
• NSX Manager should not be placed on a VLAN segment but on a vCenter managed dvpg.
The diagrams below outline virtual switch and teaming policies design guidelines for four pNICs
and two pNICs hosts.
358
VMware NSX Reference Design Guide
Figure 7-98: Collapsed Edge/Management/Compute cluster - 2 pNICs - separate VLANs for Host and Edge TEPs
359
VMware NSX Reference Design Guide
Figure 7-99 Collapsed Edge/Management/Compute cluster - 2 pNICs - single VLAN for Host and Edge TEPs
7.6.3.5 Use case: implementing a repeatable rack design for a scalable private cloud platform
in a layer 3 fabric
We have seen that vSphere clusters dedicated to management, edge, and compute workloads
have different connectivity and resource requirements. It is possible to cater to such specific
requirements by customizing the hardware design for each class of workloads (e.g., CPU core
count and pNIC design specific to edges hosts) and dedicating racks to a specific cluster function
(e.g., racks dedicated to ESXi hosts running edges with ToR switches configured with the
appropriate VLANs). While such a design may be the most effective at leveraging the available
resources, it may lack the repeatability and consistency required in large-scale deployments.
Large-scale private clouds require a high degree of automation, from the automatic
deployment and configuration of the ToR switches and the provisioning of the ESXi hosts to the
dynamic allocation of resources to different compute pools (e.g., a vSphere cluster or a VCF
workload domain).
This section presents different options to create a repeatable rack layout for computing and
edge workloads. We assume that management workloads can be centralized and do not
require to follow the same repeatable pattern because of their predictable resource
requirements.
360
VMware NSX Reference Design Guide
361
VMware NSX Reference Design Guide
Figure 7-101: Resource Block - Dedicated vSphere edge cluster with one host per rack
In FIGURE 7-101, all hosts have the same hardware configuration (e.g., 2x25G pNICs, 32 cores,
512G Memory). One host in each rack is dedicated to NSX edge services. The host runs two or
more NSX edge node VMs where both a T0 Gateway in ECMP and T1 Gateways with services
are hosted. While this configuration may not be optimal, it allows for a repeatable deployment
model that enhances the scalability and the manageability of the overall platform. This model
can be extended to a design where T0 and T1 gateways run on different edge nodes part of
different edge clusters. Such model is not discussed in this section.
In FIGURE 7-101, the edge hosts are grouped in a dedicated vSphere edge cluster. The cluster is
striped across racks. We mentioned already that, in general, it’s not a good practice to stripe
vSphere edge clusters across racks, but in this case, the repeatability of the deployment model,
which dictates one edge host per rack, takes precedence. The three hosts part of the edge
cluster have access to the same VSAN Datastore, but they cannot provide VM mobility across
racks for the edge VMs because each rack has specific management, overlay, and layer 3
peering VLANs. If an edge VMs migrates to a different rack, it will lose connectivity
immediately. Design guidelines and implications for the model are:
• We should map edge node VMs to NSX failure domains based on the rack where they are
deployed.
• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.
362
VMware NSX Reference Design Guide
• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration.
• If a single edge ESXi host is present in each rack, as depicted in FIGURE 7-101, the host
cannot be evacuated to be placed in maintenance mode. Edge node VMs must be placed
in maintenance mode and then shut down before the host can be placed in maintenance
mode. When the host exits maintenance mode and the edge VM is powered and brought
back online, it is required to manually check the status of its services (e.g., BGP neighbors
are up) before proceeding with the maintenance of another ESXi host and the
corresponding edge VMs.
Figure 7-102: Resource Block - Dedicated vSphere edge cluster with two hosts per rack
The design in FIGURE 7-102 has the following guidelines and implications:
• Edge hosts can be evacuated by vMotion to a different host in the same rack during a
maintenance. Edge nodes are not shut down.
363
VMware NSX Reference Design Guide
• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.
• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration across racks.
• DRS SHOULD rules will dictate the preferred host placement for each edge VMs. Because
now multiple hosts are present in the same rack, DRS in fully automated mode could
rebalance the edge VMs placement during normal operations. The SHOULD rule will
prevent undesired migrations. A SHOULD rule will be violated when placing the ESXi host
in maintenance mode. The edge VMs will be moved to another host in the same rack. It
won’t be moved to a different rack because MUST rules cannot be violated.
• If DRS rules management is considered too much of an overhead, DRS can be disabled or
set to partially automated. In this case, edge VMs must be migrated manually during
maintenances. This choice carries the risk of human error (e.g., edge node VM migrated
to a different rack by mistake)
• Even with six hosts in the edge cluster spread across three racks, the VSAN storage policy
has an PFTT=1 when each rack is mapped to a VSAN failure domain.
• Two edge hosts per rack may represent a waste of computing resources, especially in
deployments with five racks or more.
364
VMware NSX Reference Design Guide
Design guidelines and implications for the design in FIGURE 7-103 are:
• A complex set of DRS rules is required.
• One VM to host MUST rule per rack is required to avoid edge VM mobility across racks.
• VM to Host SHOULD rules must be in place to avoid edge VM movements across hosts in
the same rack during normal operations.
• DRS rule (SHOULD or MUST depending on the design requirements) to avoid/discourage
the placement of computing workloads on ESXi hosts dedicated to edge node VMs. This
DRS rule has the most impactful implications in terms of day-2 operations because any
new workload deployed on the cluster needs to be added to the corresponding VM
group to avoid resource contention with the edge node VMs.
365
VMware NSX Reference Design Guide
Because of this underlying data center framework, this chapter primarily focuses on
performance in terms of throughput for TCP-based workloads. There are some niche
workloads such as NFV, where raw packet processing may be ideal, and the enhanced version
of N-VDS called N-VDS (E) was designed to address these requirements. Check out the last part
of this section for more details on N-VDS (E).
366
VMware NSX Reference Design Guide
With its options field length specified for each packet within the Geneve header, Geneve allows
packing the header with arbitrary information into each packet. This flexibility offered by
Geneve opens up doors for new use cases, as additional information may be embedded into
the packet, to help track the packets path or for in depth packet flow analysis.
The above FIGURE 8-2: GENEVE Header shows the location of the Length and Options fields within
the Geneve header and also shows the location of TEP source and destination IP’s.
For further insight into this topic, please check out the following blog post:
HTTPS :// OCTO. VMWARE . COM/GENEVE- VXLAN - NETWORK -VIRTUALIZATION - ENCAPSULATIONS/
Performance Considerations
In this section we will take a look at the factors that influence performance. Performance of a
workload in a virtualized environment, depends on many factors within that environment;
hardware used, drivers and features etc. FIGURE 8-3 shows the different areas that may have an
impact on performance.
367
VMware NSX Reference Design Guide
368
VMware NSX Reference Design Guide
369
VMware NSX Reference Design Guide
The following FIGURE 8-5: SOFTWARE /CPU BASED GENEVE O FFLOAD - TSO shows the process where
the hypervisor divides the larger TSO segments to MTU-sized packets.
370
VMware NSX Reference Design Guide
Look for the tag “VMNET_CAP_Geneve_OFFLOAD”, highlighted in red above. This verbiage
indicates the Geneve Offload is activated on NIC card vmnic3. If the tag is missing, then it
means Geneve Offload is not enabled because either the NIC or its driver does not support it.
371
VMware NSX Reference Design Guide
[Host-1] # vsish
/> get /net/pNics/vmnic0/rxqueues/info
rx queues info {
# queues supported:5
# filters supported:126
372
VMware NSX Reference Design Guide
# active filters:0
Rx Queue features:features: 0x1a0 -> Dynamic RSS Dynamic
Preemptible
}
/>
CLI 2 Check RSS
To overcome this limitation, the latest NICs ( SEE COMPATIBILITY GUIDE ) support an advanced
feature, known as Rx Filters, which looks at the inner packet headers for hashing flows to
different queues on the receive side. In the following FIGURE 8-8: RX FILTERS: FIELDS USED FOR
Hashing fields used by Rx Filter are circled in red.
373
VMware NSX Reference Design Guide
Simply put, Rx Filters look at the inner packet headers for queuing decisions. As driven by NSX,
the queuing decision itself is based on flows and bandwidth utilization. Hence, Rx Filters
provide optimal queuing compared to RSS, which is akin to a hardware-based brute force
method.
374
VMware NSX Reference Design Guide
HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY =IO
The following section show the steps to check whether a NIC, Intel 810s in this case, supports
one of the key features, Geneve offload.
1. Access the online tool:
HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY = IO
375
VMware NSX Reference Design Guide
2. Specify the
1. Version of ESX
2. Vendor of the NIC card
3. Model if available
4. Select Network as the IO Device Type
5. Select Geneve Offload and Geneve Rx Filters (more on that in the upcoming
section) in the Features box
6. Select Native
7. Click “Update and View Results”
376
VMware NSX Reference Design Guide
3. From the results, click on the ESX version for the concerned card. In this example, ESXi
version 7.0 for Intel E810-C with QSFP ports:
377
VMware NSX Reference Design Guide
4. Click on the [+] symbol to expand and check the features supported.
5. Make sure the concerned driver actually has the “Geneve-Offload and Geneve-Rx
Filters” as listed features.
Follow the above procedure to ensure Geneve offload and Geneve Rx Filters are available on
any NIC card you are planning to deploy for use with NSX. As mentioned earlier, not having
Geneve offload will impact performance with higher CPU cycles spent to make up for the lack of
software based Geneve offload capabilities.
378
VMware NSX Reference Design Guide
379
VMware NSX Reference Design Guide
Figure 8-15: Throughput with AMD® EPYCTM 7F72 – MTU 1500 Bytes
8.3.1.7 Oversubscription
Oversubscription is a key consideration from a performance perspective. This should be taken
into account from both host hardware design and also from the underlying physical.
380
VMware NSX Reference Design Guide
This is especially true in cases where the traffic flows go beyond the top-of-rack layer to the
next rack or going north out of the NSX domain through an NSX Edge. In these cases, over
subscription ratio should be considered to ensure throughput is not throttled.
Number of pNICs
Adding multiple NICs in a ESXi host will help scale up the network performance of that host.
Multiple NICs help by overcoming the limitations of a single NIC both from the port speed
perspective and also from the number of queues available across all the pNICs. Considering
most NICs have roughly 4 queues available, a 100G pNIC may have only 4 available queues vs a
4 x 25G pNICs could have 16 queues, 4 per pNIC. Thus, multiple pNICs help overcome any NIC
hardware queue related limitations. Total number of available queues is an important factor to
consider when deploying workloads that are primarily focused on packet processing, such as
NSX VM Edge. In the case of normal data center workloads that can leverage offloads, multiple
pNICs help overcome pNIC hardware level speed limitation. As noted in the RSS and section,
each queue is serviced by a core or a thread. While these cores are not shared, meaning they
are available for other activities based on need, at peak network utilization consider having
enough cores to service all the used queues.
The following example shows the benefit of having two pNICs compared to a single pNIC using a
single TEP vs Dual TEP. Dual TEP will help achieve twice the throughput of single TEP. The
following image compares throughput with Single TEP vs throughput with Dual TEP using Intel®
XL710 NICs on servers running on Intel® Xeon® Gold 6252 CPU @ 2.10GHz
381
VMware NSX Reference Design Guide
Figure 8-16: Throughput with Single TEP vs Dual TEP with Different MTUs
Note: Intel® XL710s are PCIe Gen 3 x8 lane NICs. On x8 lane NICs, the max throughput is limited
to 64Gbps. To achieve the above near-line rate of 80Gbps, 2 x Intel® XL710s must be used.
vNIC Queues
Once the lower level details are tuned for optimal performance, the next consideration moves
closer to the workload. At this point the question is, whether the workload is able to consume
packets at the rate which ESXi is tuned to send? Note that this is generally not a concern when
a single ESXi host is hosting many VMs, optimally, this should be twice the number of total
queues provided by the pNIC. That is, if the total queues across all pNICs is 4, that ESXi host
should have at least have 8 VMs. In this typical data center use case, while each VM may not be
able to consume at the ESXi’s packet processing rate, many VMs together may be able to reach
that balance.
Queuing at the vNIC only becomes relevant for instances where the ESXi may be running 1 or 2
VMs and the expectation for the workload is pure packet processing, as in the case of Edge VM.
In fact, enabling multiple queues is the recommended and default option for the Edge VMs.
Similar to the RSS and queuing configuration at the pNIC level, vNIC also allows configuration
for multiple queues instead of the default single queue. Edge VM section has more details on
how to set this up.
382
VMware NSX Reference Design Guide
While vNIC queues can be set for any VM. However, the workload running on the VM, should
be able to leverage the multiple queues, in order to see a benefit. NSX VM Edge is designed to
take advantage of multiple queues and should be leveraged for optimal performance.
vCPU Count
As discussed in the pNIC queues and vNIC queues topics, each queue needs a core/thread to
service it. However, these cores are not dedicated to just servicing the queues but are engaged
on demand when needed. That is, they may be used for workloads when there is no network
traffic. Hence, a system should not only be tuned for max performance via multiple queues,
but care should be taken that the system actually has access to the required number of cores to
service those queues.
While the pNIC queues depend on the system cores, that is physical cores for packet
processing, vNIC queues depend on the vCPU cores. This vCPU count should be considered
when allocating vCPUs for the workload. However, as called out in the vNIC queuing section,
this is only true for pure packet processing focused workloads such as NSX Edge VM. For most
common data center TCP workloads, leveraging offloads such as Geneve offload, this is not a
concern.
383
VMware NSX Reference Design Guide
Why should you care about MTU values? MTU is a key factor for driving high throughput, and
this is true for any NIC which supports 9K MTU also known as jumbo frame. The following
graph FIGURE 8-18: MTU AND Throughput shows throughput achieved with MTU set to 1500 and
8800:
384
VMware NSX Reference Design Guide
Our recommendation for optimal throughput is to set the underlying fabric and ESX host’s
pNICs to 9000 and the VM vNIC MTU to 8800.
Notes for FIGURE 8-18: MTU AND Throughput:
• The above graph represents a single pair of VMs running iPerf with 4 sessions.
• For both VM MTU cases of 1500 and 8800, the MTU on the host was 9000 with
demonstrated performance improvements.
385
VMware NSX Reference Design Guide
For the VM, use the commands specific to the operating system for checking MTU. For
example, “ifconfig” is one of the commands in Linux to check MTU.
40
30
30
25 20
20 15
15
10
10
5
5
0 0
78 128 256 512 1024 1280 1518
Packet Size
Figure 8-19: Bare Metal Edge Performance Test (RFC2544) with IXIA
386
VMware NSX Reference Design Guide
Note: For the above test, the overlay lines in blue are calculated by adding throughput
reported by IXIA and the Geneve Overlay header size.
SSL Offload
VMware NSX bare metal Edges also support SSL Offload. This configuration helps in reducing
the CPU cycles spent on SSL Offload and also has a direct impact on the throughput achieved.
In the following image, Intel® QAT 8960s are used to show case the throughput achieved with
SSL Offload.
387
VMware NSX Reference Design Guide
=====================================================================
388
VMware NSX Reference Design Guide
FIGURE 8-21 Note: This test was run with LRO enabled, which is software-supported starting with
ESX version 6.5 and higher on the latest NICs which support the Geneve Rx Filter. Thus, along
with Rx Filters, LRO contributes to the higher throughput depicted here.
=====================================================================
RSS at pNIC
To achieve the best throughput performance, use an RSS-enabled NIC on the host running Edge
VM, and ensure an appropriate driver which supports RSS is also being used. Use the VMWARE
COMPATIBILITY G UIDE FOR I/O to confirm driver support.
RSS on VM Edge
For best results, enable RSS for Edge VMs. Following is the process to enable RSS on Edge VMs:
1. Shutdown the Edge VM
2. Find the “.vmx” associated with the Edge VM (https://kb.vmware.com/s/article/1714)
3. Change the following two parameters, for the Ethernet devices in use, inside the .vmx
file
a. ethernet3.ctxPerDev = "3"
b. ethernet3.pnicFeatures = "4"
4. Start the Edge VM
Alternatively, use the vSphere Client to make the changes:
1. Right click on the appropriate Edge VM and click “Edit Settings”:
389
VMware NSX Reference Design Guide
390
VMware NSX Reference Design Guide
3. Add the two configuration parameters by clicking on “Add Configuration” for each item
to add:
391
VMware NSX Reference Design Guide
The following graph FIGURE 8-25: RSS FOR VM Edge shows the comparison of throughput
between a NIC which supports RSS and a NIC which does not. Note: In both cases, even where
the pNIC doesn’t support RSS, RSS was enabled on the VM Edge:
392
VMware NSX Reference Design Guide
With an RSS-enabled NIC, a single Edge VM may be tuned to drive over 20Gbps throughput. As
the above graph shows, RSS may not be required for 10Gbps NICs as they can achieve close to
~15 Gbps throughput even without enabling RSS.
393
VMware NSX Reference Design Guide
configuration, considering most modern systems have many cores to spare, the following image
illustrates that the entire may not be utilized optimally, based on the number of available cores.
12 Threads 12 Threads
pNIC 1 pNIC 2
In the above image, if the Edge host has a total of 20 HT Enabled cores (40 threads), then a pair
of Edge VMs may be optimal. However, if the system more cores to spare, the recommended
way to leverage those cores is to add more pNICs dedicated to an additional pair of Edge VMs.
In the following image, 4 Edge VMs are spread across 4 pNICs, on a host that has 40 HT enabled
cores (80 threads)
24 threads 24 threads
Note: For workloads that are heavy on packet processing, it would be better to dedicate a pair
of pNICs to a single edge VM.
394
VMware NSX Reference Design Guide
In summary, increasing the VM Edge count along with increasing the dedicated pNIC count
could help leverage all the cores on a dedicated Edge host.
395
VMware NSX Reference Design Guide
Features Geneve-Offload: To save N-VDS Enhanced Data RSS: To leverage DPDK: Poll mode
that on CPU cycles Path: For DPDK-like multiple cores driver with
Matter capabilities memory- related
enhancements to
Geneve-RxFilters: To
help maximize
increase throughput by
packet processing
using more cores and
speed
using software based
LRO
QATs: For high
RSS (if Geneve-RxFilters
encrypt/decrypt
does not exist): To
performance with
increase throughput by
SSL-offload
using more cores
Benefits High Throughput for Maximum PPS for NFV Maximize Maximum PPS
typical TCP-based DC style workloads Throughput for
Maximum
Workloads typical TCP based
Throughput even
Workloads with
for small packet
Edge VM
sizes
VM Tuning + NIC
Maximum
with RSS Support
encrypt/decrypt
Add/Edit two performance with
parameters to the SSL Offload
Edge VM’s vmx
Low latency
file and restart
Maximum Scale
Fast Failover
Results
The following section takes a look at the results achievable under various scenarios with
hardware designed for performance. First, here are the test bed specs and methodology:
Virtual
Compute Edge Bare Metal Test Tools
Machine
396
VMware NSX Reference Design Guide
• CPU: Intel(R) Xeon(R) Gold vCPU: 2 CPU: Intel ® Xeon ® iPerf 2 (2.0.5)
6252 CPU @ 2.10GHz E5-2637 v4 3.5Ghz with
• RAM: 192 GB RAM: 2 GB
• Hyper Threading: Enabled • RAM: 256 • 4 – 12 VM
Network:
• MTU: 1700 GB Pairs
VMXNET3
• NIC: XL710 • Hyper • 4 Threads
• NIC Driver: i40e - 1.3.1- MTU: 1500 Threading: per VM
18vmw.670.0.0.8169922 Enabled Pair
• ESXi 6.7 • MTU: 1700 • 30
• NIC: XL710 seconds
• NIC Driver: per test
In-Box • Average
over
three
iterations
Table 8-3 Specific Configuration of Performance Results
NSX Components
• Segments
• T1 Gateways
• T0 Gateways
• Distributed Firewall: Enabled with default rules
• NAT: 12 rules – one per VM
• Bridging
a. Six Bridge Backed Segments
b. Six Bridge Backed Segments + Routing
The following graph shows in every scenario above, NSX throughput performance stays
consistently close to line rate on an Intel® XL710 40Gbps NIC.
397
VMware NSX Reference Design Guide
398
VMware NSX Reference Design Guide
context switching, unavoidable in the traditional interrupt mode of packet processing, resulting
in higher packet processing performance.
Buffer Management
Buffer management is optimized to represent the packets being processed in simpler fashion
with low footprint, assisting with faster memory allocation and processing. Buffer allocation is
also Non-uniform memory access (NUMA) aware. NUMA awareness reduces traffic flows
between the NUMA nodes, thus improving overall throughput.
Instead of requiring regular packet handlers for packets, Enhanced Datapath uses mbuf, a
library to allocate and free buffers resulting in packet-related info with low overhead. As
traditional packet handlers have heavy overhead for initialization, mbuf simplifies packet
descriptors by decreasing the CPU overhead for packet initialization. To further support the
mbuf-based packet, VMXNET3 has also been enhanced. In addition to the above DPDK
enhancements, ESX TCP Stack has also been optimized with features such as Flow Cache.
Flow Cache
Flow Cache is an optimization enhancement which helps reduce CPU cycles spent on known
flows. With the start of a new flow, Flow Cache tables are immediately populated. This
procedure enables follow-up decisions for the rest of packets within a flow to be skipped if the
flow already exists in the flow table. Flow Cache uses two mechanisms to figure out fast path
decisions for packets in a flow:
If the packets from the same flow arrive consecutively, the fast path decision for that packet is
stored in memory and applied directly for the rest of the packets in that cluster of packets.
399
VMware NSX Reference Design Guide
If packets are from different flows, the decision per flow is saved to a hash table and used to
decide the next hop for packets in each of the flows. Flow Cache helps reduce CPU cycles by as
much as 75%, a substantial improvement.
Figure 8-28: VMware Compatibility Guide for I/O - Selection Step for N-VDS Enhanced Data Path
=====================================================================
Note: N-VDS Enhanced Data Path cannot share the pNIC with N-VDS - they both need a
dedicated pNIC.
=====================================================================
Benchmarking Tools
Compute
On the compute side, our recommendation for testing the software components is to use a
benchmarking tool close to the application layer. Application layer benchmarking tools will
help take advantage of many features and show the true performance characteristics of the
system. While application benchmarking tools are ideal, they may not be very easy to setup
400
VMware NSX Reference Design Guide
and run. In such cases, iPerf is a great tool to quickly setup and check throughput. Netperf is
another tool to help check both throughput and latency.
Here is a github resource for an example script to run iPerf on multiple VMs simultaneously and
summarize results: HTTPS:// GITHUB.COM/ VMWARE -SAMPLES/NSX-PERFORMANCE -TESTING-SCRIPTS
Edges
VM Edge
As VM Edges are designed for typical DC workloads, application layer tools are best for testing
VM Edge performance.
Figure 8-29: Example Topology to Use Geneve Overlay with Hardware IXIA or Spirent
401
VMware NSX Reference Design Guide
Conclusion
To drive enhanced performance, NSX uses a number of features supported in hardware.
On the compute side, these are:
1. Geneve Offload for CPU cycle reduction and marginal performance benefits
2. Geneve Rx Filters, an intelligent queuing mechanism to multiply throughput
3. RSS an older hardware-based queuing mechanism – alternative if Rx Filters are missing
4. Jumbo MTU an age-old trick to enable high throughput on NICs lacking above features
5. NIC port speed
6. Number of NICs – single vs dual TEP
7. PCIe Gen 3 with x16 lane NICs or multiple PCIe Gen 3 x8 NICs
8. PCIe Gen 4
For compute workloads focused on high packet processing rate for primarily small packet sizes,
Enhanced Data Path Enabled NICs provide a performance boost.
For the Bare Metal Edges, leveraging optimal SSL offload performance such as Intel® QAT 8960s
and deploying supported hardware from the VMware NSX install guide will result in
performance gains.
402