Overlay Virtual Networking and SDDC
Overlay Virtual Networking and SDDC
The information is provided on an “as is” basis. The authors, and ipSpace.net shall have neither
liability nor responsibility to any person or entity with respect to any loss or damages arising from
the information contained in this book.
To the reprobation of those invested in the past, overlay virtual networks will play a major role in
this new future of networking. Overlays allow software to construct a persistent and feature rich
end-to-end networking service from any location, on any device, for any existing or new application.
Like it or not, overlay virtual networking is here to stay. As a networking professional in this new
era of mobile and cloud computing you will be asked to plan, design, troubleshoot, and operate
networks that implement a variety of overlay based networking architectures. In this book you will
find the fundamental technical knowledge to equip your career in the era of overlays, from the
world’s best networking teacher and practitioner, Ivan Pepelnjak.
Brad Hedlund
Office of the CTO at VMware NSBU, and Blogger
http://BradHedlund.com
It took years to debunk some of these misconceptions and prove that the overlay virtual networks
make architectural sense (and even today you can see the raging debates between proponents of
hardware-based network virtualization products and overlay virtual networking products). In these
years I wrote over fifty blog posts explaining the architectural details of overlay virtual networks,
design guidelines, and product details.
This book contains a collection of the most relevant blog posts describing overlay virtual networking
concepts, benefits and drawbacks, architectures, technical details and individual products. I cleaned
up the blog posts and corrected obvious errors and omissions, but also tried to leave most of the
content intact. The commentaries between the individual blog posts will help you understand the
timeline or the context in which a particular blog post was written.
As always, please do feel free to send me any questions you might have – the best way to reach me
is to use the contact form on my web site (www.ipSpace.net).
Happy reading!
Ivan Pepelnjak
August 2014
IN THIS CHAPTER:
The movement that started as a simple hack to bypass the limitations of layer-2 switching (aka
bridging) quickly gained momentum – just a few years later all major virtualization platforms
(vSphere, Hyper-V, KVM, Xen) support overlay virtual networks, and most new products targeting
large-scale environments use this technology.
The architectural benefits of overlay virtual networking are easy to validate: Amazon VPC and
Microsoft Azure are using MAC-over-IP encapsulation to build public clouds spanning hundreds of
thousands of physical servers and running millions of virtual machines.
This chapter focuses on the fundamental principles of overlay virtual networking and its benefits as
compared to more traditional (usually VLAN-based) approaches. The subsequent chapters delve into
the technical details.
When you virtualize the compute capacities, you’re virtualizing RAM (well-known problem for at least
40 years), CPU (same thing) and I/O ports (slightly trickier, but doable at least since Intel rolled out
80286 processors). All of these are isolated resources limited to a single physical server. There’s
zero interaction or tight coupling with other physical servers, there’s no shared state, so it’s a
perfect scale-out architecture – the only limiting factor is the management/orchestration system
(vCenter, System Center …).
So-called storage virtualization is already a fake (in most cases) – hypervisor vendors are not
virtualizing storage, they’re usually using a shared file system on LUNs someone already created for
them (architectures with local disk storage use some variant of a global file system with automatic
replication). I have no problem with that approach, but when someone boasts how easy it is to
create a file on a file system as compared to creating a VLAN (= LUN), I get mightily upset. (Side
note: why do we have to use VLANs? Because the hypervisor vendors had no better idea).
In the virtual networking case, there was extremely tight coupling between virtual switches and
physical switches and there always will be tight coupling between all the hypervisors running VMs
belonging to the same subnet (after all, that’s what networking is all about), be it layer-2 subnet
(VLAN/VXLAN/…) or layer-3 routing domain (Hyper-V).
Because of the tight coupling, the virtual networking is inherently harder to scale than the virtual
compute or storage. Of course, the hypervisor vendors took the easiest possible route, used
simplistic VLAN-based layer-2 switches in the hypervisors and pushed all the complexity to the
network edge/core, while at the same time complaining how rigid the network is compared to their
software switches. Of course it’s easy to scale out totally stupid edge layer-2 switches with no
control plane (that have zero coupling with anything else but the first physical switch) if someone
else does all the hard work.
Summary: Every time someone tells you how network virtualization will get as easy as compute or
storage virtualization, be vary. They probably don’t know what they’re talking about.
On the other hand, you probably remember the days when a SCSI LUN actually referred to a
physical disk connected to a computer, not an extensible virtual entity created through point-and-
click exercise on a storage array.
You might wonder what the ancient history has to do with virtual networking. Don’t worry
we’re getting there in a second ;)
When VMware started creating their first attempt at server virtualization software, they had readily
available storage abstractions (file system) and CPU abstraction (including MS-DOS support under
Windows, but the ideas were going all the way back to VM operating system on IBM mainframes).
Creating virtual storage and CPU environments was thus a no-brainer, as all the hard problems were
already solved. Most server virtualization solutions use the file system recursively (virtual disk = file
The “only” problem with using VLANs is that they aren’t the right abstraction. Instead of being like
files on a file system, VLANs are more like LUNs on storage arrays – someone has to provision them.
You could probably imagine how successful the server virtualization would be if you’d have to ask
storage administrators for a new LUN every time you need a virtual disk for a new VM.
So every time I see how the “Software-Defined Data Center [...] provides unprecedented
automation, flexibility, and efficiency to transform the way you deliver IT” I can’t help but read “it
took us more than a decade to figure out the right abstraction.” Virtual networking is nothing else
but another application riding on top of IP (storage and voice people got there years before).
A common theme in your talks is that L2 does not scale. Do you mean that Transparent
(Learning) Bridging does not scale due to its flooding? Or is there something else that
does not scale?
As is oft the case, I’m not precise enough in my statements, so let’s fix that first:
There are numerous layer-2 protocols, but when I talk about layer-2 (L2) scalability in data center
context, I always talk about Ethernet bridging (also known under its marketing name switching),
more precisely, transparent bridging that uses flooding of broadcast, unknown unicast, and multicast
frames (I love the BUM acronym) to compensate for lack of host-to-switch- and routing (MAC
reachability distribution) protocols.
Dismal control plane protocol (Spanning Tree Protocol in its myriad incarnations), combined with
broken implementations of STP kludges. Forward-before-you-think behavior of Cisco’s PortFast and
lack of CPU protection on some of the switches immediately come to mind.
TRILL (or a proprietary TRILL-like implementation like FabricPath) would solve most of the STP-
related issues once implemented properly (ignoring STP does not count as properly scalable
implementation in my personal opinion). However, we still have limited operational experience and
some vendors implementing TRILL might still face a steep learning curve before all the loop
detection/prevention and STP integration features work as expected.
Flooding of BUM frames is an inherent part of transparent bridging and cannot be disabled if you
want to retain its existing properties that are relied upon by broken software implementations.
Every broadcast frame flooded throughout a L2 domain must be processed by every host
participating in that domain (where L2 domain means a transparently bridged Ethernet VLAN or
equivalent). Ethernet NICs do perform some sort of multicast filtering, but it’s usually hash-based
and not ideal (for more information, read multicast-related blog posts written by Chris Marget).
Finally, while Ethernet NICs usually ignore flooded unicast frames (those frames still eat the
bandwidth on every single link in the L2 domain, including host-to-switch links), servers running
hypervisor software are not that fortunate. The hypervisor requirements (number of unicast MAC
addresses within a single physical host) typically exceed the NIC capabilities, forcing hypervisors to
put physical NICs in promiscuous mode. Every hypervisor host thus has to receive, process, and oft
ignore every flooded frame. Some of those frames have to be propagated to one or more VMs
running in that hypervisor and further processed by them (assuming the frame belongs to the
proper VLAN).
You might be able to make bridging scale better if you’d implement fully IP-aware L2 solution. Such
a solution would have to include ARP proxy (or central ARP servers), IGMP snooping and a total ban
on other BUM traffic. TRILL as initially envisioned by Radia Perlman was moving in that direction and
got thoroughly crippled and force-fit into the ECMP bridging rathole by the IETF working group.
Lack of addressing hierarchy is the final stumbling block. Modern data center switches (most of
them using the same hardware) support up to 100K MAC addresses, so other problems will probably
kill you way before you reach this milestone.
Finally, every L2 domain (VLAN) is a single failure domain (primarily due to BUM flooding). There are
numerous knobs you can try to tweak (storm control, for example), but you cannot change two
basic facts:
A software glitch in a switch that causes a forwarding (and thus flooding) loop involving core
links will inevitably cause a network-wide meltdown (due to lack of TTL field in L2 headers);
A software glitch (or virus/malware/you-name-it), or uncontrolled flooding started by any host or
VM attached to a VLAN will impact all other hosts (or VMs) attached to the same VLAN, as well
as all core links. A bug resulting in broadcasts will also impact the CPU of all layer-3 (IP)
switches with IP addresses configured in that VLAN.
You can use storm control to reduce the impact of an individual VM, but even the market leader
might have a problem or two with this feature.
It was important in those days to be able to connect an ESX host to the network with minimum
disruption (even if you had sub-zero networking skills). The decision to avoid STP and implement
split-horizon switching made perfect sense; running STP in an ESX host would get you banned in a
microsecond. vSwitch is also robust enough that you can connect it to a network that was
“designed” by someone who got all his networking skillz through Linksys Web UI … and it would still
work.
vSwitch got some scalability enhancements (distributed vSwitch), but only on the management
plane; apart from a few features that are enabled in vDS and not in vSwitch, the two products use
the same control/data plane. There’s some basic QoS (per-VM policing and 802.1p marking) and
some support for network management and troubleshooting (Netflow, SPAN, remote SPAN). Still no
STP nor LACP.
Lack of LACP is a particularly tough nut. Once you try to do anything a bit more complex, like proper
per-session load balancing, or achieving optimum traffic flow in a MLAG environment, you have to
carefully configure vSwitch and pSwitch just right. You can eventually squeeze the vSwitch into
those spots, and get it to work, but it will definitely be a tight fit, and it won’t be nearly as reliable
as it could have been were vSwitch to support proper control-plane protocols.
IS IT JUST VMWARE?
Definitely not. Other virtual switches fare no better, and the Open vSwitch is no more intelligent
without an external OpenFlow controller. At the moment, VMware’s vSwitch is probably still the most
intelligent vSwitch shipping with a hypervisor.
The only reason XenServer supports LACP is because LACP support is embedded in the underlying
Linux kernel … but even then the LACP-based bonding is not officially supported.
vCDNI and VXLAN (both scale much better and offer wider range of logical networks) are not part of
vSwitch. vCDNI is an add-on module using VMsafe API and VXLAN exists within Nexus 1000V, or as
a loadable kernel module on top of VMware virtual distributed switch (vDS).
On top of all that, vSwitch assumes “friends and family” environment. BPDUs generated by a VM can
easily escape into the wild and trigger BPDU guard on upstream switches; it’s also possible to send
tagged packets from VMs into the network (implementing VLAN hopping would take a few extra
steps and a misconfigured physical network), and there’s no per-VM broadcast storm control. Using
a vSwitch in a potentially hostile cloud environment is a risky proposition.
SCALABILITY? NO THANKS.
There is an easy way to deploy vSwitch in the worst-case “any VM can be started on any hypervisor
host” scenario – configure all VM-supporting VLANs on all switch-to-server access trunks, effectively
turning the whole data center into a single broadcast domain. As hypervisor NICs operate in
promiscuous mode, every hypervisor receives and processes every flooded packet, regardless of its
VLAN and its actual target.
Reliance on bridging, which usually implies reliance on STP. STP is not necessarily a limiting
factor; you can create bridged networks with thousands of ports without having a single blocked link
if you have a well-designed spine & leaf architecture, and large core switches. Alternatively, you
could trust emerging technologies like FabricPath or QFabric.
Single broadcast domain. I don’t want to be the one telling you how many hosts you can have in
a broadcast domain, let’s turn to TRILL Problem and Applicability Statement (RFC 5556). Its section
2.6 (Problems Not Addressed) is very clear: a single bridged LAN supports around 1000 hosts. Due
to physical NICs being in promiscuous mode and all VLANs being enabled on all access trunks, VLAN
segmentation doesn’t help us; effectively we still have a single broadcast domain. We’re thus talking
about ~1000 VMs (regardless of the number of VLANs they reside in).
I’m positive I’ll get comments along the lines of “I’m running 100.000 VMs in a single bridged
domain and they work just fine”. Free soloing (rock climbing with zero protection) also works great
until the first fall. Seriously, I would appreciate all data points you're willing to share in the
comments.
Number of VLANs. Although vSphere supports the full 12-bit VLAN range, many physical switches
don’t. The number of VLANs doesn’t matter in a traditional virtualized data center, with only a few
(or maybe a few tens) security zones, but it’s a major showstopper in a public cloud deployment.
Try telling your boss that your solution supports only around 1000 customers (assuming each
customer wants to have a few virtual subnets) … after replacing all the switches you bought last
year.
There are also a few things you can do in the physical network to improve the scalability of vSwitch-
based networks; I’ll describe them in the next post.
I have a different view regarding VMware vSwitch. For me its the best thing happened in
my network in years. The vSwitch is so simple, and its so hard to break something in it,
that I let the server team to do what ever they want (with one small rule, only one vNIC
per guest). I never have to configure a server port again.
As always, the right answer is “it depends” – what kind of vSwitch you need depends primarily on
your requirements.
I’ll try to cover the whole range of virtual networking solutions (from very simple ones to pretty
scalable solutions) in a series of blog posts, but before going there, let’s agree on the variety of
requirements that we might encounter.
We use virtual switches in two fundamentally different environments today: virtualized data centers
on one end of the spectrum, and private and public clouds at the other end (and you’re probably
somewhere between these two extremes).
The number of physical servers is also reasonably low (in tens, maybe low hundreds, but definitely
not thousands) as is the number of virtual machines. The workload is more or less stable, and the
virtual machines are moved around primarily for workload balancing / fault tolerance / maintenance
/ host upgrade reasons.
The unpredictable workload places extra strains on the networking infrastructure due to large-scale
virtual networks needed to support it.
You could limit the scope of the virtual subnets in a more static virtualized data center; after all, it
doesn’t make much sense to have the same virtual subnet spanning more than one HA cluster (or at
most a few of them).
In a cloud environment, you have to be able to spin up a VM whenever a user requests it … and you
usually start the VM within the physical server that happens to have enough compute (CPU+RAM)
resources. That physical server can be sitting anywhere in the data center, and the tenant’s logical
HYBRID ENVIRONMENTS
These data centers can offer you the most fun (or headache) there is – a combination of traditional
hosting (with physical servers owned by the tenants) and IaaS cloud (running on hypervisor-
powered infrastructure) presents some very unique requirements – just ask Kurt (@networkjanitor)
Bales about his DC needs.
In a virtualized enterprise data center you’d commonly experience a lot of live VM migration; the
workload optimizers (like VMware’s DRS) constantly shift VMs around high availability clusters to
optimize the workload on all hypervisor hosts (or even shut down some of the hosts if the overall
load drops below a certain limit). These migration events are usually geographically limited –
Vmware HA cluster can have at most 32 hosts and while it’s prudent to spread them across two
racks or rows (for HA reasons), that’s the maximum range that makes sense.
The blog post was written in 2011 (before Cisco launched VXLAN) and thus refers to VMware’s then-
popular solution (vCDNI), which use a totally non-scalable MAC-over-MAC approach. I retained the
now-irrelevant parts of the article to give you a historic perspective on what we had to argue in
2011.
Finally, please allow me to point out that virtualization vendors did exactly what I said they should
be doing (because it makes perfect sense, not because I was writing about it).
Once we have a scalable solution that will be able to stand on its own in a large data center, most
smart network admins will be more than happy to get away from provisioning VLANs and focus on
other problems. After all, most companies have other networking problems beyond data center
switching. As for disappearing work, we've seen the demise of DECnet, IPX, SNA, DLSw and multi-
protocol networks (which are coming back with IPv6) without our jobs getting any simpler, so I'm
not worried about the jobless network admin. I am worried, however, about the stability of the
In 2002 IETF published an interesting RFC: Some Internet Architectural Guidelines and Philosophy
(RFC 3439) that should be a mandatory reading for anyone claiming to be an architect of solutions
that involve networking (you know who you are). In the End-to-End Argument and Simplicity section
the RFC clearly states: “In short, the complexity of the Internet belongs at the edges, and the IP
layer of the Internet should remain as simple as possible.”
We should use the same approach when dealing with virtualized networking: the complexity belongs
to the edges (hypervisor switches) with the intervening network providing the minimum set of
required services. I don’t care if the networking infrastructure uses layer-2 (MAC) addresses or
layer-3 (IP) addresses as long as it scales. Bridging does not scale as it emulates a logical thick coax
cable. Either get rid of most bridging properties (like packet flooding) and implement proper MAC-
address-based routing without flooding, or use IP as the transport. I truly don’t care.
Reading RFC 3439 a bit further, the next paragraphs explain the Non-Linearity and Network
Complexity. To quote the RFC: “In particular, the largest networks exhibit, both in theory and in
practice, architecture, design, and engineering non-linearities which are not exhibited at smaller
scale.” Allow me to paraphrase this for some vendors out there: “just because it works in your lab
does not mean it will work at Amazon or Google scale.”
The current state of affairs is just the opposite of what a reasonable architecture would be: VMware
has a barebones layer-2 switch (although it does have a few interesting features) with another non-
scalable layer (vCDNI) on top of (or below) it. The networking vendors are inventing all sorts of
kludges of increasing complexity to cope with that, from VN-Link/port extenders and EVB/VEPA to
I don’t expect the situation to change on its own. VMware knows server virtualization is just a
stepping stone and is already investing in PaaS solutions; the networking vendors are more than
happy to sell you all the extra proprietary features you need just because VMware never
implemented a more scalable solution, increasing their revenues and lock-in. It almost feels like the
more “network is in my way” complaints we hear, the happier everyone is: virtualization vendors
because the blame is landing somewhere else, the networking industry because these complaints
give them a door opener to sell their next-generation magic (this time using a term borrowed from
the textile industry).
Imagine for a second that VMware or Citrix would actually implement a virtualized networking
solution using IP transport between hypervisor hosts. The need for new fancy boxes supporting
TRILL or 802.1aq would be gone, all you would need in your data center would be high-speed simple
L2/L3 switches. Clearly not a rosy scenario for the flat-fabric-promoting networking vendors, is it?
Is there anything you can do? Probably not much, but at least you can try. Sit down with the
virtualization engineers, discuss the challenges and figure out the best way to solve problems both
teams are facing. Engage the application teams. If you can persuade them to start writing scale-out
applications that can use proper load balancing, most of the issues bothering you will disappear on
their own: there will be no need for large stretched VLANs and no need for L2 data center
interconnects. After all, if you have a scale-out application behind a load balancer, nobody cares if
you have to shut down a VM and start it in a new IP subnet.
VLAN-BASED SOLUTIONS
The simplest possible virtual networking technology (802.1Q-based VLANs) is also the least scalable,
because of its tight coupling between the virtual networking (and VMs) and the physical world.
VLAN-based virtual networking uses bridging (which doesn’t scale), 12-bit VLAN tags (limiting you to
approximately 4000 virtual segments), and expect all switches to know the MAC addresses of all
VMs. You’ll get localized unknown unicast flooding if a ToR switch experiences MAC address table
overflow and a massive core flooding if the same thing happens to a core switch.
In its simplest incarnation (every VLAN enabled on every server port on ToR switches), the VLAN-
based virtual networking also causes massive flooding proportional to the total number of VMs in the
network.
VM-aware networking scales better (depending on the number of VLANs you have and the number
of VMs in each VLAN). The core switches still need to know all VM MAC addresses, but at least the
dynamic VLAN changes on the server-facing ports limit the amount of flooding on the switch-to-
Figure 1-3: Arista EOS VM Tracer uses CDP to detect vSphere hosts connected to ToR switches
Provider Backbone Bridging (PBB) or VPLS implemented in ToR switches fare better. The core
network needs to know the MAC addresses (or IP loopbacks) of the ToR switches; all the other
virtual networking details are hidden.
While PBB or VPLS solves the core network address table issues, the MAC address table size in ToR
switches cannot be reduced without dynamic VPLS/PBB instance creation. If you configure all VLANs
on all ToR switches, the ToR switches have to store the MAC addresses of all VMs in the network (or
risk unicast flooding after MAC address table experiences trashing).
MAC-OVER-IP SOLUTIONS
The only proper way to decouple virtual and physical networks is to treat virtual networking like yet
another application (like VoIP, iSCSI or any other “infrastructure” application). Virtual switches that
can encapsulate L2 or L3 payloads in UDP (VXLAN) or GRE (NVGRE/Open vSwitch) envelopes appear
as IP hosts to the network; you can use the time-tested large-scale network design techniques to
build truly scalable data center networks.
However, MAC-over-IP encapsulation might not bring you to seventh heaven. VXLAN does not have
a control plane and thus has to rely on IP multicast to perform flooding of virtual MAC frames. All
hypervisor hosts using VXLAN have to join VXLAN-specific IP multicast groups, creating lots of (S,G)
and (*,G) entries in the core network. The virtual network data plane is thus fully decoupled from
the physical network, the control plane isn’t.
A truly scalable virtual networking solution would require no involvement from the transport IP
network. Hypervisor hosts would appear as simple IP hosts to the transport network, and use only
unicast IP traffic to exchange virtual network payloads; such a virtual network would use the same
transport mechanisms as today’s Internet-based applications and could thus run across huge
transport networks. I’m positive Amazon has such a solution, and it seems Nicira’s Network
Virtualization Platform is another one (but I’ll believe that when I see it).
The blog post was written in May 2012 and still refers to Nicira NVP (which later became VMware
NSX).
Before going into more details, you might want to browse through my Cloud Networking Scalability
presentation (or watch its recording) – the crucial slide is this one:
Figure 1-8: Good morning, which VLAN would you like to talk with today? (source: Wikipedia)
The VM-aware networking is an interesting twist in the story – the exchange operator is
listening to the user traffic and trying to figure out who they want to talk with.
Does it make sense? Let’s see – to get a somewhat scalable VLAN-based solution, you’d need at
least the following components:
A signaling protocol between the hypervisors and ToR switches that would tell the ToR switches
which VLANs the hypervisors need. Examples: EVB (802.1Qbg) or VM-FEX.
Large-scale multipath bridging technology. Examples: SPB (802.1aq) or TRILL.
VLAN pruning protocol. Examples: MVRP (802.1ak) or VTP pruning. SPB might also offer
something similar with service instances.
VLAN addressing extension, and automatic mapping of hypervisor VLANs into a wider VLAN
address space used in the network core. Q-in-Q (802.1ad) or MAC-in-MAC (802.1ah) could be
used as the wider address space, and I have yet to see ToR gear performing automatic VLAN
provisioning.
It might be just me, but looking at this list, RFC 1925 comes to mind (“with sufficient thrust, pigs fly
just fine”).
To understand the implications of ever-increased complexity vendors are throwing at us, go through
the phenomenal presentation Randy Bush had @ NANOG26, in which he compared the complexities
of voice switches with those of IP routers. The last slide of the presentation is especially relevant to
the virtual networking environment:
You don’t think VoIP scales better than traditional voice? Just compare the costs of doing a Skype
VoIP transatlantic call with the costs of a traditional voice call from two decades ago (the
international voice calls became way cheaper in the meantime, partly because most carriers started
using VoIP for long-distance trunks). Enough said.
We can watch the same architectural shift happening in the virtual networking world: VXLAN,
NVGRE and STT are solutions that move the virtual networking complexity to the hypervisor, and
rely on proven, simple, cheap and reliable IP transport in the network. No wonder the networking
companies like you more if you use VLAN-based L2 hypervisor switches (like the Alcatels, Lucents
and Nortels of the world preferred you buy stupid phones and costly phone exchanges).
Does that mean that EVB, TRILL, and other similar technologies have no future? Absolutely not.
Networking industry made tons of money deploying RSRB, DLSw and CIPs in SNA environments
years after it was evident TCP/IP-based solutions (mostly based on Unix-based minicomputers) offer
more flexible services for way lower price. Why should it be any different this time?
The blog post was written in August 2011; in the meantime Cisco and VMware implemented vMotion
support for VM-FEX (VM bypassing the hypervisor and accessing a virtualized physical NIC directly).
You’ll also notice that I wasn’t explicitly arguing for the overlay virtual networking approach, but just
for more intelligence in the virtual switches.
A virtual switch embedded in a typical hypervisor OS serves two purposes: it does (usually abysmal)
layer-2 forwarding and (more importantly) hides the details of the physical hardware from the VM.
Virtual machines think they work with a typical Ethernet NIC – usually based on a well-known
chipset like Intel’s 82545 controller or AMD Lance controller – or you could use special drivers that
allow the VM to interact with the hypervisor more effectively (for example, VMware’s VMXNET
driver).
In both cases, the details of the physical hardware are hidden from the VM, allowing you to deploy
the same VM image on any hypervisor host in your data center (or cloudburst it if you believe in that
particular mythical beast), regardless of the host’s physical Ethernet NIC. The hardware abstraction
also makes the vMotion process run smoothly – the VM does not need to re-initialize the physical
hardware once it’s moved to another physical host. VMware (and probably most other hypervisors)
solves the dilemma in a brute force way – it doesn’t allow you to vMotion a VM that’s communicating
directly with the hardware using VMDirectPath.
The hardware abstraction functionality is probably way more resource-consuming than the simple L2
forwarding performed by the virtual switches; after all, how hard could it be to do a hash table
lookup, token bucket accounting, and switch a few ring pointers?
I am positive there are potential technical solutions to all the problems I’ve mentioned, but they are
simply not available on any server infrastructure virtualization platform I’m familiar with. The
vendors deploying new approaches to virtual networking thus have to rely on a forwarding element
embedded in the hypervisor kernel, like the passthrough VEM module Cisco is using in its VM-FEX
implementation.
In my opinion, it would make way more sense to develop a technology that tightly integrates
hypervisor hosts with the network (EVB/VDP parts of the 802.1Qbg standard) than to try to push a
square peg into a round hole using VEPA or VM-FEX, but we all know that’s not going to happen.
Hypervisor vendors don’t seem to care and the networking vendors want you to buy more of their
gear.
All of these complaints have merits ... and I’ve heard them at least three or four times:
When we started encapsulating SNA in TCP/IP using RSRB and later DLSw;
When we started replacing voice switches with VoIP and transporting voice over IP networks;
When we replaced Frame Relay and ATM switches with MPLS/VPN.
Interestingly I don’t remember a huge outcry when we started using IPsec to build private networks
over the Internet ... maybe the immediate cost savings made everyone forget we were actually
building tunnels with no QoS.
Assuming one could design the whole protocol stack from scratch, one could do a proper job of
eliminating all the redundancies. Given the fact that the only ubiquitous transport we have today is
IP, and that you can’t expect the equipment vendors to invest into anything else but Ethernet+IP in
the foreseeable future, the only logical conclusion is to use IP as the transport for your virtual
networking data ... like any other application is doing these days. It obviously works well enough for
Amazon.
You have to use transport over IP if you want the solution to scale ... or a completely
revamped layer-2 forwarding paradigm, which is not impossible, merely impractical in a
reasonable timeframe ... but of course OpenFlow will bring us there ;)
I’m not saying Nicira’s solution is the right one. I’m not saying GRE or VXLAN or NVGRE or
something else is the right tunneling protocol. I’m not saying transporting Ethernet frames in IP
tunnels is a good decision – I would prefer to have full IP routing in the hypervisors and transport IP
datagrams, not L2 frames, between hypervisor hosts. I’m also not saying IP is the right transport
protocol, it’s just the only scalable one we have today.
Split hypervisor host addressing (which is visible in the core) from VM addressing (which is only
visible to hypervisors);
Use simple routed core transport which allows the edge (hypervisor) addresses to be aggregated
for scalability;
Remove all VM-related state from the transport core;
Use proper control plane that will minimize the impact of stupidities we have to deal with if we
have to build L2 virtual networks.
But, as always, this is just my personal opinion, and I'm known to be biased.
If I understand you correctly, you think that VXLAN will win over EVB?
I wouldn’t say they are competing directly from the technology perspective. There are two ways you
can design your virtual networks: (a) smart core with simple edge (see also: voice and Frame Relay
switches) or (b) smart edge with simple core (see also: Internet). EVB makes option (a) more
viable, VXLAN is an early attempt at implementing option (b).
When discussing virtualized networks I consider the virtual switches in the hypervisors the
network edge and the physical switches (including top-of-rack switches) the network core.
Historically, option (b) (smart edge with simple core) has been proven to scale better ... the largest
example of such architecture is allowing you to read my blog posts.
Actually it is – IBM has just launched its own virtual switch for VMware ESX (a competitor to Nexus
1000V) that has limited EVB support (the way I understand the documentation, it seems to support
VDP, but not the S-component).
But VXLAN has its limitations – for example, only VXLAN-enabled VMs will be able to
speak to each other.
Almost correct. VMs are not aware of VXLAN (they are thus not VXLAN-enabled). From VM NIC
perspective the VM is connected to an Ethernet segment, which could be (within the vSwitch)
implemented with VLANs, VXLAN, NVGRE, STT or something else.
At the moment, the only implemented VXLAN termination point is Nexus 1000V, which means that
only VMs residing within ESX hosts with Nexus 1000V can communicate over VXLAN-implemented
Ethernet segments. Some vendors are hinting they will implement VXLAN in hardware (switches),
and Cisco already has the required hardware in Nexus 7000 (because VXLAN has the same header
format as OTV).
Update August 2014: Several vendors offer software and hardware VXLAN gateways. Refer
to the Gateways to Overlay Virtual Networks chapter for more details.
VXLAN encapsulation will also take some CPU cycles (thus impacting your VM
performance.
While VXLAN encapsulation will not impact VM performance per se, it will eat CPU cycles that could
be used by VMs. If your hypervisor host has spare CPU cycles, VXLAN overhead shouldn’t matter, if
you’re pushing it to the limits, you might experience performance impact.
If your VMs are CPU-bound you might not notice; if they generate lots of user-facing data, lack of
TCP offload might be a killer.
I personally see VXLAN as a end to end solution where we can't interact on the network
infrastructure anymore. For example, how would these VMs be able to connect to the
first-hop gateway?
Today you can use VXLAN to implement “closed” virtual segments that can interact with the outside
world only through VMs with multiple NICs (a VXLAN-backed NIC and a VLAN-backed NIC), which
makes it perfect for environments where firewalls and load balancers are implemented with VMs
(example: VMware’s vCloud with vShield Edge and vShield App). As said above, VXLAN termination
points might appear in physical switches.
With EVB we would still have full control and could do the same things we’re doing today
on the network infrastructure, and the network will be able to automatically provide the
correct VLAN's on the correct ports.
That’s a perfect summary. EVB enhances today’s VLAN-backed virtual networking infrastructure,
while overlay virtual networks completely change the landscape.
Is then the only advantage of VXLAN that you can scale better because you don't have
the VLAN limitation?
With the explosion of overlay virtual networking solutions (with every single reasonably-serious
vendor having at least one) one might get the feeling that it doesn't make sense to build greenfield
IaaS cloud networks with VLANs. As usual, there's significant difference between theory and
practice.
You should always consider the business requirements before launching on a technology
crusade. IaaS networking solutions are no exception.
If you plan to sell your services to customers with complex application stacks, overlay virtual
networks make perfect sense. These customers usually need multiple internal networks and an
appliance between their internal networks and the outside world. If you decide to implement the
Internet-facing appliance with a VM-based solution, and all subnets behind the appliance with
overlay virtual networks, you've almost done.
Customers buying a single VM, and maybe access to central MySQL or SQL Server database, are a
totally different story. Having a subnet and a VM-based appliance for each customer paying for a
single VM makes absolutely no sense. We need something similar to PVLANs, and the only overlay
virtual networking product with a reasonably simple PVLAN implementation is VMware NSX for
Use a single subnet (VLAN- or overlay-based) and protect individual customer VMs with VM NIC
firewall (or iptables)
PREREQUISITES
As always, it makes sense to start with the prerequisites.
If you’re fortunate enough to run Hyper-V 3.0 R2, you already have all you need – Hyper-V Network
Virtualization is included in Hyper-V 3.0, and configurable through the latest version of System
Center (I doubt you’d want to write PowerShell scripts to get your first pilot project off the ground).
vSphere users are having a slightly harder time. VXLAN is part of the free version of Nexus 1000V,
but you still need Enterprise Plus vSphere license to get distributed virtual switch functionality
needed by Nexus 1000V, and you have to configure VXLAN segments through the Nexus 1000V CLI
(or write your own NETCONF scripts).
In Linux environments use GRE tunneling available in Open vSwitch. OpenStack’s default Neutron
plugin can configure inter-hypervisor tunnels automatically (just don’t push it too far).
Ideally, you’d find a development group (or a developer) willing to play with new concepts, set up
development environment for them (including virtual segments and network services), and help
them move their project all the way to production, creating staging and testing virtual segments and
services on the fly (warning: some programming required; also check out Cloudify).
Needless to say, when engineers not familiar with the networking intricacies create point-and-click
application stacks without firewalls and load balancers, you get some interesting designs.
The following one seems to be particularly popular. Assuming your application stack has three layers
(web servers, app servers and database servers), this is how you are supposed to connect the VMs:
When I’d heard about this “design” being discussed in VMware training I politely smiled (and one of
our CCIEs attending that particular class totally wrecked it). When I saw the same design on a slide
with Cisco’s logo on it, my brains wanted to explode.
Let’s see if we can list all things that are wrong with this design:
It’s a security joke. Anyone penetrating your web servers gets a free and unlimited pass to try
and hack the app servers. Repeat recursively through the whole application stack.
How will you manage the servers? Usually we’d use SSH to access the servers. How will you
manage the app servers that are totally isolated from the rest of the network? Virtual console? Fine
with me.
How will you download operating system patches? Pretty interesting one if you happen to
download them from the Internet. Will the database servers go through the app servers and through
the web servers to access the Internet? Will you configure proxy web servers on every layer?
IP routing in vShield Edge (that you're supposed to be using as the firewall, router and load
balancer) is another joke. It supports static routes only. Even if you decide to go through multiple
layers of VMs to get to the outside world, trying to get the return packet forwarding to work will fry
your brains.
VMware NSX Edge Services Router supports routing protocols, but that doesn’t make this
particularly broken design any more valid.
How will you access common services? Let’s say you use company-wide LDAP services. How will
the isolated VMs access them? Will you create yet another segment and connect all VMs to it ...
exposing your crown jewels to the first intruder that manages to penetrate the web servers? How
about database mirroring or log shipping?
It doesn’t matter if you use VLANs or some other virtual networking technology. It doesn’t matter if
you use physical firewalls and load balancers or virtual appliances – if you want to build a proper
application stack, you need the same functional components you’d use in the physical world, wired
in approximately the same topology.
IN THIS CHAPTER:
MORE INFORMATION
Watch the Overlay Virtual Networking webinar (and the Following Packets across Overlay Virtual
Networks addendum).
Check out cloud computing and networking webinars and webinar subscription.
Use ExpertExpress service if you need short online consulting session, technology discussion or a
design review.
Examples: traditional VLANs, VXLAN on Nexus 1000v, VXLAN on VMware vCNS, VMware NSX,
Nuage Networks Virtual Services Platform, OpenStack Open vSwitch Neutron plugin.
Other solutions perform layer-3 forwarding at the first hop (vNIC-to-vSwitch boundary),
implementing a pure layer-3 network.
Some virtual networking solutions provide centralized built-in layer-3 gateways (routers) that you
can use to connect layer-2 segments.
Other layer-2 solutions provide distributed routing – the same default gateway IP and MAC address
are present in every first-hop switch, resulting in optimal end-to-end traffic flow.
Examples: Cisco DFA, Arista VARP, Juniper QFabric, VMware NSX, Nuage VSP, Distributed layer-3
forwarding in OpenStack Icehouse release.
Other layer-3 virtual networking solutions allow dynamic IP addresses (example: customer DHCP
server) or IP address migration between cluster members.
Examples: Hyper-V network virtualization in Windows Server 2012 R2, Juniper Contrail
The blog post was written in August 2013 and updated to reflect the changes introduced in recent
release of overlay virtual networking products.
Figure 2-1: Sample network with two hypervisor hosts connected to an overlay virtual networking segment
The TCP/IP stack (or any other network-related software working with the VM NIC driver) is totally
oblivious to its virtual environment – it looks like the VM NIC is connected to a real Ethernet
segment, and so when the TCP/IP stack needs to send a packet, it sends a full-fledged L2 frame
(including source and destination VM MAC address) to the VM NIC.
The first obvious question you should ask is: how does the VM know the MAC address of the other
VM? Since the VM TCP/IP stack thinks the VM NIC connects it to an Ethernet segment, it uses ARP to
get the MAC address of the other VM.
Layer-3-only products like Hyper-V network virtualization in Windows Server 2012 R2,
Amazon VPC or Juniper Contrail use different mechanisms. The Hyper-V network
virtualization behavior is described in the Overlay Virtual Networking Product Details
chapter.
Now let’s focus on what happens with the layer-2 frame sent through the VM NIC once it hits the
soft switch. If the destination MAC address belongs to a VM residing in the same hypervisor, the
frame gets delivered to the destination VM (even Hyper-V does layer-2 forwarding within the
hypervisor, as does Nicira’s NVP unless you’ve configured private VLANs).
If the destination MAC address doesn’t belong to a local VM, the layer-2 forwarding code sends the
layer-2 frame toward the physical NIC ... and the frame gets intercepted on its way toward the real
world by an overlay virtual networking module (VXLAN, NVGRE, GRE or STT
encapsulation/decapsulation module).
The overlay virtual networking module uses the destination MAC address to find the IP address of
the target hypervisor, encapsulates the virtual layer-2 frame into an VXLAN/(NV)GRE/STT envelope
and sends the resulting IP packet toward the physical NIC (with the added complexity of vKernel
NICs in vSphere environments).
Glad you asked the third question: how does the overlay networking module know the IP address of
the target hypervisor? That’s the crux of the problem and the main difference between VXLAN and
Hyper-V/NVP. It’s clearly a topic for yet another blog post (and here’s what I wrote about this
problem a while ago). For the moment, let’s just assume it does know what to do.
The physical network (which has to provide nothing more than simple IP transport) eventually
delivers the encapsulated layer-2 frame to the target hypervisor, which uses standard TCP/IP
mechanisms (match on IP protocol for GRE, destination UDP port for VXLAN and destination TCP
port for STT) to deliver the encapsulated layer-2 frame to the target overlay networking module.
Things are a bit more complex: in most cases you’d want to catch the encapsulated traffic
somewhere within the hypervisor kernel to minimize the performance hit (each trip through
the userland costs you extra CPU cycles), but you get the idea.
Last step: the target overlay networking module strips the envelope and delivers the raw layer-2
frame to the layer-2 hypervisor switch which then uses the destination MAC address to send the
frame to the target VM-NIC.
Summary: major overlay virtual networking implementations are essentially identical when it
comes to frame forwarding mechanisms. The encapsulation wars are thus stupid, with the sole
exception of TCP/IP offload, and some vendors have already started talking about multi-
encapsulation support.
The key differentiator between scalable and not-so-very-scalable architectures and technologies is
the control plane – the mechanism that maps (at the very minimum) remote VM MAC address into a
transport network IP address of the target hypervisor (see A Day in a Life of an Overlaid Virtual
Packet for more details).
Overlay virtual networking vendors chose a plethora of solutions, ranging from Ethernet-like
dynamic MAC address learning to complex protocols like MP-BGP. Here’s an overview of what they’re
doing:
The original VXLAN as implemented by Cisco’s Nexus 1000V, VMware’s vCNS release 5.1, Arista
EOS, and F5 BIG-IP TMOS release 11.4 has no control plane. It relies on transport network IP
multicast to flood BUM traffic and uses Ethernet-like MAC address learning to build mapping between
virtual network MAC address and transport network IP addresses.
Unicast VXLAN as implemented in Cisco’s Nexus 1000V release 4.2(1)SV2(2.1) has something that
resembles a control plane. VSM distributes segment-to-VTEP mappings to VEMs to replace IP
VXLAN MAC distribution mode is a proper control plane implementation in which the VSM
distributes VM-MAC-to-VTEP-IP information to VEMs. Unfortunately it seems to be based on a
proprietary protocol, so it won’t work with hardware gateways from Arista or F5.
Nicira NVP (part of VMware NSX) uses OpenFlow to install forwarding entries in the hypervisor
switches and Open vSwitch Database Management Protocol to configure the hypervisor switches.
NVP uses OpenFlow to implement L2 forwarding and VM NIC reflexive ACLs (L3 forwarding uses
another agent in every hypervisor host).
Midokura Midonet doesn’t have a central controller or control-plane protocols. Midonet agents
residing in individual hypervisors use shared database to store control- and data-plane state.
Contrail (now Juniper JunosV Contrail) uses MP-BGP to pass MPLS/VPN information between
controllers and XMPP to connect hypervisor switches to the controllers.
IBM SDN Virtual Edition uses a hierarchy of controllers and appliances to implement NVP-like
control plane for L2 and L3 forwarding using VXLAN encapsulation. I wasn’t able to figure out what
protocols they use from their whitepapers and user guides.
Nuage Networks is using MP-BGP to exchange L3VPN or EVPN prefixes with the external devices,
and OpenFlow with extensions between the controller and hypervisor switches.
As it happens, the “alternate approach” pretty accurately described what Nicira launched a few
months later in its NVP product (and the ARP handling discussion is still relevant in 2014).
In both cases, the vSwitch takes L2 frames generated by VMs attached to it, wraps them in
protocol-dependent envelopes (VXLAN-over-UDP or GRE), attaches an IP header in front of those
envelopes ... and faces a crucial question: what should the destination IP address (Virtual Tunnel
End Point – VTEP – in VXLAN terms) be. Like any other overlay technology, a MAC-over-IP vSwitch
needs virtual-to-physical mapping table (in this particular case, VM-MAC-to-host-IP mapping table).
Solve the problem within your architecture, using whatever control-plane protocol comes handy
(either reusing existing ones or inventing a new protocol);
Nicira’s Network Virtualization Platform (NVP) seems to be solving the problem using OpenFlow as
the control-plane protocol; VXLAN offloads the problem to the network.
VXLAN is a very simple technology and uses existing layer-2 mechanisms (flooding and dynamic
MAC learning) to discover remote MAC addresses and MAC-to-VTEP mappings, and IP multicast to
reduce the scope of the L2-over-UDP flooding to those hosts that expressed explicit interest in the
VXLAN frames.
Ideally, you’d map every VXLAN segment (or VNI – VXLAN Network Identifier) into a separate IP
multicast address, limiting the L2 flooding to those hosts that have VMs participating in the same
VXLAN segment. In a large-scale reality, you’ll probably have to map multiple VXLAN segments into
a single IP multicast address due to low number of IP multicast entries supported by typical data
center switches.
According to the VXLAN draft, the VNI-to-IPMC mapping remains a management plane decision.
The IP multicast tables in the core switches will probably explode if you decide to go from shared
trees to source-based trees in a large-scale VXLAN deployment.
NETWORKS
It’s perfectly possible to distribute the MAC-to-VTEP mappings with a control-plane protocol. You
could use a new BGP address family (I’m not saying it would be fast), L2 extensions for IS-IS (I’m
not saying it would scale), a custom-developed protocol, or an existing network- or server-
programming solution like OpenFlow or XMPP.
Nicira seems to be going down the OpenFlow path. Open vSwitch uses P2P GRE tunnels between
hypervisor hosts with GRE tunnel key used to indicate virtual segments (similar to NVGRE draft).
You could use OVS without OpenFlow – create P2P GRE tunnels, and use VLAN encapsulation and
dynamic MAC learning over them for a truly nightmarish non-scalable solution.
Once an OpenFlow controller enters the picture, you’re limited only by your imagination (and the
amount of work you’re willing to invest):
You could intercept all ARP packets and implement ARP proxy in the OpenFlow controller;
After implementing ARP proxy you could stop all other flooding in the layer-2 segments for a
truly scalable Amazon-like solution;
You could intercept IGMP joins and install L2 multicast or IP multicast forwarding tables in OVS.
The multicast forwarding would still be suboptimal due to P2P GRE tunnels – head-end host
would do packet replication.
You could go a step further and implement full L3 switching in OVS based on destination IP
address matching rules.
IMPLEMENTATION ISSUES
VXLAN was first implemented in Nexus 1000V, which presents itself as a Virtual Distributed Switch
(vDS) to VMware vCenter. A single Nexus 1000V instance cannot have more than 64 VEMs (vSphere
kernel modules), limiting the Nexus 1000V domain to 64 hosts (or approximately two racks of UCS
blade servers).
It’s definitely possible to configure the same VXLAN NVI and IP multicast address on different Nexus
1000V switches (either manually or using vShield Manager), but you cannot vMotion a VM out of the
vDS (that Nexus 1000V presents to vCenter).
VXLAN on Nexus 1000V is thus a great technology if you want to implement HA/DRS clusters spread
across multiple racks or rows (you can do it without configuring end-to-end bridging), but falls way
short of the “deploy any VM anywhere in the data center” holy grail.
The number of IP multicast groups (together with the size of the network) obviously influences the
overall VXLAN scalability. Here are a few examples:
One or few multicast groups for a single Nexus 1000V instance. Acceptable if you don’t need
more than 64 hosts. Flooding wouldn’t be too bad (not many people would put more than a few
thousand VMs on 64 hosts) and the core network would have a reasonably small number of (S/*,G)
entries (even with source-based trees the number of entries would be linearly proportional to the
number of vSphere hosts).
Use per-VNI multicast group. This approach would result in minimal excessive flooding but
generate large amounts of (S,G) entries in the network.
The size of the multicast routing table would obviously depend on the number of hosts, number of
VXLAN segments, and PIM configuration – do you use shared trees or switch to source tree as soon
as possible … and keep in mind that Nexus 7000 doesn’t support more than 32000 multicast entries
and Arista’s 7500 cannot have more than 4000 multicast routes on a linecard.
RULES-OF-THUMB
VXLAN has no flooding reduction/suppression mechanisms, so the rules-of-thumb from RFC 5556
still apply: a single broadcast domain should have around 1000 end-hosts. In VXLAN terms, that’s
around 1000 VMs per IP multicast address.
However, it might be simpler to take another approach: use shared multicast trees (and hope the
amount of flooded traffic is negligible), and assign anywhere between 75% and 90% of (lowest) IP
multicast table size on your data center switches to VXLAN. Due to vShield Manager’s wraparound
multicast address allocation policy, the multicast traffic should be well-distributed across all the
whole allocated address range.
Obviously you need IGMP and PIM in multicast environments only (vCNS 5.x, Nexus 1000V
in multicast mode).
IGMP is used by the ESXi hosts to tell the first-hop routers (in case you run VXLAN across multiple
subnets) that they want to participate in particular multicast group, so the subnet in which they
reside gets added to the distribution tree.
PIM is used between routers to figure out how the IP multicast flooding tree should look like.
IP multicast datagrams are sent as MAC frames with multicast destination MAC addresses. There
frames are flooded by dumb L2 switch, resulting in wasted network bandwidth and server CPU
cycles.
IGMP snooping gives some L3 smarts to L2 switches - they don't flood IP multicast frames out all
ports, but only on ports from which they've received corresponding IGMP joins. You might decide
not to care in a small data center network with tens of servers; IGMP snooping will definitely help in
large (hundreds of servers) deployments.
Finally, if you want to use IGMP snooping in L2-only environment (all VXLAN hosts in the same IP
subnet), you need a node that pretends it's a router and sends out IGMP queries, or the L2 switches
have nothing to snoop.
As far as I understand, VXLAN, NVGRE and any tunneling protocol that use global ID in
the data plane cannot support PVLAN functionality.
He’s absolutely right, but you shouldn’t try to shoehorn VXLAN into existing deployment models. To
understand why that doesn’t make sense, we have to focus on the typical cloud application
architectures.
To be more precise, any tunneling protocol that uses global ID in the data plane and uses
flooding to compensate for lack of control plane cannot support PVLAN. VMware NSX
for multiple hypervisors has port isolation (which is equivalent to a simple PVLAN); they
could do it because the NSX controller(s) download all MAC-to-IP mappings and MAC
forwarding entries into the hypervisor switches.
PVLAN is the perfect infrastructure solution for this environment – deploy a PVLAN in each compute
pod (whatever that might be – usually a few racks), and use IP routing between pods. You can still
use vMotion and HA/DRS within a pod, so you can move the customer VMs when you want to
perform maintenance on individual pod components.
Evacuating a whole pod is bit more complex, but then (hopefully) you won’t be doing that every
other day. If you really want to have this capability (because restarting customer VMs every now
and then is not an option), develop a migration process where you temporarily provision a PVLAN
between two pods, move the VMs and shut down the temporary inter-pod L2 connection, thus
minimizing the risk of having a large-scale VLAN across multiple pods.
Summary: you don’t need VXLAN if you’re selling individual VMs. PVLANs work just fine.
In short – you need multiple isolated virtual network segments with firewalls and load balancers
sitting between the segments and between the web server(s) and the outside world.
SUMMARY
MAC-over-IP virtual networking solutions are not a panacea. They cannot replace some of the
traditional isolation constructs (PVLAN), but then they were not designed to do that. Their primary
use case is an Amazon VPC-like environment with numerous isolated virtual networks per tenant.
I always get confused when thinking about IP multicast traffic over VXLAN tunnels. Since
VXLAN already uses a Multicast Group for layer-2 flooding, I guess all VTEPs would have
to receive the multicast traffic from a VM, as it appears as L2 multicast. Am I missing
something?
Short answer: no, you’re absolutely right. IP multicast over VXLAN is clearly suboptimal.
In the good old days when the hypervisor switches were truly dumb and used simple VLAN-based
layer-2 switching, you could control the propagation of IP multicast traffic by deploying IGMP
snooping on layer-2 switches (or, if you had Nexus 1000V, you could configure IGMP snooping
directly on the hypervisor switch).
Those days are gone (finally), but the brave new world still lacks a few features. No ToR switches
are currently capable of digging into the VXLAN payload to find IGMP queries and joins, and it’s
questionable whether Nexus 1000V can do IGMP snooping over VXLAN (IGMP snooping on Nexus
1000V is configured on VLANs).
Hyper-V network virtualization can map individual customer multicast groups to provider
(transport) multicast groups, resulting in way more optimal behavior.
VXLAN uses UDP for its encapsulation. What about dropped packets, lack of sequencing,
etc., that is possible with UDP? What impact is that going to have on the “inner protocol”
that’s wrapped inside the VXLAN UDP packets? Or is this not an issue in modern
networks any longer?
Somewhat longer one: VXLAN emulates an Ethernet broadcast domain, which is not reliable anyway.
Any layer-2 device (usually known as a switch although a bridge would be more correct) can drop
frames due to buffer overflows or other forwarding problems, or the frames could become corrupted
in transit (although the drops in switches are way more common in modern networks).
UDP packet reordering is usually not a problem – packet/frame reordering is a well-known challenge
and all forwarding devices take care not to reorder packets within a layer-4 (TCP or UDP) session.
The only way to introduce packet reordering is to configure per-packet load balancing somewhere in
the path (hint: don’t do that).
Using UDP to transport Ethernet frames thus doesn’t break the expected behavior. Things might get
hairy if you’d extend VXLAN across truly unreliable links with high error rate, but even then VXLAN-
over-UDP wouldn’t perform any worse than other L2 extensions (for example, VPLS or OTV) or any
other tunneling techniques. None of them uses a reliable transport mechanism.
GETTING ACADEMIC
Running TCP over TCP (which would happen in the end if one would want to run VXLAN over TCP) is
a really bad idea. This paper describes some of the nitty-gritty details, or you could just google for
TCP-over-TCP.
Some history: The last protocol stacks that had reliable layer-2 transport were SNA and X.25. SDLC
or LAPB (for WAN links) and LLC2 (for LAN connections) were truly reliable – LLC2 peers
acknowledged every L2 packet ... but even LLC2 was running across Token Ring or Ethernet bridges
that were never truly reliable. We used reliable SNA-over-TCP/IP WAN transport (RSRB and later
DLSW+) simply because the higher error rates experienced on WAN links (transmission errors and
packet drops) caused LLC2 performance problems if we used plain source-route bridging.
The blog post also describes why it makes more sense to transport virtual network traffic over UDP
than over GRE.
If you’re still wondering why we need VXLAN and NVGRE, read my VXLAN post (and the one
describing how VXLAN, OTV and LISP fit together), watch the Introduction to Virtual
Networking webinar recording or read the Introduction section of the NVGRE draft.
It’s obvious the NVGRE draft was a rushed affair; its only significant and original contribution to
knowledge is the idea of using the lower 24 bits of the GRE key field to indicate the Tenant Network
Identifier (but then, lesser ideas have been patented time and again). Like with VXLAN, most of the
real problems are handwaved to other or future drafts.
The NVGRE approach is actually more scalable than the VXLAN one because it does not mandate the
use of flooding-based MAC address learning. Even more, NVGRE acknowledges that there might be
virtual L2 networks that will not use flooding at all (like Amazon EC2).
Mapping between TNIs (virtual segments) and IP multicast addresses will be specified in a
future version of this draft. VXLAN “solves” the problem by delegating it to the management layer.
VXLAN ignores the problem and relies on jumbo frames. This might be a reasonable approach
assuming VXLAN would stay within a Data Center (keep dreaming, vendors involved in VXLAN are
already peddling long-distance VXLAN snake oil).
ECMP-based load balancing is the only difference between NVGRE and VXLAN worth mentioning.
VXLAN uses UDP encapsulation and pseudo-random values in UDP source port (computed by
hashing parts of the inner MAC frame), resulting in automatic equal-cost load balancing in every
device that uses 5-tuple to load balance.
GRE is harder to load balance, so the NVGRE draft proposes an interim solution using multiple IP
addresses per endpoint (hypervisor host) with no details on the inter-VM-flow-to-endpoint-IP-
address mapping. The final solution?
OK, might even work. But do the switches support it? Oh, don’t worry...
A diverse ecosystem play is expected to emerge as more and more devices become
multitenancy aware.
I know they had to do something different from VXLAN (creating another UDP-based scheme and
swapping two fields in the header would be a too-obvious me-too attempt), but wishful thinking like
this usually belongs to another type of RFCs.
SUMMARY
Two (or more) standards solving a single problem seems to be the industry norm these days. I’m
sick and tired of the obvious me-too/I’m-different/look-who’s-innovating ploys. Making matters
worse, VXLAN and NVGRE are half-baked affairs today.
VXLAN has no control plane and relies on IP multicast and flooding to solve MAC address learning
issues, making it less suitable for very large scale or inter-DC deployments.
NVGRE has the potential to be a truly scalable solution: it acknowledges there might be need for
networks not using flooding, and at least mentions the MTU issues, but it has a long way to go
before getting there. In its current state, it’s worse than VXLAN because it’s way more
underspecified.
The three drafts (VXLAN, NVGRE and STT) have the same goal: provide emulated layer-2 Ethernet
networks over scalable IP infrastructure. The main difference between them is the encapsulation
format and their approach to the control plane:
VXLAN ignores the control plane problem and relies on flooding emulated with IP multicast;
NVGRE authors handwave over the control plane issue (“the way to obtain [MAC-to-IP mapping]
information is not covered in this document”);
STT authors claim the draft describes just the encapsulation format.
Everything else being equal, why does STT make sense at all? The answer lies in the capabilities of
modern server NICs.
Modern NICs allow the TCP stacks to offload some of the heavy lifting to the hardware – most
commonly the segmentation and reassembly (retransmissions are still performed in software). A
LSO significantly increases the TCP performance. If you don’t believe me (and you shouldn’t), run
iperf tests on your server with TCP offload turned on and off (and report your results in a comment).
The reality behind the scenes is a bit more complex: what gets handled to the NIC is an oversized
TCP-IP-MAC frame (up to 64K long) with STT-IP-MAC header. The “TCP” segments produced by the
NIC are thus not the actual TCP segments, but segments of STT frame passed to the NIC.
Randy Bush called this approach to standard development “throwing spaghetti at the wall to see
what sticks”, which is definitely an amusing pastime… unless you happen to be the wall.
RFC 4023 specifies two methods of MPLS-in-IP encapsulation: MPLS label stack on top of IP (using
IP protocol 137) and MPLS label stack on top of GRE (using MPLS protocol type in GRE header). We
could use either one of these and use either the traditional MPLS semantics or misuse MPLS label as
virtual network identifier (VNI). Let’s analyze both options.
It’s also questionable whether the existing hardware would be able to process MAC-in-MPLS-in-GRE-
in-IP packets, which would be the only potential benefit of this approach. I know that some
(expensive) linecards in Catalyst 6500 can process IP-in-MPLS-in-GRE packets (as do some switches
from Juniper and HP), but can it process MAC-in-MPLS-in-GRE? Who knows.
Finally, like NVGRE, MPLS-over-GRE or MPLS-over-IP framing with MPLS label being used as the VNI
lacks entropy that could be used for load balancing purposes; existing switches would not be able to
load balance traffic between two hypervisor hosts unless each hypervisor hosts would use multiple
IP addresses.
No hypervisor vendor is willing to stop supporting L2 virtual networks because they just might be
required for “mission-critical” craplications running over Microsoft’s Network Load Balancing, so
we can’t use L3 MPLS VPN.
Summer 2014: EVPN is becoming a viable standard, and is used by Juniper Contrail and
Nuage VSP to integrate overlay layer-2 segments with external layer-2 gateways.
MPLS-based VPNs require decent control plane, including control-plane protocols like BGP, and
that would require some real work on hypervisor soft switches. Implementing an ad-hoc solution
like VXLAN based on doing-more-with-less approach (= let’s push the problem into someone
else’s lap and require IP multicast in network core) is cheaper and faster.
Juniper Contrail and Nuage VSP implemented MPLS/VPN control plane (including MP-BGP) in
their controller, and use simpler protocols (Contrail: XMPP, VSP: OpenFlow) to distribute the
forwarding information to the hypervisor virtual switches.
There are two shipping implementations: Juniper Contrail and Nuage VSP. Both are coming
from traditional networking vendors that already had field-tested MPLS protocol stack. Cisco
is talking about adding EVPN support to Nexus 1000V.
Not surprisingly, some ToR switch vendors abuse this fear to the point where they look downright
stupid (but I guess that’s their privilege), so let’s set the record straight.
The tunnels mentioned above are point-to-point GRE (or STT or VXLAN) tunnel interfaces between
Linux-based hypervisors. VXLAN implementations on Cisco Nexus 1000V, VMware vCNS or
(probably) VMware NSX for vSphere don’t use tunnel interfaces (or at least we can’t see them from
the outside).
The P2P overlay tunnels are an artifact of OpenFlow-based forwarding implementation in Open
vSwitch. OpenFlow forwarding model assumes point-to-point interfaces (switch-to-switch or switch-
to-host links) and cannot deal with multipoint interfaces (mGRE tunnels in Cisco IOS parlance).
OpenFlow controller (Nicira NVP) thus cannot set the transport network next hop (VTEP in VXLAN)
on a multi-access tunnel interface in a forwarding rule; the only feasible workaround is to create
numerous P2P tunnel interfaces, associating one (or more) of them with every potential destination
VTEP.
Absolutely not. They are auto-provisioned by one of the Open vSwitch daemons (using ovsdb-
proto), exist only on Linux hosts, and add no additional state to the transport network (apart from
the MAC and ARP entries for the hypervisor host which the transport network has to have anyway).
Short summary: Yes. The real scalability bottleneck is the controller and the number of hypervisor
hosts it can manage.
Every hypervisor host has only the tunnels it needs. If a hypervisor host runs 50VMs and every VM
belongs to a different logical subnet with another 50VMs in the same subnet (scattered across 50
other hypervisor hosts), the host needs 2500 tunnel interfaces going to 2500 destination VTEPs.
In recent releases of Open vSwitch, the tunnel interfaces remain a pure software construct
within Open vSwitch implementation – they are not Linux kernel interfaces.
As I wrote in the introductory paragraph – it’s pure FUD created by hardware vendors. Now that you
know what’s going on behind the scenes lean back and enjoy every time some mentions it (and you
might want to ask a few pointed questions ;).
Usually a single vendor delivers an inconsistent set of QoS features that vary from platform to
platform (based on the ASIC or merchant silicon used) or even from linecard to linecard (don’t even
mention Catalyst 6500). Sometimes you need different commands or command syntax to configure
QoS on different platforms from the same hardware vendor.
DO WE NEED QOS?
Maybe not. Maybe it’s cheaper to build a leaf-and-spine fabric with more bandwidth than your
servers can consume. Learn from the global Internet - everyone talks about QoS, but the emperor is
still naked.
In reality, the classification is usually done on the ingress network device, because we prefer playing
MacGyvers instead of telling our customers (= applications) “what you mark is what you get”.
Finally, there are the poor souls that do QoS classification and marking in the network core because
someone bought them edge switches that are too stupid to do it.
vDS in vSphere 5.1 has minimal QoS support: per-pool 802.1p marking and queuing;
Nexus 1000V has a full suite of classification, marking, policing and queuing tools. It also copies
inner DSCP and CoS values into VXLAN+MAC envelope;
VMware NSX (the currently shipping NVP 3.1 release) uses a typical service provider model: you
can define minimal (affecting queuing) and maximal (triggering policing) bandwidth per VM,
accept or overwrite DSCP settings, and copy DSCP bits from virtual traffic into the transport
envelopes;
vDS in vSphere 5.5 is has full 5-tuple classifier and CoS/DSCP marking. NSX for vSphere uses
vDS and probably relies on its QoS functionality.
In my opinion, you get pretty much par for the course with the features of Nexus 1000V, VMware
NSX or vSphere 5.5 vDS, and you get DSCP-based classification of overlay traffic with VMware NSX
and Nexus 1000V.
It is true that you won’t be able to do per-TCP-port classification and marking of overlay virtual
traffic in your ToR switch any time soon (but I’m positive there are at least a few vendors working
on it).
It’s also true that someone will have to configure classification and marking on the new network
edge (in virtual switches) using a different toolset, but if that’s an insurmountable problem, you
might want to start looking for a new job anyway.
The solution is also well known – color the elephants pink (aka DSCP marking) and sort them into a
different queue – until the reality intervenes.
It seems oh-so-impossible to figure out which applications might generate elephant flows and mark
them accordingly on the originating server; there’s no other way to explain the need for traffic
classification and marking on the ingress switch, and other MacGyver contraptions the networking
team uses to make sure “it’s not the network’s fault” instead of saying “we’re a utility – you’re
getting exactly what you’ve asked for.”
Matching TCP and UDP port numbers on the server (because FTP sessions tend to be more
elephantine than DNS requests) and setting DSCP values of outbound packets is also obviously a
mission-impossible for some people; it’s way easier to pretend the problem doesn’t exist and blame
the network for lack of proper traffic classification.
Anyway, situation gets worse in environments with truly unclassifiable traffic (as the ultimate
abomination imagine a solution doing backups over HTTP) where it’s impossible to separate elephant
from mice based on their TCP/UDP port numbers.
If, however, one would have insight into the operating system TCP buffers, or measure per-flow
rate, one might be able to figure out which flows exhibit overweight tendencies – and that’s exactly
what the Open vSwitch (OVS) team did.
Additionally, OVS appears as a TCP-offload-capable NIC to the virtual machines, and the bulk
applications happily dump megabyte-sized TCP segments straight into the output queue of the VM
NIC, where it’s easy for the underlying hypervisor software (OVS) to spot them and mark them with
a different DSCP value (this idea is marked as pending in Martin Casado’s presentation).
The results (documented in a presentation) shouldn’t be surprising – we know ping isn’t affected by
an ongoing FTP transfer if they happen to be in different queues since the days Fred Baker proudly
presented the first measurement results of the then-revolutionary Weighted Fair Queuing
mechanism (this is the only presentation I could find, but WFQ already existed in late 1995) at some
mid- ‘90s incarnation of Cisco Live (probably even before the days Cisco Live was called
Networkers).
The OVS-based elephant identification is a cool idea, although one has to wonder how well it works
in practice if it measure the flow rate (see also OVS scaling woes).
Now let’s step back and ask a fundamental question: how much bandwidth do we need?
Disclaimer: If you’re running a large public cloud or anything similarly sized, this is not the
post you’re looking for.
Let’s assume:
We have mid-sized workload of 10.000 VMs (that’s probably more than most private clouds see,
but let’s err on the high side);
The average long-term sustained network traffic generated by every VM is around 100 Mbps (I
would love to see a single VM that’s not doing video streaming or network services doing that,
but that’s another story).
If that’s not enough (or you think you should take in account traffic peaks), take a pair of Nexus
6000s or build a leaf-and-spine fabric.
In many cases VMs have to touch storage to deliver data to their clients, and that’s where the real
bottleneck is. Assuming only 10% of the VM-generated data comes from the spinning rust (or SSDs)
I’d love to see the storage delivering sustained average throughput of 100 Gbps.
Based on these figures, the total bandwidth needed in the data center is 200 Gbps. Adjust the
calculation for your specific case, but I don’t think many of you will get above 1-2 Tbps.
... but that has nothing to do with overlay virtual networks – if anything of the above is true you
have a problem regardless of what you run in your data center.
Brad has some other interesting ideas, for example “L2 doesn’t matter anymore” (absolutely agree),
so make sure you read the whole article.
Let’s make a step back. Brad started his reasoning by comparing data center fabrics with physical
switches, saying “We don’t need to engineer the switch” and “We don’t worry too much about how
this internal fabric topology or forwarding logic is constructed, or the consequential number of
hops.”
Well, we don’t until we stumble across an oversubscribed linecard, or a linecard design that allows
us to send a total of 10Gbps through four 10GE ports. The situation gets worse when we have to
deal with stackable switches, where it matters a lot whether the traffic has to traverse the stacking
cables or not (not to mention designs where switches in a stack are connected with one or a few
regular links).
Then there’s also the question of costs. Given infinite budget, it’s easy to build very large fabrics
that give the location doesn’t matter illusion to as many servers or virtual machines as needed.
Some of us are not that lucky; we have to live with fixed budgets, and we’re usually caught in a
catch-22 situation. Wasting bandwidth to support spaghetti-like traffic flows costs real CapEx money
(not to mention not-so-very-cheap maintenance contracts); trying to straighten cooked spaghettis
continuously being made by virtualized workloads generates OpEx costs – you have to figure out
which one costs you less in the long run.
Last but not least, very large fabrics are more expensive (per port) than smaller ones due to
increased number of Clos stages, so you have to stop somewhere – supporting constant
bandwidth/latency across the whole data center is simply too expensive.
I’m positive Brad knows all that, as do a lot of very smart people doing large-scale data center
designs. Unfortunately, not everyone will get the right message, and a lot of people with subscribe
to the “traffic flows don’t matter anymore” mantra without understanding the underlying
assumptions (like they did to the “stretched clusters make perfect sense” one), and get burnt in the
process because they’ll deploy workloads across uneven fabrics or even across lower-speed links.
There’s the “minor” annoyance of CoS or DSCP packet marking, but let’s ignore that detail.
Conclusions:
These solutions SHOULD decrement TTL like any other router (or layer-3 switch) would do. If they
wish to stay as close to the emulated Ethernet behavior as possible, they SHOULD decrement TTL if
and only if the packet crosses subnet boundaries (or you might get crazy problems with application
software that sends packets with TTL = 1).
For example, Hyper-V Network Virtualization SHOULD NOT decrement TTL if the source and
destination VM belong to the same subnet (even though the HNV module actually performs L3
lookup to figure out where to send the packet) but SHOULD decrement TTL if the destination VM
belongs to another IPv4 or IPv6 subnet.
Like in the layer-2 case, the transport TTL has nothing in common with the VM-generated TTL –
hypervisors should use whatever TTL they need to get the encapsulated traffic across the data
center fabric.
Conclusions:
MPLS-based L3VPN (the “original” MPLS/VPN) is a totally different story: it’s not supposed to
emulate a single virtual router, but a whole WAN. Copying customer TTL into provider TTL (and vice
versa) is the most natural thing to do under those circumstances (unless the provider wants to hide
the internal network details).
If you design a L3 IP transport network ie. L3 from access to aggregation but you wanted
to use vMotion then how could you do that unless you used an overlay technoology such
as VXLAN to extend the vlan across the underlying IP network.
My somewhat imprecise claims often get me in trouble (this wouldn’t be the first time), let me try to
straighten things out.
vMotion requires
L2 adjacency within the port group in which the VM resides between source and target
hypervisor hosts for the VM port group. Without the L2 adjacency you cannot move a live IP
address and retain all sessions (solutions like Enterasys’ host routing are an alternative if you
don’t mind longer traffic interruptions caused by routing protocol convergence time);
In other words, when you move a VM, it must reside in the same L2 segment after the move (the
source and target hypervisor hosts can be in different subnets). You can implement that
requirement with VLANs (which require end-to-end L2 connectivity) or VXLAN (which can emulate L2
segments across L3 infrastructure).
IN THIS CHAPTER:
VMware NSX;
VXLAN on Cisco Nexus 1000V;
Hyper-V Network Virtualization;
OpenStack Neutron;
Amazon VPC;
Midokura Midonet.
You’ll find even more details in the Overlay Virtual Networking webinar.
Figure 3-1: Overlay virtual networking solution overview (from Cloud Computing Networking webinar)
Answer#2: A merger of Nicira NVP and VMware vCNS (a product formerly known as vShield).
Oh, and did I mention it’s actually two products, not one?
VMware NSX for multi-hypervisor environment is Nicira NVP with ESXi and VXLAN
enhancements:
OVS-in-VM approach has been replaced with an NSX vSwitch within the ESXi kernel;
VMware NSX supports GRE, STT and VXLAN encapsulation, with VXLAN operating in unicast
mode with either source node or service node packet replication. The unicast mode is not
compatible with Nexus 1000V VXLAN unicast mode;
NSX unicast VXLAN implementation will eventually work with third-party VTEPs (there’s usually a
slight time gap between a press release and a shipping product) using ovsdb-proto as the control
plane.
Use cases: OpenStack and CloudStack deployments using Xen, KVM or ESXi hypervisors.
While the overall architecture looks similar to Nicira NVP, it seems there’s no OVS or OpenFlow
under the hood.
Hypervisor virtual switches are based on vDS switches; VXLAN encapsulation, distributed
firewalls and distributed layer-3 forwarding are implemented as loadable ESXi kernel module.
NVP controllers run in virtual machines and are tightly integrated with vCenter through NSX
manager (which replaces vShield Manager);
Distributed layer-3 forwarding uses a central control plane implemented in NSX Edge Distributed
Router, which can run BGP or OSPF with the outside (physical) world;
Another variant of NSX Edge (Services Router) provides centralized L3 forwarding, N/S firewall,
load balancing, NAT, and VPN termination;
Most components support IPv6 (hooray, finally!).
The Nicira NVP roots of NSX are evident. It’s also pretty easy to map how individual NSX
components map into vCNS/vShield Edge: NSX Edge Services Router definitely looks like vShield
Edge on steroids and the distributed firewall is probably based on vShield App.
Unfortunately, it seems that the goodies from vSphere version of NSX (routing protocols, in-kernel
firewall) won’t make it to vCNS 5.5 (but let’s wait and see how the packaging/licensing looks when
the products launch).
The only problem I see is the breadth of the offering. VMware has three semi-competing partially
overlapping products implementing overlay virtual networks:
NSX for multi-hypervisor environment using NVP controllers, NVP gateways and OVS (for Linux
and ESXi environment);
NSX for vSphere using NVP controllers, vSphere kernel modules and NSX edge gateways;
vCNS with vShield App firewall and vShield Edge firewall/load balancer/router.
It will be fun to see how the three products evolve in the future and how the diverging code base
will impact feature parity.
To learn more about NSX architecture, watch the videos from the free VMware NSX Architecture
webinar sponsored by VMware.
COMPONENTS
VMware NSX for multiple hypervisors release 4.0 (NSX for the rest of this blog post) relies on Open
vSwitch (OVS) to implement hypervisor soft switching. OVS could use dynamic MAC learning (and it
does when used with OpenStack OVS Quantum plugin) or an external OpenFlow controller.
NSX uses a cluster of controllers (currently 3 or 5) to communicate with OVS switches (OVS
switches can connect to one or more controllers with automatic failover). It uses two protocols to
communicate with the switches: OpenFlow to download forwarding entries into the OVS and ovsdb-
proto to configure bridges (datapaths) and interfaces in OVS.
NSX OpenFlow controller has to download just a few OpenFlow entries into the two Open vSwitches
to enable the communication between the two VMs (for the moment, we’re ignoring BUM flooding).
NVP controller must tell the ovsdb-daemon on all three hosts to create new tunnel interfaces and
connect them to the correct OVS datapath;
NVP controller downloads new flow entries to OVS switches on all three hosts.
BUM FLOODING
NVP supports two flooding mechanisms within a virtual layer-2 segment:
Flooding through a service node – all hypervisors send the BUM traffic to a service node (an
extra server that can serve numerous virtual segments) which replicates the traffic and sends it to
all hosts within the same segment. We would need a few extra tunnels and a handful of OpenFlow
entries to implement the service node-based flooding in our network:
If the above description causes you heartburn caused by ATM LANE flashbacks, you’re not
the only one ... but obviously the number of solutions to a certain networking problem isn’t
infinite.
These are the flow entries that an NVP controller would configure in our network when using source
node replication:
The implementation details (usually hidden behind the scenes) vary widely, and I’ll try to document
at least some of them in a series of blog posts, starting with VMware NSX.
LAYER-2 FORWARDING
VMware NSX supports traditional layer-2 segments with proper flooding of BUM (Broadcast,
Unknown unicast, Multicast) frames. NSX controller downloads forwarding entries to individual
virtual switches, either through OpenFlow (NSX for multiple hypervisors) or a proprietary protocol
(NSX for vSphere). The forwarding entries map destination VM MAC addresses into destination
hypevisor (or gateway) IP addresses.
On top of static forwarding entries downloaded from the controller, virtual switches perform dynamic
MAC learning for MAC addresses reachable through layer-2 gateways.
Layer-3 lookup is always performed by the ingress node (hypervisor host or gateway); packet
forwarding from ingress node to egress node and destination host uses layer-2 forwarding. Every
ingress node thus needs (for every tenant):
IP routing table;
ARP entries for all tenant’s hosts;
MAC-to-underlay-IP mappings for all tenant’s hosts (see layer-2 forwarding above).
NSX for vSphere implements layer-3 forwarding in a separate vSphere kernel module. The User
World Agent (UWA) running within the vSphere host uses proprietary protocol (mentioned above) to
get layer-3 forwarding information (routing tables) and ARP entries from the controller cluster. ARP
entries are cached in the layer-3 forwarding kernel module, and cache misses are propagated to the
controller.
NSX for multiple hypervisors implements layer-3 forwarding data plane in OVS kernel module,
but does not use OpenFlow to install forwarding entries.
A separate layer-3 daemon (running in user mode on the hypervisor host) receives forwarding
information from NSX controller cluster through OVSDB protocol, and handles all ARP processing
(sending ARP requests, caching responses …) locally.
You can use a VMware NSX Edge Services Router (ESR) to connect multiple VXLAN-backed layer-2
segments within an application stack. You would configure the services router through NSX Manager
(improved vShield Manager), and you’d get a VM connected to multiple VXLAN-based port groups
(and probably one or more VLAN-based port groups) behind the scenes.
In this scenario, VXLAN kernel modules resident in individual vSphere hosts perform layer-2
forwarding, sending packets between VM and ESR NICs. ESR performs layer-3 forwarding within the
VM context.
In those environments you might collapse multiple subnets into a single layer-2 segment (assuming
your security engineers approve the change in security paradigm introduced with VM NIC firewalls)
or use distributed routing functionality of VMware NSX.
Note: Open vSwitch 1.11 added support for megaflows (OpenFlow flow entries copied directly into
the kernel packet forwarding module), replacing the “per-flow” entries with “per-OpenFlow-flow
entry” entries. The following text has been updated to reflect the new functionality.
Does that mean there’s no packet punting in the NSX/Open vSwitch world? Not so fast.
First, to set the record straight, NVP OpenFlow controller (NSX controller cluster) does not touch
actual packets. There’s no switch-to-controller punting; NVP has enough topology information to
proactively download OpenFlow flow entries to Open vSwitch (OVS).
However, Open vSwitch has two components: the user-mode daemon (process switching in Cisco
IOS terms) and the kernel forwarding module, which implements flow matching, forwarding and
corresponding actions, not the full complement of OpenFlow matching rules.
Whenever the first packet of a new flow passes through the Open vSwitch kernel module, it’s sent to
the Open vSwitch daemon, which evaluates the OpenFlow rules downloaded from the OpenFlow
controller, accepts or drops the packet, and installs the corresponding forwarding rule into the kernel
module.
Open vSwitch 2.x user mode daemon copies OpenFlow matching rules to the kernel module
instead of creating per-flow entries. The initial packet matching a new OpenFlow rule is still
forwarded through the user-mode daemon.
Does this sound similar to Multi-Layer Switching or the way Cisco’s VSG and Nexus 1000V VEM
work? It’s exactly the same concept, implemented in kernel/user space of a single hypervisor host.
There really is nothing new under the sun.
I would strongly recommend you read the well written developer documentation if you want
to know the dirty details.
This approach keeps the kernel module simple and tidy, and allows the Open vSwitch architecture to
support other flow programming paradigms, not just OpenFlow – you can use OVS as a simple
learning bridge supporting VLANs, sFlow and NetFlow (not hard once you’ve implemented per-flow
forwarding), or you could implement your own forwarding paradigm while leveraging the stability of
Open vSwitch kernel module that’s included with version 3.3 of the Linux kernel and already made
its way into standard Linux distributions.
I won’t bore you with the configuration process. Let’s just say that I got mightily annoyed with the
mandatory mouse chasing skills, confirmed every single CLI-versus-GUI prejudice I ever got, but
nonetheless managed to get OSPF and BGP running on an NSX Edge appliance. Here’s what I
configured:
OSPF routing process with area 0 on the external interface and route redistribution of connected
routes into OSPF;
BGP routing process with an IBGP neighbor and route redistribution of connected routes into
BGP.
As you can see, they still have plenty of work to do (example: the subnet length is missing in the
BGP table printout), but the code is still a few months from being shipped, so I’m positive they’ll fix
the obvious gotchas in the meantime.
Time to deploy the second appliance to see whether all this fun stuff actually works. It does.
NSX Edge OSPF process inserts some funky stuff into the OSPF database (you might want to check
how that impacts other OSPF gear before deploying NSX Edge in production environment) and it
seems type-5 LSAs are not displayed (probably a bug).
...and the routing and forwarding tables look OK. The whole thing just might work outside of a lab
environment.
Finally, does it make sense to run routing protocols on L4-7 appliances? If you ever spent hours
debugging a static route pointing in a wrong direction you know the answer.
UNICAST-ONLY VXLAN
The initial VXLAN design and implementation took the traditional doing-more-with-less approach:
VXLANs behave exactly like VLANs (including most of the scalability challenges VLANs have) and rely
on third-party tool (IP multicast) to solve the hard problems (MAC address learning) that both Nicira
and Microsoft solved with control-plane solutions.
Unicast-only VXLAN comes closer to what other overlay virtual networking vendors are doing: the
VSM knows which VEMs have VMs attached to a particular VXLAN segment and distributes that
information to all VEMs – each VEM receives a per-VXLAN list of destination IP addresses to use for
flooding purposes.
OTHER GOODIES
Cisco also increased the maximum number of VEMs a single VSM can control to 128, and the
maximum number of virtual ports per VSM (DVS) to 4096.
DOES IT MATTER?
Sure it does. The requirement to use IP multicast to implement VXLAN flooding was a major
showstopper in data centers that have no other need for IP multicast (almost everyone apart from
financial institutions dealing with multicast-based market feeds). Unicast-only VXLAN will definitely
simplify VXLAN deployments and increase its adoption.
THE CAVEATS
The original VXLAN proposal was a data-plane-only solution – boxes from different vendors (not that
there would be that many of them) could freely interoperate as long as you configured the same IP
multicast group everywhere.
Unicast-only VXLAN needs a signaling protocol between VSM (or other control/orchestration entity)
and individual VTEPs. The current protocol used between VSM and VEMs is probably proprietary;
Cisco claims to plan to use VXLAN over EVPN for inter-VSM connectivity, but who knows when the
Nexus 1000V code will ship. In the meantime, you cannot connect a VXLAN segment using unicast-
only VXLAN to a third-party gateway (example: Arista 7150).
Due to the lack of inter-VSM protocol, you cannot scale a single VXLAN domain beyond 128 vSphere
hosts, probably limiting the size of your vCloud Director deployment. In multicast VXLAN
environments the vShield Manager automatically extends VXLAN segments across multiple
distributed switches (or so my VMware friends are telling me); it cannot do the same trick in
unicast-only VXLAN environments.
Unicast-based flooding. First HNV release did not need flooding – all the necessary information
was provided by the orchestration system through HNV policies. Support of dynamic address
learning and customer-owned DHCP servers obviously requires flooding of DHCP requests and ARP
requests/replies.
Performance improvements. Lack of TCP offload is the biggest hurdle in overlay network
deployments (that’s why Nicira decided to use STT). HNV will include NVGRE Task Offload in WS
2012 R2 and Emulex and Mellanox have already announced NVGRE-capable NICs. Mellanox
performance numbers mentioned in the Deep Dive video claim 10GE linerate forwarding (2 x
improvement) while reducing CPU overhead by a factor of 6.
HNV will also be able to do smarter NIC teaming and load balancing, resulting in better utilization of
all server NICs.
Built-in gateways. WS 2012 R2 distribution will include simple NVGRE-to-VLAN gateway similar to
early vShield Edge (VPN concentrator, NAT, basic L3 forwarding). F5 has announced NVGRE
gateways support, but as always I’ll believe it when the product documentation appears on their
web site.
Improved diagnostics. Next release of HNV will include several interesting troubleshooting tools:
Ability to ping provider network IP address from customer VM, ability to insert or intercept traffic in
customer network (example: emulate pings to external destinations), and cloud administrator
access to customer VM traffic statistics.
Stateful VM NIC firewalls. Windows Server 2012 included some basic VM NIC filtering
functionality. Release 2 has built-in stateful firewall. It’s similar to vShield App or Juniper’s VGW – it
can create per-flow ACL entries for return traffic, but does not inspect TCP session validity or
perform IP/TCP reassembly.
Dynamic NIC teaming can spread a single TCP flow across multiple outbound NICs – a great
solution for I/O intensive applications that need more than 10GE per single flow (obviously it only
works with ToR switches that have 40GE uplinks, 10GE port channel uplinks on ToR switches would
quickly push all traffic of the same flow onto the same 10GE uplink).
Hyper-V Network Virtualization is now part of extensible switch. The initial release of HNV
was implemented as a device filter sitting between a physical NIC and the extensible switch. Switch
extensions had no access to HNV (just to customer VM traffic) as all the encap/decap operations
happened after the traffic has already left the extensible switch on its way toward the physical NIC.
Virtual RSS (vRSS) uses VMQ to extend Receive Side Scaling into VMs – traffic received by a VM
can spread across multiple vCPUs. Ideal for high-performance appliances (firewalls, load balancers).
Remote live monitoring similar to SPAN and ERSPAN, including traffic captures for offline analysis.
Network switch management. Microsoft is trying to extend their existing OMI network
management solutions into physical switches because we desperately need yet another switch
management platform ;)
MORE INFORMATION
Hyper-V Network Virtualization: Simply Amazing
What’s New in Windows Server 2012 R2 Networking (Microsoft blog post)
What’s New in Windows Server 2012 R2 Networking (TechEd video)
Deep Dive on Hyper-V Network Virtualization in Windows Server 2012 R2 (TechEd video)
How to Design and Configure Networking in Microsoft System Center - Virtual Machine
Manager and Hyper-V Part 1 (TechEd video)
How to Design and Configure Networking in Microsoft System Center - Virtual Machine Manager
and Hyper-V Part 2 (TechEd video)
Everything You Need to Know about the Software Defined Networking Solution from Microsoft
(TechEd video)
HNV ARCHITECTURE
Hyper-V Network Virtualization started as an add-on module (NDIS lightweight filter) for Hyper-V
3.0 extensible switch (it is fully integrated with the extensible switch in the Windows Server 2012
R2).
Hyper-V extensible switch is a layer-2-only switch; Hyper-V network virtualization module is a layer-
3-only solution – an interesting mix with some unexpected side effects.
A distributed layer-3 forwarding architecture could use a single IP routing table to forward traffic
between IP hosts. Similar to traditional IP routing solutions, the end-user would configure directly
connected IP subnets and prefix routes (with IP next hops), and the virtual networking controller (or
the orchestration system) would add host routes for every reachable host. Forwarding within the
virtual domain would use host routes; forwarding toward external gateways would use configured IP
next hops (which would be recursively resolved from host routes).
Hyper-V network virtualization cannot use a pure layer-3 solution due to layer-2 forwarding within
the extensible switch – two VMs connected to the same VLAN within the same hypervisor would
communicate directly (without HNV involvement) and would exchange MAC addresses through ARP
IP routing table;
ARP table (mapping of IP addresses into MAC addresses);
MAC reachability information – outbound ports in pure layer-2 world or destination transport IP
addresses in overlay virtual networks.
If the destination MAC address exists within the same segment, send the packet to the
destination VM;
Flood multicast or broadcast frames to all VMs and the uplink interface;
Send frames with unknown destination MAC addresses to the uplink interface.
Hyper-V network virtualization module intercepts packets forwarded by the extensible switch toward
the uplink interface and performs layer-3 forwarding and local ARP processing:
All ARP requests are answered locally using the information installed with the New-
NetVirtualizationLookupRecord cmdlet;
IP packets are forwarded to the destination hypervisor based on their destination IP address (not
destination MAC address);
Flooded frames, frames sent to unknown MAC addresses, and non-IP frames are dropped.
The true difference between intra-subnet and inter-subnet layer-3 forwarding is thus the destination
MAC address:
Intra-subnet IP packets are sent to the MAC address of the destination VM, intercepted by HNV
module, and forwarded based on destination IP address;
Inter-subnet IP packets are sent to the MAC address of the default gateway (virtual MAC address
shared by all HNV modules), also intercepted by HNV module, and forwarded based on
destination IP address (when the HNV module has a New-NetVirtualizationLookupRecord for
Summary: Even though it looks like Hyper-V Network Virtualization in Windows Server 2012 works
like any other L2+L3 solutions, it’s a layer-3-only solution between hypervisors and layer-2+layer-3
solution within a hypervisor.
HNV ARCHITECTURE
In Windows Server 2012 R2, Hyper-V Network Virtualization became an integral part of Hyper-V
Extensible Switch. It intercepts all packets traversing the switch and thus behaves exactly like a
virtual switch forwarding extension while still allowing another forwarding extension (Cisco Nexus
1000V or NEC PF1000) to work within the same virtual switch.
Hyper-V Network Virtualization module effectively transforms the Hyper-V layer-2 virtual switch into
a pure layer-3 switch.
The PowerShell scripts used to configure the HNV haven’t changed. IP routing table is installed in
the Hyper-V hosts with the New-NetVirtualizationCustomerRoute PowerShell cmdlet. Mappings
ARP PROCESSING
Hyper-V Network Virtualization module is an ARP proxy: it replies to all broadcast ARP requests and
multicast ND request assuming it has the NetVirtualizationLookupRecord for the destination IP
address.
ARP requests for unknown IPv4 destinations and ND requests for unknown IPv6 destinations are
flooded if the virtual network contains layer-2-only NetVirtualizationLookupRecord entries (used to
implement dynamic IP addresses).
Unicast ARP/ND requests are forwarded to the destination VM to support IPv6 Network
Unreachability Detection (NUD) functionality.
PACKET FORWARDING
Hyper-V Network Virtualization module intercepts all packets received by the Hyper-V extensible
switch and drops all non-IP/ARP packets. ARP/ND packets are intercepted (see above), and the
forwarding of IP datagrams relies solely on the destination IP address:
Routing table lookup is performed to find the next-hop customer IP address. Every HNV module
has full routing table of the tenant virtual network – next-hop customer IP address is the
destination IP address for all destinations within the virtual network;
A lookup in NetVirtualizationLookupRecord table transforms next-hop customer IP address into
transport IP address;
When the destination transport IP address equals the local IP address (destination customer IP
address resides within the local host), HNV sends the packet back to the Hyper-V Extensible Switch,
which delivers the packet to the destination VM.
Extra lookup steps within the transport network are performed for non-local destinations:
1. Routing table lookup in the global IP routing table transforms transport destination IP address
into transport next-hop IP address;
2. ARP/ND table lookup transforms transport next-hop IP address into transport MAC address.
HNV has full support and feature parity for IPv4 and IPv6. Whenever this blog post
mentions IP, the behavior described applies equally well to IPv4 and IPv6.
Multicast and broadcast IP traffic is flooded. Flooding mechanism uses IP multicast in the transport
network if there’s a NetVirtualizationLookupRecord mapping destination multicast/broadcast
customer IP address into a transport multicast IP address, and source node packet replication in all
other cases.
As always, it helps to take a few steps back, focus on the principles, and the “unexpected” behavior
becomes crystal clear.
SAMPLE NETWORK
Let’s start with a virtual network that’s a bit more complex than a single VLAN:
VM A must use the gateway (GW) to communicate with the outside world;
VM B and VM X must communicate.
It’s obvious we need a virtual router to link the two segments (otherwise B and X cannot
communicate). In a traditional VLAN-based network you’d use a layer-3 switch somewhere in the
Figure 3-22: Connecting the virtual segments with a distributed virtual router
There are at least two ways to achieve the desired connectivity in a traditional layer-2 world:
Set the default gateway to router (.1) on B and X, and set the default gateway to .250 (GW) on
A. Pretty bad design if I’ve ever seen one.
Set the default gateway to router (.1) on all VMs and configure a static default route pointing to
.250 (GW) on the router.
After redrawing the diagram it’s obvious what needs to be configured to get the desired
connectivity:
However, keep in mind that security through obscurity is never a good idea (told you it was a bad
design), and there’s a good reason layer-3 switches have ACLs. Speaking of ACLs, you can configure
them in Hyper-V with the New-NetFirewallRule and since all VM ports are equivalent (there are no
layer-2 and layer-3 ports) you get consistent results regardless of whether the source and
destination VM are in the same or different subnets.
According to this explanation, the IP FIB contains the prefixes copied from the IP routing table.
However, this is not how most layer-3 switches work.
This is the routing table I had on the router (static route and default route were set through DHCP).
CEF table closely reflects the IP table, but there are already a few extra entries:
R1#ping 10.11.12.4
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.11.12.4, timeout is 2 seconds:
.!!!!
WAIT,WHAT?
Does that mean that the ping command created an extra entry in the CEF table? Of course not –
but it did trigger the ARP process, which indirectly created a new glean adjacency in the CEF table
(these adjacencies don’t expire due to repeated ARPing done by Cisco IOS). The glean adjacency
looks exactly like any other host route (although you can see from various fields in the detailed CEF
printout that it’s an adjacency route):
Some switch vendors still talk about IP routing entries and ARP entries, others (for example, Nexus
3000) already use IP prefix and IP host entry terminology. I chose Nexus 3000 for a reason – many
data center switches use the same chipset and thus probably use the same forwarding techniques.
Intel is way more forthcoming than Broadcom – the FM4000 data sheet contains plenty of details
about its forwarding architecture, and if I understand it correctly, the IP lookup must result in an
ARP entry (which means that the IP lookup table must contain host routes).
Looks great in PowerPoint, but whoever designed the network (Quantum, now Neutron) plugin must
have been either a vendor or a server-focused engineer using NIC device driver concepts.
You see, the major problem the Quantum plug-in architecture has is that there can only be one
Quantum plugin in a given OpenStack deployment, and that plugin has to implement all the
networking functionality: layer-2 subnets are mandatory, and there are extensions for layer-3
forwarding, security groups (firewalls) and load balancing.
This approach worked well in early OpenStack days when the Quantum plugin configured virtual
switches (similar to what VMware’s vCenter does) and ignored the physical world. You could choose
to work with Linux bridge or Open vSwitch and use VLANs or GRE tunnels (OVS only).
However, once the networking vendors started tying their own awesomesauce into OpenStack, they
had to replace the original Quantum plugin with their own. No problem there, if the vendor controls
end-to-end forwarding path like NEC does with its ProgrammableFlow controller, or if the vendor
implements end-to-end virtual networks like VMware does with NSX or Midokura does with Midonet
… but what most hardware vendors want to do is to control their physical switches, not the
hypervisor virtual switches.
Remember that OpenStack supports a single plugin. Yeah, you got it right – if you want to use the
above architecture, you’re locked into a single networking vendor. Perfect vendor lock-in within an
open-source architecture. Brilliant. Also, do remember that your vendor has to update the plugin to
reflect potential changes to Quantum/Neutron API.
Alas, wherever there’s a problem, there’s a solution – in this case a Quantum plugin that ties
OpenStack to a network services orchestration platform (Tail-f NCS or Anuta nCloudX). These
platforms can definitely configure multi-vendor network environments, but if you’re willing to go this
far down the vendor lock-in path, you just might drop the whole OpenStack idea and use VMware or
Hyper-V.
This approach removes the vendor lock-in of the monolithic vendor-supplied Quantum plugins, but
limits you to the lowest common denominator – VLANs (or equivalent). Not necessarily something
I’d want to have in my greenfield revolutionary forward-looking OpenStack-based data center, even
though Arista’s engineers are quick to point out you can implement VXLAN gateway on ToR switches
and use VLANs in the hypervisors and IP forwarding in the data center fabric. No thanks, I prefer
Skype over a fancier PBX.
Cynical summary: Reinventing the wheel while ensuring a comfortable level of lock-in seems to be
a popular pastime of the networking industry. Let’s see how this particular saga evolves … and do
keep in mind that some people remain deeply skeptical of OpenStack’s future.
Chiradeep Vittal ran a number of tests between virtual machines in an Amazon VPC network and
shared the results in a blog post and extensive comments on one of my posts. Here’s a short
summary:
Virtual switches in Amazon VPC perform layer-3-only unicast IPv4 forwarding (similar to recent
Hyper-V Network Virtualization behavior). All non-IPv4 traffic and multicast/broadcast IPv4
traffic is dropped.
Layer-3 forwarding in the hypervisor virtual switch does not decrement TTL – it’s like all virtual
machines reside in the same subnet;
Hypervisor proxies all ARP requests and replies with the expected MAC addresses of target VMs
or first-hop gateway (early implementations of Amazon VPC used the same destination MAC
address in all ARP replies);
Virtual switch implements limited router-like functionality. For example, the default gateway IP
address replies to pings, but a VM cannot ping the default gateway of another subnet.
You could, for example, use the default route toward the Internet for web server subnet, default
route toward your data center for database server subnet, and no default routing (local connectivity
only) for your application server subnet. Pretty cool stuff if you’re an MPLS/VPN geek used to
schizophrenic routing tables, and quite a tough nut to crack for people who want to migrate their
existing layer-2 networks into the cloud. Massimo Re Ferre made a perfect summary: everyone else
is virtualizing the network, Amazon VPC is abstracting it.
The architecture is definitely intriguing, but we have yet to see how well it copes with fast state
changes in large-scale cloud deployments.
The blog post was written in 2012; it was updated in summer 2014 to reflect improvements in other
network virtualization products mentioned in the blog post.
Figure 3-29: Virtual networks implemented in the hypervisor switches on top of an IP fabric
VMware NSX and Hyper-V are way better – they rely on a central controller to distribute MAC-to-IP
mapping information to individual hypervisors.
All the vendors mentioned above are dancing around that requirement claiming you can always
implement whatever L4-7 functionality you need with software appliances running as virtual
machines on top of virtual networks. A typical example of this approach is vShield Edge, a VM with
baseline load balancing, NAT and DHCP functionality.
VMware, Cisco, Juniper and a few others offer hypervisor-level firewalls; traffic going between
security zones doesn’t have to go through an external appliance (although it still goes through a
VM if you’re using VMware’s vShield Zones/App);
VMware vDS (in vSphere 5.5), VMware NSX, Cisco Nexus 1000V and Hyper-V provide ACL-like
functionality.
As expected, they decided to implement virtual networks with GRE tunnels between hypervisor
hosts. A typical virtual network topology mapped onto underlying IP transport fabric would thus like
this:
Their virtual networks solution has layer-2 virtual networks that you can link together with
layer-3 virtual routers.
Each virtual port (including VM virtual interface) has ingress and egress firewall rules and
chains (inspired by Linux iptables).
Virtual routers support baseline load balancing and NAT functionality.
Virtual routers are not implemented as virtual machines – they are an abstract concept used
by hypervisor switches to calculate the underlay IP next hop.
As one would expect in a L3 solution, hypervisors are answering ARP and DHCP requests
locally.
The edge nodes run EBGP with the outside world, appearing as a single router to external
BGP speakers.
Interestingly, they decided to go against the current centralized control plane religion, and
implemented most of the intelligence in the hypervisors. They use Open vSwitch (OVS) kernel
module as the switching platform (proving my claim that OVS provides all you need to implement
L2-4 functionality), but replaced the OpenFlow agents and centralized controller with their own
distributed software.
Their forwarding agents (running in user space on all hypervisor hosts) intercept traffic belonging to
unknown flows (much like the ovs-vswitchd), but process the unknown packets locally instead of
sending them to central OpenFlow controller.
The forwarding agent receiving an unknown packet would check the security rules, consult the
virtual network configuration, calculate the required flow transformation(s) and egress next hop,
install the flow in the local OVS kernel module, insert flow data in a central database for stateful
firewall filtering of return traffic, and send the packet toward egress node encapsulated in a GRE
envelope with the GRE key indicating the egress port on the egress node.
According to Midokura, the forwarding agents generate the most-generic flow specification they can
– load balancing obviously requires microflows, simple L2 or L3 forwarding doesn’t. While the OVS
kernel module supports only microflow-based forwarding, the forwarding agent doesn’t have to
recalculate the virtual network topology for each new flow.
The egress OVS switch has pre-installed flows that map GRE keys to output ports. The packet is
thus forwarded straight to the destination port without going through the forwarding agent on the
egress node. Like in MPLS/VPN or QFabric, the ingress node performs all forwarding decisions, the
“only” difference being that MidoNet runs as a cluster of distributed software switches on commodity
hardware.
The end result: MidoNet (Midokura’s overlay virtual networking solution) performs simple L2-4
operations within the hypervisor, and forwards packets of established flows within the kernel OVS.
Midokura claims they achieved linerate (10GE) performance on commodity x86 hardware ... but of
course you shouldn’t blindly trust me or them. Get in touch with Ben and test-drive their solution.
For more details and a longer (and more comprehensive) analysis, read Brad Hedlund's blog
post.
THE BACKGROUND
Big Switch is building a platform that would allow you to create virtual networks out of any
combination of physical or virtual devices. Their API or CLI allows you to specify which VLANs or
MAC/IP addresses belong to a single virtual network, and their OpenFlow controller does the rest of
the job. To see their software in action, watch the demo they had during the OpenFlow webinar.
THE SHIFT
When I had a brief chat with Kyle during the OpenFlow symposium, I mentioned that (in my
opinion) they were trying to reinvent MPLS ... and he replied along the lines of “if only DC switches
would have MPLS support, our life would be much easier”.
Most switches installed in today’s data centers don’t support OpenFlow in GA software (HP Procurve
series and IBM’s G8264 are obvious exceptions with a tiny market share); on top of that, some
customers are understandably reluctant to deploy OpenFlow-enabled switches in their production
environment. Time to take a step back and refocus on the piece of the puzzle that is easiest to
change – the hypervisor, combined with L2 or L3 tunneling across the network core.
Not surprisingly, they decided to use Open vSwitch with their own OpenFlow controller in Linux-
based hypervisors (KVM, Xen) and they claim they have a solution for VMware environment, but
Kyle was a bit tight-lipped about that one.
THE DIFFERENCE?
Based on the previous two paragraphs, it does seem that Big Switch is following Nicira’s steps ...
only a year or two later. However, they claim there are significant technical differences between the
two approaches:
Using Big Switch’s OpenFlow controller, you can mix-and-match physical and virtual switches.
Kyle claimed we’ll see switches supporting OpenFlow-controlled tunneling encap/decap in Q3/Q4
of this year. That would be a real game-changer, but I’ll believe when I see this particular
unicorn (update: in summer 2014 I’m still waiting for them).
And finally, as expected, there’s a positioning game going on. According to Big Switch, all
alternatives (VXLAN from Cisco, NVGRE from Microsoft and STT/GRE/OpenFlow from Nicira) expect
you to embrace a fully virtually integrated stack, whereas Big Switch’s controller creates a platform
for integration partners. If they manage to pull this off, they just might become another 6WIND or
Tail-F – not exactly a bad position to be in, but also not particularly exciting to the investors.
Whatever the case might be, we will definitely live in interesting times in the next few years, and
I’m anxiously waiting for the moment when Big Switch decides to make its product a bit more public
(and I’m still waiting for publicly available product documentation in summer 2014).
IN THIS CHAPTER:
You could implement the gateways with network services devices (load balancers or firewalls), or
with dedicated layer-3 (routers) or layer-2 (bridges) gateways.
Low-bandwidth (few Gbps) environments are easily server by VM-based solutions. Bare-metal
servers or in-kernel gateways provide at least 10 Gbps of throughput. Environments that need
higher throughput between the physical and the virtual world require dedicated hardware solutions.
This chapter describes several aspects of overlay virtual networking gateways, from design
considerations to an overview of hardware gateway products.
The only product supporting VXLAN Tunnel End Point (VTEP) in the near future is the Nexus 1000V
virtual switch; the only devices you can connect to a VXLAN segment are thus Ethernet interface
cards in virtual machines. If you want to use a router, firewall or load balancer (sometimes lovingly
called application delivery controller) between two VXLAN segments or between a VXLAN segment
and the outside world (for example, a VLAN), you have to use a VM version of the layer-3 device.
That’s not necessarily a good idea; virtual networking appliances have numerous performance
drawbacks and consume way more CPU cycles than needed ... but if you’re a cloud provider billing
your customers by VM instances or CPU cycles, you might not care too much.
The performance of VM-based network services products has increased to the point where it
became a non-issue. See the Virtual Appliances chapter in the second volume of Software
Defined Data Centers book for more details.
The virtual networking appliances also introduce extra hops and unpredictable traffic flows into your
network, as they can freely move around the data center at the whim of workload balancers like
I’ve totally changed my opinion in the meantime. It doesn’t matter whether you use virtual
or physical network services appliances within a single leaf-and-spine fabric.
Cisco doesn’t have any L3 VM-based product, and the only thing you can get from VMware is vShield
Edge – a dumbed down Linux with a fancy GUI. If you’re absolutely keen on deploying VXLAN, that
NEXT STEPS?
Someone will have to implement VXLAN on physical devices sooner or later; running networking
functions in VMs is simply too slow and too expensive. While I don’t have any firm information (not
even roadmaps), do keep in mind Ken Duda’s enthusiasm during the VXLAN Packet Pushers podcast
(and remember that both Arista and Broadcom appear in the author list of VXLAN and NVGRE
drafts).
Arista was the first data center switching vendor to ship a working VXLAN implementation in
2012.
VMs attached to a VXLAN segment are configured with the default gateway’s IP address (intra-
VXLAN subnet logical IP address of the physical termination device);
No broadcast or flooding is involved in the layer-3 termination, so you could easily use the same
physical IP address and the same VXLAN MAC address on multiple routers (anycast) and achieve
instant redundancy without first hop redundancy protocols like HSRP or VRRP.
As of August 2014, no data center switching vendor has a shipping layer-3 VXLAN gateway
due to the limitations of Broadcom Trident-2 chipset everyone is using. The hardware of
Cisco Nexus 9300 is capable of layer-3 gateway functionality, but it hasn’t been
implemented in the software yet.
Layer-2 extension of VXLAN segments into VLANs (that you might need to connect VXLAN-based
hosts to an external firewall) is a bit tougher. As you’re bridging between VXLAN and an 802.1Q
VLAN, you have to ensure that you don’t create a forwarding loop.
You could configure the VXLAN layer-2 extension (bridging) on multiple physical switches and run
STP over VXLAN ... but I hope we’ll never see that implemented. It would be way better to use IP
functionality to select the VXLAN-to-VLAN forwarder. You could, for example, run VRRP between
redundant VXLAN-to-VLAN bridges and use VRRP IP address as the VXLAN physical IP address of the
bridge (all off-VXLAN MAC addresses would appear as being reachable via that IP address to other
Brocade is the first vendor shipping redundant VTEP implementation (see another blog post
at the end of this chapter). Arista supposedly has MLAG-based solution, but that code still
hasn’t shipped in August 2014.
SUMMARY
VXLAN is a great concept that gives you clean separation between virtual networks and physical IP-
based transport infrastructure, but we need VXLAN termination in physical devices (switches,
potentially also firewalls and load balancers) before we can start considering large-scale
deployments. Till then, it will remain an interesting proof-of-concept tool or a niche product used by
infrastructure cloud providers.
GATEWAY TYPES
You can connect an overlay virtual network and a physical subnet with:
A network services device is the best choice if you have to connect a wholly virtualized application
stack to the outside world, or if you’re connecting components that have to be isolated by a firewall
or load balancer anyway.
Some overlay virtual networking solutions (example: unicast VXLAN on Cisco Nexus 1000V
don’t work with any existing hardware gateway anyway).
x86-based gateways can provide at least 10Gbps of throughput. If you need more than that across a
single VLAN or tenant you should be looking at dedicated hardware. If you need more than 10Gbps
aggregate throughput, but not more than a Gbps or two per tenant, you might be better served with
a scale-out farm of x86-based gateways – after all, you might be able to reuse them if your needs
change (and there’s no hardware lock-in).
In the ideal world, you’d have just two gateways (for redundancy purposes) connecting the legacy
servers to the cloud infrastructure using overlay virtual networking; you might need more than that
in high-bandwidth environments if you decide to use VM-based or x86-based gateways (see above).
The gateways would run in either active/backup configuration (example: Cisco VXLAN gateway, VM-
based or x86-based VMware NSX gateways) or in MLAG-type deployment where two physical
switches present themselves as a single VTEP (IP address) to the overlay virtual networking fabric
(example: Arista VXLAN gateways, NSX VTEP on Brocade Logical Chassis, Cisco Nexus 9300).
Assuming an average virtualized server needs 8 GB of RAM (usually they need less than that) you
can pack over 60 virtualized servers into a single hypervisor hosts. The 800 virtualized servers thus
need less than 15 physical servers (for example, four Nutanix appliances), or 30 10GE ports – less
than half a ToR switch.
Don’t do that. When you’re building a new data center network or refreshing an old one, start with
its customers – the servers: buy new high-end servers with plenty of RAM and CPU cores, virtualize
as much as you can, and don’t mix the old and the new world.
This does require synchronizing your activities with the server and virtualization teams,
which might be a scary and revolutionary thought in some organizations; we’ll simply have
to get used to talking with other people.
Use one or two switches as L2/L3 gateways, and don’t even think about connecting the old servers
to the new infrastructure. Make it abundantly clear that the old gear will not get any upgrades (the
server team should play along) and that the only way forward is through server virtualization… and
let the legacy gear slowly fade into obsolescence.
DON’T OVERCOMPLICATE
Let’s eliminate the trivial options first.
If your our public cloud offers hosting of individual VMs with no per-customer virtual segments,
use one of the mechanisms I described in the Does It Make Sense to Build New Clouds with
Overlay Networks? post and ask the customers to establish a VPN from their VM to their home
network.
If your public cloud offers virtual private networks, but you don’t plan to integrate the cloud
infrastructure with a multi-tenant transport network (using, for example, MPLS/VPN as the WAN
transport technology), establish VPN tunnels between the virtual network edge appliance
(example: vShield Edge) and customer’s VPN concentrator.
VM-LEVEL INTEGRATION
If you don’t want to use one of the MPLS/VPN-based overlay virtual networking solutions (they both
require Linux-based hypervisors and provide off-the-shelf integration with OpenStack and
CloudStack), use a VM-based PE-routers. You could deploy Cisco’s Cloud Services Router (CSR) as a
PE-router, connect one of its interfaces to a VLAN-based network and all other interfaces to
customer overlay virtual networks.
The number of customer interfaces (each in a separate VRF) on the CSR router is limited by
the hypervisor, not by CSR (VMware maximum: 10).
A) Do you need a virtual gateway for each VXLAN segment or can a gateway be the entry/exit
point across multiple VXLAN segments?
B) Can you setup multiple gateways and specify which VXLAN segments use each gateway?
C) Can you cluster gateways together (Active/Active) or do you setup them up as Active/Standby?
The answers obviously depend on whether you’re deploying NSX for multiple hypervisors or NSX for
vSphere. Let’s start with the former.
Each gateway node can run multiple instances of L2 or L3 gateway services (but not both). Each L2
gateway service can bridge between numerous overlay networks and VLANs (there must be a 1:1
mapping between an overlay network segment and an outside VLAN), each L3 gateway service can
route between numerous logical networks and a single uplink.
Each L2 gateway instance (NSX Edge VM running as L2 gateway) can bridge a single VXLAN
segment to a VLAN segment. Multiple L2 gateway instances can run on the same vSphere host.
NSX Edge router (running just the control plane) can have up to eight uplinks and up to 1000
internal (VXLAN-based) interfaces. NSX Edge Services Router (with data plane implemented within
the VM) can have up to ten interfaces (the well-known vSphere limit on the number of interfaces of
a single VM). Multiple NSX Edge routers or NSX Edge Services Routers can run on the same vSphere
host.
In theory you might have more than one NSX Edge instance connecting a VXLAN segment with the
outside world, but even if the NSX Manager software allows you to configure that, I wouldn’t push
my luck.
In August 2014 the shipping EOS code supports multicast-based VXLAN or an Arista-specific
implementation of unicast VXLAN. There’s no support for redundant hypervisors or VMware NSX-
controlled VTEP (Arista claims both will become available soon).
Also expected from Arista: unexpected creativity. Instead of providing a 40GE port on the switch
that can be split into four 10GE ports with a breakout cable (like everyone else is doing), these
switches group four physical 10GE SFP+ ports into a native 40GE (not 4x10GE LAG) interface.
The 7150 switches are also the first devices that offer VXLAN termination in hardware. Broadcom’s
upcoming Trident-2 chipset supports VXLAN and NVGRE, so when Arista demonstrated VXLAN
termination at the recent VMworld 2012, everyone expected the product to be available next spring
... but according to Arista it’s orderable now and shipping in Q4. Turns out Arista decided to use
Another goodie: you can run IEEE 1588 (Precision Time Protocol) on these devices to establish an
extremely precise time base in your network, drifting only a few nanoseconds per day (precision
clock module seems to be optional). Such a precision might not make sense at the first glance
(unless you’re working in high-frequency trading), until you discover you can timestamp mirrored
(Arista’s name for SPAN) or sFlow packets. Imagine being able to collect packets across the whole
network and having a (almost) totally reliable timestamp attached to all of them.
Finally (and my friend Tom Hollingsworth will love this part), 7150 switches can do NAT in hardware.
Yeah, you got that right – they do NAT in silicon (don’t even try to ask me whether it’s NAT44,
NAT64, or NAT66 ;) with less than one microsecond latency.
Brocade decided to skip the multicast VXLAN support and implemented only VMware NSX gateway
functionality. Obviously they don’t believe in viability of Cisco’s Nexus 1000V (or VMware’s vCNS).
ANYONE ELSE?
Arista has a shipping L2 VTEP that uses IP multicast. They might have OVSDB agent (which is
needed to work with the NSX controller), but it’s not yet documented in public EOS
documentation.
IN THIS CHAPTER:
The few blog posts collected in this chapter try to bring some realism into the picture: VXLAN (or
any other virtual networking solution) is not a data center interconnect technology, but it could be
used as the technology-of-last-resort to minimize the impact of suboptimal and/or unrealistic
requirements.
MORE INFORMATION
Cloud Computing Networking webinar describes several over-the-top solutions that you can use
to connect a private cloud with a public cloud;
Check out other cloud computing and networking webinars;
Use ExpertExpress service if you need short online consulting session, technology discussion or a
design review.
This blog post (written in early 2013) describes the differences between the two flavors.
The usual question: “And why would you want to vMotion a VM between data centers?” with a
refreshing answer: “Oh, no, that would not work for us.”
THE CONFUSION
There are two different mechanisms we can use to move VMs around a virtualized environment: hot
VM mobility where a running VM is moved from one hypervisor host to another and cold VM mobility
where a VM is shut down, and its configuration moved to another hypervisor, where the VM is
restarted.
Some virtualization vendors might offer a third option: warm VM mobility where you pause
a VM (saving its memory to a disk file), and resume its operation on another hypervisor.
You’ll find cold VM mobility in almost every high-availability (ex: VMware HA restarts a VM after the
server failure) and disaster recovery solution (ex: VMware’s SRM). It’s also the only viable
technology for VM migration into the brave new cloudy world (aka cloudbursting).
HOT VM MOVE
VMware’s vMotion is probably the best-known example of hot VM mobility technology. vMotion
copies memory pages of a running VM to another hypervisor, repeating the process for pages that
have been modified while the memory was transferred. After most of the VM memory has been
successfully transferred, vMotion freezes the VM on source hypervisor, moves its state to another
hypervisor, and restarts it there.
The only mechanisms we can use today to meet all these requirements are:
Stretched layer-2 domains are not the best idea ever invented (server/OS engineers that
understand networking usually agree with that).
Layer-2 subnet with BUM flooding represents a single failure domain and a scalability roadblock.
IP address of the first-hop router is usually manually configured in the VM (yeah, I’m yearning for
the ideal world where people use DHCP to get network-related parameters) and thus cannot be
changed, but nothing stops us from configuring the same IP address on multiple routers (a trick
used by first-hop localization kludges).
We can also use routing tricks (ex: host routes generated by load balancers) or overlay networks
(ex: LISP) to make the moved VM reachable by the outside world – a major use case promoted by
LISP enthusiasts.
The last time I was explaining how cold VM mobility works with LISP in an ExpertExpress WebEx
session, I got a nice question from the engineer on the other end: “And how exactly is that different
from host routes?” The best summary I’ve ever heard.
However, there’s a gotcha: even though the VM has moved to a different location, it left residual
traces of its presence in the original subnet: entries in ARP caches of adjacent hosts and routers.
Routers are usually updated with new forwarding information (be it a routing protocol or LISP
update), adjacent hosts aren’t. These hosts would try to reach the moved VM using its old MAC
address … and fail unless there’s a L2 subnet between the old and the new location.
The notes in the blog post describe the additional implementation options that became available
between the time the blog post was written and summer of 2014.
VXLAN, OTV and LISP are point solutions targeting different markets. VXLAN is an IaaS
infrastructure solution, OTV is an enterprise L2 DCI solution and LISP is ... whatever you want it to
be.
VXLAN tries to solve a very specific IaaS infrastructure problem: replace VLANs with something that
might scale better. In a massive multi-tenant data center having thousands of customers, each one
asking for multiple isolated IP subnets, you quickly run out of VLANs. VMware tried to solve the
problem with MAC-in-MAC encapsulation (vCDNI), and you could potentially do the same with the
right combination of EVB (802.1Qbg) and PBB (802.1ah), very clever tricks a-la Network Janitor, or
even with MPLS.
Reading the VXLAN draft, you might notice that all the control-plane aspects are solved with
handwaving. Segment ID values just happen, IP multicast addresses are defined at the
management layer and the hypervisors hosting the same VXLAN segment don’t even talk to each
other, but rely on layer-2 mechanisms (flooding and dynamic MAC address learning) to establish
inter-VM communication. VXLAN is obviously a QDS (Quick-and-Dirty-Solution) addressing a specific
need – increasing the scalability of IaaS networking infrastructure.
In the meantime, Cisco and VMware shipped unicast VXLAN implementations – Cisco on
Nexus 1000V, VMware with the VMware NSX.
VXLAN will indeed scale way better than VLAN-based solution, as it provides total separation
between the virtualized segments and the physical network (no need to provision VLANs on the
physical switches), it will scale somewhat better than MAC-in-MAC encapsulation because it relies on
L3 transport (and can thus work well in existing networks), but it’s still a very far cry from Amazon
EC2. People with extensive (bad) IP multicast experience are also questioning the wisdom of using
IP multicast instead of source-based unicast replication ... but if you want to remain control-plane
ignorant, you have to rely on third parties (read: IP multicast) to help you find your way around.
It seems there have already been claims that VXLAN solves inter-DC VM mobility (I sincerely hope
I’ve got a wrong impression from Duncan Epping’s summary of Steve Herrod’s general session @
Here’s where OTV kicks in: if you do become tempted to implement long-distance bridging, OTV is
the least horrendous option (BGP MPLS-based MAC VPN will be even better, but it still seems to be
working primarily in PowerPoint). It replaces dynamic MAC address learning with deterministic
routing-like behavior, provides proxy ARP services, and stops unicast flooding. Until we’re willing to
change the fundamentals of transparent bridging, that’s almost as good as it gets.
EVPN is the standardized BGP MPLS-based MAC VPN solution. Some hardware vendors
already have EVPN-capable product; it’s also used in several overlay virtual networking
solutions.
As you can see, it makes no sense to compare OTV and VXLAN; it’s like comparing a racing car to a
downhill mountain bike. Unfortunately, you can’t combine them to get the best of both worlds; at
the moment, OTV and VXLAN live in two parallel universes. OTV provides long-distance bridging-like
behavior for individual VLANs, and VXLAN cannot even be transformed into a VLAN.
LISP is yet another story. It provides very rudimentary approximation to IP address mobility across
layer-3 subnets, and it might be able to do it better once everyone realizes hypervisor is the only
place to do it properly. However, it’s a layer-3 solution running on top of layer-2 subnets, which
means you might run LISP in combination with OTV (not sure it makes sense, but nonetheless) and
you could be able to run LISP in combination with VXLAN once you can terminate VXLAN on a LISP-
capable L3 device.
So, with the introduction of VXLAN, the networking world hasn’t changed a bit: the vendors are still
serving us all isolated incompatible technologies ... and all we’re asking for is tightly integrated and
well-architected designs.
VXLAN is a layer-2 technology. If you plan to use VXLAN to implement a data center interconnect,
you’ll be stretching a single L2 segment across two data centers.
You probably know my opinion about the usability of L2 DCI, but even ignoring the obvious
problems, current VXLAN implementations don’t have the features one would want to see in a L2
DCI solution.
Per-VLAN flooding control at data center edge. Broadcasts/multicasts are usually not rate-
limited within the data center, but should be tightly controlled at the data center edge
(bandwidth between data centers is usually orders of magnitude lower than bandwidth within a
data center). Ideally, you’d be able to control them per VLAN to reduce the noisy neighbor
problems.
Broadcast reduction at data center edge. Devices linking DC fabric to WAN core should
implement features like ARP proxy.
Controlled unicast flooding. It should be possible to disable flooding of unknown unicasts at
DC-WAN boundary.
It’s also nice to have the following features to reduce the traffic trombones going across the DCI
link:
First hop router localization. Inter-subnet traffic should not traverse the DCI link to reach the
first-hop router.
Ingress traffic optimization. Traffic sent to a server in one data center should not arrive to
the other data center first.
OTV in combination with FHRP localization and LISP (or load balancers with Route Health Injection)
gives you a solution that meets these criteria.. VXLAN with hypervisor VTEPs has none of the above-
mentioned features.
Conclusion: The current VXLAN implementations (as of November 2012) are a far cry from what I
would like to see if being forced to implement a L2 DCI solution. Stick with OTV (it’s now available
on ASR 1K).
Overlay virtual networks just might be a solution if you have to solve a similar problem:
Build the cloud portion of the customer’s layer-2 network with an overlay virtual networking
technology;
Install an extra NIC in one (or more) physical host and run a VXLAN-to-VLAN gateway in a VM on
that host – the customer’s VLAN is thus completely isolated from the data center network core;
Connect the extra NIC to WAN edge router or switch on which the customer’s link is terminated.
Whatever stupidity the customer does in its part of the stretched layer-2 network won’t spill
further than the gateway VM and the overlay network (and you could easily limit the damage by
reducing the CPU cycles available to the gateway VM).
You could use Nexus 1000V with VXLAN or OVS/GRE/OpenStack combo at no additional cost
(combining VLANs with GRE-encapsulated subnets might be an interesting challenge in current
OpenStack Quantum release);
VMware’s version of VXLAN comes with vCNS (a product formerly known as vShield), so you’ll
need a vCNS license;
You could also use Nicira NVP (part of VMware NSX) with a layer-2 gateway (included in NVP
platform).
Hyper-V Network Virtualization might have a problem dealing with dynamic MAC addresses
coming from the customer’s data center – this is one of the rare use cases where dynamic
MAC learning works better than a proper control plane.
VXLAN-to-VLAN gateway linking the cloud portion of the customer’s network with the customer’s
VLAN could be implemented with Cisco’s VXLAN gateway or a simple Linux or Windows VM on which
you bridge the overlay and VLAN interfaces (yet again, one of those rare cases where VM-based
bridging makes sense). Arista’s 7150 or F5 BIG-IP is probably an overkill.
And now for a bit of totally unrelated trivia: once we solved the interesting part of the problem, I
asked about the details of the customer interconnect link – they planned to have a single 100 Mbps
link and thus a single path of failure. I can only wish them luck and hope they’ll try to run stretched
clusters over that link.
VXLAN hasn’t changed much since the time I explained why it’s not the right technology for long-
distance VLANs:
I haven’t seen integration with OTV or LISP that was promised years ago (or maybe I missed
something – please write a comment);
VXLAN-to-VLAN gateways are still limited to single gateway (or MLAG cluster) per VXLAN
segment, generating traffic trombones with long-distance VLANs;
Traffic trombones generated by stateful appliances (inter-subnet firewalls or load balancers) are
obviously impossible to solve.
Then there’s the obvious problem of data having gravity (or applications being used to being close to
data) – if you move a VM away from the data, the performance quickly drops way below acceptable
levels.
However, if you’re forced to implement a stretched VLAN (because the application team cannot
possibly deploy their latest gizmo without it, or because the server team claims they need it for
It turns out nobody took the time to analyze an OTV packet trace with the Wireshark; everyone
believed whatever IETF drafts were telling us. Here’s the packet format from draft-hasmit-otv-03:
And here’s the packet format from draft-mahalingam-dutt-dcops-vxlan. Apart from a different UDP
port number, the two match perfectly.
VXLAN Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VXLAN Network Identifier (VNI) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Payload:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ethertype of Original Payload | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Original Ethernet Payload |
| |
|(Note that the original Ethernet Frame's FCS is not included) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
However, it turns out the OTV draft four Cisco’s engineers published in 2011 has nothing to do with
the actual implementation and encapsulation format used by Nexus 7000. It seems Brian McGahan
was the first one to actually do the OTV packet capture and analysis and published his findings. He
discovered that OTV is nothing else than the very familiar EoMPLSoGREoIP. No wonder the first
VXLAN gateway device Cisco announced at Cisco Live is not the Nexus 7000 but a Nexus 1000V-
based solution (at least that’s the way I understood this whitepaper).
IN THIS CHAPTER:
Hardware networking vendors are trying to stem the shift by offering new platforms and
architectures that keep the traditional VLAN-based virtual switches and implement the network
virtualization functionality in the (hardware) network edge. Needless to say, these approaches
usually make as much sense as trying to keep X.25 a viable alternative to TCP/IP.
This chapter contains a few rants I wrote in the last three years. You won’t find many technical
arguments in this chapter – all the high-level arguments are listed in the Overlay Virtual Networking
101 chapter, and I’ve seen little value in repeating them in every blog post.
X.25 gurus telling me how Telnet will never take off because TCP/IP header has much larger
overhead than X.29 PAD service. X.25 is dead (OK, maybe a zombie) and nobody complains
about TCP/IP header overhead anymore. BTW, we solved the header overhead problem decades
ago with TCP/IP header compression.
ATM gurus telling me how each application needs its own dedicated QoS settings and how the
only way to implement that is to run ATM to the desktop. ATM to the desktop never took off, and
global QoS remains a providers’ dream and vendors’ bonanza. In the meantime, we’ve watching
Netflix videos and talk over Skype … with absolutely no QoS guarantees from our ISPs.
IBM sales engineers telling my customers how it would be totally irresponsible to transport bank
teller application data over a TCP/IP network (the data would still be in an SNA session, but
transported across unreliable routed network). SNA is another zombie, and everyone is using
TCP/IP protocol stack ... oh, and we’re running e-banking over the Internet.
All the above-mentioned technologies and architectures went extinct for a simple reason: whenever
there’s a clash between competing solutions, the ones that move the complexity as far out to the
edge as possible usually win, and those that tried to keep the complexity and micro-state in the core
failed (X.25, ATM, traditional voice circuits) because keeping state is too expensive at scale. Draw
your own conclusions, and remember that in a server virtualization environment the edge is in the
hypervisor, not in the ToR switch.
Does that mean the overlay virtual networks are a perfect solution? Far from it – we’re probably
where TCP/IP and Internet were in the early nineties, and virtualization vendors don’t have a perfect
track record when it comes to network virtualization, but they’re catching up fast.
Finally, if you’re wondering why I mentioned the spaghetti wall in the title – that’s how I feel when
being faced with a barrage of competing (and incompatible) ToR-based network virtualization
solutions.
They say a picture is worth a thousand words – here are a few slides from my Interop 2013 Overlay
Virtual Networking Explained presentation.
This is how most enterprise data centers provision virtual networks these days (if you’re working for
a cloud provider and still doing something similar, run away as fast as you can).
The networking industry would love to keep the complexity (and related margins) in the network,
keeping the edge (hypervisors) approximately as smart as the following device:
With the edge being mostly stupid (and 802.1Qbg playing the role of rotary dialing), you need loads
of technologies in the network to compensate for the edge stupidity, just like the voice exchanges
needed more and more complex technologies and protocols to establish voice circuits.
The details have changed a bit (Cisco seems to be embracing L3 forwarding at the ToR switches),
but the architectural options haven’t – you have to have the complex stuff somewhere and it will be
either in the end systems (hypervisors) or in the network.
We all know how the voice saga ended – you can’t sell a mobile phone if it doesn’t support Skype,
and while there are still plenty of loose ends when you have to connect the old and the new worlds,
more or less everyone essentially gave up and started using VoIP for new deployments. Yes, it took
us more than a decade to get there, and the road was bumpy, but I don’t think you could persuade
anyone to invest money in a PBX-with-SS7 startup these days.
We’ll probably see the same game played out twenty years later in the virtual networking space
(one can only hope the remains of the past won’t hinder us as long as they are in the VoIP world) –
the established networking vendors selling us smarter and smarter exchanges (switches) and the
virtualization vendors and startups selling us end-system solutions running on top of IP. It’s easy to
predict the final outcome; it’s just the question of how long it will take to get there (and don’t forget
that Alcatel, Lucent and Nortel made plenty of money selling PBXes to legacy enterprises while Cisco
and others tried to boost low VoIP adoption).
After talking to networking vendors I'm inclined to think they are going to focus on a
mesh of overlays from the ToR, with possible use of overlays between vswitch and ToR
too if desired - drawing analogies to MPLS with ToR a PE and vSwitch a CE. Aside from
selling more hardware for this, I'm not drawn towards a solution like this bc it doesn't
help with full network virtualization and a network abstraction for VMs.
The whole situation reminds me of the good old SNA and APPN days with networking vendors
playing the IBM part of the comedy.
I apologize to the younglings in the audience – the rest of the blog post will sound like total
gibberish to you – but I do hope the grumpy old timers will get a laugh or two out of it.
Once upon a time, there were mainframes (and nobody called them clouds), and all you could do
was to connect your lowly terminal (80 x 24 fluorescent green characters) to a mainframe. Not
surprisingly, the networking engineers were building hub-and-spoke networks with the mainframes
Years later, seeds of evil started appearing in the hub-and-spoke wonderland. There were rumors of
coax cables being drilled and vampire taps being installed onto said cables. Workstations were able
to communicate without the involvement of the central controller ... and there was a new protocol
called Internet Protocol that powered all these evil ideas.
At the same time IBM tried to persuade us 4Mbps Token Ring works faster than 10Mbps
switched Ethernet. Brocade recently tried a similar stunt, trying to tell us how Gen5 Fiber
Channel (also known as 16GB FC) is better than anything else (including 40GE FCoE) –
another proof the marketers never learn from past blunders.
François Roy provided the necessary IP-over-APPN detail in his comment to my blog post:
IBM implemented it in 2217 Nways Multiprotocol Concentrator. Straight from the
documentation: "TCP/IP data is routed over SNA using IBM's multiprotocol transport
networking (MPTN) formats."
Regardless of IBM’s huge marketing budget, the real world took a different turn. First we started
transporting SNA over IP (remember DLSw?), then deployed Telnet 3270 (TN3270) gateways to
give PCs TCP/IP-based access to mainframe applications. Oh, and IBM seems to have APPN over IP.
A few years later, IBM was happily selling Fast Ethernet mainframe attachments and running TCP/IP
stack with TN3270 on the mainframes (you see, they never really cared about networking – their
core businesses are services, software and mainframes) ... and one of the first overlay virtual
network implementations was VXLAN in Nexus 1000V.
And so I finally managed to mention overlay virtual networking ... but don’t rush to conclusions;
before drawing analogies keep in mind that most organizations couldn’t get rid of the mainframes:
there were millions of lines of COBOL code written for an environment that could not be easily
replicated anywhere else. Migrating those applications to any other platform was mission impossible.