Howto Guides: Release 20.08.0
Howto Guides: Release 20.08.0
Release 20.08.0
6 VF daemon (VFd) 31
6.1 Preparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Common functions of IXGBE and I40E . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 The IXGBE specific VFd functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 The I40E specific VFd functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
i
8 Virtio_user as Exceptional Path 40
8.1 Sample Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ii
CHAPTER
ONE
1.1 Overview
It is not possible to migrate a Virtual Machine which has an SR-IOV Virtual Function (VF).
To get around this problem the bonding PMD is used.
The following sections show an example of how to do this.
A bonded device is created in the VM. The virtio and VF PMD’s are added as slaves to the bonded
device. The VF is set as the primary slave of the bonded device.
A bridge must be set up on the Host connecting the tap device, which is the backend of the Virtio device
and the Physical Function (PF) device.
To test the Live Migration two servers with identical operating systems installed are used. KVM and
Qemu 2.3 is also required on the servers.
In this example, the servers have Niantic and or Fortville NIC’s installed. The NIC’s on both servers are
connected to a switch which is also connected to the traffic generator.
The switch is configured to broadcast traffic on all the NIC ports. A Sample switch configuration can be
found in this section.
The host is running the Kernel PF driver (ixgbe or i40e).
The ip address of host_server_1 is 10.237.212.46
The ip address of host_server_2 is 10.237.212.131
The sample scripts mentioned in the steps below can be found in the Sample host scripts and Sample
VM scripts sections.
cd /root/dpdk/host_scripts
./setup_vf_on_212_46.sh
1
HowTo Guides, Release 20.08.0
VM 1 VM 2
NFS Server
VM disk image
DPDK Testpmd App. DPDK Testpmd App.
Server 1 Server 2
10 Gb Migration Link
Linux, KVM, QEMU Linux, KVM, QEMU
Kernel PF driver Kernel PF driver
10 Gb NIC 10 Gb NIC
SW bridge with Tap SW bridge with Tap
and PF connected and PF connected
10 Gb NIC 10 Gb NIC
10 Gb Traffic Generator
cd /root/dpdk/host_scripts
./setup_bridge_on_212_46.sh
./connect_to_qemu_mon_on_host.sh
(qemu)
In VM on host_server_1:
cd /root/dpdk/vm_scripts
./setup_dpdk_in_vm.sh
./run_testpmd_bonding_in_vm.sh
Port close should remove VF MAC address, it does not remove perm_addr.
The mac_addr command only works with the kernel PF for Niantic.
testpmd> mac_addr remove 1 AA:BB:CC:DD:EE:FF
testpmd> port detach 1
Port '0000:00:04.0' is detached. Now total ports is 2
testpmd> show port stats all
In VM on host_server_1:
testpmd> show bonding config 2
cd /root/dpdk/host_scripts
./setup_vf_on_212_131.sh
./vm_virtio_one_migrate.sh
./setup_bridge_on_212_131.sh
./connect_to_qemu_mon_on_host.sh
(qemu) info status
VM status: paused (inmigrate)
(qemu)
In VM on host_server_2:
Hit Enter key. This brings the user to the testpmd prompt.
testpmd>
In VM on host_server_2:
testpmd> show port info all
testpmd> show port stats all
testpmd> show bonding config 2
testpmd> port attach 0000:00:04.0
Port 1 is attached.
Now total ports is 3
Done
The mac_addr command only works with the Kernel PF for Niantic.
testpmd> mac_addr add port 1 vf 0 AA:BB:CC:DD:EE:FF
testpmd> show port stats all.
testpmd> show config fwd
testpmd> show bonding config 2
testpmd> add bonding slave 1 2
testpmd> set bonding primary 1 2
testpmd> show bonding config 2
testpmd> show port stats all
1.4.1 setup_vf_on_212_46.sh
# set up Niantic VF
cat /sys/bus/pci/devices/0000\:09\:00.0/sriov_numvfs
echo 1 > /sys/bus/pci/devices/0000\:09\:00.0/sriov_numvfs
cat /sys/bus/pci/devices/0000\:09\:00.0/sriov_numvfs
rmmod ixgbevf
# set up Fortville VF
cat /sys/bus/pci/devices/0000\:02\:00.0/sriov_numvfs
echo 1 > /sys/bus/pci/devices/0000\:02\:00.0/sriov_numvfs
cat /sys/bus/pci/devices/0000\:02\:00.0/sriov_numvfs
rmmod i40evf
1.4.2 vm_virtio_vf_one_212_46.sh
VCPUS_NR="4"
# Memory
MEM=1536
1.4.3 setup_bridge_on_212_46.sh
ifconfig ens3f0 up
ifconfig tap1 up
ifconfig ens6f0 up
ifconfig virbr0 up
1.4.4 connect_to_qemu_mon_on_host.sh
#!/bin/sh
# This script is run on both hosts when the VM is up,
# to connect to the Qemu Monitor.
telnet 0 3333
1.4.5 setup_vf_on_212_131.sh
#!/bin/sh
# This script is run on the host 10.237.212.131 to setup the VF
# set up Niantic VF
cat /sys/bus/pci/devices/0000\:06\:00.0/sriov_numvfs
echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/sriov_numvfs
cat /sys/bus/pci/devices/0000\:06\:00.0/sriov_numvfs
rmmod ixgbevf
# set up Fortville VF
cat /sys/bus/pci/devices/0000\:03\:00.0/sriov_numvfs
echo 1 > /sys/bus/pci/devices/0000\:03\:00.0/sriov_numvfs
cat /sys/bus/pci/devices/0000\:03\:00.0/sriov_numvfs
rmmod i40evf
1.4.6 vm_virtio_one_migrate.sh
# Memory
MEM=1536
1.4.7 setup_bridge_on_212_131.sh
ifconfig ens4f0 up
ifconfig tap1 up
ifconfig ens5f0 up
ifconfig virbr0 up
1.5.1 setup_dpdk_in_vm.sh
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
ifconfig -a
/root/dpdk/usertools/dpdk-devbind.py --status
modprobe uio
insmod /root/dpdk/x86_64-default-linux-gcc/kmod/igb_uio.ko
/root/dpdk/usertools/dpdk-devbind.py --status
1.5.2 run_testpmd_bonding_in_vm.sh
# The test system has 8 cpus (0-7), use cpus 2-7 for VM
# Use taskset -pc <core number> <thread_id>
/root/dpdk/x86_64-default-linux-gcc/app/testpmd \
-l 0-3 -n 4 --socket-mem 350 -- --i --port-topology=chained
The Intel switch is used to connect the traffic generator to the NIC’s on host_server_1 and host_server_2.
In order to run the switch configuration two console windows are required.
Log in as root in both windows.
TestPointShared, run_switch.sh and load /root/switch_config must be executed in the sequence below.
run TestPointShared
/usr/bin/TestPointShared
execute run_switch.sh
/root/run_switch.sh
TWO
2.1 Overview
Live Migration of a VM with DPDK Virtio PMD on a host which is running the Vhost sample application
(vhost-switch) and using the DPDK PMD (ixgbe or i40e).
The Vhost sample application uses VMDQ so SRIOV must be disabled on the NIC’s.
The following sections show an example of how to do this migration.
To test the Live Migration two servers with identical operating systems installed are used. KVM and
QEMU is also required on the servers.
QEMU 2.5 is required for Live Migration of a VM with vhost_user running on the hosts.
In this example, the servers have Niantic and or Fortville NIC’s installed. The NIC’s on both servers are
connected to a switch which is also connected to the traffic generator.
The switch is configured to broadcast traffic on all the NIC ports.
The ip address of host_server_1 is 10.237.212.46
The ip address of host_server_2 is 10.237.212.131
The sample scripts mentioned in the steps below can be found in the Sample host scripts and Sample
VM scripts sections.
11
HowTo Guides, Release 20.08.0
VM 1 VM 2
NFS Server
VM disk image
DPDK Testpmd App DPDK Testpmd App
Server 1 Server 2
10 Gb Migration Link
Linux, KVM, QEMU 2.5 Linux, KVM, QEMU 2.5
10 Gb NIC 10 Gb NIC
DPDK PF PMD and vhost_user DPDK PF PMD and vhost_user
10 Gb NIC 10 Gb NIC
10 Gb Traffic Generator
For Fortville and Niantic NIC’s reset SRIOV and run the vhost_user sample application (vhost-switch)
on host_server_1.
cd /root/dpdk/host_scripts
./reset_vf_on_212_46.sh
./run_vhost_switch_on_host.sh
In VM on host_server_1:
Setup DPDK in the VM and run testpmd in the VM.
cd /root/dpdk/vm_scripts
./setup_dpdk_in_vm.sh
./run_testpmd_in_vm.sh
For Fortville and Niantic NIC’s reset SRIOV, and run the vhost_user sample application on
host_server_2.
cd /root/dpdk/host_scripts
./reset_vf_on_212_131.sh
./run_vhost_switch_on_host.sh
In VM on host_server_2:
Hit Enter key. This brings the user to the testpmd prompt.
testpmd>
In VM on host_server_2:
testpmd> show port info all
testpmd> show port stats all
2.4.1 reset_vf_on_212_46.sh
#!/bin/sh
# This script is run on the host 10.237.212.46 to reset SRIOV
2.4.2 vm_virtio_vhost_user.sh
#/bin/sh
# Script for use with vhost_user sample application
# The host system has 8 cpu's (0-7)
# Memory
MEM=1024
VIRTIO_OPTIONS="csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off"
# Socket Path
SOCKET_PATH="/root/dpdk/host_scripts/usvhost"
-vnc none \
-nographic \
-hda $DISK_IMG \
-chardev socket,id=chr0,path=$SOCKET_PATH \
-netdev type=vhost-user,id=net1,chardev=chr0,vhostforce \
-device virtio-net-pci,netdev=net1,mac=CC:BB:BB:BB:BB:BB,$VIRTIO_OPTIONS \
-chardev socket,id=chr1,path=$SOCKET_PATH \
-netdev type=vhost-user,id=net2,chardev=chr1,vhostforce \
-device virtio-net-pci,netdev=net2,mac=DD:BB:BB:BB:BB:BB,$VIRTIO_OPTIONS \
-monitor telnet::3333,server,nowait
2.4.3 connect_to_qemu_mon_on_host.sh
#!/bin/sh
# This script is run on both hosts when the VM is up,
# to connect to the Qemu Monitor.
telnet 0 3333
2.4.4 reset_vf_on_212_131.sh
#!/bin/sh
# This script is run on the host 10.237.212.131 to reset SRIOV
2.4.5 vm_virtio_vhost_user_migrate.sh
#/bin/sh
# Script for use with vhost user sample application
# The host system has 8 cpu's (0-7)
# Memory
MEM=1024
VIRTIO_OPTIONS="csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off"
# Socket Path
SOCKET_PATH="/root/dpdk/host_scripts/usvhost"
-m $MEM \
-smp $VCPUS_NR \
-object memory-backend-file,id=mem,size=1024M,mem-path=/mnt/huge,share=on \
-numa node,memdev=mem,nodeid=0 \
-cpu host \
-name VM1 \
-no-reboot \
-net none \
-vnc none \
-nographic \
-hda $DISK_IMG \
-chardev socket,id=chr0,path=$SOCKET_PATH \
-netdev type=vhost-user,id=net1,chardev=chr0,vhostforce \
-device virtio-net-pci,netdev=net1,mac=CC:BB:BB:BB:BB:BB,$VIRTIO_OPTIONS \
-chardev socket,id=chr1,path=$SOCKET_PATH \
-netdev type=vhost-user,id=net2,chardev=chr1,vhostforce \
-device virtio-net-pci,netdev=net2,mac=DD:BB:BB:BB:BB:BB,$VIRTIO_OPTIONS \
-incoming tcp:0:5555 \
-monitor telnet::3333,server,nowait
2.5.1 setup_dpdk_virtio_in_vm.sh
#!/bin/sh
# this script matches the vm_virtio_vhost_user script
# virtio port is 03
# virtio port is 04
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
ifconfig -a
/root/dpdk/usertools/dpdk-devbind.py --status
rmmod virtio-pci
modprobe uio
insmod /root/dpdk/x86_64-default-linux-gcc/kmod/igb_uio.ko
/root/dpdk/usertools/dpdk-devbind.py --status
2.5.2 run_testpmd_in_vm.sh
#!/bin/sh
# Run testpmd for use with vhost_user sample app.
# test system has 8 cpus (0-7), use cpus 2-7 for VM
/root/dpdk/x86_64-default-linux-gcc/app/testpmd \
-l 0-5 -n 4 --socket-mem 350 -- --burst=64 --i
THREE
Flow Bifurcation is a mechanism which uses hardware capable Ethernet devices to split traffic between
Linux user space and kernel space. Since it is a hardware assisted feature this approach can provide line
rate processing capability. Other than KNI, the software is just required to enable device configuration,
there is no need to take care of the packet movement during the traffic split. This can yield better
performance with less CPU overhead.
The Flow Bifurcation splits the incoming data traffic to user space applications (such as DPDK appli-
cations) and/or kernel space programs (such as the Linux kernel stack). It can direct some traffic, for
example data plane traffic, to DPDK, while directing some other traffic, for example control plane traffic,
to the traditional Linux networking stack.
There are a number of technical options to achieve this. A typical example is to combine the technology
of SR-IOV and packet classification filtering.
SR-IOV is a PCI standard that allows the same physical adapter to be split as multiple virtual functions.
Each virtual function (VF) has separated queues with physical functions (PF). The network adapter will
direct traffic to a virtual function with a matching destination MAC address. In a sense, SR-IOV has the
capability for queue division.
Packet classification filtering is a hardware capability available on most network adapters. Filters can
be configured to direct specific flows to a given receive queue by hardware. Different NICs may have
different filter types to direct flows to a Virtual Function or a queue that belong to it.
In this way the Linux networking stack can receive specific traffic through the kernel driver while a
DPDK application can receive specific traffic bypassing the Linux kernel by using drivers like VFIO or
the DPDK igb_uio module.
The Mellanox devices are natively bifurcated, so there is no need to split into SR-IOV PF/VF in order
to get the flow bifurcation mechanism. The full device is already shared with the kernel driver.
The DPDK application can setup some flow steering rules, and let the rest go to the kernel stack. In
order to define the filters strictly with flow rules, the flow_isolated_mode can be configured.
There is no specific instructions to follow. The recommended reading is the ../prog_guide/rte_flow
guide. Below is an example of testpmd commands for receiving VXLAN 42 in 4 queues of the DPDK
port 0, while all other packets go to the kernel:
testpmd> flow isolate 0 true
testpmd> flow create 0 ingress pattern eth / ipv4 / udp / vxlan vni is 42 / end \
actions rss queues 0 1 2 3 end / end
18
HowTo Guides, Release 20.08.0
Socket DPDK
Tools to
program filters
LINUX
Director flows
Filters support traffic
to queue index
steering to VF Kernel pf driver
in specifiedVF
inspecified VF
Rx Queues Rx Queues
( 0-N ) ( 0-M )
PF VF(vf0)
filters
NIC
FOUR
This document demonstrates some concrete examples for programming flow rules with the rte_flow
APIs.
• Detail of the rte_flow APIs can be found in the following link: ../prog_guide/rte_flow.
• Details of the TestPMD commands to set the flow rules can be found in the following link:
TestPMD Flow rules
4.1.1 Description
In this example we will create a simple rule that drops packets whose IPv4 destination equals
192.168.3.2. This code is equivalent to the following testpmd command (wrapped for clarity):
testpmd> flow create 0 ingress pattern eth / vlan /
ipv4 dst is 192.168.3.2 / end actions drop / end
4.1.2 Code
20
HowTo Guides, Release 20.08.0
pattern[3].type = RTE_FLOW_ITEM_TYPE_END;
4.1.3 Output
4.2.1 Description
In this example we will create a simple rule that drops packets whose IPv4 destination is in the range
192.168.3.0 to 192.168.3.255. This is done using a mask.
This code is equivalent to the following testpmd command (wrapped for clarity):
testpmd> flow create 0 ingress pattern eth / vlan /
ipv4 dst spec 192.168.3.0 dst mask 255.255.255.0 /
end actions drop / end
4.2.2 Code
4.2.3 Output
./filter-program enabled
[waiting for packets]
4.3.1 Description
In this example we will create a rule that routes all vlan id 123 to queue 3.
This code is equivalent to the following testpmd command (wrapped for clarity):
testpmd> flow create 0 ingress pattern eth / vlan vid spec 123 /
end actions queue index 3 / end
4.3.2 Code
4.3.3 Output
FIVE
This guide lists the steps required to setup a PVP benchmark using testpmd as a simple forwarder
between NICs and Vhost interfaces. The goal of this setup is to have a reference PVP benchmark
without using external vSwitches (OVS, VPP, ...) to make it easier to obtain reproducible results and to
facilitate continuous integration testing.
The guide covers two ways of launching the VM, either by directly calling the QEMU command line,
or by relying on libvirt. It has been tested with DPDK v16.11 using RHEL7 for both host and guest.
In this diagram, each red arrow represents one logical core. This use-case requires 6 dedicated logical
cores. A forwarding configuration with a single NIC is also possible, requiring 3 logical cores.
In this setup, we isolate 6 cores (from CPU2 to CPU7) on the same NUMA node. Two cores are assigned
to the VM vCPUs running testpmd and four are assigned to testpmd on the host.
4. Disable NMIs:
echo 0 > /proc/sys/kernel/nmi_watchdog
25
HowTo Guides, Release 20.08.0
DUT
VM
TestPMD
(macswap)
TestPMD
(io)
10G NIC 10G NIC
Moongen
TE
Fig. 5.1: PVP setup using 2 NICs
Build Qemu:
git clone git://git.qemu.org/qemu.git
cd qemu
mkdir bin
cd bin
../configure --target-list=x86_64-softmmu
make
Build DPDK:
git clone git://dpdk.org/dpdk
cd dpdk
export RTE_SDK=$PWD
make install T=x86_64-native-linux-gcc DESTDIR=install
Note: The Sandy Bridge family seems to have some IOMMU limitations giving poor perfor-
mance results. To achieve good performance on these machines consider using UIO instead.
With this command, isolated CPUs 2 to 5 will be used as lcores for PMD threads.
3. In testpmd interactive mode, set the portlist to obtain the correct port chaining:
set portlist 0,2,1,3
start
5.2.5 VM launch
Qemu way
Launch QEMU with two Virtio-net devices paired to the vhost-user sockets created by testpmd. Below
example uses default Virtio-net options, but options may be specified, for example to disable mergeable
buffers or indirect descriptors.
<QEMU path>/bin/x86_64-softmmu/qemu-system-x86_64 \
-enable-kvm -cpu host -m 3072 -smp 3 \
-chardev socket,id=char0,path=/tmp/vhost-user1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:02:d9:01,addr=0x10 \
-chardev socket,id=char1,path=/tmp/vhost-user2 \
-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \
-device virtio-net-pci,netdev=mynet2,mac=52:54:00:02:d9:02,addr=0x11 \
-object memory-backend-file,id=mem,size=3072M,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-net user,hostfwd=tcp::1002$1-:22 -net nic \
-qmp unix:/tmp/qmp.socket,server,nowait \
-monitor stdio <vm_image>.qcow2
Libvirt way
Some initial steps are required for libvirt to be able to connect to testpmd’s sockets.
First, SELinux policy needs to be set to permissive, since testpmd is generally run as root (note, as reboot
is required):
cat /etc/selinux/config
Once the domain created, the following snippet is an extract of he most important information
(hugepages, vCPU pinning, Virtio PCI devices):
<domain type='kvm'>
<memory unit='KiB'>3145728</memory>
<currentMemory unit='KiB'>3145728</currentMemory>
<memoryBacking>
<hugepages>
<page size='1048576' unit='KiB' nodeset='0'/>
</hugepages>
<locked/>
</memoryBacking>
<vcpu placement='static'>3</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='6'/>
<vcpupin vcpu='2' cpuset='7'/>
<emulatorpin cpuset='0'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
<os>
<type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>
<boot dev='hd'/>
</os>
<cpu mode='host-passthrough'>
<topology sockets='1' cores='3' threads='1'/>
<numa>
<cell id='0' cpus='0-2' memory='3145728' unit='KiB' memAccess='shared'/>
</numa>
</cpu>
<devices>
<interface type='vhostuser'>
<mac address='56:48:4f:53:54:01'/>
<source type='unix' path='/tmp/vhost-user1' mode='client'/>
<model type='virtio'/>
<driver name='vhost' rx_queue_size='256' />
<address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/>
</interface>
<interface type='vhostuser'>
<mac address='56:48:4f:53:54:02'/>
<source type='unix' path='/tmp/vhost-user2' mode='client'/>
<model type='virtio'/>
<driver name='vhost' rx_queue_size='256' />
<address type='pci' domain='0x0000' bus='0x00' slot='0x11' function='0x0'/>
</interface>
</devices>
</domain>
2. Disable NMIs:
echo 0 > /proc/sys/kernel/nmi_watchdog
Build DPDK:
git clone git://dpdk.org/dpdk
cd dpdk
export RTE_SDK=$PWD
make install T=x86_64-native-linux-gcc DESTDIR=install
Start testpmd:
$RTE_SDK/install/bin/testpmd -l 0,1,2 --socket-mem 1024 -n 4 \
--proc-type auto --file-prefix pg -- \
--portmask=3 --forward-mode=macswap --port-topology=chained \
--disable-rss -i --rxq=1 --txq=1 \
--rxd=256 --txd=256 --nb-cores=2 --auto-start
SIX
VF DAEMON (VFD)
VFd (the VF daemon) is a mechanism which can be used to configure features on a VF (SR-IOV Virtual
Function) without direct access to the PF (SR-IOV Physical Function). VFd is an EXPERIMENTAL
feature which can only be used in the scenario of DPDK PF with a DPDK VF. If the PF port is driven
by the Linux kernel driver then the VFd feature will not work. Currently VFd is only supported by the
ixgbe and i40e drivers.
In general VF features cannot be configured directly by an end user application since they are under the
control of the PF. The normal approach to configuring a feature on a VF is that an application would
call the APIs provided by the VF driver. If the required feature cannot be configured by the VF directly
(the most common case) the VF sends a message to the PF through the mailbox on ixgbe and i40e.
This means that the availability of the feature depends on whether the appropriate mailbox messages are
defined.
DPDK leverages the mailbox interface defined by the Linux kernel driver so that compatibility with
the kernel driver can be guaranteed. The downside of this approach is that the availability of messages
supported by the kernel become a limitation when the user wants to configure features on the VF.
VFd is a new method of controlling the features on a VF. The VF driver doesn’t talk directly to the PF
driver when configuring a feature on the VF. When a VF application (i.e., an application using the VF
ports) wants to enable a VF feature, it can send a message to the PF application (i.e., the application
using the PF port, which can be the same as the VF application). The PF application will configure the
feature for the VF. Obviously, the PF application can also configure the VF features without a request
from the VF application.
Compared with the traditional approach the VFd moves the negotiation between VF and PF from the
driver level to application level. So the application should define how the negotiation between the VF
and PF works, or even if the control should be limited to the PF.
It is the application’s responsibility to use VFd. Consider for example a KVM migration, the VF ap-
plication may transfer from one VM to another. It is recommended in this case that the PF control the
VF features without participation from the VF. Then the VF application has no capability to configure
the features. So the user doesn’t need to define the interface between the VF application and the PF
application. The service provider should take the control of all the features.
The following sections describe the VFd functionality.
Note: Although VFd is supported by both ixgbe and i40e, please be aware that since the hardware
capability is different, the functions supported by ixgbe and i40e are not the same.
31
HowTo Guides, Release 20.08.0
Host VM
PF Application VF Application
DPDK DPDK
Ethdev Virtual ethdev
PF driver VF driver
6.1 Preparing
VFd only can be used in the scenario of DPDK PF + DPDK VF. Users should bind the PF port to
igb_uio, then create the VFs based on the DPDK PF host.
The typical procedure to achieve this is as follows:
1. Boot the system without iommu, or with iommu=pt.
2. Bind the PF port to igb_uio, for example:
dpdk-devbind.py -b igb_uio 01:00.0
The following sections show how to enable PF/VF functionality based on the above testpmd setup.
6.1. Preparing 32
HowTo Guides, Release 20.08.0
6.2.1 TX loopback
This sets whether the PF port and all the VF ports that belong to it are allowed to send the packets to
other virtual ports.
Although it is a VFd function, it is the global setting for the whole physical port. When using this
function, the PF and all the VFs TX loopback will be enabled/disabled.
Run a testpmd runtime command on the PF to set the MAC address for a VF port:
set vf mac addr 0 0 A0:36:9F:7B:C3:51
This testpmd runtime command will change the MAC address of the VF port to this new address. If any
other addresses are set before, they will be overwritten.
Run a testpmd runtime command on the PF to enable/disable the MAC anti-spoofing for a VF port:
set vf mac antispoof 0 0 on|off
When enabling the MAC anti-spoofing, the port will not forward packets whose source MAC address is
not the same as the port.
Run a testpmd runtime command on the PF to enable/disable the VLAN anti-spoofing for a VF port:
set vf vlan antispoof 0 0 on|off
When enabling the VLAN anti-spoofing, the port will not send packets whose VLAN ID does not belong
to VLAN IDs that this port can receive.
Run a testpmd runtime command on the PF to set the VLAN insertion for a VF port:
set vf vlan insert 0 0 1
When using this testpmd runtime command, an assigned VLAN ID can be inserted to the transmitted
packets by the hardware.
The assigned VLAN ID can be 0. It means disabling the VLAN insertion.
Run a testpmd runtime command on the PF to enable/disable the VLAN stripping for a VF port:
set vf vlan stripq 0 0 on|off
This testpmd runtime command is used to enable/disable the RX VLAN stripping for a specific VF port.
Run a testpmd runtime command on the PF to set the VLAN filtering for a VF port:
rx_vlan add 1 port 0 vf 1
rx_vlan rm 1 port 0 vf 1
These two testpmd runtime commands can be used to add or remove the VLAN filter for several VF
ports. When the VLAN filters are added only the packets that have the assigned VLAN IDs can be
received. Other packets will be dropped by hardware.
Run a testpmd runtime command on the PF to enable/disable the all queues drop:
set all queues drop on|off
This is a global setting for the PF and all the VF ports of the physical port.
Enabling the all queues drop feature means that when there is no available descriptor for the
received packets they are dropped. The all queues drop feature should be enabled in SR-IOV
mode to avoid one queue blocking others.
Run a testpmd runtime command on the PF to enable/disable the packet drop for a specific VF:
set vf split drop 0 0 on|off
This is a similar function as all queues drop. The difference is that this function is per VF setting
and the previous function is a global setting.
Run a testpmd runtime command on the PF to all queues’ rate limit for a specific VF:
set port 0 vf 0 rate 10 queue_mask 1
This is a function to set the rate limit for all the queues in the queue_mask bitmap. It is not used to set
the summary of the rate limit. The rate limit of every queue will be set equally to the assigned rate limit.
6.3.4 VF RX enabling
Run a testpmd runtime command on the PF to enable/disable packet receiving for a specific VF:
set port 0 vf 0 rx on|off
6.3.5 VF TX enabling
Run a testpmd runtime command on the PF to enable/disable packet transmitting for a specific VF:
set port 0 vf 0 tx on|off
Run a testpmd runtime command on the PF to set the RX mode for a specific VF:
set port 0 vf 0 rxmode AUPE|ROPE|BAM|MPE on|off
This function can be used to enable/disable some RX modes on the VF, including:
• If it accept untagged packets.
• If it accepts packets matching the MAC filters.
• If it accept MAC broadcast packets,
• If it enables MAC multicast promiscuous mode.
6.4.1 VF statistics
This provides an API to get the a specific VF’s statistic from PF.
This provides an API to rest the a specific VF’s statistic from PF.
This provide an API to let a specific VF know if the physical link status changed.
Normally if a VF received this notification, the driver should notify the application to reset the VF port.
Run a testpmd runtime command on the PF to enable/disable MAC broadcast packet receiving for a
specific VF:
set vf broadcast 0 0 on|off
Run a testpmd runtime command on the PF to enable/disable MAC multicast promiscuous mode for a
specific VF:
set vf allmulti 0 0 on|off
Run a testpmd runtime command on the PF to enable/disable MAC unicast promiscuous mode for a
specific VF:
set vf promisc 0 0 on|off
Run a testpmd runtime command on the PF to set the TX maximum bandwidth for a specific VF:
set vf tx max-bandwidth 0 0 2000
Run a testpmd runtime command on the PF to set the TCs (traffic class) TX bandwidth allocation for a
specific VF:
set vf tc tx min-bandwidth 0 0 (20,20,20,40)
The allocated bandwidth should be set for all the TCs. The allocated bandwidth is a relative value as a
percentage. The sum of all the bandwidth should be 100.
Run a testpmd runtime command on the PF to set the TCs TX maximum bandwidth for a specific VF:
set vf tc tx max-bandwidth 0 0 0 10000
Run a testpmd runtime command on the PF to enable/disable several TCs TX strict priority scheduling:
set tx strict-link-priority 0 0x3
The 0 in the TC bitmap means disabling the strict priority scheduling for this TC. To enable use a value
of 1.
SEVEN
Container becomes more and more popular for strengths, like low overhead, fast boot-up time, and easy
to deploy, etc. How to use DPDK to accelerate container networking becomes a common question for
users. There are two use models of running DPDK inside containers, as shown in Fig. 7.1.
DPDK
Host kernel
PF driver
Host kernel
NIC
PF VF VF
7.1 Overview
The virtual device, virtio-user, with unmodified vhost-user backend, is designed for high performance
user space container networking or inter-process communication (IPC).
The overview of accelerating container networking by virtio-user is shown in Fig. 7.2.
Different virtio PCI devices we usually use as a para-virtualization I/O in the context of QEMU/VM, the
basic idea here is to present a kind of virtual devices, which can be attached and initialized by DPDK.
The device emulation layer by QEMU in VM’s context is saved by just registering a new kind of virtual
device in DPDK’s ether layer. And to minimize the change, we reuse already-existing virtio PMD code
(driver/net/virtio/).
Virtio, in essence, is a shm-based solution to transmit/receive packets. How is memory shared? In VM’s
case, qemu always shares the whole physical layout of VM to vhost backend. But it’s not feasible for a
container, as a process, to share all virtual memory regions to backend. So only those virtual memory
37
HowTo Guides, Release 20.08.0
Container/App
DPDK
ethdev
vRouter
vSwitch
or
vhost
virtio
virtio PMD
(virtual
virtio-user
device)
(PCI
virtio
device)
vhost-user unix socket file
adapter
NIC
regions (aka, hugepages initialized in DPDK) are sent to backend. It restricts that only addresses in these
areas can be used to transmit or receive packets.
Here we use Docker as container engine. It also applies to LXC, Rocket with some minor changes.
1. Write a Dockerfile like below.
cat <<EOT >> Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
ENV PATH "$PATH:/usr/src/dpdk/x86_64-native-linux-gcc/app/"
EOT
--file-prefix=container \
-- -i
Note: If we run all above setup on the host, it’s a shm-based IPC.
7.3 Limitations
7.3. Limitations 39
CHAPTER
EIGHT
The virtual device, virtio-user, was originally introduced with vhost-user backend, as a high performance
solution for IPC (Inter-Process Communication) and user space container networking.
Virtio_user with vhost-kernel backend is a solution for exceptional path, such as KNI which exchanges
packets with kernel networking stack. This solution is very promising in:
• Maintenance
All kernel modules needed by this solution, vhost and vhost-net (kernel), are upstreamed and
extensively used kernel module.
• Features
vhost-net is born to be a networking solution, which has lots of networking related features, like
multi queue, tso, multi-seg mbuf, etc.
• Performance
similar to KNI, this solution would use one or more kthreads to send/receive packets to/from user
space DPDK applications, which has little impact on user space polling thread (except that it
might enter into kernel space to wake up those kthreads if necessary).
The overview of an application using virtio-user as exceptional path is shown in Fig. 8.1.
As a prerequisite, the vhost/vhost-net kernel CONFIG should be chosen before compiling the kernel and
those kernel modules should be inserted.
1. Compile DPDK and bind a physical NIC to igb_uio/uio_pci_generic/vfio-pci.
This physical NIC is for communicating with outside.
2. Run testpmd.
$(testpmd) -l 2-3 -n 4 \
--vdev=virtio_user0,path=/dev/vhost-net,queue_size=1024 \
-- -i --tx-offloads=0x0000002c --enable-lro \
--txd=1024 --rxd=1024
This command runs testpmd with two ports, one physical NIC to communicate with outside, and
one virtio-user to communicate with kernel.
• --enable-lro
40
HowTo Guides, Release 20.08.0
ETHDEV
virtio-user
vhost adapter
vhost ko
NIC
2. Start testpmd:
(testpmd) start
Note: The tap device will be named tap0, tap1, etc, by kernel.
Then, all traffic from physical NIC can be forwarded into kernel stack, and all traffic on the tap0 can be
sent out from physical NIC.
8.2 Limitations
8.2. Limitations 42
CHAPTER
NINE
This document describes how the Data Plane Development Kit (DPDK) Packet Capture Framework is
used for capturing packets on DPDK ports. It is intended for users of DPDK who want to know more
about the Packet Capture feature and for those who want to monitor traffic on DPDK-controlled devices.
The DPDK packet capture framework was introduced in DPDK v16.07. The DPDK packet capture
framework consists of the DPDK pdump library and DPDK pdump tool.
9.1 Introduction
The librte_pdump library provides the APIs required to allow users to initialize the packet capture frame-
work and to enable or disable packet capture. The library works on a client/server model and its usage
is recommended for debugging purposes.
The dpdk-pdump tool is developed based on the librte_pdump library. It runs as a DPDK secondary
process and is capable of enabling or disabling packet capture on DPDK ports. The dpdk-pdump tool
provides command-line options with which users can request enabling or disabling of the packet capture
on DPDK ports.
The application which initializes the packet capture framework will act as a server and the application
that enables or disables the packet capture will act as a client. The server sends the Rx and Tx packets
from the DPDK ports to the client.
In DPDK the testpmd application can be used to initialize the packet capture framework and act as
a server, and the dpdk-pdump tool acts as a client. To view Rx or Tx packets of testpmd, the
application should be launched first, and then the dpdk-pdump tool. Packets from testpmd will be
sent to the tool, which then sends them on to the Pcap PMD device and that device writes them to the
Pcap file or to an external interface depending on the command-line option used.
Some things to note:
• The dpdk-pdump tool can only be used in conjunction with a primary application which has
the packet capture framework initialized already. In dpdk, only testpmd is modified to initial-
ize packet capture framework, other applications remain untouched. So, if the dpdk-pdump
tool has to be used with any application other than the testpmd, the user needs to explicitly
modify that application to call the packet capture framework initialization code. Refer to the
app/test-pmd/testpmd.c code and look for pdump keyword to see how this is done.
• The dpdk-pdump tool depends on the libpcap based PMD which is disabled by default in the
build configuration files, owing to an external dependency on the libpcap development files. Once
the libpcap development files are installed, the libpcap based PMD can be enabled by setting
CONFIG_RTE_LIBRTE_PMD_PCAP=y and recompiling the DPDK.
43
HowTo Guides, Release 20.08.0
The overview of using the Packet Capture Framework and the dpdk-pdump tool for packet capturing
on the DPDK port in Fig. 9.1.
9.3 Configuration
Modify the DPDK primary application to initialize the packet capture framework as mentioned in the
above notes and enable the following config options and build DPDK:
CONFIG_RTE_LIBRTE_PMD_PCAP=y
CONFIG_RTE_LIBRTE_PDUMP=y
The following steps demonstrate how to run the dpdk-pdump tool to capture Rx side packets on
dpdk_port0 in Fig. 9.1 and inspect them using tcpdump.
1. Launch testpmd as the primary application:
sudo ./app/testpmd -c 0xf0 -n 4 -- -i --port-topology=chained
3. Send traffic to dpdk_port0 from traffic generator. Inspect packets captured in the file capture.pcap
using a tool that can interpret Pcap files, for example tcpdump:
$tcpdump -nr /tmp/capture.pcap
reading from file /tmp/capture.pcap, link-type EN10MB (Ethernet)
11:11:36.891404 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18
11:11:36.891442 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18
11:11:36.891445 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18
librte_pdump dpdk-pdump
tool
PCAP PMD
dpdk_port0
Fig. 9.1: Packet capturing on a DPDK port using the dpdk-pdump tool.
CHAPTER
TEN
The Telemetry library provides users with the ability to query DPDK for telemetry information, currently
including information such as ethdev stats, ethdev port list, and eal parameters.
Note: This library is experimental and the output format may change in the future.
The library is enabled by default, however an EAL flag to enable the library exists, to provide backward
compatibility for the previous telemetry library interface:
--telemetry
The following steps show how to run an application with telemetry support, and query information using
the telemetry client python script.
1. Launch testpmd as the primary application with telemetry:
./app/dpdk-testpmd
3. When connected, the script displays the following, waiting for user input:
46
HowTo Guides, Release 20.08.0
Connecting to /var/run/dpdk/rte/dpdk_telemetry.v2
{"version": "DPDK 20.05.0-rc2", "pid": 60285, "max_output_len": 16384}
-->
4. The user can now input commands to send across the socket, and receive the response. Some
available commands are shown below.
• List all commands:
--> /
{"/": ["/", "/eal/app_params", "/eal/params", "/ethdev/list",
"/ethdev/link_status", "/ethdev/xstats", "/help", "/info"]}
Note: For commands that expect a parameter, use ”,” to separate the command and parameter.
See examples below.
• Get the help text for a command. This will indicate what parameters are required. Pass the
command as a parameter:
--> /help,/ethdev/xstats
{"/help": {"/ethdev/xstats": "Returns the extended stats for a port.
Parameters: int port_id"}}
ELEVEN
DPDK applications can be designed to have simple or complex pipeline processing stages making use of
single or multiple threads. Applications can use poll mode hardware devices which helps in offloading
CPU cycles too. It is common to find solutions designed with
• single or multiple primary processes
• single primary and single secondary
• single primary and multiple secondaries
In all the above cases, it is tedious to isolate, debug, and understand various behaviors which occur
randomly or periodically. The goal of the guide is to consolidate a few commonly seen issues for
reference. Then, isolate to identify the root cause through step by step debug at various stages.
Note: It is difficult to cover all possible issues; in a single attempt. With feedback and suggestions from
the community, more cases can be covered.
By making use of the application model as a reference, we can discuss multiple causes of issues in the
guide. Let us assume the sample makes use of a single primary process, with various processing stages
running on multiple cores. The application may also make uses of Poll Mode Driver, and libraries like
service cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
The overview of an application modeled using PMD is shown in Fig. 11.1.
Device
NIC 1
Worker 1 Worker 1
NIC 2
NIC 1
Worker 2 Worker 2
NIC 2 TX
PKT classify Worker 3 QoS
Crypto Worker 3
RX Distribute
core 6 core1
core 5
core0 core 2,3,4 core 2,3,4
Stats Collector
Health Check
core 7
48
HowTo Guides, Release 20.08.0
A couple of factors that lead the design decision could be the platform, scale factor, and target. This
distinct preference leads to multiple combinations, that are built using PMD and libraries of DPDK.
While the compiler, library mode, and optimization flags are the components are to be constant, that
affects the application too.
NIC1
NIC2
NIC3
RX
Core0
Fig. 11.2: RX packet rate compared against received rate.
NIC1 NIC1
NIC2 NIC2
NIC3 NIC3
RX TX
Core0 Core1
Fig. 11.3: RX-TX drops
1. At RX
• Identify if there are multiple RX queue configured for port by nb_rx_queues using
rte_eth_dev_info_get.
• Using rte_eth_dev_stats fetch drops in q_errors, check if RX thread is configured to
fetch packets from the port queue pair.
• Using rte_eth_dev_stats shows drops in rx_nombuf, check if RX thread has
enough cycles to consume the packets from the queue.
2. At TX
• If the TX rate is falling behind the application fill rate, identify if there are enough descriptors
with rte_eth_dev_info_get for TX.
• Check the nb_pkt in rte_eth_tx_burst is done for multiple packets.
• Check rte_eth_tx_burst invokes the vector function call for the PMD.
• If oerrors are getting incremented, TX packet validations are failing. Check if there queue
specific offload failures.
• If the drops occur for large size packets, check MTU and multi-segment support configured
for NIC.
11.2.3 Is there object drops in producer point for the ring library?
• Extreme stalls in dequeue stage of the pipeline will cause rte_ring_full to be true.
11.2.4 Is there object drops in consumer point for the ring library?
MBUF pool
Fig. 11.6: Memory objects have to be close to the device per NUMA.
1. Stall in processing pipeline can be attributes of MBUF release delays. These can be narrowed
down to
• Heavy processing cycles at single or multiple processing stages.
• Cache is spread due to the increased stages in the pipeline.
• CPU thread responsible for TX is not able to keep up with the burst of traffic.
• Extra cycles to linearize multi-segment buffer and software offload like checksum, TSO, and
VLAN strip.
• Packet buffer copy in fast path also results in stalls in MBUF release if not done selectively.
• Application logic sets rte_pktmbuf_refcnt_set to higher than the desired value
and frequently uses rte_pktmbuf_prefree_seg and does not release MBUF back
to mempool.
2. Lower performance between the pipeline processing stages can be
• The NUMA instance for packets or objects from NIC, mempool, and ring should be the
same.
• Drops on a specific socket are due to insufficient objects in the pool. Use
rte_mempool_get_count or rte_mempool_avail_count to monitor when
drops occurs.
• Try prefetching the content in processing pipeline logic to minimize the stalls.
3. Performance issue can be due to special cases
• Check if MBUF continuous with rte_pktmbuf_is_contiguous as certain offload
requires the same.
• Use rte_mempool_cache_create for user threads require access to mempool objects.
• If the variance is absent for larger huge pages, then try rte_mem_lock_page on the objects,
packets, lookup tables to isolate the issue.
Device
CRYPTO PMD
Core 7
Fig. 11.7: CRYPTO and interaction with PMD device.
worker 4
PKT classify worker 3
worker 2
Distribute
worker 1
core 1
core 2,3,4,5
Fig. 11.8: Custom worker function performance drops.
11.2.8 Is the execution cycles for dynamic service functions are not frequent?
Stats Collector
Health Check
core 6
Fig. 11.9: functions running on service cores
• Check the in-flight events for the desired queue for enqueue and dequeue.
NIC1
NIC2
QoS NIC3
TX
core 10 Core1
1. Identify the cause for a variance from expected behavior, is due to insufficient CPU cycles. Use
rte_tm_capabilities_get to fetch features for hierarchies, WRED and priority sched-
ulers to be offloaded hardware.
2. Undesired flow drops can be narrowed down to WRED, priority, and rates limiters.
3. Isolate the flow in which the undesired drops occur. Use
rte_tn_get_number_of_leaf_node and flow table to ping down the leaf where
drops occur.
4. Check the stats using rte_tm_stats_update and rte_tm_node_stats_read for drops
for hierarchy, schedulers and WRED configurations.
Core 0
Q1 Q3
RX TX Ring BufferQ
Q2 Q4
Secondary
Primary
1. To isolate the possible packet corruption in the processing pipeline, carefully staged capture pack-
ets are to be implemented.
• First, isolate at NIC entry and exit.
Use pdump in primary to allow secondary to access port-queue pair. The packets get copied
over in RX|TX callback by the secondary process using ring buffers.
1. For an application that runs as the primary process only, debug functionality is added in the same
process. These can be invoked by timer call-back, service core and signal handler.
2. For the application that runs as multiple processes. debug functionality in a standalone secondary
process.
TWELVE
This document describes how to enable Data Plane Development Kit (DPDK) on OpenWrt in both a
virtual and physical x86 environment.
12.1 Introduction
The OpenWrt project is a well-known source-based router OS which provides a fully writable filesystem
with package management.
57
HowTo Guides, Release 20.08.0
• CONFIG_VFIO_PCI_MMAP=y
• CONFIG_HUGETLBFS=y
• CONFIG_HUGETLB_PAGE=y
• CONFIG_PROC_PAGE_MONITOR=y
For detailed OpenWrt build steps and prerequisites, please refer to the OpenWrt build guide.
After the build is completed, you can find the images and SDK in <OpenWrt
Root>/bin/targets/x86/64-glibc/.
12.3.1 Pre-requisites
Note: For compiling the NUMA lib, run libtool --version to ensure the libtool version >= 2.2,
otherwise the compilation will fail with errors.
The numa header files and lib file is generated in the include and lib folder respectively under <OpenWrt
SDK toolchain dir>.
To cross compile with meson build, you need to write a customized cross file first.
[binaries]
c = 'x86_64-openwrt-linux-gcc'
cpp = 'x86_64-openwrt-linux-cpp'
ar = 'x86_64-openwrt-linux-ar'
strip = 'x86_64-openwrt-linux-strip'
Note: For compiling the igb_uio with the kernel version used in target machine, you need to explicitly
specify kernel_dir in meson_options.txt.
• Launch Qemu
qemu-system-x86_64 \
-cpu host \
-smp 8 \
-enable-kvm \
-M q35 \
-m 2048M \
-object memory-backend-file,id=mem,size=2048M,mem-path=/tmp/hugepages,share=on \
-drive file=<Your OpenWrt images folder>/openwrt-x86-64-combined-ext4.img,id=d0,if=none
-device ide-hd,drive=d0,bus=ide.0 \
-net nic,vlan=0 \
-net nic,vlan=1 \
-net user,vlan=1 \
-display none \
You can use the dd tool to write the OpenWrt image to the drive you want to write the image on.
dd if=openwrt-18.06.1-x86-64-combined-squashfs.img of=/dev/sdX
Where sdX is name of the drive. (You can find it though fdisk -l)
More detailed info about how to run a DPDK application please refer to Running DPDK
Applications section of the DPDK documentation.
Note: You need to install pre-built NUMA libraries (including soft link) to /usr/lib64 in OpenWrt.