Mellanox OFED Linux User Manual v2.3-1.0.1
Mellanox OFED Linux User Manual v2.3-1.0.1
Rev 2.3-1.0.1
Last Updated: November 06, 2014
www.mellanox.com
Rev 2.3-1.0.1
Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Stack Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 mlx4 VPI Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.2 mlx5 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Mid-layer Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.4 ULPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.5 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.6 InfiniBand Subnet Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.7 Diagnostic Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.8 Mellanox Firmware Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Mellanox OFED Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 ISO Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Software Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.4 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 mlx4 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.2 mlx5 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Device Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Hardware and Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Downloading Mellanox OFED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Installing Mellanox OFED. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Installation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Installation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Installation Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 openibd Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.5 mlnxofedinstall Return Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Uninstalling Mellanox OFED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Installing MLNX_OFED using YUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Setting up MLNX_OFED YUM Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Installing MLNX_OFED using the YUM Tool . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 Uninstalling Mellanox OFED using the YUM Tool. . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Installing MLNX_OFED using apt-get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Setting up MLNX_OFED apt-get Repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 Installing MLNX_OFED using the apt-get Tool . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.3 Uninstalling Mellanox OFED using the apt-get Tool . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Updating Firmware After Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Mellanox Technologies 3
Rev 2.3-1.0.1
4 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 5
Rev 2.3-1.0.1
List of Figures
Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards . . . . . . . . . . . . . . . . . . . .13
Figure 2: RoCE and v2 Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
Figure 3: QoS Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100
Figure 4: Example QoS Deployment on InfiniBand Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
Figure 5: I/O Consolidation Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121
Figure 6: An Example of a Virtual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135
6 Mellanox Technologies
Rev 2.3-1.0.1
List of Tables
Mellanox Technologies 7
Rev 2.3-1.0.1
2.3-1.0.1 November 06, 2014 • Added Section 3.2.5.1.6, “Dynamic PKey Change”, on
page 132
September, 2014 • Major restructuring of the User Manual
• Added the following sections:
• Section 2.6, “Installing MLNX_OFED using apt-get”, on
page 30
• Section 2.6.1, “Setting up MLNX_OFED apt-get Reposi-
tory”, on page 30
• Section 2.6.2, “Installing MLNX_OFED using the apt-get
Tool”, on page 30
• Section 2.6.3, “Uninstalling Mellanox OFED using the apt-
get Tool”, on page 31
• Section 2.8.2, “Removing Signature from kernel Modules”,
on page 32
• Section 3.1.5, “Checksum Offload”, on page 52
• Section 3.1.6.1, “IP Routable RoCE”, on page 53
• Section 3.1.6.2, “RoCE Configuration”, on page 54
• Section 3.1.6.2.2, “Configuring SwitchX® Based Switch
System”, on page 54
• Section 3.1.6.2.3, “Configuring the RoCE Mode”, on
page 55
• Section 3.1.7, “Explicit Congestion Notification (ECN)”,
on page 60
• Section 3.2.4, “Secure Host”, on page 124
• Section 3.2.7.3, “Memory Region Re-registration”, on
page 154
• Section 3.2.7.5, “User-Mode Memory Registration
(UMR)”, on page 155
• Section 3.2.7.6, “On-Demand-Paging (ODP)”, on page 156
• Section 3.4.1.8.1, “Configuring VGT+”, on page 187
• Section 3.5.1, “Reset Flow”, on page 189
• Section 5.1, “General Related Issues”, on page 201
• Section 5.2, “Ethernet Related Issues”, on page 201
• Section 5.3, “InfiniBand Related Issues”, on page 202
• Section 5.4, “InfiniBand/Ethernet Related Issues”, on
page 203
• Section 5.5, “Installation Related Issues”, on page 204
• Section 5.6, “Performance Related Issues”, on page 205
• Section 5.7, “SR-IOV Related Issues”, on page 205
• Section 5.8, “PXE (FlexBoot) Related Issues”, on page 206
• Section 5.9, “RDMA Related Issues”, on page 207
• Updated the following sections:
• Section 3.1.11, “Flow Steering”, on page 67
8 Mellanox Technologies
Rev 2.3-1.0.1
Intended Audience
This manual is intended for system administrators responsible for the installation, configuration,
management and maintenance of the software and hardware of VPI (InfiniBand, Ethernet)
adapter cards. It is also intended for application developers.
Mellanox Technologies 9
Rev 2.3-1.0.1
Glossary
The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man-
agers in particular. It is included here for ease of reference, but the main reference remains the
InfiniBand Architecture Specification.
10 Mellanox Technologies
Rev 2.3-1.0.1
Local Port The IB port of the HCA through which IBDIAG tools connect
to the IB fabric.
Master Subnet Man- The Subnet Manager that is authoritative, that has the refer-
ager ence configuration information for the subnet. See Subnet
Manager.
Multicast Forward- A table that exists in every switch providing the list of ports to
ing Tables forward received multicast packet. The table is organized by
MLID.
Network Interface A network adapter card that plugs into the PCI Express slot
Card (NIC) and provides one or more ports to an Ethernet network.
Standby Subnet Man- A Subnet Manager that is currently quiescent, and not in the
ager role of a Master Subnet Manager, by agency of the master SM.
See Subnet Manager.
Subnet Administra- An application (normally part of the Subnet Manager) that
tor (SA) implements the interface for querying and manipulating subnet
management data.
Subnet Manager One of several entities involved in the configuration and con-
(SM) trol of the an IB fabric.
Unicast Linear For- A table that exists in every switch providing the port through
warding Tables which packets should be sent to each LID.
(LFT)
Virtual Protocol A Mellanox Technologies technology that allows Mellanox
Interconnet (VPI) channel adapter devices (ConnectX®) to simultaneously con-
nect to an InfiniBand subnet and a 10GigE subnet (each subnet
connects to one of the adpater ports)
Mellanox Technologies 11
Rev 2.3-1.0.1
Related Documentation
Table 4 - Reference Documents
12 Mellanox Technologies
Rev 2.3-1.0.1
1 Introduction
1.1 Overview
Mellanox OFED is a single Virtual Protocol Interconnect (VPI) software stack which operates
across all Mellanox network adapter solutions supporting 10, 20, 40 and 56 Gb/s InfiniBand (IB);
10, 40 and 56 Gb/s Ethernet; and 2.5 or 5.0 GT/s PCI Express 2.0 and 8 GT/s PCI Express 3.0
uplinks to servers.
All Mellanox network adapter cards are compatible with OpenFabrics-based RDMA protocols
and software, and are supported with major operating system distributions.
Mellanox OFED is certified with the following products:
• Mellanox Messaging Accelerator (VMA™) software: Socket acceleration library that
performs OS bypass for standard socket based applications.
• Mellanox Unified Fabric Manager (UFM®) software: Powerful platform for managing
demanding scale-out computing fabric environments, built on top of the OpenSM
industry standard routing engine.
• Fabric Collective Accelerator (FCA) - FCA is a Mellanox MPI-integrated software
package that utilizes CORE-Direct technology for implementing the MPI collectives
communications.
The following sub-sections briefly describe the various components of the Mellanox OFED
stack.
Mellanox Technologies 13
Rev 2.3-1.0.1 Introduction
mlx4_core
Handles low-level functions like device initialization and firmware commands processing. Also
controls resource allocation so that the InfiniBand and Ethernet functions can share the device
without interfering with each other.
mlx4_ib
Handles InfiniBand-specific functions and plugs into the InfiniBand midlayer
mlx4_en
A 10/40GigE driver under drivers/net/ethernet/mellanox/mlx4 that handles Ethernet specific
functions and plugs into the netdev mid-layer
mlx5_core
Acts as a library of common functions (e.g. initializing the device after reset) required by the
Connect-IB® adapter card.
mlx5_ib
Handles InfiniBand-specific functions and plugs into the InfiniBand midlayer.
libmlx5
libmlx5 is the provider library that implements hardware specific user-space functionality. If
there is no compatibility between the firmware and the driver, the driver will not load and a mes-
sage will be printed in the dmesg.
The following are the Libmlx5 environment variables:
• MLX5_FREEZE_ON_ERROR_CQE
• Causes the process to hang in a loop when completion with error which is not flushed with
error or retry exceeded occurs/
• Otherwise disabled
• MLX5_POST_SEND_PREFER_BF
• Configures every work request that can use blue flame will use blue flame
• Otherwise - blue flame depends on the size of the message and inline indication in the
packet
14 Mellanox Technologies
Rev 2.3-1.0.1
• MLX5_SHUT_UP_BF
• Disables blue flame feature
• Otherwise - do not disable
• MLX5_SINGLE_THREADED
• All spinlocks are disabled
• Otherwise - spinlocks enabled
• Used by applications that are single threaded and would like to save the overhead of taking
spinlocks.
• MLX5_CQE_SIZE
• 64 - completion queue entry size is 64 bytes (default)
• 128 - completion queue entry size is 128 bytes
MLX5_SCATTER_TO_CQE
• Small buffers are scattered to the completion queue entry and manipulated by the driver.
Valid for RC transport.
• Default is 1, otherwise disabled.
1.2.4 ULPs
IPoIB
The IP over IB (IPoIB) driver is a network interface implementation over InfiniBand. IPoIB
encapsulates IP datagrams over an InfiniBand connected or datagram transport service. IPoIB
pre-appends the IP datagrams with an encapsulation header, and sends the outcome over the
InfiniBand transport service. The transport service is Unreliable Datagram (UD) by default, but it
may also be configured to be Reliable Connected (RC). The interface supports unicast, multicast
and broadcast. For details, see Chapter 3.2.5.1, “IP over InfiniBand (IPoIB)”.
iSER
iSCSI Extensions for RDMA (iSER) extends the iSCSI protocol to RDMA. It permits data to be
transferred directly into and out of SCSI buffers without intermediate data copies. For further
information, please refer to Chapter 3.3.2, “iSCSI Extensions for RDMA (iSER)”.
SRP
SCSI RDMA Protocol (SRP) is designed to take full advantage of the protocol offload and
RDMA features provided by the InfiniBand architecture. SRP allows a large body of SCSI soft-
ware to be readily used on InfiniBand architecture. The SRP driver—known as the SRP Initia-
tor—differs from traditional low-level SCSI drivers in Linux. The SRP Initiator does not control
a local HBA; instead, it controls a connection to an I/O controller—known as the SRP Target—to
Mellanox Technologies 15
Rev 2.3-1.0.1 Introduction
provide access to remote storage devices across an InfiniBand fabric. The SRP Target resides in
an I/O unit and provides storage services. See Chapter 3.3.1, “SCSI RDMA Protocol (SRP)”.
uDAPL
User Direct Access Programming Library (uDAPL) is a standard API that promotes data center
application data messaging performance, scalability, and reliability over RDMA interconnects:
InfiniBand and RoCE. The uDAPL interface is defined by the DAT collaborative.
This release of the uDAPL reference implementation package for both DAT 1.2 and 2.0 specifi-
cation is timed to coincide with OFED release of the Open Fabrics (www.openfabrics.org) soft-
ware stack.
For more information about the DAT collaborative, go to the following site:
http://www.datcollaborative.org
1.2.5 MPI
Message Passing Interface (MPI) is a library specification that enables the development of paral-
lel software libraries to utilize parallel computers, clusters, and heterogeneous networks. Mella-
nox OFED includes the following MPI implementations over InfiniBand:
• Open MPI – an open source MPI-2 implementation by the Open MPI Project
• OSU MVAPICH – an MPI-1 implementation by Ohio State University
Mellanox OFED also includes MPI benchmark tests such as OSU BW/LAT, Intel MPI Bench-
mark, and Presta.
1. OpenSM is disabled by default. See Chapter 3.2.2, “OpenSM” for details on enabling it.
16 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 17
Rev 2.3-1.0.1 Introduction
1.3.3 Firmware
The ISO image includes the following firmware items:
• Firmware images (.mlx format) for ConnectX®-3/ConnectX®-3 Pro/Connect-IB® net-
work adapters
• Firmware configuration (.INI) files for Mellanox standard network adapter cards and
custom cards
18 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 19
Rev 2.3-1.0.1 Introduction
probe_vf: Either a single value (e.g. '3') to indicate that the Hypervi-
sor driver itself should activate this number of VFs for each
HCA on the host, or a string to map device function numbers to
their probe_vf values (e.g. '0000:04:00.0-3,002b:1c:0b.a-13').
Hexadecimal digits for the device function (e.g. 002b:1c:0b.a)
and decimal for probe_vf value (e.g. 13). (string)
log_num_mgm_entry_size: log mgm size, that defines the num of qp per mcg, for example:
10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12. To
activate device managed flow steering when available, set to -
1 (int)
high_rate_steer: Enable steering mode for higher packet rate (default off)
(int)
fast_drop: Enable fast packet drop when no recieve WQEs are posted (int)
enable_64b_cqe_eqe: Enable 64 byte CQEs/EQEs when the FW supports this if non-zero
(default: 1) (int)
log_num_mac: Log2 max number of MACs per ETH port (1-7) (int)
log_num_vlan: (Obsolete) Log2 max number of VLANs per ETH port (0-7) (int)
log_mtts_per_seg: Log2 number of MTT entries per segment (0-7) (default: 0) (int)
port_type_array: Either pair of values (e.g. '1,2') to define uniform port1/
port2 types configuration for all devices functions or a
string to map device function numbers to their pair of port
types values (e.g. '0000:04:00.0-1;2,002b:1c:0b.a-1;1').
Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
If only a single port is available, use the N/A port type for
port2 (e.g '1,4').
log_num_qp: log maximum number of QPs per HCA (default: 19) (int)
log_num_srq: log maximum number of SRQs per HCA (default: 16) (int)
log_rdmarc_per_qp: log number of RDMARC buffers per QP (default: 4) (int)
log_num_cq: log maximum number of CQs per HCA (default: 16) (int)
log_num_mcg: log maximum number of multicast groups per HCA (default: 13)
(int)
log_num_mpt: log maximum number of memory protection table entries per HCA
(default: 19) (int)
log_num_mtt: log maximum number of memory translation table segments per
HCA (default: max(20, 2*MTTs for register all of the host mem-
ory limited to 30)) (int)
enable_qos: Enable Quality of Service support in the HCA (default: off)
(bool)
internal_err_reset: Reset device on internal errors if non-zero (default is 1)
(int)
20 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 21
Rev 2.3-1.0.1 Installation
2 Installation
This chapter describes how to install and test the Mellanox OFED for Linux package on a single
host machine with Mellanox InfiniBand and/or Ethernet adapter hardware installed.
Requirements Description
Platforms A server platform with an adapter card based on one of the following
Mellanox Technologies’ InfiniBand HCA devices:
• MT25408 ConnectX®-2 (VPI, IB, EN) (firmware: fw-ConnectX2)
• MT27508 ConnectX®-3 (VPI, IB, EN) (firmware: fw-ConnectX3)
• MT4103 ConnectX®-3 Pro (VPI, IB, EN) (firmware: fw-
ConnectX3Pro)
• MT4113 Connect-IB® (IB) (firmware: fw-Connect-IB)
For the list of supported architecture platforms, please refer to the
Mellanox OFED Release Notes file.
Required Disk Space 1GB
for Installation
Device ID For the latest list of device IDs, please visit Mellanox website.
Operating System Linux operating system.
For the list of supported operating system distributions and kernels,
please refer to the Mellanox OFED Release Notes file.
Installer Privileges The installation requires administrator privileges on the target
machine.
22 Mellanox Technologies
Rev 2.3-1.0.1
On Redhat and SLES distributions with errata kernel installed there is no need to use the
mlnx_add_kernel_support.sh script. The regular installation can be performed and weak-
updates mechanism will create symbolic links to the MLNX_OFED kernel modules.
1. The firmware will not be updated if you run the install script with the ‘--without-fw-update’ option.
Mellanox Technologies 23
Rev 2.3-1.0.1 Installation
Example
The following command will create a MLNX_OFED_LINUX ISO image for RedHat 6.3 under
the /tmp directory.
# ./MLNX_OFED_LINUX-x.x-x-rhel6.3-x86_64/mlnx_add_kernel_support.sh -m /tmp/
MLNX_OFED_LINUX-x.x-x-rhel6.3-x86_64/ --make-tgz
Note: This program will create MLNX_OFED_LINUX TGZ for rhel6.3 under /tmp directory.
All Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Do you want to continue?[y/N]:y
See log file /tmp/mlnx_ofed_iso.21642.log
24 Mellanox Technologies
Rev 2.3-1.0.1
In case your machine has the latest firmware, no firmware update will occur and the
installation script will print at the end of installation a message similar to the follow-
ing:
Device #1:
----------
Device Type: ConnectX3Pro
Part Number: MCX354A-FCC_Ax
Description: ConnectX-3 Pro VPI adapter card; dual-port
QSFP; FDR IB (56Gb/s) and 40GigE;PCIe3.0 x8 8GT/s;RoHS R6
PSID: MT_1090111019
PCI Device Name: 0000:05:00.0
Versions: Current Available
FW 2.31.5000 2.31.5000
PXE 3.4.0224 3.4.0224
Status: Up to date
In case your machine has an unsupported network adapter device, no firmware update
will occur and the error message below will be printed. Please contact your hardware
vendor for help on firmware updates.
Error message:
Device #1:
----------
Device: 0000:05:00.0
Part Number:
Description:
PSID: MT_0DB0110010
Versions: Current Available
FW 2.9.1000 N/A
Status: No matching image found
Step 4. Reboot the machine if the installation script performed firmware updates to your network
adapter hardware. Otherwise, restart the driver by running: "/etc/init.d/openibd
restart"
Step 5. (InfiniBand only) Run the hca_self_test.ofed utility to verify whether or not the
InfiniBand link is up. The utility also checks for and displays additional information such as
• HCA firmware version
• Kernel architecture
• Driver version
• Number of active HCA ports along with their states
• Node GUID
For more details on hca_self_test.ofed, see the file hca_self_test.readme
under docs/.
After the installer completes, information about the Mellanox OFED installation such as prefix,
kernel version, and installation parameters can be retrieved by running the command /etc/
infiniband/info.
Most of the Mellanox OFED components can be configured or reconfigured after the installation
by modifying the relevant configuration files. See the relevant chapters in this manual for details.
The list of the modules that will be loaded automatically upon boot can be found in the /etc/
infiniband/openib.conf file.
Mellanox Technologies 25
Rev 2.3-1.0.1 Installation
Software • Most of MLNX_OFED packages are installed under the “/usr” directory
except for the following packages which are installed under the “/opt” direc-
tory:
• openshmem, bupc, fca and ibutils
• The kernel modules are installed under
• /lib/modules/`uname -r`/updates on SLES and Fedora Distributions
• /lib/modules/`uname -r`/extra/mlnx-ofa_kernel on RHEL and other RedHat like
Distributions
• /lib/modules/`uname -r`/updates/dkms/ on Ubuntu
Firmware • The firmware of existing network adapter devices will be updated if the fol-
lowing two conditions are fulfilled:
• The installation script is run in default mode; that is, without the option ‘--without-
fw-update’
• The firmware version of the adapter device is older than the firmware version
included with the Mellanox OFED ISO image
Note: If an adapter’s Flash was originally programmed with an Expansion ROM
image, the automatic firmware update will also burn an Expansion ROM image.
• In case your machine has an unsupported network adapter device, no firmware
update will occur and the error message below will be printed.
-I- Querying device ...
-E- Can't auto detect fw configuration file: ...
Please contact your hardware vendor for help on firmware updates.
26 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 27
Rev 2.3-1.0.1 Installation
http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
# wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
--2014-04-20 13:52:30-- http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Resolving www.mellanox.com... 72.3.194.0
Connecting to www.mellanox.com|72.3.194.0|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1354 (1.3K) [text/plain]
Saving to: ?RPM-GPG-KEY-Mellanox?
28 Mellanox Technologies
Rev 2.3-1.0.1
repolist: 8,351
Mellanox Technologies 29
Rev 2.3-1.0.1 Installation
30 Mellanox Technologies
Rev 2.3-1.0.1
MST devices:
------------
/dev/mst/mt25418_pciconf0 - PCI configuration cycles access.
bus:dev.fn=02:00.0 addr.reg=88
data.reg=92
Chip revision is: A0
/dev/mst/mt25418_pci_cr0 - PCI direct access.
bus:dev.fn=02:00.0 bar=0xdef00000
size=0x100000
Chip revision is: A0
........
2. Your InfiniBand device is the one with the postfix “_pci_cr0”. In the example listed
above, this will be /dev/mst/mt25418_pci_cr0.
Step 3. Burn firmware.
• Burn a firmware image from a .mlx file using the mlxburn utility (that is already installed on your
machine)
Mellanox Technologies 31
Rev 2.3-1.0.1 Installation
The following command burns firmware onto the ConnectX® device with the device
name obtained in the example of Step 2.
> flint -d /dev/mst/mt25418_pci_cr0 -i fw-25408-2_31_5050-MCX353A-FCA_A1.bin burn
Step 4. Reboot your machine after the firmware burning is completed.
Prior to adding the Mellanox's x.509 public key to your system, please make sure:
• the 'mokutil' package is installed on your system
• the system is booted in UEFI mode
To see what keys have been added to the system key ring on the current boot, install the 'keyutils'
package and run: #keyctl list %:.system_keyring
32 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 33
Rev 2.3-1.0.1 Features Overview and Configuration
3.1.1 Interface
Auto Sensing enables the NIC to automatically sense the link type (InfiniBand or Ethernet) based
on the link partner and load the appropriate driver stack (InfiniBand or Ethernet).
For example, if the first port is connected to an InfiniBand switch and the second to Ethernet
switch, the NIC will automatically load the first switch as InfiniBand and the second as Ethernet.
34 Mellanox Technologies
Rev 2.3-1.0.1
3.1.1.2 Counters
Counters are used to provide information about how well an operating system, an application, a
service, or a driver is performing. The counter data helps determine system bottlenecks and fine-
tune the system and application performance. The operating system, network, and devices pro-
vide counter data that an application can consume to provide users with a graphical view of how
well the system is performing.
The counter index is a QP attribute given in the QP context. Multiple QPs may be associated with
the same counter set, If multiple QPs share the same counter its value represents the cumulative
total.
• ConnectX®-3 supports 127 different counters which are allocated as follow:
• 4 counters reserved for PF - 2 counters for each port
• 2 counters reserved for VF - 1 counter for each port
• All other counters if exist are allocated by demand
• RoCE counters are available only through sysfs located under:
• # /sys/class/infiniband/mlx4_*/ports/*/counters/
• # /sys/class/infiniband/mlx4_*/ports/*/counters_ext/
• Physical Function can also read Virtual Functions' port counters through sysfs located
under:
• # /sys/class/net/eth*/vf*_statistics/
To display the network device Ethernet statistics, you can run:
Ethtool -S <devname>
Counter Description
rx_packets Total packets successfully received.
rx_bytes Total bytes in successfully received packets.
rx_multicast_packets Total multicast packets successfully received.
rx_broadcast_packets Total broadcast packets successfully received.
rx_errors Number of receive packets that contained errors preventing them
from being deliverable to a higher-layer protocol.
Mellanox Technologies 35
Rev 2.3-1.0.1 Features Overview and Configuration
Counter Description
rx_dropped Number of receive packets which were chosen to be discarded
even though no errors had been detected to prevent their being
deliverable to a higher-layer protocol.
rx_length_errors Number of received frames that were dropped due to an error in
frame length
rx_over_errors Number of received frames that were dropped due to overflow
rx_crc_errors Number of received frames with a bad CRC that are not runts,
jabbers, or alignment errors
rx_jabbers Number of received frames with a length greater than MTU octets
and a bad CRC
rx_in_range_length_error Number of received frames with a length/type field value in the
(decimal) range [1500:46] (42 is also counted for VLANtagged
frames)
rx_out_range_length_error Number of received frames with a length/type field value in the
(decimal) range [1535:1501]
rx_lt_64_bytes_packets Number of received 64-or-less-octet frames
rx_127_bytes_packets Number of received 65-to-127-octet frames
rx_255_bytes_packets Number of received 128-to-255-octet frames
rx_511_bytes_packets Number of received 256-to-511-octet frames
rx_1023_bytes_packets Number of received 512-to-1023-octet frames
rx_1518_bytes_packets Number of received 1024-to-1518-octet frames
rx_1522_bytes_packets Number of received 1519-to-1522-octet frames
rx_1548_bytes_packets Number of received 1523-to-1548-octet frames
rx_gt_1548_bytes_packets Number of received 1549-or-greater-octet frames
Counter Description
tx_packets Total packets successfully transmitted.
tx_bytes Total bytes in successfully transmitted packets.
tx_multicast_packets Total multicast packets successfully transmitted.
tx_broadcast_packets Total broadcast packets successfully transmitted.
tx_errors Number of frames that failed to transmit
tx_dropped Number of transmitted frames that were dropped
tx_lt_64_bytes_packets Number of transmitted 64-or-less-octet frames
tx_127_bytes_packets Number of transmitted 65-to-127-octet frames
tx_255_bytes_packets Number of transmitted 128-to-255-octet frames
tx_511_bytes_packets Number of transmitted 256-to-511-octet frames
tx_1023_bytes_packets Number of transmitted 512-to-1023-octet frames
tx_1518_bytes_packets Number of transmitted 1024-to-1518-octet frames
tx_1522_bytes_packets Number of transmitted 1519-to-1522-octet frames
36 Mellanox Technologies
Rev 2.3-1.0.1
Counter Description
tx_1548_bytes_packets Number of transmitted 1523-to-1548-octet frames
tx_gt_1548_bytes_packets Number of transmitted 1549-or-greater-octet frames
Counter Description
rx_prio_<i>_packets Total packets successfully received with priority i.
rx_prio_<i>_bytes Total bytes in successfully received packets with priority i.
rx_novlan_packets Total packets successfully received with no VLAN priority.
rx_novlan_bytes Total bytes in successfully received packets with no VLAN pri-
ority.
tx_prio_<i>_packets Total packets successfully transmitted with priority i.
tx_prio_<i>_bytes Total bytes in successfully transmitted packets with priority i.
tx_novlan_packets Total packets successfully transmitted with no VLAN priority.
tx_novlan_bytes Total bytes in successfully transmitted packets with no VLAN
priority.
Counter Description
rx_pause_prio_<i> The total number of PAUSE frames received from the far-end
port
rx_pause_duration_prio_<i The total time in microseconds that far-end port was requested to
> pause transmission of packets.
rx_pause_transition_prio_<i The number of receiver transitions from XON state (paused) to
> XOFF state (non-paused)
tx_pause_prio_<i> The total number of PAUSE frames sent to the far-end port
tx_pause_duration_prio_<i> The total time in microseconds that transmission of packets has
been paused
tx_pause_transition_prio_<i The number of transmitter transitions from XON state (paused)
> to XOFF state (non-paused)
Counter Description
vport<i>_rx_unicast_packets Unicast packets received successfully
vport<i>_rx_unicast_bytes Unicast packet bytes received successfully
vport<i>_rx_multicast_packets Multicast packets received successfully
vport<i>_rx_multicast_bytes Multicast packet bytes received successfully
vport<i>_rx_broadcast_packets Broadcast packets received successfully
vport<i>_rx_broadcast_bytes Broadcast packet bytes received successfully
vport<i>_rx_dropped Received packets discarded due to out-of-buffer condition
vport<i>_rx_errors Received packets discarded due to receive error condition
vport<i>_tx_unicast_packets Unicast packets sent successfully
vport<i>_tx_unicast_bytes Unicast packet bytes sent successfully
Mellanox Technologies 37
Rev 2.3-1.0.1 Features Overview and Configuration
Counter Description
vport<i>_tx_multicast_packets Multicast packets sent successfully
vport<i>_tx_multicast_bytes Multicast packet bytes sent successfully
vport<i>_tx_broadcast_packets Broadcast packets sent successfully
vport<i>_tx_broadcast_bytes Broadcast packet bytes sent successfully
vport<i>_tx_errors Packets dropped due to transmit errors
Counter Description
rx_lro_aggregated Number of packets aggregated
rx_lro_flushed Number of LRO flush to the stack
rx_lro_no_desc Number of times LRO description was not found
rx_alloc_failed Number of times failed preparing receive descriptor
rx_csum_good Number of packets received with good checksum
rx_csum_none Number of packets received with no checksum indication
tx_chksum_offload Number of packets transmitted with checksum offload
tx_queue_stopped Number of times transmit queue suspended
tx_wake_queue Number of times transmit queue resumed
tx_timeout Number of times transmitter timeout
tx_tso_packets Number of packet that were aggregated
Counter Description
rx<i>_packets Total packets successfully received on ring i
rx<i>_bytes Total bytes in successfully received packets on ring i.
tx<i>_packets Total packets successfully transmitted on ring i.
tx<i>_bytes Total bytes in successfully transmitted packets on ring i.
38 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 39
Rev 2.3-1.0.1 Features Overview and Configuration
• If the underlying device is not a VLAN device, the tc command is used. In this case, even
though tc manual states that the mapping is from the sk_prio to the TC number, the
mlx4_en driver interprets this as a sk_prio to UP mapping.
Mapping the sk_prio to the UP is done by using tc_wrap.py -i <dev name> -u
0,1,2,3,4,5,6,7
4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if
DCBX is used.
Socket applications can use setsockopt (SK_PRIO, value) to directly set the sk_prio
of the socket. In this case the ToS to sk_prio fixed mapping is not needed. This allows
the application and the administrator to utilize more than the 4 values possible via ToS.
In case of VLAN interface, the UP obtained according to the above mapping is also used
in the VLAN tag of the traffic
With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping.
40 Mellanox Technologies
Rev 2.3-1.0.1
1. The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of
the QP:
• Sets qp_attrs.ah_attrs.sl = up
• Calls modify_qp with IB_QP_AV set in the mask
2. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if
DCBX is used
Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP.
If packets with VLAN tag are transmitted, UP in the VLAN tag will be overwritten with
the given UP.
Mellanox Technologies 41
Rev 2.3-1.0.1 Features Overview and Configuration
Strict Priority
When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) prior-
ity over other TC strict priorities coming before it (as determined by the TC number: TC 7 is
highest priority, TC 0 is lowest). It also has an absolute priority over non strict TCs (ETS).
This property needs to be used with care, as it may easily cause starvation of other TCs.
A higher strict priority TC is always given the first chance to transmit. Only if the highest strict
priority TC has nothing more to transmit, will the next highest TC be considered.
Non strict priority TCs will be considered last to transmit.
This property is extremely useful for low latency low bandwidth traffic. Traffic that needs to get
immediate service when it exists, but is not of high volume to starve other transmitters in the sys-
tem.
Rate Limit
Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from
the requested values is considered acceptable.
mlnx_qos
mlnx_qos is a centralized tool used to configure QoS features of the local host. It communicates
directly with the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qos tool enables the administrator of the system to:
• Inspect the current QoS mappings and configuration
The tool will also display maps configured by TC and vconfig set_egress_map tools, in
order to give a centralized view of all QoS mappings.
• Set UP to TC mapping
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
• Set rate limit to TCs
42 Mellanox Technologies
Rev 2.3-1.0.1
Usage:
Options:
Mellanox Technologies 43
Rev 2.3-1.0.1 Features Overview and Configuration
44 Mellanox Technologies
Rev 2.3-1.0.1
Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2:
tc: 0 ratelimit: 3 Gbps, tsa: strict
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 1
up: 2
up: 3
up: 4
up: 5
up: 6
up: 7
Configure QoS. map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2. set tc0,tc1 as ets and tc2
as strict. divide ets 30% for tc0 and 70% for tc1:
mlnx_qos -i eth3 -s ets,ets,strict -p 0,1,1,1,2,2,2 -t 30,70
tc: 0 ratelimit: 3 Gbps, tsa: ets, bw: 30%
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 7
tc: 1 ratelimit: 4 Gbps, tsa: ets, bw: 70%
Mellanox Technologies 45
Rev 2.3-1.0.1 Features Overview and Configuration
up: 1
up: 2
up: 3
tc: 2 ratelimit: 2 Gbps, tsa: strict
up: 4
up: 5
up: 6
tc and tc_wrap.py
The 'tc' tool is used to setup sk_prio to UP mapping, using the mqprio queue discipline.
In kernels that do not support mqprio (such as 2.6.34), an alternate mapping is created in sysfs.
The 'tc_wrap.py' tool will use either the sysfs or the 'tc' tool to configure the sk_prio to UP
mapping.
Usage:
Options:
Example: set skprio 0-2 to UP0, and skprio 3-7 to UP1 on eth4
UP 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
UP 1
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
46 Mellanox Technologies
Rev 2.3-1.0.1
UP 2
UP 3
UP 4
UP 5
UP 6
UP 7
Additional Tools
tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher.
This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs
interface is available.
• mlnx_qos tool (package: ofed-scripts) requires python >= 2.5
• tc_wrap.py (package: ofed-scripts) requires python >= 2.5
Usage:
Options:
Mellanox Technologies 47
Rev 2.3-1.0.1 Features Overview and Configuration
48 Mellanox Technologies
Rev 2.3-1.0.1
priority 0:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
priority 1:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
.............................
.............................
priority 7:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
Mellanox Technologies 49
Rev 2.3-1.0.1 Features Overview and Configuration
3.1.4 Ethtool
ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for
wired Ethernet devices. It can be used to:
• Get identification and diagnostic information
• Get extended device statistics
• Control speed, duplex, autonegotiation and flow control for Ethernet devices
• Control checksum offload and other hardware offload features
• Control DMA ring sizes and interrupt moderation
The following are the ethtool supported options:
Table 3 - ethtool Supported Options
Options Description
ethtool -i eth<x> Checks driver and device information.
For example:
#> ethtool -i eth2
driver: mlx4_en (MT_0DD0120009_CX3)
version: 2.1.6 (Aug 2013)
firmware-version: 2.30.3000
bus-info: 0000:1a:00.0
ethtool -k eth<x> Queries the stateless offload status.
ethtool -K eth<x> [rx on|off] [tx Sets the stateless offload status.
on|off] [sg on|off] [tso on|off] [lro TCP Segmentation Offload (TSO), Generic Segmentation
on|off] [gro on|off] [gso on|off] Offload (GSO): increase outbound throughput by reducing
[rxvlan on|off] CPU overhead. It works by queuing up large buffers and
letting the network interface card split them into separate
packets.
Large Receive Offload (LRO): increases inbound through-
put of high-bandwidth network connections by reducing
CPU overhead. It works by aggregating multiple incoming
packets from a single stream into a larger buffer before
they are passed higher up the networking stack, thus
reducing the number of packets that have to be processed.
LRO is available in kernel versions < 3.1 for untagged
traffic.
Note: LRO will be done whenever possible. Otherwise
GRO will be done. Generic Receive Offload (GRO) is
available throughout all kernels.
Hardware Vlan Striping Offload (rxvlan): When enabled
received VlAN traffic will be stripped from the VLAN tag
by the hardware.
ethtool -c eth<x> Queries interrupt coalescing settings.
50 Mellanox Technologies
Rev 2.3-1.0.1
Options Description
ethtool -C eth<x> adaptive-rx Enables/disables adaptive interrupt moderation.
on|off By default, the driver uses adaptive interrupt moderation
for the receive path, which adjusts the moderation time to
the traffic pattern.
ethtool -C eth<x> [pkt-rate-low N] Sets the values for packet rate limits and for moderation
[pkt-rate-high N] [rx-usecs-low N] time high and low values.
[rx-usecs-high N] • Above an upper limit of packet rate, adaptive mod-
eration will set the moderation time to its highest
value.
• Below a lower limit of packet rate, the moderation
time will be set to its lowest value.
ethtool -C eth<x> [rx-usecs N] [rx- Sets the interrupt coalescing settings when the adaptive
frames N] moderation is disabled.
Note: usec settings correspond to the time to wait after the
*last* packet is sent/received before triggering an inter-
rupt.
ethtool -a eth<x> Queries the pause frame settings.
ethtool -A eth<x> [rx on|off] [tx Sets the pause frame settings.
on|off]
ethtool -g eth<x> Queries the ring size values.
ethtool -G eth<x> [rx <N>] [tx Modifies the rings size.
<N>]
ethtool -S eth<x> Obtains additional device statistics.
ethtool -t eth<x> Performs a self diagnostics test.
ethtool -s eth<x> msglvl [N] Changes the current driver message level.
ethtool -T eth<x> Shows time stamping capabilities
ethtool -l eth<x> Shows the number of channels
ethtool -L eth<x> [rx <N>] [tx Sets the number of channels
<N>]
etthtool -m|--dump-module- Queries/Decodes the cable module eeprom information.
eeprom eth<x> [ raw on|off ] [ hex
on|off ] [ offset N ] [ length N ]
ethtool --show-priv-flags eth<x> Shows driver private flags and their states (on/off)
Private flags are:
• pm_qos_request_low_latency
• mlx4_rss_xor_hash_function
• qcn_disable_32_14_4_e
ethtool --set-priv-flags eth<x> Enables/disables driver feature matching the given private
<priv flag> <on/off> flag.
Mellanox Technologies 51
Rev 2.3-1.0.1 Features Overview and Configuration
52 Mellanox Technologies
Rev 2.3-1.0.1
• Since LID is a layer 2 attribute of the InfiniBand protocol stack, it is not set for a port
and is displayed as zero when querying the port
• With RoCE, the alternate path is not set for RC QP and therefore APM is not supported
• Since the SM is not present, querying a path is impossible. Therefore, the path record
structure must be filled with the relevant values before establishing a connection. Hence,
it is recommended working with RDMA-CM to establish a connection as it takes care of
filling the path record structure
• The GID table for each port is populated with N+1 entries where N is the number of IP
addresses that are assigned to all network devices associated with the port including
VLAN devices, alias devices and bonding masters. The only exception to this rule is a
bonding master of a slave in a DOWN state. In that case, a matching GID to the IP
address of the master will not be present in the GID table of the slave's port
• The first entry in the GID table (at index 0) for each port is always present and equal to
the link local IPv6 address of the net device that is associated with the port. Note that
even if the link local IPv6 address is not set, index 0 is still populated.
• GID format can be of 2 types, IPv4 and IPv6. IPv4 GID is a IPv4-mapped IPv6 address1
while IPv6 GID is the IPv6 address itself
• VLAN tagged Ethernet frames carry a 3-bit priority field. The value of this field is
derived from the IB SL field by taking the 3 least significant bits of the SL field
• RoCE traffic is not shown in the associated Ethernet device's counters since it is
offloaded by the hardware and does not go through Ethernet network driver. RoCE traf-
fic is counted in the same place where InfiniBand traffic is counted; /sys/class/infini-
band/<device>/ports/<port number>/counters/
A straightforward extension of the RoCE protocol enables traffic to operate in IP layer 3 environ-
ments. This capability is obtained via a simple modification of the RoCE packet format. Instead
of the GRH used in RoCE, IP routable RoCE packets carry an IP header which allows traversal
of IP L3 Routers and a UDP header (RoCEv2 only) that serves as a stateless encapsulation layer
for the RDMA Transport Protocol Packets over IP.
The proposed RoCEv2 packets use a well-known UDP destination port value that unequivocally
distinguishes the datagram. Similar to other protocols that use UDP encapsulation, the UDP
source port field is used to carry an opaque flow-identifier that allows network devices to imple-
ment packet forwarding optimizations (e.g. ECMP) while staying agnostic to the specifics of the
protocol header format.
Furthermore, since this change exclusively affects the packet format on the wire, and due to the
fact that with RDMA semantics packets are generated and consumed below the AP, applications
can seamlessly operate over any form of RDMA service (including the routable version of RoCE
as shown the figure below), in a completely transparent way.
1. For the IPv4 address A.B.C.D the corresponding IPv4-mapped IPv6 address is ::ffff.A.B.C.D
Mellanox Technologies 53
Rev 2.3-1.0.1 Features Overview and Configuration
3.1.6.2.1 Prerequisites
The following are the driver’s prerequisites in order to set or configure RoCE:
• ConnectX®-3 firmware version 2.32.5000 or higher
• ConnectX®-3 Pro firmware version 2.32.5000 or higher
• All InfiniBand verbs applications which run over InfiniBand verbs should work on
RoCE links if they use GRH headers.
• Set HCA to use Ethernet protocol:
Display the Device Manager and expand “System Devices”. Please refer to
Section 3.1.1.1, “Port Type Management”, on page 34.
54 Mellanox Technologies
Rev 2.3-1.0.1
• Ports facing the network should be configured as trunk ports, and use Port Control Proto-
col (PCP) for priority flow control
For further information on how to configure SwitchX, please refer to SwitchX User Manual.
The next section provides information of how to use InfiniBand over Ethernet (RoCE).
The list of the modules that will be loaded automatically upon boot can be found in the
configuration file /etc/infiniband/openib.conf.
Mellanox Technologies 55
Rev 2.3-1.0.1 Features Overview and Configuration
56 Mellanox Technologies
Rev 2.3-1.0.1
The fw_ver parameter shows that the firm- The firmware version can also be obtained by run-
ware version is 2.31.5050. ning the following commands:
# cat /sys/class/infiniband/mlx4_0/fw_ver
2.31.5050
#
Although the InfiniBand over Ethernet's -
Port MTU is 2K byte at maximum, the
actual MTU cannot exceed the mlx4_en
interface's MTU. Since the mlx4_en inter-
face's MTU is typically 1560, port 2 will
run with MTU of 1K.
Please note that RoCE's MTU are subject
to InfiniBand MTU restrictions. The
RoCE's MTU values are, 256 byte, 512
byte, 1024 byte and 2K. In general RoCE
MTU is the largest power of 2 that is still
lower than mlx4_en interface MTU.
Mellanox Technologies 57
Rev 2.3-1.0.1 Features Overview and Configuration
3.1.6.2.10Adding VLANs
To add VLANs:
Step 1. Make sure that the 8021.q module is loaded.
# modprobe 8021q
Step 2. Add VLAN.
# vconfig add eth2 7
Added VLAN with VID == 7 to IF -:eth2:-
#
Step 3. Configure an IP address.
# ifconfig eth2.7 7.4.3.220
58 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 59
Rev 2.3-1.0.1 Features Overview and Configuration
SL affects the PCP only when the traffic goes over tagged VLAN frames.
60 Mellanox Technologies
Rev 2.3-1.0.1
ECN and QCN are not compatible. When using ECN, QCN (and all its related dae-
mons/utilities that could enable it, i.e - lldpad) should be turned OFF.
Mellanox Technologies 61
Rev 2.3-1.0.1 Features Overview and Configuration
pm_qos feature is both global and static, once a request is issued, it is enforced on all CPUs and
does not change in time.
MLNX_OFED provides an option to trigger a request when required and to remove it when no
longer required. It is disabled by default and can be set/unset through the ethtool priv-flags.
For further information on how to enable/disable this feature, please refer to Table 3, “ethtool
Supported Options,” on page 50.
3.1.10 Time-Stamping
Time stamping is the process of keeping track of the creation of a packet. A time-stamping ser-
vice supports assertions of proof that a datum existed before a particular time. Incoming packets
are time-stamped before they are distributed on the PCI depending on the congestion in the PCI
buffers. Outgoing packets are time-stamped very close to placing them on the wire.
62 Mellanox Technologies
Rev 2.3-1.0.1
/*
* Enables hardware time stamping for outgoing packets;
* the sender of the packet decides which are to be
* time stamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE
* before sending the packet.
*/
HWTSTAMP_TX_ON,
/*
* Enables time stamping for outgoing packets just as
* HWTSTAMP_TX_ON does, but also enables time stamp insertion
* directly into Sync packets. In this case, transmitted Sync
* packets will not received a time stamp via the socket error
* queue.
*/
HWTSTAMP_TX_ONESTEP_SYNC,
};
Note: for send side time stamping currently only HWTSTAMP_TX_OFF and
HWTSTAMP_TX_ON are supported.
Mellanox Technologies 63
Rev 2.3-1.0.1 Features Overview and Configuration
64 Mellanox Technologies
Rev 2.3-1.0.1
a pending bounced packet is ready for reading as far as select() is concerned. If the outgoing
packet has to be fragmented, then only the first fragment is time stamped and returned to the
sending socket.
When time-stamping is enabled, VLAN stripping is disabled. For more info please
refer to Documentation/networking/timestamping.txt in kernel.org
RoCE Time Stamping allows you to stamp packets when they are sent to the wire / received from
the wire. The time stamp is given in a raw hardware cycles, but could be easily converted into
hardware referenced nanoseconds based time. Additionally, it enables you to query the hardware
for the hardware time, thus stamp other application's event and compare time.
Query Capabilities
Time stamping is available if and only the hardware reports it is capable of reporting it. To verify
whether RoCE Time Stamping is available, run ibv_exp_query_device.
Mellanox Technologies 65
Rev 2.3-1.0.1 Features Overview and Configuration
For example:
struct ibv_exp_device_attr attr;
ibv_exp_query_device(context, &attr);
if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_TIMESTAMP_MASK) {
if (attr.timestamp_mask) {
/* Time stamping is supported with mask attr.timestamp_mask */
}
}
if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_HCA_CORE_CLOCK) {
if (attr.hca_core_clock) {
/* reporting the device's clock is supported. */
/* attr.hca_core_clock is the frequency in MHZ */
}
}
This CQ cannot report SL or SLID information. The value of sl and sl_id fields in
struct ibv_exp_wc are invalid. Only the fields indicated by the exp_wc_flags field
in struct ibv_exp_wc contains a valid and usable value.
When using Time Stamping, several fields of struct ibv_exp_wc are not available
resulting in RoCE UD / RoCE traffic with VLANs failure.
CQs that are opened with the ibv_exp_create_cq verbs should be always be polled
with the ibv_exp_poll_cq verb.
66 Mellanox Technologies
Rev 2.3-1.0.1
Querying the Hardware Time is available only on physical functions / native machines.
Flow steering is a new model which steers network flows based on flow specifications to specific
QPs. Those flows can be either unicast or multicast network flows. In order to maintain flexibil-
ity, domains and priorities are used. Flow steering uses a methodology of flow attribute, which is
a combination of L2-L4 flow specifications, a destination QP and a priority. Flow steering rules
may be inserted either by using ethtool or by using InfiniBand verbs. The verbs abstraction uses a
different terminology from the flow attribute (ibv_exp_flow_attr), defined by a combination of
specifications (struct ibv_exp_flow_spec_*).
Mellanox Technologies 67
Rev 2.3-1.0.1 Features Overview and Configuration
68 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 69
Rev 2.3-1.0.1 Features Overview and Configuration
Be advised that as of MLNX_OFED v2.0-3.0.0, the parameters (both the value and the
mask) should be set in big-endian format.
Each header struct holds the relevant network layer parameters for matching. To enforce the match, the
user sets a mask for each parameter. The supported masks are:
• All one mask - include the parameter value in the attached rule
Note: Since the VLAN ID in the Ethernet header is 12bit long, the following parameter should be
used: flow_spec_eth.mask.vlan_tag = htons(0x0fff).
• All zero mask - ignore the parameter value in the attached rule
When setting the flow type to NORMAL, the incoming traffic will be steered according to the rule spec-
ifications. ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type since
InfiniBand link type packets always include QP number.
For further information, please refer to the relevant man pages.
• ibv_exp_destroy_flow
int ibv_exp_destroy_flow(struct ibv_exp_flow *flow_id)
Input parameters:
ibv_exp_destroy_flow requires struct ibv_exp_flow which is the return value of
ibv_exp_create_flow in case of success.
Output parameters:
Returns 0 on success, or the value of errno on failure.
For further information, please refer to the ibv_exp_destroy_flow man page.
• Ethtool
Ethtool domain is used to attach an RX ring, specifically its QP to a specified flow.
Please refer to the most recent ethtool manpage for all the ways to specify a flow.
Examples:
• ethtool –U eth5 flow-type ether dst 00:11:22:33:44:55 loc 5 action 2
All packets that contain the above destination MAC address are to be steered into rx-ring 2 (its
underlying QP), with priority 5 (within the ethtool domain)
• ethtool –U eth5 flow-type tcp4 src-ip 1.2.3.4 dst-port 8888 loc 5 action 2
All packets that contain the above destination IP address and source port are to be steered into rx-
ring 2. When destination MAC is not given, the user's destination MAC is filled automatically.
• ethtool -U eth5 flow-type ether dst 00:11:22:33:44:55 vlan 45 m 0xf000 loc 5 action 2
All packets that contain the above destination MAC address and specific VLAN are steered into
ring 2. Please pay attention to the VLAN's mask 0xf000. It is required in order to add such a rule.
• ethtool –u eth5
Shows all of ethtool’s steering rule
When configuring two rules with the same priority, the second rule will overwrite the first one,
so this ethtool interface is effectively a table. Inserting Flow Steering rules in the kernel
requires support from both the ethtool in the user space and in kernel (v2.6.28).
MLX4 Driver Support
70 Mellanox Technologies
Rev 2.3-1.0.1
The mlx4 driver supports only a subset of the flow specification the ethtool API defines. Ask-
ing for an unsupported flow specification will result with an “invalid value” failure.
The following are the flow specific parameters:
ether tcp4/udp4 ip4
RFS cannot function if LRO is enabled. LRO can be disabled via ethtool.
Mellanox Technologies 71
Rev 2.3-1.0.1 Features Overview and Configuration
3.2.1 Interface
Auto Sensing enables the NIC to automatically sense the link type (InfiniBand or Ethernet) based
on the link partner and load the appropriate driver stack (InfiniBand or Ethernet).
For example, if the first port is connected to an InfiniBand switch and the second to Ethernet
switch, the NIC will automatically load the first switch as InfiniBand and the second as Ethernet.
72 Mellanox Technologies
Rev 2.3-1.0.1
3.2.2 OpenSM
OpenSM is an InfiniBand compliant Subnet Manager (SM). It is provided as a fixed flow execut-
able called “opensm”, accompanied by a testing application called “osmtest”. OpenSM imple-
ments an InfiniBand compliant SM according to the InfiniBand Architecture Specification
chapters: Management Model (13), Subnet Management (14), and Subnet Administration (15).
3.2.2.1 opensm
opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of
the Mellanox OFED stack. opensm performs the InfiniBand specification’s required tasks for ini-
tializing InfiniBand hardware. One SM must be running for each InfiniBand subnet.
opensm also provides an experimental version of a performance manager.
opensm defaults were designed to meet the common case usage on clusters with up to a few hun-
dred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep
occasionally for changes.
opensm attaches to a specific IB port on the local machine and configures only the fabric con-
nected to it. (If the local machine has other IB ports, opensm will ignore the fabrics connected to
those other ports). If no port is specified, opensm will select the first “best” available port.
opensm can also present the available ports and prompt for a port number to attach to.
By default, the opensm run is logged to two files: /var/log/messages and /var/log/
opensm.log. The first file will register only general major events, whereas the second file will
include details of reported errors. All errors reported in this second file should be treated as indi-
cators of IB fabric health issues. (Note that when a fatal and non-recoverable error occurs,
opensm will exit). Both log files should include the message "SUBNET UP" if opensm was able
to setup the subnet correctly.
Syntax
opensm [OPTIONS]
For the complete list of opensm options, please run:
opensm --help / -h / -?
3.2.2.1.2 Signaling
When OpenSM receives a HUP signal, it starts a new heavy sweep as if a trap has been received
or a topology change has been found.
Mellanox Technologies 73
Rev 2.3-1.0.1 Features Overview and Configuration
Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for logrotate pur-
poses.
3.2.2.2 osmtest
osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra-
tor. osmtest provides a test suite for opensm. It can create an inventory file of all available nodes,
ports, and PathRecords, including all their fields. It can also verify the existing inventory with all
the object fields, and matches it to a pre-saved one. See Section 3.2.2.2.1.
osmtest has the following test flows:
• Multicast Compliancy test
• Event Forwarding test
• Service Record registration test
• RMPP stress test
• Small SA Queries stress test
Syntax
osmtest [OPTIONS]
For the complete list of osmtest options, please run:
osmtest--help / -h / -?
74 Mellanox Technologies
Rev 2.3-1.0.1
The default mode runs all the flows except for the Quality of Service flow (see Section 3.2.2.6).
After installing opensm (and if the InfiniBand fabric is stable), it is recommended to run the fol-
lowing command in order to generate the inventory file:
host1# osmtest -f c
Immediately afterwards, run the following command to test opensm:
host1# osmtest -f a
Finally, it is recommended to occasionally run “osmtest -v” (with verbosity) to verify that noth-
ing in the fabric has changed.
3.2.2.3 Partitions
OpenSM enables the configuration of partitions (PKeys) in an InfiniBand fabric. By default,
OpenSM searches for the partitions configuration file under the name /usr/etc/opensm/par-
titions.conf. To change this filename, you can use opensm with the ‘--Pconfig’ or ‘-P’ flags.
The default partition is created by OpenSM unconditionally, even when a partition configuration
file does not exist or cannot be accessed.
The default partition has a P_Key value of 0x7fff. The port out of which runs OpenSM is
assigned full membership in the default partition. All other end-ports are assigned partial mem-
bership.
Mellanox Technologies 75
Rev 2.3-1.0.1 Features Overview and Configuration
• <Partition Properties>:
[<Port list>|<MCast Group>]* | <Port list>
• <Port List>:
<Port Specifier>[,<Port Specifier>]
• <Port Specifier>:
<PortGUID>[=[full|limited|both]]
where
PortGUID GUID of partition member EndPort. Hexadecimal numbers should
start from 0x, decimal numbers are accepted too.
full, limited Indicates full and/or limited membership for this both port.
When omitted (or unrecognized) limited membership is assumed.
Both indicates both full and limited membership for this
port.
• <MCast Group>:
mgid=gid[,mgroup_flag]*<newline>
where
mgid=gid gid specified is verified to be a Multicast address IP groups are ver-
ified to match the rate and mtu of the broadcast group. The P_Key bits
of the mgid for IP groups are verified to either match the P_Key spec-
ified in by "Partition Definition" or if they are 0x0000 the P_Key
will be copied into those bits.
mgroup_flag rate=<val> Specifies rate for this MC group (default
is 3 (10GBps))
mtu=<val> Specifies MTU for this MC group (default is
4 (2048))
sl=<val> Specifies SL for this MC group (default is
0)
scope=<val> Specifies scope for this MC group (default
is 2 (link local)). Multiple scope settings
are permitted for a partition.
NOTE: This overwrites the scope nibble of
the specified mgid. Furthermore specifying
multiple scope settings will result in mul-
tiple MC groups being created.
qkey=<val> Specifies the Q_Key for this MC group
(default: 0x0b1b for IP groups, 0 for other
groups)
tclass=<val> Specifies tclass for this MC group (default
is 0)
FlowLabel=<val> Specifies FlowLabel for this MC group
(default is 0)
Note that values for rate, MTU, and scope should be specified as defined in the IBTA specifica-
tion (for example, mtu=4 for 2048). To use 4K MTU, edit that entry to "mtu=5" (5 indicates 4K
MTU to that specific partition).
76 Mellanox Technologies
Rev 2.3-1.0.1
PortGUIDs list:
PortGUID GUID of partition member EndPort. Hexadecimal numbers should start
from 0x, decimal numbers are accepted too.
full or limited indicates full or limited membership for this port. When omitted (or
unrecognized) limited membership is assumed.
There are two useful keywords for PortGUID definition:
• 'ALL' means all end ports in this subnet.
• 'ALL_CAS' means all Channel Adapter end ports in this subnet.
• 'ALL_SWITCHES' means all Switch end ports in this subnet.
• 'ALL_ROUTERS' means all Router end ports in this subnet.
• 'SELF' means subnet manager's port.An empty list means that there are no ports in this
partition.
Notes:
• White space is permitted between delimiters ('=', ',',':',';').
• PartitionName does not need to be unique, PKey does need to be unique. If PKey is
repeated then those partition configurations will be merged and first PartitionName will
be used (see also next note).
• It is possible to split partition configuration in more than one definition, but then PKey
should be explicitly specified (otherwise different PKey values will be generated for
those definitions).
Examples
Default=0x7fff : ALL, SELF=full ;
Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
Mellanox Technologies 77
Rev 2.3-1.0.1 Features Overview and Configuration
78 Mellanox Technologies
Rev 2.3-1.0.1
-i <equalize-ignore-guids-file>
-ignore-guids <equalize-ignore-guids-file>
This option provides the means to define a set of ports (by guids)
that will be ignored by the link load equalization algorithm.
LMC awareness routes based on (remote) system or switch basis.
Mellanox Technologies 79
Rev 2.3-1.0.1 Features Overview and Configuration
connected through the loop. As such, the UPDN routing algorithm should be send if the subnet is
not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high
pressure).
The UPDN algorithm is based on the following main stages:
1. Auto-detect root nodes - based on the CA hop length from any switch in the subnet, a statisti-
cal histogram is built for each switch (hop num vs number of occurrences). If the histogram
reflects a specific column (higher than others) for a certain node, then it is marked as a root
node. Since the algorithm is statistical, it may not find any root nodes. The list of the root
nodes found by this auto-detect stage is used by the ranking process stage.
If this stage cannot find any root nodes, and the user did not specify a guid list file,
OpenSM defaults back to the Min Hop routing algorithm.
2. Ranking process - All root switch nodes (found in stage 1) are assigned a rank of 0. Using the
BFS algorithm, the rest of the switch nodes in the subnet are ranked incrementally. This rank-
ing aids in the process of enforcing rules that ensure loop-free paths.
3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from each (CA or
switch) node in the subnet. During the BFS process, the FDB table of each switch node tra-
versed by BFS is updated, in reference to the starting node, based on the ranking rules and
guid values.
At the end of the process, the updated FDB tables ensure loop-free paths through the subnet.
Up/Down routing does not allow LID routing communication between switches that
are located inside spine “switch systems”. The reason is that there is no way to allow a
LID route between them that does not break the Up/Down rule. One ramification of
this is that you cannot run SM on switches other than the leaf switches of the fabric.
80 Mellanox Technologies
Rev 2.3-1.0.1
ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any
Constant Bisectional Ratio (CBB )ratio. As in UPDN, fat-tree also prevents credit-loop-dead-
locks.
If the root guid file is not provided ('-a' or '--root_guid_file' options), the topology has to be
pure fat-tree that complies with the following rules:
• Tree rank should be between two and eight (inclusively)
• Switches of the same rank should have the same number of UP-going port groups1,
unless they are root switches, in which case the shouldn't have UP-going ports at all.
• Switches of the same rank should have the same number of DOWN-going port groups,
unless they are leaf switches.
• Switches of the same rank should have the same number of ports in each UP-going port
group.
• Switches of the same rank should have the same number of ports in each DOWN-going
port group.
• All the CAs have to be at the same tree level (rank).
If the root guid file is provided, the topology does not have to be pure fat-tree, and it should only
comply with the following rules:
• Tree rank should be between two and eight (inclusively)
• All the Compute Nodes2 have to be at the same tree level (rank). Note that non-compute
node CAs are allowed here to be at different tree ranks.
Topologies that do not comply cause a fallback to min hop routing. Note that this can also occur
on link failures which cause the topology to no longer be a “pure” fat-tree.
Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will
not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm
allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the
more effective the "shift" communication pattern will be. In general, even if the root list is pro-
vided, the closer the topology to a pure and symmetrical fat-tree, the more optimal the routing
will be.
The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) in
the same directory where the OpenSM log resides. This ordering file provides the CN order that
may be used to create efficient communication pattern, that will match the routing tables.
1. Ports that are connected to the same remote switch are referenced as ‘port group’
2. List of compute nodes (CNs) can be specified by ‘-u’ or ‘--cn_guid_file’ OpenSM options.
Mellanox Technologies 81
Rev 2.3-1.0.1 Features Overview and Configuration
/|\ /|\
/ | \ / | \
Going down to compute nodes
To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\'
option. These nodes will be allowed to use switches the wrong way around a specific number of
times (specified by \'-H\' or \'--max_reverse_hops\'. With the proper max_reverse_hops and
io_guid_file values, you can ensure full connectivity in the Fat Tree. In the scheme above, with a
max_reverse_hop of 1, routes will be instanciated between N1<->N2 and N2<->N3. With a
max_reverse_hops value of 2, N1,N2 and N3 will all have routes between them.
Using max_reverse_hops creates routes that use the switch in a counter-stream way.
This option should never be used to connect nodes with high bandwidth traffic
between them! It should only be used to allow connectivity for HA purposes or similar.
Also having routes the other way around can cause credit loops.
LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing
algorithm is invoked instead.
LASH analyzes routes and ensures deadlock freedom between switch pairs. The link
from HCA between and switch does not need virtual layers as deadlock will not arise
between switch and HCA.
82 Mellanox Technologies
Rev 2.3-1.0.1
Note that the implementation of LASH in opensm attempts to use as few layers as possible. This
number can be less than the number of actual layers available.
In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order
Routing in certain topologies, it is topology agnostic and fares well in the face of faults.
It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down.
The reason for this is that LASH distributes the traffic more evenly through a network, avoid-
ing the bottleneck issues related to a root node and always routes shortest-path.
The algorithm was developed by Simula Research Laboratory.
Use ‘-R lash -Q’ option to activate the LASH algorithm
QoS support has to be turned on in order that SL/VL mappings are used.
LMC > 0 is not supported by the LASH routing. If this is specified, the default routing
algorithm is invoked instead.
For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm. For toroidal
meshes on the other hand there are routing loops that can cause deadlocks. LASH can be used to
route these cases. The performance of LASH can be improved by preconditioning the mesh in
cases where there are multiple links connecting switches and also in cases where the switches are
not cabled consistently. To invoke this, use '-R lash -Q --do_mesh_analysis'. This will add an
additional phase that analyses the mesh to try to determine the dimension and size of a mesh. If it
determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in
dimension order before the rest of the LASH algorithm runs.
Mellanox Technologies 83
Rev 2.3-1.0.1 Features Overview and Configuration
sl = 0;
for (d = 0; d < torus_dimensions; d++)
/* path_crosses_dateline(d) returns 0 or 1 */
sl |= path_crosses_dateline(d) << d;
For a 3D torus, that leaves one SL bit free, which torus-2 QoS uses to implement two QoS levels.
Torus-2 QoS also makes use of the output port dependence of switch SL2VL maps to encode into
one VL bit the information encoded in three SL bits. It computes in which torus coordinate direc-
tion each inter-switch link "points", and writes SL2VL maps for such ports as follows:
For a pristine fabric the path from S to D would be S-n-T-r-D. In the event that either link S-n or
n-T has failed, torus-2QoS would use the path S-m-p-o-T-r-D.
84 Mellanox Technologies
Rev 2.3-1.0.1
Note that it can do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has
been broken by failure, path segments using it cannot contribute to deadlock, and the x-direction
dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring. One result of
this is that torus-2QoS can route around many simultaneous link failures, as long as no 1D ring is
broken into disjoint segments. For example, if links n-T and T-o have both failed, that ring has
been broken into two disjoint segments, T and o-p-m-S-n. Torus-2QoS checks for such issues,
reports if they are found, and refuses to route such fabrics.
Note that in the case where there are multiple parallel links between a pair of switches, torus-
2QoS will allocate routes across such links in a round-robin fashion, based on ports at the path
destination switch that are active and not used for inter-switch links. Should a link that is one of
severalsuch parallel links fail, routes are redistributed across the remaining links. When the last
of such a set of parallel links fails, traffic is rerouted as described above.
Handling a failed switch under DOR requires introducing into a path at least one turn that would
be otherwise "illegal", i.e. not allowed by DOR rules. Torus-2QoS will introduce such a turn as
close as possible to the failed switch in order to route around it. n the above example, suppose
switch T has failed, and consider the path from S to D. Torus-2QoS will produce the path S-n-I-r-
D, rather than the S-n-T-r-D path for a pristine torus, by introducing an early turn at n. Normal
DOR rules will cause traffic arriving at switch I to be forwarded to switch r; for traffic arriving
from I due to the "early" turn at n, this will generate an "illegal" turn at I.
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 (which
would be otherwise unused) for y-x, z-x, and z-y turns, i.e., those turns that are illegal under
DOR. This causes the first hop after any such turn to use a separate set of VL values, and pre-
vents deadlock in the presence of a single failed switch. For any given path, only the hops after a
turn that is illegal under DOR can contribute to a credit loop that leads to deadlock. So in the
example above with failed switch T, the location of the illegal turn at I in the path from S to D
requires that any credit loop caused by that turn must encircle the failed switch at T. Thus the
second and later hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a credit loop
because they cannot be used to construct a loop encircling T. The hop I-r uses a separate VL, so
it cannot contribute to a credit loop encircling T. Extending this argument shows that in addition
to being capable of routing around a single switch failure without introducing deadlock, torus-
2QoS can also route around multiple failed switches on the condition they are adjacent in the last
dimension routed by DOR. For example, consider the following case on a 6x6 2D torus:
Suppose switches T and R have failed, and consider the path from S to D. Torus-2QoS will gen-
erate the path S-n-q-I-u-D, with an illegal turn at switch I, and with hop I-u using a VL with bit 1
Mellanox Technologies 85
Rev 2.3-1.0.1 Features Overview and Configuration
set. As a further example, consider a case that torus-2QoS cannot route without deadlock: two
failed switches adjacent in a dimension that is not the last dimension routed by DOR; here the
failed switches are O and T:
In a pristine fabric, torus-2QoS would generate the path from S to D as S-n-O-T-r-D. With failed
switches O and T, torus-2QoS will generate the path S-n-I-q-r-D, with illegal turn at switch I, and
with hop I-q using a VL with bit 1 set. In contrast to the earlier examples, the second hop after
the illegal turn, q-r, can be used to construct a credit loop encircling the failed switches.
For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR
turn. For traffic routed from tip to root, and some traffic routed through the root, turns are not
legal DOR turns. However, to construct a credit loop, the union of multicast routing on this span-
ning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop. In addi-
86 Mellanox Technologies
Rev 2.3-1.0.1
tion, if none of the above spanning tree branches crosses a dateline used for unicast credit loop
avoidance on a torus, and if multicast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS
uses SL bit 3 to differentiate QoS level), then multicast traffic also cannot contribute to the "ring"
credit loops that are otherwise possible in a torus. Torus-2QoS uses these ideas to create a master
spanning tree. Every multicast group spanning tree will be constructed as a subset of the master
tree, with the same root as the master tree. Such multicast group spanning trees will in general
not be optimal for groups which are a subset of the full fabric. However, this compromise must
be made to enable support for two QoS levels on a torus while preventing credit loops. In the
presence of link or switch failures that result in a fabric for which torus-2QoS can generate
credit-loop-free unicast routes, it is also possible to generate a master spanning tree for multicast
that retains the required properties. For example, consider that same 2D 6x5 torus, with the link
from (2,2) to (3,2) failed. Torus-2QoS will generate the following master spanning tree:
Two things are notable about this master spanning tree. First, assuming the x dateline was
between x=5 and x=0, this spanning tree has a branch that crosses the dateline. However, just as
for unicast, crossing a dateline on a 1D ring (here, the ring for y=2) that is broken by a failure
cannot contribute to a torus credit loop. Second, this spanning tree is no longer optimal even for
multicast groups that encompass the entire fabric. That, unfortunately, is a compromise that must
be made to retain the other desirable properties of torus-2QoS routing. In the event that a single
switch fails, torus-2QoS will generate a master spanning tree that has no "extra" turns by appro-
priately selecting a root switch. In the 2D 6x5 torus example, assume now that the switch at
(3,2), i.e. the root for a pristine fabric, fails. Torus-2QoS will generate the following master
spanning tree for that case:
Mellanox Technologies 87
Rev 2.3-1.0.1 Features Overview and Configuration
Assuming the y dateline was between y=4 and y=0, this spanning tree has a branch that crosses a
dateline. However, again this cannot contribute to credit loops as it occurs on a 1D ring (the ring
for x=3) that is broken by a failure, as in the above example.
88 Mellanox Technologies
Rev 2.3-1.0.1
and SL. Torus-2QoS can only support two quality of service levels, so only the high-order bit of
any SL value used for unicast QoS configuration will be honored by torus-2QoS. For multicast
QoS configuration, only SL values 0 and 8 should be used with torus-2QoS.
Since SL to VL map configuration must be under the complete control of torus-2QoS, any con-
figuration via qos_sl2vl, qos_swe_sl2vl, etc., must and will be ignored, and a warning will be
generated. Torus-2QoS uses VL values 0-3 to implement one of its supported QoS levels, and VL
values 4-7 to implement the other. Hard-to-diagnose application issues may arise if traffic is not
delivered fairly across each of these two VL ranges. Torus-2QoS will detect and warn if VL arbi-
tration is configured unfairly across VLs in the range 0-3, and also in the range 4-7. Note that the
default OpenSM VL arbitration configuration does not meet this constraint, so all torus-2QoS
users should configure VL arbitration via qos_vlarb_high, qos_vlarb_low, etc.
Operational Considerations
Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops. As
a result, all applications run over such fabrics must perform a path record query to obtain the cor-
rect path SL for connection setup. Applications that use rdma_cm for connection setup will auto-
matically meet this requirement.
If a change in fabric topology causes changes in path SL values required to route without credit
loops, in general all applications would need to repath to avoid message deadlock. Since torus-
2QoS has the ability to reroute after a single switch failure without changing path SL values,
repathing by running applications is not required when the fabric is routed with torus-2QoS.
Torus-2QoS can provide unchanging path SL values in the presence of subnet manager failover
provided that all OpenSM instances have the same idea of dateline location. See torus-
2QoS.conf(5) for details. Torus-2QoS will detect configurations of failed switches and links that
prevent routing that is free of credit loops, and will log warnings and refuse to route. If
"no_fallback" was configured in the list of OpenSM routing engines, then no other routing
engine will attempt to route the fabric. In that case all paths that do not transit the failed compo-
nents will continue to work, and the subset of paths that are still operational will continue to
remain free of credit loops. OpenSM will continue to attempt to route the fabric after every
sweep interval, and after any change (such as a link up) in the fabric topology. When the fabric
components are repaired, full functionality will be restored. In the event OpenSM was config-
ured to allow some other engine to route the fabric if torus-2QoS fails, then credit loops and mes-
sage deadlock are likely if torus-2QoS had previously routed the fabric successfully. Even if the
other engine is capable of routing a torus without credit loops, applications that built connections
with path SL values granted under torus-2QoS will likely experience message deadlock under
routing generated by a different engine, unless they repath. To verify that a torus fabric is routed
free of credit loops, use ibdmchk to analyze data collected via ibdiagnet -vlr.
Mellanox Technologies 89
Rev 2.3-1.0.1 Features Overview and Configuration
(looped) by suffixing its radix specification with one of m, M, t, or T. Thus, "mesh 3T 4 5" and
"torus 3 4M 5M" both specify the same topology.
Note that although torus-2QoS can route mesh fabrics, its ability to route around failed compo-
nents is severely compromised on such fabrics. A failed fabric componentis very likely to cause a
disjoint ring; see UNICAST ROUTING in torus-2QoS(8).
x_dateline position
y_dateline position
z_dateline position
In order for torus-2QoS to provide the guarantee that path SL values do not change under any
conditions for which it can still route the fabric, its idea of dateline position must not change rel-
ative to physical switch locations. The dateline keywords provide the means to configure such
behavior.
The dateline for a torus dimension is always between the switch with coordinate 0 and the switch
with coordinate radix-1 for that dimension. By default, the common switch in a torus seed is
taken as the origin of the coordinate system used to describe switch location. The position param-
eter for a dateline keyword moves the origin (and hence the dateline) the specified amount rela-
tive to the common switch in a torus seed.
next_seed
If any of the switches used to specify a seed were to fail torus-2QoS would be unable to complete
topology discovery successfully. The next_seed keyword specifies that the following link and
dateline keywords apply to a new seed specification.
For maximum resiliency, no seed specification should share a switch with any other seed specifi-
cation. Multiple seed specifications should use dateline configuration to ensure that torus-2QoS
can grant path SL values that are constant, regardless of which seed was used to initiate topology
discovery.
90 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 91
Rev 2.3-1.0.1 Features Overview and Configuration
4. Define routing engine chains over previously defined topologies and configuration files.
Rule Qualifier
Parameter Description Example
92 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 93
Rev 2.3-1.0.1 Features Overview and Configuration
94 Mellanox Technologies
Rev 2.3-1.0.1
subtract rule One can define a rule that subtracts one subtract-rule: grp1,
port group from another. The given rule, grp2
for example, will cause all the ports
which are a part of grp1, but not included
in grp2, to be chosen.
In subtraction (unlike union), the order
does matter, since the purpose is to sub-
tract the second group from the first one.
There is no option to define more than
two groups for union/subtraction. How-
ever, one can unite/subtract groups which
are a union or a subtraction themselves,
as shown in the port groups policy file
example.
Mellanox Technologies 95
Rev 2.3-1.0.1 Features Overview and Configuration
port-group
name: grp1
port-guid: 0x281, 0x282, 0x283
end-port-group
port-group
name: grp2
port-guid-range: 0x282-0x286
end-port-group
port-group
name: grp4
port-name: hostname=kika port=1 hca_idx=1
end-port-group
port-group
name: grp3
union-rule: grp3, grp4
end-port-group
Topology Qualifiers
Unlike topology and end-topology which do not require a colon, all qualifiers must end
with a colon (':'). Also - a colon is a predefined mark that must not be used inside qualifier
values. An inclusion of a column in the qualifier values will result in the policy's failure.
All topology qualifiers are mandatory. Absence of any of the below qualifiers will cause the pol-
icy parsing to fail.
96 Mellanox Technologies
Rev 2.3-1.0.1
Mellanox Technologies 97
Rev 2.3-1.0.1 Features Overview and Configuration
98 Mellanox Technologies
Rev 2.3-1.0.1
sl2vl and mcfdbs files are dumped only once for the entire fabric and NOT by every rout-
ing engine.
• Each engine concatenates its ID and routing algorithm name in its dump files names, as
follows:
• opensm-lid-matrix.2.minhop.dump
• opensm.fdbs.3.ftree
• opensm-subnet.4.updn.lst
• In case that a fallback routing engine is used, both the routing engine that failed and the
fallback engine that replaces it, dump their data.
Mellanox Technologies 99
Rev 2.3-1.0.1 Features Overview and Configuration
If, for example, engine 2 runs ftree and it has a fallback engine with 3 as its id that runs minhop, one
should expect to find 2 sets of dump files, one for each engine:
• opensm-lid-matrix.2.ftree.dump
• opensm-lid-matrix.3.minhop.dump
• opensm.fdbs.2.ftree
• opensm.fdbs.3.munhop
• Port GUID
• Port name, which is a combination of NodeDescription and IB port number
• PKey, which means that all the ports in the subnet that belong to partition with a given
PKey belong to this port group
• Partition name, which means that all the ports in the subnet that belong to partition with
a given name belong to this port group
• Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and SELF
(SM's port).
II) QoS Setup (denoted by qos-setup)
This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the
fabric. However, this is not supported in OFED. SL2VL and VLArb tables should be configured
in the OpenSM options file (default location - /var/cache/opensm/opensm.opts).
qos-levels
qos-level
name: DEFAULT
sl: 0
end-qos-level
end-qos-levels
Port groups section is missing because there are no match rules, which means that port groups are
not referred anywhere, and there is no need defining them. And since this policy file doesn't have
any matching rules, PR/MPR query will not match any rule, and OpenSM will enforce default
QoS level. Essentially, the above example is equivalent to not having a QoS policy file at all.
The following example shows all the possible options and keywords in the policy file and their
syntax:
#
# See the comments in the following example.
# They explain different keywords and their meaning.
#
port-groups
name: Storage
# "use" is just a description that is used for logging
# Other than that, it is just a comment
use: SRP Targets
port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
port-guid: 0x1000000000FFFF
end-port-group
port-group
name: Virtual Servers
# The syntax of the port name is as follows:
# "node_description/Pnum".
# node_description is compared to the NodeDescription of the node,
# and "Pnum" is a port number on that node.
port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
end-port-group
# using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
# or ALL (for all the nodes in the subnet)
port-group
name: CAs and SM
node-type: CA, SELF
end-port-group
end-port-groups
qos-setup
# This section of the policy file describes how to set up SL2VL and VL
# Arbitration tables on various nodes in the fabric.
# However, this is not supported in OFED - the section is parsed
# and ignored. SL2VL and VLArb tables should be configured in the
# OpenSM options file (by default - /var/cache/opensm/opensm.opts).
end-qos-setup
qos-levels
name: DEFAULT
use: default QoS Level
sl: 0
end-qos-level
end-qos-levels
# Match rules are scanned in order of their apperance in the policy file.
# First matched rule takes precedence.
qos-match-rules
qos-match-rule
source: Storage
use: match by source group only
qos-level-name: DEFAULT
end-qos-match-rule
qos-match-rule
use: match by all parameters
qos-class: 7-9,11
end-qos-match-rules
qos-ulps
default : 0 #default SL
end-qos-ulps
It is equivalent to the previous example of the shortest policy file, and it is also equivalent to not
having policy file at all. Below is an example of simple QoS policy with all the possible key-
words:
qos-ulps
default :0 # default SL
sdp, port-num 30000 :0 # SL for application running on
# top of SDP when a destination
# TCP/IPport is 30000
sdp, port-num 10000-20000 : 0
3.2.2.6.1 IPoIB
IPoIB query is matched by PKey or by destination GID, in which case this is the GID of the mul-
ticast group that OpenSM creates for each IPoIB partition.
Default PKey for IPoIB partition is 0x7fff, so the following three match rules are equivalent:
ipoib :<SL>
ipoib, pkey 0x7fff : <SL>
any, pkey 0x7fff : <SL>
3.2.2.6.2 SDP
SDP PR query is matched by Service ID. The Service-ID for SDP is 0x000000000001PPPP,
where PPPP are 4 hex digits holding the remote TCP/IP Port Number to connect to. The follow-
ing two match rules are equivalent:
sdp :<SL>
any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
3.2.2.6.3 RDS
Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS is
0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port Number to
connect to. Default port number for RDS is 0x48CA, which makes a default Service-ID
0x00000000010648CA. The following two match rules are equivalent:
rds :<SL>
any, service-id 0x00000000010648CA : <SL>
3.2.2.6.4 SRP
Service ID for SRP varies from storage vendor to vendor, thus SRP query is matched by the tar-
get IB port GUID. The following two match rules are equivalent:
3.2.2.6.5 MPI
SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL on the MPI
traffic, and that's why it is the only ULP that did not appear in the qos-ulps section.
qos_ca_max_vls 15
qos_ca_high_limit 0
qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
qos_swe_max_vls 15
qos_swe_high_limit 0
qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
VL arbitration tables (both high and low) are lists of VL/Weight pairs. Each list entry contains a
VL number (values from 0-14), and a weighting value (values 0-255), indicating the number of
64 byte units (credits) which may be transmitted from that VL when its turn in the arbitration
occurs. A weight of 0 indicates that this entry should be skipped. If a list entry is programmed for
VL15 or for a VL that is not supported or is not currently configured by the port, the port may
either skip that entry or send from any supported VL for that entry.
Note, that the same VLs may be listed multiple times in the High or Low priority arbitration
tables, and, further, it can be listed in both tables. The limit of high-priority VLArb table
(qos_<type>_high_limit) indicates the number of high-priority packets that can be transmitted
without an opportunity to send a low-priority packet. Specifically, the number of bytes that can
be sent is high_limit times 4K bytes.
A high_limit value of 255 indicates that the byte limit is unbounded.
If the 255 value is used, the low priority VLs may be starved.
A value of 0 indicates that only a single packet from the high-priority table may be sent before an
opportunity is given to the low-priority table.
Keep in mind that ports usually transmit packets of size equal to MTU. For instance, for 4KB
MTU a single packet will require 64 credits, so in order to achieve effective VL arbitration for
packets of 4KB MTU, the weighting values for each VL should be multiples of 64.
Below is an example of SL2VL and VL Arbitration configuration on subnet:
qos_ca_max_vls 15
qos_ca_high_limit 6
qos_ca_vlarb_high 0:4
qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
qos_swe_max_vls 15
qos_swe_high_limit 6
qos_swe_vlarb_high 0:4
qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is defined as a high pri-
ority VL, and it is limited to 6 x 4KB = 24KB in a single transmission burst. Such configuration
would suilt VL that needs low latency and uses small MTU when transmitting packets. Rest of
VLs are defined as low priority VLs with different weights, while VL4 is effectively turned off.
Deployment Example
Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager
to provide different service levels for various ULPs.
Administration
• MPI is assigned an SL via the command line
host1# mpirun –sl 0
In the following policy file example, replace OST* and MDS* with the real port
GUIDs.
qos-ulps
default :0 # default SL (for
MPI)
any, target-port-guid OST1,OST2,OST3,OST4:1 # SL for Lustre OST
any, target-port-guid MDS1,MDS2 :2 # SL for Lustre
MDS
end-qos-ulps
qos_max_vls 8
qos_high_limit 0
qos_vlarb_high 2:1
qos_vlarb_low 0:96,1:224
qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15
QoS Levels
• Application traffic
• IPoIB (UD and CM) and SDP
• Isolated from storage
• Min BW of 50%
• SRP
• Min BW 50%
• Bottleneck at storage nodes
Administration
• OpenSM QoS policy file
In the following policy file example, replace SRPT* with the real SRP Target port
GUIDs.
qos-ulps
default :0
ipoib :1
sdp :1
srp, target-port-guid SRPT1,SRPT2,SRPT3 :2
end-qos-ulps
• OpenSM options file
qos_max_vls 8
qos_high_limit 0
qos_vlarb_high 1:32,2:32
qos_vlarb_low 0:1,
qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15
QoS Levels
• Management traffic (ssh)
• IPoIB management VLAN (partition A)
• Min BW 10%
• Application traffic
• IPoIB application VLAN (partition B)
• Isolated from storage and database
• Min BW of 30%
• Database Cluster traffic
• RDS
• Min BW of 30%
• SRP
• Min BW 30%
• Bottleneck at storage nodes
Administration
• OpenSM QoS policy file
In the following policy file example, replace SRPT* with the real SRP Initiator port
GUIDs.
qos-ulps
default :0
ipoib, pkey 0x8001 :1
ipoib, pkey 0x8002 :2
rds :3
srp, target-port-guid SRPT1, SRPT2, SRPT3 : 4
end-qos-ulps
• OpenSM options file
qos_max_vls 8
qos_high_limit 0
qos_vlarb_high 1:32,2:96,3:96,4:96
qos_vlarb_low 0:1
qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15
• Partition configuration file
Adaptive Routing
Adaptive Routing (AR) enables the switch to select the output port based on the port's load. AR
supports two routing modes:
• Free AR: No constraints on output port selection.
• Bounded AR: The switch does not change the output port during the same transmission
burst. This mode minimizes the appearance of out-of-order packets.
Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric
switches. It scans all the fabric switches, deduces which switches support Adaptive Routing and
configures the AR functionality on these switches.
Currently, Adaptive Routing Manager supports only link aggregation algorithm. Adaptive Rout-
ing Manager configures AR mechanism to allow switches to select output port out of all the ports
that are linked to the same remote switch. This algorithm suits any topology with several links
between switches. Especially, it suits 3D torus/mesh, where there are several link in each direc-
tion of the X/Y/Z axis.
If some switches do not support AR, they will slow down the AR Manager as it may get
timeouts on the AR-related queries to these switches.
opensm -c <options-file-name>
2. Add 'armgr' to the 'event_plugin_name' option in the file:
opensm -F <options-file-name>
Adaptive Routig Manager can read options file with various configuration parameters to fine-
tune AR mechanism and AR Manager behavior. Default location of the AR Manager options file
is /etc/opensm/ar_mgr.conf.
To provide an alternative location, please perform the following:
1. Add 'armgr --conf_file <ar-mgr-options-file-name>' to the 'event_plugin_options'
option in the file # Options string that would be passed to the plugin(s)
event_plugin_options armgr --conf_file <ar-mgr-options-file-name>
2. Run Subnet Manager with the new options file:
opensm -F <options-file-name>
See an example of AR Manager options file with all the default values in “Example of Adaptive
Routing Manager Options File” on page 116.
Therefore, no action is required to clear Adaptive Routing configuration on the switches if you
do not wish to use Adaptive Routing.
others) cannot be used. To query the switch for the content of its Adaptive Routing table, use the
'smparquery' tool that is installed as a part of the Adaptive Routing Manager package. To see its
usage details, run 'smparquery -h'.
opensm -F <options-file-name>'
AR Manager options file contains two types of parameters:
1. General options - Options which describe the AR Manager behavior and the AR parameters
that will be applied to all the switches in the fabric.
2. Per-switch options - Options which describe specific switch behavior.
Note the following:
• Adaptive Routing configuration file is case sensitive.
• You can specify options for nonexisting switch GUID. These options will be ignored
until a switch with a matching GUID will be added to the fabric.
• Adaptive Routing configuration file is parsed every AR Manager cycle, which in turn is
executed at every heavy sweep of the Subnet Manager.
• If the AR Manager fails to parse the options file, default settings for all the options will
be used.
LOG_SIZE: <size This option defines maximal AR Manager log 0: unlimited log file
in MB> file size in MB. The logfile will be truncated and size.
restarted upon reaching this limit. Default: 5
This option cannot be changed on-the-fly.
Per-switch AR Options
A user can provide per-switch configuration options with the following syntax:
SWITCH <GUID> {
<switch option 1>;
<switch option 2>;
...
}
The following are the per-switch options:
Option File Description Values
ENABLE: true;
LOG_FILE: /tmp/ar_mgr.log;
LOG_SIZE: 100;
MAX_ERRORS: 10;
ERROR_WINDOW: 5;
SWITCH 0x12345 {
ENABLE: true;
AGEING_TIME: 77;
}
SWITCH 0x0002c902004050f8 {
AGEING_TIME: 44;
}
SWITCH 0xabcde {
ENABLE: false;
}
opensm -c <options-file-name>'
2. Find the 'event_plugin_name' option in the file, and add 'ccmgr' to it.
Once the Congestion Control is enabled on the fabric nodes, to completely disable
Congestion Control, you will need to actively turn it off. Running the SM w/o the CC
Manager is not sufficient, as the hardware still continues to function in accordance to
the previous CC configuration.
For further information on how to turn OFF CC, please refer to Section , “Configuring Conges-
tion Control Manager”, on page 117
conf_file <cc-mgr-options-file-name>':
# Options string that would be passed to the plugin(s)
To turn CC OFF, set 'enable' to 'FALSE' in the Congestion Control Manager configura-
tion file, and run OpenSM ones with this configuration.
For the full list of CC Manager options with all the default values, See “Configuring Congestion
Control Manager” on page 117.
For further details on the list of CC Manager options, please refer to the IB spec.
enable
• The values are: <TRUE | FALSE>.
num_hosts
• The values are: [0-48K].
• The default is: 0 (base on the CCT calculation on the current subnet size)
• The smaller the number value of the parameter, the faster HCAs will respond to the con-
gestion and will throttle the traffic. Note that if the number is too low, it will result in
suboptimal bandwidth. To change the mean number of packets between marking eligible
packets with a FECN, set the following parameter:
marking_rate
• The values are: [0-0xffff].
• The default is: 0xa
• You can set the minimal packet size that can be marked with FECN. Any packet less
than this size [bytes] will not be marked with FECN. To do so, set the following parame-
ter:
packet_size
• The values are: [0-0x3fc0].
max_errors
error_window
• The values are:
max_errors = 0: zero tollerance - abort configuration on first error
error_window = 0: mechanism disabled - no error checking.[0-48K]
• The default is: 5
3.2.2.8.1 Congestion Control Manager Options File
Option File Description Values
The basic need is to differentiate the service levels provided to different traffic flows, such that a
policy can be enforced and can control each flow utilization of fabric resources.
The InfiniBand Architecture Specification defines several hardware features and management
interfaces for supporting QoS:
• Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
• Arbitration between traffic of different VLs is performed by a two-priority-level
weighted round robin arbiter. The arbiter is programmable with a sequence of (VL,
weight) pairs and a maximal number of high priority credits to be processed before low
priority is served
• Packets carry class of service marking in the range 0 to 15 in their header SL field
• Each switch can map the incoming packet by its SL to a particular output VL, based on a
programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
• The Subnet Administrator controls the parameters of each communication flow by pro-
viding them as a response to Path Record (PR) or MultiPathRecord (MPR) queries
DiffServ architecture (IETF RFC 2474 & 2475) is widely used in highly dynamic fabrics. The
following subsections provide the functional definition of the various software elements that
enable a DiffServ-like architecture over the Mellanox OFED software stack.
the policy, so clients (ULPs, programs) can obtain a policy enforced QoS. The SM may also
set up partitions with appropriate IPoIB broadcast group. This broadcast group carries its QoS
attributes: SL, MTU, RATE, and Packet Lifetime.
3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime available on the
multicast group which forms the broadcast group of this partition.
4. MPI which provides non IB based connection management should be configured to run using
hard coded SLs. It uses these SLs for every QP being opened.
5. ULPs that use CM interface (like SRP) have their own pre-assigned Service-ID and use it
while obtaining PathRecord/MultiPathRecord (PR/MPR) for establishing connections. The
SA receiving the PR/MPR matches it against the policy and returns the appropriate PR/MPR
including SL, MTU, RATE and Lifetime.
6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide the CMA the
target IP and port number. ULPs might also provide QoS-Class. The CMA then creates Ser-
vice-ID for the ULP and passes this ID and optional QoS-Class in the PR/MPR request. The
resulting PR/MPR is used for configuring the connection QP.
Port Group
A set of CAs, Routers or Switches that share the same settings. A port group might be a partition
defined by the partition manager policy, list of GUIDs, or list of port names based on NodeDe-
scription.
Fabric Setup
Defines how the SL2VL and VLArb tables should be setup.
In OFED this part of the policy is ignored. SL2VL and VLArb tables should be config-
ured in the OpenSM options file (opensm.opts).
QoS-Levels Definition
This section defines the possible sets of parameters for QoS that a client might be mapped to.
Each set holds SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits.
Matching Rules
A list of rules that match an incoming PR/MPR request to a QoS-Level. The rules are processed
in order such as the first match is applied. Each rule is built out of a set of match expressions
which should all match for the rule to apply. The matching expressions are defined for the fol-
lowing fields:
• SRC and DST to lists of port groups
• Service-ID to a list of Service-ID values or ranges
• QoS-Class to a list of QoS-Class values or ranges
IPoIB
IPoIB queries the SA for its broadcast group information and uses the SL, MTU, RATE and
Packet Lifetime available on the multicast group which forms this broadcast group.
SRP
The current SRP implementation uses its own CM callbacks (not CMA). So SRP fills in the Ser-
vice-ID in the PR/MPR by itself and use that information in setting up the QP.
SRP Service-ID is defined by the SRP target I/O Controller (it also complies with IBTA Service-
ID rules). The Service-ID is reported by the I/O Controller in the ServiceEntries DMA attribute
and should be used in the PR/MPR if the SA reports its ability to handle QoS PR/MPRs.
I. Fabric Setup
During fabric initialization, the Subnet Manager parses the policy and apply its settings to the
discovered fabric elements.
The temporary enable does not affect the SMP firewall. This remains active even if the "cr-space"
is temporarily permitted.
If you do not explicitly restore hardware access when the maintenance operation is completed, the
driver restart will NOT do so. The driver will come back after restart with hardware access dis-
abled. Note, though, that the SMP firewall will still be active.
A host reboot will restore hardware access (with SMP firewall active). Thus, when you disable
hardware access, you should restore it immediately after maintenance has been completed, either
by using the flint command above or by rebooting the host (or both).
Changing the IPoIB mode (CM vs UD) requires the interface to be in ‘down’ state.
If IPoIB configuration files are included, ifcfg-ib<n> files will be installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine.
A patch for DHCP may be required for supporting IPoIB. For further information,
please see the REAME which is available under the docs/dhcp/ directory.
Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard-
ware address. To overcome this problem, DHCP over InfiniBand messages convey a client iden-
tifier field used to identify the DHCP session. This client identifier field can be used to associate
an IP address with a client identifier value, such that the DHCP server will grant the same IP
address to any client that conveys this client identifier.
The length of the client identifier field is not fixed in the specification. For the Mellanox OFED
for Linux package, it is recommended to have IPoIB use the same format that FlexBoot uses for
this client identifier.
A DHCP client can be used if you need to prepare a diskless machine with an
IB driver.
In order to use a DHCP client identifier, you need to first create a configuration file that defines
the DHCP client identifier.
Then run the DHCP client with this file using the following command:
dhclient –cf <client conf file> <IB network interface name>
Example of a configuration file for the ConnectX (PCI Device ID 26428), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier
ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier
20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
}
In order to use the configuration file, run:
host1# dhclient –cf dhclient.conf ib1
ration. The IPoIB configuration file can specify either or both of the following data for an IPoIB
interface:
• A static IPoIB configuration
• An IPoIB configuration based on an Ethernet configuration
See your Linux distribution documentation for additional information about configuring IP
addresses.
The following code lines are an excerpt from a sample IPoIB configuration file:
# Static settings; all values provided by this file
IPADDR_ib0=11.4.3.175
NETMASK_ib0=255.255.0.0
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1
# Based on eth0; each '*' will be replaced with a corresponding octet
# from eth0.
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=11.4.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1
# Based on the first eth<n> interface that is found (for n=0,1,...);
# each '*' will be replaced with a corresponding octet from eth<n>.
LAN_INTERFACE_ib0=
IPADDR_ib0=11.4.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1
This manual configuration persists only until the next reboot or driver restart.
To manually configure IPoIB for the default IB partition (VLAN), perform the following steps:
Step 1. To configure the interface, enter the ifconfig command with the following items:
• The appropriate IB interface (ib0, ib1, etc.)
• The IP address that you want to assign to the interface
• The netmask keyword
• The subnet mask that you want to assign to the interface
The following example shows how to configure an IB interface:
host1$ ifconfig ib0 11.4.3.175 netmask 255.255.0.0
Step 2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate
interface identifier ib# argument.
The following example shows how to verify the configuration:
host1$ ifconfig ib0
b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:11.4.3.175 Bcast:11.4.255.255 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:65520 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Step 3. Repeat Step 1 and Step 2 on the remaining interface(s).
3.2.5.1.3 Subinterfaces
You can create subinterfaces for a primary IPoIB interface to provide traffic isolation. Each such
subinterface (also called a child interface) has a different IP and network addresses from the pri-
mary (parent) interface. The default Partition Key (PKey), ff:ff, applies to the primary (parent)
interface.
This section describes how to
• Create a subinterface (Section 3.2.5.1.3.1)
• Remove a subinterface (Section 3.2.5.1.3.2)
• The only meaningful bonding policy in IPoIB is High-Availability (bonding mode num-
ber 1, or active-backup)
• Bonding parameter "fail_over_mac" is meaningless in IPoIB interfaces, hence, the only
supported value is the default: 0 (or "none" in SLES11)
For a persistent bonding IPoIB Network configuration, use the same Linux Network Scripts
semantics, with the following exceptions/ additions:
• In the bonding master configuration file (e.g: ifcfg-bond0), in addition to Linux bonding
semantics, use the following parameter: MTU=65520
65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode (See
Section 3.2.5.1.1, “IPoIB Mode Setting”, on page 126) and are configured with
the same value. For IPoIB slaves that work in datagram mode, use MTU=2044. If you
do not set the correct MTU or do not set MTU at all, performance of the interface
might decrease.
• In the bonding slave configuration file (e.g: ifcfg-ib0), use the same Linux Network
Scripts semantics. In particular: DEVICE=ib0
• In the bonding slave configuration file (e.g: ifcfg-ib0.8003), the line TYPE=InfiniBand
is necessary when using bonding over devices configured with partitions (p_key)
• For RHEL users:
In /etc/modprobe.b/bond.conf add the following lines:
alias bond0 bonding
• For SLES users:
It is necessary to update the MANDATORY_DEVICES environment variable in /etc/syscon-
fig/network/config with the names of the IPoIB slave devices (e.g. ib0, ib1, etc.). Otherwise,
bonding master may be created before IPoIB slave interfaces at boot time.
It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and
Ethernet bonding master. However, It is NOT possible to mix Ethernet and IPoIB slaves under
the same bonding master
Restarting openibd does no keep the bonding configuration via Network Scripts. You
have to restart the network service in order to bring up the bonding master. After the
configuration is saved, restart the network service by running: /etc/init.d/network
restart.
The diagram shows how the traffic from the Virtual Machine goes to the virtual-bridge in the
Hypervisor and from the bridge to the eIPoIB interface. eIPoIB interface is the Ethernet interface
that enslaves the IPoIB interfaces in order to send/receive packets from the Ethernet interface in
the Virtual Machine to the IB fabric beneath.
In RHEL KVM environment, there are other methods to create/configure your virtual net-
work (e.g. macvtap). For additional information, please refer to the Red Hat User Manual.
The IPoIB daemon (ipoibd) detects the new virtual interface that is attached to the same bridge as
the eIPoIB interface and creates a new IPoIB instances for it in order to send/receive data. As a
result, number of IPoIB interfaces (ibX.Y) are shown as being created/destroyed, and are being
enslaved to the corresponding ethX interface to serve any active VIF in the system according to
the set configuration, This process is done automatically by the ipoibd service.
To see the list of IPoIB interfaces enslaved under eth_ipoib interface.
# cat /sys/class/net/ethX/eth/vifs
For example:
# cat /sys/class/net/eth5/eth/vifs
SLAVE=ib0.1 MAC=9a:c2:1f:d7:3b:63 VLAN=N/A
SLAVE=ib0.2 MAC=52:54:00:60:55:88 VLAN=N/A
SLAVE=ib0.3 MAC=52:54:00:60:55:89 VLAN=N/A
Each ethX interface has at lease one ibX.Y slave to serve the PIF itself. In the VIFs list of ethX
you will notice that ibX.1 is always created to serve applications running from the Hypervisor on
top of the ethX interface directly.
For InfiniBand applications that require native IPoIB interfaces (e.g. CMA), the original IPoIB
interfaces ibX can still be used. For example, CMA and ethX drivers can co-exist and make use
of IPoIB ports; CMA can use ib0, while eth0.ipoib interface will use ibX.Y interfaces.
To see the list of eIPoIB interfaces.
# cat /sys/class/net/eth_ipoib_interfaces
For example:
# cat /sys/class/net/eth_ipoib_interfaces
eth4 over IB port: ib0
eth5 over IB port: ib1
The example above shows, two eIPoIB interfaces, where eth4 runs traffic over ib0, and eth5 runs
traffic over ib1.
The example above shows a few IPoIB instances that serve the virtual interfaces at the Virtual
Machines.
To display the services provided to the Virtual Machine interfaces:
# cat /sys/class/net/eth0/eth/vifs
Example:
# cat /sys/class/net/eth0/eth/vifs
SLAVE=ib0.2 MAC=52:54:00:60:55:88 VLAN=N/A
In the example above the ib0.2 IPoIB interface serves the MAC 52:54:00:60:55:88 with no
VLAN tag for that interface.
Step 3. Attach the new VLAN interface to the same bridge that the virtual machine interface is
already attached to.
# brctl addif <br-name> <interface-name>
For example, to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface
eth4, run:
#vconfig add eth4 3
#brctl addif br2 eth4.3
on the BridgeX's external side are called external ports or eports. Every BridgeX that is in use with
EoIB needs to have one or more eports connected.
The mlx4_vnic_confd service is used to read these configuration files and pass the relevant data
to the mlx4_vnic module. EoIB Host Administered vNic supports two forms of configuration
files:
• “Central Configuration File - /etc/infiniband/mlx4_vnic.conf”
• “vNic Specific Configuration Files - ifcfg-ethX”
Both forms of configuration supply the same functionality. If both forms of configuration files
exist, the central configuration file has precedence and only this file will be used.
Field Description
name The name of the interface that is displayed when running ifconfig.
mac The mac address to assign to the vNic.
ib_port The device name and port number in the form [device name]:[port
number]. The device name can be retrieved by running ibv_devinfo
and using the output of hca_id field. The port number can have a
value of 1 or 2.
vid [Optional field] If VLAN ID exists, the vNic will be assigned the
specified VLAN ID. This value must be between 0 and 4095.
•If the vid is set to 'all', the ALL-VLAN mode will be enabled and the vNic
will support multiple vNic tags.
•If no vid is specified or value -1 is set, the vNic will be assigned to the
default vHub associated with the GW.
DEVICE=eth2
HWADDR=00:30:48:7d:de:e4
BOOTPROTO=dhcp
ONBOOT=yes
BXADDR=BX001
BXEPORT=A10
VNICIBPORT=mlx4_0:1
VNICVLAN=3 (Optional field)
GW_PKEY=0xfff1
The fields used in the file for vNic configuration have the following meaning:
Table 6 - Red Hat Linux mlx4_vnic.conf file format
Field Description
DEVICE An optional field. The name of the interface that is displayed when
running ifconfig. If it is not present, the trailer of the configuration
file name (e.g. ifcfg-eth47 => "eth47") is used instead.
HWADDR The mac address to assign the vNic.
BXADDR The BridgeX box system GUID or system name string.
BXEPORT The string describing the eport name.
VNICVLAN [Optional field] If it exists, the vNic will be assigned the VLAN ID
specified. This value must be between 0 and 4095 or 'all' for ALL-
VLAN feature.
VNICIBPORT The device name and port number in the form [device name]:[port
number]. The device name can be retrieved by running ibv_devinfo
and using the output of hca_id field. The port number can have a
value of 1 or 2.
GW_PKEY [Optional field] If discovery_pkey module parameter is set, this value
will control on what partition would be used to discover the gateways.
For more information about discovery_pkeys please refer to
Section 3.2.5.3.3.5, “Discovery Partitions Configuration”, on
page 144
Other fields available for regular eth interfaces in the ifcfg-ethX files may also be used.
3.2.5.3.2.4 mlx4_vnic_confd
Once the configuration files are updated, the host administered vNics can be created. To manage
the host administrated vNics, run the following script:
Usage: /etc/init.d/mlx4_vnic_confd {start|stop|restart|reload|status}
This script manages host administrated vNics only, to retrieve general information about the
vNics on the system including network administrated vNics, refer to Section 3.2.5.3.3.1,
“mlx4_vnic_info”, on page 142.
When using BXADDR/bx field, all vNics BX address configuration should be consis-
tent: either all of them use GUID format, or name format.
The MAC and VLAN values are set using the configuration files only, other tools such
as (vconfig) for VLAN modification, or (ifconfig) for MAC modification, are not sup-
ported.
Using a VLAN tag value of 0 is not recommended because the traffic using it would
not be separated from non VLAN traffic.
For Host administered vNics, VLAN entry must be set in the BridgeX first. For further
information, please refer to BridgeX® documentation.
If EoIB configuration files are included, ifcfg-eth<n> files will be installed under /etc/
sysconfig/network-scripts/ on a RedHat machine and under /etc/sysconfig/network/ on
a SuSE machine.
3.2.5.3.3.1 mlx4_vnic_info
To retrieve information regarding EoIB interfaces, use the script mlx4_vnic_info. This script pro-
vides detailed information about a specific vNic or all EoIB vNic interfaces, such as: BX info,
IOA info, SL, PKEY, Link state and interface features. If network administered vNics are
enabled, this script can also be used to discover the available BridgeX® boxes from the host side.
• To discover the available gateway, run:
mlx4_vnic_info -g
• To receive the full vNic information of eth10, run:
mlx4_vnic_info -i eth10
• To receive a shorter information report on eth10, run:
mlx4_vnic_info -s eth10
• To get help and usage information, run:
mlx4_vnic_info --help
reported as the vNic interface link state. If the connection between the vNic and the
BridgeX is broken (hence the external port state is unknown)the link will be reported as
down.
• the link state of the external port associated with the vNic interface
A link state is down on a host administrated vNic, when the BridgeX is connected and
the InfiniBand fabric appears to be functional. The issue might result from a miscon-
figuration of either BXADDR or/and BXEPORT configuration file.
To query the link state run the following command and look for "Link detected":
ethtool <interface name>
Example:
ethtool eth10
Settings for eth10:
Supported ports: [ ]
Supported link modes:
Supports auto-negotiation: No
Advertised link modes: Not reported
Advertised auto-negotiation: No
Speed: Unknown! (10000)
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000000 (0)
Link detected: yes
Due to EoIB protocol overhead, the maximum MTU value that can be set for the vNic
interface is: 4038 bytes. If the vNic is configured to use VLANs, then the maximum
MTU is: 4034 bytes (due to VLAN header insertion).
When using a non default partition, the GW partitions should also be configured on the
GW in the BridgeX. Additionally, the Subnet Manager must be configured accordingly.
the same InfiniBand address. Same behavior can be expected from the host EoIB driver, which
also sends packets to the relevant InfiniBand addresses while disregarding the VLAN. In both
scenarious, the Ethernet packet that is embedded in the EoIB packet includes the VLAN header
enabling VLAN enforcement either in the Ethernet fabric or at the receiving EoIB host.
ALL VLAN must be supported by both the BridgeX® and by the host side.
When enabling ALL VLAN, all gateways (LAG or legacy) that have eports belonging
to a gateway group (GWG) must be configured to the same behavior.
For example it is impossible to have gateway A2 configured to all-vlan mode and A3
to regular mode, because both belong to GWG A.
A gateway that is configured to work in ALL VLAN mode cannot accept login
requests from
•vNics that do not support this mode
•host admin vNics that were not configured to work in ALL VLAN mode, by setting
the vlan-id value to a 'all' as as described in Section 3.2.5.3.3.8, “Creating
vNICs that Support ALL VLAN Mode”, on page 145.
Example:
# mlx4_vnic_info -g A2
IOA_PORT mlx4_0:1
BX_NAME bridge-119c64
BX_GUID 00:02:c9:03:00:11:61:67
EPORT_NAME A2
EPORT_ID 63
STATE connected
GW_TYPE LEGACY
PKEY 0xffff
ALL_VLAN yes
• vNic Support
To verify the vNIC is configured to All-VLAN mode. Run:
mlx4_vnic_info -i <interface>
Example:
# mlx4_vnic_info -i eth204
NETDEV_NAME eth204
NETDEV_LINK up
NETDEV_OPEN yes
GW_TYPE LEGACY
ALL_VLAN yes
For further information on mlx4_vnic_info script, please see Section 3.2.5.3.3.1,
“mlx4_vnic_info”, on page 142.
The driver detects virtual interfaces MAC addresses based in their outgoing packets, so
you may notice that the virtual MAC address is being detected by the EoIB driver only
after the first packet is sent out by the Guest OS. Virtual resources MAC addresses
cleanup is managed by mlx4_vnic daemon as explained in Section 3.2.5.3.4.9,
“Resources Cleanup”, on page 149.
3.2.5.3.4.7 VLANs
Virtual LANs are supported in EoIB vNic level, where VLAN tagging/untagging is done by the
EoIB driver.
• To enable VLANs on top of a EoIB vNic:
a. Create a new vNic interface with the corresponding VLAN ID
b. Enslave it to a virtual bridge to be used by the Guest OS. The VLAN tagging/untagging is trans-
parent to the Guest and managed in EoIB driver level.
The vconfig utility is not supported by EoIB driver, a new vNic instance must be cre-
ated instead. For further information, see Section 3.2.5.3.2.6, “VLAN Configura-
tion”, on page 140.
Virtual Guest Tagging (VGT) is not supported. The model explained above applies to
Virtual Switch Tagging (VST) only.
3.2.5.3.4.8 Migration
Some Hypervisors provide the ability to migrate a virtual machine from one physical server to
another, this feature is seamlessly supported by PV-EoIB. Any network connectivity over EoIB
will automatically be resumed on the new physical server. The downtime that may occur during
this process is minor.
Some Hypervisors may not have enough memory for the driver domain, as a result mlx4_vnic
driver may fail to initialize or create more vNics, causing the machine to be unresponsive.
• To avoid this behavior, you can:
a. Allocate more memory for the driver domain.
Usage:
The application calls the ibv_exp_reg_mr API which turns on the
IBV_EXP_ACCESS_ALLOCATE_MR bit and sets the input address to NULL. Upon success, the
address field of the struct ibv_mr will hold the address to the allocated memory block. This block
will be freed implicitly when the ibv_dereg_mr() is called.
The following are environment variables that can be used to control error cases / contiguity:
Parameters Description
MLX_MR_ALLOC_TYPE Configures the allocator type.
• ALL (Default) - Uses all possible allocator and selects
most efficient allocator.
• ANON - Enables the usage of anonymous pages and dis-
ables the allocator
• CONTIG - Forces the usage of the contiguous pages allo-
cator. If contiguous pages are not available the alloca-
tion fails
MLX_MR_MAX_LOG2_CONTIG_BSIZE Sets the maximum contiguous block size order.
• Values: 12-23
• Default: 23
MLX_MR_MIN_LOG2_CONTIG_BSIZE Sets the minimum contiguous block size order.
• Values: 12-23
• Default: 12
Shared Memory Region (MR) enables sharing MR among applications by implementing the
"Register Shared MR" verb which is part of the IB spec.
Sharing MR involves the following steps:
Step 1. Request to create a shared MR
The application sends a request via the ibv_exp_reg_mr API to create a shared MR. The
application supplies the allowed sharing access to that MR. If the MR was created suc-
cessfully, a unique MR ID is returned as part of the struct ibv_mr which can be used by
other applications to register with that MR.
The underlying physical pages must not be Least Recently Used (LRU) or Anonymous.
To disable that, you need to turn on the IBV_EXP_ACCESS_ALLOCATE_MR bit as part of the
sharing bits.
Usage:
• Turns on via the ibv_exp_reg_mr one or more of the sharing access bits. The sharing bits are part of
the ibv_exp_reg_mr man page.
• Turns on the IBV_EXP_ACCESS_ALLOCATE_MR bit
Step 2. Request to register to a shared MR
A new verb called ibv_exp_reg_shared_mr is added to enable sharing an MR. To use this
verb, the application supplies the MR ID that it wants to register for and the desired access
mode to that MR. The desired access is validated against its given permissions and upon
successful creation, the physical pages of the original MR are shared by the new MR.
Once the MR is shared, it can be used even if the original MR was destroyed.
The request to share the MR can be repeated multiple times and an arbitrary number of
Memory Regions can potentially share the same physical memory locations.
Usage:
• Uses the “handle” field that was returned from the ibv_exp_reg_mr as the mr_handle
• Supplies the desired “access mode” for that MR
• Supplies the address field which can be either NULL or any hint as the required output. The address and
its length are returned as part of the ibv_mr struct.
To achieve high performance it is highly recommended to supply an address that is aligned as the origi-
nal memory region address. Generally, it may be an alignment to 4M address.
For further information on how to use the ibv_exp_reg_shared_mr verb, please refer to the
ibv_exp_reg_shared_mr man page and/or to the ibv_shared_mr sample program which demon-
strates a basic usage of this verb.
Further information on the ibv_shared_mr sample program can be found in the ibv_shared_mr
man page.
int ibv_exp_rereg_mr(struct ibv_mr *mr, int flags, struct ibv_pd *pd, void *addr, size_t
length, uint64_t access, struct ibv_exp_rereg_mr_attr *attr);
Memory Windows API cannot co-work with peer memory clients (Mellanox PeerDi-
rect™).
User-mode Memory Registration (UMR) is a fast registration mode which uses send queue. The
UMR support enables the usage of RDMA operations and scatters the data at the remote side
through the definition of appropriate memory keys on the remote side.
UMR enables the user to:
• Create indirect memory keys from previously registered memory regions, including cre-
ation of KLM's from previous KLM's. There are not data alignment or length restrictions
associated with the memory regions used to define the new KLM's.
• Create memory regions, which support the definition of regular non-contiguous memory
regions.
To check which operations are supported for a given transport, the capabilities field need to be
masked with one of the following masks:
enum ibv_odp_transport_cap_bits {
IBV_EXP_ODP_SUPPORT_SEND = 1 << 0,
IBV_EXP_ODP_SUPPORT_RECV = 1 << 1,
IBV_EXP_ODP_SUPPORT_WRITE = 1 << 2,
IBV_EXP_ODP_SUPPORT_READ = 1 << 3,
IBV_EXP_ODP_SUPPORT_ATOMIC = 1 << 4,
IBV_EXP_ODP_SUPPORT_SRQ_RECV = 1 << 5,
};
For example to check if RC supports send:
If (dattr.odp_caps.per_transport_caps.rc_odp_caps & IBV_EXP_ODP_SUPPORT_SEND)
//RC supports send operations with ODP MR
For further information, please refer to the ibv_exp_query_device manual page.
Example:
struct ibv_exp_prefetch_attr prefetch_attr;
prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS;
prefetch_attr.addr = addr;
prefetch_attr.length = length;
prefetch_attr.comp_mask = 0;
ibv_exp_prefetch_mr(mr, &prefetch_attr);
For further information, please refer to the ibv_exp_prefetch_mr manual page.
3.2.7.7 Inline-Receive
When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE.
Using Inline-Receive saves PCIe read transaction since the HCA does not need to read the
scatter list, therefore it improves performance in case of short receive-messages.
On poll CQ, the driver copies the received data from WQE/CQE to the user's buffers. Therefore,
apart from querying Inline-Receive capability and Inline-Receive activation the feature is trans-
parent to user application.
When Inline-Receive is active, user application must provide a valid virtual address for
the receive buffers to allow the driver moving the inline-received message to these buf-
fers. The validity of these addresses is not checked therefore the result of providing non-
valid virtual addresses is unexpected.
Connect-IB® supports Inline-Receive on both the requestor and the responder sides. Since data
is copied at the poll CQ verb, Inline-Receive on the requestor side is possible only if the user
chooses IB(V)_SIGNAL_ALL_WR.
For example:
truct ibv_exp_device_attr device_attr = {.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED -
1};
ibv_exp_query_device(context, & device_attr);
if (device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MEM_WINDOW ||
device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MW_TYPE_2B) {
/* Memory window is supported */
When loading the ib_srp module, it is possible to set the module parameter
srp_sg_tablesize. This is the maximum number of gather/scatter entries per I/O
(default: 12).
• To establish a connection with an SRP Target and create an SRP (SCSI) device for that
target under /dev, use the following command:
echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\
pkey=ffff,service_id=[service[0] value] > \
/sys/class/infiniband_srp/srp-mlx[hca number]-[port number]/add_target
See Section , “SRP Tools - ibsrpdm, srp_daemon and srpd Service Script”, on page 164 for
instructions on how the parameters in this echo command may be obtained.
Notes:
• Execution of the above “echo” command may take some time
• The SM must be running while the command executes
• It is possible to include additional parameters in the echo command:
• max_cmd_per_lun - Default: 62
• max_sect (short for max_sectors) - sets the request size of a command
• io_class - Default: 0x100 as in rev 16A of the specification (In rev 10 the default was 0xff00)
• tl_retry_count - a number in the range 2..7 specifying the IB RC retry count. Default: 2
• comp_vector, a number in the range 0..n-1 specifying the MSI-X completion vector. Some HCA's
allocate multiple (n) MSI-X vectors per HCA port. If the IRQ affinity masks of these interrupts have
been configured such that each MSI-X interrupt is handled by a different CPU then the comp_vector
parameter can be used to spread the SRP completion workload over multiple CPU's.
• cmd_sg_entries, a number in the range 1..255 that specifies the maximum number of data buffer
descriptors stored in the SRP_CMD information unit itself. With allow_ext_sg=0 the parameter
cmd_sg_entries defines the maximum S/G list length for a single SRP_CMD, and commands whose
S/G list length exceeds this limit after S/G list collapsing will fail.
• initiator_ext - Please refer to Section 9 (Multiple Connections...)
• To list the new SCSI devices that have been added by the echo command, you may use
either of the following two methods:
• Execute “fdisk -l”. This command lists all devices; the new devices are included in this
listing.
• Execute “dmesg” or look at /var/log/messages to find messages with the names of the
new devices.
3.3.1.1.4 ibsrpdm
ibsrpdm is using for the following tasks:
1. Detecting reachable targets
a. To detect all targets reachable by the SRP initiator via the default umad device (/sys/class/infiniband_mad/
umad0), execute the following command:
ibsrpdm
This command will output information on each SRP Target detected, in human-readable
form.
Sample output:
IO Unit Info:
port LID: 0103
port GID: fe800000000000000002c90200402bd5
change ID: 0002
max controllers: 0x10
controller[ 1]
GUID: 0002c90200402bd4
vendor ID: 0002c9
device ID: 005a44
IO class : 0100
ID: LSI Storage Systems SRP Driver 200400a0b81146a1
service entries: 1
service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1
b. To detect all the SRP Targets reachable by the SRP Initiator via another umad device, use the following
command:
ibsrpdm -d <umad device>
2. Assistance in creating an SRP connection
a. To generate output suitable for utilization in the “echo” command of Section , “Manually Establishing an
SRP Connection”, on page 162, add the
‘-c’ option to ibsrpdm:
ibsrpdm -c
Sample output:
id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1
b. To establish a connection with an SRP Target using the output from the ‘ibsrpdm -c’ example above,
execute the following command:
echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 > /sys/
class/infiniband_srp/srp-mlx4_0-1/add_target
The SRP connection should now be up; the newly created SCSI devices should appear in the listing
obtained from the ‘fdisk -l’ command.
3. Discover reachable SRP Targets given an InfiniBand HCA name and port, rather than by just
runing /sys/class/infiniband_mad/umad<N> where <N> is a digit
3.3.1.1.5 srpd
The srpd service script allows automatic activation and termination of the srp_daemon utility on
all system live InfiniBand ports.
3.3.1.1.6 srp_daemon
The srp_daemon utility is based on ibsrpdm and extends its functionality. In addition to the ibsr-
pdm functionality described above, srp_daemon can also
• Establish an SRP connection by itself (without the need to issue the “echo” command
described in Section , “Manually Establishing an SRP Connection”, on page 162)
• Continue running in background, detecting new targets and establishing SRP connec-
tions with them (daemon mode)
• Discover reachable SRP Targets given an infiniband HCA name and port, rather than
just by
/dev/umad<N> where <N> is a digit
• Enable High Availability operation (together with Device-Mapper Multipath)
• Have a configuration file that determines the targets to connect to
1. srp_daemon commands equivalent to ibsrpdm:
"srp_daemon -a -o" is equivalent to "ibsrpdm"
"srp_daemon -c -a -o" is equivalent to "ibsrpdm -c"
To obtain the list of InfiniBand HCA device names, you can either use the ibstat tool or
run ‘ls /sys/class/infiniband’.
• To both discover the SRP Targets and establish connections with them, just add the -e
option to the above command.
• Executing srp_daemon over a port without the -a option will only display the reachable
targets via the port and to which the initiator is not connected. If executing with the -e
option it is better to omit -a.
• It is recommended to use the -n option. This option adds the initiator_ext to the connecting
string. (See Section for more details).
• srp_daemon has a configuration file that can be set, where the default is /etc/
srp_daemon.conf. Use the -f to supply a different configuration file that configures the tar-
gets srp_daemon is allowed to connect to. The configuration file can also be used to set
values for additional parameters (e.g., max_cmd_per_lun, max_sect).
• A continuous background (daemon) operation, providing an automatic ongoing detection
and connection capability. See Section .
• To connect to all the existing Targets in the fabric and to connect to new targets that will
join the fabric, execute srp_daemon -e. This utility continues to execute until it is either
killed by the user or encounters connection errors (such as no SM in the fabric).
• To execute SRP daemon as a daemon on all the ports:
• srp_daemon.sh (found under /usr/sbin/). srp_daemon.sh sends its log to /var/log/
srp_daemon.log.
• Start the srpd service script, run service srpd start
• It is possible to configure this script to execute automatically when the InfiniBand driver
starts by changing the value of SRP_DAEMON_ENABLE in /etc/infiniband/
openib.conf to “yes”. However, this option also enables SRP High Availability that has
some more features – see Section , “High Availability (HA)”, on page 168).
For the changes in openib.conf to take effect, run:
/etc/init.d/openibd restart
each path. The conventions is to use the Target port GUID as the initiator_ext value for the rele-
vant path.
If you use srp_daemon with -n flag, it automatically assigns initiator_ext values according to this
convention. For example:
id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,\
dgid=fe800000000000000002c90200402bed,pkey=ffff,\
service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200
Notes:
1. It is recommended to use the -n flag for all srp_daemon invocations.
2. ibsrpdm does not have a corresponding option.
3. srp_daemon.sh always uses the -n option (whether invoked manually by the user, or automat-
ically at startup by setting SRP_DAEMON_ENABLE to yes).
3.3.1.1.7 Operation
When a path (from port1) to a target fails, the ib_srp module starts an error recovery process. If
this process gets to the reset_host stage and there is no path to the target from this port, ib_srp
will remove this scsi_host. After the scsi_host is removed, multipath switches to another path to
this target (from another port/HCA).
When the failed path recovers, it will be detected by the SRP daemon. The SRP daemon will then
request ib_srp to connect to this target. Once the connection is up, there will be a new scsi_host
for this target. Multipath will be executed on the devices of this host, returning to the original
state (prior to the failed path).
It is possible for regular (non-SRP) LUNs to also be present; the SRP LUNs may be
identified by their names. You can configure the /etc/multipath.conf file to change
multipath behavior.
It is also possible that the SRP LUNs will not appear under /dev/mapper/. This can
occur if the SRP LUNs are in the black-list of multipath. Edit the ‘blacklist’ section in
/etc/multipath.conf and make sure the SRP LUNs are not black-listed.
It is possible that regular (not SRP) LUNs may also be present; the SRP LUNs may be
identified by their name.
Setting the iSER target is out of scope of this manual. For guidelines of how to do so,
please refer to the relevant target documentation (e.g. stgt, clitarget).
The iSER initiator is controlled through the iSCSI interface available from the iscsi-initiator-utils
package.
Make sure iSCSI is enabled and properly configured on your system before proceeding with
iSER. Additionally, make sure you have RDMA connectivity between the initiator and the target.
rping -s [-vVd] [-S size] [-C count] [-a addr] [-p port]
Targets settings such as timeouts and retries are set the same as any other iSCSI targets.
If targets are set to auto connect on boot, and targets are unreachable, it may take a long
time to continue the boot process if timeouts and max retries are set too high.
3.3.3 Lustre
Lustre Compilation for MLNX_OFED:
This procedure applies to RHEL/SLES OSs supported by Lustre. For further information,
please refer to Lustre Release Notes.
3.4 Virtualization
Step 5. Install the MLNX_OFED driver for Linux that supports SR-IOV.
SR-IOV can be enabled and managed by using one of the following methods:
• Run the mlxconfig tool and set the SRIOV_EN parameter to “1” without re-burning the firmware
SRIOV_EN = 1
For further information, please refer to section “mlxconfig - Changing Device Configuration Tool”
in the MFT User Manual (www.mellanox.com > Products > Software > Firmware Tools).
• Burn firmware with SR-IOV support where the number of virtual functions (VFs) will be set to 16
--enable-sriov
Step 6. Verify the HCA is configured to support SR-IOV.
[root@selene ~]# mstflint -dev <PCI Device> dc
num_pfs 1
Note: This field is optional and might not always
appear.
total_vfs • When using firmware version 2.31.5000 and
above, the recommended value is 126.
• When using firmware version 2.30.8000 and
below, the recommended value is 63
1. If SR-IOV is supported, to enable SR-IOV (if it is not enabled), it is sufficient to set “sriov_en = true” in the INI.
2. If the HCA does not support SR-IOV, please contact Mellanox Support: [email protected]
num_vfs Notes:
• PFs not included in the above list will not have SR-
IOV enabled.
• Triplets and single port VFs are only valid when all
ports are configured as Ethernet. When an InfiniBand
port exists, only num_vfs=a syntax is valid where “a”
is a single value that represents the number of VFs.
• The second parameter in a triplet is valid only when
there are more than 1 physical port.
In a triplet, x+z<=63 and y+z<=63, the maximum number of
VFs on each physical port must be 63.
port_type_array Specifies the protocol type of the ports. It is either one array
of 2 port types 't1,t2' for all devices or list of BDF to
port_type_array 'bb:dd.f-t1;t2,...'. (string)
Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
If only a single port is available, use the N/A port type for
port2 (e.g '1,4').
probe_vf • If absent or zero: no VF interfaces will be loaded in
the Hypervisor/host
• If num_vfs is a number in the range of 1-63, the driver
running on the Hypervisor will itself activate that
number of VFs. All these VFs will run on the Hyper-
visor. This number will apply to all ConnectX® HCAs
on that host.
• If its a triplet x,y,z (applies only if all ports are config-
ured as Ethernet), the driver probes:
• x single port VFs on physical port 1
• y single port VFs on physical port 2 (applies only if such a
port exist)
• z n-port VFs (where n is the number of physical ports on
device). Those VFs are attached to the hypervisor.
• If its format is a string: the string specifies the
probe_vf parameter separately per installed HCA.
The string format is: "bb:dd.f-v,bb:dd.f-v,…
• bb:dd.f = bus:device.function of the PF of the HCA
• v = number of VFs to use in the PF driver for that HCA
which is either a single value or a triplet, as described above
For example:
• probe_vfs=5 - The PF driver will activate 5 VFs on
the HCA and this will be applied to all ConnectX®
HCAs on the host
• probe_vfs=00:04.0-5,00:07.0-8 - The PF driver
will activate 5 VFs on the HCA positioned in BDF
00:04.0 and 8 for the one in 00:07.0)
If the SR-IOV is not supported by the server, the machine might not come out of boot/
load.
Step 10. Load the driver and verify the SR-IOV is supported. Run:
lspci | grep Mellanox
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR
/ 10GigE] (rev b0)
03:00.1 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
(rev b0)
03:00.2 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
(rev b0)
03:00.3 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
(rev b0)
03:00.4 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
(rev b0)
03:00.5 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
(rev b0)
Where:
• “03:00" represents the Physical Function
• “03:00.X" represents the Virtual Function connected to the Physical Function
Assigning the SR-IOV Virtual Function to the Red Hat KVM VM Server
Step 1. Run the virt-manager.
Step 2. Double click on the virtual machine and open its Properties.
Step 3. Go to Details->Add hardware ->PCI host device.
Step 4. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1)
Step 5. If the Virtual Machine is up reboot it, otherwise start it.
Step 6. Log into the virtual machine and verify that it recognizes the Mellanox card. Run:
lspci | grep Mellanox
Only the PFs are set via this mechanism. The VFs inherit their port types from their asso-
ciated PF.
• <pci id> directories - one for Dom0 and one per guest. Here, you may see the map-
ping between virtual and physical pkey indices, and the virtual to physical gid 0.
Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can .
These directories have the structure:
• <pci_id>/port/<m>/gid_idx/0 where m = 1..2 (this is read-only)
and
• <pci_id>/port/<m>/pkey_idx/<n>, where m = 1..2 and n = 0..126
For instructions on configuring pkey_idx, please see below.
Port Up/Down
When moving to multi-guid mode, the port is assumed to be up when the base GUID index per
VF/PF (entry 0) has a valid value. Setting other GUID entries does not affect the port status.
However, any change in a GUID will cause a GUID change event for its VF/PF even if it is not
the base one.
Persistency Support
Once admin request is rejected by the SM, a retry mechanism is set. Retry time is set to 1 second,
and for each retry it is multiplied by 2 until reaching the maximum value of 60 seconds. Addi-
tionally, when looking for the next record to be updated, the record having the lowest time to be
executed is chosen.
Any value reset via the admin_guid interface is immediately executed and it resets this entry
timer.
The ",ipoib" causes OpenSM to pre-create IPoIB the broadcast group for the indicated
PKeys.
Step 2. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs.
Step a. Check the PCI ID for the Physical Function and the Virtual Functions.
lspci | grep Mel
Step b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and that
on Host2 it is "0000:03:00.0"
On Host1 do the following.
cd /sys/class/infiniband/mlx4_0/iov
0000:02:00.0 0000:02:00.1 0000:02:00.2 ...1
1. 0000:02:00.0 contains the virtual-to-physical mapping tables for the physical func-
tion.
0000:02:00.X contain the virt-to-phys mapping tables for the virtual functions.
Do not touch the Dom0 mapping table (under <nnnn>:<nn>:00.0). Modify only
tables under 0000:02:00.1 and/or 0000:02:00.2. We assume that vm1 uses VF
0000:02:00.1 and vm2 uses VF 0000:02:00.2
Step c. Configure the virtual-to-physical PKey mapping for the VMs.
echo 0 > 0000:02:00.1/ports/1/pkey_idx/1
echo 1 > 0000:02:00.1/ports/1/pkey_idx/0
echo 0 > 0000:02:00.2/ports/1/pkey_idx/1
echo 2 > 0000:02:00.2/ports/1/pkey_idx/0
vm1 pkey index 0 will be mapped to physical pkey-index 1, and vm2 pkey index
0 will be mapped to physical pkey index 2. Both vm1 and vm2 will have their
pkey index 1 mapped to the default pkey.
Step d. On Host2 do the following.
cd /sys/class/infiniband/mlx4_0/iov
echo 0 > 0000:03:00.1/ports/1/pkey_idx/1
echo 1 > 0000:03:00.1/ports/1/pkey_idx/0
echo 0 > 0000:03:00.2/ports/1/pkey_idx/1
echo 2 > 0000:03:00.2/ports/1/pkey_idx/0
Step e. Once the VMs are running, you can check the VM's virtualized PKey table by doing (on the
vm).
cat /sys/class/infiniband/mlx4_0/ports/[1,2]/pkeys/[0,1]
Step 3. Start up the VMs (and bind VFs to them).
Step 4. Configure IP addresses for ib0 on the host and on the guests.
3.4.1.6.5 VLAN Guest Tagging (VGT) and VLAN Switch Tagging (VST)
When running ETH ports on VGT, the ports may be configured to simply pass through packets as
is from VFs (Vlan Guest Tagging), or the administrator may configure the Hypervisor to silently
force packets to be associated with a VLan/Qos (Vlan Switch Tagging).
In the latter case, untagged or priority-tagged outgoing packets from the guest will have the
VLAN tag inserted, and incoming packets will have the VLAN tag removed. Any vlan-tagged
packets sent by the VF are silently dropped. The default behavior is VGT.
The feature may be controlled on the Hypervisor from userspace via iprout2 / netlink:
ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ]
...
[ vf NUM [ mac LLADDR ]
[ vlan VLANID [ qos VLAN-QOS ] ]
...
[ spoofchk { on | off} ] ]
...
use:
ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>]
• where NUM = 0..max-vf-num
• vlan_id = 0..4095 (4095 means "set VGT")
• qos = 0..7
For example:
• ip link set dev eth2 vf 2 qos 3 - sets VST mode for VF #2 belonging to PF eth2,
with qos = 3
• ip link set dev eth2 vf 2 4095 - sets mode for VF 2 back to VGT
When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed
under. In the example above, vf 38 is not assigned to the same port as p1p1, in contrast to vf0.
However, even VFs that are not assigned to the net device, could be used to set and change its
settings. For example, the following is a valid command to change the spoof check:
ip link set dev p1p1 vf 38 spoofchk on
This command will affect only the vf 38. The changes can be seen in ip link on the net device
that this device is assigned to.
The first entry, enable_smi_admin, is used to enable SMI on a VF. By default, the value of this
entry is zero (disabled). When set to “1”, the SMI will be enabled for the VF on the next rebind
or openibd restart on the VM that the VF is bound to. If the VF is currently bound, it must be
unbound and then re-bound.
The second sysfs entry, smi_enabled, indicates the current enablement state of the SMI. 0 indi-
cates disabled, and 1 indicates enabled. This entry is read-only.
When a VF is initialized (bound), during the initialization sequence, the driver copies the
requested smi_state (enable_smi_admin) for that VF/port to the operational SMI state
(smi_enabled) for that VF/port, and operate according to the operational state.
Thus, the sequence of operations on the hypevisor is:
Step 1. Enable SMI for any VF/port that you wish.
Step 2. Restart the VM that the VF is bound to (or just run /etc/init.d/openibd restart on that
VM)
The SMI will be enabled for the VF/port combinations that you set in step 2 above. You will then
be able to run network diagnostics from that VF.
If you set the vlan_set parameter with more the 10 VLAN IDs, the driver chooses the
first 10 VLAN IDs provided and ignores all the rest.
3.4.2 VXLAN
3.4.2.1 Prerequisites
• HCA: ConnectX-3 Pro
• Firmware 2.32.5100 or higher
• RHEL7, Ubuntu 14.04 or upstream kernel 3.12.10 (or higher)
• DMFS enabled
3.5 Resiliency
1. A “fatal device” error can be a timeout from a firmware command, an error on a firmware closing command, communication channel not being
responsive in a VF. etc.
3.5.1.3 SR-IOV
If the Physical Function recognizes the error, it notifies all the VFs about it by marking their
communication channel with that information, consequently, all the VFs and the PF are reset.
If the VF encounters an error, only that VF is reset, whereas the PF and other VFs continue to
work unaffected.
3.6 HPC-X™
For further information, please refer to HPC-X™ User Manual (www.mellanox.com --> Prod-
ucts --> HPC-X --> HPC-X Toolkit)
4.3 Addressing
This section applies to the ibdiagpath tool only. A tool command may require defining
the destination device or port to which it applies.
Utility Description
dump_fts Dumps tables for every switch found in an ibnetdiscover scan of the subnet.
The dump file format is compatible with loading into OpenSM using the -R
file -U /path/to/dump-file syntax.
For further information, please refer to the tool’s man page.
ibaddr Can be used to show the LID and GID addresses of the specified port or the
local port by default. This utility can be used as simple address resolver.
For further information, please refer to the tool’s man page.
ibcacheedit Allows users to edit an ibnetdiscover cache created through the --cache option
in ibnetdiscover(8).
For further information, please refer to the tool’s man page.
ibccconfig Supports the configuration of congestion control settings on switches and
HCAs.
For further information, please refer to the tool’s man page.
ibccquery Supports the querying of settings and other information related to congestion
control.
For further information, please refer to the tool’s man page.
Utility Description
ibcongest Provides static congestion analysis. It calculates routing for a given topology
(topo-mode) or uses extracted lst/fdb files (lst-mode). Additionally, it ana-
lyzes congestion for a traffic schedule provided in a "schedule-file" or uses an
automatically generated schedule of all-to-all-shift.
To display a help message which details the tool's options, please run "/opt/
ibutils2/bin/ibcongest -h".
For further information, please refer to the tool’s man page.
ibdev2netdev Enables association between IB devices and ports and the associated net
device. Additionally it reports the state of the net device link.
For further information, please refer to the tool’s man page.
ibdiagnet This version of ibdiagnet is included in the ibutils package, and it is not run by
(of ibutils) default after installing Mellanox OFED.
To use this ibdiagnet version and not that of the ibutils package, you need to
specify the full path: /opt/ibutils/bin
Note: ibdiagnet is an obsolete package. We recommend using ibdiagnet from
ibutils2.
For further information, please refer to the tool’s man page.
ibdiagnet Scans the fabric using directed route packets and extracts all the available
(of ibutils2) information regarding its connectivity and devices. An ibdiagnet run performs
the following stages:
• Fabric discovery
• Duplicated GUIDs detection
• Links in INIT state and unresponsive links detection
• Counters fetch
• Error counters check
• Routing checks
• Link width and speed checks
• Alias GUIDs check
• Subnet Manager check
• Partition keys check
• Nodes information
Note: This version of ibdiagnet is included in the ibutils2 package, and it is
run by default after installing Mellanox OFED. To use this ibdiagnet version,
run: ibdiagnet.
For further information, please refer to the tool’s man page.
Utility Description
ibdiagpath Traces a path between two end-points and provides information regarding the
nodes and ports traversed along the path. It utilizes device specific health que-
ries for the different devices along the path.
The way ibdiagpath operates depends on the addressing mode used on the
command line. If directed route addressing is used (-d flag), the local node is
the source node and the route to the destination port is known apriori. On the
other hand, if LID-route (or by-name) addressing is employed, then the source
and destination ports of a route are specified by their LIDs (or by the names
defined in the topology file). In this case, the actual path from the local port to
the source port, and from the source port to the destination port, is defined by
means of Subnet Management Linear Forwarding Table queries of the switch
nodes along that path. Therefore, the path cannot be predicted as it may
change.
ibdiagpath should not be supplied with contradicting local ports by the -p and
-d flags (see synopsis descriptions below). In other words, when ibdiagpath is
provided with the options -p and -d together, the first port in the direct route
must be equal to the one specified in the “-p” option. Otherwise, an error is
reported.
Moreover, the tool allows omitting the source node in LID-route addressing,
in which case the local port on the machine running the tool is assumed to be
the source.
Note: When ibdiagpath queries for the performance counters along the path
between the source and destination ports, it always traverses the LID route,
even if a directed route is specified. If along the LID route one or more links
are not in the ACTIVE state, ibdiagpath reports an error.
ibdiagpath is located at: /opt/ibutisl/bin.
For further information, please refer to the tool’s man page.
ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies Con-
nectX® family adapters InfiniBand ports. The dump file can be loaded by the
Wireshark tool for graphical traffic analysis.
The following describes a work flow for local HCA (adapter) sniffing:
1. Run ibdump with the desired options
2. Run the application that you wish its traffic to be analyzed
3. Stop ibdump (CTRL-C) or wait for the data buffer to fill (in --mem-mode)
4. Open Wireshark and load the generated file
To download Wireshark for a Linux or Windows environment go to www.wire-
shark.org.
Note: Although ibdump is a Linux application, the generated .pcap file may
be analyzed on either operating system.
In order for ibdump to function with RoCE, Flow Steering must be enabled.
To do so:
1. Add the following to /etc/modprobe.d/mlnx.conf file:
options mlx4_core log_num_mgm_entry_size=-1
2. Restart the drivers.
For further information, please refer to the tool’s man page.
Utility Description
iblinkinfo Reports link info for each port in an InfiniBand fabric, node by node. Option-
ally, iblinkinfo can do partial scans and limit its output to parts of a fabric.
For further information, please refer to the tool’s man page.
ibnetdiscover Performs InfiniBand subnet discovery and outputs a human readable topology
file. GUIDs, node types, and port numbers are displayed as well as port LIDs
and node descriptions. All nodes (and links) are displayed (full topology).
This utility can also be used to list the current connected nodes. The output is
printed to the standard output unless a topology file is specified.
For further information, please refer to the tool’s man page.
ibnetsplit Automatically groups hosts and creates scripts that can be run in order to split
the network into sub-networks containing one group of hosts.
For further information, please refer to the tool’s man page.
ibnodes Uses the current InfiniBand subnet topology or an already saved topology file
and extracts the InfiniBand nodes (CAs and switches).
For further information, please refer to the tool’s man page.
ibping Uses vendor mads to validate connectivity between InfiniBand nodes. On
exit, (IP) ping like output is show. ibping is run as client/server. The default is
to run as client. Note also that a default ping server is implemented within the
kernel.
For further information, please refer to the tool’s man page.
ibportstate Enables querying the logical (link) and physical port states of an InfiniBand
port. It also allows adjusting the link speed that is enabled on any InfiniBand
port.
If the queried port is a switch port, then ibportstate can be used to:
• disable, enable or reset the port
• validate the port’s link width and speed against the peer port
In case of multiple channel adapters (CAs) or multiple ports without a CA/
port being specified, a port is chosen by the utility according to the following
criteria:
• The first ACTIVE port that is found.
• If not found, the first port that is UP (physical link state is LinkUp).
For further information, please refer to the tool’s man page.
ibqueryerrors The default behavior is to report the port error counters which exceed a
threshold for each port in the fabric. The default threshold is zero (0). Error
fields can also be suppressed entirely.
In addition to reporting errors on every port, ibqueryerrors can report the port
transmit and receive data as well as report full link information to the remote
port if available.
For further information, please refer to the tool’s man page.
ibroute Uses SMPs to display the forwarding tables—unicast (LinearForwarding-
Table or LFT) or multicast (MulticastForwardingTable or MFT)—for the
specified switch LID and the optional lid (mlid) range. The default range is all
valid entries in the range 1 to FDBTop.
For further information, please refer to the tool’s man page.
Utility Description
ibstat ibstat is a binary which displays basic information obtained from the local IB
driver. Output includes LID, SMLID, port state, link width active, and port
physical state.
For further information, please refer to the tool’s man page.
ibstatus Displays basic information obtained from the local InfiniBand driver. Output
includes LID, SMLID, port state, port physical state, port width and port rate.
For further information, please refer to the tool’s man page.
ibswitches Traces the InfiniBand subnet topology or uses an already saved topology file
to extract the InfiniBand switches.
For further information, please refer to the tool’s man page.
ibsysstat Uses vendor mads to validate connectivity between InfiniBand nodes and
obtain other information about the InfiniBand node. ibsysstat is run as client/
server. The default is to run as client.
For further information, please refer to the tool’s man page.
ibtopodiff Compares a topology file and a discovered listing ofsubnet.lst/ibdiagnet.lst
and reports missmatches.
Two different algorithms provided:
• Using the -e option is more suitible for MANY mismatches it applies less
heuristics and provide details about the match
• Providing the -s, -p and -g starts a detailed heuristics that should be used
when only small number of changes are expected
For further information, please refer to the tool’s man page.
ibtracert Uses SMPs to trace the path from a source GID/LID to a destination GID/
LID. Each hop along the path is displayed until the destination is reached or a
hop does not respond. By using the -m option, multicast path tracing can be
performed between source and destination nodes.
For further information, please refer to the tool’s man page.
ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand
device.
For further information, please refer to the tool’s man page.
ibv_devices Lists InfiniBand devices available for use from userspace, including node
GUIDs.
For further information, please refer to the tool’s man page.
ibv_devinfo Queries InfiniBand devices and prints about them information that is avail-
able for use from userspace.
For further information, please refer to the tool’s man page.
Utility Description
mstflint Queries and burns a binary firmware-image file on non-volatile (Flash) mem-
ories of Mellanox InfiniBand and Ethernet network adapters. The tool
requires root privileges for Flash access.
To run mstflint, you must know the device location on the PCI bus.
Note: If you purchased a standard Mellanox Technologies network adapter
card, please download the firmware image from www.mellanox.com > Sup-
port > Firmware Download. If you purchased a non-standard card from a
vendor other than Mellanox Technologies, please contact your vendor.
For further information, please refer to the tool’s man page.
perfquery Queries InfiniBand ports’ performance and error counters. Optionally, it dis-
plays aggregated counters for all ports of a node. It can also reset counters
after reading them or simply reset them.
For further information, please refer to the tool’s man page.
saquery Issues the selected SA query. Node records are queried by default.
For further information, please refer to the tool’s man page.
sminfo Issues and dumps the output of an sminfo query in human readable format.
The target SM is the one listed in the local port info or the SM specified by the
optional SM LID or by the SM direct routed path.
Note: Using sminfo for any purpose other than a simple query might result in
a malfunction of the target SM.
For further information, please refer to the tool’s man page.
smparquery Sends SMP query for adaptive routing and private LFT features.
For further information, please refer to the tool’s man page.
smpdump A general purpose SMP utility which gets SM attributes from a specified
SMA. The result is dumped in hex by default.
For further information, please refer to the tool’s man page.
smpquery Provides a basic subset of standard SMP queries to query Subnet management
attributes such as node info, node description, switch info, and port info.
For further information, please refer to the tool’s man page.
• Bandwidth of links can be reduced if cable performance degrades and LLR retransmis-
sions become too numerous. Traditional IB bandwidth performance utilities can be used
to monitor any bandwidth impact.
Due to these factors, an LLR retransmission rate counter has been added to the ibdiagnet utility
that can give end users an indication of the link health.
To monitor LLR retransmission rate:
1. Run ibdiagnet, no special flags required.
2. If the LLR retransmission rate limit is exceeded it will print to the screen.
3. The default limit is set to 500 and requires further investigation if exceeded.
4. The LLR retransmission rate is reflected in the results file /var/tmp/ibdiagnet2/ibdiagnet2.pm.
The default value of 500 retransmissions/sec has been determined by Mellanox based on the
extensive simulations and testing. Links exhibiting a lower LLR retransmission rate should not
raise special concern.
Utility Description
ib_atomic_bw Calculates the BW of RDMA Atomic transactions between a pair of
machines. One acts as a server and the other as a client. The client RDMA
sends atomic operation to the server and calculate the BW by sampling the
CPU each time it receive a successful completion. The test supports features
such as Bidirectional, in which they both RDMA atomic to each other at the
same time, change of MTU size, tx size, number of iteration, message size
and more. Using the "-a" flag provides results for all message sizes.
For further information, please refer to the tool’s man page.
ib_atomic_lat Calculates the latency of RDMA Atomic transaction of message_size
between a pair of machines. One acts as a server and the other as a client.
The client sends RDMA atomic operation and sample the CPU clock when
it receives a successful completion, in order to calculate latency.
For further information, please refer to the tool’s man page.
ib_read_bw Calculates the BW of RDMA read between a pair of machines. One acts as
a server and the other as a client. The client RDMA reads the server mem-
ory and calculate the BW by sampling the CPU each time it receive a suc-
cessful completion. The test supports features such as Bidirectional, in
which they both RDMA read from each other memory's at the same time,
change of mtu size, tx size, number of iteration, message size and more.
Read is available only in RC connection mode (as specified in IB spec).
For further information, please refer to the tool’s man page.
Utility Description
ib_read_lat Calculates the latency of RDMA read operation of message_size between a
pair of machines. One acts as a server and the other as a client. They per-
form a ping pong benchmark on which one side RDMA reads the memory
of the other side only after the other side have read his memory. Each of the
sides samples the CPU clock each time they read the other side memory , in
order to calculate latency. Read is available only in RC connection mode (as
specified in IB spec).
For further information, please refer to the tool’s man page.
ib_send_bw Calculates the BW of SEND between a pair of machines. One acts as a
server and the other as a client. The server receive packets from the client
and they both calculate the throughput of the operation. The test supports
features such as Bidirectional, on which they both send and receive at the
same time, change of mtu size, tx size, number of iteration, message size
and more. Using the "-a" provides results for all message sizes.
For further information, please refer to the tool’s man page.
ib_send_lat Calculates the latency of sending a packet in message_size between a pair
of machines. One acts as a server and the other as a client. They perform a
ping pong benchmark on which you send packet only if you receive one.
Each of the sides samples the CPU each time they receive a packet in order
to calculate the latency. Using the "-a" provides results for all message
sizes.
For further information, please refer to the tool’s man page.
ib_write_bw Calculates the BW of RDMA write between a pair of machines. One acts as
a server and the other as a client. The client RDMA writes to the server
memory and calculate the BW by sampling the CPU each time it receive a
successful completion. The test supports features such as Bidirectional, in
which they both RDMA write to each other at the same time, change of mtu
size, tx size, number of iteration, message size and more. Using the "-a" flag
provides results for all message sizes.
For further information, please refer to the tool’s man page.
ib_write_lat Calculates the latency of RDMA write operation of message_size between a
pair of machines. One acts as a server and the other as a client. They per-
form a ping pong benchmark on which one side RDMA writes to the other
side memory only after the other side wrote on his memory. Each of the
sides samples the CPU clock each time they write to the other side memory,
in order to calculate latency.
For further information, please refer to the tool’s man page.
raw_ethernet_bw Calculates the BW of SEND between a pair of machines. One acts as a
server and the other as a client. The server receive packets from the client
and they both calculate the throughput of the operation. The test supports
features such as Bidirectional, on which they both send and receive at the
same time, change of mtu size, tx size, number of iteration, message size
and more. Using the "-a" provides results for all message sizes.
For further information, please refer to the tool’s man page.
Utility Description
raw_ethernet_lat Calculates the latency of sending a packet in message_size between a pair
of machines. One acts as a server and the other as a client. They perform a
ping pong benchmark on which you send packet only if you receive one.
Each of the sides samples the CPU each time they receive a packet in order
to calculate the latency. Using the "-a" provides results for all message
sizes.
For further information, please refer to the tool’s man page.
5 Troubleshooting
You may be able to easily resolve the issues described in this section. If a problem persists and
you are unable to resolve it yourself please contact your Mellanox representative or Mellanox
Support at [email protected].
The system panics when Malfunction hardware com- 1. Remove the failed adapter.
it is booted with a failed ponent 2. Reboot the system.
adapter installed.
Mellanox adapter is not PCI slot or adapter PCI 1. Run lspci.
identified as a PCI connector dysfunctionality 2. Reseat the adapter in its PCI slot or
device. insert the adapter to a different PCI slot.
If the PCI slot confirmed to be func-
tional, the adapter should be replaced.
Mellanox adapters are Misidentification of the Run the command below and check
not installed in the sys- Mellanox adapter installed Mellanox’s MAC to identify the Mella-
tem. nox adapter installed.
lspci | grep Mellanox' or 'lspci
-d 15b3:
Mellanox MACs start with:
00:02:C9:xx:xx:xx, 00:25:8B:xx:xx:xx
or F4:52:14:xx:xx:xx"
Degraded performance Sending traffic from a node Enable Flow Control on both switch's
is measured when hav- with a higher rate to a node ports and nodes:
ing a mixed rate envi- with lower rate. • On the server side run:
ronment (10GbE, ethtool -A <interface> rx on
40GbE and 56GbE). tx on
• On the switch side run the following
command on the relevant interface:
send on force and receive on
force
No link with break-out Misuse of the break-out • Use supported ports on the switch
cable. cable or misconfiguration with proper configuration. For fur-
of the switch's split ports ther information, please refer to the
MLNX_OS User Manual.
• Make sure the QSFP break-out cable
side is connected to the SwitchX.
The following mes- Trying to join a multicast If this message is logged often, check
sages is logged after group that does not exist or for the multicast group's join require-
loading the driver: exceeding the number of ments as the node might not meet them.
multicast join multicast groups supported Note: If this message is logged after
failed with status - by the SM. driver load, it may safely be ignored.
22
Unable to stop the An external application is Manually unloading the module using
driver with the follow- using the reported module. the 'modprobe -r' command.
ing on screen message:
ERROR: Module <mod-
ule> is in use
Logical link fails to The logical port state is in 1. Verify an SM is running in the fabric.
come up while port logi- the Initializing state while Run 'sminfo' from any host connected
cal state is Initializing. pending the SM for the LID to the fabric.
assignment. 2. If SM is not running, activate the SM on
a node or on managed switch.
Physical link fails to The adapter is running an Install the latest firmware on the
negotiate to maximum outdated firmware. adapter.
supported rate.
Physical link fails to The cable is not connected • Ensure that the cable is connected on
come up while port to the port or the port on the both ends or use a known working
physical state is Polling. other end of the cable is cable
disabled. • Check the status of the connected
port using the ibportstate com-
mand and enable it if necessary
Physical link fails to The port was manually dis- Restart the driver:
come up while port abled. /etc/init.d/openibd restart
physical state is Dis-
abled.
InfiniBand utilities The InfiniBand utilities Load the driver:
commands fail to find commands are invoked /etc/init.d/openibd start
devices on the system. when the driver is not
For example, the loaded.
'ibv_devinfo' com-
mand fail with the fol-
lowing output:
Failed to get IB
devices list: Func-
tion not implemented
Driver installation fails. The install script may fail • Use only supported installation
for the following reasons: options. The full list of installation
• Using an unsupported options case be displayed on screen
installation option by using: mlnxofedinstall --h
• Failed to uninstall the • Run 'rpm -e' to display a list of all
previous installation due RPMs and then manually uninstall
to dependencies being them if the preliminary uninstallation
used failed due to dependencies being
• The operating system is used.
not supported • Use a supported operating system
• The kernel is not sup- and kernel
ported. You can run • Manually install the missing pack-
mlnx_add_kernel_suppo ages listed on screen by the installa-
rt.sh in order to to gener- tion script if the installation failed
ate a MLNX_OFED due to missing pre-requisites.
package with drivers for
the kernel
• Required packages for
installing the driver are
missing
• Missing kernel backport
support for non sup-
ported kernel
After driver installa- The driver was installed on 1. Uninstall the MLNX_OFED driver.
tion, the openibd service top of an existing In-box ofed_uninstall.sh
fail to start. This mes- driver. 2. Reboot the server.
sage is logged by the 3. Search for any remaining installed
driver: Unknown symbol driver.
locate mlx4_ib.ko
locate mlx4_en.ko
locate mlx4_core
If found, move them to the /tmp direc-
tory from the current directory
4. Re-install the MLNX_OFED driver.
5. Restart the openibd service.
The driver works but the These recommendations may assist with
transmit and/or receive gaining immediate improvement:
data rates are not opti- 1. Confirm PCI link negotiated uses its
mal. maximum capability
2. Stop the IRQ Balancer service.
/etc/init.d/irq_balancer stop
3. Start mlnx_affinity service.
mlnx_affinity start
For best performance practices, please
refer to the "Performance Tuning Guide
for Mellanox Network Adapters"
(www.mellanox.com > Products >
InfiniBand/VPI Drivers > Linux SW/
Drivers).
Failed to enable The number of VFs config- 1. Check the firmware SR-IOV configura-
SR-IOV. ured in the driver is higher tion, run the mlxconfig tool.
The following message than configured in the firm- 2. Set the same number of VFs for the
is reported in dmesg: ware. driver.
mlx4_core
0000:xx:xx.0: Failed
to enable SR-IOV,
continuing without
SR-IOV (err = -22)
Failed to enable SR-IOV is disabled in the Check that the SR-IOV is enabled in the
SR-IOV. BIOS. BIOS (see Section 3.4.1.2, “Setting Up
The following message SR-IOV”, on page 171).
is reported in dmesg:
mlx4_core
0000:xx:xx.0: Failed
to enable SR-IOV,
continuing without
SR-IOV (err = -12)
When assigning a VF to SR-IOV and virtualization 1. Verify they are both enabled in the BIOS
a VM the following are not enabled in the 2. Add to the GRUB configuration file to
message is reported on BIOS. the following kernel parameter:
the screen: "intel_immun=on"
"PCI-assgine: error: (see Section 3.4.1.2, “Setting Up SR-
requires KVM sup- IOV”, on page 171).
port"
Mellanox adapter is not The expansion ROM image 1. Run a flint query to display the expan-
identified as a boot is not installed on the sion ROM information.
device. adapter. or the server's For example: "flint -d /dev/mst/
BIOS is not configured to mt4099_pci_cr0 q" and look for the
"Rom info:" line.
work on Legacy mode
For further information on how to burn
the ROM, please refer to MFT User
Manual.
2. Make sure the BIOS is configured to
work in Legacy mode if the adapter's
firmware does not include a UEFI
image.
Infiniband-diags tests, When running a test Run the test using the same perftest
such as 'ib_write_bw', between 2 systems in the RPM on both systems.
fail between systems fabric with different
with different driver Infiniband-diags packages
releases. installed.