Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
Context Switch
Posix System TX
Sockets Call rs Ring
e
API API buff
a ps
e m
vi c
De
Data Stream Socket
structure Buffer Buffers
dev_queue_xmit(skb);
Kernel Bypass
Context Switch
Verbs
Application Kernel Driver Device
Library
Verbs Verbs
IB Driver Device Control
Library System
API
Calls
TX
Ring
MetaData e b u ffers
structures er spac
ap s us
Dev ice m
ss:
Bypa
Buffers
Rsockets by Sean Hefty (Intel)
● Presented at the Open Fabrics Developer meeting at the end
of March.
● Socket Emulation layer in user space.
● LD_PRELOAD library to override socket system calls
● Performance comparison shows that kernel processing is
detrimental to performance. Bypass is essential.
● IPOIB = Kernel Sockets layer using IP emulation on
Infiniband.
● SDP = Kernel Sockets layer using Infiniband native
connection.
● IB = Native Infiniband connection. User space → User Space
30
B a n d w id th (G b p s)
25
IP o IB
20
SD P
15 RSO CK ET
IB
10
0
64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
10
9 6 4 - B y te P in g - P o n g La te n cy (u s)
8
7
6
5
4
3
2
1
0
IP o IB SD P RSO CK ET IB
State of the Art NIC
characteristics
● 56 Gigabits per second (Unix network stack
was designed for 10Mbits)
● 4-7 Gigabytes per second (Unix: 1 MB/s)
● >8 million packets per second (Unix: ~1000
packets per second).
● Less than a microsecond per packet for
processing.
● Windows 8 will fully support these speeds
through offload/bypass APIs.
Bypass is
● Not using a kernel subsystem although it is
there and could provide similar services.
● Implementing what a kernel subsystem does.
● Direct access and control of memory mapped
devices from user space.
● Some people call zero copy bypass since it
avoids the copying pass of the kernel.
Offload is
● Replacement of what could be done in
software with dedicated hardware.
● Overlaps with Bypass because direct device
interactions replaces software action in the
kernel through the actions of a hardware
device.
● Typical case of hardware offload: DMA
engines, GPUs, Rendering screens,
cryptography, TCP (TOE), FPGAs.
Why bypass the kernel?
● Kernel is too slow and inefficient at high packet rates.
Problems already begin at 10G.
● Contemporary devices can map user space memory
and perform transfer to user space.
● Kernel must copy data between kernel buffers and
userspace.
● Kernel is continually regressing in terms of the
overhead of basic system calls and operations. Only
new hardware compensates.
10G woes
● Packet rate too high to receive all packets on one
processor.
● All sorts of optimization: Interrupt mitigation, multiqueue
support, receive flow steering etc.
● Compromises that do not allow full use of the capabilities
of the hardware.
● 10G technology goes beyond the boundary of what
socket based technology can cleanly support.
● 40G and higher speed technologies are available and
introduce different APIs that allow effective operations.
CPU problems
● The processing capacity of an individual hardware
thread is mostly stagnating. No major improvements
in upcoming processor generations.
● I/O bandwidth, memory capacity and bandwidth are
growing fast.
● Multithreaded handling of I/O requires
synchronization which is creating additional overhead
that in turn limits the processing capability.
Bypass technologies
● Kernel: Zero copy (using sendfile() f.e.).
● Kernel: Sockets Direct Protocol (SDP)
● Kernel: RDMA or “Verbs”
● Linux: u/kDAPL
● Solarflare: OpenOnload
● Myricom: DBL
● Numerous user space implementation mostly
based on the kernel RDMA APIs.
Expanding role of Kernel RDMA
Verbs API
✗ Initially only Infiniband packets supported
✗ Chelsio added iWarp via verbs API to get around TOE
restrictions in the Linux network stack. Recently added
UDP support via verbs.
✗ Mellanox added ROCEE to allow Infiniband RDMA type
traffic on Ethernet networks.
✗ Mellanox added RAW_ETH_QP to allow another form of
kernel bypass for Ethernet devices.
✗ Manufacturers come up with new devices for RDMA
verbs.
Trouble with offload APIs
● Complexity
● Manage RX / TX rings and timing issues in
user space.
● Difficult to code for.
● 4k page size is still an issue.
● No unified API.
● Proprietary vendor solutions.
“Stateless Offload”
● Artificial name created to refer to the ability of the
network device to transfer a limited amount of
packets in a single system call. Most used for TCP.
● Does reduce kernel overhead but does not copy
directly into user space.
● No use of RDMA
● For the purpose of this talk this is not offload as
understood here. By offload we mean bypassing the
kernel in critical paths. This only batches packets.
Bypass API characteristics
● Direct operation with user space memory
bypassing kernel buffers.
● Kernel manages the connection but the
involvement in actual read and write
operations is minimal.
● I/O is asynchrononous.
● Problem of notifications of newly available
packets or completed writes.
Storage
● Similar issues particular evident for remote
filesystem clients since they have traditionally
used socket based APIs or in kernel APIs for
netwrok connectivity.
● Paper “Sockets vs RDMA Interface over 10-
Gigabit Networks”
● Storage issues not as severe for local disk
access since APIs exist that already do a kind
of bypass. DirectIO and Mmapped I/O.
Infrastructure for control is missing though.
Storage Challenges
● High end SSD max out PCI-E speeds
● >1 mio IOPS
● 6.5Gbytes/sec
● Linux uses 4K pages => Kernel has to
manage (repeatedly updating!) > 1 million
page descriptors per second.
● Design was for 500 IOPs and 10s or maybe
100s of Mbytes per second.
PCI-E “Bypass”
● I/O bus hadware is insufficient for contemporary state of
the art devices.
● 40Gb networks cannot be handled by current Westmere
processors (There is a limit of 32Gb for 8 lanes)
● 56Gbs transfer rates can barely be handled by upcoming
generation (Sandybridge)
● I/O devices with much higher transfer rates are on the
horizon.
● PCI-E bypass is being explored by various companies.
● Hardware development is not keeping up
Linux Memory Management
● 4K page size. Kernel handles memory in 4K chunks (yes 2M
huge pages are available but they have serious limitations).
● A standard 64G server has 16 million page structs to manage.
Memory reclaim can become adventurous.
● With terabytes servers on the horizon we are looking at
billions of 4k chunks to manage.
● If I just want to transfer a 4G file from disk the hardware and
the memory management system needs to handle 1 million
4K pages.
● 4G can be transferred in ~1 second on state of the art
networks (but the kernel overhead will ensure that such a
transfer takes much longer).
Storage system bypass by
Network Filesystems
● Network filesystems intercept I/O requests in
user space and forward to the server
bypassing the OS on the client.
● Lustre, Gluster, GPFS
● Special protocols for block offload (SRP,iSER)
● Network performance is much higher than
local disk performance(!).
● Situation may change with new SSD/Flash
technologies. But that is very pricey.
Approaches to fix things
● Memory management needs to handle larger chunks of
memory (2M, 1G page sizes). Large physically
contiguous chunks of memory need to be managed.
● A processor should not touch the data transferred (zero
copy requirement). This implies that POSIX style socket
I/O cannot be used. There is a need for new APIs.
● I/O subsystems need to be able to handle large chunks of
memory and avoid creating large scatter gather lists.
Devices struggle to support those.
A New Standard RDMA system
call API
● Socket based
● Reuse as much as possible of the POSIX
functionality.
● Buffer management in user space
● Raw hardware interface elements
● There are a couple of academic projects in
this area but none of it will be viable without
kernel support.
Conclusion Q&A
● We are heading for a need of offload for key
components of the OS.
● OS needs to manage larger chunks of
memory effectively.
● A need to replace/augment the POSIX APIs?
● RDMA API standardization and generalization
is necessary.
Coming in 2013
● 100 GB/sec networking
● >100 GB/sec SSD / Flash devices
● More cores in Intel processors.
● GPUs already support thousands of hardware
threads. Newer models will offer more.