0% found this document useful (0 votes)
76 views

Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

This document discusses kernel bypass and offload technologies. It provides examples of sending messages via the sockets API versus kernel bypass. Kernel bypass allows direct communication between an application and a driver, avoiding kernel processing overhead. Offload involves replacing software processing with dedicated hardware. The document discusses issues with high-speed networking and storage, and potential solutions like RDMA, userspace bypass implementations, and network filesystems that bypass the client OS.

Uploaded by

aashutosh1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

This document discusses kernel bypass and offload technologies. It provides examples of sending messages via the sockets API versus kernel bypass. Kernel bypass allows direct communication between an application and a driver, avoiding kernel processing overhead. Offload involves replacing software processing with dedicated hardware. The document discusses issues with high-speed networking and storage, and potential solutions like RDMA, userspace bypass implementations, and network filesystems that bypass the client OS.

Uploaded by

aashutosh1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Kernel in the Way

Bypass and Offload Technologies

End User Summit 2012


New York
Christoph Lameter
<[email protected]>
Introduction
● Example of socket based send vs. bypass
send
● Why?
● How?
● What is bypass and offload
● Issues in storage
● Potential solutions
Sending a message via the sockets API

Context Switch

Application C Library Kernel Driver Device

Posix System TX
Sockets Call rs Ring
e
API API buff
a ps
e m
vi c
De
Data Stream Socket
structure Buffer Buffers

fwrite(data, size, 1, FILE);

write(fd, data, size);

dev_queue_xmit(skb);
Kernel Bypass

Context Switch

Verbs
Application Kernel Driver Device
Library

Verbs Verbs
IB Driver Device Control
Library System
API
Calls

TX
Ring
MetaData e b u ffers
structures er spac
ap s us
Dev ice m
ss:
Bypa
Buffers
Rsockets by Sean Hefty (Intel)
● Presented at the Open Fabrics Developer meeting at the end
of March.
● Socket Emulation layer in user space.
● LD_PRELOAD library to override socket system calls
● Performance comparison shows that kernel processing is
detrimental to performance. Bypass is essential.
● IPOIB = Kernel Sockets layer using IP emulation on
Infiniband.
● SDP = Kernel Sockets layer using Infiniband native
connection.
● IB = Native Infiniband connection. User space → User Space
30

B a n d w id th (G b p s)
25

IP o IB
20
SD P

15 RSO CK ET

IB
10

0
64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
10
9 6 4 - B y te P in g - P o n g La te n cy (u s)
8
7
6
5
4
3
2
1
0
IP o IB SD P RSO CK ET IB
State of the Art NIC
characteristics
● 56 Gigabits per second (Unix network stack
was designed for 10Mbits)
● 4-7 Gigabytes per second (Unix: 1 MB/s)
● >8 million packets per second (Unix: ~1000
packets per second).
● Less than a microsecond per packet for
processing.
● Windows 8 will fully support these speeds
through offload/bypass APIs.
Bypass is
● Not using a kernel subsystem although it is
there and could provide similar services.
● Implementing what a kernel subsystem does.
● Direct access and control of memory mapped
devices from user space.
● Some people call zero copy bypass since it
avoids the copying pass of the kernel.
Offload is
● Replacement of what could be done in
software with dedicated hardware.
● Overlaps with Bypass because direct device
interactions replaces software action in the
kernel through the actions of a hardware
device.
● Typical case of hardware offload: DMA
engines, GPUs, Rendering screens,
cryptography, TCP (TOE), FPGAs.
Why bypass the kernel?
● Kernel is too slow and inefficient at high packet rates.
Problems already begin at 10G.
● Contemporary devices can map user space memory
and perform transfer to user space.
● Kernel must copy data between kernel buffers and
userspace.
● Kernel is continually regressing in terms of the
overhead of basic system calls and operations. Only
new hardware compensates.
10G woes
● Packet rate too high to receive all packets on one
processor.
● All sorts of optimization: Interrupt mitigation, multiqueue
support, receive flow steering etc.
● Compromises that do not allow full use of the capabilities
of the hardware.
● 10G technology goes beyond the boundary of what
socket based technology can cleanly support.
● 40G and higher speed technologies are available and
introduce different APIs that allow effective operations.
CPU problems
● The processing capacity of an individual hardware
thread is mostly stagnating. No major improvements
in upcoming processor generations.
● I/O bandwidth, memory capacity and bandwidth are
growing fast.
● Multithreaded handling of I/O requires
synchronization which is creating additional overhead
that in turn limits the processing capability.
Bypass technologies
● Kernel: Zero copy (using sendfile() f.e.).
● Kernel: Sockets Direct Protocol (SDP)
● Kernel: RDMA or “Verbs”
● Linux: u/kDAPL
● Solarflare: OpenOnload
● Myricom: DBL
● Numerous user space implementation mostly
based on the kernel RDMA APIs.
Expanding role of Kernel RDMA
Verbs API
✗ Initially only Infiniband packets supported
✗ Chelsio added iWarp via verbs API to get around TOE
restrictions in the Linux network stack. Recently added
UDP support via verbs.
✗ Mellanox added ROCEE to allow Infiniband RDMA type
traffic on Ethernet networks.
✗ Mellanox added RAW_ETH_QP to allow another form of
kernel bypass for Ethernet devices.
✗ Manufacturers come up with new devices for RDMA
verbs.
Trouble with offload APIs
● Complexity
● Manage RX / TX rings and timing issues in
user space.
● Difficult to code for.
● 4k page size is still an issue.
● No unified API.
● Proprietary vendor solutions.
“Stateless Offload”
● Artificial name created to refer to the ability of the
network device to transfer a limited amount of
packets in a single system call. Most used for TCP.
● Does reduce kernel overhead but does not copy
directly into user space.
● No use of RDMA
● For the purpose of this talk this is not offload as
understood here. By offload we mean bypassing the
kernel in critical paths. This only batches packets.
Bypass API characteristics
● Direct operation with user space memory
bypassing kernel buffers.
● Kernel manages the connection but the
involvement in actual read and write
operations is minimal.
● I/O is asynchrononous.
● Problem of notifications of newly available
packets or completed writes.
Storage
● Similar issues particular evident for remote
filesystem clients since they have traditionally
used socket based APIs or in kernel APIs for
netwrok connectivity.
● Paper “Sockets vs RDMA Interface over 10-
Gigabit Networks”
● Storage issues not as severe for local disk
access since APIs exist that already do a kind
of bypass. DirectIO and Mmapped I/O.
Infrastructure for control is missing though.
Storage Challenges
● High end SSD max out PCI-E speeds
● >1 mio IOPS
● 6.5Gbytes/sec
● Linux uses 4K pages => Kernel has to
manage (repeatedly updating!) > 1 million
page descriptors per second.
● Design was for 500 IOPs and 10s or maybe
100s of Mbytes per second.
PCI-E “Bypass”
● I/O bus hadware is insufficient for contemporary state of
the art devices.
● 40Gb networks cannot be handled by current Westmere
processors (There is a limit of 32Gb for 8 lanes)
● 56Gbs transfer rates can barely be handled by upcoming
generation (Sandybridge)
● I/O devices with much higher transfer rates are on the
horizon.
● PCI-E bypass is being explored by various companies.
● Hardware development is not keeping up
Linux Memory Management
● 4K page size. Kernel handles memory in 4K chunks (yes 2M
huge pages are available but they have serious limitations).
● A standard 64G server has 16 million page structs to manage.
Memory reclaim can become adventurous.
● With terabytes servers on the horizon we are looking at
billions of 4k chunks to manage.
● If I just want to transfer a 4G file from disk the hardware and
the memory management system needs to handle 1 million
4K pages.
● 4G can be transferred in ~1 second on state of the art
networks (but the kernel overhead will ensure that such a
transfer takes much longer).
Storage system bypass by
Network Filesystems
● Network filesystems intercept I/O requests in
user space and forward to the server
bypassing the OS on the client.
● Lustre, Gluster, GPFS
● Special protocols for block offload (SRP,iSER)
● Network performance is much higher than
local disk performance(!).
● Situation may change with new SSD/Flash
technologies. But that is very pricey.
Approaches to fix things
● Memory management needs to handle larger chunks of
memory (2M, 1G page sizes). Large physically
contiguous chunks of memory need to be managed.
● A processor should not touch the data transferred (zero
copy requirement). This implies that POSIX style socket
I/O cannot be used. There is a need for new APIs.
● I/O subsystems need to be able to handle large chunks of
memory and avoid creating large scatter gather lists.
Devices struggle to support those.
A New Standard RDMA system
call API
● Socket based
● Reuse as much as possible of the POSIX
functionality.
● Buffer management in user space
● Raw hardware interface elements
● There are a couple of academic projects in
this area but none of it will be viable without
kernel support.
Conclusion Q&A
● We are heading for a need of offload for key
components of the OS.
● OS needs to manage larger chunks of
memory effectively.
● A need to replace/augment the POSIX APIs?
● RDMA API standardization and generalization
is necessary.
Coming in 2013
● 100 GB/sec networking
● >100 GB/sec SSD / Flash devices
● More cores in Intel processors.
● GPUs already support thousands of hardware
threads. Newer models will offer more.

You might also like