Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

This document discusses kernel bypass and offload technologies. It provides examples of sending messages via the sockets API versus kernel bypass. Kernel bypass allows direct communication between an application and a driver, avoiding kernel processing overhead. Offload involves replacing software processing with dedicated hardware. The document discusses issues with high-speed networking and storage, and potential solutions like RDMA, userspace bypass implementations, and network filesystems that bypass the client OS.

Uploaded by

aashutosh1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

Uploaded by

aashutosh1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Kernel in the Way

Bypass and Offload Technologies

End User Summit 2012

New York
Christoph Lameter
<[email protected]>
Introduction
● Example of socket based send vs. bypass
send
● Why?
● How?
● What is bypass and offload
● Issues in storage
● Potential solutions
Sending a message via the sockets API

Context Switch

Application C Library Kernel Driver Device

Posix System TX
Sockets Call rs Ring
e
API API buff
a ps
e m
vi c
De
Data Stream Socket
structure Buffer Buffers

fwrite(data, size, 1, FILE);

write(fd, data, size);

dev_queue_xmit(skb);
Kernel Bypass

Context Switch

Verbs
Application Kernel Driver Device
Library

Verbs Verbs
IB Driver Device Control
Library System
API
Calls

TX
Ring
MetaData e b u ffers
structures er spac
ap s us
Dev ice m
ss:
Bypa
Buffers
Rsockets by Sean Hefty (Intel)
● Presented at the Open Fabrics Developer meeting at the end
of March.
● Socket Emulation layer in user space.
● LD_PRELOAD library to override socket system calls
● Performance comparison shows that kernel processing is
detrimental to performance. Bypass is essential.
● IPOIB = Kernel Sockets layer using IP emulation on
Infiniband.
● SDP = Kernel Sockets layer using Infiniband native
connection.
● IB = Native Infiniband connection. User space → User Space
30

B a n d w id th (G b p s)
25

IP o IB
20
SD P

15 RSO CK ET

IB
10

0
64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
10
9 6 4 - B y te P in g - P o n g La te n cy (u s)
8
7
6
5
4
3
2
1
0
IP o IB SD P RSO CK ET IB
State of the Art NIC
characteristics
● 56 Gigabits per second (Unix network stack
was designed for 10Mbits)
● 4-7 Gigabytes per second (Unix: 1 MB/s)
● >8 million packets per second (Unix: ~1000
packets per second).
● Less than a microsecond per packet for
processing.
● Windows 8 will fully support these speeds
through offload/bypass APIs.
Bypass is
● Not using a kernel subsystem although it is
there and could provide similar services.
● Implementing what a kernel subsystem does.
● Direct access and control of memory mapped
devices from user space.
● Some people call zero copy bypass since it
avoids the copying pass of the kernel.
Offload is
● Replacement of what could be done in
software with dedicated hardware.
● Overlaps with Bypass because direct device
interactions replaces software action in the
kernel through the actions of a hardware
device.
● Typical case of hardware offload: DMA
engines, GPUs, Rendering screens,
cryptography, TCP (TOE), FPGAs.
Why bypass the kernel?
● Kernel is too slow and inefficient at high packet rates.
Problems already begin at 10G.
● Contemporary devices can map user space memory
and perform transfer to user space.
● Kernel must copy data between kernel buffers and
userspace.
● Kernel is continually regressing in terms of the
overhead of basic system calls and operations. Only
new hardware compensates.
10G woes
● Packet rate too high to receive all packets on one
processor.
● All sorts of optimization: Interrupt mitigation, multiqueue
support, receive flow steering etc.
● Compromises that do not allow full use of the capabilities
of the hardware.
● 10G technology goes beyond the boundary of what
socket based technology can cleanly support.
● 40G and higher speed technologies are available and
introduce different APIs that allow effective operations.
CPU problems
● The processing capacity of an individual hardware
thread is mostly stagnating. No major improvements
in upcoming processor generations.
● I/O bandwidth, memory capacity and bandwidth are
growing fast.
● Multithreaded handling of I/O requires
synchronization which is creating additional overhead
that in turn limits the processing capability.
Bypass technologies
● Kernel: Zero copy (using sendfile() f.e.).
● Kernel: Sockets Direct Protocol (SDP)
● Kernel: RDMA or “Verbs”
● Linux: u/kDAPL
● Solarflare: OpenOnload
● Myricom: DBL
● Numerous user space implementation mostly
based on the kernel RDMA APIs.
Expanding role of Kernel RDMA
Verbs API
✗ Initially only Infiniband packets supported
✗ Chelsio added iWarp via verbs API to get around TOE
restrictions in the Linux network stack. Recently added
UDP support via verbs.
✗ Mellanox added ROCEE to allow Infiniband RDMA type
traffic on Ethernet networks.
✗ Mellanox added RAW_ETH_QP to allow another form of
kernel bypass for Ethernet devices.
✗ Manufacturers come up with new devices for RDMA
verbs.
Trouble with offload APIs
● Complexity
● Manage RX / TX rings and timing issues in
user space.
● Difficult to code for.
● 4k page size is still an issue.
● No unified API.
● Proprietary vendor solutions.
“Stateless Offload”
● Artificial name created to refer to the ability of the
network device to transfer a limited amount of
packets in a single system call. Most used for TCP.
● Does reduce kernel overhead but does not copy
directly into user space.
● No use of RDMA
● For the purpose of this talk this is not offload as
understood here. By offload we mean bypassing the
kernel in critical paths. This only batches packets.
Bypass API characteristics
● Direct operation with user space memory
bypassing kernel buffers.
● Kernel manages the connection but the
involvement in actual read and write
operations is minimal.
● I/O is asynchrononous.
● Problem of notifications of newly available
packets or completed writes.
Storage
● Similar issues particular evident for remote
filesystem clients since they have traditionally
used socket based APIs or in kernel APIs for
netwrok connectivity.
● Paper “Sockets vs RDMA Interface over 10-
Gigabit Networks”
● Storage issues not as severe for local disk
access since APIs exist that already do a kind
of bypass. DirectIO and Mmapped I/O.
Infrastructure for control is missing though.
Storage Challenges
● High end SSD max out PCI-E speeds
● >1 mio IOPS
● 6.5Gbytes/sec
● Linux uses 4K pages => Kernel has to
manage (repeatedly updating!) > 1 million
page descriptors per second.
● Design was for 500 IOPs and 10s or maybe
100s of Mbytes per second.
PCI-E “Bypass”
● I/O bus hadware is insufficient for contemporary state of
the art devices.
● 40Gb networks cannot be handled by current Westmere
processors (There is a limit of 32Gb for 8 lanes)
● 56Gbs transfer rates can barely be handled by upcoming
generation (Sandybridge)
● I/O devices with much higher transfer rates are on the
horizon.
● PCI-E bypass is being explored by various companies.
● Hardware development is not keeping up
Linux Memory Management
● 4K page size. Kernel handles memory in 4K chunks (yes 2M
huge pages are available but they have serious limitations).
● A standard 64G server has 16 million page structs to manage.
Memory reclaim can become adventurous.
● With terabytes servers on the horizon we are looking at
billions of 4k chunks to manage.
● If I just want to transfer a 4G file from disk the hardware and
the memory management system needs to handle 1 million
4K pages.
● 4G can be transferred in ~1 second on state of the art
networks (but the kernel overhead will ensure that such a
transfer takes much longer).
Storage system bypass by
Network Filesystems
● Network filesystems intercept I/O requests in
user space and forward to the server
bypassing the OS on the client.
● Lustre, Gluster, GPFS
● Special protocols for block offload (SRP,iSER)
● Network performance is much higher than
local disk performance(!).
● Situation may change with new SSD/Flash
technologies. But that is very pricey.
Approaches to fix things
● Memory management needs to handle larger chunks of
memory (2M, 1G page sizes). Large physically
contiguous chunks of memory need to be managed.
● A processor should not touch the data transferred (zero
copy requirement). This implies that POSIX style socket
I/O cannot be used. There is a need for new APIs.
● I/O subsystems need to be able to handle large chunks of
memory and avoid creating large scatter gather lists.
Devices struggle to support those.
A New Standard RDMA system
call API
● Socket based
● Reuse as much as possible of the POSIX
functionality.
● Buffer management in user space
● Raw hardware interface elements
● There are a couple of academic projects in
this area but none of it will be viable without
kernel support.
Conclusion Q&A
● We are heading for a need of offload for key
components of the OS.
● OS needs to manage larger chunks of
memory effectively.
● A need to replace/augment the POSIX APIs?
● RDMA API standardization and generalization
is necessary.
Coming in 2013
● 100 GB/sec networking
● >100 GB/sec SSD / Flash devices
● More cores in Intel processors.
● GPUs already support thousands of hardware
threads. Newer models will offer more.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (78)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Shortcut To Shred Ebook Revised 9-9-2015 PDF
88% (8)
Shortcut To Shred Ebook Revised 9-9-2015 PDF
15 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
PCIe Training PDF
83% (6)
PCIe Training PDF
133 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Software-Defined Networks: A Systems Approach
No ratings yet
Software-Defined Networks: A Systems Approach
38 pages
Install Lineage Os 19.1
No ratings yet
Install Lineage Os 19.1
3 pages
Keil RL RTX en PDF
No ratings yet
Keil RL RTX en PDF
863 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
PCIe Packet Generator
No ratings yet
PCIe Packet Generator
2 pages
Khaitan PSERC Webinar HPC Mar 2013 Slides
No ratings yet
Khaitan PSERC Webinar HPC Mar 2013 Slides
52 pages
Computer Architecture Slides
No ratings yet
Computer Architecture Slides
274 pages
FDio VPPwhitepaper July 2017
No ratings yet
FDio VPPwhitepaper July 2017
21 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Drivers PDF
No ratings yet
Drivers PDF
72 pages
Tutorial - Introduction To P4
No ratings yet
Tutorial - Introduction To P4
93 pages
Linux 4.14
No ratings yet
Linux 4.14
21 pages
pci-express3-accelerator-white-paper
No ratings yet
pci-express3-accelerator-white-paper
10 pages
Cavium - Networks CN7010 1000BG640 CP G Datasheet
No ratings yet
Cavium - Networks CN7010 1000BG640 CP G Datasheet
2 pages
04 - Design With Microprocessors
No ratings yet
04 - Design With Microprocessors
71 pages
An FPGA Based Accelerator For Encrypted File Systems: Adrian Matoga, Ricardo Chaves, Pedro Tomás, Nuno Roma
No ratings yet
An FPGA Based Accelerator For Encrypted File Systems: Adrian Matoga, Ricardo Chaves, Pedro Tomás, Nuno Roma
4 pages
Ebpf Hardware Offload To Smartnics: Cls BPF and XDP: Keywords
No ratings yet
Ebpf Hardware Offload To Smartnics: Cls BPF and XDP: Keywords
6 pages
4 Performance
No ratings yet
4 Performance
67 pages
SCaLE Linux Vs Solaris Performance2014 PDF
No ratings yet
SCaLE Linux Vs Solaris Performance2014 PDF
115 pages
Os Short
No ratings yet
Os Short
16 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
Shakti
0% (1)
Shakti
32 pages
Tata HPC Aman
No ratings yet
Tata HPC Aman
34 pages
Webinar Rev Up Your Design Performance With PCI Express & DDR4 IP 20160216
No ratings yet
Webinar Rev Up Your Design Performance With PCI Express & DDR4 IP 20160216
54 pages
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
No ratings yet
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
26 pages
Computer Architecture Note by Redwan (UptoMemorySystem)
100% (1)
Computer Architecture Note by Redwan (UptoMemorySystem)
64 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
Fpga Interview Questions
No ratings yet
Fpga Interview Questions
7 pages
Design and Implementation of a 32-bit ISA RISC-V
No ratings yet
Design and Implementation of a 32-bit ISA RISC-V
5 pages
UCS BootCamp PDF
No ratings yet
UCS BootCamp PDF
306 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
Palladium z1 Ds PDF
No ratings yet
Palladium z1 Ds PDF
7 pages
AI Transformation Playbook
No ratings yet
AI Transformation Playbook
22 pages
Home Manage Services Products Downloads Knowledge Resources
No ratings yet
Home Manage Services Products Downloads Knowledge Resources
2 pages
Atc
No ratings yet
Atc
12 pages
Intel IPU
No ratings yet
Intel IPU
4 pages
031_DN22_VL16_RouterArchitecture
No ratings yet
031_DN22_VL16_RouterArchitecture
17 pages
Intel® Ethernet Controller XL710-BM1/BM2: Product Brief
No ratings yet
Intel® Ethernet Controller XL710-BM1/BM2: Product Brief
3 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Virtualization
No ratings yet
Virtualization
39 pages
Thesis Fpga Implementation
100% (3)
Thesis Fpga Implementation
6 pages
And Motivation: Presenter
No ratings yet
And Motivation: Presenter
22 pages
Back To The Roots Oracle Database IO Management
No ratings yet
Back To The Roots Oracle Database IO Management
35 pages
IEEE_802.2.html
No ratings yet
IEEE_802.2.html
3 pages
dell-broadcom-npar-white-paper
No ratings yet
dell-broadcom-npar-white-paper
6 pages
Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction
No ratings yet
Fault-Resilient Pcie Bus With Real-Time Error Detection and Correction
7 pages
Arrakis: The Operating System Is The Control Plane: Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Et Al
No ratings yet
Arrakis: The Operating System Is The Control Plane: Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Et Al
30 pages
PolarFire_FPGA_and_PolarFire_SoC_FPGA_PCI_Express_User_Guide_VC
No ratings yet
PolarFire_FPGA_and_PolarFire_SoC_FPGA_PCI_Express_User_Guide_VC
63 pages
Shivam_Rawat_Embedded_Resume
No ratings yet
Shivam_Rawat_Embedded_Resume
4 pages
Drim TILE-Gx8009 PB036-02 WEB 7663
No ratings yet
Drim TILE-Gx8009 PB036-02 WEB 7663
2 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
Advanced Topics in Networking: Lecture 7: Programmable Forwarding
No ratings yet
Advanced Topics in Networking: Lecture 7: Programmable Forwarding
38 pages
Userspace Networking: Beyond The Kernel Bypass With RDMA!
No ratings yet
Userspace Networking: Beyond The Kernel Bypass With RDMA!
8 pages
2009 07 11 NCEES AE Computers
No ratings yet
2009 07 11 NCEES AE Computers
56 pages
Dualport25gbethernetadapter Ds 3865159
No ratings yet
Dualport25gbethernetadapter Ds 3865159
3 pages
AIX Manual Glossary
No ratings yet
AIX Manual Glossary
4 pages
Modified New Embedded Systems
No ratings yet
Modified New Embedded Systems
36 pages
xilinx-versal-ai-compute-solution-brief
No ratings yet
xilinx-versal-ai-compute-solution-brief
3 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Easily Install Chrome Remote Desktop For Ubuntu
No ratings yet
Easily Install Chrome Remote Desktop For Ubuntu
7 pages
Unit 1-Linear Data Structures: Linked List Implementation
No ratings yet
Unit 1-Linear Data Structures: Linked List Implementation
44 pages
Creative Critical Reflection: By: Fransaska Renelus
No ratings yet
Creative Critical Reflection: By: Fransaska Renelus
5 pages
Interesting Sudoku Game
No ratings yet
Interesting Sudoku Game
48 pages
Green University of Bangladesh: Department of Computer Science and Engineering Assignment
No ratings yet
Green University of Bangladesh: Department of Computer Science and Engineering Assignment
3 pages
02 Script DD With Oracle11gR2
No ratings yet
02 Script DD With Oracle11gR2
26 pages
GPACK - Data - Migration - TAFJ - Installation Guide - DSF
No ratings yet
GPACK - Data - Migration - TAFJ - Installation Guide - DSF
45 pages
Subject: PRF192-PFC Workshop 06 Nguyen Tien Dat - DE160068 Objectives: Managing Arrays
100% (1)
Subject: PRF192-PFC Workshop 06 Nguyen Tien Dat - DE160068 Objectives: Managing Arrays
9 pages
H-4012-8501-03-A Data Sheet GoProbe iHMI For Fanuc iHMI Controllers en
No ratings yet
H-4012-8501-03-A Data Sheet GoProbe iHMI For Fanuc iHMI Controllers en
2 pages
Master Thesis Interview
100% (2)
Master Thesis Interview
6 pages
Darktrace Virtualized Enterprise Immune System Deployments
No ratings yet
Darktrace Virtualized Enterprise Immune System Deployments
4 pages
Teen3 Pipes
No ratings yet
Teen3 Pipes
3 pages
OEM Predictive Maintenance Monitoring - Litmus Automation Case Study - John - Younes
No ratings yet
OEM Predictive Maintenance Monitoring - Litmus Automation Case Study - John - Younes
2 pages
c06137581 PDF
No ratings yet
c06137581 PDF
38 pages
Information Management
No ratings yet
Information Management
5 pages
A10 Thunder AX 271-P2 GSLB-2013.07.29
No ratings yet
A10 Thunder AX 271-P2 GSLB-2013.07.29
242 pages
Case Study 3 - Manufacturing
No ratings yet
Case Study 3 - Manufacturing
4 pages
AI City Challenge 2024
No ratings yet
AI City Challenge 2024
2 pages
Download full Advanced Computing and Communication Technologies: Proceedings of the 10th ICACCT, 2016 1st Edition Ramesh K. Choudhary ebook all chapters
100% (1)
Download full Advanced Computing and Communication Technologies: Proceedings of the 10th ICACCT, 2016 1st Edition Ramesh K. Choudhary ebook all chapters
55 pages
Calbr Admin GD
No ratings yet
Calbr Admin GD
428 pages
Unit-I New Notes
No ratings yet
Unit-I New Notes
39 pages
CPX - AP-I - EtherNetIP - App Note
No ratings yet
CPX - AP-I - EtherNetIP - App Note
32 pages
Design of Low-Cost Prosthetic Limb
No ratings yet
Design of Low-Cost Prosthetic Limb
11 pages
Technology
No ratings yet
Technology
19 pages
CipherTrust Manager - Hands-On - Overview and Basic Configuration
No ratings yet
CipherTrust Manager - Hands-On - Overview and Basic Configuration
25 pages
Instant Access to (Ebook) Security, Privacy, and Forensics Issues in Big Data (Advances in Information Security, Privacy, and Ethics) by Ramesh C. Joshi, , ISBN 9781522597421, 1522597425 ebook Full Chapters
100% (11)
Instant Access to (Ebook) Security, Privacy, and Forensics Issues in Big Data (Advances in Information Security, Privacy, and Ethics) by Ramesh C. Joshi, , ISBN 9781522597421, 1522597425 ebook Full Chapters
55 pages
ASCET V6.4.3 Release Notes
No ratings yet
ASCET V6.4.3 Release Notes
19 pages
Department of Computer Science and Engineering: Course Material (Question Bank)
No ratings yet
Department of Computer Science and Engineering: Course Material (Question Bank)
8 pages
Advertisement For The Post of Trainee Engineer - I and Project Engineer-I - PDIC and CoE
No ratings yet
Advertisement For The Post of Trainee Engineer - I and Project Engineer-I - PDIC and CoE
10 pages

Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

Uploaded by

Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter

Uploaded by

Kernel in the Way

Bypass and Offload Technologies

End User Summit 2012

Application C Library Kernel Driver Device

fwrite(data, size, 1, FILE);

write(fd, data, size);

You might also like