0% found this document useful (0 votes)

34 views

Multi Threading

The document discusses the concept of zero-copy networking and an implementation for Linux. Zero-copy aims to eliminate unnecessary data copying between memory areas by the CPU. The implementation modifies the Linux network driver to split outgoing packets between kernel and user memory, allowing data to be directly transmitted via DMA without copying. It uses virtual-to-physical address translation via the CPU page tables to determine the physical address of the user buffer for the transmission descriptor.

Uploaded by

Sandeep Alajangi

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Multi Threading

Uploaded by

Sandeep Alajangi

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 22

The ‘zero-copy’ initiative

A look at the ‘zero-copy’ concept

and an x86 Linux implementation
for the case of outgoing packets
From Wikipedia, the free encyclopedia:

Zero-copy is an adjective that refers to computer operations in which the

CPU does not perform the task of copying data from one area of memory
to another.
The availability of zero-copy versions of operating system elements such
as device drivers, file systems and network protocol stacks greatly increases
the performance of many applications, since using a CPU that is capable of
complex operations just to make copies of data can be a great waste of
resources. Zero-copy also reduces the number of context-switches from
User space to Kernel space and vice-versa. Several OS like Linux support
zero copying of files through specific API's like sendfile, sendfile64, etc.
Techniques for creating zero-copy software include the use of DMA-based
copying, and memory-mapping through an MMU. These features require
specific hardware support and usually involve particular memory alignment
requirements.
Zero-copy protocols are especially important for high-speed networks, as
memory copies would cause a serious workload for the host cpu. Still, such
protocols have some initial overhead so that avoiding programmed IO (PIO)
there only makes sense for large messages.
Application source-code

char message[] = “This is a test of network-packet transmission \n”;

int main( void )

{
int fd = open( “/dev/nic”, O_RDWR );
if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); }

int msglen = strlen( message );

int nbytes = write( fd, message, msglen );

if ( nbytes < 0 ) { perror( “write” ); exit(1); }

printf( “Transmitted %d bytes \n”, nbytes );

}
Transmit operation
user space kernel space
Linux OS kernel

runtime library file subsystem

nic device-driver
write()
my_write()
packet buffer
user data-buffer copy_from_user()

application program DMA

hardware
We want to eliminate this copying-operation
Our driver’s packet-layout
TYPE/
destn-address source-address LENGTH
count

-- data --

-- data –

packet-buffer in kernel-space

16 bytes

Packet-
base-address (64-bits) length
CSO cmd status CSS special

Format for Legacy Transmit-Descriptor

Can zero-copy be transparent?
• We would like to implement the zero-copy
concept in out ‘nic2.c’ character driver in
such a manner that no changes would be
required to an ‘application’ program’s code
• We will show how to do this for ‘outgoing’
packets (i.e., by modifying ‘my_write()’),
but achieving zero-copy with ‘incoming’
packets would be a lot more complicated!
TX Descriptor’s CMD byte
Command-Byte Format

I V I E
R I F
D L 0 0 C O
S C
E E S P

EOP = End-Of-Packet (1=yes, 0=no)

RS = Report Status (1=yes, 0=no)

VLE = VLAN-tag Enable

Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
Splitting our packet-layout
TYPE/
destn-address source-address LENGTH
count HDR

-- data --

-- data -- LEN

-- data –

packet-buffer in kernel-space

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=0
(=HDR)

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=1
(=LEN)

Format for Legacy Transmit-Descriptor Pair

Splitting our packet-buffer
TYPE/
destn-address source-address LENGTH
count
HDR

packet-buffer in kernel-space

-- data --

-- data --
LEN

-- data –

packet-buffer in user-space

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=0
(=HDR)

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=1
(=LEN)

Format for Legacy Transmit-Descriptor Pair

Two physical packet-buffers comprise one logical packet that gets transmitted!
Transmitting a ‘split-packet’
The 82573L controller ‘merges’ the
Application-program
contents of these separate buffers
into just a single ethernet-packet
packet-data buffer

User-space
Kernel-space

Device-driver module DMA

packet-header buffer

DMA

NIC hardware
The ‘virt_to_phys()’ macro
• Linux provides a convenient macro which
kernel-module code can employ to obtain
the physical-address for a memory-region
from its virtual-address – but it only works
for addresses that aren’t in ‘high’ memory
• For ‘normal’ memory-regions, conversion
between ‘virtual’ and ‘physical’ addresses
amounts to a simple addition/subtraction
Linux memory-mapping
= persistent mapping
= transient mappings
HMA

kernel
space

896-MB
user
physical RAM space

There is more physical RAM

in our classroom’s systems
than can be ‘mapped’ into
the available address-range
for kernel virtual addresses CPU’s virtual
address-space
Two-Level Translation Scheme

PAGE PAGE PAGE

DIRECTORY TABLES FRAMES

CR3
Linear to Physical
linear address
dir-index physical address-space
table-index offset

page
table

page frame
page
directory

CR3
Address-translation
• The CPU examines any virtual address it
encounters, subdividing it into three fields
31 22 21 12 11 0

index into index into offset into

page-directory page-table page-frame

10-bits 10-bits 12-bits

This field selects This field selects This field provides
one of the 1024 one of the 1024 the offset to one
array-entries in array-entries in of the 4096 bytes
the Page-Directory that Page-Table in that Page-Frame
Format of a Page-Table entry
31 12 11 10 9 8 7 6 5 4 3 2 1 0
P P
PAGE-FRAME BASE ADDRESS AVAIL 0 0 D A C W U W P
D T

LEGEND
P = Present (1=yes, 0=no)
W = Writable (1 = yes, 0 = no)
U = User (1 = yes, 0 = no)
A = Accessed (1 = yes, 0 = no)
D = Dirty (1 = yes, 0 = no)

PWT = Page Write-Through (1=yes, 0 = no)

PCD = Page Cache-Disable (1 = yes, 0 = no)
Finding the user-buffer’s PFN
• To program the ‘base-address’ field in the
second TX-Descriptor, our driver’s ‘write()’
function will need to know which physical
Page-Frame the application’s buffer lies in
• And its PFN (Page-Frame Number) can be
found from its virtual address by ‘walking-
the-cpu-page-tables’ – even when Linux
puts some page-tables in ‘high’ memory
Performing ‘virt_to_phys()’
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{
unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame;
unsigned int dindex, pindex, offset;

// take apart the virtual-address of the user’s ‘buf’ variable

dindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits)
pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits)
offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits)

// then walk the CPU’s paging-tables to get buf’s physical-address

asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” );
pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF );
pfn_pgtbl = (pgdir[ dindex ] >> 12);
pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] );
pfn_frame = (pgtbl[ pindex ] >> 12);
kunmap( &mem_map[ pfn_pgtbl ];
txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
Can’t cross a ‘page-boundary’
• In order for the NIC to fetch the user’s data
using its Bus-Master DMA capability, it is
necessary for the buffer needs to reside in
a physically contiguous memory-region
buf

• But we can’t be sure Linux will have setup

the CPU’s page-tables that way – unless
the ‘buf’ is confined to a single page-frame
Truncate ‘len’ if necessary
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{

if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;

offset len

buf

PAGE_SIZE PAGE_SIZE PAGE_SIZE

‘zerocopy.c’
• We created this modification of our ‘nic2.c’
device-driver so it’s ‘my_write()’ function
lets an application perform transmissions
without performing a memory-to-memory
copy-operation (i.e., copy_from_user()’ )
• It is not so easy to implement ‘zero-copy’
for receiving packets – can you say why?
Website article
• We’ve posted a link on our CS686 website
to a frequently cited research-article about
the various issues that arise when trying to
implement the ‘zero-copy’ concept for the
case of ‘incoming’ network-packets:

The Need for Asynchronous, Zero-Copy Network I/O,

by Ulrich Drepper, Red Hat, Inc.

Biz Plan - Mhealth App
100% (1)
Biz Plan - Mhealth App
25 pages
What Is Direct Memory Access (DMA) and Why Should We Know About It?
No ratings yet
What Is Direct Memory Access (DMA) and Why Should We Know About It?
23 pages
What Is Direct Memory Access (DMA) and Why Should We Know About It?
No ratings yet
What Is Direct Memory Access (DMA) and Why Should We Know About It?
23 pages
Detecting PCI Devices: On Identifying The Peripheral Equipment Installed in Our PC
No ratings yet
Detecting PCI Devices: On Identifying The Peripheral Equipment Installed in Our PC
22 pages
zero-copy
No ratings yet
zero-copy
9 pages
Lab Project 1
No ratings yet
Lab Project 1
8 pages
Maio Netdev0x15
No ratings yet
Maio Netdev0x15
29 pages
Mannasim Patch
No ratings yet
Mannasim Patch
2,687 pages
Lecture 5 Communicating With Peripherals 2021
100% (1)
Lecture 5 Communicating With Peripherals 2021
44 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
GRCON21 - The State of GNU Radio Accelerator Device Support
No ratings yet
GRCON21 - The State of GNU Radio Accelerator Device Support
29 pages
Li 2012
No ratings yet
Li 2012
4 pages
100 40 Gbps Ethernet Wall Chart
No ratings yet
100 40 Gbps Ethernet Wall Chart
1 page
PMD Layer Pcs Layer IEEE 802.3ba STANDARD: 100/40 Gbps Ethernet - Are You Ready?
No ratings yet
PMD Layer Pcs Layer IEEE 802.3ba STANDARD: 100/40 Gbps Ethernet - Are You Ready?
1 page
Writing and Adapting Device Drivers For FreeBSD
No ratings yet
Writing and Adapting Device Drivers For FreeBSD
93 pages
Input/Output: Operating Systems CSE 4300
No ratings yet
Input/Output: Operating Systems CSE 4300
74 pages
Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
No ratings yet
Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
27 pages
NguyenDucHuy CHAP5 (1)
No ratings yet
NguyenDucHuy CHAP5 (1)
13 pages
Paging Mechanism of 80386
No ratings yet
Paging Mechanism of 80386
15 pages
Practical Introduction To PCI Express With FPGAs - Extended
No ratings yet
Practical Introduction To PCI Express With FPGAs - Extended
77 pages
Lecture 2
No ratings yet
Lecture 2
53 pages
4.2.2.7 Lab - Configuring Frame Relay and Subinterfaces - ILM PDF
100% (6)
4.2.2.7 Lab - Configuring Frame Relay and Subinterfaces - ILM PDF
38 pages
COMP445 Fall 2006: Lab Assignment 1
No ratings yet
COMP445 Fall 2006: Lab Assignment 1
19 pages
PCIE Document
No ratings yet
PCIE Document
7 pages
COA Lecture 24 DMA PDF
No ratings yet
COA Lecture 24 DMA PDF
25 pages
Ferrill - Processing Ethernet Flight Test Data With Open Source
No ratings yet
Ferrill - Processing Ethernet Flight Test Data With Open Source
20 pages
9251 EtherCAT TwinCAT Integration - EN
No ratings yet
9251 EtherCAT TwinCAT Integration - EN
21 pages
L09-AddressTranslation
No ratings yet
L09-AddressTranslation
39 pages
Lab 3.5.1: Basic Frame Relay
No ratings yet
Lab 3.5.1: Basic Frame Relay
24 pages
Efficient Data Transfer Through Zero
No ratings yet
Efficient Data Transfer Through Zero
11 pages
Lab 4.2.2.7 - Configuring Frame Relay and Subinterfaces (Our Routers Are F0/0, Substitute For "G0/0")
No ratings yet
Lab 4.2.2.7 - Configuring Frame Relay and Subinterfaces (Our Routers Are F0/0, Substitute For "G0/0")
11 pages
Shared Memory
No ratings yet
Shared Memory
21 pages
CS241 System Programming: Discussion Section 7 March 13 - March 16
No ratings yet
CS241 System Programming: Discussion Section 7 March 13 - March 16
27 pages
CN Lab Manual
No ratings yet
CN Lab Manual
37 pages
16-io-notes
No ratings yet
16-io-notes
35 pages
How To Capture and Use WireShark Trace Data With EtherCAT Applications
No ratings yet
How To Capture and Use WireShark Trace Data With EtherCAT Applications
9 pages
Computer Network - Lab Manuals
No ratings yet
Computer Network - Lab Manuals
29 pages
Network Programming C
100% (1)
Network Programming C
19 pages
Ch6(Interaction and Communication Between Programs)
No ratings yet
Ch6(Interaction and Communication Between Programs)
49 pages
William Stallings Computer Organization and Architecture 6 Edition Input/Output
No ratings yet
William Stallings Computer Organization and Architecture 6 Edition Input/Output
56 pages
10GbE MAC
No ratings yet
10GbE MAC
22 pages
PCIe PPT
No ratings yet
PCIe PPT
48 pages
Networks Lab Manual
No ratings yet
Networks Lab Manual
12 pages
Linux Initialization Process
No ratings yet
Linux Initialization Process
44 pages
Arid Agriculture University, Rawalpindi
No ratings yet
Arid Agriculture University, Rawalpindi
7 pages
Reevaluation of Programmed IO with Write-Combining Buffers to Improve IO Performance on Cluster Systems (NAS2015_kPIO+WC)
No ratings yet
Reevaluation of Programmed IO with Write-Combining Buffers to Improve IO Performance on Cluster Systems (NAS2015_kPIO+WC)
8 pages
Interprocess Communication: CS 241 April 2, 2012
No ratings yet
Interprocess Communication: CS 241 April 2, 2012
33 pages
c4029 Fall 1 12 Sol
No ratings yet
c4029 Fall 1 12 Sol
8 pages
Socket Programming in C: Server and Client
No ratings yet
Socket Programming in C: Server and Client
9 pages
Final
No ratings yet
Final
2 pages
11 Unixio
No ratings yet
11 Unixio
45 pages
Pci Read and Write
No ratings yet
Pci Read and Write
13 pages
WAN Lab 2 Configuring Frame Relay
No ratings yet
WAN Lab 2 Configuring Frame Relay
8 pages
Atc
No ratings yet
Atc
12 pages
Lwip Introduction
No ratings yet
Lwip Introduction
66 pages
Operating Systems Lecture Notes-11
No ratings yet
Operating Systems Lecture Notes-11
15 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
From Everand
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
POONAM DEVI
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
G2776 PM NCR 03.08 F.03 Form1A CSHP Application Form
No ratings yet
G2776 PM NCR 03.08 F.03 Form1A CSHP Application Form
4 pages
Pedram Azad Dissertation
100% (2)
Pedram Azad Dissertation
8 pages
Feasibility Title Proposal Fs Research Analysis
No ratings yet
Feasibility Title Proposal Fs Research Analysis
1 page
Bài Tập Tiếng Anh 8 (Bùi Văn Vinh - Chủ Biên)
No ratings yet
Bài Tập Tiếng Anh 8 (Bùi Văn Vinh - Chủ Biên)
145 pages
Faculty of Computer Science and Mathematics Sta555: Fundamentals of Data Mining
No ratings yet
Faculty of Computer Science and Mathematics Sta555: Fundamentals of Data Mining
50 pages
3940476
No ratings yet
3940476
76 pages
Publication
No ratings yet
Publication
17 pages
Project Manager or Assistant Project Manager or Construction Man
No ratings yet
Project Manager or Assistant Project Manager or Construction Man
3 pages
Reading Activity Past Simple
No ratings yet
Reading Activity Past Simple
1 page
English Cloze Test
100% (1)
English Cloze Test
3 pages
0707 Moteck
No ratings yet
0707 Moteck
2 pages
LAB Syllabus
No ratings yet
LAB Syllabus
3 pages
Form
No ratings yet
Form
3 pages
Vision L5 U5 Progress Test C AK
No ratings yet
Vision L5 U5 Progress Test C AK
1 page
Anub
No ratings yet
Anub
2 pages
Estimasion Cost Repair Engine and Transmission
No ratings yet
Estimasion Cost Repair Engine and Transmission
20 pages
Motor Trend - Fall 2024 USA
No ratings yet
Motor Trend - Fall 2024 USA
100 pages
SLRC Paper III SET B
No ratings yet
SLRC Paper III SET B
64 pages
Wiring Diagram - Autodata
No ratings yet
Wiring Diagram - Autodata
6 pages
Associate Software Engineer Job Descritpion
No ratings yet
Associate Software Engineer Job Descritpion
2 pages
SSRN Id4375283
No ratings yet
SSRN Id4375283
15 pages
Positive Outcome: NHRC Secures Compensation For Family of Electrocuted Lineman in Uttar Pradesh
No ratings yet
Positive Outcome: NHRC Secures Compensation For Family of Electrocuted Lineman in Uttar Pradesh
6 pages
Stainless Steel Tube and Fittings: Photography Courtesy of Outokumpu, Arcelor and New Zealand Tube Mills
No ratings yet
Stainless Steel Tube and Fittings: Photography Courtesy of Outokumpu, Arcelor and New Zealand Tube Mills
13 pages
Sanmati H. S. School: Balvigyan 2014-15
No ratings yet
Sanmati H. S. School: Balvigyan 2014-15
29 pages
57-S-3 Biology
No ratings yet
57-S-3 Biology
19 pages
Esl Lesson Plan
No ratings yet
Esl Lesson Plan
8 pages
New Concepts in Front-End Web Design Architecture: HTML5 Boilerplate, Pure CSS, and HTML Semantics
No ratings yet
New Concepts in Front-End Web Design Architecture: HTML5 Boilerplate, Pure CSS, and HTML Semantics
4 pages
Student Handbook_07!03!24 (1)
No ratings yet
Student Handbook_07!03!24 (1)
272 pages
Information Management and Technology NSG 3039
No ratings yet
Information Management and Technology NSG 3039
4 pages

Multi Threading

Uploaded by

Multi Threading

Uploaded by

The ‘zero-copy’ initiative

A look at the ‘zero-copy’ concept

Zero-copy is an adjective that refers to computer operations in which the

char message[] = “This is a test of network-packet transmission \n”;

int main( void )

int msglen = strlen( message );

int nbytes = write( fd, message, msglen );

printf( “Transmitted %d bytes \n”, nbytes );

runtime library file subsystem

application program DMA

Format for Legacy Transmit-Descriptor

EOP = End-Of-Packet (1=yes, 0=no)

RS = Report Status (1=yes, 0=no)

VLE = VLAN-tag Enable

Format for Legacy Transmit-Descriptor Pair

Format for Legacy Transmit-Descriptor Pair

Device-driver module DMA

There is more physical RAM

PAGE PAGE PAGE

index into index into offset into

10-bits 10-bits 12-bits

PWT = Page Write-Through (1=yes, 0 = no)

// take apart the virtual-address of the user’s ‘buf’ variable

// then walk the CPU’s paging-tables to get buf’s physical-address

• But we can’t be sure Linux will have setup

if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;

PAGE_SIZE PAGE_SIZE PAGE_SIZE

The Need for Asynchronous, Zero-Copy Network I/O,

You might also like