0% found this document useful (0 votes)
34 views

Multi Threading

The document discusses the concept of zero-copy networking and an implementation for Linux. Zero-copy aims to eliminate unnecessary data copying between memory areas by the CPU. The implementation modifies the Linux network driver to split outgoing packets between kernel and user memory, allowing data to be directly transmitted via DMA without copying. It uses virtual-to-physical address translation via the CPU page tables to determine the physical address of the user buffer for the transmission descriptor.

Uploaded by

Sandeep Alajangi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Multi Threading

The document discusses the concept of zero-copy networking and an implementation for Linux. Zero-copy aims to eliminate unnecessary data copying between memory areas by the CPU. The implementation modifies the Linux network driver to split outgoing packets between kernel and user memory, allowing data to be directly transmitted via DMA without copying. It uses virtual-to-physical address translation via the CPU page tables to determine the physical address of the user buffer for the transmission descriptor.

Uploaded by

Sandeep Alajangi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

The ‘zero-copy’ initiative

A look at the ‘zero-copy’ concept


and an x86 Linux implementation
for the case of outgoing packets
From Wikipedia, the free encyclopedia:

Zero-copy is an adjective that refers to computer operations in which the


CPU does not perform the task of copying data from one area of memory
to another.
The availability of zero-copy versions of operating system elements such
as device drivers, file systems and network protocol stacks greatly increases
the performance of many applications, since using a CPU that is capable of
complex operations just to make copies of data can be a great waste of
resources. Zero-copy also reduces the number of context-switches from
User space to Kernel space and vice-versa. Several OS like Linux support
zero copying of files through specific API's like sendfile, sendfile64, etc.
Techniques for creating zero-copy software include the use of DMA-based
copying, and memory-mapping through an MMU. These features require
specific hardware support and usually involve particular memory alignment
requirements.
Zero-copy protocols are especially important for high-speed networks, as
memory copies would cause a serious workload for the host cpu. Still, such
protocols have some initial overhead so that avoiding programmed IO (PIO)
there only makes sense for large messages.
Application source-code

char message[] = “This is a test of network-packet transmission \n”;

int main( void )


{
int fd = open( “/dev/nic”, O_RDWR );
if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); }

int msglen = strlen( message );

int nbytes = write( fd, message, msglen );


if ( nbytes < 0 ) { perror( “write” ); exit(1); }

printf( “Transmitted %d bytes \n”, nbytes );


}
Transmit operation
user space kernel space
Linux OS kernel

runtime library file subsystem

nic device-driver
write()
my_write()
packet buffer
user data-buffer copy_from_user()

application program DMA

hardware
We want to eliminate this copying-operation
Our driver’s packet-layout
TYPE/
destn-address source-address LENGTH
count

-- data --

-- data --

-- data –

packet-buffer in kernel-space

16 bytes

Packet-
base-address (64-bits) length
CSO cmd status CSS special

Format for Legacy Transmit-Descriptor


Can zero-copy be transparent?
• We would like to implement the zero-copy
concept in out ‘nic2.c’ character driver in
such a manner that no changes would be
required to an ‘application’ program’s code
• We will show how to do this for ‘outgoing’
packets (i.e., by modifying ‘my_write()’),
but achieving zero-copy with ‘incoming’
packets would be a lot more complicated!
TX Descriptor’s CMD byte
Command-Byte Format

I V I E
R I F
D L 0 0 C O
S C
E E S P

EOP = End-Of-Packet (1=yes, 0=no)

RS = Report Status (1=yes, 0=no)

VLE = VLAN-tag Enable

Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
Splitting our packet-layout
TYPE/
destn-address source-address LENGTH
count HDR

-- data --

-- data -- LEN

-- data –

packet-buffer in kernel-space

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=0
(=HDR)

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=1
(=LEN)

Format for Legacy Transmit-Descriptor Pair


Splitting our packet-buffer
TYPE/
destn-address source-address LENGTH
count
HDR

packet-buffer in kernel-space

-- data --

-- data --
LEN

-- data –

packet-buffer in user-space

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=0
(=HDR)

Packet-
base-address (64-bits) Length CSO cmd status CSS special
EOP=1
(=LEN)

Format for Legacy Transmit-Descriptor Pair


Two physical packet-buffers comprise one logical packet that gets transmitted!
Transmitting a ‘split-packet’
The 82573L controller ‘merges’ the
Application-program
contents of these separate buffers
into just a single ethernet-packet
packet-data buffer

User-space
Kernel-space

Device-driver module DMA

packet-header buffer

DMA

NIC hardware
The ‘virt_to_phys()’ macro
• Linux provides a convenient macro which
kernel-module code can employ to obtain
the physical-address for a memory-region
from its virtual-address – but it only works
for addresses that aren’t in ‘high’ memory
• For ‘normal’ memory-regions, conversion
between ‘virtual’ and ‘physical’ addresses
amounts to a simple addition/subtraction
Linux memory-mapping
= persistent mapping
= transient mappings
HMA

kernel
space

896-MB
user
physical RAM space

There is more physical RAM


in our classroom’s systems
than can be ‘mapped’ into
the available address-range
for kernel virtual addresses CPU’s virtual
address-space
Two-Level Translation Scheme

PAGE PAGE PAGE


DIRECTORY TABLES FRAMES

CR3
Linear to Physical
linear address
dir-index physical address-space
table-index offset

page
table

page frame
page
directory

CR3
Address-translation
• The CPU examines any virtual address it
encounters, subdividing it into three fields
31 22 21 12 11 0

index into index into offset into


page-directory page-table page-frame

10-bits 10-bits 12-bits


This field selects This field selects This field provides
one of the 1024 one of the 1024 the offset to one
array-entries in array-entries in of the 4096 bytes
the Page-Directory that Page-Table in that Page-Frame
Format of a Page-Table entry
31 12 11 10 9 8 7 6 5 4 3 2 1 0
P P
PAGE-FRAME BASE ADDRESS AVAIL 0 0 D A C W U W P
D T

LEGEND
P = Present (1=yes, 0=no)
W = Writable (1 = yes, 0 = no)
U = User (1 = yes, 0 = no)
A = Accessed (1 = yes, 0 = no)
D = Dirty (1 = yes, 0 = no)

PWT = Page Write-Through (1=yes, 0 = no)


PCD = Page Cache-Disable (1 = yes, 0 = no)
Finding the user-buffer’s PFN
• To program the ‘base-address’ field in the
second TX-Descriptor, our driver’s ‘write()’
function will need to know which physical
Page-Frame the application’s buffer lies in
• And its PFN (Page-Frame Number) can be
found from its virtual address by ‘walking-
the-cpu-page-tables’ – even when Linux
puts some page-tables in ‘high’ memory
Performing ‘virt_to_phys()’
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{
unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame;
unsigned int dindex, pindex, offset;

// take apart the virtual-address of the user’s ‘buf’ variable


dindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits)
pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits)
offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits)

// then walk the CPU’s paging-tables to get buf’s physical-address


asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” );
pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF );
pfn_pgtbl = (pgdir[ dindex ] >> 12);
pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] );
pfn_frame = (pgtbl[ pindex ] >> 12);
kunmap( &mem_map[ pfn_pgtbl ];
txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
Can’t cross a ‘page-boundary’
• In order for the NIC to fetch the user’s data
using its Bus-Master DMA capability, it is
necessary for the buffer needs to reside in
a physically contiguous memory-region
buf

• But we can’t be sure Linux will have setup


the CPU’s page-tables that way – unless
the ‘buf’ is confined to a single page-frame
Truncate ‘len’ if necessary
ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos )
{

if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset;

offset len

buf

PAGE_SIZE PAGE_SIZE PAGE_SIZE


‘zerocopy.c’
• We created this modification of our ‘nic2.c’
device-driver so it’s ‘my_write()’ function
lets an application perform transmissions
without performing a memory-to-memory
copy-operation (i.e., copy_from_user()’ )
• It is not so easy to implement ‘zero-copy’
for receiving packets – can you say why?
Website article
• We’ve posted a link on our CS686 website
to a frequently cited research-article about
the various issues that arise when trying to
implement the ‘zero-copy’ concept for the
case of ‘incoming’ network-packets:

The Need for Asynchronous, Zero-Copy Network I/O,


by Ulrich Drepper, Red Hat, Inc.

You might also like