2017 ofi-hoti-tutorial

HOTI 2017
DEVELOPING TO THE OPENFABRICS
INTERFACES LIBFABRIC
Sean Hefty, OFIWG Co-Chair Jim Swaro, GNI Maintainer
August, 2017
Intel Corporation Cray Incorporated

OVERVIEW
Design Guidelines – 10 minutes
Architecture – 20 minutes
API Bootstrap – 15 minutes
GASNet Usage – 20 minutes
MPICH Usage – 15 minutes
2
Implementation highlights
The original code
has been modified
to fit this screen
Lightning fast
introduction

OVERVIEW
Tutorial covers libfabric version 1.5 (unless noted)
•ABI version 1.1
•Minor changes from v1.0.x – 1.4.x (ABI 1.0)
Developer guide, source code, man pages, test
programs, and presentations available at:
www.libfabric.org
3

ACKNOWLEDGEMENTS
Sung Choi
Paul Grun, Cray
Howard Pritchard, Los Alamos National Lab
Bob Russell, University of New Hampshire
Jeff Squyres, Cisco
Sayantan Sur, Intel
The entire OFIWG community
4

DESIGN GUIDELINES
5
Charter: develop interfaces
aligned with user needs

Optimized SW path to HW
•Minimize cache and memory footprint
•Reduce instruction count
•Minimize memory accesses
Scalable
Implementation
Agnostic
Software interfaces aligned with
user requirements
•Careful requirement analysis
Inclusive development effort
•App and HW developers
Good impedance match with
multiple fabric hardware
•InfiniBand, iWarp, RoCE, raw Ethernet,
UDP offload, Omni-Path, GNI, BGQ, …
Open Source User-Centric
libfabric
User-centric interfaces will help foster fabric
innovation and accelerate their adoption
6

OFI USER REQUIREMENTS
7
Give us a high-
level interface!
Give us a low-
level interface!
MPI developers
OFI strives to meet
both requirements
Middleware is primary user
Looking for input on
expanding beyond HPC

OFI SOFTWARE DEVELOPMENT STRATEGIES
One Size Does Not Fit All
8
Fabric Services
User
OFI
Provider
User
OFI
Provider
Provider optimizes for
OFI features
Common optimization
for all apps/providers
Client uses OFI features
User
OFI
Provider
Client optimizes based
on supported features
Provider supports low-level features only
Linux, FreeBSD,
OS X, Windows
TCP and UDP
development support

ARCHITECTURE
10
Modes
Capabilities

OBJECT-MODEL
11
OFI only defines
semantic requirements
NIC
Network
Peer address
table
Listener
Command
queues
RDMA
buffers
Manage multiple
CQs and counters
Share wait
objects (fd’s)
example mappings

ENDPOINT TYPES
12
Unconnected
Connected

ENDPOINT CONTEXTS
13
Default
Scalable
Endpoints
Shared
Contexts
Tx/Rx completions
may go to the same
or different CQs
Tx/Rx command
‘queues’
Share underlying
command queues
Targets multi-thread
access to hardware

ADDRESS VECTORS
14
Address Vector (Table)
fi_addr Fabric Address
0 100:3:50
1 100:3:51
2 101:3:83
3 102:3:64
… …
Address Vector (Map)
fi_addr
100003050
100003051
101003083
102003064
…
Addresses are referenced by an index
- No application storage required
- O(n) memory in provider
- Lookup required on transfers
OFI returns 64-bit value for address
- n x 8 bytes application memory
- No provider storage required
- Direct addressing possible
Converts portable addressing
(e.g. hostname or sockaddr)
to fabric specific address
Possible to share AV
between processes
User Address
IP:Port
10.0.0.1:7000
10.0.0.1:7001
10.0.0.2:7000
10.0.0.3:7003
…
example mappings

DATA TRANSFER TYPES
15
msg 2
msg 1msg 3
msg 2
msg 1msg 3
tag 2
tag 1tag 3
tag 2
tag 1 tag 3
TAGGED
MSG
Maintain message
boundaries, FIFO
Messages carry
user ‘tag’ or id
Receiver selects which
tag goes with each buffer

DATA TRANSFER TYPES
16
data 2
data 1data 3
data 1
data 2
data 3
MSG Stream
dgram
dgram
dgram
dgram
dgram
Multicast MSG
Send to or receive
from multicast group
Data sent and received as
‘stream’
(no message boundaries)
Uses ‘MSG’ APIs but different
endpoint capabilities
Synchronous completion
semantics (application
always owns buffer)

DATA TRANSFER TYPES
17
write 2
write 1write 3 write 1
write 3
write 2
RMA
RDMA semantics
Direct reads or writes of remote
memory from user perspective
Specify operation to perform
on selected datatype
f(x,y)
ATOMIC
f(x,y)
f(x,y)
f(x,y)
f()
y
g()
y
x
g(x,y)
g(x,y)
g(x,y)
x
x
Format of data at target is
known to fabric services

FI_GETINFO
struct fi_info *fi_allocinfo(void);
int fi_getinfo(
uint32_t version,
const char *node,
const char *service,
uint64_t flags,
struct fi_info *hints,
struct fi_info **info);
void fi_freeinfo(
struct fi_info *info);
19
struct fi_info {
struct fi_info *next;
uint64_t caps;
uint64_t mode;
uint32_t addr_format;
size_t src_addrlen;
size_t dest_addrlen;
void *src_addr;
void *dest_addr;
fid_t handle;
struct fi_tx_attr *tx_attr;
struct fi_rx_attr *rx_attr;
struct fi_ep_attr *ep_attr;
struct fi_domain_attr*domain_attr;
struct fi_fabric_attr*fabric_attr;
};
API version
~getaddrinfo
app needs
API semantics needed, and provider
requirements for using them
Detailed object attributes

CAPABILITY AND MODE BITS
20
• Desired services requested by app
• Primary – app must request to use
• E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC
• Secondary – provider can indicate
availability
• E.g. FI_SOURCE, FI_MULTI_RECV
Capabilities
• Requirements placed on the app
• Improves performance when implemented by
application
• App indicates which modes it supports
• Provider clears modes not needed
• Sample:
FI_CONTEXT, FI_LOCAL_MR
Modes

ATTRIBUTES
struct fi_fabric_attr {
struct fid_fabric *fabric;
char *name;
char *prov_name;
uint32_t prov_version;
uint32_t api_version;
};
21
struct fi_domain_attr {
struct fid_domain *domain;
char *name;
enum fi_threading threading;
enum fi_progress control_progress;
enum fi_progress data_progress;
enum fi_resource_mgmt resource_mgmt;
enum fi_av_type av_type;
int mr_mode;
/* provider limits – fields omitted */
...
uint64_t caps;
uint64_t mode;
uint8_t *auth_key;
...
};
Provider details
Can also use env var to filter
Already opened resource
(if available)
How resources are
allocated among threads
for lockless access
Provider protects
against queue overruns
Do app threads
drive transfers
Secure communication
(job key)

ATTRIBUTES
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};
22
Indicates interoperability
Order of data placement
between two messages
Default, shared, or scalable

ATTRIBUTES
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};
23
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
Are completions
reported in order
“Fast”
message size
Can messages be sent
and received out of order

GASNET AT A GLANCE
Networking API to enable PGAS languages
•Language-independent
•Interface intended as a compilation target
•Based on active messages
UPC, UPC++, Co-Array Fortran, Legion, Chapel
•Some of these have native ports to OFI
25

OFI OBJECTS
struct fid_fabric* gasnetc_ofi_fabricfd;
struct fid_domain* gasnetc_ofi_domainfd;
struct fid_av* gasnetc_ofi_avfd;
struct fid_cq* gasnetc_ofi_tx_cqfd;
struct fid_ep* gasnetc_ofi_rdma_epfd;
struct fid_mr* gasnetc_ofi_rdma_mrfd;
struct fid_ep* gasnetc_ofi_request_epfd;
struct fid_ep* gasnetc_ofi_reply_epfd;
struct fid_cq* gasnetc_ofi_request_cqfd;
struct fid_cq* gasnetc_ofi_reply_cqfd;
26
Separated active message
request-reply traffic
Get/put traffic on own
endpoint

INITIALIZATION
hints = fi_allocinfo();
hints->caps = FI_RMA | FI_MSG | FI_MULTI_RECV;
hints->mode = FI_CONTEXT;
hints->addr_format = FI_FORMAT_UNSPEC;
hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE | FI_COMPLETION;
hints->rx_attr->op_flags = FI_MULTI_RECV | FI_COMPLETION;
hints->ep_attr->type = FI_EP_RDM;
hints->domain_attr->threading = FI_THREAD_SAFE;
hints->domain_attr->control_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->resource_mgmt = FI_RM_ENABLED;
hints->domain_attr->av_type = FI_AV_TABLE;
ret = fi_getinfo(OFI_CONDUIT_VERSION, NULL, NULL, 0ULL, hints, &info);
if (!strcmp(info->fabric_attr->prov_name, "psm") || ...) {
...
27
Generate completions
by default
Checks to enable provider
specific optimizations
Need thread safety
for most providers
Use a single buffer to
receive multiple messages

ADDRESS EXCHANGE
av_attr.type = FI_AV_TABLE;
av_attr.size = gasneti_nodes * NUM_OFI_ENDPOINTS;
fi_av_open(gasnetc_ofi_domainfd, &av_attr, &gasnetc_ofi_avfd, NULL);
fi_ep_bind(gasnetc_ofi_rdma_epfd, &gasnetc_ofi_avfd->fid, 0);
...
fi_getname(&gasnetc_ofi_rdma_epfd->fid, NULL, &rdmanamelen);
...
size_t total_len = reqnamelen + repnamelen + rdmanamelen;
on_node_addresses = gasneti_malloc(total_len);
char* alladdrs = gasneti_malloc(gasneti_nodes * total_len);
fi_getname(&gasnetc_ofi_rdma_epfd->fid, on_node_addresses, &rdmanamelen);
...
gasneti_bootstrapExchange(on_node_addresses, total_len, alladdrs);
fi_av_insert(gasnetc_ofi_avfd, alladdrs, gasneti_nodes * NUM_OFI_ENDPOINTS,
NULL, 0ULL, NULL); 28
Bind all local EPs
to same AV
3 EPs per node
Allocate buffer to store
addresses from all nodes
Get each EP address
All to all address exchange

MULTI-RECV BUFFERS
buff_size = gasneti_getenv_int_withdefault(... 1024 * 1024 ...);
region_start = gasneti_malloc_aligned(... buff_size * num);
metadata_array = gasneti_malloc(meta_size * num);
for(i = 0; i < num; i++) {
metadata = metadata_array + i;
setup_recv_msg(metadata, i)
setup_metadata(metadata, i);
fi_recvmsg((i % 2 == 0) ?
gasnetc_ofi_request_epfd :
gasnetc_ofi_reply_epfd,
&metadata->am_buff_msg, FI_MULTI_RECV);
}
29
Use a small (~4) pool of
large (~1MB) buffers to
receive messages
Metadata tracks the status
of each large buffer
Post receive buffers to
request-reply active message
endpoint handlers

MULTI-RECV BUFFERS METADATA
setup_recv_msg(metadata, i)
metadata->iov.iov_base = region_start + buff_size * i;
metadata->iov.iov_len = buff_size;
metadata->am_buff_msg.msg_iov = &metadata->iov;
metadata->am_buff_msg.iov_count = 1;
metadata->am_buff_msg.addr = FI_ADDR_UNSPEC;
metadata->am_buff_msg.desc = NULL;
metadata->am_buff_msg.context = &metadata->am_buff_ctxt.ctxt;
metadata->am_buff_msg.data = 0;
setup_metadata(metadata, i);
metadata->am_buff_ctxt.index = i;
metadata->am_buff_ctxt.final_cntr = 0;
metadata->am_buff_ctxt.event_cntr = 0;
gasnetc_paratomic_set(&metadata->am_buff_ctxt.consumed_cntr, 0, 0);
metadata->am_buff_ctxt.metadata = metadata;
30
Setup and save descriptor
used to (re-)post receive buffer
Reference counting used to track
when messages received into
the buffer are being processed
FI_CONTEXT requires we
give the provider some
storage space for their context

RECEIVING MESSAGE HANDLING
void gasnetc_ofi_am_recv_poll(
struct fid_ep *ep, struct fid_cq *cq)
{
struct fi_cq_data_entry re;
if (TRYLOCK(lock_p) == EBUSY)
return;
ret = fi_cq_read(cq, &re, 1);
if (ret == -FI_EAGAIN) {
UNLOCK(lock_p);
return;
}
gasnetc_ofi_ctxt_t *header = re.op_context;
header->event_cntr++;
if (re.flags & FI_MULTI_RECV)
header->final_cntr = header->event_cntr;
UNLOCK(lock_p);
31
if (re.flags & FI_RECV)
gasnetc_ofi_handle_am(re.buf,
is_request, re.len, re.data);
if (++header->consumed_cntr ==
header->final_cntr) {
metadata = header->metadata;
LOCK(&gasnetc_ofi_locks.am_rx);
fi_recvmsg(ep, &metadata->am_buff_msg,
FI_MULTI_RECV);
UNLOCK(&gasnetc_ofi_locks.am_rx);
}
}
Check CQ for a completion
1 event per message
FI_MULTI_RECV flag indicates
buffer has been released
Process request
or reply message
Repost receive buffer once
all callbacks complete

NONBLOCKING PUT MAPPING
gasnete_put_nb(gasnet_node_t node, void *dest,
void *src, size_t nbytes...)
{
if (nbytes <= max_buffered_send) {
use_inject(...)
} else if (nbytes <= gasnetc_ofi_bbuf_threshold) {
use_bounce_buffer(...)
} else {
put_and_wait(...)
}
}
32
General outline of
gasnet_put_nb flow
‘Inject’ feature allows
immediate re-use of buffer
Optimized for lowest latency
Copy into bounce buffer
Must wait for completion

put_and_wait
fi_write(gasnetc_ofi_rdma_epfd, src, nbytes,
NULL, GET_RDMA_DEST(node),
GET_REMOTEADDR(node, dest), 0, ctxt_ptr);
pending_rdma++;
while (pending_rdma)
GASNETC_OFI_POLL_EVERYTHING();
NONBLOCKING PUT MAPPING
33
use_inject
fi_writemsg(gasnetc_ofi_rdma_epfd, &msg,
FI_INJECT | FI_DELIVERY_COMPLETE);
pending_rdma++;
use_bounce_buffer
get_bounce_buf(nbytes, bbuf, bbuf_ctxt);
memcpy(bbuf, src, nbytes);
fi_write(gasnetc_ofi_rdma_epfd, bbuf,
nbytes, NULL, GET_RDMA_DEST(node),
GET_REMOTEADDR(node, dest), 0,
bbuf_ctxt);
pending_rdma++;
Copy source buffer into bounce
buffer. Source may be reused.
FI_INJECT – buffer is re-usable
after call returns, but still
generate a completion
Transfer data and wait for the
completion to indicate buffer is
no longer needed

MPICH AT A GLANCE
Reference MPI implementation
•MPI standards 1, 2, and 3
High-performance and scalable
•Used on most Top 500, including Top 10 supercomputers
Highly portable
•Platforms, OS, networks, CPUs…
Base for most vendor MPI derivatives
35

MPI Progress
36
Fabric Services
User
OFI
Provider
Provider optimizes for
OFI features
Counters
CQs
MPI 2-Sided
FI_TAGGED
MPI-3 RMA
FI_MSG
FI_RMA
User
OFI
Provider
Common optimization
for all apps/providers
Client uses OFI features
Provider supports low-
level features only
SELECT A MAPPING
OFI Building
Blocks
MPICH optimizes based
on provider features
• Present one possible mapping
• Designed for providers with a
close semantic match to MPI
• MPICH ch4 / OpenMPI MTL

SELECT CAPABILITIES AND MODE BITS
hints = fi_allocinfo();
hints->mode = FI_CONTEXT | FI_ASYNC_IOV;
hints->caps |= FI_RMA;
hints->caps |= FI_ATOMICS;
if (do_tagged)
hints->caps |= FI_TAGGED;
if (do_data)
hints->caps |= FI_DIRECTED_RECV;
37
Provide the provider
context space
Request support for tagged
messages, RDMA read and
write, and atomic operations
The source address is used
as part of message matching

SPECIFY DOMAIN ATTRIBUTES
fi_version = FI_VERSION(MPIDI_OFI_MAJOR_VERSION,
MPIDI_OFI_MINOR_VERSION);
hints->addr_format = FI_FORMAT_UNSPEC;
hints->domain_attr->threading = FI_THREAD_DOMAIN;
hints->domain_attr->control_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->data_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->resource_mgmt = FI_RM_ENABLED;
hints->domain_attr->av_type = do_av_table ?
FI_AV_TABLE : FI_AV_MAP;
hints->domain_attr->mr_mode = do_mr_scalable ?
FI_MR_SCALABLE : FI_MR_BASIC;
38
Specify API version
MPI is coded to
Support common memory
registration modes (ABI v1.0)
MPI handles all synchronization
(lock-free provider)
MPI is capable of driving
progress (thread-less provider)
Provider will protect against local
and remote queue overruns
Support either address vector
format

SPECIFY ENDPOINT ATTRIBUTES
hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE |
FI_COMPLETION;
hints->tx_attr->msg_order = FI_ORDER_SAS;
hints->tx_attr->comp_order = FI_ORDER_NONE;
hints->rx_attr->op_flags = FI_COMPLETION;
hints->ep_attr->type = FI_EP_RDM;
hints->fabric_attr->prov_name = provname;
fi_getinfo(fi_version, NULL, NULL, 0ULL, hints, &prov);
39
Generate a completion after
transfer processed by peer
Optionally select a provider
and away we go
Sends must be
processed in order
Reliable-datagram endpoint
Transfers can
complete in any order

Communicator A
MAPPING MPI COMMUNICATORS
40
Communicators are groups of peer processes
• MPI_COMM_WORLD – all peers
• Referenced by rank (index)
• Map logical rank to physical network address
Map Table
Communicator B
Shared AV Map
fi_addr_t
integer
Communicator A Communicator B
AV Table
Communicator A Communicator B
AV Map
All options are supported

OPENING ADDRESS VECTOR
char av_name[128];
snprintf(av_name, 127, "FI_NAMED_AV_%dn", appnum);
av_attr.name = av_name;
av_attr.flags = FI_READ;
av_attr.map_addr = 0;
unsigned do_av_insert = 1;
if (0 == fi_av_open(MPIDI_Global.domain, &av_attr, &MPIDI_Global.av, NULL)) {
do_av_insert = 0;
mapped_table = (fi_addr_t *) av_attr.map_addr;
for (i = 0; i < size; i++)
MPIDI_OFI_AV(&MPIDIU_get_av(0, i)).dest = mapped_table[i];
41
Copying fi_addr_t
address table
Try opening a shared
(named) AV

TAGGED PROTOCOL
#ifdef USE_OFI_IMMEDIATE_DATA
/*
* 0123 4567 01234567 01234567 01234567 01234567 01234567 01234567 01234567
* | | |
* ^ | Unused | context id | message tag
* | | | |
* +---- protocol
*/
#else
/*
* 0123 4567 01234567 0123 4567 01234567 0123 4567 01234567 01234567 01234567
* | | |
* ^ | context id | source | message tag
* | | | |
* +---- protocol
*/
42
Selection based on support
for FI_DIRECTED_RECV
Source rank carried in remote QP data
Need source rank for matching

MPI SEND (TAGGED MESSAGES)
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_send_handler(...)
{
if (is_inject) {
if (do_data)
fi_tinjectdata(ep, buf, len, dest, dest_addr, tag);
else
fi_tinject(ep, buf, len, dest_addr, tag);
} else {
if (do_data)
fi_tsenddata(ep, buf, len, desc, dest, dest_addr, tag, context);
else
fi_tsend(ep, buf, len, desc, dest_addr, tag, context);
}
}
43
Optimized transfers under
certain conditions
• No completions
• Immediate buffer reuse
• Limited transfer size
Protocol selection
Direct API mapping
Base endpoint or tx_tag
transmit context
fi_addr_t from
mapped_table[ ]
tag format based
on protocol

MPI RECV
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_mpi_recv (...)
{
if (is_tagged) {
fi_trecv(ep, buf, len, desc, src_addr, tag, ignored, context);
} else {
fi_recv(ep, buf, len, desc, src_addr, context);
}
}
44
Protocol selection
Direct API mapping
Base endpoint or tx_tag
transmit context
fi_addr_t from
mapped_table[ ]
tag format based
on protocol

MPI PROGRESS ENGINE
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_progress(...)
{
struct fi_cq_tagged_entry wc = {0};
struct fi_cq_err_entry error = {0};
ompi_mtl_ofi_request_t *ofi_req = NULL;
while (true) {
ret = fi_cq_read(ompi_mtl_ofi.cq, (void *) &wc, 1);
if (ret > 0) {
// process good completion
...
} else if (ret == -FI_EAVAIL) {
ret = fi_cq_reader(ompi_mtl_ofi.cq, &error);
}
}
}
45
Direct API mapping

HOTI 2017
THANK YOU
Sean Hefty, OFIWG Co-Chair

2017 ofi-hoti-tutorial

More Related Content

What's hot (20)

Similar to 2017 ofi-hoti-tutorial (20)

Recently uploaded (20)

2017 ofi-hoti-tutorial