SlideShare a Scribd company logo
HOTI 2017
DEVELOPING TO THE OPENFABRICS
INTERFACES LIBFABRIC
Sean Hefty, OFIWG Co-Chair Jim Swaro, GNI Maintainer
August, 2017
Intel Corporation Cray Incorporated
OVERVIEW
Design Guidelines – 10 minutes
Architecture – 20 minutes
API Bootstrap – 15 minutes
GASNet Usage – 20 minutes
MPICH Usage – 15 minutes
2
Implementation highlights
The original code
has been modified
to fit this screen
Lightning fast
introduction
OVERVIEW
Tutorial covers libfabric version 1.5 (unless noted)
•ABI version 1.1
•Minor changes from v1.0.x – 1.4.x (ABI 1.0)
Developer guide, source code, man pages, test
programs, and presentations available at:
www.libfabric.org
3
ACKNOWLEDGEMENTS
Sung Choi
Paul Grun, Cray
Howard Pritchard, Los Alamos National Lab
Bob Russell, University of New Hampshire
Jeff Squyres, Cisco
Sayantan Sur, Intel
The entire OFIWG community
4
DESIGN GUIDELINES
5
Charter: develop interfaces
aligned with user needs
Optimized SW path to HW
•Minimize cache and memory footprint
•Reduce instruction count
•Minimize memory accesses
Scalable
Implementation
Agnostic
Software interfaces aligned with
user requirements
•Careful requirement analysis
Inclusive development effort
•App and HW developers
Good impedance match with
multiple fabric hardware
•InfiniBand, iWarp, RoCE, raw Ethernet,
UDP offload, Omni-Path, GNI, BGQ, …
Open Source User-Centric
libfabric
User-centric interfaces will help foster fabric
innovation and accelerate their adoption
6
OFI USER REQUIREMENTS
7
Give us a high-
level interface!
Give us a low-
level interface!
MPI developers
OFI strives to meet
both requirements
Middleware is primary user
Looking for input on
expanding beyond HPC
OFI SOFTWARE DEVELOPMENT STRATEGIES
One Size Does Not Fit All
8
Fabric Services
User
OFI
Provider
User
OFI
Provider
Provider optimizes for
OFI features
Common optimization
for all apps/providers
Client uses OFI features
User
OFI
Provider
Client optimizes based
on supported features
Provider supports low-level features only
Linux, FreeBSD,
OS X, Windows
TCP and UDP
development support
ARCHITECTURE
9
ARCHITECTURE
10
Modes
Capabilities
OBJECT-MODEL
11
OFI only defines
semantic requirements
NIC
Network
Peer address
table
Listener
Command
queues
RDMA
buffers
Manage multiple
CQs and counters
Share wait
objects (fd’s)
example mappings
ENDPOINT TYPES
12
Unconnected
Connected
ENDPOINT CONTEXTS
13
Default
Scalable
Endpoints
Shared
Contexts
Tx/Rx completions
may go to the same
or different CQs
Tx/Rx command
‘queues’
Share underlying
command queues
Targets multi-thread
access to hardware
ADDRESS VECTORS
14
Address Vector (Table)
fi_addr Fabric Address
0 100:3:50
1 100:3:51
2 101:3:83
3 102:3:64
… …
Address Vector (Map)
fi_addr
100003050
100003051
101003083
102003064
…
Addresses are referenced by an index
- No application storage required
- O(n) memory in provider
- Lookup required on transfers
OFI returns 64-bit value for address
- n x 8 bytes application memory
- No provider storage required
- Direct addressing possible
Converts portable addressing
(e.g. hostname or sockaddr)
to fabric specific address
Possible to share AV
between processes
User Address
IP:Port
10.0.0.1:7000
10.0.0.1:7001
10.0.0.2:7000
10.0.0.3:7003
…
example mappings
DATA TRANSFER TYPES
15
msg 2
msg 1msg 3
msg 2
msg 1msg 3
tag 2
tag 1tag 3
tag 2
tag 1 tag 3
TAGGED
MSG
Maintain message
boundaries, FIFO
Messages carry
user ‘tag’ or id
Receiver selects which
tag goes with each buffer
DATA TRANSFER TYPES
16
data 2
data 1data 3
data 1
data 2
data 3
MSG Stream
dgram
dgram
dgram
dgram
dgram
Multicast MSG
Send to or receive
from multicast group
Data sent and received as
‘stream’
(no message boundaries)
Uses ‘MSG’ APIs but different
endpoint capabilities
Synchronous completion
semantics (application
always owns buffer)
DATA TRANSFER TYPES
17
write 2
write 1write 3 write 1
write 3
write 2
RMA
RDMA semantics
Direct reads or writes of remote
memory from user perspective
Specify operation to perform
on selected datatype
f(x,y)
ATOMIC
f(x,y)
f(x,y)
f(x,y)
f()
y
g()
y
x
g(x,y)
g(x,y)
g(x,y)
x
x
Format of data at target is
known to fabric services
API BOOTSTRAP
18
FI_GETINFO
struct fi_info *fi_allocinfo(void);
int fi_getinfo(
uint32_t version,
const char *node,
const char *service,
uint64_t flags,
struct fi_info *hints,
struct fi_info **info);
void fi_freeinfo(
struct fi_info *info);
19
struct fi_info {
struct fi_info *next;
uint64_t caps;
uint64_t mode;
uint32_t addr_format;
size_t src_addrlen;
size_t dest_addrlen;
void *src_addr;
void *dest_addr;
fid_t handle;
struct fi_tx_attr *tx_attr;
struct fi_rx_attr *rx_attr;
struct fi_ep_attr *ep_attr;
struct fi_domain_attr*domain_attr;
struct fi_fabric_attr*fabric_attr;
};
API version
~getaddrinfo
app needs
API semantics needed, and provider
requirements for using them
Detailed object attributes
CAPABILITY AND MODE BITS
20
• Desired services requested by app
• Primary – app must request to use
• E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC
• Secondary – provider can indicate
availability
• E.g. FI_SOURCE, FI_MULTI_RECV
Capabilities
• Requirements placed on the app
• Improves performance when implemented by
application
• App indicates which modes it supports
• Provider clears modes not needed
• Sample:
FI_CONTEXT, FI_LOCAL_MR
Modes
ATTRIBUTES
struct fi_fabric_attr {
struct fid_fabric *fabric;
char *name;
char *prov_name;
uint32_t prov_version;
uint32_t api_version;
};
21
struct fi_domain_attr {
struct fid_domain *domain;
char *name;
enum fi_threading threading;
enum fi_progress control_progress;
enum fi_progress data_progress;
enum fi_resource_mgmt resource_mgmt;
enum fi_av_type av_type;
int mr_mode;
/* provider limits – fields omitted */
...
uint64_t caps;
uint64_t mode;
uint8_t *auth_key;
...
};
Provider details
Can also use env var to filter
Already opened resource
(if available)
How resources are
allocated among threads
for lockless access
Provider protects
against queue overruns
Do app threads
drive transfers
Secure communication
(job key)
ATTRIBUTES
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};
22
Indicates interoperability
Order of data placement
between two messages
Default, shared, or scalable
ATTRIBUTES
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};
23
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
Are completions
reported in order
“Fast”
message size
Can messages be sent
and received out of order
GASNET USAGE
24
GASNET AT A GLANCE
Networking API to enable PGAS languages
•Language-independent
•Interface intended as a compilation target
•Based on active messages
UPC, UPC++, Co-Array Fortran, Legion, Chapel
•Some of these have native ports to OFI
25
OFI OBJECTS
struct fid_fabric* gasnetc_ofi_fabricfd;
struct fid_domain* gasnetc_ofi_domainfd;
struct fid_av* gasnetc_ofi_avfd;
struct fid_cq* gasnetc_ofi_tx_cqfd;
struct fid_ep* gasnetc_ofi_rdma_epfd;
struct fid_mr* gasnetc_ofi_rdma_mrfd;
struct fid_ep* gasnetc_ofi_request_epfd;
struct fid_ep* gasnetc_ofi_reply_epfd;
struct fid_cq* gasnetc_ofi_request_cqfd;
struct fid_cq* gasnetc_ofi_reply_cqfd;
26
Separated active message
request-reply traffic
Get/put traffic on own
endpoint
INITIALIZATION
hints = fi_allocinfo();
hints->caps = FI_RMA | FI_MSG | FI_MULTI_RECV;
hints->mode = FI_CONTEXT;
hints->addr_format = FI_FORMAT_UNSPEC;
hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE | FI_COMPLETION;
hints->rx_attr->op_flags = FI_MULTI_RECV | FI_COMPLETION;
hints->ep_attr->type = FI_EP_RDM;
hints->domain_attr->threading = FI_THREAD_SAFE;
hints->domain_attr->control_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->resource_mgmt = FI_RM_ENABLED;
hints->domain_attr->av_type = FI_AV_TABLE;
ret = fi_getinfo(OFI_CONDUIT_VERSION, NULL, NULL, 0ULL, hints, &info);
if (!strcmp(info->fabric_attr->prov_name, "psm") || ...) {
...
27
Generate completions
by default
Checks to enable provider
specific optimizations
Need thread safety
for most providers
Use a single buffer to
receive multiple messages
ADDRESS EXCHANGE
av_attr.type = FI_AV_TABLE;
av_attr.size = gasneti_nodes * NUM_OFI_ENDPOINTS;
fi_av_open(gasnetc_ofi_domainfd, &av_attr, &gasnetc_ofi_avfd, NULL);
fi_ep_bind(gasnetc_ofi_rdma_epfd, &gasnetc_ofi_avfd->fid, 0);
...
fi_getname(&gasnetc_ofi_rdma_epfd->fid, NULL, &rdmanamelen);
...
size_t total_len = reqnamelen + repnamelen + rdmanamelen;
on_node_addresses = gasneti_malloc(total_len);
char* alladdrs = gasneti_malloc(gasneti_nodes * total_len);
fi_getname(&gasnetc_ofi_rdma_epfd->fid, on_node_addresses, &rdmanamelen);
...
gasneti_bootstrapExchange(on_node_addresses, total_len, alladdrs);
fi_av_insert(gasnetc_ofi_avfd, alladdrs, gasneti_nodes * NUM_OFI_ENDPOINTS,
NULL, 0ULL, NULL); 28
Bind all local EPs
to same AV
3 EPs per node
Allocate buffer to store
addresses from all nodes
Get each EP address
All to all address exchange
MULTI-RECV BUFFERS
buff_size = gasneti_getenv_int_withdefault(... 1024 * 1024 ...);
region_start = gasneti_malloc_aligned(... buff_size * num);
metadata_array = gasneti_malloc(meta_size * num);
for(i = 0; i < num; i++) {
metadata = metadata_array + i;
setup_recv_msg(metadata, i)
setup_metadata(metadata, i);
fi_recvmsg((i % 2 == 0) ?
gasnetc_ofi_request_epfd :
gasnetc_ofi_reply_epfd,
&metadata->am_buff_msg, FI_MULTI_RECV);
}
29
Use a small (~4) pool of
large (~1MB) buffers to
receive messages
Metadata tracks the status
of each large buffer
Post receive buffers to
request-reply active message
endpoint handlers
MULTI-RECV BUFFERS METADATA
setup_recv_msg(metadata, i)
metadata->iov.iov_base = region_start + buff_size * i;
metadata->iov.iov_len = buff_size;
metadata->am_buff_msg.msg_iov = &metadata->iov;
metadata->am_buff_msg.iov_count = 1;
metadata->am_buff_msg.addr = FI_ADDR_UNSPEC;
metadata->am_buff_msg.desc = NULL;
metadata->am_buff_msg.context = &metadata->am_buff_ctxt.ctxt;
metadata->am_buff_msg.data = 0;
setup_metadata(metadata, i);
metadata->am_buff_ctxt.index = i;
metadata->am_buff_ctxt.final_cntr = 0;
metadata->am_buff_ctxt.event_cntr = 0;
gasnetc_paratomic_set(&metadata->am_buff_ctxt.consumed_cntr, 0, 0);
metadata->am_buff_ctxt.metadata = metadata;
30
Setup and save descriptor
used to (re-)post receive buffer
Reference counting used to track
when messages received into
the buffer are being processed
FI_CONTEXT requires we
give the provider some
storage space for their context
RECEIVING MESSAGE HANDLING
void gasnetc_ofi_am_recv_poll(
struct fid_ep *ep, struct fid_cq *cq)
{
struct fi_cq_data_entry re;
if (TRYLOCK(lock_p) == EBUSY)
return;
ret = fi_cq_read(cq, &re, 1);
if (ret == -FI_EAGAIN) {
UNLOCK(lock_p);
return;
}
gasnetc_ofi_ctxt_t *header = re.op_context;
header->event_cntr++;
if (re.flags & FI_MULTI_RECV)
header->final_cntr = header->event_cntr;
UNLOCK(lock_p);
31
if (re.flags & FI_RECV)
gasnetc_ofi_handle_am(re.buf,
is_request, re.len, re.data);
if (++header->consumed_cntr ==
header->final_cntr) {
metadata = header->metadata;
LOCK(&gasnetc_ofi_locks.am_rx);
fi_recvmsg(ep, &metadata->am_buff_msg,
FI_MULTI_RECV);
UNLOCK(&gasnetc_ofi_locks.am_rx);
}
}
Check CQ for a completion
1 event per message
FI_MULTI_RECV flag indicates
buffer has been released
Process request
or reply message
Repost receive buffer once
all callbacks complete
NONBLOCKING PUT MAPPING
gasnete_put_nb(gasnet_node_t node, void *dest,
void *src, size_t nbytes...)
{
if (nbytes <= max_buffered_send) {
use_inject(...)
} else if (nbytes <= gasnetc_ofi_bbuf_threshold) {
use_bounce_buffer(...)
} else {
put_and_wait(...)
}
}
32
General outline of
gasnet_put_nb flow
‘Inject’ feature allows
immediate re-use of buffer
Optimized for lowest latency
Copy into bounce buffer
Must wait for completion
put_and_wait
fi_write(gasnetc_ofi_rdma_epfd, src, nbytes,
NULL, GET_RDMA_DEST(node),
GET_REMOTEADDR(node, dest), 0, ctxt_ptr);
pending_rdma++;
while (pending_rdma)
GASNETC_OFI_POLL_EVERYTHING();
NONBLOCKING PUT MAPPING
33
use_inject
fi_writemsg(gasnetc_ofi_rdma_epfd, &msg,
FI_INJECT | FI_DELIVERY_COMPLETE);
pending_rdma++;
use_bounce_buffer
get_bounce_buf(nbytes, bbuf, bbuf_ctxt);
memcpy(bbuf, src, nbytes);
fi_write(gasnetc_ofi_rdma_epfd, bbuf,
nbytes, NULL, GET_RDMA_DEST(node),
GET_REMOTEADDR(node, dest), 0,
bbuf_ctxt);
pending_rdma++;
Copy source buffer into bounce
buffer. Source may be reused.
FI_INJECT – buffer is re-usable
after call returns, but still
generate a completion
Transfer data and wait for the
completion to indicate buffer is
no longer needed
MPICH USAGE
OFI CH4 NETMOD
34
MPICH AT A GLANCE
Reference MPI implementation
•MPI standards 1, 2, and 3
High-performance and scalable
•Used on most Top 500, including Top 10 supercomputers
Highly portable
•Platforms, OS, networks, CPUs…
Base for most vendor MPI derivatives
35
MPI Progress
36
Fabric Services
User
OFI
Provider
Provider optimizes for
OFI features
Counters
CQs
MPI 2-Sided
FI_TAGGED
MPI-3 RMA
FI_MSG
FI_RMA
User
OFI
Provider
Common optimization
for all apps/providers
Client uses OFI features
Provider supports low-
level features only
SELECT A MAPPING
OFI Building
Blocks
MPICH optimizes based
on provider features
• Present one possible mapping
• Designed for providers with a
close semantic match to MPI
• MPICH ch4 / OpenMPI MTL
SELECT CAPABILITIES AND MODE BITS
hints = fi_allocinfo();
hints->mode = FI_CONTEXT | FI_ASYNC_IOV;
hints->caps |= FI_RMA;
hints->caps |= FI_ATOMICS;
if (do_tagged)
hints->caps |= FI_TAGGED;
if (do_data)
hints->caps |= FI_DIRECTED_RECV;
37
Provide the provider
context space
Request support for tagged
messages, RDMA read and
write, and atomic operations
The source address is used
as part of message matching
SPECIFY DOMAIN ATTRIBUTES
fi_version = FI_VERSION(MPIDI_OFI_MAJOR_VERSION,
MPIDI_OFI_MINOR_VERSION);
hints->addr_format = FI_FORMAT_UNSPEC;
hints->domain_attr->threading = FI_THREAD_DOMAIN;
hints->domain_attr->control_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->data_progress = FI_PROGRESS_MANUAL;
hints->domain_attr->resource_mgmt = FI_RM_ENABLED;
hints->domain_attr->av_type = do_av_table ?
FI_AV_TABLE : FI_AV_MAP;
hints->domain_attr->mr_mode = do_mr_scalable ?
FI_MR_SCALABLE : FI_MR_BASIC;
38
Specify API version
MPI is coded to
Support common memory
registration modes (ABI v1.0)
MPI handles all synchronization
(lock-free provider)
MPI is capable of driving
progress (thread-less provider)
Provider will protect against local
and remote queue overruns
Support either address vector
format
SPECIFY ENDPOINT ATTRIBUTES
hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE |
FI_COMPLETION;
hints->tx_attr->msg_order = FI_ORDER_SAS;
hints->tx_attr->comp_order = FI_ORDER_NONE;
hints->rx_attr->op_flags = FI_COMPLETION;
hints->ep_attr->type = FI_EP_RDM;
hints->fabric_attr->prov_name = provname;
fi_getinfo(fi_version, NULL, NULL, 0ULL, hints, &prov);
39
Generate a completion after
transfer processed by peer
Optionally select a provider
and away we go
Sends must be
processed in order
Reliable-datagram endpoint
Transfers can
complete in any order
Communicator A
MAPPING MPI COMMUNICATORS
40
Communicators are groups of peer processes
• MPI_COMM_WORLD – all peers
• Referenced by rank (index)
• Map logical rank to physical network address
Map Table
Communicator B
Shared AV Map
fi_addr_t
integer
Communicator A Communicator B
AV Table
Communicator A Communicator B
AV Map
All options are supported
OPENING ADDRESS VECTOR
char av_name[128];
snprintf(av_name, 127, "FI_NAMED_AV_%dn", appnum);
av_attr.name = av_name;
av_attr.flags = FI_READ;
av_attr.map_addr = 0;
unsigned do_av_insert = 1;
if (0 == fi_av_open(MPIDI_Global.domain, &av_attr, &MPIDI_Global.av, NULL)) {
do_av_insert = 0;
mapped_table = (fi_addr_t *) av_attr.map_addr;
for (i = 0; i < size; i++)
MPIDI_OFI_AV(&MPIDIU_get_av(0, i)).dest = mapped_table[i];
41
Copying fi_addr_t
address table
Try opening a shared
(named) AV
TAGGED PROTOCOL
#ifdef USE_OFI_IMMEDIATE_DATA
/*
* 0123 4567 01234567 01234567 01234567 01234567 01234567 01234567 01234567
* | | |
* ^ | Unused | context id | message tag
* | | | |
* +---- protocol
*/
#else
/*
* 0123 4567 01234567 0123 4567 01234567 0123 4567 01234567 01234567 01234567
* | | |
* ^ | context id | source | message tag
* | | | |
* +---- protocol
*/
42
Selection based on support
for FI_DIRECTED_RECV
Source rank carried in remote QP data
Need source rank for matching
MPI SEND (TAGGED MESSAGES)
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_send_handler(...)
{
if (is_inject) {
if (do_data)
fi_tinjectdata(ep, buf, len, dest, dest_addr, tag);
else
fi_tinject(ep, buf, len, dest_addr, tag);
} else {
if (do_data)
fi_tsenddata(ep, buf, len, desc, dest, dest_addr, tag, context);
else
fi_tsend(ep, buf, len, desc, dest_addr, tag, context);
}
}
43
Optimized transfers under
certain conditions
• No completions
• Immediate buffer reuse
• Limited transfer size
Protocol selection
Direct API mapping
Base endpoint or tx_tag
transmit context
fi_addr_t from
mapped_table[ ]
tag format based
on protocol
MPI RECV
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_mpi_recv (...)
{
if (is_tagged) {
fi_trecv(ep, buf, len, desc, src_addr, tag, ignored, context);
} else {
fi_recv(ep, buf, len, desc, src_addr, context);
}
}
44
Protocol selection
Direct API mapping
Base endpoint or tx_tag
transmit context
fi_addr_t from
mapped_table[ ]
tag format based
on protocol
MPI PROGRESS ENGINE
MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_progress(...)
{
struct fi_cq_tagged_entry wc = {0};
struct fi_cq_err_entry error = {0};
ompi_mtl_ofi_request_t *ofi_req = NULL;
while (true) {
ret = fi_cq_read(ompi_mtl_ofi.cq, (void *) &wc, 1);
if (ret > 0) {
// process good completion
...
} else if (ret == -FI_EAVAIL) {
ret = fi_cq_reader(ompi_mtl_ofi.cq, &error);
}
}
}
45
Direct API mapping
HOTI 2017
THANK YOU
Sean Hefty, OFIWG Co-Chair

More Related Content

What's hot (20)

PDF
Cisco's journey from Verbs to Libfabric
Jeff Squyres
 
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf
 
PDF
OpenWrt From Top to Bottom
Kernel TLV
 
PDF
The State of libfabric in Open MPI
Jeff Squyres
 
PDF
Kvm performance optimization for ubuntu
Sim Janghoon
 
PPTX
eBPF Workshop
Michael Kehoe
 
PDF
EBPF and Linux Networking
PLUMgrid
 
PPTX
Linux Inter Process Communication
Abhishek Sagar
 
PDF
eBPF - Observability In Deep
Mydbops
 
PPTX
Linux Initialization Process (2)
shimosawa
 
PPT
Basic Linux Internals
mukul bhardwaj
 
ODP
Introduction To Makefile
Waqqas Jabbar
 
PDF
Jagan Teki - U-boot from scratch
linuxlab_conf
 
PDF
Embedded Linux Kernel - Build your custom kernel
Emertxe Information Technologies Pvt Ltd
 
PDF
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
PDF
eBPF in the view of a storage developer
Richárd Kovács
 
PDF
z16 zOS Support - March 2023 - SHARE in Atlanta.pdf
Marna Walle
 
PPTX
Understanding eBPF in a Hurry!
Ray Jenkins
 
PDF
Introduction to eBPF
RogerColl2
 
PPTX
eBPF Basics
Michael Kehoe
 
Cisco's journey from Verbs to Libfabric
Jeff Squyres
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf
 
OpenWrt From Top to Bottom
Kernel TLV
 
The State of libfabric in Open MPI
Jeff Squyres
 
Kvm performance optimization for ubuntu
Sim Janghoon
 
eBPF Workshop
Michael Kehoe
 
EBPF and Linux Networking
PLUMgrid
 
Linux Inter Process Communication
Abhishek Sagar
 
eBPF - Observability In Deep
Mydbops
 
Linux Initialization Process (2)
shimosawa
 
Basic Linux Internals
mukul bhardwaj
 
Introduction To Makefile
Waqqas Jabbar
 
Jagan Teki - U-boot from scratch
linuxlab_conf
 
Embedded Linux Kernel - Build your custom kernel
Emertxe Information Technologies Pvt Ltd
 
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
eBPF in the view of a storage developer
Richárd Kovács
 
z16 zOS Support - March 2023 - SHARE in Atlanta.pdf
Marna Walle
 
Understanding eBPF in a Hurry!
Ray Jenkins
 
Introduction to eBPF
RogerColl2
 
eBPF Basics
Michael Kehoe
 

Similar to 2017 ofi-hoti-tutorial (20)

PDF
OpenFabrics Interfaces introduction
ofiwg
 
PDF
Advancing OpenFabrics Interfaces
inside-BigData.com
 
PDF
Hoti ofi 2015.doc
seanhefty
 
PPTX
A Taste of Open Fabrics Interfaces
seanhefty
 
PDF
Intel the-latest-on-ofi
Tracy Johnson
 
PDF
Intel the-latest-on-ofi
Intel® Software
 
PDF
OGF Cloud Standards: Current status and ongoing interoperability efforts wi...
Florian Feldhaus
 
PPT
Cisco crs1
wjunjmt
 
PPTX
FlowER Erlang Openflow Controller
Holger Winkelmann
 
PPT
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
Tal Lavian Ph.D.
 
PPTX
UNIT II DIS.pptx
Premkumar R
 
PDF
2003 scalable networking - unknown
George Ang
 
PDF
Design patternsforiot
Michael Koster
 
PDF
Userspace networking
Stephen Hemminger
 
PDF
Scalable Networking
l xf
 
PDF
Recent advance in netmap/VALE(mSwitch)
micchie
 
PPT
Tcp ip
mailalamin
 
PDF
Rlite software-architecture (1)
ARCFIRE ICT
 
PDF
Reliable Distributed Systems Technologies Web Services And Applications Kenne...
tilusdukettk
 
PPTX
Nfv compute domain
sidneel
 
OpenFabrics Interfaces introduction
ofiwg
 
Advancing OpenFabrics Interfaces
inside-BigData.com
 
Hoti ofi 2015.doc
seanhefty
 
A Taste of Open Fabrics Interfaces
seanhefty
 
Intel the-latest-on-ofi
Tracy Johnson
 
Intel the-latest-on-ofi
Intel® Software
 
OGF Cloud Standards: Current status and ongoing interoperability efforts wi...
Florian Feldhaus
 
Cisco crs1
wjunjmt
 
FlowER Erlang Openflow Controller
Holger Winkelmann
 
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
Tal Lavian Ph.D.
 
UNIT II DIS.pptx
Premkumar R
 
2003 scalable networking - unknown
George Ang
 
Design patternsforiot
Michael Koster
 
Userspace networking
Stephen Hemminger
 
Scalable Networking
l xf
 
Recent advance in netmap/VALE(mSwitch)
micchie
 
Tcp ip
mailalamin
 
Rlite software-architecture (1)
ARCFIRE ICT
 
Reliable Distributed Systems Technologies Web Services And Applications Kenne...
tilusdukettk
 
Nfv compute domain
sidneel
 
Ad

Recently uploaded (20)

PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PDF
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Latest Capcut Pro 5.9.0 Crack Version For PC {Fully 2025
utfefguu
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Latest Capcut Pro 5.9.0 Crack Version For PC {Fully 2025
utfefguu
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Ad

2017 ofi-hoti-tutorial

  • 1. HOTI 2017 DEVELOPING TO THE OPENFABRICS INTERFACES LIBFABRIC Sean Hefty, OFIWG Co-Chair Jim Swaro, GNI Maintainer August, 2017 Intel Corporation Cray Incorporated
  • 2. OVERVIEW Design Guidelines – 10 minutes Architecture – 20 minutes API Bootstrap – 15 minutes GASNet Usage – 20 minutes MPICH Usage – 15 minutes 2 Implementation highlights The original code has been modified to fit this screen Lightning fast introduction
  • 3. OVERVIEW Tutorial covers libfabric version 1.5 (unless noted) •ABI version 1.1 •Minor changes from v1.0.x – 1.4.x (ABI 1.0) Developer guide, source code, man pages, test programs, and presentations available at: www.libfabric.org 3
  • 4. ACKNOWLEDGEMENTS Sung Choi Paul Grun, Cray Howard Pritchard, Los Alamos National Lab Bob Russell, University of New Hampshire Jeff Squyres, Cisco Sayantan Sur, Intel The entire OFIWG community 4
  • 5. DESIGN GUIDELINES 5 Charter: develop interfaces aligned with user needs
  • 6. Optimized SW path to HW •Minimize cache and memory footprint •Reduce instruction count •Minimize memory accesses Scalable Implementation Agnostic Software interfaces aligned with user requirements •Careful requirement analysis Inclusive development effort •App and HW developers Good impedance match with multiple fabric hardware •InfiniBand, iWarp, RoCE, raw Ethernet, UDP offload, Omni-Path, GNI, BGQ, … Open Source User-Centric libfabric User-centric interfaces will help foster fabric innovation and accelerate their adoption 6
  • 7. OFI USER REQUIREMENTS 7 Give us a high- level interface! Give us a low- level interface! MPI developers OFI strives to meet both requirements Middleware is primary user Looking for input on expanding beyond HPC
  • 8. OFI SOFTWARE DEVELOPMENT STRATEGIES One Size Does Not Fit All 8 Fabric Services User OFI Provider User OFI Provider Provider optimizes for OFI features Common optimization for all apps/providers Client uses OFI features User OFI Provider Client optimizes based on supported features Provider supports low-level features only Linux, FreeBSD, OS X, Windows TCP and UDP development support
  • 11. OBJECT-MODEL 11 OFI only defines semantic requirements NIC Network Peer address table Listener Command queues RDMA buffers Manage multiple CQs and counters Share wait objects (fd’s) example mappings
  • 13. ENDPOINT CONTEXTS 13 Default Scalable Endpoints Shared Contexts Tx/Rx completions may go to the same or different CQs Tx/Rx command ‘queues’ Share underlying command queues Targets multi-thread access to hardware
  • 14. ADDRESS VECTORS 14 Address Vector (Table) fi_addr Fabric Address 0 100:3:50 1 100:3:51 2 101:3:83 3 102:3:64 … … Address Vector (Map) fi_addr 100003050 100003051 101003083 102003064 … Addresses are referenced by an index - No application storage required - O(n) memory in provider - Lookup required on transfers OFI returns 64-bit value for address - n x 8 bytes application memory - No provider storage required - Direct addressing possible Converts portable addressing (e.g. hostname or sockaddr) to fabric specific address Possible to share AV between processes User Address IP:Port 10.0.0.1:7000 10.0.0.1:7001 10.0.0.2:7000 10.0.0.3:7003 … example mappings
  • 15. DATA TRANSFER TYPES 15 msg 2 msg 1msg 3 msg 2 msg 1msg 3 tag 2 tag 1tag 3 tag 2 tag 1 tag 3 TAGGED MSG Maintain message boundaries, FIFO Messages carry user ‘tag’ or id Receiver selects which tag goes with each buffer
  • 16. DATA TRANSFER TYPES 16 data 2 data 1data 3 data 1 data 2 data 3 MSG Stream dgram dgram dgram dgram dgram Multicast MSG Send to or receive from multicast group Data sent and received as ‘stream’ (no message boundaries) Uses ‘MSG’ APIs but different endpoint capabilities Synchronous completion semantics (application always owns buffer)
  • 17. DATA TRANSFER TYPES 17 write 2 write 1write 3 write 1 write 3 write 2 RMA RDMA semantics Direct reads or writes of remote memory from user perspective Specify operation to perform on selected datatype f(x,y) ATOMIC f(x,y) f(x,y) f(x,y) f() y g() y x g(x,y) g(x,y) g(x,y) x x Format of data at target is known to fabric services
  • 19. FI_GETINFO struct fi_info *fi_allocinfo(void); int fi_getinfo( uint32_t version, const char *node, const char *service, uint64_t flags, struct fi_info *hints, struct fi_info **info); void fi_freeinfo( struct fi_info *info); 19 struct fi_info { struct fi_info *next; uint64_t caps; uint64_t mode; uint32_t addr_format; size_t src_addrlen; size_t dest_addrlen; void *src_addr; void *dest_addr; fid_t handle; struct fi_tx_attr *tx_attr; struct fi_rx_attr *rx_attr; struct fi_ep_attr *ep_attr; struct fi_domain_attr*domain_attr; struct fi_fabric_attr*fabric_attr; }; API version ~getaddrinfo app needs API semantics needed, and provider requirements for using them Detailed object attributes
  • 20. CAPABILITY AND MODE BITS 20 • Desired services requested by app • Primary – app must request to use • E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC • Secondary – provider can indicate availability • E.g. FI_SOURCE, FI_MULTI_RECV Capabilities • Requirements placed on the app • Improves performance when implemented by application • App indicates which modes it supports • Provider clears modes not needed • Sample: FI_CONTEXT, FI_LOCAL_MR Modes
  • 21. ATTRIBUTES struct fi_fabric_attr { struct fid_fabric *fabric; char *name; char *prov_name; uint32_t prov_version; uint32_t api_version; }; 21 struct fi_domain_attr { struct fid_domain *domain; char *name; enum fi_threading threading; enum fi_progress control_progress; enum fi_progress data_progress; enum fi_resource_mgmt resource_mgmt; enum fi_av_type av_type; int mr_mode; /* provider limits – fields omitted */ ... uint64_t caps; uint64_t mode; uint8_t *auth_key; ... }; Provider details Can also use env var to filter Already opened resource (if available) How resources are allocated among threads for lockless access Provider protects against queue overruns Do app threads drive transfers Secure communication (job key)
  • 22. ATTRIBUTES struct fi_ep_attr { enum fi_ep_type type; uint32_t protocol; uint32_t protocol_version; size_t max_msg_size; size_t msg_prefix_size; size_t max_order_raw_size; size_t max_order_war_size; size_t max_order_waw_size; uint64_t mem_tag_format; size_t tx_ctx_cnt; size_t rx_ctx_cnt; size_t auth_key_size; uint8_t *auth_key; }; 22 Indicates interoperability Order of data placement between two messages Default, shared, or scalable
  • 23. ATTRIBUTES struct fi_tx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t inject_size; size_t size; size_t iov_limit; size_t rma_iov_limit; }; 23 struct fi_rx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t total_buffered_recv; size_t size; size_t iov_limit; }; Are completions reported in order “Fast” message size Can messages be sent and received out of order
  • 25. GASNET AT A GLANCE Networking API to enable PGAS languages •Language-independent •Interface intended as a compilation target •Based on active messages UPC, UPC++, Co-Array Fortran, Legion, Chapel •Some of these have native ports to OFI 25
  • 26. OFI OBJECTS struct fid_fabric* gasnetc_ofi_fabricfd; struct fid_domain* gasnetc_ofi_domainfd; struct fid_av* gasnetc_ofi_avfd; struct fid_cq* gasnetc_ofi_tx_cqfd; struct fid_ep* gasnetc_ofi_rdma_epfd; struct fid_mr* gasnetc_ofi_rdma_mrfd; struct fid_ep* gasnetc_ofi_request_epfd; struct fid_ep* gasnetc_ofi_reply_epfd; struct fid_cq* gasnetc_ofi_request_cqfd; struct fid_cq* gasnetc_ofi_reply_cqfd; 26 Separated active message request-reply traffic Get/put traffic on own endpoint
  • 27. INITIALIZATION hints = fi_allocinfo(); hints->caps = FI_RMA | FI_MSG | FI_MULTI_RECV; hints->mode = FI_CONTEXT; hints->addr_format = FI_FORMAT_UNSPEC; hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE | FI_COMPLETION; hints->rx_attr->op_flags = FI_MULTI_RECV | FI_COMPLETION; hints->ep_attr->type = FI_EP_RDM; hints->domain_attr->threading = FI_THREAD_SAFE; hints->domain_attr->control_progress = FI_PROGRESS_MANUAL; hints->domain_attr->resource_mgmt = FI_RM_ENABLED; hints->domain_attr->av_type = FI_AV_TABLE; ret = fi_getinfo(OFI_CONDUIT_VERSION, NULL, NULL, 0ULL, hints, &info); if (!strcmp(info->fabric_attr->prov_name, "psm") || ...) { ... 27 Generate completions by default Checks to enable provider specific optimizations Need thread safety for most providers Use a single buffer to receive multiple messages
  • 28. ADDRESS EXCHANGE av_attr.type = FI_AV_TABLE; av_attr.size = gasneti_nodes * NUM_OFI_ENDPOINTS; fi_av_open(gasnetc_ofi_domainfd, &av_attr, &gasnetc_ofi_avfd, NULL); fi_ep_bind(gasnetc_ofi_rdma_epfd, &gasnetc_ofi_avfd->fid, 0); ... fi_getname(&gasnetc_ofi_rdma_epfd->fid, NULL, &rdmanamelen); ... size_t total_len = reqnamelen + repnamelen + rdmanamelen; on_node_addresses = gasneti_malloc(total_len); char* alladdrs = gasneti_malloc(gasneti_nodes * total_len); fi_getname(&gasnetc_ofi_rdma_epfd->fid, on_node_addresses, &rdmanamelen); ... gasneti_bootstrapExchange(on_node_addresses, total_len, alladdrs); fi_av_insert(gasnetc_ofi_avfd, alladdrs, gasneti_nodes * NUM_OFI_ENDPOINTS, NULL, 0ULL, NULL); 28 Bind all local EPs to same AV 3 EPs per node Allocate buffer to store addresses from all nodes Get each EP address All to all address exchange
  • 29. MULTI-RECV BUFFERS buff_size = gasneti_getenv_int_withdefault(... 1024 * 1024 ...); region_start = gasneti_malloc_aligned(... buff_size * num); metadata_array = gasneti_malloc(meta_size * num); for(i = 0; i < num; i++) { metadata = metadata_array + i; setup_recv_msg(metadata, i) setup_metadata(metadata, i); fi_recvmsg((i % 2 == 0) ? gasnetc_ofi_request_epfd : gasnetc_ofi_reply_epfd, &metadata->am_buff_msg, FI_MULTI_RECV); } 29 Use a small (~4) pool of large (~1MB) buffers to receive messages Metadata tracks the status of each large buffer Post receive buffers to request-reply active message endpoint handlers
  • 30. MULTI-RECV BUFFERS METADATA setup_recv_msg(metadata, i) metadata->iov.iov_base = region_start + buff_size * i; metadata->iov.iov_len = buff_size; metadata->am_buff_msg.msg_iov = &metadata->iov; metadata->am_buff_msg.iov_count = 1; metadata->am_buff_msg.addr = FI_ADDR_UNSPEC; metadata->am_buff_msg.desc = NULL; metadata->am_buff_msg.context = &metadata->am_buff_ctxt.ctxt; metadata->am_buff_msg.data = 0; setup_metadata(metadata, i); metadata->am_buff_ctxt.index = i; metadata->am_buff_ctxt.final_cntr = 0; metadata->am_buff_ctxt.event_cntr = 0; gasnetc_paratomic_set(&metadata->am_buff_ctxt.consumed_cntr, 0, 0); metadata->am_buff_ctxt.metadata = metadata; 30 Setup and save descriptor used to (re-)post receive buffer Reference counting used to track when messages received into the buffer are being processed FI_CONTEXT requires we give the provider some storage space for their context
  • 31. RECEIVING MESSAGE HANDLING void gasnetc_ofi_am_recv_poll( struct fid_ep *ep, struct fid_cq *cq) { struct fi_cq_data_entry re; if (TRYLOCK(lock_p) == EBUSY) return; ret = fi_cq_read(cq, &re, 1); if (ret == -FI_EAGAIN) { UNLOCK(lock_p); return; } gasnetc_ofi_ctxt_t *header = re.op_context; header->event_cntr++; if (re.flags & FI_MULTI_RECV) header->final_cntr = header->event_cntr; UNLOCK(lock_p); 31 if (re.flags & FI_RECV) gasnetc_ofi_handle_am(re.buf, is_request, re.len, re.data); if (++header->consumed_cntr == header->final_cntr) { metadata = header->metadata; LOCK(&gasnetc_ofi_locks.am_rx); fi_recvmsg(ep, &metadata->am_buff_msg, FI_MULTI_RECV); UNLOCK(&gasnetc_ofi_locks.am_rx); } } Check CQ for a completion 1 event per message FI_MULTI_RECV flag indicates buffer has been released Process request or reply message Repost receive buffer once all callbacks complete
  • 32. NONBLOCKING PUT MAPPING gasnete_put_nb(gasnet_node_t node, void *dest, void *src, size_t nbytes...) { if (nbytes <= max_buffered_send) { use_inject(...) } else if (nbytes <= gasnetc_ofi_bbuf_threshold) { use_bounce_buffer(...) } else { put_and_wait(...) } } 32 General outline of gasnet_put_nb flow ‘Inject’ feature allows immediate re-use of buffer Optimized for lowest latency Copy into bounce buffer Must wait for completion
  • 33. put_and_wait fi_write(gasnetc_ofi_rdma_epfd, src, nbytes, NULL, GET_RDMA_DEST(node), GET_REMOTEADDR(node, dest), 0, ctxt_ptr); pending_rdma++; while (pending_rdma) GASNETC_OFI_POLL_EVERYTHING(); NONBLOCKING PUT MAPPING 33 use_inject fi_writemsg(gasnetc_ofi_rdma_epfd, &msg, FI_INJECT | FI_DELIVERY_COMPLETE); pending_rdma++; use_bounce_buffer get_bounce_buf(nbytes, bbuf, bbuf_ctxt); memcpy(bbuf, src, nbytes); fi_write(gasnetc_ofi_rdma_epfd, bbuf, nbytes, NULL, GET_RDMA_DEST(node), GET_REMOTEADDR(node, dest), 0, bbuf_ctxt); pending_rdma++; Copy source buffer into bounce buffer. Source may be reused. FI_INJECT – buffer is re-usable after call returns, but still generate a completion Transfer data and wait for the completion to indicate buffer is no longer needed
  • 34. MPICH USAGE OFI CH4 NETMOD 34
  • 35. MPICH AT A GLANCE Reference MPI implementation •MPI standards 1, 2, and 3 High-performance and scalable •Used on most Top 500, including Top 10 supercomputers Highly portable •Platforms, OS, networks, CPUs… Base for most vendor MPI derivatives 35
  • 36. MPI Progress 36 Fabric Services User OFI Provider Provider optimizes for OFI features Counters CQs MPI 2-Sided FI_TAGGED MPI-3 RMA FI_MSG FI_RMA User OFI Provider Common optimization for all apps/providers Client uses OFI features Provider supports low- level features only SELECT A MAPPING OFI Building Blocks MPICH optimizes based on provider features • Present one possible mapping • Designed for providers with a close semantic match to MPI • MPICH ch4 / OpenMPI MTL
  • 37. SELECT CAPABILITIES AND MODE BITS hints = fi_allocinfo(); hints->mode = FI_CONTEXT | FI_ASYNC_IOV; hints->caps |= FI_RMA; hints->caps |= FI_ATOMICS; if (do_tagged) hints->caps |= FI_TAGGED; if (do_data) hints->caps |= FI_DIRECTED_RECV; 37 Provide the provider context space Request support for tagged messages, RDMA read and write, and atomic operations The source address is used as part of message matching
  • 38. SPECIFY DOMAIN ATTRIBUTES fi_version = FI_VERSION(MPIDI_OFI_MAJOR_VERSION, MPIDI_OFI_MINOR_VERSION); hints->addr_format = FI_FORMAT_UNSPEC; hints->domain_attr->threading = FI_THREAD_DOMAIN; hints->domain_attr->control_progress = FI_PROGRESS_MANUAL; hints->domain_attr->data_progress = FI_PROGRESS_MANUAL; hints->domain_attr->resource_mgmt = FI_RM_ENABLED; hints->domain_attr->av_type = do_av_table ? FI_AV_TABLE : FI_AV_MAP; hints->domain_attr->mr_mode = do_mr_scalable ? FI_MR_SCALABLE : FI_MR_BASIC; 38 Specify API version MPI is coded to Support common memory registration modes (ABI v1.0) MPI handles all synchronization (lock-free provider) MPI is capable of driving progress (thread-less provider) Provider will protect against local and remote queue overruns Support either address vector format
  • 39. SPECIFY ENDPOINT ATTRIBUTES hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE | FI_COMPLETION; hints->tx_attr->msg_order = FI_ORDER_SAS; hints->tx_attr->comp_order = FI_ORDER_NONE; hints->rx_attr->op_flags = FI_COMPLETION; hints->ep_attr->type = FI_EP_RDM; hints->fabric_attr->prov_name = provname; fi_getinfo(fi_version, NULL, NULL, 0ULL, hints, &prov); 39 Generate a completion after transfer processed by peer Optionally select a provider and away we go Sends must be processed in order Reliable-datagram endpoint Transfers can complete in any order
  • 40. Communicator A MAPPING MPI COMMUNICATORS 40 Communicators are groups of peer processes • MPI_COMM_WORLD – all peers • Referenced by rank (index) • Map logical rank to physical network address Map Table Communicator B Shared AV Map fi_addr_t integer Communicator A Communicator B AV Table Communicator A Communicator B AV Map All options are supported
  • 41. OPENING ADDRESS VECTOR char av_name[128]; snprintf(av_name, 127, "FI_NAMED_AV_%dn", appnum); av_attr.name = av_name; av_attr.flags = FI_READ; av_attr.map_addr = 0; unsigned do_av_insert = 1; if (0 == fi_av_open(MPIDI_Global.domain, &av_attr, &MPIDI_Global.av, NULL)) { do_av_insert = 0; mapped_table = (fi_addr_t *) av_attr.map_addr; for (i = 0; i < size; i++) MPIDI_OFI_AV(&MPIDIU_get_av(0, i)).dest = mapped_table[i]; 41 Copying fi_addr_t address table Try opening a shared (named) AV
  • 42. TAGGED PROTOCOL #ifdef USE_OFI_IMMEDIATE_DATA /* * 0123 4567 01234567 01234567 01234567 01234567 01234567 01234567 01234567 * | | | * ^ | Unused | context id | message tag * | | | | * +---- protocol */ #else /* * 0123 4567 01234567 0123 4567 01234567 0123 4567 01234567 01234567 01234567 * | | | * ^ | context id | source | message tag * | | | | * +---- protocol */ 42 Selection based on support for FI_DIRECTED_RECV Source rank carried in remote QP data Need source rank for matching
  • 43. MPI SEND (TAGGED MESSAGES) MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_send_handler(...) { if (is_inject) { if (do_data) fi_tinjectdata(ep, buf, len, dest, dest_addr, tag); else fi_tinject(ep, buf, len, dest_addr, tag); } else { if (do_data) fi_tsenddata(ep, buf, len, desc, dest, dest_addr, tag, context); else fi_tsend(ep, buf, len, desc, dest_addr, tag, context); } } 43 Optimized transfers under certain conditions • No completions • Immediate buffer reuse • Limited transfer size Protocol selection Direct API mapping Base endpoint or tx_tag transmit context fi_addr_t from mapped_table[ ] tag format based on protocol
  • 44. MPI RECV MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_mpi_recv (...) { if (is_tagged) { fi_trecv(ep, buf, len, desc, src_addr, tag, ignored, context); } else { fi_recv(ep, buf, len, desc, src_addr, context); } } 44 Protocol selection Direct API mapping Base endpoint or tx_tag transmit context fi_addr_t from mapped_table[ ] tag format based on protocol
  • 45. MPI PROGRESS ENGINE MPL_STATIC_INLINE_PREFIX int MPIDI_OFI_progress(...) { struct fi_cq_tagged_entry wc = {0}; struct fi_cq_err_entry error = {0}; ompi_mtl_ofi_request_t *ofi_req = NULL; while (true) { ret = fi_cq_read(ompi_mtl_ofi.cq, (void *) &wc, 1); if (ret > 0) { // process good completion ... } else if (ret == -FI_EAVAIL) { ret = fi_cq_reader(ompi_mtl_ofi.cq, &error); } } } 45 Direct API mapping
  • 46. HOTI 2017 THANK YOU Sean Hefty, OFIWG Co-Chair