Direct Communication and Synchronization
Direct Communication and Synchronization
Stamatis Kavvadias
December 2010
University of Crete
Department of Computer Science
This work has been conducted at the Computer Architecture and VLSI Systems (CARV) labo-
ratory of the Institute of Computer Science (ICS) of the Foundation for Research and Technology –
Hellas (FORTH), and has been financially supported by a FORTH-ICS scholarship, including fund-
ing by the European Commission.
Acknowledgements
The work reported in this thesis has been conducted at the Computer Architecture
and VLSI Systems (CARV) laboratory of the Institute of Computer Science (ICS) of
the Foundation for Research and Technology – Hellas (FORTH), and has been finan-
cially supported by a FORTH-ICS scholarship, including funding by the European
Commission through projects SIVSS (FP6 STREP #002075), Usenix (Marie-Curie
#509595), SARC (FP6 IP #27648), ICores (Marie-Curie #224759), and the HiPEAC
Network of Excellence (NoE #004408).
I owe a great deal to my advisor, Manolis Katevenis. He was very patient with
my difficult character, and persistently positive in his guidance. He taught me to
think and write, seeking for precision, and more importantly to look for the bigger
picture. I am in his debt for all that and a lot more. I am especially grateful to
Dionisios Pnevmatikatos from my advisory comittee, who followed my work on
almost weekly basis throughout these six years, was always constructive in apposite
ways, and really helped me regain my courage and self-confidence in some crucial
points of my study.
I am grateful to my thesis comittee, Dimitris Nikolopoulos, Angelos Bilas, Pana-
giota Fatourou, Evangelos Markatos, and Alex Ramirez. They have all helped me
in different times and with different aspects of the work. Without their suggestions,
comments, and guiding lines, this work would have much less to contribute. They
all hepled me broaden my perception of computer architecture and I thank them.
This thesis would not have been possible without the continuous and manifold
support and the love of my parents. They have followed my ups and downs in these
Page i
six years and sensed my difficulties. Although they were staying in a town far from
my university, I always felt I was in their thoughts. Despite their distance from an
academic background and the process of research, they always found ways to be by
my side –thank you.
I owe my deepest gratitude to my friends Nikoleta Koromila, Manolis Mauro-
manolakis, Antonis Hazirakis, Sofia Kouraki, Elena Sousta, and Ioanna Leonidaki,
for always being there for me, for setting up islands of lightheartedness and warm
interaction with parties, gatherings, and excursions, and for the numerous times they
served my insatiable appetite for discussion on general subjects. I especially want to
thank Nikoleta Koromila and Manolis Mauromanolakis for the countless times they
have endured my worries, harboured me with comfort, lodged and fed me. I also
owe special thanks to Antonis Hazirakis, whose clear and precise judgment helped
me stop thinking about quitting after the fifth year, when myself and others started
to second-guess the course of my studies.
There are several colleagues who supported and helped me in different ways.
Most of all I thank my friend, Vassilis Papaefstathiou, who was always available
for deep research discussions, that helped me move forward in my work. George
Nikiforos helped me with some of the results, and along with George Kalokeri-
nos implemented and debugged the hardware prototype. Michail Zampetakis also
helped me with some of the results. Michael Ligerakis was always prompt in his
help with lab-related issues. Thank you all for everything.
Many other friends deserve to be acknowledged. Vaggelis, Thanasis, Nikos,
Manolis, Nikos, Alexandros, Minas, Sofia, I am in your debt. There are still other
friends and colleagues to thank –please let me thank you all at once.
Page ii
Abstract
Page iii
We also design and implement synchronization mechanisms in the network in-
terface (counters and queues), that take advantage of event responses and exploit the
cache tag and data arrays for synchronization state. We propose novel queues, that
efficiently support multiple readers, providing hardware lock and job dispatching
services, and counters, that enable selective fences for explicit transfers, and can be
synthesized to implement barriers in the memory system.
Evaluation of the cache-integrated NI on the hardware prototype, demonstrates
the flexibility of exploiting both cacheable and explicitly-managed data, and poten-
tial advantages of NI transfer mechanism alternatives. Simulations of up to 128 core
CMPs show that our synchronization primitives provide significant benefits for con-
tended locks and barriers, and can improve task scheduling efficiency in the Cilk [1]
run-time system, for executions within the scalability limits of our benchmarks.
Page iv
Contents
Acknowledgements i
Abstract iii
Table of Contents v
List of Figures ix
List of Tables xi
1 Introduction 1
1.1 The Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Towards Many-core Processors . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Sharing NI Memory for Computation at User-level . . . . . 4
1.2.2 Communication and Synchronization Avoiding Unnecessary
Indirection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Providing Enhanced Flexibility and Scalability . . . . . . . 9
1.3 Architectural Support for Many-core Processors . . . . . . . . . . . 12
1.3.1 Cache Integration of a Network Interface . . . . . . . . . . 13
1.3.2 Direct Synchronization Support . . . . . . . . . . . . . . . 15
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Page v
Contents
4 Evaluation 107
4.1 Evaluation on the Hardware Prototype . . . . . . . . . . . . . . . . 109
4.1.1 Software Platform and Benchmarks . . . . . . . . . . . . . 109
4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2 Evaluation of Lock and Barrier Support of the SARC Network In-
terface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.1 Microbenchmarks and Qualitative Comparison . . . . . . . 115
Page vi Contents
Contents
5 Conclusions 137
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2 Future Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Bibliography 141
1.1 Partitioned (a) versus mutually shared (b) processor and network interface
memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 From small-scale CMPs to a scalable CMP architecture. On-chip memo-
ries can be used either as cache or as scratchpad. . . . . . . . . . . . . . 10
1.3 Direct versus indirect communication. . . . . . . . . . . . . . . . . 13
1.4 Similar cache and network interface mechanisms for writing (a) and read-
ing (b) remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 One-to-One Communication using RDMA: single writer per receive buffer. 23
2.2 Communication patterns in space and time. . . . . . . . . . . . . . . . 28
2.3 Consumer-initiated communication patterns in closer examination. . . . . 30
2.4 Producer-initiated communication patterns in closer examination. . . . . . 32
2.5 NI directly interfaced to processor registers. . . . . . . . . . . . . . . . 37
2.6 Microarchitecture of NI integrated at top memory hierarchy levels. . . . . 40
2.7 Three alternatives of handling transfers with arbitrary alignment change. . 50
2.8 A counter synchronization primitive atomically adds stored values
to the contents of the counter. . . . . . . . . . . . . . . . . . . . . . 61
2.9 A single-reader queue (sr-Q) is multiplexing write messages, atom-
ically advancing its tail pointer. . . . . . . . . . . . . . . . . . . . . 62
2.10 Conceptual operation of a multiple-reader queue buffering either read or
write packets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.11 Remote access to scratchpad regions and generation of explicit acknowl-
edgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.12 RDMA-copy operation completion notification. . . . . . . . . . . . . . 69
2.13 Hierarchical barrier constructed with counters. . . . . . . . . . . . . . . 70
2.14 Multiple-reader queue provides a lock service by queueing incoming re-
quests (a), and a job dispatching service by queueing data (b). . . . . . . 71
3.1 Address Region Tables (ART): address check (left); possible contents, de-
pending on location (right). . . . . . . . . . . . . . . . . . . . . . . . 78
Page ix
List of Figures
3.2 Memory access flow: identify scratchpad regions via the ART instead of
tag matching –the tags of the scratchpad areas are left unused. . . . . . . 80
3.3 State bits mark lines with varying access semantics in the SARC
cache-integrated network interface. . . . . . . . . . . . . . . . . . . 81
3.4 Event response mechanism integration in the normal cache access
flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5 Deadlock scenario for job list serving multiple subnetworks. . . . . . . . 87
3.6 FPGA prototype system block diagram. . . . . . . . . . . . . . . . 89
3.7 FPGA prototype system block diagram. . . . . . . . . . . . . . . . 90
3.8 Complete datapath of SARC cache-integrated NI. . . . . . . . . . . 92
3.9 Pipelining of processor requests. . . . . . . . . . . . . . . . . . . . 95
3.10 Command buffer ESL tag and data-block formats. . . . . . . . . . . 98
3.11 Counter ESL tag and data-block formats. . . . . . . . . . . . . . . . 99
3.12 Single-reader queue with double-word item granularity. . . . . . . . 100
3.13 Single- and multiple-reader queue tag formats. . . . . . . . . . . . . . . 101
3.14 Modified command buffer and NoC packet formats for design opti-
mization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.15 Area optimization as a percentage of a cache-only design. . . . . . . 104
Page xi
1
Introduction
Page 1 of 154
The Challenge
Page 2 of 154
Towards Many-core Processors
of multiple independent jobs, or can provide performance gains from single appli-
cation parallelization that achieves more than half of linear speedup on four cores.
Because the two chips have the same area, they will have roughly the same capac-
ity. Assuming the same voltage as for the uniprocessor, the quad-core chip con-
sumes about half the dynamic power, potentially leaving room for integration of
more cores.
In terms of performance, chip multiprocessors have been shown to be more ef-
fective in increasing the throughput of server workloads than monolithic CPUs [4],
and CMP based supercomputers have raised the Top500 record of performance to
1.75 petaflop/s at the time of this writing. Performance scaling of a single applica-
tion in a CMP environment depends on the parallelism available in the algorithm,
but also on the ability of software and hardware to exploit it efficiently, which can
be a complex task. In many cases, considerable effort is required in software to use
efficiently the multiple cores for a single application, and hardware must provide
primitive mechanisms that allow software to exploit different forms and granulari-
ties of parallelism. In effect, chip multiprocessors provide a path for hardware and
software to join their forces to increase a computation’s efficiency with counter-
balanced efforts. It is informally said between computer architects referring to the
industry’s turn to CMPs that “The free lunch is over,” and this is true for both the
software and the hardware disciplines.
An essential part of CMP design is the communication architecture that al-
lows cores to interact with each other and with main memory. Only a very limited
set of computations can benefit from many cores without frequently requiring such
interactions. This dissertation focuses on the communication architecture of chip
multiprocessors, and especially the interface of cores to the on-chip network, in
consideration of tens or hundreds of cores per chip.
Chip multiprocessors are also called multi-core microprocessors. Lately, the term
many-core processors is also used to refer to CMPs with relatively large numbers
of cores (at least several tens) that have become feasible as demonstrated by a few
experimental chips [5, 6] and actual products [7]. For large numbers of cores, CMPs
are likely to utilize a scalable on-chip interconnection network, since a simple bus
would introduce large delays for global serialization of interactions. With several
tens or hundreds of cores, a multi-stage network is likely to be the dominant choice,
Page 3 of 154
Towards Many-core Processors
rather than a single switch, to reduce the number of global wires and in favor of
localized transfers.
The increasingly distributed environment of many-core processors requires a
communication architecture that can support efficiency (i.e. low execution over-
head) and programmability in a unified framework. To provide such support, this
study targets:
Past network interfaces, which were usually placed on the I/O bus, often have
provided large private memory, used for intermediate processing of packets and
buffering traffic categories until receiver software can handle them. This partitioned
arrangement, illustrated in figure 1.1(a), has the disadvantage that both processor
and network interface memory may be underutilized: a memory intensive workload
would utilize processor memory but not NI memory and the reverse is possible for
NI memory
processor
memory
Network
Network
Network
high b/w
memory
Proc
shared
Interface Network
Proc
Interface
(a) (b)
Figure 1.1: Partitioned (a) versus mutually shared (b) processor and network interface
memory.
network interface to a “first class citizen” among the distributed resources of the
chip, coupling it with the processor and the fast memory used for computation. Pro-
cessor proximity can provide low transfer initiation latency in support of efficient
communication mechanisms. Associating the network interface with on-chip mem-
ory, allows it to flexibly handle transfers of a few bytes up to several kilobytes. It also
allows for processor decoupled (or asynchronous) network interface operation, that
can overlap bulk transfers with computation to inexpensively hide latencies, without
the need for non-blocking caches. A simple DMA engine can support bulk trans-
fers from and into scratchpad memory, without necessitating processor architecture
adaptation to data transfer requirements as in the case of vector and out-of-order
processors.
Extending a cache hierarchy to support multiple processor nodes requires that pri-
vate parts of the hierarchy are kept coherent. Coherent caching over a scalable
network-on-chip (NoC) will usually employ a directory-based protocol, which may
introduce overheads to inter-processor communication because of the directory in-
direction required and because of the round-trip nature of caching that employs the
usual write-allocate policy.
To the contrary, this thesis advocates the use of direct communication mech-
anisms, to avoid these overheads and exploit the full potential of future many-core
processors. Direct communication mechanisms implement direct transfers, which
are communication operations that utilize network interface controllers only at the
source and the destination of a transfer. Since remote read communication opera-
tions inherently require a round-trip, they are also direct transfers as long as the only
NI controllers involved are at the requestor and at the data source. As an exception,
in this dissertation, hardware-assisted copy operations, also addressed as RDMA-
copy, that move data from a source to a destination, are considered direct transfers,
even when neither the source nor the destination are local to the node initiating the
transfer1 .
By definition, direct transfers involve the minimum possible number of de-
vice controllers and network traversals for a data transfer. In addition, direct read
and write transfers between two nodes of a system allow the most optimized route
for each network crossing, which will normally involve the minimum number of
network switches. Note that controller access and NoC switch utilization are the
primary contributors to both the latency and the energy consumed for an on-chip
transfer.
The minimum possible number of involved controllers and the optimized routes
for each network crossing of direct transfers, can be contrasted to the case of co-
herent transfers that require directory indirection. The directory cannot be in the
shortest path of any two nodes that need to communicate, entailing overheads, but,
more importantly, the directory access itself incurs energy and latency penalties to
coherent transfers. Furthermore, in the case of processor-to-processor signals or
1
Such “triangular” RDMA-copy operations are considered direct transfers because of their simi-
larity to remote read operations: data are transfered from the source to the destination after an initial
request. In addition, the involvement of the third controller (the transfer’s initiator) is not superfluous,
and any non-speculative communication mechanism implementing such a transfer would involve all
three controllers.
operation, this operation can complete in the network interface without involving
the processor, which allows for a simpler implementation. Second, when synchro-
nization is associated with notification of a thread about an event, the network inter-
face can automatically initiate a notification signal (a memory write) when a con-
dition is fulfilled. In addition, atomic operations and notifications can be combined
to complete complex synchronization operations in the memory system. Because
they avoid processor indirection, such mechanisms are called direct synchronization
mechanisms.
This dissertation extensively studies the design and the implementation of a
network interface closely-coupled to the processor, supporting direct communica-
tion and synchronization mechanisms intended for large scale CMPs.
Although cache hierarchies have been the dominant approach to locality manage-
ment in uniprocessors, cache-design research for up to 16-core CMPs provides two
important findings: (i) a shared static NUCA L2 cache has comparable performance
to a shared L2/L3 multilevel hierarchy, while a shared dynamic NUCA L2 outper-
forms the multilevel design [9], and (ii) for best performance, the optimal number
of cores sharing L2 banks of a NUCA cache (corresponding to a partitioning of the
total L2 cache pool) is small for many applications (2 or 4), and dynamic placement
mechanisms in NUCA caches are complex and power consuming [10]. Other tech-
nics that identify the appropriate degree of L2 cache sharing based on independent
data classes, do not address the requirement for communication [11].
These findings indicate that cache sharing becomes increasingly inappropriate
as the number of cores increases, which limits the usefulness of a NUCA cache
framework as an alternative to multilevel cache hierarchy coherence. The directory
state required for coherence is proportional to the number of processors (unless
partial multicasts are used) and the amount of on-chip memory. In addition, cache-
coherence is not expected to scale performance-wise beyond some point. Perhaps
more importantly, the structure of a coherent cache hierarchy limits the flexibility of
tiling and design reuse, complicating a low effort approach to incremental scaling of
the number of cores. It is thus very likely, that coherence in large scale CMPs will
not be supported throughout the chip, but rather only in “coherent-island” portions
of the chip complicating programming.
The use of local stores and direct transfers has challenged the general purpose
CMP domain –traditionally based entirely on caches and coherence– with the Cell
only cache or
P P P P only scratchpad
???
cache
cache
cache
cache
last−level
P cache
(+ coherence
support) P
P cache
bank DRAM
P
DRAM cache
Scalable bank DRAM
....
bus, ring,
or crossbar
....
....
P P P P NoC
P cache
bank DRAM
store
store
store
store
local
local
local
local
P
P
P
chip boundary
DRAM
Figure 1.2: From small-scale CMPs to a scalable CMP architecture. On-chip memories
can be used either as cache or as scratchpad.
processor [12, 13]. Efficient software communication control has been demonstrated
for several codes [14, 15, 16] leveraging scratchpad memories on the Cell processor,
and even automated with runtime support in programming models [17, 18, 19].
Figure 1.2 depicts, on the left side, the two baseline contemporary CMP archi-
tectures, based either on caches and coherence or on per processor local stores, and,
on the right, a possible large-scale CMP architecture2 . The baseline CMP architec-
tures utilize an elementary NoC, with the cache-based architecture also providing
coherence support. The large-scale CMP architecture features a scalable NoC, able
to support transfers among the numerous processors, as well as between the proces-
sors and main memory. A last-level cache is also shown, that is likely to be used to
filter off-chip memory accesses and reduce the effects of the memory wall.
A central issue, in exploiting the baseline contemporary architectures of fig-
ure 1.2 in large-scale CMPs, is whether individual processors or processor-clusters
2
The illustrated architecture follows a dancehall organization regarding the placement of main
memory. It is also possible to adopt a distributed memory organization that provides to each proces-
sor cluster direct connectivity with a portion of off-chip memory. For the purposes of the discussion
in this section, both organizations have similar properties. Moreover, because of the divergence of
on- and off-chip access times, the distributed organization is not expected to provide a significant
latency benefit for main memory access, while the dancehall organization can enhance bandwidth
via fine-grain access interleaving.
should utilized only cache or only scratchpad memory and the corresponding com-
munication mechanisms, or both should be combined in a general purpose way.
The former increases diversity and thus will complicate programming. In addition,
current experience with the Cell processor indicates that programming using only
scratchpad memories is very difficult, as is evident by the numerous scientific publi-
cations of achieving just reasonable efficiency with kernels or applications on Cell3 .
Providing to each processor the flexibility of both cache and scratchpad, with
their associated communication mechanisms, could mitigate these programmability
difficulties, or other that may emerge for the cache-based architecture, as discussed
above. The potential problem of this approach is its hardware cost. Flexible division
between cache and scratchpad use of the same on-chip memory and cache integra-
tion of a network interface, can moderate this cost and explore the similarities of
cache and network interface communication mechanisms.
The on-chip network is a primary vehicle in realizing additive performance
gains from large-scale CMPs, for all but those computations that largely incorporate
independent tasks. As already discussed, many-core processors will possibly utilize
scalable on-chip networks, as illustrated in figure 1.2. The basic property of scalable
networks, which allows bandwidth scaling with system size, is that they provide
multiple paths to each destination. Multipath routing (e.g. adaptive routing [20]
or inverse multiplexing [21]) is used to exploit this property and for network load-
balance. It can greatly improve network performance, as well as enhance robustness
and scalability.
As resources become more distributed with large numbers of cores, locality
management can benefit from coarse granularity communication and mechanisms
for bulk transfers, to avoid the additional overhead of many small transfers and
to hide latency when possible. Such bulk transfers can benefit from the improved
network utilization and the achievable bandwidth provided by multipath routing,
to improve the computation’s efficiency. In addition, to hide the latency of bulk
transfers and of access to off-chip main memory, the NI can support many transfers
in parallel. To achieve this goal, the state for outstanding transfers required in the
implementation must scale reasonably in terms of the required space, access time,
and the energy consumed.
This study targets many-core processors with scalable NoCs that support mul-
tipath routing, and a network interface with scalable, low overhead mechanisms,
suitable for efficient synchronization, as well as fine and coarse grain transfers,that
flexibly shares its resources with the processor’s cache.
3
To be fair, one should attribute some part of Cell’s inefficiency to the poor scalar performance
of its vector synergistic processors.
Page 12 of 154
Architectural Support for Many-core Processors
P P P P
Cache
C C C
copy copy
P P P P
Address Space
(a) Coherence−based C C C C
communication Scr’ Scr’ Scr’
pad pad pad
Scratchpad
P P P P
Address Space
Local
Store
Address Space
(b) NI−based
direct transfers
Figure 1.3: Use of the address space by coherence-based indirect communication providing
local block copies on demand (a), and extension of shared address space for scratchpad
memories that enables direct transfers (b).
stores, messages, and remote DMA (RDMA) with copy semantics to and from
non-coherent parts of the address space (RDMA-copy). These may include on-chip
scratchpad memories and non-cacheable off-chip memory. Although the interaction
between coherent and non-coherent parts of the address space is an important part
of the design space exploration, it is not addressed in this thesis. The coherence
extensions required for this interaction are studied elsewhere [22].
The integration of a network interface at the top levels of the cache hierarchy is pro-
posed. In this context, the term “network interface” refers to direct communication
and synchronization operations that are exposed to software. The introduced cache-
integrated NI supports efficient sharing of on-chip memory resources between cache
and scratchpad use, at cache-line granularity and according to the needs of the appli-
cation. This allows both cacheable coherent access to parts of the application’s data,
processors as well as direct non-coherent access to other parts. Scratchpad memory
can be used as NI staging area for direct data transfers, that is directly accessible
from the processor at user-level, and thus does not call for data copying.
Consolidation of the two contemporary baseline architectures, with cache-
read miss
Cache Ctrlr
to/from Network
Local Memory Flush, or fetch req.
Cache Line Write Through
to Network
fetch data
store Non-Cacheable Data Local
Write Buffer with write-combine Proc Memory
Proc RDMA
NI
Send Message req.
NI Queue Tail Remote Read DMA req.
(a) (b)
Figure 1.4: Similar cache and network interface mechanisms for writing (a) and reading
(b) remotely.
meta-data.
Cache-integrated network interfaces do not restrict the quest for efficiency to
utilize only caches or only scratchpad memories. Instead, they enable software ex-
ploration of the tradeoffs between coherent and direct transfers, without dismissing
either. From an implementation perspective, cache-integrations of the network in-
terface effectively shares on-chip resources avoiding the inefficiency of partitioned
cache and scratchpad storage with independent controllers.
Supporting both implicit and explicit communication allows software (i.e. pro-
grammer, or compiler, or run-time system) to optimize data transfer and placement
using efficient explicit communication mechanisms on a best effort way (i.e. when
it is feasible or does not require excessive work). In addition, cache-integrated NIs
share on-chip memory for cache as well as scratchpad and NI use, and leverage
cache control and datapath for the network interface.
reader queues (mr-Q). Queues can efficiently multiplex multiple asynchronous senders,
providing atomic increment of the tail pointer on writes. In single-reader queues, the
head pointer is controlled by the local processor. In multiple-reader queues, the head
pointer is handled like the tail but reacts to read requests. Multiple-reader queues
implement a pair of “overlapped” queues, one for reads and one for writes. The
response of a multiple-reader queue to a read request can be delayed, because reads
are buffered until a matching write arrives. Multiple-reader queues provide sharing
of the queue space by multiple receivers (dequeuers) and can enable efficient lock
and job-dispatching services.
1.4 Contributions
This thesis deals to a large extent with the design of NI-cache integration and the
resulting communication architecture, as well as its efficient implementation. In ad-
dition, the design is evaluated both on an actual hardware prototype, implemented
in the Computer Architecture and VLSI Systems (CARV) laboratory of the Insti-
tute of Computer Science of the Foundation for Research and Technology –Hellas
(FORTH-ICS), and with simulations. The contributions of this dissertation are:
Page 16 of 154
Contributions
Page 17 of 154
2
Communication and Synchronization
Mechanisms
Page 19 of 154
Background
2.1 Background
Henry and Joerg [23] have categorized network interfaces in four broad groups:
Although DMA had only been used in network interfaces that required OS
intervention at the time, subsequent research efforts (e.g. [24, 25]) demonstrated
its feasibility at user-level. NIs accessible only via the operating system are cur-
rently inappropriate for efficient on-chip communication, because of OS invocation
overheads. Furthermore, if OS invocation overheads are mitigated, there is no sig-
nificance in the distinction of this type of NI. In any case, this type of NIs does not
constitute a separate category for an on-chip environment. On the other extreme,
hardwired NIs, that are transparent to software and do not allow program control
over sending or receiving communication and its sementics, are utilized for coherent
shared memory interactions and in some dataflow machines (e.g. Monsoon [26]).
The two types of user-level NIs, as adopted for on-chip inter-processor com-
munication, will be discussed in detail in section 2.3. Since the place where explicit
communication mechanisms are implemented is traditionally called “network inter-
face”, in the following, when not stated more specifically, the term will refer to NIs
that provide support for explicit transfers.
In the following subsections, first we discuss implicit and explicit communica-
tion, which cache-integrated network interfaces enable for interprocessor communi-
cation. Then we review RDMA and its relation to networks with multipath routing,
including previous research of the author [27, 28]. Finally, we refer to the lessons
from a connection-oriented approach to designing the network interface [29], within
the same project and during the early stage of this dissertation.
Page 20 of 154
Background
software to express communication: implicit and explicit. This twist of the original
shared memory–message passing opposition in the context of CMPs, emphasizes
the importance of communication for the multiple instruction stream parallelism
they provide.
Explicit communication refers to data movement stated explicitly in software,
including its source and destination. This depends upon a naming mechanism for
communication end-points, and can be supported directly in hardware either via
processor instructions (e.g. for messaging), or by providing programmable commu-
nication engines per compute node.
Explicit communication forces the user program to specify communication and
manage the placement of data. In CMPs, hardware naming mechanisms for commu-
nication endpoints can be provided, by identifying the target thread, to support short
message or operand exchange via registers or per processor queues, or by specify-
ing parts of a global address space as local to a thread1 , using scratchpad memories.
Explicit communication mechanisms can provide direct transfers that each require
minimum latency and energy, including transfers from and to main memory, as dis-
cussed in subsection 1.2.2.
On the contrary, implicit communication does not provide a mechanism to
identify the source and destination of communication. In effect, all communica-
tion required is done through shared memory, without the software providing any
location information. Software can only indirectly represent communication with
consumer copy operations, or with some data “partitioning” and different threads
accessing separate partitions in each computation phase (i.e. coordinated data ac-
cess). Hardware support for implicit communication is commonly provided through
the use of coherent caches and hardware prefetchers, that serve as communication
assistants, transparent to software. In addition, hardware architectures that favor
uniform access of shared memory, like dancehall architectures or fine-grain inter-
leaving of shared memory addresses, can render explicit communication useless,
making implicit communication the de facto choice.
Implicit communication relies on hardware for data movement and placement2 .
In many-core processors, the use of caches will necessitate directory-based coher-
1
The notion of a thread is left intentionally ambiguous. Whether it refers to a hardware thread or
a software thread has to do with the virtualization of the naming mechanism and not the mechanism
itself. Virtualization is addressed in subsection 2.3.3.
2
The case of software shared memory, implemented on top of a message passing system, can be
viewed as a hybrid, which provides implicit communication at the application level, using explicit
mechanisms at the OS level. In this case, data placement is managed in OS software. Other hybrid
schemes include software-only coherence protocols [30] and, in general, software-extended directory
architectures [31].
Sender1 memory 1
memory 3
Remote DMA
buffer1
P1
P3
buffer2
Receiver
P2
Multipath Routing OK;
Sender2 memory 2 careful w. completion notification
Figure 2.1: One-to-One Communication using RDMA: single writer per receive buffer.
(P3), the receiver must have set up separate memory areas where each transfer is tak-
ing place. To deliver a large block of data to a remote destination, RDMA segments
the transfer to multiple packets, that carry their own destination address each. Ow-
ing to this, RDMA works well even if the network uses adaptive or other multipath
routing, also shown in figure 2.1, which may cause packets to arrive out-of-order.
Delivering packets in-place, RDMA obviates the need for packet resequencing and
provides the opportunity to exploit multipath routing that drastically improves net-
work performance.
Resequencing [32, 33] is difficult to implement across a network because, in
the general case, it requires non-scalable receive-side buffering. Specifically, previ-
ous research of the author and others [27, 28] in the CARV laboratory of FORTH-
ICS indicates that, in the general case, the buffer space required for resequencing is
proportional to the number of senders, the number of possible paths in the network,
and the amount of buffering in a network path. It should be possible to reduce this
amount by the last factor with inverse multiplexing [21], but balanced per path and
per destination traffic splitting is required at the senders in this case. To the best of
the author’s knowledge, the only low-cost solution for resequencing is to use end-
to-end flow control, which throttles the possible outstanding packets, and causes
bandwidth to diminish with increasing round-trip time.
The compatibility of RDMA with out-of-order arrivals over an unordered net-
work, does not solve the problem of detecting transfer completion at the receiver.
Completion detection can not be made based on when the “last” word has been
written for an RDMA transfer, as would be the case with an ordered interconnect.
In [27, 28], resequencing of packet headers was implemented in combination with
RDMA to detect transfer completion, and proved non-scalable. Instead, assuming
the network never creates duplicates of packets, one can count the number of arriv-
ing bytes for each transfer, and comparing them to the number of expected bytes. In
following subsections we propose an elegant mechanism that exploits this idea, and
provides transfer completion notification in the absence of network ordering.
connections forced an additional process group ID for protection and were also cre-
ating a problem with destination routing information that was not addressed. In-
stead, the process group ID and the connection ID of the destination were placed in
the network packet (this was probably a first notion of progressive address transla-
tion discussed in chapter 3). Finally, the user-level access scheme used, required a
whole page to be mapped to the virtual space of a process, to access a few entries of
the connection table, which was considered wasteful.
In conclusion, accessing the network interface through a table creates problems
with table space and indexing. In addition, there is inefficiency in accessing such
NI control state at user-level, over virtual memory, because of the reasonably large
protection granularity provided by access control hardware like TLBs.
With a scalable number of destinations such connection state can easily in-
crease to the point that it cannot be managed efficiently. In fact, it seems that the
increase is because of the state required for routing to the scalable number of desti-
nations. One-to-one connections, that would encapsulate destination routing infor-
mation, require non-scalable per node state, and with connection grouping, routing
information lookup at transfer initiation is difficult to avoid.
Finally, migration is hard to handle in presence of connections, because con-
nections represent the potential for communication, and migration represents the
movement of sources or destinations4 . Routing to a virtualized destination like a
peer process or software thread in presence of migrations is difficult. It requires that
setting up and dismissing a path to some destination is as lightweight as possible.
This can be supported by a mechanism that gracefully completes the task (e.g lazily
or on demand), or by providing an infrastructure that can adapt to the movement
of migrating destinations. The importand question here is how often do migrations
occur, which, generally speaking, should depend on the granularity of the moving
entity, and is related to locality.
Page 26 of 154
Producer-Consumer Communication Patterns and Mechanisms
mechanisms, we consider normal coherent loads or stores and prefetching. For ex-
plicit mechanisms, we consider coherence-based write-send [40], remote scratchpad
stores, short messages, and simple RDMA. Write-send represents data-forwarding
mechanisms, among a few that have been proposed in the literature [41, 42, 40, 43].
Although these are explicit communication mechanisms, they are not implemented
with direct transfers advocated in this thesis5 . A short message, or simply a message
in the following, refers to a small transfer that is delivered as a single atomic unit
at its destination (usually implementations limit a message to a single NoC packet),
and for which software directly provides the source data to the NI (as values and not
in memory).
Several variants of producer-consumer communication can result, depending
on a number of factors:
Page 27 of 154
Producer-Consumer Communication Patterns and Mechanisms
produce time
P wr A1 (a)
sy
pull,
da
q
req data
nc
re
ta
lazy
h
C A1
need have
produce
P wr A1 (b)
time
sy
da
prefetch pull,
q
nc
ta
re
scheduled
h
C A1 or A2
prefetch need have
produce
P (c)
sy time
w
push,
r
nc
r
h eager
C A1
need have
produce send
P wr A1 (d)
time
sy
push,
R
send
D
nc
M
batch
h
A
C A2
need have
Page 28 of 154
Producer-Consumer Communication Patterns and Mechanisms
head of first writing the data in local scratchpad memory, but can remedy inefficien-
cies caused by “inconvenient” write address order for a write-combining buffer, or
lack of such a buffer. RDMA-write creates copies at addresses A2 different from the
source addresses A1 , and is efficient when a significant amount of the transferred
data has been modified. Part (d) is the producer-initiated transfer corresponding to
consumer-initiated prefetch of part (b) of the figure. In general, producer initiated
transfers, as shown in parts (c) and (d), can occur immediately at or after production
time. Synchronization takes place after the data transfer, opposite to the case of
consumer-initiated communication.
req
data
data
fwd
a) coherent
D
loads & stores
fwd
fwd
data
data
req
C
time
write read pre−
write sync . . . sync fetch ?? read
P
req
req
data
fwd
fwd
b) coherent
D
prefetch
fwd
fwd
req
req
data
data
C
time
write
fetch
sync
read
P
RDM
c) RDMA read
req
A
C
time
Figure 2.3: Consumer-initiated communication patterns in closer examination.
with the fetch of subsequent lines that can then be accessed with much smaller
latencies.
Part (b) of figure 2.3, which corresponds to figure 2.2(b), shows closely a pat-
tern utilizing coherent prefetching. The scenario presented refers to a programmable
prefetcher, and not automated hardware prefetching. Write time is the same as in
case (a), but the consumer will usually initiate the prefetch sufficient time after the
producer completes, and thus synchronization will be broken in two parts, as illus-
trated: one after the data are produced, and one before the consumer initiates the
prefetch.
The prefetch is a batch of overlapped requests (provided adequate non-
blocking capability is supported in both the cache and the directory), followed by
a batch of responses. There is no way for software to determine when the prefetch
Similarly, in part (c) of figure 2.3 a pattern for an RDMA-read transfer, that also
corresponds to figure 2.2(b), is shown. In this case, write time is shorter because the
write is local, there is no wait time to complete the writes, and a remote store can
be used to signal the consumer, writing the synchronization variable with a direct
transfer that minimizes synchronization time. The consumer does experience the
round-trip for the consumer initiated transfer, but there is no directory indirection
as for prefetching, and a batch of responses follows. With an ordered network, the
consumer can poll on the last word of the specific transfer to detect completion, or
a more complex mechanism is needed to implement a fence or transfer completion
status update at the consumer. After data fetch is complete, read time is significantly
reduced since the consumer works locally.
RDMA-read can be faster than coherent loads and stores with larger transfer
sizes and with shorter completion times. Detecting RDMA completion with an un-
ordered network, or with arbitrary size RDMA-copy operations, can be done using
a hardware counter and partial transfer acknowledgements, as will be discussed in
subsection 2.4.3. Similarly, completion notification latency will depend on whether
the counter is located at the consumer or at the producer, the former being better
since it only requires local acknowledgements for arriving DMA packets.
sync
P write read
wrSe
wrSe
nd
nd
D a) write−send or
updt
C
time
sync
P write read
b) remote stores
data
ack
or messages
C
time
sync
P write read
RDM
lete
c) RDMA write
comp
A
C
time
Figure 2.4: Producer-initiated communication patterns in closer examination.
using a direct remote store. After that, read time is minimized by accessing local
data. Alternatively, using messages that each includes a validity flag would reduce
write time and obviate transfer overhead for synchronization. Although this may
require additional read time for polling, such read time may be fully or partially
overlapped with write time.
Short message transfers provide the ability to optimize transfer completion by in-
cluding a synchronization flag in the message. A similar optimization of the re-
quired synchronization has been proposed with coherent remote writes (similar to
write-send and use write combining buffer at the sender) in [45]. In addition to sup-
porting this optimization, messages preserve the advantage of being direct transfers,
compared to write-send, and thus can be more efficient for fine-grain communica-
tion with increases in the distance among communicating parties and the directory.
There are two potential disadvantages in using messages: (i) there is an overhead
in constructing them, to provide the appropriate arguments to the NI, and (ii) their
maximum size is limited. The latter is also true for cache line size.
Although figures 2.3 and 2.4 do not show all the packets required for coherent
transfers, or all the cases of such transfers, it is apparent that coherence requires
significantly more packets than explicit mechanisms supporting direct transfers, like
the ones advocated here. This is true even if we consider acknowledgements for
each RDMA packet, as recommended in subsection 2.4.2. For example, inspection
of parts (b) and (c) of figure 2.4 to add the acknowledgements not present in (c),
results in about the same number of packets for both, which is 2/3 of the number of
packets required for write-send in part (a) of the same figure. A collaborative study
of the author and others that also provides qualitative results [46], calculates that
the number of on-chip packets used for coherent transfers is two to five times larger
than the number of packets required for direct transfers.
Scalable on-chip networks define a new environment for network interfaces (NI)
and communication of the interconnected devices. On-chip communicating nodes
are usually involved in a computation, whose efficiency critically depends on the
organization of resources as well as the communication architecture performance.
Off-chip network architectures, in the LAN/WAN or PC cluster worlds, have been
influenced by different factors and have evolved in different directions, with com-
plex protocols, and robustness in mind, while latency was not a primary concern.
From the perspective of the network interface, the on-chip setting has a lot more in
common with supercomputers and multiprocessors targeting parallel computations.
Page 35 of 154
Network Interface Alternatives for Explicit Communication in CMPs
scale well in terms of latency and energy efficiency, and the hardware cost of coher-
ence directories is not negligible, as discussed in subsections 1.2.2 and 1.2.3.
Two main alternatives have been considered in the literature and implemented
in CMPs for the network interface, differentiated by the naming scheme used to
identify communication targets. The first is to directly name the destination node
of communication, possibly also identifying one of a few queues or registers of a
specific processor, but not any memory address. This results in a network interface
tightly coupled to the processor, and may provide advantageous end-to-end com-
munication latency, but usually necessitates receiver software processing for every
message arrival.
The second approach is to place NI communication memory inside the proces-
sor’s normal address space. In this case, communication destinations are identified
via memory addresses, which include node ID information, and the routing mecha-
nism may be augmented or combined with address translation. The resulting NI has
the flexibility of larger send and receive on-chip space, that is directly accessible
from software and allows the use of more advanced communication mechanisms.
Challenges raised in the design of such a network interface include the possible in-
teractions with the address translation mechanism, as well as managing the costs of
full virtualization of NI control registers and on-chip memory, placed in a user-level
context, including potential context swaps.
Both types of NoC interfaces will be described, with an emphasis on the second
type which can support additional communication and synchronization mechanisms
than simple messaging, and presents a more challenging design target.
Core
Reg. Intr.
Pipeline File Unit
L1 Cache NI
Interconnect &
Cache Hierarchy
fering with the performance of other threads. There are two possible approaches to
this problem: automatic off-chip buffering and end-to-end flow control. Providing
an automated facility to copy messages off-chip, may require costly rate-matching
on-chip buffers to inhibit affecting other threads, and requires an independent or
high priority channel for the transfer to off-chip storage.
Alternatively, the NI may provide some form of end-to-end flow control that
can keep the buffering requirements low. Providing some guaranteed receive side
buffering per sender and acknowledged transfers for end-to-end flow control, would
be viable only in small systems. One possible solution is provided by limited buffer-
ing at the receiver for all senders, and sender buffering for retransmission on a nega-
tive acknowledgment (NACK). Such sender-side buffers should also be flushed on a
context switch, if message ordering is required. Hardware acknowledgments (ACKs
and NACKs) for end-to-end flow control require a network channel independent to
the one used for messages, to avoid deadlock.
Because software usually requires point-to-point ordering of messages, if the
NoC does not provide ordered delivery of packets (e.g. because of using an adaptive
or other multipath routing scheme), the NI must provide reordering at receivers.
Keeping the cost of reordering hardware low in this case, probably requires that
the amount of in-flight messages per sender is kept very low. Even if the network
supports in-order packet delivery, the NI may still need to provide some support
for point-to-point ordering when the end-to-end flow control mechanism involves
NACKs and retransmissions [51].
Core
Intr.
Core
Proc. Pipeline Unit
Intr.
Proc. Pipeline Unit
CC
L1 Cache
Core
CC
L1 Cache
Intr.
Proc. Pipeline Unit
NI & CC
CC
NI
L2 Cache Scratchpad L2 Cache &
CC
NI
L1 Cache Scr’pad
Scratchpad
caching is allowed for local scratchpad memory, the L1 may need to be invalidated
on remote scratchpad accesses.
A less aggressive placement of the network interface, further from the pro-
cessor, at some level of the memory hierarchy shared by a group of cores, is also
perceivable. In this case, processor access of the NI can be uncached (as in most
network interfaces for off-chip communication), or may exploit coherence mecha-
nisms [35]. In this case, the NI would mostly be utilized for explicit transfers among
parts of the hierarchy belonging to different processor groups. This organization,
though, is not studied here.
With a network interface integrated at top memory hierarchy levels, communi-
cation operations always have their source, destination, or both in memory. Scratch-
pad memories provide sufficient buffering for bulk transfers, and thus the NI may
support RDMA (get and put) or copy operations. It is natural to provide access to
the local scratchpad with processor loads and stores. When the processor supports
global addresses (i.e. processor addresses are wide enough), or if remote scratch-
pad memory can be mapped in its address space, direct load/store access to remote
scratchpads can also be provided. In any case, atomic multi-word messages can be
supported as well.
At least one of the source and destination operands of communication opera-
tions always resides in memory shared by processors other than the one initiating a
transfer. For this reason, synchronization is required, both before the transfer and
for its completion, with the processor(s) that may be accessing the shared operands
and NI memory resources, on-chip memory is shared and better utilized, obviating
the redundant copy of communication operands that was common in traditional off-
chip network interfaces with dedicated memory. With memory hierarchy integration
there are no dedicated buffering resources managed exclusively in hardware. As a
result, producer-consumer decoupling and flow control, supported in hardware with
processor-integrated NIs, may need to be arranged under software control. For ex-
ample, single and multiple reader queues supported in the SARC network interface
are implemented in memory shared by the NI and the processor, and are managed
by hardware and software in coordination –i.e. the NI only supports ACKs and flow
control is left to the application.
Load and store accesses to local or remote scratchpad memory and to NI con-
trol registers must utilize the processor TLB for protection and to identify their ac-
tual target. The destination of these operations can be determined from the physical
address or by enhancing the TLB with explicit locality information –i.e. with extra
bits identifying local and remote scratchpad regions. In the case of a scratchpad po-
sitioned at the L1-cache level (figure 2.6(a)), TLB access may necessitate deferred
execution of local scratchpad stores for “tag” matching, as is usual for L1 cache
store-hits.
Although scratchpad memory can be virtualized in the same way as any other
memory region, utilizing access control via the TLB, the system may need to mi-
grate its contents more often to better utilize on-chip resources. For example,
scheduling a new context on a processor may need to move application data from
scratchpad to off-chip storage, in order to free on-chip space for the newly scheduled
thread. Migration of application scratchpad data is a complicated and potentially
slow process, because any ongoing transfers destined to this application memory
need to be handled somehow, for the migration process to proceed. Furthermore,
mappings of this memory in TLBs throughout the system need to be invalidated.
To facilitate TLB invalidation, the NI may provide a mechanism that allows the
completion of in-progress transfers without initiating new ones.
In a parallel computation, consumers need to know the availability of input
data in order to initiate processing that uses them. Reversely, producers may need to
know when data are consumed in order to replenish them, essentially managing flow
control in software. For these purposes the NI should provide a mechanism for pro-
ducers and consumers to determine transfer completion. Such a mechanism is also
required –and becomes more complicated– if the on-chip network does not support
point-to-point ordering. Write transfers that require only a single NoC packet can be
handled with simple acknowledgments, but multi-packet RDMA is more challeng-
ing. Furthermore, generalized copy operations, if supported, may be initiated by one
NI and performed by another –e.g. node A may initiate a copy from the scratchpad
of node B to the scratchpad of node C. Chapter 2 in subsection 2.4.3 shows how all
these cases can be handled using acknowledgements and hardware counters.
Since NI communication memory coincides with application memory, soft-
ware delivery of write transfer operands is implied, as long as NoC packets reach
their destinations. Nevertheless, the NI needs to guarantee that read requests can
always be delivered to the node that will source the response data without risking
network deadlock. Networks guarantee deadlock-free operation as long as desti-
nations sink arriving packets. The end-nodes should be able to eventually remove
packets from the network, regardless if backpressure prevents the injection of their
own packets. Because nodes sinking read requests need to send one or more write
packets in reply, reads “tie” together the incoming and the outgoing network, effec-
tively not sinking the read, which may lead to what is called protocol deadlock.
To remedy this situation the NI may use for responses (writes in this case)
a network channel (virtual or physical) whose progress is independent to that of
the channel for requests (reads). As long as responses are always sunk by NIs,
the request channel will eventually make progress without deadlock. Alternatively,
reads may use the same network channel with writes, but they need to be buffered
at the node that will then source the data. The reply of the read will then be posted
by that node, similar to a locally initiated write transfer, when the network channel
is available. Finally, for transfer completion detection, the NI may need to generate
acknowledgments for all write packets. These acknowledgments must also use an
independent network channel and NIs must always be able to sink such packets.
Since bulk transfers are offloaded to the NI while the processor can continue
computation and memory access, a weak memory consistency model is implied.
When posting a bulk transfer to the NI, subsequent memory accesses by the pro-
cessor must be considered as concurrent to NI operation (i.e. no ordering can be
assumed for NI and processor operations) until completion information for the bulk
transfer is conveyed to the processor. To provide such synchronization, the NI must
support fences9 or other mechanisms, to inform software of individual transfer com-
pletion or the completion of transfer groups.
When remote scratchpad loads and stores are supported, these accesses may
need to comply with a memory consistency model [58]. For example, for sequential
9
As discussed in chapter 2, in the context of the processor, fences or memory barriers are instruc-
tions that postpone the initiation of subsequent (in program order) memory operations, until previous
ones have been acknowledged by the memory system and thus are completed. Similarly, the net-
work interface can provide operations that expose to software the completion of previously issued
transfers.
consistency [59] it may be necessary to only issue a load or store (remote or lo-
cal) after all previous processor accesses have completed, sacrificing performance.
With a more relaxed consistency model like weak ordering of events [44], fences
are required and special synchronizing accesses. In addition, the NI may need to
implicitly support ordering for load and store accesses to the same address, so that
the processor expected order of operations is preserved and read-after-write, write-
after-read, or write-after-write hazards are avoided.
Additional complexities arise because the network interface resides outside of
the processor environment, in the memory system. The network interface may need
to tolerate potential reordering of load and store operations to NI control registers
by the compiler or an out-of-order processor. For example, the NI should be able to
handle a situation where the explicit initiation of a message send operation and the
operands of the message arrive in an unexpected order. Finally, out-of-order proces-
sors may issue multiple remote scratchpad loads, which may require NI buffering
of the outstanding operations, in a structure similar to cache miss status holding
registers (MSHR).
A final thing to note is that in the presence of cacheable memory regions, all NI
communication mechanisms may need to interact with coherence directories, which,
in turn, will need to support a number of protocol extensions. To avoid complicating
directory processing, it may be preferable that explicit transfers to coherent memory
regions are segmented at destination offsets that are multiples of cache-line size and
are aligned according to their destination address (i.e. use destination alignment
explained later in this chapter).
Virtualization of the network interface allows the OS to hide the physical device
from a user thread, while providing controlled access to a virtual one. The virtual-
ized device must be accessible in a protected manner, to prevent threads that belong
to different protection domains from interfering with each other. Protection is man-
aged by the OS, and can allow control on how sharing of the physical device is
enforced. For higher performance, the virtual device should provide direct access to
the physical one, without requiring OS intervention in the common case, which is
referred to as user-level access, and is necessary for the utility of NI communication
and synchronization mechanisms in the on-chip environment of CMPs. NIs acces-
sible at user-level allow parallel access to the physical device by multiple threads,
without the need for synchronization.
allow programmable control “registers” inside the scratchpad address space. This
is feasible by providing a few tag bits per scratchpad line (block). The tag bits are
used to designate the varying memory access semantics required for the different
line types (i.e. control “registers” and normal scratchpad memory). The resulting
memory organization is similar to that of a cache and can support virtualization via
address translation and protection for accesses to the scratchpad memory range.
This design allows a large number of control “registers” to be allocated inside
the (virtual) address space of a process. The NI can keep track of outstanding oper-
ations by means of a linked list of communication control “registers”, formed inside
scratchpad memory. Alternatively, the total number of scratchpad lines configured
as communication “registers” may be restricted to the number of outstanding jobs
that can be handled by a fixed storage NI job list. Such a job list processes in FIFO
order transfer descriptions provided in NI control “registers” –potentially recycling
in progress RDMA transfers, segmented in multiple packets. Scaling the size of
such a job list structure, and thus the number of supported outstanding transfers, re-
sults in low hardware complexity increase, in contrast to outstanding transfers for a
cache that require a transaction buffer or miss status holding register (MSHR) fully
associative structure.
Virtualization of explicit inter-processor communication, also requires han-
dling of virtual destinations. In the case of processor-integrated NIs, a translation
mechanism similar to address translation, i.e. a hardware thread translation table
in the NI, can map threads to processors, filled by the OS on misses. In addition,
the NI must be aware of the thread currently scheduled on the processor, for which
it can accept messages, and a mechanism is required to handle messages destined
to threads other than the scheduled one. For example, Sanchez et al. [51] have
proposed the use of NACKs and lazy invalidation of the corresponding entry in the
thread translation table of the source processor.
The case of NIs integrated at top levels of the cache hierarchy that support
RDMA or copy operations, requires that virtual address arguments can be passed to
NI control registers, although in a protected way that is compatible with page mi-
gration. A few solutions have been proposed to the NI address translation problem
[60, 61, 62], but processor proximity of the NI may simplify the situation. A TLB
(or MMU) structure can be implemented in the NI to support the required functions.
Updates of address mappings may use memory mapped operations by a potentially
remote processor handling NI translation misses, (e.g. in the Cell BE SPE transla-
tion misses are handled by the PPE). Alternatively, access to a second port of the
local processor’s TLB may be provided.
Furthermore, transfers destined to a node that arrive after the node’s migration
properties. Processor integrated NIs provide very low latency interactions, based
on send-receive style transfers that are interfaced directly to the processor. NI inte-
gration at top memory hierarchy levels provides adequate space for more advanced
communication functions, and allows the read-write communication style as well.
The former NI type provides register-to-register transfers and dedicated, hard-
ware managed resources, while the latter supports memory-based communication
and shares NI resources and their management with the application. As will be
described below, in the first case, the possibility of protocol deadlock involves the
receiving processor and interrupts, or potentially expensive support for automatic
off-chip buffering. In the second case, read requests may require resources for an
independent subnetwork, or receiver buffering resources for a limited number of
read requests, that can be delegated to the application. Both NI types require an
independent subnetwork for acknowledgements to provide flow control and other
functions.
Messaging-like mechanisms can be used for scalar operand exchange, atomic
multi-word control information transfers, or, combined with a user-level interrupt
mechanism at the receiver, for remote handler invocation. In processor-integrated
NIs, message transmission directly uses register values. An explicit send instruction
or an instruction that identifies setting of the final register operand is used to initiate
the transfer. At the receiver the message is delivered directly to processor registers.
In the case of a network interface integrated at the top memory hierarchy lev-
els, the message must be posted to NI registers and the transfer can be initiated
either explicitly via an additional control register access or implicitly by NI moni-
toring of the transfer size and the number of posted operands. At the receiver the
message can be delivered in scratchpad memory or in NI control registers used for
synchronization purposes.
Loads and stores to local or remote scratchpad regions can be started after
TLB access, depending on the consistency model. Remote stores can use write-
combining to economize on NoC bandwidth and energy. In this case, the processor
interface to the NI would include an additional path from the combining buffer (not
shown in figure 2.6). Although it is possible to acknowledge stores to scratchpad to
the processor immediately after TLB access, scratchpad accesses must adhere to the
program dependences, as discussed in the previous section. Remote loads can be
treated as read DMA requests, or exploit an independent network channel for their
single-packet reply, avoiding the possibility of protocol-level deadlock. In support
of vector accelerators, the NI may also provide multi-word scratchpad accesses.
When the NI resides below the L1 cache level (see figure 2.6(b) and (c)),
PACKET BODY
(a) Source Alignment
a
b c d e
SOURCE DESTINATION
f g h
MEMORY MEMORY
sh
(b) Destination Alignment
ift
a a b c a b c
b c d e shift d e f g d e f g
f g h h h
ft
sh
i
sh
i
ft
Figure 2.7: Three alternatives of handling transfers with arbitrary alignment change.
complementary approach for a read request exceeding the available buffer space at
the NI hosting the read’s source region, would be to employ some kind of negative
acknowledgement that manifests the error condition at the initiating node.
When scratchpads can only be accessed via a global address space, it is more
natural not to limit the sources and destinations of communication mechanisms to
the local scratchpad. This precludes get and put operations (RDMA), and the con-
gruent mechanism provided by the NI supports a copy operation. Copy operations
are more general than RDMA because they allow the specification of transfers that
have both remote source and destination addresses. Notification of the initiating
node for transfer completion in hardware is more complex in this case.
RDMA and copy operations that allow arbitrary changes of data alignment may
be dealt with in three different ways, as illustrated in figure 2.7. First, it is possible
to send write data in network packets keeping their source alignment, as shown in
Figure 2.7(a). This implies that packets may have “padding” both at the beginning
(before useful data) and at the end. In addition, a barrel shifter is required at the
receiver to provide the operation requested alignment. Second, if we choose to send
packets with their requested destination alignment, as shown in Figure 2.7(b), then
“padding” may also be required both at the beginning and at the end, and a barrel
shifter must be placed at the source node.
Third, we may want to minimize the number of NoC flits transferred, as shown
in figure 2.7(c), in which case we need two barrel shifters: one at the source node to
align the transmitted data to the NoC flit size boundary, and another barrel shifter at
the destination node to fix the requested destination alignment. In this case, packets
can only have “padding” after useful data. This third approach is more expensive,
requiring two barrel shifters per node, and cannot reduce the amount of transferred
data by more than a single NoC flit. Restricting the supported alignment granularity
reduces the cost of the barrel shifter circuit, but complicates software use of RDMA
or copy operations.
Transfer pipelining is important for both latency and bandwidth efficiency of
transfers. For instance, NoC request scheduling can overlap with NoC injection of a
previous packet. To keep the latency of communication mechanisms to a minimum,
it is important that the network interface implements cut-through for both outgoing
and incoming packets. When the maximum packet size does not exceed the width of
the receiver’s memory, cut-through implementation for the incoming path can spec-
ulatively advance the tail pointer of the NoC interfacing FIFO for packet reception,
until the correct CRC is computed at the end. Alternatively, for NIs integrated at the
top of the memory hierarchy, providing separate CRC for the packet’s header and
body allows writing incoming packets if the destination address is correct before
checking if the data CRC is correct.
When using explicit transfers, software needs to actively manage the ordering of op-
erations. Because of this, explicit communication can become crabbed in the case of
processor-asynchronous handling of bulk transfers by the NI, or when the on-chip
network does not support ordering. To simplify the handling of operation ordering
the NI should support both efficient and straightforward transfer completion detec-
tion. In addition, in the context of producer-consumer communication, detection
of data arrival by the consumer should be both flexible and fast to allow efficient
fine-grain interactions. This special case of transfer completion detection by the
consumer is addressed as data synchronization in the following.
Detection of data reception by a consumer (i.e. data synchronization) is nec-
essary before a computation on communicated data can start. For this purpose NI
designers usually optimize data synchronization combining it with data reception.
This is true for send-receive style communication and consumer-initiated transfers.
Reversely, in the case of producer-initiated transfers, individual transfer completion
information enables the initiating node to synchronize with the consumer as re-
quired, and enforce point-to-point ordering when it is not supported by the NoC
or the NIs. The common element of these mechanisms, that enables producer-
consumer interactions and allows exploitation of NI-initiated transfers, is transfer
completion detection. Providing this type of information, without compromising NI
scalability and communication performance, may be a challenging goal for network
interface design.
There are three basic mechanisms for application software to detect message
reception:
all associated packets have been delivered to their destination. The counter synchro-
nization primitive is introduced for this purpose in section 2.4.1, and can provide
the flexibility of selective fences that identify the transfer of software task operands.
For producer-consumer interactions, synchronization declares the event of new
(produced) data to the consumer. This event “intervenes” between the producer’s
write or send and the consumer’s read or receive. Some transfer mechanisms can
combine this event with the data transfer, and in such cases synchronization time is
minimized. Such mechanisms are:
the producer is first notified of transfer completion and then writes a flag at the con-
sumer. Cases (ii), (iii), and (iv) above, correspond to producer-initiated messages,
and RDMA or copy operations. If the flag optimization is not used and transfer
completion is detected only at the producer, he must write a separate flag at the
consumer, after the transfer is complete.
(1) Statically or in advance arrange consumer buffer space for all producers.
Occasionally, at the beginning of a computation, by providing sufficient con-
sumer memory for every possible producer requirement. At times, some form
of flow control can be exploited as well, requiring per producer buffers. The
most rewarding approach is to organize computation in phases, so that buffers
corresponding to specific producers during one phase, have a different as-
signment in another phase, and this assignment is explicit in the algorithm.
Unfortunately, this is not always possible, as well.
(2) Query the consumer, before the transfer, with a message that generates a user-
level interrupt (i.e. revert back to a consumer-initiated transfer, with an extra
network crossing). The consumer can try to find some available buffer to
service the producer’s request. In case there is no space available, and it is
too costly or not possible, to move data and free space, the consumer can
delegate resolution of the situation to the future, sending a negative response.
This negative response constitutes a form of backpressure to the producer.
(3) Handle a consumer’s memory as a resource equally shared among all threads,
and have all interested parties coordinate dynamically regarding its use. This
is the usual approach taken by shared memory algorithms, treating all mem-
ory, including a consumer’s memory, as “equally” shared. One reason for this
is that the shared address space creates the impression of equidistant memory.
As a result algorithms tend to describe independent producer and consumer
operation, with on-demand search for the opportunity to access the equally
shared resources. This on-demand search for opportunity requires a global
interaction of the interested parties to coordinate such accesses. The usual
means of coordination are locks. Depending on the structure of the shared
resources, coordination can be fine-grain, or even wait-free.
(4) Employ a mechanism for automated spilling of on-chip NI buffers or memory
to off-chip main memory. This approach is elegant and can solve the problem
transparently, in hardware. It may require, though, an independent path or
(sub)network to main memory, and even then, rate matching buffers should
be provided in that path to avoid hotspots or interference with other traffic.
The first approach is efficient, roughly, as long as the working set of producers
and consumers can fit in scratchpads. Beyond that point, a mechanism is required
to take action dynamically, and spill data at other hierarchy levels or to off-chip
memory. This problem underlines the need for locality management mechanisms in
A subtle point to observe is the following: when using a shared address space,
departure from a computation confined to a set of pairwise interactions that are
bounded with synchronizations, results in the need for a model of global memory
consistency. When only pairwise interactions and correlated synchronizations are
enforced, synchronization and flow control suffice for memory consistency. Such
a computation model would associate synchronizations with a “transfer of owner-
ship” of the related data, implicit in the program. This requires only point-to-point
ordering, and can be provided for writes with an ordered network, or with acknowl-
edgements and counters over an unordered NoC. In the general case, it also requires
that reads wait for all previous node writes (local and remote) and for synchroniza-
tion.
The model of computation described, includes, but it is not limited to, the case
of computing only on local data. When restricted to computation only on local data,
remote loads are of very limited use if any, and RDMA-read and read messages
must be associated with synchronization. In this restricted case, local loads need
only wait for local writes or synchronization. This fits better with simple RDMA
and the model of the Cell processor than to a shared address space with general
RDMA-copy support and remote loads.
With the general shared memory model data races are allowed; that is multiple
concurrent accesses to shared data, at least one of which is a write. This is in con-
trast to the case of computations limited to pairwise interactions that are bounded
by synchronizations; the “memory model” for computations of the latter type does
not allow any competing accesses (i.e. data races) other than for synchronization.
Because of data races, a memory model that provides a form of global consistency
is required with shared memory, in the general case, which can introduce additional
overheads to producers, as seen in section 2.2. This has similarities with the case
of producer-consumer interactions bounded by synchronizations over an unordered
network, but in the latter case the only requirement for synchronization operations
at transfer boundaries is transfer completion notification.
The equally shared memory approach of approach (3) above, does not pre-
vent pairwise interactions, but it allows races which may introduce overheads to
producer-initiated transfers. The consumer memory management problem is solved
in this case by assuming “infinite” equidistant memory. More importantly, producer-
consumer communication is difficult to identify and associate with synchronization,
because of shared data structures. Such data structures must be read and written in a
consistent fashion, that requires coordination of the many and potentially concurrent
producers and consumers sharing them. The latter is crucial, because it involves two
ordering problems: (i) the order of a producer’s updates to shared state, and (ii) the
Wr
Cnt Counter value (Read/Add-on-Write) Notif. Addr1
Reset value Notif. data
Counter += data Notification data NULL
Notification Addr 1
Notification Addr 2
Wr
Notif. Addr2
Notif. data
NULL
If (Counter == 0)
Figure 2.8: A counter synchronization primitive atomically adds stored values to the con-
tents of the counter. If the counter becomes zero, a number of notifications are sent by the
network interface. The actual counter value is stored in network interface metadata, and a
small portion of NI memory is associated with each counter to provide a software interface
for accessing the counter, and configuring its operation.
Two types of hardware primitives for synchronization are proposed, in support of ex-
plicit communication in a shared address space: counters and queues. For queues,
two variants are introduced, one for a single reader and one for multiple readers.
Counters are intended for the management of sequences of unordered event. Single-
reader queues provide support for efficient of many-to-one communication, and
novel multiple-reader queues are designed for the pairwise matching of producers
with consumers, when speed is important and specific correlation is not.
Counters support the anticipation of a number of events that fulfill a condition.
Reaching the expected number of events, triggers automated reset of the counter
and notification of possible “actors” waiting the condition to be satisfied. Figure 2.8
shows the operation of a counter. The illustration assumes that the counter value is
kept in network interface metadata, shown in light mauve and green, and a small
block of NI memory is associated with the counter to provide an interface for soft-
Page 61 of 154
Support for Direct Synchronization
NI memory NI metadata
write(B)
(write)(head) (tail) head tail sr−Q
) C A B
e(C
writ
queue body
Figure 2.9: A single-reader queue (sr-Q) is multiplexing write messages, atomically ad-
vancing its tail pointer. Both the head and the tail of the queue are kept in NI metadata,
while the queue body is formed in NI memory. The head pointer is managed by the local
processor accessing it through a control block in NI memory.
ware manipulation of the counter, which is shown in light orange. Software uses
different offsets inside that block, to configure notification data and addresses, and
a reset value for the counter. One of the offsets provides indirect read access to the
counter and ability to modify its value.
Counter modification is done by writing a value, and results in increment of
the counter by that value. If after this modification the counter becomes zero, no-
tifications are sent to all the addresses specified in the block associated with the
counter. Notification packets, shown in light blue in the figure, write the notification
data specified (a single word) to the notification addresses, and suppress returned
acknowledgements using a NULL acknowledgement address11 .
Single-reader queues support the automated multiplexing of data from multiple
senders in hardware, minimizing receiver effort to track down new data arrival. Data
arriving to the single-reader queue (sr-Q) are written to the queue offset pointed by
the queue tail pointer, atomically incrementing it, as depicted in figure 2.9. A control
block is provided in NI memory as an interface to the queue, which is formed in
adjacent NI memory. Different offsets of the control block allow writing to the
queue, reading the tail pointer, as well as reading and writing the head pointer. The
latter two are intended for the local processor. The actual head and tail pointers of
the queue are kept in NI metadata.
To access the sr-Q the local processor keeps shadow copies of the head and
tail pointers, reading data from where the former points to. To dequeue an item,
11
We assume read and write type packets for explicit communication, that carry an explicit ac-
knowledgment address.
rd / deq req1
wr / enq pckt1
Figure 2.10: Conceptual operation of a multiple-reader queue buffering either read or write
packets.
the processor advances the shadow head and stores it to the appropriate offset of
the control block. When the shadow head pointer becomes equal to the shadow
tail, the processor starts polling12 the actual tail pointer until the latter advances. At
that point new data have arrived, hence the processor updates its shadow tail and
restarts processing queue data. For flow control, the single reader must signal each
writer after his data are dequeued. The space required is proportional to the number
of writers and multiple queue item granularities can be supported for access via
messages.
Multiple-reader queues combine multiplexing of short message data from mul-
tiple producers and multiplexing of requests from multiple consumers, buffering
either in the same queue. They signal the availability of both by serving a re-
quest with matched message data, even when the request comes first, acting as a
binary semaphore [68] for exclusive acquisition of message data. Conceptually, the
multiple-reader queue (mr-Q) buffers either data or requests, as illustrated in fig-
ure 2.10. When data are stored and a request arrives, shown in the upper part of
the figure, the data on the top of the queue are sent to the response address for the
request. When requests are buffered in the queue and data arrive, shown in the lower
part of the figure, the data are forwarded to the response address for the top request
in the queue. In either case, when a new item arrives (request or data), that is of the
same type to those already buffered in the queue, it is also buffered at the tail of the
12
Single-reader queues can also provide a mechanism that blocks the processor until the queue
becomes non-empty. It is also desirable to provide a way for the blocked processor to transition to a
low-power state, and possibly to associate such a processor state with more than a single queue that
can become non-empty and cause the processor to recover its normal operation mode. Low-power
states are not provided on FPGAs, and thus our prototype does not implement such mechanisms.
queue.
From the perspective of software, a multiple-reader queue can be viewed as
a novel FIFO with dequeuer buffering. In addition, it can be accessed with non-
blocking read and write messages, disengaging the processor. Because dequeue
(read) requests constitute queue items exactly as enqueued (write) data, the mr-Q
can amortize the synchronization that would be required to check if the queue is
empty for a dequeue, using non-blocking messages. Viewed otherwise, the mr-
Q never returns empty, but instead supports delayed dequeues, which are buffered
until corresponding data are enqueued.
Processor access to the mr-Q’s head and tail pointers (not shown in figure 2.10)
is avoided, and can hardly be of any use since multiple concurrent enqueues and de-
queues can be in progress. Enqueuers and dequeuers can utilize a locally managed
window of enqueues and dequeues respectively, to guarantee that the queue never
becomes full by either enqueues or dequeues. For dequeuers, this is done by assur-
ing that the count of outstanding dequeues (the reads minus the responses) does not
exceed their window (a fixed space granted in the queue). A dequeuer getting an
enqueuer’s data from the queue needs to notify the enqueuer of the space released in
the queue, and the latter must assure that the count of enqueues minus the notifica-
tions he receives does not exceed his window (a corresponding queue space granted
in the queue). The space required for the queue in this case, is proportional to the
maximum of the readers and the writers. Other more complex flow control schemes
may also be possible.
The approach described for flow control above, provides concurrent access
to the mr-Q, exploiting direct transfers to allow synchronization among the con-
tenders with only local management. An inefficiency of this scheme appears when
dequeuers do not immediately inspect a dequeue’s response so as to notify the en-
queuer. This inefficiency though is amortized to the total size of an enqueuer’s
locally managed window of enqueues. Hardware support for negative acknowledge-
ments, discussed in subsection 2.4.2, can enable single- and multiple-reader queue
space management that is independent of the number of potential senders, as will be
discussed in subsection 2.4.3.
The synchronization primitives of the previous section and the associated control
space for counters and single-reader queues, should be accessible in NI scratch-
pad memory, via a scratchpad part of the address space. Figure 2.11 illustrates
the resulting remote access semantics to different types of NI memory and how ac-
knowledgements are generated for each NoC packet. Three types of NI memory
are depicted in the middle, with read and write request packets arriving from the
left, and the corresponding generated reply packets on the right. The different ac-
cess semantics are combined with an extended message passing protocol, in which
read packets generate write packets and write packets generate acknowledgements.
Acknowledgements are designed to support completion notification for single- and
multi-packet transfers. Because of this, read and write packets must allow an ac-
knowledgement address which can be provided by user software.
As shown in the figure, a read packet arriving to normal scratchpad memory
has the usual read semantics and generates a write reply packet carrying the desti-
nation and acknowledgement addresses in the read. In the case of an RDMA-copy
operation generating the read packet, multiple write packets will be generated to
consecutive destination segments. Read packets to counters also generate a write
packet in reply –not shown. A read packet arriving at a multiple-reader queue has
dequeue semantics and the write reply may be delayed.
Writes arriving at normal scratchpad memory have the usual write semantics,
unlike writes to counters that have atomic increment semantics and writes arriving
to queues that have enqueue semantics. Nevertheless, in all cases, a write packet
generates an acknowledgement (ACK) toward the acknowledgement address in the
write, carrying the size of the written data as the data of the acknowledgment. This
size can be accumulated in counters for completion notification of the initial trans-
fer request (read or write). This is because acknowledgements arriving at any type
of NI memory act as writes (not shown), but do not generate further acknowledge-
ments. Note that the acknowledgement generated for a write to a multiple-reader
queue is sent immediately after data are written in the queue and not when they are
dequeued13 .
Negative acknowledgements (NACKs) can be interesting for a number of pur-
poses. Our prototype, described in section 3.2, does not implement NACKs for
simplicity reasons, but we partially evaluate their use in simulations (see subsec-
tion 4.3.2). NACKs can be exploited in the case of a remote read to a single-reader
queue, which is something we would like to forbid. They can also be used for
writes to both queue types, as well as reads to multiple-reader queues, that cause
queue overflow.
In addition, they can support the semantics of a special unbuffered read, which
can be usefull in the case of multiple-reader queues –for other types of NI memory it
13
Although both alternatives could be provided under software configuration of the mr-Q, we
choose to preserve the same semantics for acknowledgements of writes to any type of NI memory.
Rd Wr
Src Addr read Normal data Rd Dst Addr
Dst Addr Scratchpad Rd Ack Addr
Ack Addr Memory
ac
ite
Data
k
wr
Scratchpad
Network
packets
Counter Ack
ac
in Wr Ack Addr
to
Wr Wr payload size
co
Region
un
Dst Addr
te
r
de
dequeued
Queue or delayed packet
Data enq Descriptor
ac
k
Figure 2.11: Remote access to scratchpad regions and generation of explicit acknowledge-
ments.
an advantage compared to RDMA: they require constant time for locating arrived
input. As discussed earlier (subsection 2.1.2), RDMA requires per sender buffering
at the receiver, and thus locating arrived input requires polling time proportional to
the number of possible senders.
In addition to the above, –although we do not implement or evaluate such an
optimization– single-reader queues can support NACKs when there is insufficient
space for arriving messages, to economize on receiver buffer space, irrespective
of the number of possible senders. Such an optimization, enables receiver queue
space that is proportional to the number of expected senders in a usual time interval,
instead of space proportional to the maximum possible number of senders that is
required for RDMA.
Single-reader queues are very closely-related to producer-consumer communi-
cation, in any pattern (i.e. one-to-one, many-to-one, or many-to-many). Philosoph-
ically, one can view counters and multiple-reader queues as abstractions of differ-
ent attributes of single-reader queues; counters abstract away the data buffered in
a single-reader queue and only store “arrival events”, while multiple-reader queues
abstract the reception order imposed by queue implementation, making them suit-
able for multiple concurrent readers. Next we discuss the relation of counters and
multiple-reader queues to synchronization and some of their applications.
Synchronization between computing entities (i.e. threads) is in general re-
quired in two different situations: (i) when resolving data dependencies, and (ii)
when modifying the state of shared resources using more than a single operation.
In the first case, the dependent computation must wait for the write or read of its
input or output arguments before it can proceed. This includes all types of depen-
dencies: true data dependencies, anti-dependencies, and output dependencies. The
synchronization in this case involves one or more pairs of computing entities.
The second case usually appears for complex shared data structures or devices
(or complex use of such resources), and involves many (potentially all) computing
entities sharing the resource. It occurs because the required processing consists of
multiple, inseparable operations on the resource, which we cannot or do not want to
separate. This is usually required to preserve the usage semantics of an interface or
implement the definition of a complex operator.
For example, usually we would not want to separate the operations required to
insert an element in a shared balanced tree and re-balance the tree. In this situation,
we would need to preclude other operations from accessing some part of the tree.
Precluding others from accessing the resource is needed to keep the intended oper-
ations inseparable (i.e. without intervention), and is usually addressed with the use
Node B
+128
Node A −640
+128
+128
Counter
?
= zero
640
notify
+128
+128
Node C
For each packet of the copy operation that arrives at its destination, an acknowl-
edgment is generated, writing the value 128 to the counter (remember that the data
of the acknowledgement is the size of the written data). The counter accumulates
the values of these acknowledgments, plus the opposite of the total transfer size sent
by node A, and will become zero only once, when all of them have arrived. When
that happens, the counter sends notifications to the pre-configured addresses, which
in the example reside at nodes A and B.
Several points are evident in this example. First, observe that this completion
notification mechanism does not require network ordering –the order in which the
values are accumulated in the counter is not important. Second, the RDMA-copy
operation of the example could be initiated equivalently by node B, or C, or by a
fourth node instead of node A. The only difference would be the additional read
request packet sent from the initiating node to node A before the illustrated commu-
nication is triggered15 . This means that the completion notification mechanism is
suitable for the general semantics of the RDMA-copy communication mechanism,
and is not restricted to simpler get and put operations, that necessarily involve either
source or destination data residing at the initiating node. Finally, observe that com-
pletion notification for an arbitrary number of user-selected copy operations can use
a single counter, provided that the opposite of the aggregate size of all transfers is
15
The write of the opposite of the total transfer size (-640) can actually be done by any node.
Thread 1 Thread 1
+1
Thread 2 +1 Thread 2
+1 Cnt: −4 Cnt: −1
Thread 3 +1 +1 Thread 3
+1
Thread 4 Thread 4
Cnt: −2
Thread 5 +1 Thread 5
+1
Thread 6 +1 +1 Thread 6
+1 Cnt: −4 Cnt: −1
Thread 7 Thread 7
+1
Thread 8 When Cnt == 0 Thread 8
written to the counter with a single operation, to prevent notification triggering for
subsets of the transferred data.
Counters can also be used to construct scalable and efficient hierarchical bar-
riers for global synchronization. Figure 2.13 shows how counters are combined to
form two trees, with a single counter at their common root, in a barrier for eight
threads. The tree on the left accumulates arrivals, and the tree on the right broad-
casts the barrier completion signal. In the figure, threads are shown to enter the
barrier writing the value one to counters at the leafs of the arrival tree. The counters
of both trees are initialized to the opposite of the number of expected inputs. For
the arrival tree, when counters become zero they send a single notification with the
value one, that is propagated similarly towards the root of the tree. When the root
counter triggers, the barrier has been reached. Counters in the broadcast tree, other
than the root one, expect only a single input and generate multiple notifications,
propagating the barrier completion event to the next tree level, until the final notifi-
cations are delivered to all the threads. Counters that trigger, sending notification(s),
are automatically re-initialized and are ready for the next barrier episode.
When synchronization is required to control the access of shared resources, the
multiple-reader queue can provide such coordination of competing threads reducing
the cost of synchronization. This is shown in the following examples, by demon-
strating efficient lock and job-dispatching services using a multiple-reader queue.
In figure 2.14(a) a multiple-reader queue (mr-Q) is used to provide a traditional
lock. Initially, a single lock token is enqueued in the mr-Q as shown in the figure.
The first dequeue (read) request acquires the lock, which is automatically forwarded
Thread 1 Thread 1
token
release lock
Thread 2 Thread 2 job job job job
t4 t1 t3 C B A A t3
req req req t3
acquire lock
Thread 3 Thread 3 t3
ck st
request
lo ue
q
Thread 4
re
Thread 4
Figure 2.14: Multiple-reader queue provides a lock service by queueing incoming requests
(a), and a job dispatching service by queueing data (b).
to the requester. Further dequeue requests are buffered in the mr-Q in FIFO order.
The lock should then be returned (write) to the queue after each use, to be forwarded
to other requesters either pending or to arrive. A possible generalization is to intro-
duce n > 1 tokens in this system. In this case, at most n tasks are ever allowed
simultaneously in what could be called a semi-critical section, providing a way to
reduce contention for the resources accessed therein. The lock function exploits
a multiple-reader queue as a hardware binary semaphore [68] with a FIFO queue
of pending requesters, and requires scratchpad memory space on the order of the
maximum possible number of simultaneous contenders.
Page 72 of 154
Related Work
those of today, and far less than future many-core CMPs. In addition, processor-
memory speed gap was smaller in these systems. Cache-integrated NIs enable the
opportunity for on-chip communication in large-scale CMPs, and exploration of al-
ternative programming models than message passing. During the last decade, the
abundance of available transistors transforms on-chip integration to an advantage,
as demonstrated by chips targeting scalable multiprocessors [80, 81, 82], and more
recent server chips like Opteron-based AMD CMPs and SUN Niagara I and II pro-
cessors, which integrated the network interface on-chip. For manycores, further
integration is justified, to reduce communication overheads, and to increase flexibil-
ity and programmability. Cache-integration of the NI moderates the associated area
overhead.
NI Address Translation and User-level Access. Two other subjects, extensively
researched in the 90’s, concerned NI user-level access to overcome operating system
overheads to communication [83, 84], and address translation in network interfaces
to leverage their use for DMA directly from the application [60, 61, 62]. Despite
extensive study and awareness of these performance barriers, the solutions proposed
have not been adopted by operating systems and have been used only in custom
high performance system area networks (e.g. [65, 67, 82]). Cache-integrated NIs
leverage past research, and exploit close coupling with the processor to enable reuse
of its address translation and protection mechanisms.
Configurable Memory and Event Handling Support. Ranganathan et al. [85]
propose associativity-based partitioning and overlapped wide-tag partitioning of
caches for software-managed partitions (among other uses). Associativity-based
partitioning allows independent, per way addressing, while overlapped wide-tag par-
titioning adds configurable associativity. PowerPCs allow locking caches (misses do
not allocate a line) (e.g. [86]). Intel’s Xscale microarchitecture allows per line lock-
ing for virtual address regions either backed by main memory or not [87].
In smart memories [88] the first level of the hierarchy is reconfigurable and
composed of ’mats’ of memory blocks and programmable control logic. This en-
ables several memory organizations, ranging from caches that may support coher-
ence or transactions, to streaming register files and scratchpads. Their design ex-
ploits the throughput targeted processor tiles to hide increased latencies because of
reconfigurability. It should be possible to support coherent cache and scratchpad or-
ganizations simultaneously and microcode-program smart memories for software-
configurable actions. SiCortex [82] ICE9 chip features microcode-programmable
tasks in a coherent DMA engine side-by-side with a shared L2 cache, but does not
support scratchpad memory.
Completion queues have been proposed in the VIA [39]. Fine-grain access
Page 73 of 154
Related Work
control [89] demonstrates how lookup mechanisms leverage local or remote han-
dling of coherence events. Exposing this functionality in Typhoon [90] allowed
application-specific coherence protocols [91]. The work on fine-grain access con-
trol has influenced our approach to cache-integration of event responses (see sub-
section 3.1.3).
Our design generalizes the use of line state to support configurable communi-
cation initiation and atomic operations, in addition to fine-grain cache line locking
that prevents scratchpad line replacement. With the cache integrated NI architec-
ture, event responses utilize existing cache access control functionality, to enable
modified memory access semantics that support atomic operation off-loading and
automatic event notifications. Instead of a programmable controller or microcode,
we provide run-time configurable hardware that can be exploited by libraries, com-
pilers, optimizers, or the application programmer. Configurable memory organiza-
tions, such as smart memories, should incur higher area overhead compared to our
integrated approach, although a direct comparison would require porting our FPGA
prototype to an ASIC flow (the work on smart memories only provides estimates of
silicon area for an ASIC process). Smart memories and ICE9 DMA engine are the
closest to our cache-integrated NI, but our work focuses on keeping the NI simple
enough to integrate with a high performance cache.
Hardware Support for Streaming. Recently with the appearance of the IBM
Cell processor [12] design, which is based on separately addressable scratchpad
memories for its synergistic processing elements, there was renewed interest for the
streaming programming paradigm. Streaming support for general purpose systems
exploiting caches for streaming data was considered in [92, 93, 45, 94]. Implementa-
tions side-by-side with caches or in-cache using the side effects of cache control bits
in existing commercial processors are exploited in these studies. In [45] communi-
cation initiation and synchronization are considered important for high frequency
streaming, whereas transfer delay can be overlapped providing sufficient buffering.
Coherence-based producer-initiated transfers that deliver data in L2 caches, are aug-
mented with a write-combining buffer that provides a configurable granularity for
automatic flushing. Addition of synchronization counters in L2 caches (which they
do not describe), and dedicated receive-side storage in a separate address space, in-
creases performance to that of heavyweight hardware support at all system levels.
In [92] a scatter-gather controller in the L2 cache accesses off-chip memory and
exploiting in-cache control bits for best-effort avoidance of replacements. A large
number of miss status holding registers enables exploitation of the full memory sys-
tem bandwidth. Streamware [93] exploits the compiler to avoid replacements of
streaming data mapped to processor caches, for codes amenable to stream process-
Page 74 of 154
Related Work
ing.
Stream processors like Imagine [50] and the FT64 scientific stream accelerator,
provide stream register files (SRF) that replace caches, for efficient mapping of stat-
ically identified access streams. A SIMD computation engine pipelines accesses to
the SRF, and bulk transfers between the SRF and main memory utilize independent
hardware and are orchestrated under compiler control. Syncretic adaptive memory
(SAM) [95] for the FT64 integrates a stream register file with a cache and uses cache
tags to identify segments of generalized streams. It also integrates a compiler man-
aged “translation” mechanism to map program stream addresses to cache and main
memory locations.
Leverich et al. [96] provide a detailed comparison of caching-only versus par-
titioned cache-scratchpad on-chip memory systems for medium-scale CMPs, up to
16 cores. They find that hardware prefetching and non-allocating store operation
optimizations in the caching-only system, eliminate any advantages in the mixed
environment. We believe their results are due to focus to and from off-chip main
memory. By contrast, for on-chip core-to-core communication, RDMA provides
significant traffic reduction, which together with event responses and NI cache inte-
gration are the focus of our work.
The Cell processor and [45] do not exploit dynamic sharing of cache and
streaming on-chip storage, enabled by cache-integrated NIs. Support proposed
in [92] can only serve streaming from and to main memory, although our archi-
tecture is likely to benefit from scatter-gather extentions for RDMA, possibly at the
cost of more complex hardware. Both [45] and [92] forgo the advantages of di-
rect transfers, provided by our design, and evaluate bus based systems. In addition,
bandwith of cache-based transfers, used in these studies, is limited by the number of
supported miss status holding registers (MSHRs) and the round-trip latency of per
transfer acknowledgments, unlike the large number and arbitrary size RDMA trans-
fers that can be supported, with lower complexity hardware, in cache-integrated NIs.
Compiler support like that of Streamware is much easier to exploit with configurable
cache-scratchpad. Stream register files and SAM, compared to our cache-integrated
NI, require a specialized compiler and target only data-level and single instruction
stream (SIMD) parallelism.
Queues Support for Producer-Consumer Communication Efficiency. Queues
tightly-coupled to the processor have been proposed, among others, in the multi-
ALU processor (MAP) [97], and in Tilera’s TILE64 chip [7]. MAP’s multithreaded
clusters (processors) support direct message transmission and reception from regis-
ters. In addition, [97] demonstrates how multiple hardware threads in a MAP cluster,
can handle in software asynchronous events (like message arrival or memory sys-
Page 75 of 154
Related Work
tem exceptions) posted in a dedicated hardware queue. The TILE64 chip [7] allows
operand exchange via registers. A small set of queues are associated with registers,
and provide settable tags that are matched against sender-supplied message tags.
A catch-all queue is also provided for unmatched messages. These queues can be
drained and refilled to and from off-chip memory.
Hardware support for software-exposed queues has been proposed, among oth-
ers, for the Cray T3E [37], and in the remote queues [34] proposed by Brewer et al.
Cray T3E [37] provides queues of arbitrary size and number in memory, accessible
at user-level, that provide automatic multiplexing in hardware and a threshold for
interrupt-based overflow handling. Remote queues [34], demonstrate three differ-
ent implementations (on Intel Paragon, MIT Alewife and Cray T3D), exploiting in
different degrees polling and selective interrupts. Remote queues demonstrate an
abstraction that virtualizes the incoming network interface queue which may trigger
a context switch on message arrival to drain the network as in active messages [53],
or alternatively exploit buffering to postpone its handling. Two case delivery [52] in
FUGU (modified Alewife), first demonstrated buffering in the application’s virtual
memory and enables virtualization of remote queues.
Virtualized single-reader queues in scratchpad memory (see subsection 2.4.1)
enable direct transfer reception, that shares the fast and usual path through the pro-
cessor’s cache with coherent shared memory traffic, without occupying processor
registers. Cyclic buffering is enabled by updating a hardware head pointer and
multiple item granularities are supported, for efficient use of the limited on-chip
space. Multiple single-reader queues can be allocated to demultiplex independed
traffic categories, limited only by the available on-chip space. Flow control may be
needed only at the user-level, and uses efficient on-chip explicit acknowledgements
instead of interrupts. Selective user-level interrupts can be combined with single-
reader queues to increase the efficiency of irregular or unexpected event handling,
like operand or short message processing, although we do not implement or evaluate
such support.
Page 76 of 154
3
The SARC Network Interface
Page 77 of 154
SARC Network Interface Architecture
Address Address
base bound
select
or
0 1 1 0 x x x x
match select
access permissions
caching policy
local presence
Figure 3.1: Address Region Tables (ART): address check (left); possible contents, depend-
ing on location (right).
Page 78 of 154
SARC Network Interface Architecture
Global Address
Address Region Table
Addr.Space Index
(ART)
Protection
Tags
Tags
Tags
Type
Scr’pad
Rights Data Data Data
Yes No
Directly Cacheable
Addressable
(Scratchpad)
cmp cmp
Remote
way #
Local
to Network
Figure 3.2: Memory access flow: identify scratchpad regions via the ART instead of tag
matching –the tags of the scratchpad areas are left unused.
specific cache way number for local scratchpad regions. Otherwise, the scratchpad
virtual address must provide these bits, too3 . By discriminating among cacheable,
local scratchpad, and remote scratchpad regions, ART allows the cache to select
the appropriate addressing strategy and avoid miss status holding register (MSHR)
and relevant space allocation for remote scratchpad accesses. Because scratchpad
regions are identified in the ART, tag matching is not required in the cache. The
tags of scratchpad lines are “freed” and are used for network interface metadata, as
discussed below.
The cache-integrated network interface exploits cache state bits for two pur-
poses: (a) to allow scratchpad region allocation at cache-line granularity, and (b) to
provide alternative access semantics inside srcatchpad regions. The use of per line
state bits to designate scratchpad areas inside the cache, allows flexible management
of on-chip memory usage for different purposes circumventing the coarse protec-
tion granularity. Furthermore, the cache-integrated NI uses the state bits to support
communication and synchronization primitives implemented inside the scratchpad
address space, as depicted in figure 3.3. In addition to the above, either ART iden-
tification of directly addressable regions, or cache state bits can be used to prevent
cache tag matching for scratchpad memory and other special scratchpad lines dis-
cussed below. State bits are preferable, since they obviate ART lookup at a remote
NI, or the need for NoC packet overhead.
Five different lines types are shown: i) normal cached data lines marked (Ch);
ii) normal scratchpad memory lines marked (LM); iii) Command buffers (commu-
3
Alternatively, ART could provide low order bits of the physical address, e.g. at page granularity,
for intra-node addressing (i.e. in local regions).
Address Space
LM Local Memory
Scratchpad
(scatchpad) Data
Cm arg.validity flags RDMA/Msg Cmd
arguments
Cn Counter value event response
configuration
Q head, tail, Sz, itemSz Head rd/wr, Tail rd
(single−reader only)
Figure 3.3: State bits mark lines with varying access semantics in the SARC cache-
integrated network interface. Cacheable and scratchpad address spaces are identified via
ART access.
nication control and status) marked (Cm); iv) counter synchronization primitives
marked (Cn); v) single- and multiple-reader queue descriptor synchronization prim-
itives marked (Q). Counters and queues provide the direct synchronization mech-
anisms presented in chapter 2 (subsection 2.4.1). Queue descriptors correspond to
the queue control block discussed there. Details on the functionality of the commu-
nication and synchronization primitives are discussed in the next subsection. The
full description of their implementation, as well as the format and use of the tag and
data block for each will be provided in subsection 3.2.5.
Along with command buffers, counter and queue descriptor lines are the equiv-
alent of NI registers and are called event sensitive lines (ESLs). Multiple ESLs can
be allocated in the same or in different software threads. Since ESLs are placed
inside the scratchpad address space, ART provides access protection for them. Each
software thread can directly access at user-level ESLs (and scratchpad memory),
independently of and asynchronously to other threads. For protection checks of
communication address arguments, an additional ART must be provided in the net-
work interface. Migration and swap out of a scratchpad region must be handled by
the OS, as discussed in the previous chapter (section 2.3.3).
The network interface keeps track of multiple outstanding transfers, initiated
via the communication and synchronization primitives, using a job-list. Although a
linked list could be implemented to allow an almost unlimited number of outstand-
ing transfers, this would entail multiple next pointer accesses in the cache tags (or
data), and would consume cache bandwidth and energy. In our implementation we
use a fixed size external FIFO for the job-list (see subsection 3.1.4), that allows job
description recycling to overlap implicit and explicit transfers and multiple inde-
pendent bulk transfers. The fixed job-list size restricts the total number of ESLs and
concurrent transfers that can be supported, but it can be scaled with low hardware
complexity increase.
The use of state bits obviates the need of a network interface table for ESLs
and circumvents protection granularity, solving two of the problems phased with
the connection-oriented approach of subsection 2.1.3. Avoiding the need for a table
keeping state for NI outstanding transfers, also requires the explicit acknowledge-
ments of subsection 2.4.2, which are also part of the SARC architecture.
Event responses is a technique that exploits tagged memory to enable software con-
figurable functions, extending the usual transparent cache operation flow, in which
line state and tag lookup guides miss handling with coherence actions. Local or
remote accesses to ESLs (NI events) are monitored, to atomically update associ-
ated NI metadata. Conditions can be evaluated on NI metadata, depending on the
event type (e.g. read or write), to associate the effect of groups of accesses with
communication initiation, or to manage access buffering in custom ways. These
conditional NI actions are called event responses, and allow the implementation of
different memory access semantics. The mechanisms designed, utilize appropriate
ESL and scratchpad region internal organization, to buffer concurrent accesses and
automatically initiate communication.
Event responses provide a framework for hardware synchronization mecha-
nisms configurable by software, which allows the implementation of the direct syn-
chronization mechanisms introduced in chapter 2 (subsection 2.4.1). In addition,
they allow the description of multi-word communication arguments to the NI and
automatic transfer initiation, enabling the NI communication functions. The dif-
ferent operations are designated by ESL state corresponding to the intended access
semantics.
On every access to the cache (local or remote), normal cache operation checks
the state and tag bits of the addressed line. The NI monitors ESL access, reads
and updates associated metadata stored in the ESL tag, and checks whether relevant
conditions are met, as illustrated in figure 3.4. The state accumulated in the ESL tag
can be used to customize the location of buffering, forming for example queues in
scratchpad memory. When communication triggerring conditions are met, commu-
nication is scheduled by enqueueing a job description in the network interface job
ESL
NI metadata software arguments
customize buffering
Address
Tag index offs
.... ....
update
atomic
Figure 3.4: Event response mechanism integration in the normal cache access flow. ESL
accesses are monitored and check of NI metadata in the ESL tag can be used for custom
access buffering in scratchpad memory (e.g. managing some lines as a queue), or to schedule
response transfer(s). When, later, the outgoing traffic NI controller initiates such transfer(s),
it can find related software arguments in the ESL data block.
Node A Node B
VCx
RDMA job
write
write
write
other job
other job
ack job
ack job
NI_out ack
job
processing ??
X
VCy
inv
monolithic monolithic
Job List Job List
inv
VCy
RDMA job
ack X
other job
other job
ack job
ack job
job ? NI_out
?
write
write
write processing
VCx
Figure 3.5: Deadlock scenario for job list serving multiple subnetworks.
each node cannot make progress because other write packets (independent or of the
same operation) have filled all buffers in the path to the other node, which cannot
sink the writes. Writes cannot proceed in either node because the corresponding
acknowledgement cannot be enqueued in the job list which is full. This inability
to sink a class of packets can arise periodically in any protocol, and does not cause
deadlock when the “response” (the acknowledgement in this case) uses an indepen-
dent (sub)network and is guaranteed to (eventually) make progress. This is not the
case in the scenario shown, because the job list does not provide different FIFOs to
support the independent (sub)networks. As a result, the job lists are filled with other
jobs waiting behind the RDMA operation, and there is no space for acknowledge-
ment jobs in response to the writes. Since the RDMA cannot make progress, the
system is deadlocked.
Similarly, an invalidation request cannot be serviced in either node, since the
corresponding acknowledgement job cannot be enqueued in the job list, which can
involve other nodes in the deadlock. Note that the jobs behind the RDMA that are
filling the job list may be previous write acknowledgements, and cannot be limited
so that they fit the size of the job list. To prevent this situation, separate FIFOs for
each independent subnetwork must be used. This would allow acknowledgements
to proceed independently, and thus destination would eventually sink all writes. In
effect, the job list should only recycle RDMA operation descriptions after minimal
service, separate traffic classes, and possibly serve different traffic classes in round-
robin.
Caches include queues with similarities to the job list, in order to manage the
initiation of write-back operations. The queues used in the case of write-backs how-
ever, do not require operation recycling and may not need to arbitrate for the cache-
scratchpad if data have moved to a victim buffer [98] at miss time. On the flip side,
the job list can replace output packet buffers with sorter job descriptions, directly ar-
bitrating for the network when a packet can be constructed. This area optimization is
straightforward when the local NoC switch operates in the same clock domain with
the processors it connects, as implemented in our prototype. Otherwise, it may be
complex to hide the synchronization cycles lost for accessing the cache-scratchpad
allocator from the outgoing traffic NI controller and getting back the data in the net-
work clock domain. In this case, small packet buffers per subnetwork will probably
be beneficial.
Finally, it should be noted that scaling the size of the job list, results in small
hardware complexity increase, because memory decoders and multiplexers have
logarithmic time complexity and the latter can be overlapped with memory array
access. Fixed partitioning of a single memory array can be used for queues per
subnetwork, as implemented in our prototype.
The SARC architecture, described in the previous section, was fully deployed with
a hardware FPGA-based prototype implemented in the CARV laboratory of ICS-
FORTH. The prototype utilizes a Xilinx Virtex-5 FPGA and development employed
Xilinx EDK, ISE, and Chipscope tools. A previous version of the prototype was pre-
sented in [56]. The current version is a major rewrite of the code, optimized for logic
reuse, including several features not present in the version of [56]. The prototype
implements all the different memory access semantics of subsection 2.4.2, exploits
event responses for the communication and synchronization primitives presented,
and provides three levels of NoC priority to cope with protocol deadlock issues for
both implicit and explicit communication mechanisms, that are also leveraged for
explicit acknowledgements.
The block diagram of the FPGA system is presented in figure 3.6. There are
four Xilinx IP microblaze cores, each with 4KB L1 instruction and data caches and
a 64KB L2 data cache where our network interface mechanisms are integrated. An
on-chip crossbar connects the 4 processor nodes through their L2 caches and NI’s,
the DRAM controller –to which we added a DMA engine– and the interface (L2 NI)
to a future second-level, 3-plane, 16 × 16 interconnect, over 3 high speed RocketIO
transceivers, to make a 64-processor system. All processors are directly connected
over a bus (OPB) to a hardware lock-box supporting a limited number of locks.
Page 88 of 154
SARC Network Interface Implementation
FPGA Chip
OPB b
r
i
L1 D$
d
Lock Box
L2 D
off chip
P0 g
L2
links
e
Cache/NI NI
L1 I$
b
UART
r
i
L1 D$
d
P1 g L2 D
e
Cache/NI
JTAG &
Debug
L1 I$
b NoC
r
i
L1 D$
d
P2 g L2 D
PLB e
b Cache/NI
r L1 I$
i
d
g b
e r
i
L1 D$
d
P3 g L2 D
e
Cache/NI
DRAM Controller L1 I$
& DMA engine
The lock-box is used in the evaluation of the hardware prototype for comparison
purposes (see section 4.1).
The block diagram does not show the processor ART, accessed in parallel with
the L1 data cache, and the store combining buffer, accessed in the subsequent cycle
for remote stores. The prototype implements a network interface integrated at the L2
cache level of a private L2 cache. Cache integration at the L1 level could affect the
tight L1 cache timing, because of the arbitration and multiplexing required among
processor, network incoming, and network outgoing accesses to NI and cache mem-
ory. On the contrary, L2 cache integration mitigates timing constraints, provides ad-
equate space for application data, and allows sufficient number of time overlapped
transfers to hide latencies.
Integration with a private L2 has the advantage of independent, parallel access
by each processor to NI mechanisms. It also allows the selective L1 caching of
L2 scratchpad regions with minimal coherence support, using a write-through L1
cache and local L1 invalidation on remote writes to scratchpad; the SARC prototype
implements this optimization. Coherence among the L2 caches of the four nodes to
support implicit communication is under development at this time.
63 0
RI[7:0] 16'h8008 DstAddr[39:0]
HdrCrc[15:0] 8'b0 AckAddr[39:0]
Resp Addr [39:0] 8'b0 DmaSize[15:0]
DataCrc[31:0] 32'b0
)+,
63 0
RI[7:0] OpSize[15:0] DstAddr[39:0]
HdrCrc[15:0] 8'b0 AckAddr[39:0]
F ,
Figure 3.7: FPGA prototype system block diagram.
Finally, the L2 cache supports a single outstanding miss, which is useful only
for cacheable stores that can be immediately acknowledged to the in-order IP core.
The L1 does not implement a miss status holding register (MSHR), because non-
blocking cache support is only used for a single store access that is immediately
acknowledged to the processor. Extending the processor pipeline to include L1 and
L2 data cache access and providing multiple MSHRs, can support limited cache
miss pipelining, as will be discussed in the next subsection. The hardware prototype
does not support these optimizations of cacheable accesses.
and packet size, and 40 bits for the destination address. All addresses are virtual, as
required for progressive address translation. Although processors in the SARC pro-
totype support only 32-bit addresses, 40-bit addresses are provisioned in the packet
format for future extensions. The read packet format shows the exact opcode and
size of read type packets for direct communication. In the second header word there
are 16 bits of header CRC, 8 bits of zero padding, and 40 bits for the acknowledge-
ment address.
In read type packets the header is followed by a single 64-bit word of payload,
which comprises of 40 bits for a response address, 8 bits of zero padding, and 16 bits
of request size labeled DMASize in the figure. In the case of write type packets, the
header is followed by up to 256 bytes of payload (i.e. 32 words of 64 bits). Finally,
for both packet types the payload is followed by one trailing word that includes a
32-bit CRC for the payload and 32 bits of zero padding. Subsection 3.2.6 describes
optimization of these packet formats adapting to other features of the implementa-
tion, in order to reduce the area increase from network interface cache integration.
Figure 3.8 shows the complete datapath of the SARC L2 cache-integrated network
interface implementation. The datapath is separated in three controller domains: (a)
the L2 cache and scratchpad controller domain, (b) the outgoing traffic controller
domain, and (c) the incoming traffic controller domain. Four agents compete for
access to the cache and scratchpad, arbitrated by the allocator module, which imple-
ments most of the control of domain (a). The processor interface has independent
agents (i) for store combining buffer accesses, and (ii) for all L1 data cache misses
and scratchpad region accesses except remote stores. The incoming and outgoing
traffic controllers are the other two agents interfaced to the allocator.
The store combining buffer receives remote scratchpad stores after accessing
the processor ART, that identifies remote destinations. Successive pipelined oper-
ation stages are shown in domain (a), but not in the other domains which require
multiple arbitration cycles, and reuse their datapath for multiple network packet flits
(words of NoC channel width). In addition, figure 3.8 hides some implementation
details for presentation purposes. These include the single cache miss status holding
register (MSHR) and registered data for a store miss. In addition, in the incoming
traffic controller domain, the register shown is actually implemented with load en-
ables per 32-bit word and previous multiplexing of replicated data is required to
support sub-block granularity items for single-reader queues.
NoC request 3
Job NI_out grant
Event Response
3
OpAddress
l Tags update Tags buffer data Barrel 64
l (read) (write) 256 Shifter
addr
o 256
c
L1 D$ & ART
a 32 BRAM memory
NI_out addr
t 32
VC0
from NoC
SARC Network Interface Implementation
o VC1
r 32 32 L2 Data 64 64
VC2
processor data
Arrays
address
32 256 NI_in
flow control
32 Control
256 NI_in data Xoff 3
Page 92 of 154
32 256 processor & L1 fill data
to processor
L2 cache & scratchpad controller domain
For common cacheable and local scratchpad accesses, the allocator is accessed
after processor ART and L1 cache access but in the same cycle. After a grant from
the allocator, the selected agent reads the tags and utilizes the network interface
ART (separate from the processor ART). The current implementation has a minimal
implementation of ART (providing region local or remote indication without pro-
tection), so ART is shown with a dotted outline. Event response conditions and tag
matching are also evaluated in this cycle.
In the next cycle, depending on operation type, hit or miss calculation, and
event response, data arrays may be accessed, the tags can be updated, and a job
description can be enqueued in the network interface job-list. Non all operations
write the tags (hence tag write is shown with a dash and dot outline), but for those
that do, a pipeline bubble prevents subsequent operations from reading the tags and
resolves the structural hazard. Domain (a) has priority in enqueueing jobs in the
job-list over RDMA operation recycling, which may be delayed. For processor
local scratchpad loads or cacheable L2 load hits, data are returned to the processor
and L16 from an intermediate register, that adds a cycle of latency.
The outgoing traffic controller domain (or NI-out for short), processes jobs ap-
pearing at the top of the job-list. NI-out controller requests the appropriate NoC port
and priority level based on routing table lookup and job type. The routing table is
trivial in the current implementation, since only physical addresses are used (pro-
gressive address translation is not implemented). When a NoC request is granted,
NI-out controller proceeds to retrieve data for the transfer, either from the store
combining buffer or by requesting memory access from the allocator.
When a multi-word transfer is requested (message or RDMA), the associated
command buffer is read from memory. For RDMA-write (i.e. an RDMA-copy op-
eration with a local source), the appropriate data are requested subsequently. Com-
mand buffer information, described in subsection 3.2.5, allows NI-out controller to
construct the packet header. The data are then sent to the NoC in successive cycles.
Destination alignment is used supported with a barrel shifter. RDMA write pack-
ets are segmented to a maximum of four (4) cache lines of payload, depending on
source and destination alignment. If the operation is not finished, it is recycled in
the job-list, updating the command buffer appropriately. For read type packets the
NI-out controller provides the single word of payload. The trail flit with the payload
CRC is added in the end of the packet.
The incoming network interface controller domain (NI-in for short) processes
packets arriving from the network. The packets are buffered in independent FIFOs
6
The L1 is not filled for ESL accesses.
per network priority, implemented in a single SRAM block (BRAM). If the CRC
of a packet’s header is not correct, the packet is dropped. When a priority’s FIFO
is almost full, an xoff signal is raised to the NoC switch scheduler, to block the
scheduling of further packets with this destination and priority until the condition
is resolved. Packets are dequeued from the other end of FIFOs in priority order,
as specified by the implicit and explicit communication protocols used 7 (this order
matches the NoC priority order). In order to implement cut-through for incoming
traffic, a write packet can be delivered to memory if its header CRC is correct, even
though the data CRC can be wrong. In the later case, an error is recorded.
For correct packets, header information are processed by NI-in controller.
When working on a write type packet, up to one cache line of data is placed in a
register, as shown in the figure. For read packets the whole packet is stored in the
register. In parallel with the dequeue of the last word of payload, NI-in controller
issues a request to the allocator. When granted, the data in the register are appropri-
ately buffered in the data arrays.
Virtual cut-through switching implemented for the prototype’s NoC requires
uninterrupted packet transfer. In order to support multiple cache lines of data in
RDMA write packets, NI-out needs to have higher priority in the allocator. In addi-
tion, the allocator allows the outgoing traffic controller to bypass the tags read stage
of the pipeline when desired, and directly read the data arrays. This may cause a
pipeline bubble for an in-progress operation that would access the data in the cycle
after NI-out is granted access8 . If the in-progress operation does not need to access
the data array (e.g. a cache miss), the bubble is not necessary.
The processor interface can utilize on average two out of every four cycles of
the allocator, because NI in and out do not issue requests faster than one every four
cycles. There is only one exception to this rule about the frequency of NI-in and
NI-out requests to the allocator, that will be discussed in the following. The reason
for the rule is that the data arrays have 4 times the width of the NoC (i.e. 256-bits
versus 64-bits). In result, selecting NI-in or NI-out in the allocator consumes and
provides data respectively with four times the bit rate of the datapaths of NI-in and
NI-out. Giving higher priority to NI-out over other agents in the allocator does not
affect NI-in, which also has available one out of every four cycles using the second
highest priority in the allocator.
7
This is not necessary. VCs could also be processed in round-robin.
8
In the actual implementation there are two arbiters, one for the tags in two successive cycles as
well as the job-list in the second, and one for the data. Thus, not only the current operation at the
data stage suffers the bubble, but also any succeeding operation. The arbitration described here is
equivalent, because other operations cannot be selected when NI-out is selected.
L1
stage event response and miss control decides if customized buffering is dictated by
the accessed memory semantics and whether an access will trigger communication
accessing the job-list. In the Access stage, the data arrays are accessed, the tags may
be written, and a job may be enqueued in the job-list. Finally, loads that hit in the
L2 utilize an L1 Fill stage to update the L1 and deliver data to the processor.
Since local scratchpad memory is cacheable in the L1, operations in the first
category (loads) may hit in the L1 cache, which completes their processing in the
stage labeled L1 returning the data to the processor in the same cycle. Loads that
miss in the L1 cache, issue a request to the L2 cache allocator at the end of the L1
stage, which will stall the pipeline if not granted immediately. After a grant from
the allocator, a load proceeds to the Decision stage. Load miss and remote load
requests are prepared in this stage, or a hit is diagnosed. In the Access stage, misses
and remote loads only access the job-list and hits access the data arrays. For load
hits an L1 Fill stage follows.
Other than remote stores, operations in the second category may also result in
a hit during the L1 stage, but, because of the write-through policy of the L1, an L2
allocator request is also issued. Remote stores, are identified during this cycle by the
processor ART and are written to the write combining buffer in a subsequent Store
Buffer stage. On a combining buffer hit (i.e. access that finds in the buffer data for
the same destination line) processing of the remote store completes in this cycle,
unless the combining buffer becomes full. In the latter case a request is issued to
the L2 allocator in this stage11 . An allocator request is also issued on a combining
buffer miss (i.e. access that finds in the buffer data for a different destination or line).
Loads that miss in the L1 cache, issue a request to the L2 cache allocator as well. In
all cases the pipeline is stalled unless the allocator grants the request immediately.
After a grant from the allocator, stores proceed to the Decision stage. If the
single miss status holding register (MSHR) is full and a cacheable store miss is
detected the processor pipeline is stalled, but without blocking the Decision stage.
Otherwise, cacheable store misses and remote store requests prepare a communica-
tion job description, or an L2 store hit is identified. In the Access stage, cacheable
store misses and remote stores enqueue a job description in the job-list, while local
scratchpad stores and cacheable store hits access the data arrays. In addition, store
misses write their data in the MSHR.
The third category, referring to ESL accesses, is different in two ways. First,
ESLs are not L1 cacheable –loads only acknowledge the L1 and return data directly
11
The store combining buffer has a configurable timeout that allows it to also initiate the L2 allo-
cator request independently of a processor store.
to the processor. Thus, both ESL loads and stores always result in a request to the
L2 allocator during the L1 stage. Second, local ESL accesses may trigger an event
response in the Decision stage. This is utilized for stores to counter synchroniza-
tion primitives, which may dispatch notifications if the counter becomes zero. For
communication initiation, a corresponding description is enqueued in the job-list in
the Access stage. In addition, operation side effects can be implemented for these
accesses. For example, as will be discussed in the next subsection, counter loads
return the value of the counter which is actually stored in the cache tags, and stores
to a specific offset in a single-reader queue ESL update the head pointer actually
store in the tags. Tags read and write, in Decision and Access stages respectively,
are utilized to implement this kind of operation side effects.
The command buffer communication primitive allows software to pass transfer de-
scriptions to the NI, with a series of store operations. The order of the stores is
immaterial. Figure 3.10 shows how the tag and data block of a command buffer
ESL are formated. The tag includes bit fields for a completion bitmask, a read and
a write access rights mask, the operation code, and a bit-flag indicating a remote
source address useful in RDMA operations. The completion bitmask provides a
bit-flag for each word in the command buffer data block. When a word in the data
block is written the corresponding flag is set. Read and write access rights mask
fields are intended for recording if the currently running thread has the appropriate
access rights, for the use of transfer address arguments the thread requests with the
data block
access access completion Src Address (Write−Msg Ack Address) 2
code rights rights bitmask Ack Address (Write−Msg Data) 3
mask mask
Write−Msg Data 4
remote
source Write−Msg Data 5
bit
Write−Msg Data 6
Write−Msg Data 7
Figure 3.10: Command buffer ESL tag and data-block formats. The data-block format
corresponds to software transfer descriptions for RDMA and message operation initiation.
transfer description. Their operation is not currently implemented because the ad-
dress region table (ART) does not support protection yet. The operation code is used
to designate the use of ESL data block words (differentiated for write messages).
RDMA-copy operations and read-messages use only the first four words of the
ESL for a transfer description. In these operations, the first word of the command
buffer data block (at offset 0) expects the transfer size in the two least significant
bytes, the command’s size (i.e. 4 words) in the next more significant byte, and the
transfer command operation code in the most significant byte. When this word is
written, the NI updates the ESL tag by complementing the completion mask ac-
cording to the command size, writing the operation code field, and setting the un-
necessary permissions in read and write access rights mask. The second word of
the command buffer data block (at offset 1) expects the destination address for the
transfer, the third (at offset 2) the source address, and the fourth (at offset 3) the
acknowledgement address. When all words are stored in the command buffer, a
transfer is initiated by enqueueing a job description in the job-list, pointing to the
command buffer.
Write messages do not require a source address, so the third word of the com-
mand buffer is used for the acknowledgement address, leaving words from offset 3
to offset 7 of the command buffer data block available for message data. In addi-
tion, write messages do not use the transfer size field of the first data block word,
since the command size field suffices. All stores to a command buffer read metadata
during the Decision stage of the memory access pipeline, described in the previous
section, and update them in the Access stage. In addition, a job description can be
enqueued in the job-list during the Access stage, when completion of the transfer
description is detected in the Decision stage. As a result, RDMA and minimum size
write message operation initiation requires only four processor stores.
data block
Notification Data 2
(unused) 3
Notification Address 1 4
Notification Address 2 5
Notification Address 3 6
Notification Address 4 7
Figure 3.11: Counter ESL tag and data-block formats. The tag stores the counter value,
updated atomically by the NI. The data-block is used for software configuration of noti-
fications and counter operation. Counter read access, and atomic add of stored values is
provided at word offset zero of the data-block.
3.2.5.2 Counters
Indirect software access to the counter value is provided via offset 0 of the
counter ESL data-block. Processor loads at this offset read the counter value from
the tags during the Decision stage, which is sent back through a dedicated path12 .
Local or remote stores at word offset 0 read the counter also in the Decision stage,
perform the add operation, and check for a zero result. The counter is updated in
the subsequent Access stage. In order to initiate the notifications, in case the counter
value becomes zero after a store, the counter ESL data-block is also read during this
stage. Four notification jobs can be enqueued in the job list in the four subsequent
cycles, stalling the L2 pipeline. Notification address and data are placed in the job
list, and the counter is updated again with the reset value.
12
Tags are also mapped in the processor address space to support ESL configuration, and thus
support a path for load operations. This path is not shown in figure 3.8.
.. ... ....
Q metadata (write) (Head) (Tail)
LM
LM
LM
.. ... ....
Figure 3.12: Formation of a single-reader queue in scratchpad memory of the SARC cache-
integrated NI. Double-word queue item granularity is depicted.
3.2.5.3 Queues
The SARC prototype implements single- and multiple-reader queues. The support
required is similar and so is the tag format. Single-reader queues (sr-Qs for short)
provide in addition software access to the head and tail queue pointers. In addition
to single-reader queue functionality presented in subsection 2.4.1, sr-Qs support dif-
ferent item granularities that are submultiples of cache-line size (i.e. 1, 2, 4, and 8
words), under software configuration. An example with double-word item granu-
larity is illustrated in figure 3.12. Multi-word item enqueue operations must use a
write message. The queue is formed in normal scratchpad memory lines consecu-
tive to the queue descriptor ESL. As shown in the figure, the “official” word offset
of the single-reader queue ESL data-block allows writing to different queue offsets
(enqueue operations). Word offset 1 allows indirect loads and stores to the queue
head pointer and offset 2 allows load access to the tail pointer –stores are ignored so
that the tail is only modified by the NI.
In order to support queue pointers in the limited tag space, some restrictions
are imposed to queues: (i) the number of scratchpad lines forming the queue body
plus 1 (the ESL) must be a power of two, (ii) the whole sequence of scratchpad lines
including the ESL must be aligned to its natural boundaries (i.e. the ESL must start
at an address that is a multiple of the total bytes of scratchpad and ESL data storage
for the queue). Thus, queues can be 2, 4, 8, etc cache lines (including the ESL),
which with the 32-byte line size of the prototype corresponds to 64, 128, 256 bytes
respectively; the queue descriptor ESL must be allocated to a multiple of that size.
Figure 3.13 shows the tags of single- and multi-reader queues which are simi-
lar. The single-reader queue tag includes a head and a tail pointer, an item size field,
an extra bit to indicate a maximum size queue (i.e. 128 items) and a full bit. For
queues of smaller size, the size is included in all queue pointers as the most signif-
icant bit of value 1. Because queue size is a power of two, its binary representation
has a single 1. This 1 is of higher significance than the bits used for pointers in the
range 0-queue_size. For example, a queue of maximum size of 16 words with one
word item size uses head and tail pointer values of 0 to 15 (0001111 in binary) so
the bit indicating the queue size of 16 (0010000 in binary) can be overlapped with
each pointer without changing any of the possible values, as long as the bit indicat-
ing the size and other more significant bits are ignored (this is trivial to implement
in hardware). The case of a maximum size queue that uses all pointer bits requires
an additional bit, and the extra bit is used for this purpose. Bits of lower significance
than the bit indicating the queue size are the useful pointer bits.
The multiple-reader queue tag has almost the same fields, but it uses two tail
pointers, one for reads and one for writes, and two full bits. In addition, each
item consumes one cache-line in order to reserve sufficient storage for read request
buffering in each queue entry. This is guaranteed by dropping any data of writes
over 7 words; the intended use is via messages that support up to 5 words of data.
Buffering write packets arriving from the NoC in a queue (and read packets in
the case of multiple-reader queues) exploits the usual L2 access flow: in Decision
stage, after the tags are read, some least significant bits of the accessed address (the
“official” queue offset) are replaced with the useful bits of a tail or head pointer. In
addition, during the Decision stage a possible match of a read and a write is detected
for multiple-reader queues. The data arrays are written in the Access stage, and, if a
match was detected, a job description in enqueued in the job-list.
Single−reader Queue Tag (24 bits) Multiple−reader Queue Tag (24 bits)
that there was room for improvement of the NI design. We studied the optimization
of the prototype’s hardware cost to improve cache-integration quality and reduce the
design’s footprint.
We based the optimization approach on the fact that RDMA and message de-
scriptions formed in command buffers, provide most of the information that ap-
pears in NoC packet headers. This information also appears, for the most part,
inside multiple-reader queues as response and acknowledgement addresses of read
requests. We modified the format of the command buffer data-block, used to de-
scribe transfers, and the NoC packet formats, so that they favor each other, as shown
in figure 3.14. The optimized formats can be compared against the original formats
in figures 3.7 and 3.10. The modifications respect alignment of fields in correspon-
dence, to avoid redundant multiplexing for a 64-bit wide NoC.
The result was the reduction of the NI-out logic to 1206 LUTs and 513 flip-
flops (33.2% reduction) and also the reduction of the NI-in logic to 834 LUTs and
458 flip-flops (23.4% reduction), with only a minor increase in the memory con-
troller and its datapath. Assuming that 1 LUT or 1 flip-flop is equivalent to 8 gates,
the hardware cost of NI-out was reduced to 13.7K gates (from 20.6K), and that of
NI-in to 10.3K gates (from 13.5K).
Figure 3.15 shows the results we got from this optimization, as a percentage of
the gate count for a plain cache. Bars of the unoptimized design from [99] are also
shown on the right side of the chart, separated from other bars with a vertical line,
for comparison. Bars on the left of the vertical line correspond to optimized designs.
The simple integrated design (third bar from the left) is less than 20% larger than the
plain cache. The results of [99] provide a comparison with a partitioned cache and
scratchpad design with simple RDMA support to a corresponding simple integrated
design. In that study, the integrated design (second bar from the right) is 35% less
than the partitioned (first bar from the right), where after optimization the simple
integrated design is 42.2% smaller than the corresponding partitioned design (fourth
bar from the left). This increase is in spite of reduction of the partitioned design to
63.4K gates after optimization, versus 72K gates before optimization.
In addition, figure 3.15 presents the cost of adding more advanced functionality
to the simple integrated design in bars for “Integrated advanced” and “Full function-
ality” designs. The addition of counters and notifications requires logic equivalent to
13.6% of the plain cache design, or 4.2K gates. Adding then multiple-reader queues
requires 15.2% or 4.7K gates, single-reader queues 10.6% or 3.3K gates, barrel
shifter for RDMA arbitrary alignment 9.6% or 3K gates, and combining buffer for
remote stores 23.8% or 7.3K gates.
Dst Address 0
Cmd Size OpCode Transfer Size 1
data block
Src Address (Write−Msg **UNUSED**) 2
Ack Address 3
Write−Msg Data 4
Write−Msg Data 5
Write−Msg Data 6
Write−Msg Data 7
(a) Modified command buffer data-block format.
Write Packet
(b) Modified NoC write packet format.
Read Packet
(c) Modified NoC read packet format.
Figure 3.14: Modified command buffer and NoC packet formats for design opti-
mization.
% of Cache-only Gates
220
200
180
RemSt Combine
160
Barrel Shifter
140 sr-Qs
120 mr-Qs
Counters
100
RDMA (8B aligned)
80 Outgoing IF
Incoming IF
60
Cntrl & Datapath
40
20
0
Partitioned
Partitioned
Integr. adv.
Full funct.
Scr'pad-only
Cache-only
Integr. simple
Integr. Simple
(non-opt)
(non-opt)
Putting these results into perspective for cache integration in future processors,
one should keep in mind that the base cache-only design of figure 3.15 only supports
a single outstanding miss and miss status holding register (MSHR). In addition,
these results only include the control and datapath of the cache and the integrated
NI designs, and thus do not reflect the area cost of the SRAM arrays for tags and
data of the cache and scratchpad, that will occupy most of the area in the integrated
design.
Nevertheless, changing the packet format requires changes in all system mod-
ules of the prototype (NoC, DDR controller and its DMA engine, L2 NI and off-chip
switch) and associated debugging for robustness. In addition, changes in software
library code are required because of the modification of transfer descriptions using
the old command buffer data-block format. Because of these reasons and because
only 65% of the FPGA logic resources are utilized with the non-optimized design,
the SARC prototype sticks, at the moment, to the stable but unoptimized version.
Software was compiled with a version of gcc (mb-gcc), targeted to the Microblaze
processors. Xilinx Embedded Development Kit (EDK) allowed code mapping on
“bare metal”, and the Xilinx Microprocessor Debug (XMD) engine was employed
for run-time debugging. The UART module of the prototype (see figure 3.6) was
also utilized for a single I/O terminal.
An SHMEM-like library was used to circumvent the need for a special com-
piler targeting a global address space, and for a method to exchange addresses
among threads in absence of cache coherence1 . SHMEM [100] provides a hybrid
shared memory/message passing programming model. It uses explicit communi-
cation and management of memory and object replication is in software, which is
similar to message passing, but does not use two-sided send-receive style operations.
Instead, it is based on one-sided, non-blocking operations that map directly to the
explicit communication mechanisms provided in the SARC architecture. Remote
memory address specification in SHMEM would use a thread identifier and a local
address. Such pairs were mapped to global addresses with dedicated library calls
for our platform.
Three libraries were implemented, syslib, scrlib, and nilib. The syslib im-
plements system management functions for hardware components other than the
cache-integrated NI. It provides a lock implementation using the hardware lock-box
1
Alternatively, exchanging addresses among threads could be achieved maintaining cache coher-
ence in software and per-thread signaling scratchpad areas, or via non-cacheable, off-chip, shared
memory.
module accessible over the platform’s shared OPB bus. It also provides a central-
ized barrier implementation which also uses the lock-box for locking. In addition,
sylib supplies thread-safe main memory allocation and I/O facilities, and access to
hardware registers for thread identifiers and cycle accurate global time.
The scrlib library allows manipulation of NI memory, to allocate parts of
L2 cache as scratchpad regions at runtime, and designate the use of cache lines
for communication and synchronization primitives (command buffers, counters and
queues). It also provides functions to convert local addresses to remote ones, and
check if an address is local or remote. Last, nilib contains functions for prepar-
ing and issuing DMAs and messages via command buffers, managing counters and
notifications, and accessing single-reader queues.
The lock microbenchmark measures the average time for a lock-protected,
empty critical section, when 1, 2, or 4 cores contend for the lock. Lock acquisi-
tions start after a barrier, and 106 repetitions per processor are measured (the time
of loop overhead code is subtracted). Two lock implementations are compared:
the one in the syslib using the shared hardware lock-box, and an implementation
using a multiple-reader queue with a lock-token as presented in chapter 2 (see fig-
ure 2.14(a)). The barriers benchmark similarly measures the average time for a
barrier in 106 invocations, after an initial barrier. The syslib centralized barrier2 is
compared to a barrier implementation using a single counter synchronization prim-
itive and notifications.
The STREAM triad benchmark [101] is designed to stress bandwidth at dif-
ferent layers of the memory hierarchy. The benchmark copies three arrays from a
remote to a local memory, conducts a simple calculation on the array elements and
sends the results back to original remote memory. Two configurations of STREAM
were developed for stressing on-chip and off-chip memory bandwidth respectively.
In the on-chip configuration, the data are streamed from scratchpad memories to
scratchpad memories and backwards, whereas in the off-chip configuration, data are
streamed between DRAM and scratchpad memories. In each configuration, multi-
buffering was applied to overlap the latency of computation with communication
from and to “remote” memory, varying the number of buffers and their size. In
addition, different versions of the benchmark utilize RDMA and remote stores for
producer-initiated transfers (transfers from remote memory are always consumer-
initiated to preserve benchmark semantics).
The FFT benchmark originates from the coarse-grained reference FFT imple-
mentation in the StreamIt language benchmarks [102] and does a number of all-to-
2
The syslib implementation integrates the counter of processors arriving at the barrier and a global
sense in a variable protected by the lock-box.
all data exchanges between the processors. RDMA and remote stores alternatives
are explored for the data transfers, to explore tradeoffs between the two communi-
cation mechanisms. Evaluation varies the input data-set size and the core numbers
used. The bitonic sort benchmark also originates from the reference implementa-
tion of the StreamIt language benchmarks [102]. Bitonic-sort has a low communi-
cation to computation ratio and we use it to measure the minimum granularity of
exploitable parallelism on the prototype. The benchmark measurements exclude the
initial fetching of data in scratchpad memories. The trade-off between RDMA and
remote stores is also evaluated in this case, varying the data-set size.
4.1.2 Results
Table 4.1(a) shows acquire, resease and average latencies for contended lock-unlock
operations using the lock-box or a multiple-reader queue (mr-Q). The lock that lever-
ages the multiple-reader queue executes an acquire-release pair for an empty critical
section under full contention between 4 cores in 214 cycles. Although the lock-box
provides comparable performance for the small number of contending processors,
multiple-reader queue access via the crossbar, including controller times is faster
regardless of the number of processors. Table 4.1(b) shows centralized and counter-
based barrier times with one, two, and four participating processors. The counter-
based barrier uses on-chip signaling for barrier arrival and automated notifications
between 4 cores takes 117 cycles, compared to 618 for a centralized barrier over the
shared OPB bus of the prototype.
To put these numbers in perspective, consider that the one-way latency of a
remote store (used for signaling) is 18 cycles, and of a read message (used for lock
acquisition) is 21 cycles in hardware simulation (for timing breakdown of hardware
communication mechanisms see [99]). Although direct comparisons with other
hardware designs are not possible because of differences in CPI and the limited size
of resources in the prototype, note that on leading commercial multicore processors,
lock acquire-release pairs cost in the order of thousands of cycles, and barriers cost
in the order of tens of thousands of cycles [103]. Considering processors with multi-
GHz clocks, lock-unlock time corresponds to hundreds of nanoseconds and barrier
corresponds to a few microseconds3 . These times reflect library and OS overhead
that are difficult to avoid without hardware support.
Figure 4.1 illustrates the results of the STREAM benchmark on the FPGA pro-
totype, as the achieved bandwidth varying the size of the transfered buffer. The
maximum feasible bandwidth with each communication mechanism is plotted with
a horizontal line. Bandwidth for remote stores and RDMA is shown in separate
charts with groups of three curves colored in red, green, and blue, that correspond
to benchmark runs with different numbers of cores. The three curves of each group
3
Because the SARC prototype runs only at 75 MHz, the wall clock time for acquire-release of
an mr-Q based lock is 2.782 us, and for counter barriers 1.521 us.
FFT Speedup
4 FFT Breakdown
Speedup
120
represent use of one, two, and three buffers for computation and overlapping trans-
fers.
Off-chip bandwidth is depicted in the upper charts, for remote stores and
RDMA, measured by streaming data from three large arrays (more than scratch-
pads can fit) in DRAM. For measuring the on-chip realizable bandwidth data are
laid out in scratchpad memories of one or two cores and the remaining cores stream
data from these scratchpads. As expected, the achievable maximum bandwidth with
remote stores is lower (about 7× - 8×) than the maximum achievable bandwidth
with RDMA, despite use of the store combining buffer, because remote stores incur
the overhead of one instruction per word transferred whereas RDMA can transfer
up to 64 KB (L2 size) worth of data, with overhead of only 4 instructions. In all
cases, the architecture can maximize bandwidth and overlap memory latency using
small scratchpad buffer space (3KB - 4KB), when four cores are used.
For the FFT and bitonic sort benchmarks, producer-initiated transfers are used
and the time the consumers wait for data is measured. In the remainder of this eval-
uation, consumer wait time is referred to as communication time, and the remaining
execution time, including the time to initiate communication, is referred as com-
putation time. Note that communication time does not reflect the actual time for
communication, but rather the amount of communication time that benchmark ex-
ecution could not hide, which is the actual communication overhead for the bench-
mark. The left side of figure 4.2 plots the speedup of on-chip FFT for various input
sizes, using remote stores and RDMA. The right side of the figure illustrates remote
store and RDMA benchmark version execution time, broken down in computation
and communication times, for selected input sizes and normalized to the time of the
corresponding remote store execution.
The results exhibit a trade-off between communication with RDMA and com-
2 Cores 4 Cores
100
4K elm DMA 80
4K elm RemSt
2 60
64 elm DMA
64 elm RemSt 40
16 elm DMA 20
1 16 elm RemSt
0
4 elm DMA
4 elm RemSt
0
1 2 4 # of Elem
Processors Communication Wait Time Computation Time
munication with remote stores. For input sizes corresponding to 8 scratchpad lines
of data or less (N≤128 elements) remote stores reduce communication time by 5-
48% and increase overall performance by 0.4-4.5%. However, FFT does not profit
from parallelization on small input sizes (N≤128), because execution time is dom-
inated by overhead for locating the receivers of messages during the global data
exchange phase and loop control code that is inefficient on the Microblaze proces-
sor. For larger input sizes, the RDMA benchmark version outperforms the remote
store one (by 0.6-20% in terms of overall performance) due to lower communication
initiation overhead and better overhead amortization. The performance advantage
of RDMA is consistently amplified as the input size increases. Parallel efficiency
with RDMA reaches 95% on 2 cores and 81% on 4 cores, while with remote stores
it is capped at 88% on 2 cores and 68% on 4 cores, for data-sets that fit on-chip.
Figure 4.3 shows the corresponding data for bitonic sort, which also expose the
trade-off between RDMA and remote stores. In addition, the speedup chart shows
that bitonic short execution provides a speedup of more than 1 even with a data-set
of N=4 elements. This reflects the profitable parallelization of tasks as small as 470
clock cycles, achieved by cache-integration of explicit communication mechanisms.
In input sizes corresponding to 4 scratchpad lines of data or less (N≤64 ele-
ments), communication time with remote stores is 5-41% less than communication
time with RDMA. With the same small input sizes, overall performance with remote
stores exceeds performance with RDMA by 0.2-14%. For larger input sizes, com-
munication time with DMAs is 13-32% less than communication time with remote
stores, however overall performance with DMAs exceeds only marginally perfor-
mance with remote stores (by no more than 0.2%) due to the low communication to
computation ration of the benchmark. Parallel efficiency with RDMAs reaches 89%
on 2 cores and 67% on 4 cores, while with remote stores it is slightly lower (87%
on 2 cores, 61% on 4 cores), for data sets that fit on-chip, i.e. do not exceed the
64 KB of available scratchpad space. Overall, the presence of both communication
mechanisms enables more effective parallelization depending on task input size.
4.1.3 Summary
In order to exploit many-core processors with scalable on-chip networks, scalable
communication mechanisms will be required, that can effectively hide remote mem-
ory access latency, and utilize available memory bandwidth. Block transfer facilities
from and to local memories, like RDMA provided in the SARC architecture, achieve
these targets by means of fast transfer initiation with only four store instructions, and
by amortizing NoC header and CRC overhead over larger packets (4 scratchpad lines
in the SARC prototype). Running the STREAM benchmark, transfers summing up
to 3KB-4KB total size from 4 cores, suffice to saturate the prototype’s bandwidth.
Locality will be very important on many-core processors, and effective ex-
ploitation will benefit from the use of low latency mechanisms to provide fine-grain
parallelization. On the SARC prototype, remote store communication allows prof-
itable parallelization of bitonic sort tasks of less than 500 clock cycles length. Tai-
loring communication mechanisms for low latency and for high bandwidth trans-
fers can enable software trade-offs in exploiting locality when possible, versus high
bandwidth overlapped transfers that naturally hide latency, when locality is not a
choice.
The use of network interface synchronization primitives for lock and barrier
operations provides promising performance on the prototype. The limited number of
cores makes the evaluation inconclusive, so these functions are further investigated
in the next section.
line 12).
In total, all nodes will suffer six misses (the root-node only 5) in the process
of the three barrier steps: (i) receive signals from children (4 misses), (ii) signal
my parent (1 miss), and (iii) receive –or send– the episode end-signal (1 miss).
The latency of misses for childready flags at a parent node, is partially overlapped
with the latency of misses for updates via the parentflag pointer at the children.
Similarly, the latency of non-root node misses for receiving the episode end-signal,
is partially overlapped with the latency of the root-node miss for sending the signal.
Nevertheless, the childready flag misses at each level of the tree are serialized by the
while of line 7. In other words the latency of overlapped misses is multiplied by the
number of levels in the tree minus one. Each of these misses entails a round-trip to
the child or the parent updating the relevant flag via the directory. The final miss on
the variable pointed by rootflag is also added to the total for each node, but incurring
different latency in each case since the directory can only reply to one load request
at a time. The requests are overlapped with its other and with the invalidations the
directory sends when the root node updates the flag. With rough calculations, for
T threads there are log4 (T ) + 1 round-trip transfers through the directory, plus T
serialized packet injections in the NoC.
The same three signaling steps are also utilized in the counter-based barrier
with the following differences: (a) signals do not require a round-trip and do not go
through the directory, but go directly to the receiver, (b) since no chain of misses
is involved, a broadcast signaling tree is exploited as well, to avoid sending all the
signals from a single node. The total in this case is 2 × log4 (T ) one-way transfers.
Half of them are suffered by all nodes, and the other half incur different latency at
Figure 4.5: MCS lock pseudocode from [2] using cacheable variables.
subnetworks, allowing the release to bypass queued requests at the node hosting the
multiple-reader queue. This is not the case with coherence and atomic operations,
where requests of multiple SWAP operations on the L pointer to acquire the lock,
can be queued in front of a CAS operation on the L pointer to release the lock, in the
case of contention. Nevertheless, one round-trip of the lock-token per contending
thread is suffered, totaling T round-trips plus T local message compositions plus T
local misses latency per critical section, after (roughly) the first critical section of
each thread.
When the time for the average critical section increases, under high contention
(successive critical sections by multiple threads), time proportional to the average
critical section time and the number of contenders will be added in the wait time of
each thread. Reducing contention by adding an interval between successive critical
sections of a thread, will probably have little effect if the interval does not exceed
the average critical section time multiplied by the number of contenders. When it
does, though, the MCS lock will become almost as fast as the multiple-reader queue
based lock, because of incurring a single miss and directory overhead will no longer
be amplified by multiple misses per acquire-release pair. If the same thread’s critical
section is executed multiple times before another thread requests the lock, the MCS
lock should be faster incurring no misses, compared to the necessary lock-token
round-trip with the multiple-reader queue.
L1 I$ L1 I$
P L1 D$ L1 D$ P
NI & NI &
L2 Cache L2 Cache
64 nodes
S
NI & NI &
Directory
L2 Cache L2 Cache
L1 D$ L1 D$
P L1 I$ L1 I$ P
16 nodes
128 nodes
32 nodes
Figure 4.6: NoC topology for 16, 32, 64 and 128 core CMPs.
Contended Locks
30000 3.9x
MCS Locks
(cycles)
20000
15000 3.6x
10000
3.7x
5000 3.9x
0
16 32 64 128
Processors
Figure 4.7: Average per processor acquire-release latency across the number of critical
sections versus the number of cores, for simulated MCS locks using coherently cached vari-
ables and multiple-reader queue (mr-Q) based locks accessed over a non-coherent address
space portion.
of the figure, and blue squares do not. The concentrated mesh was chosen because
of the more uniform hop-distance of nodes to directory controllers placed in the
middle of the topology.
The combination of GEMS and Simics behaves as an in-order, sequentially
consistent system. A set of libraries, similar to scrlib and nilib of subsection 4.1.1,
where developed to run over Simics. For our measurements we use the Simics
support for light weight instrumentation, using simulation break instructions, to se-
lectively measure synchronization primitive invocation intervals excluding the sur-
rounding loop code.
4.2.3 Results
Figure 4.7 shows the average latency of contended lock-unlock pairs of opera-
tions. In both implementations requests are queued until the lock is available. The
multiple-reader queue (mr-Q) based implementation is about 3.6-3.9 times faster
than the MCS lock implementation. The MCS average time for lock-unlock opera-
tion pairs exceeds the expected time of 2× to 3× the average time for lock-unlock
via the mr-Q, with the calculations of subsection 4.2.1. The additional time should
be accounted to the per miss directory indirections, and the limited ability of the
directory to process accesses in parallel.
Barriers
1800
Avg Barrier Time
1400 5.1x
1200
1000 4.5x
800 4.1x
600
400
200
0
16 32 64 128
Processors
Figure 4.8: Average per processor latency across of the number of barrier episodes versus
the number of processors, for simulated tree barriers using coherently cached variables and
counter based barriers with automated notification signals.
Figure 4.8 shows the performance of the two barrier implementations. The
counter-based barrier is from 4.1× faster for 16 cores to 5.3× faster for 128 cores.
The former (4.1×) is excessive compared to the expected 2× and is probably due to
the serialized injection of responses for the barrier end-signal, and due to the limited
ability of the directory to process accesses in parallel. The latter (5.3×), reflects the
additive effect of the serialized injection for the barrier end-signal.
For both MCS barriers and locks, one can expect that aggressive non-blocking
coherence protocols [4, 106] can reduce the latency of contended flag update-
reclaim interactions (atomic or not), but communication operations in these algo-
rithms are dependent on each other and will introduce serialization of miss overhead.
Explicit communication advocated here can significantly reduce such overheads. In
addition, counters and queues can further reduce synchronization overhead by im-
plementing the required atomicity in cache-integrated NIs and thus decoupling the
processor from the synchronization operation.
4.2.4 Summary
other uses of these primitives are taken into account, as those of chapter 2 (subsec-
tion 2.4.3). The next section evaluates the use of multiple-reader queues for task
scheduling.
The evaluation of this section, although very promising, is only partial. The
performance gain, by use of the proposed hardware assisted locks and barriers for
synchronization, needs to be studied in larger benchmarks and applications, to assess
how effective they can be.
Figure 4.9: The Cilk model of multithreaded computation. Each procedure, shown as a
rounded rectangle, is broken into sequences of threads, shown as circles. A downward edge
indicates the spawning of a subprocedure. A horizontal edge indicates the continuation to a
successor thread. An upward edge indicates the returning of a value to a parent procedure.
All three types of edges are dependencies which constrain the order in which threads may
be scheduled. (This figure is reconstructed from the Cilk language manual.)
The three basic Cilk keywords are cilk, spawn, and sync. The keyword cilk
identifies a Cilk procedure, which is the parallel version of a C function –i.e. a
function that can be run as a task on another processors. A Cilk procedure may
spawn subprocedures in parallel and synchronize upon their completion. A Cilk
procedure definition is identical to that of a C function, beginning with the keyword
cilk.
Most of the work in a Cilk procedure is executed serially, just like C, but par-
allelism can be created when the invocation of a Cilk procedure is immediately pre-
ceded by the keyword spawn. A spawn is the parallel analog of a C function call,
and like a C function call, when a Cilk procedure is spawned, execution proceeds to
the child. Unlike a C function call, however, where the parent is not resumed until
after its child returns, in the case of a Cilk spawn, the parent can continue to execute
in parallel with the child. Indeed, the parent can continue to spawn off children,
producing a high degree of parallelism.
A Cilk procedure cannot safely use the return values of the children it has
spawned until it executes a sync statement. If all of its previously spawned children
have not completed when it executes a sync, the procedure suspends and does not
resume until all of those children have completed. The sync statement is a local
“barrier”, not a global one: sync waits only for the previously spawned children of
the procedure to complete, and not for all procedures currently executing. As an aid
to programmers, Cilk inserts an implicit sync before every return, if it is not present
already. As a consequence, a procedure never terminates while it has outstanding
children.
parallel smaller amounts of work, and (ii) increase parallelism using a larger data-
set.
During the execution of a Cilk program, when a processor runs out of work, it
“asks” another processor chosen at random for work to do. Locally, a processor
executes procedures in ordinary serial order (just like C), exploring the spawn tree in
a depth-first manner. When a child procedure is spawned, the processor saves local
variables of the parent on the bottom of a stack and commences work on the child
(the convention used is that the stack grows downward, and that items are pushed
and popped from the bottom of the stack.) When the child returns, the bottom of the
stack is popped (just like C) and the parent resumes. When another processor needs
work, however, it steals from the top of the stack, that is, from the end opposite to
the one normally used by the victim processor.
Two levels of scheduling are implemented in Cilk: nanoscheduling, which
defines how a Cilk program is scheduled on one processor, and microscheduling,
which schedules procedures across a fixed set of processors. The nanoscheduler
guarantees that on one processor, when no microscheduling is needed, the Cilk code
executes in the same order as the C code. This schedule is very fast and easy to
implement. The cilk2c source-to-source compiler translates each Cilk procedure
into a C procedure with the same arguments and return value. Each spawn is trans-
lated into its equivalent C function call. Each sync is translated into a no-operation,
because, with nanoscheduler-managed serial execution, all children would have al-
ready completed by the time the sync point is reached.
In order to enable the use of the microscheduler, some code must be added to
the nanoscheduled version of the code to keep track of the run-time state of a Cilk
procedure instance. The nanoscheduler uses a double-ended queue of “frames” for
this purpose. A Cilk frame is a data structure which can hold the run-time state of
a procedure, and is analogous to a C activation frame. A doubly-ended queue, or
deque in Cilk terminology, can be thought of as a stack from the nanoscheduler’s
perspective. There is one-to-one correspondence between Cilk frames on the deque
and the activation frames on the C stack, which can be used in the future to merge
the frame deque and the native C stack.
To keep track of the execution state at run-time, the nanoscheduler inserts three
MACROs in every Cilk procedure. The first allocates a Cilk frame at the beginning
Cilk Scheduling and Augmentation with Multiple-Reader Queues Page 127 of 154
Evaluation of Task Scheduling Support in the SARC Network Interface
Page 128 of 154 Cilk Scheduling and Augmentation with Multiple-Reader Queues
Evaluation of Task Scheduling Support in the SARC Network Interface
an optimization is exploited, based on the observation that the local worker does not
need to lock his deque or the closure, unless he tries to move the bottom pointer to
the last stack frame and a thief simultaneously tries the same with the top pointer.
An additional exception pointer, that indicates a thief’s intention to steal before the
actual steal, is used to signal this situation [107]. As a result a worker’s deque has,
at all times the whole frame stack at its top.
In order to optimize the locking process, a multiple-reader queue (mr-Q) re-
places the deque, which under normal circumstances only buffers a single closure
pointer 6 . This setting has the result of identifying the deque lock-token with the
locked value (the closure pointer) of the original code. Because of this, two thieves
trying to steal from each other would deadlock. To avoid this situation and pre-
vent deadlock, thieves do not use the multiple-reader queue as a lock, but only as a
queue providing atomicity of enqueue and dequeue operations (i.e. atomic update
of head and tail pointers). This is done using unbuffered read messages to steal
from the mr-Q (see subsection 2.4.2), which are NACK’ed if the mr-Q is empty.
The local worker, though, uses the normal blocking read messages to access the
multiple-reader queue when removing a closure, and as a result takes precedence
over thieves in removing a closure from the mr-Q.
The augmentation of Cilk runtime to use multiple-reader queues, changes a
property of the original implementation. Namely, thieves do not “queue-up” in
front of a victim’s dequeue when simultaneously targeting the same victim. This
property was used in the proof of the execution time bound for the work-stealing
schedule [108]. Nevertheless, subsequent research generalizes the execution time
bound for non-blocking deques [110] that do not have this property. In addition, in
the multiple-reader queue augmented scheduler, because thieves use non-blocking
read messages and because only a single closure is placed in the mr-Q, the buffer-
ing required for each multiple-reader queue is restricted to one cache lines plus the
queue descriptor ESL.
execution minus overhead time. Overhead is the time accounted to Cilk instrumen-
tation of a benchmark’s code with MACROs7 , described in subsection 4.3.2. Steal
time is the time a worker’s scheduler is searching for a victim, until a successful
steal is performed. Whenever a steal occures, a (sub)procedure’s execution contin-
ues in two processors. Because the microsheduler is involved and two closures are
created for the (sub)procedure, when the stolen computation finishes it updates a list
of descendants and any return value of a spawned child in the original closure. The
time for these tasks is measured as returning time. The remaining time for scheduler
loop execution is labeled other in the charts of this subsection. In addition several
events are counted on a per processor bases.
Figure 4.10 shows the results of running the Cilk implementations of FFT,
Cilksort, Cholesky, and LU benchmarks (provided with the Cilk distribution), us-
ing the original Cilk runtime (SW de-Q) and the multiple-reader queue augmented
version (HW mr-Q). Cilk’s recursive implementations of the FFT kernel, with a
medium data set size of 1M complex numbers, provides scaling up to 64 processors.
The mr-Q augmented runtime is about 20%-24% faster than the original for less than
128 processors, and 8% faster for 128 processors. Steal time is reduced in the mr-Q
version by about 26%, 16%, 20% and 8% for 16, 32, 64 and 128 processors respec-
tively. In addition, true work is reduced by 20% for 128 processors and 23-24%
for other configurations. This means that the multiple-reader queue based scheduler
exhibits better locality properties for FFT. Cilk overhead and returning time increase
with more processors, and with 128 processors account in total for 18% and 26%
of the execution time in the SW de-Q and HW mr-Q versions respectively. With 128
processors both versions slow down compared to the 64 processor execution.
Cilksort implements in Cilk’s recursive style a variant of mergesort based on
an algorithm that first appears in [111]. The executions shown in figure 4.10 sort
an array of 3 · 106 integers and scale up to 64 processors. Although steal time is
reduced with the HW mr-Q version in all configurations, performance of the two
versions is always within 1% of each other. Cholesky executions, which use an
1000 × 1000 sparse matrix with 10000 non-zero elements, also do not scale beyond
64 processors with either version. The HW mr-Q version is 3-8% faster across
processor numbersand steal time is reduced by 19-24%. The total of overhead and
returning times on 16 and 32 processors remains below 2.5% of execution time for
both scheduler versions, but for larger configurations it reaches 5-6% for the SW
de-Q version and 10-14% for the HW mr-Q.
The LU kernel, with a medium data set of an 1024 × 1024 matrix, also scales
7
The code saving values in a Cilk frame before a spawn was not accounted as overhead, and thus
appears in the true work time.
Other Other
FFT Returning
Work Stealing
Cilksort Returning
Work Stealing
Overhead
Overhead
True Work
True Work 120
Million Cycles
120
Million cycles
100
100
80
80
60
60
40
40
20
20
0
SW de-Q
SW de-Q
SW de-Q
SW de-Q
HW mr-Q
HW mr-Q
HW mr-Q
HW mr-Q
0
HW mr-Q
HW mr-Q
HW mr-Q
HW mr-Q
SW de-Q
SW de-Q
SW de-Q
SW de-Q
16 32 64 128 16 32 64 128
Processors Processors
Other
Cholesky Returning
Work Stealing Other
160
Overhead
True Work
LU Returning
Work Stealing
Overhead
Million Cycles
True Work
140 240
Million Cycles
220
120 200
180
100 160
80 140
120
60 100
80
40 60
20 40
20
0 0
SW de-Q
SW de-Q
SW de-Q
SW de-Q
HW mr-Q
HW mr-Q
HW mr-Q
HW mr-Q
SW de-Q
HW mr-Q
SW de-Q
HW mr-Q
SW de-Q
HW mr-Q
SW de-Q
HW mr-Q
16 32 64 128 16 32 64 128
Processors Processors
Figure 4.10: FFT, Cilksort, Cholesky, and LU benchmark execution with multiple-reader
queue augmentation of the Cilk scheduler (HW mr-Q) and without (SW de-Q).
Other Other
Cholesky Returning LU Returning
Work Stealing Work Stealing
(fine-grain) Overhead (fine-grain) Overhead
True Work True Work
400 400
745 682
Million cycles
Million cycles
360 360
320 320
280 280
240 240
200 200
160 160
120 120
80 80
40 40
0 0
HW mr-Q
HW mr-Q
HW mr-Q
HW mr-Q
SW de-Q
SW de-Q
SW de-Q
SW de-Q
HW mr-Q
HW mr-Q
HW mr-Q
HW mr-Q
SW de-Q
SW de-Q
SW de-Q
SW de-Q
16 32 64 128 16 32 64 128
Processors Processors
Figure 4.11: Cholesky, and LU benchmark execution, allowing fine granularity work par-
titioning.
Note that for LU performance of the original Cilk scheduler appears to provide
scaling. In addition, steal, overhead, and returning times are multiplied in all con-
figurations and for both versions of the scheduler. The performance of cholesky
does not scale with more than 16 processors, and in both benchmarks and almost all
configurations the original Cilk scheduler performs better than the multiple-reader
queue augmented one (in fact, the HW mr-Q version simulations for 128 processors
did not finish after more than a month). These results indicate that Cilk’s overhead
in work stealing makes it inappropriate for fine-grain processing.
Figure 4.11 shows a considerable increase in true work compared to the exe-
cutions of figure 4.10. This increase indicates locality is an issue, for these CMP
sizes and randomized work stealing of fine-grain tasks. Our event counts indicate
that the resulting average true work per interval is more than 2 · 104 cycles in length,
although the minimum is not available from these runs. The reason for the increase
in true work is increased communication among thieves and victims with a large
number of processors. Executing part of a computation in parallel requires commu-
nication and synchronization with both the producer of the computation’s input and
the consumer of the computation’s output. If these times are relatively large, serial
PA Child A Parent
PB Child B Comp
Figure 4.12: Illustration of the minimum granularity problem for a usual Cilk computation.
execution may be faster. In the context of Cilk, the producer is the victim of a steal
attempt, the thief acts as a computation offload processing engine, and the consumer
is either the victim or the thief –the one finishing last its computation.
Consider a typical Cilk procedure spawning two subprocedures and followed
by some additional computation after a sync. In the upper part of figure 4.12, two
alternative executions of such a Cilk procedure are illustrated. In the first, processor
PA executes serially the whole computation. In the second, PA executes locally
subprocedure A (child A) while PB steals the parent procedure frame and starts
execution of subprocedure B (child B). When PB returns to the parent frame, queries
Cilk runtime to find out if child A has finished. In the scenario shown child A has
finished, so PB copies its computation result in the parent frame and continues with
the execution of the parent computation. The time gained by parallel execution of
the two subprocedures is pointed by the red arrow on the right. The lower part of
the figure shows the same execution, but this time assuming procedures A and B,
as well as the parent computation take half the time, while the times for PB to steal
the parent frame, query Cilk runtime and copy the results of child A are the same.
In this case, parallel execution of the two subprocedures takes more time than serial
execution on PA , as indicated by the red arrow on the right.
In both cases, the time overlap of processors PA and PB starts when PB initiates
its steal attempt, and finishes when PA is done with the execution of his part of the
computation (i.e. child A). If this overlapped time is more than the time required
by PB to complete the steal plus the time to synchronize with PA through Cilk
runtime and get the result of child A, then there is a positive time gain from parallel
execution. Otherwise, the time required in excess of the overlapped time for these
communication and synchronization actions with PA , is time lost compared to serial
execution on PA .
4.3.5 Summary
Evaluation of the use of multiple-reader queues in augmentation of the Cilk runtime
system simulating 16-128 processors, provides three results. First, for paralleliza-
tion in this range of processors on a single-CMP environment, the minimum task
granularity is very important, in spite of scalable behavior for up to 64 processors.
In fact, with the benchmarks used, selecting an inappropriately small minimum task
granularity results in larger execution time of the "best-performing" 64 processor
execution, than that of only 16 processors with a better minimum task size selec-
tion. This is exaggerated with hardware support that slices work to pieces faster,
and results in bad scaling behavior as more processors are used.
Second, regardless of the choice of minimum task granularity, the benchmarks
used (FFT, LU, Cholesky and Cilksort) do not scale over 64 processors. With ap-
propriate selection of minimum task granularity and 16-64 processors, hardware
support for scheduling provides 20.9%-23.9% better performance of regular codes
than the original Cilk without hardware support. For irregular parallelism, though,
the performance gain is reduced to 0.2%-9.5% and the version exploiting hardware
support may scale to fewer processors than the software-only version (though this
does not necessarily correspond to top performance reduction). Third, in 64 pro-
cessor executions, the average true work interval (i.e. task size without overheads)
is always more than 30 thousand clock cycles in simulations of this section, and
overheads range across benchmarks to about 20%-60% of average task size. This
indicates that Cilk cannot be effective for fine-grain parallelization in this range of
processors, even if minimum task granularity was chosen correctly and localized
work stealing was implemented in the runtime.
Exploiting hardware support for fine-grain job scheduling to a large number
of processors in the context of Cilk is not trivial. The problem of selecting the
minimum task granularity, that Cilk leaves to the application, may result in increased
execution times and reverse performance scaling when more processors are used.
5.1 Conclusions
This dissertation demonstrates the integration of a network interface with a cache
controller, that allows configurable use of processor-local memory, partly as cache
and partly as scratchpad. Chapter 3 shows that NI integration can be done efficiently,
requiring less than 20% logic increase of a simple cache controller and one or two
more state bits per cache line. The novel technique of event responses is introduced,
which enables this efficient cache-integration of NI mechanisms. However, hard-
ware cost assessment indicates, that support for synchronization primitives may not
be cost-effective for general purpose deployment, at least on a per core basis.
Albeit only partially evaluated here, this thesis presents the design and imple-
mentation of a general RDMA-copy mechanism for a shared address space, and
synchronization counter support for selective fences of arbitrary groups of explicit
transfers or barriers. Similarly, it demonstrates the design and implementation of
single-reader queues, aiming to enhance many-to-one interaction efficiency.
In addition, multiple-reader queues are introduced, as a novel synchroniza-
tion primitive that enables “blind” rendez-vous of requests with replies, and their
matching in the memory system, without processor involvement. Chapter 4 shows
that the explicit transfers and the synchronization primitives, provided by the cache-
integrated NI, offer flexibility via latency- and bandwidth-efficient communication
mechanisms, or allowing mixed use of implicit and explicit communication, and
also benefit latency-critical tasks like synchronization and scheduling.
Looking back for lessons learned from the course of the research for this disser-
tation, there are two specific things to mention, related to the evaluation of schedul-
ing support in the SARC NI. First, the Cilk-based study of section 4.3, indicates that
flat randomized work stealing is probably not a good idea with several tens of pro-
cessors. Some form of hierarchical extention of this scheduling scheme is required,
in favor of locality.
Second, although this is not new, the results of subsection 4.3.4 underline the
fact that scalability in itself is neither a suitable nor an appropriate metric for par-
allel execution optimizations. Changes in the problem size or the granularity of
parallelism exploited, may uncover that a scalable behavior is very far from opti-
mal. Thus, some method is required to determine the optimal behavior, especially
when, using single processor execution as a reference point, is not convenient or
possible.
Above other lessons, though, I must underline my view on setting early a con-
crete motivation for research, deduced from the course of this study. Novel ideas,
that may seem promising, but only in an abstract sense or view, must be prelimi-
narily evaluated within a limited time frame, and possibly before completing their
design or implementation, so that they are placed in a framework of solving a per-
ceived actual problem. This is most important –and more likely to occur– for basic
research that may give rise to radical changes. Moreover, it means, that methods to
achieve such preliminary evaluation, should not be under interdependent research.
Reversely, it may be even better to have the ground of solving a specific actual prob-
lem as the starting point for innovation. In the field of computer architecture, con-
fining research motivation to improvement upon a tangible problem, is established
and defended by the international culture developed around quantitative research
assessment.
Research efforts during this study and in the CARV laboratory of FORTH-ICS, iden-
tified at least three important issues, that in the author’s perspective deserve further
research. The first pertains generally to run-time systems, and the other two are
mostly oriented to the exploitation of explicit communication in CMPs.
The first issue concerns the need for locality-aware scheduling of computa-
tions in large-scale CMPs, and the associated need for management, possibly at
run-time, of the minimum profitable task size. It is becoming increasingly appar-
ent, that, with tens of processors on a single chip, locality exploitation will play a
critical role. In addition, the discussion of the minimum task granularity problem in
subsection 4.3.4, shows the importance of determining, for a given computation, the
amount of parallelism and the number of cores that can be exploited profitably, on
the given hardware.
The second issue is related to the need for methodologies and mechanisms,
associated with software-managed on-chip memory, that enable the manipulation of
working sets that do not fit in on-chip storage, and automate spilling and re-fetching
of scratchpad data. In a broader context, novel hardware support for flexible and
software-configurable locality management, can be of interest to both cache- and
scratchpad-based parallel processing.
The third issue refers to the broad exploitation of producer-initiated explicit
transfers, that could maximize the benefits of explicit communication. There are
at least two potentially problematic aspects in the general use of such transfers.
Namely, the need to know in advance where the consumer of produced data is or
will be scheduled, and the management of consumer local memory space, discussed
in subsection 2.3.6.
Although an almost ten year time frame has gone by, exploitation of multiple
cores in a general computation is an open problem, that seems to affect the course
of the general purpose processor industry. In the beginning of this study, software
targeting such computations was very scarce. The lack of software benchmark sets
for the evaluation of new architectures, has been an obstacle to the development of
appropriate hardware support, and will also be in the near future.
At this time, many research efforts propose extended [17, 18, 19], or
novel [112, 93] programming model semantics, but either evaluate their success
for less than 16 cores, or specifically target processing of regular streams of data
and data-parallel applications. In any case, there seems to be a consensus toward
the breakdown of a computation in lightweight tasks or kernels, manipulating their
parallel processing under the orchestration of a compiler-generated run-time system,
and with the aid of higher-level or library constructs.
Because mapping a general computation on specific hardware resources and
dynamic scheduling are too complex for the end-programmer, run-time systems are
essential and should provide a central software platform for evaluation of multicore
hardware support. For example, run-time systems and libraries can exploit hardware
support for synchronization and explicit communication, like that advocated here,
as in the case of the Cilk-based evaluation of section 4.3.
However, it is the author’s opinion, that to enable new parallel software, the
programmer should not be confined to the serial semantics of uniprocessor program-
ming languages. New, higher level, programming model semantics are required to
program groups of processors, that will urge the programmer to expose the maxi-
mum possible parallelism in an application. Exposing language defined tasks and
data structures, as well as providing syntactic sugar for their recursive and repetitive
declaration, can be examples of such higher level semantics in the programming
model.
Parallel processing was abandoned in the past because of the difficulty of writ-
ing parallel programs. Contemporary efforts for general exploitation of multicore
processors are also largely dependent on the progress of parallel software technol-
ogy. It is doubtful if 65nm and 45nm integration processes are shelling as many
processor chips as previous ones. This time it seems that computer engineering, and
more generally computer science, will either get over the child-phase of uniprocess-
ing in its history, or it will shrink in usefulness and importance. This may mean that
interesting times are still ahead, in the future of computers.
[3] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock rate versus ipc:
the end of the road for conventional microarchitectures,” SIGARCH Comput. Archit.
News, vol. 28, no. 2, pp. 248–259, 2000.
[6] Intel, “Intel Unveils 32-core Server Chip at International Supercomputing Confer-
ence,” http://www.intel.com/pressroom/archive/releases/2010/20100531comp.htm,
May 31 2010.
[8] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 2008.
[9] C. Kim, D. Burger, and S. W. Keckler, “An adaptive, non-uniform cache structure for
wire-delay dominated on-chip caches,” SIGPLAN Not., vol. 37, no. 10, pp. 211–222,
2002.
[10] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, “A nuca substrate
for flexible cmp cache sharing,” IEEE Transactions on Parallel and Distributed Sys-
tems, vol. 18, pp. 1028–1040, 2007.
[12] D. Pham et al., “The design and implementation of a first-generation CELL proces-
sor,” in Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), Feb. 2005.
[15] B. Gedik, R. R. Bordawekar, and P. S. Yu, “Cellsort: high performance sorting on the
cell processor,” in VLDB ’07: Proceedings of the 33rd international conference on
Very large data bases. VLDB Endowment, 2007, pp. 1286–1297.
[22] WP5 partners, under M. Katevenis coordination, “D5.6: Network Interface Evalua-
tion, Prototyping, and Optimizations,” Institute of Computer Science, FORTH, Her-
aklion, Greece, Tech. Rep. Confidential SARC Deliverable D5.6, May 2010.
[23] D. S. Henry and C. F. Joerg, “A tightly-coupled processor-network interface,” SIG-
PLAN Not., vol. 27, no. 9, pp. 111–122, 1992.
[24] M. A. Blumrich, C. Dubnicki, E. W. Felten, and K. Li, “Protected, user-level dma for
the shrimp network interface,” in HPCA ’96: Proceedings of the 2nd IEEE Sympo-
sium on High-Performance Computer Architecture. Washington, DC, USA: IEEE
Computer Society, 1996, p. 154.
[25] E. Markatos and M. Katevenis, “User-level DMA without operating system ker-
nel modification,” in HPCA’97: Proceedings of the 3rd IEEE Symposium on High-
Performance Computer Architecture, San Antonio, TX USA, Feb. 1997, pp. 322–331.
[26] G. M. Papadopoulos and D. E. Culler, “Monsoon: an explicit token-store architec-
ture,” in ISCA ’98: 25 years of the international symposia on Computer architecture
(selected papers). New York, NY, USA: ACM, 1998, pp. 398–407.
[27] V. Papaefstathiou, G. Kalokairinos, A. Ioannou, M. Papamichael, G. Mihelogian-
nakis, S. Kavadias, E. Vlachos, D. Pnevmatikatos, and M. Katevenis, “An fpga-based
prototyping platform for research in high-speed interprocessor communication,” 2nd
Industrial Workshop of the European Network of Excellence on High-Performance
Embedded Architecture and Compilation (HiPEAC), October 17 2006.
[28] V. Papaefstathiou, D. Pnevmatikatos, M. Marazakis, G. Kalokairinos, A. Ioannou,
M. Papamichael, S. Kavadias, G. Mihelogiannakis, and M. Katevenis, “Prototyping
efficient interprocessor communication mechanisms,” in Proc. IEEE International
Symposium on Systems, Architectures, Modeling and Simulation (SAMOS2007), July
16-19 2007.
[29] M. Papamichael, “Network Interface Architecture and Prototyping for Chip and Clus-
ter Multiprocessors,” Master’s thesis, University of Crete, Heraklion, Greece, 2007,
also available as ICS-FORTH Technical Report 392.
[30] H. Grahn and P. Stenstrom, “Efficient strategies for software-only directory protocols
in shared-memory multiprocessors,” in Computer Architecture, 1995. Proceedings.
22nd Annual International Symposium on, June 1995, pp. 38–47.
[31] D. Chaiken and A. Agarwal, “Software-extended coherent shared memory: perfor-
mance and cost,” SIGARCH Comput. Archit. News, vol. 22, no. 2, pp. 314–324, 1994.
[32] F. C. Denis, D. A. Khotimsky, and S. Krishnan, “Generalized inverse multiplexing
of switched atm connections,” in Proceedings of the IEEE Conference on Global
Communications (GlobeCom ’98), 1998, pp. 3134–3140.
[34] E. Brewer, F. Chong, L. Liu, S. Sharma, and J. Kubiatowicz, “Remote queues: Expos-
ing message queues for optimization and atomicity,” in Proc. 7th ACM Symposium
on Parallel Algorithms and Architectures (SPAA’95), Santa Barbara, CA USA, Jun.
1995, pp. 42–53.
[35] S. Mukherjee, B. Falsafi, M. Hill, and D. Wood, “Coherent network interfaces for
fine-grain communication,” in Proc. 23rd Int. Symposium on Computer Architecture
(ISCA’96), Philadelphia, PA USA, May 1996, pp. 247–258.
[36] T. von Eicken, A. Basu, V. Buch, and W. Vogels, “U-net: a user-level network in-
terface for parallel and distributed computing,” in SOSP ’95: Proceedings of the
fifteenth ACM symposium on Operating systems principles. New York, NY, USA:
ACM, 1995, pp. 40–53.
[43] D. Koufaty and J. Torrellas, “Comparing data forwarding and prefetching for
communication-induced misses in shared-memory mps,” in ICS ’98: Proceedings
[47] S. Borkar, R. Cohn, G. Cox, S. Gleason, and T. Gross, “iWarp: an integrated solu-
tion of high-speed parallel computing,” in Supercomputing ’88: Proceedings of the
1988 ACM/IEEE conference on Supercomputing. Los Alamitos, CA, USA: IEEE
Computer Society Press, 1988, pp. 330–339.
[50] U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, “The Imagine stream
processor,” in Proceedings 2002 IEEE International Conference on Computer De-
sign, Sep. 2002, pp. 282–288.
[51] D. Sanchez, R. M. Yoo, and C. Kozyrakis, “Flexible architectural support for fine-
grain scheduling,” in ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS
on Architectural support for programming languages and operating systems. New
York, NY, USA: ACM, 2010, pp. 311–322.
[80] P. Bannon, “Alpha 21364: A scalable single-chip smp,” in Eleventh Ann. Micropro-
cessor Forum, MicroDesign Resources, Sebastopol, California, 1998.
[83] S. Mukherjee and M. Hill, “A survey of user-level network interfaces for system area
networks,” Univ. of Wisconsin, Madison, USA, Computer Sci. Dept. Tech. Report
1340, 1997.
[85] P. Ranganathan, S. Adve, and N. P. Jouppi, “Reconfigurable caches and their applica-
tion to media processing,” in ISCA ’00: Proceedings of the 27th annual international
symposium on Computer architecture. New York, NY, USA: ACM, 2000, pp. 214–
224.
[86] IBM, PowerPC 750GX/FX Cache Programming, Dec 2004. [Online]. Available:
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/0DD2C54EDDF7EB9287256F3F00592C64
[87] Intel, Intel XScale Microarchitecture Programmers Reference Manual, Feb 2001.
[Online]. Available: http://download.intel.com/design/intelxscale/27343601.pdf
[95] M. Wen, N. Wu, C. Zhang, Q. Yang, J. Ren, Y. He, W. Wu, J. Chai, M. Guan, and
C. Xun, “On-chip memory system optimization design for the ft64 scientific stream
accelerator,” IEEE Micro, vol. 28, no. 4, pp. 51–70, 2008.
[111] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without memory
conflicts,” IEEE Trans. Comput., vol. 36, no. 11, pp. 1367–1369, 1987.