0% found this document useful (0 votes)
45 views

Cell Multproc Comm NTWK - Built For SPD

The document discusses the communication network in multicore processors. It analyzes the network in IBM's Cell processor using benchmarks involving different DMA traffic patterns and synchronization protocols. The Cell network uses four slotted rings coordinated by an arbiter, borrowing a model from high-performance networks. It addresses issues limiting processor performance like memory latency, bandwidth, and power.

Uploaded by

lycank
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Cell Multproc Comm NTWK - Built For SPD

The document discusses the communication network in multicore processors. It analyzes the network in IBM's Cell processor using benchmarks involving different DMA traffic patterns and synchronization protocols. The Cell network uses four slotted rings coordinated by an arbiter, borrowing a model from high-performance networks. It addresses issues limiting processor performance like memory latency, bandwidth, and power.

Uploaded by

lycank
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CELL MULTIPROCESSOR

COMMUNICATION NETWORK:
BUILT FOR SPEED
MULTICORE DESIGNS PROMISE VARIOUS POWER-PERFORMANCE AND AREA-

PERFORMANCE BENEFITS. BUT INADEQUATE DESIGN OF THE ON-CHIP

COMMUNICATION NETWORK CAN DEPRIVE APPLICATIONS OF THESE BENEFITS.

TO ILLUMINATE THIS IMPORTANT POINT IN MULTICORE PROCESSOR DESIGN,

THE AUTHORS ANALYZE THE CELL PROCESSOR’S COMMUNICATION NETWORK,

USING A SERIES OF BENCHMARKS INVOLVING VARIOUS DMA TRAFFIC

PATTERNS AND SYNCHRONIZATION PROTOCOLS.

Over the past decade, high-perfor- monolithic processors, which are prohibitive-
mance computing has ridden the wave of ly expensive to develop, have high power con-
Michael Kistler commodity computing, building cluster- sumption, and give limited return on
based parallel computers that leverage the investment. Multicore system-on-chip (SoC)
IBM Austin tremendous growth in processor performance processors integrate several identical, inde-
fueled by the commercial world. As this pace pendent processing units on the same die,
Research Laboratory slows, processor designers face complex prob- together with network interfaces, acceleration
lems in their efforts to increase gate density, units, and other specialized units.
reduce power consumption, and design effi- Researchers have explored several design
Michael Perrone cient memory hierarchies. Processor develop- avenues in both academia and industry. Exam-
ers are looking for solutions that can keep up ples include MIT’s Raw multiprocessor, the
IBM TJ Watson with the scientific and industrial communi- University of Texas’s Trips multiprocessor,
ties’ insatiable demand for computing capa- AMD’s Opteron, IBM’s Power5, Sun’s Niagara,
Research Center bility and that also have a sustainable market and Intel’s Montecito, among many others.
outside science and industry. (For details on many of these processors, see
A major trend in computer architecture is the March/April 2005 issue of IEEE Micro.)
Fabrizio Petrini integrating system components onto the In all multicore processors, a major tech-
processor chip. This trend is driving the devel- nological challenge is designing the internal,
Pacific Northwest opment of processors that can perform func- on-chip communication network. To realize
tions typically associated with entire systems. the unprecedented computational power of
National Laboratory Building modular processors with multiple the many available processing units, the net-
cores is far more cost-effective than building work must provide very high performance in

10 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
latency and in bandwidth. It must also resolve memory bandwidth, resulting in poor chip
contention under heavy loads, provide fair- area use and increased power dissipation with-
ness, and hide the processing units’ physical out commensurate performance gains.3
distribution as completely as possible. For example, larger memory latencies
Another important dimension is the nature increase the amount of speculative execution
and semantics of the communication primi- required to maintain high processor utiliza-
tives available for interactions between the var- tion. Thus, they reduce the likelihood that
ious processing units. Pinkston and Shin have useful work is being accomplished and
recently compiled a comprehensive survey of increase administrative overhead and band-
multicore processor design challenges, with width requirements. All of these problems
particular emphasis on internal communica- lead to reduced power efficiency.
tion mechanisms.1 Power use in CMOS processors is approach-
The Cell Broadband Engine processor ing the limits of air cooling and might soon
(known simply as the Cell processor), jointly begin to require sophisticated cooling tech-
developed by IBM, Sony, and Toshiba, uses niques.4 These cooling requirements can sig-
an elegant and natural approach to on-chip nificantly increase overall system cost and
communication. Relying on four slotted rings complexity. Decreasing transistor size and cor-
coordinated by a central arbiter, it borrows a respondingly increasing subthreshold leakage
mainstream communication model from currents further increase power consumption.5
high-performance networks in which pro- Performance improvements from further
cessing units cooperate through remote direct increasing processor frequencies and pipeline
memory accesses (DMAs).2 From functional depths are also reaching their limits.6 Deeper
and performance viewpoints, the on-chip net- pipelines increase the number of stalls from
work is strikingly similar to high-performance data dependencies and increase branch mis-
networks commonly used for remote com- prediction penalties.
munication in commodity computing clus- The Cell processor addresses these issues by
ters and custom supercomputers. attempting to minimize pipeline depth,
In this article, we explore the design of the increase memory bandwidth, allow more
Cell processor’s on-chip network and provide simultaneous, in-flight memory transactions,
insight into its communication and synchro- and improve power efficiency and perfor-
nization protocols. We describe the various mance.7 These design goals led to the use of
steps of these protocols, the algorithms flexible yet simple cores that use area and
involved, and their basic costs. Our perfor- power efficiently.
mance evaluation uses a collection of bench-
marks of increasing complexity, ranging from Processor overview
basic communication patterns to more The Cell processor is the first implementa-
demanding collective patterns that expose net- tion of the Cell Broadband Engine Architetc-
work behavior under congestion. ture (CBEA), which is a fully compatible
extension of the 64-bit PowerPC Architecture.
Design rationale Its initial target is the PlayStation 3 game con-
The Cell processor’s design addresses at least sole, but its capabilities also make it well suit-
three issues that limit processor performance: ed for other applications such as visualization,
memory latency, bandwidth, and power. image and signal processing, and various sci-
Historically, processor performance entific and technical workloads.
improvements came mainly from higher Figure 1 shows the Cell processor’s main
processor clock frequencies, deeper pipelines, functional units. The processor is a heteroge-
and wider issue designs. However, memory neous, multicore chip capable of massive
access speed has not kept pace with these floating-point processing optimized for com-
improvements, leading to increased effective putation-intensive workloads and rich broad-
memory latencies and complex logic to hide band media applications. It consists of one
them. Also, because complex cores don’t allow 64-bit power processor element (PPE), eight
a large number of concurrent memory access- specialized coprocessors called synergistic
es, they underutilize execution pipelines and processor elements (SPEs), a high-speed mem-

MAY–JUNE 2006 11
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

an instruction set and a microarchitecture


designed for high-performance data stream-
ing and data-intensive computation. The SPU
includes a 256-Kbyte local-store memory to
hold an SPU program’s instructions and data.
The SPU cannot access main memory direct-
ly, but it can issue DMA commands to the
MFC to bring data into local store or write
computation results back to main memory.
The SPU can continue program execution
while the MFC independently performs these
DMA transactions. No hardware data-load
prediction structures exist for local store man-
BEI Broadband engine interface MBL MIC bus logic agement, and each local store must be man-
EIB Element interconnect bus PPE Power processor element
FlexIO High-speed I/O interface SPE Synergistic processor element aged by software.
L2 Level 2 cache XIO Extreme data rate I/O cell The MFC performs DMA operations to
MIC Memory interface controller Test control unit/pervasive logic transfer data between local store and system
memory. DMA operations specify system
Figure 1. Main functional units of the Cell processor. memory locations using fully compliant Pow-
erPC virtual addresses. DMA operations can
transfer data between local store and any
ory controller, and a high-bandwidth bus resources connected via the on-chip intercon-
interface, all integrated on-chip. The PPE and nect (main memory, another SPE’s local store,
SPEs communicate through an internal high- or an I/O device). Parallel SPE-to-SPE and
speed element interconnect bus (EIB). SPE-to-main-memory transfers are sustainable
With a clock speed of 3.2 GHz, the Cell at a rate of 16 bytes of data every bus cycle,
processor has a theoretical peak performance with a peak bandwidth of 25.6 Gbytes/s.
of 204.8 Gflop/s (single precision) and 14.6 Each SPU has 128 128-bit single-instruc-
Gflop/s (double precision). The EIB supports tion, multiple-data (SIMD) registers. The
a peak bandwidth of 204.8 Gbytes/s for intra- large number of architected registers facilitates
chip data transfers among the PPE, the SPEs, highly efficient instruction scheduling and
and the memory and I/O interface controllers. enables important optimization techniques
The memory interface controller (MIC) pro- such as loop unrolling. All SPU instructions
vides a peak bandwidth of 25.6 Gbytes/s to are inherently SIMD operations that the
main memory. The I/O controller provides pipeline can run at four granularities: 16-way
peak bandwidths of 25 Gbytes/s inbound and 8-bit integers, eight-way 16-bit integers, four-
35 Gbytes/s outbound. way 32-bit integers or single-precision
The PPE, the Cell’s main processor, runs floating-point numbers, or two 64-bit dou-
the operating system and coordinates the ble-precision floating-point numbers.
SPEs. It is a traditional 64-bit PowerPC The SPU is an in-order processor with two
processor core with a vector multimedia instruction pipelines, referred to as the even
extension (VMX) unit, 32-Kbyte level 1 and odd pipelines. The floating- and fixed-
instruction and data caches, and a 512-Kbyte point units are on the even pipeline, and the
level 2 cache. The PPE is a dual-issue, in- rest of the functional units are on the odd
order-execution design, with two-way simul- pipeline. Each SPU can issue and complete
taneous multithreading. up to two instructions per cycle—one per
Each SPE consists of a synergistic proces- pipeline. The SPU can approach this theo-
sor unit (SPU) and a memory flow controller retical limit for a wide variety of applications.
(MFC). The MFC includes a DMA con- All single-precision operations (8-bit, 16-bit,
troller, a memory management unit (MMU), or 32-bit integers or 32-bit floats) are fully
a bus interface unit, and an atomic unit for pipelined and can be issued at the full SPU
synchronization with other SPUs and the clock rate (for example, four 32-bit floating-
PPE. The SPU is a RISC-style processor with point operations per SPU clock cycle). The

12 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
PPE SPE1 SPE3 SPE5 SPE7 IOIF1

Data network Data bus arbiter

BIF
MIC SPE0 SPE2 SPE4 SPE6
IOIF0

BIF Broadband interface


IOIF I/O interface

Figure 2. Element interconnect bus (EIB).

two-way double-precision floating-point chronization mechanisms is a set of atomic


operation is partially pipelined, so its instruc- operations available to the SPU, which oper-
tions issue at a lower rate (two double-preci- ate in a similar manner as the PowerPC archi-
sion flops every seven SPU clock cycles). tecture’s lwarx/stwcx atomic instructions. In
When using single-precision floating-point fact, the SPU’s atomic operations interoper-
fused multiply-add instructions (which count ate with PPE atomic instructions to build
as two operations), the eight SPUs can per- locks and other synchronization mechanisms
form a total of 64 operations per cycle. that work across the SPEs and the PPE. Final-
ly, the Cell allows memory-mapped access to
Communication architecture nearly all SPE resources, including the entire
To take advantage of all the computation local store. This provides a convenient and
power available on the Cell processor, work consistent mechanism for special communi-
must be distributed and coordinated across cations needs not met by the other techniques.
the PPE and the SPEs. The processor’s spe- The rich set of communications mecha-
cialized communication mechanisms allow nisms in the Cell architecture enables pro-
efficient data collection and distribution as grammers to efficiently implement widely
well as coordination of concurrent activities used programming models for parallel and
across the computation elements. Because the distributed applications. These models
SPU can act directly only on programs and include the function-offload, device-exten-
data in its own local store, each SPE has a sion, computational-acceleration, streaming,
DMA controller that performs high-band- shared-memory-multiprocessor, and asym-
width data transfer between local store and metric-thread-runtime models.8
main memory. These DMA engines also allow
direct transfers between the local stores of two Element interconnect bus
SPUs for pipeline or producer-consumer-style Figure 2 shows the EIB, the heart of the Cell
parallel applications. processor’s communication architecture,
At the other end of the spectrum, the SPU which enables communication among the
can use either signals or mailboxes to perform PPE, the SPEs, main system memory, and
simple low-latency signaling to the PPE or external I/O. The EIB has separate commu-
other SPEs. Supporting more-complex syn- nication paths for commands (requests to

MAY–JUNE 2006 13
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

transfer data to or from another element on and 25 Gbytes/s inbound.


the bus) and data. Each bus element is con- The actual data bandwidth achieved on the
nected through a point-to-point link to the EIB depends on several factors: the destina-
address concentrator, which receives and tion and source’s relative locations, the chance
orders commands from bus elements, broad- of a new transfer’s interfering with transfers
casts the commands in order to all bus ele- in progress, the number of Cell chips in the
ments (for snooping), and then aggregates and system, whether data transfers are to/from
broadcasts the command response. The com- memory or between local stores in the SPEs,
mand response is the signal to the appropriate and the data arbiter’s efficiency.
bus elements to start the data transfer. Reduced bus bandwidths can result in the
The EIB data network consists of four 16- following cases:
byte-wide data rings: two running clockwise,
and the other two counterclockwise. Each ring • All requestors access the same destina-
potentially allows up to three concurrent data tion, such as the same local store, at the
transfers, as long as their paths don’t overlap. same time.
To initiate a data transfer, bus elements must • All transfers are in the same direction and
request data bus access. The EIB data bus cause idling on two of the four data rings.
arbiter processes these requests and decides • A large number of partial cache line
which ring should handle each request. The transfers lowers bus efficiency.
arbiter always selects one of the two rings that • All transfers must travel halfway around
travel in the direction of the shortest transfer, the ring to reach their destinations,
thus ensuring that the data won’t need to trav- inhibiting units on the way from using
el more than halfway around the ring to its the same ring.
destination. The arbiter also schedules the
transfer to ensure that it won’t interfere with Memory flow controller
other in-flight transactions. To minimize Each SPE contains an MFC that connects
stalling on reads, the arbiter gives priority to the SPE to the EIB and manages the various
requests coming from the memory controller. communication paths between the SPE and the
It treats all others equally in round-robin fash- other Cell elements. The MFC runs at the EIB’s
ion. Thus, certain communication patterns frequency—that is, at half the processor’s speed.
will be more efficient than others. The SPU interacts with the MFC through the
The EIB operates at half the processor-clock SPU channel interface. Channels are unidirec-
speed. Each EIB unit can simultaneously send tional communication paths that act much like
and receive 16 bytes of data every bus cycle. first-in first-out fixed-capacity queues. This
The EIB’s maximum data bandwidth is lim- means that each channel is defined as either
ited by the rate at which addresses are snooped read-only or write-only from the SPU’s per-
across all units in the system, which is one spective. In addition, some channels are defined
address per bus cycle. Each snooped address with blocking semantics, meaning that a read
request can potentially transfer up to 128 of an empty read-only channel or a write to a
bytes, so in a 3.2GHz Cell processor, the the- full write-only channel causes the SPU to block
oretical peak data bandwidth on the EIB is until the operation completes. Each channel has
128 bytes × 1.6 GHz = 204.8 Gbytes/s. an associated count that indicates the number
The on-chip I/O interfaces allow two Cell of available elements in the channel. The SPU
processors to be connected using a coherent uses the read channel (rdch), write channel
protocol called the broadband interface (BIF), (wrch), and read channel count (rchcnt) assem-
which effectively extends the multiprocessor bly instructions to access the SPU channels.
network to connect both PPEs and all 16
SPEs in a single coherent network. The BIF DMA. The MFC accepts and processes DMA
protocol operates over IOIF0, one of the two commands that the SPU or the PPE issued
available on-chip I/O interfaces; the other using the SPU channel interface or memory-
interface, IOIF1, operates only in noncoher- mapped I/O (MMIO) registers. DMA com-
ent mode. The IOIF0 bandwidth is config- mands queue in the MFC, and the SPU or PPE
urable, with a peak of 30 Gbytes/s outbound (whichever issued the command) can continue

14 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
execution in parallel with the data transfer, Atomic operations. To support more complex
using either polling or blocking interfaces to synchronization mechanisms, the SPU can use
determine when the transfer is complete. This special DMA operations to atomically update
autonomous execution of MFC DMA com- a lock line in main memory. These operations,
mands allows convenient scheduling of DMA called get-lock-line-and-reserve (getllar) and
transfers to hide memory latency. put-lock-line-conditional (putllc), are con-
The MFC supports naturally aligned trans- ceptually equivalent to the PowerPC load-and-
fers of 1, 2, 4, or 8 bytes, or a multiple of 16 reserve (lwarx) and store-conditional (stcwx)
bytes to a maximum of 16 Kbytes. DMA list instructions.
commands can request a list of up to 2,048 The getllar operation reads the value of a
DMA transfers using a single MFC DMA synchronization variable in main memory and
command. However, only the MFC’s associ- sets a reservation on this location. If the PPE
ated SPU can issue DMA list commands. A or another SPE subsequently modifies the syn-
DMA list is an array of DMA source/destina- chronization variable, the SPE loses its reser-
tion addresses and lengths in the SPU’s local vation. The putllc operation updates the
storage. When an SPU issues a DMA list com- synchronization variable only if the SPE still
mand, the SPU specifies the address and holds a reservation on its location. If putllc
length of the DMA list in the local store.9 Peak fails, the SPE must reissue getllar to obtain the
performance is achievable for transfers when synchronization variable’s new value and then
both the effective address and the local storage retry the attempt to update it with another
address are 128-byte aligned and the transfer putllc. The MFC’s atomic unit performs the
size is an even multiple of 128 bytes. atomic DMA operations and manages reser-
vations held by the SPE.
Signal notification and mailboxes. The signal Using atomic updates, the SPU can partic-
notification facility supports two signaling ipate with the PPE and other SPUs in locking
channels: Sig_Notify_1 and Sig_Notify_2. The protocols, barriers, or other synchronization
SPU can read its own signal channels using the mechanisms. The atomic operations available
read-blocking SPU channels SPU_RdSigNo- to the SPU also have some special features,
tify1 and SPU_RdSigNotify2. The PPE or an such as notification through an interrupt
SPU can write to these channels using memo- when a reservation is lost, that enable more
ry-mapped addresses. A special feature of the efficient and powerful synchronization than
signaling channels is that they can be config- traditional approaches.
ured to treat writes as logical OR operations,
allowing simple but powerful collective com- Memory-mapped I/O (MMIO) resources. Mem-
munication across processors. ory-mapped resources play a role in many of
Each SPU also has a set of mailboxes that can the communication mechanisms already dis-
function as a narrow (32-bit) communication cussed, but these are really just special cases of
channel to the PPE or another SPE. The SPU the Cell architecture’s general practice of mak-
has a four-entry, read-blocking inbound mail- ing all SPE resources available through MMIO.
box and two single-entry, write-blocking out- These resources fall into four broad classes:
bound mailboxes, one of which will also
generate an interrupt to the PPE when the SPE • Local storage. All of an SPU’s local storage
writes to it. The PPE uses memory-mapped can be mapped into the effective-address
addresses to write to the SPU’s inbound mail- space. This allows the PPE to access the
box and read from either of the SPU’s outbound SPU’s local storage with simple loads and
mailboxes. In contrast to the signal notification stores, though doing so is far less efficient
channels, mailboxes are much better suited for than using DMA. MMIO access to local
one-to-one communication patterns such as storage is not synchronized with SPU exe-
master-slave or producer-consumer models. A cution, so programmers must ensure that
typical round-trip communication using mail- the SPU program is designed to allow
boxes between two SPUs takes approximately unsynchronized access to its data (for
300 nanoseconds (ns). example, by using the “volatile” variables)
when exploiting this feature.

MAY–JUNE 2006 15
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

• Problem state memory map. Resources in The Cell architecture doesn’t dictate the size
this class, intended for use directly by of these queues and recommends that soft-
application programs, include access to ware not assume a particular size. This is
the SPE’s DMA engine, mailbox chan- important to ensure functional correctness
nels, and signal notification channels. across Cell architecture implementations.
• Privilege 1 memory map. These resources However, programs should use DMA queue
are available to privileged programs such entries efficiently because attempts to issue
as the operating system or authorized DMA commands when the queue is full will
subsystems to monitor and control the lead to performance degradation. In the Cell
execution of SPU applications. processor, the MFC SPU command queue
• Privilege 2 memory map. The operating contains 16 entries, and the MFC proxy com-
system uses these resources to control the mand queue contains eight entries.
resources available to the SPE. Figure 3 illustrates the basic flow of a DMA
transfer to main storage initiated by an SPU.
DMA flow The process consists of the following steps:
The SPE’s DMA engine handles most com-
munications between the SPU and other Cell 1. The SPU uses the channel interface to
elements and executes DMA commands place the DMA command in the MFC
issued by either the SPU or the PPE. A DMA SPU command queue.
command’s data transfer direction is always 2. The DMAC selects a command for pro-
referenced from the SPE’s perspective. There- cessing. The set of rules for selecting the
fore, commands that transfer data into an SPE command for processing is complex, but,
(from main storage to local store) are consid- in general, a) commands in the SPU
ered get commands (gets), and transfers of command queue take priority over com-
data out of an SPE (from local store to main mands in the proxy command queue, b)
storage) are considered put commands (puts). the DMAC alternates between get and
DMA transfers are coherent with respect to put commands, and c) the command
main storage. Programmers should be aware must be ready (not waiting for address
that the MFC might process the commands in resolution or list element fetch or depen-
the queue in a different order from that in dent on another command).
which they entered the queue. When order is 3. If the command is a DMA list command
important, programmers must use special and requires a list element fetch, the
forms of the get and put commands to enforce DMAC queues a request for the list ele-
either barrier or fence semantics against other ment to the local-store interface. When
commands in the queue. the list element is returned, the DMAC
The MFC’s MMU handles address transla- updates the DMA entry and must rese-
tion and protection checking of DMA access- lect it to continue processing.
es to main storage, using information from 4. If the command requires address trans-
page and segment tables defined in the Pow- lation, the DMAC queues it to the
erPC architecture. The MMU has a built-in MMU for processing. When the transla-
translation look-aside buffer (TLB) for caching tion is available in the TLB, processing
the results of recently performed translations. proceeds to the next step (unrolling). On
The MFC’s DMA controller (DMAC) a TLB miss, the MMU performs the
processes DMA commands queued in the translation, using the page tables stored
MFC. The MFC contains two separate DMA in main memory, and updates the TLB.
command queues: The DMA entry is updated and must be
reselected for processing to continue.
• MFC SPU command queue, for com- 5. Next, the DMAC unrolls the command—
mands issued by the associated SPU that is, creates a bus request to transfer the
using the channel interface; and next block of data for the command. This
• MFC proxy command queue, for com- bus request can transfer up to 128 bytes of
mands issued by the PPE or other devices data but can transfer less, depending on
using MMIO registers. alignment issues or the amount of data the

16 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
1.6 GHz
(1) DMA Queues

MMIO
SPU proxy

interface
Channel
SPU DMAC

(2) (5) (6)


(4)

LS (3) TLB
MMU

EIB
MFC

BIU
(7) (7)

51.2 Gbyte/s 25.6 Gbyte/s


3.2 GHz in/out
1.6 GHz

204.8 Gbyte/s
(6)

(7) (7)
Memory
MIC
(off chip) 25.6
25.6
Gbyte/s Gbyte/s
in/out

BIU Bus interface unit MIC Memory interface controller


DMAC Direct memory access controller MMIO Memory-mapped I/O
EIB Element interconnect bus MMU Memory management unit
LS Local store SPU Synergistic processor unit
MFC Memory flow controller TLB Translation lookaside buffer

Figure 3. Basic flow of a DMA transfer.

DMA command requests. The DMAC that pipeline through the communication
then queues this bus request to the bus network. The DMA command remains
interface unit (BIU). in the MFC SPU command queue until
6. The BIU selects the request from its all its bus requests have completed. How-
queue and issues the command to the ever, the DMAC can continue to process
EIB. The EIB orders the command with other DMA commands. When all bus
other outstanding requests and then requests for a command have completed,
broadcasts the command to all bus ele- the DMAC signals command completion
ments. For transfers involving main to the SPU and removes the command
memory, the MIC acknowledges the from the queue.
command to the EIB which then informs
the BIU that the command was accept- In the absence of congestion, a thread run-
ed and data transfer can begin. ning on the SPU can issue a DMA request in
7. The BIU in the MFC performs the reads as little as 10 clock cycles—the time needed to
to local store required for the data trans- write to the five SPU channels that describe
fer. The EIB transfers the data for this the source and destination addresses, the
request between the BIU and the MIC. DMA size, the DMA tag, and the DMA com-
The MIC transfers the data to or from mand. At that point, the DMAC can process
the off-chip memory. the DMA request without SPU intervention.
8. The unrolling process produces a sequence The overall latency of generating the DMA
of bus requests for the DMA command command, initially selecting the command,

MAY–JUNE 2006 17
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

and unrolling the first bus request to the board at the IBM Research Center in Austin.
BIU—or, in simpler terms, the flow-through The Cell processor on this board was running
latency from SPU issue to injection of the bus at 3.2 GHz. We also obtained results from an
request into the EIB—is roughly 30 SPU internal version of the simulator that includes
cycles when all resources are available. If list performance models for the MFC, EIB, and
element fetch is required, it can add roughly memory subsystems. Performance simulation
20 SPU cycles. MMU translation exceptions for the Cell processor is still under develop-
by the SPE are very expensive and should be ment, but we found good correlation between
avoided if possible. If the queue in the BIU the simulator and hardware in our experi-
becomes full, the DMAC is blocked from ments. The simulator let us observe aspects of
issuing further requests until resources system behavior that would be difficult or
become available again. practically impossible to observe on actual
A transfer’s command phase involves hardware.
snooping operations for all bus elements to We developed the benchmarks in C, using
ensure coherence and typically requires some several Cell-specific libraries to orchestrate
50 bus cycles (100 SPU cycles) to complete. activities between the PPE and various SPEs.
For gets, the remaining latency is attributable In all tests, the DMA operations are issued by
to the data transfer from off-chip memory to the SPUs. We wrote DMA operations in C
the memory controller and then across the bus language intrinsics,12 which in most cases pro-
to the SPE, which writes it to local store. For duce inline assembly instructions to specify
puts, DMA latency doesn’t include transfer- commands to the MFC through the SPU
ring data all the way to off-chip memory channel interface. We measured elapsed time
because the SPE considers the put complete in the SPE using a special register, the SPU
once all data have been transferred to the decrementer, which ticks every 40 ns (or 128
memory controller. processor clock cycles).
PPE and SPE interactions were performed
Experimental results through mailboxes, input and output SPE reg-
We conducted a series of experiments to isters that can send and receive messages in as
explore the major performance aspects of the little as 150 ns. We implemented a simple syn-
Cell’s on-chip communication network, its chronization mechanism to start the SPUs in
protocols, and the pipelined communication’s a coordinated way. SPUs notify the comple-
impact. We developed a suite of microbench- tion of benchmark phases through an atomic
marks to analyze the internal interconnection fetch-and-add operation on a main memory
network’s architectural features. Following the location that is polled infrequently and unin-
research path of previous work on traditional trusively by the PPE.
high-performance networks,2 we adopted an
incremental approach to gain insight into sev- Basic DMA performance
eral aspects of the network. We started our The first step of our analysis measures the
analysis with simple pairwise, congestion-free latency and bandwidth of simple blocking
DMAs, and then we increased the bench- puts and gets, when the target is in main
marks’ complexity by including several pat- memory or in another SPE’s local store. Table
terns that expose the contention resolution 1 breaks down the latency of a DMA opera-
properties of the network under heavy load. tion into its components.
Figure 4a shows DMA operation latency
Methodology for a range of sizes. In these results, transfers
Because of the limited availability of Cell between two local stores were always per-
boards at the time of our experiments, we per- formed between SPEs 0 and 1, but in all our
formed most of the software development on experiments, we found no performance dif-
IBM’s Full-System Simulator for the Cell ference attributable to SPE location. The
Broadband Engine Processor (available at results show that puts to main memory and
http://www-128.ibm.com/developerworks/ gets and puts to local store had a latency of
power/cell/).10,11 We collected the results pre- only 91 ns for transfers of up to 512 bytes
sented here using an experimental evaluation (four cache lines). There was little difference

18 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
between puts and gets to local store because Table 1. DMA latency components for a clock
local-store access latency was remarkably frequency of 3.2 GHz.
low—only 8 ns (about 24 processor clock
cycles). Main memory gets required less than Latency component Cycles Nanoseconds
100 ns to fetch information, which is remark- DMA issue 10 3.125
ably fast. DMA to EIB 30 9.375
Figure 4b presents the same results in terms List element fetch 10 3.125
of bandwidth achieved by each DMA opera- Coherence protocol 100 31.25
tion. As we expected, the largest transfers Data transfer for inter-SPE put 140 43.75
achieved the highest bandwidth, which we Total 290 90.61
measured as 22.5 Gbytes/s for gets and puts to
local store and puts to main memory and 15
Gbytes/s for gets from main memory. 1,200
Next, we considered the impact of non- Blocking get, memory
Blocking get, SPE
blocking DMA operations. In the Cell proces- 1,000 Blocking put, memory
sor, each SPE can have up to 16 outstanding Latency (nanoseconds)
Blocking put, SPE
DMAs, for a total of 128 across the chip, allow- 800
ing unprecedented levels of parallelism in on-
chip communication. Applications that rely 600
heavily on random scatter or gather accesses to
main memory can take advantage of these com- 400
munication features seamlessly. Our bench-
marks use a batched communication model, in
200
which the SPU issues a fixed number (the batch
size) of DMAs before blocking for notification
0
of request completion. By using a very large 4 16 64 256 1,024 4,096 16,384
batch size (16,384 in our experiments), we (a) DMA message size (bytes)
effectively converted the benchmark to use a
nonblocking communication model. 25
Figure 5 shows the results of these experi- Blocking get, memory
Blocking get, SPE
ments, including aggregate latency and band-
Blocking put, memory
Bandwidth (Gbytes/second)

20
width for the set of DMAs in a batch, by batch Blocking put, SPE
size and data transfer size. The results show a
form of performance continuity between 15
blocking—the most constrained case—and
nonblocking operations, with different
10
degrees of freedom expressed by the increasing
batch size. In accessing main memory and
local storage, nonblocking puts achieved the 5
asymptotic bandwidth of 25.6 Gbytes/s,
determined by the EIB capacity at the end-
points, with 2-Kbyte DMAs (Figure 5b and 0
4 16 64 256 1,024 4,096 16,384
5f ). Accessing local store, nonblocking puts DMA message size (bytes)
(b)
achieved the optimal value with even smaller
packets (Figure 5f ). Figure 4. Latency (a) and bandwidth (b) as a function of DMA message size
Gets are also very efficient when accessing for blocking gets and puts in the absence of contention.
local memories, and the main memory laten-
cy penalty slightly affects them, as Figure 5c
shows. Overall, even a limited amount of example, 256-byte DMAs in Figure 5h).
batching is very effective for intermediate
DMA sizes, between 256 bytes and 4 Kbytes, Collective DMA performance
with a factor of two or even three of bandwidth Parallelization of scientific applications gen-
increase compared with the blocking case (for erates far more sophisticated collective

MAY–JUNE 2006 19
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

800 30
Blocking Blocking
700
Batch = 2 25 Batch = 2

Bandwidth (Gbytes/second)
Latency (nanoseconds)
600 Batch = 4 Batch = 4
Batch = 8 Batch = 8
500 Batch = 16 20
Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking

300
10
200
5
100

0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(a) Put latency, main memory (b) Put bandwidth, main memory

1,200
20
Blocking
18 Blocking
1,000 Batch = 2
Batch = 2

Bandwidth (Gbytes/second)
Batch = 4 16
Latency (nanoseconds)

Batch = 4
Batch = 8
800 14 Batch = 8
Batch = 16
Batch = 16
Batch = 32 12
Batch = 32
600 Nonblocking
10 Nonblocking
8
400
6

200 4
2
0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(c) Get latency, main memory (d) Get bandwidth, main memory

800
30
Blocking Blocking
700
Batch = 2 Batch = 2
25
Bandwidth (Gbytes/second)

Batch = 4
Latency (nanoseconds)

600 Batch = 4
Batch = 8 Batch = 8
500 Batch = 16 20
Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking

300
10
200
5
100

0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(e) Put latency, local store (f) Put bandwidth, local store

800 30
Blocking Blocking
700
Batch = 2 25 Batch = 2
Bandwidth (Gbytes/second)

Batch = 4
Latency (nanoseconds)

600 Batch = 4
Batch = 8 Batch = 8
Batch = 16 20
500 Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking

300
10
200
5
100

0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(g) Get latency, local store (h) Get bandwidth, local store

Figure 5. DMA performance dimensions: how latency and bandwidth are affected by the choice of DMA tar-
get (main memory or local store), DMA direction (put or get), and DMA synchronization (blocking after each
DMA, blocking after a constant number of DMAs, ranging from two to 32, and nonblocking with a final fence).

20 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
communication patterns than the single pair- 26

Aggregate bandwidth (Gbytes/second)


wise DMAs discussed so far. We also analyzed
25
how the system performs for several common
patterns of collective communications under 24
heavy load, in terms of both local perfor- 23
mance—what each SPE achieves—and aggre-
22
gate behavior—the performance level of all
SPEs involved in the communication. 21
All results reported in this section are for 20
nonblocking communication. Each SPE Put, memory
19 Get, memory
issues a sequence of 16,384-byte DMA com- Put, SPE
mands. The aggregate bandwidth reported is 18 Get, SPE
the sum of the communication bandwidths 17
1 2 3 4 5 6 7 8
reached by each SPE.
(a) Synergistic processor element
The first important case is the hot spot, in
which many or, in the worst case, all SPEs are
accessing main memory or a specific local 200
Aggregate bandwidth (Gbytes/second)

Uniform traffic
store.13 This is a very demanding pattern that 180 Complement traffic
exposes how the on-chip network and the Pairwise traffic, put
160
communication protocols behave under stress. Pairwise traffic, get
It is representative of the most straightforward 140
code parallelizations, which distribute com- 120
putation to a collection of threads that fetch 100
data from main memory, perform the desired
computation, and store the results without 80
SPE interaction. Figure 6a shows that the Cell 60
processor resolves hot spots in accesses to local 40
storage optimally, reaching the asymptotic per-
formance with two or more SPEs. (For the 20
1 2 3 4 5 6 7 8
SPE hot spot tests, the number of SPEs (b) Synergistic processor element
includes the hot node; x SPEs include 1 hot
SPE plus x − 1 communication partners.) Figure 6. Aggregate communication performance: hot spots (a) and collec-
Counterintuitively, get commands outper- tive communication patterns (b).
form puts under load. In fact, with two or more
SPEs, two or more get sources saturate the
bandwidth either in main or local store. The partner. Note that SPEs with numerically con-
put protocol, on the other hand, suffers from secutive numbers might not be physically
a minor performance degradation, approxi- adjacent on the Cell hardware layout.
mately 1.5 Gbytes/s less than the optimal value. The first static pattern, complement, is
The second case is collective communica- resolved optimally by the network, and can
tion patterns, in which all the SPES are both be mapped to the four rings with an aggregate
source and target of the communication. Fig- performance slightly below 200 Gbytes/s (98
ure 6b summarizes the performance aspects percent of aggregate peak bandwidth).
of the most common patterns that arise from The direction of data transfer affects the
typical parallel applications. In the two static pairwise pattern’s performance. As the hot spot
patterns, complement and pairwise, each SPE experiments show, gets have better contention
executes a sequence of DMAs to a fixed tar- resolution properties under heavy load, and
get SPE. In the complement pattern, each Figure 6b further confirms this, showing a gap
SPE selects the target SPE by complementing of 40 Gbytes/s in aggregate bandwidth
the bit string that identifies the source. In the between put- and get-based patterns.
pairwise pattern, the SPEs are logically orga- The most difficult communication pattern,
nized in pairs < i, i + 1 >, where i is an even arguably the worst case for this type of on-chip
number, and each SPE communicates with its network, is uniform traffic, in which each SPE

MAY–JUNE 2006 21
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS

100 responsive and fair network even with the


90 Blocking put most demanding traffic patterns.
Nonblocking put
80
70
60
M ajor obstacles in the traditional path to
processor performance improvement
have led chip manufacturers to consider mul-
DMAs

50 ticore designs. These architectural solutions


40
promise various power-performance and area-
performance benefits. But designers must take
30
care to ensure that these benefits are not lost
20 because of the on-chip communication net-
10 work’s inadequate design. Overall, our exper-
0 imental results demonstrate that the Cell
0 2 4 6 8 10 12 14 processor’s communications subsystem is well
(a) Latency μs matched to the processor’s computational
capacity. The communications network pro-
120 vides the speed and bandwidth that applica-
Blocking get tions need to exploit the processor’s
100 Nonblocking get
computational power. MICRO

80
References
1. T.M. Pinkston and J. Shin, “Trends toward
DMAs

60 On-Chip Networked Microsystems,” Int’l J.


High Performance Computing and Net-
40 working, vol. 3, no. 1, 2005, pp. 3-18.
2. J. Beecroft et al., “QsNetII: Defining High-
20 Performance Network Design,” IEEE Micro,
vol. 25, no. 4, July/Aug. 2005, pp. 34-47.
0 3. W.A. Wuld and S.A. McKee, “Hitting the
0 2 4 6 8 10 12 14
Memory Wall: Implications of the Obvious,”
(b) Latency μs
ACM Sigarch Computer Architecture News,
Figure 7. Latency distribution with a main-memory hot spot: puts (a) and vol. 23, no. 1, Mar. 1995, pp. 20-24.
gets (b). 4. U. Ghoshal and R. Schmidt, “Refrigeration
Technologies for Sub-Ambient Temperature
Operation of Computing Systems,” Proc.
randomly chooses a DMA target across SPEs’ Int’l Solid-State Circuits Conf. (ISSCC 2000),
local memories. In this case, aggregate band- IEEE Press, 2000, pp. 216-217, 458.
width is only 80 Gbytes/s. We also explored 5. R.D. Isaac, “The Future of CMOS Technol-
other static communication patterns and ogy,” IBM J. Research and Development,
found results in the same aggregate-perfor- vol. 44, no. 3, May 2000, pp. 369-378.
mance range. 6. V. Srinivasan et al., “Optimizing Pipelines for
Finally, Figure 7 shows the distribution of Power and Performance,” Proc. 35th Ann.
DMA latencies for all SPEs during the main- Int’l Symp. Microarchitecture (Micro 35),
memory hot-spot pattern execution. One IEEE Press, 2002, pp. 333-344.
important result is that the distributions show 7. H.P. Hofstee, “Power Efficient Processor
no measurable differences across the SPEs, evi- Architecture and the Cell Processor,” Proc.
dence of a fair and efficient algorithm for net- 11th Int’l Symp. High-Performance Com-
work resource allocation. The peak of both puter Architecture (HPCA-11), IEEE Press,
distributions shows a sevenfold latency 2005, pp. 258-262.
increase, at about 5.6 μs. In both cases, the 8. J.A. Kahle et al., “Introduction to the Cell Mul-
worst-case latency is only 13 μs, twice the aver- tiprocessor,” IBM J. Research and Develop-
age. This is another remarkable result, demon- ment, vol. 49, no. 4/5, 2005, pp. 589-604.
strating that applications can rely on a 9. IBM Cell Broadband Engine Architecture 1.0,

22 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
http://www-128.ibm.com/developerworks/ Michael Perrone is the manager of the IBM
power/cell/downloads_doc.html. TJ Watson Research Center’s Cell Solutions
10. IBM Full-System Simulator for the Cell Department. His research interests include
Broadband Engine Processor, IBM Alpha- algorithmic optimization for the Cell proces-
Works Project, http://www.alphaworks.ibm. sor, parallel computing, and statistical
com/tech/cellsystemsim. machine learning. Perrone has a PhD in
11. J.L. Peterson et al., “Application of Full-Sys- physics from Brown University.
tem Simulation in Exploratory System
Design and Development,” IBM J. Research Fabrizio Petrini is a laboratory fellow in the
and Development, vol. 50, no. 2/3, 2006, pp. Applied Computer Science Group of the Com-
321-332. putational Sciences and Mathematics Division
12. IBM SPU C/C++ Language Extensions 2.1, at Pacific Northwest National Laboratory. His
http://www-128.ibm.com/developerworks/ research interests include various aspects of
power/cell/downloads_doc.html. supercomputers, such as high-performance
13. G.F. Pfister and V.A. Norton, “Hot Spot Con- interconnection networks and network inter-
tention and Combining in Multistage Inter- faces, multicore processors, job-scheduling
connection Networks,” IEEE Trans. algorithms, parallel architectures, operating sys-
Computers, vol. 34, no. 10, Oct. 1985, pp. tems, and parallel-programming languages.
943-948. Petrini has a Laurea and a PhD in computer
science from the University of Pisa, Italy.
Michael Kistler is a senior software engineer
in the IBM Austin Research Laboratory. His
research interests include parallel and cluster Direct questions and comments about this
computing, fault tolerance, and full-system article to Fabrizio Petrini, Applied Computer
simulation of high-performance computing Science Group, MS K7-90, Pacific Northwest
systems. Kistler has an MS in computer sci- National Laboratory, Richland, WA 99352;
ence from Syracuse University. [email protected].

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.

You might also like