Cell Multproc Comm NTWK - Built For SPD
Cell Multproc Comm NTWK - Built For SPD
COMMUNICATION NETWORK:
BUILT FOR SPEED
MULTICORE DESIGNS PROMISE VARIOUS POWER-PERFORMANCE AND AREA-
Over the past decade, high-perfor- monolithic processors, which are prohibitive-
mance computing has ridden the wave of ly expensive to develop, have high power con-
Michael Kistler commodity computing, building cluster- sumption, and give limited return on
based parallel computers that leverage the investment. Multicore system-on-chip (SoC)
IBM Austin tremendous growth in processor performance processors integrate several identical, inde-
fueled by the commercial world. As this pace pendent processing units on the same die,
Research Laboratory slows, processor designers face complex prob- together with network interfaces, acceleration
lems in their efforts to increase gate density, units, and other specialized units.
reduce power consumption, and design effi- Researchers have explored several design
Michael Perrone cient memory hierarchies. Processor develop- avenues in both academia and industry. Exam-
ers are looking for solutions that can keep up ples include MIT’s Raw multiprocessor, the
IBM TJ Watson with the scientific and industrial communi- University of Texas’s Trips multiprocessor,
ties’ insatiable demand for computing capa- AMD’s Opteron, IBM’s Power5, Sun’s Niagara,
Research Center bility and that also have a sustainable market and Intel’s Montecito, among many others.
outside science and industry. (For details on many of these processors, see
A major trend in computer architecture is the March/April 2005 issue of IEEE Micro.)
Fabrizio Petrini integrating system components onto the In all multicore processors, a major tech-
processor chip. This trend is driving the devel- nological challenge is designing the internal,
Pacific Northwest opment of processors that can perform func- on-chip communication network. To realize
tions typically associated with entire systems. the unprecedented computational power of
National Laboratory Building modular processors with multiple the many available processing units, the net-
cores is far more cost-effective than building work must provide very high performance in
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
latency and in bandwidth. It must also resolve memory bandwidth, resulting in poor chip
contention under heavy loads, provide fair- area use and increased power dissipation with-
ness, and hide the processing units’ physical out commensurate performance gains.3
distribution as completely as possible. For example, larger memory latencies
Another important dimension is the nature increase the amount of speculative execution
and semantics of the communication primi- required to maintain high processor utiliza-
tives available for interactions between the var- tion. Thus, they reduce the likelihood that
ious processing units. Pinkston and Shin have useful work is being accomplished and
recently compiled a comprehensive survey of increase administrative overhead and band-
multicore processor design challenges, with width requirements. All of these problems
particular emphasis on internal communica- lead to reduced power efficiency.
tion mechanisms.1 Power use in CMOS processors is approach-
The Cell Broadband Engine processor ing the limits of air cooling and might soon
(known simply as the Cell processor), jointly begin to require sophisticated cooling tech-
developed by IBM, Sony, and Toshiba, uses niques.4 These cooling requirements can sig-
an elegant and natural approach to on-chip nificantly increase overall system cost and
communication. Relying on four slotted rings complexity. Decreasing transistor size and cor-
coordinated by a central arbiter, it borrows a respondingly increasing subthreshold leakage
mainstream communication model from currents further increase power consumption.5
high-performance networks in which pro- Performance improvements from further
cessing units cooperate through remote direct increasing processor frequencies and pipeline
memory accesses (DMAs).2 From functional depths are also reaching their limits.6 Deeper
and performance viewpoints, the on-chip net- pipelines increase the number of stalls from
work is strikingly similar to high-performance data dependencies and increase branch mis-
networks commonly used for remote com- prediction penalties.
munication in commodity computing clus- The Cell processor addresses these issues by
ters and custom supercomputers. attempting to minimize pipeline depth,
In this article, we explore the design of the increase memory bandwidth, allow more
Cell processor’s on-chip network and provide simultaneous, in-flight memory transactions,
insight into its communication and synchro- and improve power efficiency and perfor-
nization protocols. We describe the various mance.7 These design goals led to the use of
steps of these protocols, the algorithms flexible yet simple cores that use area and
involved, and their basic costs. Our perfor- power efficiently.
mance evaluation uses a collection of bench-
marks of increasing complexity, ranging from Processor overview
basic communication patterns to more The Cell processor is the first implementa-
demanding collective patterns that expose net- tion of the Cell Broadband Engine Architetc-
work behavior under congestion. ture (CBEA), which is a fully compatible
extension of the 64-bit PowerPC Architecture.
Design rationale Its initial target is the PlayStation 3 game con-
The Cell processor’s design addresses at least sole, but its capabilities also make it well suit-
three issues that limit processor performance: ed for other applications such as visualization,
memory latency, bandwidth, and power. image and signal processing, and various sci-
Historically, processor performance entific and technical workloads.
improvements came mainly from higher Figure 1 shows the Cell processor’s main
processor clock frequencies, deeper pipelines, functional units. The processor is a heteroge-
and wider issue designs. However, memory neous, multicore chip capable of massive
access speed has not kept pace with these floating-point processing optimized for com-
improvements, leading to increased effective putation-intensive workloads and rich broad-
memory latencies and complex logic to hide band media applications. It consists of one
them. Also, because complex cores don’t allow 64-bit power processor element (PPE), eight
a large number of concurrent memory access- specialized coprocessors called synergistic
es, they underutilize execution pipelines and processor elements (SPEs), a high-speed mem-
MAY–JUNE 2006 11
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
12 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
PPE SPE1 SPE3 SPE5 SPE7 IOIF1
BIF
MIC SPE0 SPE2 SPE4 SPE6
IOIF0
MAY–JUNE 2006 13
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
14 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
execution in parallel with the data transfer, Atomic operations. To support more complex
using either polling or blocking interfaces to synchronization mechanisms, the SPU can use
determine when the transfer is complete. This special DMA operations to atomically update
autonomous execution of MFC DMA com- a lock line in main memory. These operations,
mands allows convenient scheduling of DMA called get-lock-line-and-reserve (getllar) and
transfers to hide memory latency. put-lock-line-conditional (putllc), are con-
The MFC supports naturally aligned trans- ceptually equivalent to the PowerPC load-and-
fers of 1, 2, 4, or 8 bytes, or a multiple of 16 reserve (lwarx) and store-conditional (stcwx)
bytes to a maximum of 16 Kbytes. DMA list instructions.
commands can request a list of up to 2,048 The getllar operation reads the value of a
DMA transfers using a single MFC DMA synchronization variable in main memory and
command. However, only the MFC’s associ- sets a reservation on this location. If the PPE
ated SPU can issue DMA list commands. A or another SPE subsequently modifies the syn-
DMA list is an array of DMA source/destina- chronization variable, the SPE loses its reser-
tion addresses and lengths in the SPU’s local vation. The putllc operation updates the
storage. When an SPU issues a DMA list com- synchronization variable only if the SPE still
mand, the SPU specifies the address and holds a reservation on its location. If putllc
length of the DMA list in the local store.9 Peak fails, the SPE must reissue getllar to obtain the
performance is achievable for transfers when synchronization variable’s new value and then
both the effective address and the local storage retry the attempt to update it with another
address are 128-byte aligned and the transfer putllc. The MFC’s atomic unit performs the
size is an even multiple of 128 bytes. atomic DMA operations and manages reser-
vations held by the SPE.
Signal notification and mailboxes. The signal Using atomic updates, the SPU can partic-
notification facility supports two signaling ipate with the PPE and other SPUs in locking
channels: Sig_Notify_1 and Sig_Notify_2. The protocols, barriers, or other synchronization
SPU can read its own signal channels using the mechanisms. The atomic operations available
read-blocking SPU channels SPU_RdSigNo- to the SPU also have some special features,
tify1 and SPU_RdSigNotify2. The PPE or an such as notification through an interrupt
SPU can write to these channels using memo- when a reservation is lost, that enable more
ry-mapped addresses. A special feature of the efficient and powerful synchronization than
signaling channels is that they can be config- traditional approaches.
ured to treat writes as logical OR operations,
allowing simple but powerful collective com- Memory-mapped I/O (MMIO) resources. Mem-
munication across processors. ory-mapped resources play a role in many of
Each SPU also has a set of mailboxes that can the communication mechanisms already dis-
function as a narrow (32-bit) communication cussed, but these are really just special cases of
channel to the PPE or another SPE. The SPU the Cell architecture’s general practice of mak-
has a four-entry, read-blocking inbound mail- ing all SPE resources available through MMIO.
box and two single-entry, write-blocking out- These resources fall into four broad classes:
bound mailboxes, one of which will also
generate an interrupt to the PPE when the SPE • Local storage. All of an SPU’s local storage
writes to it. The PPE uses memory-mapped can be mapped into the effective-address
addresses to write to the SPU’s inbound mail- space. This allows the PPE to access the
box and read from either of the SPU’s outbound SPU’s local storage with simple loads and
mailboxes. In contrast to the signal notification stores, though doing so is far less efficient
channels, mailboxes are much better suited for than using DMA. MMIO access to local
one-to-one communication patterns such as storage is not synchronized with SPU exe-
master-slave or producer-consumer models. A cution, so programmers must ensure that
typical round-trip communication using mail- the SPU program is designed to allow
boxes between two SPUs takes approximately unsynchronized access to its data (for
300 nanoseconds (ns). example, by using the “volatile” variables)
when exploiting this feature.
MAY–JUNE 2006 15
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
• Problem state memory map. Resources in The Cell architecture doesn’t dictate the size
this class, intended for use directly by of these queues and recommends that soft-
application programs, include access to ware not assume a particular size. This is
the SPE’s DMA engine, mailbox chan- important to ensure functional correctness
nels, and signal notification channels. across Cell architecture implementations.
• Privilege 1 memory map. These resources However, programs should use DMA queue
are available to privileged programs such entries efficiently because attempts to issue
as the operating system or authorized DMA commands when the queue is full will
subsystems to monitor and control the lead to performance degradation. In the Cell
execution of SPU applications. processor, the MFC SPU command queue
• Privilege 2 memory map. The operating contains 16 entries, and the MFC proxy com-
system uses these resources to control the mand queue contains eight entries.
resources available to the SPE. Figure 3 illustrates the basic flow of a DMA
transfer to main storage initiated by an SPU.
DMA flow The process consists of the following steps:
The SPE’s DMA engine handles most com-
munications between the SPU and other Cell 1. The SPU uses the channel interface to
elements and executes DMA commands place the DMA command in the MFC
issued by either the SPU or the PPE. A DMA SPU command queue.
command’s data transfer direction is always 2. The DMAC selects a command for pro-
referenced from the SPE’s perspective. There- cessing. The set of rules for selecting the
fore, commands that transfer data into an SPE command for processing is complex, but,
(from main storage to local store) are consid- in general, a) commands in the SPU
ered get commands (gets), and transfers of command queue take priority over com-
data out of an SPE (from local store to main mands in the proxy command queue, b)
storage) are considered put commands (puts). the DMAC alternates between get and
DMA transfers are coherent with respect to put commands, and c) the command
main storage. Programmers should be aware must be ready (not waiting for address
that the MFC might process the commands in resolution or list element fetch or depen-
the queue in a different order from that in dent on another command).
which they entered the queue. When order is 3. If the command is a DMA list command
important, programmers must use special and requires a list element fetch, the
forms of the get and put commands to enforce DMAC queues a request for the list ele-
either barrier or fence semantics against other ment to the local-store interface. When
commands in the queue. the list element is returned, the DMAC
The MFC’s MMU handles address transla- updates the DMA entry and must rese-
tion and protection checking of DMA access- lect it to continue processing.
es to main storage, using information from 4. If the command requires address trans-
page and segment tables defined in the Pow- lation, the DMAC queues it to the
erPC architecture. The MMU has a built-in MMU for processing. When the transla-
translation look-aside buffer (TLB) for caching tion is available in the TLB, processing
the results of recently performed translations. proceeds to the next step (unrolling). On
The MFC’s DMA controller (DMAC) a TLB miss, the MMU performs the
processes DMA commands queued in the translation, using the page tables stored
MFC. The MFC contains two separate DMA in main memory, and updates the TLB.
command queues: The DMA entry is updated and must be
reselected for processing to continue.
• MFC SPU command queue, for com- 5. Next, the DMAC unrolls the command—
mands issued by the associated SPU that is, creates a bus request to transfer the
using the channel interface; and next block of data for the command. This
• MFC proxy command queue, for com- bus request can transfer up to 128 bytes of
mands issued by the PPE or other devices data but can transfer less, depending on
using MMIO registers. alignment issues or the amount of data the
16 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
1.6 GHz
(1) DMA Queues
MMIO
SPU proxy
interface
Channel
SPU DMAC
LS (3) TLB
MMU
EIB
MFC
BIU
(7) (7)
204.8 Gbyte/s
(6)
(7) (7)
Memory
MIC
(off chip) 25.6
25.6
Gbyte/s Gbyte/s
in/out
DMA command requests. The DMAC that pipeline through the communication
then queues this bus request to the bus network. The DMA command remains
interface unit (BIU). in the MFC SPU command queue until
6. The BIU selects the request from its all its bus requests have completed. How-
queue and issues the command to the ever, the DMAC can continue to process
EIB. The EIB orders the command with other DMA commands. When all bus
other outstanding requests and then requests for a command have completed,
broadcasts the command to all bus ele- the DMAC signals command completion
ments. For transfers involving main to the SPU and removes the command
memory, the MIC acknowledges the from the queue.
command to the EIB which then informs
the BIU that the command was accept- In the absence of congestion, a thread run-
ed and data transfer can begin. ning on the SPU can issue a DMA request in
7. The BIU in the MFC performs the reads as little as 10 clock cycles—the time needed to
to local store required for the data trans- write to the five SPU channels that describe
fer. The EIB transfers the data for this the source and destination addresses, the
request between the BIU and the MIC. DMA size, the DMA tag, and the DMA com-
The MIC transfers the data to or from mand. At that point, the DMAC can process
the off-chip memory. the DMA request without SPU intervention.
8. The unrolling process produces a sequence The overall latency of generating the DMA
of bus requests for the DMA command command, initially selecting the command,
MAY–JUNE 2006 17
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
and unrolling the first bus request to the board at the IBM Research Center in Austin.
BIU—or, in simpler terms, the flow-through The Cell processor on this board was running
latency from SPU issue to injection of the bus at 3.2 GHz. We also obtained results from an
request into the EIB—is roughly 30 SPU internal version of the simulator that includes
cycles when all resources are available. If list performance models for the MFC, EIB, and
element fetch is required, it can add roughly memory subsystems. Performance simulation
20 SPU cycles. MMU translation exceptions for the Cell processor is still under develop-
by the SPE are very expensive and should be ment, but we found good correlation between
avoided if possible. If the queue in the BIU the simulator and hardware in our experi-
becomes full, the DMAC is blocked from ments. The simulator let us observe aspects of
issuing further requests until resources system behavior that would be difficult or
become available again. practically impossible to observe on actual
A transfer’s command phase involves hardware.
snooping operations for all bus elements to We developed the benchmarks in C, using
ensure coherence and typically requires some several Cell-specific libraries to orchestrate
50 bus cycles (100 SPU cycles) to complete. activities between the PPE and various SPEs.
For gets, the remaining latency is attributable In all tests, the DMA operations are issued by
to the data transfer from off-chip memory to the SPUs. We wrote DMA operations in C
the memory controller and then across the bus language intrinsics,12 which in most cases pro-
to the SPE, which writes it to local store. For duce inline assembly instructions to specify
puts, DMA latency doesn’t include transfer- commands to the MFC through the SPU
ring data all the way to off-chip memory channel interface. We measured elapsed time
because the SPE considers the put complete in the SPE using a special register, the SPU
once all data have been transferred to the decrementer, which ticks every 40 ns (or 128
memory controller. processor clock cycles).
PPE and SPE interactions were performed
Experimental results through mailboxes, input and output SPE reg-
We conducted a series of experiments to isters that can send and receive messages in as
explore the major performance aspects of the little as 150 ns. We implemented a simple syn-
Cell’s on-chip communication network, its chronization mechanism to start the SPUs in
protocols, and the pipelined communication’s a coordinated way. SPUs notify the comple-
impact. We developed a suite of microbench- tion of benchmark phases through an atomic
marks to analyze the internal interconnection fetch-and-add operation on a main memory
network’s architectural features. Following the location that is polled infrequently and unin-
research path of previous work on traditional trusively by the PPE.
high-performance networks,2 we adopted an
incremental approach to gain insight into sev- Basic DMA performance
eral aspects of the network. We started our The first step of our analysis measures the
analysis with simple pairwise, congestion-free latency and bandwidth of simple blocking
DMAs, and then we increased the bench- puts and gets, when the target is in main
marks’ complexity by including several pat- memory or in another SPE’s local store. Table
terns that expose the contention resolution 1 breaks down the latency of a DMA opera-
properties of the network under heavy load. tion into its components.
Figure 4a shows DMA operation latency
Methodology for a range of sizes. In these results, transfers
Because of the limited availability of Cell between two local stores were always per-
boards at the time of our experiments, we per- formed between SPEs 0 and 1, but in all our
formed most of the software development on experiments, we found no performance dif-
IBM’s Full-System Simulator for the Cell ference attributable to SPE location. The
Broadband Engine Processor (available at results show that puts to main memory and
http://www-128.ibm.com/developerworks/ gets and puts to local store had a latency of
power/cell/).10,11 We collected the results pre- only 91 ns for transfers of up to 512 bytes
sented here using an experimental evaluation (four cache lines). There was little difference
18 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
between puts and gets to local store because Table 1. DMA latency components for a clock
local-store access latency was remarkably frequency of 3.2 GHz.
low—only 8 ns (about 24 processor clock
cycles). Main memory gets required less than Latency component Cycles Nanoseconds
100 ns to fetch information, which is remark- DMA issue 10 3.125
ably fast. DMA to EIB 30 9.375
Figure 4b presents the same results in terms List element fetch 10 3.125
of bandwidth achieved by each DMA opera- Coherence protocol 100 31.25
tion. As we expected, the largest transfers Data transfer for inter-SPE put 140 43.75
achieved the highest bandwidth, which we Total 290 90.61
measured as 22.5 Gbytes/s for gets and puts to
local store and puts to main memory and 15
Gbytes/s for gets from main memory. 1,200
Next, we considered the impact of non- Blocking get, memory
Blocking get, SPE
blocking DMA operations. In the Cell proces- 1,000 Blocking put, memory
sor, each SPE can have up to 16 outstanding Latency (nanoseconds)
Blocking put, SPE
DMAs, for a total of 128 across the chip, allow- 800
ing unprecedented levels of parallelism in on-
chip communication. Applications that rely 600
heavily on random scatter or gather accesses to
main memory can take advantage of these com- 400
munication features seamlessly. Our bench-
marks use a batched communication model, in
200
which the SPU issues a fixed number (the batch
size) of DMAs before blocking for notification
0
of request completion. By using a very large 4 16 64 256 1,024 4,096 16,384
batch size (16,384 in our experiments), we (a) DMA message size (bytes)
effectively converted the benchmark to use a
nonblocking communication model. 25
Figure 5 shows the results of these experi- Blocking get, memory
Blocking get, SPE
ments, including aggregate latency and band-
Blocking put, memory
Bandwidth (Gbytes/second)
20
width for the set of DMAs in a batch, by batch Blocking put, SPE
size and data transfer size. The results show a
form of performance continuity between 15
blocking—the most constrained case—and
nonblocking operations, with different
10
degrees of freedom expressed by the increasing
batch size. In accessing main memory and
local storage, nonblocking puts achieved the 5
asymptotic bandwidth of 25.6 Gbytes/s,
determined by the EIB capacity at the end-
points, with 2-Kbyte DMAs (Figure 5b and 0
4 16 64 256 1,024 4,096 16,384
5f ). Accessing local store, nonblocking puts DMA message size (bytes)
(b)
achieved the optimal value with even smaller
packets (Figure 5f ). Figure 4. Latency (a) and bandwidth (b) as a function of DMA message size
Gets are also very efficient when accessing for blocking gets and puts in the absence of contention.
local memories, and the main memory laten-
cy penalty slightly affects them, as Figure 5c
shows. Overall, even a limited amount of example, 256-byte DMAs in Figure 5h).
batching is very effective for intermediate
DMA sizes, between 256 bytes and 4 Kbytes, Collective DMA performance
with a factor of two or even three of bandwidth Parallelization of scientific applications gen-
increase compared with the blocking case (for erates far more sophisticated collective
MAY–JUNE 2006 19
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
800 30
Blocking Blocking
700
Batch = 2 25 Batch = 2
Bandwidth (Gbytes/second)
Latency (nanoseconds)
600 Batch = 4 Batch = 4
Batch = 8 Batch = 8
500 Batch = 16 20
Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking
300
10
200
5
100
0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(a) Put latency, main memory (b) Put bandwidth, main memory
1,200
20
Blocking
18 Blocking
1,000 Batch = 2
Batch = 2
Bandwidth (Gbytes/second)
Batch = 4 16
Latency (nanoseconds)
Batch = 4
Batch = 8
800 14 Batch = 8
Batch = 16
Batch = 16
Batch = 32 12
Batch = 32
600 Nonblocking
10 Nonblocking
8
400
6
200 4
2
0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(c) Get latency, main memory (d) Get bandwidth, main memory
800
30
Blocking Blocking
700
Batch = 2 Batch = 2
25
Bandwidth (Gbytes/second)
Batch = 4
Latency (nanoseconds)
600 Batch = 4
Batch = 8 Batch = 8
500 Batch = 16 20
Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking
300
10
200
5
100
0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(e) Put latency, local store (f) Put bandwidth, local store
800 30
Blocking Blocking
700
Batch = 2 25 Batch = 2
Bandwidth (Gbytes/second)
Batch = 4
Latency (nanoseconds)
600 Batch = 4
Batch = 8 Batch = 8
Batch = 16 20
500 Batch = 16
Batch = 32 Batch = 32
400 Nonblocking 15 Nonblocking
300
10
200
5
100
0 0
4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384
DMA message size (bytes) DMA message size (bytes)
(g) Get latency, local store (h) Get bandwidth, local store
Figure 5. DMA performance dimensions: how latency and bandwidth are affected by the choice of DMA tar-
get (main memory or local store), DMA direction (put or get), and DMA synchronization (blocking after each
DMA, blocking after a constant number of DMAs, ranging from two to 32, and nonblocking with a final fence).
20 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
communication patterns than the single pair- 26
Uniform traffic
store.13 This is a very demanding pattern that 180 Complement traffic
exposes how the on-chip network and the Pairwise traffic, put
160
communication protocols behave under stress. Pairwise traffic, get
It is representative of the most straightforward 140
code parallelizations, which distribute com- 120
putation to a collection of threads that fetch 100
data from main memory, perform the desired
computation, and store the results without 80
SPE interaction. Figure 6a shows that the Cell 60
processor resolves hot spots in accesses to local 40
storage optimally, reaching the asymptotic per-
formance with two or more SPEs. (For the 20
1 2 3 4 5 6 7 8
SPE hot spot tests, the number of SPEs (b) Synergistic processor element
includes the hot node; x SPEs include 1 hot
SPE plus x − 1 communication partners.) Figure 6. Aggregate communication performance: hot spots (a) and collec-
Counterintuitively, get commands outper- tive communication patterns (b).
form puts under load. In fact, with two or more
SPEs, two or more get sources saturate the
bandwidth either in main or local store. The partner. Note that SPEs with numerically con-
put protocol, on the other hand, suffers from secutive numbers might not be physically
a minor performance degradation, approxi- adjacent on the Cell hardware layout.
mately 1.5 Gbytes/s less than the optimal value. The first static pattern, complement, is
The second case is collective communica- resolved optimally by the network, and can
tion patterns, in which all the SPES are both be mapped to the four rings with an aggregate
source and target of the communication. Fig- performance slightly below 200 Gbytes/s (98
ure 6b summarizes the performance aspects percent of aggregate peak bandwidth).
of the most common patterns that arise from The direction of data transfer affects the
typical parallel applications. In the two static pairwise pattern’s performance. As the hot spot
patterns, complement and pairwise, each SPE experiments show, gets have better contention
executes a sequence of DMAs to a fixed tar- resolution properties under heavy load, and
get SPE. In the complement pattern, each Figure 6b further confirms this, showing a gap
SPE selects the target SPE by complementing of 40 Gbytes/s in aggregate bandwidth
the bit string that identifies the source. In the between put- and get-based patterns.
pairwise pattern, the SPEs are logically orga- The most difficult communication pattern,
nized in pairs < i, i + 1 >, where i is an even arguably the worst case for this type of on-chip
number, and each SPE communicates with its network, is uniform traffic, in which each SPE
MAY–JUNE 2006 21
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
HIGH-PERFORMANCE INTERCONNECTS
80
References
1. T.M. Pinkston and J. Shin, “Trends toward
DMAs
22 IEEE MICRO
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.
http://www-128.ibm.com/developerworks/ Michael Perrone is the manager of the IBM
power/cell/downloads_doc.html. TJ Watson Research Center’s Cell Solutions
10. IBM Full-System Simulator for the Cell Department. His research interests include
Broadband Engine Processor, IBM Alpha- algorithmic optimization for the Cell proces-
Works Project, http://www.alphaworks.ibm. sor, parallel computing, and statistical
com/tech/cellsystemsim. machine learning. Perrone has a PhD in
11. J.L. Peterson et al., “Application of Full-Sys- physics from Brown University.
tem Simulation in Exploratory System
Design and Development,” IBM J. Research Fabrizio Petrini is a laboratory fellow in the
and Development, vol. 50, no. 2/3, 2006, pp. Applied Computer Science Group of the Com-
321-332. putational Sciences and Mathematics Division
12. IBM SPU C/C++ Language Extensions 2.1, at Pacific Northwest National Laboratory. His
http://www-128.ibm.com/developerworks/ research interests include various aspects of
power/cell/downloads_doc.html. supercomputers, such as high-performance
13. G.F. Pfister and V.A. Norton, “Hot Spot Con- interconnection networks and network inter-
tention and Combining in Multistage Inter- faces, multicore processors, job-scheduling
connection Networks,” IEEE Trans. algorithms, parallel architectures, operating sys-
Computers, vol. 34, no. 10, Oct. 1985, pp. tems, and parallel-programming languages.
943-948. Petrini has a Laurea and a PhD in computer
science from the University of Pisa, Italy.
Michael Kistler is a senior software engineer
in the IBM Austin Research Laboratory. His
research interests include parallel and cluster Direct questions and comments about this
computing, fault tolerance, and full-system article to Fabrizio Petrini, Applied Computer
simulation of high-performance computing Science Group, MS K7-90, Pacific Northwest
systems. Kistler has an MS in computer sci- National Laboratory, Richland, WA 99352;
ence from Syracuse University. [email protected].
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on September 03,2023 at 11:26:16 UTC from IEEE Xplore. Restrictions apply.