Berkeley View
Berkeley View
Abstract
The recent switch to parallel microprocessors is a milestone in the history of computing.
Industry has laid out a roadmap for multicore designs that preserves the programming
paradigm of the past via binary compatibility and cache coherence. Conventional wisdom
is now to double the number of cores on a chip with each silicon generation.
A multidisciplinary group of Berkeley researchers met nearly two years to discuss this
change. Our view is that this evolutionary approach to parallel hardware and software
may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16
and 32 processor systems are realized, just as returns fell with greater instruction-level
parallelism.
We believe that much can be learned by examining the success of parallelism at the
extremes of the computing spectrum, namely embedded computing and high performance
computing. This led us to frame the parallel landscape with seven questions, and to
recommend the following:
• The overarching goal should be to make it easy to write programs that execute
efficiently on highly parallel computing systems
• The target should be 1000s of cores per chip, as these chips are built from
processing elements that are the most efficient in MIPS (Million Instructions per
Second) per watt, MIPS per area of silicon, and MIPS per development dollar.
• Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel
programming models and architectures. (A dwarf is an algorithmic method that
captures a pattern of computation and communication.)
• “Autotuners” should play a larger role than conventional compilers in translating
parallel programs.
• To maximize programmer productivity, future programming models must be
more human-centric than the conventional focus on hardware or applications.
• To be successful, programming models should be independent of the number of
processors.
• To maximize application efficiency, programming models should support a wide
range of data types and successful models of parallelism: task-level parallelism,
word-level parallelism, and bit-level parallelism.
1
The Landscape of Parallel Computing Research: A View From Berkeley
Since real world applications are naturally parallel and hardware is naturally parallel,
what we need is a programming model, system software, and a supporting architecture
that are naturally parallel. Researchers have the rare opportunity to re-invent these
cornerstones of computing, provided they simplify the efficient programming of highly
parallel systems.
2
The Landscape of Parallel Computing Research: A View From Berkeley
1.0 Introduction
The computing industry changed course in 2005 when Intel followed the lead of IBM’s
Power 4 and Sun Microsystems’ Niagara processor in announcing that its high
performance microprocessors would henceforth rely on multiple processors or cores. The
new industry buzzword “multicore” captures the plan of doubling the number of standard
cores per die with every semiconductor process generation starting with a single
processor. Multicore will obviously help multiprogrammed workloads, which contain a
mix of independent sequential tasks, but how will individual tasks become faster?
Switching from sequential to modestly parallel computing will make programming much
more difficult without rewarding this greater effort with a dramatic improvement in
power-performance. Hence, multicore is unlikely to be the ideal answer.
Applications Hardware
Tension between
1. What are the Embedded & Server 3. What are the
applications? Computing hardware
building blocks?
2. What are
common 4. How to
kernels of the Programming Models connect them?
applications? 5. How to describe applications and
kernels?
6. How to program the hardware?
Evaluation:
7. How to measure success?
st
Figure 1. A view from Berkeley: seven critical questions for 21 Century parallel computing.
(This figure is inspired by a view of the Golden Gate Bridge from Berkeley.)
Although compatibility with old binaries and C programs is valuable to industry, and
some researchers are trying to help multicore product plans succeed, we have been
thinking bolder thoughts. Our aim is to realize thousands of processors on a chip for new
applications, and we welcome new programming models and new architectures if they
3
The Landscape of Parallel Computing Research: A View From Berkeley
simplify the efficient programming of such highly parallel systems. Rather than
multicore, we are focused on “manycore”. Successful manycore architectures and
supporting software technologies could reset microprocessor hardware and software
roadmaps for the next 30 years.
Figure 1 shows the seven critical questions we used to frame the landscape of parallel
computing research. We do not claim to have the answers in this report, but we do offer
non-conventional and provocative perspectives on some questions and state seemingly
obvious but sometimes-neglected perspectives on others.
Note that there is a tension between embedded and high performance computing, which
surfaced in many of our discussions. We argue that these two ends of the computing
spectrum have more in common looking forward than they did in the past. First, both are
concerned with power, whether it is battery life for cell phones or the cost of electricity
and cooling in a data center. Second, both are concerned with hardware utilization.
Embedded systems are always sensitive to cost, but efficient use of hardware is also
required when you spend $10M to $100M for high-end servers. Third, as the size of
embedded software increases over time, the fraction of hand tuning must be limited and
so the importance of software reuse must increase. Fourth, since both embedded and
high-end servers now connect to networks, both need to prevent unwanted accesses and
viruses. Thus, the need is increasing for some form of operating system for protection in
embedded systems, as well as for resource sharing and scheduling.
Perhaps the biggest difference between the two targets is the traditional emphasis on real-
time computing in embedded, where the computer and the program need to be just fast
enough to meet the deadlines, and there is no benefit to running faster. Running faster is
usually valuable in server computing. As server applications become more media-
oriented, real time may become more important for server computing as well. This report
borrows many ideas from both embedded and high performance computing.
The organization of the report follows the seven questions of Figure 1. Section 2
documents the reasons for the switch to parallel computing by providing a number of
guiding principles. Section 3 reviews the left tower in Figure 1, which represents the new
applications for parallelism. It describes the original “Seven Dwarfs”, which we believe
will be the computational kernels of many future applications. Section 4 reviews the right
tower, which is hardware for parallelism, and we separate the discussion into the classical
categories of processor, memory, and switch. Section 5 covers programming models and
Section 6 covers systems software; they form the bridge that connects the two towers in
Figure 1. Section 7 discusses measures of success and describes a new hardware vehicle
for exploring parallel computing. We conclude with a summary of our perspectives.
Given the breadth of topics we address in the report, we provide 134 references for
readers interested in learning more.
In addition to this report, we also started a web site and blog to continue the conversation
about the views expressed in this report. See view.eecs.berkeley.edu.
4
The Landscape of Parallel Computing Research: A View From Berkeley
2.0 Motivation
The promise of parallelism has fascinated researchers for at least three decades. In the
past, parallel computing efforts have shown promise and gathered investment, but in the
end, uniprocessor computing always prevailed. Nevertheless, we argue general-purpose
computing is taking an irreversible step toward parallel architectures. What’s different
this time? This shift toward increasing parallelism is not a triumphant stride forward
based on breakthroughs in novel software and architectures for parallelism; instead, this
plunge into parallelism is actually a retreat from even greater challenges that thwart
efficient silicon implementation of traditional uniprocessor architectures.
In the following, we capture a number of guiding principles that illustrate precisely how
everything is changing in computing. Following the style of Newsweek, they are listed as
pairs of outdated conventional wisdoms and their new replacements. We later refer to
these pairs as CW #n.
1. Old CW: Power is free, but transistors are expensive.
• New CW is the “Power wall”: Power is expensive, but transistors are “free”. That
is, we can put more transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is dynamic power.
• New CW: For desktops and servers, static power due to leakage can be 40% of
total power. (See Section 4.1.)
3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors
occurring only at the pins.
• New CW: As chips drop below 65 nm feature sizes, they will have high soft and
hard error rates. [Borkar 2005] [Mukherjee et al 2005]
4. Old CW: By building upon prior successes, we can continue to raise the level of
abstraction and hence the size of hardware designs.
• New CW: Wire delay, noise, cross coupling (capacitive and inductive),
manufacturing variability, reliability (see above), clock jitter, design validation,
and so on conspire to stretch the development time and cost of large designs at 65
nm or smaller feature sizes. (See Section 4.1.)
5. Old CW: Researchers demonstrate new architecture ideas by building chips.
• New CW: The cost of masks at 65 nm feature size, the cost of Electronic
Computer Aided Design software to design such chips, and the cost of design for
GHz clock rates means researchers can no longer build believable prototypes.
Thus, an alternative approach to evaluating architectures must be developed. (See
Section 7.3.)
6. Old CW: Performance improvements yield both lower latency and higher
bandwidth.
• New CW: Across many technologies, bandwidth improves by at least the square
of the improvement in latency. [Patterson 2004]
7. Old CW: Multiply is slow, but load and store is fast.
• New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow,
but multiply is fast. Modern microprocessors can take 200 clocks to access
Dynamic Random Access Memory (DRAM), but even floating-point multiplies
may take only four clock cycles.
5
The Landscape of Parallel Computing Research: A View From Berkeley
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers
and architecture innovation. Examples from the past include branch prediction,
out-of-order execution, speculation, and Very Long Instruction Word systems.
• New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
[Hennessy and Patterson 2007]
9. Old CW: Uniprocessor performance doubles every 18 months.
• New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots
processor performance for almost 30 years. In 2006, performance is a factor of
three below the traditional doubling every 18 months that we enjoyed between
1986 and 2002. The doubling of uniprocessor performance may now take 5 years.
10. Old CW: Don’t bother parallelizing your application, as you can just wait a little
while and run it on a much faster sequential computer.
• New CW: It will be a very long wait for a faster sequential computer (see above).
11. Old CW: Increasing clock frequency is the primary method of improving
processor performance.
• New CW: Increasing parallelism is the primary method of improving processor
performance. (See Section 4.1.)
12. Old CW: Less than linear scaling for a multiprocessor application is failure.
• New CW: Given the switch to parallel computing, any speedup via parallelism is a
success.
10000
??%/year
1000
Performance (vs. VAX-11/780)
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Figure 2. Processor performance improvement between 1978 and 2006 using integer SPEC [SPEC 2006]
programs. RISCs helped inspire performance to improve by 52% per year between 1986 and 2002, which
was much faster than the VAX minicomputer improved between 1978 and 1986. Since 2002, performance
has improved less than 20% per year. By 2006, processors will be a factor of three slower than if progress
had continued at 52% per year. This figure is Figure 1.1 in [Hennessy and Patterson 2007].
Although the CW pairs above paint a negative picture about the state of hardware, there
are compensating positives as well. First, Moore’s Law continues, so we will soon be
able to put thousands of simple processors on a single, economical chip (see Section
6
The Landscape of Parallel Computing Research: A View From Berkeley
4.1.2). For example, Cisco is shipping a product with 188 Reduced Instruction Set
Computer (RISC) cores on a single chip in a 130nm process [Eatherton 2005]. Second,
communication between these processors within a chip can have very low latency and
very high bandwidth. These monolithic manycore microprocessors represent a very
different design point from traditional multichip multiprocessors, and so provide promise
for the development of new architectures and programming models. Third, the open
source software movement means that the software stack can evolve much more quickly
than in the past. As an example, note the widespread use of Ruby on Rails. Version 1.0
appeared in just December 2005.
Our goal is to delineate application requirements in a manner that is not overly specific to
individual applications or the optimizations used for certain hardware platforms, so that
we can draw broader conclusions about hardware requirements. Our approach, described
below, is to define a number of “dwarfs”, which each capture a pattern of computation
and communication common to a class of important applications.
7
The Landscape of Parallel Computing Research: A View From Berkeley
be implemented differently and the underlying numerical methods may change over time,
but the claim is that the underlying patterns have persisted through generations of
changes and will remain important into the future.
Some evidence for the existence of this particular set of “equivalence classes” can be
found in the numerical libraries that have been built around these equivalence classes: for
example, FFTW for spectral methods [Frigo and Johnson 1998], LAPACK/ScaLAPACK
for dense linear algebra [Blackford et al 1996], and OSKI for sparse linear algebra
[Vuduc et al 2006]. We list these in Figure 3, together with the computer architectures
that have been purpose-built for particular dwarfs: for example, GRAPE for N-body
methods [Tokyo 2006], vector architectures for linear algebra [Russell 1976], and FFT
accelerators [Zarlink 2006]. Figure 3 also shows the inter-processor communication
patterns exhibited by members of a dwarf when running on a parallel machine [Vetter
and McCracken 2001] [Vetter and Yoo 2002] [Vetter and Meuller 2002] [Kamil et al
2005]. The communication pattern is closely related to the memory access pattern that
takes place locally on each processor.
Dwarfs are specified at a high level of abstraction that can group related but quite
different computational methods. Over time, a single dwarf can expand to cover such a
disparate variety of methods that it should be viewed as multiple distinct dwarfs. As long
as we do not end up with too many dwarfs, it seems wiser to err on the side of embracing
new dwarfs. For example, unstructured grids could be interpreted as a sparse matrix
problem, but this would both limit the problem to a single level of indirection and
disregard too much additional information about the problem.
8
The Landscape of Parallel Computing Research: A View From Berkeley
9
The Landscape of Parallel Computing Research: A View From Berkeley
10
The Landscape of Parallel Computing Research: A View From Berkeley
To investigate the general applicability of the Seven Dwarfs, we compared the list against
other collections of benchmarks: EEMBC from embedded computing and from
SPEC2006 for desktop and server computing. These collections were independent of our
study, so they act as validation for whether our small set of computational kernels are
good targets for applications of the future. We will describe the final list in detail in
Section 3.5, but from our examination of the 41 EEMBC kernels and the 26 SPEC2006
programs, we found four more dwarfs to add to the list:
o Combinational Logic generally involves performing simple operations on
very large amounts of data often exploiting bit-level parallelism. For example,
computing Cyclic Redundancy Codes (CRC) is critical to ensure integrity and
RSA encryption for data security.
o Graph Traversal applications must traverse a number of objects and examine
characteristics of those objects such as would be used for search. It typically
involves indirect table lookups and little computation.
o Graphical Models applications involve graphs that represent random
variables as nodes and conditional dependencies as edges. Examples include
Bayesian networks and Hidden Markov Models.
o Finite State Machines represent an interconnected set of states, such as
would be used for parsing. Some state machines can decompose into multiple
simultaneously active state machines that can act in parallel.
Michael Jordan and Dan Klein, our local experts in machine learning, found two dwarfs
that should be added to support machine learning:
o Dynamic programming is an algorithmic technique that computes solutions
by solving simpler overlapping subproblems. It is particularly applicable for
optimization problems where the optimal result for a problem is built up from
the optimal result for the subproblems.
o Backtrack and Branch-and-Bound: These involve solving various search
and global optimization problems for intractably large spaces. Some implicit
method is required in order to rule out regions of the search space that contain
no interesting solutions. Branch and bound algorithms work by the divide and
conquer principle: the search space is subdivided into smaller subregions
(“branching”), and bounds are found on all the solutions contained in each
subregion under consideration.
Many other well-known machine-learning algorithms fit into the existing dwarfs:
11
The Landscape of Parallel Computing Research: A View From Berkeley
Joe Hellerstein, our local expert in databases, said the future of databases was large data
collections typically found on the Internet. A key primitive to explore such collections is
MapReduce, developed and widely used at Google. [Dean and Ghemawat 2004] The first
phase maps a user supplied function onto thousands of computers, processing key/value
pairs to generate a set of intermediate key/value pairs, The second phase reduces the
returned values from all those thousands of instances into a single result by merging all
intermediate values associated with the same intermediate key. Note that these two
phases are highly parallel yet simple to understand. Borrowing the name from a similar
function in Lisp, they call this primitive “MapReduce”.
MapReduce is a more general version of the pattern we had previously called “Monte
Carlo”: the essence is a single function that executes in parallel on independent data sets,
with outputs that are eventually combined to form a single or small number of results. In
order to reflect this broader scope, we changed the name of the dwarf to “MapReduce”.
A second thrust for the future of databases was in genetics, exemplified by the widely
popular BLAST (Basic Local Alignment Search Tool) code. [Altschul et al 1990]
BLAST is a heuristic method used to find areas of DNA/protein sequences that are
similar from a database. There are three main steps:
1. Compile a list of high-scoring words from the sequence
2. Scan database for hits from this list
3. Extend the hits to optimize the match
Although clearly important, BLAST did not extend our list of dwarfs.
12
The Landscape of Parallel Computing Research: A View From Berkeley
For instance, modeling of liquids and liquid behavior used for special effects in movies
are typically done using particle methods such as Smooth Particle Hydrodynamics (SPH)
[Monaghan 1982]. The rendering of the physical model is still done in OpenGL using
GPUs or software renderers, but the underlying model of the flowing shape of the liquid
requires the particle-based fluid model. There are several other examples where the desire
to model physical properties in game and graphics map onto the other dwarfs:
o Reverse kinematics requires a combination of sparse matrix computations and
graph traversal methods.
o Spring models, used to model any rigid object that deforms in response to
pressure or impact such as bouncing balls or Jell-O, use either sparse matrix or
finite-element models.
o Collision detection is a graph traversal operation as are the Octrees and Kd
trees employed for depth sorting and hidden surface removal.
o Response to collisions is typically implemented as a finite-state machine.
Hence, the surprising conclusion is that games and graphics did not extend the drawfs
beyond the 13 identified above.
One encouraging lesson to learn from the GPUs and graphics software is that the APIs do
not directly expose the programmer to concurrency. OpenGL, for instance, allows the
programmer to describe a set of “vertex shader” operations in Cg (a specialized language
for describing such operations) that are applied to every polygon in the scene without
having to consider how many hardware fragment processors or vertex processors are
available in the hardware implementation of the GPU.
13
The Landscape of Parallel Computing Research: A View From Berkeley
Dwarf Description
8. Combinational Logic Functions that are implemented with logical functions and stored state.
(e.g., encryption)
9. Graph traversal (e.g., Visits many nodes in a graph by following successive edges. These
Quicksort) applications typically involve many levels of indirection, and a relatively
small amount of computation.
10. Dynamic Computes a solution by solving simpler overlapping subproblems.
Programming Particularly useful in optimization problems with a large set of feasible
solutions.
11. Backtrack and Finds an optimal solution by recursively dividing the feasible region into
Branch+Bound subdomains, and then pruning subproblems that are suboptimal.
12. Construct Graphical Constructs graphs that represent random variables as nodes and
Models conditional dependencies as edges. Examples include Bayesian networks
and Hidden Markov Models.
13. Finite State Machine A system whose behavior is defined by states, transitions defined by
inputs and the current state, and events associated with transitions or
states.
Figure 4. Extensions to the original Seven Dwarfs.
Although 12 of the 13 Dwarfs possess some form of parallelism, finite state machines
(FSMs) look to be a challenge, which is why we made them the last dwarf. Perhaps FSMs
will prove to be embarrassingly sequential just as MapReduce is embarrassingly parallel.
If it is still important and does not yield to innovation in parallelism, that will be
disappointing, but perhaps the right long-term solution is to change the algorithmic
approach. In the era of multicore and manycore. Popular algorithms from the sequential
computing era may fade in popularity. For example, if Huffman decoding proves to be
embarrassingly sequential, perhaps we should use a different compression algorithm that
is amenable to parallelism.
In any case, the point of the 13 Dwarfs is not to identify the low hanging fruit that are
highly parallel. The point is to identify the kernels that are the core computation and
communication for important applications in the upcoming decade, independent of the
amount of parallelism. To develop programming systems and architectures that will run
applications of the future as efficiently as possible, we must learn the limitations as well
as the opportunities. We note, however, that inefficiency on embarrassingly parallel code
could be just as plausible a reason for the failure of a future architecture as weakness on
embarrassingly sequential code.
More dwarfs may need to be added to the list. Nevertheless, we were surprised that we
only needed to add six dwarfs to cover such a broad range of important applications.
14
The Landscape of Parallel Computing Research: A View From Berkeley
The two forms of distribution can be applied hierarchically. For example, a dwarf may be
implemented as a pipeline, where the computation for an input is divided into stages with
each stage running on its own spatial division of the processors. Each stage is time
multiplexed across successive inputs, but processing for a single input flows through the
spatial distribution of pipeline stages.
These issues are pieces of a larger puzzle. What are effective ways to describe
composable parallel-code libraries? Can we write a library such that it encodes
knowledge about its ideal mapping when composed with others in a complete parallel
application? What if the ideal mapping is heavily dependent on input data that cannot be
known at compile time?
15
The Landscape of Parallel Computing Research: A View From Berkeley
The common computing theme of RMS is “multimodal recognition and synthesis over
large and complex data sets” [Dubey 2005]. Intel believes RMS will find important
applications in medicine, investment, business, gaming, and in the home. Intel’s efforts in
Figure 5 show that Berkeley is not alone in trying to organize the new frontier of
computation to underlying computation kernels in order to guide architectural research.
Collision
PDE LCP NLP FIMI
SVM SVM detection
Classification Training
IPM K-Means
(LP, QP)
Level Set
Particle Filter/ Text
Fast Marching
Filtering transform Monte Carlo Indexer
Method
16
Dwarf Embedded Computing General Purpose Machine Learning Graphics / Databases Intel RMS
Computing Games
1. Dense Linear EEMBC Automotive: iDCT, FIR, SPEC Integer: Quantum Support vector Database hash Body Tracking,
IIR, Matrix Arith; EEMBC computer simulation machines, princpal accesses large media synthesis
Algebra (e.g., Consumer: JPEG, RGB to CMYK, (libquantum), video component analysis, contiguous linear
BLAS or RGB to YIQ; EEMBC Digital compression (h264avc) independent component sections of programming, K-
MATLAB) Entertainment: RSA MP3 Decode, SPEC Fl. Pl.: Hidden Markov analysis memory means, support
MPEG-2 Decode, MPEG-2 models (sphinx3) vector machines,
Encode, MPEG-4 Decode; quadratic
MPEG-4 Encode; EEMBC programming,
Networking: IP Packet; EEMBC PDE: Face, PDE:
Office Automation: Image Cloth*
Rotation; EEMBC Telecom:
Convolution Encode; EEMBC
Java: PNG
2. Sparse Linear EEMBC Automotive: Basic Int + SPEC Fl. Pt.: Fluid dynamics Support vector Reverse Support vector
FP, Bit Manip, CAN Remote (bwaves), quantum chemistry machines, principal kinematics; Spring machines,
Algebra (e.g., Data, Table Lookup, Tooth to (gamess; tonto), linear program component analysis, models quadratic
SpMV, OSKI, or Spark; EEMBC Telecom: Bit solver (soplex) independent component programming,
SuperLU) Allocation; EEMBC Java: PNG analysis PDE: Face, PDE:
Cloth*
PDE:
Computational
fluid dynamics
3. Spectral EEMBC Automotive: FFT, iFFT, Spectral clustering Texture maps PDE:
iDCT; EEMBC Consumer: JPEG; Computational
Methods (e.g., EEMBC Entertainment: MP3 fluid dynamics
FFT) Decode PDE: Cloth
4. N-Body SPEC Fl. Pt.: Molecular
dynamics (gromacs, 32-bit;
Methods (e.g., namd, 64-bit)
Barnes-Hut, Fast
Multipole
Method)
5. Structured EEMBC Automotive: FIR, IIR; SPEC Fl. Pt.: Quantum Smoothing;
EEMBC Consumer: HP Gray- chromodynamics interpolation
Grids (e.g., Scale; EEMBC Consumer: JPEG; (milc),magneto hydrodynamics
Cactus or EEMBC Digital Entertainment: (zeusmp), general relativity
Lattice- MP3 Decode, MPEG-2 Decode, (cactusADM), fluid dynamics
MPEG-2 Encode, MPEG-4 (leslie3d-AMR; lbm), finite
Boltzmann Decode; MPEG-4 Encode; element methods (dealII-AMR;
Magneto- EEMBC Office Automation: calculix), Maxwell’s E&M
17
The Landscape of Parallel Computing Research: A View From Berkeley
Dwarf Embedded Computing General Purpose Machine Learning Graphics / Databases Intel RMS
Computing Games
hydrodynamics) Dithering; EEMBC Telecom: eqns solver (GemsFDTD),
Autocorrelation quantum crystallography
(tonto), weather modeling
(wrf2-AMR)
6. Unstructured Belief propagation Global
illumination
Grids (e.g.,
ABAQUS or
FIDAP)
7. MapReduce SPEC Fl. Pt.: Ray tracer Expectation MapReduce
(povray) maximization
(e.g., Monte
Carlo)
12. Graphical EEMBC Telecom: Viterbi Decode SPEC Integer: Hidden Markov Hidden Markov models
models (hmmer)
Models
18
The Landscape of Parallel Computing Research: A View From Berkeley
Dwarf Embedded Computing General Purpose Machine Learning Graphics / Databases Intel RMS
Computing Games
13. Finite State EEMBC Automotive: Angle To SPEC Integer: Text processing Response to
Time, Cache “Buster”, CAN (perlbench), compression collisions
Machine Remote Data, PWM, Road Speed, (bzip2), compiler (gcc), video
Tooth to Spark; EEMBC compression (h264avc),
Consumer: JPEG; EEMBC Digital network discrete event
Entertainment: Huffman Decode, simulation (omnetpp), XML
MP3 Decode, MPEG-2 Decode, transformation (xalancbmk)
MPEG-2 Encode, MPEG-4
Decode; MPEG-4 Encode;
EEMBC Networking: QoS, TCP;
EEMBC Office Automation: Text
Processing; EEMBC Telecom: Bit
Allocation; EEMBC Java: PNG
Figure 6. Mapping of EEMBC, SPEC2006, Machine Learning, Graphcs/Games, Data Base, and Intel’s RMS to the 13 Dwarfs. *Note that SVM, QP, PDE:Face,
and PDE:Cloth may use either dense or sparse matrices, depending on the application.
19
4.0 Hardware
Now that we have given our views of applications and dwarfs for parallel computing in
the left tower of Figure 1, we are ready for examination of the right tower: hardware.
Section 2 above describes the constraints of present and future semiconductor processes,
but they also present many opportunities.
We organize our observations on hardware around the three components first used to
describe computers more than 30 years ago: processor, memory, and switch [Bell and
Newell 1970].
New Conventional Wisdom #4 in Section 2 states that the size of module that we can
successfully design and fabricate is shrinking. New Conventional Wisdoms #1 and #2 in
Section 2 state that power is proving to be the dominant constraint for present and future
generations of processing elements. To support these assertions we note that several of
the next generation processors, such as the Tejas Pentium 4 processor from Intel, were
canceled or redefined due to power consumption issues [Wolfe 2004]. Even
representatives from Intel, a company generally associated with the “higher clock-speed
is better” position, warned that traditional approaches to maximizing performance
through maximizing clock speed have been pushed to their limit [Borkar 1999]
[Gelsinger 2001]. In this section, we look past the inflection point to ask: What processor
is the best building block with which to build future multiprocessor systems?
There are numerous advantages to building future microprocessors systems out of smaller
processor building blocks:
• Parallelism is an energy-efficient way to achieve performance [Chandrakasan et al
1992].
• Many small cores give the highest performance per unit area for parallel codes.
• A larger number of smaller processing elements allows a finer-grained ability to
perform dynamic voltage scaling and power down.
• A small processing element is an economical element that is easy to shut down in
the face of catastrophic defects and easier to reconfigure in the face of large
parametric variation. The Cisco Metro chip [Eatherton 2005] adds four redundant
processors to each die, and Sun sells 4-processor, 6-processor, or 8-processor
versions of Niagara based on the yield of a single 8-processor design. Graphics
processors are also reported to be using redundant processors in this way, as is the
20
The Landscape of Parallel Computing Research: A View From Berkeley
IBM Cell microprocessor for which only 7 out of 8 synergistic processors are
used in the Sony Playstation 3 game machine.
• A small processing element with a simple architecture is easier to design and
functionally verify. In particular, it is more amenable to formal verification
techniques than complex architectures with out-of-order execution.
• Smaller hardware modules are individually more power efficient and their
performance and power characteristics are easier to predict within existing
electronic design-automation design flows [Sylvester and Keutzer 1998]
[Sylvester and Keutzer 2001] [Sylvester et al 1999].
While the above arguments indicate that we should look to smaller processor
architectures for our basic building block, they do not indicate precisely what circuit size
or processor architecture will serve us the best. We argued that we must move away from
a simplistic “bigger is better” approach; however, that does not immediately imply that
“smallest is best”.
Different applications will present different tradeoffs between performance and energy
consumption. For example, many real-time tasks (e.g., viewing a DVD movie on a
laptop) have a fixed performance requirement for which we seek the lowest energy
implementation. Desktop processors usually seek the highest performance under a
maximum power constraint. Note that the design with the lowest energy per operation
might not give the highest performance under a power constraint, if the design cannot
complete tasks fast enough to exhaust the available power budget.
If all tasks were highly parallelizable and silicon area was free, we would favor cores
with the lowest energy per instruction (SPEC/Watt). However, we also require good
21
The Landscape of Parallel Computing Research: A View From Berkeley
performance on less parallel codes, and high throughput per-unit-area to reduce die costs.
The challenge is to increase performance without adversely increasing energy per
operation or silicon area.
The effect of microarchitecture on energy and delay was studied in [Gonzalez and
Horowitz 1996]. Using energy-delay product (SPEC2/W) as a metric, the authors
determined that simple pipelining is significantly beneficial to delay while increasing
energy only moderately. In contrast, superscalar features adversely affected the energy-
delay product. The power overhead needed for additional hardware did not outweigh the
performance benefits. Instruction-level parallelism is limited, so microarchitectures
attempting to gain performance from techniques such as wide issue and speculative
execution achieved modest increases in performance at the cost of significant area and
energy overhead.
There is strong empirical evidence that we can achieve 1000 cores on a die when 30nm
technology is available. (As Intel has taped out a 45-nm technology chip, 30 nm is not so
distant in the future.) Cisco today embeds in its routers a network processor with 188
cores implemented in 130 nm technology. [Eatherton 2005] This chip is 18mm by 18mm,
22
The Landscape of Parallel Computing Research: A View From Berkeley
dissipates 35W at a 250MHz clock rate, and produces an aggregate 50 billion instructions
per second. The individual processor cores are 5-stage Tensilica processors with very
small caches, and the size of each core is 0.5 mm2. About a third of the die is DRAM and
special purpose functions. Simply following scaling from Moore's Law would arrive at
752 processors in 45nm and 1504 in 30nm. Unfortunately, power may not scale down
with size, but we have ample room before we push the 150W limit of desktop or server
applications.
As Amdahl observed 40 years ago, the less parallel portion of a program can limit
performance on a parallel computer [Amdahl 1967]. Hence, one reason to have different
“sized” processors in a manycore architecture is to improve parallel speedup by reducing
the time it takes to run the less parallel code. For example, assume 10% of the time a
program gets no speed up on a 100-processor computer. Suppose to run the sequential
code twice as fast, a single processor would need 10 times as many resources as a simple
core runs due to bigger power budget, larger caches, a bigger multiplier, and so on. Could
it be worthwhile? Using Amdahl’s Law [Hennessy and Patterson 2007], the comparative
speedups of a homogeneous 100 simple processor design and a heterogeneous 91-
processor design relative to a single simple processor are:
SpeedupHomogeneous = 1 / (0.1 – 0.9/100) = 9.2 times faster
SpeedupHeterogeneous = 1 / (0.1/2 – 0.9/90) = 16.7 times faster
In this example, even if a single larger processor needed 10 times as many resources to
run twice as fast, it would be much more valuable than 10 smaller processors it replaces.
In addition to helping with Amdahl’s Law, heterogeneous processor solutions can show
significant advantages in power, delay, and area. Processor instruction-set configurability
[Killian et al 2001] is one approach to realizing the benefits of processor heterogeneity
while minimizing the costs of software development and silicon implementation, but this
requires custom fabrication of each new design to realize the performance benefit, and
this is only economically justifiable for large markets.
23
The Landscape of Parallel Computing Research: A View From Berkeley
On the other hand, a single replicated processing element has many advantages; in
particular, it offers ease of silicon implementation and a regular software environment.
Managing heterogeneity in an environment with thousands of threads may make a
difficult problem impossible.
Will the possible power and area advantages of heterogeneous multicores win out versus
the flexibility and software advantages of homogeneous multicores? Alternatively, will
the processor of the future be like a transistor: a single building block that can be woven
into arbitrarily complex circuits? Alternatively, will a processor be more like a NAND
gate in a standard-cell library: one instance of a family of hundreds of closely related but
unique devices? In this section, we do not claim to have resolved these questions. Rather
our point is that resolution of these questions is certain to require significant research and
experimentation, and the need for this research is more imminent than industry’s
multicore multiprocessor roadmap would otherwise indicate.
The good news is that if we look inside a DRAM chip, we see many independent, wide
memory blocks. [Patterson et al 1997] For example, a 512 Mbit DRAM is composed of
hundreds of banks, each thousands of bits wide. Clearly, there is potentially tremendous
bandwidth inside a DRAM chip waiting to be tapped, and the memory latency inside a
DRAM chip is obviously much better than from separate chips across an interconnect.
In creating a new hardware foundation for parallel computing hardware, we should not
limit innovation by assuming main memory must be in separate DRAM chips connected
by standard interfaces. New packaging techniques, such as 3D stacking, might allow
vastly increased bandwidth and reduced latency and power between processors and
DRAM. Although we cannot avoid global communication in the general case with
thousands of processors and hundreds of DRAM banks, some important classes of
computation have almost entirely local memory accesses and hence can benefit from
innovative memory designs.
24
The Landscape of Parallel Computing Research: A View From Berkeley
Whereas DRAM capacity kept pace with Moore’s Law by quadrupling capacity every
three years between 1980 and 1992, it slowed to doubling every two years between 1996
and 2002. Today we still use the 512 Mbit DRAM that was introduced in 2002.
Manycore designs will unleash a much higher number of MIPS in a single chip. Given
the current slow increase in memory capacity, this MIPS explosion suggests a much
larger fraction of total system silicon in the future will be dedicated to memory.
Although there has been research into statistical traffic models to help refine the design of
Networks-on-Chip [Soteriou et al 2006], we believe the 13 Dwarfs can provide even
more insight into communication topology and resource requirements for a broad-array
of applications. Based on studies of the communication requirements of existing
massively concurrent scientific applications that cover the full range of dwarfs [Vetter
and McCracken 2001] [Vetter and Yoo 2002] [Vetter and Meuller 2002] [Kamil et al
2005], we make the following observations about the communication requirements in
order to develop a more efficient and custom-tailored solution:
• The collective communication requirements are strongly differentiated from point-to-
point requirements. Collective communication, requiring global communication,
tended to involve very small messages that are primarily latency bound. As the
number of cores increases, the importance of these fine-grained, smaller-than-cache-
line-sized, collective synchronization constructs will likely increase. Since latency is
likely to improve much more slowly than bandwidth (see CW #6 in Section 2), the
separation of concerns suggests adding a separate latency-oriented network dedicated
to the collectives. They already appeared in prior MPPs. [Hillis and Tucker 1993]
[Scott 1996] As a recent example at large scale, the IBM BlueGene/L has a “Tree”
network for collectives in addition to a higher-bandwidth “Torus” interconnect for
point-to-point messages. Such an approach may be beneficial for chip interconnect
implementations that employ 1000s of cores.
• The sizes of most point-to-point messages are typically large enough that they remain
strongly bandwidth-bound, even for on-chip interconnects. Therefore, each point-to-
25
The Landscape of Parallel Computing Research: A View From Berkeley
The communication patterns observed thus far are closely related to the underlying
communication/computation patterns. Given just 13 dwarfs, the interconnect may need to
target a relatively limited set of communication patterns. It also suggests that the
programming model provide higher-level abstractions for describing those patterns.
One can use less complex circuit switches to provision dedicated wires that enable the
interconnect to adapt to communication pattern of the application at runtime. A hybrid
26
The Landscape of Parallel Computing Research: A View From Berkeley
design that combined packed switches with an optical circuit switch was proposed as a
possible solution to the problem at a macro scale. [Kamil et al 2005] [Shalf et al 2005].
However, at a micro-scale, hybrid switch designs that incorporate electrical circuit
switches to adapt the communication topology may be able to meet all of the needs of
future parallel applications. A hybrid circuit-switched approach can result in much
simpler and area-efficient on-chip interconnects for manycore processors by eliminating
unused circuit paths and switching capacity through custom runtime reconfiguration of
the interconnect topology.
4.4.1 Coherency
Conventional SMPs use cache-coherence protocols to provide communication between
cores, and mutual exclusion locks built on top of the coherency scheme to provide
synchronization. It is well known that standard coherence protocols are inefficient for
certain data communication patterns (e.g., producer-consumer traffic), but these
inefficiencies will be magnified by the increased core count and the vast increase in
potential core bandwidth and reduced latency of CMPs. More flexible or even
reconfigurable data coherency schemes will be needed to leverage the improved
bandwidth and reduced latency. An example might be large, on-chip, caches that can
flexibly adapt between private or shared configurations. In addition, real-time embedded
applications prefer more direct control over the memory hierarchy, and so could benefit
from on-chip storage configured as software-managed scratchpad memory.
27
The Landscape of Parallel Computing Research: A View From Berkeley
should be allowed to update some shared mutable state, but typically, the order does not
matter. For producer-consumer synchronization, a consumer must wait until the producer
has generated a required value. Conventional systems implement both types of
synchronization using locks. (Barriers, which synchronize many consumers with many
producers, are also typically built using locks on conventional SMPs).
These locking schemes are notoriously difficult to program, as the programmer has to
remember to associate a lock with every critical data structure and to access only these
locks using a deadlock-proof locking scheme. Locking schemes are inherently non-
composable and thus cannot form the basis of a general parallel programming model.
Worse, these locking schemes are implemented using spin waits, which cause excessive
coherence traffic and waste processor power. Although spin waits can be avoided by
using interrupts, the hardware inter-processor interrupt and context switch overhead of
current operating systems makes this impractical in most cases.
The Transactional Coherence & Consistency (TCC) scheme [Kozyrakis and Olukotun
2005] proposes to apply transactions globally to replace conventional cache-coherence
protocols, and to support producer-consumer synchronization through speculative
rollback when consumers arrive before producers.
Transactional memory is a promising but still active research area. Current software-only
schemes have high execution time overheads, while hardware-only schemes either lack
facilities required for general language support or require very complex hardware. Some
form of hybrid hardware-software scheme is likely to emerge, though more practical
experience with the use of transactional memory is required before even the functional
requirements for such a scheme are well understood.
28
The Landscape of Parallel Computing Research: A View From Berkeley
2006]. In particular, recent work by Jon Berry et al. [Berry et al 2006] has demonstrated
that graph processing algorithms executing on a modest 4 processor MTA, which offers
hardware support for full-empty bits, can outperform the fastest system on the 2006
Top500 list – the 64k processor BG/L system.
4.5 Dependability
CW #3 in Section 2 states that the next generation of microprocessors will face higher
soft and hard error rates. Redundancy in space or in time is the way to make dependable
systems from undependable components. Since redundancy in space implies higher
hardware costs and higher power, we must use redundancy judiciously in manycore
designs. The obvious suggestion is to use single error correcting, double error detecting
(SEC/DED) encoding for any memory that has the only copy of data, and use parity
protection on any memory that just has a copy of data that can be retrieved from
elsewhere. Servers that have violated those guidelines have suffered dependability
problems [Hennessy and Patterson 2007].
For example, if the L1 data cache uses write through to an L2 cache with write back, then
the L1 data cache needs only parity while the L2 cache needs SEC/DED. The cost for
SEC/DED is a function of the logarithm of the word width, with 8 bits of SEC/DED for
64 bits of data being a popular size. Parity needs just one bit per word. Hence, the cost in
energy and hardware is modest.
Mainframes are the gold standard of dependable hardware design, and among the
techniques they use is repeated retransmission to try to overcome soft errors. For
example, they would retry a transmission 10 times before giving up and declaring to the
operating system that it uncovered an error. While it might be expensive to include such a
mechanism on every bus, there are a few places where it might be economical and
effective. For example, we expect a common design framework for manycore will be
globally asynchronous but locally synchronous per module, with unidirectional links and
queues connecting together these larger function blocks. It would be relatively easy to
include a parity checking and limited retransmission scheme into such framework.
29
The Landscape of Parallel Computing Research: A View From Berkeley
Virtual Machines can also help systems resilient to failures by running different programs
in different virtual machines (see Section 6.2). Virtual machines can move applications
from a failing processor to a working processor in a manycore chip before the hardware
stops. Virtual machines can help cope with software failures as well due to the strong
isolation they provide, making an application crash much less likely to affect others.
In addition to these seemingly obvious points, there are open questions for dependability
in the manycore era:
• What is the right granularity to check for errors? Whole processors, or even down
to registers?
• What is the proper response to an error? Retry, or decline to use the faulty
component in the future?
• How serious are errors? Do we need redundant threads to have confidence in the
results, or is a modest amount of hardware redundancy sufficient?
The combination of Moore’s Law and the Memory Wall led architects to design
increasingly complicated mechanisms to try to deliver performance via instruction level
parallelism and caching. Since the goal was to run standard programs faster without
change, architects were not aware of the increasing importance of performance counters
to compiler writers and programmers in understanding how to make their programs run
faster. Hence, the historically cavalier attitude towards performance counters became a
liability for delivering performance even on sequential processors.
The switch to parallel programming, where the compiler and the programmer are
explicitly responsible for performance, means that performance counters must become
first-class citizens. In addition to monitoring traditional sequential processor performance
features, new counters must help with the challenge of efficient parallel programming.
Section 7.2 below lists efficiency metrics to evaluate parallel programs, which suggests
performance counters to help manycore architectures succeed:
- To minimize remote accesses, identify and count the number of remote accesses
and amount of data moved in addition to local accesses and local bytes
transferred.
- To balance load, identify and measure idle time vs. active time per processor.
- To reduce synchronization overhead, identify and measure time spent in
synchronization per processor.
As power and energy are increasingly important, they need to be measured as well.
Circuit designers can create Joule counters for the significant modules from an energy
and power perspective. On a desktop computer, the leading energy consumers are
30
The Landscape of Parallel Computing Research: A View From Berkeley
processors, main memory, caches, the memory controller, the network controller, and the
network interface card.
Given Joules and time, we can calculate Watts. Unfortunately, measuring time is getting
more complicated. Processors traditionally counted processor clock cycles, since the
clock rate was fixed. To save energy and power, some processors have adjustable
threshold voltages and clock frequencies. Thus, to measure time accurately, we now need
a “picosecond counter” in addition to a clock cycle counter.
While performance and energy counters are vital to the success of parallel processing, the
good news is that they are relatively easy to include. Our main point is to raise their
priority: do not include features that significantly affect performance or energy if
programmers cannot accurately measure their impact.
Figure 7 shows the current lack of agreement on the opacity/visibility tradeoff. It lists 10
examples of programming models for five critical parallel tasks that go from requiring
the programmer to make explicit decisions for all tasks for efficiency to models that make
all the decisions for the programmer for productivity. In between these extremes, the
programmer does some tasks and leaves the rest to the system.
The struggle is delivering performance while raising the level of abstraction. Going too
low may achieve performance, but at the cost of exacerbating the software productivity
problem, which is already a major hurdle for the information technology industry. Going
too high can reduce productivity as well, for the programmer is then forced to waste time
trying to overcome the abstraction to achieve performance.
31
The Landscape of Parallel Computing Research: A View From Berkeley
32
The Landscape of Parallel Computing Research: A View From Berkeley
All three goals are obviously important: efficiency, productivity, and correctness. It is
striking, however, that research from psychology has had almost no impact, despite the
obvious fact that the success of these models will be strongly affected by the human
beings who use them. Testing methods derived from the psychology research community
have been used to great effect for HCI, but are sorely lacking in language design and
software engineering. For example, there is a rich theory investigating the causes of
human errors, which is well known in the human-computer interface community, but
apparently it has not penetrated the programming model and language design community.
[Kantowitz and Sorkin 1983] [Reason 1990] There have been some initial attempts to
identify the systematic barriers to collaboration between the Software Engineering (SE)
and HCI community and propose necessary changes to the CS curriculum to bring these
fields in line, but there has been no substantial progress to date on these proposals.
[Seffah 2003] [Pyla et al 2004] We believe that integrating research on human
psychology and problem solving into the broad problem of designing, programming,
debugging, and maintaining complex parallel systems will be critical to developing
broadly successful parallel programming models and environments.
33
The Landscape of Parallel Computing Research: A View From Berkeley
Not only do we ignore insights about human cognition in the design of our programming
models, we do not follow their experimental method to resolve controversies about how
people use them. That method is human-subject experiments, which is so widespread that
most campuses have committees that must be consulted before you can perform such
experiments. Subjecting our assumptions about the process or programming to formal
testing often yields unexpected results that challenge our intuition. [Mattson 1999]
A small example is a study comparing programming using shared memory vs. message
passing. These alternatives have been the subject of hot debates for decades, and there is
no consensus on which is better and when. A recent paper compared efficiency and
productivity of small programs written both ways for small parallel processors by novice
programmers. [Hochstein et al 2005] While this is not the final word on the debate, it
does indicate a path to try to resolve important programming issues. Fortunately, there
are a growing number of examples of groups that have embraced user studies to evaluate
the productivity of computer languages. [Kuo et al 2005] [Solar-Lezama et al 2005]
[Ebcioglu et al 2006]
Recent efforts in programming languages have focused on this problem and their
offerings have provided models where the number of processors is not exposed [Deitz
2005] [Allen et al 2006] [Callahan et al 2004] [Charles et al 2005]. While attractive, these
models have the opposite problem—delivering performance. In many cases, hints can be
provided to co-locate data and computation in particular memory domains. In addition,
34
The Landscape of Parallel Computing Research: A View From Berkeley
because the program is not over-specified, the system has quite a bit of freedom in
mapping and scheduling that in theory can be used to optimize performance. Delivering
on this promise is, however, still an open research question.
5.3 Models must support a rich set of data sizes and types
Although the algorithms were often the same in embedded and server benchmarks in
Section 3, the data types were not. SPEC relies on single- and double-precision floating
point and large integer data, while EEMBC uses integer and fixed-point data that varies
from 1 to 32 bits. [EEMBC 2006] [SPEC 2006] Note that most programming languages
only support the subset of data types found originally in the IBM 360 announced 40 years
ago: 8-bit characters, 16- and 32-bit integers, and 32- and 64-bit floating-point numbers.
This leads to the relatively obvious observation. If the parallel research agenda inspires
new languages and compilers, they should allow programmers to specify at least the
following sizes (and types):
• 1 bit (Boolean)
• 8 bits (Integer, ASCII)
• 16 bits (Integer, DSP fixed point, Unicode)
• 32 bits (Integer, Single-precision floating point, Unicode)
• 64 bits (Integer, Double-precision floating point
• 128 bits (Integer, Quad-Precision floating point
• Large integer (>128 bits) (Crypto)
35
The Landscape of Parallel Computing Research: A View From Berkeley
Rather than placing all the eggs in one basket, we think programming models and
architectures should support a variety of styles so that programmers can use the superior
choice when the opportunity occurs. We believe that list includes at least the following:
1. Independent task parallelism is an easy-to-use, orthogonal style of parallelism
that should be supported in any new architecture. As a counterexample, older
vector computers could not take advantage of task-level parallelism despite
having many parallel functional units. Indeed, this was one of the key arguments
used against vector computers in the switch to massively parallel processors.
2. Word-level parallelism is a clean, natural match to some dwarfs, such as sparse
and dense linear algebra and unstructured grids. Examples of successful support
include array operations in programming languages, vectorizing compilers, and
vector architectures. Vector compilers would give hints at compile time about
why a loop did not vectorize, and non-computer scientists could then vectorize the
code because they understood the model of parallelism. It has been many years
since that could be said about a new parallel language, compiler, and architecture.
3. Bit-level parallelism may be exploited within a processor more efficiently in
power, area, and time than between processors. For example, the Secure Hash
Algorithm (SHA) for cryptography has significant parallelism, but in a form that
requires very low latency communication between operations on small fields.
In addition to the styles of parallelism, we also have the issue of the memory model.
Because parallel systems usually contain memory distributed throughout the machine, the
question arises of the programmer’s view of this memory. Systems providing the illusion
of a uniform shared address space have been very popular with programmers. However,
scaling these to large systems remains a challenge. Memory consistency issues (relating
to the visibility and ordering of local and remote memory operations) also arise when
multiple processors can update the same locations, each likely having a cache. Explicitly
partitioned systems (such as MPI) sidestep many of these issues, but programmers must
deal with the low-level details of performing remote updates themselves.
36
The Landscape of Parallel Computing Research: A View From Berkeley
Due to the limitations of existing compilers, peak performance may still require
handcrafting the program in languages like C, FORTRAN, or even assembly code.
Indeed, most scalable parallel codes have all data layout, data movement, and processor
synchronization manually orchestrated by the programmer. Such low-level coding is
labor intensive, and usually not portable to different hardware platforms or even to later
implementations of the same instruction set architecture.
In recent years, “Autotuners” [Bilmes et al 1997] [Frigo and Johnson 1998] [Frigo and
Johnson 2005] [Granlund et al 2006] [Im et al 2005] [Whaley and Dongarra 1998] gained
popularity as an effective approach to producing high-quality portable scientific code.
Autotuners optimize a set of library kernels by generating many variants of a given kernel
and benchmarking each variant by running on the target platform. The search process
effectively tries many or all optimization switches and hence may take hours to complete
on the target platform. Search needs to be performed only once, however, when the
library is installed. The resulting code is often several times faster than naive
implementations, and a single autotuner can be used to generate high-quality code for a
wide variety of machines. In many cases, the autotuned code is faster than vendor
libraries that were specifically hand-tuned for the target machine! This surprising result is
partly explained by the way the autotuner tirelessly tries many unusual variants of a
particular routine, often finding non-intuitive loop unrolling or register blocking factors
that lead to better performance.
37
The Landscape of Parallel Computing Research: A View From Berkeley
For example, Figure 8 shows how performance varies by a factor of four with blocking
options on Itanium 2. The lesson from autotuning is that by searching many possible
combinations of optimization parameters, we can sidestep the problem of creating an
effective heuristic for optimization policy.
Figure 8. Sparse matrix performance on Itanium 2 for a finite element problem using block compressed
sparse row (BCSR) format [Im et al 2005]. Performance (color-coded, relative to the 1x1 baseline) is
shown for all block sizes that divide 8x8—16 implementations in all. These implementations fully unroll
the innermost loop and use scalar replacement for the source and destination vectors. You might reasonably
expect performance to increase relatively smoothly as r and c increase, but this is clearly not the case.
Platform: 900 MHz Itanium-2, 3.6 Gflop/s peak speed. Intel v8.0 compiler.
38
The Landscape of Parallel Computing Research: A View From Berkeley
To reduce the search space, it may be possible to decouple the search for good data
layout and communication patterns from the search for a good compute kernel, especially
with the judicial use of performance models. The network and memory performance may
be characterized relatively quickly using test patterns, and then plugged into performance
models for the network to derive suitable code loops for the search over compute kernels
[Vadhiyar et al 2000].
We believe these two worlds are colliding and merging, as embedded systems increase in
functionality. For example, cell phones and game machines now support multi-gigabyte
file systems and complex Web browsers. In particular, cell phone manufacturers who
have previously resisted the installation of third-party software due to reliability
concerns, now realize that a standard API must be provided to allow user extensibility,
and this will require much more sophisticated and stable operating systems and the
hardware support these require.
Since embedded computers are increasingly connected to networks, we think they will be
increasingly vulnerable to viruses and other attacks. Indeed, the first personal computer
operating systems dropped protection since developers thought a PC had only a single
user, which worked OK until we connected PCs to the Internet. Imagine how much better
our lives would be if security had been a PC OS priority before they joined the Internet.
39
The Landscape of Parallel Computing Research: A View From Berkeley
Virtual machines appear to be the future of server operating systems. For example, AMD,
Intel, and Sun have all modified their instruction set architectures to support virtual
machines. VMs have become popular in server computing for a few reasons: [Hennessy
and Patterson 2007]
• To provide a greater degree of protection against viruses and attacks;
• To cope with software failures by isolating a program inside a single VM so as
not to damage other programs; and
• To cope with hardware failures by migrating a virtual machine from one computer
to another without stopping the programs
VMMs provide an elegant solution to the failure of conventional OSes to provide such
features. VMMs are also a great match to manycore systems, in that space sharing will be
increasingly important when running multiple applications on 1000s of processors.
While this vision is compelling, it is not binding. An application can run either a very thin
or a very thick OS on top of the VMM, or even multiple OSes simultaneously to
accommodate different task needs. For example, a real-time code and a best effort code
running on different cores, or a minimal data-plane OS on multiple high-density cores
and a complex control-plane OS on a large general-purpose core.
40
The Landscape of Parallel Computing Research: A View From Berkeley
Another area that deserves consideration is the addition of hardware structures that assist
language productivity features. For example, supporting transactional memory entirely in
software may be too slow to be useful, but can be made efficient with hardware support.
Other examples of this include support for garbage collection, fine-grained
synchronization (the Cray MTA), one-sided messaging, trace collection for debugging
[Xu et al 2003], and performance and energy counters to aid program optimization (see
Section 4.5).
Moreover, since the power wall has forced us to concede the battle for maximum
performance of individual processing elements, we must aim at winning the war for
application efficiency through optimizing total system performance. This will require
41
The Landscape of Parallel Computing Research: A View From Berkeley
New efficiency metrics will make up the evaluation of the new parallel architecture. As
in the sequential world, there are many “observables” from program execution that
provide hints (such as cache misses) to the overall efficiency of a running program. In
addition to serial performance issues, the evaluation of parallel systems architectures will
focus on:
- Minimizing remote accesses. In the case where data is accessed by computational
tasks that are spread over different processing elements, we need to optimize its
placement so that communication is minimized.
- Load balance. The mapping of computational tasks to processing elements must
be performed in such a way that the elements are idle (waiting for data or
synchronization) as little as possible.
- Granularity of data movement and synchronization. Most modern networks
perform best for large data transfers. In addition, the latency of synchronization is
high and so it is advantageous to synchronize as little as possible.
Software design environments for embedded systems such as those described in [Rowen
and Leibson 2005] lend greater support to making these types of system-level decisions.
To make help programmers progress towards these goals, we recommend hardware
counters that can measure these performance issues (see Section 4.6).
The conventional path for exploring new architectures for the last decade has been
simulation. We are skeptical that software simulation alone will provide sufficient
throughput for thorough evaluation of manycore systems architectures. Nor will per-
project hardware prototypes that require long development cycles be sufficient. The
development of these ad hoc prototypes will be far too slow to influence the decisions
that industry will need to make regarding future manycore system architectures. We need
a platform where feedback from software experiments on novel manycore architectures
running real applications with representative workloads will lead to new system
architectures within days, not years.
42
The Landscape of Parallel Computing Research: A View From Berkeley
While the idea for RAMP is just 18 months old, the group has made rapid progress. It has
financial support from NSF and several companies and it has working hardware based on
an older generation of FPGA chips. Although RAMP will run, say, 20 times more slowly
than real hardware, it will emulate many different speeds of components accurately to
report correct performance as measured in the emulated clock rate.
The group plans to develop three versions of RAMP to demonstrate what can be done:
• Cluster RAMP (“RAMP Blue”): Led by the Berkeley contingent, this version will
a large-scale example using MPI for high performance applications like the NAS
parallel benchmarks [Van der Wijngaart 2002] or TCP/IP for Internet applications
like search. An 8-board version will run the NAS benchmarks on 256 processors.
• Transactional Memory RAMP (“RAMP Red”): Led by the Stanford contingent,
this version will implement cache coherency using the TCC version of
transactional memory [Hammond et al 2004]. A single board system runs 100
times faster than the Transactional Memory simulator.
• Cache-Coherent RAMP (“RAMP White”): Led by the CMU and Texas
contingents, this version will implement a ring-based coherency or snoop based
coherency.
All will share the same “gateware”—processors, memory controllers, switches, and so
on—as well as CAD tools, including co-simulation. [Chung et al 2006]
The goal is to make the “gateware” and software freely available on a web site, to
redesign the boards to use the recently announced Virtex 5 FPGAs, and finally to find a
manufacturer to sell them at low margin. The cost is estimated to be about $100 per
processor and the power about 1 watt per processor, yielding a 1000 processor system
that costs about $100,000, that consumes about one kilowatt, and that takes about one
quarter of a standard rack of space.
43
The Landscape of Parallel Computing Research: A View From Berkeley
standard platform for parallel research for many types of researchers. If it creates a
“watering hole effect” in bringing many disciplines together, it could lead to innovation
that will more rapidly develop successful answers to the seven questions of Figure 1.
8.0 Conclusion
CWs # 1, 7, 8, and 9 in Section 2 say the triple whammy of the Power, Memory, and
Instruction Level Parallelism Walls has forced microprocessor manufacturers to bet their
futures on parallel microprocessors. This is no sure thing, as parallel software has an
uneven track record.
44
The Landscape of Parallel Computing Research: A View From Berkeley
As a test case to see the usefulness of these observations, one of the authors was invited
to a workshop that posed the question of what could you do if you had infinite memory
bandwidth? We approached the problem using the dwarfs, asking which were
computationally limited and which were limited by memory. Figure 9 below gives the
results of our quick study, which was that memory latency was a bigger problem than
memory bandwidth, and some dwarfs were not limited by memory bandwidth or latency.
Whether our answer was correct or not, it was exciting to have a principled framework to
rely upon to try to answer such open and difficult questions.
This report is intended to be the start of a conversation about these perspectives. There is
an open, exciting, and urgent research agenda to flush out the concepts represented by the
two towers and span of Figure 1. We invite you to participate in this important discussion
by visiting view.eecs.berkeley.edu.
45
The Landscape of Parallel Computing Research: A View From Berkeley
Acknowledgments
During the writing of this paper, Krste Asanovic was visiting U.C. Berkeley, on
sabbatical from MIT. We’d like to thank the following who participated in at least some
of these meetings: Jim Demmel, Jike Chong, Armando Fox, Joe Hellerstein, Mike
Jordan, Dan Klein, Bill Kramer, Rose Liu, Lenny Oliker, Heidi Pan, and John
Wawrzynek. We’d also like to thank those who gave feedback on the first draft that we
used to improve this report: Shekhar Borkar, Yen-Kuang Chen, David Chinnery, Carole
Dulong, James Demmel, Srinivas Devadas, Armando Fox, Ricardo Gonzalez, Jim Gray,
Mark Horowitz, Wen-Mei Hwu, Anthony Joseph, Christos Kozyrakis, Jim Larus, Sharad
Malik, Grant Martin, Tim Mattson, Heinrich Meyr, Greg Morrisett, Shubhu Mukherjee,
Chris Rowen, and David Wood. Revising the report in response to their extensive
comments meant the final draft took 4 more months!
References
[ABAQUS 2006] ABAQUS finite element analysis home page. http://www.hks.com
[Adiletta et al 2002] M. Adiletta, M. Rosenbluth, D. Bernstein, G. Wolrich, and H. Wilkinson, “The Next
Generation of the Intel IXP Network Processors,” Intel Technology Journal, vol. 6, no. 3, pp. 6–18, Aug.
15, 2002.
[Allen et al 2006] E. Allen, V. Luchango, J.-W. Maessen, S. Ryu, G. Steele, and S. Tobin-Hochstadt, The
Fortress Language Specification, 2006. Available at http://research.sun.com/projects/plrg/
[Altschul et al 1990] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic local
alignment search tool,” Journal Of Molecular Biology, vol. 215, no. 3, 1990, pp. 403–410.
[Alverson et al 1999] G.A. Alverson, C.D. Callahan, II, S.H. Kahan, B.D. Koblenz, A. Porterfield, B.J.
Smith, “Synchronization Techniques in a Multithreaded Environment,” US patent 6862635.
[Arnold 2005] J. Arnold, “S5: the architecture and development flow of a software configurable processor,”
in Proceedings of the IEEE International Conference on Field-Programmable Technology, Dec. 2005, pp.
121–128.
[Arvind et al 2005] Arvind, K. Asanovic, D. Chiou, J.C. Hoe, C. Kozyrakis, S. Lu, M. Oskin, D. Patterson,
J. Rabaey, and J. Wawrzynek, “RAMP: Research Accelerator for Multiple Processors - A Community
Vision for a Shared Experimental Parallel HW/SW Platform,” U.C. Berkeley technical report, UCB/CSD-
05-1412, 2005.
[Bader and Madduri 2006] D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-
First Search and st-connectivity on the Cray MTA-2,” in Proceedings of the 35th International Conference
on Parallel Processing (ICPP), Aug. 2006, pp. 523–530.
[Barnes and Hut 1986] J. Barnes and P. Hut, “A Hierarchical O(n log n) force calculation algorithm,”
Nature, vol. 324, 1986.
46
The Landscape of Parallel Computing Research: A View From Berkeley
[Bell and Newell 1970] G. Bell and A. Newell, “The PMS and ISP descriptive systems for computer
structures,” in Proceedings of the Spring Joint Computer Conference, AFIPS Press, 1970, pp. 351–374.
[Bernholdt et al 2002] D.E. Bernholdt, W.R. Elsasif, J.A. Kohl, and T.G.W. Epperly, “A Component
Architecture for High-Performance Computing,” in Proceedings of the Workshop on Performance
Optimization via High-Level Languages and Libraries (POHLL-02), Jun. 2002.
[Berry et al 2006] J.W. Berry, B.A. Hendrickson, S. Kahan, P. Konecny, “Graph Software Development
and Performance on the MTA-2 and Eldorado,” presented at the 48th Cray Users Group Meeting,
Switzerland, May 2006.
[Bilmes et al 1997] J. Bilmes, K. Asanovic, C.W. Chin, J. Demmel, “Optimizing matrix multiply using
PHiPAC: a Portable, High-Performance, ANSI C coding methodology,” in Proceedings of the 11th
International Conference on Supercomputing, Vienna, Austria, Jul. 1997, pp. 340–347.
[Blackford et al 1996] L.S. Blackford, J. Choi, A. Cleary, A. Petitet, R.C. Whaley, J. Demmel, I. Dhillon,
K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker, “ScaLAPACK: a portable linear algebra
library for distributed memory computers - design issues and performance,” in Proceedings of the 1996
ACM/IEEE conference on Supercomputing, Nov. 1996.
[Borkar 1999] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol. 19, no. 4, Jul.–Aug.
1999, pp. 23–29.
[Borkar 2005] S. Borkar, “Designing Reliable Systems from Unrealiable Components: The Challenges of
Transistor Variability and Degradation,” IEEE Micro, Nov.–Dec. 2005, pp. 10–16.
[Brunel et al 2000] J.-Y. Brunel, K.A. Vissers, P. Lieverse, P. van der Wolf, W.M. Kruijtzer, W.J.M.
Smiths, G. Essink, E.A. de Kock, “YAPI: Application Modeling for Signal Processing Systems,” in
Proceedings of the 37th Conference on Design Automation (DAC ’00), 2000, pp. 402–405.
[Callahan et al 2004] D. Callahan, B.L. Chamberlain, and H.P. Zima. “The Cascade High Productivity
Language,” in Proceedings of the 9th International Workshop on High-Level Parallel Programming
Models and Supportive Environments (HIPS 2004), IEEE Computer Society, Apr. 2004, pp. 52–60.
[Chandrakasan et al 1992] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-power CMOS digital
design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, 1992, pp. 473–484.
[Chinnery 2006] D. Chinnery, Low Power Design Automation, Ph.D. dissertation, Department of Electrical
Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, 2006.
47
The Landscape of Parallel Computing Research: A View From Berkeley
[Chung et al 2006] E.S. Chung, J.C. Hoe, and B. Falsafi, “ProtoFlex: Co-Simulation for Component-wise
FPGA Emulator Development,” in the 2nd Workshop on Architecture Research using FPGA Platforms
(WARFP 2006), Feb. 2006.
[Colella 2004] P. Colella, “Defining Software Requirements for Scientific Computing,” presentation, 2004.
[Cooley and Tukey 1965] J. Cooley and J. Tukey, “An algorithm for the machine computation of the
complex Fourier series,” Mathematics of Computation, vol. 19, 1965, pp. 297–301.
[Cristianini and Shawe-Taylor 2000] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector
Machines, Cambridge University Press, Cambridge, 2000.
[Dally and Towles 2001] W.J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection
Networks,” in Proceedings of the 38th Conference on Design Automation (DAC ’01), 2001, pp. 684–689.
[Dean and Ghemawat 2004] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large
Clusters,” in Proceedings of OSDI ’04: 6th Symposium on Operating System Design and Implemention,
San Francisco, CA, Dec. 2004.
[Deitz 2005] S.J. Deitz, High-Level Programming Language Abstractions for Advanced and Dynamic
Parallel Computations, PhD thesis, University of Washington, Feb. 2005.
[Demmel et al 1999] J. Demmel, S. Eisenstat, J. Gilbert, X. Li, and J. Liu, “A supernodal approach to
sparse partial pivoting,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755.
[Demmel et al 2002] J. Demmel, D. Bailey, G. Henry, Y. Hida, J. Iskandar, X. Li, W. Kahan, S. Kang, A.
Kapur, M. Martin, B. Thompson, T. Tung, and D. Yoo, “Design, Implementation and Testing of Extended
and Mixed Precision BLAS,” ACM Transactions on Mathematical Software, vol. 28, no. 2, Jun. 2002, pp.
152–205.
[Dubey 2005] P. Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,”
Technology@Intel Magazine, Feb. 2005.
[Duda and Hart 1973] R. Duda and P. Hart, Pattern Classification and Scene Analysis, New York: Wiley,
1973.
[Eatherton 2005] W. Eatherton, “The Push of Network Processing to the Top of the Pyramid,” keynote
address at Symposium on Architectures for Networking and Communications Systems, Oct. 26–28, 2005.
Slides available at: http://www.cesr.ncsu.edu/ancs/slides/eathertonKeynote.pdf
[Ebcioglu et al 2006] K. Ebcioglu, V. Sarkar, T. El-Ghazawi, J. Urbanic, “An Experiment in Measuring the
Productivity of Three Parallel Programming Languages,” in Proceedings of the Second Workshop on
Productivity and Performance in High-End Computing (P-PHEC 2005), Feb. 2005.
[FLUENT 2006] FIDAP finite element for computational fluid dynamics analysis home page.
http://www.fluent.com/software/fidap/index.htm
[Frigo and Johnson 1998] M. Frigo and S.G. Johnson, “FFTW: An adaptive software architecture for the
FFT,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’98), Seattle, WA, May 1998, vol. 3, pp. 1381–1384.
48
The Landscape of Parallel Computing Research: A View From Berkeley
[Frigo and Johnson 2005] M. Frigo and S.G. Johnson, "The Design and Implementation of FFTW3,"
Proceedings of the IEEE, vol. 93, no. 2, 2005, pp. 216–231.
[Gelsinger 2001] P.P. Gelsinger, “Microprocessors for the new millennium: Challenges, opportunities, and
new frontiers,” in Proceedings of the International Solid State Circuits Conference (ISSCC), 2001, pp. 22–
25.
[Gonzalez and Horowitz 1996] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose
microprocessors,” IEEE Journal of Solid-State Circuits, vol. 31, no. 9, 1996, pp. 1277–1284.
[Goodale et al 2003] T. Goodale, G. Allen, G. Lanfermann, J. Masso, T. Radke, E. Seidel, and J. Shalf,
“The cactus framework and toolkit: Design and applications,” in Vector and Parallel Processing
(VECPAR’2002), 5th International Conference, Springer, 2003.
[Gordon et al 2002] M.I. Gordon, W. Thies, M. Karczmarek, J. Lin, A.S. Meli, A.A. Lamb, C. Leger, J.
Wong, H. Hoffmann, D. Maze, and S. Amarasinghe, “A Stream Compiler for Communication-Exposed
Architectures,” MIT Technology Memo TM-627, Cambridge, MA, Mar. 2002.
[Gries 2004] M. Gries, “Methods for Evaluating and Covering the Design Space during Early Design
Development,” Integration, the VLSI Journal, Elsevier, vol. 38, no. 2, Dec. 2004, pp. 131–183.
[Gries and Keutzer 2005] M. Gries and K. Keutzer (editors), Building ASIPs: The MESCAL Methodology,
Springer, 2005.
[Gursoy and Kale 2004] A. Gursoy and L.V. Kale, “Performance and Modularity Benefits of Message-
Driven Execution,” Journal of Parallel and Distributed Computing, vol. 64, no. 4, Apr. 2004, pp. 461–480.
[Harstein and Puzak 2003] A. Harstein and T. Puzak, “Optimum Power/Performance Pipeline Depth,” in
Proceedings of the 36th IEEE/ACM International Symposium on Microarchitecture (MICRO-36), Dec.
2003, pp. 117–126.
[Hauser and Wawrzynek 1997] J.R. Hauser and J. Wawrzynek, “GARP: A MIPS processor with a
reconfigurable coprocessor,” in Proceedings of the IEEE Symposium on FPGAs for Custom Computing
Machines, Apr. 1997, pp. 12–21.
[Hennessy and Patterson 2007] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative
Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.
[Heo and Asanovic 2004] S. Heo and K. Asanovic, “Power-Optimal Pipelining in Deep Submicron
Technology,” in Proceedings of the International Symposium on Low Power Electronics and Design, 2004,
pp. 218–223.
[Herlihy and Moss 1993] M. Herlihy and J.E.B. Moss, “Transactional Memory: Architectural Support for
Lock-Free Data Structures,” in Proceedings of the 20th Annual International Symposium on Computer
Architecture (ISCA ’93), 1993, pp. 289–300.
[Hewitt et al 1973] C. Hewiit, P. Bishop, and R. Stieger, “A Universal Modular Actor Formalism for
Artificial Intelligence,” in Proceedings of the 1973 International Joint Conference on Artificial
Intelligence, 1973, pp. 235–246.
49
The Landscape of Parallel Computing Research: A View From Berkeley
[Hillis and Tucker 1993] W.D. Hillis and L.W. Tucker, “The CM-5 Connection Machine: A Scalable
Supercomputer,” Communications of the ACM, vol. 36, no. 11, Nov. 1993, pp. 31–40.
[Hochstein et al 2005] L. Hochstein, J. Carver, F. Shull, S. Asgari, V.R. Basili, J.K. Hollingsworth, M.
Zelkowitz. “Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers,”
International Conference for High Performance Computing, Networking and Storage (SC'05). Nov. 2005.
[Hrishikesh et al 2002] M.S. Hrishikesh, D. Burger, N.P. Jouppi, S.W. Keckler, K.I. Farkas, and P.
Shivakumar, “The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays,” in Proceedings
of the 29th Annual International Symposium on Computer Architecture (ISCA ’02), May 2002, pp. 14–24.
[HPCS 2006] DARPA High Productivity Computer Systems home page. http://www.highproductivity.org/
[Im et al 2005] E.J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization framework for sparse matrix
kernels,” International Journal of High Performance Computing Applications, vol. 18, no. 1, Spr. 2004, pp.
135–158.
[Kamil et al 2005] S.A. Kamil, J. Shalf, L. Oliker, and D. Skinner, “Understanding Ultra-Scale Application
Communication Requirements,” in Proceedings of the 2005 IEEE International Symposium on Workload
Characterization (IISWC), Austin, TX, Oct. 6–8, 2005, pp. 178–187. (LBNL-58059)
[Kantowitz and Sorkin 1983] B.H. Kantowitz and R.D. Sorkin, Human Factors: Understanding People-
System Relationships, New York, NY, John Wiley & Sons, 1983.
[Killian et al 2001] E. Killian, C. Rowen, D. Maydan, and A. Wang, “Hardware/Software Instruction set
Configurability for System-on-Chip Processors,” in Proceedings of the 38th Conference on Design
Automation (DAC '01), 2001, pp. 184–188.
[Koelbel et al 1993] C.H. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele Jr., and M.E. Zosel, The
High Performance Fortran Handbook, The MIT Press, 1993. ISBN 0262610949.
[Kozyrakis and Olukotun 2005] C. Kozyrakis and K. Olukotun, “ATLAS: A Scalable Emulator for
Transactional Parallel Systems,” in Workshop on Architecture Research using FPGA Platforms, 11th
International Symposium on High-Performance Computer Architecture (HPCA-11 2005), San Francisco,
CA, Feb. 13, 2005.
[Kumar et al 2003] R. Kumar, K.I. Farkas, N.P. Jouppi, P. Ranganathan, and D.M. Tullsen, “Single-ISA
Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction,” in Proceedings of
the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-36), Dec. 2003.
[Kuo et al 2005] K. Kuo, R.M. Rabbah, and S. Amarasinghe, “A Productive Programming Environment for
Stream Computing,” in Proceedings of the Second Workshop on Productivity and Performance in High-
End Computing (P-PHEC 2005), Feb. 2005.
[Kuon and Rose 2006] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” in
Proceedings of the Internation Symposium on Field Programmable Gate Arrays (FPGA ’06), Monterey,
California, USA, ACM Press, New York, NY, Feb. 22–24, 2006, pp. 21–30.
50
The Landscape of Parallel Computing Research: A View From Berkeley
[Massalin 1987] H. Massalin, “Superoptimizer: a look at the smallest program,” in Proceedings of the
Second International Conference on Architectual Support for Programming Languages and Operating
Systems (ASPLOS II), Palo Alto, CA, 1987, pp. 122–126.
[Mattson 1999] T. Mattson, “A Cognitive Model for Programming,” U. Florida whitepaper, 1999.
Available at
http://www.cise.ufl.edu/research/ParallelPatterns/PatternLanguage/Background/Psychology/CognitiveMod
el.htm
[Monaghan 1982] J.J. Monaghan, “Shock Simulation by the Particle Method SPH,” Journal of
Computational Physics, vol. 52, 1982, pp. 374–389.
[Mukherjee et al 2005] S.S. Mukherjee, J. Emer, and S.K. Reinhardt, "The Soft Error Problem: An
Architectural Perspective," in Proceedings of the 11th International Symposium on High-Performance
Computer Architecture (HPCA-11 2005), Feb. 2005, pp. 243–247.
[Nyberg et al 2004] C. Nyberg, J. Gray, C. Koester, “A Minute with Nsort on a 32P NEC Windows
Itanium2 Server”, http://www.ordinal.com/NsortMinute.pdf, 2004.
[Pancake and Bergmark 1990] C.M. Pancake and D. Bergmark, “Do Parallel Languages Respond to the
Needs of Scientific Programmers?” IEEE Computer, vol. 23, no. 12, Dec. 1990, pp. 13–23.
[Patterson 2004] D. Patterson, “Latency Lags Bandwidth,” Communications of the ACM, vol. 47, no. 10,
Oct. 2004, pp. 71–75.
[Plishker et al 2004] W. Plishker, K. Ravindran, N. Shah, K. Keutzer, “Automated Task Allocation for
Network Processors,” in Network System Design Conference Proceedings, Oct. 2004, pp. 235–245.
[Pthreads 2004] IEEE Std 1003.1-2004, The Open Group Base Specifications Issue 6, section 2.9, IEEE
and The Open Group, 2004.
51
The Landscape of Parallel Computing Research: A View From Berkeley
[Pyla et al 2004] P.S. Pyla, M.A. Perez-Quinones, J.D. Arthur, H.R. Hartson, “What we should teach, but
don’t: Proposal for cross pollinated HCI-SE Curriculum,” in Proceedings of ASEE/IEEE Frontiers in
Education Conference, Oct. 2004, pp. S1H/17–S1H/22.
[Rajwar and Goodman 2002] R. Rajwar and J. R. Goodman, “Transactional lock-free execution of lock-
based programs,” in Proceedings of the 10th International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS X), ACM Press, New York, NY, USA, Oct.
2002, pp. 5–17.
[Reason 1990] J. Reason, Human error, New York, Cambridge University Press, 1990.
[Rosenblum 2006] M. Rosenblum, “The Impact of Virtualization on Computer Architecture and Operating
Systems,” Keynote Address, 12th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS XII), San Jose, California, Oct. 23, 2006.
[Rowen and Leibson 2005] C. Rowen and S. Leibson, Engineering the Complex SOC : Fast, Flexible
Design with Configurable Processors, Prentice Hall, 2nd edition, 2005.
[Scott 1996] S.L. Scott. “Synchronization and communication in the T3E multiprocessor.” In Proceedings
of the Seventh International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS VII), Cambridge, MA, Oct. 1996.
[Seffah 2003] A. Seffah, “Learning the Ropes: Human-Centered Design Skills and Patterns for Software
Engineers’ Education,” Interactions, vol. 10, 2003, pp. 36–45.
[Shah et al 2004a] N. Shah, W. Plishker, K. Ravindran, and K. Keutzer, “NP-Click: A Productive Software
Development Approach for Network Processors,” IEEE Micro, vol. 24, no. 5, Sep. 2004, pp. 45–54.
[Shah et al 2004b] N. Shah, W. Plishker, and K. Keutzer, “Comparing Network Processor Programming
Environments: A Case Study,” 2004 Workshop on Productivity and Performance in High-End Computing
(P-PHEC), Feb. 2004.
[Shalf et al 2005] J. Shalf, S.A. Kamil, L. Oliker, and D. Skinner, “Analyzing Ultra-Scale Application
Communication Requirements for a Reconfigurable Hybrid Interconnect,” in Proceedings of the 2005
ACM/IEEE Conference on Supercomputing (SC ’05), Seattle, WA, Nov. 12–18, 2005. (LBNL-58052)
[Singh et al 1992] J.P. Singh, W.-D. Weber, and A. Gupta, “SPLASH: Stanford Parallel Applications for
Shared-Memory,” in Computer Architecture News, Mar. 1992, vol. 20, no. 1, pages 5–44.
[Snir et al 1998] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete
Reference (Vol. 1). The MIT Press, 1998. ISBN 0262692155.
[Soteriou et al 2006] V. Soteriou, H. Wang, L.-S. Peh, “A Statistical Traffic Model for On-Chip
Interconnection Networks,” in Proceedings of the 14th IEEE International Symposium on Modeling,
Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS ’06), Sep. 2006, pp.
104–116.
[Srinivasan et al 2002] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P.N. Strenski, and
P.G. Emma, “Optimizing pipelines for power and performance,” in Proceedings of the 35th International
Symposium on Microarchitecture (MICRO-35), 2002, pp. 333–344.
52
The Landscape of Parallel Computing Research: A View From Berkeley
[Sterling 2006] T. Sterling, “Multi-Core for HPC: Breakthrough or Breakdown?” Panel discussion at the
International Conference for High Performance Computing, Networking and Storage (SC'06), Nov.
2006..Slides available at http://www.cct.lsu.edu/~tron/SC06.html
[Sylvester et al 1999] D. Sylvester, W. Jiang, and K. Keutzer, “Berkeley Advanced Chip Performance
Calculator,” http://www.eecs.umich.edu/~dennis/bacpac/index.html
[Sylvester and Keutzer 1998] D. Sylvester and K. Keutzer, “Getting to the Bottom of Deep Submicron,” in
Proceedings of the International Conference on Computer-Aided Design, Nov. 1998, pp. 203–211.
[Sylvester and Keutzer 2001] D. Sylvester and K. Keutzer, “Microarchitectures for systems on a chip in
small process geometries,” Proceedings of the IEEE, Apr. 2001, pp. 467–489.
[Vahala et al 2005] G. Vahala, J. Yepez, L. Vahala, M. Soe, and J. Carter, “3D entropic lattice Boltzmann
simulations of 3D Navier-Stokes turbulence,” in Proceedings of the 47th Annual Meeting of the APS
Division of Plasma Phsyics, 2005.
[Vetter and McCracken 2001] J.S. Vetter and M.O. McCracken, “Statistical Scalability Analysis of
Communication Operations in Distributed Applications,” in Proceedings of the Eigth ACM SIGPLAN
Symposium on Principles and Practices of Parallel Programming (PPOPP), 2001, pp. 123–132.
[Vetter and Mueller 2002] J.S. Vetter and F. Mueller, “Communication Characteristics of Large-Scale
Scientific Applications for Contemporary Cluster Architectures,” in Proceedings of the 16th International
Parallel and Distributed Processing Symposium (IPDPS), 2002, pp. 272–281.
[Vetter and Yoo 2002] J.S. Vetter and A. Yoo, “An Empirical Performance Evaluation of Scalable
Scientific Applications,” in Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, 2002.
[Vuduc et al 2006] R. Vuduc, J. Demmel, and K. Yelick, “OSKI: Optimized Sparse Kernel Interface,”
http://bebop.cs.berkeley.edu/oski/.
[Wawrzynek et al 2006] J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C. Kozyrakis, J.C. Joe, D. Chiou,
and K. Asanovic, “RAMP: A Research Accelerator for Multiple Processors,” U.C. Berkeley technical
report, 2006.
[Weinburg 2004] B. Weinberg, “Linux is on the NPU control plane,” EE Times, Feb. 9, 2004.
[Whaley and Dongarra 1998] R.C. Whaley and J.J. Dongarra, “Automatically tuned linear algebra
software,” in Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, 1998.
[Van der Wijngaart 2002] R.F. Van der Wijngaart, “NAS Parallel Benchmarks Version 2.4,” NAS
technical report, NAS-02-007, Oct. 2002.
53
The Landscape of Parallel Computing Research: A View From Berkeley
[Wolfe 2004] A. Wolfe, “Intel Clears Up Post-Tejas Confusion,” VARBusiness, May 17, 2004.
http://www.varbusiness.com/sections/news/breakingnews.jhtml?articleId=18842588
[Woo et al 1995] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs:
Characterization and Methodological Considerations,” in Proceedings of the 22nd International Symposium
on Computer Architecture (ISCA ’95), Santa Margherita Ligure, Italy, Jun. 1995, pp. 24–36.
[Wulf and McKee 1995] W.A. Wulf and S.A. McKee, “Hitting the Memory Wall: Implications of the
Obvious,” Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20–24.
[Xu et al 2003] M. Xu, R. Bodik, and M.D. Hill, “A ‘Flight Data Recorder’ for Enabling Full-System
Multiprocessor Deterministic Replay,” in Proceedings of the 31st Annual International Symposium on
Computer Architecture (ISCA ’04), 2004.
54