0% found this document useful (0 votes)
84 views13 pages

Computer Architecture: Challenges and Opportunities For The Next Decade

Uploaded by

Pranju Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views13 pages

Computer Architecture: Challenges and Opportunities For The Next Decade

Uploaded by

Pranju Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3215474

Computer Architecture: Challenges and Opportunities for the Next Decade

Article  in  IEEE Micro · June 2005


DOI: 10.1109/MM.2005.45 · Source: IEEE Xplore

CITATIONS READS

45 5,646

2 authors, including:

Tilak Agerwala
TKMA Consulting LLC
41 PUBLICATIONS   1,044 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Tilak Agerwala on 13 August 2018.

The user has requested enhancement of the downloaded file.


COMPUTER ARCHITECTURE:
CHALLENGES AND OPPORTUNITIES
FOR THE NEXT DECADE
IN AN UPDATED VERSION OF AGERWALA’S JULY 2004 KEYNOTE ADDRESS AT
THE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, THE

AUTHORS URGE THE COMPUTER ARCHITECTURE COMMUNITY TO DEVISE

INNOVATIVE WAYS OF DELIVERING CONTINUING IMPROVEMENT IN SYSTEM

PERFORMANCE AND PRICE-PERFORMANCE, WHILE SIMULTANEOUSLY SOLVING

THE POWER PROBLEM.

Computer architecture forms the with some observations on the evolving nature
bridge between application needs and the of workloads.
capabilities of the underlying technologies. As The computational and storage demands
application demands change and technologies of technical, scientific, digital media, and busi-
cross various thresholds, computer architects ness applications continue to grow rapidly,
must continue innovating to produce systems driven by finer degrees of spatial and tempo-
that can deliver needed performance and cost ral resolution, the growth of physical simula-
effectiveness. Our challenge as computer tion, and the desire to perform real-time
architects is to deliver end-to-end perfor- optimization of scientific and business prob-
mance growth at historical levels in the pres- lems. The following are some examples of
Tilak Agerwala ence of technology discontinuities. We can such applications:
address this challenge by focusing on power
Siddhartha Chatterjee optimization at all levels. Key levers are the • A computational fluid dynamics (CFD)
development of power-optimized building calculation on an airplane wing of a 512
IBM Research blocks, deployment of chip-level multi- × 64 × 256 grid, with 5,000 floating-
processors, increasing use of accelerators and point operations per grid point and
offload engines, widespread use of scale-out 5,000 time steps, requires 2.1 × 1014
systems, and system-level power optimization. floating-point operations. Such a com-
putation would take 3.5 minutes on a
Applications machine sustaining 1 billion floating-
To design leadership computer systems, we point operations per second (Tflops). A
must thoroughly understand the nature of the similar CFD simulation of a full aircraft,
workloads that such systems are intended to on the other hand, would involve 3.5 ×
support. It is, therefore, worthwhile to begin 1017 grid points, for a total of 8.7 × 1024

58 Published by the IEEE Computer Society 0272-1732/05/$20.00 © 2005 IEEE


floating-point operations. On the same media, massively multiplayer online gaming,
1-Tflops machine, this computation business intelligence, semantic search, and
would require more than 275,000 years national security, are increasing the demand
to complete.1 for numerical- and data-intensive computing.
• Materials scientists currently simulate Another growing workload characteristic is
magnetic materials at the level of 2,000- variability of demand for system resources,
atom systems, which require 2.64 Tflops both across different workloads and within
of computational power and 512 Gbytes different temporal phases of a single workload.
of storage. In the future, simulation of a Figure 1 shows an example of variable and
full hard-disk drive will require about 30 periodic behavior of instructions per cycle
Tflops of computational power and 2 (IPC) in the SPEC2000 benchmarks bzip2
Tbytes of storage (http://www.zurich. and art.2 Important business and scientific
ibm.com/deepcomputing/parallel/pro- applications demonstrate similar variability.
jects_cpmd.html). Current investigation Designing computer architectures to ade-
of electronic structures is limited to about quately handle such variability is essential.
1,000 atoms, requiring 0.5 Tflops of A third important characteristic of many
computational power and 250 Gbytes of workloads is that they are amenable to scaling
storage (http://www.zurich.ibm.com/ out. A scale-out architecture is a collection of
deepcomputing/). Future investigations interconnected, modular, low-cost computers
involving some 10,000 atoms will require that work as a single entity to cooperatively pro-
100 Tflops of computational power and vide applications, systems resources, and data to
2.5 Tbytes of storage. users. Scale-out platforms include clusters;
• Digital movies and special effects are yet high-density, rack-mounted blade systems; and
another source of growing demand for massively parallel systems. On the other hand,
computation. At around 1014 floating- conventional symmetric multiprocessor (SMP)
point operations per frame and 50 frames systems are scale-up platforms.
per second, a 90-minute movie represents Many important workloads are scaling out.
2.7 × 1019 floating-point operations. It Enterprise resource planning, customer rela-
would take 2,000 1-Gflops CPUs tionship management, streaming media, Web
approximately 150 days to complete this serving, and science/engineering computa-
computation. tions are prime examples of scale-out work-
• Large amounts of computation are no loads. However, some commercially
longer the sole province of classical high- important workloads, such as online transac-
performance computing. There is an tion processing, are difficult to scale out and
industry trend toward continual opti- continue to require the highest possible sin-
mization—rapid and frequent modeling gle-thread performance and symmetric mul-
for timely business decision support in tiprocessing. We will discuss later how
domains as diverse as inventory planning, different workload characteristics can drive
risk analysis, workforce scheduling, and computer systems to different design points.
chip design. Such applications also con- As a community, computer architects must
tribute to the drive for improved perfor- make a concerted effort to better characterize
mance and more cost-effective numerical applications and environments to drive the
computing. design of future computing platforms. This
effort should include developing a detailed
Applications continue to drive the growth understanding of applications’ scale-out char-
of absolute performance and cost-performance acteristics, developing opportunities for opti-
at the historical level of an 80 percent com- mizing applications across all system stack
pound annual growth rate (CAGR). This rate levels, and developing tools to aid the migra-
shows no foreseeable slowdown. If anything, tion of existing applications to future platforms.
application demands will grow even faster—
perhaps a 90 to 100 percent CAGR—over the Technology
next few years. New workloads, such as deliv- Even as application demands for compu-
ery and processing of streaming and rich tational power continue to grow, silicon

MAY–JUNE 2005 59
FUTURE TRENDS

chip-level performance must


3.5
result from on-chip func-
3.0 tional integration rather
than continued frequency
Instructions per cycle

2.5
scaling.
2.0 CMOS device scaling
rules, as initially stated by
1.5 Dennard et al., predict that
1.0 scaling of device geometry,
process, and operating-envi-
0.5 ronment parameters by a fac-
0 tor of α will result in higher
0 100 200 300 400 500 density (~α2), higher speed
(a) Time(s) (~α), lower switching power
per circuit (~1/α2), and con-
stant active-power density.3
0.6 In the past several years, how-
ever, in our pursuit of higher
0.55 operating frequency, we have
not scaled operating voltage
Instructions per cycle

as required by this scaling


0.5
theory. As a result, power
densities have grown with
0.45
every CMOS technology
generation.
0.4 Dennard et al.’s scaling the-
ory is based on considerations
0.35 of active (or switching)
power, the dominant source
0.3 of power dissipation when
CMOS device features were
large relative to atomic
31 31.2 31.4 31.6 31.8 32 dimensions. As CMOS
(b) Time(s) device features shrink, addi-
tional sources of passive (or
Figure 1. Variability of instructions per cycle (IPC) in SPEC2000: IPC over the entire execu- leakage) power dissipation are
tion for benchmark bzip2 (a) and a 1-second interval from 31 to 32 seconds for the art bench- increasing in importance.
mark (b).2 (Copyright IEEE Press, 2003) There are two distinct forms
of passive power:

technology is running into some major dis- • Gate leakage is a quantum tunneling
continuities as it scales to smaller feature effect in which electrons tunnel through
sizes. When we study operating frequencies the thin gate dielectric. This effect is
of microprocessors introduced over the last exponential in gate voltage and oxide
10 years and projected frequencies for the thickness.
next two to three years, it is clear that fre- • Subthreshold leakage is a thermodynamic
quency will grow in the future at half the phenomenon in which charge leaks
rate of the past decade. Although technolo- between a MOSFET’s source and drain.
gy scaling delivers devices with ever-finer This effect increases as device channel
feature sizes, power dissipation is limiting lengths decrease and is also exponential
chip-level performance, making it more dif- in turn-off voltage, the difference
ficult to ramp up operating frequency at his- between the device’s power supply and
torical rates. In the near future, therefore, threshold voltages.

60 IEEE MICRO
The implication of the growth of passive tem performance and price-performance
power at the chip level is profound. Although while simultaneously solving the power prob-
scaling allows us to grow the number of devices lem. Rather than riding on the steady fre-
on a chip, these devices are no longer “free”— quency growth of the past decade, system
that is, they leak significant amounts of pas- performance improvements will increasingly
sive power even when they are not performing be driven by integration at all levels, together
useful computation or storing useful data.4 with hardware-software optimization. The
Chip-level power is already at the limits of shift in focus implied by this challenge
air cooling. Liquid cooling is an option being requires us to optimize performance at all sys-
increasingly explored, as are improvements in tem stack levels (both hardware and software),
air cooling. But, in the end, all heat extraction constrained by power dissipation and relia-
and removal processes are inherently subex- bility issues. Opportunities for optimization
ponential. They will thus limit the exponen- exist at both the chip and system levels.
tial growth of power density and total
chip-level power that CMOS technology scal- Microprocessors and chip-level integration
ing is driving. Chip-level design space includes two major
We faced a similar situation two decades options: how we trade power and performance
ago, when the heat flux of bipolar technology within a single processor pipeline (core), and
was similarly exploding beyond the effective how we integrate multiple cores, accelerators,
air-cooling limits of the day. However, there and off-load engines on chip to boost total
was a significant difference between that situ- chip-level performance. The investigation of
ation and the current one: We had CMOS these issues requires appropriate methodologies for
available as a mature, low-power, high-volume evaluating design choices. The following discus-
technology then. We have no other technol- sion illustrates such a methodology; readers should
ogy with similar characteristics waiting in the focus less on the specific numerical values of the
wings today. Technologists are making many results and more on how the results are derived.
advances in materials and processes, but com- The term power is often used loosely in dis-
puter architects must find alternate designs cussions like this one. Depending on context,
within the confines of CMOS, the basic sili- the term can be a proxy for various quantities,
con technology. including energy, instantaneous power, max-
CMOS scaling results in another dimension imum power, average power, power density,
of complexity—it affects variability. The crit- and temperature. These quantities are not
ical dimensions in our designs are scaling faster interrelated in a simple manner, and the asso-
than our ability to control them, and manu- ciated physical processes often have vastly dif-
facturing and environmental variations are ferent time constants. The evaluation
becoming critical. Such variations affect both methodology must accommodate the sub-
operating frequency and chip yield, and ulti- tleties of the context.
mately they adversely affect system cost and
cost-performance. The implications of such Power-performance optimization in a single core
variability are twofold: We can either use chip Let us consider an instruction set architec-
area to obtain performance, or we can design ture (ISA) and a family of pipelined imple-
for variability. The industry is beginning to use mentations of that ISA parameterized by the
both approaches to counteract the increasing number of pipeline stages or, equivalently, the
variability of deep-submicron CMOS. depth in fan-out of four (FO4) of each
pipeline stage. (FO4 delay is the delay of one
Challenge inverter driving four copies of an equal-sized
We face a gap. We need 80-plus percent inverter. The amount of logic and latch over-
compound growth in system-level perfor- head per pipeline stage is often measured in
mance, while frequency growth has dropped terms of FO4 delay. This implies that deeper
to 15 to 20 percent because of power limita- pipelines have smaller FO4 delays.) The fol-
tions. The computer architecture communi- lowing discussion also fixes the circuit family
ty’s challenge, therefore, is to devise innovative and assumes it to be one of the standard stat-
ways of delivering continuing growth in sys- ic CMOS circuit families.

MAY–JUNE 2005 61
FUTURE TRENDS

1.1 • The falloff past this opti-


1.0 mal point is much steep-
0.9 er than in the case of the
Relative to optimal FO4

0.8 performance-only
0.7
curve, demonstrating
the fundamental super-
0.6
linear trade-off between
0.5
performance and power.
0.4
bips/W
0.3 bips
The power model for these
0.2 bips2/W
bips3/W curves incorporates active
0.1 IPC power only. If we added pas-
0 sive power to the model, the
37 34 31 28 25 22 19 16 13 10 7 optimal power-performance
Total FO4 per stage
design point would shift
somewhat to the right of the
Figure 2. Power-performance trade-off in a single-processor pipeline.5 (Copyright IEEE Press, 18 FO4 bips3/W design
2002.) point (because combined
active and passive power
increases less rapidly with
Now consider the implementation family’s increasing pipeline depth).
behavior for some agreed-upon workload and Figure 3 plots the same information in a
metric of goodness. Figure 2 shows plots of different manner, making the trade-off
such behavior. The number of pipeline stages between power and performance visually
increases from left to right along the x-axis, obvious. Here, a family of pipeline designs
and the y-axis shows normalized behavior; the shows up as a single curve, with performance
pipeline organization with the best value is decreasing from left to right on the x-axis and
defined as 1. The y-axis numbers came from power increasing from bottom to top on the
detailed simulation. y-axis. FO4 numbers of individual design
The curve labeled “bips” (billions of points appear on the curve.
instructions per second) plots performance We now focus on two example design points:
for the SPEC2000 benchmark suite as a func- the 12 FO4 design, which delivers high per-
tion of pipeline stages and shows an optimal formance (at a high power cost), and the 18
design point of 10 FO4 per pipeline stage. FO4 design, which is optimal for the power-
Performance drops off for deeper pipelines as performance metric. Once these designs are
the effects of pipeline hazards, branch mis- committed to silicon and fabricated, it is pos-
prediction penalties, and cache and transla- sible to determine whether they meet the chip-
tion look-aside buffer misses play an level power budget, shown as the horizontal
increasing role. dashed line in the figure. Suppose that the 12
The curve labeled “bips3/W” measures FO4 design exceeds the power budget, as the
power-performance as a function of pipeline figure shows. Options exist, even at this stage of
stages, again for SPEC2000. The term bips3 the process, to trade performance and power
per watt is a proxy for (energy × delay2)−1, a by reducing either the operating voltage (shown
metric commonly used to quantify the power- in the “Varying VDD and η” curve) or the oper-
performance efficiency of high-performance ating frequency (the “Reducing f ” curve).
processors. There are two key differences Either choice could return this design to an
between this curve and the performance-only acceptable power budget, but at a significant-
curve: ly reduced level of single-core performance,
once again emphasizing the superlinear trade-
• The optimal design point for the power- off between performance and power. On the
performance metric is at 18 FO4 per other hand, suppose that the less-aggressive
pipeline stage, corresponding to a shal- 18 FO4 design comes in slightly below the
lower pipeline. power budget. Applying VDD scaling would

62 IEEE MICRO
2.4
12 FO4 Experimental points
Varying depth (fixed Vdd and η)
2.2 Varying VDD and η (fixed depth)
Reducing ƒ (fixed depth, Vdd, and η)

2.0

1.8
Relative power, P/P0

14 FO4
1.6

Maximum power budget


1.4

1.2
18 FO4
1.0

0.8
23 FO4

0.6

0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30

Relative delay, D/D0

Figure 3. Effect of pipeline depth on a single-core design.6 (Copyright IEEE Press, 2004)

boost its performance, while staying within Table 1. Power-aware microarchitectural techniques.
the power budget.
The preceding example illustrates the Microarchitecture
importance of incorporating power as an optimization goal Techniques
optimization target early in the design Active-power reduction Clock gating
process along with the traditional perfor- Bandwidth gating
mance metric. Although voltage- and fre- Register port gating
quency-scaling techniques can certainly Asynchronously clocked pipelined units and globally
correct small mismatches, selecting a pipeline asynchronous, locally synchronous architectures
structure on the basis of both performance Power-efficient thread prioritization (simultaneous
and power is critical because a fundamental multithreading)
error here could lead to an irrecoverable post- Active- and Simpler cores
silicon power-performance (hence, cost-per- passive-power reduction Voltage gating of unused functional units and cache
formance) deficiency. lines
In addition to fixing and scaling pipeline Adaptive resizing of computing and storage resources
depth appropriately to match technology Dynamic voltage and frequency scaling
trends, additional enhancements to increase
power efficiency at the microarchitecture level
are possible and desirable. The computer number of these techniques in commercial
architecture research community has worked microprocessors. However, many difficult
for several years on power-aware microarchi- problems remain open. For example:
tectures, developing various techniques for
reducing active and passive power in cores.7-13 • determining the proper logic-level gran-
Table 1 shows some of these techniques. ularity of applying clock-gating tech-
Microarchitects are using an increasing niques to maximize power savings,

MAY–JUNE 2005 63
FUTURE TRENDS

No. of wide-issue,
2.5 out-of-order cores
1 2 4
No. of narrow-issue,
in-order cores
1 2 4 8
2.0
Relative power

1.5

1.0

0.5

0 0.5 1.0 1.5 2.0 2.5 3.0


Relative chip throughput

Figure 4. Power-performance trade-offs in integrating multiple cores on a chip. (Courtesy of V. Zyuban,


“Power-Performance Optimizations across Microarchitectural and Circuit Domains,” invited course at
Swedish Intelect Summer School on Low-Power Systems on Chip, 23 to 25 Aug. 2004.)

• reconciling pervasive clock gating’s effect augments the two cores per chip with two-
on cycle time, way simultaneous multithreading per core.15
• building in predictive support for volt- The 389-mm2 Power5 chip contains 276 mil-
age gating at the microarchitectural and lion transistors, and the resulting systems lead
compiler levels to minimize switching- in 34 industry-standard benchmarks. Increas-
unit overhead, and ingly, CPU manufacturers are moving to the
• addressing increased design verification multiple-cores-per-chip design.
complexity in the presence of these Let’s examine the trade-offs that arise in
techniques. putting multiple cores on a chip. What types
of cores should we integrate on a chip, and how
Integrating multiple cores on a chip many of them should we integrate? Of course,
With single-core performance improve- we’ll leverage what we learned in our discus-
ments slowing, multiple cores per chip can sion of power-performance trade-offs for a sin-
help continue the exponential growth of chip- gle core. Figure 4 presents two extreme designs
level performance. This solution exploits per- that illustrate the methodology: a complex,
formance through higher chip, module, and wide-issue, out-of-order core and a simple, nar-
system integration levels and optimizes for row-issue, in-order core. Given the relative dif-
performance through technology, system, ference in size between these two organizations,
software, and application synergies. we assume that we could integrate up to four of
IBM is a trailblazer in this space. The the complex cores or up to eight of the simple
Power4 microprocessor, introduced in 2001 cores on a single chip. The curves show the
in 180-nm technology, comprised two cores power-performance trade-offs possible for each
per chip.14 The Power4+ microprocessor, of these designs through variation of the
introduced in 2003, was a remapping of pipeline depth, as discussed earlier.
Power4 to 130-nm technology. The Power5, Several conclusions follow from the curves
introduced in 2004 in 130-nm technology, in Figure 4:

64 IEEE MICRO
• For a given power budget (consider a However, memory issues, such as latency hid-
horizontal line at 1.5), multiple simple ing and locality enhancement, need further
cores produce higher throughput (aggre- examination.22 A fundamental issue in exploit-
gate chip-level performance). The simu- ing thread-level parallelism is identifying the
lations used to derive the curves show threads in a computation. Explicitly parallel
that this conclusion holds for both SMP languages such as Java make the programmer
workloads and independent threads. responsible for this determination. Sequential
• A complex core provides much higher languages require either automatic paralleliza-
single-thread performance than a simple tion techniques21,23 or OpenMP-like compil-
core (compare the curves “1 wide-issue er directives (http://www.openmp.org). In
out-of-order core” and “1 narrow-issue addition, for more effective exploitation of
in-order core”). Scaling up a simple core shared resources, the operating system must
by reducing FO4 and/or raising VDD does provide richer functionality in terms of
not achieve this level of performance. coscheduling and switching threads to cores.
• Integrating a heterogeneous mixture of
simple and complex cores on a chip Accelerators and offload engines
might provide acceptable performance Special-purpose accelerators and offload
over a wider variety of workloads. As dis- engines offer an alternative means of increas-
cussed later, such a solution has signifi- ing performance and reducing power. Systems
cant implications on programming will increasingly rely on accelerators for
models and software support. improved performance and cost-performance.
Such engines help exploit concurrency and
These conclusions show that no single data formats in a specialized domain for which
design for chip-level integration is optimal for we have a thorough understanding of the bot-
all workloads. We can choose the appropriate tlenecks and expected end-to-end gains. A lack
design only by weighing the relative impor- of compilers, libraries, and software tools to
tance of single-thread performance and chip enable acceleration is the primary bottleneck
throughput for workloads that the systems are to more pervasive deployment of these engines.
expected to run. Accelerators are not new, but in recent years
several conditions have changed, making
Software issues wider deployment feasible:
The systems just described depend on
exploiting greater levels of locality and con- • Functionality that merits acceleration has
currency to gain performance within an accept- become clearer. Examples include Trans-
able power budget. Appropriate support from mission Control Protocol/Internet Pro-
compilers, runtime systems, operating systems, tocol (TCP/IP) offloading, security,
and libraries is essential to delivering the hard- streaming and rich media, and collective
ware’s potential at the application level. The communications in high-performance
fundamental technology discontinuities dis- computing.
cussed earlier, which slow the rate of frequen- • In the past, accelerators had to compete
cy growth, make such enablement, integration, against the increasing frequency, perfor-
and optimization even more important. mance, and flexibility of general-purpose
Increasing software componentization, com- processors. The slowing of frequency
bined with vastly increased hardware system growth makes accelerators more attractive.
complexity, requires the development of high- • Increasing density allows the integration
er-level abstractions,16,17 innovative compiler of accelerators on chips along with the
optimizations,17,18 and high-performance CPU. This results in tighter coupling and
libraries19,20 to sustain the performance growth finer-grained integration of the CPU and
levels that applications demand. the accelerator, and allows the accelera-
Processor issues involved in exploiting tor to benefit from the same technology
instruction-level parallelism, such as code gen- advances as the CPU.
eration, instruction scheduling, and register • Domain-specific programmable and
allocation, are generally well understood.21 reconfigurable accelerators have emerged,

MAY–JUNE 2005 65
FUTURE TRENDS

and both link-time and dynamic compiler


BLC DD 1.0
optimizations) for software enablement of
accelerators, and developing industry-standard
software interfaces and practices that support
Tree FPU1 accelerator use. Given the potential for
PU0 improvement, the judicious use of accelerators
L2
PU1
will remain an important part of system design
methodology in the foreseeable future.
Torus
FPU0
Scale-out
Eth Scale-out provides the opportunity to meet
JTAG
Perf
performance demands beyond the levels that
chip-level integration can provide. Moreover,
given that the power-performance trade-off is
L3
superlinear, scale-out can provide the same
computational performance for far less power.
In other words, if an application is amenable
to scale-out, we can execute it on a large
enough collection of lower-power, lower-per-
formance cores to satisfy the application’s
overall computational requirement with much
less power dissipation at the system level.
Figure 5. Integrated functionality on IBM’s Blue Gene/L com- An effective scale-out solution requires a
puter chip. It uses two enhanced floating-point units (FPU) balanced building block, which integrates
per chip, each FPU is two-way SIMD, and each SIMD FPU high-bandwidth, low-latency memory and
unit performs one Fused Multiply Add operation (equivalent interconnects on chip to balance data trans-
to two floating-point operations) per cycle. This structure pro- fer and computational capabilities. Figure 5
duces a peak computational rate of 8 floating-point opera- shows an example of such a building block,
tions per cycle, or 5.6 Gflops for a 700-MHz clock rate. the chip used in the Blue Gene/L machine
that IBM Research is building in collabora-
tion with Lawrence Livermore National Lab-
replacing fixed-function, dedicated units. oratory.24 The relatively modest-sized chip
Examples include SIMD instruction set (121 mm2 in 130-nm technology) integrates
architecture extensions and FPGA-based two PowerPC 440 cores (PU0 and PU1) run-
accelerators. ning at 700 MHz, two enhanced floating-
point units (FPU0 and FPU1), L2 and L3
Given the power issues discussed earlier, caches, communication interfaces (Torus,
accelerators are not free. It is extremely impor- Tree, Eth, and JTAG) tightly coupled to the
tant to achieve high utilization of an accelera- processors and performance counters. This
tor or to clock gate and power gate it chip provides 5.6 Gflops of peak computation
effectively. Programming models, compilers, power for approximately 5 W of power dissi-
and tool chains for exploiting accelerators must pation. On top of this balanced hardware plat-
continue to mature to make such specialized form, an innovative hierarchically structured
functions easier for application developers to system software environment and standard
use productively. The end-to-end benefit of programming models (Message-Passing Inter-
deploying an accelerator critically depends on face) and APIs for file systems, job schedul-
the workload and the ease of accessing the ing, and system management result in a
accelerator functionality from application scalable, power-efficient system. Sixteen racks
code. Much work remains in this area; for (32,768 processors) of the system sustained a
example, deciding what functions to acceler- Linpack performance of 70.72 Tflops on a
ate, understanding the system-level implica- problem size of 933,887, securing the top spot
tions of integrating accelerators, developing on the 24th Top500 list of supercomputers
the right tools (including libraries, profilers, (http://www.top500.org).

66 IEEE MICRO
System-level power management Table 2. Power distribution across system
Power is clearly a limiting factor at the sys- components.
tem level. It is now a principal design con-
straint across the computing spectrum. Power dissipation
Although the preceding discussion has con- System and component (percentage)
centrated primarily on the CPU, the power Data center
densities of all computing components at all Servers 46
scales are increasing exponentially. Micro- Tape drives 28
processors, caches, dual in-line memory mod- Direct-access storage devices 17
ules, and buses are each capable of trading Network 7
power for performance. For example, today’s Other 2
DRAM designs have different power states Midrange server
and both microprocessors and bus frequen- DRAM system 30
cies can be dynamically voltage- and fre- Processors 28
quency-scaled. Fans 23
The power distributions in Table 2 make Level-three cache 11
it clear that we can ignore none of the power I/O fans 5
components. To effectively manage the range I/O and miscellaneous 3
of components that use power, we must have
a holistic, system-level view. Each level in the
hardware/software stack needs to be aware of
power consumption and must cooperate in
an overall strategy for intelligent power man-
T he inexorable growth in applications’
requirements for performance and cost-
performance improvements will continue at
agement. To do this in real-time, power historical rates. At the same time, we face a
usage information must be available at all technology discontinuity: the exponential
levels of the stack and managed via a global growth in device and chip-level power dissi-
systems view. Dynamically rebalancing total pation and the consequent slowdown in fre-
power across system components is key to quency growth. As computer architects, our
improving system-level performance. challenge over the next decade is to deliver
Achieving dynamic power balancing requires end-to-end performance growth at historical
three enablers: levels in the presence of this discontinuity. We
will need a maniacal focus on power at all
• System components must support multi- architecture and design levels to bridge this
ple power-performance operating points. gap, together with tight hardware-software
Sleep modes in disks are a mature exam- integration across the system stack to optimize
ple of this feature. performance. The right building blocks
• The system’s design must exploit the (cores), chip-level integration (chip multi-
extremely unlikely fact that all compo- processors, system on chips, and accelerators),
nents will simultaneously operate at their scale-out and parallel computing, and system-
maximum power dissipation points level power management are key levers. The
(while providing a safe fallback position discontinuity is stimulating renewed interest
for the rare occasion when this might in architecture and microarchitecture, and
actually happen). opportunities abound for innovative work to
• Researchers must develop algorithms, meet the challenge. MICRO
most likely at the operating system or
workload manager level, to monitor Acknowledgments
and/or predict workloads’ power-perfor- The work cited here came from multiple
mance trade-offs over time. These algo- individuals and groups at IBM Research. We
rithms must also dynamically rebalance thank Pradip Bose, Evelyn Duesterwald,
maximum available power across com- Philip Emma, Michael Gschwind, Hendrik
ponents to achieve the required quality Hamann, Lorraine Herger, Rajiv Joshi, Tom
of service, while maintaining the health Keller, Bruce Knaack, Eric Kronstadt, Jaime
of the system and its components. Moreno, Pratap Pattnaik, William Pulley-

MAY–JUNE 2005 67
FUTURE TRENDS

blank, Michael Rosenfield, Leon Stok, Ellen 10. D.M. Brooks et al., “Power-Aware Microar-
Yoffa, Victor Zyuban, and the entire Blue chitectures: Design and Challenges for Next-
Gene/L team for the technical results and for Generation Microprocessors,” IEEE Micro,
helping us to coherently formulate the views vol. 20, no. 6, Nov.-Dec. 2000, pp. 26-44.
discussed in this article. The Blue Gene/L pro- 11. K. Skadron et al., “Temperature-Aware Com-
ject was developed in part through a partner- puter Systems: Opportunities and Chal-
ship with the Department of Energy, National lenges,” IEEE Micro, vol. 23, no. 6,
Nuclear Security Administration Advanced Nov.-Dec. 2003, pp. 52-61.
Simulation and Computing Program to devel- 12. Z. Hu et al., “Microarchitectural Techniques
op computing systems suited to scientific and for Power Gating of Execution Units,” Proc.
programmatic missions. Int’l Symp. Low Power Electronics and
Design (ISLPED 04), IEEE Press, 2004, pp.
References 32-37.
1. A. Jameson, L. Martinelli, and J.C. Vassberg, 13. Z. Hu, S. Kaxiras, and M. Martonosi, “Let
“Using Computational Fluid Dynamics for Caches Decay: Reducing Leakage Energy
Aerodynamics: A Critical Assessment,” via Exploitation of Cache Generational
Proc. 23rd Int’l Congress Aeronautical Sci- Behavior,” ACM Trans. Computer Systems,
ences (ICAS 02), Int’l Council of Aeronautical vol. 20, no. 2, May 2002, pp. 161-190.
Sciences, 2002. 14. J. Tendler et al., “Power4 System Micro-
2. E. Duesterwald, C. Cascaval, and S. architecture,” IBM J. Research & Develop-
Dwarkadas, “Characterizing and Predicting ment, vol. 46, no. 1, Jan. 2002, pp. 5-26.
Program Behavior and Its Variability,” Proc. 15. R. Kalla, B. Sinharoy, and J. Tendler, “IBM
12th Int’l Conf. Parallel Architectures and Power5 Chip: A Dual-Core Multithreaded
Compilation Techniques (PACT 03), IEEE Processor,” IEEE Micro, vol. 24, no. 2, Mar.-
Press, 2003, pp. 220-231. Apr. 2004, pp. 40-47.
3. R.H. Dennard et al., “Design of Ion-Implant- 16. W.W. Carlson et al., Introduction to UPC and
ed MOSFETs with Very Small Physical Language Specification, tech. report CCS-TR-
Dimensions,” IEEE J. Solid-State Circuits, 99-157, Lawrence Livermore Nat’l Lab., 1999.
vol. 9, no. 5, Oct. 1974, pp. 256-268. 17. Y. Dotsenko, C. Coarfa, and J. Mellor-Crum-
4. International Technology Roadmap for Semi- mey, “A Multi-Platform Co-Array Fortran
conductors, 2003 ed., http://public.itrs.net/ Compiler,” Proc. 13th Int’l Conf. Parallel
Files/2003ITRS/Home2003.htm. Architectures and Compilation Techniques
5. V. Srinivasan et al., “Optimizing Pipelines for (PACT 04), IEEE CS Press, 2004, pp. 29-40.
Power and Performance,” Proc. 35th 18. A.E. Eichenberger, P. Wu, and K. O’Brien,
ACM/IEEE Int’l Symp. Microarchitecture “Vectorization for SIMD Architectures with
(MICRO-35), IEEE CS Press, 2002, pp. 333- Alignment Constraints,” Proc. ACM SIG-
344. PLAN Conf. Programming Language Design
6. V. Zyuban et al., “Integrated Analysis of and Implementation (PLDI 04), ACM Press,
Power and Performance for Pipelined Micro- 2004, pp. 82-93.
processors,” IEEE Trans. Computers, vol. 19. R.C. Whaley, A. Petitet, and J.J. Dongarra,
53, no. 8, Aug. 2004, pp. 1004-1016. “Automated Empirical Optimization of Soft-
7. P. Bose, “Architectures for Low Power,” ware and the ATLAS Project,” Parallel Com-
Computer Engineering Handbook, V. Oklob- puting, vol. 27, no. 1-2, Jan. 2001, pp. 3-35.
dzija, ed., CRC Press, 2001. 20. K. Yotov et al., “A Comparison of Empirical
8. D. Brooks and M. Martonosi, “Value-Based and Model-Driven Optimization,” Proc. ACM
Clock Gating and Operation Packing: Dynam- SIGPLAN Conf. Programming Language
ic Strategies for Improving Processor Power Design and Implementation (PLDI 03), ACM
and Performance,” ACM Trans. Computer Press, 2003, pp. 63-76.
Systems, vol. 18, no. 2, May 2000, pp. 89-126. 21. R. Allen and K. Kennedy, Optimizing Com-
9. A. Buyuktosunoglu et al., “Power Efficient pilers for Modern Architectures, Morgan
Issue Queue Design,” Power-Aware Com- Kaufmann, 2002.
puting, R. Melhem and R. Graybill, eds., 22. X. Fang, J. Lee, and S.P. Midkiff, “Automat-
Kluwer Academic, 2001. ic Fence Insertion for Shared Memory Mul-

68 IEEE MICRO
tiprocessing,” Proc. 17th Ann. Int’l Conf. Siddhartha Chatterjee is a research staff mem-
Supercomputing (ICS 03), ACM Press, 2003, ber and manager at IBM Research. His
pp. 285-294. research interests include all aspects of high-
23. W. Blume et al., “Parallel Programming with performance systems and software quality.
Polaris,” Computer, vol. 29, no. 12, Dec. Chatterjee has a PhD in computer science
1996, pp. 78-82. from Carnegie Mellon University. He is a
24. G. Almasi et al., “Unlocking the Performance senior member of IEEE, and a member of
of the BlueGene/L Supercomputer,” Proc. ACM and SIAM.
Supercomputing 2004, IEEE Press, 2004.
Direct questions and comments about this
Tilak Agerwala is vice president, systems, at article to Tilak Agerwala, IBM T.J. Watson
IBM Research. His primary research area is Research Center, 1101 Kitchawan Road,
high-performance computing systems. He is Yorktown Heights, NY 10598; tilak@us.
responsible for all of IBM’s advanced systems ibm.com.
research programs in servers and supercomput-
ers. Agerwala has a PhD in electrical engineer- For further information on this or any other
ing from The Johns Hopkins University. He is computing topic, visit our Digital Library at
a fellow of the IEEE, and a member of ACM. http://www.computer.org/publications/dlib.

REACH
HIGHER
Advancing in the IEEE Computer Society can elevate your standing in the profession.
Application to Senior-grade membership recognizes
✔ ten years or more of professional expertise
Nomination to Fellow-grade membership recognizes
✔ exemplary accomplishments in computer engineering

GIVE YOUR CAREER A BOOST ■ UPGRADE YOUR MEMBERSHIP


www.computer.org/join/grades.htm

MAY–JUNE 2005 69
View publication stats

You might also like