0% found this document useful (0 votes)
68 views

The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs

Norm Jouppi: "the information contained herein is subject to change without notice" "evolution is a process of continuous change from a lower, simpler, or worse to a higher, more complex, or better state" "the future evolution of high-performance microprocessors" - Norm jouppi.

Uploaded by

ronnmac
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs

Norm Jouppi: "the information contained herein is subject to change without notice" "evolution is a process of continuous change from a lower, simpler, or worse to a higher, more complex, or better state" "the future evolution of high-performance microprocessors" - Norm jouppi.

Uploaded by

ronnmac
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

The Future Evolution of

High-Performance
Microprocessors

Norm Jouppi
HP Labs

© 2005 Hewlett-Packard Development Company, L.P.


The information contained herein is subject to change without notice
Keynote Overview

• What, Why, How, and When of Evolution


• Microprocessor Environmental Constraints
• The Power Wall
• Power: From the Transistor to the Data Center
• The Evolution of Computer Architecture Ideas
• Summary

11/15/2005 2
Disclaimer
• These views are mine, not necessarily HP
• “Never make forecasts, especially about the
future” – Samuel Goldwyn

11/15/2005 3
What is Evolution
• Definition:
− a process of continuous change
from a lower, simpler, or worse
to a higher, more complex, or
better state
− unfolding

11/15/2005 4
Why Evolution
• Evolution
is a Very Efficient
Means of Building New Things
− Reuse, recycle
− Minimum of new stuff
− Much easier than revolution

11/15/2005 5
When Evolution
• Can be categorized
into Eras, Periods, etc.

11/15/2005 6
Technology
• Usually evolution, not revolution
• Many revolutionary technologies have a bad history:
− Bubble memories
− Josephson junctions
− Anything but Ethernet
− Etc.
• Moore’s Law has been a key force driving evolution
of the technology

11/15/2005 7
Moore’s Law
• Originally presented in 1965
• Number of transistors per chip is 1.59year-1959
(originally 2year-1959)
• Classical scaling theory (Denard, 1974)
− With every feature size scaling of n
• You get O(n2) transistors
• They run O(n) times faster

• Subsequently proposed:
− “Moore’s Design Law” (Law #2)
− “Moore’s Fab Law” (Law #3)

11/15/2005 8
Microprocessor Efficiency Eras (Jouppi)
• Moore’sLaw says number of transistors scaling as
O(n2) and speed as O(n)
• Microprocessor performance should scale as O(n3)

N0
N1 N-1
(log) Performance

Era

N2
Era
Efficiency

N3
Era

Number of Transistors
11/15/2005 9
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS

11/15/2005 10
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS
− But how many 4-bit ops = 64-bit FP mul?

11/15/2005 11
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS
− But how many 4-bit ops = 64-bit FP mul?
• More than 1500!
11/15/2005 12
N2 Era
•n from device speed, only n from n2 transistors
• 486 through Pentium III/IV
• Era of large on-chip caches
− Miss rate halves per quadrupling of cache size
• Superscalar issue
− 2X performance from quad issue?
• Superpipelining
− Diminishing returns

11/15/2005 13
N Era
•n from device frequency, 1 from n2 transistor count
• Very wide issue machines
− Little help to many aps
− Need SMT to justify
• Increase complexity and size too much → slowdown
− Long global wires
− Structure access times go up
− Time to market

11/15/2005 14
Environmental Constraints on
Microprocessor Evolution
• Several categories:
• Technology scaling
− Economics
− Devices
− Voltage scaling
• System-level constraints
− Power

11/15/2005 15
Supply Economics: Moore’s Fab Law
• Fab cost is scaling as 1/feature size
• 90nm fabs currently cost 1-2 billion dollars
• Few can afford one by themselves (except Intel)
− Fabless startups
− Fab partnerships
• IBM/Toshiba/etc.
• Large foundries

• But number of transistors scales as 1/feature size2


− Transistors still getting cheaper
− Transistors still getting faster

11/15/2005 16
Supply Economics: Moore’s Design Law
• The number of designers goes as O(1/feature)
• The 4004 had 3 designers, 10µm
• Recent 90nm microprocessors have ~300 designers
• Implication: design cost becomes very large
• Consolidation in # of viable microprocessors
• Microprocessor cores often reused
− Too much work to design from scratch
− But “shrinks and tweaks” becoming difficult in deep
submicron technologies

11/15/2005 17
Devices
• Transistors historically get faster ∝ feature size
• But transistors getting much leakier
− Gate leakage (fix with high-K gate dielectrics)
− Channel leakage (dual-gate or vertical transistors?)
• Even CMOS has significant static power
− Power is roughly proportional to # transistors
− Static power approaching dynamic power
− Static power increases with chip temperature
• Positive feedback is bad

11/15/2005 18
Voltage Scaling
• High performance MOS started out with12V
• Max Voltage scaling roughly as sqrt(feature)
• Power is ∝ CV2f
− Lower voltage can reduce the power as square
− But speed goes down with lower voltage
• Current
high-performance microprocessors have
1.1V supplies
• Reduced power (12/1.1)2 = 119X over 24 years!

11/15/2005 19
Limits of Voltage Scaling
• Beyond a certain voltage transistors don’t turn off
• ITRS projects minimum voltage of 0.7V in 2018
− Limited by threshold voltage variation, etc.
− But high-performance microprocessors are now 1.1V
• Only(1.1/0.7)2 = 2.5X reduction left in next 14
years!

11/15/2005 20
System-Level Power
• Per-chip power envelope is peaking
− Itanium 2 @ 130W => Montecito @ 100W
• 1U servers and blades reduces heat sink height
• Costof power in and heat out for several years
can equal original system cost
• First class design constraint

11/15/2005 21
Current Microprocessor Power Trends

• Figure source: Shekhar Borkar, “Low Power Design Challenges for the Decade”,
Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE.

11/15/2005 22
The Power Wall in Perspective

The Memory Wall The Power Wall


11/15/2005 23
Pushing Out the Power Wall
• “Waste Not Want Not” (Ben Franklin) for circuits
• Power-efficient microarchitecture
• Single threaded vs. throughput tradeoff

11/15/2005 24
“Waste Not Want Not” for Circuits
• Lots of circuit ways to save power already in use
− Clock gating
− Multiple transistor thresholds
− Sleeper transistors
− Etc.
• Thus circuit-level power dissipation is already fairly
efficient
• What about architectural savings?

11/15/2005 25
Power-Efficient Microarchitecture
• Off-chip memory reference costs a lot of power
− Thus drive to more highly-associative on-chip caches
− Limits to how far this can go
• Lots of other similar examples
• PACS workshop/conference proceedings
− No silver bullet
− Limited benefits from each technique

11/15/2005 26
Single-Threaded / Throughput Tradeoff
• Reducing transistors/core can yield higher MIPS/W
• Move back towards N3 scaling efficiency
• Thus, expect trend to simpler processors
− Narrower issue width
− Shallower pipelines
− More in-order processors or smaller OOO windows
• “Back to the Future”
• But this gives lower single-thread performance
− Can’t simplify core too quickly
• Tradeoffs on what to eliminate not always obvious
− Examples: Speculation, Multithreading
11/15/2005 27
Speculation
• Is speculation uniformly bad?
− No
• Example: branch prediction
− Executing down wrong path wastes performance & power
− Stalling at every branch would hurt performance & power
• Circuits leak when not switching

• Predicting a branch can save power


− Plus predictor memory takes less power/area than logic
• But current amount of speculation seems excessive

11/15/2005 28
Multithreading
• SMT is very useful in wide-issue OOO machines
− Good news: increases power efficiency
− Bad news: Wide issue still power inefficient
• Multithreading useful even in simple machines
− During cache misses transistors still leak
• Not enough time to gate power
− May only need 2 or 3 thread non-simultaneous MT

11/15/2005 29
Microarchitectural Research Implications
• Processors
need to get simpler (not more
complicated) to become more efficient
• More complicated microarchitecture mechanisms
with little benefit not needed

11/15/2005 30
Recap: Possible Future Evolution
in Terms of P = CV2f
• Formula doesn’t change, but terms do:
− Power ⎯ 100
• Already at system limits, better if lower
− Voltage ⎯ 100
• Only a factor of 2.5 left over next 14 years
− Clock Frequency ⎯ 100
• Not scaling with device speed
• Fewer pipestages, higher efficiency
• Move from 30 to 10 stages from 90nm to 32 nm
− Capacitance (~Area) needs to be repartitioned for
higher system performance

11/15/2005 31
Number of Cores Per Die
• Scale processor complexity as 1/feature
− Number of cores will go up dramatically as n3
− From 1 core at 90nm to 27 per die at 30nm!
− But can we efficiently use that many cores?

11/15/2005 32
The Coming Golden Age
• We are on the cusp of the golden age of parallel
programming
− Every major application needs to become parallel
− “Necessity is the mother of invention”
• How to make use of many processors effectively?
− Some aps inherently parallel (Web, CFD, TPC-C)
− Some applications are very hard to parallelize
− Parallel programming is a hard problem
• Can’t use large amounts of speculation in software
− Just moves power inefficiency to a higher level

11/15/2005 33
Important Architecture Research Problems
Not an exhaustive list:
• How to wire up CMPs:
− Memory hierarchy?
− Transactional memory?
− Interconnection networks?
• How to build cores:
− Heterogeneous CMP design
− Conjoined core CMPs
• Power: From the transistor through the datacenter

11/15/2005 34
Important Architecture Research #2
Many system level problems:
• Cluster interconnects
• Manageability

• Availability

• Security

• Hardware/software tradeoffs (ASPLOS) increasingly


important

11/15/2005 35
Power:
From the Transistor through the Data Center
• CACTI 4.0
• Heterogenous CMP
• Conjoined cores
• Data center power management

11/15/2005 36
CACTI 4.0
• Collaboration with David Tarjan of UVa
• Now with leakage power model
• Also includes:
− Much improved technology scaling
− Circuit updates
− Scaling to much larger cache sizes
− Options: serial tag/data, high speed access, etc.
− Parameterized data and tag widths (including no tags)
− Beta version web interface:
http://analog.cs.virginia.edu:81/cacti/index.y
− Full release coming soon ([email protected])

11/15/2005 37
11/15/2005 38
Heterogeneous Chip Multiprocessors

• A.k.a.
Asymmetric,
Non-homogeneous,
synergistic,…
• Single ISA vs. Multiple ISA
• Many benefits:
− Power
− Throughput
− Mitigating Amdahl’s Law
• Open questions
− Best mix of heterogeneity

11/15/2005 39
Potential Power Benefits
• Grochowski et. al. ICCD 2004:
− Asymmetric CMP => 4-6X
− Further voltage scaling => 2-4X
− More gating => 2X
− Controlling speculation =>1.6X

11/15/2005 40
Clock Gating

Transistors
still leak
And
busses
are still
long,
slow, &
waste
power

11/15/2005 Key: Yellow are inactive, red are active, orange are global busses 41
Combine most
Animation for CPU scale-down frequently active
blocks into small
heterogeneous
core instead

Power down
large core when
not needed

11/15/2005 42
Performance benefits from heterogeneity

63%

11/15/2005 43
Mitigating Amdahl’s Law
• Amdahl’s law: Parallel speedups limited by serial
portions
• Annavaram et. al., ISCA 2005:
− Basic idea
• Use big core for serial portions
• Use many small cores for parallel portions
− Prototype built from discrete 4-way SMP
• Ran one socket at regular voltage, other 3 at low voltage
• 38% wall clock speedup using fixed power budget

11/15/2005 44
Conjoined Cores

Ideally, provide for peak needs of any single thread


without multiplying the cost with the number of cores

Baseline usage resource Peak usage resource

11/15/2005 45
Conjoined-core Architecture Benefits
Core-area savings

INT Workloads

(Core-area +
80
crossbar-area per 70 FP Workloads

% perfo rm ance degradatio n


core) area savings
60 60
50 50
% Area 40 40
savings
per core 30 30
20 20
10 10
0 0
Performance degradation even less if < 8 threads!
11/15/2005 46
Datacenter Power Optimizations
•A rack of blades can dissipate 30kW or more!

11/15/2005 47
Data Center Power is Front Page News

11/15/2005 48
Data center thermal management at HP
• Power density also becoming a significant reliability issue
• Use 3D modeling and measurement to understand thermal
characteristics of data centers
− Saving 25% today
• Exploit this for dynamic resource allocation and provisioning
• Chandrakant Patel et. al.

11/15/2005 49
A Theory on the Evolution of Computer
Architecture Ideas
• Conjecture: There is no such thing as a bad or
discredited idea in computer architecture, only
ideas at the wrong level, place, or time
• Reuse, Recycle
• Evolution vs. Revolution
• Examples:
− SIMD
− Dataflow
− HLL architectures
− Capabilities
− Vectors
11/15/2005 50
SIMD
• Efficient way of computing proposed in late ’60s
− 64-PE Illiac-IV operational in 1972
− Difficult to program

• Intel MMX introduced in 1996


− Efficient use of larger word sizes for small parallel data
− Used in libraries or specialized code
− Only small increase in hardware (<1%)

11/15/2005 51
Dataflow
• Widely researched in late ’80s
• Lots of problems:
− Complicated machines
− New programming model
• Out-of-order
(OOO) execution machines
developed in ’90s are a limited form of data flow
− Issue queues issue instructions when operands ready
− But keeps same instruction set architecture
− Keeps same programming model
− Still complex internally

11/15/2005 52
High-Level Language Architectures
• Popular research in 1970’s and early 1980’s
• “Closing the semantic gap”
•A few attempts to implement in hardware failed
− Machines interpreted HLLs in hardware
• Now we have Java interpreted in software
− JIT compilers
− Portability
− Modest performance loss doesn’t matter for some apps

11/15/2005 53
Capabilities
• Popular research topic in 1970’s
• Intel 432 implemented capabilities in hardware
− Every memory reference
− Poor performance
• Died with 432
• Security increasingly important
• Capabilities
at the file-system level combined with
standard memory protection models
− Much less overhead
• Virtual machine support
11/15/2005 54
Lessons from “Idea Evolution Theory”
• Don’t
be afraid to look at past ideas that didn’t
work out as a source of inspiration
• Some ideas make be successful if reinterpreted at
a different level, place, or time when they can be
made more evolutionary than revolutionary

11/15/2005 55
Conclusions
• The Power Wall is here
− It is the wall to worry about
− It has dramatic implications for the industry
• From the transistor through the data center

• We need to reclaim past efficiencies


− Microarchitectural complexity needs to be reduced
• The power wall will usher in the “Golden Age of
Parallel Programming”
• Much open research in architecture
• Itmay be time to reexamine some previously
discarded architecture ideas
11/15/2005 56
Thanks

11/15/2005 57

You might also like