The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
High-Performance
Microprocessors
Norm Jouppi
HP Labs
11/15/2005 2
Disclaimer
• These views are mine, not necessarily HP
• “Never make forecasts, especially about the
future” – Samuel Goldwyn
11/15/2005 3
What is Evolution
• Definition:
− a process of continuous change
from a lower, simpler, or worse
to a higher, more complex, or
better state
− unfolding
11/15/2005 4
Why Evolution
• Evolution
is a Very Efficient
Means of Building New Things
− Reuse, recycle
− Minimum of new stuff
− Much easier than revolution
11/15/2005 5
When Evolution
• Can be categorized
into Eras, Periods, etc.
11/15/2005 6
Technology
• Usually evolution, not revolution
• Many revolutionary technologies have a bad history:
− Bubble memories
− Josephson junctions
− Anything but Ethernet
− Etc.
• Moore’s Law has been a key force driving evolution
of the technology
11/15/2005 7
Moore’s Law
• Originally presented in 1965
• Number of transistors per chip is 1.59year-1959
(originally 2year-1959)
• Classical scaling theory (Denard, 1974)
− With every feature size scaling of n
• You get O(n2) transistors
• They run O(n) times faster
• Subsequently proposed:
− “Moore’s Design Law” (Law #2)
− “Moore’s Fab Law” (Law #3)
11/15/2005 8
Microprocessor Efficiency Eras (Jouppi)
• Moore’sLaw says number of transistors scaling as
O(n2) and speed as O(n)
• Microprocessor performance should scale as O(n3)
N0
N1 N-1
(log) Performance
Era
N2
Era
Efficiency
N3
Era
Number of Transistors
11/15/2005 9
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS
11/15/2005 10
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS
− But how many 4-bit ops = 64-bit FP mul?
11/15/2005 11
N3 Era
•n from device speed, n2 from transistor count
• 4004 to 386
• Expansion of data path widths from 4 to 32 bits
• Basic pipelining
• Hardware support for complex ops (FP mul)
• Memory range and virtual memory
• Hard to measure performance
− Measured in MIPS
− But how many 4-bit ops = 64-bit FP mul?
• More than 1500!
11/15/2005 12
N2 Era
•n from device speed, only n from n2 transistors
• 486 through Pentium III/IV
• Era of large on-chip caches
− Miss rate halves per quadrupling of cache size
• Superscalar issue
− 2X performance from quad issue?
• Superpipelining
− Diminishing returns
11/15/2005 13
N Era
•n from device frequency, 1 from n2 transistor count
• Very wide issue machines
− Little help to many aps
− Need SMT to justify
• Increase complexity and size too much → slowdown
− Long global wires
− Structure access times go up
− Time to market
11/15/2005 14
Environmental Constraints on
Microprocessor Evolution
• Several categories:
• Technology scaling
− Economics
− Devices
− Voltage scaling
• System-level constraints
− Power
11/15/2005 15
Supply Economics: Moore’s Fab Law
• Fab cost is scaling as 1/feature size
• 90nm fabs currently cost 1-2 billion dollars
• Few can afford one by themselves (except Intel)
− Fabless startups
− Fab partnerships
• IBM/Toshiba/etc.
• Large foundries
11/15/2005 16
Supply Economics: Moore’s Design Law
• The number of designers goes as O(1/feature)
• The 4004 had 3 designers, 10µm
• Recent 90nm microprocessors have ~300 designers
• Implication: design cost becomes very large
• Consolidation in # of viable microprocessors
• Microprocessor cores often reused
− Too much work to design from scratch
− But “shrinks and tweaks” becoming difficult in deep
submicron technologies
11/15/2005 17
Devices
• Transistors historically get faster ∝ feature size
• But transistors getting much leakier
− Gate leakage (fix with high-K gate dielectrics)
− Channel leakage (dual-gate or vertical transistors?)
• Even CMOS has significant static power
− Power is roughly proportional to # transistors
− Static power approaching dynamic power
− Static power increases with chip temperature
• Positive feedback is bad
11/15/2005 18
Voltage Scaling
• High performance MOS started out with12V
• Max Voltage scaling roughly as sqrt(feature)
• Power is ∝ CV2f
− Lower voltage can reduce the power as square
− But speed goes down with lower voltage
• Current
high-performance microprocessors have
1.1V supplies
• Reduced power (12/1.1)2 = 119X over 24 years!
11/15/2005 19
Limits of Voltage Scaling
• Beyond a certain voltage transistors don’t turn off
• ITRS projects minimum voltage of 0.7V in 2018
− Limited by threshold voltage variation, etc.
− But high-performance microprocessors are now 1.1V
• Only(1.1/0.7)2 = 2.5X reduction left in next 14
years!
11/15/2005 20
System-Level Power
• Per-chip power envelope is peaking
− Itanium 2 @ 130W => Montecito @ 100W
• 1U servers and blades reduces heat sink height
• Costof power in and heat out for several years
can equal original system cost
• First class design constraint
11/15/2005 21
Current Microprocessor Power Trends
• Figure source: Shekhar Borkar, “Low Power Design Challenges for the Decade”,
Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE.
11/15/2005 22
The Power Wall in Perspective
11/15/2005 24
“Waste Not Want Not” for Circuits
• Lots of circuit ways to save power already in use
− Clock gating
− Multiple transistor thresholds
− Sleeper transistors
− Etc.
• Thus circuit-level power dissipation is already fairly
efficient
• What about architectural savings?
11/15/2005 25
Power-Efficient Microarchitecture
• Off-chip memory reference costs a lot of power
− Thus drive to more highly-associative on-chip caches
− Limits to how far this can go
• Lots of other similar examples
• PACS workshop/conference proceedings
− No silver bullet
− Limited benefits from each technique
11/15/2005 26
Single-Threaded / Throughput Tradeoff
• Reducing transistors/core can yield higher MIPS/W
• Move back towards N3 scaling efficiency
• Thus, expect trend to simpler processors
− Narrower issue width
− Shallower pipelines
− More in-order processors or smaller OOO windows
• “Back to the Future”
• But this gives lower single-thread performance
− Can’t simplify core too quickly
• Tradeoffs on what to eliminate not always obvious
− Examples: Speculation, Multithreading
11/15/2005 27
Speculation
• Is speculation uniformly bad?
− No
• Example: branch prediction
− Executing down wrong path wastes performance & power
− Stalling at every branch would hurt performance & power
• Circuits leak when not switching
11/15/2005 28
Multithreading
• SMT is very useful in wide-issue OOO machines
− Good news: increases power efficiency
− Bad news: Wide issue still power inefficient
• Multithreading useful even in simple machines
− During cache misses transistors still leak
• Not enough time to gate power
− May only need 2 or 3 thread non-simultaneous MT
11/15/2005 29
Microarchitectural Research Implications
• Processors
need to get simpler (not more
complicated) to become more efficient
• More complicated microarchitecture mechanisms
with little benefit not needed
11/15/2005 30
Recap: Possible Future Evolution
in Terms of P = CV2f
• Formula doesn’t change, but terms do:
− Power ⎯ 100
• Already at system limits, better if lower
− Voltage ⎯ 100
• Only a factor of 2.5 left over next 14 years
− Clock Frequency ⎯ 100
• Not scaling with device speed
• Fewer pipestages, higher efficiency
• Move from 30 to 10 stages from 90nm to 32 nm
− Capacitance (~Area) needs to be repartitioned for
higher system performance
11/15/2005 31
Number of Cores Per Die
• Scale processor complexity as 1/feature
− Number of cores will go up dramatically as n3
− From 1 core at 90nm to 27 per die at 30nm!
− But can we efficiently use that many cores?
11/15/2005 32
The Coming Golden Age
• We are on the cusp of the golden age of parallel
programming
− Every major application needs to become parallel
− “Necessity is the mother of invention”
• How to make use of many processors effectively?
− Some aps inherently parallel (Web, CFD, TPC-C)
− Some applications are very hard to parallelize
− Parallel programming is a hard problem
• Can’t use large amounts of speculation in software
− Just moves power inefficiency to a higher level
11/15/2005 33
Important Architecture Research Problems
Not an exhaustive list:
• How to wire up CMPs:
− Memory hierarchy?
− Transactional memory?
− Interconnection networks?
• How to build cores:
− Heterogeneous CMP design
− Conjoined core CMPs
• Power: From the transistor through the datacenter
11/15/2005 34
Important Architecture Research #2
Many system level problems:
• Cluster interconnects
• Manageability
• Availability
• Security
11/15/2005 35
Power:
From the Transistor through the Data Center
• CACTI 4.0
• Heterogenous CMP
• Conjoined cores
• Data center power management
11/15/2005 36
CACTI 4.0
• Collaboration with David Tarjan of UVa
• Now with leakage power model
• Also includes:
− Much improved technology scaling
− Circuit updates
− Scaling to much larger cache sizes
− Options: serial tag/data, high speed access, etc.
− Parameterized data and tag widths (including no tags)
− Beta version web interface:
http://analog.cs.virginia.edu:81/cacti/index.y
− Full release coming soon ([email protected])
11/15/2005 37
11/15/2005 38
Heterogeneous Chip Multiprocessors
• A.k.a.
Asymmetric,
Non-homogeneous,
synergistic,…
• Single ISA vs. Multiple ISA
• Many benefits:
− Power
− Throughput
− Mitigating Amdahl’s Law
• Open questions
− Best mix of heterogeneity
11/15/2005 39
Potential Power Benefits
• Grochowski et. al. ICCD 2004:
− Asymmetric CMP => 4-6X
− Further voltage scaling => 2-4X
− More gating => 2X
− Controlling speculation =>1.6X
11/15/2005 40
Clock Gating
Transistors
still leak
And
busses
are still
long,
slow, &
waste
power
11/15/2005 Key: Yellow are inactive, red are active, orange are global busses 41
Combine most
Animation for CPU scale-down frequently active
blocks into small
heterogeneous
core instead
Power down
large core when
not needed
11/15/2005 42
Performance benefits from heterogeneity
63%
11/15/2005 43
Mitigating Amdahl’s Law
• Amdahl’s law: Parallel speedups limited by serial
portions
• Annavaram et. al., ISCA 2005:
− Basic idea
• Use big core for serial portions
• Use many small cores for parallel portions
− Prototype built from discrete 4-way SMP
• Ran one socket at regular voltage, other 3 at low voltage
• 38% wall clock speedup using fixed power budget
11/15/2005 44
Conjoined Cores
11/15/2005 45
Conjoined-core Architecture Benefits
Core-area savings
INT Workloads
(Core-area +
80
crossbar-area per 70 FP Workloads
11/15/2005 47
Data Center Power is Front Page News
11/15/2005 48
Data center thermal management at HP
• Power density also becoming a significant reliability issue
• Use 3D modeling and measurement to understand thermal
characteristics of data centers
− Saving 25% today
• Exploit this for dynamic resource allocation and provisioning
• Chandrakant Patel et. al.
11/15/2005 49
A Theory on the Evolution of Computer
Architecture Ideas
• Conjecture: There is no such thing as a bad or
discredited idea in computer architecture, only
ideas at the wrong level, place, or time
• Reuse, Recycle
• Evolution vs. Revolution
• Examples:
− SIMD
− Dataflow
− HLL architectures
− Capabilities
− Vectors
11/15/2005 50
SIMD
• Efficient way of computing proposed in late ’60s
− 64-PE Illiac-IV operational in 1972
− Difficult to program
11/15/2005 51
Dataflow
• Widely researched in late ’80s
• Lots of problems:
− Complicated machines
− New programming model
• Out-of-order
(OOO) execution machines
developed in ’90s are a limited form of data flow
− Issue queues issue instructions when operands ready
− But keeps same instruction set architecture
− Keeps same programming model
− Still complex internally
11/15/2005 52
High-Level Language Architectures
• Popular research in 1970’s and early 1980’s
• “Closing the semantic gap”
•A few attempts to implement in hardware failed
− Machines interpreted HLLs in hardware
• Now we have Java interpreted in software
− JIT compilers
− Portability
− Modest performance loss doesn’t matter for some apps
11/15/2005 53
Capabilities
• Popular research topic in 1970’s
• Intel 432 implemented capabilities in hardware
− Every memory reference
− Poor performance
• Died with 432
• Security increasingly important
• Capabilities
at the file-system level combined with
standard memory protection models
− Much less overhead
• Virtual machine support
11/15/2005 54
Lessons from “Idea Evolution Theory”
• Don’t
be afraid to look at past ideas that didn’t
work out as a source of inspiration
• Some ideas make be successful if reinterpreted at
a different level, place, or time when they can be
made more evolutionary than revolutionary
11/15/2005 55
Conclusions
• The Power Wall is here
− It is the wall to worry about
− It has dramatic implications for the industry
• From the transistor through the data center
11/15/2005 57