Aug 08
Aug 08
Stefan Rusu
Senior Principal Engineer
Intel Corporation
Stefan Rusu 2
Server Processor Power Trends
1000
Total
100
Power
▼
10
Active
Power [W]
Power
1
0.1
Leakage ►
0.01
0.001
1990 1992 1994 1996 1998 2000 2002 2004 2006
Year
Stefan Rusu 3
Power Components
• Total power includes switching, short-circuit and leakage:
P = Psw + Pshort + Pleakage
n
Psw = f ⋅ Vcc2 ⋅ ∑ AFi ⋅ Ci
i =1
0 − delay
AFi = AFi + AFi glitch
Stefan Rusu 4
Short Circuit Power
• Short circuit power is a function of (Vcc – 2Vt)3
• Linearly increases with input slope ► Avoid large slopes
Stefan Rusu 6
Optimal Active / Leakage Power Ratio
1.E-08
Production data in literature ( )
1.E-10
1.E-12
1.E-14
10 100 1000
100
Silicon 10
0.1
1 0.01
350nm 250nm 180nm 130nm 90nm 65nm
High-K Dielectric
– Increases the gate field effect
– Allows use of thicker dielectric
layer to reduce gate leakage
HK + MG Combined
– Drive current increased >20%
– Or source-drain leakage
reduced >5x
– Gate oxide leakage reduced
http://download.intel.com/pressroom/kits/45nm/
Press%2045nm%20107_FINAL.pdf
Stefan Rusu 10
HK+MG Gate Leakage Reduction
• Gate leakage is reduced >25X for NMOS and 1000X for PMOS
100
SiON/Poly 65nm
Normalized Gate Leakage
10
1
SiON/Poly 65nm
0.1
0.01
0.001
HiK+MG 45nm HiK+MG 45nm
0.0001
PMOS NMOS
0.00001
-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
VGS (V) 65nm: Bai, 2004 IEDM
45nm: Mistry, 2007 IEDM
Stefan Rusu 11
Leakage Dependency on Voltage
100%
130nm process
80%
Normalized Leakage
60%
Sub-threshold
Leakage ►
40%
◄ Gate
20% Leakage
0
0 0.3 0.6 0.9 1.2 1.5
Voltage (V)
Junction
4
1
20 30 40 50 60 70 80 90 100
Temperature [deg.C]
[Mukhopadhyay, et al.,
VLSI Symposium 2003]
Stefan Rusu 13
Outline
• Power components and trends
• Active power reduction techniques
– Clock gating
– Reduce clock loading
– Multiple cores
– Multiple voltage domains
• Leakage reduction techniques
• Power management methods
• Summary
Stefan Rusu 14
Active Power Reduction
Reduce switched capacitance: Technology scaling:
• Minimize diffusion, wire and gate • Dynamic voltage scaling
loading, particularly in high activity • Supply voltage scaling is
factor nodes (clocks, domino) slowing down
• Use more efficient layout techniques • Thresholds don’t scale
2
P = α CL V fCLK
Reduce switching activity: Reduce clock frequency:
• Conditional execution • Use parallelism
• Conditional clocking • Less pipeline stages
• Conditional precharge • Use double-edge flip-flops
• Turn off inactive blocks
• Reduce toggling of high
capacitance nodes/busses
Stefan Rusu 15
Clock Gating
0
D Q D D Q
D 1 S
En En
Clk
Clk
Stefan Rusu 16
Conditional Clocking Flip-Flop
50
40
30
20
10
0
c1_11
c2_11
c1_1
c2_1
c1_10
c1_12
c2_10
c2_12
c1_0
c1_2
c1_3
c1_6
c1_7
c1_8
c1_9
c2_0
c2_2
c2_3
c2_6
c2_7
c2_8
c2_9
c1_4
c1_5
c2_4
c2_5
Clock Net
Un-Core pre-global
ZCLK spine
Un-Core sparse
SCLK grid
Horizontal clock
Un-Core pre-global
MCLK spine
spines
De-skew buffer
Match the clock grid to the underlying circuits to reduce clock loading
S. Tam, ISSCC 2006
Stefan Rusu 21
Multiple Voltage Domains
FSB TOP
Core 1 T
A
G
1MB L2
16MB L3
Control Logic
1MB L2
T
A
G
Core 0
FSB BOT
Voltage Profile
Cut Line
1.25V 1.10V
Voltage
FF FF FF FF
FF FF FF FF
FF FF FF FF
FF FF FF FF
FF_A FF_B
Critical path Critical path
VDDH row
VDDL row
VDDL row
VDDH row
VDDL row
VDDH row
VDDL row
Stefan Rusu 26
Long-Le Transistors
Nominal Le • All transistors can be either
nominal or long-Le
• Most library cells are
available in both flavors
• Long-Le transistors are
~10% slower, but have
3x lower leakage
• All paths with timing slack
use long-Le transistors
• Initial design uses only long
channel devices
Long Le
Rusu, et. al, ISSCC 2006
(Nom+10%)
Stefan Rusu 27
Long-Le Transistors Usage
Long channel device
average usage summary
Cores 54%
Uncore 76%
Cache 100%
100%
Long-Le Usage (%)
80%
60%
40%
20%
0% Cor 1 Co L3 Cache
nt
ro
l
Core 0
Rusu, et. al, ISSCC 2006
Stefan Rusu 28
High-Vt Transistors
Vbp
Vdd
+Ve
Equal Loading
Logic Block
-Ve Vbn
Stefan Rusu 30
Body Bias Leakage Reduction
iso-load
Two -stack Two -stack
load
High-V Tt Wu = Wl
Normalized delay
Low -V Tt
under iso-input
delay
Normalized
Performance
Loss
Equal Loading wu? ½ w
High-V Tt
wl? ½ w
wu
Low --V T
wu+wl = w
1
10e-12 Leakage
1 Reduction
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
NormalizedIoff
Normalized Ioff
0V
1.5V Ileak = 9.96nA
Natural Stacks
1.5V
1.5V • Leakage reduced significantly when
1.5V two transistors are off in a stack
0V • Educate circuit designers,
0V Ileak = 1.71nA monitor average stacking factor
1.5V
1.5V
1.5V 12
Leakage [nA]
0V
8
0V Ileak = 0.98nA
0V
1.5V 4
1.5V
0
0V
0V Ileak = 0.72nA 1 2 3 4
0V Number of OFF transistors in stack
0V
Stefan Rusu 33
Cache Leakage Reduction Techniques
Voltage
Sleep 520mV
Block Bias Virtual VSS 250mV
Select
Shut 0V
off 0V
Active Sleep Shut-off
Shut 1.1V
1.1V
off 850mV
Block Voltage
Select Sleep Virtual VCC
Bias 3x lower 10x lower
leakage leakage
150mV
Sub-array
0V
Data ►
Tag ►
Data ►
Controller Controller
Power
off in
white
area
Stefan Rusu 43
Voltage / Frequency Scaling
Max Performance
Power scaling
range ~ 3–4 Increasing
Power
Power α V 3 Performance
Increasing Efficiency
Minimum (Freq/Power)
Operating
Voltage Most efficient
operating point
Deep Sleep /
Quick Start
Frequency
Stefan Rusu 44
Itanium® Processor V/F Scaling
100 Frequency
80
Power (W)
Frequency &
60
Voltage
t he r
O e
40 ak a g
e L e
Cor
i tc h ing
20 r e Sw
C o
0
100
50
60
65
70
75
80
85
90
95
55
Frequency (% of Fmax)
R. McGowan, ICCAD 2005 45
Stefan Rusu
V / F Control System
10s of µs
Power
Sensor
Supply
Micro-Controller VRM
Thermal
Sensor
100s of ps
R. McGowan, ICCAD 2005 46
Stefan Rusu
Power Measurement
On Die Measurement
VConn VDie VConn
Counter
VDie Count
VCO
VTest
VCal
RPkg
Power = VDie
(VConn − VDie )
– Uses package resistance to measure power
• Widely variable, changes with temperature
RPkg
– VCO speed changes with process, temperature
– Uses a lookup table created with reference V
• Unique to each part / operating condition
• Linear interpolation for entries not in the table
– On die microcontroller software generates table, calibration and computes
final power measurement
96%
Frequency
Core 1 Frequency
Core 0 Frequency
92%
88%
84%
100W 95W 90W 85W 80W 75W 70W 65W 60W
Power Limit
Core voltage*
L2 cache off
partial flus h
Idle power*
* R ough approximation
STATE DEFINITION:
– What to include?
– Criteria: “Software seamless” VccP
– Inclusions:
• All Architectural state MICROCODE:
• Most micro-architectural state
– State save and restore
– Exclusions: – Core synchronization
• Temp registers used by ucode
• Some others on a case by case basis Power Management Unit:
– Manages the DPD power-up sequence
– Manages entry/exit protocol with platform
OS: Decides OS: execute CPU: shrink L2; CPU: signal C/S: stops
to idle the MWAIT save uarch chipset to CPU clk, blk I/O
processors instruction state enter DPD CPU: VRM dn
Deep power
down state
CPU: continue CPU: restore CPU: Internal C/S: signals Interrupt
execution to Arch/uarch state, RESET, CPU wakeup, break
next instr. Expand L2 PLL relock etc VRM ramp, bclk event
27%
Leakage Current
– 27% to 44% (based on the leakage of the part) Average Power reduction as
measured by Mobile Mark – Office Productivity benchmark due to DPD feature
– Significant improvement compared to previous generation (Merom)
– Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range
EDAT Bin
Core #2 at
Frequency
Frequency
C3-DPD
SPECint_base2000 (estimated)
1.08 SPECfp_base2000 (estimated)
1.06
Relative Performance
1.04
1.02
1.00
0.98
0.96
Baseline: EDAT OFF EDAT ON; Low EDAT ON; High EDAT ON with
Interrupt rate Interrupt rate Hysteresis: High
Interrupt rate
.9 %
• Clock gating used at cluster
SOC
l4
and local clock-header level.
e ve
Spa
pL
rc C • 'GATE-BIAS' cells used to
ore To
s 31 . 6 reduce leakage.
%
– ~10 % increase in channel
Leakage 21.1 %
length gives ~40 % leakage
reduction.
% Gc
r 1 .0 L2 lk, C • Interconnect W/S
s s ba Bu CU
Cr o ffe 1. combinations optimized for
r 2 3%
%
power-delay product to
L2T
.5
.2
L2Dat 8.6 %
%
13
.5 %
Degree of Throttling
U. Nawathe (Sun Micro), ISSCC 2007
Stefan Rusu 58
Future Directions
TPT = 0, Power = 0
Pwr=1 Pwr=¼ Pwr=0
0V
0 Adapted from S. Borkar, DAC 2007
Stefan Rusu 60
Summary
• Low power design is essential for modern computing
from hand-held all the way to servers
• Major low-power technology directions:
– Advanced process technology features:
High-K + metal gate, strained silicon
– Multiple clock and voltage domains
– Advanced voltage / frequency scaling
– Operate at the lowest possible voltage
– Turn off blocks that are not in use (clock and power gating)
Stefan Rusu 61