0% found this document useful (0 votes)
65 views61 pages

Aug 08

This document discusses techniques for reducing power and leakage in integrated circuits at the nanoscale. It begins by outlining power components and trends, noting that total power includes switching, short-circuit, and leakage power. It then discusses various active power reduction techniques like clock gating, reducing clock loading, using multiple cores/voltage domains, and conditional clocking. The document also covers leakage reduction techniques and power management methods. It provides examples and data on topics like gate leakage trends, the optimal active to leakage power ratio, and the temperature and voltage dependency of leakage.

Uploaded by

lucilla.genovese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views61 pages

Aug 08

This document discusses techniques for reducing power and leakage in integrated circuits at the nanoscale. It begins by outlining power components and trends, noting that total power includes switching, short-circuit, and leakage power. It then discusses various active power reduction techniques like clock gating, reducing clock loading, using multiple cores/voltage domains, and conditional clocking. The document also covers leakage reduction techniques and power management methods. It provides examples and data on topics like gate leakage trends, the optimal active to leakage power ratio, and the temperature and voltage dependency of leakage.

Uploaded by

lucilla.genovese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Power and Leakage Reduction in

the Nanoscale Era

Stefan Rusu
Senior Principal Engineer
Intel Corporation

August 21st, 2008

Copyright © 2008, Intel Corporation. All rights reserved.


*Other names and brands may be claimed as the property of others.
Outline

• Power components and trends


• Active power reduction techniques
• Leakage reduction techniques
• Power management methods
• Summary

Stefan Rusu 2
Server Processor Power Trends
1000
Total
100
Power

10
Active
Power [W]

Power
1

0.1
Leakage ►
0.01

0.001
1990 1992 1994 1996 1998 2000 2002 2004 2006
Year

Stefan Rusu 3
Power Components
• Total power includes switching, short-circuit and leakage:
P = Psw + Pshort + Pleakage
n
Psw = f ⋅ Vcc2 ⋅ ∑ AFi ⋅ Ci
i =1
0 − delay
AFi = AFi + AFi glitch

• Glitches are a significant contributor to power as illustrated


in the NOR gate example below
t Dt t+1 Dt

Stefan Rusu 4
Short Circuit Power
• Short circuit power is a function of (Vcc – 2Vt)3
• Linearly increases with input slope ► Avoid large slopes

H. Veendrick (NXP), JSSC, 1984


Stefan Rusu 5
Voltage Scaling Trends
• Vcc scaling has been driven by power and oxide reliability
• Gate overdrive is decreasing with each technology generation
• VT is scaling very slowly
• Vcc scaling trend is decreasing due to performance concerns

Stefan Rusu 6
Optimal Active / Leakage Power Ratio

Kuroda (Keio Univ.),


Optimal active/leakage power ratio is 70/30 ICCAD 2002
Stefan Rusu 7
Source/Drain Leakage (Ioff)
1.E-04

Research data in literature ( )


1.E-06
IOff (A/um)

1.E-08
Production data in literature ( )

1.E-10

1.E-12

1.E-14
10 100 1000

Physical Gate Length (nm)


Stefan Rusu 8
Gate Leakage Trends
10 1000
Poly
Electrical (Inv) Tox (nm)

100

Gate Leakage (Rel.)


SiON

Silicon 10

0.1

1 0.01
350nm 250nm 180nm 130nm 90nm 65nm

• SiON scaling running out of atoms


• Poly depletion limits inversion TOX scaling
K. Mistry, et. al, IEDM 2007
Stefan Rusu 9
45nm High-K + Metal Gate Transistors
Metal Gate
– Increases the gate field effect

High-K Dielectric
– Increases the gate field effect
– Allows use of thicker dielectric
layer to reduce gate leakage

HK + MG Combined
– Drive current increased >20%
– Or source-drain leakage
reduced >5x
– Gate oxide leakage reduced

http://download.intel.com/pressroom/kits/45nm/
Press%2045nm%20107_FINAL.pdf
Stefan Rusu 10
HK+MG Gate Leakage Reduction
• Gate leakage is reduced >25X for NMOS and 1000X for PMOS

100
SiON/Poly 65nm
Normalized Gate Leakage

10

1
SiON/Poly 65nm
0.1

0.01

0.001
HiK+MG 45nm HiK+MG 45nm
0.0001
PMOS NMOS
0.00001
-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
VGS (V) 65nm: Bai, 2004 IEDM
45nm: Mistry, 2007 IEDM
Stefan Rusu 11
Leakage Dependency on Voltage
100%
130nm process

80%
Normalized Leakage

60%
Sub-threshold
Leakage ►
40%
◄ Gate
20% Leakage

0
0 0.3 0.6 0.9 1.2 1.5
Voltage (V)

[Krishnamurthy, et. al, ASICON 2005]


Stefan Rusu 12
… And Temperature
6
Sub-threshold
5 Gate
Relative Leakage

Junction
4

1
20 30 40 50 60 70 80 90 100
Temperature [deg.C]
[Mukhopadhyay, et al.,
VLSI Symposium 2003]

Stefan Rusu 13
Outline
• Power components and trends
• Active power reduction techniques
– Clock gating
– Reduce clock loading
– Multiple cores
– Multiple voltage domains
• Leakage reduction techniques
• Power management methods
• Summary

Stefan Rusu 14
Active Power Reduction
Reduce switched capacitance: Technology scaling:
• Minimize diffusion, wire and gate • Dynamic voltage scaling
loading, particularly in high activity • Supply voltage scaling is
factor nodes (clocks, domino) slowing down
• Use more efficient layout techniques • Thresholds don’t scale

2
P = α CL V fCLK
Reduce switching activity: Reduce clock frequency:
• Conditional execution • Use parallelism
• Conditional clocking • Less pipeline stages
• Conditional precharge • Use double-edge flip-flops
• Turn off inactive blocks
• Reduce toggling of high
capacitance nodes/busses
Stefan Rusu 15
Clock Gating

0
D Q D D Q
D 1 S

En En
Clk
Clk

• Save power by gating the clock when data activity is low


• Widest used switching power reduction technique
• Requires early En signal arrival, as well as detailed timing
and logic validation

Stefan Rusu 16
Conditional Clocking Flip-Flop

• FF does not consume active power when the


data input does not change its state
M. Hamada (Toshiba), CICC, 2005
Stefan Rusu 17
Conditional Clocking Flip-Flop (2)

• Taking into account the overhead of the


auxiliary circuits, the flip-flop consumes
less power than the conventional flip-
flop when the data transition probability
is less than 55%
• Issues: leakage, setup time
M. Hamada (Toshiba), CICC, 2005
Stefan Rusu 18
Latch Clustering
• Minimize the capacitive loading on local clock buffers by
clustering latches around them
– Tradeoff between latch placement flexibility and clock power savings
– Reduction in clock skew between capturing and launching latch
compensates for loss in latch placement flexibility

R. Puri (IBM), DAC, 2005


Stefan Rusu 19
Clock Power Savings
70
Wire Total
60
% Capacitance Savings

50

40

30

20

10

0
c1_11

c2_11
c1_1

c2_1
c1_10

c1_12

c2_10

c2_12
c1_0

c1_2
c1_3

c1_6
c1_7
c1_8
c1_9

c2_0

c2_2
c2_3

c2_6
c2_7
c2_8
c2_9
c1_4
c1_5

c2_4
c2_5
Clock Net

Latch clustering reduces local clock net capacitance by 25%

R. Puri (IBM), DAC, 2005


Stefan Rusu 20
Multiple Clock Grid Types
PLL (Clock Generator)
Vertical clock spines
Core dense MCLK grid

Un-Core ZCLK grid

Un-Core pre-global
ZCLK spine
Un-Core sparse
SCLK grid
Horizontal clock
Un-Core pre-global
MCLK spine
spines
De-skew buffer

Match the clock grid to the underlying circuits to reduce clock loading
S. Tam, ISSCC 2006
Stefan Rusu 21
Multiple Voltage Domains
FSB TOP

Core 1 T
A
G
1MB L2
16MB L3
Control Logic
1MB L2
T
A
G
Core 0

FSB BOT

Legend: Core PLL Uncore I/O


S. Rusu, ISSCC 2006
Stefan Rusu 22
Voltage Profile

Voltage Profile
Cut Line

1.25V 1.10V
Voltage

Cores Ctrl 16MB array


+
0.25V Virtual
Tag
VSS
0V

Operate each block at the lowest possible voltage


Stefan Rusu S. Rusu, ISSCC 2006 23
Cell-Level Dual-VDD Approach
• Use reduced voltage VDDL in non-critical paths
• Apply original voltage VDDH to timing critical paths

Level Converter VDDH cluster VDDL cluster


VDDH only
FF FF FF FF

FF FF FF FF

FF FF FF FF

FF FF FF FF

FF FF FF FF

FF_A FF_B
Critical path Critical path

• Challenges: minimize # of level converters by clustering


K. Usami (Toshiba), DAC 1998
Stefan Rusu 24
Cell-Level Dual-VDD (cont)
Row-by-Row layout architecture with Dual-VDD
VDDL VSS
VDDH

VDDH row

VDDL row

VDDL row

VDDH row

VDDL row

VDDH row

VDDL row

• P&R tool determines which rows should be VDDL


• Clock tree synthesis using VDDL clock buffers
• 25% power reduction demonstrated on MPEG4 video
codec core K. Usami (Toshiba), DAC 1998
Stefan Rusu 25
Outline
• Power components and trends
• Active power reduction techniques
• Leakage reduction techniques
– Long channel devices
– High-Vt transistors
– Body bias
– Transistor stacking
– Cache leakage reduction
– Power gating and multiple supplies
• Power management methods
• Summary

Stefan Rusu 26
Long-Le Transistors
Nominal Le • All transistors can be either
nominal or long-Le
• Most library cells are
available in both flavors
• Long-Le transistors are
~10% slower, but have
3x lower leakage
• All paths with timing slack
use long-Le transistors
• Initial design uses only long
channel devices
Long Le
Rusu, et. al, ISSCC 2006
(Nom+10%)
Stefan Rusu 27
Long-Le Transistors Usage
Long channel device
average usage summary
Cores 54%
Uncore 76%
Cache 100%
100%
Long-Le Usage (%)

80%
60%
40%
20%
0% Cor 1 Co L3 Cache
nt
ro
l
Core 0
Rusu, et. al, ISSCC 2006

Stefan Rusu 28
High-Vt Transistors

IBM’s Power Processors are leveraging triple Vt process option

Clabes, et al. (IBM), ISSCC 2004


Stefan Rusu 29
Leakage Reduction Circuit Techniques
Body Bias Stack Effect Sleep Transistor

Vbp
Vdd
+Ve

Equal Loading
Logic Block

-Ve Vbn

Stefan Rusu 30
Body Bias Leakage Reduction

Keshavarzi, et al., D&TC 2002


Stefan Rusu 31
Stack Forcing
100e-12
10

iso-load
Two -stack Two -stack

load
High-V Tt Wu = Wl

Normalized delay
Low -V Tt

under iso-input
delay
Normalized
Performance
Loss
Equal Loading wu? ½ w
High-V Tt
wl? ½ w
wu
Low --V T
wu+wl = w
1
10e-12 Leakage
1 Reduction
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
NormalizedIoff
Normalized Ioff

wl • Force one transistor into a two transistor stack


with the same input load
• Can be applied to gates with timing slack
• Trade-off between transistor leakage and speed
Narendra, et al, ISLPED 2001 32
Stefan Rusu
1.5V

0V
1.5V Ileak = 9.96nA
Natural Stacks
1.5V
1.5V • Leakage reduced significantly when
1.5V two transistors are off in a stack
0V • Educate circuit designers,
0V Ileak = 1.71nA monitor average stacking factor
1.5V
1.5V
1.5V 12

Leakage [nA]
0V
8
0V Ileak = 0.98nA
0V
1.5V 4
1.5V

0
0V
0V Ileak = 0.72nA 1 2 3 4
0V Number of OFF transistors in stack
0V

Stefan Rusu 33
Cache Leakage Reduction Techniques

Stefan Rusu [Kim, et al., IEEE Trans. VLSI Sys.,2005] 34


Cache Sleep and Shut-off Modes
Active Sleep Shut-off
1.1V
Sub-array 2x lower 2x lower
leakage leakage

Voltage
Sleep 520mV
Block Bias Virtual VSS 250mV
Select
Shut 0V
off 0V
Active Sleep Shut-off
Shut 1.1V
1.1V
off 850mV
Block Voltage
Select Sleep Virtual VCC
Bias 3x lower 10x lower
leakage leakage
150mV
Sub-array
0V

PMOS reduces junction leakage and has better shut-off


S. Rusu, et. al, US Pat App 20070005999, 6/2005 35
Stefan Rusu
Leakage Shut-off Infrared Images
16MB part 8MB part 4MB part

16MB in 8MB 8MB 4MB 12MB


sleep mode sleep shut-off sleep shut-off

Leakage reduction ► 3W (8MB) 5W (4MB)


Stefan Rusu Rusu, et al., ISSCC 2006 36
Cache Dynamic Shut-off
Way ► 15 14 •• 3 2 1 0 15 14 •• 3 2 1 0

Data ►

Tag ►

Data ►
Controller Controller

Normal Operation Cache-by-Demand Operation


• In the full-load state, all 16 • Under idle or low-load states,
ways are enabled (green) cache ways are dynamically
flushed out and put in shut-off
mode (red)
Sakran, et al., ISSCC 2007
Stefan Rusu 37
Multiple Power Domains

Kanno, et. al, ISSCC-2006


Stefan Rusu Hitachi + Renesas 38
Power Domains Activation Examples

Power
off in
white
area

Kanno, et. al, ISSCC-2006


Stefan Rusu Hitachi + Renesas 39
IBM POWER6 Voltage Domains

J. Friedrich (IBM), ISSCC 2007


Stefan Rusu 40
Split vs. Connected Power Grid

• Chips are roughly same process speed


• 17% to 7% droop by connecting power grids
N. James (IBM), ISSCC 2007
Stefan Rusu 41
Split vs. Connected Core Supplies

• Normalized to Process Sensitive Ring Oscillator (PSRO), the


Fmax is ~5-10% higher on chips with connected core power grids
N. James (IBM), ISSCC 2007
Stefan Rusu 42
Outline
• Power components and trends
• Active power reduction techniques
• Leakage reduction techniques
• Power management methods
– Voltage / Frequency Scaling
– Deep Power Down Technology
– Enhanced Dynamic Acceleration Technology
– Power Throttling
– Future Directions
• Summary

Stefan Rusu 43
Voltage / Frequency Scaling
Max Performance

Power scaling
range ~ 3–4 Increasing
Power

Power α V 3 Performance

Increasing Efficiency
Minimum (Freq/Power)
Operating
Voltage Most efficient
operating point
Deep Sleep /
Quick Start
Frequency

• Voltage-frequency scaling with active thermal feedback


• Multi-operating states from high performance to deep sleep
• Power management reduces average and peak power

Stefan Rusu 44
Itanium® Processor V/F Scaling
100 Frequency

80
Power (W)

Frequency &
60
Voltage
t he r
O e
40 ak a g
e L e
Cor
i tc h ing
20 r e Sw
C o
0

100
50

60
65
70
75
80
85
90
95
55

Frequency (% of Fmax)
R. McGowan, ICCAD 2005 45
Stefan Rusu
V / F Control System
10s of µs
Power
Sensor
Supply
Micro-Controller VRM
Thermal
Sensor

Voltage Voltage to Freq.


Sensor Clock
Converter

100s of ps
R. McGowan, ICCAD 2005 46
Stefan Rusu
Power Measurement
On Die Measurement
VConn VDie VConn

Counter
VDie Count

VCO
VTest
VCal

RPkg

Power = VDie
(VConn − VDie )
– Uses package resistance to measure power
• Widely variable, changes with temperature
RPkg
– VCO speed changes with process, temperature
– Uses a lookup table created with reference V
• Unique to each part / operating condition
• Linear interpolation for entries not in the table
– On die microcontroller software generates table, calibration and computes
final power measurement

E. Fetzer, ISSCC 2007 Short Course 47


Stefan Rusu
Temperature Measurement

• Two thermal sensors per core


• Mux thermal diodes into VCOs to measure temperature
R. McGowan, ICCAD 2005 48
Stefan Rusu
Frequency vs. Power Limit

100% Average Frequency

96%
Frequency

Core 1 Frequency
Core 0 Frequency

92%

88%

84%
100W 95W 90W 85W 80W 75W 70W 65W 60W
Power Limit

31% power reduction for only 10% frequency drop


R. McGowan, ICCAD 2005 49
Stefan Rusu
Deep Power Down Technology
Deep Power Down
C0 C1 C3 C4 technology state

Core voltage*

Core clock off off off off

New for P enryn


PLL off off off

L1 caches flus hed flus hed off

L2 cache off
partial flus h

Wakeup time* active

Idle power*
* R ough approximation

DPD enables reaching lower limit of CPU idle power of 0 W


V. George, et al., Hot Chips 2007 50
Stefan Rusu
Penryn DPD Implementation
VccP
STATE STORAGE:
– 8KB per core, ECC protected 8KB SRAMs
– Powered from I/O Vcc (VccP)

STATE DEFINITION:
– What to include?
– Criteria: “Software seamless” VccP
– Inclusions:
• All Architectural state MICROCODE:
• Most micro-architectural state
– State save and restore
– Exclusions: – Core synchronization
• Temp registers used by ucode
• Some others on a case by case basis Power Management Unit:
– Manages the DPD power-up sequence
– Manages entry/exit protocol with platform

V. George, et al., Hot Chips 2007


Stefan Rusu 51
DPD Technology Entry/Exit
VccP
• S/W instruction initiates processor DPD entry
• CPU does rest of sequencing with platform
Vcc
VRM • Protocol with chipset to block snoops (no CPU
wakeup required) while in DPD state
VID control
VccP FSB I/F • Exit initiated by a break event (int) in platform
I/O • CPU drives VID to VRM, internal hardware
requests Chipset
reset, state restore and execution resumption

OS: Decides OS: execute CPU: shrink L2; CPU: signal C/S: stops
to idle the MWAIT save uarch chipset to CPU clk, blk I/O
processors instruction state enter DPD CPU: VRM dn

Deep power
down state
CPU: continue CPU: restore CPU: Internal C/S: signals Interrupt
execution to Arch/uarch state, RESET, CPU wakeup, break
next instr. Expand L2 PLL relock etc VRM ramp, bclk event

V. George, et al., Hot Chips 2007 52


Stefan Rusu
DPD Results (Average Power)
Average Power (MM05* - Office Productivity)

Penryn with DPD Disabled

Merom (Core (tm) 2 Duo processor)


Average Power

Penryn with DPD Enabled 44%

27%

Leakage Current

– 27% to 44% (based on the leakage of the part) Average Power reduction as
measured by Mobile Mark – Office Productivity benchmark due to DPD feature
– Significant improvement compared to previous generation (Merom)
– Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range

V. George, et al., Hot Chips 2007 53


Stefan Rusu
Enhanced Dynamic Acceleration
Technology (EDAT)
Concept: In multi-core CPUs, use the power headroom of idle core
to boost performance of the active core
Two cores active: Single core active:
Marked frequency EDAT freq
P = P0 + P1 <= P (TDP Spec)
P = P0 + P1 = TDP Spec

EDAT Bin

TDP Top Freq. TDP Top Freq. TDP Top Freq.

Core #2 at
Frequency

Frequency
C3-DPD

Min Vcc Min Vcc Min Vcc Leakage

CORE 0 CORE 1 CORE 0 CORE 1

EDAT provides single-threaded performance boost


Stefan Rusu V. George, et al., Hot Chips 2007 54
EDAT Implementation Overview
Microarchitecture
OS P-state Request “P[0]” • Entry on OS request AND other core idle
• Idle core defined as “CC3” or deeper C-state
F/V Max F/V
Clipping • EDAT Freq pre-programmed in chip based
EDAT F/V Logic on power, reliability and other constraints
Max F/V • Exit EDAT mode when Idle core wakes up
Guar. F/V Hysteresis mechanism
• Allows short durations where 2 cores active
EDAT Logic
EDAT • Reduces perf loss for low activity wake-ups
Disable
Core • Implemented using a few counters
CC-state
• Voltage Regulator needs to provide for this
EDAT
Control • Benefits most at high timer tick rates
Hysteresis
Logic Mechanism OS interface
• OS requests P[0] state if perf demand exists
• EDAT logic grants it if power headroom exists

V. George, et al., Hot Chips 2007 55


Stefan Rusu
EDAT Performance Results
EDAT Performance on SPEC CPU2000 (Estimated)

SPECint_base2000 (estimated)
1.08 SPECfp_base2000 (estimated)

1.06
Relative Performance

1.04

1.02

1.00

0.98

0.96
Baseline: EDAT OFF EDAT ON; Low EDAT ON; High EDAT ON with
Interrupt rate Interrupt rate Hysteresis: High
Interrupt rate

Performance gains of about 5% on SPECfp_base2000 and 7% on


SPECint_base2000 due to EDAT within the same TDP power envelope
Stefan Rusu V. George, et al., Hot Chips 2007 56
Sun’s Niagara 2 Power
Niagara2 Worst Case Power =
• CMT approach used to
84 W @ 1.1V, 1.4 GHz optimize the design for

Misc Units 1.2 %


6.1 %
performance/watt.

.9 %
• Clock gating used at cluster

SOC

l4
and local clock-header level.

e ve
Spa
pL
rc C • 'GATE-BIAS' cells used to
ore To
s 31 . 6 reduce leakage.
%
– ~10 % increase in channel
Leakage 21.1 %
length gives ~40 % leakage
reduction.
% Gc
r 1 .0 L2 lk, C • Interconnect W/S
s s ba Bu CU
Cr o ffe 1. combinations optimized for
r 2 3%
%

power-delay product to
L2T

.5
.2

L2Dat 8.6 %

%
13

reduce interconnect power.


a
g8
s
IO

.5 %

U. Nawathe (Sun Micro), ISSCC 2007


Stefan Rusu 57
Niagara2 Power Management
Effect of Throttling on
Dynamic Power
100 • Software can turn threads on/off.
% of 'No Throttling' power

• 'Power Throttling' mode controls


95 instruction issue rates to manage
power consumption.
90
• On-chip thermal diodes monitor die
temperature.
85
High Work- – Helps ensure reliable operation in
load case of cooling system failure.
80 Low Work-
load • Memory Controllers enable DRAM
power-down modes and/or control
75
DRAM access rates to control
memory power.
70
None Minimum Medium Maximum

Degree of Throttling
U. Nawathe (Sun Micro), ISSCC 2007
Stefan Rusu 58
Future Directions

• A sample 2D mesh network with three Voltage / Frequency Islands


• Communication across different islands is achieved through mixed
clock / mixed voltage FIFOs
U. Ogras (CMU), DAC 2007
Stefan Rusu 59
Fine Grain Power Management
f f/2 0 f
Cores with critical tasks
f/2 0 f f/2 Freq = f, at Vdd
0 f f/2 0
TPT = 1, Power = 1

f f/2 0 f Non-critical cores


Freq = f/2, at 0.7xVdd
Hi-Act Lo-Act Shut-off
TPT = 0.5, Power = 0.25
VDD
0.7*VDD Cores shut down
Voltage

TPT = 0, Power = 0
Pwr=1 Pwr=¼ Pwr=0
0V
0 Adapted from S. Borkar, DAC 2007
Stefan Rusu 60
Summary
• Low power design is essential for modern computing
from hand-held all the way to servers
• Major low-power technology directions:
– Advanced process technology features:
High-K + metal gate, strained silicon
– Multiple clock and voltage domains
– Advanced voltage / frequency scaling
– Operate at the lowest possible voltage
– Turn off blocks that are not in use (clock and power gating)

• Low-power design techniques are becoming a way


of life at all levels of chip and platform design!

Stefan Rusu 61

You might also like