eytu_lecture2-3
eytu_lecture2-3
Management 2:
Low Power Digital Design
and Management
Pablo Ituero and Rubén San Segundo
Outline
• 4 Relationship between Energy and Delay
• 5 Circuit-level Strategies
• 6 Gate-level Strategies
• 7 Architecture-level Strategies
• 8 Software-level Strategies
• 9 System-level Strategies
• 10 Power Management Examples
▪ 10.1 Arduino
▪ 10.2 Raspberry Pi
2
Slides Credits
• Low Power Design Essentials. Jan Rabaey. Springer
• Low Power VLSI Design. Dr.-Ing. Frank Sill.
Department of Electrical Engineering, Federal
University of Minas Gerais, Brazil.
• www.arduino.cc
• www.raspberrypi.org
• Own Material
3
Lecture Recap 1
• In current electronic circuits, power is mainly
consumed through the charge and discharge of
capacitors through resistances (CMOS gates) that
store the information that the circuit processes.
• In each charge cycle, half of the energy provided
by the power supply is stored in the capacitance
and the other half is dissipated in the pull-up
resistance (turned into thermal energy, heat),
regardless of the value of the resistance.
• In the discharge cycle, the remaining half energy, is
dissipated in the pull-down resistance.
4
Lecture Recap 2
The total power consumption of a circuit is given by
5
Threshold voltage
6
Sub-threshold Leakage
The dominant component of the leakage currents
8
The Traditional Design Philosophy
• Maximum performance is primary goal
▪ Minimum delay at circuit level
• Architecture implements the required function with
target throughput, latency
• Performance achieved through optimum sizing, logic
mapping, architectural transformations.
• Supplies, thresholds set to achieve maximum
performance, subject to reliability constraints
Trend: Power
10
The New Design Philosophy
• Maximum performance (in terms of propagation delay)
is too power-hungry, and/or not even practically
achievable
• Many (if not most) applications either can tolerate
larger latency, or can live with lower than maximum
clock-speeds
• Excess performance (as offered by technology) to be
used for energy/power reduction
12
Lowering Vdd
• One of the most straightforward ways to reduce
power is lowering 𝑉𝐷𝐷
• However, lowering 𝑉𝐷𝐷 also affects an important
metric of the circuit: Speed.
13
Threshold voltage
14
Energy-Delay Interaction
16
Relationship Between Power and Delay
-4 -10
x 10 x 10
1 5
0.8 4
Power (W)
0.6
Delay (s)
3
A
0.4 2
0.2 1
B
0 04
4 A
3 -0.4 3 B
2 0 0 -0.4
0.4 2 0.4
1 0.8 1 0.8
For a given activity level, power is reduced while delay is unchanged if both VDD
and VTH are lowered such as from A to B.
𝑉𝐷𝐷
𝑅𝑂𝑁 = 𝑘2
(𝑉𝐷𝐷 − 𝑉𝑇 )2
18
4.5 Reducing power:
global overview
19
Exploring the Energy-Delay Space
Energy
Unoptimized
design
Emax Pareto-optimal
designs
Emin
Dmin Dmax Delay
Amount of concurrency
Software
Parallel versus pipelined, general
purpose versus application
(Micro-)Architecture specific
logic family, standard cell versus
Logic/RT custom
Power Consumption 28
and Thermal
5 Circuit-Level
Strategies
29
Transistor Sizing for Power Minimization
To keep
performance
Large W’s
Higher Capacitance Lower Voltage
31
Gate-Level Strategies for Low-Power
6.1 Algebraic transformations
6.2 Restructuring
6.3 Input Ordering
6.4 Dealing with glitches
6.5 Multiple VDD
Algebraic Transformations
Idea: Modify network to reduce capacitance
p1=0.05
p5=0.075
a p3=0.075 a
b f
f
a b
c c
p2=0.05 p4=0.75
Source: Timmernann,
2007
35
Glitching
A X
B
C Z
Unit Delay
36
Example 1: Chain of NAND Gates
out1 out2 out3 out4 out5
1
...
6.0
out8
4.0 out6
out4
V (Volt)
out2
VDD / 2
2.0
out1
out3
out5
out7
0.0
0 1 2 3
t (nsec)
37
Example 2: Adder Circuit
Cin
S15 S14 S2 S1 S0
3
S Output Voltage (V)
2 S3
S4 S15
Cin VDD / 2
S2
1 S5
S10
S1
S0
0
0 2 4 6 8 10 12
Time (ps)
0
F1 0
1 F1 1
F2 0
0 2
F3
0 F3
0
0 F2 1
0
0
1 1
1 1
0 0
1 1
1
1 1
1 1
2
1 1 1
1
3
• At design phase:
▪ Determine critical path(s)
▪ High VDD for gates on those paths
▪ Lower VDD on the other gates (in non-critical paths)
▪ For low VDD: prefer gates that drive large capacitances (yields the
largest energy benefits)
• Usually two different VDD (but more are possible)
FF FF FF
FF FF FF
Paths
Path
FF FF FF
C
A Y G2
G1
B
A
G1 ready with
B evaluation
Y all inputs of G2
all Inputs of G1 arrived
arrived
C
43
Multiple VDD in Data Paths
• Minimum energy consumption when all logic paths are critical (same delay)
• Possible Algorithm: clustered voltage-scaling
▪ Each path starts with VDDH and switches to VDDL (blue gates) when slack
is available
▪ Level conversion in flipflops at end of paths
44
11. The following section represents a segment of a pipelined architecture.
11.a Signals 𝑆1 , 𝑆2 and 𝑆3 have a ‘1’ probability of 0.5. Find the ‘1’ probability of the rest of the
signals of the circuit.
11.c The circuit operates at 100 MHz, from a 1.2 V supply voltage and the average load
capacitance is 10 fF. Find the dynamic power consumption of the circuit. Do not consider the
effect of glitches in your analysis.
45
12. Considering the circuit in the previous exercise,
12.a Draw the timing diagram when the inputs signals change from 𝑆1 = 0, 𝑆2 = 0, 𝑆3 = 1 to 𝑆1 =
1, 𝑆2 = 1, 𝑆3 = 0.
12.b Is there any glitch in the circuit? What can you say about the power consumption results
of the previous exercise?
46
14. You need to implement a 3-OR function with two 2-OR gates. Find the input ordering that
minimizes power consumption, knowing that PA = 0.7, PB = 0.5, PC = 0.2.
47
7 Architecture-Level
Strategies
48
Strategies
• 7.1 Review of architectural metrics and design
techniques
• 7.2 Reducing supply voltage while maintaining
performance
• 7.3 Clock Gating
• 7.4 Bus Power Reduction
49
Design Layer: Architecture
Level
• Also known as Register transfer level (RTL)
• Base elements:
▪ Register structures
▪ Arithmetic logic units (ALU)
▪ Memory elements
• Only behavior is described
(no inner structure)
No pipeline:
1 operation
every 1ns
1ns
Pipeline:
1 operation
every 200ps
200ps 200ps 200ps 200ps 200ps
52
Basic Concepts: Parallelism
1ns
1ns Parallel
implementation:
5 operations
every 1ns
1ns
1ns
53
1ns
Motivation for Power Reduction
• Optimizations at the architecture or system level can enable
more effective power minimization at the circuit level (while
maintaining performance), such as
▪ Enabling a reduction in supply voltage
▪ Reducing the effective switching capacitance for a given function
(physical capacitance, activity)
▪ Reducing the switching rates
▪ Reducing leakage
E E
D D
E
Architecture and system
transformations and
optimizations reshape
the E-D curves
R
F1
R
F2
fref
R: register,
Cref: average switching capacitance
F1,F2: combinational logic blocks
(adders, ALUs, etc)
fref /2
R Almost cancels
F1
R
F2
R
fref /2
fref fref
Assuming
ovpipe = 10%
Parallel Architecture: Example
• The clock rate can be reduced by half with the same throughput fpar = fref / 2
• Vpar = Vref / 1.7, Cpar = 2.15 Cref
• Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref
63
16. Repeat problem 15, using parallelism instead of pipelining. Assume that a 2-to-1
multiplexer has a delay of 4 ns at 2.5 V and switches 0.3 pF. Try parallelism levels of 2 and by 4.
Which one is preferred?
64
Increasing use of Concurrency Saturates
▪ Can combine parallelism and pipelining to drive VDD down
▪ But, close to process threshold overhead of excessive concurrency starts to dominate
1
0.9
0.8
0.7
Power
0.6
0.5
0.4
0.3
0.2
0.1
2 4 6 8 10 12 14 16
Concurrency
Assuming constant % overhead
Increasing use of Concurrency Saturates
P Nominal design
Fixed (no concurrency)
Throughput
Overhead +
leakage
Concurrency
Pmin
VDD
Fixed throughput
Optimum
Energy-Delay
point
increasing level of parallelism
Delay = 1/Throughput
1011
1.000E+11
memory
1010
1.000E+10
109
1.000E+09
108
1.000E+08
microprocessor/DSP
107
1.000E+07 100
memory
[mA/ MIP]
106
1.000E+06 10 processors
processor speed
105
1.000E+05 1
Normalized
104
1.000E+04 0.1
103
1.000E+03 computational 0.01
102
1.000E+02 efficiency 0.001
101
1.000E+01
100
1.000E+00
1960
1 3 5 7 91970 1980
11 13 15 17 19 1990
21 23 25 27 29 2000
31 33 35 37 39 2010
41 43 45 47 49 51
100
(for constant power envelope)
Processor performance
Dual/Many Core
10x
10
Single Core
3x
1
2000 2004 2008+
ARM
AMD DualCore
4
Serial = 20%
2
0
0 10 20 30
Number of Cores
Amdahl’s Law:
Clock Gating
• Most popular method for power reduction of clock signals and
functional units
• Gate off clock to idle functional units
• Logic for generation of disable signal necessary R
Functional
Higher complexity of control logic e
unit
Higher power consumption g
Critical timing critical for avoiding of
clock glitches at OR gate output
Additional gate delay on clock signal
clock
disable
30.6mW
8.5mW DEU
VDE
MIF
0 5 10 15 20 25 DSP/
Power [mW]
HIF
896Kb SRAM
▪ 90% of FlipFlops clock-gated
78
Bus Power
• Buses are significant source of power dissipation
▪ 50% of dynamic power for interconnect switching (Magen, SLIP 04)
▪ MIT Raw processor’s on-chip network consumes 36% of total chip power
(Wang et al. 2003)
• Caused by:
▪ High switching activities
▪ Large capacitive loading
79
Bus Power Reduction
• For an n-bit bus: Pbus = n* αfClkCloadVDD2
• Alternative bus structures
▪ Segmented buses (lower Cload)
▪ Charge recovery buses
▪ Bus multiplexing (lower fClk possible)
• Minimizing bus traffic (n)
▪ Code compression
▪ Instruction loop buffers
• Minimization of bit switching activity (fclk) by data encoding
• Minimize voltage swing (VDD2) using differential signaling
80
Reducing Shared Resources
• Shared resources incur switching overhead
• Local bus structures reduce overhead
81
Reducing Shared Resources cont’d
• Bus segmentation
▪ Another way to reduce shared buses
▪ Control of bus segment by controller blocks (B)
Shared Bus
B
Segmented Bus
82
8 Software-Level
Strategies
83
Design Layer: Algorithm Level
• Base elements:
▪ Functions
▪ Procedures
▪ Processes
▪ Control structures
• Description of design behavior
▪ Communication
for (c = 1..N) receive (A)
receive (A) for (c = 1..N)
B=c*A B=c*A
▪ Storage
for (c = 1..N)
B[c] = A[c]*D[c] for (c = 1..N)
for (c = 1..N) F[c] = A[c]*D[c]-1
F[c] = B[c]-1
T1 T2 T1 T2
Same work,
Speed
lower energy
Task Idle
Task
Time Time
Micro transductors ‘08, Low Power 88
2
9 System-Level
Strategies
89
Design Layer: System Level
• Basic Elements:
▪ Complex modules
▪ Processors
▪ Calculation and control units
▪ Sensors
ALU
MEM
MEM
MP3
80
70
60
50
40
30
20
10
Typical operating region Peak performance region
0
300 400 500 600 700 800 900 1000
300 Mhz 433 Mhz 533 Mhz 667 Mhz 800 Mhz 900 Mhz 1000 Mhz
0.80 V 0.87 V 0.95 V 1.05 V 1.15 V 1.25 V 1.30 V
Frequency (MHz)
Source: Transmeta
Source: Transmeta
97
10.1 Reducing
power consumption
with Arduino
98
Standard situation
99
Sleep modes
#include <avr/sleep.h>
void setup () {
set_sleep_mode
(SLEEP_MODE_PWR_DOWN);
sleep_enable();
sleep_cpu ();
} // end of setup
oid loop () { }
100
Sleep modes
SLEEP_MODE_IDLE: 50 mA
SLEEP_MODE_ADC: 42 mA
SLEEP_MODE_PWR_SAVE: 36 mA
SLEEP_MODE_EXT_STANDBY: 36 mA
SLEEP_MODE_STANDBY : 35 mA
SLEEP_MODE_PWR_DOWN : 34.5 mA
SLEEP_MODE_IDLE: 50 mA
SLEEP_MODE_ADC: 42 mA
SLEEP_MODE_PWR_SAVE: 36 mA
SLEEP_MODE_EXT_STANDBY: 36 mA
SLEEP_MODE_STANDBY : 35 mA
SLEEP_MODE_PWR_DOWN : 34.5 mA
103
Sleep modes
104
Power Reduction Mode
In addition to putting the whole thing to sleep, you can turn off
parts of the chip with the chip's Power Reduction Manager.
106
Down-clock
typedef enum
{
clock_div_1 = 1, clock_div_2 = 2, clock_div_4 = 4,
clock_div_8 = 8, clock_div_16 = 16, clock_div_32
= 32, clock_div_64 = 64, clock_div_128 = 128
} clock_div_t;
clock_prescale_set ( clock_div_t x)
107
Down-voltage
At 8MHz
1. 5.0V : 11.67 mA
2. 4.5V : 7.74 mA
3. 4.0V : 5.60 mA
4. 3.5V : 4.10 mA
5. 3.3V : 3.70 mA
108
Max frequency vs. Voltage
109
10.2 Reducing
power consumption
with Raspberry Pi
110
General comparison
111
Power Management in Linux
112
Suspend(suspend.c)/Resume
113
Hibernation (hibernate.c)
114
Restore
115
Disconnect Unnecessary Peripherals
116
Shut down the USB Hub
117
Shut down the USB Hub
#!/bin/bash
#Code to stop
/etc/init.d/networking stop
echo 0 > /sys/devices/platform/bcm2708_usb/buspower;
echo “Bus power stopping”
#!/bin/bash
#Code to start
echo 1 > /sys/devices/platform/bcm2708_usb/buspower;
echo “Bus power starting”
sleep 2;
/etc/init.d/networking start
118
Shut down the USB Hub
To locate buspower
120
Down-clock the Core
122
Example
Multithread
GPU Module
If the GPU increases a 20% its frequency, the power consumption increases (1.2)3
123