0% found this document useful (0 votes)
11 views11 pages

Dynamic

Uploaded by

9797kjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

Dynamic

Uploaded by

9797kjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2978393

A Dynamic Voltage Scaled Microprocessor System

Article in IEEE Journal of Solid-State Circuits · December 2000


DOI: 10.1109/4.881202 · Source: IEEE Xplore

CITATIONS READS
864 1,168

4 authors, including:

Tom Burd Robert Brodersen


University of California, Berkeley University of California, Berkeley
29 PUBLICATIONS 3,994 CITATIONS 391 PUBLICATIONS 26,219 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tom Burd on 19 December 2014.

The user has requested enhancement of the downloaded file.


IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000 1571

A Dynamic Voltage Scaled Microprocessor System


Thomas D. Burd, Student Member, IEEE, Trevor A. Pering, Anthony J. Stratakos, and
Robert W. Brodersen, Fellow, IEEE

Abstract—A microprocessor system is presented in which the


supply voltage and clock frequency can be dynamically varied so
that the system can deliver high throughput when required while
significantly extending battery life during the low speed periods.
The system consists of a dc-dc switching regulator, an ARM V4
microprocessor with a 16-kB cache, a bank of 64-kB SRAM ICs,
and an I/O interface IC. The four custom chips were fabricated in a
standard 0.6- m 3-metal CMOS process. The system can dynam-
ically vary the supply voltage from 1.2 to 3.8 V in less than 70 s.
This provides a throughput range of 6–85 MIPS with an energy
consumption of 0.54–5.6 mW/MIP yielding an effective energy ef-
ficiency as high as 26 200 MIPS/W. Fig. 1. Processor usage model.
Index Terms—Adaptive processor, energy efficient, low power,
variable voltage.
tasks (e.g., text entry, address book browsing, playing music,
etc.) only require a fraction of the full processor throughput to
I. INTRODUCTION adequately run. Executing these tasks faster than the desired
throughput rate has no discernible benefit. In addition, there
M ICROPROCESSOR systems are found in a variety of
portable electronic devices which span a broad range of
performance with respect to throughput and energy consump-
are system idle periods because single-user systems are often
not actively computing. The key design objective for the
tion. To lower energy consumption, existing low-power design processor systems in these applications is to provide the highest
techniques generally sacrifice throughput [1]–[4]. For example, possible peak throughput for the compute-intensive tasks while
personal digital assistants (PDAs) have a much longer battery maximizing the battery life for the remaining low speed and
life than notebook computers, but deliver proportionally less idle periods.
throughput to achieve this goal. Reducing the supply voltage A common power-saving technique is to reduce the clock
is an effective technique to decrease energy consumption, as it frequency during non-compute-intensive activity [5]. This
is a quadratic function of voltage; however, the delay of CMOS reduces power, but does not affect the total energy consumed
gates scales inversely with voltage, so this technique reduces per task, since energy consumption is independent of clock fre-
throughput as well. This paper will describe a new design tech- quency to a first order approximation [6]. Conversely, reducing
nique that dynamically varies the supply voltage to only provide the voltage of the processor improves its energy efficiency, but
high throughput when required. This technique can decrease the compromises its peak throughput. If, however, both clock fre-
system’s average energy consumption by up to 10x, without sac- quency and supply voltage are dynamically varied in response
rificing perceived throughput, by exploiting the time-varying to computational load demands, then the energy consumed per
computational load that is commonly found in portable elec- task can be reduced for the low computational periods, while
tronic devices. retaining peak throughput when required. When a majority of
Shown in Fig. 1 is an example of the microprocessor system’s the computation does not require maximum throughput, then
desired throughput (e.g., million instructions per second, or the average energy consumption can be significantly reduced,
MIPS) as a function of time. The computational require- thereby increasing the computation that can be done with
ments can be considered to fall into one of three categories: the limited energy supply of a battery. This strategy, which
compute-intensive, low-speed, and idle. Compute-intensive achieves the highest possible energy efficiency for time-varying
and short-latency tasks (e.g., video decompression, speech computational loads, is called dynamic voltage scaling (DVS).
recognition, complex spreadsheet operations, etc.) utilize the This paper describes a prototype DVS-enabled chip-set which
full throughput of the processor. Low speed and long-latency contains a voltage converter, a microprocessor, SRAM memory
chips, and an interface chip for connecting to commercial I/O
peripherals.
Manuscript received March 21, 2000; revised June 17, 2000. This work was A technique for minimizing the supply voltage to reduce en-
supported by Defense Advanced Research Projects Agency and ARM Ltd.
T. D. Burd and R. W. Brodersen are with the Berkeley Wireless Re- ergy consumption utilizing a voltage regulator was proposed for
search Center, University of California, Berkeley, CA 94704 USA (e-mail: digital circuits at fixed throughput [7], and subsequently demon-
[email protected]). strated on a microprocessor core [8]. This idea was enhanced
T. A. Pering is with Intel Corporation, Hillsboro, OR 97124 USA.
A. J. Stratakos is with Volterra, Fremont, CA 94538 USA. to dynamically scale the supply voltage for variable-rate digital
Publisher Item Identifier S 0018-9200(00)09434-8. signal processing [9], [10], and for variable-rate I/O interfaces
0018–9200/00$10.00 © 2000 IEEE
1572 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000

Fig. 2. Simulated maximum clock frequency for four circuits in 0.6-m


CMOS. Fig. 3. System architecture—four custom chips.

[11]. This work extends these efforts to dynamic voltage supply negative feedback loop. In this DVS system, the regulation loop
scaling of a general-purpose microprocessor, under direct oper- was modified so that the output voltage drives the ring oscillator
ating system control, and over a complete system chip-set. whose output clock signal can be readily converted to a digi-
tally measured clock frequency. The operating system controls
II. DVS OVERVIEW the loop by providing a desired clock frequency from which the
measured clock frequency is subtracted to calculate the feed-
There are three key components for implementing DVS in back error. This approach allows the software to directly set
a general-purpose microprocessor system: an operating system the operating frequency, and lets the hardware loop determine
that can intelligently vary the processor speed, a regulation loop the minimum required supply voltage to meet this desired fre-
that can generate the minimum voltage required for the desired quency.
speed, and a microprocessor that can operate over a wide voltage DVS introduces two new performance parameters, transition
range. time and transition energy. Transition time is defined as the du-
A critical characteristic of CMOS circuits is shown in Fig. 2, ration required to alter the clock frequency and supply voltage.
which plots simulated maximum clock frequency versus supply This time impacts both interrupt latency and wake-up latency
voltage for various circuits in our 0.6- m CMOS process. when the system is in its lowest-energy sleep state. Transition
Whether the circuits are simple (NAND gate, ring oscillator) energy is the extra energy consumption due to switching losses
or complex (register file, SRAM), their circuit delays track that is required to change the system supply voltage.
extremely well over a broad range of supply voltage. Thus, as
the processor’s supply voltage varies, all of the circuit delays
III. SYSTEM ARCHITECTURE
scale proportionally making CMOS processor implementations
amenable to DVS. However, subtle variations of circuit delay The complete microprocessor system is comprised of four
with voltage do exist and primarily effect circuit timing, as custom chips, as shown in Fig. 3, all of which were designed for
discussed in Section VI. DVS to maximize system energy efficiency. The regulator chip,
Control of the processor speed must be under software con- discussed in detail in Section V, converts the battery voltage
trol, as the hardware alone cannot distinguish whether the cur- to the variable supply voltage which powers the
rently executing instruction is part of a compute-intensive task microprocessor, the processor bus, the external memory bank,
or a non-speed-critical task. The application programs cannot the I/O interface chip, and the front-end of the regulator. The
set the processor speed because they are unaware of other pro- four chips have been fabricated in a 0.6- m 3-metal -V
grams running in a multitasking system. Thus, the operating CMOS process.
system must control processor speed, as it is aware of the com- The CPU chip, shown in Fig. 4, contains a custom imple-
putational requirements of all the active tasks. Applications may mentation of an ARM8 processor core [12]. The core, which
provide useful information regarding their load requirements, implements the ARM IV instruction set architecture, contains a
but should not be given direct control of the processor speed. five-stage scalar integer pipeline and an eight-word prefetch unit
As processor speed varies, so too must the supply voltage in that performs static branch prediction. A 16-kB 32-way set-as-
order to optimize the energy consumption. However, the soft- sociative unified cache operates at the core clock rate. The cache
ware is not aware of the minimum required supply voltage for a contains 16 physical blocks in which a CAM tag array provides
desired clock frequency since it is a function of the underlying the line decoding for an SRAM data array which is logically or-
hardware implementation, process variation, and operating tem- ganized into 32 lines with 8 words per line. A 12-element write
perature. A ring oscillator, which attempts to scale over voltage buffer multiplexes address and data into a single register file
with the critical paths of the processor, provides the translation and supports a variable number of data words per address to
from supply voltage to operating frequency. accommodate the store multiple registers (STM) instruction. A
A conventional voltage regulation system samples the output bus interface unit connects the CPU to the processor bus and
voltage and compares it to an input reference voltage within a contains a memory controller that provides all the signal gen-
BURD et al.: DYNAMIC VOLTAGE SCALED MICROPROCESSOR SYSTEM 1573

accesses (DMA) from an I/O device. To facilitate testing, the


entire I/O subsystem was emulated in hardware using a pro-
cessor system board and an FPGA which connected directly
to the I/O chip. This virtual I/O subsystem can stream data in
regular intervals modeling real input devices, as well as collect
and verify data destined for output peripheral devices. This em-
ulation system allowed the execution of benchmark programs,
typical of those run on PDA-like devices, to adequately demon-
strate DVS.

IV. VOLTAGE SCHEDULER


The voltage scheduler is a new operating system component
for use in a DVS system. It controls the processor speed by
writing the desired clock frequency to a system control register.
The register’s value is used by the regulation loop to adjust the
Fig. 4. CPU die photo (7.5 2 9.0 mm, 1.3 M transistors). CPU clock frequency and regulated voltage. By optimally ad-
justing the processor speed, the voltage scheduler always oper-
ates the processor at the minimum throughput level required by
the currently executing tasks and thereby minimizes system en-
ergy consumption.
The implemented voltage scheduler runs as part of a simple
real-time operating system. Since the job of determining the
optimal frequency and the optimal task ordering are indepen-
dent of each other, the voltage scheduler can be separate from
the temporal scheduler. Thus, existing operating systems can be
straightforwardly retrofitted to support DVS by adding in this
new, modular component. The overhead of the scheduler is quite
small such that it requires a negligible amount of throughput and
energy consumption [13].
The basic voltage scheduler algorithm determines the
optimal clock frequency by combining the computation re-
Fig. 5. SRAM die photo (9.6 2 10.4 mm, 3.4 M transistors). quirements of all the active tasks in the system, and ensuring
that all latency requirements are met given the task ordering
eration for the external memory system. The bus side of the in- of the temporal scheduler. Individual tasks supply either a
terface, the external bus, and the external memory system all completion deadline (e.g., video frame rate), or a desired rate
operate at one-half the internal clock rate. At the center of the of execution in MHz. The task’s workload (e.g., processing an
chip is the voltage-controlled oscillator (VCO) which drives an MPEG frame), measured in processor cycles, is automatically
approximate H-tree clock network that has a maximum clock estimated by the voltage scheduler. While the optimal clock
skew of 80 ps (simulated). The chip also contains a system co- frequency in a single-tasking system is simply workload
processor consisting of the desired clock frequency register, the divided by the deadline time, a more sophisticated voltage
regulator interface, a real-time counter, dynamic performance scheduler is necessary to determine the optimal frequency for
counters, and other system control state. multiple tasks. Workload predictions are empirically calculated
The SRAM chip, shown in Fig. 5, contains 64 kB of memory using an exponential moving average, and are updated by the
organized into two levels (8 8 1 kB). The data width of the voltage scheduler at the end of each task. Other features of
chip is 32 bits so that only one external memory chip is activated the algorithm are a graceful degradation when deadlines are
per access, thereby minimizing the energy consumption of main missed, the reservation of cycles for future high-priority tasks,
memory. To prevent an excessive pin count, the address pins and the filtering of tasks that cannot possibly be completed by
are multiplexed onto the same bus as the data. This reduces the a given deadline [14].
bandwidth of single-word accesses by a factor of 2, but since ex- Fig. 6 plots for two seconds of a user-interface task,
ternal memory accesses are predominantly cache-line reloads, which generally has long-latency requirements. Since clock fre-
the average bandwidth reduction is closer to 11% (1 address quency increases with , processor speed can be inferred
word per 8 data words). The controller block on the SRAM chip from this scope trace. The top trace demonstrates the micropro-
supports these burst-mode accesses. cessor running in the typical full-speed/idle operation. A high
The primary function of the I/O chip is to convert the variable voltage indicates the processor is actively running at full speed,
voltage processor bus to a fixed 3.3 V bus for communication and low voltage indicates system idle. This trace shows that the
with commercial peripheral devices. In addition, the I/O chip user-interface task has bursts of computation, which can be ex-
performs simple data flow control and supports direct memory ploited with DVS. The lower trace shows the same task running
1574 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000

Fig. 8. Regulator die photo (1.6 2 3.4 mm).

Fig. 6. DVS improvement for UI process.

Fig. 9. Converter waveforms in regulation mode (V = 1:47 V, I = 22


mA).

voltage , which is fed back to the CPU chip, thus closing


the loop.
The only external components required are a 4.7- H in-
ductor placed next to the regulator, and 5.5- F of capacitance
distributed near the chips’ pins. The ring oscillator is
placed on the CPU chip, and is designed to track the critical
paths of the microprocessor over voltage. A beneficial side
effect is that the ring oscillator will also track the critical paths
over process and temperature variations. The rest of the loop is
integrated onto the regulator die as shown in Fig. 8.
Fig. 7. Frequency to voltage feedback loop. The regulation loop operates discontinuously to improve its
stability and decrease the voltage transition time by pulsing the
with the voltage scheduler enabled. In this mode, low voltage inductor current to transfer discrete quantities of charge to, or
indicates both system idle and low-speed/low-energy operation. from, the capacitor. This is demonstrated in Fig. 9 which plots
The voltage spikes indicate when the voltage scheduler has to the buck circuit waveforms when the converter is regulating a
increase the processor speed in order to meet required deadlines. constant , in which case it is adding charge to the capac-
This comparison demonstrates that much of the computation for itor. is enabled first, which begins ramping up the inductor
this application can be done at low voltage, greatly improving current for a duration specified by the loop filter. At the end
the system’s energy efficiency. of the duration, is turned off, and is turned on, which
ramps down until it returns to zero. When the converter is
ramping up , the pulses will be larger and more frequent,
V. VOLTAGE AND FREQUENCY REGULATION FEEDBACK LOOP
and when it is ramping down , will be reversed in po-
The two-chip regulation feedback loop is shown in Fig. 7. The larity and the timing of the power FETs will switch so that
ring oscillator on the CPU chip outputs a clock signal whose is enabled before . Because returns to zero at the end of
frequency is a function of the supply voltage . The clock each pulse, the inductor’s current flow is not continuous, and the
signal is sent to the regulator chip and drives a counter which is two-pole LC filter reduces to a single dominant pole which is set
latched and reset at 1 MHz intervals to quantize the frequency by the capacitor and the effective load resistance .
into a 7-bit word. This value is subtracted from the desired fre- Loop stability is then ensured by setting the gain-bandwidth of
quency (in MHz) as set by the operating system, to create an the loop to be well below the sampling frequency of 1 MHz over
8-bit frequency error, . The loop filter circuit implements the range of and .
a hybrid pulse-width/pulse-frequency modulation (PWM/PFM) The regulator chip contains two additional components to
algorithm which generates signals to enable the power FETs via support this discontinuous mode of operation. Current limiting
and . The buck circuit consisting of , , and circuits restrict the maximum output current to 1 A to protect the
the LC tank down-converts (3.3–6.0 V) to the regulated power FETs and external filter elements. These circuits consist
BURD et al.: DYNAMIC VOLTAGE SCALED MICROPROCESSOR SYSTEM 1575

of two offset-cancelled comparators, one for each of the power


FETs. Zero-detection circuits, implemented as offset-cancelled
comparators, are required to detect when the rectifying power
FET’s current crosses zero, so that the FET can be turned off
at the end of the charge pulse [15]. Additionally, to minimize
power dissipation due to detector inaccuracy, an integral feed-
back loop, similar to adaptive dead-time control [16], is used
to null the comparator, logic, and power FET gate-drive delays.
This loop can detect the zero current crossing to within 2 mA.

A. Tracking Performance
The voltage converter required for DVS is fundamentally dif-
ferent from a standard voltage regulator because in addition to
regulating voltage for a given speed, it must also change the
voltage when a new speed is requested. A large regulator output
Fig. 10. Transient response of the regulation loop.
capacitance reduces the dominant pole frequency, thereby re-
ducing supply ripple, and increases low-voltage conversion ef-
ficiency, making the loop a better voltage regulator. A small ca- powered by the variable , while the rest of the chip is pow-
pacitance reduces transition time and energy, making the loop a ered by .
better voltage tracking system. Hence, the fundamental trade-off To minimize the sum of on-state and conduction losses, there
in DVS system design is to make the processor more tolerant of is an optimum power FET gate width for fixed load current [16].
supply ripple so that the capacitance can be reduced to mini- Since the load current varies by 50x, the power FETs are dy-
mize transition time and energy [17]. The peak-to-peak ripple namically sized to minimize losses over a broad range of load
constraint for this system was relaxed to 5%, with a maximum current and maximize conversion efficiency. The filter’s SRAM
measured value of 3.8%. look-up table also contains two bits for each power FET for in-
To further improve transition speed, the loop filter utilizes dependent, binary-weighted, sizing control. The gate-width of
feed-forward control. is first multiplied by a gain term, the nMOS and pMOS least-significant bits (LSB) are 10 and
then a feed-forward value is added to it which is solely a function 20 mm, respectively.
of the desired clock frequency. A 16 16-bit SRAM contains
the look-up table for the feed-forward value as well as the fre- C. Transient Loop Response
quency-dependent gain term, and is indexed by the upper four
Fig. 10 shows a scope trace for the system’s maximum
frequency bits. The feed-forward provides quick, but approxi-
low-to-high and high-to-low speed transitions. The signal
mate loop adaptation, while the feedback loop locks onto the
transitions from 1.2 to 3.8 V, then back down to 1.2 V. The
desired clock frequency.
Track Status signal indicates whether the loop is operating in
the tracking or regulation mode. This signal demonstrates that
B. Optimizing Conversion Efficiency the maximum transition time is 70 s for the 5–80 MHz tran-
The key design challenge of this loop was to maintain good sition under full system load, while smaller voltage transitions
conversion efficiency with over 100x variation in load power are executed in less time. During this entire transition period,
dissipation, while keeping the output capacitance sufficiently the processor system can continue to execute instructions.
small to maintain the loop’s tracking performance. A hybrid The signal labeled is the battery current measured going
PWM/PFM algorithm is utilized which combines the high ef- into the regulator, but after the battery’s bypass capacitor. There
ficiency that PWM can provide at high loads with the high effi- is a current spike on a low-to-high transition which is required
ciency that PFM can provide at low loads [15]. to charge up the loop’s output capacitor to the required voltage.
The converter operates in one of two modes, tracking and The negative current spike on the high-to-low transition occurs
regulation. Tracking mode is initiated by a new frequency re- because the power pMOS is removing charge from the capac-
quest. Charge is either delivered to, or removed from the capac- itor and placing it back onto the battery’s bypass capacitor. The
itor depending upon the sign of , and is delivered via a conversion loss of the regulator while charging and discharging
variable-width pulse which has 4 bits of control. When the error the output capacitor becomes the transition energy, and is pro-
magnitude is less than 4 MHz, the converter switches to regu- portional to the size of the capacitor. This transition energy is
lation mode in which the converter will still deliver energy to a maximum of 4 J for a 5–80 MHz transition, which equals
the output when is greater than zero, but only the micro- the energy consumed by the processor running at 80 MHz for
processor system can remove charge. When is less than 712 full-load cycles.
zero, the loop filter is disabled to suppress the charge pulse that
would otherwise remove charge and drive to zero in a
VI. DIGITAL CIRCUIT DESIGN FOR DVS
strictly PWM system. Thus, the only part of the loop that is
continuously running is the front-end which calculates . One approach to designing a processor system that switches
To improve low-voltage conversion efficiency, these circuits are voltage dynamically is to halt processor operation during the
1576 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000

inputs are low such that the output node is undriven at a value
. If ramps down by more than a diode drop by the end
of the evaluation state, the drain-well diode will become for-
ward biased. Current may be injected into the parasitic p-n-p
transistor of the pMOS device and induce latchup [18]. This
condition occurs when

(1)
Fig. 11. Ring oscillator adapting to varying V (simulated).
where is the average clock period as varies by
switching transient. The drawback to this approach is that in- a diode voltage drop . Since the clock period is longest at
terrupt latency increases and potentially useful processor cy- lowest voltage, this is evaluated as ranges from
cles are discarded. Since static CMOS gates are quite tolerable to , where mV. For our 0.6- m
of a varying voltage supply, there is no fundamental need to process, the limit is 20 V/ s. Another failure mode occurs if
halt operation during the transient. When the gate’s output is ramps up by more than by the end of the evaluation
low, it will remain low independent of . However, when state, and the output drives a pMOS device resulting in a false
the output is high, it will track via the pMOS device(s). logic low, giving a functional error. This condition occurs when
Simulation demonstrated that for a minimum-sized pMOS de-
vice in our 0.6- m process, the RC time constant of the pMOS (2)
drain-source resistance and the load capacitance is a maximum
of 5 ns, at low voltage. Thus, static CMOS gates track quite and is evaluated as varies from to
well for a in excess of 100 V/ s. Because all logic , since this condition is also most severe at low voltage. For
high nodes will track very closely, the circuit delay will our 0.6- m process, the limit is 24 V/ s.
instantaneously adapt to the varying supply voltage. Since the These limits assume that the circuit is in the evaluation state
processor clock is derived from a ring oscillator also powered for no longer than half the clock period. If the clock is gated,
by , its output frequency will dynamically adapt as well, as leaving the circuit in the evaluation state for consecutive cycles,
demonstrated in Fig. 11. these limits drop proportionally. Hence, the clock should only be
Yet, there are constraints when using a design style other than gated when the circuit is in the precharge state. These limits may
static CMOS as well as limits on allowable . The pro- be increased to that of static CMOS logic using a small bleeder
cessor system design contains a variety of different styles, in- pMOS device to hold the output at while it remains un-
cluding not only static CMOS logic, but dynamic logic, CMOS driven. The bleeder device also removes the constraint on gating
pass-gate logic, memory cells, sense-amps, bus drivers, and I/O the clock, and since the bleeder device can be made quite small,
drivers. As will be shown, the maximum that the cir- there can be insignificant degradation of circuit delay due to
cuits in this design can tolerate is approximately 5 V/ s. The the pMOS bleeder fighting the nMOS pull-down devices. The
converter loop has a maximum of only 0.2 V/ s, pro- charge-redistribution problem of dynamic logic will be magni-
viding sufficient design margin. These design constraints sacri- fied by a varying supply voltage such that the internal nodes of
fice a small amount of energy-efficiency in the circuit design, nMOS stacks should be properly precharged [18].
but return much larger gains at the system level via DVS.
C. Tri-State Busses
A. Pass Gate Logic
Tri-state busses that are not constantly driven for any given
NMOS pass gates are often used in low-power design due to cycle suffer from the same two failure modes as seen in dy-
their small area and input capacitance. However, they are limited namic logic circuits due to their floating capacitance. The re-
by not being able to pass a voltage greater than , such sulting can be much lower if the number of consecu-
that a minimum of is required for proper operation. tive undriven cycles is unbounded. Tri-state busses can only be
Since throughput and energy consumption vary approximately used if one of two design methods are followed.
by 4x over the voltage range to , using nMOS pass gates The first method is to ensure by design that the bus will al-
restricts the range of operation by a significant amount, and are ways be driven. While this is done easily on a tri-state bus with
not worth the moderate improvement in energy efficiency. In- only two drivers, this may become expensive to ensure by de-
stead, CMOS pass gates, or an alternate logic style, should be sign for a large number of drivers , which requires routing ,
utilized to maximize the voltage range for DVS. or , enable signals.
The second method is to use weak, cross-coupled inverters
B. Dynamic Logic which continually drive the bus. This is preferable to just a
Dynamic logic styles are often preferable over static CMOS bleeder pMOS as it will also maintain a low voltage on the
as they are more efficient for implementing complex logic func- floating bus. Otherwise, leakage current may drive the bus high
tions. They can be used with a varying supply voltage, but re- while it is floating for an indefinite number of cycles. The size
quire some additional design considerations. One failure mode of these inverters can be quite small, even for a large bus. For
can occur while the circuit is in the evaluation state and the gate our 0.6- m process, the inverters could be designed to tolerate a
BURD et al.: DYNAMIC VOLTAGE SCALED MICROPROCESSOR SYSTEM 1577

Fig. 12. Basic sense amplifier topology.

in excess of 75 V/ s with negligible increase in delay,


while increasing by only 10% the energy consumed driving the
bus. Fig. 13. Relative CMOS circuit delay variation (simulated).

D. Sense Amps nificantly more expensive in area and/or power (e.g., memory
The basic sense-amplifier topology, shown in Fig. 12, re- address decoder).
sponds to the varying in a desirable manner. When
increases, the cell current drive pulling down increases be- VII. ARCHITECTURAL ENHANCEMENTS FOR DVS
cause the cell’s internal voltage increases, and the trip point of
A. Desired Frequency Register
the sense amplifier shifts up. Likewise, when decreases,
the cell current drive decreases, and the trip-point shifts down. The primary architectural support for DVS is the addition
The net affect is that the decrease/increase in response time of of the desired frequency register, which has been added to the
the sense amplifier with is relatively similar to the de- system coprocessor. Writes to this register send a new frequency
crease/increase in clock period. Thus, the basic sense amplifier request to the regulator, and reads report the current measured
is very suitable for DVS, though second-order delay variation clock frequency. This allows the operating system to actively
limits on the order of 5 V/ s, which ultimately deter- monitor the operating frequency. To reduce the pin count on the
mines the maximum slew rate allowed on the supply voltage. CPU-regulator interface, the 7-bit frequency value is serialized
by the CPU and transmitted to the regulator upon writing to the
E. Circuit Delay Variation register. The regulator then converts the serial data back to a
7-bit word. The interface requires just three pins to transmit the
While circuit delays track well over voltage, subtle delay vari- new frequency value, and one pin to transmit the clock signal
ations do exist which impact circuit timing. To demonstrate this, from the ring oscillator.
three chains of inverters were simulated whose loads were dom-
inated by gate, interconnect, and diffusion capacitance respec- B. Dynamic Performance Monitors
tively. To model paths dominated by stacked devices, a fourth
chain was simulated consisting of four pMOS and four nMOS The system coprocessor also contains several read-only reg-
transistors in series. The relative delay variation of these circuits isters that monitor system run-time performance. Four registers
is shown in Fig. 13 for which the baseline reference is an inverter track processor performance by counting the number of cycles
chain with a balanced load capacitance similar to the ring oscil- the processor spends in each of its states: active, idle, sleep, and
lator. stalled. A separate register counts the number of instructions ex-
The relative delay of all four circuits is a maximum at only ecuted. Another four registers track cache system performance
the lowest or highest operating voltages. This is true even in- by counting the cache hits, misses, cache-line write-backs, and
cluding the effect of the interconnect’s RC delay. Since the gate uncached accesses. These nine registers provide dynamic feed-
dominant curve is convex, combining it with one or more of the back to the operating system on processor utilization, which can
other effects’ curves may lead to a relative delay maxima some- be used to vary the processor speed accordingly.
where between the two voltage extremes. However, all the other
curves are concave and roughly mirror the gate dominant curve C. Ring Oscillator
such that this maxima will be at most a few percent higher than To accommodate process variation over the die as well as sim-
at either the lowest or highest voltage, and therefore insignifi- ulation error, the oscillator was designed to be programmable
cant. Thus, timing analysis is only required at the two voltage from 50% to 150% of nominal frequency with 5 bits of con-
extremes, and not at all the intermediate voltage values. trol. The frequency control is designed to be glitch-free so that
As demonstrated by the series dominant curve, the relative it can be programmed via software through another register in
delay of four stacked devices rapidly increases at low voltage, the system coprocessor.
and larger stacks will further increase the relative delay [17]. The basic oscillator architecture, shown in Fig. 14, consists
Thus, to improve the tracking of circuit delay over voltage, a of five binary weighted delay blocks, plus a return path to close
general design guideline is to limit the number of stacked de- the loop. Each of the delay blocks has both a slow and fast
vices, except for circuits whose alternative design would be sig- path which is selected by the ctrl signal. A new value for this
1578 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000

Fig. 14. VCO architecture.


Fig. 15. Measured throughput versus energy consumption.

signal can be loaded in when the trig1 signal transitions low-to-


high. By ensuring that the pass gates in the basic block have when peak throughput is required only a small fraction of the
switched by the time trig2 transitions low-to-high, the oscillator time, the system’s average power dissipation can be as low as
will change frequency glitch-free. 3.24 mW, yielding 26 200 MIPS/W. When the system is oper-
The hardware was stepped from 5 to 80 MHz in 5-MHz in- ated at constant voltage, the energy-efficiency is a maximum of
crements, and at each step, the ring oscillator’s control bits were 1850 MIPS/W at 1.2 V.
decreased until processor failure. Decreasing the control bits
had the effect of decreasing supply voltage since the converter B. Idle Energy Consumption
loop maintains constant clock frequency. The minimum control
Because a microprocessor in portable systems idles a signif-
setting was exactly the setting for nominal frequency at all fre-
icant amount of time, a sleep mode has been implemented to
quency values, with the exception at 5 MHz, at which speed
minimize idle energy consumption. When the halt instruction is
the control could be decreased by 1 LSB from nominal. This
executed, which was implemented via a write to a system con-
demonstrates that the critical paths of a CMOS processor do
trol register, the processor stops the clock to the processor core.
track extremely well over a wide range of voltage.
This effectively stops all activity by clock gating the rest of the
components in the system. The bus interface continues clocking
VIII. MEASURED RESULTS a small finite-state machine to grant any DMA request that may
A. Range of Operation come in while the processor is in sleep mode.
If the processor speed is set to 5 MHz before entering sleep,
A plot of throughput versus energy consumption is shown in the entire system will dissipate only 800 W of power, with
Fig. 15. The upper curve is for the processor system running a one cycle start-up from sleep. This is possible because the
off of a fixed voltage source while the lower curve is for the VCO and regulation loop are continually operating, albeit at
entire system with the regulator powered by a battery voltage. their lowest-energy operating points. To achieve low-power
The curves are generated by running the system at constant fre- sleep modes, other processors require powering down the
quency and supply voltage to demonstrate the full operating voltage supply and/or PLL [1]. A high-speed frequency change
range of the system. The throughput ranges from 6–85 Dhrys- can be immediately initiated upon detection of an interrupt to
tone 2.1 MIPs, and the total system energy consumption ranges minimize interrupt latency via a separate interrupt-frequency
from 0.54–5.6 mW/MIP. The efficiency of the dc-dc converter, register. The latency to ramp back to full speed is set by the
which is the ratio of the regulated power (measured at fixed regulation loop to be 70 s, but the processor can continue
voltage) to the power drawn from the battery, ranges from 90% operating during this ramp up period and begin immediate
at high voltage to 80% at low voltage. execution of the interrupt handler.
With DVS, peak throughput can be delivered on demand.
Thus, the true operating point for the system lies somewhere
along the dotted line because 85 MIPS can always be delivered C. Benchmark Programs
when required. In the optimum case when only a small frac- To evaluate DVS, benchmarks were chosen that represented
tion of the computation requires peak throughput, the micropro- software applications that are typically run on notebook
cessor system can deliver 85 MIPs while consuming on average computers or PDA devices. Existing benchmarks (e.g., SPEC,
as little as 0.54 mW/MIP. MIPS, etc.) are not useful because they were constructed to only
A common energy-efficiency metric is MIPS/W. The equiv- measure the peak throughput of a processor. New benchmarks
alent for this system would be the ratio of peak MIPS to were selected which combine computational requirements with
average power dissipation because the throughput and power realistic latency constraints. The three benchmarks that were
dissipation can be dynamically varied. In the optimal case executed on the system are:
BURD et al.: DYNAMIC VOLTAGE SCALED MICROPROCESSOR SYSTEM 1579

TABLE I TABLE II
MEASURED BENCHMARK ENERGY CONSUMPTION (NORMALIZED) MEASURED POWER DISSIPATION WITH THE VOLTAGE SCHEDULER

IX. CONCLUSION

1) MPEG: MPEG-2 decoding of an 80-frame 192 144 The prototype processor system demonstrates that DVS can
video at 5 frames/s, requiring an average 50-MHz clock improve the energy efficiency of battery-powered processor
rate in a single-task environment. systems by up to a factor of 10x without sacrificing peak
2) AUDIO: IDEA decryption of a 10-s 11-kHz mono audio throughput. DVS is amenable to standard digital CMOS pro-
stream, divided into 1-kB frames with a 93-ms deadline, cesses, with a few additional circuit design constraints. Existing
requiring an average 17-MHz clock rate. operating systems can be retrofitted to support DVS, with little
3) UI: A simple address-book user interface allowing simple modification, as the voltage scheduler can be added to the
searching, selection, and database selection. 432 frames operating system in a modular fashion. Finally, the prototype
are processed, each defined as a user triggered event, such system demonstrated that when running real programs, typical
as pen-down, which ends when the corresponding ac- of those run on notebook computers and PDAs, DVS provides a
tion has been completed. Most frames require less than significant reduction in measured system energy consumption,
a 10-MHz clock rate, while some frames are very com- thus significantly extending battery life.
pute intensive.
The key parameter to measure the energy-efficiency improve- ACKNOWLEDGMENT
ment of DVS is the system energy consumption. Energy con- The authors would like to thank P. Laramie, O. Rowhani,
sumption was measured by charging up a large (3.5 F) capacitor C. Chang, R. Davis, and J. C. Rudell for their contributions.
to the battery voltage, and measuring the voltage drop on it over
the duration of the benchmark. The benchmarks were first run at REFERENCES
constant maximum throughput to measure the baseline energy
[1] J. Montanaro, et al., “A 160–MHz, 32–b 0.5W CMOS RISC processor,”
consumption. They were then run with the voltage scheduler and IEEE J. Solid-State Circuits, vol. 31, pp. 1703–1714, Nov. 1996.
the energy consumption was measured again. [2] E. Vittoz, “Micropower IC,” in Proc. IEEE ESSCC, Sept. 1980, pp.
Table I shows the measured system energy consumption for 174–189.
[3] A. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS
the three benchmarks, and is normalized to when the system digital design,” IEEE J. Solid-State Circuits, vol. 27, pp. 473–484, Apr.
is running at maximum throughput, which is the typical oper- 1992.
ating mode of a processor system that operates from a fixed [4] B. Davari, R. Dennard, and G. Shahidi, “CMOS scaling for high perfor-
mance and low power—The next ten years,” Proc. IEEE, pp. 595–606,
supply voltage. The row labeled Optimal is the energy reduction Apr. 1995.
when all the computational requirements are known a priori, [5] Advanced Configuration and Power Interface Specification, Revision
and is an estimated value derived from simulation. The optimal 1.0b, Intel/Microsoft/Toshiba, Feb. 1999, pp. 67–69.
[6] J. Rabaey, Digital Integrated Circuits—A Design Perspec-
values represent the maximum achievable energy reduction for tive. Englewood Cliffs, NJ: Prentice Hall, 1996.
these benchmarks. The last row is the measured energy con- [7] V. von Kaenel, P. Macken, and M. Degrauwe, “A voltage reduction tech-
sumption with the voltage scheduler enabled. As expected, the nique for battery-operated systems,” IEEE J. Solid-State Circuits, vol.
25, pp. 1136–1140, Oct. 1990.
compute-intensive MPEG benchmark has only a 11% energy [8] T. Kuroda, et al., “Variable supply-voltage scheme for low-power high-
reduction from DVS. However, DVS demonstrates significant speed CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, pp.
improvement for the less compute-intensive AUDIO and UI 454–462, Mar. 1998.
[9] L. Nielsen, C. Niessen, J. Sparso, and K. van Berkel, “Low-power opera-
benchmarks, which have a 4.5 and 3.5 energy reduction, re- tion using self-timed circuits and adaptive scaling of the supply voltage,”
spectively. Comparing the DVS results against the optimal re- IEEE Trans. VLSI Syst., vol. 2, pp. 391–397, Dec. 1994.
sults demonstrates that while the voltage scheduler’s heuristic [10] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, “Data driven signal
processing: An approach for energy efficient computing,” in IEEE
algorithm has a difficult time optimizing for compute-intensive ISLPED Dig. Tech. Papers, Aug. 1996, pp. 347–352.
code, it performs extremely well on non-speed-critical applica- [11] G. Wei, J. Kim, D. Liu, S. Sidiropoulos, and M. Horowitz, “A variable-
tions. frequency parallel I/O interface with adaptive power supply regulation,”
in IEEE ISSCC Dig. Tech. Papers, Feb. 2000, pp. 298–299.
Table II shows the average power dissipation of the three [12] “ARM 8 Data-Sheet,” ARM Ltd., Doc. No. ARM-DDI-0080C, July
benchmarks with the voltage scheduler operating. The effective 1996.
MIPS/W is calculated as the ratio of peak throughput (85 MIPS) [13] T. Pering, T. Burd, and R. W. Brodersen, “Voltage scheduling in the
1pARM microprocessor system,” in IEEE ISLPED Dig. Tech. Papers,
to average power dissipation, and demonstrates the achievable July 2000, pp. 96–101.
increase in energy efficiency when the system is running real [14] T. Pering, “Energy-efficient operating system techniques,” Ph.D. disser-
programs. Both the UI and AUDIO benchmarks have an average tation, Univ. California, Berkeley, CA, 2000.
[15] A. Stratakos, “High-efficiency, low-voltage dc-dc conversion for
power dissipation on the order of 10mW, yielding an energy ef- portable applications,” Ph.D. dissertation, Univ. California, Berkeley,
ficiency on the order of 10 000 MIPS/W. CA, 1999.
1580 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000

[16] A. Stratakos, R. W. Brodersen, and S. Sanders, “High-efficiency low- Robert W. Brodersen (M’76–SM’81–F’82) re-
voltage dc-dc conversion for portable applications,” in Int. Workshop ceived the Bachelor of Science degrees in Electrical
Low-Power Design, Apr. 1994, pp. 619–626. Engineering and Mathematics from California
[17] T. Burd and R. W. Brodersen, “Design issues for dynamic voltage State Polytechnic University, Pomona, CA in 1966;
scaling,” in IEEE ISLPED Dig. Tech. Papers, July 2000, pp. 9–14. the Engineering and Master of Science degrees
[18] N. Weste and K. Eshraghian, Principles of CMOS VLSI De- from the Massachusetts Institute of Technology
sign. Reading, MA: Addison Wesley, 1993. (MIT), Cambridge in 1968; and the Ph.D. degree
in Engineering from MIT in 1972. On May 28,
1999, Professor Brodersen was formally declared
Technologiae Doctor Honoris Causa (Honorary
Thomas D. Burd (S’94) received the B.S and Doctor of Technology) by the University of Lund,
M.S. degrees in electrical engineering from the Sweden.
University of California, Berkeley, in 1992 and 1994, He is a Professor in the Department of Electrical Engineering and Computer
respectively, where he is currently working toward Sciences (EECS) at the University of California, Berkeley. He joined the EECS
the Ph.D. degree in the area of energy-efficient Department faculty in 1976. From 1972–1976, he was a member of the Tech-
processor system design. nical Staff, Central Research Laboratory at Texas Instruments. In addition to
For the InfoPad research project at Berkeley, he de- teaching, his present research focus is the application of integrated circuits as
veloped a low-power CMOS cell library for custom applied to personal communication systems with emphasis on wireless commu-
DSP ASICs, which was used in the design of sev- nications and low power design. He was appointed the first holder of the John
eral custom chips. He has since worked on energy- R. Whinnery Chair in the Department of Electrical Engineering and Compputer
efficient system, architecture, and circuit design for Science, University of California, Berkeley, in September, 1995. He was the
general-purpose processors, energy-efficient low-swing bus transceivers, CAD National Chair of Information Science and Technology (ISAT) Study Group
methodology to automate IC verification, and DVS converter loop architecture sponsored by the Institute for Defense Analysis, Washington, D.C., from 1992
design. to 1994. He currently serves on several committees associated with the National
Mr. Burd is the recipient of the 1998 Analog Devices Outstanding Student Academy of Sciences, Washington, D.C. He is the author or co-author of over
Award for recognition in IC design. He is a member of Tau Beta Pi and Eta 60 journal publications and 120 published conference papers, and author, co-au-
Kappa Nu. thor, editor, or contributor to 14 books, including Anatomy of a Silicon Compiler
(Norwood, MA: Kluwer, 1992), and Low Power Digital CMOS Design (Nor-
wood, MA: Kluwer, 1995). He is the holder of three patents. He has served on the
Trevor Pering received the B.S. degree in computer editorial board or as reviewer for numerous scholarly journals and publications,
science and the Ph.D. degree in electrical engineering including the IEEE JOURNAL OF SOLID-STATE CIRCUITS, IEEE TRANSACTIONS
from the University of California, Berkeley, in 1993 ON VLSI SYSTEMS, IEEE PERSONAL COMMUNICATIONS MAGAZINE, and Wire-
and 2000, respectively. In September, 1999, he joined less Personal Communications (Kluwer Press).
the Microprocessor Research Lab, Intel Corporation, He won conference best paper awards at Eascon in 1973, the International
near Portland, OR. Solid-States Circuits Conference in 1975, and the European Solid-States Cir-
At Berkeley, he worked on the InfoPad project, cuits Conference in 1978. In 1979, he received the W.G. Baker Award for the
where he was responsible for the design and imple- outstanding paper in the IEEE Journals and Transactions. He was co-recipient of
mentation of hardware-based wireless transmission the Morris Libermann Award of the IEEE in 1983 for “outstanding contributions
protocols, as well as system-level integration and to an emerging technology.” He received the best paper award in the Transac-
debugging of the InfoPad portable terminal. His tions on CAD in 1985 and the best tutorial paper of the IEEE Communications
Ph.D. work focused on energy-efficient software techniques for portable Society in 1992. In 1997, he received the distinguished IEEE Solid-States Cir-
computers, including the design of a real-time voltage-scaling operating cuits Award “for contributions to the design of integrated circuits for signal pro-
system. With Intel, he is currently engaged in user interfaces and system-level cessing systems.” He is a member of the National Academy of Engineering.
design issues for ubiquitous computing environments.

Anthony J. Stratakos received the B.S. and M.S. de-


grees in electrical engineering from The Johns Hop-
kins University, Baltimore, MD, in 1992, and the Ph.
D. degree in electrical engineering from the Univer-
sity of California, Berkeley, in 1998.
While at Berkeley, he worked on the InfoPad re-
search project in the area of low-power mixed-signal
design, with particular emphasis on low-voltage
dc-dc conversion. In 1996, he co-founded Volterra,
where he presently serves as Vice President of
Advanced Research and Development and Chief
Technical Officer. At Volterra, he has worked on low-voltage dc-dc converters
for low-power portable/wireless and high-performance microprocessor
applications.

View publication stats

You might also like