0% found this document useful (0 votes)
2 views

paper6

This paper presents energy-efficient approximate multiplication techniques for digital signal processing (DSP) and classification applications that tolerate computational errors. The proposed architectures can reduce energy consumption by up to 58% compared to precise multipliers while maintaining acceptable accuracy levels, leveraging methods such as aggressive voltage scaling and bit-width truncation. The results demonstrate that small computational errors do not significantly impact the quality of DSP and classification tasks, making these techniques suitable for energy-constrained devices.

Uploaded by

Akash Dey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

paper6

This paper presents energy-efficient approximate multiplication techniques for digital signal processing (DSP) and classification applications that tolerate computational errors. The proposed architectures can reduce energy consumption by up to 58% compared to precise multipliers while maintaining acceptable accuracy levels, leveraging methods such as aggressive voltage scaling and bit-width truncation. The results demonstrate that small computational errors do not significantly impact the quality of DSP and classification tasks, making these techniques suitable for energy-constrained devices.

Uploaded by

Akash Dey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Energy-Efficient Approximate Multiplication for


Digital Signal Processing and Classification
Applications
Srinivasan Narayanamoorthy, Hadi Asghari Moghaddam, Zhenhong Liu, Taejoon Park, Member, IEEE,
and Nam Sung Kim, Senior Member, IEEE

Abstract— The need to support various digital signal process- due to computational error tolerance.
ing (DSP) and classification applications on energy-constrained Most of such algorithms extensively perform matrix multi-
devices has steadily grown. Such applications often extensively plications as their fundamental operation, while a multiplier is
perform matrix multiplications using fixed-point arithmetic while
exhibiting tolerance for some computational errors. Hence, typically an inherently energy-hungry component. To improve
improving the energy efficiency of multiplications is critical. In energy efficiency of multipliers, previous studies have explored
this paper, we propose multiplier architectures that can trade-off various techniques exploiting computational error tolerance.
computational accuracy with energy consumption at design time. They can be classified into three categories: (i) aggressive
Compared to a precise multiplier, the proposed multiplier can voltage scaling [4,5], (ii) truncation of bit-width [4,6], and
consume 58% less energy/op with average computational error
of ∼1%. Finally, we demonstrate that such small computational (iii) use of inaccurate building blocks [7]. Chippa et al. pro-
error does not notably impact the quality of DSP and the posed scalable effort hardware design and explored algorithm-
accuracy of classification applications. , architecture-, and circuit-level scaling to minimize energy
consumption while offering acceptable classification quality
through aggressively scaling voltage scaling and truncating
I. I NTRODUCTION least-significant bits [4]. Kulkarni et al. proposed an under-
Achieving high energy efficiency has become a key design designed 16 × 16 multiplier using inaccurate 2 × 2 partial
objective for embedded and mobile computing devices due to product generators (PPG) while guaranteeing the minimum
their limited battery capacity and power budget. To improve and maximum accuracy fixed at design time. Each PPG has
energy efficiency of such computing devices, significant effort fewer transistors compared to the accurate 2 × 2 one, reducing
has already been devoted at various levels, from software to both dynamic and leakage energy at the cost of some accuracy
architecture, and all the way down to circuit and technology loss. Babića et al. proposed a novel iterative log approximate
levels. multiplier using leading one detectors (LODs) to support
Embedded and mobile computing devices are frequently variable accuracy [8].
required to execute some key digital signal processing (DSP) In this paper, we propose an approximate multiplication
and classification applications. To further improve energy technique that takes m consecutive bits (i.e., m-bit segment)
efficiency of executing such applications, first, dedicated spe- from each n-bit operand, where m is equal to or greater than
cialized processors are often integrated in computing devices. n/2. An m-bit segment can start only from one of two or
It has been reported that the use of such specialized processors three fixed bit positions depending where the leading one bit is
can improve energy efficiency by 10∼100× compared to located for a positive number. This approach can provide much
general-purpose processors at the same voltage and technology higher accuracy than one simply truncating the LSBs, because
generation [1]. it can more effectively capture more noteworthy bits. Although
Second, many DSP and classification applications heavily we can capture m-bit segments starting from the exact leading
rely on complex probabilistic mathematical models and are one bit position, such an approach requires expensive LODs
designed to process information that typically contains noise. and shifters to take m-bit segments starting from the leading
Thus, for some computational error, they exhibit graceful one position, steer them to an m × m multiplier, and expand
degradation in overall DSP quality and classification accuracy 2m bits to 2n bits. In contrast, our approach is more scalable
instead of a catastrophic failure. Such computational error than one that captures m-bit segments starting from the leading
tolerance has been exploited by trading accuracy with energy one bits, since it limits the possible starting bit positions of an
consumption (e.g., [2]). m-bit segment to two or three regardless of m and n chosen
Finally, these algorithms are initially designed and trained at design time, eliminating LODs and replacing shifters with
with floating-point (FP) arithmetic, but they are often con- multiplexers. Finally, we also observe that one of two operands
verted to fixed-point (FxP) arithmetic due to the area and in each multiplication for DSP and classification algorithms is
power cost of supporting FP units in embedded computing often stored in memory (e.g., coefficients in filter algorithms
devices [3]. Although this conversion process leads to some and trained weight values in artificial neural networks (ANN))
loss of computational accuracy, it does not notably affect the and repeatedly used. We exploit it to further improve the
quality of DSP and the accuracy of classification applications energy efficiency of our approximate multiplier.
Digital Object Identifier: 10.1109/TVLSI.2017.2333366

1557-9999 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2

achieve 99.4% accuracy for a 16 ×16 multiplication even with


an 8 × 8 multiplier.
Such a multiplication approach has little negative impact on
computational accuracy, because it can eliminates redundant
Fig. 1. An example of a multiplication with 8-bit segments of two 16-bit
operands; bold-font bits comprise the segments.
bits (i.e., sign-extension bits) while feeding the most useful
m significant bits to the multiplier; we will provide detailed
evaluations of computational accuracy for various m in Section
3. Furthermore, an m × m multiplier consumes much less
energy than an n × n multiplier, because the complexity
(and thus energy consumption) of multipliers quadratically
increases with n. For example, the 4 × 4 and 8 × 8 multipliers
consume almost 20× and 5× less energy than a 16 × 16
multiplier per operation on average. However, a DSM requires:
(i) two LODs, (ii) two n-bit shifters to align the leading one
position of each n-bit operand to the MSB position of each
Fig. 2. Possible starting bit positions of 8- and 10-bit segments indicated by m-bit segment to apply their m-bit segments to the m × m
arrows; the dotted arrow is the case for supporting three possible starting bit
positions.
multiplier, and (iii) one 2n-bit shifter to expand a 2m-bit result
to 2n bits. (i), (ii), and (iii) incur considerable area and energy
penalties, completely negating the energy benefit of using the
m × m multiplier; we provide detailed evaluations for two m
values in Section 0.
The area and energy penalties associated with (i), (ii), and
(iii) in DSM is to capture an m-bit segment starting from an
arbitrary bit position in an n-bit operand because the leading
one bit can be anywhere. Thus, we proposed to limit possible
starting bit positions to extract an m-bit segment from an n-
bit operand to two or three at most in SSM, where Figure 2
shows examples of extracting 8- and 10-bit segments from a
16-bit operand. Regardless of m and n, we have four possible
combinations of taking two m-bit segments from two n-bit
Fig. 3. Examples of 16×16 multiplications based on 8-bit segments with two operands for a multiplication using the m-bit SSM.
possible starting bit positions for 8-bit segments. The shaded cells represent For a multiplication, we choose the m-bit segment that
8-bit segments and the aligned position of 8 × 8 multiplication results. contains the leading one bit of each operand and apply the
chosen segments from both operands to the m × m multiplier.
The SSM greatly simplifies the circuit that chooses m-bit
segments and steers them to the m ×m multiplier by replacing
two n-bit LODs and shifters for the DSM with two (n − m)-
Fig. 4. An example of low accuracy for SSM16 × 16. input OR gates and m-bit 2-to-1 multiplexers; if the first
(n − m) bits starting from the MSB are all zeros, the lower
m-bit segment must contain the leading one. Furthermore, the
The remainder of the paper is organized as follows. Sec- SSM also allows us to replace the 2n-bit shifter used for the
tion 2 details the proposed multiplier architecture. Section 3 DSM with a 2n-bit 3-to-1 multiplexer. Since the segment for
analyzes energy consumption and computational accuracy of each operand is taken from one of two possible segments in
various approximate multipliers and impact of such multipliers an n-bit operand, a 2m-bit result can be expanded to a 2n-
on quality of DSP and accuracy of classification algorithms. bit result by left-shifting the 2m-bit result by one of three
Section 4 concludes this study. possible shift amounts: (i) no shift when both segments are
from the lower m-bit segments; (ii) (n − m) shift when two
II. A PPROXIMATE M ULTIPLIER E XPLOITING S IGNIFICANT segments are from the upper and lower ones, respectively; and
S EGMENTS OF O PERANDS (iii) 2 × (n − m) shift when both segments are from the upper
In order to motivate and describe our proposed multiplier, ones, as shown in Figure 3.
we define an m-bit segment as m contiguous bits starting with Note that the accuracy of an SSM with m = n/2 can be
the leading one in an n-bit positive operand. We dub this significantly low for operands shown in Figure 4, where many
method dynamic segment method (DSM) in contrast to static MSBs of m-bit segments containing the leading one bit are
segment method (SSM) that will be discussed later in this filled with zeros. On the other hand, such a problem becomes
section. With two m-bit segments from two n-bit operands, less severe as m is larger than n/2; there is an overlap in a range
we can perform a multiplication using an m × m multiplier. of bits covered by both possible m-bit segments as shown
Figure 1 illustrates an example of a multiplication after taking for m = 10 in Figure 2. Thus, for an SSM with m = n/2
8-bit segments from 16-bit operands. In this example, we can we propose to support one more bit position that allows us
3

Fig. 5. Probability distribution of compute accuracy of AM2 × 2, DSM8 × 8, DSM6 × 6, SSM8 × 8, ESSM8 × 8, and 8 × 8 (truncated) for random vectors,
audio/image processing, and recognition applications.

to extract an m-bit segment indicated by the dotted arrow in


Figure 2. This will be able to effectively capture operand pairs
similar to one shown in Figure 4.
Figure 5 illustrates an SSM allowing to take an m-bit
segment from two possible bit positions of an n-bit operand.
The key advantage is its scalability for various m and n,
because the complexity (i.e., area and energy consumption)
of auxiliary circuits for choosing/steering m-bit segments and
expanding a 2m-bit result to a 2n-bit results scales linearly
with m.
For applications where one of operands of each multiplica-
tion is often a fixed coefficient, we propose to pre-compute the
bit-wise OR value of B[n − 1:m] and pre-select between two
possible m-bit segments (i.e., B[n − 1:n − m] and B[m − 1:0])
in Figure 5, and store them instead of the native B value in
memory. This allows us to remove the n − m input OR gate
Fig. 6. Proposed approximate multiplier architecture; the logic and wires
and the m-bit 2-to-1 multiplexer denoted by the dotted lines denoted by the dotted lines are not needed if B is pre-processed as proposed.
in Figure 5.
Finally, to support three possible starting bit positions for
picking an m-bit segment where m = n/2, the two 2-to-1
multiplexers at the input stage and one 3-to-1 multiplier at the values (denoted by “random”), noise cancelling algorithm
output stage are replaced with 3-to-1 and 5-to-1 multiplexers, [9] (denoted by “audio”), 2-dimensional optical coherence
respectively, along with some minor changes in logic functions tomography (2D OCT) [10] (denoted by “image”), and isolated
generating multiplexer control signals; we will show this spoken digit recognition [11] (denoted by “recognition”);
enhanced SSM design for m = 8 and n = 16 (denoted by where each set is comprised of billions of operand pairs. To
ESSM8 × 8) can provide as good accuracy as SSM10 × 10 at evaluate energy consumption, we use Synopsys PrimeTime-
notably lower energy consumption later. PX®, which can estimate energy consumption of a synthesized
design based on annotated switching activities from gate-level
simulation. The input vectors for energy estimation are directly
III. E VALUATION
taken from the execution of multiplication intensive kernels in
Evaluation Methodology: In this section we describe each application. We observe that the extracted input vectors
the methodology for evaluating computational accuracy and exhibit inherent periodicity in the operand values applied to
energy consumption of precise and various approximate mul- the multiplier. Thus, we take many such periods such that the
tipliers. All the multipliers are described to support two 16-bit number of vectors is 10,000 at least.
inputs and 32-bit output with Verilog HDL and synthesized Computational Accuracy: Figure 6 plots the probability
using Synopsys Design Compiler®and a TSMC 45nm stan- distribution of computational accuracy of AM, DSM8 × 8,
dard cell library at the typical process corner. We repeatedly DSM6×6, SSM8×8, SSM10×10, ESSM8×8, and TRUN8×8
synthesize each multiplier until it achieves the highest operat- for four sets of operand pairs. We observe that the average
ing frequency. Then we choose the frequency of the slowest computational accuracy of all these approximate multipliers is
one (i.e., 2GHz) to re-synthesize all other multipliers. very high. For “random,” AM, DSM8×8, DSM6×6, SSM8×8,
To evaluate computational accuracy, we take four sets SSM10 × 10, ESSM8 × 8, and TRUN8 × 8 exhibit average
of 16-bit operand pairs from: all possible pairs of 16-bit compute accuracy of 96.7%, 99.7%, 97.8%, 98.0%, 99.6%,
4

Fig. 7. Energy/op of AM, DSM8 × 8, DSM6 × 6, SSM8 × 8, SSM10 × 10,


and ESSM10 × 10, relative to that of PM for four sets of operand pairs.

Fig. 8. Breakdown of area and energy/op of 16 × 16.


TABLE I
Q O C OF APPLICATIONS USING APPROXIMATE MULTIPLIERS RELATIVE TO
THE PRECISE MULTIPLIER .
respectively. In contrast, the average energy/op of SSM10×10
and ESSM8 × 8, which can offer sufficient computational
accuracy and QoC, is 35% and 58% lower than that of PM.
ESSM8 × 8 that is simplified to accept pre-processed fixed
coefficients consumes 6% lower than the original ESSM8 × 8.
We do not provide detailed energy/op analysis for TRUN8×8,
SSM8×8 and SSM12×12, because (i) TRUN8×8 and SSM8×
8 may not provide sufficient computational accuracy and QoC
regardless of very low energy/op and (ii) SSM12 × 12 does
not exhibit notably higher computational accuracy and QoC
than SSM10 × 10 while consuming much higher energy/op.
Analyzing the energy/op reduction of the evaluated multi-
99.0%, and 97.1%, respectively. However, for three classes of pliers, we first note that energy/op of AM can consume even
applications, AM, SSM8 × 8, and TRUN8 × 8 show notably lower than that of PM with higher target synthesis frequency
deteriorating accuracy compared to SSM10×10 and ESSM8× [7]. However, the target synthesis frequency is limited by
8. For example, AM, SSM8 × 8, and TRUN8 × 8 can achieve DSM8 × 8 while the energy/op of the multipliers should
computational accuracy higher than 95% only for 45%, 64%, be compared at the same target frequency. Even though we
and 61 % of operand pairs from “image.” In contrast, other remove DSM8 × 8 in the comparison, which allows us to
approximate multipliers such as DSM8 × 8, SSM10 × 10, and increase the target synthesis frequency for AM, SSM10 × 10,
ESSM8×8 can offer computational accuracy higher than 95% ESSM8 × 8, and PM, we see that the relative energy/op differ-
for 100%, 98%, and 98% of operand pairs, respectively. We ence between AM and SSM10 × 10 (or ESSM8 × 8) does not
expect that the high computational accuracy of SSM10 × 10 notably change. In other words, SSM10 × 10 and ESSM8 × 8
and ESSM8 × 8 for such a high fraction of operand pairs will also benefits from higher target synthesis frequency, exhibiting
barely impact quality of computing (QoC). The computational lower energy/op.
accuracy trend of “audio” and “recognition” is similar to that Second, we observe that the power overhead of extra logic
of “image” as shown in Figure 6. such as LODs and shifters is almost considerably larger than
QoC: Table 1 tabulates the QoC obtained using different the 8 × 8 and 6 × 6 multipliers themselves for DSM8 × 8
approximate multipliers relative to PM. To measure QoC, we and DSM6 × 6, although the bit width of the multiplier is a
use perceptual evaluation of speech quality (PESQ) [12] and half of PM. This significantly reduces the overall benefit of
structural similarity (SSIM) [13] for “audio” and “image,” DSM8 × 8 and DSM6 × 6. Furthermore, the fraction of logic
respectively. In “audio” and “image,” PESQ and SSIM that are gates switching in the 8 × 8 multiplier for DSM8 × 8 and
higher than 99% do not incur notable perceptual difference. DSM6 × 6 can be much higher than the 16 × 16 multiplier
Thus, our SSM10 × 10 and ESSM8 × 8 are sufficient for QoC for PM. This is because DSM8 × 8 and DSM6 × 6 remove
while TRUN8 × 8 shows considerable QoC degradation for all redundant sign-extension bits, which may incur no switching
three applications. in many logic gates corresponding to MSB portion of PM
Energy and Area Analysis: Figure 7 shows the average before each multiplication.
energy/op of AM, DSM8×8, DSM6×6, SSM8×8, SSM10 × Figure 8 depicts the breakdown of area and energy/op of
10, ESSM8×8, and TRUN8×8 for each of four sets of operand 16 × 16 multipliers using DSM, SSM, and ESSM for various
pairs, normalized to that of PM, respectively. The average m, normalized to those of the 16 × 16 PMs. In the plot, each
energy/op of AM, DSM8 × 8, DSM6 × 6 across all four sets bar is comprised of area (or energy/op) of the base m × m
of operand pairs is 13%, 3%, and 28% lower than that of PM, multiplier (denoted by “base m × m mult”) and the remaining
5

components (denoted by “rest”) such as the segment selection [2] R. Hegde and N.R. Shanbhag, "Energy-efficient signal processing via
logic and multiplexers in SSMm × m, and LODs and shifters algorithmic noise-tolerance," in IEEE/ACM Int. Symp. Low Power Elec-
tronics and Design (ISLPED), 1999, pp. 30-35.
in DSMm × m, respectively. The average energy/op in the plot [3] D. Menard, D. Chillet, C. Charot, and O Sentieys, "Automatic floating-
is based on “random.” point to fixed-point conversion for DSP code generation," in ACM Int.
First, the area of SSMm×m and DSMm×m are very closely Conf. Compilers, Arch., and Syn. for Embedded Syst. (CASES), 2002,
pp. 270-276.
correlated with the energy/op. For example, SSM10 × 10 [4] V.K. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S.T. Chakrad-
consumes only 62% and 58% of PM’s area and energy/op, har, "Scalable effort hardware design: Exploiting algorithmic resilience
respectively. Second, the base m × m multiplier contributes to for energy efficiency," in IEEE/ACM Design Automation Conf., 2010,
pp. 555-560.
consider-able area and energy/op for SSMm × m, while the [5] D. Mohapatra, G. Karakonstantis, and K. Roy, "Significance driven
remaining components dominate those for DSMm × m. For computation: a voltage-scalable, variation-aware, quality-tuning motion
example, the 10 × 10 multiplier is responsible for 67% and estimator," in IEEE/ACM Int. Symp. Low Power Electronics and Design
(ISLPED), 2009, pp. 195-200.
71% of total area and energy/op of SSM10 × 10, respectively. [6] C.H. Chang and R.K. Satzoda, "A low error and high performance
In other words, SSM is much more efficient than DSM because multiplexer-based truncated multiplier," IEEE T. on Very Large Scale
the overhead of the extra circuits to support SSM is small. Integration (VLSI) Syst., vol. 18, no. 12, pp. 1767-1771, Dec 2010.
[7] P. Kulkarni, P Gupta, and M. Ercegovac, "Trading Accuracy for Power
Third, we observe that increasing the number of possible with an Underdesigned Multiplier Architecture," in IEEE Int. Conf. VLSI
starting bit positions to take an m-bit segment from two to Design (VLSID), 2011, pp. 346-351.
three does not notably increase the area and energy/op because [8] Z. Babića, A. Avramovića, and P. Bulićb, "An iterative logarithmic
multiplier," Microprocessors and Microsyst., vol. 35, no. 1, pp. 23-33,
both the area and energy/op are dominated by the base m × m 2011.
multiplier. Finally, DSM6×6 does not significantly reduce area [9] B. Widrow et al., "Adaptive noise cancelling: Principles and applica-
and energy/op compared to DSM8 × 8 because its area and tions," Proceedings of the IEEE, vol. 63, no. 12, pp. 1692-1716, Dec
1975.
energy/op are dominated by the peripheral gates as discussed [10] K. Zhang and J. U. Kang, "Graphics Processing Unit-Based Ultrahigh
earlier. Speed Real-Time Fourier Domain Optical Coherence Tomography,"
IEEE J. Selected Topics in Quantum Electronics, vol. 18, no. 4, pp.
1270-1279, Jul-Aug 2012.
IV. C ONCLUSION [11] D. Verstraeten, B. Schrauwen, D. Stroobandt, and J. Van Campenhout,
"Isolated word recognition with the Liquid State Machine: a case study,"
In this paper, we propose an approximate multiplier that can Inf. Process. Lett., vol. 95, no. 6, pp. 521-528, Sep 2005.
trade-off accuracy and energy/op at design time for DSP and [12] Y. Hu and P.C. Loizou, "Evaluation of Objective Quality Measures for
Speech Enhancement," IEEE T. Audio, Speech, and Lang. Process, vol.
recognition applications. Our proposed approximate multiplier 16, no. 1, pp. 229-238, Jan 2008.
takes m consecutive bits (i.e., an m-bit segment) of an n-bit [13] Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, "Image quality
operand either starting from the MSB or ending at the LSB assessment: from error visibility to structural similarity," IEEE T. Image
Processing, vol. 13, no. 4, pp. 600-612, Apr 2004.
and apply two segments that includes the leading ones from
two operands (i.e., SSM) to an m × m multiplier. Compared
to an approach that identifies the exact leading one positions
of two operands and applies two m-bit segments starting
from the leading one positions (i.e., DSM), ours consumes
much less energy and area than PM and DSM. This improved
energy and area efficiency comes at the cost of slightly
lower compute accuracy than PM and DSM. However, we
demonstrate that the loss of small compute accuracy using
SSM does not notably impact QoC of image, audio, and
recognition applications we evaluated. On average, 16 × 16
ESSM8 × 8 can achieve 99% computational accuracy, respec-
tively, with negligible degradation in QoC for audio, image,
and recognition applications. On the other hand, 16 × 16
ESSM8 × 8 consumes only 42% energy/op of PM.

ACKNOWLEDGEMENT
This work was supported in part by generous grants from
NSF (CCF-0953603) and DARPA (HR0011-12-2-0019). Nam
Sung Kim has a financial interest in AMD and Samsung Elec-
tronics. Nam Sung Kim and Taejoon Park equally contributed
to this work.

R EFERENCES
[1] R.K. Krishnamurthy and H. Kaul, "Ultra-low Voltage Technologies for
Energy-efficient Special-Purpose Hardware Accelerators," Intel Technol-
ogy J., vol. 13, no. 4, pp. 102-117, 2009.

You might also like