paper6
paper6
Abstract— The need to support various digital signal process- due to computational error tolerance.
ing (DSP) and classification applications on energy-constrained Most of such algorithms extensively perform matrix multi-
devices has steadily grown. Such applications often extensively plications as their fundamental operation, while a multiplier is
perform matrix multiplications using fixed-point arithmetic while
exhibiting tolerance for some computational errors. Hence, typically an inherently energy-hungry component. To improve
improving the energy efficiency of multiplications is critical. In energy efficiency of multipliers, previous studies have explored
this paper, we propose multiplier architectures that can trade-off various techniques exploiting computational error tolerance.
computational accuracy with energy consumption at design time. They can be classified into three categories: (i) aggressive
Compared to a precise multiplier, the proposed multiplier can voltage scaling [4,5], (ii) truncation of bit-width [4,6], and
consume 58% less energy/op with average computational error
of ∼1%. Finally, we demonstrate that such small computational (iii) use of inaccurate building blocks [7]. Chippa et al. pro-
error does not notably impact the quality of DSP and the posed scalable effort hardware design and explored algorithm-
accuracy of classification applications. , architecture-, and circuit-level scaling to minimize energy
consumption while offering acceptable classification quality
through aggressively scaling voltage scaling and truncating
I. I NTRODUCTION least-significant bits [4]. Kulkarni et al. proposed an under-
Achieving high energy efficiency has become a key design designed 16 × 16 multiplier using inaccurate 2 × 2 partial
objective for embedded and mobile computing devices due to product generators (PPG) while guaranteeing the minimum
their limited battery capacity and power budget. To improve and maximum accuracy fixed at design time. Each PPG has
energy efficiency of such computing devices, significant effort fewer transistors compared to the accurate 2 × 2 one, reducing
has already been devoted at various levels, from software to both dynamic and leakage energy at the cost of some accuracy
architecture, and all the way down to circuit and technology loss. Babića et al. proposed a novel iterative log approximate
levels. multiplier using leading one detectors (LODs) to support
Embedded and mobile computing devices are frequently variable accuracy [8].
required to execute some key digital signal processing (DSP) In this paper, we propose an approximate multiplication
and classification applications. To further improve energy technique that takes m consecutive bits (i.e., m-bit segment)
efficiency of executing such applications, first, dedicated spe- from each n-bit operand, where m is equal to or greater than
cialized processors are often integrated in computing devices. n/2. An m-bit segment can start only from one of two or
It has been reported that the use of such specialized processors three fixed bit positions depending where the leading one bit is
can improve energy efficiency by 10∼100× compared to located for a positive number. This approach can provide much
general-purpose processors at the same voltage and technology higher accuracy than one simply truncating the LSBs, because
generation [1]. it can more effectively capture more noteworthy bits. Although
Second, many DSP and classification applications heavily we can capture m-bit segments starting from the exact leading
rely on complex probabilistic mathematical models and are one bit position, such an approach requires expensive LODs
designed to process information that typically contains noise. and shifters to take m-bit segments starting from the leading
Thus, for some computational error, they exhibit graceful one position, steer them to an m × m multiplier, and expand
degradation in overall DSP quality and classification accuracy 2m bits to 2n bits. In contrast, our approach is more scalable
instead of a catastrophic failure. Such computational error than one that captures m-bit segments starting from the leading
tolerance has been exploited by trading accuracy with energy one bits, since it limits the possible starting bit positions of an
consumption (e.g., [2]). m-bit segment to two or three regardless of m and n chosen
Finally, these algorithms are initially designed and trained at design time, eliminating LODs and replacing shifters with
with floating-point (FP) arithmetic, but they are often con- multiplexers. Finally, we also observe that one of two operands
verted to fixed-point (FxP) arithmetic due to the area and in each multiplication for DSP and classification algorithms is
power cost of supporting FP units in embedded computing often stored in memory (e.g., coefficients in filter algorithms
devices [3]. Although this conversion process leads to some and trained weight values in artificial neural networks (ANN))
loss of computational accuracy, it does not notably affect the and repeatedly used. We exploit it to further improve the
quality of DSP and the accuracy of classification applications energy efficiency of our approximate multiplier.
Digital Object Identifier: 10.1109/TVLSI.2017.2333366
1557-9999 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2
Fig. 5. Probability distribution of compute accuracy of AM2 × 2, DSM8 × 8, DSM6 × 6, SSM8 × 8, ESSM8 × 8, and 8 × 8 (truncated) for random vectors,
audio/image processing, and recognition applications.
components (denoted by “rest”) such as the segment selection [2] R. Hegde and N.R. Shanbhag, "Energy-efficient signal processing via
logic and multiplexers in SSMm × m, and LODs and shifters algorithmic noise-tolerance," in IEEE/ACM Int. Symp. Low Power Elec-
tronics and Design (ISLPED), 1999, pp. 30-35.
in DSMm × m, respectively. The average energy/op in the plot [3] D. Menard, D. Chillet, C. Charot, and O Sentieys, "Automatic floating-
is based on “random.” point to fixed-point conversion for DSP code generation," in ACM Int.
First, the area of SSMm×m and DSMm×m are very closely Conf. Compilers, Arch., and Syn. for Embedded Syst. (CASES), 2002,
pp. 270-276.
correlated with the energy/op. For example, SSM10 × 10 [4] V.K. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S.T. Chakrad-
consumes only 62% and 58% of PM’s area and energy/op, har, "Scalable effort hardware design: Exploiting algorithmic resilience
respectively. Second, the base m × m multiplier contributes to for energy efficiency," in IEEE/ACM Design Automation Conf., 2010,
pp. 555-560.
consider-able area and energy/op for SSMm × m, while the [5] D. Mohapatra, G. Karakonstantis, and K. Roy, "Significance driven
remaining components dominate those for DSMm × m. For computation: a voltage-scalable, variation-aware, quality-tuning motion
example, the 10 × 10 multiplier is responsible for 67% and estimator," in IEEE/ACM Int. Symp. Low Power Electronics and Design
(ISLPED), 2009, pp. 195-200.
71% of total area and energy/op of SSM10 × 10, respectively. [6] C.H. Chang and R.K. Satzoda, "A low error and high performance
In other words, SSM is much more efficient than DSM because multiplexer-based truncated multiplier," IEEE T. on Very Large Scale
the overhead of the extra circuits to support SSM is small. Integration (VLSI) Syst., vol. 18, no. 12, pp. 1767-1771, Dec 2010.
[7] P. Kulkarni, P Gupta, and M. Ercegovac, "Trading Accuracy for Power
Third, we observe that increasing the number of possible with an Underdesigned Multiplier Architecture," in IEEE Int. Conf. VLSI
starting bit positions to take an m-bit segment from two to Design (VLSID), 2011, pp. 346-351.
three does not notably increase the area and energy/op because [8] Z. Babića, A. Avramovića, and P. Bulićb, "An iterative logarithmic
multiplier," Microprocessors and Microsyst., vol. 35, no. 1, pp. 23-33,
both the area and energy/op are dominated by the base m × m 2011.
multiplier. Finally, DSM6×6 does not significantly reduce area [9] B. Widrow et al., "Adaptive noise cancelling: Principles and applica-
and energy/op compared to DSM8 × 8 because its area and tions," Proceedings of the IEEE, vol. 63, no. 12, pp. 1692-1716, Dec
1975.
energy/op are dominated by the peripheral gates as discussed [10] K. Zhang and J. U. Kang, "Graphics Processing Unit-Based Ultrahigh
earlier. Speed Real-Time Fourier Domain Optical Coherence Tomography,"
IEEE J. Selected Topics in Quantum Electronics, vol. 18, no. 4, pp.
1270-1279, Jul-Aug 2012.
IV. C ONCLUSION [11] D. Verstraeten, B. Schrauwen, D. Stroobandt, and J. Van Campenhout,
"Isolated word recognition with the Liquid State Machine: a case study,"
In this paper, we propose an approximate multiplier that can Inf. Process. Lett., vol. 95, no. 6, pp. 521-528, Sep 2005.
trade-off accuracy and energy/op at design time for DSP and [12] Y. Hu and P.C. Loizou, "Evaluation of Objective Quality Measures for
Speech Enhancement," IEEE T. Audio, Speech, and Lang. Process, vol.
recognition applications. Our proposed approximate multiplier 16, no. 1, pp. 229-238, Jan 2008.
takes m consecutive bits (i.e., an m-bit segment) of an n-bit [13] Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, "Image quality
operand either starting from the MSB or ending at the LSB assessment: from error visibility to structural similarity," IEEE T. Image
Processing, vol. 13, no. 4, pp. 600-612, Apr 2004.
and apply two segments that includes the leading ones from
two operands (i.e., SSM) to an m × m multiplier. Compared
to an approach that identifies the exact leading one positions
of two operands and applies two m-bit segments starting
from the leading one positions (i.e., DSM), ours consumes
much less energy and area than PM and DSM. This improved
energy and area efficiency comes at the cost of slightly
lower compute accuracy than PM and DSM. However, we
demonstrate that the loss of small compute accuracy using
SSM does not notably impact QoC of image, audio, and
recognition applications we evaluated. On average, 16 × 16
ESSM8 × 8 can achieve 99% computational accuracy, respec-
tively, with negligible degradation in QoC for audio, image,
and recognition applications. On the other hand, 16 × 16
ESSM8 × 8 consumes only 42% energy/op of PM.
ACKNOWLEDGEMENT
This work was supported in part by generous grants from
NSF (CCF-0953603) and DARPA (HR0011-12-2-0019). Nam
Sung Kim has a financial interest in AMD and Samsung Elec-
tronics. Nam Sung Kim and Taejoon Park equally contributed
to this work.
R EFERENCES
[1] R.K. Krishnamurthy and H. Kaul, "Ultra-low Voltage Technologies for
Energy-efficient Special-Purpose Hardware Accelerators," Intel Technol-
ogy J., vol. 13, no. 4, pp. 102-117, 2009.