0% found this document useful (0 votes)
7 views

AxRMs_Approximate_Recursive_Multipliers_Using_High-Performance_Building_Blocks

AxRMs_Approximate_Recursive_Multipliers_Using_High

Uploaded by

jspkay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

AxRMs_Approximate_Recursive_Multipliers_Using_High-Performance_Building_Blocks

AxRMs_Approximate_Recursive_Multipliers_Using_High

Uploaded by

jspkay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Received 11 October 2020; revised 15 June 2021; accepted 7 July 2021.

Date of publication 13 July 2021; date of current version 7 June 2022.


Digital Object Identifier 10.1109/TETC.2021.3096515

AxRMs: Approximate Recursive Multipliers Using


High-Performance Building Blocks
HAROON WARIS , (Student Member, IEEE), CHENGHUA WANG, CHENYU XU,
AND WEIQIANG LIU , (Senior Member, IEEE)
The authors are with the College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
CORRESPONDING AUTHOR: WEIQIANG LIU ([email protected])

ABSTRACT Recursive multipliers (RMs) have been classified as a class of low-power multipliers because
they provide a wide-range of power-quality configuration options. 22 multipliers are the constitutional
building blocks of this recursive topology; however, most of the state-of-the-art approximate recursive
designs are based on a 44 building blocks. Therefore, the design space exploration of AxRMs using 22
multipliers is still an open-research problem. To add the configurability and flexibility in the design of
AxRMs such 2-bit multipliers are required that exhibit high-performance and low-area. In this article, two
approximate 22 multipliers are proposed that exhibit double-sided error distribution feature. Compared to
the existing best-approximated 22 multiplier, the proposed design achieves a reduction of 52 percent in
area and exhibits an improvement of 25 percent in terms of delay while having a bounded error behavior.
Then, three 88 multipliers of variable accuracy are designed using different configurations of approximate
22 multiplier. AxRM1 is the most-accurate design; an improvement of 50 percent in terms of mean relative
error distance (MRED) is achieved compared to the existing best MRED-optimized design. AxRM3 has simi-
lar MRED compared to the previous best 22-based AxRM (called MACISH); however, AxRM3 exhibits 13
percent better PDP than MACISH due to the use of low-power and high-performance 22 multipliers in
building larger multipliers. The proposed approximate multipliers are applied to cutting-edge error-tolerant
application, i.e., convolutional neural networks. AxRM2 provides the best quality-power trade-off, 32.64 per-
cent power savings are achieved with 1.10 percent better classification accuracy.
INDEX TERMS Low-power, accuracy-energy trade-off, error compensation, building blocks, pareto-front

I. INTRODUCTION Multiplication is a basic arithmetic operation and commonly


The modern applications such as artificial intelligence and used in many signal processing applications [2], [3]. In the
cloud computing require huge amount of data processing. To technical literature the design of approximate recursive multi-
enable the design exploration of these applications there is a pliers (AxRMs) have been investigated using both 44 and
high demand of high-performance and power-efficient 22 multipliers. In [4] three approximate 4:2 compressors are
embedded platforms. However, with the soon end of technol- proposed and used in the design of 4-bit multipliers. However,
ogy node scaling it is hard to realize such hardware accelera- in the proposed compressors as both carry and sum are approxi-
tors. Meanwhile, it has been reported in the technical mated, carry having the higher weight of binary bit; therefore, it
literature that stringent accuracy requirement is not desired in deteriorates the results to a higher extent. Inexact half adder and
these applications. Therefore, approximate computing (AC) full adder cells based 44 multiplier has been proposed in [5].
has emerged as a design alternative to address this excep- A reduced hardware complexity has been achieved; however,
tional challenge [1]. Approximation can be introduced at they exhibit higher normalized mean error distance (NMED)
both hardware and software layers. At the hardware layer, because a two-stage approximation has been utilized. Recently,
arithmetic units are the most power-hungry modules; thus, Waris et al. have proposed hybrid partial product-based 4-bit
the design of approximate arithmetic units has been a promis- multipliers by considering the probability distribution of the
ing area of research. input operands. The proposed building blocks achieved an

2168-6750 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE
VOLUME 10, NO.
Authorized 2, APRIL-JUNE
licensed 2022to: Tallinn University
use limited permission. of
See ht_tps://www.ieee.org/publications/rights/index.html
Technology. Downloaded on December 03,2024 for more information.
at 10:15:32 1229
UTC from IEEE Xplore. Restrictions apply.
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

TABLE 1. Truth table and error distance of the proposed approximate 22 multipliers with error compensation feature.

Inputs Exact Truth Table Mul2a Truth Table Mul2b Truth Table Output (Dec.) ED1 ED2
a1 a0 b1 b0 c3 c2 c1 c0 c3 c2 c1 c0 c3 c2 c1 c0 Mul2a Mul2b
0 0 0 0 0 0 0 0 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0 0 0 0
0 0 0 1 0 0 0 0 0@ 0@ 0@ 1 0@ 0@ 0@ 0@ 1 0 +1 0
0 0 1 0 0 0 0 0 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0 0 0 0
0 0 1 1 0 0 0 0 0@ 0@ 0@ 1 0@ 0@ 0@ 0@ 1 0 +1 0
0 1 0 0 0 0 0 0 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0@ 0 0 0 0
0 1 0 1 0 0 0 1 0@ 0@ 0@ 1@ 0@ 0@ 0@ 1@ 1 1 0 0
0 1 1 0 0 0 1 0 0@ 0@ 1@ 0@ 0@ 0@ 0 0@ 2 0 0 2
0 1 1 1 0 0 1 1 0@ 0@ 1@ 1@ 0@ 0@ 0 1@ 3 1 0 2
1 0 0 0 0 0 0 0 0@ 0@ 0@ 0@ 0@ 0@ 1 0@ 0 2 0 +2
1 0 0 1 0 0 1 0 0@ 0@ 1@ 0@ 0@ 1@ 0@ 1 3 2 +1 0
1 0 1 0 0 1 0 0 0@ 1@ 0@ 0@ 0@ 1@ 1 0@ 4 6 0 +2
1 0 1 1 0 1 1 0 0@ 1@ 1@ 1 0@ 1@ 1@ 0@ 7 6 +1 0
1 1 0 0 0 0 0 0 0@ 0@ 0@ 0@ 0@ 0@ 1 0@ 0 2 0 +2
1 1 0 1 0 0 1 1 0@ 0@ 1@ 1@ 0@ 0@ 1@ 1@ 3 3 0 0
1 1 1 0 0 1 1 0 0@ 1@ 1@ 0@ 0@ 1@ 1@ 0@ 6 6 0 0
1 1 1 1 1 0 0 1 0 1 1 1@ 0 1 1 1@ 7 7 2 2

improvement up to 61 percent compared to the exact 4-bit of the proposed and existing state-of-the-art low-power
multiplier. approximate multipliers. The evaluation of the proposed
The very first approximate 22 multiplier where reduction designs for convolutional neural networks is presented in
in the critical path delay and area has been achieved is referred Section V. Section VI concludes the paper.
as UDM [7]. 2-bit multiplier proposed by Rehman et al. [8]
have a reduced maximum error magnitude compared to [7] II. THE PROPOSED 22 MULTIPLIER
with no improvement in the delay and area. [9] has proposed a The double-sided error distribution property has shown promis-
2-bit multiplier that is a mirror of [7], i.e., it produces an error ing results to compensate the error [11]. This is because posi-
case (e = +2) which is an additive inverse of [7] (e = 2). How- tive and negative errors complement each other resulting in the
ever, this has been achieved at the increased hardware cost internal self healing of the error. In this section approximate
compared to [7]. Another 2-bit multiplier with a large negative 22 multipliers (Mul2a /Mul2b ) are investigated with error
error (e = 4) has been proposed in [10]. It exhibits large delay compensation/cancellation feature. Table 1 shows the truth
compared to [7] as XOR based design is proposed. The existing table of the proposed 22 multipliers (called Mul2a and
2-bit multipliers [8]–[10] exhibit no improvement in the critical Mul2b ). All the approximated outputs in Mul2b have the same
path delay compared to state-of-the-art approximate multiplier error distance (ED) of two whereas Mul2a has only one output
[7]. Moreover, with reference to [7] they have a large area and with ED = 2. In Mul2a five outputs are approximated while
power consumption. To explore the power-quality trade-off in Mul2b contains six inaccurate outputs. Among the five approxi-
the design of AxRMs there is a need for a wide-range of mated outputs of Mul2a , four generate positive error while one
power-quality configuration options for approximate arithmetic shows negative error. Similarly, in Mul2b , three positive and
modules. The main contributions of our work are summarized three negative errors are introduced. Figure 1 shows the gate-
as follows: level logic of proposed 22 multipliers. Mul2a consists of
1) The two 22 multipliers (called Mul2a and Mul2b ) of three AND and one OR gate whereas Mul2b is further simpli-
variable accuracy/power are proposed. Mul2b exhibits an fied with two AND gates. As the approximated outputs (in both
improvement of 52 percent in terms of area compared to
the existing state-of-the-art approximate 2-bit multiplier.
2) The large-size multipliers (AxRMs) are then designed
using proposed 2-bit multipliers. A comprehensive
error analysis is presented by evaluating AxRMs
against different input distributions.
3) The proposed AxRM1 multiplier outperforms prior
error-energy Pareto front due to internal error compen-
sation feature, achieving 61 percent better error charac-
teristics for comparable energy dissipation.
The rest of the paper is organized as follows. The proposed
approximate 22 multipliers are presented in Section II.
Section III describes the design of 88 multipliers using 2- FIGURE 1. Logic diagrams of the proposed approximate 22 mul-
bit multipliers. Section IV presents the comparative analysis tipliers (a) Mul2a and (b) Mul2b .

1230 VOLUME
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE 10, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2022
apply.
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

FIGURE 2. Utilization of 2-bit multipliers against inputs A = (1110)2 and B = (1011)2 (a) Exact output, (b) UDM [7] output, (c) Mul2a output
and (d) Mul2b output.

the proposed designs) produce both positive and negative equal numerical weights; therefore, the perfect error cancella-
errors so the errors can complement each other in the partial tion is performed and final result is same as that of exact result
product reduction process. This is further elaborated with the (Figure 2(a)). Table 2 shows the hardware savings and the out-
numerical example. Consider a 44 multiplier designed using put accuracy (quantified as, mean error distance) of the pro-
22 multipliers, four 2-bit multipliers are utilized (Figure 2 posed designs compared to the exact and existing state-of-the-
(a)). For example, let A = (1110)2 and B = (1011)2 . This implies art approximate 2-bit multiplier designs. Compared to the exact
AH = (11)2 = 3, AL = (10)2 = 2, BH = (10)2 = 2, BL = (11)2 = 3. design, an improvement of up to 65 and 72 percent in the area
An impact on the result is observed by using approximate mul- and power are achieved, respectively. Mul2b is the fastest and
tipliers. The output can be expressed as the most power-efficient design with near-to-zero mean error
profile. It has already been shown through a numerical example
C4x4 ¼ AL  BL þ 4ðAL  BH Þ that for certain inputs this multiplier can produce zero-mean
(1) error. Therefore, use of this approximate block for the design of
þ 4ðAH  BL Þ þ 16ðAH  BH Þ;
AxRMs can achieve significant hardware savings on account
where, shift factors are represented by the constants 4 and 16. of bounded error. Moreover, Mul2a has the similar mean error
The multiplication carried out using existing state-of-the-art distance as that of state-of-the-art UDM [7]; however, due to
approximate 22 multiplier [7] is shown in Figure 2(b). In [7], its reduced power consumption it exhibits 59 percent better
output against the input (3*3) is approximated to 7; therefore, PDP than [7].
negative errors are always introduced. The same inputs have
been passed to the approximate 44 multiplier designed using III. THE PROPOSED 88 APPROXIMATE RECURSIVE
Mul2a . First two partial products are approximated as shown in MULTIPLIERS USING 22 MULTIPLIER
Figure 2(c). Approximation against first partial product intro- An n-bit recursive multiplier can be designed using (n/2)2
duced an error of +1 while an error of 2 is introduced for the elementary multipliers. In this work, the 8-bit multiplier is
second partial product. These positive and negative errors com- under investigation which is designed using sixteen 2-bit
pensate each other in the partial product reduction step. How-
ever, as not all the errors are canceled out; therefore, the
resultant output has an error compared to the exact result. Con-
trarily, using Mul2b to design a 4-bit multiplier shows a zero-
mean error behavior against the specified inputs (Figure 2(d)).
The second and third partial products introduced an error of 2
and +2, respectively. As both these partial products have the

TABLE 2. Hardware resource analysis and quality characteriza-


tion of 22 multipliers.

Design Area Delay Power PDP Mean Error


(um2 ) (ns) (nW) (aJ) Distance
Exact 3.65 0.09 132.17 11.89 -
Mul2a 1.51 0.04 41.23 1.64 0.1250
Mul2b 1.26 0.03 36.91 1.10 0.0000
UDM 2.64 0.04 100.92 4.03 0.1250
Approx 3.15 0.09 113.58 10.22 0.0625
SquASH 3.15 0.05 131.53 6.57 0.1250
FIGURE 3. 8-bit multiplier using 2-bit multiplier (Mul2b ) against
MACISH 2.89 0.06 105.07 6.30 0.2500
inputs A = (1110 1110)2 and B = (1011 1011)2 .

VOLUME 10, NO.


Authorized 2, APRIL-JUNE
licensed 1231
2022to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE Xplore. Restrictions apply.
use limited
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

TABLE 3. 8-bit multipliers using Approx. 2-bit multipliers. TABLE 4. Error analysis of approximate multipliers.

Design Od Oc Ob Oa Multiplier Input Distribution ER MAE NoEB


AxRM1 accurate accurate accurate Mul2b Uniform 0.3247 160 9.54
AxRM2 accurate accurate Mul2b Mul2b AxRM1 Geometric p ¼ 0:005 0.2458 271 9.23
AxRM3 accurate Mul2a Mul2b Mul2b Gaussian m ¼ 124, s ¼ 35 0.2961 312 9.39
Uniform 0.5384 226 7.69
AxRM2 Geometric p ¼ 0:005 0.4217 358 7.25
multipliers (Figure 3). Approximation is only considered for Gaussian m ¼ 124, s ¼ 35 0.4793 492 7.41
the 2-bit multipliers while summation of the bit-shifted par-
Uniform 0.7645 374 5.89
tial products is carried out using exact Wallace tree. Eq. (2) AxRM3 Geometric p ¼ 0:005 0.6829 451 5.36
represents the output of 88 multiplier Gaussian m ¼ 124, s ¼ 35 0.7371 592 5.71

C8x8 ¼ Oa þ 16ðOb Þ þ 16ðOc Þ þ 256ðOd Þ; (2)

where, shift factors are represented by the constants 16 and


256. Note, Oa and Od are the least and most significant block; results in large errors. The error compensation feature of
blocks, respectively. The proposed 2-bit multipliers when the proposed AxRMs is evaluated for uniform, geometric and
used to design 8-bit multipliers can achieve a near-to-zero Gaussian input distributions. Error rate (ER), mean absolute
mean error profile due to double-sided error distribution fea- error (MAE) and number of effective bits (NoEB) are used as
ture. For example, let A = (1110 1110)2 and B = (1011 error metrics to quantify the quality of proposed designs. The
1011)2 . This implies {AHH , ALH } = (11)2 = 3, {AHL , ALL } = results in Table 4 shows that the proposed AxRMs exhibit least
(10)2 = 2, {BHH , BLH } = (10)2 = 2, {BHL , BLL } = (11)2 = 3. MAE and high NoEB against uniform input distribution. This
Within all the four blocks having equal numerical weights, is because in that case both positive and negative errors are
the use of Mul2b cancels positive and negative errors, for equally probable resulting in cancellation/compensation of the
instance, in the first block Oa , ALL  BLH (22 = 6) gener- error. With respect to ER metric, the proposed AxRMs shows
ates +2 error while ALH  BLL (33 = 7) produces 2 error. high ER against uniform input distribution due to the large
Similarly in the most significant block Od , AHH  BHL (33 number of erroneous outputs. It is pertinent to mention that
= 7) generates 2 error while AHL  BHH (22 = 6) produ- error compensation against other two distributions is still per-
ces +2 error. The output produced by the approximate 88 formed; however, the high MAE and low NoEB are due to the
multiplier is similar to that of the exact multiplier. The reduced probability of generating both positive and negative
numerical calculation of 8-bit result against the considered errors.
inputs is presented in Eq. (4)
IV. COMPARATIVE EVALUATION
9
Oa=b=c=d ¼ 3  2 þ 4ð6Þ þ 4ð7Þ þ 16ð2  3Þ ¼ 154 = Synopsys Design Compiler for the 28nm CMOS technology
O8x8 ¼ 154 þ 16ð154Þ þ 16ð154Þ þ 256ð154Þ : node with a supply voltage of 1V and a temperature of 25 C
; and “ultra compile” have been used to synthesize the approxi-
¼ 44506 ¼ ð1010 1101 1101 1010Þ2
(3) mate multipliers. It is pertinent to mention that the reported
results (delay, area, power and power-delay product of
Table 4) are obtained, when optimized for area. Power analy-
Given that PP1 is the least and PP4 is the most significant sis is pursued using Synopsys Primetime tool, default settings
partial product, whereas PP2 and PP3 are equivalently signifi- are used for the input-drive strength and the output load. Ran-
cant, multipliers with different inaccuracies can be designed dom input vectors over the entire input space of 88 multi-
with different configurations. However, to speed up the design pliers (65,536) are applied to generate the back annotated
space exploration process and to avoid inefficient design points switching activity file. Behavioral models of the proposed
of large-size multipliers only Pareto-optimal points are benefi- AxRMs are developed for accuracy assessment. ER and ED
cial. Having said that, compared to Mul2a , Mul2b is the most are the basic error metrics; however, they cannot be used for
favorite candidate to be selected as Pareto point because it fair comparison of two approximate designs with different
exhibits a near-to-zero mean error profile with reduced power number of bits; therefore the mean relative error distance
consumption and area. Therefore, using Mul2b following three (MRED) has been used to overcome this limitation.
designs (Table 3) are considered to build large-size multipliers: Compared multipliers include AxRMs designed using 44
(1) the most accurate scaled-up variant using Mul2b , referred to building blocks [4]–[6], AxRMs proposed using 22 build-
as AxRM1; (2) the most hardware efficient scaled-up variants ing blocks [7], [8], [10] and existing state-of-the-art low-
using Mul2a and Mul2b , referred to as AxRM3; (3) AxRM2 power approximate multipliers [12]–[16]. It is worth men-
that has a good trade-off between accuracy and hardware com- tioning that in [12] approximate 4-2 compressors are used
plexity. The approximation of Od block is not considered in the only in the n less significant columns of the partial-product
proposed AxRMs variants because approximation in the MSB matrix and in [14], {t,h} are the truncation parameters which

1232 VOLUME
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE 10, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2022
apply.
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

TABLE 5. Performance analysis of approximate multipliers.

Design Delay Area Power PDP MRED


(ns) (um2 ) (uW) (uW.ns) (%)
Wallace 1.09 185.42 160.23 174.65 -
AxRM1 0.81 171.26 150.33 121.76 0.05
AxRM2 0.78 158.73 136.20 106.23 0.74
AxRM3 0.75 143.58 126.39 94.79 1.34
M8-5 [4] 0.82 178.05 156.24 128.11 0.13
LOAM [5] 0.85 155.67 133.69 113.63 1.81
Ax8-1 [6] 0.81 177.81 156.29 126.59 0.10
UDM1 [7] 0.81 178.62 156.37 126.65 0.36
UDM2 [7] 0.80 165.38 147.91 118.32 2.14
UDM3 [7] 0.78 148.65 128.49 100.22 3.28
FIGURE 4. MRED-PDP analysis of approximate multipliers.
Approx1 [8] 1.09 180.59 157.42 171.58 0.03
Approx2 [8] 1.09 168.24 149.63 163.09 0.63
Approx3 [8] 1.09 153.47 136.87 149.18 1.51
MACISH1 [10] 0.86 179.63 157.19 135.18 0.11
Approx multiplier [8] has a maximum ED of one resulting in
MACISH2 [10] 0.84 166.78 148.26 124.53 0.86 a better MRED but has high PDP due to no improvements in
MACISH3 [10] 0.82 150.62 133.47 109.44 1.34 the delay of a proposed 22 multiplier. MACISH has pro-
Hybrid [12] 0.75 163.00 152.00 114.00 0.42 posed a 2-bit multiplier with a large error distance (ED =
LOBO [13] 0.77 159.42 150.73 116.06 0.29 4). To compensate this ED for large AxRMs, MACISH has
TOSAM(1,5) [14] 0.83 147.25 134.73 111.82 0.51 combined [7]–[9] designs with [10]. However, the proposed
mul8u_14VP [15] 0.94 172.49 156.66 147.26 0.14 approximate 22 multiplier of [10] has a large delay and
mul8u_ZFB [15] 0.79 148.96 139.27 110.02 0.80
power; therefore, the large multipliers (MACISH-1/2/3)
DRUM4 [16] 0.88 136.53 127.49 112.19 0.80
designed using this methodology shows low-performance.
Among the considered state-of-the-art low-power multi-
pliers, hybrid multiplier [12] exhibits the least delay. A lower
dictates the amount of partial products reduction. The consid- delay has been achieved due to use of two compressor types
ered Evoapprox designs {mul8u_14VP, mul8u_ZFB} are to reduce the partial product matrix (PPM). A low-power/
from the pareto optimal subsets of MRED versus power. low-error compressor has been used in the right/left most col-
They are selected as they have a comparable MRED with our umns of the PPM, respectively. The LOBO multiplier [13]
proposed AxRMs. While, parameter m in [16] shows the bit uses approximate logarithmic partial product generation
length of the rounded input operands. Moreover, for a fair (LPPG) for the LSBs while exact radix-4 Booth encoding
comparative analysis the existing 22 multipliers [7], [8], has been used for MSBs. The data path pruning technique
[10] have also been implemented using the same setup as has been used in the LPPG stage which reduces the length of
used for the proposed multipliers, i.e., like in Table 3. barrel shifters and in turns achieve lower power consump-
Hardware performance (power, area, delay and power- tion. Truncation and rounding methods have been combined
delay product) of the considered approximate multipliers is in TOSAM [14] to achieve the large improvements in area
shown in Table 5. [4], [6] has used the propagate-and-gener- utilization. The number of partial products have been
ate (PG) approach to reduce the probability of a particular reduced by truncating the input operands according to their
partial product bit to be one. PG methodology has an inher- leading one-bit position. Evoapprox designs [15] have a
ited hardware cost, one PG pair utilize three AND and one comparable MRED with proposed AxRMs; however, they
OR gates (in total, four gates). For instance, in [6] for a 44 exhibit large delay. DRUM4 [16] approach reduces a large
multiplier four PG pairs are generated; thus, sixteen extra multiplier to a significantly smaller multiplier by exploiting
gates are required. Moreover, for larger multipliers as more the fact that not all operand-bits of the multiplier are equally
PG pairs are required, the hardware resource increases. This important. Therefore, achieved a noticeable improvements in
is the reason the most-accurate designs of [4], [6] (M8-5 and the area utilization and power consumption. MRED-PDP
Ax8-1) have a large area. LOAM has divided the 8-bit multi- Pareto diagram of all the multipliers is shown in Figure 4.
plier into lower and upper parts. OR-based approximation is This error-energy comparison gives the Pareto front formed
used in the lower part while inexact half and full adders by the most efficient designs. The proposed AxRMs outper-
(FAs) have been used in the upper part. This approach forms the existing designs and provide new Pareto points.
resulted in achieving the least area among all the 4-bit based AxRM1 is the most-accurate design which achieves 50 per-
AxRMs. However, as the proposed inexact FAs have the cent improvement in the MRED as compared to the previous
same critical path delay as that of exact FA; thus, exhibiting best low-MRED design (Ax8-1). AxRM2 provides an accu-
high delay. Among the existing 22 based AxMRs, UDM3 racy-energy trade-off, exhibits a little high-MRED with
has a comparable power consumption as that of AxRM3; 39.17 percent energy savings compared to Wallace tree mul-
however, UDM3 exhibits high MRED as error accumulates tiplier. Among the high-MRED designs, AxRM3 and MAC-
with no internal error compensation. Contrast to UDM, ISH3 have similar MREDs; however, AxRM3 shows an

VOLUME 10, NO.


Authorized 2, APRIL-JUNE
licensed 1233
2022to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE Xplore. Restrictions apply.
use limited
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

TABLE 6. Evaluation of approximate multipliers using alexnet


CNN.

Multiplier Top-1 (%) Top-5 (%) Power Reduction (%)


Exact 56.80 79.90 -
AxRM1 58.20 81.30 20.93
AxRM2 57.90 81.00 32.64
AxRM3 57.10 80.20 44.97
Hybrid [12] 55.75 78.86 28.41
LOBO [13] 56.12 79.25 26.35
TOSAM [14] 55.06 78.15 29.79

ILSVRC2012 dataset [17]. Caffe provides a pre-trained


AlexNet model that has been quantized to 8-bit using the
method proposed in [18]. The exact pre-trained model is
used because the approximate multipliers cannot converge
the gradient decent of backpropgation. However, in the infer-
ence stage, the exact multiplications are replaced by the
approximate ones. The entire ILSVRC2012 validation set
(50,000 images) has been used to measure the Top-1/Top-5
inference accuracy along with power reductions.
Table 6 presents the classification accuracy and the power
reduction achieved by the approximate multipliers that lie on
FIGURE 5. (a) Hardware and (b) Error gains for scale-up variants
of proposed AxRMs. the MRED-PDP Pareto-optimal curve. AxRMs exhibit the
double-sided error distribution along with lower error magni-
improvement of 13 percent in terms of PDP. This is due to tudes which helps to mitigate any over-fitting in Alexnet and
the use of high-performance 22 multipliers for building effectively introduces noise into the proposed AxRMs. It has
large multipliers. already been reported in the technical literature that adding
The scaling of the proposed AxRMs in terms of error and noise is often an effective way of improving the performance
hardware gains is examined. We consider the accurate 8-bit of CNNs [19]. Therefore, AxRM1 achieves higher classifica-
multiplier designed using 2-bit multiplier blocks as the base tion accuracy of 1.4 percent with 20.93 percent power saving
architecture for comparison. An n  n recursive multiplier is compared to the exact multiplier. Compared with the LOBO
constructed using (n/2)2 elementary (22) multipliers. There- (which exhibits previous-best classification accuracy),
fore, 8/16/32-bit multipliers consists of 16/64/256 22 AxRM2 provides 1.75 and 6.29 percent better accuracy and
blocks, respectively. Area reduction is used as a metric to power reductions, respectively. AxRM3 being the most-
demonstrate the scalability (in terms of hardware gains) of the approximated design has a lower recognition rate (thus, con-
proposed multipliers. Area utilized by an exact 22 block is sistent with the error analysis presented in Section IV); how-
3.65 um2 while an approximate 22 block Mul2a/Mul2b uti- ever, achieves a 44.97 percent power reduction. The result
lizes 1.51/1.26 um2 . Figure 5(a) shows the improvement in shows that AxRMs can effectively increase the classification
area reductions scale up to 52 percent for increased multiplier accuracy with lower hardware resource consumption.
size. The error scalability cannot be theoretically calculated
for the proposed approximate recursive multipliers (AxRMs).
This is because the proposed multipliers exhibit double-sided VI. CONCLUSION
error distribution property which is highly dependent on the This paper has presented two 22 multipliers (Mul2a and
input data set. Therefore, 16/32-bit AxRMs are designed to Mul2b ) with error compensation/cancellation feature. Mul2b
examine the error scaling and results are shown in Figure 5 exhibits 58 percent reduction in area and 50 percent better
(b). Note, the MRED reported is against the uniform input dis- delay compared to the existing state-of-the-art 22 multi-
tribution. The error scaling behavior is bounded due to the plier. Design exploration of 8-bit AxRMs is performed using
error compensation feature; thus, consistent with the error different configurations of proposed Mul2a /Mul2b and a
analysis presented in Section III. comprehensive error analysis is presented. AxRM1 shows an
improvement of 50 percent in the MRED compared to the
previous best low-MRED design (Ax8-1). AxRM3 exhibits
V. CASE STUDY: CONVLOUTIONAL NEURAL the least area among all designs; an improvement of 13 per-
NETWORKS (CNNS) cent in the PDP is achieved compared to MACISH while
The performance of proposed approximate multipliers is both designs have similar MRED metric. The better MRED
evaluated for the AlexNet CNN, that classifies the ImageNet metric for larger designs is achieved because the proposed

1234 VOLUME
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE 10, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2022
apply.
Waris et al.: AxRMs: Approximate Recursive Multipliers Using High-Performance Building Blocks

Mul2a and Mul2b multipliers cancel errors in the partial [7] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power
with an underdesigned multiplier architecture,” in Proc. Int. Conf. VLSI
product reduction step. The proposed approximate multi- Des., 2011, pp. 346–351.
pliers have been evaluated using AlexNet CNN that classifies [8] S. Rehman et al., “Architectural-space exploration of approximate multi-
the ImageNet ILSVRC2012 dataset. AxRM1 exhibits an pliers,” in Proc. Int. Conf. Comput.-Aided Des., 2016, pp. 1–8.
[9] G. A. Gillani, M. A. Hanif, M. Krone, S. H. Gerez, M. Shafique, and
improvement of 1.4 percent in the classification accuracy A. B. J. Kokkeler, “SquASH: Approximate square-accumulate with self-
with 20.93 percent power savings. To encourage and help healing,” IEEE Access, vol. 6, pp. 49112–49128, 2018.
further research in this direction the synthesizable Verilog [10] G. A. Gillani, M. A. Hanif, B. Verstoep, S. H. Gerez, M. Shafique, and
A. B. J. Kokkeler, “MACISH: Designing approximate MAC accelerators
files are provided as open-source libraries at https://source- with internal-self-healing,” IEEE Access, vol. 7, pp. 77142–77160, 2019.
forge.net/projects/approxrecursivemul/. [11] H. Waris, C. Wang, W. Liu, and F. Lombardi, “AxBMs: Approximate Radix-
8 booth multipliers for high-performance FPGA-based accelerators,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 5, pp 1566–1570, May 2021.
ACKNOWLEDGMENTS [12] A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and G. D. Meo, “Com-
parison and extension of approximate 4–2 compressors for low-power
This work was supported by grants from the National
approximate multipliers,” IEEE Trans. Circuits Syst. I, Reg. Papers,
Natural Science Foundation of China (62022041 and vol. 67, no. 9, pp. 3021–3034, Sep. 2020.
61871216), and the Six Talent Peaks Project in Jiangsu [13] R. Pilipovic and P. Bulic, “On the design of logarithmic multiplier using
radix-4 booth encoding,” IEEE Access, vol. 8, pp. 64578–64590, 2020.
Province (2018-XYDXX-009).
[14] S. Vahdat, M. Kamal, A. Afzali-Kusha, and M. Pedram, “TOSAM: An
energy-efficient truncation-and rounding-based scalable approximate mul-
REFERENCES tiplier,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 5,
pp. 1161–1173, May 2019.
[1] W. Liu, F. Lombardi, and M. Shulte, “A retrospective and prospective view of [15] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “EvoApprox8b:
approximate computing,” Proc. IEEE, vol. 108, no. 3, pp. 394–399, Mar. Library of approximate adders and multipliers for circuit design and bench-
2020. marking of approximation methods,” in Proc. Des. Automat. Test Eur.
[2] C. Chen et al., “Optimally approximated and unbiased floating-point mul- Conf. Exhib., 2017, pp. 258–261.
tiplier with runtime configurability,” in Proc. Int. Conf. Comput. Aided [16] S. Hashemi et al., “DRUM: A dynamic range unbiased multiplier for
Des., 2020, pp. 1–9. approximate applications,” in Proc. Int. Conf. Comput.-Aided Des., 2015,
[3] V. Leon, K. Asimakopoulos, S. Xydis, D. Soudris, and K. Pekmestzi, pp. 418–425.
“Cooperative arithmetic-aware approximation techniques for energy-effi- [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
cient multipliers,” in Proc. 56th ACM/IEEE Des. Automat. Conf., 2019, with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.
pp. 1–6. Process. Syst., 2012, pp. 1097–1105.
[4] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, “Low-power approxi- [18] Z.-G. Tasoulas, G. Zervakis, I. Anagnostopoulos, H. Amrouch, and J. Hen-
mate multipliers using encoded partial products and approximate compres- kel, “Weight-oriented approximation for energy-efficient neural network
sors,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, inference accelerators,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67,
pp. 404–416, Sep. 2018. no. 12, pp. 4670–4683, Dec. 2020.
[5] Y. Guo, H. Sun, and S. Kimura, “Design of power and area efficient lower- [19] M. S. Ansari, V. Mrazek, B. F. Cockburn, L. Sekanina, Z. Vasicek, and
part-OR approximate multiplier,” in Proc. IEEE Region 10 Conf., 2018, J. Han, “Improving the accuracy and hardware efficiency of neural net-
pp. 2110–2115. works using approximate multipliers,” IEEE Trans. Very Large Scale
[6] H. Waris, C. Wang, W. Liu, J. Han, and F. Lombardi, “Hybrid partial Integr. (VLSI) Syst., vol. 28, no. 2, pp. 317–328, Feb. 2020.
product-based high-performance approximate recursive multipliers,” IEEE
Trans. Emerg. Topics Comput., early access, Aug. 04, 2020, doi: 10.1109/
TETC.2020.3013977.

VOLUME 10, NO.


Authorized 2, APRIL-JUNE
licensed 1235
2022to: Tallinn University of Technology. Downloaded on December 03,2024 at 10:15:32 UTC from IEEE Xplore. Restrictions apply.
use limited

You might also like