0% found this document useful (0 votes)
25 views10 pages

1 s2.0 S0045790622006243 Main

This paper presents a new approximate unsigned multiplier architecture designed to reduce area and power consumption while maintaining accuracy. The architecture includes a least significant region, an approximate region with two new compressors, and an accurate region, with an error-correcting module to mitigate inaccuracies. Results demonstrate significant improvements in power and power-delay-product compared to both exact and other approximate designs, validated through applications in image processing and neural networks.

Uploaded by

Akash Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

1 s2.0 S0045790622006243 Main

This paper presents a new approximate unsigned multiplier architecture designed to reduce area and power consumption while maintaining accuracy. The architecture includes a least significant region, an approximate region with two new compressors, and an accurate region, with an error-correcting module to mitigate inaccuracies. Results demonstrate significant improvements in power and power-delay-product compared to both exact and other approximate designs, validated through applications in image processing and neural networks.

Uploaded by

Akash Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Computers and Electrical Engineering 104 (2022) 108407

Contents lists available at ScienceDirect

Computers and Electrical Engineering


journal homepage: www.elsevier.com/locate/compeleceng

Compressor based hybrid approximate multiplier architectures with


efficient error correction logic✩
Anil Kumar Uppugunduru a ,∗, S. Vignesh Bharadwaj b , Syed Ershad Ahmed b
a Department of ECE, B V Raju Institute of Technology, Vishupur, Narsapur, Telangana, India
b Department of Electrical Engineering, BITS, Pilani, Hyderabad Campus, Hyderabad, India

ARTICLE INFO ABSTRACT

Keywords: A new approximate unsigned multiplier architecture has been proposed in this paper, which
Approximate computing aims to minimize the area utilized and power consumed while maintaining high accuracy. The
Approximate multiplier proposed architecture is segmented into the least significant region (LSR), the approximate
Partial product reduction
region, and the accurate region (most significant region). In LSR, the partial products (PPs)
Compressor
are reduced using four methods. In contrast, in the approximate region, two new approximate
compressors are used to reduce the PPs, and the error arising from the approximate compressors
is neutralized using an efficacious error-correcting module. For the 8-bit multipliers, the results
indicate that the proposed designs, when compared against the exact design, achieve an
increment of 26.5% and 32.2% in power and power-delay-product, respectively, and when
compared with other approximate designs, achieve an improvement of 18.4% and 26.4%,
respectively. Finally, proposed designs are evaluated using image processing and neural network
applications.

1. Introduction

Approximate computing has increasingly become a standard for achieving compact, efficient, and high-speed designs, especially
in architectures that are error-tolerant to a certain extent. Since neural networks, image, and video processing applications are
designed to be perceived by the human eye, slight deviations in the outputs go unnoticed due to perceptual limitations of human
sensory system [1]. Hence, such applications provide the perfect environment to implement approximate computing.
The multiplication operation is the fundamental module in neural network, image, and video processing applications [2–5]. In
the literature, many approximation multipliers have been proposed to reduce the area, power, and latency. The two main methods
for implementing approximate multipliers are (i) approximation during partial product generation (PPG) and (ii) approximation
during partial product reduction (PPR). In (i), the number of partial product rows are reduced by using approximate sub-blocks.
Existing designs based on (i) strive to minimize error, with the MSB section being of the designs truncate the partial products (PPs)
of the least significant constructed using precise sub-blocks. On the other hand, in (ii), the majority portion and then use larger
input approximate and exact compressors to reduce the remaining part of PPs. Because there are still an equal number of partial
product rows with this strategy, resource usage is reduced minimally. In contrast to existing systems centered on (ii), this work aims
to reduce the PPs using two effective approximate 4:2 compressors and four different techniques. As a result, proposed designs are
more accurate than existing designs while using less area and power.
The highlights of this paper are organized as follows:

✩ This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Dr. Fabrizio Messina.
∗ Corresponding author.
E-mail address: [email protected] (A.K. Uppugunduru).

https://doi.org/10.1016/j.compeleceng.2022.108407
Received 21 May 2022; Received in revised form 22 September 2022; Accepted 26 September 2022
Available online 19 October 2022
0045-7906/© 2022 Elsevier Ltd. All rights reserved.
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Table 1
Comparison between existing multiplier designs.
Design Approximation No. of Scheme at New Error recovery Various multiplier
regions LSR compressor circuit designs
PPG PPR
Momeni [11] N Y 2 Approximation Y N Y
Yang [12] N Y 3 Truncation Y N Y
Minho [13] N Y 3 Truncation Y Y N
Xilin [14] N Y 2 Approximation Y N N
Suganthi [15] N Y 2 Approximation Y N Y
Haroon [9] Y N – – N N Y
Saeed [8] Y N – – Y N Y
Haoran [16] N Y 2 Approximation Y Y Y
Strollo [17] N Y 2 Approximation Y N Y
AxRMs [10] Y N – – N N Y
Kong [18] N Y 2 Approximation Y N Y
Shaghayegh [19] N Y 2 Approximation N N Y
Four Different
Proposed N Y 3 Y Y Y
Techniques

’Y’ means Yes and ’N’ means No.

• Four different methods are proposed, which are used in the least significant portion of PPR structure to decrease the area and
power with a trade-off in accuracy.
• Two new approximate 4:2 compressors with high accuracy and low power consumption are proposed and used for approximate
portion of the PPR structure to reduce the hardware complexity.
• A simple yet efficient error-correcting module is proposed to make the approximate compressor behave like an exact
compressor.
• Finally, the proposed designs are validated using benchmarking applications.

The rest of the paper is organized as follows. Section 2 describes the existing multiplier architecture whereas proposed multiplier
architecture, two new approximate compressors and error correcting module are presented in Section 3. Section 4 includes
comprehensive hardware and error analysis of the multipliers. The existing and proposed designs are validated using image
processing and neural network applications in Section 5. Finally, conclusions are drawn in Section 6.

2. Related work

A wide variety of approximate multipliers have been posited to lower the computational complexity using methods (i) and (ii).
Based on (i), by utilizing smaller sub-multipliers, Kulkarni et al. [6] presented an under-designed recursive multiplier UDM. The
sub-multipliers are enhanced by tweaking a particular K-map entry. For instance, one output bit is saved as a result of changing the
output from ‘‘1001’’ to ‘‘111’’. Therefore, the modified sub-multipliers achieve a lower area utilization with a tradeoff in accuracy. By
lowering the carry-save adders used at the accumulation step, Mahdiani et al. [7] proposed a broken-array multiplier. With the use
of approximate 4:2 compressors and encoded partial products (PPs), Saeed et al. [8] proposed two 4*4 sub multiplier architectures.
These sub-multipliers are used to create multipliers with larger bit widths, however these designs suffer from accuracy. Hybrid
partial product array-based 4*4 sub multipliers were proposed by Waris et al. [9] and are used to design the larger multipliers,
however these designs require more power. Authors in [10] proposed 2*2 multipliers with variable power and accuracy, which are
used to design larger multipliers. These multipliers have better accuracy but consume more power.
Momeni et al. [11] proposed two approximate compressors to reduce the area and power; however, they suffer from accuracy.
Yang et al. [12] proposed three approximate compressors with low error-rate, which are used in the new PPR structure but it suffers
from power dissipation. Minho et al. [13] modified one of the compressors proposed by Yang and used an error-correcting module in
the PPR structure to improve the accuracy, but doing so resulted in higher power consumption. Venkatachalam et al. [15] proposed
two types of multiplier architectures using approximate arithmetic modules with an intent to reduce hardware complexity, but
these architectures have low accuracy. A low-power compressor-based multiplier architecture was proposed by Xilin et al. [14],
with accuracy being the main tradeoff. Haoran et al. in [16] put forward four multiplier designs, consisting of three approximate
compressors and an error-correcting module, with an intent to maximize savings in power and area. Strollo et al. [17] proposed
a compressor with high accuracy, and this compressor has been used in the reduction structure for designing two multiplier
architectures. Shaghayegh et al. [19] proposed efficient approximate multipliers based on truncation and rounding for high accuracy.
Kong et al. [18] investigated the existing approximate 4:2 compressors and then analyzed the delay of their architectures.
Numerous multiplier architectures have been put forward in the literature, but the error-correcting module is not a constituent
of most of them. Furthermore, none of them have four discrete methods to reduce the LSR, as shown in Table 1.

2
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Fig. 1. Proposed partial product reduction structure with error correction logic.

3. Proposed work

Designs [6–10], [16] have lower power consumption; however, they suffer from low accuracy. The multipliers proposed by
Minho [13], Yang [12], and Strollo [17], have high accuracy due to their highly efficient compressors, but these designs tend to
have higher power dissipation. Therefore the existing multiplier architectures have a trade-off in power and precision. To mitigate
this, we have proposed four different multiplier architectures with low power and high accuracy.
The proposed 8-bit unsigned multiplier partial product structure is divided into three parts: the 4-bit LSR, the 4-bit approximate
region, and the 7-bit exact region. The LSR partial products contribute least to the final product, so these are approximated using
four different methods and the details of these four methods as shown in Fig. 1 and discussed in Section 3.5. The partial products
in the approximate region are reduced using the proposed approximate compressors. To compensate for the error introduced by the
proposed compressor in the approximate region, we have used the error-correcting module at the most significant column of the
approximate region as highlighted in Fig. 1. The PPs in the accurate region are reduced using exact full adders and 4:2 compressors.
The product tree is shrunk down to two rows using adders and compressors, which are then cumulated using a ripple carry adder
(RCA) to produce the final product.

3.1. Proposed approximate compressors in the approximate region

The PPs in the approximate region shown in Fig. 1 are reduced using two approximate compressors. Approximate compressor
1 (AC1) in the most significant column (MSC) of the approximate region contributes to a large error. To overcome this problem,
approximate compressor 2 (AC2) is proposed, which can be converted to the exact compressor with the help of an error-correcting
module. Details of AC1 and AC2 are discussed below.

3.2. Proposed approximate 4:2 compressor 1 (AC1)

We have proposed approximate 4:2 compressors with reduced hardware complexity. The proposed approximate compressor 1
(AC1) has four inputs and two outputs and its truth table is presented in Table 2. AC1 is designed in such way that it produces
an error for two cases, and they are highlighted in red color in the truth table. The Boolean equations of AC1 are represented in
Eqs. (1) and (2) and the corresponding logic diagram is represented in Fig. 2(a).

𝑆𝑢𝑚 = 𝑝1 ⊕ 𝑝2 + 𝑄3 𝑄4 (1)

𝐶𝑎𝑟𝑟𝑦 = 𝑝1 𝑝2 + 𝑄1 𝑄2 (2)

3
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Table 2
Truth Table of the proposed 4:2 approximate compressors.
Compressor input combinations Value Prob. AC1 Appr. Difference AC2 Appr. Difference
value of value of
AC1 AC2
𝑄1 𝑄2 𝑄3 𝑄4 Sum Carry Sum Carry
0 0 0 0 0 81/256 0 0 0 0 0 0 0 0
0 0 0 1 1 27/256 1 0 1 0 1 0 1 0
0 0 1 0 1 27/256 1 0 1 0 1 0 1 0
0 0 1 1 2 9/256 1 0 1 -1 0 0 0 -2
0 1 0 0 1 27/256 1 0 1 0 1 0 1 0
0 1 0 1 2 9/256 0 1 2 0 0 1 2 0
0 1 1 0 2 9/256 0 1 2 0 0 1 2 0
0 1 1 1 3 3/256 1 1 3 0 1 0 1 -2
1 0 0 0 1 27/256 1 0 1 0 1 0 1 0
1 0 0 1 2 9/256 0 1 2 0 0 1 2 0
1 0 1 0 2 9/256 0 1 2 0 0 1 2 0
1 0 1 1 3 3/256 1 1 3 0 1 0 1 -2
1 1 0 0 2 9/256 0 1 2 0 0 1 2 0
1 1 0 1 3 3/256 1 1 3 0 1 1 3 0
1 1 1 0 3 3/256 1 1 3 0 1 1 3 0
1 1 1 1 4 1/256 1 1 3 -1 0 1 2 -2

Fig. 2. (a) Proposed approximate compressor 1 (AC1) (b) Proposed approximate compressor 2 (AC2).

Where
𝑝1 = 𝑄1 ⊕ 𝑄2

𝑝2 = 𝑄3 + 𝑄4
Table 2 shows that the proposed approximate compressor generates an error for two cases that are ‘‘0011’’ and ‘‘1111’’ with an
error distance of ’−1’. The inputs 𝑄1 , 𝑄2 , 𝑄3 , and 𝑄4 are formed by the partial products produced using AND gates. The probability
of getting a multiplier or a multiplicand bit as ‘1’ is 0.5. Therefore, the probability of generating one of the inputs of the compressor
as ‘1’ is 0.25. So, from Table 2, the probability of generating error for AC1 is 10/256 i.e 0.039.

3.3. Proposed approximate 4:2 compressor 2 (AC2)

The approximate region’s MSC contributes a large error to the final product. To overcome this problem, we have designed another
approximate compressor that can easily be converted into the exact compressor using a simple error correcting module.
Proposed approximate compressor 2 (AC2) was developed by altering the truth table of AC1 to generate an error for four input
cases, as mentioned in Table 2, with an error distance of ‘‘−2’’, and these are highlighted in yellow. The Boolean equations of AC2
are represented in Eqs. (3) and (4) and the corresponding logic diagram is represented in Fig. 2(b).

𝑆𝑢𝑚 = 𝑝2 ⊕ 𝑝1 (3)

𝐶𝑎𝑟𝑟𝑦 = 𝑄1 𝑄2 + 𝑝2 𝑝1 (4)
Where
𝑝1 = 𝑄3 ⊕ 𝑄4

4
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Table 3
Methods used in the proposed multiplier designs.
S.No Design Technique used in LSP
1 D1 Truncation
2 D2 Constant Correction Term
3 D3 Bypass the first row
4 D4 OR operation

𝑝2 = 𝑄1 ⊕ 𝑄2
From Table 2, AC2 generates an error in four cases that are ‘‘0011’’, ‘‘0111’’, ‘‘1011’’, and ‘‘1111’’. From the Table 2, the
probability of generating error for AC2 is 16/256 i.e 0.0625.

3.4. Error correcting module

Minho [13] proposed a compressor that generates an error of distance ’−1’, for four cases. and also introduced a single error-
correcting term from two compressors at the MSC of the approximate region to reduce the error distance. This correcting term is valid
only when both the compressors generate an error. However, suppose one of the compressors does not give an error. In that case, it
does not completely recover the error because the corresponding column overall error distance is ’−1’, but the error-correcting term
adds a value of 2. Similarly, the error-correcting module in MUL2 [16] does not recover the error completely. We have proposed a
new error-correcting module to overcome the error introduced by AC2.
The compressor AC2 generates an error of ’-2’ in four out of sixteen cases. This compressor is used at the MSC of the approximate
region in the proposed partial product structure and it is highlighted in Fig. 1. To correct this error, an error-correcting module has
been incorporated which generates a ‘1’ if there is an error that needs to be compensated, and generates a ‘0’ otherwise. This
correction bit is given as a carry in to the next column, which is the least significant column of the exact portion as shown in Fig. 1,
hence the effective binary weight of the correction term is 2. For the erroneous cases ‘‘0011’’, ‘‘0111’’, ‘‘1011’’ and ‘‘1111’’, the −2
error is compensated by adding a carry in to the next position, which effectively means adding +2 to negate the error completely.
Therefore, AC2 essentially behaves like an exact compressor. The hardware complexity of AC2 along with error correcting module
is less compared to the exact compressor.
From Fig. 1, the error-correcting module in level 1 comprises of two AND gates which analyze the PPs at the MSC of the
approximate region. The outputs of these two AND gates are given as carry-in to the exact compressors in the least significant
column of the exact region. This reduces the error distance and helps improve overall accuracy of the multiplier. Similarly, an AND
gate is utilized at the MSC of the approximate region in level 2.
In order to incorporate a similar error-correcting module in a 16-bit multiplier, a total of seven AND gates would be required.
As the error-correcting module is only used in the MSC of the approximate region, the hardware overhead is much lesser compared
to the tradeoff in accuracy.

3.5. Variants of multiplier design

By varying the techniques used in the LSR (approximate and accurate region remains same), different variants of 8-bit and 16-bit
multiplier architectures are proposed, namely D1, D2, D3, and D4, and are tabulated in Table 3. In D1, when the LSR is truncated
it saves area but introduces error. In contrast, in D2, the LSP value is assigned to a constant value calculated by averaging the exact
value obtained for all the possible input combinations. In D3, the LSR value is directly assigned one partial product term. Now we
can see that in the case of D2 and D3, replacing the LSR portion with a constant term or partial product term reduces the error
without incurring additional hardware cost. Finally, in D4, the LSR value is calculated by using OR gates, which reduces the error
significantly however increases the hardware utilization. This is further evaluated using a numerical example.
Consider A= (10100111)2 and B=(11101101)2 , the exact output is shown in Fig. 3(a). The impact of the four approximation
methods on the output can be observed in Fig. 3(b–e). In design D1, lower four bits are truncated, and the remaining approximate
and exact portions are reduced using proposed and exact compressors, respectively. The resulting output is (39552)10 , which is close
to the exact output. In design D2, lower four bits are replaced with a constant term, and the resultant output is (39558)10 . Replacing
the lower four bits directly with the partial products of the first row in D3, the output is (39559)10 , which is closer to the exact
output than D1 and D2 outputs. In design D4, the output obtained by reducing the PP bits using OR gates is (39567)10 , which is the
most accurate of all the four outputs. Fig. 3(f) shows the design D1 without error correcting module, and the output obtained here
is (39296)10 . Comparing the two outputs from Fig. 3(b) and 3(f), it is clear that the error correcting module is indispensable for our
proposed designs.
The summary of the proposed work is mentioned below:

• Four distinct techniques (D1, D2, D3 and D4) are proposed to lower the area and power with a compromise on accuracy in
the least significant part of the PPR structure.
• To simply the hardware complexity, two new approximate 4:2 compressors with high accuracy and low power consumption
are deployed in the approximate portion of the PPR structure.
• The approximate compressor is designed to perform like an exact compressor using a simple but robust error-correcting module.

5
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Fig. 3. Inputs A=(167)10 and B= (237)10 (a) Exact output (b) D1 o/p (c) D2 o/p (d) D3 o/p (e) D4 o/p (f) D1 without error correcting module.

Table 4
Error Analysis of various 8-bit and 16-bit approximate multipliers.
Multiplier 8-bit 16-bit
MRED NMED EDmax MRED NMED WCRE
(%) (10–3) (%) (10–3)
D1 0.68 0.28 213 0.01 0.0032 0.42
D2 0.49 0.23 207 0.01 0.0032 0.40
D3 0.48 0.229 202 0.01 0.0029 0.31
D4 0.21 0.14 198 0.009 0.0031 0.21
Ax8-1 [9] 0.1 0.46 96 0.007 0.001 0.61
Ax8-2 [9] 1.26 1.64 2034 0.09 0.04 0.97
Ax8-3 [9] 2.83 6.12 3954 0.18 0.08 0.98
Multiplier2 [15] 1.36 1.56 1020 0.12 0.44 0.99
Yang [12] 0.77 0.49 641 0.02 0.057 0.65
Minho [13] 0.78 0.43 385 0.01 0.05 0.31
Xilin [14] 1.40 1.33 560 0.05 0.18 0.42
MUL2 [16] 2.43 1.06 264 0.09 0.021 0.99
M3 [8] 1.7 2.1 1188 1.6 1.2 0.43
M4 [8] 2.2 3.2 2244 2.2 1.9 0.44
AxRM1 [10] 0.765 0.25 50 0.01 0.001 34.1
AxRM2 [10] 5.6 4.28 850 0.27 0.025 191.5
AxRM3 [10] 7.566 5.22 1650 0.49 0.03 885.1
C-N [17] 0.10 0.13 520 0.004 0.002 0.19

4. Results and discussions

See Table 4.

4.1. Error analysis

Exhaustive error analysis is performed on the existing and proposed 8-bit and 16-bit multiplier architectures using 65,536 and 1
million (random) input samples, respectively, in MATLAB. The results are tabulated in Table 4. Metrics such as max error distance
(EDmax), normalized mean error distance (NMED), worst case relative error distance (WCRE), and mean relative error distance

6
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Table 5
Area and power results of various approximate compressors.
𝑆.𝑁𝑜 𝐷𝑒𝑠𝑖𝑔𝑛 Area (μm2 ) Power (nW) Delay (ps) Error rate
1. Exact 35.28 1447.31 414 –
2. AC1 23.28 484.67 289 10/256
3. AC2 22.57 612.95 278 16/256
4. Multiplier2 [15] 23.28 440.02 188 37/256
5. Xilin [14] 26.10 416.44 209 37/256
6. Yang [12] 26.10 775.92 451 16/256
7. Minho [13] 21.16 654.44 289 16/256
8. MUL2 [16] 14.11 245.81 193 175/256
9. Strollo [17] 25.40 500.06 451 4/256

Table 6
Power and area results of various 8 and 16-bit approximate multipliers.
8-bit 16-bit
Multiplier
Area Power Delay PDP Area Power Delay PDP
(μm2 ) (μw) (ns) (fj) (μm2 ) (μw) (ns) (fj)
Exact 1666.62 49.8 2.66 132.46 6728.6 286.71 4.112 1178.95
D1 1222.8 37.1 2.42 89.78 4886.98 220.12 2.895 637.25
D2 1222.8 37.1 2.42 89.78 4886.98 220.12 2.895 637.25
D3 1236.917 37.31 2.42 90.29 4923 222.12 2.895 643.04
D4 1251.73 37.62 2.42 90.99 4989 224.21 2.895 649.09
Ax8-1 [9] 1388.62 39.38 2.60 102.63 6027 237.45 3.812 905.16
Ax8-2 [9] 1248.2 37.47 2.59 97.04 5578.47 222.17 3.724 827.36
Ax8-3 [9] 1139.544 35.68 2.33 83.42 5163.58 194.1 3.724 722.83
Multiplier2 [15] 1557.96 40.07 2.52 101.05 5251.78 234.65 2.905 681.66
Yang [12] 1240.92 38.76 2.50 97.13 5487.45 257.22 3.02 776.8
Minho [13] 1235.8 37.7 2.49 94.17 5282.82 248.17 2.97 737.06
Xilin [14] 1501.51 37.56 2.45 92.02 5756 221.06 2.88 636.65
MUL2 [16] 1236.23 36.97 2.47 91.31 5509.32 195.79 2.698 528.24
M3 [8] 1093.68 37.27 2.14 79.75 4767.73 216.01 2.85 615.63
M4 [8] 1006.89 32.51 2.03 66.25 4471.38 197.11 2.708 533.77
AxRM-1 [10] 1572.07 42.81 2.85 122.08 6035.7 237.67 3.916 930.72
AxRM-2 [10] 1423.901 39.86 2.81 112.2 5684.31 225.75 3.902 880.88
AxRM-3 [10] 1284.19 37.42 2.45 91.68 5358.32 219.72 2.968 652.13
C-N [17] 1473.29 45.5 2.52 109.91 5911.51 247.05 3.504 865.66

(MRED) are used to quantify the efficacy of multiplier designs. The error distance represents the difference between accurate and
approximate results. MRED indicates the mean of all REDs where RED is relative error distance.
It can be observed from Table 4 that D1, D2, D3 and D4 have lower 𝐸𝐷𝑚𝑎𝑥 , NMED, and MRED compared to all existing multiplier
designs except Ax8-1 and AxRM1 due to efficient proposed approximate compressors, error-correcting module, and techniques used
in the least significant portion of the PPR structure. Similarly, the NMED, MRED, and WCRE values are lower for 16-bit designs
than existing ones.

4.2. Synthesis results

For the sake of fair analysis, approximate compressors, 8-bit and 16-bit schemes of all the existing and the proposed designs are
modeled using Verilog Hardware Description language. Cadence RTL compiler v7.1 has been used to perform hardware synthesis
of all the approximate compressors and approximate multiplier designs at TSMC 90 nm process node (slow-normal library).
Synthesis results of all the approximate compressors are tabulated in Table 5. The Table 5 shows that AC1 and AC2 have a lower
power consumption than Minho [13] and Yang [12]. The designs which have lower power consumption than AC1 and AC2 suffer
from accuracy.
The area, power, and delay characteristics of existing and proposed 8-bit and 16-bit multipliers are mentioned in Table 6. In the
case of 8-bit multipliers, it is evident from Table 6 that the proposed designs D1 and D2 have lower power consumption compared to
the existing designs except for Ax8-3, MUL2, M4, and AxRM3 due to efficient approximate compressors and approximation in LSR.
The designs which have lower power consumption than D1 and D2 suffer from accuracy. D1 and D2 achieve power improvement
up to 26.5% and 18.4% compared to exact and existing designs [8,9], [11–13,15], respectively. Likewise, D3 consumes less power
than existing designs, with the exception of Ax8-3, MUL2, M3, M4, and AxRM3, because it allocates one partial product as an output
in the LSR region directly. D3 has power savings up to 25.1% and 18% compared to the exact and existing designs [9], [11–13,15]
respectively. D4 consumes low power than Minho, Yang, C-N, AxRM1, Multiplier2, Ax8-1, and Ax8-2. The power improvement of
D4 is up to 24.4% and 17.3% compared to exact and existing designs [9], [11–13], respectively.
Similarly, the area occupied by D1 and D2 is less compared to existing designs except for Ax8-3, M3 and M4 due to the LSR
region’s simple techniques and efficient approximate compressors. D3 and D4 occupy lower area than existing designs except for

7
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Table 7
PSNR comparison between various multiplier designs.
Application Image Approximate Multipliers

D1 D2 D3 D4 Multiplier2 Xilin Yang Minho MUL2 M3 M4 Ax8-1 Ax8-2 Ax8-3 AxRM1 AxRM2 AxRM3 C-N
[15] [14] [12] [13] [16] [8] [8] [9] [9] [9] [10] [10] [10] [17]

Lena 51.07 51.62 53.85 55.51 40.99 37.47 48.68 48.77 34.32 46.99 38.43 53.68 52.54 23.54 48.62 32.31 22.51 55.85
Sharpening
Peppers 51.15 51.85 54.18 55.83 41.23 41.12 49.43 49.06 35.07 43.71 33.53 53.47 52.20 28.10 49.39 34.23 25.10 55.98

Cameraman 31.41 33.87 34.97 42.84 31.13 29.75 33.27 33.87 22.34 29.98 28.75 28.33 26.91 23.95 33.18 22.21 21.02 42.98
JIC
Lena 29.40 29.53 33.56 42.23 28.89 25.65 26.33 29.56 21.69 24.25 23.90 24.51 21.44 20.63 32.15 21.62 21.02 42.48

Image
Multiplication Image1 & 2 68.51 69.10 69.49 70.26 54.69 57.38 60.12 61.43 61.47 53.19 47.48 70.65 44.72 40.36 60.18 58.12 49.13 71.36

MNIST
Lenet-5 Dataset 98.53 98.53 98.56 98.56 83.14 98.51 98.43 98.46 98.49 98.52 98.49 98.54 93.24 88.28 98.44 97.14 96.00 98.54

Ax8-3, M3, Minho, MUL2, M4 and AxRM3. Proposed designs have low latency than Multiplier2, Xilin, Minho, Yang, AxRM1, AxRM2,
AxRM3, C-N and MUL2. Though MUL2 compressor has lower delay compared to existing compressors, when it is embedded in to
multiplier structure, multiplier delay performance is poor because authors in that work approximated up to seven columns in the
multiplier reduction structure. Finally, the power-delay products (PDPs) of D1, D2, and D3 are lower compared to existing designs
except M3, M4 and Ax8-3. The designs which have less PDP compared to D1 and D2 suffer from accuracy. D4 has lower PDP than
existing designs except for M3, M4, Minho, and Ax8-3.
Similarly, for a 16-bit multiplier, the proposed designs consume less power than Multiplier2, Yang, and Minho. The proposed
designs have power savings up to 23.2% and 14.41%, and the PDP improvement up to 45.94% and 31.53% compared to exact and
existing designs.

5. Benchmarking applications

5.1. Image processing applications

The efficacy of the proposed designs are validated using the metric PSNR in applications such as image sharpening, JPEG image
compression and image multiplication. Image sharpening application processes the input image with 5*5 kernels to produce a high
quality image. Let K be the input image, then the output image Z can be expressed as follows.

𝑍(𝑥, 𝑦) = 2𝐾(𝑥, 𝑦) − 𝑅 (5)


1 ∑2 ∑2
where 𝑅 = 273 𝑙=−2 𝑛=−2 𝐻 (𝑙 + 3, 𝑛 + 3) 𝐾 (𝑥 − 𝑙 , 𝑦 − 𝑛) and 𝐻 is a matrix defined as

⎡ 1 4 7 4 1 ⎤
⎢ ⎥
7 26 41 26 7
𝐻 =⎢ ⎥
⎢ 4 16 26 16 4 ⎥
⎢ 1 4 7 4 1 ⎥
⎣ ⎦
Since image sharpening entails a substantial number of multiplication operations, it is exemplary for evaluating the performance
of the approximate multipliers. While all the multiplication operations are carried out using the existing and proposed multipliers,
the others are still performed using exact modules.
In JPEG image compression, efficacy of proposed designs is validated with quality factor (QF) of 70. Further, proposed designs
are evaluated on image multiplication by multiplying two input images. The PSNR values of the existing and proposed designs for
these applications are shown in Table 7. From Table 7, shows that the proposed design D4 better PSNR compared to the existing
designs because it has lower NMED and MRED. D1, D2, and D3 have better PSNR than the existing designs except for AX8-1 due
to lower NMED and MRED. However, Ax8-1 occupies more area and consumes more power than the proposed designs.
As is evident from Fig. 4, the images obtained by using exact and proposed designs look identical. Fig. 5 shows the comparison
of PDP against MRED, NMED and PSNR. It can be observed from Fig. 5(a) that Ax8-1, AxRM1, C-N, Yang, Minho, and the proposed
designs have low MRED. All the multipliers except for Minho and the proposed multiplier tend to have high PDP. Designs Ax8-2,
Multiplier2, MUL2, Xilin, M3, M4, and Ax8-3 achieve moderate MRED though with a great variation in PDP. Multipliers AxRM2
and AxRM3 have high MRED and high PDP values. The same can be observed for NMED and PSNR in Fig. 5(b) & (c). Finally, we
can conclude that the proposed designs have better accuracy with less PDP.

5.2. Convolution neural networks (CNNs)

The Lenet-5 CNN model is used to gauge the characteristics of the proposed multipliers. The pre-trained Lenet-5 model has been
quantized to 8-bit integers as suggested in [20]. The proposed and existing multipliers have been used in the convolution layer to
perform the multiplications in the inference phase. The MNIST dataset [21], a collection of ten thousand images of handwritten
digits, has been used to compute the model’s accuracy for all the multipliers. The comparison in accuracy can be found in Table 7,
from which we can observe that the proposed designs, D3 and D4, have higher accuracy than all the other multipliers. Designs D1
and D2 have better accuracy than existing designs except for Ax8-1 and C-N due to low MRED and NMED.

8
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Fig. 4. (a–e) Ouput images of exact and proposed multipliers.

Fig. 5. Accuracy and hardware comparison between various existing multipliers (a) PDP vs MRED (b) PDP vs NMED (c) PDP vs PSNR.

6. Conclusions and future work

Approximate computing seeks to take advantage of the applications’ error tolerance. This work presented an unsigned multiplier
approximating the middle and least significant portions of the PPR structure to reduce computing complexity. Furthermore, two
approximate compressors and error-correcting modules are suggested for reducing the partial products in the approximate portion.
To investigate the circuit parameters, the proposed designs are implemented in Cadence TSMC 90 nm technology node. The
experiment result shows that proposed designs have 26.5% and 18.5% power improvements compared to exact and existing designs
that too with better accuracy. When validated using image processing and CNN applications, the proposed designs establish a
superior computation quality effort tradeoff.
The improvement brought about by utilizing several approximate compressors for each column in Level 2 based on the probability
analysis would be interesting to study in further research. We also want to investigate how well the suggested designs operate a
reconfigurable platform.

CRediT authorship contribution statement

Anil Kumar Uppugunduru: Conceptualization, Methodology, Software, Writing – original draft. S. Vignesh Bharadwaj:
Software, Visualization, Investigation. Syed Ershad Ahmed: Supervision, Writing – review & editing.

9
A.K. Uppugunduru et al. Computers and Electrical Engineering 104 (2022) 108407

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.

Data availability

Data will be made available on request.

References

[1] Liu Weiqiang, Lombardi Fabrizio, Shulte Michael. A retrospective and prospective view of approximate computing [point of view]. Proc IEEE
2020;108(3):394–9.
[2] Afzali-Kusha Hassan, Vaeztourshizi Marzieh, Kamal Mehdi, Pedram Massoud. Design exploration of energy-efficient accuracy-configurable dadda multipliers
with improved lifetime based on voltage overscaling. IEEE Trans Very Large Scale Integr (VLSI) Syst 2020;28(5):1207–20.
[3] Ansari Mohammad Saeed, Mrazek Vojtech, Cockburn Bruce F, Sekanina Lukas, Vasicek Zdenek, Han Jie. Improving the accuracy and hardware efficiency
of neural networks using approximate multipliers. IEEE Trans Very Large Scale Integr (VLSI) Syst 2019;28(2):317–28.
[4] Mrazek Vojtech, Vasicek Zdenek, Sekanina Lukas, Jiang Honglan, Han Jie. Scalable construction of approximate multipliers with formally guaranteed worst
case error. IEEE Trans Very Large Scale Integr (VLSI) Syst 2018;26(11):2572–6.
[5] Leon Vasileios, Zervakis Georgios, Soudris Dimitrios, Pekmestzi Kiamal. Approximate hybrid high radix encoding for energy-efficient inexact multipliers.
IEEE Trans Very Large Scale Integr (VLSI) Syst 2017;26(3):421–30.
[6] Kulkarni Parag, Gupta Puneet, Ercegovac Milos. Trading accuracy for power with an underdesigned multiplier architecture. In: 2011 24th Internatioal
Conference on VLSI Design. IEEE; 2011, p. 346–51.
[7] Jiang Honglan, Liu Cong, Liu Leibo, Lombardi Fabrizio, Han Jie. A review, classification, and comparative evaluation of approximate arithmetic circuits.
ACM J Emerg Technol Comput Syst (JETC) 2017;13(4):1–34.
[8] Ansari Mohammad Saeed, Jiang Honglan, Cockburn Bruce F, Han Jie. Low-power approximate multipliers using encoded partial products and approximate
compressors. IEEE J Emerg Sel Top Circuits Syst 2018;8(3):404–16.
[9] Waris Haroon, Wang Chenghua, Liu Weiqiang, Han Jie, Lombardi Fabrizio. Hybrid partial product-based high-performance approximate recursive
multipliers. IEEE Trans Emerg Top Comput 2020.
[10] Waris Haroon, Wang Chenghua, Xu Chenyu, Liu Weiqiang. AxRMs: Approximate recursive multipliers using high-performance building blocks. IEEE Trans
Emerg Top Comput 2021.
[11] Momeni Amir, Han Jie, Montuschi Paolo, Lombardi Fabrizio. Design and analysis of approximate compressors for multiplication. IEEE Trans Comput
2014;64(4):984–94.
[12] Yang Zhixi, Han Jie, Lombardi Fabrizio. Approximate compressors for error-resilient multiplier design. In: 2015 IEEE international symposium on defect
and fault tolerance in VLSI and nanotechnology systems. IEEE; 2015, p. 183–6.
[13] Ha Minho, Lee Sunggu. Multipliers with approximate 4–2 compressors and error recovery modules. IEEE Embed Syst Lett 2017;10(1):6–9.
[14] Yi Xilin, Pei Haoran, Zhang Ziji, Zhou Hang, He Yajuan. Design of an energy-efficient approximate compressor for error-resilient multiplications. In: 2019
IEEE international symposium on circuits and systems. IEEE; 2019, p. 1–5.
[15] Venkatachalam Suganthi, Ko Seok-Bum. Design of power and area efficient approximate multipliers. IEEE Trans Very Large Scale Integr (VLSI) Syst
2017;25(5):1782–6.
[16] Pei Haoran, Yi Xilin, Zhou Hang, He Yajuan. Design of ultra-low power consumption approximate 4–2 compressors based on the compensation characteristic.
IEEE Trans Circuits Syst II: Express Briefs 2020;68(1):461–5.
[17] Strollo Antonio Giuseppe Maria, Napoli Ettore, Caro Davide De, Petra Nicola, Meo Gennaro Di. Comparison and extension of approximate 4-2 compressors
for low-power approximate multipliers. IEEE Trans Circuits Syst I Regul Pap 2020;67(9):3021–34.
[18] Kong Tianqi, Li Shuguo. Design and analysis of approximate 4–2 compressors for high-accuracy multipliers. IEEE Trans Very Large Scale Integr (VLSI)
Syst 2021;29(10):1771–81.
[19] Vahdat Shaghayegh, Kamal Mehdi, Afzali-Kusha Ali, Pedram Massoud. TOSAM: An energy-efficient truncation-and rounding-based scalable approximate
multiplier. IEEE Trans Very Large Scale Integr (VLSI) Syst 2019;27(5):1161–73.
[20] Tasoulas Zois-Gerasimos, Zervakis Georgios, Anagnostopoulos Iraklis, Amrouch Hussam, Henkel Jörg. Weight-oriented approximation for energy-efficient
neural network inference accelerators. IEEE Trans Circuits Syst I Regul Pap 2020;67(12):4670–83.
[21] LeCun Yann, Bottou Leon, Bengio Yoshua, Haffner Patrick. Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278–324.

10

You might also like