PAxC A Probabilistic-Oriented Approximate Computing Methodology For ANNs
PAxC A Probabilistic-Oriented Approximate Computing Methodology For ANNs
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
E-mail: {pfhuang, chwang, chen.ke and liuweiqiang}@nuaa.edu.cn
Abstract—In spite of the rapidly increasing number of approx- the probabilistic energy benefit equation into the loss function
imate designs in circuit logic stack for Artificial Neural Networks to reflect the impact of noise introduced by approximated
(ANNs) learning. A principled and systematic approximate hard- hardware on quality estimation, a more aggressive and adaptive
ware incorporating domain knowledge is still lacking. As the layer
of ANN becomes deeper, the errors introduced by approximate approximation strategy can be found in a great number of
hardware will be accumulated quickly, which can result in optional datapaths. The main contributions of this paper are
unexpected results. In this paper, we propose a probabilistic- summarized as follows:
oriented approximate computing (PAxC) methodology based on • We propose a probabilistic-oriented approximate comput-
the notion of approximate probability to overcome the conceptual
and computational difficulties inherent to probabilistic ANN ing methodology for ANNs.
learning. The PAxC makes use of minimum likelihood error • In ANN inference, we propose two maximum likelihood
in both circuit and application level to maintain the aggressive approximate computing algorithms using weight stationary
approximate datapaths to boost the benefits from the trade- in ANNs.
off between accuracy and energy. Compared with a baseline • By considering the accumulation and counterbalance of
design, the proposed method significantly reduces the power-delay
product (PDP) with a negligible accuracy loss. Simulation and a error along from the circuit to application, approximation
case study of image processing validate the effectiveness of the friendly multiply-accumulate (MAC) designs are proposed
proposed methodology. for building the processing elements (PEs) in ANN learn-
Index Terms—Approximate computing, probabilistic-oriented, ing.
ANN learning, hybrid approximate circuits
II. R ELATED W ORK
I. I NTRODUCTION This section first introduces the related works on error
analysis of both ANN learning and approximate hardware
As emerging computation-intensive applications, the ever-
design for this work.
increasing performance requirements will soon exceed the
growth of resource budgets. One of the most promising A. ANN Learning
paradigms is the approximate computing (AxC) which has ANNs are based on a set of connected units or nodes called
gained a increasingly significant attention. By considering such artificial neurons, which loosely model the neurons in the
error tolerance, AxC pursues the imperfect designs to trade off biological brain. A multilayer perceptron (MLP) is a typical
accuracy for better energy efficiency. Specifically, researchers type of feedforward ANNs. MLPs are very useful in research
have developed a number of approximate circuit units (such because they can solve problems stochastically, which usually
as approximate multipliers [1] and adders [2]). Many recent allows approximate solutions to extremely complex problems,
work using approximate schemes at the hardware level shows such as fitness approximation.
that it can play a vital role in improving the overall system
performance [3].
In fact, there is no universal approximate design that can B. Comprehensive Metrics for Approximate Hardware Design
fit all error-resilient applications. A systematic methodology is More acceleration leads to more accuracy deterioration. In
proposed to explore joint design space and find energy–area order to obtain the optimal trade-off, error analysis and effective
optimal solutions together with hardware parameters in [4]. A metrics should be considered in advance. To make a trade-off
larger scope of applications can benefit from neural approxi- between preciseness and energy efficiency, illustrative figure
mate computing which involves a great number of multiplica- of merit (FOM) and FOM1 are suggested in [6] and [7]
tions [5]. The fundamental computation unit in NN is a neuron, respectively.
which performs a multiply-accumulate (MAC) operation. The
most power consuming operation among NN computation is III. T HE P ROPOSED PA X C M ETHODOLOGY
multiplications [1]. In this paper, we propose a PAxC method- Multiple approximate units usually interact in a datapath;
ology that allows the exploration of approximate multipliers moreover, many applications often require complex datapaths
(AMs) efficiently coordinating with ANN learning. By injecting rather than just a single operation (such as multiplication), so
978-3-9819263-6-1/DATE22/2022
c EDAA 1165
Authorized licensed use limited to: Southeast University. Downloaded on August 17,2022 at 16:07:40 UTC from IEEE Xplore. Restrictions apply.
retraining leads to over-fitting of the noises. In Fig. 2 (a),
blue and yellow dots (training data) are perfectly classified. It
indicates that a high accuracy for training data is not necessarily
a good indicator, as it may also imply that the model is
suffering from overfitting. Such a data set may perform well
in the test scenario, but it may fail in specific applications
Fig. 1: Overview of the proposed PAxC. since it lacks the generalization ability as shown in Fig. 2 (b).
Hence, it is reasonable to find the optimal combinations for
approximate multipliers to inference on the model trained by
exact computation.
1) Approximation Friendly Multiply-accumulate (MAC) De-
signs: In many ANNs, the computation of score is actually
dot product of the feature (x) and the weight (w)
the (i.e.,
i wi xi ). In the inference process, the MAC operations of the
feature extraction (convolutional layer) and the classification
Fig. 2: The generalization result of the model from (a) approx- can be easily parallelized. To increase the design flexibility and
imate and (b) exact training. facilitate approximation, we propose a block by block neural
network architecture design, which includes the following two
basic computing blocks.
the optimal hybrid scheme is quite essential to calibrate the A) Processing Element (PE) Block: In convolution process,
accumulative error. Fig. 1 gives an overview of the proposed the core operation is multiplying the sliding filter and
PAxC methodology. Although machine learning applications the corresponding sub-block of the input feature. These
have error resilience capabilities, AxC should be applied in a operations can be conveniently parallelized in different
principled manner to ensure that the impact on output quality channels and filters. Hence, operations in each filter can
is negligible (or acceptable). be constructed as one PE, and operations of channels can
A. PAxC for ANN Inference be divided into separated PE blocks. As shown in Fig. 3,
the PE in one block maintains the same filters which can
Inference applies knowledge from a trained neural network
simplify the complexity of PE design.
model and uses it to infer a result. When a new unknown data
B) Multiply-Accumulate (MAC) Block: The elementary unit
set is input through a trained neural network, it outputs a predic-
of ANN is MAC which only contains an addition and
tion based on predictive accuracy of the neural network. Infer-
multiplication. Each individual MAC can use a unique
ence comes after training as it requires a trained neural network
approximate strategy. As shown in Fig. 4, MACs in one
model. An application supports lots of approximate settings
PE use different approximate multipliers. The output of
with multiple approximate modules. This makes computation-
the PE is the accumulation of n MACs, hence, the optimal
intensive retraining impossible to pick up a ”best” strategy
group of the approximate multipliers should compensate
among all the approximate strategies. Meanwhile, excessive
each other to achieve the overall fine result.
2) PAxC Algorithms: Since one operand of the multipli-
cation is the weight wi which is predefined by the training
model. Intuitively, it is easy to estimate the error rate Pe (x|wi )
introduced by approximate multiplier based on the conditional
probability of wi with random input x. In order to search the
optimal approximate multiplier Mk for the static wi , perfor-
mance P (M|x, w) and related FOM (RF OM ) are defined as
follows:
N
1 1
P (Mk |wi , x) = (1)
N j=1 RF OM (Mk |wi , xj )
Fig. 3: Approximation friendly PE blocks. α × P DP × area
RF OM = (2)
β × (1 − N RED)
M̂k = arg max P̂ (Mk |wi , x) (3)
Mk ∈Θ
Authorized licensed use limited to: Southeast University. Downloaded on August 17,2022 at 16:07:40 UTC from IEEE Xplore. Restrictions apply.
maximum likelihood estimation is to find the optimal M̂k that Algorithm 2 Fine-tuned PAxC2 for ANNs
maximize the likelihood performance function over the poten- Require:
tial approximate multiplier space Θ. The overall probabilistic- The set of weight W ;
oriented approximate computing algorithm (PAxC1) is summa- The set of Data X ;
rized in Algorithm 1 which finds out the maximum likelihood The set of approximate operators M;
of approximate operators. To better facilitate approximation The setting of the coefficients (α and β).
and reduce the data movement between the AI engine and the The set of maximum likelihood performance M̂ for each
memory, the parameters of the involved filter will be reused in wi ∈ W
the PE block with proposed approximation friendly MACs. In Ensure:
order to neutralize the error on final summation of MACs for The set of fine-tuned weight W
the output feature, a neighborhood sensitive performance P̂ is 1: Set i := 0;
defined as follows: 2: repeat
1 3: Set i ← i + 1;
P̂ (Mk , Mk |wi , x) =
RF OM (Mk |wi , wi−1 , x) 4: repeat
(4) if P (M̂|wj , x) > P (M̂|wi , x) then
1 5:
+ Update wi := wj ;
RF OM (Mk |wi , wi+1 , x) 6:
7: end if
The effective pair-wise accelerators can be determined via the
8: until The other weights (wj ) are traversed.
optimal combination performance P̂ . Similarly, the maximum
9: until All weights are fine-tuned under the maximum
likelihood ŵi of weight wi can be obtained under given
likelihood performance.
approximate operator M̂i . Suppose C is the correct operator.
|M̂(w, x) − C(w, x)| = ε (5)
A. Simulation Setup
arg min |M̂(ŵ, x) − C(w, x)| = ε̂
ŵ (6) Although the proposed method is applicable to most re-
(∃ŵ) s.t. ε̂ ≤ ε configurable multipliers and adders, the method in [1] is
From Eqs. (5) and (6), weight wi can be fine-tuned as ŵi to considered as multipliers are more complex and have higher
formulate a more approximation friendly model of the neural energy consumption than adders. Digits dataset in the standard
network in a regression way. In the extreme case (ε̂ == ε), UCI datasets are used. Logic synthesis and simulation tools i.e.
ŵi is equal to wi . Algorithm 2 presents the steps of the fine- the Synopsys Design Compiler and VCS are utilized; the target
tuned probabilistic-oriented approximate computing algorithm. synthesis technology is given by the Nangate 45-nm open cell
library.
Algorithm 1 PAxC1 for ANNs
B. Simulation Results
Require:
The set of weight W ; A typical MLP with 3 layers is built to classify the digits
The set of Data X ; data-set. For convenience, both α and β are set as 1 in
The set of approximate operators M; PAxC algorithms. In order to analysis the superiority of our
The setting of the coefficients (α and β). proposed method, the pre-processing and post-processing of an
Ensure: application are omitted. The total number of weights is 7620
A maximum likelihood performance M̂ for each wi ∈ W and the hardware consumption result is the average values
1: Set i := 0; of overall approximate multipliers embedded in the network.
2: repeat Table I summarizes the accuracy loss, power, delay, area and
3: Set i ← i + 1, M̂i ← M1 ; power-delay product. Both PAxC1 and PAxC2 maintain better
4: repeat energy performance than the fixed AMs (p=2,6), and certainly
5: if P (Mk |wi , x) > P (Mk−1 |wi , x) and k > 1 then more accuracy than fixed AM (p=10). After weights fine-tuned,
6: Update M̂i := Mk ; PAxC2 achieves higher accuracy than PAxC1 with the exact
7: end if same hardware design.
8: until Each potential approximate operator Mk ∈ M is
C. Case Study: Image Processing
traversed.
9: until Probabilistic-oriented approximate operator M̂ for An autoencoder is a type of ANNs used to learn efficient
all weights are found. codings of unlabeled data. The autoencoder learns a repre-
sentation for a set of data, by training the network to ignore
insignificant data (”noise”). Exact, PAxC and the moderate
accelerator (p=6) in [1] are applied to auto-encode the images
IV. E VALUATION AND A NALYSIS in Fig. 5. The models is trained by exact computing. Since the
Hardware metrics, such as power consumption, area, critical autoender involves many multiplications, the accumulated error
path delay and PDP are considered in this section. has a large impact. From the results, all decoding images of the
Authorized licensed use limited to: Southeast University. Downloaded on August 17,2022 at 16:07:40 UTC from IEEE Xplore. Restrictions apply.
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Fig. 5: Image processing: (a) original images, (b) original images with noise, (c), (e), (g) are the reconstructions of original
images by auto-encoders with exact computing, PAxC and AM(p=6) [1]; (d), (f), (h) are the reconstructions of noisy images
with exact computing, PAxC and AM(p=6) [1] receptively.
TABLE I: Accuracy and Energy Evaluation tors. The proposed PAxCs aggressively employ approximate ac-
celerators under the compressive metric RFOM, thus achieving
Approximate Accuracy Power Delay Area Energy better performance, while ensuring that the result quality meets
Multipliers loss (%) (mW ) (ps) ( mm2 ) (nJ)
the application’s requirements. Our evaluation showed that the
PAxC1 0.35 33.396 0.63 0.1327 21.039 proposed PAxCs reduce power consumption and maintain high
PAxC2 0.13 33.396 0.63 0.1241 21.039
AM (p=2) [1] 0.28 44.385 0.63 0.1592 27.963 quality.
AM (p=6) [1] 2.33 36.855 0.61 0.1373 22.482
AM (p=10) [1] 10.20 25.319 0.58 0.1009 14.685 ACKNOWLEDGMENT
Exact multiplier 0 48.179 0.71 0.2011 34.207 The authors would like to acknowledge support from Na-
tional Natural Science Foundation of China under grant No.
62022041 and 61871216.
accelerator (p=6) in [1] suffer too much deviation caused by the R EFERENCES
approximation. our proposed method can achieve almost exact
[1] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design of
in a dedicated approximate strategy with maximum likelihood approximate radix-4 Booth multipliers for error-tolerant computing,” IEEE
performance and compensation group. Input noises can be Trans. Computers, vol. 66, no. 8, pp. 1435–1441, 2017.
denoised through the learning filters which are not aware of the [2] M. A. Hanif, R. Hafiz, O. Hasan, and M. Shafique, “Pemacx: A proba-
bilistic error analysis methodology for adders with cascaded approximate
noise introduced by the hardware approximators. Our method units,” in Proc. DAC, 2020, pp. 1–6.
can provide the accelerators which are adapted to weights of [3] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, A. Afzali-Kusha, and
filters with injected errors. Hence, it can achieve the desired M. Pedram, “Block-based carry speculative approximate adder for energy-
efficient applications,” IEEE Trans. Circuits and Systems II: Express Briefs,
performance with acceptable accuracy loss. vol. 67, no. 1, pp. 137–141, 2020.
[4] T. Alan, A. Gerstlauer, and J. Henkel, “Cross-layer approximate hardware
V. C ONCLUSION synthesis for runtime configurable accuracy,” IEEE Trans. VLSI Systems,
vol. 29, no. 6, pp. 1231–1243, 2021.
By allowing the almost exact computation in error-tolerance [5] H. R. Mahdiani, M. Haji Seyed Javadi, and S. M. Fakhraie, “Efficient
applications, approximate computing can gain both perfor- utilization of imprecise computational blocks for hardware implementation
mance and energy efficiency. In this work, we propose a of imprecision tolerant applications,” Microelectronics Journal, vol. 61, pp.
57–66, 2017.
probabilistic-oriented approximate computing methodology and [6] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-quality 4:2
an approximate friendly hardware architecture for ANN. Based compressors for utilizing in dynamic accuracy configurable multipliers,”
on the notion of approximate probability, we designed two IEEE Trans. VLSI Systems, vol. 25, no. 4, pp. 1352–1361, 2017.
[7] F. Sabetzadeh, M. H. Moaiyeri, and M. Ahmadinejad, “A majority-based
algorithms to search the optimal combinations of approximate imprecise multiplier for ultra-efficient approximate image multiplication,”
operators and fine-tuned weights. Approximation friendly MAC IEEE Trans. Circuits and Systems I: Regular Papers, vol. 66, no. 11, pp.
and PE block are designed to better apply the hybrid accelera- 4200–4208, 2019.
Authorized licensed use limited to: Southeast University. Downloaded on August 17,2022 at 16:07:40 UTC from IEEE Xplore. Restrictions apply.