0% found this document useful (0 votes)
23 views24 pages

FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture

Uploaded by

jiby.phd2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views24 pages

FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture

Uploaded by

jiby.phd2102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

FTT-NAS: Discovering Fault-tolerant Convolutional

Neural Architecture
XUEFEI NING, GUANGJUN GE, WENSHUO LI, and ZHENHUA ZHU, Department of
Electronic Engineering, Tsinghua University, China
YIN ZHENG, Weixin Group, Tencent, China
XIAOMING CHEN, State Key Laboratory of Computer Architecture, Institute of Computing Technology,
Chinese Academy of Sciences, China
ZHEN GAO, School of Electrical and Information Engineering, Tianjin University, China
YU WANG and HUAZHONG YANG, Department of Electronic Engineering, Tsinghua University,
China

With the fast evolvement of embedded deep-learning computing systems, applications powered by deep learn-
ing are moving from the cloud to the edge. When deploying neural networks (NNs) onto the devices under
complex environments, there are various types of possible faults: soft errors caused by cosmic radiation and
radioactive impurities, voltage instability, aging, temperature variations, malicious attackers, and so on. Thus,
the safety risk of deploying NNs is now drawing much attention. In this article, after the analysis of the pos-
sible faults in various types of NN accelerators, we formalize and implement various fault models from the
algorithmic perspective. We propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically 44
discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays de-
vices. Then, we incorporate fault-tolerant training (FTT) in the search process to achieve better results, which
is referred to as FTT-NAS. Experiments on CIFAR-10 show that the discovered architectures outperform other
manually designed baseline architectures significantly, with comparable or fewer floating-point operations
(FLOPs) and parameters. Specifically, with the same fault settings, F-FTT-Net discovered under the feature
fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNet-V2), and W-FTT-Net discovered
under the weight fault model achieves an accuracy of 69.6% (VS. 60.8% achieved by ResNet-18). By inspecting
the discovered architectures, we find that the operation primitives, the weight quantization range, the capac-
ity of the model, and the connection pattern have influences on the fault resilience capability of NN models.

CCS Concepts: • Hardware → Fault tolerance; • Computing methodologies → Computer vision;

This work was supported by National Natural Science Foundation of China (No. U19B2019, 61832007, 61621091), National
Key R&D Program of China (No. 2017YFA02077600); Beijing National Research Center for Information Science and Tech-
nology (BNRist); Beijing Innovation Center for Future Chips; the project of Tsinghua University and Toyota Joint Research
Center for AI Technology of Automated Vehicle (TT2020-01); Beijing Academy of Artificial Intelligence.
Authors’ addresses: X. Ning, G. Ge, W. Li, Z. Zhu, Y. Wang, and H. Yang, Tsinghua University, Department of Electronic En-
gineering, Rohm Building, Beijing, China, 100084; emails: [email protected], [email protected],
[email protected], [email protected], [email protected], [email protected]; Y.
Zheng, Weixin group, Tencent, Beijing, China, 100080; email: [email protected]; X. Chen, State Key Laboratory
of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China, 100190; email:
[email protected]; Z. Gao, School of Electrical and Information Engineering, Tianjin University, China, 300072; email:
[email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1084-4309/2021/08-ART44 $15.00
https://doi.org/10.1145/3460288

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:2 X. Ning et al.

Additional Key Words and Phrases: Neural architecture search, fault tolerance, neural networks
ACM Reference format:
Xuefei Ning, Guangjun Ge, Wenshuo Li, Zhenhua Zhu, Yin Zheng, Xiaoming Chen, Zhen Gao, Yu Wang,
and Huazhong Yang. 2021. FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture. ACM
Trans. Des. Autom. Electron. Syst. 26, 6, Article 44 (August 2021), 24 pages.
https://doi.org/10.1145/3460288

1 INTRODUCTION
Convolutional Neural Networks (CNNs) have achieved breakthroughs in various tasks, includ-
ing classification [14], detection [31], segmentation [32], and so on. Due to their promising perfor-
mance, CNNs have been utilized in various safety-critic applications, such as autonomous driving,
intelligent surveillance, and identification. Meanwhile, driven by the recent academic and indus-
trial efforts, the neural network accelerators based on various hardware platforms (e.g., Applica-
tion Specific Integrated Circuits (ASIC) [9], Field Programmable Gate Array (FPGA) [37],
Resistive Random-Access Memory (RRAM) [10]) have been rapidly evolving.
The robustness and reliability related issues of deploying neural networks onto various embed-
ded devices for safety-critical applications are attracting more and more attention. There is a large
stream of algorithmic studies on various robustness-related characteristics of NNs, e.g., adversarial
robustness [44], data poisoning [41], interpretability [53], and so on. However, no hardware mod-
els are taken into consideration in these studies. Besides the issues from the purely algorithmic
perspective, there exist hardware-related reliability issues when deploying NNs onto nowadays
embedded devices. With the down-scaling of CMOS technology, circuits become more sensitive to
cosmic radiation and radioactive impurities [16]. Voltage instability, aging, and temperature vari-
ations are also common effects that could lead to errors. As for the emerging metal-oxide RRAM
devices, due to the immature technology, they suffer from many types of device faults [7], among
which hard faults such as Stuck-at-Faults (SAFs) damage the computing accuracy severely and
could not be easily mitigated [49]. Moreover, malicious attackers can attack the edge devices by
embedding hardware Trojans, manipulating back-doors, and doing memory injection [54].
Recently, some studies [28, 40, 46] analyzed the sensitivity of NN models. They proposed to
predict whether a layer or a neuron is sensitive to faults and protect the sensitive ones. For fault
tolerance, a straightforward way is to introduce redundancy in the hardware. Triple Modular
Redundancy (TMR) is a commonly used but expensive method to tolerate a single fault [4, 42,
55]. References [28, 49] proposed various redundancy schemes for Stuck-at-Faults tolerance in
the RRAM-based Computing Systems. For increasing the algorithmic fault resilience capability,
References [12, 15] proposed to use fault-tolerant training (FTT), in which random faults are
injected in the training process.
Although redesigning the hardware for reliability is effective, it is not flexible and inevitably
introduces a large overhead. It would be better if the issues could be mitigated as far as possible
from the algorithmic perspective. Existing methods mainly concerned about designing training
methods and analyzing the weight distribution [12, 15, 40]. Intuitively, the neural architecture
might also be important for the fault tolerance characteristics [1, 25], since it determines the “path”
of fault propagation. To verify these intuitions, the accuracies of baselines under a random bit-
bias feature fault model1 are shown in Table 1, and the results under SAF weight fault model2 are
shown in Table 2. These preliminary experiments on the CIFAR-10 dataset show that the fault
tolerance characteristics vary among neural architectures, which motivates the employment of
1 The random bit-bias feature fault model is formalized in Section 3.4.
2 The SAF weight fault model is formalized in Section 3.5.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:3

Table 1. Performance of the Baseline Models with Random


Bit-bias Feature Faults

Model Acc(0/10−5 /10−4 ) #Params #FLOPs


ResNet-18 94.7/63.4/10.0 11.2M 1110M
VGG-16† 93.1/21.4/10.0 14.7M 626M
MobileNet-V2 92.3/10.0/10.0 2.3M 182M
0/10−5 /10−4 denotes the per-MAC fault rate.
†: For simplicity, we only keep one fully connected layer of VGG-16.

Table 2. Performance of the Baseline Models with


SAF Weight Faults

Model Acc(0/4%/8%) #Params #FLOPs


ResNet-18 94.7/64.8/17.8 11.2M 1110M
VGG-16 93.1/45.7/14.3 14.7M 626M
MobileNet-V2 92.3/26.2/11.7 2.3M 182M
0/4%/8% denotes the sum of the SAF1 and SAF0 rates.

neural architecture search (NAS) techniques in designing fault-tolerant neural architectures.


We emphasize that our work is orthogonal to most of the previous methods based on hardware or
mapping strategy design. To our best knowledge, our work is the first to increase the algorithmic
fault resilience capability by optimizing the NN architecture.
In this article, we employ NAS to discover fault-tolerant neural network architectures against
feature faults and weight faults, and demonstrate the effectiveness by experiments. The main con-
tributions of this article are as follows:
• We analyze the possible faults in various types of NN accelerators (ASIC-based, FPGA-based,
and RRAM-based), and formalize the statistical fault models from the algorithmic perspec-
tive. After the analysis, we adopt the .Multiply-Accumulate (MAC)-i.i.d. Bit-Bias (MiBB)
model and the arbitrary-distributed Stuck-at-Fault (adSAF) model in the neural archi-
tecture search for tolerating feature faults and weight faults, respectively.
• We establish a multi-objective neural architecture search framework. On top of this frame-
work, we propose two methods to discover neural architectures with better reliability: FT-
NAS (NAS with a fault-tolerant multi-objective), and FTT-NAS (NAS with a fault-tolerant
multi-objective and fault-tolerant training (FTT)).
• We employ FT-NAS and FTT-NAS to discover architectures for tolerating feature faults and
weight faults. The discovered architectures, F-FTT-Net and W-FTT-Net, have comparable
or fewer floating-point operations (FLOPs) and parameters, and achieve better fault re-
silience capabilities than the baselines. With the same fault settings, F-FTT-Net discovered
under the feature fault model achieves an accuracy of 86.2% (vs. 68.1% achieved by MobileNet-
V2), and W-FTT-Net discovered under the weight fault model achieves an accuracy of 69.6%
(vs. 60.8% achieved by ResNet-18). The ability of W-FTT-Net to defend against several other
types of weight faults is also illustrated by experiments.
• We analyze the discovered architectures and discuss how the weight quantization range, the
capacity of the model, and the connection pattern influence the fault resilience capability of
a neural network.
The rest of this article is organized as follows: The related studies and the preliminaries are in-
troduced in Section 2. In Section 3, we conduct comprehensive analysis on the possible faults

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:4 X. Ning et al.

and formalize the fault models. In Section 4, we elaborate on the design of the fault-tolerant
NAS system. Then in Section 5, the effectiveness of our method is illustrated by experiments,
and the insights are also presented. Finally, we discuss and conclude our work in Section 6 and
Section 7.

2 RELATED WORK AND PRELIMINARY


2.1 Convolutional Neural Network
Usually, a convolutional neural network is constructed by stacking multiple convolution layers
and optional pooling layers, followed by fully connected layers. Denoting the input feature map
(IFM), before-activation output feature map, output feature map (OFM, i.e., activations), weights
and bias of ith convolution layer as x (i ) , f (i ) , y (i ) , W (i ) , b (i ) , the computation can be written as:
f (i ) = W (i )  x (i ) + b (i )
(1)
y (i ) = д( f (i ) ),
where  is the convolution operator, д(·) is the activation function, for which the ReLU func-
tion (д(x ) = max(x, 0)) is the commonest choice. From now on, we omit the (i) superscript for
simplicity.

2.2 NN Accelerators and Fixed-point Arithmetic


With dedicated data flow design for efficient neural network processing, FPGA-based NN accel-
erators could achieve at least 10× better energy efficiency than GPUs [11, 37]. And ASIC-based
accelerators could achieve even higher efficiency [9]. Besides, RRAM-based computing systems
are promising solutions for energy-efficient brain-inspired computing [10], due to their capability
of performing matrix-vector-multiplications (MVMs) in memory. Existing studies have shown
RRAM-based Processing-In-Memory (PIM) architectures can enhance the energy efficiency by
over 100× compared with both GPU and ASIC solutions, as they can eliminate the large data
movements of bandwidth-bounded NN applications [10]. For the detailed and formal hardware
architecture descriptions, we refer the readers to the references listed above.
Currently, fixed-point arithmetic units are implemented by most of the NN accelerators, as (1)
they consume much fewer resources and are much more efficient than the floating-point ones [11];
(2) NN models are proven to be insensitive to quantization [19, 37]. Consequently, quantization is
usually applied before a neural network model is deployed onto the edge devices. To keep con-
sistent with the actual deploying scenario, our simulation incorporates 8-bit dynamic fixed-point
quantization for the weights and activations. More specifically, independent step sizes are used
for the weights and activations of different layers. Denoting the fraction length and bit-width of a
tensor as l and Q, the step size (resolution) of the representation is 2−l . For common CMOS plat-
forms, in which complement representation is used for numbers, the representation range of both
weights and features is
[−2Q −l , 2−l (2Q − 1)]. (2)
As for RRAM-based NN platforms, two separate crossbars are usually used for storing positive
and negative weights [10]. Thus, the representation range of the weights (denoted by the w super-
script) is
[−Rw , Rw ] = [−2−l (2Q +1 − 1), 2−l (2Q +1 − 1)]. (3)
For the feature representation in RRAM-based platforms, by assuming that the Analog to Digi-
tal Converters (ADCs) and Digital to Analog Converters (DACs) have enough precision, and
the CMOS bit-width is Q-bit, the representation range of features (denoted by the f superscript)

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:5

in CMOS circuits is
[−R f , R f ] = [−2Q −l , 2−l (2Q − 1)]. (4)

2.3 Fault Resilience for CMOS-based Accelerators


Borkar [5], Henkel et al. [16], Slayman [43] revealed that advanced nanotechnology makes circuits
more vulnerable to soft errors. Unlike hard errors, soft errors do not damage the underlying cir-
cuits, but instead trigger an upset of the logic state. The dominant cause of soft errors in CMOS
circuits is the radioactive events, in which a single particle strikes an electronic device. Arechiga
and Michaels [1], Libano et al. [26] explored how the Single-Event Upset (SEU) faults impact the
FPGA-based CNN computation system.
TMR is a commonly used approach to mitigate SEUs [4, 42, 55]. Traditional TMR methods are
agnostic of the NN applications and introduce large overhead. To exploit the NN applications’
characteristics to reduce the overhead, one should understand the behavior of NN models with
computational faults. Vialatte and Leduc-Primeau [46] analyzed the layer-wise sensitivity of NN
models under two hypothetical feature fault models. Libano et al. [26] proposed to only triplicate
the vulnerable layers after layer-wise sensitivity analysis and reduced the LUTs overhead for an
NN model on Iris Flower from about 200% (TMR) to 50%. Schorn et al. [40] conducted sensitivity
analysis on the individual neuron level. Li et al. [25] found that the impacts and propagation of
computational faults in an NN computation system depend on the hardware data path, the model
topology, and the type of layers. These methods analyzed the sensitivity of existing NN models
at different granularities and exploited the resilience characteristics to reduce the hardware over-
head for reliability. Our methods are complementary and discover NN architectures with better
algorithmic resilience capability.
To avoid the accumulation of the persistent soft errors in FPGA configuration registers, the
scrubbing technique is applied by checking and partially reloading the configuration bits [4, 6].
From the algorithmic perspective, Hacene et al. [12] demonstrated the effectiveness of fault-
tolerant training in the presence of SRAM bit failures.

2.4 Fault Resilience for RRAM-based Accelerators


RRAM devices suffer from lots of device faults [7], among which the commonly occurring SAFs
are shown to cause severe degradation in the performance of mapped neural networks [49]. RRAM
cells containing SAF faults get stuck at high-resistance state (SAF0) or low-resistance state (SAF1),
thereby causing the weight to be stuck at the lowest or highest magnitudes of the representa-
tion range, respectively. Besides the hard errors, resistance programming variation [24] is another
source of faults for NN applications [27].
For the detection of SAFs, Kannan et al. [20, 21] proposed fault detection methods that can
provide high fault coverage; Xia et al. [50] proposed on-line fault detection method that can peri-
odically detect the current distribution of faults.
Most of the existing studies on improving the fault resilience ability of RRAM-based neural
computation system focus on designing the mapping and retraining methods. Chen et al. [8], Liu
et al. [28], Xia et al. [49, 50] proposed different mapping strategies and the corresponding hardware
redundancy design. After the distribution detection of the faults and variations, they proposed to
retrain (i.e., fine-tune) the NN model for tolerating the detected faults, which is exploiting the
intrinsic fault resilience capability of NN models. To overcome the programming variations, Liu
et al. [27] calculated the calibrated programming target weights with the log-normal resistance
variation model and proposed to map sensitive synapses onto cells with small variations. From the
algorithmic perspective, Liu et al. [30] proposed to use error-correcting output codes (ECOC)
to improve the NN’s resilience capability for tolerating resistance variations and SAFs.
ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:6 X. Ning et al.

2.5 Neural Architecture Search


Neural Architecture Search, as an automatic neural network architecture design method, has
been recently applied to design model architectures for image classification and language mod-
els [29, 36, 57]. The architectures discovered by NAS techniques have demonstrated surpassing
performance than the manually designed ones. NASNet [57] used a recurrent neural network
(RNN) controller to sample architectures, trained them, and used the final validation accuracy to in-
struct the learning of the controller. Instead of using reinforcement learning (RL)-learned RNN
as the controller, Liu et al. [29] used a relaxed differentiable formulation of the neural architecture
search problem and applied gradient-based optimizer for optimizing the architecture parameters;
Real et al. [39] used evolutionary-based methods for sampling new architectures by mutating the
architectures in the population; Recent predictor-based search strategies [34, 35] sample architec-
tures with promising performance predictions by gradient-based method [34] or discrete inner
search methods [35]. Besides the improvements on the search strategies, a lot of methods are pro-
posed to speed up the performance evaluation in NAS. Baker et al. [3] incorporated learning curve
extrapolation to predict the final performance after a few epochs of training; Real et al. [39] sam-
pled architectures using mutation on existing models and initialized the weights of the sampled
architectures by inheriting from the parent model; Pham et al. [36] shared the weights among dif-
ferent sampled architectures and used the shared weights to evaluate each sampled architecture.
The goal of the NAS problem is to discover the architecture that maximizes some predefined ob-
jectives. The process of the original NAS algorithm goes as follows: At each iteration, α is sampled
from the architecture search space A. This architecture is then assembled as a candidate network
Net(α, w ), where w is the weights to be trained. After training the weights w on the training data
split D t , the evaluated reward of the candidate network on the validation data split Dv will be used
to instruct the sampling process. In its purest form, the NAS problem can be formalized as:

maxα ∈A E xv ∼Dv [R(xv , Net(α, w ∗ (α )))]


(5)
s.t. w ∗ (α ) = argminw E x t ∼D t [L(x t , Net(α, w ))],

where A is the architecture search space, ∼ is the sampling operator, and x t , xv denote the data
sampled from the training and validation data splits D t , Dv , respectively. E x ∼D [·] denotes the ex-
pectation with respect to the data distribution D, R denotes the evaluated reward used to instruct
the sampling process, and L denotes the loss criterion for back propagation during the training of
the weights w.
Originally, for the performance evaluation of each sampled architecture α, one needs to find
the corresponding w ∗ (α ) by fully training the candidate network from scratch. This process is
extremely slow, and shared weights evaluation is commonly used for accelerating the evaluation.
In shared weights evaluation, each candidate architecture α is a subgraph of a super network and
is evaluated using a subset of the super network weights. The shared weights of the super network
are updated along the search process.

3 FAULT MODELS
In Section 3.1, we motivate and discuss the formalization of application-level statistical fault mod-
els. Platform-specific analysis are conducted in Section 3.2 and Section 3.3. Finally, the .MAC-i.i.d.
Bit-Bias (MiBB) feature fault model and the arbitrary-distributed Stuck-at-Fault model (ad-
SAF) weight fault model are described in Section 3.4 and Section 3.5, which would be used in the
neural architecture search process. The analyses in this part are summarized in Figure 4(a) and
Table 3.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:7

Table 3. Summary of the NN Application-level Statistical Fault Models, Due to Various Types of Errors on
Different Platforms

Error Logic Common NN application level


Platform Error source H/S P/T
position component mitigation F/W Simplified statistical model
SB-cell detection+3R w ∼ 1bit-adSAF (w 0 ; p0, p1 )
RRAM SAF Crossbar H P W
MB-cell [8, 28, 50] w ∼ Q bit-adSAF (w 0 ; p0, p1 )
PS loop w ∼ LogNormal (w 0 ; σ ) ,
RRAM variations MB-cell Crossbar S P W
[17, 27] w ∼ ReciprocalNormal (w 0 ; σ )
SEE, overstress H P w ∼ iBF (w 0 ; r h × Mp (t ))
FPGA/ASIC SRAM Weight buffer ECC W
SEE, VS S T w ∼ iBF (w 0 ; r s )
TMR
SEE, overstress H f ∼ iBB (f 0 ; r h × Ml × Mp (t ))
[4, 42, 55]
FPGA LUTs PE P F
TMR,
SEE, VS S f ∼ iBB (f 0 ; r s × Ml × Mp (t ))
Scrubbing [6]
SEE, overstress H P y ∼ iBF (y0 ; r h × Mp (t ))
FPGA/ASIC/RRAM SRAM Feature buffer ECC F
SEE, VS S T y ∼ iBF (y0 ; r s )
SEE, overstress CL gates, H P TMR, f ∼ iBB (f 0 ; r hl × Ml × Mp (t ))
ASIC PE F
SEE, VS flip-flops S T DICE [13] f ∼ iBB (f 0 ; r sl × Ml )
Headers: H/S refers to Hard/Soft errors; P/T refers to Persistent/Transient influences; F/W refers to Feature/Weight
faults.
Notations: w, f , y refer to the weights, before-activation features, and after-activation features of a convolution; p0, p1
refer to the SAF0 and SAF1 rates of RRAM cells; σ refers to the standard deviation of RRAM programming variations;
r s , r h refer to the soft and hard error rates of memory elements, respectively; r sl , r hl refer to the soft and hard error
rates of logic elements, respectively; Ml is an amplifying coefficient for feature error rate due to multiple involved
computational components; Mp (t ) > 1 is a coefficient that abstracts the error accumulation effects over time.
Abbreviations: SEE refers to Single-Event Errors, including Single-Event Burnout (SEB), Single-Event Upset
(SEU), and so on; “overstress” includes conditions such as high temperature, voltage or physical stress; VS refers to
voltage (down)scaling that is used for energy efficiency; SB-cell and MB-cell refer to single-bit and multi-bit
memristor cells, respectively; CL gates refer to combinational logic gates; 3R refers to various Redundancy schemes
and corresponding Remapping/Retraining techniques; PS loop refers to the programming-sensing loop during
memristor programming; TMR refers to Triple Modular Redundancy; DICE refers to Dual Interlocked Cell.

3.1 Application-level Modeling of Computational Faults


Computational faults do not necessarily result in functional errors [16, 25]. For example, a neu-
ral network for classification tasks usually outputs a class probability vector, and our work only
regards it as a functional error i.f.f. the top-1 decision becomes different from the golden result.
Due to the complexity of the NN computations and different functional error definition, it is very
inefficient to incorporate gate-level fault injection or propagation analysis into the training or ar-
chitecture search process. Therefore, to evaluate and further boost the algorithmic resilience of
neural networks to computational faults, the application-level fault models should be formalized.
From the algorithmic perspective, the faults fall into two categories: weight faults and feature
faults. In this section, we analyze the possible faults in various types of NN accelerators and for-
malize the statistical feature and weight fault models. A summary of these fault models is shown
in Table 3.
Note that we focus on the computational faults along the datapath inside the NN accelerator that
could be modeled and mitigated from the algorithmic perspective. Faults in the control units and
other chips in the system are not considered. See more discussion in the “limitation of application-
level fault models” section in Section 6.4.

3.2 Analysis of CMOS-based Platforms: ASIC and FPGA


The possible errors in CMOS-based platforms are illustrated in Figure 1. Soft errors that happen
in the memory elements or the logic elements could lead to transient faulty outputs in ASICs.
Compared with logic elements (e.g., combinational logic gates, flip-flops), memory elements are

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:8 X. Ning et al.

Fig. 1. The possible error positions in CMOS-based platforms.

more susceptible to soft errors [43]. An unprotected SRAM cell usually has a larger bit soft error
rate (SER) than flip-flops. Since the occurring probability of hard errors is much smaller than that
of the soft errors, we focus on the analysis of soft errors, despite that hard errors lead to permanent
failures.
The soft errors in the weight buffer could be modeled as i.i.d. weight random bit-flips. Given
the original value as x 0 , the distribution of a faulty value x  under the random bit-flip (BF) model
could be written as
x  ∼ BF(x 0 ; p)

Q
indicates x  = 2−l (2l x 0 ⊕ e), e= eq 2q−1 (6)
q=1
eq ∼ Bernoulli(p), q = 1, . . . , Q,
where eq denotes whether a bit-flip occurs at bit position q, ⊕ is the XOR operator.
By assuming that error occurs at each bit with an i.i.d. bit SER of r s , we know that each Q-bit
weight has an i.i.d. probability pw to encounter error, and pw = (1−(1−r s ) Q ) ≈ r s ×Q, as r s ×Q 1.
It is worth noting that throughout the analysis, we assume that the SERs of all components 1,
hence the error rate at each level is approximated as the sum of the error rates of the independent
sub-components. As each weight encounters error independently, a weight tensor is distributed
as i.i.d. random bit-flip (iBF): w ∼ iBF(w 0 ; r s ), where w 0 is the golden weights. Reagen et al. [38]
showed that the iBF model could capture the bit error behavior exhibited by real SRAM hardware.
The soft errors in the feature buffer are modeled similarly as i.i.d. random bit-flips, with a fault
probability of approximately r s ×Q for Q-bit feature values. The distribution of the output feature
map (OFM) values could be written as y ∼ iBF(y0 ; r s ), where y0 is the golden results.
Actually, FPGA-based implementations are usually more vulnerable to soft errors than their
ASIC counterparts [2]. Since the majority space of an FPGA chip is filled with memory cells, the
overall SER rate is much higher. Moreover, the soft errors occurring in logic configuration bits
would lead to persistent faulty computation, rather than transient faults as in ASIC logic. Persis-
tent errors can not be mitigated by simple retry methods and would lead to statistically significant
performance degradation. Moreover, since the persistent errors would be accumulated if no cor-
rection is made, the equivalent error rate would keep increasing as time goes on. We abstract this
effect with a monotonic increasing function Mp (t ) ≥ 1, where the subscript p denotes “persistent,”
and t denotes the time. For example, if the FPGA weight buffer or LUTs are reloaded for every
T period in the radioactive environment [4, 6], then a multiplier of Mp (T ) would be the worst

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:9

bounding case. Note that the exact choice of t is not important in our experiments, since our work
mainly aims at comparing different neural architectures using certain fault insertion pattern and
ratio, and the temporal effect modeled by Mp (t ) does not influence the architectural preference.
Let us recap how one convolution is mapped onto the FPGA-based accelerator to see what the
configuration bit errors could cause on the OFM values. If the dimension of the convolution kernel
is (c, k, k ) (channel, kernel height, kernel width, respectively), then there are ck 2 −1 ≈ ck 2 additions
needed for the computation of a feature value. We assume that the add operations are spatially
expanded onto adder trees constructed by LUTs, i.e., no temporal reusing of adders is used for
computing one feature value. That is to say, the add operations are mapped onto different hardware
adders3 and encounter errors independently. The per-feature error rate could be approximated by
the adder-wise SER times Ml , where Ml ≈ ck 2 . Now, let us dive into the adder-level computation,
in a 1-bit adder with scale s, the bit-flip in one LUTs bit would add a bias ±2s to the output value,
if the input bit signals match the address of this LUTs bit. If each LUT cell has an i.i.d. SER of r s , in
a Q -bit adder, denoting the fraction length of the operands and result as l , then the distribution
of the faulty output x  with the random bit-bias (BB) faults could be written as

x  ∼ BB(x 0 ; p, Q , l  )


Q
 −l 
indicates x = x 0 + e, e=2 (−1) β 2q−1eq
q=1 (7)
eq ∼ Bernoulli(p)
βq ∼ Bernoulli(0.5), q = 1, . . . , Q  .

As for the result of the adder tree constructed by multiple LUT-based adders, since the proba-
bility that multiple bit-bias errors co-occur is orders of magnitude smaller, we ignore the accumu-
lation of the biases that are smaller than the OFM quantization resolution 2−l . Consequently, the
OFM feature values before the activation function follow the i.i.d. Random Bit-Bias distribution
f ∼ iBB( f 0 ; r s × Ml × Mp (t ), Q, l ), where Q and l are the bit-width and fraction length of the OFM
values, respectively.
We can make an intuitive comparison between the equivalent feature error rates induced by
LUTs soft errors and feature buffer soft errors. As the majority of FPGAs is SRAM-based, consid-
ering the bit SER r s of LUTs cell and BRAM cell to be close, we can see that the feature error rate
induced by LUTs errors is amplified by Ml ×Mp (t ). As we have discussed, Mp (t ) ≥ 1, Ml = ck 2 > 1,
the performance degradation induced by LUTs errors could be significantly larger than that in-
duced by feature buffer errors.

3.3 Analysis of PIM-based Platforms: RRAM as an Example


In an RRAM-based computing system, compared with the accompanying CMOS circuits, the
RRAM crossbar is much more vulnerable to various non-ideal factors. In multi-bit RRAM cells,
studies have showed that the distribution of the resistance due to programming variance is either
Gaussian or Log-Normal [24]. As each weight is programmed as the conductance of the memristor
cell, the weight could be seen as being distributed as Reciprocal-Normal or Log-Normal. Besides the
soft errors, common hard errors such as SAFs, caused by fabrication defects or limited endurance,
could result in severe performance degradation [49]. SAFs occur frequently in nowadays RRAM
crossbar: As reported by Reference [7], the overall SAF ratio could be larger than 10% (p1 = 9.04%

3 See more discussion in Section 6.4.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:10 X. Ning et al.

Fig. 2. An example of injecting feature faults under the iBB fault model (soft errors in FPGA LUTs).

for SAF1 and p0 = 1.75% for SAF0) in a fabricated RRAM device. The statistical model of SAFs in
single-bit and multi-bit RRAM devices would be formalized in Section 3.5.
As the RRAM crossbars also serve as the computation units, some non-ideal factors (e.g., IR-
drop, wire resistance) could be abstracted as feature faults. They are not considered in this work,
since the modeling of these effects highly depends on the implementation (e.g., crossbar dimension,
mapping strategy) and hardware-in-the-loop testing [15].

3.4 Feature Fault Model


As analyzed in Section 3.2, the soft errors in LUTs are relatively the more pernicious source of
feature faults, as (1) SER is usually much higher than hard error rate: r s r h , (2) these errors are
persistent if no correction is made, (3) the per-feature equivalent error rate is amplified as multiple
adders are involved. Therefore, we use the iBB fault model in our exploration of mitigating feature
faults.
We have f ∼ iBB( f 0 ; r s Ml Mp (t )), where Ml = ck 2 , and the probability of error occurring at
every position in the OFM is p = r s Ml Mp (t )Q = pm Ml , where pm = r s QMp (t ) is defined as
the per-MAC error rate. Denoting the dimension of the OFM as (Co , H ,W ) (channel, height, and
width, respectively) and the dimension of each convolution kernel as (c, k, k ), the computation of
a convolution layer under this fault model could be written as
y = д(W  x + b + θ · 2α −l · (−1) β )
s.t. θ ∼ Bernoulli(p)Co ×H ×W
(8)
α ∼ U {0, . . . , Q − 1}Co ×H ×W
β ∼ U {0, 1}Co ×H ×W ,
where θ is the mask indicating whether an error occurs at each feature map position, α represents
the bit position of the bias, β represents the bias sign. Note that this formulation is not equivalent to
the random bit-bias formalization in Equation (7) and is adopted for efficient simulation. These two
formulations are close when the odds that two errors take effect simultaneously is small (pm /Q
1). This fault model is referred to as the MAC-i.i.d. Bit-Bias model (abbreviated as MiBB). An
example of injecting this type of feature faults is illustrated in Figure 2.
Intuitively, convolution computation that needs fewer MACs might be more immune to the
faults, as the equivalent error rate at each OFM location is lower.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:11

Fig. 3. An example of injecting weight faults under the adSAF fault model (SAF errors in RRAM cells).

3.5 Weight Fault Model


As RRAM-based accelerators suffer from a much higher weight error rate than the CMOS-based
ones. The Stuck-at-Faults in RRAM crossbars are mainly considered for the setup of the weight
fault model. We assume the underlying platform is RRAM with multi-bit cells and adopt the com-
monly used mapping scheme, in which separate crossbars are used for storing positive and neg-
ative weights [10]. That is to say, when a SAF0 fault causes a cell to be stuck at HRS, the corre-
sponding logical weight would be stuck at 0. When a SAF1 fault causes a cell to be stuck at LRS,
the weight would be stuck at −Rw if it is negative, or Rw otherwise.
The computation of a convolution layer under the SAF weight fault model could be written as

y = д(W   x + b)
s.t. W  = (1 − θ ) · W + θ · e
θ ∼ Bernoulli(p0 + p1 )Co ×c×k×k
  Co ×c×k×k (9)
p1
m ∼ Bernoulli
p0 + p1
e = Rw sgn(W ) · m,

where Rw refers to the representation bound in Equation (3), θ is the mask indicating whether
fault occurs at each weight position, m is the mask representing the SAF types (SAF0 or SAF1) at
faulty weight positions, e is the mask representing the faulty target values (0 or ±Rw ). Every single
weight has an i.i.d. probability of p0 to be stuck at 0, and p1 to be stuck at the positive or negative
bounds of the representation range, for positive and negative weights, respectively. An example
of injecting this type of weight faults is illustrated in Figure 3.
Note that the weight fault model, referred to as arbitrary-distributed Stuck-at-Fault model
(adSAF), is much harder to defend against than SAF faults with a specific known defect map.
A neural network model that behaves well under the adSAF model is expected to achieve high
reliability across different specific SAF defect maps.
The above adSAF fault model assumes the underlying hardware is multi-bit RRAM devices;
adSAFs in single-bit RRAM devices are also of interest. In single-bit RRAM devices, multiple bits
of one weight value are mapped onto different crossbars, of which the results would be shifted and
added together [56]. In this case, a SAF fault that occurs in a cell would cause the corresponding
bit of the corresponding weight to be stuck at 0 or 1. The effects of adSAF faults on a weight value
in single-bit RRAM devices can be formulated as

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:12 X. Ning et al.

w  = sgn(w )2−l (((¬θ ) ∧ 2l |w |) ∨ (θ ∧ e))



Q 
Q
θ= θ q 2q−1 , e= mq 2q−1
q=1 q=1
(10)
iid
θ q ∼ Bernoulli(p0 + p1 ), q = 1, . . . , Q
 
iid p1
mq ∼ Bernoulli , q = 1, . . . , Q,
p0 + p1
where the binary representation of θ indicates whether fault occurs at each bit position, the binary
representation of e represents the target faulty values (0 or 1) at each bit position if fault occurs.
We will demonstrate that the architecture discovered under the multi-bits adSAF fault model can
also defend against single-bit adSAF faults and iBF weight faults caused by errors in the weight
buffers of CMOS-based accelerators.

4 FAULT-TOLERANT NAS
In this section, we present the FTT-NAS framework. We first give out the problem formalization
and framework overview in Section 4.1. Then, the search space, sampling, and assembling process
are described in Section 4.2 and Section 4.3, respectively. Finally, the search process is elaborated
in Section 4.4.

4.1 Framework Overview


Denoting the fault distribution characterized by the fault models as F , the neural network search
for fault tolerance can be formalized as
max E xv ∼Dv [E f ∼F [R(xv , Net(α, w ∗ (α )), f )]]
α ∈A
(11)
s.t. w ∗ (α ) = argmin E x t ∼D t [E f ∼F [L(x t , Net(α, w ), f )]],
w

where A is the architecture search space, and D t , Dv denote the training and validation data
split, respectively. R and L denote the reward and loss criterion, respectively. The major differ-
ence of Equation (11) to the vanilla NAS problem Equation (5) lies in the introduction of the fault
model F .
As the cost of finding the best weights w ∗ for each architecture α is almost unbearable, we
use the shared-weights based evaluator, in which shared weights are directly used to evaluate
sampled architectures. The resulting method, FTT-NAS, is the method to solve this NAS problem
approximately. And FT-NAS can be viewed as a degraded special case for FTT-NAS, in which no
fault is injected in the inner optimization of finding w ∗ (α ).
The overall neural architecture search framework is illustrated in Figure 4(b). There are multi-
ple components in the framework: A controller that samples different architecture rollouts from
the search space; a candidate network is assembled by taking out the corresponding subset of
weights from the super-net. A shared weights–based evaluator evaluates the performance of
different rollouts on the CIFAR10 dataset using fault-tolerant objectives.

4.2 Search Space


The design of the search space is as follows: We use a cell-based macro architecture, similar to
the one used in References [29, 36]. There are two types of cells: normal cell and reduction cell
with stride 2. All normal cells share the same connection topology, while all reduction cells share
another connection topology. The layout and connections between cells are illustrated in Figure 5.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:13

Fig. 4. Illustration of the overall workflow. (a) The setup of the application-level statistical fault models.
(b) The FTT-NAS framework. (c) The final fault-tolerant training stage.

In every cell, there are B nodes, and node 1 and node 2 are treated as the cell’s inputs, which are
the outputs of the two previous cells. For each of the other B − 2 nodes, two incoming connections
will be selected and element-wise added. For each connection, the 11 possible operations are: none;
skip connect; 3 × 3 average (avg.) pool; 3 × 3 max pool; 1 × 1 Conv; 3 × 3 ReLU-Conv-BN block; 5
× 5 ReLU-Conv-BN block; 3 × 3 SepConv block; 5 × 5 SepConv block; 3 × 3 DilConv block; 5 × 5
DilConv block.
The complexity of the search space can be estimated. For each cell type, there are (11 (B−2) × (B −
1)!) 2 possible choices. As there are two independent cell types, there are (11 (B−2) ×(B−1)!) 4 possible
architectures in the search space, which is roughly 9.5 × 1024 with B = 6 in our experiments.

4.3 Sampling and Assembling Architectures


In our experiments, the controller is an RNN, and the performance evaluation is based on a super
network with shared weights, as used by Reference [36].
An example of the sampled cell architecture is illustrated in Figure 6. Specifically, to sample a
cell architecture, the controller RNN samples B − 2 blocks of decisions, one for each node 3, . . . , B.
In the decision block for node i, M = 2 input nodes are sampled from 1, . . . , i − 1, to be connected
with node i. Then M operations are sampled from the 11 basic operation primitives, one for each
of the M connections. Note that the two sampled input nodes can be the same node j, which will
result in two independent connections from node j to node i.
During the search process, the architecture assembling process using the shared-weights super
network is straightforward [36]: just take out the weights from the super network corresponding
to the connections and operation types of the sampled architecture.
ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:14 X. Ning et al.

Fig. 5. Illustration of the search space design. Left: The layout and connections between cells. Right: The
possible connections in each cell and the possible operation types on every connection.

Fig. 6. An example of the sampled cell architecture.

4.4 Searching for Fault-tolerant Architecture


The FTT-NAS algorithm is illustrated in Algorithm 1. To search for a fault-tolerant architecture,
we use a weighted sum of the clean accuracy accc and the accuracy with fault injection accf as
the reward to instruct the training of the controller:
R = (1 − α r ) ∗ accc + α r ∗ accf , (12)
where accf is calculated by injecting faults following the fault distribution described in Section 3.
For the optimization of the controller, we employ the Adam optimizer [22] to optimize the REIN-
FORCE [47] objective, together with an entropy encouraging regularization.
In every epoch of the search process, we alternatively train the shared weights and the con-
troller on separate data splits D t and Dv , respectively. For the training of the shared weights, we

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:15

ALGORITHM 1: FTT-NAS
1: EPOCH: the total search epochs
2: w: shared weights in the super network
3: θ : the parameters of the controller π
4: epoch = 0
5: while epoch < EPOCH do
6: for all x t , yt ∼ D t do
7: a ∼ π (a; θ ) # sample an architecture from the controller
8: f ∼ F (f ) # sample faults from the fault model
9: Lc = CE(Net(a; w )(x t ), yt ) # clean cross entropy
10: L f = CE(Net(a; w )(x t , f ), yt ) # faulty cross entropy
11: L(x t , yt , Net(a; w ), f ) = (1 − αl )Lc + αl L f
12: w = w − ηw ∇w L # for clarity, we omit momentum calculation here
13: end for
14: for all xv , yv ∼ Dv do
15: a ∼ π (a; θ ) # sample an architecture from the controller
16: f ∼ F (f ) # sample faults from the fault model
17: Rc = Acc(Net(a; w )(xv ), yv ) # clean accuracy
18: R f = Acc(Net(a; w )(xv , f ), yv ) # faulty accuracy
19: R(xv , yv , Net(a; w ), f ) = (1 − α r ) ∗ Rc + α r ∗ R f
20: θ = θ + ηθ (R − b)∇θ log π (a; θ )
21: end for
22: epoch = epoch + 1
23: schedule ηw , ηθ
24: end while
25: return a ∼ π (a; θ )

carried out experiments under two different settings: without/with FTT. When training with FTT,
a weighted sum of the clean cross-entropy loss CEc and the cross-entropy loss with fault injection
CEf is used to train the shared weights. The FTT loss can be written as
L = (1 − αl ) ∗ CEc + αl ∗ CEf . (13)
As shown in lines 7–12 in Algorithm 1, in each step of training the shared weights, we sample
architecture α using the current controller, then backpropagate using the FTT loss to update the
parameters of the candidate network. Training without FTT (in FT-NAS) is a special case with
αl = 0.
As shown in lines 15–20 in Algorithm 1, in each step of training the controller, we sample
architecture from the controller, assemble this architecture using the shared weights, and then
get the reward R on one data batch in Dv . Finally, the reward is used to update the controller by
applying the REINFORCE technique [47], with the reward baseline denoted as b.

5 EXPERIMENTS
In this section, we demonstrate the effectiveness of the FTT-NAS framework and analyze the dis-
covered architectures under different fault models. First, we introduce the experiment setup in
Section 5.1. Then, the effectiveness under the feature and weight fault models are shown in Sec-
tion 5.2 and Section 5.3, respectively. The effectiveness of the learned controller is illustrated in
Section 5.4. Finally, the analyses and illustrative experiments are presented in Section 5.5.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:16 X. Ning et al.

5.1 Setup
Our experiments are carried out on the CIFAR-10 [23] dataset. CIFAR-10 is one of the most com-
monly used computer vision datasets and contains 60,000 32 × 32 RGB images. Three manually
designed architectures VGG-16, ResNet-18, and MobileNet-V2 are chosen as the baselines. 8-bit
dynamic fixed-point quantization is used throughout the search and training process, and the frac-
tion length is found following the minimal-overflow principle.
In the neural architecture search process, we split the training dataset into two subsets. 80% of
the training data is used to train the shared weights, and the remaining 20% is used to train the
controller. The super network is an 8-cell network, with all the possible connections and opera-
tions. The channel number of the first cell is set to 20 during the search process, and the channel
number increases by 2 upon every reduction cell. The controller network is an RNN with one hid-
den layer of size 100. The learning rate for training the controller is 1e-3. The reward baseline b
is updated using a moving average with momentum 0.99. To encourage exploration, we add an
entropy encouraging regularization to the controller’s REINFORCE objective, with a coefficient of
0.01. For training the shared weights, we use an SGD optimizer with momentum 0.9 and weight
decay 1e-4, and the learning rate is scheduled by a cosine annealing scheduler [33] started from
0.05. Each architecture search process is run for 100 epochs. Note that all these are typical settings
that are similar to Reference [36].
To conduct the final training of the architectures (Figure 4(c)), we run fault-tolerant training
for 100 epochs. The learning rate is set to 0.1 initially and decayed by 10 at epoch 40 and 80.
We have experimented with a fault-tolerant training choice: whether to mask out the error po-
sitions in feature/weights during the backpropagation process. If the error positions are masked
out, then no gradient would be backpropagated through the erroneous feature positions, and no
gradient would be calculated w.r.t. the erroneous weight positions. We find that this choice does
not affect the fault-tolerant training result, thus, we do not use the masking operation in our final
experiments.
We build the neural architecture search framework and fault injection framework upon the
PyTorch framework, and all the codes are available at https://github.com/walkerning/aw_nas.

5.2 Defend against MiBB Feature Faults


As described in Section 4, we conduct neural architecture searching without/with fault-tolerant
training (i.e., FT-NAS and FTT-NAS, correspondingly). The per-MAC injection probability pm used
in the search process is 1e-4. The reward coefficients α r in Equation (12) is set to 0.5. In FTT-NAS,
the loss coefficient αl in Equation (13) is also set to 0.5. As the baselines for FT-NAS and FTT-NAS,
we train ResNet-18, VGG-16, MobileNet-V2 with both normal training and FTT. For each model
trained with FTT, we successively try per-MAC fault injection probability pm in {3e-4, 1e-4, 3e-
5} and use the largest injection probability with which the model could achieve a clean accuracy
above 50%. Consequently, the ResNet-18 and VGG-16 are trained with a per-MAC fault injection
probability of 1e-4 and 3e-5, respectively.
The discovered cell architectures are shown in Figure 7, and the evaluation results are shown in
Table 4. The discovered architecture F-FTT-Net outperforms the baselines significantly at various
fault ratios. In the meantime, compared with the most efficient baseline MobileNet-V2, the FLOPs
number of F-FTT-Net is comparable, and the parameter number is only 28.3% (0.65M versus 2.30M).
If we require that the accuracy should be kept above 70%, then MobileNet-V2 could function with
a per-MAC error rate of 3e-6, and F-FTT-Net could function with a per-MAC error rate larger than
1e-4. That is to say, while meeting the same accuracy requirements, F-FTT-Net could function in
an environment with a much higher SER.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:17

Table 4. Comparison of Different Architectures under the MiBB Feature Fault Model

Clean Accuracy (%) with feature faults


Arch Training† #FLOPs #Params
accuracy 3e-6 1e-5 3e-5 1e-4 3e-4
ResNet-18 clean 94.7 89.1 63.4 11.5 10.0 10.0 1110M 11.16M
VGG-16 clean 93.1 78.2 21.4 10.0 10.0 10.0 626M 14.65M
MobileNet-V2 clean 92.3 10.0 10.0 10.0 10.0 10.0 182M 2.30M
F-FT-Net clean 91.0 71.3 22.8 10.0 10.0 10.0 234M 0.61M
ResNet-18 pm =1e-4 79.2 79.1 79.6 78.9 60.6 11.3 1110M 11.16M
VGG-16 pm =3e-5 83.5 82.4 77.9 50.7 11.1 10.0 626M 14.65M
MobileNet-V2 pm =3e-4 71.2 70.3 69.0 68.7 68.1 47.8 182M 2.30M
F-FTT-Net pm =3e-4 88.6 88.7 88.5 88.0 86.2 51.0 245M 0.65M
†: As also noted in the main text, for all the FTT trained models, we successively try per-MAC fault injection probability
pm in {3e-4, 1e-4, 3e-5} and use the largest injection probability with which the model could achieve a clean accuracy
above 50%.

Fig. 7. The discovered cell architectures under the MiBB feature fault model. (a) Normal cell. (b) Reduction
cell.

We can see that FTT-NAS is much more effective than its degraded variant, FT-NAS. We con-
clude that, generally, NAS should be used in conjunction with FTT, as suggested by Equation (11).
Another interesting fact is that, under the MiBB fault model, the relative rankings of the resilience
capabilities of different architectures change after FTT: After FTT, MobileNet-V2 suffers from the
smallest accuracy degradation among three baselines, whereas it is the most vulnerable one with-
out FTT.

5.3 Defend against adSAF Weight Faults


We conduct FT-NAS and FTT-NAS under the adSAF model. The overall SAF ratio p = p0 + p1
is set to 8%, in which the proportion of SAF0 and SAF1 is 83.7% and 16.3%, respectively (SAF0
ratio p0 =6.7%, SAF1 ratio p1 =1.3%). The reward coefficient α r is set to 0.2. The loss coefficient αl in
FTT-NAS is set to 0.7. After the fault-tolerant training of the discovered architecture, we test the
model accuracy with various SAF ratios (from 4% to 12%), while the relative ratio p0 /p1 remains
unchanged according to the numbers reported by Chen et al. [7].
The discovered cell architectures are shown in Figure 8. As shown in Table 5, the discovered W-
FTT-Net outperforms the baselines significantly at various test SAF ratios, with comparable FLOPs
and less parameter number. We then apply channel augmentation to the discovered architecture
to explore the performance of the model at different scales. We can see that models with larger
capacity have better reliability under the adSAF weight fault model, e.g., 54.2% (W-FTT-Net-40) vs.
38.4% (W-FTT-Net-20) with 10% adSAF faults.
To investigate whether the model FTT-trained under the adSAF fault model can tolerate other
types of weight faults, we evaluate the reliability of W-FTT-Net under 1bit-adSAF model and the

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:18 X. Ning et al.

Table 5. Comparison of Different Architectures under the adSAF Weight Fault Model

Clean Accuracy (%) with weight faults


Arch Training #FLOPs #Params
accuracy 0.04 0.06 0.08 0.10 0.12
ResNet-18 clean 94.7 64.8 34.9 17.8 12.4 11.0 1110M 11.16M
VGG-16 clean 93.1 45.7 21.7 14.3 12.6 10.6 626M 14.65M
MobileNet-V2 clean 92.3 26.2 14.3 11.7 10.3 10.5 182M 2.30M
W-FT-Net-20 clean 91.7 54.2 30.7 19.6 15.5 11.9 1020M 3.05M
ResNet-18 p=0.08 92.0 86.4 77.9 60.8 41.6 25.6 1110M 11.16M
VGG-16 p=0.08 91.1 82.6 73.3 58.5 41.7 28.1 626M 14.65M
MobileNet-V2 p=0.08 86.3 76.6 55.9 35.7 18.7 15.1 182M 2.30M
W-FTT-Net-20† p=0.08 90.8 86.2 79.5 69.6 53.5 38.4 919M 2.71M
W-FTT-Net-40 p=0.08 92.1 88.8 85.5 79.3 69.2 54.2 3655M 10.78M
†: The “-N ” suffix means that the base of the channel number is N .

Fig. 8. The discovered cell architectures under the adSAF weight fault model. (a) Normal cell. (b) Reduction
cell.

Fig. 9. Accuracy curves under different weight fault models. (a) W-FTT-Net under 8bit-adSAF model.
(b) W-FTT-Net under 1bit-adSAF model. (c) W-FTT-Net under iBF model.

iBF model. As shown in Figure 9(b)(c), under the 1bit-adSAF and iBF weight fault model, W-FTT-
Net outperforms all the baselines consistently at different noise levels.

5.4 The Effectiveness of the Learned Controller


To demonstrate the effectiveness of the learned controller, we compare the performance of the
architectures sampled by the controller, with the performance of the architectures random sampled
from the search space. For both the MiBB feature fault model and the adSAF weight fault model, we
random sample five architectures from the search space and train them with FTT for 100 epochs. A

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:19

Table 6. RNN Controller vs. Random Samples under the


MiBB Feature Fault Model

Model clean acc pm =3e-4 #FLOPs #Params


sample1 60.2 19.5 281M 0.81M
sample2 79.7 29.7 206M 0.58M
sample3 25.0 32.2 340M 1.09M
sample4 32.9 25.8 387M 1.23M
sample5 17.4 10.8 253M 0.77M
F-FTT-Net 88.6 51.0 245M 0.65M

Table 7. RNN Controller vs. Random Sample under the


adSAF Weight Fault Model

Model clean acc p=8% #FLOPs #Params


sample1 90.7 63.6 705M 1.89M
sample2 84.7 36.7 591M 1.54M
sample3 90.3 60.3 799M 2.33M
sample4 90.5 64.0 874M 2.55M
sample5 85.2 45.6 665M 1.83M
W-FTT-Net 90.7 68.5 919M 2.71M

per-MAC fault injection probability of 3e-4 is used for feature faults, and a SAF ratio of 8% (p0 =6.7%,
p1 =1.3%) is used for weight faults.
As shown in Table 6 and Table 7, the performance of different architectures in the search space
varies a lot, and the architectures sampled by the learned controllers, F-FTT-Net and W-FTT-Net,
outperform all the random sampled architectures. Note that, as we use different preprocess opera-
tions for feature faults and weight faults (ReLU-Conv-BN 3 × 3 and SepConv 3 × 3, respectively),
there exist differences in FLOPs and parameter number even with the same cell architectures.

5.5 Inspection of the Discovered Architectures


Feature faults: From the discovered cell architectures shown in Figure 7, we can observe that
the controller obviously prefers SepConv and DilConv blocks over Relu-Conv-BN blocks. This
observation is consistent with our anticipation. As under the MiBB feature fault model, operations
with smaller FLOPs will result in a lower equivalent fault rate in the OFM.
Under the MiBB feature fault model, there is a tradeoff between the capacity of the model and the
feature error rate. As the number of channels increases, the operations become more expressive,
but the equivalent error rates in the OFMs also get higher. Thus, there exists a tradeoff point of c ∗
for the number of channels. Intuitively, c ∗ depends on the per-MAC error rate pm : The larger the
pm is, the smaller the c ∗ is.
Besides the choices of primitives, the connection pattern and combination of different primitives
also play a role in making the architecture fault-tolerant. To verify this, first, we conduct a simple
experiment to confirm the preference of primitives: For each of the four different primitives
(SepConv 3 × 3, SepConv 5 × 5, DilConv 3 × 3, DilConv 5 × 5), we stack five layers of the
primitives, get the performance of the stacked NN after FTT training it with pm =3e-4. The stacked
NNs achieve the accuracy of 60.0%, 65.1%, 50.0%, and 56.3% with pm =1e-4, respectively. The
stacked NN of SepConv 5 × 5 blocks achieves the best performance, which is of no surprise, since
the most frequent block in F-FTT-Net is SepConv 5 × 5. Then, we construct six architectures
by random sampling five architectures with only SepConv5 × 5 connections and replacing all

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:20 X. Ning et al.

the primitives in F-FTT-Net with SepConv 5 × 5 blocks. The best result achieved by these six
architectures is 77.5% with pm =1e-4 (versus 86.2% achieved by F-FTT-Net). These illustrative
experiments indicate that the connection pattern and combination of different primitives all play
a role in the fault resilience capability of a neural network architecture.

Weight faults: Under the adSAF fault model, the controller prefers ReLU-Conv-BN blocks over
SepConv and DilConv blocks. This preference is not so easy to anticipate. We hypothesize that the
weight distribution of different primitives might lead to different behaviors when encountering
SAF faults. For example, if the quantization range of a weight value is larger, then the value devi-
ation caused by a SAF1 fault would be larger, and we know that a large increase in the magnitude
of weights would damage the performance severely [12]. We conduct a simple experiment to ver-
ify this hypothesis: We stack several blocks to construct a network, and in each block, one of the
three operations (a SepConv3 × 3 block, a ReLU-Conv-BN 3 × 3 block, and a ReLU-Conv-BN 1 ×
1 block) is randomly picked in every training step. The SepConv 3 × 3 block is constructed with a
DepthwiseConv 3 × 3 and two Conv 1 × 1, and the ReLU-Conv-BN 3 × 3 and ReLU-Conv-BN 1 ×
1 contain a Conv 3 × 3 and a Conv 1 × 1, respectively. After training, the weight magnitude ranges
of Conv 3 × 3, Conv 1 × 1, and DepthwiseConv 3 × 3 are 0.036±0.043, 0.112±0.121, 0.140±0.094,
respectively. Since the magnitude of the weights in 3 × 3 convolutions is smaller than that of the 1 × 1
convolutions and the depthwise convolutions, SAF weight faults would cause larger weight deviations
in a SepConv or DilConv block than in a ReLU-Conv-BN 3 × 3 block.

6 DISCUSSION
6.1 Orthogonality
Most of the previous methods are exploiting the inherent fault resilience capability of existing NN
architectures to tolerate different types of hardware faults. In contrast, our methods improve the
inherent fault resilience capability of NN models, thus effectively increase the algorithmic fault
resilience “budget” to be utilized by hardware-specific methods. Our methods are orthogonal to
existing fault-tolerance methods and can be easily integrated with them, e.g., helping hardware-
based methods to reduce the overhead largely.

6.2 Data Representation


In our work, an 8-bit dynamic fixed-point representation is used for the weights and features. As
pointed out in Section 5.5, the dynamic range has impacts on the resilience characteristics against
weight faults. The data format itself obviously decides or affects the data range. Yan et al. [52]
found out that the errors in exponent bits of the 32bit floating-point weights have large impacts
on the performance. Li et al. [25] investigated the resilience characteristics of several floating-point
and non-dynamic fixed-point representations.

6.3 Other Search Strategies (Controllers)


A NAS system mainly consists of three components that work together: The search space that
defines the architectural decisions to make; the evaluator that assesses the performance of an
architecture; the search strategy (controller) that explores the search space using the rewards pro-
vided by the evaluator. Apart from few exceptions such as the differentiable NAS methods [29],
the NAS system could be built in a modularized and decoupled way, in which the controller and
the evaluator can be designed independently. Our major modifications to the NAS framework lie
in that (1) We analyze different categories of faults existing in various devices and conduct fault
insertion with the formulated fault model when evaluating the reward. (2) The fault-tolerant

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:21

training techniques should be incorporated when training the supernet weights to reduce the
gap between the search and final training stages, since it is a common technique for training a fault-
tolerant NN model. Note that these two amendments are all on the evaluator component. Despite
that we choose to use the popular reinforcement learning-based controller, other controllers such
as evolutionary-based [39] and predictor-based ones [35] could be incorporated into the FTT-NAS
framework easily. The application of other controllers is outside the scope of this work’s interests.

6.4 Fault Model


Limitation of application-level fault model: There are faults that are hard or unlikely to
model and mitigate by our methods, e.g., timing errors, routing/DSP errors in FPGA, and so
on. A hardware-in-the-loop framework could be established for a thorough evaluation of the
system-level fault hazards. Anyway, since the correspondence between these faults and the
application-level elements are subtle, it is more suitable to mitigate these faults in the lower
abstraction layer.

FPGA platform: In the MiBB feature fault model, we assume that the add operations are spatially
expanded onto independent hardware adders, which applies to the template-based designs [45].
For ISA (Instruction Set Architecture)-based accelerators [37], the NN computations are
orchestrated using instructions, time-multiplexed onto hardware units. In this case, the accu-
mulation of the faults follows a different model and might show different preferences among
architectures. Anyway, the FTT-NAS framework is general and could be used with different fault
models. We leave the exploration and experiments of other fault models for future work.

RRAM platform: As for the RRAM platform, this article mainly focuses on discovering fault-
tolerant neural architecture to mitigate SAFs, which have significant impacts on computing
accuracy. In addition to SAFs, the variation is another typical RRAM non-ideal factor that may
lead to inaccurate computation. There exist various circuit-level optimizations that can mitigate
the computation error caused by the RRAM variation. First, with the development of the RRAM
device technology, a large on/off ratio of RRAM devices (i.e., the resistance ratio of the high
resistance state and the low resistance state) can be obtained (e.g., 103 [48]). And a large on/off
ratio makes the bit line current difference among different computation results obvious and
thus improves the fault tolerance capability against variation. Second, in existing RRAM-based
accelerators, the number of activated RRAM rows at one time is limited. For example, only
four rows are activated in each cycle, which provides a sufficient signal margin against process
variation [51]. In contrast, compared with the process variation, it is more costly to mitigate SAFs
by circuit-level optimization (e.g., existing work utilizes costly redundant hardware to tolerate
SAFs [18]). Thus, we aim at tolerating the SAFs from the algorithmic perspective. Anyway,
simulating the variation is a meaningful extension of the general FTT-NAS framework for the
RRAM platform, and we leave it for future work.

Combining multiple fault models: We experiment with a fault model at a time and do not com-
bine different fault models. And our experimental results show that the architectural preferences
of the adSAF and iBB feature fault models are distinct (see discussions in Section 5.5). Fortunately,
the two types of faults that we experiment with would not co-exist for the same part of an NN
model: IBB feature faults (caused by FPGA LUTs errors) and the adSAF weight faults (in the RRAM
crossbar). Nevertheless, there indeed exist scenarios that some weights and feature errors could
happen simultaneously on one platform. For example, iBF in the feature buffer and SAF in the
crossbar can occur simultaneously in an RRAM-based accelerator. However, on the same platform

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:22 X. Ning et al.

in the same environment, the influences of different types of errors would usually be vastly differ-
ent. For example, compared with the accuracy degradation caused by the SAF errors in the RRAM
cell, the influences of iBF errors in the feature buffer could usually be ignored on the samedevice.
As a future direction, it might be interesting to combine these fault models to search for a neural
architecture to be partitioned and deployed onto a heterogeneous hardware system. In that case,
the fault patterns, along with the computation and memory access patterns of multiple platforms,
should be considered jointly.

7 CONCLUSION
In this article, we analyze the possible faults in various types of NN accelerators and formalize the
statistical fault models from the algorithmic perspective. After the analysis, the MAC-i.i.d. Bit-Bias
(MiBB) model and the arbitrary-distributed Stuck-at-Fault (adSAF) model are adopted in the neural
architecture search for tolerating feature faults and weight faults, respectively. To search for the
fault-tolerant neural network architectures, we propose the multi-objective Fault-Tolerant NAS
(FT-NAS) and Fault-Tolerant Training NAS (FTT-NAS) method. In FTT-NAS, the NAS technique
is employed in conjunction with the Fault-Tolerant Training (FTT). The fault resilience capabili-
ties of the discovered architectures, F-FTT-Net and W-FTT-Net, outperform multiple manually de-
signed architecture baselines, with comparable or fewer FLOPs and parameters. And W-FTT-Net
trained under the 8bit-adSAF model can defend against other types of weight faults. Generally,
compared with FT-NAS, FTT-NAS is more effective and should be used. In addition, through the
inspection of the discovered architectures, we find that, since operation primitives differ in their
MACs, expressiveness, and weight distributions, they exhibit different resilience capabilities un-
der different fault models. The connection pattern is also shown to have influences on the fault
resilience capability of NN models.

REFERENCES
[1] Austin P. Arechiga and Alan J. Michaels. 2018. The robustness of modern deep learning architectures against single
event upset errors. In IEEE High Performance Extreme Computing Conference (HPEC’18). 1–6.
[2] Hossein Asadi and Mehdi B. Tahoori. 2007. Analytical techniques for soft error rate modeling and mitigation of FPGA-
based designs. IEEE Trans. Very Large Scale Integ. Syst. 15, 12 (Dec. 2007), 1320–1331.
[3] Bowen Baker, Otkrist Gupta, R. Raskar, and N. Naik. 2017. Accelerating neural architecture search using performance
prediction. arXiv preprint arXiv:1705.10823 (2017).
[4] Cristiana Bolchini, Antonio Miele, and Marco D. Santambrogio. 2007. TMR and partial dynamic reconfiguration to
mitigate SEU faults in FPGAs. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’07).
87–95.
[5] Shekhar Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability
and degradation. IEEE Micro 25, 6 (2005), 10–16.
[6] Carl Carmichael, Michael Caffrey, and Anthony Salazar. 2000. Correcting single-event upsets through Virtex partial
configuration. Xilinx Application Notes 216 (2000), v1.
[7] Ching-Yi Chen, Hsiu-Chuan Shih, Cheng-Wen Wu, C. Lin, Pi-Feng Chiu, S. Sheu, and F. Chen. 2015. RRAM defect
modeling and failure analysis based on march test and a novel squeeze-search scheme. IEEE Trans. Comput. 64, 1 (Jan.
2015), 180–190.
[8] Lerong Chen, Jiawen Li, Yiran Chen, Qiuping Deng, Jiyuan Shen, X. Liang, and L. Jiang. 2017. Accelerator-friendly
neural-network training: Learning variations and defects in RRAM crossbar. In IEEE/ACM Design, Automation and
Test in Europe Conference (DATE’17). 19–24.
[9] Tianshi Chen, Zidong Du, Ninghui Sun, J. Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao:
A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM International Conference on
Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).
[10] Ping Chi, Shuangchen Li, C. Xu, Tao Zhang, J. Zhao, Yongpan Liu, Y. Wang, and Yuan Xie. 2016. PRIME: A novel
processing-in-memory architecture for neural network computation in ReRAM-based main memory. In IEEE/ACM
International Symposium on Computer Architecture (ISCA’16). IEEE Press, 27–39.

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture 44:23

[11] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. A survey of FPGA-based neural network
inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1 (Mar. 2019).
[12] Ghouthi Boukli Hacene, FranÃğois Leduc-Primeau, Amal Ben Soussia, Vincent Gripon, and F. Gagnon. 2019. Training
modern deep neural networks for memory-fault robustness. In IEEE International Symposium on Circuits and Systems
(ISCAS’19). 1–5.
[13] Mahta Haghi and Jeff Draper. 2009. The 90 nm double-DICE storage element to reduce single-event upsets. In IEEE
International Midwest Symposium on Circuits and Systems (MWSCAS’09). IEEE, 463–466.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.
[15] Zhezhi He, Jie Lin, Rickard Ewetz, J. Yuan, and Deliang Fan. 2019. Noise injection adaption: End-to-end ReRAM
crossbar non-ideal effect adaption for neural network mapping. In ACM/IEEE Design Automation Conference (DAC’19).
[16] Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert
Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In ACM/IEEE Design Automa-
tion Conference (DAC’13). IEEE, 1–10.
[17] Miao Hu, Hai Li, Yiran Chen, Q. Wu, and G. Rose. 2013. BSB training scheme implementation on memristor-based
circuit. In IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA’13). IEEE, 80–87.
[18] Wenqin Huangfu, Lixue Xia, Ming Cheng, Xiling Yin, Tianqi Tang, Boxun Li, Krishnendu Chakrabarty, Yuan Xie, Yu
Wang, and Huazhong Yang. 2017. Computation-oriented fault-tolerance schemes for RRAM computing systems. In
IEEE/ACM Asia and South Pacific Design Automation Conference (ASPDAC’17). IEEE, 794–799.
[19] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural net-
works: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18 (2017).
[20] Sachhidh Kannan, Naghmeh Karimi, Ramesh Karri, and Ozgur Sinanoglu. 2015. Modeling, detection, and diagnosis
of faults in multilevel memristor memories. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 34 (2015), 822–834.
[21] Sachhidh Kannan, Jeyavijayan Rajendran, Ramesh Karri, and Ozgur Sinanoglu. 2013. Sneak-path testing of memristor-
based memories. In 26th International Conference on VLSI Design and 12th International Conference on Embedded Sys-
tems. 386–391.
[22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on
Learning Representations (ICLR’15).
[23] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. http://www.cs.toronto.edu/~kriz/cifar.
html.
[24] Binh Q. Le, Alessandro Grossi, Elisa Vianello, Tony Wu, Giusy Lama, Edith Beigne, H.-S. Philip Wong, and Subhasish
Mitra. 2018. Resistive RAM with multiple bits per cell: Array-level demonstration of 3 bits per cell. IEEE Transactions
on Electron Devices 66, 1 (2018), 641–646. DOI:10.1109/TED.2018.2879788
[25] Guanpeng Li, S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and Stephen W. Keckler. 2017. Understanding
error propagation in deep learning neural network (DNN) accelerators and applications. In ACM/IEEE Supercomputing
Conference (SC’17). ACM, 8.
[26] F. Libano, B. Wilson, J. Anderson, M. Wirthlin, C. Cazzaniga, C. Frost, and P. Rech. 2019. Selective hardening for neural
networks in FPGAs. IEEE Trans. Nucl. Sci. 66 (2019), 216–222.
[27] Beiye Liu, Hai Li, Yiran Chen, Xin Li, Qing Wu, and Tingwen Huang. 2015. Vortex: Variation-aware training for
memristor X-bar. In ACM/IEEE Design Automation Conference (DAC’15). 1–6.
[28] Chenchen Liu, Miao Hu, John Paul Strachan, and Hai Li. 2017. Rescuing memristor-based neuromorphic design with
high defects. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1–6.
[29] Hanxiao Liu, K. Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In International Con-
ference on Learning Representations (ICLR’19).
[30] Tao Liu, Wujie Wen, Lei Jiang, Yanzhi Wang, Chengmo Yang, and Gang Quan. 2019. A fault-tolerant neural network
architecture. In ACM/IEEE Design Automation Conference (DAC’19). 55:1–55:6.
[31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.
2016. SSD: Single shot MultiBox detector. In European Conference on Computer Vision (ECCV’16). Springer, 21–37.
[32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3431–3440.
[33] Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Con-
ference on Learning Representations (ICLR’17).
[34] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2018. Neural architecture optimization. In Conference
on Neural Information Processing Systems (NIPS’18). 7816–7827.
[35] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. 2020. A generic graph-based neural architec-
ture encoding scheme for predictor-based NAS. In European Conference on Computer Vision (ECCV’20).

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.
44:24 X. Ning et al.

[36] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via
parameter sharing. In International Conference on Machine Learning (ICML’18).
[37] Jiantao Qiu, J. Wang, Song Yao, K. Guo, Boxun Li, Erjin Zhou, J. Yu, T. Tang, N. Xu, S. Song, Yu Wang, and H. Yang. 2016.
Going deeper with embedded FPGA platform for convolutional neural network. In ACM International Symposium on
Field-Programmable Gate Arrays (FPGA’16). ACM, 26–35.
[38] Brandon Reagen, Udit Gupta, L. Pentecost, P. Whatmough, S. Lee, Niamh Mulholland, D. Brooks, and Gu-Yeon Wei.
2018. Ares: A framework for quantifying the resilience of deep neural networks. In ACM/IEEE Design Automation
Conference (DAC’18) (DAC’18).
[39] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier archi-
tecture search. In AAAI Conference on Artificial Intelligence, Vol. 33. 4780–4789.
[40] Christoph Schorn, Andre Guntoro, and Gerd Ascheid. 2018. Accurate neuron resilience prediction for a flexible reli-
ability management in neural network accelerators. In IEEE/ACM Design, Automation and Test in Europe Conference
(DATE’18).
[41] Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein.
2018. Poison frogs! Targeted clean-label poisoning attacks on neural networks. In Conference on Neural Information
Processing Systems (NIPS’18). 6103–6113.
[42] Xiaoxuan She and N. Li. 2017. Reducing critical configuration bits via partial TMR for SEU mitigation in FPGAs. IEEE
Trans. Nucl. Sci. 64 (2017), 2626–2632.
[43] Charles Slayman. 2011. Soft error trends and mitigation techniques in memory devices. In Reliability and Maintain-
ability Symposium. 1–5.
[44] Christian Szegedy, W. Zaremba, Ilya Sutskever, Joan Bruna, D. Erhan, Ian J. Goodfellow, and R. Fergus. 2013. Intriguing
properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
[45] Stylianos I. Venieris and C. Bouganis. 2019. fpgaConvNet: Mapping regular and irregular convolutional neural net-
works on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 30 (2019), 326–342.
[46] Jean-Charles Vialatte and FranÃğois Leduc-Primeau. 2017. A study of deep learning robustness against computation
failures. arXiv:1704.05396 (2017).
[47] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Mach. Learn. 8, 3-4 (1992), 229–256.
[48] Jiyong Woo, Tien Van Nguyen, Jeong-Hun Kim, J. Im, Solyee Im, Yeriaron Kim, Kyeong-Sik Min, and S. Moon. 2020.
Exploiting defective RRAM array as synapses of HTM spatial pooler with boost-factor adjustment scheme for defect-
tolerant neuromorphic systems. Sci. Rep. 10 (2020).
[49] Lixue Xia, Wenqin Huangfu, Tianqi Tang, Xiling Yin, K. Chakrabarty, Yuan Xie, Y. Wang, and H. Yang. 2018. Stuck-at
fault tolerance in RRAM computing systems. IEEE J. Emerg. Select. Topics Circ. Syst. 8 (2018), 102–115.
[50] Lixue Xia, Mengyun Liu, Xuefei Ning, K. Chakrabarty, and Yu Wang. 2017. Fault-tolerant training with on-line fault
detection for RRAM-based neural computing systems. In ACM/IEEE Design Automation Conference (DAC’17). 1–6.
[51] Cheng-Xin Xue, J.-M. Hung, H.-Y. Kao, Y.-H. Huang, S.-P. Huang, F.-C. Chang, P. Chen, T.-W. Liu, C.-J. Jhang, C.-I.
Su, W.-S. Khwa, C.-C. Lo, R.-S. Liu, C.-C. Hsieh, K.-T. Tang, Y.-D. Chih, T.-Y. J. Chang, and M.-F. Chang. 2021. A 22nm
4Mb 8b-precision ReRAM computing-in-memory macrowith 11.91 to 195.7TOPS/W for tiny AI edges devices. In IEEE
International Solid-State Circuits Conference (ISSCC’21).
[52] Zheyu Yan, Yiyu Shi, Wang Liao, M. Hashimoto, Xichuan Zhou, and Cheng Zhuo. 2020. When single event upset
meets deep neural networks: Observations, explorations, and remedies. In IEEE/ACM Asia and South Pacific Design
Automation Conference (ASPDAC’20). 163–168.
[53] Quanshi Zhang, Ruiming Cao, Feng Shi, Ying Nian Wu, and Song-Chun Zhu. 2018. Interpreting CNN knowledge via
an explanatory graph. In AAAI Conference on Artificial Intelligence.
[54] Yang Zhao, X. Hu, Shuangchen Li, Jing Ye, Lei Deng, Y. Ji, Jianyu Xu, Dong Wu, and Yuan Xie. 2019. Memory Trojan
attack on neural network accelerators. IEEE/ACM Design, Automation and Test in Europe Conference (DATE’19). 1415–
1420.
[55] Zhuoran Zhao, D. Agiakatsikas, N. H. Nguyen, E. Cetin, and O. Diessel. 2018. Fine-grained module-based error recov-
ery in FPGA-based TMR systems. ACM Trans. Reconfig. Technol. Syst. 11, 1 (2018), 4.
[56] Zhenhua Zhu, Hanbo Sun, Yujun Lin, Guohao Dai, L. Xia, Song Han, Yu Wang, and H. Yang. 2019. A configurable
multi-precision CNN computing framework based on single bit RRAM. In ACM/IEEE Design Automation Conference
(DAC’19). 1–6.
[57] Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In International Conference
on Learning Representations (ICLR’17) (2017).

Received December 2020; revised March 2021; accepted April 2021

ACM Transactions on Design Automation of Electronic Systems, Vol. 26, No. 6, Article 44. Pub. date: August 2021.

You might also like