Reliability_Model_for_Multiple_Error_Pro
Reliability_Model_for_Multiple_Error_Pro
DOI 10.1007/s10836-017-5649-x
Abstract The problem of multi-cell upset (MCU) becomes a of memory cells leads to multiple cell upset (MCU) occur-
major issue in the nanometer SRAM chips. Various types of rence in each particle hit. Furthermore, the corrupted bits ac-
multi-bit error correction codes (MECC) have been developed cumulate and remain in the memory until a scrubbing opera-
to mitigate this problem. Proper selection of different param- tion corrects all corrupted bits in the memory.
eters of each MECC scheme can lead to efficient encoder/ To mitigate the SBU and MCU effects, various schemes of
decoder design. In this paper a semi-analytical model is pre- error correction codes have been used in SRAM chips. The basic
sented which can estimate the memory failure probability (as architecture of coding process in SRAM chips is shown in Fig. 1.
well as mean time between failures (MTBF) and reliability) A block of data bits enters encoder and some parity bits are added
and can guide the system designers to select proper protection to them. The resulted data and parity bits (encoded data) are
scheme for the system memories. The model was validated by stored in a memory block. Decoder extracts the data bits from
comparison to simulation method (less than 3.1% estimation a SRAM block memory as well as corrects the corrupted bits in
error) and a state of the art model (less than 2.9% estimation that block. Each coding scheme has some parameters which
error). The impact of various parameters of four types of cor- should be tuned properly. For example, when MECC and inter-
rection schemes is analyzed and a comparison of these coding leaving (a technique in which adjacent cells in a row belong to
schemes with respect to their capabilities to enhance memory different words and two consecutive bits of a word put ID cell
reliability is performed. apart (Fig. 3) are used simultaneously for memory protection, the
correction capability of MECC (L) and the interleaving distance
Keywords SRAM . MCU . Multi-bit error correction codes . (ID) are the key features. Selection of greater L leads to more
MTBF . Bit interleaving protection in the cost of more complex encoder-decoder circuits
and using larger ID changes the aspect ratio of memory chip. On
the other hand, each coding scheme has better performance in
specific aspect dealing with MCU. For example, Reed-Solomon
1 Introduction (RS) code [18] handles burst errors more efficiently than MECC-
Interleaving schemes.
As technology scales down, the effect of cosmic radiation To select a proper scheme and its parameters we should be
(neutrons and alpha particles) on SRAM chips becomes sig- able to compute the memory failure probability accurately and
nificant. The single bit upset (SBU) is the common event quickly for different codes and various coding parameters. To
when a particle hits the memory, however shrinking the size evaluate the efficiency of the MCU protection methods, mem-
ory failure probability and mean time between failures
Responsible Editor: M. Goessel
(MTBF) are well-known metrics. In this paper a semi-
* Hadi Jahanirad
analytical model is presented which allows two-step memory
[email protected] failure (as well as MTBF) estimation. The first step is based on
probabilistic distribution to model the event arrival time and
1
Department of Electrical Engineering, University of Kurdistan, the size of MCU. The second step is related to block failure
Pasdaran street, Sanandaj, Iran probability calculation. By a block of memory, we mean a
J Electron Test
Encoded Corrupted
Data in data Data
Data out
portion of memory which is protected by one MECC scheme. a SRAM chip when protection is done using Single Error
The models for four major categories of MECC schemes are Correction codes. In this technique encoder and decoder are
derived and their performances are compared by simulation. kept simple to correct only single cell upsets.
This study presents an accurate approach to MCU behavior In this type of memory coding each row is divided into
modeling and provides semi-analytical modeling of block- some words containing N bits and each of them is protected
based protection schemes which have not been investigated by ECC. Adjacent bits are distributed to different words so if
in previous studies. an MCU occurs in adjacent positions in a row, the corrupted
This paper is organized as follows. The related background bits will belong to different words. Therefore, the error cor-
work is presented in Section 2. Our approach of the memory rection capability improves. For example, in Fig. 2, the first
failure probability and MTBF evaluation is described in bits (Bit 1) of 4 words are put in first 4 cells of memory row
Section 3. Block memory failure probability analysis for dif- and then this pattern is repeated for N-1 remaining bits of each
ferent coding schemes is presented in Section 4. The simula- word (in this figure each color represents a word).
tion results and discussions are given in Section 5 which is This coding type has been used in many error prone digital
followed by a conclusion in Section 6. systems. A good instance of such protection method was in-
troduced in [46]. Hamming code and bit interleaving along
with putting the tap-well in the location which could help the
2 Related Works Single-bit Error Correction (SEC) to correct bigger size MCU,
made this coding scheme very efficient. The amount of nec-
2.1 MCU Characterization essary parity bits for words in a memory row has been reduced
dramatically in [42] by combining words of a row. On the
The main features of MCU have been investigated in detail in other hand, a virtual ground was added to the multiword so
the literature. Environmental conditions (temperature, operat- that the leakage current of the memory reduced drastically. In
ing voltage, altitude,…), memory design factors (layout, ar- [43] a novel approach was introduced which first detects more
chitecture,…) and test conditions (test pattern, memory bit vulnerable bits of microprocessor embedded memory such
patterns, …) determine the characteristics of MCU. The size that using this information the correction capability of ECC
and shape of MCU patterns were experimentally considered in is improved. In [47] a method introduced to model adjacent
[11, 13, 19, 20, 28, 30] for different sub-micron SRAM chips. upsets in dense SRAM memories was validated by experi-
In [8–10, 15, 31, 44] the MCU features have been described at mental data regarding SRAM-based FPGAs. This model il-
transistor level using simulation and experimental approaches. lustrated that bit interleaving reduces the impact of MCU on 7-
A different method was introduced in [12] where statistical series of FPGA. In [14] a novel bit interleaved Hamming code
approach was the base for MCU and SEU characterization. developed in which the configuration frame was divided into
In [41] a novel experimental setup was constructed for SRAM sub-frames. In a sub-frame, bits associated to user design are
chip to detect heavy ion radiation flux. essential and the other bits are non-essential. The non-
essential bits are used to embed Hamming parity bits. In
2.2 MCU Mitigation runtime using readback operation the sub-frames are decoded
B bits
using embedded hamming code and then corrected frame is
written back to the design. This protection scheme has no area
overhead and can correct more than 90% of MCUs in FPGA
Block 1
W words
configuration memory.
Second type of MECC is the frame based codes. In this type
of coding scheme each block of N bits (including data bits and
Block 2
code bits) is divided into Nsym symbols each containing S bits.
Figure 3 shows a frame with Nsym symbols each containing
S = 8 bits. If there are u upsets in a block, then block failure
occurs when more than L symbols have corrupted bits. Reed-
....
Solomon (RS) and BCH are common coding schemes belong-
ing to this category. This type of coding has better perfor-
mance in detection and correction of burst errors. An impor-
Block N
tant drawback of this coding scheme is complicated encoding
and decoding processes.
These coding schemes were used in various studies. In [1,
Fig. 4 The structure of a block coding scheme
27] double error correction BCH codes were developed to
mitigate three and four bits MCU burst upsets in SRAM mem-
ories. In [29] an efficient error correction code has been pre- correction capabilities. These codes rank in complexity be-
sented for embedded memories of digital signal processors tween SEC-DEC and more sophisticated BCH DEC codes.
(DSP) in which more important bits of the memory were fur- In [17, 36, 40, 48] a matrix based ECC code which belonged
ther protected. In [2] a novel technique was described which to block coding scheme has been investigated. This type of
combines Hamming and Reed-Solomon codes to protect developed code could handle adjacent error bits more effi-
memory against multiple bit errors. In [18] an optimized cost ciently. In [4, 25] a modified version of Hamming code by
Reed-Solomon encoding and decoding scheme has been pre- using decimal form of code words in the encoding mechanism
sented for SRAM memories. has been developed.
Third category is based on the block (matrix) codes. In such Fourth category is product coding scheme. In this memory
codes k bits of data feed into an encoder and n coded bits are protection scheme each row of a block containing W words is
stored in the memory. In this type of protection code, W words protected by an MECC which can correct LH and detect L′H
of memory with B bits in each word are considered as a block. upsets. If there are B bits in a word, then each block can be
This means that if up to L errors happen in W × B bits of one considered as another block with B rows and W columns, and
block then MECC can correct these upsets (Fig. 4). another coding scheme can be applied to such a block (LV and L
This coding scheme is the most common type in MCU ′V). In Fig. 5 a block which contains eight 8-bit words is illus-
mitigation techniques. In [32] a new correction code was in- trated. The blue cells are data bits of the block, the green cells are
troduced in which by redesigning the Hamming matrix the the parity bits of each block rows and the yellow cells are the
adjacent errors (MCUs) generate syndromes that are not sim- parity bits of each column. In this type, the memory is partitioned
ilar to SEU syndromes. In [3] a majority logic decoding into some blocks each having R rows and C columns. Each row
scheme was used in difference-set cyclic code which can cor- is coded first and then each column of the memory block is
rect the large size burst errors in the memory along with a coded independently. For example, in [45] SEC is applied to
modification in decoding scheme which resulted in lower area encode the row and parity is used in column coding.
and time overhead. In [24] a modified version of decimal In [35] a cost-efficient structure based on erasure codes was
matrix code (DMC) was used to detect and correct MCU in introduced to correct MCU in configuration memory of an
the SRAM configuration memory of FPGAwith very low cost
architecture. In [26, 33] new error correction schemes have
been introduced which can correct three error bits in a word in
addition to single error correction and double adjacent error
....
FPGA. In [22] a built-in 2-D Hamming based correction code of optimum ID and the area and power overhead of the mem-
has been developed to improve the reliability and availability ory have been investigated in [38].
of FPGA chips in space mission applications. In [21] two error Second category of SER estimation methods is semi-
correction codes along with the read back scrubbing property analytical models. One major problem of analytical methods
of FPGA configuration memory have been developed to solve is that row failure probability of MCU cannot be accurately
the MCU problem in FPGA. In [45] a harmonious row and formulated. They use accumulated SEUs instead of MCU
column coding scheme has been developed which can im- which is not a precise assumption regarding MCU clustering
prove the reliability of the SRAM. trend. To overcome this shortcoming semi-analytical methods
The hybrid block code can be used by combining the above have been developed. These methods include two main ef-
coding categories. In this type each block of the memory di- forts: one is the event arrival time and MCU behavior model-
vides into m parts then using separate encoders each part is ing and the other is the computation of row failure probability
encoded. The encoder of each part may use different coding due to MCU. The event arrival time and MCU size are
schemes. For example, [39] used RS and Hamming coding for modeled using Poisson and geometric distributions respec-
two parts of each memory block. tively [6, 23, 37, 50]. In [49] two technology parameters,
In [34] Hamming decoder and One-Step Majority Logic Beffective size^ and Bline fitting^ are defined to account for
Decoder (OS-MLD) has been compared according to the sus- the size and the shape of MCU. In this model the cells affected
ceptibility of FPGA configuration memory to accumulated by a MCU event are distributed in some rows and columns of
SEU errors. In [5] an example of hybrid codes has been de- memory according to its size. The adjacent cells in each row
veloped in which the Hamming code bits put among the DTI are assumed to be corrupted by MCU. In [6] row failure prob-
(doubly transitive invariant) code bits. This structure shows ability is computed assuming that all corrupted cells distribut-
lower overhead and better performance than DEC codes. In ed completely randomly among memory cells. Hence the fail-
[5] hybrid code based on a sub-group of low density parity ure probability of each memory word can be computed by
checks (LDPC) has been presented which can correct a large considering all mappings of MCU cells which generate mem-
number of erroneous bits in a block of the memory. ory failure. This model was used to select proper interleaving
A different approach to MCU mitigation is structural hard- distance for a SEC protected memory. In [23] the row failure
ening of SRAM memory cells against radiation. For example, probability has been computed by random fault injection in
in [7] a novel method to mitigate MCU in SRAM memories the row cells. In this model accumulation of soft errors is
has been presented. In such SRAM chips by change the inter- included by some modification of geometric distribution pa-
nal architecture of cells and the read and write lines more rameter and the distribution of corrupted bits in a memory row
MCU could be tolerated. However, this approach is out of is assumed to be uniform random. The analysis of this model
the scope of this paper. has been done for a SRAM which was protected by SEC and
interleaving schemes. In [50] SER of an SRAM chip has been
computed using a software tool at the transistor level. Two
2.3 SER Estimation different 6 T–SRAM cell layouts were presented in [50] and
then SER estimates for SEU, horizontal MCU and vertical
Soft Error Rate (SER) estimation is the goal of all studies MCU were derived using the proposed software tool. These
related to reliability analysis of radiation prone digital sys- estimates were 26–41% lower than that for conventional 6 T–
tems. SRAM chip SER estimation methods regarding MCU SRAM cell layout.
can be divided into two main categories. The analytical
models belong to first category in which the behavior of N- 2.4 Motivation
cell MCU event is estimated as N independent 1-cell (SEU)
events and using mathematical elaboration an upper bound for Two main motivators for this study are as follows. First is
SER of SRAM is deduced. We point out to two examples of presentation of more accurate model to calculate row failure
this type of modeling presented in [37] and [38]. In [37] an probability. All three mentioned models [6, 23, 49] do not
analytical reliability model has been developed for SRAM capture all aspects of MCU features. First model ([49]) as-
architecture in the presence of SEC and bit interleaving pro- sumes that if there are more than one upsets in a row, all of
tection scheme. In one paper ([37]) a general expression has them are adjacent (one cluster of corrupted cells in a row).
been derived for MTBF regarding MCU and then using some This ignores all the cases in which corrupted cells are not
approximation the expression simplified so that the designer adjacent or they are divided into multiple clusters. In second
could compare it with the expression previously derived for model [6] no-clustering effect is considered and this model
SEU. In [38] a procedure has been presented which can be overestimates the failure probability of MCU. In the third
used to select best ID distance in the SEC and bit interleaving model [23] the clustering effect is included by defining row
protected memories. The effects of memory size on the value grouping number as the probability of existence a row
J Electron Test
containing g upsets when MCU occurs. However, the distri- 9 cells and the multiplicity equals to 3 upsets (the black
bution of these g upsets in a row is still assumed to be fully squares are corrupted bits). This ratio for these categories is
random and no-clustering is assumed in a row which is a different for each memory chip due to technology and envi-
significant deviation from MCU behavior. In this paper I pro- ronmental conditions. According to collected data from [20] I
posed a model which includes clustering effect of MCU in the plotted the percentages of three categories of MCU
row and column directions. (WL_ratio(m), BL_ratio(m) and C_ratio(m)) in Fig. 7.
The second motivator is the lack of semi-analytical model Generally, the third category (clustered MCU) has greater par-
for SER estimation of block based memory protection ticipation and with reducing features this ratio increases sig-
schemes. All the above SER estimation methods have been nificantly. The multiplicity is an important factor that influ-
developed for SRAM chips with SEC-interleaving protection ences the MCU behavior. The percentage of ratio of MCU
scheme and to my knowledge there is no block-based coding versus multiplicity for 22 nm technology is indicated in
scheme for SER estimation in the literature. When MCU ratio Fig. 8 using data from [20]. An MCU with large multiplicity
increases with decreasing feature size of transistors, SEC- is related to large size upsets which have low occurrence prob-
interleaving cannot produce acceptable reliability level. As ability. So the ratio of MCU decreases when multiplicity is
mentioned in previous sub-section more powerful coding increased. I derived following equation for ratio vs. multiplic-
schemes (such as RS, Block-based, Product ,…) should be ity (m) using curve fitting tool of MATLAB in which
developed. In this paper for each category of protection a = 212.2, b = −1.646, c = 19.9 and d = −0.3444.
schemes a semi-analytical approach is presented.
ratio multiplicityðmÞ ¼ aebm þ cedm ð1Þ
3 MTBF and Memory Failure Probability Scrubbing is an efficient method to remove all accumulated
Evaluation upsets in the memory. This method scans all words in the
memory to correct all corrupted bits. Among two scrubbing
In this section first a semi-analytical model (event arrival time operations, the upsets accumulate in the memory and the fail-
and MCU behavioral modeling) is presented and the second ure probability will increase, so the scrubbing interval must be
part of modeling (block failure probability computation for selected so that memory error correction code scheme can
each protection scheme) is described in Section 4. overcome the corrupted bits conveniently [23, 37, 38, 50]. In
this paper the scrubbing interval is selected so that no MCU
3.1 MCU Behavior accumulation occurs. If for example the event arrival rate of
soft errors is assumed to be 0.01 events per second (which is a
Each cosmic particle which hits the chip can affect one (Single rather high arrival rate in comparison to real situations) then
Cell Upset (SCU)) or more cells (MCU) in an SRAM chip. there will be one event in 100 s while the time interval among
Shrinking the feature size of transistor in memory chips in- scrubbing operations is less than 100 ms in real systems. Thus
creases the ratio of MCU to SEU. Ibe et al. [20] reported that no-accumulation assumption is required for real memory-
the MCU ratio increases from 0% to more than 45% when based digital applications.
technology goes from 250 nm to 22 nm. In [20] MCU patterns Suppose that there is a memory with 256 rows and 64
were classified into three categories: a single line along col- columns (i.e., 64 bits in a row). Furthermore, the scrubbing
umn (BL), a single line along row (WL) and cluster (an MCU
that has more than one upset along row and column direc-
tions). The clustered MCUs have various shapes, e. g.,
Fig. 6 shows some examples of MCU patterns which include
Fig. 6 MCU patterns with size =9 and multiplicity =3 Fig. 7 Ratios of MCU categories
J Electron Test
1.0E+2 IT = 1000000
Total_upsets = 30
Data m=1
Fitted
1.0E+1
it = 1
Ratio(%)
1.0E+0
F mem ðX Þ ¼ ∑xg¼1
max
pblock ðgÞ F block ðgÞ ð5Þ
occu pðg; iÞ ¼ ∏c clus pðc; mÞnðcÞ ð3Þ Finally, the mean time between failures is computed as,
T scr
In this equation i is the index of pattern and n(c) is the ∫ Rðt Þdt
MTBF ¼ 0 ð8Þ
number of c-size cluster repetition in the i’th pattern. 1−RðT scr Þ
Another feature of events that should be considered is how
often the energetic particles hit the chip. The experimental
results illustrate that the event occurrence can be described
3.3 Simulation for MTBF Evaluation
using Poisson process [6, 23, 37, 50]. The λ parameter of
Poisson distribution implicitly shows the average time be-
To simulate MCU behavior, the time of event occurrence
tween two events. The total number of upsets which are de-
should be determined using exponential distribution.
veloped from some events can be described using Compound
Because of scrubbing capability of memory, the simulation
Poisson (CP) as reported in [6, 23, 37, 49, 50].
should be carried out for adequate number of scrubbing inter-
ðλt ÞY e−λt X −1 X −Y
vals (TN). For an event the injection of MCU/SEU is done
CPðX ; t Þ ¼ ∑Y ¼1
X
r ð1−rÞY ð4Þ using the algorithm of Fig. 11.
Y! Y −1
After each event occurrence, the fail or pass status of
In this eq. X is the total number of upsets in the memory and SRAM chip is tested. If there is no failure, then the affected
λ is the Poisson distribution parameter. bits are corrected (scrubbed) and the next MCU will be con-
sidered. This process continues until a failure occurs. The time
of failure occurrence is registered and the corrupted bits are
3.2 MTBF Evaluation scrubbed. This process is repeated for other scrubbing time
intervals. The average of time differences between successive
To compute MTBF, the failure probability of each block failures is reported as MTBF in current simulation run
should be first computed. A block is a part of memory with (flowchart of Fig. 11). The above steps are repeated until stop-
single coding scheme. For example, in the SEC protected ping criterion is met. In our simulation the stopping criterion is
memory each row should be considered as a block but in RS a specified number of simulation runs. Finally, the average of
coding the bits that belong to a collection of symbols should MTBF of all runs is reported as the memory MTBF.
J Electron Test
Start this run To compute the row failure probability for clustered type
MCU, we assume that there are u upsets in a row. Total number
Generate MCU occurrence of possible states to distribute these corrupted bits in a row is:
time vector (Exp Dist.)
T(1 … N)
N ID
tot stateðuÞ ¼ ð10Þ
u
i=1
Each solution which is found by recur_func represents some If S solutions were derived by recur_func then the row
states. For example if the number of upsets in word w is k(w) failure probability for clustered MCU is computed using the
then this k(w)bits can be selected from N bits byð N k ðwÞÞ states. following equation.
If the solution assigns the upsets to W words of total ID′ words ( no )
then these words can be selected by ð ID0 WÞ states. ∑Ss¼1 ∑i¼1patt nfail shareðu:s:iÞ occupðu:iÞ
Prow−C ðuÞ ¼ 1− no
So the total states of each recur_func solution (with s in- ∑i¼1patt totshareðu:iÞ occupðu:iÞ
dex) can be computed using the following equation.
0 ratio multiplicityðuÞ C ratio ð15Þ
ID N
nfail stateðsÞ ¼ ∏w¼1
W
ð12Þ In the numerator, inner summation computes the total oc-
W k ðw Þ
currence probability of non-failing states for each solution of
Altogether we find total states and number of non-failing rec-func. According to Prow_WL(u) and Prow_C(u) the block
states for distribution of u upsets in a row of memory. The i’th failure probability is computed as,
pattern of u upsets has its share in tot-state(u) and nfail-state(s)
regarding to patt-share(u,i,B) as shown in eqs. (13) and (14) F block−IntLV ðuÞ ¼ Prow−WL ðuÞ þ Prow−C ðuÞ ð16Þ
below.
To compute memory failure probability using eq. (5), p-
block(u) also must be determined. In this type of memory pro-
tot shareðu; iÞ ¼ patt shareðu; i; BÞ tot stateðuÞ ð13Þ
tection scheme, each block is defined as one row so to com-
pute pblock(u), the probability of a row to have u upsets must be
nfail shareðu; s; iÞ ¼ patt shareðu; i; BÞ nfail statesðsÞ ð14Þ considered. According to definition of pRE(u,m) as the
J Electron Test
probability of a row having u upsets when there are m upsets When the remainder is greater than 1, MCU affects K + 2
in the memory, the number of rows containing u upsets is symbols. If K + 1 is greater than L, the number of failing states
computed as follows: is computed using eq. (22). In this equation Rem is the remain-
der of u/S.
pRE ðu; mÞ m N stateðuÞ ¼ N sym −K ðS−ðRem−1ÞÞ
N Rðu; mÞ ¼ ð17Þ
u
þ N sym −ðK þ 1Þ ðRem−1Þ
ð22Þ
The numerator contains the number of total corrupted cells in Otherwise, if K + 2 is greater than L then the number of
all rows with u upsets and when this number is divided by u the failing states is computed according to,
number of such rows will be deduced. Because of existence of W
N state ðuÞ ¼ N sym −ðK þ 1Þ ðRem−1Þ
rows in the memory pblock(u) can be calculated as follows: ð23Þ
Third, if the length of MCU is not a coefficient of symbol no pattð j; iÞ ¼ patt shareðupðiÞ; j; S Þ nðiÞ ð27Þ
length then there will be two different situations. When the
remainder of the division (u/S) is 1, MCU affects K + 1 sym- The occurrence probability of each pattern can be deduced
bols. If L is greater than K + 1 there will be no failure, other- from occu-p(up(i),j) definition. These steps are repeated for all
wise number of failing states are computed as, symbols then the failure probabilities of each combination of
patterns of symbols are computed using eq. (28). In this equa-
N stateðuÞ ¼ N sym −K S
ð21Þ tion a(i) is the index of pattern of i’th symbol in k’th
J Electron Test
combination. because of its length, the corrupted cells in at least one of them
will be greater than L. Second, MCU length is between L and
p combðk Þ ¼ ∏no pattðaðiÞ:iÞ occu pðupðiÞ:aðiÞÞ ð28Þ
2 L. In this case the BL type MCU may disturb two adjacent
The failure probabilities of all combinations are summed blocks such that s states have fewer than L corrupted cells in
up to deduce total non-failing probability for current solution: these two blocks. The failure probability for this case can be
computed using eq. (35). In this equation Nblocks is the total
nfail probðsÞ ¼ ∑k p combðk Þ ð29Þ number of blocks in the memory.
ðN blocks −1Þ s
The total failure probability of u upsets in a row also is
pblock−BL ðuÞ ¼ 1−
computed as, B−ðu−1Þ
tot probðuÞ ¼ ∑i no pattðu; iÞ occu pðu; iÞ ð30Þ ratio multiplicityðuÞ BL ratio ð35Þ
The summation is over all different patterns of u upsets To deal with clustered type MCU whose multiplicity equals
located on a memory row and no-patt(u,i) is the number of m, we should consider the MCU extending over memory
i’th pattern which is calculated using eq. (31). length. This will be done using pCE(r,m), that is the probability
of a clustered MCU extending through r rows in the memory.
no pattðiÞ ¼ patt shareðu; i; BÞ u ð31Þ
If the multiplicity of MCU was greater than L then there may
The block failure probability regarding clustered type be a memory failure. We divide the situation into two cases:
MCU is computed as, first, the length of MCU (l) is less than a block length. Because
of multiplicity is greater than L there may be non-failing states
∑S nfail probðsÞ
when clustered MCU hits two adjacent blocks simultaneously.
pblock−C ðuÞ ¼ 1− s¼1
tot probðuÞ To derive the probability of this case, for each state two fea-
tures must be considered. One is the length of MCU in each
ratio multiplicityðuÞ C ratio ð32Þ block (l1 + l2 = l) and the other is the number of corrupted
cells in each block (X1 + X2 = u). For each combination of
According to Pblock_WL(u) and Pblock_C(u) the block failure
l1(k) and l2(k), the non-failing states (X1,X2 ≤ L) are derived
probability is computed as,
and the occurrence probability of k’th combination is comput-
F block−RS ðuÞ ¼ Pblock−WL ðuÞ þ Pblock−C ðuÞ ð33Þ ed using eq. (36). Then total occurrence probability of K com-
bination is deduced by summation (eq. (37)).
In this type of memory protection scheme the structure of
each block is similar to the blocks of MECC with interleaving pcomb ðk Þ ¼ ∑Ss¼1 pRE ðl 1 ðk Þ; X 1 ðsÞÞ pRE ðl 2 ðk Þ; X 2 ðsÞÞ ð36Þ
scheme. Thus eq. (18) can be used to compute pblock(u,X).
nf pðuÞ ¼ ∑Kk¼1 pcomb ðk Þ ð37Þ
0.001
ref 23 row oriented code can correct and detect LH and L’H upsets,
0.0009
0.0008
ref 49 respectively. On the other hand, in the column direction an-
other code can correct and detect LV and L’V upsets, respec-
Failure Probability
proposed model
0.0007
0.0006
tively (Fig. 5). To use the correction capability of this coding
0.0005 scheme in a block containing u upsets, first of all the rows and
0.0004 columns which have fewer than LH + 1 and LV + 1 upsets are
0.0003 corrected. Then rows containing errors between LH and L′H,
0.0002 and the columns containing errors between LV and L′V are
0.0001 determined. When there is error detection in a row and a
0 column, the intersection cell contains an erroneous bit so this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time (×100 sec)
bit can be corrected.
For WL type MCU if LV > 0 then all row errors will be
Fig. 14 Comparison of proposed model and models of [49] and [23]
corrected but if LV equals 0 and length of MCU be greater than
LH then failure will occur, therefore,
NRðr:mÞ
4.4 Product Coding PPRO C ðuÞ ¼ C ratio ∑Br¼L0 þ1 ð44Þ
h W
As mentioned in the related work section, this coding scheme The rows containing LH ≤ r ≤ L′H upsets may cause a
uses 2-D protection approach. In a W × B memory block, the memory failure if the column MECC cannot correct
Column1 W B Block length Number Correction capability Diff Run time ratio
of symbols
0.0004
them. The probability of existence of such rows can be Model
computed as, 0.00035 Simulation
For all g (LV’ ≤ g ≤ WB), this process is repeated and the In this type of error correction coding scheme the memory is
total failure probability is calculated as the following sum: divided in P blocks each protected by one type of aforemen-
tioned ECC schemes. Because of independent encoding and
P 2 ð uÞ ¼ ∑ W B 0 p all ðg Þ ð48Þ decoding mechanisms for each part, we can compute block
g¼LV þ1
0.0045 0.0008
0.0035 0.0006
0.003
0.0005
0.0025
0.0004
0.002
0.0003
0.0015
0.0002
0.001
0.0005 0.0001
0 0
1 5 9 13 17 21 25 29 33 37 41 45 1 5 9 13 17 21 25 29 33 37 41 45 49
Time (×100 sec) Time (×100 sec)
Fig. 15 MECC-Interleaving scheme accuracy simulation Fig. 17 Block coding scheme accuracy simulation
J Electron Test
0.004 0.0045
0.004 ID=2 ID=4 ID=8 ID=16
0.0035 Model
0.0035
Failure Probability
Memory failure probabiliy
Simulation
0.003 0.003
0.0025 0.0025
0.002
0.002
0.0015
0.0015 0.001
0.0005
0.001
0
1 4 7 10 13 16 19
0.0005
Time (×1000 sec)
0
1 5 9 13 17 21 25 29 33 37 41 45 49 Fig. 20 ID variation effect on failure probability of MECC-interleaved
Time (×100 sec)
Fig. 18 Product coding scheme accuracy simulation ¼X
F mem−hybrid ðX Þ ¼ ∑XX P1
P1 ¼0
F part−int ðX P1 Þ
þ F part−RS ðX −X P1 Þ ð53Þ
failure probability for each part and then sum up the failure
probabilities of all parts to calculate memory failure probability.
Suppose that two types of ECC are used in a memory, one 5 Simulation Results
is SEC-Interleaving and the other is RS coding. So the mem-
ory (W rows and B columns) is divided into two parts each For simulations λ and r parameters are set to 0.01 and 0.5,
having W/2 rows and B columns. Using the procedures of respectively.
Sections 4.1 and 4.2, the failure probability of each part is
computed according to eq. (51) and eq. (52).
5.1 Comparison to Previous Methods
W
F part−int ðX P1 Þ ¼ F mem−int X P1 ; ; B ð51Þ One of the goals of this study is derivation of an accurate model
2 for failure probability estimation for first coding scheme
(MECC-Interleaving). As mentioned in the related works sec-
W tion, previous methods do not consider all aspects of MCU be-
F part−RS ðX P2 Þ ¼ F mem−RS X P2 ; ; B ð52Þ
2 havior for this scheme. In [49] all corrupted cells in a row have
been considered as single cluster and in [23] the upsets were
In these equations each part of memory is considered as a injected into the rows completely randomly and no-clustering
memory with W/2 rows. If there are X upsets in the memory, trend was assumed. A simulation has been carried out to com-
then Xp1 upsets are in the first part and the remaining X-Xp1 pare the proposed model with these two previous methods for a
upsets are in the second part. The memory failure probability memory with length and width of M = 256 and B = 64, respec-
is calculated as follows: tively. I used SEC (L = 1) for correction code and set interleaving
distance equal to 8 (ID =8). Thus each row of memory contains 8
0.009 words 8 bits each and the physical distance between consecutive
0.008
Model
Memory failure probabiliy
0.005
0.002
0.004
0.0015
0.003
0.001
0.002
0.001 0.0005
0 0
1 5 9 13 17 21 25 29 33 37 41 45 49 1 4 7 10 13 16 19
Time (×100 sec) Time (×1000 sec)
Fig. 19 Hybrid coding scheme accuracy simulation Fig. 21 L variation effect on failure probability of MECC-interleaved
J Electron Test
0.003 0.03
L=1 L=2 L=3 L=4 L=2 L=3 L=4
0.0025 0.025
Failure Probability
Failure Probability
0.002 0.02
0.0015 0.015
0.001 0.01
0.0005 0.005
0 0
1 4 7 10 13 16 19 1 4 7 10 13 16 19
Time (×1000 sec) Time (×1000 sec)
Fig. 22 L variation effect on failure probability of RS coding Fig. 24 L variation effect on failure probability of block coding
bits of a word is 8 memory cells. The simulation result is illus- figures the memory failure probabilities versus the time of
trated in Fig. 14. The failure probability estimation of [49] is some scrubbing intervals have been plotted. The scrubbing
much lower than those of the proposed model and the model interval is the time interval at the end of which the upsets of
of [23]. This under estimation can be attributed to single cluster the memory are cleared and the memory becomes error free.
assumption in that method. Many states which cause block fail- In all cases there is acceptable matching between our model
ure are related to patterns in which two or more clusters with and simulation. The average differences between model and
different sizes are involved. simulation for all memory schemes are indicated in the sev-
On the other hand, the model of [23] overestimates the enth column of Table 1. In all cases the average difference is
failure probability. This model unlike the model of [49] con- below 3.1%.
siders all possible patterns that occur when u upsets lie on a The ratio of proposed model and simulation method run
row, but the occurrence probabilities of all patterns are as- times are reported in the eighth column of Table 1. The pro-
sumed to be equal. As mentioned in Section 4.1 patterns have posed model can produce the results much faster than simula-
different occurrence probabilities due to their shapes (the tion. The speed up of our model is ×106 approximately.
number of clusters and the type of clusters which construct
the pattern). When the soft errors tend to produce more MCU 5.3 Parameter Variation
this over estimation is greater. This is because more clustered
patterns have higher occurrence probability in this situation. The next experiment shows the impact of each coding scheme
parameters on the memory failure probability. According to
5.2 Model Accuracy such simulation we can select the proper coding parameters in
each protection scheme.
To investigate the accuracy of the proposed model, memory Two parameters of the first scheme (MECC-Interleaving)
specifications for each scheme have been selected according are interleaving distance (ID) which indicates the physical
to Table 1. Figures 15, 16, 17, 18 and 19 illustrate the com- distance between two consecutive bits of a word in a memory
parison between our model and these coding schemes row, and the error correction capability of MECC (L). In
(simulation procedure has been introduced in Fig. 11). In these Fig. 20 L is set to a fixed value and ID takes different values.
0.0014
0.004 Len_Block=4
0.0012
0.0035 S=8 S=16 S=32 Len_Block=8
Failure Probability
0.001 Len_Block=16
0.003
Failure Probability
0.0008
0.0025
0.002 0.0006
0.0015 0.0004
0.001 0.0002
0.0005
0
1 4 7 10 13 16 19
0
1 4 7 10 13 16 19 Time (×1000 sec)
Time (×1000 sec) Fig. 25 Block length variation effect on failure probability of block
Fig. 23 S variation effect on failure probability of RS coding coding
J Electron Test
0.0045 0.0025
0.004 Lh=0,Lh'=1 Lv=0, Lv'=1
0.0035 0.002
Lh=1,Lh'=2 Lv=1, Lv'=2
Failure Probability
Failure Probability
0.003
Lh=2,Lh'=3 0.0015
0.0025
0.002
0.001
0.0015
0.001 0.0005
0.0005
0 0
1 4 7 10 13 16 19 1 4 7 10 13 16 19
Time (×1000 sec) Time (×1000 sec)
Fig. 26 Horizontal coding variation effect on failure probability of Fig. 27 Vertical coding variation effect on failure probability of product
product coding coding
According to this figure, bigger ID value results in smaller In the third protection scheme (block coding), there are two
memory failure probability. For a row containing B cells if important parameters: length of each block (B-Len) and the
we use ID for interleaving distance then each word (MECC correction capability (L). According to Fig. 24 increasing the
protected) will contain B/ID cells and ID adjacent cells in a block length leads to higher memory failure probability. When
row belong to different words. An MCU with a size greater the block length increases the occurrence of upset in a block
than ID can corrupt two cells of a word, so a bigger value for will increase, so the failure probability of block increases for a
ID leads to more protection against upsets. If the width of each fixed L.
word is set to a fixed value, then a bigger ID value means a If the length of blocks remains fixed but correction capa-
wider memory and this affects the physical shape (aspect ra- bility increases, then the memory failure probability will de-
tio) of the memory. crease. For each of three cases which are illustrated in Fig. 25,
Figure 21 shows the case of fixed ID and variable L. In this the length of blocks is 8 rows of the memory. So the proba-
case increasing L parameter leads to more protection in each bility of occurrence of u upsets in each block of memory is
word, so the memory failure probability decreases. Note that equal in these cases. Therefore, memory with bigger L can
increment of L results in more parity bits in each word and correct more upsets in a block and reduce the block failure
higher complexity in encoding and decoding processes. probability.
For RS protection scheme there are two parameters, the The fourth scheme (product block coding) has two main
length of symbols (S) and the number of symbols (L) which options to be tuned: The correction and detection capabilities
can be corrected in a frame (row) of the memory. If there are B of row coding (Lh and Lh’) and the correction and detection
cells in a memory row, then each row contains B/S symbols. If capabilities of column coding (Lv and Lv′).
the correction capability (L) remains fixed, increasing the In the first experiment the correction capability of row code
length of the symbols will decrease the memory failure prob- increases (Lh and Lh’) along fixed column code correction
ability according to Fig. 22. capability. As depicted in Fig. 26 this approach can decrease
An RS coding with correction capability of 1 symbol in a memory failure probability drastically. Increasing Lh results in
row can correct up to S upsets in a specified symbol, so in- correction of more erroneous rows, so number of errors which
creasing the length of symbols increases the probability of can exist in each column of a block decreases and the block
correctable upsets to be located in a symbol. On the other failure probability reduces. In this experiment parity code
hand, if the length of symbols remains fixed but the correction (Lv = 0, Lv′ = 1) is used as the column code and the parity
capability increases then the memory failure probability will (Lh = 0, Lh’ = 1), SEC-DED (Lh = 1, Lh’ = 2) and DEC-TED
be decreased (Fig. 23); to describe such a situation, suppose (Lh = 2 , Lh’ = 3) are used as row codes.
that each symbol is a super cell and each row contains B/S If the horizontal code keeps fixed correction and detection
super cells. If there are up to L erroneous super cells in a row capabilities (Lh and Lh’), then block failure probability can be
the RS code can correct them. reduced using more powerful vertical code (Lv and Lv′). In this
250000
experiment (Fig. 27) parity code (Lh = 0, Lh’ = 3) is used as the
row code and the parity (Lv = 0, Lv′ = 1) and SEC-DED
200000
(Lv = 1, Lv′ = 2) are used as column codes.
150000
MTBF
5.4 Comparison of Various Protection Schemes
100000
To compare aforementioned protection schemes, I should al-
locate same protection capability per bits for all schemes. To 50000
this end I select protection parameters according to Table 2. In
this Table M (=16,384) cells of memory have been arranged in 0
a W × B array layout. MECCIntLv RS Block Pro
Coding Scheme
In this table for all schemes the ratio of correction capability
Fig. 29 MTBF for various coding schemes
per bit in a block (RCpb), is identical. For the first scheme in each
8-bits word L = 1 so RCpb is 1/8. For RS coding each frame
contains 64 bits partitioned into 8 symbols each containing 8
bits. If this coding scheme can correct only one symbol, then condition to correct 8 erroneous bits in a memory block. All
the RCpb will be 8/64 or 1/8. In the third scheme, each block is an rows in a block must contain only one erroneous bit and the
8 × 8 array of bits in which the block code can correct up to 8 bits column of the erroneous bits must be different. According to
so in this scheme the RCpb is 1/8. The product coding scheme can Fig. 28 this condition is slightly more restricted than the con-
detect 1 erroneous cell in each row and column of a block (8 × 8 dition of RS scheme, in which the 8 erroneous bits must be in
bits). So for each row/column this scheme can correct only one a symbol.
bit and the RCpb will be 8/64 or 1/8. In interleaving-MECC scheme, the 8-bits correction capa-
The memory failure probabilities for these four types of pro- bility in 64 bits is adopted as 1-bit correction capability in 8
tection schemes are illustrated in Fig. 28. The block coding bits. So this scheme can correct more upset states occurring in
scheme has lowest failure probability, and the interleaving- a 64 bits block of the memory. In Fig. 29 the MTBF for these
MECC, RS and product schemes have increasingly higher fail- schemes is illustrated.
ure probability, respectively. The main reason for good perfor-
mance of block coding is its ability to correct up to 8 erroneous
bits which are distributed randomly in a 64-bits block.
In other schemes the 8 erroneous bits cannot randomly 6 Conclusion
distribute in the block. For example, in RS scheme the 8 erro-
neous bits should be all in a specific symbol or in interleaving- Considering the increased importance of MCU in
MECC scheme all of 8 bits must occur in different interleaved SRAM memory chips the development of analytical
words. The main drawback of block coding scheme is the models has become essential. In this paper an analytical
more complicated encoder and decoder structure which in- model has been presented to determine the reliability of
creases the cost of protection. SRAM memory with various schemes of error correc-
Among the other three schemes the product coding has the tion codes. The memory is divided into blocks and each
worst performance. This is because of the most restricted block has an error correction code to overcome the ef-
fects of corrupted bits. In comparison to previous
0.0045 models for analyzing the behavior of MCU using
0.004 IntLV Poisson process and geometric distribution, our model
RS can compute the failure probability of each block of
Failure Probability
0.0035
0.003
Block memory more efficiently.
0.0025
Pro Experiments have been performed to verify the model for
different coding schemes. Our model shows less than 3.1%
0.002
difference from the simulation.
0.0015
The impact of coding parameters on the memory failure
0.001
probability has been considered in another experiment. From
0.0005
the results one can select proper values for effective parame-
0
1 4 7 10 13 16 19 ters of the error correction scheme. Finally, an experiment
Time (×1000 sec) compares the ability of various coding schemes to construct
Fig. 28 Comparison among various coding schemes a more reliable SRAM chip.
J Electron Test
SRAM memories used as particle detectors, Proc. Euo. Conf. MBU for SRAM-based FPGAs, Proc. Int. Conf. on FPL, pp
Radiation and its Effect on Components and Syst., 10–14 2–6
42. She X, Li N, Jensen DW (2012) SEU tolerant memory using error 48. Wirthlin M, Lee D, Swift G, Quin H (2014) A method and case
correction code. IEEE Trans. on Nuc. Sci. 59(1):205–210 study on identifying physically adjacent MCU using 28 nm inter-
leaved and SEC-DED protected arrays. IEEE Trans. on Nuc. Sci.
43. Tang H, Park J (2016) Unequal-error-protection ECC for the embed-
61(6):3080–3087
ded memories in DSPs. IEEE Trans on VLSI Sys 24(6):2327–2401
49. Wu W, Seifert S (2015) MBU-calc: a compact model for MBU SER
44. Tianqi W, Xiao L, Huo M, Zhou B, Chunhua Q, Shanshan estimation, Proc. Int. Rel. Phy. Sym., pp SE.2.1-SE.2.6
L, Xeubing C, Romgsheng Z, Jing G (2015) SEU prediction 50. Yoshimoto S, Amashita T, Yoshimura M, Matsunaga Y, Yasuura H,
in SRAMs account for on-transistor sensitive volume. IEEE Izumi S, Kawaguchi H, M. Yoshimoto, (2012) Neutron-induced
Trans. on Nuc. Sci. 62(6):3207–3215 soft error rate estimation for SRAM using PHITS, Proc. IEEE Int.
45. Uemura T, Tanabe R, H. Matusyama, (2012) Mitigation technique on-Line Tes. Sym., pp 138–141
against MBU without area, power and performance overhead,
Proc. IEEE Int. Rel. Phy. Sym., pp 5B.4.1-5B.4.6
46. Venkataraman S, Santos R, Maheshwari S, Kumar A, (2014a)
H. Jahanirad received his Ph.D degree from Iran University of Science
Multi-directional error correction schemes for SRAM-based
and Technology (IUST) in electrical engineering. He is currently an as-
FPGAs, Proc. Int. Conf. on FPL, pp 1–8
sistant professor in the department of electrical engineering, university of
47. Venkataraman S, Santos R, Das A, Kumar A, (2014b) A bit- Kurdistan, Sanandaj, Iran. His research interests include VLSI design,
interleaved embedded Hamming scheme to correct SBU and fault tolerant systems and digital circuit testing.