0% found this document useful (0 votes)
2 views

Reliability_Model_for_Multiple_Error_Pro

The document presents a semi-analytical model for estimating memory failure probability and mean time between failures (MTBF) in SRAM chips affected by multi-cell upset (MCU) due to cosmic radiation. It discusses various multi-bit error correction codes (MECC) and their parameters, comparing their effectiveness in enhancing memory reliability. The study emphasizes the importance of selecting appropriate coding schemes and parameters to mitigate the effects of MCU and improve overall memory performance.

Uploaded by

kronesxennelb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Reliability_Model_for_Multiple_Error_Pro

The document presents a semi-analytical model for estimating memory failure probability and mean time between failures (MTBF) in SRAM chips affected by multi-cell upset (MCU) due to cosmic radiation. It discusses various multi-bit error correction codes (MECC) and their parameters, comparing their effectiveness in enhancing memory reliability. The study emphasizes the importance of selecting appropriate coding schemes and parameters to mitigate the effects of MCU and improve overall memory performance.

Uploaded by

kronesxennelb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

J Electron Test

DOI 10.1007/s10836-017-5649-x

Reliability Model for Multiple-Error Protected Static Memories


Hadi Jahanirad 1

Received: 8 September 2016 / Accepted: 6 February 2017


# Springer Science+Business Media New York 2017

Abstract The problem of multi-cell upset (MCU) becomes a of memory cells leads to multiple cell upset (MCU) occur-
major issue in the nanometer SRAM chips. Various types of rence in each particle hit. Furthermore, the corrupted bits ac-
multi-bit error correction codes (MECC) have been developed cumulate and remain in the memory until a scrubbing opera-
to mitigate this problem. Proper selection of different param- tion corrects all corrupted bits in the memory.
eters of each MECC scheme can lead to efficient encoder/ To mitigate the SBU and MCU effects, various schemes of
decoder design. In this paper a semi-analytical model is pre- error correction codes have been used in SRAM chips. The basic
sented which can estimate the memory failure probability (as architecture of coding process in SRAM chips is shown in Fig. 1.
well as mean time between failures (MTBF) and reliability) A block of data bits enters encoder and some parity bits are added
and can guide the system designers to select proper protection to them. The resulted data and parity bits (encoded data) are
scheme for the system memories. The model was validated by stored in a memory block. Decoder extracts the data bits from
comparison to simulation method (less than 3.1% estimation a SRAM block memory as well as corrects the corrupted bits in
error) and a state of the art model (less than 2.9% estimation that block. Each coding scheme has some parameters which
error). The impact of various parameters of four types of cor- should be tuned properly. For example, when MECC and inter-
rection schemes is analyzed and a comparison of these coding leaving (a technique in which adjacent cells in a row belong to
schemes with respect to their capabilities to enhance memory different words and two consecutive bits of a word put ID cell
reliability is performed. apart (Fig. 3) are used simultaneously for memory protection, the
correction capability of MECC (L) and the interleaving distance
Keywords SRAM . MCU . Multi-bit error correction codes . (ID) are the key features. Selection of greater L leads to more
MTBF . Bit interleaving protection in the cost of more complex encoder-decoder circuits
and using larger ID changes the aspect ratio of memory chip. On
the other hand, each coding scheme has better performance in
specific aspect dealing with MCU. For example, Reed-Solomon
1 Introduction (RS) code [18] handles burst errors more efficiently than MECC-
Interleaving schemes.
As technology scales down, the effect of cosmic radiation To select a proper scheme and its parameters we should be
(neutrons and alpha particles) on SRAM chips becomes sig- able to compute the memory failure probability accurately and
nificant. The single bit upset (SBU) is the common event quickly for different codes and various coding parameters. To
when a particle hits the memory, however shrinking the size evaluate the efficiency of the MCU protection methods, mem-
ory failure probability and mean time between failures
Responsible Editor: M. Goessel
(MTBF) are well-known metrics. In this paper a semi-
* Hadi Jahanirad
analytical model is presented which allows two-step memory
[email protected] failure (as well as MTBF) estimation. The first step is based on
probabilistic distribution to model the event arrival time and
1
Department of Electrical Engineering, University of Kurdistan, the size of MCU. The second step is related to block failure
Pasdaran street, Sanandaj, Iran probability calculation. By a block of memory, we mean a
J Electron Test

Encoder Storage cells Decoder

Encoded Corrupted
Data in data Data
Data out

Fig. 1 A general SRAM protection scheme

portion of memory which is protected by one MECC scheme. a SRAM chip when protection is done using Single Error
The models for four major categories of MECC schemes are Correction codes. In this technique encoder and decoder are
derived and their performances are compared by simulation. kept simple to correct only single cell upsets.
This study presents an accurate approach to MCU behavior In this type of memory coding each row is divided into
modeling and provides semi-analytical modeling of block- some words containing N bits and each of them is protected
based protection schemes which have not been investigated by ECC. Adjacent bits are distributed to different words so if
in previous studies. an MCU occurs in adjacent positions in a row, the corrupted
This paper is organized as follows. The related background bits will belong to different words. Therefore, the error cor-
work is presented in Section 2. Our approach of the memory rection capability improves. For example, in Fig. 2, the first
failure probability and MTBF evaluation is described in bits (Bit 1) of 4 words are put in first 4 cells of memory row
Section 3. Block memory failure probability analysis for dif- and then this pattern is repeated for N-1 remaining bits of each
ferent coding schemes is presented in Section 4. The simula- word (in this figure each color represents a word).
tion results and discussions are given in Section 5 which is This coding type has been used in many error prone digital
followed by a conclusion in Section 6. systems. A good instance of such protection method was in-
troduced in [46]. Hamming code and bit interleaving along
with putting the tap-well in the location which could help the
2 Related Works Single-bit Error Correction (SEC) to correct bigger size MCU,
made this coding scheme very efficient. The amount of nec-
2.1 MCU Characterization essary parity bits for words in a memory row has been reduced
dramatically in [42] by combining words of a row. On the
The main features of MCU have been investigated in detail in other hand, a virtual ground was added to the multiword so
the literature. Environmental conditions (temperature, operat- that the leakage current of the memory reduced drastically. In
ing voltage, altitude,…), memory design factors (layout, ar- [43] a novel approach was introduced which first detects more
chitecture,…) and test conditions (test pattern, memory bit vulnerable bits of microprocessor embedded memory such
patterns, …) determine the characteristics of MCU. The size that using this information the correction capability of ECC
and shape of MCU patterns were experimentally considered in is improved. In [47] a method introduced to model adjacent
[11, 13, 19, 20, 28, 30] for different sub-micron SRAM chips. upsets in dense SRAM memories was validated by experi-
In [8–10, 15, 31, 44] the MCU features have been described at mental data regarding SRAM-based FPGAs. This model il-
transistor level using simulation and experimental approaches. lustrated that bit interleaving reduces the impact of MCU on 7-
A different method was introduced in [12] where statistical series of FPGA. In [14] a novel bit interleaved Hamming code
approach was the base for MCU and SEU characterization. developed in which the configuration frame was divided into
In [41] a novel experimental setup was constructed for SRAM sub-frames. In a sub-frame, bits associated to user design are
chip to detect heavy ion radiation flux. essential and the other bits are non-essential. The non-
essential bits are used to embed Hamming parity bits. In
2.2 MCU Mitigation runtime using readback operation the sub-frames are decoded

Plenty of multi error correction codes have been developed in


the literature to mitigate the MCU effects. These methods can
....
be partitioned into four categories. First category is the com-
bination of interleaving with single or multiple correction Bit 1 Bit 2 Bit N-1
codes. Interleaving is an efficient method to tolerate MCU in Fig. 2 Bit interleaving scheme for 4 words
J Electron Test

B bits
using embedded hamming code and then corrected frame is
written back to the design. This protection scheme has no area
overhead and can correct more than 90% of MCUs in FPGA

Block 1
W words
configuration memory.
Second type of MECC is the frame based codes. In this type
of coding scheme each block of N bits (including data bits and

Block 2
code bits) is divided into Nsym symbols each containing S bits.
Figure 3 shows a frame with Nsym symbols each containing
S = 8 bits. If there are u upsets in a block, then block failure
occurs when more than L symbols have corrupted bits. Reed-

....
Solomon (RS) and BCH are common coding schemes belong-
ing to this category. This type of coding has better perfor-
mance in detection and correction of burst errors. An impor-

Block N
tant drawback of this coding scheme is complicated encoding
and decoding processes.
These coding schemes were used in various studies. In [1,
Fig. 4 The structure of a block coding scheme
27] double error correction BCH codes were developed to
mitigate three and four bits MCU burst upsets in SRAM mem-
ories. In [29] an efficient error correction code has been pre- correction capabilities. These codes rank in complexity be-
sented for embedded memories of digital signal processors tween SEC-DEC and more sophisticated BCH DEC codes.
(DSP) in which more important bits of the memory were fur- In [17, 36, 40, 48] a matrix based ECC code which belonged
ther protected. In [2] a novel technique was described which to block coding scheme has been investigated. This type of
combines Hamming and Reed-Solomon codes to protect developed code could handle adjacent error bits more effi-
memory against multiple bit errors. In [18] an optimized cost ciently. In [4, 25] a modified version of Hamming code by
Reed-Solomon encoding and decoding scheme has been pre- using decimal form of code words in the encoding mechanism
sented for SRAM memories. has been developed.
Third category is based on the block (matrix) codes. In such Fourth category is product coding scheme. In this memory
codes k bits of data feed into an encoder and n coded bits are protection scheme each row of a block containing W words is
stored in the memory. In this type of protection code, W words protected by an MECC which can correct LH and detect L′H
of memory with B bits in each word are considered as a block. upsets. If there are B bits in a word, then each block can be
This means that if up to L errors happen in W × B bits of one considered as another block with B rows and W columns, and
block then MECC can correct these upsets (Fig. 4). another coding scheme can be applied to such a block (LV and L
This coding scheme is the most common type in MCU ′V). In Fig. 5 a block which contains eight 8-bit words is illus-
mitigation techniques. In [32] a new correction code was in- trated. The blue cells are data bits of the block, the green cells are
troduced in which by redesigning the Hamming matrix the the parity bits of each block rows and the yellow cells are the
adjacent errors (MCUs) generate syndromes that are not sim- parity bits of each column. In this type, the memory is partitioned
ilar to SEU syndromes. In [3] a majority logic decoding into some blocks each having R rows and C columns. Each row
scheme was used in difference-set cyclic code which can cor- is coded first and then each column of the memory block is
rect the large size burst errors in the memory along with a coded independently. For example, in [45] SEC is applied to
modification in decoding scheme which resulted in lower area encode the row and parity is used in column coding.
and time overhead. In [24] a modified version of decimal In [35] a cost-efficient structure based on erasure codes was
matrix code (DMC) was used to detect and correct MCU in introduced to correct MCU in configuration memory of an
the SRAM configuration memory of FPGAwith very low cost
architecture. In [26, 33] new error correction schemes have
been introduced which can correct three error bits in a word in
addition to single error correction and double adjacent error

....

Symb 1 Symb 2 Symb N


Fig. 3 One frame containing N symbols Fig. 5 The structure of a block for product coding scheme
J Electron Test

FPGA. In [22] a built-in 2-D Hamming based correction code of optimum ID and the area and power overhead of the mem-
has been developed to improve the reliability and availability ory have been investigated in [38].
of FPGA chips in space mission applications. In [21] two error Second category of SER estimation methods is semi-
correction codes along with the read back scrubbing property analytical models. One major problem of analytical methods
of FPGA configuration memory have been developed to solve is that row failure probability of MCU cannot be accurately
the MCU problem in FPGA. In [45] a harmonious row and formulated. They use accumulated SEUs instead of MCU
column coding scheme has been developed which can im- which is not a precise assumption regarding MCU clustering
prove the reliability of the SRAM. trend. To overcome this shortcoming semi-analytical methods
The hybrid block code can be used by combining the above have been developed. These methods include two main ef-
coding categories. In this type each block of the memory di- forts: one is the event arrival time and MCU behavior model-
vides into m parts then using separate encoders each part is ing and the other is the computation of row failure probability
encoded. The encoder of each part may use different coding due to MCU. The event arrival time and MCU size are
schemes. For example, [39] used RS and Hamming coding for modeled using Poisson and geometric distributions respec-
two parts of each memory block. tively [6, 23, 37, 50]. In [49] two technology parameters,
In [34] Hamming decoder and One-Step Majority Logic Beffective size^ and Bline fitting^ are defined to account for
Decoder (OS-MLD) has been compared according to the sus- the size and the shape of MCU. In this model the cells affected
ceptibility of FPGA configuration memory to accumulated by a MCU event are distributed in some rows and columns of
SEU errors. In [5] an example of hybrid codes has been de- memory according to its size. The adjacent cells in each row
veloped in which the Hamming code bits put among the DTI are assumed to be corrupted by MCU. In [6] row failure prob-
(doubly transitive invariant) code bits. This structure shows ability is computed assuming that all corrupted cells distribut-
lower overhead and better performance than DEC codes. In ed completely randomly among memory cells. Hence the fail-
[5] hybrid code based on a sub-group of low density parity ure probability of each memory word can be computed by
checks (LDPC) has been presented which can correct a large considering all mappings of MCU cells which generate mem-
number of erroneous bits in a block of the memory. ory failure. This model was used to select proper interleaving
A different approach to MCU mitigation is structural hard- distance for a SEC protected memory. In [23] the row failure
ening of SRAM memory cells against radiation. For example, probability has been computed by random fault injection in
in [7] a novel method to mitigate MCU in SRAM memories the row cells. In this model accumulation of soft errors is
has been presented. In such SRAM chips by change the inter- included by some modification of geometric distribution pa-
nal architecture of cells and the read and write lines more rameter and the distribution of corrupted bits in a memory row
MCU could be tolerated. However, this approach is out of is assumed to be uniform random. The analysis of this model
the scope of this paper. has been done for a SRAM which was protected by SEC and
interleaving schemes. In [50] SER of an SRAM chip has been
computed using a software tool at the transistor level. Two
2.3 SER Estimation different 6 T–SRAM cell layouts were presented in [50] and
then SER estimates for SEU, horizontal MCU and vertical
Soft Error Rate (SER) estimation is the goal of all studies MCU were derived using the proposed software tool. These
related to reliability analysis of radiation prone digital sys- estimates were 26–41% lower than that for conventional 6 T–
tems. SRAM chip SER estimation methods regarding MCU SRAM cell layout.
can be divided into two main categories. The analytical
models belong to first category in which the behavior of N- 2.4 Motivation
cell MCU event is estimated as N independent 1-cell (SEU)
events and using mathematical elaboration an upper bound for Two main motivators for this study are as follows. First is
SER of SRAM is deduced. We point out to two examples of presentation of more accurate model to calculate row failure
this type of modeling presented in [37] and [38]. In [37] an probability. All three mentioned models [6, 23, 49] do not
analytical reliability model has been developed for SRAM capture all aspects of MCU features. First model ([49]) as-
architecture in the presence of SEC and bit interleaving pro- sumes that if there are more than one upsets in a row, all of
tection scheme. In one paper ([37]) a general expression has them are adjacent (one cluster of corrupted cells in a row).
been derived for MTBF regarding MCU and then using some This ignores all the cases in which corrupted cells are not
approximation the expression simplified so that the designer adjacent or they are divided into multiple clusters. In second
could compare it with the expression previously derived for model [6] no-clustering effect is considered and this model
SEU. In [38] a procedure has been presented which can be overestimates the failure probability of MCU. In the third
used to select best ID distance in the SEC and bit interleaving model [23] the clustering effect is included by defining row
protected memories. The effects of memory size on the value grouping number as the probability of existence a row
J Electron Test

containing g upsets when MCU occurs. However, the distri- 9 cells and the multiplicity equals to 3 upsets (the black
bution of these g upsets in a row is still assumed to be fully squares are corrupted bits). This ratio for these categories is
random and no-clustering is assumed in a row which is a different for each memory chip due to technology and envi-
significant deviation from MCU behavior. In this paper I pro- ronmental conditions. According to collected data from [20] I
posed a model which includes clustering effect of MCU in the plotted the percentages of three categories of MCU
row and column directions. (WL_ratio(m), BL_ratio(m) and C_ratio(m)) in Fig. 7.
The second motivator is the lack of semi-analytical model Generally, the third category (clustered MCU) has greater par-
for SER estimation of block based memory protection ticipation and with reducing features this ratio increases sig-
schemes. All the above SER estimation methods have been nificantly. The multiplicity is an important factor that influ-
developed for SRAM chips with SEC-interleaving protection ences the MCU behavior. The percentage of ratio of MCU
scheme and to my knowledge there is no block-based coding versus multiplicity for 22 nm technology is indicated in
scheme for SER estimation in the literature. When MCU ratio Fig. 8 using data from [20]. An MCU with large multiplicity
increases with decreasing feature size of transistors, SEC- is related to large size upsets which have low occurrence prob-
interleaving cannot produce acceptable reliability level. As ability. So the ratio of MCU decreases when multiplicity is
mentioned in previous sub-section more powerful coding increased. I derived following equation for ratio vs. multiplic-
schemes (such as RS, Block-based, Product ,…) should be ity (m) using curve fitting tool of MATLAB in which
developed. In this paper for each category of protection a = 212.2, b = −1.646, c = 19.9 and d = −0.3444.
schemes a semi-analytical approach is presented.
ratio multiplicityðmÞ ¼ aebm þ cedm ð1Þ

3 MTBF and Memory Failure Probability Scrubbing is an efficient method to remove all accumulated
Evaluation upsets in the memory. This method scans all words in the
memory to correct all corrupted bits. Among two scrubbing
In this section first a semi-analytical model (event arrival time operations, the upsets accumulate in the memory and the fail-
and MCU behavioral modeling) is presented and the second ure probability will increase, so the scrubbing interval must be
part of modeling (block failure probability computation for selected so that memory error correction code scheme can
each protection scheme) is described in Section 4. overcome the corrupted bits conveniently [23, 37, 38, 50]. In
this paper the scrubbing interval is selected so that no MCU
3.1 MCU Behavior accumulation occurs. If for example the event arrival rate of
soft errors is assumed to be 0.01 events per second (which is a
Each cosmic particle which hits the chip can affect one (Single rather high arrival rate in comparison to real situations) then
Cell Upset (SCU)) or more cells (MCU) in an SRAM chip. there will be one event in 100 s while the time interval among
Shrinking the feature size of transistor in memory chips in- scrubbing operations is less than 100 ms in real systems. Thus
creases the ratio of MCU to SEU. Ibe et al. [20] reported that no-accumulation assumption is required for real memory-
the MCU ratio increases from 0% to more than 45% when based digital applications.
technology goes from 250 nm to 22 nm. In [20] MCU patterns Suppose that there is a memory with 256 rows and 64
were classified into three categories: a single line along col- columns (i.e., 64 bits in a row). Furthermore, the scrubbing
umn (BL), a single line along row (WL) and cluster (an MCU
that has more than one upset along row and column direc-
tions). The clustered MCUs have various shapes, e. g.,
Fig. 6 shows some examples of MCU patterns which include

Fig. 6 MCU patterns with size =9 and multiplicity =3 Fig. 7 Ratios of MCU categories
J Electron Test

1.0E+2 IT = 1000000
Total_upsets = 30
Data m=1
Fitted
1.0E+1

it = 1
Ratio(%)

1.0E+0

Select area size


(Normal Dist.)
1.0E-1

Determine length and width of


1.0E-2 area.
1.0E+0 1.0E+1 1.0E+2 (Uniform Dist.)
Multiplicity

Fig. 8 Ratio of MCU multiplicity


Determine disturbed bits
(Unif. Dist.)

clus-p, pRE and


pCE computation
interval is selected so that there is not more than one MCU
occurrence in the memory between two scrubbing operations it = it +1
and each event can generate up to 30 upsets. I set up a simu-
lation to study the distribution of these corrupted bits in the
it == IT ?
memory. The flowchart of this simulation is given in Fig. 9. No
The WL and BL types of MCU have straightforward behav- Yes
iors. They hit one row or column of memory and the probability m=m+1
of the number of corrupted cells can be deduced using eq. (1):
The probability of WL and BL type occurrences when there are No
m ==
m corrupted bits in memory will be equal to Total_upsets ?
WL_ratio(m) × ratio_multiplicity(m) and Yes
BL_ratio(m) × ratio_multiplicity(m), respectively. For cluster make average
type of MCU an area of memory is affected. To characterize this
MCU type three features are defined: c-size cluster occurrence Finish
probability (clus-p(c,m)) which expresses the probability of oc-
currence of a cluster with c corrupted cells in a row (c consecu- Fig. 9 Flowchart of upset injection into memory
tive corrupted cells in a row) when total m cells are corrupted by
MCU. Row error probability (pRE(g,m)) which indicates the iterations and the averages of three features are used as the final
probability of existence of g upsets in a row (these g corrupted values in the rest of this paper.
cells distributed among row cells randomly) when the MCU To formulate the pRE(g,m), in [6, 23] a geometric distribu-
generates m corrupted cells in the memory. Column error prob- tion was used to define probability of row grouping number
ability (pCE(r,m)) represents the probability of an MCU with (which indicates the probability of existence of g upsets in a
multiplicity of m cells extends r rows in the memory. row when the MCU generates X corrupted cells in the mem-
As indicated in Fig. 9, for each multiplicity of MCU (m) IT ory) according to eq. (2).
=105 iterations run sequentially. In each iteration an area be-
tween m to area-bound is selected using normal distribution, prgn ðg; X Þ ¼ rðX Þ  ð1−rðX ÞÞg−1 ð2Þ
hence the m corrupted cells can distribute among up to area-
bound memory cells. In the simulation area-bound is set to 100 The r parameter indicates how many events produce clus-
cells and for each m value the mean and deviation of normal tered upsets. The range of this parameter is between 0 and 1;
distribution is set to m and 10% of area-bound. These values when r increases from 0 to 1 the events tend to produce SBU
make a greater chance for smaller area to be selected which is events more than MCUs. For my simulations I used
essential to model the real clustered MCUs [20]. The sizes of r(X) = 0.57 and p rgn (g,X) and p RE (g,X) are plotted in
horizontal and vertical sides of affected area are selected ran- Fig. 10. As this figure shows there is an acceptable match
domly using uniform distribution. When m corrupted cells are between simulation and analytical results.
randomly selected in the MCU area the three clustered MCU On the other hand, when a row of memory contains g
features (clus-p(c,m), pRE(g,m) and pCE(r,m)) are calculated in corrupted cells, these g upsets can be distributed in various
the current iteration. This process is repeated in the remaining patterns over a memory row. For example, for g = 2 there are
J Electron Test

be considered. Details of this step of MTBF evaluation will be


described in the next section.
The next step in the MTBF evaluation is computation of the
failure probability of entire memory. By memory failure we
mean that at least one of the words in the memory is in the failure
state. Aword failure occurs when the number of its corrupted bits
is beyond the correction capability of ECC. To compute memory
failure probability following equation can be used.

F mem ðX Þ ¼ ∑xg¼1
max
pblock ðgÞ  F block ðgÞ ð5Þ

In this equation Fmem(X) is the memory failure prob-


ability when there are X upsets in the memory. Fblock(g)
Fig. 10 Row error probabilities comparison is the block failure probability in the presence of g
upsets in the block. The first term of right hand side
is the existing probability of blocks with g upsets and
xmax is the upper limit on the number of upsets in a
two different patterns: two adjacent cells or two non-adjacent
block.
cells. In a memory with 8-cell width, two corrupted cells can
We compute the memory failure probability in the presence
take 28 different states, in which two-adjacent-cell pattern
of x upsets. Using CP derived in the previous section (eq. (4)),
covers 7 states and the other pattern covers 21 states. In this
this probability at time t can be obtained as follows.
paper, for various values of g the share of each pattern (i’th
pattern) in a B-cell width row (patt-share(g,i,B)) is computed F ðt Þ ¼ ∑Xx¼1
max
CPðx; t Þ  F mem ðxÞ ð6Þ
by random error injection simulation. Furthermore, different
patterns have unequal occurrence probabilities. The i’th pat- The reliability function for memory is derived as follows:
tern occurrence probability (occu-p(g,i)) can be computed
using clus-p(c,m) according to eq. (3). Rðt Þ ¼ 1−F ðt Þ ð7Þ

occu pðg; iÞ ¼ ∏c clus pðc; mÞnðcÞ ð3Þ Finally, the mean time between failures is computed as,
T scr
In this equation i is the index of pattern and n(c) is the ∫ Rðt Þdt
MTBF ¼ 0 ð8Þ
number of c-size cluster repetition in the i’th pattern. 1−RðT scr Þ
Another feature of events that should be considered is how
often the energetic particles hit the chip. The experimental
results illustrate that the event occurrence can be described
3.3 Simulation for MTBF Evaluation
using Poisson process [6, 23, 37, 50]. The λ parameter of
Poisson distribution implicitly shows the average time be-
To simulate MCU behavior, the time of event occurrence
tween two events. The total number of upsets which are de-
should be determined using exponential distribution.
veloped from some events can be described using Compound
Because of scrubbing capability of memory, the simulation
Poisson (CP) as reported in [6, 23, 37, 49, 50].
should be carried out for adequate number of scrubbing inter-
ðλt ÞY e−λt X −1 X −Y
  vals (TN). For an event the injection of MCU/SEU is done
CPðX ; t Þ ¼ ∑Y ¼1
X
r ð1−rÞY ð4Þ using the algorithm of Fig. 11.
Y! Y −1
After each event occurrence, the fail or pass status of
In this eq. X is the total number of upsets in the memory and SRAM chip is tested. If there is no failure, then the affected
λ is the Poisson distribution parameter. bits are corrected (scrubbed) and the next MCU will be con-
sidered. This process continues until a failure occurs. The time
of failure occurrence is registered and the corrupted bits are
3.2 MTBF Evaluation scrubbed. This process is repeated for other scrubbing time
intervals. The average of time differences between successive
To compute MTBF, the failure probability of each block failures is reported as MTBF in current simulation run
should be first computed. A block is a part of memory with (flowchart of Fig. 11). The above steps are repeated until stop-
single coding scheme. For example, in the SEC protected ping criterion is met. In our simulation the stopping criterion is
memory each row should be considered as a block but in RS a specified number of simulation runs. Finally, the average of
coding the bits that belong to a collection of symbols should MTBF of all runs is reported as the memory MTBF.
J Electron Test

Start this run To compute the row failure probability for clustered type
MCU, we assume that there are u upsets in a row. Total number
Generate MCU occurrence of possible states to distribute these corrupted bits in a row is:
time vector (Exp Dist.)
T(1 … N)  
N  ID
tot stateðuÞ ¼ ð10Þ
u
i=1

In this eq. N is the number of bits in a word. If the number


Generate MCU size (Geom.
i ++
Dist.) of upsets is greater than ID × L then, according to the pigeon-
hole principle [16], there are more than L erroneous bits in a
Scrub Determine disturbed bits
word so the memory will have a failure.
Mem. (Unif. Dist.) The number of non-failing states of a row containing u
upsets can be calculated by finding the solutions of the fol-
No
Memory lowing equation.
fail?
Yes X 1 þ X 2 þ …: þ X ID ¼ u; X i ≤ L ð11Þ
Register failure
time and scrub the Where Xi is the number of corrupted bits in ith word, which
Mem
should not be greater than L for non-failing states. To compute
number of solutions we use a recurrent approach. In this ap-
Any proach a recur-func is called iteratively. This function has
interval?
three input arguments: up-no is the number of upsets, var.-
no is the number of words and index is the index of the word
for which the number of upsets will be determined.
Finish this run
The pseudo-code of this approach is shown in Fig. 12. The
Fig. 11 Flowchart of one run simulation
solution array which lists the solutions has two indices: the
first one is the index of current solution (sol-index) and the
other is the index of words. The index and sol-index are ini-
tialized as 1 at the start (Lines 1 and 2 of pseudo-code). The
4 Block Failure Probability Computation reg-sol is an array to temporarily store the number of upsets in
each word. The bound is the correction capability of MECC
In this section the failure probability analysis (second effort in that in this memory protection scheme is L.
semi-analytical modeling) is described for four memory cod- In the first function call, up-no, var-no and bound are
ing schemes that were mentioned in the introduction section. substituted with u, ID and L, respectively. Each word can
contain up to L upsets in the non-failing state but if the total
4.1 MECC with Interleaving number of upsets in a row (u) is less than L then the upper
bound will be limited to u rather than L. This issue is ad-
As shown in Fig. 2 each word is protected by the MECC with dressed in line 4 of the pseudo-code. The loop in line 5 deter-
L-bit correction capability and ID interleaving distance in a row. mines the number of current word upsets (index’th word)
Interleaving technique leads to standing the same bits of different using the equation in line 7. Then total number of upsets is
words adjacently, thus a cluster of corrupted bits has less chance decreased by the amount of current word upsets and var-no
to generate failed word. For three MCU types we should com- decreases one unit (Lines 8 and 9). The stopping criteria is
pute the block failure probabilities. First of all, BL type cannot checked, if var-no = 0 but up-no ≠ 0, i.e., no acceptable solu-
produce any failure; this MCU type disturbs only one cell in each tion was found so the function goes to a previous state by
row and the MECC will correct all of them. In WL type all canceling the last upset assignment (line 16) and deleting the
corrupted cells belong to a row thus if the length of MCU (u) last word from the reg-sol array (line 14 and 15). On the other
is greater than L × ID then the memory failure will occur. The hand, if up-no = 0, i.e., an acceptable solution was found and if
probability of such a situation can be computed as follows: var-no = 0 all words have at least one upset, or else ID- (var-
no) of words are not corrupted by any upsets. In this situation
 the discovered solution is registered in solution array and the
WL ratioðuÞ  ratio multiplicityðuÞ if u > L  ID sol-index is incremented (lines 21 and 22). If stopping criteria
Prow−WL ðuÞ ¼
0 otherwise
are not met, then recur_func with new arguments will be
ð9Þ called (line 29).
J Electron Test

Fig. 12 Pseudo-code to find


solutions of eq. (11) 1 sol-index = 1; // index of found solution.
2 index = 1; // index of under upset injection word.
3 recur_func (var-no, up-no, index)
4 limit = min(up-no,bound); // determination of how upsets can be injected.
5 for (i = 1 to limit) // injection of I upsets in index’th word.
6 {
7 reg-sol (index) = i; // put i upsets in index’th word.
8 up-no = up-no – i; // remaining upsets calculation.
9 var-no --; // remaining words calculation.
10 index ++; // determination of next candidate word to put upsets on.
11 if (var-no = 0 but up-no ≠ 0) // there are some upsets not located on a word
12 {
13 there is no solution;
14 index --; // cancel the upset assignment to last word.
15 var-no ++; // increase the remaining word.
16 up-no = up-no + i; // retrieve the placed upsets in last iteration.
17 }
18 elseif ((var-no = 0 and up-no = 0) or (var-no ≠ 0 and up-no = 0))
19 {
20 one solution has been found;
21 solution (sol-index , 1 to index) = reg-sol (1 to index); //
register the founded solution.
22 sol-index++; // increase the number of found solution.
23 index -- ; // cancel the upset assignment to last word.
15 var-no ++; // increase the remaining word.
25 up-no = up-no + i; // retrieve the placed upsets in last iteration.
26 }
27 else
28 {
29 rec_func(var-no, up-no, bound, index, reg-sol, sol-index );
30 }

Each solution which is found by recur_func represents some If S solutions were derived by recur_func then the row
states. For example if the number of upsets in word w is k(w) failure probability for clustered MCU is computed using the
then this k(w)bits can be selected from N bits byð N k ðwÞÞ states. following equation.
If the solution assigns the upsets to W words of total ID′ words ( no )
then these words can be selected by ð ID0 WÞ states. ∑Ss¼1 ∑i¼1patt nfail shareðu:s:iÞ  occupðu:iÞ
Prow−C ðuÞ ¼ 1− no
So the total states of each recur_func solution (with s in- ∑i¼1patt totshareðu:iÞ  occupðu:iÞ
dex) can be computed using the following equation.
 0    ratio multiplicityðuÞ  C ratio ð15Þ
ID N
nfail stateðsÞ ¼ ∏w¼1
W
ð12Þ In the numerator, inner summation computes the total oc-
W k ðw Þ
currence probability of non-failing states for each solution of
Altogether we find total states and number of non-failing rec-func. According to Prow_WL(u) and Prow_C(u) the block
states for distribution of u upsets in a row of memory. The i’th failure probability is computed as,
pattern of u upsets has its share in tot-state(u) and nfail-state(s)
regarding to patt-share(u,i,B) as shown in eqs. (13) and (14) F block−IntLV ðuÞ ¼ Prow−WL ðuÞ þ Prow−C ðuÞ ð16Þ
below.
To compute memory failure probability using eq. (5), p-
block(u) also must be determined. In this type of memory pro-
tot shareðu; iÞ ¼ patt shareðu; i; BÞ  tot stateðuÞ ð13Þ
tection scheme, each block is defined as one row so to com-
pute pblock(u), the probability of a row to have u upsets must be
nfail shareðu; s; iÞ ¼ patt shareðu; i; BÞ  nfail statesðsÞ ð14Þ considered. According to definition of pRE(u,m) as the
J Electron Test

probability of a row having u upsets when there are m upsets When the remainder is greater than 1, MCU affects K + 2
in the memory, the number of rows containing u upsets is symbols. If K + 1 is greater than L, the number of failing states
computed as follows: is computed using eq. (22). In this equation Rem is the remain-
der of u/S.
pRE ðu; mÞ  m N stateðuÞ ¼ N sym −K  ðS−ðRem−1ÞÞ

N Rðu; mÞ ¼ ð17Þ
u
þ N sym −ðK þ 1Þ  ðRem−1Þ

ð22Þ
The numerator contains the number of total corrupted cells in Otherwise, if K + 2 is greater than L then the number of
all rows with u upsets and when this number is divided by u the failing states is computed according to,
number of such rows will be deduced. Because of existence of W
N state ðuÞ ¼ N sym −ðK þ 1Þ  ðRem−1Þ

rows in the memory pblock(u) can be calculated as follows: ð23Þ

NRðu; mÞ Finally, the block failure probability regarding WL type


pblock ðu; mÞ ¼ ð18Þ MCU is computed by following equation:
W
 
As Fblock − IntLv(u) and pblock(u, m) have been computed, N state ðuÞ
Pblock−WL ðuÞ ¼  ratio multiplicityðuÞ
F(t), R(t) and MTBF can be calculated using eqs. (6), (7) B−ðu−1Þ
and (8), respectively.  WL ratio ð24Þ

4.2 RS Codes To calculate block failure probability for clustered type


MCU we proceed as follows. If more than L × S erroneous
In this type of coding scheme each block (called a frame) cells exist in a block, then from the pigeonhole principle fail-
contains n bits divided into Nsym symbols. If there are any ure occurs with certainly.
upsets in a symbol that symbol is considered as an erroneous
one. A block failure occurs when more than L erroneous sym- Pblock−C ðuÞ ¼ 1; u>LS ð25Þ
bols exist in that block (Fig. 3).
For smaller u, using eq. (26) all possible states for distribu-
Similar to previous scheme, BL type MCU cannot cause
tion of u upsets in Nsym symbols are determined. In this equa-
any memory failure but WL type MCU needs attention. If
tion, Xi is the number of upsets in ith symbol which should not
length of WL type is more than L × S memory failure will
be greater than S.
occur otherwise there will be three cases: first, the length of
MCU is less than one symbol in which case MCU can disturb X 1 þ X 2 þ …: þ X N sym ¼ u; X i ≤ S ð26Þ
one or two symbols. If L is greater than 1, then this case cannot
cause memory failure. However, if L equals 1 the number of To find the solutions we call the rec_func in Fig. 12 with
failing states is computed as follows: var-no = Nsym and up-no = u. For each solution, if the number
of erroneous symbols is more than L then block will fail,
N stateðuÞ ¼ N sym −1  ðu−1Þ

ð19Þ therefore, for such a solution, the number of non-failing states
will be zero. Otherwise, the number of non-failing states of
each solution can be computed using the following procedure.
Second, if the length of MCU is a coefficient of symbol
For each non-failing solution, first of all the number of
length (K = u/S is an integer) then MCU can disturb K sym-
corrupted cells in each symbol is determined. up(i) is the
bols. If L is greater than K then there will be no failure, other-
number of corrupted cells in i’th symbol. For each symbol
wise number of failing states is computed as follows:
up(i) upsets can be distributed in n(i) states. These states in-
N stateðuÞ ¼ N sym −K  ðS−1Þ

ð20Þ clude various patterns. The number of j’th pattern is computed
as,

Third, if the length of MCU is not a coefficient of symbol no pattð j; iÞ ¼ patt shareðupðiÞ; j; S Þ  nðiÞ ð27Þ
length then there will be two different situations. When the
remainder of the division (u/S) is 1, MCU affects K + 1 sym- The occurrence probability of each pattern can be deduced
bols. If L is greater than K + 1 there will be no failure, other- from occu-p(up(i),j) definition. These steps are repeated for all
wise number of failing states are computed as, symbols then the failure probabilities of each combination of
patterns of symbols are computed using eq. (28). In this equa-
N stateðuÞ ¼ N sym −K  S

ð21Þ tion a(i) is the index of pattern of i’th symbol in k’th
J Electron Test

combination. because of its length, the corrupted cells in at least one of them
will be greater than L. Second, MCU length is between L and
p combðk Þ ¼ ∏no pattðaðiÞ:iÞ  occu pðupðiÞ:aðiÞÞ ð28Þ
2 L. In this case the BL type MCU may disturb two adjacent
The failure probabilities of all combinations are summed blocks such that s states have fewer than L corrupted cells in
up to deduce total non-failing probability for current solution: these two blocks. The failure probability for this case can be
computed using eq. (35). In this equation Nblocks is the total
nfail probðsÞ ¼ ∑k p combðk Þ ð29Þ number of blocks in the memory.

ðN blocks −1Þ  s
 
The total failure probability of u upsets in a row also is
pblock−BL ðuÞ ¼ 1−
computed as, B−ðu−1Þ
tot probðuÞ ¼ ∑i no pattðu; iÞ  occu pðu; iÞ ð30Þ  ratio multiplicityðuÞ  BL ratio ð35Þ

The summation is over all different patterns of u upsets To deal with clustered type MCU whose multiplicity equals
located on a memory row and no-patt(u,i) is the number of m, we should consider the MCU extending over memory
i’th pattern which is calculated using eq. (31). length. This will be done using pCE(r,m), that is the probability
of a clustered MCU extending through r rows in the memory.
no pattðiÞ ¼ patt shareðu; i; BÞ  u ð31Þ
If the multiplicity of MCU was greater than L then there may
The block failure probability regarding clustered type be a memory failure. We divide the situation into two cases:
MCU is computed as, first, the length of MCU (l) is less than a block length. Because
of multiplicity is greater than L there may be non-failing states
∑S nfail probðsÞ
 
when clustered MCU hits two adjacent blocks simultaneously.
pblock−C ðuÞ ¼ 1− s¼1
tot probðuÞ To derive the probability of this case, for each state two fea-
tures must be considered. One is the length of MCU in each
 ratio multiplicityðuÞ  C ratio ð32Þ block (l1 + l2 = l) and the other is the number of corrupted
cells in each block (X1 + X2 = u). For each combination of
According to Pblock_WL(u) and Pblock_C(u) the block failure
l1(k) and l2(k), the non-failing states (X1,X2 ≤ L) are derived
probability is computed as,
and the occurrence probability of k’th combination is comput-
F block−RS ðuÞ ¼ Pblock−WL ðuÞ þ Pblock−C ðuÞ ð33Þ ed using eq. (36). Then total occurrence probability of K com-
bination is deduced by summation (eq. (37)).
In this type of memory protection scheme the structure of
each block is similar to the blocks of MECC with interleaving pcomb ðk Þ ¼ ∑Ss¼1 pRE ðl 1 ðk Þ; X 1 ðsÞÞ  pRE ðl 2 ðk Þ; X 2 ðsÞÞ ð36Þ
scheme. Thus eq. (18) can be used to compute pblock(u,X).
nf pðuÞ ¼ ∑Kk¼1 pcomb ðk Þ ð37Þ

This process is repeated for all states (all combinations of


4.3 Block Based Coding X1 and X2) to deduce tot-p(u) and the failure probability of this
case is computed according to eq. (38). The last term is nec-
In this type of protection code, W words of memory are con- essary because of various two adjacent blocks states in the
sidered as a block. This means that if up to L errors occur in memory.
W × B bits of one block then MECC can correct these upsets
(Fig. 4). nf pðuÞ N blocks −1
pðu; l Þ ¼ 1−  ð38Þ
The WL type MCU can cause memory failure if the length tot pðuÞ W−ðl−1Þ
of MCU is greater than L. The probability of such a situation is
computed as, In the second case the length of MCU is greater than length
of a block. In this case the MCU disturbs D blocks (D ≥ 2) and
pblock−WL ðuÞ ¼ ratio multiplicityðuÞ  WL ratio ð34Þ
1 p_sol(i)= 1;
The BL type cannot cause a memory failure if its length 2 for g = LH+1 to LH’
were less than L. Furthermore, if the length of each block was 3 {
4 if Xg is non-zero then
less than L then there will be no failure. Otherwise if length of 5 {
MCU is greater than a block length then MCU covers at least 6 ( )=
( . )
×
one block and there will be memory failure. If the length of BL 7 _ ( )= _ ( )× ( )
type MCU was less than a block length then two cases will 8}
9 }
arise: one, its length is more than 2 L and memory failure is
inevitable. This type of MCU affects two adjacent blocks and Fig. 13 Pseudo-code to find states of product coding solutions
J Electron Test

0.001
ref 23 row oriented code can correct and detect LH and L’H upsets,
0.0009
0.0008
ref 49 respectively. On the other hand, in the column direction an-
other code can correct and detect LV and L’V upsets, respec-
Failure Probability

proposed model
0.0007
0.0006
tively (Fig. 5). To use the correction capability of this coding
0.0005 scheme in a block containing u upsets, first of all the rows and
0.0004 columns which have fewer than LH + 1 and LV + 1 upsets are
0.0003 corrected. Then rows containing errors between LH and L′H,
0.0002 and the columns containing errors between LV and L′V are
0.0001 determined. When there is error detection in a row and a
0 column, the intersection cell contains an erroneous bit so this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time (×100 sec)
bit can be corrected.
For WL type MCU if LV > 0 then all row errors will be
Fig. 14 Comparison of proposed model and models of [49] and [23]
corrected but if LV equals 0 and length of MCU be greater than
LH then failure will occur, therefore,

PPRO−WL ðuÞ ¼ ratio multiplicityðuÞ  WL ratio ð42Þ


there are D variables (l1, l2, …, lD) to be selected. For each set
of variables, we must select total combinations of X1, X2, …, For BL type MCU, all corrupted rows have one upset so if
XD. The overall process is similar to that described in the first LH > 0 then all errors will be corrected. If LH equals 0 then
case and the failure probability of this case is calculated using there will be two cases: first length of MCU is less than LV’
eq. (39). In this equation Ndist is the number of corrupted and the horizontal and vertical ECCs can correct all upsets. On
blocks in the memory. the other hand, if length of MCU is greater than LV’ but less
nf pðuÞ N blocks −ðN dist −1Þ than length of block then there will be some non-failing states.
pðu; l Þ ¼ 1−  ð39Þ In each non-failing state BL type MCU affects two adjacent
tot pðuÞ W−ðl−1Þ
blocks so that the number of corrupted cells in each block is
The block failure probability according to all l values and less than LV’. Suppose that there are s non-failing states, then
the memory failure probability are computed using eqs. (40) block failure probability is computed as,
and (41), respectively.  
s
pblock−C ðuÞ ¼ ∑W
 PPRO−BL ðuÞ ¼ 1−  ratio multiplicityðuÞ
l¼1 pðu; l Þ  ratio multiplicityðuÞ W−ðl−1Þ
 C ratio ð40Þ  BL ratio ð43Þ
F mem−block−code ðuÞ ¼ Pblock−WL ðuÞ þ Pblock−BL ðuÞ
To investigate clustered type MCU, when there are u upsets
þ Pblock−C ðuÞ ð41Þ in the memory, NR(r,u) is used to determine the number of
rows with r upsets. If there is a row with more than L′H upsets
Note that in this memory protection scheme the whole the memory fails. The probability of this case is computed
memory is considered as a single block so the memory failure according to eq. (44). In this eq. B is the number of bits in a
probability is computed directly and eq. (4) is not used. row of the memory.

NRðr:mÞ
4.4 Product Coding PPRO C ðuÞ ¼ C ratio  ∑Br¼L0 þ1 ð44Þ
h W
As mentioned in the related work section, this coding scheme The rows containing LH ≤ r ≤ L′H upsets may cause a
uses 2-D protection approach. In a W × B memory block, the memory failure if the column MECC cannot correct

Table 1 The coding scheme parameters to validate the model

Column1 W B Block length Number Correction capability Diff Run time ratio
of symbols

Int_MECC 256 128 1 1 ID = 4, L = 1 3.5% 8.40E-07


RS_based 256 128 1 8 L=4 3.8% 8.90E-06
Block_coding 256 16 8 1 L=8 2.0% 1.53E-05
Product 256 16 8 1 Lh = 1,Lh’ = 2 ; Lv = 0, Lv′ = 1 2.8% 5.80E-06
J Electron Test

0.0004
them. The probability of existence of such rows can be Model
computed as, 0.00035 Simulation

Memory failure probabiliy


L
0
NRðr:uÞ 0.0003
P1 ðuÞ ¼ ∑r¼L
H
H þ1
ð45Þ
W 0.00025
If more than L′V cells of a block column are erroneous the 0.0002
column MECC cannot correct them and failure will occur. If the
number of rows in a block is WB then columns which have L 0.00015

′V ≤ g’ ≤ WB erroneous cells can cause a block failure. Each 0.0001


erroneous cell may belong to a row containing LH + 1 ≤ g’ ≤ L
0.00005
′H upsets, so to count all block failure states the solutions of eq.
(46) should use rec-func. The no-var., no-up are initialized with L 0
1 5 9 13 17 21 25 29 33 37 41 45 49
′H– (LH + 1) and g respectively. A bound on the variables of this
Time (×100 sec)
equation is set to u, since in each state all erroneous rows of a
Fig. 16 RS coding scheme accuracy simulation
block have equal number of upsets. For example, if there are 3
erroneous rows in a block all of them may have 2 upsets.
When P2(u) is multiplied by Nblock (the total number of blocks
X LH þ1 þ X LH þ2 þ … þ X L0 ¼ g ð46Þ in the memory) the memory failure probability which is caused
H
by uncorrectable errors in columns of memory blocks is com-
For each solution the probability of block failure can be puted. When this value is added to P1(u) the total memory failure
computed according to the pseudo-code in Fig. 13. In each probability will be deduced (eq. (49)). Finally, the memory fail-
solution for the rows which contain Xg upsets the occurrence ure probability is computed using eqs. (50).
probability is calculated using equation in line 6 of pseudo- 
code. The first term in the right hand side of this equation PPRO−C ðuÞ ¼ C ratio  P1 ðuÞ þ N block  P2 ðuÞ ð49Þ
declares the probability of existence of a row with g upsets F mem−PRO−code ðuÞ ¼ PPRO−WL ðuÞ þ PPRo−BL ðuÞ
in the memory and the second term expresses the probability
of a specific bit becoming erroneous in that row. þ PPRO−C ðuÞ ð50Þ
The block failure for all solutions is computed using eq.
(47). A solution of eq. (46) may occur in each of B bits of the The computation of F(t), R(t) and MTBF will be done using
block rows so the probability of each solution (p_sol(i)) mul- eqs. (6), (7) and (8), respectively.
tiplies to B in eq. (47).

p all ðuÞ ¼ ∑i¼1


no solutions
p sol ðiÞ  B ð47Þ 4.5 Hybrid Block Coding

For all g (LV’ ≤ g ≤ WB), this process is repeated and the In this type of error correction coding scheme the memory is
total failure probability is calculated as the following sum: divided in P blocks each protected by one type of aforemen-
tioned ECC schemes. Because of independent encoding and
P 2 ð uÞ ¼ ∑ W B 0 p all ðg Þ ð48Þ decoding mechanisms for each part, we can compute block
g¼LV þ1

0.0045 0.0008

0.004 Model 0.0007 Model


Simulation Simulation
Memory failure probability

Memory failure probabiliy

0.0035 0.0006
0.003
0.0005
0.0025
0.0004
0.002
0.0003
0.0015
0.0002
0.001

0.0005 0.0001

0 0
1 5 9 13 17 21 25 29 33 37 41 45 1 5 9 13 17 21 25 29 33 37 41 45 49
Time (×100 sec) Time (×100 sec)
Fig. 15 MECC-Interleaving scheme accuracy simulation Fig. 17 Block coding scheme accuracy simulation
J Electron Test

0.004 0.0045
0.004 ID=2 ID=4 ID=8 ID=16
0.0035 Model
0.0035

Failure Probability
Memory failure probabiliy

Simulation
0.003 0.003

0.0025 0.0025
0.002
0.002
0.0015
0.0015 0.001
0.0005
0.001
0
1 4 7 10 13 16 19
0.0005
Time (×1000 sec)
0
1 5 9 13 17 21 25 29 33 37 41 45 49 Fig. 20 ID variation effect on failure probability of MECC-interleaved
Time (×100 sec)
Fig. 18 Product coding scheme accuracy simulation ¼X
F mem−hybrid ðX Þ ¼ ∑XX P1
P1 ¼0
F part−int ðX P1 Þ

þ F part−RS ðX −X P1 Þ ð53Þ
failure probability for each part and then sum up the failure
probabilities of all parts to calculate memory failure probability.
Suppose that two types of ECC are used in a memory, one 5 Simulation Results
is SEC-Interleaving and the other is RS coding. So the mem-
ory (W rows and B columns) is divided into two parts each For simulations λ and r parameters are set to 0.01 and 0.5,
having W/2 rows and B columns. Using the procedures of respectively.
Sections 4.1 and 4.2, the failure probability of each part is
computed according to eq. (51) and eq. (52).
5.1 Comparison to Previous Methods
 
W
F part−int ðX P1 Þ ¼ F mem−int X P1 ; ; B ð51Þ One of the goals of this study is derivation of an accurate model
2 for failure probability estimation for first coding scheme
  (MECC-Interleaving). As mentioned in the related works sec-
W tion, previous methods do not consider all aspects of MCU be-
F part−RS ðX P2 Þ ¼ F mem−RS X P2 ; ; B ð52Þ
2 havior for this scheme. In [49] all corrupted cells in a row have
been considered as single cluster and in [23] the upsets were
In these equations each part of memory is considered as a injected into the rows completely randomly and no-clustering
memory with W/2 rows. If there are X upsets in the memory, trend was assumed. A simulation has been carried out to com-
then Xp1 upsets are in the first part and the remaining X-Xp1 pare the proposed model with these two previous methods for a
upsets are in the second part. The memory failure probability memory with length and width of M = 256 and B = 64, respec-
is calculated as follows: tively. I used SEC (L = 1) for correction code and set interleaving
distance equal to 8 (ID =8). Thus each row of memory contains 8
0.009 words 8 bits each and the physical distance between consecutive
0.008
Model
Memory failure probabiliy

0.007 Simulation 0.003


L=1 L=2 L=3 L=4
0.006 0.0025
Failure Probability

0.005
0.002
0.004
0.0015
0.003
0.001
0.002

0.001 0.0005

0 0
1 5 9 13 17 21 25 29 33 37 41 45 49 1 4 7 10 13 16 19
Time (×100 sec) Time (×1000 sec)
Fig. 19 Hybrid coding scheme accuracy simulation Fig. 21 L variation effect on failure probability of MECC-interleaved
J Electron Test

0.003 0.03
L=1 L=2 L=3 L=4 L=2 L=3 L=4
0.0025 0.025
Failure Probability

Failure Probability
0.002 0.02

0.0015 0.015

0.001 0.01

0.0005 0.005

0 0
1 4 7 10 13 16 19 1 4 7 10 13 16 19
Time (×1000 sec) Time (×1000 sec)
Fig. 22 L variation effect on failure probability of RS coding Fig. 24 L variation effect on failure probability of block coding

bits of a word is 8 memory cells. The simulation result is illus- figures the memory failure probabilities versus the time of
trated in Fig. 14. The failure probability estimation of [49] is some scrubbing intervals have been plotted. The scrubbing
much lower than those of the proposed model and the model interval is the time interval at the end of which the upsets of
of [23]. This under estimation can be attributed to single cluster the memory are cleared and the memory becomes error free.
assumption in that method. Many states which cause block fail- In all cases there is acceptable matching between our model
ure are related to patterns in which two or more clusters with and simulation. The average differences between model and
different sizes are involved. simulation for all memory schemes are indicated in the sev-
On the other hand, the model of [23] overestimates the enth column of Table 1. In all cases the average difference is
failure probability. This model unlike the model of [49] con- below 3.1%.
siders all possible patterns that occur when u upsets lie on a The ratio of proposed model and simulation method run
row, but the occurrence probabilities of all patterns are as- times are reported in the eighth column of Table 1. The pro-
sumed to be equal. As mentioned in Section 4.1 patterns have posed model can produce the results much faster than simula-
different occurrence probabilities due to their shapes (the tion. The speed up of our model is ×106 approximately.
number of clusters and the type of clusters which construct
the pattern). When the soft errors tend to produce more MCU 5.3 Parameter Variation
this over estimation is greater. This is because more clustered
patterns have higher occurrence probability in this situation. The next experiment shows the impact of each coding scheme
parameters on the memory failure probability. According to
5.2 Model Accuracy such simulation we can select the proper coding parameters in
each protection scheme.
To investigate the accuracy of the proposed model, memory Two parameters of the first scheme (MECC-Interleaving)
specifications for each scheme have been selected according are interleaving distance (ID) which indicates the physical
to Table 1. Figures 15, 16, 17, 18 and 19 illustrate the com- distance between two consecutive bits of a word in a memory
parison between our model and these coding schemes row, and the error correction capability of MECC (L). In
(simulation procedure has been introduced in Fig. 11). In these Fig. 20 L is set to a fixed value and ID takes different values.

0.0014
0.004 Len_Block=4
0.0012
0.0035 S=8 S=16 S=32 Len_Block=8
Failure Probability

0.001 Len_Block=16
0.003
Failure Probability

0.0008
0.0025

0.002 0.0006

0.0015 0.0004

0.001 0.0002
0.0005
0
1 4 7 10 13 16 19
0
1 4 7 10 13 16 19 Time (×1000 sec)
Time (×1000 sec) Fig. 25 Block length variation effect on failure probability of block
Fig. 23 S variation effect on failure probability of RS coding coding
J Electron Test

0.0045 0.0025
0.004 Lh=0,Lh'=1 Lv=0, Lv'=1
0.0035 0.002
Lh=1,Lh'=2 Lv=1, Lv'=2

Failure Probability
Failure Probability

0.003
Lh=2,Lh'=3 0.0015
0.0025
0.002
0.001
0.0015
0.001 0.0005
0.0005
0 0
1 4 7 10 13 16 19 1 4 7 10 13 16 19
Time (×1000 sec) Time (×1000 sec)
Fig. 26 Horizontal coding variation effect on failure probability of Fig. 27 Vertical coding variation effect on failure probability of product
product coding coding

According to this figure, bigger ID value results in smaller In the third protection scheme (block coding), there are two
memory failure probability. For a row containing B cells if important parameters: length of each block (B-Len) and the
we use ID for interleaving distance then each word (MECC correction capability (L). According to Fig. 24 increasing the
protected) will contain B/ID cells and ID adjacent cells in a block length leads to higher memory failure probability. When
row belong to different words. An MCU with a size greater the block length increases the occurrence of upset in a block
than ID can corrupt two cells of a word, so a bigger value for will increase, so the failure probability of block increases for a
ID leads to more protection against upsets. If the width of each fixed L.
word is set to a fixed value, then a bigger ID value means a If the length of blocks remains fixed but correction capa-
wider memory and this affects the physical shape (aspect ra- bility increases, then the memory failure probability will de-
tio) of the memory. crease. For each of three cases which are illustrated in Fig. 25,
Figure 21 shows the case of fixed ID and variable L. In this the length of blocks is 8 rows of the memory. So the proba-
case increasing L parameter leads to more protection in each bility of occurrence of u upsets in each block of memory is
word, so the memory failure probability decreases. Note that equal in these cases. Therefore, memory with bigger L can
increment of L results in more parity bits in each word and correct more upsets in a block and reduce the block failure
higher complexity in encoding and decoding processes. probability.
For RS protection scheme there are two parameters, the The fourth scheme (product block coding) has two main
length of symbols (S) and the number of symbols (L) which options to be tuned: The correction and detection capabilities
can be corrected in a frame (row) of the memory. If there are B of row coding (Lh and Lh’) and the correction and detection
cells in a memory row, then each row contains B/S symbols. If capabilities of column coding (Lv and Lv′).
the correction capability (L) remains fixed, increasing the In the first experiment the correction capability of row code
length of the symbols will decrease the memory failure prob- increases (Lh and Lh’) along fixed column code correction
ability according to Fig. 22. capability. As depicted in Fig. 26 this approach can decrease
An RS coding with correction capability of 1 symbol in a memory failure probability drastically. Increasing Lh results in
row can correct up to S upsets in a specified symbol, so in- correction of more erroneous rows, so number of errors which
creasing the length of symbols increases the probability of can exist in each column of a block decreases and the block
correctable upsets to be located in a symbol. On the other failure probability reduces. In this experiment parity code
hand, if the length of symbols remains fixed but the correction (Lv = 0, Lv′ = 1) is used as the column code and the parity
capability increases then the memory failure probability will (Lh = 0, Lh’ = 1), SEC-DED (Lh = 1, Lh’ = 2) and DEC-TED
be decreased (Fig. 23); to describe such a situation, suppose (Lh = 2 , Lh’ = 3) are used as row codes.
that each symbol is a super cell and each row contains B/S If the horizontal code keeps fixed correction and detection
super cells. If there are up to L erroneous super cells in a row capabilities (Lh and Lh’), then block failure probability can be
the RS code can correct them. reduced using more powerful vertical code (Lv and Lv′). In this

Table 2 The coding scheme


parameters to comparison of IntLv RS Block Product
schemes
W 256 256 2048 2048
B 64 64 8 8
ECC ID = 8, L = 1 S = 8, L = 1 Len_B = 8, L = 8 Len_B = 8, Lh = 0, Lv = 0 Lh’ = 1,Lv′ = 1
J Electron Test

250000
experiment (Fig. 27) parity code (Lh = 0, Lh’ = 3) is used as the
row code and the parity (Lv = 0, Lv′ = 1) and SEC-DED
200000
(Lv = 1, Lv′ = 2) are used as column codes.

150000

MTBF
5.4 Comparison of Various Protection Schemes
100000
To compare aforementioned protection schemes, I should al-
locate same protection capability per bits for all schemes. To 50000
this end I select protection parameters according to Table 2. In
this Table M (=16,384) cells of memory have been arranged in 0
a W × B array layout. MECCIntLv RS Block Pro
Coding Scheme
In this table for all schemes the ratio of correction capability
Fig. 29 MTBF for various coding schemes
per bit in a block (RCpb), is identical. For the first scheme in each
8-bits word L = 1 so RCpb is 1/8. For RS coding each frame
contains 64 bits partitioned into 8 symbols each containing 8
bits. If this coding scheme can correct only one symbol, then condition to correct 8 erroneous bits in a memory block. All
the RCpb will be 8/64 or 1/8. In the third scheme, each block is an rows in a block must contain only one erroneous bit and the
8 × 8 array of bits in which the block code can correct up to 8 bits column of the erroneous bits must be different. According to
so in this scheme the RCpb is 1/8. The product coding scheme can Fig. 28 this condition is slightly more restricted than the con-
detect 1 erroneous cell in each row and column of a block (8 × 8 dition of RS scheme, in which the 8 erroneous bits must be in
bits). So for each row/column this scheme can correct only one a symbol.
bit and the RCpb will be 8/64 or 1/8. In interleaving-MECC scheme, the 8-bits correction capa-
The memory failure probabilities for these four types of pro- bility in 64 bits is adopted as 1-bit correction capability in 8
tection schemes are illustrated in Fig. 28. The block coding bits. So this scheme can correct more upset states occurring in
scheme has lowest failure probability, and the interleaving- a 64 bits block of the memory. In Fig. 29 the MTBF for these
MECC, RS and product schemes have increasingly higher fail- schemes is illustrated.
ure probability, respectively. The main reason for good perfor-
mance of block coding is its ability to correct up to 8 erroneous
bits which are distributed randomly in a 64-bits block.
In other schemes the 8 erroneous bits cannot randomly 6 Conclusion
distribute in the block. For example, in RS scheme the 8 erro-
neous bits should be all in a specific symbol or in interleaving- Considering the increased importance of MCU in
MECC scheme all of 8 bits must occur in different interleaved SRAM memory chips the development of analytical
words. The main drawback of block coding scheme is the models has become essential. In this paper an analytical
more complicated encoder and decoder structure which in- model has been presented to determine the reliability of
creases the cost of protection. SRAM memory with various schemes of error correc-
Among the other three schemes the product coding has the tion codes. The memory is divided into blocks and each
worst performance. This is because of the most restricted block has an error correction code to overcome the ef-
fects of corrupted bits. In comparison to previous
0.0045 models for analyzing the behavior of MCU using
0.004 IntLV Poisson process and geometric distribution, our model
RS can compute the failure probability of each block of
Failure Probability

0.0035
0.003
Block memory more efficiently.
0.0025
Pro Experiments have been performed to verify the model for
different coding schemes. Our model shows less than 3.1%
0.002
difference from the simulation.
0.0015
The impact of coding parameters on the memory failure
0.001
probability has been considered in another experiment. From
0.0005
the results one can select proper values for effective parame-
0
1 4 7 10 13 16 19 ters of the error correction scheme. Finally, an experiment
Time (×1000 sec) compares the ability of various coding schemes to construct
Fig. 28 Comparison among various coding schemes a more reliable SRAM chip.
J Electron Test

References 20. Ibe E, Tanguchi H, Yahagi Y, Shimbo K, Toba T (2010) Impact of


scaling neutron induced soft error in SRAMs from a 250 nm to a 22 nm
design rule. IEEE Trans on Electron Devices 57(7):1527–1538
1. Agridys C, Zarandi DHR, Pradhan DK, (2007) Multiple upsets 21. Jahinuzzaman SM, Shah JS, Rennie DJ, Sachdev M (2009) Design
tolerance in SRAM memory, Proc. Int. Symp. on Circ. and Sys., and analysis of a 5.3-pJ 64-kb gated ground SRAM with multiword
pp 365–368 ECC. IEEE J of Solid-State Circ 44(9):2543–2553
2. Agridys C, Pradhan DK, Kocak T (2011) Matrix codes for reliable 22. Jeon SH, Lee S, Baeg S, Kim I, Kim G (2014) Novel error detection
and cost efficient memory chips. IEEE Trans. on VLSI Sys. 19(3): scheme with the harmonious use of parity codes, well-taps and
420–428 interleaving distance. IEEE Trans. on Nuc. Sci. 61(5):2711–2717
3. Ahilan A, Deepa P, (2015) Modified decimal matrix codes in FPGA 23. Lee S, Baeg S, Reviriego P (2011) Memory reliability model for
configuration memory for MBU, Proc. Int. Conf. on Comp. accumulated and clustered soft errors. IEEE Trans. on Nuc. Sci.
Commu. and Info., pp 1–5, 2015, INDIA 58(5):2483–2492
4. Argyrides C, Reviriego P, Pradhan DK, Maestro JA (2010) Matrix- 24. Liu SF, Reviriego P, Maestro JA (2012a) Efficient majority logic
based codes for adjacent error correction. IEEE Trans. on Nuc. Sci. fault detection with difference-set codes for memory applications.
57(4):2106–2111 IEEE Trans. on VLSI Sys. 20(1):148–156
5. Argyrides C, Ferriera RR, Lisboa CA, Carro L, (2011) Decimal 25. Liu S, Sorrenti G, Reviriego P, Casini F, Maestro JA, Alderighi M,
Hamming:a software implemented technique to cope with soft er- Mecha H (2012b) Comparison of the susceptibility to soft errors of
ror, Proc. IEEE Int. Symp. on Def. and Fault Tolerant in VLSI and
SRAM-based FPGA error correction code implementations. IEEE
Nano. Tech., pp 11–17 Trans. on Nuc. Sci. 59(3):619–624
6. Baeg S, Wen S, Wong R (2009) SRAM interleaving distance selec-
26. Luis J et al (2015) MCU tolerance in SRAMs through low-
tion with a soft error failure model. IEEE Trans. on Nuc. Sci. 56(4):
redundancy triple adjacent error correction. IEEE Trans. on VLSI
2111–2118
Sys. 23(10):2332–2336
7. Baskar S, (2014) Error recognition and correction enhanced
27. Ma W, Cui Z, Lee C, (2013) Enhanced error correction against MBU
decoding of hybrid codes for memory application, Proc. Int.
based on BCH code for SRAM, Proc. Int. Conf. on ASIC, pp 1–4
Conf. on Dev. Cir, and Sys., pp 1–6
28. N. Mahatma, B. Bhuva, (2011) Analysis of multiple-cell upsets due
8. Bhuva BL, Tam N, Massengill LW, Ball D, Chatterjee I, McCurdy
to neutrons in SRAMs for deep-N-well process, Proc. Int. Reliability
M, Alles ML (2015) Multi-cell soft errors at advanced technology
Phys. Sym., pp SE 7.1-SE 7.8, APR, 2011
nodes. IEEE Trans. on Nuc. Sci. 62(6):2585–2591
29. Maniatakos M, Michael MK, Markis Y, (2012) Vulnerability-based
9. Boruzdina AB, Sogoyan AV, Smolin AA, Ulanova AV, Gorbunov
interleaving for MBU protection in modern microprocessors, Proc.
MS, Chumakov AI, Boychnko DV (2015) Temperature depen-
Int. Test Conf., pp 1–8
dence of MCU sensitivity in 65 nm CMOS SRAM. IEEE Trans
on Nuc Sci 62(6):2860–2866 30. Mavis DG, Eaton PH, Sibely MD, Lacoe RC, Smith EJ, Avery KA
(2008) MBU and error mitigation in ultra-deep submicron SRAMs.
10. Chumakov AI, Sogoyan AV, Boruzdina AB, Smolin AA and
IEEE Trans. on Nuc. Sci. 55(6):3288–3294
Pechenkin AA, (2015) MCU mechanisms in SRAMs, Proc. Euo.
Conf. Radiation and its Effect on Components and Sys., 1–5 31. Mehtal A, Barna SH, (2014) Analyzing single bit failure in SRAM
with no visual defects, Proc. IEEE Int. Mem. Work., 1–4
11. Clemens MA, Sierawski BD, Warren KM, Mendenhall MH, Dodds
NA, Weller RA, Reed RA, Dodd PE, ShaneyfeltJ MR, Schwank R, 32. Neale A et al (2013) A new SEC-DED error correction code sub-
Wender SA, Baumann RC (2011) The effect of neutron energy class for adjacent MBU tolerance in embedded memory. IEEE
high-Z materials on SEU and MCU. IEEE Trans. on Nuc. Sci. Trans on Dev Mater Reliability 13(1):223–230
56(6):2591–2598 33. Neale A et al (2014) Adjacent MBU tolerant SEC-DED-TAEC-
12. Clemente JA, Franco FJ, Villa F, Baylac M, Rey S, Mecha H, yAED codes for embedded SRAMs. IEEE Trans on Cir Sys II
Agapito JA, Puchner H, Hubert G, Velazco R, (2015) Statistical 62(4):387–391
anomalies of bit-flips in SRAMs to discriminate MCUs from SEUs, 34. Neuberegr G, Delim FG, Reis R (2005) An automatic technique for
Proc. Euo. Conf. Radiation and its Effect on Components and Syst., optimizing reed-Solomon codes to improve fault tolerance in mem-
pp 1–4 ories. IEEE Design & Test of Computers 22:50–58
13. Correas V, Siagne F, Wrobel F, Boch J, Gasiot G, Roche P (2009) 35. Neuberger G, Delima F, Carro L, Reis R (2003) A MBU tolerant
Prediction of MCU induced by heavy ions in 90 nm bulk SRAM. SRAM memory. ACM Tran on Des Auto of Elec Sys 8(4):577–590
IEEE Trans. on Nuc. Sci. 56(4):2050–2055 36. Park SP, Lee D, Roy K (2012) Soft error resilient FPGAs using
14. Ebrahimi M, Murali P, Rao B, Seyyedi R, Tahoori MB (2016) Low- built-in 2D hamming product code. IEEE Trans. on VLSI Sys.
cost MBU correction in SRAM-based FPGA configuration frames. 20(2):248–256
IEEE Trans on VLSI Syst 24(3):932–942 37. Reviriego P, Maestro JA, Cervantes C (2007) Reliability analysis of
15. Fuketa H, Harada R, Hashimoto M, Onoye T (2014) Measurement memories suffering MBUs. IEEE Trans. on Devc. Matr. Reliability
and analysis of alpha-particle-induced soft errors and MCU in 10 T 7(4):592–601
sub-threshold SRAM. IEEE Trans on Devc Matr Reliability 14(1): 38. Reviriego P, Maestro JA, Baeg S, Wen S, Wong R (2010)
463–470 Protection of memories suffering MCUs through the selection of
16. Grimaldi, RP. (2006). Discrete and Combinatorial Mathematics, the optimal interleaving distance. IEEE Trans. on Nuc. Sci. 57(4):
5/e. Pearson Education India. 2124–2128
17. Gue J, Xiao L, Mao Z, Zhao Q (2013) Novel mixed codes for MCU 39. Reviriego P, Argyrides C, Maestro JA, Pradhan DK (2011)
mitigation in SRAMs. IEEE Micr 33(6):66–74 Improving memory reliability against soft errors using block parity.
18. Gue J, Xiao L, Mao Z, Zhao Q (2014) Enhanced memory reliability IEEE on Nuc Sci 58(3):981–986
against MCU using decimal matrix code. IEEE Trans. on VLSI Sys. 40. Reviriego P, Martinez J, Pontarelli S, Mastero JA (2014) A method
22(1):127–135 to design SEC-DED-DAEC codes with optimized decoding. IEEE
19. Hands A, Moris P, Ryden K, Dyer C (2012) Large-scale MCU in on Devc Matr Reliability 14(3):884–889
90 nm commercial SRAMs during neutron irradiation. IEEE Trans. 41. Secondo R, Foucard G, Danzeca S, Losito R, Peronnard P, Masi A,
on Nuc. Sci. 59(6):2824–2830 Brugger M, Dusseau L, (2015) Analysis and detection of MCU in
J Electron Test

SRAM memories used as particle detectors, Proc. Euo. Conf. MBU for SRAM-based FPGAs, Proc. Int. Conf. on FPL, pp
Radiation and its Effect on Components and Syst., 10–14 2–6
42. She X, Li N, Jensen DW (2012) SEU tolerant memory using error 48. Wirthlin M, Lee D, Swift G, Quin H (2014) A method and case
correction code. IEEE Trans. on Nuc. Sci. 59(1):205–210 study on identifying physically adjacent MCU using 28 nm inter-
leaved and SEC-DED protected arrays. IEEE Trans. on Nuc. Sci.
43. Tang H, Park J (2016) Unequal-error-protection ECC for the embed-
61(6):3080–3087
ded memories in DSPs. IEEE Trans on VLSI Sys 24(6):2327–2401
49. Wu W, Seifert S (2015) MBU-calc: a compact model for MBU SER
44. Tianqi W, Xiao L, Huo M, Zhou B, Chunhua Q, Shanshan estimation, Proc. Int. Rel. Phy. Sym., pp SE.2.1-SE.2.6
L, Xeubing C, Romgsheng Z, Jing G (2015) SEU prediction 50. Yoshimoto S, Amashita T, Yoshimura M, Matsunaga Y, Yasuura H,
in SRAMs account for on-transistor sensitive volume. IEEE Izumi S, Kawaguchi H, M. Yoshimoto, (2012) Neutron-induced
Trans. on Nuc. Sci. 62(6):3207–3215 soft error rate estimation for SRAM using PHITS, Proc. IEEE Int.
45. Uemura T, Tanabe R, H. Matusyama, (2012) Mitigation technique on-Line Tes. Sym., pp 138–141
against MBU without area, power and performance overhead,
Proc. IEEE Int. Rel. Phy. Sym., pp 5B.4.1-5B.4.6
46. Venkataraman S, Santos R, Maheshwari S, Kumar A, (2014a)
H. Jahanirad received his Ph.D degree from Iran University of Science
Multi-directional error correction schemes for SRAM-based
and Technology (IUST) in electrical engineering. He is currently an as-
FPGAs, Proc. Int. Conf. on FPL, pp 1–8
sistant professor in the department of electrical engineering, university of
47. Venkataraman S, Santos R, Das A, Kumar A, (2014b) A bit- Kurdistan, Sanandaj, Iran. His research interests include VLSI design,
interleaved embedded Hamming scheme to correct SBU and fault tolerant systems and digital circuit testing.

You might also like