10329903
10329903
Abstract—Generalized integrated interleaved (GII) codes can path is reduced to two multipliers by incorporating higher-
nest BCH sub-codewords to form stronger BCH codewords. They order syndromes iteratively into existing KES results in [5]–
are among the best candidates for error correction in the new [7]. Although it can be further reduced by a half using the
storage class memories (SCMs). However, SCMs require short
codeword length and low redundancy. In this case, the nested slow-down and re-timing techniques, the decoding of two sub-
key equation solver (KES), which is a key step in GII decoding, words needs to be interleaved. By pre-computing the combined
has a small number of iterations. The initialization and/or scalar polynomial scalars, the critical path is truly reduced to one
pre-computation in previous nested KES designs have large area multiplier in [8], [9] although 5 clock cycles are needed for
and may take even longer time than the iterations themselves. the pre-computation.
This paper proposes an efficient nested KES design for short
GII-BCH codes. The polynomial updating is decomposed into Since SCMs require short codeword length and low re-
two steps to reduce the critical path without requiring scalar dundancy, the error correction capabilities of the BCH codes
pre-computation. Besides, the KES is reformulated to reduce the involved in the GII codes and their differences are small. This
number of clock cycles without incurring any area overhead. means that the KES of each nested decoding round has a small
For an example code over GF (210 ) that protects 2560 bits with
10% redundancy, the proposed design achieves at least 25% area
number of iterations and the involved polynomials are short.
reduction and 37% reduction on the area-time product averaged In this case, the scalar pre-computations in [7], [9] targeting
over the nested decoding rounds compared to prior efforts. at longer codes require not only more clock cycles but also
larger area compared to the nested KES iterations themselves.
I. I NTRODUCTION This paper proposes an efficient nested KES design for
short GII-BCH codes. The polynomial updating with long
The new fast storage class memories (SCMs) may bring data path is decomposed into two steps so that the critical
paradigm shifts to many systems, such as computer memory path is reduced to one multiplier without any scalar pre-
architecture, machine learning, and big data analytics. Error- computation. Shareable substructures are identified to reduce
correcting codes with hyper-speed decoding and excellent cor- the area requirement. Additionally, by reformulating the nested
rection capability are essential to realizing the speed potential KES algorithm, the order of the error locator and discrepancy
of SCMs. Generalized integrated interleaved (GII) codes [1], polynomial updating is switched so that one more clock cycle
[2] that generate stronger BCH codewords by nesting short is eliminated in the end. Efficient implementation architectures
BCH sub-codewords are among the best candidates for SCMs. are also developed for the proposed nested KES algorithm.
GII decoding has two stages [2]. The first is the traditional Although the proposed design needs two clock cycles for each
BCH decoding on individual sub-words. A key equation solver nested KES iteration, it does not require expensive scalar pre-
(KES), such as the Berlekamp-Massey (BM) algorithm, takes computation. For an example code over GF (210 ) that protects
the syndromes and computes the error locator polynomial. If 2560 bits with 10% redundancy considered for SCMs, the
some sub-words failed the decoding or were miscorrected, proposed architecture achieves at least 25% area reduction and
the multi-round second-stage nested decoding is activated to 37% improvement on the area-time product (ATP) averaged
correct extra errors. In each round, higher-order sub-word over the nested decoding rounds compared to prior work.
syndromes are derived from those of the nested words. Then
the nested KES is carried out to update the error locator
II. GII-BCH D ECODING AND N ESTED KES A LGORITHMS
polynomial and accordingly correct more errors.
Since the higher-order syndromes are not available in the Let Cv ⊆· · · ⊆C1 ⊂C0 be v+1 BCH codes defined over
beginning to initialize the discrepancy polynomial, the con- GF (2q ) with error-correction capabilities tv ≥· · · ≥t1 >t0 . A
ventional reformulated inversionless (ri-)BM architecture [3] GII-BCH [m,v] code with m length-n sub-codewords can be
cannot be used to reduce the critical path of the nested KES. constructed from these BCH codes as [2]:
Re-initializing the discrepancy polynomial as in [4] before the n
nested KES requires many multiplier-adder trees. The critical C , [c0 (x), c1 (x), · · · , cm−1 (x)] : ci (x) ∈ C0 ,
m−1 o (1)
This work was supported in part by Kioxia Corporation and by the National
X
c̃l (x) = αil (x)ci (x) ∈ Cv−l , 0 ≤ l < v ,
Science Foundation under Award No. 2011785.
i=0
Algorithm 1: Nested GII-BCH KES Algorithm re-initialized according to all the higher-order syndromes using
(u)
Λeven (x),
(u) (u)
Λodd (x), Beven (x),
(u)
Bodd (x), ˆ (u)
∆
expensive multiplier-adder trees. Algorithm 1 [7] incorporates
Input: even (x),
(u) (u) (u) two higher-order syndromes, Sr+1 and Sr+2 , to update ∆(x) ˆ
Θ̂even (x),
γ ,k from previous KES;
Si (u≤i<w); Sw =0 in iteration r so that the correct δ (r+2) is always ready before
Initialization: iteration r+2 and the odd iterations are skipped for GII-
∆ˆ (u) ˆ (u) (u)
even (x)=∆even (x)+Su Λeven (x) BCH decoding. In this algorithm, ‘even’ and ‘odd’ denote
(u) (u) (u)
Θ̂even (x)=Θ̂even (x)+Su Beven (x) the even and odd coefficients, respectively, of the polynomi-
Iterations: for r=u, u+2, · · · , w−2 als. B(x) and Θ̂(x) are auxiliary polynomials assisting the
(r+2) (r) ˆ (r) x2 Beven
(r)
1) Λeven (x)=γ (r) Λeven (x)+∆ 0 (x) ˆ
(r+2) (r) (r) ˆ (r) 2 (r) updating of Λ(x) and ∆(x), respectively. A scaled nested
2) Λodd (x)=γ Λodd (x)+∆0 x Bodd (x)
ˆ (r+2) ˆ (r)
(r) (∆ 2 (r) (r) KES (SNK) algorithm was also developed in [7] to reduce the
3) ∆ even (x)=γ even (x)/x +Sr+1 Λodd (x)/x+Sr+2 Λeven (x))
(r) (r) (r) (r)
ˆ (Θ̂even (x)+Sr+1 xB (x)+Sr+2 x2 Beven (x))
+∆
area requirement. The critical paths of these algorithms have
0 odd
(r)
ˆ 6=0 and k(r) ≥−1)
if (∆
two multipliers due to the computations in Line 3. Although
0
4)
(r+2) (r) (r+2)
Beven (x)=Λeven (x); Bodd (x)=Λodd (x)
(r) they are reduced to one multiplier by applying slow-down
(r+2)
Θ̂even (x)=∆ ˆ even (x)/x +Sr+1 Λ (x)/x+Sr+2 Λ(r)
(r) 2 (r) and re-timing [11], two sub-words are interleaved to increase
5) odd even (x)
6) γ (r+2) ˆ (r)
=∆0 ; k(r+2) =−k (r) −2 the efficiency. The critical path is truly shortened to one
else multiplier in the scaled fast nested KES (SFNK) algorithm
(r+2) (r) (r+2) (r)
7) Beven (x)=x2 Beven (x); Bodd (x)=x2 Bodd (x) [9] by combining and pre-computing the scalars in Line 3,
(r+2) (r) (r) (r)
8) Θ̂even (x)=Θ̂even (x)+Sr+1 xBodd (x)+Sr+2 x2 Beven (x) which introduce 5 clock cycles at initialization. Besides, both
(r+2) (r) (r+2) (r)
9) γ =γ ; k =k +2 the SNK and SFNK designs require a number of multipliers
for scalar pre-computation.
where ci (x) is a sub-codeword, c̃l (x) is a nested codeword, α A major application of GII codes is SCMs and they require
is a primitive element of GF (2q ), and αil (x) is the polynomial short codes with low redundancy, such as 2560 data bits
form of the standard basis representation of αil . protected by 10% redundancy. An example GII-BCH code for
GII-BCH decoding includes two stages. The first is the these code parameters is a [4,3] code over GF (210 ) with sub-
traditional BCH decoding that corrects ≤t0 errors in each codeword length n=704 and [t0 , t1 , t2 , t3 ]=[3,5,6,11]. For such
received sub-word. 2t0 syndromes are computed from each short codes, the 5 clock cycles for scalar pre-computation [9]
received sub-word, y(x), as Sj =y(αj+1 ) (0≤j<2t0 ). Then are more than the number of clock cycles for the nested KES
a KES, such as the BM algorithm [10], uses the syndromes iterations themselves, which is ti −ti−1 for nested decoding
to iteratively compute the error locator polynomial Λ(x). round i. Also when tv is smaller, the scalar pre-computation
P (r)
In iteration r, a discrepancy coefficient δ (r) = Λi Sr−i is in [7] and [9] may dominate the overall nested KES area.
calculated to update the polynomials. The riBM algorithm uti-
ˆ (r) (x)=Λ(x)S(x)/xr , whose III. E FFICIENT KES FOR GII-BCH N ESTED D ECODING
lizes a discrepancy polynomial ∆
ˆ (r) ˆ An alternative approach is proposed in this section to reduce
constant coefficient, ∆0 , equals δ (r) [3]. ∆(x) is initialized
as S(x) and updated in parallel with Λ(x) so that the critical the critical path of the nested KES to one multiplier. Instead
path is reduced to one multiplier and one adder. of combining and pre-computing the scalars, the computations
The second-stage nested decoding can correct more errors. in Line 3 of Algorithm 1 are broken down and implemented
2t higher-order syndromes are needed to correct t extra in two clock cycles. Besides, reformulations are carried out to
errors. Let the indices of the sub-words with extra errors eliminate one clock cycle in the end. As a result, the proposed
be i0 , · · · , ib−1 (b≤v). Since the nested codewords are at nested KES for decoding iteration i requires 2(ti −ti−1 ) clock
least t1 -error-correcting, 2(t1 −t0 ) higher-order syndromes are cycles with no extra latency for scalar pre-computation or
(l)
computed as S̃j =ỹl (αj+1 ) (0≤l<b, 2t0 ≤j<2t1 ), where initialization. For small ti −ti−1 as in short GII codes, the
Pm−1 il proposed design achieves significant latency reduction com-
ỹl (x)= i=0 α (x)yi (x). From (1), higher-order syndromes
pared to previous approaches. Additionally, multipliers are
for the sub-words can be derived as
shared among the computations and no complicated scalar pre-
(ib−1 ) T
h i h iT
(i0 )
Sj
(i1 )
, Sj , · · · , Sj = A−1 S̃j , S̃j , · · · , S̃j
(0) (1) (b−1)
, (2) calculation is needed. Hence, the proposed design also requires
much smaller area than prior architectures.
where Au,w =αiw u(j+1) (0≤u, w<b). Then the BM algorithm Each term in the parentheses in Line 3 of Algorithm 1
takes the 2(t1 −t0 ) higher-order syndromes to correct ≤t1 has up to one scalar, and the second scalar is outside of
errors in each of the b sub-words. If there are b0 ≤v−1 sub- the parentheses. Denote the sums in the first and second
words with more errors, 2(t2 −t1 ) higher-order syndromes are parentheses by ∆ ˆ 0(r) 2 0(r)
even (x)/x and Θ̂even (x), respectively. To
derived in the next round and this process is repeated for up reduce the critical path to one multiplier, they are computed in
to v rounds. the first clock cycle and then multiplied with γ (r) and ∆ˆ (r) in
0
(r+2)
The KES in the nested decoding cannot continue from the the second clock cycle. They are also used to update Θ̂even (x)
result of the sub-word decoding and use the riBM algorithm to as in Line 5 and 8 of Algorithm 1. The coefficients of the
shorten the critical path since the higher-order syndromes are same degree in these polynomials can be calculated using
ˆ
not available in the beginning to initialize ∆(x). ˆ
In [4], ∆(x) is four multipliers. In the second clock cycle, each coefficient
Algorithm 2: Nested GII-BCH KES w. Switched Updating ()
Δ 0
D
D D CTRL
(u) (u) (u) (u) ¯ (u)
Input: Λeven (x), Λodd (x), Beven (x), Bodd (x), ∆ even (x), PE10 () /1 PE11 PE1ℎ D 1
0 1
(u) (u) (u)
Θ̄even (x), γ , k from previous KES; D ()
Δ 0 /1
D D D 1
0 1
syndromes Si (u−1≤i≤w−2)
set Su−1 =0 for the first nested decoding round () / D 1
D D D 0
Iterations: for r=u, u+2, · · · , w−2 PE00,0 PE00,1 ()
Δ0 / PE00,ℎ D 1
0
1) ∆ˆ (r) ¯ (r) (r) (r)
even (x)=∆even (x)+Sr−1 Λodd (x)/x+Sr Λeven (x)
0
Mult. Add Reg. Mux Mux with Inv. total crit. path # clks/sub-word of KES normalized ATP
const. input # XORs # gates in nested round 1, 2, 3 nested round 1, 2, 3
re-init. [4] 86 58 36 28 0 0 16904 10 3, 2, 6 (1+(ti − ti−1 )) 1.60, 2.13, 1.28
SNK [7] 46 42 98 18 39 0 11739 9 5, 3, 11 (1+2(ti − ti−1 ))∗ 1.67, 2.00, 1.47
SFNK [9] 78 57 85 37 40 2 18042 10 7, 6, 10 (5+(ti − ti−1 )) 3.98, 6.83, 2.28
proposed 38 43 48 28 9 0 8807 9 4, 2, 10 (2(ti − ti−1 )) 1.00, 1.00, 1.00