0% found this document useful (0 votes)
24 views13 pages

TCAS_CUI_2020

Uploaded by

Shreyas Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

TCAS_CUI_2020

Uploaded by

Shreyas Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1

Design of High-Performance and Area-Efficient


Decoder for 5G LDPC Codes
Hangxuan Cui , Fakhreddine Ghaffari , Member, IEEE, Khoa Le , Member, IEEE,
David Declercq , Senior Member, IEEE, Jun Lin, Senior Member, IEEE,
and Zhongfeng Wang , Fellow, IEEE

Abstract— Low-density parity-check (LDPC) code as a very coding scheme in the enhanced mobile broadband (eMBB)
promising error-correction code has been adopted as the channel scenario [5]. LDPC codes can perform close to the Shannon
coding scheme in the fifth-generation (5G) new radio. However, limit when paired with the belief propagation (BP) decoding
it is very challenging to design a high-performance decoder
for 5G LDPC codes because their inherent numerous degree-1 algorithm [6]. However, the BP algorithm involves complex
variable-nodes are very prone to be erroneous. In this article, non-linear functions in check-node (CN) processing, lead-
the problem is solved gracefully by developing a low-complexity ing to large implementation complexity. As an alternative,
check-node update function, greatly improving the reliability of the min-sum (MS) algorithm [7] was proposed and became
check-to-variable messages. By further incorporating the pro- the primary solutions in practical applications. By approxi-
posed column degree adaptation strategy, our decoder could offer
a 0.4dB performance gain over the existing ones. In addition, this mating the non-linear functions with simple summation and
article presents an efficient 5G LDPC decoder architecture. Ben- comparison operations, the MS algorithm can get significant
efiting the specific structure of 5G LDPC codes, layer merging, complexity reduction at the cost of obvious performance loss.
split storage method, and selective-shift structure are introduced By introducing the correction factor to decoding, the nor-
to facilitate a significant reduction of decoding delay and area malized MS (NMS) and offset MS (OMS) algorithms could
consumption. Implementation result on 90-nm CMOS technology
demonstrates that the proposed decoder architecture yields an offer a better balance between decoding complexity and
impressive improvement in throughput-to-area ratio, achieving performance [8].
up to 173.3% compared to conventional design. This article targets the design of an area-efficient and high-
Index Terms— Low-density parity-check codes, 5G LDPC performance 5G LDPC decoder. In general, 5G LDPC codes
decoder, high-performance, VLSI implementation. are built from a concatenation of a high-rate LDPC code and
a low-density generator matrix (LDGM) code [9]. Since the
variable-nodes (VNs) in the LDGM part are degree-1 VNs
I. I NTRODUCTION
which can only receive one check-to-variable (CTV) message

L OW-DENSITY parity-check (LDPC) codes [1] have


attracted considerable attention over the past several
decades because of their remarkable error-correction perfor-
in each iteration, they are very sensitive to the reliability of the
received CTV messages, and so to the choice of the correction
factor. Therefore, in fixed-point implementations with low
mance and inherent parallelism for hardware implementation. quantization bits where the precision of correction factor is
LDPC codes also have been adopted in several industrial limited, the OMS decoder suffers from severe performance
standards, including IEEE 802.11 [2], the second generation degradation [10].
satellite digital video broadcast (DVB-S2) [3], and advanced Many algorithms have been proposed in recent
television system committee (ATSC) [4]. Recently, LDPC years [10]–[14] to improve the error-correction performance
codes have been chosen as the 5G new radio (NR) channel of 5G LDPC codes. By taking account the approximate-min*
Manuscript received July 28, 2020; revised October 20, 2020; accepted algorithm [15], the adjusted MS and generalized
November 12, 2020. This work was supported in part by the National approximate-min* algorithms were proposed in [11]
Natural Science Foundation of China under Grant 61604068, in part by and [12], respectively. However, like the BP decoding, they
the Fundamental Research Funds for the Central Universities under Grant
021014380065, in part by the Key Research Plan of Jiangsu Province of suffer from relatively large implementation complexity due
China under Grant BE2019003-4, and in part by the French ANR under Grant to the involved non-linear functions. In [13], the authors
ANR-15-CE25-0006-01. This article was recommended by Associate Editor proposed a hybrid decoding algorithm in which the non-linear
S. Gupta. (Corresponding author: Zhongfeng Wang.)
Hangxuan Cui, Jun Lin, and Zhongfeng Wang are with the School of functions in the BP algorithm are simplified using the linear
Electronic Science and Engineering, Nanjing University, Nanjing 210008, approximation. In [14], the offset and normalized factors are
China (e-mail: [email protected]; [email protected]; [email protected]). both introduced to decoding for a better calculation precision
Fakhreddine Ghaffari, Khoa Le, and David Declercq are with ETIS
UMR 8051, CY Cergy Paris Université, ENSEA, CNRS, F-95000 and their values vary during iterations which are optimized
Cergy, France (e-mail: [email protected]; [email protected]; by machine learning. Despite the performance improvement,
[email protected]). the main problem for these two methods is the numerous
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2020.3038887. parameters, making the algorithm impractical in applications.
Digital Object Identifier 10.1109/TCSI.2020.3038887 Moreover, since all of the above algorithms are designed
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

based on floating-point decoding, their performance cannot


be guaranteed after being quantized.
Recently, the adapted MS (AMS) decoder [10] was pro-
posed which targets fixed-point decoding. In 5G LDPC codes,
the CNs connected to the degree-1 VNs are called extension
checks and others are referred to core checks. Considering the
fact that the degree-1 VNs are more likely to be erroneous
when an imprecise offset factor is adopted, in the AMS
decoder, the offset factor is only applied to core checks.
Consequently, with low quantization bits, the AMS decoder
could offer better performance than the MS and OMS decoders
on 5G LDPC codes.
To further improve the performance of 5G LDPC decoders,
this article introduces an improved AMS (IAMS) algorithm. Fig. 1. The structure of base matrix BG2.
Starting from reducing the error-probability of degree-1 VNs,
a modified CN-update function is designed which considerably The remainder of this article is organized as follows.
improves the reliability of CTV messages while maintaining Section II gives some notations, as well as the preliminaries for
the low-complexity property. Moreover, considering 5G LDPC 5G LDPC codes and fixed-point LDPC decodings. The pro-
codes are extremely irregular, a column degree adaptation posed IAMS decoding algorithm is introduced in Section III.
strategy is proposed to manage the influence of the high-degree Numerical results and related discussions are provided in
VNs on the decoding process. Simulation results on several Section IV. Section V describes the proposed hardware archi-
5G LDPC codes with different code rates and code lengths tecture and Section VI presents the implementation results.
demonstrate that the proposed IAMS algorithm could offer an Finally, Section VII concludes the paper.
obvious performance improvement compared to existing ones,
especially for codes with low to moderate code rates. II. N OTATIONS AND P RELIMINARIES
The implementation of LDPC decoders has been fully A. Notations
investigated [16]–[20]. In [17], the authors introduced a fully-
An LDPC code is specified by a sparse M × N parity check
parallel bit-parallel architecture with detailed optimizations
matrix H, where M denotes the number of parity checks and
for high-throughput applications. Since the complexity of the
N represents the number of code bits. The code rate R =
fully-parallel decoder is relatively high, the partially-parallel
K /N = (N − M)/N. LDPC codes can also be defined by
schedule, such as the layered schedule, has become the most
bipartite Tanner graphs [22] which comprise a set of VNs and
popular one, which could use the up-to-date information from
a set of CNs, corresponding to code bits and parity checks,
the current iteration, thereby doubling the speed of the decod-
respectively. Let N (m) denote the set of VNs that participate
ing convergence. When the quasi-cyclic LDPC (QC-LDPC)
in the mth check. Similarly, the neighbors set of the nth VN is
codes are adopted, the CNs in the same block row of the base
denoted by M(n). The number of neighbors connected with
matrix are usually grouped into a single layer. In [18], an effi-
a VN is called column degree and with a CN is called row
cient reordered layered schedule was proposed to minimize
degree, denoted by dv and dc , respectively. An LDPC code
the memory consumption. Moreover, to reduce the required
is regular if the degrees of each set of nodes are the same,
number of iterations, the authors of [21] introduced a modified
while degrees of an irregular LDPC code vary according to
layered schedule for 5G LDPC codes, in which the processing
some degree distributions. QC-LDPC codes have a structured
order of layers is not sequential, but depends on the number
H matrix that can be generated from an Mb × Nb base matrix
of punctured edges and check-node degrees.
H B . Each nonzero entry of H B can be expanded by circularly
Though many works focus on the implementation of LDPC
shifting a Z × Z identity matrix and each zero entry represents
decoders, to the best of our knowledge, there is no prior
a Z × Z all-zero matrix, where Z denotes the expansion factor.
work presenting the design for a whole 5G-LDPC decoder.
In this article, we for the first time introduce an efficient
5G LDPC decoder architecture. In the proposed architecture, B. 5G LDPC Codes
first, a layer merging technique benefitting from the orthogonal To support a broad range of code lengths and rates, two rate-
structure of 5G LDPC codes is proposed, which could reduce compatible base graphs, BG1 and BG2, are designed for 5G
the number of clock cycles by 28.3%. By further incorporating LDPC codes. These two base graphs have a similar structure
with the proposed split storage method, the CTV memory while BG1 is targeted for larger information lengths (500 ≤
consumption could be reduced by 39.6%. To alleviate the K ≤ 8448) and higher rates (1/3 ≤ R ≤ 8/9) and BG2 is
interconnection network overhead, we present the selective- targeted for smaller information lengths (40 ≤ K ≤ 2560)
shift structure and the message reordering methods, leading and lower rates (1/5 ≤ R ≤ 2/3). Fig. 1 shows the structure
to obvious area and latency reduction. ASIC implementation of base matrix BG2 which has 42 rows and 52 columns.
results on 90-nm CMOS technology show that the proposed The sub-matrix Hcore is called the core part and the other
IAMS decoder could improve the throughput-to-area ratio three sub-matrices form the extension part. In both BG1 and
(TAR) by 173.3% compared to the conventional design. BG2 matrices, Hcore consists of the first four rows of the base

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 3

matrix and adopts a dual diagonal structure for parity bits Step 2 (CN update): In the BP decoding, the CTV message
to simplify the encoding process. The extension part has an is given by
equal amount of VNs and CNs, and all extension VNs are   
degree-1 nodes. O denotes an all-zero matrix and I denotes (t )
αm,n (t )
= τm,n ·φ φ(|βn(t ,m
)
|) , (3)
an identity matrix. The core checks usually have a higher row n  ∈N (m)\n
degree than the extension checks. The leftmost two columns of
(t )  (t )
the base matrix correspond to the punctured bits, also known where τm,n = sgn(βn ,m ) and φ(x) = − log[tanh(x/
as the state bits. One important feature for 5G LDPC codes is n  ∈N (m)\n
(t )
that they are extremely irregular, which means there exists a 2)]. Considering φ −1 (x) = φ(x) and the magnitude of αm,n
(t )
significant difference in row degrees and column degrees. For is dominated by the minimum input |βn ,m | [24], the MS
instance, in base matrix BG2, dv varies from 1 to 23 and dc simplifies (3) according to
varies from 3 to 10.  
(t ) (t ) (t )
αm,n  τm,n · φ φ( min |βn ,m |)
n  ∈N (m)\n
C. Fixed-Point LDPC Decodings (t ) (t )
= τm,n ·  min |βn ,m |. (4)
n ∈N (m)\n
Assume an LDPC codeword c = {c0 , c1 , . . . , c N−1 } is
transmitted over the additive white Gaussian noise (AWGN) Since φ(|βn(t ,m
)
|) > 0, we have φ( min |βn(t ,m )
|) <
channel using the binary phase shift keying (BPSK) modula- n  ∈N (m)\n
 (t )
tion, the received vector y is φ(|βn ,m |). Moreover, because φ(x) is a decreasing
n  ∈N (m)\n
yi = x i + n i , n i ∼ N (0, σ ), i = 0, 1, . . . , N − 1, (1)
2 function, it can be deduced from (3) and (4) that the MS
decoding overestimates the magnitudes of CTV messages
where x i = 1 − 2ci and n i is a Gaussian random variable compared to the BP decoding, leading to the performance
with zero mean and variance σ 2 . In fixed-point implementa- degradation [11]. To alleviate the overestimation, an offset
tions, the quantized version of y, denoted by γ , is typically factor is included in the OMS decoding, as shown in (5).
input to the decoders. Let  represent the input alphabet  
(t ) (t ) (t )
comprising of integers, and then we have γi = [μ · yi ] αm,n = τm,n · max min |βn ,m | − λ, 0 , (5)
n  ∈N (m)\n
where μ > 0 is a constant referred to as the gain factor. [x]
returns the closest integer to x that belongs to . Assume where λ denotes the offset factor. In fixed-point implementa-
the input messages are expressed by q bits, we have  = tions, λ is typically fixed to 1, which is the least significant
{−Q, . . . , −1, 0, 1, . . . , +Q} where Q = 2q−1 − 1. Actually, bit (LSB) under the integer representation [20].
μ = 2q means that all channel LLR values are shifted q bits to The only difference between the AMS and OMS algorithms
the left and then rounded to integers, which is the same as the is the CN processing procedure. To reduce the error proba-
usual quantization method when q fraction bits are preserved. bility of degree-1 VN, the AMS decoder processes the core
Moreover, the introduced quantization method is more flexible checks and extension checks differently using different offset
because the values of μ could be optimized to other values factors [10], as shown in (6). For the core checks, λ is set to
besides 2q for better decoding performance [23]. 1 to obtain the gain from the offset principle while λ is set to
Let αm,n and βn,m denote the messages passed from the 0 for the extension checks to reduce the offset effect on these
mth CN to the nth VN and from the nth VN to the mth VNs.
CN, respectively. γ̃ denotes the a-posteriori-probability (APP) 
vector. The exchanged messages αm,n and βn,m are quantized (t ) applying (5), for the core checks
αm,n = (6)
to q bits. Since the APP messages are generally larger than the applying (4), for the extension checks.
input and exchanged messages, γ̃n is quantized to q̃ bits where
Step 3 (APP update): In order to achieve better precision,
q̃ > q to avoid clipping. A = {− Q̃, . . . , −1, 0, 1, . . . , + Q̃} (t )
denotes the alphabet for γ̃ where Q̃ = 2q̃−1 − 1. β̃n,m in (2) is used to update APP values according to
The decoding process of the layered schedule is described γ̃n(t ) = [αm,n
(t ) (t )
+ β̃n,m ]A , (7)
as follows.
1) Initialization: where function [·]A is applied to ensure the updated APP
Assign the values of the input vector γ to the APP vector γ̃ . values are taken from alphabet A.
Moreover, all CTV messages αm,n are initialized with zeros. After all layers have been processed, the tentative codeword
2) Iterative Process: ĉ(t ) can be obtained by applying the hard-decision to vector
In the layered schedule, each iteration comprises several γ̃ (t ) according to
decoding layers. The decoding is executed layer by layer and 
each layer has three steps. (t ) (t ) 0, γ̃n(t ) ≥ 0
ĉn = H D(γ̃n ) =
Step 1 (VN update): In the tth iteration, the variable-to-
(t )
1, γ̃n(t ) < 0.
check (VTC) message βn,m is calculated by
The decoding stops when all parity check equations are satis-
(t ) (t )
βn,m = [β̃n,m ] = [γ̃n(t ) − αm,n
(t −1)
] . (2) fied or the maximum number of iterations I tmax is reached.

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

III. T HE P ROPOSED IAMS D ECODING A LGORITHM Property: The offset factor λ will be 1 only when mi n 1 and
A. The Modified CN-Update Function mi n 2 are both strictly positive and equal. Otherwise, λ = 0.
Proof: To prove this property, we consider three cases.
As mentioned above, all extension VNs in 5G LDPC codes
Case 1: mi n 1 = 0. In this case, a = 0. Then,
are with degree-1 and each is connected to a unique CN.
Consequently, these VNs only receive one CTV message 1 + e− 1
λ = log

+ = 0.
in each iteration so they are sensitive to the reliability of 1+e 2
CTV messages and the choice of offset factor. In fixed-point Case 2: mi n 1 > 0 and mi n 1 = mi n 2 . In this case, a ≥ 1
implementations, the offset factor is generally not optimal so and ≥ 1. Therefore,
the reliability of CTV messages is limited due to the limited
1 + e−
bit representation of messages, which is the main reason for log
the severe performance degradation appearing in fixed-point 1 + e−(2a+ )
1 + e−1
OMS decoder. In order to improve the performance of 5G ≤ log < log(1 + e−1 )
LDPC decoders, we propose a new CN-update function in this 1 + e−(2a+1)
subsection to improve the reliability of CTV messages, and 1 + e− 1 1
⇒ log + < log(1 + e−1 ) + < 0.7133
thus efficiently benefits the performance improvement of 5G 1 + e−(2a+ ) 2 2
LDPC decoders. ⇒ λ = 0.
Denote the first and second minimum magnitudes of the
Case 3: mi n 1 > 0 and mi n 1 = mi n 2 . In this case, a ≥ 1
input VTC messages in a CN by mi n 1 and mi n 2 , respectively.
and = 0. Let = 0 and we have
In order to maintain the low computation complexity, we only
use these two values which are available in conventional 1 + e− 2
log = log < log 2
MS decoder to design a new CN-update function. Let i d x 1 1 + e−(2a+ ) 1 + e−2a
and i d x 2 be the indices of VNs corresponding to mi n 1 and 1 + e− 1
mi n 2 , respectively. I(m) is defined as I (m) = {i d x 1, i d x 2 } ⇒ log + < 1.1931
1 + e−(2a+ ) 2
and Ī(m) = N (m) \ I(m). Observing (3) we notice that, for ⇒ λ < 2.
n ∈ Ī(m), both mi n 1 and mi n 2 are extrinsic VTC messages
(t ) Also,
that are used to calculate the CTV message αm,n . Since the
(t )
magnitude of αm,n is dominated by the minimum magnitude 1 + e− 2 2
log = log −2a

of extrinsic VTC messages, a sufficient precision can be 1 + e−(2a+ ) 1+e 1 + e−2
achieved if the first and second minimum magnitudes of the 1+e − 1
extrinsic VTC messages are both employed to approximate ⇒ log −(2a+ )
+ ≥ 1.0662
1+e 2
the CN-update function of the BP algorithm. Therefore, for ⇒ λ ≥ 1.
n ∈ Ī(m), we approximate the CN-update function shown in
(3) to Therefore, we have λ = 1.
  Based on this property, the offset factor for n ∈ Ī(m)
(t )
αm,n (t )
= τm,n · φ φ(mi n 1 ) + φ(mi n 2 ) . (8) can be determined according to mi n 1 and mi n 2 . For n ∈
I (m), we cannot obtain a more precise correction factor
It can be seen that the overestimation of the CTV messages only based on mi n 1 and mi n 2 . Since MS decoder performs
appearing in the MS algorithm could be alleviated by using better than OMS decoder on 5G LDPC codes in fixed-point
(8) since more extrinsic VTC messages are included. Based implementations [10], λ is set to 0 for n ∈ I (m). The proposed
on the approximate-min* algorithm proposed in [15], (8) can CN-update function is shown in (10), which still remains the
also be written as low-complexity property.
⎧ (t )
(t ) (t ) ⎪ τm,n · mi n2 , n = i d x1
αm,n = τm,n · (mi n 1  mi n 2 ), (9) ⎪

⎨ τ (t ) · mi n , n = i d x2
−|x−y| (t ) 1
where x  y = min(|x|, |y|) − log 1+e . In fact, (9) can αm,n = m,n
(t ) (10)
1+e−|x+y| ⎪
⎪ τm,n · max(mi n 1 − 1, 0), n ∈ Ī(m) & = 0
be viewed as the MS decoding with an offset factor which is ⎪
⎩ (t )
inherently optimized by the BP decoding. For simplicity, let a τm,n · mi n1 , n ∈ Ī(m) & = 0.
represent mi n 1 so mi n 2 = a + . Therefore, the offset factor To demonstrate the effectiveness of the proposed CN-update
λ in (9) is function, the mismatch probabilities of different CN-update
1 + e− functions are shown in Fig. 2, where the exchanged messages
λ = log . are quantized to 4 bits, i.e., q = 4. Therefore, the values of
1 + e−(2a+ )
|βn,m | can only be 0 ∼ 7 so the total number of combinations
Since mi n 1 and mi n 2 are both non-negative integers in fixed- of the received messages in a degree-dc CN is 8dc (2(q−1)·dc ).
point implementations, a and are also non-negative integers. For each case, if the CTV value calculated by the tested
Therefore, we can conclude that λ ≥ 0 so the quantized decoder is not equal to the CTV value calculated by the 4-bit
version of λ is quantized BP decoder, we consider this case as a mismatch
1 + e− 1 case. The mismatch probability is obtained by testing all 8dc
λ = log + .
1 + e−(2a+ ) 2 cases and then calculating the proportion of mismatch cases.

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 5

Fig. 2. The mismatch probabilities of different CN-update functions under


the row degree region for BG2 codes. Fig. 3. The error-rate of each group when E b /N0 = 2.0dB.

From Fig. 2 we can see that, compared to the MS, OMS, and degree adaptation is only applied to core checks, whose CTV
AMS decodings, the proposed CN-update function shows a messages show a lower mismatch probability than those of
much lower mismatch probability in the simulated row degree extension checks when applying the OMS decoding, as shown
region, which is also the row degree region for BG2 codes. in Fig. 2. Consequently, the influence of strong messages
Therefore, the reliability of CTV messages is significantly to the decoding process could be managed to some extent
improved, especially for the extension checks. It can also by adjusting parameter D and the decoding performance
be seen that the OMS decoder shows a higher mismatch could get a better balance in the waterfall and error-floor
probability for the extension checks while a lower probability regions. Moreover, one can select a proper D to get the best
for the core checks compared with the MS decoder. The AMS performance in the required SNR region.
decoder [10] combines the advantages of the MS and OMS The effectiveness of the column degree adaptation is illus-
decoders, which explains its performance improvement. trated in Fig. 3. In this work, we divide the codeword into Nb
groups and each group corresponds to a column in base matrix
B. Column Degree Adaptation H B . Consequently, a group consists of Z bits and the bits
As stated before, 5G LDPC codes are extremely irregular in each group have the same column degree. In simulations,
and there exists a wide variation in column degrees. In base a group is considered as erroneous if there exists an error bit
matrix BG2, the column degree varies from 1 to 23 and from in the group. Since 5G LDPC codes are extremely irregular,
1 to 30 in BG1. With more neighbor CNs, the high degree VNs the degrees of bits are very different, so the bits with different
usually have larger APP magnitudes, which are called strong column degrees may perform differently under the same
messages. These strong messages can be helpful or harmful decoding algorithm. Considering the bits in different groups
to the decoding process, depending on whether they are cor- may have different column degrees, Fig. 3 shows the error-rate
rect or not. In the waterfall region where many bits are received of each group when E b /N0 = 2.0dB. The R = 1/5, Z = 52,
incorrectly, the incorrect strong messages tend to negatively N = 2600 5G LDPC code defined by BG2 is applied and
influence the correction of the received bits. In the error-floor all decodings are quantized with parameters (q, q̃) = (4, 6).
region where the channel conditions are good and trapping-sets For each decoding, at least 1000 error frames are collected.
dominate the decoding performance [25], the correct strong We denote the decoding where only the proposed CN-update
messages can overcome the incorrect messages in trapping- function is applied as M1 and the decoding where both the
sets and thus contribute to improving the decoding perfor- proposed CN-update function and column degree adaptation
mance [26]. Therefore, the requirement of strong messages is are applied as M2, namely the IAMS algorithm. The parameter
different in different SNR regions. D is selected by traversing all row degrees of the code to
In order to manage the influence of strong messages on find the value which shows the optimal performance through
the decoding process, we propose a column degree adaptation Monte-Carlo simulations. For the selected code, D = 6.
strategy in which the CTV messages passed to different VNs Considering the degrees of bits in the first two groups are
from a CN are computed non-uniformly. Observing (5) and much larger than others, these bits have more chances to be
(10) we can conclude that the magnitudes of CTV messages corrected so the first two groups show the best performance,
computed by the OMS decoding are generally smaller than especially for the OMS and M2 decoders. Since Fig. 3 shows
those by the proposed CN-update function. To limit the the simulation results in the low SNR region where many bits
magnitudes’ growth of strong messages, the CTV messages are received incorrectly, the propagation of incorrect strong
transmitted to the VNs whose degrees are larger than thresh- messages has larger negative influence to decoding than the
old D is computed using the CN-update function of the imprecise offset factor. Therefore, the OMS decoder performs
OMS decoding rather than the proposed CN-update function. better than the MS decoder. However, they both perform worse
To avoid over-correction to strong messages, the column than the AMS decoder [10], which is the state-of-the-art one

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

for 5G LDPC codes in fixed-point. Moreover, it can be seen


that for all groups, the M1 and M2 decoders exhibit better
performance than the AMS decoder. With the help of the pro-
posed column degree adaptation, M2 significantly improves
the decoding performance of the M1 decoder, proving the
effectiveness of the proposed column degree adaptation.
The detailed decoding process of the proposed IAMS algo-
rithm is shown in Alg. 1, where the layered schedule is adopted
and each layer corresponds to one row of the base matrix.
When l < 4, the core checks are processed with the proposed
column degree adaption applied. The number of layers is
denoted by L and Ll denotes the indices set of rows in the
l-th layer.
Fig. 4. Simulation results on the R = 1/5, Z = 52, N = 2600 BG2 code
Algorithm 1: The Proposed IAMS Decoding Algorithm when I tmax = 15.
input : γ = (γ0 , γ1 , · · · , γ N−1 )
(0)
initialize: ∀m ∈ [0, M), n ∈ [0, N) : αm,n = 0,
(0)
∀n ∈ [0, N) : γ̃n = γn
1 for t = 1 to I tmax do
2 for l = 0 to L − 1 do
3 for m ∈ Ll and n ∈ Nm do
(t ) (t )
4 βn,m = [β̃n,m ] = [γ̃n(t −1) − αm,n
(t −1)
]
5 for m ∈ Ll and n ∈ Nm do
6 if l < 4 and dvn ≥ D then
(t )
7 Calculate αm,n using (5)
8 else
(t )
9 Calculate αm,n using (10)
10 for m ∈ Ll and n ∈ Nm do
(t ) (t ) (t ) Fig. 5. Simulation results on the R = 1/5, Z = 52, N = 2600 BG2 code
11 γ̃n = [αm,n + β̃n,m ]A when I tmax = 100.

12 for n = 0 to N − 1 do
(t ) (t ) moderate code rates 5G LDPC codes. Therefore, we consider
13 ĉn = H D(γ̃n ) two 5G LDPC codes with different rates and lengths: a R =
14 if ĉ(t ) · HT = 0 then 1/5, Z = 52, N = 2600 BG2 code and a R = 2/3, Z = 104,
15 break N = 3432 BG1 code. For simplicity, it is assumed that the
codeword is sent only once without using any hybrid automatic
output : ĉ(t ) repeat request (HARQ) scheme.

A. Performance Comparisons
IV. N UMERICAL R ESULTS Since the maximum number of iterations is typically less
In this section, the decoding performance of the proposed than 20 in practical implementations considering the through-
IAMS algorithm is illustrated and compared to the MS, OMS, put requirement while the decoders need about 100 iterations
and AMS decodings. All decodings take the layered schedule. to be saturated, Fig. 4 and Fig. 5 show the simulation results
In practical applications, the number of quantization bits used on the R = 1/5, Z = 52, N = 2600 BG2 code when
in LDPC decoders is usually no more than 6 in order to reduce I tmax = 15 and I tmax = 100, respectively. For a fair
the area and power consumption. Therefore, the quantization comparison, the channel gain factors for each decoding are
parameters are set to (q, q̃) = (4, 6) in this work. Moreover, fixed and optimized by simulations to find the value which
the performance of the floating-point MS and OMS algorithms performs best when F E R = 10−7 , where the test step is set to
are shown for reference, which also take the layered schedule. 0.05. The optimal values for the OMS, MS, AMS, and IAMS
The offset value for the floating-point OMS decodings is set to decoders are 1.3, 1.1, 0.85, and 0.8, respectively. Due to the
0.2. The simulation results are obtained through Monte-Carlo imprecise offset factor, the OMS decoder suffers from severe
simulations that generate at least 100 error frames for each performance degradation under (4,6) quantization, which could
plotted point. Because the fraction of degree-1 bits is very be compensated by increasing one bit of quantization length.
small in high code rate 5G LDPC codes while our approach Compared to the AMS decoding, the proposed IAMS decoding
targets for improving the performance regarding the degree- shows a much better performance. When the threshold D is
1 bits, the proposed decoder is more suitable for the low to well-selected, the performance gain could be 0.4dB in the

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 7

Fig. 6. Simulation results on the R = 2/3, Z = 104, N = 3432 BG1 code


when I tmax = 15.

Fig. 8. FER performance under different iteration limits at (a): E b /N0 =


2.0dB, (b): E b /N0 = 2.6dB.

B. Decoding Performance Analysis


In Fig. 8, we explore the effect of limiting the maximum
number of iterations on different decodings, where I tmax
increases from 10 to 10K. The R = 1/5, Z = 52, N = 2600
Fig. 7. Simulation results on the R = 2/3, Z = 104, N = 3432 BG1 code BG2 code is adopted and two E b /N0 points, 2.0dB and
when I tmax = 100.
2.6dB, are considered. Fig. 8(a) shows the performance in
the waterfall region where the random-like errors are main
waterfall region and 0.2dB in the error-floor region. However, causes of decoding failures [26]. As shown in Fig. 8(a),
a limitation of the IAMS decoding is that the error floor starts by increasing the maximum number of iterations, most of
around F E R = 10−5 . It can be explained by the fact that these errors can be corrected and the decoding performance is
due to the quantization, the dynamic range of messages is improved. Since a smaller D has better capability to limit the
limited so they are hard to escape from trapping-sets [26]. magnitudes growth of strong random-like errors, when I tmax
By increasing the number of quantization bits, the error-floor is not sufficient, the IAMS decoding paired with D = 6 has
phenomenon can be overcome to some extent. It should be a better performance than with D = 10. Also, because the
noted that though the IAMS decoding suffers from the error- overestimation of CTV messages encourages the magnitude
floor phenomenon, its performance is still better than those of growth of errors, the MS decoding shows poor performance
other fixed-point decodings in high-SNR region. and converges slowly at this point. When E b /N0 is sufficiently
To further verify the comparison results, Fig. 6 and Fig. 7 large, the decoding performance is dominated by trapping-sets,
show the simulation results on the R = 2/3, Z = 104, N = which are main reasons for the error-floor phenomenon [27].
3432 BG1 code, where the threshold D is set to 5. The optimal Fig. 8(b) shows the performance under different iteration
values of the channel gain factor for the OMS, MS, AMS, and limits in the error-floor region. As can be seen, except the
IAMS decoders are 3.1, 2.6, 2.8, and 2.55, respectively. As can MS decoding, almost no performance gain can be further
be seen, the IAMS algorithm shows the best error-correction obtained by increasing the maximum number of iterations
performance among all fixed-point decodings. Compared to when I tmax > 30. As for the MS decoding, the saturation
the AMS decoding, the IAMS decoding could offer 0.4dB starts from I tmax = 500. Moreover, the IAMS decoding paired
to 0.6dB performance gain. Therefore, we can conclude that with D = 10 can surpass that with D = 6 in performance
for 5G LDPC codes with low to moderate rates, there are within a smaller number of iterations.
many extension bits that could benefit from the proposed Since the degree-1 VNs in 5G LDPC codes are prone to
CN-update function so the proposed IAMS decoding could be erroneous, these VNs are easily to form a trapping-set.
offer a much-improved error-correction performance compared To analyze the performance behavior of IAMS decoding in the
to the existing fixed-point decodings. error-floor region, we collect a typical set of trapping-sets for

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

TABLE I
T HE VALUES OF R ECEIVED CTV M ESSAGES FOR A VN
B ELONGING TO T RAPPING -S ET.

Fig. 9. Subgraph induced by the collected (8,2) trapping-set.

Fig. 11. Top-level architecture of the proposed 5G LDPC decoder.

the core checks (c1 and c2 ) can be increased to some extent.


In that case, the probability that v 1 could be corrected is
increased so the decoder could have a larger probability to
escape from trapping-sets.
To illustrate the above discussion intuitively, Table I shows
the values of received CTV messages of a VN that belongs to
trapping-set in the 6th iteration. The channel message for this
VN equals −1. Compared to the case when D = 6, the second
connected CN c2 sends a slightly larger correct CTV message
Fig. 10. Soft-decision values evolution along with iterations. to this VN when D = 10 and thus, the corresponding bit
could be correctly recovered. Consequently, this codeword can
the selected code, as shown in Fig. 9. In order to facilitate the be successfully decoded when D = 10. This explains why
analysis, we assume only eight received bits are erroneous and the IAMS decoding could perform better when paired with
all fall into this trapping-set. In this case, v 2 to v 8 are falsely D = 10 in the error-floor region. However, in order to balance
estimated so v 1 receives seven wrong messages from c3 to the decoding performance in the waterfall region, D is not the
c9 , which are all extension checks. Moreover, v 1 receives two larger the better. Also, the correction of random-like errors in
correct messages from c1 and c2 , which belong to core checks. early iterations will be damaged by an excessive D.
When the summation of the two correct messages is smaller
V. H ARDWARE A RCHITECTURE FOR 5G LDPC D ECODERS
than that of seven wrong messages, v 1 will be erroneous and
the remaining seven bits cannot be corrected. Consequently, In this section, we propose an efficient architecture to
the decoder will be trapped in the trapping-set. implement 5G LDPC decoders. In order to design a high-
The trapping process can be shown with a practical example. throughput and area-efficient decoder, several optimization
Fig. 10 shows the APP value evolution when the IAMS methods are developed, as shown in the following subsections.
algorithm with D = 6 is applied. The bits belonging to
the collected trapping-set are marked with red squares and A. Top-Level Architecture
others with black circles. Assume the all-zero codeword is The overall architecture of the proposed 5G LDPC decoder
transmitted using the BPSK modulation. Accordingly, non- is shown in Fig. 11, which is implemented using the layered
negative APP values are interpreted as correct and negative schedule. For convenience, we assume that the quantization
values denote faults. As can be seen, the decoder cannot escape version of channel LLR messages is available in the input port
from the trapping-set once it is captured. By increasing the of the decoder. However, it should be noted that the method
value of D, more core checks could be processed by the to quantize channel LLR messages should be compatible with
proposed CN-update function rather than the OMS decoding. the quantization method used in the decoder. The proposed
Therefore, the magnitudes of CTV messages generated from architecture is not limited to a specific quantization method

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 9

decoding layer takes one clock cycle and the total number
of clock cycles is L × I tmax . The throughput θ is computed
as
f ×N
θ= , (11)
Fig. 12. Compressed format of CTV messages. L × I tmax
where f denotes the frequency of the decoder.
so one can easily modify the proposed decoder architecture
to support different quantization schemes when the number of
B. Memory and Clock Cycles Reduction
quantization bits remains unchanged.
For a QC-LDPC code defined by an Mb × Nb base matrix By observing the structure of BG1 and BG2 matrices, it can
H B , the number of decoding layers L usually equals Mb . be found that part of them has the orthogonality property,
Therefore, the parallelism degree of decoder equals the expan- meaning no VN is connected to two consecutive layers.
sion factor Z . As shown in Fig. 11, the input and exchanged For instance, the 21st to 46th rows of the BG1 matrix are
messages are quantized to q bits and the APP values are orthogonal. Similarly, the 21st to 42nd rows of the BG2 matrix
quantized to q̃ bits. All control signals are generated by the are orthogonal. In two orthogonal layers, the APP messages
Controller. Two memory blocks, namely APP memory and updated in the previous layer will not be used in the next
CTV memory, are used to store the APP and CTV messages, layer. Therefore, the decoding processes in such two layers
respectively. The CTV memory is implemented with the dual- are independent. Based on this feature of 5G LDPC codes,
port random access memory (DP-RAM) to support simulta- we propose a layer merging method to reduce the number
neous read and write operations. In order to allow massively of clock cycles. A similar idea was also applied in [29] to
parallel read, write, and initiate operations, the APP memory optimize a pipelined decoder for IEEE 802.11ad standard.
is implemented with registers. In the proposed architecture, However, the configurations in these two architectures are
the APP memory is divided into three parts and the CTV different.
memory consists of two parts. The reason for this configuration In the proposed architecture, two consecutive layers in
will be presented in the following subsections. the orthogonal part are processed simultaneously. Therefore,
In each decoding layer, APP messages are read from the the number of decoding layers in the orthogonal part is reduced
APP memory first and then passed to the Read Network, by half, which leads to fewer clock cycles. For BG1 and
which rearranges and selects these messages according to the BG2 codes, the number of clock cycles could be reduced
current processing layer to ensure they will be processed by by 28.3% and 26.2%, respectively. Because row degrees in
the proper left barrel shifters (LBSs) and VN unit (VNUs). the orthogonal part are all less than dcmax /2, no additional
Similarly, the Write Network is used to rearrange the updated LBS or VNU is needed and the APP memory remains
APP messages to ensure they can be stored in the correct unchanged. However, since two layers are processed in one
addresses of APP memory. Let dcmax denote the maximum row clock cycle, the CNU and CTV memory should be modified
degree of the code. In the proposed architecture, dcmax pairs to make generating and storing two sets of CTV messages
of LBSs and VNUs are applied. Messages output from the at the same time feasible. Fig. 13 shows the architecture
Read Network should be left rotated first by LBSs according of CNU, which is divided into two subunits. When two
to the corresponding shift factors and then passed to VNUs orthogonal layers are processed simultaneously, two sets of
to calculate the VTC messages. By adopting the method to VTC messages are input to CNU1st and CNU2nd , respectively.
generate the shift factor presented in [20], the data write-back In this case, the Compare & Select unit is disabled so two sets
barrel shifters can be eliminated. of CTV messages are output from the CNU. Let dco denote the
After being saturated to q bits, the VTC messages are maximum row degree in the orthogonal part. In order to store
sent to the CN unit (CNU) which generates CTV messages. two sets of CTV messages in the same address, the width of
The CNU is implemented using the area-efficient architecture CTV memory is set to
proposed in [28]. In the IAMS decoder, i d x 2 should also be W = max{z × (dcmax + 2 · (q − 1 + log2 dcmax )),
calculated and stored, which is the main difference with other
2 × z × (dco + 2 · (q − 1 + log2 dco ))}. (12)
decoders. As shown in Fig. 12, CTV messages are stored in
a compressed format to reduce the memory cost. Therefore, Table II shows the size of CTV memory when q = 4. As can
the width of the CTV memory is z × (dcmax + 2 · (q − 1 + be seen, though the width of CTV memory is slightly increased
log2 dcmax )). Since the CTV messages in all layers need to after applying the layer merging, the depth is reduced due to
be stored, the depth of the CTV memory is L. In order to less number of layers. Therefore, besides reducing the number
convert the CTV messages from the compressed format to the of clock cycles, the proposed layer merging method could
uncompressed format, two De-compressors are inserted into reduce the size of CTV memory by 26.2% and 13.9% for
the decoder which generate the final CTV messages for the BG1 and BG2 codes, respectively.
following calculations. Then, the APP values can be updated. Considering the 5G LDPC codes are extremely irregular
After writing them back to the APP memory, one decoding and the degrees of some layers are relatively small, setting
layer is finished. the width of CTV memory according to (12) will lead to
To minimize the number of clock cycles, no pipeline a great waste of memory resource. To further reduce the
is inserted into the proposed architecture. Therefore, one memory cost, we present a split storage method. As mentioned

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 15. The structure of the APP memory for extension bits.

Nb sets and replaces them by the updated APP messages.


Fig. 13. The architecture of CNU.
Thus, two complex interconnection networks are required
which induce a long critical path and large area consumption.
TABLE II
To minimize the hardware overhead of the interconnection
T HE S IZE OF CTV M EMORY.
block, two optimization methods are introduced.
First, the selective-shift structure is applied in the APP
memory to minimize Nio , which represents the number of
inputs to the Read Network and also the number of outputs
from the Write Network. Considering the diagonal property
of identity matrix I in the base matrix, only one set of APP
messages corresponding to the extension bits is needed in
each decoding layer. Therefore, rather than sending all APP
messages corresponding to the extension bits to the Read
Network, in this work, these APP messages are fed into
the Read Network one set by one set in sequence order.
Consequently, Nio and hence the area and critical path of
the interconnection block can be significantly reduced. For
this reason, we store the APP meesages corresponding to the
extension bits in an individual memory which is implemented
by applying the selective-shift structure, as shown in Fig. 15.
The control signal sel decides whether the data should be
Fig. 14. The detailed structure of CTV memory.
rotated or not, which is only enabled when the extension
part is processed. By cyclically shifting the APP messages,
in Section V-A, the CTV memory is divided into two parts. the address of the required APP messages is fixed during
Fig. 14 shows the detailed structure of these two sub- decoding so the required APP messages could be easily
memories, where dcn denotes the maximum row degree of obtained.
layers except the core and orthogonal parts and W1 = z×(dcn + Since the layer merging method is applied, two sets of APP
2 · (q − 1 + log2 dcn )). Since the width of the CTV messages messages corresponding to the orthogonal part are needed at
generated in the core and orthogonal parts are larger than W1 , the same clock cycle. In order to obtain them simultaneously,
only the first W1 bits of messages are stored in CTV memory two memories are applied to store the APP values correspond-
1 while the remaining bits are stored in CTV memory 2. For ing to extension bits. Therefore, all APP messages of the core
other layers, the CTV messages are totally stored in CTV bits and two sets of APP messages of the extension bits are
memory 1. Because CTV memory 2 is specifically used for input to the Read Network in each clock cycle. Hence, Nio
the layers in the core and orthogonal parts, its depth L 0 is less could be reduced from 68 to 28 for BG1 codes and from
than L. Thanks to the spilt storage method, the size of CTV 52 to 16 for BG2 codes. The APP memory for extension bits
memory can be further reduced by 16.6% for BG1 codes and can also be implemented with the DP-RAM which consumes
18.4% for BG2 codes. Combining the layer merging method, less area and power. However, to support massively parallel
39.6% of CTV memory can be saved for BG1 codes and initialization, the shift register set is adopted in the proposed
29.8% can be saved for BG2 codes in total. Since the CTV design.
memory occupies a large proportion of the decoder in area To further reduce the hardware overhead of interconnection
consumption, these modifications greatly benefit the total area networks, the APP messages processed in the Read Network
reduction. are reordered. For simplicity, we take the R = 1/5, Z = 52,
N = 2600 BG2 code as an example. As stated before,
C. Interconnection Network Optimizations 16 sets of messages are fed into the Read Network in each
Besides the memory block, the interconnection block is iteration, which are denoted as s1 , s2 , …, and s16 , respectively.
another important part that dictates the overall hardware Therefore, each output is selected from 16 inputs, leading to
overhead. For a QC-LDPC code generated from an Mb × Nb large hardware overhead. In the proposed design, the APP
base matrix, Nb sets of APP messages are fed into the Read messages are reordered. Fig. 16 shows the mapping between
Network which outputs dcmax sets of messages. Similarity, the input and output messages in the Read Network, in which
the Write Network selects dcmax sets of APP messages from the last row indicates the number of inputs to generate the

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 11

TABLE III
ASIC S YNTHESIS R ESULTS ON 90-nm CMOS T ECHNOLOGY.

TABLE IV
T HE A REA OF E ACH B LOCK .

Fig. 16. Mapping relationship between the input and output messages in the
Read Network.

corresponding output. It can be seen that by applying the


message reordering, all outputs can be generated by no more
than four inputs. Therefore, the critical path of the Read
Network is only two multiplexers and the number of required
multiplexers is significantly reduced. Since the Write Network
has similar structure as the Write Network, the mapping for decoding performance. Since there are no published archi-
the Write Network can be easily deduced from Fig. 16. tectures or synthesis results of 5G LDPC decoders, in order
to evaluate the effectiveness of our optimization methods,
VI. I MPLEMENTATION R ESULTS Table III lists the synthesis results of decoders with and
The implementation results, as well as the corresponding without applying the optimizations proposed in Section V-B
comparisons, are reported in this section. The decoder archi- and Section V-C. It can be seen that after being modified,
tecture is implemented in RTL and synthesized under the the area of the IAMS decoder is reduced by 32.3% and the
TSMC 90-nm CMOS technology using the Synopsys Design frequency is improved by 38.9%. Considering the decoding
Compiler. The Synopsys Prime Time PX tool is used for power cycles is decreased by 25%, the throughput could be improved
estimation. We generate the VCD file to read the switching by up to 84.1%, reaching 914Mbps, which could meet the 5G
activity first and then estimate the power consumption of requirements in terms of throughput on rate-1/5 codes [9].
the decoder using time-based power analysis. Consider the In order to make easier comparisons with other works that
R = 1/5, Z = 52, N = 2600 BG2 code to implement use different technologies, the area in gate equivalents of
the proposed architecture. By applying the proposed methods each decoder is also reported, which is computed by dividing
presented in Section V, we can conclude that for the selected the total area by that of an XOR gate. Moreover, the power
code, L could be reduced from 40 to 30 so the number of consumption results are reported. As can be seen, though the
clock cycles could be reduced by 25%. Moreover, 29.2% of the modified decoders are synthesized at a higher frequency, they
CTV memory can be saved and the hardware overhead of the nearly consume the same power as the original ones due
interconnection networks could be significantly reduced. to less resource usage. To keep the throughput comparison
Since the AMS and OMS decoders have similar hard- on an equal basis, we further define the TAR metric and
ware complexity, we compare the implementation results of normalized TAR (NTAR) metric. TAR = throughput/area and
the proposed IAMS decoder to that of the OMS decoder. The NTAR = TAR × Iterations. Table III shows that the TAR
IAMS and OMS decoders are implemented according to the of the IAMS decoder is increased from 247.2Mbps/mm2 to
structure presented in Section V-A. Table III shows the ASIC 675.5Mbps/mm2, increasing by 173.3%. The similar conclu-
synthesis results on 90-nm CMOS technology of the OMS sion can be drawn for the OMS decoder, for which the TAR
and IAMS decoders which are quantized with parameters is improved by 168.7%. Therefore, the effectiveness of the
(q, q̃) = (4, 6). Due to the additional storage and calculations proposed optimizations can be proved.
for applying i d x 2 , the IAMS decoder has a slightly larger Table IV lists the area of each block in the whole decoder.
area and lower throughput than the OMS decoder. However, For the IAMS decoder, the area of the interconnection blocks
this overhead is negligible considering its much-improved (Read Network, Write Network, and LBSs) is decreased from

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

0.848mm2 to 0.343mm2 after applying the proposed modifi- [13] K. Sun and M. Jiang, “A hybrid decoding algorithm for low-rate LDPC
cations, decreased by up to 59.6%. Moreover, the area of CTV codes in 5G,” in Proc. 10th Int. Conf. Wireless Commun. Signal Process.
(WCSP), Hangzhou, China, Oct. 2018, pp. 1–5.
memory is reduced by 25.1%, which is less than the theoretical [14] X. Wu, M. Jiang, and C. Zhao, “Decoding optimization for 5G LDPC
analysis (29.2%). This result mainly comes from the reason codes by machine learning,” IEEE Access, vol. 6, pp. 50179–50186,
that the area of DP-RAM is not fully decided by data size, 2018.
[15] C. Jones, E. VaIles, M. Smith, and J. Villasenor, “Approximate-
so the reduction of the total area is not strictly equal to that of MIN∗ constraint node updating for LDPC code decoding,” in Proc.
the data. We also notice that the area of the APP memory is IEEE Mil. Commun. Conf. (MILCOM), Boston, MA, USA, Oct. 2003,
slightly increased, which comes from applying the selective- pp. 157–162.
[16] K. Zhang, X. Huang, and Z. Wang, “A high-throughput LDPC decoder
shift structure. However, considering it greatly benefits the architecture with rate compatibility,” IEEE Trans. Circuits Syst. I, Reg.
interconnection blocks, this overhead is acceptable. Papers, vol. 58, no. 4, pp. 839–847, Apr. 2011.
[17] C.-C. Cheng, J.-D. Yang, H.-C. Lee, C.-H. Yang, and Y.-L. Ueng,
“A fully parallel LDPC decoder architecture using probabilistic
VII. C ONCLUSION min-sum algorithm for high-throughput applications,” IEEE Trans.
In this article, we propose a high-performance decoding Circuits Syst. I, Reg. Papers, vol. 61, no. 9, pp. 2738–2746,
Sep. 2014.
algorithm, named the improved adapted min-sum algorithm, [18] H.-C. Lee, M.-R. Li, J.-K. Hu, P.-C. Chou, and Y.-L. Ueng, “Opti-
for fixed-point decoding of 5G LDPC codes. To reduce the mization techniques for the efficient implementation of high-rate layered
error-probability of degree-1 VNs, a new CN-update function QC-LDPC decoders,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64,
no. 2, pp. 457–470, Feb. 2017.
is designed, and the column degree adaptation is proposed [19] I. Tsatsaragkos and V. Paliouras, “A reconfigurable LDPC
to alleviate the excessive growth of posterior probability in decoder optimized for 802.11n/AC applications,” IEEE Trans.
high-degree VNs. As a result, the proposed decoder could Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 1, pp. 182–195,
Jan. 2018.
outperform the state-of-the-art AMS decoder by 0.4dB in [20] T. T. Nguyen-Ly, V. Savin, K. Le, D. Declercq, F. Ghaffari, and
FER performance. We also present an efficient architecture O. Boncalo, “Analysis and design of cost-effective, high-throughput
for 5G LDPC decoders. First, the layer merging technique is LDPC decoders,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 26, no. 3, pp. 508–521, Mar. 2018.
applied based on the orthogonality property of the base matrix. [21] C.-Y. Liang, M.-R. Li, H.-C. Lee, H.-Y. Lee, and Y.-L. Ueng, “Hardware-
Then, the split storage method is adopted to further reduce friendly LDPC decoding scheduling for 5G HARQ applications,”
CTV memory cost. Finally, the interconnection blocks are in Proc. ICASSP-IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), Brighton, U.K., May 2019, pp. 1418–1422.
optimized by using the selective-shift structure and message [22] R. Tanner, “A recursive approach to low complexity codes,” IEEE Trans.
reordering method. Implementation results demonstrate that Inf. Theory, vol. IT-27, no. 5, pp. 533–547, Sep. 1981.
the proposed architecture can improve the throughput-to-area [23] Z. Mheich, T.-T. Nguyen-Ly, V. Savin, and D. Declercq, “Code-aware
quantizer design for finite-precision min-sum decoders,” in Proc. IEEE
ratio by 173.3%. Int. Black Sea Conf. Commun. Netw. (BlackSeaCom), Varna, Bulgaria,
Jun. 2016, pp. 1–5.
R EFERENCES [24] W. E. Ryan, “An introduction to LDPC codes,” in CRC Handbook
for Coding and Signal Processing for Magnetic Recording Systems,
[1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inf. B. Vasic, Ed. Boca Raton, FL, USA: CRC Press, 2004, ch. 36.
Theory, vol. 8, no. 1, pp. 21–28, Jan. 1962. [25] T. Richardson, “Error floors of LDPC codes,” in Proc. 41st
[2] IEEE 802.11n Wireless LAN Medium Access Control MAC and Physical Annu. Allerton Conf. Commun., Control, Comput., Oct. 2003,
Layer PHY Specifications, Standard IEEE 802.11n-D2.0, 2007. pp. 1426–1435.
[3] Second Generation Framing Structure, Channel Coding and Modula- [26] X. Zhang and P. H. Siegel, “Quantized iterative message passing
tion Systems for Broadcasting, Interactive Services, News Gathering decoders with low error floor for LDPC codes,” IEEE Trans. Commun.,
and Other Broadband Satellite Applications (DVB-S2), ETSI, Sophia vol. 62, no. 1, pp. 1–14, Jan. 2014.
Antipolis, France, 2009. [27] F. Angarita, J. Valls, V. Almenar, and V. Torres, “Reduced-complexity
[4] Standard: Synchronization Standard for Distributed Transmission, min-sum algorithm for decoding LDPC codes with low error-floor,”
ATSC, Boston, MA, USA, 2007. IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 7, pp. 2150–2158,
[5] Multiplexing and Channel Coding, document TS 38.212 V15.0.0, 3GPP, Jul. 2014.
Dec. 2017. [28] C. Zhang, S. Weng, X. You, and Z. Wang, “Area-efficient check node
[6] T. J. Richardson and R. L. Urbanke, “The capacity of low-density parity- unit architecture for single block-row quasi-cyclic LDPC codes,” in
check codes under message-passing decoding,” IEEE Trans. Inf. Theory, Proc. IEEE Asia Pacific Conf. Circuits Syst. (APCCAS), Ishigaki, Japan,
vol. 47, no. 2, pp. 599–618, Feb. 2001. Nov. 2014, pp. 17–20.
[7] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity [29] M. Weiner, B. Nikolic, and Z. Zhang, “LDPC decoder architecture for
iterative decoding of low-density parity check codes based on belief high-data rate personal-area networks,” in Proc. IEEE Int. Symp. Circuits
propagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, Syst. (ISCAS), Janeiro, Brazil, May 2011, pp. 1784–1787.
May 1999.
[8] J. Chen, A. Dholakia, E. Eleftheriou, M. P. C. Fossorier, and X.-Y. Hu,
“Reduced-complexity decoding of LDPC codes,” IEEE Trans. Commun.,
vol. 53, no. 8, pp. 1288–1299, Aug. 2005.
[9] T. Richardson and S. Kudekar, “Design of low-density parity check
codes for 5G new radio,” IEEE Commun. Mag., vol. 56, no. 3, pp. 28–34,
Mar. 2018. Hangxuan Cui received the B.S. degree in under-
[10] K. Le Trung, F. Ghaffari, and D. Declercq, “An adaptation of min- water acoustic engineering from Northwestern Poly-
sum decoder for 5G low-density parity-check codes,” in Proc. IEEE Int. technical University, Xi’an, China, in 2017. He is
Symp. Circuits Syst. (ISCAS), Sapporo, Japan, May 2019, pp. 1–5. currently pursuing the Ph.D. degree with Nanjing
[11] LDPC Decoding With Adjusted Min-Sum, document R1-1610140, TSG University.
RAN WG1 #86bis, 3GPP, Qualcomm Incorporated, Lisbon, Portugal, His research interests include channel coding algo-
Oct. 2016. rithms and low-power and high-throughput VLSI
[12] W. Zhou and M. Lentmaier, “Generalized two-magnitude check node systems for digital signal processing.
updating with self correction for 5G LDPC codes decoding,” in
Proc. 12th Int. ITG Conf. Syst., Commun. Coding, Rostock, Germany,
Mar. 2019, pp. 1–6.

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: DESIGN OF HIGH-PERFORMANCE AND AREA-EFFICIENT DECODER FOR 5G LDPC CODES 13

Fakhreddine Ghaffari (Member, IEEE) received Jun Lin (Senior Member, IEEE) received the B.S.
the degree in electrical engineering and master’s degree in physics and the M.S. degree in micro-
degree from the National School of Electrical Engi- electronics from Nanjing University, Nanjing, China,
neering (ENIS), Tunisia, in 2001 and 2002, respec- in 2007 and 2010, respectively, and the Ph.D. degree
tively, and the Ph.D. degree in electronics and in electrical engineering from Lehigh University,
electrical engineering from the University of Sophia Bethlehem, in 2015. From 2010 to 2011, he was
Antipolis, France, in 2006. an ASIC Design Engineer with AMD. In summer
He is currently an Associate Professor with the 2013, he was an Intern with Qualcomm Research,
Université de Cergy-Pontoise, France. His research Bridgewater, NJ, USA. In June 2015, he joined
interests include VLSI design and implementation the School of Electronic Science and Engineering,
of reliable digital architectures for wireless commu- Nanjing University, where he is currently an Asso-
nication applications in ASIC/FPGA platform and the study of mitigating ciate Professor. He was a member of the Design and Implementation of
transient faults from algorithmic and implementation perspectives for high- Signal Processing Systems (DISPS) Technical Committee of the IEEE Signal
throughput applications. Processing Society. His current research interests include low-power high-
speed VLSI design for digital signal processing and deep learning, hardware
acceleration for big data processing, and emerging computer architectures.
He was a co-recipient of the Merit Student Paper Award at the IEEE Asia
Pacific Conference on Circuits and Systems in 2008, the Best Paper Award at
Khoa Le (Member, IEEE) received the bachelor’s the IEEE Computer Society Annual Symposium on VLSI (ISVLSI) in 2019,
and M.Sc. degrees in electronics engineering from and the Best Paper Award (The First Place) at the IEEE International Signal
the Ho Chi Minh City University of Technology Processing Systems (SiPS) in 2019. He was a recipient of the 2014 IEEE
(HCMUT), Vietnam, in 2010 and 2012, respec- Circuits & Systems Society (CAS) Student Travel Award.
tively, and the Ph.D. degree from the Université
de Cergy-Pontoise, France, in 2017. He is currently
a Post-Doctoral Researcher with the ETIS Labora- Zhongfeng Wang (Fellow, IEEE) received the
tory, ENSEA, France. His research interest includes B.E. and M.S. degrees from the Department of
error correcting code algorithms, analysis, and their Automation, Tsinghua University, Beijing, China,
implementations in FPGA/ASIC. in 1988 and 1990, respectively, and the Ph.D. degree
from the University of Minnesota, Minneapolis,
in 2000. He was with Oregon State University and
National Semiconductor Corporation. From 2007 to
2016, he was a Leading VLSI Architect with Broad-
com Corporation, CA, USA. Since 2016, he has been
David Declercq (Senior Member, IEEE) was born a Distinguished Professor with Nanjing University,
in June 1971. He received the Ph.D. degree in China.
statistical signal processing from the Université de He is a world-recognized expert on Low-Power High-Speed VLSI Design
Cergy-Pontoise, France, in 1998. From 2009 to for Signal Processing Systems. He has published more than 200 technical
2014, he held the junior position with the Institut articles with multiple best paper awards received from the IEEE technical
Universitaire de France. He is currently a Full Pro- societies, among which is the VLSI Transactions Best Paper Award of 2007.
fessor with ENSEA, Cergy. He is also the General He has edited one book VLSI and held more than 20 U.S. and China
Secretary of the National GRETSI Association. He patents. In the current record, he has had many articles ranking among top
worked several years on the particular family of 25 most (annually) downloaded manuscripts in the IEEE T RANSACTIONS
LDPC codes, both from the code and decoder design ON V ERY L ARGE S CALE I NTEGRATION (VLSI) P ER S TYLE S YSTEMS .
aspects. Since 2003, he has been developing a strong His current research interests include optimized VLSI design for digital
expertise on non-binary LDPC codes and decoders in high order Galois fields communications and deep learning. He has also served as a TPC member
GF(q). A large part of his research projects are related to non-binary LDPC and various chairs for tens of international conferences. Moreover, he has
codes. He mainly investigated two aspects the design of GF(q) LDPC codes for contributed significantly to the industrial standards. So far, his technical
short and moderate lengths and the simplification of the iterative decoders for proposals have been adopted by more than 15 international networking
GF(q) LDPC codes with complexity/performance tradeoff constraints. He pub- standards. In 2015, he was elevated to the Fellow of IEEE for contributions
lished more than 40 articles in major journals [the IEEE T RANSACTIONS ON to VLSI design and implementation of FEC coding. In the past, he has served
C OMMUNICATIONS, the IEEE T RANSACTIONS ON I NFORMATION T HEORY, as an Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND
the IEEE C OMMUNICATONS L ETTERS , and EURASIP Journal on Wireless S YSTEMS I: R EGULAR PAPERS , the IEEE T RANSACTIONS ON C IRCUITS
Communications and Networking (JWCN)], and more than 120 articles in AND S YSTEMS II: R EGULAR PAPERS , and the IEEE T RANSACTIONS ON
major conferences in information theory and signal processing. His research V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS P ER S TYLE for many
interests include digital communications and error-correction coding theory. terms.

Authorized licensed use limited to: UNIVERSITE DE CERGY PONTOISE. Downloaded on December 03,2020 at 17:12:01 UTC from IEEE Xplore. Restrictions apply.

You might also like