Toward High-Performance Implementation of Algorithms
Toward High-Performance Implementation of Algorithms
2019.
Digital Object Identifier 10.1109/ACCESS.2019.2891597
ABSTRACT The recent evolution of mobile communication systems toward a 5G network is associated
with the search for new types of non-orthogonal modulations such as sparse code multiple access (SCMA).
Such modulations are proposed in response to demands for increasing the number of connected users.
SCMA is a non-orthogonal multiple access technique that offers improved bit error rate performance and
higher spectral efficiency than other comparable techniques, but these improvements come at the cost of
complex decoders. There are many challenges in designing near-optimum high throughput SCMA decoders.
This paper explores means to enhance the performance of SCMA decoders. To achieve this goal, various
improvements to the MPA algorithms are proposed. They notably aim at adapting SCMA decoding to
the single instruction multiple data paradigm. Approximate modeling of noise is performed to reduce the
complexity of floating-point calculations. The effects of forwarding error corrections such as polar, turbo,
and LDPC codes, as well as different ways of accessing memory and improving power efficiency of modified
MPAs are investigated. The results show that the throughput of an SCMA decoder can be increased by
3.1 to 21 times when compared to the original MPA on different computing platforms using the suggested
improvements.
INDEX TERMS 5G, BER, exponential estimations, intel advanced vector extensions (AVX), iterative
multi-user detection, knights corner instruction (KNCI), log-MPA, maximum likelihood (ML), message
passing algorithm (MPA), power efficiency, SCMA, single instruction multiple data (SIMD), streaming
SIMD extension (SSE).
probability density function (PDF). A classical improvement vectorized instructions. Therefore, complicated and control
to this bottleneck is the computation of extrinsic information heavy algorithms such as MPA have to be adapted for
in the logarithm domain, which led to develop the log-MPA efficient execution on heterogeneous architectures and their
decoder. In [7], fixed point and floating-point implemen- exploitable parallelism must be expressed at every level of
tations of the MPA and log-MPA on FPGA are studied. the code, whether in arithmetic or memory access instruc-
The bit error rate performance and complexity of the MPA tions. Particularly, various Single Instruction Multiple Data
and log-MPA are compared and it is concluded that using (SIMD) extensions and thread-level parallelism are used
log-MPA with 4 message passing iterations achieves a good to increase the throughput of MPA decoding on various
tradeoff between performance and complexity. In [8], several platforms.
complexity reduction techniques are proposed to increase the This paper reports on two contributions that can be useful
system throughput. These techniques are 1) SCMA code- for any variation of the aforementioned MPA. First, MPA
book design with minimum number of projections, 2) clus- and log-MPA have been adapted to use SIMD extensions
tered MPA (CMPA) which defines sub-graphs in MPA and leveraging the available data-level parallelism. The algo-
runs MPA on them, and 3) selected exponential computa- rithms are revised to have aligned and contiguous access to
tions. In [9] an adaptive Gaussian approximation is used memory, which is crucial to achieve high memory through-
to unselect the edges of the graph with smaller modulus. put. Various SIMD instruction set architectures (ISAs) such
In addition, mean and variance feedbacks are employed to as Advanced Vector Extensions (AVX), Streaming SIMD
compensate information loss caused by unselected edges. Extension (SSE), Knights Corner Instruction (KNCI) and
User’s codebooks play an important role for fast conver- ARM NEON are used to enhance the performance of var-
gence of the MPA or log-MPA. As investigated in [10]–[12], ious parts of the algorithm. Multi-threaded programming
revisiting codebook design can help to reduce the number technique and power efficiency are also studied in this
of iterations needed for MPA decoding of SCMA. In [13], paper.
an improved MPA is proposed which eliminates determined Second, efforts have been made to reduce the high dynamic
user codewords after certain number of iterations and con- ranges and high storage burden that are induced by the numer-
tinue the iterations for undetermined user’s codewords. Sim- ous calculations of the exponential function embedded in
ilarly, in [14], a belief threshold is set to choose the most MPA, which is one of its main points of congestion. To elim-
reliable edge probabilities and continue the iterations for inate calculations of the exponentials in the MPA, this paper
the others. A Shuffled MPA (S-MPA) is introduced in [15]. uses approximate modeling of noise. Indeed, a Gaussian
S-MPA is based on shuffling the messages between func- Probability Density Function (PDF) is estimated with sub-
tion nodes and variable nodes. As a result, the convergence optimal, bell shaped, polynomial PDFs. Using polynomial
rate is accelerated. A Monte Carlo Markov Chain Method PDFs enables a significant throughput improvement with a
is proposed in [16] to decode SCMA signals and sphere very small degradation on the bit error rate performance. In
decoding is also explored in [17] and [18] for SCMA receiver addition, this technique enables to use vectorized instructions
design. for the calculation of the probabilities, as opposed to log-
The main difference between this work and previously MPA. Details will be explained in the sequel. The impacts
cited works is that the present paper combines an analytic of turbo codes [19], polar codes [20] and LDPC codes [21]
view of MPA complexity with hardware and resource aware are investigated.
programming, exploiting hardware features available in gen- In this paper, symbols B, N, Z, R and C denote binary,
eral purpose processors (GPPs). The SCMA decoding algo- natural, integer, real and complex numbers. Scalar, vector and
rithms are revised in light of the needs of Cloud Radio matrix are presented as x, x, X respectively. The n’th element
Access Networks (C-RANs) and to take full advantage of of a vector denoted by xn and X n,m is the element of n’th row
key hardware features available in GPPs such as their SIMD and m’th column of matrix X. Notation diag(x) shows a diag-
engine. In the early 2000s, the performance of many pro- onal matrix where its n’th diagonal element is xn . In addition,
cessors improved significantly due to clock rate increases. the transpose of a matrix is expressed as X T . The paper is
This increase of performance needed very minimal if any organized as follows, Section II introduces the SCMA algo-
programming effort, however the drawbacks of high clock rithm. Maximum Likelihood, MPA and log-MPA decoding
rate was more power and energy consumption, overheating methods are explained in this section as a background to this
of processors, leakage currents and signal integrity problems. research. Section III elaborates on proposed improvements
These disadvantages led designers to follow new paradigms such as vectorizing the algorithm, exponential estimations,
such as thread-level and data-level parallelisms that pro- contiguous access to memory and other hardware oriented
vide good performance at lower clock speeds. Another chal- techniques. Section IV explores the bit error rate performance
lenge was data access efficiency in cache and RAM for as well as the throughput, the latency, the power consumption,
performance critical algorithms. Higher performance also and the energy efficiency of the proposed MPA and log-
came from improved cache access efficiency of heteroge- MPA implementations. Some profiling metrics are given to
neous processors and parallel access to the L1 cache through better understand the results. Section V is dedicated to study
FIGURE 1. a) SCMA encoder with 6 users (layers) and 4 physical resources, b) SCMA uplink chain with
channel coding, c) Factor graph representation of a decoder, d) Message Passing Algorithm based on
Bayesian factor graph: (I) Resource to user message, (II) Guess swap at each user and user to resource
message, (III) Final guess at each user.
the effects of suggested techniques on block error rate after where xj = (x1 , . . . , xKj )T and hj = (h1 , . . . , hKj )T are
channel coding. Finally, the main findings of this research are respectively codeword and channel coefficients of layer j.
summarized in Section VI.
B. SCMA DETECTION SCHEMES
1) MAXIMUM LIKELIHOOD
II. BACKGROUND For an arbitrary codeword, the optimum decision, i.e. the
A. OVERVIEW OF THE SCMA SYSTEM MODEL one that minimizes the likelihood of transmission errors after
An SCMA encoder with J users (layers) and K physical decoding, is the one resulting from the Maximum Likeli-
resources is a function that maps a binary stream of data hood (ML) estimation, which can be described as:
to K -dimensional complex constellations f : Blog2 (M ) →
X, x = f (b) where X ⊂ Ck . The K -dimensional complex ˆ = arg min ||y − c||2 ,
xML (2)
codeword x is a sparse vector with N < K non-zero entries. c∈X
Each layer j = 1, . . . , J has its own codebook to generate
the desired codeword according to the binary input stream. given the received codeword. In (2), the soft outputs x̂ are also
Fig. 1 shows SCMA uplink chain with J = 6, K = 4 called Log-Likelihood Ratios (LLRs) that can be calculated
and N = 2. SCMA codewords are spread over K physical with the following equation:
resources, such as OFDMA tones. Fig. 1a shows that in the P
C∈L0x P(y|c)
multiplexed scheme of SCMA, all chosen codewords of the J
LLRx = ln P , (3)
layers are added together after being multiplied by the chan- C∈L1x P(y|c)
nel coefficient hj . Then, the entire uplink chain is shown
in Fig. 1b. The output of the SCMA encoder is affected by where LLRx is the log likelihood ratio of bit x obtained from
the white additive noise n. codeword x̂. This codeword comes from L1x the set of code-
J
X words in which bit x is 1 and L0x the set of codewords in which
y= diag(hj )xj + n, (1) bit x is 0. The probability function P(y|c) can be expressed
j=1 as in (4) when a signal is transmitted over an additive white
Gaussian channel with σ 2 variance: As shown in Fig. 1d(II) there are only two resources con-
nected to each user. A message from a user to a resource is a
||y − c||2
1
P(y|c) = √ exp − . (4) normalized guess swap at the user node:
2π σ 2σ 2
µRES3→UE3 (i)
µUE3→RES1 (i) = P , (9)
Although the ML method provides the best guess for O xML , i µRES3→UE3 (i)
performing the computation with this method requires unac-
message passing between users and resources (see (8)
ceptable complexity in real applications. In the case of six
and (9)) will be repeated three to eight times to reach
users and codebooks matrices size 4 × 4 as in Fig. 1a,
the desired decoding performance. The final belief at each
the calculation of the soft output for each bit in (4) needs
user B (i) is the multiplication of all incoming messages as
4096 exponential function computations, which is unaccept-
illustrated in Fig. 1d(III) and (10) for UE4 and codeword i.
able. Nevertheless, in this article the result of this method is
Finally, (11) is used to calculate soft outputs for bit bx :
used to compare with practical methods to characterize the
BER performance degradation of MPA and log-MPA. B3 (i) = µRES1→UE3 (i) × µRES3→UE3 (i), (10)
P(y|bx = 0)
2) MESSAGE PASSING ALGORITHM (MPA)
LLRx = ln
P(y|bx = 1)
Fig. 1c shows a Bayesian factor graph representation of an P
Bm (i) when bx =0
MPA decoder with six users and four physical resources. = ln Pm . (11)
Thanks to sparsity of the codebooks, exactly three users m Bm (i) when bx =1
collide in each physical resource. There are four possible 3) LOG-MAP
codewords for each of the three connected user’s codebooks
Since calculation of exponentials in (7) requires relatively
which gives 64 possible combined codewords in each phys-
high computational effort, changing the algorithm to log
ical resource. In the first step of the MPA, the 64 distances
domain using the Jacobi formula (12) is a classical improve-
between each possible combined codewords and the actual
ment of MPA:
received codeword are calculated. X N
ln exp(fi ) ≈ max{f1 , f2 , . . . , fN } (12)
X
dRESβ (m, H) = ||yβ − hl,mu xl,mu ||, (5)
l⊂ζ,mu ∈{1,...,K } i−1
calculations and to drastically reduce the number of instruc- in (5) and (6), 2) calculate the exponentials in (7), 3) perform
tions for SSE, NEON, AVX and KNCI ISAs. users to resources messaging and final guesses at each user.
jth
where Nj is the size of the dimension of the array and ni is
the location of a target element in that dimension. Improving
data locality with a stride of a single floating-point number in
each element makes it easier for the processor to have aligned
and contiguous accesses to the memory through SIMD ISA.
Utilizing SIMD instructions helps to reduce the total num-
ber of mispredicted branches in the algorithm. Contiguous
accesses to the L1 cache are performed by chunks of 128-bit,
256-bit or 512-bit. It reduces the number of iterations in
the for-loops and consequently it reduces the number of
branches. On the other hand, for a vector of sixty four
32-bit floating-point numbers, 64 iterations are needed in the
scalar mode, while only 16, 8 or 4 iterations are required
in the vectorized modes using respectively SSE (or NEON),
AVX or KNCI ISAs.
TABLE 1. Throughput, latency, power and energy characteristics. throughput and latency are improved by much larger factors.
It means that the overall energy consumption have been
decreased with AVX.
2) ARMTM CORTEX-A57
On this platform [27], the throughput difference caused by
the fast math libraries of the GNU compiler is still visible for
MPA and log-MPA algorithms. With level three optimization
(-O3), MPA and log-MPA run at 1.60 Mbps and 3.01 Mbps
respectively. When using fast math libraries (-Ofast) the
throughputs increased to 4.07 and 4.70 Mbps. It should be
noted that the four physical cores of the ARM platform were
utilized for those tests. Power consumption and energy used
per decoded bit is lower on the ARM platform than on the
Intel processors. The low power consumption of the ARM
platform notably comes at the cost of less powerful floating-
point arithmetic units (cf. MPA+NEON and E-MPA+NEON
in Table 1). Eliminating the exponential computations almost
doubled the performance in E-MPA (15.30 Mbps) as com-
pared to MPA+NEON (8.40 Mbps), which shows the limits of
low power processors when calculating many exponentials.
Nevertheless, by using E-MPA, the ARM low power proces-
sors can be a good candidate for implementation of SCMA
decoders on C-RAN servers as it allows significant energy
savings.
does not claim any novelty in channel coding, however, two curves are plotted: one for the E-MPA and the other for
we found crucial to validate our proposed SCMA optimiza- the MPA. Only 0.2 to 0.4 dB separate the two versions of the
tions in a sufficiently complete communication chain. algorithm for all the considered channel codes. These results
show the extent to which uncertainty of estimations affects
B. CHANNEL CODING CONFIGURATIONS channel coding. The decoding speed improvement brought
1) TURBO CODES by the E-MPA algorithm has a cost in terms of decoding
In a first validation, the turbo code from the LTE stan- performance. This trade-off should be considered in order to
dard is used. In the decoder, 6 iterations are done. The two meet the system constraints.
sub-decoders implement the max-log Maximum A Posteri-
ori algorithm (max-log-MAP) [31] with a 0.75 scaling fac- VI. CONCLUSIONS
tor [32]. In Fig. 7a, the rate is R ≈ 1/3, no puncturing is In this paper, in consideration of the potential of Cloud-RAN
performed, the number of information bits K is 1024 and that would support 5G communication, we focused on
the codeword length N is 3084. In Fig. 7b, R ≈ 1/2 with improving the efficiency of 5G SCMA receivers on the type
the puncturing of half of the parity bits, K = 2048, and of multiprocessors that can be found in such servers. We pro-
N = 4108. vided test results using different platforms such as ARM
Cortex, Xeon-Phi and Core-i7. The benefits of using SIMD
and various algorithmic simplifications have been studied and
2) LDPC CODES
test results were presented. Among the platforms, the ARM
In a second set of validations, the LDPC codes used in this
Cortex-A57 was shown to offer the lowest energy consump-
paper are based on MacKay matrices that have been taken
tion per decoded bit, while many-core platforms such as
from [33]. In Fig. 7a, the matrix used is (K = 272, N = 816),
Xeon-Phi Knight’s Corner 7120P had the best throughput.
and in Fig. 7b the matrix is (K = 2000, N = 4000). In both
In addition, an estimation of conditional probabilities using
figures, the decoder used is a Belief Propagation (BP) decoder
polynomial distributions instead of Gaussian distribution was
with an Horizontal Layered scheduling [34]. For the update
proposed to increase throughput. This estimation has shown
rules, the Sum-Product Algorithm (SPA) has been used [35].
to offer throughput improvements of 15 to 90 percent depend-
The number of iterations is 100.
ing on the platform used, while it causes a very small degrada-
tion of BLER after channel decoding. To support this claim,
3) POLAR CODES the error performance of telecommunication chains combin-
In the final validation, polar codes are built by suitably ing MPA and E-MPA with channel coding with LDPC, polar
selecting the frozen bits. We used the Gaussian Approxima- codes and turbo codes with code rates R = 1/3 and R = 1/2
tion (GA) technique of [36]. The input SNR for the code were tested.
construction with the GA is 1 dB, which apparently is very
low considering that the SNR are 4 to 5 dB in the convergence REFERENCES
zone. This is motivated by the fact that the GA algorithm is [1] S. M. R. Islam, N. Avazov, O. A. Dobre, and K.-S. Kwak, ‘‘Power-domain
designed to work with the BPSK modulation. Using SCMA non-orthogonal multiple access (NOMA) in 5G systems: Potentials and
completely modifies the histogram of the LLR values for a challenges,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 2, pp. 721–742,
2nd Quart., 2017.
given SNR. Therefore, a shift on the input SNR of the GA [2] H. Nikopour and H. Baligh, ‘‘Sparse code multiple access,’’ in Proc. IEEE
algorithm must be applied in order to efficiently select the Int. Symp. Pers. Indoor Mobile Radio Commun., London, U.K., Sep. 2013,
frozen bits. If this shift is not applied, the decoding perfor- pp. 332–336.
[3] 5G Algorithm Innovation Competition. (2015). Altera Innovate Asia
mances of the polar code degrades drastically. The number FPGA Design Contest. [Online]. Available: http://www.innovateasia.com/
of information bits and the codeword length are (K = 682, 5g/en/gp2.html
N = 2048) in Fig. 7a and (K = 2048, N = 4096) in Fig. 7b. [4] NGMN Alliance, ‘‘5G white paper,’’ Next Gener. Mobile Netw., Frankfurt,
Germany, White Paper, 2015, pp. 1–125.
The decoder is a Successive Cancellation List (SCL) decoder [5] L. Lei, C. Yan, G. Wenting, Y. Huilian, W. Yiqun, and X. Shuangshuang,
with L = 32 and a 32-bit GZIP CRC that was proposed ‘‘Prototype for 5G new air interface technology SCMA and performance
in [37]. evaluation,’’ China Commun., vol. 12, no. 9, pp. 38–48, Sep. 2015.
[6] S. Zhang, X. Xu, L. Lu, Y. Wu, G. He, and Y. Chen, ‘‘Sparse code multiple
access: An energy efficient uplink approach for 5G wireless systems,’’ in
C. EFFECTS OF E-MPA ON ERROR CORRECTION Proc. IEEE Global Commun. Conf., Dec. 2014, pp. 4782–4787.
In Fig. 7, the number of iterations of the SCMA demodulator [7] J. Liu, G. Wu, S. Li, and O. Tirkkonen, ‘‘On fixed-point implementation of
Log-MPA for SCMA signals,’’ IEEE Wireless Commun. Lett., vol. 5, no. 3,
is 5. The objective of simulating multiple channel codes is pp. 324–327, Jun. 2016.
not to compare them with each other. A fair comparison of [8] A. Bayesteh, H. Nikopour, M. Taherzadeh, H. Baligh, and J. Ma, ‘‘Low
the different channel codes would indeed impose using the complexity techniques for SCMA detection,’’ in Proc. IEEE Globecom
Workshops, San Diego, CA, USA, Dec. 2015, pp. 1–6.
same code lengths and more importantly their computational [9] Y. Du, B. Dong, Z. Chen, J. Fang, P. Gao, and Z. Liu, ‘‘Low-complexity
complexity should be compared, which is not the case here. detector in sparse code multiple access systems,’’ IEEE Commun. Lett.,
Our goal here is to study the impact of using E-MPA on vol. 20, no. 9, pp. 1812–1815, Sep. 2016.
[10] M. Taherzadeh, H. Nikopour, A. Bayesteh, and H. Baligh, ‘‘SCMA code-
the BER and FER performances when the channel codes are book design,’’ in Proc. IEEE Veh. Technol. Conf., Las Vegas, NV, USA,
included in the communication chain. For each channel code, Sep. 2014, pp. 1–5.
[11] J. Peng, W. Chen, B. Bai, X. Guo, and C. Sun, ‘‘Joint optimization of ALIREZA GHAFFARI received the B.Sc. degree
constellation with mapping matrix for SCMA codebook design,’’ IEEE from the Amirkabir University of Technology
Signal Process. Lett., vol. 24, no. 3, pp. 264–268, Mar. 2017. (Tehran Polytechnic), Iran, and the M.Sc. degree
[12] C. Yan, G. Kang, and N. Zhang, ‘‘A dimension distance-based SCMA from Laval University Canada. He has wide
codebook design,’’ IEEE Access, vol. 5, pp. 5471–5479, 2017. range of industrial and research experience in
[13] M. Jia, L. Wang, Q. Guo, X. Gu, and W. Xiang, ‘‘A low complexity firmware design, FPGA, hardware acceleration,
detection algorithm for fixed up-link SCMA system in mission critical and the IoT. Recently, he is more focused on
scenario,’’ IEEE Internet Things J., vol. 5, no. 5, pp. 3289–3297, Oct. 2018.
hardware/software acceleration in heterogeneous
[14] L. Yang, Y. Liu, and Y. Siu, ‘‘Low complexity message passing algo-
platforms and clouds. He is currently an Enthusiast
rithm for SCMA system,’’ IEEE Commun. Lett., vol. 20, no. 12,
pp. 2466–2469, Dec. 2016. in electrical engineering and a Research Asso-
[15] Y. Du, B. H. Dong, Z. Chen, J. Fang, and L. Yang, ‘‘Shuffled multiuser ciate with the Electrical Engineering Department, École Polytechnique
detection schemes for uplink sparse code multiple access systems,’’ IEEE de Montréal.
Commun. Lett., vol. 20, no. 6, pp. 1231–1234, Jun. 2016.
[16] J. Chen, Z. Zhang, S. He, J. Hu, and G. E. Sobelman, ‘‘Sparse code multiple
access decoding based on a Monte Carlo Markov chain method,’’ IEEE
Signal Process. Lett., vol. 23, no. 5, pp. 639–643, May 2016.
[17] L. Yang, X. Ma, and Y. Siu, ‘‘Low complexity MPA detector based
on sphere decoding for SCMA,’’ IEEE Commun. Lett., vol. 21, no. 8,
pp. 1855–1858, Aug. 2017.
[18] F. Wei and W. Chen, ‘‘Low complexity iterative receiver design for MATHIEU LÉONARDON received the degree
sparse code multiple access,’’ IEEE Trans. Commun., vol. 65, no. 2,
in engineering from Bordeaux INP, Bordeaux,
pp. 621–634, Feb. 2017.
[19] C. Berrou, A. Glavieux, and P. Thitimajshima, ‘‘Near Shannon limit error-
France, in 2015. He is currently pursuing the
correcting coding and decoding: Turbo-codes. 1,’’ in Proc. IEEE Int. Conf. Ph.D. degree with the École Polytechnique de
Commun. (ICC), vol. 2, Geneva, Switzerland, May 1993, pp. 1064–1070. Montréal and the University of Bordeaux under
[20] E. Arıkan, ‘‘Channel polarization: A method for constructing capacity- a co-directorship between both institutions. His
achieving codes for symmetric binary-input memoryless channels,’’ IEEE research interests include the design of efficient
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009. and flexible implementations, both hardware and
[21] R. G. Gallager, ‘‘Low-density parity-check codes,’’ IRE Trans. Inf. Theory, software, for decoding error-correcting codes,
vol. 8, no. 1, pp. 21–28, Jan. 1962. in particular polar codes.
[22] Intel. (2018). Intel C++ Compiler 18.0 Developer Guide and Refer-
ence. [Online]. Available: https://software.intel.com/en-us/cpp-compiler-
18.0-developer-guide-and-reference-fxsave64
[23] A. Cassagne, O. Aumage, D. Barthou, C. Leroux, and C. Jégo, ‘‘MIPP:
A portable C++ SIMD wrapper and its use for error correction coding in
5G standard,’’ in Proc. Workshop Program. Models SIMD/Vector Process.
(WPMVP), Vösendorf, Austria, Feb. 2018, pp. 1–8.
[24] A. Ghaffari, M. Léonardon, Y. Savaria, C. Jégo, and C. Leroux, ‘‘Improving
performance of SCMA MPA decoders using estimation of conditional ADRIEN CASSAGNE received the M.Sc. degree
probabilities,’’ in Proc. 15th IEEE Int. New Circuits Syst. Conf. (NEWCAS), in computer science from the University of
Jun. 2017, pp. 21–24. Bordeaux, France, in 2013, where he is currently
[25] GCC. (2018). Semantics of Floating Point Math in GCC. [Online]. Avail- pursuing the Ph.D. degree. His research inter-
able: https://gcc.gnu.org/wiki/FloatingPointMath ests include the design of efficient and flexible
[26] L. Torvalds. (2018). Turbostat. [Online]. Available: https://github.com/ software implementations for modern decoding
torvalds/linux/tree/master/tools/power/x86/turbostat error-correcting codes such as LDPC, turbo, and
[27] NVIDIA. (2018). Jetson TX1. [Online]. Available: https://www.nvidia. polar codes. More precisely, he looks at different
com/fr-fr/autonomous-machines/embedded-systems-dev-kits-modules/ aspects of parallelism such as multi-node, multi-
[28] G. Chrysos, ‘‘Intel Xeon Phi coprocessor (codename Knights Corner),’’ in threading, or vectorization.
Proc. IEEE Hot Chips 24 Symp. (HCS), Cupertino, CA, USA, Sep. 2012,
pp. 1–31.
[29] Multiplexing and Channel Coding (Release 15), document TS 38.212,
3GPP, Sep. 2017.
[30] Multiplexing and Channel Coding (Release 11), document TS 136.212,
3GPP, Feb. 2013.
[31] P. Robertson, E. Villebrun, and P. Hoeher, ‘‘A comparison of optimal and
sub-optimal MAP decoding algorithms operating in the log domain,’’ in
Proc. IEEE Int. Conf. Commun. (ICC), vol. 2, Jun. 1995, pp. 1009–1013. CAMILLE LEROUX received the M.Sc. degree
[32] J. Vogt and A. Finger, ‘‘Improving the max-log-MAP turbo decoder,’’ in electronics engineering from the University
Electron. Lett., vol. 36, no. 23, pp. 1937–1939, Nov. 2000. of South Brittany, Lorient, France, in 2005, and
[33] D. J. MacKay. (2018). Encyclopedia of Sparse Graph Codes. [Online]. the Ph.D. degree in electronics engineering from
Available: http://www.inference.org.uk/mackay/codes/data.html TELECOM Bretagne, Brest, France, in 2008.
[34] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, ‘‘High He was a Visiting Student with the Electrical and
throughput low-density parity-check decoder architectures,’’ in Proc. Computer Engineering Department, Aalborg Uni-
IEEE Global Commun. Conf. (GLOBECOM), vol. 5, Nov. 2001,
versity, Denmark, in 2004, and also with the Uni-
pp. 3019–3024.
versity of Alberta, AB, Canada, in 2005. From
[35] D. J. C. MacKay, ‘‘Good error-correcting codes based on very sparse
matrices,’’ IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999. 2008 to 2011, he was a Postdoctoral Research
[36] P. Trifonov, ‘‘Efficient design and decoding of polar codes,’’ IEEE Trans. Associate with the Electrical and Computer Engineering Department, McGill
Commun., vol. 60, no. 11, pp. 3221–3227, Nov. 2012. University, Montreal, QC, Canada. He has been an Associate Professor
[37] M. Léonardon, A. Cassagne, C. Leroux, C. Jégo, L. Hamelin, and with Bordeaux INP, since 2011. His research interests include encompass
Y. Savaria, ‘‘Fast and flexible software polar list decoders,’’ CoRR, algorithmic and architectural aspects of channel decoder implementation.
vol. abs/1710.08314, pp. 1–11, Oct. 2017. [Online]. Available: More generally, he is interested in the hardware and software implementation
http://arxiv.org/abs/1710.08314 of computationally intensive algorithms in a real-time environment.
YVON SAVARIA (S’77–M’86–F’08) received machine learning, computational efficiency, and application specific archi-
the B.Ing. and M.Sc.A. degrees from the École tecture design. He holds 16 patents. He has published 140 journal papers and
Polytechnique Montreal, in 1980 and 1982, 440 conference papers. He was the thesis advisor of 160 graduate students
respectively, and the Ph.D. degree from McGill who completed their studies.
University, in 1985, all in electrical engineering. He has been a Consultant or was sponsored for carrying research by
Since 1985, he has been with the École Polytech- Bombardier, Inc., CNRC, Design Workshop, DREO, Ericsson, Genesis,
nique de Montréal, where he is currently a Profes- Gennum, Huawei, Hyperchip, ISR, Kaloom, LTRIM, Miranda, MiroTech,
sor with the Department of Electrical Engineering. Nortel, Octasic, PMC-Sierra, Technocap, Thales, Tundra, and VXP. He is
He has carried work in several areas related to a Fellow of IEEE. He has been a member of CMC Microsystems Board,
microelectronic circuits and microsystems, such as since 1999. He is a member of the Regroupement Stratégique en Microélec-
testing, verification, validation, clocking methods, defect and fault toler- tronique du Québec and of the Ordre des Ingénieurs du Québec. He was a
ance, the effects of radiation on electronics, high-speed interconnects and Tier 1Canada Research Chair on design and architectures of advanced micro-
circuit design techniques, CAD methods, reconfigurable computing and the electronic systems, from 2001 to 2015. He received the Synergy Award of
applications of microelectronics to telecommunications, aerospace, image the Natural Sciences and Engineering Research Council of Canada, in 2006.
processing, video processing, radar signal processing, and digital signal pro- He was a Program Co-Chairman of ASAP’2006 and NEWCAS’2018, and
cessing acceleration. He is currently involved in several projects that relate the General Co-Chair of ASAP’2007. He was a Chairman of CMC Microsys-
to aircraft embedded systems, radiation effects on electronics, asynchronous tems Board, from 2008 to 2010.
circuits design and test, green IT, wireless sensor networks, virtual networks,