0% found this document useful (0 votes)
5 views

Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

4648 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

11, NOVEMBER 2021

Instruction-Set Accelerated Implementation


of CRYSTALS-Kyber
Mojtaba Bisheh-Niasar , Student Member, IEEE, Reza Azarderakhsh , Member, IEEE,
and Mehran Mozaffari-Kermani , Senior Member, IEEE

Abstract— Large scale quantum computers will break classical (PQC) based on alternative mathematical features has become
public-key cryptography protocols by quantum algorithms such a fundamental research topic.
as Shor’s algorithm. Hence, designing quantum-safe cryptosys- Recently, the National Institute of Standards and Tech-
tems to replace current classical algorithms is crucial. Luckily
there are some post-quantum candidates that are assumed to be nology (NIST) announced the third-round finalists, which
resistant against future attacks from quantum computers, and includes 4 key encapsulation mechanisms (KEMs) and
NIST is considering standardizing them. Among these candidates, 3 signature schemes [2]. Among these KEM schemes,
lattice-based cryptography sounds more interesting than others CRYSTALS-Kyber shares a common framework with the
due to the performance results as well as confidence in the CRYSTALS-Dilithium signature scheme [2]. This scheme also
security. There are few works in the literature evaluating the
performance of lattice-based cryptography in hardware. In this supports efficient matrix-vector and vector-vector multiplica-
paper, we focus on Cryptographic Suite for Algebraic Lattices tion over a polynomial ring using the fast number-theoretic
(CRYSTALS) key exchange mechanisms known as Kyber and transform (NTT) [3]. Although the optimization of NTT-based
provide an instruction-set hardware architecture and imple- multiplication is not a new idea and is used in countless
ment on Xilinx Artix-7 FPGA for performance evaluation and applications, particularly in signal processing, it is still a per-
testing. Our proposed architecture provides an efficient and
high-performance set of components to perform polynomial formance bottleneck in the lattice-based cryptography imple-
sampling, number-theoretic transform (NTT), and point-wise mentation. Thus, several works have been done to optimize
multiplication to speed up lattice-based post-quantum cryptogra- NTT from different perspectives, such as resource utilization,
phy (PQC). This architecture implemented on ASIC outperforms performance, efficiency, and energy consumption.
state-of-the-art implementations. Recently, implementations of lattice-based cryptography
Index Terms— ASIC, FPGA, hardware architecture, Kyber, have been investigated on various platforms. While soft-
lattice-based cryptography, post-quantum cryptography. ware (SW) implementations offer programming capabilities,
flexibility, and a shorter design cycle, the hardware (HW)
I. I NTRODUCTION platforms accelerate the computations and result in signif-
icantly higher throughput. Recently, there are considerable
Q UANTUM computing development constitutes a signifi-
cant threat to classical public-key cryptography protocols
based on Shor’s algorithm [1]. Most current cryptosystems,
efforts to implement cryptosystems using hardware-software
(HW/SW) co-design. This method makes the design smaller,
i.e., RSA and Elliptic Curve Cryptography (ECC), are envi- slower, and more controllable/programmable compared to pure
sioned to be broken when large quantum computers will be HW schemes at the cost of implementing a software-based
built. Thus, designing the lattice-based cryptosystem as one of processor. Furthermore, a HW/SW co-design requires a shorter
the most promising algorithms in Post-Quantum Cryptography design period; nevertheless, this method may not lead to the
best performance. On the other hand, pure hardware imple-
Manuscript received December 28, 2020; revised April 13, 2021 and mentations can be significantly accelerated using well-known
June 10, 2021; accepted August 11, 2021. Date of publication August 30,
2021; date of current version November 9, 2021. This work was supported optimization strategies, including register balancing, par-
by NSF under Grant 1801341. This article was recommended by Associate allelization, and resource sharing, to increase the overall
Editor S. Yin. (Corresponding author: Mojtaba Bisheh-Niasar.) throughput of the hardware architectures. The main difficulty
Mojtaba Bisheh-Niasar is with the Department of Computer and Electrical
Engineering and Computer Science, Florida Atlantic University, Boca Raton, of this strategy is its hand-optimized design requiring a longer
FL 33431 USA, and also with I-SENSE, Florida Atlantic University, Boca time and may be achieved at the cost of losing flexibility.
Raton, FL 33431 USA (e-mail: [email protected]). To transition to PQC, we must develop hybrid cryptosys-
Reza Azarderakhsh is with the Department of Computer and Electrical
Engineering and Computer Science, Florida Atlantic University, Boca Raton, tems to maintain industry or government regulations, while
FL 33431 USA, also with I-SENSE, Florida Atlantic University, Boca PQC updates will be applied thoroughly. Therefore, classical
Raton, FL 33431 USA, and also with PQSecure Technologies LLC, Boca cryptosystems, e.g. ECC, cannot be eliminated even if PQC
Raton, FL 33431 USA (e-mail: [email protected]).
Mehran Mozaffari-Kermani is with the Department of Computer Engineer- will significantly be developed. The instruction-set processor
ing and Science, University of South Florida, Tampa, FL 33620 USA (e-mail: builds an appropriate platform for accelerated implementation
[email protected]). compared to SW and HW/SW. while the architecture remains
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2021.3106639. flexible compared to highly optimized HW. Specifically,
Digital Object Identifier 10.1109/TCSI.2021.3106639 the flexible HW architecture is a promising solution for
1549-8328 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4649

integrating classic cryptosystems and PQC to move towards which reduces the number of clock cycles to nlog(n) cycles.
hybrid systems. The authors in [21] proposed a low-complexity NTT/INTT in
Kyber is notable for high speed and constant-time imple- the architecture of NewHope-NIST.
mentations. It has to be implemented in various platforms The proposed architecture combines the NTT, INTT, and
subject to the performance requirement. However, Kyber has point-wise multiplication architectures in an efficient way to
not got sufficient study in the field of hardware implemen- utilize significantly fewer resources and improve the overall
tation. Therefore, investigation of the hardware implementa- performance. To do so, using the Cooley-Turkey (CT) as
tion is required considering the advantages of FPGA-based NTT and the Gentleman-Sande (GS) as INTT [22], [23] is
architectural designs to exploit parallelism, which leads to a well-known trick in the literature. Moreover, the resource
improvements in the efficiency of the overall system. In this sharing technique from [5], [24] is extended by using com-
paper, we implement a pure hardware design since it is faster pact storage for pre-computed twiddle factors from [25] and
and could be integrated into any HW/SW co-design solutions. doubled bandwidth scheme from [14], [21] to account for the
high-performance architecture.
A. Related Work
B. Our Contributions
Software implementation of Kyber has been studied by
Botros et al. in [3], proposing a memory-efficient high- To the best of our knowledge, there appear to be very few
speed implementation on Cortex-M4. Recently, several PQC pure hardware implementations that focus only on the Kyber
schemes have been implemented, targeting HW/SW co-design. cryptosystem and make the best of all its features. This paper
The work of [4] was one of the first initiatives of proposes an efficient hardware implementation of the module
post-quantum acceleration using high-level synthesis (HLS). lattice-based post-quantum KEM CRYSTAL-Kyber on a
Furthermore, Banerjee et al. in [5] proposed a flexible ASIC Xilinx Artix-7 FPGA (as recommended by NIST) and the
crypto-processor to support several lattice-based algorithms application specific integrated circuit (ASIC) platform.
into a RISC-V architecture, including Frodo, NewHope, Our proposed architecture provides an efficient and
qTESLA, and CRYSTALS-Kyber/Dilithium. This work is high-performance set of components, including polynomial
extended in [6] to show FPGA validation results. Their design sampling, NTT, and point-wise multiplication, to accelerate
strategy targets reducing power consumption. The authors lattice-based PQC exploiting fewer resources. The
in [7] employ the RISC-V processor integrated with a finite contributions of this paper are itemized in the following:
field multiplier to accelerate polynomial multiplications in a 1) We propose a new approach for implementing
lightweight architecture of NewHope and Kyber. In [8], per- a resource-efficient reconfigurable butterfly core on
forming vectorized modular arithmetic and NTT computations FPGA. We reduce the execution time for Kyber NTT
are proposed employing RISC-V for NewHope, Kyber, and computation from N2 log2 N2 + 2N to N2 log2 N4 by dou-
Saber. The vector processor architecture based on the extensi- bling the transform throughput and merging the pre-
ble RISC-V architecture has been studied in [9], which shows processing into NTT algorithm. We also customize a
a remarkable speed up occupying 979k gate equivalent (GE) memory addressing strategy to implement a high-speed
in ASIC implementations. polynomial multiplier on the target platform.
The pure hardware architectures of Kyber are pro- 2) We highly parallelize the operations in polynomial sam-
posed in [10]–[13]. The work of [10] heavily relies on pling cores through tightly coupling with Keccak core
BlockRAM primitives between components to perform arith- to decrease the required cycles. The performance of pro-
metic tasks and store intermediate results. We addressed the posed parallel scheduling for binomial sampler indicates
high-performance implementation of Kyber in our previous a significant improvement, while our rejection sampler
work [13] as the fastest Kyber design in the literature. latency can be completely absorbed by the Keccak core.
The authors in [14] proposed a Kyber processor for com- 3) Our fast and scalable architecture provides a constant-
puting NTT and point-wise multiplication. An instruction- time implementation over three different quantum
set coprocessor for Saber is presented in [15] to design a security levels. To enhance our HW accelerator from a
flexible hardware architecture using the quadratic-complexity flexibility point of view, we design a set of customized
schoolbook polynomial multiplication algorithm. Schoolbook high-level instruction codes to run the protocol. Hence,
polynomial multiplication is also employed in [16]. this set identifies the control flow of the proposed
Since NTT plays a central role in lattice-based cryptog- components and provides flexibility for integration with
raphy, several hardware implementations focus on NTT from host processors.
performance, efficiency, and flexibility perspectives. The work 4) We employ various optimization techniques to achieve
of [17], [18] introduced a scalable NTT architecture that an overall optimization in terms of efficiency, including
can be used for various lattice-based schemes. Furthermore, parallelization, resource sharing, utilizing distributed
the authors in [19] proposed a RISC-V architecture RAM and ROM blocks, which significantly improve
to increase efficiency and flexibility for NTT compu- the area-time product. The proposed implementation is
tation used in NewHope, qTESLA, CRYSTALS-Kyber, constant-time and is resistant to known timing attacks.
CRYSTALS-Dilithium, and Falcon. Additionally, Fritzmann The rest of the paper is organized as follows. In Sec. II, we
and Sepúlveda [20] proposed an efficient and low-power NTT, discuss the preliminaries. In Sec. III, our proposed algorithms

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4650 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021

TABLE I
T HE L IST OF S YMBOLS AND N OTATIONS U SED IN T HIS PAPER

TABLE II
PARAMETER S ETS FOR K YBER I MPLEMENTATION [26]

Fig. 1. 8-point NTT butterfly dataflow [28].

and architectures are discussed. We discuss our results


and compare them to the counterparts in Sec. IV. Finally, 2) Sampling Units: The rejection sampling generates a
we conclude the paper in Sec. V. matrix from the uniform distribution, while the accepted
samples are smaller than q. The public matrix  is sampled
II. P RELIMINARIES directly in the NTT domain. In the updated Kyber v3 speci-
A. Symbol Definition fication the rejection probability calculated as 1 − q/2log(q)
To make the paper more readable, Table I provides the list is increased from 3.48% to 18.7%.
of notations used in this paper. The polynomial ring Rq = Noise sampling is performed from a centered binomial
Zq [X]/(X n + 1) is defined over the field of Zq = Z/qZ in distribution (CBD) based on the subtraction of the Hamming

which n = 2n −1 is the dimension and q is the prime modulo. weights of the two η-bit chunks. Let β be the Keccak output,
the coefficients are computed as follows:
B. Kyber Algorithms j =η−1
 j =η−1

Kyber [26] is an IND-CCA secure KEM based on hardness ei = β2iη+ j − β2iη+η+ j (1)
assumptions over module learning with errors (Module-LWE) j =0 j =0
[27]. NIST has recently announced the 3rd round PQC stan- which turns uniformly distributed samples into binomial dis-
dardization candidates, and Kyber was among the chosen tribution. According to Table II, in Kyber-512 architecture,
algorithms as a finalist [2]. Kyber provides three post-quantum two different samplers are implemented, i.e., η = 2 and
security levels, and its parameter sets are reported in Table II. η = 3, while binomial sampling units in Kyber-768 and Kyber-
Kyber cryptosystem uses a uniformly random ring element 1024 work only with η = 2.
ρ. The Kyber KEM is defined as follows where sk stands for 3) NTT and Multiplication: The centerpiece of KEM is
secret key, pk for public key, and ct for ciphertext: NTT which is a fast Fourier transform (FFT) applied in a
• KeyGen(): This function returns (sk, pk) by choosing s finite field. Fig. 1 illustrates the butterfly diagram for 8-point
and e from a binomial sampling, and  from a uniform NTT. Let a be a polynomial as follows:
distribution. pk = (ρ, t̂) and sk = ŝ where t̂ = Â ◦ ŝ + ê.
• Enc( pk, m, μ): Using seed of μ, a binomial sam- a(x) = (a0 , a1 , . . . , a255 ) ∈ Rq (2)
pling is employed to choose r, e1 , and e2 . Further-
T NTT(a) is defined as â = (â0 + â1 X, â2 + â3 X, . . . , â254 +
more, Â is sampled from a uniform distribution. 
127
Computing of u = INTT(ÂT ◦ r̂) + e1 and v = â255 X) such that â2i = a2 j ζ (2br7 (i)+1) j and â2i+1 =
INTT(t̂ T ◦ r̂) + e2 + m construct the ciphertexts such that j =0

ct = (Compress(u), Compress(v)). 
127
a2 j +1ζ (2br7 (i)+1) j , where ζ = 17 is the first primitive
• Dec(sk, ct): Message m is computed such that m = j =0
Compress(v − INTT(ŝT ◦ û)), while u and v are extracted 256-th root of unity modulo q, and br7 is the bit reversal
from ct. function. The pseudo-code of the iterative NTT is shown in
1) Keccak: The most performance-critical part of the soft- Algorithm 1. The INTT is similar to NTT, while ωn−1 is used
ware implementation is the Keccak core based on the profiled instead of ωn , and the resulting coefficients of a(x) is divided
cycle counts presented in [3], [7]. In fact, more than half of by n.
the reported clock cycles in SW and HW/SW benchmarking However, the original computing of NTT and INTT
are used to compute Keccak. However, this core can be needs the pre-processing and the post-processing, respectively.
accelerated in a pure hardware architecture since Keccak is A point-wise multiplication includes 128 multiplications of
a hardware-friendly design of SHA. polynomial of degree 2 modulo X 2 − ζ 2br7 (i)+1 .

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4651

Algorithm 1 Iterative In-Place NTT Algorithm Based on


Cooley-Tukey Butterfly [25]
Input: a polynomial a(x) ∈ Zq [X]/(X n + 1), n-th primitive
root of unity ωn ∈ Zq , n = 2l
Output: â(x) = NTTωn (a) ∈ Zq [X]/(X n + 1)
1: â ← bit-reverse(a)
2: for i form 1 to l do
3: m = 2l−i
4: for j from 0 to 2i−1 − 1 do
1+ j
5: W ← ωn
6: for k from 0 to m − 1 do
7: T ← W · â[2 · j · m + k + m] mod q
8: U ← â[2 · j · m + k]
9: â[2 · j · m + k] = U + T mod q
10: â[2 · j · m + k + m] = U − T mod q
11: end for
12: end for
13: end for
14: return â(x)

Algorithm 2 Barrett Reduction Modulus q = 3, 329 [29]


224
Input: q = 3, 329, m = q = 5, 039, x ∈ [0, q 2 )
Output: z = x mod q
1: u ← x · m
2: u ← u  24
3: u ← x − u · q
4: v = u − q
5: if v ≥ 0 then
6: z = v
7: else Fig. 2. Top-level architecture of Kyber KEM. The CBD core with η = 3 is
implemented only in Kyber-512.
8: z = u
9: end if
10: return z polynomial generation, addition, subtraction, and multiplica-
tion. Dedicated architecture can be implemented to accelerate
corresponding operations in hardware.
The matrix-vector multiplication  ◦ ŝ in NTT domain for
Kyber-512 is shown in (3) while a point-wise multiplication III. H IGH -S PEED K YBER A RCHITECTURE
 j,i ◦ ŝ i can be performed as shown in (4).
    The top-level architecture of Kyber is designed and
Â00 Â01 ŝ presented in Fig. 2.
 ◦ ŝ = ◦ 0
Â10 Â11 ŝ1
  A. High-Level Architecture
Â00 ◦ ŝ0 + Â01 ◦ ŝ1
= (3) Full HW methodology enhances the performance of archi-
Â10 ◦ ŝ0 + Â11 ◦ ŝ1 tecture over a HW/SW co-design scheme at the cost of a longer
(â j,2i + â j,2i+1 X) · (ŝ2i + ŝ2i+1 X) design cycle, killing the flexibility, and demands customized
= (â j,2i ŝ2i + â j,2i+1 ŝ2i+1 ζ 2br7 (i)+1 ) data paths for different protocol-level operations. However,
using an instruction-set processor makes the design smaller,
+ (â j,2i ŝ2i+1 + â j,2i+1 ŝ2i )X (4)
simpler, slower, and more controllable/programmable. A cus-
Operation in polynomial multiplication should be reduced tomized instruction-set can be a plausible option to achieve
with respect to the prime q. Although in the C reference fine-tuned hardware acceleration with a low to moderate logic
implementation both Montgomery and Barrett reduction are overhead. In order to implement a full HW architecture,
employed, from a resource sharing optimization point of view, cascading computation units in a customized data flow reduces
we focus on Barrett reduction as described in Algorithm 2 to the required latency significantly while the design becomes
avoid the cost of Montgomery domain conversion. inflexible. In this paper, we implement all computation blocks
To conclude, we outlined the most time-consuming opera- in hardware; meanwhile, our implementation remains flexible
tions that are performed during KEM. These operations are to be extended, which is vital for a fast evolving field like
composed of several basic computations, including hashing, PQC despite existing HW architecture.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4652 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021

TABLE III
P ROPOSED I NSTRUCTION FOR H ASHING

To enhance the proposed architecture from a flexibility point


of view, we design 20 different customized high-level instruc-
tion codes to perform the protocol. In particular, each line of
the program ROM is 25-bit wide: 5 bits for instruction code
and two 10 bits for operand addresses. The instruction memory
is located within the controller and stores instructions for all
required operations, including arithmetic, Keccak, and various
memory operations. For example, Table III summarizes our
proposed hashing instructions for different hash types. As Fig. 3. Reconfigurable Butterfly Architecture.
one can see, our instructions can be easily used for integra-
tion with classic cryptosystems, e.g., Ed448 digital signature
scheme [30], in a hybrid architecture, which is beyond of this In our optimized architecture, this unit works in parallel with
work. The data memory can share data with other modules the Keccak core. Therefore, the latency for rejection sampling
through a databus handled by the controller. To perform KEM, is completely absorbed within the latency for Keccak core.
the required parameters should be pre-loaded into the memory.
D. Binomial Sampling
B. Keccak Core Fig. 2 illustrates the datapath of the binomial sampler. Since
Keccak unit is configured to perform four functions, includ- this module is inherently lightweight, we implement 16 par-
ing SHA3-256, SHA3-512, SHAKE-128, and SHAKE-256 allel combinational cores. Then, 16 consecutive samples are
during KEM. To design a high-performance core, we modify generated in parallel and stored in a buffer register. Although
the high-speed core implementation of the Keccak provided the resulting samples, which are in [−η, η], can be presented
by the Keccak team [31]. We develop a dedicated buffer in 3-bit, we use 4-bit representation to simplify the addressing.
for interfacing with the Keccak core. This dedicated buffer The main difference in implementing CBD core with η = 3 is
read/write data in 64-bit width from/to the memory unit. The an input buffer to keep data for concatenating with the input in
buffer length is adjusted to the most extended required data, the next cycles. In this mode, three consecutive 64-bit words
i.e., 1344-bit for SHAKE-128. Therefore, the buffer interfacing are read to generate 32 samples in two words.
needs a maximum of 21 cycles, which can be handled during
the Keccak sponge function computation, i.e., 24 cycles. E. Butterfly Unit
The main configurations of our butterfly unit are detailed
C. Rejection Sampling in Fig. 3. We employ hand-crafted resource sharing techniques
Since the 64-bit data path can be matched with the Keccak to implement this core with optimized resources. There is
core, the rejection data path is set to 64 bits. To design a high- only one modular multiplier in our butterfly architecture. In
performance rejection core, we implement six parallel cores in addition, we use only one reduction unit in the middle of
this module fed by Keccak results. Therefore, a buffer should the butterfly operation and employ a modular adder/subtractor
be added to store the accepted samples. When the number of in the proposed configurations. Hence, implementing Mont-
buffered samples is more than three, the 64 bits of the buffer, gomery reduction requires more resources due to converting
i.e., four accepted samples, are stored in the RAM. back from that domain and demands more clock cycles.
As shown in Fig. 2, a 64-bit word is read from memory. Moreover, our proposed modular reduction is constant-time
Since 64-bit input is not a multiple of a 12-bit integer, the input and takes two cycles, as illustrated in Fig. 3. As one can see,
buffer is extended to 80-bit to store some parts of input for the the architecture is pipelined to avoid any delay in butterfly
next cycles. In the first cycle, only four samples are generated operation.
in parallel, and 16 bits of the input are postponed to the next 1) Speeding up the NTT/INTT: An n-point NTT requires
cycle. In the second cycle, all six cores work on 72 bits of n/2 independent butterfly operations per stage. As a
the buffer, of which 16 bits are kept from the first iteration, result, the naive implementation of polynomial multiplica-
and 56 bits are extracted from the second input. Hence, 8 bits tions requires 4,352 modular multiplications, of which 2 ×
of the input are postponed to concatenate with 64 bits of the (7 × 128 + 256) = 2, 304 modular multiplications for twice
third cycle processed with six rejection cores. A specific flag performing NTT, 5 × 128 = 640 modular multiplications for
for each core shows whether the input is valid or not. point-wise multiplication, and 7 × 128 + 2 × 256 = 1, 408

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4653

Fig. 4. The proposed address flow of our NTT memory architecture in the first two stages. (Butterfly inputs are in white and outputs are in black).

modular multiplications for INTT are required. To avoid the same memory, where in each address the lower column stores
bit-reverse permutation in Algorithm 1, two different butterfly s0 and the higher column stores s1 coefficients. In each clock
configurations, i.e., CT and GS, are required for NTT and cycle, two addresses of memory (e.g., i and j ) are read which
INTT, respectively, as follows: contains four coefficients, i.e., s0,i and s1,i from address i , and
s0, j and s1, j from address j . Then, s0,i and s0, j are fed into
f.g = INTTGS (NTTC T ( f ) ◦ NTTC T (g)). (5) the first butterfly, while s1,i and s1, j are used by the second
To be consistent with standard software implementation, core. The results of these cores will be stored in the same
the input polynomials in normal order are transformed to the fashion in the second RAM. Fig. 4 shows the address flow of
NTT domain in bit-reverse order employing CT configuration, our proposed NTT architecture using RAM0 and RAM1.
while twiddle factors are absorbed in bit-reversed order. The To implement a highly parallel architecture, we implement
point-wise multiplication is performed in bit-reverse order and multiple butterfly units matched with the number of polyno-
transformed back using GS configuration in normal order. mial vectors in s, i.e., two, three, and four units for Kyber-512,
However, the required twiddle factors are absorbed in the Kyber-768, and Kyber-1024, respectively.
bit-reversed order. Our first method reduces the NTT execution time from
2 log2 2 + 2N to 2 log2 2 compared with the naive imple-
N N N N
We observe that an efficient implementation of point multi-
plication requires 3,584 modular multiplications reducing 18% mentation. In our second method, we take advantage of the
complexity compared to the naive implementation. According NTT definition in the Kyber scheme to perform two indepen-
to Fig. 3, for NTT operation, the butterfly is arranged based on dent NTT computations for odd and even coefficients. Hence,
CT configuration, while in INTT, it is reconfigured to match we employ two butterfly cores in parallel to computes NTT,
with the GS configuration. In NTT/INTT, when the pipeline is which halves execution time to N2 log2 N4 . In this method,
fulfilled, the butterfly unit can read and write two data inputs each address of memory stores two consecutive coefficients,
and outputs in each clock cycle. i.e., si,2 j and si,2 j +1 . Then, two addresses of memory are fed
The most crucial bottleneck in implementing NTT core is into two butterfly cores where contains four coefficients, i.e.,
memory access because memory access patterns change during si,2 j and si,2 j +1 from address j , and si,2k and si,2k+1 from
each operation stage [15], [32]. Therefore, designing efficient address k of memory. So, si,2 j and si,2k are used for the first
memory management is critical to avoid memory conflicts butterfly, which are independently processed form si,2 j +1 and
and achieve high throughput. On the other hand, memory si,2k+1 in the second core. Similar to the previous method,
bandwidth limits the efficiency of the butterfly operation. the results should be stored similarly in the second RAM.
Hence, we use two memory units to provide double bandwidth Although this method does not improve the efficiency due to
during NTT operation to reduce latency. In the first round, doubling the resources to halve the latency, it can accelerate
the results are stored in NTT RAM 0. After completing the first the computations to target high-performance architectures.
round, the input coefficients are read from NTT RAM 0, and 2) Optimizing Point-Wise Multiplication: To implement
the butterfly outputs are stored in NTT RAM 1. This scenario an optimized high-throughput point-wise multiplication core,
is repeated for seven rounds until NTT is computed. we use a specific memory pattern for matrix  coefficients.
In this method, two coefficients are fetched from the In our proposed memory pattern for Â, four consecutive
first RAM block at a time and fed into a butterfly unit. coefficients are stored in pairs, i.e., (Â00 (3), Â00 (2), Â00 (1),
Then, the butterfly output will be prepared and written into Â00 (0)), . . . ,(Â11 (255), Â11 (254), Â11 (253), Â11 (252)).
the second RAM block after pipelined stages, i.e., five cycles. Further, two parallel butterfly cores are employed to
Employing the ping-pong strategy, after 128 cycles, all coef- accelerate the polynomial multiplication. The number of the
ficients are fed into the butterfly core, and the five additional pipelined stages is set to five to design a high-throughput
cycles are required to complete a round of NTT/INTT compu- architecture for point-wise multiplication, i.e., 4-coefficient
tation. In the next round, the input coefficients are fetched from per 5-cycle. In other words, based on detailed scheduling
the second RAM block, and the outputs are stored in the first and our proposed memory scheme, this design results in
RAM block. This computation will be continued to complete higher throughput while limits the maximum operating
all seven required rounds of NTT. To optimize the memory frequency. It is observed that the path from reduction
utilization in this method, different vectors are stored in the output to the multiplier is the critical path. Nevertheless,
same RAM bock. For example, the s0 and s1 are located in the increasing the pipeline latency improves the critical path

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4654 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021

delay at the cost of decreasing the point-wise multiplication TABLE IV


throughput. FPGA I MPLEMENTATION R ESULTS FOR O UR K ECCAK , B INOMIAL , AND
R EJECTION C ORES AND C OMPARISON W ITH S TATE - OF - THE -A RT
Let R̂00 = Â00 ◦ ŝ0 ; hence, based on (4), the R̂00 coefficients
can be computed as follows:
R̂00 (2i ) = ζi Â00 (2i + 1)ŝ0 (2i + 1) + Â00 (2i )ŝ0 (2i ) (6)
R̂00 (2i + 1) = Â00 (2i + 1)ŝ0 (2i ) + Â00 (2i )ŝ0 (2i + 1) (7)
Hence, we use the first core for the R̂00 (4i ) and R̂00 (4i +1),
and the second core works on R̂00 (4i + 2) and R̂00 (4i + 3).
Operations in each step is described for a core as follows:
Step 1: ŝ0 (2i + 1) and ζi are read from NTT memory and
twiddle factor ROM memory to perform modular multiplica-
tion, respectively.
Step 2: ŝ0 (2i ) is multiplied by Â00 (2i ). Furthermore, the
previous multiplication result is passed into the modular
reduction unit.
Step 3: ŝ0 (2i ) is multiplied by Â00 (2i + 1).
Step 4: The first step result after reduction is multiplied by
Â00 (2i + 1). TABLE V
Step 5: The second term of R̂00 (2i + 1), i.e., Â00 (2i ) ASIC I MPLEMENTATION R ESULTS FOR O UR K ECCAK , B INOMIAL , AND
R EJECTION C ORES AND C OMPARISON W ITH S TATE - OF - THE -A RT
and ŝ0 (2i + 1) are multiplied. The reduced result of step 2,
i.e., Â00 (2i )ŝ0 (2i ), is entered into the pipeline stages.
Steps 6-7: The reduction outputs, i.e., Â00 (2i + 1)ŝ0 (2i )
and ζi Â00 (2i + 1)ŝ0 (2i + 1), are entered sequentially into the
pipeline stages. Moreover, the next coefficients are read from
the memories to start from Step 1.
Step 8: The modular addition computes ζi Â00 (2i +1)ŝ0 (2i +
1) + Â00 (2i )ŝ0 (2i ). Furthermore, Â00 (2i )ŝ0 (2i + 1) is passed
from the reduction unit into the pipeline stages.
Step 9: The previous addition result, i.e., R̂00 (2i ),
is buffered in the next register, while the modular addition
computes Â00 (2i + 1)ŝ0 (2i ) + Â00 (2i )ŝ0 (2i + 1). of this core and achieve scalability through area versus latency
Step 10: The R̂00 (2i ) and R̂00 (2i + 1), which are already trade-offs.
buffered in the output registers, are stored in the memory. This architecture can be easily scaled to match the upper
Since the memory  includes four coefficients per address, or lower security level. To scale up the architecture, the same
the addition between Â00 ◦ ŝ0 and Â01 ◦ ŝ1 can be performed structure can be applied, while the number of butterfly cores
by a 64-bit addition. In the described scenario, one port of the should increase. Moreover, the depth of Data RAM and
memory is always in read mode to feed the cores. The second RAM(A) needs to be increased. The main difference between
port is used for accumulating the results. these architectures is using two separate CBD circuits for
Kyber-512, which causes more resources to provide a ded-
F. Scalability icated sampler. Hence, a general core utilizing the most up
The proposed architecture for NTT computation employing security level resources with additional CBD core for η = 3
two butterfly cores for Kyber-512 achieves high-performance can be used to provide a scalable Kyber cryptosystem.
results with reasonable resource utilization. However, different
IV. E XPERIMENTAL R ESULTS AND C OMPARISON
hardware resource utilization can be explored to achieve a
desirable area-time trade-off from various optimization per- In this section, we provide implementation results and com-
spectives. For example, to reduce the required cycles, the num- pare them to the counterparts available in the open literature.
ber of butterfly cores can be increased to 4 cores. However, Along with the fact that the implementations employ different
the resources can be saved if only one butterfly core is imple- platforms, a fair and meaningful discussion or comparison of
mented at the cost of increasing the total latency. It should be different designs and implementations with previous work is
noted that increasing the number of butterfly cores changes not straightforward. Nevertheless, we like to put our results in
the memory access pattern, and some modifications should be the context with existing implementations to allow the reader
considered to feed all cores. Hence, a high-performance design a quick overview of other designs and architectures.
requires complex memory access management to reduce the
access overhead. A. Results for Keccak and Polynomial Sampling
Besides, a high-performance Keccak core occupies almost Tables IV and V report the required FPGA and ASIC
25% of the total area. We can implement different architectures resources and latency specifications for the Keccak, the CBD,

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4655

TABLE VI
FPGA I MPLEMENTATION R ESULTS FOR O UR NTT C ORE AND C OMPARISON W ITH S TATE - OF - THE -A RT (n = 256)

TABLE VII
ASIC R ESULTS FOR NTT AND C OMPARISON W ITH S TATE - OF - THE -A RT

Fig. 5. Proposed Scheduling for Sampling Units in Kyber-512.

and rejection sampling cores in our design and other state-


of-the-art implementations. As one can see, the software
implementation of Keccak runs in thousands of clock cycles, B. Results for the Butterfly Core
which can be significantly accelerated while implemented Tables VI and VII report the required FPGA and
in hardware. A lightweight Keccak core presented in [34] ASIC hardware resources and latency specifications for
uses 359 LUTs to perform a round of Keccak-f [1600] our proposed butterfly unit in different configurations,
in 1,665 cycles, while in [35], the authors proposed an i.e., NTT, INTT, and point-wise multiplication, including
architecture performing in 12 clock cycles at the cost of other state-of-the-art implementations. We remark that a more
almost 10k LUTs. In our proposed design, a Keccak-f [1600] technology-independent measurement is the required cycle.
is performed in 24 cycles at the cost of 4.4k LUTs or 24k GEs. Thus, for efficiency comparison between different proposed
Additionally, decreasing the latency of the Keccak core does NTT architectures, efficiency can be computed by the required
not considerably improve the performance due to interfacing clock cycles×area.
cost, which requires 21 clock cycles for a 1,344-bit output. Our pipelined architecture employing our first method
The reported results show that the performance of our requires 133 cycles for performing one round of 256-point
binomial and rejection sampler outperform sampling units of NTT; hence, a full NTT with seven rounds requires 940 cycles.
previous works [6], [8], [9]. Our proposed implementation Computing INTT requires 263 additional clock cycles for
takes advantage of parallel computations between our sam- post-processing. Moreover, point-wise multiplication between
pling units and Keccak core. Our binomial sampler requires two polynomials of degree 256 requires 1,289 clock cycles.
68 clock cycles for generating four polynomials of degree Our proposed architecture is significantly smaller compared to
256, i.e., 1,024 samples. The rejection sampler in our pro- previous best works occupying 360 LUTs, 145 FFs, 187 Slices,
posed scheme works simultaneously with the Keccak core. 3 DSPs, and 2 BRAMs.
Therefore, its required latency for generating matrix  with Besides, our second proposed method requires 474 cycles
1,024 samples, i.e., 432 clock cycles, is totally absorbed. This for performing a full NTT employing two parallel butterfly
unit, with 16 parallel cores, occupies almost 2k LUTs in FPGA cores. Hence, this method results in a significant speedup
or 13k GEs on ASIC platform. by halving the cycle count compared to other NTT imple-
Fig. 5 shows the proposed scheduling for sampler units mentations for Kyber. Although the efficiency of both meth-
in Kyber-512. Rejection sampler works parallel by Keccak ods is the same, a trade-off between area and time can be
core, and therefore its latency, i.e., 108 cycles, is absorbed achieved.
completely. The accepted samples will be stored in RAM(A), The authors in [19] presented a flexible NTT architecture
shown in Fig. 2. For a binomial sampling of a polynomial of over RISC-V, which consumes significantly greater cycles.
degree 256 with η = 3, two rounds of Keccak are required. In [7], 3-layer merged NTT for NewHope was proposed. The
Each round of Keccak result is processed in 17 cycles by work of [24] and [13] implemented 2-layer merged NTT using
the binomial sampler. However, processing the second round the KRED algorithm, while this reduction algorithm needs
result cannot be parallelized by the next CBD due to memory a special prime form. In [10], Montgomery reduction was
bandwidth limitation. employed. From a resource sharing perspective, we use a

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4656 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021

TABLE VIII
FPGA I MPLEMENTATION R ESULTS AND C OMPARISON W ITH S TATE - OF - THE -A RT

general reduction method that can be configured for different TABLE IX


prime values in a hybrid cryptosystem. Moreover, in [37] ASIC R ESULTS AND C OMPARISON W ITH S TATE - OF - THE -A RT
and [21], the bandwidth doubling technique is used for feeding
the processing units. Particularly, in this work, we propose a
compact reconfigurable architecture to accelerate the polyno-
mial multiplication, which is enhanced by borrowing the com-
pact memory implementation [25], resource sharing technique
[5], [24], and doubled bandwidth scheme [14], [21].
In comparison to the SW implementations, our first method
achieves a speedup factor of 8.2×, 7.7×, and 21.6× for NTT,
INTT, and point-wise multiplication, respectively. Our sec-
ond proposed method can also accelerate 16.3×, 15.2×, and
21.6× NTT, INTT, and point-wise multiplication, respectively.
However, our proposed architecture decrease 26% (15%) the
performance compared to [13] ( [38]) in the HW platform,
while the NTT core designed in [13] employs 4 butterfly units.
It should be noted that although the design presented in [9] is
faster implementing a vectorized butterfly unit, it consumes
C. FPGA Implementations
512k GE logic gates, which is several times bigger than
our proposed design. Hence, our design outperforms state- Our proposed architecture for different NIST security levels
of-the-art ASIC implementations with at least 11.6× better is synthesized with Xilinx Vivado 2019.2 and implemented
Area×Cycles. on a Xilinx Artix XC7A100T-3 FPGA. All given results are
Note that the NTT can also be parallelized by sampling obtained after place-and-route (PAR). We report the area,
unit to reduce the total latency; however, applying this par- timing, and area-time trade-off (number of LUT×time in μs)
allelization in this work results in diminishing the flexibility results of the design in Table VIII. In some previous works,
and increasing the required memory units. To achieve both each DSP is considered equivalent to 100 Slices [39]. How-
high speed and instruction-level flexibility, we do not follow ever, no single element of FPGA can be accurately expressed
this methodology such that the design remains flexible to add in terms of other elements; hence, DSP and BRAM are not
or modify new instructions. considered in A. To have a fair comparison, we evaluate the

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4657

TABLE X
C OMPARISONS W ITH E XISTING FPGA-BASED PQC I MPLEMENTATIONS OF CCA-S ECURE KEM S CHEMES IN NIST S ECURITY L EVEL 5

performance of the proposed design on the state-of-the-art geting NIST security level 5 is reported. It should be noted that
targeted platforms, which changes performance by a factor due to the varying techniques of different FPGA generations,
of 1.35×, 1.4×, and 0.68× on Zynq-7000, Virtex-7, and a fair comparison is actually not accurately possible.
Virtex-6 compared to Artix-7. In [15], a fast architecture of Saber is proposed using
We compare our architecture results to the best SW design the high-speed instruction-set coprocessor on a Xilinx
on the ARM Cortex-M4 chip, as well as the HW imple- ZCU102 board. In this work, a non-NTT-based approach is
mentations and the HW/SW co-design. The total latency is used, taking advantage of the module power of 2 in the Saber
the summation of key encapsulation and key decapsulation scheme, which results in 153 μs time execution. Employing
(Encaps + Decaps), as the key generation can be done offline. multiply-and-accumulate units provides the required trade-off
As one can see, for NIST level 1 security, our proposed scheme between area and time for different applications. However,
occupies 18k LUTs, 5k FFs, 6 DSPs, and 15 BRAMs. It also this design needs more hardware resources compared to ours,
runs at 115 MHz and performs the whole Kyber protocol which results in 1.1× area-time product.
in 148 μs. Our design achieves a speedup factor of 83.9× We also compare our work with FrodoKEM-1344 based
and 74.1× compared to the leading counterpart in SW and on standard learning with error problem. To the best of our
HW/SW designs. Furthermore, our architecture employing knowledge, there is not a pure HW work for FrodoKEM tar-
the various optimization techniques is highly efficient, with geting security level 5; hence, the results in [6] used a HW/SW
area-time trade-off being about 98% improved compared to approach are reported. As one can see, the FrodoKEM scheme
[6]. It is to be noted that the HW/SW co-design [6]–[9] is requires a considerable cycle compared to other PQC schemes
a complete design for all Kyber security levels. The same due to performing expensive matrix-vector multiplications.
improvement can be observed in the remaining security levels. Our implementation of Kyber-1024 is almost 26,000 times
Compared to HW architecture, our proposed design consumes faster, occupying almost the same resources compared to [6].
5× time than our previous work [13], resulting in a greater SIKE [40] as an isogeny-based PQC scheme requires
A × T by a factor of 7. Our design is also 2× slower and 2.5× significantly more DSP resources to design parallel Mont-
larger compared to [12]. However, this overhead comes to gomery multiplier architecture over a large prime. Although
keep the customized instruction-set design flexible compared this scheme outperforms FrodoKEM implementation, our
to highly parallel [13] or highly compact architectures [12]. Kyber-1024 design shows 155 times better area-time product
The hardware specially designed to cater a scheme may fail in compared to this scheme.
flexibility; thereby, this work aims to achieve both high speed It should be noted that there is a large body of work on opti-
and flexibility for Kyber to support extension for building a mizing PQC schemes on a variety of platforms. For example,
hybrid cryptosystem. the work of [21] and [28] propose the NewHope on a Xilinx
Although our implementations are constant-time, investi- XC7Z020 and Zynq-7000, respectively. The architecture of
gating side-channel analysis attacks will part of our future NewHope is very similar to that of Kyber; however, this
work. scheme has not been selected to continue into the third round
of NIST. In [21], a low-complexity architecture of NewHope
D. ASIC Results is introduced, having a competitive performance compared to
The ASIC implementation results of our architectures based our design. Hence, taking advantage of this architecture to
on the 65-nm TSMC cell library using Synopsys Design improve the total performance of Kyber is kept for future
Compiler are presented in this section. All the designs are works.
synthesized with a 5ns clock period. Table IX reports the Although one of the drawbacks of various post-quantum
maximum clock frequency and the amount of logic cells cryptosystems is requiring larger key sizes and more com-
for our proposed designs and state-of-the-art implementations. putational power than the current pre-quantum algorithms,
As one can see, the placed-and-routed design of our proposed the efficiency of our proposed implementation already has
Kyber-1024 consists of 104 kGE for logic and 190 KB SRAM performance levels comparable to or even significantly better
for memory, which shows a significant speedup compared to than pre-quantum algorithms [30], [41], [42].
previous works.
V. C ONCLUSION
E. Comparison With Other Implementations The threat from large-scale quantum computers is real,
In Table X, the comparison between our proposed architec- and we need to act now as the deployment, integration,
ture with some existing PQC hardware implementations tar- and migration to quantum-safe security systems take several

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4658 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021

years. In this paper, we have presented an instruction-set [16] Y. Zhang, C. Wang, D. E. S. Kundi, A. Khalid, M. O’Neill, and W. Liu,
post-quantum cryptosystem for CRYSTALS-Kyber. Our pro- “An efficient and parallel R-LWE cryptoprocessor,” IEEE Trans. Circuits
Syst. II, Exp. Briefs, vol. 67, no. 5, pp. 886–890, May 2020.
posed architecture is synthesized for a Xilinx Artix-7 FPGA [17] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, M. Becchi, and A. Aysu,
(which is a NIST recommended tool for prototype) prototype “A flexible and scalable NTT hardware: Applications from homomor-
and an ASIC. Implementing efficient components, including phically encrypted deep learning to post-quantum cryptography,” in
Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Grenoble, France,
sampling cores, NTT, and point-wise multiplication architec- Mar. 2020, pp. 346–351.
tures, increases the performance compared to the state-of- [18] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, and A. Aysu, “An
the-art SW and HW/SW implementations. More specifically, extensive study of flexible design methods for the number theoretic
transform,” IEEE Trans. Comput., early access, Aug. 19, 2020, doi:
our proposed architecture performs Kyber-512, Kyber-768, 10.1109/TC.2020.3017930.
and Kyber-1024 protocols in only 148, 209, and 286 μs [19] E. Karabulut and A. Aysu, “RANTT: A RISC-V architecture extension
on a Artix-7 FPGA, respectively. Our future work will for the number theoretic transform,” in Proc. 30th Int. Conf. Field-
Program. Log. Appl. (FPL), Aug. 2020, pp. 26–32.
focus on the side-channel resistance and the development of [20] T. Fritzmann and J. Sepulveda, “Efficient and flexible low-power NTT
countermeasures against such attacks. for lattice-based cryptography,” in Proc. IEEE Int. Symp. Hardw. Ori-
ented Secur. Trust (HOST), McLean, VA, USA, May 2019, pp. 141–150.
[21] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly
ACKNOWLEDGMENT efficient architecture of NewHope-NIST on FPGA using low-complexity
NTT/INTT,” in Proc. IACR, Mar. 2020, pp. 49–72.
The authors would like to thank the reviewers for their [22] T. Pöppelmann, T. Oder, and T. Güneysu, “High-performance ideal
comments. lattice-based cryptography on 8-bit ATxmega microcontrollers,” in Proc.
LATINCRYPT, Guadalajara, Mexico, Aug. 2015, pp. 346–365.
[23] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
R EFERENCES for faster ideal lattice-based cryptography,” in Proc. 15th Int. Conf.,
Milan, Italy, Nov. 2016, pp. 124–139.
[1] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
[24] P.-C. Kuo et al., “High performance post-quantum key exchange on
and factoring,” in Proc. 35th Annu. Symp. Found. Comput. Sci., Santa Fe,
FPGAs,” in Proc. IACR, 2017, p. 690.
NM, USA, Nov. 1994, pp. 124–134.
[25] C. Du and G. Bai, “Towards efficient polynomial multiplication for
[2] Status Report on the Second Round of the NIST Post-Quantum
lattice-based cryptography,” in Proc. IEEE Int. Symp. Circuits Syst.
Cryptography Standardization Process, Nat. Inst. Standards Technol.,
(ISCAS), Montréal, QC, Canada, May 2016, pp. 1178–1181.
Gaithersburg, MD, USA, 2020.
[3] L. Botros, M. J. Kannwischer, and P. Schwabe, “Memory-efficient high- [26] R. Avanzi et al., “CRYSTALSKyber: Algorithm specification and sup-
speed implementation of Kyber on Cortex-M4,” in Proc. 11th Int. Conf. porting documentation (version 3.0). submission to the NIST post-
Cryptol., Rabat, Morocco, Jul. 2019, pp. 209–228, 2019. quantum cryptography standardization project,” NIST Post-Quantum
[4] K. Basu, D. Soni, M. Nabeel, and R. Karri, “NIST post-quantum Cryptogr. Standardization Project, Tech. Rep., 2020.
cryptography a hardware evaluation study,” in Proc. IACR, 2019, p. 47. [27] J. Bos et al., “CRYSTALS-Kyber: A CCA-secure module-lattice-based
[5] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: KEM,” in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), London,
A configurable crypto-processor for post-quantum lattice-based proto- U.K., Apr. 2018, pp. 353–367.
cols,” in Proc. IACR, vol. 4, 2019, pp. 17–61. [28] Y. Xing and S. Li, “An efficient implementation of the NewHope key
[6] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: exchange on FPGAs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67,
A configurable crypto-processor for post-quantum lattice-based proto- no. 3, pp. 866–878, Mar. 2020.
cols (extended version),” in Proc. IACR, 2019, p. 1140. [29] P. Barrett, “Implementing the Rivest Shamir and Adleman public key
[7] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA encryption algorithm on a standard digital signal processor,” in Proc.
extensions for finite field arithmetic accelerating Kyber and NewHope CRYPTO, Santa Barbara, CA, USA, 1986, pp. 311–323.
on RISC-V,” in Proc. IACR, vol. 3, 2020, pp. 219–242. [30] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “Area-time
[8] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly coupled efficient hardware architecture for signature based on Ed448,” IEEE
RISC-V accelerators for post-quantum cryptography,” in Proc. IACR, Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 8, pp. 2942–2946,
Aug. 2020, pp. 239–280. Aug. 2021.
[9] G. Xin et al., “VPQC: A domain-specific vector processor for [31] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche,
post-quantum cryptography based on RISC-V architecture,” IEEE “Keccak in VHDL,” Keccak Team, Tech. Rep., 2020.
Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, [32] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede,
Aug. 2020. “Compact ring-LWE cryptoprocessor,” in Proc. Cryptograph. Hardw.
[10] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware implemen- Embedded Syst., Busan, South Korea, Sep. 2014, pp. 371–391.
tation of CRYSTALS-KYBER PQC algorithm through resource reuse,” [33] K. Stoffelen, “Efficient cryptography on the RISC-V architecture,” in
IEICE Electron. Exp., vol. 17, no. 17, 2020, Art. no. 20200234. Proc. LATINCRYPT, Santiago de Chile, Chile, Oct. 2019, pp. 323–340.
[11] V. B. Dang, F. Farahmand, M. Andrzejczak, K. Mohajerani, [34] B. Jungk and M. Stottinger, “Hobbit—Smaller but faster than a dwarf:
D. T. Nguyen, and K. Gaj, “Implementation and benchmarking of round Revisiting lightweight SHA-3 FPGA implementations,” in Proc. Int.
2 candidates in the NIST post-quantum cryptography standardization Conf. ReConFigurable Comput., Cancun, Mexico, Nov. 2016, pp. 1–7.
process using hardware and software/hardware co-design approaches,” [35] T. Fritzmann, U. Sharif, D. Muller-Gritschneder, C. Reinbrecht,
in Proc. IACR Cryptol. Arch., 2020, p. 795. U. Schlichtmann, and J. Sepulveda, “Towards reliable and secure post-
[12] Y. Xing and S. Li, “A compact hardware implementation of CCA- quantum co-processors based on RISC-V,” in Proc. Design, Autom. Test
secure key exchange mechanism CRYSTALS-KYBER on FPGA,” in Eur. Conf. Exhib. (DATE), Florence, Italy, Mar. 2019, pp. 1148–1153.
Proc. IACR, Feb. 2021, pp. 328–356. [36] M. J. Kannwischer, J. Rijneveld, P. Schwabe, and K. Stoffelen,
[13] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, “PQM4: Post-quantum crypto library for the ARM Cortex-M4,” PQM4,
“High-speed NTT-based polynomial multiplication accelerator for Tech. Rep., 2018.
CRYSTALS-Kyber post-quantum cryptography,” Proc. IACR, 2021, [37] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “High-performance area-
p. 563. efficient polynomial ring processor for CRYSTALS-kyber on FPGAs,”
[14] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “Towards efficient kyber on Integration, vol. 78, pp. 25–35, May 2021.
FPGAs: A processor for vector of polynomials,” in Proc. 25th Asia South [38] C. Zhang et al., “Towards efficient hardware implementation of NTT
Pacific Design Autom. Conf. (ASP-DAC), Beijing, China, Jan. 2020, for kyber on FPGAs,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
pp. 247–252. May 2021, pp. 1–5.
[15] S. Sinha Roy and A. Basso, “High-speed instruction-set coprocessor [39] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
for lattice-based key encapsulation mechanism: Saber in hardware,” in “Cryptographic accelerators for digital signature based on Ed25519,”
Proc. IACR Trans. Cryptograph. Hardw. Embedded Syst., Aug. 2020, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 7,
pp. 443–466. pp. 1297–1305, Jul. 2021.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4659

[40] R. Elkhatib, R. Azarderakhsh, and M. Mozaffari-Kermani, “Highly Reza Azarderakhsh (Member, IEEE) received the
optimized Montgomery multiplier for SIKE primes on FPGA,” in Proc. Ph.D. degree in electrical and computer engineering
IEEE 27th Symp. Comput. Arithmetic (ARITH), Portland, OR, USA, from Western University in 2011. He has worked at
Jun. 2020, pp. 64–71. the Center for Applied Cryptographic Research and
[41] M. B. Niasar, R. El Khatib, R. Azarderakhsh, and the Department of Combinatorics and Optimization,
M. Mozaffari-Kermani, “Fast, small, and area-time efficient University of Waterloo. He is currently an Asso-
architectures for key-exchange on Curve25519,” in Proc. IEEE 27th ciate Professor with the Department of Electrical
Symp. Comput. Arithmetic (ARITH), Portland, OR, USA, Jun. 2020, and Computer Engineering, Florida Atlantic Uni-
pp. 72–79. versity. His current research interests include finite
[42] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “Efficient hardware field and its application, elliptic curve cryptogra-
implementations for elliptic curve cryptography over Curve448,” in phy, isogenies on elliptic curves, and lattice-based
Proc. 21st Int. Conf. Cryptol., Indocrypt, India, Dec. 2020, pp. 228–247. post-quantum cryptography. He was a recipient of the NSERC Post-Doctoral
Research Fellowship. He is serving as an Associate Editor for the IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS .

Mehran Mozaffari-Kermani (Senior Member,


IEEE) received the B.Sc. degree from the University
of Tehran, Iran, and the M.E.Sc. and Ph.D. degrees
from the University of Western Ontario, London,
Canada, in 2007 and 2011, respectively. In 2012,
he joined the Department of Electrical Engineer-
ing, Princeton University, NJ, USA, as an NSERC
Post-Doctoral Research Fellow. From 2013 to 2017,
he was an Assistant Professor with Rochester Insti-
Mojtaba Bisheh-Niasar (Student Member, IEEE) tute of Technology and has joined the Department
received the B.Sc. degree from Amirkabir Univer- of Computer Science and Engineering, University
sity of Technology in 2011 and the M.Sc. degree of South Florida, in 2017, where he is currently an Associate Professor.
in electrical engineering from Iran University of He has been the TPC Member for a number of conferences, including
Science and Technology in 2015. He is currently HOST (publications chair), CCS (publications chair), DAC, DATE, RFIDSec,
pursuing the Ph.D. degree in computer engineering LightSec, WAIFI, FDTC, and DFT. He is serving as an Associate Editor for
with Florida Atlantic University under the supervi- the IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI)
sion of Dr. Azarderakhsh. He is also a Research S YSTEMS , the Transactions on Embedded Computing Systems (ACM), and the
Assistant with I-SENSE Lab. He is a Research Intern IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS .
in azure hardware security architecture (AHSA) He has been a Guest Editor of the IEEE T RANSACTIONS ON D EPENDABLE
at Microsoft, Redmond, Washington. His research AND S ECURE C OMPUTING , the IEEE/ACM T RANSACTIONS ON C OMPU -
interests include applied cryptography, post-quantum cryptography, and TATIONAL B IOLOGY AND B IOINFORMATICS, and the IEEE T RANSACTIONS
efficient implementation of cryptographic algorithms. ON E MERGING T OPICS IN C OMPUTING for special issues on security.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.

You might also like