Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber
Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber
Abstract— Large scale quantum computers will break classical (PQC) based on alternative mathematical features has become
public-key cryptography protocols by quantum algorithms such a fundamental research topic.
as Shor’s algorithm. Hence, designing quantum-safe cryptosys- Recently, the National Institute of Standards and Tech-
tems to replace current classical algorithms is crucial. Luckily
there are some post-quantum candidates that are assumed to be nology (NIST) announced the third-round finalists, which
resistant against future attacks from quantum computers, and includes 4 key encapsulation mechanisms (KEMs) and
NIST is considering standardizing them. Among these candidates, 3 signature schemes [2]. Among these KEM schemes,
lattice-based cryptography sounds more interesting than others CRYSTALS-Kyber shares a common framework with the
due to the performance results as well as confidence in the CRYSTALS-Dilithium signature scheme [2]. This scheme also
security. There are few works in the literature evaluating the
performance of lattice-based cryptography in hardware. In this supports efficient matrix-vector and vector-vector multiplica-
paper, we focus on Cryptographic Suite for Algebraic Lattices tion over a polynomial ring using the fast number-theoretic
(CRYSTALS) key exchange mechanisms known as Kyber and transform (NTT) [3]. Although the optimization of NTT-based
provide an instruction-set hardware architecture and imple- multiplication is not a new idea and is used in countless
ment on Xilinx Artix-7 FPGA for performance evaluation and applications, particularly in signal processing, it is still a per-
testing. Our proposed architecture provides an efficient and
high-performance set of components to perform polynomial formance bottleneck in the lattice-based cryptography imple-
sampling, number-theoretic transform (NTT), and point-wise mentation. Thus, several works have been done to optimize
multiplication to speed up lattice-based post-quantum cryptogra- NTT from different perspectives, such as resource utilization,
phy (PQC). This architecture implemented on ASIC outperforms performance, efficiency, and energy consumption.
state-of-the-art implementations. Recently, implementations of lattice-based cryptography
Index Terms— ASIC, FPGA, hardware architecture, Kyber, have been investigated on various platforms. While soft-
lattice-based cryptography, post-quantum cryptography. ware (SW) implementations offer programming capabilities,
flexibility, and a shorter design cycle, the hardware (HW)
I. I NTRODUCTION platforms accelerate the computations and result in signif-
icantly higher throughput. Recently, there are considerable
Q UANTUM computing development constitutes a signifi-
cant threat to classical public-key cryptography protocols
based on Shor’s algorithm [1]. Most current cryptosystems,
efforts to implement cryptosystems using hardware-software
(HW/SW) co-design. This method makes the design smaller,
i.e., RSA and Elliptic Curve Cryptography (ECC), are envi- slower, and more controllable/programmable compared to pure
sioned to be broken when large quantum computers will be HW schemes at the cost of implementing a software-based
built. Thus, designing the lattice-based cryptosystem as one of processor. Furthermore, a HW/SW co-design requires a shorter
the most promising algorithms in Post-Quantum Cryptography design period; nevertheless, this method may not lead to the
best performance. On the other hand, pure hardware imple-
Manuscript received December 28, 2020; revised April 13, 2021 and mentations can be significantly accelerated using well-known
June 10, 2021; accepted August 11, 2021. Date of publication August 30,
2021; date of current version November 9, 2021. This work was supported optimization strategies, including register balancing, par-
by NSF under Grant 1801341. This article was recommended by Associate allelization, and resource sharing, to increase the overall
Editor S. Yin. (Corresponding author: Mojtaba Bisheh-Niasar.) throughput of the hardware architectures. The main difficulty
Mojtaba Bisheh-Niasar is with the Department of Computer and Electrical
Engineering and Computer Science, Florida Atlantic University, Boca Raton, of this strategy is its hand-optimized design requiring a longer
FL 33431 USA, and also with I-SENSE, Florida Atlantic University, Boca time and may be achieved at the cost of losing flexibility.
Raton, FL 33431 USA (e-mail: [email protected]). To transition to PQC, we must develop hybrid cryptosys-
Reza Azarderakhsh is with the Department of Computer and Electrical
Engineering and Computer Science, Florida Atlantic University, Boca Raton, tems to maintain industry or government regulations, while
FL 33431 USA, also with I-SENSE, Florida Atlantic University, Boca PQC updates will be applied thoroughly. Therefore, classical
Raton, FL 33431 USA, and also with PQSecure Technologies LLC, Boca cryptosystems, e.g. ECC, cannot be eliminated even if PQC
Raton, FL 33431 USA (e-mail: [email protected]).
Mehran Mozaffari-Kermani is with the Department of Computer Engineer- will significantly be developed. The instruction-set processor
ing and Science, University of South Florida, Tampa, FL 33620 USA (e-mail: builds an appropriate platform for accelerated implementation
[email protected]). compared to SW and HW/SW. while the architecture remains
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2021.3106639. flexible compared to highly optimized HW. Specifically,
Digital Object Identifier 10.1109/TCSI.2021.3106639 the flexible HW architecture is a promising solution for
1549-8328 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4649
integrating classic cryptosystems and PQC to move towards which reduces the number of clock cycles to nlog(n) cycles.
hybrid systems. The authors in [21] proposed a low-complexity NTT/INTT in
Kyber is notable for high speed and constant-time imple- the architecture of NewHope-NIST.
mentations. It has to be implemented in various platforms The proposed architecture combines the NTT, INTT, and
subject to the performance requirement. However, Kyber has point-wise multiplication architectures in an efficient way to
not got sufficient study in the field of hardware implemen- utilize significantly fewer resources and improve the overall
tation. Therefore, investigation of the hardware implementa- performance. To do so, using the Cooley-Turkey (CT) as
tion is required considering the advantages of FPGA-based NTT and the Gentleman-Sande (GS) as INTT [22], [23] is
architectural designs to exploit parallelism, which leads to a well-known trick in the literature. Moreover, the resource
improvements in the efficiency of the overall system. In this sharing technique from [5], [24] is extended by using com-
paper, we implement a pure hardware design since it is faster pact storage for pre-computed twiddle factors from [25] and
and could be integrated into any HW/SW co-design solutions. doubled bandwidth scheme from [14], [21] to account for the
high-performance architecture.
A. Related Work
B. Our Contributions
Software implementation of Kyber has been studied by
Botros et al. in [3], proposing a memory-efficient high- To the best of our knowledge, there appear to be very few
speed implementation on Cortex-M4. Recently, several PQC pure hardware implementations that focus only on the Kyber
schemes have been implemented, targeting HW/SW co-design. cryptosystem and make the best of all its features. This paper
The work of [4] was one of the first initiatives of proposes an efficient hardware implementation of the module
post-quantum acceleration using high-level synthesis (HLS). lattice-based post-quantum KEM CRYSTAL-Kyber on a
Furthermore, Banerjee et al. in [5] proposed a flexible ASIC Xilinx Artix-7 FPGA (as recommended by NIST) and the
crypto-processor to support several lattice-based algorithms application specific integrated circuit (ASIC) platform.
into a RISC-V architecture, including Frodo, NewHope, Our proposed architecture provides an efficient and
qTESLA, and CRYSTALS-Kyber/Dilithium. This work is high-performance set of components, including polynomial
extended in [6] to show FPGA validation results. Their design sampling, NTT, and point-wise multiplication, to accelerate
strategy targets reducing power consumption. The authors lattice-based PQC exploiting fewer resources. The
in [7] employ the RISC-V processor integrated with a finite contributions of this paper are itemized in the following:
field multiplier to accelerate polynomial multiplications in a 1) We propose a new approach for implementing
lightweight architecture of NewHope and Kyber. In [8], per- a resource-efficient reconfigurable butterfly core on
forming vectorized modular arithmetic and NTT computations FPGA. We reduce the execution time for Kyber NTT
are proposed employing RISC-V for NewHope, Kyber, and computation from N2 log2 N2 + 2N to N2 log2 N4 by dou-
Saber. The vector processor architecture based on the extensi- bling the transform throughput and merging the pre-
ble RISC-V architecture has been studied in [9], which shows processing into NTT algorithm. We also customize a
a remarkable speed up occupying 979k gate equivalent (GE) memory addressing strategy to implement a high-speed
in ASIC implementations. polynomial multiplier on the target platform.
The pure hardware architectures of Kyber are pro- 2) We highly parallelize the operations in polynomial sam-
posed in [10]–[13]. The work of [10] heavily relies on pling cores through tightly coupling with Keccak core
BlockRAM primitives between components to perform arith- to decrease the required cycles. The performance of pro-
metic tasks and store intermediate results. We addressed the posed parallel scheduling for binomial sampler indicates
high-performance implementation of Kyber in our previous a significant improvement, while our rejection sampler
work [13] as the fastest Kyber design in the literature. latency can be completely absorbed by the Keccak core.
The authors in [14] proposed a Kyber processor for com- 3) Our fast and scalable architecture provides a constant-
puting NTT and point-wise multiplication. An instruction- time implementation over three different quantum
set coprocessor for Saber is presented in [15] to design a security levels. To enhance our HW accelerator from a
flexible hardware architecture using the quadratic-complexity flexibility point of view, we design a set of customized
schoolbook polynomial multiplication algorithm. Schoolbook high-level instruction codes to run the protocol. Hence,
polynomial multiplication is also employed in [16]. this set identifies the control flow of the proposed
Since NTT plays a central role in lattice-based cryptog- components and provides flexibility for integration with
raphy, several hardware implementations focus on NTT from host processors.
performance, efficiency, and flexibility perspectives. The work 4) We employ various optimization techniques to achieve
of [17], [18] introduced a scalable NTT architecture that an overall optimization in terms of efficiency, including
can be used for various lattice-based schemes. Furthermore, parallelization, resource sharing, utilizing distributed
the authors in [19] proposed a RISC-V architecture RAM and ROM blocks, which significantly improve
to increase efficiency and flexibility for NTT compu- the area-time product. The proposed implementation is
tation used in NewHope, qTESLA, CRYSTALS-Kyber, constant-time and is resistant to known timing attacks.
CRYSTALS-Dilithium, and Falcon. Additionally, Fritzmann The rest of the paper is organized as follows. In Sec. II, we
and Sepúlveda [20] proposed an efficient and low-power NTT, discuss the preliminaries. In Sec. III, our proposed algorithms
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4650 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021
TABLE I
T HE L IST OF S YMBOLS AND N OTATIONS U SED IN T HIS PAPER
TABLE II
PARAMETER S ETS FOR K YBER I MPLEMENTATION [26]
ct = (Compress(u), Compress(v)).
127
a2 j +1ζ (2br7 (i)+1) j , where ζ = 17 is the first primitive
• Dec(sk, ct): Message m is computed such that m = j =0
Compress(v − INTT(ŝT ◦ û)), while u and v are extracted 256-th root of unity modulo q, and br7 is the bit reversal
from ct. function. The pseudo-code of the iterative NTT is shown in
1) Keccak: The most performance-critical part of the soft- Algorithm 1. The INTT is similar to NTT, while ωn−1 is used
ware implementation is the Keccak core based on the profiled instead of ωn , and the resulting coefficients of a(x) is divided
cycle counts presented in [3], [7]. In fact, more than half of by n.
the reported clock cycles in SW and HW/SW benchmarking However, the original computing of NTT and INTT
are used to compute Keccak. However, this core can be needs the pre-processing and the post-processing, respectively.
accelerated in a pure hardware architecture since Keccak is A point-wise multiplication includes 128 multiplications of
a hardware-friendly design of SHA. polynomial of degree 2 modulo X 2 − ζ 2br7 (i)+1 .
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4651
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4652 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021
TABLE III
P ROPOSED I NSTRUCTION FOR H ASHING
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4653
Fig. 4. The proposed address flow of our NTT memory architecture in the first two stages. (Butterfly inputs are in white and outputs are in black).
modular multiplications for INTT are required. To avoid the same memory, where in each address the lower column stores
bit-reverse permutation in Algorithm 1, two different butterfly s0 and the higher column stores s1 coefficients. In each clock
configurations, i.e., CT and GS, are required for NTT and cycle, two addresses of memory (e.g., i and j ) are read which
INTT, respectively, as follows: contains four coefficients, i.e., s0,i and s1,i from address i , and
s0, j and s1, j from address j . Then, s0,i and s0, j are fed into
f.g = INTTGS (NTTC T ( f ) ◦ NTTC T (g)). (5) the first butterfly, while s1,i and s1, j are used by the second
To be consistent with standard software implementation, core. The results of these cores will be stored in the same
the input polynomials in normal order are transformed to the fashion in the second RAM. Fig. 4 shows the address flow of
NTT domain in bit-reverse order employing CT configuration, our proposed NTT architecture using RAM0 and RAM1.
while twiddle factors are absorbed in bit-reversed order. The To implement a highly parallel architecture, we implement
point-wise multiplication is performed in bit-reverse order and multiple butterfly units matched with the number of polyno-
transformed back using GS configuration in normal order. mial vectors in s, i.e., two, three, and four units for Kyber-512,
However, the required twiddle factors are absorbed in the Kyber-768, and Kyber-1024, respectively.
bit-reversed order. Our first method reduces the NTT execution time from
2 log2 2 + 2N to 2 log2 2 compared with the naive imple-
N N N N
We observe that an efficient implementation of point multi-
plication requires 3,584 modular multiplications reducing 18% mentation. In our second method, we take advantage of the
complexity compared to the naive implementation. According NTT definition in the Kyber scheme to perform two indepen-
to Fig. 3, for NTT operation, the butterfly is arranged based on dent NTT computations for odd and even coefficients. Hence,
CT configuration, while in INTT, it is reconfigured to match we employ two butterfly cores in parallel to computes NTT,
with the GS configuration. In NTT/INTT, when the pipeline is which halves execution time to N2 log2 N4 . In this method,
fulfilled, the butterfly unit can read and write two data inputs each address of memory stores two consecutive coefficients,
and outputs in each clock cycle. i.e., si,2 j and si,2 j +1 . Then, two addresses of memory are fed
The most crucial bottleneck in implementing NTT core is into two butterfly cores where contains four coefficients, i.e.,
memory access because memory access patterns change during si,2 j and si,2 j +1 from address j , and si,2k and si,2k+1 from
each operation stage [15], [32]. Therefore, designing efficient address k of memory. So, si,2 j and si,2k are used for the first
memory management is critical to avoid memory conflicts butterfly, which are independently processed form si,2 j +1 and
and achieve high throughput. On the other hand, memory si,2k+1 in the second core. Similar to the previous method,
bandwidth limits the efficiency of the butterfly operation. the results should be stored similarly in the second RAM.
Hence, we use two memory units to provide double bandwidth Although this method does not improve the efficiency due to
during NTT operation to reduce latency. In the first round, doubling the resources to halve the latency, it can accelerate
the results are stored in NTT RAM 0. After completing the first the computations to target high-performance architectures.
round, the input coefficients are read from NTT RAM 0, and 2) Optimizing Point-Wise Multiplication: To implement
the butterfly outputs are stored in NTT RAM 1. This scenario an optimized high-throughput point-wise multiplication core,
is repeated for seven rounds until NTT is computed. we use a specific memory pattern for matrix  coefficients.
In this method, two coefficients are fetched from the In our proposed memory pattern for Â, four consecutive
first RAM block at a time and fed into a butterfly unit. coefficients are stored in pairs, i.e., (Â00 (3), Â00 (2), Â00 (1),
Then, the butterfly output will be prepared and written into Â00 (0)), . . . ,(Â11 (255), Â11 (254), Â11 (253), Â11 (252)).
the second RAM block after pipelined stages, i.e., five cycles. Further, two parallel butterfly cores are employed to
Employing the ping-pong strategy, after 128 cycles, all coef- accelerate the polynomial multiplication. The number of the
ficients are fed into the butterfly core, and the five additional pipelined stages is set to five to design a high-throughput
cycles are required to complete a round of NTT/INTT compu- architecture for point-wise multiplication, i.e., 4-coefficient
tation. In the next round, the input coefficients are fetched from per 5-cycle. In other words, based on detailed scheduling
the second RAM block, and the outputs are stored in the first and our proposed memory scheme, this design results in
RAM block. This computation will be continued to complete higher throughput while limits the maximum operating
all seven required rounds of NTT. To optimize the memory frequency. It is observed that the path from reduction
utilization in this method, different vectors are stored in the output to the multiplier is the critical path. Nevertheless,
same RAM bock. For example, the s0 and s1 are located in the increasing the pipeline latency improves the critical path
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4654 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4655
TABLE VI
FPGA I MPLEMENTATION R ESULTS FOR O UR NTT C ORE AND C OMPARISON W ITH S TATE - OF - THE -A RT (n = 256)
TABLE VII
ASIC R ESULTS FOR NTT AND C OMPARISON W ITH S TATE - OF - THE -A RT
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4656 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021
TABLE VIII
FPGA I MPLEMENTATION R ESULTS AND C OMPARISON W ITH S TATE - OF - THE -A RT
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4657
TABLE X
C OMPARISONS W ITH E XISTING FPGA-BASED PQC I MPLEMENTATIONS OF CCA-S ECURE KEM S CHEMES IN NIST S ECURITY L EVEL 5
performance of the proposed design on the state-of-the-art geting NIST security level 5 is reported. It should be noted that
targeted platforms, which changes performance by a factor due to the varying techniques of different FPGA generations,
of 1.35×, 1.4×, and 0.68× on Zynq-7000, Virtex-7, and a fair comparison is actually not accurately possible.
Virtex-6 compared to Artix-7. In [15], a fast architecture of Saber is proposed using
We compare our architecture results to the best SW design the high-speed instruction-set coprocessor on a Xilinx
on the ARM Cortex-M4 chip, as well as the HW imple- ZCU102 board. In this work, a non-NTT-based approach is
mentations and the HW/SW co-design. The total latency is used, taking advantage of the module power of 2 in the Saber
the summation of key encapsulation and key decapsulation scheme, which results in 153 μs time execution. Employing
(Encaps + Decaps), as the key generation can be done offline. multiply-and-accumulate units provides the required trade-off
As one can see, for NIST level 1 security, our proposed scheme between area and time for different applications. However,
occupies 18k LUTs, 5k FFs, 6 DSPs, and 15 BRAMs. It also this design needs more hardware resources compared to ours,
runs at 115 MHz and performs the whole Kyber protocol which results in 1.1× area-time product.
in 148 μs. Our design achieves a speedup factor of 83.9× We also compare our work with FrodoKEM-1344 based
and 74.1× compared to the leading counterpart in SW and on standard learning with error problem. To the best of our
HW/SW designs. Furthermore, our architecture employing knowledge, there is not a pure HW work for FrodoKEM tar-
the various optimization techniques is highly efficient, with geting security level 5; hence, the results in [6] used a HW/SW
area-time trade-off being about 98% improved compared to approach are reported. As one can see, the FrodoKEM scheme
[6]. It is to be noted that the HW/SW co-design [6]–[9] is requires a considerable cycle compared to other PQC schemes
a complete design for all Kyber security levels. The same due to performing expensive matrix-vector multiplications.
improvement can be observed in the remaining security levels. Our implementation of Kyber-1024 is almost 26,000 times
Compared to HW architecture, our proposed design consumes faster, occupying almost the same resources compared to [6].
5× time than our previous work [13], resulting in a greater SIKE [40] as an isogeny-based PQC scheme requires
A × T by a factor of 7. Our design is also 2× slower and 2.5× significantly more DSP resources to design parallel Mont-
larger compared to [12]. However, this overhead comes to gomery multiplier architecture over a large prime. Although
keep the customized instruction-set design flexible compared this scheme outperforms FrodoKEM implementation, our
to highly parallel [13] or highly compact architectures [12]. Kyber-1024 design shows 155 times better area-time product
The hardware specially designed to cater a scheme may fail in compared to this scheme.
flexibility; thereby, this work aims to achieve both high speed It should be noted that there is a large body of work on opti-
and flexibility for Kyber to support extension for building a mizing PQC schemes on a variety of platforms. For example,
hybrid cryptosystem. the work of [21] and [28] propose the NewHope on a Xilinx
Although our implementations are constant-time, investi- XC7Z020 and Zynq-7000, respectively. The architecture of
gating side-channel analysis attacks will part of our future NewHope is very similar to that of Kyber; however, this
work. scheme has not been selected to continue into the third round
of NIST. In [21], a low-complexity architecture of NewHope
D. ASIC Results is introduced, having a competitive performance compared to
The ASIC implementation results of our architectures based our design. Hence, taking advantage of this architecture to
on the 65-nm TSMC cell library using Synopsys Design improve the total performance of Kyber is kept for future
Compiler are presented in this section. All the designs are works.
synthesized with a 5ns clock period. Table IX reports the Although one of the drawbacks of various post-quantum
maximum clock frequency and the amount of logic cells cryptosystems is requiring larger key sizes and more com-
for our proposed designs and state-of-the-art implementations. putational power than the current pre-quantum algorithms,
As one can see, the placed-and-routed design of our proposed the efficiency of our proposed implementation already has
Kyber-1024 consists of 104 kGE for logic and 190 KB SRAM performance levels comparable to or even significantly better
for memory, which shows a significant speedup compared to than pre-quantum algorithms [30], [41], [42].
previous works.
V. C ONCLUSION
E. Comparison With Other Implementations The threat from large-scale quantum computers is real,
In Table X, the comparison between our proposed architec- and we need to act now as the deployment, integration,
ture with some existing PQC hardware implementations tar- and migration to quantum-safe security systems take several
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
4658 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 11, NOVEMBER 2021
years. In this paper, we have presented an instruction-set [16] Y. Zhang, C. Wang, D. E. S. Kundi, A. Khalid, M. O’Neill, and W. Liu,
post-quantum cryptosystem for CRYSTALS-Kyber. Our pro- “An efficient and parallel R-LWE cryptoprocessor,” IEEE Trans. Circuits
Syst. II, Exp. Briefs, vol. 67, no. 5, pp. 886–890, May 2020.
posed architecture is synthesized for a Xilinx Artix-7 FPGA [17] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, M. Becchi, and A. Aysu,
(which is a NIST recommended tool for prototype) prototype “A flexible and scalable NTT hardware: Applications from homomor-
and an ASIC. Implementing efficient components, including phically encrypted deep learning to post-quantum cryptography,” in
Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Grenoble, France,
sampling cores, NTT, and point-wise multiplication architec- Mar. 2020, pp. 346–351.
tures, increases the performance compared to the state-of- [18] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, and A. Aysu, “An
the-art SW and HW/SW implementations. More specifically, extensive study of flexible design methods for the number theoretic
transform,” IEEE Trans. Comput., early access, Aug. 19, 2020, doi:
our proposed architecture performs Kyber-512, Kyber-768, 10.1109/TC.2020.3017930.
and Kyber-1024 protocols in only 148, 209, and 286 μs [19] E. Karabulut and A. Aysu, “RANTT: A RISC-V architecture extension
on a Artix-7 FPGA, respectively. Our future work will for the number theoretic transform,” in Proc. 30th Int. Conf. Field-
Program. Log. Appl. (FPL), Aug. 2020, pp. 26–32.
focus on the side-channel resistance and the development of [20] T. Fritzmann and J. Sepulveda, “Efficient and flexible low-power NTT
countermeasures against such attacks. for lattice-based cryptography,” in Proc. IEEE Int. Symp. Hardw. Ori-
ented Secur. Trust (HOST), McLean, VA, USA, May 2019, pp. 141–150.
[21] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly
ACKNOWLEDGMENT efficient architecture of NewHope-NIST on FPGA using low-complexity
NTT/INTT,” in Proc. IACR, Mar. 2020, pp. 49–72.
The authors would like to thank the reviewers for their [22] T. Pöppelmann, T. Oder, and T. Güneysu, “High-performance ideal
comments. lattice-based cryptography on 8-bit ATxmega microcontrollers,” in Proc.
LATINCRYPT, Guadalajara, Mexico, Aug. 2015, pp. 346–365.
[23] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
R EFERENCES for faster ideal lattice-based cryptography,” in Proc. 15th Int. Conf.,
Milan, Italy, Nov. 2016, pp. 124–139.
[1] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
[24] P.-C. Kuo et al., “High performance post-quantum key exchange on
and factoring,” in Proc. 35th Annu. Symp. Found. Comput. Sci., Santa Fe,
FPGAs,” in Proc. IACR, 2017, p. 690.
NM, USA, Nov. 1994, pp. 124–134.
[25] C. Du and G. Bai, “Towards efficient polynomial multiplication for
[2] Status Report on the Second Round of the NIST Post-Quantum
lattice-based cryptography,” in Proc. IEEE Int. Symp. Circuits Syst.
Cryptography Standardization Process, Nat. Inst. Standards Technol.,
(ISCAS), Montréal, QC, Canada, May 2016, pp. 1178–1181.
Gaithersburg, MD, USA, 2020.
[3] L. Botros, M. J. Kannwischer, and P. Schwabe, “Memory-efficient high- [26] R. Avanzi et al., “CRYSTALSKyber: Algorithm specification and sup-
speed implementation of Kyber on Cortex-M4,” in Proc. 11th Int. Conf. porting documentation (version 3.0). submission to the NIST post-
Cryptol., Rabat, Morocco, Jul. 2019, pp. 209–228, 2019. quantum cryptography standardization project,” NIST Post-Quantum
[4] K. Basu, D. Soni, M. Nabeel, and R. Karri, “NIST post-quantum Cryptogr. Standardization Project, Tech. Rep., 2020.
cryptography a hardware evaluation study,” in Proc. IACR, 2019, p. 47. [27] J. Bos et al., “CRYSTALS-Kyber: A CCA-secure module-lattice-based
[5] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: KEM,” in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), London,
A configurable crypto-processor for post-quantum lattice-based proto- U.K., Apr. 2018, pp. 353–367.
cols,” in Proc. IACR, vol. 4, 2019, pp. 17–61. [28] Y. Xing and S. Li, “An efficient implementation of the NewHope key
[6] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: exchange on FPGAs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67,
A configurable crypto-processor for post-quantum lattice-based proto- no. 3, pp. 866–878, Mar. 2020.
cols (extended version),” in Proc. IACR, 2019, p. 1140. [29] P. Barrett, “Implementing the Rivest Shamir and Adleman public key
[7] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA encryption algorithm on a standard digital signal processor,” in Proc.
extensions for finite field arithmetic accelerating Kyber and NewHope CRYPTO, Santa Barbara, CA, USA, 1986, pp. 311–323.
on RISC-V,” in Proc. IACR, vol. 3, 2020, pp. 219–242. [30] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “Area-time
[8] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly coupled efficient hardware architecture for signature based on Ed448,” IEEE
RISC-V accelerators for post-quantum cryptography,” in Proc. IACR, Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 8, pp. 2942–2946,
Aug. 2020, pp. 239–280. Aug. 2021.
[9] G. Xin et al., “VPQC: A domain-specific vector processor for [31] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche,
post-quantum cryptography based on RISC-V architecture,” IEEE “Keccak in VHDL,” Keccak Team, Tech. Rep., 2020.
Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, [32] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede,
Aug. 2020. “Compact ring-LWE cryptoprocessor,” in Proc. Cryptograph. Hardw.
[10] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware implemen- Embedded Syst., Busan, South Korea, Sep. 2014, pp. 371–391.
tation of CRYSTALS-KYBER PQC algorithm through resource reuse,” [33] K. Stoffelen, “Efficient cryptography on the RISC-V architecture,” in
IEICE Electron. Exp., vol. 17, no. 17, 2020, Art. no. 20200234. Proc. LATINCRYPT, Santiago de Chile, Chile, Oct. 2019, pp. 323–340.
[11] V. B. Dang, F. Farahmand, M. Andrzejczak, K. Mohajerani, [34] B. Jungk and M. Stottinger, “Hobbit—Smaller but faster than a dwarf:
D. T. Nguyen, and K. Gaj, “Implementation and benchmarking of round Revisiting lightweight SHA-3 FPGA implementations,” in Proc. Int.
2 candidates in the NIST post-quantum cryptography standardization Conf. ReConFigurable Comput., Cancun, Mexico, Nov. 2016, pp. 1–7.
process using hardware and software/hardware co-design approaches,” [35] T. Fritzmann, U. Sharif, D. Muller-Gritschneder, C. Reinbrecht,
in Proc. IACR Cryptol. Arch., 2020, p. 795. U. Schlichtmann, and J. Sepulveda, “Towards reliable and secure post-
[12] Y. Xing and S. Li, “A compact hardware implementation of CCA- quantum co-processors based on RISC-V,” in Proc. Design, Autom. Test
secure key exchange mechanism CRYSTALS-KYBER on FPGA,” in Eur. Conf. Exhib. (DATE), Florence, Italy, Mar. 2019, pp. 1148–1153.
Proc. IACR, Feb. 2021, pp. 328–356. [36] M. J. Kannwischer, J. Rijneveld, P. Schwabe, and K. Stoffelen,
[13] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, “PQM4: Post-quantum crypto library for the ARM Cortex-M4,” PQM4,
“High-speed NTT-based polynomial multiplication accelerator for Tech. Rep., 2018.
CRYSTALS-Kyber post-quantum cryptography,” Proc. IACR, 2021, [37] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “High-performance area-
p. 563. efficient polynomial ring processor for CRYSTALS-kyber on FPGAs,”
[14] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “Towards efficient kyber on Integration, vol. 78, pp. 25–35, May 2021.
FPGAs: A processor for vector of polynomials,” in Proc. 25th Asia South [38] C. Zhang et al., “Towards efficient hardware implementation of NTT
Pacific Design Autom. Conf. (ASP-DAC), Beijing, China, Jan. 2020, for kyber on FPGAs,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
pp. 247–252. May 2021, pp. 1–5.
[15] S. Sinha Roy and A. Basso, “High-speed instruction-set coprocessor [39] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
for lattice-based key encapsulation mechanism: Saber in hardware,” in “Cryptographic accelerators for digital signature based on Ed25519,”
Proc. IACR Trans. Cryptograph. Hardw. Embedded Syst., Aug. 2020, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 7,
pp. 443–466. pp. 1297–1305, Jul. 2021.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.
BISHEH-NIASAR et al.: INSTRUCTION-SET ACCELERATED IMPLEMENTATION OF CRYSTALS-KYBER 4659
[40] R. Elkhatib, R. Azarderakhsh, and M. Mozaffari-Kermani, “Highly Reza Azarderakhsh (Member, IEEE) received the
optimized Montgomery multiplier for SIKE primes on FPGA,” in Proc. Ph.D. degree in electrical and computer engineering
IEEE 27th Symp. Comput. Arithmetic (ARITH), Portland, OR, USA, from Western University in 2011. He has worked at
Jun. 2020, pp. 64–71. the Center for Applied Cryptographic Research and
[41] M. B. Niasar, R. El Khatib, R. Azarderakhsh, and the Department of Combinatorics and Optimization,
M. Mozaffari-Kermani, “Fast, small, and area-time efficient University of Waterloo. He is currently an Asso-
architectures for key-exchange on Curve25519,” in Proc. IEEE 27th ciate Professor with the Department of Electrical
Symp. Comput. Arithmetic (ARITH), Portland, OR, USA, Jun. 2020, and Computer Engineering, Florida Atlantic Uni-
pp. 72–79. versity. His current research interests include finite
[42] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “Efficient hardware field and its application, elliptic curve cryptogra-
implementations for elliptic curve cryptography over Curve448,” in phy, isogenies on elliptic curves, and lattice-based
Proc. 21st Int. Conf. Cryptol., Indocrypt, India, Dec. 2020, pp. 228–247. post-quantum cryptography. He was a recipient of the NSERC Post-Doctoral
Research Fellowship. He is serving as an Associate Editor for the IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS —I: R EGULAR PAPERS .
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:45:58 UTC from IEEE Xplore. Restrictions apply.