0% found this document useful (0 votes)
12 views

Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO.

1, JANUARY 2025 263

Efficient Number Theoretic Transform Architecture


for CRYSTALS-Kyber
Khalid Javeed , Senior Member, IEEE, and David Gregg

Abstract—The Number Theoretic Transform (NTT) is a central relies on computing module learning with error (module-LWE)
primitive to compute polynomial multiplication in a finite ring for problems [7]. The fundamental operation in module lattice is
both post-quantum cryptography (PQC) and fully homomorphic denoted as As + e, where A is a k × k polynomial matrix while
encryption (FHE) schemes. This brief presents a novel, efficient s and e are k-dimensional polynomial vectors having their
NTT hardware architecture suitable for CRYSTALS-Kyber, one
of the NIST PQC standards. It is based on a new novel elements in the polynomial ring Rq . Note that Kyber security
unified butterfly unit (UBU) developed by combining interleaved level can be adjusted by tuning different parameters.
multiplication, radix-4, and resource-sharing strategies. This Several researchers have proposed hardware accelerators
unit computes all butterfly operations for any generic prime to demonstrate the practical feasibility of PQC schemes.
modulus value and is re-configurable to any modulus length. Polynomial multiplication over a ring is the central operation
In the proposed NTT architecture, multiple UBUs are deployed, in almost all LBC schemes. The number theoretic transform
demonstrating an area-time tradeoff. UBU and NTT architectures
(NTT) is a vital tool to accelerate polynomial multiplication
are synthesized and implemented over the Xilinix Artix-7 FPGA
platform and results are shown for different performance evalu- and enable the practical deployment of LBC-based cryptosys-
ation metrics. The implementation results show our lightweight tems. It reduces the computational complexity of a polynomial
and high-speed designs achieve up to 5.6× and 7× improvements multiplication from quadratic O(n2 ) to quasi-linear O(n log n).
in resource consumption and efficiency, respectively. To the The butterfly unit (BU) is the fundamental primitive in
authors’ knowledge, it is the first generic NTT architecture based the design of NTT hardware architecture. It consists of one
on interleaved multiplication approaches. modular multiplication (MM), one modular addition (MA),
Index Terms—Number theoretic transform, FPGA, interleaved one modular subtraction (MS), and two modular division
multiplication, post-quantum cryptography. by 2 (div-by-2) operations. The MM primitive is the most
computationally-intensive part, and is critical to the overall
performance of the NTT computation. The div-by-2 operations
I. I NTRODUCTION
are required only in an inverse NTT transform (INTT =
ITH the advent of quantum computers (QCs), current NTT−1 ). Numerous hardware accelerators for NTT [8], [9],
W widely deployed public key schemes such as RSA [1]
and elliptic curve cryptography [2] can no longer ensure their
[10], [11], [12], [13], [14], [15], [16], [17] are available where
the efforts have been mainly to optimize the MM primitive
security. This is due to the computational power of QCs that in the respective BU. In this regard, three approaches have
can solve the underlying mathematical hard problems: integer been explored: Barret reduction [18], Montgomery reduc-
factorization in RSA and discrete logarithm in ECC using tion [19], and exploiting the special characteristics of a chosen
Shor’s algorithm [3]. This led to the National Institute of prime modulus q. Montgomery reduction has seen limited
Standards and Technology (NIST) initiatives for standardizing deployment due to the need for domain conversion, whereas
various post-quantum cryptography (PQC) schemes. In this the Barret and special modulus-based reductions have been
regard, CRYSTALS-Kyber [4], a lattice-based crypto scheme extensively used. In [8], [9], [10], NTT architectures for LBC
is popular for key exchange and encryption/decryption tasks. are proposed using the Montgomery reduction while [11], [12]
At the same time, Dilithium is standardized as the digital are based on Barret reduction and [15], [16], [20] proposed
signature scheme [5]. On the other hand, fully homomorphic new designs by exploiting the special structure of selected
encryption (FHE) schemes enable operations over encrypted modulus q [11]. Designs over special q can produce higher
data and have a lot of applications in cloud computing performance but are tied to the selected q and lack flexibility.
security [6]. Lattice-based cryptography (LBC) is a popular Thus, the gain in speed is offset by the lack of flexibility. In
tool for designing PQC and FHE schemes. Kyber security this brief, we propose an efficient NTT architecture to support
any generic q and evaluate its performance on the FPGA
Received 17 June 2024; revised 7 August 2024; accepted 17 September
2024. Date of publication 20 September 2024; date of current version
platform. The design uses LUTs making it more flexible to
27 December 2024. This brief was recommended by Associate Editor S.- be ported to any other FPGA and even to standard ASIC cell
B. Ko. (Corresponding author: Khalid Javeed.) technologies. Our main contributions in this brief are given
Khalid Javeed is with the Department of Computer Engineering, University below:
of Sharjah, Sharjah, UAE (e-mail: [email protected]).
David Gregg is with the Lero-the Science Foundation Ireland Research • We proposed an efficient NTT architecture that can
Centre for Software, Trinity College Dublin, Dublin 2, D02 PN40 Ireland support any generic modulus q based on a novel unified
(e-mail: [email protected]). butterfly unit (UBU).
Color versions of one or more figures in this article are available at
• In the UBU, we merge MM, MA, MS, and two div-by-2
https://doi.org/10.1109/TCSII.2024.3465273.
Digital Object Identifier 10.1109/TCSII.2024.3465273 operations using the interleaved multiplication (IM) [21].
1549-7747 
c 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
264 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO. 1, JANUARY 2025

Algorithm 1: Proposed Unified Butterfly Unit (UBU)


Input: V, U , ω
Output: V i+1 , U i+1
1 Initialization: t = 0, t1 = V, t2 = 2 × V mod q

log2 q + 2, log2 q even
2 α =
log2 q + 1, log2 q odd 
// OP-1: MM of V and ω(V ω)
Fig. 1. CTBU and GSBU internal architectures.
// MM started
We modify the IM algorithm to execute all BU operations 3 for (i = 0; i ≤ α − 2; i ← i + 2) do
required in NTT and INTT with an insignificant increase 4 switch (ω(i+2:i) ) do
in the critical path delay and hardware resource con- 5 when 000 | 111 =⇒ d ← 0
sumption overhead. To reduce the critical path delay, we 6 when 001 | 010 | 101 | 110 =⇒ d ← t1
used a laddering approach to remove the data dependency 7 else =⇒ d ← t2
between partial product generation and accumulation 8 end 
parts. Thus, these can be executed concurrently. To the 9 t1 = β  t1 mod q
best of the authors’ knowledge, this is the first implemen- 10 t2 = β t2 mod q
tation of NTT using IM methods. 11 t = t(⊕|)d mod q
• We evaluated our UBU for common q sizes (12 bits) 12 end
and the NTT/INTT architecture for the Kyber scheme // OP-1 (MM) completed
(n = 256) on the Xilinx Artix-7 FPGA platform. The // OP-2: MA of t and U (t ⊕ U )
13 U
next = t ⊕ U
performance evaluation confirms that these deliver better
efficiency results in comparison to the state-of-the-art. // OP-3:MS of t and U (t  U )
14 V
next = t  V
This brief is structured as follows: Section II introduces
the NTT. Section III presents our novel UBU algorithm and 15 return V , U
next next

its mapping to the proposed hardware architecture within the


overall NTT design. Section IV presents the results.

II. BACKGROUND We modified the IM algorithm so that it can perform all


required BU operations. Our proposed UBU algorithm is given
The NTT is defined over a finite field Fq with integer roots in Algorithm 1. Our modifications eliminate data dependencies
whereas the Fast Fourier transform (FFT) works in a complex in critical operations so that these can be executed concurrently
field C with complex roots. Let’s say an integer polynomial p with low hardware footprints. We represent MM, MA, MS
in an integer ring Rq is represented as p(x) = Zq [X]/(X N + 1), operations with ⊗, ⊕, and , respectively, while represents
where reduction of the coefficients of this polynomial can be a modular multiply-by-4 operation. It is worth mentioning
done by (X N + 1). The NTT and INTT of polynomial p(x) are that operation is required in each iteration of the modified
represented as p̂(x) ← NTT(p(x)) and p(x) ← INTT(p̂(x)), algorithm in the partial product generation part. The given
respectively. Similar to FFT computations, the forward NTT algorithm takes four inputs: two coefficients of the polynomial
of a given polynomial can be performed using Cooley-Tukey V, U , a twiddle factor ω, and a modulus q. It produces two
BU (CTBU ), whereas, coefficient reversing operation can be new coefficients V next , U next after completing the required
avoided by using the Gentleman-Sande BU (GSBU ) for the operations. Three operations OP-1 (⊗), OP-2 (⊕), and OP-3
INTT computation. The internal architectures of these BUs are (⊕) are specified in steps 3, 13, and 14, respectively. The
shown in Fig. 1. It is evident from the figure that CTBU and specified order of these operations is for computing CTBU ,
GSBU require one MM, MA, and MS operations in different where the result of OP-1 (V ⊗ ω) is ⊕ and  with U and V,
orders. Moreover, in addition to these, the GSBU also requires respectively. However, these operations can be sequenced to
two div-by-2 operations. MM is the core operation in both facilitate the execution of GSBU . OP-1 is the most time-critical
CTBU and GSBU computations and it can greatly influence operation comprised of steps 1 to 11, where steps 3 to 11 run
the overall performance of the NTT/INTT primitive. Therefore iteratively whereas steps 1 and 2 are required only once for
its optimization is crucial in accelerating the LBC-based variable initialization and to determine the bit length α of a
cryptosystems. modulus q. In most of the PQC schemes, log2 q results in an
even number so we need to append two zeros to the left of
III. P ROPOSED UBU M ODULE the MSB bit of a multiplier ω, i.e., α = log2 q + 2.
The proposed novel UBU module is based on IM [21] In step 1, pre-computation of a modular multiply-by 2
method. The IM method works in an iterative manner of multiplicand V is performed in addition to loading of t1
where partial products are generated, added, and reduced and t2 with zero and V, respectively. To execute OP-1, our
in each iteration. Numerous modifications [22], [23] have algorithm starts scanning the multiplier (ω) from the least-
been proposed to design efficient MM modules for ECC. significant-bit (LSB) and forms groups of three bits (step 3),
However, ECC works on large operands (256-512 bits) so where the MSB of the current group acts as an LSB of the
straightforward adoption of these proposals for MM operation subsequent group. Therefore, due to radix-4, each iteration
in CTBU and GSBU is not possible. Moreover, there are MA, processes two bits of the multiplier. Steps 9, 10, and 11 are the
MS, and div-by-2 operations required in these BU units. main execution steps in performing OP-1. These three steps

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
JAVEED AND GREGG: EFFICIENT NTT ARCHITECTURE FOR CRYSTALS-KYBER 265

Fig. 2. Proposed Unified Butterfly Architecture.

have no data dependencies and can operate in parallel. The of two identical sub-modules: SM1 and SM2 for PS1 , SM3 and
possible partial products t1 , t2 are left-shifted and reduced in SM4 for PS2 . Each SMi unit computes a single-bit left-shift
each iteration, where the amount of shift is represented by reduction. The input is shifted-left and subtracted from q. Then
log2 β. Due to radix-4, we select β = 4 in the given algorithm. these two results are multiplexed and fed into the next sub-
Step 11 is the ⊕ or  of a respective partial product from the module. Thus each PS unit behaves as a two-bit shift reducer.
accumulator t contents. Due to the processing of two bits of These blocks remain idle during the OP-2 and OP-3 execution
the multiplier in each iteration, the total number of iterations hence dedicated to OP-1 primitive.
required to complete OP-1 is α/2 + 1. Once the OP-1 is 2) MAS: The MAS block performs MA or MS operations
complete, the two remaining operations (⊕ and ) of BU can of the two input operands. Note that our MM primitive is
be performed. These are specified in steps 14 and 15. The based on IM, where these operations are required internally in
div-by-2 operation required in INTT is merged in ⊕ and  OP-1 to ⊕| the shifted partial products from the accumulator
operations. The internal working details of ⊗, , ⊕, , and t contents as specified in step 11 of Algorithm 1. To reduce
div-by-2 operations are given in the following section. hardware footprints of the proposed UBU algorithm, we reuse
this block to perform OP-2 and OP-3 of the BU instead of
A. UBU Hardware Design deploying a dedicated unit. Based on the given control signals
in Table I, it either performs OP-1 or executes OP-2 and
A compact hardware architecture to execute the proposed
OP-3 operations in a single clock cycle. OP-1 is completed
UBU method is presented in Fig. 2. The proposed UBU hard-
in α/2 cycles where α = log2 q. OP-2 and OP-3 can then
ware architecture consists of several computational blocks: two
be mapped to MAS for their execution in sequential order.
partial product shifters PS1 and PS2 , one modular addition
The addition/subtraction of the partial products is controlled
subtraction (MAS), one shift register (SR), three data storage
by the CGL logic which takes three bits of multiplier ω.
registers t, t1 , t2 , several multiplexers, carry generation logic
As MAS takes one cycle thus (α/2 + 2) clock cycles are
(CGL), and a controller. PS1 , PS2 , and MAS are the main com-
required to complete one BU operation in CTBU and GSBU .
putational modules whereas multiplexers and the controller
To execute INTT, the execution order of these operations is
select appropriate inputs and generate required control signals,
changed which can be easily achieved by configuring our
respectively. We reduce the critical path delay by running
UBU. We adopt the same strategy as [20] to integrate one
PS1, PS2, and MAS in parallel. In addition, to minimize the
div-by-2 operation in our MAS unit while eliminating the
hardware footprints, the MAS unit is shared among different
second one. This is done by checking even and odd values
UBU operations. Details of these units are as follows:
of polynomial coefficients and adding/subtracting (q + 1)/2
1) PS1 and PS2 : These units are responsible for executing denoted as q in Fig. 2.
steps 9 and 10 of the algorithm where possible partial products Complete execution details of NTT and INTT BUs on
are multiplied by β and reduced by the given modulus q. Thus the proposed UBU hardware are elaborated in Table I. The
these modules are used only in the computation of OP-1 (⊗). NTT/INTT operation is selected by the mode signal. Based
We adopted radix-4 so β = 22 , hence partial products are on this signal, the BU operations are executed in the specified
two-bit left-shifted and reduced. As both steps are identical order. Table I lists the execution and data flow for all the
so PS1 and PS2 blocks are the same. Initially, registers t and internal BU operations for CTBU and GSBU required in NTT
SR are loaded with zero and ω, respectively. However, at the and INTT computations. A control word (CW) indicates
first clock cycle, we use PS1 block to compute 2 × V mod q control signals required to execute internal butterfly operations.
by loading register t1 with V. This way, we integrated the
pre-computation step required in OP1 in its normal execution.
The output is available after one clock cycle and is stored B. NTT Architecture
in register t2 . Thus, after two clock cycles, register t1 and t2 NTT is an essential tool widely deployed to speed up
are loaded with the required values mentioned in step 1 of polynomial multiplication in LBC systems. The proposed
Algorithm 1. Internal architecture of each PS unit is comprised UBU has the potential to work for any modulus type and

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
266 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO. 1, JANUARY 2025

TABLE I
E XECUTION F LOW OF B UTTERFLY O PERATIONS ON THE P ROPOSED UBU A RCHITECTURE

TABLE II
T HE NTT D ESIGN (N = 256 AND log2 q = 12- BIT ) ON A RTIX -7 FPGA W ITH C OMPARISON TO P RIOR D ESIGNS

length and thus can be used in the design of any NTT


architecture. However, for evaluation purposes, we used it
in the NTT architecture for CRYSTALS-Kyber. In Kyber,
N = 256 and q = 3329, where q − 1 = 28 · 13. The NTT and
INTT formulas for Kyber are given as follows:
N
2 −1

p̂(x) : (NTT(p(x))) = p[j].ω(2i+1).j mod q (1)
j=0
N
2 −1
  2 
p(x) : INTT(p̂(x)) = ˆ
p[j].ω −i.(2j+1)
mod q (2)
N
j=0
CTBU is utilized for NTT while to avoid the coefficient
reversing, GSBU is used in the computation of INTT. Our
proposed UBU can be configured for these BU types and
perform their internal operations in the same cycle count. Fig. 3. Proposed NTT/INTT architecture.
The NTT and INTT are done recursively by dividing the
given polynomial into smaller polynomials. There are total cycles. The rlogic and wlogic are responsible for avoiding data
log N − 1 stages where at each stage N2 butterfly operations hazards by reading and writing correct registers. In the case
are required. At each stage, a polynomial is divided into of NTT, each BU is configured to execute CTBU operations
two equal parts where coefficients are processed in the pre- on the input coefficients and ω. Whereas, in the case of INTT,
defined fixed order. We deploy several copies of the UBU unit GSBU operations require ω−1 .
to exploit the available parallelism at each stage. The given
architecture in Fig. 3 is comprised of K number of UBUs, a
register module (RM), read and write logic (rlogic, wlogic), IV. I MPLEMENTATION AND R ESULTS
and a control unit (CU). We choose K a power-of-two to The proposed NTT architecture is implemented using
process odd and even numbers of coefficients. To start the the Xilinx Vivado tool targeting Xilinx Artix-7 FPGA
NTT operation, the RM modules are loaded with polynomial (XC7A350T), a popular implementation platform for PQC
coefficients, twiddle factors (ω), and its inverse ω−1 . Thus, algorithms. We evaluate the design for three K sizes ( 22 , 23 ,
2K coefficients of a given polynomial are read from the RM and 24 ) to explore tradeoffs between resource consumption
and fed into their respective UBU1−K modules. These UBUs and computational speed. Note that our design is DSP and
operate in parallel and output their results in (α + 2) clock BRAMs free, in contrast to almost all listed designs in

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
JAVEED AND GREGG: EFFICIENT NTT ARCHITECTURE FOR CRYSTALS-KYBER 267

Table II. So we consider equivalent slice count (ESC) calcu- [2] V. S. Miller, “Use of elliptic curves in cryptography,” in Proc. Conf.
lated as #slices + (#DSPs × 100 + #BRAMs × 200) [11]. Theory Appl. Cryptogr. Techn., 1985, pp. 417–426.
[3] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
Area-delay product (ADP) and efficiency (E) are calculated as and factoring,” in Proc. 35th Annu. Symp. Found. Comput. Sci., 1994,
ESC × time (μs) and throughput (TP)/ESC, respectively. Test pp. 124–134.
vectors for functional verification are generated and captured [4] R. Avanzi et al., “CRYSTALS-Kyber algorithm specifications and
using customized Python implementation. supporting documentation, version 3.01” NIST PQC Round, vol. 2, Nat.
Inst. Stand. Technol., Gaithersburg, MD, USA, document NIST PQC
Table II also shows the performance of other similar designs Round-3-20210131, 2019.
on the same FPGA platform. Our lightweight design (K = 22 ), [5] D. Moody, NIST PQC Standardization Update, Nat. Inst. Stand.
balanced design (K = 23 ), and low latency design (K = 24 ) Technol., Gaithersburg, MD, USA, 2021.
[6] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proc.
consume 314, 575, and 911 slices, run at 307.5, 306.4, and 41st Annu. ACM Symp. Theory Comput., 2009, pp. 169–178.
304.7 MHz, compute NTT/INTT operation in 6.55, 3.28, and [7] A. Langlois and D. Stehlé, “Worst-case to average-case reductions for
1.65 μs, respectively. These designs have lower ESC values module lattices,” Designs, Codes Cryptogr., vol. 75, no. 3, pp. 565–599,
than other designs. The lowest ESC is delivered by our 2015.
[8] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware
lightweight design (K = 22 ). It has 5.6×, 23.67×, 1.90×, implementation of CRYSTALS-Kyber PQC algorithm through
15.41×, 36×, 9.77×, 2.6×, and 3.18× lower ESC values in resource reuse,” IEICE Electron. Express, vol. 17, no. 17,
comparison to [11], [12], [13], [15], [16], [17], [20], and [14], pp. 20200234–20200234, 2020.
[9] R. Paludo and L. Sousa, “Number theoretic transform architecture
respectively. The highest throughput design (K = 24 ) delivers suitable to lattice-based fully-homomorphic encryption,” in Proc. IEEE
5.52×, 7.08×, 1.66×, 3.01×, 1.28×, 1.05×, and 1.05× 32nd Int. Conf. Appl.-Specif. Syst., Archit. Process. (ASAP), 2021,
higher efficiency (E) values as compared to [12], [13], [15], pp. 163–170.
[16], [17], [20], and [14], respectively. It produces the same [10] A. C. Mert, E. Karabulut, E. Öztürk, E. Savaş, and A. Aysu, “An
extensive study of flexible design methods for the number theoretic
efficiency and ADP values compared to [11], the best design transform,” IEEE Trans. Comput., vol. 71, no. 11, pp. 2829–2843,
regarding ADP and efficiency values. Note that a design with Nov. 2022.
low ADP and higher E values demonstrates its superiority. [11] M. Li, J. Tian, X. Hu, and Z. Wang, “Reconfigurable and high-efficiency
However, the proposed design consumes 5.31× lower FPGA polynomial multiplication accelerator for CRYSTALS-Kyber,” IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 8,
slices than [11]. Moreover, [11] is a fixed design developed by pp. 2540–2551, Aug. 2023.
exploiting the special structure of Kyber prime so it cannot be [12] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “High-performance
utilized for any other prime. Moreover, the proposed design area-efficient polynomial ring processor for CRYSTALS-Kyber on
FPGAs,” Integration, vol. 78, pp. 25–35, 2021.
with K = 24 delivers better ADP and efficiency results than [13] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
the designs with K = 22 and K = 23 . The proposed novel “Instruction-set accelerated implementation of CRYSTALS-
UBU is also implemented on Artix-7 FPGA as a standalone Kyber,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 11,
unit. It consumes 49 slices (114 LUTs + 80 FFs), runs at pp. 4648–4659, Nov. 2021.
[14] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “High-speed
312 MHz, and completes one 12-bit BU operation in 9 clock NTT-based polynomial multiplication accelerator for post-quantum
cycles. cryptography,” in Proc. IEEE 28th Symp. Comput. Arithmetic (ARITH),
In future work, the proposed UBU module is to be used 2021, pp. 94–101.
[15] C. Zhang et al., “Towards efficient hardware implementation of NTT
in the development of NTT architectures for other LBC-based for Kyber on FPGAs,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
cryptosystems such as CYSTALS-Dilithium [5], new signa- 2021, pp. 1–5.
tures [24], and for FHE [6] schemes. Furthermore, resistance [16] F. Yaman, A. C. Mert, E. Öztürk, and E. Savaş, “A hardware accelerator
against side-channel attacks, power and energy consumption for polynomial multiplication operation of CRYSTALS-Kyber PQC
scheme,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), 2021,
evaluation, and error detection are to be evaluated. pp. 1020–1025.
[17] L. Ma, X. Wu, and G. Bai, “Parallel polynomial multiplication optimized
scheme for CRYSTALS-Kyber post-quantum cryptosystem based on
V. C ONCLUSION FPGA,” in Proc. Int. Conf. Commun., Inf. Syst. Comput. Eng. (CISCE),
Lattice-based cryptosystems are gaining popularity and will 2021, pp. 361–365.
[18] P. Barrett, “Implementing the Rivest Shamir and Adleman public key
soon be deployed widely. A design that can support a generic encryption algorithm on a standard digital signal processor,” in Proc.
prime offers significant reconfigurability. This brief presents Conf. Theory Appl. Cryptogr. Techn., 1986, pp. 311–323.
an NTT architecture using a novel unified butterfly unit (UBU) [19] P. L. Montgomery, “Modular multiplication without trial division,” Math.
Comput., vol. 44, no. 170, pp. 519–521, 1985.
for CRYSTALS-Kyber. It combines interleaved multiplication, [20] W. Guo, S. Li, and L. Kong, “An efficient implementation of
critical path splitting, and resource-sharing strategies. It deliv- KYBER,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69, no. 3,
ers better area-delay product and efficiency results on the pp. 1562–1566, Mar. 2022.
FPGA platform. Thus, it has great potential to be deployed to [21] G. R. Blakely, “A computer algorithm for calculating the product
AB modulo M,” IEEE Trans. Comput., vol. 100, no. 5, pp. 497–500,
accelerate a polynomial multiplication operation in LBC-based May 1983.
cryptosystems. [22] K. Javeed and D. Gregg, “Point multiplication accelerator for arbitrary
Montgomery curves,” IEEE Embed. Syst. Lett., early access, May 9,
2024, doi: 10.1109/LES.2024.3399071.
R EFERENCES [23] K. Javeed, “FPGA implementation of area-time aware ECC scalar
multiplication core,” in Proc. 30th IEEE Int. Conf. Electron., Circuits
[1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital Syst. (ICECS), 2023, pp. 1–4.
signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, [24] Post Quantum Cryptography (Digital Signatures)—Round 1 Additional
pp. 120–126, 1978. Signatures, Nat. Inst. Stand. Technol., Gaithersburg, MD, USA, 2023.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.

You might also like