Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

Bikram Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

Bikram Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO.

1, JANUARY 2025 263

Efficient Number Theoretic Transform Architecture

for CRYSTALS-Kyber
Khalid Javeed , Senior Member, IEEE, and David Gregg

Abstract—The Number Theoretic Transform (NTT) is a central relies on computing module learning with error (module-LWE)
primitive to compute polynomial multiplication in a finite ring for problems [7]. The fundamental operation in module lattice is
both post-quantum cryptography (PQC) and fully homomorphic denoted as As + e, where A is a k × k polynomial matrix while
encryption (FHE) schemes. This brief presents a novel, efficient s and e are k-dimensional polynomial vectors having their
NTT hardware architecture suitable for CRYSTALS-Kyber, one
of the NIST PQC standards. It is based on a new novel elements in the polynomial ring Rq . Note that Kyber security
unified butterfly unit (UBU) developed by combining interleaved level can be adjusted by tuning different parameters.
multiplication, radix-4, and resource-sharing strategies. This Several researchers have proposed hardware accelerators
unit computes all butterfly operations for any generic prime to demonstrate the practical feasibility of PQC schemes.
modulus value and is re-configurable to any modulus length. Polynomial multiplication over a ring is the central operation
In the proposed NTT architecture, multiple UBUs are deployed, in almost all LBC schemes. The number theoretic transform
demonstrating an area-time tradeoff. UBU and NTT architectures
(NTT) is a vital tool to accelerate polynomial multiplication
are synthesized and implemented over the Xilinix Artix-7 FPGA
platform and results are shown for different performance evalu- and enable the practical deployment of LBC-based cryptosys-
ation metrics. The implementation results show our lightweight tems. It reduces the computational complexity of a polynomial
and high-speed designs achieve up to 5.6× and 7× improvements multiplication from quadratic O(n2 ) to quasi-linear O(n log n).
in resource consumption and efficiency, respectively. To the The butterfly unit (BU) is the fundamental primitive in
authors’ knowledge, it is the first generic NTT architecture based the design of NTT hardware architecture. It consists of one
on interleaved multiplication approaches. modular multiplication (MM), one modular addition (MA),
Index Terms—Number theoretic transform, FPGA, interleaved one modular subtraction (MS), and two modular division
multiplication, post-quantum cryptography. by 2 (div-by-2) operations. The MM primitive is the most
computationally-intensive part, and is critical to the overall
performance of the NTT computation. The div-by-2 operations
I. I NTRODUCTION
are required only in an inverse NTT transform (INTT =
ITH the advent of quantum computers (QCs), current NTT−1 ). Numerous hardware accelerators for NTT [8], [9],
W widely deployed public key schemes such as RSA [1]
and elliptic curve cryptography [2] can no longer ensure their
[10], [11], [12], [13], [14], [15], [16], [17] are available where
the efforts have been mainly to optimize the MM primitive
security. This is due to the computational power of QCs that in the respective BU. In this regard, three approaches have
can solve the underlying mathematical hard problems: integer been explored: Barret reduction [18], Montgomery reduc-
factorization in RSA and discrete logarithm in ECC using tion [19], and exploiting the special characteristics of a chosen
Shor’s algorithm [3]. This led to the National Institute of prime modulus q. Montgomery reduction has seen limited
Standards and Technology (NIST) initiatives for standardizing deployment due to the need for domain conversion, whereas
various post-quantum cryptography (PQC) schemes. In this the Barret and special modulus-based reductions have been
regard, CRYSTALS-Kyber [4], a lattice-based crypto scheme extensively used. In [8], [9], [10], NTT architectures for LBC
is popular for key exchange and encryption/decryption tasks. are proposed using the Montgomery reduction while [11], [12]
At the same time, Dilithium is standardized as the digital are based on Barret reduction and [15], [16], [20] proposed
signature scheme [5]. On the other hand, fully homomorphic new designs by exploiting the special structure of selected
encryption (FHE) schemes enable operations over encrypted modulus q [11]. Designs over special q can produce higher
data and have a lot of applications in cloud computing performance but are tied to the selected q and lack flexibility.
security [6]. Lattice-based cryptography (LBC) is a popular Thus, the gain in speed is offset by the lack of flexibility. In
tool for designing PQC and FHE schemes. Kyber security this brief, we propose an efficient NTT architecture to support
any generic q and evaluate its performance on the FPGA
Received 17 June 2024; revised 7 August 2024; accepted 17 September
2024. Date of publication 20 September 2024; date of current version
platform. The design uses LUTs making it more flexible to
27 December 2024. This brief was recommended by Associate Editor S.- be ported to any other FPGA and even to standard ASIC cell
B. Ko. (Corresponding author: Khalid Javeed.) technologies. Our main contributions in this brief are given
Khalid Javeed is with the Department of Computer Engineering, University below:
of Sharjah, Sharjah, UAE (e-mail: [email protected]).
David Gregg is with the Lero-the Science Foundation Ireland Research • We proposed an efficient NTT architecture that can
Centre for Software, Trinity College Dublin, Dublin 2, D02 PN40 Ireland support any generic modulus q based on a novel unified
(e-mail: [email protected]). butterfly unit (UBU).
Color versions of one or more figures in this article are available at
• In the UBU, we merge MM, MA, MS, and two div-by-2
https://doi.org/10.1109/TCSII.2024.3465273.
Digital Object Identifier 10.1109/TCSII.2024.3465273 operations using the interleaved multiplication (IM) [21].
1549-7747
c 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
264 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO. 1, JANUARY 2025

Algorithm 1: Proposed Unified Butterfly Unit (UBU)

Input: V, U , ω
Output: V i+1 , U i+1
1 Initialization: t = 0, t1 = V, t2 = 2 × V mod q

log2 q + 2, log2 q even
2 α =
log2 q + 1, log2 q odd
// OP-1: MM of V and ω(V ω)
Fig. 1. CTBU and GSBU internal architectures.
// MM started
We modify the IM algorithm to execute all BU operations 3 for (i = 0; i ≤ α − 2; i ← i + 2) do
required in NTT and INTT with an insignificant increase 4 switch (ω(i+2:i) ) do
in the critical path delay and hardware resource con- 5 when 000 | 111 =⇒ d ← 0
sumption overhead. To reduce the critical path delay, we 6 when 001 | 010 | 101 | 110 =⇒ d ← t1
used a laddering approach to remove the data dependency 7 else =⇒ d ← t2
between partial product generation and accumulation 8 end
parts. Thus, these can be executed concurrently. To the 9 t1 = β t1 mod q
best of the authors’ knowledge, this is the first implemen- 10 t2 = β t2 mod q
tation of NTT using IM methods. 11 t = t(⊕|)d mod q
• We evaluated our UBU for common q sizes (12 bits) 12 end
and the NTT/INTT architecture for the Kyber scheme // OP-1 (MM) completed
(n = 256) on the Xilinx Artix-7 FPGA platform. The // OP-2: MA of t and U (t ⊕ U )
13 U
next = t ⊕ U
performance evaluation confirms that these deliver better
efficiency results in comparison to the state-of-the-art. // OP-3:MS of t and U (t U )
14 V
next = t V
This brief is structured as follows: Section II introduces
the NTT. Section III presents our novel UBU algorithm and 15 return V , U
next next

its mapping to the proposed hardware architecture within the

overall NTT design. Section IV presents the results.

II. BACKGROUND We modified the IM algorithm so that it can perform all

required BU operations. Our proposed UBU algorithm is given
The NTT is defined over a finite field Fq with integer roots in Algorithm 1. Our modifications eliminate data dependencies
whereas the Fast Fourier transform (FFT) works in a complex in critical operations so that these can be executed concurrently
field C with complex roots. Let’s say an integer polynomial p with low hardware footprints. We represent MM, MA, MS
in an integer ring Rq is represented as p(x) = Zq [X]/(X N + 1), operations with ⊗, ⊕, and , respectively, while represents
where reduction of the coefficients of this polynomial can be a modular multiply-by-4 operation. It is worth mentioning
done by (X N + 1). The NTT and INTT of polynomial p(x) are that operation is required in each iteration of the modified
represented as p̂(x) ← NTT(p(x)) and p(x) ← INTT(p̂(x)), algorithm in the partial product generation part. The given
respectively. Similar to FFT computations, the forward NTT algorithm takes four inputs: two coefficients of the polynomial
of a given polynomial can be performed using Cooley-Tukey V, U , a twiddle factor ω, and a modulus q. It produces two
BU (CTBU ), whereas, coefficient reversing operation can be new coefficients V next , U next after completing the required
avoided by using the Gentleman-Sande BU (GSBU ) for the operations. Three operations OP-1 (⊗), OP-2 (⊕), and OP-3
INTT computation. The internal architectures of these BUs are (⊕) are specified in steps 3, 13, and 14, respectively. The
shown in Fig. 1. It is evident from the figure that CTBU and specified order of these operations is for computing CTBU ,
GSBU require one MM, MA, and MS operations in different where the result of OP-1 (V ⊗ ω) is ⊕ and with U and V,
orders. Moreover, in addition to these, the GSBU also requires respectively. However, these operations can be sequenced to
two div-by-2 operations. MM is the core operation in both facilitate the execution of GSBU . OP-1 is the most time-critical
CTBU and GSBU computations and it can greatly influence operation comprised of steps 1 to 11, where steps 3 to 11 run
the overall performance of the NTT/INTT primitive. Therefore iteratively whereas steps 1 and 2 are required only once for
its optimization is crucial in accelerating the LBC-based variable initialization and to determine the bit length α of a
cryptosystems. modulus q. In most of the PQC schemes, log2 q results in an
even number so we need to append two zeros to the left of
III. P ROPOSED UBU M ODULE the MSB bit of a multiplier ω, i.e., α = log2 q + 2.
The proposed novel UBU module is based on IM [21] In step 1, pre-computation of a modular multiply-by 2
method. The IM method works in an iterative manner of multiplicand V is performed in addition to loading of t1
where partial products are generated, added, and reduced and t2 with zero and V, respectively. To execute OP-1, our
in each iteration. Numerous modifications [22], [23] have algorithm starts scanning the multiplier (ω) from the least-
been proposed to design efficient MM modules for ECC. significant-bit (LSB) and forms groups of three bits (step 3),
However, ECC works on large operands (256-512 bits) so where the MSB of the current group acts as an LSB of the
straightforward adoption of these proposals for MM operation subsequent group. Therefore, due to radix-4, each iteration
in CTBU and GSBU is not possible. Moreover, there are MA, processes two bits of the multiplier. Steps 9, 10, and 11 are the
MS, and div-by-2 operations required in these BU units. main execution steps in performing OP-1. These three steps

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
JAVEED AND GREGG: EFFICIENT NTT ARCHITECTURE FOR CRYSTALS-KYBER 265

Fig. 2. Proposed Unified Butterfly Architecture.

have no data dependencies and can operate in parallel. The of two identical sub-modules: SM1 and SM2 for PS1 , SM3 and
possible partial products t1 , t2 are left-shifted and reduced in SM4 for PS2 . Each SMi unit computes a single-bit left-shift
each iteration, where the amount of shift is represented by reduction. The input is shifted-left and subtracted from q. Then
log2 β. Due to radix-4, we select β = 4 in the given algorithm. these two results are multiplexed and fed into the next sub-
Step 11 is the ⊕ or of a respective partial product from the module. Thus each PS unit behaves as a two-bit shift reducer.
accumulator t contents. Due to the processing of two bits of These blocks remain idle during the OP-2 and OP-3 execution
the multiplier in each iteration, the total number of iterations hence dedicated to OP-1 primitive.
required to complete OP-1 is α/2 + 1. Once the OP-1 is 2) MAS: The MAS block performs MA or MS operations
complete, the two remaining operations (⊕ and ) of BU can of the two input operands. Note that our MM primitive is
be performed. These are specified in steps 14 and 15. The based on IM, where these operations are required internally in
div-by-2 operation required in INTT is merged in ⊕ and OP-1 to ⊕| the shifted partial products from the accumulator
operations. The internal working details of ⊗, , ⊕, , and t contents as specified in step 11 of Algorithm 1. To reduce
div-by-2 operations are given in the following section. hardware footprints of the proposed UBU algorithm, we reuse
this block to perform OP-2 and OP-3 of the BU instead of
A. UBU Hardware Design deploying a dedicated unit. Based on the given control signals
in Table I, it either performs OP-1 or executes OP-2 and
A compact hardware architecture to execute the proposed
OP-3 operations in a single clock cycle. OP-1 is completed
UBU method is presented in Fig. 2. The proposed UBU hard-
in α/2 cycles where α = log2 q. OP-2 and OP-3 can then
ware architecture consists of several computational blocks: two
be mapped to MAS for their execution in sequential order.
partial product shifters PS1 and PS2 , one modular addition
The addition/subtraction of the partial products is controlled
subtraction (MAS), one shift register (SR), three data storage
by the CGL logic which takes three bits of multiplier ω.
registers t, t1 , t2 , several multiplexers, carry generation logic
As MAS takes one cycle thus (α/2 + 2) clock cycles are
(CGL), and a controller. PS1 , PS2 , and MAS are the main com-
required to complete one BU operation in CTBU and GSBU .
putational modules whereas multiplexers and the controller
To execute INTT, the execution order of these operations is
select appropriate inputs and generate required control signals,
changed which can be easily achieved by configuring our
respectively. We reduce the critical path delay by running
UBU. We adopt the same strategy as [20] to integrate one
PS1, PS2, and MAS in parallel. In addition, to minimize the
div-by-2 operation in our MAS unit while eliminating the
hardware footprints, the MAS unit is shared among different
second one. This is done by checking even and odd values
UBU operations. Details of these units are as follows:
of polynomial coefficients and adding/subtracting (q + 1)/2
1) PS1 and PS2 : These units are responsible for executing denoted as q in Fig. 2.
steps 9 and 10 of the algorithm where possible partial products Complete execution details of NTT and INTT BUs on
are multiplied by β and reduced by the given modulus q. Thus the proposed UBU hardware are elaborated in Table I. The
these modules are used only in the computation of OP-1 (⊗). NTT/INTT operation is selected by the mode signal. Based
We adopted radix-4 so β = 22 , hence partial products are on this signal, the BU operations are executed in the specified
two-bit left-shifted and reduced. As both steps are identical order. Table I lists the execution and data flow for all the
so PS1 and PS2 blocks are the same. Initially, registers t and internal BU operations for CTBU and GSBU required in NTT
SR are loaded with zero and ω, respectively. However, at the and INTT computations. A control word (CW) indicates
first clock cycle, we use PS1 block to compute 2 × V mod q control signals required to execute internal butterfly operations.
by loading register t1 with V. This way, we integrated the
pre-computation step required in OP1 in its normal execution.
The output is available after one clock cycle and is stored B. NTT Architecture
in register t2 . Thus, after two clock cycles, register t1 and t2 NTT is an essential tool widely deployed to speed up
are loaded with the required values mentioned in step 1 of polynomial multiplication in LBC systems. The proposed
Algorithm 1. Internal architecture of each PS unit is comprised UBU has the potential to work for any modulus type and

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.
266 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO. 1, JANUARY 2025

TABLE I
E XECUTION F LOW OF B UTTERFLY O PERATIONS ON THE P ROPOSED UBU A RCHITECTURE

TABLE II
T HE NTT D ESIGN (N = 256 AND log2 q = 12- BIT ) ON A RTIX -7 FPGA W ITH C OMPARISON TO P RIOR D ESIGNS

length and thus can be used in the design of any NTT

architecture. However, for evaluation purposes, we used it
in the NTT architecture for CRYSTALS-Kyber. In Kyber,
N = 256 and q = 3329, where q − 1 = 28 · 13. The NTT and
INTT formulas for Kyber are given as follows:
N
2 −1

p̂(x) : (NTT(p(x))) = p[j].ω(2i+1).j mod q (1)
j=0
N
2 −1
2
p(x) : INTT(p̂(x)) = ˆ
p[j].ω −i.(2j+1)
mod q (2)
N
j=0
CTBU is utilized for NTT while to avoid the coefficient
reversing, GSBU is used in the computation of INTT. Our
proposed UBU can be configured for these BU types and
perform their internal operations in the same cycle count. Fig. 3. Proposed NTT/INTT architecture.
The NTT and INTT are done recursively by dividing the
given polynomial into smaller polynomials. There are total cycles. The rlogic and wlogic are responsible for avoiding data
log N − 1 stages where at each stage N2 butterfly operations hazards by reading and writing correct registers. In the case
are required. At each stage, a polynomial is divided into of NTT, each BU is configured to execute CTBU operations
two equal parts where coefficients are processed in the pre- on the input coefficients and ω. Whereas, in the case of INTT,
defined fixed order. We deploy several copies of the UBU unit GSBU operations require ω−1 .
to exploit the available parallelism at each stage. The given
architecture in Fig. 3 is comprised of K number of UBUs, a
register module (RM), read and write logic (rlogic, wlogic), IV. I MPLEMENTATION AND R ESULTS
and a control unit (CU). We choose K a power-of-two to The proposed NTT architecture is implemented using
process odd and even numbers of coefficients. To start the the Xilinx Vivado tool targeting Xilinx Artix-7 FPGA
NTT operation, the RM modules are loaded with polynomial (XC7A350T), a popular implementation platform for PQC
coefficients, twiddle factors (ω), and its inverse ω−1 . Thus, algorithms. We evaluate the design for three K sizes ( 22 , 23 ,
2K coefficients of a given polynomial are read from the RM and 24 ) to explore tradeoffs between resource consumption
and fed into their respective UBU1−K modules. These UBUs and computational speed. Note that our design is DSP and
operate in parallel and output their results in (α + 2) clock BRAMs free, in contrast to almost all listed designs in

Table II. So we consider equivalent slice count (ESC) calcu- [2] V. S. Miller, “Use of elliptic curves in cryptography,” in Proc. Conf.
lated as #slices + (#DSPs × 100 + #BRAMs × 200) [11]. Theory Appl. Cryptogr. Techn., 1985, pp. 417–426.
[3] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
Area-delay product (ADP) and efficiency (E) are calculated as and factoring,” in Proc. 35th Annu. Symp. Found. Comput. Sci., 1994,
ESC × time (μs) and throughput (TP)/ESC, respectively. Test pp. 124–134.
vectors for functional verification are generated and captured [4] R. Avanzi et al., “CRYSTALS-Kyber algorithm specifications and
using customized Python implementation. supporting documentation, version 3.01” NIST PQC Round, vol. 2, Nat.
Inst. Stand. Technol., Gaithersburg, MD, USA, document NIST PQC
Table II also shows the performance of other similar designs Round-3-20210131, 2019.
on the same FPGA platform. Our lightweight design (K = 22 ), [5] D. Moody, NIST PQC Standardization Update, Nat. Inst. Stand.
balanced design (K = 23 ), and low latency design (K = 24 ) Technol., Gaithersburg, MD, USA, 2021.
[6] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proc.
consume 314, 575, and 911 slices, run at 307.5, 306.4, and 41st Annu. ACM Symp. Theory Comput., 2009, pp. 169–178.
304.7 MHz, compute NTT/INTT operation in 6.55, 3.28, and [7] A. Langlois and D. Stehlé, “Worst-case to average-case reductions for
1.65 μs, respectively. These designs have lower ESC values module lattices,” Designs, Codes Cryptogr., vol. 75, no. 3, pp. 565–599,
than other designs. The lowest ESC is delivered by our 2015.
[8] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware
lightweight design (K = 22 ). It has 5.6×, 23.67×, 1.90×, implementation of CRYSTALS-Kyber PQC algorithm through
15.41×, 36×, 9.77×, 2.6×, and 3.18× lower ESC values in resource reuse,” IEICE Electron. Express, vol. 17, no. 17,
comparison to [11], [12], [13], [15], [16], [17], [20], and [14], pp. 20200234–20200234, 2020.
[9] R. Paludo and L. Sousa, “Number theoretic transform architecture
respectively. The highest throughput design (K = 24 ) delivers suitable to lattice-based fully-homomorphic encryption,” in Proc. IEEE
5.52×, 7.08×, 1.66×, 3.01×, 1.28×, 1.05×, and 1.05× 32nd Int. Conf. Appl.-Specif. Syst., Archit. Process. (ASAP), 2021,
higher efficiency (E) values as compared to [12], [13], [15], pp. 163–170.
[16], [17], [20], and [14], respectively. It produces the same [10] A. C. Mert, E. Karabulut, E. Öztürk, E. Savaş, and A. Aysu, “An
extensive study of flexible design methods for the number theoretic
efficiency and ADP values compared to [11], the best design transform,” IEEE Trans. Comput., vol. 71, no. 11, pp. 2829–2843,
regarding ADP and efficiency values. Note that a design with Nov. 2022.
low ADP and higher E values demonstrates its superiority. [11] M. Li, J. Tian, X. Hu, and Z. Wang, “Reconfigurable and high-efficiency
However, the proposed design consumes 5.31× lower FPGA polynomial multiplication accelerator for CRYSTALS-Kyber,” IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 8,
slices than [11]. Moreover, [11] is a fixed design developed by pp. 2540–2551, Aug. 2023.
exploiting the special structure of Kyber prime so it cannot be [12] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “High-performance
utilized for any other prime. Moreover, the proposed design area-efficient polynomial ring processor for CRYSTALS-Kyber on
FPGAs,” Integration, vol. 78, pp. 25–35, 2021.
with K = 24 delivers better ADP and efficiency results than [13] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
the designs with K = 22 and K = 23 . The proposed novel “Instruction-set accelerated implementation of CRYSTALS-
UBU is also implemented on Artix-7 FPGA as a standalone Kyber,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 11,
unit. It consumes 49 slices (114 LUTs + 80 FFs), runs at pp. 4648–4659, Nov. 2021.
[14] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “High-speed
312 MHz, and completes one 12-bit BU operation in 9 clock NTT-based polynomial multiplication accelerator for post-quantum
cycles. cryptography,” in Proc. IEEE 28th Symp. Comput. Arithmetic (ARITH),
In future work, the proposed UBU module is to be used 2021, pp. 94–101.
[15] C. Zhang et al., “Towards efficient hardware implementation of NTT
in the development of NTT architectures for other LBC-based for Kyber on FPGAs,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
cryptosystems such as CYSTALS-Dilithium [5], new signa- 2021, pp. 1–5.
tures [24], and for FHE [6] schemes. Furthermore, resistance [16] F. Yaman, A. C. Mert, E. Öztürk, and E. Savaş, “A hardware accelerator
against side-channel attacks, power and energy consumption for polynomial multiplication operation of CRYSTALS-Kyber PQC
scheme,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), 2021,
evaluation, and error detection are to be evaluated. pp. 1020–1025.
[17] L. Ma, X. Wu, and G. Bai, “Parallel polynomial multiplication optimized
scheme for CRYSTALS-Kyber post-quantum cryptosystem based on
V. C ONCLUSION FPGA,” in Proc. Int. Conf. Commun., Inf. Syst. Comput. Eng. (CISCE),
Lattice-based cryptosystems are gaining popularity and will 2021, pp. 361–365.
[18] P. Barrett, “Implementing the Rivest Shamir and Adleman public key
soon be deployed widely. A design that can support a generic encryption algorithm on a standard digital signal processor,” in Proc.
prime offers significant reconfigurability. This brief presents Conf. Theory Appl. Cryptogr. Techn., 1986, pp. 311–323.
an NTT architecture using a novel unified butterfly unit (UBU) [19] P. L. Montgomery, “Modular multiplication without trial division,” Math.
Comput., vol. 44, no. 170, pp. 519–521, 1985.
for CRYSTALS-Kyber. It combines interleaved multiplication, [20] W. Guo, S. Li, and L. Kong, “An efficient implementation of
critical path splitting, and resource-sharing strategies. It deliv- KYBER,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69, no. 3,
ers better area-delay product and efficiency results on the pp. 1562–1566, Mar. 2022.
FPGA platform. Thus, it has great potential to be deployed to [21] G. R. Blakely, “A computer algorithm for calculating the product
AB modulo M,” IEEE Trans. Comput., vol. 100, no. 5, pp. 497–500,
accelerate a polynomial multiplication operation in LBC-based May 1983.
cryptosystems. [22] K. Javeed and D. Gregg, “Point multiplication accelerator for arbitrary
Montgomery curves,” IEEE Embed. Syst. Lett., early access, May 9,
2024, doi: 10.1109/LES.2024.3399071.
R EFERENCES [23] K. Javeed, “FPGA implementation of area-time aware ECC scalar
multiplication core,” in Proc. 30th IEEE Int. Conf. Electron., Circuits
[1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital Syst. (ICECS), 2023, pp. 1–4.
signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, [24] Post Quantum Cryptography (Digital Signatures)—Round 1 Additional
pp. 120–126, 1978. Signatures, Nat. Inst. Stand. Technol., Gaithersburg, MD, USA, 2023.

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 05:42:58 UTC from IEEE Xplore. Restrictions apply.

NTT Architecture For A Linux-Ready RISC-V Fully-Homomorphic Encryption Accelerator
No ratings yet
NTT Architecture For A Linux-Ready RISC-V Fully-Homomorphic Encryption Accelerator
14 pages
Analog Circuits With Solutions
No ratings yet
Analog Circuits With Solutions
98 pages
An_Efficient_Hardware_Accelerator_of_High-Speed_NTT_for_CRYSTALS-Kyber_Post-Quantum_Cryptography
No ratings yet
An_Efficient_Hardware_Accelerator_of_High-Speed_NTT_for_CRYSTALS-Kyber_Post-Quantum_Cryptography
6 pages
Thiết kế bộ nhân đa thức kết hợp NTT cho CRYSTALS-kyber
No ratings yet
Thiết kế bộ nhân đa thức kết hợp NTT cho CRYSTALS-kyber
18 pages
High-Speed_Polynomials_Multiplication_HW_Accelerator_for_CRYSTALS-Kyber
No ratings yet
High-Speed_Polynomials_Multiplication_HW_Accelerator_for_CRYSTALS-Kyber
9 pages
Reconfigurable_and_High-Efficiency_Polynomial_Multiplication_Accelerator_for_CRYSTALS-Kyber
No ratings yet
Reconfigurable_and_High-Efficiency_Polynomial_Multiplication_Accelerator_for_CRYSTALS-Kyber
12 pages
Scalable and Parallel Optimization of The Number Theoretic Transform Based On FPGA
No ratings yet
Scalable and Parallel Optimization of The Number Theoretic Transform Based On FPGA
14 pages
High-Speed NTT-based Polynomial Multiplication Accelerator For Post-Quantum Cryptography
No ratings yet
High-Speed NTT-based Polynomial Multiplication Accelerator For Post-Quantum Cryptography
8 pages
Compact and Low-Latency FPGA-Based Number Theoreti
No ratings yet
Compact and Low-Latency FPGA-Based Number Theoreti
15 pages
preprints202504.2368.v1
No ratings yet
preprints202504.2368.v1
16 pages
A_High-Speed_Hardware_Architecture_of_an_NTT_Accelerator_for_CRYSTALS-Kyber
No ratings yet
A_High-Speed_Hardware_Architecture_of_an_NTT_Accelerator_for_CRYSTALS-Kyber
11 pages
FPL 20 Riscv NTT
No ratings yet
FPL 20 Riscv NTT
8 pages
36743-86250-1-PB
No ratings yet
36743-86250-1-PB
12 pages
Conceptual_Review_on_Number_Theoretic_Transform_and_Comprehensive_Review_on_Its_Implementations
No ratings yet
Conceptual_Review_on_Number_Theoretic_Transform_and_Comprehensive_Review_on_Its_Implementations
29 pages
High-Speed_NTT_Accelerator_for_CRYSTAL-Kyber_and_CRYSTAL-Dilithium
No ratings yet
High-Speed_NTT_Accelerator_for_CRYSTAL-Kyber_and_CRYSTAL-Dilithium
13 pages
Faster Kyber and Dilithium on the Cortex-M4
No ratings yet
Faster Kyber and Dilithium on the Cortex-M4
23 pages
2021-485
No ratings yet
2021-485
6 pages
A_Hardware_Accelerator_for_Polynomial_Multiplication_Operation_of_CRYSTALS-KYBER_PQC_Scheme
No ratings yet
A_Hardware_Accelerator_for_Polynomial_Multiplication_Operation_of_CRYSTALS-KYBER_PQC_Scheme
6 pages
Ultra_High-Speed_Polynomial_Multiplications_for_Lattice-Based_Cryptography_on_FPGAs
No ratings yet
Ultra_High-Speed_Polynomial_Multiplications_for_Lattice-Based_Cryptography_on_FPGAs
13 pages
An Efficient Implementation of KYBER
No ratings yet
An Efficient Implementation of KYBER
6 pages
Article 2473
No ratings yet
Article 2473
10 pages
An_Efficient_Method_for_Accelerating_Kyber_and_Dilithium_Post-Quantum_Cryptography
No ratings yet
An_Efficient_Method_for_Accelerating_Kyber_and_Dilithium_Post-Quantum_Cryptography
5 pages
Faster NTT
No ratings yet
Faster NTT
14 pages
Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber
No ratings yet
Instruction-Set_Accelerated_Implementation_of_CRYSTALS-Kyber
12 pages
Optimization and Implementation of NTT-JISA-2017
No ratings yet
Optimization and Implementation of NTT-JISA-2017
8 pages
An_Efficient_and_Configurable_Hardware_Architecture_of_Polynomial_Modular_Operation_for_CRYSTALS-Kyber_and_Dilithium
No ratings yet
An_Efficient_and_Configurable_Hardware_Architecture_of_Polynomial_Modular_Operation_for_CRYSTALS-Kyber_and_Dilithium
4 pages
Chester Thesis
No ratings yet
Chester Thesis
135 pages
Implementation of Reliable CRC Error Detection For Versatile and Scalable Digit Serial Finite Field Multipliers For Cryptography Applications
No ratings yet
Implementation of Reliable CRC Error Detection For Versatile and Scalable Digit Serial Finite Field Multipliers For Cryptography Applications
6 pages
5. Francisco Romao-Extended Abstract
No ratings yet
5. Francisco Romao-Extended Abstract
10 pages
s13369-024-08976-w
No ratings yet
s13369-024-08976-w
15 pages
Final Project Report (4)
No ratings yet
Final Project Report (4)
44 pages
applsci-14-04085
No ratings yet
applsci-14-04085
15 pages
Hardware Implementation of Bit-Parallel Finite Field Multipliers
No ratings yet
Hardware Implementation of Bit-Parallel Finite Field Multipliers
68 pages
IJSPR_5901_30318
No ratings yet
IJSPR_5901_30318
5 pages
Area-Efficient_Iterative_Logarithmic_Approximate_Multipliers_for_IEEE_754_and_Posit_Numbers
No ratings yet
Area-Efficient_Iterative_Logarithmic_Approximate_Multipliers_for_IEEE_754_and_Posit_Numbers
13 pages
31_Design_JJ_new
No ratings yet
31_Design_JJ_new
8 pages
Paper 037
No ratings yet
Paper 037
15 pages
Flexible_and_Efficient_Implementation_of_CRYSTALS-KYBER_SIMD_RISC-V_Coprocessor_Based_on_Customized_Vector_Instruction-Set_Extension
No ratings yet
Flexible_and_Efficient_Implementation_of_CRYSTALS-KYBER_SIMD_RISC-V_Coprocessor_Based_on_Customized_Vector_Instruction-Set_Extension
3 pages
Elliptic Curve Cryptography On Embedded Multicore Systems
No ratings yet
Elliptic Curve Cryptography On Embedded Multicore Systems
6 pages
Residue Number Systems Theory and Applications - P.V. Ananda Mohan (Auth.) - 1st Ed., 2016 - Birkhäuser - 9783319413853 - Anna's Archive
No ratings yet
Residue Number Systems Theory and Applications - P.V. Ananda Mohan (Auth.) - 1st Ed., 2016 - Birkhäuser - 9783319413853 - Anna's Archive
353 pages
Imran 2017
No ratings yet
Imran 2017
6 pages
Hardware Multiplier Accumulator For SIDH Protocol
No ratings yet
Hardware Multiplier Accumulator For SIDH Protocol
4 pages
High-Speed Modular Multiplier For Lattice-Based Cryptosystems
No ratings yet
High-Speed Modular Multiplier For Lattice-Based Cryptosystems
5 pages
Hardware Acceleration of ECC
No ratings yet
Hardware Acceleration of ECC
102 pages
A Bootstrapping-Capable Configurable NTT Architect
No ratings yet
A Bootstrapping-Capable Configurable NTT Architect
11 pages
Proteus_A_Pipelined_NTT_Architecture_Generator
No ratings yet
Proteus_A_Pipelined_NTT_Architecture_Generator
11 pages
TensorCrypto_High_Throughput_Acceleration_of_Lattice-Based_Cryptography_Using_Tensor_Core_on_GPU
No ratings yet
TensorCrypto_High_Throughput_Acceleration_of_Lattice-Based_Cryptography_Using_Tensor_Core_on_GPU
17 pages
Thesis
No ratings yet
Thesis
65 pages
VHDL Implementation of ECC Processor Over GF (2 163)
No ratings yet
VHDL Implementation of ECC Processor Over GF (2 163)
7 pages
Practical_Lattice-Based_Cryptography
No ratings yet
Practical_Lattice-Based_Cryptography
18 pages
ECO-CRYSTALS_Efficient_Cryptography_CRYSTALS_on_Standard_RISC-V_ISA
No ratings yet
ECO-CRYSTALS_Efficient_Cryptography_CRYSTALS_on_Standard_RISC-V_ISA
13 pages
Digital Signal Processing With Field Programmable Gate Arrays
No ratings yet
Digital Signal Processing With Field Programmable Gate Arrays
42 pages
A Survey of Polynomial Multiplications for lattice based cryptography
No ratings yet
A Survey of Polynomial Multiplications for lattice based cryptography
63 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Proteus a Pipelined NTT Architecture Generator
No ratings yet
Proteus a Pipelined NTT Architecture Generator
11 pages
Thesis Final App of RNS in Comm SP
No ratings yet
Thesis Final App of RNS in Comm SP
77 pages
A_Fast_and_Efficient_191-bit_Elliptic_Curve_Cryptographic_Processor_Using_a_Hybrid_Karatsuba_Multiplier_for_IoT_Applications
No ratings yet
A_Fast_and_Efficient_191-bit_Elliptic_Curve_Cryptographic_Processor_Using_a_Hybrid_Karatsuba_Multiplier_for_IoT_Applications
12 pages
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
No ratings yet
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
5 pages
3
No ratings yet
3
30 pages
Analysis of Chaos in Double Pendulum
No ratings yet
Analysis of Chaos in Double Pendulum
6 pages
Stick
No ratings yet
Stick
16 pages
Signals and Systems With Solutions
100% (2)
Signals and Systems With Solutions
64 pages
Ramanujan's Notebooks (Part 1 of 5) - B. Berndt (Springer, 1985) WW
100% (1)
Ramanujan's Notebooks (Part 1 of 5) - B. Berndt (Springer, 1985) WW
368 pages
Chest Radiography
No ratings yet
Chest Radiography
39 pages
LCT2
No ratings yet
LCT2
1 page
Mallows Theorem
No ratings yet
Mallows Theorem
21 pages
Microwave Test Bench
100% (2)
Microwave Test Bench
99 pages
Xtra
No ratings yet
Xtra
1 page
Linearization Assignment W-Solution Sa
0% (1)
Linearization Assignment W-Solution Sa
7 pages
Mathematics 11 02997
No ratings yet
Mathematics 11 02997
11 pages
Cryptanalysis of The Stream Cipher ZUC in The 3GPP Confidentiality & Integrity Algorithms 128-EEA3 & 128-EIA3
No ratings yet
Cryptanalysis of The Stream Cipher ZUC in The 3GPP Confidentiality & Integrity Algorithms 128-EEA3 & 128-EIA3
15 pages
Image Encryption For Secure Data Transfer and Image Based Cryptography IJERTCONV2IS03017
No ratings yet
Image Encryption For Secure Data Transfer and Image Based Cryptography IJERTCONV2IS03017
4 pages
ME 226 - Advanced Math For ME (Gauss Elimination)
No ratings yet
ME 226 - Advanced Math For ME (Gauss Elimination)
22 pages
Mathematics For Information Science 6
No ratings yet
Mathematics For Information Science 6
5 pages
(A) Simplify The Following Expressions. I) : Task 1 (P1)
No ratings yet
(A) Simplify The Following Expressions. I) : Task 1 (P1)
6 pages
Answer: (A) : 15.87% Probability of The Cup Having More Than 18 Ounces
No ratings yet
Answer: (A) : 15.87% Probability of The Cup Having More Than 18 Ounces
5 pages
Digital Signal Processing Ppt-1
100% (1)
Digital Signal Processing Ppt-1
12 pages
Games Puzzles and Computation PDF
No ratings yet
Games Puzzles and Computation PDF
226 pages
Vu Solved
No ratings yet
Vu Solved
60 pages
Dynamic Games of Complete and Perfect Information
No ratings yet
Dynamic Games of Complete and Perfect Information
26 pages
DSP-UNIT-5 Objective
No ratings yet
DSP-UNIT-5 Objective
5 pages
Module Chapter II - Linear Programming
No ratings yet
Module Chapter II - Linear Programming
17 pages
Mathematical Modeling and Simulation Introduction for Scientists and Engineers second edition Kai Velten - The full ebook version is just one click away
No ratings yet
Mathematical Modeling and Simulation Introduction for Scientists and Engineers second edition Kai Velten - The full ebook version is just one click away
63 pages
Chapter 2. Graph Theory and Concepts: Figure 2-1
No ratings yet
Chapter 2. Graph Theory and Concepts: Figure 2-1
18 pages
Linear and Digital Control Systems
No ratings yet
Linear and Digital Control Systems
2 pages
CH 9
No ratings yet
CH 9
27 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Engg - Mathematics I Unit 1 Unit IV and Unit V Assignment Questions
No ratings yet
Engg - Mathematics I Unit 1 Unit IV and Unit V Assignment Questions
3 pages
#Quantitative Analysis Excel
No ratings yet
#Quantitative Analysis Excel
5 pages
Supplemental Aggregate Planning
No ratings yet
Supplemental Aggregate Planning
18 pages
Book Reviews: Regulation
No ratings yet
Book Reviews: Regulation
2 pages
Relational Data Modal 11 lacture-WPS Office
No ratings yet
Relational Data Modal 11 lacture-WPS Office
24 pages
Priority Queue Implementation
No ratings yet
Priority Queue Implementation
24 pages
DLD Assignment 2
No ratings yet
DLD Assignment 2
3 pages
Fake News Detection Using Machine Learning Report PDF
No ratings yet
Fake News Detection Using Machine Learning Report PDF
52 pages
Nonogram Solving Algorithms Analysis and Implementation For Augmented Reality System
No ratings yet
Nonogram Solving Algorithms Analysis and Implementation For Augmented Reality System
54 pages
Cit 756 Answers
100% (1)
Cit 756 Answers
25 pages
Pda Npda Mobile PDF
No ratings yet
Pda Npda Mobile PDF
20 pages

Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

Efficient_Number_Theoretic_Transform_Architecture_for_CRYSTALS-Kyber

Uploaded by

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 72, NO.

1, JANUARY 2025 263

Efficient Number Theoretic Transform Architecture

Algorithm 1: Proposed Unified Butterfly Unit (UBU)

its mapping to the proposed hardware architecture within the

II. BACKGROUND We modified the IM algorithm so that it can perform all

Fig. 2. Proposed Unified Butterfly Architecture.

length and thus can be used in the design of any NTT

You might also like