0% found this document useful (0 votes)
12 views

Fast Modular Multipliers For Supersingular Isogeny-Based Post-Quantum Cryptography

Uploaded by

arch185205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Fast Modular Multipliers For Supersingular Isogeny-Based Post-Quantum Cryptography

Uploaded by

arch185205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.

2, FEBRUARY 2021 359

Fast Modular Multipliers for Supersingular


Isogeny-Based Post-Quantum Cryptography
Jing Tian , Student Member, IEEE, Jun Lin, Senior Member, IEEE, and Zhongfeng Wang , Fellow, IEEE

Abstract— As one of the postquantum protocol candidates, the post-quantum cryptography (PQC) protocols. For example, the
supersingular isogeny key encapsulation (SIKE) protocol delivers call for proposals for PQC standards hosted by the National
promising public and secret key sizes over other candidates. Institute of Standards and Technology (NIST) [6] is driven by
Nevertheless, the considerable computations form the bottleneck
and limit its practical applications. The modular multiplication this demand.
operations occupy a large proportion of the overall computations The supersingular isogeny key encapsulation (SIKE) proto-
required by the SIKE protocol. The VLSI implementation of col [7] has advanced to the third round of the PQC standard-
the high-speed modular multiplier remains a big challenge. ization process in July 2020 after submitted to the NIST in
In this article, we propose three improved modular multiplication November 2017. The possible reason is that it is the only one
algorithms based on an unconventional radix for this protocol,
all of which cost about 20% fewer computations than the prior that is similar to the classical ECC having very small public
art. Besides, a multiprecision scheme is also introduced for the and secret keys and owning perfect forward secrecy. Recently,
proposed algorithms to improve the scalability in hardware it has been proven in [8] and [9] that the security estimations
implementation, resulting in three new algorithms. We then in the SIKE proposal were extremely conservative both in
present very efficient high-speed constant-time modular mul- quantum and classical situations. In other words, smaller key
tiplier architectures for the six algorithms. It is shown that
these new architectures can be extensively pipelined and highly sizes can be used for the security levels presented in [7].
optimized to obtain high throughput and low latency. The field- The SIKE is a variant of the supersingular isogeny Diffie–
programmable gate array (FPGA) implementation results show Hellman (SIDH) key exchange protocol, applied with the key
that all proposed multipliers achieve much higher throughput encapsulation mechanism [10] to obtain the indistinguishabil-
than previous designs, but the increase in resources is relatively ity under chosen-ciphertext attack (IND-CCA) [11]. The SIDH
small. In addition, the multipliers without the multiprecision
scheme have very low latency, which is very friendly to high- was first introduced by Jao and De Feo [12] in 2011 to resist
speed applications of the SIKE protocol. the quantum attack based on the difficulty of finding isogenies
between supersingular elliptic curves. The zero-knowledge
Index Terms— Field-programmable gate array (FPGA), mod-
ular multiplication, post-quantum cryptography (PQC), super- identification scheme was proposed based on this protocol
singular isogeny Diffie–Hellman (SIDH) key exchange, VLSI. in [13]. Jao and Soukharev [14] presented the undeniable
signatures based on the SIDH. Azarderakhsh et al. [15] pro-
I. I NTRODUCTION vided a key compression method that greatly reduces the cost

R ECENT improvements in quantum system control make


it seem feasible to finally build a powerful quantum
computer in the near future [1]. Many companies, such as
of transmitting and storing public keys and per-party public
information without any impact on security. However, the
computations of these algorithms are still huge and encounter
IBM, Google, and Intel, have enthusiastically joined this field. difficulties in practical applications.
A company named IonQ reported in December 2018 that its To alleviate this problem, many researchers have focused
machine could be built as large as 160 qubits [2]. These on speeding up the SIDH key exchange in hardware, e.g.,
achievements have brought to the flurry of research in public- on field-programmable gate array (FPGA) [16]–[18] or on
key cryptography since most of the popular public-key ciphers, ARM [19]–[21]. Through breaking down the computations,
such as the RSA [3] and ECC [4] schemes based on the diffi- modular multiplication is one of the fundamental operations,
culty of factoring integers or the discrete logarithm problem, which is the main concerning issue in these designs. The
can be solved by Shor’s algorithm [5] with quantum comput- first FPGA implementation for SIDH key exchange was
ers. Meanwhile, they have also accelerated the development of proposed by Koziel et al. [16] by parallelizing the modular
Manuscript received March 9, 2020; revised July 12, 2020 and October 12, multipliers based on the high-radix Montgomery multipli-
2020; accepted November 28, 2020. Date of publication December 21, 2020; cation algorithm [22]. They further speedup this protocol
date of current version January 28, 2021. This work was supported in part by by adding more modular multipliers in [17]. Liu et al. [18]
the National Natural Science Foundation of China under Grant 61774082,
in part by the Fundamental Research Funds for the Central Universities presented two fast modular multipliers, the FFM1 and FFM2,
under Grant 021014380065, and in part by the Key Research Plan of for SIDH based on an unconventional radix inspired by the
Jiangsu Province of China under Grant BE2019003-4. (Corresponding author: efficient finite field multiplication (EFFM) algorithm pro-
Zhongfeng Wang.)
The authors are with the School of Electronic Science and Engineering, posed in [23]. In addition, the SIDH is implemented on
Nanjing University, Nanjing 210023, China (e-mail: [email protected]; ARM-embedded systems as well. Seo et al. [19] proposed
[email protected]; [email protected]). a unified ARM/NEON multiprecision modular multiplication
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2020.3041786. architecture based on the specialized Montgomery reduction
Digital Object Identifier 10.1109/TVLSI.2020.3041786 and integrated it into the SIKE library [11] to accelerate
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
360 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

the original ARM design. Jalali et al. [20] implemented the Meanwhile, we also develop their multiprecision versions
optimized field arithmetic operations on ARM for SIKE and with a clever interleaving scheme and present three other new
commutative SIDH (CSIDH) in [21], respectively. Indeed, algorithms to improve the scalability and reduce the resource
much progress has been made to speedup the SIKE protocol consumption in hardware implementation.
and make it more practical. However, these implementations Moreover, we have devised new constant-time architec-
for the SIKE still suffer more than one order of magnitude tures for the proposed algorithms with fully parallelizing and
slower speed than those for most of the other candidates. interleaving schedules that enable us to mostly reduce the
Notice that the smooth isogeny modulus for SIDH has the required clock cycles and highly optimize each submodule.
form of p = f · a x b y ± 1, where a and b are small primes, We have also coded the proposed architectures with the Verilog
x and y are positive integers, and f is a small cofactor to language and implemented them on FPGA. The implementa-
make p prime. To simplify the modular operations, especially tion results show that, compared with the new-type modular
for the modular multiplication, the parameter a is usually set multipliers based on the unconventional radix, the designs
to 2. The form of p then becomes f · 2x b y ± 1. The EFFM without the multiprecision scheme have about 60 times faster
in [23] constrains the p with the form of 2 · 2x b y − 1 = throughput than the fastest design by introducing a relatively
2R 2 − 1, where x and y must be even, and R = 2x/2 b y/2 . small portion of extra hardware resources. When applying
The input operands are transformed into quadratic polynomials the multiprecision scheme, these designs achieve significant
based on the unconventional radix R. Compared with the reductions on total resource consumptions at the cost of
conventional Montgomery modular reduction algorithm [24], slower throughput compared with their original versions while
the new method can reduce about half of the complexity still being much better than prior arts. Compared with the
of the multiplications at the cost of some additions. The eightfold high-radix Montgomery modular multipliers [11], the
FFM1 in [18] reduces the coefficients of EFFM from three redesigned IFFM− multiplier with comparable frequency still
to two by using an extra mapping function for the input achieves more than one order of magnitude faster throughput.
and output, which could efficiently discard the precomputed The rest of this article is organized as follows. Section II
constant without any increase in complexity. The FFM2 in [18] gives a brief review of general modular multiplication
extends the searching scope of the prime with the form of algorithms and efficient modular multiplication algorithms for
f · 2x b y ± 1 at the expense of more computations. It should be a specific form of prime. The three new algorithms and the
pointed out that a good prime could more possibly be found complexity comparisons with previous works are presented
with a larger searching scope, which could also help increase in Section III, and their multiprecision versions are also
the efficiency of the algorithm. Therefore, it is important to proposed in this section. In Section IV, the architectures for
develop efficient modular multiplication algorithms with loose all the proposed algorithms are devised. The results of FPGA
constraints for the prime. implementations are introduced in Section V. Section VI
In this article, we propose three new modular multipli- concludes this article. The proofs of parameters used in the
cation algorithms for different forms of prime based on proposed algorithms and the prime searching are appended
an unconventional radix adopted in [18], [23], and [25], following the conclusion.
and all of them have lower computational complexity than
previous algorithms. We aim to extend the previous prime II. BACKGROUND
used in [23] into prime with form of f · 2x b y ± 1 where The multiplication for elliptic curve cryptography is based
f ∈ {1, 2} and x and y are even. The prime can be split on finite fields, called modular multiplication, requiring the
into three forms: 2 · 2x b y − 1, 2 · 2x b y + 1, and 2x b y + 1, modular reduction after the multiplication operation. In the
with x and y being even. Accordingly, the corresponding following, we will first introduce the Montgomery reduction,
new algorithms are developed, named IFFM− , IFFMo+ , and the Barrett reduction (BR), and the efficient BR for the mod-
IFFMe+ , respectively. We use R = 2x/2 b y/2 as the uncon- ulus of 2x b y . Then, several efficient modular multiplications
ventional radix for the proposed algorithms. For the IFFM− for the SIDH will be presented.
algorithm, the usage of the radix is almost the same as before,
which has been preliminarily presented in our conference A. Modular Reduction Algorithms
paper [26]. For the IFFMo+ and IFFMe+ , we use the radix 1) Montgomery Reduction: The main idea of the Mont-
R = 2 x/2 b y/2 to reduce the complexity by expanding the gomery reduction [24] is to replace the ordinary modulus
range of the constant coefficient of a quadratic polynomial by a power of two so that the modular reduction operation
for the first time. A detailed discussion can be found in is inexpensive to handle in hardware implementation. The
Section III-B. The reduction and multiplication of the three detailed process is shown in Algorithm 1. The modulus
proposed algorithms are optimized, reducing the computa- p is an arbitrary number, which is less than R (equal to
tional complexity by about 20%. It should be pointed out 2 N ). The parameter (− p−1 ) mod R is precomputed and
that, although the new modular multiplication algorithms are saved. As taking integers modulo R is very easy, we will
very efficient, they are not applied to all the primes presented not take this kind of computations into consideration in the
in the SIKE protocol. In fact, they can only be used by the following evaluation. It can be found that the complexity
SIKEp610. We have made a brute-force search in Appendix B is only related to the bit width of the modulus p. This
to find good replacements for the currently considered SIKE algorithm requires two N × N multiplications: one 2N + 2N
primes. and one N + N adders. Note that the output remainder is

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 361

Algorithm 1: Montgomery Reduction [24] Algorithm 3: BR for Modulus R = 2x b y [23]

Algorithm 2: BR [31]
Algorithm 4: EFFM Modular Multiplication Proposed
in [23]

not c mod p but c R −1 mod p. Luckily, if the operands are


converted into Montgomery presentations by multiplying R,
all of the arithmetic operations can be normally processed.
When an algorithm contains many modular multiplications and
divisions, this conversion overhead becomes negligible. Since
the large-number multiplications and additions are unfriendly
to the high-speed hardware implementation, many efficient
high-radix Montgomery-based multipliers [22], [27]–[30] are
proposed. The one using systolic array in [22] is adopted in the
SIKE/SIDH protocol [11], [16], [17] to balance the resource
consumption and the efficiency.
2) Barrett Reduction: Another hardware-friendly modular
reduction algorithm is the BR, proposed by Barrett [31] in
1986. The key idea is also to transfer the complex division to
an easier one by introducing an extra parameter. The flow
is described in Algorithm 2. It should be noticed that the
complexity is changed with the input width α. When α = 2N,
this algorithm costs one 2N × (N + 1) and one N × (N + 1)
multiplications and one 2N + 2N and one N + N adders.
The complexity of multiplication is almost 1.5 times of the
Montgomery reductions. The benefit is that this algorithm can
directly compute the quotient and the remainder.
3) Efficient Barrett Reduction for the Modulus of 2x b y :
As introduced earlier, the form of modulus for the SIDH is
The complexity is changed with α. Since the considered
f · a x b y ± 1. Karmakar et al. [23] have constrained the values
modulus is not prime, this reduction algorithm cannot be
of f and a to 2, respectively. Meanwhile, x and y must be
directly used as an independent modular reduction algorithm
even. Therefore, the unconventional radix R has the form of
for the SIDH protocol. As presented in [23], it is used as
2x b y (x and y are arbitrary positive integers here). The BR is
a subfunction for the EFFM to compute the quotients and
used for such a kind of moduli. The modulus R is split into
remainders of the coefficients dividing R. More details will
two parts, 2x and b y , and the reduction is computed in two
be introduced in Section II-B.
steps, as introduced in [23]. This algorithm is summarized in
Algorithm 3.
Apparently, the main cost is for modulo b y . Assume that B. Efficient Finite Field Multiplications for SIDH
α = 2N, N1 = x, N2 = log2 (b y ), and N1 ≈ N2 ≈ N/2. 1) EFFM: The modular multiplication named EFFM algo-
This algorithm costs one 3N/2×(N +1) and one N/2×(N +1) rithm proposed in [23] is summarized in Algorithm 4. It works
multiplications and one 3N/2 + 3N/2 and one N + N adders. in an interleaved way with an unconventional radix to compute

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
362 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

the multiplication and reduction. The key idea is to replace Algorithm 5: IBR for Hardware Efficiency
a large modulus p with a relatively small modulus R. The
two parameters satisfy the formulas: p = 2 · 22x b2y − 1 and
R = 2x b y . An input A, which is a field element in F p ,
is expressed in quadratic polynomial as

A = a2 R 2 + a1 R + a0 (1)

where a2 ∈ {0, 1} and 0 ≤ a1 , a0 < R. We divide the


process of this algorithm into three steps: 1) the first tentative
computing; 2) the second tentative computing; and 3) post-
processing. In the first step, the higher order (larger than two
orders) terms are reduced and merged with the lower order
terms according to the rules deduced in [23]. The second
step is to further reduce the coefficients by adopting two BR
functions, as presented in Algorithm 3. The bit width α of
the BR function is approximately equal to N (half of the
original input) and N1 ≈ N2 ≈ N/4. In the postprocessing III. P ROPOSED M ODULAR M ULTIPLICATION A LGORITHMS
step, the while loop is executed at most twice as introduced
in [23]. Since adding or multiplying one number by a single-bit According to the abovementioned analysis, the complexity
number is very easy, these kinds of operations are not taken of modular multiplication algorithms in [18] and [23] is mainly
into account in this article. Thus, this algorithm takes four dominated by the multiplications used in the first tentative
N/2 × N/2, two 3N/4 × (N/2 + 1), and two N/4 × (N/2 + 1) computing and the two BR functions. We will detail our
multiplications and six N/2 + N/2, two 3N/4 + 3N/4, three improvements from the two aspects in this section. Meanwhile,
N/2 + N, and three N + N additions. we will extend the searching space of the modulus with a form
2) FFM1: Recently, Liu et al. [18] have proposed the FFM1 of p = 2x b y ± 1 with an even y.
algorithm to remove the coefficients a2 and b2 of the inputs of
Algorithm 4. Taking input A as an example, the coefficients
are transformed as A. Improved Barrett Reduction

ai , a2 = 0 The improved BR (IBR) is presented, as shown in Algo-
ai = (2)
R − ai − 1, a2 = 1, i = {1, 0} rithm 5. We assume that α = 2N and N1 ≈ N2 ≈ N/2,
the same as the BR algorithm in Section II-A3. A simple
based on the fact that improvement is to move the combination step to the end,
which reduces the size of the subtraction in Step 6 of Algo-
AB ≡ ( p − A)( p − B) ≡ p − ( p − A)B mod p (3) rithm 3 from N + N to N2 + N2 . The other improvement is
that the subtraction and multiplication operations in Step 3
and of Algorithm 3 are simplified (shown in Steps 3 and 4 of
Algorithm 5). Since the tentative remainder r is smaller than
p − A = 2R 2 − 1 − (a2 R 2 + a1 R + a0 ) 2 N2 +1 , the difference between the (N − 1) MSBs of t and
= (1 − a2 )R 2 + (R − a1 − 1)R + (R − a0 − 1). (4) those of q · R is no more than one, which can be distinguished
from their corresponding (N2 + 1)th bits. Accordingly, we can
Equation (2) is denoted as map function in the rest of this reduce the sizes of the subtraction and multiplication from
article. The inverse transformation is also needed for the (3/2)N + (3/2)N and (3/2)N × (N + 1) to (N/2) + (N/2)
output, and a2 is replaced by a2 ⊕ b2 . This modification and ((N/2) + 1) × (N/2), respectively. If their (N2 + 1) MSBs
can remove the precomputed parameter (2−2 mod p). The are not equal, the remainder r would be adjusted by adding the
multiplicative complexity of the FFM1 is the same as the parameter 2 N2 . Therefore, the IBR algorithm only requires one
EFFM. It takes ten N/2 + N/2, two 3N /4 + 3N /4, one 3N/2 × (N + 1) and one N/2 × (N/2 + 1) multiplications and
N/2 + N, and two N + N additions. As the transformation three N/2 + N/2 additions. Compared with the BR algorithm,
for the input and output requires more extra additions, the the complexities of multiplication and addition are reduced by
complexity reduction is limited. about 12.5% and 40%, respectively.
3) FFM2: The FFM2 algorithm is another efficient modular As the output r is expected to have the range of [0, R),
algorithm proposed in [18] to extend the searching space of the this function cannot obtain the required results if the input
modulus p with the form of f · 2x b y ± 1 (x and y are arbitrary integer c is a negative number. To address this issue, we take
positive integers here) at a cost of more multiplications. It costs the absolute value to the IBR function and then correct the
one N × N, one 3N/2 × (N + 1), and one N/2 × (N + remainder with R − r and the quotient with −(q + 1) when
1) multiplications and two N + N and one 3N/2 + 3N/2 c is negative. This modified reduction algorithm is defined
additions. as IBR+ .

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 363

B. Modular Multiplication Algorithms for Modulus Algorithm 6: Proposed IFFM− for p = 2R 2 − 1


p = 2x b y ± 1 With an Even y
According to the introduction in [18], [23], and [25], the
modular multiplication algorithms based on the unconven-
tional radix for the SIDH show more efficiency than con-
ventional methods. This concept is first proposed in [23]
for the smooth isogeny prime modulus p with the form of
p = 2 · 2 x b y − 1, where x and y must be even. In this section,
we will extend this prime with the form of p = 2x b y ± 1
for an even y. For convenience, we reformulate the prime as
p = f ·22x b2y ±1, where f ∈ {1, 2} to make the transformation
hold. Since the modulus p is prime, we have three forms
for p: 1) p = 2 · 22x b2y − 1; 2) p = 2 · 22x b2y + 1; and
3) p = 22x b2y + 1. For the three different p forms, three
different modular multiplication algorithms are developed:
IFFM− , IFFMo+ , and IFFMe+ . Here, the latter two algorithms
are proposed with unconventional radix for the first time.
There are two issues required to be addressed for this kind
of modulus p = f · 22x b2y + 1: 1) how to construct an
unconventional radix and 2) how to reduce the coefficients
with such a modulus.
For the first issue, we still keep the field elements in F p with
the form of quadratic polynomials based on the unconventional
radix R = 2x b y as in (1). This form is a one-to-one mapping
for the modulus p with minus sign, where all elements in F p
are exactly expressed and the expression p − A is still in this
field. Back to the original motivation, the target is to replace
the large modulus p with a small modulus R, not to construct
an onto mapping. Thus, we try to build a mapping that may
not be so exact but can involve all elements in F p . With this
clue, we find that, if we extend the range of the coefficient of in [18] and [23]. Equation (5) for this modulus turns into
  
the constant term to R + 1, namely, 0 ≤ a0 ≤ R + 1, this goal   t
t R 2 mod p ≡ (t mod 2) · R 2 + mod p. (6)
can be achieved. We only need to add this constraint for c0 in 2
the postprocessing step, and most of the processing steps are In the proposed IFFM− , besides applying the IBR function,
almost the same as the EFFM or FFM1. Note that, in some we also use the map function proposed in [18], as shown in (2).
cases, two polynomials would equal the same value in F p , Meanwhile, the number of multiplications is further reduced
which, however, do not affect the calculations and the final by using the formula
results.
For the second issue, we review that the basic idea of a1 b0 + a0 b1 = (a1 + a0 )(b1 + b0 ) − a0 b0 − a1 b1 . (7)
EFFM [23] is first to resolve the quadratic and higher order The proposed algorithm is shown in Algorithm 6. The
terms modulo p and then to reduce the coefficients by using post_proc function is given in Algorithm 8, where the range of
modulo or subtracting R. We suppose that the quadratic and the input c0o is deduced shown in Section A1. Therefore, the
higher order terms are combined as t R 2 . The formula t R 2 proposed IFFM− only needs two N/2 × N/2, one (N/2 +
modulo p ( p = f · R 2 ± 1) can be computed as 1) × (N/2 + 1), two 3N/4 × N/2, and two N/4 × N/4
t R 2 mod p multiplications and six N/4 + N/4, ten N/2 + N/2, one
  2  N/2 + N, and three N + N additions.
  tR
≡ t R 2 mod ( f R 2 ) ∓ mod p 2) IFFMo+ for Modulus p = 2·22x b2y +1: For p = 2R 2 +1,
f R2
   it means that f is set to 2 and the minus sign is adopted. Thus,
  t (5) becomes
= (t mod f ) · R 2 ∓ mod p. (5)   
f   t
t R mod p ≡ (t mod 2) · R −
2 2
mod p. (8)
When p = 2· R 2 −1, the plus sign is taken, and this equation is 2
equivalent to the deduced equation in [23]. For p = f · R 2 +1, The IFFMo+ is very similar to the IFFM− . We will mainly
this equation can also be used. analyze the different operations of this algorithm. First, the
In the following, we will discuss more details about the negation operation, p − A, becomes
proposed three algorithms.
1) IFFM− for Modulus p = 2 · 22x b2y − 1: For p = p − A = 2R 2 + 1 − (a2 R 2 + a1 R + a0 )
2R 2 − 1, the optimization methods have been fully discussed = (1 − a2 )R 2 + (R − a1 − 1)R + (R − a0 + 1) (9)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
364 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

Algorithm 7: Proposed IFFMe+ for p = R 2 + 1 Algorithm 8: post_ pr oc Functions for the Proposed
Algorithms

which is still a standard expression. Therefore, when a2 = 1,


for a1 , the map function is still R−a1 −1, and we define map+
function as R − a0 + 1 for a0 . The transformations are the
same for b1 and b0 . Second, when updating the constant term
c0 , the subtraction need be taken, including Steps 5 and 9 in
Algorithm 6. Third, the IBR function is replaced by the IBR+
function. Finally, the post_proc function shown in Algorithm 8
is updated differently according to the range of c0o deduced
in Section A2. The complexity of this algorithm is almost
the same as that of the IFFM− , with two extra N/2 + N/2
additions.
3) IFFMe+ for Modulus p = 22x b2y + 1: For the IFFMe+
algorithm, f is equal to 1, so (5) becomes
normalized the numbers of additions and multiplications to
t R 2 mod p = −t mod p. (10) those of N + N additions and N × N multiplications for the
Meanwhile, for modulus p = R 2 + 1, the coefficient of the previous and the proposed modular multiplication algorithms,
quadratic term is equal to zero. Therefore, multiplying two as listed in Table I. The Montgomery and BR algorithms com-
elements A, B ∈ F p turns into bined with the multiplication part are abbreviated as MontM
and BarM modular multiplication algorithms, respectively.
A × B mod p Since N is usually as large as several hundred for public-key
≡ (a1 R + a0 ) × (b1 R + b0 ) cryptosystems, the bit width N + 1 is approximated to N.
The N + N/2 addition, which can be split as one N/2 + N/2
≡ ((a0 + a1 )(b0 + b1 ) − a0 b0 − a1 b1 )R + a0 b0 − a1 b1 .
and one N/2 + 1 additions, is approximately computed as
(11) one N/2 + N/2 addition. It can be seen that the proposed
It can be seen that there is no need to transform the inputs algorithms have the fewest number of multiplications. Note
and output with (2), which can efficiently reduce the addition that the performance is mainly constrained by the computation
operations. The detailed process is shown in Algorithm 7. c0 in of multiplications in these algorithms. We have achieved a
Step 8 has a range of [−2R, R], the proof for which is attached nearly 20% reduction in computations compared with the
in Section A3. Thus, at most two additions are required to state-of-the-art algorithms.
correct the final results. The number of multiplications is also
the same as those of the other two algorithms. This algorithm D. Multiprecision Scheme for Proposed Algorithms
costs six N/4 + N/4, seven N/2 + N/2, one N/2 + N, and
To improve the scalability in hardware design, we apply a
three N + N additions.
multiprecision scheme to the proposed algorithms. Assume a
number A with multiprecision format in radix of 2k as
C. Complexity Analysis and Comparison n−1
Assume that the bit width of the input field elements A and A= A j · (2k ) j (12)
B is N and the modulus p satisfies 2 N −1 < p < 2 N . We have j =0

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 365

TABLE I
E STIMATIONS OF THE N ORMALIZED N UMBERS OF N + N A DDITIONS AND N × N M ULTIPLICATIONS FOR D IFFERENT
M ODULAR M ULTIPLICATION A LGORITHMS

Algorithm 9: Proposed Multiprecision IFFMe+ Modular


Multiplication Algorithm

Fig. 1. Top-level architecture for the proposed algorithms including


the IFFM− , IFFMo+ , IFFMe+ , Multi-IFFM− , Multi-IFFMo+ , and Multi-
IFFMe+ .

where A j is the j th k-bit digit of A and n is the number


of digits. To reduce the bit width used in each iteration,
the interleaving of multiplication and reduction is adopted as
follows:

AB mod p = . . . ((0 · 2k + An−1 B mod p) · 2k

+ An−2 B mod p) · 2k +· · ·+ A1 B mod p · 2k
+ A0 B mod p. (13)
It can be observed that a recursive equation can be concluded
as
C ( j ) = C ( j +1) · 2k + A j B mod p (14)
where C (n) = 0 and C ( j ) are the intermediate values for
0 < j < n, and C (0) is the final result. In our proposed
algorithms, we can transfer the modulus p to R. We represent
the coefficients a0 and a1 of the field element A with the
form of (12). For the quadratic polynomial, a0 and a1 are the
results after mapping. Take the IFFMe+ as an example. The
recursive process starts from Step 1 and finishes at Step 5 in Fig. 2. Proposed top-level architecture for the IFFM− .
Algorithm 7. The postprocessing step can be finally executed
after the iterative step to reduce computation consumption. data flow are labeled. The connection lines and diagrams of
If n > 1, the postprocessing can be further simplified. The the common parts are shown in black. The parts in shaded area
multiprecision IFFMe+ (Multi-IFFMe+) algorithm is shown are used for the algorithms with f = 2, including the IFFM− ,
in Algorithm 9. This scheme can be also applied to other IFFMo+ , and their multiprecision algorithms. The subtracters
proposed algorithms in the same way. in the green dotted box are adopted for the algorithms with
p = f · 2x b y + 1. When f = 1, the green dotted line
IV. H ARDWARE A RCHITECTURE is connected. Modules in light gray, including the Memory,
FeedB, and two adders, are used for multiprecision designs.
A. Top-Level Architecture Totally, this figure covers six designs: the IFFM− , IFFMo+ ,
The top-level architecture is shown in Fig. 1, where all the IFFMe+ , Multi-IFFM− , Multi-IFFMo+ , and Multi-IFFMe+
proposed algorithms are covered, and the variables and the modular multipliers.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
366 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

Fig. 4. (a) Map and (b) Demap modules.

Fig. 3. Proposed top-level architecture for the Multi-IFFMe+ .

To make Fig. 1 easier to read, we split it and give two


of them as examples shown in Figs. 2 and 3. Fig. 2 shows
the top-level architecture for the IFFM− algorithm, which Fig. 5. IBR module.
is a completely feed-forward design. Fig. 3 shows the top-
level architecture for the Multi-IFFMe+ algorithm, which is final results back to normal. The architecture of this module
iteratively processed. In these designs with the multiprecision shown in Fig. 4 is almost the same as that of the Map module
scheme, the bit width of each module can be easily adjusted except the control signals and the additional output, c2 .
by the parameter n to make a good tradeoff between the speed 2) IBR/IBR+ and Post_Paral Modules: The IBR module
and the resource consumption. Since the input coefficients is to compute the reduction using the method presented in
cannot be pushed into the computation modules at one time, Algorithm 5. The architecture is shown in Fig. 5, where the
the memory is necessary to store them, and it is also used to bit widths of the data flow are marked, and the parameter α is
buffer the other groups of inputs to accelerate the design. If assumed as 2N. After the positive integer c is input, its (N1 )
the bit width of the digits k is determined, the number of LSBs are saved in c L , and the other 2N − N1 bits are saved
iterations n would be fixed. It is clear that all the six proposed in c H and sent into the following steps (Step 1 of Algorithm
modular multipliers have constant execution time. 5). First, c H is multiplied by the constant parameter λ using
From the top-level architectures, we can see that, besides a constant multiplier cMul 0, and the tentative quotient q0 is
the explicit adders, seven submodules are used: 1) Map; 2) obtained by taking the (N + 1) MSBs of the product (Step 2
Mul; 3) IBR/IBR+ ; 4) Post_Paral; 5) Post_Proc; 6) Demap; of Algorithm 5). Then, the (N2 + 1) LSBs of q0 is multiplied
and 7) FeedB. They will be detailed in the following. Since by the parameter R in the second constant multiplier cMul
the modules except the FeedB module for the multiprecision 1 and the (N2 + 1) LSBs of the output is denoted as ca
algorithms are almost the same as their primitives, they will (Step 3 of Algorithm 5). Third, the (N2 + 1) LSBs of c H is
not be separately discussed in the following for brevity. subtracted by ca to get the tentative remainder r0 . It should be
pointed out that this subtraction has covered the computations
B. Proposed Submodules from Steps 4–7 of Algorithm 5, as these steps are exactly the
definition of two’s complement. Fourth, the quotient q and
1) Map and Demap Modules: The mapping and demapping
the partial remainder r H are selected from the corresponding
are used for the IFFM− and IFFMo+ algorithms with f = 2
tentative values and their corrections based on whether r0 is
to transfer the three coefficients into two coefficients and
larger than R (Steps 8–10 of Algorithm 5). Note that the
the inverse process, respectively. As the subtraction operation
comparator is omitted, and the sign bit of r0 is used as the
is implemented by using the addition operation with two’s
control signal. Finally, the final remainder r is obtained by
complement format of the minuend in hardware, we can
directly assembling r L and r H together. For the IBR+ function,
reformulate (2) for a2 = 1 part to reduce the complexity
two multiplexers controlled by the sign bit of c and two adders
p
ai = R − ai − 1 = R + (ai )comp − 1 are needed for the output of the IBR module.
= R + ((ai )inv + 1) − 1 As introduced in [23], the two IBR functions can be
processed in parallel with some extra computations to reduce
= R + (ai )inv , i = {1, 0} (15)
the overall latency. We follow this idea and further modify it
where the indexes comp and inv denote the two’s and ones’ for our proposed algorithms. For the IFFM− , as q0 +r10 ranges
complements of ai , respectively. For map+ function, the con- in [0, (5/2)R − 4], subtracting R is needed at most twice for
stant addend is replaced by R + 2. By using this method, r1 . For the IFFMo+ and IFFMe+ , the ranges of q0 + r10 are
the critical path can be effectively reduced. The proposed [−(R/2), 2R + 1] and [−R + 1, 2R + 1], respectively. Note
architecture of the Map module is presented in Fig. 4(a), where that, though they are different, the required operations, one
the constant R in blue is for map function, and the R + 2 in addition and two subtractions, are the same. The architectures
green is for map+ function. The Demap module is to make the for the three algorithms are presented in Fig. 6. To reduce

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 367

Fig. 6. Post_Paral module.

the data path, we arrange the adders in parallel as much as


possible. It can be seen that the comparators are removed by
using the sign bits to select the right results. The critical path
is only two adders and two multiplexers for IFFM− and two
adders and three multiplexers for IFFMo+ and IFFMe+ .
3) Post_Proc Module: This module is to satisfy the Fig. 7. Post_Proc module.
constraints of the output for the proposed algorithms.
According to Algorithm 8 for post_proc functions, the adders, subtractors, and shifters. It can be seen that the data
corresponding architectures can be devised, as shown in path includes one (N/2)+(N/2) and one ((N/4)+1)+(N/4)
Fig. 7. The data paths are optimized by parallelizing the adders, one ((N/4) + 2) × ((N/4) + 2) multiplier, and the two
adders and precomputing the constant parameters. The critical stages of the combination. Pipelines can be easily inserted to
paths are two adders and one multiplexer for the IFFM− , two increase the clock frequency. For the combination, the pipeline
adders and two multiplexers for the IFFMo+ , and one adder is inserted between the partial sums.
and four multiplexers for the IFFMe+ . In a general case, when an N-bit operand is divided into n
4) Multipliers: Multipliers, including the three Mul mod- digits and (N/n) = w, the number of the Karatsuba levels
ules in the top level and the two cMul modules in the IBR, is equal to k = log2 n. There is an interesting fact: no matter
occupy the most hardware resources, and they are usually what n (n > 2) is, the sizes of those small multipliers always
located in the critical path. If we multiply two integers directly, are w × w (or w-bit), (w + 1) × (w + 1) (or (w + 1)-bit), and
the critical path will be too long to be accepted because (w +2)×(w +2) (or (w +2)-bit). Theoretically, the Karatsuba
of the large input bit width. Two strategies can be adopted: optimization can reduce the number of digit multipliers from
serial computing and parallel computing. Both of them can n 2 to 3k with an increase in adders, subtractors, and shifters.
efficiently reduce the bit width of a multiplier. The former We have verified through experiment that, when a multiplier
consumes small resources but suffers from a huge delay. with large N (e.g., 100∼1000) is designed with a higher order
For example, if a 512 × 512 multiplication is devised by a Karatsuba decomposition, the resource and latency both can
16 × 16 multiplier and a 32 + 32 adder, the latency would be reduced. The levels of such decomposition are specifically
be 1024 cycles. Meanwhile, the control logic to choose the devised for each multiplier in our designs, usually three or four
operands in each cycle is complicated, and the number of levels. Several stages of the pipeline are good enough to meet
registers to save the intermediates is large. On the contrary, the the frequency. We have carefully arranged them and made a
latter can be designed with small latency but requires relatively good tradeoff between the clock frequency and latency.
large resources. The control logic is not needed, and there 5) FeedB Module for Multiprecision-Based Algorithms:
are no intermediates to be saved. Moreover, the optimization According to Steps 4 and 5 of Algorithm 9, two k-bit left
methods, such as the Karatsuba [32], can be easily adopted. shifters are required for the outputs of last iteration. Therefore,
Since the latency comes first in our designs, we adopt the latter for the Multi-IFFMe+ , the FeedB module is composed of two
strategy with Karatsuba optimization to design the multipliers. k-bit left shifters. For the Multi-IFFM− and Multi-IFFMo+ ,
Figure. 8 shows an example to design a multiplier with the FeedB modules are a little different from that of the Multi-
the two-level Karatsuba method, where the bit width of the IFFMe+ as three instead of two inputs are put into those
operands is assumed as N and a multiple of 4. The two N- modules. We can deduce the formulas as
bit operands A and B are decomposed and combined twice
recursively before and after the multiplying step, respectively.  
The multiplying step is made up of five (N/4) × (N/4), c2 = (m 1 mod 2) + c2 · 2k mod 2 = m 1 mod 2,
three ((N/4) + 1) × ((N/4) + 1), and one ((N/4) + 2) × c1 = (m 3 − m 1 − m 2 ) + c1 · 2k ,
((N/4) + 2) multipliers, reducing the number of multipliers m1 c2 k
c0 = m 2 ± + c0 · 2 k ± ·2
from 16 to 9. The decomposition or combination consists of 2 2
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

Fig. 8. Example: a two-level Karatsuba decomposition for a multiplier, where the bit width of the operands is assumed as N and a multiple of 4.
m1
= m2 ± + (2c0 ± c2 ) · 2k−1 TABLE II
2 C ALCULATION OF I NSERTED P IPELINE S TAGES m, D IGITS n, AND
where the plus sign is for Multi-IFFM− and minus sign for L ATENCY FOR THE P ROPOSED A LGORITHMS ON FPGA
Multi-IFFMo+ . It can be seen from the bold terms that the
FeedB module for them is made up of one 1-bit left shifter,
one adder, one (k −1)-bit left shifter, and one k-bit left shifter.

V. FPGA I MPLEMENTATION R ESULTS


the throughput in the table denotes the maximum throughput.
To compare with conventional implementations, the pro-
Since our multipliers can be fully interleaved and have a
posed designs are coded in Verilog language and implemented
higher clock frequency than the conventional ones, it is clear
on FPGA. Since the complexity of IFFMo+ is almost the same
that each of the proposed designs obtains higher throughput
as that of IFFM− , the implementations are mainly focused on
than the prior arts. Our designs without and with the multi-
the IFFM− and IFFMe+ and their multiprecision versions. The
precision scheme achieve more than 60 times and about ten
adopted SIDH-friendly prime moduli are p771 = 2·2386 3242 −1
times higher throughput than the state-of-the-art FFM2 design,
provided in [13] for IFFM− and p752 = 2394 5154 + 1 in [25]
respectively. However, the introduced extra hardware resources
for IFFMe+ , targeting at the Level-5 NIST security, the same
are relatively small. For example, compared with the best
as the SIKEp751 provided in [11]. We will show more
design FFM2, the multiprecision designs consume some extra
implementation details for them in the following.
BRAMs but fewer FFs, LUTs, and DSPs, and meanwhile, for
The Xilinx Vivado 2016.4 EDA platform is adopted. We
the designs without the multiprecision scheme, except the extra
select the Virtex-7 xc7vx690tffg1157-3 board, which is used
BRAMs, the increase ratio in other resources is only about two
by Koziel et al. [16], [17] for SIDH protocol implementa-
to three times, far smaller than the increase in speed.
tion, and implement all the proposed algorithms on it for
The comparisons with conventional high-radix
comparisons.
Montgomery-based multipliers on FPGA platforms are
For high-speed applications, we should increase the clock
shown in Table IV. Since the used EDA platform does not
frequency and reduce the required clock cycles (CCs). Parallel
contain the option for the Virtex-5 device, we only provided
processing and pipelining are both adopted in our designs to
the results for the Virtex-7 device.
increase the speed. Many efforts have been made to optimize
The modular multiplier used in [11] is an efficient high-
the data path of each module, as shown earlier. Several pipeline
radix Montgomery-based multiplier, whose radix is set to
stages are inserted such that the proposed designs can achieve a
216 . Note that the core architectures of the used Montgomery
reasonable clock frequency. To further enhance the throughput,
modular multipliers in [16] and [17] are the same but neither
the interleaving scheme is used to make the hardware resources
of them provided the individual results for the multipliers.
be fully utilized. Assuming that the number of pipeline stages
Luckily, we have found the VHDL source code of the adopted
is m, m pairs of inputs can simultaneously be computed with
multipliers applied to SIKEp751 at the SIKE library [11] and
a latency of mn CCs, where n is the number of digits of the
used it for comparison. In order to make a fair comparison
operands. Though the latency for one pair of inputs is mn CCs,
with the Montgomery modular multiplier used in [11], [16],
the average number of cost cycles for it is only n CCs. For
and [17], we have picked out one of the proposed algorithms,
the algorithms without the multiprecision scheme, n is equal
the IFFM− , and redesigned it by inserting more pipelines to
to 1. The maximum throughput can be computed as
achieve higher frequency. We implemented the two designs on
m · N · f clk N · f clk the Virtex-7 xc7vx690tffg1157-3 board based on the Xilinx
throughput = = (16)
mn n Vivado 2016.4 EDA platform. The implementation results
where f clk is the clock frequency and N is the output bit width. are shown in Table IV. The design in the SIKE library has
Table II shows the numbers of inserted pipelines, digits, and four replicated multipliers, each of which has two branches.
the total latency of the proposed designs on FPGA. The whole design obtains a frequency of 183.8 MHz in the
The comparisons with prior arts based on unconventional selected platform. The latency and interleaved latency are
radix are provided in Table III. It should be pointed out that taken from [17, Table 4], for which we have also checked

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 369

TABLE III
C OMPARISONS OF M ODULAR M ULTIPLIERS BASED ON U NCONVENTIONAL R ADIX FOR L EVEL -5 NIST S ECURITY I MPLEMENTING ON FPGA

TABLE IV
C OMPARISONS W ITH THE H IGH -R ADIX M ONTGOMERY-BASED M ODULAR M ULTIPLIERS ON FPGA

through simulation. For the redesigned IFFM− , we achieved in proportion to the scaling factor of the CMOS technology.
156.3 MHz by inserting 23 extra CCs, about 2.6× faster It means that the throughput will be almost doubled by
speed, and 2.3× more CCs compared with the original IFFM− using the Virtex 7 device. Apparently, our design still greatly
multiplier. It should be noted that the number of DSPs in our outperforms the previous two works in terms of efficiency.
new design is reduced by using the left shifters and adders for
the constant multipliers in the multiplying step (as shown in VI. C ONCLUSION
Fig. 8), with further growth in LUTs and Slices. It can be seen In this article, we have proposed three low-complexity
that our design achieves about 11× higher throughput with modular multiplication algorithms called IFFM− , IFFMo+ ,
only 28% latency compared with that of [11]. The increases IFFMe+ , and their corresponding multiprecision versions for
in FFs, LUTs, Slices, and DSPs are only about 3.3×, 5.4×, the SIKE protocol. Six new constant-time architectures were
4.5×, and 1.3×, respectively. In this way, we can claim that presented based on these algorithms. By incorporating the
our design is more promising than the previous one in [11], smart formula transformation, novel architectural optimization,
[16], and [17]. and maximum interleaving processing, the proposed designs
Considering the other two efficient high-radix Montgomery- demonstrate significant advantages over conventional ones. For
based multipliers, the radix of [27] is fixed, equal to 24 , high-speed applications, the primitive multipliers are very suit-
and that used in [28] is a variable, whose multiplicand is able thanks to the low latency. For embedded applications, the
converted from the binary representation to its canonical multiprecision versions are good alternatives. We believe that
representation [33], to make several (1 ∼ 3) bits be processed these achievements will greatly contribute to the practicability
in one CC. Since the additions form the bottleneck in the of this protocol.
small-radix multipliers in [27] and [28], both critical paths
are fully optimized with carry-save adders, which enable A PPENDIX
achieving higher frequencies than ours. However, the adopted
A. Deduction for the Range of c0o for the Proposed
partitioning strategies in [27] and [28] also lead to long
Algorithms
latency, which basically grows linearly with the bit width
of the modulus. From the results of the 1024- and 512- 1) For the IFFM− : The deduction is based on the assump-
bit multipliers in [27] or [28], we can see that, when the tion that the input numbers A and B are the field elements
frequency is fixed, the throughput is almost unchanged. Based in F p , where p = 2R 2 − 1. According to Algorithm 6, after
on (16), we can conclude that the throughputs of the two mapping, the coefficients of A and B are still in the normal
works are unchanged when the modulus is equal or close to range. After the first tentative computing in Step 5, we can
the one used by us. In addition, the resource consumption compute the ranges of the coefficients of C with the upper
also grows almost linearly with the bit width. Therefore, our and lower limits as
design can obtain more than 100× higher throughput than 3
c2 ∈ {0, 1}, 0 ≤ c1 ≤ 2(R − 1)2 , 0 ≤ c0 ≤ (R − 1)2 .
both of them, while the increase in slices is only about ten 2
times. If the designs in [27] and [28] are implemented on a With the first IBR function, c0 is decomposed as q0 and r0
Virtex 7 device, the clock frequency will be increased roughly in Step 6, whose ranges are [0, (3/2)R − 3] and [0, R − 1],

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
370 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2021

respectively, and thus, we have TABLE V


  B RUTE -F ORCE S EARCH R ESULTS FOR THE P RIME p = 2x 3 y + k
5 1 TARGETING AT THE NIST S ECURITY L EVEL 5
0 ≤ c1 + q0 ≤ 2R 2 − R − 1 = 2R 2 − 3R + R−1 .
2 2
After the second IBR function in Step 7, we have
0 ≤ q1 ≤ 2R − 3, 0 ≤ r1 ≤ R − 1.
Since t ranges in [0, 2R − 2], the term (t/2) has the range
of [0, R − 1]. Therefore, we can deduce the range of c0o for
the IFFM− as [0, 2R − 2].
2) For the IFFMo+ : Similar to the IFFM− , we can begin
our deduction after the mapping. The first tentative results have
the ranges as
B. Prime Searching
(R − 1)2
c2 ∈ {0, 1}, 0 ≤ c1 ≤ 2(R − 1), −
2
≤ c0 ≤ (R + 1)2 . As the documentation covered in the SIKE library [11]
2
states, those parameters are still conservative lower bounds
By using the first IBR+ function, r0 is in [0, R − 1], and q0 on the true classical gate count. It can be concluded that the
is in [−(R/2), R + 2]. Note that these arithmetic operations four SIKE primes can be replaced with smaller ones to reduce
aforementioned are monotonous, so we can directly operate the the resource. By analyzing those four primes, we find that a
maximum and minimum values to obtain the upper and lower good prime has two features: meeting the security requirement
limits. However, for the formula c1 + q0 , the monotonicity and the two large-degree isogenies (2 x and 3 y ) being equal or
is broken by the subtraction, and the assumptions for the close to each other as much as possible. In the four primes, the
coefficients of A and B are not the same when computing the differences between the two isogenies are smaller than 10 bits.
two variables’ upper and lower limits. When calculating the With these clues, we designed a brute-force search for the
lower limit of this formula, we should take the minimums of prime p = 2x 3 y + k suitable for our algorithms targeting
c1 and q0 into consideration. The lower limit of q0 is achieved at the NIST security level 5, by setting x ∈ [300, 400), the
by assuming a0 or b0 equal to zero and a1 = b1 = R − 1. even y ∈ [200, 300), |x − log2 3 y | ≤ 10, and k ∈ {−1, 1}.
When a0 = b0 = 0, the lower limit of c1 can also be obtained. The search results are shown in Table V. The SIKEp751
Thus, we can get the lower limit for c1 + q0 as −(R/2) when (No. 4) is also listed for a clear comparison. Five primes are
a0 = b0 = 0 and a1 = b1 = R − 1. However, the upper limit obtained with these settings. The last one (No. 6) was first
of q0 is achieved by setting a0 and b0 to R + 1 and a1 or b1 proposed in [12] and has been widely used as an example for
to zero, while that of c1 is obtained when a0 = b0 = R + 1 the unconventional-radix modular multiplication algorithms,
and a1 = b1 = R − 1. Since q0 is the quotient of c0 divided such as in [18], [23], and [25]. It is also considered in our
by R, obviously, satisfying the assumptions for c1 can obtain work. The other four primes all are with k = +1, which also
the largest value of c1 + q0 . Hence, the upper limit of c1 + q0 demonstrates the effectiveness of our extension. The first three
is 2R 2 + (R/2) + 1. After computing by the second IBR+ are more promising than the SIKEp751 parameter. One of
function, we have them may be adopted as a replacement after a more strict secu-
−1 ≤ q1 ≤ 2R, 0 ≤ r1 ≤ R − 1. rity proof. The source code for the prime search is available
at the GitHub: https://github.com/PQC229/find_prime. More
Then, the formula (q1 + c0 /2) ranges in [−1, R]. Therefore, explorations for other security levels can be done by adjusting
we can obtain the range of c0o for the IFFMo+ as [−R, R]. the parameters according to the comments.
3) For the IFFMe+ : The IFFMe+ does not need the map-
ping and demapping process. We can refer to the deduction for
the IFFMo+ . In Step 2 of Algorithm 7, we have the constraints R EFERENCES
as
[1] B. Bauer, D. Wecker, A. J. Millis, M. B. Hastings, and M. Troyer,
“Hybrid quantum-classical approach to correlated materials,” Phys. Rev.
c2 ∈ {0, 1}, 0 ≤ c1 ≤ 2(R 2 − 1), −(R − 1)2 ≤ c0 ≤ (R + 1)2 . X, vol. 6, no. 3, Sep. 2016, Art. no. 031045.
[2] R. F. Mandelbaum. This Could be the Best Quantum Computer
In Step 3, the ranges for q0 and r0 are Yet. [Online]. Available: https://gizmodo.com/this-could-be-the-best-
quantum-computer-yet-1831085617
−R + 1 ≤ q0 ≤ R + 2, 0 ≤ r0 ≤ R − 1. [3] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2,
According to the abovementioned analysis, the range of c1 +q0 pp. 120–126, Feb. 1978.
becomes [−R + 1, 2R 2 + 2]. After the second IBR+ function [4] V. S. Miller, “Use of elliptic curves in cryptography,” in Proc. Conf.
Theory Appl. Cryptograph. Techn. Berlin, Germany: Springer, 1985,
in Step 4, we obtain pp. 417–426.
[5] P. W. Shor, “Polynomial-time algorithms for prime factorization and
−1 ≤ q1 ≤ 2R, 0 ≤ r1 ≤ R − 1. discrete logarithms on a quantum computer,” SIAM Rev., vol. 41, no. 2,
pp. 303–332, Jan. 1999.
Therefore, we can have the range of c0o for the IFFMe+ as [6] L. Chen et al., Report on Post-Quantum Cryptography. US Department
[−2R, R]. of Commerce, National Institute of Standards and Technology, 2016.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.
TIAN et al.: FAST MODULAR MULTIPLIERS FOR SUPERSINGULAR ISOGENY-BASED POST-QUANTUM CRYPTOGRAPHY 371

[7] R. Azarderakhsh et al. (2017). Supersingular Isogeny Key Encapsula- [29] A. Rezai and P. Keshavarzi, “High-performance scalable architecture for
tion. Submission to the NIST Post-Quantum Standardization Project. modular multiplication using a new digit-serial computation,” Microelec-
[Online]. Available: https://sike.org/ tron. J., vol. 55, pp. 169–178, Sep. 2016.
[8] G. Adj, D. Cervantes-Vázquez, J.-J. Chi-Domínguez, A. Menezes, and [30] A. Rezai and P. Keshavarzi, “Compact SD: A new encoding algorithm
F. Rodríguez-Henríquez, “On the cost of computing isogenies between and its application in multiplication,” Int. J. Comput. Math., vol. 94,
supersingular elliptic curves,” in Proc. Int. Conf. Sel. Areas Cryptogr. no. 3, pp. 554–569, Mar. 2017.
Cham, Switzerland: Springer, 2018, pp. 322–343. [31] P. Barrett, “Implementing the Rivest Shamir and Adleman public key
[9] S. Jaques and J. M. Schanck, “Quantum cryptanalysis in the RAM encryption algorithm on a standard digital signal processor,” in Proc.
model: Claw-finding attacks on sike,” IACR Cryptol. ePrint Arch., Conf. Theory Appl. Cryptograph. Techn. Berlin, Germany: Springer,
vol. 2019, p. 103, Aug. 2019. 1986, pp. 311–323.
[10] D. Hofheinz, K. Hövelmanns, and E. Kiltz, “A modular analysis of [32] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
the Fujisaki-Okamoto transformation,” in Proc. Theory Cryptogr. Conf. numbers by automatic computers,” in Doklady Akademii Nauk,
Cham, Switzerland: Springer, 2017, pp. 341–371. vol. 145, no. 2. Moscow, Russia: Russian Academy of Sciences, 1962,
[11] D. Jao et al. (2020). PQCrypto-SIDH. Submission to the NIST Post- pp. 293–294.
Quantum Standardization Project. [Online] https://github.com/Microsoft/ [33] G. W. Reitwiesner, “Binary arithmetic,” in Advances in Computers,
PQCrypto-SIDH vol. 1. Amsterdam, The Netherlands: Elsevier, 1960, pp. 231–308.
[12] D. Jao and L. De Feo, “Towards quantum-resistant cryptosystems Jing Tian (Student Member, IEEE) received the
from supersingular elliptic curve isogenies,” in Proc. Int. Workshop B.S. degree in microelectronics and the Ph.D. degree
Post-Quantum Cryptogr. Berlin, Germany: Springer, 2011, in information and communication engineering from
pp. 19–34. Nanjing University, Nanjing, China, in 2015 and
[13] L. De Feo, D. Jao, and J. Plût, “Towards quantum-resistant cryptosys- 2020, respectively.
tems from supersingular elliptic curve isogenies,” J. Math. Cryptol., She is currently an Associate Researcher with
vol. 8, no. 3, pp. 209–247, Jan. 2014. Nanjing University. Her research interests include
[14] D. Jao and V. Soukharev, “Isogeny-based quantum-resistant undeniable VLSI design for digital signal processing and cryp-
signatures,” in Proc. Int. Workshop Post-Quantum Cryptogr. Cham, tographic engineering.
Switzerland: Springer, 2014, pp. 160–179.
[15] R. Azarderakhsh, D. Jao, K. Kalach, B. Koziel, and C. Leonardi, “Key
compression for isogeny-based cryptosystems,” in Proc. 3rd ACM Int. Jun Lin (Senior Member, IEEE) received the B.S.
Workshop ASIA Public-Key Cryptogr., 2016, pp. 1–10. degree in physics and the M.S. degree in micro-
[16] B. Koziel, R. Azarderakhsh, M. Mozaffari Kermani, and D. Jao, “Post- electronics from Nanjing University, Nanjing, China,
quantum cryptography on FPGA based on isogenies on elliptic curves,” in 2007 and 2010, respectively, and the Ph.D. degree
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 1, pp. 86–99, in electrical engineering from Lehigh University,
Jan. 2017. Bethlehem, PA, USA, in 2015.
From 2010 to 2011, he was an ASIC Design
[17] B. Koziel, R. Azarderakhsh, and M. M. Kermani, “A high-performance
Engineer with AMD, Shanghai, China. In summer
and scalable hardware architecture for isogeny-based cryptography,”
2013, he was an Intern with Qualcomm Research,
IEEE Trans. Comput., vol. 67, no. 11, pp. 1594–1609, Nov. 2018.
Bridgewater, NJ, USA. In June 2015, he joined
[18] W. Liu, J. Ni, Z. Liu, C. Liu, and M. O’Neill, “Optimized modular the School of Electronic Science and Engineering,
multiplication for supersingular isogeny Diffie-Hellman,” IEEE Trans. Nanjing University, where he is currently an Associate Professor. His current
Comput., vol. 68, no. 8, pp. 1249–1255, Aug. 2019. research interests include low-power high-speed VLSI design, specifically
[19] H. Seo, Z. Liu, P. Longa, and Z. Hu, “SIDH on ARM: Faster VLSI design for digital signal processing and cryptography.
modular multiplications for faster post-quantum supersingular isogeny Dr. Lin is also a member of the Design and Implementation of Sig-
key exchange,” IACR Trans. Cryptograph. Hardw. Embedded Syst., nal Processing Systems (DISPS) Technical Committee of the IEEE Signal
vol. 2018, pp. 1–20, Aug. 2018. Processing Society. He was a co-recipient of the Merit Student Paper Award
[20] A. Jalali, R. Azarderakhsh, and M. M. Kermani, “NEON SIKE: Super- at the IEEE Asia Pacific Conference on Circuits and Systems in 2008. He was
singular isogeny key encapsulation on ARMv7,” in Proc. Int. Conf. a recipient of the 2014 IEEE Circuits & Systems Society (CAS) Student Travel
Secur., Privacy, Appl. Cryptogr. Eng. Cham, Switzerland: Springer, 2018, Award.
pp. 37–51.
[21] A. Jalali, R. Azarderakhsh, M. M. Kermani, and D. Jao, “Towards Zhongfeng Wang (Fellow, IEEE) received the B.S.
optimized and constant-time CSIDH on embedded devices,” in Proc. and M.S. degrees from Tsinghua University, Beijing,
Int. Workshop Constructive Side-Channel Anal. Secure Design. Cham, China, in 1988 and 1990, respectively, and the Ph.D.
Switzerland: Springer, 2019, pp. 215–231. degree from the University of Minnesota, Minneapo-
[22] T. Blum and C. Paar, “High-radix montgomery modular exponentiation lis, MN, USA, in 2000.
on reconfigurable hardware,” IEEE Trans. Comput., vol. 50, no. 7, He was a Leading VLSI Architect with Broadcom
pp. 759–764, Jul. 2001. Corporation, Irvine, CA, USA, from 2007 to 2016.
He was with Oregon State University, Corvallis,
[23] A. Karmakar, S. S. Roy, F. Vercauteren, and I. Verbauwhede, “Efficient
OR, USA, and National Semiconductor Corporation,
finite field multiplication for isogeny based post quantum cryptography,”
Santa Clara, CA, USA. He has been a Distinguished
in Proc. Int. Workshop Arithmetic Finite Fields. Cham, Switzerland:
Professor with Nanjing University, Nanjing, China,
Springer, 2016, pp. 193–207.
since 2016. He is a world-recognized expert on low-power high-speed VLSI
[24] P. L. Montgomery, “Modular multiplication without trial division,” Math. design for signal processing systems. He has published over 200 technical
Comput., vol. 44, no. 170, pp. 519–521, Apr. 1985. articles with multiple best paper awards received from the IEEE technical
[25] J. W. Bos and S. J. Friedberger, “Arithmetic considerations for isogeny- societies, among which is the VLSI Transactions Best Paper Award of 2007.
based cryptography,” IEEE Trans. Comput., vol. 68, no. 7, pp. 979–990, He has edited one book VLSI. He holds more than 20 U.S. and China patents.
Jul. 2019. His current research interests are in the areas of optimized VLSI design for
[26] J. Tian, J. Lin, and Z. Wang, “Ultra-fast modular multiplication digital communications and deep learning.
implementation for isogeny-based post-quantum cryptography,” in Dr. Wang was elevated as a Fellow of IEEE for contributions to VLSI
Proc. IEEE Int. Workshop Signal Process. Syst. (SiPS), Oct. 2019, design and implementation of forward error correction (FEC) coding in 2015.
pp. 97–102. He has served as a TPC member and various chairs for tens of international
[27] G. D. Sutter, J.-P. Deschamps, and J. L. Imana, “Modular multiplication conferences. In the current record, he has had many papers ranking among the
and exponentiation architectures for fast RSA cryptosystem based on top-25 most (annually) downloaded manuscripts in the IEEE T RANSACTIONS
digit serial computation,” IEEE Trans. Ind. Electron., vol. 58, no. 7, ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS . He has served
pp. 3101–3109, Jul. 2011. as an Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND
[28] A. Rezai and P. Keshavarzi, “High-throughput modular multiplication S YSTEMS (CAS) I, T-CAS-II, and TVLSI for many terms. Moreover, he has
and exponentiation algorithms using multibit-scan–multibit-shift tech- contributed significantly to the industrial standards. So far, his technical
nique,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9, proposals have been adopted by more than 15 international networking
pp. 1710–1719, Sep. 2015. standards.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on November 14,2024 at 09:57:47 UTC from IEEE Xplore. Restrictions apply.

You might also like