0% found this document useful (0 votes)
103 views

A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm

Uploaded by

Quyền Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm

Uploaded by

Quyền Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A GMP-based Implementation of Schönhage-Strassen’s

Large Integer Multiplication Algorithm

Pierrick Gaudry Alexander Kruppa Paul Zimmermann


LORIA, CACAO project-team, Campus scientifique, 54506 Vandœuvre-lès-Nancy

ABSTRACT constant will give a similar gain on all multiplication-based


Schönhage-Strassen’s algorithm is one of the best known al- operations. Some authors reported on implementations of
gorithms for multiplying large integers. Implementing it ef- large integer arithmetic for specific hardware or as part of
ficiently is of utmost importance, since many other algo- a number-theoretic project [2, 10]. In this article we con-
rithms rely on it as a subroutine. We present here an im- centrate on the question of an optimized implementation of
proved implementation, based on the one distributed within Schönhage-Strassen’s algorithm on a classical workstation.
the GMP library. The following ideas and techniques were In the last few years, the multiplication of large integers
used or tried: faster arithmetic modulo 2n + 1, improved has found several new applications in “real life”, and not only
cache locality, Mersenne transforms, Chinese Remainder Re- in computing billions of digits of π. One such application
√ is the segmentation method (called Kronecker substitution
construction, the 2 trick, Harley’s and Granlund’s tricks,
improved tuning. in [23]) to reduce the multiplication of polynomials with in-
teger coefficients to one huge integer multiplication; this is
used for example in the GMP-ECM software [25]. Another
Categories and Subject Descriptors example is the multiplication or factorization of multivariate
I.1.2 [Computing methodologies]: Algorithms—Symbolic polynomials [21, 22].
and algebraic manipulation In this article we detail several ideas or techniques that
may be used to implement Schönhage-Strassen’s algorithm
General Terms (SSA) efficiently. As a consequence, we obtain what we be-
lieve is the best existing implementation of SSA on current
Algorithms, Performance processors; this implementation might be used as a reference
to compare with other algorithms based on the Fast Fourier
Keywords Transform, in particular those using complex floating-point
numbers.
Integer multiplication, multiprecision arithmetic
The paper is organized as follows: §1 revisits the original
SSA and defines the notation used in the rest of the paper;
INTRODUCTION §2 describes the different ideas and techniques we tried; fi-
Since Schönhage and Strassen presented in 1971 a method nally §3 provides timing figures and graphs obtained with
to multiply two N -bit integers in O(N log N log log N ) time our new GMP implementation, and compares it to other
[19], several authors have shown how to reduce other op- implementations.
erations — inverse, division, square root, gcd, base con-
version, elementary functions — to multiplication, possibly
with log N multiplicative factors [5, 7, 15, 16, 18, 21]. It 1. THE ALGORITHM OF SCHÖNHAGE
has now become common practice to express complexities
in terms of the cost M (N ) to multiply two N -bit numbers,
AND STRASSEN
and many researchers tried hard to get the best possible con- Throughout the paper we use w for the computer word
stants in front of M (N ) for the above-mentioned operations size in bits — usually 32 or 64 — and denote by N the
(see for example [6, 14]). number of bits of the numbers we want to multiply.
Strangely, much less effort was made for decreasing the Several descriptions of SSA can be found in the literature,
implicit constant in M (N ) itself, although any gain on that see [11, 19] for example. We recall it here to establish the
notations.
+
Let RN — or simply RN — be the ring of integers mod-
N
ulo 2 + 1. SSA reduces integer multiplication to multipli-
Permission to make digital or hard copies of all or part of this work for cation in RN , which reduces to polynomial multiplication
personal or classroom use is granted without fee provided that copies are in Z[x] mod (xK + 1), which in turn reduces to polynomial
not made or distributed for profit or commercial advantage and that copies multiplication in Rn [x] mod (xK + 1), which finally reduces
bear this notice and the full citation on the first page. To copy otherwise, to to multiplication in Rn . The reason for choosing RN as the
republish, to post on servers or to redistribute to lists, requires prior specific ring to map the input integers to is that the multiplications
permission and/or a fee.
ISSAC’07, July 29–August 1, 2007, Waterloo, Ontario, Canada. of elements of Rn can use SSA recursively, skipping the first
Copyright 2007 ACM 978-1-59593-743-8/07/0007 ...$5.00. step of mapping from integers to RN again.

167
Z ⇒ RN ⇒ Z[x] mod (xK + 1) ⇒ Rn [x] mod (xK + 1) ⇒ Rn or  + m = i + K. Since θK ≡ −1 mod (2n + 1), it follows:
6  H H
No, recurse n small H
 enough? X
K−1
H  ci = Kθi (a bi− − a bi+K− ) = Kθi (ci − ci+K ),
H H 
? =0
Yes, multiply
where bm is assumed zero for m outside the range [0, K − 1].
The first reduction — from Z to RN — is simple: to SSA thus consists of five consecutive steps, where all com-
multiply two non-negative integers of u and v bits, it suffices putations in steps (2) to (4) are done modulo 2n + 1:
to compute their product mod 2N + 1 for N ≥ u + v.
The second step — a map from RN to Z[x] mod (xK + 1) (1) the “decompose” step extracts from a the M -bit parts
— works as follows. Assume N = 2k M for integers k and ai , and multiplies them by the weight signal θi , ob-
M , and define K := 2k . An integer a ∈ [0, 2N ] can be taining ai (similarly for bi );
PK−1 iM
uniquely written a = i=0 ai 2 , with 0 ≤ ai < 2M for
M
i < K − 1, and 0 ≤ aK−1 ≤ 2 , that is, we cut a into K (2) the “forward transform” computes (â0 , . . . , âK−1 ) from
pieces of M bits each, except the last piece can be equal (a0 , . . . , aK−1 ) (similarly for b̂i );
to 2M . Now the integer
P a is ithe value at x = 2 of the
M

polynomial A(x) = K−1 a x . Assume we decompose an (3) the “pointwise product” step computes ĉi = âi b̂i , for
i=0 i
integer b ∈ RN in the same manner, and let C(x) be the 0 ≤ i < K;
P
product A(x)B(x) over Z[x]: C(x) = 2K−2 i=0 Pci xi . One now (4) the “backward transform” computes (c0 , . . . , cK−1 ) from
M M M 2K−2
has ab = A(2 )B(2 ) = C(2 ), thus ab = i=0 ci 2iM . (ĉ0 , . . . , ĉK−1 );
N
Now what we really want is ab mod (2 + 1), i.e.,
(5) the “recompose” step divides ci by 2k θi , and constructs
(c0 −cK )+· · ·+(cK−2 −c2K−2 )2 +cK−1 2 (K−2)M
(1) (K−1)M the final result as c̄0 +c̄1 2M +· · ·+c̄K−1 2(K−1)M . Some
c̄i , defined in Eq. (2), may be negative, but the sum is
P K−1
which comes from C + (x) := i
i=0 c̄i x = A(x)B(x) mod
necessarily non-negative.
K M
(x + 1), since x = 2 and N = KM . To determine For a given input bit-size N , several choices of the FFT
C + (x), one uses a negacyclic convolution over the ring Rn , length K may be possible. SSA is thus a whole family of
i.e., modulo 2n + 1, where n is taken large enough so that algorithms: we call FFT-K — or FFT-2k — the algorithm
the c̄i can be recovered
Pi exactly. For 0 ≤ i ≤ K − 1, one splitting the inputs into K = 2k parts. For a given input
2M
has 0 ≤ ci = j=0 aj bi−j < (i + 1)2 . Similarly for size N , one of the main practical problems is how to choose
K ≤ i ≤ 2K − 3, one has 0 ≤ ci < (2K − 1 − i)22M the best value of the FFT length K, and thus of the bit-size
and finally 0 ≤ c2K−2 ≤ 22M . With the convention that n of the smaller multiplies (see §2.6).
c2K−1 = 0, according to (1), we have
1.1 Choice of n and Efficiency
((i + 1) − K)22M ≤ c̄i = ci − ci+K < (i + 1)22M (2) SSA takes for n a multiple of K, so that ω = 22n/K is a
primitive Kth root of unity, and θ = 2n/K is used for the
for 0 ≤ i < K. Hence each coefficient of C(x) mod (xK + 1)
weight signal. This ensures that all FFT butterflies only
is confined to an interval of length K22M , and so it suffices
involve additions/subtractions and shifts on a radix 2 com-
to have 2n + 1 ≥ K22M , i.e., n ≥ 2M + k1 .
puter (see §2.1).
The negacyclic convolution A(x)B(x) mod (xK + 1) can
In practice one may additionally require n to be a mul-
be performed efficiently using the Fast Fourier Transform
tiple of the word size w, to make the arithmetic in 2n + 1
(FFT). More precisely, SSA uses here a simple case of the
simpler. Indeed, a number from Rn is then represented by
Discrete Weighted Transform (DWT) [10]. Assume ω = θ2
n/w machine words, plus one additional bit of weight 2n .
is a primitive Kth root of unity in Rn . (All operations in
We call this a semi-normalized representation, since values
this paragraph are in Rn .) Given (ai )0≤i<K , the weight
up to 2n+1 − 1 can be represented.
signal is (ai := θi ai )0≤i<K . The forward transform com-
P For a given bit size N divisible by K = 2k , we define the
putes (âi := K−1 ij 
j=0 ω aj )0≤i<K , and similarly for (b̂i ). One efficiency of the FFT-K scheme:
then multiplies âi and b̂i together in Rn (pointwise prod-
2N/K + k
ucts): let ĉi = âi b̂i . The backward transform computes ,
P n
(ci := K−1
j=0 ω
−ij
ĉj )0≤i<K :
where n is the smallest multiple of K larger than or equal
! ! to 2N/K + k. For example for N = 1, 000, 448 and K = 210 ,
X
K−1 X
K−1 X
K−1 X
K−1
ci = ω −ij
âj b̂j = ω −ij
ω j a ω jm bm we have 2N/K + k = 1964, and the next multiple of K
j=0 j=0 =0 m=0 is n = 2048, therefore the efficiency is 1964
2048
≈ 96%. For
X
K−1 X
K−1 N = 1, 044, 480 with the same value of K, we have 2N/K +
= a bm θ+m ω j(+m−i) . k = 2050, and the next multiple of K is n = 3072, with an
,m=0 j=0 efficiency of about 67%. The FFT scheme is close to optimal
when its efficiency is near 100%.
P
Since ω is a primitive Kth root of unity, K−1
j=0 ω
j(+m−i)
is Note that a scheme with efficiency below 50% does not
zero unless  + m − i ≡ 0 mod K, which holds for  + m = i need to be considered. Indeed, this means that 2N/K +
k ≤ 12 n, which necessarily implies that n = K (remem-
1
One might use n ≥ 2M + k + 1 to get a lifting algorithm ber n has to be divisible by K). Then the FFT scheme of
from Rn to Z which is independent of i. length K/2 can be performed with the same value of n, since

168
2(N/(K/2)) + (k − 1) < 4N/K + 2k ≤ n, and n is a multiple 4. a <- a + b
of K/2. 5. t <- t * 2^s
From this last remark, we can assume 2N/K ≥ 12 n —
neglecting the small k term —, which together with n ≥ K Step 3 means that the most significant words from t are
gives: formed with al - bl, and the least significant words with
√ bh - ah, where we assume that borrows are propagated, so
K ≤ 2 N. (3)
that t is semi-normalized. Thus the only real multiplication
by a power of two is that of step 5, which may be efficiently
2. OUR IMPROVEMENTS performed with GMP’s mpn_lshift routine.
We describe in this section the ideas and techniques we If one has a combined addsub routine which computes
have tried to improve the GMP implementation of SSA. We simultaneously x + y and x − y faster than two separate
started from the GMP 4.2.1 implementation, and used the calls, then step 4 can be written a <- (bh + ah, al + bl),
graph of the multiplication time up to 1, 000, 000 words on which shows that t and a may be computed with two addsub
an Opteron as benchmark. After encoding each idea, if the calls.
new graph was better than the old one, the new idea was
validated, otherwise it was discarded. Each technique saved 2.2 Cache Locality During the Transforms
only 5% up to 20%, but all techniques together saved a factor When multiplying large integers with SSA, the time spent
of about 2 with respect to GMP 4.2.1. in accessing data for performing the Fourier transforms is
2.1 Arithmetic Modulo 2n + 1 non-negligible. The literature is rich with papers dealing
with the organization of the computations in order to im-
Arithmetic operations modulo 2n +1 have to be performed prove the locality. However most of these papers are con-
during the forward and backward transforms, when applying cerned with contexts which are different from ours: usually
the weight signal, and when unapplying it. Thanks to the the coefficients are small and most often they are complex
fact that the primitive roots of unity are powers of two, numbers represented as a pair of double’s. Also there is a
the only needed operations are additions, subtractions, and variety of target platforms, from embedded hardware imple-
multiplications by a power of two. Divisions by 2k can be mentations to super-scalar computers.
reduced to multiplications by 22n−k . We have tried to apply several of these approaches in our
We recall that we desire n to be a multiple of the number context where the coefficients are modular integers that fit
w of bits per word. Since n must also be a multiple of in at least a few cache lines and the target platform is a
K = 2k , this is not a real constraint, unless k < 5 on a 32- standard PC workstation.
bit computer, or k < 6 on a 64-bit computer. Let m = n/w In this work, we concentrate on multiplying large, but not
be the number of computer words corresponding to an n- huge integers. By this we mean that we consider only 3 levels
bit number. A residue mod 2n + 1 has a semi-normalized of memory for our data: L1 cache, L2 cache, and standard
representation with m full words and one carry of weight 2n : RAM. In the future we might consider also the case where
a = (am , am−1 , . . . , a0 ), we have to use the hard disk as a 4th level.
Here are the orders of magnitude for these memories, to
w
with 0 ≤ ai < 2 for 0 ≤ i < m, and 0 ≤ am ≤ 1. fix ideas: on a typical Opteron, a cache line is 64 bytes; the
The addition of two such representations is done as follows L1 data cache is 64 kB; the L2 cache is 1 MB; the RAM is
(we give here the GMP code): 8 GB. The smallest coefficient size (i.e., n-bit residues) we
consider is about 50 machine words, that is 400 bytes. For
c = a[m] + b[m] + mpn_add_n (r, a, b, m);
r[m] = (r[0] < c);
very large integers, a single coefficient hardly fits in the L1
MPN_DECR_U (r, m + 1, c - r[m]); cache.
The very first FFT algorithm is the iterative one. In our
The first line adds (am−1 , . . . , a0 ) with (bm−1 , . . . , b0 ), puts context this is a really bad idea. The main advantage of it
the low m words of the result in (rm−1 , . . . , r0 ), and adds is that the data is accessed in a sequential way. In the case
the out carry to am + bm ; we thus have 0 ≤ c ≤ 3. The where the coefficients are small enough so that several of
second line yields rm = 0 if r0 ≥ c, in which case we simply them fit in a cache line, this saves many cache misses. But
subtract c from r0 at the third line. Otherwise rm = 1, and in our case, contiguity is irrelevant due to the size of the
we subtract c − 1 from r0 : a borrow may propagate, but coefficients compared to cache lines.
at most to rm . In all cases r = a + b mod (2n + 1), and The next very classical FFT algorithm is the recursive
r is semi-normalized. The subtraction is done in a similar one. In this algorithm, at a certain level of recursion, we
manner. work on a small set of coefficients, so that they must fit in
The multiplication by 2e is more tricky to implement. the cache. This version (or a variant of it) was implemented
However this operation mainly appears in the butterflies in GMP up to version 4.2.1. This behaves well for moderate
[a, t] ← [a + b, (a − b)2e ] of the forward and backward trans- sizes, but when multiplying large numbers, everything fits
forms, which may be performed as follows: in the cache only at the tail of the recursion, so that most
of the transform is already done when we are at last in the
Bfy(a, b, t, e) cache. The problem is that before getting to the appropriate
1. Write e = d*w + s with 0 <= s < w, recursion level, the accesses are very cache unfriendly.
where w is the number of bits per word
2. Decompose a = (ah, al),
In order to improve the locality for large transforms, we
where ah contains the upper d words have tried three strategies found in the literature: the Bel-
Idem for b gian approach, the radix-2k transform, and Bailey’s 4-step
3. t <- (al - bl, bh - ah) algorithm.

169
a0 Step 1 Step 2 Step 3 a0 2.2.2 Higher Radix Transforms
a1 a4 The principle of higher radix transforms is to use an ato-
a2 a2
a3 a6 mic operation which groups several butterflies. In the book
a4 a1 [1] the reader will find a description of several variants in
a5 a5 this spirit. The classical FFT can be viewed as a radix-2
a6 a3 transform. The next step is a radix-4 transform, where the
a7 a7
atomic operation has 4 inputs and 4 outputs (without count-
ing roots of unity) and groups 4 butterflies of 2 consecutive
Figure 1: FFT circuit of length 8 and butterfly tree of steps of the FFT.
depth 3. We can then build a recursive algorithm upon this atomic
operation. Of course, since we perform 2 steps at a time,
the number of steps in the recursion is reduced by a factor
2.2.1 The Belgian Transform of 2, and we have to handle separately the last step when
In [9], Brockmeyer et al. propose a way of organizing the the FFT level k is odd.
transform that reduces cache misses. In order to explain it, In the literature, the main interest for higher radix trans-
let us first define a tree of butterflies as follows (we don’t forms comes from the fact that the number of operations is
mention the root of unity for simplicity): reduced for a transform of complex numbers (this is done
by exhibiting a free multiplication by i). In our case, the
TreeBfy(A, index, depth, stride) number of operations remains the same. However, in the
Bfy(A[index], A[index+stride]) atomic block each input is used in two butterflies, so that
if depth > 1
TreeBfy(A, index-stride/2, depth-1, stride/2)
the number of cache misses is less than 50%, just as for the
TreeBfy(A, index+stride/2, depth-1, stride/2) Belgian approach. Furthermore, with the recursive struc-
ture, just as for the classical recursive FFT, at some point
An example of a tree of depth 3 is given on the right of we deal with a number of inputs which is small enough so
Figure 1. Now, the depth of a butterfly tree is bounded by that everything fits in the cache.
a value that is not the same for every tree. For instance, We have tested this approach, and this was faster than
on Figure 1, the butterfly tree that starts with the butterfly the Belgian transform by a few percent.
between a0 and a4 has depth 1: one can not continue the The next step after radix 4 is radix 8 which works in the
tree on step 2. Similarly, the tree starting with the butterfly same spirit, but grouping 3 levels at a time. We have also
between a1 and a5 has depth 1, the tree starting between a2 implemented it, but this saved nothing, and was even some-
and a6 has depth 2 and the tree starting between a3 and a7 times slower than the radix 4 approach. Our explanation
has depth 3. More generally, the depth can be computed by is that for small numbers, radix 4 is close to optimal with
a simple formula. respect to cache locality, and for large numbers, the num-
One can check that by considering all the trees of but- ber of coefficients that fit in the cache is rather small and we
terflies starting with an operation at step 1, we cover the have misses inside the atomic block of 12 butterflies. Further
complete FFT circuit. It remains to find the right ordering investigation is needed to validate this explanation.
for computing those trees of butterflies. For instance, in the More generally, radix 2t groups t levels together, with
example of Figure 1, it is important to do the tree that starts a total of t2t−1 butterflies, over 2t residues. If all those
between a3 and a7 in the end, since it requires data from all residues fit in the cache, the cache miss rate is less than 1/t.
the other trees. Thus the optimal strategy seems to choose for t the largest
One solution is to perform the trees of butterflies following integer such that 2t n bits fit in the cache (either L1 or L2,
the BitReverse order. For an integer i whose binary repre- in fact the smallest cache where a single radix 2 butterfly
sentation fits in at most k bits, the value BitReverse(i,k) is fits).
the integer one obtains by reading the k bits (maybe padded
with zeros) in the opposite order. One obtains the following 2.2.3 Bailey’s 4-step Algorithm
algorithm, where ord_2 stands for the number of trailing ze- The algorithm we describe in this section can be found in
ros in the binary representation of an integer (together with a paper by Bailey [3]. In there, the reader will find earlier
the 4-line TreeBfy routine, this is a recursive description of references tracing back the original idea. For simplicity we
the 36-line routine from [9, Code 6.1]): stick to the “Bailey’s algorithm” denomination.
A way√of seeing Bailey’s 4-step transform algorithm is as
BelgianFFT(A, k) a radix- K transform, where K = 2k is the length of the
K = 2^{k-1} input sequence. In other words, instead of grouping 2 steps
for i := 0 to K-1
as in radix-4, we group k/2 steps. To be more general, let
TreeBfy(A, BitReverse(i, k-1), 1+ord_2(i+1), K)
us write k = k1 + k2 , where k1 and k2 are to be thought as
Inside a tree of butterflies, we see that most of the time, close to k/2, but this is not really necessary. Then Bailey’s
the butterfly operation will involve a coefficient that has 4-step algorithm consists in the following phases:
been used just before, so that it should still be in the cache. 1. Perform 2k2 transforms of length 2k1 ;
Therefore an approximate 50% cache-hit is provided by con- 2. Multiply the data by weights;
struction, and we can hope for more if the data is not too
3. Perform 2k1 transforms of length 2k2 .
large compared to the cache size.
We have implemented this in GMP, and this saved a few There are only three phases in this description. The fourth
percent for large sizes, thus confirming the fact that this phase is usually some matrix transposition, but this is irrel-
approach is better than the classical recursive transform. evant in our case: the coefficients are large so that we keep

170
a table of pointers to them, and this transposition is just from the input numbers, then multiplying each part ai by
pointer exchanges which are basically for free, and fit very θi modulo 2n + 1, giving ai . If one closely looks at the first
well in the cache. FFT level, it will perform a butterfly between ai and ai+K/2
The second step involving weights is due to the fact that in with θ2i as multiplier. This will compute ai + ai+K/2 and
the usual description of Bailey’s 4-step algorithm, the trans- ai − ai+K/2 , and multiply the latter by θ2i . It can be seen
forms of length 2k1 are exactly Fourier transforms, whereas
that the M non-zero bits from ai and ai+K/2 do not overlap,
the needed operation is a twisted Fourier transform where
thus no real addition or subtraction is required: the results
the roots of unity involved in the butterflies are different
ai + ai+K/2 and ai − ai+K/2 can be obtained with just copies
(since they involve a 2k -th root of unity, whereas the classi-
and ones’ complements. As a consequence, it should be pos-
cal transform of length 2k1 involves a 2k1 -th root of unity).
sible to completely avoid the “decompose” step and the first
In the classical FFT setting this is very interesting, since we
FFT level, by directly starting from the second FFT level,
can then reuse some small-dimension implementation that
which for instance will add ai + ai+K/2 to (aj − aj+K/2 )θ2j ;
has been very well optimized. In our case, we have found it
better to write separate code for this twisted FFT, so that here the four operands ai , ai+K/2 , aj , aj+K/2 will be directly
we merge the first and second phases. taken from the input integer a, and the implicit multiplier
The interest of this way of organizing the computation is θ2j will be used to know where to add or subtract aj and
again not due to a reduction of the number of operations, aj+K/2 . This example illustrates the kind of savings ob-
since they are exactly the same as with the other FFT ap- tained by avoiding trivial operations like copies and ones’
proaches mentioned√above. The goal is to help locality. In- complements, and furthermore improving the locality. This
deed, assume that K coefficients fit in the cache, then the idea was not used in the results from §3.
number of cache misses is at most 2K, since √ each call to the
internal FFT or twisted FFT operates on K coefficients.
2.3 Fermat and Mersenne Transforms

Of course we are interested in numbers for which K co- The reason why SSA uses negacyclic convolutions is be-
efficients do not fit in the L1 cache, but for all numbers we cause the algorithm can be used recursively: the “pointwise
might want to multiply, they do fit in the L2 cache. There- products” modulo 2n + 1 can in turn be performed using the
fore the structure of the code follows the memory hierarchy: same algorithm, each one giving rise to K  smaller pointwise

at the top level of Bailey’s algorithm, we deal with the RAM products modulo 2n + 1. (In that case, n must satisfy an
vs L2 cache locality question, then in each internal FFT or additional divisibility condition related to K  .) A drawback
twisted FFT, we can take care of the L2 vs L1 cache locality of this approach is that it requires a weighted transform, i.e.,
question. This is done by using the radix-4 variant inside additional operations before the forward transforms and af-
our Bailey-algorithm implementation. ter the backward transform. However, if one looks carefully,
We have implemented this approach (with a threshold for power-of-two roots of unity are needed only at the “lower

activating Bailey’s algorithm only for large sizes), and com- level”, i.e., in Rn+ . Therefore one can replace RN by RN —
bined with radix-4, this gave us our best timings. We have i.e., the ring of integers modulo 2N − 1 — in the original
also tried a √higher dimensional transform, in particular 3 algorithm, and replace the weighted transform by a classical
steps of size 3 K. This did not help for the sizes we consid- cyclic convolution, to compute a product mod 2N − 1. This
ered. works only at the top level of the algorithm, and not re-
cursively. We call this a “Mersenne transform”, whereas the
2.2.4 Mixing Several Phases original SSA performs a “Fermat transform”2 . This idea of
using a Mersenne transform is already present in [4] where
Another way to improve locality is to mix different phases it is called “cyclic Schönhage-Strassen trick”.
of the algorithm in order to do as much work as possible Despite the fact that it can be used at the top level only,
on the data while they are in the cache. An easy improve- the Mersenne transform is nevertheless very interesting for
ment in this spirit is to mix the pointwise multiplication the following reasons:
and the backward transform, in particular when Bailey’s al- • a Mersenne transform modulo 2N − 1, combined with a
gorithm is used. Indeed, after the two forward transforms Fermat transform modulo 2N + 1 and CRT reconstruction,
have been computed, one can load the data corresponding can be used to compute a product of two N -bit integers;
to the first column, do the pointwise multiplication of its • a Mersenne transform can use a larger FFT length K = 2k
elements, and readily perform the small transform of this than the corresponding Fermat transform. Indeed, while
column. Then the data corresponding to the second column K must divide N for the Fermat transform, so that θ =
is loaded, multiplied and transformed, and so on. In this 2N/K is a power of two, it only needs to divide 2N for the
way, one saves one full pass on the data. Taking the idea Mersenne transform, so that ω = 22N/K is
one step further, assuming that the forward transform for √ a power of two.
This improves the efficiency for K near N , and enables
the first input number has been done already (or that we are one to use a value of K close to optimal. (The constraint
squaring one number), after performing the column-wise for- √
on the FFT length can still be decreased by using the “ 2
ward transform on the second number we can immediately trick”, see §2.4.)
do the point-wise multiply and the backward transform on The above idea can be generalized to a Fermat transform
the column, so saving another pass over memory. mod 2aN + 1 and a Mersenne transform mod 2bN − 1 for
Following this idea, we can also merge the “decompose” small integers a, b.
and “recompose” steps with the transforms, again to save a
pass on the data. In the case of the “decompose” step, there
is more to it since one can also save unnecessary copies by 2
In the whole paper, a Fermat transform, product, or scheme
merging it with the first step of the forward transform. is meant modulo 2N +1, without N being necessarily a power
The “decompose” step consists of cutting parts of M bits of two as in Fermat numbers.

171
Lemma 1. Let a, b be two positive integers. Then at least Torbjörn Granlund [12] found that this idea — combining
one of gcd(2a + 1, 2b − 1) and gcd(2a − 1, 2b + 1) is 1. computations mod 2n + 1 with computations mod 2h — can
also be used at the top-level for the plain integer multipli-
Proof. Let g = gcd(a, b), r = 2g , a = a/g, b = b/g.
cation, and not only at the lower-level as in Harley’s trick.
Denote by ordp (r) the multiplicative order of r (mod p). In
 Assume one wants to multiply two integers u and v whose
the case of b odd, p | r b − 1 ⇒ ordp (r) | b ⇒ 2  ordp (r), product has m bits, where m is just above an “optimal” Fer-

and p | r a + 1 ⇒ ordp (r) | 2a and ordp (r)  a ⇒ 2 | mat scheme (2N +1, K), say m = N +h. Then first compute
 
ordp (r), hence no prime can divide both r b − 1 and r a + 1. uv mod (2N +1), and second compute uv mod 2h , by simply
In the other case of b even, a must be odd, and the same computing the plain integer product (u mod 2h )(v mod 2h ),
argument holds with the roles of a an b exchanged, so no again possibly in turn with fast multiplication. The exact
 
prime can divide both r a − 1 and r b + 1. value of uv can be efficiently reconstructed by CRT from
both values. We call this idea “Granlund’s trick”.
It follows from Lemma 1 that we can use one Fermat trans- Let us denote M (N ), M + (N ) and M − (N ) the cost of
form of size aN (respectively bN ) and one Mersenne trans- the multiplication of two N -bit integers, multiplication mod-
form of size bN (respectively aN ). However this does not ulo 2N + 1 and multiplication modulo 2N − 1 respectively.
imply that the reconstruction is easy: in practice we used Granlund’s trick can be written M (N + h) = M + (N ) +
b = 1 and made only a vary (see §2.6.2). M (h), or M (N + h) = M + (N ) + M + (2h) if one reduces
√ the plain product modulo 2h to a modular product modulo
2.4 The 2 Trick 22h + 1. Marco Bodrato (personal communication) discov-
Since all prime factors of 2n + 1 are p ≡ 1 (mod 8) if √
4 | n, ered that Granlund’s trick can be applied simultaneously to
2 is a quadratic residue (mod n), and it turns out that 2 is the low and high ends of the product, giving M (N + h) =
of a simple enough form to make it useful as a root of unity M + (N ) + 2M (h/2).
“ ”2 We use neither Harley’s nor Granlund’s trick in our cur-
with power-of-two order. Specifically, 23n/4 − 2n/4 ≡ rent implementation. We believe Granlund’s trick is less
2 (mod 2n + 1), which is easily√ checked by expanding the efficient than the generalized Fermat-Mersenne scheme we
square. Hence we can use 2 = 23n/4 − 2n/4 as a root of propose (§2.3), which yields M (a + b) = M + (a) + M − (b),
unity of order 2k+2 in the transform to double the possible with M − (N ) the cost of a multiplication modulo 2N − 1,
transform length for a given n. In the case of the negacyclic if a good efficiency is possible for M + (a) and M − (b). As

transform, this allows a length 2k+1 transform, and 2√is for Harley’s trick, we tried it only at the word level, i.e.,
used only in the weight signal. For a cyclic transform, 2 for λK < 2M + k ≤ λK + w, which happens in rare cases
is used normally as a root of unity during the transform, only, and it made little difference. However, when
n
multiply-
allowing a transform length of 2k+2 . This idea is mentioned ing two numbers modulo a Fermat number 22 + 1, Harley’s
in [4, §9] where it is credited without reference to Schönhage, trick becomes very attractive, since 2M is a power of two in
but we have been unable to √ track down the original source. that case.
In our implementation, this 2 trick saved roughly 10% on
the total time of integer multiplication. 2.6 Improved Tuning
Unfortunately using higher roots of unity for the trans- We found that significant speedups could be obtained with
form is not feasible as prime divisors of 2n + 1 are not nec- better tuning schemes, which we describe here. All examples
essarily congruent to 1 (mod 2k+3 ), deciding whether they given in this section are related to an Opteron.
are or not requires factoring 2n + 1, and even if they are as
in the case of the eighth Fermat number F8√= 2256 + 1 [8], 2.6.1 Tuning the Fermat and Mersenne Transforms
there does not seem to be a simple form for 4 2 which would Until version 4.2.1, GMP used a naive tuning scheme for
make it useful as a root on unity in the transform. the FFT multiplication. For the Fermat transforms modulo
2N + 1, an FFT of length 2k was used for tk ≤ N < tk+1 ,
2.5 Harley’s and Granlund’s Tricks where tk is the smallest bit-size for which FFT-2k is faster
Rob Harley [13] suggested the following trick3 to improve than FFT-2k−1 . For example on an Opteron, the default
the efficiency of a given FFT scheme. Assume 2M + k is gmp-mparam.h file uses k = 4 for a size less than 528 machine
just above an integer multiple of K, say λK. Then we have words, then k = 5 for less than 1184 words, and so on:
to use n = (λ + 1)K, which gives an efficiency of only about
λ
λ+1
. Harley’s idea is to use n = λK instead, and recover #define MUL_FFT_TABLE { 528, 1184, 2880, 5376, 11264,
the missing information from a CRT-reconstruction with an 36864, 114688, 327680, 1310720, 3145728, 12582912, 0 }
additional computation modulo the machine word 2w .
A drawback of Harley’s trick is that when only a few bits A special rule is used for the last entry: here k = 14 is
are missing, the K 2 word products may become relatively used for less than m = 12582912 words, k = 15 is used for
expensive. When only a few bits are missing, we can mul- less than 4m = 50331648 words, and then k = 16 is used.
tiply A(x) and B(x) over Z[x] modulo a small power of 2 An additional single threshold determines from which size
using the segmentation method. That way, if h ≤ w bits upward — still in words — a Fermat transform mod 2n + 1
are missing, one trades K 2 word products for the product of is faster than a full product of two n-bit integers:
two large integers of (2h + k)K/w words, which can in turn #define MUL_FFT_MODF_THRESHOLD 544
use fast multiplication4 .
3
Bernstein attributes a similar idea to Karp in [4, §9]. For a product mod 2n + 1 of at least 544 words, GMP
4
Harley’s trick extends to h > w: together with the segmen- 4.2.1 therefore uses a Fermat transform, with k = 5 until
tation method, the exact same reasoning holds. 1183 words according to the above MUL_FFT_TABLE. Below

172
the 544 words threshold, the algorithm used is the 3-way 80
Toom-Cook algorithm, followed by a reduction mod 2n + 1. GMP 4.1.4
Magma V2.13-6
This scheme is clearly not optimal since the FFT-2k curves 70
GMP 4.2.1
intersect several times, as shown by Figure 2. new GMP code
60

0.9
mpn_mul_fft.5
0.2
mpn_mul_fft.5 50
0.8 mpn_mul_fft.6 mpn_mul_fft.6
mpn_mul_fft.7 0.19
0.7
0.18 40
0.6
0.5 0.17

0.4
30
0.16
0.3
0.15
0.2 20
0.14
0.1
0
500 1000 1500 2000 2500
0.13
700 750 800 850 900
10

0
Figure 2: Time in milliseconds needed to multiply num- 2.5e6 5e6 7.5e6 1e7 1.25e7 1.5e7
bers modulo 2n +1 with an FFT of length 2k for k = 5, 6, 7.
On the right, the zoom (with only k = 5, 6) illustrates that
Figure 3: Comparison of GMP 4.1.4, GMP 4.2.1,
two curves can intersect several times.
Magma V2.13-6 and our new code for the plain inte-
ger multiplication on a 2.4Ghz Opteron (horizontal axis
To take into account those multiple crossings, the new
in 64-bit words, vertical axis in seconds).
tuning scheme determines word-intervals [m1 , m2 ] where the
FFT of length 2k is preferred for Fermat transforms:

#define MUL_FFT_TABLE2 {{1, 4 /*66*/}, {401, 5 /*96*/}, cases, 2.3 is the maximal ratio between Magma V2.12-1 and
{417, 4 /*98*/}, {433, 5 /*96*/}, {865, 6 /*96*/}, ... GMP 4.1.4, and between our code and Magma V2.13-6 re-
spectively, following the well known “benchmarketing” strat-
The entry {433, 5 /*96*/} means that from 433 words — egy5 (both versions of Magma give very similar timings).
and up to the next size of 865 words — FFT-25 is pre- We have tested other freely available packages provid-
ferred, with an efficiency of 96%. A similar table is used for ing an implementation for large integer arithmetic. Among
Mersenne transforms. them, some (OpenSSL/BN, LiDiA/libI) do not go beyond
Karatsuba algorithm, some do have some kind of FFT, but
2.6.2 Tuning the Plain Integer Multiplication are not really made for really large integers: arprec, Miracl.
Up to GMP 4.2.1, a single threshold controls the plain Two useful implementations we have tested are apfloat and
integer multiplication: CLN. They take about 4 to 5 seconds on our test machine to
multiply one million-word integers, whereas we need about
#define MUL_FFT_THRESHOLD 7680 1 second. Bernstein mentions some partial implementation
Zmult of Schönhage-Strassen’s algorithm, with good tim-
This means that SSA is used for a product of two integers
ings, but right now, only very few sizes are handled, so that
of at least 7680 words, which corresponds to about 148, 000
the comparison with our software is not really possible.
decimal digits, and the Toom-Cook 3-way algorithm is used
A program that implements a complex floating-point FFT
below that threshold.
for integer multiplication is George Woltman’s Prime95. It
We now use the generalized Fermat-Mersenne scheme de-
is written mainly for testing large Mersenne numbers 2p − 1
scribed in §2.3 with b = 1 (in our implementation we found
for primality in the in the Great Internet Mersenne Prime
1 ≤ a ≤ 7 was enough). Again, for each size, the best value
Search [24]. It uses a DWT for multiplication mod a2n ± c,
of a is determined by our tuning program:
with a and c not too large, see [17]. We compared multi-
#define MUL_FFT_FULL_TABLE2 {{16, 1}, {4224, 2}, plication modulo 22wn − 1 in Prime95 version 24.14.2 with
{4416, 6}, {4480, 2}, {4608, 4}, {4640, 2}, ... multiplication of n-word integers using our SSA implemen-
tation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at
For example, the entry {4608, 4} means that to multiply 2.4 GHz, see Figure 4. It is plain that Prime95 beats our im-
two numbers of 4608 words — or whose product has 2×4608 plementation by a wide margin, in fact usually by more than
words — the new algorithm uses one Mersenne transform a factor of 10 on a Pentium 4, and by a factor between 2.5
modulo 2N − 1 and one Fermat transform modulo 24N + 1. and 3 on the Opteron. Some differences between Prime95
Reconstruction is easy since 2aN + 1 ≡ 2 mod (2N − 1). and our implementation need to be pointed out: due to the
floating point nature of Prime95’s FFT, rounding errors can
3. RESULTS AND CONCLUSION build up for particular input data to the point where the re-
sult are incorrectly rounded to integers. The floating point
On July 1st, 2005, Allan Steel wrote a web page [20] en-
FFT can be made provably correct, see again [17], but at the
titled “Magma V2.12-1 is up to 2.3 times faster than GMP
cost of using larger FFT lengths. For example, for a length
4.1.4 for large integer multiplication”. This was actually our 225 FFT, [17] allows 9 bits per double, whereas Prime95
first motivation for improving GMP’s implementation. uses up to 17.76. To eliminate any chance of fatal round-off-
Magma V2.13-6 takes 2.22s to multiply two numbers of
error, the transform length and hence run-time would need
784141 words, whereas our GMP development code takes
to be about doubled. Also, the implementation of the FFT
only 0.96s. Thus our GMP-based code is clearly faster than
Magma by a factor of 2.3. Note that this does not mean that 5
The word “benchmarketing” has been suggested to us by
we have gained a factor 2.32 = 5.29 over GMP 4.1.4. In both Torbjörn Granlund.

173
in Prime95 is done in hand-optimized assembly for the x86 [9] Brockmeyer, E., Ghez, C., D’Eer, J., Catthoor,
family of processors, and will not run on other architectures. F., and Man, H. D. Parametrizable behavioral IP
Another implementation of complex floating point FFT is module for a data-localized low-power FFT. In Proc.
Guillermo Ballester Valor’s Glucas. The algorithm it uses IEEE Workshop on Signal Processing Systems (SIPS)
is similar to that in Prime95, but it is written portably in (1999), IEEE Press, pp. 635–644.
C. This makes it slower than Prime95, but still faster than [10] Crandall, R., and Fagin, B. Discrete weighted
our code on both the Pentium 4 and the Opteron, as shown transforms and large-integer arithmetic. Math. Comp.
in Figure 4. 62, 205 (1994), 305–324.
[11] Crandall, R., and Pomerance, C. Prime Numbers:
1.6 A Computational Perspective. Springer-Verlag, 2000.
SSA on P4
1.4 SSA on Opteron [12] Granlund, T. Personal communication, Dec. 2006.
Glucas on Opteron
1.2 Prime95 on Opteron [13] Harley, R. Personal communication, Jan. 2000.
Glucas on P4
1 Prime95 on P4 [14] Karp, A. H., and Markstein, P. High-precision
0.8 division and square root. ACM Trans. Math. Softw.
0.6 23, 4 (1997), 561–589.
0.4 [15] Knuth, D. The analysis of algorithms. In Actes du
0.2 Congrès International des Mathématiciens de 1970
0
(1971), vol. 3, Gauthiers-Villars, pp. 269–274.
2e5 4e5 6e5 8e5 1e6 [16] Moenck, R., and Borodin, A. Fast modular
transforms via division. In Proceedings of the 13th
Figure 4: Time in seconds for multiplication of differ- Annual IEEE Symposium on Switching and Automata
ent word lengths with our implementation, Prime95 and Theory (Oct. 1972), pp. 90–96.
Glucas on a 3.2 GHz Pentium 4 and a 2.4 GHz Opteron. [17] Percival, C. Rapid multiplication modulo the sum
and difference of highly composite numbers. Math.
Comp. 72, 241 (2003), 387–395.
Acknowledgments. [18] Schönhage, A. Schnelle Berechnung von
This work was done in collaboration with Torbjörn Gran- Kettenbruchentwicklungen. Acta Inform. 1 (1971),
lund, during his visits as invited professor at INRIA Lor- 139–144.
raine in March-April and November-December 2006; we also [19] Schönhage, A., and Strassen, V. Schnelle
thank him for proof-reading this paper. This work would Multiplikation großer Zahlen. Computing 7 (1971),
probably not have been achieved without the initial stimu- 281–292.
lation from Allan Steel; the authors are very grateful to him. [20] Steel, A. Magma V2.12-1 is up to 2.3 times faster
Thanks to Markus Hegland for pointing to the Belgian pa- than GMP 4.1.4 for large integer multiplication.
per, and to William Hart for finding a typo in a preliminary http://magma.maths.usyd.edu.au/users/allan/
version. We also thank the anonymous referees for many intmult.html, July 2005.
valuable remarks on the manuscript.
[21] Steel, A. Reduce everything to multiplication.
Computing by the Numbers: Algorithms, Precision,
4. REFERENCES and Complexity, Workshop for Richard Brent’s 60th
[1] Arndt, J. Algorithms for programmers (working birthday, Berlin, July 2006. http://www.mathematik.
title). Draft version of 2007-January-05, hu-berlin.de/~gaggle/EVENTS/2006/BRENT60/.
http://www.jjj.de/fxt/. [22] van der Hoeven, J. The truncated Fourier
[2] Bailey, D. The computation of π to 29,360,000 transform and applications. In Proceedings of the 2004
decimal digits using Borwein’s quartically convergent international symposium on symbolic and algebraic
algorithm. Math. Comp. 50 (1988), 283–296. computation (ISSAC) (2004), J. Gutierrez, Ed.,
[3] Bailey, D. FFTs in external or hierarchical memory. pp. 290–296.
J. Supercomputing 4 (1990), 23–35. [23] von zur Gathen, J., and Gerhard, J. Modern
[4] Bernstein, D. J. Multidigit multiplication for Computer Algebra. Cambridge University Press, 1999.
mathematicians. http://cr.yp.to/papers.html#m3, [24] Woltman, G., and Kurowski, S. The Great
2001. Internet Mersenne Prime Search.
[5] Bernstein, D. J. Fast multiplication and its http://www.gimps.org/.
applications. [25] Zimmermann, P., and Dodson, B. 20 years of ECM.
http://cr.yp.to/papers.html#multapps, 2004. In Proceedings of the 7th Algorithmic Number Theory
[6] Bernstein, D. J. Removing redundancy in Symposium (ANTS VII) (2006), F. Hess, S. Pauli, and
high-precision Newton iteration. M. Pohst, Eds., vol. 4076 of Lecture Notes in Comput.
http://cr.yp.to/fastnewton.html, 2004. Sci., Springer-Verlag, pp. 525–542.
[7] Brent, R. P. Multiple-precision zero-finding methods
and the complexity of elementary function evaluation.
In Analytic Computational Complexity (1975), J. F.
Traub, Ed., Academic Press, pp. 151–176.
[8] Brent, R. P., and Pollard, J. M. Factorization of
the eighth Fermat number. Math. Comp. 36 (1981),
627–630.

174

You might also like