Even Faster Integer Multiplication: July 11, 2014
Even Faster Integer Multiplication: July 11, 2014
Even Faster Integer Multiplication: July 11, 2014
David Harvey
School of Mathematics and Statistics
University of New South Wales
Sydney NSW 2052
Australia
Email: [email protected]
Joris van der Hoeven
a
, Grgoire Lecerf
b
CNRS, Laboratoire dinformatique
cole polytechnique
91128 Palaiseau Cedex
France
a. Email: [email protected]
b. Email: [email protected]
July 11, 2014
We give a new proof of Frers bound for the cost of multiplying n-bit integers in the bit
complexity model. Unlike Frer, our method does not require constructing special coecient
rings withfast roots of unity. Moreover, we prove the more explicit bound O(n logn K
log
n
)
with K =8. We show that an optimised variant of Frers algorithm achieves only K =16,
suggesting that the newalgorithmis faster than Frers by a factor of 2
log
n
. Assuming stan-
dard conjectures about the distribution of Mersenne primes, we give yet another algorithm
that achieves K=4.
Keywords: Integer multiplication, algorithm, complexity bound, FFT
A.C.M. subject classification: G.1.0 Computer-arithmetic, F.2.1 Number-theoretic com-
putations
A.M.S. subject classification: 68W30, 68Q17, 68W40
1. Introduction
Let I(n) denote the cost of multiplying two n-bit integers in the deterministic multitape Turing
model [38] (commonly called bit complexity). Previously, the best known asymptotic bound for
I(n) was due to Frer [18, 19]. He proved that there is a constant K >1 such that
I(n) = O(nlog nK
log
n
), (1.1)
where log
x := min k N: log
k
x 61, (1.2)
log
k
:= log
k
log.
The main contribution of this paper is a new algorithm that yields the following improvement.
Theorem 1.1. For n we have
I(n) = O(nlog n8
log
n
).
Frer suggested several methods to minimise the value of K in his algorithm, but did not give
an explicit bound for K. In section 7 of this paper, we outline an optimised variant of Frers
algorithm that achieves K = 16. We do not know how to obtain K <16 using Frers approach.
This suggests that the new algorithm is faster than Frers by a factor of 2
log
n
.
The idea of the new algorithm is remarkably simple. Given two n-bit integers, we split them
into chunks of exponentially smaller size, say around log n bits, and thus reduce to the problem
of multiplying integer polynomials of degree O(n/log n) with coecients of bit size O(log n). We
multiply the polynomials using discrete Fourier transforms (DFTs) over C, with a working precision
of O(logn) bits. To compute the DFTs, we decompose them into short transforms of exponentially
smaller length, say length around log n, using the CooleyTukey method. We then use Bluesteins
chirp transform to convert each short transform into a polynomial multiplication problem over C,
and nally convert back to integer multiplication via Kronecker substitution. These much smaller
integer multiplications are handled recursively.
1
The algorithm just sketched leads immediately to a bound of the form (1.1). A detailed proof
is given in section 4. We emphasise that the new method works directly over C, and does not need
special coecient rings with fast roots of unity, of the type constructed by Frer. Optimising
parameters and keeping careful track of constants leads to Theorem 1.1, which is proved in sec-
tion 6. We also prove the following conditional result in section 9.
Theorem 1.2. Assume Conjecture 9.1. Then
I(n) = O(nlog n4
log
n
).
Conjecture 9.1 is a slight weakening of the LenstraPomeranceWagsta conjecture on the
distribution of Mersenne primes, i.e., primes of the form p =2
q
1. The idea of the algorithm is
to replace the coecient ring C by the nite eld F
p
[i]; we are then able to exploit fast algorithms
for multiplication modulo numbers of the form 2
q
1.
An important feature of the new algorithms is that the same techniques are applicable in other
contexts, such as polynomial multiplication over nite elds. Previously, no Frer-type complexity
bounds were known for the latter problem. The details are presented in the companion paper [24].
In the remainder of this section, we present a brief history of complexity bounds for integer
multiplication, and we give an overview of the paper and of our contribution. More historical details
can be found in books such as [21, Chapter 8].
1.1. Brief history and related work
Multiplication algorithms of complexity O(n
2
) in the number of digits n were already known in
ancient civilisations. The Egyptians used an algorithm based on repeated doublings and additions.
The Babylonians invented the positional numbering system, while performing their computations
in base 60 instead of 10. Precise descriptions of multiplication methods close to the ones that we
learn at school appeared in Europe during the late Middle Ages. For historical references, we refer
to [49, Section II.5] and [37, 5].
The rst subquadratic algorithm for integer multiplication, with complexity O(n
log 3/log 2
), was
discovered by Karatsuba [30, 31]. From a modern viewpoint, Karatsubas algorithm utilises an
evaluation-interpolation scheme. The input integers are cut into smaller chunks, which are taken to
be the coecients of two integer polynomials; the polynomials are evaluated at several well-chosen
points; their values at those points are (recursively) multiplied; interpolating the results at those
points yields the product polynomial; nally, the integer product is recovered by pasting together
the coecients of the product polynomial. This cutting-and-pasting procedure is sometimes known
as Kronecker segmentation (see section 2.6).
Shortly after the discovery of Karatsubas algorithm, which uses three evaluation points, Toom
generalised it so as to use 2 r 1 evaluation points instead [51, 50], for any r > 2. This leads to
the bound I(n) =O(n
log (2r1)/log r
) for xed r. Letting r grow slowly with n, he also showed that
I(n) =O(n 2
5 log n/log 2
p
). The algorithm was adapted to the Turing model by Cook [10] and is now
known as ToomCook multiplication. Schnhage obtained a slightly better bound [45] by working
modulo several numbers of the form 2
k
1 instead of using several polynomial evaluation points.
Knuth proved that an even better complexity bound could be achieved by suitably adapting Tooms
method [32].
The next step towards even faster integer multiplication was the rediscovery of the fast Fourier
transform (FFT) by Cooley and Tukey [11] (essentially the same algorithm was already known
to Gauss [27]). The FFT yields particularly ecient algorithms for evaluating and interpolating
polynomials on certain special sets of evaluation points. For example, if R is a ring in which 2 is
invertible, and if R is a principal 2
k
-th root of unity (see section 2.2 for detailed denitions),
then the FFT permits evaluation and interpolation at the points 1, , ...,
2
k
1
using only O(k 2
k
)
ring operations in R. Consequently, if P and Q are polynomials in R[X] whose product has degree
less than 2
k
, then the product PQ can be computed using O(k 2
k
) ring operations as well.
2 Even faster integer multiplication
In [47], Schnhage and Strassen presented two FFT-based algorithms for integer multiplication.
In both algorithms, they rst use Kronecker segmentation to convert the problem to multiplication
of integer polynomials. They then embed these polynomials into R[X] for a suitable ring R and
multiply the polynomials by using FFTs over R. The rst algorithm takes R = C and =
exp(2 p i /2
k
), and works with nite-precision approximations to elements of C. Multiplications
in C itself are handled recursively, by treating them as integer multiplications (after appropriate
scaling). The second algorithm, popularly known as the SchnhageStrassen algorithm, takes
R=Z/mZ where m=2
2
k
+1 is a Fermat number. This algorithm is the faster of the two, achieving
the bound I(n) = O(n log n log log n). It benets from the fact that = 2 is a principal 2
k+1
-th
root of unity in R, and that multiplications by powers of can be carried out eciently, as they
correspond to simple shifts and negations. At around the same time, Pollard pointed out that one
can also work with R=Z/mZ where m is a prime of the form m=a 2
k
+1, since then R
contains
primitive 2
k
-th roots of unity [39] (although he did not give a bound for I(n)).
Schnhage and Strassens algorithm remained the champion for more than thirty years, but was
recently superseded by Frers algorithm [18]. In short, Frer managed to combine the advantages
of the two algorithms from [47], to achieve the bound I(n) =O(n log n 2
O(log
n)
). Frers algorithm
is based on the ingenious observation that the ring R=C[X] /(X
2
r1
+1) contains a small number
of fast principal 2
r
-th roots of unity, namely the powers of X, but also a large supply of much
higher-order roots of unity inherited from C. To evaluate an FFT over R, he decomposes it into
many short transforms of length at most 2
r
, using the CooleyTukey method. He evaluates the
short transforms with the fast roots of unity, pausing occasionally to perform slow multiplications
by higher-order roots of unity (twiddle factors). A slightly subtle point of the construction is that
we really need, for large k, a principal 2
k
-th root of unity R such that
2
kr
=X.
In [15] it was shown that the technique from [39] to compute modulo suitable prime numbers
of the form m = a 2
k
+ 1 can be adapted to Frers algorithm. Although the complexity of this
algorithm is essentially the same as that of Frers algorithm, this method has the advantage that
it does not require any error analysis for approximate numerical operations in C.
Date Authors Time complexity
<3000 BC Unknown [37] O(n
2
)
1962 Karatsuba [30, 31] O(n
log 3/log 2
)
1963 Toom [51, 50] O(n2
5 log n/log 2
p
)
1966 Schnhage [45] O(n2
2log n/log 2
p
(log n)
3/2
)
1969 Knuth [32] O(n2
2log n/log 2
p
log n)
1971 SchnhageStrassen [47] O(nlog nlog log n)
2007 Frer [18] O
_
nlog n2
O(log
n)
_
2014 This paper O(nlog n8
log
n
)
Table 1.1. Historical overview of known complexity bounds for n-bit integer multiplication.
1.2. Our contributions and outline of the paper
Throughout the paper, integers are assumed to be handled in the standard binary representation.
For our computational complexity results, we assume that we work on a Turing machine with
a nite but suciently large number of tapes [38]. The Turing machine model is very conservative
with respect to the cost of memory access, which is pertinent from a practical point of view for
implementations of FFT algorithms. Nevertheless, other models for sequential computations could
be considered [46, 20]. For practical purposes, parallel models might be more appropriate, but we
will not consider these in this paper. Occasionally, for polynomial arithmetic over abstract rings,
we will also consider algebraic complexity measures [8, Chapter 4].
David Harvey, Joris van der Hoeven, Grgoire Lecerf 3
In section 2, we start by recalling several classical techniques for completeness and later use:
sorting and array transposition algorithms, discrete Fourier transforms (DFTs), the CooleyTukey
algorithm, FFT multiplication and convolution, Bluesteins chirp transform, and Kronecker sub-
stitution and segmentation. In section 3, we also provide the necessary tools for the error analysis
of complex Fourier transforms. Most of these tools are standard, although our presentation is
somewhat ad hoc, being based on xed point arithmetic.
In section 4, we describe a simplied version of the new integer multiplication algorithm,
without any attempt to minimise the aforementioned constant K. As mentioned in the sketch
above, the key idea is to reduce a given DFT over C to a collection of short transforms, and then
to convert these short transforms back to integer multiplication by a combination of Bluesteins
chirp transform and Kronecker substitution.
The complexity analysis of Frers algorithm and the algorithm from section 4 involves func-
tional inequalities which contain post-compositions with logarithms and other slowly growing
functions. In section 5, we present a few systematic tools for analysing these types of inequalities.
For more information on this quite particular kind of asymptotic analysis, we refer the reader
to [44, 16].
In section 6, we present an optimised version of the algorithm from section 4, proving in
particular the bound I(n) = O(n log n 8
log
n
) (Theorem 1.1), which constitutes the main result
of this paper. In section 7, we outline a similar complexity analysis for Frers algorithm. Even
after several optimisations of the original algorithm, we were unable to attain a bound better than
I(n) =O(nlog n16
log
n
). This suggests that the new algorithm outperforms Frers algorithm by
a factor of 2
log
n
.
This speedup is surprising, given that the short transforms in Frers algorithm involve only
shifts, additions and subtractions. The solution to the paradox is that Frer has made the short
transforms too fast . Indeed, they are so fast that they make a negligible contribution to the overall
complexity, and his computation is dominated by the slow twiddle factor multiplications. In the
new algorithm, we push more work into the short transforms, allowing them to get slightly slower;
the quid pro quo is that we avoid the factor of two in zero-padding caused by Frers introduction
of articial fast roots of unity. The optimal strategy is actually to let the short transforms
dominate the computation, by increasing the short transform length relative to the coecient size.
Frer is unable to do this, because in his algorithm these two parameters are too closely linked.
To underscore just how far the situation has been inverted relative to Frers algorithm, we point
out that in our presentation we can get away with using SchnhageStrassen for the twiddle factor
multiplications, without any detrimental eect on the overall complexity.
We have chosen to base most of our algorithms on approximate complex arithmetic. Instead,
following [39] and [15], we might have chosen to use modular arithmetic. In section 8, we will briey
indicate how our main algorithm can be adapted to this setting. This variant of our algorithm
presents several analogies with its adaptation to polynomial multiplication over nite elds [24].
The question remains whether there exists an even faster algorithm than the algorithm of
section 6. In an earlier paper [17], Frer gave another algorithm of complexity O(n log n 2
O(log
n)
)
under the assumption that there exist suciently many Fermat primes, i.e., primes of the form
F
m
= 2
2
m
+ 1. It can be shown that a careful optimisation of this algorithm yields the bound
I(n) = O(n log n 4
log
n
). Unfortunately, odds are high that F
4
is the largest Fermat prime. In
section 9, we present an algorithm that achieves the bound I(n) =O(n log n 4
log
n
) under the more
plausible conjecture that there exist suciently many Mersenne primes (Theorem 1.2). The main
technical ingredient is a variant of an algorithm of Crandall and Fagin [12] that permits ecient
multiplication modulo 2
q
1, despite q not being divisible by a large power of two.
It would be interesting to know whether the new algorithms could be useful in practice. We
have implemented an unoptimised version of the algorithm from section 8 in the Mathemagix
system [29] and found our implementation to be an order of magnitude slower than the Gmp
library [23]. There is certainly room for improvement, but we doubt that even a highly optimised
implementation of the new algorithm will be competitive in the near future. Nevertheless, the
variant for polynomial multiplication over nite elds presented in [24] seems to be a promising
avenue for achieving speedups in practical computations. This will be investigated in a forthcoming
paper.
4 Even faster integer multiplication
Notations. We use Hardys notations f g for f =o(g), and f g for f =O(g) and g =O(f).
The symbol R
>
denotes the set of non-negative real numbers, and N denotes 0, 1, 2, .... We will
write lg n:=,log n/log 2|.
2. Survey of classical tools
This section recalls basic facts on Fourier transforms and related techniques used in subsequent
sections. For more details and historical references we refer the reader to standard books on the
subject such as [2, 8, 21, 42].
2.1. Arrays and sorting
In the Turing model, we have available a xed number of linear tapes. An n
1
n
d
array
M
i1,...,id
of b-bit elements is stored as a linear array of n
1
n
d
b bits. We generally assume that the
elements are ordered lexicographically by (i
1
, ..., i
d
), though this is just an implementation detail.
What is signicant from a complexity point of view is that occasionally we must switch repre-
sentations, to access an array (say 2-dimensional) by rows or by columns. In the Turing model,
we may transpose an n
1
n
2
matrix of b-bit elements in time O(b n
1
n
2
lg min (n
1
, n
2
)), using the
algorithm of [4, Appendix]. Briey, the idea is to split the matrix into two halves along the short
dimension, and transpose each half recursively.
We will also require more complex rearrangements of data, for which we resort to sorting.
Suppose that X is a totally ordered set, whose elements are represented by bit strings of length b,
and suppose that we can compare elements of X in time O(b). Then an array of n elements of X
may be sorted in time O(b n lg n) using merge sort [33], which can be implemented eciently on
a Turing machine.
2.2. Discrete Fourier transforms
Let R be a commutative ring with identity and let n>1. An element R is said to be a principal
n-th root of unity if
n
=1 and
k=0
n1
(
i
)
k
=0 (2.1)
for all i 1, ..., n1. In this case, we dene the discrete Fourier transform (or DFT) of an n-tuple
a =(a
0
, ..., a
n1
) R
n
with respect to to be DFT
(a) =a=(a
0
, ..., a
n1
) R
n
where
a
i
:= a
0
+a
1
i
+ +a
n1
(n1)i
.
That is, a
i
is the evaluation of the polynomial A(X) :=a
0
+a
1
X + +a
n1
X
n1
at
i
.
If is a principal n-th root of unity, then so is its inverse
1
=
n1
, and we have
DFT
1(DFT
(a)) = na.
Indeed, writing b :=DFT
1(DFT
j=0
n1
a
j
ji
=
j=0
n1
k=0
n1
a
k
j(ki)
=
k=0
n1
a
k
j=0
n1
j(ki)
=
k=0
n1
a
k
(n
i,k
) =na
i
,
where
i,k
=1 if i =k and
i,k
=0 otherwise.
Remark 2.1. In all of the new algorithms introduced in this paper, we actually work over a eld,
whose characteristic does not divide n. In this setting, the concept of principal root of unity
coincides with the more familiar primitive root of unity. The more general principal root concept
is only needed for discussions of other algorithms, such as the SchnhageStrassen algorithm or
Frers algorithm.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 5
2.3. The CooleyTukey FFT
Let be a principal n-th root of unity and let n=n
1
n
2
where 1<n
1
<n. Then
n1
is a principal
n
2
-th root of unity and
n2
is a principal n
1
-th root of unity. Moreover, for any i
1
0, ..., n
1
1
and i
2
0, ..., n
2
1, we have
a
i1n2+i2
=
k
1
=0
n11
k
2
=0
n21
a
k2n1+k1
(k2n1+k1)(i1n2+i2)
=
k
1
=0
n11
k1i2
_
k
2
=0
n21
a
k2n1+k1
(
n1
)
k2i2
_
(
n2
)
k1i1
. (2.2)
If /
1
and /
2
are algorithms for computing DFTs of length n
1
and n
2
, we may use (2.2) to construct
an algorithm /
1
/
2
for computing DFTs of length n as follows.
For each k
1
0, ..., n
1
1, the sum inside the brackets corresponds to the i
2
-th coecient
of a DFT of the n
2
-tuple (a
0n
1
+k
1
, ..., a
(n21)n1+k1
) R
n2
with respect to
n1
. Evaluating these
inner DFTs requires n
1
calls to /
2
. Next, we multiply by the twiddle factors
k1i2
, at a cost of n
operations in R. (Actually, fewer than n multiplications are required, as some of the twiddle factors
are equal to 1. This optimisation, while important in practice, has no asymptotic eect on the
algorithms discussed in this paper.) Finally, for each i
2
0, ..., n
2
1, the outer sum corresponds
to the i
1
-th coecient of a DFT of an n
1
-tuple in R
n1
with respect to
n2
. These outer DFTs
require n
2
calls to /
1
.
Denoting by F
R
(n) the number of ring operations needed to compute a DFT of length n, and
assuming that we have available a precomputed table of twiddle factors, we obtain
F
R
(n
1
n
2
) 6 n
1
F
R
(n
2
) +n
2
F
R
(n
1
) +n.
For a factorisation n=n
1
n
d
, this yields recursively
F
R
(n) 6
i=1
d
n
n
i
F
R
(n
i
) +(d 1) n. (2.3)
The corresponding algorithm is denoted /
1
/
d
. The operation is neither commutative
nor associative; the above expression will always be taken to mean (((/
1
/
2
) /
3
) ) /
d
.
Let B be the buttery algorithm that computes a DFT of length 2 by the formula (a
0
, a
1
)
(a
0
+a
1
, a
0
a
1
). Then B
k
:=B B computes a DFT of length n:=2
k
in time F
R
(2
k
) =O(k n).
Algorithms of this type are called fast Fourier transforms (or FFTs).
The above discussion requires several modications in the Turing model. Assume that elements
of R are represented by b bits.
First, for /
1
/
2
, we must add a rearrangement cost of O(b n lg min (n
1
, n
2
)) to eciently
access the rows and columns for the recursive subtransforms (see section 2.1). For the general case
/
1
/
d
, the total rearrangement cost is bounded by O(
i
b nlg n
i
) =O(b nlg n).
Second, we will sometimes use non-algebraic algorithms to compute the subtransforms, so it
may not make sense to express their cost in terms of F
R
. The relation (2.3) therefore becomes
F(n) 6
i=1
d
n
n
i
F(n
i
) +(d 1) nm
R
+O(b nlg n), (2.4)
where F(n) is the (Turing) cost of a transform of length n over R, and where m
R
is the cost of
a single multiplication in R.
Finally, we point out that /
1
/
2
requires access to a table of twiddle factors
i1i2
, ordered
lexicographically by (i
1
, i
2
), for 0 6 i
1
< n
1
, 0 6 i
2
< n
2
. Assuming that we are given as input
a precomputed table of the form 1, , ...,
n1
, we must show how to extract the required twiddle
factor table in the correct order. We rst construct a list of triples (i
1
, i
2
, i
1
i
2
), ordered by (i
1
, i
2
),
in time O(nlg n); then sort by i
1
i
2
in time O(nlg
2
n) (see section 2.1); then merge with the given
root table to obtain a table (i
1
, i
2
,
i1i2
), ordered by i
1
i
2
, in time O(n (b +lg n)); and nally sort
by (i
1
, i
2
) in time O(nlg n(b +lg n)). The total cost of the extraction is thus O(nlg n(b +lg n)).
6 Even faster integer multiplication
The corresponding cost for /
1
/
d
is determined as follows. Assuming that the table
1, , ...,
n1
is given as input, we rst extract the subtables of (n
1
n
i
)-th roots of unity for
i =d 1, ..., 2 in time O((n
1
n
d
+ +n
1
n
2
) (b +lg n)) =O(n(b +lg n)). Extracting the twiddle
factor table for the decomposition (n
1
n
i1
) n
i
then costs O(n
1
n
i
lg n (b +lg n)); the total
over all i is again O(nlg n(b +lg n)).
Remark 2.2. An alternative approach is to compute the twiddle factors directly in the correct
order. When working over C, as in section 3, this requires a slight increase in the working precision.
Similar comments apply to the root tables used in Bluesteins algorithm in section 2.5.
2.4. Fast Fourier multiplication
Let be a principal n-th root of unity in R and assume that n is invertible in R. Consider
two polynomials A = a
0
+ + a
n1
X
n1
and B = b
0
+ + b
n1
X
n1
in R[X]. Let
C =c
0
+ +c
n1
X
n1
be the polynomial dened by
c :=
1
n
DFT
1(DFT
(a) DFT
(b)),
where the product of the DFTs is taken pointwise. By construction, we have c = a b
, which
means that C(
i
) = A(
i
) B(
i
) for all i 0, ..., n 1. The product S = s
0
+ + s
n1
X
n1
of A and B modulo X
n
1 also satises S(
i
) = A(
i
) B(
i
) for all i. Consequently, s = a b
,
s =DFT
j=0
n1
a
j
ij
=f
i
j=0
n1
(a
j
f
j
) g
ij
. (2.5)
Also, since n is even,
g
i+n
=
(i+n)
2
=
i
2
n
2
2ni
=
i
2
n
2
+i
n
=g
i
.
Now let F := f
0
a
0
+ + f
n1
a
n1
X
n1
, G := g
0
+ + g
n1
X
n1
and C := c
0
+ +
c
n1
X
n1
F G modulo X
n
1. Then (2.5) implies that a
i
=f
i
c
i
for all i 0, ..., n1. In other
words, the computation of a DFT of even length n reduces to a cyclic convolution product of the
same length, together with O(n) additional operations in R. Notice that the polynomial G is xed
and independent of a in this product.
The only complication in the Turing model is the cost of extracting the f
i
in the correct order,
i.e., in the order 1, ,
4
,
9
, ...,
(n1)
2
, given as input a precomputed table 1, ,
2
, ...,
2n1
.
We may do this in time O(nlg n(b +lg n)) by applying the strategy from section 2.3 to the pairs
(i, i
2
mod 2 n) for 0 6i <n. Similar remarks apply to the g
i
.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 7
Remark 2.3. It is also possible to give variants of the new multiplication algorithms in which
Bluesteins transform is replaced by a dierent method for converting DFTs to convolutions, such
as Raders algorithm [41].
2.6. Kronecker substitution and segmentation
Multiplication in Z[X] may be reduced to multiplication in Z using the classical technique of
Kronecker substitution [21, Corollary 8.27]. More precisely, let d>0 and n>0, and suppose that we
are given two polynomials A, BZ[X] of degree less than d, with coecients A
i
and B
i
satisfying
[A
i
[ 62
n
and [B
i
[ 62
n
. Then for the product C =AB we have [C
i
[ 62
2n+lg d
. Consequently, the
coecients of C may be read o the integer product C(2
N
)=A(2
N
) B(2
N
) where N:=2 n+lgd+2.
Notice that the integers [A(2
N
)[ and [B(2
N
)[ have bit length at most d N, and the encoding and
decoding processes have complexity O(d N).
The inverse procedure is Kronecker segmentation. Given n > 0 and d > 0, and non-negative
integers a < 2
n
and b < 2
n
, we may reduce the computation of c := a b to the computation of
a product C := A B of two polynomials A, B Z[X] of degree less than d, and with [A
i
[ < 2
k
and [B
i
[ < 2
k
where k := ,n/d|. Indeed, we may cut the integers into chunks of k bits each, so
that a =A(2
k
), b =B(2
k
) and c =C(2
k
). Notice that we may recover c from C using an overlap-
add procedure in time O(d (k + lg d)) = O(n + d lg d). In our applications, we will always have
d =O(n/lg n), so that O(n+d lg d) =O(n).
Kronecker substitution and segmentation can also be used to handle Gaussian integers (and
Gaussian integer polynomials), and to compute cyclic convolutions. For example, given polynomials
A, B Z[i][X] /(X
d
1) with [A
i
[, [B
i
[ 62
n
, then for C =AB we have [C
i
[ 62
2n+lg d
, so we may
recover C from the cyclic Gaussian integer product C(2
N
) = A(2
N
) B(2
N
) (Z/(2
dN
1) Z)[i],
where N :=2 n+lg d+2. In the other direction, suppose that we wish to compute a b for some a,
b (Z/(2
dn
1) Z)[i]. We may assume that the real and imaginary parts of a and b are non-
negative, and so reduce to the problem of multiplying A, B Z[i][X] /(X
d
1), where a =A(2
n
)
and b =B(2
n
), and where the real and imaginary parts of A
i
, B
i
Z[i] are non-negative and have
at most n bits.
3. Fixed point computations and error bounds
In this section, we consider the computation of DFTs over C in the Turing model. Elements of C
can only be represented approximately on a Turing machine. We describe algorithms that compute
DFTs approximately, using a xed-point representation for C, and we give complexity bounds and
a detailed error analysis for these algorithms. We refer the reader to [7] for more details about
multiple precision arithmetic.
For our complexity estimates we will freely use the standard observation that I(O(n)) =O(I(n)),
since the multiplication of two integers of bit length 6k n reduces to k
2
multiplications of integers
of bit length 6n, for any xed k >1.
3.1. Fixed point numbers
We will represent xed point numbers by a signed mantissa and a xed exponent. More precisely,
given a precision parameter p > 4, we denote by C
p
the set of complex numbers of the form
z =m
z
2
p
, where m
z
=u +v i for integers u and v satisfying u
2
+v
2
62
2p
, i.e., [z[ 61. We write
C
p
2
e
for the set of complex numbers of the form u 2
e
, where u C
p
and e Z; in particular,
for z C
p
2
e
we always have [z[ 6 2
e
. At every stage of our algorithms, the exponent e will be
determined implicitly by context, and in particular, the exponents do not have to be explicitly
stored or manipulated.
In our error analysis of numerical algorithms, each z C
p
2
e
is really the approximation of some
genuine complex number z C. Each such z comes with an implicit error bound
z
> 0; this is
a real number for which we can guarantee that [z z[ 6
z
. We also dene the relative error bound
for z by
z
:=
z
/2
e
. We nally denote by :=2
1p
61/8 the machine accuracy.
Remark 3.1. Interval arithmetic [36] (or ball arithmetic [28, Chapter 3]) provides a systematic
method for tracking error bounds by storing the bounds along with z. We will use similar formulas
for the computation of
z
and
z
, but we will not actually store the bounds during computations.
8 Even faster integer multiplication
3.2. Basic arithmetic
In this section we give error bounds and complexity estimates for xed point addition, subtraction
and multiplication, under certain simplifying assumptions. In particular, in our DFTs, we only ever
need to add and subtract numbers with the same exponent. We also give error bounds for xed
point convolution of vectors; the complexity of this important operation is considered later.
For xR, we dene the round towards zero function x| by x|:=x| if x>0 and x| :=,x|
if x 60. For x, y R, we dene x+y i| :=x| +y| i. Notice that [z|[ 6[z[ and [z| z[ 6 2
for any z C.
Proposition 3.2. Let z, uC
p
2
e
. Dene the xed point sum and dierence z uu, z
uC
p
2
e+1
by m
z
u
:=(m
z
m
u
)/2|. Then z uu and z
u
6
z
+
u
2
+.
Proof. We have
[(z
u) (z u)[
2
e+1
=
_
m
z
m
u
2
_
m
z
m
u
2
2
p
6 2
2
p
6
and
[(z u) (zu)[
2
e+1
6
z
+
u
2
e+1
=
z
+
u
2
,
whence [(z
u) (zu)[/2
e+1
6(
z
+
u
)/2 +.
Proposition 3.3. Let z C
p
2
ez
and u C
p
2
eu
. Dene the xed point product z
u C
p
2
ez+eu
by m
z
u
:=2
p
m
z
m
u
|. Then z
u
6 (1 +
z
) (1 +
u
) (1 +).
Proof. We have
[z
u z u[/2
ez+eu
=[2
p
m
z
m
u
| 2
p
m
z
m
u
[ 2
p
6 2
2
p
6
and
[z u zu[ 6 [z[ [u u[ +[z z[ ([u[ +[u u[)
6 2
ez
u
+2
eu
z
+
z
u
= (
u
+
z
+
z
u
) 2
ez+eu
.
Consequently, [z
u zu[/2
ez+eu
6
z
+
u
+
z
u
+ 6(1 +
z
) (1 +
u
) (1 +) 1.
Proposition 3.3 may be generalised to numerical cyclic convolution of vectors as follows.
Proposition 3.4. Let k >1 and n:=2
k
. Let z (C
p
2
ez
)
n
and u(C
p
2
eu
)
n
. Dene the xed point
convolution z<dotast>u (C
p
2
ez+eu+k
)
n
by
m
(z<dotast>u)i
:=
_
2
pk
i1+i2=i (mod n)
m
z
i
1
m
u
i
2
_
, 0 6i <n.
Then
max
i
(1 +
(z<dotast>u)
i
) 6 max
i
(1 +
zi
) max
i
(1 +
ui
) (1 +).
Proof. Let denote the exact convolution, and write
z
:=max
j
z
j
and
u
:=max
j
u
j
. As in the
proof of Proposition 3.3, we obtain [(z<dotast>u)
i
(z u)
i
[/2
ez+eu+k
6 2
2
p
6 and
[(z u)
i
(z u)
i
[ 6
i1+i2=i (mod n)
[z
i
1
u
i
2
z
i
1
u
i
2
[
6 (
z
+
u
+
z
u
) 2
ez+eu+k
.
The proof is concluded in the same way as Proposition 3.3.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 9
3.3. Precomputing roots of unity
Let H:=x +y i C: y >0 and H
p
:=x + y i C
p
: y >0. Let
: HH be the branch of the
square root function such that e
i
:=e
i/2
for 0 6 6. Using Newtons method [7, Section 3.5]
and SchnhageStrassen multiplication [47], we may construct a xed point square root function
: H
p
H
p
, which may be evaluated in time O(p log p log log p), such that [ z
[ 6 for all
z H
p
. For example, we may rst compute some u H such that [u z
:=2
p
u| 2
p
; the desired bound follows since /4 + 2
2
p
6.
Lemma 3.5. Let z H
p
, and assume that [z[ =1 and
z
63/8. Then
z
6
z
+.
Proof. The mean value theorem implies that
6
z
max
wD
[1 / (2 w
)[ where
D:=wH: [wz[ 6
z
. For wD we have [w[ >[z[ [zz[ [z w[ >1 3/8 3/8 >1/4;
hence
6
z
=
z
. By construction [ z
[ 6. We conclude that
6
z
+.
Proposition 3.6. Let k N and p > k, and let := e
2i/2
k
. We may compute 1, ,
2
, ...,
2
k
1
C
p
, with
i 6k
0
6/4. This also shows
that the hypothesis
2
p
6.
3.4. Error analysis for fast Fourier transforms
A tight algorithm for computing DFTs of length n=2
k
>2 is a numerical algorithm that takes as
input an n-tuple a(C
p
2
e
)
n
and computes an approximation a(C
p
2
e+k
)
n
to the DFT of a with
respect to =e
2pi/n
(or =e
2pi/n
in the case of an inverse transform), such that
max
i
(1 +
ai
) 6 max
i
(1 +
ai
) (1 +)
3k2
.
We assume for the moment that any such algorithm has at its disposal all necessary root tables
with relative error not exceeding . Propositions 3.2 and 3.3 directly imply the following:
Proposition 3.7. The buttery algorithm B that computes a DFT of length 2 using the formula
(a
0
, a
1
) (a
0
ua
1
, a
0
a
1
) is tight.
Proof. We have
a
i
6(
a
0
+
a
1
)/2 + 6max
i
a
i
+ 6(1 +max
i
a
i
) (1 +) 1.
Proposition 3.8. Let k
1
, k
2
> 1, and let /
1
and /
2
be tight algorithms for computing DFTs of
lengths 2
k1
and 2
k2
. Then /
1
/
2
is a tight algorithm for computing DFTs of length 2
k1+k2
.
Proof. The inner and outer DFTs contribute factors of (1 + )
3k12
and (1 + )
3k22
, and by
Proposition 3.3 the twiddle factor multiplications contribute a factor of (1 +)
2
. Thus
max
i
(1 +
ai
) 6max
i
(1 +
ai
) (1 +)
(3k12)+2+(3k22)
6max
i
(1 +
ai
) (1 +)
3(k1+k2)2
.
Corollary 3.9. Let k > 1. Then B
k
is a tight algorithm for computing DFTs of length 2
k
over C
p
, whose complexity is bounded by O(2
k
k I(p)).
4. A simple and fast multiplication algorithm
In this section we give the simplest version of the new integer multiplication algorithm. The key
innovation is an alternative method for computing DFTs of small length. This new method uses
a combination of Bluesteins chirp transform and Kronecker substitution (see sections 2.5 and 2.6)
to convert the DFT to a cyclic integer product in (Z/(2
n
0
1) Z)[i] for suitable n
0
.
10 Even faster integer multiplication
Proposition 4.1. Let 16r 6p. There exists a tight algorithm (
r
for computing DFTs of length 2
r
over C
p
, whose complexity is bounded by O(I(2
r
p) +2
r
I(p)).
Proof. Let n := 2
r
, and suppose that we wish to compute the DFT of a (C
p
2
e
)
n
. Using
Bluesteins chirp transform (notation as in section 2.5), this reduces to computing a cyclic convolu-
tion of suitable F (C
p
2
e
)[X]/(X
n
1) and GC
p
[X] /(X
n
1). We assume that the f
i
and g
i
have been precomputed with
f
i
,
g
i
6.
We may regard F
0
:= 2
pe
F and G
0
:= 2
p
G as cyclic polynomials with complex integer
coecients, i.e., as elements of Z[i][X] /(X
n
1). Write F
0
=
i=0
n1
F
i
0
X
i
and G
0
=
i=0
n1
G
i
0
X
i
,
where F
i
0
, G
i
0
Z[i] with [F
i
0
[ 6 2
p
and [G
i
0
[ 6 2
p
. Now we compute the exact product H
0
:=
F
0
G
0
Z[i][X] /(X
n
1) using Kronecker substitution. More precisely, we have [H
i
0
[ 62
2p+r
, so
it suces to compute the cyclic integer product H
0
(2
b
) =F
0
(2
b
) G
0
(2
b
) (Z/(2
nb
1) Z)[i], where
b :=2 p +r +2 =O(p). Then H :=H
0
2
e2p
is the exact convolution of F and G, and rounding H
to precision p yields F<dotast>G(C
p
2
e+r
)[X] /(X
n
1) in the sense of Proposition 3.4. A nal
multiplication by f
i
yields the Fourier coecients a
i
C
p
2
e+r
.
To establish tightness, observe that 1 +
F
i
6(1 +
a
i
) (1 +)
2
and
G
i
6, so Proposition 3.4
yields 1 +
(F<dotast>G)i
6 (1 +
a
) (1 + )
4
where
a
:= max
i
ai
; we conclude that 1 +
ai
6
(1+
a
) (1+)
6
. For r >3, this means that the algorithm is tight; for r 62, we may take (
r
:=B
r
.
For the complexity, observe that the product in (Z/ (2
nb
1) Z)[i] reduces to three integer
products of size O(n p). These have cost O(I(n p)), and the algorithm also performs O(n) multi-
plications in C
p
, contributing the O(nI(p)) term.
Remark 4.2. A crucial observation is that, for suitable parameters, the DFT algorithm in Propo-
sition 4.1 is actually faster than the conventional CooleyTukey algorithm of Corollary 3.9. For
example, if we assume that I(m) =m(logm)
1+o(1)
, then to compute a transform of length n over C
p
with n p, the CooleyTukey approach has complexity n
2
(log n)
2+o(1)
, whereas Proposition 4.1
yields n
2
(log n)
1+o(1)
, an improvement by a factor of roughly log n.
Theorem 4.3. For n, we have
I(n)
nlg n
= O
_
I(lg
2
n)
lg
2
nlg lg n
+
I(lg n)
lg nlg lg n
+1
_
. (4.1)
Proof. We rst reduce our integer product to a polynomial product using Kronecker segmentation
(section 2.6). Splitting the two n-bit inputs into chunks of b := lg n bits, we need to compute
a product of polynomials u, v Z[X] with non-negative b-bit coecients and degrees less than
m := ,n/b| = O(n/lg n). The coecients of h := u v have O(lg n) bits, and we may deduce the
desired integer product h(2
b
) in time O(n).
Let k := lg (2 m). To compute u v, we will use DFTs of length 2
k
=O(n/lg n) over C
p
, where
p:=2 b+2 k+lgk+8=O(lgn). Zero-padding u to obtain a sequence (u
0
, ..., u
2
k
1
) (C
p
2
b
)
2
k
, and
similarly for v, we compute the transforms u, v(C
p
2
b+k
)
2
k
with respect to :=e
2pi/2
k
as follows.
Let r :=lg lg n and d :=,k/r| =O(lg n/lg lg n). Write k =r
1
+ +r
d
with r
i
:=r for i 6d 1
and r
d
:= k (d 1) r 6 r. We use the algorithm / := /
1
/
d
(see section 2.3), where
for 1 6 i 6 d 1 we take /
i
to be the tight algorithm (
r
for DFTs of length 2
r
lg n given by
Proposition 4.1, and where /
d
is B
rd
as in Corollary 3.9. In other words, we split the k usual
radix-2 layers of the FFT into groups of r layers, handling the transforms in each group with the
BluesteinKronecker reduction, and then using ordinary CooleyTukey for the remaining r
d
layers.
We next compute the pointwise products h
i
:= u
i
v
i
C
p
2
2b+2k
, and then apply an inverse
transform /
0
dened analogously to /. A nal division by 2
k
(which is really just an implicit
adjustment of exponents) yields approximations h
i
C
p
2
2b+2k
.
Since /and /
0
are tight by Propositions 3.8, 4.1 and Corollary 3.9, we have 1+
ui
6(1+)
3k2
,
and similarly for v. Thus 1 +
h
i
6 (1 + )
6k3
, so 1 +
h
i
6 (1 + )
9k5
6 exp(9 k ) 6
exp(2
5+lg kp
) 6 1 + 2
6+lg kp
after the inverse transform (since exp x 6 1 + 2 x for x 6 1). In
particular,
hi
=2
2b+2k
hi
62
2b+2k+lg kp+6
61/4, so we obtain the exact value of h
i
by rounding
to the nearest integer.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 11
Now we analyse the complexity. Using Proposition 3.6, we rst compute a table of roots 1,
, ...,
2
k
1
in time O(2
k
p log p log log p) =O(nlg n), and then extract the required twiddle factor
tables in time O(2
k
k (p +k)) =O(n lg n) (see section 2.3). For the Bluestein reductions, we may
extract a table of 2
r+1
-th roots in time O(2
k
p) = O(n), and then rearrange them as required in
time O(2
r
r (p+r)) =O(lg
2
n lglg n) (see section 2.5). These precomputations are then all repeated
for the inverse transforms.
By Corollary 3.9, Proposition 4.1 and (2.4), each invocation of / (or /
0
) has cost
O((d 1) 2
kr
(I(2
r
p) +2
r
I(p)) +2
krd
2
rd
r
d
I(p) +(d 1) 2
k
I(p) +p 2
k
k)
= O((d 1) 2
kr
I(2
r
p) +(d +r
d
) 2
k
I(p) +p 2
k
k)
= O
_
n
lg nlg lg n
I(lg
2
n) +
n
lg lg n
I(lg n) +nlg n
_
.
The cost of the O(2
k
) pointwise multiplications is subsumed within this bound.
It is now a straightforward matter to recover Frers bound.
Theorem 4.4. For some constant K >1, we have
I(n) = O(nlg nK
log
n
).
Proof. Let T(n) :=I(n)/(n lgn) for n>2. By Theorem 4.3, there exists x
0
>2 and C>1 such that
T(n) 6 C (T(lg
2
n) +T(lg n) +1)
for all n > x
0
. Let (x) := 4 log
2
x for x R, x > 1. Increasing x
0
if necessary, we may assume
that (x) 6x 1 for x>x
0
, so that the function
(x) :=minj N:
j
(x) 6x
0
is well-dened.
Increasing C if necessary, we may also assume that T(n) 63 C for all n6x
0
.
We prove by induction on
(n)+1
for all n. If
(lgn) 6
(lg
2
n) 6
((n)) =
(n)
+C (3 C)
(n)
+C 6(3 C)
(n)+1
.
Finally, since ((x)) log x, we have
(x) 6 2 log
n
) for
K :=(3 C)
2
.
5. Logarithmically slow recurrence inequalities
This section is devoted to developing a framework for handling recurrence inequalities, similar
to (4.1), that appear in subsequent sections.
Let : (x
0
, ) R be a smooth increasing function, for some x
0
R. We say that
: (x
0
, ) R
>
is an iterator of if
is increasing and if
(x) =
((x)) +1 (5.1)
for all suciently large x.
For instance, the standard iterated logarithm log
(x) := min k N:
k
(x) 6
is well-dened and satises (5.1) for all x > . It will sometimes be convenient to increase x
0
so
that (x) 6x 1 is satised on the whole domain of .
We say that is logarithmically slow if there exists an ` N such that
(log
`
exp
`
)(x) = log x +O(1) (5.2)
for x. For example, the functions log(2 x), 2 logx, (logx)
2
and (logx)
log log x
are logarithmically
slow, with ` =0, 1, 2, 3 respectively.
12 Even faster integer multiplication
Lemma 5.1. Let : (x
0
, ) R be a logarithmically slow function. Then there exists >x
0
such
that (x) 6x 1 for all x >. Consequently all logarithmically slow functions admit iterators.
Proof. The case ` = 0 is clear. For ` > 1, let := log exp. By induction (x) 6 x 1 for
large x, so (x) 6exp(log x 1) =x/e 6x 1 for large x.
In this paper, the main role played by logarithmically slowfunctions is to measure size reduction
in multiplication algorithms. In other words, multiplication of objects of size n will be reduced to
multiplication of objects of size n
0
, where n
0
6(n) for some logarithmically slow function (x).
The following result asserts that, from the point of view of iterators, such functions are more or
less interchangeable with log x.
Lemma 5.2. For any iterator
(x) = log
x+O(1).
Proof. First consider the case where ` = 0 in (5.2), i.e., assume that [(x) log x[ 6 C for
some constant C > 0 and all x > x
0
. Increasing x
0
and C if necessary, we may assume that
(x) =
(x) =
(
k
(x)) + k. Since
k
(x) 62 log
k
x68 e
2C
and k =log
x +O(1), we obtain
(x) =log
x +O(1).
Now consider the general case ` > 0. Let := log
`
exp
`
, so that
:=
exp
`
is
an iterator of . By the above argument
(x) = log
x + O(1), and so
(x) =
(log
`
x) =
log
(log
`
x) +O(1) =log
x ` +O(1) =log
x+O(1).
The next result, which generalises and renes the argument of Theorem 4.4, is our main tool
for converting recurrence inequalities into actual asymptotic bounds for solutions. We state it
in a slightly more general form than is necessary for the present paper, anticipating the more
complicated situation that arises in [24].
Proposition 5.3. Let K > 1, B > 0 and ` N. Let x
0
> exp
`
(1), and let : (x
0
, ) R be
a logarithmically slow function such that (x) 6x 1 for all x >x
0
. Then there exists a positive
constant C (depending on x
0
, , K, B and `) with the following property.
Let >x
0
and L > 0. Let o R, and let T: o R
>
be any function satisfying the following
recurrence. First, T(y) 6L for all y o, y 6. Second, for all y o, y >, there exist y
1
, ..., y
d
o
with y
i
6(y), and weights
1
, ...,
d
>0 with
i
i
=1, such that
T(y) 6 K
_
1 +
B
log
`
y
_
i=1
d
i
T(y
i
) +L.
Then we have T(y) 6CLK
log
ylog
(x) :=mink N:
k
(x) 6 for x>x
0
. We
claim that there exists r N, depending only on x
0
and , such that
(x) 6 log
x log
+r (5.4)
David Harvey, Joris van der Hoeven, Grgoire Lecerf 13
for all x>. Indeed, let
(x) :=minj N:
j
(x) 6x
0
. First suppose >x
0
, so that
() >1.
For any x >, we have
(
(x)1)
(x) >, so
(x)1+
()1)
(x) >
(
()1)
() >x
0
,
and hence
(x) >
(x) +
(x) 6
(x)
() +O(1) =log
x log
+O(1).
Dene a sequence of real numbers E
1
, E
2
, ... by the formula
E
j
:=
_
1 +B if j 6r +`,
1 +B/exp
(jr`1)
(1) if j >r +`.
We claim that
1 +B/log
`
x 6 E
(x)
(5.5)
for all x>. Indeed, let j :=
(y) that
T(y) 6 E
1
E
j
L(K
j
+ +K +1)
for all y >x
0
. The base case j := 0, i.e., y 6, holds by assumption. Now assume that j >1, so
y >. By hypothesis there exist y
1
, ..., y
d
o, y
i
6(y), and
1
, ...,
d
>0 with
i
i
=1, such that
T(y) 6 KE
j
i
T(y
i
) +L.
Since
(y
i
) 6
((y)) =
(y) 1, we obtain
T(y) 6 KE
j
i
(E
1
E
j 1
L(K
j 1
+ +K +1)) +L
= E
1
E
j
L(K
j
+ +K
2
+K) +L
6 E
1
E
j
L(K
j
+ +K
2
+K +1).
Finally, the innite product
E :=
j>1
E
j
6(1 +B)
r+`
k>0
_
1 +
B
exp
k
(1)
_
certainly converges, so we have T(y)6EL K
j+1
/(K1) for y >x
0
. Setting C:=EK
r+1
/(K1),
by (5.4) we obtain T(y) 6CLK
log
ylog
2
b
<2
b+1
, and for any p >b +1 we may encode w
as a polynomial W (C
p
2
b+1
)[X]/(X
2
k
1).
We will multiply the desired (cyclic) polynomials by using DFTs of length 2
k
over C
p
where
p := 2 b + 2 k + lg k + 10 = O(lg
2
n). We construct the DFTs in a similar way to section 4. Let
r := (lg lg n)
2
and d :=,k/r| =O(lg n/(lg lg n)
2
). Write k =r
1
+ +r
d
with r
i
:=r for i 6d 1
and r
d
:=k (d 1) r 6r. We use the tight algorithm /:=/
1
/
d
, where for 16i 6d 1 we
take /
i
to be the tight algorithm (
r
0
for DFTs of length 2
r
given by Proposition 6.3, and where /
d
is B
rd
as in Corollary 3.9. Thus, for the rst d 1 groups of r layers, we use BluesteinKronecker
to reduce to complex integer convolution of size n
0
:=((2 p +r +2) 2
r
), and the remaining layers
are handled using ordinary CooleyTukey. We write /
0
for the analogous inverse transform.
To check the hypothesis of Proposition 6.3, we observe that 2
r
[ n
0
for suciently large n, as n
0
is divisible by 2
k
0
where k
0
:=lg n
0
lg(lg
2
n
0
) +1, and
2
k
0
n
0
lg
2
n
0
(2 p +r +2) 2
r
lg
2
((2 p +r +2) 2
r
)
b 2
r
(lg b +r)
2
(lg n)
2
(lg lg n)
4
2
r
~2
r
.
Denote by D the cost of a single invocation of / (or /
0
). By Corollary 3.9 and (2.4), we have
D 6 (d 1) B
p,2
kr(2
r
) +O(2
krd
2
rd
r
d
I(p)) +O(d 2
k
I(p)) +O(2
k
k b).
The last term is the rearrangement cost, and simplies to O(n lg n). The second term covers the
invocations of /
d
, and simplies to O(r 2
k
I(p)), so is absorbed by the d 2
k
I(p) term. The rst
term covers the invocations of (
r
0
. By denition B
p,2
kr(2
r
) 6 (2 2
kr
+ 1) B
p
(2
r
), and since
2
kr
~lg lg n, Proposition 6.3 yields
B
p,2
kr(2
r
) 6 (2 +O(1/lg lg n)) 2
kr
C(n
0
) +O(2
k
I(p)).
Thus
D 6 (2 +O(1/lg lg n)) d 2
kr
C(n
0
) +O(d 2
k
I(p)) +O(nlg n).
We will use SchnhageStrassens algorithm for xed point multiplications in C
p
. Since p =
O(lg
2
n), we may take I(p) =O(lg
2
nlg lg nlg lg lg n). Thus the d 2
k
I(p) term becomes
O
_
lg n
(lg lg n)
2
n
lg
2
n
lg
2
nlg lg nlg lg lg n
_
=O
_
nlg n
lg lg lg n
lg lg n
_
=O(nlg n).
(We could of course use our algorithm recursively for these multiplications; however, it turns out
that SchnhageStrassen is fast enough, and leads to simpler recurrences. In fact, the algorithm
asymptotically spends more time rearranging data than multiplying in C
p
!)
Since (2 p + r + 2) 2
r
= (4 b + O(lg n)) 2
r
= (4 + O(1 / lg lg n)) b 2
r
, and since lg(b 2
r
) =
r +O(lg lg n) =(1 +O(1/lg lg n)) r ~lg lg n, by Lemma 6.1 we have
n
0
= (4 +O(1/lg lg n)) b 2
r
,
lg n
0
= (1 +O(1/lg lg n)) r.
We also have k =lg n+O(lg lg n) and d =k/r +O(1), so
lg n = (1 +O(1/lg lg n)) k,
d = (1 +O(1/lg lg n)) k/r.
Thus
d 2
kr
=
4 (2
k
b) d
(4 b 2
r
)
=
_
4 +O
_
1
lg lg n
__
nlg n
n
0
lg n
0
,
and consequently
D 6
_
8 +O
_
1
lg lg n
__
nlg n
n
0
lg n
0
C(n
0
) +O(nlg n).
16 Even faster integer multiplication
To compute the desired t products, we must execute t + 1 forward transforms and t inverse
transforms. For each product, we must also perform O(2
k
) pointwise multiplications in C
p
, at
cost O(2
k
I(p)) = O(n lg n). As in the proof of Theorem 4.3, the cost of all necessary root table
precomputations is also bounded by O(2
k
I(p)) =O(nlg n). Thus we obtain
C
t
(n) 6 (2 t +1) D+O(t nlg n).
Dividing by (2 t +1) nlg n and taking suprema yields the bound (6.2).
The error analysis is almost identical to the proof of Theorem 4.3, the only dierence being
that b is replaced by b +1. Denoting one of the t products by h(C
p
2
2b+2k+2
)[X]/(X
2
k
1), we
have
h
i
62
6+lg kp
exactly as in Theorem 4.3. Thus
h
i
62
2b+2k+lg kp+8
61/4, and again we
obtain h
i
by rounding to the nearest integer.
Finally we show how to dene (x). We already observed that lg n
0
r (lglg n)
2
. Thus there
exists a constant C >0 such that log log log n
0
6log log log log n +C for large n, so we may take
(x) :=exp
3
(log
4
x +C).
Now we may prove the main theorem announced in the introduction.
Proof of Theorem 1.1. Let x
0
and (x) be as in Theorem 6.4. Increasing x
0
if necessary, by
Lemma 5.1 we may assume that (x) 6x 1 for x >x
0
, and that x
0
>exp(exp(1)).
Let T(n) :=C(n)/(nlg n) for admissible n>3. By the theorem, there exist constants B, L>0
such that for all admissible n>x
0
, there exists an admissible n
0
6(n) with
T(n) 6 8
_
1 +
B
log log n
_
T(n
0
) +L.
Increasing L if necessary, we may also assume that T(n) 6 L for all admissible n 6 x
0
. Taking
o to be the set of admissible integers, we apply Proposition 5.3 with K := 8, := x
0
, ` := 2, and
for each admissible n >x
0
setting d := 1,
1
:= 1, y := n and y
1
:= n
0
as above. We conclude that
T(n) =O(8
log
n
), and hence C(n) =O(nlg n 8
log
n
) as n runs over admissible integers. We already
pointed out that I(k) 63 C(O(k)) +O(k).
7. An optimised variant of Frers algorithm
As pointed out in the introduction, Frer proved that I(n) =O(n log n K
log
n
) for some K>1, but
did not give an explicit bound for K. In this section we sketch an argument showing that one may
achieve K =16 in Frers algorithm, by reusing tools from previous sections, especially section 6.
At the core of Frers algorithm is the ring R=C[X] /(X
2
r1
+1), which contains the principal
2
r
-th root of unity X. Note that R is a direct sum of 2
r1
copies of C, and hence not a eld (for
r >2). A crucial observation is that X is a fast root of unity, in the sense that multiplication by
X and its powers can be achieved in linear time, as in SchnhageStrassens algorithm. For any
k >r, we need to construct a 2
kr
-th root of X, which is itself a 2
k
-th principal root of unity.
We recall Frers construction of as follows.
Lemma 7.1. With R as above, let % =exp
2 p i
2
k
and =exp
2 p i
2
r
. Then
:=
i=0
2
r1
1
%
2i+1
j=/ i
(X
2j+1
)
j=/ i
(
2i+1
2j+1
)
R
is a principal 2
k
-th root of unity with
2
kr
=X. The coecients of have absolute value 61.
Proof. See [19, Section 4].
As our basic recursive problem, we will consider multiplication in (Z/(2
n
+1) Z)[i], where n is
divisible by a high power of two. We will refer to the last property as admissibility, but we will
not dene it precisely. We write C
t
(n) for the cost of t >1 such products with one xed argument,
and C(n) :=sup
t>1
C
t
(n)/(2 t +1) for the normalised cost, exactly as in section 6.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 17
Frer worked with Z/(2
n
+1) Z rather than (Z/(2
n
+1) Z)[i], but, since we are interested in
constant factors, and since the recursive multiplication step involves multiplication of complex
quantities, it simplies the exposition to work systematically with complexied objects everywhere.
For suitable parameters r and k, we will encode elements of (Z/(2
n
+ 1) Z)[i] as (nega)cyclic
polynomials in R[Y ] /
_
Y
2
k
+1
_
, where R:=C[X] /(X
2
r1
+1) as above. We choose the parameters
later; for now we require only that 2
k+r2
divides n and that b := n/2
k+r2
>lg n (so that the
coecients are not too small).
The encoding proceeds as follows. Given aZ/(2
n
+1) Z, we split a into 2
k
parts a
0
, ..., a
2
k
1
of n/2
k
bits. Each a
i
is cut into 2
r2
even smaller pieces a
i,0
, ..., a
i,2
r2
1
of b bits. Then a is
encoded as
a :=
i=0
2
k
1
j=0
2
r2
1
a
i, j
X
j
Y
i
,
and an element u=x+y i (Z/(2
n
+1) Z)[i] is encoded as u:=x+yi. (Notice that the coecients
of X
j
are zero for 2
r2
6j <2
r1
; this zero-padding is the price Frer pays for introducing articial
roots of unity.)
We represent complex coecients by elements of C
p
2
e
for a suitable precision parameter p. The
exponent e varies during the algorithm, as explained in [19]; nevertheless, additions and subtrac-
tions only occur for numbers with the same exponent, as in the algorithms from sections 4 and 6.
Given u, v (Z/(2
n
+ 1) Z)[i], to successfully recover the product u v from the polynomial
product u v R[Y ] /
_
Y
2
k
+ 1
_
, we must choose p > 2 b + k + r + h, where h is an allowance for
numerical error. Certainly r 6k 6lg n, and, as shown by Frer, we may also take h=O(lg n) (an
analogous conclusion is reached in sections 4 and 6). Thus we may assume that p =2 b +O(lg n).
We must now show how to compute a product u v, for u, v R[Y ] /
_
Y
2
k
+ 1
_
. Frer handles
these types of multiplications using half-DFTs, i.e., DFTs that evaluate at odd powers of ,
where R is a principal 2
k+1
-th root of unity such that
2
k+1r
= X (Lemma 7.1). To keep
terminology and notation consistent with previous sections, we prefer to make the substitution
U(X, Y ) :=u(X, Y ), i.e., writing u =
i=0
2
k
1
u
i
(X) Y
i
, we put U :=
i
(u
i
i
) Y
i
, and similarly
for v and V . This reduces the problem to computing the product U V in R[Y ] /(Y
2
k
1). The
change of variable imposes a cost of O(2
k
m
R
), where m
R
is the cost of a multiplication in R.
So now consider a product UV , where U, V R[Y ] /(Y
2
k
1). Let :=
2
, so that
2
kr
=X.
Let d :=,k/r|, and write k =r
1
+ +r
d
with r
i
:=r for i 6d 1 and r
d
:=k (d 1) r 6r. For
each i, let /
i
be the algorithm for DFTs of length 2
ri
that applies the usual CooleyTukey method,
taking advantage of the fast 2
ri
-th root of unity X
2
rr
i
. The complexity of /
i
is O(2
ri+r
r
i
p), since
it performs O(2
ri
r
i
) linear-time operations on objects of bit size O(2
r
p). Let D be the complexity
of the algorithm /:=/
1
/
d
for DFTs of length 2
k
over R. Then (2.4) yields
D 6 O
_
i=1
d
2
kri
2
ri+r
r
i
p
_
+
_
k
r
_
2
k
m
R
+O(nlg n),
The rst term is bounded by O(d 2
k
2
r
r p) =O((2
k+r
p) k) =O(nlg n), since p =O(b).
Let us now consider the second term,k/r| 2
k
m
R
, which describes the cost of the twiddle factor
multiplications. This term turns out to be the dominant one. Both Kronecker substitution and
FFT multiplication may be considered for multiplication in R, but it turns out that Kronecker
substitution is faster (a similar phenomenon was noted in Remark 4.2). So we reduce multiplication
in R to multiplication in (Z/(2
n
0
+1) Z)[i] where n
0
>2
r1
(2 p+r +2) is admissible and divisible
by 2
r1
. For any reasonable denition of admissibility we then have n
0
=(1 +o(1)) 2
r
p, provided
that r is somewhat smaller than p. (In the interests of brevity, we will not specify the o(1) terms
for the remainder of the argument. They can all be controlled along the lines of section 6.) Most
of the twiddle factors are reused many times, so we will assume that m
R
=(2 +o(1)) C(n
0
), where
the factor 2 counts the two (rather than three) DFTs needed for each multiplication of size n
0
. The
term of interest then becomes
_
k
r
_
2
k
m
R
= (2 +o(1))
r +lg p
r
2
k+r
p k
n
0
lg n
0
C(n
0
).
18 Even faster integer multiplication
Since p =2 b +O(lg n) =
_
2 +O
_
lg n
b
__
b and 2
k+r
b =4 n, this yields
D 6 (16 +o(1))
_
1 +O
_
lg n
b
__
r +lg p
r
nlg n
n
0
lg n
0
C(n
0
) +O(nlg n).
To minimise the leading constant, we must choose b to grow faster than lg n, and r to grow faster
than lg p. For example, taking r :=(lglgn)
2
and k :=lgnr lg(lg
2
n) leads to b=4 n/2
k+r
lg
2
n
and lg p lg b lg lg n. The function mapping n to n
0
is then bounded by a logarithmically slow
function, and a similar argument to section 6 shows that I(n) =O(nlog n16
log
n
).
8. Fast multiplication using modular arithmetic
Shortly after Frers algorithm appeared, De et al [15] presented a variant based on modular
arithmetic that also achieves the complexity bound I(n) = O(n log n K
log
n
) for some K > 1.
Roughly speaking, they replace the coecient ring C with the eld Q
p
of p-adic numbers, for
a suitable prime p. In this context, working to nite precision means performing computations
in Z/p
Z
is a ring (unlike our C
p
), and arithmetic operations never lead to precision loss (unless one divides
by p, which never happens in these algorithms). The main disadvantage is that there are certain
technical diculties associated with nding an appropriate p; this is discussed in section 8.2 below.
The aim of this section is to sketch an analogue of the algorithm of section 6 that achieves
I(n) =O(nlog n8
log
n
) using modular arithmetic instead of C. We assume familiarity with p-adic
numbers, referring the reader to [22] for an elementary introduction.
8.1. Sketch of the algorithm
For the basic problem, we take multiplication in Z/(2
n
1) Z, where n is admissible (in the sense
of section 6) and where one of the arguments is xed over t >1 multiplications. As before, we take
k:=(n), and cut the inputs into chunks of b:=n/2
k
=O(lg
2
n) bits. Thus we reduce to multiplying
polynomials in Z[X] /(X
2
k
1) with coecients of at most b bits. The coecients of the product
have at most 2 b +k bits.
Let p be a prime such that p=1(mod2
k
), so that Q
p
contains a primitive 2
k
-th root of unity .
The problem of nding such p and is discussed in the next section; for now we assume only
that lg p = O(lg n). We may then embed the multiplication problem into Q
p
[X] /(X
2
k
1), and
use DFTs with respect to to compute the product. On a Turing machine, we cannot represent
elements of Q
p
exactly, so we perform all computations in Z/p
Z where
:=
_
2 b +k
(lg p) 1
_
.
This choice ensures that lg(p
Z)[X] /(X
2
k
1)
determines it unambiguously in Z[X]/(X
2
k
1).
To compute each DFT, we rst use the CooleyTukey algorithm to decompose it into short
transforms of length 2
r
, where r := (lg lg n)
2
. (As in section 6, there are also residual transforms
of length 2
rd
for some r
d
6r, whose contribution to the complexity is negligible.) Multiplications
in Z/p
n
).
On the other hand, we note that the problem only really occurs at the top recursion level.
Indeed, at deeper recursion levels, there is exponentially more time available at the previous level
to compute p. So one possible workaround is to use a dierent, suciently fast algorithm at the
top level, such as Frers algorithm, and then switch to the algorithm sketched in section 8.1 for
the remaining levels. In this way one still obtains the bound O(nlog n8
log
n
), and asymptotically
almost all of the computation is done using the algorithm of section 8.1.
If one insists on avoiding C entirely, there are still many choices: one could use the algorithm
of De et al at the top level, or use a multivariate version of the algorithm of section 8.1. One
could even use the SchnhageStrassen algorithm, whose main recursive step yields the bound
I(n) =O(n
1/2
I(n
1/2
) +n log n); applying this three times gives I(n) =O(n
7/8
I(n
1/8
) +n log n), and
then to multiply integers with n
1/8
bits, one can nd a suitable prime using Lemma 8.2 in time
O(n
3/4+o(1)
) =O(n).
Another way to work around the problem is to assume the generalised Riemann hypothesis
(GRH). De et al pointed out that under GRH, it is possible to nd a suitable prime eciently using
a randomised algorithm. Here we show that, under GRH, we can even use deterministic algorithms.
Lemma 8.3. Assume GRH. Then p
0
(k) = O(2
2k
k
2
), and we may compute p
0
(k) in time
O(2
k
k
O(1)
).
Proof. The rst bound is given in [26], and the complexity bound follows similarly to the proof
of Lemma 8.2.
To use this result, we must modify the algorithm of section 8.1 slightly. Choose a constant C>3
so that we can compute p
0
(k) in time O(2
k
k
C
), as in Lemma 8.3. Increase the coecient size from
(lg n)
2
to (lg n)
C1
, and change the denition of admissibility accordingly. The transform length
then decreases to 2
k
=O(n/(lgn)
C1
), and the cost of computing p decreases to only O(n lgn). The
rest of the complexity analysis is essentially unchanged; the result is an algorithm with complexity
O(nlog n8
log
n
), working entirely with modular arithmetic, in which the top recursion level does
not need any special treatment.
20 Even faster integer multiplication
Finally, we consider the computation of a suitable approximation to a 2
k
-th root of unity in Q
p
.
Lemma 8.4. Given k, > 1 and a prime p = 1 (mod 2
k
), we may nd Z/ p
Z such that
= (mod p
0
= g
(p1)/2
k
is a primitive 2
k
-th root of unity in Z/ p Z, and there is a unique primitive 2
k
-th
root of unity Q
p
congruent to
0
modulo p. Given
0
, we may compute (mod p
) using fast
Newton lifting in time O((k log p)
1+
) [9, Section 12.3].
In the context of section 8.1, we may assume that = O((lg n)
O(1)
) and k = O(lg n), so the
cost of nding is O(p
1/4+
). This is certainly less than the cost of nding p itself, using either
Lemma 8.2 or Lemma 8.3.
9. Conjecturally faster multiplication
It is natural to ask whether the approaches from sections 6, 7 or 8 can be further optimised, to
obtain a complexity bound I(n) =O(nlog nK
log
n
) with K <8.
In Frers algorithm, the complexity is dominated by the cost of multiplications in R=C[X] /
(X
2
r1
+ 1). If we could use a similar algorithm for a much simpler R, then we might achieve
a better bound. Such an algorithm was actually given by Frer [17], under the assumption that
there exist suciently many Fermat primes, i.e., primes of the form F
m
=2
2
m
+1. More precisely,
his algorithm requires that there exists a positive integer k such that for every mN, the sequence
F
m+1
, ..., F
2
m+k contains a prime number. The DFTs are then computed directly over R=F
Fm
for
suitable m, taking advantage of the fact that F
F
m
contains a fast 2
m+1
-th primitive root of unity
(namely the element 2) as well as a 2
2
m
-th primitive root of unity. It can be shown that a suitably
optimised version of this hypothetical algorithm achieves K=4: we still pay a factor of two due to
the fact that we compute both forward and inverse transforms, and we pay another factor of two
for the zero-padding in the recursive reduction. Unfortunately, it is likely that F
4
= 65537 is the
last Fermat prime [13].
In the K =8 algorithm of section 6, a potential bottleneck arises during the short transforms,
when we use Kronecker substitution to multiply polynomials in C
p
[X]/(X
2
r
1). We really only
need the high p bits of each coecient of the product (i.e., of the real and imaginary parts), but
we are forced to allocate roughly 2 p bits per coecient in the Kronecker substitution, and then
we discard roughly half of the output. This problem is similar to the well-known obstruction that
prevents us from using FFT methods to compute a short product, i.e., the high n bits or low n
bits of the product of two n-bit integers, any faster than computing the full 2 n bits.
In this section, we present a variant of the algorithm of section 6, in which the coecient ring C
is replaced by a nite eld F
p
[i], where p =2
q
1 is a Mersenne prime. Thus short products are
replaced by cyclic products, namely by multiplications modulo 2
q
1. This saves a factor of two
at each recursion level, and consequently reduces K from 8 to 4.
This change of coecient ring introduces several technical complications. First, it is of course
unknown if there are innitely many Mersenne primes. Thus we are forced to rely on unproved
conjectures about the distribution of Mersenne primes.
Second, q is always prime (except possibly at the top recursion level). Thus we cannot cut up
an element of Z/p Z into equal-sized chunks with an integral number of bits, and still expect to
take advantage of cyclic products. In other words, q is very far from being admissible in the sense
of section 6. To work around this, we deploy a variant of an algorithm of Crandall and Fagin [12],
which allows us to work with chunks of varying size. The CrandallFagin algorithm was originally
presented over C, and depended crucially on the fact that R contains suitable roots of 2. In our
setting, we work over F
p
0[i]
=
F
(p
0
)
2, where p
0
=2
q
0
1 is a Mersenne prime exponentially smaller
than p. Happily, F
p
0 contains suitable roots of 2, and this enables us to adapt their algorithm to
our setting. Moreover, since (p
0
)
2
1 =2
q
0
+1
(2
q
0
1
1), the eld F
p
0[i] contains roots of unity of
high power-of-two order, namely of order 2
q
0
+1
, so we can perform FFTs over F
p
0[i] very eciently.
David Harvey, Joris van der Hoeven, Grgoire Lecerf 21
Finally, we can no longer use Kronecker substitution, as this would reintroduce the very zero-
padding we are trying to avoid. Instead, we take our basic problem to be polynomial multiplication
over (Z/p Z)[i] (where p=2
q
1 is not necessarily prime). After the CrandallFagin splitting step,
we have a bivariate multiplication problem over F
p
0[i], which is solved using 2-dimensional FFTs
over F
p
0[i]. These FFTs are in turn reduced to 1-dimensional FFTs using standard methods; this
dimension reduction is, roughly speaking, the analogue of Kronecker substitution in this algorithm.
(Indeed, it is also possible to give an algorithm along these lines that works over C but avoids Kro-
necker substitution entirely; this still yields K=8 because of the short product problem mentioned
above.) For the 1-dimensional transforms, we use the same technique as in previous sections: we
use CooleyTukeys algorithm to decompose them into short transforms of exponentially shorter
length, then use Bluesteins method to convert them to (univariate) polynomial products, and
nally evaluate these products recursively.
9.1. Mersenne primes
Let
m
(x) denote the number of Mersenne primes less than x. Based on probabilistic arguments
and numerical evidence, Lenstra, Pomerance and Wagsta have conjectured that
m
(x)
e
g
log 2
log log x
as x, where g=0.5772... is the Euler constant [52, 40]. Our fast multiplication algorithm relies
on the following slightly weaker conjecture.
Conjecture 9.1. There exist constants 0 <a <b such that for all x >3,
a log log x <
m
(x) <b log log x.
Proposition 9.2. Assume Conjecture 9.1 and let c := b /a. For any integer n > 2, there exists
a Mersenne prime p =2
q
1 in the interval 2
n
<p <2
n
c
. Given n, we may compute the smallest
such p, and nd a primitive 2
q+1
-th root of unity in F
p
[i], in time O(n
(3+o(1))c
).
Proof. The required prime exists since for n>2 we have
m
(2
n
c
) >a log log (2
n
c
) =a c log n+a log log 2 >b log n+b log log 2 =b log log(2
n
) >
m
(2
n
).
An integer of the form 2
q
1 may be tested for primality in time q
2+o(1)
using the LucasLehmer
primality test [13]. A simple way to compute p is to apply this test successively for all q n+1, ...,
n
c
|; this takes time O(n
(3+o(1))c
). A primitive 2
q+1
-th root of unity may be computed by the
formula :=2
2
q2
+(3)
2
q2
i F
p
[i] in time O(q
2+o(1)
); see [43] or [14, Corollary 5].
9.2. Crandall and Fagins algorithm revisited
Let p = 2
q
1 be a Mersenne number (not necessarily prime). The main integer multiplication
algorithm depends on a variant of Crandall and Fagins algorithm that reduces multiplication in
(Z/p Z)[i][X]/(X
M
1) to multiplication in F
p
0[i][X, Y ]/(X
M
1, Y
N
1), where p
0
=2
q
0
1 is
a suitably smaller Mersenne prime (assuming that such a prime exists).
To explain the idea of this reduction, we rst consider the simpler univariate case, in which
we reduce multiplication in (Z/ p Z)[i] to multiplication in F
p
0[i][Y ] /(Y
N
1). Here we require
that N 6 q, that gcd(N, q
0
) = 1 and that q
0
>2 ,q /N| + lg N + 3. For any k N, we will write
N
k
=0, ..., k 1 and Z
k
=(k 1), ..., k 1.
Assume that we wish to compute the product of u, v (Z/ p Z)[i]. Considering u and v as
elements of N
p
[i] modulo p, we decompose them as
u =
i=0
N1
u
i
2
ei
, v =
i=0
N1
v
i
2
ei
, (9.1)
where
e
i
:= ,q i/N|,
u
i
, v
i
N
2
e
i+1
e
i
[i].
22 Even faster integer multiplication
We regard u
i
and v
i
as complex digits of u and v, where the base 2
ei+1ei
varies with the position i.
Notice that e
i+1
e
i
takes only two possible values: q/N| or ,q/N|.
For 0 6i <N, let
c
i
:= Ne
i
q i, (9.2)
so that 0 6 c
i
< N. For any 0 6 i
1
, i
2
< N, dene
i1,i2
Z as follows. Choose 0, 1 so that
i :=i
1
+i
2
N lies in the interval 0 6i <N, and put
i1,i2
:= e
i1
+e
i2
e
i
q.
From (9.2), we have
c
i1
+c
i2
c
i
=N (e
i1
+e
i2
e
i
) q (i
1
+i
2
i) =N
i1,i2
.
Since the left hand side lies in the interval (N, 2 N), this shows that
i
1
,i
2
0, 1. Now, since
2
q
=1 (mod p) and e
i
1
+e
i
2
=e
i
+
i
1
,i
2
(mod q), we have
uv =
i1=0
N1
i2=0
N1
u
i1
v
i2
2
ei
1
+ei
2
=
i=0
N1
w
i
2
ei
(mod p),
where
w
i
:=
i1+i2=i (mod N)
2
i
1
,i
2
u
i
1
v
i
2
.
Since [u
i
1
[ < 2
2
dq/Ne
and similarly for v
i
2
, we have w
i
Z
4
dq/Ne+1
N
[i]. Note that we may
recover u v from w
0
, ..., w
N1
in time O(q), by a standard overlap-add procedure (provided that
N =O(q/lg q)).
Let h be the inverse of q
0
modulo N; this inverse exists since we assumed gcd(N, q
0
) =1. Let
:=2
h
F
p
0, so that
N
=2
hN
=2,
since 2 has order q
0
in F
p
0. The quantity plays the same role as the real N-th root of 2 appearing
in CrandallFagins algorithm.
Now dene polynomials U, V F
p
0[i][Y ]/(Y
N
1) by U
i
:=
ci
u
i
and V
i
:=
ci
v
i
for 0 6i <N,
and let W =W
0
+ +W
N1
Y
N1
:=UV be their (cyclic) product. Then
w
i
:=
ci
W
i
=
i
1
+i
2
=i (mod N)
ci
U
i
1
V
i
2
=
ci
1
+ci
2
ci
u
i
1
v
i
2
=
2
i
1
,i
2
u
i
1
v
i
2
coincides with the reinterpretation of w
i
as an element of F
p
0[i]. Moreover, we may recover w
i
unambiguously from w
i
, as q
0
>2 ,q/N| +lg N +3 and w
i
Z
4
dq/Ne+1
N
[i]. Altogether, this shows
how to reduce multiplication in (Z/p Z)[i] to multiplication in F
p
0[i][Y ] /(Y
N
1).
Remark 9.3. The pair (e
i+1
, c
i+1
) can be computed from (e
i
, c
i
) in O(lg q) bit operations, so
we may compute the sequences e
0
, ..., e
N1
and c
0
, ..., c
N1
in time O(N lg q). Moreover, since
c
i+1
c
i
takes on only two possible values, we may compute the sequence
c0
, ...,
cN1
using O(N)
multiplications in F
p
0[i].
9.3. Bivariate CrandallFagin reduction
Generalising the discussion of the previous section, we now show how to reduce multiplication in
(Z/p Z)[i][X] /(X
M
1), for a given M>1, to multiplication in F
p
0[i][X, Y ] /(X
M
1, Y
N
1).
For this, we require that N 6q, that gcd(N, q
0
) =1 and that q
0
>2 ,q/N| +lg (MN) +3.
Indeed, consider two cyclic polynomials u=u
0
+ +u
M1
X
M1
and v =v
0
+ +v
M1
X
M1
in (Z/p Z)[i][X] /(X
M
1). We cut each of the coecients u
i
, v
i
(Z/p Z)[i] into N chunks u
i, j
and v
i, j
of bit size at most ,q /N|, using the same varying base strategy as above. With
N
= 2
and c
j
as before, we next form the bivariate cyclic polynomials
U :=
i, j
u
i, j
cj
X
i
Y
j
, V :=
i, j
v
i, j
cj
X
i
Y
j
David Harvey, Joris van der Hoeven, Grgoire Lecerf 23
in F
p
0[i][X, Y ]/(X
M
1, Y
N
1). Setting
W := UV =
i, j
w
i, j
cj
X
i
Y
j
,
the same arguments as in the previous section yield
w
i, j
=
i1+i2=i (mod M)
j1+j2=j (mod N)
2
j
1
,j
2
u
i1,j1
v
i2,j2
.
Using the assumption that q
0
>2 ,q/N| +lg(MN) +3, we recover the coecients w
i, j
, and hence
the product uv, from the bivariate cyclic convolution product W =UV .
9.4. Conjecturally faster multiplication
Let q >2 and p := 2
q
1 (not necessarily prime). We will take our basic recursive problem to be
multiplication in (Z/ p Z)[i][X] /(X
M
1) for suitable M. We need M somewhat larger than q;
this is analogous to the situation in section 6, where we chose a short transform length somewhat
larger than the coecient size. Thus we set M =M(q) :=2
(q)
where (q) is dened as follows.
Lemma 9.4. There exists an increasing function : NN such that
0 6(q) (log
2
q) (log
2
log
2
q) 62 (9.3)
for all q >2, and such that we may compute (q) in time (log q)
1+o(1)
.
Proof. Let f(q) := (log
2
q) (log
2
log
2
q). Using [6], we may construct a function g(q) such that
[g(q) f(q)[ 61/ q for all q >2, and which may be computed in time (log q)
1+o(1)
. One checks that
f(q +1) f(q) >2/q for all q >2, so g(q +1) >f(q +1)
1
q +1
>f(q +1)
1
q
>f(q) +
1
q
> g(q)
for q >2. Thus g(q) is increasing, and (q) :=g(q) +3/2| has the desired properties.
We say that an integer n>2 is admissible if it is of the form n=q M where M :=M(q) for some
q >2. (This should not be confused with the notion of admissibility of section 6.) An element of
(Z/ p Z)[i][X] /(X
M
1) is then represented by 2 n bits. Note that q q M(q) is strictly increasing,
so there is a one-to-one correspondence between integers q > 2 and admissible n. For x > 2 we
dene (x) to be the smallest admissible integer n>x.
Lemma 9.5. We have (n) = O(n) as n . Given n > 2, we may compute (n), and the
corresponding q, in time o(n).
Proof. From (9.3) we have (q +1) M(q +1)/(q M(q)) =O
_
2
(q+1)(q)
_
=O(1); this immediately
implies that (n) =O(n).
Suppose that we wish to compute (n) for some n. We assume that n is large enough that
the denition q
0
:= 2
dlg n/(lg lg nlg lg lg n1)e
makes sense and so that q
0
> 2. One checks that
(log
2
q
0
) (log
2
log
2
q
0
) >lgn, so (q
0
) >lgn and hence q
0
M(q
0
) >n. To nd the smallest suitable q,
we may simply compute q M(q) for each q = 2, 3, ..., q
0
, and compare with n. This takes time
O
_
q
0
(log q
0
)
1+o(1)
_
=o(n).
Now let q >2, p :=2
q
1 and M :=M(q). Consider the problem of computing t >1 products
u
1
v, ..., u
t
v with u
1
, ..., u
t
, v (Z/p Z)[i][X] /(X
M
1). We denote by C
t
(n) the complexity of this
problem, where n:=q M(q) is the admissible integer corresponding to q. As in section 6, we dene
C(n) :=sup
t>1
C
t
(n)/(2 t +1).
Notice that multiplication of two integers of bit size 6k reduces to the above problem, for t =1,
via a suitable Kronecker segmentation. Indeed, let n := (8 k) = q M(q) for some q, and encode
the integers as integer polynomials of degree less than M / 2 with coecients of bit size m :=
,k/(M/2)|. The desired product may be recovered from the product in (Z/p Z)[i][X] /(X
M
1),
as
2 m+lg (M/2) 6
4 k
M
+(q) 6
q
2
+(q) 6q 1
for large q. Thus, as in section 6, we have I(k) 63 C(O(k)) +O(k), and it suces to obtain a good
bound for C(n).
24 Even faster integer multiplication
Now suppose additionally that p=2
q
1 is prime. In this case (Z/p Z)[i] =F
p
[i] is a eld, and
as noted above, it contains 2
q+1
-th roots of unity, so we may dene DFTs of length 2
r
over F
p
[i]
for any r 6 q + 1. In particular, for r 6 q we may use Bluesteins algorithm to compute DFTs of
length 2
r
. Denote by B
q,t
(2
r
) the cost of evaluating t independent DFTs of length 2
r
over F
p
[i],
and put B
q
(2
r
) :=sup
t>1
B
q,t
(2
r
)/(2 t +1). Here we assume as usual that a 2
r+1
-th root of unity
is known, and that the corresponding Bluestein root table has been precomputed.
Let us apply these denitions in the case r :=lgM; this is permissible, as lgM6q for suciently
large q. Since convolution of length M over F
p
[i] is exactly the basic recursive problem, and since
one of the operands is xed, we have B
q,t
(M) 6C
t
(n) +O(t M I(q)), where n:= q M, and hence
B
q
(M) 6 C(n) +O(M I(q)). (9.4)
Theorem 9.6. Assume Conjecture 9.1. Then there exists x
0
>2 and a logarithmically slow function
: (x
0
, ) R with the following property. For all admissible n >x
0
, there exists an admissible
n
0
6(n) such that
C(n)
nlg n
6
_
4 +O
_
1
lg lg lg n
__
C(n
0
)
n
0
lg n
0
+O(1). (9.5)
Proof. Let n := q M with M =M(q). Assume that we wish to compute t >1 products with one
xed operand. Our goal is to reduce to a problem of the same form, but for exponentially smaller n.
Choose parameters. Let p
0
= 2
q
0
1 be the smallest Mersenne prime larger than 2
(lg M)
2
. By
Proposition 9.2, we have 2
(lg M)
2
<p
0
<2
(lg M)
2c
, whence (lg M)
2
6q
0
6(lg M)
2c
, for some absolute
constant c >1. Moreover, we may compute p
0
, together with a primitive 2
q
0
+1
-th root of unity
in F
p
0[i], in time O((lg M)
(6+o(1))c
) =O(nlg n). We dene M
0
:=M(q
0
) and n
0
:= q
0
M
0
.
The algorithm must perform various multiplications in F
p
0[i], at cost O(I(q
0
)). For simplicity
we will use SchnhageStrassens algorithm for these multiplications, i.e., we will take I(q
0
) =
O(q
0
lg q
0
lg lg q
0
). Since lg q
0
=O(lg lg M) =O(lg lg n), we have
I(q
0
) = O(q
0
lg lg nlg lg lg n).
CrandallFagin reduction. We use the framework of section 9.3 to reduce the basic multipli-
cation problem in (Z/p Z)[i][X] /(X
M
1) to multiplication in F
p
0[i][X, Y ] /(X
M
1, Y
N
1) for
suitable N. We take N :=2
`
s where
` := lg
_
2 q
q
0
lg lg q
_
,
s := 2
_
q
2
`
(q
0
lg
2
q)
_
+1.
We also write L:=2
`
. The denition of s makes sense for large q since q
0
>(lg M)
2
(lg q lglg q)
2
.
Let us check that the hypotheses of section 9.3 are satised for large q. We have Lq/(q
0
lg lg q)
and hence s lg lg q; in particular, s =/ q
0
, so gcd(N, q
0
) = 1, and also N q / q
0
q / lg q.
Since N = L s > 2 q / (q
0
lg
2
q), we also have 2 ,q / N| 6 q
0
lg
2
q + O(1), and thus
2 ,q/N| +lg (MN) +3 6q
0
since lg(MN) =O(lg q lg lg q).
We also note for later use the estimate
MNq
0
=
_
2 +O
_
1
lg lg n
__
n.
Indeed, since s lg lg q we have
s =
_
2 +O
_
1
lg lg q
__
q
2
`
(q
0
lg
2
q)
,
and we already noticed earlier that (lg
2
q)/q
0
=O(1/(lg lg q)
2
) =O(1/lg lg q).
To assess the cost of the CrandallFagin reduction, we note that computing the e
i
and c
i
costs O(N lg q) = O(n lg n) (see Remark 9.3), the splitting itself and nal overlap-add phase
require time O(t n), and the various multiplications by ,
ci
and
ci
have cost O(t MN I(q
0
)) =
O(t nI(q
0
)/q
0
) =O(t nlg n).
David Harvey, Joris van der Hoeven, Grgoire Lecerf 25
Reduction to power-of-two lengths. Next we reduce multiplication in F
p
0[i][X, Y ]/(X
M
1,
Y
N
1) to multiplication in ![X, Z] /(X
M
1, Z
L
1), where !:=F
p
0[i][U] /(U
s
1). In fact,
since gcd(L, s) = 1, these rings are isomorphic, via the map that sends X to X and Y to Z U.
Evaluating this isomorphism corresponds to rearranging the coecients according to the rule
i (i
0
, i
1
), where i 0, ..., N 1 is the exponent of Y and where i
0
:=i mod L and i
1
:=i mod s
are the exponents of Z and U. This may be achieved in time O(t MN lgN (q
0
+lgN)) =O(t n lgn)
using the same sorting strategy as in section 2.3. The inverse rearrangement is handled similarly.
Reduction to univariate transforms. For multiplication in ![X, Z] /(X
M
1, Z
L
1), we will
use bivariate DFTs over !. This is possible because F
p
0[i] contains both M-th and L-th primitive
roots of unity, namely
2
q
0
+1
/M
and
2
q
0
+1
/L
, since q
0
~ lg M and q
0
~ lg L. More precisely, we
must perform t +1 forward bivariate DFTs and t inverse bivariate DFTs of length M L over !,
and t M L multiplications in !. Each bivariate DFT reduces further to s M univariate DFTs of
length L over F
p
0[i] (with respect to Z) and s L univariate DFTs of length M over F
p
0[i] (with
respect to X). Interspersed between these steps are various matrix transpose operations of total
cost O(t s MLlg(s ML) q
0
) =O(t n lg n), to enable ecient access to the rows and columns (see
section 2.1).
Multiplications in ! are handled by zero-padding, i.e., we rst use CooleyTukey to multiply
in F
p
0[i][U] /(U
2
dlg se+1
1), and then reduce modulo U
s
1. The total cost of these multiplications
is O(t MLs lg s I(q
0
)) =O(t nlg s I(q
0
)/q
0
) =O(t nlg lg n(lg lg lg n)
2
) =O(t nlg n).
Reduction to short transforms. Consider one of the long univariate DFTs of length 2
k
M,
L over F
p
0[i]. We decompose the DFT into short DFTs of length M
0
as follows. Let r :=
lgM
0
=O(lglgn lglglgn) and d:=,k/r|=O(lgn/(lglgn lglglgn)), and write k=r
1
+ +r
d
where
r
i
:=r for 1 6i 6d 1 and r
d
:=k (d 1) r 6r. We use the algorithm /:=/
1
/
d
, where
for 16i 6d1 we take /
i
to be the algorithm based on Bluesteins method (discussed immediately
before (9.4)), and where /
d
is the usual CooleyTukey algorithm over F
p
0[i]. Let D
k
be the cost
of a single invocation of / (or of the corresponding inverse transform /
0
). By (2.4) we have
D
k
6 (d 1) B
q
0
,2
kr(2
r
) +O(2
krd
2
rd
r
d
I(q
0
)) +O(d 2
k
I(q
0
)) +O(2
k
q
0
lg n).
The cost of precomputing the necessary root tables is only O(2
k
I(q
0
)). By denition B
q
0
,2
kr(2
r
) 6
(2 2
kr
+1) B
q
0(M
0
). From (9.4) and the estimate 2
kr
~lg lg n, the rst term becomes
(d 1) B
q
0
,2
kr(2
r
) 6 (2 +O(1/lg lg n)) (d 1) 2
kr
C(n
0
) +O(d 2
kr
M
0
I(q
0
)).
The contribution to D
k
from all terms involving I(q
0
) is
O(2
k
(r
d
+d) I(q
0
)) =O
_
2
k
lg n
lg lg nlg lg lg n
q
0
lg lg nlg lg lg n
_
=O(2
k
q
0
lg n),
so
D
k
6 (2 +O(1/lg lg n)) (d 1) 2
kr
C(n
0
) +O(2
k
q
0
lg n).
Denoting by D the cost of a bivariate DFT of length M L over !, we thus have (ignoring the
transposition costs, which were included earlier)
D = s LD
lg M
+s M D
lg L
6
_
2 +O
_
1
lg lg n
___
s L
_
lg M
lg M
0
_
M
M
0
+s M
_
lg L
lg M
0
_
L
M
0
_
C(n
0
) +O(s LMq
0
lg n)
6
_
2 +O
_
1
lg lg n
__
s LM
lg (LM)
M
0
lg M
0
C(n
0
) +O(s LMq
0
lg n)
6
_
4 +O
_
1
lg lg n
__
nlg n
n
0
lg M
0
C(n
0
) +O(nlg n).
Moreover, since
lg n
0
lg M
0
=1 +
lg q
0
lg M
0
=1 +O
_
1
lg lg q
0
_
=1 +O
_
1
lg lg lg n
_
,
we get
D 6
_
4 +O
_
1
lg lg lg n
__
nlg n
n
0
lg n
0
C(n
0
) +O(nlg n).
26 Even faster integer multiplication
We must perform 2 t +1 bivariate DFTs; the bound (9.5) then follows exactly as in the proof
of Theorem 6.4.
For large n, we have log q
0
= O(log log M) = O(log log n), so log n
0
= log q
0
+ O((q
0
)) =
O(log q
0
log log q
0
) = O(log log n log log log n). Thus there exists a constant C > 0 such that
log log log n
0
6log log log log n+C for large n, and we may take (x) :=exp
3
(log
4
x+C).
Proof of Theorem 1.2. Follows from Theorem 9.6 and Proposition 5.3, analogously to the proof
of Theorem 1.1.
Bibliography
[1] M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Math., 160(2):781793, 2004.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-
Wesley, 1974.
[3] L. I. Bluestein. A linear ltering approach to the computation of discrete Fourier transform. IEEE Transac-
tions on Audio and Electroacoustics, 18(4):451455, 1970.
[4] A. Bostan, P. Gaudry, and . Schost. Linear recurrences with polynomial coecients and application to
integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:17771806, 2007.
[5] C. B. Boyer. A History of Mathematics. Princeton Univ. Press, rst paperback edition, 1985.
[6] R. P. Brent. Fast multiple-precision evaluation of elementary functions. J. Assoc. Comput. Mach., 23(2):242
251, 1976.
[7] R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.
[8] P. Brgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, 1997.
[9] H. Cohen, G. Frey, R. Avanzi, Ch. Doche, T. Lange, K. Nguyen, and F. Vercauteren, editors. Handbook
of elliptic and hyperelliptic curve cryptography. Discrete Mathematics and its Applications. Chapman &
Hall/CRC, Boca Raton, FL, 2006.
[10] S. A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
[11] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math.
Computat., 19:297301, 1965.
[12] R. Crandall and B. Fagin. Discrete weighted transforms and large-integer arithmetic. Math. Comp.,
62(205):305324, 1994.
[13] R. Crandall and C. Pomerance. Prime numbers. A computational perspective. Springer, New York, 2nd
edition, 2005.
[14] R. Creutzburg and M. Tasche. Parameter determination for complex number-theoretic transforms using cyclo-
tomic polynomials. Math. Comp., 52(185):189200, 1989.
[15] A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. SIAM
J. Comput., 42(2):685699, 2013.
[16] J. calle. Introduction aux fonctions analysables et preuve constructive de la conjecture de Dulac. Hermann,
collection: Actualits mathmatiques, 1992.
[17] M. Frer. On the complexity of integer multiplication (extended abstract). Technical Report CS-89-17, Penn-
sylvania State University, 1989.
[18] M. Frer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of
Computing, STOC 2007, pages 5766, New York, NY, USA, 2007. ACM Press.
[19] M. Frer. Faster integer multiplication. SIAM J. Comp., 39(3):9791005, 2009.
[20] M. Frer. How fast can we multiply large integers on an actual computer? Technical report,
http://arxiv.org/abs/1402.1811, 2014.
[21] J. vonzur Gathenand J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2nd edition, 2002.
[22] F. Q. Gouva. p-adic numbers. An introduction. Universitext. Springer-Verlag, Berlin, 1993.
[23] T. Granlund et al. GMP, the GNU multiple precision arithmetic library. http://gmplib.org, 1991. Latest
version 6.0.0 released in 2014.
[24] D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over nite elds. Technical
report, HAL, 2014. http://hal.archives-ouvertes.fr.
[25] D. R. Heath-Brown. Almost-primes in arithmetic progressions and short intervals. Math. Proc. Cambridge
Philos. Soc., 83(3):357375, 1978.
[26] D. R. Heath-Brown. Zero-free regions for Dirichlet L-functions, and the least prime in an arithmetic progres-
sion. Proc. London Math. Soc. (3), 64(2):265338, 1992.
[27] M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch.
Hist. Exact Sci., 34(3):265277, 1985.
[28] J. van der Hoeven. Journes Nationales de Calcul Formel (2011), volume 2 of Les
cours du CIRM, chapter Calcul analytique. CEDRAM, 2011. Exp. No. 4, 85 pages,
http://ccirm.cedram.org/ccirm-bin/fitem?id=CCIRM_2011__2_1_A4_0.
[29] J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002. http://www.mathemagix.org.
[30] A. Karatsuba and J. Ofman. . Doklady Akad. Nauk SSSR,
7:293294, 1962. English translation in [31].
David Harvey, Joris van der Hoeven, Grgoire Lecerf 27
[31] A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady,
7:595596, 1963.
[32] D. E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley,
1969.
[33] D. E. Knuth. The art of computer programming, volume 3: Sorting and Searching. Addison-Wesley, Reading,
MA, 1998.
[34] Yu. V. Linnik. On the least prime in an arithmetic progression I. The basic theorem. Rec. Math. (Mat.
Sbornik) N.S., 15(57):139178, 1944.
[35] Yu. V. Linnik. On the least prime in an arithmetic progression II. The Deuring-Heilbronn phenomenon. Rec.
Math. (Mat. Sbornik) N.S., 15(57):347168, 1944.
[36] R. E. Moore. Interval Analysis. Prentice Hall, Englewood Clis, N.J., 1966.
[37] O. Neugebauer. The Exact Sciences in Antiquity. Brown Univ. Press, Providence, R.I., 1957.
[38] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[39] J. M. Pollard. The fast Fourier transform in a nite eld. Math. Comp., 25(114):365374, 1971.
[40] C. Pomerance. Recent developments in primality testing. Math. Intelligencer, 3(3):97105, 1980/81.
[41] C. M. Rader. Discrete Fourier transforms whenthe number of data samples is prime. Proc. IEEE, 56(6):1107
1108, June 1968.
[42] K. R. Rao, D. N. Kim, and J. J. Hwang. Fast Fourier Transform - Algorithms and Applications. Signals and
Communication Technology. Springer-Verlag, 2010.
[43] I. S. Reed and T. K. Truong. The use of nite elds to compute convolutions. IEEE Trans. Inform. Theory,
IT-21:208213, 1975.
[44] M. C. Schmeling. Corps de transsries. PhD thesis, Universit Paris-VII, France, 2001.
[45] A. Schnhage. Multiplikation groer Zahlen. Computing, 1(3):182196, 1966.
[46] A. Schnhage. Storage modication machines. SIAM J. on Comp., 9:490508, 1980.
[47] A. Schnhage and V. Strassen. Schnelle Multiplikation groer Zahlen. Computing, 7:281292, 1971.
[48] I. Shparlinski. On nding primitive roots in nite elds. Theoret. Comput. Sci., 157(2):273275, 1996.
[49] D. E. Smith. History of Mathematics, volume 2. Dover, 1958.
[50] A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet
Mathematics, 4(2):714716, 1963.
[51] A. L. Toom. , .
Doklady Akad. Nauk SSSR, 150:496498, 1963. English translation in [50].
[52] S. Wagsta. Divisors of Mersenne numbers. Math. Comp., 40(161):385397, 1983.
[53] T. Xylouris. On the least prime in an arithmetic progression and estimates for the zeros of Dirichlet L-
functions. Acta Arith., 1:6591, 2011.
28 Even faster integer multiplication