|
We give a new proof of Fürer's bound for the cost of
multiplying
|
Let denote the cost of multiplying two
-bit integers in the deterministic
multitape Turing model [38] (commonly called “bit
complexity”). Previously, the best known asymptotic bound for
was due to Fürer [18, 19].
He proved that there is a constant
such that
where , for
, denotes the iterated logarithm, i.e.,
The main contribution of this paper is a new algorithm that yields the following improvement.
Fürer suggested several methods to minimise the value of in his algorithm, but did not give an explicit bound for
. In section 7
of this paper, we outline an optimised variant of Fürer's algorithm
that achieves
. We do not
know how to obtain
using Fürer's approach.
This suggests that the new algorithm is faster than Fürer's by a
factor of
.
The idea of the new algorithm is remarkably simple. Given two -bit integers, we split them into
chunks of exponentially smaller size, say around
bits, and thus reduce to the problem of multiplying integer polynomials
of degree
with coefficients of bit size
. We multiply the polynomials using
discrete Fourier transforms (DFTs) over
,
with a working precision of
bits. To compute the
DFTs, we decompose them into “short transforms” of
exponentially smaller length, say length around
, using the Cooley–Tukey method. We then use
Bluestein's chirp transform to convert each short transform into a
polynomial multiplication problem over
,
and finally convert back to integer multiplication via Kronecker
substitution. These much smaller integer multiplications are handled
recursively.
The algorithm just sketched leads immediately to a bound of the form (1.1). A detailed proof is given in section 4. We
emphasise that the new method works directly over , and does not need special coefficient rings with
“fast” roots of unity, of the type constructed by
Fürer. Optimising parameters and keeping careful track of constants
leads to Theorem 1.1, which is proved in section 6.
We also prove the following conditional result in section 9.
Conjecture 9.1 is a slight weakening of the
Lenstra–Pomerance–Wagstaff conjecture on the distribution of
Mersenne primes, i.e., primes of the form .
The idea of the algorithm is to replace the coefficient ring
by the finite field
;
we are then able to exploit fast algorithms for multiplication modulo
numbers of the form
.
An important feature of the new algorithms is that the same techniques are applicable in other contexts, such as polynomial multiplication over finite fields. Previously, no Fürer-type complexity bounds were known for the latter problem. The details are presented in the companion paper [24].
In the remainder of this section, we present a brief history of complexity bounds for integer multiplication, and we give an overview of the paper and of our contribution. More historical details can be found in books such as [21, Chapter 8].
Multiplication algorithms of complexity in the
number of digits
were already known in ancient
civilisations. The Egyptians used an algorithm based on repeated
doublings and additions. The Babylonians invented the positional
numbering system, while performing their computations in base
instead of
.
Precise descriptions of multiplication methods close to the ones that we
learn at school appeared in Europe during the late Middle Ages. For
historical references, we refer to [49, Section II.5] and
[37, 5].
The first subquadratic algorithm for integer multiplication, with
complexity , was discovered
by Karatsuba [30, 31]. From a modern
viewpoint, Karatsuba's algorithm utilises an evaluation-interpolation
scheme. The input integers are cut into smaller chunks, which are taken
to be the coefficients of two integer polynomials; the polynomials are
evaluated at several well-chosen points; their values at those points
are (recursively) multiplied; interpolating the results at those points
yields the product polynomial; finally, the integer product is recovered
by pasting together the coefficients of the product polynomial. This
cutting-and-pasting procedure is sometimes known as Kronecker
segmentation (see section 2.6).
Shortly after the discovery of Karatsuba's algorithm, which uses three
evaluation points, Toom generalised it so as to use
evaluation points instead [51, 50], for any
. This leads to the bound
for fixed
.
Letting
grow slowly with
, he also showed that
.
The algorithm was adapted to the Turing model by Cook [10]
and is now known as Toom–Cook multiplication. Schönhage
obtained a slightly better bound [45] by working modulo
several numbers of the form
instead of using
several polynomial evaluation points. Knuth proved that an even better
complexity bound could be achieved by suitably adapting Toom's method
[32].
The next step towards even faster integer multiplication was the
rediscovery of the fast Fourier transform (FFT) by Cooley and Tukey [11] (essentially the same algorithm was already known to Gauss
[27]). The FFT yields particularly efficient algorithms for
evaluating and interpolating polynomials on certain special sets of
evaluation points. For example, if is a ring in
which
is invertible, and if
is a principal
-th root of
unity (see section 2.2 for detailed definitions), then the
FFT permits evaluation and interpolation at the points
using only
ring operations in
. Consequently, if
and
are polynomials in
whose
product has degree less than
,
then the product
can be computed using
ring operations as well.
In [47], Schönhage and Strassen presented two
FFT-based algorithms for integer multiplication. In both algorithms,
they first use Kronecker segmentation to convert the problem to
multiplication of integer polynomials. They then embed these polynomials
into for a suitable ring
and multiply the polynomials by using FFTs over
. The first algorithm takes
and
, and works with
finite-precision approximations to elements of
. Multiplications in
itself
are handled recursively, by treating them as integer multiplications
(after appropriate scaling). The second algorithm, popularly known as
the Schönhage–Strassen algorithm, takes
where
is a Fermat number. This
algorithm is the faster of the two, achieving the bound
. It benefits from the fact that
is a principal
-th
root of unity in
, and that
multiplications by powers of
can be carried out
efficiently, as they correspond to simple shifts and negations. At
around the same time, Pollard pointed out that one can also work with
where
is a prime of the
form
, since then
contains primitive
-th
roots of unity [39] (although he did not give a bound for
).
Schönhage and Strassen's algorithm remained the champion for more
than thirty years, but was recently superseded by Fürer's algorithm
[18]. In short, Fürer managed to combine the
advantages of the two algorithms from [47], to achieve the
bound . Fürer's
algorithm is based on the ingenious observation that the ring
contains a small number of “fast” principal
-th roots of unity, namely
the powers of
, but also a
large supply of much higher-order roots of unity inherited from
. To evaluate an FFT over
, he decomposes it into many
“short” transforms of length at most
, using the Cooley–Tukey method. He evaluates
the short transforms with the fast roots of unity, pausing occasionally
to perform “slow” multiplications by higher-order roots of
unity (“twiddle factors”). A slightly subtle point of the
construction is that we really need, for large
, a principal
-th
root of unity
such that
.
In [15] it was shown that the technique from [39]
to compute modulo suitable prime numbers of the form
can be adapted to Fürer's algorithm. Although the complexity of
this algorithm is essentially the same as that of Fürer's
algorithm, this method has the advantage that it does not require any
error analysis for approximate numerical operations in
.
|
|||||||||||||||||||||||||||
Throughout the paper, integers are assumed to be handled in the standard binary representation. For our computational complexity results, we assume that we work on a Turing machine with a finite but sufficiently large number of tapes [38]. The Turing machine model is very conservative with respect to the cost of memory access, which is pertinent from a practical point of view for implementations of FFT algorithms. Nevertheless, other models for sequential computations could be considered [46, 20]. For practical purposes, parallel models might be more appropriate, but we will not consider these in this paper. Occasionally, for polynomial arithmetic over abstract rings, we will also consider algebraic complexity measures [8, Chapter 4].
In section 2, we start by recalling several classical techniques for completeness and later use: sorting and array transposition algorithms, discrete Fourier transforms (DFTs), the Cooley–Tukey algorithm, FFT multiplication and convolution, Bluestein's chirp transform, and Kronecker substitution and segmentation. In section 3, we also provide the necessary tools for the error analysis of complex Fourier transforms. Most of these tools are standard, although our presentation is somewhat ad hoc, being based on fixed point arithmetic.
In section 4, we describe a simplified version of the new
integer multiplication algorithm, without any attempt to minimise the
aforementioned constant . As
mentioned in the sketch above, the key idea is to reduce a given DFT
over
to a collection of “short”
transforms, and then to convert these short transforms back to integer
multiplication by a combination of Bluestein's chirp transform and
Kronecker substitution.
The complexity analysis of Fürer's algorithm and the algorithm from section 4 involves functional inequalities which contain post-compositions with logarithms and other slowly growing functions. In section 5, we present a few systematic tools for analysing these types of inequalities. For more information on this quite particular kind of asymptotic analysis, we refer the reader to [44, 16].
In section 6, we present an optimised version of the
algorithm from section 4, proving in particular the bound
(Theorem 1.1), which constitutes
the main result of this paper. In section 7, we outline a
similar complexity analysis for Fürer's algorithm. Even after
several optimisations of the original algorithm, we were unable to
attain a bound better than
.
This suggests that the new algorithm outperforms Fürer's algorithm
by a factor of
.
This speedup is surprising, given that the short transforms in Fürer's algorithm involve only shifts, additions and subtractions. The solution to the paradox is that Fürer has made the short transforms too fast. Indeed, they are so fast that they make a negligible contribution to the overall complexity, and his computation is dominated by the “slow” twiddle factor multiplications. In the new algorithm, we push more work into the short transforms, allowing them to get slightly slower; the quid pro quo is that we avoid the factor of two in zero-padding caused by Fürer's introduction of artificial “fast” roots of unity. The optimal strategy is actually to let the short transforms dominate the computation, by increasing the short transform length relative to the coefficient size. Fürer is unable to do this, because in his algorithm these two parameters are too closely linked. To underscore just how far the situation has been inverted relative to Fürer's algorithm, we point out that in our presentation we can get away with using Schönhage–Strassen for the twiddle factor multiplications, without any detrimental effect on the overall complexity.
We have chosen to base most of our algorithms on approximate complex arithmetic. Instead, following [39] and [15], we might have chosen to use modular arithmetic. In section 8, we will briefly indicate how our main algorithm can be adapted to this setting. This variant of our algorithm presents several analogies with its adaptation to polynomial multiplication over finite fields [24].
The question remains whether there exists an even faster algorithm than
the algorithm of section 6. In an earlier paper [17],
Fürer gave another algorithm of complexity
under the assumption that there exist sufficiently many Fermat primes,
i.e., primes of the form
. It
can be shown that a careful optimisation of this algorithm yields the
bound
. Unfortunately, odds
are high that
is the largest Fermat prime. In
section 9, we present an algorithm that achieves the bound
under the more plausible conjecture that there
exist sufficiently many Mersenne primes (Theorem 1.2). The
main technical ingredient is a variant of an algorithm of Crandall and
Fagin [12] that permits efficient multiplication modulo
, despite
not being divisible by a large power of two.
It would be interesting to know whether the new algorithms could be
useful in practice. We have implemented an unoptimised version of the
algorithm from section 8 in the
Notations. We use Hardy's notations for
, and
for
and
. The symbol
denotes
the set of non-negative real numbers, and
denotes
. We will write
.
This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [2, 8, 21, 42].
In the Turing model, we have available a fixed number of linear tapes.
An array
of
-bit elements is stored as a linear array of
bits. We generally assume that the elements are
ordered lexicographically by
,
though this is just an implementation detail.
What is significant from a complexity point of view is that occasionally
we must switch representations, to access an array (say 2-dimensional)
by “rows” or by “columns”. In the Turing model,
we may transpose an matrix of
-bit elements in time
, using the algorithm of [4, Appendix].
Briefly, the idea is to split the matrix into two halves along the
“short” dimension, and transpose each half recursively.
We will also require more complex rearrangements of data, for which we
resort to sorting. Suppose that is a totally
ordered set, whose elements are represented by bit strings of length
, and suppose that we can
compare elements of
in time
. Then an array of
elements of
may be sorted in time
using merge sort [33], which can be
implemented efficiently on a Turing machine.
Let be a commutative ring with identity and let
. An element
is said to be a principal
-th
root of unity if
and
![]() |
(2.1) |
for all . In this case, we
define the discrete Fourier transform (or DFT) of an
-tuple
with
respect to
to be
where
That is, is the evaluation of the polynomial
at
.
If is a principal
-th
root of unity, then so is its inverse
,
and we have
Indeed, writing , the
relation (2.1) implies that
where if
and
otherwise.
Remark .
In this setting, the concept of principal root of unity coincides with
the more familiar primitive root of unity. The more general
“principal root” concept is only needed for discussions of
other algorithms, such as the Schönhage–Strassen algorithm or
Fürer's algorithm.
Let be a principal
-th
root of unity and let
where
. Then
is a principal
-th root of unity and
is a principal
-th
root of unity. Moreover, for any
and
, we have
If and
are algorithms
for computing DFTs of length
and
, we may use (2.2) to construct an
algorithm
for computing DFTs of length
as follows.
For each , the sum inside the
brackets corresponds to the
-th
coefficient of a DFT of the
-tuple
with respect to
.
Evaluating these inner DFTs requires
calls to
. Next, we multiply
by the twiddle factors
,
at a cost of
operations in
. (Actually, fewer than
multiplications are required, as some of the twiddle factors are equal
to
. This optimisation, while
important in practice, has no asymptotic effect on the algorithms
discussed in this paper.) Finally, for each
, the outer sum corresponds to the
-th coefficient of a DFT of an
-tuple in
with respect
to
. These outer DFTs
require
calls to
.
Denoting by the number of ring operations needed
to compute a DFT of length
,
and assuming that we have available a precomputed table of twiddle
factors, we obtain
For a factorisation , this
yields recursively
The corresponding algorithm is denoted .
The
operation is neither commutative nor
associative; the above expression will always be taken to mean
.
Let be the butterfly algorithm that computes a
DFT of length 2 by the formula
.
Then
computes a DFT of length
in time
. Algorithms of this
type are called fast Fourier transforms (or FFTs).
The above discussion requires several modifications in the Turing model.
Assume that elements of are represented by
bits.
First, for , we must add a
rearrangement cost of
to efficiently access the
rows and columns for the recursive subtransforms (see section 2.1).
For the general case
, the
total rearrangement cost is bounded by
.
Second, we will sometimes use non-algebraic algorithms to compute
the subtransforms, so it may not make sense to express their cost in
terms of . The relation (2.3) therefore becomes
where is the (Turing) cost of a transform of
length
over
,
and where
is the cost of a single multiplication
in
.
Finally, we point out that requires access to a
table of twiddle factors
,
ordered lexicographically by
,
for
,
. Assuming that we are given as input a precomputed
table of the form
, we must
show how to extract the required twiddle factor table in the correct
order. We first construct a list of triples
, ordered by
,
in time
; then sort by
in time
(see section 2.1);
then merge with the given root table to obtain a table
, ordered by
,
in time
; and finally sort by
in time
.
The total cost of the extraction is thus
.
The corresponding cost for is determined as
follows. Assuming that the table
is given as
input, we first extract the subtables of (
)-th
roots of unity for
in time
. Extracting the twiddle factor table for the
decomposition
then costs
; the total over all
is
again
.
Remark , as
in section 3, this requires a slight increase in the
working precision. Similar comments apply to the root tables used in
Bluestein's algorithm in section 2.5.
Let be a principal
-th
root of unity in
and assume that
is invertible in
.
Consider two polynomials
and
in
. Let
be the polynomial defined by
where the product of the DFTs is taken pointwise. By construction, we
have , which means that
for all
.
The product
of
and
modulo
also satisfies
for all
.
Consequently,
,
, whence
.
For polynomials with
and
, we thus obtain an algorithm
for the computation of
modulo
using at most
operations in
. Modular products of this type are also called
cyclic convolutions. If
,
then we may recover the product
from its
reduction modulo
. This
multiplication method is called FFT multiplication.
If one of the arguments (say )
is fixed and we want to compute many products
(or cyclic convolutions) for different
,
then we may precompute
,
after which each new product
can be computed
using only
operations in
.
We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [3], as follows.
Let be a principal
-th
root of unity. For simplicity we assume that
is
even, and that there exists some
with
. Consider the sequences
Then , so for any
we have
![]() |
(2.5) |
Also, since is even,
Now let ,
and
modulo
.
Then (2.5) implies that
for all
. In other words, the
computation of a DFT of even length
reduces to a
cyclic convolution product of the same length, together with
additional operations in
.
Notice that the polynomial
is fixed and
independent of
in this product.
The only complication in the Turing model is the cost of extracting the
in the correct order, i.e., in the order
, given as input a precomputed
table
. We may do this in
time
by applying the strategy from section 2.3 to the pairs
for
. Similar remarks apply to the
.
Remark
Multiplication in may be reduced to
multiplication in
using the classical technique
of Kronecker substitution [21, Corollary 8.27].
More precisely, let
and
, and suppose that we are given two polynomials
of degree less than
,
with coefficients
and
satisfying
and
.
Then for the product
we have
. Consequently, the coefficients of
may be read off the integer product
where
. Notice that the
integers
and
have bit
length at most
, and the
encoding and decoding processes have complexity
.
The inverse procedure is Kronecker segmentation. Given and
, and
non-negative integers
and
, we may reduce the computation of
to the computation of a product
of two
polynomials
of degree less than
, and with
and
where
.
Indeed, we may cut the integers into chunks of
bits each, so that
,
and
.
Notice that we may recover
from
using an overlap-add procedure in time
.
In our applications, we will always have
,
so that
.
Kronecker substitution and segmentation can also be used to handle
Gaussian integers (and Gaussian integer polynomials), and to compute
cyclic convolutions. For example, given polynomials
with
, then for
we have
, so we
may recover
from the cyclic Gaussian integer
product
, where
. In the other direction, suppose that we wish
to compute
for some
. We may assume that the “real” and
“imaginary” parts of
and
are non-negative, and so reduce to the problem of
multiplying
, where
and
, and where
the real and imaginary parts of
are non-negative
and have at most
bits.
In this section, we consider the computation of DFTs over in the Turing model. Elements of
can only be represented approximately on a Turing machine. We describe
algorithms that compute DFTs approximately, using a fixed-point
representation for
, and we
give complexity bounds and a detailed error analysis for these
algorithms. We refer the reader to [7] for more details
about multiple precision arithmetic.
For our complexity estimates we will freely use the standard observation
that , since the
multiplication of two integers of bit length
reduces to
multiplications of integers of bit
length
, for any fixed
.
We will represent fixed point numbers by a signed mantissa and a fixed
exponent. More precisely, given a precision parameter , we denote by
the set
of complex numbers of the form
,
where
for integers
and
satisfying
,
i.e.,
. We write
for the set of complex numbers of the form
, where
and
; in particular, for
we always have
.
At every stage of our algorithms, the exponent
will be determined implicitly by context, and in particular, the
exponents do not have to be explicitly stored or manipulated.
In our error analysis of numerical algorithms, each
is really the approximation of some genuine complex number
. Each such
comes with
an implicit error bound
;
this is a real number for which we can guarantee that
. We also define the relative error bound for
by
.
We finally denote by
the “machine
accuracy”.
Remark .
We will use similar formulas for the computation of
and
, but we will not
actually store the bounds during computations.
In this section we give error bounds and complexity estimates for fixed point addition, subtraction and multiplication, under certain simplifying assumptions. In particular, in our DFTs, we only ever need to add and subtract numbers with the same exponent. We also give error bounds for fixed point convolution of vectors; the complexity of this important operation is considered later.
For , we define the
“round towards zero” function
by
if
and
if
. For
, we define
.
Notice that
and
for any
.
.
Define the fixed point sum and difference
by
. Then
and
can be computed in time
, and
Proof. We have
and
whence .
Proof. We have
and
Consequently, .
Proposition 3.3 may be generalised to numerical cyclic convolution of vectors as follows.
Proof. Let denote the
exact convolution, and write
and
. As in the proof of Proposition 3.3,
we obtain
and
The proof is concluded in the same way as Proposition 3.3.
Let and
.
Let
be the branch of the square root function
such that
for
.
Using Newton's method [7, Section 3.5] and
Schönhage–Strassen multiplication [47], we may
construct a fixed point square root function
, which may be evaluated in time
, such that
for all
. For example, we may first
compute some
such that
and
, and then take
; the desired bound follows since
.
Proof. The mean value theorem implies that where
.
For
we have
;
hence
. By construction
. We conclude that
.
Proof. It suffices to compute . Starting from
and
, for each
, we compute
for
using
if
and
otherwise. Performing all computations with
temporarily increased precision
and
corresponding
, Lemma 3.5 yields
. This
also shows that the hypothesis
is always
satisfied, since
. After
rounding to
bits, the relative error is at most
.
A tight algorithm for computing DFTs of length
is a numerical algorithm that takes as input an
-tuple
and computes an
approximation
to the DFT of
with respect to
(or
in
the case of an inverse transform), such that
We assume for the moment that any such algorithm has at its disposal all
necessary root tables with relative error not exceeding . Propositions 3.2 and 3.3
directly imply the following:
that
computes a DFT of length
using the formula
is tight.
Proof. We have .
,
and let
and
be tight
algorithms for computing DFTs of lengths
and
. Then
is a tight algorithm for computing DFTs of length
.
Proof. The inner and outer DFTs contribute
factors of and
,
and by Proposition 3.3 the twiddle factor multiplications
contribute a factor of
. Thus
.
Then
is a tight algorithm for computing DFTs
of length
over
,
whose complexity is bounded by
.
In this section we give the simplest version of the new integer
multiplication algorithm. The key innovation is an alternative method
for computing DFTs of small length. This new method uses a combination
of Bluestein's chirp transform and Kronecker substitution (see sections
2.5 and 2.6) to convert the DFT to a cyclic
integer product in for suitable
.
.
There exists a tight algorithm
for computing
DFTs of length
over
, whose complexity is bounded by
.
Proof. Let ,
and suppose that we wish to compute the DFT of
. Using Bluestein's chirp transform (notation as in
section 2.5), this reduces to computing a cyclic
convolution of suitable
and
. We assume that the
and
have been precomputed with
.
We may regard and
as
cyclic polynomials with complex integer coefficients, i.e., as
elements of
. Write
and
, where
with
and
. Now we compute the exact product
using Kronecker substitution. More precisely, we have
, so it suffices to compute
the cyclic integer product
,
where
. Then
is the exact convolution of
and
, and rounding
to
precision
yields
in the
sense of Proposition 3.4. A final multiplication by
yields the Fourier coefficients
.
To establish tightness, observe that and
, so Proposition 3.4
yields
where
;
we conclude that
. For
, this means that the algorithm is
tight; for
, we may take
.
For the complexity, observe that the product in
reduces to three integer products of size
.
These have cost
, and the
algorithm also performs
multiplications in
, contributing the
term.
Remark , then to
compute a transform of length
over
with
, the
Cooley–Tukey approach has complexity
,
whereas Proposition 4.1 yields
, an improvement by a factor of roughly
.
Proof. We first reduce our integer product to a
polynomial product using Kronecker segmentation (section 2.6).
Splitting the two -bit inputs
into chunks of
bits, we need to compute a
product of polynomials
with non-negative
-bit coefficients and degrees less
than
. The coefficients of
have
bits, and we may
deduce the desired integer product
in time
.
Let . To compute
, we will use DFTs of length
over
, where
. Zero-padding
to
obtain a sequence
, and
similarly for
, we compute
the transforms
with respect to
as follows.
Let and
.
Write
with
for
and
. We use
the algorithm
(see section 2.3),
where for
we take
to be
the tight algorithm
for DFTs of length
given by Proposition 4.1, and where
is
as in Corollary 3.9.
In other words, we split the
usual radix-2
layers of the FFT into groups of
layers,
handling the transforms in each group with the Bluestein–Kronecker
reduction, and then using ordinary Cooley–Tukey for the remaining
layers.
We next compute the pointwise products ,
and then apply an inverse transform
defined
analogously to
. A final
division by
(which is really just an implicit
adjustment of exponents) yields approximations
.
Since and
are tight by
Propositions 3.8, 4.1 and Corollary 3.9,
we have
, and similarly for
. Thus
, so
after the inverse
transform (since
for
). In particular,
,
so we obtain the exact value of
by rounding to
the nearest integer.
Now we analyse the complexity. Using Proposition 3.6, we
first compute a table of roots in time
, and then extract the required
twiddle factor tables in time
(see section 2.3). For the Bluestein reductions, we may extract a table of
-th roots in time
, and then rearrange them as
required in time
(see section 2.5).
These precomputations are then all repeated for the inverse transforms.
By Corollary 3.9, Proposition 4.1 and (2.4), each invocation of (or
) has cost
The cost of the pointwise multiplications is
subsumed within this bound.
It is now a straightforward matter to recover Fürer's bound.
Proof. Let for
. By Theorem 4.3,
there exists
and
such
that
for all . Let
for
,
. Increasing
if necessary, we may assume that
for
, so that the function
is well-defined. Increasing
if
necessary, we may also assume that
for all
.
We prove by induction on that
for all
. If
, then
,
so the bound holds. Now suppose that
.
Since
, we have
, so by induction
.
Finally, since , we have
, so
for
.
This section is devoted to developing a framework for handling recurrence inequalities, similar to (4.1), that appear in subsequent sections.
Let be a smooth increasing function, for some
. We say that
is an iterator of
if
is increasing and if
for all sufficiently large .
For instance, the standard iterated logarithm
defined in (1.2) is an iterator of
. An analogous iterator may be defined for any
smooth increasing function
for which there
exists some
such that
for all
. Indeed, in that
case,
is well-defined and satisfies (5.1) for all . It will sometimes be convenient to increase
so that
is satisfied on
the whole domain of
.
We say that is logarithmically slow if
there exists an
such that
for . For example, the
functions
,
,
and
are logarithmically slow, with
respectively.
be a logarithmically slow
function. Then there exists
such that
for all
.
Consequently all logarithmically slow functions admit iterators.
Proof. The case is
clear. For
, let
. By induction
for
large
, so
for large
.
In this paper, the main role played by logarithmically slow functions is
to measure size reduction in multiplication algorithms. In other
words, multiplication of objects of size will be
reduced to multiplication of objects of size
, where
for some
logarithmically slow function
.
The following result asserts that, from the point of view of iterators,
such functions are more or less interchangeable with
.
Proof. First consider the case where in (5.2), i.e., assume that
for some constant
and all
. Increasing
and
if necessary, we may assume that
for all
, and that
.
We claim that
for all . Indeed, if
, then
Now, given any , let
, so
.
For any
we have
,
so
-fold iteration of (5.3), starting with
,
yields
Moreover this shows that for
, so
.
Since
and
,
we obtain
.
Now consider the general case .
Let
, so that
is an iterator of
.
By the above argument
, and
so
.
The next result, which generalises and refines the argument of Theorem 4.4, is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions. We state it in a slightly more general form than is necessary for the present paper, anticipating the more complicated situation that arises in [24].
,
and
.
Let
, and let
be a logarithmically slow function such that
for all
.
Then there exists a positive constant
(depending on
,
,
,
and
)
with the following property.
Let and
.
Let
, and let
be any function satisfying the following recurrence.
First,
for all
,
. Second, for all
,
,
there exist
with
,
and weights
with
,
such that
Then we have for all
,
.
Proof. Let ,
,
and
be as above. Define
for
. We claim that there exists
, depending only on
and
, such that
for all . Indeed, let
. First suppose
, so that
.
For any
, we have
, so
and hence . This last
inequality also clearly holds if
(since
). By Lemma 5.2 we
obtain
.
Define a sequence of real numbers by the formula
We claim that
for all . Indeed, let
. If
then
(5.5) holds as
.
If
then
by (5.4),
so
and hence
.
Now let . We will prove by
induction on
that
for all . The base case
, i.e.,
, holds by assumption. Now assume that
, so
.
By hypothesis there exist
,
, and
with
, such that
Since , we obtain
Finally, the infinite product
certainly converges, so we have for
. Setting
,
by (5.4) we obtain
for all
.
In this section, we present an optimised version of the new integer multiplication algorithm. The basic outline is the same as in section 4, but our goal is now to minimise the “expansion factor” at each recursion level. The necessary modifications may be summarised as follows.
Since Bluestein's chirp transform reduces a DFT to a complex cyclic
convolution, we take the basic recursive problem to be complex
cyclic integer convolution, i.e., multiplication in , rather than ordinary integer
multiplication.
In multiplications involving one fixed operand, we reuse the transform of the fixed operand.
In a convolution of length with input
coefficients of bit size
,
the size of the output coefficients is
,
so the ratio of output to input size is
. We increase
from
to
,
so as to reduce the inflation ratio from
to
.
We increase the “short transform length” from to
. The
complexity then becomes dominated by the Bluestein–Kronecker
multiplications, while the contribution from ordinary arithmetic in
becomes asymptotically negligible. (As noted
in section 1, this is precisely the opposite of what
occurs in Fürer's algorithm.)
We begin with a technical preliminary. To perform multiplication in efficiently using FFT multiplication, we need
to be divisible by a high power of two. We say that
an integer
is admissible if
, where
(note that
for all
).
We will need a function that rounds a given
up
to an admissible integer. For this purpose we define
for
. Note that
may be computed in time
.
Proof. We have ,
which implies (6.1). Since
and
, we have
and thus
, i.e.,
. In particular
,
so
is admissible. (In fact, one easily checks
that
is the smallest admissible integer
).
Remark be
divisible by a high power of two, by using the Crandall–Fagin
method (see section 9). We prefer to avoid this approach in
this section, as it adds an unnecessary layer of complexity to the
presentation.
Now let be admissible, and consider the problem
of computing
products
with
, i.e.,
products with one fixed operand. Denote the cost of this operation by
. Our algorithm for this
problem will perform
forward DFTs and
inverse DFTs, so it is convenient to introduce the
normalisation
This is well-defined since clearly .
Roughly speaking,
may be thought of as the
notional cost of a single DFT.
The problem of multiplying -bit
integers may be reduced to the above problem by using zero-padding,
i.e., by taking
and
. Since
and
, we obtain
.
Thus it suffices to obtain a good bound for
.
The recursive step in the main multiplication algorithm involves
computing “short” DFTs via the Bluestein–Kronecker
device. As pointed out in section 2.5, this leads to a
cyclic convolution with one fixed operand. To take advantage of the
fixed operand, let denote the cost of computing
independent DFTs of length
over
, and let
. Then we have the following refinement of
Proposition 4.1. As usual we assume that the necessary
Bluestein root table has been precomputed.
,
and assume that
divides
. Then there exists a tight algorithm
for computing DFTs of length
over
, with
Proof. We use the same notation and algorithm as
in the proof of Proposition 4.1, except that in the
Kronecker substitution we take ,
so that the resulting integer multiplication takes place in
. The proof of tightness is identical to that
of Proposition 4.1 (this is where we use the assumption
). For the complexity bound,
note that
is admissible by construction, so for
any
we have
Here we have
used the fact that
is fixed over all these
multiplications. Dividing by
and taking suprema
over
yields the result.
The next result gives the main recurrence satisfied by
(compare with Theorem 4.3).
and a logarithmically
slow function
with the following property. For
all admissible
, there
exists an admissible
such that
Proof. Let be admissible
and sufficiently large, and consider the problem of computing
products
, for
. Let
, so that
,
and let
.
We cut the inputs into chunks of size
, i.e., if
is one of the
inputs, we write
, where
,
and where the real and imaginary parts of
have
absolute value at most
. Thus
, and for any
we may encode
as a polynomial
.
We will multiply the desired (cyclic) polynomials by using DFTs of
length over
where
. We construct the DFTs in a
similar way to section 4. Let
and
. Write
with
for
and
. We use the tight algorithm
, where for
we take
to be the tight algorithm
for DFTs of length
given by
Proposition 6.3, and where
is
as in Corollary 3.9. Thus, for the first
groups of
layers, we use
Bluestein–Kronecker to reduce to complex integer convolution of
size
, and the remaining
layers are handled using ordinary Cooley–Tukey. We write
for the analogous inverse transform.
To check the hypothesis of Proposition 6.3, we observe that
for sufficiently large
, as
is divisible by
where
,
and
Denote by the cost of a single invocation of
(or
).
By Corollary 3.9 and (2.4), we have
The last term is the rearrangement cost, and simplifies to . The second term covers the invocations of
, and simplifies to
, so is absorbed by the
term. The first term covers the invocations of
. By definition
, and since
,
Proposition 6.3 yields
Thus
We will use Schönhage–Strassen's algorithm for fixed point
multiplications in . Since
, we may take
. Thus the
term becomes
(We could of course use our algorithm recursively for these
multiplications; however, it turns out that
Schönhage–Strassen is fast enough, and leads to simpler
recurrences. In fact, the algorithm asymptotically spends more time
rearranging data than multiplying in !)
Since , and since
, by Lemma 6.1 we have
We also have and
,
so
Thus
and consequently
To compute the desired products, we must execute
forward transforms and
inverse transforms. For each product, we must also perform
pointwise multiplications in
,
at cost
. As in the proof of
Theorem 4.3, the cost of all necessary root table
precomputations is also bounded by
.
Thus we obtain
Dividing by and taking suprema yields the bound
(6.2).
The error analysis is almost identical to the proof of Theorem 4.3,
the only difference being that is replaced by
. Denoting one of the
products by
,
we have
exactly as in Theorem 4.3.
Thus
, and again we obtain
by rounding to the nearest integer.
Finally we show how to define .
We already observed that
.
Thus there exists a constant
such that
for large
, so
we may take
.
Now we may prove the main theorem announced in the introduction.
Proof of Theorem 1.1. Let and
be as in Theorem 6.4.
Increasing
if necessary, by Lemma 5.1
we may assume that
for
, and that
.
Let for admissible
.
By the theorem, there exist constants
such that
for all admissible
, there
exists an admissible
with
Increasing if necessary, we may also assume that
for all admissible
.
Taking
to be the set of admissible integers, we
apply Proposition 5.3 with
,
,
, and for each admissible
setting
,
,
and
as above. We conclude that
,
and hence
as
runs over
admissible integers. We already pointed out that
.
As pointed out in the introduction, Fürer proved that for some
, but
did not give an explicit bound for
.
In this section we sketch an argument showing that one may achieve
in Fürer's algorithm, by reusing tools from
previous sections, especially section 6.
At the core of Fürer's algorithm is the ring , which contains the principal
-th root of unity
.
Note that
is a direct sum of
copies of
, and hence not a
field (for
). A crucial
observation is that
is a “fast” root
of unity, in the sense that multiplication by
and its powers can be achieved in linear time, as in
Schönhage–Strassen's algorithm. For any
, we need to construct a
-th root
of
, which is itself a
-th principal root of unity. We recall Fürer's
construction of
as follows.
as above, let
and
. Then
is a principal -th root
of unity with
. The
coefficients of
have absolute value
.
Proof. See [19, Section 4].
As our basic recursive problem, we will consider multiplication in , where
is
divisible by a high power of two. We will refer to the last property as
“admissibility”, but we will not define it precisely. We
write
for the cost of
such products with one fixed argument, and
for
the normalised cost, exactly as in section 6.
Fürer worked with rather than
, but, since we are interested in constant
factors, and since the recursive multiplication step involves
multiplication of complex quantities, it simplifies the exposition to
work systematically with complexified objects everywhere.
For suitable parameters and
, we will encode elements of
as (nega)cyclic polynomials in
,
where
as above. We choose the parameters later;
for now we require only that
divides
and that
(so that the coefficients
are not too small).
The encoding proceeds as follows. Given ,
we split
into
parts
of
bits. Each
is cut into
even smaller pieces
of
bits. Then
is encoded as
and an element is encoded as
. (Notice that the coefficients of
are zero for
;
this zero-padding is the price Fürer pays for introducing
artificial roots of unity.)
We represent complex coefficients by elements of
for a suitable precision parameter
.
The exponent
varies during the algorithm, as
explained in [19]; nevertheless, additions and subtractions
only occur for numbers with the same exponent, as in the algorithms from
sections 4 and 6.
Given , to successfully
recover the product
from the polynomial product
, we must choose
, where
is an allowance
for numerical error. Certainly
,
and, as shown by Fürer, we may also take
(an analogous conclusion is reached in sections 4 and 6). Thus we may assume that
.
We must now show how to compute a product ,
for
. Fürer handles
these types of multiplications using “half-DFTs”, i.e., DFTs
that evaluate at odd powers of
,
where
is a principal
-th root of unity such that
(Lemma 7.1). To keep terminology and notation consistent
with previous sections, we prefer to make the substitution
, i.e., writing
,
we put
, and similarly for
and
.
This reduces the problem to computing the product
in
. The change of variable
imposes a cost of
, where
is the cost of a multiplication in
.
So now consider a product ,
where
. Let
, so that
.
Let
, and write
with
for
and
. For each
, let
be the algorithm
for DFTs of length
that applies the usual
Cooley–Tukey method, taking advantage of the fast
-th root of unity
The
complexity of
is
,
since it performs
linear-time operations on
objects of bit size
. Let
be the complexity of the algorithm
for DFTs of length
over
. Then (2.4) yields
The first term is bounded by ,
since
.
Let us now consider the second term ,
which describes the cost of the twiddle factor multiplications. This
term turns out to be the dominant one. Both Kronecker substitution and
FFT multiplication may be considered for multiplication in
, but it turns out that Kronecker substitution
is faster (a similar phenomenon was noted in Remark 4.2).
So we reduce multiplication in
to multiplication
in
where
is admissible
and divisible by
. For any
reasonable definition of admissibility we then have
, provided that
is
somewhat smaller than
. (In
the interests of brevity, we will not specify the
terms for the remainder of the argument. They can all be controlled
along the lines of section 6.) Most of the twiddle factors
are reused many times, so we will assume that
, where the factor
counts
the two (rather than three) DFTs needed for each multiplication of size
. The term of interest then
becomes
Since and
,
this yields
To minimise the leading constant, we must choose
to grow faster than
, and
to grow faster than
. For example, taking
and
leads to
and
. The function mapping
to
is then bounded by a
logarithmically slow function, and a similar argument to section 6 shows that
.
Shortly after Fürer's algorithm appeared, De et al [15]
presented a variant based on modular arithmetic that also achieves the
complexity bound for some
. Roughly speaking, they replace the coefficient
ring
with the field
of
-adic numbers, for a suitable
prime
. In this context,
working to “finite precision” means performing computations
in
, where
is a precision parameter.
The main advantage of this approach is that the error analysis becomes
trivial; indeed is a ring (unlike our
), and arithmetic operations never
lead to precision loss (unless one divides by
, which never happens in these algorithms). The main
disadvantage is that there are certain technical difficulties associated
with finding an appropriate
;
this is discussed in section 8.2 below.
The aim of this section is to sketch an analogue of the algorithm of
section 6 that achieves using
modular arithmetic instead of
.
We assume familiarity with
-adic
numbers, referring the reader to [22] for an elementary
introduction.
For the basic problem, we take multiplication in , where
is admissible (in
the sense of section 6) and where one of the arguments is
fixed over
multiplications. As before, we take
, and cut the inputs into
chunks of
bits. Thus we reduce to multiplying
polynomials in
with coefficients of at most
bits. The coefficients of the product have at most
bits.
Let be a prime such that
, so that
contains a
primitive
-th root of unity
. The problem of finding such
and
is discussed in the
next section; for now we assume only that
.
We may then embed the multiplication problem into
, and use DFTs with respect to
to compute the product. On a Turing machine, we cannot represent
elements of
exactly, so we perform all
computations in
where
This choice ensures that , so
knowledge of the product in
determines it
unambiguously in
.
To compute each DFT, we first use the Cooley–Tukey algorithm to
decompose it into “short transforms” of length , where
.
(As in section 6, there are also residual transforms of
length
for some
,
whose contribution to the complexity is negligible.) Multiplications in
, such as the multiplications
by twiddle factors, are handled using Schönhage–Strassen's
algorithm, with the divisions by
being reduced
to multiplication via Newton's method. We then use Bluestein's algorithm
to convert each short transform to a cyclic convolution of length
over
, and
apply Kronecker substitution to convert this to multiplication in
, where
is
the smallest admissible integer exceeding
.
This multiplication is then handled recursively.
Now, since ,
,
and
, we have
,
and hence
, just as in
section 6. The rest of the complexity analysis follows
exactly as in the proof of Theorem 6.4, except for the
computation of
and
,
which is considered below.
Remark is to give some extra
flexibility regarding the choice of
.
If there was an efficient way to find a prime
larger than
(but not too much larger), and an
efficient way to find a suitable
-th
root of unity modulo
, then
we could always take
and obtain an algorithm
working directly over the finite field
.
Given a transform length for
, our aim is to find a prime
such that
, i.e., such that
divides
.
Denote by
the smallest such prime.
Heath-Brown has conjectured that [25],
but given the current state of knowledge in number theory, we are only
able to prove a result of the following type.
Proof. This is a special case of Linnik's
theorem [34, 35], which states that there
exist constants and
such
that for any
with
,
there exists a prime number
with
. The best currently known estimate
for
is due to Xylouris [53].
Applying this result for
and
, we get the bound
for
large enough
. The complexity
bound follows by testing
for primality until we
find
, using a polynomial
time primality test [1].
The difficulty with this result — already noted in [15]
— is that the time required to find
greatly exceeds the time bound we are trying to prove for
!
To avoid this problem, De et al suggested using a multivariate
splitting, i.e., by encoding each integer as a polynomial in for suitable
,
say
. One then uses
-dimensional DFTs to multiply the
polynomials. Since the transform length is shorter, one can get away
with a smaller
.
Unfortunately, this introduces further zero-padding and leads to a
larger value of
, ruining our
attempt to achieve the bound
.
On the other hand, we note that the problem only really occurs at the
top recursion level. Indeed, at deeper recursion levels, there is
exponentially more time available at the previous level to
compute . So one possible
workaround is to use a different, sufficiently fast algorithm at the top
level, such as Fürer's algorithm, and then switch to the algorithm
sketched in section 8.1 for the remaining levels. In this
way one still obtains the bound
,
and asymptotically almost all of the computation is done using the
algorithm of section 8.1.
If one insists on avoiding entirely, there are
still many choices: one could use the algorithm of De et al at the top
level, or use a multivariate version of the algorithm of section 8.1. One could even use the Schönhage–Strassen
algorithm, whose main recursive step yields the bound
; applying this three times gives
, and then to multiply integers with
bits, one can find a suitable prime using Lemma 8.2
in time
.
Another way to work around the problem is to assume the generalised Riemann hypothesis (GRH). De et al pointed out that under GRH, it is possible to find a suitable prime efficiently using a randomised algorithm. Here we show that, under GRH, we can even use deterministic algorithms.
Proof. The first bound is given in [26],
and the complexity bound follows similarly to the proof of Lemma 8.2.
To use this result, we must modify the algorithm of section 8.1
slightly. Choose a constant so that we can
compute
in time
,
as in Lemma 8.3. Increase the coefficient size from
to
, and
change the definition of admissibility accordingly. The transform length
then decreases to
, and the
cost of computing
decreases to only
. The rest of the complexity analysis is
essentially unchanged; the result is an algorithm with complexity
, working entirely with modular
arithmetic, in which the top recursion level does not need any special
treatment.
Finally, we consider the computation of a suitable approximation to a
-th root of unity in
.
and a prime
,
we may find
such that
for some primitive
-th root
of unity
, in time
for any
.
Proof. We may find a generator
of
deterministically in time
[48]. Then
is a primitive
-th root of unity in
, and there is a unique primitive
-th root of unity
congruent to
modulo
. Given
,
we may compute
using fast Newton lifting in time
[9, Section 12.3].
In the context of section 8.1, we may assume that and
, so the
cost of finding
is
.
This is certainly less than the cost of finding
itself, using either Lemma 8.2 or Lemma 8.3.
It is natural to ask whether the approaches from sections 6,
7 or 8 can be further optimised, to obtain a
complexity bound with
.
In Fürer's algorithm, the complexity is dominated by the cost of
multiplications in . If we
could use a similar algorithm for a much simpler
, then we might achieve a better bound. Such an
algorithm was actually given by Fürer [17], under the
assumption that there exist sufficiently many Fermat primes, i.e.,
primes of the form
. More
precisely, his algorithm requires that there exists a positive integer
such that for every
, the sequence
contains a
prime number. The DFTs are then computed directly over
for suitable
, taking
advantage of the fact that
contains a fast
-th primitive root of unity (namely
the element 2) as well as a
-th
primitive root of unity. It can be shown that a suitably optimised
version of this hypothetical algorithm achieves
: we still pay a factor of two due to the fact that
we compute both forward and inverse transforms, and we pay another
factor of two for the zero-padding in the recursive reduction.
Unfortunately, it is likely that
is the last
Fermat prime [13].
In the algorithm of section 6, a
potential bottleneck arises during the short transforms, when we use
Kronecker substitution to multiply polynomials in
. We really only need the high
bits of each coefficient of the product (i.e., of the real and imaginary
parts), but we are forced to allocate roughly
bits per coefficient in the Kronecker substitution, and then we discard
roughly half of the output. This problem is similar to the well-known
obstruction that prevents us from using FFT methods to compute a
“short product”, i.e., the high
bits
or low
bits of the product of two
-bit integers, any faster than computing the
full
bits.
In this section, we present a variant of the algorithm of section 6, in which the coefficient ring is
replaced by a finite field
,
where
is a Mersenne prime. Thus “short
products” are replaced by “cyclic products”, namely by
multiplications modulo
. This
saves a factor of two at each recursion level, and consequently reduces
from
to
.
This change of coefficient ring introduces several technical complications. First, it is of course unknown if there are infinitely many Mersenne primes. Thus we are forced to rely on unproved conjectures about the distribution of Mersenne primes.
Second, is always prime (except possibly at the
top recursion level). Thus we cannot cut up an element of
into equal-sized chunks with an integral number of bits,
and still expect to take advantage of cyclic products. In other words,
is very far from being admissible in the sense
of section 6. To work around this, we deploy a variant of an algorithm
of Crandall and Fagin [12], which allows us to work with
chunks of varying size. The Crandall–Fagin algorithm was
originally presented over
,
and depended crucially on the fact that
contains
suitable roots of
. In our
setting, we work over
, where
is a Mersenne prime exponentially smaller than
. Happily,
contains suitable roots of
,
and this enables us to adapt their algorithm to our setting. Moreover,
since
, the field
contains roots of unity of high power-of-two order, namely
of order
, so we can perform
FFTs over
very efficiently.
Finally, we can no longer use Kronecker substitution, as this would
reintroduce the very zero-padding we are trying to avoid. Instead, we
take our basic problem to be polynomial multiplication over (where
is not necessarily
prime). After the Crandall–Fagin splitting step, we have a
bivariate multiplication problem over
, which is solved using 2-dimensional FFTs over
. These FFTs are in turn
reduced to 1-dimensional FFTs using standard methods; this dimension
reduction is, roughly speaking, the analogue of Kronecker substitution
in this algorithm. (Indeed, it is also possible to give an algorithm
along these lines that works over
but avoids
Kronecker substitution entirely; this still yields
because of the “short product” problem mentioned above.) For
the 1-dimensional transforms, we use the same technique as in previous
sections: we use Cooley–Tukey's algorithm to decompose them into
“short transforms” of exponentially shorter length, then use
Bluestein's method to convert them to (univariate) polynomial products,
and finally evaluate these products recursively.
Let denote the number of Mersenne primes less
than
. Based on probabilistic
arguments and numerical evidence, Lenstra, Pomerance and Wagstaff have
conjectured that
as , where
is the Euler constant [52, 40]. Our fast
multiplication algorithm relies on the following slightly weaker
conjecture.
. For any integer
, there exists a Mersenne prime
in the interval
.
Given
, we may compute the
smallest such
, and find a
primitive
-th root of unity
in
, in time
.
Proof. The required prime exists since for we have
An integer of the form may be tested for
primality in time
using the Lucas–Lehmer
primality test [13]. A simple way to compute
is to apply this test successively for all
; this takes time
.
A primitive
-th root of unity
may be computed by the formula
in time
; see [43]
or [14, Corollary 5].
Let be a Mersenne number (not necessarily
prime). The main integer multiplication algorithm depends on a variant
of Crandall and Fagin's algorithm that reduces multiplication in
to multiplication in
,
where
is a suitably smaller Mersenne prime
(assuming that such a prime exists).
To explain the idea of this reduction, we first consider the simpler
univariate case, in which we reduce multiplication in
to multiplication in
. Here
we require that
, that
and that
.
For any
, we will write
and
.
Assume that we wish to compute the product of . Considering
and
as elements of
modulo
, we decompose them as
![]() |
(9.1) |
where
We regard and
as complex
“digits” of
and
, where the base
varies
with the position
. Notice
that
takes only two possible values:
or
.
For , let
so that . For any
, define
as
follows. Choose
so that
lies in the interval
, and
put
From (9.2), we have
Since the left hand side lies in the interval , this shows that
.
Now, since
and
,
we have
where
Since and similarly for
, we have
.
Note that we may recover
from
in time
, by a standard
overlap-add procedure (provided that
).
Let be the inverse of
modulo
; this inverse exists
since we assumed
. Let
, so that
since has order
in
. The quantity
plays the same role as the real
-th
root of
appearing in Crandall–Fagin's
algorithm.
Now define polynomials by
and
for
,
and let
be their (cyclic) product. Then
coincides with the reinterpretation of as an
element of
. Moreover, we may
recover
unambiguously from
, as
and
. Altogether, this shows how to reduce
multiplication in
to multiplication in
.
Remark can be computed from
in
bit operations, so we may compute the
sequences
and
in time
. Moreover, since
takes on only two possible values, we may compute the
sequence
using
multiplications in
.
Generalising the discussion of the previous section, we now show how to
reduce multiplication in ,
for a given
, to
multiplication in
. For this,
we require that
, that
and that
.
Indeed, consider two cyclic polynomials and
in
. We
cut each of the coefficients
into
chunks
and
of bit size at most
, using
the same varying base strategy as above. With
and
as before, we next form the bivariate cyclic
polynomials
in . Setting
the same arguments as in the previous section yield
Using the assumption that ,
we recover the coefficients
,
and hence the product
, from
the bivariate cyclic convolution product
.
Let and
(not necessarily
prime). We will take our basic recursive problem to be multiplication in
for suitable
.
We need
somewhat larger than
; this is analogous to the situation in section
6, where we chose a “short transform length”
somewhat larger than the coefficient size. Thus we set
where
is defined as follows.
such that
![]() |
(9.3) |
for all , and such that
we may compute
in time
.
Proof. Let .
Using [6], we may construct a function
such that
for all
,
and which may be computed in time
.
One checks that
for all
, so
for
. Thus
is increasing,
and
has the desired properties.
We say that an integer is admissible if
it is of the form
where
for some
. (This should not
be confused with the notion of admissibility of section 6.)
An element of
is then represented by
bits. Note that
is strictly
increasing, so there is a one-to-one correspondence between integers
and admissible
.
For
we define
to be the
smallest admissible integer
.
as
.
Given
, we may compute
, and the corresponding
, in time
.
Proof. From (9.3) we have ; this immediately implies that
.
Suppose that we wish to compute for some
. We assume that
is large enough that the definition
makes sense
and so that
. One checks that
, so
and hence
. To find the
smallest suitable
, we may
simply compute
for each
, and compare with
.
This takes time
.
Now let ,
and
. Consider the problem of
computing
products
with
. We denote by
the complexity of this problem, where
is the admissible integer corresponding to
.
As in section 6, we define
.
Notice that multiplication of two integers of bit size
reduces to the above problem, for
,
via a suitable Kronecker segmentation. Indeed, let
for some
, and encode the
integers as integer polynomials of degree less than
with coefficients of bit size
.
The desired product may be recovered from the product in
, as
for large . Thus, as in
section 6, we have
,
and it suffices to obtain a good bound for
.
Now suppose additionally that is prime.
In this case
is a field, and as noted above, it
contains
-th roots of unity,
so we may define DFTs of length
over
for any
. In
particular, for
we may use Bluestein's algorithm
to compute DFTs of length
.
Denote by
the cost of evaluating
independent DFTs of length
over
, and put
. Here we assume as usual that a
-th root of unity is known, and that the
corresponding Bluestein root table has been precomputed.
Let us apply these definitions in the case ;
this is permissible, as
for sufficiently large
. Since convolution of length
over
is exactly the
basic recursive problem, and since one of the operands is fixed, we have
, where
, and hence
and a logarithmically slow function
with the following property. For all admissible
, there exists an admissible
such that
Proof. Let with
. Assume that we wish to compute
products with one fixed operand. Our goal is to
reduce to a problem of the same form, but for exponentially smaller
.
Choose parameters. Let be the
smallest Mersenne prime larger than
.
By Proposition 9.2, we have
,
whence
, for some absolute
constant
. Moreover, we may
compute
, together with a
primitive
-th root of unity
in
,
in time
. We define
and
.
The algorithm must perform various multiplications in , at cost
.
For simplicity we will use Schönhage–Strassen's algorithm for
these multiplications, i.e., we will take
.
Since
, we have
Crandall–Fagin reduction. We use the framework of
section 9.3 to reduce the basic multiplication problem in
to multiplication in
for
suitable
. We take
where
We also write . The
definition of
makes sense for large
since
. Let us
check that the hypotheses of section 9.3 are satisfied for
large
. We have
and hence
; in
particular,
, so
, and also
.
Since
, we also have
, and thus
since
.
We also note for later use the estimate
Indeed, since we have
and we already noticed earlier that .
To assess the cost of the Crandall–Fagin reduction, we note that
computing the and
costs
(see Remark 9.3), the splitting
itself and final overlap-add phase require time
, and the various multiplications by
,
and
have cost
.
Reduction to power-of-two lengths. Next we reduce
multiplication in to multiplication in
, where
. In fact, since
,
these rings are isomorphic, via the map that sends
to
and
to
. Evaluating this isomorphism corresponds to
rearranging the coefficients according to the rule
, where
is the exponent
of
and where
and
are the exponents of
and
. This may be achieved in
time
using the same sorting strategy as in
section 2.3. The inverse rearrangement is handled
similarly.
Reduction to univariate transforms. For multiplication
in , we will use bivariate
DFTs over
. This is possible
because
contains both
-th and
-th
primitive roots of unity, namely
and
, since
and
. More precisely, we must
perform
forward bivariate DFTs and
inverse bivariate DFTs of length
over
, and
multiplications in
. Each
bivariate DFT reduces further to
univariate DFTs
of length
over
(with
respect to
) and
univariate DFTs of length
over
(with respect to
).
Interspersed between these steps are various matrix transpose operations
of total cost
, to enable
efficient access to the “rows” and “columns”
(see section 2.1).
Multiplications in are handled by zero-padding,
i.e., we first use Cooley–Tukey to multiply in
, and then reduce modulo
. The total cost of these multiplications is
.
Reduction to short transforms. Consider one of the
“long” univariate DFTs of length
over
. We decompose the DFT
into “short” DFTs of length
as
follows. Let
and
,
and write
where
for
and
. We
use the algorithm
, where for
we take
to be the
algorithm based on Bluestein's method (discussed immediately before (9.4)), and where
is the usual
Cooley–Tukey algorithm over
.
Let
be the cost of a single invocation of
(or of the corresponding inverse transform
). By (2.4) we have
The cost of precomputing the necessary root tables is only . By definition
.
From (9.4) and the estimate
,
the first term becomes
The contribution to from all terms involving
is
so
Denoting by the cost of a bivariate DFT of
length
over
,
we thus have (ignoring the transposition costs, which were included
earlier)
Moreover, since
we get
We must perform bivariate DFTs; the bound (9.5) then follows exactly as in the proof of Theorem 6.4.
For large , we have
, so
.
Thus there exists a constant
such that
for large
, and
we may take
.
Proof of Theorem 1.2. Follows from
Theorem 9.6 and Proposition 5.3, analogously
to the proof of Theorem 1.1.
M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Math., 160(2):781–793, 2004.
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974.
L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:1777–1806, 2007.
C. B. Boyer. A History of Mathematics. Princeton Univ. Press, first paperback edition, 1985.
R. P. Brent. Fast multiple-precision evaluation of elementary functions. J. Assoc. Comput. Mach., 23(2):242–251, 1976.
R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.
P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, 1997.
H. Cohen, G. Frey, R. Avanzi, Ch. Doche, T. Lange, K. Nguyen, and F. Vercauteren, editors. Handbook of elliptic and hyperelliptic curve cryptography. Discrete Mathematics and its Applications. Chapman & Hall/CRC, Boca Raton, FL, 2006.
S. A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
R. Crandall and B. Fagin. Discrete weighted transforms and large-integer arithmetic. Math. Comp., 62(205):305–324, 1994.
R. Crandall and C. Pomerance. Prime numbers. A computational perspective. Springer, New York, 2nd edition, 2005.
R. Creutzburg and M. Tasche. Parameter determination for complex number-theoretic transforms using cyclotomic polynomials. Math. Comp., 52(185):189–200, 1989.
A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. SIAM J. Comput., 42(2):685–699, 2013.
J. Écalle. Introduction aux fonctions analysables et preuve constructive de la conjecture de Dulac. Hermann, collection: Actualités mathématiques, 1992.
M. Fürer. On the complexity of integer multiplication (extended abstract). Technical Report CS-89-17, Pennsylvania State University, 1989.
M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66, New York, NY, USA, 2007. ACM Press.
M. Fürer. Faster integer multiplication. SIAM J. Comp., 39(3):979–1005, 2009.
M. Fürer. How fast can we multiply large integers on an actual computer? Technical report, http://arxiv.org/abs/1402.1811, 2014.
J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2nd edition, 2002.
F. Q. Gouvêa. -adic
numbers. An introduction. Universitext. Springer-Verlag,
Berlin, 1993.
T. Granlund et al. GMP, the GNU multiple precision arithmetic library. http://gmplib.org, 1991. Latest version 6.0.0 released in 2014.
D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. Technical report, HAL, 2014. http://hal.archives-ouvertes.fr.
D. R. Heath-Brown. Almost-primes in arithmetic progressions and short intervals. Math. Proc. Cambridge Philos. Soc., 83(3):357–375, 1978.
D. R. Heath-Brown. Zero-free regions for Dirichlet
-functions, and the
least prime in an arithmetic progression. Proc. London Math.
Soc. (3), 64(2):265–338, 1992.
M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.
J. van der Hoeven. Journées Nationales de Calcul Formel (2011), volume 2 of Les cours du CIRM, chapter Calcul analytique. CEDRAM, 2011. Exp. No. 4, 85 pages, http://ccirm.cedram.org/ccirm-bin/fitem?id=CCIRM_2011__2_1_A4_0.
J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002. http://www.mathemagix.org.
A. Karatsuba and J. Ofman. Умножение многозначных чисел на автоматах. Doklady Akad. Nauk SSSR, 7:293–294, 1962. English translation in [31].
A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
D. E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, 1969.
D. E. Knuth. The art of computer programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1998.
Yu. V. Linnik. On the least prime in an arithmetic progression I. The basic theorem. Rec. Math. (Mat. Sbornik) N.S., 15(57):139–178, 1944.
Yu. V. Linnik. On the least prime in an arithmetic progression II. The Deuring-Heilbronn phenomenon. Rec. Math. (Mat. Sbornik) N.S., 15(57):347–168, 1944.
R. E. Moore. Interval Analysis. Prentice Hall, Englewood Cliffs, N.J., 1966.
O. Neugebauer. The Exact Sciences in Antiquity. Brown Univ. Press, Providence, R.I., 1957.
C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
J. M. Pollard. The fast Fourier transform in a finite field. Math. Comp., 25(114):365–374, 1971.
C. Pomerance. Recent developments in primality testing. Math. Intelligencer, 3(3):97–105, 1980/81.
C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56(6):1107–1108, June 1968.
K. R. Rao, D. N. Kim, and J. J. Hwang. Fast Fourier Transform - Algorithms and Applications. Signals and Communication Technology. Springer-Verlag, 2010.
I. S. Reed and T. K. Truong. The use of finite fields to compute convolutions. IEEE Trans. Inform. Theory, IT-21:208–213, 1975.
M. C. Schmeling. Corps de transséries. PhD thesis, Université Paris-VII, France, 2001.
A. Schönhage. Multiplikation großer Zahlen. Computing, 1(3):182–196, 1966.
A. Schönhage. Storage modification machines. SIAM J. on Comp., 9:490–508, 1980.
A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
I. Shparlinski. On finding primitive roots in finite fields. Theoret. Comput. Sci., 157(2):273–275, 1996.
A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
A. L. Toom. О сложности схемы из функциональных элементов, реализующей умножение целых чисел. Doklady Akad. Nauk SSSR, 150:496–498, 1963. English translation in [50].
S. Wagstaff. Divisors of Mersenne numbers. Math. Comp., 40(161):385–397, 1983.
T. Xylouris. On the least prime in an arithmetic progression and estimates for the zeros of Dirichlet L-functions. Acta Arith., 1:65–91, 2011.