|
In a previous paper [vdH04], we introduced a truncated version of the classical Fast Fourier Transform. When applied to polynomial multiplication, this algorithm has the nice property of eliminating the “jumps” in the complexity at powers of two. When applied to the multiplication of multivariate polynomials or truncated multivariate power series, a non-trivial asymptotic factor was gained with respect to the best previously known algorithms. In the present note, we correct two errors which slipped into the previous paper and we give a new application to the multiplication of polynomials with real coefficients. We also give some further hints on how to implement the TFT in practice.
|
Let be an effective ring of constants
(i.e. the usual arithmetic operations
,
and
can be carried out by algorithm). If
has a
primitive
-th root of unity
with
, then the product of
two polynomials
with
can
be computed in time
using the Fast Fourier
Transform or FFT [CT65]. If
does not admit a primitive
-th
root of unity, then one needs an additional overhead of
in order to carry out the multiplication, by artificially adding new
root of unity [SS71, CK91].
Besides the fact that the asymptotic complexity of the
FFT involves a large constant factor, another
classical drawback is that the complexity function admits important
jumps at each power of two. These jumps can be reduced by using -th roots of unity for small
. They can also be smoothened by
decomposing
-multiplications
as
-,
- and
-multiplications.
However, these tricks are not very elegant, cumbersome to implement, and
they do not allow to completely eliminate the jump problem. The jump
phenomenon becomes even more important for
-dimensional
FFTs, since the complexity is multiplied by
whenever the degree traverses a power of two.
In [vdH04], the author introduced a new kind of
“Truncated Fourier Transform” (TFT),
which allows for the fast evaluation of a polynomial
in any number
of well-chosen roots of unity.
This algorithm coincides with the usual FFT if
is a power of two, but it behaves smoothly for
intermediate values. Moreover, the inverse TFT can be carried out with
the same complexity and the approach generalizes to higher dimensions.
Unfortunately, two errors slipped into the final version of [vdH04]:
in the multivariate TFT, we forgot certain crossings. As a consequence,
the complexity bounds for multiplying polynomials (and power series) in
variables of total degree
only holds when
. Moreover,
the inverse TFT does not generalize to arbitrary “unions of
intervals”.
The present note has several purposes: correcting the above mistakes in
[vdH04], providing further details on how to implement the
FFT and TFT in a sufficiently generic way (and suitable for univariate
as well as multivariate computation) and a new extension to the case of
TFTs with real coefficients. Furthermore, a generic implementation of
the FFT and TFT is in progress as part of the standard C++ library
Let be an effective ring of constants,
with
and
a
primitive
-th root of unity
(i.e.
). The
discrete Fast Fourier Transform (FFT) of an
-tuple
(with respect to
) is the
-tuple
with
In other words, , where
denotes the polynomial
.
We also say that
is the FFT
of
.
The F.F.T can be computed efficiently using binary splitting: writing
we recursively compute the Fourier transforms of
and
Then we have
This algorithm requires multiplications with
powers of
and
additions
(or subtractions).
In practice, it is most efficient to implement an in-place variant of
the above algorithm. We will denote by the
bitwise mirror of
at length
(for instance,
and
).
At step
, we start with the
vector
At step , we set
![]() |
(1) |
for all and
,
where
. Using induction over
, it can easily be seen that
for all and
.
In particular,
for all . This algorithm of
“repeated crossings” is illustrated in figure 1.
![]() |
A classical application of the FFT is the
multiplication of polynomials and
. Assuming that
,
we first evaluate
and
in
using the FFT:
We next compute the evaluations of
at
. We finally
have to recover
from these values using the
inverse FFT. But the inverse FFT
with respect to
is nothing else as
times the direct FFT with respect to
. Indeed, for all
and all
, we
have
![]() |
(2) |
since
whenever . This yields a
multiplication algorithm of time complexity
in
, when assuming that
admits enough primitive
-th
roots of unity. In the case that
does not, then
new roots of unity can be added artificially [SS71, CK91, vdH02] so as to yield an algorithm of time
complexity
.
Given a multivariate polynomial with
for all
,
we may also consider its FFT with respect to each of its variables:
where is an
-th
root of unity for each
.
Usually the
are chosen such that
,
,
, etc. The multivariate FFT
is simply computed by taking the univariate FFT with respect to each
variable.
Instead of using multi-indices for multivariate
FFTs, it is more efficient to encode
and
using a single array of size
. This can be done in several ways. Assume that we
have ordered
. The
lexicographical encoding of the multi-index
is given by
The simultaneous encoding of is
recursively defined by
where ,
and
is maximal with
. For each
,
we denote by
the maximal number with
.
Using one of these encodings, one may use the above in-place algorithm for the computation of the multivariate FFT, when replacing the crossing relation by
![]() |
(3) |
Here
in case of the lexicographical encoding and
in case of the simultaneous encoding. Translated back into terms of
monomials, crossings at stage involve monomials
and
,
where
in case of the lexicographical encoding and
in case of the simultaneous encoding.
A challenge for concrete implementations of the FFT in computer algebra
systems is to achieve both optimal speed and genericity. Also, we would
like to allow for univariate as well as multivariate FFTs. A first
attempt in this direction is now part of
The Schönhage-Strassen model.
The nice prime number model.
The floating-point model.
In our implementation, we have created a special template class fft_root<C> for roots of unity in C, which may be specialized to any of the above models. The class fft_root<C> essentially implements methods for multiplication and multiplication with elements of C.
Of course, the precomputation of requires some
additional space. Nevertheless, this does not harm if
is significantly smaller than
(say
) or when the encoding of roots of unity
requires significantly less space than storing elements of C
(for instance, in the Schönhage-Strassen model, it suffices to
store the shifts). In the remaining case, some space may be saved by
precomputing only
and by computing the remaining
values only on demand during the two last steps.
It should also be noticed that we really precompute the array
![]() |
(4) |
There are several reasons for doing so. First of all, we really need the
binary mirrored power in the cross relation (3). Secondly,
is easily deduced from
. Finally, precomputation of
the
is useful for the inverse transform.
At present, this strategy has not yet been implemented in .
It may be that the cost of cache misses is negligible
w.r.t. the cost of multiplications and divisions modulo
this number.
On the other hand, the nice prime number and floating-point models admit
many roots of unity, but genuine multiplications (and divisions) have to
be carried out during the FFT step. In the nice prime number model, one
computes modulo a prime number like ,
which has many
-th roots of
unity, yet still fits into a single machine word. One may also use
several such prime numbers, in combination with Chinese remaindering. In
the floating-point model, one uses floating point approximations of
complex numbers. This is particularly useful if fast floating point
arithmetic is available, but one always has to carefully deal with
rounding errors.
In practice, it follows that the choice of an optimal model for the FFT depends very much on the application. When using the FFT for multiplication, a general rule of the dumb is to balance the costs of the FFT-step and the inner multiplication step. If the inner multiplication step is carried out just once, then Schönhage-Strassen's model is usually optimal. This is for instance the case when multiplying two large integers. If one FFT corresponds to several inner multiplications, as in the case of matrix multiplication with large integer coefficients, one should opt for the nice prime or floating-point model, depending on the efficiency of division, floating-point operations and the available amount of space.
In between, it may sometimes be useful to use Schönhage-Strassen's
method with cube roots (instead of square roots) of the number of digits
(or degree) [vdH02, Section 6.1.3]. Similarly, one may
force an extra iteration (i.e. immediately use fourth
roots, but loose an additional factor ).
Finally, in the case of multivariate FFTs (we consider that large
integer coefficients account for an additional dimension), one may
artificially generate additional roots of unity. For instance, if
has a
-th
root of unity
, then one may
multiply bivariate polynomials in
,
by using
as a
-th
root of
during the FFT w.r.t.
.
Let ,
and let
be a primitive
-th root of unity. The Truncated Fourier Transform
(TFT) from [vdH04, Section 3] takes a tuple
on input and produces
with
for all
.
More generally, if
is a subset of
and
, then we
define
by
Even more generally, one may select a target subset
of
which is different from
, and define
by
In order to compute the TFT, we simply consider the computation graph of
a complete FFT and throw out all computations which do not contribute to
the result (see figures 2, 3
and 4). If the set
is
“sufficiently connected”, then the cost of the computation
of
is
.
For instance, for
, we have
proved [vdH04]:
,
and let
be a primitive
-th root of unity in
. Then the TFT of an
-tuple
w.r.t.
can be computed using at most
additions (or subtractions) and
multiplications with powers of
.
Inverting the Truncated Fourier Transform
In order to use the TFT for the multiplication of numbers,
polynomials or power series, we also need an algorithm to compute the
inverse transform. In this section, we give such an algorithm in the
case when and when
is an
initial segment for the (partial) “bit ordering
” on
:
given numbers
and
(with
), we set
if
for all
.
We recall that an initial segment of
is a set
with
.
The inverse TFT is based on the key observation that the cross relation can also be applied to compute “one diagonal from the other diagonal”. More precisely, given a relation
where for some
,
we may clearly compute
as a function of
and vice versa
but we also have
Moreover, these relations only involve shifting (multiplication and
division by ), additions,
subtractions and multiplications by roots of unity.
In order to state our in-place algorithm for computing the inverse TFT,
we will need some more notations. At the very start, we have and
on input, where
is the complement of
in
and
is the zero function. Roughly
speaking, the algorithm will replace
by its
inverse TFT
and
by its
direct TFT
. In the actual
algorithm, the array
will be stored in a special
way (see the next section) and only a part of the computation of
will really be carried out.
Our algorithm proceeds by recursion. In the recursive step, is replaced by a subset of
of the
form
, where
is the current step and
. The
sets
and
will again be
subsets of
with
,
and
is recursively assumed to be an initial
segment for the bit ordering. Given a subset
of
, we will denote
and
, where
Similarly, given , we will
denote by
and
the
restrictions of
to
resp.
. We now
have the following recursive in-place algorithm for computing the
inverse TFT (see figure 5 for an illustration of the
different steps).
Algorithm
If or
then return
If then apply the partial inverse FFT on
and return
For , cross
with
using (5)
For , cross
with
using (7)
For , cross
with
using (6)
Applying the algorithm for with
, we have
,
and let
be a primitive
-th root of unity in
. Then the
-tuple
can be recovered from its TFT
w.r.t.
using at most
shifted additions (or subtractions) and
multiplications with powers of
.
Remark has a
non-zero determinant for any
,
an algorithm of the above kind does not always work. The simplest
example when our method fails is for
and
. Nevertheless, the condition that
is an initial segment for the bit ordering on
is not a necessary condition. For instance, a
slight modification of the algorithm also works for final segments and
certain more general sets like
for
. We will observe in the next section that the
bit ordering condition on
is naturally satisfied
when we take multivariate TFTs on initial segments of
.
Let and let
be a
primitive
-th root of unity
for each
. Given subsets
of
and a mapping
, the multivariate TFT of
is a mapping
defined by
See figure 6 for an example with in
dimension
. The multivariate
TFT and its inverse are computed in a similar way as in the univariate
case, using one of the encodings from see section 3 of
tuples in
by integers in
and using the corresponding FFT-profile determined by
instead of
.
We generalize the bit-ordering on sets of the
form
to sets
of indices
by
If is one of the encodings
or
from section
,
then it can be checked that
If is an initial segment for the bit ordering,
this ensures that the inverse TFT also generalizes to the multivariate
case.
From the complexity analysis point of view, two special cases are
particularly interesting. The first block case is when with
for all
. Then using the lexicographical encoding, the
multivariate TFT can be seen as a succession of
univariate TFTs whose coefficients are mappings
, with
Setting ,
and
, theorems 1
and 2 therefore generalize to:
can be computed using at most
shifted additions (or subtractions) and
multiplications with powers of the
.
Remark is also an
interesting. The easiest instance of this problem is when
for some
. In
that case, we may introduce a new variable
and
compute in
instead of
, where
is the smallest
power of two with
. This
gives rise to yet another FFT-profile of the form
and a complexity in
. The
trick generalizes to the case when the
are all
powers of two, but the general case seems to require non-binary FFT
steps.
The other important simplicial case is when
and
for some fixed
with
(and more generally, one may consider weighted
degrees). In this case, we choose the simultaneous encoding for the
multivariate TFT (see figure 7 for an example). Now the
size of
is
.
At stages
, we notice that
the number of “active nodes” for the TFT is bounded by
. When
, we have
which proves:
, the direct and inverse TFT of a mapping
can be computed using
shifted additions (or subtractions) and multiplications with powers of
.
Remark does not
depend on
, which is
incorrect.
Remark .
|
|||||||||||||||||||||
In order to implement the TFT and its inverse, one first has to decide
how to represent the sets and
, and mappings
.
A convenient approach is to represent
,
and
(for the direct TFT)
or
,
and
(for the inverse TFT) by a single binary
“TFT-tree”. Such a binary tree corresponds to what happens
on a subset
of the form
with
. The binary tree is of
one of the following forms:
When the tree is reduced to a leaf, then it explicitly contains a
map , which is
represented by an array of length
in
memory. By convention, if
is reduced to a
null-pointer, then
represents the zero
map. Moreover, the node contains a flag in order to indicate
whether
or
(for
leafs, it is assumed that we either have
or
and similarly for
).
When the tree consists of a binary node with subtrees and
,
then
encodes what happens on the interval
, whereas
encodes what happens on the interval
. For convenience, the tree still
contains a flag to indicate whether
or
.
Notice that the bounds of the interval need not
to be explicitly present in the representation of the tree; when needed,
they can be passed as parameters for recursive functions on the trees.
Only the direct TFT has currently been implemented in
In tables 1–4, we have given a few
benchmarks on a 2.4GHz AMD architecture with 512Mb of memory. We use
as our coefficient ring and we tested the
simplicial multivariate TFT for total degree
and
dimension
. Our table both
shows the size of the input
,
the real and average running times
and
, the theoretical number of
crossings to be computed
,
its average
and the ratio
. The number
correspond to
the running time divided by the running time of the univariate TFT for
the same input size. The tables both show the most favourable cases when
is a power of two and the worst cases when
is a power of two plus one.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A few conclusions can be drawn from tables 1–4.
First of all, in the univariate case, the TFT behaves more or less as
theoretically predicted. The non-trivial ratios
in higher dimensions indicate that the set
is
quite disconnected. Hence, optimizations in the implementation of
TFT-trees may cause non-trivial gains. The differences between the
running times for the best and worse cases in higher dimensions show
that the multivariate TFT is still quite sensible to the jump phenomenon
(even though less than a full FFT). Finally, in the best case, the ratio
be comes reasonably small. Indeed,
should theoretically tend to
and
should be bounded by
! in
order to gain something w.r.t. a full FFT. However,
can become quite large in the worse case. The various
observations suggest the following future improvements.
for all and
for each . It can be hoped
that this acceleration approximately reduces the current running times
for the worst case to the current running times for the best case. For
high dimensions, it may be useful to apply the trick a second time for
the next
stages
;
however, the corresponding formulas become more complicated. We hope
that further study will enable the replacement of the condition
in theorem 6 by a better condition, like
.
Assume now that is a ring such that
has many
-th
roots of unity. Typically, this is the case when
is the “field” of floating point numbers. A disadvantage of
the standard FFT or TFT in this context is that the size of the input is
doubled in order to represent numbers in
.
This doubling of the size actually corresponds to the fact that the FFT
or TFT contains redundant information in this case. Indeed, given
with
, we
have
for any
-th
root of unity
. This suggests
to evaluate
either in
or
.
With the notations from section 5, let
be an initial segment for the bit ordering
.
As our target set
, we take
The “real TFT” now consists of applying the usual TFT with
source and target
.
It can be checked that crossings of “real nodes” introduce
“complex nodes” precisely then when one of the two nodes
disappears at the next stage (see figure 8). It can also be
checked that the inverse TFT naturally adapts so as to compute the
inverse real TFT. For “sufficiently connected” sets
, like
, the real TFT is asymptotically twice as fast as
the complex TFT.
![]() |
D.G. Cantor and E. Kaltofen. On fast multiplication
of polynomials over arbitrary algebras.
J.W. Cooley and J.W. Tukey. An algorithm for the
machine calculation of complex Fourier series.
A. Schönhage and V. Strassen. Schnelle
Multiplikation grosser Zahlen.
J. van der Hoeven. Relax, but don't be too lazy.
J. van der Hoeven. The truncated Fourier transform
and applications. In J. Gutierrez, editor,
J. van der Hoeven et al. Mmxlib: the standard library for Mathemagix, 2002–2005. ftp://ftp.mathemagix.org/pub/mathemagix/targz.