|
In a previous paper [vdH04], we introduced a truncated version of the classical Fast Fourier Transform. When applied to polynomial multiplication, this algorithm has the nice property of eliminating the “jumps” in the complexity at powers of two. When applied to the multiplication of multivariate polynomials or truncated multivariate power series, a non-trivial asymptotic factor was gained with respect to the best previously known algorithms. In the present note, we correct two errors which slipped into the previous paper and we give a new application to the multiplication of polynomials with real coefficients. We also give some further hints on how to implement the TFT in practice.
|
Let be an effective ring of constants (i.e. the usual arithmetic operations , and can be carried out by algorithm). If has a primitive -th root of unity with , then the product of two polynomials with can be computed in time using the Fast Fourier Transform or FFT [CT65]. If does not admit a primitive -th root of unity, then one needs an additional overhead of in order to carry out the multiplication, by artificially adding new root of unity [SS71, CK91].
Besides the fact that the asymptotic complexity of the FFT involves a large constant factor, another classical drawback is that the complexity function admits important jumps at each power of two. These jumps can be reduced by using -th roots of unity for small . They can also be smoothened by decomposing -multiplications as -, - and -multiplications. However, these tricks are not very elegant, cumbersome to implement, and they do not allow to completely eliminate the jump problem. The jump phenomenon becomes even more important for -dimensional FFTs, since the complexity is multiplied by whenever the degree traverses a power of two.
In [vdH04], the author introduced a new kind of “Truncated Fourier Transform” (TFT), which allows for the fast evaluation of a polynomial in any number of well-chosen roots of unity. This algorithm coincides with the usual FFT if is a power of two, but it behaves smoothly for intermediate values. Moreover, the inverse TFT can be carried out with the same complexity and the approach generalizes to higher dimensions.
Unfortunately, two errors slipped into the final version of [vdH04]: in the multivariate TFT, we forgot certain crossings. As a consequence, the complexity bounds for multiplying polynomials (and power series) in variables of total degree only holds when . Moreover, the inverse TFT does not generalize to arbitrary “unions of intervals”.
The present note has several purposes: correcting the above mistakes in
[vdH04], providing further details on how to implement the
FFT and TFT in a sufficiently generic way (and suitable for univariate
as well as multivariate computation) and a new extension to the case of
TFTs with real coefficients. Furthermore, a generic implementation of
the FFT and TFT is in progress as part of the standard C++ library
Let be an effective ring of constants, with and a primitive -th root of unity (i.e. ). The discrete Fast Fourier Transform (FFT) of an -tuple (with respect to ) is the -tuple with
In other words, , where denotes the polynomial . We also say that is the FFT of .
The F.F.T can be computed efficiently using binary splitting: writing
we recursively compute the Fourier transforms of and
Then we have
This algorithm requires multiplications with powers of and additions (or subtractions).
In practice, it is most efficient to implement an in-place variant of the above algorithm. We will denote by the bitwise mirror of at length (for instance, and ). At step , we start with the vector
At step , we set
(1) |
for all and , where . Using induction over , it can easily be seen that
for all and . In particular,
for all . This algorithm of “repeated crossings” is illustrated in figure 1.
A classical application of the FFT is the multiplication of polynomials and . Assuming that , we first evaluate and in using the FFT:
We next compute the evaluations of at . We finally have to recover from these values using the inverse FFT. But the inverse FFT with respect to is nothing else as times the direct FFT with respect to . Indeed, for all and all , we have
(2) |
since
whenever . This yields a multiplication algorithm of time complexity in , when assuming that admits enough primitive -th roots of unity. In the case that does not, then new roots of unity can be added artificially [SS71, CK91, vdH02] so as to yield an algorithm of time complexity .
Given a multivariate polynomial with for all , we may also consider its FFT with respect to each of its variables:
where is an -th root of unity for each . Usually the are chosen such that , , , etc. The multivariate FFT is simply computed by taking the univariate FFT with respect to each variable.
Instead of using multi-indices for multivariate FFTs, it is more efficient to encode and using a single array of size . This can be done in several ways. Assume that we have ordered . The lexicographical encoding of the multi-index is given by
The simultaneous encoding of is recursively defined by
where , and is maximal with . For each , we denote by the maximal number with .
Using one of these encodings, one may use the above in-place algorithm for the computation of the multivariate FFT, when replacing the crossing relation by
(3) |
Here
in case of the lexicographical encoding and
in case of the simultaneous encoding. Translated back into terms of monomials, crossings at stage involve monomials and , where
in case of the lexicographical encoding and
in case of the simultaneous encoding.
A challenge for concrete implementations of the FFT in computer algebra
systems is to achieve both optimal speed and genericity. Also, we would
like to allow for univariate as well as multivariate FFTs. A first
attempt in this direction is now part of
The Schönhage-Strassen model.
The nice prime number model.
The floating-point model.
In our implementation, we have created a special template class fft_root<C> for roots of unity in C, which may be specialized to any of the above models. The class fft_root<C> essentially implements methods for multiplication and multiplication with elements of C.
Of course, the precomputation of requires some additional space. Nevertheless, this does not harm if is significantly smaller than (say ) or when the encoding of roots of unity requires significantly less space than storing elements of C (for instance, in the Schönhage-Strassen model, it suffices to store the shifts). In the remaining case, some space may be saved by precomputing only and by computing the remaining values only on demand during the two last steps.
It should also be noticed that we really precompute the array
(4) |
There are several reasons for doing so. First of all, we really need the binary mirrored power in the cross relation (3). Secondly, is easily deduced from . Finally, precomputation of the is useful for the inverse transform.
At present, this strategy has not yet been implemented in
On the other hand, the nice prime number and floating-point models admit many roots of unity, but genuine multiplications (and divisions) have to be carried out during the FFT step. In the nice prime number model, one computes modulo a prime number like , which has many -th roots of unity, yet still fits into a single machine word. One may also use several such prime numbers, in combination with Chinese remaindering. In the floating-point model, one uses floating point approximations of complex numbers. This is particularly useful if fast floating point arithmetic is available, but one always has to carefully deal with rounding errors.
In practice, it follows that the choice of an optimal model for the FFT depends very much on the application. When using the FFT for multiplication, a general rule of the dumb is to balance the costs of the FFT-step and the inner multiplication step. If the inner multiplication step is carried out just once, then Schönhage-Strassen's model is usually optimal. This is for instance the case when multiplying two large integers. If one FFT corresponds to several inner multiplications, as in the case of matrix multiplication with large integer coefficients, one should opt for the nice prime or floating-point model, depending on the efficiency of division, floating-point operations and the available amount of space.
In between, it may sometimes be useful to use Schönhage-Strassen's method with cube roots (instead of square roots) of the number of digits (or degree) [vdH02, Section 6.1.3]. Similarly, one may force an extra iteration (i.e. immediately use fourth roots, but loose an additional factor ). Finally, in the case of multivariate FFTs (we consider that large integer coefficients account for an additional dimension), one may artificially generate additional roots of unity. For instance, if has a -th root of unity , then one may multiply bivariate polynomials in , by using as a -th root of during the FFT w.r.t. .
Let , and let be a primitive -th root of unity. The Truncated Fourier Transform (TFT) from [vdH04, Section 3] takes a tuple on input and produces with for all . More generally, if is a subset of and , then we define by
Even more generally, one may select a target subset of which is different from , and define by
In order to compute the TFT, we simply consider the computation graph of a complete FFT and throw out all computations which do not contribute to the result (see figures 2, 3 and 4). If the set is “sufficiently connected”, then the cost of the computation of is . For instance, for , we have proved [vdH04]:
Inverting the Truncated Fourier Transform
In order to use the TFT for the multiplication of numbers, polynomials or power series, we also need an algorithm to compute the inverse transform. In this section, we give such an algorithm in the case when and when is an initial segment for the (partial) “bit ordering ” on : given numbers and (with ), we set if for all . We recall that an initial segment of is a set with .
The inverse TFT is based on the key observation that the cross relation can also be applied to compute “one diagonal from the other diagonal”. More precisely, given a relation
where for some , we may clearly compute as a function of and vice versa
but we also have
Moreover, these relations only involve shifting (multiplication and division by ), additions, subtractions and multiplications by roots of unity.
In order to state our in-place algorithm for computing the inverse TFT, we will need some more notations. At the very start, we have and on input, where is the complement of in and is the zero function. Roughly speaking, the algorithm will replace by its inverse TFT and by its direct TFT . In the actual algorithm, the array will be stored in a special way (see the next section) and only a part of the computation of will really be carried out.
Our algorithm proceeds by recursion. In the recursive step, is replaced by a subset of of the form , where is the current step and . The sets and will again be subsets of with , and is recursively assumed to be an initial segment for the bit ordering. Given a subset of , we will denote and , where
Similarly, given , we will denote by and the restrictions of to resp. . We now have the following recursive in-place algorithm for computing the inverse TFT (see figure 5 for an illustration of the different steps).
Algorithm
If or then return
If then apply the partial inverse FFT on and return
For , cross with using (5)
For , cross with using (7)
For , cross with using (6)
Applying the algorithm for with , we have
Remark
Let and let be a primitive -th root of unity for each . Given subsets of and a mapping , the multivariate TFT of is a mapping defined by
See figure 6 for an example with in dimension . The multivariate TFT and its inverse are computed in a similar way as in the univariate case, using one of the encodings from see section 3 of tuples in by integers in and using the corresponding FFT-profile determined by instead of .
We generalize the bit-ordering on sets of the form to sets of indices by
If is one of the encodings or from section , then it can be checked that
If is an initial segment for the bit ordering, this ensures that the inverse TFT also generalizes to the multivariate case.
From the complexity analysis point of view, two special cases are particularly interesting. The first block case is when with for all . Then using the lexicographical encoding, the multivariate TFT can be seen as a succession of univariate TFTs whose coefficients are mappings , with
Setting , and , theorems 1 and 2 therefore generalize to:
Remark
The other important simplicial case is when and for some fixed with (and more generally, one may consider weighted degrees). In this case, we choose the simultaneous encoding for the multivariate TFT (see figure 7 for an example). Now the size of is . At stages , we notice that the number of “active nodes” for the TFT is bounded by . When , we have
which proves:
Remark
Remark
In order to implement the TFT and its inverse, one first has to decide how to represent the sets and , and mappings . A convenient approach is to represent , and (for the direct TFT) or , and (for the inverse TFT) by a single binary “TFT-tree”. Such a binary tree corresponds to what happens on a subset of the form with . The binary tree is of one of the following forms:
When the tree is reduced to a leaf, then it explicitly contains a map , which is represented by an array of length in memory. By convention, if is reduced to a null-pointer, then represents the zero map. Moreover, the node contains a flag in order to indicate whether or (for leafs, it is assumed that we either have or and similarly for ).
When the tree consists of a binary node with subtrees and , then encodes what happens on the interval , whereas encodes what happens on the interval . For convenience, the tree still contains a flag to indicate whether or .
Notice that the bounds of the interval need not to be explicitly present in the representation of the tree; when needed, they can be passed as parameters for recursive functions on the trees.
Only the direct TFT has currently been implemented in
In tables 1–4, we have given a few benchmarks on a 2.4GHz AMD architecture with 512Mb of memory. We use as our coefficient ring and we tested the simplicial multivariate TFT for total degree and dimension . Our table both shows the size of the input , the real and average running times and , the theoretical number of crossings to be computed , its average and the ratio . The number correspond to the running time divided by the running time of the univariate TFT for the same input size. The tables both show the most favourable cases when is a power of two and the worst cases when is a power of two plus one.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A few conclusions can be drawn from tables 1–4. First of all, in the univariate case, the TFT behaves more or less as theoretically predicted. The non-trivial ratios in higher dimensions indicate that the set is quite disconnected. Hence, optimizations in the implementation of TFT-trees may cause non-trivial gains. The differences between the running times for the best and worse cases in higher dimensions show that the multivariate TFT is still quite sensible to the jump phenomenon (even though less than a full FFT). Finally, in the best case, the ratio be comes reasonably small. Indeed, should theoretically tend to and should be bounded by ! in order to gain something w.r.t. a full FFT. However, can become quite large in the worse case. The various observations suggest the following future improvements.
for all and
for each . It can be hoped that this acceleration approximately reduces the current running times for the worst case to the current running times for the best case. For high dimensions, it may be useful to apply the trick a second time for the next stages ; however, the corresponding formulas become more complicated. We hope that further study will enable the replacement of the condition in theorem 6 by a better condition, like .
Assume now that is a ring such that has many -th roots of unity. Typically, this is the case when is the “field” of floating point numbers. A disadvantage of the standard FFT or TFT in this context is that the size of the input is doubled in order to represent numbers in . This doubling of the size actually corresponds to the fact that the FFT or TFT contains redundant information in this case. Indeed, given with , we have for any -th root of unity . This suggests to evaluate either in or .
With the notations from section 5, let be an initial segment for the bit ordering . As our target set , we take
The “real TFT” now consists of applying the usual TFT with source and target . It can be checked that crossings of “real nodes” introduce “complex nodes” precisely then when one of the two nodes disappears at the next stage (see figure 8). It can also be checked that the inverse TFT naturally adapts so as to compute the inverse real TFT. For “sufficiently connected” sets , like , the real TFT is asymptotically twice as fast as the complex TFT.
D.G. Cantor and E. Kaltofen. On fast multiplication
of polynomials over arbitrary algebras.
J.W. Cooley and J.W. Tukey. An algorithm for the
machine calculation of complex Fourier series.
A. Schönhage and V. Strassen. Schnelle
Multiplikation grosser Zahlen.
J. van der Hoeven. Relax, but don't be too lazy.
J. van der Hoeven. The truncated Fourier transform
and applications. In J. Gutierrez, editor,
J. van der Hoeven et al. Mmxlib: the standard library for Mathemagix, 2002–2005. ftp://ftp.mathemagix.org/pub/mathemagix/targz.