Notes on the Truncated Fourier Transform

In a previous paper [vdH04], we introduced a truncated version of the classical Fast Fourier Transform. When applied to polynomial multiplication, this algorithm has the nice property of eliminating the “jumps” in the complexity at powers of two. When applied to the multiplication of multivariate polynomials or truncated multivariate power series, a non-trivial asymptotic factor was gained with respect to the best previously known algorithms.

In the present note, we correct two errors which slipped into the previous paper and we give a new application to the multiplication of polynomials with real coefficients. We also give some further hints on how to implement the TFT in practice.

Keywords: Fast Fourier Transform, jump phenomenon, truncated multiplication, FFT-multiplication, multivariate polynomials, multivariate power series.

A.M.S. subject classification: 42-04, 68W25, 42B99, 30B10, 68W30.

1.Introduction

Let

be an effective ring of constants (i.e. the usual arithmetic operations

and

can be carried out by algorithm). If

has a primitive

-th root of unity with

, then the product of two polynomials

with

can be computed in time

using the Fast Fourier Transform or FFT [CT65]. If

does not admit a primitive

-th root of unity, then one needs an additional overhead of

in order to carry out the multiplication, by artificially adding new root of unity [SS71, CK91].

Besides the fact that the asymptotic complexity of the FFT involves a large constant factor, another classical drawback is that the complexity function admits important jumps at each power of two. These jumps can be reduced by using

-th roots of unity for small

. They can also be smoothened by decomposing

-multiplications as

- and

-multiplications. However, these tricks are not very elegant, cumbersome to implement, and they do not allow to completely eliminate the jump problem. The jump phenomenon becomes even more important for

-dimensional FFTs, since the complexity is multiplied by

whenever the degree traverses a power of two.

In [vdH04], the author introduced a new kind of “Truncated Fourier Transform” (TFT), which allows for the fast evaluation of a polynomial

in any number

of well-chosen roots of unity. This algorithm coincides with the usual FFT if

is a power of two, but it behaves smoothly for intermediate values. Moreover, the inverse TFT can be carried out with the same complexity and the approach generalizes to higher dimensions.

Unfortunately, two errors slipped into the final version of [vdH04]: in the multivariate TFT, we forgot certain crossings. As a consequence, the complexity bounds for multiplying polynomials (and power series) in

variables of total degree

only holds when

. Moreover, the inverse TFT does not generalize to arbitrary “unions of intervals”.

The present note has several purposes: correcting the above mistakes in [vdH04], providing further details on how to implement the FFT and TFT in a sufficiently generic way (and suitable for univariate as well as multivariate computation) and a new extension to the case of TFTs with real coefficients. Furthermore, a generic implementation of the FFT and TFT is in progress as part of the standard C++ library Mmxlib of Mathemagix [vdHea05]. We will present a few experimental results, together with suggestions for future improvements. More details will be provided in a forthcoming paper and we refer to [vdH04] for further references.

2.The Fast Fourier Transform

Let

be an effective ring of constants,

with

and

a primitive

-th root of unity (i.e.

). The discrete Fast Fourier Transform (FFT) of an

-tuple

(with respect to

) is the

-tuple

with

In other words,

, where

denotes the polynomial

. We also say that

is the FFT

This algorithm requires

multiplications with powers of

and

additions (or subtractions).

In practice, it is most efficient to implement an in-place variant of the above algorithm. We will denote by

the bitwise mirror of

at length

(for instance,

and

). At step

, we start with the vector

for all

. This algorithm of “repeated crossings” is illustrated in figure 1.

A classical application of the FFT is the multiplication of polynomials

and

. Assuming that

, we first evaluate

and

using the FFT:

We next compute the evaluations

. We finally have to recover

from these values using the inverse FFT. But the inverse FFT with respect to

is nothing else as

times the direct FFT with respect to

. Indeed, for all

and all

, we have

whenever

. This yields a multiplication algorithm of time complexity

, when assuming that

admits enough primitive

-th roots of unity. In the case that

does not, then new roots of unity can be added artificially [SS71, CK91, vdH02] so as to yield an algorithm of time complexity

3.The multivariate FFT

Given a multivariate polynomial

with

for all

, we may also consider its FFT with respect to each of its variables:

where

is an

-th root of unity for each

. Usually the

are chosen such that

, etc. The multivariate FFT is simply computed by taking the univariate FFT with respect to each variable.

Instead of using multi-indices

for multivariate FFTs, it is more efficient to encode

and

using a single array of size

. This can be done in several ways. Assume that we have ordered

. The lexicographical encoding of the multi-index

is given by

Using one of these encodings, one may use the above in-place algorithm for the computation of the multivariate FFT, when replacing the crossing relation by

in case of the simultaneous encoding. Translated back into terms of monomials, crossings at stage

involve monomials

and

, where

4.Implementing the FFT

A challenge for concrete implementations of the FFT in computer algebra systems is to achieve both optimal speed and genericity. Also, we would like to allow for univariate as well as multivariate FFTs. A first attempt in this direction is now part of Mmxlib, the standard C++ library of Mathemagix [vdHea05]. The genericity is obtained using templates. Let us comment some aspects of our implementation.

Roots of unity

In our implementation, we have created a special template class fft_root<C> for roots of unity in C, which may be specialized to any of the above models. The class fft_root<C> essentially implements methods for multiplication and multiplication with elements of C.

The FFT-profile

Of course, the precomputation of

requires some additional space. Nevertheless, this does not harm if

is significantly smaller than

(say

) or when the encoding of roots of unity requires significantly less space than storing elements of C (for instance, in the Schönhage-Strassen model, it suffices to store the shifts). In the remaining case, some space may be saved by precomputing only

and by computing the remaining values only on demand during the two last steps.

There are several reasons for doing so. First of all, we really need the binary mirrored power

in the cross relation (3). Secondly,

is easily deduced from

. Finally, precomputation of the

is useful for the inverse transform.

Partial FFT steps

The core of the FFT is an efficient implementation of one FFT step, i.e. carrying out the cross relations (3) for

and fixed

. More generally, for the purpose of preservation of locality (see below) and the TFT (see next sections), it is interesting to allow for ranges

with

and

. This key step should be massively inlined and loop-unrolled. For special types C and fft_root<C>, the operations in C and fft_root<C> may be written in assembler. Notice that the

involved in the cross-relations correspond to a subarray of (4).

Preservation of locality

For very large

, the cross relations (3) typically involve data which are stored at locations which differ by a multiple of the cache size. Consequently, the FFT may give rise to a lot of cache misses and swapping. One remedy to this problem is to divide the FFT in two (or more) parts. The first part regroups the first

steps. For each

, we collect the data

with

in an array stored in contiguous, and perform the FFT on this array. After putting the results back at the original places, we proceed as usual for the remaining

steps.

At present, this strategy has not yet been implemented in Mmxlib, since we noticed no substantial slow-down due to cache misses on our computer. However, when memory fills up, we did observe significant swapping, so the above trick might be used in order to obtain a good performance even in critical cases. Also, we currently only tested our implementation with coefficients modulo

. It may be that the cost of cache misses is negligible w.r.t. the cost of multiplications and divisions modulo this number.

Choosing the appropriate FFT-model

In the Schönhage-Strassen model, we recall that multiplications with roots of unity simply correspond to shiftings, which leads to almost optimal speed. However, the inner FFT multiplication step usually becomes expensive in this model. Indeed, in order to obtain a large number of roots of unity, one has to allow for large constants in C. Nevertheless, given a number

which fits into a machine word, it should be noticed that Schönhage-Strassen's method still provides

-th roots of unity. Therefore, the inner multiplication step is efficient for certain higher dimensional FFTs. Moreover, in this case, one may exploit the fact that multiplications modulo

are fast.

On the other hand, the nice prime number and floating-point models admit many roots of unity, but genuine multiplications (and divisions) have to be carried out during the FFT step. In the nice prime number model, one computes modulo a prime number like

, which has many

-th roots of unity, yet still fits into a single machine word. One may also use several such prime numbers, in combination with Chinese remaindering. In the floating-point model, one uses floating point approximations of complex numbers. This is particularly useful if fast floating point arithmetic is available, but one always has to carefully deal with rounding errors.

In practice, it follows that the choice of an optimal model for the FFT depends very much on the application. When using the FFT for multiplication, a general rule of the dumb is to balance the costs of the FFT-step and the inner multiplication step. If the inner multiplication step is carried out just once, then Schönhage-Strassen's model is usually optimal. This is for instance the case when multiplying two large integers. If one FFT corresponds to several inner multiplications, as in the case of matrix multiplication with large integer coefficients, one should opt for the nice prime or floating-point model, depending on the efficiency of division, floating-point operations and the available amount of space.

In between, it may sometimes be useful to use Schönhage-Strassen's method with cube roots (instead of square roots) of the number of digits (or degree) [vdH02, Section 6.1.3]. Similarly, one may force an extra iteration (i.e. immediately use fourth roots, but loose an additional factor

). Finally, in the case of multivariate FFTs (we consider that large integer coefficients account for an additional dimension), one may artificially generate additional roots of unity. For instance, if

has a

-th root of unity

, then one may multiply bivariate polynomials in

, by using

as a

-th root of

during the FFT w.r.t.

5.The Truncated Fourier Transform

Let

and let

be a primitive

-th root of unity. The Truncated Fourier Transform (TFT) from [vdH04, Section 3] takes a tuple

on input and produces

with

for all

. More generally, if

is a subset of

and

, then we define

Even more generally, one may select a target subset

which is different from

, and define

In order to compute the TFT, we simply consider the computation graph of a complete FFT and throw out all computations which do not contribute to the result

(see figures 2, 3 and 4). If the set

is “sufficiently connected”, then the cost of the computation of

. For instance, for

, we have proved [vdH04]:

Inverting the Truncated Fourier Transform

In order to use the TFT for the multiplication of numbers, polynomials or power series, we also need an algorithm to compute the inverse transform. In this section, we give such an algorithm in the case when

and when

is an initial segment for the (partial) “bit ordering

” on

: given numbers

and

(with

), we set

for all

. We recall that an initial segment of

is a set

with

The inverse TFT is based on the key observation that the cross relation can also be applied to compute “one diagonal from the other diagonal”. More precisely, given a relation

Moreover, these relations only involve shifting (multiplication and division by

), additions, subtractions and multiplications by roots of unity.

In order to state our in-place algorithm for computing the inverse TFT, we will need some more notations. At the very start, we have

and

on input, where

is the complement of

and

is the zero function. Roughly speaking, the algorithm will replace

by its inverse TFT

and

by its direct TFT

. In the actual algorithm, the array

will be stored in a special way (see the next section) and only a part of the computation of

will really be carried out.

Our algorithm proceeds by recursion. In the recursive step,

is replaced by a subset of

of the form

, where

is the current step and

. The sets

and

will again be subsets of

with

, and

is recursively assumed to be an initial segment for the bit ordering. Given a subset

, we will denote

and

, where

Similarly, given

, we will denote by

and

the restrictions of

resp.

. We now have the following recursive in-place algorithm for computing the inverse TFT (see figure 5 for an illustration of the different steps).

Figure 5. Schematic representation of the recursive computation of the inverse TFT for and . The different images show the progression of the known values (the black dots) during the different computations at stage . Between the second and third image, we recursively apply the algorithm, as well as between the fourth and fifth image.

Remark 3. Even though it seems that the linear transformation

has a non-zero determinant for any

, an algorithm of the above kind does not always work. The simplest example when our method fails is for

and

. Nevertheless, the condition that

is an initial segment for the bit ordering on

is not a necessary condition. For instance, a slight modification of the algorithm also works for final segments and certain more general sets like

for

. We will observe in the next section that the bit ordering condition on

is naturally satisfied when we take multivariate TFTs on initial segments of

7.The multivariate TFT

Let

and let

be a primitive

-th root of unity for each

. Given subsets

and a mapping

, the multivariate TFT of

is a mapping

defined by

See figure 6 for an example with

in dimension

. The multivariate TFT and its inverse are computed in a similar way as in the univariate case, using one of the encodings from see section 3 of tuples in

by integers in

and using the corresponding FFT-profile determined by

instead of

is an initial segment for the bit ordering, this ensures that the inverse TFT also generalizes to the multivariate case.

From the complexity analysis point of view, two special cases are particularly interesting. The first block case is when

with

for all

. Then using the lexicographical encoding, the multivariate TFT can be seen as a succession of

univariate TFTs whose coefficients are mappings

, with

Remark 5. The power series analogue of the block case, i.e. the problem of multiplication in the ring

is also an interesting. The easiest instance of this problem is when

for some

. In that case, we may introduce a new variable

and compute in

instead of

, where

is the smallest power of two with

. This gives rise to yet another FFT-profile of the form

and a complexity in

. The trick generalizes to the case when the

are all powers of two, but the general case seems to require non-binary FFT steps.

The other important simplicial case is when

and

for some fixed

with

(and more generally, one may consider weighted degrees). In this case, we choose the simultaneous encoding for the multivariate TFT (see figure 7 for an example). Now the size of

. At stages

, we notice that the number of “active nodes” for the TFT is bounded by

. When

, we have

Remark 7. The complexity analysis of the simplicial case in [vdH04, Section 5] contained a mistake. Indeed, for the multivariate TFT described there, the computed transformed coefficient

does not depend on

, which is incorrect.

Remark 8. Theorem 6 may be used to multiply formal power series in a similar way as in [vdH04]. This multiplication algorithm requires an additional logarithmic overhead. Contrary to what was stated in [vdH04], we again need the assumption that

8.Implementing the TFT

In order to implement the TFT and its inverse, one first has to decide how to represent the sets

and

, and mappings

. A convenient approach is to represent

and

(for the direct TFT) or

and

(for the inverse TFT) by a single binary “TFT-tree”. Such a binary tree corresponds to what happens on a subset

of the form

with

. The binary tree is of one of the following forms:

Notice that the bounds of the interval

need not to be explicitly present in the representation of the tree; when needed, they can be passed as parameters for recursive functions on the trees.

Only the direct TFT has currently been implemented in Mmxlib. This implementation roughly follows the above ideas and can be used in combination with arbitrary FFT-profiles. However, the basic arithmetic with “TFT-trees” has not been very well optimized yet.

In tables 1–4, we have given a few benchmarks on a 2.4GHz AMD architecture with 512Mb of memory. We use

as our coefficient ring and we tested the simplicial multivariate TFT for total degree

and dimension

. Our table both shows the size of the input

, the real and average running times

and

, the theoretical number of crossings to be computed

, its average

and the ratio

. The number

correspond to the running time divided by the running time of the univariate TFT for the same input size. The tables both show the most favourable cases when

is a power of two and the worst cases when

is a power of two plus one.

			in	in μs
1	16	16	0.01090	0.681	32	2	0.341	1
1	17	17	0.01899	1.117	63	3.71	0.301	1
1	256	256	0.1452	0.567	1024	4	0.142	1
1	257	257	0.2314	0.901	1535	5.97	0.151	1
1	4096	4096	3.206	0.783	24576	6	0.131	1
1	4097	4097	4.413	1.077	32767	8.00	0.135	1
1	65536	65536	68.33	1.043	524288	8	0.130	1
1	65537	65537	88.67	1.353	655359	10.00	0.135	1
1	1048576	1048576	1371	1.307	10485760	10	0.131	1
1	1048577	1048577	1710	1.631	12582911	12	0.136	1

Table 1. Timings for the direct univariate TFT of degree .

			in	in μs
2	4	10	0.0153	1.531	27	2.70	0.567	1.25
2	5	15	0.0434	2.891	101	6.73	0.430	3.30
2	16	136	0.1671	1.229	724	5.32	0.231	1.43
2	17	153	0.4566	2.984	1758	11.49	0.260	3.83
2	64	2080	2.393	1.150	15824	7.61	0.151	1.15
2	65	2145	5.857	2.731	31498	14.68	0.186	2.74
2	256	32896	44.8	1.362	319296	9.71	0.140	1.05
2	257	33153	90.67	2.735	566330	17.08	0.160	2.14
2	1024	524800	851	1.622	6159616	11.74	0.138	1.03
2	1025	525825	1578	3.001	10096890	19.20	0.156	1.91

Table 2. Timings for the direct bivariate TFT of total degree .

			in	in μs
3	4	20	0.0941	4.704	109	5.45	0.863	5.18
3	5	35	0.2797	7.992	509	14.54	0.550	8.50
3	16	816	2.2222	2.723	9324	11.43	0.238	3.68
3	17	969	9.4546	9.757	25807	26.63	0.366	13.90
3	64	45760	143	3.125	729008	15.93	0.196	2.76
3	65	47905	478	9.978	1640523	34.25	0.291	8.85
3	256	2829056	9005	3.183	55049920	19.46	0.164	2.05
3	257	2862209	234173	81.816	111116603	38.82	2.108	52.97

Table 3. Timings for the direct three-dimensional TFT of total degree . The last colored rectangles correspond to an input for which the computer started heavy swapping.

			in	in μs
4	2	5	0.0175809	3.51617	20	4	0.879	1.90
4	3	15	0.155039	10.3359	206	13.73	0.753	11.88
4	8	330	2.09375	6.3447	5346	16.2	0.392	8.10
4	9	495	13.4	27.0707	20327	41.06	0.659	41.21
4	32	52360	508	9.70206	1436052	27.43	0.354	8.61
4	33	58905	2262	38.4008	3820448	64.86	0.592	24.59
5	32	376992	9541	25.3082	19603488	52.00	0.487	19.04
6	16	54264	3941	72.6264	3962120	73.02	0.995	65.96
7	8	3432	325	94.697	254493	74.15	1.277	112.13
8	8	6435	2051	318.726	730912	113.58	2.806	345.22

Table 4. Assorted timings for higher dimensional simplicial TFTs.

9.Improving the implementation of the TFT

A few conclusions can be drawn from tables 1–4. First of all, in the univariate case, the TFT behaves more or less as theoretically predicted. The non-trivial ratios

in higher dimensions indicate that the set

is quite disconnected. Hence, optimizations in the implementation of TFT-trees may cause non-trivial gains. The differences between the running times for the best and worse cases in higher dimensions show that the multivariate TFT is still quite sensible to the jump phenomenon (even though less than a full FFT). Finally, in the best case, the ratio

be comes reasonably small. Indeed,

should theoretically tend to

and should be bounded by

! in order to gain something w.r.t. a full FFT. However,

can become quite large in the worse case. The various observations suggest the following future improvements.

Avoiding expression swell

In the case of simplicial TFTs, the number of active nodes swells quite dramatically at the first stages of the TFT (the size may roughly be multiplied by a factor

). In this special case, it may therefore be good to compress the first

steps into a single step. Moreover, one benefits from the fact that this compression can be done in a particularly efficient way. Indeed, let

be the transform of

after

steps. We have

for each

. It can be hoped that this acceleration approximately reduces the current running times for the worst case to the current running times for the best case. For high dimensions, it may be useful to apply the trick a second time for the next

stages

; however, the corresponding formulas become more complicated. We hope that further study will enable the replacement of the condition

in theorem 6 by a better condition, like

Avoiding small leafs

Reducing the jump phenomenon

When the TFT is used for polynomial or power series multiplication, and

is slightly above a power of two

, then classical techniques may be used for reducing the jump phenomenon. For instance, the polynomials

and

may be decomposed

and

, where

is the part of total degree

and similarly for

. Then

is computed using the TFT at order

and the remainder

by a more naive method.

10.The TFT for real coefficients

Assume now that

is a ring such that

has many

-th roots of unity. Typically, this is the case when

is the “field” of floating point numbers. A disadvantage of the standard FFT or TFT in this context is that the size of the input is doubled in order to represent numbers in

. This doubling of the size actually corresponds to the fact that the FFT or TFT contains redundant information in this case. Indeed, given

with

, we have

for any

-th root of unity

. This suggests to evaluate

either in

With the notations from section 5, let

be an initial segment for the bit ordering

. As our target set

, we take

The “real TFT” now consists of applying the usual TFT with source

and target

. It can be checked that crossings of “real nodes” introduce “complex nodes” precisely then when one of the two nodes disappears at the next stage (see figure 8). It can also be checked that the inverse TFT naturally adapts so as to compute the inverse real TFT. For “sufficiently connected” sets

, like

, the real TFT is asymptotically twice as fast as the complex TFT.

Bibliography

[CK91]: D.G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
[CT65]: J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[SS71]: A. Schönhage and V. Strassen. Schnelle Multiplikation grosser Zahlen. Computing 7, 7:281–292, 1971.
[vdH02]: J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.
[vdH04]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296, Univ. of Cantabria, Santander, Spain, July 4–7 2004.
[vdHea05]: J. van der Hoeven et al. Mmxlib: the standard library for Mathemagix, 2002–2005. ftp://ftp.mathemagix.org/pub/mathemagix/targz.

			(7)
			(8)

1.Introduction

2.The Fast Fourier Transform

3.The multivariate FFT

4.Implementing the FFT

Roots of unity

The FFT-profile

Partial FFT steps

Preservation of locality

Choosing the appropriate FFT-model

5.The Truncated Fourier Transform

6. Inverting the Truncated Fourier Transform

7.The multivariate TFT

8.Implementing the TFT

9.Improving the implementation of the TFT

Avoiding expression swell

Avoiding small leafs

Reducing the jump phenomenon

10.The TFT for real coefficients

Bibliography

6.
Inverting the Truncated Fourier Transform