Computing one billion roots using the tangent Graeffe method

Consider a polynomial function

over a field

given through a black box capable of evaluating

at points in

. The problem of sparse interpolation is to recover the representation of

in its usual form, as a linear combination

of monomials

. One popular approach to sparse interpolation is to evaluate

at points in a geometric progression. This approach goes back to work of Prony in the eighteen's century [41] and became well known after Ben-Or and Tiwari's seminal paper [2]. It has widely been used in computer algebra, both in theory [7, 30, 32–35, 39] and in practice [11, 14, 24, 25, 29, 31]; see [43] for a nice survey.

More precisely, if a bound

for the number of terms

is known, then we first evaluate

pairwise distinct points

, where

and

for all

. The generating function of the evaluations at

satisfies the identity

where

and

is of degree

. The rational function

can be recovered from

using fast Padé approximation [6, 37]. For well chosen points

, it is often possible to recover the exponents

from the values

. If the exponents

are known, then the coefficients

can also be recovered using a transposed version of fast multipoint interpolation [5, 7]. This leaves us with the question how to compute the roots

in an efficient way.

For practical applications in computer algebra, we usually have

, in which case it is most efficient to use a multi-modular strategy. This means that we rather work with coefficients in a finite field

, where

is a prime number that we are free to choose. It is well known that polynomial arithmetic over

can be implemented most efficiently using FFTs when the order

of the multiplicative group is smooth. In practice, this prompts us to choose

of the form

for some small

and such that

fits into a machine word.

The traditional way to compute roots of polynomials over finite fields is using Cantor and Zassenhaus' method [8]. In [19, 20], alternative algorithms were proposed for our case of interest when

is smooth. The fastest algorithm was based on the tangent Graeffe transform and it gains a factor

with respect to Cantor–Zassenhaus' method. The aim of the present paper is to report on a parallel implementation of this new algorithm and on a few improvements that allow for a further constant speed-up. A preliminary version of this paper was published at ICMS 2020 [27]. The new contributions in this paper include the theoretical contributions in sections 3.4, 3.5, 3.6 and 3.7, the implementation details in section 4 and the (parallel) timing results in section 5.

In section 2, we start by recalling generalities about the Graeffe transform and the heuristic root finding method based on the tangent Graeffe transform from [19]. In section 3, we present the main new theoretical improvements, which all rely on optimizations in the FFT-model for fast polynomial arithmetic. Our contributions are threefold:

In section 3.4 we also indicate how to generalize our methods to Graeffe transforms of general orders. In section 3.7 we determine how much faster the tangent Graeffe algorithm is than the Cantor–Zassenhaus algorithm. To do this, we determine the constant factors in the complexities of both algorithms, under the assumption that arithmetic in

is done using FFT-based algorithms.

We first implemented a sequential version of the tangent Graeffe method in C, with the optimizations from sections 3.2 and 3.3; see [27]. Section 4 is devoted to a more elaborate parallel implementation in Cilk C. We detail how we parallelized the different steps of the algorithm, and how we organized the memory so we could factor a polynomial of degree

over

, for

. We believe we are the first to factor such large polynomials. Our code is available here

In the last section 5, we give timings. We used a multi-core computer with two 10 core Intel Xeon E5 2680 v2 CPUs and 128 gigabytes of RAM. The above factorization of a polynomial of degree

then takes just under 4,000 seconds on 10 cores and uses 121 gigabytes of RAM.

2.Root finding using the tangent Graeffe transform

Let

be an effective field. Throughout the paper, time complexities count arithmetic operations in the field

and space complexities are for elements of

. We use

to denote the time for multiplying two polynomials in

of degree

. We make the customary assumption that

is a non-decreasing function that tends to infinity.

We will chiefly be interested in the case when

, where

is a prime of the form

with

small. We call primes of this form smooth Fourier primes. Some examples that we use and their bit lengths are as follows:

Let

be a polynomial of degree

which has

distinct roots in

. The tangent Graeffe algorithm computes the roots of

. The cost of the algorithm depends on a parameter

with

. The parameter

determines what proportion of the roots are found in each iteration of the algorithm. The space complexity of the algorithm is

and the average time complexity is

Theoretically, choosing

yields the best time complexity. However, because we want to factor polynomials with very large

, our implementation chooses a smaller

in the interval

to save space. For

the time complexity is

In this section, we recall the tangent Graeffe algorithm from [19]. In the next section, we will analyze its complexity in the FFT-model and present several improvements.

2.1.Graeffe transforms

Let

be a general field. The traditional Graeffe transform of a monic polynomial

of degree

is the unique polynomial

of degree

such that

More generally, given

, we define the Graeffe transform of order to be the polynomial

of degree

given by

2.2.Root finding using tangent Graeffe transforms

Let

be a formal indeterminate with

. Elements in

are called tangent numbers. They are of the form

with

and basic arithmetic operations go as follows:

The definitions from the previous subsection readily extend to coefficients in

instead of

. Given

, we call

the tangent Graeffe transform of

of order

. We have

Now assume that we have an efficient way to determine the roots

. For some polynomial

, we may decompose

2.3.Heuristic root finding over smooth finite fields

Assume now that

is a finite field, where

is a prime number of the form

for some small

. Assume also that

is a primitive element of order

for the multiplicative group of

Let

be as in the previous subsection. The tangent Graeffe method can be used to efficiently compute those

for which

is a single root of

. In order to guarantee that there are a sufficient number of such roots, we first replace

for a random shift

, and use the following heuristic:

For a random shift

and any

, the assumption ensures with probability at least

that

has at least

single roots.

Now take

to be the largest power of two such that

and let

. By construction, note that

. The roots

are all

-th roots of unity in the set

, where

. We may thus determine them by evaluating

for

. Since

, this can be done efficiently using a discrete Fourier transform. Combined with the tangent Graeffe method from the previous subsection, this leads to the following probabilistic algorithm for root finding:

Algorithm 1

Input:	of degree and only order one factors,
Output:	the set of roots of
Note:	time complexities for the main steps are indicated on the right

If then return
Let be largest such that and let
Pick at random and compute
Compute
For , set
Let be of order and write

Compute , , and for
If , then set , else set
For do

If and , then set
Compute
Compute
Recursively determine the set of roots of and return

Remark 1. To compute

we may use

, which requires three polynomial multiplications in

of degree

. In total, step 5 therefore performs

such multiplications. We discuss how to perform step 5 efficiently in the FFT model in section 3.

Remark 2. For practical implementations, one may vary the threshold

for

and the resulting threshold

for

. For larger values of

, the computations of the DFTs in step 6 get more expensive, but the proportion of single roots goes up, so more roots are determined at each iteration. From an asymptotic complexity perspective, it would be best to take

. In practice, we actually preferred to take the lower threshold

, because the constant factor of our implementation of step 6 (based on Bluestein's algorithm [4]) is significant with respect to our highly optimized implementation of the tangent Graeffe method. A second reason we prefer

of size

instead of

is that the total space used by the algorithm is linear in

Remark 3. For the application to sparse interpolation, it is possible to further speed up step 5 for the top-level iteration, which is the most expensive step. More precisely, for a polynomial with

terms, the idea is to take

and

of order

instead of

for some constant

with

. This reduces

(and the cost of the top-level iteration) by a factor of

. For the recursive calls, we still need to work with a primitive root of unity

of order

and random shifts.

3.Computing Graeffe transforms

3.1.Reminders about discrete Fourier transforms

Assume that

is invertible in

and let

be a primitive

-th root of unity. Consider a polynomial

. Then the discrete Fourier transform (DFT) of order

of the sequence

is defined by

We will write

for the cost of one discrete Fourier transform in terms of the number of operations in

. We assume that

. For any

, we have

is invertible in

, then it follows that

. The costs of direct and inverse transforms therefore coincide up to a factor

This shows that a DFT of length

reduces to

transforms of length

plus

transforms of length

plus

multiplications in

It is sometimes convenient to apply DFTs directly to polynomials as well; for this reason, we also define

. Given two polynomials

with

, we may then compute the product

using

In particular, we obtain

, where we recall that

stands for the cost of multiplying two polynomials of degree

Remark 4. In Algorithm 1, we note that step 6 comes down to the computation of three DFTs of length

. Since

is a power of two, this length is of the form

for some

. In view of (6), we may therefore reduce step 6 to

DFTs of length

plus

DFTs of length

. If

is very small, then we may use a naive implementation for DFTs of length

. In general, one may use Bluestein's algorithm [4] to reduce the computation of a DFT of length

into the computation of a product in

, which can in turn be computed using FFT-multiplication and three DFTs of length a larger power of two.

3.2.Graeffe transforms of order two

Let

be a field with a primitive

-th root of unity

. Let

be a polynomial of degree

. Then the relation (2) yields

can be obtained from

using

transpositions of elements in

. Concerning the inverse transform, we also note that

Algorithm 2

Input:	with and a primitive -th root of unity
Output:

Compute
For , compute
Return

Proof. We have already explained the correctness of Algorithm 2. Step 1 requires one forward DFT of length

and cost

. Step 2 can be done in linear time

. Step 3 requires one inverse DFT of length

and cost

. The total cost of Algorithm 2 is therefore

Remark 6. In terms of the complexity of multiplication, we obtain

This gives a

improvement over the previously best known bound

that was used in [19]. Note that the best known algorithm for computing squares of polynomials of degree

. It would be interesting to know whether squares can also be computed in time

3.3.Graeffe transforms of power of two orders

Now assume that we computed the first Graeffe transform

using Algorithm 2 and that we wish to apply a second Graeffe transform to the result. Then we note that

is already known for

. We can use this to accelerate step 1 of the second application of Algorithm 2. Indeed, in view of (6) for

and

, we have

for

. In order to exploit this idea in a recursive fashion, it is useful to modify Algorithm 2 so as to include

in the input and

in the output. This leads to the following algorithm:

Algorithm 3

Input:	with , a primitive -th root of unity ,
	and
Output:	and

Set
Set
For , compute
Return and

Proof. It suffices to compute

and then to apply Algorithm 3 recursively,

times. Every application of Algorithm 3 now takes

operations in

, whence the claimed complexity bound.

Remark 8. In [19], Graeffe transforms of order

were directly computed using the formula (9), using

operations in

. The new algorithm is twice as fast for large

3.4.Graeffe transforms of arbitrary smooth orders

The algorithms from subsections 3.2 and 3.3 readily generalize to Graeffe transforms of order

for arbitrary

, provided that we have an

-th root of unity

. For convenience of the reader, we specified the generalization of Algorithm 3 below, together with the resulting complexity bounds.

Algorithm 4

Input:	with , , a primitive -th root of unity ,
	and
Output:	and

Set
For , set
For , compute
Return and

Remark 11. In our application to root finding, we are interested in the efficient computation of Graeffe transforms of high order

. In terms of the size

, it is instructive to observe that the “average cost”

is minimal for

. This suggests that it might be interesting to use Graeffe transforms of order three whenever possible. In the application of Algorithm 1, this would lead us to take primes of the form

, with

small and

close to

. This still allows us to use radix 2 FFTs, while at the same time benefitting from radix 3 Graeffe transforms.

3.5.Truncated Fourier transforms

is a fixed finite field, then DFTs are most efficient for sizes

that divide

. For our root finding application, it is often convenient to take

, in which case

should be a power of two or three times a power of two. The truncated Fourier transform was developed for the multiplication of polynomials such that the degree of the product does not have a nice size

of this type. It turns out that we may also use it for the efficient computation of Graeffe transforms of polynomials of arbitrary degrees. Moreover, the optimizations from the previous subsections still apply.

Let us briefly describe how the truncated Fourier transform can be used for the computation of Graeffe transforms of power of two orders. With the notations from subsections 3.2 and 3.3, we assume that

is a power of two as well and that we wish to compute the Graeffe transform of a polynomial

of degree

with

. Let

denote the reversal of a binary number

bits. For instance,

and

. Then the truncated Fourier of

at order

is defined by

It has been shown in [22] that

and

can both be computed in time

. More generally, for direct transforms, one may compute

in time

, whenever

. For generalizations to arbitrary radices, we refer to [36].

for

. This provides us with the required counterpart of (8) for retrieving

efficiently from

. The relation (10) also has a natural counterpart:

Algorithm 5

Input:	with ,
	a primitive -th root of unity , and
Output:	and

Set
Set
For , compute
Return and

Proof. Straightforward adaptation of the proof of Proposition 7, while using [22].

3.6.Taylor shifts

In step 3 of Algorithm 1, we still need an algorithm to compute the Taylor shift

. If the characteristic of

exceeds

, then it is (not so) well known [1, Lemma 3] that this can be done in time

, using the following reduction to a single polynomial multiplication of degree

Algorithm 6

Input:	of degree and
Output:

Return

It is interesting to observe that Taylor shifts can still be computed in time

in small characteristic, as follows:

Algorithm 7

Input:	of degree and
Output:

Define for where
Rewrite with for
For , replace
Return

3.7.Comparison with Cantor–Zassenhaus' algorithm

We give a theoretical comparison of Algorithm 1 with the Cantor–Zassenhaus algorithm [8], where both algorithms have been optimized in the “FFT model” [44]. For this comparison, it is convenient to replace the “worst case” heuristic H by a more empirical assumption. More precisely, if we take

for

, then the expected proportion of single roots is

(see e.g. [23]). This expected proportion is indeed observed in practice: see Table 3. In Algorithm 1, we took

Proof. We first analyze the cost of step 9. Let

be the cost of the polynomial multiplications in the product tree algorithm, where

is the size of

. Exercise 10.3 of [17] shows that

. The recurrence for

for

and

. Solving this recurrence while using the assumption that

is non-decreasing yields

. Now let

be the successive sizes of

for the recursive calls of the algorithm, so that

. Thus the cost of all executions of step 9 is

. Using again that

is non-decreasing, we have

Let us next analyze the cost of the other steps until step 10 inclusive, but without the recursive calls in step 11. Set

We now need to account for the cost of the recursive calls in step 11. For

the tangent Graeffe algorithm will, under our hypothesis, on average, obtain at least

of the roots in each recursive call, leaving slightly less than

of them to be found. Since

and

is non-decreasing, the total cost of Algorithm 1 is therefore as claimed.

Remark 14. In the asymptotic region where

, the bound (12) reduces to

. Since we may pick

as large as we want, the complexity of Algorithm 1 is then bounded by

for any

Assume from now on that we are in the asymptotic region where

. Then Remark 14 shows that the cost of the tangent Graeffe method is

for any

. The bottleneck of Cantor–Zassenhaus' algorithm in this region is modular exponentiation, which accounts for a total cost of

[17]. We wish to determine the constant factor in order to give an accurate comparison between the two algorithms. Given polynomials

with

, one modular exponentiation does

modular squarings

in a loop. However, in the first few iterations, the degree of

is less then

so no division is needed. The number of steps with divisions is

. Using fast division [17], the remainder

can be computed with two multiplications of size

assuming the required inverse of

, the reciprocal of

, is precomputed. In the FFT model, one forward transform of

may be saved and the forward transform of the inverse of

may be cached [44]. This costs

, i.e. at least

, since

. The cost of the top-level modular exponentiation is therefore equivalent to

. Using a similar recursion as for

in the proof of Proposition 13, the total cost of all modular compositions in CZ is equivalent to

(strictly speaking, this is really an upper bound; but we indeed obtain an equivalence for common values of

, such as

). Altogether, this means that the tangent Graeffe algorithm is about

times faster than Cantor–Zassenhaus.

Remark 15. In practice, we often have

, and the complexity of Cantor–Zassenhaus also involves another

term due to gcd computations. The corresponding constant factor is fairly high, which makes the tangent Graeffe algorithm even more favorable in this asymptotic region.

4.Implementing the tangent Graeffe algorithm

For our implementation, we have been pursuing three goals. We first wanted to reimplement the tangent Graeffe algorithm from scratch using FFT-based polynomial arithmetic and compare it with a good implementation of the Cantor–Zassenhaus algorithm. Second, we wanted to factor a polynomial of degree

. For this, it is important to consider space efficiency. Third, we wanted to parallelize the tangent Graeffe algorithm for a multi-core computer. Unlike the Cantor–Zassenhaus algorithm, the tangent Graeffe algorithm is gcd-free which makes it easier to parallelize. We first tried parallelizing only the FFTs that appear in steps 3, 5, 6, 9, and 10 of Algorithm 1 to see if this alone is sufficient to obtain good parallel speedup. It isn't.

Our main practical challenge is, however, space, not time. For a 64 bit prime

, and an input polynomial of degree

, the input polynomial and output roots alone need 16 gigabytes of storage. Our first implementation exceeded the virtual memory of our machine which has 128 gigabytes of RAM plus 240 gigabytes of SSD swap space.

To reduce space we use in-place algorithms. In-place algorithms do not allocate new space (memory). They work in the input and output space and we allow them to be given additional working space. To parallelize a recursive in-place algorithm, we must partition the input, output and the temporary memory (if any) for recursive parallel calls. Our design is such that the space used does not increase with more cores. We refer the reader to Giorgi, Grene and Roche [18] for examples of in-place serial algorithms in computer algebra and a bibliography.

4.1.Parallelizing the FFTs

Number theoretic FFTs over

were first introduced by Pollard [40]. Let us first explain how we parallelized such FFTs using Cilk C. The strategy is classical, but it is convenient to detail it here, since we use the same strategy for parallelizing the other steps in Algorithm 1. Cilk [16] was designed to support the parallelization of recursive divide and conquer algorithms like the FFT for multi-core computers. To parallelize recursive algorithms using Cilk we first modify the code so that the memory locations that are updated in the recursive calls do not overlap.

Figure 1 shows our sequential C code for two radix 2 FFTs that we use. In the vernacular fft1 is known as the “decimation-in-time” FFT and fft2 as the “decimation-in-frequency” FFT. We refer the reader to Chu and George [9] for a discussion of the two FFTs. We note that Geddes, Czapor, and Labahn present only fft1 in their book on computer algebra [10], whereas von zur Gathen and Gerhard present only fft2 [17].

In the code in Figure 1, LONG is a macro for long long int, a 64 bit signed integer. The C functions addmod, submod, mulmod implement

respectively for

. For multiplication in

we use Roman Pearce's implementation of Möller and Granlund [38]. The input array

of size

is the main input to and output from the FFT. The input

is an array of size

containing powers of

, a primitive

-th root of unity in

. We precompute

Codes like these where the recursive calls update separate parts of the array

are easy to parallelize using Cilk. To parallelize fft1 in Cilk C we first make a new subroutine fft1cilk which executes the two recursive calls in parallel. In Figure 2 the two Cilk spawn directives do this. For small

we do not want to start two new processes because of the cost of process management overhead. Thus for

, fft1cilk calls fft1 which runs sequentially.

To obtain good parallel speedup we also need to parallelize the for loop in fft1 because it does

multiplications out of a total of

multiplications. We do this by executing large contiguous blocks of the loop in parallel. We could use Cilk's parallel for loop construct cilk_for and let Cilk determine what block size to use but this does not allow us to tune the block size for best performance. Instead we choose a blocksize B and parallelize the loop explicitly as shown in Figure 2. To get best performance, we reduced B until the Cilk process overhead takes more than 1% say of the total time for the FFT. This yields good parallel speedup for many cores when

is large.

An alternative way to parallelize an FFT is to use a recursive FFT for creating parallel tasks for large

as we have but use a non-recursive FFT for smaller

. These and other strategies for parallelizing the FFT for multi-core computers are discussed in [9, 13].

4.2.Step 4: the Taylor shift

Steps 1, 2, 3, 5, and 6 of Algorithm 6 require

time. Steps 3 and 6 involve

inverses in

which, because they require many divisions, are relatively very expensive. All the inverses can be replaced with multiplications as follows. After computing

! in step 1, we compute the inverse

. For steps 3 and 6, we run the loop backwards and multiply by

to get the next inverse as follows.

We parallelized only the polynomial multiplication

in step 4. To multiply two polynomials of degree at most

our FFT multiplication needs three arrays

, and

of size

where

satisfies

The data in Table 4 shows that the best parallel speedup for step 3 is 4.12. A factor of 4.12 on 10 cores is not very good and due to the fact that we did not paralellize steps 1, 3 and 6 which involve

and

multiplications in

, respectively. We did not do this because computing

is not a bottleneck of Algorithm 1.

Of course, it is possible to paralellize steps 1, 3 and 6. For

blocks of size

we would need to compute the factorials

in parallel. We could do this by first computing

partial products

in parallel and then compute

. Now we can execute step 1 in

blocks in parallel. Next we would compute the inverses of

so we can execute steps 3 and 6 in parallel.

4.3.Step 5: the tangent Graeffe loop

Step 5 of Algorithm 1 is the main computation in Algorithm 1. It has complexity

. We successfully parallelized the FFTs and all

loops in blocks except the following data movement:

We tried parallelizing the above code by moving the even entries of

into a temporary array

, in blocks, in parallel, then copying them back to

, twice, in blocks, in parallel. On 6 cores, the parallel code is slower than the serial code so we have left this pure data movement not parallelized.

4.4.Step 6: parallelizing the evaluations

Each of the three evaluations

and

for

in step 6 of Algorithm 1 can be done using the Bluestein transform [4] in

time. Although step 6 does not dominate the time complexity it needs the most space. We need at least

units of storage to store the values of

and

. For

and

, the requirement

implies

. This is

gigabytes. We chose to save half of this by choosing

instead. This increases the time a little (see Table 3) because we obtain a smaller portion of the roots at each recursive call of Algorithm 1.

The Bluestein transform does one multiplication of two polynomials of degree

plus

work. For a multiplication of degree

we need an FFT of size

and our FFT uses

units of temporary storage. For

and

we have

so we need 192 gigabytes. We do not evaluate of

and

in parallel as that would triple the 192 gigabytes to 576 gigabytes.

For

with

small, because

for some

, we can save time and space by applying (6). We do

DFTs of size

, in blocks, in parallel, then

multiplications by powers of

, in blocks in parallel, followed by

FFTs of size

which we do in parallel. This reduces the time for an evaluation by a factor between 6 and 12 since we replace 3 FFTs of size

where

FFTs of size

which is equivalent to one FFT of size

We need an additional

units of storage for storing the

inputs of size

for the

FFTs and an additional

units for the required

array. Thus applying (6) reduces the temporary storage needed from

units to

units. For

and

, with

, this is

gigabytes of temporary storage instead of 192 gigabytes.

We can now state the total storage needed by our implementation of Algorithm 1. We need to store the input

of degree

and an array of size

for the output roots

. We need space for three temporary polynomials of degree

. Including the memory for step 6, our implementation uses

units, which, for

and

is 121 gigabytes.

4.5.Step 9: parallelizing the product tree multiplication algorithm

In step 9 we need to expand the polynomial

. Let

and suppose

is stored in an array of size

and

in an array of size

. We can compute

using fast multiplication and divide and conquer. The obvious way to do this is to split the roots in

into two equal size groups, of size

and

, then recursively compute

and finally multiply

using fast multiplication for large

. In the literature this approach is called the product tree multiplication algorithm; see [17, Algorithm 10.3]. In Appendix A, the function mult implements the product tree algorithm in Magma. We use it to create the input polynomial

to get Magma timings. However, mult is inefficient because all recursive calls allocate new memory for the inputs and products in the product tree.

For large products we use our FFT based polynomial multiplication FFTpolmul64s which needs a temporary array

of size

units of storage where

. We do not want to allocate space for

in each recursive call. Fortunately,

is a power of 2 so we can divide

into two halves and execute the two recursive calls in parallel using the two halves of

. But the size of

plus

. They don't fit in

. The remedy is to note that

and

are both monic, so we do not need to store their leading coefficients.

Figure 3 presents C code for an in-place product tree algorithm which uses the input space

of size

, the output space

of size

and a temporary array

of size

for all polynomial multiplications in the product tree. Our solution splits

asymmetrically into a “large” set of size

where

and a “small” set of size

. For example, for

we use

and

. To multiply

we first recursively compute

and

. We copy the contents of

to the input array

and we compute the product

. For small

, we use polmul64s an in-place multiplication with quadratic complexity. Then we add

and

so that

contains

. Thus we have the following result.

In Figure 3 , because the memory for

and

in the two recursive calls is separated, we may execute the two recursive calls to treemul in parallel. This also holds recursively, so we can keep executing recursive calls in parallel, which is ideal for Cilk. In our parallel implementation of treemul we parallelize the two recursive calls if

. We also parallelize FFTmul64s and the FFTs (both the subalgorithms and their calls inside treemul ). We do not parallelize the two polynomial additions.

Remark 17. By dividing the input

asymmetrically so that the FFTs are fully utilized, we gain upto a factor of 2 in speed when

is much smaller then

. Thus we get the benefit of using a truncated FFT [22] for the polynomial multiplications without using it.

Remark 18. If we insert the statement Q[n] = 1; before the return statement in treemul, then the sequential code will still work. However, since both recursive calls update Q[n], the memory is no longer separated, so the two recursive calls cannot be executed in parallel.

5.Timing Results

We want to compare our new tangent Graeffe implementation with an implementation of the Cantor–Zassenhaus algorithm [8] which uses fast polynomial arithmetic. One of the best implementations that we know of is Magma's Factorize command. Magma uses an implementation of Shoup's work [44] by Allan Steel [45]. For

of degree

with

distinct roots in

, Magma uses the Cantor–Zassenhaus algorithm which has time complexity

The timings in Tables 1 and 2 were obtained on an 8 core Intel Xeon E5-2660 CPU which runs at at 2.2 GHz base and 3.0 GHz turbo. The input polynomials

in Tables 1 and 2 of degree

were created with

random distinct roots in

. Table 1 is for the 28.8 bit prime

and Table 2 is for the 62.4 bit prime

The timings in column NewTG are for our C implementation of Algorithm 1 with the parameter

chosen in

. The timings in column Magma are for the Factorize command for Magma version V2.25-5. Magma code for creating

and factoring it is given in Appendix A. We note that older versions of Magma have a quadratic sub-algorithm; this problem was fixed by Steel for version V2.25-5. We also note that Magma has faster arithmetic for primes

The columns in Tables 1 and 2 labeled Step5, Step6, Step9, and

report the time spent in steps 5 , 6 , 9 , and 10 , respectively, in the top level call of Algorithm 1 . They show that initially step 5 , which costs

, dominates the cost of steps 6 and 9 which cost

and

respectively. As

increases, so does

, and the number of iterations

of step 5 in Algorithm 1 drops; ultimately, the computation of the product

in step 9 dominates.

The column labeled %roots reports the percentage of the roots found in the first tangent Graeffe application. Column

is the value of

in step 5. Step 5 does

FFTs of size

. Column FFT reports the time for one of those FFTs. For example, in Table 2, for

, 13.9 seconds was spent in step 5. The code did

FFTs of size

which took

seconds.

The timings in Table 1 show that the tangent Graeffe method is 100 times faster than Magma's implementation of Cantor–Zassenhaus at degree

. For the larger prime in Table 2 the tangent Graeffe method is 166 times faster at degree

Table 3 shows the effect of different choices for the parameter

in Algorithm 1. If the roots of

under the Graeffe transform

are uniformly distributed among the

-th roots of unity, then the proportion of them that remain distinct is

. Thus doubling

increases the proportion of roots found at each recursive level of Algorithm 1, which saves time, but it also doubles the cost of the evaluations in step 6. Recall that the theoretically optimal performance is obtained by taking

, where we note that

and

for

Column

gives the expected proportion of roots that will be found. Column %roots is the actual percentage of roots found by Algorithm 1 . Columns Step5, Step6, and Total are the time in step 5 , step 6 , and the total time. The data in Table 3 suggests choosing

to minimize time. For

the algorithm is expected to obtain between

and

of the roots. Our actual choice of

increases the time by about 10% but saves a factor of 2 in space.

Table 4 shows the results for our parallel implementation including timings for

of degree

. The timings in Table 4 were obtained on a server with a 10 core Intel E5 2680 v2 CPU with 128 gigabytes of RAM running at 3.0GHz base, 3.6GHz turbo. The input polynomial

is created as before by choosing

random distinct values from

with

Timings in column

are for computing the input polynomial

using our in-place parallel product tree multiplication from section 3.4. The total time for computing the roots of

is in column Total. Roughly speaking, our software takes 10 times longer to compute the roots of

(less for larger

) than it does to create

Timings in columns Step4, Step5, Step6, Step9, and

are the times for those steps for the top-level call of Algorithm 1. Speedups are given for each computation. The reader can see that the parallel speedup is not as good for Step4 and the division

. This is because we have only parallelized the FFTs in them; none of steps of linear cost are parallelized.

6.Conclusion

The motivation for our work is the problem of sparse polynomial interpolation where one seeks to recover the coefficients and monomials of a polynomial

with

terms from values of

. The popular approach by Prony and Ben-Or–Tiwari needs only

values of

but it needs to factor a polynomial

of degree

which has

distinct roots in

. Using the Cantor–Zassenhaus (CZ) algorithm, computing the roots of

takes

time. Because this is the most expensive step in Ben-Or/Tiwari sparse interpolation, and because CZ does gcd computations which are difficult to parallelize, research in sparse interpolation has tried different approaches that do not require root finding.

If we choose the prime

of the form

with

small, the new tangent Graeffe (TG) algorithm factors

time. This is

faster than CZ, but the constants actually matter in practice. In this work we improved the main step of TG by a factor of 2 and we showed that for large

, TG is faster than CZ by a factor of

. Our new C implementation of TG is over 100 times faster than Magma's C implementation of CZ for

on the tests we made. So TG really is a lot faster than CZ.

Another goal was to see if we could parallelize TG for multi-core computers. We found that it was not sufficient to only parallelize the underlying FFT. We also had to parallelize many sub-algorithms to get good parallel speedup. Here we contributed a new parallel in-place product tree algorithm. We were also successful in reducing the space needed so that we could compute the roots of

of degree one billion on a 10 core computer with 128 gigabytes of RAM in about one hour. It should be possible to reduce the time further by using vectorization [12, 26]. The sequential implementation of the FFTs could also be further improved using techniques from [15, 42] and Harvey's approach for number theoretic FFTs [21].

Appendix A.Magma code

Bibliography

[1]: A. V. Aho, K. Steiglitz, and J. D. Ullman. Evaluating polynomials on a fixed set of points. SIAM Journ. of Comp., 4:533–539, 1975.
[2]: M. Ben-Or and P. Tiwari. A deterministic algorithm for sparse multivariate polynomial interpolation. In STOC '88: Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 301–309. New York, NY, USA, 1988. ACM Press.
[3]: D.J. Bernstein. Fast multiplication and its applications, pages 325–384. Mathematical Sciences Research Institute Publications. Cambridge University Press, United Kingdom, 2008.
[4]: Leo I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[5]: A. Bostan, G. Lecerf, and É. Schost. Tellegen's principle into practice. In ISSAC '03: Proceedings of the 2003 International Symposium on Symbolic and Algebraic Computation, pages 37–44. ACM Press, 2003.
[6]: R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun. Fast solution of Toeplitz systems of equations and computation of Padé approximants. J. Algorithms, 1(3):259–295, 1980.
[7]: J. Canny, E. Kaltofen, and Y. Lakshman. Solving systems of non-linear polynomial equations faster. In Proceedings of the ACM-SIGSAM 1989 International Symposium on Symbolic and Algebraic Computation, pages 121–128. ACM Press, 1989.
[8]: D. G. Cantor and H. Zassenhaus. A new algorithm for factoring polynomials over finite fields. Math. Comp., 36(154):587–592, 1981.
[9]: E. Chu and A. George. Inside the FFT Black Box. Computational Mathematics Series. CRC Press, 2000.
[10]: S. Czapor, K. Geddes, and G. Labahn. Algorithms for Computer Algebra. Kluwer Academic Publishers, 1992.
[11]: A. Díaz and E. Kaltofen. FOXFOX: a system for manipulating symbolic objects in black box representation. In ISSAC '98: Proceedings of the 1998 International Symposium on Symbolic and Algebraic Computation, pages 30–37. ACM Press, 1998.
[12]: P. Fortin, A. Fleury, F. Lemaire, and M. Monagan. High performance SIMD modular arithmetic for polynomial evaluation. ArXiv:2004.11571, 2020.
[13]: F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. Moura. Discrete Fourier transform on multicores. IEEE Signal Processing Magazine, 26(6):90–102, 2009.
[14]: T. S. Freeman, G. M. Imirzian, E. Kaltofen, and Y. Lakshman. DAGWOOD: a system for manipulating polynomials given by straight-line programs. ACM Trans. Math. Software, 14:218–240, 1988.
[15]: M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Generation, Optimization, and Platform Adaptation”.
[16]: M. Frigo, C.E. Leisorson, and R.K. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of PLDI 1998, pages 212–223. ACM, 1998.
[17]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.
[18]: P. Giorgi, B. Grenet, and D. S. Roche. Fast in-place algorithms for polynomial operations: division, evaluation, interpolation. Technical Report http://arxiv.org/abs/2002.10304, Arxiv, 2020.
[19]: B. Grenet, J. van der Hoeven, and G. Lecerf. Randomized root finding over finite fields using tangent Graeffe transforms. In Proc. ISSAC '15, pages 197–204. New York, NY, USA, 2015. ACM.
[20]: B. Grenet, J. van der Hoeven, and G. Lecerf. Deterministic root finding over finite fields using Graeffe transforms. AAECC, 27(3):237–257, 2016.
[21]: D. Harvey. Faster arithmetic for number-theoretic transforms. J. Symbolic Comput., 60:113–119, 2014.
[22]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296. Univ. of Cantabria, Santander, Spain, July 4–7 2004.
[23]: J. van der Hoeven. Probably faster multiplication of sparse polynomials. Technical Report, HAL, 2020. http://hal.archives-ouvertes.fr/hal-02473830.
[24]: J. van der Hoeven and G. Lecerf. On the bit-complexity of sparse polynomial and series multiplication. J. Symbolic Comput., 50:227–254, 2013.
[25]: J. van der Hoeven and G. Lecerf. Sparse polynomial interpolation in practice. ACM Commun. Comput. Algebra, 48(3/4):187–191, 2015.
[26]: J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5–1, 2016.
[27]: J. van der Hoeven and M. Monagan. Implementing the tangent Graeffe root finding method. In A. M. Bigatti, J. Carette, J. H. Davenport, M. Joswig, and T. de Wolff, editors, Mathematical Software – ICMS 2020, pages 482–492. Cham, 2020. Springer International Publishing.
[28]: J. van der Hoeven et al. GNU TeXmacs. https://www.texmacs.org, 1998.
[29]: J. Hu and M. B. Monagan. A fast parallel sparse polynomial GCD algorithm. In S. A. Abramov, E. V. Zima, and X.-S. Gao, editors, Proc. ISSAC '16, pages 271–278. ACM, 2016.
[30]: M. A. Huang and A. J. Rao. Interpolation of sparse multivariate polynomials over large finite fields with applications. In SODA '96: Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, pages 508–517. Philadelphia, PA, USA, 1996. Society for Industrial and Applied Mathematics.
[31]: M. Javadi and M. Monagan. Parallel sparse polynomial interpolation over finite fields. In Proceedings of PASCO 2010, pages 160–168. ACM Press, 2010.
[32]: E. Kaltofen. Computing with polynomials given by straight-line programs I: greatest common divisors. In STOC '85: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, pages 131–142. ACM Press, 1985.
[33]: E. Kaltofen, Y. N. Lakshman, and J.-M. Wiley. Modular rational sparse multivariate polynomial interpolation. In ISSAC '90: Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 135–139. New York, NY, USA, 1990. ACM Press.
[34]: E. Kaltofen and B. M. Trager. Computing with polynomials given by black boxes for their evaluations: greatest common divisors, factorization, separation of numerators and denominators. JSC, 9(3):301–320, 1990.
[35]: E. Kaltofen and L. Yagati. Improved sparse multivariate polynomial interpolation algorithms. In ISSAC '88: Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 467–474. Springer Verlag, 1988.
[36]: R. Larrieu. The truncated Fourier transform for mixed radices. In Proc. ISSAC '17, pages 261–268. New York, NY, USA, 2017. ACM.
[37]: R. Moenck. Fast computation of GCDs. In Proc. of the 5th ACM Annual Symposium on Theory of Computing, pages 142–171. New York, 1973. ACM Press.
[38]: N. Möller and T. Granlund. Improved division by invariant integers. IEEE Transactions on Computers, 60:165–175, 2011.
[39]: H. Murao and T. Fujise. Modular algorithm for sparse multivariate polynomial interpolation and its parallel implementation. JSC, 21:377–396, 1996.
[40]: J. M. Pollard. The fast Fourier transform in a finite field. Mathematics of Computation, 25(114):365–374, 1971.
[41]: R. Prony. Essai expérimental et analytique sur les lois de la dilatabilité des fluides élastiques et sur celles de la force expansive de la vapeur de l'eau et de la vapeur de l'alkool, à différentes températures. J. de l'École Polytechnique Floréal et Plairial, an III, 1(cahier 22):24–76, 1795.
[42]: M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: code generation for DSP transforms. Proceedings of the IEEE, special issue on “Program Generation, Optimization, and Adaptation”, 93(2):232–275, 2005.
[43]: D. S. Roche. What can (and can't) we do with sparse polynomials? In Proc. ISSAC '18, pages 25–30. New York, NY, USA, 2018. ACM.
[44]: V. Shoup. A new polynomial factorization and its implementation. J. Symbolic Computation, 20(4):363–397, 1995.
[45]: A. Steel. Private communication.

1.Introduction

2.Root finding using the tangent Graeffe transform

2.1.Graeffe transforms

2.2.Root finding using tangent Graeffe transforms

2.3.Heuristic root finding over smooth finite fields

3.Computing Graeffe transforms

3.1.Reminders about discrete Fourier transforms

3.2.Graeffe transforms of order two

3.3.Graeffe transforms of power of two orders

3.4.Graeffe transforms of arbitrary smooth orders

3.5.Truncated Fourier transforms

3.6.Taylor shifts

3.7.Comparison with Cantor–Zassenhaus' algorithm

4.Implementing the tangent Graeffe algorithm

4.1.Parallelizing the FFTs

4.2.Step 4: the Taylor shift

4.3.Step 5: the tangent Graeffe loop

4.4.Step 6: parallelizing the evaluations

4.5.Step 9: parallelizing the product tree multiplication algorithm

5.Timing Results

6.Conclusion

Appendix A.Magma code

Bibliography