Optimizing the half-gcd algorithm

The computation of greatest common divisors is the key operation to be optimized when implementing a package for rational number arithmetic. For integers of small bit-length

, one may use Euclid's algorithm, which has a quadratic time complexity

. Asymptotically faster algorithms were first proposed by Knuth [20] and Schönhage [29], based on earlier ideas by Lehmer [22]. Schönhage's algorithm has a logarithmic

overhead with respect to integer multiplication, which is believed to be asymptotically optimal. Many variants and improvements have been proposed since, mainly aiming at faster practical implementations [36, 32, 23, 26, 4, 27]. All these subquadratic algorithms are based on a recursive reduction to half of the precision; for this reason it is convenient to regroup them under the name “half-gcd algorithms”.

An analogous story can be told about the history of polynomial gcd computations. The polynomial counterpart of Euclid's algorithm was first described by Stevin [33, p. 241]. The first algorithms for polynomial half-gcds are due to Moenck [25] and Brent–Gustavson–Yun [5]; several variants have been developed since [35, 36, 4].

We refer to [10, chapter 11] for a gentle exposition of the polynomial half-gcd algorithm. This exposition also contains a careful analysis of the constant factor in the asymptotic time complexity. More precisely, let

be the complexity to multiply two polynomials of degree

over an abstract effective field

. Then the gcd of two polynomials of degree

can be computed using

operations in

. This complexity further drops to

if the corresponding Euclidean remainder sequence is “normal” (all quotients in the ordinary Euclidean algorithm are of degree one). The authors declare that they made no particular efforts to optimize these constants

and

; in [10, Research problem 11.11], they ask the question how far these constants can be lowered. This main goal of the present paper is to make progress on this question.

One major motivation for optimizing constant factors in time complexity bounds is the development of faster implementations. Now practical algorithms for multiplying large polynomials usually rely on fast Fourier techniques. We will also investigate optimizations that are only possible when multiplications are done in this way. Indeed, this so-called FFT model allows for various specific optimizations that typically help to reduce constant factors. We refer to [3] for a nice survey of such tricks and mention NTL [31] as a software library that was early to exploit them in a systematic way.

A particularly favorable case is when the ground field

has primitive

-th roots of unity for all sufficiently large

. We will call this the “binary FFT model”. A good example is the finite field

for a prime

of the form

, where

is large. Such finite fields arise naturally after modular reduction, provided that

can be chosen freely. The binary FFT model allows for additional tricks and is particularly useful for dichotomic algorithms such as the half-gcd. We refer to section 2 for those results that will be required in this paper. Note that large polynomial multiplications over general finite fields are fastest using FFT-based techniques, although one may not necessarily use radix two FFTs; see [14, 15] and references therein for the fastest current algorithms.

For convenience of the reader, we will first describe our new half gcd algorithms in the normal case (see section 4). This will allow us to avoid various technical complications and focus on the new optimizations. The first idea behind our new half gcd is to make the update step after the first recursive call particularly efficient, by using a

matrix variant of the middle product algorithm [12]. This leads to an algorithm of time complexity

in general and

in the FFT model (Proposition 6). The second idea is to combine this with known optimizations for the (binary) FFT model, such as FFT doubling, FFT caching, and Harvey's variant of the middle product [13]. In the binary FFT model, this allows us to compute half gcds in time

(Proposition 7 and Corollary 9).

When dropping the normality assumption, the algorithm becomes more technical, but it turns out that the middle product trick can still be generalized. This is explained in section 5 and leads to the bound

for general gcds and

in the FFT model (Theorem 15 ). In the binary FFT model, special efforts are needed in order to enforce the use of DFTs of power of two lengths. We prove the bound

in this last case (Theorem 17 ). See Table 1 for a summary of the new constant factors

It is well known that polynomial gcds have many applications: fast arithmetic on rational functions, fast reconstruction of rational functions [10, section 5.7], polynomial factorization [7], fast decoding of error-correcting codes [2, 24, 9], sparse interpolation [1, 28], etc. Personally, we were mainly motivated by the last application to sparse interpolation. After the introduction of the tangent Graeffe algorithm [11, 17], gcd computations have often become one of the main practical bottlenecks. For this application, it is typically possible to work modulo a prime of the form

, which allows us to exploit our optimizations for the binary FFT model.

We made a first implementation of the new algorithm and also programmed the customary extension to the computation of subresultants (see [10, section 11.2] and also [21] for this extension). Work is in progress on HPC implementations of our algorithms and all required subalgorithms.

A few questions remain for future work. It is likely that the ideas in this paper can be applied to integer gcds as well, except for the optimizations that are specific to the binary FFT model. It would also be nice to have counterparts for the “ternary FFT model”, which could be used for fields

of characteristic two (see [17, section 3.4] for an algorithm of this kind in the case of iterated Graeffe transforms). Finally, in view of Bernstein and Yang's recent ideas from [4], as well as Remark 16, one may wonder whether the bounds for the normal case can actually be extended to the general case.

Acknowledgments. We are grateful to Michael Monagan for triggering this study and suggesting some early ideas.

2.Preliminaries

2.1.The binary FFT model

The best bounds in this paper will be proved for the so-called “binary FFT model”, which requires special assumptions on the ground field

. First of all, this model requires

to be invertible in

. Secondly, for any polynomial

that occurs during computations, we assume that

has a primitive

-th root of unity

with

. For convenience, we also assume that

whenever

divides

. We measure computational complexity in terms of the number of required field

Let

denote the space of polynomials of degree

. Given

with

, we define its discrete Fourier transform

We will write

for the maximum cost of computing a discrete Fourier transform of order

or the inverse transformation

. It is well known [8] that

. In what follows, we will always assume that

, and

are non-decreasing (for

and

, respectively). Therefore, we may just as well take

, but it is instructive to keep the notation

to indicate where we use Fourier transforms.

and the discrete Fourier transform of

at order

with respect to

is known, then it costs

to compute

. Indeed, we already know

, so it remains to compute

. But this is the DFT of

at order

and it takes a linear number of operations to compute

in terms of

. We call this trick FFT doubling.

Given

with

, we may use the discrete Fourier transform to compute the product

using

This is called FFT multiplication. If we fix

of degree

, then we may consider the multiplication map with

as a linear map

. We have

where

stands for componentwise multiplication by

and

for the injection of

into

The matrix in the middle has

rows and

columns and we may think of it as the matrix of the map

. If

, then we note that

From the complexity perspective, let

be the cost to multiply two polynomials in

. Using FFT multiplication, we see that

in the binary FFT model. More generally, if

is even, then any two polynomials

with

can be multiplied in time

using this method. Similarly, the cost

to compute a middle product (2) with

and

satisfies

. If

is even and

is general, then

can be computed in time

It is important to note that the above results all generalize to the case when the coefficient field

is replaced by the algebra

matrices with entries in

. For instance, assume that

with

. Then (1) allows us to compute

using

DFTs,

inverse DFTs, and

multiplications of

matrices in

, for a total cost of

2.2.Other models

Although the main focus in this paper is on the binary FFT model, we will also consider other types of polynomial multiplication. In that case, we will still denote by

and

the complexities of multiplying two polynomials in

and to compute the middle product

for

of degree

and

. Moreover, we make the following assumptions on these cost functions:

We also make the same assumptions for the analogue cost functions

and

when taking coefficients in

instead of

. In the binary FFT model we have seen that we may take

and

to be of the form

for suitable constants

, after which the above assumptions are satisfied (if

is not a power of two, then one may use the truncated Fourier transform [16], for instance). They are also verified for all other commonly used types of multiplication, such as Karatsuba's and Toom's methods [19, 37], or FFT-based methods for arbitrary radices [30, 6, 15].

In addition, we define

to be a constant such that

. We have seen that we may take

in the binary FFT model; this generalizes to arbitrary FFT models. Using Strassen's method for

matrix multiplication [34], we may always take

. From a practical point of view, we usually have

Let us examine the constant

a bit more closely in the case of Karatsuba multiplication. The complexity of this method satisfies

for certain constants

and

, which yields

. Similarly, when using Karatsuba's method to multiply

matrices, the complexity

satisfies

This analysis can be generalized to general lengths

and to the case when we only use Karatsuba's method for degrees above a certain threshold. In the latter case, it is usually favorable to choose the threshold over

to be slightly lower than the threshold over

3.Euclidean remainder sequences

3.1.Definition and basic properties

We regard

as a matrix polynomial in

and say that

and

are normal if

for all

. This is the case if and only if

and

for all

. For

, we also define

so an extended gcd computation essentially boils down to the computation of the matrix product

. In the case of a normal remainder sequence, this is done most efficiently using binary splitting:

3.2.Re-indexation of irnormal Euclidean remainder sequences

In the irnormal case, it is convenient to work with an alternative indexation of remainder sequences for which

, as in the normal case. Note that we will not consider abnormal remainder sequences until section 5 below, so the reader may safely skip this subsection until there.

As before, we will sometimes write

instead of

in order to emphasize the dependence on

and

, and similarly for

, etc. Occasionally, when

is not clear from the context, we also write

, etc.

4.The normal case

4.1.Statement of the non-optimized algorithm

Let

be as in the previous section with remainder sequence

. We will write

for the corresponding

-th Bezout matrix

in case we wish to make the dependency on

and

clear. Similarly, we define

and

. Given a polynomial

and indices

, it will also be convenient to define

thus shows that the coefficient

of degree

only depends on coefficients

and

with

. In particular,

Algorithm 1

Input: and with for all

Output:

If , then return
Let and
Recursively compute
Compute with
Recursively compute
Return

Proof. If

, then the result follows from Lemma 4. If

, then Lemma 4 implies

, whence

and

. For

, we have

. This allows us to apply Lemma 4 once more, and obtain

. We conclude by noting that

4.2.Exploiting the middle product

Let us now show how to compute

and

efficiently in step 4 using a middle product of

matrix polynomials. In order to simplify the exposition, we assume that

is a power of two; in subsection 4.4 we will consider arbitrary lengths. We first decompose all polynomials into blocks of degree

. More precisely, let

where the left hand matrix has degree

, where

has degree

, and where the right hand matrix has degree

; see also Figure 1 .

Before we discuss further optimizations that are specific to the binary FFT model, let us first consider a general multiplication scheme (that satisfies the assumptions from section 2.2), and analyze the complexity of Algorithm 1 with the middle product optimization.

Proof. Recall that we assumed

to be a power of two. Then the running time of the algorithm satisfies the recurrence inequality

Unrolling this relation while using the assumption that

is non-decreasing yields the first bound. If

for some

and

, then the relation (6) yields

4.3.Implementation in the binary FFT model

In order to efficiently implement Algorithm 1 with the optimizations from the previous subsection in the binary FFT model, let us again assume that

is a power of two. We essentially have to compute a

matrix middle product in step 4 and a

matrix product in step 6. We will do all these computations using DFTs of length

Let us first consider the middle product (4). We have

in (4) and the right hand side matrix has degree

. Consequently, we may apply (3) and compute the middle product using FFT multiplication:

As to the matrix product

, its degree is

, which is just one too large to use FFT multiplication directly. Nevertheless, FFT multiplication still allows us to compute

modulo

. Then we may simply recover

by computing the leading coefficient

separately.

Now the FFT model also allows for “FFT caching” when doing the recursive calls. In addition to

, we return its DFT transform at the end of the algorithm. When computing the DFT transforms of

and

at length

, this means that we already know their DFT transforms at length

, and thereby save half of the work.

Algorithm 2

Input: and such that and for all

Output: and

If , then return and
Let
Recursively compute and

Compute using FFT doubling
Compute with using (7) and (5)
Recursively compute and

Compute using FFT doubling
Compute and

Return and

Proof. Since the algorithm is simply an adaptation of Algorithm 1 to the binary FFT model, it is correct by Proposition 5. Let us analyze the costs of the various steps without the recursive calls.

The total cost of the top-level of the algorithm is therefore bounded by

. Consequently, the cost of our algorithm satisfies the recurrence inequality

Unrolling this relation while using the assumption that

is non-decreasing yields the desired complexity bound.

4.4.Generalization to arbitrary lengths

Let us now generalize Algorithm 2 to the case when

is not necessarily a power of two.

Algorithm 3

Input: and such that and for all

Output:

If , then return
Let be maximal such that and
Compute using Algorithm 2
Compute with
Recursively compute
Return

Proof. The correctness is proved in the same way as for Algorithm 1. For some universal constant

, the cost

of the algorithm satisfies the recurrence relation

Proof. Modulo multiplication of

and

with

, we may assume without loss of generality that

Remark 10. Instead of Algorithm 2, we may also use Algorithm 1 with the middle product optimization in step 3. In that case, the complexity bound from Proposition 6 generalizes in a similar way to arbitrary lengths.

5.The general case

Algorithms 1 and 2 generalize to the abnormal case, modulo several technical adjustments. In this section we describe how to do this.

5.1.Statement of the non-optimized algorithm

Proof. We have

for

, whence the relation

shows that the coefficient

of degree

only depends on coefficients

and

with

. We next proceed in a similar way as in the proof of Lemma 4 .

Proof. If

, then the result is obvious. If

and

, then the result follows from Lemma 11. Assume from now on that

and

. Then Lemma 11 implies

, whence

and

Let

be largest with

. If

in step 6, then

, so

and our algorithm returns the correct answer. Assume from now on that

We call

the degeneracy after

steps. Let

still be largest with

. Then we have

and

. In particular, we see that

, and

. Furthermore,

implies that

, so

. Therefore, we computed the

leading terms of

as part of

. Now

, so we only need the

leading coefficients of

and

in order to compute

. This proves our claim that

After step 7, we thus have

, and

, where

. Moreover,

and

. In particular, we may indeed retrieve the new value of

from

and the old values of

and

at the end of step 7, since

. Furthermore,

, which allows to apply Lemma 11, and obtain

. We conclude that

Remark 13. For efficiency, we chose

in step 3, but it is interesting to note that the above correctness proof actually works for any choice of

with

. We will use this property for our FFT version in section 5.3 below, where we will take

to be a power of two.

5.2.Exploiting the middle product

Contrary to what we did in section 4.2, we do not require

to be a power of two in this subsection. In fact, it is possible to efficiently implement step 5 in general, using middle products. This time, we break up our input and output polynomials as follows:

Remark 14. In the continuation of Remark 13, we note that the above formulas again apply for any choice of

with

Let

be as in section 4.2. For the multiplication method that we selected, assume that the middle product (8) and the final product

can each be computed in time

. Then we have the following generalization of Proposition 6:

Proof. The cost of steps 5 and 9 is bounded by

, according to the assumption just above this proposition. In step 7, the computation of

can be done in time

with

, using Newton's method for fast power series inversion; see, e.g. [3, section 6]. The update of

amounts to two products of degrees

, which can certainly be computed in time

. The update of

can be computed efficiently using a middle product that takes

additional operations. Altogether, the cost

of the algorithm satisfies

This completes the proof of the first bound. We skip the proof of the second one, which is based on similar arguments.

Remark 16. The constants in the bounds from Theorem 15 are probably not sharp. The point is the following: if the quotients in the Euclidean remainder sequence are all of bounded degree, then the additional cost with respect to the normal case can actually be bounded by

. If, on the contrary, some of the quotients have exceptionally large degrees, then many of the recursive calls will terminate early at step 6. This will actually make the constants decrease instead of increase. The only problematic case therefore seems to be when many of the quotients have moderately large degrees; it would be interesting to know more about the precise worse case scenario.

5.3.Implementation in the binary FFT model

One important feature of Algorithm 2 is that we use the same length

for all our DFTs. If

is a power of two, then we may conserve this property in the abnormal case. Indeed, the final product

causes no problem since its degree is still

. As to the middle product (8), we now have

and the degree of the right hand side is still

. This allows us to apply (3), which yields

However, there is no reason why

should be a power of two for the second recursive call. In order to remedy to this problem, we introduce a new parameter

that we assume to be a power of two and that we will use for the lengths of our DFTs. This requires a minor adjustment of (11):

We are now in a position to adapt Algorithm 4 to the binary FFT model: see Algorithm 5 .

For the complexity analysis, it will be convenient to assume that

satisfies the properties from section 2.2. In particular, the assumption that

is non-decreasing implies that

for all

. Conversely, in the binary FFT model, it will be convenient to also assume that

for all

, and some fixed constant

Proof. The correctness is proved in a similar way as the correctness of Proposition 12, while using Remarks 13 and 14.

The total cost of steps 5, 6, 9, and 10 is

, as in the proof of Proposition 7. As to the update step 8, the computation of

requires

operations. The computation of

and

costs

, whereas the multiplication

takes linear time.

Now let

and

before the updates, so that

with

and

. Since

, the lowest

coefficients of

do not matter in step 9. During the update, this means that we essentially need to compute

. Now, using FFT multiplication, the computation of

takes one direct and one inverse DFT of length

of total cost

, since we already know

. The remaining coefficients

can be computed in time

, then

and the above analysis shows that the time complexity

of the algorithm satisfies

for suitable constants

and

. If

, then the FFT doubling in step 3 can be done in time

, so

by increasing

if necessary. In the bound for

when

, the term

pollutes the complexity analysis. Our next objective is to reduce to the case when the sum

is replaced by

We start with the definition of an upper bound

for

as follows. For

, we take

. For

with

, we define

Using an easy induction, we note that

is increasing in

for fixed

. If

and

, then there exist

and

with

such that

We prove our claim by induction on the smallest

with

. We already dealt with the case when

, so assume that

. If

, then (13) and the induction hypothesis with

in the role of

yield

for all

and

. Unrolling this inequality for

, we deduce that there exists a constant

with

Bibliography

[1]: M. Ben-Or and P. Tiwari. A deterministic algorithm for sparse multivariate polynomial interpolation. In Proc. ACM STOC '88, pages 301–309. New York, NY, USA, 1988.
[2]: E. R. Berlekamp. Algebraic coding theory. McGraw-Hill, 1968.
[3]: D. J. Bernstein. Fast multiplication and its applications, pages 325–384. Mathematical Sciences Research Institute Publications. Cambridge University Press, United Kingdom, 2008.
[4]: D. J. Bernstein and B.-Y. Yang. Fast constant-time gcd computation and modular inversion. IACR Trans. Cryptogr. Hardw. Embed. Syst., 3:340–398, 2019.
[5]: R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun. Fast solution of Toeplitz systems of equations and computation of Padé approximants. J. Algorithms, 1(3):259–295, 1980.
[6]: D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
[7]: D. G. Cantor and H. Zassenhaus. A new algorithm for factoring polynomials over finite fields. Math. Comp., 36(154):587–592, 1981.
[8]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[9]: J. L. Dornstetter. On the equivalence between Berlekamp's and Euclid's algorithms. IEEE Transactions on Information Theory, 33:428–431, 1987.
[10]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.
[11]: B. Grenet, J. van der Hoeven, and G. Lecerf. Randomized root finding over finite fields using tangent Graeffe transforms. In Proc. ISSAC '15, pages 197–204. New York, NY, USA, 2015. ACM.
[12]: G. Hanrot, M. Quercia, and P. Zimmermann. The middle product algorithm I. speeding up the division and square root of power series. AAECC, 14:415–438, 2004.
[13]: D. Harvey. Faster algorithms for the square root and reciprocal of power series. Math. Comp., 80:387–394, 2011.
[14]: D. Harvey and J. van der Hoeven. Faster polynomial multiplication over finite fields using cyclotomic coefficient rings. J. of Complexity, 54, 2019. Article ID 101404, 18 pages.
[15]: D. Harvey and J. van der Hoeven. Polynomial multiplication over finite fields in time . Technical Report, HAL, 2019. http://hal.archives-ouvertes.fr/hal-02070816.
[16]: J. van der Hoeven. The truncated Fourier transform and applications. In Proc. ISSAC 2004, pages 290–296. Univ. of Cantabria, Santander, Spain, July 2004.
[17]: J. van der Hoeven and M. Monagan. Computing one billion roots using the tangent Graeffe method. ACM SIGSAM Commun. Comput. Algebra, 54(3):65–85, 2021.
[18]: J. van der Hoeven et al. GNU TeXmacs. https://www.texmacs.org, 1998.
[19]: A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
[20]: D. E. Knuth. The analysis of algorithms. In Actes du congrès international des matheématiciens 1970, volume 3, pages 269–274. Gauthier-Villars, 1971.
[21]: G. Lecerf. On the complexity of the Lickteig–Roy subresultant algorithm. J. Symbolic Comput., 2018. https://doi.org/10.1016/j.jsc.2018.04.017.
[22]: D. H. Lehmer. Euclid's algorithm for large numbers. Amer. Math. Monthly, pages 227–233, 1938.
[23]: D. Lichtblau. Half-GCD and fast rational recovery. In Proc. ISSAC '05, pages 231–236. 2005.
[24]: J. Massey. Shift-register synthesis and bch decoding. IEEE Transactions on Information Theory, 15:122–127, 1969.
[25]: R. Moenck. Fast computation of GCDs. In Proc. of the 5th ACM Annual Symposium on Theory of Computing, pages 142–171. New York, 1973. ACM Press.
[26]: N. Möller. On Schönhage's algorithm and subquadratic integer gcd computation. Math. Comp., 77(261):589–607, 2008.
[27]: F. Morain. Implementing the Thull-Yap algorithm for computing euclidean remainder sequences. In Proc. ISSAC '22, pages 197–205. 2022.
[28]: R. Prony. Essai expérimental et analytique sur les lois de la dilatabilité des fluides élastiques et sur celles de la force expansive de la vapeur de l'eau et de la vapeur de l'alkool, à différentes températures. J. de l'École Polytechnique Floréal et Plairial, an III, 1:24–76, 1795. Cahier 22.
[29]: A. Schönhage. Schnelle Berechnung von Kettenbruchentwicklungen. Acta Informatica, 1(2):139–144, 1971.
[30]: A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Informatica, 7:395–398, 1977.
[31]: V. Shoup. NTL: a library for doing number theory. 1996. www.shoup.net/ntl.
[32]: D. Stehlé and P. Zimmermann. A binary recursive gcd algorithm. In D. Buell, editor, Algorithmic Number Theory, pages 411–425. Springer Berlin Heidelberg, 2004.
[33]: S. Stevin. L'arithmétique. Imprimerie de Christophle Plantin, 1585.
[34]: V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:352–356, 1969.
[35]: V. Strassen. The computational complexity of continued fractions. In Proc. of the Fourth ACM Symp. on Symbolic and Algebraic Computation, pages 51–67. 1981.
[36]: K. Thull and C. K. Yap. A unified approach to HGCD algorithms for polynomials and integers. https://cs.nyu.edu/yap/papers/SYNOP.htm#hgcd.
[37]: A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.

1.Introduction

2.Preliminaries

2.1.The binary FFT model

2.2.Other models

3.Euclidean remainder sequences

3.1.Definition and basic properties

3.2.Re-indexation of irnormal Euclidean remainder sequences

4.The normal case

4.1.Statement of the non-optimized algorithm

4.2.Exploiting the middle product

4.3.Implementation in the binary FFT model

4.4.Generalization to arbitrary lengths

5.The general case

5.1.Statement of the non-optimized algorithm

5.2.Exploiting the middle product

5.3.Implementation in the binary FFT model

Bibliography