Faster relaxed multiplication

In previous work, we have introduced several fast algorithms for relaxed power series multiplication (also known under the name on-line multiplication) up to a given order

. The fastest currently known algorithm works over an effective base field

with sufficiently many

-th roots of unity and has algebraic time complexity

. In this paper, we will generalize this algorithm to the cases when

is replaced by an effective ring of positive characteristic or by an effective ring of characteristic zero, which is also torsion-free as a

-module and comes with an additional algorithm for partial division by integers. In particular, we may take

to be any effective field. We will also present an asymptotically faster algorithm for relaxed multiplication of

-adic numbers.

Keywords: power series, multiplication, on-line algorithm, FFT, computer algebra

1Introduction

1.1Relaxed resolution of recursive equations

Let

be an effective (possibly non-commutative) ring; i.e., we assume data structures for representing the elements of

and algorithms for performing the ring operations

and

. The aim of algebraic complexity theory is to study the cost of basic or more complex algebraic operations over

(such as the multiplication of polynomials or matrices) in terms of the number of operations in

The algebraic complexity usually does not coincide with the bit complexity, which also takes into account the potential growth of the actual coefficients in

. Nevertheless, understanding the algebraic complexity usually constitutes a first useful step towards understanding the bit complexity. Of course, in the special case when

is a finite field, both complexities coincide up to a constant factor.

One of the most central operations is polynomial multiplication. We will denote by

the number of operations required to multiply two polynomials of degrees

. If

admits primitive

-th roots of unity for all

, then we have

using FFT multiplication, which is based on the fast Fourier transform [12]. In general, it has been shown [28, 10] that

. The complexities of most other operations (division, Taylor shift, extended g.c.d., multipoint evaluation, interpolation, etc.) can be expressed in terms of

. Often, the cost of such other operations is simply

, where

stands for

; see [2, 6, 15] for some classical results along these lines.

The complexity of polynomial multiplication is fundamental for studying the cost of operations on formal power series in

up to a given truncation order

. Clearly, it is possible to perform the multiplication up to order

in time

: it suffices to multiply the truncated power series at order

and truncate the result. Using Newton's method, and assuming that

, it is also possible to compute

, etc. up to order

in time

. More generally, it has been shown in [9, 17, 29, 23] that the power series solutions of algebraic differential equations with coefficients in

can be computed up to order

in time

. However, in this case, the “

” hides a non-trivial constant factor which depends on the expression size of the equation that one wants to solve.

The relaxed approach for computations with formal power series makes it possible to solve equations in quasi-optimal time with respect to the sparse expression size of the equations. The idea is to consider power series

as streams of coefficients

and to require that all operations are performed “without delay” on these streams. For instance, for a multiplication

of two power series

, we require that

is computed as soon as

are known. Any algorithm which has this property will be called a relaxed or on-line algorithm for multiplication.

Given a relaxed algorithm for multiplication, it is possible to let the later coefficients

of the input depend on the known coefficients

of the output. For instance, given a power series

with

, we may compute

using the formula

to be recursive, if

only depends on

. Replacing

, we notice that the same terminology applies to systems of

equations. In the case of an implicit equation, special rewriting techniques can be implied in order to transform the input equation into a recursive equation [24, 22, 5].

Let

denote the cost of performing a relaxed multiplication up to order

. If

is an expression which involves

multiplications and

other “linear time” operations (additions, integrations, etc.), then it follows that (2) can be solved up to order

in time

. If we had

, then this would yield an optimal algorithm for solving (2) in the sense that the computation of the solution would essentially require the same time as its verification.

1.2Known algorithms for relaxed multiplication

is clearly relaxed. Unfortunately, FFT multiplication is not relaxed, since

are computed simultaneously as a function of

, in this case.

In [16, 17] it was remarked that Karatsuba's

algorithm [25] for multiplying polynomials can be rewritten in a relaxed manner. Karatsuba multiplication and its relaxed version thus require exactly the same number of operations. In [16, 17], an additional fast relaxed algorithm was presented with time complexity

We were recently made aware of the fact that a similar algorithm was first published in [13]. However, this early paper was presented in a different context of on-line (relaxed) multiplication of integers (instead of power series), and without the application to the resolution of recursive equations (which is quite crucial from our perspective).

An interesting question remained: can the bound (3) be lowered further, be it by a constant factor? In [18], it was first noticed that an approximate factor of two can be gained if one of the multiplicands is known beforehand. For instance, if we want to compute

for a known series

with

, then the coefficients of

are already known in the product

in (1), so only one of the inputs depends on the output. An algorithm for the computation of

is said to be semi-relaxed, if

is written to the output as soon as

are known, but all coefficients of

are known beforehand. We will denote by

the complexity of semi-relaxed multiplication. We recall from [21] (see also Section 3) that relaxed multiplication reduces to semi-relaxed multiplication:

It has recently been pointed out [26] that the factor

that was gained in the case of semi-relaxed multiplication can also be gained in the general case.

The first reduction of (3) by a non-constant factor was published in [21], and uses the technique of FFT blocking (which has also been used for the multiplication of multivariate polynomials and power series in [17, Section 6.3], and for speeding up Newton iterations in [3, 23]). Under the assumption that

admits primitive

-th roots of unity for all

(or at least for all

with

), we showed that

The function

has slower growth than any strictly positive power of

. It is convenient to write

whenever

for all

. In particular, it follows that

In Section 3, we will recall the main ideas from [21] which lead to the complexity bound (4).

In fact, in Section 4, we will see that the complexity bound from [21] can be further reduced to

Since such complexity bounds involve rather complex expressions, it will be convenient to use the following abbreviations:

1.3Improved complexity bounds

We recall that the characteristic of a ring

is the integer

such that the canonical ring homomorphism

has kernel

. If

is torsion-free as a

-module (i.e.

for any

and

), then we will say that

admits an effective partial division by integers if there exists an algorithm which takes

and

on input and which returns the unique

with

on output. We will count a unit cost for such divisions. The main result of this paper is:

We notice that the theorem holds in particular if

is an effective field. In Section 8, we will also consider the relaxed multiplication of

-adic numbers, with

and

. If we denote by

the bit complexity of multiplying two

-bit integers, then [13] essentially provided an algorithm of bit complexity

for the relaxed multiplication of

-adic numbers at order

. Various algorithms and benchmarks for more general

were presented in [4]. It is also well known [33, 11, 28, 14] that

. Let

denote the bit complexity of the relaxed multiplication of two

-adic numbers modulo

. In Section 8, we will prove the following new result:

For comparison, the best previously known bound for relaxed multiplication in

was

We thus improved the previous bound by a factor

at least, up to sublogarithmic terms.

The main idea which allows for the present generalizations is quite straightforward. In our original algorithm from [21], the presence of sufficiently many primitive

-th roots of unity in

gives rise to a quasi-optimal evaluation-interpolation strategy for the multiplication of polynomials. More precisely, given two polynomials of degrees

, their FFT-multiplication only requires

evaluation and interpolation points, and both the evaluation and interpolation steps can be performed efficiently, using only

operations. Now it has recently been shown [8] that quasi-optimal evaluation-interpolation strategies still exist if we evaluate and interpolate at points in geometric progressions instead of roots of unity. This result is the key to our new complexity bounds, although further technical details have to be dealt with in order to make things work for various types of effective rings

. We also notice that the main novelty of [8] concerns the interpolation step. Fast evaluation at geometric progressions was possible before using the so called chirp transform [27, 7]. For effective rings

of positive characteristic, this would actually have been sufficient for the proving the bound (5).

Our paper is structured as follows. Since the algorithms of [8] were presented in the case when

is an effective field, Section 2 starts with their generalization to more general effective rings

. These generalizations are purely formal and contain no essentially new ideas. In Section 3, we give a short survey of the algorithm from [21], but we recommend reading the original paper for full technical details. In Section 4, we sharpen the complexity analysis for the algorithm from [21]. In Section 5, we prove Theorem 1 in the case when

has characteristic zero. In Section 6, we turn our attention to the case when

has prime characteristic

. If the characteristic is sufficiently large, then we may find sufficiently large geometric progressions in

in order to generalize the results from Section 5. Otherwise, we have to work over

for some sufficiently large

. In Section 7, we complete the proof of Theorem 1; the case when the characteristic is a prime power is a refinement of the result from Section 6. The remaining case is done via Chinese remaindering. In Section 8, we will prove Theorem 2.

Acknowledgments. I am grateful to the referees for their detailed comments and suggestions. A special thanks goes to the referee who made a crucial suggestion for the improved complexity analysis in Section 4.

2Multipoint evaluation and interpolation

Let

be a (commutative) effective integral domain and let

be its quotient field. Assume that

has characteristic zero. Let

be an effective torsion-free

-module and

. Elements of

and

are fractions

with

(resp.

) and

, and the operations

on such fractions are as usual:

For

, we also have

. It follows that

is an effetive field and

an effective

-vector space. Moreover, all field operations in

(and all vector space operations in

) can be performed using only

operations in

(resp.

We will say that

admits an effective partial division, if for every

and

, we can compute the unique

with

. In that case, and we will count any division of the above kind as one operation in

. Similarly, given a fixed

, we say that

admits an effective partial division by

, we can compute the unique

with

. Given

, we define

Given

and

, we will denote by

the number of operations in

and

which are needed in order to compute the product

Lemma 3. Let be an effective integral domain and an effective torsion-free -module. There exists a constant , such that the following holds: for any , and such that are pairwise distinct, and such that admits effective divisions by and , we have:

Proof. In the case when

is a field and

, this result was first proven in [8]. More precisely, the conversions can be done using the algorithms NewtonEvalGeom, NewtonInterpGeom, NewtonToMonomialGeom and MonomialToNewtonGeom in that paper. Examining these algorithms, we observe that general elements in

are only multiplied with elements in

and divided by elements of the set

. In particular, the algorithms can still be applied in the more general case when

is a field and

a vector space.

is only an effective integral domain and

an effective torsion-free

-module with an effective partial division, then we define the effective field

and the effective vector space

as above, and we may still apply the generalized algorithms for multipoint evaluation and interpolation in

. In particular, both multipoint evaluation and interpolation can still be done using

operations in

and

, whence

operations in

and

. If we know that the end-results of these algorithms are really in the subspace

(or in the submodule

), then we use the partial division in

to replace their representations in

(or

) by representations in

(or

3Survey of blockwise relaxed multiplication

The fast relaxed algorithms from [21] are all based on two main changes of representation: “blocking” and the fast Fourier transform. Let us briefly recall these transformations and how to use them for the design of fast algorithms for relaxed multiplication.

Blocking and unblocking.

Discrete Fourier transforms.

and it is classical [12] that both

and

can be computed using

operations in

. The operations

and

extend naturally to

via

where

is a pointwise product in

. The first

coefficients of

can be computed using at most

operations in

Relaxed multiplication.

operations in

. For a given expansion order

, one may take

, and use the above formula in a recursive manner. This yields [21, Theorem 11]

Remark 4. Since the block size

is chosen as a function of

, the above method really describes a relaxed algorithm for computing the product up to an order

which is specified in advance. In fact, such an algorithm automatically yields a genuine relaxed algorithm with the same complexity (up to a constant factor), by doubling the order

each time when needed.

Reduction to semi-relaxed multiplication.

shows that a relaxed product of two polynomials

and

of degrees

reduces to a relaxed product

of half the size, two semi-relaxed products

, and one non-relaxed product

. Under the assumptions that

and

are increasing, a routine calculation thus yields

Multiple block sizes.

4Improved complexity analysis

One of the referees suggested to take

instead of

in (8). As a matter of fact, it is even better to take

. In this section, we will show that this leads to the bound

which further improves on (9). We will prove a slightly more general result, which is useful for the analysis of algorithms which satisfy complexity bounds similar to (8).

and let

and

be as in Lemma 5. Without loss of generality, we may pick

sufficiently large such that

and such that (12) holds for

. Let

be such that

and denote

and

. For certain

, we have

and

. Now (12) implies

Let

and

. Let us prove by induction that

for all

. This is clear for

. Assume now that

and

for all

. Then, with the above notations,

implies

and

, whence

5Relaxed multiplication in characteristic zero

Let us now consider the less favourable case when

is an effective ring which does not necessarily contain primitive

-th roots of unity for arbitrarily high

. In this section, we will first consider the case when

is torsion-free as a

-module and also admits a partial algorithm for division by integers.

Given a block size

and

(say

), we will replace the discrete Fourier transform

at a

-th primitive root of unity by multipoint evaluation at

. More precisely, we define

and the inverse transform

. By Lemma 3, these transforms can both be computed using

operations in

. In a similar way as for

and

, we extend

and

to power series in

Theorem 7. Let both be an effective ring and an effective torsion-free -module with an effective partial division by elements in . Then

Proof. It suffices to prove the complexity bound for semi-relaxed multiplication. Instead of (7), we now compute

using

6Relaxed multiplication in prime characteristic

Let

now be an effective ring of prime characteristic

. For expansion orders

, the ring

does not necessarily contain

distinct points in geometric progression. Therefore, in order to apply Lemma 3, we will first replace

by a suitable extension, in which we can find sufficiently large geometric progressions.

Given

, let

be even with

. Let

be such that the finite field

is isomorphic to

. Then the ring

The ring

is an effective ring and one addition or subtraction in

corresponds to

additions or subtractions in

. Similarly, one multiplication in

can be done using

operations in

In order to multiply two series

up to order

, the idea is now to rewrite

and

as series in

with

. If we want to compute the relaxed product, then we also have to treat the first

coefficients apart, as we did before for the blocking strategy. More precisely, we will compute the semi-relaxed product

using the formula

Since

contains a copy of

, it also contains at least

points in geometric progression. For the multiplication up to order

of two series with coefficients in

, we may thus use the blocking strategy combined with multipoint evaluation and interpolation.

Proof. With the notations from above, we may find a primitive

-th root of unity

. We may thus use formula (13) for the semi-relaxed multiplication of two series in

up to order

. In a similar way as in the proof of Theorem 7, we thus get

Remark 10. In our complexity analysis, we have not taken into account the computation of the polynomial

with

. Using a randomized algorithm, such a polynomial can be computed in time

; see [15, Corollary 14.44]. If

, then this is really a precomputation of negligible cost

If we insist on computing

in a deterministic way, then one may use [30, Theorem 3.2], which provides us with an algorithm of time complexity

Similarly, there both exist randomized and deterministic algorithms [32, 31] for the efficient computation of primitive

-th roots of unity in

. In particular, thanks to [31, Theorem 1] such a primitive root of unity can always be computed in time

, when using a naive algorithm for factoring

7Relaxed multiplication in positive characteristic

Let us now show that the technique from the previous section actually extends to the case when

is an arbitrary effective ring of positive characteristic. We first show that the algorithm still applies when the characteristic of

is a prime power. We then conclude by showing how to apply Chinese remaindering in our setting.

Proof. Taking

, let

of degree

be as in the previous section and pick a monic polynomial

of degree

such that the reduction

modulo

yields

. Then we get a natural commutative diagram

Now let

be an element in

of order

. Then any lift

with

has order at least

. Moreover,

and

are all invertible. Consequently,

and

do not lie in

, whence they are invertible as well. It follows that we may still apply multipoint evaluation and interpolation in

at the sequence

, whence Theorem 8 generalizes to the present case.

Remark 12. For a fixed prime number

, we notice that the complexity bound is uniform in the following sense: there exists a constant

such that for all effective rings of characteristic

with

, we have

. Indeed, the choice of

only depends on

and

, and any operation in

in the case

corresponds to exactly one lifted operation in

in the general case.

Remark 13. Similarly as in Remark 9, the hypothesis

leads to the improved bound

Proof. We will prove the theorem by induction on the number of prime divisors of

. If

is a prime power, then we are done. So assume that

, where

and

are relatively prime, and let

be such that

These rings are effective, when representing their elements by elements of

and transporting the operations from

. Of course, the representation of an element

(or

) is not unique, since we may replace it by

for any

(or

). But this is not a problem, since our definition of effective ring did not require unique representability or the existence of an equality test.

Now let

and let

be their projections in

, for

. Consider the relaxed products

, for

. These products are represented by relaxed series

via

, for

. By the induction hypotheses, we may compute

and

at order

using

operations in

. The linear combination

can still be expanded up to order

with the same complexity. We claim that

. Indeed,

Remark 15. A uniform bound interms of

can be given along similar lines as in Remark 12. This time, such a bound depends linearly on the number of prime factors of

8Relaxed multiplication of

-adic numbers

Let

be an integer, not necessarily a prime number, and denote

. We will regard

-adic numbers

as series

with

, and such that the basic ring operations

and

require an additional carry treatment.

In order to multiply two relaxed

-adic numbers

, we may rewrite them as series

, multiply these series

, and recover the product

from the result. Of course, the coefficients of

may exceed

, so some relaxed carry handling is required in order to recover

from

. We refer to [4, Section 2.7] for details. In particular, we prove there that

can be computed up to order

using

ring operations in

of bit size

Given

, let

, and consider two power series

. We will denote by

(resp.

) the bit complexity of multiplying

and

up to order

using a relaxed (resp. semi-relaxed) algorithm.

Proof. Let

be a prime number with

. Such a prime

can be computed in time

using the polynomial time polynomial primality test from [1], together with the Bertrand-Chebychev theorem.

Let

and consider

such that

. Let

be the reductions of

modulo

. Then

may be reconstructed up to order

from the product

. We thus get

For the above strategy to be efficient, it is important that

. This can be achieved by combining it with the technique of

-adic blocking. More precisely, given a

-adic block size

, then any

-adic number in

can naturally be considered as a

-adic number in

, and vice versa. Assuming that numbers in

are written in base

, the conversion is trivial if

is a power of two. Otherwise, the conversion involves base conversions and we refer to [4, Section 4] for more details. In particular, the conversions in both directions up to order

can be done in time

Let

(resp.

) the complexity of relaxed (resp. semi-relaxed) multiplication in

up to order

Using the strategy of

-adic blocking, a semi-relaxed product in

may then be reduced to one semi-relaxed product in

and one relaxed multiplication with an integer in

. In other words,

where

stands for the cost of semi-relaxed multiplication of two

-adic numbers in

up to order

. By [4, Proposition 4], we have

9Final remarks

For the moment, we have not implemented any of the new algorithms in practice. Nevertheless, our old implementation of the algorithm from [21] allowed us to gain some insight on the practical usefulness of blockwise relaxed multiplication. Let us briefly discuss the potential impact of the new results for practical purposes.

Characteristic zero.

In the case of integer coefficients, it is best to re-encode the integers as polynomials in

for a prime number which fits into a machine word and such that

admits many

-th roots of unity (it is also possible to take several primes

and use Chinese remaindering). After that, one may again use the old algorithm from [21]. Also, integer coefficients usually grow in size with

, so one really should see the power series as a bivariate power series in

with a triangular support. One may then want to use TFT-style multiplication [19, 20] in order to gain another constant factor.

Finite fields.

Semi-relaxed multiplication.

Cache friendliness.

Bilinear maps.

Skew series.

denotes the cost of multiplying two polynomials

and

with

, then the classical fast relaxed multiplication algorithm from [17, 13] generalizes and still admits the time complexity

. However, the blockwise algorithm from this paper does not generalize to this setting, at least not in a straightforward way.

Bibliography

[1]: M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Mathematics, 160(2):781–793, 2004.
[2]: A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts, 1974.
[3]: D. Bernstein. Removing redundancy in high precision Newton iteration. Available from http://cr.yp.to/fastnewton.html, 2000.
[4]: J. Berthomieu, J. van der Hoeven, and G. Lecerf. Relaxed algorithms for -adic numbers. Journal de Théorie des Nombres de Bordeaux, 23(3):541–577, 2011.
[5]: J. Berthomieu and R. Lebreton. Relaxed -adic hensel lifting for algebraic systems. In J. van der Hoeven and M. van Hoeij, editors, Proc. ISSAC '12, pages 59–66, Grenoble, France, July 2012.
[6]: D. Bini and V.Y. Pan. Polynomial and matrix computations. Vol. 1. Birkhäuser Boston Inc., Boston, MA, 1994. Fundamental algorithms.
[7]: Leo I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[8]: A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. Journal of Complexity, 21(4):420–446, August 2005. Festschrift for the 70th Birthday of Arnold Schönhage.
[9]: R.P. Brent and H.T. Kung. Fast algorithms for manipulating formal power series. Journal of the ACM, 25:581–595, 1978.
[10]: D.G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
[11]: S. A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
[12]: J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[13]: M. J. Fischer and L. J. Stockmeyer. Fast on-line integer multiplication. Proc. 5th ACM Symposium on Theory of Computing, 9:67–72, 1974.
[14]: M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing (STOC 2007), pages 57–66, San Diego, California, 2007.
[15]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, first edition, 1999.
[16]: J. van der Hoeven. Lazy multiplication of formal power series. In W. W. Küchlin, editor, Proc. ISSAC '97, pages 17–20, Maui, Hawaii, July 1997.
[17]: J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.
[18]: J. van der Hoeven. Relaxed multiplication using the middle product. In Manuel Bronstein, editor, Proc. ISSAC '03, pages 143–147, Philadelphia, USA, August 2003.
[19]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296, Univ. of Cantabria, Santander, Spain, July 4–7 2004.
[20]: J. van der Hoeven. Notes on the Truncated Fourier Transform. Technical Report 2005-5, Université Paris-Sud, Orsay, France, 2005.
[21]: J. van der Hoeven. New algorithms for relaxed multiplication. JSC, 42(8):792–802, 2007.
[22]: J. van der Hoeven. Relaxed resolution of implicit equations. Technical report, HAL, 2009. http://hal.archives-ouvertes.fr/hal-00441977.
[23]: J. van der Hoeven. Newton's method and FFT trading. JSC, 45(8):857–878, 2010.
[24]: J. van der Hoeven. From implicit to recursive equations. Technical report, HAL, 2011. http://hal.archives-ouvertes.fr/hal-00583125.
[25]: A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
[26]: R. Lebreton and É. Schost. A simple and fast online power series multiplication and its analysis. Technical report, LIRMM, 2013. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00867279.
[27]: L.R. Rabiner, R.W. Schafer, and C.M. Rader. The chirp z-transform algorithm and its application. Bell System Tech. J., 48:1249–1292, 1969.
[28]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[29]: Alexandre Sedoglavic. Méthodes seminumériques en algèbre différentielle ; applications à l'étude des propriétés structurelles de systèmes différentiels algébriques en automatique. PhD thesis, École polytechnique, 2001.
[30]: V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 54:535–447, 1990.
[31]: V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.
[32]: I. Shparlinski. On primitive elements in finite fields and on elliptic curves. Mat. Sbornik, 181(9):1196–1206, 1990.
[33]: A.L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.

			(7)


			(9)