Fast multivariate multi-point evaluation revisited

Let

be a commutative ring with unity, and let

be a multivariate polynomial. The multi-point evaluation problem consists in evaluating

at several given points

. Let

be polynomials in

of degrees

and let

be a monic polynomial in

of degree

. The modular composition problem consists in computing

modulo

. This is equivalent to the computation of the remainder

of the Euclidean division of

. It turns out that these two problems are related and that they form important building blocks in computer algebra. Theoretically speaking, Kedlaya and Umans have given efficient solutions to both problems when

is a finite ring of the form

where

is a monic polynomial [34]. The design of practically efficient algorithms remains an important challenge. The purpose of this paper is to revisit the algorithms by Kedlaya and Umans in detail, to sharpen their theoretical complexity bounds, and get some insight into the required data size for which this approach outperforms asymptotically slower algorithms.

1.1.Related work

Let

denote a complexity function that bounds the number of operations in

required to multiply two polynomials of degree

. We will often use the soft-Oh notation:

means that

; see [13, chapter 25, section 7] for technical details. The least integer larger or equal to

is written

. The largest integer smaller or equal to

is written

. The

-module of polynomials of degree

is denoted by

Multi-point evaluation

In the univariate case when

, the evaluation of

points in

can be achieved with

operations in

. We refer the reader to [13, chapter 10] for the description of the well known algorithm based on remainder trees. Algorithms with the smallest constant hidden in the “

” may be found in [6]. By allowing precomputations that only depend on the set of points, this evaluation complexity even drops to

as shown in [23]. For specific sets of points, such as geometric progressions or TFT points, multi-point evaluation requires only

operations in

; see [4, 7, 21].

The univariate situation does not extend to several variables, unless the set

of evaluation points has good properties. For instance if

has the form

with

, then fast univariate evaluations may be applied coordinate by coordinate. Fast algorithms also exist for suitable initial segments of such Cartesian products [29]. Other specific families of sets of points are used for fast evaluation and interpolation of multivariate polynomials in sparse representation; see [1, 24] for some recent results.

In the bivariate case when

, a smart use of the univariate case leads to a cost

, where

bounds the partial degrees of

[36, Theorem 3]. In 2004, Nüsken and Ziegler improved this bound to

[36, Result 4]—here the constant

is such that a

matrix over

may be multiplied with another

rectangular matrix with

operations in

. When

is a field the best currently known bound

is due to Huang and Pan [30, Theorem 10.1].

In 2008, Kedlaya and Umans achieved a major breakthrough for the general case [33]. In [34, Corollary 4.3] they showed the following statement (simplified here for conciseness): let

be a fixed rational value, given

with partial degrees in any

at most

, and evaluation points

, then

can be computed with

bit operations, provided that

and where

is a constant independent of

. This result was stated for random access memory machines. In fact, some of the underlying arguments (such as the use of lookup tables) need to be adapted to make them work properly on Turing machines. This is one of our contributions in this paper. In a nutshell the Kedlaya–Umans algorithm proceeds as follows (see section 3.1):

Modular composition

Let us first discuss the standard modular composition problem when

. Let

and

be polynomials in

of respective degrees

and

, with

monic. The naive modular composition algorithm takes

operations in

. In 1978, Brent and Kung [9] gave an algorithm with cost

, which uses the baby-step giant-step technique [39]. Their algorithm even yields a sub-quadratic cost

when using fast linear algebra; see [32, p. 185].

The major breakthrough for this problem is again due to Kedlaya and Umans [33, 34] in the case when

is a finite field

(and even more generally a finite ring of the form

for any integer

and

monic). For any fixed real value

, they have shown that the composition

could be computed modulo

using

bit operations.

The special case of power series composition corresponds to

. The best known complexity bound in the algebraic model when

is a field, written

for convenience, is still due to Brent and Kung: in [9], they achieved

operations in

, under the condition that

is invertible and that the characteristic of

is at least

, where

. The variant proposed by van der Hoeven [20, section 3.4.3] raises the condition on

. For fields with small characteristic

, Bernstein [3] proposed an algorithm that is softly linear in

but linear in

. These algorithms have been generalized to moduli

of the form

in [25]; it is shown therein that such a composition reduces to one power series composition at order

over

, plus

compositions modulo

, and one characteristic polynomial computation modulo

. Let us finally mention that an optimized variant, in terms of the constant hidden in the “

”, of the Brent–Kung algorithm has been proposed recently by Johansson in [31], and that series with integer, rational or floating point coefficients can often be composed in quasi-linear time in suitable bit complexity models, as shown by Ritzmann [40]; see also [22].

Relationship between multi-point evaluation and modular composition

Multi-point evaluation and modular composition are instances of evaluation problems at points lying in different extensions of

. The former case involves several points with coordinates in

. The latter case implies one point in the extension

. In the next paragraphs we summarize known conversions between evaluation problems.

When

, several algorithms are known for converting evaluations at any set of points to specific sets of points. For instance evaluating at roots of unity can be done fast thanks to the seminal FFT algorithm, so we usually build fast algorithms upon FFTs. Typically fast polynomial products are reduced to FFTs over synthetic roots of unity lying in suitable extensions of

by means of the Schönhage–Strassen algorithm. And since fast multi-point evaluation reduces to polynomial products, they thus reduce to FFTs. Such reductions to FFTs are omnipresent in computer algebra.

Let us still assume that

. Let

, let

be given evaluation points in a field

, and let

be pairwise distinct evaluation points in

. Let

and let

be such that

for

. Setting

we have

. So the evaluations of

reduce to evaluations and interpolations in degree

at the chosen points plus one modular composition. Conversely given a modulus

, one may benefit from factorizations of

to compose modulo

. We have studied this approach when

has factors with large multiplicities in [25], when it splits into linear factors over

in [26], and also when it factors over an algebraic extension of

in [27].

The key idea of Nüsken and Ziegler to speed up multi-point evaluation is a reduction to modular composition; then their aforementioned complexity bound follows from a variant of the Brent–Kung algorithm. Assume

. In order to evaluate

points

they first compute

and interpolate

such that

for

(assuming the

being pairwise distinct, which is not restrictive). Then they compute

and deduce

Over finite fields, Kedlaya and Umans showed an equivalence between multi-point evaluation and modular composition. Using Kronecker segmentation, Theorem 3.1 from [34] reduces such a composition to multi-point evaluation for an increased number of variables. Kedlaya and Umans' reduction in the opposite direction is close to the one of Nüsken and Ziegler. Let

be pairwise distinct points in

. For each

, they interpolate

such that

for

. Then they compute

, so that

1.2.Contributions

On machines with random access memory, arbitrary memory accesses admit a constant cost. This does not reflect the actual behavior of real computers, on which memory is organized into different levels, with efficient hardware support for copying contiguous blocks of memory from one level to another. In this paper, we opted for the standard Turing machine model with a finite number of tapes [38], which charges a “maximal penalty” for non contiguous memory accesses. This means in particular that complexity bounds established for this model are likely to hold for any more or less realistic alternative model. Our first contribution in the present paper is to show that Kedlaya and Umans' complexity bounds hold in the Turing machine model.

Our second contribution concerns sharper and more precise bit complexity bounds. For multi-point evaluation over

, we achieve softly linear time in the bit size of

and obtain more general explicit bounds in terms of

, the partial and total degrees of

, without the assumption

as in [34, Corollary 4.3]. We also put into evidence the advantage of taking

much larger than the dense size of the support of

. In particular, we analyze the threshold for which the average cost per evaluation point stabilizes. Our algorithm closely follows the main ideas of Kedlaya and Umans, but with two notable changes. On the one hand, using precise estimates for the first Chebyshev function, we obtain sharper bounds for the prime numbers to be used during the multi-modular stage of the algorithm; see section 3.3. On the other hand, when

becomes very large, we fall back to the naive evaluation algorithm, and thereby achieve a softly linear dependence in

Let us now turn to multi-point evaluation over an extension ring of the form

, where

is monic of degree

. Kedlaya and Umans proposed a reduction to multi-point evaluation over

, with

large, based on Kronecker substitution. In section 4, we propose an alternative approach, based on univariate polynomial evaluation, interpolation, and Chinese remaindering, to directly reduce to several compositions over suitable finite prime fields.

Our detailed analysis of multi-point evaluation is used in section 6 in order to obtain refined bounds for univariate modular composition. In [34, Corollary 7.2] it is shown that univariate modular composition in degree

over a finite field

can be done in time

. Making use of the same reduction to multi-point evaluation, the exponent in

can be made more explicit: in Theorem 6.2 we prove the bound

The new complexity bounds for multi-point evaluation are also crucial for our new bit complexity bounds for multivariate modular composition and the application to polynomial system solving in [28].

Section 7 addresses the special case when

is a field

of small positive characteristic

. We closely revisit the method proposed in [34, section 6], and again make the complexity bound more explicit. Again we quantify the number of evaluation points from which the average cost per point stabilizes, and we deduce a sharpened complexity bound for modular composition.

2.Complexity model and basic operations

We consider Turing machines with sufficiently many tapes. In fact seven tapes are usually sufficient to implement all useful complexity bounds for the elementary operations on polynomials, series and matrices involved in the present paper (standard algorithms may be found in [42]). The number of symbols used by the machine is not of the utmost importance, since it only impacts complexity bounds by constant factors. In the sequel, Turing machines will always have two symbols “

” and “

”, as well as a few specific additional ones dedicated to data representation.

Some algebraic structures involve a natural bit size for representing their elements (e.g. modular integers, finite fields); others involve a variable size (e.g. arbitrarily large integers, arrays, polynomials). In both cases, elements are seen as sequences of symbols on tapes ended by a specific symbol, written “

” in the sequel. Because heads of the machine can just move one cell left or right at time, algorithms must take care of consuming data in the most contiguous way as possible. In particular, we notice that loop counters must be used with care: for instance, the naive implementation of a loop “for

from 1 to

” involves a non-constant number of

bit operations at each iteration: to increment

and to test whether

is less than

. In this section we gather standard data types and elementary operations needed in the next sections. We freely use well known complexity bounds for polynomials and matrices from [13] and refer to [42] for more details on Turing machine implementations.

Integers

We use binary representation for integers, so that

has bit size

. A modular integer in

is represented by its natural representative in

. Integers may be added in linear time. The expression

will represent a nondecreasing cost function for multiplying two integers of bit sizes

, which satisfies

for all

. At present time the best known complexity bound is

, where

; see [16–18] and historical references therein. The integer division in bit sizes

takes time

(see Lemma 2.15 below for instance), and the extended gcd costs

by [41]. Overall, all arithmetic operations in

take softly linear time.

Arrays

For bidimensional arrays we use column-major representation. Precisely an array

of size

(

rows and

columns), is stored as the vector of its columns, that is

. Such arrays are allowed to contain elements of different types and sizes.

In the Turing machine model, it is not known how to perform transpositions of bidimensional arrays in linear time. The following lemma shows how to do transpositions with a logarithmic overhead. The special case when all entries have the same bit size was treated before in [5, appendix]. Notice that transpositions do not preserve the total bit size for non square matrices, due to changes in the number of “#” symbols.

Proof. We first handle the case

using the following “divide and conquer” algorithm. If

, then the array

is encoded as

and we write its transpose

on the output tape using one linear traversal. Otherwise, we split

into two matrices

and

on separate tapes, where

is made of the

first rows of

, and

of the

remaining ones. We recursively transpose

and

and glue the results together on the output tape.

Clearly, the case when

can be handled in time

, as well as the algorithm for splitting

into

and

, and the algorithm for gluing the transposes of

and

together into the transpose of

. Let

be a constant such that each of these algorithms takes time at most

. Let

and

. Let us show by induction over

that the transposition algorithm takes time

. This is clear for

. For

, the computation time is bounded by

The case when

is handled in an essentially similar way, by reverting the steps of the algorithm: if

, then

is rewritten into

using one linear pass. If

, then we recursively apply the algorithm on the first

and the last

columns, and merge the results in a linear pass. The entire computation can be done in time

, by a similar complexity analysis as above.

Univariate polynomials

For univariate polynomials we use dense representation, which means that a polynomial of degree

is stored as the vector of its

coefficients from degrees

. Additions and subtractions take linear time in

. Let

denote a cost function that yields an upper bound for the number of operations in

needed to multiply two polynomials in

. For a general ring

one may take

thanks to [10]. For finite fields better bounds exist, and we write

for the time taken by a Turing machine to multiply two polynomials in

In what follows, any finite field

with

and

prime is always assumed to be given as

with

monic and irreducible of degree

. Elements of

are stored as their natural representatives in

. Additions and subtractions in

take linear time, one product takes time

and one inversion

: see [13, part II], for instance. In [15, 19], it was shown that

Multivariate polynomials

For a polynomial

in a given number of variables

, we use the recursive dense representation, by viewing

as an element of

. In particular,

admits the same representation as its expansion

as a univariate polynomial in

. In our algorithms, the number of variables

is not part of the representation of

, so it must be supplied as a separate parameter.

Example 2.4. The univariate polynomial

of degree

is represented by

. The bivariate polynomial

is represented by

The support

is defined as the set of monomials with nonzero coefficients and we write

for its cardinality. Assuming that, apart from the mandatory trailing “#” symbol, the representations of coefficients in

do not involve the “

” symbol (this can always be achieved through suitable renaming

), we denote the number of “

” symbols involved in the representation of

. We notice that

Proof. This follows by an easy induction over

: for

, we have nothing to do. If

and

, then we get

If all the

equal

, then

is the constant polynomial

and its representation is

with

symbols “

”. If

for all

, then the number of

becomes

, then we already observed that

. If

, then

. For the remainder of this subsection, we assume that the size of the elements in

is bounded by a constant

. In particular, the total size of a multivariate polynomial

is bounded by

Proof. Recall that both

and

are considered to be the inputs. We use the following recursive algorithm: if

, then we have nothing to do. If

, then we write

and recursively compute partial degree bounds

for the coefficients. We next return

. The lemma clearly holds for

. By induction, the recursive computations can be done in time

Determining

requires an additional time

. Altogether, the computation takes time

Proof. We use the following recursive algorithm: if

, then we have nothing to do. If

, then we write

and recursively compute total degree bounds

for the coefficients. We next return

and

otherwise. The complexity bound follows using a similar induction argument as in the previous lemma.

Evaluation and multi-remaindering

In the following paragraphs we recall the costs of integer multi-remaindering and univariate multi-point evaluation together with the inverse problems. The underlying techniques are well known. They are recalled in the context of Turing machines for convenience.

in time

by using fast multi-remaindering [13, chapter 10]. Then we appeal to Lemma 2.3 to obtain

in time

Proof. We first extract the vector

of all the nonzero coefficients of

in time

. We use the latter lemma on

, which incurs time

. Then we recover

from

and

, for

, in time

Proof. First we transpose the bidimensional array

of size

and then use Chinese remaindering

times. By Lemma 2.3 and [13, chapter 10], this can be done in time

in time

, using fast univariate multi-point evaluation [13, chapter 12]. We next transpose the array in time

, using Lemma 2.3.

Proof. We first extract the vector

of all the nonzero coefficients of

in time

together with

of the same support as

. We use the latter lemma on

, which incurs time

in time

by Lemma 2.3. We next obtain the result through

interpolations, in time

by [13, chapter 12].

Lexicographic orders

Notice that in the dense polynomial representation used here, coefficients are stored accordingly to the lexicographic order on the exponent vectors; this corresponds to the usual lexicographic monomial order induced by

Fixed point numbers

We use fixed point representation to approximate real numbers. The number

, with

and

, is represented by

, where “.” is a specific symbol of the machine. A negative number starts with the symbol “-”. The bit size of

. The integer

is called the (absolute) precision. Additions, subtractions and reducing the precision take linear time. The product of two fixed point numbers of bit size

takes time

. We gather well known results from [8].

Proof. Without loss of generality we may assume

. We perform the following long integer division in time

Proof. We write

with

and

. By using [8, Theorem 6.1],

and

may be computed to precisions

and

respectively in time

. Let us write

and

the respective approximations, which satisfy

Proof. We write

with

and

, so we have

. By using [8, Lemma 2.3],

may be approximated to precision

in time

3.Fast multi-point evaluation

In this section

is an integer with

and

is a polynomial in

with partial degree

for

, and total degree

. We assume that

is given in recursive dense representation for

, as described in the previous section, so the size of its representation is

where

and

represents the number of

in the representation of

. The upper bounds

and

for the partial and total degrees are not necessarily sharp. Throughout this section, we assume that

and that these bounds satisfy

3.1.Overview of the multi-modular approach

In order to evaluate

at a point

the initial idea consists in performing the evaluation over

, that is to say by discarding the modulus

. We write

for the natural preimage of

with coefficients in

, and

for the natural preimage of

with entries in

. In order to compute

we construct an ad hoc sequence of primes

such that

. In this way,

may be recovered by Chinese remaindering from

Minimizing the bit size of

is of the utmost importance for efficiency. For this purpose we introduce the following quantity

which bounds the cardinality of the support of

both in terms of the partial and total degrees:

is used as a bound for

. It satisfies the following inequality to be used several times in the proofs when

This inequality trivially holds when

. Suppose by induction that it holds up to

variables. Then

again holds for

variables.

Given

and

evaluation points

, the multi-point evaluation algorithm of this section works as follows:

We will be able to take all the

of the order of

. Therefore, the bit size of the modulus when entering the first recursive call is of the order

, then

at depth

, then

at depth

, etc. The total bit size of all recursive problems to be solved at depth

turns out to grow with

. In section 3.5 we study the complexity of the algorithm in terms of the depth

. Section 3.6 is devoted to finding a suitable value for

to end the recursive calls. Roughly speaking, the property “sufficiently small” from step 1 becomes “

is of the order

”, so the time spent in the exhaustive evaluation of step 1 is of the order

3.2.Exhaustive evaluation

We begin with studying the base case of the multi-modular algorithm, i.e. the exhaustive evaluation at all points of

. We recall that this algorithm is used for sufficiently small values of

. We regard the evaluation of

at all points in

as a vector in

Proof. Detecting if

is the constant polynomial

takes time

. If so, then it suffices to copy

onto the destination tapes

times. This costs

. From now we assume that

is not a constant, whence

and

We interpret

as a univariate polynomial

and recursively evaluate the coefficients

at all points in

. After one

matrix transposition of cost

, this yields a vector of

univariate polynomials

where

ranges over

. Using

multi-point evaluations of these polynomials at all

of cost

, we finally obtain the vector of all

with

. Denoting by

the cost of the algorithm, we thus obtain

In order to evaluate

at a specific sequence

of points in

, we next wrap the latter lemma in the following algorithm that simply reads off the requested values once the exhaustive evaluation is done. This task is immediate in the RAM model, but induces a logarithmic overhead on Turing machines.

Input
Output: .

Evaluate at all points of sorted lexicographically.
For from to do
1. Set and .
2. Sort the vectors of pairs into accordingly to the second coordinate (where denotes a permutation of ).
3. Deduce the vector by using the exhaustive evaluation of step 1.
4. Sort the latter vector according to the first coordinate in order to obtain .
Return .

Proof. The cost of step 1 is given in Lemma 3.2. In the Turing machine model the loop counter

and the bounds

and

do not need to be explicitly computed in step 2. Instead it suffices to allocate an array of

bits once on an auxiliary tape and use it to split the sequence of evaluation points into subsequences of

elements—except the last one which has cardinality

Each step 2.c takes time

. The total cost of all steps 2.c is therefore bounded by

3.3.The first Chebyshev function

Multi-modular techniques classically involve bounds on the first Chebyshev function

For many applications, crude estimates on

suffice. For our refined complexity analysis of the Kedlaya–Umans algorithm, we rely on the somewhat better estimate

More precisely, it was shown by Barkley Rosser and Schoenfeld [2, Theorem 31] that, for all

Even sharper bounds may be found in [12], but they will not be necessary here. From now on

represents a constant in

such that

Then it suffices to further increase

so that the implication also holds on the interval

In the rest of this section the constant

of the lemma will be used via the following function:

that will bound the inflation of the modulus at successive recursive calls of our main algorithm. We will write

for the

-th iterate of this function.

3.4.Computing prime numbers

Generating prime numbers is a standard task. In the next paragraphs we include the needed results for completeness.

Proof. We use the well known Eratosthenes sieve. On the same tape we generate all the integer multiples of

not larger than

, followed by all the multiples of 3 not larger than

, then all the multiples of

not larger than

, etc. The total number of multiples generated in this way is

. These multiples can all be generated in time

. Then we sort these integers in increasing order and remove duplicates in time

. The integers

which are not listed in this way are exactly the requested prime numbers, which can thus be deduced with further

bit operations.

The following algorithm computes consecutive prime numbers larger than a given integer

, such that their product exceeds a given threshold

Input
Output: The shortest sequence of consecutive primes with and .

Set .
Compute all the consecutive prime numbers less or equal to .
If , then increase by and go back to step 2.
Compute the first index such that , and set and .
If , then return .
While do
1. Compute .
2. If , then , else .
Return .

Proof. After step 3 we have

, so the rest of the algorithm corresponds to a binary search to obtain the minimal index

such that

. During the “while” loop of step 6 we have

and

. So when

the minimal sequence is actually found. Since

at each step of the “while” loop, either

increases or

decreases strictly. Consequently the algorithm returns the correct result.

We exit step 3 once

. Thanks to (3.6), this condition is met for

, after time

, by Lemma 3.5. The binary search also takes time

3.5.The main recursion

We are now ready to present the multi-modular evaluation algorithm. The parameter

indicates the allowed depth for the recursive calls.

Input
Output: .
Assumption: .

If or if is a constant polynomial, then use Algorithm 3.1.
Call Algorithm 3.2 with and to compute the shortest sequence of consecutive prime numbers with and .
Let be the canonical preimage of in , with coefficients in . Let represent the canonical preimage of in , for .

Compute and , for .
For from to , compute modulo by calling the algorithm recursively with .
Use Chinese remaindering to recover .
Compute the remainders by of these values to obtain and return .

Proof. Lemma 3.1 ensures that the multi-modular approach works well, whence the correctness of the algorithm. From now assume

and

. Inequality (3.1), combined to the definition of

, implies

. If

is bounded, then so are

, and

. Consequently we may freely assume that

is sufficiently large in the cost analysis. From (3.7), for all

we obtain

By Proposition 3.6, step 2 takes time

. The number of “#” in the representation of

by Lemma 2.5. Consequently the multi-remaindering of

in step 3 takes time

by Lemma 2.10. By Lemma 2.9 the multi-remaindering of

takes time

. In total step 3 contributes to

Let

denote the cost function of Algorithm 3.3, for

being fixed. In other words, the constants hidden in the “

” of the rest of the proof do not depend on

but on

. Since

by (3.3), we have

. Proposition 3.3 yields the cost of step 1:

Now assume by induction that

holds for some

. Combining (3.11) and (3.13) we deduce that:

We claim that

for all

which implies

and concludes the proof. The latter claim is proved by induction on

. It clearly holds for

. Assume it holds for

. Then, using

, we verify that

3.6.The main complexity bound

In order to complete Algorithm 3.3, we still have to specify how to set the parameter

in terms of

. It is natural to let

increase as a function of

. Yet we cannot take

arbitrarily large because the complexity in Proposition 3.7 involves a factor

. The key idea here is to observe that, if

is very large, namely when

, then we may use the naive algorithm to evaluate

independently at each

. This leads to the following complexity bound.

Proof. If

, then

is constant and we just copy the input. If

, then we expand

as a univariate polynomial in

and recursively evaluate

at the point

. This yields a univariate polynomial

that we evaluate at

using Horner's method. Using induction over

, it is not hard to show that the algorithm essentially performs

ring operations in

, which can be done in time

. We finally recall that

, by Lemmas 2.5 and 2.6.

We are now ready to present the top level multi-point evaluation procedure. Recall that the bit size of an integer

is given by

Input
Output: .

If , then compute the output using the univariate multi-point evaluation algorithm.
Compute .
Compute the bit size of and then the bit size of .

If , then compute the output using the naive algorithm.
Compute the bit size of .

If or , then compute the output using Algorithm 3.1.
Compute the output using Algorithm 3.3 with parameter .

Proof. When we arrive at step 5 with

and

, the inequality

holds, whence

. Consequently the assumption of Algorithm 3.3 is satisfied. This proves the correctness of the algorithm thanks to Propositions 3.3 and 3.7.

which is below the bound of the proposition. From now on we assume that

. Recall that

by (3.3).

By the subproduct tree technique, we may compute

in time

, and

in time

. The cost for summing

. We may also compute

in time

and then easily deduce

as the bit size of

. Overall the cost of step 2 is negligible.

Let

denote the logarithm function in base

. The bit size

and then

may be obtained in time

. We have

and

, whence

From now we may assume that

. If

is bounded, then so are all other parameters

by (3.3) and

, whence the execution takes time

. If

and

, then we have

Let us now consider step 5, where we have

and

, whence

, as previously mentioned. For our complexity analysis, we may freely assume that

is sufficiently large. In particular, by using (3.14), inequality

implies

On the one hand

implies

. On the other hand

implies

. Consequently we have

and deduce:

Proof. We may compute

with

and

in time

thanks to the lemmas from the end of section 2. If

, then

and the result directly follows from Proposition 3.9. Otherwise we apply the evaluation algorithm several times with sets of evaluation points of cardinality at most

4.Extension rings

In this section

and

represent integers and we study multi-point multivariate evaluation over

, where

is a monic polynomial in

of degree

. The approach developed in the next paragraphs lifts this problem to an evaluation problem over

, so that fast univariate evaluation/interpolation in

may be used.

4.1.Reduction to prime fields

Proof. The proof is done by induction on

. The inequality is an equality for

. Then we verify that

holds for

Input
Output: .
Assumption: is monic of degree .

Compute , the bit size of , and the bit size .
If , then use the naive evaluation algorithm to compute and return .
Call Algorithm 3.2 with and to compute the minimal sequence of the smallest prime numbers such that and .
Let be the canonical preimage of in with degrees in and with integer coefficients in .

Let represent the similar preimage of in , for .

Compute and , for .
For from to do
1. Specialize for .
2. Specialize for .
3. For , evaluate at by using the algorithm from Theorem 3.10.
4. Interpolate .
Use Chinese remaindering to recover .
Compute the remainders by and of these values to obtain and return .

Proof. The correctness follows from Lemma 4.2. The quantities

may be obtained in time

by Lemmas 2.5, 2.7 and 2.8. As in the beginning of the proof of Proposition 3.9, the cost for deducing

and

is negligible. The degree

may be obtained in time

. Then, computing

requires time

. To obtain

we first evaluate

in time

and then

in time

. Overall step 1 takes negligible time

Let us now examine the cost of step 5. For fixed

, the specializations of

and the

for

in steps 5.a and 5.b require time

where (3.15) implies

. The cost of step 5.d is

by Lemma 2.14. It follows that the total cost of step 5 is

Proof. If

, then we use fast univariate multi-point evaluation. Otherwise we use Proposition 4.3 in combination with

4.2.Corollaries in terms of partial and total degrees

The first corollary is a complexity bound in terms of the partial degrees, while the second one concerns the total degree.

Proof. We apply Theorem 4.4 with

and make use of the well known inequality

. If

, then

5.Kronecker segmentation

If the partial degrees are large with respect to the number of the variables, then we may use Kronecker segmentation on

in order to decrease the dependency in

in the complexity bounds from the two previous sections. We first analyze the cost of Kronecker segmentation on Turing machines and then show how to reduce the complexity of multi-point evaluation. Throughout this section,

is an effective ring whose elements occupy at most

cells on tapes and whose arithmetic operations take softly linear time.

5.1.Univariate case

Let

be integers

. The Kronecker substitution map is the unique

-algebra morphism determined by

When restricted to the space of polynomials of partial degree

, it becomes an

-linear isomorphism onto the space of polynomials in

of degree

. The Kronecker segmentation associated to

transforms the univariate polynomial

of degree

into the multivariate polynomial

, so that

Input
Output: .
Assumption: .

If , then return .
For , call the algorithm recursively on and integers to obtain .
Return .

Proof. The correctness is clear. Let

represent the cost function of the algorithm. Step 1 takes linear time in the size of

, that is

. Step 2 requires one linear traversal and

recursive calls, whence

The cost of the Kronecker substitution, stated in the next proposition, will be needed in section 7 only.

Proof. The proof is done by induction on

. We will require in addition that

is zero padded up to degree

(at the end, we may clearly remove trailing zeros in linear time). The case

corresponds to a simple zero padding up to degree

, which takes time

. If

, then we write

, we recursively compute

for

, and concatenate the results onto the output tape. The complexity bound is straightforward.

Only the univariate Kronecker segmentation is actually needed for the modular composition algorithm of the next section. In the rest of this section we introduce the multivariate segmentation and make use of it in order to speed up multi-point evaluation.

5.2.Multivariate case

Now consider a multivariate polynomial

of partial degree

for

. For

, let

be integers such that

This map restricts to an

-linear isomorphism between polynomials of partial degree

and polynomials in

of partial degree

. In this context, the Kronecker segmentation of

is defined as

Proof. If

, then we may use Proposition 5.1. Otherwise we consider

in the form

. Let

represent the cost of the segmentation of

. By Lemma 2.5 the representation size of each

. If

, then Proposition 5.1 gives

5.3.Application to multi-point evaluation

In the rest of this section we explain how to decrease the cost of multi-point evaluation using Kronecker segmentation of

, then we set

so that inequality (5.3) still holds. In addition, if

, then

, whence

, and

In a dual manner to the Kronecker substitution map (5.2) associated to the

we introduce the map

In this way we reduce the multi-point evaluation in

variables and partial degree bounds

to evaluation in

variables and partial degree bounds

. Notice that this segmentation generically builds a polynomial

of total degree close to the sum of its partial degrees. The cardinality of the support of

is the same as of the support of

, but its number of “#” symbols in the representation is larger. From (5.7) we deduce that

The latter

may be replaced by a smaller value

with

arbitrarily close to

whenever

is sufficiently large. The reduction process is summarized in the following algorithm.

Input
Output: ; .

Compute and then and for and .
Build .
For all compute .
Return and .

Proof. The correctness is clear from the definitions. The quantities

may be obtained in time

by Lemmas 2.5 and 2.7. Then we use a binary search to determine

as the first integer such that

in time

. By Proposition 5.3 the segmentation in step 2 takes time

. Using binary powering, step 3 involves

operations in

for each point

Proof. We may freely assume that

is sufficiently large, so that the cost of multi-modular evaluation is

From now we assume that

. If

is bounded, then so is

, and we may appeal to the naive evaluation algorithm; the conclusion follows by adapting Lemma 3.8 to

. We may thus further assume that

is sufficiently large to satisfy

5.4.Consequence in terms of total degree

In the univariate case it is well known that a polynomial of degree

may be evaluated at

points in softly linear time. In the multivariate setting we wish to reach softly linear time for the evaluation in degree

, with

variables, at

points. Although such a complexity bound seems out of reach for the present techniques, the aim of this section is to prove a slightly weaker bound. We start with a simple lemma.

Proof. If

, then Lemma 3.8 ensures evaluation time

. From now on we may assume

First we examine the case when

for a constant

to be fixed. Lemma 5.6 combined to the fact that the function

is nondecreasing yields

We fix the constant

sufficiently small such that

. By Lemma 3.8, the evaluations can be achieved in time

From now we may assume that

holds. If

is bounded, then so is

and Lemma 3.8 again leads to an evaluation time

. Consequently we may further assume that

is sufficiently large to satisfy

We now consider the other case when

, so we have

, hence

. In this case, Corollary 4.6 gives the cost

Since

, we have

, so the total running time is bounded by

. We conclude by applying this for

instead of

6.Modular composition

In this section,

is an effective field,

is a monic polynomial in

of degree

, and

are two polynomials in

. We want to compute

. We first describe and analyze the algorithm in the algebraic complexity model and then return to Turing machines in the case when

is a finite field.

Let

. This precise choice of

will be motivated below. We let

be an integer such that

Input
Output: .
Assumptions: , , , , and , where is as in (6.1).

Compute .

If one of the is less than , then use the naive modular composition algorithm.
Build .
Compute , ,…, .
Compute for .
Evaluate at for .
Interpolate such that for .
Return .

The following proposition summarizes the cost in terms of arithmetic operations in

operations in

in view of (6.3). Steps 4 and 6 respectively take time

and

. Step 7 takes

additional operations in

Proof. The integer

can be computed in time

. Without loss of generality we may suppose

, so that

. Since

, Lemmas 2.15, 2.16 and 2.17 allow us to routinely compute

in time

. We compute

as the largest integer such that

, in time

, and deduce

with additional cost

First we handle the case when

. We may generate

pairwise distinct

in time

and use Algorithm 6.1. Step 5 takes time

Step 2 of Algorithm 6.1 takes time

by Proposition 5.1 and (6.3). The cost for other steps, as analyzed by Proposition 6.1, amounts to

It remains to deal with the case

. Basically we construct a suitable algebraic extension

and run Algorithm 6.1 over

instead of

. In fact we need

to hold. We compute the first integer

such that

, in time

, so we have

. We next compute the smallest integer

that its coprime with

. Since

, this takes time

. We proceed with the construction of an irreducible polynomial

of degree

. Using [43, Theorem 3.2], this can be done in time

This includes the computation of the irreducible factorization of

: the primes

can be computed in time

by Lemma 3.5, so the prime factors of

may be deduced in time

We let

represent the monic part of the resultant

and we write

for the corresponding subresultant of degree

. It is well known that

is irreducible and that

is invertible modulo

(see for instance [43, Lemma 2.4]). Setting

, we then have the following isomorphism:

We identify

. We may obtain

and

in time

(see [35, Corollary 31] for fast algorithms, but it would be sufficient here to appeal to naive methods). An element of

represented by

may be sent into

in time

. For the backward conversion we simply replace

and reduce modulo

and

, which takes time

. Consequently applying Algorithm 6.1 over

instead of

only involves and additional overhead of

in the total complexity.

Remark 6.3. Instead of relying on Theorem 4.4 in step 5, one might wonder whether it is interesting to consider Theorem 5.5. It turns out that the same complexity analysis applies with

, with a similar complexity result.

Remark 6.4. In practice, the case when

at the end of the proof can be handled more efficiently by constructing irreducible polynomials by means of a faster, although probabilistic Las Vegas algorithm; see [13, chapter 14] and [43] for instance. It is also worth it to build an extension of

of smooth degree

with

, which allows the isomorphism (6.7) to be computed more efficiently [27].

7.Fields of small positive characteristic

For the case when

is a field of small characteristic

, Kedlaya and Umans also designed an algebraic algorithm for fast multi-point evaluation [34, Theorem 6.3]. This algorithm turns out to be somewhat more efficient than those from the previous sections. In the present section, we adapt their techniques and prove a complexity bound in terms of the total degree instead of the partial degrees. We also refine their complexity estimates for multi-point evaluation and modular composition.

The base field is written

with

and

prime; we assume it is explicitly given as

with

monic and irreducible of degree

. We let

represent a constant such that two

matrices can be multiplied with

ring operations.

7.1.Multi-point evaluation

We begin with

-th root extractions in

. It is well known that such root extractions reduce to linear algebra via the so-called Pietr-Berlekamp matrix of the isomorphism

in the canonical basis

. Once the inverse of this matrix is known, each root extraction takes

ring operations in

, and

root extractions take

ring operations in

. We could use this strategy in this section but it would involve an extra factor

in our complexity estimates. Instead, since we focus on the case when

remains small, it is worth using the following algorithm, borrowed from [34, Theorem 6.1] and originating from [37].

Input
Output: such that .

Expand into , where .
Expand into , where .
Select such that , compute , and return .

Proof. The identity

is equivalent to

. For all

we have

and thus

. Since

is irreducible,

is invertible modulo

. We are done with the correctness.

The computations in steps

and 2 take

ring operations in

. Step 3 requires

ring operations in

plus

inversions in

by using fast extended gcds.

The next algorithm, borrowed from [34, section 6], performs multivariate multi-point evaluations by reduction to the univariate case.

Input
Output: .

Compute the total degree of .
Let be minimal such that and .
Build a monic irreducible polynomial of degree , and find a primitive root of unity of maximal order in .
Reinterpret as an element of , let , and reinterpret as an element of .
For from to do
1. Compute .
2. Interpolate such that for .
Let , compute modulo for .
Return .

Proof. For

and

we write

. Let

represent the endomorphism of

that rises coefficients to the

-th power. Then

Since

has degree

, we deduce that it coincides with

, whence

. We are done with the correctness.

by means of [43, Theorem 3.2] (this includes the computation of the irreducible factorization of

Now [44, Theorem 1.1] provides us with a universal constant

such that there exists a monic polynomial

of degree

with

, for which

is a primitive element. It thus suffices to enumerate all the monic polynomials of increasing degrees

and to stop as soon as one discovers a primitive one. The loop terminates after testing

candidates.

Verifying that a candidate polynomial

is a primitive element for the multiplicative group of

can be done as follows. We first compute the irreducible factorization of

in time

(see for instance [11, Theorem 1.1]). We next verify that

for each prime divisor

in time

. Altogether, this allows us to compute

deterministically in time

Now each step 5.a requires

extractions of

-th roots in

. In total, they take time

Since the partial degrees of

are

and since

, the polynomial

is the Kronecker substitution

, following the notation of section 5.1. The univariate polynomial

in step 6 may be obtained in time

Remark 7.3. Notice that

is not necessarily a field. As an optimization, it is worth taking

and build the extension

of degree

. Then we compute a primitive root

of order

. In practice, it is also worth using faster probabilistic algorithms in step 3.

7.2.Modular composition

Let us now reanalyze the complexity of Algorithm 6.1 when using Algorithm 7.2 for multi-point evaluations. For simplicity we assume that

is a fixed prime number, so the constant hidden in the “

” below actually depends on

. This time, we let

be an integer such that

Proof. We adapt the proof of Theorem 6.2 and use (7.2) for the value of

instead of (6.1). Here, it is important for

to be fixed, in order to benefit from the same kind of asymptotic expansions as in the case of Theorem 6.2:

The main change is the use of Proposition 7.2 to obtain the following cost for step 5 of Algorithm 6.1:

For small fixed values of

, Theorem 7.4 therefore improves upon Theorem 6.2. This happens roughly whenever

, which rewrites into

. A more precise complexity analysis in terms of the parameter

could be developed under the latter condition.

8.Conclusion

An important application of the present results concerns polynomial system solving, for which we prove new complexity bounds in [28]: the key algorithms are the Kronecker solver [14] and fast multivariate modular composition. For the latter problem, we mostly follow the strategy deployed in the proof of Proposition 5.7.

Besides technical adaptations to Turing machines and various refinements within the asymptotic complexity bounds, our main improvements upon Kedlaya and Umans' algorithms concern complexity analyses in terms of the total degree, and the way we appeal to the naive evaluation algorithm as a fallback in Theorem 3.10. In particular, our complexity bounds are quasi-optimal with respect to the bit size of the elements in the ground ring.

Another major motivation behind our work is to understand how relevant the new complexity bounds are for practical implementations. Unfortunately, the input sizes for which our optimized variants of Kedlaya and Umans' algorithms become faster than previous approaches still seem to be extremely large. It is instructive to discuss, even in very informal terms, the orders of magnitude of the cross-over points.

In the univariate case

over

, fast algorithms for multi-point evaluation allow for an average evaluation cost per point of

, by means of the classical subproduct tree technique [13, chapter 10], and as soon as the number of points is of the order of the degree. In the same spirit, in order to minimize the average cost per evaluation point, Theorem 3.10 indicates that it is favorable to take

of the order of

(that is larger than the cardinality of the support of

To simplify the discussion, we discard the factor

occurring in Theorem 3.10 , and we focus on the case when

and

. So when

, the average cost per point roughly becomes

. Recall that this average cost is

with the naive approach. The ratio between both bounds is therefore of the order

. In Table 8.1

The above factor

corresponds to the value

in Algorithm 3.3. In practice it turns out to be more interesting to use smaller values for

. In fact, it is worth using Algorithm 3.3 with

whenever

, since

In small positive characteristic

, Proposition 7.2 seems more promising for practical purposes. For

, the average cost per evaluation point drops to

. The efficiency ratio with respect to the naive algorithm is thus of the order

. For instance, with

and

, this ratio rewrites into

. Consequently, Algorithm 7.2 might be relevant in practice for input data of a few kilobytes. However we are not aware of such an efficient implementation.

For the above reasons, faster software implementations of multivariate multi-point evaluation and modular composition remain major challenges. At least, we hope that our new detailed complexity bounds will stimulate more theoretical and practical research in this area. For example, is it possible to decrease the contribution of

in Theorem 3.10? Or, still in Theorem 3.10, could one decrease the exponent

? Is it possible to improve upon the constant

in Theorem 6.2?

Acknowledgments

Bibliography

[1]: A. Arnold, M. Giesbrecht, and D. S. Roche. Faster sparse multivariate polynomial interpolation of straight-line programs. J. Symbolic Comput., 75:4–24, 2016.
[2]: J. Barkley Rosser and L. Schoenfeld. Approximate formulas for some functions of prime numbers. Ill. J. Math, 6:64–94, 1962.
[3]: D. J. Bernstein. Composing power series over a finite ring in essentially linear time. J. Symbolic Comput., 26(3):339–341, 1998.
[4]: L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[5]: A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier–Manin operator. SIAM J. Comput., 36:1777–1806, 2007.
[6]: A. Bostan, G. Lecerf, and É. Schost. Tellegen's principle into practice. In Hoon Hong, editor, Proceedings of the 2003 International Symposium on Symbolic and Algebraic Computation, ISSAC '03, pages 37–44, New York, NY, USA, 2003. ACM.
[7]: A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. J. Complexity, 21(4):420–446, 2005.
[8]: R. P. Brent. Fast multiple-precision evaluation of elementary functions. J. ACM, 23(2):242–251, 1976.
[9]: R. P. Brent and H. T. Kung. Fast algorithms for manipulating formal power series. J. ACM, 25(4):581–595, 1978.
[10]: D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Infor., 28:693–701, 1991.
[11]: E. Costa and D. Harvey. Faster deterministic integer factorization. Math. Comp., 83(285):339–345, 2014.
[12]: P. Dusart. Estimates of some functions over primes without R.H. Technical report, ArXiv, 2010. https://arxiv.org/abs/1002.0442.
[13]: J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge University Press, New York, 3rd edition, 2013.
[14]: M. Giusti, G. Lecerf, and B. Salvy. A Gröbner free alternative for polynomial system solving. J. complexity, 17(1):154–211, 2001.
[15]: D. Harvey and J. van der Hoeven. Faster integer and polynomial multiplication using cyclotomic coefficient rings. Technical report, ArXiv, 2017. http://arxiv.org/abs/1712.03693.
[16]: D. Harvey and J. van der Hoeven. Faster integer multiplication using plain vanilla FFT primes. Math. Comp., 88(315):501–514, 2019.
[17]: D. Harvey and J. van der Hoeven. Faster integer multiplication using short lattice vectors. In ANTS XIII. Proceedings of the Thirteenth Algorithmic Number Theory Symposium, volume 2 of The Open Book Series, pages 293–310. Mathematical Sciences Publishers, 2019.
[18]: D. Harvey, J. van der Hoeven, and G. Lecerf. Even faster integer multiplication. J. Complexity, 36:1–30, 2016.
[19]: D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. J. ACM, 63(6), 2017. Article 52.
[20]: J. van der Hoeven. Relax, but don't be too lazy. J. Symbolic Comput., 34(6):479–542, 2002.
[21]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proceedings of the 2004 international symposium on Symbolic and algebraic computation, ISSAC '04, pages 290–296, New York, NY, USA, 2004. ACM.
[22]: J. van der Hoeven. Fast composition of numeric power series. Technical Report 2008-09, Université Paris-Sud, Orsay, France, 2008.
[23]: J. van der Hoeven. Faster Chinese remaindering. Technical report, CNRS & École polytechnique, 2016. http://hal.archives-ouvertes.fr/hal-01403810.
[24]: J. van der Hoeven and G. Lecerf. On the bit-complexity of sparse polynomial and series multiplication. J. Symbolic Comput., 50:227–254, 2013.
[25]: J. van der Hoeven and G. Lecerf. Composition modulo powers of polynomials. In M. Blurr, editor, Proceedings of the 2017 ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC '17, pages 445–452, New York, NY, USA, 2017. ACM.
[26]: J. van der Hoeven and G. Lecerf. Modular composition via complex roots. Technical report, CNRS & École polytechnique, 2017. http://hal.archives-ouvertes.fr/hal-01455731.
[27]: J. van der Hoeven and G. Lecerf. Modular composition via factorization. J. Complexity, 48:36–68, 2018.
[28]: J. van der Hoeven and G. Lecerf. On the complexity exponent of polynomial system solving. https://hal.archives-ouvertes.fr/hal-01848572, 2018.
[29]: J. van der Hoeven and É. Schost. Multi-point evaluation in higher dimensions. Appl. Alg. Eng. Comm. Comp., 24(1):37–52, 2013.
[30]: Xiaohan Huang and V. Y. Pan. Fast rectangular matrix multiplication and applications. J. Complexity, 14(2):257–299, 1998.
[31]: F. Johansson. A fast algorithm for reversion of power series. Math. Comp., 84:475–484, 2015.
[32]: E. Kaltofen and V. Shoup. Fast polynomial factorization over high algebraic extensions of finite fields. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, ISSAC '97, pages 184–188, New York, NY, USA, 1997. ACM.
[33]: K. S. Kedlaya and C. Umans. Fast modular composition in any characteristic. In FOCS'08: IEEE Conference on Foundations of Computer Science, pages 146–155, Washington, DC, USA, 2008. IEEE Computer Society.
[34]: K. S. Kedlaya and C. Umans. Fast polynomial factorization and modular composition. SIAM J. Comput., 40(6):1767–1802, 2011.
[35]: G. Lecerf. On the complexity of the Lickteig–Roy subresultant algorithm. J. Symbolic Comput., 92:243–268, 2019.
[36]: M. Nüsken and M. Ziegler. Fast multipoint evaluation of bivariate polynomials. In S. Albers and T. Radzik, editors, Algorithms – ESA 2004. 12th Annual European Symposium, Bergen, Norway, September 14-17, 2004, volume 3221 of Lect. Notes Comput. Sci., pages 544–555. Springer Berlin Heidelberg, 2004.
[37]: D. Panario and D. Thomson. Efficient th root computations in finite fields of characteristic . Des. Codes Cryptogr., 50(3):351–358, 2009.
[38]: C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[39]: M. S. Paterson and L. J. Stockmeyer. On the number of nonscalar multiplications necessary to evaluate polynomials. SIAM J.Comput., 2(1):60–66, 1973.
[40]: P. Ritzmann. A fast numerical algorithm for the composition of power series with complex coefficients. Theoret. Comput. Sci., 44:1–16, 1986.
[41]: A. Schönhage. Schnelle Berechnung von Kettenbruchentwicklungen. Acta Informatica, 1(2):139–144, 1971.
[42]: A. Schönhage, A. F. W. Grotefeld, and E. Vetter. Fast algorithms: A multitape Turing machine implementation. B. I. Wissenschaftsverlag, Mannheim, 1994.
[43]: V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 54(189):435–447, 1990.
[44]: V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.



			(3.5)



			(3.12)



			(3.14)




			(4.5)

1.Introduction

1.1.Related work

Multi-point evaluation

Modular composition

Relationship between multi-point evaluation and modular composition

1.2.Contributions

2.Complexity model and basic operations

Integers

Arrays

Univariate polynomials

Multivariate polynomials

Evaluation and multi-remaindering

Lexicographic orders

Fixed point numbers

3.Fast multi-point evaluation

3.1.Overview of the multi-modular approach

3.2.Exhaustive evaluation

3.3.The first Chebyshev function

3.4.Computing prime numbers

3.5.The main recursion

3.6.The main complexity bound

4.Extension rings

4.1.Reduction to prime fields

4.2.Corollaries in terms of partial and total degrees

5.Kronecker segmentation

5.1.Univariate case

5.2.Multivariate case

5.3.Application to multi-point evaluation

5.4.Consequence in terms of total degree

6.Modular composition

7.Fields of small positive characteristic

7.1.Multi-point evaluation

7.2.Modular composition

8.Conclusion

Acknowledgments

Bibliography