Joris van der Hoeven and
Grégoire Lecerf
Laboratoire d'Informatique, UMR 7161 CNRS
École polytechnique
91128 Palaiseau Cedex, France
{vdhoeven,
lecerf}@lix.polytechnique.fr
On the complexity of multivariate
blockwise polynomial multiplication
Preliminary version of November 3, 2023
. This work has
been partly supported by the French
In this article, we study the problem of multiplying two multivariate
polynomials which are somewhat but not too sparse, typically like
polynomials with convex supports. We design and analyze an algorithm
which is based on blockwise decomposition of the input polynomials, and
which performs the actual multiplication in an FFT model or some other
more general so called “evaluated model”. If the input
polynomials have total degrees at most ,
then, under mild assumptions on the coefficient ring, we show that their
product can be computed with
ring operations,
where
denotes the number of all the monomials of
total degree at most
.
sparse polynomial multiplication; multivariate power series; evaluation-interpolation; algorithm
Let be an effective ring, which means
that we have a data structure for representing the elements of
and algorithms for performing the ring operations in
, including the equality test. The
complexity for multiplying two univariate polynomials with coefficients
in
and of degree
is
rather well understood [11, Chapter 8]. Let us recall that
for small
, one uses the
naive multiplication, as learned at high school, of complexity
. For moderate
, one uses the Karatsuba multiplication [21], of complexity
,
or some higher order Toom-Cook scheme [4, 29],
of complexity
with
.
For large
, one uses FFT [3, 5, 27] and TFT [14,
15] multiplications, both of complexity
whenever
has sufficiently many
-th roots of unity, or the Cantor-Kaltofen
multiplication [3, 27] with complexity in
otherwise.
Unfortunately most of the techniques known in the univariate case do not straightforwardly extend to several variables. Even for the known extensions, the efficiency thresholds are different. For instance, the naive product is generally softly optimal when sparse representations are used, because the size of the output can grow with the product of the sizes of the input. In fact, the nature of the input supports plays a major role in the design of the algorithms and the determination of their mutual thresholds. In this article, we focus on and analyze an intermediate approach to several variables, which roughly corresponds to an extension of the Karatsuba and Toom-Cook strategies, and which turns out to be more efficient than the naive multiplication for instance if the supports of the input polynomials are rather dense in their respective convex hulls.
The main idea in our blockwise polynomial multiplication is to cut the
supports in blocks of size and to rewrite all
polynomials as “block polynomials” in
with “block coefficients” in
These block coefficients are then manipulated in a suitable “evaluated model”, in which the multiplication is intended to be faster than with the direct naive approach. The blockwise polynomials are themselves multiplied in a naive manner. The precise algorithm is described in Section 3.1, where we also discuss the expected speedup compared to the naive product. A precise complexity analysis is subtle because it depends too much on the nature of the supports. Although the worse case bound turns out to be no better than with the naive approach, we detail, in Section 3.2, a way for quickly determining a good block size, that suits supports that are not too small, not too thin, and rather dense in their convex hull.
Our Section 4 is devoted to implementation issues. We first design a cache-friendly version of our blockwise product. Then we adapt a blockwise product for multivariate power series truncated in total degree. Finally, we discuss a technique to slightly improve the blockwise approach for small and thin supports.
In Section 5 we analyze the complexity of the blockwise
product for two polynomials supported by monomials of degree at most
: if
contains
in its center, then we show that their
product can be computed with
operations in
, where
denotes the number of all the monomials of degree at most
. Notice that the constant hidden behind the
latter
does not depend neither on
nor on the number of the variables. With no hypothesis on
, the complexity bound we
reach becomes in
, by a
direct extension of the Karatsuba algorithm. Finally, we prove similar
complexity results for truncated power series
Algorithms for multiplying multivariate polynomials and series have been designed since the early ages of computer algebra [6, 7, 8, 20, 28]. Even those based on the naive techniques are a matter of constant improvements in terms of data structures, memory management, vectorization and parallelization [10, 18, 23, 24, 25, 26, 30].
With a single variable , the
usual way to multiply two polynomials
and
of degree
with the Karatsuba
algorithm begins with splitting
and
into
and
, where
.
Then
,
, and
are computed
recursively. This approach has been studied for several variables [22], but it turns out to be efficient mainly for plain block
supports, as previously discussed in [8, Section 3].
For any kind of support, there exist general purpose sparse
multiplication algorithms [2, 18]. They
feature quasi-linear complexity in terms of the sizes of the supports of
the input and of a given superset for the support of the product.
Nevertheless the logarithmic overhead of the latter algorithms is
important, and alternative approaches, directly based on the truncated
Fourier transform, have been designed in [14, 15,
19] for supports which are initial segments of (i.e. sets of monomials which are
complementary to a monomial ideal), with the same order of efficiency as
the FFT multiplication for univariate polynomials.
To the best of our knowledge, the blockwise technique was introduced, in a somewhat sketchy manner, in [13, Section 6.3.3], and then refined in [16, Section 6]. In the case of univariate polynomials, its complexity has been analyzed in [12], where important applications are also presented.
Throughout this article, we use the sparse representation for
the polynomials, and the computation tree model with the
total complexity point of view [1, Chapter 4] for
the complexity analysis of the algorithms. Informally speaking, this
means that complexity estimates charge a constant cost for each
arithmetic operation and the equality test in , and that all the constants are thought to be
freely at our disposal.
Consider a multivariate polynomial ,
also written as
We define its support by ,
and denote its cardinality
by
. We interpret
as the
sparse size of
.
Given a vector of positive integers, we define
the set of block coefficients at order
by
Any polynomial can uniquely be rewritten as a
block polynomial
,
where
Given , we also notice that
. We let
represent the size of the block
.
A univariate evaluation-interpolation scheme in size
on
is the data of
an algorithm for computing an evaluation function , where
is the subset of polynomials in
of degree at
most
, and where
can be seen as the number of evaluation points,
and
an algorithm for computing an interpolation function,
written , but which is
not necessarily exactly the inverse of
,
,
such that and
are linear
and
holds for all and
in
, where
denotes the entry-wise product in
.
We write
for a common bound on the complexity of
and
,
so that two polynomials in
can be multiplied in
time
. For convenience we
still write
and
for
extensions to any
seen as a ring endowed with
the entry-wise product.
Example corresponds
to
,
,
can be interpreted as the evaluation of a
polynomial at the values
,
, and
. By induction, the Karatsuba algorithm for
polynomials of degree at most
corresponds to
,
, where
is the block
polynomial of
with respect to
, and where
actually
corresponds to applying the extension of
to
. This yields
and
.
Univariate evaluation-interpolation schemes can be extended to several
variables as follows: if then we define
, define the maps
and take
![]() |
(1) |
where and
represent the
evaluation and interpolation with respect to the variable
:
In the sequel we will only use multivariate evaluation-interpolation schemes constructed in this way.
For what follows, we first assume that we have a strategy for picking a
suitable evaluation-interpolation scheme for any given block size , and that we have a fast way to
compute the corresponding functions
and
, at least approximately. We also
assume that
and
are
increasing functions.
Let us first assume that we have fixed a block size . The first version of our algorithm for
multiplying two multivariate polynomials
summarizes as follows:
Algorithm block-multiply
and
.
.
Rewrite and
as
block polynomials
.
Compute and
.
Multiply using the naive algorithm.
Compute .
Reinterpret as a polynomial
and return
.
Let represent the size of
if there is no coefficient of
which accidentally
vanishes.
operations in .
. Steps 2 and 4 require
operations. Step 3 requires
operations. Step 5
takes at most
operations in order to add up
overlapping block coefficients, where
.
Unfortunately, without any structural knowledge about the supports of
and
,
the quantity
requires a time
to be computed. In the next subsection we explain a heuristic approach
to find a good block size
.
In order to complete our multiplication algorithm, we must explain how
to set the block size. Computing quickly the block size that minimizes
the number of operations in the product of two given polynomials and
looks like a difficult
problem. In Section 5 we analyze it for asymptotically
large simplex supports, but, in this subsection, we describe a heuristic
for the general case.
For approximating a good block size we first need to approximate the
complexity in Proposition 2 by a function, written , which is intended to be easy to
compute in terms of
,
, and the input supports. For
instance one may opt for the formula
Nevertheless, if and
are
expected to be not too sparse, then one may assume
and use the formula
Then we begin the approximating process with , and keep doubling the entry of
which reduces
as much as possible. We stop as
soon as
cannot be further reduced in this way.
Given
and
,
we write
for
.
In order to make explicit the dependency in
during the execution of the algorithm, we write
instead of
.
Algorithm block-size
.
for
multiplying
and
.
Set .
While and
are
supported by at least two monomials repeat:
Let be such that
is
minimal.
If , then return
.
Set .
In the computation tree model over ,
the block-size algorithm actually performs no operation in
. Its complexity must be
analyzed in terms of bit-operations, which means here
computation trees over
. For
this purpose we assume that the supports are represented by vectors of
monomials, with each monomial being stored as a vector of integers in
binary representation.
and
have total degrees at
most
, then the algorithm
bit-operations, plus
calls to the function
.
and
throughout the main loop. Since
can be computed from
with
bit-operations, each iteration requires at most
bit-operations. The number of iterations is bounded by
.
Remark that in favorable cases the first few iterations are expected to
approximately halve at each stage. The expected
total complexity thus becomes closer to
.
Several problems arise if one tries to implement the basic blockwise multiplication algorithm from Section 3.1 without any modification:
The mere storage of ,
and
involves
coefficients in
.
This number is usually much larger than
.
Coefficients in usually will not fit into
cache memory, which makes the algorithm highly cache inefficient.
Both drawbacks can be removed by using a recursive multiplication
algorithm instead, where the evaluation-interpolation technique is
applied sequentially with respect to for
suitable pairwise distinct
.
Algorithm improved-block-multiply
,
and pairwise distinct
.
.
If then return
block-multiply
.
Let and
rewrite and
as
block polynomials
.
Compute and
.
For each do
, where
is identified to
Compute .
Reinterpret as a polynomial
and return
.
It is not hard to check that the improved algorithm has the same
complexity as the basic algorithm in terms of the number of operations
in . Let us now examine how
to choose
in order to increase the performance.
Setting
with
we first choose the set in such a way that
constants in
fit into cache
memory for a certain threshold
.
The idea is that the same coefficients should be reused as much as
possible in the inner multiplication, so
should
not be taken to small, e.g.
.
For a fixed set
of indexes, we next take
such that
,
thereby minimizing the memory consumption of the algorithm.
Let and
.
We define the truncation of
to
by
.
Notice that
if, and only if,
with
. It is sometimes
convenient to represent finite sets
by
polynomials
with
,
for instance by taking
, for
a given
. Given
and a fixed support
,
the computation of
is called the truncated
multiplication. The basic blockwise multiplication algorithm is
adapted as follows:
Algorithm truncated-block-multiply
and
.
.
Rewrite ,
and
as block polynomials
.
Compute and
.
Multiply using a naive algorithm.
Compute .
Reinterpret as a polynomial
and return
.
In a similar way as in Proposition 2, we can show that
![]() |
(2) |
operations in . However, this
bound is very pessimistic since
usually has a
special form which allows us to perform the naive truncated
multiplication in step 3 in much less than
operations. For instance, in the special and important case of truncated
power series at order
, we
have
and, by [16, Section 6], .
The naive truncated power series multiplication at order
can be performed using only
operations, as shown in [16, Section 6]. Hence, if and
, so
that
, then
can be replaced by
in our complexity bound (2).
Assuming that we have a way to estimate the number of operations in necessary to compute the truncated product
using the naive algorithm, we can adapt the heuristic of
Section 3.2 to determine a candidate block size. Finally
let us mention that the additional implementation tricks from Section 4.1 also extend to the truncated case in a straightforward
way.
If we increase the block size in
block-multiply, then the block coefficients
get sparser and sparser. At a certain point, this makes it
pointless to further increase
.
One final optimization which can often be applied is to decompose
and
in a such a way that the
multiplication
can be done using a larger block
size than the remaining part
of the
multiplication
. Instead of
providing a full and rather technical algorithm, let us illustrate this
idea with the example
The naive multiplication performs products in
The direct use of the Karatsuba algorithm with
, reduces to the naive
product of two polynomials of
terms with
coefficients in
, which
amounts to
multiplications in
. If we decompose
and
into
then takes
products in
, and the naive
multiplication for
uses
products in
. In this
example, the separate treatment of the border saves 7 products in
out of the 36 of the naive algorithm, and is faster
than the direct use of the blockwise algorithm.
In this section we focus on the specific and important problems of
multiplying polynomials of total degree at most
, and truncated power series
at order
. Recall from
Section 4.2 that
.
Let
Let be the first value of
such that
, and define
. Numeric computations yield
and
.
, and assume that
admits evaluation-interpolation schemes with
and
, for all
, and for some constant
. Then two polynomials in
of degree at most
can be
multiplied with
operations in
, if the sizes of the blocks
are chosen as follows:
, if
,
, if
,
, if
.
.
By Stirling's formula, there exists a constant
such that
From , we obtain the
existence of a constant
such that the following
inequality holds for all
and
:
Therefore there exists a constant such that
hold for all and
.
By Proposition 2, the cost of the multiplication is bounded
by
with
where , since
,
,
and
. With
,
and
using (1), there exists a constant
such that:
Since and since
,
we can bound
by
,
and deduce the existence of an other constant
such that
From we have that
,
and since
it follows that
, and that there exists a constant
such that
![]() |
(3) |
In order to express the execution time in terms of the output size, we thus have to study the function
The first are plotted in Figure 1.
The right limit of
when
tends to
is 1, while the same limit for the
other
is
.
Since
whenever
,
the sign of
is the same as the one of
From and
,
it follows that there exists a unique zero, written
, of
for each
. These zeros can be approximated
by classical ball arithmetic techniques. We used the
numerix package (based on
,
, and
.
Still using ball arithmetic, one checks that
achieves its maximum for
at
. Given
,
Lemma 9 of the appendix shows that
.
Finally, we have obtained so far that taking for
,
for
and
for larger
, the inequality
holds for all
. Therefore
there exists a constant
,
such that
for all
and
, or
and
.
It remains to bound for
and
sufficiently large. There exists a constant
such that:
Therefore, taking , there
exists a contant
with
which implies that when
is sufficiently large.
Remark in terms of
.
However our choice of
is suboptimal. For
instance, taking
for
is
generally better whenever
remains in
. In fact, we think that algorithm
block-size will do the choice of the block size just as
well and maybe even a little bit better from the latter bounds when
using a function
precise enough. A detailed
proof of this claim sounds quite technical, so we have not pursued this
direction any further in the present article. Nevertheless, the claim is
plausible for the following reasons:
For symmetry reasons, the block size
computed by block-size should usually be of the form
for some
(and modulo
permutation of the indeterminates).
The complexity of of the above form is close to what we
would obtain by taking
in the proof of
Theorem 4. When allowing for real
, the maximum of
is
now reached at
, with
and
.
For rings which do not support large evaluation-interpolation schemes we propose the following extension of Theorem 4.
, and assume that
admits a unit. Then two polynomials in
of degree
at most
can be multiplied using at most
operations in
.
. The unit of
is
written
, and we introduce
and
.
By using the aforementioned TFT algorithm, we can multiply
and
in
(resp. in
) via Theorem 4, with
(resp. with
) being the next power of
(resp. of
) of
if we do not perform the divisions by
(resp. by
). We thus obtain
and
.
If
then we deduce
as
.
Replacing all arithmetic operations in by
arithmetic operations in
gives rise to an
additional overhead of
with respect to the
complexity analysis from the proof of Theorem 4. More
precisely, we have a new constant
such that
The result is thus proved for when for a
sufficiently large constant
.
It remains to examine the case when
.
Again, we use a similar proof as at the end of Theorem 4,
with new constants
and
such that
Hence for
sufficiently
large.
For rings which do not have a unit, we can use a Karatsuba evaluation-interpolation scheme. For this purpose we introduce
Let be the unique positive zero of
, and let
.
Numeric computations lead to
and
.
, let
be any ring. Then two polynomials in
of degree
at most
can be multiplied using at most
operations in
.
and
for
all
. We follow the proof of
Theorem 4 and begin with a new constant
such that
In order to express the execution time in terms of the output size, we thus have to study the function
The right limit of when
tends to
is 1, while the same limit for the
other
is
.
Since
whenever
,
the sign of
is the same as the one of
If , then
and
, which implies the
existence of a unique zero of
,
written
. In particular, with
, we obtain
. For convenience we let
, and display the first few values:
One can verify that holds for all
, which implies that
for all
. Therefore
with
such that
(with the convention that
).
Lemma 10 of the appendix provides us with
. Therefore there exists a constant
, such that
for all
and
,
or
and
.
It remains to bound for
and
sufficiently large, as in the end of the
proof of Theorem 4. In this range we take
with
, so that
. There exist constants
and
such that:
Therefore, using , we have
whenever is sufficiently large.
, and assume that
admits evaluation-interpolation schemes with
and
, for all
, and for some constant
. Then two power series in
truncated at order
can be
multiplied with
operations in
, if the sizes of the blocks
are chosen as follows:
, if
,
, if
,
, if
.
the following inequalities hold:
The cost of the multiplication is bounded by
with
and
.
There exists a new constant
such that:
The present situation is similar to the one of the proof of Theorem 4, with instead of
. Therefore by taking
for
,
for
and
for larger
, there exists a constant
, such that
for all
and
,
or
and
.
It remains to bound for
and
sufficiently large. There exists a new
constant
such that:
Taking , we thus have a
contant
with
We conclude that for
sufficiently large.
P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, 1997.
J. Canny, E. Kaltofen, and Y. Lakshman. Solving systems of non-linear polynomial equations faster. In Proc. ISSAC 1989, pages 121–128, Portland, Oregon, A.C.M., New York, 1989. ACM Press.
D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
S. A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
S. Czapor, K. Geddes, and G. Labahn. Algorithms for Computer Algebra. Kluwer Academic Publishers, 1992.
J. H. Davenport, Y. Siret, and É. Tournier. Calcul formel : systèmes et algorithmes de manipulations algébriques. Masson, Paris, France, 1987.
R. Fateman. Comparing the speed of programs for sparse polynomial multiplication. SIGSAM Bull., 37(1):4–15, 2003.
L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Transactions on Mathematical Software, 33(2), June 2007. Software available at http://www.mpfr.org.
M. Gastineau and J. Laskar. Development of TRIP: Fast sparse multivariate polynomial multiplication using burst tries. In Proceedings of ICCS 2006, LNCS 3992, pages 446–453. Springer, 2006.
J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2-nd edition, 2003.
G. Hanrot and P. Zimmermann. A long note on Mulders' short product. J. Symbolic Comput., 37(3):391–401, 2004.
J. van der Hoeven. Relax, but don't be too lazy. J. Symbolic Comput., 34(6):479–542, 2002.
J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296, Univ. of Cantabria, Santander, Spain, July 4–7 2004.
J. van der Hoeven. Notes on the truncated Fourier transform. Technical Report 2005-5, Université Paris-Sud, Orsay, France, 2005.
J. van der Hoeven. Newton's method and FFT trading. J. Symbolic Comput., 45(8), 2010.
J. van der Hoeven et al. Mathemagix. Available from http://www.mathemagix.org, 2002.
J. van der Hoeven and G. Lecerf. On the bit-complexity of sparse polynomial multiplication. Technical Report http://hal.archives-ouvertes.fr/hal-00476223, École polytechnique and CNRS, 2010.
J. van der Hoeven and É. Schost. Multi-point evaluation in higher dimensions. Technical report, HAL, 2010. http://hal.archives-ouvertes.fr/hal-00477658.
S. C. Johnson. Sparse polynomial arithmetic. SIGSAM Bull., 8(3):63–71, 1974.
A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
G. I. Malaschonok and E. S. Satina. Fast multiplication and sparse structures. Programming and Computer Software, 30(2):105–109, 2004.
M. Monagan and R. Pearce. Polynomial division using dynamic arrays, heaps, and packed exponent vectors. In Proc. of CASC 2007, pages 295–315. Springer, 2007.
M. Monagan and R. Pearce. Parallel sparse polynomial multiplication using heaps. In Proc. ISSAC 2009, pages 263–270, New York, NY, USA, 2009. ACM.
M. Monagan and R. Pearce. Polynomial multiplication and division in Maple 14. Communications in Computer Algebra, 44(4):205–209, 2010.
M. Monagan and R. Pearce. Sparse polynomial pseudo division using a heap. J. Symbolic Comput., 46(7):807–822, 2011.
A. Schönhage and V. Strassen. Schnelle Multiplikation grosser Zahlen. Computing, 7:281–292, 1971.
D. R. Stoutemyer. Which polynomial representation is best? In Proceedings of the 1984 MACSYMA Users' Conference: Schenectady, New York, July 23–25, 1984, pages 221–243, 1984.
A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
T. Yan. The geobucket data structure for polynomials. J. Symbolic Comput., 25(3):285–293, 1998.
This appendix contains details on how one can bound the functions and
used in Section 5.
We first prove the bounds in an explicit neighbourhood of infinity and
then we rely on certified ball arithmetic for the remaining compact set.
and
. We shall first show that
for all
.
Let
. Since
, it suffices to show that
. Since
,
we let
, and have to prove
that
. The latter inequality
is correct since
, and
for
. On
the other hand, for
we have
Since , we have shown that
for
.
For
between
and
, we appeal to numeric computations
to verify that
still holds.
,
and
. Let
. We shall first show that
for all
. Let
. Since
,
it suffices to show that
.
Since
, we let
, and have to prove that
. The latter inequality is correct since
, and
.
Let and
.
We introduce
. Since
, for
,
we have
Since , we have shown that
for
.
For , we symbolically
computed
and straightforwardly lower bounded it
by a function
in the domain
. For each value of
, we used ball arithmetic to show that
. For smaller values of
, we could directly compute that
whenever
. Finally we
computed that
for all
, which concludes the proof.