|
Abstract
The Chinese remainder theorem is a key tool for the
design of efficient multi-modular algorithms. In this paper, we study
the case when the moduli are fixed and can
even be chosen by the user. If
is small or
moderately large, then we show how to choose gentle moduli
that allow for speedier Chinese remaindering. The multiplication of
integer matrices is one typical application where we expect practical
gains for various common matrix dimensions and bitsizes of the
coefficients.
Keywords: Chinese remainder theorem, algorithm, complexity, integer matrix multiplication
Modular reduction is an important tool for speeding up computations in computer arithmetic, symbolic computation, and elsewhere. The technique allows to reduce a problem that involves large integer or polynomial coefficients to one or more similar problems that only involve small modular coefficients. Depending on the application, the solution to the initial problem is reconstructed via the Chinese remainder theorem or Hensel's lemma. We refer to [9, chapter 5] for a gentle introduction to this topic.
In this paper, we will mainly be concerned with multi-modular algorithms
over the integers that rely on the Chinese remainder theorem. Given with
, we
will denote by
the remainder of the Euclidean
division of
by
.
Given an
matrix
with
integer coefficients, we will also denote
for
the matrix with coefficients
.
One typical application of Chinese remaindering is the multiplication of
integer matrices
.
Assuming that we have a bound
with
for all
, we
proceed as follows:
Select moduli with
that are mutually coprime.
Compute and
for
.
Multiply for
.
Reconstruct from the
with
.
The simultaneous computation of from
for all
is called the problem of
multi-modular reduction. In step
,
we need to perform
multi-modular reductions for
the coefficients of
and
. The inverse problem of reconstructing
from the
with
is called the problem of multi-modular reconstruction. We need
to perform
such reconstructions in step 3. Our
hypothesis on
allows us to recover
from
.
Let us quickly examine when and why the above strategy pays off. In this
paper, the number should be small or moderately
large, say
. The moduli
typically fit into a machine word. Denoting by
the bitsize of a machine word (say
or
), the coefficients of
and
should therefore be
of bitsize
.
For small , integer
multiplications of bitsize
are usually carried
out using a naive algorithm, of complexity
.
If we directly compute the product
using
naive integer multiplications, the computation time
is therefore of order
. In
comparison, as we will see, one naive multi-modular reduction or
reconstruction for
moduli roughly requires
machine operations, whereas an
matrix multiplication modulo any of the
can be
done in time
. Altogether,
this means that the above multi-modular algorithm for integer matrix
multiplication has running time
,
which is
times faster than the naive algorithm.
If , then the cost
of steps 1 and 3 is negligible with respect to the cost
of step 2. However, if
and
are of the same order of magnitude, then
Chinese remaindering may take an important part of the computation time;
the main purpose of this paper is to reduce this cost. If
, then we notice that other algorithms for
matrix multiplication usually become faster, such as naive
multiplication for small
,
Karatsuba multiplication [13] for larger
, or FFT-based techniques [6] for
very large
.
Two observations are crucial for reducing the cost of Chinese
remaindering. First of all, the moduli are the
same for all
multi-modular reductions and
multi-modular reconstructions in steps
and
. If
is large, then this means that we can essentially
assume that
were fixed once and for all.
Secondly, we are free to choose
in any way that
suits us. We will exploit these observations by precomputing gentle
moduli for which Chinese remaindering can be performed more
efficiently than for ordinary moduli.
The first idea behind gentle moduli is to consider moduli of the form
,
where
is somewhat smaller than
, where
is even, and
. In section 3.1,
we will show that multi-modular reduction and reconstruction both become
a lot simpler for such moduli. Secondly, each
can be factored as
and, if we are lucky, then
both
and
can be factored
into
moduli of bitsize
. If we are very lucky, then this allows us to
obtain
moduli
of bitsize
that are mutually coprime and for which Chinese
remaindering can be implemented efficiently.
Let us briefly outline the structure of this paper. In section 2,
we rapidly recall basic facts about Chinese remaindering and naive
algorithms for this task. In section 3, we introduce gentle
moduli and describe how to speed up Chinese remaindering with respect to
such moduli. The last section 4 is dedicated to the brute
force search of gentle moduli for specific values of
and
. We implemented a
sieving method in
. We expect our technique to be efficient for
, but this still needs to be
confirmed via an actual implementation. The application to
integer matrix multiplication in section 4.3 also has not
been implemented yet.
Let us finally discuss a few related results. In this paper, we have
chosen to highlight integer matrix multiplication as one typical
application in computer algebra. Multi-modular methods are used in many
other areas and the operations of multi-modular reduction and
reconstruction are also known as conversions between the positional
number system (PNS) and the residue number system (RNS). Asymptotically
fast algorithms are based on remainder trees [8,
14, 3], with recent improvements in [4,
2, 10]; we expect such algorithms to become
more efficient when exceeds
.
Special moduli of the form are also known as
pseudo-Mersenne moduli. They have been exploited before in
cryptography [1] in a similar way as in section 3.1,
but with a slightly different focus: whereas the authors of [1]
are keen on reducing the number of additions (good for circuit
complexity), we rather optimize the number of machine instructions on
recent general purpose CPUs (good for software implementations). Our
idea to choose moduli
that can be factored into
smaller moduli is new.
Other algorithms for speeding up multiple multi-modular reductions and
reconstructions for the same moduli (while allowing for additional
pre-computations) have recently been proposed in [7]. These
algorithms essentially replace all divisions by simple multiplications
and can be used in conjunction with our new algorithms for conversions
between residues modulo and residues modulo
.
For any integer , we recall
that
. For machine
computations, it is convenient to use the following effective version of
the Chinese remainder theorem:
Chinese Remainder Theorem. Let
be positive integers that are mutually coprime and denote
. There exist
such that
for any
, the number
satisfies for
.
Proof. For each ,
let
. Since
and
are coprime,
admits
an inverse
modulo
in
. We claim that we may take
. Indeed, for
and any
, we
have
Since is divisible by
for all
, this congruence
relation simplifies into
This proves our claim and the theorem.
Notation. We call the
cofactors for
in
and
also denote these numbers by
.
For practical computations, the moduli are
usually chosen such that they fit into single machine words. Let
denote the bitsize of a machine word, so that we
typically have
or
.
It depends on specifics of the processor how basic arithmetic operations
modulo
can be implemented most efficiently.
For instance, some processors have instructions for multiplying two
-bit integers and return the
exact
-bit product. If not,
then we rather have to assume that the moduli
fit into
instead of
bits, or replace
by
. Some processors do not provide efficient integer
arithmetic at all. In that case, one might rather wish to rely on
floating point arithmetic and take
(assuming
that we have hardware support for double precision). For floating point
arithmetic it also matters whether the processor offers a
“fused-multiply-add” (FMA) instruction; this essentially
provides us with an efficient way to multiply two
-bit integers exactly using floating point
arithmetic.
It is also recommended to choose moduli that fit
into slightly less than
bits whenever possible.
Such extra bits can be used to significantly accelerate implementations
of modular arithmetic. For a more detailed survey of practically
efficient algorithms for modular arithmetic, we refer to [12].
Let ,
,
and
be as in the Chinese remainder theorem. We will refer to the computation
of
as a function of
as
the problem of multi-modular reduction. The inverse problem is
called multi-modular reconstruction. In what follows, we assume
that
have been fixed once and for all.
The simplest way to perform multi-modular reduction is to simply take
Inversely, the Chinese remainder theorem provides us with a formula for multi-modular reconstruction:
Since are fixed, the computation of the
cofactors
can be regarded as a precomputation.
Assume that our hardware provides an instruction for the exact
multiplication of two integers that fit into a machine word. If fits into a machine word, then so does the remainder
. Cutting
into
machine words, it follows that the product
can be computed using
hardware products and
hardware additions.
Inversely, the Euclidean division of an
-word
integer
by
can be done
using
multiplications and
additions/subtractions: we essentially have to multiply the quotient by
and subtract the result from
; each next word of the quotient is obtained
through a one word multiplication with an approximate inverse of
.
The above analysis shows that the naive algorithm for multi-modular
reduction of modulo
requires
hardware multiplications and
additions. The multi-modular reconstruction of
can be done using only
multiplications and
additions. Depending on the
hardware, the moduli
, and
the way we implement things,
more operations may
be required for the carry handling—but it is beyond the scope of
this paper to go into this level of detail.
Let us now reconsider the naive algorithms from section 2.3,
but in the case when the moduli are all close to
a specific power of two. More precisely, we assume that
where and
a small
number. As usual, we assume that the
are
pairwise coprime and we let
.
We also assume that
is slightly smaller than
and that we have a hardware instruction for the
exact multiplication of
-bit
integers.
For moduli as in (3), the naive
algorithm for the Euclidean division of a number
by
becomes particularly simple and essentially
boils down to the multiplication of
with the
quotient of this division. In other words, the remainder can be computed
using
hardware multiplications. In comparison,
the algorithm from section 2.3 requires
multiplication when applied to
-bit
(instead of
-bit) integers.
More generally, the computation of
remainders
can be done using
instead of
multiplications. This leads to a
potential gain of a factor
,
although the remainders are
-bit
integers instead of
-bit
integers, for the time being.
Multi-modular reconstruction can also be done faster, as follows, using
similar techniques as in [1, 5]. Let . Besides the usual binary
representation of
and the multi-modular
representation
, it is also
possible to use the mixed radix representation (or Newton
representation)
where . Let us now show how
to obtain
efficiently from
. Since
,
we must take
. Assume that
have been computed. For
we next compute
using and
Notice that can be computed using
hardware multiplications. We have
Now the inverse of
modulo
can be precomputed. We finally compute
which requires multiplications. For small
values of
, we notice that it
may be faster to divide successively by
modulo
instead of multiplying with
. In total, the computation of the mixed radix
representation
can be done using
multiplications. Having computed the mixed radix
representation, we next compute
for , using the recurrence
relation
Since , the computation of
requires
multiplications. Altogether, the computation of
from
can therefore be done using
hardware multiplications.
For practical applications, we usually wish to work with moduli that fit
into one word (instead of words). With the
as in the previous subsection, this means that we
need to impose the further requirement that each modulus
can be factored
with . If this is possible,
then the
are called
-gentle moduli. For given bitsizes
and
, the main
questions are now: do such moduli indeed exist? If so, then how to find
them?
If , then it is easy to
construct
-gentle moduli
by taking
,
where
is odd. Indeed,
and . Unfortunately, this
trick does not generalize to higher values
.
Indeed, consider a product
where are small compared to
. If the coefficient
of
vanishes, then the coefficient of
becomes the opposite
of a sum of
squares. In particular, both coefficients cannot vanish simultaneously,
unless
.
If , then we are left with
the option to search
-gentle
moduli by brute force. As long as
is
“reasonably small” (say
),
the probability to hit an
-gentle
modulus for a randomly chosen
often remains
significantly larger than
.
We may then use sieving to find such moduli. By what precedes, it is
also desirable to systematically take
for
. This has the additional benefit
that we “only” have to consider
possibilities for
.
We will discuss sieving in more detail in the next section. Assuming
that we indeed have found -gentle
moduli
, we may use the naive
algorithms from section 2.3 to compute
from
and vice versa for
. Given
for all
, this allows us to compute all
remainders
using
hardware multiplications, whereas the opposite conversion only requires
multiplications. Altogether, we may thus obtain
the remainders
from
and
vice versa using
multiplications.
We implemented a sieving procedure in and
,
the goal of our procedure is to find
-gentle
moduli of the form
with the constraints that
for , and
. The parameter
is
small and even. One should interpret
and
as the intended and maximal bitsize of the small
moduli
. The parameter
stands for the minimal bitsize of a prime factor of
. The parameter
should be such that
fits into a
machine word.
In Table 1 below we have shown some experimental results
for this sieving procedure in the case when ,
,
and
. For
, the table provides us with
, the moduli
, as well as the smallest prime power factors
of the product
. Many hits
admit small prime factors, which increases the risk that different hits
are not coprime. For instance, the number
divides both
and
,
whence these
-gentle moduli
cannot be selected simultaneously (except if one is ready to sacrifice a
few bits by working modulo
instead of
).
In the case when we use multi-modular arithmetic for computations with
rational numbers instead of integers (see [9, section 5
and, more particularly, section 5.10]), then small prime factors should
completely be prohibited, since they increase the probability of
divisions by zero. For such applications, it is therefore desirable that
are all prime. In our table, this occurs for
(we indicated this by highlighting the list of
prime factors of
).
In order to make multi-modular reduction and reconstruction as efficient
as possible, a desirable property of the moduli
is that they either divide
or
. In our table, we highlighted the
for which this happens. We notice that this is
automatically the case if
are all prime. If only
a small number of
(say a single one) do not
divide either
or
,
then we remark that it should still be possible to design reasonably
efficient ad hoc algorithms for multi-modular reduction and
reconstruction.
Another desirable property of the moduli is that
is as small as possible: the spare bits can for
instance be used to speed up matrix multiplication modulo
. Notice however that one
“occasional” large modulus
only
impacts on one out of
modular matrix products;
this alleviates the negative impact of such moduli. We refer to section
4.3 below for more details.
For actual applications, one should select gentle moduli that combine all desirable properties mentioned above. If not enough such moduli can be found, then it it depends on the application which criteria are most important and which ones can be released.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ideally speaking, we want to be as large as
possible. Furthermore, in order to waste as few bits as possible,
should be close to the word size (or half of it) and
should be minimized. When using double precision
floating point arithmetic, this means that we wish to take
. Whenever we have efficient hardware support
for integer arithmetic, then we might prefer
.
Let us start by studying the influence of on the
number of hits. In Table 2, we have increased
by one with respect to Table 1. This results
in an approximate
increase of the
“capacity”
of the modulus
. On the one hand, we observe that
the hit rate of the sieve procedure roughly decreases by a factor of
thirty. On the other hand, we notice that the rare gentle moduli that we
do find are often of high quality (on four occasions the moduli
are all prime in Table 2).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Without surprise, the hit rate also sharply decreases if we attempt to
increase . The results for
and
are shown in Table
3. A further infortunate side effect is that the quality of
the gentle moduli that we do find also decreases. Indeed, on the one
hand,
tends to systematically admit at least one
small prime factor. On the other hand, it is rarely the case that each
divides either
or
(this might nevertheless happen for other
recombinations of the prime factors of
,
but only modulo a further increase of
).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An increase of while maintaining
and
fixed also results in a
decrease of the hit rate. Nevertheless, when going from
(floating point arithmetic) to
(integer
arithmetic), this is counterbalanced by the fact that
can also be taken larger (namely
);
see Table 4 for a concrete example. When doubling
and
while keeping the same upper
bound for
, the hit rate
remains more or less unchanged, but the rate of high quality hits tends
to decrease somewhat: see Table 5.
It should be possible to analyze the hit rate as a function of the
parameters ,
,
and
from a probabilistic point of view, using the idea that a random number
is prime with probability
. However, from a practical perspective, the
priority is to focus on the case when
.
For the most significant choices of parameters
and
, it should be possible
to compile full tables of
-gentle
moduli. Unfortunately, our current implementation is still somewhat
inefficient for
. A helpful
feature for upcoming versions of
(the current version only does this for
prime factors that can be tabulated).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Let us finally return to our favourite application of multi-modular
arithmetic to the multiplication of integer matrices . From a practical point of view, the second
step of the algorithm from the introduction can be implemented very
efficiently if
fits into the size of a word.
When using floating point arithmetic, this means that we should have
for all
.
For large values of
, this is
unrealistic; in that case, we subdivide the
matrices into smaller
matrices with
. The fact that
may
depend on
is very significant. First of all, the
larger we can take
, the
faster we can multiply matrices modulo
.
Secondly, the
in the tables from the previous
sections often vary in bitsize. It frequently happens that we may take
all
large except for the last modulus
. The fact that matrix
multiplications modulo the worst modulus
are
somewhat slower is compensated by the fact that they only account for
one out of every
modular matrix products.
Several of the tables in the previous subsections were made with the
application to integer matrix multiplication in mind. Consider for
instance the modulus from Table 1.
When using floating point arithmetic, we obtain
,
,
,
,
and
.
Clearly, there is a trade-off between the efficiency of the modular
matrix multiplications (high values of
are
better) and the bitsize
of
(larger capacities are better).
J.-C. Bajard, M. E. Kaihara, and T. Plantard. Selected rns bases for modular multiplication. In Proc. of the 19th IEEE Symposium on Computer Arithmetic, pages 25–32, 2009.
D. Bernstein. Scaled remainder trees. Available from https://cr.yp.to/arith/scaledmod-20040820.pdf, 2004.
A. Borodin and R. T. Moenck. Fast modular transforms. Journal of Computer and System Sciences, 8:366–386, 1974.
A. Bostan, G. Lecerf, and É. Schost. Tellegen's principle into practice. In Proceedings of ISSAC 2003, pages 37–44. ACM Press, 2003.
A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. Journal of Complexity, 21(4):420–446, August 2005. Festschrift for the 70th Birthday of Arnold Schönhage.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
J. Doliskani, P. Giorgi, R. Lebreton, and É. Schost. Simultaneous conversions with the Residue Number System using linear algebra. Technical Report https://hal-lirmm.ccsd.cnrs.fr/lirmm-01415472, HAL, 2016.
C. M. Fiduccia. Polynomial evaluation via the division algorithm: the fast fourier transform revisited. In A. L. Rosenberg, editor, Fourth annual ACM symposium on theory of computing, pages 88–93, 1972.
J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.
J. van der Hoeven. Faster Chinese remaindering. Technical report, HAL, 2016. http://hal.archives-ouvertes.fr/hal-01403810.
J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002. http://www.mathemagix.org.
J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5:1–5:37, 2016.
A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
R. T. Moenck and A. Borodin. Fast modular transforms via division. In Thirteenth annual IEEE symposium on switching and automata theory, pages 90–96, Univ. Maryland, College Park, Md., 1972.
The PARI Group, Bordeaux. PARI/GP, 2012. Software available from http://pari.math.u-bordeaux.fr.