|
Abstract
The Chinese remainder theorem is a key tool for the design of efficient multi-modular algorithms. In this paper, we study the case when the moduli are fixed and can even be chosen by the user. If is small or moderately large, then we show how to choose gentle moduli that allow for speedier Chinese remaindering. The multiplication of integer matrices is one typical application where we expect practical gains for various common matrix dimensions and bitsizes of the coefficients.
Keywords: Chinese remainder theorem, algorithm, complexity, integer matrix multiplication
Modular reduction is an important tool for speeding up computations in computer arithmetic, symbolic computation, and elsewhere. The technique allows to reduce a problem that involves large integer or polynomial coefficients to one or more similar problems that only involve small modular coefficients. Depending on the application, the solution to the initial problem is reconstructed via the Chinese remainder theorem or Hensel's lemma. We refer to [9, chapter 5] for a gentle introduction to this topic.
In this paper, we will mainly be concerned with multi-modular algorithms over the integers that rely on the Chinese remainder theorem. Given with , we will denote by the remainder of the Euclidean division of by . Given an matrix with integer coefficients, we will also denote for the matrix with coefficients .
One typical application of Chinese remaindering is the multiplication of integer matrices . Assuming that we have a bound with for all , we proceed as follows:
Select moduli with that are mutually coprime.
Compute and for .
Multiply for .
Reconstruct from the with .
The simultaneous computation of from for all is called the problem of multi-modular reduction. In step , we need to perform multi-modular reductions for the coefficients of and . The inverse problem of reconstructing from the with is called the problem of multi-modular reconstruction. We need to perform such reconstructions in step 3. Our hypothesis on allows us to recover from .
Let us quickly examine when and why the above strategy pays off. In this paper, the number should be small or moderately large, say . The moduli typically fit into a machine word. Denoting by the bitsize of a machine word (say or ), the coefficients of and should therefore be of bitsize .
For small , integer multiplications of bitsize are usually carried out using a naive algorithm, of complexity . If we directly compute the product using naive integer multiplications, the computation time is therefore of order . In comparison, as we will see, one naive multi-modular reduction or reconstruction for moduli roughly requires machine operations, whereas an matrix multiplication modulo any of the can be done in time . Altogether, this means that the above multi-modular algorithm for integer matrix multiplication has running time , which is times faster than the naive algorithm.
If , then the cost of steps 1 and 3 is negligible with respect to the cost of step 2. However, if and are of the same order of magnitude, then Chinese remaindering may take an important part of the computation time; the main purpose of this paper is to reduce this cost. If , then we notice that other algorithms for matrix multiplication usually become faster, such as naive multiplication for small , Karatsuba multiplication [13] for larger , or FFT-based techniques [6] for very large .
Two observations are crucial for reducing the cost of Chinese remaindering. First of all, the moduli are the same for all multi-modular reductions and multi-modular reconstructions in steps and . If is large, then this means that we can essentially assume that were fixed once and for all. Secondly, we are free to choose in any way that suits us. We will exploit these observations by precomputing gentle moduli for which Chinese remaindering can be performed more efficiently than for ordinary moduli.
The first idea behind gentle moduli is to consider moduli of the form , where is somewhat smaller than , where is even, and . In section 3.1, we will show that multi-modular reduction and reconstruction both become a lot simpler for such moduli. Secondly, each can be factored as and, if we are lucky, then both and can be factored into moduli of bitsize . If we are very lucky, then this allows us to obtain moduli of bitsize that are mutually coprime and for which Chinese remaindering can be implemented efficiently.
Let us briefly outline the structure of this paper. In section 2,
we rapidly recall basic facts about Chinese remaindering and naive
algorithms for this task. In section 3, we introduce gentle
moduli and describe how to speed up Chinese remaindering with respect to
such moduli. The last section 4 is dedicated to the brute
force search of gentle moduli for specific values of
and . We implemented a
sieving method in
Let us finally discuss a few related results. In this paper, we have chosen to highlight integer matrix multiplication as one typical application in computer algebra. Multi-modular methods are used in many other areas and the operations of multi-modular reduction and reconstruction are also known as conversions between the positional number system (PNS) and the residue number system (RNS). Asymptotically fast algorithms are based on remainder trees [8, 14, 3], with recent improvements in [4, 2, 10]; we expect such algorithms to become more efficient when exceeds .
Special moduli of the form are also known as pseudo-Mersenne moduli. They have been exploited before in cryptography [1] in a similar way as in section 3.1, but with a slightly different focus: whereas the authors of [1] are keen on reducing the number of additions (good for circuit complexity), we rather optimize the number of machine instructions on recent general purpose CPUs (good for software implementations). Our idea to choose moduli that can be factored into smaller moduli is new.
Other algorithms for speeding up multiple multi-modular reductions and reconstructions for the same moduli (while allowing for additional pre-computations) have recently been proposed in [7]. These algorithms essentially replace all divisions by simple multiplications and can be used in conjunction with our new algorithms for conversions between residues modulo and residues modulo .
For any integer , we recall that . For machine computations, it is convenient to use the following effective version of the Chinese remainder theorem:
Chinese Remainder Theorem. Let be positive integers that are mutually coprime and denote . There exist such that for any , the number
satisfies for .
Proof. For each , let . Since and are coprime, admits an inverse modulo in . We claim that we may take . Indeed, for and any , we have
Since is divisible by for all , this congruence relation simplifies into
This proves our claim and the theorem.
Notation. We call the cofactors for in and also denote these numbers by .
For practical computations, the moduli are usually chosen such that they fit into single machine words. Let denote the bitsize of a machine word, so that we typically have or . It depends on specifics of the processor how basic arithmetic operations modulo can be implemented most efficiently.
For instance, some processors have instructions for multiplying two -bit integers and return the exact -bit product. If not, then we rather have to assume that the moduli fit into instead of bits, or replace by . Some processors do not provide efficient integer arithmetic at all. In that case, one might rather wish to rely on floating point arithmetic and take (assuming that we have hardware support for double precision). For floating point arithmetic it also matters whether the processor offers a “fused-multiply-add” (FMA) instruction; this essentially provides us with an efficient way to multiply two -bit integers exactly using floating point arithmetic.
It is also recommended to choose moduli that fit into slightly less than bits whenever possible. Such extra bits can be used to significantly accelerate implementations of modular arithmetic. For a more detailed survey of practically efficient algorithms for modular arithmetic, we refer to [12].
Let , , and be as in the Chinese remainder theorem. We will refer to the computation of as a function of as the problem of multi-modular reduction. The inverse problem is called multi-modular reconstruction. In what follows, we assume that have been fixed once and for all.
The simplest way to perform multi-modular reduction is to simply take
Inversely, the Chinese remainder theorem provides us with a formula for multi-modular reconstruction:
Since are fixed, the computation of the cofactors can be regarded as a precomputation.
Assume that our hardware provides an instruction for the exact multiplication of two integers that fit into a machine word. If fits into a machine word, then so does the remainder . Cutting into machine words, it follows that the product can be computed using hardware products and hardware additions. Inversely, the Euclidean division of an -word integer by can be done using multiplications and additions/subtractions: we essentially have to multiply the quotient by and subtract the result from ; each next word of the quotient is obtained through a one word multiplication with an approximate inverse of .
The above analysis shows that the naive algorithm for multi-modular reduction of modulo requires hardware multiplications and additions. The multi-modular reconstruction of can be done using only multiplications and additions. Depending on the hardware, the moduli , and the way we implement things, more operations may be required for the carry handling—but it is beyond the scope of this paper to go into this level of detail.
Let us now reconsider the naive algorithms from section 2.3, but in the case when the moduli are all close to a specific power of two. More precisely, we assume that
where and a small number. As usual, we assume that the are pairwise coprime and we let . We also assume that is slightly smaller than and that we have a hardware instruction for the exact multiplication of -bit integers.
For moduli as in (3), the naive algorithm for the Euclidean division of a number by becomes particularly simple and essentially boils down to the multiplication of with the quotient of this division. In other words, the remainder can be computed using hardware multiplications. In comparison, the algorithm from section 2.3 requires multiplication when applied to -bit (instead of -bit) integers. More generally, the computation of remainders can be done using instead of multiplications. This leads to a potential gain of a factor , although the remainders are -bit integers instead of -bit integers, for the time being.
Multi-modular reconstruction can also be done faster, as follows, using similar techniques as in [1, 5]. Let . Besides the usual binary representation of and the multi-modular representation , it is also possible to use the mixed radix representation (or Newton representation)
where . Let us now show how to obtain efficiently from . Since , we must take . Assume that have been computed. For we next compute
using and
Notice that can be computed using hardware multiplications. We have
Now the inverse of modulo can be precomputed. We finally compute
which requires multiplications. For small values of , we notice that it may be faster to divide successively by modulo instead of multiplying with . In total, the computation of the mixed radix representation can be done using multiplications. Having computed the mixed radix representation, we next compute
for , using the recurrence relation
Since , the computation of requires multiplications. Altogether, the computation of from can therefore be done using hardware multiplications.
For practical applications, we usually wish to work with moduli that fit into one word (instead of words). With the as in the previous subsection, this means that we need to impose the further requirement that each modulus can be factored
with . If this is possible, then the are called -gentle moduli. For given bitsizes and , the main questions are now: do such moduli indeed exist? If so, then how to find them?
If , then it is easy to construct -gentle moduli by taking , where is odd. Indeed,
and . Unfortunately, this trick does not generalize to higher values . Indeed, consider a product
where are small compared to . If the coefficient of vanishes, then the coefficient of becomes the opposite of a sum of squares. In particular, both coefficients cannot vanish simultaneously, unless .
If , then we are left with the option to search -gentle moduli by brute force. As long as is “reasonably small” (say ), the probability to hit an -gentle modulus for a randomly chosen often remains significantly larger than . We may then use sieving to find such moduli. By what precedes, it is also desirable to systematically take for . This has the additional benefit that we “only” have to consider possibilities for .
We will discuss sieving in more detail in the next section. Assuming that we indeed have found -gentle moduli , we may use the naive algorithms from section 2.3 to compute from and vice versa for . Given for all , this allows us to compute all remainders using hardware multiplications, whereas the opposite conversion only requires multiplications. Altogether, we may thus obtain the remainders from and vice versa using multiplications.
We implemented a sieving procedure in
with the constraints that
for , and . The parameter is small and even. One should interpret and as the intended and maximal bitsize of the small moduli . The parameter stands for the minimal bitsize of a prime factor of . The parameter should be such that fits into a machine word.
In Table 1 below we have shown some experimental results for this sieving procedure in the case when , , and . For , the table provides us with , the moduli , as well as the smallest prime power factors of the product . Many hits admit small prime factors, which increases the risk that different hits are not coprime. For instance, the number divides both and , whence these -gentle moduli cannot be selected simultaneously (except if one is ready to sacrifice a few bits by working modulo instead of ).
In the case when we use multi-modular arithmetic for computations with rational numbers instead of integers (see [9, section 5 and, more particularly, section 5.10]), then small prime factors should completely be prohibited, since they increase the probability of divisions by zero. For such applications, it is therefore desirable that are all prime. In our table, this occurs for (we indicated this by highlighting the list of prime factors of ).
In order to make multi-modular reduction and reconstruction as efficient as possible, a desirable property of the moduli is that they either divide or . In our table, we highlighted the for which this happens. We notice that this is automatically the case if are all prime. If only a small number of (say a single one) do not divide either or , then we remark that it should still be possible to design reasonably efficient ad hoc algorithms for multi-modular reduction and reconstruction.
Another desirable property of the moduli is that is as small as possible: the spare bits can for instance be used to speed up matrix multiplication modulo . Notice however that one “occasional” large modulus only impacts on one out of modular matrix products; this alleviates the negative impact of such moduli. We refer to section 4.3 below for more details.
For actual applications, one should select gentle moduli that combine all desirable properties mentioned above. If not enough such moduli can be found, then it it depends on the application which criteria are most important and which ones can be released.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ideally speaking, we want to be as large as possible. Furthermore, in order to waste as few bits as possible, should be close to the word size (or half of it) and should be minimized. When using double precision floating point arithmetic, this means that we wish to take . Whenever we have efficient hardware support for integer arithmetic, then we might prefer .
Let us start by studying the influence of on the number of hits. In Table 2, we have increased by one with respect to Table 1. This results in an approximate increase of the “capacity” of the modulus . On the one hand, we observe that the hit rate of the sieve procedure roughly decreases by a factor of thirty. On the other hand, we notice that the rare gentle moduli that we do find are often of high quality (on four occasions the moduli are all prime in Table 2).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Without surprise, the hit rate also sharply decreases if we attempt to increase . The results for and are shown in Table 3. A further infortunate side effect is that the quality of the gentle moduli that we do find also decreases. Indeed, on the one hand, tends to systematically admit at least one small prime factor. On the other hand, it is rarely the case that each divides either or (this might nevertheless happen for other recombinations of the prime factors of , but only modulo a further increase of ).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An increase of while maintaining and fixed also results in a decrease of the hit rate. Nevertheless, when going from (floating point arithmetic) to (integer arithmetic), this is counterbalanced by the fact that can also be taken larger (namely ); see Table 4 for a concrete example. When doubling and while keeping the same upper bound for , the hit rate remains more or less unchanged, but the rate of high quality hits tends to decrease somewhat: see Table 5.
It should be possible to analyze the hit rate as a function of the
parameters , , and
from a probabilistic point of view, using the idea that a random number
is prime with probability . However, from a practical perspective, the
priority is to focus on the case when .
For the most significant choices of parameters
and , it should be possible
to compile full tables of -gentle
moduli. Unfortunately, our current implementation is still somewhat
inefficient for . A helpful
feature for upcoming versions of
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Let us finally return to our favourite application of multi-modular arithmetic to the multiplication of integer matrices . From a practical point of view, the second step of the algorithm from the introduction can be implemented very efficiently if fits into the size of a word.
When using floating point arithmetic, this means that we should have for all . For large values of , this is unrealistic; in that case, we subdivide the matrices into smaller matrices with . The fact that may depend on is very significant. First of all, the larger we can take , the faster we can multiply matrices modulo . Secondly, the in the tables from the previous sections often vary in bitsize. It frequently happens that we may take all large except for the last modulus . The fact that matrix multiplications modulo the worst modulus are somewhat slower is compensated by the fact that they only account for one out of every modular matrix products.
Several of the tables in the previous subsections were made with the application to integer matrix multiplication in mind. Consider for instance the modulus from Table 1. When using floating point arithmetic, we obtain , , , , and . Clearly, there is a trade-off between the efficiency of the modular matrix multiplications (high values of are better) and the bitsize of (larger capacities are better).
J.-C. Bajard, M. E. Kaihara, and T. Plantard. Selected rns bases for modular multiplication. In Proc. of the 19th IEEE Symposium on Computer Arithmetic, pages 25–32, 2009.
D. Bernstein. Scaled remainder trees. Available from https://cr.yp.to/arith/scaledmod-20040820.pdf, 2004.
A. Borodin and R. T. Moenck. Fast modular transforms. Journal of Computer and System Sciences, 8:366–386, 1974.
A. Bostan, G. Lecerf, and É. Schost. Tellegen's principle into practice. In Proceedings of ISSAC 2003, pages 37–44. ACM Press, 2003.
A. Bostan and É. Schost. Polynomial evaluation and interpolation on special sets of points. Journal of Complexity, 21(4):420–446, August 2005. Festschrift for the 70th Birthday of Arnold Schönhage.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
J. Doliskani, P. Giorgi, R. Lebreton, and É. Schost. Simultaneous conversions with the Residue Number System using linear algebra. Technical Report https://hal-lirmm.ccsd.cnrs.fr/lirmm-01415472, HAL, 2016.
C. M. Fiduccia. Polynomial evaluation via the division algorithm: the fast fourier transform revisited. In A. L. Rosenberg, editor, Fourth annual ACM symposium on theory of computing, pages 88–93, 1972.
J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.
J. van der Hoeven. Faster Chinese remaindering. Technical report, HAL, 2016. http://hal.archives-ouvertes.fr/hal-01403810.
J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002. http://www.mathemagix.org.
J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5:1–5:37, 2016.
A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
R. T. Moenck and A. Borodin. Fast modular transforms via division. In Thirteenth annual IEEE symposium on switching and automata theory, pages 90–96, Univ. Maryland, College Park, Md., 1972.
The PARI Group, Bordeaux. PARI/GP, 2012. Software available from http://pari.math.u-bordeaux.fr.