Implementing fast carryless multiplication

Modern algorithms for fast polynomial multiplication are generally based on evaluation-interpolation strategies and more particularly on the discrete Fourier transform (DFT). Taking coefficients in the finite field

with two elements, the problem of multiplying in

is also known as carryless integer multiplication (assuming binary notation). The aim of this paper is to present a practically efficient solution for large degrees.

One major obstruction to evaluation-interpolation strategies over small finite fields is the potential lack of evaluation points. The customary remedy is to work in suitable extension fields. Remains the question of how to reduce the incurred overhead as much as possible.

More specifically, it was shown in [7] that multiplication in

can be done efficiently by reducing it to polynomial multiplication over the Babylonian field

. Part of this reduction relied on Kronecker segmentation, which involves an overhead of a factor two. In this paper, we present a variant of a new algorithm from [11] that removes this overhead almost entirely. We also report on our Mathemagix implementation that is roughly twice as efficient as before.

1.1.Related work

For a long time, the best known algorithm for carryless integer multiplication was Schönhage's triadic variant [16] of Schönhage–Strassen's algorithm [17] for integer multiplication: it achieves a complexity

for the multiplication of two polynomials of degree

. Recently [8], Harvey, van der Hoeven and Lecerf proved the sharper bound

, but also showed that several of the new ideas could be used for faster practical implementations [7].

More specifically, they showed how to reduce multiplication in

to DFTs over

, which can be computed efficiently due to the existence of many small prime divisors of

. Their reduction relies on Kronecker segmentation: given two input polynomials

and

, one cuts them into chunks of 30 bits and forms

and

, where

(the least integer

). Hence

, and the product

satisfies

, where

. Now

and

are multiplied in

by reinterpreting

as the generator of

. The recovery of

is possible since its degree in

is bounded by

. However, in terms of input size, half of

coefficients of

and

are “left blank”, when reinterpreted inside

. Consequently, this reduction method based on Kronecker segmentation involves a constant overhead of roughly

. In fact, when considering algorithms with asymptotically softly linear costs, comparing relative input sizes gives a rough approximation of the relative costs.

Recently van der Hoeven and Larrieu [11] have proposed a new way to reduce multiplication of polynomials in

to the computation of DFTs over an extension

. Roughly speaking, they have shown that the DFT of a polynomial in

could be computed almost

times faster if its coefficients happen to lie in the subfield

. Using their algorithm, called the Frobenius FFT, it is theoretically possible to avoid the overhead of Kronecker segmentation, and thereby to gain a factor of two with respect to [7]. However, application of the Frobenius FFT as described in [11] involves computations in all intermediate fields

between

and

. This makes the theoretical speed-up of two harder to achieve and practical implementations more cumbersome.

Besides Schönhage–Strassen type algorithms, let us mention that other strategies such as the additive Fourier transform have been developed for

[4, 15]. A competitive implementation based on the latter transform has been achieved very recently by Chen et al. [2]—notice that their preprint [2] does not take into account our new implementation. For more historical details on the complexity of polynomial multiplication we refer the reader to the introductions of [7, 8] and to the book by von zur Gathen and Gerhard [5].

1.2.Results and outline of the paper

This paper contains two main results. In section 3, we describe a variant of the Frobenius DFT for the special extension of

over

. Using a single rewriting step, this new algorithm reduces the computation of a Frobenius DFT to the computation of an ordinary DFT over

, thereby avoiding computations in any intermediate fields

with

and

Our second main result is a practical implementation of the new algorithm and our ability to indeed gain a factor that approaches two with respect to our previous work. We underline that in both cases, DFTs over

represent the bulk of the computation, but the lengths of the DFTs are halved for the new algorithm. In particular, the observed acceleration is due to our new algorithm and not the result of ad hoc code tuning or hardware specific optimizations.

In section 4, we present some of the low level implementation details concerning the new rewriting step. Our timings are presented in section 5. Our implementation outperforms the reference library gf2x version 1.2 developed by Brent, Gaudry, Thomé and Zimmermann [1] for multiplying polynomials in

. We also outperform the recent implementation by Chen et al. [2]. Finally, the evaluation-interpolation strategy used by our algorithm is particularly well suited for multiplying matrices of polynomials over

, as reported in section 5.

2.Prerequisites

Discrete Fourier transforms

Let

be a primitive root of unity of order

. The discrete Fourier transform (DFT) of an

-tuple

with respect to

, where

Hence

is the evaluation of the polynomial

. For simplicity we often identify

with

and we simply write

. The inverse transform is related to the direct transform via

, which follows from the well known formula

properly factors as

, then

is an

-th primitive root of unity and

is an

-th primitive root of unity. Moreover, for any

and

, we have

and

are algorithms for computing DFTs of length

and

, we may use (1) to construct an algorithm for computing DFTs of length

as follows. For each

, the sum inside the brackets corresponds to the

-th coefficient of a DFT of the

-tuple

with respect to

. Evaluating these inner DFTs requires

calls to

. Next, we multiply by the twiddle factors

, at a cost of

operations in

. Finally, for each

, the outer sum corresponds to the

-th coefficient of a DFT of an

-tuple in

with respect to

. These outer DFTs require

calls to

. Iterating this decomposition for further factorizations of

and

yields the seminal Cooley–Tukey algorithm [3].

Frobenius Fourier transforms

Let

be a polynomial in

and let

be a primitive root of unity in some extension

. We write

for the Frobenius map

and notice that

for any

. This formula implies many nontrivial relations for the DFT of

: if

, then we have

. In other words, some values of the DFT of

can be deduced from others, and the advantage of the Frobenius transform introduced in [11] is to restrict the bulk of the evaluations to a minimum number of points.

Let

denote the order of the root

, and consider the set

. This set is clearly globally stable under

, so the group

generated by

acts naturally on it. This action partitions

into disjoint orbits. Assume that we have a section

that contains exactly one element in each orbit. Then formula (2) allows us to recover

from the evaluations of

at each of the points in

. The vector

is called the Frobenius DFT of

3.Fast reduction from

3.1.Variant of the Frobenius DFT

To efficiently reduce a multiplication in

into DFTs over

, we use an order

that divides

and such that

for some integer

. We perform the decomposition (1) with

and

. Let

be a primitive

-th root of unity in

. The discrete Fourier transform of

, given by

, can be reorganized into

slices as follows

The variant of the Frobenius DFT of

that we introduce in the present paper corresponds to computing only the second slice:

Let us show that this transform is actually a bijection. The following lemma shows that the slices

can be deduced from the second slice

using the action of the Frobenius map

Proof. Let

and

, we have

for some integer

, so the action of

onto

is well defined. Notice that

is primitive for the multiplicative group

. This implies that for any

there exists

such that

. Consequently we have

for some

, whence

. Since

is injective the latter inclusion is an equality.

If we were needed the complete

, then we would still have to compute the first slice

. The second main new idea with respect to [11] is to discard this first slice and to restrict ourselves to input polynomials

of degrees

. In this way,

can be inverted, as proved in the following proposition.

Proof. The dimensions of the source and destination spaces of

over

being the same, it suffices to prove that

is injective. Let

be such that

. By construction,

vanishes at

distinct values, namely

for

. Under the action of

it also vanishes at

other values by Lemma 1, whence

Remark 3. The transformation

being bijective is due to the fact that

is primitive in the multiplicative group

. Among the prime divisors of

, the factors 3, 5, 11 and 13 also have this property, but taking

allows us to divide the size of the evaluation-interpolation scheme by 60, which is optimal.

3.2.Frobenius encoding

We decompose the computation of

into two routines. The first routine is written

and called the Frobenius encoding:

Below, we will choose

in such a way that

is essentially a simple reorganization of the coefficients of

We observe that the coefficients of

are part of the values of the inner DFTs of

in the Cooley–Tukey formula (1), applied with

and

. The second task is the computation of the corresponding outer DFT of order

Summarizing, we have reduced the computation of a DFT of size

over

to a DFT of size

over

. This reduction preserves data size.

3.3.Direct transforms

The computation of

involves the evaluation of

polynomials in

. In order to perform these evaluations fast, we fix the representation of

and the primitive root

of unity of maximal order

to be given by

Setting

and

, it can be checked that

. Evaluation of a polynomial in

can now be done efficiently.

Assumption: divides .

For , build .
Return .

Proof. This deduces immediately from the definition of

in formula (3), using the fact that

in our representation.

Assumption: divides .

Compute the Frobenius encoding of by Algorithm 1.
Compute the DFT of with respect to .

3.4.Inverse transforms

By combining Propositions 2 and 4, the map

is invertible and its inverse may be computed by the following algorithm.

Assumption: divides .

For , build the preimage of .
Return .

Assumption: divides .

Compute the inverse DFT of with respect to .
Compute the Frobenius decoding of by Algorithm 3 and return .

3.5.Multiplication in

Using the standard technique of multiplication by evaluation-interpolation, we may now compute products in

as follows:

Let be such that divides .
Let be the privileged root of unity of order .
Compute and by Algorithm 2.
Compute as the entry-wise product of and .
Compute by Algorithm 4 and return .

Proof. The correctness simply follows from Propositions 6 and 8 and using the fact that

, since

For step 1, the actual determination of

has been discussed in [7, section 3]. In fact it is often better not to pick the smallest possible value for

but a slightly larger one that is also very smooth. Since

admits many small prime divisors, such smooth values of

usually indeed exist.

4.Implementation details

We follow Intel's terminology and use the term quad word to denote a unit of 64 bits of data. In the rest of the paper we use the C99 standard for presenting our source code. In particular a quad word representing an unsigned integer is considered of type uint64_t.

Our implementations are done for an AVX2-enabled processor and an operating system compliant to System V Application Binary Interface. The C++ library numerix of Mathemagix [13] (http://www.mathemagix.org) defines wrappers for AVX types. In particular, avx_uint64_t represents an SIMD vector of

elements of type uint64_t. Recall that the platform disposes of

AVX registers which must be allocated accurately in order to minimize read and write accesses to the memory.

Our new polynomial product is implemented in the justinline library of Mathemagix. The source code is freely available from revision 10681 of our SVN server (https://gforge.inria.fr/projects/mmx/). Main sources are in justinline/src/frobenius_encode_f2_60.cpp for the Frobenius encoding and in justinline/mmx/polynomial_f2_amd64_avx2_clmul.mmx for the top level functions. Related test and bench files are also available from dedicated directories of the justinline library. Let us further mention here that our Mathemagix functions may be easily exported to C++ [12].

4.1.Packed representations

Polynomials over

are supposed to be given in packed representation, which means that coefficients are stored as a vector of contiguous bits in memory. For the implementation considered in this paper, a polynomial of degree

is stored into

quad words, starting with the low-degree coefficients: the constant term is the least significant bit of the first word. The last word is suitably padded with zeros.

Reading or writing one coefficient or a range of coefficients of a polynomial in packed representation must be done carefully to avoid invalid memory access. Let

be such a polynomial of type uint64_t*. Reading the coefficient

of degree

is obtained as ([i >> 6] >> (i & 63)) & 1. However, reading or writing a single coefficient should be avoided as much as possible for efficiency, so we prefer handling ranges of 256 bits. In the sequel the function of prototype

void load (avx_uint64_t& , const uint64_t* ,

const uint64_t& , const uint64_t& , const uint64_t& );

returns the

bits of

starting from

into

. Bits beyond position

are considered to be zero.

For arithmetic operations in

we refer the reader to [7, section 3.1]. In the sequel we only appeal to the function

uint64_t f2_60_mul (const uint64_t& , const uint64_t& );

We also use a packed column-major representation for matrices over

. For instance, an

bit matrix

is encoded as a quad word whose

-th bit is

. Similarly, a

matrix

may be seen as a vector

of type avx_uint64_t*, so

corresponds to the

-th bit of [].

4.2.Matrix transposition

The Frobenius encoding essentially boils down to matrix transpositions. Our main building block is

bit matrix transposition. We decompose this transposition in a suitable way with regards to data locality, register allocation and vectorization.

For the computation of general transpositions, we repeatedly make use of the well-known divide and conquer strategy: to transpose an

matrix

, where

and

are even, we decompose

, where

are

matrices; we swap the anti-diagonal blocks

and

and recursively transpose each block

4.2.1.Transposing packed

bit matrices

The basic task we begin with is the transposition of a packed

bit matrix. The solution used here is borrowed from [18, Chapter 7, section 3].

Input: in packed representation.

Output: The transpose of in packed representation.

packed_matrix_bit_8x8_transpose (const uint64_t& ) {

uint64_t = ;
static const uint64_t mask_4 = 0x00000000f0f0f0f0;
static const uint64_t mask_2 = 0x0000cccc0000cccc;
static const uint64_t mask_1 = 0x00aa00aa00aa00aa;
uint64_t ;
= (( >> 28) ^ ) & mask_4; = ^;
= << 28; = ^ ;
= (( >> 14) ^ ) & mask_2; = ^;
= << 14; = ^ ;
= (( >> 7) ^ ) & mask_1; = ^;
= << 7; = ^ ;
return ; }

In steps 6 and 7, the anti-diagonal

blocks are swapped. In steps 8 and 9, the matrix

is seen as four

matrices whose anti-diagonal

blocks are swapped. In steps 10 and 11, the matrix

is seen as sixteen

matrices whose anti-diagonal elements are swapped. All in all, 18 instructions, 3 constants and one auxiliary variable are needed to transpose a packed

bit matrix in this way.

One advantage of the above algorithm is that it admits a straightforward AVX vectorization that we will denote by

avx_packed_matrix_bit_8x8_transpose (const avx_uint64_t& );

This routine transposes four

bit matrices

that are packed successively into an AVX register of type avx_uint64_t. We emphasize that this task is not the same as transposing a

bit matrices.

Remark 10. The BMI2 technology gives another method for transposing

bit matrices:

uint64_t mask = 0x0101010101010101;

for (unsigned = 0; < 8; ++)

|= _pext_u64 (, mask << ) << (8 * );

The loop can be unrolled while precompting the shift amounts and masks, which leads to a faster sequential implementation. Unfortunately this approach cannot be vectorized with the AVX2 technology. Other sequential solutions even exist, based on lookup tables or integer arithmetic, but their vectorization is again problematic. Practical efficiencies are reported in section 5.

4.2.2.Transposing four

byte matrices simultaneously

Our next task is to design a transposition algorithm of four packed

byte matrices simultaneously. More precisely, it performs the following operation on a packed

byte matrix:

void avx_packed_matrix_byte_8x8_transpose

(avx_uint64_t* dest, const avx_uint64_t* src);

This function works as follows. First the input src is loaded into eight AVX registers

. Each

is seen as a vector of four uint64_t: for

thus represent the

byte matrix

. Then we transpose these four matrices simultaneously in-register by means of AVX shift and blend operations over 32, 16 and 8 bits entries in the spirit of the aforementioned divide and conquer strategy.

4.2.3.Transposing

bit matrices

Having the above subroutines at our disposal, we can now present our algorithm to transpose a packed

bit matrix. The input bit matrix of type

is written

. The transposed output matrix is written

and has type uint64_t*. We first compute the auxiliary byte matrix

as follows:

static avx_uint64_t [64];

for (int i= 0; i < 8; i++) {

avx_packed_matrix_byte_8x8_transpose ( + 8*i, + 8*i);

for (int k= 0; k < 8; k++)

T[8*i+k]= avx_packed_matrix_bit_8x8_transpose(T[8*i+k]); }

If we write

for the byte representing the packed bit vector

, then

contains the following

byte matrix:

First, for all

, we load column

into the AVX register

. We interpret these registers as forming a

byte matrix that we transpose in-registers. This transposition is again performed in the spirit of the aforementioned divide and conquer strategy and makes use of various specific AVX2 instructions. We obtain

More precisely, for

, the group of four consecutive columns from

until

is in the register

. We save the registers

at the addresses

and

For each

, we build a similar

byte matrix from the columns

, and transpose this matrix using the same algorithm. This time the result is saved at the addresses

and

, where

. This yields an efficient routine for transposing

into

, whose prototype is given by

void packed_matrix_bit_256x64_transpose

(uint64_t* , (const avx_uint64_t*) );

4.3.Frobenius encoding

If the input polynomial

has degree less than

and is in packed representation, then it can also be seen as a

matrix in packed representation (except a padding with zeros could be necessary to adjust the size).

In this setting, the polynomials

of Algorithm 1 are simply read as the rows of the matrix. Therefore, to compute the Frobenius encoding

, we only need to transpose this matrix, then add 4 rows of zeros for alignment (because we store one element of

per quad word) and multiply by twiddle factors. This leads to the following implementation:

Output: stored from pointer to allocated quad words.

Assumptions: divides and .

void encode (uint64_t* , const uint64_t& ,

const uint64_t* , const uint64_t& ) {

uint64_t = 1, = 0, = 0;
avx_uint64_t [64]; uint64_t [256];
while ( < ) {
= min ( - , 256);
for (int = 0; < 64; ++)

load ([], , , + * , );
packed_matrix_bit_256x64_transpose (, );
for (int = 0; < e; ++) {

[ + ] = f2_60_mul ([], );
```
     = f2_60_mul (, ); }
```
+= ; }

Remark. To optimize read accesses, it is better to run loop

for

and to initialize the remaining [] to zero. Indeed, for a product of degree

, we typically multiply two polynomials of degree

, which means

when computing the direct transform.

The Frobenius decoding consists in inverting the encoding. The implementation issues are the same, so we refer to our source code for further details.

5.Timings

The platform considered in this paper is equipped with an Intel(R) Core(TM) i7-6700 CPU at

GHz and 32 GB of

MHz DDR4 memory. This CPU features AVX2, BMI2 and CLMUL technologies (family number

and model number 94). The platform runs the Stretch GNU Debian operating system with a 64 bit Linux kernel version 4.3. We compile with GCC [6] version 5.4.

We use version 1.2 of the gf2x library (https://gforge.inria.fr/projects/gf2x/, released in July 2017)—it makes use of the CLMUL features of the platform. We tuned it to our platform during the installation process up to

input quad words. We also compare to the implementation of the additive Fourier transform by Chen et al. [2], using the GIT version of 2017, September, 1.

Frobenius encoding

Concerning the cost of the Frobenius encoding and decoding, Function 1 takes about

CPU cycles when compiled with the sole -O3 option. With the additional options -mtune=native -mavx2 -mbmi2, the BMI2 version of Remark 10 takes about 16 CPU cycles. The vectorized version of Function 1 transposes four packed

bit matrices simultaneously in about 20 cycles, which makes an average of

cycles per matrix.

It it interesting to examine the performance of the sole transpositions made during the Frobenius encoding and decoding (that is discarding products by twiddle factors in

). From sizes of a few kilobytes this average cost per quad word is about 8 cycles with the AVX2 technology, and it is about 23 cycles without. Unfortunately the vectorization speed-up is not as close to 4 as we would have liked.

Since the encoding and decoding costs are linear, their relative contribution to the total computation time of polynomial products decreases for large sizes. For two input polynomials in

quad words, the contribution is about

%; for

quad words, it is about

Polynomial product

In Figure 1 we report timings in milliseconds for multiplying two polynomials in

, hence each of input size

quad words—indicated in abscissa and obtained from justinline/bench/polynomial_f2_bench.mmx. Notice that our implementation in [7] was faster than version 1.1 of gf2x, but is now of similar speed as version 1.2. The additive FFT strategy of [2] achieves a noticeable speed-up in favorable cases, but because of its staircase-effect its runtime is roughly similar to the one of gf2x in average. With respect to our old implementation, the new one finally achieves a speed-up that is not far from the factor

predicted by the asymptotic complexity analysis. Let us mention that our new implementation becomes faster than gf2x when

is larger than

Polynomial matrix product

As in [7], one major advantage of DFTs over the Babylonian field

is the compactness of the evaluated FFT-representation of polynomials. This makes linear algebra over

particularly efficient: instead of multiplying

matrices over

naively by means of

polynomial products of degree

, we use the standard evaluation-interpolation approach. In our context, this comes down to: (a) computing the

Frobenius encodings, (b) the

direct DFTs of all entries of the two matrices to be multiplied, (c) performing the

products of

matrices over

, (d) computing the

inverse DFTs and Frobenius decodings of the so-computed matrix products.

Timings for matrices over

are obatined from justinline/bench/matrix_polynomial_f2_bench.mmx and are reported in Table 1 . The row “this paper” confirms the practical gain of this fast approach within our implementation. For comparison, the row “ gf2x ” shows the cost of computing the product naively, by doing

polynomial multiplications using gf2x . More efficient evaluation-interpolation based approaches [ 10 , Section 2] for matrix multiplication can in principle be combined with Schönhage's triadic polynomial multiplication [ 16 ] as implemented in gf2x . However, this would require an additional implementation effort and also lead to an extra constant overhead with respect to our approach.

6.Conclusion

The present paper describes a major new approach for the efficient computation of large carryless products. It confirms the excellent arithmetic properties of the Babylonian field

for practical purposes, when compared to the fastest previously available strategies.

Improvements are still possible for our implementation of DFTs over

. First, taking advantage of the more recent AVX-512 technologies is an important challenge. This is difficult due to the current lack of 256 or 512 bit SIMD counterparts for the vpclmulqdq assembly instruction (carryless multiplication of two quad words). However, larger vector instruction would be beneficial for matrix transposition, and even more taking into account that there are twice as many 512 bit registers as 256 bit registers; so we can expect a significant speed-up for the Frobenius encoding/decoding stages. The second expected improvement concerns the use of truncated Fourier transforms [9, 14] in order to smoothen the graph from Figure 1. Finally we expect that our new ideas around the Frobenius transform might be applicable to other small finite fields.

Bibliography

[1]: R. P. Brent, P. Gaudry, E. Thomé, and P. Zimmermann. Faster multiplication in GF. In A. van der Poorten and A. Stein, editors, Algorithmic Number Theory, volume 5011 of Lect. Notes Comput. Sci., pages 153–166. Springer Berlin Heidelberg, 2008.
[2]: Ming-Shing Chen, Chen-Mou Cheng, Po-Chun Kuo, Wen-Ding Li, and Bo-Yin Yang. Faster multiplication for long binary polynomials. https://arxiv.org/abs/1708.09746, 2017.
[3]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[4]: S. Gao and T. Mateer. Additive fast Fourier transforms over finite fields. IEEE Trans. Inform. Theory, 56(12):6265–6272, 2010.
[5]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 3rd edition, 2013.
[6]: GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org, from 1987.
[7]: D. Harvey, J. van der Hoeven, and G. Lecerf. Fast polynomial multiplication over . In M. Rosenkranz, editor, Proceedings of the ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC '16, pages 255–262. ACM, 2016.
[8]: D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. J. ACM, 63(6), 2017. Article 52.
[9]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 290–296. ACM, 2004.
[10]: J. van der Hoeven. Newton's method and FFT trading. J. Symbolic Comput., 45(8):857–878, 2010.
[11]: J. van der Hoeven and R. Larrieu. The Frobenius FFT. In M. Burr, editor, Proceedings of the 2017 ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC '17, pages 437–444. ACM, 2017.
[12]: J. van der Hoeven and G. Lecerf. Interfacing Mathemagix with C++. In M. Monagan, G. Cooperman, and M. Giesbrecht, editors, Proceedings of the 2013 ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC '13, pages 363–370. ACM, 2013.
[13]: J. van der Hoeven and G. Lecerf. Mathemagix User Guide. https://hal.archives-ouvertes.fr/hal-00785549, 2013.
[14]: R. Larrieu. The truncated Fourier transform for mixed radices. In M. Burr, editor, Proceedings of the 2017 ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC '17, pages 261–268. ACM, 2017.
[15]: Sian-Jheng Lin, Wei-Ho Chung, and S. Yunghsiang Han. Novel polynomial basis and its application to Reed-Solomon erasure codes. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pages 316–325. IEEE, 2014.
[16]: A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Infor., 7:395–398, 1977.
[17]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[18]: H. S. Warren. Hacker's Delight. Addison-Wesley, 2nd edition, 2012.

1.Introduction

1.1.Related work

1.2.Results and outline of the paper

2.Prerequisites

Discrete Fourier transforms

Frobenius Fourier transforms

3.Fast reduction from to

3.1.Variant of the Frobenius DFT

3.2.Frobenius encoding

3.3.Direct transforms

3.4.Inverse transforms

3.5.Multiplication in

4.Implementation details

4.1.Packed representations

4.2.Matrix transposition

4.2.1.Transposing packed bit matrices

4.2.2.Transposing four byte matrices simultaneously

4.2.3.Transposing bit matrices

4.3.Frobenius encoding

5.Timings

Frobenius encoding

Polynomial product

Polynomial matrix product

6.Conclusion

Bibliography


			(3)