Implementing number theoretic transforms

Joris van der Hoevena, Grégoire Lecerfb

CNRS (UMR 7161, LIX)

Laboratoire d'informatique de l'École polytechnique

Bâtiment Alan Turing, CS35003

1, rue Honoré d'Estienne d'Orves

91120 Palaiseau, France

a. Email: vdhoeven@lix.polytechnique.fr

b. Email: lecerf@lix.polytechnique.fr

Preliminary version of December 18, 2024

. Grégoire Lecerf has been supported by the French ANR-22-CE48-0016 NODE project. Joris van der Hoeven has been supported by an ERC-2023-ADG grant for the ODELIX project (number 101142171).

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

. This article has been written using GNU TeXmacs [16].

We describe a new, highly optimized implementation of number theoretic transforms on processors with SIMD support (AVX, AVX-512, and Neon). For any prime modulus and any order of the form , our implementation can automatically generate a dedicated codelet to compute the number theoretic transform of order over . New speed-ups were achieved by relying heavily on non-normalized modular arithmetic and allowing for orders that are not necessarily powers of two.

Keywords: number theoretic transform, finite fields, codelets, algorithm, complexity bound, FFT, integer multiplication

A.C.M. subject classification: G.1.0 Computer-arithmetic, F.2.1 Number-theoretic computations

A.M.S. subject classification: 68W30, 68Q17, 68W40

1.Introduction

Number theoretic transforms (NTTs) were introduced by Pollard [21]. He used them as a tool to design practical algorithms for the efficient multiplication of very large integers. Technically speaking, a number theoretic transform is simply a discrete Fourier transform (DFT) over a finite field. Traditional DFTs work over the complex numbers. A fast algorithm to compute such DFTs was published a few years earlier by Cooley and Tukey [4], although the idea goes back to Gauss [14].

As of today, fast practical algorithms for multiplying large integers are based on number theoretic transforms, complex DFTs [23, first algorithm], or Schönhage-Strassen's algorithm [23, second algorithm]. All three strategies are very competitive and the winner depends on hardware and the application. Since the race is very close, it is interesting to heavily optimize each of the three approaches.

In this paper, we will describe a high performance implementation for number theoretic transforms. We focus on modern general purpose CPUs with support for SIMD (Single Instruction Multiple Data) instructions. Our current implementation was written in Mathemagix [17], partly as a test case for the development of our new compiler (a -version of it has been released this year). It fully exploits SIMD accelerations and we started experimentations with multi-threading.

A first way to optimize NTTs is to design faster modular arithmetic. Let be a prime number. Traditionally, the result of any arithmetic operation modulo is always normalized, e.g. to make it lie in . It was first observed by Harvey [13] that some of these normalizations can be delayed when computing NTTs and that this can lead to significant speed-ups. In section 3, we show how to take this idea one step further, which allows us to delay even more normalizations.

While we were working on our implementation

1. We started our implementation in 2019, but it remained private due to the experimental nature of the new Mathemagix compiler.

1 , a few other implementations appeared that use similar ideas. The delayed reduction strategy has been generalized to 4-wide and 8-wide butterflies in [ 2 , 24 ]. Dan Schultz recently added an optimized implementation of this kind to the Flint library [ 12 ]. We note that the idea to delay normalizations is also very popular in linear algebra; see for instance [ 5 , 6 ]. Another inspiration and noteworthy analogue is the “transient ball arithmetic” that was developed in [ 18 ].

The optimization of traditional DFTs is the subject of an extensive literature. The most efficient implementations are based on so-called “codelets” [8, 9, 22]. For a given target order , the idea is to automatically generate a highly optimized program (called a codelet) for DFTs of order . Such codelets can optimize the use of registers and memory, which makes them in particular cache-efficient. For small orders, we can also unroll the loops, which minimizes the overhead of control structures. In section 4, we apply the codelet approach to number theoretic transforms and discuss some specific issues that arise for this application. Similar experiments have recently been carried out in the Spiral system [10, 25].

The last section 5 is dedicated to timings. We implemented our algorithms on five different systems that support SIMD arithmetic of three different widths: 128-bit (ARM Neon), 256-bit (AVX-2), and 512-bit (AVX-512). For orders , we achieve speed-ups that are proportional to the SIMD width. For higher orders, the x86-based systems start to suffer from memory access costs, whereas the performance of ARM-based systems remains stable. We also experimented with multi-threaded versions of our algorithms, but without pushing our optimization efforts as far as for the mono-threaded case.

The application of the fast NTTs to integer multiplication requires a wrapper for Chinese remaindering and Kronecker segmentation. This turns out to be highly non-trivial to implement with the same level of optimization as our new NTTs. We plan to address this issue in a separate future work.

2.Preliminaries

2.1.Discrete Fourier transforms

Let be a field with a primitive root of unity of order and assume that is invertible in . We define the discrete Fourier transform (or DFT) of an -tuple with respect to to be where

That is, is the evaluation of the polynomial at . If is a primitive -th root of unity, then so is its inverse , and we have

(1)

Indeed, writing , we have

where if and otherwise.

The DFT induces a ring homomorphism from into , by sending to . Since the DFT is invertible, this map (that we will still denote by ) is actually an isomorphism. If we have a fast way to compute DFTs, this yields an efficient way to multiply “cyclic” polynomials :

(2)

Here the multiplication in is done entry-wise. Given two polynomials with , we have

and the product can be read off from . Hence we also obtained an algorithm for the multiplication of and . This is called FFT-multiplication.

2.2.Divide-and-conquer DFTs

Let be a primitive -th root of unity, with for . Then is a primitive -th root of unity and is a primitive -th root of unity. Moreover, for any and , we have

(3)

This formula can be decomposed in four steps:

(4)
(5)
(6)
(7)

Given a linear map and , it will be convenient to use the Kronecker products and , that are defined by

Using these notations, we may rewrite (4) and (6) as

Step (5) consists of scalar multiplications by so-called twiddle factors . We will denote this twiddling map by :

The last step corresponds to the transposition of an matrix. Denoting this transposition map by , the relation (3) becomes

(8)

2.3.DFTs using the Chinese remainder theorem

If in the previous subsection, then it is actually possible to avoid the twiddling step (5), as first noted by Good in [11]. Consider the Bezout relation

where and . Then we have the isomorphism of abelian groups

This Chinese remainder isomorphism induces an isomorphism of -algebras

Representing cyclic polynomials and

by the vectors of their coefficients, the map acts as a permutation : if then , for and . The bivariate DFT

sends to the vector with . If and , then

so and for some permutation . Altogether, this yields

2.4.DFTs up to permutations

The above two subsections show that efficient algorithms for DFTs often only compute them up to permutations. For some applications, it is actually not required to perform all these permutations. Assume for instance that we have an efficient algorithm for the computation of the “twisted” for some permutation . Then the variant

of (2) still yields an efficient method for FFT-multiplication. This technique requires the inverse transform to be implemented with more care, since one cannot directly use the formula (1). Instead, one typically uses a similar algorithm as for , but with all steps reversed.

The twisting technique can be combined recursively with the algorithms from the previous two subsections. Assume for instance that and for some permutations and , and let

with the notation of (8). Then, equation (8) implies that

where . For some table of precomputable twiddling factors, we have for any with .

For DFTs of small orders, we also note that permutations of the input and output coefficients can be achieved without cost. As will be explained in sections 4.1 and 4.2, we will regard such DFTs as completely unrolled straight-line programs (SLPs); see definition in [3] for instance. Instead of explicitly permuting the input and output coefficients, it then suffices to appropriately rename all local variables in the SLP. This is particularly useful for the algorithm from section 2.3, which requires permutations of both the input and output coefficients.

2.5.Number theoretic transforms

In the special case when is a finite field for some prime number , a discrete Fourier transform is also called a number theoretic transform (NTT). The most favorable case is when is of the form or , where is small. Indeed, since the multiplicative group of is cyclic of order , this ensures the existence of primitive roots of unity of large orders of the form or . Nice primes like this are sometimes called FFT primes. In what follows, for any order , we will denote by a number theoretic transform of the form for some suitable permutation and primitive -th root of unity .

Pollard first noted [21] that NTTs can be used to devise efficient practical algorithms for multiplying large integers: using Kronecker segmentation, the two multiplicands can be rewritten as special values

of integer polynomials with small coefficients . Then and the product has coefficients in . If

(9)

then can be read off from its reduction modulo . If we also have , then the product modulo can be computed fast using FFT-multiplication.

For the application to integer multiplication, Pollard's technique is most efficient when nicely fits into a machine integer or floating point number. As soon as gets large (say ), this constraints to become fairly small (e.g. when using double precision floating point arithmetic with ). In order to increase , and reduce the negative impact of in the constraint (9), Pollard suggested to reduce modulo three FFT-primes , , and instead of a single one and use Chinese remaindering to work modulo instead of . For instance, when using double precision floating point arithmetic, this allows us to take and instead of and . As of today, this is one of the fastest practical methods for integer multiplication [13].

3.Modular floating point arithmetic

We refer to [7, 19] for previous work on the implementation of SIMD versions of modular arithmetic. Before starting such an implementation, one first has to decide whether one wishes to rely on floating point or integer arithmetic. In principle, integer arithmetic should be most suitable, since no bits are wasted on the storage of exponents. However, for large moduli of more than 32 bits, floating point arithmetic tends to be supported better by current hardware.

First of all, at least in the case of Intel processors, throughputs and latencies for basic floating point instructions are better than for their integer counterparts. Secondly, the AVX2 and AVX-512 instruction sets only provide SIMD support for 3232 bit integer products; although there is an SIMD instruction for 6464 bit integer multiplication, the high part of the bit result is lost. In comparison, with floating point arithmetic, the full product of a bit integer multiplication can be obtained using two fused multiply add instructions.

Based on these hardware considerations, we have decided to implement our NTT with floating point arithmetic. Our implementation focusses on odd moduli that fit into double precision numbers and works for SIMD widths up to , as allowed by processors with AVX-512 support.

Our approach further extends known ideas. On the one hand, for the normalized representation of integers modulo , we use signed residues in ; this provides us with a cheap way to gain one bit of precision and it also simplifies the mathematical analysis. On the other hand, we will allow the results of certain intermediate computations to be non-normalized. The latter idea was already used with success in [13], but we push it further by taking the modulus a bit smaller (thereby sacrificing a few bits of precision), which allows us to redundantly represent modular integers by integers that can be several times larger than (this allows for several speed-ups, as we shall see). Several recent implementations [2, 24] also use this kind of super-redundant representations in order to perform basic NTTs of order and without unnecessary reductions. In this paper we will present a generalization to arbitrary orders.

3.1.Pseudo-code

In what follows, we will write for the machine precision and assume that . Then any real number in the interval is a floating point number if and only if it is a multiple of . Floating point number can be scaled by powers with exponents . Here is assumed to satisfy . Finally, zero is a special floating point number.

When rounding to nearest, the rounding error of a floating point operation with an exact result in is always bounded by . Given a general operation with an exact result and rounded result , this means that . Note also that necessarily holds in this case. Consequently, we have

(10)

More generally, these two inequalities hold whenever the exponent of does not overflow or underflow. Our current implementation is limited to double precision, for which and , but it could easily be extended to single and other precisions. We assume correct IEEE 754 style rounding to the nearest [1].

We will present our implementation of modular arithmetic using pseudo machine code. The only data type that we will use are SIMD vectors of width with floating point entries (e.g. for processors with AVX-512 support, for double precision). Such vectors are typically stored in hardware registers. The first argument of all our algorithms will be the destination register and the subsequent arguments the source registers. For instance, the instruction adds two SIMD vectors and and puts the result in . In the pseudo-code, the modulus stands for a SIMD vector of potentially different moduli. We also denote by its floating point inverse , rounded to nearest.

We assume that basic instructions for SIMD floating point arithmetic are provided by the hardware, namely addition (add), subtraction (sub), multiplication (mul), fused multiply add (fma) and its variants (fms, fnma, fnms), as well as an instruction for rounding to the nearest integer (round). Precisely, the fused operators are defined as follows:

On processors with AVX-512 support, one may implement using the intrinsic _mm512_roundscale_pd.

The modular analogues for the basic arithmetic operations are suffixed by . For instance, computes products modulo . The results of these analogues are not required to be normalized. The modular ring operations, as well as a separate operation for normalization, can be implemented as follows:

Algorithm

Algorithm

Algorithm

Algorithm

Here are temporary variables. Modular analogues of the fused operations fma, fms, fnma, and fnms turn out to be of limited use, since non-normalized addition and subtraction take only one operation.

3.2.Measuring how far we are off from normalization

For the analysis of our algorithms, it suffices to consider the scalar case . Consider a modular integer represented by an integer that fits into a floating point number of precision . Let

and note that if and only if , i.e. if and only if is normalized. If , then we regard as a measure for how non-normalized really is. For each of our algorithms add_mod, sub_mod, mul_mod, and normalize it is interesting to analyze as a function of and .

Note that is the smallest integer that cannot be represented exactly using a floating point number. For this reason, we always assume that the source arguments and of our algorithms as well as the modulus satisfy , , and . Each of our algorithms is easily seen to be correct, provided that . In what follows, we will in particular study under which circumstances this can be guaranteed.

In the cases of addition and subtraction, the analysis is simple, since we clearly have and , provided that (or , equivalently).

Proposition 1. The algorithm satisfies:

  1. If , then .

  2. If , then .

Proof. Let and be the errors for the computations of and :

and recall that , , , and , using (10). Hence,

Let be the distance between and the set . Since is odd, we must have . If , then we necessarily have , and the end-result will indeed be normalized. Here stands for the nearest integer to . The condition is satisfied as soon as , i.e. if . Since is an integer and there are no integers between and , this condition further simplifies into . This proves (a).

If , then the above calculations still yield , whence

and

This in particular implies (b).

Remark 2. If and , then , so a second application of normalize will achieve to normalize .

Proposition 3. Let

Then the algorithm satisfies:

  1. We have .

  2. If , then .

  3. If , , and , then .

Proof. We first note that the product is represented exactly by , where and . By a similar reasoning as in the beginning of the proof of Proposition 1, we have

whence

Note that implies so is the exact value. We deduce that

This relation in particular yields (a). If , then we also obtain

which proves (b). Assume finally that and . Then

Since is odd, we have , whence the equation

has two positive roots

We took to be the largest of these two roots. This implies (c).

Remark 4. If , , and , then (c) implies . Indeed, in that case, whence .

Remark 5. Multiplication has the benefit of producing partially normalized results, especially if one of the arguments is already normalized. For instance, if and , then (a) yields

If , , and , then (a) yields

since .

4.Codelets for number theoretic transforms

The easiest way to implement DFTs is to write a single routine that takes the order and the coefficients as input and then applies, say, Cooley and Tukey's algorithm using a double loop. However, this strategy is register-inefficient for small orders and cache-inefficient for large orders. Suboptimal loop unrolling may be another problem.

For our implementation, we use the alternative strategy based on “codelets”. For a given finite field and a given target order , the idea is to automatically generate a highly optimized program for number theoretic transforms of order over (up to permutations as explained in section 2.4). Such highly optimized programs that are dedicated to one specific task are called codelets. We refer to [8, 9, 22] for early work on codelets in the more traditional context of DFTs over .

For small orders, codelets are typically designed by hand. For large orders, they are generated automatically in terms of codelets of smaller lengths. We developed a small “codelet library” inside Mathemagix for the automatic generation of codelets for NTTs of different orders as well as other linear maps. One difference with previous work is that all code is written in the Mathemagix language and that codelets can be both generated and executed at runtime.

There are often multiple solutions to a given task. For instance, for an NTT of composite order , there may be multiple ways to factor and then apply one of the algorithms from section 2.2 or 2.3. One advantage of codelets is that we may generate codelets for each solution, do some benchmarking at runtime, and then select the most efficient solution. For large orders, when benchmarking becomes more expensive, the most efficient tactics can be cached on disk. Our codelet library exploits this kind of ideas.

4.1.Straight-line programs over modular integers

A straight-line program (SLP) over is a map that can be implemented using a sequence of ring operations in . For instance, if is a primitive root of unity of order , then the following SLP with input and output computes an NTT of order :

Here , , , are auxiliary variables. Note also that the output has been re-indexed in bit-reversed order: if , then , , , and .

For an actual machine implementation of such an SLP, we may replace the ring operations by the algorithms , , and from the previous section. However, we have to be careful that all non-normalized intermediate results fit into bit integers. An easy linear-time greedy algorithm to ensure this is to insert instructions whenever some intermediate result might not fit.

More precisely, for every intermediate result of a binary operation on and , we use the results from section 3.2 to compute a bound for . Whenever this bound exceeds , we insert an instruction to normalize or (we pick if and otherwise). For instance, if is slightly larger than , then applying this strategy to the above SLP yields

(using part (a) of Proposition 3)

(using Proposition 1)

(using Proposition 1)

Here we implicitly assumed that the input arguments are normalized. We also wrote for a number that is slightly larger than (e.g. will do). We indeed had to normalize or before the instruction , since otherwise would exceed . Similarly, we normalized , since . On the other hand, after these normalizations, we did not need to normalize and , since .

A few remarks about this approach.

4.2.Number theoretic transforms of small orders

For NTTs of order up to , it is efficient to use unrolled codelets that can be regarded as SLPs. Throughout our implementation, we assumed that . This allows for moduli with a “capacity” of 48 bits. We wrote our transforms by hand and manually determined the best spots for normalizations. Although this means that we did apply the general greedy algorithm from the previous section, we did use the same strategy to determine bounds for the , where ranges over the intermediate results, and verify that we always have .

For small power of two orders , we use the classical Cooley–Tukey algorithm without any normalization. Writing and for the input and output vectors in , and setting , Proposition 3 implies . In view of Remark 5, the worst case may only occur for evaluations at and (which only involve additions and subtractions).

NTTs of medium-size power of two orders are decomposed into smaller NTTs with one intermediate twiddling stage: writing for a number theoretic transform of order (while assuming the usual bit-reverse indexing), we apply the formulas

for suitable diagonal linear twiddle maps . Our implementation automatically generates completely unrolled codelets for , , and . The twiddling step has the advantage of partially normalizing the intermediate results (see Remark 5). For instance, if , , , and , then , , and , using part (a) of Proposition 3. Furthermore, the fact that several twiddle factors are equal to one allows for an optimization: whenever we need to multiply with one, we simply normalize instead. In the case of , the error bounds are a bit better, and we only need to perform two normalizations.

We also implemented hand-coded NTTs for some mixed-radix orders of the form . If , and assuming that is a primitive third root of unity, we use the following algorithm, without normalization:

One verifies that and . For , we use Good's algorithm to decompose in terms of and . This yields . For , we use the formula

without normalization, which yields . For , we use Good's algorithm with an intermediate normalization step. For , we use the formula

while taking benefit of the partial normalization of the twiddling step.

4.3.Large orders

Now assume that we wish to compute an NTT of large order with . For input and output vectors , we require that whenever . This is indeed the case for all codelets from the previous subsection.

Let be a decomposition of with . We recursively assume that we know how to compute codelets for and . Now we compute using the formula

(11)

Following (4), we compute using

(12)

for . Note that this yields whenever , by our assumption on . Setting , let be the twiddle factors with

for and . For , we next compute

(13)

which corresponds to combining (5) and (6). If , then and , by our assumption on . This proves our inductive requirement that whenever .

It is important to keep an eye on the cache efficiency for the actual implementation. In the formula (12), we precompute the twiddle factors and store the vectors , , and in contiguous segments of memory. Instead of applying a global twiddling step , we use a loop over and mix the twiddling step with the NTTs of length ; this avoids traversing elements in memory twice and thereby improves the cache efficiency. If is really large, then we might rewrite in terms of smaller NTTs and move the twiddling step even further inside.

As to (12), we first need to move the input slices with “stride” into contiguous memory. After applying the NTT, we also need to put the result back in the slice with stride . These memory rearrangements correspond to two and matrix transpositions. For cache efficiency, we need to minimize the number of full traversals of vectors of size , so it is again better to use a loop over . However, entries of the form are typically stored in the same cache line for some that depends on the hardware. It is better to treat these contiguous entries together, which can be achieved by cutting the loop over into chunks of size (in fact, slightly different chunk sizes may perform even better). This corresponds to rewriting the matrix transposition into transpositions of matrices (and similar for the inverse transposition).

4.4.SIMD acceleration

Assume now that we are on hardware that supports SIMD vectors of width and let us investigate how to vectorize the algorithm (11) from the previous subsection.

Let us first consider the case when and . The first step (12) is easy to vectorize: we may reinterpret the map on vectors of size with entries in as the map on vectors of size with SIMD entries in . As to the second step (13), we have and the computation of the map reduces to the computation of , using a matrix transposition. In order to compute the map , this leads us to use a loop of length that treats blocks of coefficients at the time. On each block, we first multiply with the twiddle factors, then apply a matrix transposition, and finally an inline codelet for for SIMD coefficients of width .

The idea behind SIMD implementations of matrix transposition is to recursively reduce this operation to two transpositions in different lanes, followed by a linear number of operations to combine the results. This leads to an algorithm that does the full transposition in steps. Concrete implementations on existing hardware are a bit tricky: every next generation or brand of SIMD processors (SSE, AVX2, AVX-512, Neon, ...) tends to provide another set of instructions that are useful for this application (permute, blend, shuffle, etc.). Instead of covering all cases, we illustrate the above recursive meta-algorithm by showing how to perform a matrix transposition using SIMD instructions on machines with AVX2 support:

x[0] = _mm256_permute2f128_pd (s[0], s[2], 32)

x[2] = _mm256_permute2f128_pd (s[0], s[2], 49)

x[1] = _mm256_permute2f128_pd (s[1], s[3], 32)

x[3] = _mm256_permute2f128_pd (s[1], s[3], 49)

d[0] = _mm256_shuffle_pd (x[0], x[1], 0)

d[1] = _mm256_shuffle_pd (x[0], x[1], 15)

d[2] = _mm256_shuffle_pd (x[2], x[3], 0)

d[3] = _mm256_shuffle_pd (x[2], x[3], 15)

Let us now turn to the case when . For any decomposition with and , we may apply the above vectorization technique for the computation of . Using the loop (13), this also allows us to vectorize . Combined with the fact that the map trivially vectorizes, this provides us with an alternative vectorization of the formula (11). This technique has the advantage of providing more flexibility for the choices of and . In particular, for large orders , taking tends to ensure a better cache efficiency.

For our actual implementation, we emulate an SIMD width of twice the hardware SIMD width for the given floating point type. This means that every instruction is “duplexed” for the “low” and “high” parts. For instance, the codelet for from section 4.2 gives rise to the following duplex version for coefficients in :

The main advantage of this technique is that it spreads data dependencies. This makes the code less vulnerable to delays caused by instructions with high latencies. One disadvantage of duplexing is that it doubles the pressure on hardware registers. Nevertheless, this is less of a problem on modern processors with many SIMD registers.

5.Timings

We implemented the algorithms described in this paper in the Mathemagix language. In order to test the new algorithms, one has to add –enable-mmcomp to the configuration options, before compiling Mathemagix. Here mmcomp stands for the new experimental Mathemagix compiler. The code for the number theoretic transforms can be found in the mmx/hpc subdirectory of the mmlib package (revision 11190 available at https://sourcesup.renater.fr/projects/mmx/). The binary that we used to obtain the timings in Table 2 was compiled using the following command:

mmcomp –optimize hpc/bench/dft_best_mixed_bench.mmx

We systematically use the fixed prime number as our modulus for all NTTs. Consequently, has power of two roots of unity of orders up till , which is also an upper bound for our main benchmarks. For orders , our implementation does an extensive search for the best strategy. For larger orders, we make a suboptimal guess based on the best observed strategies for orders .

We tested our implementation on five different systems whose characteristics are specified in Table 1. These systems cover three SIMD widths:

Note that the AVX-512 support on the AMD RYZEN9 7950X3D processor is “double pumped” in the sense that certain instructions (like FMA) are done in two parts of 256 bits each

name CPU SIMD cores frequency memory
i7 Intel Core i7 4 physical, 8 logical 2.8 GHz 16 Gb 1.6 GHz DDR3
xeon Intel Xeon W 8 physical, 16 logical 3.2 GHz 32 Gb 2.67 GHz DDR4
zen4 AMD 7950X3D 16 physical, 32 logical 5.759 GHz 128 Gb, 5.2GHz DDR5
m1 Apple M1 4 performance, 4 efficiency 3.2 GHz 16 Gb
m3 Apple M3 Pro 6 performance, 6 efficiency 4.05 GHz 36 Gb

Table 1. Overview of the five machines on which we benchmarked our implementation.

.

i7 xeon zen4 m1 m3 i7 xeon zen4 m1 m3
64 0.10μs 51.4ns 16.5ns 0.11μs 73.4ns 884736 5.52ms 3.10ms 1.53ms 4.51ms 3.25ms
128 0.25μs 0.11μs 56.9ns 0.25μs 0.17μs 995328 6.17ms 3.55ms 1.79ms 5.01ms 3.59ms
192 0.43μs 0.17μs 94.3ns 0.40μs 0.26μs 1048576 7.15ms 4.13ms 1.92ms 5.77ms 4.15ms
256 0.56μs 0.26μs 0.15μs 0.57μs 0.38μs 1179648 7.84ms 4.30ms 2.09ms 6.26ms 4.46ms
384 0.89μs 0.40μs 0.27μs 0.85μs 0.57μs 1327104 8.68ms 4.74ms 2.44ms 6.83ms 4.86ms
512 1.24μs 0.56μs 0.34μs 1.23μs 0.80μs 1492992 9.96ms 5.57ms 2.73ms 7.96ms 5.71ms
576 1.30μs 0.60μs 0.40μs 1.33μs 0.87μs 1572864 10.8ms 6.02ms 3.04ms 8.85ms 6.30ms
768 1.93μs 0.85μs 0.54μs 1.87μs 1.25μs 1769472 11.7ms 6.74ms 3.36ms 9.59ms 6.81ms
1024 2.57μs 1.20μs 0.77μs 2.70μs 1.82μs 1990656 13.2ms 7.50ms 3.84ms 10.6ms 7.54ms
1152 2.91μs 1.29μs 0.85μs 2.99μs 1.99μs 2097152 15.4ms 9.01ms 4.15ms 13.0ms 9.07ms
1536 4.17μs 1.83μs 1.19μs 4.14μs 2.76μs 2359296 16.3ms 9.58ms 4.55ms 13.5ms 9.60ms
1728 4.90μs 2.19μs 1.46μs 4.53μs 3.02μs 2654208 18.4ms 10.5ms 5.10ms 14.7ms 10.4ms
2048 6.21μs 2.65μs 1.67μs 5.81μs 3.98μs 2985984 21.1ms 11.9ms 5.91ms 16.5ms 11.9ms
2304 6.92μs 2.89μs 1.89μs 6.26μs 4.23μs 3145728 22.6ms 13.1ms 6.18ms 18.9ms 13.6ms
3072 9.96μs 4.03μs 2.73μs 8.77μs 5.90μs 3538944 25.1ms 14.3ms 6.97ms 20.4ms 14.9ms
3456 10.7μs 4.82μs 3.51μs 9.54μs 6.61μs 3981312 28.4ms 16.2ms 7.74ms 22.6ms 16.5ms
4096 13.7μs 6.10μs 4.22μs 12.5μs 8.65μs 4194304 33.3ms 19.5ms 9.19ms 27.6ms 21.0ms
4608 14.6μs 6.71μs 5.10μs 13.6μs 9.42μs 4718592 33.2ms 20.6ms 9.75ms 28.2ms 20.5ms
5184 17.5μs 8.10μs 5.66μs 16.2μs 11.2μs 5308416 38.8ms 22.9ms 11.9ms 30.9ms 22.6ms
6144 20.2μs 9.50μs 6.45μs 19.5μs 13.4μs 5971968 44.7ms 25.9ms 13.8ms 34.1ms 24.9ms
6912 21.8μs 10.1μs 7.07μs 20.9μs 14.6μs 6291456 47.7ms 28.6ms 15.5ms 42.0ms 32.6ms
8192 29.6μs 14.2μs 8.76μs 26.9μs 18.8μs 7077888 54.1ms 32.0ms 16.8ms 43.6ms 31.4ms
9216 30.6μs 13.9μs 9.76μs 29.7μs 20.7μs 7962624 62.3ms 36.7ms 19.4ms 47.4ms 34.6ms
10368 36.1μs 16.8μs 11.6μs 34.2μs 24.0μs 8388608 70.0ms 40.1ms 21.7ms 59.2ms 44.3ms
12288 43.0μs 19.7μs 13.4μs 41.6μs 29.0μs 9437184 78.7ms 45.2ms 24.7ms 61.5ms 45.3ms
13824 47.4μs 20.8μs 14.7μs 47.1μs 33.5μs 10616832 86.1ms 51.9ms 27.0ms 65.6ms 47.2ms
16384 63.8μs 27.6μs 18.3μs 61.4μs 42.6μs 11943936 98.5ms 58.4ms 31.1ms 72.8ms 51.9ms
18432 66.9μs 28.9μs 20.7μs 63.9μs 45.7μs 12582912 0.11s 62.6ms 33.4ms 87.2ms 65.5ms
20736 81.3μs 34.4μs 23.9μs 71.0μs 51.6μs 14155776 0.12s 70.4ms 36.5ms 93.6ms 64.9ms
24576 99.7μs 41.4μs 28.4μs 88.3μs 65.0μs 15925248 0.14s 79.3ms 41.5ms 0.10s 71.4ms
27648 0.12ms 44.6μs 32.1μs 97.0μs 70.9μs 16777216 0.16s 87.2ms 46.7ms 0.12s 93.1ms
32768 0.14ms 59.5μs 43.6μs 0.13ms 91.6μs 18874368 0.17s 94.9ms 50.4ms 0.13s 99.0ms
36864 0.16ms 68.9μs 43.8μs 0.13ms 98.3μs 21233664 0.20s 0.11s 57.2ms 0.14s 99.5ms
41472 0.16ms 83.6μs 52.0μs 0.15ms 0.11ms 23887872 0.22s 0.12s 64.2ms 0.16s 0.11s
49152 0.22ms 97.3μs 66.2μs 0.19ms 0.14ms 25165824 0.25s 0.14s 67.6ms 0.18s 0.15s
55296 0.23ms 0.12ms 77.5μs 0.20ms 0.15ms 28311552 0.28s 0.15s 76.7ms 0.21s 0.16s
65536 0.31ms 0.15ms 99.7μs 0.26ms 0.19ms 31850496 0.31s 0.17s 86.9ms 0.23s 0.16s
73728 0.33ms 0.17ms 0.11ms 0.28ms 0.21ms 33554432 0.36s 0.19s 93.7ms 0.26s 0.20s
82944 0.35ms 0.20ms 0.12ms 0.31ms 0.23ms 37748736 0.38s 0.26s 0.12s 0.29s 0.23s
98304 0.46ms 0.25ms 0.15ms 0.44ms 0.32ms 42467328 0.43s 0.28s 0.13s 0.32s 0.25s
110592 0.48ms 0.28ms 0.16ms 0.43ms 0.32ms 47775744 0.49s 0.32s 0.16s 0.36s 0.27s
124416 0.59ms 0.34ms 0.19ms 0.54ms 0.39ms 50331648 0.54s 0.33s 0.16s 0.39s 0.30s
131072 0.68ms 0.38ms 0.19ms 0.58ms 0.43ms 56623104 0.57s 0.38s 0.18s 0.44s 0.34s
147456 0.75ms 0.40ms 0.21ms 0.66ms 0.49ms 63700992 0.64s 0.41s 0.20s 0.49s 0.38s
165888 0.81ms 0.45ms 0.25ms 0.74ms 0.53ms 67108864 0.80s 0.49s 0.24s 0.54s 0.44s
196608 1.02ms 0.58ms 0.31ms 0.92ms 0.66ms 75497472 0.83s 0.51s 0.26s 0.61s 0.47s
221184 1.13ms 0.62ms 0.35ms 1.01ms 0.73ms 84934656 0.92s 0.61s 0.29s 0.67s 0.54s
248832 1.36ms 0.73ms 0.37ms 1.12ms 0.83ms 95551488 0.98s 0.66s 0.33s 0.75s 0.59s
262144 1.45ms 0.88ms 0.42ms 1.27ms 0.93ms 100663296 1.12s 0.78s 0.35s 0.83s 0.66s
294912 1.57ms 0.87ms 0.46ms 1.39ms 1.00ms 113246208 1.19s 0.78s 0.39s 0.90s 0.71s
331776 1.86ms 0.97ms 0.52ms 1.54ms 1.12ms 127401984 1.34s 0.94s 0.44s 1.03s 0.81s
393216 2.33ms 1.26ms 0.65ms 1.92ms 1.41ms 134217728 1.61s 1.00s 0.47s 1.15s 0.93s
442368 2.49ms 1.35ms 0.71ms 2.13ms 1.56ms 150994944 1.77s 1.21s 0.53s 1.26s 1.04s
497664 2.88ms 1.68ms 0.84ms 2.40ms 1.73ms 169869312 1.88s 1.29s 0.59s 1.38s 1.11s
524288 3.19ms 1.88ms 0.91ms 2.71ms 1.99ms 191102976 2.16s 1.46s 0.66s 1.59s 1.28s
589824 3.41ms 1.94ms 1.01ms 2.95ms 2.10ms 201326592 2.34s 1.54s 0.70s 1.69s 1.37s
663552 3.91ms 2.19ms 1.11ms 3.25ms 2.39ms 226492416 2.71s 1.81s 0.79s 1.90s 1.55s
786432 5.03ms 2.80ms 1.34ms 4.11ms 2.97ms 254803968 2.90s 1.93s 0.88s 2.08s 1.66s

Table 2. Extensive timings for number theoretic transforms of lengths up till over the field with .

5.1.Mono-threaded implementation

The main focus of this paper is on a mono-threaded implementation of number theoretic transforms, while exploiting SIMD instruction-level parallelism. We optimized our code for orders with and . For reference and completeness, a full table with all observed timings is provided in Table 2 . For each entry, we computed three timings and reported the median one. In Figure 1 , we also plotted the corresponding graph. For readability, this graph is limited to the x86-based architectures. The graph shows a familiar nearly affine behavior for the computation time in the double logarithmic scale. The sudden increase of the timings of the xeon curve near is due to an inefficient memory hierarchy on this computer.

Figure 1. Graphical representation of the timings from Table 2, for the first three columns only.

Now one NTT of a power of two length consists of butterflies. For a better understanding of the efficiency of our implementation it is therefore useful to divide our timings by , even when is not a power of two. Figure 2 shows the resulting normalized timings. In addition, Figure 3 shows these normalized timings when measuring the cost of a butterfly in cycles instead of nanoseconds.

From Figure 3, we can read off that a doubled SIMD width indeed makes the implementation approximately twice as fast on x86-based architectures. Although Apple's M1 and M3 chips only support 128-bit wide SIMD, they actually behave more like CPUs with 256-bit wide SIMD. This is probably due to the fact that Apple chips provide four instead of two FMA units, which potentially doubles the instruction-level parallelism. For large , the M1 and M3 perform even better, thanks to a more efficient memory hierarchy.

When multiplying the cost in cycles with the SIMD width (while taking for the M1 and M3 chips in view of what precedes), then we note that the cost of a single butterfly stays between 5 and 8 cycles for ; this is similar to the performance of traditional scalar NTTs. For larger orders , the cost on x86-based platforms slowly increases to 12 cycles per butterfly (and even to 15 cycles per butterfly for the Xeon W system and , but we recall that our implementation was not optimized as heavily in this range).

We also note the somewhat irregular behavior of the graphs in Figures 2 and 3, with discrepancies as large as for close orders . This is not due to inaccurate or irreproducible timings: it turns out that the efficiency of the generated codelets is highly dependent on the combinatorics of the divisors of . In particular, if is a power of two, then the “pool of available strategies” is somewhat smaller than if is divisible by a few powers of . Consequently, the average costs in cycles tend to be slightly higher for power of two lengths. When using NTTs as the workhorse of evaluation-interpolation strategies (e.g. for integer multiplication), then it may be interesting to use truncated Fourier transforms (TFTs) from [15, 20]: this should allow us to smooth out the graphs from Figures 2 and 3, and benefit from the best possible NTTs for close orders .

Figure 2. Timings divided by , which corresponds to the number of butterflies in the case when the order of the NTT is a power of two.

Figure 3. Variant of Figure 2 when using the cycle time as a unit instead of nanoseconds.

We also compared our new algorithm to the one from [ 13 ]. For this purpose, we slightly adapted the companion C code from [ 13 ] to work also on M1 and M3 platforms and to support orders . The timings for both implementations are shown in Table 3 . We recall that the algorithm from [ 13 ] uses a 62 bit prime and unsigned integer arithmetic, instead of a 49 bit prime and double precision floating point arithmetic. We have also indicated the appropriately rescaled timings.

i7 xeon zen4 m1 m3
f48 f61 i61 f48 f61 i61 f48 f61 i61 f48 f61 i61 f48 f61 i61
64 1.51 1.92 4.71 0.86 1.09 4.94 0.49 0.63 4.16 1.87 2.37 2.77 1.55 1.97 2.62
256 1.54 1.95 4.26 0.80 1.02 4.40 0.83 1.06 3.77 1.78 2.26 2.47 1.48 1.89 2.30
1024 1.41 1.79 4.19 0.75 0.96 4.45 0.87 1.10 3.81 1.69 2.15 2.55 1.44 1.83 2.36
4096 1.56 1.98 4.22 0.79 1.01 4.47 0.99 1.26 3.82 1.62 2.06 2.51 1.43 1.81 2.34
16384 1.56 1.98 4.89 0.77 0.98 4.58 0.92 1.17 4.05 1.71 2.18 2.69 1.51 1.91 2.59
65536 1.64 2.08 5.47 0.91 1.15 4.64 1.09 1.39 4.02 1.61 2.05 2.74 1.49 1.89 2.70
262144 1.73 2.19 5.49 1.19 1.51 5.22 1.03 1.31 4.29 1.73 2.19 2.78 1.60 2.03 2.70
1048576 1.91 2.43 5.29 1.26 1.60 5.96 1.06 1.34 4.25 1.76 2.24 2.91 1.60 2.04 2.69
4194304 2.02 2.57 5.41 1.35 1.72 7.48 1.15 1.46 5.07 1.92 2.43 3.18 1.85 2.35 2.81
16777216 2.23 2.84 6.48 1.39 1.76 7.58 1.33 1.70 6.37 1.99 2.52 3.19 1.87 2.38 2.87
67108864 2.56 3.25 8.37 1.80 2.28 7.59 1.61 2.05 8.81 1.99 2.53 3.25 2.06 2.62 2.92

Table 3. Comparison of the number of clock cycles per butterfly between our new implementation and the implementation from [13]. We recall that the implementation from [13] uses a prime with and unsigned integer arithmetic. The “f48” columns correspond to our new timings for our prime with . The “f61” columns contain the scaled variants through multiplication by . The “i61” columns show the timings for the algorithm from [13].

5.2.Pure SIMD mode

In the previous subsection, we explored the cost of computing a single NTT of order over , while exploiting SIMD support in the processor. However, in order to fully benefit from hardware SIMD support of width , it is better to compute NTTs in parallel, as a single NTT of order over . Here stands for the set of SIMD vectors of length over . Such a vector fits exactly into a single hardware SIMD register. We will say that we are in pure SIMD mode when we can compute NTTs in parallel in this way. This typically happens when we need to compute many NTTs (e.g. when doing an integer matrix product as in [13]), in chunks of NTTs at a time. By contrast, we are in thwarted SIMD mode if we want to compute a single or less than NTTs, as in the previous subsection. Although this mode may still benefit from SIMD arithmetic, this typically requires non-trivial data rearrangements as described in section 4.4.

In order to compare the pure and thwarted modes, it is useful to compare the following three costs:

  1. the cost of a single NTT of length over ,

  2. times the cost of NTTs of length over , and

  3. times the cost of a single NTT of length over .

Although the first two costs ought to be similar most of the time, this is not necessarily the case when is close to one of the (L1, L2, L3, . . . ) cache sizes. In Table 4 , we show a detailed comparison of these costs for power of two lengths. The overhead of thwarted with respect to pure SIMD is most important for small orders .

i7 simd4 xeon simd8 zen4 simd8 m3 simd2
64 0.10μs 0.11μs 83.1ns 51.4ns 65.5ns 37.8ns 16.5ns 33.1ns 24.5ns 73.4ns 73.3ns 52.0ns
128 0.25μs 0.27μs 0.19μs 0.11μs 0.12μs 85.0ns 56.9ns 69.2ns 56.5ns 0.17μs 0.16μs 0.13μs
256 0.56μs 0.57μs 0.40μs 0.26μs 0.28μs 0.18μs 0.15μs 0.17μs 0.13μs 0.38μs 0.37μs 0.29μs
512 1.24μs 1.27μs 1.05μs 0.56μs 0.63μs 0.49μs 0.34μs 0.39μs 0.35μs 0.80μs 0.80μs 0.67μs
1024 2.57μs 2.65μs 2.32μs 1.20μs 1.29μs 1.21μs 0.77μs 0.81μs 0.77μs 1.82μs 1.80μs 1.67μs
2048 6.21μs 6.26μs 4.88μs 2.65μs 2.76μs 2.39μs 1.67μs 1.73μs 1.62μs 3.98μs 3.82μs 3.34μs
4096 13.7μs 13.2μs 11.5μs 6.10μs 6.35μs 5.21μs 4.22μs 4.22μs 4.21μs 8.65μs 8.41μs 7.67μs
8192 29.6μs 31.3μs 27.7μs 14.2μs 14.6μs 13.6μs 8.76μs 8.92μs 8.60μs 18.8μs 18.3μs 16.9μs
16384 63.8μs 64.9μs 57.6μs 27.6μs 31.9μs 33.5μs 18.3μs 18.6μs 17.9μs 42.6μs 40.1μs 36.6μs
32768 0.14ms 0.15ms 0.13ms 59.5μs 69.3μs 75.1μs 43.6μs 42.7μs 41.2μs 91.6μs 86.7μs 77.7μs
65536 0.31ms 0.30ms 0.31ms 0.15ms 0.17ms 0.16ms 99.7μs 0.10ms 95.1μs 0.19ms 0.18ms 0.17ms
131072 0.68ms 0.72ms 0.66ms 0.38ms 0.44ms 0.40ms 0.19ms 0.21ms 0.20ms 0.43ms 0.40ms 0.52ms
262144 1.45ms 1.49ms 1.53ms 0.88ms 1.02ms 0.94ms 0.42ms 0.43ms 0.43ms 0.93ms 0.87ms 0.87ms
524288 3.19ms 3.31ms 3.17ms 1.88ms 2.09ms 1.95ms 0.91ms 0.94ms 0.92ms 1.99ms 1.89ms 1.88ms
1048576 7.15ms 7.03ms 7.04ms 4.13ms 4.62ms 4.31ms 1.92ms 1.97ms 2.13ms 4.15ms 4.27ms 4.15ms
2097152 15.4ms 15.7ms 15.3ms 9.01ms 9.32ms 9.63ms 4.15ms 5.38ms 5.02ms 9.07ms 9.73ms 9.42ms
4194304 33.3ms 36.1ms 35.7ms 19.5ms 19.6ms 20.5ms 9.19ms 11.0ms 11.3ms 21.0ms 20.8ms 20.2ms
8388608 70.0ms 77.9ms 76.5ms 40.1ms 40.9ms 49.4ms 21.7ms 23.4ms 24.4ms 44.3ms 43.3ms 41.7ms
16777216 0.16s 0.16s 0.16s 87.2ms 90.5ms 0.11s 46.7ms 51.4ms 53.2ms 93.1ms 90.6ms 86.0ms

Table 4. Comparison between the timings of NTTs in the thwarted and pure SIMD modes. The greyed entries correspond to timings that were obtained using slightly less optimized DFTs.

5.3.Multi-threading

The main focus of this paper has been on an almost optimal implementation of the NTT using SIMD double precision arithmetic, but in the mono-threaded case. Naturally, we were curious about the performance of our work when using multiple threads. In this last subsection, we report on some preliminary experiments, while emphasizing that this is work in progress which has not yet benefited from a similar level of optimization. We recompiled Mathemagix with the –enable-threads configure option.

1 2 3 4 5 6 7 8 9 10 11 12 14 16 24 32
i7 1.00 2.04 2.79 3.35 3.72 4.09 4.50 4.64
xeon 1.00 2.00 2.99 4.00 5.00 5.98 6.61 6.96 6.84 7.38 7.57 7.47 7.65 7.78
zen4 1.00 1.99 2.93 3.83 4.69 5.57 6.41 7.23 8.07 8.99 9.79 10.6 12.2 13.1 17.5 20.7
m1 1.00 1.93 2.80 3.73 4.21 4.71 5.19 5.61
m3 1.00 1.89 2.82 3.75 4.67 5.32 5.96 6.62 7.26 7.87 8.46 8.56

Table 5. Speed-up when executing a dummy loop using multiple threads. The top row indicates the number of threads that we use to execute our code. The main table indicates the observed speed-ups. The greyed entries correspond to the degraded regime in which hyperthreading or efficiency cores enter the stage.

Before measuring the thread efficiency of our implementation, it is useful to analyze the maximal theoretical speed-up that multi-threading may provide on our testing platforms. For this, we created a simple loop that calls many times an empty function, and measured the speed-up when executing this loop in a multi-threaded fashion. The results have been reported in Table 5. For the x86-based platforms, we note that the number of logical threads is twice as high as the number of physical threads, due to hyperthreading. It turns out that our simple loop does benefit from hyperthreading, but that this is less likely to happen for highly optimized HPC code like our NTTs. For the M1 and M3 processors, efficiency cores tend to be much slower than performance cores, with similar pitfalls as in the case of hyperthreading. Note that the observed speed-ups always decreased when using a number of threads that exceeds the maximal number of logical threads for the platform.

For our main experiments, the number of threads is fixed for each individual timing. The main codelets that benefit from multi-threaded execution are of the form or for some other codelet and some large integer . For a parameter (that we call the flooding factor), let us write with . Now we subdivide the execution of, say, into the execution of chunks for . These chunks are executed by our threads using the work stealing paradigm. In our codelet factory, the most appropriate flooding factor for given and is determined dynamically.

The first scenario we tested is that of a single NTT of length , while using threads. For each platform, we investigate three values for : the number of physical or performance cores, the number of logical or performance plus efficiency cores, and an intermediate value. The resulting timings are shown in Table 6 . We observed a large variance, especially on x86 platforms. We consider the speed-ups on the m1 and m3 platforms to be reasonably satisfactory for a first implementation. The timings for the other platform are far from optimal. We omitted the results for the zen4 platform, since they were so absurd that they require further investigation.

i7 xeon m1 m3
4 6 8 8 12 16 4 6 8 6 8 12
4096 1.16 2.12 1.94
8192 1.59 1.25 2.54 1.58 2.62 1.53
16384 2.05 1.75 1.34 1.12 3.04 2.24 2.07 3.59 2.44 1.91
32768 2.31 2.33 1.75 1.61 1.24 3.17 2.57 2.45 4.06 3.15 2.79
65536 2.66 2.62 2.17 2.42 1.94 3.30 2.83 2.82 4.42 3.48 3.36
131072 2.74 2.92 2.22 3.94 3.09 1.87 3.50 3.25 3.09 4.72 4.34 4.16
262144 2.65 2.98 2.44 4.75 3.27 2.22 3.69 3.65 3.77 4.87 4.34 5.07
524288 2.98 3.01 2.76 4.22 3.94 3.03 3.63 3.93 3.80 4.67 4.62 5.02
1048576 3.00 3.34 2.97 4.35 4.08 3.48 3.45 3.71 3.71 4.61 4.53 4.71
2097152 3.20 3.29 3.29 3.94 3.77 3.18 3.42 3.76 4.02 4.16 4.50 4.70
4194304 3.16 3.11 3.02 4.04 3.59 3.08 3.28 3.64 3.84 4.75 5.04 5.61
8388608 3.00 3.09 3.02 3.81 3.33 3.11 3.21 3.70 4.09 4.70 5.12 5.63
16777216 3.06 3.12 3.27 4.26 3.66 3.26 3.18 3.69 4.05 4.79 5.15 5.87
33554432 3.06 3.18 3.27 3.78 3.79 3.42 3.30 3.75 4.15 4.72 5.63 6.35

Table 6. Parallel speed-ups for our number theoretic transforms as a function of the transform length , the platform, and the number of threads .

The second scenario is when we wish to compute a large number of NTTs of the same order . This is similar to the pure SIMD mode that we investigated in subsection 5.2 . We use a similar work stealing approach as above. Each individual thread repeatedly computes NTTs of order over . In Table 7 , we reported detailed timings in the single case when and only for the m3 platform. The table also indicates the best flooding factor as a function of the number of threads .

1 2 4 8 16 32 64 128 256 time
1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.50ms
2 1.93 1.93 1.94 1.93 1.93 1.92 1.91 1.88 1.84 1.81ms
3 2.89 2.90 2.88 2.88 2.88 2.87 2.86 2.84 2.76 1.21ms
4 3.82 3.85 3.83 3.81 3.80 3.78 3.73 3.67 3.52 0.91ms
5 4.65 4.66 4.61 4.63 4.61 4.62 4.41 4.34 4.24 0.75ms
6 3.18 4.01 4.56 4.83 4.67 4.87 4.85 4.72 4.28 0.72ms
7 3.58 4.49 4.75 4.65 5.01 5.21 5.17 4.94 4.32 0.67ms
8 3.85 4.80 4.84 5.01 5.24 5.38 5.33 4.91 4.33 0.65ms
9 4.32 5.00 4.99 5.01 5.46 5.64 5.45 5.14 4.30 0.62ms
10 4.56 5.75 5.56 5.78 5.68 5.87 5.71 5.18 4.29 0.60ms
11 4.01 5.14 5.51 5.74 5.79 5.87 5.81 5.14 4.19 0.60ms
12 4.07 4.78 5.26 5.11 5.38 5.39 5.33 4.93 3.81 0.65ms

Table 7. Speed-ups on an M3 processor for NTTs of length with SIMD coefficients in , where , as a function of the number of threads and the flooding factor . The highlighted entries in each row correspond to the best flooding factor for a given number of threads. The last column reports the timing for the best flooding factor.

Bibliography

[1]

ANSI/IEEE. IEEE standard for binary floating-point arithmetic. Technical Report, ANSI/IEEE, New York, 2008. ANSI-IEEE Standard 754-2008. Revision of IEEE 754-1985, approved on June 12, 2008 by IEEE Standards Board.

[2]

J. Bradbury, N. Drucker, and M. Hillenbrand. NTT software optimization using an extended Harvey butterfly. Cryptology ePrint Archive, Paper 2021/1396, 2021. https://eprint.iacr.org/2021/1396.

[3]

P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, Berlin, 1997.

[4]

J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19:297–301, 1965.

[5]

J.-G. Dumas, P. Giorgi, and C. Pernet. FFPACK: finite field linear algebra package. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 119–126. New York, NY, USA, 2004. ACM.

[6]

J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. ACM Trans. Math. Softw., 35(3):19–1, 2008.

[7]

P. Fortin, A. Fleury, F. Lemaire, and M. Monagan. High-performance SIMD modular arithmetic for polynomial evaluation. Concurr. Comput. Pract. Exp., 33(16):e6270, 2021.

[8]

M. Frigo. A fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI '99, pages 169–180. New York, NY, USA, 1999. ACM.

[9]

M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.

[10]

S. Fu, N. Zhang, and F. Franchetti. Accelerating high-precision number theoretic transforms using Intel AVX-512. 2024. https://spiral.ece.cmu.edu/pub-spiral/pubfile/PACT_2024_AVX_371.pdf.

[11]

I. J. Good. The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. Series B, 20(2):361–372, 1958.

[12]

W. Hart and the FLINT Team. FLINT: Fast Library for Number Theory. From 2008. Software available at http://www.flintlib.org.

[13]

D. Harvey. Faster arithmetic for number-theoretic transforms. J. Symbolic Comput., 60:113–119, 2014.

[14]

M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.

[15]

J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 290–296. New York, NY, USA, 2004. ACM.

[16]

J. van der Hoeven. The Jolly Writer. Your Guide to GNU TeXmacs. Scypress, 2020.

[17]

J. van der Hoeven and G. Lecerf. Mathemagix User Guide. HAL, 2013. https://hal.archives-ouvertes.fr/hal-00785549.

[18]

J. van der Hoeven and G. Lecerf. Evaluating straight-line programs over balls. In P. Montuschi, M. Schulte, J. Hormigo, S. Oberman, and N. Revol, editors, 2016 IEEE 23nd Symposium on Computer Arithmetic, pages 142–149. IEEE, 2016.

[19]

J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5–1, 2016.

[20]

R. Larrieu. The truncated Fourier transform for mixed radices. In M. Burr, editor, Proceedings of the 2017 ACM International Symposium on Symbolic and Algebraic Computation, ISSAC '17, pages 261–268. New York, NY, USA, 2017. ACM.

[21]

J. M. Pollard. The fast Fourier transform in a finite field. Math. Comput., 25(114):365–374, 1971.

[22]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: code generation for DSP transforms. Proc. IEEE, 93(2):232–275, 2005. Special issue on “Program Generation, Optimization, and Adaptation”.

[23]

A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.

[24]

D. Takahashi. An implementation of parallel number-theoretic transform using Intel AVX-512 instructions. In F. Boulier, M. England, T. M. Sadykov, and E. V. Vorozhtsov, editors, Computer Algebra in Scientific Computing. CASC 2022., volume 13366 of Lect. Notes Comput. Sci., pages 318–332. Springer, Cham, 2022.

[25]

N. Zhang, A. Ebel, N. Neda, P. Brinich, B. Reynwar, A. G. Schmidt, M. Franusich, J. Johnson, B. Reagen, and F. Franchetti. Generating high-performance number theoretic transform implementations for vector architectures. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7. Los Alamitos, CA, USA, 2023. IEEE.