Implementing number theoretic transforms

Number theoretic transforms (NTTs) were introduced by Pollard [21]. He used them as a tool to design practical algorithms for the efficient multiplication of very large integers. Technically speaking, a number theoretic transform is simply a discrete Fourier transform (DFT) over a finite field. Traditional DFTs work over the complex numbers. A fast algorithm to compute such DFTs was published a few years earlier by Cooley and Tukey [4], although the idea goes back to Gauss [14].

As of today, fast practical algorithms for multiplying large integers are based on number theoretic transforms, complex DFTs [23, first algorithm], or Schönhage-Strassen's algorithm [23, second algorithm]. All three strategies are very competitive and the winner depends on hardware and the application. Since the race is very close, it is interesting to heavily optimize each of the three approaches.

In this paper, we will describe a high performance implementation for number theoretic transforms. We focus on modern general purpose CPUs with support for SIMD (Single Instruction Multiple Data) instructions. Our current implementation was written in Mathemagix [17], partly as a test case for the development of our new compiler (a

-version of it has been released this year). It fully exploits SIMD accelerations and we started experimentations with multi-threading.

A first way to optimize NTTs is to design faster modular arithmetic. Let

be a prime number. Traditionally, the result of any arithmetic operation modulo

is always normalized, e.g. to make it lie in

. It was first observed by Harvey [13] that some of these normalizations can be delayed when computing NTTs and that this can lead to significant speed-ups. In section 3, we show how to take this idea one step further, which allows us to delay even more normalizations.

The optimization of traditional DFTs is the subject of an extensive literature. The most efficient implementations are based on so-called “codelets” [8, 9, 22]. For a given target order

, the idea is to automatically generate a highly optimized program (called a codelet) for DFTs of order

. Such codelets can optimize the use of registers and memory, which makes them in particular cache-efficient. For small orders, we can also unroll the loops, which minimizes the overhead of control structures. In section 4, we apply the codelet approach to number theoretic transforms and discuss some specific issues that arise for this application. Similar experiments have recently been carried out in the Spiral system [10, 25].

The last section 5 is dedicated to timings. We implemented our algorithms on five different systems that support SIMD arithmetic of three different widths: 128-bit (ARM Neon), 256-bit (AVX-2), and 512-bit (AVX-512). For orders

, we achieve speed-ups that are proportional to the SIMD width. For higher orders, the x86-based systems start to suffer from memory access costs, whereas the performance of ARM-based systems remains stable. We also experimented with multi-threaded versions of our algorithms, but without pushing our optimization efforts as far as for the mono-threaded case.

The application of the fast NTTs to integer multiplication requires a wrapper for Chinese remaindering and Kronecker segmentation. This turns out to be highly non-trivial to implement with the same level of optimization as our new NTTs. We plan to address this issue in a separate future work.

2.Preliminaries

2.1.Discrete Fourier transforms

Let

be a field with a primitive root of unity

of order

and assume that

is invertible in

. We define the discrete Fourier transform (or DFT) of an

-tuple

with respect to

to be

where

That is,

is the evaluation of the polynomial

. If

is a primitive

-th root of unity, then so is its inverse

, and we have

The DFT induces a ring homomorphism from

into

, by sending

. Since the DFT is invertible, this map (that we will still denote by

) is actually an isomorphism. If we have a fast way to compute DFTs, this yields an efficient way to multiply “cyclic” polynomials

Here the multiplication

is done entry-wise. Given two polynomials

with

, we have

and the product

can be read off from

. Hence we also obtained an algorithm for the multiplication of

and

. This is called FFT-multiplication.

2.2.Divide-and-conquer DFTs

Let

be a primitive

-th root of unity, with

for

. Then

is a primitive

-th root of unity and

is a primitive

-th root of unity. Moreover, for any

and

, we have

Given a linear map

and

, it will be convenient to use the Kronecker products

and

, that are defined by

Step (5) consists of

scalar multiplications by so-called twiddle factors

. We will denote this twiddling map by

The last step corresponds to the transposition of an

matrix. Denoting this transposition map by

, the relation (3) becomes

2.3.DFTs using the Chinese remainder theorem

in the previous subsection, then it is actually possible to avoid the twiddling step (5), as first noted by Good in [11]. Consider the Bezout relation

by the vectors

of their coefficients, the map

acts as a permutation

: if

then

, for

and

. The bivariate DFT

2.4.DFTs up to permutations

The above two subsections show that efficient algorithms for DFTs often only compute them up to permutations. For some applications, it is actually not required to perform all these permutations. Assume for instance that we have an efficient algorithm for the computation of the “twisted”

for some permutation

. Then the variant

of (2) still yields an efficient method for FFT-multiplication. This technique requires the inverse transform

to be implemented with more care, since one cannot directly use the formula (1). Instead, one typically uses a similar algorithm as for

, but with all steps reversed.

The twisting technique can be combined recursively with the algorithms from the previous two subsections. Assume for instance that

and

for some permutations

and

, and let

where

. For some table

of precomputable twiddling factors, we have

for any

with

For DFTs of small orders, we also note that permutations of the input and output coefficients can be achieved without cost. As will be explained in sections 4.1 and 4.2, we will regard such DFTs as completely unrolled straight-line programs (SLPs); see definition in [3] for instance. Instead of explicitly permuting the input and output coefficients, it then suffices to appropriately rename all local variables in the SLP. This is particularly useful for the algorithm from section 2.3, which requires permutations of both the input and output coefficients.

2.5.Number theoretic transforms

In the special case when

is a finite field

for some prime number

, a discrete Fourier transform is also called a number theoretic transform (NTT). The most favorable case is when

is of the form

, where

is small. Indeed, since the multiplicative group of

is cyclic of order

, this ensures the existence of primitive roots of unity

of large orders of the form

. Nice primes

like this are sometimes called FFT primes. In what follows, for any order

, we will denote by

a number theoretic transform of the form

for some suitable permutation

and primitive

-th root of unity

Pollard first noted [21] that NTTs can be used to devise efficient practical algorithms for multiplying large integers: using Kronecker segmentation, the two multiplicands

can be rewritten as special values

of integer polynomials

with small coefficients

. Then

and the product

has coefficients in

. If

then

can be read off from its reduction modulo

. If we also have

, then the product

modulo

can be computed fast using FFT-multiplication.

For the application to integer multiplication, Pollard's technique is most efficient when

nicely fits into a machine integer or floating point number. As soon as

gets large (say

), this constraints

to become fairly small (e.g.

when using double precision floating point arithmetic with

). In order to increase

, and reduce the negative impact of

in the constraint (9), Pollard suggested to reduce modulo three FFT-primes

, and

instead of a single one and use Chinese remaindering to work modulo

instead of

. For instance, when using double precision floating point arithmetic, this allows us to take

and

instead of

and

. As of today, this is one of the fastest practical methods for integer multiplication [13].

3.Modular floating point arithmetic

We refer to [7, 19] for previous work on the implementation of SIMD versions of modular arithmetic. Before starting such an implementation, one first has to decide whether one wishes to rely on floating point or integer arithmetic. In principle, integer arithmetic should be most suitable, since no bits are wasted on the storage of exponents. However, for large moduli of more than 32 bits, floating point arithmetic tends to be supported better by current hardware.

First of all, at least in the case of Intel processors, throughputs and latencies for basic floating point instructions are better than for their integer counterparts. Secondly, the AVX2 and AVX-512 instruction sets only provide SIMD support for 32

32 bit integer products; although there is an SIMD instruction for 64

64 bit integer multiplication, the high part of the

bit result is lost. In comparison, with floating point arithmetic, the full product of a

bit integer multiplication can be obtained using two fused multiply add instructions.

Based on these hardware considerations, we have decided to implement our NTT with floating point arithmetic. Our implementation focusses on odd moduli

that fit into double precision numbers and works for SIMD widths up to

, as allowed by processors with AVX-512 support.

Our approach further extends known ideas. On the one hand, for the normalized representation of integers modulo

, we use signed residues in

; this provides us with a cheap way to gain one bit of precision and it also simplifies the mathematical analysis. On the other hand, we will allow the results of certain intermediate computations to be non-normalized. The latter idea was already used with success in [13], but we push it further by taking the modulus a bit smaller (thereby sacrificing a few bits of precision), which allows us to redundantly represent modular integers

by integers

that can be several times larger than

(this allows for several speed-ups, as we shall see). Several recent implementations [2, 24] also use this kind of super-redundant representations in order to perform basic NTTs of order

and

without unnecessary reductions. In this paper we will present a generalization to arbitrary orders.

3.1.Pseudo-code

In what follows, we will write

for the machine precision and assume that

. Then any real number in the interval

is a floating point number if and only if it is a multiple of

. Floating point number can be scaled by powers

with exponents

. Here

is assumed to satisfy

. Finally, zero is a special floating point number.

When rounding to nearest, the rounding error of a floating point operation with an exact result in

is always bounded by

. Given a general operation with an exact result

and rounded result

, this means that

. Note also that

necessarily holds in this case. Consequently, we have

More generally, these two inequalities hold whenever the exponent of

does not overflow or underflow. Our current implementation is limited to double precision, for which

and

, but it could easily be extended to single and other precisions. We assume correct IEEE 754 style rounding to the nearest [1].

We will present our implementation of modular arithmetic using pseudo machine code. The only data type that we will use are SIMD vectors of width

with floating point entries (e.g.

for processors with AVX-512 support, for double precision). Such vectors are typically stored in hardware registers. The first argument of all our algorithms will be the destination register and the subsequent arguments the source registers. For instance, the instruction

adds two SIMD vectors

and

and puts the result in

. In the pseudo-code, the modulus

stands for a SIMD vector of

potentially different moduli. We also denote by

its floating point inverse

, rounded to nearest.

We assume that basic instructions for SIMD floating point arithmetic are provided by the hardware, namely addition (add), subtraction (sub), multiplication (mul), fused multiply add (fma) and its variants (fms, fnma, fnms), as well as an instruction for rounding to the nearest integer (round). Precisely, the fused operators are defined as follows:

On processors with AVX-512 support, one may implement

using the intrinsic _mm512_roundscale_pd.

The modular analogues for the basic arithmetic operations are suffixed by

. For instance,

computes

products modulo

. The results of these analogues are not required to be normalized. The modular ring operations, as well as a separate operation

for normalization, can be implemented as follows:

Here

are temporary variables. Modular analogues of the fused operations fma, fms, fnma, and fnms turn out to be of limited use, since non-normalized addition and subtraction take only one operation.

3.2.Measuring how far we are off from normalization

For the analysis of our algorithms, it suffices to consider the scalar case

. Consider a modular integer

represented by an integer

that fits into a floating point number of precision

. Let

and note that

if and only if

, i.e. if and only if

is normalized. If

, then we regard

as a measure for how non-normalized

really is. For each of our algorithms add_mod, sub_mod, mul_mod, and normalize it is interesting to analyze

as a function of

and

Note that

is the smallest integer that cannot be represented exactly using a floating point number. For this reason, we always assume that the source arguments

and

of our algorithms as well as the modulus

satisfy

, and

. Each of our algorithms is easily seen to be correct, provided that

. In what follows, we will in particular study under which circumstances this can be guaranteed.

In the cases of addition and subtraction, the analysis is simple, since we clearly have

and

, provided that

(or

, equivalently).

Let

be the distance between

and the set

. Since

is odd, we must have

. If

, then we necessarily have

, and the end-result

will indeed be normalized. Here

stands for the nearest integer to

. The condition

is satisfied as soon as

, i.e. if

. Since

is an integer and there are no integers between

and

, this condition further simplifies into

. This proves (a).

Remark 2. If

and

, then

, so a second application of normalize will achieve to normalize

Proof. We first note that the product

is represented exactly by

, where

and

. By a similar reasoning as in the beginning of the proof of Proposition 1, we have

Remark 5. Multiplication has the benefit of producing partially normalized results, especially if one of the arguments is already normalized. For instance, if

and

, then (a) yields

4.Codelets for number theoretic transforms

The easiest way to implement DFTs is to write a single routine that takes the order

and the coefficients as input and then applies, say, Cooley and Tukey's algorithm using a double loop. However, this strategy is register-inefficient for small orders and cache-inefficient for large orders. Suboptimal loop unrolling may be another problem.

For our implementation, we use the alternative strategy based on “codelets”. For a given finite field

and a given target order

, the idea is to automatically generate a highly optimized program

for number theoretic transforms of order

over

(up to permutations as explained in section 2.4). Such highly optimized programs that are dedicated to one specific task are called codelets. We refer to [8, 9, 22] for early work on codelets in the more traditional context of DFTs over

For small orders, codelets are typically designed by hand. For large orders, they are generated automatically in terms of codelets of smaller lengths. We developed a small “codelet library” inside Mathemagix for the automatic generation of codelets for NTTs of different orders as well as other linear maps. One difference with previous work is that all code is written in the Mathemagix language and that codelets can be both generated and executed at runtime.

There are often multiple solutions to a given task. For instance, for an NTT of composite order

, there may be multiple ways to factor

and then apply one of the algorithms from section 2.2 or 2.3. One advantage of codelets is that we may generate codelets for each solution, do some benchmarking at runtime, and then select the most efficient solution. For large orders, when benchmarking becomes more expensive, the most efficient tactics can be cached on disk. Our codelet library exploits this kind of ideas.

4.1.Straight-line programs over modular integers

A straight-line program (SLP) over

is a map

that can be implemented using a sequence of ring operations in

. For instance, if

is a primitive root of unity of order

, then the following SLP with input

and output

computes an NTT of order

Here

are auxiliary variables. Note also that the output has been re-indexed in bit-reversed order: if

, then

, and

For an actual machine implementation of such an SLP, we may replace the ring operations by the algorithms

, and

from the previous section. However, we have to be careful that all non-normalized intermediate results fit into

bit integers. An easy linear-time greedy algorithm to ensure this is to insert

instructions whenever some intermediate result might not fit.

More precisely, for every intermediate result

of a binary operation on

and

, we use the results from section 3.2 to compute a bound for

. Whenever this bound exceeds

, we insert an instruction to normalize

(we pick

and

otherwise). For instance, if

is slightly larger than

, then applying this strategy to the above SLP yields





	(using part (a) of Proposition 3)
	(using Proposition 1)


	(using Proposition 1)

Here we implicitly assumed that the input arguments

are normalized. We also wrote

for a number that is slightly larger than

(e.g.

will do). We indeed had to normalize

before the instruction

, since otherwise

would exceed

. Similarly, we normalized

, since

. On the other hand, after these normalizations, we did not need to normalize

and

, since

4.2.Number theoretic transforms of small orders

For NTTs of order up to

, it is efficient to use unrolled codelets that can be regarded as SLPs. Throughout our implementation, we assumed that

. This allows for moduli

with a “capacity” of 48 bits. We wrote our transforms by hand and manually determined the best spots for normalizations. Although this means that we did apply the general greedy algorithm from the previous section, we did use the same strategy to determine bounds for the

, where

ranges over the intermediate results, and verify that we always have

For small power of two orders

, we use the classical Cooley–Tukey algorithm without any normalization. Writing

and

for the input and output vectors in

, and setting

, Proposition 3 implies

. In view of Remark 5, the worst case

may only occur for evaluations at

and

(which only involve additions and subtractions).

NTTs of medium-size power of two orders

are decomposed into smaller NTTs with one intermediate twiddling stage: writing

for a number theoretic transform of order

(while assuming the usual bit-reverse indexing), we apply the formulas

for suitable diagonal linear twiddle maps

. Our implementation automatically generates completely unrolled codelets for

, and

. The twiddling step has the advantage of partially normalizing the intermediate results (see Remark 5). For instance, if

, and

, then

, and

, using part (a) of Proposition 3. Furthermore, the fact that several twiddle factors are equal to one allows for an optimization: whenever we need to multiply

with one, we simply normalize

instead. In the case of

, the error bounds are a bit better, and we only need to perform two normalizations.

We also implemented hand-coded NTTs for some mixed-radix orders of the form

. If

, and assuming that

is a primitive third root of unity, we use the following algorithm, without normalization:

One verifies that

and

. For

, we use Good's algorithm to decompose

in terms of

and

. This yields

. For

, we use the formula

without normalization, which yields

. For

, we use Good's algorithm with an intermediate normalization step. For

, we use the formula

4.3.Large orders

Now assume that we wish to compute an NTT of large order

with

. For input and output vectors

, we require that

whenever

. This is indeed the case for all codelets from the previous subsection.

Let

be a decomposition of

with

. We recursively assume that we know how to compute codelets for

and

. Now we compute

using the formula

for

. Note that this yields

whenever

, by our assumption on

. Setting

, let

be the twiddle factors with

which corresponds to combining (5) and (6). If

, then

and

, by our assumption on

. This proves our inductive requirement that

whenever

It is important to keep an eye on the cache efficiency for the actual implementation. In the formula (12), we precompute the twiddle factors and store the vectors

, and

in contiguous segments of memory. Instead of applying a global twiddling step

, we use a loop over

and mix the twiddling step with the NTTs of length

; this avoids traversing

elements in memory twice and thereby improves the cache efficiency. If

is really large, then we might rewrite

in terms of smaller NTTs and move the twiddling step even further inside.

As to (12), we first need to move the input slices

with “stride”

into contiguous memory. After applying the NTT, we also need to put the result back in the slice

with stride

. These memory rearrangements correspond to two

and

matrix transpositions. For cache efficiency, we need to minimize the number of full traversals of vectors of size

, so it is again better to use a loop over

. However, entries of the form

are typically stored in the same cache line for some

that depends on the hardware. It is better to treat these contiguous entries together, which can be achieved by cutting the loop over

into chunks of size

(in fact, slightly different chunk sizes may perform even better). This corresponds to rewriting the

matrix transposition into

transpositions of

matrices (and similar for the inverse transposition).

4.4.SIMD acceleration

Assume now that we are on hardware that supports SIMD vectors of width

and let us investigate how to vectorize the algorithm (11) from the previous subsection.

Let us first consider the case when

and

. The first step (12) is easy to vectorize: we may reinterpret the map

on vectors of size

with entries in

as the map

on vectors of size

with SIMD entries in

. As to the second step (13), we have

and the computation of the map

reduces to the computation of

, using a

matrix transposition. In order to compute the map

, this leads us to use a loop of length

that treats blocks of

coefficients at the time. On each block, we first multiply with the twiddle factors, then apply a

matrix transposition, and finally an inline codelet for

for SIMD coefficients of width

The idea behind SIMD implementations of

matrix transposition is to recursively reduce this operation to two

transpositions in different lanes, followed by a linear number of operations to combine the results. This leads to an algorithm that does the full transposition in

steps. Concrete implementations on existing hardware are a bit tricky: every next generation or brand of SIMD processors (SSE, AVX2, AVX-512, Neon, ...) tends to provide another set of instructions that are useful for this application (permute, blend, shuffle, etc.). Instead of covering all cases, we illustrate the above recursive meta-algorithm by showing how to perform a

matrix transposition using

SIMD instructions on machines with AVX2 support:

x[0] = _mm256_permute2f128_pd (s[0], s[2], 32)

x[2] = _mm256_permute2f128_pd (s[0], s[2], 49)

x[1] = _mm256_permute2f128_pd (s[1], s[3], 32)

x[3] = _mm256_permute2f128_pd (s[1], s[3], 49)

d[0] = _mm256_shuffle_pd (x[0], x[1], 0)

d[1] = _mm256_shuffle_pd (x[0], x[1], 15)

d[2] = _mm256_shuffle_pd (x[2], x[3], 0)

d[3] = _mm256_shuffle_pd (x[2], x[3], 15)

Let us now turn to the case when

. For any decomposition

with

and

, we may apply the above vectorization technique for the computation of

. Using the loop (13), this also allows us to vectorize

. Combined with the fact that the map

trivially vectorizes, this provides us with an alternative vectorization of the formula (11). This technique has the advantage of providing more flexibility for the choices of

and

. In particular, for large orders

, taking

tends to ensure a better cache efficiency.

For our actual implementation, we emulate an SIMD width

of twice the hardware SIMD width

for the given floating point type. This means that every instruction is “duplexed” for the “low” and “high” parts. For instance, the codelet for

from section 4.2 gives rise to the following duplex version for coefficients in

The main advantage of this technique is that it spreads data dependencies. This makes the code less vulnerable to delays caused by instructions with high latencies. One disadvantage of duplexing is that it doubles the pressure on hardware registers. Nevertheless, this is less of a problem on modern processors with many SIMD registers.

5.Timings

We implemented the algorithms described in this paper in the Mathemagix language. In order to test the new algorithms, one has to add –enable-mmcomp to the configuration options, before compiling Mathemagix. Here mmcomp stands for the new experimental Mathemagix compiler. The code for the number theoretic transforms can be found in the mmx/hpc subdirectory of the mmlib package (revision 11190 available at https://sourcesup.renater.fr/projects/mmx/). The binary that we used to obtain the timings in Table 2 was compiled using the following command:

We systematically use the fixed prime number

as our modulus for all NTTs. Consequently,

has power of two roots of unity of orders up till

, which is also an upper bound for our main benchmarks. For orders

, our implementation does an extensive search for the best strategy. For larger orders, we make a suboptimal guess based on the best observed strategies for orders

We tested our implementation on five different systems whose characteristics are specified in Table 1. These systems cover three SIMD widths:

Note that the AVX-512 support on the AMD RYZEN9 7950X3D processor is “double pumped” in the sense that certain instructions (like FMA) are done in two parts of 256 bits each

xeon

zen4

xeon

zen4

0.10μs

51.4ns

16.5ns

0.11μs

73.4ns

884736

5.52ms

3.10ms

1.53ms

4.51ms

3.25ms

128

0.25μs

0.11μs

56.9ns

0.25μs

0.17μs

995328

6.17ms

3.55ms

1.79ms

5.01ms

3.59ms

192

0.43μs

0.17μs

94.3ns

0.40μs

0.26μs

1048576

7.15ms

4.13ms

1.92ms

5.77ms

4.15ms

256

0.56μs

0.26μs

0.15μs

0.57μs

0.38μs

1179648

7.84ms

4.30ms

2.09ms

6.26ms

4.46ms

384

0.89μs

0.40μs

0.27μs

0.85μs

0.57μs

1327104

8.68ms

4.74ms

2.44ms

6.83ms

4.86ms

512

1.24μs

0.56μs

0.34μs

1.23μs

0.80μs

1492992

9.96ms

5.57ms

2.73ms

7.96ms

5.71ms

576

1.30μs

0.60μs

0.40μs

1.33μs

0.87μs

1572864

10.8ms

6.02ms

3.04ms

8.85ms

6.30ms

768

1.93μs

0.85μs

0.54μs

1.87μs

1.25μs

1769472

11.7ms

6.74ms

3.36ms

9.59ms

6.81ms

1024

2.57μs

1.20μs

0.77μs

2.70μs

1.82μs

1990656

13.2ms

7.50ms

3.84ms

10.6ms

7.54ms

1152

2.91μs

1.29μs

0.85μs

2.99μs

1.99μs

2097152

15.4ms

9.01ms

4.15ms

13.0ms

9.07ms

1536

4.17μs

1.83μs

1.19μs

4.14μs

2.76μs

2359296

16.3ms

9.58ms

4.55ms

13.5ms

9.60ms

1728

4.90μs

2.19μs

1.46μs

4.53μs

3.02μs

2654208

18.4ms

10.5ms

5.10ms

14.7ms

10.4ms

2048

6.21μs

2.65μs

1.67μs

5.81μs

3.98μs

2985984

21.1ms

11.9ms

5.91ms

16.5ms

11.9ms

2304

6.92μs

2.89μs

1.89μs

6.26μs

4.23μs

3145728

22.6ms

13.1ms

6.18ms

18.9ms

13.6ms

3072

9.96μs

4.03μs

2.73μs

8.77μs

5.90μs

3538944

25.1ms

14.3ms

6.97ms

20.4ms

14.9ms

3456

10.7μs

4.82μs

3.51μs

9.54μs

6.61μs

3981312

28.4ms

16.2ms

7.74ms

22.6ms

16.5ms

4096

13.7μs

6.10μs

4.22μs

12.5μs

8.65μs

4194304

33.3ms

19.5ms

9.19ms

27.6ms

21.0ms

4608

14.6μs

6.71μs

5.10μs

13.6μs

9.42μs

4718592

33.2ms

20.6ms

9.75ms

28.2ms

20.5ms

5184

17.5μs

8.10μs

5.66μs

16.2μs

11.2μs

5308416

38.8ms

22.9ms

11.9ms

30.9ms

22.6ms

6144

20.2μs

9.50μs

6.45μs

19.5μs

13.4μs

5971968

44.7ms

25.9ms

13.8ms

34.1ms

24.9ms

6912

21.8μs

10.1μs

7.07μs

20.9μs

14.6μs

6291456

47.7ms

28.6ms

15.5ms

42.0ms

32.6ms

8192

29.6μs

14.2μs

8.76μs

26.9μs

18.8μs

7077888

54.1ms

32.0ms

16.8ms

43.6ms

31.4ms

9216

30.6μs

13.9μs

9.76μs

29.7μs

20.7μs

7962624

62.3ms

36.7ms

19.4ms

47.4ms

34.6ms

10368

36.1μs

16.8μs

11.6μs

34.2μs

24.0μs

8388608

70.0ms

40.1ms

21.7ms

59.2ms

44.3ms

12288

43.0μs

19.7μs

13.4μs

41.6μs

29.0μs

9437184

78.7ms

45.2ms

24.7ms

61.5ms

45.3ms

13824

47.4μs

20.8μs

14.7μs

47.1μs

33.5μs

10616832

86.1ms

51.9ms

27.0ms

65.6ms

47.2ms

16384

63.8μs

27.6μs

18.3μs

61.4μs

42.6μs

11943936

98.5ms

58.4ms

31.1ms

72.8ms

51.9ms

18432

66.9μs

28.9μs

20.7μs

63.9μs

45.7μs

12582912

0.11s

62.6ms

33.4ms

87.2ms

65.5ms

20736

81.3μs

34.4μs

23.9μs

71.0μs

51.6μs

14155776

0.12s

70.4ms

36.5ms

93.6ms

64.9ms

24576

99.7μs

41.4μs

28.4μs

88.3μs

65.0μs

15925248

0.14s

79.3ms

41.5ms

0.10s

71.4ms

27648

0.12ms

44.6μs

32.1μs

97.0μs

70.9μs

16777216

0.16s

87.2ms

46.7ms

0.12s

93.1ms

32768

0.14ms

59.5μs

43.6μs

0.13ms

91.6μs

18874368

0.17s

94.9ms

50.4ms

0.13s

99.0ms

36864

0.16ms

68.9μs

43.8μs

0.13ms

98.3μs

21233664

0.20s

0.11s

57.2ms

0.14s

99.5ms

41472

0.16ms

83.6μs

52.0μs

0.15ms

0.11ms

23887872

0.22s

0.12s

64.2ms

0.16s

0.11s

49152

0.22ms

97.3μs

66.2μs

0.19ms

0.14ms

25165824

0.25s

0.14s

67.6ms

0.18s

0.15s

55296

0.23ms

0.12ms

77.5μs

0.20ms

0.15ms

28311552

0.28s

0.15s

76.7ms

0.21s

0.16s

65536

0.31ms

0.15ms

99.7μs

0.26ms

0.19ms

31850496

0.31s

0.17s

86.9ms

0.23s

0.16s

73728

0.33ms

0.17ms

0.11ms

0.28ms

0.21ms

33554432

0.36s

0.19s

93.7ms

0.26s

0.20s

82944

0.35ms

0.20ms

0.12ms

0.31ms

0.23ms

37748736

0.38s

0.26s

0.12s

0.29s

0.23s

98304

0.46ms

0.25ms

0.15ms

0.44ms

0.32ms

42467328

0.43s

0.28s

0.13s

0.32s

0.25s

110592

0.48ms

0.28ms

0.16ms

0.43ms

0.32ms

47775744

0.49s

0.32s

0.16s

0.36s

0.27s

124416

0.59ms

0.34ms

0.19ms

0.54ms

0.39ms

50331648

0.54s

0.33s

0.16s

0.39s

0.30s

131072

0.68ms

0.38ms

0.19ms

0.58ms

0.43ms

56623104

0.57s

0.38s

0.18s

0.44s

0.34s

147456

0.75ms

0.40ms

0.21ms

0.66ms

0.49ms

63700992

0.64s

0.41s

0.20s

0.49s

0.38s

165888

0.81ms

0.45ms

0.25ms

0.74ms

0.53ms

67108864

0.80s

0.49s

0.24s

0.54s

0.44s

196608

1.02ms

0.58ms

0.31ms

0.92ms

0.66ms

75497472

0.83s

0.51s

0.26s

0.61s

0.47s

221184

1.13ms

0.62ms

0.35ms

1.01ms

0.73ms

84934656

0.92s

0.61s

0.29s

0.67s

0.54s

248832

1.36ms

0.73ms

0.37ms

1.12ms

0.83ms

95551488

0.98s

0.66s

0.33s

0.75s

0.59s

262144

1.45ms

0.88ms

0.42ms

1.27ms

0.93ms

100663296

1.12s

0.78s

0.35s

0.83s

0.66s

294912

1.57ms

0.87ms

0.46ms

1.39ms

1.00ms

113246208

1.19s

0.78s

0.39s

0.90s

0.71s

331776

1.86ms

0.97ms

0.52ms

1.54ms

1.12ms

127401984

1.34s

0.94s

0.44s

1.03s

0.81s

393216

2.33ms

1.26ms

0.65ms

1.92ms

1.41ms

134217728

1.61s

1.00s

0.47s

1.15s

0.93s

442368

2.49ms

1.35ms

0.71ms

2.13ms

1.56ms

150994944

1.77s

1.21s

0.53s

1.26s

1.04s

497664

2.88ms

1.68ms

0.84ms

2.40ms

1.73ms

169869312

1.88s

1.29s

0.59s

1.38s

1.11s

524288

3.19ms

1.88ms

0.91ms

2.71ms

1.99ms

191102976

2.16s

1.46s

0.66s

1.59s

1.28s

589824

3.41ms

1.94ms

1.01ms

2.95ms

2.10ms

201326592

2.34s

1.54s

0.70s

1.69s

1.37s

663552

3.91ms

2.19ms

1.11ms

3.25ms

2.39ms

226492416

2.71s

1.81s

0.79s

1.90s

1.55s

786432

5.03ms

2.80ms

1.34ms

4.11ms

2.97ms

254803968

2.90s

1.93s

0.88s

2.08s

1.66s

Table 2. Extensive timings for number theoretic transforms of lengths up till over the field with .

5.1.Mono-threaded implementation

The main focus of this paper is on a mono-threaded implementation of number theoretic transforms, while exploiting SIMD instruction-level parallelism. We optimized our code for orders

with

and

. For reference and completeness, a full table with all observed timings is provided in Table 2 . For each entry, we computed three timings and reported the median one. In Figure 1 , we also plotted the corresponding graph. For readability, this graph is limited to the x86-based architectures. The graph shows a familiar nearly affine behavior for the computation time in the double logarithmic scale. The sudden increase of the timings of the xeon curve near

is due to an inefficient memory hierarchy on this computer.

Now one NTT of a power of two length

consists of

butterflies. For a better understanding of the efficiency of our implementation it is therefore useful to divide our timings by

, even when

is not a power of two. Figure 2 shows the resulting normalized timings. In addition, Figure 3 shows these normalized timings when measuring the cost of a butterfly in cycles instead of nanoseconds.

From Figure 3, we can read off that a doubled SIMD width indeed makes the implementation approximately twice as fast on x86-based architectures. Although Apple's M1 and M3 chips only support 128-bit wide SIMD, they actually behave more like CPUs with 256-bit wide SIMD. This is probably due to the fact that Apple chips provide four instead of two FMA units, which potentially doubles the instruction-level parallelism. For large

, the M1 and M3 perform even better, thanks to a more efficient memory hierarchy.

When multiplying the cost in cycles with the SIMD width

(while taking

for the M1 and M3 chips in view of what precedes), then we note that the cost of a single butterfly stays between 5 and 8 cycles for

; this is similar to the performance of traditional scalar NTTs. For larger orders

, the cost on x86-based platforms slowly increases to 12 cycles per butterfly (and even to 15 cycles per butterfly for the Xeon W system and

, but we recall that our implementation was not optimized as heavily in this range).

We also note the somewhat irregular behavior of the graphs in Figures 2 and 3, with discrepancies as large as

for close orders

. This is not due to inaccurate or irreproducible timings: it turns out that the efficiency of the generated codelets is highly dependent on the combinatorics of the divisors of

. In particular, if

is a power of two, then the “pool of available strategies” is somewhat smaller than if

is divisible by a few powers of

. Consequently, the average costs in cycles tend to be slightly higher for power of two lengths. When using NTTs as the workhorse of evaluation-interpolation strategies (e.g. for integer multiplication), then it may be interesting to use truncated Fourier transforms (TFTs) from [15, 20]: this should allow us to smooth out the graphs from Figures 2 and 3, and benefit from the best possible NTTs for close orders

We also compared our new algorithm to the one from [ 13 ]. For this purpose, we slightly adapted the companion C code from [ 13 ] to work also on M1 and M3 platforms and to support orders

. The timings for both implementations are shown in Table 3 . We recall that the algorithm from [ 13 ] uses a 62 bit prime and unsigned integer arithmetic, instead of a 49 bit prime and double precision floating point arithmetic. We have also indicated the appropriately rescaled timings.

5.2.Pure SIMD mode

In the previous subsection, we explored the cost of computing a single NTT of order

over

, while exploiting SIMD support in the processor. However, in order to fully benefit from hardware SIMD support of width

, it is better to compute

NTTs in parallel, as a single NTT of order

over

. Here

stands for the set of SIMD vectors of length

over

. Such a vector fits exactly into a single hardware SIMD register. We will say that we are in pure SIMD mode when we can compute

NTTs in parallel in this way. This typically happens when we need to compute many NTTs (e.g. when doing an integer matrix product as in [13]), in chunks of

NTTs at a time. By contrast, we are in thwarted SIMD mode if we want to compute a single or less than

NTTs, as in the previous subsection. Although this mode may still benefit from SIMD arithmetic, this typically requires non-trivial data rearrangements as described in section 4.4.

In order to compare the pure and thwarted modes, it is useful to compare the following three costs:

Although the first two costs ought to be similar most of the time, this is not necessarily the case when

is close to one of the (L1, L2, L3, . . . ) cache sizes. In Table 4 , we show a detailed comparison of these costs for power of two lengths. The overhead of thwarted with respect to pure SIMD is most important for small orders

simd₄

xeon

simd₈

zen4

simd₈

simd₂

0.10μs

0.11μs

83.1ns

51.4ns

65.5ns

37.8ns

16.5ns

33.1ns

24.5ns

73.4ns

73.3ns

52.0ns

128

0.25μs

0.27μs

0.19μs

0.11μs

0.12μs

85.0ns

56.9ns

69.2ns

56.5ns

0.17μs

0.16μs

0.13μs

256

0.56μs

0.57μs

0.40μs

0.26μs

0.28μs

0.18μs

0.15μs

0.17μs

0.13μs

0.38μs

0.37μs

0.29μs

512

1.24μs

1.27μs

1.05μs

0.56μs

0.63μs

0.49μs

0.34μs

0.39μs

0.35μs

0.80μs

0.67μs

1024

2.57μs

2.65μs

2.32μs

1.20μs

1.29μs

1.21μs

0.77μs

0.81μs

0.77μs

1.82μs

1.80μs

1.67μs

2048

6.21μs

6.26μs

4.88μs

2.65μs

2.76μs

2.39μs

1.67μs

1.73μs

1.62μs

3.98μs

3.82μs

3.34μs

4096

13.7μs

13.2μs

11.5μs

6.10μs

6.35μs

5.21μs

4.22μs

4.21μs

8.65μs

8.41μs

7.67μs

8192

29.6μs

31.3μs

27.7μs

14.2μs

14.6μs

13.6μs

8.76μs

8.92μs

8.60μs

18.8μs

18.3μs

16.9μs

16384

63.8μs

64.9μs

57.6μs

27.6μs

31.9μs

33.5μs

18.3μs

18.6μs

17.9μs

42.6μs

40.1μs

36.6μs

32768

0.14ms

0.15ms

0.13ms

59.5μs

69.3μs

75.1μs

43.6μs

42.7μs

41.2μs

91.6μs

86.7μs

77.7μs

65536

0.31ms

0.30ms

0.31ms

0.15ms

0.17ms

0.16ms

99.7μs

0.10ms

95.1μs

0.19ms

0.18ms

0.17ms

131072

0.68ms

0.72ms

0.66ms

0.38ms

0.44ms

0.40ms

0.19ms

0.21ms

0.20ms

0.43ms

0.40ms

0.52ms

262144

1.45ms

1.49ms

1.53ms

0.88ms

1.02ms

0.94ms

0.42ms

0.43ms

0.93ms

0.87ms

524288

3.19ms

3.31ms

3.17ms

1.88ms

2.09ms

1.95ms

0.91ms

0.94ms

0.92ms

1.99ms

1.89ms

1.88ms

1048576

7.15ms

7.03ms

7.04ms

4.13ms

4.62ms

4.31ms

1.92ms

1.97ms

2.13ms

4.15ms

4.27ms

4.15ms

2097152

15.4ms

15.7ms

15.3ms

9.01ms

9.32ms

9.63ms

4.15ms

5.38ms

5.02ms

9.07ms

9.73ms

9.42ms

4194304

33.3ms

36.1ms

35.7ms

19.5ms

19.6ms

20.5ms

9.19ms

11.0ms

11.3ms

21.0ms

20.8ms

20.2ms

8388608

70.0ms

77.9ms

76.5ms

40.1ms

40.9ms

49.4ms

21.7ms

23.4ms

24.4ms

44.3ms

43.3ms

41.7ms

16777216

0.16s

87.2ms

90.5ms

0.11s

46.7ms

51.4ms

53.2ms

93.1ms

90.6ms

86.0ms

Table 4. Comparison between the timings of NTTs in the thwarted and pure SIMD modes. The greyed entries correspond to timings that were obtained using slightly less optimized DFTs.

5.3.Multi-threading

The main focus of this paper has been on an almost optimal implementation of the NTT using SIMD double precision arithmetic, but in the mono-threaded case. Naturally, we were curious about the performance of our work when using multiple threads. In this last subsection, we report on some preliminary experiments, while emphasizing that this is work in progress which has not yet benefited from a similar level of optimization. We recompiled Mathemagix with the –enable-threads configure option.

Before measuring the thread efficiency of our implementation, it is useful to analyze the maximal theoretical speed-up that multi-threading may provide on our testing platforms. For this, we created a simple loop that calls many times an empty function, and measured the speed-up when executing this loop in a multi-threaded fashion. The results have been reported in Table 5. For the x86-based platforms, we note that the number of logical threads is twice as high as the number of physical threads, due to hyperthreading. It turns out that our simple loop does benefit from hyperthreading, but that this is less likely to happen for highly optimized HPC code like our NTTs. For the M1 and M3 processors, efficiency cores tend to be much slower than performance cores, with similar pitfalls as in the case of hyperthreading. Note that the observed speed-ups always decreased when using a number of threads that exceeds the maximal number of logical threads for the platform.

For our main experiments, the number of threads

is fixed for each individual timing. The main codelets that benefit from multi-threaded execution are of the form

for some other codelet

and some large integer

. For a parameter

(that we call the flooding factor), let us write

with

. Now we subdivide the execution of, say,

into the execution of chunks

for

. These chunks are executed by our

threads using the work stealing paradigm. In our codelet factory, the most appropriate flooding factor

for given

and

is determined dynamically.

The first scenario we tested is that of a single NTT of length

, while using

threads. For each platform, we investigate three values for

: the number of physical or performance cores, the number of logical or performance plus efficiency cores, and an intermediate value. The resulting timings are shown in Table 6 . We observed a large variance, especially on x86 platforms. We consider the speed-ups on the m1 and m3 platforms to be reasonably satisfactory for a first implementation. The timings for the other platform are far from optimal. We omitted the results for the zen4 platform, since they were so absurd that they require further investigation.

The second scenario is when we wish to compute a large number

of NTTs of the same order

. This is similar to the pure SIMD mode that we investigated in subsection 5.2 . We use a similar work stealing approach as above. Each individual thread repeatedly computes NTTs of order

over

. In Table 7 , we reported detailed timings in the single case when

and only for the m3 platform. The table also indicates the best flooding factor

as a function of the number of threads

Bibliography

[1]: ANSI/IEEE. IEEE standard for binary floating-point arithmetic. Technical Report, ANSI/IEEE, New York, 2008. ANSI-IEEE Standard 754-2008. Revision of IEEE 754-1985, approved on June 12, 2008 by IEEE Standards Board.
[2]: J. Bradbury, N. Drucker, and M. Hillenbrand. NTT software optimization using an extended Harvey butterfly. Cryptology ePrint Archive, Paper 2021/1396, 2021. https://eprint.iacr.org/2021/1396.
[3]: P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, Berlin, 1997.
[4]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19:297–301, 1965.
[5]: J.-G. Dumas, P. Giorgi, and C. Pernet. FFPACK: finite field linear algebra package. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 119–126. New York, NY, USA, 2004. ACM.
[6]: J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. ACM Trans. Math. Softw., 35(3):19–1, 2008.
[7]: P. Fortin, A. Fleury, F. Lemaire, and M. Monagan. High-performance SIMD modular arithmetic for polynomial evaluation. Concurr. Comput. Pract. Exp., 33(16):e6270, 2021.
[8]: M. Frigo. A fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI '99, pages 169–180. New York, NY, USA, 1999. ACM.
[9]: M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
[10]: S. Fu, N. Zhang, and F. Franchetti. Accelerating high-precision number theoretic transforms using Intel AVX-512. 2024. https://spiral.ece.cmu.edu/pub-spiral/pubfile/PACT_2024_AVX_371.pdf.
[11]: I. J. Good. The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. Series B, 20(2):361–372, 1958.
[12]: W. Hart and the FLINT Team. FLINT: Fast Library for Number Theory. From 2008. Software available at http://www.flintlib.org.
[13]: D. Harvey. Faster arithmetic for number-theoretic transforms. J. Symbolic Comput., 60:113–119, 2014.
[14]: M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.
[15]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 290–296. New York, NY, USA, 2004. ACM.
[16]: J. van der Hoeven. The Jolly Writer. Your Guide to GNU TeXmacs. Scypress, 2020.
[17]: J. van der Hoeven and G. Lecerf. Mathemagix User Guide. HAL, 2013. https://hal.archives-ouvertes.fr/hal-00785549.
[18]: J. van der Hoeven and G. Lecerf. Evaluating straight-line programs over balls. In P. Montuschi, M. Schulte, J. Hormigo, S. Oberman, and N. Revol, editors, 2016 IEEE 23nd Symposium on Computer Arithmetic, pages 142–149. IEEE, 2016.
[19]: J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5–1, 2016.
[20]: R. Larrieu. The truncated Fourier transform for mixed radices. In M. Burr, editor, Proceedings of the 2017 ACM International Symposium on Symbolic and Algebraic Computation, ISSAC '17, pages 261–268. New York, NY, USA, 2017. ACM.
[21]: J. M. Pollard. The fast Fourier transform in a finite field. Math. Comput., 25(114):365–374, 1971.
[22]: M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: code generation for DSP transforms. Proc. IEEE, 93(2):232–275, 2005. Special issue on “Program Generation, Optimization, and Adaptation”.
[23]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[24]: D. Takahashi. An implementation of parallel number-theoretic transform using Intel AVX-512 instructions. In F. Boulier, M. England, T. M. Sadykov, and E. V. Vorozhtsov, editors, Computer Algebra in Scientific Computing. CASC 2022., volume 13366 of Lect. Notes Comput. Sci., pages 318–332. Springer, Cham, 2022.
[25]: N. Zhang, A. Ebel, N. Neda, P. Brinich, B. Reynwar, A. G. Schmidt, M. Franusich, J. Johnson, B. Reagen, and F. Franchetti. Generating high-performance number theoretic transform implementations for vector architectures. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7. Los Alamitos, CA, USA, 2023. IEEE.

Algorithm	Algorithm

Algorithm	Algorithm

			(4)
			(5)
			(6)
			(7)

1.Introduction

2.Preliminaries

2.1.Discrete Fourier transforms

2.2.Divide-and-conquer DFTs

2.3.DFTs using the Chinese remainder theorem

2.4.DFTs up to permutations

2.5.Number theoretic transforms

3.Modular floating point arithmetic

3.1.Pseudo-code

3.2.Measuring how far we are off from normalization

4.Codelets for number theoretic transforms

4.1.Straight-line programs over modular integers

4.2.Number theoretic transforms of small orders

4.3.Large orders

4.4.SIMD acceleration

5.Timings

5.1.Mono-threaded implementation

5.2.Pure SIMD mode

5.3.Multi-threading

Bibliography