> <\body> <\hide-preamble> >> |<\doc-note> Grégoire Lecerf has been supported by the French NODE project. Joris van der Hoeven has been supported by an ERC-2023-ADG grant for the ODELIX project (number 101142171). <\wide-tabular> ||||||| ||LOGO_ERC-FLAG_FP.png>|61.62pt|28.51pt||>>>>> |>|||<\author-affiliation> CNRS (UMR 7161, LIX) Laboratoire d'informatique de l'École polytechnique Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>||<\author-affiliation> CNRS (UMR 7161, LIX) Laboratoire d'informatique de l'École polytechnique Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>|<\doc-note> > We describe a new, highly optimized implementation of number theoretic transforms on processors with SIMD support (AVX, AVX-512, and Neon). For any prime modulus and any order of the form \3\p-1>, our implementation can automatically generate a dedicated codelet to compute the number theoretic transform of order over >. New speed-ups were achieved by relying heavily on non-normalized modular arithmetic and allowing for orders that are not necessarily powers of two. |||> Number theoretic transforms (NTTs) were introduced by Pollard. He used them as a tool to design practical algorithms for the efficient multiplication of very large integers. Technically speaking, a number theoretic transform is simply a discrete Fourier transform (DFT) over a finite field. Traditional DFTs work over the complex numbers. A fast algorithm to compute such DFTs was published a few years earlier by Cooley and Tukey, although the idea goes back to Gauss. As of today, fast practical algorithms for multiplying large integers are based on number theoretic transforms, complex DFTs, or Schönhage-Strassen's algorithm. All three strategies are very competitive and the winner depends on hardware and the application. Since the race is very close, it is interesting to heavily optimize each of the three approaches. In this paper, we will describe a high performance implementation for number theoretic transforms. We focus on modern general purpose CPUs with support for SIMD (Single Instruction Multiple Data) instructions. Our current implementation was written in , partly as a test case for the development of our new compiler (a>version of it has been released this year). It fully exploits SIMD accelerations and we started experimentations with multi-threading. A first way to optimize NTTs is to design faster modular arithmetic. Let be a prime number. Traditionally, the result of any arithmetic operation modulo is always normalized, to make it lie in ,p-1|}>>. It was first observed by Harvey that some of these normalizations can be delayed when computing NTTs and that this can lead to significant speed-ups. In section, we show how to take this idea one step further, which allows us to delay even more normalizations. While we were working on our implementation<\footnote> We started our implementation in 2019, but it remained private due to the experimental nature of the new compiler. , a few other implementations appeared that use similar ideas. The delayed reduction strategy has been generalized to 4-wide and 8wide butterflies in. Dan Schultz recently added an optimized implementation of this kind to the library. We note that the idea to delay normalizations is also very popular in linear algebra; see for instance>. Another inspiration and noteworthy analogue is the\Ptransient ball arithmetic\Q that was developed in. The optimization of traditional DFTs is the subject of an extensive literature. The most efficient implementations are based on so-called \Pcodelets\Q. For agiven target order, the idea is to automatically generate a highly optimized program (called a) for DFTs of order . Such codelets can optimize the use of registers and memory, which makes them in particular cache-efficient. For small orders, we can also unroll the loops, which minimizes the overhead of control structures. In section, we apply the codelet approach to number theoretic transforms and discuss some specific issues that arise for this application. Similar experiments have recently been carried out in the system. The last section is dedicated to timings. We implemented our algorithms on five different systems that support SIMD arithmetic of three different widths: 128-bit (ARM Neon), 256-bit (AVX-2), and 512-bit (AVX-512). For orders 2>, we achieve speed-ups that are proportional to the SIMD width. For higher orders, the x86-based systems start to suffer from memory access costs, whereas the performance of ARM-based systems remains stable. We also experimented with multi-threaded versions of our algorithms, but without pushing our optimization efforts as far as for the mono-threaded case. The application of the fast NTTs to integer multiplication requires a wrapper for Chinese remaindering and Kronecker segmentation. This turns out to be highly non-trivial to implement with the same level of optimization as our new NTTs. We plan to address this issue in a separate future work. Let > be a field with a primitive root of unity > of order and assume that is invertible in >. We define the (or DFT) of an tuple ,\,a|)>\\> with respect to > to be >==,\,|)>\\> where <\eqnarray*> >|>|+a*\+\+a*\*i>.>>>> That is, > is the evaluation of the polynomial \a+a*X+\+a*X> at >. If > is aprimitive -th root of unity, then so is its inverse =\>, and we have <\eqnarray*> >>|)>>||>>>> Indeed, writing DFT>>|)>>, we have <\equation*> b=*\=a*\>=a*\>=a*|)>=r*a, where =1> if and =0> otherwise. The DFT induces a ring homomorphism from /-1|)>> into >, by sending +\+a*x> to >>. Since the DFT is invertible, this map (that we will still denote by >>) is actually an isomorphism. If we have a fast way to compute DFTs, this yields an efficient way to multiply \Pcyclic\Q polynomials |P,Q|\>\\/-1|)>>: <\equation> P*Q=DFT>>*DFT>|)>. Here the multiplication >*DFT>> in > is done entry-wise. Given two polynomials \> with |A*B\r|\>>, we have <\equation*> A*B mod -1|)>=-1|)>|)>*-1|)>|)> and the product can be read off from -1|)>>. Hence we also obtained an algorithm for the multiplication of and . This is called >. Let > be a primitive -th root of unity, with *r> for |r,r|\>\r>. Then \\>> is aprimitive >-th root of unity and \\>> is a primitive >-th root of unity. Moreover, for any \,r-1|}>>> and \,r-1|}>>, we have <\eqnarray*> *r+i>>||=0>-1>=0>-1>a*r+k>*\*r+k|)>**r+i|)>>>>|||=0>-1>\*i>*=0>-1>a*r+k>*>|)>*i>|)>*>|)>*i>.>>>> This formula can be decomposed in four steps: <\eqnarray*> *r+k>>|>|=0>-1>a*r+k>*\*i>>>|*r+k>>|>|*i>*x*r+k>>>|*r+i>>|>|=0>-1>y*r+k>*\*i>>>|*r+i>>|>|*r+i>.>>>> Given a linear map :\\\> and \>, it will be convenient to use the Kronecker products \Id:\\\> and \\:\\\>, that are defined by <\eqnarray*> \Id|)>>|>||\i\,t-1|}>|\>,,\,y*t+i>|)>=\,\,x*t+i>|)>>>|\\|)>>|>||\i\,t-1|}>|\>,,\,y|)>=\,\,x|)>.>>>> Using these notations, we may rewrite() and() as <\eqnarray*> ||>\Id>|)>>>|||>\DFT>|)>.>>>> Step() consists of scalar multiplications by so-called *i>>. We will denote this by ,r,\>>: <\eqnarray*> ||,r,\>.>>>> The last step corresponds to the transposition of an \r> matrix. Denoting this transposition map by ,r>>, the relation() becomes <\equation> \,r>\DFT>=>\DFT>|)>\\,r,\>\>\Id>|)>. If ,r|)>=1> in the previous subsection, then it is actually possible to avoid the twiddling step(), as first noted by Good in. Consider the Bezout relation <\equation*> u*r+u*r=1, where |\|>\r> and |\|>\r>. Then we have the isomorphism of abelian groups <\eqnarray*> /|r*\|\>>|>|/|r*\|\>\\/|r*\|\>>>||>||u*i|\> mod r,|u*i|\> mod r|)>>>|*i+r*i|)> mod r>|>| mod r,i mod r|)>>>>> This Chinese remainder isomorphism induces an isomorphism of >-algebras <\eqnarray*> /-1|)>>||\>>||]>/>-1|)>\\|]>/>-1|)>>>|>|>|*i>*x*i>>>|*i+r*i>>|>|>*x>.>>>> Representing cyclic polynomials a*x\\/-1|)>> and <\equation*> B=i\r,0\i\r>b*r+i>*x>*x>\|]>/>-1|)>>\\|]>/>-1|)> by the vectors \> of their coefficients, the map> acts as a permutation :\\\>: if > then *r+i>=a*i+r*i|)> rem r>>, for i\r> and i\r>. The bivariate DFT <\equation*> DFT,\>\DFT>\DFT>\>\DFT>|)>\>\Id>|)> sends ,i>b*r+i>*x>*x>> to the vector > with *r+i>=B>,\>|)>>. If > and =DFT>>, then <\equation*> B>,\>|)>=B*r>,\*r>|)>=A*r*u+i*r*u>|)>, so *r+i>=*r*u+i*r*u|)> rem r>> and =\|)>> for some permutation :\\\>. Altogether, this yields <\equation*> \\DFT>=DFT,\>\\. The above two subsections show that efficient algorithms for DFTs often only compute them up to permutations. For some applications, it is actually not required to perform all these permutations. Assume for instance that we have an efficient algorithm for the computation of the \Ptwisted\Q >=\\DFT>> for some permutation >. Then the variant <\equation*> P*Q=>>*>|)> of() still yields an efficient method for FFT-multiplication. This technique requires the inverse transform >> to be implemented with more care, since one cannot directly use the formula(). Instead, one typically uses a similar algorithm as for >>, but with all steps reversed. The twisting technique can be combined recursively with the algorithms from the previous two subsections. Assume for instance that >\\\DFT>> and >\\\DFT>> for some permutations :\>\\>> and :\>\\>>, and let <\equation*> >\\\|)>\\,r>\DFT>, with the notation of(). Then, equation() implies that <\eqnarray*> >>||>\\|)>\\Id>|)>\>\DFT>|)>\\,r,\>\>\Id>|)>>>|||>\\|)>\>\DFT>|)>\\Id>|)>\\,r,\>\>\Id>|)>>>|||>\\|)>\>\DFT>|)>\|~>,r,\>\\Id>|)>\>\Id>|)>>>|||>\>|)>\|~>,r,\>\>\Id>|)>>>>> where |~>,r,\>\\Id>|)>\\,r,\>\\Id>|)>>. For some table >,\,\>|)>> of precomputable twiddling factors, we have =\>*x> for any \> with |~>,r,\>>. For DFTs of small orders, we also note that permutations of the input and output coefficients can be achieved without cost. As will be explained in sections and, we will regard such DFTs as completely unrolled (SLPs); see definition in for instance. Instead of explicitly permuting the input and output coefficients, it then suffices to appropriately rename all local variables in the SLP. This is particularly useful for the algorithm from section, which requires permutations of both the input and output coefficients. In the special case when > is a finite field > for some prime number , a discrete Fourier transform is also called a (NTT). The most favorable case is when is of the form >+1> or >*3>+1>, where is small. Indeed, since the multiplicative group of > is cyclic of order , this ensures the existence of primitive roots of unity > of large orders of the form >> or>*3>>. Nice primes like this are sometimes called . In what follows, for any order >, we will denote by > anumber theoretic transform of the form \DFT>> for some suitable permutation > and primitive -th root of unity >. Pollard first noted that NTTs can be used to devise efficient practical algorithms for multiplying large integers: using Kronecker segmentation, the two multiplicands \>> can be rewritten as special values <\eqnarray*> ||+a*2+\+a*2>>|||+b*2+\+b*2>>>> of integer polynomials \> with small coefficients ,b\,2-1|}>>. Then |)>> and the product has coefficients in ,+1|)>*2-1|}>>. If <\equation> >+1|)>*2\2, then can be read off from its reduction modulo . If we also have r>, then the product modulo can be computed fast using FFTmultiplication. For the application to integer multiplication, Pollard's technique is most efficient when nicely fits into a machine integer or floating point number. As soon as >> gets large (say \2>), this constraints to become fairly small ( 15> when using double precision floating point arithmetic with ). In order to increase, and reduce the negative impact of > in the constraint(), Pollard suggested to reduce modulo three FFT-primes >, >, and > instead of a single one and use Chinese remaindering to work modulo *p*p> instead of . For instance, when using double precision floating point arithmetic, this allows us to take and 65> instead of and 15>. As of today, this is one of the fastest practical methods for integer multiplication. We refer to for previous work on the implementation of SIMD versions of modular arithmetic. Before starting such an implementation, one first has to decide whether one wishes to rely on floating point or integer arithmetic. In principle, integer arithmetic should be most suitable, since no bits are wasted on the storage of exponents. However, for large moduli of more than 32 bits, floating point arithmetic tends to be supported better by current hardware. First of all, at least in the case of processors, throughputs and latencies for basic floating point instructions are better than for their integer counterparts. Secondly, the AVX2 and AVX-512 instruction sets only provide SIMD support for 32>32 bit integer products; although there is an SIMD instruction for 64>64 bit integer multiplication, the high part of the bit result is lost. In comparison, with floating point arithmetic, the full product of a 53> bit integer multiplication can be obtained using two fused multiply add instructions. Based on these hardware considerations, we have decided to implement our NTT with floating point arithmetic. Our implementation focusses on odd moduli 3> that fit into double precision numbers and works for SIMD widths up to , as allowed by processors with AVX-512 support. Our approach further extends known ideas. On the one hand, for the normalized representation of integers modulo , we use signed residues in ||\>,\,||\>|}>>; this provides us with acheap way to gain one bit of precision and it also simplifies the mathematical analysis. On the other hand, we will allow the results of certain intermediate computations to be non-normalized. The latter idea was already used with success in, but we push it further by taking the modulus a bit smaller (thereby sacrificing a few bits of precision), which allows us to redundantly represent modular integers by integers that can be several times larger than (this allows for several speed-ups, as we shall see). Several recent implementations also use this kind of super-redundant representations in order to perform basic NTTs of order and without unnecessary reductions. In this paper we will present a generalization to arbitrary orders. In what follows, we will write > for the machine precision and assume that \8>. Then any real number in the interval > is a floating point number if and only if it is a multiple of >>. Floating point number can be scaled by powers > with exponents >. Here \> is assumed to satisfy 2* \>. Finally, zero is a special floating point number. When rounding to nearest, the rounding error of a floating point operation with an exact result in > is always bounded by >>. Given a general operation with an exact result > and rounded result>, this means that -x|\|>\2>>. Note also that \1> necessarily holds in this case. Consequently, we have <\equation> -x|\|>\*2>-x|\|>\|\|>*2>. More generally, these two inequalities hold whenever the exponent of does not overflow or underflow. Our current implementation is limited to double precision, for which =53> and , but it could easily be extended to single and other precisions. We assume correct IEEE 754 style rounding to the nearest. We will present our implementation of modular arithmetic using pseudo machine code. The only data type that we will use are SIMD vectors of width with floating point entries ( for processors with AVX-512 support, for double precision). Such vectors are typically stored in hardware registers. The first argument of all our algorithms will be the destination register and the subsequent arguments the source registers. For instance, the instruction > adds two SIMD vectors and and puts the result in. In the pseudo-code, the modulus stands for a SIMD vector of potentially different moduli. We also denote by its floating point inverse>, rounded to nearest. We assume that basic instructions for SIMD floating point arithmetic are provided by the hardware, namely addition (), subtraction (), multiplication (), fused multiply add () and its variants (, , ), as well as an instruction for rounding to the nearest integer (). Precisely, the fused operators are defined as follows: <\equation*> >means>d\a*b+c>|>|>means>d\-a*b+c>>|>means>d\a*b-c>||>means>d\-a*b-c>>>>> On processors with AVX-512 support, one may implement > using the intrinsic . The modular analogues for the basic arithmetic operations are suffixed by >. For instance, > computes products modulo . The results of these analogues are not required to be normalized. The modular ring operations, as well as aseparate operation > for normalization, can be implemented as follows: \; <\wide-tabular> <\named-algorithm|>> > |<\cell> <\named-algorithm|>> > >| \; |<\cell> \; >| <\named-algorithm|>> > > > |<\cell> <\named-algorithm|>> > > > > > > >>> \; Here are temporary variables. Modular analogues of the fused operations , , , and turn out to be of limited use, since non-normalized addition and subtraction take only one operation. For the analysis of our algorithms, it suffices to consider the scalar case . Consider a modular integer represented by an integer \> that fits into a floating point number of precision >. Let <\equation*> \\|m> and note that \1> if and only if \|m/2|\>>, if and only if is normalized. If \1>>, then we regard > as a measure for how non-normalized really is. For each of our algorithms , , , and it is interesting to analyze > as afunction of > and >. Note that >+1> is the smallest integer that cannot be represented exactly using afloating point number. For this reason, we always assume that the source arguments and of our algorithms as well as the modulus satisfy \2>>, \2>>, and 2>>. Each of our algorithms is easily seen to be correct, provided that \2>>. In what follows, we will in particular study under which circumstances this can be guaranteed. In the cases of addition and subtraction, the analysis is simple, since we clearly have \\+\> and \2>>, provided that +\2>> (or +\\2+1>/m>, equivalently). <\proposition> The algorithm > satisfies: <\enumerate-alpha> If \2-2>>, then \1>. If -2>\\2>>, then \1+>. <\proof> Let > and > be the errors for the computations of and : <\eqnarray*> >||>>|||>>>> and recall that |\|>\u*2>>, |\|>\*2>>, |\|>\m*2>>, and |\|>\*u*2>>>, using. Hence, |\|>\|\|>+|\|>\*|\|>+|\|>\*u*2>.> Let > be the distance between > and the set +>. Since is odd, we must have \>. If |\|>\\>, then we necessarily have |x|\>=|a*m|\>>, and the end-result will indeed be normalized. Here |x|\>> stands for the nearest integer to . The condition |\|>\\> is satisfied as soon as *u*2>\>, if *>|)>\2-2>>. Since is an integer and there are no integers between -2>/>|)>> and -2>>, this condition further simplifies into \2-2>>. This proves(). If -2>\\2>>, then the above calculations still yield |\|>\2*u>>, whence <\equation*> |\|>\+|\|>\+2*u and <\equation*> =\+2*u*m\+2*|\|>|)>\+2>|)>. This in particular implies(). <\remark> If -2>\\2>> and 2-1>-5>, then =v*m/2\/2\2-2>>, so asecond application of will achieve to normalize. <\proposition> Let <\equation*> |||>|>|)>*m*2>>|>|>>>||>|*2>>|>|*m*2>>>|>|>|>|)>*2>>|4*>|)>>>|>|-*m*2>-*>|)>-\.>>>>> Then the algorithm > satisfies: <\enumerate-alpha> We have \1+C*\*\>. If \*B*2>>, then \2>>. If 2-2>>, \\*2>>, and \\*2>>, then \\*2>>. <\proof> We first note that the product is represented exactly by , where \*2>>> and \*2>>. By a similar reasoning as in the beginning of the proof of Proposition, we have <\equation*> |\|>\+*u*2>, whence \+*>|)>*2>.> Note that \8> implies \2>> so is the exact value. We deduce that <\eqnarray*> >|>|+>>||>|+*>|)>*2>+*2>>>||>|+*>|)>*>|)>*2>+*2>>>||>|+*>|)>*2>.>>>> This relation in particular yields(). If \*B*2>>, then we also obtain <\eqnarray*> +*>|)>*2>>|>|+>|)>*B*2>>>||>|+>|)>**2>|)>*2>>>||>|+>|)>*2>>>|||>,>>>> which proves(). Assume finally that \\*2>> and \\*2>>. Then <\eqnarray*> >|>|+2*\*>|)>*2>.>>>> Since 2-2>> is odd, we have 2-2>-1\2-2>/>|)>>, whence the equation <\equation*> 2*t*>|)>*2>-t*2>+=0 has two positive roots <\equation*> t=>|)>*2>>|4*>|)>>. We took > to be the largest of these two roots. This implies(). <\remark> If \2-2>>, \2-2>>, and 2-2>>, then () implies \2-2>>. Indeed, 2-2>-1> in that case, whence \>. <\remark> Multiplication has the benefit of producing partially normalized results, especially if one of the arguments is already normalized. For instance, if \1> and \2>>, then() yields <\equation*> \\1+C*\\1+C*2+1>/m\3+2>. If 2-4>>, \1>, and \4>, then() yields <\equation*> \\1+C*\*\\1+4*>|)>*-4>-1|)>*2>\1+, since \8>. The easiest way to implement DFTs is to write a single routine that takes the order and the coefficients as input and then applies, say, Cooley and Tukey's algorithm using adouble loop. However, this strategy is register-inefficient for small orders and cacheinefficient for large orders. Suboptimal loop unrolling may be another problem. For our implementation, we use the alternative strategy based on \Pcodelets\Q. For a given finite field> and a given target order, the idea is to automatically generate a highly optimized program > for number theoretic transforms of order over > (up to permutations as explained in section). Such highly optimized programs that are dedicated to one specific task are called . We refer to for early work on codelets in the more traditional context of DFTs over >. For small orders, codelets are typically designed by hand. For large orders, they are generated automatically in terms of codelets of smaller lengths. We developed a small \Pcodelet library\Q inside for the automatic generation of codelets for NTTs of different orders as well as other linear maps. One difference with previous work is that all code is written in the language and that codelets can be both generated and executed at runtime. There are often multiple solutions to a given task. For instance, for an NTT of composite order , there may be multiple ways to factor *r> and then apply one of the algorithms from section or. One advantage of codelets is that we may generate codelets for each solution, do some benchmarking at runtime, and then select the most efficient solution. For large orders, when benchmarking becomes more expensive, the most efficient tactics can be cached on disk. Our codelet library exploits this kind ofideas. A straight-line program (SLP) over /|m*\|\>> is a map /|m*\|\>|)>\/|m*\|\>|)>>> that can be implemented using a sequence of ring operations in /|m*\|\>>. For instance, if \\/|m*\|\>> is a primitive root of unity of order , then the following SLP with input ,s,s,s|)>> and output ,d,d,d|)>> computes an NTT of order : <\render-code> \s+s> \s+s> \s-s> \s-s> \\\x> \x+x> \x-x> \x+x> \x-x> Here >, >, >, > are auxiliary variables. Note also that the output has been re-indexed in bit-reversed order: if =DFT>>, then =>, =>, =>, and =>. For an actual machine implementation of such an SLP, we may replace the ring operations by the algorithms >, >, and > from the previous section. However, we have to be careful that all non-normalized intermediate results fit into >bit integers. An easy linear-time greedy algorithm to ensure this is to insert > instructions whenever some intermediate result might not fit. More precisely, for every intermediate result of a binary operation on and , we use the results from section to compute a bound for >. Whenever this bound exceeds +1>/m>, we insert an instruction to normalize or (we pick if \\> and otherwise). For instance, if +1>/m> is slightly larger than >, then applying this strategy to the above SLP yields <\render-code> <\wide-tabular> ,s,s|)>> |<\cell> >\2> >| ,s,s|)>> |<\cell> >\2> >| ,s,s|)>> |<\cell> >\2> >| ,s,s|)>> |<\cell> >\2> >| ,\,x|)>> |<\cell> >\2\>(using part(a) of Proposition) >| ,x|)>> |<\cell> >\1>(using Proposition) >| ,x,x|)>> |<\cell> >\3> >| ,x,x|)>> |<\cell> >\3> >| ,x|)>> |<\cell> >\1>(using Proposition) >| ,x,x|)>> |<\cell> >\3> >| ,x,x|)>> |<\cell> >\3> >>> Here we implicitly assumed that the input arguments ,s,s,s> are normalized. We also wrote > for a number that is slightly larger than ( >> will do). We indeed had to normalize > or > before the instruction ,x,x|)>>, since otherwise >+\>=4>> would exceed >. Similarly, we normalized >, since =4\\3\>. On the other hand, after these normalizations, we did not need to normalize > and >, since \3\3\>. A few remarks about this approach. <\itemize> The greedy strategy is not optimal in the sense that we might have prevented overflows by inserting an even smaller number of normalization instructions. One obvious optimization is to do the normalization of > right after its computation, which may improve subsequent bounds and avoid other normalizations. One may iterate the greedy algorithm a few times after applying this optimization. It is an interesting question how to design even better algorithms. We may choose to normalize the output ,d,d,d> or not. If we allow for non-normalized output, then it may be interesting to also allow for non-normalized input. The greedy algorithm generalizes to this case as long as we know bounds for >,\>,\>,\>>. The bound for >> after multiplication with > is suboptimal: since > is a constant, >>is also a known constant in >. We may use this constant instead of the rough bound >\1>. For instance, if >\>, then >\1\> after the multiplication of > with>. Consequently, >+\>=3\\3\>, which makes it no longer necessary to normalize >. For NTTs of order up to , it is efficient to use unrolled codelets that can be regarded asSLPs. Throughout our implementation, we assumed that >/m\31>. This allows for moduli 2> with a \Pcapacity\Q of 48 bits. We wrote our transforms by hand and manually determined the best spots for normalizations. Although this means that we did apply the general greedy algorithm from the previous section, we did use the same strategy to determine bounds for the >, where ranges over the intermediate results, and verify that we always have \31>. For small power of two orders >, we use the classical Cooley\UTukey algorithm without any normalization. Writing and for the input and output vectors in/|m*\|\>|)>>>, and setting \max>,\,\>|)>>, Proposition implies \r*\>. In view of Remark, the worst case=r*\> may only occur for evaluations at and (which only involve additions and subtractions). NTTs of medium-size power of two orders > are decomposed into smaller NTTs with one intermediate twiddling stage: writing > for a number theoretic transform of order (while assuming the usual bit-reverse indexing), we apply the formulas <\eqnarray*> >||\NTT|)>\Twiddle\\Id|)>>>|>||\NTT|)>\Twiddle\\Id|)>>>|>||\NTT|)>\Twiddle\\Id|)>,>>>> for suitable diagonal linear twiddle maps ,r>>. Our implementation automatically generates completely unrolled codelets for >, >, and >. The twiddling step has the advantage of partially normalizing the intermediate results (see Remark). For instance, if \Id|)>>, >, \NTT|)>>, and \3>, then \24>, \1\\1\>, and \15>, using part(a) of Proposition. Furthermore, the fact that several twiddle factors are equal to one allows for an optimization: whenever we need to multiply with one, we simply normalize instead. In the case of >, the error bounds are a bit better, and we only need to perform two normalizations. We also implemented hand-coded NTTs for some mixed-radix orders of the form*3>>. If , and assuming that > is a primitive third root of unity, we use the following algorithm, without normalization: <\render-code> \s-s> \\\d> \s-s> \d-d> \s-s> \d+d> \s+s> \d+s> One verifies that >\3*\> and >\3*|)>*\>. For >, we use Good's algorithm to decompose > in terms of > and >. This yields \r*|)>*\>. For , weuse the formula <\eqnarray*> >||\NTT|)>\Twiddle\\Id|)>>>>> without normalization, which yields \9*|)>*\>. For >, we use Good's algorithm with an intermediate normalization step. For , we use the formula <\eqnarray*> >||\NTT|)>\Twiddle\\Id|)>,>>>> while taking benefit of the partial normalization of the twiddling step. Now assume that we wish to compute an NTT of large order 2>*3>> with 64>. For input and output vectors |s,d|\>\/|r*\|\>|)>>, we require that \24> whenever \2>. This is indeed the case for all codelets from the previous subsection. Let *r> be a decomposition of with |r,r|\>\r>. We recursively assume that we know how to compute codelets for >> and >>. Now we compute > using theformula <\equation> NTT=>\NTT>|)>\Twiddle,r>\>\Id>|)>. Following(), we compute >\Id>|)>> using <\equation> ,x+j>,\,x+j>|)>\NTT>,s+j>,\,s+j>|)> for ,r-1>. Note that this yields \24> whenever \2>, by our assumption on>>. Setting Twiddle,r>>, let > be the twiddle factors with <\equation*> y+j>=\*x+j>, for ,r-1> and ,r-1>. For ,r-1>, we next compute <\equation> >,d+1>,\,d+r-1>|)>\NTT>*x>,\*x+1>,\,\-1>*x+r-1>|)>, which corresponds to combining() and(). If \24>, then \1\\2> and \24>, by our assumption on >>. This proves our inductive requirement that \24> whenever\2>. It is important to keep an eye on the cache efficiency for the actual implementation. In the formula(), we precompute the twiddle factors and store the vectors >,\,d+r-1>|)>>>, >,\,x+r-1>|)>>, and ,\,\-1>|)>> in contiguous segments of memory. Instead of applying a global twiddling step ,r>>, we use a loop over and mix the twiddling step with the NTTs of length >; this avoids traversing elements in memory twice and thereby improves the cache efficiency. If > is really large, then we might rewrite >> in terms of smaller NTTs and move the twiddling step even further inside. As to(), we first need to move the input slices ,s+j>,\,s+j>|)>> with \Pstride\Q > into contiguous memory. After applying the NTT, we also need to put the result back in the slice ,x+j>,\,x+j>|)>> with stride >. These memory rearrangements correspond to two \r> and \r> matrix transpositions. For cache efficiency, we need to minimize the number of full traversals of vectors of size , so it is again better to use a loop over. However, entries of the form >*k>,\,s>*k+2>-1>> are typically stored in the same cache line for some \\> that depends on the hardware. It is better to treat these contiguous entries together, which can be achieved by cutting the loop over into chunks of size >> (in fact, slightly different chunk sizes may perform even better). This corresponds to rewriting the \r>> matrix transposition into /2>> transpositions of \2>> matrices (and similar for the inverse transposition). Assume now that we are on hardware that supports SIMD vectors of width and let us investigate how to vectorize the algorithm() from the previous subsection. Let us first consider the case when \|r> and \w>. The first step() is easy to vectorize: we may reinterpret the map >\Id>> on vectors of size with entries in /|m*\|\>>> as the map >\Id/w>> on vectors of size with SIMD entries in /|m*\|\>|)>>. As to the second step(), we have >\NTT=Id/w>\Id\NTT> and the computation of the map \NTT> reduces to the computation of \Id>, using a w> matrix transposition. In order to compute the map >\NTT|)>\Twiddle,w>>, this leads us to use a loop of length /w> that treats blocks of > coefficients at the time. On each block, we first multiply with the twiddle factors, then apply a w> matrix transposition, and finally an inline codelet for > for SIMD coefficients of width . The idea behind SIMD implementations of w> matrix transposition is to recursively reduce this operation to two \> transpositions in different lanes, followed by a linear number of operations to combine the results. This leads to an algorithm that does the full transposition in w|)>> steps. Concrete implementations on existing hardware are a bit tricky: every next generation or brand of SIMD processors (SSE, AVX2, AVX-512, Neon, ) tends to provide another set of instructions that are useful for this application (, , , etc.). Instead of covering all cases, we illustrate the above recursive meta-algorithm by showing how to perform a 4> matrix transposition using SIMD instructions on machines with AVX2 support: <\cpp-code> <\with|font|roman> x[0] = _mm256_permute2f128_pd (s[0], s[2], 32) x[2] = _mm256_permute2f128_pd (s[0], s[2], 49) x[1] = _mm256_permute2f128_pd (s[1], s[3], 32) x[3] = _mm256_permute2f128_pd (s[1], s[3], 49) d[0] = _mm256_shuffle_pd (x[0], x[1], 0) d[1] = _mm256_shuffle_pd (x[0], x[1], 15) d[2] = _mm256_shuffle_pd (x[2], x[3], 0) d[3] = _mm256_shuffle_pd (x[2], x[3], 15) Let us now turn to the case when \|r>. For any decomposition *r> with r>> and \|r>, we may apply the above vectorization technique for the computation of >>. Using the loop(), this also allows us to vectorize >\NTT>|)>\Twiddle,r>>. Combined with the fact that the map >\Id>> trivially vectorizes, this provides us with an alternative vectorization of the formula(). This technique has the advantage of providing more flexibility for the choices of > and >. In particular, for large orders, taking \r\> tends to ensure a better cache efficiency. For our actual implementation, we emulate an SIMD width \2*w> of twice the hardware SIMD width for the given floating point type. This means that every instruction is \Pduplexed\Q for the \Plow\Q and \Phigh\Q parts. For instance, the codelet for > from section gives rise to the following duplex version for coefficients in /|m*\|\>|)>>: <\render-code> \s-s> \s-s> \\\d> \\\d> \s-s> \s-s> \d-d> \d-d> > The main advantage of this technique is that it spreads data dependencies. This makes the code less vulnerable to delays caused by instructions with high latencies. One disadvantage of duplexing is that it doubles the pressure on hardware registers. Nevertheless, this is less of a problem on modern processors with many SIMD registers. We implemented the algorithms described in this paper in the language. In order to test the new algorithms, one has to add to the configuration options, before compiling . Here stands for the new experimental compiler. The code for the number theoretic transforms can be found in the subdirectory of the package (revision 11190 available at ). The binary that we used to obtain the timings in Table was compiled using the following command: <\padded> <\indent> We systematically use the fixed prime number 2\3+1=281597114843137> as our modulus for all NTTs. Consequently, > has power of two roots of unity of orders up till >, which is also an upper bound for our main benchmarks. For orders 2>, our implementation does an extensive search for the best strategy. For larger orders, we make a suboptimal guess based on the best observed strategies for orders 2>. We tested our implementation on five different systems whose characteristics are specified in Table. These systems cover three SIMD widths: <\itemize> <\compact> 128 bits, using ARM Neon on the Apple M1 and M3 based machines; 256 bits, on an old laptop whose Intel Core i7 4980HQ CPU supports AVX-2; 512 bits, on two machines with AVX-512 support: Intel Xeon W-2140B and AMD RYZEN9 7950X3D. Note that the AVX-512 support on the AMD RYZEN9 7950X3D processor is \Pdouble pumped\Q in the sense that certain instructions (like FMA) are done in two parts of 256 bits each||||||||||||||||||||>|||64>>|||>|||64>>|||>|||64>>|||>|||64>>|||>|||64>>|||>>>>>> Overview of the five machines on which we benchmarked our implementation. >. \; |||||||||||||||||||||||||||||||||||||>|>|>|>|>|>|>|>|>|>|>|>|>>||\s>>|ns>>|ns>>|\s>>|ns>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|ns>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|ns>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||ms>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||s>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||s>>|ms>>|ms>>|ms>>|ms>>>||\s>>|\s>>|\s>>|\s>>|\s>>|||s>>|ms>>|ms>>|s>>|ms>>>||ms>>|\s>>|\s>>|\s>>|\s>>|||s>>|ms>>|ms>>|s>>|ms>>>||ms>>|\s>>|\s>>|ms>>|\s>>|||s>>|ms>>|ms>>|s>>|ms>>>||ms>>|\s>>|\s>>|ms>>|\s>>|||s>>|s>>|ms>>|s>>|ms>>>||ms>>|\s>>|\s>>|ms>>|ms>>|||s>>|s>>|ms>>|s>>|s>>>||ms>>|\s>>|\s>>|ms>>|ms>>|||s>>|s>>|ms>>|s>>|s>>>||ms>>|ms>>|\s>>|ms>>|ms>>|||s>>|s>>|ms>>|s>>|s>>>||ms>>|ms>>|\s>>|ms>>|ms>>|||s>>|s>>|ms>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|ms>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|||s>>|s>>|s>>|s>>|s>>>>>>> \; <|big-table> Extensive timings for number theoretic transforms of lengths =64> up till =254803968> over the field > with 2\3+1>. > The main focus of this paper is on a mono-threaded implementation of number theoretic transforms, while exploiting SIMD instruction-level parallelism. We optimized our code for orders with \r\2> and \r\2\3>. For reference and completeness, a full table with all observed timings is provided in Table. For each entry, we computed three timings and reported the median one. In Figure, we also plotted the corresponding graph. For readability, this graph is limited to the x86-based architectures. The graph shows afamiliar nearly affine behavior for the computation time in the double logarithmic scale. The sudden increase of the timings of the xeon curve near 2> is due to an inefficient memory hierarchy on this computer.|gr-grid||1|10>|gr-frame|>|gr-geometry||gr-grid-old||1|10>|gr-edit-grid-aspect|||>|gr-edit-grid||1|10>|gr-edit-grid-old||1|10>|gr-grid-aspect|||>|gr-grid-aspect-props|||>|gr-color|#33c|gr-text-at-valign|center|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>|>>||>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|ns>>|>>|\s>>|>>|\s>>|>>|\s>>|>>|ms>>|>>|ms>>|>>|ms>>|>>|s>>|>>|s>>|>>||>>||>>||>>||>>||>>||>>>>> Graphical representation of the timings from Table, for the first three columns only. > Now one NTT of a power of two length >> consists of *r*\> butterflies. For a better understanding of the efficiency of our implementation it is therefore useful to divide our timings by *r*log r>, even when is not a power of two. Figure shows the resulting normalized timings. In addition, Figure shows these normalized timings when measuring the cost of a butterfly in cycles instead of nanoseconds. From Figure, we can read off that a doubled SIMD width indeed makes the implementation approximately twice as fast on x86-based architectures. Although Apple'sM1 and M3 chips only support 128-bit wide SIMD, they actually behave more like CPUs with 256-bit wide SIMD. This is probably due to the fact that Apple chips provide four instead of two FMA units, which potentially doubles the instruction-level parallelism. For large, the M1 andM3 perform even better, thanks to a more efficient memory . When multiplying the cost in cycles with the SIMD width > (while taking for the M1 and M3 chips in view of what precedes), then we note that the cost of asingle butterfly stays between 5 and 8 cycles for 10>; this is similar to the performance of traditional scalar NTTs. For larger orders , the cost on x86-based platforms slowly increases to 12 cycles per butterfly (and even to 15 cycles per butterfly for the XeonW system and 2>, but we recall that our implementation was not optimized as heavily in thisrange). We also note the somewhat irregular behavior of the graphs in Figures and, with discrepancies as large as for close orders . This is due to inaccurate or irreproducible timings: it turns out that the efficiency of the generated codelets is highly dependent on the combinatorics of the divisors of . In particular, if is apower of two, then the \Ppool of available strategies\Q is somewhat smaller than if is divisible by afew powers of . Consequently, the average costs in cycles tend to be slightly higher for power of twolengths. When using NTTs as the workhorse of evaluation-interpolation strategies ( for integer multiplication), then it may be interesting to use truncated Fourier transforms (TFTs) from: this should allow us to smooth out the graphs from Figures and, and benefit from the best possible NTTs for close orders \r>. <\big-figure||gr-grid||1|10>|gr-frame|>|gr-geometry||magnify|1.189207115|gr-grid-old||1|10>|gr-edit-grid-aspect|||>|gr-edit-grid||1|10>|gr-edit-grid-old||1|10>|gr-text-at-valign|center||||>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|ns>|>>|ns>|>>|ns>|>>|ns>|>>||>>||>>||>>||>>||>>||>>||>>||>>|||>||>>>>>> Timings divided by *r*log r>, which corresponds to the number of butterflies in the case when the order of the NTT is a power of two. <\big-figure||gr-grid||1|10>|gr-frame|>|gr-geometry||magnify|1.189207115|gr-grid-old||1|10>|gr-edit-grid-aspect|||>|gr-edit-grid||1|10>|gr-edit-grid-old||1|10>|gr-text-at-valign|center|gr-color|#33c||||>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|cy>|>>|cy>|>>|cy>|>>||>>||>>||>>||>>|||>>||>>|||>||>>||>>||>>>>>> Variant of Figure when using the cycle time as a unit instead of nanoseconds. We also compared our new algorithm to the one from. For this purpose, we slightly adapted the companion C code from to work also on M1 and M3 platforms and to support orders 2>. The timings for both implementations are shown in Table. We recall that the algorithm from uses a 62 bit prime and unsigned integer arithmetic, instead of a 49 bit prime and double precision floating point arithmetic. We have also indicated the appropriately rescaled timings.<\float|float|thb> <\big-table|||||||||||||||||||||||||||||||||||>|||||||||||||||>||||||||||||||||>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>|>|>|>>>>>>> Comparison of the number of clock cycles per butterfly between our new implementation and the implementation from. We recall that the implementation from uses a prime with \p\2> and unsigned integer arithmetic. The \Pf48\Q columns correspond to our new timings for our prime with \p\2>. The \Pf61\Q columns contain the scaled variants through multiplication by >. The \Pi61\Q columns show the timings for the algorithm from. In the previous subsection, we explored the cost of computing a single NTT of order over >, while exploiting SIMD support in the processor. However, in order to fully benefit from hardware SIMD support of width , it is better to compute NTTs in parallel, asasingle NTT of order over >. Here > stands for the set of SIMD vectors of length over >. Such a vector fits exactly into a single hardware SIMD register. We will say that we are in when we can compute NTTs in parallel in this way. This typically happens when we need to compute many NTTs ( when doing an integer matrix product as in), in chunks of NTTs at a time. By contrast, we are in if we want to compute a single or less than NTTs, as in the previous subsection. Although this mode may still benefit from SIMD arithmetic, this typically requires non-trivial data rearrangements as described in section. In order to compare the pure and thwarted modes, it is useful to compare the following three costs: the cost of a single NTT of length over >, > times the cost of NTTs of length over >, and > times the cost of a single NTT of length over >. > Although the first two costs ought to be similar most of the time, this is not necessarily the case when is close to one of the (L1, L2, L3, ) cache sizes. In Table, we show adetailed comparison of these costs for power of two lengths. The overhead of thwarted with respect to pure SIMD is most important for small orders .||||||||||||||||||||||||||||||||||||||||>|>|4>>|>|>|8>>>|>|>|8>>>|>|>|2>>>|>>||\s>>|\s>>|ns>>|ns>>|ns>>|ns>>|ns>>|ns>>|ns>>|ns>>|ns>>|ns>>>||\s>>|\s>>|\s>>|\s>>|\s>>|ns>>|ns>>|ns>>|ns>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||ms>>|ms>>|ms>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>|\s>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|\s>>|ms>>|\s>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>||s>>|s>>|s>>>|ms>>|ms>>|s>>|ms>>|ms>>|ms>>|ms>>|ms>>|ms>>>>>>>> Comparison between the timings of NTTs in the thwarted and pure SIMD modes. The greyed entries correspond to timings that were obtained using slightly less optimized DFTs. > The main focus of this paper has been on an almost optimal implementation of the NTT using SIMD double precision arithmetic, but in the mono-threaded case. Naturally, we were curious about the performance of our work when using multiple threads. In this last subsection, we report on some preliminary experiments, while emphasizing that this is work in progress which has not yet benefited from a similar level of optimization. We recompiled with the configure option.||||||||||||||||||||||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>>>>>> Speed-up when executing a dummy loop using multiple threads. The top row indicates the number of threads that we use to execute our code. The main table indicates the observed speedups. The greyed entries correspond to the degraded regime in which hyperthreading or efficiency cores enter the stage. > Before measuring the thread efficiency of our implementation, it is useful to analyze the maximal theoretical speed-up that multi-threading may provide on our testing platforms. For this, we created a simple loop that calls many times an empty function, and measured the speed-up when executing this loop in a multi-threaded fashion. The results have been reported in Table. For the x86-based platforms, we note that the number of logical threads is twice as high as the number of physical threads, due to hyperthreading. It turns out that our simple loop does benefit from hyperthreading, but that this is less likely to happen for highly optimized HPC code like our NTTs. For theM1 andM3 processors, efficiency cores tend to be much slower than performance cores, with similar pitfalls as in the case of hyperthreading. Note that the observed speed-ups always decreased when using a number of threads that exceeds the maximal number of logical threads for the platform. For our main experiments, the number of threads > is fixed for each individual timing. The main codelets that benefit from multi-threaded execution are of the form \T> or Id> for some other codelet and some large integer . For a parameter > (that we call the ), let us write +\+n*\>> with |n/*\|)>|\>=n\\\n*\>=|n/*\|)>|\>>. Now we subdivide the execution of, say, Id> into the execution of chunks Id>> for,\*\>>. These chunks are executed by our > threads using the work stealing paradigm. In our codelet factory, the most appropriate flooding factor > for given and is determined dynamically. The first scenario we tested is that of a single NTT of length , while using > threads. For each platform, we investigate three values for >: the number of physical or performance cores, the number of logical or performance plus efficiency cores, and an intermediate value. The resulting timings are shown in Table. We observed a large variance, especially on x86 platforms. We consider the speed-ups on the m1 and m3 platforms to be reasonably satisfactory for a first implementation. The timings for the other platform are far from optimal. We omitted the results for the zen4 platform, since they were so absurd that they require further investigation.<\float|float|thb> <\big-table|||||||||||||||||||||||||||||||||||||||||||||||||||||||||>|>||||||||||||>||>||||||>|||>||>||>|>|||||>|>||>|>|>||>|>|>|>|||>|>|>|>|>|>>||>|>|>|>|>||>|>|>|>|>|>>||>|>|>|>|>||>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>||>|>|>|>|>|>|>|>|>|>|>|>>>>>>> Parallel speed-ups for our number theoretic transforms as a function of the transform length , the platform, and the number of threads >. The second scenario is when we wish to compute a large number of NTTs of the same order . This is similar to the pure SIMD mode that we investigated in subsection. We use a similar work stealing approach as above. Each individual thread repeatedly computes NTTs of order over >. In Table, we reported detailed timings in the single case when and only for the m3 platform. The table also indicates the best flooding factor > as a function of the number of threads >.<\float|float|thb> <\big-table|||||||||||||||||||||||||||||||||||||||>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>|||||||||||ms>>>>>>> Speed-ups on an M3 processor for 1024> NTTs of length 1024> with SIMD coefficients in >, where 1439\2\3+1> , as a function of the number > of threads and the flooding factor>. The highlighted entries in each row correspond to the best flooding factor for a given number of threads. The last column reports the timing for the best flooding factor. <\bibliography|bib|tm-plain|> <\bib-list|22> ANSI/IEEE. IEEE standard for binary floating-point arithmetic. , ANSI/IEEE, New York, 2008. ANSI-IEEE Standard 754-2008. Revision of IEEE 754-1985, approved on June 12, 2008 by IEEE Standards Board. J.Bradbury, N.DruckerM.Hillenbrand. NTT software optimization using an extended Harvey butterfly. , 2021. . P.Bürgisser, M.ClausenM.A.Shokrollahi. . Springer-Verlag, Berlin, 1997. J.W.CooleyJ.W.Tukey. An algorithm for the machine calculation of complex Fourier series. , 19:297\U301, 1965. J.-G.Dumas, P.GiorgiC.Pernet. FFPACK: finite field linear algebra package. J.Schicho, , ISSAC '04, 119\U126. New York, NY, USA, 2004. ACM. J.-G.Dumas, P.GiorgiC.Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. , 35(3):19\U1, 2008. P.Fortin, A.Fleury, F.LemaireM.Monagan. High-performance SIMD modular arithmetic for polynomial evaluation. , 33(16):e6270, 2021. M.Frigo. A fast Fourier transform compiler. , PLDI'99, 169\U180. New York, NY, USA, 1999. ACM. M.FrigoS.G.Johnson. The design and implementation of FFTW3. , 93(2):216\U231, 2005. S.Fu, N.ZhangF.Franchetti. Accelerating high-precision number theoretic transforms using Intel AVX-512. 2024. . I.J.Good. The interaction algorithm and practical Fourier analysis. , 20(2):361\U372, 1958. W.Hartthe FLINT Team. FLINT: Fast Library for Number Theory. From 2008. Software available at . D.Harvey. Faster arithmetic for number-theoretic transforms. , 60:113\U119, 2014. M.T.Heideman, D.H.JohnsonC.S.Burrus. Gauss and the history of the fast Fourier transform. , 34(3):265\U277, 1985. J.vander Hoeven. The truncated Fourier transform and applications. J.Gutierrez, , ISSAC '04, 290\U296. New York, NY, USA, 2004. ACM. J.vander Hoeven. . Scypress, 2020. J.vander HoevenG.Lecerf. . HAL, 2013. . J.vander HoevenG.Lecerf. Evaluating straight-line programs over balls. P.Montuschi, M.Schulte, J.Hormigo, S.ObermanN.Revol, , 142\U149. IEEE, 2016. J.vander Hoeven, G.LecerfG.Quintin. Modular SIMD arithmetic in Mathemagix. , 43(1):5\U1, 2016. R.Larrieu. The truncated Fourier transform for mixed radices. M.Burr, , ISSAC '17, 261\U268. New York, NY, USA, 2017. ACM. J.M.Pollard. The fast Fourier transform in a finite field. , 25(114):365\U374, 1971. M.Püschel, J.M.F.Moura, J.Johnson, D.Padua, M.Veloso, B.Singer, J.Xiong, F.Franchetti, A.Gacic, Y.Voronenko, K.Chen, R.W.JohnsonN.Rizzolo. SPIRAL: code generation for DSP transforms. , 93(2):232\U275, 2005. Special issue on \PProgram Generation, Optimization, and Adaptation\Q. A.SchönhageV.Strassen. Schnelle Multiplikation groÿer Zahlen. , 7:281\U292, 1971. D.Takahashi. An implementation of parallel number-theoretic transform using Intel AVX-512 instructions. F.Boulier, M.England, T.M.SadykovE.V.Vorozhtsov, , 13366, 318\U332. Springer, Cham, 2022. N.Zhang, A.Ebel, N.Neda, P.Brinich, B.Reynwar, A.G.Schmidt, M.Franusich, J.Johnson, B.ReagenF.Franchetti. Generating high-performance number theoretic transform implementations for vector architectures. , 1\U7. Los Alamitos, CA, USA, 2023. IEEE. <\initial> <\collection> <\attachments> <\collection> <\associate|bib-bibliography> <\db-entry|+1PlYSeg92FURz2kB|book|TeXmacs:vdH:book> <|db-entry> > <\db-entry|+33fX45UG6TD|article|Pol71> <|db-entry> > <\db-entry|+33fX45UG6TF|article|CT65> <|db-entry> J. W. > <\db-entry|+3a783HvX7Lqvqj|article|HJB-gauss-fft> <|db-entry> D. H. C. S. > > <\db-entry|+qYhUkmR1lNMNv9s|article|SS71> <|db-entry> V. > <\db-entry|+1Hcl3U922Lc9q61S|manual|vdH:mmx:manual> <|db-entry> G. > > <\db-entry|+qYhUkmR1lNMNv1c|article|Har14> <|db-entry> > <\db-entry|+33fX45UG6T8|article|BDH21> <|db-entry> N. M. > > <\db-entry|+33fX45UG6TM|inproceedings|Tak22> <|db-entry> > M. T.M. E.V. > <\db-entry|+33fX45UG6TA|misc|FLINT> <|db-entry> > > <\db-entry|+33fX45UG6TG|inproceedings|DumasGiorgiPernet2004> <|db-entry> P. C. > > <\db-entry|+X7fSkONwiTAWqJ|article|DumasGiorgiPernet2008> <|db-entry> P. C. > <\db-entry|+X73gZ7W1OKmV82h|inproceedings|vdH:dagball> <|db-entry> G. > M. J. S. N. > <\db-entry|+33fX45UG6TJ|inproceedings|Fri99> <|db-entry> > '99> <\db-entry|+1QvsQXLs1XLbdH0D|article|FJ05> <|db-entry> S.G. > <\db-entry|+33fX45UG6TE|article|Pue05> <|db-entry> J. M. F. J. D. M. B. J. F. A. Y. K. R. W. N. > <\db-entry|+2bjZycDjGDb|inproceedings|Zhangetal2023> <|db-entry> A. N. P. B. A. G. M. J. B. F. > <\db-entry|+33fX45UG6T9|misc|Fu24> <|db-entry> N. F. > > <\db-entry|+33fX45UG6TK|article|Good58> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuua|book|BCS97> <|db-entry> M. M. A. > <\db-entry|+33fX45UG6TN|article|FortinFleuryLemaireMonagan2021> <|db-entry> A. F. M. > <\db-entry|+1c9ZcFhm1wjIEk2e|article|vdH:simd> <|db-entry> G. G. > <\db-entry|+qYhUkmR1lNMNv2I|techreport|IEEE754> <|db-entry> > <\db-entry|+ikO8ncBMk21kgb|inproceedings|vdH:tft> <|db-entry> > > <\db-entry|+33fX45UG6TC|inproceedings|Lar16> <|db-entry> > > <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> TeXmacs:vdH:book Pol71 CT65 HJB-gauss-fft SS71 SS71 vdH:mmx:manual Har14 BDH21 Tak22 FLINT DumasGiorgiPernet2004 DumasGiorgiPernet2008 vdH:dagball Fri99 FJ05 Pue05 Fu24 Zhangetal2023 Good58 BCS97 Pol71 Har14 FortinFleuryLemaireMonagan2021 vdH:simd Har14 BDH21 Tak22 IEEE754 Fri99 FJ05 Pue05 vdH:tft Lar16 Har14 Har14 Har14 Har14 Har14 Har14 Har14 <\associate|figure> |1>|> Graphical representation of the timings from Table |->|-0.3em|>|0em||0em|>>, for the first three columns only. |> |2>|> Timings divided by ||||>||0.05em>*r*log r>, which corresponds to the number of butterflies in the case when the order |r> of the NTT is a power of two. |> |3>|> Variant of Figure |->|-0.3em|>|0em||0em|>> when using the cycle time as a unit instead of nanoseconds. |> <\associate|table> |1>|> Overview of the five machines on which we benchmarked our implementation. |> |2>|> Extensive timings for number theoretic transforms of lengths |2=64> up till |2=254803968> over the field |\> with |p=1439\2\3+1>. |> |3>|> Comparison of the number of clock cycles per butterfly between our new implementation and the implementation from |->|-0.3em|>|0em||0em|>>[]. We recall that the implementation from |->|-0.3em|>|0em||0em|>>[] uses a prime |p> with |2\p\2> and unsigned integer arithmetic. The \Pf48\Q columns correspond to our new timings for our prime |p> with |2\p\2>. The \Pf61\Q columns contain the scaled variants through multiplication by ||||>||0.05em>>. The \Pi61\Q columns show the timings for the algorithm from |->|-0.3em|>|0em||0em|>>[]. |> |4>|> Comparison between the timings of NTTs in the thwarted and pure SIMD modes. The greyed entries correspond to timings that were obtained using slightly less optimized DFTs. |> |5>|> Speed-up when executing a dummy loop using multiple threads. The top row indicates the number of threads that we use to execute our code. The main table indicates the observed speed-ups. The greyed entries correspond to the degraded regime in which hyperthreading or efficiency cores enter the stage. |> |6>|> Parallel speed-ups for our number theoretic transforms as a function of the transform length |r>, the platform, and the number of threads |\>. |> |7>|> Speed-ups on an M3 processor for |N\1024> NTTs of length |r\1024> with SIMD coefficients in |\>, where |p\1439\2\3+1> , as a function of the number |\> of threads and the flooding factor |->|-0.3em|>|0em||0em|>>|\>. The highlighted entries in each row correspond to the best flooding factor for a given number of threads. The last column reports the timing for the best flooding factor. |> <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.Preliminaries> |.>>>>|> |2.1.Discrete Fourier transforms |.>>>>|> > |2.2.Divide-and-conquer DFTs |.>>>>|> > |2.3.DFTs using the Chinese remainder theorem |.>>>>|> > |2.4.DFTs up to permutations |.>>>>|> > |2.5.Number theoretic transforms |.>>>>|> > |math-font-series||font-shape||3.Modular floating point arithmetic> |.>>>>|> |3.1.Pseudo-code |.>>>>|> > |3.2.Measuring how far we are off from normalization |.>>>>|> > |math-font-series||font-shape||4.Codelets for number theoretic transforms> |.>>>>|> |4.1.Straight-line programs over modular integers |.>>>>|> > |4.2.Number theoretic transforms of small orders |.>>>>|> > |4.3.Large orders |.>>>>|> > |4.4.SIMD acceleration |.>>>>|> > |math-font-series||font-shape||5.Timings> |.>>>>|> |5.1.Mono-threaded implementation |.>>>>|> > |5.2.Pure SIMD mode |.>>>>|> > |5.3.Multi-threading |.>>>>|> > |math-font-series||font-shape||Bibliography> |.>>>>|>