The '''Cooley–Tukey [[algorithm]]''', named after [[James Cooley|J.W. Cooley]] and [[John Tukey]], is the most common [[fast Fourier transform]] (FFT) algorithm.  It re-expresses the [[discrete Fourier transform]] (DFT) of an arbitrary [[composite number|composite]] size ''N'' = ''N''<sub>1</sub>''N''<sub>2</sub> in terms of smaller DFTs of sizes ''N''<sub>1</sub> and ''N''<sub>2</sub>, [[recursion|recursively]], in order  to reduce the computation time to O(''N'' log ''N'') for highly-composite ''N'' ([[smooth number]]s). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.

Because the Cooley-Tukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT.  For example, [[Rader's FFT algorithm|Rader's]] or [[Bluestein's FFT algorithm|Bluestein's]] algorithm can be used to handle large prime factors that cannot be decomposed by Cooley–Tukey, or the [[prime-factor FFT algorithm|prime-factor algorithm]] can be exploited for greater efficiency in separating out [[relatively prime]] factors.

See also the [[fast Fourier transform]] for information on other FFT algorithms, specializations for real and/or symmetric data, and accuracy in the face of finite [[floating-point]] precision.

== History ==
This algorithm, including its recursive application, was invented around 1805 by [[Carl Friedrich Gauss]], who used it to interpolate the trajectories of the [[asteroid]]s [[2 Pallas|Pallas]] and [[3 Juno|Juno]], but his work was not widely recognized (being published only posthumously and in [[New Latin|neo-Latin]]).<ref>Gauss, Carl Friedrich,  "Nachlass: Theoria interpolationis methodo nova tractata", Werke, Band 3, 265&ndash;327 (Königliche Gesellschaft der Wissenschaften, Göttingen, 1866)</ref><ref name=Heideman84>Heideman, M. T., D. H. Johnson, and [[C. Sidney Burrus|C. S. Burrus]], "[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1162257 Gauss and the history of the fast Fourier transform]," IEEE ASSP Magazine, 1, (4), 14&ndash;21 (1984)</ref> Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries.<ref name=Heideman84/>  FFTs became popular after [[J. W. Cooley]] of [[International Business Machines|IBM]] and [[John Tukey|John W. Tukey]] of [[Princeton University|Princeton]] published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer.<ref name=CooleyTukey65>Cooley, James W., and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," ''Math. Comput.'' '''19''', 297&ndash;301 (1965). {{doi|10.2307/2003354}}</ref> 

Tukey reportedly came up with the idea during a meeting of a US presidential advisory committee discussing ways to detect [[nuclear testing|nuclear-weapon tests]] in the [[Soviet Union]].<ref>Cooley, James W., Peter A. W. Lewis, and Peter D. Welch, "Historical notes on the fast Fourier transform," ''IEEE Trans. on Audio and Electroacoustics'' '''15''' (2), 76&ndash;79 (1967).
</ref><ref>Rockmore, Daniel N. , ''Comput. Sci. Eng.'' '''2''' (1), 60 (2000). [http://www.cs.dartmouth.edu/~rockmore/cse-fft.pdf The FFT &mdash; an algorithm the whole family can use] Special issue on "top ten algorithms of the century "[http://amath.colorado.edu/resources/archive/topten.pdf  ]</ref>  Another participant at that meeting, [[Richard Garwin]] of IBM, recognized the potential of the method and put Tukey in touch with Cooley, who implemented it for a different (and less-classified) problem: analyzing 3d crystallographic data (see also: [[Fast Fourier transform#Algorithms|multidimensional FFTs]]).  Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed.

The fact that Gauss had described the same algorithm (albeit without analyzing its asymptotic cost) was not realized until several years after Cooley and Tukey's 1965 paper.<ref name=Heideman84/>  Their paper cited as inspiration only work by I. J. Good on what is now called the [[prime-factor FFT algorithm]] (PFA);<ref name=CooleyTukey65/> although Good's algorithm was initially mistakenly thought to be equivalent to the Cooley–Tukey algorithm, it was quickly realized that PFA is a quite different algorithm (only working for sizes that have [[relatively prime]] factors and relying on the [[Chinese Remainder Theorem]], unlike the support for any composite size in Cooley–Tukey).<ref>James W. Cooley, Peter A. W. Lewis, and Peter W. Welch, "Historical notes on the fast Fourier transform," ''Proc. IEEE'', vol. '''55''' (no. 10), p. 1675–1677 (1967).</ref>

== The radix-2 DIT case ==
A '''radix-2''' decimation-in-time ('''DIT''') FFT is the simplest and most common form of the Cooley–Tukey algorithm, although highly optimized Cooley–Tukey implementations typically use other forms of the algorithm as described below. Radix-2 DIT divides a DFT of size ''N'' into two [[Interleaving|interleaved]] DFTs (hence the name "radix-2") of size ''N''/2 with each recursive stage. 

The discrete Fourier transform (DFT) is defined by the formula:
:<math>      X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk},</math>
where <math>k</math> is an integer ranging from <math>0</math> to <math>N-1</math>.

Radix-2 DIT first computes the DFTs of the even-indexed inputs
<math>x_{2m} \ </math> (<math>x_0, x_2, \ldots, x_{N-2}</math>)
and of the odd-indexed inputs <math>x_{2m+1} \ </math> (<math>x_1, x_3, \ldots, x_{N-1}</math>), and then combines those two results to produce the DFT of the whole sequence. This idea can then be performed [[recursion|recursively]] to reduce the overall runtime to O(''N'' log ''N'').  This simplified form assumes that ''N'' is a [[power of two]]; since the number of sample points ''N'' can usually be chosen freely by the application, this is often not an important restriction.  

The Radix-2 DIT algorithm rearranges the DFT of the function <math>x_n</math> into two parts: a sum over the even-numbered indices <math>n={2m}</math> and a sum over the odd-numbered indices <math>n={2m+1}</math>:
:<math>
  \begin{matrix} X_k & =
& \sum \limits_{m=0}^{N/2-1} x_{2m}     e^{-\frac{2\pi i}{N} (2m)k}   +
  \sum \limits_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N} (2m+1)k}.
  \end{matrix}
</math>
One can factor a common multiplier <math>e^{-\frac{2\pi i}{N}k}</math> out of the second sum, as shown in the equation below. It is then clear that the two sums are the DFT of the even-indexed part <math>x_{2m}</math> and the DFT of odd-indexed part <math>x_{2m+1}</math> of the function <math>x_n</math>. Denote the DFT of the '''''E'''''ven-indexed inputs <math>x_{2m}</math> by <math>E_k</math> and the DFT of the '''''O'''''dd-indexed inputs <math>x_{2m + 1}</math> by <math>O_k</math> and we obtain:
:<math>
\begin{matrix} X_k= \underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m}   e^{-\frac{2\pi i}{N/2} mk}}_{\mathrm{DFT\;of\;even-indexed\;part\;of\;} x_m} {} +  e^{-\frac{2\pi i}{N}k}
 \underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk}}_{\mathrm{DFT\;of\;odd-indexed\;part\;of\;} x_m} =  E_k + e^{-\frac{2\pi i}{N}k} O_k.
\end{matrix}
</math>
However, these smaller DFTs have a length of ''N''/2, so we need compute only ''N''/2 outputs: thanks to the periodicity properties of the DFT, the outputs for <math>N/2 \leq k < N</math> from a DFT of length ''N''/2 are identical to the outputs for <math>0\leq k < N/2</math>. That is, <math>E_{k + N/2} = E_k</math> and <math>O_{k + N/2} = O_k</math>. The phase factor <math>\exp[-2\pi i k/ N]</math> (called a [[twiddle factor]]) obeys the relation: <math>\exp[-2\pi i (k + N/2)/ N] = e^{-\pi i} \exp[-2\pi i k/ N] = -\exp[-2\pi i k/ N]</math>, flipping the sign of the <math>O_{k + N/2}</math> terms. Thus, the whole DFT can be calculated as follows:
:<math>
\begin{matrix} X_k & =
& \left\{
\begin{matrix}
E_k + e^{-\frac{2\pi i}{N}k} O_k & \mbox{if } k < N/2 \\ \\
E_{k-N/2} - e^{-\frac{2\pi i}{N} (k-N/2)} O_{k-N/2} & \mbox{if }
k \geq N/2. \end{matrix} \right. \end{matrix}
</math>
This result, expressing the DFT of length ''N'' recursively in terms of two DFTs of size ''N''/2, is the core of the radix-2 DIT fast Fourier transform. The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs.  Note that final outputs are obtained by a +/&minus; combination of <math>E_k</math> and <math>O_k \exp(-2\pi i k/N)</math>, which is simply a size-2 DFT (sometimes called a [[butterfly diagram|butterfly]] in this context); when this is generalized to larger radices below, the size-2 DFT is replaced by a larger DFT (which itself can be evaluated with an FFT).


[[Image:DIT-FFT-butterfly.png|thumb|300px|right|Data flow diagram for ''N''=8: a decimation-in-time radix-2 FFT breaks a length-''N'' DFT into two length-''N''/2 DFTs followed by a combining stage consisting of many size-2 DFTs called "butterfly" operations (so-called because of the shape of the data-flow diagrams).]]

This process is an example of the general technique of [[divide and conquer algorithm]]s; in many traditional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in [[breadth-first search|breadth-first]] fashion.

The above re-expression of a size-''N'' DFT as two size-''N''/2 DFTs is sometimes called the '''Danielson–[[Cornelius Lanczos|Lanczos]]''' [[lemma (mathematics)|lemma]], since the identity was noted by those two authors in 1942<ref>Danielson, G. C., and C. Lanczos, "Some improvements in practical Fourier analysis and their application to X-ray scattering from liquids," ''J. Franklin Inst.'' '''233''', 365&ndash;380 and 435&ndash;452 (1942).</ref> (influenced by [[Carl David Tolmé Runge|Runge's]] 1903 work<ref name=Heideman84/>).  They applied their lemma in a "backwards" recursive fashion, repeatedly ''doubling'' the DFT size until the transform spectrum converged (although they apparently didn't realize the [[linearithmic]] asymptotic complexity they had achieved).  The Danielson–Lanczos work predated widespread availability of [[computer]]s and required hand calculation (possibly with mechanical aids such as [[adding machine]]s); they reported a computation time of 140 minutes for a size-64 DFT operating on [[Fast Fourier transform#FFT algorithms specialized for real and/or symmetric data|real inputs]] to 3–5 significant digits.  Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an [[IBM 7094]] (probably in 36-bit [[floating point|single precision]], ~8 digits).<ref name=CooleyTukey65/>  Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000.  (To put the time for the hand calculation in perspective, 140 minutes for size 64 corresponds to an average of at most 16 seconds per floating-point operation, around 20% of which are multiplications.)

===Pseudocode===

In [[pseudocode]], the above process could be written:

 ''Y''<sub>0,...,''N''&minus;1</sub> &larr; '''ditfft2'''(''X'', ''N'', ''s''):             ''DFT of (X''<sub>0</sub>, ''X''<sub>''s''</sub>, ''X''<sub>2''s''</sub>, ..., ''X''<sub>(''N''-1)''s''</sub>):
     if ''N'' = 1 then
         ''Y''<sub>0</sub> &larr; ''X''<sub>0</sub>                                      ''trivial size-1 DFT base case''
     else
         ''Y''<sub>0,...,''N''/2&minus;1</sub> &larr; '''ditfft2'''(''X'', ''N''/2, 2''s'')             ''DFT of (X''<sub>0</sub>, ''X''<sub>2''s''</sub>, ''X''<sub>4''s''</sub>, ...)
         ''Y''<sub>''N''/2,...,''N''&minus;1</sub> &larr; '''ditfft2'''(''X''+s, ''N''/2, 2''s'')           ''DFT of (X''<sub>''s''</sub>, ''X''<sub>''s''+2''s''</sub>, ''X''<sub>''s''+4''s''</sub>, ...)
         for ''k'' = 0 to ''N''/2&minus;1                           ''combine DFTs of two halves into full DFT:''
             t ← ''Y''<sub>''k''</sub>
             ''Y''<sub>''k''</sub> &larr; t + exp(&minus;2π''i'' ''k''/''N'') ''Y''<sub>''k''+''N''/2</sub>
             ''Y''<sub>''k''+''N''/2</sub> ← t &minus; exp(&minus;2π''i'' ''k''/''N'') ''Y''<sub>''k''+''N''/2</sub>
         endfor
     endif

Here, <code>'''ditfft2'''</code>(''X'',''N'',1), computes ''Y''=DFT(''X'') [[out-of-place]] by a radix-2 DIT FFT, where ''N'' is an integer power of 2 and ''s''=1 is the [[stride of an array|stride]] of the input ''X'' [[Array data structure|array]].  ''X''+''s'' denotes the array starting with ''X''<sub>''s''</sub>.

(The results are in the correct order in ''Y'' and no further [[bit-reversal permutation]] is required; the often-mentioned necessity of a separate bit-reversal stage only arises for certain in-place algorithms, as described below.)

High-performance FFT implementations make many modifications to the implementation of such an algorithm compared to this simple pseudocode.  For example, one can use a larger base case than ''N''=1 is useful to [[amortize]] the overhead of recursion, the twiddle factors exp(&minus;2πi ''k''/''N'') can be precomputed, and larger radices are often used for [[cache]] reasons; these and other optimizations together can improve the performance by an order of magnitude or more.<ref name="Johnson08">S. G. Johnson and M. Frigo, “[http://cnx.org/content/m16336/ Implementing FFTs in practice],” in ''Fast Fourier Transforms'' (C. S. Burrus, ed.), ch. 11, Rice University, Houston TX: Connexions, September 2008.</ref>  (In many textbook implementations the [[depth-first]] recursion is eliminated entirely in favor of a nonrecursive [[breadth-first]] approach, although depth-first recursion has been argued to have better [[memory locality]].<ref name="Johnson08"/><ref name=Singleton67/>) Several of these ideas are described in further detail below.

== General factorizations ==
[[File:Cooley-tukey-general.png|thumb|right|500px|The basic step of the Cooley–Tukey FFT for general factorizations can be viewed as re-interpreting a 1d DFT as something like a 2d DFT. The 1d input array of length ''N'' = ''N''<sub>1</sub>''N''<sub>2</sub> is reinterpreted as a 2d ''N''<sub>1</sub>&times;''N''<sub>2</sub> matrix stored in [[column-major order]]. One performs smaller 1d DFTs along the ''N''<sub>2</sub> direction (the non-contiguous direction), then multiplies by phase factors (twiddle factors), and finally performs 1d DFTs along the ''N''<sub>1</sub> direction, at some point transposing the matrix.  This is done recursively for the smaller transforms.]]

More generally, Cooley–Tukey algorithms recursively re-express  a DFT of a composite size ''N'' = ''N''<sub>1</sub>''N''<sub>2</sub> as:<ref name=DuhamelVe90>Duhamel, P., and M. Vetterli, "Fast Fourier transforms: a tutorial review and a state of the art," ''Signal Processing'' '''19''', 259&ndash;299 (1990)</ref>

# Perform ''N''<sub>1</sub> DFTs of size ''N''<sub>2</sub>.
# Multiply by complex [[roots of unity]] called [[twiddle factor]]s.
# Perform ''N''<sub>2</sub> DFTs of size ''N''<sub>1</sub>.

Typically, either ''N''<sub>1</sub> or ''N''<sub>2</sub> is a small factor (''not'' necessarily prime), called the '''radix''' (which can differ between stages of the recursion).  If ''N''<sub>1</sub> is the radix, it is called a '''decimation in time''' (DIT) algorithm, whereas if ''N''<sub>2</sub> is the radix, it is '''decimation in frequency''' (DIF, also called the Sande-Tukey algorithm). The version presented above was a radix-2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/- combination (''butterfly'') of the even and odd transforms is a size-2 DFT.  (The radix's small DFT is sometimes known as a [[butterfly (FFT algorithm)|butterfly]], so-called because of the shape of the [[dataflow diagram]] for the radix-2 case.)

There are many other variations on the Cooley–Tukey algorithm.  '''Mixed-radix''' implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(''N''<sup>2</sup>) algorithm for the prime base cases of the recursion <nowiki>[</nowiki>it is also possible to employ an ''N''&nbsp;log&nbsp;''N'' algorithm for the prime base cases, such as [[Rader's FFT algorithm|Rader]]'s or [[Bluestein's FFT algorithm|Bluestein]]'s algorithm<nowiki>]</nowiki>.  [[Split-radix FFT algorithm|Split radix]] merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for power-of-two sizes,<ref name=DuhamelVe90/> although recent variations achieve an even lower count.<ref>Lundy, T., and J. Van Buskirk, "A new matrix approach to real FFTs and convolutions of length 2<sup>''k''</sup>," ''Computing'' '''80''', 23-45 (2007).</ref><ref>Johnson, S. G., and M. Frigo, "[http://www.fftw.org/newsplit.pdf A modified split-radix FFT with fewer arithmetic operations]," ''IEEE Trans. Signal Processing'' '''55''' (1), 111–119 (2007).</ref>  (On present-day computers, performance is determined more by [[CPU cache|cache]] and [[CPU pipeline]] considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.<ref name=FrigoJohnson05/>)  Another way of looking at the Cooley–Tukey algorithm is that it re-expresses a size ''N'' one-dimensional DFT as an ''N''<sub>1</sub> by ''N''<sub>2</sub> two-dimensional DFT (plus twiddles), where the output matrix is [[transpose]]d. The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices.  If, instead of using a small radix, one employs a radix of roughly √''N'' and explicit input/output matrix transpositions, it is called a '''four-step''' algorithm (or ''six-step'', depending on the number of transpositions), initially proposed to improve memory locality,<ref name=GenSande66>Gentleman W. M., and G. Sande, "Fast Fourier transforms&mdash;for fun and profit," ''Proc. AFIPS'' '''29''', 563&ndash;578 (1966).</ref><ref name=Bailey90>Bailey, David H., "FFTs in external or hierarchical memory," ''J. Supercomputing'' '''4''' (1), 23&ndash;35 (1990)</ref> e.g. for cache optimization or [[out-of-core]] operation, and was later shown to be an optimal [[cache-oblivious algorithm]].<ref name=Frigo99>M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In ''Proceedings of the 40th IEEE Symposium on Foundations of Computer Science'' (FOCS 99), p.285-297. 1999. [http://ieeexplore.ieee.org/iel5/6604/17631/00814600.pdf?arnumber=814600 Extended abstract at IEEE], [http://citeseer.ist.psu.edu/307799.html at Citeseer].</ref>

The general Cooley–Tukey factorization rewrites the indices ''k'' and ''n'' as <math>k = N_2 k_1 + k_2</math> and <math>n = N_1 n_2 + n_1</math>, respectively, where the indices ''k''<sub>a</sub> and ''n''<sub>a</sub> run from 0..''N''<sub>a</sub>-1 (for ''a'' of 1 or 2).  That is, it re-indexes the input (''n'') and output (''k'') as ''N''<sub>1</sub> by ''N''<sub>2</sub> two-dimensional arrays in [[column-major order|column-major]] and [[row-major order]], respectively; the difference between these indexings is a transposition, as mentioned above.  When this re-indexing is substituted into the DFT formula for ''nk'', the <math>N_1 n_2 N_2 k_1</math> cross term vanishes (its exponential is unity), and the remaining terms give

:<math>X_{N_2 k_1 + k_2} =
      \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1}
         x_{N_1 n_2 + n_1}
         e^{-\frac{2\pi i}{N_1 N_2} \cdot (N_1 n_2 + n_1) \cdot (N_2 k_1 + k_2) }</math>
::<math>= 
    \sum_{n_1=0}^{N_1-1} 
      \left[ e^{-\frac{2\pi i}{N} n_1 k_2 } \right]
      \left( \sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1}  
              e^{-\frac{2\pi i}{N_2} n_2 k_2 } \right)
      e^{-\frac{2\pi i}{N_1} n_1 k_1 }
</math>

where each inner sum is a DFT of size ''N''<sub>2</sub>, each outer sum is a DFT of size ''N''<sub>1</sub>, and the <nowiki>[...]</nowiki> bracketed term is the twiddle factor.

An arbitrary radix ''r'' (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey<ref name=CooleyTukey65/> as well as Gauss (who gave examples of radix-3 and radix-6 steps).<ref name=Heideman84/>  Cooley and Tukey originally assumed that the radix butterfly required O(''r''<sup>2</sup>) work and hence reckoned the complexity for a radix ''r'' to be O(''r''<sup>2</sup>&nbsp;''N''/''r''&nbsp;log<sub>''r''</sub>''N'') = O(''N''&nbsp;log<sub>2</sub>(''N'')&nbsp;''r''/log<sub>2</sub>''r''); from calculation of values of ''r''/log<sub>2</sub>''r'' for integer values of ''r'' from 2 to 12 the optimal radix is found to be 3 (the closest integer to ''[[e (mathematical constant)|e]]'', which minimizes ''r''/log<sub>2</sub>''r'').<ref name=CooleyTukey65/><ref>Cooley, J. W., P. Lewis and P. Welch, "The Fast Fourier Transform and its Applications", ''IEEE Trans on Education'' '''12''', 1, 28-34 (1969)</ref>  This analysis was erroneous, however: the radix-butterfly is also a DFT and can be performed via an FFT algorithm in O(''r''  log ''r'') operations, hence the radix ''r'' actually cancels in the complexity O(''r''&nbsp;log(''r'')&nbsp;''N''/''r''&nbsp;log<sub>''r''</sub>''N''), and the optimal ''r'' is determined by more complicated considerations.  In practice, quite large ''r'' (32 or 64) are important in practice in order to effectively exploit e.g. the large number of [[processor register]]s on modern processors,<ref name=FrigoJohnson05/> and even an unbounded radix ''r''=√''N'' also achieves O(''N''&nbsp;log&nbsp;''N'') complexity and has theoretical and practical advantages for large ''N'' as mentioned above.<ref name=GenSande66/><ref name=Bailey90/><ref name=Frigo99/>

== Data reordering, bit reversal, and in-place algorithms ==
Although the abstract Cooley–Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an [[in-place algorithm]] that overwrites its input with its output data using only O(1) auxiliary storage.

The most well-known reordering technique involves explicit '''bit reversal''' for in-place radix-2 algorithms.  [[Bit-reversal permutation|Bit reversal]] is the [[permutation]] where the data at an index ''n'', written in [[binary numeral system|binary]] with digits ''b''<sub>4</sub>''b''<sub>3</sub>''b''<sub>2</sub>''b''<sub>1</sub>''b''<sub>0</sub> (e.g. 5 digits for ''N''=32 inputs), is transferred to the index with reversed digits ''b''<sub>0</sub>''b''<sub>1</sub>''b''<sub>2</sub>''b''<sub>3</sub>''b''<sub>4</sub> . Consider the last stage of a radix-2 DIT algorithm like the one presented above, where the output is written in-place over the input: when <math>E_k</math> and <math>O_k</math> are combined with a size-2 DFT, those two values are overwritten by the outputs.  However, the two output values should go in the first and second ''halves'' of the output array, corresponding to the ''most'' significant bit ''b''<sub>4</sub> (for ''N''=32); whereas the two inputs <math>E_k</math> and <math>O_k</math> are interleaved in the even and odd elements, corresponding to the ''least'' significant bit ''b''<sub>0</sub>.  Thus, in order to get the output in the correct place, these two bits must be swapped in the input. If you include all of the recursive stages of a radix-2 DIT algorithm, ''all'' the bits must be swapped and thus one must pre-process the ''input'' with a bit reversal to get in-order output. Correspondingly, the reversed (dual) algorithm is radix-2 DIF, and this takes in-order input and produces bit-reversed ''output'', requiring a bit-reversal post-processing step.  Alternatively, some applications (such as convolution) work equally well on bit-reversed data, so one can do radix-2 DIF without bit reversal, followed by processing, followed by the radix-2 DIT inverse DFT without bit reversal to produce final results in the natural order.

Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time,<ref name=FrigoJohnson05/> even though bit reversal can be done in O(''N'') time and has been the subject of much research.<ref>Karp, Alan H., "Bit reversal on uniprocessors," ''SIAM Review'' '''38''' (1), 1&ndash;26 (1996)</ref><ref>Carter, Larry and Kang Su Gatlin, "Towards an optimal bit-reversal permutation program," ''Proc. 39th Ann. Symp. on Found. of Comp. Sci. (FOCS)'', 544&ndash;553 (1998).</ref><ref>Rubio, M., P. Gómez, and K. Drouiche, "A new superfast bit reversal algorithm," ''Intl. J. Adaptive Control and Signal Processing'' '''16''', 703&ndash;707 (2002)</ref> Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.

The problem is greatly simplified if it is '''out-of-place''': the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available.  The '''Stockham auto-sort''' algorithm<ref>Stockham, T. G., "High speed convolution and correlation", ''Spring Joint Computer Conference, Proc. AFIPS'' '''28''', 229&ndash;233 (1966)</ref> performs every stage of the FFT out-of-place, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on [[SIMD]] architectures.<ref>Swarztrauber, P. N., "Vectorizing the FFTs", in G. Rodrigue (Ed.), ''Parallel Computations'' (Academic Press, New York, 1982), pp. 51&ndash;83.</ref>  Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the '''Pease''' algorithm,<ref>Pease, M. C."An adaptation of the fast Fourier transform for parallel processing", ''J. ACM'' '''15''' (2), 252&ndash;264 (1968)</ref> which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(''N'' log ''N'') storage.  One can also directly apply the Cooley–Tukey factorization definition with explicit ([[depth-first search|depth-first]]) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step (as in the pseudocode above) and can be argued to have [[cache-oblivious algorithm|cache-oblivious]] locality benefits on systems with [[cache|hierarchical memory]].<ref name=Singleton67>Singleton, Richard C., "On computing the fast Fourier transform", ''Commun. of the ACM'' '''10''' (1967), 647&ndash;654</ref><ref name=FrigoJohnson05>Frigo, M. and S. G. Johnson, "[http://fftw.org/fftw-paper-ieee.pdf The Design and Implementation of FFTW3]," ''Proceedings of the IEEE'' '''93''' (2), 216–231 (2005).</ref><ref>Frigo, Matteo and Steven G. Johnson: ''FFTW'', http://www.fftw.org/. A free ([[GNU General Public License|GPL]]) C library for computing discrete Fourier transforms in one or more dimensions, of arbitrary size, using the Cooley–Tukey algorithm</ref>

A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data.<ref name=FrigoJohnson05/><ref>Johnson, H. W. and C. S. Burrus, "An in-place in-order radix-2 FFT," ''Proc. ICASSP'', 28A.2.1&ndash;28A.2.4 (1984)</ref><ref>Temperton, C., "Self-sorting in-place fast Fourier transform," ''SIAM J. Sci. Stat. Comput.'' '''12''' (4), 808&ndash;823 (1991)</ref><ref>Qian, Z., C. Lu, M. An, and R. Tolimieri, "Self-sorting in-place FFT algorithm with minimum working space," ''IEEE Trans. ASSP'' '''52''' (10), 2835&ndash;2836 (1994)</ref><ref>Hegland, M., "A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing," ''Numerische Mathematik'' '''68''' (4), 507&ndash;547 (1994)</ref>

==References==
{{reflist}}

== External links ==
* [http://www.librow.com/articles/article-10 a simple, pedagogical radix-2 Cooley–Tukey FFT algorithm in C++.]
* [http://sourceforge.net/projects/kissfft/ KISSFFT]: a simple mixed-radix Cooley–Tukey implementation in C (open source)

{{DEFAULTSORT:Cooley–Tukey Fft Algorithm}}
[[Category:FFT algorithms]]
[[Category:Articles with example pseudocode]]