Algorithms of Scientific Computing: Fast Fourier Transform (FFT)

Algorithms of Scientific Computing
Fast Fourier Transform (FFT)
Michael Bader
Technical University of Munich
Summer 2024
Fast Fourier Transform – Outline
• Discrete Fourier transform
• Fast Fourier transform
• Special Fourier transforms:
– real-valued FFT
– sine/cosine transform
• Applications:
– Fast Poisson solver (FST)
– Computergraphics (FCT)
• Efficient Implementation
Literature/Reference:
William L. Briggs and Van Emden Henson: The DFT – An Owner’s Manual for the Discrete
Fourier Transform. SIAM, 1995. https://doi.org/10.1137/1.9781611971514
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 2
The Pair DFT/IDFT as Matrix-Vector Product
DFT and IDFT may be computed in the form
N−1 N−1
1 X X
Fk = fn ωN−nk fn = Fk ωNnk
N
n=0 k=0
or as matrix-vector products
1 H
F= W f, f = WF ,
N
with a computational complexity of O(N 2 ).
Note that
1
DFT(f ) = IDFT(f ) .
N
A fast computation is possible via the divide-and-conquer approach.

Fast Fourier Transform for N = 2p
Basic idea: sum up even and odd indices separately in IDFT
→ first for n = 0, 1, . . . , N2 − 1:
N N
N−1 2 −1 2 −1
(2k+1)n
X X X
xn = Xk ωNnk = X2k ωN2nk + X2k +1 ωN
k =0 k=0 k=0
We set Yk := X2k and Zk := X2k+1 , use nk

= ωN/2 ωN2nk
, and get a sum of two
N
IDFT on 2 coefficients:
N N
N−1
X 2 −1
X 2 −1
X
nk nk n nk
xn = Xk ωN = Yk ωN/2 + ωN Zk ωN/2 .
k =0 k=0 k =0
| {z } | {z }
:= yn := zn
N
Note: this formula is actually valid for all n = 0, . . . , N − 1; however, the IDFTs of size 2
will only
deliver the yn and zn for n = 0, . . . , N2 − 1 (but: yn and zn are periodic!)
Fast Fourier Transform (FFT)
N
Consider the formular xn = yn + ωNn zn for indices 2,...,N −1:
(n+ N )
xn+ N = yn+ N + ωN 2 zn+ N for n = 0, . . . , N2 − 1
2 2 2
(n+ N ) N
Since ωN 2 = −ωNn and yn and zn have a period of 2, we obtain the
so-called butterfly scheme:
xn = yn + ωNn zn yn yn + ω n zn
xn+ N = yn − ωNn zn
2
ωn
zn yn − ω n zn
Fast Fourier Transform – Butterfly Scheme
(x0 , x1 , . . . , . . . , xN−1 ) = IDFT(X0 , X1 , . . . , . . . , XN−1 )
⇓
(y0 , y1 , . . . , y N −1 ) = IDFT(X0 , X2 , . . . , XN−2 )
2
(z0 , z1 , . . . , z N −1 ) = IDFT(X1 , X3 , . . . , XN−1 )

2
X0 X0 y0 yy0 x0
X1 X2 y1 yy1 x1
X2 X1 z0 zz0 x2
X3 X3 z1 zz1 x3
Fast Fourier Transform – Butterfly Scheme (2)
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X1 X2 X4 x (4) x (0,4) x (0−6) x1

4 4 2
X2 X4 X2 x (2) x (2,6) x (0−6) x2

2 2 4
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Recursive Implementation of the FFT
rekFFT(X) −→ x
(1) Generate vectors Y and Z:
for n = 0, . . . , N2 − 1 : Yn := X2n und Zn := X2n+1
(2) compute 2 FFTs of half size:
rekFFT(Y) −→ y and rekFFT(Z) −→ z
(3) combine with “butterfly scheme”:
(
xk = yk + ωNk zk
for k = 0, . . . , N2 − 1 :
xk+ N = yk − ωNk zk
2
Observations on the Recursive FFT
• Computational effort C(N) (N = 2p ) given by recursion equation
(
O(1) for N = 1
C(N) = N
⇒ C(N) = O(N log N)
O(N) + 2C 2 for N > 1
• Algorithm splits up in 2 phases:

• resorting of input data
• combination following the “butterfly scheme”
⇒ Anticipation of the resorting enables a simple, iterative algorithm without
additional memory requirements.
Sorting Phase of the FFT – Bit Reversal
Observation:
• even indices are sorted into the upper half,
odd indices into the lower half.
• distinction even/odd based on least significant bit
• distinction upper/lower based on most significant bit
⇒ An index in the sorted field has the reversed (i.e. mirrored) binary
representation compared to the original index.
Sorting of a Vector (N = 2p Entries, Bit Reversal)
/** FFT sorting phase: reorder data in array X */
for( int n=0; n<N; n++) {
// Compute p−bit bit reversal of n in j
int j =0; int m=n;
for( int i =0; i <p; i++) {
j = 2* j + m%2; m = m/2;
}
// if j >n exchange X[j] and X[n]:
if ( j >n) { complex<double> h;
h = X[j ]; X[j ] = X[n]; X[n] = h;
}
}
Bit reversal needs O(p) = O(log N) operations
⇒ Sorting results also in a complexity of O(N log N)
⇒ Sorting may consume up to 10–30 % of the CPU time!
Iterative Implementation of the “Butterflies”
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X1 X2 X4 x (4) x (0,4) x (0−6) x1

4 4 2
X2 X4 X2 x (2) x (2,6) x (0−6) x2

2 2 4
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Iterative Implementation of the “Butterflies”
{Loop over the size of the IDFT}

for(int L=2; L<=N; L*=2)
{Loop over the IDFT of one level}
for(int k=0; k<N; k+=L)
{perform all butterflies of one level}
for(int j=0; j<L/2; j++) {
{complex computation:}
z ← ωLj * X[k+j+L/2]
X[k+j+L/2] ← X[k+j] - z
X[k+j] ← X[k+j] + z
}
• k-loop und j-loop are “permutable”!

• How and when are the ωLj computed?
Iterative Implementation – Variant 1
/** FFT butterfly phase: variant 1 */

for( int L=2; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = omega(L,j) * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Advantage: consecutive (“stride-1”) access to data in array X
⇒ suitable for vectorisation
⇒ good cache performance due to prefetching (stream access) and usage
of cache lines
j
Disadvantage: multiple computations of ωL
Iterative Implementation – Variant 1 Vectorised
/** FFT butterfly phase: variant 1 vectorised*/

for( int j =0; j <L/2; j+= SIMD LENGTH) {
kjStart = k+j ;
kjEnd = k+j + (SIMD LENGTH−1);
complex<double> z[0: SIMD LENGTH−1];
z[0: SIMD LENGTH−1] = omega(L,j) * X[kjStart+L/2 : kjEnd+L/2];
X[ kjStart +L/2 : kjEnd+L/2] = X[kjStart :kjEnd] − z;
X[ kjStart :kjEnd] = X[ kjStart :kjEnd] + z;
}
Comments on the code:
• we assume that our CPU offers SIMD instructions that can add/multiply short vectors of
SIMD LENGTH complex numbers
• with the notation X[ kjStart :kjEnd] we mean a subvector of SIMD LENGTH consecutive
complex numbers starting from element X[ kjStart ]
• Note: the term omega(L,j) has to be changed into such a short vector, as well
Iterative Implementation – Variant 2

for( int j =0; j <L/2; j ++) {
complex<double> w = omega(L,j);
for( int k=0; k<N; k+=L) {
complex<double> z = w * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
} }
Advantage: each ωLj only computed once
Disadvantage: “stride-L”-access to the array X
⇒ worse cache performance (inefficient use of cache lines)

⇒ not suitable for vectorisation
Separate Computation of ωLj
• necessary: N − 1 factors
L/2−1 N/2−1
ω20 , ω40 , ω41 , . . . , ωL0 , . . . , ωL , . . . , ωN0 , . . . , ωN
• are computed in advance, and stored in an array w, e.g.:

for(int L=2; L<=N; L*=2)
for(int j=0; j<L/2; j++)
w[L/2+j] ← ωLj ;
• Variant 2: access on w in sequential order
• Variant 1: access on w local (but repeated) and compatible with
vectorisation
• Important: weight array w[:] needs to stay in cache!
(as accesses to main memory can be slower than recomputation)
Cache Efficiency – Variant 1 Revisited
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Observation:
• each L-loop traverses entire array X
• in the ideal case (N log N)/B cache line transfers (N/B per L-loop,
B the size of the cache line), unless all N elements fit into cache
Compare with recursive scheme:
• if L < MC (MC the cache size), then the entire FFT fits into cache
• is it thus possible to require only N log N/(MC B) cache line transfers?
Butterfly Phase with Loop Blocking
/** FFT butterfly phase: loop blocking for k */
for( int kb=0; kb<N; kb+=M)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Question: can we make the L-loop an inner loop?
• kb-loop and L-loop may be swapped, if M > L
• however, we assumed N > MC (“data does not fit into cache”)
• we thus need to split the L-loop into a phase L=2..M (in cache) and a
phase L=2*M..N (out of cache)
Butterfly Phase with Loop Blocking (2)
/** perform all butterfly phases of size M */
for( int L=2; L<=M; L*=2)
for( int j =0; j <L/2; j ++) {
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
/** perform remaining butterfly levels of size L>M */
for( int L=2*M; L<=N; L*=2)
for( int j =0; j <L/2; j ++) {
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Loop Blocking and Recursion – Illustration
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X1 X2 X4 x (4) x (0,4) x (0−6) x1

4 4 2
X2 X4 X2 x (2) x (2,6) x (0−6) x2

2 2 4
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Outlook: Parallel External Memory and I/O Model
CPU
External Shared Memory

CPU
private local caches
(M words per cache)
CPU
[Arge, Goodrich, Nelson, Sitchinava, 2008]
Outlook: Parallel External Memory
Classical I/O model:
• large, global memory (main memory, hard disk, etc.)
• CPU can only access smaller working memory
(cache, main memory, etc.) of MC words each
• both organised as cache lines of size B words
• algorithmic complexity determined by memory transfers
Extended by Parallel External Memory Model:

• multiple CPUs access private caches
• caches fetch data from external memory
• exclusive/concurrent read/write classification
(similar to PRAM model)
Outlook: FFT and Parallel External Memory
Consider Loop-Blocking Implementation:
/** perform all butterfly phases of size M */

for( int L=2; L<=M; L*=2)
for( int j =0; j <L/2; j ++) {
/* ... */
• choose M such that one kb-Block (M elements) fit into cache
• then: L-loop and inner loops access only cached data
• number of cache line transfers therefore:
≈ N divided by words per cache line (ideal case)
Outlook: FFT and Parallel External Memory (2)
Consider Non-Blocking Implementation:
/** perform remaining butterfly levels of size L>M */

for( int L=2*M; L<=N; L*=2)
for( int j =0; j <L/2; j ++) {
/* ... */
• assume: N too large to fit all elements into cache
• then: each L-loop will need to reload all elements X into cache
• number of cache line transfers therefore:
≈ N divided by words per cache line (ideal case) per L-iteration
Compute-Bound vs. Memory-Bound Performance
Consider a memory-bandwidth intensive algorithm:
• you can do a lot more flops than can be read from memory
• computational intensity of a code:
number of performed flops per accessed byte
Memory-Bound Performance:
• computational intensity smaller than critical ratio
• you could execute additional flops “for free”
• speedup only possible by reducing memory accesses
Compute-Bound Performance:
• enough computational work to “hide” memory latency
• speedup only possible by reducing operations
Outlook: The Roofline Model
peak FP performance
GFlop/s − log
h
idt
n dw without vectorization
ba
eam MA without instruction−level parallism
k str ut
NU
ide
p ea i tho str
w unit
n−
no
matrix mult.
(100x100)
5−pt stencil
SpMV
1/8 1/4 1/2 1 2 4 8 16 32 64

Operational Intensity [Flops/Byte] − log
[Williams, Waterman, Patterson, 2008]
Outlook: The Roofline Model
Memory-Bound Performance:
• available bandwidth of a bytes per second
• computational intensity small: x flops per byte
• CPU thus executes a · x flops per second
• linear increase of the Flop/s with variable x linear part of “roofline”
• “ceilings”: memory bandwidth limited due to “bad” memory access
(strided access, non-uniform memory access, etc.) smaller a
Compute-Bound Performance:
• computational intensity large
• CPU executes highest possible Flop/s flat/constant “rooftop”
• “ceilings”: fewer Flop/s due to “bad” instruction mix
(no vectorization, bad branch prediction, no multi-add instructions, etc.)

Algorithms of Scientific Computing: Fast Fourier Transform (FFT)

Uploaded by

Copyright:

Available Formats

Algorithms of Scientific Computing: Fast Fourier Transform (FFT)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Algorithms of Scientific Computing: Fast Fourier Transform (FFT)

Uploaded by

Copyright:

Available Formats

Algorithms of Scientific Computing

Fast Fourier Transform (FFT)

A fast computation is possible via the divide-and-conquer approach.

We set Yk := X2k and Zk := X2k+1 , use nk

(z0 , z1 , . . . , z N −1 ) = IDFT(X1 , X3 , . . . , XN−1 )

X1 X2 X4 x (4) x (0,4) x (0−6) x1

X2 X4 X2 x (2) x (2,6) x (0−6) x2

• Algorithm splits up in 2 phases:

X1 X2 X4 x (4) x (0,4) x (0−6) x1

X2 X4 X2 x (2) x (2,6) x (0−6) x2

{Loop over the size of the IDFT}

• k-loop und j-loop are “permutable”!

/** FFT butterfly phase: variant 1 */

/** FFT butterfly phase: variant 1 vectorised*/

/** FFT butterfly phase: variant 2 */

Disadvantage: “stride-L”-access to the array X

⇒ worse cache performance (inefficient use of cache lines)

• are computed in advance, and stored in an array w, e.g.:

X1 X2 X4 x (4) x (0,4) x (0−6) x1

X2 X4 X2 x (2) x (2,6) x (0−6) x2

External Shared Memory

[Arge, Goodrich, Nelson, Sitchinava, 2008]

Extended by Parallel External Memory Model:

/** perform all butterfly phases of size M */

/** perform remaining butterfly levels of size L>M */

1/8 1/4 1/2 1 2 4 8 16 32 64

[Williams, Waterman, Patterson, 2008]

You might also like