Algorithms of Scientific Computing: Fast Fourier Transform (FFT)
Algorithms of Scientific Computing: Fast Fourier Transform (FFT)
Algorithms of Scientific Computing: Fast Fourier Transform (FFT)
Michael Bader
Technical University of Munich
Summer 2024
Fast Fourier Transform – Outline
• Discrete Fourier transform
• Fast Fourier transform
• Special Fourier transforms:
– real-valued FFT
– sine/cosine transform
• Applications:
– Fast Poisson solver (FST)
– Computergraphics (FCT)
• Efficient Implementation
Literature/Reference:
William L. Briggs and Van Emden Henson: The DFT – An Owner’s Manual for the Discrete
Fourier Transform. SIAM, 1995. https://doi.org/10.1137/1.9781611971514
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 2
The Pair DFT/IDFT as Matrix-Vector Product
DFT and IDFT may be computed in the form
N−1 N−1
1 X X
Fk = fn ωN−nk fn = Fk ωNnk
N
n=0 k=0
or as matrix-vector products
1 H
F= W f, f = WF ,
N
with a computational complexity of O(N 2 ).
Note that
1
DFT(f ) = IDFT(f ) .
N
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 4
Fast Fourier Transform (FFT)
N
Consider the formular xn = yn + ωNn zn for indices 2,...,N −1:
(n+ N )
xn+ N = yn+ N + ωN 2 zn+ N for n = 0, . . . , N2 − 1
2 2 2
(n+ N ) N
Since ωN 2 = −ωNn and yn and zn have a period of 2, we obtain the
so-called butterfly scheme:
xn = yn + ωNn zn yn yn + ω n zn
xn+ N = yn − ωNn zn
2
ωn
zn yn − ω n zn
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 5
Fast Fourier Transform – Butterfly Scheme
(x0 , x1 , . . . , . . . , xN−1 ) = IDFT(X0 , X1 , . . . , . . . , XN−1 )
⇓
(y0 , y1 , . . . , y N −1 ) = IDFT(X0 , X2 , . . . , XN−2 )
2
X0 X0 y0 yy0 x0
X1 X2 y1 yy1 x1
X2 X1 z0 zz0 x2
X3 X3 z1 zz1 x3
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 6
Fast Fourier Transform – Butterfly Scheme (2)
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 7
Recursive Implementation of the FFT
rekFFT(X) −→ x
(1) Generate vectors Y and Z:
for n = 0, . . . , N2 − 1 : Yn := X2n und Zn := X2n+1
(2) compute 2 FFTs of half size:
rekFFT(Y) −→ y and rekFFT(Z) −→ z
(3) combine with “butterfly scheme”:
(
xk = yk + ωNk zk
for k = 0, . . . , N2 − 1 :
xk+ N = yk − ωNk zk
2
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 8
Observations on the Recursive FFT
• Computational effort C(N) (N = 2p ) given by recursion equation
(
O(1) for N = 1
C(N) = N
⇒ C(N) = O(N log N)
O(N) + 2C 2 for N > 1
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 9
Sorting Phase of the FFT – Bit Reversal
Observation:
• even indices are sorted into the upper half,
odd indices into the lower half.
• distinction even/odd based on least significant bit
• distinction upper/lower based on most significant bit
⇒ An index in the sorted field has the reversed (i.e. mirrored) binary
representation compared to the original index.
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 10
Sorting of a Vector (N = 2p Entries, Bit Reversal)
/** FFT sorting phase: reorder data in array X */
for( int n=0; n<N; n++) {
// Compute p−bit bit reversal of n in j
int j =0; int m=n;
for( int i =0; i <p; i++) {
j = 2* j + m%2; m = m/2;
}
// if j >n exchange X[j] and X[n]:
if ( j >n) { complex<double> h;
h = X[j ]; X[j ] = X[n]; X[n] = h;
}
}
Bit reversal needs O(p) = O(log N) operations
⇒ Sorting results also in a complexity of O(N log N)
⇒ Sorting may consume up to 10–30 % of the CPU time!
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 11
Iterative Implementation of the “Butterflies”
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 12
Iterative Implementation of the “Butterflies”
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 13
Iterative Implementation – Variant 1
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 14
Iterative Implementation – Variant 1 Vectorised
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 15
Iterative Implementation – Variant 2
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 17
Cache Efficiency – Variant 1 Revisited
/** FFT butterfly phase: variant 1 */
for( int L=2; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Observation:
• each L-loop traverses entire array X
• in the ideal case (N log N)/B cache line transfers (N/B per L-loop,
B the size of the cache line), unless all N elements fit into cache
Compare with recursive scheme:
• if L < MC (MC the cache size), then the entire FFT fits into cache
• is it thus possible to require only N log N/(MC B) cache line transfers?
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 18
Butterfly Phase with Loop Blocking
/** FFT butterfly phase: loop blocking for k */
for( int L=2; L<=N; L*=2)
for( int kb=0; kb<N; kb+=M)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Question: can we make the L-loop an inner loop?
• kb-loop and L-loop may be swapped, if M > L
• however, we assumed N > MC (“data does not fit into cache”)
• we thus need to split the L-loop into a phase L=2..M (in cache) and a
phase L=2*M..N (out of cache)
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 19
Butterfly Phase with Loop Blocking (2)
/** perform all butterfly phases of size M */
for( int kb=0; kb<N; kb+=M)
for( int L=2; L<=M; L*=2)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
/** perform remaining butterfly levels of size L>M */
for( int L=2*M; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 20
Loop Blocking and Recursion – Illustration
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0
X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3
X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4
X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5
X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6
X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 21
Outlook: Parallel External Memory and I/O Model
CPU
CPU
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 22
Outlook: Parallel External Memory
Classical I/O model:
• large, global memory (main memory, hard disk, etc.)
• CPU can only access smaller working memory
(cache, main memory, etc.) of MC words each
• both organised as cache lines of size B words
• algorithmic complexity determined by memory transfers
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 23
Outlook: FFT and Parallel External Memory
Consider Loop-Blocking Implementation:
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 24
Outlook: FFT and Parallel External Memory (2)
Consider Non-Blocking Implementation:
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 25
Compute-Bound vs. Memory-Bound Performance
Consider a memory-bandwidth intensive algorithm:
• you can do a lot more flops than can be read from memory
• computational intensity of a code:
number of performed flops per accessed byte
Memory-Bound Performance:
• computational intensity smaller than critical ratio
• you could execute additional flops “for free”
• speedup only possible by reducing memory accesses
Compute-Bound Performance:
• enough computational work to “hide” memory latency
• speedup only possible by reducing operations
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 26
Outlook: The Roofline Model
peak FP performance
GFlop/s − log
h
idt
n dw without vectorization
ba
eam MA without instruction−level parallism
k str ut
NU
ide
p ea i tho str
w unit
n−
no
matrix mult.
(100x100)
5−pt stencil
SpMV
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 27
Outlook: The Roofline Model
Memory-Bound Performance:
• available bandwidth of a bytes per second
• computational intensity small: x flops per byte
• CPU thus executes a · x flops per second
• linear increase of the Flop/s with variable x linear part of “roofline”
• “ceilings”: memory bandwidth limited due to “bad” memory access
(strided access, non-uniform memory access, etc.) smaller a
Compute-Bound Performance:
• computational intensity large
• CPU executes highest possible Flop/s flat/constant “rooftop”
• “ceilings”: fewer Flop/s due to “bad” instruction mix
(no vectorization, bad branch prediction, no multi-add instructions, etc.)
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 28