Algorithms of Scientific Computing: Fast Fourier Transform (FFT)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Algorithms of Scientific Computing

Fast Fourier Transform (FFT)

Michael Bader
Technical University of Munich
Summer 2024
Fast Fourier Transform – Outline
• Discrete Fourier transform
• Fast Fourier transform
• Special Fourier transforms:
– real-valued FFT
– sine/cosine transform
• Applications:
– Fast Poisson solver (FST)
– Computergraphics (FCT)
• Efficient Implementation

Literature/Reference:
William L. Briggs and Van Emden Henson: The DFT – An Owner’s Manual for the Discrete
Fourier Transform. SIAM, 1995. https://doi.org/10.1137/1.9781611971514

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 2
The Pair DFT/IDFT as Matrix-Vector Product
DFT and IDFT may be computed in the form
N−1 N−1
1 X X
Fk = fn ωN−nk fn = Fk ωNnk
N
n=0 k=0

or as matrix-vector products
1 H
F= W f, f = WF ,
N
with a computational complexity of O(N 2 ).

Note that
1
DFT(f ) = IDFT(f ) .
N

A fast computation is possible via the divide-and-conquer approach.


Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 3
Fast Fourier Transform for N = 2p
Basic idea: sum up even and odd indices separately in IDFT
→ first for n = 0, 1, . . . , N2 − 1:
N N
N−1 2 −1 2 −1
(2k+1)n
X X X
xn = Xk ωNnk = X2k ωN2nk + X2k +1 ωN
k =0 k=0 k=0

We set Yk := X2k and Zk := X2k+1 , use nk


= ωN/2 ωN2nk
, and get a sum of two
N
IDFT on 2 coefficients:
N N
N−1
X 2 −1
X 2 −1
X
nk nk n nk
xn = Xk ωN = Yk ωN/2 + ωN Zk ωN/2 .
k =0 k=0 k =0
| {z } | {z }
:= yn := zn
N
Note: this formula is actually valid for all n = 0, . . . , N − 1; however, the IDFTs of size 2
will only
deliver the yn and zn for n = 0, . . . , N2 − 1 (but: yn and zn are periodic!)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 4
Fast Fourier Transform (FFT)
N
Consider the formular xn = yn + ωNn zn for indices 2,...,N −1:

(n+ N )
xn+ N = yn+ N + ωN 2 zn+ N for n = 0, . . . , N2 − 1
2 2 2

(n+ N ) N
Since ωN 2 = −ωNn and yn and zn have a period of 2, we obtain the
so-called butterfly scheme:

xn = yn + ωNn zn yn yn + ω n zn
xn+ N = yn − ωNn zn
2

ωn

zn yn − ω n zn

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 5
Fast Fourier Transform – Butterfly Scheme
(x0 , x1 , . . . , . . . , xN−1 ) = IDFT(X0 , X1 , . . . , . . . , XN−1 )

(y0 , y1 , . . . , y N −1 ) = IDFT(X0 , X2 , . . . , XN−2 )
2

(z0 , z1 , . . . , z N −1 ) = IDFT(X1 , X3 , . . . , XN−1 )


2

X0 X0 y0 yy0 x0

X1 X2 y1 yy1 x1

X2 X1 z0 zz0 x2

X3 X3 z1 zz1 x3

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 6
Fast Fourier Transform – Butterfly Scheme (2)
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0

X1 X2 X4 x (4) x (0,4) x (0−6) x1


4 4 2

X2 X4 X2 x (2) x (2,6) x (0−6) x2


2 2 4

X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3

X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4

X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5

X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6

X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 7
Recursive Implementation of the FFT
rekFFT(X) −→ x
(1) Generate vectors Y and Z:
for n = 0, . . . , N2 − 1 : Yn := X2n und Zn := X2n+1
(2) compute 2 FFTs of half size:
rekFFT(Y) −→ y and rekFFT(Z) −→ z
(3) combine with “butterfly scheme”:
(
xk = yk + ωNk zk
for k = 0, . . . , N2 − 1 :
xk+ N = yk − ωNk zk
2

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 8
Observations on the Recursive FFT
• Computational effort C(N) (N = 2p ) given by recursion equation
(
O(1) for N = 1
C(N) = N
 ⇒ C(N) = O(N log N)
O(N) + 2C 2 for N > 1

• Algorithm splits up in 2 phases:


• resorting of input data
• combination following the “butterfly scheme”
⇒ Anticipation of the resorting enables a simple, iterative algorithm without
additional memory requirements.

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 9
Sorting Phase of the FFT – Bit Reversal
Observation:
• even indices are sorted into the upper half,
odd indices into the lower half.
• distinction even/odd based on least significant bit
• distinction upper/lower based on most significant bit
⇒ An index in the sorted field has the reversed (i.e. mirrored) binary
representation compared to the original index.

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 10
Sorting of a Vector (N = 2p Entries, Bit Reversal)
/** FFT sorting phase: reorder data in array X */
for( int n=0; n<N; n++) {
// Compute p−bit bit reversal of n in j
int j =0; int m=n;
for( int i =0; i <p; i++) {
j = 2* j + m%2; m = m/2;
}
// if j >n exchange X[j] and X[n]:
if ( j >n) { complex<double> h;
h = X[j ]; X[j ] = X[n]; X[n] = h;
}
}
Bit reversal needs O(p) = O(log N) operations
⇒ Sorting results also in a complexity of O(N log N)
⇒ Sorting may consume up to 10–30 % of the CPU time!
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 11
Iterative Implementation of the “Butterflies”
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0

X1 X2 X4 x (4) x (0,4) x (0−6) x1


4 4 2

X2 X4 X2 x (2) x (2,6) x (0−6) x2


2 2 4

X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3

X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4

X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5

X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6

X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 12
Iterative Implementation of the “Butterflies”

{Loop over the size of the IDFT}


for(int L=2; L<=N; L*=2)
{Loop over the IDFT of one level}
for(int k=0; k<N; k+=L)
{perform all butterflies of one level}
for(int j=0; j<L/2; j++) {
{complex computation:}
z ← ωLj * X[k+j+L/2]
X[k+j+L/2] ← X[k+j] - z
X[k+j] ← X[k+j] + z
}

• k-loop und j-loop are “permutable”!


• How and when are the ωLj computed?

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 13
Iterative Implementation – Variant 1

/** FFT butterfly phase: variant 1 */


for( int L=2; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = omega(L,j) * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Advantage: consecutive (“stride-1”) access to data in array X
⇒ suitable for vectorisation
⇒ good cache performance due to prefetching (stream access) and usage
of cache lines
j
Disadvantage: multiple computations of ωL

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 14
Iterative Implementation – Variant 1 Vectorised

/** FFT butterfly phase: variant 1 vectorised*/


for( int L=2; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j+= SIMD LENGTH) {
kjStart = k+j ;
kjEnd = k+j + (SIMD LENGTH−1);
complex<double> z[0: SIMD LENGTH−1];
z[0: SIMD LENGTH−1] = omega(L,j) * X[kjStart+L/2 : kjEnd+L/2];
X[ kjStart +L/2 : kjEnd+L/2] = X[kjStart :kjEnd] − z;
X[ kjStart :kjEnd] = X[ kjStart :kjEnd] + z;
}
Comments on the code:
• we assume that our CPU offers SIMD instructions that can add/multiply short vectors of
SIMD LENGTH complex numbers
• with the notation X[ kjStart :kjEnd] we mean a subvector of SIMD LENGTH consecutive
complex numbers starting from element X[ kjStart ]
• Note: the term omega(L,j) has to be changed into such a short vector, as well

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 15
Iterative Implementation – Variant 2

/** FFT butterfly phase: variant 2 */


for( int L=2; L<=N; L*=2)
for( int j =0; j <L/2; j ++) {
complex<double> w = omega(L,j);
for( int k=0; k<N; k+=L) {
complex<double> z = w * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
} }
Advantage: each ωLj only computed once

Disadvantage: “stride-L”-access to the array X

⇒ worse cache performance (inefficient use of cache lines)


⇒ not suitable for vectorisation
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 16
Separate Computation of ωLj
• necessary: N − 1 factors
L/2−1 N/2−1
ω20 , ω40 , ω41 , . . . , ωL0 , . . . , ωL , . . . , ωN0 , . . . , ωN

• are computed in advance, and stored in an array w, e.g.:


for(int L=2; L<=N; L*=2)
for(int j=0; j<L/2; j++)
w[L/2+j] ← ωLj ;
• Variant 2: access on w in sequential order
• Variant 1: access on w local (but repeated) and compatible with
vectorisation
• Important: weight array w[:] needs to stay in cache!
(as accesses to main memory can be slower than recomputation)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 17
Cache Efficiency – Variant 1 Revisited
/** FFT butterfly phase: variant 1 */
for( int L=2; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Observation:
• each L-loop traverses entire array X
• in the ideal case (N log N)/B cache line transfers (N/B per L-loop,
B the size of the cache line), unless all N elements fit into cache
Compare with recursive scheme:
• if L < MC (MC the cache size), then the entire FFT fits into cache
• is it thus possible to require only N log N/(MC B) cache line transfers?
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 18
Butterfly Phase with Loop Blocking
/** FFT butterfly phase: loop blocking for k */
for( int L=2; L<=N; L*=2)
for( int kb=0; kb<N; kb+=M)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Question: can we make the L-loop an inner loop?
• kb-loop and L-loop may be swapped, if M > L
• however, we assumed N > MC (“data does not fit into cache”)
• we thus need to split the L-loop into a phase L=2..M (in cache) and a
phase L=2*M..N (out of cache)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 19
Butterfly Phase with Loop Blocking (2)
/** perform all butterfly phases of size M */
for( int kb=0; kb<N; kb+=M)
for( int L=2; L<=M; L*=2)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
/** perform remaining butterfly levels of size L>M */
for( int L=2*M; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
complex<double> z = w[L/2+j] * X[k+j+L/2];
X[k+j+L/2] = X[k+j] − z;
X[k+j] = X[k+j] + z;
}
Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 20
Loop Blocking and Recursion – Illustration
X0 X0 X0 x (0) x (0,4) x (0−6)
0 x0
0 0

X1 X2 X4 x (4) x (0,4) x (0−6) x1


4 4 2

X2 X4 X2 x (2) x (2,6) x (0−6) x2


2 2 4

X3 X6 X6 x (6)
6 x (2,6)
6 x (0−6)
6
x3

X4 X1 X1 x (1)
1 x (1,5)
1 x (1−7)
1
x4

X5 X3 X5 x (5)
5 x (1,5)
5 x (1−7)
3
x5

X6 X5 X3 x (3)
3 x (3,7)
3 x (1−7)
5
x6

X7 X7 X7 x (7)
7 x (3,7)
7 x (1−7)
7
x7

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 21
Outlook: Parallel External Memory and I/O Model

CPU

External Shared Memory


CPU
private local caches
(M words per cache)

CPU

[Arge, Goodrich, Nelson, Sitchinava, 2008]

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 22
Outlook: Parallel External Memory
Classical I/O model:
• large, global memory (main memory, hard disk, etc.)
• CPU can only access smaller working memory
(cache, main memory, etc.) of MC words each
• both organised as cache lines of size B words
• algorithmic complexity determined by memory transfers

Extended by Parallel External Memory Model:


• multiple CPUs access private caches
• caches fetch data from external memory
• exclusive/concurrent read/write classification
(similar to PRAM model)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 23
Outlook: FFT and Parallel External Memory
Consider Loop-Blocking Implementation:

/** perform all butterfly phases of size M */


for( int kb=0; kb<N; kb+=M)
for( int L=2; L<=M; L*=2)
for( int k=kb; k<kb+M; k+=L)
for( int j =0; j <L/2; j ++) {
/* ... */
• choose M such that one kb-Block (M elements) fit into cache
• then: L-loop and inner loops access only cached data
• number of cache line transfers therefore:
≈ N divided by words per cache line (ideal case)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 24
Outlook: FFT and Parallel External Memory (2)
Consider Non-Blocking Implementation:

/** perform remaining butterfly levels of size L>M */


for( int L=2*M; L<=N; L*=2)
for( int k=0; k<N; k+=L)
for( int j =0; j <L/2; j ++) {
/* ... */
• assume: N too large to fit all elements into cache
• then: each L-loop will need to reload all elements X into cache
• number of cache line transfers therefore:
≈ N divided by words per cache line (ideal case) per L-iteration

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 25
Compute-Bound vs. Memory-Bound Performance
Consider a memory-bandwidth intensive algorithm:
• you can do a lot more flops than can be read from memory
• computational intensity of a code:
number of performed flops per accessed byte

Memory-Bound Performance:
• computational intensity smaller than critical ratio
• you could execute additional flops “for free”
• speedup only possible by reducing memory accesses

Compute-Bound Performance:
• enough computational work to “hide” memory latency
• speedup only possible by reducing operations

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 26
Outlook: The Roofline Model
peak FP performance
GFlop/s − log

h
idt
n dw without vectorization
ba
eam MA without instruction−level parallism
k str ut
NU
ide
p ea i tho str
w unit
n−
no

matrix mult.
(100x100)
5−pt stencil
SpMV

1/8 1/4 1/2 1 2 4 8 16 32 64


Operational Intensity [Flops/Byte] − log

[Williams, Waterman, Patterson, 2008]

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 27
Outlook: The Roofline Model
Memory-Bound Performance:
• available bandwidth of a bytes per second
• computational intensity small: x flops per byte
• CPU thus executes a · x flops per second
• linear increase of the Flop/s with variable x linear part of “roofline”
• “ceilings”: memory bandwidth limited due to “bad” memory access
(strided access, non-uniform memory access, etc.) smaller a

Compute-Bound Performance:
• computational intensity large
• CPU executes highest possible Flop/s flat/constant “rooftop”
• “ceilings”: fewer Flop/s due to “bad” instruction mix
(no vectorization, bad branch prediction, no multi-add instructions, etc.)

Michael Bader | Algorithms of Scientific Computing | Fast Fourier Transform (FFT) | Summer 2024 28

You might also like