Handbook of Real Time Fast Fourier Transform
Handbook of Real Time Fast Fourier Transform
Handbook of Real Time Fast Fourier Transform
of Real-Time
Fast Fourier Transforms
IEEE PRESS Editorial Board
John B. Anderson, Editor in Chief
Technical Reviewers
Vito J. Sisto
E-Systems, Inc.
James S. Walker
Mathematics Department
University ofWisconsin, Eau Claire
John C. Russ
Materials Science and Engineering Department
North Carolina University
Handbook
of Real-Time
Fast Fourier Transforms
Algorithms to Product Testing
Winthrop W. Smith
Joanne M. Smith
+IEEE
The Institute of Electrical and Electronics Engineers, Inc., New York
mWILEY-
~INTERSCIENCE
A JOHN WILEY & SONS, INC., PUBLICATION
New York • Chichester • Weinheim • Brisbane • Singapore • Toronto
A NOTE TO THE READER
This book has been electronically reproduced from
digital information stored at John Wiley & Sons,
Inc. We are pleased that the use of this new
technology will enable us to keep works of
enduring scholarly value in print as long as there is
reasonable demand for them. The content of this
book is identical to previous printings.
Preface xxi
1 Overview 1
1.a Introduction 1
1.1 Laying the Foundation 1
1.2 Design Decisions 2
1.2.1 Number of Dimensions 2
1.2.2 Type of Processing 2
1.2.3 Arithmetic Format 2
1.2.4 Weighting Functions 3
1.2.5 Transform Length 3
1.2.6 Algorithm Building Blocks 3
1.2.7 Algorithm Construction 3
1.2.8 DSP Chips 3
1.2.9 Architectures 3
1.2.10 Mapping Algorithms onto Architectures 4
1.2.11 Board Decisions and Selection 4
1.2.12 Test Signals and Procedures 4
1.3 Types of Examples 4
1.3.1 Eight-Point DFT to FFT Example 5
1.3.2 Algorithm Steps and Memory Maps 5
1.3.3 Fifteen-Point or 16-Point
FFT Algorithm Examples 5
1.3.4 Sixteen-Point Radix-4 FFT Algorithm Examples 5
1.3.5 Four-Point FFT and 16-Point Radix-4
FFT Algorithm Examples 5
viii CONTENTS
4 Weighting Functions 35
4.0 Introduction 35
4.1 Six Performance Measures 35
4.1.1 Highest Sidelobe Level 36
4.1.2 Sidelobe Fall-off Ratio 36
4.1.3 Frequency Straddle Loss 36
4.1.4 Coherent Integration Gain 36
4.1.5 Equivalent Noise Bandwidth 36
4.1.6 Three dB Main-Lobe Bandwidth 37
4.2 Weighting Function Equations and Their FFfs 37
4.2.1 Rectangular 37
4.2.2 Triangular 38
4.2.3 Sine Lobe 39
4.2.4 Hanning 40
4.2.5 Sine Cubed 40
4.2.6 Sine to the Fourth 41
4.2.7 Hamming 42
4.2.8 Blackman 43
4.2.9 Three-Sample Blackman-Harris 43
4.2.10 Four-Sample Blackman-Harris 45
4.2.11 Kaiser-Bessel 46
4.2.12 Gaussian 48
4.2.13 Dolph-Chebyshev 49
4.2.14 Finite Impulse Response Filter Design
Techniques 52
4.3 Weighting Function Comparison Matrix 52
4.4 Conclusions 53
x CONTENTS
5 Frequency Analysis 55
5.0 Introduction 55
5.1 Five Performance Measures 55
5.1.1 Input Sample Overlap 55
5.1.2 Sidelobe Level 56
5.1.3 Frequency Straddle Loss 56
5.1.4 Frequency Resolution 56
5.1.5 Coherent Integration Gain 57
5.2 Computational Techniques 57
5.2.1 Nonoverlapped 57
5.2.2 Overlapped 58
5.2.3 Weighting Functions 58
5.3 Conclusions 59
7 Multidimensional Processing 73
7.0 Introduction 73
7.1 Frequency Analysis 74
7.1. 1 Two Dimensions 74
7.1.2 Three or More Dimensions 75
7.2 Linear Filtering 75
7.2.1 Separable Two-Dimensional Filter 76
7.2.2 Frequency Domain Approach 76
7.2.3 Three and More Dimensions 77
7.3 Pattern Matching 78
7.3.1 Separable Two-Dimensional Pattern Matching 78
7.3.2 Frequency Domain Approach 79
7.3.3 Three and More Dimensions 80
7.4 Conclusions 80
8 Building-Block Algorithms 81
8.0 Introduction 81
8.1 Four Performance Measures 81
8.1.1 Number of Adds 82
8.1.2 Number of Multiplies 82
8.1.3 Number of Memory Locations for
Multiplier Constants 82
8.1.4 Number of Data Memory Locations 83
8.2 Ten Building-Block Algorithm Constraints 83
8.3 Two-Point FFT 84
8.4 Three-Point FFT 85
8.4.1 Winograd 3-Point FFT 85
8.4.2 Singleton 3-Point FFT 86
8.5 Four-Point FFT 87
8.6 Five-Point FFT 88
8.6.1 Winograd 5-Point FFT 89
8.6.2 Singleton 5-Point FFT 91
8.6.3 Rader 5-Point FFT 93
8.7 Seven-Point FFT 96
8.7.1 Winograd 7-Point FFT 97
8.7.2 Singleton 7-Point FFT 101
8.8 Eight-Point FFT 103
8.8.1 Winograd 8-Point FFT 104
8.8.2 Eight-Point Radix-4 and -2 Algorithm 107
8.8.3 Eight-Point Radix-2 Algorithm 110
8.8.4 PTL 8-Point FFT 113
xii CONTENTS
14 Chips 323
14.0 Introduction 323
14.1 Five FFf Performance Measures 324
14.1.1 1024-Point Complex FFT 324
14.1.2 Data I/O Ports 324
14.1.3 On-Chip Data Memory Words 325
14.1.4 On-Chip Program Memory Words 325
14.1.5 Number of Address Generators 325
14.2 Generic Programmable DSP Chip 325
14.2.1 Block Diagram 326
14.2.2 On-Chip Data Memory 326
14.2.3 On-Chip Program Memory 327
14.2.4 On-Chip Data Buses 327
14.2.5 Off-Chip Data Bus 327
14.2.6 On-Chip Address Buses 328
14.2.7 Off-Chip Address Bus 328
14.2.8 Address Generators 328
14.2.9 Serial I/O Ports 329
14.2.10 Program Control 332
xvi CONTENTS
16 Test 395
16.0 Introduction 395
16.1 Example 395
16.2 Errors during Algorithm Development 395
16.2.1 Arithmetic Check 397
16.2.2 Memory Map Check 399
16.3 Errors during Code Development 400
16.3.1 Coding the Building-Block Algorithm 400
16.3.2 Coding the Multiplier Constants 401
16.3.3 Coding the Memory Mapping 401
16.3.4 Coding the Relabeled Memory Maps 402
16.4 Errors during Product Operation 402
16.4.1 Arithmetic Unit 402
16.4.2 Address Generator 403
16.4.3 DataMemory 403
16.4.4 Program Memory 404
16.4.5 Data I/O 404
16.5 Test Signal Features 404
16.5.1 UnitPulse 404
16.5.2 Constants 405
16.5.3 Single Sine Waves 406
16.5.4 Pair of Sine Waves 406
xviii CONTENTS
Glossary 449
Index 457
Preface
This book gives engineers and other technical innovators the foundation and facts they
need to construct and implement fast Fourier transforms (FFfs) that synthesize, recognize,
enhance, compress, modify, or analyze signals. Because of special integrated circuits,
known as digital signal processing (DSP) chips, a wide array of applications is affordably
done, from magnetic resonance imaging (MRI) to Doppler weather radar. Increased demand
for wireless communication, multimedia, and consumer products has created the need for
high-volume, low-cost, multifunction, DSP-based products that use FFfs for their signal
processing or data manipulation.
In 1974, E. Oran Brigham lived and worked in the small East Texas town of Greenville.
He was employed by a little-known aerospace company named E-Systems, Inc. when his
230-page book, The Fast Fourier Transform [1], was published. Over the years it has
helped thousands of engineers learn the fundamentals of that analytical tool. After moving
to Greenville in 1991 for Win to join E-Systems, we decided to write a book that continued
the efforts begun here two decades before-putting practical information about FFfs into
the hands of practicing professionals and engineering students.
The explosion of digital products, ignited by the proliferation of integrated circuits
in the 21 years since Brigham's book came out, marks the coming of age for computing
FFfs. Because of personal computers, with chips or plug-in boards for doing DSP functions,
including FFfs, thousands of engineers, scientists, and students now work with and develop
new FFf techniques and products. The National Information Infrastructure, popularly
called "The Information Superhighway," and other digital-based goods and services now
provide the impetus for sophisticated new products, once driven by the Department of
Defense.
The book addresses the following areas of real-time FFf implementation:
• How to select DSP chips and commercial off-the-shelf (COTS) boards for FFf
applications
• How to detect and isolate errors in every phase of development
The goal of the book is to provide a single-source reference for the elements used
in programming real-time FFf algorithms on DSP and special-purpose chips. It uses a
building-block approach to constructing several FFf algorithms. Extensive use is made
of examples and spreadsheet-style comparison charts. With hundreds of figures, tables,
and Algorithm Steps, its practical features are geared to assist design engineers, scientists,
researchers, and students. The book may even open the design of FFf-based products
to innovators with no prior FFf experience, if they have microprocessor programming,
engineering, or mathematics backgrounds. Though useful as a handy reference book by
topic, it is laid out in a logical sequence that can be a textbook for a course on applied FFfs.
Sid Burrus's and Tom Park's book DFT/FFT and Convolution Algorithms [2], writ-
ten a decade ago, met the mushrooming hunger of engineers for TMS32010 code, which
would make it easier to use the new Texas Instruments chip for computing FFf algorithms.
Mainstream applications for consumer products incorporating FFTs, precipitated by recent
advances in integrated circuits, especially ASICs, have fostered a need to:
Win's 28-year DSP career in both military and commercial companies, teaching
courses and seminars nationwide, has repeatedly shown him that engineers need to be able
to work easily with any length of FFfs to do real-time signal conversion and analysis.
Joanne's 12 years experience as founder and president of two DSP companies has given
her exposure to the rapidly changing technology, market, and economic realities of this
industry. Coauthoring a book seemed the logical way to combine our diverse talents and
complementary perspectives to comprehensively address the topic of real-time fast Fourier
transform algorithms.
This book is only one of several tools for expanding the knowledge base of the DSP
community. A service called DSP Net provides access to the latest vendor information in this
field through InterNet. DSP and Multimedia Technology magazine addresses this growing
market, as do two annual applications-oriented conferences-DSPx and the International
Conference on Signal Processing Applications & Technology. The IEEE International
Conference on Acoustics, Speech and Signal Processing holds its 20th annual gathering in
1995. The chip vendors have free bulletin boards for algorithms, code, and other pertinent
information. Additional information on resources available to design engineers should be
sent to the authors, in care of the publisher, for possible inclusion in follow-up publications.
ACKNOWLEDGMENTS
We are pleased to thank Frank J. Thomas, Rosalie Sinnett, Thomas L. Loposer, Randy
Davis, and Wayne Yuhasz, who convinced us we could accomplish this effort; Ross A.
McClain, Jr., Jeffrey W. Marquis, Vito J. Sisto, V. Rex Tanakit, and Joel Morris, Ph.D.,
for their contributions during the editing process; Harold W. Cates, Ph.D., and Robert H.
Whalen, for their mentoring of Win's career; the many friends and colleagues who have
encouraged us throughout our careers; and our daughters Patricia and Paula for not letting
us give up. Most of all we thank God for His inspiration, guidance, and strength throughout
this seemingly impossible task.
REFERENCES
[1] E. Oran Brigham, The Fast Fourier Transform, Prentice-Hall, Englewood Cliffs, NJ,
1974.
[2] C. S. Burrus and T. W. Parks, DFT/FFT and Convolution Algorithms, Wiley, New York,
1985.
[3] John P. Sweeney, "Mainstream Applications Require Optimized Assembly Language
for Fast DSPs," EON, April 28, 1994.
1
Overview
1.0 INTRODUCTION
The increased demand for communication, multimedia, and other consumer products has
created the need for high-volume, low-cost, multifunction DSP-based products that can
use fast Fourier transforms (FFTs) for their signal processing or data manipulation. This
book is the first to cover FFTs from algorithms to product testing, with the information
needed to create and convert to code FFT algorithms of any length on 10 different archi-
tectures. It uses a building-block approach for constructing the algorithms. Included are
recommended Memory Maps to streamline assembly and high-level language coding of 17
small-point FFTs, four general algorithms, and seven FFT algorithm examples. To ensure
that the algorithms work properly, a test approach for the detection and isolation of errors,
refined over many years of time consuming searches for mistakes in FFT algorithms, is
detailed.
Spreadsheet-style comparison matrices provide easy to use inventories of the com-
prehensive array of key FFT elements and performance measures. Dozens of digital signal
processing (DSP) chips and criteria for selecting DSP boards are covered. Four design
examples at the end of the book show how to apply most of what has been explained.
Chapters 2 and 3 provide the technical foundation and mathematical equations for the al-
gorithms in Chapters 8 and 9. The discrete Fourier transform (DFT) is an equation for
converting time domain data into its frequency components. The DFf equation is imple-
mented with FFT algorithms because they are computationally efficient ways of calculating
it. All the properties and strengths of the DFT are shared by the wide variety of FFTs that
2 CHAP. 1 OVERVIEW
have beep developed over the years. However, only three of the five weaknesses of the DFf
are also weaknesses of FFT algorithms.
In the beginning of the design process, comparison of the uses and properties of the
DFT with the technical specifications of the application will determine if the DFT is a good
match. If so, then it makes sense to examine the FFT algorithms, hardware architectures,
arithmetic formats, and mappings in this book to decide which combination is best for a
specific design.
The decisions listed are the ones related to real-time FFf selection and implementation.
They are listed in an order which differs from the sequence of the chapters, because learning
the facts happens more easily in an order that is different from applying them.
1.2.9 Architectures
Bit-slice, arithmetic chips were used to construct FFf applications prior to the in-
troduction of DSP chips. However, advances in silicon technology have replaced bit-slice
building blocks with nsp chips that include a complete fixed- or floating-point multiplier
and adder, as well as memory and program control logic.
4 CHA~ 1 OVERVIEW
All of the DSP chips in this book use a Harvard architecture for interconnecting
these elements. FFf-specific chips interconnect several arithmetic building blocks into a
small-point FFf to increase performance. Multiprocessor interconnections (pipeline, linear
bus, ring bus, crossbar, two- and three-dimensional massively parallel, star, hypercube, and
hybrid architectures) of DSP chips are used when a single chip is not adequate. In fact, up to
four Harvard processors are now available on a single chip (SPROC 1000 and TMS320C80
families). Chapter 10 describes bit slice, integrated arithmetic and FFf-specific hardware
building blocks. Then Chapter 11 shows how to use them in single and multiprocessor
architectures. These two chapters prepare the reader for mapping the algorithms in Chapter 9
onto these architectures.
1. Algorithm performance
2. I/O Performance
3. Architecture
4. Software support
5. Expansion capability
because they are large enough to show the pattern of an algorithm yet small enough to easily
follow.
In Chapter 17, frequency analysis, power spectrum estimation, linear filtering, and two-
dimensional processing examples were chosen to illustrate:
Whether the design will be single or multiple chip on single or multiple boards may not
be determined until far into the design process. In this chapter both multiple-chip and
multiple-board applications are developed to illustrate making those decisions. These are
not intended to be full-scale product designs. They are taken far enough into a design to
show how to use the wide array of information in the book.
Example 4 is another PC plug-in board, this one for doing image deblurring. The PC
housing this board could be found at a police station, crime lab, or as instrumentation for
SEC. 1.5 CONCLUSIONS 7
an engineer or researcher. Though deblurring images does not have the widespread uses of
the first three examples, the image processing principles it employs do. Some of them are
CAT scans and MRls, seismic exploration, and multimedia applications. Like Example 2,
this product does frequency domain conversion, the third common use of the OFT.
1.5 CONCLUSIONS
This chapter provides an overview of the contents of the book. From a foundation in the
OFT through design examples, the authors have tried to present a logical, easy to follow
explanation of how to implement real-time FFTs on commercially available processors.
Digital signal processing is a mushrooming field of technology. The FFT is a valuable
technique for synthesizing, recognizing, enhancing, compressing, modifying, or analyzing
digital signals from many sources.
The next chapter, on the Off, lays the foundation for all that is said about the FFf in
subsequent chapters.
2
2.0 INTRODUCTION
The discrete Fourier transform (Off) is an equation for converting time domain data into
frequency domain data. Discrete means that the signal is sampled in time rather than being
continuous. Therefore, the OFT is an approximation for the continuous Fourier transform
[1]. This approximation works well when the frequencies in the signal are all less than half
the sampling rate (Section 2.3.1) and do not vary more than the filter spacing (Section 2.3.2).
Because of heat-transfer work done by the French mathematician J. B. Fourier in
the early 1800s, many fields of science and engineering have benefited from the use of his
mathematical link between time and frequency domains, called the Fourier transform, This
link is valuable because many natural or man-made signals (waveforms) are periodic and
thus can be expressed in terms of a sum of sine waves. Mathematicians realized that rather
than compute continuous spectra, they could take discrete data points in the time domain and
translate that information into the frequency domain, and so the discrete Fourier transform
came into being.
The Off equation, unlike the continuous Fourier transform, covers a finite time and
frequency span. These data points may be collected from the output of an analog-to-digital
(AID) converter, generated by a digital computer, or output from another signal processing
algorithm. They can be the plotted points of the performance of any numerical data, such
as stock prices. The OFT equation is implemented with FFT algorithms because they are
computationally efficient ways of calculating it. The properties (Section 2.3) and strengths
(Section 2.5) of the OFT also belong to the FFT. However, only three of the weaknesses
(Section 2.6) of the OFT are also weaknesses of FFT algorithms.
Comparison of the uses and properties of the OFT, with the technical specifications
of the application, determines if the OFT will be useful. If so, it makes sense to examine
10 CHAP. 2 THE DISCRETE FOURIER TRANSFORM
the FFT algorithms, hardware architectures, arithmetic formats, and mappings in this book
to decide which combination of them will provide the specified performance. This chapter
lays the technical foundation for the FFf algorithms in Chapters 8 and 9.
Equation 2-1 is the standard description of the OFf of N complex data points, a (n).
N-l
A(k) = L a(n) * wt*n where WN = cos(21l'/ N) - j sin(21l'/ N) (2-1)
n=O
Before the Off properties are described, it is useful to have a simple picture of the function
that Equation 2-1 is performing.
Since Equation 2-1 takes the same set of N input data points, a (n), and produces
N output signals, A(k), each representing a different frequency, the N-point DFT can be
modeled as an array of N narrowband filters, each providing an output if the input signal has
frequency components in its passband. Since a narrowband filter can be implemented with
a multiplier and a low-pass filter (LPF), Figure 2-1, on page 11, can be used to represent
the DFT. The only difference between the DFT and this array of narrowband filters is that
the DFT only produces an output from each filter every N input samples. A narrowband
filter produces an output for every new input data point.
2.3 PROPERTIES
All FFT algorithms are just faster ways of computing the OFf equations; they are not ap-
proximations for the OFT equations. Thus the Off properties described in this section apply
to all FFT algorithms. These properties have been derived in detail in many textbooks [1-4].
a(n) A(O)
A(I)
• •
• •
• •
A(N-l)
the Nyquist rate [6]. The DFT determines the presence of zero-frequency signals in the input
data points by calculating A (0). The A (1) term in Equation 2-1 determines the presence of
a sine wave that goes through exactly one 360 0 cycle during the N data points. Similarly,
the A(k) term determines the presence of sine waves that go through exactly k 360 0 cycles
during the N data samples.
The frequencies A (k) in Equation 2-1 are the only ones that the DFf computes. When
the frequency of a signal is higher than the sampling rate, the sampled version of the signal
appears to be at the signal's frequency minus the sampling rate. To illustrate this, consider
a sine-wave signal that goes through exactly N 360 0 cycles during the N input data points.
That means it goes through exactly one 360 0 cycle between each data point. Therefore,
every time it is sampled it has the same data value. However, a zero-frequency signal also
has the same value each time it is sampled. Therefore, the DFf cannot distinguish between
zero-frequency sine waves and sine waves that go through N 360 0 cycles during the N
samples.
The Nyquist rate is a formal mathematical description of this phenomena. For a DFf
to accurately represent frequencies up to F samples per second, a sample rate of at least
2 * F samples per second is required. Further, frequencies that are higher will appear to be
lower-frequency signals (ambiguous), just as the sine waves in the previous paragraph that
had N 360 0 cycles in N samples looked the same as the zero-frequency sine wave. A sine
wave with 2 * N 360 cycles in N samples also looks the same as a zero-frequency sine
0
wave.
For real signals, the sampling theorem, as stated above and by Shannon, holds directly.
If the samples are complex, real and imaginary samples are taken at the sampling rate. The
result is two samples at the sampling rate or samples taken at twice the sampling rate. This
implies that, for complex sampling, frequencies are unambiguously analyzed by the DFT
up to the complex sampling rate F.
12 CHAP. 2 THE DISCRETE FOURIER TRANSFORM
Since there are N equally spaced OFf filters between zero and the sampling rate, the
spacing between the filters is 1I N times the sampling rate. It is important to note that 1I N
times the sampling rate is also the total time period over which the N samples were taken.
Therefore, the filter spacing is equal to l/(total time for data collected for the Off input).
Further, the Off filters are designed so that, if a signal has an input frequency in the center
of one of the filters, the other filters do not respond. Therefore, the spacing between the
center of a DFT filter and its first null response is equal to the 1/(total time for data collected
for the DFf input). In filtering terms, each OFT filter has a null in its response at the input
frequencies of the other filters.
2.3.3 Linearity
Linearity means that the output of the OFT for the sum of two input signals is ex-
actly the same as summing the OFf outputs of two individual input signals, as shown in
Equation 2-2.
N-I N-I N-I
C(k) = L[a(n) + b(n)]Wtn = L a(n)Wtn + L b(n)Wtn = A(k) + B(k) (2-2)
n=O n=O n=O
2.3.4 Symmetry
The symmetry property is helpful in understanding the response of a Off to a par-
ticular waveform, It states that if A(k) = OFf of a(n), then an input waveform with the
shape of A(n) will have a OFf equal to a(N - k).
The inverse discrete Fourier transform (10FT), shown in Equation 2-3, is used to
convert frequency information into time domain data points. This property allows the OFT
to be used to perform linear filtering and pattern matching in the frequency domain. These
frequency domain algorithms are described in Chapter 6 and often require fewer adds and
multiplies than doing linear filtering and pattern matching directly in the time domain.
N-I
a(n) = [liN] L A(k)W Nkn where WN1 = cos (21l'1 N) + j sin (21l'1 N) (2-3)
k=O
Notice that the IOFf, Equation 2-3, is similar to Equation 2-1, which describes the
OFT. This similarity makes it possible to use almost the same algorithm to compute the IDFT
as is used for the OFT. This is most simply illustrated by Equations 2-4 and 2-5. Except for
the factor of 1IN, the difference between the 10FT equation and the Off equation is the
sign of the sine terms of Wkn •
wt n
= cos(21l'knl N) - j sin(21l'knl N) (2-4)
Therefore, any OFT or FFT algorithm can be converted to its comparable 10FT algorithm
by changing the sign of the coefficient multipliers formed by the sine terms and dividing
the results by N. This becomes important when using the frequency domain algorithms in
Chapter 6 to perform linear filtering and pattern matching. In those algorithms, FFfs and
IFFTs are required. This property allows the same FFT algorithm to be used for both the
FFT and IFFT portions of the computations,
05
a(n)
-05
-1 '---_'---_'---_L.--~L.---..._______'L..__..____JL...---____JI..______JL...___~L..._______J
o JOO 400 600 sao 1000 1~00 1400 1600 1S00 ~OOO
I~ Samples 0-15 ~I
f.---- Samples 4-19 ~I
sine-wave phase for samples 0-15 is zero, the A(I) FFf output has zero phase. Since
the sine-wave phase for samples 4-19 is 90°, the A(I) FFf output has 90° phase.
Similarly, if a frequency component A(k) is shifted to a new frequency A(k - i), then
the IDFT of the shifted frequency is a sine wave at frequency k - i. This sine wave can
also be obtained by multiplying a sine wave at frequency k by a sine wave at frequency i.
This is mathematically described by multiplying the original input signal by a complex sine
wave. Again, since the IDFf is linear, this phenomena is true regardless of the number of
sine waves that comprise the sampled signal.
Time and frequency shifting are represented mathematically by Equations 2-6 and 2-7.
a(n + i) ¢> A (k)e- j21rki/N (2-6)
A(k - i) ¢> a(n)e+j2Trni/N (2-7)
Therefore, except for a factor of 1/ N, the sum of the magnitudes of the FFf outputs
is the same as the sum of the magnitudes of the input samples. Therefore, the fOnTIS of the
outputs of an FFf allow the power in a signal to be calculated as easily in the frequency
domain as in the time domain.
Figures 2-3 and 2-4 illustrate the effects zero padding has on the real and imaginary
parts of the responses of 12- and 16-point FfTs, for a I-kHz sine wave that has been sampled
at 12 kHz. In Figure 2-3 the real part has an amplitude of zero and the imaginary part has
a nonzero amplitude at filters 1 and 11. This is because the sine wave has a 270 0 phase.
This particular phase was used so that the real parts would be obviously different between
the 12- and I6-point transforms. In Figure 2-4 the real and imaginary parts have nonzero
responses in most of the filters because four zeros are appended to the 12 actual samples,
and a 16-point FFf is performed.
/
5
OJ---------- 0 -------"
,/1
-5 \/
I I -10
.5 10 15 0 5 10 15
/
o ~ .~----.-.---------..",..
/
1'1,\ t f'-. •• - ........} -
I I
J
-5~--------'-------'
I I
5 10 15 o 5 10 15
The 16 FFf filter outputs in Figure 2-4 only span a 12-kHz frequency range because
12 kHz is the sample rate. With 16 filters to span the 12 kHz, the frequency spacing between
them is smaller. This example shows that appending zeros to the end of the periodic sine
wave, to make it a power-of-two length, alters the real and imaginary responses of the FFf
filters. The weighting functions in Chapter 4 are used to minimize zero-padding effects.
2.3.11 Resolution
The resolution of two sine waves is defined as how close they can be in frequency
before they can no longer be distinguished. If two frequencies are positioned at adjacent
DFT filter outputs, namely A(k) and A(k+ 1), then they are distinguishable. If the frequency
at k + 1 moves closer to frequency k, then it will start to appear as part of the passband of
16 CHA~ 2 THE DISCRETE FOURIER TRANSFORM
A(k), as well as A(k + 1), and it is no longer clear whether there is one signal at a frequency
between k and k + 1 or two separate signals near k and k + 1.
Therefore, the frequency resolution of the OFf is the separation between adjacent
filters. Since there are N filters that cover the region from zero to the sampling frequency,
the Off resolution is the sampling frequency divided by N. This implies that, for a given
sampling rate, the longer the transform length the better the frequency resolution of the
analysis.
2.3.12 Periodicity
Section 2.3.1 showed that the Off correctly analyzes frequencies from zero to half
the sampling frequency. All other frequencies appear to be frequencies between zero and
half the sampling rate. For complex inputs the real sampling rate is actually twice the
sampling rate for the real or imaginary parts because both are being sampled at the same
time. This leads to the two rules for the way frequencies below zero and above the sampling
rate are analyzed by the OFT, one for complex signals and the other for real signals.
For complex input signals, periodicity means that frequencies that are higher than
the sampling frequency appear at frequencies that are less than the sampling frequency
(A(N + k) => A(k». Similarly, negative frequencies appear as if they are at the sampling
frequency minus their frequency (A (-k) => A(N - k».
For real input signals with frequencies, k, below half the sampling rate, OFT filters
k and N - k respond. Note that these two responding filters are symmetric about half
the sampling rate. If the frequency is less than zero, add twice the sampling rate to the
frequency and then apply the rule in the first sentence of this paragraph.
2.3.13 Summary of Properties
These 12 DFT properties:
• Apply to all of the FFf algorithms in Chapters 8 and 9
• Provide the framework for the capabilities of FFfs described in Chapters 5, 6, and 7
• Allow multiple mapping options for FFfs onto the multiprocessor architectures in
Chapter 12
• Underlie the capabilities of the test signals in Chapter 16
• Provide the basis for using the FFT in the examples in Chapter 17
Equations 2-9 to 2-11 define the process of combining real signals a (n) and b(n) to
form a complex input to the DFf. Since both A (k) and B(k) are complex sets of numbers, an
additional step must be performed on the output of the DFf algorithm to separate these two
real input signals. The algorithms in this section show two ways of utilizing the DFT for
frequency analysis of real signals. The first is for the case of two independent real signals.
The second is to more rapidly compute the frequency content in a single real signal.
N-l
A(k) == L a(n) * W kn (2-9)
n=O
N-l
B(k) == L ben) * w kn
(2-10)
n=O
N-I
C(k) == A(k) + jB(k) == L[a(n) + jb(n)] * wkn (2-11)
n=O
Stage 4: Compute the FFT Outputs for Each Real Input Signal
For each k = 0,1,2, ... , N - 1, identify the FFf output A(k) and B(k) for each of
the real input signals a(n) and ben), respectively, as
A(k) = RP(k) + j * I M(k)
B(k) = I P(k) + j * RM(k)
The total number of computations for the two-signal algorithm is the number of adds and
multiplies required by the FFf algorithm plus the 2 * (N - 1) or 2 * (N - 2) adds in Stage 3,
depending on whether N is odd or even.
This requires 2( N - 1) adds and no multiplies because multiplying by 0.5 is just shifting the
binary point to the left 1 bit. Note that this algorithm does require each computed answer to
be stored in two places. This puts an additional burden on the memory address generators
of the DSP chips (Chapter 14) used to compute the answers.
20 CHA~ 2 THE DISCRETE FOURIER TRANSFORM
2.5 STRENGTHS
The DFT has four types of strengths. The first two are associated with the types of data
the DFT analyzes. The third is associated with the way data (complex samples) must
be collected and processed by a DFT. The fourth is associated with the signal-to-noise
improvement offered by the DFT.
is ideal for analyzing the sine waves in a signal when the signal repeats an integer number
of times (i.e., is periodic) during the N input data samples.
Even if the data is not periodic during the N samples, the OFT output is still the
amplitude and phase of a set of frequencies that can be used to reconstruct the time domain
signal. However, the OFT's output frequencies are not the actual ones in the signal. The
frequency-shift-keyed (FSK) modem example in Section 2.6.5 is a good illustration of this
phenomena. Therefore, the Off is not particularly well suited for signals that are either
never periodic (random or transient) or are periodic at a rate different from the number of
samples in the transform. Example 2 in Chapter 17 shows how to use the DFT to analyze
random signals. The ability to choose any OFf length allows the OFT to match the period
of the transient input signals.
a(n) 0
--os
-lL--.....---'-----'----4----"'--'o<:..L--_--'--_--'--_--L-_~---:.ll_....::.L_ _ _____J
o 200 400 600 800 1000 UOO 1400 1600 1800 2000
Equation 2-1 shows that N input samples are summed to obtain each frequency
component value. If the input samples contain a frequency that is in the center of one of
the DFT's narrowband filters (Figure 2-1), then the frequency component at the output of
the appropriate filter will have an amplitude that is N times the amplitude of that input sine
wave. For example, the zero-frequency component A(k) sums the N samples with k = O.
If those samples are all the same, the output of A (0) is N times larger than the amplitude
of the input samples. This is one aspect of coherent integration.
The second aspect of coherent integration exhibited by each DFf output is a reduction
in noise bandwidth by a factor of N over the input signal. This is most easily understood by
using the sampling theorem (Nyquist rate) in Section 2.3.1. Namely, a signal that is properly
prepared for the DFT will have frequency components that go no higher than the sampling
rate. Therefore, the noise bandwidth into the DFT will be limited to the sampling rate. Since
this allowable bandwidth is divided into N pieces by the N DFT bandpass filters, the output
of anyone of the filters can only have 1/ N of the input noise power. Since white noise
is equally distributed across the available bandwidth by definition, the noise bandwidth of
each DFT filter is 1/ N of the input bandwidth. The result is an improvement of a factor of
N in the signal-to-noise ratio of a single sine wave plus noise at the output of the DFT.
2.6 WEAKNESSES
The DFT has five weaknesses. The first two are improved through the use ofFFT algorithms.
The second two are improved by applying a weighting function to data before computing
an FFT of it. The fifth, inaccurate identification of frequencies in a transient signal, is not
improved by FFT algorithms. Transforms that do identify transient signals are not addressed
in this book.
In a digital computer all numbers are represented by some number of bits either
as fixed- or floating-point numbers. When these numbers are used in multiplication, the
resulting number has more bits than either of the input numbers. Because the number of
bits used to represent a number must be controlled, to avoid running out of memory to store
the numbers, the outputs from arithmetic computations must be rounded off at some point.
The round-off process introduces an error that changes the results of all of the rest
of the computations that use the rounded-off results. This is called quantization noise
error. The numerous computations required by the OFT result in a lot of quantization noise
error. One of the advantages of FFT algorithms is that the reduced number of computations
reduces quantization noise error. This will be discussed quantitatively in Chapter 13.
Sidelobes are a way of describing how a filter responds to signals at frequencies that
are not in its main lobe, commonly called its passband. Specific details on the OFT's
sidelobes are discussed in Section 4.1.1, because weighting functions are used to control
the sidelobe behavior of OFT filters. Each OFT filter's first sidelobe is only 13 dB below the
main lobe (therefore considered high), and subsequent sidelobes fall off very slowly. The
result is that a signal with strong frequency, far away from the center frequency of a OFT
filter, will not be completely removed by that filter and can look like a significant signal at
the output of that filter.
Frequency straddle loss is the reduced output of a OFT filter caused by the input
signal not being at the filter's center frequency. The coherent gain of the Off is N when
the input frequency is located at the center of one of the narrowband filters whose output is
A (k). If the input frequency is halfway between two of the narrowband filters, the coherent
gain is reduced, because half of the signal will appear in one filter and half in the other.
The difference between the full coherent gain of N and this lower gain is called frequency
straddle loss. This subject is explained in more detail in Section 4.1.3.
In Section 2.5.1 the OFT was shown to be ideal for analyzing signals that are periodic
within the number of samples being analyzed. Transient signals are not well analyzed by the
OFT. This is true regardless of whether the signal is a true transient or a transient sine wave.
An example of a transient sine wave is an FSK modem signal, which changes frequency
during the set of data points being analyzed. An FSK modem signal is a sum of two sine
waves, each of which lasts for a portion of the sequence of input samples. Figures 2-6 and
2-7 show an FSK modem signal and its OFT.
While the time waveform in Figure 2-6 shows just two frequencies, the DFf of the
time waveform in Figure 2-7 suggests there are five prominent frequencies and some smaller
ones. This is a result of the OFT analyzing transient signals as if they were periodic signals.
24 CHA~ 2 THE DISCRETE FOURIER TRANSFORM
a(n)
-
Data Samples
50 r--------r-----~---__r_----_r__---_.._----..,.__---_.
40
30
4(k)
~o
10
Frequency Bins
2.7 CONCLUSIONS
The Off is a sound computational method, whose characteristics make it useful in ma-
nipulating periodic signals and poor at dealing with transient signals, though it is used on
the latter when applied carefully with a thorough understanding of its limitations. Even
though the OFT equation assumes complex input signals, it is frequently used to analyze
real signals by doing input data reorganization and performing additional computations on
the output data.
Because the FFT inherits all the properties and strengths of the Off, a firm foundation
about the Off must be laid in order to see why FFfs are so useful and versatile. Its property
of linearity appears throughout the book in the implementation of many FFT algorithms.
SEC. 2.7 CONCLUSIONS 25
The next two chapters deal with the ways that four of the five weaknesses of the DFT
are minimized. The fifth drawback-being poor at analyzing transient signals-requires
transforms not covered in this book, such as wavelet and joint time frequency.
REFERENCES
[1] E. Oran Brigham, The Fast Fourier Transform, Prentice-Hall, Englewood Cliffs, NJ,
1974.
[2] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Engle-
wood Cliffs, NJ, 1975.
[3J L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,
Prentice-Hall, Englewood Cliffs, NJ, 1975.
[4J E. Oran Brigham, The Fast Fourier Transform and Its Applications, Prentice-Hall,
Englewood Cliffs, NJ, 1988.
[5] C. E. Shannon, "A Mathematical Theory of Communication," The Bell System Technical
Journal, Vol. 27, pp. 379-423 (1948).
[6] H. Nyquist, "Certain Topics in Telegraph Transmission Theory," AlEE Transactions,
Vol. 47, pp. 617-644 (1928).
3
3.0 INTRODUCTION
Fast Fourier transforms (FFf) are a group of algorithms for significantly speeding up the
computation of the OFT. The most widely known of these algorithms is attributed to Cooley
and Tukey [1] and is used for a number of points N equal to a power-of-two. A unique
feature of this book is that it provides multiple FFT algorithms for fast computation of any
length DFf. These are found in Chapters 8 and 9. In fact, the article by Cooley and Tukey
presented a non-power-of-two algorithm which has mostly been ignored. Several of the
algorithms in Chapter 9 are spin-offs of that work.
The most important fact about all FFf algorithms is that they are mathematically
equivalent to the Off, not an approximation of it. This means that all of the properties,
strengths, and most of the weaknesses of the DFf apply to the FFf algorithms in this book.
The FFf improves two weaknesses of the Off: high number of adds and multiplies; and
quantization noise.
An example of an 8-point OFT to FFf is used in this chapter to illustrate how FFfs
actually speed up the Off. The chapter concludes with a detailed explanation of how to use
the building-block approach to construct FFfs.
The FFT improves the DFf by reducing the computational load and quantization noise of
the DFf.
28 CHA~ 3 THE FAST FOURIER TRANSFORM
Chapter 9 establishes that the number of computations required for FFf algorithms, regard-
less of the transform length, can be expressed as a constant times N *log, (N). Therefore, the
computation reduction factor when using an FFf algorithm is a constant times N / log, (N).
The constant is different, but near 5, for each algorithm and nearly always provides a
significant advantage for using the FFT.
to illustrate these techniques is to show the process for the 8-point DFT. This is the
only place in this book where an FFT algorithm is actually derived from its DFf ori-
gins. The rest of the book focuses on choosing and applying the algorithms, not deriv-
ing them. The building-block algorithms described in Chapter 8 are the result of using
techniques, such as those in this section, to remove redundant computations from small
OFTs.
Equation 3-2 is a simplified matrix representation of the 8-point DFT, based on Equa-
tion 2-1. The simplification over the standard OFT equation is easily visualized by drawing
the w;n terms as vectors on a unit circle (Figure 3-1). From Figure 3-1 it is clear that the
»r rotates around the unit circle as k * n increases and the vector returns to the same
location when k * n is increased by multiples of 8.
Ao Wo WO rVo WO WO WO WO WO ao
Al WO WI 2
w W3 W4
W W S 6
W7 aI
A2 WO W2 W4 W6 WO W2 W 4 W6 a2
A3 WO W~~ W6 fV I W4 W7 W 2 fV S
a3
A4 WO W4 WO W4 WO W4 WO W 4
a4 (3-2)
As WO WS W2 W7 W4 WI w6 W3 as
A6 WO W6 W4 W2 WO W6 W 4 W2 a6
A7 WO W7 W6 WS W4 W3 W2 WI a7
6
W
5
W W7
WI
For example,
4 - W I2 - W 20 - W 28 - W 36
W8 (3-3)
-8-8-8-8
This cyclic feature of »r plays a primary role in the development of all of the FFT
algorithms in this book. In Equation 3-1 all of the exponents (k * n) of W larger than 8 have
been reduced to the equivalent power that is less than 8 by repeatedly subtracting 8 until the
exponent is less than 8. Using the example in Equation 3-3, the powers of k en = 36, 28, 20,
and 12 have all been replaced by Wi.
Ao 1 0 1 0 1 0 1 0 ao +a4
Al 0 1 0 W 0 -j 0 -jW ao - a4
A2 1 0 -j 0 -1 0 j 0 at +as
A3 0 1 0 -jW 0 j 0 W al -as
A4 = 1 0 -1 0 1 0 -1 0 ai +a6 (3-4)
As 0 1 0 -W 0 -j 0 jW a2 - a6
A6 1 0 j 0 -1 0 -j 0 a3 +a7
A7 0 0 }W 0 } 0 -W a3 - a7
Wi = wl * Wi = - j * W8I
»: = Wi * Wi = (- j) * (- j) * Wi = -Wi (3-5)
Wl = Wi * Wi = - j * (-Wi) = j * Wi
The simplest example of using the property in Equation 3-5 to reduce computations is
in columns 0 and 4 of the matrix in Equation 3-4. Notice that rows 0 and 4 have 1
to multiply the ao + a4 and ai + a6 terms in the right-hand column vector. Similarly,
rows 2 and 6 both subtract the al + as and a3 + a7 terms. In both cases, redundant
SEC. 3.3 EIGHT-POINT OFT TO FFT EXAMPLE 31
computations can be removed by performing the required computations once and us-
ing the results twice. Other symmetries similar to this illustration also exist in Equa-
tion 3-4. When all these are exploited, matrix Equation 3-4 is converted to matrix
Equation 3-6.
In addition to removing redundant computations, the other important feature of this ap-
proach is that the required computations are performed in a way that allows them to be
efficiently used later in the algorithm. Specifically, the first step in this version of the 8-
point FFT algorithm is to compute the terms found in the right-hand vector in Equation
3-4. The second step is to combine these results as shown in the right-hand vector in
Equation 3-6.
The final observation in this example is based on noticing columns 0 and 4 of rows 0
and 4 in Equation 3-6. Notice that these terms in the matrix require the sum and difference
of terms in the right-hand vector. This does not reduce the overall computations. However,
it does complete the computational symmetry of the algorithm. The advantage of this is
that this algorithm needs only one computational building block, the sum and difference
calculation of a pair of numbers, which is called a butterfly. Therefore, not only have
this set of observations resulted in butterfly computations at each stage, but the number of
computations has been reduced.
Figure 3-2 is a flowchart of the 8-point FFf. This algorithm's detailed equations are
in Section 8.8.2. Each node in the flowchart represents a complex add, which is two real
adds. There are 24 of these nodes, which corresponds to 48 adds. Similarly, there are
two complex multiplies in the algorithm. Since these multipliers are applied to a complex
number, the algorithm requires eight real multiplies and four additional real adds. Based on
Equation 3-1, the 8-point DFT requires 4 * N 2 == 256 multiplies and 4 * N 2 - 2 * N = 240
adds. Therefore, this algorithm reduces the total number of arithmetic operations from
256 + 240 == 496 to 48 + 8 + 4 == 60, more than a factor of 8.
To be absolutely fair, the W~ == 1 and Wi == -1 terms in Equation 3-1 do not
require complex multiplications. This reduces the DFf computational load by 16 com-
32 CHAP. 3 THE FAST FOURIER TRANSFORM
ao AO
a4 A1
-1
a2 A2
a6 A3
-1 -j A4
a1
as AS
-1
a3 A6
a7 A7
-1 -j -1 -JW -1
an
er- I
LPF ~AO
er-
•
I
LPF
•
t-- A1
< > a
n
--~
Filters
~AO
Set of P ~Al
• • ~Ap_l
• •
t--
er- I
LPF
Set of P Filters
Ap _ 1
frequency spectrum into Q equally spaced increments by using a Q-point DFT to implement
Q narrowband filters. The result is Q narrowband filters for each of the P filters as shown
in Figure 3-4. If Figure 3-4 were expanded by using the block diagram in Figure 3-3, there
would be Q narrowband filters for each of the P narrowband filters. Since a narrowband
filter connected to the output of a narrowband filter is also a narrowband filter, Figure 3-4
can be redrawn as P * Q narrowband filters.
Since these N == P * Q narrowband filter outputs are also equally spaced and cover
the same frequency spectrum as an array of N narrowband filters, they must be the same as
the ones implemented by a P * Q-point OFT. This is the strategy used by each of the FFT
algorithms in Chapter 9 to decompose the FFT into the smaller building blocks described
in Chapter 8.
If Figure 3-4 is compared with the prime factor algorithm block diagrams
(Figures 9-17 and 9-18) or the mixed-radix algorithm block diagrams (Figures 9-23, 9-
24, and 9-25), two differences are noticed. First, the frequency component outputs are in
different order in each of the figures. The details of the FFT algorithms result in these
different output frequency orders. Second, while Figure 3-4 and all of the FFT algorithms
have P Q-point FFTs, all of the FFT algorithms have QP -point FFTs on the input and Fig-
ure 3-4 only has one. This makes it look like Figure 3-4 requires fewer computations than
the FFT algorithms in Chapter 9. The catch is that each of the P narrowband input filters
on the left-hand side of Figure 3-4 must process all N of the input data samples, However,
each of the P-point FFTs on the inputs to the FFT algorithms in Figures 9-17,9-18,9-23,
9-24, and 9-25 only processes Q points. In all cases each of the Q-point output filters and
34 CHA~ 3 THE FAST FOURIER TRANSFORM
First
Set at Q
Filters
Set «r Second
---.. Set of Q
Filters
Filters
•
•
•
A(P-l)* Q
P-th
A(P-l)* Q +1
Set of Q
Filters
Ap*Q_l
FFTs only processes P intermediate results. Section 3.3 shows how the FFT approach is
used to reduce the total computational load over using the narrowband filter approach.
3.5 CONCLUSIONS
The fast versions of the Off overcome two of its weaknesses. The FFT reduces computa-
tionalload (adds and multiplies) by significantly reducing the redundancy that is inherent
in the structure of the DFf equation. Quantization noise is also reduced by using FFTs
because the number of computations is less than with the DFf.
While improving the DFf so dramatically that it is now used in hundreds of applica-
tions, the FFT does not add any drawbacks of its own, which cannot be said for the element
covered in the next chapter. Weighting functions get teamed with FfTs to reduce two more
weaknesses of the DfT.
REFERENCES
[1] J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex
Fourier Series," Mathematics of Computation, Vol. 19, p. 297 (1965).
4
Weighting Functions
4.0 INTRODUCTION
A weighting function, w (n), is a sequence of numbers that is multiplied times input data
prior to performing a OFT on that data. Weighting (also called window) functions reduce
sidelobes of OFT filters and widen main lobes while, fortunately, not altering the locations
of the centers of the filters. The weighting functions in this chapter provide options to
reduce sidelobes from the -13-dB peak sidelobe of the DFT to as low as -94 dB.
Weighting function selection can be made early in the design process because the
choices of FFT algorithm and weighting function are independent of each other. Choice of
a weighting function to provide the specified sidelobe level is done without concern for the
FFf algorithm that will be used because:
• They work for any length FFT.
• They work the same for any FFf algorithm.
• They do not alter the FFT's ability to distinguish two frequencies (resolution).
Weighting functions are applied three ways:
• As a rectangular function, which does not modify the input data
• By having all the weighting function coefficients stored in memory
• By computing each coefficient when it is needed
of the narrowband filters in order to analytically compare weighting functions. All these
measures, except frequency straddle loss, refer to individual filters. Frequency straddle loss
is associated with how filters work together.
Sidelobes are a way of describing how a filter responds to signals at frequencies that
are not in its main lobe, commonly called its passband. Each FFT filter has several sidelobes.
With rare exception, the highest one is closest in frequency to the main lobe and is the one
that is most likely to cause the passband filter to respond when it should not. The higher a
sidelobe level is, the lower is the amplitude of a signal outside the passband of the filter that
produces a significant filter response. This response erroneously indicates the presence of
a signal in the passband.
noise. That noise is generally spread over the frequency spectrum of interest, and each
narrowband filter passes a certain amount of that noise through its main lobe and sidelobes.
White noise is used as the input signal and the noise power out of each filter is compared to
the noise power into the filter to determine the equivalent noise bandwidth of each passband
filter. In other words, equivalent noise bandwidth represents how much noise would come
through the filter if it had an absolutely flat passband gain and no sidelobes.
The standard definition of a filter's bandwidth is the frequency range over which sine
waves can pass through the filter without being attenuated more than a factor of 2 (3 dB)
relative to the gain of the filter at its center frequency. The narrower the main lobe, the
smaller the range of frequencies that can contribute to the output of any FFT filter. This
means that the accuracy of the FFT filter, in defining the frequencies in a waveform, is
improved by having a narrower main lobe.
This section gives the equations for 15 weighting functions and shows the plots of the
frequency responses of their corresponding FFT narrowband filters. It also gives the best
use of each weighting function. More details can be found in References 1 and 2.
4.2.1 Rectangular
For n == 0 to N - 1, w (n) == 1
20 r--------r-----.-----~---__y----__r_----__r__---____.
I I I I I I
dB
-gO-............--...............--'--'-...............
~~~~~"--IIo--'~__I.._L..........."....L_L.~_..L._.J_.I..__.&.....!.._~_J.._l.._l_..I._L_L._L.._I_....L...J......L_I.._..I.._I...㨁 �ሀ 늅倀 샲ሀ ⣳ሀ⳱ሀ陸唀郼ሀ볼ሀ〲䍖 _ __ '
The rectangular weighting function is just the plain FFf without modifying the input
data samples. The peak of the highest sidelobe is only 13 dB (a factor of roughly 5) below
the main-lobe response, and the sidelobe peaks do not drop off rapidly. This makes it poor
for signals with multiple frequency components that have amplitudes that are more than 6
dB different from each other.
In contrast to the poor sidelobe performance, the main lobe is narrower and the
coherent gain higher than for any of the other weighting functions. This gives these FFf
filters the highest amplitude response to a frequency in the main lobe (coherent gain) and
the smallest output noise power (3-dB noise bandwidth). The narrow main lobe also causes
these FFf filters to have the poorest response when the frequency is halfway between two
adjacent filters (straddle loss). For these reasons, the rectangular weighting function is used
when maximum signal-to-noise ratios are critical.
4.2.2 Triangular
The triangular weighting function is used to provide sidelobes and straddle loss lower
than the rectangular weighting function and can be easily constructed as a sequence of
two straight-line segments. Notice that the sidelobes start off lower than the rectangular
weighting function by 14 dB and fall off faster than the rectangular weighting function.
The outstanding characteristic of this weighting function is the smaller number of sidelobes
20 ,------~----.,.------.----....,-----,-------.----,
dB
700
than the others in this chapter. It is best used when additional sidelobe reduction, more than
the rectangular weighting function, is required and when the weighting function must be
computed by the processor because there is no room in its memory to store the values of
the weighting function.
:20 r--------r------,-------,------...,..-----or-------,--------,
dB
-:20
(,
(\ (I
P
-40 ~I II
I
-60
4.2.4 Hanning
The Hanning weighting function is slightly more complicated to compute than the
sine lobe. However, it provides 9 dB of additional sidelobe attenuation and can be computed
with constants that are already in memory for the complex multiplications between power-
of-two FFf building blocks. The peaks of its sidelobes fall off 50% faster than the triangular
and sine lobe weighting functions. This weighting function has better 3-dB bandwidth and
equivalent noise bandwidth than 16 of the 22 weighting functions in this chapter. These
features make it most useful when better than 32-dB sidelobe attenuation is needed, along
with 3-dB bandwidth that is less than 1.5 filter widths.
20 r-------,-------,------.,-------..,.-------r-----..,..-------,
0
/,4\
dB
""""20 f\
I
"
-40
..............L....S_~~~.::::..::.:a.~
-soL..-_~~~..;w.....s.....L..L_~~~L...I_l~...l....L..L.....L.~_.1....1-.&......I..._~__L_L_I.._L..L.....L. __J
be utilized without adding to memory allocated for constants but can afford adding to the
computational load for the arithmetic processor.
~o ~-----r-----~-----r-------,--------.------r---------.
-:20
dB
-40
/ '\......
I
j \
II
) I
-:20
l I
!
dB
i l
I
-40
I I
-60
The sine to the fourth, like the sine-cubed weighting function, is one whose values are
not used as multiplier constants between power-of-two FFT building blocks. Therefore, if
constant memory is available, the weighting function constants are stored there. If not, two
multiplies are needed to square values from the multiplier constant values and then square
those results (sin 2(nrr j N) * sin2(nrr j N)). Notice that the peak sidelobe is 47 dB below the
main lobe, and the peaks of the other sidelobes drop off 2.5 times as fast as the triangular
and sine-lobe weighting functions. This weighting function is most useful when better than
47 dB of sidelobe attenuation is needed, and the weighting function must be utilized without
adding to memory allocated for constants but can afford adding to the computational load
for the arithmetic processor.
4.2.7 Hamming
20 r-------r-----~----r_---__r----_r_----.___---___,
I I I I I I
o~
~o-
dB
-so~L...L._J....L-J....L_L...J..J,.....L.....L.~.t.....L__.lL....L_J......J....L..J_L..J.....I-....L_..L....L..1- ___t.~._L...L.....L.._L.'"""__'_..I....L...~L._L_Io...L_I___'__'_~__........Lo_I__............o_.....a.__ __ _ . J
4.2.8 Blackman
For n = 0 to N - 1,
wen) = 0.42 - 0.50 * cos(2rrnj N) + 0.08 * cos(4Jrnj N)
The Blackman weighting function is an extension of the Hamming and Hanning
approaches of using multiplier constants that are already in memory for complex multipli-
cations between FFf stages. This weighting function also provides the best fall-off ratio
of any of the weighting functions with peak sidelobes below -50 dB. If the FFf multiplier
constants are used, two multiplies and two adds are required to compute each value to be
multiplied times the complex FFf input data. This increases the weighting function com-
putationalload from two to six arithmetic operations per complex input data point, if it is
computed rather than stored in memory. This weighting function is most useful when over
50 dB of sidelobe attenuation is needed close to the main lobe and rapid sidelobe fall-off is
required to attenuate frequency components, with large amplitudes, that are separated from
each other by more than three to four FFf filters.
20 .-------..------r-----.-------...,.....------r-----r--------.
--'20
dB
-40
-S0L.-..-..-----"-'ClwI:~~~~~:...L_l...L_J..~..J_J......L_.L._L_.__-l-J.....J._L..J...l....J_J...::L....:l_~~~~~-=-_&_..._ ___J
Figures 4-9 and 4-10 show two of these weighting functions. Both provide over 60 dB of
peak sidelobe attenuation. Note one peculiarity of both: there is a dip in the peaks of the
sidelobes near the main lobe and then the sidelobes drop off monotonically. The difference
between (a) and (b) is that (b) provides additional sidelobe attenuation but requires a wider
3-dB main-lobe bandwidth. These weighting functions are most useful when over 60 dB of
attenuation is required and the width of the main lobe (frequency accuracy) is not critical.
20 r-----....,.....----,r------~---___,----~---____,.---___,
0
/'..\
J '\
( ~
I
-20 ( \
dB
-40 I
-60
700
I/ \ I
-20
dE I \
I \
-40
I
-60
700
dB
100
r"....
(
f \"
~
/
-20
,
(
'I
\
I
-40 \
dB I \
I
-60 I I
I
-so
700
4.2.11 Kaiser-Bessel
20 ,..-----~-----y-----___r----~---____..----_r__---__.,
o
(\
I
dB
-eO'------.&...----....I-------&-----....a..------Ioo-------------'
o 100 200 300 400 .500 600 700
Tenths of Frequency Bins
Figure 4-13 FFT of a = 2.0 Kaiser-Bessel weighting function.
20 ,...------..,------r-----___r----~---~----_._---__,
dB
.....0
-60
VWVWVWW\NWVVVllVVVVV'l!"M
~!~
-e0L-----.L.----...L.-.-------L-----L------I.-----------
a 100 200 300 400 500 600 100
:20 ..--------.-------r------r-----r-----~----___r_---______,
/"
/ \
-20
( \
dB
-40
i 'I
I
-60
-lOOL..-------'-----~----....Io-----.a---------&.--------------'
o 100 200 300 400 soo 600 700
The Kaiser-Bessel weighting function is the ratio of two zero-order Bessel func-
tions of the first kind (/0 (x)). Even though the summation that defines these Bessel func-
tions has an infinite number of terms, the functions have finite values [3]. In particular,
these Bessel functions have a value of 1 when x = 0 and they increase as x gets larger.
Figures 4-13 to 4-16 show Kaiser-Bessel weighting functions for different values of ex.
These weighting functions have the most energy in the main lobe for a given peak sidelobe
level. The peaks of the sidelobes only fall-off at 6 dB per octave. Therefore, this set of
weighting functions is most useful when the filters are being used to distinguish multiple
frequencies that have amplitudes that must be attenuated by the filter sidelobes by 46 to 82
dB, depending on which ex is chosen.
20 ...-------r'----~-----r-----....-----___..----__...._---__,
dB
700
4.2.12 Gaussian
20 , . . - - - - I- - - - r - -
I
- , . - - -I, - - - - - ,I - - - - , . . . - - - - r - - - - - - ,
I I
dB
I I I I I I
100 200 300 400 500 600 '100
01-
(\
-20- I \ -
I \
dB
.......0 --
! \, -
I~
1Nm~Y~VWY{V~~A~(Yff·i~Y0 ~)I wmV~~N'~J,VVW\,0iYf~Y~V~~~,
-60
I I I I I I
~o
0 100 200 300 400 .500 600 100
20 r - - - - - - - r - - - - - . . . . . , . - - - - - - , - - - - - - , . . . - - -I- - , - - - - -I , - - - - - - ,
I I J I
dB
I I I I I I
100 200 300 400 500 600 '100
The next three weighting functions are derived by optimizing the weighting func-
tion for the minimum time-bandwidth product for a given sidelobe level. The narrower a
signal in the time domain, the wider it appears in the frequency domain. Likewise, sig-
nals that are represented with a narrow set of frequency components do not vary rapidly
in the time domain. For a given narrow signal (i.e., a sine wave that lasts less than the
number of samples in the FFT) in the time domain, the Gaussian windows provide the
tightest concentration of energy in the frequency domain. This means that the Gaus-
sian weighting function is most useful in converting transient signals to the frequency
domain. Figures 4-17 to 4-19 show Gaussian weighting functions for different values
of ex.
4.2.13 Dolph-Chebyshev
For k == 0 to N - 1,
W (k) == (_I)k cos{N cos- 1 [,8 cos(kn / N)]}/ cosh[N cosh- 1 (,8)]
-20
dB
-40
--SOL..-------'-----..&....------L..-------------"----~----'
--:20
dB
-40
20 ...----~--,..___-_._--r__-_._--..__-___,
/ .....
l 'I.
II \
-AO
!\
dB
-40
I )
I
-60
700
20 '-'---~--r------r----"'------,-----r-------,
-AO
dB --40
• Passband width
• Width of the transition between the passband and sidelobes
• Stopband maximum sidelobe level
• Ripple in the filter's gain across the passband
Algorithms have been developed to construct an FIR filter with a frequency response
with the least-mean-squared error relative to these desired frequency response requirements.
The problem with this optimization approach is that it produces filters with gain that peaks
up at the edges of the filter passbands. This is called the Gibbs effect. The Gibbs effect
is reduced by designing the filter with an optimization criterion that minimizes the maxi-
mum, rather than mean-squared, error. Chebychev polynomial-based filter design uses this
approach. Filters that exhibit this property also have equiripple behavior in the sidelobes.
The most popular of these optimization algorithms was published by Parks and McClellan
and has been named for them [5].
4.4 CONCLUSIONS
Because of the third and fourth weaknesses of the OFT, weight functions are applied before
data is processed with FFTs to lower high sidelobes and reduce frequency straddle loss. The
trade-off for those improvements to the OFT is the introduction of coherent gain reduction
and increasing the 3-dB bandwidth of each FFT filter. Fortunately, a wide selection of
weighting functions allows users to choose one that offers the balance between benefits
and drawbacks needed in a specific application. Chapters 2-4 cover fundamentals of FFfs.
The next three chapters address what can be done well with them.
54 CHA~ 4 WEIGHTING FUNCTIONS
REFERENCES
[1] F. J. Harris, "On the Use of Windows for Harmonic Analysis with the Discrete Fourier
Transform," Proceedings of the IEEE, Vol. 66, No.1 (1978).
[2] A. H. Nuttal, "Some Windows with Very Good Sidelobe Behavior," IEEETransactions
on Acoustics, Speech, and Signal Processing, Vol. ASSP-29, No.1 (1981).
[3] A. N. Lowan, Table of Bessel Functionsfor ComplexArguments, Columbia University
Press, New York, pp. 362-381, 1943.
[4] T. W. Parks and C. S. Burrus, Digital FilterDesign, Wiley, New York, 1987.
[5] T. W. Parks and J. H. McClellan, "Chebyshev Approximation for Nonrecursive Dig-
ital Filters with Linear Phase," IEEE Transactions on Circuit Theory, Vol. CT-20,
pp. 697-701 (1973).
5
Frequency Analysis
5.0 INTRODUCTION
Frequency analysis is the process of determining the amplitude and phase of the frequencies
that comprise a real or complex sequence of data samples in one and more dimensions. Based
on the Nyquist (also called Shannon) sampling theorem (Chapter 2), those frequencies span
from zero to half the sampling rate for real signals and from zero to the sampling rate
for complex signals. The span of frequencies detected by an FFf is called the frequency
spectrum of the data samples. If the output of the FFf is used to catalogue the frequencies
in a signal, it is performing the first of the common uses of the DFTs listed in Section 2.1. If
the output is used as a shorthand way of describing the signal, because of its small number
of frequencies, the FFr is performing the second common use of the OFf. This chapter
presents the steps required for one-dimensional frequency analysis. Chapter 7 presents the
additional steps required for multidimensional frequency analysis.
Frequency analysis can be done with overlapped or nonoverlapped data sets. In either case
the computations can be performed with or without a weighting function. For each of the
four possible cases, five measures can be used to describe the performance of the FFT
algorithm.
When frequency analysis is performed on data sequences larger than the chosen trans-
form length, the sequence gets divided into smaller segments and transforms are computed
56 CHA~ 5 FREQUENCY ANALYSIS
on each segment. If the FFT is being used to detect the presence of a frequency that is not
always present, the FFf length is chosen to match the expected duration of the frequency of
interest. If the frequency of interest is present and aligned with a segment of data samples,
the maximum improvement in signal-to-noise ratio is provided by the FFf because the
frequency is amplifiedby a factor of the transform length and the noise by the square root of
the transform length. The maximum signal-to-noise ratio provides the highest probability
of signal detection.
If the frequency appears in two segments, the signal-to-noise improvement is not
as great in either of the two segments, hence a lower probability of detection. The worst
case is when the frequency appears half the time in each of the two segments. Segments
are overlapped to increase the probability of detecting a frequency of interest. For ex-
ample, if the segments are overlapped 50%, the frequency of interest lines up with the
straddling segment when it is half in each of the two contiguous segments. When segments
are overlapped, some of the data points in the sequence are the input to more than one
transform, In the example, if the data segments overlap 50%, each data sample is used
twice, except for the first and last segments. The larger the overlap, the larger the num-
ber of computations; the more complex data addressing; and the larger the data memory
required.
Sidelobe level is the ratio of the amplitude response of a filterto a frequency in one of
its sidelobes to the response it would have if the frequency were in the center of the filter. A
filterhas a sidelobe level for every frequency outside its main lobe. It is important to ensure
that the sidelobe response is attenuated far enough by the filter sidelobes that the filter only
gets a significant output when a frequency in its passband is present. These requirements
change radically from application to application.
Frequency straddle loss is the reduced output of a filter caused by the input signal not
being at the filter's center frequency but still in its main lobe. Frequencies to be detected
in an application seldom fall at the center of any of the filter passbands. When a frequency
is halfway between two filters, the response of the FFf has its lowest amplitude. For a
rectangular weighting function the frequency response halfway between two filters is 4 dB
lowerthan if the frequencywerein the centerof a filter. Eachof the other weightingfunctions
in this chapter has less frequency straddle loss than the rectangular one. This performance
measure is important in applications where maximum filter response is needed to detect the
frequency of interest.
Frequency resolution is the measure of how close two frequencies can be before they
can no longer be distinguished by the FFT. Frequencies closer than the separation between
filter center frequencies are generally considered unresolvable. Weightingfunctions do not
change the separation between the centers of the FFf filters.
SEC. 5.2 COMPUTATIONAL TECHNIQUES 57
Coherent integration gain is the ratio of amplitude of the filter output to the amplitude
of the input frequency. N -point FFTs have a coherent gain of N for frequencies at the
centers of the filter passbands. Since most of weighting function coefficients are less than
1, the coherent gain of a weighted FFf is less than N. Like frequency straddle loss, this
performance measure is important in applications where maximum filter response is needed
to detect the frequency of interest.
There are four basic ways that the N -point OFT, in any of its fast implementation forms
(FFTs), is used. The first two are associated with the spacing between the starting samples
in the computation of N -point FFTs on data sequences that are longer than N samples. The
third and fourth are modifications that can be made to the input data prior to using either of
the first two techniques. Each of these is described in this section.
5.2.1 Nonoverlapped
N Samples N Samples
~I
Samples
5.2.2 Overlapped
Chapter 2 discusses the weakness of using the DFr to analyze transient signals.
However, there are applications where the frequency content of the data sequence is known
to be constant, but only for a specific number of samples. If the goal of the application is to
detect when this signal is present in a long data sequence, then the best DFf approach is to
use an FFT that matches the expected number of signal samples at the frequency of interest.
However, choosing the correct transform length is not sufficient. If the N -point FFT
does not start when the transient sequence starts, then two effects occur. First, the coherent
gain will not be N because some of the samples integrated by the FFT are noise not signal.
Second, the transient that is caused when the signal appears will distort the FFT's ability
to recognize the signal. When the N -point FFf matches up with the signal, all N samples
are integrated and the FFf does not see the transient of the signal turning on and off and
therefore performs the analysis without artifacts. An example is a Doppler radar where the
antenna beam is scanning at a constant rate to find a target. Since the antenna beam width
is fixed, the radar receives returns from the target for a fixed period of time as the beam
passes by. Until the target is detected, there is no way to know when this time period starts.
The theoretically best, but computationally most costly, solution is the start a new N -point
FFf every time a new sample arrives.
If the FFT is not overlapped, the worst-case situation is to have half of the returns
in one set of samples and half in the other. The loss of coherent gain associated with this
case is reduced by starting a new N -point FFT every N 12 samples. Figure 5-2 illustrates
this process with an overlap of P samples. With a 2:1 overlap each input data point is used
in two FFf computations. This increases the required computational load by a factor of 2.
For an overlap of P out of N samples, the increase in computational load is N I(N - P).
NSamples
I~
N Samples
~ 1
p
I- -I
Samples
or nonoverlapped processing approaches. For a slowly varying signal the FFT provides the
sidelobe and straddle loss improvements described in Chapter 4.
However, for transient signals the weighting function only improves the performance
of the FFf if the FFf is aligned with the signal. In that case the FFT calculates as if the signal
is always present and processes it just like slowly varying signals. When input samples to
an FFT do not align with the time when the transient signal is present, the transient occurs
somewhere in the middle of the set of samples. Then the FFf thinks there is a transient at
that point and also one at the end of the data set. The effect of the transient at the end of
the data set is minimized by the weighting function, but the effects of the transient in the
middle of the data set are virtually unaffected because the transients are not attenuated (see
Chapter 4 for more details).
Figure 5-3 shows an example of a transient signal. The first and third sets of N
data samples match the transient signals exactly. In the first set there is a transient at the
beginning of the data set because the first sample is not zero. For this set of samples a
weighting function will reduce the sidelobe effects associated with this transient.
N Samples N Samples
-I I-
N Samples
In the third set of samples the first and last samples are zero. Therefore, adding a
weighting function to the FFT computations provides no improvement because there are
no transient conditions to reduce at the ends of the data set. In fact, the weighting function
has a detrimental effect in this case because the coherent gain of the FFT is reduced by the
weighting function, and the main lobe of each FFf filter is widened.
The second set of samples has transient effects at both ends of the data set and straddles
the two transient signals. Therefore, a weighting function will reduce the transient effects
at the ends of the data set. However, the FFT will provide little useful data about either of
the transients because it straddles them.
5.3 CONCLUSIONS
This chapter covers one of the two functions where FFfs are primarily used. As can be
seen in the Doppler radar and speech processing design examples in Chapter 17, frequency
analysis and the use of FFTs to create a shorthand version of a signal have wide application
in aviation and consumer products. Frequency analysis and the functions explained in the
60 CHA~ 5 FREQUENCY ANALYSIS
next chapter get used separately or together in almost every place an FFf is used. This
chapter contains no algorithms because frequency analysis is performed with the algorithms
in Chapters 8 and 9.
6
Linear Filtering
and Pattern Matching
6.0 INTRODUCTION
Linear filtering and pattern matching are techniques for determining the presence of specific
waveforms in a signal of one or more dimensions. Generally, linear filtering is used to pass
certain bands of frequencies and block others. Pattern matching is the process of finding
a pattern in a signal, whether it is a sine-wave frequency or an arbitrary sequence of data
samples that do not resemble any easily defined function.
While neither a linear filter nor a pattern matcher is the same as an FFT, FFf al-
gorithms are often able to speed up their computation. The purpose of this chapter is to
present algorithms for using an FFT to perform one-dimensional linear filtering and pattern
matching. It also shows how to determine when using an FFf requires fewer adds and mul-
tiplies than performing those functions in the time domain. The additional steps required
to perform multidimensional versions of this processing are in Chapter 7.
6.1 EQUATIONS
Linear filtering and pattern matching, also known as convolution and correlation, respec-
tively, are defined by Equations 6-1 and 6-2. For linear filtering applications, x(k - i) is
the input sequence to the filter and h (i) is the unit pulse response of the filter. For pattern
matching applications, x (k + i) is still the input signal and h (i) is the pattern to be found
in the signal. This chapter presents two FFT-based approaches for computing these two
62 CHAP.6 LINEAR FILTERING AND PATTERN MATCHING
equations because there are many instances when the FFf approach is more efficient than
computing the equations directly. Both approaches can be implemented with any of the
FFf algorithms in Chapters 8 and 9.
M-l
y(k) = L x(k - i) * h(i) (6-1)
;=0
M-l
y(k) = L
;=0
x(k + i) * h(i) (6-2)
Figure 6-1 shows the steps needed to implement Equations 6-1 and 6-2 in the frequency
domain. Derivations of this approach can be found in several DSP textbooks [1-4].
x (i) Combine
FFT IFFT y(k)
Results
he})
These three performance measures provide a way to compare the one direct and two fre-
quency domain methods for computing Equations 6-1 and 6-2 (linear filtering and pattern
matching).
Computational latency is the time between the start of computations and when output
of results begins. Computational latency is considerably different for frequency domain
methods of computing Equations 6-1 and 6-2 than for the direct method. For the direct
method a new output is computed for each new input by performing M multiplies and M - 1
adds. This is a latency of one input sample. In the frequency domain methods, M new
pieces of data are collected and less than M new output values are produced because of the
required input data overlapping. Therefore, the latency is at least M data samples.
For a finite input sequence of length L, Equation 6-1 does not require M complex
multiplies and M - 1 adds for each value of k. In particular:
• For k == 0, the only term in the summation is i == 0, so there are one multiply and
no adds.
• For k == 1, there are two terms to compute and add in Equation 6-1. Namely, a
°
multiply is required for i == and 1, and an add is required to combine these two
multiplications.
• Each time k increases by 1, the number of adds (k adds) and multiplies (k +1
multiplies) increases by I until k == M - 1.
If the input data is real and the unit pulse response remains real, the basic logic for
determining the number of computations remains unchanged. The only difference is that the
half-complex adds and multiplies are replaced with real adds and real multiplies. Adding
64 CHA~ 6 LINEAR FILTERING AND PATTERN MATCHING
all of these computational requirements, Equation 6-1 requires L * (M - 1) real adds and
(L + 1) * M real multiplies to compute all N = L + M - 1 outputs y(k) if the input data
is complex and the filter is real.
2 * {L * (M - 1) + (L + 1) * M} > 2 * NF + 6 * N
6.4.2 Real Input Signal
If the input signal is real, then all of the FFT computations are reduced by using
the double-length algorithm from Section 2.4. If N /2 is odd, this reduces the input FFf
computations to
# Compo = N F +5 *N - 7
Likewise, if N /2 is even, Chapter 2 shows the total input FFf computations are:
# Compo = N F +5 *N - 9
Then the outputs of the input FFT are multiplied by complex numbers to provide the filter
shaping. Since the FFf input and the unit pulse response are real, the FFf outputs of
both are symmetric around the center filter. This means the only complex multiplies to be
performed are those below the center filter.
Case 1: Real Input Signal with N/2 an Even Number
If N /2 is even, this is N /2 complex multiplies, which is 2 * N real multiplies and N
real adds. If N /2 is odd, the total number of filters to be multiplied is the (N - 1)/2 below
the center filter and the center filter. This is (N - 1)/2 complex multiplies plus one real
multiply for the center filter (see the symmetry properties of DPTs in Chapter 2). This is a
total of 2 * N - 1 real multiplies and N - 1 real adds.
SEC. 6.5 OVERLAP-AND-ADD FREQUENCY DOMAIN ALGORITHM 65
The output of the complex multiplication step is then fed into an N -point IFFT
that requires 2 * N F computations. Therefore, the equation to determine when the total
computations for N /2 even is less in the frequency domain for real input signals is
3 * NF +8*N - 9 < L * (M - 1) + (L + 1) * M
Case 2: Rea/Input Signals with N/2 an Odd Integer
For N/2 odd,
3 * NF + 8*N - 5< L * (M - 1) + (L + 1) * M
If the length of the input sequence L is too long to practically compute as a single transform
length, a means must be found to segment the input sequence into manageable lengths and
perform the functions in Figure 6-1 several times. Once these several sets of operations are
performed, the results must be recombined to form the complete output sequence. There
are two algorithms for performing the frequency domain method on long sequences of input
data. These algorithms are described, and the total number of computations determined
and compared with the time domain approach for real and complex input sequences.
6.6.1 Introduction
For complex input signals, the specific overlap-and-add algorithm stages are as
follows.
L Samples
-I
,. L Samples ., L Samples
f• • f
L Samples
Compute the N -point FFf of the M members of the sequence for h (i), after N - M
zeros are appended to the end and label the results H(k).
N-l
H(k) = L h(i) * W;}
;=0
This computation only happens once, and the results are stored in memory for use in
multiplying all of the transformed data sets as shown in Figure 6-1.
Stage 3: Set t =0
Stage 4: Load and Augment the Next Set of Input Data Points for Processing
Collect L data points, x[i + t * L], and store in the input data memory along with
N - L zeros to occupy the last N - L samples in the sequence of N data points, Xt(i).
x, (i) = xU + t * L] for i = 0,1,2, ... , (L - 1)
Xt(i) =0 for i = L, L + 1, ... , (N - 1)
Stage 5: Transform the Next Set of Data Points to the Frequency Domain
Compute the N-point FFT of x.ti), using one of the appropriate algorithms from
Chapters 8 and 9.
N-I
Xt(k) = L Xt(i) * W;}
;=0
This stage requires N p arithmetic computations. However, the first stage in all of the
algorithms in Chapters 8 and 9 is the sums and differences of the input samples. Therefore,
2 * (N - L) of the input complex adds can be removed from the FFf algorithm because
N - L of the input data points are known to be zero. Therefore, the first time these samples
need to be added to other samples the addition can be omitted. This reduces the total to
Np - 4 * (N - L) computations.
SEC. 6.6 OVERLAP-AND-ADD FREQUENCY DOMAIN ALGORITHM 67
This stage requires N F arithmetic computations because the IFFf takes the same number
of computations as the FFT.
# Compo == 2 * N F +4 *N +2 *L
Since these computations are performed every time L new data samples are used, the number
of computations per complex input data sample is
# Compo == {2 * N F + 4 * N + 2 * L} I L
6.6.3 Real Input Signals
If the input signal to the overlap-and-add algorithm is real, then all of the FFT compu-
tations are reduced by using the double-length algorithm from Chapter 2. The exact answer
depends on whether N 12 is odd or even. If N 12 is odd, the input FFT computations per
data point are
# Compo = {N F2 +5 *N - 7}/ L
where N F2 is the number of computations for the N 12-point FFT algorithm chosen from
Chapters 8 and 9. If N 12 is even, the input FFT computations per data point are
#Comp. == {N F2 + 5 * N - 9}IL
68 CHAP. 6 LINEAR FILTERINGAND PATIERN MATCHING
Then the outputs of the input FFf are multiplied by complex numbers to provide the filter
shaping. Since the FFf input and the unit pulse response are real, the FFf outputs of
both are symmetric around the center filter. This means the only complex multiplies to be
performed are those below the center filter. If N /2 is even, this is N /2 complex multiplies,
which is 2 * N real multiplies and N real adds. If N /2 is odd, the total number of filters to
be multiplied is the (N -1)/2 below the center filter and the center filter. This is (N - 1)/2
complex multiplies plus one real multiply for the center filter (see the symmetry properties
of DFfs in Chapter 2). This is a total of 2 * N - 1 real multiplies and N - 1 real adds. The
output of the complex multiplication stage is then fed into an N -point IFFf that requires
N F2 computations.
The total number of computations per data point is:
# Compo = 2 * N F2 + 13 * N - 18
6.7.1 Introduction
The overlap-and-save algorithm overlaps the data sequences into the FFf rather than
artificially creating the overlap by adding zeros (Figure 6-3). The process starts by taking
the first N samples in the sequence x t (i) and computing its FFf. These results are multiplied
by the N-point FFf of hi j), and the result is transformed back to the time domain by an
IFFf. The result is only accurate starting at the first sample in the sequence until the unit
pulse response ht j) of M samples no longer completely overlaps the data sequence x.ii),
Therefore, each set of computations generates (N - M + 1) new valid outputs. To cover the
last M - 1 outputs, the next input sequence overlaps the previous one by M - 1 samples.
If this process is continued, the correct outputs are always obtained for y(k).
I~ N Samples. I
N Samples
I~ -I
N Samples
I~I I~ -I
M -1 N Samples
I~ 14 -I
M-l N Samples
~ I~ -I
M-I
I~
M-I
H(k) == L h(i) * w~
;=0
This computation only happens once, and the results are stored in memory for use in
multiplying all of the transformed data sets.
Stage 3: Set t = 0
Stage 4: Load and Augment the Next Set of Input Data Points for Processing
Collect N data points, x [i + t * (N - M + 1)], and store in the input data memory,
x, (i). Note that this means this algorithm will use M - 1 of every N input data points twice.
This makes the input data addressing nonsequential.
Stage 5: Transform the Next Set of Data Points to the Frequency Domain
Compute the N -point FFT of x, (i), using one of the appropriate algorithms from
Chapters 8 and 9.
N-l
Xt(k) == LXt(i) * W;}
;=0
This stage requires N F arithmetic computations, where N F is computed based on the algo-
rithm chosen from Chapters 8 and 9.
point to note is that the performance measures for both frequency domain methods are the
same. Therefore, this matrix is only useful in determining if Equations 6-1 and 6-2 should
be implemented directly in the time domain or in the frequency domain.
# of data Compo
Algorithm # of computations per data point locations latency
6.9 CONCLUSIONS
While linear filtering and pattern matching can be done in the time domain, and often are,
frequency domain implementation using FFTs often requires fewer adds and multiplies.
The algorithms in this chapter, in combination with the FFT algorithms in Chapters 8 and
9, provide all the steps necessary to implement linear filtering and pattern matching in the
frequency domain.
The next chapter describes how to perform these functions and those from Chapter 5
in more than one dimension by simply converting the multidimensional processing to a
sequence of one-dimensional processes.
REFERENCES
[1] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,
Prentice-Hall, Englewood Cliffs, NJ, 1975.
[2] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Engle-
wood Cliffs, NJ, 1975.
[3] E. Oran Brigham, The Fast Fourier Transform, Prentice-Hall, Englewood Cliffs, NJ,
1974.
[4] E. Oran Brigham, The Fast Fourier Transform and Its Applications, Prentice-Hall,
Englewood Cliffs, NJ, 1988.
7
Multidimensional Processing
7.0 INTRODUCTION
To this point the book has only addressed the use of the OFT and its fast versions (FPTs)
to convert one-dimensional signals to their frequency components. Signals such as music,
speech, radar, and sonar are waveforms that change as a function of one variable, time.
They are usually analyzed with one-dimensional FFfs. However, some signals have more
than one dimension or can be turned into waveforms with more than one dimension. The
most obvious example is an image, a two-dimensional waveform, which is analyzed with
two-dimensional FFTs. Video is described in three-dimensional terms, some number of
two-dimensional pictures per second, with time as the third dimension.
The most important fact about multidimensional OPTs is that they can be decomposed
into a sequence of one-dimensional DFfs. The results of this fact are twofold:
These three separable function properties significantly reduce the number of computations
required for multidimensional OPTs. This, combined with FFf algorithms that provide fast
computation of one-dimensional OPTs, has led to uses of two- and three-dimensional FFTs
for applications such as image formation (synthetic aperture radar and magnetic resonance
imaging) and image analysis (deblurring).
Once the exponential is factored, it can be separated between the two summation signs to
produce
Nt-t N2 - t
LL a(nt, n2) * e-j2Jr[ntkt/Nt+nzk2/N2]
nt=O n2=O
(7-3)
The inner summation is the N2-point one-dimensional DFT of a(nl, n2). Since a(nt, n2)
is different for each value of nt, this OFT must be computed for each nt = 0,1,2, ... ,
tN, -1). Those results become the terms used to compute the second set of one-dimensional
DFfs described by the outer summation to the right of the equals sign in Equation 7-3. To
summarize, if this two-dimensional image described by a(nl' n2) is to be transformed,
then:
1. For each row: nl = 0,1,2, ... , tN, - 1), compute its N2-point OFT and place
the results back in the same row.
SEC. 7.2 LINEAR FILTERING 75
2. For each column of the results from 1): n: = 0,1,2, ... , (N2 - 1), in this interim
two-dimensional set of numbers, compute its Nt-point OFf and place the results
back in the same column.
Each of these N} * N2 one-dimensional OFTs can be computed using any of the FFT
algorithms in Chapters 8 and 9 to improve the computation time. If the input data is
complex, the complex version of the algorithms is most efficient. If the input is real, then
the overlap-and-add or overlap-and-save approaches from Chapter 6 can also be applied to
the chosen FFT algorithm to further reduce the computational load.
For a general unit pulse response this equation requires an enormous number of computa-
tions. Suppose the image has P rows and Q columns of pixels, and the two-dimensional
unit pulse response has N, rows and N2 columns. Generally, N} and N2 are much smaller
than P and Q.
76 CHAP. 7 MULTIDIMENSIONAL PROCESSING
The inner summation is a one-dimensional linear filter that is computed for each value
of j = 0,1,2, ... , (N, - 1) in each row k, = 0,1,2, ... , (P - 1). Since each one-
dimensional linear filter requires N 2 multiplies and (N2 - 1) adds, the inner summation
requires N 1 * P * [2 * N2 - 1] arithmetic computations and produces the signal used by the
outer summation which is now also only a one-dimensional linear filter. Similarly, the outer
summation requires N2 * Q * [2 * N 1 - 1] arithmetic computations. The total computations
for Equation 7-6 are then reduced to N 1 * P * [2 * N2 - 1] + N 2 * Q * [2 * N l - 1]. This
total can be roughly approximated as 2 * N 1 * N 2 * (P + Q). The ratio of the number of
computations required for the two-dimensional approach to the separable one-dimensional
approach is roughly
(P+ Q)/(P * Q) (7-7)
For a 512 x 512 image this ratio is (512 + 512)/(512 * 512) == 1/256, which is why this
approach to the unit pulse response is commonly found in image processing. Note that
Equation 7-7 is not dependent on the size of the unit pulse response. There actually is a
weak dependence that has been lost in the equation because of the approximations made on
the number of computations near the edge of the image.
# Compo == P * {2 * N AI 2 + 13 * M2 - 16}
for real input sequences x (j, i) and M2/2 odd. If M2/2 is even, this portion of the algorithm
requires
# Compo == P * {2 * N M2 + 13 * M2 - I8}
In both cases, N Nf2 == number of computations in the M2/2-point FFT.
Stage 3: Choose Outer Filter Transform Length
Choose a transform length M 1 for the outer summation in Equation 7-6 based on the
criteria in Chapter 6. Using a larger number than M 1 == N 1 + P - 1 requires adding zeros
(zero padding), which is equivalent to adding a border of zeros at the ends of the columns
of the image.
# Compo == Q * {2 * N M 1 + 13 * M 1 - 16}
for real input sequences x (j, i) and M I /2 odd. If M 1 /2 is even, this portion of the algorithm
requires
# Compo == Q * {2 * N Al I + 13 * M I - I8}
In both cases, NAIl == number of computations in the M I /2-point FFT. The total number of
computations using the frequency domain approach is
Just as frequency analysis can be extended into more than two dimensions, the linear
filtering equation can also be written in more than two dimensions. Again, the most common
technique for reducing the computational load from multidimensional linear filtering is to
restrict the unit pulse response to one that can be factored into functions of the individual
dimensions, and then use frequency domain filtering on the resulting one-dimensional linear
filters.
78 CHAP. 7 MULTIDIMENSIONAL PROCESSING
yik«, k2) = ~
Nt-11NZ- 1
I
~ xtk, + i. k2 + i) * h(i) * h(j) (7-9)
The inner summation is a one-dimensional pattern matcher that is computed for each value
of j = 0,1,2, ... , (N1 - 1) in each row k l = 0,1,2, ... , (P - 1). Since each one-
dimensional pattern matcher requires N2 multiplies and (N2 - 1) adds, the inner summa-
tion requires N 1 * P * [2 * N 2 - 1] arithmetic computations and produces the signal used
by the outer summation which is now also only a one-dimensional pattern matcher. Simi-
larly, the outer summation requires N2 * Q* [2 * N, - 1] arithmetic computations. The total
computations for Equation 7-9 are then reduced to N, * P*[2*N2 -1]+ N2 * Q*[2*Nl -1].
This total can be roughly approximated as 2 * N, * N2 * (P + Q). The ratio of the number of
computations required for the two-dimensional approach to the separable one-dimensional
approach is roughly
+ Q)/(P * Q)
(P (7-10)
For a 512 x 512 image, this ratio is (512 + 512)/(512 * 512) = 1/256, which is why this
approach to the unit pulse response is commonly found in image processing. Note that
Equation 7-10 is not dependent on the size of the unit pulse response. There actually is a
SEC. 7.3 PATIERN MATCHING 79
weak dependence that has been lost in the equation because of the approximations made on
the number of computations near the edge of the image.
# Compo = P * {2 * N M2 + 13 * M2 - 16}
for real input sequences x(j, i) and M2/2 odd. If M2/2 is even, this portion of the algorithm
requires
# Compo = P * {2 * N M2 + 13 * M2 - 18}
Stage 3: Choose Outer Pattern Matcher Transform Length
Choose a transform length M, for the outer summation in Equation 7-9 based on the
criteria in Chapter 6. Using a number larger than M, = N, + P - 1 requires adding zeros
(zero padding), which is equivalent to adding a border of zeros at the ends of the columns
of the image.
# Compo = Q * {2 * NM 1 + 13 * M1 - 16}
for real input sequences x (j, i) and M 1/2 odd. If M 1/2 is even, this portion of the algorithm
requires
# Compo = Q * {2 * N M 1 + 13 * M 1 - 18}
The total number of computations with the frequency domain approach is roughly
7.4 CONCLUSIONS
Having learned in this chapter how to break down multidimensional processing to more
easily performed sequences of one-dimensional processing, we conclude the foundation
portion of the book. Design Example 4 in Chapter 17, an image deblurrer, demonstrates
two-dimensional processing. Now that what FFfs are and what they can do have been
covered, the next two chapters show how to construct an FFf of any length.
REFERENCES
(1] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,
Prentice-Hall, Englewood Cliffs, NJ, 1975.
[2] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Engle-
wood Cliffs, NJ, 1975.
[3] E. Oran Brigham, The Fast Fourier Transform and Its Applications, Prentice-Hall,
Englewood Cliffs, NJ, 1988.
8
Building-Block Algorithms
8.0 INTRODUCTION
In this chapter the 2-, 3-, 4-, 5-, 7-, 8-, 9-, and 16-point FFf algorithms are presented
because they are the most efficient and widely used FFT algorithm building blocks. The
general-purpose FFT algorithms (Rader and Singleton) are included to provide the addi-
tional building blocks necessary to compute any transform length. This is because not all
numbers have only 2, 3, 4, 5, 7, 8, 9, or 16 as factors, for example, 119 = 7 * 17. More than
one algorithm for computing a particular building block, except for 2 and 4, is given because
each has different features that make it better suited to some applications than others. A
unique feature of the book is the format in which they are all presented, with input adds,
multiplies, and then output adds, so that all can be used with the Winograd algorithm in
Chapter 9.
All of the building-block algorithms are FITs, sometimes called small-point trans-
forms. Since they are FFTs, they have all of the same properties, strengths, and weaknesses
of the DFT described in Chapter 2.
The most common way to evaluate FFf algorithms is in terms of the number of computations
and amount of memory required to compute them. The performance measures in this section
quantify those computations and memory needs. The same four measures are used again
in Chapter 9.
82 CHAP. 8 BUILDING-BLOCK ALGORITHMS
Steps a, e, and fare additions (in one case a subtraction which is generally implemented
as an addition of a negative number), and steps b-d are real multiplications.
Each algorithm begins and ends by using exactly 2 * N data memory locations to
store the input data and output results, respectively. However, if no temporary registers are
available for intermediate results, most of the algorithms in this chapter require additional
data memory locations during the computations. In this chapter, Algorithm Steps and a
Memory Map are given for each algorithm, and total data memory location requirements
are listed in the Comparison Matrix, assuming the processor has no temporary registers.
The difference between those numbers and 2 * N is the number of temporary registers
needed to avoid using extra data memory locations for intermediate results.
The following are the constraints the authors have used for the small-point transforms in
this chapter:
1. The real and imaginary parts of the i-th input sample are aR(i) and al(i). AR(i)
and A I (i) are the real and imaginary parts of the i -th output frequency component.
2. All of the algorithms have been segmented to have all of the multiplications in
the center so that they can be used by any of the FFf algorithms in Chapter 9 to
form longer transform lengths. Chapter 9 explains the reasons for this constraint.
3. Intermediate results are labeled with sequential lowercase letters of the alphabet
to indicate where they are located relative to other computational outputs. For ex-
ample, the first set of intermediate computational results in each of the algorithm
building blocks is labeled bRei) and b[(i).
4. The sum and difference computations are performed by taking two pieces of
data from data memory, perfonriing the required computations, and returning the
results to available data memory locations.
5. The multiply-accumulates are performed by sequentially pulling a data value
from data memory, performing the multiplication, and adding the results to the
processor's accumulator (Section 14.2.11). When the multiply-accumulate func-
tion is complete, the result is stored in a memory location, overwriting data that
is no longer needed.
6. The sequence of computations shown for the first stage in each algorithm has
been left the same as in its referenced article. The data labels have been changed
to make them consistent for all the algorithms in the book.
7. The memory location (Memory Map) for intermediate results or output frequency
components is shown next to each Algorithm Step.
8. For an N -point algorithm building block, the real input data, aR(i), is located in
data memory locations M(i), and the imaginary input data, al(i), is located in
data memory locations M(N + i), where i = 0,1,2, ... , (N - 1).
9. All of the multiplier constants are presented in their sine and cosine forms so that
they may be computed in the arithmetic format (see Chapter 13) appropriate for
the application.
84 CHAR 8 BUILDING-BLOCK ALGORITHMS
10. All of the intermediate results and output frequency components are stored di-
rectly in data memory, rather than temporary storage locations, to ensure that the
algorithm will work on all processors.
Since each set of results can be placed in the same data memory locations that the inputs
were taken from, this algorithm requires only four data memory locations. The flowchart
for the 2-point FFT is shown in Figure 8-1. Two inputs and two outputs are used to indicate
that the same computational building block is used twice to compute the real and imaginary
portions of the 2-point FFT output.
-1
Note that Figure 8-1 looks similar to the 2-point decimation-in-time (DIT) and
decimation-in-frequency (DIF) figures in Section 10.4. The difference is the multiplier
in the DIT and OIF flowcharts. When the 2-point transform is used in a larger power-of-
two algorithm, it requires data reorganization as well as the complex multiplier to prepare
the data for each succeeding stage of the algorithm. However, in the prime factor algorithm
(Section 9.6), only data reorganization is required. Therefore, the universal building block
is the 2-point FFT in Figure 8- l , Chapter 9 deals with how these algorithm building blocks
are combined in different ways to form larger transform lengths, including power-of-two
and prime factor algorithms.
SEC. 8.4 THREE-POINT FFT 85
If the 3-point OFT is calculated directly from Equation 8-4, it requires four complex mul-
tiplies and six complex adds. Since a complex multiply uses 4 real multiplies and 2 real
adds, and a complex add uses 2 real adds, the 3-point OFT requires 16 real multiplies and
20 real adds. The number of adds and multiplies for the two fast algorithms is significantly
less than required for computing the OFf directly. However, if only a subset of the out-
put frequency components is required, it may be more cost effective to compute the OFT
equation directly for those terms. For example, if A (0) is the only term needed, it can be
computed with four adds and no multiplies by using the OFT directly. Each of the other
two output frequencies requires two complex multiplies and two complex adds for a total
of eight real adds and eight real multiplies. With this in mind the crossover point between
using the OFT directly and one of the 3-point FFT algorithms can be determined based on
the number of output frequency components that must be computed.
Since all of the input data is required for each of the output frequency component
calculations, the direct OFT computations require six data memory locations for the input
data and six more for the output frequency components. This is a total of 12 data memory
locations, since the input and output are complex. Similarly, the OFT data addressing is
sequential (i.e., 0 through 2 for each output frequency component), and the computational
architecture is simple since they can all be performed by using a complex multiply ac-
cumulator (see Chapter 10 for details). Addressing the complex multiplier coefficients is
sequential in two orders (1 and 2 or 2 and 1) or requires that the addresses be stored in
program memory.
There are two common 3-point FFT algorithms. Both require 12 adds, 4 multiplies,
and 2 memory locations for multiplier constants. The Winograd [1] algorithm is based
on circular convolution properties and requires six data memory locations. The Singleton
[2] algorithm is based on complex conjugate symmetry properties of the 3-point OFf and
requires seven data memory locations.
The strategy for converting these equations into code is to start at the top (com-
pute b R (I) and identify the pair of inputs to be used first (in this case a R (1) and a R (2».
Then look down the list to find the second (compute bR (2» place where these two in-
puts are used. Pull aR(l) and aR(2) from memory, compute bR(I) and b R(2) , and store
the results in data memory locations M (1) and M (2) previously occupied by a R (1) and
aR(2).
Next, look for the computation for bI ( 1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. Note that the algorithm steps for AR(O) and A/(O)
only relabel the data values to their output labels once they have been used as required by
other portions of the algorithm.
86 CHA~ 8 BUILDING-BLOCK ALGORITHMS
This set of equations is shown pictorially with the flow graph in Figure 8-2.
a(O) A(O)
a(l) Z ~ A(l)
a(2)X cos(21T/3)-1 X A(2)
-I j sin(21T/3) -1
a(O) A(O)
a(1)
a(2) XCOS(21T/3)X
X A(t)
A(2)
-1 j sin(21T/3) -1
If the 4-point OFT is computed directly from Equation 8-5, it requires no complex multiplies
and 12complex adds for a total of 24 real adds. The circular convolution, complex conjugate
symmetry, and 90° and 180° symmetry approaches to a 4-point FFT all result in the same
set of Algorithm Steps. The algorithm requires 16 adds, no multiplications, 8 data memory
locations, and no memory locations for multiplier constants.
Since all of the input data is required for each output frequency component calculation,
the direct OFT computations require eight data memory locations for the input data and
eight more for the output frequency components. This is a total of 16data memory locations,
since the input and output are complex. Similarly, the OFT data addressing is sequential
(i.e., 0 through 3 for each output frequency component), and the computational architecture
is simple, since they can all be performed with additions.
The strategy for converting these equations into code is to start at the top (compute
b R (0)) and identify the pair of inputs to be used first (in this case as (0) and aR(2)). Then
look down the list to find the second (compute bR (1)) place where these two inputs are
88 CHA~ 8 BUILDING-BLOCK ALGORITHMS
used. Pull aR(O) and aR(2) from memory, compute bR(O) and bR(I), and store the results
in data memory locations M(O) and M(2) previously occupied by aR(O) and aR(2).
Next, look for the computation for b/ (0) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses.
16 real adds and 16 real multiplies. With this in mind the crossover point between using
the Off directly and one of the 5-point FFf algorithms can be determined based on the
number of output frequency components that must be computed.
Since all of the input data is required for each output frequency component calcu-
lation, the direct OFT computations require 10 data memory locations for the input data
and 10 more for the output frequency components. This is a total of 20 data memory
locations, since the input and output are complex. Similarly, the OFf data addressing is
sequential (i.e., 0 through 4 for each output frequency component), and the computational
architecture is simple, since they can all be performed with a complex multiply accumu-
lator (see Chapter 10 for details). Addressing the complex multiplier coefficients requires
either a modulo arithmetic scheme (k * n mod 5) or that the addresses be stored in program
memory.
Each of the three fast algorithms is presented, characterized, and summarized in
the Comparison Matrix in Table 8-1. For example, the Rader algorithm has the simplest
computational structure but requires the largest number of adds. The Singleton algorithm
has the simplest memory mapping for the multiplier constants but requires more constants
than the Winograd algorithm.
Stage 2: Multiplications
This stage contains all of the multiplications and requires additional data memory
locations to store intermediate results. In all steps the multiplication is perfonned by
pulling a data value from memory, multiplying it by the appropriate constant, and returning
the result to the same data memory location. All these computations are performed in-
place.
Stage 2: Multiply-Accumulates
This stage contains all of the multiplications and requires additional data memory
locations to perform the sets of multiply-accumulate operations and store the intermediate
results. The strategy for converting these steps into code is explained in Constraint 5 of
Section 8.2.
Algorithm Steps Memory Map
cR(2) = b R(2) * sin(21T/5) + b R(4) * sin(4Jl'/5) cR(2) => M(IO)
c/(2) = b/(2) * sin(21T/5) + b/(4) * sin(41Tj5) c/(2) => M(3)
cR(4) = b R(2) * sin(41Tj5) - b R(4) * sin(2Jl'j5) cR(4) => M(ll)
c/(4) = b/(2) * sin(4Jl'j5) - b/(4) * sin(21T j5) c/(4) => M(4)
cR(I) = b R(I) * COS(21Tj5) + b R(3) * COS(41Tj5) + aR(O) cR(I) => M(9)
c/(l) = b/(l) * COS(21Tj5) + b/(3) * COS(41Tj5) + 0/(0) c/(I) => M(l)
cR(3) = bR(I) * COS(41Tj5) + b R (3) * COS(21Tj5) + aR(O) cR(3) => M(8)
c/(3) = b/(I) * COS(41Tj5) + b/(3) * COS(21Tj5) + 0/(0) c/(3) => M(2)
AR(O) = OR (0)+ bR(I) + bR(3) AR(O) => M(O)
Al(O) = 0/(0) + b/(!) + b/(3) A/(O) => M(5)
look down the list to find the second (compute A R (4» place where these two inputs are
used. Pull cR(I) and c,(2) from memory, compute AR(I) and AR(4), and store the results
in data memory locations M(9) and M(3) previously occupied by eR(I) and c/(2).
Next, look for the computation for A I (1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses.
look down the list to find the second (compute A R (4» place where these two inputs are
used. Pull fR(l) and eR(3) from memory, compute A R(I) and A R(4), and store the results
in data memory locations M(3) and M(l) previously occupied by fR(I) and eR(3).
Next, look for the computation for AI(l) and repeat the same set of steps. Continue
this process until all the Algorithm Steps have been computed and all of the results are
returned to the data memory locations.
Singleton [2] algorithm was developed by using a decomposition based on the complex
conjugate symmetry properties of the 7-point transform.
Next, look for the computation for c[(l) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. Note that b R(5), b[(5), b R(6), and b[(6) are also
used in Stage 3.
Stage 4: MUltiplications
This stage contains all of the multiplications and also requires additional data memory
locations to store intermediate results. In all cases the multiplication is performed by pulling
a data value from memory, multiplying it by the appropriate constant, and returning the result
to the same data memory location.
The strategy for converting these equations to code is to start at the top (compute
hR(l» and identify the pair of inputs to be used first (in this case gR(l) and eR(3». For this
set of computations only eR(4) and eI(4) are used more than once. Therefore, pull gR(l)
and eR (3) from memory, compute h R (l), and store the result in data memory location M (3)
previously occupied by e R (3).
Next, look for the computation for hI (l) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and all of the results
are returned to the data memory locations.
SEC. 8.7 SEVEN-POINT FFT 101
The Singleton [2] 7-point FFT requires 60 adds, 36 multiplies, 17 data memory
locations, and 6 multiplier constant memory locations. The three stages are as follows.
102 CHAR 8 BUILDING-BLOCK ALGORITHMS
This stage does not require additional data memory locations or accessing any of
the multiplier constants. Further, the add/subtract process is the same for all of the real
and imaginary pairs. The strategy for converting these equations to code is to start at the
top (compute bR (I» and identify the pair of inputs to be used first (in this case a R (I) and
aR(6». Then look down the list to find the second (compute b R(2» place where these two
inputs are used. Pull aR(I) and aR(6) from memory, compute bR(I) and b R(2), and store
the results in data memory locations M(I) and M(6) previously occupied by aR(I) and
aR(6).
Next, look for the computation for bI(l) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses.
Stage 2: Multiply-Accumulates
This stage contains all of the multiplications and also requires additional data memory
locations to store intermediate results because of multiple multiply-accumulate operations
requiring the same input data. The terms with the sine multipliers are computed first to
minimize required memory. The Memory Map is based on Constraint 5 of Section 8.2.
If the 8-point DFT is calculated directly using Equation 8-8, it would require 16
complex multiplies and 56 complex adds. The number of complex multiplies is lower
than expected (seven for each of seven output frequency components) because many of the
multiplier constants are ±l or ±j (see Figure 3-1). Since a complex multiply uses 4 real
multiplies and 2 real adds, and a complex add uses 2 real adds, the 8-point DFT would
require 64 real multiplies and 144 real adds. The number of adds and multiplies shown
for each of the fast algorithms is significantly less than required for computing the DFf
directly. However, if only a subset of the output frequency components is required, it may
be more cost effective to compute the DFT equation directly for those terms. For example,
if A (0) is the only term needed, it can be computed with 14 adds and no multiplies using
the DFT directly. Each of the other 7 output frequencies requires 6 complex multiplies and
6 complex adds for a total of 24 real adds and 24 real multiplies. With this in mind the
crossover point between using the DFf directly and one of the 8-point FFT algorithms can
be determined based on the number of output frequency components that must be computed.
Since all of the input data is required for each output frequency component calculation,
the direct DFf computations require 16 memory locations for the input data and 16 more
for the output frequency components. This is a total of 32 data memory locations, since
the input and output are complex. Similarly, the DFT data addressing is sequential (i.e.,
o through 7 for each output 'frequency component), and the computational architecture is
simple since they can all be performed with a complex multiply accumulator (see Chapter
10 for details). Addressing the complex multiplier coefficients requires either a modulo
arithmetic scheme (k * n mod 8) or that the addresses be stored in program memory.
Stage 2: MUltiplies
This stage contains all of the multiplications. In all cases the multiplication is per-
formed by pulling a data value from memory, multiplying it by the appropriate constant,
and returning the result to the same data memory location. Note that only one multiplier
constant is required.
to be used first (in this case C R (0) and C R (2)). Then look down the list to find the second
(compute d R(4)) place where these two inputs are used. Pull CR(O) and cR(2) from memory,
compute dR(O) and d R(4), and store the results in data memory locations M(O) and M(l)
previously occupied by C R (0) and C R (2).
Next, look for the computation for b [(0) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. Notice that some of these additions require one
imaginary input and one real input. This approach to these additions implements the required
multiplication by j = R, which converts real parts of data to imaginary parts and
imaginary parts to real parts (with a sign change).
Stage 3: Multiplies
This stage contains all of the multiplications. In all cases, multiplication is performed
by pulling a data value from memory, multiplying it by the appropriate constant, and re-
turning the result to the same data memory location. Note that only one multiplier constant
is required because cos(2rr /8) = sin(2rr /8).
SEC. 8.8 EIGHT-POINT FFT 109
Next, look for the computation for dI (0) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. Notice that some of these additions require one
imaginary input and one real input. This approach to these additions implements the required
multiplication by j == J=T, which converts real parts of data to imaginary parts and
imaginary parts to real parts (with a sign change).
Pull CR(S) and c/(7) from memory, compute d R(5) and d I(5), and store the results in data
memory locations M(5) and M(13) previously occupied by cR(5) and cI(7). Perform the
same set of steps for d R (7) and d/ (7).
Stage 4: Multiplies
This stage contains all of the multiplications. In all cases the multiplication is per-
formed by pulling a data value from memory, multiplying it by the appropriate constant,
and returning the result to the same data memory location. Note that only one multiplier
constant is required.
This stage also does not require any multiplier constants. Further, the add/subtract
process is the same for all of the real and imaginary pairs. The strategy for converting these
equations to code is to start at the top (compute AR(O» and identify the pair of inputs to be
used first (in this case C R (0) and C R (4». Then look down the list to find the second (compute
A R(4» place where these two inputs are used. Pull CR(O) and cR(4) from memory, compute
AR(O) and A R(4), and store the results in data memory locations M(O) and M(l) previously
occupied by CR(O) and cR(4).
Next, look for the computation for A/(O) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses.
Stage 3: Multiplies
This stage contains all of the multiplications. In all cases except C R (8) and C1(8),
the multiplication is performed by pulling a data value from memory, multiplying it by the
appropriate constant, and returning the result to the same data memory location. Since C R (8)
and CI (8) are multiplied during this stage as well as used in the next stage, the multiplied
values fl(lO) and fR(lO), respectively, are stored in two of the additional data memory
locations M(20) and M(24) used earlier.
data values to their output labels once they have been used as required by other portions of
the algorithm.
Algorithm Steps Memory Map
A R (0) == fR (0) AR(O) =} M(O)
A/(O) == fICO) A/(O) =} M(9)
A R( I) == m R( I) + m R(2) AR(I) =} M(2)
A, (I) == m , (1) - In / (2) A/(l) =} M(7)
A R(2) == nlR(3) + mR(4) A R(2) =} M(3)
A/(2) == In ,(3) - ",/(4) A/(2) =} M(6)
AR(3) == mR(5) + nlR(6) A R(3) =} M(8)
A , (3) == m , (5) - m , (6) A I (3) =} M(5)
A R(4) == In R(7) + m R(8) A R(4) =} M(4)
A I ( 4) == m / (7) - In I (8) A/(4) =} M(13)
A R(5) == m R(7) - m R(8) A R(5) =} M(l)
A I (5) == m I (7) + m , (8 ) A I(5) =} M(IO)
A R(6) == m R(5) - In R(6) A R(6) =} M(14)
A I (6) == 111/(5) + nl/(6) A/(6) =} M(17)
A R(7) == 111R(3) - nlR(4) A R(7) =} M(15)
A/(7) == m/(3) + m/(4) A I(7) =} M(12)
A R(8) == 111R(1) - InR(2) A R(8) =} M(16)
A I (8) == m I ( 1) + m , (2) A/(8) =} M(ll)
Stage 2: MUltiply-Accumulates
This algorithm stage contains all of the multiplications and requires additional data
memory locations to store the results because the input data is used for sets of computa-
tions. The data memory mapping assumes the multiply-accumulation process described as
Constraint 5 in Section 8.2.
For example, consider the computation of mR(l), mR(3), mR(5), mR(7), and fR(O),
which requires bR(I), bR(3), b R(5), bR(7), and aR(O). Because of the need for all five
inputs to compute all five outputs, the first four outputs, say mR(I), mR(3), mR(5), and
mR(7) are stored in additional data memory locations M(21), M(20), M(19), and M(18).
Finally, fR (0) may be stored in one of the input data memory locations, say data memory
location M(O) occupied by aR(O). This leaves the four data memory locations M(l), M(2),
M(3), and M(4), the ones used by bR(I), b R(3), b R(5), and b R(7), to be used for the extra
locations required by other sets of multiply-accumulate operations. The extra locations
are used for the imaginary equivalent of the real computations. This process is continued,
always using leftover data memory locations, until all of the computations are performed.
This strategy is continued until all of the computations and all the results are stored in
the data memory locations. One caution is that some of the inputs to this stage are needed
in Stage 3.
Stage 3: Multiplies
This stage contains all of the multiplications. The individual data values are pulled
from memory, multiplied by the appropriate constant, and stored in the same data memory
location.
The 16-point OFT is defined for k = 0,1,2,3,4,5,6,7,8,9,10, 11, 12, 13, 14, and 15 as
15
A(k) = La(n) * e- j 2rrk*n/ 16 (8-10)
n=O
The Winograd [1] 16-point OFT was developed by using a decomposition based on circular
convolution properties. Other popular 16-point FFfs are based on mixed-radix combina-
tions of the 2-, 4-, and 8-point building-block algorithms and are presented in Chapter 9.
If the 16-point OFT is calculated directly from Equation 8-10, it requires 225 complex
multiplies and 240 complex adds. Since a complex multiply uses 4 real multiplies and 2
real adds, and a complex add uses 2 real adds, the 16-point OFT requires 900 real multiplies
and 930 real adds. The number of adds and multiplies for the fast algorithm is significantly
less than required for computing the OFT directly. However, if only a subset of the output
frequency components is required, it may be more cost effective to compute the OFT
equation directly for those terms. For example, if A (0) is the only term needed, it can be
computed with 30 adds and no multiplies by using the OFT directly. Each of the other 15
output frequencies requires 15 complex multiplies and 15 complex adds for a total of 60
real adds and 60 real multiplies. With this in mind, the crossover point between using the
OFT directly and the 16-point FFT algorithm can be determined based on the number of
output frequency components that must be computed.
Since all of the input data is required for each output frequency component calculation,
the direct OFT computations require 32 data memory locations for the input data and 32
more for the output frequency components. This is a total of 64 data memory locations,
since the input and output are complex. Similarly, the OFT data addressing is sequential
(i.e., 0 through 15 for each output frequency component), and the computational architecture
is simple, since they can all be performed by using a complex multiply accumulator (see
Chapter 10 for details). Addressing the complex multiplier coefficients requires either
a modulo arithmetic scheme (k * n mode16» or that the addresses be stored in program
memory. The Winograd algorithm is presented, characterized, and then summarized in the
Comparison Matrix in Table 8-10.
The Winograd [1] 16-point FFT requires 148 adds, 20 multiplies, 36 data memory
locations, and 6 multiplier constant memory locations. The seven stages are as follows.
Next, look for the computation for b/ (1) on the list and repeat the same set of steps.
Continue this process until all of the computations are performed and all of the results
returned to the data memory locations.
pairs. The strategy for converting these equations to code is to start at the top (compute
cR(I» and identify the pair of inputs to be used first (in this case bR(I) and bR(3». Then
look down the list to find the second (compute C R (2» place where these two inputs are used.
Pull bR(I) and b R(3) from memory, compute cR(I) and cR(2), and store the results in data
memory locations M(O) and M(4) previously occupied by bR(I) and bR(3).
Next, look for the computation for cj(l) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses.
pairs. The strategy for converting these equations to code is to start at the top (compute
dR(l» and identify the pair of inputs to be used first (in this case cR(l) and cR(3». Then
look down the list to find the second (compute d R (2» place where these two inputs are
used. Pull cR(l) and cR(3) from memory, compute dR(l) and d R(2), and store the results
in data memory locations M(O) and M(2) previously occupied by cR(l) and cR(3).
Next, look for the computation for dl(l) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. The additional data memory locations M (32), M (33),
M(34), and M(35) are requiredfor aef'Z), d R(8), dl(?), anddl(8) because their input values,
cR(ll) through cR(l4) and c/(ll) through c/(l4), are also needed in Stage 4.
Stage 4: MUltiplies
This stage contains all of the multiplications. In all cases the multiplication is per-
formed by pulling a data value from memory, multiplying it by the appropriate constant,
and returning the result to the same data memory location. In some of the multiplica-
tions the real part of a complex data value is the input and the output has an imaginary
label. This process provides the required multiplications by j = R.
Also note that
sin(4p/l6) = cos(4p/16), which reduces the number of constants to be stored to 6. Note
that several of the Algorithm Steps, such as eR(3) and ej(3), just relabel the data values.
This is to make intermediate results from several stages have the same small letter label
prior to proceeding with Stage 5.
132 CHA~ 8 BUILDING-BLOCK ALGORITHMS
Stage 5: Postmultiplies
This stage also does not require accessing any multiplier constants. The strategy for
converting these equations to code is to start at the top (compute [« (1» and identify the
pair of inputs to be used first (in this case e R (3) and e R (4». Then look down the list to find
the second (compute fR(2» place where these two inputs are used. Pull eR(3) and eR(4)
from memory, compute fR(I) and fR(2), and store the results in data memory locations
M(2) and M(19) previously occupied by eR(3) and eR(4).
SEC. 8.10 SIXTEEN-POINT FFT 133
Next, look for the computation for /1 (1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps have been computed and their results
stored in the Memory Map addresses. This stage does not require additional data memory
locations. However, all four additional data memory locations required for this algorithm
are used during this stage to simplify the data addressing. This leaves input data memory
locations M(ll), M(13), M(27), and M(29) unused. They will be reused in Stage 7 to end
the algorithm with the results in the same data memory locations that were occupied by the
input data.
Additionally, note that this stage has data (eRCI3), eR(16), e/(13), and e/(16» that
are independently used to compute two results. The Memory Map strategy in this case is to
use eR(13), eR(16), e,(13), and el(16) data memory locations for the output of the second
computation that required these data values. If those data memory locations were used for
the output of the first computations, their values would be destroyed before being able to
use them for the second computation.
Algorithm Steps Memory Map
[« ( I) == e R (3) + e R (4) IR(l) =} M(2)
./1 (1) == e I (3) + e I (4) /[(1) => M(I8)
fR(2) == eR(3) - eR(4) !R(2) =} M(19)
./'(2) == el(3) - e/(4) ii (2) => M(3)
fR(3) == eR(5) + eR(7) iR(3) => M(4)
,II (3) == e 1(5) + e 1(7) .f/(3) =} M(20)
./R(4) == eR(5) - eR(7) IR(4) =} M(21)
,Ii (4) == e I (5) - e I (7) 1/(4) =} M(5)
./R(5) == eR(6) + eR(8) IR(5) =} M(22)
li(5) == e/(6) + e,(8) //(5) =} M(6)
.fR(6) == eR(6) - eR(8) IR(6) =} M(7)
./, (6) == e[(6) - el(8) 1/(6) =} M(23)
fR(7) == eR(9) + eR(12) IR(7) =} M(8)
./1(7) == e/(9) + el(12) /1(7) =} M(24)
fl? (8) == e R (9) - e R ( 12) !R(8) => M(14)
ff (8) == e / (9) - e / ( 12) i, (8) => M (30)
fR (9) == e R ( 10) + e R ( 11) !R(9) =} M(28)
.II(9) == e I ( 10) + e I ( 11 ) .f/ (9) =} M( 12)
[« ( 10) == e R ( 10) - e R ( 11 ) .fR (10) =} M(26)
II (10) == e/ ( 10) - e/ ( 11 ) //(10) =} M(IO)
./R (11) == e R ( 13) + e R ( 14) fR(11) => M(25)
fi(ll) == e/(I3) +e/(14) fi(II) =} M(9)
}I< (12) == e R ( 13) - e R ( 15) !R(12) =} M(34)
./~(12) == e/(13) - e[(15) //(12) => M(32)
fR( 13) == eRe 17) - eRe16) .fR( 13) => M ( 15)
.// ( 13) == e, (17) - e/ (16) fi( 13) => M(3l)
iR( 14) == eR(18) - eR(16) IR(14) => M(33)
f,(14) == el(lS) - e/(16) .f,(14) => M (35)
134 CHAR 8 BUILDING-BLOCK ALGORITHMS
The strategy for converting these equations to code is to start at the top (compute A R (I»
and identify the pair of inputs to be used first (in this case gR(5) and gR(9». Then look
down the list to find the second (compute A R (7» place where these two inputs are used.
Pull gR(5) and gR(9) from memory, compute AR(I) and AR(7), and store the results in
data memory locations M(8) and M(28) previously occupied by gR(5) and gR(9).
Next, look for the computation for AI(I) on the list and repeat the same set of
steps. Continue this process until all the Algorithm Steps have been computed and their
results stored in the Memory Map addresses. The only variation in the standard pattern
of data addressing is for computing AR(II), AI(II), A R(13), and A/(13). The inputs
for these computations come from the additional data memory locations needed earlier in
the algorithm. Since the additional data memory locations are no longer needed, these
computed results for A R(11), AI(II), A R(13), and A/(13) are stored in M(13), M(29),
M (27), and M ( 11) respectively. The final result is the output frequencies being located in
the same data memory locations used for the input data. Note that several of the Algorithm
Steps, such as A R (0) and A/ (0), only relabel the data values to their output labels once they
have been used as required by other portions of the algorithm.
The preceding sections describe specific algorithm building blocks for 2-, 3-, 4-, 5-, 7-,8-,
9-, and 16-point FFTs. Chapter 9 shows how these can be combined to form any transform
length that can be factored into the product of these numbers. However, transform lengths
such as 13, 143 = 13 x 11, and 117 = 9 x 19 are not the product of these building
block lengths. To compute all transform lengths efficiently, a fast algorithm must exist for
computing all prime number (p) length building blocks. The Rader [3] algorithm provides
this capability by converting the p-point FFf to a series of (p - Ij-point FFTs. The 5-point
Rader FFT given in Section 8.6.3 is a special case of this algorithm.
Since all prime numbers except 2 are odd (all even numbers have at least one factor
of 2), (p - 1) is always even and therefore has at least one factor of 2. For example, if
p = 67, then (p - 1) = 66 = 11 x 2 x 3. If all of the factors of 2 are grouped (in this
case just one factor of 2), the remaining factors are now all odd (in this case 11 and 3). If
the factors of (p - 1) are 2, 3, 4, 5, 7, 8, 9, or 16, the algorithms in this chapter, combined
with those in Chapter 9, can be used to compute the p-point FFf.
If some of the factors are not among the building-block algorithms provided, they
must be obtained from some other source. The power-of-primes algorithm from Chapter
9 can be used for factors of 2 larger than 16. The Singleton [2] or general SWIFf [8]
odd-point algorithms can be used for any odd-numbered factor. Therefore, coupled with
the building blocks presented in this chapter and the algorithms presented in Chapter 9, the
Singleton and general SWIFT odd-point algorithms can be used to compute an FFT of any
length.
The general Rader [3] algorithm uses the circular convolution properties of prime
number DFfs, much like the Winograd algorithm [1]. The eight stages are as follows.
Pi == g modulo N
for i == 1, 2, ... , (N - 1) where "modulo N" means to take the number g and subtract N
from it until the number is less than N but greater than zero.
For example, 3 and 5 are the primitive roots of 7. Therefore, either can be used
to reorganize the input data to a 7-point OFf to prepare it for the Rader computational
algorithm. Namely, the sequences for g = 3 and g == 5 are
g = 3 sequence: 3,2,6,4,5, and 1
g == 5 sequence: 5,4,6,2,3, and 1
With the use of the table of primitive roots, this process can be performed for any prime
number up to 5003 [10].
This stage requires no computation or data manipulation during FFT computations.
For a givenN-point prime number OFT, this reorganized data sequence can be computed
ahead of time and stored in data or program memory.
For every primitive root there is another primitive root so that the product of the
two is 1 modulo N. For the 7-point example, 5 plays this role for the primitive root 3,
and 3 plays this role for the primitive root 5 (3 x 5 = 15 == 1 modulo 7). This stage
reorganizes the complex multiplier coefficients using this other factor. Namely, for the
7-point transform and the generator g = 3, reorganize the complex multiplier coefficients,
using the g == 5 sequence for the exponents, to Wi, Wi, wr, Wi, Wi, W1
7 • This stage
requires no computation or data manipulation during FFf computations. For a given N-
point prime number OFT, this reorganized complex multiplier coefficient sequence can be
computed ahead of time and stored in data or program memory.
138 CHAF'. 8 BUILDING-BLOCK ALGORITHMS
Compute the (N - I)-point IFIT of the output sequence from Stage 7. Again, this
IFFI' can be computed using the building blocks in this chapter, the algorithms in Chapter
9, and the facts from Section 2.3. The result is the required A(i) - a(O) for the N-point
FFf, reordered by using the same generator that was used to reorder the complex multiplier
coefficients. For the 7-point FFf, the output of this stage is:
From Chapters 2 and 3, the IFFT requires the same number of computations as the com-
parable FFT. In fact, it uses the same algorithm, with some of the multiplier coefficients
changed. Therefore, this stage requires the number of computations associated with the
(N - Ij-point FFf algorithm chosen from Chapter 9 with the building blocks from this
chapter.
The general Singleton [2] algorithm uses the complex conjugate symmetry of the wt n
multipliers in the DFf (Equation 8-11) and works for all odd numbers.
N-I
(a) Pull aR(i) and aR(N - i) from their data memory locations, perform the add
and subtract operations, and return the results, b R (2i - 1) and b R(2i), to the data
memory locations previously occupied by aR(i) and aR(N - i).
(b) Pull a/ (i) and a/ (N - i) from their data memory locations, perform the add
and subtract operations, and return the results, b/ (2i - 1) and b/ (2i) to the data
memory locations previously occupied by a/(i) and a/eN - i).
Since all of these computations can be performed in-place, no additional data memory is
required.
Stage 2: Multiply-Accumulates
* *
This is a total of (N - 1) (N - 1) additions and (N - 1) (N - 1) multiplications. Since
the computations are all multiply accumulations and the input values are used by all of the
computed results, the most efficient use of data memory is to:
140 CHA~ 8 BUILDING-BLOCK ALGORITHMS
(a) Compute the (N - 1)/2 different cR(2i - 1) terms and store them in (N - 1)/2
new data memory locations.
(b) Compute A R (0) and store its result in the location previously occupied by
aR(O).
(c) Compute the (N - 1)/2 different C/ (2i - 1) terms and store them in (N - 1)/2 data
memory locations previously occupied by the (N - 1)/2 different b R(2n - 1).
(d) Compute A/ (0) and store its result in the location previously occupied by
a/(O).
(e) Compute the (N - 1)/2 different c/(2i) terms and store them in (N - 1)/2 data
memory locations previously occupied by the (N - 1)/2 different b/(2n - 1).
(I) Compute the (N - 1)/2 different cR(2i) terms and store them in (N - 1)/2 data
memory locations previously occupied by the (N - 1)/2 different bR (2n ) .
The result is the need for (N - 1)/2 additional data memory locations.
(a) Pull cR(2i - 1) and cj(2i) from their data memory locations, perform the add
and subtract operations, and return the results, AR(i) and AR(N - i), to the data
memory locations previously occupied by CR(2i - 1) and C j (2i).
(b) Pull c/(2i - 1) and cR(2i) from their data memory locations, perform the add
and subtract operations, and return the results, A[(i) and A/(N - i), to the data
memory locations previously occupied by C/ (2i - 1) and C R(2i).
The general SWIFT odd-point algorithm also uses the complex conjugate symmetry
of the wt n
multipliers in the DFT (Equation 8-11). The only difference is how the first
input sample and first output frequency component are treated. Depending on the approach,
half of the multipliers are changed. The three stages are as follows.
*
This is a total of 3 (N - 1) additions. Since all of these computations can be performed
in-place, no additional data memory is required. These computations are performed in
pairs. For i == 1.2, ... , (N - 1)/2:
(a) Pull aR(i) and aR(N - i) from their data memory locations, perform the add
and subtract operations, and return the results, b R(2i - 1) and b R (2i ) , to the data
memory locations previously occupied by QR(i) and aI(N - i).
(b) Pull aI(i) and aI(N - i) from their data memory locations, perform the add
and subtract operations, and return the results, bI(2i - 1) and b/(2i), to the data
memory locations previously occupied by al(i) and a/eN - i).
Finally, A R (0) and Al (0) are computed and the results stored in the locations previously
occupied by a R (0) and a I (0).
Stage 2: MUltiply-Accumulates
For i == 1,2, ... , (N - 1)/2, compute:
(N -1)/2
cR(2i - l) == L b R(2n - 1) * [cos(2nnij N) - 1] + AR(O)
n=l
(N-l)j2
(N-l)/2
ct (Zi) == L b R (Zn) * sin(Z;r ni / N)
n=l
(N-l)/2
cR(Zi)== L b/(2n)*sin(Z;rni/N)
n=l
(a) Compute the (N - 1)/2 different cR(2i - 1) terms and store them in (N - 1)/2
new data memory locations.
(b) Compute the (N -1)/2 different c/(2i -1) terms and store them in (N -1)/2 data
memory locations previously occupied by the (N - 1)/2 different b R(2n - 1).
(c) Compute the (N - 1)/2 different c[(2i) terms and store them in (N - 1)/2 data
memory locations previously occupied by the (N - 1)/2 different b/(2n - 1).
(d) Compute the (N - 1)/2 different cR(2i) terms and store them in (N - 1)/2 data
memory locations previously occupied by the (N - 1)/2 different bR (2n ) .
The result is the need for (N - 1)/2 additional data memory locations, and all of the
computations are performed for the same multiply-accumulate structure, not in-place.
(a) Pull cR(2i - 1) and c/(2i) from their data memory locations, perform the add
and subtract operations, and return the results, AR(i) and AR(N - i), to the data
memory locations previously occupied by C R(2i - 1) and C/ (2i).
(b) Pull C I (2i - 1) and C R (2;) from their data memory locations, perform the add
and subtract operations, and return the results, A/(;) and A/(N - i), to the data
memory locations previously occupied by c/ (2i - 1) and CR(2i).
The performance measures of the three general algorithms at the bottom of the Compar-
ison Matrix in Table 8-1 (see page 143) are described as formulas, so the specific values
can be computed for any building-block length. The last two columns refer to memory
locations.
8.13 CONCLUSIONS
A lot of space is spent on examples in this chapter because they provide the clearest picture
and instruction on how to implement the familiar and not so familiar small-point transforms.
Multiple algorithms for each length, except 2 and 4, prove the versatility and flexibility of
SEC. 8.13 CONCLUSIONS 143
# of data # of const.
Algorithm # of adds # of multiplies locations locations
2-Point 4 0 4 0
3-Point
Winograd 12 4 6 2
Singleton 12 4 7 2
4-Point 16 0 8 0
5-Point
Winograd 34 10 12 5
Singleton 32 16 12 4
Rader 42 12 12 4
'-Point
Winograd 72 16 22 8
Singleton 60 36 17 6
8-Point
Winograd 52 4 16 1
Split-Radix 52 4 16 1
Radix-2 52 4 16 1
PTL 52 4 16 1
9-Point
Winograd 90 20 26 10
PTL 94 52 22 8
Burrus- Eschenbacher 84 20 26 8
16-Point
Winograd 148 20 36 6
General N-Point
Rader 2*AN-I+6*(N-1) 2 * MN-I + 4 * (N - 1) CN-I +2 DN-l +2
Singleton (N + 3) * (N - 1) . (N - 1)2 (5*N-l)/2 (N -1)
SWIFT (N + 3) * (N - 1) (N - 1)2 (5 * N - 1)/2 (N - 1)
Key to Variables
N = Number of complex points in building-block algorithm
AN-l = Number of adds required for (N - l j-point FFf
MN-I = Number of multiplies required for (N - lj-point FFT
D N -I = Number of memory locations used for data in (N - l)-point FFf
C» -1 = Number of memory locations used for constants in N -point FFf
FFfs to provide optimized and customized products. With the building-block algorithms
here an FFT of any length can be created by using the algorithms in the next chapter.
Another unique feature of the book-mapping-was introduced in this chapter and is
done on two higher levels in Chapters 9 and 12. Here, mapping the result of each algorithm
step into a data memory location is the first step toward converting FFf algorithms to
optimized assembly language code. The next chapter shows how to do the necessary
relabeling of the mappings in this chapter, so these building blocks can be used in larger
algorithms. In Chapter 12 the third level of mapping shows how to distribute data and
algorithms among multiple processors.
If an application only needs a small-point transform on a single processor, the methods
and steps detailed in the next four chapters are not needed. The reader can proceed to
Chapter 13 to see how to select an arithmetic fonnat for implementing the algorithm on one
of the chips in Chapter 14.
144 CHA~ 8 BUILDING-BLOCK ALGORITHMS
REFERENCES
[1] S. Winograd, "On Computing the Discrete Fourier Transform," Mathematics ofCom-
putation, Vol. 32, No. 141, pp. 175-199 (1978).
[2] R. C. Singleton, "An Algorithm for Computing the Mixed Radix Fast Fourier Trans-
form," IEEE Transactions on Audio and Electroacoustics, Vol. AU-17, pp. 93-103
(1969).
[3] C. M. Rader, "Discrete Fourier Transforms When the Number of Data Samples Is
Prime," Proceedings ofthe IEEE, Vol. 56, pp. 1107-1108 (1968).
[4] J. W. Cooley, "The Structure ofFFf Algorithms," IEEE International Conference on
Acoustics, Speech and Signal Processing Tutorial Session, pp. 12-14 (1990).
[5] J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex
Fourier Series," Mathematics ofComputation, Vol. 19, p. 297 (1965).
[6] J. Smith, "Next-Generation FFf Quickly Calculates Odd Sample Sizes," Personal
Engineering & Instrumentation News, pp. 21-24 (1984).
[7] C. S. Burrus and P. W. Eschenbacher, "An In-Place In-Order Prime Factor FFT Algo-
rithm," Acoustic Speech and Signal Processing, Vol. 29, No.4, pp. 806-817 (1981).
[8] Patent No. 4,293,921, October 6, 1981, Method and Signal Processor for Frequency
Analysis of Time Domain Signals, Winthrop W. Smith, Jr.
[9] CRe Standard Mathematical Tables and Formulae, cac Press, Boca Raton, FL,
pp. 96-101, 1991.
9
Algorithm Construction
9.0 INTRODUCTION
An FFT algorithm is a sequence of computational steps used to compute the DFT efficiently.
The most popular of these algorithms work only for transform lengths that are powers-of-two
(i.e., 2, 4,8, 16, 32,64, ... points). However, there are FFT algorithms for any number (N)
of data points. This chapter describes the computational stages and lists the computational
steps for seven FFT algorithms, including the memory maps for storing the intermediate
and final results of each.
The answers to the following questions help determine which FFT algorithm to use:
• Presented with a general two-block algorithm and then with a 15- or 16-point
example
• Constructed in a uniform format
• Able to use any of the building-block algorithms from Chapter 8
• Able to be combined to form even larger FFf algorithms
The most common way to evaluate FFT algorithms is in terms of the number of computations
and amount of memory required to compute them. The performance measures in this section
quantify those computations and memory needs. The same four measures were used in
Chapter 8.
146 CHA~ 9 ALGORITHM CONSTRUCTION
The number of adds is the total number of real adds used for each of the algorithms.
It includes the two adds required as part of each of the complex multiplies.
The number of multiplies is the total number of real multiplies for each algorithm.
Each complex multiply takes four real multiplies and two real adds (counted in the number
of adds).
Each algorithm begins and ends by using exactly 2 * N data memory locations to
store the input data and output results, respectively. However, if no temporary regis-
ters are available for intermediate results, most of the algorithms in this chapter require
additional data memory locations during the computations. In this chapter, Algorithm
Steps and a Memory Map are given for each algorithm, and total data memory location
requirements are listed in the Comparison Matrix, assuming the processor has no tem-
porary registers. The difference between those numbers and 2 * N is the number of
temporary registers needed to avoid using extra data memory locations for intermediate
results.
The following are the constraints the authors have used for the transforms in this chapter:
1. The real and imaginary parts of the i-th input sample are aR(i) and al(i); AR(i)
and AI (i) are the real and imaginary parts of the i-th output frequency component.
2. Intermediate results are labeled with subsequent lowercase letters of the alphabet
to indicate where they are located relative to other computational outputs. For
example, the first set of intermediate computational results in each of the algorithm
building blocks is labeled bR(i) and bl (i).
3. The sum and difference computations are performed by taking two pieces of data
from data memory, performing the required computations, and returning the results
to available memory locations.
4. The multiply-accumulates are performed by sequentially pulling a data value from
data memory, performing the multiplication, and adding the results to the proces-
SEC. 9.3 THREE CONSTRUCTION APPROACHES 147
The seven FFT algorithms presented in this chapter are divided into three approaches: con-
volution, prime factor, and mixed-radix. For each algorithm, the general form is presented
and discussed first. Then a specific example is presented to illustrate the features of each
of the seven algorithms more clearly. These examples are chosen to be 15- and 16-point
transforms. These lengths are large enough to show the characteristics of the algorithms
and yet small enough to be reasonably presented. Keeping the lengths of the different
examples close to each other also allows the algorithms in the different approaches to be
compared.
The first approach is convolution-based algorithms. The mathematical technique for
obtaining these FFT algorithms is based on converting the Off into a set of convolution
equations that have special properties to reduce the number of computations. Two prime
factor-based algorithms, due to Bluestein and Winograd, are presented in general and
then illustrated with I5-point examples. Performance measures are used to describe the
properties and limitations of the algorithms.
The second approach ofFFT algorithms is commonly called prime factor algorithms.
The mathematics for obtaining these algorithms is based on modulo arithmetic theory. Two
prime factor-based algorithms are presented in general and then illustrated with 15-point
examples. Performance measures are used to describe the properties and limitations of the
algorithms.
The third approach ofFFf algorithms is called mixed-radix algorithms. This approach
can be used for all transform lengths and includes the power-of-two algorithms which have
been the most popular, yet most restrictive. The algorithm takes advantage of the complex
conjugate symmetry properties of the OFT. The general algorithm is presented first and is
followed by three examples, two of 16 points and one of 15 points. Performance measures
are used to describe the properties and limitations of the algorithms.
148 CHA~ 9 ALGORITHM CONSTRUCTION
The memory mappings in the algorithm examples in Chapters 8 and 9 only work directly
if these exact transforms are being computed and memory locations 0 through 2N - 1 are
available. In general, the building blocks in Chapter 8 will be combined in different ways
than the examples in Chapter 9 in order to implement different transform lengths. This
leads to the need to use different memory locations than in the examples.
Rather than having to construct a new memory mapping, this section provides a
straightforward set of steps for converting the memory mappings in the Chapters 8 and 9
examples to any random ordering of the input data that occurred because of where the data
was stored from prior computations. Section 9.4.1 defines the relabeling steps in general,
and Section 9.4.2 provides a specific example.
Step 1: For all of the stages in the N -point FFf, relabel the input addresses for real
data with letters. Start with M(AR) for M(O), proceed to use M(B R) for M(l), and
so forth, until all of the real data is relabeled.
Step 2: Label all real parts of all intermediate and output results in the algorithm that
correspond with the "letter pair" address from Step 1.
Step 3: Repeat Step 1 for the imaginary data, labeling the input address with the
letter corresponding to its real-part equivalent. For example, the real part of the zero-th
input sample is in location zero. In Step 1 this was assigned memory location
M(AR).
Step 4: Label all imaginary parts of all intermediate and output results in the algorithm
that correspond with the "letter pair" address from Step 3.
Step 5: For each input address pair M(A R), M(AI), set the A R and AI equal to the
actual data location of the data that will be input to the algorithm.
Step 6: For each place in the N -point FFf that has letter labels (constructed in Steps
1 through 4), replace the labels with the actual data location assigned it in Step 5.
The 4-point FFf from Chapter 8 can be used as a simple example to illustrate Steps
1 through 6. The columns in Table 9-1 show the mapping steps, as follows:
1. The first eight entries in column 1 are the 4-point building-block input data mapping
from Chapter 8.
2. The second eight entries in column 1 are a random ordering of the input data
memory locations that might be required because of previous computations.
3. The first eight entries in column 2 are the result of performing Steps 1 and 3 of
Section 9.4.1.
4. The second eight entries in column 2 are the result of performing Step 5 of Section
9.4.1.
5. The entries in column 3 are the result of performing Steps 2 and 4 of Section 9.4.1.
6. The entries in column 4 are the result of performing Step 6 of Section 9.4.1.
SEC. 9.5 CONVOLUTION APPROACH 149
Once this is accomplished, the modified building blocks from Chapter 8 can be used to
construct the needed building block computations with the new input data ordering.
al(2) =} M(S) M(5) :::} M(C I) A/(l) :::} M(DR) Al(l) :::} M(4)
al(3) :::} M(2) M(2) =} M(Dl) A l(3) :::} M(C/) A l(3) :::} M(5)
t
Input Complex Output Complex
Multipliers Multipliers
In general, this algorithm only provides a speedup of N1.5 rather than the N * 10g2(N)
computational speedup of other FFT algorithms. However, if the N -stage linear filter is
implemented with the FFT techniques in Chapter 6, the Bluestein algorithm can provide
computational performance that varies as N * log2(N). Figure 9-2 shows the Bluestein
algorithm with the N -stage linear filter replaced with its frequency domain processing
150 CHA~ 9 ALGORITHM CONSTRUCTION
equivalent from Chapter 6 (Figure 6-1). The M-point FFT that operates on the N-stage
linear filter coefficients is used just once since the filter coefficients stay constant for a given
transform length N.
N-Stage
Linear Filter
Coefficients
It seems logical that if an FFT is going to be used to compute the Bluestein algorithm
for FFTs, the FFf might as well be used directly. The reason for the attractiveness of the
Bluestein algorithm is that a standard power-of-two algorithm can be used to compute a
non-power-of-two FFT. However, for the same non-power-of-two FFf length, the prime
factor and Winograd implementations will require fewer multiplications than the Bluestein
algorithm.
Once it has been decided that power-of-two algorithms provide the best approach for
the M -point FFf needed in the Bluestein algorithm, the mixed-radix section of this chapter
(Section 9.7) should be examined to see if other advantages can be taken to simplify the
computations. The most useful simplification comes because of additional constraints the
Bluestein algorithm puts on M. Namely, the algorithm requires that M, the FFT length, be
at least twice N, the number of stages in the linear filter in Figure 9-1. This means that, for
N input samples, M - N zeros (Section 2.3.10) are added to obtain the M samples needed
by the M-point FFT. Since M ~ 2 * N, it follows that M - N ~ N. Therefore, at least the
second half of the inputs to the M -point FFf are zeros.
In Sections 9.7.5 and 9.7.6 the first input data samples are combined such that one
comes from the first half of the data and one from the second half. This is the decimation-
in-time decomposition in Section 10.4.1. In Stage 1 of the general mixed-radix algorithm
in Section 9.7.4, if P = 2 and Q = M12, then the samples (k = 0 and k = 1) that are
combined in the n-th 2-point input building block are aR(k* N 12+ n) and aR(k * N 12 + n).
This always puts one input (k = 0) in the first half of the data samples and the other (k = 1)
in the second half. This means that if the first building block for the M -point FFT is two
points (P = 2), one input to each 2-point FFT is always zero. Therefore, the 2-point
FFTs require no computations. This replaces the single M -point mixed-radix FFT with two
M f2-point FFTs, one of which requires a complex multiplier because of the details of the
mixed-radix algorithm shown in Stage 2 of Section 9.7.4.
Since less than half of the outputs, N to be exact, of the M-point IFFI' are used,
only half of its M outputs need be computed. Similar to the input M-point FFT, if the
SEC. 9.5 CONVOLUTION APPROACH 151
2-point IFFT is used as Q rather than P, the output 2-point FFT is reduced to its subtract
computation. Combining all of these facts to reduce the Bluestein computations results
in converting the block diagram in Figure 9-2 to the one in Figure 9-3. Following the
description of the general algorithm, a 15-point example is provided to concretely illustrate
the algorithm and provide a direct comparison with the 15- and 16-point examples presented
later in this chapter for other FFT algorithms.
Linear
Filter
Complex
Multipliers
g(i)
Input
Complex
Multipliers
x
1
tl----'~tA(i)
M/2-Pt
IFFT
The 10 stages required to implement the general Bluestein algorithm are presented
*
and summarized in Figure 9-3. The total number of real adds required is 10 N + 2 M *
plus the number of real adds required for four M j2-point FFTs. Similarly, the required
number of real multiplies is 4 * M + 16 * N plus the number of real multiplies required for
four M j2-point FFTs.
Complex multipliers require two additional memory locations for temporary storage,
and each M j2-point FFT requires some number of memory locations over and above the
input and output data requirements. Since the FFT almost always requires at least two
additional data memory locations, the data memory requirements are determined by the
chosen M j2-point FFT. If the M j2-point FFTs are computed in sequence, not both at the
same time, then the additional data memory required for the intermediate results of the first
M j2-point FFT algorithm can also be used for the second M /2-point FFT. Therefore, the
data memory requirement is M (for the second M /2-point FFr) plus the requirements for
the chosen M /2-point FFT.
152 CHAR 9 ALGORITHM CONSTRUCTION
There are N complex multiplier constants on the input and the output, and M complex
multiplier constants in the center for the unit pulse response of the Bluestein filter. Addi-
tionally, there are M 12 complex constants at the input to the lower M /2-point FFT and at
the output from the lower M 12-point IFFf. The M /2-point IFFf uses the same constants
as the FFf with the sign of the imaginary parts changed. This is a total of (4 * N + 3 * M)
memory locations plus those required for the chosen M /2-point FFT.
This sequence of stages assumes that the linear filter complex multipliers have been
computed and stored in memory using the techniques in Chapter 6. The stages of the general
Bluestein algorithm are as follows.
Append the N input data points, a(n) = aR(n) + j * a/en), with (M - N) zeros to
obtain an M-point input sequence for the M-point FFT. The (M - N) zeros are appended
to the end of the actual data. The real zeros are stored in data memory locations N through
(M - 1), and the imaginary zeros in locations (N + M) through (2 * M - 1). The result
is having all of the real input data to the M -point FFT stored in contiguous data memory
locations 0 through (M - 1), and the imaginary data stored in data memory locations M
through (2 * M - 1).
SEC. 9.5 CONVOLUTION APPROACH 153
Contents Location
DR(0) 0
• •
• •
• •
DR(M12 - 1) M12-1
DR(MI2) MI2
• •
• •
• •
DR(M - 1) M-l
DI(O) M
• •
• •
• •
DI(MI2 - 1) 3 * MI2 - 1
DI(MI2) 3*M12
• •
• •
• •
DI(M - 1) 2*M -1
Temporary Data 2 * Mplus
them M /2 addresses higher than in the Chapter 8 building block. Similarly, the imaginary
data addresses start at M(M/2+ M) and end at M(M/2+M/2-1 + M). This makes them
M addresses higher than in the Chapter 8 building block. This offset of the data locations
makes it easy to directly use both the equations from Chapter 8 and their data memory map.
Step 1: First M/2-point FFT Computations
The assumptions for the first M /2-point FFT (k = 0) are the following:
1. Use the M j2-point algorithm steps directly from Chapter 8 or from one of the
mixed-radix algorithms in Section 9.7.
2. Use the memory addresses directly for all real data, except the additional memory
locations required in the middle of the computations.
3. For the imaginary data, add M/2 to all of the memory locations, except for the
additional memory locations required in the middle of the computations.
4. For the additional memory locations required in the middle of the computations,
add M to the memory location.
5. Relabel the output frequency components from AR(n) and A/(n) to A R(2 * n) and
A/(2 * n).
Step 2: Second M/2-Point FFT Computations
Similarly, the assumptions for the second M j2-point FFT (k = 1) are the following:
1. Use the M/2-point algorithm steps directly from Chapter 8 or Chapter 9, except
modify all of the data labels by adding M /2 to them.
2. Add M /2 to the memory addresses for all real data, except the additional memory
locations required in the middle of the computations.
3. Add M to the memory addresses for all imaginary data, except for the additional
memory locations required in the middle of the computations.
4. For the additional memory locations required in the middle of the computations,
add M to the memory location.
5. Relabel the output frequency components from A R(n) and A I (n) to A R(2 * n + 1)
and A/(2 * n + 1).
The total number of computations required for this stage is twice the number of
computations needed for the chosen M /2-point transform.
constants prior to forming and storing the output results. Since the complex multiplies are
computed sequentially, the same two additional memory locations can be used for each.
The C R (n) and C I (n) are stored in the locations from which the A R (n) and A I (n) were
pulled to perform the computations.
Some of the building-block algorithms in Chapter 8 and algorithms in Chapter 9 do
not have all of their real outputs in the same data locations as the real inputs. Addressing
convenience has resulted in some of the imaginary outputs being interspersed. It is conve-
nient to correct this inconsistency during the complex multiply computations in this stage.
Specifically, if the imaginary part of one of the A R (n) and A I (n) is stored in the lower
portion of the data memory, change this when the complex multiply outputs are stored so
that the real parts of all of the terms are stored together in the lower portion of the memory
used for CR(n) and C/(n).
by 1. Therefore, one of the two M 12-point IFFTs does not have its outputs modified prior to
computing the 2-point IFFTs. Since only M 12 - 1 of these M 12 complex outputs represent
the needed result in Stage 8, only M /2 - 1 of the complex multiplies need be performed.
The total number of computations for these M /2 - 1 complex multiplies is 4 * (M /2 - 1)
real multiplies and 2 * (M 12 - 1) real adds.
Stage 8: Computing the Output 2-Point Building Blocks
This stage has two steps. The first is to properly group the input data for each of the
M /2 two-point algorithms. The second is to compute the appropriate part of each of
the M 12 two-point algorithms.
Step 1: Grouping the Input Data Points to the 2-Point Building Blocks
For the n-th input to the k-th 2-point building block, choose .fR(k * 2 + n) and
Jj tk * 2 + n) (where k == 0, 1, .... M 12 - 1 and n == 0, 1) from the input data sequence.
In terms of the input labels, a R (n) and a/ (n), shown in Chapter 8, the inputs for the k-th
2-point building blocks are:
a R (0) == [« (2 * k) a R ( 1) == .IR (2 * k + 1)
a/CO) == //(2 * k) a/(l) == //(2 * k + 1)
Step 2: Computing a Portion of the Output 2-Point Building Blocks
Using the 2-point building block from Chapter 8 gives:
A R(O) == aR(O) + aRC!) A R( I ) == aR(O) - aR(I)
A/(O) == a/CO) + a/(l) A/(l) == a/CO) - aIel)
The outputs of interest are the second pair of equations. Therefore, if the output frequency
components of the M-point IFFT are YR(n * MI2 + k) and y/(n * MI2 + k), for the n-th
output of the k-th 2-point building block, the outputs of interest are for n == 1. In terms of
the output labels, AR(n) and A/(n), shown for the M 12-point radix-4 FFT, the outputs for
the k-th 2-point building block are equated to the complete outputs, using the equations:
YR(M/2 + k) == .fR(2 * k) - JR(2 * k + 1)
y,(M/2 + k) == .(,(2 * k) - /,(2 * k + 1)
Since only (M /2 - 1) of these M /2 complex outputs represent the needed result in Stage
10, only (M /2 - 1) of the complex adds need be performed. The (M /2 - 1) partial 2-point
building block requires 30 real adds.
Stage 9: Adjusting the Output Data
This stage has two steps:
These steps can be combined into a single complex multiply for each of the N outputs.
This is a total of N complex multiplies, which is a total of 4 * N real multiplies and 2 * N
158 CHAP. 9 ALGORITHM CONSTRUCTION
real adds. If there are no temporary registers in the processor, then two additional memory
locations are required to perform the complex computations. The outputs from this step are
placed in the same locations from which the inputs to the step were pulled for each n. The
equations are
16-Pt
FFT
~x L..--_ _- - J
aU)
-~+
Input 16-Pt A(i)
Complex IFFT
Multipliers
FFT Linear IFFT Output
Complex Filter Complex Complex
Multipliers Complex Multipliers Multipliers
Multipliers
This example requires 790 real adds and 464 real multiplies. This is about five
times the number of computations needed for the other IS-point examples in this chapter.
However, it can be computed using only power-of-two algorithms. This removes the need
to develop special code or hardware and allows the application to take advantage of hardware
SEC. 9.5 CONVOLUTION APPROACH 159
and software refinements developed for the standard power-of-two FFTs. Further, the
computational difference is not as great when unusual FFf lengths, such as prime numbers,
are required.
The data memory required for this algorithm is the same as that required for two
16-point radix-4 mixed-radix algorithms. From the example in Section 9.7.5, this is 40
locations. Since the 16-point algorithms are computed sequentially, the additional eight
(40 - 32) locations can be reused for the second 16-point FFT. The same is true for the
IFFTs. Therefore, the total data memory required is 32 + 32 + 8 = 72. The memory
required for data constants is the sum of the requirements for the 16-point FFT plus those
for each of the complex multiplies. For this example that is 4 * 15 + 3 * 32 + 6 == 162.
The complex multiply algorithm used here is the one used in the Singleton example in
Section 9.7.7.
The 32-point FFT is chosen to execute the IS-point FFT because it is the smallest
power-of-two greater than 2 * 15 == 30 points.
Modify the I5-point complex input data sequence, g(n) == gR(n) + j * glen), by
multiplying it by exp( - j * T( * n 2 / I 5) == cosor * n 2 / 15) - j * sin(n * n 2 / 15) to obtain
a(n) == aR(n) + j » ajtn), This requires d « 15 == 60realmultipliesand2* 15 == 30 real
adds. The equations are (for n == 0, 1, ... , 15):
aR(n) == gR(n) * cosor * n 2/ 15) + glen) * sin(n * n 2/ 15)
2 2
al(n) == glen) * cosor * n / 15) - gR(n) * sinor * n / 15)
The complex data results are stored in the same locations from which the inputs were
pulled. If no temporary registers are available, two additional memory locations, M(64)
and M (65) (Figure 9-4), are used to store the values computed from multiplying the sine
term by the input data, and the original data locations are used to store the values computed
by multiplying the cosine term by the input data. Those values are then pulled from memory
and added to form the output values a(n) == aR(n) + j * a/en).
Append the 15 input data points, a(n) == aR(n) + j * a/en), with 17 complex zeros to
obtain a 32-point input sequence for the 32-point FFT. The 17 complex zeros are appended
to the end of the actual data (i.e., n == 15, 16, ... ,31). The real zeros are stored in data
memory locations 15 through 31, and the imaginary zeros in locations 47 through 63.
where k = 1 are zeros. Using the 2-point building block from Chapter 8 gives:
AR(O) = aR(O) + aR(l) AR(l) = aR(O) - aR(l)
A/(O) = a/CO) + aIel) A/(l) = a/CO) - aIel)
TheaR(l) andcjt l) inputs to all 16 of the required 2-point building blocks (n = 0, 1, ... , 15)
are zero. Therefore, the outputs of all of those 2-point building blocks are just the input
data:
AR(O) = aR(O) AR(l) = aR(O)
A/(O) = a/CO) A/(l) = a/CO)
Using the labels from Step 2 of Stage 1 of the general mixed-radix algorithm, the k-th
output (k = 0, 1) of the n-th 2-point building block (n = 0, 1, ... , 15) should be labeled
B R (k * 16 + n) and B/ (k * 16 + n) in preparation for input to the complex multiply portion
of the mixed-radix algorithm. Specifically,
BR(k * 16 + n) = aR(n) BR(k* l6+n) => Mtk » l6+n)
B/(k * 16 + n) = a/en) Biik » 16+n) => Mtk « 16+n +32)
The right column shows the corresponding memory mapping, based on the locations of
the input data and taking advantage of the initial data mapping that saved room for the
added zeros. Each a R (n) and a/ (n) is stored in two memory locations in preparation for
subsequent steps.
Step 2: Multiplication by FFT Complex Multipliers
Each BR(k* 16+n) and Bitk» 16+n) needs to be multiplied by the specific complex
number required by the general mixed-radix algorithm prior to entering the 16-point portion
of the 32-point algorithm. The equations for this complex multiplication for each k = 0, 1
and n = 0, 1, ... , 15 are:
DR(k * 16 + n) = BR(n) * cos(2rr * kn/32) + B/(n) * sin(2rr * kn/32)
Djt]: * 16 + n) = Bjtn) * cos(2rr * kn/32) - BR(n) * sin(2rr * kn/32)
If no temporary registers are assumed, each complex multiply required two additional
data memory locations to store the results of multiplying each input value by two different
constants prior to forming and storing the output results. However, if the complex multiplies
are performed sequentially, the same two additional memory locations can be reused for
all of the complex multiplies. The result is the need for only two additional memory
locations. The DR(k * 16 + n) and Djtk * 16 + n) are stored in the locations from which
the BR(k * 16 + n) and BI(k * 16 + n) were pulled to perform the computations. This step
requires 15 complex multiplies, which is 60 real multiplies and 30 real adds.
1. Use the 16-point equations directly from Section 9.7.5, except modify all of the data
labels aR(n) and a/en) by adding 16 to them to obtain aR(n + 16) and a/en + 16).
2. Add 16 to the memory addresses from Section 9.7.5 for all real data, except the
additional memory locations required in the middle of the computations.
3. Add 32 to the memory addresses for all imaginary data in Section 9.7.5, except
for the additional memory locations required in the middle of the computations.
4. For the additional memory locations in the middle of the computations in Section
9.7.5, add 32 to the memory location.
5. Relabel the output frequency components from Section 9.7.5 from AR(n) and
A I (n) to A R (2 * n + 1) and A 1(2 * n + 1).
Table 9-2 shows the output data addresses for the 16-point radix-4 FFT in Section 9.7.5 in
column 1 and the offset addresses for the first and second 16-point FFTs in columns 2 and
3, based on following Steps 2 and 3 of this stage. The two 16-point FFfs require 288 real
adds and 48 real multiplies.
Stage 6: Multiplication by Linear Filter Complex MUltipliers
Multiply the 32 complex outputs of the data FFT (AR(i), AI(i)) by the 32 complex
*
outputs of the unit pulse response FFT (H R (i), H/ (i)) to obtain C (n) == C R (n) + j C I (n).
162 CHA~ 9 ALGORITHM CONSTRUCTION
*
In general, this requires 32 complex multiplications, which is 4 32 = 128 real multiplies
and 2 * 32 = 64 real adds. The equations are (for n = 0, 1, ... ,31):
Addressing convenience has resulted in imaginary parts A/(6), A/(7), A/(12), A/(13),
A/(22), A/(23), A/(24), A/(25), A/(26), A/(27), A/(28), and A/(29) being stored in the
lower half of allotted data memory and their corresponding real parts stored in the upper
half. It is convenient to correct this inconsistency during the complex multiply computa-
tions. Specifically, if the imaginary part of one of the A R (n) and A/ (n) is stored in the lower
portion of the data memory, change this when the complex multiply outputs are stored so
that the real parts of all of the results are stored together in the lower portion of the memory
used for CR(n) and Citn). These 32 complex multiplies require 128 real multiplies and 64
real adds.
aR(k) = C R(2 * k + 1)
The inputs to the first 16-point IFFT are the outputs of the first 16-point FFT, modified
by complex multipliers. Therefore, these inputs occupy the same memory locations as the
outputs of the 16-point FFT. In general, the building-block algorithms do not have their
outputs in sequential memory addresses. Therefore, the inputs to the inverse 16-point FFT
will not be in sequential addresses, as was assumed in Chapter 8. However, the inputs to
the first 16-point IFFf do have all of its real inputs in the first 16 memory locations and all
of its imaginary outputs in memory locations 32 through 47. Likewise, the inputs to the
second 16-point IFFf are in memory locations 16 through 31, and imaginary outputs are in
memory locations 48 through 63. With this in mind, data address relabeling from Section
9.4 is applied to the 16-point radix-4 memory mapping in Section 9.7.5.
Step 2: Computing the Two 16-Point IFFTs
If the labels from Step 2 of Stage 1 of the general mixed-radix algorithm are used, the
k-th output (k == 0, 1.... , 15) of the n-th 16-point transform (n = 0, 1) should be labeled
e R (k * 2 + n) and e/ (k * 2 + n) in preparation for input to the complex multiply portion
of the 32-point mixed-radix algorithm. In terms of the output labels, AR(n) and A/(n), for
the 16-point radix-4 FFf in Section 9.7.5, the outputs for the first 16-point FFf are:
The four columns in Table 9-3 are the remapping process for the first of the two 16-point
radix-4 IFFTs.
164 CHA~ 9 ALGORITHM CONSTRUCTION
• Column I shows the data mapping out of the first 16-point input FFf.
• Column 2 shows the data mapping after the linear filter complex multiplications.
The data addresses are identical to those in column I except for the terms where
column I had the imaginary part at a lower address than the real part. In those cases,
the real and imaginary addresses were swapped during the complex multiplication
process.
• Column 3 shows the new memory addresses for each of the inputs to the first
I6-point IFFf in terms of the data labeling found in Section 9.7.5.
• Column 4 shows the memory address for each of the first I6-point FFf's outputs,
based on the memory relabeling technique, and the definition of how they are
related to the actual output of the first stage of the required 32-point IFFf.
The four columns in Table 9-4 are the remapping process for the second of the two
16-point IFFfs.
SEC. 9.5 CONVOLUTION APPROACH 165
Table 9-4 Output Memory Maps for I5-Point Bluestein Algorithm Example
AR(1) :::} M(l6) CR(l) => M(l6) aR(O) ::::> M(l6) AR(O) == eR(l) =} M(l6)
A /(1) => M(48) C/O) => M(48) a[(O) =} M(48) A/(O) == e/O) => M(48)
A R(3) => M(24) CR(3) =} M(24) aRO) => M(24) AR(I) == eR(3) =} M(33)
A[(3) =} M(56) C/(3) =} M(56) a/(l) => M(56) A/(l) == e/(3) => M(49)
AR(5) =} M(20) CR(5) => M(20) aR(2) => M(20) AR(2) == eR(5) => M(8)
A [(5) =} M(52) C/(5) => M(52) a/(2) ::::> M(52) AI(2) == e/(5) =} M(50)
AR(7) => A1(60) CR(7) ::::> M(28) aR(3) => M(28) AR(3) == eR(7) =} M(51)
A [(7) =} M(28) C / (7) ==> M(60) a/ (3) ==> M (60) AI(3) == e/(7) =} M(l9)
AR(9) => M(18) CR(9) => M(l8) GR(4) =} M(8) AR(4) == eR(9) =} M(20)
A/(9) => A1(50) CI(9) => M(50) a/(4) => M(50) A/(4) == eI(9) =} M(52)
AR(1l) =} M(26) CR(ll) ==> M(26) aR(5) =} M(26) AR(5) == eR(ll) =} M(2l)
A J (ll) => M(58) CI(lI) => M(58) a[(5) =} M(58) AI(5) == eI(ll) => M(53)
AR(13) =} M(54) CR(l3) => M(22) aR(6) => M(22) AR(6) == eR(l3) =} M(54)
AI(l3) :::} M(22) CI(l3) => M(54) aI(6) :::} M(54) A/(6) == eI(3) =} M(22)
AR(l5) :::} M(30) CR(l5) => M(30) aR(7) => M(30) AR(7) == eR(5) :::} M(23)
A/(15) ::::> M(62) CI(l5) => M(62) a/(7) => M(62) AI(7) == e/(l5) => M(55)
AR(l7):::} M(17) CR(l7) => M(l7) aR(8) :::} M(l7) AR (8) == eR(l7) :::} M(24)
A/(l7) ::::> M(49) C/(17) => M(49) a/(8) :::} M(49) A/(8) == e/(l7) => M(56)
AR(19) :::} M(25) C R(9) => M(25) aR(9) ::::> M(25) AR(9) == eR(19) :::} M(25)
A/(19) => M(57) C/(19) => M(57) a[(9) => M(57) A/(9) == e1(9) :::} M(57)
AR(2l):::} M(2l) CR(2l) => M(2!) aR(10) =} M(2l) AR(lO) == eR(21) =} M(26)
A/(2l) :::} M(53) C/(21) =>M(53) a/(10) => M(53) AI(lO) == e/(21) => M(58)
AR(23) => M(61) CR(23) => M(29) a» (11) => M(29) AR(1l) == eR(23) :::} M(59)
A/(23) => M(29) C/(23) => M(61) a/(1) => M(6l) A[Ol) == e[(23) => M(27)
AR(25) :::} M(51) CR(25) => M(9) aR(l2) =} M(l9) AR(12) == eR(25) :::} M(60)
A J{25) :::} M(l9) C[(25) => M(5l) a/(12) :::} M(51) A/(l2) == e/(25) :::} M(28)
A R(27) => M(59) CR(27) => M(27) aR(13) :::} M(27) AR(13) == eR(27) => M(61)
AI(27) => M(27) C / (27) => M (59) a/(13) => M(59) A/(l3) == e/(27) => M(29)
AR(29) => M(55) CR(29) => M(23) aR(14) ::::> M(23) AR(14) == eR(29) => M(62)
A/(29) =} M(23) CI(29) => M(55) a[(l4) => M(55) A[(l4) == e/(29) => M(30)
AR(3!)::::> M(3l) CR(31) => M(3!) GR(l5) => M(3}) AR(l5) == eR(3!) => M(3!)
A/(31):::} M(63) C/(31) => M(63) a/(15) => M(63) A/(5) == e/(31) => M(63)
• Column 1 shows the data mapping out of the second I6-point input FFT.
• Column 2 shows the data mapping after the linear filter complex multiplications.
The data addresses are identical to those in column 1 except for the terms where
column 1 had the imaginary part at a lower address than the real part. In those cases,
the real and imaginary addresses were swapped during the complex multiplication
process.
• Column 3 shows the new memory addresses for each of the inputs to the second
16-point IFFT, in terms of the data labeling found in Section 9.7.5.
• Column 4 shows the memory address for each of the second 16-point FFT's outputs,
based on the memory relabeling technique, and the definition of how they are related
to the actual output of the first stage of the required 32-point IFFT.
166 CHAP. 9 ALGORITHM CONSTRUCTION
These two 16-point IFFTs require exactly the same number of computations as the 16-point
FFTs in Stage 5. Therefore, Stage 7 requires 288 real adds and 48 real multiplies.
Step 3: Performing Complex Multiplications
Each of the eR (k *2 + n) and e/ (k *2 + n) needs to be multiplied by a specific complex
number prior to entering the 2-point portion of the 32-point algorithm. The equations for
this complex multiplication for each n = 0, 1 and k == 0, 1, ... , 15 are:
!R(k * 2 + n) = eR(k * 2 + n) * cos(2n * knJ32) - ejt]: * 2 + n) * sin(2n * knJ32)
lICk * 2 + n) = erik * 2 + n) * cos(2n * knJ32) + eR(k * 2 + n) * sin(2n * knJ32)
If no temporary registers are assumed, each complex multiply required two additional
data memory locations to store the results of multiplying each input value by two different
constants prior to forming and storing the output results. However, if the complex multiplies
are performed sequentially, the same two additional memory locations can be reused for all
of the complex multiplies. The result is the need for only two additional memory locations.
Store the results of the complex multiplies back in the same locations that the inputs to the
complex multiplies were taken from. For n = 0, these complex multiplies are just multiplies
by 1. Therefore, one of the two 16-point IFFTs does not have its outputs modified prior
to computing the 2-point IFFfs. Since only 15 of these 16 complex outputs represent the
needed result in Stage 8, only 15 of the complex multiplies need to be performed. The total
number of computations for these 15 complex multiplies is 60 real multiplies and 30 real
adds.
1. For n = 15, 16, ... , 31, multiply y(n) = YR(n) + j * y/(n) by exp( - j 11 * *
n 2 / 15) = cos(n * n2 / 15) - j * sin(11 * n 2 / 15) to obtain z(n).
2. For n = 15, 16, ... , 31, multiply z(n) by exp( - j * 11 * 15) = -1 to obtain q(n).
These two steps can be combined into a single complex multiply by multiplying the first
complex multiplier by -1 to obtain:
qR(n) = - YR(n) * cos(n * n 2 / 15) - y/(n) * sinor * n 2 / 15)
q/(n) = - y/(n) * cos(n * n / 15) + YR(n) * sinor * n / 15)
2 2
Again, if there are no temporary registers in the processor, then two additional memory
locations are required to perform the complex computations. However, if the complex
multiplies are performed sequentially, the same two additional memory locations can be
reused for all of the complex multiplies. The result is the need for only two additional
memory locations. Store the results of the complex multiplies back in the same locations
that the inputs to the complex multiplies were taken from.
Since only 15 of these 16 complex outputs represent the needed result in Stage 10,
only 15 of the complex multiplies need be performed. This is a total of 60 real multiplies
and 30 real adds.
except when dedicated building blocks have been developed. The primary reason for this
is that the multiply-accumulators (Chapter 10) used in DSP chips (Chapter 14) are all
based on an architecture that does not allow the multiplier and accumulator to be used
independently. Since the Winograd algorithm separates adds from multiplies, it is diffi-
cult to make efficient use of these computational building blocks to compute the Winograd
algorithm.
The available Winograd building blocks (Chapter 8) are 2, 3, 4, 5, 7, 8, 9, and
16 points. Combining relatively prime sets of these allows the following 58 transform
lengths:
lV=2,3,4,5,6, 7,8,9,10,12,14,15, 16, 18,20,21,24,28,30,35,36,40,42,45,48,
56,60,63,70,72,80,84,90,105,112,120,126,140,144, 168, 180,210,240,252,
280,315,336,360,420,504,560,630,840,1008, 1260, 1680,2520,5040
In the original derivation of the Winograd algorithm, the Winograd building blocks from
Chapter 8 were combined to form these 58 different transform lengths. However, the
technique can be extended to combining any building blocks that have all of their multiplies
in the center and just adds and subtracts for the input and output computations. This is why
the building blocks in Chapter 8 were configured in this format.
The general algorithm steps for computing the Winograd transform can be described
completely with just two building blocks. The result is a larger transform that still has all
of its multiplies in the center and only adds and subtracts on the input and output. The
larger transform can now be combined into a larger transform with a third building block
with the same technique for combining them that was used for the first two. This process
can be continued as long as the add-multiply-add architecture is followed and all of the
building blocks are relatively prime numbers. This process, using the general odd-number
algorithms in Section 8.11, increases the number of transform lengths for the Winograd
algorithm beyond the 58 listed. The only catch is that, since the non-Winograd building
blocks do not have the minimum number of multiplies, their combination into larger FFTs
does not result in a minimum number of multiplications.
Figure 9-6 is a Winograd algorithm block diagram for two factors, P and Q. Since
all of the N input data points are processed by the P- and Q-point stages, the N data points
must be separated into sets of P data points for the first input addition stage. There are
N / P = Q of these sets. Then the results from the first input addition stage must be divided
into sets of Q data points for processing by the second input addition stage.
In general, there are more outputs of the input adds than there are inputs. The result
is that there are more than N / Q = P sets of Q-point input adds to perform. If the order
of P and Q is reversed, there are P sets of Q-point input adds performed first, followed by
more than Q sets of P-point input adds. This implies that the total number of input adds
(all of the P and Q-point sets combined) changes as a function of which building block is
implemented first.
SEC. 9.5 CONVOLUTION APPROACH 169
complex output of the m-th P-point input adds building block bR«Q * k + P * m) mod N),
* *
b/«Q k + P m) mod N), where k = 0,1, ... , (M p ) and m = 0,1, ... , (Q - 1).
Specifically, the outputs from the first (m = 0) P -point input adds are b R (Q * k mod N)
*
and b/(Q k mod N), where k = 0, 1,2, ... , M», The outputs from the last (m = Q - 1)
P-point input adds are bR(Q «k + P * (Q -1) mod N) and b/(Q *k+ P * (Q -1) mod N),
where k = 0,1,2, ... , (P - 1). Figure 9-7 shows the the input adds data ordering for the
first (m = 0) of these P -point input adds.
b(O) c(O)
e (0) d(O)
e(l) del)
c(2) d(2)
Multiplier
• Array •
• •
• •
c(M )
Q
d(MQ)
d(O) e(O)
d(l) e(l)
e(O) A(O)
Since there are Q of the P -point output adds blocks, this step requires Q * (number
of P-point building-block output adds) additions. There are P outputs from each of the
*
Q P-point output adds, for a total of Q P outputs. The m-th output of the k-th P-point
* *
building block is labeled AR[(Q m + P k) mod N] and A/[(Q m + P k) mod N], * *
where k == 0,1, ... , (Q - 1) and m == 0,1, ... , (P - 1). This set of computations is
represented by the fifth block from the left in Figure 9-6, and the results are shown more
explicitly in Figure 9-11 for the first (k == 0) P -point output adds stage.
The IS-point Winograd [2] algorithm can be implemented with either the 3-point
or the S-point building blocks first. Like the prime factor and mixed-radix algorithms
in Sections 9.6 and 9.7, the order of the building blocks does not affect the number of
multiplications. However, unlike the prime factor and mixed-radix algorithms, the order
does affect the number of additions.
This example uses the Winograd 3- and 5-point building blocks. However, any of
the 3- and 5-point building blocks from Chapter 8 can be used because they were designed
to have an input add section, a central multiply section, and an output add section. From
the Comparison Matrix in Chapter 8, the 3-point Winograd building block has six input
adds, six output adds, and uses 3 for the number of multiply paths. The 5-point Winograd
building block has 16 input adds, 18 output adds, and uses 6 for the number of multiply
paths. Substituting these numbers into the equation for the number of computations gives
that the total number of real multiplications is 34 and the total number of real adds is
174 if the input portion of the S-point Winograd building block is computed first. The total
number of real adds is 162 if the input add portion of the 3-point building block is computed
first.
Figure 9-12 shows how the various portions of the 3- and 5-point Winograd building
blocks are nested to form the I5-point Winograd FFT. The various 3- and 5-point input and
output add blocks are labeled as they are below. The three distinct multiplier blocks are also
shown explicitly in Figure 9-12. This I5-point example requires 36 data memory locations
and 17 memory locations for multiplier constants.
The b R (i), b / (i), C R (i), C I (i), d R (i), d I (i), and e R (i), e/ (i) used to label intermediate
results in the description of the general Winograd algorithm in Section 9.5.8 are different
from the intermediate result labels in this example. However, the computations and data
reorganization are identical. The labels in Section 9.5.8 were chosen to show the intercon-
nection pattern of the individual building blocks. The labels in this example were chosen to
identify as closely as possible with the 3-point and 5-point Winograd building block labels
in Chapter 8. The nonmodular nature of the different Winograd building blocks makes
complete commonality between these descriptions impossible.
The 15 input data points must first be divided into five sets of 3 points to serve as
inputs to each of the 3-point algorithms. Following the addressing in Section 9.5.8, this is
done by starting with complex input data point QR(O), Q/(O), and grouping it with complex
174 CHA~ 9 ALGORITHM CONSTRUCTION
*M3(0)
roo--
a(O) -.... 0
I ~
0
I .----
M5(0)
I - 0
I r---
A(O)
o~
a(5) -.... 1 0 I 2 I M5~1~ I 1
I
M52 o 1 ~ A(5)
z(10) --.. 2 40 I
M5(3)
I 0 2 2~ A(IO)
- I
I
1
I
M5(4)
I
3
I
-
a(6) -+- 0
-
z(11) -+- 1 1
- I ----
3
I
I -M5(5) I
I ""'----
4
I
I .. ~
I
o ....... A(6)
.......
A(11)
I I I I 1
a(l) -+- 2 -*M3(1) 2 ....... A(l)
- I ~
I roo--
M5(0)
I ......---- I -
0 0
- I I M5(1) I ~I r---
-.... -~ 2 M5(2)
"""'-
1(12) 0 ~ 1 ~~
-.. O~ A(12)
I 41 I I I 2 I
a(2) -+- 1 2 M5(3) A(2)
I I I
2 1 ~
I
M5(5)
-
I
I "'----
4
I
-
r---
2 --+- A(7)
a(9) --.. 0
r(14) -.. 1 4
~
~
,---..
I 2
42
1
I
I
M5(2)
M5(3)
I
I
2
l~
2 ~I
3
I
---..
--.... 4
~
O~
1 ~
A(9)
A(14)
I M5(4) I
a(4) -.... 2 I I 4 2 A(4)
- 3
~
--+-
3-Point
----
5-Point 15-Point
""'----
5-Point
-----'
3-Point
Input Adds Input Adds Multiplies Output Adds OutputAdds
input data point pairs aR(5), al(5) and aR(lO), al(lO). These provide the input to the top
one of the five 3-point building blocks. This is followed by grouping the input data point
pairs aR(I), aIel), aR(6), a/(6), and aR(II), a/(ll) to provide the input for the second of
the five 3-point building blocks. The next grouping is data point pairs aR(2), al(2), aR(7),
a/(7), and aR(12), a/(12) for input into the third of the five 3-point building blocks. The
next grouping is data point pairs aR(3), a/(3), aR(8), al(8), and aR(13), al(13) to provide
input for the fourth of the five 3-point building blocks. The final grouping is data point
pairs aR(4), a/(4), aR(9), a/(9), and aR(14), a/(14) for input into the fifth 3-point building
block. The addressing in Section 9.5.8 determines the order in which these data points enter
the 3-point input adds.
The strategy for converting these equations to code is to start at the top (compute
bR(I)) and identify the pair of inputs to be used first (in this case aR(5) and aR(10)). Then
look down the list to find the second (compute b R (2») place where these two inputs are
used. Pull aR(5) and aR(10) from memory, compute bR(I) and b R(2), and store the results
in memory locations M(5) and M(IO), previously occupied by aR(5) and aR(lO). The next
step is to look at the next computation b I (1) on the list and repeat the same set of steps.
SEC. 9.5 CONVOLUTION APPROACH 175
Continue this process until all the Algorithm Steps in Stage 1 have been computed and their
results stored in the Memory Map addresses.
First of Five 3-Point Algorithm Building-Block Input Adds
* *
The inputs to these 3-point input adds are aR«5 k + 3 m) mod 15), al«5 k + 3 * *
m) mod 15) where m = O. Performing the modulo arithmetic computations to determine
the inputs results in the inputs being aR(O), al(O), aR(5), al(5), aR(IO), and al(lO) for
k == 0, 1, and 2. These input adds are represented in Figure 9-12 by the 3-point input adds
block labeled 0. Further, the labels on the left of this input add block correspond to the
input labels in the 3-point Winograd building block in Chapter 8.
adds. The input combinations and their resulting outputs are listed below and are based on
the addressing in Section 9.5.8.
The strategy for converting these equations to code is to start at the top (compute
tReI» and identify the pair of inputs to be used first (in this case b R(9) and b R(6». Then
look down the list to find the second (compute B R (2» place where these two inputs are
used. Pull b R (9) and bR(6) from memory, compute t R(1) and B R(2), and store the results
in memory locations M(12) and M(3), previously occupied by b R(9) and b R(6). The next
step is to look at the next computation t I (1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps in Stage 2 have been computed and their
results stored in the Memory Map addresses.
First of Three 5-Point Winograd Building-Block Input Adds
The inputs are bR(O), b/(O), b R (6) , b j (6), b R (12), b/(12), b R(3), b l(3), b R(9), and
b j (9). They produce six complex outputs. There are many ways to allocate the additional
memory locations, tR(i), tl(i) required to store this additional complex output data value.
For this example they are located at M (30) and M (31). These input adds are represented
in Figure 9-12 by the 5-point input adds block labeled O. Further, the labels on the left of
this input add block correspond to the input labels in the 5-point Winograd building block
in Chapter 8.
Multiplications for the Outputs of the Third Set of 5-Point Building-Block InputAdds
These multiplications are represented in Figure 9-12 by the bottom multiply block.
Algorithm Steps Memory Map
MR(lO) = -d/(lO) * sin(21l'13) MR(lO) => M(25)
M/(lO) = -dR(lO) * sin(2n 13) M[(IO) => M(IO)
MR(ll) = -cj(ll) * [0.5 * cos(2Jl'15) + 0.5 * cos(4n IS) - 1] * sin(2Jl'13) M R (11) => M(22)
Mj(ll) = -eR(ll) * [0.5 * cos(2Jl'15) + 0.5 * cos(4Jl'15) - 1] * sin(2Jr 13) Mj(ll) => M(7)
M R(13) = -cj(13) * [0.5 * cos(2n15) - 0.5 * cos(4Jl'15)] * sin(2n 13) M R(13) => M(16)
M/(13) = -cR(13) * [0.5 * cos(2Jl'15) - 0.5 * cos(4n IS)] * sin(2n 13) M[(13) => M(l)
M R(17) = c[(15) * sin(4n 15) * sin(21l'13) M R ( 17) => M(33)
M/(17) = cR(15) * sin(4n IS) * sin(2n 13) M/(17) => M(32)
M R(12) = B R(12) * [sin(2n IS) + sin(4n IS)] * sin(2n 13) M R (12) => M(13)
M/(12) = -B/(12) * [sin(2Jl'15) + sin(4Jl'15)] * sin(21l'13) M[(12) => M(28)
M R(14) = -BR(14) * [sin(21l'15) - sin(4Jl'15)] * sin(2n/3) M R ( 14) => M(4)
M[(14) = B/(14) * [sin(21l'15) - sin(4n IS)] * sin(21l'13) M/(14) => M19)
The strategy for converting these equations into code is to start at the top (compute
dR(I» and identify the pair of inputs to be used first (in this case NR(O) and N R (5» .
Then look down the list to find the second place where these two inputs are used. In
this case, N R(5) is not used again and NR(O) is relabeled to become AR(O), one of the
outputs. Therefore, pull N R (0) and N R (5) from memory, compute d R (I), relabel N R (0)
as A R (0), and store the results in memory locations M (5) and M (0), previously occu-
pied by N R(5) and NR(O). The next step is to look at the next computation d/(l) on
the list and repeat the same set of steps. Continue this process until all the Algorithm
Steps in Stage 5 have been computed and their results stored in the Memory Map ad-
dresses.
First of Five 3-Point Building-Block Output Adds
These output adds are represented in Figure 9-12 by the 3-point output adds block
labeled O. Further, the labels on the right of this output add block correspond to the output
labels in the 3-point Winograd building block in Chapter 8, for k = O.
General Prime Factor Algorithm. Prime factor [3] algorithms are characterized
by a sequence of small-point building blocks, from Chapter 8, without complex multipliers
between. This sequence of building blocks is developed by factoring the transform length,
N, into two numbers, N = P * Q, and computing the N -point transform based on P- and
Q-point FFTs (Figure 9-13). Chapter 3 describes why that process works. If P or Q can
be further factored, say Q == R * S, then the Q-point transform can be constructed from
two building blocks (R- and S-point building blocks) with Figure 9-13 as a guide.
extreme, there are numerous orders in which those primes can be combined to form the
complete transform. The order of the building blocks determines the data reordering used
between the stages but does not affect the number of adds and multiplies.
Thirty-Point Example. There are three ways to factor 30 into two numbers (2 * 15,
3 * 10, 5 * 6). Therefore, the 3D-point transform can be implemented, using the block
diagram in Figure 9-13, as anyone of these sequences of two building blocks. In fact, each
of these choices can be implemented in two ways. The 2 * 15 option can be implemented
with either the 2- or 15-point transform first in Figure 9-13. However, in each case, one of
the two factors can be factored further into two factors. The result in all three cases is three
building blocks (2, 3, and 5 points). There are six ways of ordering these three numbers to
implement the 3D-point FFT. To summarize, there are 12 ways to implement the 3D-point
FFf independent of which algorithm is used for each building block. These are shown in
Table 9-5. The first six sequence choices only have two building blocks, indicated by N/A in
column S. The choice of building blocks from Chapter 8 for all but the 6-, 10-, and IS-point
FFTs provides additional options to optimize the implementation for an application.
Sequence choices P R S
1 2 15 N/A
2 15 2 N/A
3 5 6 N/A
4 6 5 N/A
5 3 10 N/A
6 10 3 N/A
7 2 3 5
8 2 5 3
9 3 2 5
10 3 5 2
11 5 2 3
12 5 3 2
Section 9.6.2 describes how to determine the number of adds and multiplies for the
prime factor algorithm. Section 9.6.3 describes the general prime factor algorithm for two
factors. Then the next two sections give two prime factor algorithms, Kolba-Parks and
SWIFT, using I5-point transforms, so that their features can be most easily compared. The
primary difference between the two algorithms is the strategy for organizing the data and
then reorganizing it between the building blocks. The number of adds and multiplies, data
SEC. 9.6 PRIME FACTOR APPROACH 187
memory locations, and locations for multiplier constants is the same for both prime factor
algorithms.
used as the P-point building block inputs is different for nearly all transform lengths. The
equations for both input orderings are given. It is important to notice that the IS-point
examples actually use the same ordering of the input data. This is an exception to the
general rule.
For the Kolba-Parks [3] algorithm, the k-th input to the n-th P-point algorithm is
* * * *
aR«k Q + P n) mod N) and a/«k Q + P n) mod N), (where k = 0, 1, ... , (P - 1)
and n = 0, 1, ... , Q-1) from the input data sequence. Therefore, the zero-th (k = 0) input
tothen-th P-pointbuilding block is aR(P*n) anda/(P*n), wheren = 0,1, ... , (Q-l).
Additionally, the subsequent inputs to the same P-point transform are separated by Q
samples because k is incremented to determine the sample. Figure 9-15 shows the inputs
for the second (n = 1) P -point building block.
For the SWIFf [4] algorithm, the k-th input to the n-th P-point building block is
aR«k* Q + (Q » d + 1) *n) mod N) anda/«k * Q + (Q *d + 1) *n) mod N), where c
and d are determined as the solution to the equation:
(9-2)
and define the output sequence for the SWIFT algorithm. For the I5-point SWIFf example
(P = 3 and Q = 5), the solution of Equation 9-2 is c = -2 and d = 1. Figure 9-16 shows
these inputs for the second (n = 1) P -point building block.
Step 2: Computing the Q P-PointBuilding Blocks
Use the complex input data points defined in Step 1 to compute each of the Q P-point
building blocks. Again, the two prime factor algorithms have different output data labeling.
The simplest approach to output labeling is to use the same modulo arithmetic scheme as
on the input. Therefore, for the Kolba-Parks algorithm, the k-th output of the n-th P-point
building block is labeled BR«k* Q+ P*n) mod N) and B/«k* Q+ P*n) mod N), (where
k = 0, 1, ... , (P - 1) and n = 0, 1, ... , (Q - 1». Similarly, for the SWIFf algorithm,
*
the k-th output of the n-th P-point building block is BR«k * Q + (Q d + 1) * n) mod N)
*
and B/«k Q + (Q * d + 1) * n) mod N), where d is defined by Equation 9-2. Figures
9-15 and 9-16 show this labeling for the Kolba-Parks and SWIFT algorithms, respectively.
SEC. 9.6 PRIME FACTOR APPROACH 189
B(O) A(O)
B(P modN) A(S modN)
B(2*P mod N) Q-Point A(2*S modN)
Building
• •
• Block
•
• •
B«Q-I)*P mod N) A«Q-I)*S mod N)
B(O) ~
A(O)
B«Q*d+ 1) mod N) ~
A(P*(l mod Q»
Q-Point
B(2*(Q*d+ 1) mod N) ~
A(P*(2 mod Q»
Building
• •
Block
• •
• •
B«Q-l)*(Q*d+l) mod N) ~
A(P*«Q-l) mod Q»
is defined by Equation 9-2, n == 0,1, ... , (Q - 1), and k == 0,1, ... , (P -1). Figure 9-18
shows this labeling for the first (k == 0) Q-point building block.
The IS-point Kolba-Parks [3] algorithm can be implemented with either the 3-point
or the 5-point building blocks first. If the 3-point transform is first, the 15 pieces of com-
plex input data are divided into five sets of three complex points, one for each of the
15/3 == 53-point building blocks. Following the 3-point transforms, the intermediate
results are reorganized into three sets of five pieces of complex data needed for input to
the 15/5 == 3 5-point building-block computations. The order does not affect how many
computations are required. This example uses the Singleton 3- and 5-point building blocks.
A smaller number of adds and multiplies is required if the Winograd building blocks were
used.
If the Comparison Matrix in Chapter 8 and Equation 9-1 are used, the total number
of real adds required is 5 * 12 + 3 * 32 == 156 and the total number of real multiplies
is 5 * 4 + 3 * 16 == 68. The total amount of data memory required is driven by the 5-
point building block and is 32 locations. Explicitly, 30 locations are required for the 15
complex data points, plus 2 additional locations for the intermediate computations in the
5-point Singleton building block. Similarly, the 3-point Singleton building block has two
multiplier constants and the 5-point Singleton building block has four for a total of six
memory locations for multiplier constants. Figure 9-19 is a block diagram of this example.
The stages are as follows.
The 15 data points are divided into five sets of 3 points to serve as inputs to each
of the 3-point building blocks. This is done by using the addressing from Section 9.6.3,
starting with complex input data point pair au (0), a/ (0), and grouping it with complex
input data point pairs aR(5), a/(5) and aR(IO), a/(IO). These provide the input to the top
one of the five 3-point building blocks in Figure 9-19. This is followed by grouping the
input data point pairs aR(3), a/(3), aR(8), a/(8), and aR(13), a/(13) to provide the input
for the second of the five 3-point building blocks. The next grouping is data point pairs
aR(6), a/(6), aR(11), a/(I1), and aR(I), a/(l) for input into the third of the five 3-point
building blocks. The next grouping is data point pairs aR(9), a/(9), aR(14), a/(14), and
a R (4), a I (4) to provide input for the fourth of the five 3-point building blocks. The final
grouping is data point pairs aR(12), a/(12), aR(2), a/(2), and aRC?), ale?) for input into
the fifth 3-point building block.
The order in which this data is used for inputs to the 3-point building blocks is the key
point in removing the need for complex multipliers between the 3- and 5-point algorithms.
From Section 9.6.3, the complex input data for the k-th input to the m-th 3-point building
block is aR«5 * k + 3 * m) mod 15), a/«5 * k + 3 * m) mod 15), where k == 0,1, and 2,
and m == 0,1,2,3, and 4.
The five groups of computations, listed as (a) through (e), each perform the 3-point
building block. In this example, the Singleton 3-point algorithm building block from Chap-
ter 8 is used. All of these 3-point building blocks could also have been the Winograd 3-point
algorithm building block from Chapter 8. In fact, the five 3-point building blocks can be
any combination of these two 3-point algorithm building blocks. The outputs of each of the
192 CHAR 9 ALGORITHM CONSTRUCTION
I
a(ll) -+-1 2 1 2 I 2~ A(7)
I
a(l) -+- 2 2 3 3~ A(l3)
I 4~ A(4)
4
a(9) -"0 O~ I
I
a(14) -"1 3 1
I
a(4) -..2 2 0 O~ A(5)
I 1 1~
A(ll)
I 2 A(2)
a(12) -+-0 0 2 2~
I 3 3 ----.. A(8)
a(2) -'1 4 1
I
a(7) -.2 2 4 4 ----.. A(l4)
3-Point FFTs 5-PointFFTs
results in the inputs being aR(6), a[(6), aR(II), a[(II), aR(I), and aIel) for k = 0, 1,
and 2. This set of computations is represented in Figure 9-19 by 3-point building block 2.
Further, the labels on the left and right of this building block correspond to the input and
output labels in the 3-point Singleton building block in Chapter 8.
Algorithm Steps Memory Map
b R(6) = QR(ll) of- QR(l) b R(6) =} M(lI)
bR(II) = aR(ll) - QR(l) bR(ll) =} M(l)
b[(6) = Q[(ll) + Q[(l) b[(6) =} M(26)
b[(Il) = a[(ll) - Q[(l) b[(lI) =} M(16)
* cos(2rr/3) + QR(6)
cR(6) = b R(6) cR(6) =} M(30)
B R(6)= QR(6) + bR(6) B R(6) =} M(6)
cR(ll) = b[(ll) * sin(2rr 13) cR(II) => M(16)
cj(6) = b[(6) * cos(2rr 13) + Q[(6) cj(6) => M(II)
B[(6) = Qj(6) + bj(6) B j(6) :::} M(21)
c[(ll) = -bR(ll) * sin(2rr 13) cj(ll) => M(l)
BR(II) = cR(6) + cR(ll) BR(ll) :::} M(16)
Bj(Il) = cj(6) + c[(ll) Bj(II) :::} M(l)
B R (1) = CR (6) - C R (11) BR(I) => M(26)
Bj(l) = cj(6) - cj(ll) B[(l) => M(ll)
Fourth of Five 3-Point Algorithm Building Blocks
The inputs to this 3-point building block are QR«5 * k + 3 * m) mod 15), Qj«5 * k +
3 * m) mod 15) where m = 3. Performing the modulo arithmetic computations results in
the inputs being QR(9), Q[(9), QR(14), Qj(14), QR(4), and Q[(4) for k = 0, 1, and 2. This
set of computations is represented in Figure 9-19 by 3-point building block 3. Further, the
labels on the left and right of this building block correspond to the input and output labels
in the 3-point Singleton building block in Chapter 8.
Algorithm Steps Memory Map
b R(9) = aR(14) + aR(4) b R(9) => M(14)
b R(14) = QR(14) - QR(4) bR ( 14) => M(4)
bj(9) = Qj(14) + Qj(4) bj(9) => M(29)
bj(14) = Qj(14) - Qj(4) bj ( 14) => M(I9)
cR(9) = b R(9) * cos(2Jl'13) + QR(9) cR(9) => M(30)
B R(9) = QR(9) + bR(9) B R(9) => M(9)
cR(14) = b[(14) * sin(2rr/3) cR(14) => M(19)
Cj (9) = bj (9) * cos(2Jl'13) + Q[(9) cj(9) => M(14)
B j(9) = Qj(9) + b[(9) B/(9) => M(24)
*
cj(14) = -b R(14) sin(2rr 13) cj(I4) => M(4)
B R(14) = cR(9) + cR(14) BR ( 14) => M(19)
Bj ( 14) = cj(9) + cj(14) B j ( 14) => M(4)
B R(4) = cR(9) - cR(I4) B R(4) => M(29)
B[(4) = c[(9) - cj(I4) B j(4) => M(14)
SEC. 9.6 PRIME FACTOR APPROACH 195
step is to look at the next computation b/ (1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps in Stage 2 have been computed and their
results stored in the Memory Map addresses.
First of Three 5-PointBuilding Blocks
This 5-point building block (k = 0) has B R«5 * k + 3 * m) mod 15) and B/«5 * k +
3* m) mod 15) (m = 0,1,2,3, and 4) as inputs and A R«10 * k + 6 * m) mod 15) and
A/«10 * k + 6 * m) mod 15) (m = 0, 1,2,3, and 4) as its output frequency components.
Performing the modulo arithmetic computations results in the inputs being B R (0), B/ (0),
B R(3), B/(3), BR(6), B/ (6), BR(9), B/(9), BR(12) , and B/(12).
The multiplication portion of the algorithm requires two additional data memory lo-
cations because no temporary registers are assumed. The variables used for the intermediate
computations were chosen to be the same as those used for the 5-point Singleton building
block in Chapter 8 to make it easier to associate the computational steps with the discussion
in Chapter 8. This set of computations is represented in Figure 9-19 by 5-point building
block O. Further, the labels on the left and right of this building block correspond to the
input and output labels in the 5-point Singleton building block in Chapter 8.
Algorithm Steps Memory Map
b R(I) = B R(3) + B R ( 12) b R(I)=> M(3)
b/(I) = B/(3) + B/(12) b/(I) => M(18)
b R(2) = B R(3) - B R(12) b R(2)=> M(12)
b/(2) = B/(3) - B/(12) b/(2) => M(27)
b R(3) = B R(6) + B R(9) b R(3) => M(6)
b/(3) = B/(6) + B/(9) b/(3) => M(21)
b R(4) = B R(6) - BR(9) b R(4) => M(9)
b/(4) = B/(6) - B/(9) b/(4) => M(24)
* *
cR(2) = b R(2) sin(2Jrj5) + b R(4) sin(4Jrj5) cR(2) => M(30)
* *
c/(2) = b/(2) sin(21rj5) + b/(4) sin(41rj5) c/(2) => M(9)
cR(4) = b R(2) * sin(41rj5) - bR(4) * sin(21rj5) cR(4) => M(31)
c/(4)= b/(2) * sin(41rj5) - b/(4) * sin(21rj5) c/(4) => M(12)
cR(I) = bR(I) * cos(2rr 15) + b R(3) * cos(4rr 15) + BR(O) cR(I) => M(27)
*
c/(l) = b/(l) cos(2rrj5) + b/(3) * cos(41r/5) + B/(O) c/(I) => M(3)
cR(3) * *
= bR(I) cos(4rr/5) + bR(3) cos(21r/5) + BR(O) cR(3) => M(24)
* *
c/(3) = b/(l) cos(41r/5) + b/(3) cos(2rr/5) + B/(O) c/(3) => M(6)
AR(O) = BR(O) + bR(I) + b R(3) AR(O) => M(O)
A/(O) = B/(O) + b/(l) + b/(3) A/(O) => M(15)
A R(6) = cR(I) + c/(2) A R(6) => M(27)
A/(6) = c/(l) - cR(2) A/(6) => M(18)
A R(12) = cR(3) + c/(4) A R(12) => M(24)
A/(12) = c/(3) - cR(4) A/(12) => M(6)
A R(3) = cR(3) - c/(4) A R(3) => M(12)
A/(3) = c/(3) + cR(4) A/(3) => M(3)
AR(9) = cR(I) - c/(2) A R(9) => M(9)
A/(9) = c/(I) + cR(2) A/(9) => M(21)
SEC. 9.6 PRIME FACTOR APPROACH 197
The IS-point SWIFf [4] algorithm can be implemented with either the 3-point or the
5-point building blocks first. If the 3-point building block is first, the 15 pieces of complex
input data are divided into five sets of three complex points, one for each of the 15/3 = 5
3-point building blocks. Following the 3-point building blocks, the intermediate results are
divided into three sets of five pieces of complex data needed for input to the 15/5 = 3 5-point
building-block computations. This algorithm is similar to the Kolba-Parks algorithm but
uses a different data mapping strategy. The order does not affect how many computations
are required.
a(I) ---.2 2
I
0 O~ A(10)
I
1 1~ A(13)
a(12) --+0 0
I
a(2) -"1 2 1
I 2 1 2~ A(I)
a(7) --+ 2 2
I 3 3~ A(4)
4 4 ~ A(7)
I
a(3) ~ 0 0-
I
a(8) ~ 1 1 I
3 I
a(13) ~ 2 2 I
I 0 o ----.. A(5)
1 1 ----.. A(8)
i
2 2 2 ---.. A(I!)
a(9) ~ 0 0
~ 1
3 3 ----.. A(14)
a(14) 4 1 !
This example uses the Singleton 3- and 5-point building blocks. A smaller number
of adds and multiplies would be needed if the Winograd building blocks were used. If
the Comparison Matrix in Chapter 8 and the equation presented in the discussion of the
200 CHAR 9 ALGORITHM CONSTRUCTION
performance features for the prime factor algorithm are used, the total number of real adds
required is 5* 12+3*32 = 156, and the total number of real multiplies is 5*4+3* 16 = 68.
The total amount of data memory required is driven by the 5-point algorithm and is 32
locations. Explicitly, 30 locations are required for the 15 complex data points, plus 2
additional locations for the intermediate computations in the 5-point Singleton building
block. Similarly, the 3-point Singleton building block has two multiplier constants and the
5-point Singleton building block has four, for a total of six memory locations for multiplier
constants. The stages are as follows.
and a, (10) for k == 0, 1, and 2. This set of computations is represented in Figure 9-20 by
3-point building block O. Further, the labels on the left and right of this building block cor-
respond to the input and output labels in the 3-point Singleton building block in Chapter 8.
the overriding criterion, then the Winograd algorithm building block should be used in place
of the 5-point Singleton building block.
Three sets of 5-point algorithm building-block Algorithm Steps from Chapter 8 are
presented. In Chapter 8 the 5-point algorithm building block was presented as three stages.
Since the features of the individual stages of the 5-point algorithm block are discussed
in Chapter 8, they are not discussed again. The m-th input to the k-th 5-point building
* * * *
block is BR«5 k + 6 m) mod 15) and B/«5 k + 6 m) mod 15) from the previous
stage.
The multiply stage of the 5-point Singleton building block required additional data
memory locations under the set of constraints used in Chapter 8. If the I5-point computa-
tions are performed in the order shown, the additional memory locations used by the first
of the three 5-point building blocks can be reused by each of the other two 5-point building
blocks.
The strategy for converting these equations to code is to start at the top (compute
bR(I» and identify the pair of inputs to be used first (in this case BR(6) and BR(9». Then
look down the list to find the second (compute bR(2» place where these two inputs are
used. Pull B R(6) and B R (9) from memory, compute bR(I) and bR(2) , and store the results
in memory locations M(6) and M(9), previously occupied by B R(6) and B R(9). The next
step is to look at the next computation b/ (1) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps in Stage 2 have been computed and their
results stored in the Memory Map addresses.
First of Three 5-PointBuilding Blocks
This 5-point building block (k = 0) has B R « 5 * k + 6 * m) mod 15) and
B/«5*k+6*m) mod I5)(m = 0,1,2,3, and 4) as inputs and A R « IO*k + 3*m) mod 15)
* *
and A/«IO k + 3 m) mod 15)(m = 0,1,2,3, and 4) as its output frequency compo-
nents. Performing the modulo arithmetic computations to determine the inputs results in
the inputs being BR(O), Bj(O), B R(6), B j(6), B R(12), Bj(12), B R(3), B j(3), B R(9) , and
B/(9).
The multiplication portion of the building block requires two additional data memory
locations because no temporary registers are assumed. The variables used for the inter-
mediate computations were chosen to be the same as those used for the 5-point Singleton
building block in Chapter 8 to make it easier to associate the computational steps with the
discussion of its features and memory mappings in Chapter 8. This set of computations is
represented in Figure 9-20 by 5-point building block O. Further, the labels on the left and
right of this building block correspond to the input and output labels in the 5-point Singleton
building block in Chapter 8.
1
Complex
Multipliers
The result of factoring N into P *R *S is an algorithm that has a series of three building
blocks with complex multipliers between (Figure 9-22). The mixed-radix algorithm allows
this factoring process to stop at any point. The extreme case is to factor N until the building
blocks are only prime numbers. Even if N is factored to all prime numbers, there are
numerous orders in which those primes can be combined to form the complete transform.
The order of the building blocks determines the multiplier constants used between the stages
but does not affect the number of adds and multiplies.
Complex Complex
Multipliers Multipliers
Forty-five-Point Example. There are two ways to factor 45 into two numbers
(3 * 15 and 5 * 9). Therefore, the 45-point transform can be implemented by using the
block diagram in Figure 9-21. The 3 * 15 option can be implemented with either the 3- or
* *
15-point transform first in Figure 9-21. However, for either the 3 15 or 5 9 cases, the
second factor can be factored further. The result in all three cases is three building blocks
(3, 3, and 5 points). There are three ways of ordering these three numbers to implement the
45-point FFf. To summarize, there are seven ways to implement the 45-point FFf using the
mixed-radix algorithm, without having to choose which algorithm to use for each building
block. These are shown in Table 9-6.
Sequence choices P R S
1 3 15 N/A
2 15 3 N/A
3 5 9 N/A
4 9 5 N/A
5 3 3 5
6 3 5 3
7 5 3 3
SEC. 9.7 MIXED-RADIX APPROACH 209
The first four sequence choices only have two building blocks, indicated by NIA under
column S. The choice of algorithm building blocks from Chapter 8, for all but the 15-point
FFT, provides additional options to optimize the implementation for the application. The
IS-point FFf can be implemented with any of the algorithms in this chapter.
A derivation of the mixed-radix algorithm shows that the complex multipliers between
the P - and Q-point building blocks have a predictable pattern. If the complex multipliers
are viewed as connected to the output of the P -point building block, then:
I. The zeroth P-point building block has all I 's as output multipliers.
2. The outputs of the other ( Q - 1) P -point building blocks have complex multipliers
for all but their top output D(n), which has 1 as the multiplier, for a total of P - 1
complex multiplies.
3. The complex multiplier at the k-th output, Bik * Q + n), of the n-th P-point
building block is cos(2 * T( * k * n / N) - j * sin(2 * T( * k * n/ N), as shown in
Figure 9- 23.
4. After multiplication, the k-th output, D(k * Q + n), of the n-th P-point building
block is connected to the n-th input of the k-th Q-point building block shown in
Figure 9-24.
a(n)~
nth
a(Q+n) ~ D(Q+n)
P-Point
• Building cos(2 *n* nlN) -j* sin(2 *n* 111N)
• Block
•
• •
a«P-l)*Q+n) ----. "t------. D«P-l)*Q+n)
Comments 1 and 2, combined with Figure 9-23, show that there are Q - 1 of the
P -point building blocks that each have P - 1 complex multiplies on the output for a total
of (Q - I) * (P - I) complex multiplies.
If the N -point transform is further decomposed into three or more factors, say by
factoring Q, these same four facts determine the number of building blocks and complex
multiplier constants needed for each of the decomposed Q-point transforms. The only
change is to replace N with Q and to replace Q with Rand S, where Q = R * S. With this
information and the algorithm building blocks from Chapter 8, a complete block diagram
can be constructed for a transform of any length with several combinations of building
blocks.
210 CHA~ 9 ALGORITHM CONSTRUCTION
D(k*Q+O) A(O*P+k)
D(k*Q+I) A(I *P+k)
kth
D(k*Q+2) A(2*P+k)
Q-Point
• Building •
• Block
•
• •
D(k*Q+Q-I) A((Q-I)*P+k)
in Stages 1 through 6 to form a two-factor decomposition. Then, for each of the P Q-point
building blocks, relabel its inputs as if they were Q consecutive complex data points and
reapply the two-factor decomposition algorithm to split the Q-point building block into two
factors. Each of those can be further subdivided with the same approach. The relabeling
scheme is given in Section 9.4.
The algorithm starts by grouping the input data points for each of the Q P-point
building blocks (Stage 1, Step 1) and computing the Q P-point building blocks with these
data subsets as inputs (Stage 1, Step 2). Then the outputs of the P-point building blocks
are multiplied by the proper complex numbers (Stage 2 and as shown in Figure 9-23). To
complete the algorithm, the outputs of the complex multiplications are reorganized and fed
to the P Q-point building blocks (Stage 3, Step 1 as shown in Figure 9-24). Finally, the P
Q-point building blocks convert their input data to the output frequency components (Stage
3, Step 2).
ing block. Since the complex multiplies are performed one at a time, only two additional
memory locations are required. In the 16-point radix-4 example (Section 9.7.5), the mul-
tiplies are all grouped together. This requires two additional memory locations for each
of the complex multiplies. The 16-point radix-8 and -2 example (Section 9.7.6) and the
IS-point Singleton example (Section 9.7.7) reduce the added memory locations required
at the expense of interweaving adds with the multiplies. Details of the architectures in
Chapters 11 and 12 determine which approach is best for an application.
This stage has two steps. The first is to properly group the input data for each of
the P Q-point building blocks. The second is to compute each of the P Q-point building
blocks. The number of adds and multiplies required for this stage is P times the number
of adds and multiplies required for the chosen Q-point building block. Since the Q-point
building blocks are performed sequentially, any additional memory required for the Q-point
building block is only needed once. This is because each Q-point building block uses these
additional locations in sequence, not all at once. Therefore, the total memory required for
this portion of the algorithm is 2 * N for the data plus the additional locations needed for
one Q-point building block.
Step 1: Grouping the Input Data Points to the Q-Point Building Blocks
For the n-th input to the k-th Q-point building block, choose DR(k * Q + n) and
DI(k * Q + n) (where k := 0,1, ... , (P - 1) and n == 0,1, ... , (Q - 1)) from the input
data sequence. Each input to a Q-point building block comes from a different P-point
building-block output. Therefore, the data memory locations where the required input data
reside are not in the order assumed by the Q-point building blocks in Chapter 8. To further
complicate this, the output data memory address order for the P -point building blocks in
Chapter 8 is not in order. Therefore, to use the building-block algorithms from Chapter 8,
the specified data memory locations must be relabeled. This process is straightforward and
completely described in Section 9.4.
Step 2: Computing the P Q-Point Building Blocks
Use the complex input data points defined in Step 1 to compute each of the P Q-
point building blocks. The n -th output of the k-th Q-point building block should be labeled
A R (n * P + k) and A I (n * P + k). These are the final outputs of the N -point FFT.
The primes-to-a-power [5, 6] algorithm requires each FFT building block in Figures
9- 21 or 9-22 to have the same algorithm building block. The power-of-two algorithms,
made popular by the 1965 Cooley and Tukey paper [6], are in this class. They are a set of
algorithms for computing an N -point DFT, where N == 2 P, and P is any positive integer.
For example, N == 64 (2 6 ) , N == 256 (2 8 ) , and N == 1024 (210). Since 4,8, and 16 are
also powers-of-two, the 2-, 4-, 8-, or 16-point building blocks can be inserted into Figures
9-21 and 9-22 to produce a transform from this category. However, any of the other prime
algorithm building blocks could also have been used. For example, an 81-point transform
can be implemented by using four blocks with 3-point building blocks or two blocks with
9-point building blocks.
214 CHAR 9 ALGORITHM CONSTRUCTION
In Figure 9-21, the radix-4 16-point FFf has 4-point building blocks in each of two
stages (P = Q = 4). It is a five-stage process with 144 adds and 24 multiplications.
The equations for adds and multiplies in Section 9.7.2 imply the need for 146 real adds
and 36 real multiplies, based on the 4-point building block having 16 real adds and no
real multiplies. The actual numbers are reduced by taking advantage of some special-
case multiplier constants. Specifically, multiplication by cos(8Jl' /16) + j * sin(8Jl'/16) = j
requires no multiplication or addition, and multiplication by cos(4Jl'/ 16) + j *sine4Jl'/ 16) =
(J2) * (1 + j) requires only two multiplications.
The storage requirements are 40 locations for data memory and 6 locations for mul-
tiplier constants. This is larger than required by the other mixed-radix algorithms, because
a different approach to complex multiplication was used in this example to illustrate the
difference in storage requirements. Namely, the approach used in this example computed
all of the multiplications required for the complex multiplies between the stages and stored
the results. Then the adds needed to complete the complex multiplies were performed, It
is the multiplies that cause the need for additional data memory locations. Each complex
multiply only requires two additional memory locations. Therefore, if each complex mul-
tiply is completed before proceeding to the next one, only two additional memory locations
are required, making the total 34 rather than 40 locations.
The data mapping shown next to the algorithm steps is an example. Specifically,
Stage 1 is the four 4-point building blocks that must be performed on the input. The next
two stages provide all of the complex multiplications required between Stages 1 and 3, and
the final stage performs the four 4-point output building blocks.
Figure 9-25 is a block diagram of this example that shows the data memory mapping
implemented in the detailed algorithm steps. Each 4-point building block is labeled to
identify it with the steps of each stage of computation. The numbers inside the left and
right edges of the 4-point building blocks are the corresponding input and output labels as
defined in Chapter 8. For example, a(12) is the complex input for the terms labeled aR(3)
and a/(3) in the 4-point building-block description in Chapter 8.
The radix-4 power-of-primes algorithm stages for a 16-point radix-4 FFT are as
follows.
Stage 1: Input 4-Point Building Blocks
This stage does not require additional data memory or accessing any of the multiplier
constants. Further, the add/subtract process is the same for all of the real and imaginary
pairs. The strategy for converting these equations to code is to start at the top (compute
bR(O» and identify the pair of inputs to be used first (in this case aRCO) and aR(8». Then
look down the list to find the second (compute b R (1» place where these two inputs are
used. Pull aRCl) and aR(8) from memory, compute bRCO) and bRCI), and store the results
in memory locations M(O) and M(8), previously occupied by aRCO) and aR(8). The next
step is to look at the next computation b/ (0) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps in Stage I have been computed and their
results stored in the Memory Map addresses.
First of Four 4-PointBuilding Blocks
This set of computations is represented in Figure 9-25 by input 4-point building
block O. Further, the labels on the left and right of this building block correspond to the
input and output labels in the 4-point building block in Chapter 8.
SEC. 9.7 MIXED-RADIX APPROACH 215
I
a(O) --.. 0 0 0 o~ A(O)
1 I
a(8) --.. 2 1 2 l~ A(4)
0 1 I 0
a(4) --. 1 2 1 2~ A(8)
1
a(l2) --.. 3 3 I ~ 3 3~ A(12)
1 II -.. A(I)
a(2) --.. 0 0 I
0 0
W2 I
1
a(3) --..- 0 0
W3 II I
\
0 0 -. A(3)
a(ll) --..- 2 1
-jW2 2 1 - . A(7)
--..- 1 3 2 3 2 .--. A(l!)
I
a(7) -W I I 1
a(l5) -.. 3 3 3 3 .--. A(15)
I
requirement. It was done in this algorithm so that the output of each of the computational
steps has the real part in the lower portion of data memory, and the imaginary part is in the
upper portion of data memory. Continue this process until all the Algorithm Steps in Stage
2 have been computed and their results stored in the Memory Map addresses.
The mixed powers-of-primes [7] algorithm computes a transform length that can be
written as one prime number raised to a power, but uses different algorithm building blocks
in the blocks in Figure 9-21, as long as they are all powers of the same prime number. For
example, an 81-point transform has five mixed power-of-primes implementations, namely
* * * * * *
3 * 3 * 9,3 9 3,9 3 3,3 27, and 27 3. The 16-point FFf can be implemented using
8-point and 2-point building blocks. Either the 2- or 8-point building blocks can be first,
and any of the 8-point building blocks can be used. This example has the 8-point building
blocks first.
The mixed power-of-primes 16-point FFT is a three-stage process with 148 adds and
28 multiplications. The reason these are lower than the general mixed-radix equation is that
some of the complex multiplies can be performed with fewer computations because of their
*
specific numerical values. Specifically, multiplication by cos(8Jr/16) + j sin(81l'/16) = j
requires no multiplication or addition, and multiplication by cos(4Jr/16) + j *sin(4Jr/16) =
*
(.J2) (1 + j) requires only two multiplications.
The storage requirements are 34 locations for data memory and 6 locations for mul-
tiplier constants. The input stage implements the 8-point radix-4 and -2 building block
from Section 8.8.2. Stage 2 implements the complex multiplications between Stages 1
and 3, and the output stage implements the eight 2-point building blocks from Section
8.3.
Figure 9-26 is a block diagram of this example. Each of the 8- and 2-point building
blocks is labeled to identify it with the steps of each stage of computations. The numbers
inside the left and right edges of the 8- and 2-point building blocks are the corresponding
input and output labels as defined in Chapter 8. For example, a (12) is the complex input for
the terms labeled aR(6) and a/(6) in the 8-point radix-4 and -2 building-block description
in Chapter 8.
The stages are described below.
I
a(O) --. 0 0 0 o~ A(O)
I 0
a(8) --. 4 1
1 1~ A(8)
I
a(4) --. 2 2 r----
I I ~~
a(12) --+ 6 3
0 A(l)
0 1 1
a(2) --. 1 4 1 l~ A(9)
1
a(IO) --. 5 5
1 0 o~ A(2)
a(6) --. 3 6
2
1 1~ A(lO)
a(14) --. 7 7
1
0 o~ A(3)
3
1 1 t--+- A(11)
.L,
G 1 1
A(4)
A(12)
a(l)
[J=
~
0 0 A(5)
W
a(9) ~
4 1
W2 1 1 A(l3)
a(5) ~
2 2
3
W
G
--+-
I
a(13) 6 3 A(6)
1 -}
a(3) ~l 4 .-1 1 6 1
-jW A(14)
a(ll) --+- 5 5
'W 2
G
-J A(7)
a(7) ~
3 6 3 I
_jW
a(15) ~
7 7 1 1 A(15)
8-PointFFTs 2-PointFFTs
The Singleton mixed-radix [5] algorithm is the most general one. In Figure 9-21, any
of the algorithm building blocks from Chapter 8 can be placed in the FFT stages.
The I5-point Singleton mixed-radix algorithm can be implemented with either the
3-point or the 5-point building blocks first. If the 3-point building block is first, the 15
pieces of complex input data are divided into five sets of three complex points, one for each
of the 15/3 = 53-point transforms. Following the 3-point building blocks and complex
multiplies, the intermediate results are divided into three sets of five pieces of complex data
needed for input to the 15/5 = 3 5-point building-block computations. The order does not
affect the number of computations required.
Figure 9-27 is a detailed block diagram of this example. At the block diagram level,
any of the 3- and 5-point building blocks from Chapter 8 can be used. This example uses the
Singleton 3-and5-point building blocks. A smaller number of adds and multiplies would
be needed if the Winograd building blocks were used.
SEC. 9.7 MIXED-RADIX APPROACH 231
1
~
a(O) --. 0 0
I
0 0 A(O)
1 1 1~ A(3)
a(5) --.. 1 0 1
I
1
a(10) --.. 2 2 2 0 2~ A(6)
I
I 3 3~ A(9)
1
a(l) ~o 0
I
I ---+- 4 4 ~
A(12)
WI
I
a(6) --. 1 1 1
W2
a(11) -..2
\
2
I 0 o~ A(I)
1 I 1 1~ A(4)
a(2) --+- 0 0
W2 I
a(7) --+- 1 2 1 2 1 2~ A(7)
W4 I
j
W3 I
a(8) --. 1 3 1
2
W6 I
a(13) --. 2 I
0 O~ A(2)
1 1 --.. A(5)
I
1 2 2 2~ A(8)
a(4) --+- 0 0
W4
I
3 3 f----. A(II)
a(9) --+- 1 4 1 I
2
W8 I 4 4~
a(14) --+- 2 A(14)
3-Point FFTs 5-Point FFTs
If the Comparison Matrix in Chapter 8 and the equation presented in Section 9.7.2
are used, the total number of real adds required is 5 * 12 + 3 * 32 + 2 * 2 * 4 = 172, and
the total number of real multiplies is 5 * 4 + 3 * 16 + 4 * 2 * 4 = 100. The total amount of
data memory required is driven by the 5-point building block and is 3 * 10 basic complex
data locations plus 2 temporary locations, for a total of 32 memory locations.
The 3-point Singleton building block has two multiplier constants (cos(2rr 13) and
sin(2rr 13)), the 5-point Singleton building block has four (cos(2rr /5), sin(2rr IS), cos(4rr IS),
and sin(4Jl'15»), and the complex multiplies between the stages require eight constants that
are not already required by the 3- and 5-point building blocks (cos(2rr 115), sin(2Jl'115),
cos(41l'/15), sin(4rr/15), cos(8rr/15), sin(8rr/15), cos(161l'/15), and sin(16rr/15». This
is a total of 14 memory locations for multiplier constants.
Stage 1: Three-Point Building Blocks
The 15 data points must first be divided into five sets of 3 points to serve as inputs
to each of the 3-point building blocks. This is done by starting with complex input data
232 CHAR 9 ALGORITHM CONSTRUCTION
point pair aR(O), a/CO) and grouping it with complex input data point pairs aR(5), a/(5)
and aR(IO), a/(IO). These provide the input to the top one of the five 3-point building
blocks. This is followed by grouping the input data point pairs aR(I), aj (1), aR(6), ai (6),
and aR(II), a/(II) to provide the input for the second of the five 3-point building blocks.
The next grouping is data point pairs aR(2), a/(2), aR(7), a/(7), and aR(I2), aj(I2) for
input into the third of the five 3-point building blocks. The next grouping is data point
pairs aR(3), a/(3), aR(8), al(8), and aR(I3), a/(I3) to provide input for the fourth of the
five 3-point building blocks. The final grouping is data point pairs a R (4), a/ (4), a R (9),
a/ (9), and a R (14), a/ (14) for input into the fifth 3-point building block. In general, the
complex input data for the k-th input to the m-th 3-point building block are aRCS * k + m),
a/(5 * k + m) where k = 0,1, and 2, and m = 0,1,2,3, and 4.
The five groups of computations, listed as (a) through (e), each perform the 3-point
building block. In this example, the Singleton 3-point algorithm building block from Section
8.4.2 is used. All of these 3-point transforms could also have been the Winograd 3-point
algorithm building block from Chapter 8. In fact, the five 3-point transforms can be any
combination of the two 3-point algorithm building blocks. The outputs of each of the 3-
point building blocks, labeled BR(i) and B/(i) for i = 0,5, 10, are the equivalent of the
AR(i) and A/(i) in the 3-point building block in Chapter 8. To translate these data addresses
and data labels to each of the next four 3-point building blocks, add 1, 2, 3, and 4 to the
addresses and data labels.
The strategy for converting these equations to code is to start at the top (compute
b R(5» and identify the pair of inputs to be used first (in this case aR(5) and aR(10». Then
look down the list to find the second (compute bR (10» place where these two inputs are
used. Pull aR(O) and aR(IO) from memory, compute b R(5) and bR(IO) and store the results
in memory locations M(5) and M(IO), previously occupied by aR(5) and aR(IO). The next
step is to look at the next computation bI (5) on the list and repeat the same set of steps.
Continue this process until all the Algorithm Steps in Stage 1 have been computed and their
results stored in the Memory Map addresses.
First of Five 3-Point Building Blocks
This set of computations is represented in Figure 9-27 by 3-point building block O.
Further, the labels on the left and right of this building block correspond to the input and
output labels in the 3-point Singleton building block in Section 8.4.2.
The complex multiplier to be applied to the k-th output of the m-th 3-point building
block, BR(5*k + m) + j * BI(5 * k + m), is cos(2 * x *k* m/15) - j * sin(2 * n * k em /15)
as shown in Figure 9-23. Assuming no temporary storage registers, the complex multiply
requires two additional data memory locations (M(30) and M(31» if the results are to be
placed back in the same memory locations where the B R (5 * k + m) and B1(5 * k + m) were
*
accessed. The reason is that the real and imaginary parts, BR(5 k + m) and BI(5 k + m), *
are multiplied by different constants and both results are used twice. Once one complex
multiply is performed, the two additional data memory locations (M(30) and M(31» are
free to be used as the extra memory locations for the next complex multiply. Therefore,
only two additional data memory locations are required.
Many of the Algorithm Steps in this stage are just renaming the intermediate results.
This is done to make all of the intermediate results labels into the next stage have the same
letter, D. For those Algorithm Steps that perform multiplication, the data is pulled from
memory, the computation performed, and the results stored back in the same location. This
stage's computations are as follows.
First 3-Point Building-Block Output Complex Multiplies
When m == 0, the complex multiplier is 1, which requires no multiplication. The
first four lines are a redefinition of the data variables so that the inputs to the output 5-point
building blocks all use the same variable names. The final three lines are used to reverse
the data memory locations of the real and imaginary parts of the last output of the zero-th
3-point building block. This rearrangement is not required. However, for this example,
all of the real and imaginary parts that will be inputs to the 5-point building blocks are
reordered so that the real part appears in the lower half of data memory and the imaginary
parts appear in the upper half of data memory.
# of const.
Algorithm # of adds # of multiplies # of data locations locations
Convolution
Bluestein 2* M + 10* N 4 * M + 16 * N M + DM/2 4 * N + 3 * M + CM/2
+4 * AM/2 +4 * MM/2
Winograd Q *Ap -1+(Mp+l)* o; *DQ (Mp + 1) * (MQ + 1) - 1
+(Mp + 1) * AQ (MQ + 1)
Prime Factor Q * Ap + P * AQ Q*Mp+P*MQ 2 * P * Q + greatest Cp+CQ
of DQ - 2 * Q and
Dp - 2 * P
Mixed-Radix
Primes-to-a-power 2 * (P - 1) * (P - 1) 4 * (P - 1) * (P - 1) 2 * P * P+ greatest (P-l)*P+Cp
+2 * P * Ap +2 * P * M» of D p - 2 * P and 2
Mixed power-of 2 * (P - 1) * (Q - 1) 4*(P-l)*(Q-l) 2 * P * Q + greatest (P - 1) * (2 * Q - P)
primes +Q * Ap + P * AQ +Q*Mp+P*MQ of DQ - 2 * Q and +Cp + CQ
*
D» - 2 P and 2
Singleton 2 * (P - 1) * (Q - 1) 4 * (P - 1) * (Q - 1) 2 * P * Q+ greatest (P - 1) * (2 * Q - P)
+Q * Ap + P * AQ +Q*Mp+P*MQ of DQ - 2 * Q and +Cp +CQ
Dp - 2 * P and 2
Key to Variables
N = number of points in an FFf
M = number of FFT and IFFT points used to implement an N -point Bluestein algorithm
AM/2 = number of adds in M /2-point FFf used for N -point Bluestein algorithm
M M /2 = number of multiplies in M /2-point FFT used for N -point Bluestein algorithm
D M /2 = number of memory locations used for data in M /2-point FFf used for N -point Bluestein algorithm
C M /2 = number of memory locations used for constants in M /2-point FFT used for N -point Bluestein algorithm
P = number of points in the first building block of an N = P * Q-point FFT
M p = number of multiplies required for P-point building block of N = P * Q-point FFf
Ap = number of adds required for P-point building block of N = P * Q-point FFT
D» = number of memory locations used for data in P-point building block of N = P * Q-point FFT
C p = number of memory locations used for constants in P -point building block of N -point Bluestein algorithm
Q = number of points in the second building block of an N = P * Q-point FFT
MQ = number *
of multiplies required for Q-point building block of N = P Q-point FFT
A Q = number *
of adds required for Q-point building block of N = P Q-point FFf
DQ = number of memory locations used for data in Q-point building block of N = P * Q-point FFT
C Q = number of memory locations used for constants in Q-point building block of N -point Bluestein algorithm
CHA~ 9 REFERENCES 243
# of const.
Algorithm # of adds # of multiplies # of data locations locations
Convolution
15-point Bluestein 790 464 72 162
15-point Winograd 162 34 36 17
Prime Factor
15-point Kolba-Parks 156 68 32 6
15-point SWIFf 156 68 32 6
Mixed-Radix
16-point radix 4 144 24 40* 6
16-point radix 8 and 2 148 28 34 6
15-point Singleton 172 100 32 14**
* See Section 9.7.5 for why this does not match the formula in the Comparison Matrix in Table 9-7.
** See Section 9~7.7 for why this does not match the formula in the Comparison Matrix in Table 9-7.
9.9 CONCLUSIONS
The algorithms detailed here have memory map relabeling instructions that will work for
every algorithm building block in Chapter 8. Seven examples give detailed memory maps,
with the relabeling incorporated, for each algorithm step. They have accompanying block
diagrams to illustrate the data reorganization needed to combine small-point transforms in
the examples and four general algorithms. These block diagrams help to see how to distribute
data and algorithms on multiprocessor architectures that are explained in Chapter 12.
The next three chapters can be skipped if it is clear that a single processor will ade-
quately compute the algorithm. However, if multiple processors are required, the next three
chapters provide the information needed to learn how to map algorithms on multiprocessor
architectures.
REFERENCES
[1] L. I. Bluestein, "A Linear Filtering Approach to the Computation of Discrete
Fourier Transform," IEEE Transactions on Audio and Electroacoustics, Vol. AU-I8,
pp. 451-455 (1970).
[2] S. Winograd, "On Computing the Discrete Fourier Transform," Mathematics ofCom-
putation, Vol. 32, No. 141, pp. 175-199 (1978).
244 CHA~ 9 ALGORITHM CONSTRUCTION
[3] D. P. Kolba and T. W. Parks, "A Prime Factor FFT Algorithm Using High-Speed
Convolution", IEEE Transactions Acoustics, Speech, and Signal Processing, Vol.
ASSP-25, No.4, pp. 281-294 (1977).
[4] Patent number 4,293,921, October 6, 1981, Method and Signal Processor for Fre-
quency Analysis of Time Domain Signals, Winthrop W. Smith, Jr.
[5] R. C. Singleton, "An Algorithm for Computing the Mixed Radix Fast Fourier Trans-
form," IEEE Transactions on Audio and Electroacoustics, Vol. AU-17, pp. 93-103
(1969).
[6] J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex
Fourier Series," Mathematics ofComputation, Vol. 19, p. 297 (1965).
[7] J. W. Cooley, "The Structure ofFFT Algorithms," IEEE International Conference on
Acoustics, Speech and Signal Processing Tutorial Session, pp. 12-14 (1990).
10
10.0 INTRODUCTION
Arithmetic building blocks are adders and multipliers combined in different ways that affect
their cost and speed. This chapter does not contain a Comparison Matrix because these
building blocks will already be imbedded in the processors by their vendors. Their memory
and bus configurations are explained in Chapter 11. Arithmetic building blocks fall into
three categories:
• Bit slice
• Integrated arithmetic
• Special purpose
The first two categories are known as general-purpose building blocks. Because most
applications require more than just the computation of FFTs, general-purpose arithmetic
architectures are typically used to allow the non-FFf functions to be computed on the same
processor.
As a rule-of-thumb, if a DSP application requires more than four programmable DSP
chips, and the FFT portion of the computations can be separated onto a dedicated processor,
then a special-purpose arithmetic architecture, such as a hardware implementation of a 2-
point FFT, is used for the dedicated processing. Once the special-purpose FFT architecture
is part of an application, two things often happen. First, the number of programmable
DSP chips can be reduced. Second, other functions being done on the programmable DSP
chip, such as linear filtering and pattern matching, are often performed in the frequency
domain (Chapter 6) using the special-purpose hardware, further reducing the number of
programmable DSP chips needed.
246 CHA~ 10 ARITHMETIC BUILDING BLOCKS FOR ARCHITECTURES
All FFT algorithms have addition and multiplication steps. Sections 10.1.1 through 10.1.5
define five performance measures that can be used to characterize the following:
• How the data enters and leaves the arithmetic building block
• How the adder and multiplier are connected inside the building block
• How long it takes to perform adds and multiplies once the data is inside the building
block
Since adders and multipliers each have two inputs, it is also vital to know whether two
pieces ofdata to be added or multiplied can be entered into the building block simultaneously.
If entry must be done sequentially, knowing the order of the sequence is important. Input
data organization is described for each of the arithmetic building-block architectures and
explained for each nsp chip in Chapter 14.
When a building block has both an adder and a multiplier, there are two potential
outputs. It is important to know whether the building block has separate outputs for the
adder and multiplier, a single output for both, or a single output that can be multiplexed
between the adder and multiplier. This performance measure has a significant affect on
how flexible the building block is for computing FFT algorithms. Output data organization
is described for each of the arithmetic building-block architectures and explained for each
DSP chip in Chapter 14.
How the adder and multiplier are connected by a bus, within an arithmetic building
block, affects how much an algorithm loads the bus. The most common internal data bus
configuration is a multiplier-accumulator (Figure 10-4). In that configuration the input data
goes to the multiplier and the output comes from the adder. The output of the multiplier
and the delayed adder output are the two inputs to the adder. Internal data bus loading is
described for each arithmetic building-block architecture and explained for each DSP chip
in Chapter 14.
Throughput is the number of adds and multiplies per second that the arithmetic build-
ing block can perform if input data is supplied as fast as the building block can process it.
Since the number of required adds and multiplies is a key performance measure of FFT al-
gorithms, the ability to execute those arithmetic computations is an important performance
measure. Throughput is described for each of the arithmetic building blocks and explained
in more detail in Chapter 12 for algorithm mappings.
SEC. 10.2 BIT-SLICE ARITHMETIC 247
Latency is entirely different from throughput. Latency is the delay between when
data enters the arithmetic building block and when answers are ready to be output. Latency
becomes important in applications where the time it takes a system to respond to input data
is critical. In a radar altimeter, if the plane is flying close to the ground, short latency is
important in order to know rapidly any substantial loss of altitude. Latency is described
for each of the arithmetic building-block architectures and explained in Chapter 12 for
algorithm mappings.
The results of the second and third multiplies and their sum have nonzero digits that are
in the same locations as nonzero digits from the result of the first multiply. This approach
requires four 4-bit multiplies and three 8-bit adds to obtain the results. This replaces doing
one 16-bit multiply in order to reduce hardware. However, it increases computation time
because of the sequence of operations that replace one 16-bit multiply.
The advantage of this architecture is that the multipliers and adders do not handle as
many bits simultaneously. This was very important in the past, but is less important now
because low-power full multipliers are commonly available. However, the technique can
still be used to provide ultrafast arithmetic computations.
10.2.1 Multiplier
Equation 10-2 describes the functions that must be performed by the simplest bit-slice
multiplier. For example, an 8-bit multiply can be performed by this equation using two 4-bit
(M = 4), bit-slice multipliers. Similarly, a 16-bit multiply requires two 8-bit (M = 8)
bit-slice multipliers using Equation 10-2.
Clearly, the technique can be extended to combining any number of bit-slice' multi-
pliers to form a larger multiplier. The algorithm is defined by writing the individual data
words as their bit-slice components and then performing all of the required multiplies and
adds. Equation 10-4 is an example for combining four 4-bit (M = 4) bit-slice multipliers
into one large 16-bit multiply.
+ + + +
+ I~ --.J
10.2.2 MUltiplier-Accumulator
There are two types of bit-slice multiplier-accumulators. The first was shown in Fig-
ure 10-2 as a way of implementing a bit-slice multiply algorithm sequentially. The second
type is used to compute the sums of products of numbers. The core of this second type
of architectural building block is the bit-slice multiplier. To it is added a bit-slice adder.
Equations 10-5 and 10-6 are the bit-slice adder equivalents of Equations 10-2 and 10-4.
Notice that the algorithm for implementing bit-slice addition is considerably simpler than
bit-slice multiplication.
Figure 10-4 shows the most common multiplier-accumulator block diagram. All ofthe
programmable DSP chips in Chapter 14 use this basic architecture with varying degrees of
bells and whistles to enhance performance for a particular manufacturer's perceived market.
One example is the number of bits in the accumulator, depending on the anticipated number
of multiply-accumulates required to compute results for particular algorithms. To ensure
that a fixed-point accumulator does not overflow, it needs to have at least log2 N bits more
than the multiplier output that feeds it, if N multiplies must be accumulated prior to storing
results.
In applications that require more than four programmable DSP chips to perform the power-
of-two FFT computations, hardware that has an architecture dedicated to FFf computations,
special-purpose chips, should be used. The special-purpose FFf chips in Section 14.7 do
power-of-two FFTs much faster than programmable DSP chips, because the common build-
ing blocks of FFT algorithms are imbedded in the hardware. For the power-of-two FFT
algorithms in Section 9.7, the common arithmetic building block is the 2-point-building-
block algorithm. Building blocks for non-power-of-two algorithms have not become pop-
ular because these algorithms are not common and because they require several building
blocks, not a single one. Section 14.7 describes chips that have been built to implement the
2-,4-, and 8-point building blocks from Chapter 8.
Since FFf equations assume complex inputs, the 2-point building block assumes
complex input data. The 2-point building block can be implemented in full parallel form with
two complex input signals entering the hardware simultaneously, or it can be implemented
in half-complex form, where the real portion of the two input signals enters the arithmetic
building block first, followed by the imaginary part. The linearity of FFfs allows this
sequential computation, followed by a recombination of the results (Section 2.3.3).
Two forms of the 2-point FFf building block have been developed to implement
the two approaches to decomposing the DFf to form the power-of-two FFf. The data
separation pattern for each of these approaches is presented in Section 10.4.1. Then the
2-point building-block hardware for each approach is presented in Sections 10.4.2 and
10.4.3.
The first FFf data separation approach is called decimation in time (DIT). In the
DIT algorithm, which is used in Chapters 8 and 9, the input samples are first reordered
into two subsets of input samples, one containing the odd-numbered samples and the other
the even-numbered ones, shown in Figure 10-5 as the 1st decimation in time. Then each
252 CHAR 10 ARITHMETIC BUILDING BLOCKS FOR ARCHITECTURES
of these subsets is further reordered by taking every other one of its members and putting
it into a new subset, shown in Figure 10-5 as the 2nd decimation in time. Once the data
reordering is complete, the paired input data samples are used as the inputs to the 2-point
FFf building block from Section 8.3. Since the input data sequences are usually thought
of as sequences in time, they are being decimated in time by this reordering process.
The second approach, decimation in frequency (DIF), also starts by segmenting the
input sequence into two subsets of data. The difference is that this algorithm puts the first
half of the samples in the first subset and the second half in the second subset, shown in
Figure 10-6 as the 1st decimation in frequency. The next step in the algorithm segments
each of these subsets into new subsets, again by putting the first half of its members in the
first subset and the rest in the other subset. This process is shown in Figure 10-6 as the
2nd decimation in frequency. These four subsets are the inputs to the first set of 2-point
FFTs from Section 8.3. The outputs of the first set of 2-point FFfs are reordered following
this same strategy. This process continues until the output frequencies are reached. At
the output, the output frequency components are in subsets of even- and odd-numbered
frequencies. Therefore, the output frequencies have been decimated, which led to calling
this approach decimation in frequency.
a(O) a(O)
a(O) 2-Point
a(2) a(4)
a(l) FFT
I
a(2) I
a(2) 2-Point
a(6)
a(3) FFT
I
a(l) I
a(4) 2-Point
a(3) I
a(5) FFT
I
I
a(5)
a(6) 2-Point
a(7)
a(7) FFT
I
a(O)
a(l) :I 2-:;~nt I
L - - -- -1-----
a(2) I I
2-Point
I I
a(3) FFT
I I
,- - - --
I
-1- - - --
I
T-----
a(4)
a(5) :I 2-:;~nt I
1- - - -- -1----
I I
a(6) 2-Point
,
a(7) I
FFT
I I I I
Input 1st Decimation I 2nd Decimation I 1st2-Point
Order in Frequency I in Frequency I FFT Stage
The flow graph for the DIT 2-point hardware building block is shown in Figure 10-7
(on page 254). One advantage of this algorithm over the decimation-in-frequency algo-
rithm is that it is organized to work easily with multiplier-accumulator arithmetic building
blocks.
The flow graph for the DIF 2-point hardware building block is shown in Figure
10-8. The primary difference between this and the DIT flow graph is the multiplier on the
output rather than the input. While this appears to cause problems with using multiplier-
accumulator building blocks, it does not. The reason is that most FFf applications require a
weighting function prior to the FFT. This weighting function multiplier is then added to the
front end of the flow graph in Figure 10-8 for the first stage and then the back -end multiplier
is moved to the front end of the next 2-point building block of the FFf algorithm.
254 CHAP. 10 ARITHMETIC BUILDING BLOCKS FOR ARCHITECTURES
:Jo---~----~ A(l)
-1
10.5 CONCLUSIONS
Prior to the introduction of programmable DSP chips, a detailed understanding of arith-
metic building blocks was crucial in the creation of DSP processors on boards. This was
because the number of processor clock cycles required to perform multiplies was signif-
icantly higher than for additions. Arithmetic building blocks are now imbedded in nsp
chips. Understanding the nuances of how chip manufacturers connect the multipliers and
accumulators helps in the selection of an algorithm from Chapters 8 and 9.
11
Multiprocessor Architectures
11.0 INTRODUCTION
There are two popular single-processor architectures. The first, called Von Neumann [1],
has only one bus and uses it to interconnect the arithmetic unit to the rest of the processor.
The arithmetic unit is used for all algorithm computations and data address generation. The
single bus and arithmetic unit are shared at each step for FFT arithmetic computations and
256 CHA~ 11 MULTIPROCESSOR ARCHITECTURES
data addressing. This "Von Neumann bottleneck" stimulated development of the second
type of single processor, called Harvard. This architecture has separate arithmetic and
addressing hardware and buses to alleviate the bottleneck. All the chips in Chapter 14 are
Harvard architectures. Section 11.1.1 presents the Von Neumann architecture to illustrate
specifically the inefficiencies associated with using it for signal processing applications.
The Von Neumann architecture (Figure 11-1), has been the most popular approach to
standard computers for many years because of its simplicity. This architecture has:
• One arithmetic unit shared between address generation and arithmetic computations
• One memory shared between data, constants, and program instructions
• One bus used for moving data addresses and instructions
The arithmetic unit includes not only the adder and multiplier for data computations but
the "next instruction address," "present instruction," and "present data address" registers,
as well as the logic for executing instructions.
Arithmetic
Unit
The simplicity of this architecture allows it to run at high clock speeds and to be used
for a general class of applications. For example, applications that access data sequentially
do not require address generation algorithms, and applications that perform large numbers
of computations on each new data sample use the arithmetic unit for data addressing infre-
quently. A simple example that illustrates both of these is converting an input data sequence
into the logarithm of that sequence using the Taylor series expansion. In this algorithm, a
data value is accessed from memory, followed by a long sequence of adds and multiplies
on that data, to form the logarithm. The result is then stored in the same memory location.
The processor then steps to the next memory location and repeats the process,
The two major disadvantages of this architecture for FFT algorithms are that it has a
single bus for handling data I/O, data movement, and instruction movement, and it needs
the arithmetic unit to perform the data reordering between algorithm steps as well as to
perform the algorithm computations. A simple example is a single multiply accumulation
of data values stored in nonsequential locations of memory. The arithmetic unit steps are
as follows:
1. Use the next instruction address in the arithmetic unit register to access the next
instruction from memory and store it in the present instruction register.
SEC. 11.1 TWO SINGLE PROCESSORS 257
Steps 3, 6, 8, and lOuse the arithmetic unit, steps 1, 4, 7, and 9 make use of the bus between
arithmetic unit and memory, and steps 2, 5, and 10 use the instruction decoding logic. Steps
4 and 5 can be performed in parallel by the Von Neumann architecture. The result is a
sequence of nine steps to perform the multiply-and-store function that is common to FFf
algorithms. Note that step lOuses the arithmetic unit as well as the instruction decoding
logic. This is the most obvious example of reduced computation time that is obtained if the
instruction and computational functions of the processor are separated. This separation is
the basis of the Harvard architecture described in the next section.
The Harvard [2] architecture (Figure 11-2) is the most popular single arithmetic unit
processor for DSP applications. All of the programmable DSP chips in Chapter 14 use a
variant of this architecture. Its main feature is that it physically separates the algorithm
computations from the data and instruction memory addressing (control) functions. It also
uses separate buses to interconnect the building blocks associated with the computational
and control functions. This provides significant improvements in throughput and latency
for FFT algorithms because it removes the Von Neumann bus bottleneck and allows the
arithmetic unit to be used only for algorithm computations,
The multiply-accumulate steps in Section 11.1.1 are identical to those used by the
Harvard architecture. However, they can be overlapped in the Harvard architecture to
speed up the computations. The most recent generations of programmable DSP chips have
two data memory to arithmetic unit buses, two data memories, and two address genera-
tors. This allows the data and multiplier constant address generation and memory accesses
to be accomplished in parallel. For those chips, steps 2, 3, and 4 can be performed in parallel
258 CHAR 11 MULTIPROCESSOR ARCHITECTURES
~ Data I/O
Arithmetic Program
Unit Counter
with steps 5, 6, and 7. Similarly, steps 8 and 9 can be performed in parallel with steps 10
and 1. The result is that the 10 steps can be performed as if they were 5, rather than having
to do the 9 required by the Von Neumann architecture. Thus, the Harvard architecture can
compute FFTs nearly twice as fast as the Von Neumann. That is why all the commercial
DSP chips are based on this more efficient architecture.
Linear array architectures, the simplest form of multiprocessor systems, fall into three
classes:
• Pipeline, where the output of each processor provides the input for the next
• Linear bus, where all processors are connected to a common communication bus
• Ring bus, an extension of the linear bus with the ends of the common communica-
tion bus connected
Any of the arithmetic building blocks from Chapter 10 can be used as the processors in
these three bus architectures. Further, either of the single processors described in Section
11.1 can be used. Because of this, the key differences between the linear array architectures
are how their interconnections affect their ability to perform FFf algorithms. This section
describes those three architectures, and Section 12.4 shows how they are used to compute
the FFT algorithms from Chapter 9.
11.2.1 Pipeline
The pipeline [1, 3] architecture interconnects processors such that the output of one
becomes the input to the next. The three-block version of the pipeline in Figure 11-3
can be used to illustrate the key features of this architecture. The most important design
consideration is matching the data output rate from one processor to the input data rate of
the next so that it keeps the next processor busy without overloading. If each processor is
kept busy, then the performance of the overall architecture is the sum of the performances
of each processor.
A multiplier-accumulator is a common example of a two-processor pipeline that is
found in nearly all modem programmable DSP chips and is explained in more detail in
Chapter 14. Processor 0 would be the multiplier and Processor 1 the accumulator, as
SEC. 11.2 THREE LINEAR ARRAYS 259
shown in Figure 10-4. The input to Processor 0 is the next data sample to be multiplied
and its multiplier constant. Each time Processor 0 produces a multiplication result, it
sends that result to Processor 1 to add to the accumulator. Processor 1 then performs the
addition and stores the result in its accumulator register while Processor 0 is performing
the next multiplication. At some point, the multiply-accumulation process is complete, and
Processor 1 outputs its result to data memory.
Therefore, if the input data rate to Processor 0 is R samples per second, the overall
*
input rate to Processor 0 is 2 R per second because it must also receive the multiplier
constants. The output data rate from Processor 0 is R per second, which then becomes the
input data rate to Processor 1. If Processor 1 can perform R adds and accumulator register
stores per second, then the data rate between the two processors is ideal. Finally, notice that
the output data rate from Processor 1 is lower than its input rate. If M multiply-accumulates
are performed before an output is produced, then Processor 1's output data rate is Rj M per
second.
If further computations are needed on these results, then Processor 2 should be chosen
to perform its portion of those computations at an input data rate of R/ M per second. A
well-designed pipeline architecture uses processors at each stage that match the required
data rates of the previous processor outputs.
t t t
Processor Processor Processor
0 1 2
Some programmable DSP chips use this bus architecture when they have multi-
ple arithmetic processors. These are described in more detail in Chapter 14. Again, the
multiply-accumulate example can be used to illustrate the issues associated with using this
architecture. Assume Processor 0 is the multiplier, Processor 1 is the accumulator, and
Processor 2 is the data and multiplier constant memory. To keep the multiplier busy, it must
have a new data word and multiplier constant each computation cycle. Since both of these
260 CHA~ 11 MULTIPROCESSOR ARCHITECTURES
come across the bus from Processor 2, this forces Processor 2 to handle two data accesses
per computation cycle and puts a two-word-per-computation cycle load on the bus.
The multiplier also produces a new result each computation cycle, and this answer
must be passed to the accumulator (Processor 1) to allow Processor 0 to continue perform-
ing multiplications and to allow Processor 1 to remain busy performing accumulations.
This adds another word per computation cycle to the bus requirements. Finally, after M
accumulations the accumulator has an output that it must pass back to the data memory
(Processor 2). This adds load on the bus of 1/ M words per computations cycle.
In addition to these computational loads, data must be coming into the processor and
be stored in the data memory so that data is available for multiply-accumulation. Assuming
the new data must enter at the multiplier computation rate, this adds another data word per
computation cycle to the bus requirements. Eventually, results must also exit the processor
to be used elsewhere. If this is assumed to occur at the 1/ M rate of the accumulator outputs,
then the output function increases the total bus loading to (4 + 2/ M) words per computation
cycle. If the computation rate is R multiplies per second, then the data rate that must be
sustained on the bus is at least [R* (4+2/ M)] words per second. A well-designed linear bus
architecture uses processors and buses that match the required performance of the chosen
algorithm.
Ring Bus
At first glance, this architecture does not appear to differ from the linear bus. In fact,
it can be used in that manner. In this case it has the same properties as the linear bus.
However, this architecture allows another type of processing, namely the input data can be
thought of as being sequentially passed from one building block to the next along with a
codeword that tells whether that processor is supposed to perform a function on that piece
SEC. 11.2 THREE LINEAR ARRAYS 261
of data. The codeword also can tell the processor what function to perform if the processor
is programmable. This allows multiple words to be on the bus at one time because each is
stored in a data register at the input to one of the processors. This makes the architecture
look like a series of linear buses between processors.
For example, consider the multiply-accumulation example again. However, this time
consider Processor 0 to be one of the bit-slice multiplier building blocks described in
Chapter 10. Chapter 10 showed that a complete multiplication can be performed with bit-
slice building blocks by passing the various "slices" of the input data word and multiplier
constant through the bit-slice multiplier, properly scaling the output and adding it to the
accumulator.
Further, assume Processor 1 is a bit-slice adder, Processor 2 is a data memory, and the
data words are bit-sliced into two pieces. From Chapter 10 the multiply process requires four
bit-slice multiplies and three bit-slice adds, as shown in Equation 11-1. The accumulation
portion of the multiply-accumulate can now be integrated with the addition portion of the
bit-slice multiply.
and the result is put on the bus by Processor 1 to return to data memory in Processor 2. The
data memory processor not only stores the result but removes it from the bus.
The key concern with this architecture is bus contention, just as for the linear bus.
Only this architecture has a more demanding requirement because data passes around the
ring several times before the algorithm computations are complete. When bus contention
occurs, the transmission of processor outputs must be delayed. This results in a reduction
in throughput and an increase in latency.
One solution to bus contention is to allocate specific time slots to each processor
connected to the ring. This completely removes the contention problem. However, the
contention problem is then replaced with the need to design algorithms so that the processors
finish their computations close to their ring bus time slot. Otherwise, the processors have
the overhead of waiting for their turn to output results and input the next set of data. For
FFT algorithms this approach can be efficient because the algorithms are highly modular.
Section 14.11 shows a product family that uses this time-slot technique to remove bus
contention.
• Crossbar, which is the most general and allows processors to be directly connected
as needed to a large number of others in the array.
• Massively parallel, where the processors are generally connected to just their near-
est neighbors and communications beyond the nearest neighbor requires passing
information through other processors.
• Star, which has all processors connected to a central one. The central processor
may use the connected processors as coprocessors, or it may be a central memory
that is used by the surrounding processors. When the central processor is replaced
with memory, this is called a shared-memory architecture.
11.3.1 Crossbar
A crossbar [1, 3] switch is a device that allows each of its inputs to be directly in-
terconnected to any other one. For example, consider a crossbar switch to interconnect
four processors that each have one I/O port. Table 11-1 shows the number of simultaneous
interconnections available. If the number of processors is larger, or the processors have
additional I/O ports, the number of different interconnection combinations grows exponen-
tially.
Figure 11-6 is a block diagram of a crossbar architecture where the individual cross-
bar elements control the routing of four processors in an overall array of 16. Each cross-
bar switch can arbitrarily connect any of its four processors to any other one. The
crossbar switch used in Figure 11-6 has an additional output that can be connected to
any of the four inputs. This increases the number of combinations shown in Table 11-1
from 3 to 12 because for each combination any of the four processors can also be connected
to the additional output to feed the larger network. Further, the central crossbar switch in
Figure 11-6 can connect any of the four crossbar switches to another. The result is that with
SEC. 11.3 THREE PARALLEL ARRAYS 263
these two levels of crossbar switching, any of the 16 processors can be directly connected
to one of the others without going through another processor.
There are numerous variations to this architecture, depending on the vendor. For
example, the crossbar switch described in Table 11-1 can also be designed to allow a
processor's I/O to connect to more than one of the other processors. Table 11-2 shows the
combinations available under these design constraints. Note that for this set of design rules
(each processor only having one I/O port), if three processors are connected the fourth has
nowhere to be connected. This architecture's interprocessor data I/O rate is not limited by the
buses themselves, but by scheduling the processing tasks so that two or more processors do
not have to feed data to the same one simultaneously. This is more accurately characterized
as processor I/O contention, rather than bus contention.
Crossbar --
Switch
The multiply-accumulation example is again used to illustrate the processor I/O con-
tention issues. Forexample, assume that the upper-left-hand crossbar switch in Figure 11-6
has Processor 0 containing the data memory and multiplier constants, Processor 1 contain-
264 CHA~ 11 MULTIPROCESSOR ARCHITECTURES
1
2
°°
Processors and 1 Processors 2 and 3
Processors 1 and 3
3
4
°
Processors and 2
Processors and 3
Processors 0, 1, and 2
Processors 1 and 2
N/A
5 Processors 0, 1, and 3 N/A
6 Processors 0, 2, and 3 N/A
7 Processors 1, 2, and 3 N/A
8 Processors 0, 1,2, and 3 N/A
ing the multiplier, Processor 2 containing the accumulator, and Processor 3 being the data
I/O. Since data must be input as fast as it is being operated on by the multiply-accumulator,
a single multiply-accumulate cycle will be assumed to also include receiving a new input
data sample.
The first step is to connect Processor 0 to Processor 1 for two cycles to move a data
word and multiplier constant from memory into the multiplier. During the next cycle the
multiplier performs its computation and sends the result to the accumulator in Processor 2.
This requires the crossbar to connect Processors 1 and 2. This is the perfect time to bring in
a new data sample using the data I/O in Processor 3 and connecting it through the crossbar
switch to Processor 0 to store the data.
During the next cycle, the accumulator in Processor 2 performs its task, and the data
memory in Processor 0 is connected, by the crossbar, to Processor 1 to move additional
data into the multiplier. This is a rather simplistic example that does not illustrate all of
the power and flexibility of the crossbar network. This is addressed in conjunction with the
FFf algorithm mappings in Section 12.5.1.
11.3.2 Massively Parallel
A massively parallel [1, 3] processor is defined as having more than 1000 smaller
processors. Most often, the processors are connected in a two-dimensional array with only
nearest-neighbor connections. If the array is rectangular, then the processors are connected
either to four or all eight of their neighbors, as shown in Figures 11-7 and 11-8. There are
a number of variations depending on the manufacturer.
A fundamental assumption of this architecture is that the individual processors have
multiple I/O ports. Figures 11-7 and 11-8 show four and eight I/O ports, respectively.
The result is that there is no data I/O bottleneck between nearest neighbors. However, if
data must be passed to processors beyond nearest-neighbor locations, the nearest neighbors
must participate in the data transfer. This I/O requirement occupies the I/O ports of multiple
processors, thus reducing a processor's capability to pass its own data to another processor.
Another key characteristic of this architecture is whether all of the processors are
controlled by one program or whether each one can implement its own. If all the processors
must execute the same program, the architecture is called single-instruction, multiple-data
(SIMD). If each processor can have its own program to execute, then it is called multiple-
instruction, multiple-data (MIMD).
SEC. 11 .3 THREE PARALLEL ARRAYS 265
N N N
W E W E W E
S S S
N N N
W E W E W E
S S S
N N N
W E W E W E
S S s
t t t
Figure 11-7 North-east-west-south connected massively parallel array
architecture block diagram.
/
E~
E~
~X
E~
Most massively parallel processors have been SIMD architectures. There are two
primary reasons for this and one significant drawback. The first reason is that technology
has not allowed it to be cost efficient to implement a control processor for each of the
1000 or more processors. Second, it is much more difficult to think through how to control
1000 programs working at the same time. The drawback is that it is very difficult to map
266 CHA~ 11 MULTIPROCESSOR ARCHITECTURES
individual algorithms onto an array of 1000 or more processors and have them execute it
efficiently.
More recently, programmable signal processor chips have been designed to be inter-
connected in larger arrays. Since each of these has its own program control, they are likely
to be used in an MIMD configuration. While thousands of these devices are not likely to
be connected in the near future, a trend is developing in that direction. Examples of this
are shown in Section 14.11.
Massively parallel array architectures generally have their own special-purpose I/O
subsystem that converts the input data from a sequential stream into data vectors that can
be passed into the processing array along one of its edges. Figure 11-9 shows a specific
example of this I/O strategy for the north-east-west-south (NEWS) connected massively
parallel array in Figure 11-7. When the computations are complete, the results can be
shifted down to the output data reorganizer and converted back to a sequential stream of
data.
s s s
N N N
E w E w E~
s s
Figure 11-9 Data I/O for a massively parallel array architecture block
diagram.
sor's data memory. Every time M data samples have been stored in each processor, all the
processors can be told to perform the M -step multiply-accumulate process on its set of data.
All the processors then execute the same instruction set and finish at the same time. When
they are finished, multiply-accumulates have been performed on four sets ofdata. If during
that computation period, M new data samples can be loaded into each of the four processor's
data memory, then the four processors can begin the multiply-accumulation process on the
next set of data as soon as they have finished the present set and have output the results.
In the second approach each set of M inputs is divided equally among the four proces-
sors. Then each of the four processors computes M /4 of the multiply -accumulates, and these
four partial results are combined by adding. In more detail, one-quarter of the multiplier
constants are stored in each of the four processors. Then the input data interface separates
the input data words so that one-quarter of them go to each processor. Then each processor
performs multiply-accumulation on its M /4 data words, using its M /4 multiplier constants.
Once these partial results are obtained, they must be added to form thefinal M sample
multiply-accumulation. One way to do this is to send the partial answers from the left two
processors to memory locations in the right two processors, using the "E" output of the
left-hand processors and the "W" input of the right-hand processors. Then the right two
processors can add their partial results to those computed by the processor to their left.
Finally, the top right processor can send its partial result to the bottom right processor for
the final addition needed to produce the desired output.
The second approach takes longer to compute because of the data passing required
and because all of the processors are not active during the final additions usedto combine the
partial results. However, the computation has less latency to produce its result. Namely, a
new multiply accumulation starts every M samples with the second approach, and therefore
answers are output every M samples. In the first approach the processor only starts a new
multiply-accumulate computation every 4 *M data samples. Therefore, it can only produce
results every 4 * M data samples. Hence, even though the individual multiply-accumulate
is produced faster, it takes longer for the answers to be available for further computations.
11.3.3 Star
The star [1] architecture is most often used when one function or processdominates the
application. It consists of one central processor with interconnections to numerous others,
as shown in Figure 11-10. The star architecture does not have to have four processors
surrounding the central one. It can have more or less, depending on the application.
The interprocessor communications in this architecture all occur via the central unit.
This requires it to have the capability to handle multiple data streams simultaneously or
the architecture will not be efficient. The most likely uses for this architecture are for
applications where either:
1. The central block does the general computations and the surrounding ones are
used as coprocessors to perform specific functions, such as nonlinear operations
or database searching, or
2. The central processor is data memory (shared memory) that needs to be accessed
by multiple processors at the same time, like a simultaneous database search from
multiple remote locations.
268 CHA~ 11 MULTIPROCESSOR ARCHITECTURES
Just like the massively parallel architecture, there are many ways to use a star archi-
tecture to implement a set of algorithms. Using the multiply-accumulate as an example,
assume five processors connected to the central processor. In this case let four of the
outlying processors be 8-bit bit-slice multipliers, and let the central processor be the data
memory and an accumulator. Let the fifth outlying processor handle the data I/O func-
tions.
The first step is to move 16-bit input data through the data I/O processor and store
it in the data memory in the central processor. The next step is to have the central pro-
cessor slice the 16-bit input words into 8-bit slices and pass the slices to each of the
four bit-slice multipliers. The next step is for each of the bit-slice multipliers to per-
form one of the multiplications shown in Equation 11-1. Once the computations are
complete, each bit-slice multiplier passes its result back to the central processor. The
central processor is then responsible for performing the scaled additions shown in Equa-
tion 11-1. The final result for the first multiplication now resides in the central processor,
and it can be added to the other multiplied data to form the M -step multiply-accumu-
lation.
Multidimensional arrays are one step beyond parallel arrays because they exhibit intercon-
nectivity that has three or more dimensions. The three presented in this section are:
This type of architecture has been included because there are multidimensional FFT
applications and because even one-dimensional applications can be conveniently written as
a multidimensional FFT computation.
11.4.1 Hypercube
In mathematics a cube is a three-dimensional object with equal sides. The mathe-
matical generalization of this equal-sided object to more than three dimensions is called a
hypercube. A hypercube [1,3] processing architecture is an organization of connections be-
tween processing elements that form cubes. Joining two hypercubes of the same dimension
forms a hypercube of the next higher dimension. A single processor is a zero-dimensional
hypercube. Connecting two of those forms a one-dimensional hypercube. Connecting
two of these forms a square, which is a two-dimensional hypercube. Connecting two
squares forms a cube, called a three-dimensional hypercube. It becomes difficult to envi-
sion higher-dimensional hypercubes. Figure 11-11 shows the four-dimensional hypercube.
Note that it is composed of two interconnected (one inside the other), three-dimensional
hypercubes.
fact, going beyond the four dimensions shown in Figure 11-11 (16 processor elements)
is difficult to visualize. Processor arrays with large numbers of processing elements are
also difficult to program efficiently. Once the visualization of the processor architecture is
removed, it becomes even more difficult to program.
Up
PO P1 P2
Layer 1
East
North South
P3 P4 P5
Layer 2
West
P6 P7 P8
Layer 3
Down
Figure 11-12 is a simplified block diagram of such an interconnection. The top three
processors (PO, PI, and P2) represent one row of the two-dimensional array in Figure 11-7.
The middle (P3, P4, and P5) and bottom (P6, P7, and P8) sets of processors also represent
a row of another two-dimensional array. The vertical interconnections are the up and down
connections between these two-dimensional arrays. The six basic interconnections, north,
east, west, south, up, and down, are labeled in Figure 11-12.
11.4.3 Hybrids
By definition a hybrid architecture is a combination of two or more of the architectures
described in previous sections. The example is a high-level crossbar [1, 3] architecture
(Figure 11-13) where half of the processors (2, 3, 6, 7, 10, 11, 14, and 15) are 3 x 3 arrays
of Harvard [2] architecture processing elements connected in a massively parallel [1, 3]
NEWS architecture for a total of 72 processors. The other half of the high-level crossbar
processors is split between data memory (1, 5, 9, and 13) and data input/output (0,4, 8,
and 12). Therefore, this is a combination of Harvard, massively parallel, and crossbar
architectures.
SEC. 11.4 THREE MULTIDIMENSIONAL ARRAYS 271
Figure 11-14 shows the 3 x 3 parallel processor array that exists at each of the
processors 2, 3, 6, 7, 10, 11, 14, and 15 in Figure 11-13, and Figure 11-15 shows the
Data I/O
t
Data Data I/O Processor Data I/O Processor
I/O 0 2 8 10
Memory
5
Processor
7
I Memory
13
I Processor
15
w
s s s
Harvard processor at each node of each of these 3 x 3 parallel processor arrays. Multiply-
accumulate functions would be performed with the 72 Harvard processors. This means
that 72 multiply-accumulations can be done at the same time and the answers combined
at whatever level is necessary by using the NEWS and crossbar interconnections. The
strength of this architecture is its processing power. However, the drawback, like all MIMD
architectures, is the difficulty in programming the 72 processors to work efficiently on
complex algorithms. Chapter 12 addresses the complexity of mapping the algorithms from
Chapter 9 onto these architectures.
N
Data Address Program
Memory Generator Memory
E
W
Arithmetic Program
Unit Counter
s
Figure 11-15 Harvard processor block diagram.
11.5 CONCLUSIONS
More than a dozen block diagrams illustrate the variety of ways processors are combined
to offer enonnous selection for computing FFf algorithms. Seeing the interconnection of
the processors allows data movement overhead to be estimated. This helps to narrow the
choices of how to map an algorithm onto an architecture, which is shown in the next chapter
for minimum latency and maximum throughput examples.
REFERENCES
[1] T. Fountain, Processor Arrays Architecture and Applications, Academic Press, London,
1987.
[2] S. K. Mitra, J. F. Kaiser, Handbook/or Digital Signal Processing, Wiley, New York,
1993.
[3] R. W. Hockney and C. R. Jesshope, Parallel Computers, Adam Hilger, Bristol, England,
1981.
12
12.0 INTRODUCTION
The method used to distribute and redistribute data and an algorithm in a single or multi-
processor hardware architecture is called algorithm mapping. The process of choosing an
algorithm mapping for a particular application is often complex. The data I/O requirements,
processor interconnections and building-block algorithms must all be considered to reach
an optimal approach for a particular application.
This chapter uses minimum latency and maximum throughput examples to illustrate
how to map the algorithms from Chapter 9 onto the hardware architectures from Chapter
11. It is assumed that each processor takes one instruction cycle for each add, multiply, or
data move. The measures of how well an architecture performs an FFf algorithm are:
• How much delay does the architecture introduce while obtaining the results (la-
tency)?
• How many FFTs per second can be computed (throughput)?
The first three performance measures apply to the first issue and the last two to the
second.
274 CHA~ 12 ALGORITHM AND DATA MAPPINGS
Input data overhead is the number of clock cycles to move the data into the hardware
architecture and store it in the processor that will use it first.
12.2 MAPPINGS
Algorithms and architectures are interesting to study. However, it is the efficiency with
which an architecture can execute an FFf algorithm that is of paramount importance in
making choices in the development of an application. The following sections use the
performance measures to characterize how each algorithm from Chapter 9 will work on
each architecture from Chapter 11.
In general, the best mapping of an algorithm onto processors is to allocate a processor
to each algorithm building block. If a transform length is factored into P smaller numbers,
then:
1. The Bluestein algorithm needs 2P + 3 hardware blocks. Three are used for the
complex multiplies at the beginning, middle, and end of the algorithm. The other
2P are needed to implement the forward and inverse transforms, where P is the
number of building blocks needed to implement the FFf.
2. The Winograd algorithm needs three hardware building blocks to implement the
two sets of adds and one set of multiplies.
3. The prime factor algorithms need P hardware building blocks to compute the P
building-block algorithms.
4. The mixed-radix algorithms need P hardware building blocks to compute the
P building-block algorithms and P - 1 more to implement the complex multipli-
cations between the stages.
SEC. 12.3 SINGLE PROCESSOR 275
Table 12-1 shows input set 1 flowing through the data I/O portion of the processor
during time slot 1 and being stored in data RAM section 1. After one time slot for compu-
tation, the FFf outputs from input set 1 are passed out of the processor during time slot 3.
This process is repeated for each set of complex samples. The only difference is the section
of memory used for each set. Therefore, the processor's real-time computational require-
ment is to perform the entire FFf algorithm during the time slot for inputting one set of
complex samples. This includes algorithm arithmetic and memory address calculations. If
the processor is fast enough to perform all of these functions in real-time, a single processor
is sufficient for the application and the throughput is an FFT per time slot. If it is not,
multiple processors are needed, leading to one of the other architectures from Chapter 11.
276 CHA~ 12 ALGORITHM AND DATAMAPPINGS
The latency of this processing architecture is two time slots because the data goes into
the processor during time slot 1 and the results exit the processor during time slot 3. This
performance must also be adequate for the application in order for a single processor to be
sufficient. If the latency must be less than two sets of complex samples, multiple processors
must be used.
For a given transform length the data I/O rates are the same for all of the algorithms
because all N -point FFfs use N input complex samples and produce N output frequency
components. However, if data I/O is marginal, it is important to find the smallest transform
length that meets the performance goals of the application. Generally, the smallest transform
length is not a power-of-two.
The other factor affecting data I/O is the data sequence reordering needed to com-
pute the algorithm. On the input, the data is almost always in time sequence order be-
cause it came from an AID converter or out of some linear filtering function. However,
all of the algorithms in Chapter 9 needed the data to be reorganized to be ready for the
first building-block algorithm computations. This can be performed as the data enters the
processor by the way it is stored in memory. Or it can be performed at the beginning
of the first building-block computations by the way data is initially accessed from mem-
ory.
The FFf results are not in sequential order either. Since the next computational stage
generally needs the frequency components in sequential order, another data reorganization
is required. Since the addresses used for the last-stage computational outputs are based on
the building-block addressing, this data reorganization is performed as the data moves from
the data memory through the data I/O hardware.
The algorithms for performing these two reorganizations are given in Chapter 9, and
all use multiplies, adds, and modulo arithmetic. Therefore, there is no significant advantage
of one algorithm over another for this portion of the computations.
five copies of the 3-point algorithm and three copies of the 5-point algorithm in the straight-
line approach. For the 16-point radix -4 FFT example in Chapter 9, eight copies of the
4-point building block are used in the straight-line approach, rather than the one copy for
the building-block subroutine code approach.
The arithmetic unit is responsible for algorithm and data addressing computations.
The algorithm computations are different for each algorithm. The I/O addressing is ex-
plained in Section 12.3.1. The other data addressing computations are to reorganize the
data between each building-block algorithm stage. Each algorithm from Chapter 9 requires
this data reorganization and uses multiplies, adds, and modulo arithmetic. Therefore, there
is no significant advantage of one algorithm over another for this portion of the computa-
tions.
The arithmetic unit must be capable of computing all of these tasks in the time
slot allotted by the real-time requirements of the application. Millions of instructions per
second (MIPS) and millions of operations per second (MOPS) are only crude measures of a
processor's ability to execute the needed FFT algorithm in real time, because no hardware
architecture is 100% efficient at computing FFTs.
The chip Comparison Matrices in Chapter 14 show 1024-point complex FFT timings
for most DSP chips on the market. Section 14.1.1 describes how to estimate timings for
other FFT lengths, based on the l024-point benchmark. This is a better measure of chip
performance than MIPS and MOPS because it incorporates internal overhead of the chip.
When processors are connected into larger arrays, additional overhead is incurred when
data must be passed between processors. That additional overhead is explained for each
algorithm mapped in this chapter.
Memory
Multiplier
Constants
Data Section
1
Data Section
2
Data Section
3
Program
~
Data I/O
t
~
Arithmetic
Unit
Memory
Data Section
1
Data Section
2
Data Section
3
Multiplier Address Program
...--
.
~ Data I/O
Arithmetic Program
Unit Counter
Since additional hardware is used to compute memory addresses and sequence through
the program, this architecture coupled with building-block subroutine code generally has
better performance than a Von Neumann architecture using straight-line code. Additionally,
the larger memory needed for straight-line code is replaced with a small amount of control
logic in the Harvard architecture.
The extent of the performance improvement over the Von Neumann architecture
depends on the sophistication of the address generators. In the more recent generations
of DSP chips, the address generators, often multiple ones, allow the complex memory
address sequences to be generated at the same speed as the arithmetic computations are
performed. In the early generations of DSP chips, the address generator was nothing more
than a counter. For these less sophisticated address generators, straight-line coding provided
additional performance gain over using building-block subroutines. All of the other data
I/O, memory, and arithmetic unit considerations are virtually the same for the Harvard and
Von Neumann architectures.
Because only one processor is being used, any of the FFf examples from Chapter 9
can be used to illustrate the mapping process. If the 16-point, radix-4 FFf is used and it is
assumed that (1) the data addressing is all accomplished by an address generator, in parallel
with the computations, and (2) the arithmetic unit performs either an add or a multiply in
a clock cycle, then 232 clock cycles are required because there are 144 real adds, 24 real
multiplies, and 64 data I/O operations (32 to input 16 complex data samples and 32 to output
16 complex frequency components) to execute. Therefore, the throughput is one 16-point
radix-4 FFf every 232 clock cycles with a processing latency that is also 232 clock cycles.
If the arithmetic unit allows multiplies and adds on the same clock cycle, the clock cycle
total is reduced as a function of how many places in the algorithm adds and multiplies can
be done in parallel.
Linear arrays were early architectures for increasing the performance of an FFT algorithm
beyond the capability of a single processor. The primary difference between the various
algorithms on this architecture is the number of processors that are efficient for decomposing
the algorithm into smaller pieces. Table 12-2 shows how each of the FFf examples from
Chapter 9 can be mapped onto a three-processor linear-array architecture. These mappings
are then described in more detail for each linear-array architecture from Chapter 11. Finally,
the 16-point radix-4 FFT example is described in more detail. Throughout this section, when
the k-th input data sample is written as a(k), it means both the real and imaginary parts of
the sample. Specifically, a(k) = aR(k) + j * aj(k). This same shorthand notation is also
used for intermediate results and output frequency components.
12.4.1 Pipeline
The pipeline [1, 3] architecture was one of the first real-time architectures used to
implement the power-of-two FFT. It interconnects processors such that the output of each
one becomes the input to the next. Then an FFf algorithm is implemented by segmenting
280 CHA~ 12 ALGORITHM AND DATA MAPPINGS
Table 12-2 Chapter 9 Example Algorithms Mapped onto a Three-Processor Linear Array
Each FFf algorithm in Chapter 9 requires the input samples, intermediate results, and
output results to be reorganized. These reorganizations are implemented by the sequence
in which data is read into each processor in Figure 12-3 or by the address pattern used to
store the data in the processor. Therefore, the time for data reorganization is similar for all
algorithms.
In terms of algorithm computational efficiency, the key is to provide enough com-
putational capability in each processor so that it can process the outputs from the previous
processor as fast as provided and can provide inputs for the next processor as fast as needed.
If each processor meets these criteria, the P -stage pipeline processor can process P times
as much data as a single processor. The pipeline approach allows each processor to be
tailored to execute the computations in that portion of the algorithm.
There are three contributors to processing latency in a pipeline architecture. The first
is the individual latencies of each of the processors, once they have received the necessary
data to perform the computations. The second is added latency due to one processor not
working fast enough to feed results to the next one. Then the next processor must wait for
data prior to performing its computations.
The final contributor to pipeline processor latency is whether a processor waits until
it has an entire set of complex samples before it begins processing. If it does, the processing
latency of each processor is as described in Section 12.3. However, it is possible to start
processing data prior to the entire set of complex samples being present. This can be
observed by looking at the algorithm steps in Chapter 9 for the 15- and 16-point examples.
In all of the 15-point examples the first computations can be performed once the complex
a (0), a (5), and a (10) samples are received. For the 16-point radix-4 example, computations
can start once complex samples a (0), a (4), a (8), and a (12) are received. The 16-point mixed
power-of-primes example must wait until sample a(14) is received.
For algorithms where a 2-point transform is computed first, computations can start
after receiving the first sample in the second half of the data. This technique was used
SEC. 12.4 THREE LINEAR ARRAYS 281
The Bluestein algorithm requires much more processing power for the first and third
blocks than for the second block. This can be accommodated by using blocks with different
processing power or by subdividing the computations for the 16-point algorithm into smaller
blocks. For example, the first and/or third blocks in Figure 12-4 can be replaced with the
three blocks in Figure 12-7, resulting in a pipeline with five or seven blocks with more
comparable amounts of computations. The advantage of this is the possibility of having all
the computational blocks be the same hardware architecture, or at least fill the same amount
of board space. The disadvantage of this approach is that it adds processing latency to the
algorithm, even though it does not decrease the system input data rate.
The Winograd algorithm provides the best chance for optimizing the hardware to the
algorithm because it segregates adds and multiplies. This allows the first and third processors
to be constructed using only adders. Only the center processor needs the multiplication
capability. For the I5-point FFf this algorithm also allows the first and third processors to
be decomposed into a sequence of 3- and 5-point add processors. However, with the cost
of programmable DSP chips decreasing rapidly, it may still be most cost effective to use
those chips for each of the three blocks needed for the 3- and 5-point FFTs.
Figure 12-5 Pipeline architecture block diagram for the I5-point Wino-
grad algorithm.
The prime factor algorithm (Figure 12-6) has two potentially attractive features be-
cause multipliers are not needed between the stages. The first is that a two-stage algorithm
can be implemented with processors that are much closer to having the same computational
requirements than if the multiply stage were in the middle. The second is the potential for
a smaller processing latency because of the lack of the multiplier processor.
Figure 12-6 Pipeline architecture block diagram for the I5-point prime
factor algorithm.
282 CHAR 12 ALGORITHM AND DATAMAPPINGS
Additionally, these two blocks can be further decomposed into smaller building blocks
to meet the computational requirements. For example, the Winograd building blocks from
Chapter 8 allow each block in Figure 12-6 to be divided into three blocks. In that case,
each processor can be optimized as described for the adds and multiplies required by the
Winograd algorithm.
The power-of-primes algorithm in Figure 12-7 has the special feature that the first
and third blocks are the same. Further, when they are 4-point FFfs, they do not have
multiplications. Therefore, they can be implemented by using only adder blocks for the
arithmetic unit. Again, the 4-point FFf requires more computations than the complex
multiplies. This means more processing power is needed in the first and third blocks than in
the second block. If the processor latency requirements allow, the 4-point algorithm can be
computed with a pair of 2-point FFfs. This increases processor latency by turning a three-
block process into a five-block process. However, it makes the processing requirements of
each block similar.
The mixed powers-of-primes algorithm in Figure 12-8 has the worst mismatch of
computational tasks of any of the examples because all three blocks have different require-
ments. Again, this can be improved by decomposing the 8-point FFf into three 2-point or
4- and 2-point mixed-radix FFf algorithms. The three 2-point FFT algorithms offer the
best computational match because the 2-point FFT requires four adds and each complex
multiply consists of four multiplies and two adds.
Figure 12-8 Pipeline architecture block diagram for the 16-point mixed
powers-of-primes algorithm.
A third option for decomposing the 8-point FFf is to use the Winograd 8-point
algorithm. Then it can be decomposed into a sequence of adds, then multiplies, and then adds
again. Since the 2-point FFT is also just adds, it can be implemented with the same hardware
architecture as the Winograd input and output adds. Further, the Winograd multiplies and
the complex multiplies can be implemented with the same hardware architecture.
The block diagram in Figure 12-9 is very similar to the prime factor algorithm in
Figure 12-6. The two drawbacks to this algorithm, over the prime factor algorithm, are
that the processing latency is one more set of complex samples because of the complex
multiplies, and the complex multiplies need a simpler computational architecture than the
3- and 5-point FFTs. The second issue can be resolved by decomposing the 3- and 5-point
FFTs into smaller building blocks. However, this decomposition results in added processing
latency.
SEC. 12.4 THREE LINEAR ARRAYS 283
Figure 12-9 Pipeline architecture block diagram for the I5-point Sin-
gleton mixed-radix algorithm.
Ring Bus
As explained in Chapter 11, data in this architecture flow along the bus from one
processor to the next, accompanied by a codeword. The codeword tells the next processor
if it has computations to perform on the next set of data and what those computations are.
Additionally, just as in the pipeline section, each processor can be further decomposed so
that there are more smaller processors connected to the ring.
Stage 1: Input set 1 of complex samples to Processor 0 and compute input 4-point
FFfs.
Stage 2: Transfer Processor D's set 1 results to Processor 1.
Stage 3: Compute complex multiplications on set 1 in Processor 1 and input set 2 to
Processor 0 and compute input 4-point FFTs.
Stage 4: Transfer Processor 0 set 2 results to Processor 1; transfer Processor 1 set 1
results to Processor 2.
SEC. 12.4 THREE LINEAR ARRAYS 285
This process is repeated for multiple sets of complex samples. Table 12-3 summarizes
these events as a function of clock cycles from the beginning of the process.
0-95 Input 1st set into Processor 0 and compute four input 4-point FFfs.
96-127 Move Processor 0 results from 1st set to Processor 1.
128-223 Input 2nd set into Processor 0 and compute four input 4-point FFfs.
128-191 Compute complex multiplies on 1st set in Processor 1.
192-223 Move Processor 1 results from the 1st set into Processor 2.
224-319 Compute four output 4-point FF'Ts on 1st set and output results.
224-255 Move Processor 0 results from 2nd set to Processor 1.
256-341 Input 3rd set into Processor 0 and compute four input 4-point FFfs.
Stage 3: Collect the Results of the Three 16-Point Radix-4 FFT Computations
Assuming this step takes two clock cycles to output each complex frequency com-
ponent, the first set of output frequency components is moved out of the third processor
in 32 clock cycles. At the same time, the second set of complex frequency components is
moved from the second processor to the third processor. Also, these same 32 additional
clock cycles are used to move the third set of output frequency components from the first
286 CHAR 12 ALGORITHM AND DATA MAPPINGS
processor to the second processor. During the next set of 32 clock cycles, the second set of
output frequency components is moved out of the third processor and the third set of output
frequency components is moved from the second to the third processors. Finally, during
the last set of 32 clock cycles, the third set of output frequency components is moved out of
the third processor. Therefore, the three sets of 16 complex output frequencies are output
in 96 clock cycles. Therefore, this option takes a total of 360 clock cycles, which is the
latency and defines the throughput rate of 360/3 = 120 clock cycles per FFT.
Stage 1: Input set 1 of complex samples to Processor 0 and compute input 4-point
FFfs.
Stage 2: Transfer Processor O's set 1 results to Processor 1.
Stage 3: Compute complex multiplications on set 1 in Processor 1, and input set 2 to
Processor 0 and compute input 4-point FFTs.
Stage 4: Transfer Processor 0 set 2 results to Processor 1; transfer Processor 1 set 1
results to Processor 2.
SEC. 12.5 THREE PARALLEL ARRAYS 287
This process is repeated for multiple sets of complex samples. Table 12-4 summarizes
these events as a function of clock cycles from the beginning of the process.
0-95 Input 1st set into Processor 0 and compute four input 4-point FFfs.
96-127 Move Processor 0 results from 1st set to Processor 1.
128-223 Input 2nd set into Processor 0 and compute four input 4-point FFTs.
128-191 Compute complex multiplies on 1st set in Processor 1.
192-223 Move Processor 1 results from the 1st set into Processor 2.
224-319 Compute four output 4-point FFfs on 1st set and output results.
224-255 Move Processor 0 results from 2nd set to Processor 1"
256-341 Input 3rd set into Processor 0 and compute four input 4-point FFTs.
Stage 3: Collect the Results of the Three 16·Point Radix·4 FFT Computations
Assuming this step takes one clock cycle for each output result, the three sets of 16
complex output frequencies take 96 clock cycles. Therefore, this option takes a total of 360
clock cycles, which is the latency and defines the throughput rate of 360/3 = 120 clock
cycles per FFf.
Processors can be combined into parallel arrays in numerous ways, and there are many ways
to use the array to compute each of the algorithms in Chapter 9. At the two data mapping
extremes are:
1. One set of complex samples is distributed among all of the processors in the array
and then computed in one FFT. This approach usually results in minimum latency
processing.
288 CHA~ 12 ALGORITHM AND DATAMAPPINGS
2. A set of complex samples is distributed to each of the processors and then a number
of FFfs are performed in parallel. This usually results in maximum throughput
but has more latency than the first approach.
Each extreme is described by mapping the 16-point radix-4 FFf onto each of the three
parallel arrays from Chapter 11. Throughout this section, when the k-th input data sample
is written as a(k), it means both the real and imaginary parts of the sample. Specifically,
= *
a (k) a R (k) + j a I (k). This same shorthand notation is also used for intermediate results
and output frequency components.
Crossbar
Switch
cluster. The output 4-point building blocks are then computed and the final results sent
out of the architecture. The data mapping in each of the processors is the same as used in
Section 8.5, because the computations in an individual processor are only 4-point building
blocks.
Stage 1: Distribute the Input Data onto the Processors
Use Processor 0 to load the 16 complex samples and use the crossbar network to
distribute one of the data points to each of the other 16 processors. Group the data points
such that a(O), a(4), a(8), and a(12) are in Processors 4,5,6, and 7, respectively. Similarly,
group a(1), a(5), a(9), and a(13) in Processors 8, 9,10, and 11, respectively; a(2), a(6),
a(IO), and a(14) in Processors 12, 13, 14, and 15, respectively; and a(3), a(7), a(11), and
a (15) in Processors 0, 1, 2, and 3, respectively. It takes two clock cycles to input and store
each complex data sample in the processor, if no additional clock cycles are assumed for
passing data through the crossbar switches. This is a total of 32 clock cycles. Figure 12-13
shows which of the 16 processors has each of the 16 complex samples, intermediate results,
and output results after each stage of the 16-point radix-4 algorithm by listing them in their
processor on the same line as the label on the left side of the figure that defines the stage of
the algorithm.
Stage 2: Compute Input 4-Point FFTs
Compute 4-point FFTs in each processor cluster. Use Stage 1 of the 16-point radix-4
FFT example in Chapter 9 as the guideline, along with the memory mapping scheme in
Chapter 8, and each processor cluster's crossbar switch to move data between processors.
Specifically, processor cluster 0-3 is used to compute the fourth of four input 4-point build-
ing blocks in Stage 1 of Section 9.7.5. Similarly, processor cluster 4-7 is used to compute
the first of four input 4-point building blocks. Processor cluster 8-11 is used to compute
the third of the four input 4-point building blocks. Finally, processor cluster 12-15 is used
to compute the second of the four input 4-point building blocks in Stage 1 of Section 9.7.5.
To illustrate how a processor cluster can be used to compute these input 4-point
building blocks, consider processor cluster 4-7. One approach for this cluster is to use
the crossbar switch to connect Processor 4 to Processor 6 and to connect Processor 5 to
Processor 7. Then:
Step 1: Copy a (0) from Processor 4 into Processor 6 and copy a (4) from Processor
5 into Processor 7, simultaneously.
Step 2: Copy a(8) from Processor 6 into Processor 4 and copy a(12) from Processor
7 into Processor 5, simultaneously.
Step 3: Use the equations from the 16-point radix-4 example in Section 9.7.5 to
compute
b(O) = a(O) + a(8) in Processor 4
b(2) = a(4) + a(12) in Processor 5
bel) = a(O) - a(8) in Processor 6
b(3) = a(4) - a(12) in Processor 7 simultaneously
Use the crossbar switch to connect Processor 4 to Processor 5 and to connect Processor
6 to Processor 7. Then:
290 CHA~ 12 ALGORITHM AND DATA MAPPINGS
Step 4: Copy b(D) from Processor 4 into Processor 5 and copy bel) from Processor
6 into Processor 7 simultaneously.
Step 5: Copy b(2) from Processor 5 into Processor 4 and copy b(3) from Processor
7 into Processor 6 simultaneously.
Step 6: Use the equations from the lfi-point, radix-4 example in Section 9.7.5 to
compute
+
e(O) = b(O) b(2) in Processor 4
e(l) = bel) - jb(3) in Processor 6
e(2) = b(O) - b(2) in Processor 5
e(3) = bel) + jb(3) in Processor 7 simultaneously.
SEC. 12.5 THREE PARALLEL ARRAYS 291
At the same time these computations and data movements are taking place, perform the
equivalent functions in the other three processor clusters, using the data in their processors
and the equations from Section 9.7.5. The data movements and adds each take a clock
cycle, for a total of 12 clock cycles. Figure 12-13 shows the locations of the results of these
computations as the second entry in each of the 16 processor blocks.
Since only the real or imaginary part of one sample can move on a crossbar connection
during any clock cycle, these data moves take 12 clock cycles. Figure 12-13 shows the
locations of the results of this reorganization of intermediate results as the fourth entry in
each of the 16 processor blocks.
Compute 4-point FFTs in each processor cluster, using Stage 3 of the radix-4 16-point
example as the guideline. This uses each processor cluster's crossbar switch to move data
between processors. Specifically, processor cluster 0-3 is used to compute the fourth of
four output 4-point building blocks in Stage 3 of Section 9.7.5. Similarly, processor cluster
4-7 is used to compute the first of four output 4-point building blocks. Processor cluster
8-11 is used to compute the second of the four output 4-point building blocks. Finally,
processor cluster 12-15 is used to compute the third of the four 4-point output building
blocks in Stage 3 of Section 9.7.5.
To illustrate how a processor cluster can be used to compute these output 4-point
building blocks, consider processor cluster 4-7. One approach for this processor cluster
uses crossbar switch 2 to connect Processor 4 and Processor 5 and to connect Processor 6
to Processor 7. Then:
Step 1: Copy c(O) from Processor 4 into Processor 5 and copy c(8) from Processor
6 into Processor 7 simultaneously.
Step 2: Copy c(4) from Processor 5 into Processor 4 and copy c(12) from Processor
7 into Processor 6 simultaneously.
Step 3: Use the equations from the radix-4 16-point example to compute
/(0) = c(O) + c(4) in Processor 4
/(2) = c(8) + c(12) in Processor 6
/(1) = c(O) - c(4) in Processor 5
/(3) = c(8) - c(12) in Processor 7 simultaneously
Use crossbar switch 2 to connect Processor 4 to Processor 6 and to connect Processor
5 to Processor 7. Then:
Step 4: Copy /(0) from Processor 4 into Processor 6 and copy /(1) from Processor
5 into Processor 7 simultaneously.
Step 5: Copy /(2) from Processor 6 into Processor 4 and copy /(3) from Processor
7 into Processor 5 simultaneously.
Step 6: Use the equations from the radix-4 16-point example to compute
A(O) = /(0) + /(2) in Processor 4
A(4) = f(l) - j/(3) in Processor 5
A(8) = /(0) - /(2) in Processor 6
A(12) = [(I) + j/(3) in Processor 7 simultaneously
At the same time these computations and data movements are taking place, perform the
equivalent functions in the other three-processor clusters, using the data in their processors
and the algorithm steps in Section 9.7.5. This stage also takes 12 clock cycles.
SEC. 12.5 THREE PARALLEL ARRAYS 293
converts the input data from a sequential stream into data vectors that can be passed into
the processing array along one of its edges. Additionally, the outputs are passed out of
the array along another, usually opposite, edge and converted back to a sequential set of
passband filter outputs for further processing. Figure 12-14 shows a specific example of this
I/O strategy for the 4 x 4 NEWS connected massively parallel array described in Section
11.3.2 and used later in the implementation example.
t t t ~
N 0 N 1 N 2 N 3
... -- ...
Row 1 ~ W E
....
W E -W E W E
S S S s
N
t 4 N
t 5
t
N 6
t
N 7
Row 2 ~ W E ..... W E - W E W E -- ...
S S S s
N
t 8 N
t 9
t
N 10
t
N 11
Row 3 ~
... ....
W E W E W E W E
S S S s
t
N 12
t
N 13
t
N 14
t
N 15
Row 4 ~ W E W E W E W E .....
-
S S S s
t t t t
Output Data Reorganizer ...
The details of the I/O data reorganizers depend on whether the computational portion
of the FFT algorithm is distributed across all of the processors (minimal latency) or whether
each processor computes an entire FFT (maximum throughput).
Step I: Load complex samples a(O), a(I), a(2), and a(3) into the input shift register
so that sample a (0) is above Processor 3 (8 clock cycles because the samples are
complex). Then shift this set of four complex samples into the top four processors.
This takes 2 clock cycles because the data is complex, for a total of 10 clock cycles.
Step 2: Load complex samples a(4), a(5), a(6), and a(7) into the input shift register
so that sample a (4) is above Processor 3. This takes 8 clock cycles. Then shift this
set of four complex samples into the top four processors. At the same time, shift
the first four complex samples from the top row of processors to the second row of
processors. This takes 2 clock cycles, for a total of 10 clock cycles. Figure 12-15
shows which of the 16 processors has each of the 16 complex samples, intermediate
and output results at the end of each stage of this algorithm by listing them in their
processor on the same line as the label on the left side of the figure that defines the
stage of the algorithm.
Step 3: Load complex samples a(8), a(9), a(10), and a(ll) into the input shift
register so that sample a(8) is above Processor 3. This takes 8 clock cycles. Then
shift this set of four complex samples into the top four processors. At the same time,
shift the second four complex samples from the top row of processors to the second
row and the first set of complex samples from the second row to the third. This takes
2 clock cycles, for a total of 10 clock cycles.
Step 4: Load complex samples a(12), a(13), a(14), and a(I5) into the input shift
register so that sample a(12) is above Processor 3. This takes 8 clock cycles. Then
shift this set of four complex samples into the top four processors. At the same time,
shift the first four complex samples from the third to fourth rows, the second set from
the second row to the third, and the third set from the first row to the second. This
takes 2 clock cycles, for a total of 10 clock cycles.
Figure 12-15 shows the locations of the input data samples in the first row of each
processor block.
Stage 2: Compute the Input4-Point FFTs
To do this, notice that the complex samples in the columns are the ones that must be
combined. Therefore, whatever processing steps are used for one column can be performed
on all four columns at once to compute the four 4-point input FFfs. The steps are as follows:
Step 1: Move the complex samples a(4), a(5), a(6), and a(7) in row 3 to row 2 and
the complex samples a(8), a(9), a(10), and a(11) in row 2 to row 3. This step takes
4 clock cycles because each data point is complex.
Step 2: Copy the complex samples a(4), a(5), a(6), and a(7) in row 2 into row 1
and copy the complex samples a(12), a(13), a(14), and a(I5) from row 1 into row 2
so that rows 1 and 2 both have the same complex samples. At the same time do the
same copy function in rows 3 and 4. This step takes 4 clock cycles.
Step 3: In rows 2 and 4 add the two sets of complex samples. At the same time
subtract the complex samples in rows 1 and 3, following the equations in Section
9.7.5. This step takes 2 clock cycles. At the end of this step:
(i) Intermediate results b(O), b(8), b(4), and b(12) are in Processors 15, 14,
13, and 12 (row 4).
(ii) Intermediate results b(I), b(9), b(5), and b(13) are in Processors 11, 10,
9, and 8 (row 3).
SEC. 12.5 THREE PARALLEL ARRAYS 297
(iii) Intermediate results b(2), b(lO), b(6), and b(14) are in Processors 7,6,5,
and 4 (row 2).
(iv) Intermediate results b(3), bell), b(7), and b(15) are in Processors 3, 2,1,
and 0 (row 1).
Step 4: Move the intermediate results b(2), b(IO), b(6), and b(14) in row 2 to row 3
and the intermediate results bel), b(9), b(5), and b(13) in row 3 to row 2. This step
takes 4 clock cycles.
Step 5: Copy the intermediate results bel), b(9), b(5), and b(13), in row 2 into row
1 and copy the intermediate results b(3), bell), b(7), and b(lS) from row 1 into row
2 so that rows 1 and 2 both have the same intermediate results. At the same time do
the same copy function in rows 3 and 4. This step takes 4 clock cycles.
Step 6: In rows 2 and 4 add the two sets of intermediate results. In rows 1 and 3
subtract the intermediate results, using the equations in Section 9.7.5. This step takes
2 clock cycles.
(i) Intermediate results e(O), e(8), c(4), and e(12) are in Processors 15, 14, 13, and
12 (row 4).
(ii) Intermediate results eel), e(9), e(5), and e(13) are in Processors 7, 6, 5, and 4
(row 3).
(iii) Intermediate results e(2), e(IO), e(6), and e(14) are in Processors 11, 10,9, and
8 (row 2).
(iv) Intermediate results c(3), e(ll), e(7), and e(l5) are in Processors 3, 2, 1, and 0
(row I).
Figure 12-15 shows the locations of these intermediate results in the second row of
each processor block.
(i) Intermediate results c(O), c(8), e(4), and c(12) are in Processors 15, 14, 13, and
12 (row 4).
(ii) Intermediate results e(l), e(9), e(5), and e(13) are in Processors 7,6,5, and 4
(row 3).
(iii) Intermediate results e(2), e(IO), e(6), and e(l4) are in Processors 11, 10,9, and
R (row 2).
(iv) Intermediate results e(3), e(ll), e(7), and e(15) are in Processors 3,2, 1, and 0
(row 1).
Figure 12-15 shows the locations of these intermediate results in the third row of each
processor block.
298 CHAP. 12 ALGORITHM AND DATA MAPPINGS
Compute the four 4-point output FFTs by using the intermediate results that are now
located in the rows of the array. The steps are similar to those used in the columns to
compute the 4-point input FFfs. The columns are defined as numbered from left to right.
The steps are:
Step 1: Move the intermediate results in column 2 to column 3 and the intermediate
results in column 3 to column 2. This step takes 4 clock cycles.
Step 2: Copy the intermediate results in column 2 into column 1 and the intermediate
results from column 1 into column 2 so that columns 1 and 2 both have the same
intermediate results. At the same time do the same function in columns 3 and 4. This
step takes 4 clock cycles.
Step 3: In columns 2 and 4 add the two sets of intermediate results. At the same time
subtract the intermediate results in columns 1 and 3, following the Algorithm Steps
in Section 9.7.5. This step takes 2 clock cycles. At the end of this step
(i) Intermediate results 1(0), 1(8), 1(4), and 1(12) are in Processors 15, 11,
7, and 3 (column 4).
(ii) Intermediate results 1(1), 1(9), 1(5), and 1(13) are in Processors 14, 10,
6, and 2 (column 3).
(iii) Intermediate results 1(2), 1(10), 1(6), and 1(14) are in Processors 13, 9,
5, and 1 (column 2).
(iv) Intermediate results 1(3), 1(11), 1(7), and 1(15) are in Processors 12, 8,
4, and 0 (column 1).
Step 4: Move the intermediate results in column 2 to column 3 and the intermediate
results in column 3 to column 2. This step takes 4 clock cycles.
Step 5: Copy the intermediate results in column 2 into column 1 and the intermediate
results from column 1 into column 2 so that columns 1 and 2 both have the same
intermediate results. At the same time do the same function in columns 3 and 4. This
step takes 4 clock cycles.
Step 6: Follow the 16-point radix-4 equations to add orsubtract the pairs of inter-
mediate results in columns 2 and 4 and in columns 1 and 3. This step takes 2 clock
cycles and the output frequency components:
(i) A(O), A(2), A(I), and A(3) are in Processors 15, 11,7, and 3 (column 4).
(ii) A(8), A(IO), A(9), and A(II) are in Processors 14, 10, 6, and 2 (col-
umn 3).
(iii) A(4), A(6), A(5), and A(7) are in Processors 13,9,5, and 1 (column 2).
(iv) A(12), A(14), A(13), and A(15) are in Processors 12, 8, 4, and 0 (col-
umn 1).
SEC. 12.5 THREE PARALLEL ARRAYS 299
This stage takes 20 clock cycles, and Figure 12-15 shows the locations of these output
frequency components in the fourth row of each processor block.
At the other extreme, 16 sets of complex samples can be loaded into the processor
array and then each processor can compute a 16-point radix-4 FFf and output the results.
Option 1 showed that it takes 40 clock cycles for the data input for one set of complex
*
samples. Therefore, it takes 16 40 = 640 clock cycles for 16 sets of complex samples.
The total number of clock cycles for this approach is the number of clock cycles to
perform the 16-point radix-4 FFT plus the data I/O time for 16 sets of complex samples.
300 CHA~ 12 ALGORITHM AND DATA MAPPINGS
The result is a total of 1448 clock cycles, which is the processing latency. The processing
throughput is an average of 1448/16 = 90.5 clock cycles per FFf. Notice that the data I/O
clock cycle total is much larger (1280) than the computational clock cycles (168). This
time can be improved to 1280 clock cycles by requiring each processor to perform data I/O
and computations simultaneously.
The star [1] architecture is most often used when one function or process dominates
the application. It consists of one central processor with interconnections to numerous
others as shown in Figure 12-16. The number of processing elements depends on the FFf
algorithm to be computed. Figure 12-16 is a natural configuration for the 16-point radix-4
FFf because of the four 4-point FFTs computed on the input and output. For this example,
Processor 0 is the data I/O processor and global memory. The other four processors have
the Harvard architecture from Section 12.3.5.
Data Processor 0
I/O
This architecture can also be used in the two extremes of minimum processing latency
(Option 1) and maximum processing throughput (Option 2) described for the crossbar and
massively parallel architectures. Both are described.
Results of Stage 1 a(I ), a(5), a(9), a(13) No Data a(3), a(7), a(lI), a(I5)
Results of Stage 2 e(8), c(9), c(10), e(Il) No Data e( 12), c( 13), c( 14), c( 15)
Results of Stage 3 e(8), e(9), e(IO), e(II) No Data c(12), e(I3), e(14), e(15)
Results of Stage 4 No Data All Intermediate Results No Data
Resu Its of Stage 5 e(I ), e(5), e(9), e( 13) No Data c(3), e(7),e(ll), e(15)
Results of Stage 6 A( I), A(9), A(9), A( 13) No Data A(3), A(7), A( 11), A( 15)
Results of Stage 7 No Data All Output Results No Data
Processor 2 Processor 0 Processor 4
Step 1: Load the input data into Processor O. This step takes 32 clock cycles.
Step 2: Move complex samples a(O), a(4), a(8), and a(12) to Processor 1 using 8
clock cycles.
Step 3: Move complex samples a(I), a(5), a(9), and a(13) to Processor 2 using 8
clock cycles.
Step 4: Move complex samples a(2), a(6), a(10), and a(14) to Processor 3 using 8
clock cycles.
Step 5: Move complex samples a(3), a(7), a(II), and a(15) to Processor 4 using 8
clock cycles.
If Processor 0 were a memory that could move data from all four processors at once
(four-port memory), the data transfers in Steps 2-5 could occur simultaneously. The total
is 64 clock cycles to load data into Processor 0 and then distribute it among Processors
1-4. Figure 12-17 shows the locations of these input data samples in the first row of each
processor block.
(i) Intermediate results c(O), c(I), c(2), and c(3) are in Processor 1.
(ii) Intermediate results c(8), c(9), c(10), and c(ll) are in Processor 2.
(iii) Intermediate results c(4), c(5), c(6), and c(7) are in Processor 3.
(iv) Intermediate results c(12), c(13), c(14), and c(15) are in Processor 4.
Figure 12-17 shows the locations of these intermediate results in the second row of
each processor block.
(i) Intermediate results c(O), c(I), c(2), and c(3) are in Processor 1.
(ii) Intermediate results c(8), e(9), e(10), and e(11) are in Processor 2.
SEC. 12.5 THREE PARALLEL ARRAYS 303
(iii) Intermediate results c(4), e(5), e(6), and e(7) are in Processor 3.
(iv) Intermediate results e(12), e(13), e(14), and e(15) are in Processor 4.
Figure 12-17 shows the locations of these intermediate results in the third row of each
processor block.
Step 1: Move intermediate results e(O), e(4), e(8), and e(12) to Processor 1, using 8
clock cycles.
Step 2: Move intermediate results e(l), e(5), e(9), and e(13) to Processor 2, using 8
clock cycles.
Step 3: Move intermediate results e(2), e(6), e(10), and e(14) to Processor 3, using
8 clock cycles.
Step 4: Move intermediate results c(3), e(7), e(11), and e(15) to Processor 4, using
8 clock cycles.
Stages 4 and 5 can be done with 16 fewer clock cycles because one of the four results
from each processor output of Stage 3 ends up back in the same processor for the output
4-point FFr computations. This means it does not have to be moved from its location at the
end of Stage 3 into Processor 0 and then back out to the same location in the same processor.
Moving these four complex intermediate results twice takes 16 clock cycles. Therefore,
Stages 4 and 5 can be performed with 48, not 64, clock cycles. Figure 12-17 shows the
locations of the intermediate results in the fifth row of each processor block.
Figure 12-17 shows the locations of these output frequency components in the sixth
row of each processor block.
Stage 3: Collect the Results of the Four 16-Point Radix-4 FFT Computations
It takes 128 clock cycles to move the four sets of 16-point complex results from the
four processors to Processor 0 and another 128 clock cycles to move it out of the processor
array.
The total number of clock cycles for this option is the number of clock cycles to per-
form the 16-point radix-4 FFT plus the data I/O time for four sets of complex samples. This
is a total of 680 clock cycles, which is the processing latency. The processing throughput
is an average of 680/4 = 170 clock cycles per FFT.
1. One set of complex samples is mapped onto all of the processors in the array
and then one FFT is computed. This option usually results in minimum latency
processing.
2. A set of complex samples is mapped onto each of the processors and then a number
of FFfs are performed in parallel. This usually results in the maximum throughput
but has more latency than the first option.
Each extreme is described for mapping the 16-point radix-4 FFf onto the four-
dimensional hypercube architecture from Section 11.4.1. Mapping onto the massively
parallel and hybrid arrays from Chapter 11 is described in general terms, but a detailed
example is not presented because these complex architectures are not suited to implement-
ing the 16-point radix-4 FFf efficiently. Throughout this section, when the k-th input data
sample is written as a(k), it means both the real and imaginary parts of the sample. Specifi-
cally, a (k) == a R (k) + j * a I (k). This same shorthand notation is also used for intermediate
results and output frequency components.
__- - - - - - . 1 5
within the processors and a reordering of the data so that the same squares or another set of
squares can be used to compute the output 4-point FFTs. Figure 12-19 shows which of the
16 processors has each of the 16 complex samples, intermediate results, and output results
at the end of each stage by listing them in their processor on the same line as the label to
the left of the figure that defines the stage of the algorithm.
hypercube architecture. Table 12-5 shows the number of clock cycles required to move a
data word from Processor 0 to one of the other processors in the architecture, assuming one
clock cycle to move a data word between any two processors. Notice that, as mentioned in
Chapter 11, the longest path length for a four-dimensional hypercube is 4. In this example
the path from Processor 0 to Processor lOis longest. Since each of the input samples is
complex, the numbers in Table 12-5 must be doubled to determine the actual number of
clock cycles used for each complex data input. This stage takes 42 clock cycles.
o o
1 1
2 2
3 1
4 1
5 2
6 3
7 2
8 2
9 3
10 4
11 3
12 1
13 2
14 3
15 2
Step 1: Load complex samples a(O), a(4), a(8), and a(12) into Processors 0,2, 1,
and 3, using 8 clock cycles.
Step 2: Load complex samples a(I), a(5), a(9), and a(13) into Processors 4,6, 5,
and 7 by first loading them into Processors 0, 2, 1, and 3 and then moving them to
Processors 4, 6, 5, and 7 in parallel in 2 additional clock cycles. This step takes 10
clock cycles.
Step 3: Load complex samples a(2), a(6), a(10), and a(14) into Processors 8, 10,
9, and 11 by first loading them into Processors 0, 2, 1, and 3 and then moving them
through Processors 4, 6, 5, and 7 in parallel to Processors 8, 10, 9, and 11 in 4
additional clock cycles. This step takes 14 clock cycles.
Step 4: Load complex samples a(3), a(7), a(11), and a(15) into Processors 12, 14,
13, and 15 by first loading them into Processors 0,2, 1, and 3 and then moving them
to Processors 12, 14, 13, and 15 in parallel in 2 additional clock cycles. This step
takes 10 clock cycles.
Figure 12-19 shows the locations of the complex input samples in the first row of
each processor block.
308 CHA~ 12 ALGORITHM AND DATA MAPPINGS
(iv) intermediate results e(12), e(lS), e(13), and c(14) are in Processors 12,
14, 13, and 15.
Figure 12-19 shows the locations of these intermediate results in the second row of
each processor block.
These can be computed within the individual processors. Since each takes four
multiplies and two adds, the complex multiplies use 6 clock cycles. At this point:
(i) Intermediate results c(O), c(3), c(l), and e(2) are in Processors 0, 2,1, and 3.
(ii) Intermediate results c(8), e(ll), e(9), and e(10) are in Processors 4,6,5, and 7.
(iii) Intermediate results c(4), e(7), e(5), and e(6) are in Processors 8,10,9, and 11,
(iv) Intermediate results c(12), e(lS), e(13), and e(14) are in Processors 12,14,13,
and 15.
Figure 12-19 shows the locations of these intermediate results in the third row of each
processor block.
Figure 12-19 shows the locations of the output frequency components in the fourth
row of each processor block.
Step 1: Move output frequency components A(O), A(l), A(2), and A(3) out of the
hypercube first. This step takes 8 clock cycles based on adding the number of clock
cycles in Processors 0, 1, 2, and 3 in Table 12-4 and multiplying by 2 to account for
complex data.
Step 2: Move the answers in Processors 12, 13, 14, and 15 (A(4), A(5), A(7), and
A(6), respectively) into Processors 0,1,2, and 3, respectively. This step takes 2 clock
cycles because all four moves can be done at once.
Step 3: Move A(4), A(5), A(6), and A(7) out of the hypercube. Since A(4), A(5),
A (6), and A (7) are now in Processors 0, 1, 3, and 2, this step takes 8 clock cycles. As
in Step 1 of this stage, this is based on adding the number of clock cycles in Processors
0, 1, 2, and 3 in Table 12-4 and multiplying by 2 to account for complex data.
Step 4: Move the answers in Processors 4,5,6, and 7 (A(8), A(9), A(ll), and
A (10), respectively) into Processors 0, 1, 2, and 3, respectively. At the same time,
the answers in Processors 8, 9, 10, and 11 (A(12), A(13), A(15), and A(14), re-
spectively) can be moved into Processors 4, 5, 6, and 7. This step takes 2 clock
cycles.
Step 5: Move A(8), A(9), A(10), and A(II) out. Since they are now in Processors
0, 1, 3, and 2, this step takes 8 clock cycles. As in Step 1 of this stage, this is based
on adding the number of clock cycles in Processors 0, 1,2, and 3 in Table 12-4 and
multiplying by 2 to account for complex data.
Step 6: Move the answers in Processors 4, 5, 6, and 7 (now A(12), A(13), A(15),
and A(14) from Step 3 of this stage) into Processors 0,1,2, and 3, respectively. This
step takes 2 clock cycles because all four moves can be done at once, each by one
pair of processors.
Step 7: Move A (12), A (13), A (14), and A(15) out. Since they are now in Processors
0, 1, 3, and 2, this step takes 8 clock cycles. As in Step 1 of this stage, this is based
on adding the number of clock cycles in Processors 0, 1, 2, and 3 in Table 12-4 and
multiplying by 2 to account for complex data.
The total is 134 clock cycles of processing load and processing latency.
These complex sample moves take 16 times as many clock cycles as used to move
one set of complex samples into the 16 processors in Stage 1 of Option 1, in this section.
This is a total of 42 * 16 = 672 clock cycles.
Using a Harvard architecture processor at each node, this takes 168 clock cycles,
based on the assumptions in Section 12.3.5.
312 CHA~ 12 ALGORITHM AND DATA MAPPINGS
Up
PO P1 P2
---.--+--------1 --- . - - + - - - - - - - 1 - - - .--+--- Layer 1
East
P3 P4 P5
North ---.--+--------f - - + - - - - - - - 1 - _- ._-+--__ South Layer 2
West
P6 P7 P8
---.--+--------1 --+-------1- - - .--+--- Layer 3
Down
The top three processors represent one row of the massively parallel processor array
in Section 11.4.2. The middle and bottom sets of processors each represent a row of an
additional two-dimensional array. The vertical interconnections are the "up" and "down"
connections between these two-dimensional arrays that makes the resulting array three di-
mensional. This is a very complex architecture to efficiently use to compute the small FFf
examples from Chapter 9. In all likelihood, if this architecture had to compute the 16-point
radix-4 FFT, it would use one of the two approaches described for the two-dimensional mas-
sively parallel processor in Section 12.5.2. The two additional layers of two-dimensional
processors would process more sets of data, but the interconnections between vertical layers
would not be used. The result is that the computational throughput and latency would be
multiplied by how many layers of two-dimensional processors were in the array.
SEC. 12.7 ALGORITHM MAPPING EXAMPLES COMPARISON MATRIX 313
N Address
Data Program
Memory Generator Memory
E
W
Arithmetic Program
Unit Counter
s
Figure 12-21 Harvard architecture block from parallel array.
All entries are in clock cycles per FFT (see Table 12-6 on page 314).
12.8 CONCLUSIONS
Algorithms and data are distributed and redistributed among the processors in the course of
computing the entire algorithm. The data map figures for four parallel and multidimensional
arrays depict where the data resides at the end of each stage of computing an algorithm.
This awareness makes it easier to understand how the reorganization of the data among
the processors was done in the examples. This chapter concludes the portion of the book
on architectures and algorithms. The next four chapters deal with selecting hardware and
testing it.
314 CHA~ 12 ALGORITHM AND DATA MAPPINGS
REFERENCES
[1] T. Fountain, Processor Arrays Architecture and Applications, Academic Press, London,
1987.
[2] S. K. Mitra and J.F. Kaiser, Handbookfor Digital Signal Processing, Wiley, New York,
1993.
[3] R. W. Hockney and C. R. Jesshope, Parallel Computers, Adam Hilger, Bristol, England,
1981.
13
Arithmetic Formats
13.0 INTRODUCTION
After the hardware architecture selection is made, the exact chip can only be chosen by
deciding what arithmetic format will best meet the specification. The primary effect of
the format choice is in the accuracy of the results. Three arithmetic formats are used for
computing FFfs:
Prior to the development of DSP chips, the choice of fixed-point arithmetic resulted in
faster and smaller hardware architectures than floating-point or block-floating-point arith-
metic. However, the opposite is generally true today, as can be seen in the Comparison
Matrices of Chapter 14.
Since the primary effect of choosing the arithmetic format is the accuracy of the results,
the performance measures here are those that quantify the computational accuracy of FFT
algorithms.
316 CHA~ 13 ARITHMETIC FORMATS
be used to further refine the arithmetic format decision to specific bit lengths. For example,
16-,20-, and 24-bit fixed-point programmable DSP chips are commercially available and
described in Chapter 14.
13.2.1 Fixed-Point
Fixed-point [1] numbers are like working with integers. The format has a specific
number of bits, say 16, to represent the numbers, and the binary point (comparable to the
decimal point for base 10 numbers) is located at a fixed position among the bits. It might
be to the right of all the bits. In this case all of the numbers are represented as integers. It
might be to the left of all the bits. In this case all the numbers are less than 1, (i.e., fractions).
The other feature of fixed-point arithmetic formats is that one of the bits is used to
represent the sign of the numerical value. Generally, the sign bit is the most significant bit
with 0 representing positive numbers and 1 representing negative numbers. For an n -bit
format where all of the numbers are represented as fractions, the binary point is between
the sign bit and the other n - 1 bits. All of the fixed-point DSP chips in Chapter 14
have a multiplier-accumulator block diagram similar to that in Figure 13-1 to implement
fixed-point arithmetic.
n == 1 + log2[D + 1] (13-1 )
Arithmetic Accuracy. The binary point in a fixed-point format controls its arith-
metic accuracy. If the binary point is all the way to the right, numbers are all represented
as integers. Therefore, the numbers are only accurate to 1/2. If the binary point in an n-bit
format is just to the right of the sign bit, then there are (n - 1) fractional bits. This makes
the largest fractional bit 2- 1 and the smallest fractional bit 2-(n-l), which translates into
numbers being represented to an accuracy of 2- n • For example, in a 16-bit format with the
binary point just to the right of the sign bit, the least significant bit is 2- 15 , which means
318 CHAP. 13 ARITHMETIC FORMATS
numbers are accurate to 2- 16 • Therefore, the location of the binary point depends on the
required accuracy of the computations.
Quantization Noise Escalation. Fixed-point quantization noise is a nonlinear
phenomena that depends on the data and the sequence of computations. Analysis of quanti-
zation noise for power-of-two FFTs has determined a rule-of-thumb for growth of the noise
relative to the signal as a function of the transform length of roughly 1/2 bit per power-of-
two [1]. For example, a 1024-point FFT has twice the quantization noise, relative to the
signal level, as a 256-point FFT has. The actual levels depend on the signal being analyzed.
The drawback of the fixed-point format is that this quantization noise is relatively
independent of the size of the frequency component. Therefore, the signal-to-noise level
for strong frequency components is large and for small-frequency components is small. This
sometimes causes small-frequency components to be masked by the quantization noise.
Quantization noise for fixed-point FFTs has also been analyzed for the Winograd [2]
algorithm. The growth trend is roughly the same as for power-of-two algorithms, and the
actual amount of quantization noise is slightly larger than for power-of-two algorithms.
13.2.2 Floating-Point
Floating-point [3] numbers are like performing computations in scientific notation.
The allotted digits that represent each number are divided between the exponent and the
mantissa of the number. In a decimal floating-point format, numbers such as 536 are
represented as 5.36 * 102 • In a binary floating-point format, 536 would be represented
based on decomposing it by powers-of-two. Namely, 536 == 512 + 16 + 8. Just as for
decimal scientific notation, this number can be written as 1000011000, or normalized as
1.000011000 x 28 . Therefore, a binary floating-point number has a certain number of
digits to represent the mantissa (1.000011000 in the example) and to represent the exponent
(8 = 01000 in the example). Notice that to represent numbers with magnitudes less than 1,
the exponent is negative. In those cases one of the bits in the exponent must be used as a
sign bit. Figure 13-2 is a functional block diagram for floating-point addition, and Figure
13-3 is a functional block diagram for floating-point multiplication, as they are typically
implemented by the floating-point DSP chips in Chapter 14.
Input #1
Scale Output
Results
Data
Input #2
Input #1 Input #2
Add
Exponents
1
Scale
Output
Results
Results
t
Multiply
Mantissas
of the noise relative to the signal as a function of the transform length of roughly log2(N)
bits for an N -point power-of-two FFT. For example, a 1024-point FFT has 10/8 = 1.25 the
amount of quantization noise, relative to the signal level, than does a 256-point FFT. The
actual levels depend on the signal being analyzed and are controlled by the number of bits in
the mantissa: the larger the number of mantissa bits, the smaller the quantization noise level.
Quantization noise for floating-point FFTs has also been analyzed for the prime factor
[4] algorithm. The growth trend is roughly the same, and the actual amount of quantization
noise is slightly larger than for power-of-two algorithms.
13.2.3 Block-Floating-Point
MultiplierConstants
~
Building-Block
Input Data ----. Data Scaler 1-----+ Output Data
from Memory Algorithm to Memory
Magnitude
Detection
Scale Factor
Accumulator
The arithmetic in each building block of the FFT algorithm is performed as fixed-
point arithmetic. However, from stage to stage, the intermediate answers are evaluated to
ensure that the full dynamic range of the fixed-point numbers is being utilized. If not, all
of the intermediate values are scaled enough so that the largest value uses roughly half
of the full dynamic range. Then the next stage of computations is performed and the
results reevaluated. The processor keeps track of the net scaling that has occurred from
SEC. 13.3 ARITHMETIC FORMAT COMPARISON MATRIX 321
stage to stage as an exponent that effectively increases the dynamic range of the processor.
The scaling only uses half the dynamic range because the next stage of a power-of-two FFT
algorithm will have a gain of 2 for sine-wave inputs. This keeps the fixed-point computation
from overflowing.
Key to Variables
11 = number of bits in a fixed-point arithmetic format
LSB = numerical value of least significant bit of fixed-point arithmetic format
p = 21.', where e is number of bits used to represent the exponent
Mantissa LSB = numerical value of least significant bit of floating-point mantissa
N = number of points in FFT
322 CHA~ 13 ARITHMETIC FORMATS
13.4 CONCLUSIONS
An application usually has a specification for dynamic range and/or arithmetic accuracy.
This chapter shows how to determine which arithmetic format best meets the product speci-
fication. If a format cannot meet the specifications, the chips in the next chapter that use that
format are automatically eliminated from consideration. This is usually the first decision
in selecting a chip.
REFERENCES
[1] P.D. Welch, "A Fixed-Point Fast Fourier Transform Error Analysis," IEEETransactions
on Audio and Electroacoustics, Vol. AU-17, pp. 151-157 (1969).
[2] R. W. Patterson and J. H. McClellan, "Fixed-Point Error Analysis Winograd Fourier
Transform Algorithms," IEEE Transactions on Acoustics, Speech, and SignalProcess-
ing, Vol. ASSP-26, No.4, pp. 447-455 (1978).
[3] C.J. Weinstein, "Roundoff Noise in Floating Point Fast Fourier Transform Compu-
tation," IEEE Transactions on Audio and Electroacoustics, Vol. AU-17, pp. 209-215
(1969).
[4] D. C. Munson, Jr. and B. Liu, "Floating Point Roundoff Error in the Prime Factor FFT,"
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-29, No.
4, pp. 877-882 (1981).
[5] A. V. Oppenheim and C. J. Weinstein, "Effects of Finite Register Length in Digital
Filtering and the Fast Fourier Transform," Proceedings of IEEE, Vol. 60, No.8, pp.
957-976 (1972).
14
Chips
14.0 INTRODUCTION
This chapter gives an objective description of commonly available nsp chips for executing
FFf algorithms. A unique feature is the "generic" nsp chip block diagram, to which
all the commercial DSP chips are standardized and compared, to simplify understanding
their differences. Making the decision about which chip to use depends on the arithmetic
format, algorithm and data mapping process (Chapter 12), and the architecture's efficiency
at performing that algorithm. FFf code can be written for any programmable processor
chip; however, Harvard architectures are specifically designed to execute FFfs efficiently
and thus are the only type used in this chapter.
Programmable DSP chips fall into four categories:
The most popular category is general-purpose programmable chips. These chips are
designed to efficiently execute FFf and FIR filter algorithms. However, they also have
enough general-purpose instructions to be used in a variety of non-DSP functions, partic-
ularly when the functions can utilize the on-chip multipliers. Motor controllers, modems,
and matrix arithmetic are good examples of these more general-purpose applications. The
earliest of these chips used fixed-point arithmetic because the more complex floating-point
computations and buses required too much integrated circuit area to be practical. More re-
cent generations are available in fixed- and floating-point arithmetic formats (Chapter 13).
324 CHAP. 14 CHIPS
The following performance measures are the keys to characterizing the ability of a pro-
grammable DSP chip to efficiently compute FFT algorithms.
input cycles and 2 * N output cycles to move data on and off the chip. The parallel ports are
also used to move data and program instructions into the chip from off-chip memory. If the
data and program fit in the on-chip memory, these parallel port functions are not needed.
The on-chip data memory words performance measure is the total number of words
of RAM available on a DSP chip for storing the FFT input, output, and intermediate data
values. This is important because it defines how large an FFT can be computed, with
all of the data in the on-chip memory. An N -point complex FFT requires at least 2 * N
data memory locations on the chip for the entire algorithm to be performed on-chip. The
Comparison Matrices in Chapters 8 and 9 show the data memory required to compute each
algorithm, and the Comparison Matrices in this chapter show the data memory available
in each chip. All chips in this chapter have temporary registers. If these registers are not
being used when they are needed by the algorithms in Chapters 8 and 9, they may be used
to reduce the data memory required for intermediate computational results.
The on-chip program memory words performance measure is the total number of
words of memory available on a DSP chip for the FFT program. This is important because
it defines how large the FFT program can be without using off-chip program memory.
When off-chip program memory is required, it reduces the efficiency of the chip because
accessing instructions from off-chip memory is usually slower than accessing them from
on-chip memory.
Address generators are used to compute where to get the data for the next computation
and where to store the results of the present computation (the Memory Map), so that the
arithmetic units can spend all of their time computing the Algorithm Steps. There is usually
one address generator for each on-chip data memory block. The address generators that are
capable of stepping multiple, as well as single, address locations can be used by all of the
FFT algorithms given in Chapters 8 and 9.
This section describes the function that each block in Figure 14-1 performs in computing
FFTs. This "generic" block diagram of a programmable DSP chip is a unique feature of the
book. All the vendor block diagrams have been standardized to this generic one to make
it easy to compare them and to see where and how they differ. The following methods are
used to identify how a specific chip varies from the generic diagram: bold lines indicate
where a new connection exists; double bold lines indicate where one or more buses are
added to an existing one; dotted lines show where a connection does not exist; shaded
blocks are modified functions; diagonal shaded blocks are new functions; and dotted line
blocks are ones that do not exist. Differences that do not affect FFT performance are not
covered.
326 CHA~14 CHIPS
Multiplier
Program Accum.
& Serial Serial
Control
I/O Bus
ALU
(data value and multiplier constant) can be accessed from memory in one clock cycle rather
than sequentially addressing them in one data memory bank.
The role of on-chip program memory was explained in Section 14.1.4. The algorithms
that require the least amount of program memory are the ones with simple computational
building blocks and the simplest memory maps. The power-of-primes algorithms from
Chapter 9 fit this description if the multiplier coefficients are stored in data memory. If these
coefficients are stored in program memory, then the prime factor algorithms can result in
the smallest program memory because they only require a few multiplier coefficients and
are also computed with simple building blocks. The exact length of program memory can
only be determined by writing the code.
All of the DSP chips in this chapter have at least one on-chip bus dedicated to data
movement. Some chips have two data buses, each connected to a data memory. For
FFf algorithms these dual buses make it convenient to store FFf or weighting function
constants in one memory and data in the second. FFf algorithms that are structured for the
maximum use of the multiply-accumulate function have an advantage on the multiple-data-
bus architectures because both multiplier and multiplicand can be pulled from data memory
in one instruction cycle. The SWIFf, Singleton, and PTL algorithms from Chapters 8 and
9 are the best examples of multiply-accumulate-intensive FFT algorithms.
The purpose of the off-chip data bus is to access data blocks that are too large to
store on-chip. Because of pin limitations, there is generally only one off-chip data bus.
There are exceptions, and they are explained under the appropriate chip family. Ideally, the
time required to access off-chip data memory should be the same as for on-chip memory.
However, DSP chip I/O limitations, off-chip data memory speed, or cost factors often result
in the off-chip data access time being larger than the access time for internal data. This
causes FFT performance to degrade when off-chip data memory is required.
Even if off-chip data memory accesses are at the same speed as internal ones, the chip
will be slower executing from off-chip data memory if there are two internal data buses.
The reason is that the external data inputs to the multiplier or adder must be accessed one
at a time rather than in parallel. This adds clock cycles to the computation, which results
in longer FFT execution times.
If off-chip program memory is used, this bus is also used to carry program memory
instructions to the chip. This reduces the data I/O rate that can be supported. Accessing
externally stored program instructions is generally implemented by moving substantial
chunks of program code to the chip's internal program RAM and then executing that code
until another set of code is required. The building-block formulation of the FFT algorithms
in Chapters 8 and 9 is ideal for this approach because each building block's code can be
moved into the chip and executed on the entire data set. Then code for the next building block
is moved into the chip and the process repeated. This implies that mixed-radix algorithms
with identical small building blocks, power-of-primes, are ideal in this situation. Of these,
328 CHAR 14 CHIPS
the power-of-two algorithms are the best because they require the smallest amount of code
to be transferred into the chip.
On-chip address buses have two functions. The first is to provide the address needed
to point to the next program memory location. Second, they are used for providing the
addresses to data memory to access input and intermediate data values and multiplier con-
stants. Figure 14-1 shows a program address bus and a data address bus. DSP chips have
the same number of data buses as they have data memories and the same number of address
buses as they have program and data memory. This makes the address buses extensions of
data and program memory in terms of their affect on FFT algorithms.
For most DSP chips, the off-chip address bus plays a dual role. If data must be stored
off-chip, this bus provides the addresses to access the off-chip data for processing and for
returning answers to the off-chip data memory. If the FFT program is too large to store
in the DSP chip, this bus supplies the address sequence to the off-chip program memory.
DSP chip I/O limitations, off-chip data memory speed, or cost factors often result in the
off-chip access time being larger than the access time for internal memory. This causes
FFT performance to degrade.
However, FFf performance can also degrade when the off-chip memory accesses
work at the full internal rates. This happens when there are independent address buses
inside the chip for program and data memory. Outside the chip, pin limitations usually
result in those buses being multiplexed (MUX) as shown in Figure 14-1. Additionally, if
there are multiple internal data address buses, the off-chip address bus is further shared,
resulting in additional performance decreases.
tractions. In the second and third columns are the initial address and address increment
to accomplish this addressing. The fourth column lists the data memory addressing se-
quence for each group of input data values that resulted from the inputs to the address
generator.
Register
,
Register ~4 - Register f4---
Modulo
Logic
[
The role of the serial I/O ports was explained in Section 14.1.2. Figure 14-3 is a
typical block diagram for a serial I/O interface in a programmable DSP chip. Some chips
have one serial port and some have as many as six. These appear to have been originally
provided to allow a convenient data interface with inexpensive voice bandwidth AID and
0/A converters for modem applications. However, more recent generations of DSP chips
also use them for interchip communications in multiprocessor architectures. The value of
this interface is that it requires few pins and reduces the interrupt overhead to the main
processing circuitry to one clock cycle per input or output word.
In Figure 14-3, data is input to the receive shift register one bit at a time. Once an
entire word is loaded, it is shifted in parallel to the receive buffer used to load it into the main
processor. The main processor then uses one instruction cycle to move the data from the
receive buffer to its data memory. The receive buffer allows the main processor to load the
new data word asynchronously with the reception of the word through the serial port. The
reverse sequence of operations is used to output parallel data words through the serial port.
For FFT applications the reduction of interrupt overhead to one instruction cycle makes it
less likely for the data I/O rate to become the system bottleneck.
330 CHAP. 14 CHIPS
Table 14-1 Address Generator Sequences for the 16-Point Radix-4 FFT Example
Multiple serial ports also provide a way to interconnect multiple nsp chips into the
architectures defined in Chapter 11, without significant overhead. The programmable nsp
chips described in this chapter have one, two, four, or six serial ports. Figure 14-4 is an
example of how to form a pipeline multiprocessor architecture using two serial ports. Figure
14-5 shows how to form a 2-D array massively parallel architecture using four serial ports.
Figure 14-6 shows how to form a 3-D massively parallel multiprocessor architecture using
six serial ports. The ports that go to the adjacent layers are labeled. Refer to Chapter 12
for details on the features of each of these architectures for the various FFf algorithms in
Chapters 8 and 9.
Internal Parallel Data Bus
! t
Transmit Receive
Data Buffer Buffer
~ t
Transmit Serial I---+-- Receive
Shift Shift
Register
t---- Control r--+'
Register
51 52 S1 52
t l t I
Figure 14-4 Two serial ports to form a bus/pipeline architecture.
51 51
54 84
81 81
54 54
331
332 CHA~ 14 CHIPS
81 81
83 D8P 0 82 ...---~ 83 D8P 1 82
86 84 85 85 84 86
)
86 51 55 85 51 86
83 D8P 2 82 83 DSP3 82
84 54
2-Point FFT
Subroutine
accumulator output can also be rounded off to n bits and the results returned to data memory.
Several bells and whistles have been added by the individual vendors to optimize the MAC
for specific tasks. The most visible one is shifting logic that aligns the binary point for the
add and multiply processes. This function is not included in Figure 14-8 because it occurs
in different places for different chip families and its location has little effect on the overall
computation time for an FFT algorithm.
~ ~
Input Data Input Data
Register Register
'------.I X \~---'
!
Accumulator &
Round-off ALU
Output Results
Chip vendors usually provide some FFf benchmark for how long it takes its chip
to perform some power-of-two-length FFf. Often the 1024-point FFf is used. From the
given benchmark the performance of any power-of-two FFf length N can be estimated by
using one of two techniques, depending on whether the chip can perform the FFf entirely
on-chip or needs external data memory. The estimated 1024-point FFT benchmarks in the
Comparison Matrices of this chapter are based on the techniques described below.
Case 1: Benchmark and DesiredFFT Both UseOn-Chip or Off-Chip DataMemory
In this case, the following equation can be used:
N-point FFf time = (1024-point FFT time)
(14-1)
*5 * N * 10g(N)/[5 * 1024 * 10g(1024)]
For example, to estimate the time it takes to perform a 256-point complex FFf, compute
5 * 256 * log(256)/[5 * 1024 * log(1024)] = 0.2 times the 1024-point FFf time.
Case 2: Benchmark Uses On-Chip Data Memoryand the DesiredFFT Uses Off-
Chip Memory
The only place Equation 14-1 fails to provide accurate estimates is when the FFf
length gets too long for the FFfs to be computed with on-chip data memory. When off-chip
data memory is required, the efficiency of the chip is reduced because accessing off-chip
memory is slower than accessing on-chip memory. When this occurs, understanding the
building-block approach to the FFf algorithm becomes the key to estimating the perfor-
mance of the chip for the needed FFf length. The steps to estimating the chip's performance
are as follows:
Step 1: Divide the FFT Length into Building-block Lengths with Known FFT
Performance
Chapter 9 presents three categories .of FFf algorithms. All three use the building-
block approach. In each case, if the N -point FFf can be factored into P -point and Q-point
building blocks (N = P * Q), then the FFf algorithm requires P Q-point building-block
computations, followed by Q P-point building-block computations. For those computa-
tions, some algorithms need some complex multiplications. Factor N such that the chips
can perform the P- and Q-point FFfs using only on-chip memory. Further, choose P and
Q such that their on-chip performance is known. If it is not known, choose P and Q so
that their performance can be calculated by using Equation 14-1.
Step 2: Compute the Time Requiredto Compute All the P- and Q-point FFTs
This is done by computing:
FFf Time = P * (Q-point FFf's time) +Q * (P-point FFT's time) (14-2)
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 335
Step 3: Compute the Time for Moving Data On and Off the Chip
Assume all data is stored in off-chip data memory. To compute a P-point FFf, move
P data samples onto the chip, perform the P-point FFT, and return the answers to off-chip
memory. Since this is done Q times, all of the data is moved onto the chip and the answers
back off again once for the P-point FFTs and once for the Q-point FFfs. Therefore, the
data transfer time is:
Data transfer time = (Data word transfer time) * (2 words) * (2 for on and off) *N (14-3)
Step 4: Compute the Time for Complex Multiplies
DSP chips usually specify the time required to perform a multiply. Determine the
number, X, of complex multiplies required for the desired algorithm and FFf length. Then
compute
Complex multiply time = X* (complex multiply time) (14-4)
Step 5: Add All Times that Contribute
The total FFf performance time estimate is:
Total time estimate = FFf time + data transfer time + complex multiply time (14-5)
If all of the data can be stored on-chip, the data transfer time is not part of the total time
estimate. The effect of this on the chip's FFT performance depends on the data I/O speed
of the chip and the speed of the off-chip memory. Table 14-2 illustrates that Equation 14-1
works and also illustrates the performance degradation suffered by using off-chip memory,
with two generations of fixed-point DSP chips from Texas Instruments. In moving from
64 to 256 points, the computation time is expected to increase by roughly a factor of
5 * 256 * 10g(256)/[5 * 64 * log(64)] = 5.333. Similarly, moving from 256 to 1024
points should increase the computation time by roughly a factor of 5 * 1024 * log( 1024)/
[5 * 256 * log(256)] = 5. The TMS320C5x series follows these ratios closely because this
generation of chips has enough on-chip RAM to compute any of these three FFf lengths.
The TMS320C2x series follows closely for the transition from 64 to 256 points because it
has enough RAM for the 256-point FFf. However, the ratio for moving from 256 to 1024
points is larger than expected because off-chip data memory is required.
TI chip family 64-pt clock cycles 256-pt clock cycles 1024-pt clock cycles
biggest market for these chips has been telecommunications applications such as modems
and fax. However, today these chips are used for a broad range of applications that require
high-speedarithmeticcomputations and can toleratethedynamicrangeconstraintsof fixed-
point arithmeticexplainedin Chapter 13.
14.3.1 Analog Devices ADSP·21 xx Family
The ADSP-2Ixx family is a series of 16-pointDSP chips that offers a varietyof bells
and whistlesto meet specific application needs. However, few of these have a dramaticim-
pacton FFTperformance. Theprimaryimpactis in thedataI/Ocapabilityfor an application.
The members of this family are ADSP-2100A, ADSP-2101, ADSP-2103, ADSP-2105,
ADSP-2111, ADSP-2115, ADSP-216x, ADSP-2171, ADSP-2175, and ADSP-21msp5xx,
where the "x" means that there are severalsubfamily membersof that family member (see
Figure 14.9) [1-4].
Program
Memory
Multiplier
Program Accum.
Serial
Control &
Bus
ALU
Analog
I/O
Serial I/O. All of this family, except the ADSP-2105, have dual serial ports with
hardware companding circuitry. This additional serial port providesthe capability to inter-
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 337
face these devices into linear- bus, pipeline, and ring bus architectures for multiprocessor
applications (Section 14.2.9) without having to use the parallel bus that may be addressing
off-chip data or program memory.
The companding hardware is an advantage in applications where the FFT is obtaining
its data from an AID converter or sending its results to a 0/ A converter. If the AID and
D/A converters are connected to networks such as the telephone system, the voltages they
convert may be logarithmically compressed by using either the A-law (European standard)
or JL-law (U.S. standard). Since the FFT is assuming linear data, the input data must be
converted to linear form. This function is called companding. If companding is performed
in software, it takes several instruction cycles. If the process takes 10 instruction cycles, the
*
total data I/O time for an N -point complex FFf increases from 4 N to at least 10 4 N * *
* *
instruction cycles. Since the FFf takes roughly 5 N log2(N) instructions, an FFT
* * * *
becomes I/O limited when 10 4 N > 5 N log2(N). This occurs for N < 256 points.
The companding hardware removes the need for these 10 cycles and allows the data I/O
overhead to return to one cycle per word so that I/O limiting only occurs for 2-point FFTs,
based on the inequality.
Other Data 110. The ADSP-2Imsp50 and ADSP-21msp51 provide a full voice
band analog interface which includes 16-bit Sigma-Delta AID and D/A converters, an-
tialiasing and antiimaging filters, and automatic gain control (AGe). Voice applications,
such as speech recognition, that use FFTs (see the example in Chapter 17) can use this
feature to reduce the cost of development and production.
Data Memory. Only the ADSP-2171 and ADSP-2175 have enough on-chip data
RAM to perform a 1024-point FFT, and the ADSP-2171 is marginal since it has just 2048
data memory words. It would require all of the weighting function and multiplier constants
to be in program memory. Therefore, the 1024-point FFT benchmarks for the other chips
in this family already reflect the slowdown incurred by having to store data off-chip. This
means that Equation 14-1, the FFT performance estimator, will work for FFT performance
above 1024 points but gives answers that are too large for smaller transform lengths. The
Programmable Fixed-Point Chips Comparison Matrix (Section 14.4) shows that the ADSP-
2171 and ADSP-2175 have significantly better l024-point FFT computation times than the
other devices in this family because of the additional on-chip data memory.
Address Generators. All of the members of this family have dual address gener-
ators. This maximizes the ability to address both data and multiplier constants to feed to
the MAC unit on each instruction cycle. The flexibility of the address step sizes for these
generators also allows them to be easily used to execute non-power-of-two algorithms as
well as standard FFTs. Address generator 1 also has bit-reverse logic to accommodate
standard power-of-two algorithms.
Program Boot. This is additional logic to allow the on-chip program RAM to be
loaded during the power-up phase of the application's operation from a low-speed 24-bit-
wide EPROM to lower the cost of the overall application. It also allows multiple programs
to be swapped in and out of the chip's on-chip program memory without having to store
them in high-speed off-chip program RAM.
338 CHAP. 14 CHIPS
Unlike other DSP chip manufacturers, AT&T introduced the DSP 16line of fixed-point
chips after having a floating-point chip (DSP32) in the market. The most characteristically
different feature of this fixed-point family is the instruction cache provided to run inner-loop
computations rapidly. The members of this family are DSP16 and DSP16A (see Figure
14-10) [5,6].
Off-Chip
Parallel
Data
Bus
On-Chip
Parallel
Data
Buses
-~:~~]--~~---~~~---~-----~-~
Multiplier
Program Accum.
Control Serial
&
ALU Bus
Cache RAM. The 15 instructions of on-chip cache RAM can execute a set of
repetitive operations up to 127 times to increase the throughput and coding efficiency.
This is particularly valuable for power-of-prime FFf algorithms where the same building
block is used throughout the computations. In particular, the 2-point building block would
easily fit into this RAM. The 4-point building block is a series of four 2-point building-block
computations, and the 3-point building block uses two complete 2-point building blocks and
two partial ones (just the add). Therefore, it may also be possible to efficiently implement
3- and 4-point building blocks with this cache memory.
MUXlParaliel I/O. The MUX/parallel I/O chip does not use multiplexers (MUX)
for interfacing the on-chip address bus to outside the chip because there is only one on-chip
address bus. Even though there are two on-chip data buses, they are not interfaced to a single
bus outside the chip because there are two off-chip parallel bus interfaces. This additional
off-chip bus allows additional freedom in the internal organization of the chip and a way
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 339
for data to be input to the on-chip data memory while off-chip data memory is being used
to provide data to the MAC and ALU to perform computations.
If the FFf is small enough to execute entirely on-chip, then this architecture works
best if all data is in the data RAM and all multiplier coefficients are in on-chip program
memory ROM. If the FFT must be executed with off-chip memory, storing the data in
off-chip memory and the multiplier coefficients in on-chip data RAM is the easiest way to
program the algorithm. However, if the off-chip memory is slow, it may be more efficient to
load portions of the data from off-chip to on-chip memory through the parallel I/O port and
execute the FFf internally, in steps, using multiplier coefficients stored in on-chip program
ROM. The manufacturer provides detailed data books to help make those decisions.
Address Generators. Both members of this family have dual address generators.
This maximizes the ability to address both data and multiplier constants to feed to the MAC
unit on each instruction cycle. The flexibility of the address step sizes for these generators
allows them to be easily used to execute non-power-of-two algorithms as well as standard
FFTs.
Program Memory. All on-chip program memory in this family is in ROM, and the
programming strategy is to use this memory for programs and multiplier coefficients. The
architecture does allow off-chip program RAM up to 64K words.
Data Memory. The DSP16 has 512 words and the DSP16A has 2048 words of on-
chip RAM. Therefore, the maximum on-chip complex FFf that can be performed by the
DSP16 is 256 points and by the DSP16A is 1024 points. This assumes all of the multiplier
constants and weighting function constants are stored in program memory. This means that
the FFf performance formula will work for FFf performance above 1024 points (256 points
for the DSPI6A) but gives answers that are too large for smaller transform lengths. The
Programmable Fixed-Point Chip Comparison Matrix (Section 14.4) shows that the DSP-
16A has significantly better 1024-point FFT computation times than the DSP16 because of
this additional internal data RAM.
This series of 16-bit fixed-point chips is focused on the digital cellular marketplace.
However, they are general-purpose programmable DSP chips that can be used to execute
FFT algorithms. In addition to the specific market focus, the primary difference between
this family and the DSP16 family is on-chip RAM for programs. The members of this
family are DSP1610, DSP1616, DSP1617, and DSP1618 (see Figure 14-11) [7-10].
Cache RAM. The 15 instructions of on-chip cache memory can execute a set of
repetitive operations up to 127 times to increase the throughput and coding efficiency. This
is particularly valuable for power-of-prime FFf algorithms where the same building block
is used throughout the computations. In particular, the 2-point building block would easily
fit into this RAM. The 4-point building block is a series of four 2-point building-block
computations and the 3-point building block uses two complete 2-point building blocks and
two partial ones (just the add). Therefore, it may also be possible to efficiently implement
3- and 4-point building blocks using this cache memory.
340 CHAR 14 CHIPS
On-Chip Program
Off-Chip
Parallel Parallel
Data MUX
Address Address
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
Parallel
Bus
Serial Ports. All members of this family have dual serial ports. This additional
serial port provides the capability to interrace these devices into linear bus, pipeline, and
ring bus architectures for multiprocessor applications (Section 14.2.9) without having to
use the parallel bus that may be addressing off-chip data or program memory.
Parallel I/O/Interface Bus. In addition to the two on-chip data buses that are in-
terfaced off-chip by using multiplexers, there is an additional parallel interrace, just like
the one in the DSP16 family. The difference is that it is multiplexed onto a bus that is then
interfaced with one of the on-chip data buses.
Data Memory. All of the devices in this family, except the DSP1618, have at least
2048 words of data RAM with two access ports. Therefore, the 1024-point FFf can be
performed on-chip if the weighting function and multiplier coefficients are stored in program
memory. The DSP1617 and DSP1618 have 4096 words of dual-ported data RAM, so they
can compute up to 2048-point complex FFTs without going off the chip. The DSP 1610 has
8192 words of data RAM. It can compute up to 4096-point complex FFfs without going
off the chip.
Read-Only Memory (ROM). All of the devices in this family have on-chip pro-
gram ROM. The DSP1610 has 512 words, the DSP1616 has 12K words, the DSP1617 has
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 341
24K words, and the DSP1618 has 16K words. For high-volume applications, this ROM
can be used to store FFf algorithms. Otherwise, the on-chip RAM can be used to store
the program. However, storing the program in data RAM reduces the location available for
data, which results in a smaller FFT length that is computable with only on-chip memory.
The DSP56001 was the first programmable DSP chip family from Motorola. Its most
characteristically different feature is that it is a 24-bit fixed-point processor. The members
of this family are DSP56001, DSP56002, DSP56L002, and DSP56004 (see Figure 14-12)
[11-13].
Program
On-Chip Off-Chip
Parallel
Data MUX Parallel
Address Address
Buses Bus
Program
Memory
Program
On-Chip Off-Chip
Parallel MUX Parallel
Data
Data Data
Buses Bus
Multiplier
Program Accum.
Control &
ALU
Serial Ports. All members of this family have dual serial ports. This additional
serial port provides the capability to interface these devices into linear bus, pipeline, and
ring bus architectures for multiprocessor applications (Section 14.2.9) without having to
use the parallel bus that may be addressing off-chip data or program memory.
In conjunction with these ports, the X-data memory has a built-in table of A-law
and Jl-Iaw companding coefficients to simplify the interface with companded data sources.
Since the FFT is assuming linear data, the companded input data must be converted to
linear form. If companding is performed in software, it takes several instruction cycles. If
the process takes 10 instruction cycles, the total data I/O time becomes at least 10 * 4 * N
instruction cycles. Since the FIT takes roughly 5 * N * log2(N) instructions, an FFT will
be I/O limited when 10 * 4 * N > 5 * N * log, (N). This occurs for N < 256 points. The
342 CHA~ 14 CHIPS
companding table removes the need for these 10 cycles and allows the data I/O overhead to
return to one cycle per word. At one cycle per data I/O word, the device is only I/O limited
for 2-point FFTs.
Data Memory. All members of this family have 512 words of data RAM on-chip.
Therefore, the largest FFf that can be computed with only on-chip memory is 256 points.
Therefore, the performance numbers in the Programmable Fixed-Point Chips Comparison
Matrix (Section 14.4) already reflect the penalty paid for having to access off-chip data
memory. Further, the data RAM is divided into two 256-word memories called X-data
memory and Y-data memory.
The other nonstandard fact about this family is that it is 24-bit fixed point. This allows
it to be used for digital compact disc (CD) products that require roughly 20 bits of dynamic
range and accuracy. This was the first family of fixed-point DSP processors to offer more
than 16 bits. The advantage for FFT algorithms is that it has less quantization noise than
16-bit fixed-point chips by a factor of 24 dB. See the explanation of quantization error in
Chapter 13 for details.
Data ROM. All of the members in this family have on-chip data ROM. The X-data
memory ROM is programmed with A-law and JL-Iaw companding functions to simplify
interfaces with companded data sources such as telephone lines. The V-data memory ROM
is programmed with a full, four-quadrant sine table that can be used for the multiplier
coefficients for power-of-two FFfs. This removes the need to store these coefficients in
program memory. This table can also be used for non-power-of-two FFTs with the help
of an interpolation algorithm. For example, to use the table for the 504-point mixed-radix
algorithm, 360 0 must be divided into 504 pieces, not 512. Therefore, the table entries cannot
be used directly. However, for each needed value, the two surrounding phase angle values
and a linear interpolation algorithm can be used to accurately compute the correct value.
The coefficients in the V-data ROM can also be used to compute the sine lobe, Han-
ning, sine cubed, sine to the fourth, Hamming, Blackman, 3-sample Blackman-Harris, and
4-sample Blackman-Harris weighting functions in Sections 4.2.3 through 4.2.10. This re-
moves the need to store weighting function coefficients if the chip's computational power
allows the weighting function coefficients to be computed as needed within the required
FFT computation time.
There are two drawbacks to the V-data memory ROM having the sine table. This
table is specifically designed for power-of-two algorithms. Therefore, it does not contain
the multiplier constants needed for non-power-of-two algorithms. Further, the table is fixed
in the V-data memory ROM. Therefore, to pull a multiplier coefficient and data value during
the same instruction cycle, the data must be in the X-data memory. For radix-2 algorithms
this is not a problem because the data can always be partitioned so that the values that
require the multiplications are in the X-memory, because only half of the data in the radix-2
building block ever gets multiplied by other than 1. In general, mixed-radix algorithms
require N - 1 of the N -point building-block inputs to be multiplied by a complex number.
For full-speed operation this requires that the data must be modified prior to being input
to the N -point building block to be stored in the X-memory. If that data is stored in the
V-memory, two memory access clock cycles are required to get the data and multiplier
constant. This slows FFT performance,
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 343
Address Generators. All of the members of this family have dual address gener-
ators. This maximizes the ability to address both data and multiplier constants to feed to
the MAC unit on each instruction cycle. The flexibility of the address step sizes for these
generators also allows them to be easily used to generate non-power-of-two algorithms as
well as standard FFTs. Both address generators also have bit-reverse logic to accommodate
standard power-of-two algorithms.
Data Address and Data Buses. To accommodate the extra data memories, there
is an extra data memory bus and an extra data memory address bus. This provides a simpler
way of thinking about programming the devices, because the natural thought process of
pulling two data values from data memory can be programmed.
Boot ROM. Boot ROM is additional memory to allow the on-chip program RAM
to be loaded during the power-up phase of the application's operation from a low-speed
24-bit-wide EPROM to lower the cost of the overall application. It also allows multiple
programs to be swapped in and out of the chip's on-chip program memory without having
to store them in high-speed off-chip program RAM.
The DSP561xx family of 16-bit fixed-point chips is based on the 24-bit fixed-point
DSP560xx series from Motorola. The members of this family are DSP56156,
DSP56156ROM, DSP56166, and DSP56166ROM (see Figure 14-13) [14-17].
Serial I/O and AJD-D/A I/O. All members of this family have dual serial ports.
This additional serial port provides the capability to interface these devices into linear bus,
pipeline, and ring bus architectures for multiprocessor applications (Section 14.2.9) without
having to use the parallel bus that may be addressing off-chip data or program memory.
All members of the family also provide 14-bit Sigma Delta AID and D/A conversion
to simplify the application of these devices to telecommunications and digital cellular ap-
plications. Example 3 of Chapter 17 uses these on-chip AID and 0/A converters to simplify
doing the pitch detection portion of speech recognition algorithms.
Data Memory. Both DSP56156 devices have 2048 words of data RAM, and both
DSP56166 devices have 4096 words of data RAM. Therefore, the 1024-point FFT can be
performed on-chip if the weighting function and multiplier coefficients are stored off-chip
or in program memory for the DSP56166 devices and even without that constraint for the
DSP56166 devices.
Busesand Multiplexers. This family has dual data address buses and an additional
data bus for moving the serial and analog I/O port data on and off the chip. The result is that
the multiplexers for combining on-chip buses to one off-chip bus are both 3:1 rather than
the more standard 2:1 found in other chip families. The additional data bus enhances the
chip's capability to input data in parallel while performing computations. This improves
its FFT performance.
Address Generators. Unlike the DSP5600x family, this family only has one ad-
dress generator. However, its logic is fast enough to compute two addresses per instruction
344 CHA~ 14 CHIPS
On-Chip Program
Off-Chip
Parallel
Data MUX Parallel
Address
Address
Buses
Bus
On-Chip Program
Off-Chip
Parallel Parallel
Data MUX
Data Data
Buses Global Bus
Multiplier
Program Accum.
Serial
Control & Bus
ALU
Analog
I/O
cycle. Thus, it functions like two address generators and still provides the FFf performance
advantages described for dual-generator architectures.
Program Memory. A set of addresses in program memory is used to allow the on-
chip program RAM to be loaded during the power-up phase of the application's operation
from a low-speed 24-bit-wide EPROM to lower the cost of the overall application. It also
allows multiple programs to be swapped in and out of the chip's on-chip program memory
without having to store them in high-speed off-chip program RAM. Both the DSP56156
and DSP56166 have 2048 additional words of on-chip program ROM. The DSP56156ROM
and DSP56166ROM devices have 12K and 8K of on-chip program ROM, respectively.
Program
Memory
Multiplier
Program Accum.
Serial
Control &
Bus
ALU
Data Memory. This device has two on-chip data memories. One is a ROM (1024 x
16 for the ttPD77C25 and 512 x 23 for the jlPD77C20) for storing multiplier and weighting
function coefficients. The other is a RAM (256 x 16 for the jlPD77C25 and 128 x 16 for
the I1PD77C20). This means that the best case is being able to compute 128-point FFTs
(j1,PD77C25) and 64-point FFTs (jlPD77C20) using only on-chip memory. Therefore,
the I024-point performance numbers in the Programmable Fixed-Point Chips Comparison
Matrix (Section 14.4) assume off-chip data memory,
346 CHA~ 14 CHIPS
For FFfs larger than 128 points, FFf performance will lose efficiency because the
off-chip data interface is only 8 bits wide. Therefore, two accesses are required to move
one 16-bit word into and out of the chip. However, since the 16-bit word is stored in a
buffer register prior to becoming two 8-bit words, it only takes one instruction cycle away
from the processor to move data onto and off the chip. Furthermore, the buffer register is
controlled from off-chip timing signals. Therefore, if the off-chip logic can operate at twice
the on-chip instruction speed, the 8-bit I/O inefficiency is removed. Read the detailed timing
information in the manufacturer's data book to determine the effect of the 8-bit interface.
The 8-bit interface is used because the family was designed to interface to 8-bit
microprocessor hosts. The 8-bit interface also slows the data I/O before and after the FFf
algorithm. However, the degree to which this affects overall FFf performance depends
on the speed of the off-chip data transfer, just as it was for off-chip data memory accesses
during the FFT computations.
The 16-bit fixed-point NEC jlPD7701x family was developed for the digital cellular
and modem/fax telecommunications markets. However, the Programmable Fixed-Point
Chips Comparison Matrix (Section 14.4) shows it has good performance for FFf compu-
tations. The members of this family are the jlPD77016 and jlPD77017 (see Figure 14-15)
[20].
Program
On-Chip Off-Chip
Parallel Parallel
Data Data
Data
Bus
Buses
Multiplier
Program Accum.
Control Serial
&
ALU Bus
Serial Ports. Both the /LPD77016 and j1,PD77017 have dual serial ports. This
additional serial port provides the capability to interface these devices into linear bus,
pipeline, and ring bus architectures for multiprocessor applications (Section 14.2.9) without
having to use the parallel bus that may be addressing off-chip data or program memory.
Address Generators. Both devices have dual address generators that are very sim-
ilar to Figure 14-2. However, they are directly connected to the two data RAM blocks rather
than to dual address buses because there is only one bus used for carrying address infor-
mation' and it carries program memory addresses and other control data. The flexibility of
these address generators makes them useful for computing all of the algorithms in Chapters
8 and 9. For the standard power-of-two algorithms, both address generators have hardware
for performing bit-reversed addressing arithmetic.
Busesand Multiplexers. Both of the on-chip data memory buses are also available
outside the chip. This eliminates the need for the multiplexers shown in the block diagram.
Furthermore, the reduced number of on-chip buses (two for data and one for program
addressing) and multiple-address generators results in the address generators providing
their output directly to their respective memories.
Data Memory. Both devices have two data memories. Each data memory has
2048 sixteen-bit words of RAM. The j1,PD77017 also has 4096 words of data ROM in each
data memory. Therefore, both devices can compute a 1024-point FFT on-chip if all of the
multiplier and weighting function coefficients are stored in program memory. Even though
the on-chip data buses are not multiplexed to the outside of the chip, going off-chip for
data does slow down the computations. This is because the two off-chip data buses must be
used for both data and addressing. Therefore, only one data memory value can be accessed
during an instruction cycle, not two as can happen when the data is internal to the chip.
Data Memory. This device has two 256-word data RAM blocks and one 1024-word
data ROM for storing multiplier constants and weighting function coefficients. Externally,
the device supports a 12-bit address word which corresponds to addressing 4096 data words.
This limits this device to performing 2048-point FFfs, even using off-chip memory. Using
on-chip memory with real and imaginary components in respective 256-word blocks of data
memory provides the capability to perform 256-point complex FFfs.
348 CHA~ 14 CHIPS
Bus
Program
Memory
On-Chip Program ;
., - -··i···· --. ·-t·· .-- "_ - j , -"-1 Off-Chip
Parallel Parallel
Data Data Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
Data memory does not use the main bus to transfer data to the multiplier. Each data
RAM has its own direct path to the multiplier. However, the results from the multiplier or
accumulator are stored in data RAM using the main bus.
Address Generators. This device has an address generator for each of the data
RAMs to avoid having to use the main bus. These generators are simple base address
plus offset calculators that require the offset to be programmed into the instructions for
nonunit values. Therefore, they are not ideally suited for computing non-power-of-two
FFT algorithms.
The TMS320Clx is TI's first family of CMOS programmable DSP chips and is still
used for low-cost applications. It is a follow-on to the NMOS TMS3201 0 series intro-
duced in 1982. The members of this family are TMS320CIO, TMS320C14, TMS320P14,
TMS320E14, TMS320C15, TMS320P15, TMS320E15, TMS320C16, TMS32OC17,
TMS320P17, and TMS320E17 (see Figure 14-17) [22]. The "E" indicates the presence of
on-chip EPROM for program memory, and the "P" indicates 3.3- V versions of the chip.
Serial I/O. The TMS320C14, TMS320P14, TMS320E14, TMS320C17,
TMS320P17, and TMS320E17 have one serial port, but the other members of this fam-
ily do not have serial ports. This means that the only input path for data and output path
for results are through the parallel port. This is not a problem for applications where the
SEC. 14.3 PROGRAMMABLE FIXED-POINT CHIP FAMILIES 349
Program
Memory
Multiplier
Program Accum.
Control & Serial
ALU Bus
input comes from a data buffer and the outputs go to a data buffer. For applications where
the data I/O is asynchronous, overhead cycles are required to synchronize these DSP chips
with the source of data or destination of results. These overhead cycles reduce the effective
throughput rate of the chip.
The conversion of data to a linear form (frequency analysis with FFfs requires the
data to be in linear form) is called companding. The TMS320C17 and TMS320E17 have
companding hardware, which is an advantage in applications where the FFT is obtaining
its data from an AID converter or sending its results to a D/A converter. If the AID and
0/A converters are connected to networks such as the telephone system, the voltages they
convert may be logarithmically compressed by using either the A-law (European standard)
or fL-Iaw (U .S. standard).
If companding is performed in software, it takes several instruction cycles. If the
process takes 10 instruction cycles, the total data I/O time for an N -point complex FFT
increases from 4 * N to at least 10 * 4 * N instruction cycles. Since the FFf takes roughly
* *
5 * N * log, (N) instructions, an FFf will be I/O limited when 10 * 4 * N > 5 N log, (N).
This occurs for N < 256 points. The companding hardware removes the need for these 10
cycles and allows the data I/O overhead to return to 1 cycle per word so that I/O limiting
only occurs for 2-point FFTs, based on the inequality.
Buses and Multiplexers. The data address bus is highlighted because it does not
exist in this family. This eliminates the need for the I/O multiplexer for on-chip address
buses. Additionally, the MAC is only connected to the data bus. To multiply numbers,
350 CHA~ 14 CHIPS
one cycle is used to load one number, the second cycle to load the other and perform the
multiplication. This two-cycle process, as opposed to one cycle for multiple-bus architec-
tures, results in the significantly higher 1024-point FFT times shown in the Programmable
Fixed-Point Chips Comparison Matrix in Section 14.4.
Data Memory. There are only 256 words of data RAM in this family of devices.
Actually, the TMS320CI0 only has 144 data words. This limits the complex FFTs that can
be performed on-chip to 128 and 64 points, respectively. Therefore, the I024-point FFT
performance numbers in the Programmable Fixed-Point Chips Comparison Matrix (Section
14.4) already reflect the penalty paid for addressing off-chip data memory.
Address Generators. There are no special address generators for data memory in
this family. Nonsequential addressing is done by coding the instructions to perform indirect
addressing. This includes loading auxiliary registers with address offsets and loading data
page pointers because the data memory is partitioned into 128-word pages. Each of these
adds to the time required to perform an FFT.
The TMS320C2x, a second generation of 16-bit fixed-point DSP chips, was intro-
duced by TI in 1986 with the TMS32020. This device has subsequently been discontinued.
The members of this family are TMS320C25, TMS320E25, TMS320C26, and TMS320C28
(see Figure 14-18) [23]. The "E" indicates the presence of on-chip EPROM for program
memory.
On-Chip Program
Off-Chip
Parallel Parallel
Data MUX
Data Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
Address Generator. Like the TMS320CI0 family, this family has an increment-
ing counter for program memory addressing and auxiliary registers to offset data memory
addresses. Data memory address generation operates by loading an offset into an auxil-
iary register and moving the auxiliary register pointer to the correct register. Then indirect
address instructions address the offset data location. For power-of-two FFTs there is re-
verse binary addressing supported in hardware to alleviate the problems associated with
nonsequential memory addressing. However, this support does not help the nonsequential
addressing needed for non-power-of-two algorithms. Therefore, they are less efficient on
this chip family than comparable power-of-two algorithms.
Data Memory. The TMS320C25/E25 and TMS320C28 members of this family
have 544 words of on-chip RAM that can be used for data. This means that the maximum
complex FFT that can be implemented on-chip is 256 points, assuming the multiplier coef-
ficients and weighting function coefficients are stored in ROM/EPROM program memory.
The TMS320C26 has 1568 words of RAM. Of that, 32 words are dedicated to data
and the other 1536 words are in three 512-word blocks that can be used for either data or
program memory. This allows a 512-point complex power-of-two algorithm and roughly
a 768-point complex FFT if all weighting function and multiplier coefficients are stored in
program memory. Since 768 == 256 * 3, this FFT can be computed with existing mixed-
radix 256-point code with the 3-point building block from Chapter 8 added to the front end
or back end of the algorithm.
In all cases, the l024-point FFT performance numbers in the Programmable Fixed-
Point Chips Comparison Matrix (Section 14.4) reflect the data being in off-chip memory.
If multiplier and/or weighting function coefficients are stored in data memory, this further
reduces the maximum FFT length, depending on the required number of multiplier coeffi-
cients. In this case, larger FFTs can be implemented using the Winograd and prime factor
algorithms from Chapters 8 and 9 because they require fewer multiplier coefficients and
have FFT lengths between 128 and the maximum on-chip FFT length of 256 points.
Program Memory. The TMS320C25/E25 family members have 4096 words of
ROM/EPROM dedicated to programs. Additionally, a 256-word block of RAM can be
used for either data or program memory. If it is used for program memory, the maximum
allowable on-chip FFT length is reduced. This leads to a complex trade because the Wino-
grad and prime factor algorithms from Chapters 8 and 9 require fewer multiplier coefficients
but more program memory. Only detailed implementation can be used to determine the
maximum length in this situation. In the TMS320C26, the program ROM is a 256-word
boot program, and in the TMS320C28 the program memory is 8192 words.
On-Chip Program
Off-Chip
Parallel Parallel
Data MUX
Data Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
addresses. Data memory address generation operates by loading an offset into an auxil-
iary register and moving the auxiliary register pointer to the correct register. Then indirect
address instructions address the offset data location. For power-of-two FFfs there is re-
verse binary addressing supported in hardware to alleviate the problems associated with
nonsequential memory addressing. However, this support does not help the nonsequential
addressing needed for non-power-of-two algorithms. Therefore, they are less efficient on
this chip family than comparable power-of-two algorithms.
Data Memory. All members of this family have 1056 words of on-chip RAM
dedicated to data. Additionally, the TMS320C50/51/52/53 have 9K/IK/IK/3K of on-chip
RAM, respectively, that can be used for either data or programs. As a result, all mem-
bers of this family have the ability to compute 1024-point complex FFfs on-chip. The
TMS320C51 and TMS320C52 require the complex multiplier coefficients to be stored in
program memory to allow enough room for all 2048 data words. This, combined with the
faster instruction cycle times (35 and 50 ns versus 80 and 100 ns for the TMS320C2x fam-
ily), are the reasons for the improved 1024-point FFf performance in the Programmable
Fixed-Point Comparison Matrix (Section 14.4).
on-chip FFT is reduced. This results in a complex trade because the Winograd and prime
factor algorithms from Chapters 8 and 9 require fewer multiplier coefficients but more
program memory. Only detailed implementation can used to determine the maximum
length in this situation.
Serial Ports. The TMS320C50/51/53 have dual serial ports. This additional serial
port provides the capability to interface these devices into linear bus, pipeline, and ring
bus architectures for multiprocessor applications (Section 14.2.9) without having to use the
parallel bus that may be addressing off-chip data or program memory. The TMS320C52
only has one serial port.
The Zilog Z89Cxx is a family of bare-bones 16-bit fixed-point processors. The most
distinguishing feature of this processor is that the accumulator holds only 24 bits out of
the 16 x 16 multiplier. This means that multiplier outputs are rounded from 32 bits to 24
bits prior to entering the accumulator. This introduces more quantization noise in the FFT
outputs than accumulators that hold 32 bits or more. The only general-purpose member of
this family is the Z89COO (see Figure 14-20) [25]. Other members are customized to audio
and multimedia applications.
Multiplexers and Serial I/O. This processor does not have a serial I/O function.
Additionally, the device has an off-chip program memory port and off-chip I/O port. Data
On-Chip Program
Off-Chip
Parallel Parallel
Data Data
Data
Buses Bus
Program
Control Serial
Bus
is input through the I/O port and no multiplexer exists because the program data bus is only
used to connect the program memory with the program control function. Likewise, there
is no multiplexer needed for the address buses because there is only one external address
bus. Data memory addresses are generated and directly connected to each of the two data
memories as shown in Figure 14-20.
Data Memory. The Z89COO has two 256-word data memories. Assuming all the
multiplier coefficients and weighting function coefficients can be stored in program memory,
this device can execute up to a 128-point FFT on-chip. Moving data from data memory to
the multiplier is simplified by having it directly connected to the two data memory blocks
as shown in Figure 14-20. This eliminates the need for two data buses in order to feed two
data words to the multiplier during one instruction.
Program Memory. This device has a 4K ROM internal program memory, but no
internal RAM for program memory.
Address Generators. Each data RAM has its own dedicated address generator that
is based on programming offset address pointers rather than having an ALU to compute the
offset address. This makes this device's address generation scheme similar to the first two
generations ofTI chips, the TMS320C1x and TMS320C2x.
Multiplier-Accumulator. The 16 x 16 multiplier output is 24 bits and is fed to an
ALU before going to the 24-bit accumulator. The output of the multiplier can also return
to data memory. The multiplier and ALU outputs are returned to data memory through the
chips' bus.
This is the first family of fixed-point DSP chips to compute the 1024-point complex
FFf in less than 1 ms. A second distinguishing feature for FFf computations is that it
performs 20-bit, not 16-bit, integer arithmetic. These additional 4 bits reduce the algorithm-
generated quantization noise by 12 dB and increase the dynamic range by 24 dB. Another
distinguishing feature for these fixed-point processors is the six half-duplex (three two-
way) serial ports. The only member of this family is the ZR38000 (see Figure 14-21)
[26].
Data/Program Memory. This chip has 2048 twenty-bit words of data memory
and 8192 thirty-two-bit words of program/data ROM. Assuming all multiplier coefficients
and weighting function coefficients are stored in program/data ROM, a l024-point FFT
can be computed on-chip. Therefore, Equation 14-1 works for FFTs less than 1024 points
but not for those above 1024 points. However, the standard product only uses the ROM
for bootstrapping the loading of the main operating program. Therefore, the standard
product can only perform 512-point complex FFTs with on-chip data memory because
it needs the rest of the data memory to store multiplier and weighting function coeffi-
cients.
Address Generator. This chip has only one address generator, and its output is
connected to the data memory address bus. However, this generator and the data memory
SEC. 14.4 PROGRAMMABLE FIXED-POINT CHIPS COMPARISON MATRIX 355
On-Chip Program
Off-Chip
Parallel MUX Parallel
Data Data
Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
are able to support the update of two data memory address locations per instruction cycle
and two accesses of data memory per instruction cycle. The address generator also has built-
in hardware that supports bit-reversed addressing for the power-of-two FFI' algorithms in
Chapter 9. The generator also supports modulo addressing, which is useful in implementing
the non-power-of-two FFf algorithms in Chapter 9.
Serial I/O. This device has six half-duplex serial ports. Therefore, it has the capa-
bility of moving data in and out of the processor as if there were three full-duplex serial
ports.
The data in the Comparison Matrix in Table 14-3, on page 354, comes from the referenced
vendor material. In the case of the 1024-point complex FFf performance, this is the fastest
number available in the material. Different versions of a l024-point FFI' may produce
slightly different performance numbers. Versions of the chips that run at slower speeds will
have times that are slower. Conversely, newer versions of these chips, which run faster, will
have faster times. Performance numbers with asterisks are estimated because times for the
I024-point FFT were not available from the vendor.
356 CHA~ 14 CHIPS
All of the general-purpose floating-point DSP chips in this chapter use 32-bit arithmetic
with 8 bits of exponent and 24 bits of mantissa. In addition to these chips, the Intel i860 has
also been included. While this chip was initially developed for graphics applications, its
FFT performance is so good that it has been used by many DSP board manufacturers. The
i860 uses the same configuration of 32-bit floating-point numbers described above. The
way the different vendors treat the smallest and largest number varies slightly but has no
effect on the computational performance, except in rare instances when the top or bottom
numbers in the dynamic range are reached.
The 21020 is Analog Devices first family of 32-bit floating-point processors. Its
most distinguishing feature is that it has no on-chip program or data memory. However,
the on-chip buses are designed to work at full speed with off-chip memory to produce
high-performance computing that does not depend on the inability to get large amounts of
memory on-chip. The only member of this family is the ADSP-21020 (see Figure 14-22)
[27].
On-Chip Program
Off-Chip
Parallel Parallel
Data Data
Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
Serial I/O. This device does not have a serial I/O port.
Multiplexers. This device does not use the MUX hardware because it provides I/O
pins for all four on-chip data and address buses.
358 CHAR 14 CHIPS
Data and Program Memory. This device does not have anyon-chip data or pro-
gram memory. It is all accessed directly using off-chip memory. As a result, the FFf
performance numbers in the Programmable Floating-Point Chips Comparison Matrix (Sec-
tion 14.7) can be scaled to estimate larger or smaller FFf computation times using Equation
14-1.
Address Generators. The ADSP-21020 has dual address generators. This max-
imizes the ability to address both data and multiplier constants to feed to the MAC unit
on each instruction cycle. The flexibility of the address step sizes for these generators
also allows them to be easily used to generate non-power-of-two algorithms as well as
standard FFfs. Address generator 1 also has bit-reverse logic to accommodate standard
power-of-two algorithms.
Cache Memory. This device has a 48-word instruction cache memory to run fre-
quently used instruction sequences without having to access off-chip program memory.
Building-block FFf algorithms can be executed from this memory. Because of the small
size, it is likely that only 2-, 3-, and possibly 4-point building blocks from Chapter 8 can be
programmed to fit in the cache.
Multiplier
Program Accum.
Control & Serial
ALU Bus
•
•
•
and to be stored either in on-chip RAM or in off-chip RAM via the interface multiplexers.
These six communications ports allow this device to be connected into a variety of one-,
two-, and three-dimensional architectures. The three-dimensional massively parallel pro-
cessor example in Figure 14-6 is one example. Others are described in Chapter 11.
The DSP32C is AT&T's first CMOS family of 32-bit floating-point processors and
is a follow-on to their DSP32 introduced in 1984. The most distinguishing feature of this
family is that it operates like a Harvard architecture even though it is actually a VonNeumann
architecture. This is accomplished by allowing multiple uses of the data and program buses
during one instruction cycle. The members of this family are DSP32C, DSP3210, and
DSP3207 (see Figure 14-24) [29,30].
360 CHAR 14 CHIPS
~ ,
On-Chip Program j
..__ -- - _-_ .. Off-Chip
Parallel Parallel
Data Data
Data
Buses Bus
Program
Control Serial
Bus
Buses and Multiplexers. This family's architecture uses only one data bus and
one address bus. Therefore, all functions must be connected to these, and there is no need
to multiplex multiple buses to access off-chip data and program memory. This high-speed
bus allows the device to access two 32-bit operands from memory, perform multiplication
and accumulation operations on a previous pair of operands, and write a previous result
to an I/O port or memory in one instruction cycle. Therefore, from the outside the device
appears to function like a Harvard architecture.
Address Generator. With only one address bus, there is only need for one ad-
dress generator if it can produce the multiple addresses supportable by the address bus
during an instruction cycle. The address generator in this device family is capable of
that. Additionally, the address generator has an ALU that can be used to perform ad-
dressing in nonunit increments. This makes it useful for implementing any of the FFf
algorithms in Chapter 9. However, the devices are more efficient for power-of-two FFf
algorithms because bit-reversed addressing is directly supported for reorganizing data for
these FFTs.
Data/Program Memory. The DSP32C supports one of two on-chip memory con-
figurations that can be used for data or program. The first is 1024 words of RAM and
4096 words of ROM. The second is 1536 words of RAM. Therefore, the largest power-
of-two complex FFf that can be executed on-chip is 512 points. The limit on the largest
non-power-of-two FFT is more difficult to calculate without getting an estimate on the com-
plexity of the code that must be stored in on-chip memory. It is likely that code will need
SEC. 14.5 PROGRAMMABLE FLOATING-POINT CHIPS 361
to be written to determine the largest allowable FFT. For the 4096-word ROM option, the
answer is clearly 512 points, assuming all multiplier coefficients and weighting function
coefficients are stored in ROM.
The primary difference between the DSP32C and the DSP3210 for executing FFf
algorithms is the larger on-chip memory space. The DSP321 0 has two banks of 1024 words
of RAM and a small 256-word boot ROM. Program instructions and data can reside in any
of the 2048 RAM locations, and the boot ROM is preprogrammed to load the on-chip RAM
from off-chip EPROM for lower-cost operation. Again, the largest FFf depends on the
size of the FFT algorithm code, but will not be larger than 512 points for power-of-two
algorithms because the next largest size (1024 points) would not leave any room for the
FFT program code. The largest non-power-of-two algorithm depends on the size of its
code.
Serial I/O. All members of the device family, except the DSP3207 have one serial
I/O port. The DSP3207 has no serial ports.
Multiplier-Accumulator and ALU. Because there is only one data bus in this chip
family, all data must be moved sequentially. Since the data bus can support two of those
data accesses per instruction cycle, the MAC and ALU function can also support two inputs
during an instruction cycle. This makes the MAC/ALU unit appear as if it has two ports.
Bus Control Unit. Intel calls its interface to off-chip data memory the bus control
unit. The i860 family's single on-chip data bus architecture removes the need for the bus
control unit to perform the data bus MUX function found in conventional DSP chips for
off-chip data access.
Address
Gen.
r---·····-·······,
On-Chip Program Off-Chip
Parallel Parallel
Data MUX :----- Data
Data
Buses Bus
Program
Control Serial
Bus
Multiply Accumulator and ALU. The i860 family has a separate multiplier and
adder. Both are pipelined for maximum computation rate. This means that multiple cycles
are used to perform each arithmetic computation. Conventional DSP chips perform these
functions in one instruction cycle.
Graphics Unit. The i860 chip family was designed with built-in support for high-
speed graphics. While this feature does not modify its capability to compute FFf algorithms,
it is a unique feature worth mentioning. Specifically, this hardware performs the integer
operations necessary for shading and hidden line removal. The 4 x 4 transforms needed
for orienting points are performed by the floating-point hardware.
The DSP96002 is Motorola's first 32-bit floating-point family and is aimed at the
multimedia market. It is basically a 32-bit floating-point extension of the 24-bit fixed-point
DSP5600x family. Its most distinguishing features are the large number of on-chip buses,
dual parallel interfaces off the chip, and an arithmetic unit that has Newton-Raphson-based
square root and l/(square root) functions. The only member of this family is the 96002 (see
Figure 14-26) [32].
~
seco nd
Parallel
Address Bus
~
second
Parallel
..----~- Data Bus
On-Chip Program
Parallel Off-Chip
Data Data MUX Parallel
Buses Data
Bus
Program
Control Serial
Bus
Buses and Multiplexers. In addition to the buses in the Motorola DSP5600x ar-
chitecture (three address and four data), the DSP96002 provides a DMA data bus. Another
feature of the DSP96002 is the dual parallel interfaces off the chip. This additional off-chip
364 CHAP. 14 CHIPS
parallel interface allows these devices to be connected into linear bus, pipeline, and ring
bus architectures for multiprocessor applications (Section 14.2.9) without having to use the
parallel bus that may be addressing off-chip data or program memory.
Data RAM and ROM. The DSP96002 has 1024 words of data RAM on-chip.
Therefore, the largest FFf that can be computed with on-chip memory is 512 points. The
performance numbers in the Programmable Floating-Point Chips Comparison Matrix (Sec-
tion 14.7) already reflect the penalty paid for having to access off-chip data memory. Further,
the data RAM is divided into two 512-word memories called X-data memory and Y-data
memory. To accommodate these extra memories, there is an extra data memory bus and
extra data memory address bus.
Grouped with each of these 512-word RAMs is a 512-word ROM. The X-data
ROM contains a full cycle of the "cosine" function, and the Y-data ROM contains a
full cycle of the "sine" function to be used by power-of-two FFr algorithms directly as
the multiplier constants. Specifically, the 3600 phase angle is divided into 512 pieces.
These tables can also be used for non-power-of-two FFTs with the help of an interpola-
tion algorithm. For example, to use the table for the 504-point mixed-radix algorithm,
3600 must be divided into 504 pieces, not 512. Therefore, the table entries cannot be
used directly. However, for each needed value, the two surrounding phase angle val-
ues and a linear interpolation algorithm can be used to accurately compute the correct
value.
The coefficients in the X- and Y-data ROMs can also be used to compute the sine lobe,
Hanning, sine cubed, sine to the fourth, Hamming, Blackman, three-sample Blackman-
Harris, and four-sample Blackman-Harris weighting functions in Sections 4.2.3 through
4.2.10. This removes the need to store weighting function coefficients if the chip's compu-
tational power allows the weighting function coefficients to be computed as needed within
the required FFT computation time.
Address Generators. All of the members of this family have dual address gener-
ators. This maximizes the ability to address both data and multiplier constants to feed to
the MAC unit on each instruction cycle. The flexibility of the address step sizes for these
generators also allows them to be easily used to generate non-power-of-two algorithms as
well as standard FFTs. Both address generators also have bit-reverse logic to accommodate
standard power-of-two algorithms.
Multiply Accumulator and ALU. The ALU has a "divide and square root" unit that
uses the Newton-Raphson algorithm to compute the square root(x) and 1/(square root(x))
in 12 and 11 instruction cycles, respectively. This is not critical for FFT algorithms but can
accelerate an overall application.
Program
Memory
Multiplier
Program Accum.
Control Serial
&
Bus
ALU
described below. These connections offset the degradation in FFT performance associated
with having only one main bus.
Data Memory. Both devices have two 512-word data RAM blocks and the
/-lPD77230A has 1024- and 2048-word data ROMs for storing multiplier constants and
weighting function coefficients. Externally, both devices support a 12-bit address word
which corresponds to addressing 4096 data words. This limits them to performing 2048-
point FFTs, even using off-chip memory. Using on-chip memory with real and imaginary
components in respective 512-word blocks of data memory provides the capability to per-
form 512-point complex FFTs.
Data memory does not use the main bus to transfer data to the multiplier. Each data
RAM has its own direct path to the multiplier. However, the results from the multiplier or
accumulator are stored in data RAM using the main bus.
Address Generators. Both devices have an address generator for each of the data
RAMs to avoid having to use the main bus. These generators are simple base address
plus offset calculators that require the offset to be programmed into the instructions for
nonunit values. Therefore, they are not ideally suited for computing non-power-of-two
FFT algorithms.
earlier fixed-point generations primarily because of the additional buses that allow multiple
tasks to occur during the same instruction cycle. The primary distinguishing feature of
this device family is the multiple data and address ports. The members of this family are
TMS320C30 and TMS320C31 (see Figure 14-28) [33].
Expansion
~ Parallel
Address Bus
Expansion
Parallel
DataBus
On-Chip Program
Parallel Off-Chip
Data Parallel
Data Data
Buses
Bus
Multiplier
Program Accum.
Control Serial
&
ALU Bus
Buses and Multiplexers. The large number of on-chip buses is a primary charac-
teristic of this family. There are four on-chip data buses and three on-chip address buses,
which make it possible to access multiple pieces of data during one instruction cycle. This
improves the perfonnance of this TI family over the TMS320Clx and TMS320C2x fixed-
point families, which only access one data word per instruction cycle. Additionally, the
on-chip buses are multiplexed off the chip twice. The additional off-chip parallel interface
allows these devices to be connected into linear bus, pipeline, and ring bus architectures for
multiprocessor applications without having to use the parallel bus that may be addressing
off-chip data or program memory.
Data/Program Memory. This family has two 1024-word RAMs and one 4096-
word ROM. Each RAM and ROM can support two memory accesses each instruction
cycle, and the multiple buses allow for parallel program fetches, data reads/writes, and
DMA operations. Additionally, a 64-word instruction cache is provided to store often used
pieces of code so that they need not be stored off-chip to slow down execution. If all
multiplier constants and weighting function coefficients are stored in program ROM, this
chip family can be used to compute up to a 1024-point complex FFT on-chip.
SEC. 14.5 PROGRAMMABLE FLOATING-POINT CHIPS 367
Address Generators. This is the first generation of TI DSP chips to have a full-
function address generator. This family has two that can do addressing in nonunit steps to
support non-power-of-two FFf algorithms. They can compute two addresses per instruction
cycle to address two pieces of data using two of the four data buses. The address generators
also support bit-reversed addressing for power-of-two FFf algorithms.
Serial I/O. The TMS320C30 has two serial I/O ports. This additional serial port
provides the capability to interface these devices into linear bus, pipeline, and ring bus
architectures for multiprocessor applications (Section 14.2.9) without having to use the
parallel bus that may be addressing off-chip data or program memory. The TMS320C31
only has one serial port.
Another fundamental difference of this family architecture is that the serial ports
interface to the expansion I/O buses rather than directly to the on-chip buses. The advantage
of this is allowing the serial data port to interface directly to all of the on-chip data buses.
The disadvantage of this is that the serial port data cannot be input to the on-chip data buses
while the expansion I/O bus is active to some other peripheral. If the serial port were tied to
one of the on-chip data buses, it could be active while the expansion I/O bus was connected
to one of the other on-chip data buses.
Expansion
Parallel
Address Bus
Expansion
Parallel
Data Bus
Multiplier
Program Accum.
Control Serial
&
Bus #1
ALU
Serial
Bus #6
constants and weighting function coefficients are stored in program ROM, this chip family
can be used to compute up to a l024-point complex FFT on-chip.
Address Generators. This is the second generation of TI DSP chips to have a
full-function address generator. This family has two that can do addressing in nonunit
steps to support non-power-of-two FFT algorithms. They can compute two addresses per
instruction cycle to address two pieces of data using two of the four data buses. The address
generators also support bit-reversed addressing.
Serial I/O (Comm Ports 1-6). .The TMS320C40 has six serial I/O ports, which
are called communications ports. These ports are independently multiplexed into the on-
chip buses to provide full bus utilization flexibility. These six communications ports allow
this device to be connected into one-, two-, and three-dimensional architectures. The three-
dimensional architecture in Section 12.6.2 shows one option.
SEC. 14.7 FFT-SPECIFIC CHIPS AND CHIP SETS 369
Analog Devices
ADSP-21020 0.58 Os/2p 0 0 2
ADSP-21060 0.46 8s/1p 65,536 65,536 2
AT&T
DSP32C 3.2 ls/Ip 1024/1536 4096/0 1
DSP3210 2.4 islip 1024/2048 1024/256 I
DSP3207 1.9 Os/Ip 1024/2048 1024/256 1
Intel
i860XR 0.74 Os/lp 1024 256 1
i860XP 0.55 Os/lp 2048 1024 1
Motorola
DSP96002 1.04 Os/2p 1024 1024 2
NEe
Il PD77240 7.07 ls/Ip 1024 0 2
IlPD77230A 11.78 Is/lp 1024 1024/2048 2
TI
TMS320C30 1.97 2s/2p 2048 4096 2
TMS320C31 1.97 Is/2p 2048 4096 2
TMS320C40 1.54 6s/2p 2048 4096 2
set. Since these chips are designed to perform FFfs, it is more relevant to show block
diagrams of how the chips are connected to off-chip memory and address controllers than
to show the internal block diagram of the chip. These block diagrams can then be combined
to form the multiprocessor architectures in Chapter 11. Refer to the manufacturer's data
books and application notes for details on the limitations of each chip for multiprocessor
operation.
The primary disadvantage of these chips is they are not designed to perform general-
purpose functions, such as user interface and decision making, often required to complete
an application. A second disadvantage is that these chips can only perform power-of-two
FFfs. However, for the Bluestein algorithm in Section 9.5, these chip/chip sets can be
used to perform non-power-of-two algorithms by customizing the complex multiplications
to the transform length of interest by using the Bluestein approach. While this approach
is less efficient than power-of-two algorithms with these chips, they do perform those
algorithms 5 to 10 times faster than programmable DSP chips. Therefore, even a factor
of 2 or 3 inefficiency still results in higher-speed computations than can be obtained from
programmable DSP chips. For some applications this can be the difference in success or
failure.
Because these chips are specifically designed to perform FFfs, their performance can
be measured by using more FFT specific items. These are:
The array Microsystems a66110/66210 chip set [35] is designed to perform real and
complex FFfs, IFFfs, as well as linear filtering and pattern matching in the time and
frequency domains. The chip has radix-2 and -4 FFf building-block instructions that are
connected using the mixed power-of-primes algorithm from Chapter 9 to implement up to
a 65,536-point complex FFf. The chip uses both the Two-Signal Algorithm and Double-
Length Algorithm from Chapter 2 to compute FFTs of real input data. It uses the Overlap-
and-Add Algorithm from Chapter 6 for performing linear filtering and pattern matching in
the frequency domain. All arithmetic is 16-bit mantissa block-floating-point.
Figure 14-30 is a block diagram of one of several ways to interface this chip set with
data memory and algorithm control logic. In addition to the a66110 (269 pins), the address
generator function is also provided as a chip and is called the a66210 (180 pins). Array
Microsystems also provides a reduced pinout version of this chip set (a66111/a66211), each
having 144 pins. The primary distinguishing feature of this chip set is that it performs FFfs
up to 65,536 points.
SEC. 14.7 FFT-SPECIFIC CHIPS AND CHIP SETS 371
Re~~ R~M
Input ~
RAM
r 01
a66110
03
Real
.,
Output
~
FFT Processor
Imagi~ ~
RAM
3
Imaginary
02 04 •
Output
~ RAM
~
Input
4 X01 X02
RAM RAM
To To #9 #10 To To
RAMs RAMs RAMs RAMs
1&3 2&4 Cosine Terms Sine Terms 5&7 6&8
The operational strategy for the configuration in Figure 14-30 is to start by loading
a set of data into RAMs 1 and 3. Then, that set of data is moved through the processor to
output RAMs 5 and 7 while the first stage of FFf computations is performed. Then, these
intermediate results are passed back through the processor to RAMs 1 and 3 to perform the
second stage of the algorithm. This process continues until the final computations result in
the output frequency components being in RAMs 5 and 7.
During each pass, the appropriate complex multiplier coefficients are addressed from
RAMs 9 and 10 to satisfy the mixed-radix algorithm. During the first stage, these coeffi-
cients can be the weighting function. This capability is also used during frequency-domain
filtering/pattern matching to input the needed complex filter coefficients between the input
FFf and output inverse FFf. The chip supports both 25% and 50% overlapped data sets, as
explained in Chapter 6.
372 CHA~ 14 CHIPS
While the first FFf is being computed, the next set of data to be transformed is being
loaded into RAMs 2 and 4. After the first set of data is transformed, RAMs 2 and 4 become
the input, and RAMs 6 and 8 work with those RAMs to produce the next set of outputs. At
the same time, the controller addresses RAMs 5 and 7 to output the results of the previous
FFf. This architecture allows data to be continuously input and the results to be output
while computations are performed. It also allows the input and output data clocks to work
at a different rate than the processing clock, as long as the data is loaded and output before
the end of the present FFf computation.
For computing Fl-Ts of real data, the processor has instructions that support both
types of data reorganization described in Chapter 6. However, the data must be input in
the proper form for the transform to work. Once that has occurred, an output instruction
performs the necessary unraveling of the data.
A subtle point with this chip set is that an odd number of FFf stages is required to
have the output in the memories on the right side of Figure 14-30 (RAMs 5-8). This means
that if 2-point stages are being used, 128-, 512-, 2048- ... point transforms have the best
performance. To get a 1024-point FFf to the output RAMs requires an extra pass of data
through the processor if 2-point stages are used. Since 4-point stages are also available,
they should be used for 64-, 1024-, and 4096-point FFfs to have an odd number of stages.
The Sharp chip set [36] is designed to perform real and complex FFTs, and IFFfs, as
well as linear filtering and pattern matching in the time and frequency domains. The chip
has radix-2, -4, and -16 FFf building-block instructions that are connected by using the
mixed power-of-two algorithm from Chapter 9 to implement up to a 4096-point complex
FFf. The chip uses the Two-Signal Algorithm from Chapter 2 to compute FFfs of real
input data and the Overlap-and-Add Algorithm from Chapter 6 (called overlap and discard
in the Sharp application notes) for performing linear filtering and pattern matching in the
frequency domain.
Figure 14-31 is a block diagram of how to interface this chip set with data memory and
algorithm control logic for the most efficient execution of FFT algorithms. In addition to
the LH9124, the address generator function is also provided as a chip by Sharp and is called
their LH9320. The primary distinguishing feature of this chip set is that it performs FFTs
using 24-bit block-floating-point arithmetic. This makes the random quantization noise
at the output of the FFf computation 8 bits less than using a 16-bit block-floating-point
processor. This allows frequency components that are 24 dB lower to become visible above
quantization noise.
In Figure 14-31, the Q-port is used to input data and to output results from the
processor. The C-port is used to provide weighting function coefficients, complex multiplier
coefficients, and frequency-domain linear filter/pattern matching coefficients. This allows
any weighting function or filter coefficients to be used by the processor.
The A- and B-ports are used to store intermediate results during the various stages of
the computations. If data is stored in the RAM connected to data port A, then the next step
is to pass that data into the processor to execute the next stage of the FFT algorithm and
store the results in the data RAM connected to port B. The opposite process occurs at the
next stage of computations.
SEC. 14.7 FFT-SPECIFIC CHIPS AND CHIP SETS 373
Address
Generator
Data Data
Real 110 ~ Imaginary I/O
RAM RAM
OR 01
Data Data
AR BR
RAM RAM
Address Address
LH9124
Generator Generator
Data Data
AI BI
RAM RAM
CR CI
Data Data
RAM RAM
Address
Generator
Unlike the array Microsystems chip set, either intermediate RAM can feed data to
the output. However, the same data RAM is used for both input and output data, as shown
in Figure 14-31. This requires more coordination between the input of data and the output
of results than is required by the array Microsystems chip set.
real or complex FFTs. The chip does not support sequencing for executing real FFfs or
linear filtering in the frequency domain. However, both real FFT algorithms from Chap-
ter 2 and frequency-domain filtering/pattern matching algorithms from Chapter 6 can be
implemented with off-chip logic because the chip does support complex and real multipli-
cation.
Figure 14-32 is a block diagram of how to interface this FFf chip with data memory
and algorithm control logic. The primary distinguishing features of this chip is that it can
compute all power-of-two FFfs from 16 to 1024 points and has the complex multiplier
coefficients for these algorithms stored in an on-chip ROM. Its 16-bit block-Boating-point
arithmetic provides better quantization noise performance than 16-bit fixed-point proces-
sors, and its off-chip weighting function RAM allows any weighting function or complex
filter coefficients to be implemented.
Real Imaginary
Data Data
I/O
RE WIN
!
1M
I/O
TMC2310 RIW
Address
Figure 14-32 Hardware block diagram for computing FFfs using the
TMC2310.
Real Real
----.... Aux Rout ~
Input Complex Output
Multiplier PDSP16510
Imaginary .... Imaginary
-----.. D lout ~
Input Output
~
Weighting
Function Control
Memory Counter
If another weighting function is required, it must be applied before inputting the data
to the chip. Similarly, if the device is to be used to perform linear filtering or pattern
matching in the frequency domain, an off-chip complex multiplier must be connected as
shown in Figure 14-33. No off-chip data memory is needed up to 256-point FFfs. Figure
14-34 shows the configuration required for 1024-point FFTs. Plessey makes a companion
chip (PDSP16540) to perform the needed data memory addressing function, including the
address and clock timing interfaces.
The data in the Comparison Matrix in Table 14-5 comes from the referenced vendor material.
For the 1024-point complex FFT performance, this is the fastest number available in the
referenced material. Different versions of a 1024-point FFI' may produce slightly different
performance numbers. Versions of the chips that run at slower speeds will have times that
are slower. Conversely, newer versions of these chips, which run faster, will have faster
times.
376 CHAR 14 CHIPS
1024-point # of block
FFT-specific complex FFT Programmed FFT Largest floating-point
chip/set JLS building blocks complex FFT mantissa bits
array Microsystems
a6611 0/a6621 0 131 2 and 4 points 65,536 16
a66111/a66211 131 2 and 4 points 65,536 16
Sharp Electronics
LH9124/LH9320 87 2, 4, and 16 points 4,096 24
LH9124L/LH9320 129 2, 4 and 16 points 4,096 24
Raytheon
TMC2310 514 2 point 1,024 16
PIessey
PDSP16510 96 4 point 1,024 16
Program Off-Chip
On-Chip
Parallel Parallel
Address Data Address
Buses Bus
Program
Memory
Program
On-Chip Off-Chip
Parallel Parallel
Data Data Data
Buses Bus
Multiplier
Program Accum.
Control Serial
&
ALU Bus
complex FFT without adding data memory to the ASIC design. Program memory must be
added to store the algorithm code and the multiplier constants.
ASIC programmable lO24-point Data I/O On-chip data On-chip prog. # of address
DSP chip core complex FFf (MS) ports memory words memory words generators
DSP Semiconductor
Pine core 2.2 Os/Op 2048 0 2
Oak core 2.2 Os/Op 2048 0 2
Address Program
1/0
FromfTo To Program
Data Bus Address Bus
FromfTo To Data
Data Bus Address Bus
The multiprocessor architecture is similar to the linear bus described in Section 11.2.2
with multiple processors and data memory on the bus. Star Semiconductor has devised a
unique time-division-multiplexing scheme to remove the complexity of the four (two for
the SPROCI200/1210) processors trying to access the data memory from the same bus. For
example, the program memory bus has a five-cycle sequence. Each of the four processors
is assigned to use the bus during one of the five cycles, and the fifth cycle is for data I/O.
The same is true of the data memory bus.
Each of the four general-purpose DSPs has a five-stage pipeline processing cycle to
match the five-cycle bus multiplexing scheme. By time-multiplexing the program and data
accesses of each processor, all five can be kept busy without causing bus contention. Each
processor has its own 24-bit fixed-point MAC (multiply-accumulator; Figure 14-37).
The building-block form of FFf algorithms matches well with this architecture. At a
top level, consider the implementation of a 256-point radix -4 FFf algorithm. The algorithm
has four stages, and at each stage it requires 64 four-point FFf computations. One strategy
for performing this algorithm on the SPROC 1400 is to allocate 16 of the 64 four-point FFrs
at each stage to one of the four processors. Since each 4-point building block is identical,
each processor has the exact same code to execute and therefore finishes its portion of each
stage at the same time.
This approach also makes this architecture good for computing the Winograd, prime
factor, or mixed-radix algorithms from Chapter 9. For example, consider the 3*5 *8 = 120-
point prime factor algorithm. The 3-point stage requires computing 120/3 = 40 three-point
380 CHAP. 14 CHIPS
building blocks. For the SPROC1400 this means each processor performs 10 three-point
FFfs. The 5-point stage requires computing 120/5 = 24 five-point building blocks. For
the SPROC1400 this means each processor performs 6 five-point FFTs. Finally, the eight-
point stage requires 120/8 = 15 eight-point building-block computations. For this stage,
three of the four processors compute 4 eight-point FFfs and one only computes three. The
single central data RAM makes accessing the proper inputs for each of these building-block
computations straightforward.
At first glance, having all the processors repeat the same algorithm causes lost cycles,
while each processor waits for its turn to obtain input data and output results. In reality, the
solution is simple. At the end of the first time the processors finish a block of algorithm
code, the processors send results out in sequence and receive new data in sequence. From
that point on, the processors are out of synchronization by one, two, three, and four clocks
and therefore have outputs available, in time sequence, so that processor cycles are not
lost.
Serial I/O. All members of this family have two serial input ports and two serial
output ports. This additional serial port provides capability to interface these devices into
linear bus, pipeline, and ring bus architectures for multiprocessor applications (Section
14.2.9) without having to use the parallel bus that may be addressing off-chip data or
program memory.
Program RAM. The SPROC1200 and SPROC1210 have 512 words of program
RAM, and the SPROC1400 has 1024 words of program RAM.
Data RAM. The SPROC1200 and SPROC1210 have 512 twenty-four-bit words
of data RAM, and the SPROC1400 has 1024 twenty-four-bit words of data RAM. This
limits the complex FFfs that can be performed on-chip to 256 and 512 points, respec-
tively. Therefore, the 1024-point FFT performance numbers in the Multiple Processor
Programmable DSP Chips Comparison Matrix (Section 14.12) already reflect the penalty
paid for addressing off-chip data memory.
Boot ROM. Boot ROM is additional on-chip memory to allow the on-chip program
RAM to be loaded during the power-up phase of the application's operation from a low-speed
24-bit-wide EPROM to lower the cost of the overall application. It also allows multiple
programs to be swapped in and out of the on-chip program memory without having to store
them in high-speed off-chip program RAM.
Multiply-Accumulator (MAC) and Arithmetic LogicUnit (ALU). Unlike the gen-
eric programmable DSP chip block diagram (Figure 14-1), the MAC and ALU in this
architecture have only one bus to input data and output results. This is not a problem
for computing FFTs because the multiply-accumulate function takes three clock cycles to
implement, not one cycle like the generic programmable DSP chip, and a data interface
with the main chip architecture can only occur every five cycles.
Address Generators. Each general signal processor has two address generators.
One handles program memory addressing and one handles data memory addressing. These
generators are capable of direct and indexed addressing needed to implement the FFf
algorithms in Chapters 8 and 9.
SEC. 14.11 MULTIPLE PROCESSORS ON A SINGLE CHIP 381
Program Control. Program control logic controls the sequencing of the various
functions in the general signal processor, such as address generation and the three steps in
each multiply computation.
The TMS320C8x is the first programmable DSP chip to have four DSP blocks con-
nected by a crossbar switch and controlled by a RISC floating-point processor. The first
block diagram, Figure 14-38, shows how the four processors are interconnected with each
other and on-chip memory. The second block diagram, Figure 14-39, shows the internal
architecture of the programmable DSP blocks. The only member of this family is the
TMS320C80 [41].
Crossbar Switch
• • • • •
Figure 14-38 High-level block diagram ofTMS320C80 family.
Address
Gen.
On-Chip Off-Chip
Parallel Parallel
Data Data
Buses Bus
Multiplier
Program Accum.
Control & Serial
ALU Bus
putes three. The crossbar switch interface to data RAM makes accessing the proper inputs
for each of these building-block computations straightforward.
The architecture of the individual fixed-point DSPs is shown in Figure 14-39. Each
has two address generators and no data or program memory or multiplexers to combine the
data and program buses. Additionally, there is a third address and data bus pair, called the
global bus. The serial I/O is also missing from the DSPs because it is not needed in this
highly integrated internal chip architecture.
The data in the Comparison Matrix in Table 14-7 comes from the referenced vendor material.
In the case of the 1024-point complex FFf performance, this is the fastest number available
in the referenced material. Different versions of a 1024-point FFT may produce slightly
different performance numbers. Versions of the chips that run at slower speeds will have
times that are slower. Conversely, newer versions of these chips, which run faster, will have
faster times. Performance numbers with an asterisk behind them are estimated because
times for the 1024-point FFT were not available from the vendor.
CHAP. 14 REFERENCES 383
14.13 CONCLUSIONS
Choices, choices, and more choices! Few engineers have the time to keep abreast of the
rapid changes and hundreds of options available for creating DSP products in general and
FFT products in particular. This comprehensive inventory would be hard to choose from
without the guidelines given with a "standardized" approach to block diagrams for each
chip family. At this stage of the book, the reader is ready to select a chip or multiples of it
for processing the algorithm chosen from the information in Chapters 8, 9, and 12.
The number of board-level companies and products for FFf applications is many
times higher than at the chip level. Therefore, only guidelines for selecting off-the-shelf
boards are provided in the next chapter.
REFERENCES
[1] ADSP-2101 and ADSP-2102 User's Manual-Architecture, Analog Devices, Inc.,
Norwood, MA, 1990.
[2] ADSP-2111 User's Manual-Architecture, Analog Devices, Inc., Norwood, MA,
1990.
[3] Mixed-Signal Processorwith Host Interface Port- ADSP-21msp50A/55A/56A, Analog
Devices, Inc., Norwood, MA.
[4] ADSP-2171 DSP Microcomputer, Analog Devices, Inc., Norwood, MA, 1993.
[5] WE DSP16 and DSP16A Digital Signal Processors Information Manual, AT&T Mi-
croelectronics, Allentown, PA, 1989.
[6] WE DSP16C Digital Signal Processor/Codec, AT&T Microelectronics, Allentown,
PA, 1991.
[7] DSP1610 Signal Coding Processor, AT&T Microelectronics, Allentown, PA, 1993.
[8] DSP1616-x11 Digital Signal Processor, AT&T Microelectronics, Allentown, PA,
1993.
[9] Piranha Digital Signal Processor, DSP1616-x30, AT&T Microelectronics, Allentown,
PA,1993.
[10] DSP1617 Digital Signal Processor, AT&T Microelectronics, Allentown, PA, 1993.
384 CHA~ 14 CHIPS
[34] TMS320C4x Technical Brief, Digital Signal Processing Products, Texas Instruments,
Inc., Dallas, TX, 1991.
[35] Digital Signal Processing a66540 FDaP User's Guide, Revision a66540IG/2.0, array
Microsystems, Inc., Colorado Springs, CO, 1992.
[36] Application Notes, Integrated Circuits, Liquid Crystal Displays, RF Components,
Optoelectronics, Sharp Electronics Corporation, Portland, OR, 1993.
[37] 1994 Data Book, ASSP, Standard Products, ASIC Arrays & Standard Cells, Raytheon
Semiconductor, Mountain View, CA, 1993.
[38] Digital Video & Digital Signal Processing IC Handbook, GEC Plessy Semiconductors,
Scotts Valley, CA, 1993.
[39] S. Berger, "An Application Specific DSP for Personal Communications Applications,"
Proceedings ofthe 1994 DSPx Exposition & Symposium, pp. 63-69 (June 1994).
[40] SPROC-1400 Programmable Signal Processor Data Sheet, STAR Semiconductor
Corp., San Jose, CA, 1993.
[41] TMS320C80, "TI's First Multiprocessor DSP, Product Overview," Arrow Electronics,
Inc., Carrollton, TX, 1994.
15
15.0 INTRODUCTION
Getting to market with an FFf product is usually less expensive and faster if commercial-
off-the-shelf (COTS) hardware is available to run the algorithm efficiently. Even if the end
product will not be at the board level, a commercial board can be an inexpensive way to
develop and demonstrate the proof of concept. With several dozen manufacturers selling a
wide variety of DSP boards for PC, VME, SBus, and embedded applications, it is unrealistic
to describe and evaluate them in this chapter. That endeavor is surely an entire book by
itself. This chapter provides guidelines that engineers, managers, and students can use to
make their own decisions about appropriate COTS boards or the need to design one.
The key board specifications are:
• Processor
• Off-chip memory
• Analog I/O ports
• Instruction cycle time
• Parallel and serial I/O ports (buses)
• Host interface
15.1.5 MUltiprocessing
In a multiprocessor application, a COTS solution can be a single board with more than
one chip connected in the selected architecture, or multiple boards, with one or more chips,
that can be connected in the selected architecture. Chapters 11 and 12 provide extensive
information on how to select multiprocessor architectures. When boards are connected in
one of those architectures, performance is reduced if data I/O between the processors is
slower than the processor's I/O instruction rate.
Question
1. Which boards have the selected DSP chip?
Answer
The fastest way to narrow the number of board candidates is by eliminating those
that do not have the chip already chosen. If two or more chips would meet product
specifications, all of the boards without those are eliminated.
Question
2. Does the board slow the FFT performance of the chip?
Answer
The timing on the chip does not always translate to the same timing on the board
because of slower board instruction cycle time and/or memory speed. Board vendors
list instruction cycle time or clock rate (which can be the same or a multiple of the
instruction cycle time) in the board specifications. Memory speed is listed by vendors
in terms of the number of ws (wait state). If the off-chip memory runs at the same
speed as the chip can access it, this is called 0 ws. If it runs at half the speed the chip
can access it, the ws is 1, because the chip must wait one instruction cycle after it
requests data.
Question
3. What digital I/O ports does the board have?
Answer
There are three types of digital interfaces found on COTS boards. The first is the
standard bus interface such as PC, VME, or SBus. These are always parallel and
generally slower than a DSP chip is capable of transferring data, which slows the
chip's performance. The second is a serial interface, such as RS-232C. Most of the
general-purpose DSP chips in Sections 14.3 and 14.5 have serial interfaces that work
with an RS-232C.
The third and most preferable type of interface is a dedicated parallel interface,
designed to run at the DSP chip's parallel I/O instruction rate. Not all boards have this
feature because it requires adding a special-purpose connector and interface logic to
the board. However, when this is available, the board's DSP chip is able to function at
its maximum rate. This is a key element of a multiprocessor hardware architecture's
ability to perform at peak efficiency.
Question
4. Does the board have analog I/O ports?
Answer
Not every board has analog I/O ports because some are designed to only receive and
send digital data. The analog I/O port or ports use AID and D/A functions in the DSP
390 CHA~ 15 BOARD DECISIONS AND SELECTION
chip or on the board to convert analog signals to digital ones that the chip can process.
The performance measures for AID and D/A are the number of bits per sample and
the number of samples per second that they convert.
Question
5. Does the board have enough off-chip data and program memory?
Answer
The amount of memory an application needs is determined by the FFf algorithm and
transform length. The portion of that memory that will be off-chip is a function of the
chip selected. Some may even be off-board, depending on which board is used. The
on-chip memory is subtracted from the total memory to see how much the board needs
to have. If there is too much remaining for a board to handle, an external source such
as host processor RAM or hard disk, or a separate memory board, must be available.
Question
6. Which boards work with the selected high-level language?
Answer
Various versions of C and FORTRAN are common programming languages for en-
gineers and scientists. In recent years, graphical user interface (GUI) software has
become a popular way to go from block diagram design to C code. If the manufacturer
of the board, or the DSP chip on it, supports application software, including library
routine calls, in one of these languages, development time is reduced. The price paid
for faster software development is the inefficiency of cross compilers when converting
C and FORTRAN code to nsp chip code. Code converted from high-level languages
can take two to five times longer to execute than nsp chip assembly language.
Question
7. Does the algorithm library provide the needed FFf length?
Answer
If the chip's algorithm library does not have the needed FFf length, maybe the board's
library will. The more code an algorithm library provides, the less must be written
in high-level or assembly languages. This reduces development time and speeds up
processing because the algorithm library routines are usually written in assembly
language. Even if entire algorithms are not available in the algorithm library, decom-
posing the needed algorithms into building blocks that are available speeds execution
of the algorithm and shortens development time. If code is not available in a chip or
board algorithm library, it may be available from a third-party supplier.
Question
8. Do the algorithm library routines have a common I/O format?
Answer
Ideally, an application can be constructed by using a sequence of routines from the
algorithm library. However, if the data I/O formats for these routines are not the
SEC. 15.2 BOARD SELECTION QUESTIONS AND ANSWERS 391
same, additional algorithms must be executed between the algorithm library routines
to allow the data to flow from one routine to the next.
For-example, suppose the application requires an FIR filter followed by an FFT.
The input to and output from the FIR filter library routine is likely to be in sequential
order, simply because that is how FIR filters are implemented. Then the filter routine
will perform all the multiplies and adds to produce a new output each time a new
input data value enters the routine.
On the other hand, the N -point FFT routine needs a set of N samples at one
time. Therefore, a buffer must be set up between the FIR filter routine and the N -point
FFf routine to accumulate N FIR outputs to use for the next N -point FFT input set
(Figure 15-1). The output of the FFT library routine provides N answers at one time.
To convert this block of data back to a sequence of results requires another data buffer
routine. All of this adds to the application execution time and to the development
time and cost.
Sequential
Output
Data
Question
9. Does the board support real-time operating systems (RTOS)?
Answer
In real-time applications, a common but complex portion of the design is the code that
controls the interface between the nsp chip and the data I/O interface hardware. Real-
time operating systems (RTOS) are software subroutines that reduce the programming
necessary to accomplish this portion of the design.
Question
10. What control, data I/O, and graphical display software are available?
Answer
Board manufacturers provide algorithm library software to reduce the time required
for the application developer to implement required functions. Most applications also
require software to control the operation of the board, control the movement of data
392 CHA~ 15 BOARD DECISIONS AND SELECTION
on and off the board once the RTOS has synchronized the data interface, and interface
to graphical display software and hardware. If basic algorithms are also provided by
the board manufacturer for these functions, the time to market is reduced. This is
because not only are these functions usually required by the application, but they
can also be used to enter data and view results as part of the algorithm debugging
process. Therefore, it is important to identify which of these functions are relevant
for the application and determine if they are available from the board manufacturer,
chip manufacturer, or a third-party supplier.
Question
11. Can the board be expanded with a daughter card?
Answer
One way to expand the capability of a board is by connecting a smaller board (daughter
card) to it. This has two advantages over adding more boards. The first is cost. The
small boards are generally less expensive than large ones and add little space to the
volume required by the application. The second is performance. The connections
to the daughter cards are much shorter, and therefore faster, than those between full
cards.
Question
12. Does the board have prototyping area?
Answer
Some boards may meet the majority of the needs of an application but be missing
something vital. For example, suppose a board can perform all of the computations
in the required time but does not have the AID and D/A converters needed. If the
board vendor provides a prototyping area, then the application developer can put
these functions in the prototyping area. The resulting product only requires one
board rather than an additional AID and D/A interface board. This reduces the cost,
size, and weight of the product.
Question
13. Does the board have the selected architecture?
Answer
The fastest way to narrow the number of board candidates is by eliminating those that
do not have the chip and architecture which have already been selected. If more than
one board meets those specifications, the issues dealt with in the preceding questions
and answers are used to further narrow the choice. If no single board is suitable, the
answer to Question 14 must be used.
Question
14. Can the board be connected to one or more copies of itself, using the selected
architecture?
Answer
The digital I/O ports on the board determine what kinds of multiprocessor architec-
tures can be implemented. The text and figures in Section 14.2.9 show how to use
SEC. 15.3 CONCLUSIONS 393
chip serial I/O ports to form multiprocessor architectures. These same concepts can
be applied to board interconnections by replacing the DSP chips in those figures with
DSP boards, whether the I/O ports are parallel or serial. If no board exists that can
be configured into the selected architecture, a custom board must be designed or the
architecture decision must be revised.
Question
15. Can the board move data at the processor's I/O instruction rate?
Answer
An architecture was chosen because of its throughput and/or latency performance with
a particular algorithm. Chapters 11 and 12 dealt with how efficiently architectures
compared, assuming each processor takes one instruction cycle for each add, multiply,
or data move. If the data input, intermediate, or output results overhead (which
comprise total I/O instruction time) take more than one cycle, that portion of the
architectures's throughput or latency will be slowed. It is important to be aware of
this possible slowdown and what causes it. This is most likely to occur when a board
uses a standard bus, and is least likely to happen when a board has a dedicated parallel
interface.
15.3 CONCLUSIONS
Many factors must be carefully evaluated to be certain that a COTS board will do the job that
meets the specifications of a product. Designers should know how to answer these questions
for their application before purchasing a board or when deciding on the specifications for
a custom-designed board. The next chapter gives the test signals and methods needed to
detect and isolate errors that occur during software development on the board chosen using
these guidelines.
16
Test
16.0 INTRODUCTION
The book would not be complete without explaining how to test the performance of the FFf
algorithms it shows how to construct and implement. This chapter provides test signals and
shows how to use them to detect and isolate the errors that occur during development of
FFf algorithms, conversion of them to code, and operation of them in a product. Each area
is explained separately. A recommended set of test signals is described, and its ability to
detect and isolate errors is illustrated, using the 4-point FFf example from Section 8.5 and
the 16-point radix-4 FFf example from Section 9.7.5.
16.1 EXAMPLE
This chapter uses the 16-point radix -4 FFf example to illustrate the test signals and methods
explained here. This algorithm is a mixed-radix technique from Chapter 9 and uses the 4-
point building block from Chapter 8. Figure 16-1 is a flow graph of the 4-point building
block, and Figure 16-2 is a flow graph of the 16-point radix-4 FFf. Unlike Chapters 8 and 9,
where Memory Maps are more useful than flow graphs, flow graphs are the most powerful
way to understand the test process, because it is so easy to see the path from the error to the
FFf outputs. This allows the output error patterns to be easily understood.
Algorithm developrnent includes the Algorithm Steps and Memory Maps for the needed
building-block algorithms as well as for combining them into the complete N -point FFf.
The building blocks from Chapter 8 and algorithms in Chapter 9 have been checked, using
the techniques described in this section, to ensure there are no algorithm errors. If another
See Sections 16.2.1
16.3.1
16.4.1
a(O) ....- ...~---.. A(O) 16.4.3
a(2) '----~------....~----...A(t)
-1
a(l) A(2)
A(O)
A(4)
A(8)
A(l2)
a(2) A(t)
a(IO) A(5)
a(6) A(9)
a(l4) A(13)
a(l) A(2)
a(9) A(6)
a(5) 1 3 A(lO)
a(l3) ~ 3 A(14)
396
SEC. 16.2 ERRORS DURING ALGORITHM DEVELOPMENT 397
building block or algorithm is going to be used, it is recommended that test signals be used
to verify the Algorithm Steps and Memory Maps prior to implementing the algorithm in
code.
Algorithm Step (arithmetic) errors can occur at the building-block level or in defining
the complex multipliers between the stages. The most complete method for ensuring the
correctness of the arithmetic is to start from each complex output frequency term, A(i), and
write the Algorithm Step for the terms with the Algorithm Step that is used to calculate it.
Then continue to move back through the algorithm and replace each term that makes up
those terms, This process continues until the equation is in terms of the complex input data,
a (i). Then compare that equation with the corresponding OFT equation to ensure they are
the same.
The 4-point FFT, shown in Figure 16-1, provides a simple example that illustrates
this approach. The Algorithm Steps for each of the output frequency telTI1S (Equation 16-1)
are listed first, followed by the corresponding 4-point OFT (Equation 16-2).
3
A(O) == L a(n) * e-j2nOn/4 == a(O) + a(l) + a(2) + a(3)
n=O
3
A(l) == L a(n) * e j2nn/4 == a(O) - j * a(l) - a(2) + j * a(3)
n=O
3 (16-2)
A(2) == L a(n) * e" jtt n == a(O) - a(l) + a(2) - a(3)
n=O
3
A(3) == L a(n) * e-j3nn/2 == a(O) + j * a(l) - a(2) - j * a(3)
n=O
where
a(n) == aR(n) + j * a/en)
j * a(n) == -al(n) + j * aR(n)
398 CHA~ 16 TEST
If the real and imaginary parts of input data, a(n), are substituted in Equation 16-2, the
result is
The final step is to compare Equations 16-1 and 16-3 to see that they are mathe-
matically identical. Notice that the order of the a(i) terms in the two sets of equations
is different. This is caused by the sequence of Algorithm Steps used to reduce the to-
tal computations. However, the equations all have the same terms. Therefore, all of the
building-block arithmetic is correct.
If there is an error, the flow graph in Figure 16-1 is invaluable in tracing the source
of that error. For example, suppose the node in Figure 16-1 that adds a(O) to a(2) is a
subtract instead of an add. Then, using Figure 16-1, that error affects A (0) and A (2) but
not A(l) and A(3). Therefore, if a(2) has the wrong sign in A(O) and A(2), it must have
been subtracted from, not added to, a (0). Each arithmetic error in the algorithm has its own
pattern that can be easily discerned by looking at how the error propagates to the output of
the flow graph.
This same process can be used at the complete algorithm level to verify the accuracy
of the complex multiplications between the building blocks and that the output of the first-
stage building blocks is input to the proper places in the second-stage building blocks. At
first this looks like a very large set of computations to perform. Fortunately, the regularity
of the building-block interconnection algorithms and the fact the building blocks have been
checked can be used to simplify these checks significantly.
The 16-point radix-4 FFf, shown in Figure 16-2 and used later as an example, illus-
trates these features. The input to each of the four output 4-point FFTs is 4 of the 16 in-
put building-block outputs, modified by the appropriate complex multipliers. Since the
4-point building-block arithmetic is known to be correct, checking anyone of its four 4-
point outputs verifies that the correct data has been sent to it. Therefore, only four output
frequency terms must be checked to verify the algorithm, one from each of the four output
4-point FFTs.
For example, suppose the third output of the second input 4-point FFT is multiplied
by +j, not - j. Then the error propagates into the third output 4-point FFT and affects
frequency outputs A(2), A(6), A(10), and A(14). All of the other outputs will be correct.
Since all four of the outputs of this 4-point FFT are affected by the error, it is immaterial
which is chosen to check the algorithm arithmetic.
SEC. 16.2 ERRORS DURING ALGORITHM DEVELOPMENT 399
Once the individual building-block memory mapping schemes have been checked
and used to form the complete FFT, it must also be checked. For a P * Q = N -point
FFT, there are Q P-point FFfs performed as the input computations and P Q-point FFfs
performed as the output computations. This leads to a two-stage memory mapping check
of the complete algorithm. First the input P-point FFf memory mapping is checked. If
the memory mapping strategy from Section 9.4 is used for the input building blocks, this
check is simple. In that strategy, the Memory Map of the input data to each of the input FFf
building blocks is different and follows the pattern of the building-block Memory Maps
from Chapter 8.
The only exception to this is the additional data memory locations that most of the
building blocks require in the center of their computations. The simplest answer to the
additional memory location problem is to allocate those locations to a separate area of
memory not used by any of the building blocks. As mentioned in Chapter 9, only one set
of extra memory locations is required for most applications. This means that, since the
building-block memory mapping is already checked before combining the building blocks
into a larger transform, the only thing to check is that the data memory areas for each building
block do not overlap. The algorithms in Chapter 9 were checked using this approach. A
similar argument ensures that the output Q-point FFfs do not interfere with each other.
Any error in coding the Algorithm Steps of a building block propagates to the output
of the building block and to the output of the complete FFT when the code is combined
SEC. 16.3 ERRORS DURING CODE DEVELOPMENT 401
by using the algorithms in Chapter 9. Debugging the FFT code during development is
simplified by debugging the individual building blocks before they are combined into the
complete FFT. For example, with the 4-point FFT building-block algorithm in Table 16-1,
if the computation of b R (0) = a R (0) + a R (2) is incorrectly programmed, A R (0) and A R (2)
will be incorrect because b R (0) is used to compute these two outputs. Figure 16-1 shows
the same thing, where bR (0) is the real part of the node that combines a (0) and a (2). Other
arithmetic errors can also cause the same two outputs to be incorrect.
These errors can be checked with the sequence of steps described in Section 16.2 for
the algorithm development stage. However, because the code is in a computer at this point
and has been verified at the algorithm level, test input signals provide the most efficient
means for finding coding errors. The test signals described in Section 16.5 are specifically
designed to isolate errors based on the patterns they exhibit at the building block and
complete FFT outputs. In both cases, the flow graph of the building block makes it easier
to trace and isolate errors.
There are three ways that the multiplier constants, both in building blocks and complex
constants between building blocks, can be incorrectly converted to code. In all three cases,
the error propagates to the building block and complete FFT outputs to cause errors in the
answers.
The first incorrect conversion is to use the wrong equation for computing the constant.
The arguments of the sines and cosines or the way they are combined to form a constant
can be wrong. This causes incorrect numerical values for the constants or a sign error.
For example, in the 4-point FFT, the - j multiplier in Figure 16-1 is - j * sin(900). If the
argument of the sine term were -90 0 , then the multiplier would have been + j and an error
would have occurred in A ( 1) and A (3).
The second incorrect conversion is to use the wrong round-off technique for the
arithmetic format chosen for the application. For this reason all the multiplier constants for
the algorithms in Chapters 8 and 9 are in equation form rather than just numerical values.
Generally, standard round-off to the nearest least significant bit is the correct approach. If
the constants are truncated instead, small errors are introduced into all of the outputs. The
characteristics of these quantization errors are explained in Chapter 13.
The third incorrect conversion is the result of storing the multiplier constants in the
wrong locations. Then, when the multiplier constants are accessed, completely uncontrolled
numbers are used. These errors propagate to the output frequency components and have
the same error patterns as incorrect arithmetic computations.
Data reorganization occurs at the input and between the building-block stages of an
FFT. Additionally, the complete FFT requires the building blocks to memory-map blocks of
data located in multiple locations in data memory. If either of these two memory mapping
schemes is incorrectly converted to code, the FFT outputs will be dramatically altered.
If the equation for input data reorganization is incorrectly implemented, it reorders the
input data sequence and causes the FFT to analyze a shuffled input signal. If the equation
for data reorganization between the building-block stages is incorrect, the partial patterns
402 CHA~ 16 TEST
computed by the input building-block FFTs are destroyed and the output is also drastically
altered. Finally, if the incorrect memory map conversion results in using locations that do
not contain data, then a portion of the input sequence is altered. The result is a substantial
change in the output of the FFf. All three of these errors can be isolated by using the test
sequences in Section 16.5.
Relabeling of the memory mapping scheme developed for each building block is
required for mostFFf algorithms because the data does not exit the first building-block
algorithms in order. When a relabeling technique, like the one recommended in Section
9.4 is needed, it is possible to make a mistake in the relabeling process. When this occurs,
the algorithm memory mapping uses incorrect data for some portion of the computations.
Once the error is made, it generally propagates to several output frequencies. The error
pattern that occurs when each of the test signals is applied can be used to isolate this error.
Data
. . - Address Program
Memory Generator Memory
~ Data I/O
I
Arithmetic Program
Unit Counter
Accumulator
in Figure 16-1, these arithmetic errors propagate to the output and generally cause all of the
results to be wrong. Because this is a catastrophic arithmetic failure, any test signal is also
likely to have all of its outputs wrong.
One exception is the zero test signal. In most cases a zero input sequence will result
in zero outputs. The exception is if one of the bits of the multiplier, adder, or accumulator
outputs is stuck high. However, these bits represent a very small portion of the total transistor
count in the arithmetic unit. If this occurs, the zero input sequence is likely to produce the
same nonzero outputs for all of the frequency components. The reason for this is that the
only thing generating the nonzero numbers is the failed bit. Therefore, regardless of the
arithmetic to be performed, the answer is likely to look the same.
value of a(2). From Figure 16-1 this means that the error can propagate to A(O) and A(2)
but not to A (1) and A (3). In fact, depending on the specific values of the other inputs, none
of the outputs may be incorrect. One input sequence that can be used to catch this type of
error in any of the memory locations is one that has a nonzero value for only one location.
This is called the unit pulse when it is described in Section 16.5.1.
There are three likely data I/O failures. The first is with the interrupt control logic that
synchronizes the input of data to the processor and the output of results from the processor.
When this occurs, the input data sequence is no longer correct, which results in incorrect
FFf outputs.
The second and third likely failures are associated with the input and output con-
nections for the data itself. If one of these fails, on either side of the data I/O circuitry in
Figure 16-3, the signal is modified. Since the FFf is a linear computation, the resulting
FFf provides answers as if there are two signals present, the actual signal and the signal
which represents the data modification.
Sine wave 1
Unit pulse Constant Sine wave 1 + constant
a « (0) == 100 aR(O) = 100 aR(O) == 100 aR(O) = 200
aI(O) ==50 aI(O) ==50 aI(O) ==0 aI(0)=50
aR(l)==() aR(I) = 100 aR(I) ==0 aR(l) = 100
aI(I)==O aI(I) = 50 (II (1) == 100 aI(l) == 150
aR(2)==() aR(2) = 100 QR (2) == -100 aR(2) = 0
a/(2)==0 a[(2) = 50 aI(2) ==0 a[(2) = 50
aR(3)==0 QR(3) = 100 QR(3)=O aR(3) = 100
al(3)==0 aI(3) = 50 a, (3) == -100 (I ] (3) == -50
any of the four positions in the sequence can have the nonzero term. The key feature of
this signal is that it only activates one input to the FFT. Therefore, it shows how each input
signal contributes to the output. One test approach is to apply this signal at each of the FFT
inputs and ensure that the output is correct. Then, because the FFf is linear, it must work
for any arbitrary input signal. The drawback to this approach is that it requires many input
signals. For a 1024-point FFT, 1024 different test signals are required.
16.5.2 Constants
The constant signal is one where all of the complex values are the same. The key
features of this input signal are that it is easy to generate and that incorrect input data
reorganization does not cause errors in the output. It therefore becomes a good first test
406 CHA~ 16 TEST
signal to verify that much of the arithmetic in an algorithm is working, independent of the
input memory mapping. The biggest drawback is that the input add-subtract arithmetic
common to all of the building-block FFfs has zero as the output of all of the subtractions.
The b(l) terms in Table 16-2 are examples of this affect. Therefore, roughly half of the
algorithm's multipliers and the output arithmetic are not checked.
The single sine wave, centered at the first nonzero output frequency of the FFT, is a
signal that has exactly one cycle during the set of N data values input to the FFf. In general,
this test signal requires all of the multiplier constants to work to provide the correct answers.
Additionally, the data reorganization memory mapping must be correct or the signal will be
scrambled into another signal. This signal is best applied after the constant signal verifies
most of the arithmetic. Table 16-3 shows an example of this signal for the 4-point FFf.
One disadvantage of this signal is that it can also cause some intermediate points in the
computations to be zero. Once that happens, subsequent computations are not checked.
The b(O) terms in Table 16-2 are examples of this phenomena.
An input signal that is the sum of two sine waves is used to remove the problems
of zeroed-out intermediate results generated by the constant and single sine-wave signals.
However, since these signals are more complicated to generate and to use to decipher errors,
they are best applied after the constant and single sine-wave signals have eliminated most
errors. The right-hand column in Table 16-1 shows a pair of these signals for the 4-point
FFf. Each entry is just the sum of the entries for the constant and single sine-wave signals.
The linearity properties of the FFf ensure that this occurs all the way through the algorithm.
In general, the best characteristics for these two sine waves are that they are centered at FFf
output frequencies and that the frequencies are at output filter numbers that are relatively
prime to each other and to the length of the FFf. The example in Table 16-1 is an exception
to this approach. This is because the 4-point FFf is too small to be able to choose a pair of
frequencies that meet the criteria.
The simplest way to illustrate the types of patterns that errors produce is with an example.
Most algorithm errors produce errors with specific patterns, regardless of the input signal.
However, the test signals are specifically designed to produce specific error patterns that can
be easily traced to the source of the error in the algorithm. Figure 16-5 shows the 4-point
FFf from Figure 16-1 with an arithmetic error in adding a (0) to a(2). Bold flow graph lines
are the paths taken by the error as a result of the Algorithm Steps on page 402. The error
is that they are subtracted rather than added. Table 16-3 shows the responses generated by
each of the corresponding signals in Table 16-1 as it goes through those Algorithm Steps.
Comparing Tables 16-3 and 16-2 allows the error patterns to be easily identified for each
test signal.
SEC. 16.6 TEST SIGNAL ERROR PATTERNS 407
~-"'---"A(O)
to Subtraction
a(O)
a(2) A(l)
-1
a(3) A(3)
-1 -} -1
Table 16-3 Response to the Test Signals with an Error in the 4-Point FFT
results. In fact, the only version of the unit pulse that would catch this error is one with
a (2) i= O. This is an illustration of the drawback of using the unit pulse test signal first.
Namely, all of the possible versions of the unit pulse must be used to detect the error. For
a 4-point FFf this is not a significant problem. However, for a 1024-point FFT it is. The
best use of the unit pulse test signal is after the constant, single sine wave, and pair of sine
waves tests have been used. If these tests do not pinpoint the error, but only localize it, then
the appropriate unit pulse test signal can be used to positively identify the error.
16.6.2 Constants
Constant input signals exercise a significant portion of the algorithm arithmetic with-
out the need for the input data organization to work properly. With the error shown in
Figure 16-5 and the test signal responses in Table 16-5, the constant signal finds the error.
The only output frequency components affected by the error (different in Tables 16-4 and
16-5) are the A (0) and A (2) terms. A reasonable assumption is that all of the computations
associated with A(l) and A(3) are correct. For the flow graph in Figure 16-5, this means
that the error must be associated with the top addition of one of the two input add-subtracts
(a(O) ± a(2) or a(l) ± a(3».
To determine which of the two input add computations (a(O) + a(2) or a(l) + a(3»
is incorrect, start with Table 16-5, which shows that the real parts of A(O) and A(2) are
reduced by 200 and the imaginary parts by 100. This implies that the error occurred in such
a way that it affected A(O) and A(2) in the same way. Again for the flow graph in Figure
16-5, the top input add (a(O) + a(2» is added to A(O) and A(2), and the bottom input add
(a(l) + a(3» is added to A(O) but subtracted from A(2). Therefore, it must be the top
input add. In Table 16-1 this is the computation that forms the complex intermediate values
bR(O) and b/(O).
Notice that the real part of b(l) (bR(I) and the imaginary part of b(3) (b/(3)) are
nonzero. In this example, the phase of the sine wave is set to zero. If the sine wave had
nonzero phase, the real and imaginary parts of b( 1) and b(3) would be nonzero. This
eliminates the possibilities of error that cannot be tested by the constant signal.
From the discussion of the constant and single sine-wave input signals and the data
values in Tables 16-3 and 16-4, it is clear that b(l) and b(3) are always zero for the constant
signal, regardless of the phase. Similarly, b(O) and b(2) are always zero for the single sine
wave. Therefore, each test signal has its own class of errors it can detect. If the signals are
combined, the resulting test input can be made to have nonzero outputs for all of the b(i).
The pair of sine waves recommended to catch errors that the others miss is two that are in
the center of output filters that have relatively prime numbers and are relatively prime to the
FFT length. This set of conditions removes these "always zero" conditions and picks up
remaining algorithm errors. However, this signal should be used after the constant and single
sine-wave tests because the patterns are more complex and the error combinations more
vast than for the simpler signals. Use the simpler signals to remove most of the potential
errors and then rely on this more complex waveform to ferret out the remaining problems.
16.7.1 Assumptions
The 16-point radix-4 FFT, shown in flow graph form in Figure 16-6 and completely
described in Section 9.7.5, is used to illustrate the error isolation approaches explained in
this chapter. A single programmable DSP chip, with external data and program memory, is
used as the implementation architecture because it represents the most common DSP board
configuration and the majority of product applications. Further, the 4-point building-block
code (blocks 1 through 4 on the left and right of Figure 16-6) will be written once and
used each of the eight times it is required by the relabeling techniques in Section 9.4 to
memory-map the data for each building block to different portions of data memory.
In multiprocessor applications it is prudent to test the FFf algorithms at the single
processor level first to simplify the overall testing process. Additional assumptions are
that the error is found after the algorithms have been developed, in this case using ones in
Chapters 8 and 9, and after the 4-point building-block coding is checked.
The bold line between the multiplier error third output 4-point building block shows
that the outputs of that building block are the only ones affected by the error. Therefore,
any test signal that has an incorrect output will only be incorrect in the A (6), A (6), A (10),
and A (14) terms, An error in one of these terms is the initial indication of an error in the
algorithm. The four bold lines on the input of the third output 4-point building block show
which intermediate results can possibly be in error. The goal of the test signal sequence is
to isolate the error to the correct place in the algorithm. The error introduced is a sign error
in the multiplier used to modify the third output of the second input 4-point building block
between the building-block stages.
410 CHAP. 16 TEST
o 0 ..- ....J
a (1) A (2)
a (9) 2 1 .....-----------' A (6)
a (5) 1 3 2 1--.-..- - - - - --+- ....- - - - - . . . . , . , A (10)
a (13) 3 3 ..-~------ A (14)
a (3) o 0 t----""-------J
A (3)
a (11) 2 1 A (7)
a (7) 1 4 2 1--........- - - -.......------+---~ A (11)
a (15) 3 31'-----'-'--------------.... A (15)
The test signal strategy is to find the error using the least number of signals. Therefore,
the constant signal is applied first, followed by the single sine wave. If needed, the pair of
relatively prime sine waves is used, and the 16 unit pulses are a last resort. Even if the unit
pulses are needed, hopefully the error will have been isolated far enough so that only a few
of the 16 choices are required.
Since the 4-point building block is known to be correct, the error must be in the
multiplier constants between the building-block stages or in the reorganization of data at
the algorithm input or between the algorithm stages. Therefore, the results of applying the
test signals are used to isolate the error to one of those three portions of the algorithm.
locate the error, it does eliminate certain portions of the algorithm. Namely, since all of the
top outputs of the input 4-point FFfs are nonzero for the constant test signal and they are
all inputs to the top output 4-point FFT, the four associated multipliers are correct and that
portion of the data reorganization between stages is correct.
Applying the Single Sine Wave at Frequency 1 Test Signal. The single sine
wave at frequency 1 also does not provide useful information for isolating the error, because
of how the input data is reorganized before entering the input FFf building blocks. From
Figures 16-5 and 16-6 the input data points combined by the add-subtract computations are
eight samples apart. For a sine wave that has only one cycle during the 16 samples, the
samples that are eight apart are the negatives of each other, independent of the phase of the
sine wave. Therefore, the add output of the 4-point FFf input computations are always zero.
Since it is these two add outputs that are used to form the zero and second outputs of the
4-point FFT, the signal that feeds the incorrect multiplier value is always zero. Therefore,
that multiplier value can be anything, and the 16-point FFT outputs are unaffected.
While this also does not locate the error, it also eliminates other portions of the
algorithm. Specifically, all of the first and third outputs of the input 4-point FFfs are
nonzero, and they feed the second and fourth output 4-point FFTs. Since these output
4-point building blocks have the correct results, it is likely that they are getting the correct
inputs. Therefore, the respective multiplier constants on the input of those output 4-point
building blocks should be correct and the data reorganizations at the algorithm input and
between the stages must be correct. This leaves the third output from the input 4-point FFTs
or their corresponding multipliers and mappings into the third output 4-point FFf.
Applying the Pair of Relatively Primed Frequency Sine Waves. The choice
of frequency pairs has been aided considerably by the two previous test signals. Namely,
the conclusion to this point is that the error is somewhere in the path between the second
input FFT outputs and the outputs of the third output 4-point FFT. Since that 4-point FFf
produces output frequencies A (2), A (6), A (10), and A (14), the pair of frequencies chosen
must come from that set of four if it is to isolate the error. These sine waves have the feature
that the samples that are eight apart are the same. Therefore, the second output from each
of the 4-point input FFfs will be nonzero, regardless of the phase of those sine waves. As
a result, all of the inputs to the third output 4-point FFT will be nonzero.
To see how this test signal, with any combination of the pairs of frequencies mentioned,
can isolate the error, use Figure 16-5. If the top signal to that 4-point FFT (a(O)) is incorrect,
all of its outputs are modified by the same amount. If the next input signal (a (2)) is in error,
the error is added to its zero and second A (0) and A (2) outputs and subtracted from its
other A(l) and A(3) outputs. If the third input signal (a(l)) is incorrect, the error is added
to the first and subtracted from the second A (0) and A (2) outputs and - j times the error
is added to the second and subtracted from the third A(I) and A(3) outputs. Finally, if the
fourth input (a (3)) is incorrect, its error is added to the first and subtracted from the second
A (0) and A (2) outputs and - j times the error is subtracted from the first and added to the
third A (1) and A (3) outputs.
Therefore, the strategy is to apply the pair of sine-wave signals and compare the
outputs of the third 4-point output FFf with the correct ones. The errors must follow one
412 CHA~ 16 TEST
of the four patterns described in the last paragraph. Once the error pattern is identified, it
immediately points to which multiplier output is wrong. In this case, the second input to
the third output 4-point FFf has the wrong multiplier. Thus A(2) and A(lO) will have the
same error, and A(6) and A(14) will have the negative of that error.
Applying the Unit Pulse Test Signals. In this example, the unit pulse signals are
not needed because the other three test signals were sufficient to isolate the error. If this
were not the case, then the results of the previous three test signals would have narrowed
the error to one of a few places. The unit pulse is then used to test for those few remaining
error locations sequentially until one of them had the wrong answer. However, a unit pulse
signal at a(2) or a(6) can be used to verify the results found by using the other sequence of
inputs.
16.8 CONCLUSIONS
This chapter details an orderly, efficient way to detect and isolate errors in FFTs , from
algorithm development through product operation. Carefully chosen test signals and the
sequence in which they are applied save time in error detection and isolation. Taking the
time to draw a flow graph is one of the best investments for saving time when isolating
errors. Examples have been used to illustrate these techniques, which are the final step in
the design process of an FFf-based product.
The final chapter integrates the concepts, facts, and tools of this and all the preceding
chapters, using four design examples.
17
Design Examples
17.0 INTRODUCTION
How to make the FFT decisions in a design is not easily explained in general because each
application has its own specific requirements. Therefore, four real-time design examples
are developed in this chapter to illustrate the concepts, elements, and tools given throughout
the book. These were chosen to cover:
The keyboard specifications from Section 15.0 are given for each example, but an actual
board will not be picked or designed because the information needed to illustrate that
selection process is beyond the scope of this book. The design decisions from Section
1.2 appear at the end of each example, with the choices for that example and a text that
summarizes those decisions. The sequence in which these decisions get made vary from
example to example.
Issues such as heat dissipation, temperature range, and vibration levels are not covered
in the book or in these examples. While these are important product design decisions, they
414 CHA~ 17 DESIGN EXAMPLES
are normally related to the specific environment where the product will operate and do not
affect choice ofFFf length, algorithm, or architecture. Issues such as package type (ceramic
versus plastic and pin-grid array versus surface mount) are also not covered because these
options are available from most chip and board vendors and are unlikely to affect FFf-related
decisions.
Processing in early Doppler radars was performed with an array of analog bandpass fil-
ters. The capacitors, resistors, and inductors used to create these filters were sensitive to
temperature changes and aging, making the filters' center frequencies and bandwidths hard
to control. The advent of digital integrated circuits in the early 1970s stimulated a rapid
transition of Doppler radar processing from analog filtering to digital filtering, using FFf
algorithms (Section 2.2) [1]. Initially, FFf-based Doppler processors could only be afforded
for military applications. However, the proliferation of the DSP chips listed in Chapter 14
reduced implementation costs to the point where FFT processing is now common in both
military and commercial Doppler radars.
The Doppler processing portion of a ground-based air surveillance radar, which might
be used for commercial airport air traffic control or for Doppler weather radar, is de-
signed in this example. In this class of radar applications, Doppler processing is used for
three reasons. First, aircraft targets and storms are moving relative to the ground, which
means their return frequency is different than the ground's. Therefore, Doppler processing
can be used to separate those returns from ground returns. Second, Doppler processing
determines how fast each target aircraft is moving toward the radar. This, in conjunc-
tion with angle and range measurements, can be used by the radar to track aircraft and
storms,
Finally, Doppler processing is also used to improve the signal-to-noise (S/ N) per-
formance of the radar. Since radar system noise is random in time, its value in any target's
range interval is reduced by the number of range intervals, M, within the interpulse period
(time between radio frequency (RF) pulse transmissions). Further, within a particular range
interval, the radar system's noise is also random in frequency. Since the return energy from
a target is concentrated at a particular frequency, S/ N is improved by a factor of N when
the Doppler processor divides the frequency range into N smaller passbands. The result is
an overall S/ N improvement of a factor of M * N by performing Doppler processing at
each range interval of interest.
17.1.2 Specification
Table 17-1 shows the fundamental system parameters and the values they have for this
example. Range resolution is the width of the transmitted pulse. Because RF energy travels
at the speed of light (300,000,000 m/s), it has a round-trip time to the target and back of 150
tnlu» (492 ftl/1s). This means that 50-ft resolution translates into roughly 0.1-/1s pulses.
Azimuth resolution is defined as the 3-dB azimuth beamwidth of the radar antenna, and
SEC. 17.1 EXAMPLE 1: DOPPLER RADAR PROCESSOR 415
radial speed resolution is defined as the spacing between Doppler filters. The conversion
between speed (lJ) and Doppler frequency (I) is
/=2*1)/)..
where A is the wavelength of the transmitted RF energy. For an X-band radar, A ~ 0.1
ft. Therefore, a 2-ft/s speed resolution requirement converts to a 40-Hz spacing between
Doppler filters (~I == 2 * 2 ft/s/(O.1 ft) == 40 Hz).
Range resolution 50 ft
Antenna scan rate 6 RPM
Maximum detection range 80 nautical miles
Azimuth resolution 1°
Radial speed resolution 2 ft/s
Product volume 100 systems
Time to market 1 year
Normally these types of radars are designed so that the return from the longest-range
target reaches the receiver before the next pulse is transmitted. For an 80-nautical-mile
maximum range the RF energy must travel 160 nautical miles, which is roughly 296,000
m. Since RF energy travels at 300,000,000 tul«, it takes the RF energy 0.987 ms to make
the maximum round-trip excursion. Therefore, a pulse repetition interval of 1 ms (1000
transmissions per second) satisfies the maximum-range requirements. If the entire time
between transmitted pulses is divided into O.I-J-Ls pulse widths, 10,000 pulse widths are
required.
17.1.3 Description
Doppler radars periodically transmit pulses of RF energy and collect the radar returns
and "noise" as a function of time. Given that RF energy travels at the speed of light, the
time delay between pulse transmission and the reception of energy that has bounced off the
target is directly related to the target's distance from the radar antenna [1].
Because a target's radial speed (motion away from or toward the radar) causes a
change in the frequency of the transmitted pulse (the Doppler effect), frequency analysis
of the return samples is used to aid in detecting targets and determining their radial speed.
The FFT is the most widely used algorithm for determining this frequency shift.
Radar antenna scan rates and beam widths determine how many times the transmitted
radar energy hits the target each time the antenna beam scans by it. The available number
of return samples is rarely a power of two. However, Doppler radar processor transform
lengths (number of samples at a particular range) are usually powers of two because of avail-
ability of power-of-two FFT algorithms. In these radars, the zero-padding technique dis-
cussed in Section 2.3.10 is used to obtain enough data points for a power-of-two algorithm.
The alternative approach is to use one of the non-power-of-two algorithms in Chapters 8
416 CHA~ 17 DESIGN EXAMPLES
and 9. This alternative may produce a more accurate analysis of the Doppler shift and
use fewer computations and data memory. However, the high-speed FFf-specific chips in
Section 14.7 only perform power-of-two algorithms. This means that non-power-of-two
algorithms require either the Bluestein algorithm (Section 9.5.1) or the programmable DSP
chips from Sections 14.3 and 14.5. Both reduce the throughput possible.
25 5,5
26 2, 13
27 3,3,3
28 2,2,7
29 29
30 2,3,5
31 31
32 2,4,8,16
Since the 27-point FFf can be computed by using either three stages of 3-point
building blocks or a 3-point and a 9-point building block, the factors in Table 17-2 include
all of the building blocks in Chapter 8. Additionally, the 29- and 31-point FFTs can be
computed by using any of the three general algorithms for all odd numbers. The Winograd
(26-, 28-, and 3D-point FFTs), prime factor (26-, 28-, and 30-point FFTs), and mixed-radix
(25-, 26-, 27-, 28-, 30-, and 32-point FFTs) algorithms from Chapter 9 can be used to
implement the listed transform length choices.
From the Comparison Matrices in Chapter 9 (Tables 9-7 and 9-8), the most likely non-
power-of-two FFT is one of the 28- or 30-point prime factor algorithms (Kolba-Parks or
SEC. 17.1 EXAMPLE 1: DOPPLER RADAR PROCESSOR 417
SWIFT) using the Winograd building-block algorithms from Chapter 8 because they require
the fewest adds and multiplies. The algorithms can be compared by using the Comparison
Matrices from Chapters 8 (Table 8-1) and 9 (Tables 9-7 and 9-8). However, the 32-point
FFT must also be considered because this is a high-computation-rate application which may
result in the use of an FFT-specific chip from Chapter 14.
From the Comparison Matrix in Table 9-8, the 16-point radix-4 FFT algorithm takes
144 adds and 24 multiplies. The mixed-radix algorithm in Chapter 9 can be used to combine
the 16-point FFT with a 2-point building block to form the 32-point FFf. This requires:
function coefficients. Assuming all these are stored, the number of memory locations for
the weighting function coefficients is equal to the FFf length.
Table 17-3 summarizes the performance measures for each of the three most likely
FFf algorithms. If the choice of processor is limited to the programmable processors in
Chapter 14, Table 17-3 can be used to choose the 28-point prime factor algorithm be-
cause of the smaller numbers in columns 2, 3, and 4. However, the 32-point FFf can
also be implemented with the FFT-specific chips in Chapter 14. Therefore, the FFf al-
gorithm decision must be postponed until the chip and architecture choices are exam-
ined.
# of data # of const.
Algorithm # of adds # of multiplies locations locations
32-point mixed-radix 382 172 64 68
28-point prime factor 400 120 56 36
30-point prime factor 384 160 60 65
Fixed-Point
DSP56001 1.797 24 II
DSP56002 0.908 24 6
DSP56L002 1.497 24 9
DSP56004 1.497 24 9
jlPD77220 8.5 24 48
jlPD77P220 8.5 24 48
SPROCI400 2.4 24 14
SPROCI200 4.8 24 28
SPROCI210 4.8 24 28
ZR38000 0.88 20 5
Floating-Point
ADSP-21020 0.58 32 4
ADSP-2 1060 0.46 32 3
DSP32C 3.2 32 18
DSP3210 2.4 32 14
DSP3207 1.9 32 11
i860XR 0.74 32 5
i860XP 0.55 32 4
DSP96002 1.04 32 6
jlPD77240 7.07 32 40
jlPD77230A 11.78 32 66
TMS320C30 1.97 32 II
TMS320C31 1.97 32 II
TMS320C40 1.54 32 6
Block-Floating-Pt.
a66110/a6621 0 0.131 16 1
a66111/a66211 0.131 16 1
LH9124/LH9320 0.087 24 1
LH9124L/LH9320 0.129 24 1
TMC2310 0.514 16 3
PDSP16510/16540 0.096 16 Cannot do
processor. The 32-point FFT takes three passes because it needs two radix-4 and one radix-2
passes.
Based on these observations, two processor architectures are shown in Figures 17-1
and 17-2. To ensure there is plenty of processing power for the non-signal-processing
portions of the radar functions, and to account for inefficiencies encountered with combining
algorithms into an application, four floating-point DSP chips are used for the other radar
processing in both processor architectures.
100
.--------Pl.1 !
RAM
&
Doppler Processing
Output
I Control
I
t
Working Working
RAM FFT RAM
& Processor &
Control Control
Local Local
RAM RAM
t I I
Coeff. Floating- Floating-
RAM Point - - Point
& DSP DSP
Control
Input Output
[ Crossbar Switch 1 to
Data Display
110
Doppler Processing Floating- Floating-
RAM
& Output Point r- '-- Point
Control DSP DSP
I I
t Local Local
Working I Working RAM RAM
RAM
&
I FFT
~I Processor - RAM
&
Control Control
I
Coeff.
RAM
&
Control
Local Local
RAM RAM
I I
Floating- Floating-
Point f-- ...... Point
DSP DSP
~ Crossbar Switch }
Local Local
Floating- Floating- RAM RAM
Point r-- --- Point I I
DSP DSP Floating- Floating-
I I Point i-- r--
Point
DSP DSP
Local Local
Input RAM RAM Doppler Processing f 1 Output
- output l CrossbarSwitch J to
Data Display
Local Local Floating- Floating-
RAM RAM Point i-- "-
Point
I I DSP DSP
Floating- Floating- I I
Point i-- r-
Point Local Local
DSP DSP RAM RAM
f CrossbarSwitch }
Floating- Floating-
Point f- - Point
DSP DSP
I I
Local Local
RAM RAM
To select a board, the FFT length and radar processor architecture decisions still need
to be made. In Table 17-3 the 28- and 3D-point FFf algorithms require fewer computations
and less memory than does the 32-point FFf algorithm. In Table 17-5 both processor
architectures are capable of meeting the processing requirements by using any of the three
FFT lengths. However, 32-point FFT code exists in algorithm libraries for the Analog
Devices ADSP-21060 chip. Therefore, since memory storage requirements for the three
SEC. 17.1 EXAMPLE 1: DOPPLER RADAR PROCESSOR 423
different FFT lengths all need more than 512-kbyte and less than 1-mbyte memory chips,
the 32-point FFT is also the best choice for architecture 2.
Now a direct comparison can be made between the two architecture options. The only
discernible difference is that the FFf-specific architecture already has the 32-point algorithm
and the associated memory management built-in to the operation of the Sharp chip set. Be-
cause of the benefit of reduced development time and effort for architecture 1, it is the better
choice (time-to-market requirement from Table 17-1). Table 17-6 summarizes the specifi-
cations needed to choose a COTS board that will be used twice for this multiboard design.
Category Specification
Signal Parameters
functions in coherent gain, equivalent noise bandwidth, and 3-dB bandwidth. Table 17-8
summarizes all of the key element design decisions made for this example.
Number of dimensions 1
Type of processing Frequency analysis
Arithmetic fonnat Block-floating-point and 32-bit floating-point
Weighting function Dolph-Chebychev
Transform length 32-point
Algorithm building blocks 2- and 16-point
Algorithm Powers-of-primes mixed-radix
DSP chip Sharp FFT-specific and Analog Devices 21060
Architecture Pipeline and crossbar
Mapping the algorithm onto the architecture Maximum throughput
17.2.2 Specification
Table 17-9 summarizes the specification of the product. Throughput is defined as the
rate at which data sets can be fed to the product without the product getting behind. Latency
is the time from when a data set enters the product until the analyzed version is sent back
to the hard disk. The assumption is that the computational board is not used to display the
results, just to compute them. The results are returned to hard disk, and a standard software
package is used to display the results.
SEC. 17.2 EXAMPLE 2: POWER SPECTRUM ESTIMATOR 425
17.2.3 Description
The modified periodogram method [2] of spectral estimation is based on dividing the
sampled signal into subsequences of a manageable length, computing the power spectrum
of those subsequences, and combining the result to estimate the power spectrum of the
complete signal sequence. This strategy allows the sequence length to be controlled to
fit within the memory capabilities of a computer and does not require the entire set of
computations to be redone every time new samples are added to the signal. The power
spectrum estimator uses the FFT in the center of its computations. Therefore, the example
must include the other portions of the algorithm to obtain a realistic design. Since the
modified periodogram method algorithm is not discussed in this book, it is summarized
below. The details can be obtained from other sources [2].
The power spectrum of a data sequence of L samples, a(m) for m == 0, ... , L - 1,
with the modified periodogram method, is computed from the following steps.
Step 1: Sectioning the Input Data Sequence
Section the input data sequence into P overlapping subsequences of length N such
that the combined subsequences span the entire data sequence. Figure 17-3 illustrates this
process with an overlap of M samples and P == 5.5.
Step 2: Apply the Weighting Function and Compute the FFT of Each Section
For each segment of length N, select the same weighting function (W F(n», multiply
it by the segment data samples, and compute the N -point FFT of the result. Specifically,
compute
N-l
• N Samples
I 'II _I
I .. N Samples .1
M__
1__ N Samples
~1 I- ~I
~ I.. NSamples ~I
~ I.. NSamples ~I
L Samples
~
where U = E~:Ol[W F(n)]2 is computed ahead of time. For each set of NFFfcoefficients,
N complex multiplies are required. Since there are P of these sets, this step requires N * P
complex multiplies. Since each complex multiply uses four real multiplies and two real
* * * *
adds, this is a total of 4 N P real multiplies and 2 N P real adds. For the 2: 1 overlap
case described in Step 2, P = 2 * L / N. Therefore, the number of real multiplies required
is 8 * L, and the required number of real adds is 4 * L, independent of the FFf length.
Step 4: Compute the Power Spectral Density
Compute the power spectral density of the input data samples a (n) by computing the
average of the modified periodograms from Step 3:
p
Step 5: Update the Power Spectral Density for Each New Section of Input Data
Samples
To modify the power spectral density in Step 4 when additional data is collected,
another periodogram is computed for the new data and then the average in Step 4 is recom-
puted. There is even a trick to simplify the computation of the new average, namely rather
than computing P - 1 adds and a divide for each of the N frequency components, compute
PSD(p+l)(k) = [P * PSDp(k) + Ip(k)]j(P + 1) (17-5)
which requires only one multiply, one add, and one divide for each of the N frequency
components, k = 0, 1, ... , N - 1.
Complex
Multipliers
1
X
a(i)
+
Complex A(i)
Multipliers
The block diagram in Figure 17-4 is for performing an N -point complex FFf. Since
the data sets for this product are real, the Double-Length Algorithm from Section 2.4.2
can be used to more efficiently implement the complex algorithm. Therefore, the estimates
made on FFf performance are based on complex data lengths that are half of the real data
lengths.
To simplify the Bluestein algorithm development process, power-of-two algorithms
will be used for the V/2-point FFfs. These algorithms are available for all of the candidate
DSP chips.
Architecture and Chips. The worst-case processing load is when the required
FFf is largest because the FFT computation load increases as N * log2(N). The largest
prime number less than 4096 is 4093, making 4093 the largest value of N. Based on
V being a power of two and the input data being real, V only has to be 4096 points,
which means the largest complex FFf to compute is 2048 points. Since the system re-
quires four of these, it requires a total of sixteen 2048-point FFTs, as well as 4 * (4 * V +
10 * N) adds and 4 * (8 * V + 16 * N) multiplies, based on the Comparison Matrix in
Table 9-7.
Table 17-10 is a list of the floating-point FFT chips from Chapter 14. For the chips
that have less than 2048 locations of on-chip data RAM, the 1024-point FFT performance
number already reflects going off-chip for data. Therefore, the performance numbers for
these chips can be extrapolated to estimate performance for 2048-point FFTs by multiplying
by a factor of 2 * 11/10 = 2.2 (Section 14.1.1). It is easy to see that, even for the slowest
1024-point FFf time, all of the chips can execute the required computations in less than a
second.
Based on the preliminary options available for chips in Table 17.7, the product
should work as a single DSP chip solution with off-chip program and data memory (Figure
17-5). The data and program memory interfaces are shown for the same DSP chip pins,
because the added speed of having separate buses is not required. Therefore, the com-
bined bus approach can be used to choose a DSP chip with fewer pins. This will re-
duce the cost of the product. If all the devices with over 144 pins are eliminated, the list
shrinks to the DSP32xx family, the jtPD77240 and TMS320C3x families with 132-pin
packages, and the jtPD77230A with a 68-pin package, which are summarized in Table
17-11. The package pin counts were obtained from the respective chip family references in
Chapter 14.
SEC. 17.2 EXAMPLE 2: POWER SPECTRUM ESTIMATOR 429
PC Bus To
Interface PC Bus
I
Address
Floating-Point Data
DSP RAM
Chip Data
EPROM
Program
Memory
Category Specification
Processor TMS320C30
Off-chipmemory 8192 of 32-bit words
AnaloglID ports None required
Instructioncycle time 60 ns
Parallel and serial I/O ports (buses) PC bus
Host interface PC compatible
be any pair of relatively prime numbers up to the length of the transform (2048 points) and
were arbitrarily selected.
Signal Parameters
Number of dimensions 1
Type of processing Frequency analysis
Arithmetic format 32-bit floating-point
Weighting function Triangular
Transform length Any up to 2048
Algorithm building blocks 2-,4-, 18-, and 16-points
Algorithm Bluestein convolutional
DSP chip ILPD77230A or TMS320C30
Architecture One Harvard processor & external memory
Mapping the algorithm onto the architecture Maximum throughput
1. Speech analysis for products that use speech recognition or speaker recognition
2. Speech synthesis for products that talk to the user from either stored or real-time
input
432 CHAR 17 DESIGN EXAMPLES
3. Speech analysis followed by speech synthesis for products that compress speech
to reduce storage space and/or communication bandwidth
The product is defined as the number recognition portion of a system for hands-off
numerical data entry, voice car phone dialing, speaker verification for security, or fraud
applications. FFT-based algorithms are not the only way to perform these tasks, but they
may be more cost efficient for high-volume, low-cost products.
17.3.2 Specification
Table 17-15 shows the system requirements. The bottom four requirements are quali-
tative rather than quantitative because their quantitative values will change with the evolution
of technology. The point is that, for a high-volume portable product, the lower the cost,
weight, volume, and power the more likely it is to sell.
17.3.3 Description
Speech scientists have determined that the human speech generation system (lungs,
vocal cords, trachea, mouth, and nose) can be modeled by the block diagram in Figure
17-6. Voiced sounds, such as vowels, can be modeled as the output of a time-varying
linear filter response to a periodic impulse train. The period of the impulse train (pitch
period) is determined by the dimensions of the vocal cords and trachea. Unvoiced signals,
such as consonants, can be modeled as the response of the time-varying linear filter to a
random number generator. The loudness (amplitude) of the resulting sound is modeled
by the multiplier in front of the time-varying linear filter. The time-varying linear filter
represents the way the human vocal tract and mouth modify the sources of the sound. The
linear filter coefficients change slowly over time to produce different voiced and unvoiced
sounds from the same signal generators. This suggests it should be possible to describe
the speech samples by knowing the pitch period and the time-varying linear filter coeffi-
cients.
Figure 17-7 is a block diagram of the algorithm to be used in this example [3]. The
reason it works is that the impulse train generator waveform has a periodic structure in the
SEC. 17.3 EXAMPLE 3: SPEECH ANALYZER 433
Pitch Period
Linear Filter
Impulse Coefficients
Train
Generator
Random
Number
Amplitude
Generator
frequency domain that repeats at roughly the pitch frequency of 50 to 100 Hz. Over the 5-
kHz bandwidth of speech, this results in 50 to 100 peaks. Figure 17-8 shows what that pitch
spectrum might look like. On the other hand, the frequency response of the time-varying
linear filter varies smoothly and decreases with increasing frequency. The filter's response
does have peaks in it, generally at three or four frequencies. These peaks are called the
formants of the filter, and their locations can be used to characterize the filter's coefficients.
Thus, in the frequency domain, the pitch and the linear filter have significantly different
structures.
Cepstrum
Window
Speech
Samples -1_ FFT
Pitch
Detection Filter
Period Coefficient
Detection
To Data
Storage
If the composite waveform out of the log function in Figure 17-7 is linearly filtered
to remove the high-frequency components, the remaining signal is the slowly varying fre-
434 CHAR 17 DESIGN EXAMPLES
10.-----
I . - - - - -I, - - - - - rI - - - - - - r
I ------,
I - - - - - -I - - . - - - -
dB
1 ~ -
II
I I I I I I
10 20 30 40 50 ISO
Frequency Bins
quency response of the time-varying linear filter. The three blocks following the log function
are the equivalent of the linear filtering in the frequency domain described in Chapter 6.
The only difference is the exchanged roles of the FFT and IFFf because the waveform
has started out in the frequency domain, not the time domain. Therefore, the output of the
second FFf is the slowly varying frequency response of the time-varying linear filter.
Similarly, since the input to the IFFf is the sum of two waveforms, its output is the
inverse transform of the sum of those two signals because the IFFT is a linear function.
The slowly varying portion of the IFFT output ends up close to zero. In fact, if the slowly
varying function did not fluctuate at all, all of it would be at the zero sample, because the
FFT of the unit pulse at zero time is the same for all frequency components. This fact is
computed from Equation 2-1. If the n = 0 sample is 1 and the rest of the samples are zeros
(unit pulse at sample zero), then Equation 17-6 (Equation 2-1) simplifies to Equation 17-7.
N-I
A(k) = L a(n) * wt*n where WN = cos(2Jr/ N) + j * sin(2Jr/ N) (17-6)
n=O
A(k) = a(O) (17-7)
At the same time, the periodic nature of the pitch unit pulse train results in a peak in
the IFFT output at roughly the period of that pulse train. Therefore, the output of the IFFI'
can be searched to find the pitch frequency by finding the first substantial peak away from
zero. This is the function of the pitch period detection block in Figure 17-7. Similarly, the
filter coefficient detection function in Figure 17-7 finds the peaks in the time-varying linear
filter's frequency response. These are directly related to the time-varying filter's coefficients
[3]. The time-varying filter coefficients and pitch are then combined and used to search
SEC. 17.3 EXAMPLE 3: SPEECHANALYZER 435
a database to determine the best match. The best match is the pattern for the number
that was verbalized. The number on the database that is the best match to the computed
parameters of the input data is then stored in the computer rather than as a sequence of
speech samples.
With the exception of the S12-point transform, the Comparison Matrix in Table
9-8 shows that the prime factor algorithms require the fewest computations and smallest
multiplier constant memory. The Comparison Matrix in Table 8-1 shows that the smaller
FFT building blocks are the more efficient. These two facts suggest limiting the FFf lengths
to 420 = 3 * 4 * 5 * 7,504 = 7 * 8 * 9, and 512 = 8 * 8 * 8. Further decisions on the
FFT algorithm to choose are deferred to the architecture and chip paragraphs below because
other factors will affect the best choice.
436 CHA~ 17 DESIGN EXAMPLES
Arithmetic Format. With only 8 bits needed at the input and peak detection being
the final parameter detection process, 16-bit fixed-point numbers are likely to be sufficient.
This means that the arithmetic format does not limit the chip choices because the floating-
and block-floating-point arithmetic formats just provide less quantization noise based on
the Comparison Matrix in Table 13-1.
Architecture and Chips. The desired architecture is a single chip with all the
necessary program and data memory on-chip. Since the input is voice samples, the data
must go through an AID converter somewhere. Therefore, a plus in the design is to have
an AID converter on-chip. Table 17-18 shows the FFT performance and on-chip memory
capacities of nsp chips with on-chip AID converters (Sections 14.3.1 and 14.3.5).
According to the references in Chapter 14 for each of these three devices, the imme-
diate drawback is that their AfD converters work at 8 kHz, not the 10-kHz sampling rate
assumed earlier. In the interest of taking advantage of the integrated AfD to reduce the over-
all cost of the product, it makes sense to reevaluate the need for sampling at 10kHz. The
higher sampling rate is actually a luxury. The telephone system has a 4-kHz bandwidth and
voice is easily discernible. Based on the sampling theorem (Section 2.3.1), 8 kHz should
be a sufficient rate. To keep the 40-ms sampling period means that the number of 8-kHz
samples should be at least 320, rather than the 400 calculated for the 10-kHz sampling rate.
* * * *
This means that the 336 == (3 7 16)- and 360 == (5 8 9)-point prime factor algorithms,
using the building blocks in Chapter 8, should be added to Table 17-16.
All of the functions in Figure 17-7 must be performed each time a new set of 40 ms
of data is collected. Since all of the chips in Table 17-18 perform 1024-point FFTs in less
than 3 ms, it is clear that they will have no problem completing three FFTs in the range of
336 to 512 points and all of the other computations in the allotted time of 40 ms. Therefore,
the processor architecture block diagram can be as shown in Figure 17-9.
!
Data Bus
I
Address EPROM
From Analog
Program
Microphone I/O
Memory
Serial
I/O
1
To Main
Computer
Note that the output interface to the main computer is through the serial link to reduce
the number of wires and, therefore, the system cost and to improve its reliability. All of
the chips in Table 17-18 have on-chip boot ROM that allows external, inexpensive EPROM
to load the program to on-chip program RAM at power-up. If the product becomes a big-
enough seller, the progranl can be put into on-chip program ROM and the external EPROM
can then be eliminated.
For the product to work in real-time, it must be collecting a new data set while
processing the present one. In high-speed real-time applications it would also have to output
results from the previous computations while processing the present data set. However,
it appears there will be enough processing time so that the answers can be output after
computations and before the next set of data is available for computation. Therefore, there
must be at least enough RAM for two full sets of data. Additionally, the database, as well
as the pitch and formant data used to access the database, must be stored.
438 CHAP. 17 DESIGN EXAMPLES
The key issue is the two sets of data for the FFT. Since the data is real, the Double-
Length Algorithm from Section 2.4.2 can be used to efficiently utilize the FFf algorithm.
This allows N real data samples to be processed by an N 12-point FFf. Therefore, the chosen
transform length will require storing from 2 * 336 = 672 to 2 * 512 = 1024 data words.
All of the DSP chips in Table 17-16 have sufficient data memory to meet this goal, but the
ADSP-21msp5xx series is marginal because of the need to store the database. Based on
this, the Motorola DSP56166 is selected because it has the largest data RAM.
FFT Algorithm Revisited. Now that the DSP chip has been chosen, the FFf al-
gorithm can be chosen based on the specific characteristics of the chip. Equation 14-1,
for estimating the computation time, will work for the Motorola DSP56166 because it has
enough memory on-chip to execute the 1024-point complex FFT. Based on the formula,
the worst-case 512-point FFf should take about 1.53 * 0.5 * 9/10 = 0.69 ms. Therefore,
three of them should take just over 2 ms out of the 40 ms available. This means that the
differences in the number of adds and multiplies for the different potential FFf lengths is
insignificant in deciding which length to use. Furthermore, there is plenty of time to com-
pute the weighting function with a small look-up table and interpolation formulas. This
saves program memory locations. The formulas in the Comparison Matrices in Chapter 9
(Tables 9-7 and 9-8) and with the building-block algorithm performance measures from the
Comparison Matrix in Chapter 8 (Table 8-1) are used to compute the performance measures
for the candidate FFf algorithms. They are summarized in Table 17-19.
# of data # of const.
Algorithm # of adds # of multiplies locations locations
336 = 3 * 7 * 16 Prime factor 7,332 2,596 672 14
360 = 5 * 8 * 9 Prime factor 8,404 3,412 720 13
420 = 3 * 4 * 5 * 7 Prime factor 9,648 4,064 840 12
504 = 7 * 8 * 9 Prime factor 12,860 5,756 1,008 15
512 = 8 * 8 * 8 Mixed-radix 11,776 4,352 1,024 128
Because the most critical issue appears to be data and program memory, not com-
putation time, columns 4 and 5 of Table 17-19 are most important as selection criteria. In
these two columns, the entry showing the most dramatic difference between the algorithms
is the total number of multiplier constants required for the 512-point FFT. Therefore, the
first decision is to eliminate the 512-point FFf.
Once the 512-point FFT is eliminated, the fifth column no longer is important in the
decision process because all the other transform lengths are so close to each other. Columns
2, 3, and 4 of Table 17-19 show 336 and 360 as the best technical choices. The 336-point
FFT is selected because it has the smallest entries in these columns.
no external memory is the best chip choice in this application. Since weight and volume are
primary specifications for the product, a custom board should be designed to take advantage
of how well the DSP56166 fits the application. Table 17-20 summarizes the specifications
for that board.
Category Specification
Processor DSP56166
Off-chip memory None required
Analog I/O ports 8-kHz sample rate AID built-in to DSP56166
Instruction cycle time 33 ns
Parallel and serial I/O ports (buses) RS-232C serial port
Host interface Any that are RS-232C compatible
Section 16.5 introduces four types of test signal in an order of increasing complexity.
It also gives the guidelines that were followed to create the specific parameters of each
signal in Table 17-21. They are reordered to match the strategy in Section 16.7.2 that lists
them in an order that allows testing with the least number of signals. The pair of sine waves
can be any pair of relatively prime numbers up to the length of the transform (336 points)
and were arbitrarily selected.
Signal Parameters
The 336-point FFT algorithm is chosen because it has the smallest number of adds,
multiplies, and memory locations of the choices in Table 17-19. Many of the single pro-
cessors provided sufficient computational power. This allows the weighting function to be
computed rather than stored. This led to choosing the sine-to-the-fourth weighting function.
Any of the arithmetic formats provide the required accuracy and dynamic range. This al-
lowed the freedom to choose a chip based on other performance measures. The DSP56166
is picked because it has a combination of an on-chip AID converter and sufficient on-chip
data memory to remove the need for external data RAM chips. Table 17-22 summarizes all
of the key element design decisions made for this example.
440 CHAP. 17 DESIGN EXAMPLES
Number of dimensions 1
Type of processing Frequency analysis and correlation
Arithmetic format 16-bit fixed-point
Weighting function Sine-to-the-fourth
Transform length 336 points
Algorithm building blocks 3-, 7-, and 16-points
Algorithm Prime factor
DSP chip DSP56166
Architecture One Harvard processor & no external memo
Mapping the algorithm onto the architecture Maximum throughput
The product is a general-purpose board that plugs into IBM PC-compatible hardware
and is used for deblurring images that are downloaded to it from the PC's hard disk. The de-
blurred results are to be restored in the PC's hard disk before the next image is downloaded.
The product is to be as inexpensive as possible so that it can be sold to law enforcement
SEC. 17.4 EXAMPLE 4: IMAGE DEBLURRING 441
agencies for use with images stored from digital cameras, videophones, and other image
input devices. Applications include license plate identification from an image taken in a
moving police car and in crime labs for identification of suspects in video surveillance
imagery.
17.4.2 Specification
Table 17-23 summarizes the specification of the product. Throughput is defined as
the rate at which images can be fed to the product without the product getting behind.
Latency is the time from when the image enters the product until the deblurred version
exits. Notice that the throughput is three times more than the latency. This is to account
for the image being loaded onto the board and for the deblurred image to be sent back to
the hard disk.
17.4.3 Description
Figure 17-10 shows a simplified block diagram of an image recording process. The
simplest example of this process is a camera, where the image formation device is the
lens system and the image recording device is photographic film. If the lens system is not
properly focused, the image will be blurred. The photographic film recording process is
nonlinear as well as grainy. If the camera moves during the collection process, another
blur is introduced because the same portion of the input image energy will be recorded in
multiple locations on the film.
Input
Image
Energy
- Image
Formation
Image
Recording
Received
Image
Image
Noise
The approach illustrated in this example is called power spectrum equalization [1].
More can be learned about the power spectrum of a signal in Section 17.2. Its basic definition
is the FFf of the autocorrelation of the signal, where the autocorrelation of the signal is
pattern matching of the signal with itself using the techniques given in Chapter 6. The
computational approach is to find an estimate for the actual image that has the same power
spectrum as the recorded image and can be represented by that recorded image after passing
through a two-dimensional linear operator.
The algorithm for computing the deblurred N x M pixel image has the following
steps:
Step 1: Transform the Image to the Two-Dimensional Frequency Domain
Compute the (2 * N x 2 * M)-point, two-dimensional FFT of the received image,
where the outside of the array is filled with zeros as shown in Figure 17-11. Chapter 7
shows that the two-dimensional FFf of a 2 * N x 2 * M array of real data can be computed
as a sequence of 2 * Mane-dimensional 2 * N -point FFTs of real data and 2 * N one-
dimensional 2 * M -point FFTs of real symmetric complex data. Further, Chapter 2 shows
that a 2 * N -point FFT of real data can be computed by using an N -point FFf algorithm for
complex data. Therefore, the computational requirement for this step is to compute 2 * N
M-point FFTs and 2 * M N-point FFTs of complex data. Actually, the first dimension of
FFf computations, say the row FFTs, only requires N M-point FFfs because the other N
would be computing the FFI' of all zeros (Figure 17-11).
2 N X2 M Pixels
N12 Rows of Zeros
M Total
Columns of Zeros
the lengths of the functions being correlated. The 1024-point FFT certainly meets that
criterion.
Weighting Function. The defined algorithm does not use weighting functions, so
the Comparison Matrix in Table 4-1 does not playa role in the development of this product.
Arithmetic Formats. The deblurring algorithm used here is sensitive to system
noise. Therefore, it is also sensitive to quantization noise. This suggests that 32-bit floating-
point arithmetic be used to minimize quantization errors.
Architecture and Chips. The arithmetic format requirement immediately elimi-
nates all but the floating-point DSP chip families in Chapter 14. These are listed in Table
17-25. The processing starts with loading the 1024 x 768 image onto the board, then con-
tinues with the deblurring algorithm, followed by outputting the results to the hard disk.
Therefore, the board needs data memory to store all of the input pixels, but not additional
memories to collect the next image while processing the present one.
Since the processing will be performed in floating-point arithmetic, the on-board data
memory must hold 1024 * 768 = 786,432 thirty-two-bit complex words, or 1008 * 840 =
846,720 thirty-two-bit complex words, depending on the chosen FFT lengths. This amount
of data memory can be cut in half by taking advantage of the symmetries in the FFT outputs
as a result of the input data being real rather than complex. However, this only happens by
increasing the complexity of the memory addressing scheme. The cost of developing and
debugging the more complex addressing scheme is not worth the effort, except for a very
high volume application.
SEC. 17.4 EXAMPLE 4: IMAGE DEBLURRING 445
The crucial step is to estimate how many DSP chips will be required. This defines the
architecture choices. The two key contributors are the FFf computations and the divides.
As a conservative estimate, assume all the FFTs are 1024 points. This will help account for
the fact that the double-length algorithm requires an extra stage after the FFT to compute
the needed outputs. Therefore, Steps 1 and 4 in Section 17.4.1 require 6 * 1024 == 6144
FFTs of 1024-points. If these took 1 ms each, all 6144 of them would take 6.144 s. At
2 ms per FFf, the tirne required for this portion of the processing is roughly 12.3 s. Using
2 ms is preferable because it allows more of the floating-point chips in Table 17-25 to be
included and is still well within the 20-s throughput requirement.
To these computations must be added the 4 * N * M complex multiplies, which is
16 * N * M real multiplies, and 8 * N * M real adds. Assuming these are performed in
series, rather than making use of the multiplier-accumulator architecture of the DSP chips
to perform these functions in parallel, this is 24 * N * M == 18.87 or 20.3 million arithmetic
computations. These computations can be accomplished in less than 2 s on any of the
floating-point DSP chips in Table 17-25.
To the FFTs and complex multiplies must be added the 4 * N * M == 3.15 or 3.39
million divides, depending on the FFT lengths chosen. To perform the divides in the
remaining 20 - 12.3 - 2 == 5.7 s requires a computation rate of 0.55 or 0.59 million divides
per second. This translates into 1.81 or 1.68 J1-S per divide. Modeling the divide function
as an inverse followed by multiplication takes 35 cycles for the inverse and another for the
multiply in the TI series of floating-point chips (Reference 33 from Chapter 14). At the
446 CHA~ 17 DESIGN EXAMPLES
40-ns clock rate of the TMS320C40, the divide will take.roughly 1.44 J.Ls. The Analog
Devices and Intel chip families also use software techniques to implement division. The
Motorola DSP96002 floating-point chip has hardware support for division.
It appears there is a single DSP chip solution and that 2 ms is marginal for 1024-point
FFf performance, if the divides are performed in software. Table 17-26 summarizes the
candidate DSP chip choices from Table 17-16 that should not be marginal, based on all the
computational estimates.
Therefore, the product can be built with a single DSP chip with off-chip program and
data memory. The off-chip data memory is required to hold the nearly 2 million 32-bit data
words needed for the intermediate frequency domain computations on the image. Figure
17-12 shows the proposed processor architecture block diagram. The data and program
memory interfaces are shown with separate DSP chip pins to optimize performance. Based
on Table 17-26, the separate parallel memory interfaces assumption reduces the DSP chip
choices to the ADSP-21020, DSP96002, and TMS320C40.
PC Bus To
Interface PC Bus
Address Address
Floating-Point Data
Program
DSP
Memory RAM
Data Chip Data
Category Specification
Processor TMS320C40
Off-chip memory 256K of 32-bit words
Analog I/O ports None required
Instruction cycle time 40 ns
Parallel and serial I/O ports (buses) PC bus
Host interface PC compatible
Signal Parameters
program and data memory chips to accommodate the complex algorithm and huge amount
of data, is selected. Table 17-29 summarizes all of the key element design decisions made
for this example.
Number of dimensions 2
Type of processing Convolution
Arithmetic format 32-bit floating-point
Weighting function None
Transform length 1024 points
Algorithm building blocks 2-, 4-, 8-, and 16-point
Algorithm Power-of-primes mixed-radix
DSPchip TMS320C40
Architecture One Harvard processor with external memory
Mapping the algorithm onto the architecture Maximum throughput
17.5 CONCLUSIONS
The use ofFFfs in ever-increasing numbers ofindustrial and mainstream consumer products
will be driven by the ability of design engineers to optimize code for computing this flexible
class of algorithms. The examples in this chapter, which serve as an applied summary of the
information in the preceding chapters, are just a taste of the astounding number of products
that are possible because of constantly evolving improvements to the work begun by J. B.
Fourier nearly two centuries ago.
It is our fervent hope that insights gained through the use of this book will help
readers invent the FFf-based products that will transform the fields of telecommunication,
medicine, seismology, oceanography, environmental protection, and consumer products
well into the 21st century.
REFERENCES
[1] A. V.Oppenheim, ApplicationsofDigitalSignalProcessing, Prentice Hall, Englewood
Cliffs, NJ, 1978.
[2] P. D. Welsh, "The Use of the Fast Fourier Transform for the Estimation of Power
Spectra: A Method Based on Time Averaging over Short, Modified Periodograms,"
IEEE Transactions on Audio and Acoustics, Vol. AU-IS, pp. 70-73 (1967).
[3] L. R. Rabiner and R. W. Schafer, Digital Processing ofSpeechSignals, Prentice Hall,
Englewood Cliffs, NJ, 1978.
Glossary
Algorithm
A series of steps to compute a set of equations.
Architecture
A hardware organization of adders, multipliers, control logic, and memory for im-
plernenting algorithms.
Assembler
Software that converts assembly language code into machine language 1's and O's for
a specific processor.
Assembly language
A programming language for controlling a microprocessor or DSP chip at the register
level.
Bandwidth
The measure of the spread of frequencies that pass through a filter or are contained
in a signal.
Bit slice
A method of dividing a number into smaller pieces so that arithmetic can be performed
with less-complex chips.
Block diagram
A drawing to depict the electronic interconnections of hardware components.
Block-floating-point
A floating-point number system that uses only one exponent for an entire set of data.
Bluestein algorithm
An algorithm developed to compute FFTs using convolution.
Bus
The communication network in or between processors or other devices.
Bus interface
Hardware that links a processor or other device to a bus.
450 GLOSSARY
Butterfly
The fundamental building block of the 2-point FFT.
Coefficients
The numerical constants in an equation or filter.
Complex arithmetic
Arithmetic with numbers that have real and imaginary parts.
Computational latency
The time between the start of computations and when output of results begins.
Computational load
The amount of computations a processor is required to do, expressed as opera-
tions/second.
Convolution
A method of modifying the amplitude and/or phase of the frequency components of
a signal; also known as linear filtering.
Cooley-Tukey algorithm
The most common power-of-two FFf.
Correlation
The operation of comparing or measuring the similarity of two waveforms: also known
as pattern matching.
Cross bar
A bus architecture that allows any processor to directly connect to any other processor.
dB
The abbreviation for decibel, a measure of the power level of a signal relative to 1
watt.
Debugger
Software for removing errors from code.
Decimation in frequency (DIF)
A method of computing a power-of-two FFf that has the multiplier on the butterfly
output.
Decimation in time (DIT)
A method of computing a power-of-two FFf that has the multiplier on the butterfly
input.
Discrete Fourier transform (DFT)
A sine-wave-based set of equations to convert sampled time-domain data into
frequency-domain data that has equally spaced frequencies; an array of pattern match-
ers where the patterns being matched are sine waves.
Dolph-Chebyshev weighting function
A weighting function with a spectrum characterized by uniform sidelobes.
Doppler radar
A radar that directly measures the radial velocity of a target.
Dynamic range
The ratio of largest to the smallest number that can be represented by any arithmetic
format,
GLOSSARY 451
Emulator
A hardware model for a processor chip that allows access to all the functions of the
chip for program development or debugging.
Equivalent noise bandwidth
The ratio of the input noise power to the noise power in the output of an FFf filter
times the input data sampling rate.
Fast Fourier transform (FFT)
An algorithm for fast DFT computation.
Filter
An analog or digital device that reshapes the spectrum of a signal, typically to enhance
desirable frequencies and attenuate undesirable frequencies.
Fixed point
A number system based on the numbers being represented by a fixed number of digits
relative to the decimal point.
Floating point
A number system based on the numbers being represented by both a fixed number of
digits and an exponential multiplier.
Flowchart
A drawing to depict the sequence for executing the steps of an algorithm or progression
of information through a system.
Fourier transform
A sine-wave-based set of equations to convert continuous time-domain data into
continuous frequency-domain data.
Frequency analysis
Finding the amplitude and phase of the sine waves that comprise any waveform.
Frequency domain
A coordinate system for representing the frequency components of a signal.
Frequency resolution
How close the frequency of two sine waves can be and still be separately distinguished
by a measurement system.
Frequency straddle loss
The reduced output of a filter caused by the input signal not being at the filter's center
frequency.
Harvard architecture
A computer architecture with separate data and program memory buses.
High-level language
A programming language for controlling a microprocessor or DSP chip only at the
function level.
Hybrid architecture
A combination of features from two or more standard architectures.
Hypercube
A parallel processing architecture where the processors are connected in a multidi-
mensional cube configuration.
452 GLOSSARY
Passband
The range of frequencies that are not attenuated by a filter.
Pipeline
An architecture where data is sequentially passed from one processor to the next to
execute an algorithm.
Power-of-two
An FFT algorithm where the number of data points or computed frequencies is 2
raised to a power.
Power spectrum estimation
Technique for estimating the power in the frequency components of a signal.
Practical transform length (PTL)
The acronym for a non-power-of-two FFT algorithm using multidimensional decom-
position and complex conjugate math, developed by Win Smith.
Prime factor
An FFT algorithm where the factors are relatively prime and there are no twiddle
factors.
Prime number
Any number that has no factors other than itself and 1.
Primes-to-a-power
An FFT algorithm where the number of data points or computed frequencies is a
prime number raised to a power.
Quantization noise
The error signal caused by rounding-off numbers and coefficients in a digital proces-
sor.
Rader algorithm
A prime number FFT using circular convolution.
Real-time operating system
Software that helps a processor control real-time algorithms.
Real-time operation
Processing of data that keeps up with the input data rate rather than storing it and
performing the processing later.
Relatively prime
Any two numbers with no common factors.
Ring bus
A circular bus architecture that allows data to pass from one processor to another and
end up where it started.
Sampled data
A sequence of data values collected at regular or irregular intervals.
Sampling theorem
The sampling rate must be at least twice as fast as the highest-frequency component
in the signal; also known as the Nyquist rate.
Sidelobes
Unwanted frequency components that are reduced but not removed by a filter.
454 GLOSSARY
Simulator
A software model of a processor that is used to develop and debug code prior to
hardware implementation.
Sine wave
A continuous, smooth, periodic signal defined by the mathematical function sin(kt).
Singleton algorithm
Computes non-power of two FFfs using multidimensional decomposition.
Small-point transform
A small FFT, usually 16 or fewer points.
Split-radix algorithm
An FFT composed of a mixture of power-of-two small-point transforms,
Star bus
A bus architecture with a central processor with additional processors connected like
spokes of a wheel.
SWIFT
The acronym for a non-power-of-two FFf algorithm using multidimensional decom-
position and complex conjugate math, developed by Winthrop W. Smith.
Throughput
The number of times per second that a processor can compute an algorithm.
Time domain
A coordinate system that describes signals as a sequence of values at different points
in time.
Twiddle factor
A standard, complex multiplication operation between small-point transforms of an
FFf.
Unit pulse
A signal with a value of 1 for one time sample and zero for all other time samples.
Versa module eurocard (VME)
A standard hardware interface and software communications protocol for connecting
boards onto a VME system's bus.
Von Neumann
An architecture with a single bus for data and program memory.
Weighting functions
Functions that multiply FFf input data to reduce sidelobes.
Winograd algorithm
An algorithm developed to compute FFTs using a minimum number of multiplica-
tions.
Appendix
Comparison Matrices
Programmable floating-point DSP chips, Resolution of two sine waves, defined, 15-16
357-68 Ring bus architecture, 258, 260--62, 283-84
Analog Devices 21020 family, 357-58 16-point radix-4 FFf, 286--287
Analog Devices ADSP-21060 family, ROM. See Data read-only memory (ROM)
358-359 Round-off process, error introduction with, 23,
AT&T DSP32C family, 359-61, 360 28,314
Comparison Matrix, 369 RS-232C interface, 389
Intel i860 family, 361-63,362 RTOS. See Real-time operating systems
Motorola DSP96002 family, 363-64 (RTOS)
NEC j.tPD77240/230A family, 364-365
Texas Instruments TMS320C3x family,
365-67,366
s
Texas Instruments TMS320C40 family, Sampling theorem, for real signals, 11
365-366 Sequential 16-bit bit-slice multiplication, 249
Program memory Serial I/O ports
as source of error, 404 on DSP chip, 324-25, 329-332, 337-83, 389
See also Data memory; Memory See also Digital I/O ports
Prototyping area, as board selection factor, 392 Sidelobe, for DFT, defined, 23
PTL 8-point FFT, 113 Sidelobe fall-off ratio, in relation to weighting
PTL 9-point FFT, 121 function, 36
Sidelobe level
Q in frequency analysis, 56
in relation to weighting function, 36
Q-point building blocks, output of, 189-91, Signals
190 periodic, 20-21
Q-point input adds data configuration for as waveforms, 73
k = 0,170 See also Transient signals
Q-point output adds data configuration for
Signal-to-noise ratio
k = 0, 172 improvement of with OFf, 20
Quantization noise error
improvement of in Doppler processing, 414
for DFT, defined, 23
Sine waves
for FFf, defined, 27, 28
resolution of, 15-16
See also Error
in test signals, 406, 408-09
Quantization noise escalation, in arithmetic
Single- processor architectures
formats, 316, 318, 319-20
in algorithm and data mapping, 275-79
defined, 255
R See also Multiprocessor architectures
Radar Singleton algorithms, 81, 88, 138-40, 242, 327
as changing signal, 73 Comparison Matrix, 242
See also Doppler radar I5-point mixed-radix, 283
Rader algorithms, 81, 88,136-38 7-point FFf, 101--03
5-point FFf, 93-96 3-point FFf flow graph, 86-87
Real data sequence, DFT of, 16 16-point FFf, response to 12 samples and four
Real input signals zeros of I-kHz input, 15
for DFfs, 16--20 16-point radix-4 FFf
double-length algorithm, 18-20 address generator sequences, 330
2-signal algorithm, 17-18 crossbar architecture, 288-93, 290
See also Input signals in Doppler processing, 417
Real-time operating systems (RTOS), support error isolation in, 409-12, 410
for by board, 391 example, 5
INDEX 467
T u
Unit pulse, in test signals, 404--405
Telecommunications, 250
Test, of FFf performance, 395-412
Test signal v
consideration of in FFf design, 4
constants, 405-06, 408 Video, as changing signal, 73
error patterns, 406-407 Von Neumann architecture, 255-57, 256
features of, 404-06 in single processor function, 277-278
for 4-point FFf, 405
sine waves, 406, 408-09 w
for speech analyzer, 439
unit pulse, 404--405, 407-08 Waveforms
3 dB main-lobe bandwidth, in relation to periodic nature of, 9
weighting function, 37 signals as, 73
468 INDEX