Academia.eduAcademia.edu

GEOPHYSICAL DATA ANALYSIS

time-series analysis. The concepts for estimating other parameters (e.g., bias, efficiency, consistency) apply equally to power spectrum estimation; but spectrum estimation is more complex, in part because the spectrum is a function, not just a single parameter or a set of them. Because of this complexity, this area suffers from much confusion and reliance on obsolete procedures. We will introduce you to useful and modern methods.

G EOPHYSICAL D ATA A NALYSIS GEOPHYSICAL DATA ANALYSIS: TIME SERIES D UNCAN CARR A GNEW R OBERT L. PARKER SIO 223B C LASS N OTES, S PRING 2011 © 2011 Duncan Carr Agnew Contents Contents i 1 Introduction 1.1 Random Time Series . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of the Subject . . . . . . . . . . . . . . . . . . . . . . . 1 1 5 2 Linear Systems and Fourier Transforms 2.1 Linear Time-Invariant Systems . . . . . . . . . . . . . . 2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Fourier Transform . . . . . . . . . . . . . . . . . . . 2.4 A Transform Pair . . . . . . . . . . . . . . . . . . . . . . 2.5 Fourier Theorems I: Similarity and Shift . . . . . . . . 2.6 Generalized Functions . . . . . . . . . . . . . . . . . . . 2.7 Fourier Transforms of Generalized Functions . . . . . 2.8 Fourier Theorems II: Derivatives and Differentiability 2.9 Fourier Theorems III: Convolution, Power, Modulation 2.10 The Correlation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 12 14 15 17 19 21 23 25 28 Fourier Theory for Discrete Time 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Discrete-Time Sequences and Operations . . . . . 3.3 Fourier Transforms for Infinite Sequences . . . . 3.4 Discrete Fourier Transform . . . . . . . . . . . . . 3.5 Fourier Theorems for the DFT . . . . . . . . . . . . 3.6 The Dirichlet Kernel . . . . . . . . . . . . . . . . . 3.7 Computing the DFT: the Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 31 32 36 39 46 48 Sampling Theory for Time-Series 4.1 Introduction: Logging Data . . . 4.2 The Sampling Problem . . . . . . 4.3 The Nyquist Theorem . . . . . . . 4.4 Aliasing . . . . . . . . . . . . . . . 4.5 Decimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 54 55 58 61 3 4 i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents ii 4.6 4.7 Violating the Nyquist Criterion . . . . . . . . . . . . . . . . . . Quantization Error . . . . . . . . . . . . . . . . . . . . . . . . . 62 63 5 Digital Filters I: Frequency-Selective Filters 5.1 Introduction: Digital Filters in General . . . . . . . . . . . . . 5.2 FIR Filters for Frequency Selection . . . . . . . . . . . . . . . 65 65 67 6 Digital Filters II: Recursive Filters and Simulating Linear Systems 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Lumped-Parameter Systems . . . . . . . . . . . . . . . . . . . . 6.3 The Laplace Transform: System Poles and Zeros . . . . . . . 6.4 The z-transform for Digital Filters . . . . . . . . . . . . . . . . 6.5 Recursive Digital Filters . . . . . . . . . . . . . . . . . . . . . . 79 79 79 84 87 90 7 Digital Filters III: Differentiators, Least-Squares, and MinimumPhase Filters 97 7.1 FIR Filters: Other Applications . . . . . . . . . . . . . . . . . . 97 7.2 Differentiators and Digital Hilbert Transformers . . . . . . . 97 7.3 Least-squares Fits as Filters . . . . . . . . . . . . . . . . . . . . 100 7.4 Minimum-Phase Filters . . . . . . . . . . . . . . . . . . . . . . 102 8 Stochastic Processes 8.1 Introducing Ordered Random Data . . . . 8.2 Stationary Processes and Autocovariance 8.3 White Noises and their Relatives . . . . . 8.4 Examples from the Real World . . . . . . . . . . . . . . . . . . 108 108 111 113 117 A Computing the Discrete Fourier Transform A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 The Basis of the Fast Fourier Transform . . . . . . . . . . A.3 Computing DFT’s of Real Series . . . . . . . . . . . . . . . A.4 Computing the Fourier Transform for a Few Frequencies . . . . . . . . 121 121 121 123 125 . . . . 128 128 128 129 130 B Other Reading B.1 Introduction . . . . . . . . . . . . . . . B.2 Linear Systems and Fourier Analysis B.3 Digital Signal Processing . . . . . . . . B.4 Time Series and Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents iii Bibliography 131 Index 134 C HAPTER 1 I NTRODUCTION 1.1 Random Time Series In this course we consider datasets in which the data have a particular structure, namely that the data are ordered, whether in time or space (or both); we shall use the term time series for the general case of data ordered in one dimension. Most data are actually ordered, though we may choose to ignore this; even repeated measurements of the same thing are usually done sequentially rather than all at once. We begin with some mathematical concepts for ordered data. In fact, we start by focusing on the “ordered” part while ignoring the “random” part: we discuss concepts and methods useful for working with ordered data regarded as conventional (rather than random) variables. Only after we have introduced these concepts, in particular Fourier theory and methods, will we return to randomness. Our ultimate aim is to develop Fourier methods for random variables, which leads to the idea of the power spectrum: this spectrum, and the procedures for estimating it, will be our focus in the latter part of this section of the course. 1.1.1 Two Examples of Time Series We start with some examples of ordered data to show some of what we will be discussing in more detail. Our first examples come from very close by: sea level at the end of the Scripps pier (Figure 1.1). The ordering of these data is crucial: if we selected values at random from those shown, we would lose significant information. Given the ordering, two obvious features are a periodic variation and a long-term rise. The periodic variation turns out to be annual,1 so that a preliminary model for the data values as a function 1 This comes from the “steric effect”, and is caused by the annual variation in water temperature at shallow depths; as it warms, water expands, and level rises. This annual change does not reflect any annual variation in the amount of water in the oceans; to 1 Chapter 1. Introduction 2 Figure 1.1: Monthly mean sea level measured at the end of the SIO pier; data from the Permanent Service for Mean Sea Level. The reference level is chosen to be well below the lowest water level. of time t (in years) is x i = a 1 + a 2 t i + a 3 sin(2π t i ) + a 4 cos(2π t i ) + r i We can use the least-squares methods learned earlier to estimate the parameters a 1 through a 4 ; doing this, and subtracting the result from the data, gives the residuals r i , also plotted in Figure 1.1. These residuals certainly cannot be called “error;” rather, these fluctuations in sea level are caused by changes in the ocean and atmosphere, and show, for example, a period of high sea level in the early 1980’s, caused by an El Niño. Our next example uses exactly same quantity (water level at the end of the Scripps pier), but measured at a much finer time resolution, to record the ocean waves. Figure 1.2 shows 10 minutes of 1-second data, 24 hours apart. The dominant behavior on both days is an oscillation with amplitude about 15 cm and period about 10 seconds: the ocean swell. More than this it is difficult to say. Figure 1.3 shows the power spectral density (PSD) of these data, estimated from a longer sample (an hour of measurements). We define the PSD later; for now, a simple way of looking at it is that it shows how much of the variation is occurring at different frequencies. In this case, the PSD is large for periods of 8-15 seconds (frequencies of 0.06-0.12 Hz), as we what extent the long-term change is steric, and to what extent driven by changes in water storage elsewhere, is much debated. 1.1. Random Time Series 3 Figure 1.2: Sea level measured at the end of the SIO pier; data from the Coastal Data Information Program. Zero level is arbitrary (and not the same as in the previous figure). Each frame shows 10 minutes of data, on the top starting at 1997:240:00:00:00 and on the bottom one day later. might expect from the plot of the time series. But we also can see that the decrease at higher and lower frequencies has very different forms (suggesting different physics). Further, we can see that the peak of the energy on day 241 has moved to a frequency 0.008 Hz higher than on day 240. If we take a series of such spectra and make a contour plot of the power spectral density over time, we see a “ridge line” of high values which shifts gradually with time. The explanation for this is that surface waves in liquids are dispersive; the wave energy was generated over a broad range of frequencies in one place (where there were high winds), but the lower-frequency energy traveled faster and arrived first.2 Given the frequency shift over three days and the known physics of wave propagation, the distance to the source is found to be 64◦ ± 1◦ . We show this example to illustrate a common occurrence: characteristics of the data that are not obvious in a time series can become very obvious after we find the power spectrum of that data. 2 See, for much more on this subject, based on an ocean-spanning data set, the classic paper by Snodgrass et al. (1966); a more recent summary, using satellite data, is Ardhuin et al. (2009). Chapter 1. Introduction 4 Figure 1.3: Spectrum of the wave data in Figure 1.2 (actually, of a full hour beginning at the times shown in that figure). In each frame the spectrum from the other frame is shown as a light line at the lowest frequencies, to indicate the shift in frequency of the lowest-frequency peak. . Figure 1.4: Power spectral density as a function of time for sea level at the end of the SIO pier. Contours are in dB relative to 1 m2 /Hz. The two dashed lines show the “slices” that give the spectra in Figure 1.3. . 1.2. Overview of the Subject 1.1.2 5 A Note on Decibels Figures 1.3 and 1.4 shows the amplitude of the spectrum plotted in units that may be new to you: decibels (dB). These are another way of creating a logarithmic scale, and it is sufficiently common that you need to know what it is. The term comes, via acoustics, from telephony: hence the name, Bel, which refers to Alexander Graham Bell. The original definition is that if a signal has a power (think loudness) P , and there is a reference power P0 , the signal level in decibels is 10 log10 (P /P0 ). The logarithm is used for two reasons: one is the usual reason that we can use a small range of numbers for very large changes. . The other is peculiar to acoustics, namely that the intensity of a sensation (e.g., perceived loudness) is roughly proportional to the logarithm of the power, a rule of psychophysics known as Fechner’s law. In digital signal processing, the reference level is taken to be unity. If we have something with amplitude A , then by definition the amplitude in dB is 10 log10 ( A 2 ) = 20 log10 A . The square comes from the fact that squared quantities relate to power: for example, if we have a voltage V across a resistance R , the power dissipated in the resistor is V 2 /R . With this definition, a factor of two in amplitude is very nearly equal to 6 dB; a factor of four to 12 dB, and a factor of eight to 18 dB. Conversely, if signal A is half as large as signal B, we would say “A is 6 dB below B”; if A had a value of one half, with the reference level being one, we would say that "A has a level of −6 dB”. And after a little practice, you will too. 1.2 Overview of the Subject A major aim of this course is to teach you to understand what a spectrum is, know how to estimate it from the data, and how to decide which of the bumps on the plotted spectrum mean something and which do not. But to get to this, we need to spend time introducing Fourier methods and some specialized parts of probability and statistics: all quite useful in themselves, but with the consequence that we will only discuss spectral methods towards the latter part of the course. We start with mathematics applicable to ordered conventional variables, in particular the Fourier analysis of functions and linear systems theory. As we just saw, looking at data in terms of frequency rather than time can be a very powerful tool - and this is what Fourier methods are about: moving between a function defined in time, and its transform de- 6 Chapter 1. Introduction Figure 1.5: Monthly mean water density at the end of the SIO pier, found from the daily data for salinity and temperature archived at http://shorestation.ucsd.edu/. The upper trace is the raw data; the bottom trace is the residual (offset for clarity) after fitting and removing an annual cycle. fined in frequency. Linear systems theory helps to explain why this procedure is so useful, and also provides a useful way of looking at many phenomena. This way of looking at time series was developed by electrical engineers (and mathematicians who worked in collaboration with them), something also true of our next topic, digital signal processing. This name covers a range of concepts and methods: notably how to look at Fourier theory applied to time series in which the data occur as samples from a continuous function, and how to filter data for various purposes. After this we move to ordered random variables. The subfield of probability that deals with random variables ordered in one or more dimensions is called the study of stochastic processes; we will utilize this mathematics to create descriptions that help us understand the process. For non-ordered random variables such descriptions include things like the moments of the probability density function (pdf); for time series we will develop the theory of the power spectrum, or power spectral density. Of course, defining the mathematics of ideal models is only part of working with data: we also have to develop methods for getting a “good” estimate of this model from actual data. This, like all estimation from sets of data modeled as random variables, is a problem in statistics, not probability; within statistics it falls into the subfield termed (by statisticians) 1.2. Overview of the Subject 7 time-series analysis. The concepts for estimating other parameters (e.g., bias, efficiency, consistency) apply equally to power spectrum estimation; but spectrum estimation is more complex, in part because the spectrum is a function, not just a single parameter or a set of them. Because of this complexity, this area suffers from much confusion and reliance on obsolete procedures. We will introduce you to useful and modern methods. The power spectrum is a way of looking at a single set of ordered random variables. Often we have more than one set, and want to know how to relate them. To take an example, Figure 1.5 shows the water density at the end of the Scripps pier; if we compare this with the sea-level data in Figure 1.1, we can see that both show an annual cycle. We might ask if these two series are related at other frequencies, and if so how: to answer this question we would employ a generalization of the power spectrum known as the crossspectrum, for which there are, again, good and bad estimation methods. Another generalization of the power spectrum comes when we want to describe a time series in which the “spectrum” (in a rough sense of the amount of energy at different frequencies) changes with time. The data in Figures 1.2 through 1.4 is an example, though an “easy” case because the spectrum changed over times that were long (days) compared to the reciprocal of the frequencies involved (seconds to minutes). But there can be more rapid changes in frequency content, as, for example, in most seismograms. To describe such a process (which is termed nonstationary) we would need to replace the mathematical framework of Fourier analysis by what is called wavelet analysis – a topic we will not have time to discuss. You should realize that there are many other areas of time series analysis that we will not touch on at all; we hope that what we do discuss will provide a useful framework for any topics you may want to learn about later. C HAPTER 2 L INEAR S YSTEMS AND F OURIER T RANSFORMS 2.1 Linear Time-Invariant Systems We begin with the mathematical theory – Fourier analysis – that underlies many kinds of analysis of time series, including signal processing, data filtering, and spectrum estimation. In order to make this theory usable, we will need to introduce new kinds of functions, called generalized functions. Fourier analysis, as we will use it, is about representing a function, not as something that depends on the original variable (time) but as something that depends on a different variable, frequency. We can use this alternate representation because we can write the original function in terms of the “sum” of a series of sines and cosines in time, these sines and cosines being “indexed” by this frequency variable. We have used quotes around “sum” and “index” because these are only analogies to what actually done in Fourier transforms of functions on the real line – though, when we talk about digital data series we could remove the quote marks, since in that case these terms are correct. We can construct functions from sums of a number of different families of functions; for example, over a finite interval we can form any function from sums of polynomials. Why use sines and cosines? There are number of ways of answering this; from the standpoint of geophysical data the answer lies in the ubiquity of linear time-invariant systems, and the especially simple way in which sinusoids are treated by such systems. What do we mean when we refer to a system? It is just something into which we put an input function x( t) (a function of some variable, usually thought of as time), and out of which we get an output function y( t). We can write this symbolically, either as a block diagram (as shown below) or symbolically, as x( t) ⇒ y( t). 8 2.1. Linear Time-Invariant Systems x(t) ✲ System 9 ✲ y(t) Actual systems may well have more than one x and y, but we avoid this complication for now. Some examples of systems are: Input Ground motion Magnetized sea-floor Solar-lunar gravity Ice loads Wind Solar radiation System Seismometer 3-4 km of water Oceans Upper Mantle Oceans Atmosphere & ocean Output Voltage Surface magnetics Ocean tides Postglacial rebound Swell Weather & climate These systems are given in order of increasing difficulty in studying them. The first three are examples of systems that are linear and timeinvariant, which are the ones we will focus on. The last two are definitely not linear, and as a result are by far the least understood. The fourth, postglacial rebound, may or may not be linear depending on the rheology of the mantle; the advantages of linearity are shown by the almost-universal use of linear models of viscoelasticity in studying this problem. The important point about linear time-invariant systems is that they are basically all alike – the same mathematics applies to them all – whereas nonlinear systems can be nonlinear in many different ways. A system is linear if x1 ( t) ⇒ y1 ( t) and x2 ( t) ⇒ y2 ( t) implies that x1 ( t) + x2 ( t) ⇒ y1 ( t) + y2 ( t) which is often called the principle of superposition. Repeated application of this definition shows that ax( t) ⇒ a y( t) for any rational value of a; it can be shown to be true for any real value of a. Time-invariance is a separate behavior, and means that for a particular input the output does not depend on the absolute time: x( t + τ) ⇒ y( t + τ) Chapter 2. Linear Systems and Fourier Transforms 10 for all τ. An example of a linear system that would not be time invariant is a pendulum whose length varies over times much longer than its period; even though a simple pendulum is linear in its response to small forces applied to the bob, gradual changes in length would keep it from being time-invariant. Time-invariance automatically holds for any system that depends only on physical laws, as in the magnetics example above; and can often be assumed as an idealization for many actual systems. One of the linear systems mentioned above, the ocean tides, will vary as the shape of the oceans changes over geologic time because of sea-level changes and continental motion; but over the past few centuries we can reasonably assume time invariance. What do these characterizations of systems have to do with Fourier methods? The answer is that sinusoids (or indeed any exponential) are unchanged “in form” by a linear time-invariant system: the output and the input have the same functional representation. One way to see this is to note that shifting an exponential e st by a time τ simply multiplies it by e sτ , and a linear system does not change the form of the output because the amplitude of the input changes. More formally (and working with complex exponentials), we suppose that x ( t ) = e 2π i f t ⇒ y( t) = g̃( f , t) e2π i f t which, since g̃( f , t) is unspecified, allows y to be an arbitrary function of time. Then applying a time shift, and time-invariance, gives x( t + τ) = e2π i f τ e2π i f t ⇒ g̃( f , t + τ) e2π i f τ e2π i f t Linearity allows us to cancel out the constant factor e2π i f τ from both sides, with the result e 2π i f t ⇒ g̃( f , t + τ) e2π i f t which implies g̃( f , t + τ) = g̃( f , t) But this means that g̃ can depend only on f ; the time-dependence of input and output are then the same, namely a complex exponential, which is to say a sinusoid. We call g̃( f ) the frequency response of the system, where f is the frequency of the sinusoid. In general g̃( f ) will be complex (just as x and y are in our example), even though actual series are usually real. This use of 2.1. Linear Time-Invariant Systems 11 Figure 2.1: Some sinusoids, and their amplitude and phase displayed on a phasor diagram. A is a unit sinusoid with 0◦ phase shift (note that this depends on where we put the t = 0 point). B is a unit sinusoid with −30◦ phase shift (a delay); C is a unit sinusoid with −90◦ phase shift (we say that it is in quadrature with A); D is a unit sinusoid with ±180◦ phase shift; and E is a sinusoid with amplitude 2 and a 0◦ phase shift. F is another unit sinusoid with 0◦ phase shift, but a different frequency; in the phasor diagram, it plots in the same location as A though usually a single phasor diagram is taken to refer to sinusoids of the same frequency. complex numbers simplifies the bookkeeping; what we really mean when we say that the input is Ae2π i f t is that x( t) = R [ Ae2π i f t ] = 1/2[ Ae2π i f t + A ∗ e−2π i f t ] def = R [ A ] cos 2π f t − I [ A ] sin2π f t = a cos(2π f t + φ) so that we can write the complex amplitude A as a(cos φ + i sin φ) = ae iφ , a complex number with amplitude a and phase φ. We thus can represent our sinusoid on the Argand diagram often used for complex numbers; in this application it is called the phasor diagram (or vibration diagram) for this sinusoid. The modulus of A is the amplitude and the angle φ the phase (relative to some time, which is arbitrary but must be chosen). Figure 2.1 shows some examples. It is very important to be aware of an arbitrary part of the amplitude and phase representation, namely which sign of φ represents a lag (or lead). Chapter 2. Linear Systems and Fourier Transforms 12 Consider our sample sinusoid, e2π i f t . If φ = 0, a maximum will occur at t = 0. If φ is slightly greater than 0, this maximum will be at t < 0, and the waveform will be advanced (we reach the maximum sooner in time) relative to φ = 0; similarly for φ < 0, we reach the maximum later (a delay). Thus, φ < 0 is a phase lag, and φ > 0 a phase lead; we will see shortly that this convention corresponds to a particular choice of how we define the Fourier transform. Unfortunately there is no universal agreement on this sign convention; our choice is that used in electrical engineering (and hence in signal processing), but much of the older geophysical literature (for example, in tidal studies) takes phase lags to be positive. 2.2 Convolution We have shown that what a linear time-invariant system does to a sinusoid is specified by its frequency response g̃( f ); but what about more general inputs? We shall simply state that, in general, we can relate the output y( t) to the input x( t) through the convolution integral Z∞ y( t) = x( u) g( t − u) du −∞ where g( t) is a function characteristic of the system. We say that y is the convolution of x and g, which we write as y( t) = x( t) ∗ g( t), It is clear that the system described by this integral is linear; to show that it is time-invariant we first show it to be commutative. Letting u′ = t − u we see that Z∞ Z∞ x( u′ ) g( t − u′ ) du′ = x( t − u) g( u) du −∞ −∞ which means that x ∗ g = g ∗ x Delaying x( t) by τ is expressed in the convolution by x( t − τ), which gives Z∞ x( t − τ − u) g( u) du = y( t − τ) −∞ so convolution is time-invariant. There are lots of ways of looking at convolutions; see Bracewell (1986) for a selection. The most direct way of viewing convolution, at least for those who think visually, is to imagine a plot of x( t) (strictly, x( u)), together with one of g (plotted reversed because it is g( t − u)). The position of g 2.2. Convolution 13 Figure 2.2: A graphical representation of convolution. The left panel shows x(u), and g(t − u) for three values of t. The dots in the right panel show the integral of the product gx for these three values; when this operation is performed for all values of t, we get the function y(t) as shown. depends on the value of t; as t increases, g moves to the right. For any value of t, we form the product gx, and the value of y is just the area under this product function. Figure 2.2 illustrates the process. To develop a better sense of how convolution works, it can be useful to practice drawing functions and sliding one past the other. Convolution is also associative, as we can show by writing the following series of integrals, using changes of variables to get through most steps. ( x ∗ y) ∗ z = Z∞ Z∞ z( t − τ) x( u) y(τ − u) dud τ Z∞ −∞ = x( u ) y(τ − u) z( t − τ) d τdu −∞ Z−∞ Z ∞ ∞ = x( u ) y(v) z( t − v − u) dvdu −∞ −∞ Z∞ Z∞ = x( u ) y(v) z[( t − u) − v] dvdu −∞ −∞ Z∞ = x( u)w( t − u) du = x ∗ ( y ∗ z) Z−∞ ∞ −∞ What this means in terms of linear systems is that we can aggregate them together as we see fit, as suggested in the accompanying sketch, where the systems A and B can be viewed separately or as a single system C. While this result may seem a bit trivial, it is actually very important: the concept of linear systems is so useful partly because we can combine simple systems together to make complicated ones. Chapter 2. Linear Systems and Fourier Transforms 14 x(t) ✲ A B ✲ ✲ y(t) C 2.3 The Fourier Transform We are now ready to discuss the Fourier transform of a function. We have seen that exponentials e2π i f t are modified in an especially simple way by linear time-invariant systems; but how can we use exponentials for a more general input? The most obvious method is to take a sum of exponentials, otherwise known as a Fourier series: X x( t) = x̃n e2π i f n t n where the sum is over some finite set of frequencies f n There are some geophysical signals that can be represented in this way, most notably the tides, for which this is called a harmonic development. But the Fourier series certainly cannot represent an arbitrary function; for one thing, it cannot represent a transient (a function that is zero outside some range of t). So we generalize the weighted sum to an integral, and write Z∞ x( t ) = x̃( f ) e2π i f t d f (2.1) −∞ where f is now the numbers on the real line, not just a finite set of values. It can be shown that this integral holds if and only if Z∞ x̃( f ) = x( t) e−2π i f t dt (2.2) −∞ These two equations define a function, x̃( f ), that is the Fourier transform of the function x( t), which we write as F [ x( t)] = x̃( f ).1 We say that equations (2.1) and (2.2) define a transform pair, (2.2) being the forward and (2.1) the inverse Fourier transform. 1 Our notation for the Fourier transform of x( t) being x̃( f ) (the tilde is meant to remind you of the sinusoid) is different from (Bracewell, 1986), who uses regular capital letters (e.g., X ( f )) for the transforms of functions; unfortunately, this conflicts with the usage of capital and lower-case letters in probability and statistics. Another common notation denotes the Fourier transform of x( t) by x̂( f ) – though this, too, conflicts with statistical notation. 2.4. A Transform Pair 15 Any user of canned programs or other people’s formulae should be aware that the definitions (2.1) and (2.2) are common but not universal. One alternative usage writes (2.2) using radian frequency ω rather than cyclic frequency f . The transform pair then becomes Z Z∞ 1 ∞ x̃(ω) = x̃( f ) e iωt d f x( t) e− iωt dt x( t ) = 2π −∞ −∞ or or Z 1 ∞ x̃(ω) = x( t) e− iωt dt 2π −∞ Z∞ 1 x( t) e− iωt dt x̃(ω) = p 2π −∞ x( t ) = Z∞ x̃( f ) e iωt d f −∞ 1 x( t ) = p 2π Z∞ x̃( f ) e iωt d f −∞ A more insidious danger is that the transform is taken to be Z∞ x̃( f ) = x( t) e2π i f t dt (2.3) −∞ so that the inverse transform uses e−2π i f t . This will give results quite different than the convention we use. A check on this, if you want to test a particular algorithm, is to give it the same signal with a time lag, and compare the phases: for our convention, a lag means a more negative phase. As we saw in the previous section, this convention comes from our choice of e2π i f t as a test function; using e−2π i f t would give the reverse Fourier convention. There are sometimes good reasons to use this other convention; for example, if we write a traveling wave as e i(kx+2π f t) we get something that propagates backwards, which is undesirable. The definition in (2.3) is therefore common when dealing with waves. 2.4 A Transform Pair When we have a pair of functions related by (2.1) or (2.2), we often say that x( t) is the function viewed in the time domain, and its transform x̃( f ) is the same function viewed in the frequency domain. If you have not encountered this phraseology before, talking about the two domains can sound rather mysterious; but in fact we do it all the time: for example, in music pitch is about frequency and rhythm about time. When we say “Doing A in the time domain has effect B in the frequency domain”, you should remember that this is just another way of saying, “If we modify the function 16 Chapter 2. Linear Systems and Fourier Transforms Figure 2.3: The left panel shows the Π function. The next two panels show the sinc function plotted on a linear scale (center) and on a loglog scale (right). (In the log-log plot, the zeroes in sinc should go off the bottom of the plot but do not because of the way the plot program works.) This plot also shows the “corner frequency” f c , defined in two ways: one (left) as the intersection between the high-frequency and low-frequency asymptotes, and the other (right) as the frequency at which sinc( f ) = 1/2 (−3 dB). in according to A, its Fourier transform will be modified according to B”. And, as we illustrated in the first chapter, what is invisible in one domain is often obvious in the other. To some extent a familiarity with transform pairs and their behavior comes only with experience; to start developing this we begin with a simple case and explore some of the behaviors that it illustrates, and then look at x̃( f ) for a few more functions. Our first transform pair starts with the most basic of transients, the rectangle (really square) function (often called a boxcar, at least in American usage):   1 Π( t) = 1/2  0 | t| < 1/2 for | t| = 1/2 | t| > 1/2 (2.4) This function is discontinuous; the value at t = 1/2 is required in a fully rigorous definition. A more general form of Π is aΠ([ t − b]/ c), which is centered at b, and has amplitude a and duration c. The Fourier transform of Π( t) is easy to find and gives another impor- 2.5. Fourier Theorems I: Similarity and Shift 17 tant function: F [Π( t)] = = Z∞ −∞ Π( t ) e −2π i f t dt = sinπ f def = sinc( f ) πf Z1/2 −1/2 e −2π i f t dt = Z1/2 −1/2 cos 2π f t dt Figure 2.3 shows what the sinc function looks like. We can then write the inverse transform Z∞ Π( t ) = sinc f e2π i f t d f −∞ but if we interchange f and t, and substitute − f for f , which is allowable because sinc is an even function, we get Z∞ Π( f ) = sinc( t) e−2π i f t dt −∞ another Fourier transform. We see that while sinc( t) is nonzero over an infinite range, and very smooth, its distribution with frequency is Π( f ), which is nonzero only over a finite range (“bandlimited”) and not smooth at all. 2.5 Fourier Theorems I: Similarity and Shift Now, suppose we make the rectangle function broader or narrower, so it is Π(at). The transform is (defining t′ = at): Z1/2a −1/2a e −2π i f t 1 dt = a Z1/2 −1/2 e −2π( f /a) t′ µ ¶ 1 f dt = sinc a a ′ If a becomes large, Π(at) becomes narrower, but its Fourier transform sinc( f /a) becomes wider. This duality of width between the two domains is in fact a general result, called the similarity theorem; again letting t′ = at Z∞ Z ′ 1 ∞ x( t′ ) e−2π i f t /a dt′ x(at) e−2π i f t dt = |a| −∞ −∞ which means that if F [ x( t)] = x̃( f ), F [ x(at)] = 1 x̃( f /a) | a| Chapter 2. Linear Systems and Fourier Transforms 18 where the |a| comes from the exchange of limits for a < 0. Shortening in the time domain thus implies broadening in the frequency domain; a generalization of the obvious case of a sine wave, for which less time between peaks means a higher frequency. We can be more precise about this relationship between broadening and narrowing by noting that, by the definition of the transform and its inverse, x̃(0) = Z∞ x( t) dt and x(0) = −∞ Z∞ x̃( f ) d f −∞ which means that R∞ −∞ x( t) dt Z∞ x(0) = · R∞ −∞ x̃( f ) d f x̃(0) ¸−1 Z∞ 1 x( t) dt is a x(0) −∞ −∞ kind of measure of the width of the function, specifically of the width of a rectangle of height x(0) with the same area. This width, and the width of the transform, are thus reciprocals. Also note that we have now seen the first example of another duality: global aspects in one domain (the area) are connected to local aspects in the other (the value at zero). Rather than changing the scale of the t axis, and thus the “width” of a function we can ask what happens if we shift it in time. The Fourier transform of a function x( t) that has been shifted in time by an amount τ is: Z∞ Z∞ ′ −2π i f t F [ x( t − τ)] = x( t − τ) e dt = x( t′ ) e−2π i f ( t +τ) dt′ = x̃( f ) e−2π i f τ Since x( t) dt is just the area under the function, −∞ −∞ Looking at the amplitude and phase of x̃( f ) e−2π i f τ , we see that the first, which is | x̃( f )|, is unchanged; but the phase has −2π f τ added to it: a time shift thus causes a phase that varies linearly with frequency. If (as is usual) we use the term amplitude spectrum to refer to the amplitude of the Fourier transform, we can say that a time shift leaves the amplitude spectrum unaltered, but changes the phase spectrum. For our sign convention for the Fourier transform, a time delay will cause the phase to be more negative at a given frequency. In any situation in which uncontrollable, or irrelevant, delays are present, the amplitude spectrum will be the part of the Fourier transform to look at. For example, this is the spectrum usually taken of seismic wave arrivals, unless the travel time is important. 2.6. Generalized Functions 19 2.6 Generalized Functions So far we have proceeded as though, given a function, there would in fact be another function that is its Fourier transform. But this may easily not be true; we can easily come up with functions, for example x( t) = cos t, or x( t) = 1, for which the transform integral (2.2) does not exist. We could limit ourselves to functions for which (2.2) does exist, but it turns out to be possible, to extend, without loss of rigor, the idea of what x( t) can be, to include what are called generalized functions; once we introduce these we can assume that a much wider class of functions has Fourier transforms. Adding to the kinds of entities we can consider, rather than declaring results to be meaningless, is often a fruitful approach. To take the simplest examples, if we start with the integers, we have to introduce rational numbers for division of any integer by any other to make sense. Similarly, for subtraction of any two to be possible requires that we introduce negative numbers; and allowing square roots requires irrational and imaginary numbers. As the names of the last two attest, some of these extensions have not been easily accepted. A full and rigorous treatment of generalized functions goes well beyond the scope of this course. We give a brief, and very nonrigorous, version, though our strategy for developing generalized functions will be the same as the one used in a rigorous approach: we treat such functions as the limit of a sequence of ordinary functions. As a start, consider the convolution of the rectangle function with another function x( t). Evaluated for t = 0, this is Z∞ Z1/2 x( t) dt x( t) ∗ Π( t)|t=0 = x( u)Π(− u) du = −∞ −1/2 Next, do the same thing for aΠ(at) ∗ x( t); the t = 0 value is Z1/2a a x( t) dt −1/2a What happens as a → ∞, making the rectangle function higher and narrower? Provided that x( t) is itself well-behaved near zero, what we get is the area under a rectangle with height close to the average of x( t), and with base a−1 ; we then multiply this by a. The a−1 and a cancel, so that this product is the average of x( t) over the range from −1/2a to 1/2a. As a gets larger, this average of x approaches x(0), so that lim [aΠ(at) ∗ x( t)] t=0 = x(0) a→∞ Chapter 2. Linear Systems and Fourier Transforms 20 If we take the limiting operation “outside” the convolution, we may define a generalized function δ( t) = lim [aΠ(at)] a→∞ such that Z∞ −∞ x( u)δ( u) du = x(0) This is the Dirac δ-function, which can be pictured as an infinitely narrow, infinitely high spike at x = 0. By using a shifted version of Π( t) we can equally well show that Z∞ −∞ x( u)δ( u − t) du = x( t) which means that convolution with δ( t − a) selects the value of the function at t = a. The δ-function thus is the mathematical link between continuous time and sampled data. Note that from the above we can conclude Z∞ δ( t ) = 0 for t 6= 0 but δ( t) dt = 1 −∞ making the δ function one that vanishes everywhere but zero, but still has a finite integral. Note that the value of δ at t = 0 is undefined: this function (like other generalized functions) only has meaningful properties when integrated, not when standing alone. A function closely related to the δ function (but not itself a generalized function) is the Heaviside step function H ( t) =   1  for 0 t>0 t<0 which is the mathematical way of saying, “Throw the switch.”2 If we perform the convolution x ∗ H we get Z∞ −∞ 2 x( u) H ( t − u) du = Zt x( u) du −∞ Entirely appropriately, Oliver Heaviside, the inventor of this and much of the other mathematical machinery in this field, worked as a telegrapher. 2.7. Fourier Transforms of Generalized Functions 21 which is just the integral of x evaluated at t. Taking the derivative of this, by the theorems of calculus, gives x( t) again. We may, very nonrigorously, write: d [ H ∗ x] = x dt and applying the derivative operator to the step function (interchanging differentiation and integration) gives H ′ ∗ x = x, which means that H ( t)′ = δ( t). That the derivative of H is the δ-function, while graphically appealing, does considerable violence to the usual notion of derivative, but again can be made rigorous by considering sequences of functions which approach H ( t); if these are differentiable, it can be shown that the derivatives approach δ( t). Indeed, it is possible to keep “taking derivatives” in this fashion to get a function δ′ ( t), which when convolved with a function x( t) produces the derivative x′ ( t). 2.7 Fourier Transforms of Generalized Functions So, what additional Fourier transforms can we do now? If we take Fourier transforms of the sequence of rectangle functions, F [aΠ(at)], the transform is Z+1/2a Z1/2 ′ −2π i f t e−2π( f /a) t dt′ = sinc( f /a) a e dt = −1/2 −1/2a If we now let a go to ∞, f /a → 0 and since sinc(0) = 1, we have (interchanging integration and the limiting operation) F [δ( t)] = 1 which says that the δ-function contains all frequencies equally. Of course just taking the transform, and using the properties of the δ-function, gives the same result: Z∞ δ( t) e−2π i f t dt = e−2π i0 = 1 F [δ( t)] = −∞ The inverse transform then implies that Z∞ e 2 π i f t d f = δ( t ) −∞ Chapter 2. Linear Systems and Fourier Transforms 22 One way to “see” this is to note that for t 6= 0 the integrand oscillates and so gives a zero integral, but for t = 0 the integral is infinite. Again, in ordinary function theory this equation would be a complete nonsense; the left side is nonintegrable and the right side is meaningless. The extension to generalized functions allows us to deal with such problems in a consistent and rigorous way. If we now swap f and t, we get Z∞ 2π i f 0 t F [e ]= e2π it( f 0 − f ) dt = δ( f 0 − f ) = δ( f − f 0 ) −∞ with, as might be expected, an “infinite” peak at frequency f 0 . A δ function away from zero is thus the Fourier transform of a complex sinusoid. This is the transform of a complex sinusoids; what about sines and cosines? Their transforms are: 1 F [cos 2π f 0 t] = F [1/2( e−2π i f 0 t + e−2π i f 0 t )] = [δ( f − f 0 ) + δ( f + f 0 )] 2 which is two δ-functions at ± f 0 . Similarly, F [sin 2π f 0 t] = F [ −i − i 2π i f 0 t (e − e−2π i f 0 t )] = [δ( f − f 0 ) − δ( f + f 0 )] 2 2 which is purely imaginary, and also has δ-functions at + − f 0 , though they are oppositely directed. These two results cases illustrate some important symmetry relations for the Fourier transform: cos 2π f 0 t is real and even, and so is its transform; sin 2π f 0 t is odd, and its transform is also, but is purely imaginary. Since any real function can be split into even and odd parts, its transform must have a real part which is even, and an odd part which is imaginary: for x( t) real, R [ x̃( f )] = R [ x̃(− f )] I [ x̃( f )] = −I [ x̃(− f )] which is to say x̃( f ) = x̃∗ (− f ). Functions with this property are known as Hermitian. Because real functions have Hermitian transforms, the transform only needs to be specified for f ≥ 0.3 Table 2.7 summarizes most of the useful symmetry relations between functions and their transforms. Since we have just established that negative frequencies are superfluous when we are dealing with real-valued functions, and knowing that all 3 Or f ≤ 0, but nobody does that. 2.8. Fourier Theorems II: Derivatives and Differentiability Time Domain Complex Imaginary Real Real, even Real, odd 23 Frequency Domain Complex anti-Hermitian x̃( f ) = − x̃∗ (− f ) Hermitian x̃( f ) = − x̃∗ (− f ) Real, even x̃( f ) = x̃(− f ) Imaginary, odd x̃( f ) = − x̃(− f ) data are real-valued, you may wonder if there is any meaning, outside the formalism of the mathematics, for negative frequencies. The answer is that they are useful in certain cases, namely, in dealing with 2-dimensional vector processes: for example, horizontal currents in the ocean. We can represent such data as complex numbers using the Argand-diagram representation: given a North velocity vN and East velocity vE , we can write the vector velocity as v = vE + ivN , where v is complex-valued. The transform of data v then has no symmetry about f = 0: we need to look at it for both positive and negative frequencies. What the sign of the frequency tells us, is whether the vector rotates in one sense or the other (clockwise vs anticlockwise). For systems in which gyroscopic forces are important, this can be a useful way of looking at the data. It would be very useful if there were something like this for vectors in three dimensions; unfortunately there isn’t. 2.8 Fourier Theorems II: Derivatives and Differentiability We now return to general theorems on Fourier transform pairs, concerning ourselves now with derivatives. A (nonrigorous) application of integration by parts gives Z∞ ′ x′ ( t) e−2π i f t dt = F [ x ( t)] = −∞ ¯∞ (2.5) Z∞ e−2π i f t x( t) ¯¯ −2π i f t x( t ) e dt ¯ + 2π i f 2π i f −∞ −∞ so that F [ x′ ( t)] = 2π i f x̃( f ), which also holds in reverse, with, F −1 [ x̃′ ( f )] = −2π itx( t). These results are the basis for using Fourier transforms to solve some classes of linear differential equations, something we discuss later on in 24 Chapter 2. Linear Systems and Fourier Transforms Figure 2.4: Sketches to show what level of discontinuity (left to right: in function, first derivative, and second derivative) provides what level of asymptotic decay of the Fourier transform. designing some kinds of digital filters. In the terms we are developing, of looking at the function in two domains, what this result is important for is showing what taking a derivative or integral in the time domain does to the function in the frequency domain. For example, consider ground motion from an earthquake: for given displacements, the velocities (the first derivative) will have a transform scaled by f , and will thus be much rougher (richer in high frequencies); and the accelerations will be rougher still, with a Fourier transform scaled by f 2 . This result leads to other ones, which like the earlier results on widths connect local properties in one domain with global ones in another. For second derivatives F −1 [ x̃′′ ( f )] = −4π2 t2 x( t) Taking the transform of both sides gives Z∞ −∞ t2 x( t) e−2π i f o t dt = − x̃′′ ( f ) 4π2 which means that the second moment of the function x( t) is Z∞ − x̃′′ (0) t2 x( t) dt = 4π2 −∞ that is, the second moment of the function is proportional to the second derivative of its Fourier transform, evaluated at the origin. Once again we see that a global property (the second moment) in the time domain is connected to a very local property (the second derivative at the origin) in the frequency domain. Still another example of the local/global relation between the original and the transform is a theorem which relates the differentiability of x( t) to 2.9. Fourier Theorems III: Convolution, Power, Modulation 25 the asymptotic behavior of x̃( f ) at large f ; that is, to the bounds on | x̃( f )| as f → ∞. Intuitively one might suppose these to be connected; the “rougher” a function is, the more high frequencies we would expect it to contain. We can motivate (not prove!) the result we seek by supposing x( t) to contain steps; then x′ ( t) (using our extended notion of differentiation) contains δfunctions, and so |F [ x′ ( t)]| ∝ 1 for f large. But then we have, from equation (2.5): ¯ ¯ |F [ x( t)]| = | x̃( f )| = ¯F [ x′ ( t)]2π i f ¯ ∝ f −1 for f → ∞. Similarly, a function with steps is the derivative of one that is continuous but has corners (discontinuous derivatives) and so by the same argument the Fourier transform of such a function will be proportional to f −2 as f → ∞. Extended, this shows that if x( t) has a discontinuous n-th derivative, | x̃( f )| will fall off as f −(n+1) at high frequencies; Figure 2.4 gives a graphical summary of the first few cases. Two extreme examples are Π( t), for which the zeroth derivative is discontinuous; and indeed |F [Π( t)]| = |sinc f | < k/ f as f →∞ (here k = 1/π, but it is the rate of the falloff that matters). At the other 2 extreme, suppose x( t) = e−π t , which is infinitely differentiable. For this, Z∞ 2 x̃( f ) = F [ x( t)] = e−(π t +2π i f t) d e −∞ Z∞ 2 −π f 2 =e e−π( t+ i f ) dt Z−∞ ∞ 2 2 2 = e−π f e−πu du = e−π f −∞ which for large f falls off (goes to 0) more rapidly than any power of f . We also see that the Gaussian is its own Fourier transform. 2.9 Fourier Theorems III: Convolution, Power, Modulation So far, we have looked only at properties of a single function and its Fourier transform. We now consider combinations of functions. Linearity means that if we add two functions the transform of the result is the sum of the two transforms. A much more interesting result appears if we consider the 26 Chapter 2. Linear Systems and Fourier Transforms function formed by convolution of two functions. The result is the convolution theorem, which states that F [ x( t) ∗ y( t)] = F [ x( t)]F [ y( t)] which is to say, the transform of the function formed by convolving two functions is the product of the transform of the individual functions. This is not difficult to show; again, using change of variables freely, we have: Z∞ Z∞ Z∞ Z∞ −2π i f t e x( u) y( t − u) dudt = x( u ) y( t − u) e−2π i f t dtdu −∞ −∞ −∞ −∞ Z∞ Z∞ ′ x( u ) y( t′ ) e−2π i f t e−2π i f u dt′ du = −∞ −∞ Z∞ x( u) e−2π i f u du = F [ y( t)]F [ x( t)] = F [ y( t)] −∞ Writing F [ x] = x̃, F [ y] = ỹ, we can state the related results x ∗ y = F −1 [ x̃ ỹ] F −1 [ x̃ ∗ ỹ] = x y F [ x y] = x̃ ∗ ỹ This theorem is the main reason for our use of Fourier transforms: while the action of a linear system is a convolution in the time domain, it is a multiplication in the frequency domain – and multiplication is much easier to grasp than convolution. While the frequency domain may initially seem less natural than the time domain, operations in it are often much simpler. Our earlier result, that a sinusoid e2π i f t put into a linear time-invariant system yields another sinusoid g̃( f ) e2π i f t out, now appears as a special case of the convolution theorem. We write the effect of the system as y = g ∗ x, which implies ỹ( f ) = g̃( f ) x̃( f ), and if x̃( f ) = δ( f − f 0 ), ỹ( f ) will be g̃( f )δ( f − f 0 ) = g̃( f 0 )δ( f − f 0 ) since δ( f − f 0 ) = 0 for f 6= f 0 . Again, and now perhaps more clearly than in our earlier discussion, we see that g̃( f ) is the frequency response of the system. If we make the input to the system x( t) = δ( t), then the output is Z∞ y( t) = δ( t − u) g( u) du = g( t) −∞ which is called the impulse response of the system; albeit this is not easy to produce in practice, it is still a useful characterization. Then the frequency response is seen to be g̃( f ) = F [ g( t)] 2.9. Fourier Theorems III: Convolution, Power, Modulation 27 A function related to the impulse response is its integral, the step response: Zt g( u) du h( t) = −∞ which is the response of the system to a Heaviside step function H ( t) applied at t = 0. For any actual system this is much easier to produce than the impulse response is. As an example of a system’s frequency response, consider differentiation, which is a linear time-invariant system – though not usually thought of as such. We have seen that F [ x′ ( t)] = 2π i f x̃( f ) and so the frequency response for this system is g̃( f ) = 2π i f , with amplitude and phase spectra Ph[ g̃( f )] = π/4 = 90◦ | g̃ ( f )| = 2π f As might be expected, the response rises with increasing frequency; the 90◦ phase response means that sines are turned into cosines, and vice versa. The frequency response of a system is often much simpler to describe than its impulse response (and more indicative of the underlying physics); we will (eventually) get to ways of estimating it given x( t) and y( t). An interesting consequence of the convolution theorem relates to the total variance, also called energy (or, sometimes, power) in x( t), which is Z∞ | x( t)|2 dt −∞ We first consider the Fourier transform of the conjugate of x, which is ·Z∞ ¸∗ Z∞ ∗ ∗ −2π i f t −2π i(− f ) t F [ x ( t)] = x ( t) e dt = x( t ) e dt = x̃∗ (− f ) −∞ −∞ and so, using one version of the convolution theorem, F [ x( t)|2 ] = F [ xx∗ ] = x̃( f ) ∗ x̃∗ (− f ) Remembering that the area under a function is the value of its Fourier transform at f = 0, this means that ¯ ·Z∞ ¸ Z∞ Z∞ ¯ 2 ∗ ∗ ¯ = | x( t)| dt = x̃( f ) ∗ x̃ (− f )¯ | x̃( f )|2 d f x̃( u) x̃ ( u − f ) du = −∞ f =0 −∞ f =0 −∞ so that the total energy is the same in the function as in its transform. This result goes by several names; Bracewell (1986) calls it Rayleigh’s theorem, but says that the name Plancherel’s theorem is also in use; this result 28 Chapter 2. Linear Systems and Fourier Transforms is also sometimes called Parseval’s theorem, especially for the analogous case of discrete-time Fourier transforms, which we will meet later on. As another example of the convolution theorem, we derive the transform of a modulated waveform, in which the data looks like a “slowly varying” sinusoid, where “slow” means “over many cycles of the sinusoid”. The slow variation, termed a modulation, can be in amplitude, frequency, or phase; all of these are used in different forms of radiowave communication. Suppose the sinusoid (“carrier” in radio parlance) is the cosine cos 2π f 0 t; this has the Fourier transform F [cos 2π f 0 t] = 1/2[δ( f − f 0 ) + δ( f + f 0 )] Suppose also that the modulation is purely in amplitude, and the modulating function is x( t). Then the modulated waveform is x( t) cos 2π f 0 t. We can now apply the inverse of the convolution theorem: the transform of a product of two functions is the convolution of the individual transforms. Remembering that convolution with a δ-function recovers the function being convolved, this means that F [ x( t) cos2π f 0 t] = 1/2[δ( f − f 0 ) ∗ x̃( f ) + δ( f + f 0 ) ∗ x̃( f )] = 1/2[ x̃( f − f 0 ) + x̃( f + f 0 )] where x̃( f ) = F [ x( t)]. The effect of multiplying a function by a sine wave is thus to replicate the original transform at ± f 0 (Figure 2.5); alternatively, we can say that the effect of modulating a sine wave is to spread its original δ-function transform out over a broader band, though if x( t) contains only frequencies much less than f 0 (as is true in the case of radio), the band will be narrow. This is also true for other classes of modulation, though the relation between x̃( f ) and the transform of the modulated function is less straightforward. 2.10 The Correlation Function We can modify the convolution integral by not time-reversing one function. The result is called the cross-correlation, because that is the form taken by the correlation between two random time series, as we will see later on. For now we look at some of the properties of this expression when we use the same function for both parts; this is called the autocorrelation. For a complex-valued x, this is: Z∞ x∗ ( t) x( t + τ) dt C (τ) = −∞ 2.10. The Correlation Function 29 Figure 2.5: From top to bottom, the time series and spectra for a signal, a carrier waves, and the amplitude-modulated carrier. The spectra are schematic. The location of the conjugate part matters; if we conjugate the other x in the integral, we have, since yz∗ = ( y∗ z)∗ in general, that µZ ∞ ¶∗ Z∞ ∗ ∗ x( t) x ( t + τ) dt = x ( t) x( t + τ) dt = C ∗ (τ) −∞ −∞ which is not the same as C (τ). C has some symmetry properties; if we change variables, we get the result: Z∞ Z∞ ∗ C (−τ) = x ( t) x( t − τ) dt = x∗ ( u + τ) x( u) du −∞ −∞ which from the previous result is C ∗ (τ). Hence, C (−τ) = C ∗ (τ), which is to say that C is Hermitian; this implies that its Fourier transform is always real. If x( t) is real, C is also; which means that C , and its transform, must be even. All of these properties will become relevant when we define the power spectral density, since the most general form of this is the Fourier transform of a Hermitian function, and the most usual form is the Fourier transform of a real symmetric function. C HAPTER 3 F OURIER T HEORY FOR D ISCRETE T IME 3.1 Introduction We now turn from Fourier theory for functions to the same theory for ordered sets of numbers: this is part of digital signal processing. As we will see, much of what has been covered in the Fourier theory discussion will have parallels here – though also significant differences, which come about because the theory of the Fourier transform assumes that what is being transformed (and the transforms themselves) are functions on the real line (or a higher-dimensional equivalent). If we specifically take these functions as functions of time, we say that they are defined “in continuous time”. However, digital signals are, intrinsically, not so defined: rather, they are collections of numbers, representing (usually) a continuous time signal sampled at regular intervals. Such functions are called sampled data1 and are be said to be defined in discrete time. We begin by describing how Fourier theory works applied to such data, for two cases: • Discrete-time data defined, like that in continuous time, over an infinite range, so the data “go on forever.” The Fourier transform of such data turns out to be a function on the real line; but unlike the Fourier transform of a function, which is another function, the transform of an infinite amount of discrete data is a function defined on only a part of the real time. A discrete-time series and its transform are very different. 1 This terminology is more commonly (in statistics) used for the act of getting data on a subset of some population, ideally without biasing the results of an statistical investigation. We will not deal with this topic. 30 3.2. Discrete-Time Sequences and Operations 31 • Discrete-time data defined over a finite range – which is what we actually have to deal with. The Fourier transform of this turns out to be a finite set of numbers at discrete frequencies: in this case, as with functions on the real line, a function and its transform are the same kind of thing. The Fourier transform of a finite amount of discrete-time data is called (what else?) the Discrete Fourier Transform or DFT. We will spend some time in this chapter exploring the properties of the DFT, including an overview of how to compute it efficiently, though we relegate the details of this to Appendix A. In Chapter 4 we will use the theory to discuss how to go from continuous time to discrete time without losing information: a question best approached, as it turns out, by looking at Fourier transforms. The following chapters describe the commonest operations we perform on discrete-time data: filtering them to remove certain frequencies (Chapter 5), to emulate how systems behave in continuous time (Chapter 6), or for other reasons (Chapter 7). 3.2 Discrete-Time Sequences and Operations In discrete time there are no functions, only sequences of numbers; we denote such a sequence by xn or x( n) or xn , where in all cases n is an integervalued index. To start with we assume that we have an infinite amount of data, with n running from −∞ to ∞. We seek as many parallels with the continuous-time case as we can. Some things are much simpler in discrete time. For example, the distinction between generalized functions and other functions disappears: all are just sequences. The δ-function is just another the sequence: ½ 1 n=0 δ n = δ n0 = 0 n 6= 0 where δ i j is the Kronecker δ-function. The discrete-time version of the Heaviside step function is ½ 1 n>0 δ n = δ n0 = 0 n≤0 so that Hn = n X k=−∞ δk Chapter 3. Fourier Theory for Discrete Time 32 In discrete time summation replaces integration, and we face none of the difficulties in making things rigorous that we had to consider when dealing with functions on the real line. Likewise, convolution is a summation rather than an integral; the convolution of two sequences (also called the serial product) is still written as z = x ∗ y, but this now denotes zn = ∞ X k=−∞ xk yn−k (again one series is reversed). This definition satisfies the requirements for convolution of being linear in x and y, and also invariant with shifts in time (sequence number); if we shift x by m terms, the convolution x ∗ y is ∞ X k=−∞ xk−m yn−k = ∞ X l =−∞ xl yn−m−l = z n−m (3.1) so z is also shifted. Convolution is just as important in discrete time as in continuous time; indeed, some types of sequences are defined by what they do in convolution. One example is that a sequence { g k } is causal if and only if g k = 0 for k < 0; this means that if { xk } is causal, g ∗ x will be also: the output will not precede the input. If you are processing data in real time, or trying to emulate a system that is causal (such as an instrument, or wave propagation) this is an important distinction; more generally, however, it is not: usually we analyze data long after they are collected, so that for most on a computer the actual time is usually far behind real time. 3.3 Fourier Transforms for Infinite Sequences We started our discussion of Fourier theory by looking at what linear systems do to sinusoids; similarly, we develop Fourier transforms for infinite sequences by considering the convolution of a sinusoidal sequence with the sequence of interest. In continuous time we saw that a sinusoid input into a linear system produced an output sinusoid, scaled by a function (the frequency response) that is the Fourier transform of the convolving function. 3.3. Fourier Transforms for Infinite Sequences 33 The same thing holds in discrete time; if we convolve xn = e2π i f n with g n , we get, starting by commuting the series, ∞ X k=−∞ g k x n− k = ∞ X k=−∞ g k e 2π i f ( n − k ) = e 2π i f n ∞ X k=−∞ def g k e−2π i f k = g̃( f ) e2π i f n g̃( f ) is again the frequency response, and regarding it as the Fourier transform of the sequence g k gives the definition for the Fourier transform of an infinite sequence xn ∞ X xk e−2π i f k (3.2) x̃( f ) = k=−∞ Another way to get this definition from the continuous-time Fourier transform is to introduce the sampled-data function xs ( t), defined by xs ( t) = ∞ X k=−∞ δ( t − k ) x ( t ) While this is a continuous-time function (actually a generalized function), it is like a sequence in that it contains information about the value of x( t) only at the sample points. Just as a single δ function samples x at a single time, an infinite array of them samples x at the times of the sequence xn . We will explore the effects of such sampling in the next chapter; for now we simply note that the Fourier transform of the sampled-data function is F [ xs ( t)] = Z∞ X ∞ −∞ k=−∞ δ( t − k) x( t) e−2π i f t dt = ∞ X x( k) e−2π i f k k=−∞ which is equivalent to how we defined the Fourier transform for a sequence. Notice that, as is conventional in digital signal processing, we are setting the sample interval to unity; for a sequence indexed by integer values there is no clue to this interval, and one is a convenient choice. x̃( f ) is a function of f , that is, a function on the real line; but it is a function that is periodic, and with period one: x̃( f + m) = ∞ X k=−∞ xk e−2π i f k e−2π imk = x̃( f ) for m an integer It is often more useful, and certainly sufficient, to regard x̃( f ) as defined only over [−1/2, 1/2]. (This is the interval for f ; for ω = 2π f , the interval would be [−π, π].) When we discuss sampling, we will see that these limits 34 Chapter 3. Fourier Theory for Discrete Time Figure 3.1: An example of the relationships between a function, a discrete-time sequence related to it, and their Fourier transforms. 2 The function (top left) is a Gaussian, e− x ; its Fourier transform is shown top right. The bottom left panel shows part of a discrete-time sequence that matches this (with the original function in gray). The bottom right shows the Fourier transform of the sequence; the region of f between the two dashed lines is sufficient to describe it. Note that the transform of the sequence is close to the Fourier transform of the function, except near to these two lines – the effect of aliasing, which we will discuss in Chapter 4. are what is called the Nyquist frequency for a sample interval (∆) of one. The result for F [ xs ( t)] can be generalized to ∆ 6= 1 by replacing k by k∆, in which case the Fourier transform becomes ∞ X x̃( f ) = xk e−2π i f k∆ k=−∞ which is periodic with period 1/∆, and so defined over −1/(2∆) to 1/(2∆); again, this will turn out to be the range between Nyquist frequencies. This result also shows that formulas for arbitrary ∆ can be gotten from those for ∆ = 1 by replacing f with f ∆. If we swap time and frequency in 3.2 we get an expression for a function of time as a sum of an infinite set of sinusoids in time: ∞ X x̃k e2π itk x( t ) = k=−∞ which is just the Fourier series expansion of a periodic function of time in terms of sinusoids: this is the result that Fourier first obtained, and the 3.3. Fourier Transforms for Infinite Sequences 35 way that Fourier theory is often introduced. Of course, we need a way to find the coefficients for the sinusoids; these are given by Fourier’s result that Z1/2 x( t) e−2π itk dt x̃n = −1/2 So, swapping t and f again, we get the inverse transform for the Fourier transform x̃( f ) of a sequence xn ; the xk ’s can be regarded as the Fourier series coefficients, which are given by xn = Z1/2 −1/2 x̃( f ) e2π i f n d f So we have a transform pair – albeit not a symmetric one, since the forward transform is an infinite sum and the inverse transform a definite integral. Many of the theorems already outlined for Fourier transforms carry over to this new transform pair, although others (e.g., the derivative theorems) do not. Some examples of ones that do involve integrals that become sums. Using x̃( f ) for the Fourier transform of both the function x( t) and the sequence xn (remember that these are different x̃’s), we have for continuous time: Z ∞ −∞ x( t) dt = x̃(0) and in discrete time, from 3.2 ∞ X n=−∞ xn = x̃(0) For Rayleigh’s theorem we take complex conjugates of the inverse Fourier transform relation, Z1/2 ∗ xn = x̃∗ ( f ) e−2π i f n d f −1/2 which means that ∞ X n=−∞ 2 | xn | = = ∞ X n=−∞ Z1/2 −1/2 ∗ xn x∗n x̃ ( f ) ∞ X = n=−∞ ∞ X n=−∞ xn e xn Z1/2 −2π i f n −1/2 x̃∗ ( f ) e−2π i f n d f df = Z1/2 −1/2 ∗ x̃ ( f ) x̃( f ) d f = Z1/2 −1/2 | x̃( f )|2 d f Chapter 3. Fourier Theory for Discrete Time 36 The convolution theorem becomes one of sums rather than integrals: if we have a sequence z = x ∗ y then z̃( f ) = ∞ X k=−∞ ∞ X = z k e−2π i f k = l =−∞ xl ∞ X k=−∞ ∞ X ∞ X k=−∞ l =−∞ yk−l e−2π i f k = = xl yk−l e−2π i f k ∞ X xl e−2π i f l k=−∞ l =−∞ ∞ X ∞ X xl e−2π i f l ∞ X m=−∞ l =−∞ yk−l e−2π i f (k−l ) ym e−2π i f m = x̃( f ) ỹ( f ) 3.4 Discrete Fourier Transform Now we suppose that our sequence xn is, not infinite, but finite in length, with terms x0 , x1 . . . xN −1 : N terms in all. How do we define the Fourier transform now? We start in what is a somewhat unusual way, or at least one that is different from most treatments in signal processing: we consider the general problem of fitting sine waves to the data. We can always represent the data as a sum of sine waves plus a residual: xn = LX −1 l =0 C l e2π in f l + ǫn n = 0, ...N − 1 where the frequencies f 0 , f 1 . . . are assumed to be specified. If we choose the NX −1 |ǫn |2 , we coefficients C l to minimize the sum of squares of the residuals, n=0 do so using least-squares theory. To apply this we write the sequences as vectors: xT = ( x0 , x1 . . . xN −1 ) C T = (C 0 , C 1 . . . C L−1 ) and the array of exponentials (in l and n) as a matrix A , which is most easily written in terms of its adjoint (the complex conjugate of the transpose) A†:   1 e−2π i f 1 e−4π i f 1 . . . e−2π i( N −1) f 1   e−2π i f 2 e−4π i f 2 . . . e−2π i( N −1) f 2   1 † A =  ... ... ... ...   ... −2π i f L−1 −4π i f L−1 −2π i( N −1) f L−1 1 e e ... e The coefficients C that minimize the sum of squares of the residual is then given by the normal equations ( A † A )C = A † x or BC = A † x 3.4. Discrete Fourier Transform 37 where B is an L × L matrix with elements B pq = NX −1 n=0 e−2π in f p ( e−2π in f q )∗ = The solution for C is then NX −1 e−2π in( f p − f q ) (3.3) n=0 C = B−1 ( A † x) In a general least-squares problem solving this will involve inverting an L × L matrix: a good-sized computational task if L is large. However, for the special form of B given in equation (3.3), and with proper choice of the frequencies, we can simplify the form of B substantially. First, we can compute the elements of B analytically rather than by actually computing the sum 3.3, since there is a closed-form expression for finite sums of the exponential function e−2 iπ f n . To get this expression, we first note that for any z, we have LX −1 n=0 z n = (1 − z L ) ∞ X zn n=0 as can be verified through term-by-term expansion; then setting L = 1 we get ∞ X 1 zn = (1 − z) n=0 Combining these two results gives us a closed-form expression for a finite sum; then if we set z = e−2 iπ f , we get NX −1 n=0 e −2 iπ f n = 1 − e−2 iπ f N 1 − e−2 iπ f = µ ¶ e− iπ f N e iπ f N − e− iπ f N e− iπ f e iπ f − e− iπ f = e− iπ f ( N −1) sinπ N f sinπ f (3.4) which gives, for the elements of B, B pq = e− iπ( N −1)( f p − f q ) sinπ N ( f p − f q ) sinπ( f p − f q ) Next, we choose a particular set of frequencies: fl = l N l = 0, 1, . . . N − 1 (3.5) Chapter 3. Fourier Theory for Discrete Time 38 so that there are as many frequencies as there are data values. Given this choice of frequencies, the matrix elements become: B pq = e − iπ( N −1)( p− q)/ N sinπ( p − q) = sin(π( p − q)/ N ) ½ N 0 p=q = N δ pq p 6= q (3.6) where we get the result for p = q by treating p and q as real variables and taking the limit as p → q. Note that this relationship, as given, is only true if the integers p and q are in the range from 0 through N − 1; but this is true for the matrix elements. A more general expression is NX −1 n=0 e −2π iln/ N =e − iπ l ( N −1)/ N sinπl = sin(πl / N ) ½ N l = 0, ± N, ±2 N . . . 0 otherwise (3.7) which we will use a little bit later. Using 3.6, we find that B = N I , where I is the identity matrix. This could not be easier to invert; the inverse M −1 is just N −1 I , which means that the coefficients for our fit are C= 1 † A x N which we can write out explicitly as Cl = −1 1 NX xn e−2π in(l /N ) N n=0 Getting the matrix B to be the identity matrix has required two conditions and a mathematical result. The first condition is that an increment in sequence number is a constant increment in the fitting function; this is a consequence of equal spacing in time. The second condition is that we choose our frequencies according to equation (3.5). Given these specifications, the result we use is that the imaginary exponentials are orthogonal in summation over the finite interval. A major reason for assuming equispaced samples is that without this, we lose orthogonality in the fitting functions. We define the Discrete Fourier Transform (DFT) as our fit, though with a different normalization: x̃k = NX −1 n=0 xn e−2π ink/N k = 0, ....N − 1 (3.8) 3.5. Fourier Theorems for the DFT 39 Figure 3.2: An example of a DFT pair, using a finite length of the same discrete-time sequence shown in Figure 3.1. The right side shows the DFT values (filled circles), with the gray line being the Fourier Transform of the infinite sequence; in this case these look the same because the infinite sequence is very small outside the range shown. The unfilled circle is an “extra” DFT point to show the periodicity of the DFT. which, if we compare it with the definition of the Fourier transform of an infinite sequence (3.2), shows that x̃k = x̃( f ) for f= k N with k = 0, 1, . . . N − 1 The DFT coefficients can thus be considered as samples from the continuousfrequency transform. To obtain the inverse transform, we use the orthogonality relations 3.7 again, writing −1 −1 −1 −1 N −1 NX X 1 NX 1 NX 1 NX xm e2π i(nk−mk)/N = xm e2π ik(n−m)/N x̃k e2π ink/N = N k=0 N k=0 m=0 N k=0 m=0 = −1 1 NX xm N δnm = xn N m=0 (3.9) where the next-to-last step makes use of the orthogonality relationship (3.7). We thus have a transform pair – the DFT (equation 3.8) and the inverse DFT (equation 3.9) – between finite-length sequences of numbers. 3.5 Fourier Theorems for the DFT While the DFT and inverse DFT are completely consistent, there are notable pitfalls in using them, mostly arising from trying to apply lessons in 40 Chapter 3. Fourier Theory for Discrete Time continuous and infinite time, some of which do not carry over into the world of finite-length sequences. The basic confusion lies in supposing that our finite sequence is just a part of a longer sequence, so that we can apply continuous-time (and infinite-range) Fourier theory simply by multiplying the series by a function that is zero over most of the range, and one over some part of it: that is, by a scaled version of the boxcar Π( t)). This kind of multiplication by a function that is nonzero only over some range is called windowing. It is tempting to take the view that the DFT of a finite sequence is the FT of a windowed version of an infinite series; after all, the data we analyze are usually obtained from a longer series in just this way. However, this view is wrong. Once the data have been obtained, the process of getting them ceases to matter: we have to regard them as a finite sequence, not part of some longer one, and use the mathematics appropriate to such sequences. As an introductory example, suppose we compute the value of the inverse DFT of the DFT of a sequence for term numbers outside the original range from 0 to N − 1. If we do this for the regular Fourier transform of a windowed function, we of course get the original function back; having been windowed, it is indeed zero except over a limited range. But this is not what happens for the DFT. From the definitions of the transforms, we can write the “recovered” series xr ec as # " −1 NX −1 1 NX 2π ik(n−m)/ N r ec (3.10) xm e xn = N k=0 m=0 Now, by 3.7, the part in square brackets is nonzero (and indeed equal to one) for n − m a multiple of N ; that is, for n − m = jN , which we can rewrite as n = m + jN , where j is any integer. The usual way to express this relationship between n and m is to say that “ n equals m, modulo N ”, the notation for which is m = n modN which, if we apply it to (3.10), gives us the result xnr ec = xnmod N In this expression n can be any integer, but n modN ranges only from 0 through N − 1: this is just the range over which we actually know the original sequence. The infinite sequence we have thus “recovered,” and which is in some sense equivalent to the finite sequence, is therefore the finite 3.5. Fourier Theorems for the DFT 41 sequence replicated over and over; because it is periodic, it contains no information not in the original sequence – and after all, how could it?. If we want to apply our earlier Fourier theory to an infinite sequence, it has to be to one that is periodic, not to a finite sequence surrounded by zeros on both ends. Another interesting example comes from supposing that our finite periodic sequence is just a sinusoid x n = e 2π i f 0 n with f 0 real; the discrete Fourier transform of this is x̃k = NX −1 e2π in( f 0 −k/N ) n=0 k = 0...N − 1 If, but only if, f 0 = l / N , with l an integer, we get x̃k = ½ N k=l 0 k= 6 l so that this finite sequence can have a (discrete) Fourier transform that is a discrete-frequency δ-function, with only one nonzero value. In continuous time we get a δ-function as the transform only if the time series is an infinitely long sinusoid; the transform of a section of a windowed sinusoid is F [ e2π i f 0 t Π( t/T )] = T δ( f − f 0 ) ∗ sinc( f T ) = T sinc[( f − f 0 )T ] which is not a δ-function for any choice of f 0 . This all fits with the concept of replicating the sequence over an infinite interval: f 0 = l / N means that exactly l cycles occur over the finite segment, and repeating the function then gives an infinite, untruncated sinusoid – with a δ-function transform. But if f 0 = l / N , the replicas of the sequence do not join smoothly, so the Fourier transform is not a δ-function – and neither is the DFT. Figure 3.3 shows an example. 3.5.1 Shift Theorem We next consider what is the effect in the time domain of some operation carried out in the frequency domain, with the “time domain” being a finite sequence. For conventional Fourier transforms, the shift theorem 42 Chapter 3. Fourier Theory for Discrete Time Figure 3.3: The top two pairs of plots show a 12-point sequence and its DFT, where the sequence is xn = cos(2π f n), with n = 1/6 in the top left plot and n = 1/6.5 in the middle left plot. For each the right-hand plot shows the corresponding DFT. Especially when plotted on a log scale (lowest plot), a very large difference is apparent between the DFT for a cosine with an integral number of cycles, (for which the DFT sequence is a δ sequence with roundoff error), and the DFT for one which is merely close to an integral number of cycles. 3.5. Fourier Theorems for the DFT 43 states that, if we take the Fourier transform of a function x( t) shifted by an amount τ, we get F [ x( t − τ)] = Z∞ −∞ x( t − τ) e −2π i f t dt = = x̃( f ) e−2π i f τ Z∞ ′ x( t′ ) e−2π i f ( t +τ) dt′ −∞ and so if we multiplied a Fourier transform x̃( f ) by e−2π i f τ , and took the inverse transform, we would get x( t − τ). Now we consider the parallel case for finite sequences. Call the DFT of xn , x̃k ; and multiply x̃k by e2π imk/N . What does the inverse DFT produce? We use xns for the series whose transform is x̃k e2π imk/N ; then we write down the definitions for the DFT and inverse DFT, to get −1 −1 −1 NX 1 NX 1 NX x̃k e2π imk/N e2π ink/N = xl e−2π ilk/N e2π imk/N e2π ink/N N k=0 N k=0 l =0 # " NX −1 −1 1 NX = e2π ik(m+n−l )/N xl N k=0 l =0 xns = The sum in square brackets is nonzero for n = ( l − m)modN , since then n − ( l − m) is a multiple of N . Thus xns = x(l −m)mod N To see what this means, it helps to take a specific value for m; if we take m = 3, then taking l = 3 gives n = 0, l = 4 gives n = 1, and so on: x0s = x3 x1s = x4 ... xsN −4 = xN −1 This is all as we expect, but the “right end” of the back-transformed series is xsN −3 = xN mod N = x0 xsN −2 = xN +1mod N = x1 xsN −1 = x2 which is to say, the start of the original series. We can look at this in two ways. One way is to view this as a linear shift applied to a replicated, periodic sequence. The other way is to view it as a circular shift, where we imagine the finite sequence to be on a circle, so that any shift moves the end to the beginning. Chapter 3. Fourier Theory for Discrete Time 44 3.5.2 Convolution Theorem for the DFT A much more important result, with parallels to the one just given, is provided by the extension of the convolution theorem to discrete, finite-length series. We know that the Fourier transform of the convolution of two functions is the product of the two Fourier transforms; so what is the sequence produced by multiplying two sets of DFT coefficients and taking the inverse DFT? More precisely, suppose xn and yn to have DFT’s x̃k and ỹk ; what series z n has the DFT z̃ k = x̃k ỹk ? As before, the derivation just involves writing the inverse transform, replacing z̃ by the product x̃ ỹ, substituting in the DFT expressions for these, and finally collecting all the exponentials together: zn = = = −1 −1 1 NX 1 NX z̃ k e2π ink/N = x̃k ỹk e2π ink/N N k=0 N k=0 −1 −1 NX −1 NX 1 NX xl e−2π ilk/N ym e−2π imk/N e2π ink/N N k=0 l =0 m=0 NX −1 l =0 xl NX −1 m=0 ym −1 1 NX e2π ik(n−m−l )/N N k=0 Once again, we need the sum over k to be nonzero; the orthogonality relation (3.7) then implies that ( n − m − l )modN = 0 or m = ( n − l )modN giving the final result that the sequence z is given by zn = NX −1 l =0 xl y(n−l )mod N This is somewhat like the convolutions we have seen before, though not exactly. Again, an example may help; taking n = 5 gives z5 = x0 y5 + x1 y4 + . . . x5 y0 + x6 yN −1 + x7 yN −2 + . . . xN −1 y6 meaning that as the convolution sum goes off the beginning of the yn sequence it goes around to the end. One way to visualize this is to imagine both sequences mapped onto circles: a shift is then merely a rotation of one circle relative to another, and the kind of convolution described by the above equation is therefore called circular convolution. 3.5. Fourier Theorems for the DFT 45 If we attempt to convolve two series in the time domain by multiplying their DFT’s, we must be careful to ensure that such “wraparound effects” do not create any problems. These effects complicate such seemingly simple techniques as removing certain frequencies from data by taking the DFT, setting some of the DFT values to zero, and taking the inverse DFT. This can work – but we must remember that the series we get back can show the effects of data at one end influencing data at the other, since using the DFT makes the two ends contiguous. It is indeed possible (and indeed often most efficient) to convolve two sequences by multiplying their transforms. To avoid wraparound problems, we need to pad both time series with zeroes before taking the DFT’s; with enough zeros, the circularity of the convolution has no effect. Specifically, if we have an M -term series, xn and an L-term series yn , we will get the correct result if we pad both out to length M + L − 1 before doing the DFT. The circular convolution is then zn = M+ L−2 X l =0 xl y(n−l )mod M +L−1 = M −1 X l =0 xl y(n−l )mod M +L−1 giving (as examples from the two ends of z n ) z0 = x0 y0 + x1 yM +L−2 + . . . + xM −1 yM +L−M z M +L−1 = x0 yM +L−2 + . . . + xM −2 yM +L−M + xM −1 yL−1 For the first sum, all terms but the first are zero; for the second sum, all terms but the last, so we get the same series z n as we would from a time-domain convolution in which both sequences were padded with infinite numbers of zeroes on both ends. 3.5.3 Symmetry Relations The DFT output runs from 0 to N − 1, whereas we have been plotting the Fourier transform from negative to positive frequency. The connection is given by the result x̃N −k = NX −1 n=0 xn e−2π in( N −k)/N = NX −1 n=0 xn e−2π in(−k)/N = x̃−k This result is consistent with the notions of periodicity that we have been discussing; but it also means that the “upper half” of the DFT sequence Chapter 3. Fourier Theory for Discrete Time 46 ( k from N /2 to N − 1) can also be thought of as the negative frequencies k = − N /2 to −1. As with the Fourier transform in continuous time, we have symmetry rules for the transforms of special classes of sequences: for example, if the sequence xn is real, the transform x̃ is Hermitian. Combining this with the previous result, we see that, for a real sequence, half of the spectrum is redundant: usually we say this is the top half, or equivalently the negative frequencies. Formally, we have that if xn is real, then x̃N −k = NX −1 n=0 xn e2π ink/N = NX −1 n=0 ³ ´∗ x∗n e−2π ink/N = x̃∗k so the top half is just the complex conjugate of the bottom half. This makes sense, since if we put in N real numbers, we expect to get N independent numbers out – and we do, namely, N /2 independent complex DFT coefficients. Actually, for xn real, two x̃’s are real, namely x̃0 = NX −1 xn n=0 and x̃N /2 = NX −1 n=0 (−1)n xn Our final theorem for the DFT is Parseval’s relation: NX −1 n=0 | xn |2 = = = NX −1 n=0 xn x∗n = NX −1 −1 −1 NX 1 NX 2π ikn/ N x̃∗l e−2π iln/N x̃ e k N 2 n=0 k=0 l =0 −1 −1 NX NX −1 −1 NX −1 1 NX 1 NX ∗ 2π in(k−l )/ N ∗ x̃ x̃ e = 1 x̃ x̃ k k l k N 2 n=0 k=0 l =0 N 2 k=0 n=0 −1 1 NX | x̃k |2 N k=0 which is useful for checking the normalization (and accuracy) of any actual DFT algorithm. 3.6 The Dirichlet Kernel We now use the result (3.4) to find the Fourier transform of the discretetime version of the rectangle (or boxcar) function; remember that in continuous time the Fourier transform is sinc( f ). In discrete time the Fourier transform is called the Dirichlet kernel; this is worth a section because 3.6. The Dirichlet Kernel 47 Figure 3.4: The Dirichlet kernel D N plotted for N equal to 20 and 200, on the left on linear scales and on the right on log scales. it plays a fundamental role in many topics in Fourier analysis and signal processing. We consider a rectangle function in discrete time that is centered on the origin and runs from − M to M ; the total length N = M + 1 is then odd, but the derivation may be extended to an even length. The sequence is: wn = ½ 1 | n| ≤ M 0 | n| > M If we now form the Fourier transform of this (in continuous frequency, since this is an infinite sequence), we find W(f ) = nX =∞ n=−∞ wn e−2 iπ f n = M X n=− M ( e−2 iπ f )n = e2 iπ f M " n= 2M X n=0 e−2 iπ f n # We evaluate the sum in square brackets using equation (3.4), with 2 M = N − 1; the result is sin π(2 M + 1) f WM ( f ) = sin π f The Dirichlet kernel is normalized slightly differently: DN ( f ) = sin π N f N sin π f 48 Chapter 3. Fourier Theory for Discrete Time which means that WM ( f ) = (2 M + 1)D 2 M +1 ( f ), and also that at f = 0 the value of the Dirichlet kernel is one. Sometimes the Dirichlet kernel is normalized so that D 2 M +1 ( f ) = WM ( f ), in which case the value at zero depends on N but the total area under all the lobes is invariant. Either way, the zeroes are at f = l / N , with |l | ranging from 1 through N /2. The first minimum, which is close to f = 3/2N, is about − N /5, the next extremum is about N /7, and so on. Away from zero the function thus decays rather slowly with increasing f . To see the connection between the Dirichlet kernel and the sinc function, put y = N f and let N go to infinity while keeping y fixed: this is analogous to finer and finer sampling. Then D N ( f ) behaves like sin π y sin π y → = sinc( y) N sin π y/ N πy 3.7 Computing the DFT: the Fast Fourier Transform Many methods of signal processing and data analysis require that we compute the DFT – sometimes many DFT’s. This computation can be done rapidly, though this is not immediately obvious; the DFT formula, equation (3.8), appears at first sight to be the multiplication of an N -length vector (the sequence xn ) and an N -by- N matrix (the e’s) to produce another N -vector (the x̃k ’s). Computing such a matrix multiplication in general requires a multiple of N 2 operations (additions and multiplications), an algorithm that we say takes polynomial time: the number of operations is a polynomial (above a linear one) of the number of data. It might also appear that we have to compute N 2 sines and cosines (for the matrix of e’s), but this is not so; because of the periodicity of e−2π im/N , we have e2π ink/N = e−2π i(nk)mod N /N , so that while the matrix of e’s has N 2 elements there are only N distinct values. In fact, the periodic nature of e2π ink/N allows us to compute the DFT with many fewer than N 2 operations. For certain N we can compute all the x̃k ’s with only N log2N operations, so that the computation can be done N / log2 N times faster: for example, 100 times faster for N = 1024. The algorithm that does this is called the Fast Fourier Transform (or FFT). You should realize that the FFT is just an algorithm for computing the DFT quickly, not a definition of the DFT. A common error is to suppose that 3.7. Computing the DFT: the Fast Fourier Transform 49 any restrictions that apply to an FFT algorithm (notably restrictions on N ) also apply to the DFT; but this is not so: the DFT is defined for any N . The FFT algorithm is not completely obvious, as is shown by how many people failed to develop it: at the time the FFT was first announced, in 1965, many groups were computing DFT’s using matrix multiplication, and it was with some amazement that they learned how much time they could save just by using a different procedure.2 At the same time, the FFT was obvious enough that by 1965 it had already been invented independently several times: but either the inventors dismissed it as too small an improvement to be worth publishing (true for N very small), or the invention was simply ignored.3 The publication, and popularization, of an FFT algorithm for N = 2 M , by Cooley and Tukey (1965) led to a complete change, and since then, devising faster transforms, and fast techniques for other transforms, has become a subarea of signal processing. The FFT is something you are very unlikely to program yourself – indeed, you shouldn’t, since other people have spent lots of time doing so. We have therefore relegated our derivation of it to Appendix A. But it is worth discussing the FFT because it illustrates several points: 1. The FFT shows that considerable increases in speed can come from using a good algorithm; for whatever problem you will be doing, you should become familiar with how the time needed (the operations count) scales with problem size. There is a big difference between scaling as N log N , which is not much worse than scaling as N (linear time), and scaling as a power of N (even if the power is only N 2 ). Even worse is when a computation scales not in polynomial time but in exponential time: that is, the operations count scales as e N . If the only algorithms available have this scaling, computational methods are impracticable unless N is quite small. 2. A related lesson from the FFT is that the scaling with N is much 2 For example, Alsop and Nowroozi (1966), computing spectra of seismic data for free oscillations, reported that their computation time shrank from 23 minutes to 2.4 seconds. 3 The first (and uninfluential) discovery of an FFT algorithm was by Gauss in 1805 (Heideman et al., 1985). A notable example of the use and non-use of the FFT in the early 1960’s was here at SIO: Philip Rudnick, at MPL, had been using his own version of an FFT for a number of years, but only published it after Cooley and Tukey published theirs. At the same time, Walter Munk’s group, at IGPP, was computing transforms in the less-efficient way. Chapter 3. Fourier Theory for Discrete Time 50 Figure 3.5: Fraction of highly-composite numbers for all values less than N, as a function of N. “Highly-composite” means having no prime factors larger than 23, a not-uncommon restriction in some implementations of the FFT. more important than the details of implementation. The FFT at once gave at least two orders of magnitude improvement in speed for even moderate values of N . Thirty-five more years of effort have produced about a factor of four improvement between a general-purpose FFT from 1969 (Singleton, 1969) and the best one available in 2004.4 3. The shortest and simplest FFT algorithms are for N a power of 2. But this is not a requirement: there are algorithms that are nearly as fast if N is highly composite: that is, can be factored into a product of small primes. While some special cases ( N prime) also can be done quickly, you should if possible avoid values of N whose 4 FFTW, currently the fastest general-purpose program (hence the name: Fastest Fourier Transform in the West). For information on FFTW, and links to lots of other FFT programs and results, see http://www.fftw.org. Note that FFTW only gives its best performance if you will be doing many transforms of the same length within a single execution of the program – which may well not be true. 3.7. Computing the DFT: the Fast Fourier Transform 51 factors include large primes, Fortunately this is not difficult. Figure 3.5 shows the percent of lengths available, as a function of length; while this decreases towards zero, it does so slowly: up to lengths of 10000, about 5% of the lengths are feasible. Another way to put this is that for lengths greater than 1000, the largest change in series length needed to be at an acceptable value is almost always less than 1%. You should approach with care any DFT program that will handle all N ; if N is not highly composite, then what the program actually does may be a slow Fourier transform. 4. If, as is usually the case, your sequence xn is real, you should not simply treat it as complex with all imaginary parts zero; you can get a factor of two improvement in speed by instead treating alternate terms as real and imaginary parts of a sequence of length N /2. The DFT result then needs some auxiliary processing, which we describe in Appendix A. 5. If you only need the transform x̃ at a few frequencies, or at those that are not automatically produced by the DFT relationship, there are special algorithms for rapid computation. We describe one, the Goertzel algorithm, in Appendix A. C HAPTER 4 S AMPLING T HEORY T IME -S ERIES FOR 4.1 Introduction: Logging Data We now consider the transition between continuous time and discrete time, namely the sampling of a function at specified times (or, for spatial data, specified places) to get the numbers that we later process. This sampling is what takes place in a digital datalogger, which converts an analog function of continuous time (almost always a voltage) x( t), into a sequence of numbers xn , each number being the value of x( t) at some instant: xn = x( t n ). For this chapter we use xn for a single sample and xn for a sequence of samples. Continuous time Raw voltage Antialiasing filter Filtered voltage Sample & hold Discrete time Stepped voltage Discrete sequence Quantizer (A/D Conv.) Digital sequence Clock The block diagram above shows the steps of converting an analog signal to a digital sequence; Figure 4.1 shows the corresponding time series. The first step is to to remove high frequencies from the continuous function; this is done with an analog filter (we will encounter a simple analog filter in the next chapter). This is called an anti-aliasing filter. for reasons we will see below. The filtered signal now varies smoothly; it goes to a sample-andhold circuit; this produces a continuously-varying voltage that is held at fixed values for a time determined by a clock; each time a clock signal goes to the sample and hold, the value of the output changes to the value of the input at that time ( t n ). We call this the “stepped voltage”. This voltage is still a function of time x( t), but it has the same information as, and so can be be regarded as equivalent to, a sequence of values xn . 52 4.1. Introduction: Logging Data 53 Figure 4.1: The upper block diagram shows the steps included in recording an analog signal and converting it to a digital sequence; the lower plots are cartoons of the different forms of the signal at each step. See text for details. In the block diagram this point is the division between continuous and discrete time, though in the datalogger there is no explicit division. The next stage in the datalogger is a quantizer, or analog-to-digital converter (usually called “an A to D”). This device is a nonlinear system which, given a real number x, produces a value x̆ that can be represented in a finite number of digits; the reason for the sample-and-hold is to give a constant input voltage to the A-to-D for the amount of time needed for it to compute its output value. We naturally want x̆ to be as close to x as possible, and it can be an interesting problem to choose a representation for x̆ and then design the quantizer to make the difference x̆ − x unimportant. However, in practice this is usually not too difficult, as is suggested by the bottom two plots, which show the discrete sequence xn and the digital sequence x̆n , quantized by rounding to the nearest integer. For most of this chapter we assume that we have the real-valued xn ’s available; we discuss quantization error very briefly at the end. Chapter 4. Sampling Theory for Time-Series 54 Figure 4.2: A mathematical model for sampling, shown in pictures. We start with a function (on the real line), then create another function whose values depend only on those of the sampled data. The arrows are schematic for differently-weighted δ-functions. If we can reconstruct the original function from the sampled-data function, we have lost no information by sampling. The gray lines in the left and right panels are the zero of the y-axis. 4.2 The Sampling Problem What we shall discuss for most of the chapter is how to choose the t n ’s such that, given { xn } we could (in principle) reconstruct x( t) – and what errors we incur if we cannot. Classical interpolation theory would be one way of studying this; but a different approach, working in the frequency domain, turns out to be much more useful. To avoid what would be substantial complications, we require the samples to be equispaced in time; the spacing is the sample interval ∆, given in whatever time units we are using, so the times are: t n = n∆ n = −∞, . . . − 2, −1, 0, 1, 2, . . . ∞ In the last chapter we implicitly set ∆ to be unity; for now we do not do this, though we will assume it to be unity later on. Slight departures of t n from absolutely equal spacing do not much matter, since they can be viewed as adding some error (noise) to the data; for a timing error ǫ, the sample value x( n∆ + ǫ) will be approximately x( n∆) + ǫ ẋ( n∆), and so long as the second term is usually smaller than the first there will not be a problem. As we saw in our earlier discussion of the Discrete Fourier Transform, larger irregularities complicate analysis greatly and are best avoided if at all possible, if necessary by filling in missing data. 4.3. The Nyquist Theorem 55 To get from functions to sequences we need an intermediate step. We start with x( t), a function in continuous time; mathematically we say that it is a function on the real line, which we would express pictorially as the left-hand panel of Figure 4.2. The sequence xn , on the other hand, is just an array of numbers, which we would express pictorially as the right-hand panel of Figure 4.2. We connect these two using the delta-function δ( t − t n ); the (generalized) function x ( t ) δ( t − t n ) is zero at t 6= t n , and contains the value x( t n ) in the weighting of the deltafunction. It thus is equivalent to the sampled value xn : not equal to it, since one is a function and the other a number, but equivalent in the sense that either could be constructed given the other. Such a link between functions and numbers is just what we seek.1 For equispaced sampling the sequence xn is thus equivalent to the function ∞ ∞ X X x( t ) δ( t − n∆) = x( n∆)δ( t − n∆) = xs ( t) n=−∞ n=−∞ which is shown in the middle panel of Figure 4.2 (using arrows for deltafunctions). We can view this function as formed by multiplying x( t) by an infinite “comb” of delta-functions. The normalized version of such a comb is the III function (so called by Bracewell because it looks like the Hebrew letter shah): ∞ X III( t) = δ( t − n ) n=−∞ Changing the spacing to ∆ rescales the III, making our sampled-data function, for a sample interval ∆, xs ( t) = 1 III( t/∆) x( t) ∆ 4.3 The Nyquist Theorem To return to our original question, we can, given xn easily get xs ( t) as a weighted array of delta-functions, but how do we get (and can we get) x( t) 1 It might seem more in keeping with the action of a sampling circuit to multiply x( t) by a function that is 1 at t = t n and zero elsewhere. But such a function, even multiplied by any finite value, has an integral of 0. This makes any integral transform, such as the Fourier transform, also 0: not very useful. Chapter 4. Sampling Theory for Time-Series 56 from xs ( t)? We approach this problem by comparing x̃s ( f ) = F [ xs ( t)] with x̃( f ) = F [ x( t)]. By the convolution theorem, · ¸ 1 1 x̃s ( f ) = F III( t/∆) = ∗ F [III( t/∆)] ∆ x( t ) ∆ x̃( f ) To proceed further we need F [III( t)], which is F [III( t)] = Z∞ e−2π i f t −∞ ∞ X n=−∞ δ( t − n) dt = ∞ X e 2π i f n n=−∞ An extremely nonrigorous way to see what this might give is to observe P that for f an integer this is 1, and hence infinite; while for f noninteger 2π i f n assumes all possible values (modulo 2π) and so the sum will be zero. This suggests what is the actual result: F [III( t)] = III( f ) that is, that the III function is its own Fourier transform. This can in fact be proved rigorously. We then have x̃s ( f ) = x̃( f ) ∗ 1 F [III( t/∆)] = x̃( f ) ∗ III(∆ f ) ∆ Convolving any function with a δ-function simply recovers the original, and if we convolve with δ( f − f 0 ) we recover x̃( f ) shifted by an amount f 0 . Thus x̃s ( f ) = x̃( f ) ∗ III(∆ f ) = ∆ =∆ ∞ X n=−∞ ∞ X n=−∞ δ( f − n/∆) ∗ x̃( f ) x̃( f − n/∆) Our sampled-data series xs thus has a Fourier transform that can be gotten by replicating that of the original series at frequency interval ∆−1 , and adding up the replicas. Figure 4.3 shows the whole process. We can now show that if x̃s has a certain form, we can recover x( t) from xs ( t). The requirement for this is called the Nyquist criterion, and is given in the frequency domain: 4.3. The Nyquist Theorem 57 Figure 4.3: The upper panels show a function, then a sampled version of it, then a version sampled half as often. The lower panels show the corresponding Fourier transform: the original transform (this is actually a made-up function), and the replicated-and-summed version that results from sampling. In the middle, the transform replicas are separated; on the right, they are not. The right plot shows the replicas (aliases) with dashed lines. x̃( f ) can be recovered from x̃s ( f ) and hence x( t) recovered from xs ( t), if and only if 1 def = fN (4.1) x̃( f ) = 0 for |f | > 2∆ where f N is called the Nyquist frequency. This requirement for x̃( f ) comes from its replicated nature. The replicas are spaced 1/∆ apart; if the Nyquist criterion holds, each replica will remain separate, as in the lower middle panel of Figure 4.3. However, if x̃( f ) is nonzero above f N , the replicas will overlap, and when summed will blend together, as in the lower right panel of Figure 4.3. There is then no way of disentangling them. Chapter 4. Sampling Theory for Time-Series 58 If the Nyquist criterion holds, we also have a way of getting x̃( f ) from x̃s ( f ): we simply set all replicas but the center one to zero. by multiplying x̃s ( f ) by a suitable boxcar function, scaled to be zero outside | f | < f N : x̃( f ) = x̃s ( f )∆Π( f ∆) Taking inverse transforms we have ∞ X x( t) = xs ( t) ∗ sinc( t/∆) = sinc( t/∆) ∗ x( n∆)δ( t − n∆) n=−∞ ¶ µ ∞ X t − n∆ = x( n∆)sinc ∆ n=−∞ (4.2) The final sum is the expression for getting x( t) from xn ; note that if we put t = m∆ the zeros of the sinc function fall on all the other values than the one of interest, so we recover only one, x( m∆), as we should. It is useful to look at a pure sinusoid to see how the Nyquist criterion works. Suppose we have a sine (or cosine) with f = f N , so there are two points per cycle. The cosine then seems to be recoverable, but the sine is zero at all points sampled. Since 1 [δ( f − 1/T ) − δ( f + 1/T )] 2i we see that for ∆ = T /2, the replicas just meet, causing the opposing deltafunctions to cancel, so that x̃s ( f ) = 0, consistant with the sampling. But for a cosine F [cos(2π t/T )] = 1/2[δ( f − 1/T ) + δ( f + 1/T )] F [sin(2π t/T )] = and for ∆ = T /2, the two δ-functions meet and reinforce, so that x̃( f ) and thus x( t) can be recovered. If x̃( f ) = 0 for | f − f c | < f b , we say that x( t) is bandlimited, with a bandwidth 2 f b and center frequency f c . A sine wave modulated in amplitude would be an example of such a function. For the Nyquist criterion to hold, x( t) must be bandlimited about zero with a bandwidth 2 f N ; when we sample data we ideally want ∆ to have a value that will make this true. 4.4 Aliasing But what if x( t) is not bandlimited? And it never will be, since no transient function is.2 The best way to understand the resulting errors is again to 2 The apparent paradox that no continuous function can be limited in both time and frequency (which is what we expect of all signals we record) is discussed by Slepian (1976). 4.4. Aliasing 59 look at replication of the Fourier transform in the frequency domain. To start with, we suppose that x( t) is nearly bandlimited and falls to near zero at large frequency (otherwise we might as well give up), and that f N is large enough that only the overlap with the two nearest neighbors matters much: x̃s ( f ) ≈ x̃( f ) + x̃( f − 1/∆) + x̃( f + 1/∆) For frequencies near f N , the major overlap is with x̃( f N − 1/∆) ≈ x̃(− f N ); the other overlap is with x̃(3 f N ) which we assume is smaller. Thus for “large” positive frequency, the major contamination is from energy at “large” negative frequency. We have used the term “replica” to emphasize what the convolution with III does, but this is not the usual terminology – usually the replicas are called aliases of the transform of the continuous-time function, the contaminating energy being termed aliased energy, and the whole process aliasing. The expression (4.4) shows that for a complex-valued sequence the effect of sampling is to alias energy below the negative Nyquist frequency onto positive frequencies just below the positive Nyquist frequency. In the more usual case that x( t) is real, you may remember that x̃(− f ) = ∗ x̃ ( f ), so we can write the “nearest-neighbor” aliases as x̃s ( f ) ≈ x̃( f ) + x̃∗ (2 f N − f ) + x̃(2 f N + f ) It is useful to just consider | x̃| and ignore the complex conjugates. Consider the positive frequencies from 0 to f N , and consider what the aliased frequencies are as f goes from 0 to f N (which we write as 0 → f → f N : 0 → f 2fN → 2fN − f 2fN → 2fN + f → f N (main) → f N (first alias) → 3 f N (second alias) (4.3) We could continue this on indefinitely, but it serves to show that, for x( t) real, we can regard sampling as mapping the real line of positive frequencies, from zero to infinity, into the finite interval 0 to f N ; if x( t) is complex, sampling maps the entire real line into − f N to f N . Another way of seeing this is to note that our result (4.2) for the reconstructed x( t) is a sum of sinc functions, and hence will be, as they are, bandlimited. Given a sequence of samples, we can construct an infinity This paper is well worth reading for its commentary on the relationship between mathematical idealizations and the real world. 60 Chapter 4. Sampling Theory for Time-Series Figure 4.4: The left panel shows a made-up Fourier transform (amplitude), with two possible locations for the Nyquist frequency. The next two panels show what the transform would be for these two choices, with the aliases dashed and their sum solid. Choice A gives a badlyaliased result, choice B one that is probably acceptable, since the only aliasing is near the Nyquist frequency. of functions on the real line – but only one of these will be bandlimited. For real functions and sequences, that bandlimited one has the transform given by mapping [0, ∞] into [0, f N ] according to the rule (4.3), continued to infinite frequency. One way to visualize this rule is as an accordion folding of the frequency axis (a fold at f N brings 2 f N to 0; then another fold at 2 f N brings 3 f N to f N , and so on). We can write a formula for this: for f > f N , let N be the integer part of f / f N ; then f maps to  f − N fn N even  for f′=  ( N + 1) f n − f N odd In practice, we can do two things to minimize the effect of aliasing: • Before we sample the data, shape its Fourier transform using an analog filter (physical or electronic) to remove high frequencies: this is the “antialias” filter in the block diagram at the start of the chapter. • Choose ∆ so that what aliased energy there is will be much less than the energy in the main transform in the frequency band of interest. We do not want “significant” contamination, but what level is significant depends on the problem in hand. Figure 4.4 shows an example; while choice B allows some aliasing, the amount is small except near f N , which is probably acceptable. Near f n , x̃( f ) is always slightly 4.5. Decimation 61 aliased; our concern is usually with lower frequencies. In actual data an indication of aliasing is successive values with alternating sign (perhaps around a trend); this behavior means that there is substantial energy near f N , most likely because of significant aliasing. For a somewhat specialized example of what aliasing might do, consider the effect of 60-Hz noise on data sampled less often (unless it is filtered out, 60-Hz hum is common on most electronic signals). If ∆ is 1 second, N (see above) is 120 ( f N = 1/2 Hz) and f ′ is 0. This would be acceptable, except that “60-Hz” line noise is really 60 Hz ±700µ Hz, which maps into 0 to 700 µHz; this would interfere with a signal with a period of 12 hours. We can avoid this peblem by sampling at some different interval; if we sampled at 2048 times per hour, f N = 0.2844 Hz, N = 210, and f ′ = 0.2667 Hz, so that any 60-Hz energy aliases to near the Nyquist, out of the way of lower-frequency signals. 4.5 Decimation A very common activity in digital signal processing is decimation. We say that we have decimated a discrete-time series by n if we take every n-th value to create a new and shorter series with sample interval n∆. (Another term for this is downsampling.) The new series is related to the one it came from by exactly the same aliasing rules that apply to sampling of a continuous-time series; the only difference is that the folding of the frequency axis extends only up to the Nyquist frequency for the original series, instead of to infinity. For example, if we decimate by two, there is a single folding, so that frequencies from the old Nyquist f N to the new one (1/2 f N ) fold into the new interval from 0 to 1/2 f N . Just as in the continuous case, the degree to which the decimated series is aliased depends on the relative levels of the Fourier transform above and below the new Nyquist frequency; before decimation we may need to digitally filter the data, as we describe in the next chapter, to avoid aliasing of the output. The opposite of decimation is putting additional values between each value in a sequence to give a series sampled more often; this is called either interpolation, or, what else, upsampling. Although it seems counterintutive, we can do this just by setting the new values to zero and then digitally filtering the series to remove energy above the old Nyquist frequency. One reason for doing such an interpolation would be to create a smooth series 62 Chapter 4. Sampling Theory for Time-Series for conversion back into analog form: standard practice in digital music players. Decimation and interpolation can be combined to change sample rates; we might, for example, decimate by 5 and then interpolate by 3 to get a new rate that is 0.6 of the old. But whatever we do, we need to worry about, and correct for, any possible aliasing. 4.6 Violating the Nyquist Criterion We have discussed the importance of not aliasing data – but there are situations in which the data can violate the Nyquist criterion without loss of information. We might, for example, have additional information at the sample times, such as the values of both x( t) and its derivative ẋ( t)), though this is rarely the case. More often, the signal has a special structure that allows us to sample less often: • A periodic signal can, under certain conditions, be measured adequately while being sampled at random times: one practical area where this occurs is in measurements of regularly variable stars, since for the longest-period ones the observations may occur at random times over many decades. • If the signal can be described by a few parameters, these may be determined even though there is aliasing. To take an example from some work done here at IGPP, the signal produced by the interferometer in an absolute gravimeter is what is called a chirp, has a timevarying frequency: x( t) = x0 + cos[(( f 0 + f 1 t) t + φ] where the parameter f 1 is related to the local acceleration of gravity. The frequency rapidly increases to a value that cannot be sampled fast enough to satisfy the Nyquist criterion, but it is still possible to process the signal to determine its parameters (Parker et al., 1995). • Another case is a real signal that is bandlimited between frequencies f l and f h . To not lose information, we need to sample it so that the sampling “folds” the frequency axis [0, ∞) into the interval [0, f n ], none of the folds fall within the frequency band f l to f h . This is 4.7. Quantization Error 63 equivalent to requiring that, for for some integer n (not initially determined) the Nyquist frequency f n satisfies ( n − 1) f n ≤ f l and n fn ≥ fh which combine to give fl fh ≤ fn ≤ n n−1 This in turn means that n must be such that f h / n ≤ f l /( n − 1), equivalent to requiring that fh n≤ fh − fl Taking n to be the largest integer that satisfies this inequality, we then have that f n ≥ f h / n, which means that the sample frequency f s = 2 f n must satisfy f s > 2 f h /n and, from the other inequality involving f l , fs < 2 fl n−1 4.7 Quantization Error We close this chapter by discussing briefly the effects of quantization of a signal; we consider the simple (and common) case in which an A-to-D converts a real-valued input voltage to an output integer. A-to-D’s are specified by the number of bits in this integer: for example, a 16-bit system outputs values from 32767 to 32768 (plus and minus 215 ). over the full range of input voltage. So the more bits, the smaller the voltage corresponding to a unit change in the output, and the larger the ratio of the maximum signal to the smallest one that can be detected. This ratio is called the dynamic range, for an N -bit A-to-D, it is 2 N , though it is usually expressed in dB. The voltage range and number of bits determine the size of the least count in volts, which may be scaled by the instrument sensitivity to find the least count in the units of whatever is being recorded; we denote this value by ǫ. If (as is usual) the A-to-D rounds the real value to the nearest integer, the maximum error is ǫ/2. Provided the signal moves by more than ǫ between samples, it is easy to see that this error will be uniformly distributed, provided that where the signal “lands”, relative to the integer 64 Chapter 4. Sampling Theory for Time-Series boundaries, is reasonably random. The error thus has a uniform distribution from −ǫ/2 to ǫ/2, which has a variance of one-twelfth, equivalent to a standard deviation of about 0.3 least counts. This result suggests that quantization error will not be large, which in fact is the case: in general, it is not difficult to make quantization error small enough to ignore. You should however always estimate it and compare it to the size of what signal you might wish to record. This is often best done, again, in the frequency domain: though this comparison requires use of the power spectral density – which we will get to. C HAPTER 5 D IGITAL F ILTERS I: F REQUENCY-S ELECTIVE F ILTERS Engineering is the art of doing that thing well with one dollar, which any bungler can do with two after a fashion. Arthur M. Wellington, The Economic Theory of the Location of Railways (1887). 5.1 Introduction: Digital Filters in General Given a discrete-time sequence { xn } we often wish to perform some kind of operation on it. The DFT is one such operation; many others go by the name of filters. There are many different things that filters may be used to do; and for each of these actions there are many different ways to design filters that are “good”. You should remember that filter design is engineering, not science; our aim is to get a good enough, and economical, answer, for the situation we face. The list of tasks that filters can be used for includes: A. Select out certain frequencies. This is the most common operation in the kind of analysis done in geophysics: especially removing high frequencies, to allow decimation of a sequence to make it smaller; or low frequencies, to reduce large low-frequency changes that hide smaller fluctuations. B. Differentiate or integrate, at least approximately; since these are continuoustime operations, they only make sense for sequences properly sampled from a continuous-time function. The design of filters for these operations is quite similar to the design of frequency-selective ones. 65 66 Chapter 5. Digital Filters I: Frequency-Selective Filters C. Simulate what a continuous-time linear system would do to the series, again supposing the sequence to have been sampled from a continuoustime function. This goal leads to another way of designing filters, through simulating a linear system represented by an ordinary differential equation. In the signal processing literature, these methods are used to design frequency-selective filters by simulating filter designs developed for analog systems. For most geophysical data, such simulation filters are more important for reproducing a synthetic instrument through which one wants to pass synthetic data, for comparison with a real dataset recorded by a real instrument. D. Predict future values, which is an major subfield all by itself: it can be economically very valuable, but it is also used a lot in such matters as autopilots for airplanes, where knowledge of where the plane is about to be is very important. Many methods exist, in part because prediction depends on what model we take to characterize the sequence. Quite a few methods model the sequence probabilistically: for example, the Kalman filter operates in the time domain to infer future values from past ones. A rapidly-growing part of this field connects to the burgeoning subject of nonlinear dynamics, since if a time series that appears to be random can actually be described as a deterministic but chaotic system, better predictions may be possible. E. Detect a signal, defined as some kind of anomalous behavior. This goal is closely related to the previous one: if the series is not as predicted, then there might be an anomaly. Anomaly detection can be described using the methods of statistical hypothesis testing, and has developed a large literature of its own, again because of its technological importance in radar and sonar. F. Assuming that the sequence has been produced as the result of a convolution of two other sequences wn and yn , determine something about one or both of these other two. This is known as deconvolution, and covers a wide variety of techniques, which depend on what we think we know about the convolving sequence wn . One example would be the reverse of (C): we have the result of a sequence (or a function) being passed through a known linear system, and want to determine the original. Two geophysical examples of this are correcting data for instrument response, and downward continuation of magnetic or gravity data measured above its source (the height acts as a filter). Another type 5.2. FIR Filters for Frequency Selection 67 Figure 5.1: Ideal frequency-selective filters: the four commonest types. The parts of the frequency band for which the response is 1 are called the passband(s); where the response is 0 is the stopband(s). The dashed lines show the region in which (for most design methods) the response goes from 1 to 0 (not necessarily linearly as shown here), which is called the transition band. Another name for the bandstop filter is notch filter. of deconvolution (used in exploration geophysics and image processing) occurs when we have only partial knowledge about the two sequences (e.g., some of their statistical properties). In this chapter we discuss filters for purpose (A): certainly the commonest application in geophysics. In the next chapter we discuss (C), partly so we can introduce some new ways of looking at digital filters. Purpose (B), along with some other filtering topics, will be placed in a third chapter. Purposes (D), (E), and (F), all of which combine filtering and statistics, will not be covered in this course. 5.2 FIR Filters for Frequency Selection Even just for purpose (A), we have many choices in how to design filters. We start by introducing the ideal filter response Wd ( f ); for frequency-selective filters Wd is either 1 or 0 over particular frequency bands. Figure 5.1 shows some typical designs, and their names. In this chapter we restrict ourselves to filters that consist of a finite sequence {wn }, usually called the weights. These filters are used by convolving the weight sequence {wn } with the data sequence { xn } to produce a filtered sequence { yn } = {wn } ∗ { xn }. In the signal-processing literature this type of filter is called a finite impulse response (FIR) filter; in the statistics literature the usual name is moving average, usually abbreviated as MA. Another term is nonrecursive filter. Chapter 5. Digital Filters I: Frequency-Selective Filters 68 The frequency response of a FIR filter is just the Fourier transform of the weight sequence: W(f ) = NX −1 n=0 ∞ X wn e−2π i f n = wn e−2π i f n for 0 ≤ f ≤ 1 (or − 1/2 ≤ f ≤ 1/2) n=−∞ (5.1) where N is the number of weights. It is useful to consider the weights to be an infinite sequence, though one that happens to be zero outside the finite range from 0 to N − 1. We next impose three more restrictions: • The weights are real. • The number of weights, N , is odd. • The weights are “symmetric”: that is, for N = 2 M + 1, w M −k = w M +k k = 0, . . . M The consequence of all this is that wn = w2 M −n , making the frequency response W(f ) = 2X M n=0 wn e−2π i f n = =e −2π i f M =e −2π i f M " wM + " M −1 X n=0 M −1 X wn ( e n=0 wM + 2 wn ( e−2π i f n + e−2π i f (2 M −n) ) + w M e−2π i f M M −1 X n=0 −2π i f (n− M ) +e −2π i f ( M −n) wn cos 2π( M − n) f ) # (5.2) # The part in square brackets is purely real; like any real quantity viewed using the amplitude and phase form for complex numbers, its phase is zero. The shift theorem shows that the initial exponential is equivalent to an M point shift, which is to say a time delay. Ignoring this, a symmetric FIR filter thus does not shift the phases of different frequencies: this minimizes distortion of the input, other than by removing energy at certain frequencies. If the part in square brackets were one, we would have just an M -term delay, which obviously creates no distortion at all. Because the phase shift is linear in frequency, symmetric FIR filters are usually termed linear phase. 5.2. FIR Filters for Frequency Selection 69 Figure 5.2: Frequency response (shown in both linear and log amplitude) for lowpass filters designed by simple truncation of the sequence of ideal weights. The ideal response Wd ( f ) has a passband (Wd = 1) from 0 to 0.25, and a stopband (Wd = 0) from 0.25 to 0.5. The left-hand pair of plots show the response for M = 49 (99 weights total), and the right-hand pair for M = 99 (199 weights total). If we do not mind having the filter be acausal (the output for a step input precedes the step), we can make it symmetrical around n = 0, defining wn = w−n for n = − M, . . . M . The frequency response is then W ( f ) = w0 + 2 M X wn cos 2πn f (5.3) n=1 which has zero phase shift; in this form the FIR filter is often called a zerophase filter. Because such a filter is acausal it can be implemented only after the data have been collected, but in digital systems this can always be the case – if only because we are free to relabel the absolute time of the terms in the series as we see fit. In general a lack of causality is not a problem, though seismic data can be an exception to this: if our interest is in the exact time of arrival of bursts of energy, we do not want the filtering to put any energy before it actually arrives.1 In Chapter 7 we will see how to make FIR filters that are better suited to this application. Given all the restrictions so far applied, the filter design problem becomes how to choose the M + 1 weights w0 , w1 , . . . w M +1 , to best approximate a given frequency response Wd ( f ). As we will see, the meaning of “best” is not unique. Indeed it is characteristic of filter design that the best answer is not always the same: what we want from a filter will depend on what we are trying to do. A naive approach would be to just take the inverse Fourier transform of the ideal filter response: since Wd ( f ) is real and symmetric, it might appear 1 A good example of exactly this problem confounding the interpretation of the initial rupture in earthquakes is given by Scherbaum and Bouin (1997). Chapter 5. Digital Filters I: Frequency-Selective Filters 70 that we can use the inverse transform for sequences to get wn = Z1/2 −1/2 Wd ( f ) e 2π i f n df =2 Z1/2 0 Wd ( f ) cos2π f n d f However, this will not in general give a sequence of finite length – something we certainly need for a practical filter. We need to be more sophisticated. 5.2.1 Designs Using Tapers One method of filter design is to construct, from the ideal response, the (infinite) sequence of ideal weights wnd =2 Z1/2 0 Wd ( f ) cos2π f n d f and then create a finite sequence from this by multiplying by a taper (or window) sequence {a n } to create the final weights wn = a n wnd The taper sequence must have two properties. The first is that a n = 0 for n > M , so that all but a finite section of {wn } is zero, to give a finite-length filter. The second is that the taper must also be symmetric about n = 0, to keep the final filter weights symmetric. We might expect that this time-domain multiplication would be equivalent to a convolution in the frequency domain; and so it is: W(f ) = = ∞ X n=−∞ Z1/2 −1/2 a n wnd e−2π i f n Wd ( u) ∞ X n=−∞ = ∞ X an e −2π i f n n=−∞ a n e−2π in( f −u) du = Z1/2 −1/2 Z1/2 −1/2 Wd ( u) e2π i f nu du Wd ( u) A ( f − u) du where in the final convolution integral both functions are assumed to be periodic outside the range [−1/2, 1/2]. This means that the closer A ( f ), the transform of the taper sequence, is to a delta-function, the closer the filter response will be to the ideal Wd ( f ) Of course, what we mean by “close to” will, again, depend on what we think is important: in thinking about closeness to a delta-function, is it more 5.2. FIR Filters for Frequency Selection 71 Figure 5.3: The three Dirichlet kernels used in forming the response of the von Hann taper (left, dashed), and that response itself (solid). The right plot shows A( f ) for the Dirichlet and von Hann kernels on a log-log scale, for a larger value of M. important that the peak of A ( f ) be narrow or the values away from the peak be small? Our answer to this will depend on what we want the filter to do, which is why there are no universal rules for filter designs. The simplest taper is the rectangular taper, with a n = 1 for |n| ≤ M ; this corresponds to simply truncating the sequence of ideal weights. We can see that this is not a particularly good solution if we look at the corresponding A ( f ): M X A( f ) = e−2π i f n = (2 M + 1)D 2 M +1 ( f ) n=− M where D N ( f ) is the Dirichlet kernel discussed in Chapter 3. As described there, successive maxima of |D N ( f )| do decrease, but only as f −1 : if what we want is small values away from the peak, this A ( f ) is not a good approximation to a delta-function. Convolving this A ( f ) with the ideal response causes much of the passband response to “leak” into the stopband, and vice-versa. Figure 5.2 shows this effect clearly, and also shows that simply increasing the number of weights M does not diminish its importance. If you are familiar with Fourier series theory, another way to view this result is to realize that in taking the rectangular taper we are simply finding partial sums of the Fourier series expression for Wd ( f ); such partial sums always suffer from Gibbs’ phenomenon, with poor convergence near Chapter 5. Digital Filters I: Frequency-Selective Filters 72 Figure 5.4: A collection of data tapers (top plots) and their Fourier transforms A( f ), for 2M + 1 = 52 (the transforms have been interpolated for plotting, and normalized to 1 at zero frequency). The sinc taper is given by (π n/M)−1 sin(π n/M), and the Potter taper by P3 k=0 a k cos(π kn/(2M + 1)) with a 0 = 0.3557702, a 1 = 0.4873966, a 2 = 0.1442299, and a 3 = 0.0126033. the discontinuities in the frequency response.2 There is a taper that is almost as simple, but that has much smaller values away from the central peak: this is the von Hann taper, also called the hanning, cos2 , or 1 + cos taper: a n = 1/2[1 + cos(πn/ M )] = cos2 (πn/2 M ) for n = − M, . . . M which has the transform: A( f ) = M X a n e−2π i f n = 1/2 n=− M M X = 1/2 n=− M e−2π i f n + 1/4 M X n=− M M X n=− M e−2π i f n + 1/4 M X n=− M [ eπ in/M + e−π in/M ] e−2π i f n e−2π i( f −( M /2))n + 1/4 = (2 M + 1) [1/2D 2 M +1 ( f ) + 1/4D 2 M +1 ( f M X e−2π i( f +( M /2))n n=− M − ( M /2)) + 1/4D 2 M +1 ( f + ( M /2))] The three Dirichlet kernels and their sum are shown in Figure 5.3: it is clear that, as f increases, the combination approaches zero much more rapidly than does the original Dirichlet kernel, a point made even more dramatically by a log-log plot. 2 For a good introduction, both historical and mathematical, to Gibbs’ phenomenon, see Hewitt and Hewitt (1979). 5.2. FIR Filters for Frequency Selection 73 Figure 5.5: Frequency responses of lowpass filters for three data tapers, and also for the minimax design procedure described in Section 5.2.2 For each, the error (departure from unity) is shown over part of the passband (note the different scales), and the complete response is shown in log form. There are many taper sequences available;3 the frequency responses of these can vary considerably, but inevitably have a tradeoff between width of the central peak and smallness of the fluctuations (these are called the sidelobes) away from it. Figure 5.4 gives a small selection of tapers, including two with especially small sidelobes: the Potter taper (an empiricallydesigned one) and the 4π prolate spheroidal taper. The last, as we will see in Chapter ??, is the function for which A ( f ) is most concentrated within a frequency band around zero, and is one of several that are important when estimating a power spectrum. The effect of using different tapers for filters is shown in Figure 5.5. As expected, the use of a Hann taper gives much lower sidelobes than does a rectangular taper; the 4π prolate spheroidal taper gives even lower ones, with sidelobe amplitudes due in part to roundoff error in computing the DFT. But filter design using tapers is somewhat inflexible: given a particular taper the tradeoff between sidelobe level and the width of the transition between passband and stopband is fixed. A somewhat more flexible ap3 Harris (1976) is a kind of bestiary of every taper then thought of, with plots of the time and frequency responses. Most of these tapers deserve to be forgotten, since much better ones, notably the prolate spheroidal tapers, have been developed since. It is worth noting that the ‘Kaiser-Bessel’ tapers described in this paper closely approximate the prolates. Chapter 5. Digital Filters I: Frequency-Selective Filters 74 proach has been developed by Kaiser and Reed (1977) and Kaiser and Reed (1978), using an adjustable taper and some empirical rules for setting it (and the filter length) to meet a given sidelobe level and transition width; this method offers a convenient design that is adequate for many applications. However, even more flexibility is possible using different techniques, one of which we now describe. 5.2.2 Design by Optimal Fitting of the Frequency Response A more powerful technique in designing filters is to look at the difference between the ideal response Wd ( f ) and the actual response W ( f ), and choose the filter weights to minimize this difference in some way. What seems like the most obvious approach does not however work very well, this being to minimize the mean-square misfit. Minimizing this is just like the least-squares criterion, though in this case expressed as an integral over frequency: the quantity to minimize is 2 ǫ = Z1/2 −1/2 |Wd ( f ) − W ( f )|2 d f The difference between the ideal and actual responses, in terms of the weights, is ∞ X Wd ( f ) − W ( f ) = (wnd − wn ) e−2π i f n n=−∞ where the actual weights wn are extended to infinity by adding zeroes at both ends. But then we may apply Parseval’s theorem for infinite sequences to get: ǫ2 = ∞ X n=−∞ |wnd − wn |2 = −M X−1 n=−∞ |wnd |2 + ∞ X n= M +1 |wnd |2 + M X n=− M |wnd − wn |2 and since only the last term is adjustable, we minimize ǫ2 by setting wn = wnd . But this is no different from using a rectangular taper on the ideal sequence – and we have already seen that this produces filters with large sidelobes. Least squares, for all its merits, is not a panacea. What is more useful is to minimize the maximum value that the misfit |Wd ( f ) − W ( f )| attains over a range of frequencies. This is computationally more complicated than anything we have discussed so far, and we will not 5.2. FIR Filters for Frequency Selection 75 describe how it is done; suffice it to say that it is possible to find weights that will do this. Specifically, given K non-overlapping frequency bands f L k < f ≤ f H k , for k = 1, 2 . . . K , we can find the M weights which minimize b k max |Wd ( f ) − W ( f )| [ f L ,f H ] over all the bands; b k is a weight applied to each band, which allows the fit to the ideal response to be tighter or looser. The actual amount of misfit is largely a function of the filter length M ; what this misfit actually is, can only be determined after the optimal weights are found. Usually some trial and error is needed to decide on the appropriate tradeoff between length and misfit. This approach applies the L ∞ norm, whereas the mean-squareerror criterion applies the L 2 norm; and in general gives very good designs, with more flexibility than the tapering method. The resulting filters are termed equiripple filters, because the frequency response shows a uniform level of departure from the ideal. A filter designed using this method is shown in the fourth example in Figure 5.5: by sacrificing low sidelobes at high frequency, the equiripple design can have a narrower transition band than any of the tapered designs. The procedure for designing equiripple filters is called the Parks-McClellan algorithm; it is included in and in MATLAB as routine firpm. The underlying algorithm is called the Remez exchange method. There are many other variants in this method of filter design; we may, for example, require, as well as a good fit to Wd , that the derivative dW / d f be of one sign in the passband: this will completely eliminates ripple there.4 In general, most such refinements, and how to design filters where the word length is short and roundoff a problem, are rarely important in geophysical data analysis. 5.2.3 A Filtering Example We close with an example that illustrates how frequency-selective filters can be used, and also illustrates why there are no fixed rules for designing them – what filter you choose depends on the problem you want to solve. The data to be filtered are measurements of strain at Piñon Flat Observatory, made at one-second intervals with a long-base laser strainmeter. For 4 Steiglitz et al. (1992) present a fairly general program for including different kinds of constraints using linear programming: relatively slow, but very flexible. See http://www.cs.princeton.edu/k̃en/meteor.html for source code and examples. 76 Chapter 5. Digital Filters I: Frequency-Selective Filters Figure 5.6: Equiripple filter designs. The lower right shows a schematic response, with the frequency parameters indicated. The different designs (A through E), have different frequency settings or lengths M. For each, the passband response is shown on a linear scale, the overall response in decibels. the time period shown, these strain data show the signal from the 1994 Northridge earthquake, 203 km away. This signal, though dominated by the seismic waves from the earthquake, also contains the static change in strain caused by the elastic rebound of the rocks, which is what drives the faulting. The aim of the filtering will be to remove the “high-frequency” seismic signal so as to show the static change more clearly. For maximum flexibility we use the equiripple design; Figure 5.6 shows some possible filter responses. The lower right corner of this figure shows parameters which we can choose: the transition frequency f c and the width of the transition band f t . Once we choose these parameters, we next pick a length for the filter; the program will then find the best design possible, with the smallest amount of ripple in both the passband and stopband. In this case we choose these amounts to be the same; while they can be in any proportion to each other, making one smaller makes the other larger. 5.2. FIR Filters for Frequency Selection 77 Figure 5.7: Strain variations from the Northridge earthquake recorded on the NW-SE laser strainmeter at Piñon Flat Observatory (PFO). The lower right plot is the original unfiltered data; this actually has a much larger range than is shown. Plots A through E show the results of lowpassing the data with the filters whose response is shown in Figure 5.6. Our first design (A) has a narrow passband, and a narrow transition band – and the consequence is that the ripple is large, nearly 10%. This means that in the stopband any signal will be only 0.1 of what it was originally – we would say, 20 dB down, which is not a large reduction. To do better, we can try two approaches: (B) make the transition band wider, and (C) use more filter weights (199 instead of 101). Either one gives much less ripple; for these two examples we are trading computation time (filter length) against the width of the passband response. In (D) we try making the passband much larger by increasing f c ; in (E) we combine the larger f t of (B) with the more numerous weights of (C) to get a response even closer to the ideal. 78 Chapter 5. Digital Filters I: Frequency-Selective Filters Figure 5.7 shows what these filters do to the data; without filtering the static offset is well hidden in the seismic coda. Filter A reduces this enough that the offset can be seen, but still leaves a lot of energy: looking at this, you would (or should) want to say that the “true” signal would be a smoothed version of this: a sure sign that the data need more filtering. Filters B and C provide more filtering: since their stopband levels are the same, the results look about the same as well; between the two, B is probably preferable because it is shorter. A much wider passband (D) turns out to be a poor idea, at least for showing the static offset. Filter E is perhaps “best” for this application. C HAPTER 6 D IGITAL F ILTERS II: R ECURSIVE F ILTERS AND S IMULATING L INEAR S YSTEMS 6.1 Introduction We now turn to another kind of digital filter: one that will allow us to use a computer to imitate what some physical system does. We might need this when, for example, we want to model a seismogram. The first step would be to have a way of computing the the ground motion input to the seismometer. The next step would be to simulate what the output of the seismometer (a physical system) is for this ground motion. The FIR filters of the previous chapter are not well-suited for this, but other designs are, and it is these we will now describe. But, to discuss these filters and how to design them we need to first spend some time introducing additional mathematics for the linear systems we discussed in Chapter 2. We will then return to the problem of designing digital filters, introducing what is known as a recursive filter; finally, we will show how to make such a digital filter accurately simulate an analog system. 6.2 Lumped-Parameter Systems We saw in Chapter 2 that a linear time-invariant system could be characterized in three different ways: • By its frequency response g̃( f ); this expresses the ratio (a complex number) between output and input when the input is a pure sinusoid, e 2π i f t . 79 Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 80 Figure 6.1: Frequency response of an RC filter with τ = 1, shown as log amplitude, and phase, plotted against log frequency. . • By its impulse response g( t): the time-domain behavior when the input is a delta-function. This is also the inverse Fourier transform of g̃( f ): g f o( f ) = F [ g( t)]. • By the step response h( t), which is the integral of g( t). We now look at these characterizations for a special class of linear timeinvariant systems – though one that includes many actual cases. This class is systems that are described by constant-coefficient linear differential equations; a common term, especially in electrical engineering, for these is lumped-parameter systems. For such systems the relationship between the input x and output y is d n−1 y dy dn y + a + a0 y = + . . . + a1 n − 1 n n − 1 dt dt dt dm x d m−1 x dx b m m + b m−1 m−1 + . . . + b 1 + b0 x dt dt dt an (6.1) The a’s and b’s are the parameters that describe the system. There are standard procedures for solving such an equation to get explicit forms for y( t), especially for certain classes of inputs x( t); we shall instead look at the frequency response of such systems. But first we provide a few examples. 6.2. Lumped-Parameter Systems 81 Figure 6.2: Frequency response of a seismometer to displacement. The natural frequency is ω0 = 1. The response is shown for λ = 0 (dashed), λ = 0.8 (dotted), and λ = 1 (solid). • The first example is the simplest analog electronic filter: the RC lowpass filter, which consists of a resistor and capacitor in series, with input and output voltages as shown. The input voltage x( t) = Vin is just the output voltage y( t) = Vout plus the voltage drop across the resistor, which is given by R I ( t), where I ( t) is the current flowing through the capacitor. This current is given by I ( t) = C ẏ( t), making the differential equation y( t) + RC dy ( t ) = x( t ) dt (6.2) To get the frequency response, we assume, as usual, a sinusoidal input x( t) = e2π i f t ; by definition of g̃, y( t) = g̃( f ) e2π i f t . Substituting these expressions into (6.2), we get g̃( f ) = 1 1 + 2π i f τ (6.3) where τ = RC is the time constant of the filter. The two plots in Figure 6.1 show the amplitude and phase of g̃( f ) for τ equal to 1, plotted against the logarithm of the frequency (and with the amplitude plotted in dB); these are called Bode plots. Clearly this is a lowpass filter: high frequencies are attenuated. Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 82 Figure 6.3: Response of the Earth’s polar motion to excitation; note that since we represent the input and output (both 2-vectors) as complex numbers. we need to show both positive and negative frequency. The response is given for a Q of 100; the Chandler resonance is at 0.849 cycles/year. The rapid change of phase at this frequency makes it difficult to compare polar motion with excitations. • A seismometer (Figure 6.2). The simplest form (a mass on a spring, or a pendulum) obeys the well-known equation for a simple harmonic oscillator: ÿ + 2λω0 ẏ + ω20 y = ẍ where y is the displacement of the mass with respect to the frame, and x the displacement of the frame with respect to inertial space. The system parameters are the natural frequency ω0 , and the damping λ. Again, we substitute e2π i f t for x( t) and g̃( f ) e2π i f t for y( t), and find −4π2 f 2 g̃( f ) = −4π2 f 2 + 4 i πλω0 f + ω20 whose phase and amplitude response are shown in Bode plots (Figure 6.2) for several values of λ: λ = 0 is undamped, λ = 0.8 gives the flattest response, and λ = 1 is critically damped. • The Earth’s polar motion. The position of the rotation pole of the solid earth changes because of variations in the angular momentum and mass distribution of the fluid parts. The pole position can be described by two coordinates, p 1 and p 2 , giving the (dimensionless) 6.2. Lumped-Parameter Systems 83 angular displacement of (say) the North rotation pole in directions toward the Greenwich meridian and at 90◦ to it. If these changes are small, it can be shown that the relationship between the pole position and the motions of the fluid parts is 1 d p 1 ( t) + p 2 ( t ) = ψ2 ( t ) ω c dt 1 d p 2 ( t) + p 1 ( t ) = ψ1 ( t ) ω c dt (6.4) where ψ1 and ψ2 are integrals over the mass distribution and velocity of the fluid parts of the earth (usually termed the “excitation”).1 The frequency ω c is determined by the properties of the solid earth; for a rigid ellipsoidal body ω c = [(C − A )/ A ]Ω, where C and A are the polar and equatorial moments of inertia, and Ω is the frequency of the Earth’s spin (once per day). The values of C and A for the Earth give a value for ω c that corresponds to a period of 305 days; various corrections, too complicated to discuss here, make the actual period equal to 430 days. We can write this pair of equations as a single equation by forming the complex variables p = p 1 + i p 2 and ψ = ψ1 + i ψ2 ; then we can combine the two equations in (6.4) to get: i ṗ(t) + p(t) = ψ(t) ωc (6.5) If we make the input e f t, and the output g̃e f t, we find g̃( f ) = 1 1 − (2π f /ω c ) which has the response shown in Figure 6.3, in this case plotted against both positive and negative frequency, since the response is different for these: a common attribute of gyroscopic systems. (Remember that, with this trial function, negative frequency corresponds to clockwise [deasil] rotation, and positive to counterclockwise [widdershins].) We will use these examples in the discussion below, both to illustrate concepts about systems, and also as examples for filter design. 1 For a derivation of this equation, see Munk and McDonald (1960); the version given here has been revised to match the newer and more accurate result of Gross (1992). Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 84 6.3 The Laplace Transform: System Poles and Zeros Yet another way of looking at the response of a system comes if we look at another integral transform: the Laplace transform. The Laplace transform of a function x( t) is defined as Z∞ x̃( s) = x( t) e−st dt −∞ for s a complex number: x̃( s) is defined on the complex plane. The Fourier transform is x̃( f ) is just x̃( s) evaluated along the imaginary axis of the complex plane; that is, x̃( f ) = x̃( s) for s = 2π i f . The Laplace transform can thus be thought of as a generalization of the Fourier transform; but, viewed as a transform, it is nowhere near as useful. And, mostly because x̃( s) is a complex-valued function over the complex plane, the Laplace transform is much more difficult to visualize than the Fourier transform. The inverse Laplace transform is Z c+ i ∞ 1 x( t ) = x̃( s) e st ds 2 pi c− i∞ where the integral is evaluated over a line parallel to the imaginary axis, the value of the real part depending on the nature of x̃( s). The Laplace transform and its inverse are thus much less symmetrical than the Fourier transform pair. The utility of the Laplace transform for us comes through applying it to the lumped-parameter systems described by equation (6.1). The value of the Laplace transform is that it enables us to get significant understanding of such a system without solving the differential equation at all.2 Taking the Fourier transform of both sides of equation (6.1) gives us the same result as we would get (and did get for the examples of the previous section) by substituting e2π i f t as the input. What we mean by “taking the Fourier transform” of the system is to find the equation that describes the connection between the Fourier transforms of the input and output: that is, between x̃( f ) and ỹ( f ). We get this result by applying the theorem for the Fourier transform of the derivative of a function to equation (6.1); this gives us a polynomial in f : [a n (2π i f )n + a n − 1(2π i f )n − 1 + . . . + a 1 (2π i f ) + a 0 ] ỹ = 2 [ b m (2π i f )m + b m − 1(2π i f )m − 1 + . . . + b 1 (2π i f ) + + b 0 ] x̃ Deakin (1992) describes how this approach grew to dominance. 6.3. The Laplace Transform: System Poles and Zeros 85 where ỹ and x̃ are the Fourier transforms of y and x. Therefore, the frequency response of the system is Pm j ỹ( f ) j =0 b j (2π i f ) g̃( f ) = = Pn (6.6) k x̃( f ) a (2 π i f ) k k=0 We can follow a similar route if we take the Laplace transform of both sides of (6.1). While this might seem like an unnecessary generalization (who needs the system response at a complex frequency?) it actually leads to some very useful insights into how the system will behave. We need the derivative theorem for the Laplace transform, with we simply state without proof: if x̃( s) is the Laplace transform of x( t), the Laplace transform of ẋ is s x̃( s). Applying this rule to the differential equation shows that the ratio of the Laplace transforms of the input and output, which is called the transfer function, is Pm j ỹ( s) j =0 b j s = Pn (6.7) k x̃( s) a s k k=0 Comparing this with equation (6.6), we see that the frequency response is a special case of the transfer function: the frequency response g̃( f ) is just the transfer function g̃( s) evaluated on the imaginary axis. The additional insight to be gotten from looking at the transfer function comes if, instead of expressing the polynomials in (6.7) in terms of their coefficients, we instead write them as products of their roots: Qm ỹ( s) j =1 ( s − r j ) = C Qn x̃( s) (s − p k ) k=1 The roots of these polynomials in s come in two forms, the zeros r j of the numerator, and the poles p k of the denominator. At the zeros the transfer function is zero; and at the poles it is infinite. The scaling value, C , is needed to make the description complete. Looking at the locations of the poles and zeros, and particularly how close they are to the imaginary axis, will show where the frequency response will be large and where it will be small. Finding the locations for either set of roots has to be done numerically if m or n exceeds five, but our three examples can all be done analytically: • The RC filter has a transfer function g̃( s) = (1 + τs)−1 (6.8) Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 86 which has a single pole on the negative real axis at s = τ−1 . This is shown in the left-hand plot of Figure 6.4; as is conventional, we use a cross to show the location of a pole. • The seismometer transfer function is g̃( s) = s2 s2 + λω0 s + ω20 which has p a pair of zeros at the origin of the s-plane. The poles lie at ω0 [−1 ± λ2 − 1]. For λ = 0, the poles are on the imaginary axis, at ± i ω0 . Since this axis corresponds to the frequency axis for the Fourier transform, these on-axis poles make g̃(±ω0 infinite. As λ increases from zero, the two poles leave the imaginary s-axis and follow a circular path about the origin: because the coefficients of the polynomial are real, its roots (these poles) must be complex conjugates. The case of λ = 0.8, which gives the flattest response, would be termed a twopole Butterworth filter. At λ = 1 the poles meet; for larger values they lie on the negative real axis, one approaching and the other receding from the origin. • The transfer function for polar motion is 1 P(s) 1+ is ωc = = Ψ(s) ωc ω c + is which has a single pole, on the imaginary axis, at s = i ω c . Again, this gives an infinite response for f = ω c : in the Earth this resonance gives rise to the Chandler wobble, hence the subscript c on ω. In reality dissipation keeps the response from being infinite; we can add this effect to the equations by moving the pole of the transfer function away from the imaginary axis. We make this pole complex, rather than imaginary, by defining ω c as µ ¶ i 2π 1+ ωc = Tc 2πQ c g̃( s) = where T c is now the period of the resonance, and Q c a measure of the amount of dissipation.3 3 It is certain that dissipation occurs in the Chandler wobble, but what causes it remains unclear. Smith and Dahlen (1981) provide an exhaustive discussion of the theoretical values for these two parameters for a given Earth model. 6.4. The z-transform for Digital Filters 87 Figure 6.4: Pole-zero plots, showing the locations of these on the complex s-plane for some analog systems. Poles are crosses, zeros circles; multiple roots have numbers next to them. For all our examples, including the last one with finite Q c , the poles are left of the imaginary axis. This is good, because for any lumped-parameter system, stability requires that all the poles of the transfer function have negative real parts, lying in the left half-plane. And for an unstable system, any nonzero input leads to an infinite output. Whether or not a system is stable is not obvious from looking at the differential equation; but finding the poles shows this immediately. Likewise, looking at the poles we can easily see what at what frequencies the response is large: it will be those that are near poles that are close to the imaginary axis. Conversely, zeros close to the imaginary axis will produce dips in the amplitude response. 6.4 The z-transform for Digital Filters In Chapter 3 we examined the Fourier transform for sequences. We have just seen that the Laplace transform can be regarded as the generalization of the Fourier transform; the equivalent generalization for sequences is called the z-transform. An infinite sequence xn , has a z-transform ζ[ xn ] Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 88 that is a function of the complex variable z: ζ[ x( n)] = x̃( z) = ∞ X xn z −n n=−∞ It is important to note that this is not a definition that is universally followed. In particular, the geophysical exploration industry (which does a lot of signal processing) defines the z-transform as a sum over xn z n , Our usage is that found in electrical engineering; if we call our z-variable zEE , −1 and the exploration one zoil , zEE = zoil . Since z is a complex variable, this usage is equivalent to an inversion of the complex plane on the unit circle: the outside of the circle becomes the inside, and vice-versa, with the origin and infinity mapping into each other. Either convention is consistent, but in looking at results (and software) you need to know which one is being used. We have seen that the Fourier transform can be viewed as the Laplace transform evaluated on the imaginary axis of the s-plane. In just the same way, the Fourier transform of a sequence is a special case of the ztransform. Remember that the Fourier transform of a sequence is x̃( f ) = ∞ X xn e−2π i f n (6.9) n=−∞ This is equivalent to the z-transform for z = e2π i f ; in words, The Fourier transform of a sequence is the z-transform evaluated on the unit circle in the complex z-plane. There are other ways of looking at the z-transform. We can, for example, regard z−1 as a unit delay. This may sound peculiar, but is easily derived. Suppose y is the same series as x, but delayed by m terms: yn = xn−m ; then the z-transform of y is just ỹ( z) = ∞ X n=−∞ x n− m z − n = ∞ X l =−∞ xl z−(l +m) = x̃( z) z−m = x̃( z)( z−1 )m (6.10) so that the effect of delaying x is to multiply the z-transform by z−1 a total −1 of m times: z−1 “represents” a delay. (That is, zEE does; in the usage of the exploration industry, zoil does.) 6.4. The z-transform for Digital Filters 89 Figure 6.5: Mapping between the s-plane and the z-plane if we get our sequence in discrete time by sampling in continuous time. The right-hand side of the s-plane maps outside the unit circle in the zplane, and the left-hand side to the inside of the unit circle, but in both cases the mapping is not one-to-one. The bottom plot shows how the continuous-time frequency f a maps into discrete-time frequency fd. Another way to look at the z-transform comes from taking the Laplace transform of the function which is equivalent to the sequence xn , namely P∞ n=−∞ x( n)δ( t − n), which we used in Chapter 4 to discuss sampling of a function given in continuous time. The Laplace transform of this infinite sum of delta-functions is Z∞ X Z∞ ∞ ∞ ∞ X X −st xn δ( t − n) e dt = xn e−st δ( t − n) dt = xn e−sn −∞ n=−∞ n=−∞ −∞ n=−∞ which becomes equivalent to the z-transform if we set z = es (6.11) which for any s gives a value z. This derivation of the z-transform thus introduces a mapping from the s-plane to the z-plane; as we will see below, Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 90 such mappings are important in devising digital (discrete-time) models of continuous-time systems. The particular mapping given by equation (6.11) has an important feature, namely that it is nonunique: different points in the s-plane all map to the same point in the z-plane. If we consider the strip for which |ℑ( s)| ≤ π, we can see that the region of the strip with negative real part maps inside the unit circle, the region with positive real part maps to outside the unit circle, and the segment of the imaginary axis that lies in the strip maps onto the unit circle.4 But exactly the same mapping occurs for each strip defined by m − π ≤ ℑ( s) ≤ m + π, for any integer m. This is yet another example of aliasing: when we form a sequence by sampling a function, one frequency in the sequence can come from many different frequencies in the function; with the mapping (6.11) we have generalized this to complex frequencies. It is possible to use complex-variable theory to find the inverse of the z-transform, and also to derive extensions to the convolution theorems in order to show, for example, how to convolve two z-transforms to get the z-transform of the product of two sequences. Since these results are not needed to design filters for system simulation, we do not discuss them further; see Oppenheim and Schafer (1989) for the details. 6.5 Recursive Digital Filters In a FIR filter, the output is just a weighted combination of input values. This kind of filter is inadequate for simulating an analog system, for which purpose we introduce a more general form: the recursive filter, where the output value depends on previous values of the output, as well as on the input values. The equation for such a digital filter is NX −1 k=0 a k yn−k = LX −1 l =0 b l x n− l (6.12) where xn is the input sequence and yn is the output. Because the sums are one-sided, we can compute the “current” value of y, yn , without needing any values of y except for the “past” ones that we have already computed. Using only past values of x is not necessary except in real-time processing; but is conventional and not restrictive in practice. 4 In geophysical-exploration usage, the mappings to the inside and outside of the circle would be reversed. 6.5. Recursive Digital Filters 91 Filters of this type, like the nonrecursive filters of the previous chapter, go by several different names: electrical engineers call them Infinite Impulse Response (IIR) filters, and statisticians call them auto-regressive moving-average (ARMA) systems. If L = 1, so that only the current value of the input is used, the statisticians’ term is that the system is autoregressive (AR). Just as we used the Laplace transform to find the transfer function of a differential equation, we may use the z-transform to find a transfer function for a discrete-time filter. If we take the z-transforms of both sides of equation (6.12), equation, and use the result (6.10) that the z-transform of a delayed sequence is z−m times the transform of the original sequence, the two sides of (6.12) become polynomials in z. The transfer function becomes the ratio of these polynomials: PL−1 l ỹ( z) l =0 b l z (6.13) = P N −1 x̃( z) a zk k=0 k This is actually a general equation for any type of digital filter, recursive or nonrecursive; note that for zero phase shift FIR filters the range of summation will include negative as well as positive values of l . The parallel between (6.12) and (6.13) on the one hand, and the differential equation (6.1) and its transfer function (6.7) is quite intentional: this parallelism shows how to reach our goal of simulating a system with a digital filter. The simulation problem now becomes how to find the coefficients of the filter (6.12), with z-transform (6.13), that will best imitate the behavior described by the differential equation (6.1) with transfer function (6.7). We start with our simplest example, the lowpass RC filter, and begin with the step response. If x is constant and equal to x0 , the solution to the differential equation (6.2) is y = x0 . If at time t = 0, x then becomes zero, the solution for y for t > 0 is y = x0 e−t/τ . An FIR filter cannot reproduce the infinitely long response to a step that this system shows; even approximating it would require a large number of weights. But a very simple recursive filter can give a similar response. This is yn − a yn−1 = xn (1 − a) For x constant this gives a constant y equal to x. If xn is 0 for n ≤ 0, and 1 for n ≥ 1, it is easy to compute y: y0 = 1 y1 = a y2 = a2 ym = a m = e m ln a Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 92 giving the same exponential falloff as in the analog filter. The transfer function of this digital filter is ỹ( z)(1 − az−1 ) = x̃( z)(1 − a) or ỹ( z) 1−a = x̃( z) 1 − az−1 which has the frequency response (using z = e2π i f ) 1−a (6.14) 1 − ae−2π i f where f is now the digital frequency, running over the range from −1/2 to 1/2 (the range between the Nyquist frequencies). Clearly the analog response (6.3) and the digital response (6.14), though similar, are not the same; the problem we will now address is to devise a recursive filter that will be closer. However, before doing so, we need to discuss the stability criterion for digital filters: what makes them stable or not. If we look at the output of the discrete-time filter for the case given above (equation (6.13), we see that for a less than 1, the output decays when x goes to zero (from one); but if a were greater than 1, the output would increase exponentially. The value of a thus determines if the filter is stable or unstable, the boundary being a = 1, for which the filter is metastable. In the z-plane, the one filter pole is at z = a, so the condition for stability is that the pole is inside the unit circle. This can be shown to be true in general. Just as a continuous-time system, to be stable, must have all poles on the negative real half of the splane, for a discrete-time filter, all its poles must fall within the unit circle. These criteria matter because we want any mapping from the s-plane to the z-plane to preserve stability: the “stable” part of each plane should map only into the stable part of the other. The mapping given by equation (6.11), corresponding to sampling, fulfills this criterion; some others do not. For example, creating discrete-time systems by replacing derivatives by forward differences does not; if we perform this substitution on a stable differential equation, the resulting discrete-time filter may not be stable. A better mapping than (6.11) for system simulation is called the bilinear transformation. To motivate it, we again construct a discrete-time system from the differential equation, but instead of using numerical differences to approximate derivatives, we use numerical approximations to integrals. For example, we can re-express the differential equation for an RC lowpass filter in integral form as: Zt y( t) = ẏ( u) du + y( t 0) t0 6.5. Recursive Digital Filters 93 Figure 6.6: Mapping between the s-plane and the z-plane if the bilinear transformation is used. The right-hand side of the s-plane maps outside the unit circle in the z-plane, and the left-hand side to the inside of the unit circle; in both cases the mapping is one-to-one. The bottom plot shows how the continuous-time frequency f a maps into discrete-time frequency f d : the mapping is one-to-one, but nonlinear, though approximately linear for f a ≪ f n . For equispaced intervals, which we set equal to 1, this becomes y( n) = Zn n−1 ẏ( u) du + y( n − 1) We now use the trapezoidal rule for this integral to write y( n) = 1/2[ ẏ( n) + ẏ( n − 1)] + y( n − 1) The differential equation itself gives an expression for ẏ: ẏ( t) = τ−1 [ x( t) − y( t)]. We may then write the above expression for y( n) as y( n) = 1 [ x( n) + x( n − 1) − y( n) − y( n − 1)] + y( n − 1) 2τ Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 94 which we may, finally, write in the form (6.12) of a recursive filter; using subscripts rather than arguments since we are now working with a sequence, we get ¶ µ ¶ µ 1 1 1 − yn−1 1 + = ( xn + xn−1 ) yn 1 + 2τ 2τ 2τ The transfer function in the z-plane of this discrete-time filter is found in the usual way; we take z-transforms to get ·µ ¸ ¶ 1 −1 ỹ( z) 1 + = x̃( z)[1 + z−1 ] −z 2τ whence the transfer function is ¶ µ 1 1 − z−1 1 + 2τ 1 + z−1 We can equate this with the transfer function for the continuous-time case, equation (6.8), if we make the relationship between s and z 1 − z−1 s=2 1 + z−1 µ ¶ (6.15) which is the bilinear transformation. In the more general case where the sampling interval is ∆, the same derivation may be used to give the same result, with the leading 2 replaced by 2/∆. The inverse mapping from s to z has a similar form, namely z= 1 + 1/2s∆ 1 − 1/2s∆ (6.16) While our derivation used a particular differential equation, using multiple integrals for any constant-coefficient linear differential equation will produce the same mapping. This mapping takes the imaginary axis of the s-plane onto the unit circle in the z-plane; since it maps the entire left half of the s-plane into the inside of the unit circle on the z-plane, it maintains stability. The bilinear transformation thus gives us a design procedure for getting discrete-time filters from continuous-time ones. The steps are: 1. Find the transfer function in the s-plane from the differential equation of the system; this will be a ratio of polynomials in s. 6.5. Recursive Digital Filters 95 2. Perform the bilinear mapping (6.16) to produce a transfer function in the z-plane; reduce this to a ratio of polynomials in z. 3. Get the filter weights from the coefficients of the polynomials, as in the relation between (6.12) and (6.13). There is often a step before step 1: we may have to modify the differential equation to minimize a distortion from the bilinear mapping. Let the frequency of the discrete-time system be f d , so that (equation (6.9)) z = e2π i f d , and of the analog system be f a , so that s = 2π i f a . Then 1 − z−1 = i tan(π f d ) 1 + z−1 so that the bilinear transformation of the frequencies is fd = 1 arctan(π f a ∆) π which maps the entire analog frequency axis, from −∞ to ∞, onto the digital frequencies from −1/2 to 1/2: that is, from the lower to the upper Nyquist frequencies. This warping of the frequency axis, though it avoids aliasing of the filter response, means that we must adjust the time constants of the differential equation so that after applying the bilinear transformation these have the correct frequency, an adjustment known as prewarping. To show how to do this, we design a filter to simulate the polar-motion equation (6.4). If we apply the bilinear transformation (6.15) to the transfer function (6.5), and compute the frequency response of the resulting digital filter, we find that it is ω′c ω′c − (2/∆) tan(π f d ) where we have used ω′c to denote that this parameter is not necessarily the same one as in the actual continuous-time system. For ω c real, the frequency of the resonance in the continuous-time system is at f a = ω c /2π; when the series is sampled at an interval ∆, this becomes the (nondimensional) frequency ω c ∆/2π, providing it is not aliased. In the discrete-time system the resonance will be at µ ′ ¶ ω ∆ 1 f d = arctan c π 2 Chapter 6. Digital Filters II: Recursive Filters and Simulating Linear Systems 96 Equating these two digital frequencies gives us the relationship µ ¶ 2 ∆ ′ ω c = tan ωc ∆ 2 for the time constant (in this case a frequency) used in the discrete-time filter as a function of the one actually present in the system. If ω c ∆ ≪ 1, the two are nearly equal; but as this product approaches π more and more correction is needed. In this particular example, the usual sample interval ∆ for p and ψ is 30 days; for ω c = 0.01461rad/day, ω′c = 0.01485, only a 2% change. Having made this correction, we can compute the actual filter weights by the procedure just described. We find that the polynomial in z is ω′c ∆ + ω′c ∆ z−1 (ω′c ∆ + 2 i ) + (ω′c ∆ − 2 i ) z−1 which gives a recursive filter pn (ω′c ∆ + 2 i ) = −pn−1 (ω′c ∆ − 2 i ) + ω′c ∆(ψn + ψn−1 ) for producing simulated polar motion from a possible excitation. C HAPTER 7 D IGITAL F ILTERS III: D IFFERENTIATORS , L EAST-S QUARES , AND M INIMUM -P HASE F ILTERS 7.1 FIR Filters: Other Applications In this chapter we look at some other uses for FIR filters than removing bands of frequencies: differentiation, Hilbert transforms, and linear regression. We also describe the most important class of non-symmetric FIR filters for frequency selection, the minimum-phase filters; these are sometimes important for use in seismology. 7.2 Differentiators and Digital Hilbert Transformers We have mentioned before (Section 2.8) that differentiation can be thought of as a linear system; this makes it appropriate for digital filtering. Equation (2.5) F [ ẋ( t)] = 2π i f F [ x( t)] shows that differentiation is a filter with frequency response 2π i f : that is, a constant phase shift of π/2 (the i part of the coefficient) and an amplitude response that is proportional to frequency. To imitate this digitally, we make the number of weights odd, with N = 2 M + 1, and also make them antisymmetric: w−n = −wn for 97 n = 0, . . . M Chapter 7. Digital Filters III: Differentiators, Least-Squares, and Minimum-Phase Filters 98 Figure 7.1: The amplitude response, and weights, for a differentiating filter, constructed by tapering the weights for an ideal response with a Kaiser-Bessel taper so that the response at the Nyquist is −40dB. where we have centered the filter about 0 rather than term M ; this asymmetry means that w0 = 0. Substituting these weights into the general expression (equation (5.1)) for the frequency response of a FIR filter, we find W ( f ) = −2 i M X wn sin 2πn f n=1 which we may compare with equation (5.3) for a symmetric acausal FIR filter. An antisymmetric filter thus will have exactly the same phase response as differentiation: purely imaginary. To get an amplitude response that approximates differentiation, we take the ideal response to be W (F ) = 2π i f – but, to avoid over-amplifying the energy close to the Nyquist, only up to some cutoff frequency α. That is, we want the filter to have an amplitude response that is zero at zero frequency, increases linearly to the cutoff frequency, and then falls to zero. As with the frequency-selective filters, we can get the filter weights corresponding to this ideal response by taking the inverse Fourier transform of it: Z1/2α Z1/2α Z1/2α wn = W ( f ) e 2π i f n d f = 4 π i f i sin 2π f n d f = −4π f sin 2πn f d f −1/2α = 1 π n2 0 0 Z1/2α 0 2πn f sin 2πn f d (2πn f ) = 1 π n2 Zπnα 0 x sin x dx = α cos(πn) n − sin πnα π n2 7.2. Differentiators and Digital Hilbert Transformers 99 Figure 7.2: On the left, half of the weights for a (windowed) Hilbert transform filter with total length of 97. On the right, a time varying series (x3 e− x sin(1.5x2 ), where x = n/10), shown as the black solid line, its Hilbert transform (gray solid line) and the instantaneous amplitude (dashed line). where we have made use of the antisymmetry of f around zero frequency. As with the ideal lowpass filter, this series of weights goes on forever; and, as in that case, we can approximate the ideal response better if we taper the series of weights instead of simply truncating it. Figure 7.1 shows an example. It is also possible to adapt the Parks-McClellan procedure to this design task. Antisymmetric filters may have other amplitude responses. An important case is when there is no frequency dependence at all: an amplitude response equal to one from zero frequency to the Nyquist. This is the digital Hilbert transform filter, whose ideal response is W(f ) =   i  −i for f <0 f >0 Note that this has the opposite phase shift from differentiation. The ideal weights are 2 sin2 ( nπ/2) wn = nπ 100 Chapter 7. Digital Filters III: Differentiators, Least-Squares, and Minimum-Phase Filters Again, actual filters can be designed by using tapers (as was done for the filter in Figure 7.2) or using the Parks-McClellan algorithm. Either of these methods will produce a filter whose amplitude response is only approximately flat over the full frequency band, but whose phase response is, like that of the differentiator, exact, thanks to the antisymmetry of the filter, and its having an odd number of weights. If a Hilbert-transform filter, is convolved with a series it converts all the cosine components into sines and vice versa. This can be useful for a series that is approximately sinusoidal, since then we can create what is called a complex-valued analytic series { x a } = { x} + i { x h } where { xh } is the digital Hilbert transform of the original sequence { x}. From this, we can find the instantaneous amplitude (also called the envelope function) and instantaneous phase; these are the amplitude and phase of each term of { xa }; differentiating the phase gives the instantaneous frequency. Figure 7.2 shows an example for a graduallychanging sinusoid. 7.3 Least-squares Fits as Filters Next we show that parameters estimated through least-squares can be found by filtering the data – and that sometimes, thinking of least-squares fitting as a filter can help us understand what the fit is doing. The basis of this is that the solution to any least-squares problem gives the parameters in terms of a linear combination of the data; but any linear combination of the data is a linear time-invariant system, and hence a filtering operation. More formally, suppose we have a data vector x, which we aim to fit parameters y to by least squares. The equations for y are the normal equations: y = ( A T A )−1 A T x where A is the design matrix.1 Now suppose we want to find a particular linear combination of the parameters, z = b T y; z is a scalar (and might just be one of the parameters). Then the solution is def 1 z = b T ( A T A )−1 A T x = wT x We have assumed that the errors are identical and independent; if they are not the equations are slightly more complicated but we get an equivalent end result. 7.3. Least-squares Fits as Filters 101 Figure 7.3: An example of the frequency response of a least-squares fit: in this case, for the slope of daily data fit over a two-year span. where the weights w are given by b T ( A T A )−1 A T , which can be seen to be an N -term vector, N being the number of data. But this is just a digital convolution (albeit with only one output value) – so we may regard the weights as filter weights, and compute the frequency response by taking their Fourier transform. A simple, and common, example is fitting a straight line to data; assume we have 2 M + 1 values, centered about 0. Then if y1 is the value of the fit at n = 0 and y2 is the slope, the design matrix is T A = · 1 1 −M −M + 1 ... 1 1 1 . . . −1 0 1 ... ... 1 1 M −1 M ¸ which, conveniently, means that the matrix for the normal equations is diagonal: · ¸ 2M + 1 0 T A A= 0 M ( M + 1)(2 M + 1) 102 Chapter 7. Digital Filters III: Differentiators, Least-Squares, and Minimum-Phase Filters which in turn makes the weights for finding the slope wn = 3n M ( M + 1)(2 M + 1) for n = − M, . . . M Note that this is quite different from the weights for finding the derivative, which is a local measure of slope: the weights for the fitted slope are largest far from the center, something that is often stated by saying that such parts of the series have more influence. The frequency response, which then is W(f ) = M X 6 n sin(2πn f ) M ( M + 1)(2 M + 1) n=1 so that the frequency response depends on M , the length over which the fit is made. Figure 7.3 shows the frequency response for N = 730: this might be, for example, 2 years of daily geodetic data. What is interesting is that the response actually peaks at a frequency of about 0.3 cycles/year. This is lower than the lowest frequency that is usually resolvable from data, namely one cycle over the record, which would be 0.5 cycles/year. The location of this peak in the response becomes more obvious if we think about what kind of sinusoid “looks like” a straight line: namely, one with about a quarter of a cycle over the range of data. Much lower frequencies look more and more like a constant. 7.4 Minimum-Phase Filters Next, we turn to what are called minimum-phase filters. As mentioned in Section 5.2, sometimes we do not want to use zero-phase filters because they lack causality: if the data have sudden changes (notably from earthquakes) we may not want leakage of energy into times earlier than the time at which the change occurs. But, also, we usually want the filter to be as compact as possible: that is, we would like the filter to spread any impulses or steps over the minimum possible time span. Making the filter compact, but still causal, turns out to be equivalent to minimizing the phase shift. The design of such filters depends on the locations of the roots of the z-transform polynomial; these are, for weights w0 , w1 , . . . w N , W ( z) = NX −1 k=0 wk z −k 7.4. Minimum-Phase Filters 103 where z is complex. This polynomial has N − 1 roots, r 1 , r 2 , . . . r N −1 ; given these roots, the z-transform can alternatively be written as a product of degree-one polynomials: W ( z) = w0 NY −1 k=1 (1 − r k z−k ) with the w0 providing appropriate scaling. For common choices of the weights wn , there are a number of “symmetry” relationships that apply to the roots r k . If the weights are real, the roots r k must either be real, or occur as complex conjugate pairs. If the filter is symmetric, with an odd number N = 2 M + 1 of weights, then if r k is a root, so is r l = 1/ r k : the roots occur as reciprocal pairs unless they are on the unit circle. To see this, write the z-transform as W ( z) = 2X M k=0 wk z −k =z −M " wM + M −1 X k=0 wk ( z M −k +z k− M # ) = z−2 M W ( z−1 ) so that if W ( r k ) = 0, then W (1/ r k ) = 0 as well. Combining these results, we have that for real and symmetric weights (the commonest case), the roots of W ( z) occur in reciprocal pairs plus single roots on the unit circle, and all pairs (or singlets) not purely real occur as complex conjugate pairs. If the weights are real and symmetric, the response on the unit circle is " W ( f ) = e−2π f M w M + 2 M −1 X k=0 # wk cos 2πk f = e−2π f M Wa ( f ) where Wa ( f ) is the amplitude response, which is purely real. If there are single roots of W ( z) on the unit circle, there must be values of f for which Wa ( f ) ≤ 0. Conversely, if Wa ( f ) > 0 for all f , there can be no roots on the unit circle: all roots occur as reciprocal pairs. Quite commonly, Wa ( f ) is nonnegative, but equal to zero for particular values of f ; these values of f then correspond to double roots of W ( z), for which the root and its reciprocal are equal. We have discussed the locations of the roots in such detail because of some less obvious results that relate to the goal of obtaining a compact filter. Suppose some sequence wn , with z-transform W ( z), has a root within the unit circle; without loss of generality we can take this to be the “last” 104 Chapter 7. Digital Filters III: Differentiators, Least-Squares, and Minimum-Phase Filters root r N −1 . Now consider another z-transform polynomial ′ W ( z) = W ( z) z−1 − r ∗N −1 1 − r N −1 z−1 def = W ( z) A ( z) (7.1) The form of the term A ( z) means that for the polynomial W ′ the root at r N −1 has been divided out of the polynomial for W ( f ) and replaced by its reciprocal complex conjugate, which will be outside the unit circle. For z = e−2 iπ f , ¯ ¯ ¯ −1 1 − r ∗N −1 z ¯2 2 ¯ ¯ =1 | A ( Z )| = ¯ z 1 − r z−1 ¯ n which means that the amplitude responses |W ( f )|2 and |W ′ ( f )|2 are identical. For this reason filter expressed by the A ( z) polynomial is termed an allpass filter; note that this is a recursive filter. We can use the factorization (7.1) to get a relationship between the two sets of weights, wk and w′k , that have the responses W ( z) and W ′ ( z). Let V ( z) be a polynomial of degree N − 2, obtained by factoring out the last root (this is called deflation of the polynomial). Then we can rearrange equation (7.1), to get W ( z) = V ( z)(1 − r N −1 z−1 ) and W ′ ( z) = V ( z)( z−1 − r ∗N −1 ) But these polynomial relationships imply the following expressions for the coefficients of W and W ′ : wk = vn ∗ (1, − r n ) = vk − r N −1 vk−1 w′k = vn ∗ (− r ∗n , 1) = − r N −1 vk + vk−1 Next, we compute the difference of the sums of squares of the weights up to the penultimate term. Most of the terms in the sum cancel, giving NX −2 k=0 |wk |2 − |w′k |2 = (1 − |r N −1 |2 )|v N −2 |2 Because |r N −1 | < 1 (by assumption, we chose a root with this property), this sum has to be positive. This means that the sequence wk has more energy (measured by the sum of squared amplitudes) concentrated before the last point than the sequence w′k does. This result can be extended to show that, among all FIR filters with the same amplitude response |W ( f )|2 , the one whose z-transform has all its 7.4. Minimum-Phase Filters 105 roots on or inside the unit circle will have the most concentration of energy towards the early terms of the sequence; that is, m X k=0 |wk |2 will be maximized for all m < N − 1. (For m = N − 1 the sum is the same, being the filter response at f = 0.) This choice of roots thus makes the filter as compact as possible. Additional results follow if we express W ( f ) as amplitude and phase: W ( f ) = Wa ( f ) e−φ( f ) . The phase of W and W ′ are related, by equation (7.1), though the phase of the allpass term in equation (7.1). If we write the root in polar form as r N −1 = ρ e2π iθ , we find that its phase is the phase of e −2π i f µ 1 − ρ e2π i( f −θ) 1 − ρ e−2π i( f −θ) ¶ which gives a phase φ( f ) = −2π f − 2 arctan · ρ sin 2π( f − θ ) 1 − ρ cos2π( f − θ ) ¸ (7.2) A more interesting quantity is the group delay D ( f ), which defines the delay energy in a signal with frequency f (exactly analogous to group velocity in wave propagation), which is given by the derivative of the phase. If the phase is given by ¶ µ S( f ) φ( f ) = arctan C( f ) the group delay is given by D( f ) = 1 S ′ ( f )C ( f ) − C ′ ( f ) S ( f ) 1 dφ = 2π d f 2π S2( f ) + C2( f ) which when applied to (7.2) gives D( f ) = −1 − ρ 2 1 + ρ 2 − 2ρ cos 2π( f − θ ) which is negative for ρ < 0 (the root inside the unit circle) so that W ′ ( f ) has a greater delay, and a larger (more negative) phase than W ( f ). The sequence with all the roots inside the unit circle is thus not only the most 106 Chapter 7. Digital Filters III: Differentiators, Least-Squares, and Minimum-Phase Filters Figure 7.4: Weights for two filters with the same (lowpass) frequency response: on the left, a symmetric filter, on the right, its minimumphase equivalent. Both are shown as being applied causally. compact; of all the filters with a given amplitude response, it minimizes group delay, and makes the phase lag as small as possible. Such sequences are therefore usually referred to as providing minimum-phase FIR filters. To design such filters requires, besides approximating the desired frequency response, that the roots of the z-transform all fall inside of or on the unit circle in the complex plane. There are two different ways to accomplish this. One, suggested by equation (7.1), is to replace all roots outside the unit circle with their reciprocal complex conjugates, leaving roots on the unit circle in place. This procedure is called the allpass decomposition, because of the allpass term in equation (7.1). This equation, generalized to contain as many such factors as there are roots outside the unit circle, becomes the result that any FIR filter can be expressed as the convolution of a minimum-phase FIR filter and an allpass filter. Note that the allpass filter itself is not a FIR filter, but would need to be computed recursively; this is acceptable for correcting for the previous application of a non-causal FIR filter to data (Scherbaum and Bouin, 1997). A second approach, called spectral factorization, depends on the result that, for symmetric filters, if Wa ( f ) ≥ 0 on the unit circle, then all the roots can be paired into reciprocals, with double roots on the unit circle. The factorization consists of taking all the roots inside the unit circle and half of the roots on it, and constructing the corresponding sequence.2 The 2 A good description of the procedure, and some pitfalls, is given in Orchard and Willson (2003) – though there have been difficulties with the Newton’s method procedure given 7.4. Minimum-Phase Filters 107 p resulting filter will have an amplitude response |W ( f )| = Wa ( f ), which therefore requires that the initial filter have a nonnegative amplitude response, W ( f ) ≥ 0. Such a filter can be created using the Parks-McClellan algorithm if we constrain the stopband response to oscillate around the error level ǫ, rather than (as is more usual) zero. Figure 7.4 shows an example of filter weights for a zero-phase and minimum-phase filter with the same amplitude response.3 It is obvious that the symmetric (linear-phase) filter, applied as a causal filter. is less compact than the minimum-phase filter applied the same way. there. 3 This is one of several filters combined to produce 5-minute data from 1-Hz data recorded with the strainmeters installed by the Plate Boundary Observatory. For more details, see Agnew and Hodgkinson (2007), C HAPTER 8 S TOCHASTIC P ROCESSES 8.1 Introducing Ordered Random Data We now turn to sequences of random data: that is, data in which (1) the order of the values is important, and (2) the data are random. The first attribute is the same kind of specification that we have been dealing with in signal processing: we have a sequence of data x1 , x2 , x3 , . . . xn , which we will often call, as statisticians do, a time series. As before, we assume that the index n denotes a time t = n∆ at which the observation was made: we are again sampling uniformly in time at an interval ∆. Obviously we could equally well be treating a data set sampled in space or, much more rarely, some other variable. For theoretical purposes it is sometimes useful to be able to treat time as a continuous variable, and to work with functions x( t); but as we will see, if we do this for a random variable things can become mathematically extremely difficult without some further quite severe simplifications. We will, as before, often want to consider infinite sequences, maybe infinite in both directions. The time series of data { xn } is modeled by a stochastic process { X n } or { X ( t)}. Such a process is a family of random variables indexed by the integer n (when it is a discrete process) or by the real number t (when it is a continuous process).1 What do we mean when we say “random variable”? The conventional variables we have been dealing with up to now (and which we did not, up to now, need a special name for) were a particular kind of mathematical entity that obeyed certain rules. Random variables are also mathematical entities, with different rules. A. A conventional variable is assumed to have a definite value, although we may not know it: it might, for example, be integer-valued, real1 Notation alert: random variables or functions will normally be denoted by uppercase letters, but not all uppercase letters denote random variables. 108 8.1. Introducing Ordered Random Data 109 valued, complex, or a vector. A random variable does not have a definite value; rather, it is described by its probability distribution function (pdf), giving the probability density over some range. The range, and any parameters of the function used for the pdf, are however conventional variables. B. Conventional variables obey the rules of algebra (arithmetic): we can add, subtract, multiply, or (for real-valued ones) divide them. Formally, we can perform the same operations on random variables, but what happens when we do is described by a new set of rules. For example, the sum of two random variables is another one: but the pdf of this sum is not the sum of the pdf’s of the two variables that have been added. Conventional variables are appropriate models for things we want to represent as having a definite value: for example, data values that we have measured or otherwise acquired. This makes them appropriate for signal processing: we assume the signal has some definite value. The terms of a sequence may not be equal to a known number (that is why they are variables, after all), but we treat them as though at some point they will. In contrast, random variables are appropriate models for things that are somehow uncertain. We might be using a random variable to model the outcome of some experiment, in which case we can think of it as one of an infinite collection of values from which the experiment will extract a single one. Or it might be used to model something about which we are simply uncertain. A key point is that the choice to model something by a random variable is just that: a choice of what mathematical entity we decide is appropriate to what we are studying. There are no absolute rules for this, and it may well be that, depending on what different models are for, we might treat the same thing as a conventional variable in one, and as a random variable in the other. One way to think about a given series of random variables is as being the result of an experiment that could be repeated as many times as we like. We say that each repetition produces one realization, drawn from an ensemble of series that would be produced if we made many repetitions; in that sense the ensemble is generated by the underlying random process. So we can think of such operations as finding the expected value at a particular n or t, as the average, over an infinite number of realizations, of the variable for that n or t. Figure 8.1 shows realizations of a several stochastic processes. Chapter 8. Stochastic Processes 110 Figure 8.1: Five realizations of four different stochastic processes: from left to right, lowpassed white noise, a random walk, a randomly modulated sinusoid, and f −1, or “flicker” noise. Even in discrete time the general stochastic process is horribly complex. A single random variable is specified by a single pdf; but a set of them is specified completely only if we have the joint pdf of every element X n with every other element. Suppose all the elements have a Gaussian distribution. Then for a sequence of length N the joint pdf is a function φ of an N -vector X = ( X 1 , X 2 , . . . X N )T φ(X) = 1 (2π) N /2 det(C ) exp[−1/2(X − x̄)T C −1 (X − x̄)] (8.1) This expression contains N parameters x̄ ∈ |ℜN |, the vector of mean values, and C ∈ |ℜN ×N |, the covariance matrix. This is a symmetric, positive definite matrix that describes the correlations among the random variables X n , and has 1/2 N ( N + 1) values. So, to describe completely these N random variables, we need a total of 1/2 N ( N + 3) parameters, far more than we are likely to have observations: even a short time series of 50 numbers would need 1325 parameters to be fully specified, and for a reasonably long data series of 2000 terms, we would need more than 2 million parameters. If we instead consider continuous time the problem is almost completely intractable, and such a general treatment has no practical value because it would be impossible to make estimates of the necessary parameters, even if the mathematics were possible – and this turns out to be very hard for the general case (see Priestley (1981), Chapter 3). 8.2. Stationary Processes and Autocovariance 111 8.2 Stationary Processes and Autocovariance Since the perfectly general stochastic process is much too general to be useful, we follow the conventional approach of limiting ourselves to a much more restricted class of random processes: those called stationary processes. These have the property that, while the actual observables vary over time, the underlying statistical description is time invariant. This restriction can be weakened a bit, but we will consider only these so-called completely stationary processes. Assuming stationarity (as this restriction is called) has the gratifying effect of reducing the number of parameters needed to a manageable number. For example, the mean value of a stationary process is E [ X n ] = x̄ where E stands for “the expectation of”; we could call this the expectation operator, which converts a random variable into a conventional one. For a stationary process the mean is one number: a constant independent of n, or of time t if the process is continuous. It is often assumed the mean is zero, since it is a trivial operation to remove or add the mean value into a time series, if necessary. Of course, many observational series do not look as if the mean is constant – there may be a secular trend, as there was in the first example we gave in Chapter 1. Stationarity is such a powerful and useful property that one often attempts to convert an evidently nonstationary series into a stationary process, for example, by fitting and removing a straight line trend, or forming a new stationary sequence by differencing: Yn = X n+1 − X n The variance of a single random variable is defined as σ2X = V [ X n ] = E [( X n − x̄)2 ] which defines the variance operator V ; again, this converts from a random to a conventional variable. With stationarity σ2X must also be independent of n (or t). But most importantly, the covariance between any two random variables in the sequence cannot depend on where we are in the series; it must look like def C [ X m , X n ] = E [( X m − x̄)( X n − x̄)] = R X ( m − n) (8.2) Chapter 8. Stochastic Processes 112 which is to say that the covariance can be expressed by a function R X , whose argument is the interval between the two points. (Note that though we use a capital letter for this it is not itself random). The function R X is called the autocovariance. For continuous processes equation (8.2) is usually written C [ X ( t), X ( t + τ)] = R X (τ) (8.3) and τ is called the lag. Observe that by definition R X (0) = σ2X Also notice that, because of stationarity, one can set s = t − τ in (8.3) and the result will be the same, since the answer is independent of which time was selected. This yields: R X (−τ) = R X (τ) which shows that the autocovariance function is an even function of the lag. A stationary process does not contain information on which way time is flowing – it is the same if time is reversed. For a stochastic process with a Gaussian pdf as in (8.1), we see that in place of the vector x̄ of mean values we have a single number. How about the covariance matrix? You may recall that the j - kth entry of C is def C jk = C [ X j , X k ] = R X ( j − k) and so, for a stationary process, the covariance matrix has the same values on all its diagonals:   R X (0) R X (1) R X (2) R X (3) ... R X (N )  R (1) R X (0) R X (1) R X (2) . . . R X ( N − 1)    X   R X (1) R X (0) R X (1) . . . R X ( N − 2)  C =  R X (2)     ... ... ... ... ... ... R X ( N ) R X ( N − 1) R X ( N − 2) R X ( N − 3) . . . R X (0) This type of matrix is called a Toeplitz matrix. So we only need N + 1 parameters, instead of 1/2 N ( N + 3) parameters, to describe the pdf of the process. You should note that just because a stochastic process is composed of random variables, this does not mean it is completely unpredictable. Recall the correlation coefficient of two random variables: ρXY = p C [X , Y ] V [ X ]V [Y ] (8.4) 8.3. White Noises and their Relatives 113 From (8.3) we find the correlation coefficient between any two points in a sequence in continuous time: ρ (τ) = R X (τ) σ2X with a similar result for discrete processes. The function ρ is called the autocorrelation function; it has the advantage over the autocovariance that the variance has been removed from the definition, so that for any process −1 ≤ ρ ≤ 1. Unless the autocovariance function is zero for the lag τ, there is some correlation between the value X ( t), and the later value X ( t + τ); we can predict something about one given the other. So far we have considered the average of the variables, and also averages of the products of variables, ( X ( t 1 ) − x̄)( X ( t 2 ) − x̄). Such averages are examples of moments of the process. The second-order moments are the ones involving products, and provide the variance and the autocorrelation function (or the autocovariance, which includes both), nothing else. We can define higher order moments, for example the third-order moments that come from the product of three X ’s. But if the process is based on a Gaussian pdf, we have seen that the complete pdf is specified by the second-order moments alone: all the higher-order moments, if we wanted them, can be found from these second-order moments. And even if the pdf is not Gaussian, a great deal can be learned about a process from its second-order moments; the higher order moments are rarely studied–and we will not do so. 8.3 White Noises and their Relatives Let us look a few concrete examples of stationary stochastic processes. The simplest one is what is called white noise. This is defined as a stationary process in which the random variable at any time (or sequence number) is independent of the variable at any other time.2 It is commonly assumed that the mean value is zero. In the discrete case we have RW ( n) = σ20 δn0 where δ jk is the Kronecker delta symbol. In this case the random process 2 More precisely, that the correlation is zero; for non-Gaussian variables this can be true even if the variables have some dependence. But this gets into the matters of higher moments that we just promised to ignore. Chapter 8. Stochastic Processes 114 Figure 8.2: Three noises and their pdf’s. All have the same variance σ2 . The top two are white noises; the random telegraph turns out to be slightly nonwhite. really is unpredictable: we learn nothing about the value at one time from knowing the value at another. However, the values are still not completely unpredictable; we can say something about the range of values to be expected based on the known variance, σ20 . For continuous processes, it turns out that a white noise is singular because the variance at any point must be infinite: RW (τ) = σ20 δ(τ) (8.5) But these definitions do not completely specify the stochastic process. At any particular time t (or index n) X is a random variable with a pdf; that pdf will be the same for every t, but so far we have not specified it. Of course, the most common choice is the Gaussian, for which φ( X ) = 1 p σ0 2π 1 2 e− /2( X /σ0 ) Because of the statistical independence, the joint pdf is φ( X 1 , X 2 , X 3 , . . .) = φ( X 1 )φ( X 2 )φ( X 3 ) . . . (8.6) This is Gaussian white noise which is shown in Figure 8.2. We could allow any other pdf for X n ; suppose we choose a uniform distribution: φ( X ) = b−1 Π( X / b) so that V [ X ] = b2 /12. This is a different kind of white noise, as is easily seen by comparing the top two panels in Figure 8.2. The joint pdf is given 8.3. White Noises and their Relatives 115 by (8.6) again, with the appropriate choice of φ. We have actually encountered this kind of white noise already, since it is what quantization noise (Section 4.7) looks like. If the series varies by much more than ±1 between samples, the value recorded is the true value of the measurement plus an unpredictable (that is, random) amount that is distributed uniformly over the interval (−1/2, 1/2). Since the consecutive values of the rounding error are uncorrelated, the recorded series appears to be the true signal with a white noise added. The noise is zero mean and its variance is 1/12. If the last digit doesn’t change very often in the series, then the round-off noise added is no longer uncorrelated; but then the signal is most likely not being recorded with enough accuracy. Another white series with a limited range is the random telegraph signal: this one switches discontinuously between two values, say 0 and 1, at random times: φ( X ) = 1/2[δ( X ) + δ( X − 1)] Here the mean value is not zero but 1/2, and the variance is a 1/4. The random telegraph signal is used as a calibration signal for seismometers or other instruments since it is easy to generate electronically. A zero mean version of it has sometimes been suggested as a model for the Earth’s dipole moment over time scales of 105 to 106 years, but this is actually very implausible because the moment is far from constant between reversals. The bottom panel of Figure 8.2 shows a sample. These three examples of white noise are clearly different to the eye. Somewhat remarkably, if they are normalized to the same variance and converted into a sound track, they sound identical – the ear doesn’t have a very good density-function discriminator. These three sequences were not observational; they were made with a random number generator. We can obtain other kinds of stochastic sequences by filtering white noise in various ways; the series in Figure 8.1 were generated that way. Note that as consequence of the Central Limit Theorem. filtered white noise tends to have a Gaussian distribution. Many physical processes can be thought of as resulting from filtering applied to some stochastic process, so Gaussian distributions are often assumed in a signal, and are quite often. though not always, observed too. A common exception, an example of which we will see later, is marine magnetic anomalies; these are usually heavy in the tails compared with a Gaussian distribution. Suppose we apply a FIR filter, of any type, to white noise; that is, we Chapter 8. Stochastic Processes 116 convolve the white noise sequence W j with a finite number of weights wk : Yn = K X k=1 wk Wn−k Assume for simplicity that the mean of the white noise is zero. Then we can calculate the autocovariance of the new sequence using the definition (8.2): R Y ( l ) = C [Yn + l, Yn ] = E [Yn+l Yn ] " # K K X X =E w j Wn+l − j wk Wn−k j =1 = = k=1 K X K X w j wk E [Wn+l − j Wn−k ] j =1 k=1 K X K X j =1 k=1 w j wk σ20 δl − j+k,0 The delta symbol vanishes except when l − j + k = 0 or j = l + k; so R Y ( l ) = σ20 K X k=1 wk wk+l = σ20 wk ∗ w−k And so we see that the previously uncorrelated (zero covariance) white noise Wk has become correlated. Notice that R Y is zero for |l | > K . You are invited to verify that the same result is obtained for the continuous time version: if Y = w∗W then R Y = σ2 w( t) ∗ w(− t) It is possible to produce an even wider range of stochastic processes if we use recursive filters. For example, suppose our recursive filter is yn = yn−1 + xn which is about as simple as you could ask for. If we apply this to a whitenoise sequence X n , the resulting Y is a process known as a random walk 8.4. Examples from the Real World 117 Figure 8.3: On the left, a time series of water height on the shore of the Salton Sea, showing a seiche. The right plot shows the correlation between data at two times eight minutes apart. or a Brownian process.3 This process is actually nonstationary, with a variance that grows with time. A great deal of the statistical literature on time series is devoted to processes defined by filters; in Chapter 7 we introduced the terms autoregressive (AR) and autoregressive/moving-average (ARMA) as names for different kinds of recursive filters; and from these come the names AR and ARMA processes. The utility of this approach for geophysical data, outside some special cases, seems to be limited, and we shall not speak of it again. 8.4 Examples from the Real World Instead, we look at some examples of actual observations which a stationary process might be an appropriate model for. Our first example is the water height at the edge of a lake4 over several hours. The water rises and falls in response to the wind, and because of standing waves in the basin, there are well-defined oscillations, known as seiches. Figure 8.3 shows nearly 10 hours of data. Note that the mean is not zero; on the other hand, the zero level is purely arbitrary. and there is no reason not to make the 3 The “Brownian” name comes from the applicability of this process to explaining the motion of small particles that are moved about by the kinetic motions of (much smaller) molecules, something known as Brownian motion because it was first observed. in pollen, by the microscopist Robert Brown. 4 A very large, shallow one: the Salton Sea Chapter 8. Stochastic Processes 118 Figure 8.4: Magnetic anomaly profile over the eastern Pacific Ocean. A standard geomagnetic model has been removed, so that the mean values are nearly zero. The sample interval (in space) is 0.35 km. mean zero. Next notice the oscillations, which are not completely regular, but nonetheless suggest periodicity. This definitely looks like a stochastic process, but one might be skeptical that it is stationary, given the amplitude increase at about 210 minutes. However, we will stick with that model because it is so much more mathematically tractable than the alternative. Do these data look like a good approximation to a white noise? If they did, values at one time would not be related, to values at other times. That looks improbable to the eye. If we draw a scatter plot (Figure 8.3) where the observed height at one time is plotted against the height 8 minutes later, we see what looks like a very clear correlation; and indeed, the correlation coefficient estimated using (8.4) is 0.626. We used 1293 data values in finding this; the probability that ρ would be this big in an uncorrelated Gaussian sample5 is less than 10−180. Perhaps surprisingly, the distribution of the data is almost perfectly Gaussian. Our second example of a geophysical data sequence is a series of values in space: the magnetic field measured on an aircraft flying 7 km above the south-eastern Pacific Ocean. Figure 8.4 shows the vertical component Z , and the horizontal components X , which is along the flight path. As with the lake-level data, we see an irregular line with a certain amount of order. A stochastic process seems like a good model, but there seems to be little evidence of a regular oscillation, or even an irregular one. A feature to notice here is that the two components appear to be related, varying together in some fashion: there is a lag and perhaps a suggestion of differentiation of Z to obtain X . We will discuss later how to look at pairs of time series for evidence of this kind of relationship. 5 Found using the t test; t N −2 = 28.8. 8.4. Examples from the Real World 119 Figure 8.5: Left: estimated autocorrelation function for magnetic component Z. Right: pdf of Z data (gray) and of a Gaussian distribution with the same mean and variance (black). Concentrating for the moment on the Z component, let us look at an estimate of the autocorrelation function, shown in Figure 8.5. We do not describe yet how that estimate was made; that will be the subject of a later lecture. For now notice how the R Z dies away monotonically from one. This means that neighboring values are likely to be very similar indeed, but as one separates two samples in space, their correlation fades away so that by a lag of 30 samples (about 100 km), they are uncorrelated. It is easy to believe this series could be generated by applying a suitable filter to white noise. The histogram of the data shows something mentioned earlier: the magnetic anomaly data is somewhat non-Gaussian, being asymmetric, with a large positive tail and a compressed lower tail. A test for Gaussian behavior shows that Gaussian variables would depart this much from a true Gaussian about 18% of the time: not a resounding rejection of the Gaussian model, but a reason to be suspicious of it. The reason for this commonly observed behavior is not understood. The final example (Figure 8.6) is a bathymetry profile across one side of the East Pacific Rise: another example of a spatial series, though this would be a true time series if we used the age of seafloor rather than distance. Here the data are very obviously not stationary, because the seafloor becomes deeper with age: as a realization of a stationary process, this is a miserable failure. But if we remove a average trend, (the line shown), we get a much more satisfactory-looking approximation to a stationary series, as we see from the gray line in Figure 8.6. Now in fact we should 120 Chapter 8. Stochastic Processes Figure 8.6: The upper plot shows bathymetry along a profile perpendicular to the East Pacific rise, along with a linear trend and a funcp tion where depth increases as t. The bottom plot shows the “depth anomaly” found by removing these nonstationary features. have removed, not a straight line but a parabola, because of the famous p t approximation to depth changes caused by the cooling of the plate as it ages. The least-squares best-fitting parabolic curve is also shown; it fits the observations slightly better than the straight line does, and produces a residual that is the black curve in Figure 8.6. Here we have done something very common (we saw another example in the sea-level data in Chapter 1), which is to model the observations as a steadily evolving part, plus a random, stationary part. Unusually in this example, we have a good model for the evolving piece of the model; normally we would just take a straight-line trend to represent that. The pdf of the depth anomaly is again not quite Gaussian – the distribution is heavy tailed, but not enough for a test to reject it as not being Gaussian. A PPENDIX A C OMPUTING THE D ISCRETE F OURIER T RANSFORM A.1 Introduction As we noted in Chapter 3, the nominal form of the Discrete Fourier Transform (equation 3.8), which is x̃k = NX −1 xn e2π ink/N for n=0 k = 0, . . . N − 1 does not well express how to actually perform the computation. First of all, as noted in that chapter, we would first compute the exponents using e2π ink/N = e−2π i(nk)mod N /N ; then the formula above would imply doing N “operations” (additions and multiplications) to get a single x̃k – and we would then do this N times to get all the DFT values, for a total of N 2 operations. The Fast Fourier Transform gives us the same result with many fewer operations. A.2 The Basis of the Fast Fourier Transform The first discovery of an FFT algorithm was by Gauss in 1805 Heideman et al. (1985); like all the others before Cooley and Tukey’s, it was not noticed. In fact, these algorithms are the same, at least if Gauss’ procedure is updated to use complex exponentials. We now outline how it works. The key to the whole procedure lies in assuming that N , the series length, be a composite integer, so that we can write N = N1 N2 , where N1 and N2 are both integers. We will show how to compute the DFT in fewer operations for this case; and then we will assume that N1 is also composite, apply the same procedure, and continue until we run out of factors. The simplest algorithm is for all the composite factors of N to be the same size, 121 122 Appendix A. Computing the Discrete Fourier Transform and as small as possible – which amounts to requiring that N be a power of two. We divide the N -length sequence of data xn into N2 sequences of length N1 , and the N -length sequence of DFT values x̃k into N1 sequences of length N2 . Then we re-index the subscripts for the two sequences as n = N2 n 1 + n 2 and where n 1 , k 1 = 0, . . . N1 − 1 and k = k 1 + N1 k 2 n 2 , k 2 = 0, . . . N2 − 1 We apply this indexing to the e2π ink/N in the DFT expression: e−2π i(k1 n1 N2 +k1 n2 )/N e−2π iN1 k2 n2 /N e−2π ik2 n1 and note that the last exponential is always one. Then the full DFT expression becomes # " NX 1 −1 2 −1 NX xn e−2π i(k1 n1 N2 +k1 n2 )/N e−2π iN1 k2 n2 /N x̃k1 +N1 k2 = n2 =0 n1 =0 which we can write more compactly as x̃k1 +N1 k2 = NX 2 −1 S k1 n2 e−2π iN1 k2 n2 /N n2 =0 where S k1 n2 is the sum in brackets: we can view this as an N1 by N2 matrix, each of whose terms takes N1 operations to compute.1 Each of the columns of this matrix is just the result of performing a DFT on a sequence of length N − 1 to get N − 1 coefficients; So the total operations count to get all the elements of S is N1 ( N1 N2 ) = N12 N2 . Then to get the final result requires the second (outer) sum, multiplying the S matrix into the N2 -length vector represented by the final exponential to get a second vector of length N1 – which we have to do for N2 values of k 2 . The operations count for each matrix multiplication is N1 N2 ; doing it N2 times makes the total operations count for this step N1 N22 . The total number of operations is the sum: N12 N2 + N22 N1 and remembering that N = N1 N2 , this is N2 + N · N2 N2 1 Actually 2 N1 operations if we count both additions and multiplications as operations; but here and in the rest of the derivation we ignore expressions that do not depend on N . A.3. Computing DFT’s of Real Series 123 which for N large and N2 small would be less than N 2 ; for N2 = 2 it would be 1/2 N 2 + 2 N . But now suppose that N1 is composite, with N1 = N2 N3 , which means that N = N3 N22 . then we can decompose each of the N1 -length sequences in the same way, and use the same decomposition to compute each of the DFT’s in S more efficiently. The total number of operations then becomes, replacing the N12 term in the above, N2 ( N32 N2 + N22 N3 ) + N22 N1 = N2 N22 + 2 N2 N q And finally, if N = N2 we can do this decomposition q times, so that the total operations count is N2 q N2 + qN2 N = N (1 + qN2 ) Since q = log N2 N , the operations count scales as N log N2 , or for N2 = 2, N log2 N : a much smaller number than N 2 , even for N as small as 1024, 100 times smaller. A.3 Computing DFT’s of Real Series If we are computing the DFT of a real-valued series, we can always reduce the computation time, and storage requirements, by a factor of two. The standard definition of the DFT, which is followed by almost all FFT programs, assumes that the sequence xn is complex. Transforming N real numbers then means transforming N complex numbers, all with imaginary parts equal to zero – and then half the output will be redundant, because the x̃n ’s will be Hermitian. While we could replace the DFT with a purely real transform, (such as the Hartley transform), this is not necessary. Instead, we represent our N real numbers as N /2 complex numbers, take an N /2-length complex DFT, and rearrange the output to give the N /2 complex x̃k ’s that we want. To do this, we combine adjacent pairs of real xn ’s to form our complexvalued sequence; we may not even have to do this explicitly, because in many computer languages (certainly FORTRAN), passing a real-valued sequence to a subroutine that expects complex numbers will automatically Appendix A. Computing the Discrete Fourier Transform 124 cause this reassignment to occur. The DFT is then Ck = NX /2−1 n=0 ( x2n + ix2n+1 ) e−2πkn/M k = 0, . . . M − 1 for where M = N /2. To see what to do with the C k ’s, construct two complexvalued sequences A and B, defined by the sums Ak = M −1 X x2n e−2π ink/M n=0 Bk = M −1 X n=0 x2n+1 e−2π ink/M which means that C k = A k + iB k , If we could find the A and B sequences, we can compute the X k that we actually want, since Xk = 2M X−1 n=0 xn e−2π ink/2 M = = Ak + e −π ik/ N Bk for M −1 X n=0 x2n e−2π i2nk/2 M + k = 0, . . . N /2 − 1 M −1 X x2n e−2π i(2n+1)k/2 M n=0 To get the A ’s and B’s from the C ’s that we have computed with out DFT, consider C N − k + C k = A n− k + A k + i ( B N − k + B k ) Sine A and B are both derived from real sequences, both are Hermitian, with A N −k = A ∗k and B N −k = B∗k . Therefore, C N − k + C k = 2R [ A k ] + 2 i R [ B k ] which means that R [ A k ] = 1/2R [C N −k + C k ] R [B k ] = 1/2I [C N −k + C k ] and similarly, I [ A k ] = 1/2I [C k − C N −k ] I [B k ] = −1/2R [C k − C N −k ] So, by computing first C then A and B, and finally X , we have achieved our aim. We can get slightly greater efficiency using DFT algorithms specifically designed to handle real sequences (or, for the inverse transform, Hermitian ones), though the improvements are not large enough to be important unless you plan to do a lot of transforms; Sorensen et al. (1987) give both algorithms and comparisons. A.4. Computing the Fourier Transform for a Few Frequencies A.4 125 Computing the Fourier Transform for a Few Frequencies Finally, suppose we want to compute, instead of the DFT coefficients, the transform x̃( f ) for arbitrary f : x̃( f ) = NX −1 xn e−2π i f n (A.1) n=0 Certainly you might imagine (correctly) that even for the DFT case, where f = k/ N , that the FFT will not be the most efficient procedure if we only want a few values of f . But again, we do not need to do the sum as written; by using a recursive method known as the Goertzel algorithm we need to compute only one trigonometric function for each frequency. We define a series of sums: Uk ( f ) = NX −1 1 xn sin 2π f [ n − ( N − k) + 1] for sin 2π f N −k k = 1, . . . N and set U0 = U−1 = 0. Then the difference between the final sum and the penultimate one, multiplied by e2π i f turns out to be the Fourier transform we seek: U N ( f ) − e2π i f U N −1 ( f ) # " NX −1 NX −1 NX −1 1 xn sin 2πn f xn sin 2πn f − i xn sin 2π( n + 1) f − cos 2π f = sin 2π f n=0 n=1 n=1 # " NX −1 1 xn (sin 2πn f cos 2π f + cos 2πn f sin 2π f − cos 2π f sin 2πn f ) = x0 + sin2π f n=1 −i NX −1 n=1 = x0 + xn sin 2πn f NX −1 n=1 xn (cos 2πn f − i sin 2πn f ) = NX −1 xn e−2π in f n=1 The point of all this is that the we can compute the Uk ’s recursively, and need to compute only one cosine, once, to do so. As above, we start with the Appendix A. Computing the Discrete Fourier Transform 126 answer and show that it gives the result we wish. Take xN −k + 2Uk−1 cos[2π f − Uk−2 ] = x N −k + 1 sin 2π f NX −1 N −k+1 xn [2 cos[2π f sin 2π( n − N + k) f ] − sin[2π( n − N + k − 1) f ]] A general expression for the trigonometric functions is 2 cos u sin[( m − 1) u] − sin[( m − 2) u] = sinm u which, applied to the sum in A.4, gives xN −k + 2Uk−1 cos[2π f − Uk−2 ] = x N −k + = 1 sin 2π f NX −1 N −k+1 xn sin[2π( n − N + k + 1)] NX −1 1 xn sin[2π( n − ( N − k) + 1)] sin 2π f N −k =Uk a recursion that allows us to find U1 ,U2 . . . U N in N real multiplications and 2 N real additions; and once we have the U N −1 and U N , we can find x̃( f ). One difficulty with this approach is that, for values of f near 0 or 1/2, it can be inaccurate because of roundoff (Gentleman, 1969). For example, for f = 0, the recursion becomes U1 = xN −1 U2 = xN −2 + 2 xN −1 U3 = xN −3 + 2 xN −2 + 3 xN −1 ... Finally, we would compute x̃(0) = U N − U N −1 by differencing two numbers that are potentially very large: this is a sure invitation to roundoff error. There is however a simple solution to this: define ∆Uk = Uk − Uk−1 . Then the recursion in terms of ∆U is ∆Uk = Uk − Uk−1 = xN −k + 2(cos 2π f − 1)Uk−1 − Uk−2 + Uk−1 = xN −k + 2(cos 2π f − 1)Uk−1 + ∆Uk−1 A.4. Computing the Fourier Transform for a Few Frequencies 127 and the expression for x̃( f ) is X ( f ) = ∆U N − (cos 2π f − 1)U N −1 − i (sin 2π f )U N −1 Now, for f small, cos 2π f − 1 becomes small, and the recursion and its final processing become numerically stable. There is a similar problem with the original recursion for f close to 1/2; for f = 1/2 the original recursion is for which the original recursion gives U1 = xN −1 U2 = xN −2 − 2 xN −1 U3 = xN −3 − 2 xN −2 − 3 xN −1 ... which again means (potentially) differencing large numbers. The cure is again to define a new quantity (for which we use the same notation): ∆Uk = Uk + Uk−1 in terms of which the recursion becomes ∆Uk = xN −k + 2(cos 2π f + 1)Uk−1 − ∆Uk−1 and the final step becomes X ( f ) = ∆U N − (cos 2π f + 1)U N −1 − (− i (sin 2π f )U N −1 ) The need for three recursions complicates the process somewhat, but not so much so as to render this method unattractive for special purposes. A PPENDIX B O THER R EADING B.1 Introduction There are a many books that cover different aspects of this course, but none that provide the mix we think most useful. Most books tend to either focus on the statistical viewpoint, without much discussion either of Fourier theory or of actual algorithms; or else focus on the signal-processing topics with little discussion of randomness. The list here is certainly not complete, but includes some of our own favorites, both textbooks and reference monographs. B.2 Linear Systems and Fourier Analysis Pippard (1978) gives a particularly detailed and readable discussion of one class of linear system: the damped oscillator (or sets of oscillators), including a number of interesting ways of looking at the Fourier transform. Books on the Fourier transform differ substantially in their level of mathematical rigor. The two least rigorous are James (1995), which is at about the level of these notes, and Bracewell (1986), which is much more complete, and is probably the best book on Fourier theory for non-mathematicians. It emphasizes getting a “feel” for transforms, being able to visualize the connection between the time and frequency domains, and the ubiquity of Fourier transforms, as shown by a wide range of examples (most from electrical engineering). Editions after the first one are basically unchanged, except for a discussion of the Hartley transform, about which the author is, in our view, overenthusiastic. Moving up the scale of rigor, Champeney (1987) covers the Fourier theorems rigorously, with due attention to what kinds of functions they apply to and what the integrals really mean, though results are given rather than proved. Korner (1988) is a collection of anecdotes, theorems and applica128 B.3. Digital Signal Processing 129 tions, mostly to do with Fourier analysis; many of the essays can be read in isolation from the rest of the book. It provides an entertaining look at some mathematical culture. Lighthill (1958) is a brief, elegant, readable, and rigorous exposition of Fourier analysis and its connection with generalized functions; if you want to see the proofs done right, but readably, this is the place to look to fill in many of the details we have skipped. It is not, however, complete: there is no discussion of convolution. Finally, for the mathematically inclined Dym and McKean (1972) treats the subject with modern notation, and full rigor but in a lively style. They show lots and lots of unexpected applications ranging from differential equations to the Prime Number Theorem. They do not make any use of generalized functions. B.3 Digital Signal Processing Because digital signal processing is, now, used inside devices ranging from cell phones to satellites to DVD players and washing machines, it has a huge literature on it; and because this subject is a standard course for electrical engineers, many authors have tried their hands at textbooks. Perusal of the relevant classification number (TK5102.9) in the library catalog will reveal dozens of books, all with similar content, though with levels varying from introductions for computer musicians whose mathematical background may be quite modest, to detailed treatments for graduate-level engineers. Many of these books address issues of finite word length and computational efficiency that, while of considerable engineering importance, are rarely relevant to data analysis. Many books also include a section on spectrum analysis, but usually without doing justice to the statistical questions involved. Given this abundance, we can only offer a selection. Two basic introductions are Scherbaum (2001), which is oriented towards seismology but not very complete, and Steiglitz (1996), which is oriented towards people doing computer music – and thus avoids the assumption of many of these books, which is that you are an electrical engineer. Another in this class is Hamming (1983), which is very readable, and pays lots of attention to common stumbling blocks. It is however somewhat idiosyncratic in its methods and its coverage. One particular strong point is the especially good treatment of what some common operations (e.g. numerical integration) look like if viewed as filters. Oppenheim and Schafer (1989) is a standard introductory textbook, 130 Appendix B. Other Reading with probably all the information that you are likely to ever need, presented in a way that shows the authors’ command of the subject and how to teach it. It is much more detailed than the books just mentioned; and it does assume some familiarity with continuous-time theory and (in places) an electrical-engineering background. The notation will be familiar, since the notation in our class notes comes from an earlier version of this text. Signal processing as done in exploration geophysics, has very different goals and standards from the rest of the field. Robinson and Treitel (1980), by two of the founders of the subject, begins at a very basic level, but can be quite advanced in the later chapters, which tend to focus on special topics the authors happen to be interested in. B.4 Time Series and Spectral Estimation Jenkins and Watts (1968) was long the standard text in the field, but it is now showing its age in terms of estimation methods (the theory is still perfectly correct). It is probably still one of the best discussions of multivariate methods. Chapter Four, on statistical inference, is still well worth reading. Priestley (1981) provides a careful but readable treatment of all the standard material with a nice balance between proof, discussion and illustration. This book focuses on the statistical issues, and includes some discussion of time-domain estimation. Bendat and Piersol (1986) has a strong emphasis on estimation, particularly of response functions for linear systems. The orientation is towards engineering, not statistics. Percival and Walden (1993) introduces univariate spectral analysis at a reasonably high level. They cover stochastic processes, filtering, Fourier theory, and spectral estimation, with special emphasis on direct multitaper estimation techniques. They also cover parametric spectral estimation and harmonic analysis. A promised volume on cross spectra has not yet appeared. This is probably the best single book on univariate spectral estimation, though it is not always easy going; the authors have emulated the inventor of multitaper methods, Dave Thomson, in the lavish use of subscripts and superscripts, which tends to obscure the discussion. Bibliography Agnew, D. C., and K. M. Hodgkinson (2007), Designing compact causal digital filters for low-frequency strainmeter data, Bull. Seismol. Soc. Amer., 97, 91–99. Alsop, L. E., and A. A. Nowroozi (1966), Faster Fourier analysis, J. Geophys. Res., 71, 5482–5483, doi:10.1029/JZ071i022p05482. Ardhuin, F., B. Chapron, and F. Collard (2009), Observation of swell dissipation across oceans, Geophys. Res. Lett., 36, L06,607, doi: 10.1029/2008GL037030. Bendat, J. S., and A. G. Piersol (1986), Random Data: Analysis and Measurement Procedures, John Wiley, New York. Bracewell, R. (1986), The Fourier Transform and its Applications, McGrawHill, New York. Champeney, D. C. (1987), A Handbook of Fourier Theorems, Cambridge University Press), New York. Cooley, J. W., and J. W. Tukey (1965), An algorithm for the machine computation of complex Fourier series, Math, Comp., 19, 297–301, doi: 10.2307/2003354. Deakin, M. A. B. (1992), The ascendancy of the Laplace Transform and how it came about, Arch. Hist. Exact Sci, 44, 265–286. Dym, H., and H. P. McKean (1972), Fourier Series and Integrals, Academic Press, New York. Gentleman, W. M. (1969), An error analysis of Goertzel’s (Watt’s) algorithm, Computer J., 12, 160–165. Gross, R. (1992), Correspondence between theory and observations of polar motion, Geophys. J. Internat., 109, 162–170. Hamming, R. W. (1983), Digital Filters, Prentice-Hall, Englewood Cliffs. 131 132 Bibliography Harris, F. J. (1976), On the use of windows for harmonic analysis with the discrete Fourier transform, IEEE Proc., 66, 51–83. Heideman, M. T., D. H. Johnson, and C. S. Burrus (1985), Gauss and the history of the Fast Fourier Transform, Arch. Hist. Exact. Sci., 34, 265– 277. Hewitt, E., and R. Hewitt (1979), The Gibbs-Wilbraham phenomenon: an episode in Fourier analysis, Arch. Hist. Exact Sci., 21, 129–169. James, J. F. (1995), A Student’s Guide to Fourier Transforms: With Applications in Physics and Engineering, Cambridge University Press), New York. Jenkins, G. M., and D. G. Watts (1968), Spectral Analysis and its Applications, 525 pp., Holden-Day, San Francisco. Kaiser, J. F., and W. A. Reed (1977), Data smoothing using low-pass digital filters, Rev. Sci. Instrum., 48, 1447–1454. Kaiser, J. F., and W. A. Reed (1978), Bandpass (bandstop) digital filter design routine„ Rev. Sci. Instrum., 49, 1103–1106. Korner, T. W. (1988), Fourier Analysis, Cambridge University Press, Cambridge. Lighthill, M. J. (1958), Fourier Analysis and Generalized Functions, Cambridge University Press, Cambridge. Munk, W. H., and G. McDonald (1960), The Rotation of the Earth: a Geophysical Discussion, Cambridge University Press, Cambridge. Oppenheim, A. V., and R. W. Schafer (1989), Discrete-time Signal Processing, Prentice Hall, Englewood Cliffs, N.J. Orchard, H. J., and A. N. Willson (2003), On the computation of a minimumphase spectral factor, IEEE Trans. Circuits Systems I, 50, 365–375. Parker, P. R., M. A. Zumberge, and R. L. Parker (1995), A new method of fringe-signal processing in absolute gravity meters, Manuscrip. Geodet., 20, 173–181. Bibliography 133 Percival, D. B., and A. T. Walden (1993), Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques, Cambridge University Press, Cambridge. Pippard, A. B. (1978), The Physics of Vibration, Cambridge University Press, New York. Priestley, M. B. (1981), Spectral Analysis and Time Series, 890 pp., Academic, Orlando, Fla. Robinson, E. A., and S. Treitel (1980), Geophysical Signal Analysis, Prentice Hall, Englewood Cliffs, N.J. Scherbaum, F. (2001), Of Poles and Zeros: Fundamentals of Digital Seismology, Kluwer Academic Publishers, Boston. Scherbaum, F., and M.-P. Bouin (1997), FIR filter effects and nucleation phases, Geophys. J. Internat., 130, 661–668. Singleton, R. C. (1969), An algorithm for computing the mixed radix fast Fourier transform, IEEE Trans. Audio Electroacous., AU-17, 93–103. Slepian, D. (1976), On bandwidth, Proc. IEEE, 6, 292–300. Smith, M. L., and F. A. Dahlen (1981), The period and Q of the Chandler wobble, Geophys. J. R. Astron. Soc., 64, 223–281. Snodgrass, F. E., G. W. Groves, K. F. Hasselmann, G. R. Miller, W. H. Munk, and W. H. Powers (1966), Propagation of ocean swell across the Pacific, Phil. Trans. Roy. Soc., Ser. A, 259, 431–497. Sorensen, H., D. L. Jones, M. T. Heideman, and C. S. Burrus (1987), Real-valued Fast Fourier Transform algorithms, IEEE Trans. Acoustics Speech Sign. Proc., 35, 849–863. Steiglitz, K. (1996), A DSP Primer: with Applications to Digital Audio and Computer Music, Addison-Wesley, Menlo Park. Steiglitz, K., T. W. Parks, and J. F. Kaiser (1992), METEOR: A constraintbased FIR filter design program, IEEE Trans. Signal Proc., 40, 1901– 1909. Index Amplitude of complex number, see Complex numbers Argand diagram, see Phasor diagram Complex numbers amplitude, 11 phase, 11 Decibels, 5 Fechner’s law, 5 Fourier series, 34 Fourier transform symmetry relations, 22 Linear systems, 9 definition, 9 time-invariant, 9 Phase of complex number, see Complex numbers sign convention for, 12 Phasor diagram, 11 Quadrature, defined, 11 Sea level at SIO pier, 1 steric effect, 1 swell, 2 Swell, propagation, 3 Symmetry relations of Fourier transform, see Fourier transform Systems, see Linear systems Time invariance, see Linear systems, time-invariant Vibration diagram, see Phasor diagram 134