Some Extra Notes On Body Waves and Mantle Tomography: 1. The Travel Time Inverse Problem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CHAPTER 5

Some extra notes on body waves and mantle tomography


1. The travel time inverse problem
In ray theory, the travel time of the ray is
T =
_

1
v
d (1)
where the integral is taken along the ray path and v is the seismic velocity. In global tomography, we usually
start from a 1-D spherically symmetric Earth. Fermats principle states that a ray path between two points
is a path of stationary time. Thus the travel time will not change to rst order if the ray path is slightly
perturbed. If we make a small perturbation in velocity structure there will be a change in travel time due to
the change in velocity structure and due to the change in the ray path but the latter term is of second order
and so can be neglected. We can therefore differentiate equation 1 giving:
T =
_

1
v
v
v
d (2)
where we can use the spherical Earth ray path and v can be taken to be the 1-D velocity. More generally,
we can write
T =
_
V
K(r, , )m(r, , ) dV (3)
where m is our model in our case, the relative perturbation in velocity v/v, and K is some kernel which
in ray theory is just a delta function along the ray, but in more sophisticated theories which take into account
nite frequency effects, will occupy a volume around the geometrical ray.
2. Long- and short-period travel times
Historically, seismograms were recorded either at "long" periods or "short" periods. The reason for this
is that a major source of motion of the ground is the "microseisms" which are due to nonlinear interactions
of ocean waves causing pressure variations on the ocean oor. Microseisms have a main peak at 14
second period and a secondary peak at 7 seconds. It is actually the secondary peak that is mainly seen on
seismometers. In the past, seismic recording systems did not have the dynamic range to record both the
microseisms and the small seismic signals which ride on them. Thus instruments were designed to see
periods shorter than 7 seconds (usually peak response at about 1 second) or periods longer than about 15
seconds. Modern seismic recording systems have enough dynamic range to be able to record the microseisms
(so-called broad-band recording) but, for most earthquakes, we must still lter out the microseisms so we
can see the small seismic signals.
On the short period side, body waves of dominant period 1 second are seen and the rst arriving P wave can
be accurately picked. Scattering by short-wavelength heterogeneity causes large "codas" which can obscure
secondary arrivals. Many observations have been made and are collected by the International Seismological
Commission (ISC) which has used them to make a more comprehensive tabulation of earthquake locations.
Such data are also used in tomography. P wave tomography using the ISC data has been quite successful but
the S waves are more problematic. This is because S waves typically have a lower frequency content due to
attenuation and are more poorly recorded by short period instruments. Furthermore, ISC picks are usually
1
made from vertical component instruments so interference of S by the SKS phase at distances beyond 80
degrees is a problem. This makes it very difcult to image S velocity in the lowermost mantle from ISC S
picks.
Long-period data offer some advantages over the ISCdata in particular, codas fromscattering are nearly
non-existant so later phases can be accurately picked.
3. Parameterization.
Consider a linearized inverse problem of the form:
d
i

i
=
R
_
0
G
i
mdr (4)
where d
i
is a datum such as the difference between an observed frequency of free oscillation of the earth
and one calculated for a starting model, G
i
is some continuous kernel which we can compute and m is a
continuous model perturbation. When we have relatively few data, it is possible to avoid parameterization
of the model and make an expansion of the form:
m =

i=1,N
a
i
G
i
(r) (5)
where N is the number of data. Inserting this into equation 4 gives
d = a where
ij
=
R
_
0
G
i
G
j
dr (6)
and is a matrix which is of dimension N N. Equation 6 can be solved in a variety of ways (e.g., we
can impose smoothness constraints on the model perturbation or on the total model) and we can explore the
trade off with t to the data. And we can look at the ability of our data to resolve features of our models in
a quantitative way.
Unfortunately, once N exceeds a few thousand, the computational burden of dealing with huge matrices
becomes too great. The conventional way around this is to parameterize the model by expanding it in a set
of basis functions where the number of parameters is chosen to be computationally manageable:
m =

i=1,M
a
i
f
i
(r) (7)
where M is the number of parameters. Of course, in 3D tomography, the basis functions f are functions of
r, , and . Substituting 7 into equation 4 gives
d = A a where A
ij
=
R
_
0
G
i
f
j
dr (8)
and A is a matrix which is of dimension N M.
The choice of basis functions in equation 7 can impact the kinds of models we can recover and can
also impact the computational difculty of solving equation 8. In global tomography, the choice of basis
functions has, for many years, involved the use of spherical harmonics for parameterizing lateral variations:
m =

s,t
m
t
s
(r)Y
t
s
(, ) (9)
where the radial expansion coefcients m
t
s
(r) are further parameterized either in global functions (e.g.
Legendre polynomials or Chebychef polynomials) or as local functions (e.g. layers or B-splines. The
2
reason for the choice of spherical harmonics is that these are efcient for parameterizing the long-wavelength
structure which dominates many of the seismic datasets. Unfortunately, a consequence of using global bases
is that every datum effectively becomes sensitive to the entire model so that the matrix A is very dense. This
means that global models using spherical harmonics are typically limited to about 10,000 model parameters
if we divide the mantle up into roughly 20 layers (see below), each layer could have about 500 parameters.
A spherical harmonic expansion up to degree l has (l + 1)
2
expansion coefcients so this means l is limited
to about 21. If we recall that
ka = l +
1
2
=
2a

(10)
where k is wavenumber, is wavelength, and 2a is the circumference of the earth (about 40,000 km),
we nd that the minimum wavelength we can capture is about 2000 km. Unfortunately, dynamically
interesting structures such as slabs typically have much smaller dimensions than this (and many of our data
are, in principle, sensitive to small wavelength structure). This has motivated the use of local bases in global
tomography (e.g. blocks of uniformlateral dimension, equal area blocks, non-uniformdistribution of blocks
mimicking data sampling, tesselations, etc).
Why are local bases so useful? Let us suppose we have several hundred thousand travel time measure-
ments. To a fairly good approximation, ray theory can be used to interpret such data so each datum is
sensitive to only a small fraction of the total number of parameters in the model (i.e. along a particular
ray). For example, using blocks of lateral dimension 4 degrees at the equator (this corresponds to a surface
wavelength of about 880kmor an l of about 45 if we had done a spherical harmonic expansion) gives roughly
2500 blocks per layer for a total of 50,000 model parameters for a 20 layer model. However, each datum
samples only about 1% or less of the blocks so each row of the matrix A will have less then 500 non-zero
entries. Sparse matrix techniques can then be efciently used to solve equation 8.
4. Computing matrix elements for travel times using ray theory
Consider a ray through the Earth as shown in Figure 1.
a lower frequency content due to attenuation and are more poorly recorded by short period instruments.
Furthermore, ISC picks are usually made from vertical component instruments so intereference of S
by the SKS phase at distances beyonf 80 degrees is a problem. This makes it very dicult to image S
velocity in the lowermost mantle from ISC S picks.
Long-period data offer some advantages over the ISC data in particular, codas from scattering are
nearly nonexistant so later phases can be accurately picked. One picking algorithm is discussed in detail
in the paper by Bolton and Masters (2001) and we shall give some demonstrations during the lecture.
Differential times can also be picked but require corrections for relative attenuation and, sometimes,
corrections for waveform distortion due to propagation effects. Again, we shall give some examples in
the lecture.
7. Computing matrix elements for travel times using ray theory
Consider a ray through the Earth as shown in Figure 1.
Fig 1
Now focus on the small segment of the ray which subtends the angle d at the center of the Earth (Fig.
2).
Fig 2
i is the angle the ray makes with the vertical and we know that the ray parameter is related to i by
p =
r
v
sin i (7.35)
where v is the velocity for the ray segment at radius r. The travel time of the ray is
189
Fig 1
Now focus on the small segment of the ray which subtends the angle d at the center of the Earth (Fig. 2).
3
a lower frequency content due to attenuation and are more poorly recorded by short period instruments.
Furthermore, ISC picks are usually made from vertical component instruments so intereference of S
by the SKS phase at distances beyonf 80 degrees is a problem. This makes it very dicult to image S
velocity in the lowermost mantle from ISC S picks.
Long-period data offer some advantages over the ISC data in particular, codas from scattering are
nearly nonexistant so later phases can be accurately picked. One picking algorithm is discussed in detail
in the paper by Bolton and Masters (2001) and we shall give some demonstrations during the lecture.
Differential times can also be picked but require corrections for relative attenuation and, sometimes,
corrections for waveform distortion due to propagation effects. Again, we shall give some examples in
the lecture.
7. Computing matrix elements for travel times using ray theory
Consider a ray through the Earth as shown in Figure 1.
Fig 1
Now focus on the small segment of the ray which subtends the angle d at the center of the Earth (Fig.
2).
Fig 2
i is the angle the ray makes with the vertical and we know that the ray parameter is related to i by
p =
r
v
sin i (7.35)
where v is the velocity for the ray segment at radius r. The travel time of the ray is
189
Fig 2
i is the angle the ray makes with the vertical and we know that the ray parameter is related to i by
p =
r
v
sin i (11)
where v is the velocity for the ray segment at radius r. Consider equation (2):
T =
_

1
v
v
v
d
From gure 2, we have
sin i =
rd
d
so d =
r
2
pv
d (12)
and we can rewrite equation (2) as an integral over distance:
T =

_
0
G()
v()
v
d (13)
where
G() =
r
2
pv
2
(14)
This kernel is evaluated by keeping track of the depth of the ray for every position of arc length . To do
this we need an equation relating to r. Reconsider Figure 2 and note that (using the equation 11 for p)
r
d
dr
= tan i =
vp
r
_
1 (
vp
r
)
2
_
1
2
(15)
or
d
dr
=
p
r
_
r
2
v
2
p
2
_

1
2
(16)
On an aspherical Earth where we have used a local block parameterization, we step nely along in
distance starting from a specic source position to a specic receiver position. At each point, we compute
the radius we are at using equation 16 and then evaluate the kernel using equations 13 and 14. We also keep
track of which block we are in at each step along the ray then integrate the contributions to each block at
the end.
4
So far, we have been considering the effect of a volume perturbation in velocity. There may also be
perturbations in the levels of discontinuities which, if the velocity is different on either side, produce travel
time anomalies. The formula for T for a transmitted ray if a boundary is moved by r is
T =
r
r
_
_
_
r
2
v
2
p
2
_
1
2
_
_
+

(17)
where [f]
+

indicates the value of f below subtracted from the value of f above the discontinuity. For a
reected ray, we get
T =
2r
r
_
r
2
v
2
p
2
_
1
2
(18)
for a topside reection (so v is the velocity just above the discontinuity) and
T =
2r
r
_
r
2
v
2
p
2
_
1
2
(19)
for a bottomside reection (so v is the velocity just below the discontinuity).
5. Finding a model
Let us suppose we are solving the problem
A x = d (20)
for the vector x. Further, we shall assume that we have divided each row of this system of equations by
the observation error on the datum so that the data vector d has a covariance matrix which is just I (i.e. we
are assuming our data are statistically independent from each other. If our system of equations (20) were
well-conditioned, we might just nd the least-squares solution:
x = (A
T
A)
1
A
T
d (21)
which minimizes (A x d)
2
. In reality, A is usually not well-conditioned and A
T
A is even worse (the
condition number is effectively squared) so the solution (21) is rarely chosen. One way around squaring
the condition number is to use a singular value decomposition (SVD) on equation 20. The matrix A is
decomposed into singular values and matrices of left and right eigenvectors (here, we assume M < N):
A = U V
T
(22)
where U has dimension N M and V has dimension M M and is a M M with non-zero diagonal
elements. Note that U
T
U = I and V
T
V = I. The least-squares solution in terms of the SVD is
x = V
1
U
T
d = A
+
d (23)
where A
+
can be thought of as the (generalized) inverse of A. If A is not well-conditioned, it will have
some small singular values which will generally lead to some poorly determined contributions to x. To
see why this is so, consider the covariance matrix of the model. To get the model we are taking a linear
combination of data: A
+
d. Now d has covariance matrix I so x has covariance matrix:
A
+
I(A
+
)
T
= V
1
U
T
U
1
V
T
= V
2
V
T
(24)
The square roots of the diagonal elements of this matrix are the errors on our model parameters. Clearly,
small singular values are going to make these errors large. One way to avoid this is to exclude small singular
5
values from the sums implicit in equations 23 and 24 but this will mean that A
+
A will no longer be I. in
fact, substituting 20 into 23 gives
x = A
+
Ax = Rx (25)
and the matrix R = A
+
A is sometimes called the "resolution matrix". In a perfectly resolved system, R = I
but, in general, each model element estimated will be a linear combination of all the model elements. For
the truncated SVD approximation to the generalized inverse, R = VV
T
. We use the resolution matrix to
estimate how much we are "blurring" the model.
The process of throwing away small singular values is an example of "regularization" of the inverse
problem. It is not a commonly used method because the model we end up with doesnt satisfy any
particularly sensible optimization criterion. Usually we seek a model which has some property optimized
and still adequately satises the data. For example, we might seek a model which has minimum rst or
second derivative. Let D be some "roughening" operation on the model. Then we might want to minimize
f = (Ax d)
T
(Ax d) + (Dx)
T
Dx (26)
where the parameter controls the degree of smoothing. Expanding out the brackets and taking the derivative
with respect to x and setting to zero gives
x = (A
T
A+ D
T
D)
1
A
T
d (27)
Clearly, setting to zero gives us our least-squares result. Comparing equations 23 and 27 gives A
+
=
(A
T
A+D
T
D)
1
A
T
and we can use 24 and 25 to estimate the model covariance matrix and the resolution
matrix. Increasing will result in models which have a smaller value of x
T
D
T
Dx. One choice for D is
I which results in a process called "ridge regression" and ends up minimizing the Euclidean length of the
solution vector. This turns out to be a bad thing to do in tomography as it results in models which have
wildly underestimated amplitudes. A good choice for D is the rst difference operator which in 1D looks
like:
_
_
1 1 0 0 ...
0 1 1 0 ...
0 0 1 1 ...
_
_
(28)
In tomography, we use this for for smoothing in the radial direction and we use a form which minimizes the
sum of the rst differences between a block and its four nearest neighbors laterally for lateral smoothing
(an approximation to the Laplacian). In practice, very different degrees of radial and lateral smoothing
are required in the tomography problem because radial and lateral length scales are so different for mantle
structure.
We have already complained about forming matrix products like A
T
A when the matrices are ill-
conditioned and, in any case, making A
T
A can itself be time consuming (and may remove the sparsity). In
practice, we construct the following equivalent system:
_
A

1
2
D
_
x =
_
d
0
_
(29)
and solve this rectangular systemusing SVD or more likely a solver which takes advantage of the sparseness
of the matrices A and D.
One nal technical point about solving equation 29 is that we can help the conditioning of the system by
solving a slightly different system:
Cy =
_
A

1
2
D
_
Wy =
_
d
0
_
where y = W
1
x (30)
for y then getting x from x = Wy. W can be chosen in a variety of ways one is to make it a diagonal
matrix such that the Euclidean lengths of the columns of C are the same this makes the range of singular
6
values of C much less extreme and also speeds up convergence of some of the iterative techniques we
discuss in the next section. This process of weighting is sometimes called "preconditioning" of the system
and whole books have been written on the topic.
We now consider some "iterative" techniques for solving large systems of (hopefully) sparse equations.
Such techniques can operate on one row of the matrix at a time (and are sometimes called row-action
methods)
6. True iterative techniques
For simplicity, we go back to equation 20: Ax = d though we are more likely to be solving something
like equation 30 in practice. Let x
q
be the qth iterate and dene the residual vector
r
q
= d A x
q
(31)
Now we want to perturb x
q
to get a better answer. One way to do this is to work one equation at a time. Let
x
q
be the desired perturbation. We choose x
0
to be the perturbation that makes the rst element of r
0
be
zero, x
1
is chosen to make the second element of r
1
zero and so on we then cycle through the equations
until we get convergence. To get a unique perturbation, we choose the one that has x
q
minimized. Thus
we minimize
_
A
ij
x
q
j
r
q
i
_
2
Then
x
j
=
A
ij
r
i

k
A
2
ik
(32)
This is the original procedure of Kaczmarz and is not terribly efcient. One popular modication to this is
to compute the correction for each row (as above) and then average all the corrections to get a mean x:
x
j
=
1
M
M

i=1
A
ij
r
i

k
A
2
ik
(33)
where M is the number of non-zero elements in A
ij
. This process is called the Simultaneous Iterative
Reconstruction Technique (SIRT) and is still commonly used. Some modications are described in Hager
and Clayton, 1989. A general family of SIRT methods is given by
x
j
=

j
M

i=1
A
ij
r
i

i
where

j
=

i
|A
ij
|

,
i
=

k
|A
ik
|
2
with 0 < < 2 and 0 < < 2. Hager et al use ( = 1, = 1). It turns out that SIRT as described
above converges to a solution which is not the least squares solution of the original system of equations and
some weighting must be applied to correct this (van der Sluis and van der Vorst, 1987). SIRT works well
in practice but it is now more common to use a conjugate gradient method one particular variant called
LSQR has become popular in seismic tomography.
7. Gradient (Projection) techniques
Consider the function dened by
7
f(x) =
1
2
(A x d)
2
(34)
In two dimensions (x = x
1
, x
2
), f is a surface which has hills and valleys. Expanding out this function gives
f =
1
2
(A x d)
T
(A x d)
=
1
2
[d
T
d +x
T
A
T
A x 2x
T
A
T
d]
Now dene the square symmetric matrix B = A
T
A and the vector b = A
T
d then
f =
1
2
[d
T
d +x
T
B x 2x
T
b]
The rst term on the right is just the length of the data vector so we dene the mist function (x) as the
last two terms:
(x) =
1
2
x
T
B x x
T
b (35)
(This is the same function as f with all the same hills and valleys but with an offset removed.)
The gradient of with respect to x is simply
(x) = B x b (36)
At any point x
k
on the surface, the downhill slope is given by
(x
k
) = b B x
k
= r
k
(37)
and is actually zero at a solution which ts the data (B x b = 0)
Our procedure is to nd x by moving in a sequence of directions which take us down the mist surface.
Let
x
k+1
= x
k
+
k
u
k
(38)
where u
k
is a direction we choose to go in. We can nd the value of
k
(assuming u
k
is specied) that
minimizes
(x
k
+
k
u
k
)
=
1
2
(x
k
+
k
u
k
)
T
B (x
k
+
k
u
k
) (x
k
+
k
u
k
)
T
b
so

k
= u
T
k
B x
k
+
k
u
T
k
B u
k
u
T
k
b = 0
so
u
T
k
(B x
k
b) +
k
u
T
k
B u
k
= 0

k
=
u
T
k
r
k
u
T
k
B u
k
(39)
The next question is how to specify u
k
. If we choose u
k
= r
k
we get the "steepest descent algorithm"
(remember r is the local downhill direction see equation 37):
x
k+1
= x
k
+
k
r
k
where
k
=
r
T
k
r
k
r
T
k
B r
k
(40)
8
This isnt always a very good idea since it is possible to go fromone side of the valley to another rather than
going down the middle. A better method is to chose directions so that they are "conjugate" (perpendicular
in some sense) to all previous directions.
Reconsider equation 38:
x
k+1
= x
k
+
k
u
k
Note that x
k+1
is actually a linear combination of all the directions taken to date: u
1
...u
k
if there are N
model parameters, then the nal x can be completely specied by an expansion in N (orthogonal) directions:
x =
1
u
1
+
2
u
2
+ +
N
u
N
If the directions were truly orthogonal to each other, we could just dot this equation with the transpose of
the jth u and that would pick out the jth term. It turns out that this isnt computationally helpful but it
is helpful to make the directions "B-orthogonal" which means that
u
T
k
B u
j
= 0
Applying this to the above equation gives
u
T
k
B x = u
T
k
b =
k
u
T
k
B u
k
A conjugate-gradient algorithm can now be developed. We start with x
1
= 0 and compute r
1
= b. For the
rst direction, we choose steepest descent so u
1
= r
1
and we get
1
from equation 39. We are now at point
x
2
and can compute r
2
. In steepest descents, r
2
would be our next direction but this is not "B-orthogonal"
to the previous direction. To achieve this, we let the new direction be
u
k+1
= r
k+1
+
k
u
k
(41)
Dotting through by (B u
k
)
T
gives

k
=
u
T
k
B r
k+1
u
T
k
B u
k
This form for
k
is not computationally optimal as we shall see. To get our nal algorithm, we rst note
that the rs can be computed recursively. Multiply equation 38 by B and subtract b from both sides:
B x
k+1
b = B x
k
b +
k
B u
k
so
r
k+1
= r
k

k
B u
k
(42)
We can further manipulate the above formulae to get some identities which allow us to compute
k
and

k
more efciently. First, note that we recover equation 39 from equation 42 if we require u
T
k
r
k+1
= 0.
Forcing this to be true and dotting r
T
k+1
into equation 41 gives the result that r
T
k
u
k
= r
T
k
r
k
. Furthermore,
if we dot r
T
k+1
into 42 and use equation 39 for
k
and the above formula for
k
, we get
r
T
k+1
r
k+1
= r
T
k+1
r
k

k
r
T
k+1
B u
k
= r
T
k+1
r
k
+
k
u
T
k
r
k
(43)
Similarly, dotting (B u
k+1
)
T
into 41 shows that u
T
k
B u
k
= r
T
k
B u
k
. Dotting r
T
k
into 42 and using this
result allow us to show that r
T
k+1
r
k
= 0. These identities allow us to compute
k
and
k
as

k
=
r
T
k+1
r
k+1
r
T
k
r
k

k
=
r
T
k
r
k
u
T
k
B u
k
(44)
9
The algorithm can now be written (taking x
1
= 0)
k = 0
r
1
= b
u
1
= r
1
x
1
= 0
begin loop
k = k + 1
w = B u
k
= r
T
k
r
k
/u
T
k
w
x
k+1
= x
k
+ u
k
r
k+1
= r
k
w
= r
T
k+1
r
k+1
/r
T
k
r
k
u
k+1
= r
k+1
+ u
k
end loop
Note that there is only one matrix-vector multiply per iteration. M iterations of this process would give the
exact solution (in the absence of roundoff) but it is anticipated that much fewer than M iterations will be
required to get an acceptable solution.
The algorithm described above is the standard CG algorithm Golub and Van Loan (Chapter 10) 1996
give an extensive discussion of the theory. This is not in the best form for numerical application since it
uses the "normal" equations B x b which, as we have already noted, can square the condition number
and introduce instability. We would like to go back to the rectangular system in equation 20. Remember,
even just forming B can turn a sparse A matrix into a dense B matrix though the sparseness can be retained
by computing B u as A
T
(A u). An equivalent sparse square system can be written down:
_
I A
A
T
0
_

_
r
x
_
=
_
d
0
_
and used to develop algorithms which do not implicitly use the normal equations and which are stable when
systems are not well-conditioned (e.g. LSQR). We leave this as an exercise to the reader.
One nal point: knowing when to stop iterative techniques can be a bit of an art form. Typically, much
of the mist to the data is taken up in the rst few iterations but convergence to a stable model can take
much longer. In particular, where we include a smoother (as in equation 30), it seems that the effect of the
smoother becomes more apparent at later iterations even though the t to the data does not change much.
Several stopping criteria for LSQR have been suggested (see original papers by Paige and Saunders) but it
pays to be conservative and to iterate longer than you think you need to! I have found that stopping when
the norm of the model vector has reached a stable value works best.
8. Resolution and error analysis when the generalized inverse is not available
In section 5, we discussed resolution and error and gave results in terms of the generalized inverse of A
(equations 24 and 25). How do we go about computing resolution and error when A
+
is not available (as
when using an iterative technique). One way of estimating the resolution matrix is to do an inversion where
we set the mth element of the model vector x to one and all the others to zero call this vector x
m
. Now,
compute d
m
= Ax
m
and solve Ax = d
m
using exactly the same iterative algorithm as you used to get your
true model. This process computes a single row (and column) of the resolution matrix corresponding to the
mth model element. The complete resolution matrix can be computed by performing M such inversions
one for each model parameter. Clearly this is infeasible if we are talking about 50,000 model parameters
but we can focus on key areas of the model where we are particularly interested in the resolvability of a
particular structure.
10
A modication of the above process (which is sometimes called a "spike test") is to solve for some pattern
to test resolution over a broad region. A common choice is to use a checkerboard pattern in one of the layers
of the model. A synthetic data set is computed for this checkerboard model and then inverted using exactly
the same iterative algorithm used to get the real model. The recovered checkerboard can indicate areas of
problematic recovery in the layer being tested and can show leakage into adjacent layers above and below.
The estimation of the covariance matrix of the model can also be problematic but usually we are satised
with the diagonal elements (the square roots of which are the standard deviations of the model paramters).
It turns out that the best way to estimate these is to add a noise vector to the data vector d = d+e where the
elements of e are randomly chosen from a normal distribution with a unit standard deviation (remember, we
divided all data by their errors initially). We then solve for a model using this perturbed data vector in our
iterative procedure. We repeat this process many times (100 say) and then look at the standard deviations
of the elements of the 100 models we have generated. Tests show that this process produces an excellent
estimate of the diagonal elements of the model covariance matrix.
9. Importance of earthquake location in tomography
It turns out that our (in)ability to locate earthquakes accurately means that we have a source of noise in
our tomographic problem which can rival the signal from 3D structure (at least for P-wave tomography).
We can estimate the uncertainty due to event mislocation by considering the following equation
t =
t
x
x +
t
y
y +
t
z
z + t
0
,
where x, y, z are errors in event location, t
0
is the error in origin time, and t is the resulting error in
travel time. Analyses of mislocations of events located by independent means leads to estimates of the
length of a typical mislocation vector,
X
of 1418 km . To convert this number to a typical change in
epicentral distance, we assume that the stations are uniformly distributed around the event so that stations
in a direction perpendicular to the mislocation vector see no change in epicentral distance while stations in
the direction of mislocation will see the full value. Assuming a cosinusoidal dependence as a function of
azimuth suggests that, on average, the error in epicentral distance is
X
/

2. Assuming that x and y do


not co-vary (as suggested by an analysis of the differences of the NEIC and ISC locations), the error in the
travel time due to the error in each of x and y is
p
X

2
,
where p is the ray parameter. It is well known that errors in depth and origin time do covary with t
0
z/9
(depth in kilometers). Since t/z is negative, the errors in origin time and depth tend to cancel in their
contribution to the total error and the errors in x and y dominate the error budget. We now assume a typical
depth uncertainty of about 10 km and nd that
X
is .61.2 seconds for P waves at epicentral distances of
about 70

for mislocation vectors of length 1020 km. The corresponding estimate of


X
for S waves is
1.62.5 seconds. These numbers rival the signals from 3D structure.
These results mean that we cannot ignore earthquake mislocation in our tomography and we must either
relocate events or make our data insensitive to event location. Consider the travel time residuals for one
event:
t = Ah +Bv (45)
We could iteratively solve this equation rst by relocating the events then solving separately for velocity
structure then reloacting again but nowincluding the newvelocity structure. Convergence is usually attained
after a few iterations but often to an incorrect solution. Alternatively, we can seek linear combinations of
the data for each event which, to rst order, are insensitive to the event location. This reduces to nding P
such that
Pt = PAh +PBv = PBv (46)
11
i.e., we want PA = 0. Note that if A has the SVD A = U V
T
then P = G(I UU
T
) where G is
any matrix. We choose G so that the new data t

= Pt are statistically independent. If t has a


covariance matrix I then t

has covariance matrix G(I UU


T
)G
T
(since (I UU
T
) = (I UU
T
)
T
and
(I UU
T
)(I UU
T
) = (I UU
T
)). Thus, if (I UU
T
) has the eigenvalue decomposition R R
T
then
choosing G =

1
2
R
T
leads to the desired covariance matrix which is I. It is interesting that the eigenvalues
of (IUU
T
) are one or zero and we lose four eigenvalues during the projection process we have effectively
used up four data to remove sensitivity to location.
The alternative process is to relocate initially, solving
t = Ah (47)
which, if we have used a SVD would lead to a mislocation vector

h = V
1
U
T
t (48)
and equation 45 would become
t A

h (I UU
T
)t = Bv (49)
Note this is similar to the projection method where P = G(I UU
T
) except that we have not taken account
of the fact that we have "used some data up" in doing the relocation and we have not consistently operated on
the Bpart of equation 45 as we did with the projection method. Ignoring these niceties does still leave us with
a B matrix which is sparse whereas, in the projection method, each new travel time is a linear combination
of all the travel times for that event so that PB is no longer as sparse as we would like. Unfortunately,
alternating between solving for locations then 3D velocity structure can result in convergence to the wrong
answer so this technique should be used with great care.
It is also possible to solve equation 45 directly which means expanding the unknown vector we are
solving for by four times the number of events. Since we are often dealing with 10,000 events, this means
we are adding 40,000 more unknowns to the problem. However, the matrices remain extremely sparse so
this approach is entirely feasible from a computational point of view.
12

You might also like