Complete Fall09

Download as pdf or txt
Download as pdf or txt
You are on page 1of 271

INVERSE PROBLEMS IN GEOPHYSICS

GEOS 567






A Set of Lecture Notes

by

Professors Randall M. Richardson and George Zandt
Department of Geosciences
University of Arizona
Tucson, Arizona 85721







Revised and Updated Fall 2009


Geosciences 567: PREFACE (RMR/GZ)

i


TABLE OF CONTENTS


PREFACE .......................................................................................................................................v

CHAPTER 1: INTRODUCTION ..................................................................................................1

1.1 Inverse Theory: What It Is and What It Does ........................................................1
1.2 Useful Definitions ...................................................................................................2
1.3 Possible Goals of an Inverse Analysis ....................................................................3
1.4 Nomenclature ..........................................................................................................4
1.5 Examples of Forward Problems ..............................................................................7
1.5.1 Example 1: Fitting a Straight Line ..........................................................7
1.5.2 Example 2: Fitting a Parabola .................................................................8
1.5.3 Example 3: Acoustic Tomography .........................................................9
1.5.4 Example 4: Seismic Tomography .........................................................10
1.5.5 Example 5: Convolution ...................................................................... 10
1.6 Final Comments ....................................................................................................11

CHAPTER 2: REVIEW OF LINEAR ALGEBRA AND STATISTICS.....................................12

2.1 Introduction ...........................................................................................................12
2.2 Matrices and Linear Transformations ....................................................................12
2.2.1 Review of Matrix Manipulations ...........................................................12
2.2.2 Matrix Transformations .........................................................................15
2.2.3 Matrices and Vector Spaces ...................................................................19
2.2 Probability and Statistics .......................................................................................21
2.3.1 Introduction ............................................................................................21
2.3.2 Definitions, Part 1...................................................................................21
2.3.3 Some Comments on Applications to Inverse Theory ............................24
2.3.4 Definitions, Part 2 ..................................................................................25

CHAPTER 3: INVERSE METHODS BASED ON LENGTH ...................................................31

3.1 Introduction ...........................................................................................................31
3.2 Data Error and Model Parameter Vectors .............................................................31
3.3 Measures of Length................................................................................................31
3.4 Minimizing the Misfit: Least Squares...................................................................33
3.4.1 Least Squares Problem for a Straight Line ............................................33
3.4.2 Derivation of the General Least Squares Solution .................................36
3.4.3 Two Examples of Least Squares Problems ............................................38
3.4.4 Four-Parameter Tomography Problem ..................................................40
3.5 Determinancy of Least Squares Problems .............................................................42
3.5.1 Introduction ............................................................................................42
3.5.2 Even-Determined Problems: M = N ......................................................43
3.5.3 Overdetermined Problems: Typically, N > M .......................................43
3.5.4 Underdetermined Problems: Typically M > N ......................................43
3.6 Minimum Length Solution.....................................................................................44
3.6.1 Background Information ........................................................................44
3.6.2 Lagrange Multipliers ..............................................................................45
3.6.3 Application to the Purely Underdetermined Problem ............................48
Geosciences 567: PREFACE (RMR/GZ)

ii
3.6.4 Comparison of Least Squares and Minimum Length Solutions .............50
3.6.5 Example of Minimum Length Problem .................................................50
3.7 Weighted Measures of Length...............................................................................51
3.7.1 Introduction ............................................................................................51
3.7.2 Weighted Least Squares..........................................................................52
3.7.3 Weighted Minimum Length ...................................................................55
3.7.4 Weighted Damped Least Squares ...........................................................57
3.8 A Priori Information and Constraints ....................................................................58
3.8.1 Introduction ............................................................................................58
3.8.2 A First Approach to Including Constraints ............................................59
3.8.3 A Second Approach to Including Constraints .......................................61
3.8.4 Example From Seismic Receiver Functions ..........................................64
3.9 Variance of the Model Parameters.........................................................................65
3.9.1 Introduction ............................................................................................65
3.9.2 Application to Least Squares .................................................................65
3.9.3 Application to the Minimum Length Problem .......................................66
3.9.4 Geometrical Interpretation of Variance .................................................66

CHAPTER 4: LINEARIZATION OF NONLINEAR PROBLEMS............................................70
4.1 Introduction ...........................................................................................................70
4.2 Linearization of Nonlinear Problems ....................................................................70
4.3 General Procedure for Nonlinear Problems ..........................................................73
4.4 Three Examples ....................................................................................................73
4.4.1 A Linear Example ..................................................................................73
4.4.2 A Nonlinear Example ............................................................................75
4.4.3 Nonlinear Straight-Line Example ..........................................................81
4.5 Creeping vs Jumping (Shaw and Orcutt, 1985) ....................................................86

CHAPTER 5: THE EIGENVALUE PROBLEM ........................................................................89

5.1 Introduction ...........................................................................................................89
5.2 The Eigenvalue Problem for Square (M M) Matrix A .......................................89
5.2.1 Background ............................................................................................89
5.2.2 How Many Eigenvalues, Eigenvectors? ................................................90
5.2.3 The Eigenvalue Problem in Matrix Notation .........................................92
5.2.4 Summarizing the Eigenvalue Problem for A .........................................94
5.3 Geometrical Interpretation of the Eigenvalue Problem for Symmetric A ............95
5.3.1 Introduction ............................................................................................95
5.3.2 Geometrical Interpretation .....................................................................96
5.3.3 Coordinate System Rotation ................................................................100
5.3.4 Summarizing Points .............................................................................101
5.4 Decomposition Theorem for Square A................................................................102
5.4.1 The Eigenvalue Problem for A
T
..........................................................102
5.4.2 Eigenvectors for A
T
.............................................................................103
5.4.3 Decomposition Theorem for Square Matrices .....................................103
5.4.4 Finding the Inverse A
1
for the M M Matrix A ................................110
5.4.5 What Happens When There Are Zero Eigenvalues? ...........................111
5.4.6 Some Notes on the Properties of S
P
and R
P
........................................114
5.5 Eigenvector Structure of m
LS
.............................................................................115
5.5.1 Square Symmetric A Matrix With Nonzero Eigenvalues ....................115
5.5.2 The Case of Zero Eigenvalues ..............................................................117
5.5.3 Simple Tomography Problem Revisited ..............................................118

Geosciences 567: PREFACE (RMR/GZ)

iii
CHAPTER 6: SINGULAR-VALUE DECOMPOSITION (SVD) ............................................123

6.1 Introduction .........................................................................................................123
6.2 Formation of a New Matrix B .............................................................................123
6.2.1 Formulating the Eigenvalue Problem With G .....................................123
6.2.2 The Role of G
T
as an Operator ............................................................124
6.3 The Eigenvalue Problem for B ...........................................................................125
6.3.1 Properties of B .....................................................................................125
6.3.2 Partitioning W ......................................................................................126
6.4 Solving the Shifted Eigenvalue Problem ............................................................127
6.4.1 The Eigenvalue Problem for G
T
G .......................................................127
6.4.2. The Eigenvalue Problem for GG
T
.......................................................128
6.5 How Many
i
Are There, Anyway?? ..................................................................129
6.5.1 Introducing P, the Number of Nonzero Pairs (+
i
,
i
) ......................130
6.5.2 Finding the Eigenvector Associated with
i
......................................131
6.5.3 No New Information From the
i
System .........................................131
6.5.4 What About the Zero Eigenvalues
i
s, i = 2(P + 1), . . . , N + M? .....132
6.5.5 How Big is P? ......................................................................................133
6.6 Introducing Singular Values ...............................................................................134
6.6.1 Introduction ..........................................................................................134
6.6.2 Definition of the Singular Value ..........................................................135
6.6.3 Definition of , the Singular-Value Matrix .........................................135
6.7 Derivation of the Fundamental Decomposition Theorem for General G
(N M, N M) ...................................................................................137
6.8 Singular-Value Decomposition (SVD) ...............................................................138
6.8.1 Derivation of Singular-Value Decomposition .....................................138
6.8.2 Rewriting the Shifted Eigenvalue Problem ..........................................140
6.8.3 Summarizing SVD ...............................................................................141
6.9 Mechanics of Singular-Value Decomposition ....................................................142
6.10 Implications of Singular-Value Decomposition .................................................143
6.10.1 Relationships Between U, U
P
, and U
0
.................................................143
6.10.2 Relationships Between V, V
P
, and V
0
.................................................144
6.10.3 Graphic Representation of U, U
P
, U
0
, V, V
P
, and V
0
Spaces ..............145
6.11 Classification of d = Gm Based on P, M, and N ................................................146
6.11.1 Introduction ..........................................................................................146
6.11.2 Class I: P = M = N ..............................................................................147
6.11.3 Class II: P = M < N .............................................................................147
6.11.4 Class III: P = N < M ............................................................................148
6.11.5 Class IV: P < min(N, M) .....................................................................149

CHAPTER 7: THE GENERALIZED INVERSE AND MEASURES OF QUALITY .............150

7.1 Introduction .........................................................................................................150
7.2 The Generalized Inverse Operator G
g
1
..............................................................152
7.2.1 Background Information ......................................................................152
7.2.2 Class I: P = N = M ..............................................................................152
7.2.3 Class II: P = M < N .............................................................................153
7.2.4 Class III: P = N < M ............................................................................161
7.2.5 Class IV: P < min(N, M) .....................................................................164
7.3 Measures of Quality for the Generalized Inverse ...............................................166
7.3.1 Introduction ..........................................................................................166
7.3.2 The Model Resolution Matrix R ..........................................................167
7.3.3 The Data Resolution Matrix N .............................................................171
Geosciences 567: PREFACE (RMR/GZ)

iv
7.3.4 The Unit (Model) Covariance Matrix [cov
u
m] ...................................176
7.3.5 A Closer Look at Stability ....................................................................176
7.3.6 Combining R, N, [cov
u
m] ..................................................................180
7.3.7 An Illustrative Example .......................................................................181
7.4 Quantifying the Quality of R, N, and [cov
u
m] ..................................................184
7.4.1 Introduction ..........................................................................................184
7.4.2 Classes of Problems .............................................................................184
7.4.3 Effect of the Generalized Inverse Operator G
g
1
.................................185
7.5 Resolution Versus Stability ................................................................................187
7.5.1 Introduction ..........................................................................................187
7.5.2 R, N, and [cov
u
m] for Nonlinear Problems ........................................189

CHAPTER 8: VARIATIONS OF THE GENERALIZED INVERSE ......................................195

8.1 Linear Transformations .......................................................................................195
8.1.1 Analysis of the Generalized Inverse Operator G
g
1
............................195
8.1.2 G
g
1
Operating on a Data Vector d ......................................................197
8.1.3 Mapping Between Model and Data Space: An Example ....................198
8.2 Including Prior Information, or the Weighted Generalized Inverse ...................200
8.2.1 Mathematical Background ...................................................................200
8.2.2 Coordinate System Transformation of Data and Model Parameter
Vectors ...........................................................................................203
8.2.3 The Maximum Likelihood Inverse Operator, Resolution, and
Model Covariance .........................................................................204
8.2.4 Effect on Model- and Data-Space Eigenvectors ..................................208
8.2.5 An Example .........................................................................................210
8.3 Damped Least Squares and the Stochastic Inverse .............................................216
8.3.1 Introduction ..........................................................................................216
8.3.2 The Stochastic Inverse .........................................................................216
8.3.3 Damped Least Squares .........................................................................220
8.4 Ridge Regression ................................................................................................225
8.4.1 Mathematical Background ...................................................................225
8.4.2 The Ridge Regression Operator ...........................................................226
8.4.3 An Example of Ridge Regression Analysis .........................................228
8.5 Maximum Likelihood .........................................................................................232
8.5.1 Background ..........................................................................................232
8.5.2 The General Case..................................................................................235

CHAPTER 9: CONTINUOUS INVERSE THEORY AND OTHER APPROACHES.............239

9.1 Introduction .........................................................................................................239
9.2 The BackusGilbert Approach ............................................................................240
9.3 Neural Networks .................................................................................................248
9.4 The Radon Transform and Tomography (Approach 1) .......................................251
9.4.1 Introduction .............................................................................................251
9.4.2 Interpretation of Tomography Using the Radon Transform ...................254
9.4.3 Slant-Stacking as a Radon Transform (following Claerbout, 1985) .......255
9.5 A Review of the Radon Transform (Approach 2) ...............................................259
9.6 Alternative Approach to Tomography ................................................................262
Geosciences 567: PREFACE (RMR/GZ)

v
PREFACE

This set of lecture notes has its origin in a nearly incomprehensible course in inverse
theory that I took as a first-semester graduate student at MIT. My goal, as a teacher and in these
notes, is to present inverse theory in such a way that it is not only comprehensible but useful.

Inverse theory, loosely defined, is the fine art of inferring as much as possible about a
problem from all available information. Information takes both the traditional form of data, as
well as the relationship between actual and predicted data. In a nuts-and-bolt definition, it is one
(some would argue the best!) way to find and assess the quality of a solution to some
(mathematical) problem of interest.

Inverse theory has two main branches dealing with discrete and continuous problems,
respectively. This text concentrates on the discrete case, covering enough material for a single-
semester course. A background in linear algebra, probability and statistics, and computer
programming will make the material much more accessible. Review material is provided on the
first two topics in Chapter 2.

This text could stand alone. However, it was written to complement and extend the
material covered in the supplemental text for the course, which deals more completely with some
areas. Furthermore, these notes make numerous references to sections in the supplemental text.
Besides, the supplemental text is, by far, the best textbook on the subject and should be a part of
the library of anyone interested in inverse theory. The supplemental text is:

Geophysical Data Analysis: Discrete Inverse Theory (Revised Edition)
by William Menke, Academic Press, 1989.

The course format is largely lecture. We may, from time to time, read articles from the
literature and work in a seminar format. I will try to schedule a couple of guest lectures in
applications. Be forewarned. There is a lot of homework for this course. They are occasionally
very time consuming. I make every effort to avoid pure algebraic nightmares, but my general
philosophy is summarized below:

I hear, and I forget.
I see, and I remember.
I do, and I understand.
Chinese Proverb

I try to have you do a simple problem by hand before turning you loose on the computer,
where all realistic problems must be solved. You will also have access to existing code and a
computer account on a SPARC workstation. You may use and modify the code for some of the
homework and for the term project. The term project is an essential part of the learning process
and, I hope, will help you tie the course work together. Grading for this course will be as
follows:

60% Homework
30% Term Project
10% Class Participation

Good luck, and may you find the trade-off between stability and resolution less traumatic
than most, on average.

Randy Richardson
August 2009
Geosciences 567: CHAPTER 1 (RMR/GZ)
1





CHAPTER 1: INTRODUCTION


1.1 Inverse Theory: What It Is and What It Does


Inverse theory, at least as I choose to define it, is the fine art of estimating model
parameters from data. It requires a knowledge of the forward model capable of predicting data if
the model parameters were, in fact, already known. Anyone who attempts to solve a problem in
the sciences is probably using inverse theory, whether or not he or she is aware of it. Inverse
theory, however, is capable (at least when properly applied) of doing much more than just
estimating model parameters. It can be used to estimate the quality of the predicted model
parameters. It can be used to determine which model parameters, or which combinations of
model parameters, are best determined. It can be used to determine which data are most
important in constraining the estimated model parameters. It can determine the effects of noisy
data on the stability of the solution. Furthermore, it can help in experimental design by
determining where, what kind, and how precise data must be to determine model parameters.

Inverse theory is, however, inherently mathematical and as such does have its limitations.
It is best suited to estimating the numerical values of, and perhaps some statistics about, model
parameters for some known or assumed mathematical model. It is less well suited to provide the
fundamental mathematics or physics of the model itself. I like the example Albert Tarantola
gives in the introduction of his classic book
1
on inverse theory. He says, . . . you can always
measure the captains age (for instance by picking his passport), but there are few chances for
this measurement to carry much information on the number of masts of the boat. You must
have a good idea of the applicable forward model in order to take advantage of inverse theory.
Sooner or later, however, most practitioners become rather fanatical about the benefits of a
particular approach to inverse theory. Consider the following as an example of how, or how not,
to apply inverse theory. The existence or nonexistence of a God is an interesting question.
Inverse theory, however, is poorly suited to address this question. However, if one assumes that
there is a God and that She makes angels of a certain size, then inverse theory might well be
appropriate to determine the number of angels that could fit on the head of a pin. Now, who said
practitioners of inverse theory tend toward the fanatical?



In the rest of this chapter, I will give some useful definitions of terms that will come up
time and again in inverse theory, and give some examples, mostly from Menkes book, of how to
set up forward problems in an attempt to clearly identify model parameters from data.

1
Inverse Problem Theory, by Albert Tarantola, Elsevier Scientific Publishing Company, 1987.
Geosciences 567: CHAPTER 1 (RMR/GZ)
2
1.2 Useful Definitions


Let us begin with some definitions of things like forward and inverse theory, models and
model parameters, data, etc.


Forward Theory: The (mathematical) process of predicting data based on some physical or
mathematical model with a given set of model parameters (and perhaps some other appropriate
information, such as geometry, etc.).

Schematically, one might represent this as follows:

predicted data
model parameters model


As an example, consider the two-way vertical travel time t of a seismic wave through M layers of
thickness h
i
and velocity v
i
. Then t is given by

=
=
M
i i
i
v
h
t
1
2
(1.1)

The forward problem consists of predicting data (travel time) based on a (mathematical) model
of how seismic waves travel. Suppose that for some reason thickness was known for each layer
(perhaps from drilling). Then only the M velocities would be considered model parameters. One
would obtain a particular travel time t for each set of model parameters one chooses.


Inverse Theory: The (mathematical) process of predicting (or estimating) the numerical values
(and associated statistics) of a set of model parameters of an assumed model based on a set of
data or observations.

Schematically, one might represent this as follows:


model data
predicted (or estimated)
model parameters


As an example, one might invert the travel time t above to determine the layer velocities. Note
that one needs to know the (mathematical) model relating travel time to layer thickness and
velocity information. Inverse theory should not be expected to provide the model itself.


Model: The model is the (mathematical) relationship between model parameters (and other
auxiliary information, such as the layer thickness information in the previous example) and the
data. It may be linear or nonlinear, etc.
Geosciences 567: CHAPTER 1 (RMR/GZ)
3
Model Parameters: The model parameters are the numerical quantities, or unknowns, that one
is attempting to estimate. The choice of model parameters is usually problem dependent, and
quite often arbitrary. For example, in the case of travel times cited earlier, layer thickness is not
considered a model parameter, while layer velocity is. There is nothing sacred about these
choices. As a further example, one might choose to cast the previous example in terms of
slowness s
i
, where:
s
i
= 1 / v
i
(1.2)

Travel time t is a nonlinear function of layer velocities but a linear function of layer slowness.
As you might expect, it is much easier to solve linear than nonlinear inverse problems. A more
serious problem, however, is that linear and nonlinear formulations may result in different
estimates of velocity if the data contain any noise. The point I am trying to impress on you now
is that there is quite a bit of freedom in the way model parameters are chosen, and it can affect
the answers you get!


Data: Data are simply the observations or measurements one makes in an attempt to constrain
the solution of some problem of interest. Travel time in the example above is an example of
data. There are, of course, many other examples of data.


Some examples of inverse problems (mostly from Menke) follow:

Medical tomography
Earthquake location
Earthquake moment tensor inversion
Earth structure from surface or body wave inversion
Plate velocities (kinematics)
Image enhancement
Curve fitting
Satellite navigation
Factor analysis


1.3 Possible Goals of an Inverse Analysis


Now let us turn our attention to some of the possible goals of an inverse analysis. These
might include:

1. Estimates of a set of model parameters (obvious).
2. Bounds on the range of acceptable model parameters.
3. Estimates of the formal uncertainties in the model parameters.
4. How sensitive is the solution to noise (or small changes) in the data?
5. Where, and what kind, of data are best suited to determine a set of model parameters?
6. Is the fit between predicted and observed data adequate?
Geosciences 567: CHAPTER 1 (RMR/GZ)
4
7. Is a more complicated (i.e., more model parameters) model significantly better than a
more simple model?

Not all of these are completely independent goals. It is important to realize, as early as possible,
that there is much more to inverse theory than simply a set of estimated model parameters. Also,
it is important to realize that there is very often not a single correct answer. Unlike a
mathematical inverse, which either exists or does not exist, there are many possible approximate
inverses. These may give different answers. Part of the goal of an inverse analysis is to
determine if the answer you have obtained is reasonable, valid, acceptable, etc. This takes
experience, of course, but you have begun the process.
Before going on with how to formulate the mathematical methods of inverse theory, I
should mention that there are two basic branches of inverse theory. In the first, the model
parameters and data are discrete quantities. In the second, they are continuous functions. An
example of the first might occur with the model parameters we seek being given by the moments
of inertia of the planets:

model parameters = I
1
, I
2
, I
3
, . . . , I
10
(1.3)

and the data being given by the perturbations in the orbital periods of satellites:

data = T
1
, T
2
, T
3
, . . . , T
N
(1.4)

An example of a continuous function type of problem might be given by velocity as a
function of depth:

model parameters = v(z) (1.5)

and the data given by a seismogram of ground motion

data = d(t) (1.6)

Separate strategies have been developed for discrete and continuous inverse theory.
There is, of course, a fair bit of overlap between the two. In addition, it is often possible to
approximate continuous functions with a discrete set of values. There are potential problems
(aliasing, for example) with this approach, but it often makes otherwise intractable problems
tractable. Menkes book deals exclusively with the discrete case. This course will certainly
emphasize discrete inverse theory, but I will also give you a little of the continuous inverse
theory at the end of the semester.


1.4 Nomenclature


Now let us introduce some nomenclature. In these notes, vectors will be denoted by
boldface lowercase letters, and matrices will be denoted by boldface uppercase letters.

Geosciences 567: CHAPTER 1 (RMR/GZ)
5
Suppose one makes N measurements in a particular experiment. We are trying to
determine the values of M model parameters. Our nomenclature for data and model parameters
will be

data: d = [d
1
, d
2
, d
3
, . . . , d
N
]
T
(1.7)

model parameters: m = [m
1
, m
2
, m
3
, . . . , m
M
]
T
(1.8)

where d and m are N and M dimensional column vectors, respectively, and T denotes transpose.

The model, or relationship between d and m, can have many forms. These can generally
be classified as either explicit or implicit, and either linear or nonlinear.

Explicit means that the data and model parameters can be separated onto different sides
of the equal sign. For example,

d
1
= 2m
1
+ 4m
2
(1.9)

and
d
1
= 2m
1
+ 4m
1
2
m
2
(1.10)

are two explicit equations.

Implicit means that the data cannot be separated on one side of an equal sign with model
parameters on the other side. For example,

d
1
(m
1
+ m
2
) = 0 (1.11)

and

d
1
(m
1
+ m
1
2
m
2
) = 0 (1. 12)

are two implicit equations. In each example above, the first represents a linear relationship
between the data and model parameters, and the second represents a nonlinear relationship.

In this course we will deal exclusively with explicit type equations, and predominantly
with linear relationships. Then, the explicit linear case takes the form

d = Gm (1.13)

where d is an N-dimensional data vector, m is an M-dimensional model parameter vector, and G
is an N M matrix containing only constant coefficients.

The matrix G is sometimes called the kernel or data kernel or even the Greens function
because of the analogy with the continuous function case:

Geosciences 567: CHAPTER 1 (RMR/GZ)
6
d(x) = G(x, t)m(t) dt

(1. 14)

Consider the following discrete case example with two observations (N = 2) and three
model parameters (M = 3):

d
1
= 2m
1
+ 0m
2
4m
3

d
2
= m
1
+ 2m
2
+ 3m
3

(1.15)


which may be written as


d
1
d
2






=
2 0 4
1 2 3






m
1
m
2
m
3










(1.16)

or simply

d = Gm (1.13)

where

d = [d
1
, d
2
]
T


m = [m
1
, m
2
, m
3
]
T


and

G=
2 0 4
1 2 3






(1.17)

Then d and m are 2 1 and 3 1 column vectors, respectively, and G is a 2 3 matrix with
constant coefficients.



On the following pages I will give some examples of how forward problems are set up
using matrix notation. See pages 1016 of Menke for these and other examples.
Geosciences 567: CHAPTER 1 (RMR/GZ)
7
1.5 Examples of Forward Problems


1.5.1 Example 1: Fitting a Straight Line (See Page 10 of Menke)


z (depth)
T


(
t
e
m
p
e
r
a
t
u
r
e
)
slope = b
a
.
.
.
.
.
. .


Suppose that N temperature measurements T
i
are made at depths z
i
in the earth. The data
are then a vector d of N measurements of temperature, where d = [T
1
, T
2
, T
3
, . . . , T
N
]
T
. The
depths z
i
are not data. Instead, they provide some auxiliary information that describes the
geometry of the experiment. This distinction will be further clarified below.
Suppose that we assume a model in which temperature is a linear function of depth: T =
a + bz. The intercept a and slope b then form the two model parameters of the problem, m =
[a, b]
T
. According to the model, each temperature observation must satisfy T = a + zb:

T
1
= a + bz
1

T
2
= a + bz
2

M
T
N
= a + bz
N


These equations can be arranged as the matrix equation Gm = d:

b
a
z
z
z
T
T
T
N N
1
1
1
2
1
2
1
M M M

Geosciences 567: CHAPTER 1 (RMR/GZ)
8
1.5.2 Example 2: Fitting a Parabola (See Page 11 of Menke)


z (depth)
T


(
t
e
m
p
e
r
a
t
u
r
e
)
.
.
.
.
.
.


If the model in example 1 is changed to assume a quadratic variation of temperature with
depth of the form T = a + bz + cz
2
, then a new model parameter is added to the problem, m = [a,
b, c]
T
. The number of model parameters is now M = 3. The data are supposed to satisfy

T
1
= a + bz
1
+ cz
1
2

T
2
= a + bz
2
+ cz
2
2

M
T
N
= a + bz
N
+ cz
N
2


These equations can be arranged into the matrix equation

c
b
a
z z
z z
z z
T
T
T
N N N
2
2
2 2
2
1 1
2
1
1
1
1
M M M M


This matrix equation has the explicit linear form Gm = d. Note that, although the
equation is linear in the data and model parameters, it is not linear in the auxiliary variable z.

The equation has a very similar form to the equation of the previous example, which
brings out one of the underlying reasons for employing matrix notation: it can often emphasize
similarities between superficially different problems.
Geosciences 567: CHAPTER 1 (RMR/GZ)
9
1.5.3 Example 3: Acoustic Tomography (See Pages 1213 of Menke)


Suppose that a wall is assembled from a rectangular array of bricks (Figure 1.1 from
Menke, below) and that each brick is composed of a different type of clay. If the acoustic
velocities of the different clays differ, one might attempt to distinguish the different kinds of
bricks by measuring the travel time of sound across the various rows and columns of bricks, in
the wall. The data in this problem are N = 8 measurements of travel times, d = [T
1
, T
2
, T
3
, . . . ,
T
8
]
T
. The model assumes that each brick is composed of a uniform material and that the travel
time of sound across each brick is proportional to the width and height of the brick. The
proportionality factor is the bricks slowness s
i
, thus giving M = 16 model parameters, m = [s
1
,
s
2
, s
3
, . . . , s
16
]
T
, where the ordering is according to the numbering scheme of the figure as



The travel time of acoustic waves (dashed lines) through the rows and columns of a square array
of bricks is measured with the acoustic source S and receiver R placed on the edges of the square.
The inverse problem is to infer the acoustic properties of the bricks (which are assumed to be
homogeneous).

row 1: T
1
= hs
1
+ hs
2
+ hs
3
+ hs
4
row 2: T
2
= hs
5
+ hs
6
+ hs
7
+ hs
8

M M
column 4: T
8
= hs
4
+ hs
8
+ hs
12
+ hs
16


and the matrix equation is

16
2
1
8
2
1
1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
s
s
s
h
T
T
T
M M M M M M M M M M M M M M M M M M


Here the bricks are assumed to be of width and height h.
Geosciences 567: CHAPTER 1 (RMR/GZ)
10
1.5.4 Example 4: Seismic Tomography


An example of the impact of inverse methods in the geosciences: Northern California
A large amount of data is available, much of it redundant.
Patterns in the data can be interpreted qualitatively.
Inversion results quantify the patterns.
Perhaps, more importantly, inverse methods provide quantitative information on the
resolution, standard error, and "goodness of fit."
We cannot overemphasize the "impact" of colorful graphics, for both good and bad.
Inverse theory is not a magic bullet. Bad data will still give bad results, and, interpretation of
even good results requires breadth of understanding in the field.
Inverse theory does provide quantitative information on how well the model is "determined,"
importance of data, and model errors.
Another example: improvements in "imaging" subduction zones.


1.5.5 Example 5: Convolution


Convolution is widely significant as a physical concept and offers an advantageous
starting point for many theoretical developments. One way to think about convolution is that it
describes the action of an observing instrument when it takes a weighted mean of some physical
quantity over a narrow range of some variable. All physical observations are limited in this way,
and for this reason alone convolution is ubiquitous (paraphrased from Bracewell, The Fourier
Transform and Its Applications, 1964). It is widely used in time series analysis as well to
represent physical processes.

The convolution of two functions f(x) and g(x) represented as f(x)*g(x) is
f (u) g(x u) du

(1.18)
For discrete finite functions with common sampling intervals, the convolution is
h
k
= f
i
g
ki
i=0
m

0 < k < m+ n (1. 19)


A FORTRAN computer program for convolution would look something like:

L=M+N1
DO 10 I=1,L
10 H(I)=0
DO 20 I=1,M
DO 20 J=1,N
20 H(I+J1)=H(I+J1)+G(I)*F(J)
Geosciences 567: CHAPTER 1 (RMR/GZ)
11
Convolution may also be written using matrix notation as











+ 1
2
1
2
1
1
2
1 2
1

0 0
0
0
0 0
m n m
n
n
n
h
h
h
g
g
g
f
f
f
f
f
f f
f
(1. 20)

In the matrix form, we recognize our familiar equation Gm = d (ignoring the confusing
notation differences between fields, when, for example, g
1
above would be m
1
), and we can
define deconvolution as the inverse problem of finding m = G
1
d. Alternatively, we can also
reformulate the problem as G
T
Gm = G
T
d and find the solution as m = [G
T
G]
1
[G
T
d].


1.6 Final Comments


The purpose of the previous examples has been to help you formulate forward problems
in matrix notation. It helps you to clearly differentiate model parameters from other information
needed to calculate predicted data. It also helps you separate data from everything else.
Getting the forward problem set up in matrix notation is essential before you can invert the
system.

The logical next step is to take the forward problem given by

d = Gm (1.13)

and invert it for an estimate of the model parameters m
est
as

m
est
= G
inverse
d (1.21)

We will spend a lot of effort determining just what G
inverse
means when the inverse
does not exist in the mathematical sense of

GG
inverse
= G
inverse
G = I (1.22)

where I is the identity matrix.


The next order of business, however, is to shift our attention to a review of the basics of
matrices and linear algebra as well as probability and statistics in order to take full advantage of
the power of inverse theory.
Geosciences 567: CHAPTER 2 (RMR/GZ)
12




CHAPTER 2: REVIEW OF LINEAR ALGEBRA AND STATISTICS


2.1 Introduction


In discrete inverse methods, matrices and linear transformations play fundamental roles.
So do probability and statistics. This review chapter, then, is divided into two parts. In the first,
we will begin by reviewing the basics of matrix manipulations. Then we will introduce some
special types of matrices (Hermitian, orthogonal and semiorthogonal). Finally, we will look at
matrices as linear transformations that can operate on vectors of one dimension and return a
vector of another dimension. In the second section, we will review some elementary probability
and statistics, with emphasis on Gaussian statistics. The material in the first section will be
particularly useful in later chapters when we cover eigenvalue problems, and methods based on
the length of vectors. The material in the second section will be very useful when we consider
the nature of noise in the data and when we consider the maximum likelihood inverse.


2.2 Matrices and Linear Transformations


Recall from the first chapter that, by convention, vectors will be denoted by lower case
letters in boldface (i.e., the data vector d), while matrices will be denoted by upper case letters in
boldface (i.e., the matrix G) in these notes.


2.2.1 Review of Matrix Manipulations

Matrix Multiplication

If A is an N M matrix (as in N rows by M columns), and B is an M L matrix, we write
the N L product C of A and B, as

C = AB (2.1)

We note that matrix multiplication is associative, that is

(AB)C = A(BC) (2.2)

but in general is not commutative. That is, in general

AB BA (2.3)
Geosciences 567: CHAPTER 2 (RMR/GZ)
13
In fact, if AB exists, then the product BA only exists if A and B are square.

In Equation (2.1) above, the ijth entry in C is the product of the ith row of A and the jth
column of B. Computationally, it is given by

=
=
M
k
kj ik ij
b a c
1
(2.4)

One way to form C using standard FORTRAN code would be

DO 300 I = 1, N
DO 300 J = 1, L
C(I,J) = 0.0
DO 300 K = 1, M
300 C(I,J) = C(I,J) + A(I,K)*B(K,J) (2.5)

A special case of the general rule above is the multiplication of a matrix G (N M ) and a
vector m (M 1):

d = G m (1.13)
(N 1) (N M) (M 1)

In terms of computation, the vector d is given by


d
i
= G
ij
m
j
j=1
M

(2.6)


The Inverse of a Matrix

The mathematical inverse of the M M matrix A, denoted A
1
, is defined such that:

AA
1
= A
1
A = I
M
(2.7)

where I
M
is the M M identity matrix given by:

1 0 0
0
1 0
0 0 1
L
O M
M
L
(2.8)
(M M)
Geosciences 567: CHAPTER 2 (RMR/GZ)
14
A
1
is the matrix, which when either pre- or postmultiplied by A, returns the identity matrix.
Clearly, since only square matrices can both pre- and postmultiply each other, the mathematical
inverse of a matrix only exists for square matrices.

A useful theorem follows concerning the inverse of a product of matrices:

Theorem: If A = B C D (2.9)
N N N N N N N N

Then A
1
, if it exists, is given by

A
1
= D
1
C
1
B
1
(2.10)


Proof: A(A
1
) = BCD(D
1
C
1
B
1
)

= BC (DD
1
) C
1
B
1


= BC I C
1
B
1


= B (CC
1
) B
1


= BB
1


= I (2.11)


Similarly, (A
1
)A = D
1
C
1
B
1
BCD = = I (Q.E.D.)


The Transpose and Trace of a Matrix

The transpose of a matrix A is written as A
T
and is given by

(A
T
)
ij
= A
ji
(2.12)

That is, you interchange rows and columns.

The transpose of a product of matrices is the product of the transposes, in reverse order.
That is,

(AB)
T
= B
T
A
T
(2.13)

Geosciences 567: CHAPTER 2 (RMR/GZ)
15
Just about everything we do with real matrices A has an analog for complex matrices. In
the complex case, wherever the transpose of a matrix occurs, it is replaced by the complex
conjugate transpose of the matrix, denoted

A . That is,

if A
ij
= a
ij
+ b
ij
i (2.14)

then

A
ij
= c
ij
+ d
ij
i (2.15)

where c
ij =
a
ji
(2.16)

and d
ij
= b
ji
(2.17)

that is,

A
ij
= a
ji
b
ji
i (2.18)

Finally, the trace of A is given by


trace (A) = a
ii
i=1
M

(2.19)


Hermitian Matrices

A matrix A is said to be Hermitian if it is equal to its complex conjugate transpose. That
is, if

A =

A

(2.20)

If A is a real matrix, this is equivalent to

A = A
T
(2.21)

This implies that A must be square. The reason that Hermitian matrices will be important is that
they have only real eigenvalues. We will take advantage of this many times when we consider
eigenvalue and shifted eigenvalue problems later.


2.2.2 Matrix Transformations

Linear Transformations

A matrix equation can be thought of as a linear transformation. Consider, for example,
the original matrix equation:
d = Gm (1.13)
Geosciences 567: CHAPTER 2 (RMR/GZ)
16
where d is an N 1 vector, m is an M 1 vector, and G is an N M matrix. The matrix G can
be thought of as an operator that operates on an M-dimensional vector m and returns an
N-dimensional vector d.

Equation (1.13) represents an explicit, linear relationship between the data and model
parameters. The operator G, in this case, is said to be linear because if m is doubled, for example,
so is d. Mathematically, one says that G is a linear operator if the following is true:
If d = Gm

and f = Gr

then [d + f] = G[m + r] (2.22)

In another way to look at matrix multiplications, in the by-now-familiar Equation (1.13),

d = Gm (1.13)

the column vector d can be thought of as a weighted sum of the columns of G, with the
weighting factors being the elements in m. That is,

d = m
1
g
1
+ m
2
g
2
+ + m
M
g
M
(2.23)

where

m = [m
1
, m
2
, . . . , m
M
]
T
(2.24)

and

g
i
= [g
1i
, g
2i
, . . . , g
Ni
]
T
(2.25)

is the ith column of G. Also, if GA = B, then the above can be used to infer that the first column
of B is a weighted sum of the columns of G with the elements of the first column of A as
weighting factors, etc. for the other columns of B. Each column of B is a weighted sum of the
columns of G.

Next, consider

d
T
= [Gm]
T
(2.26)

or

d
T
= m
T
G
T
(2.27)
1 N 1 M M N

The row vector d
T
is the weighted sum of the rows of G
T
, with the weighting factors again being
the elements in m. That is,
Geosciences 567: CHAPTER 2 (RMR/GZ)
17
d
T
= m
1
g
1
T
+ m
2
g
2
T
+ + m
M
g
M
T
(2.28)

Extending this to

A
T
G
T
= B
T
(2.29)

we have that each row of B
T
is a weighted sum of the rows of G
T
, with the weighting factors
being the elements of the appropriate row of A
T
.

In a long string of matrix multiplications such as

ABC = D (2.30)

each column of D is a weighted sum of the columns of A, and each row of D is a weighted sum
of the rows of C.


Orthogonal Transformations

An orthogonal transformation is one that leaves the length of a vector unchanged. We
can only talk about the length of a vector being unchanged if the dimension of the vector is
unchanged. Thus, only square matrices may represent an orthogonal transformation.

Suppose L is an orthogonal transformation. Then, if

Lx = y (2.31)

where L is N N, and x, y are both N-dimensional vectors. Then

x
T
x = y
T
y (2.32)

where Equation (2.32) represents the dot product of the vectors with themselves, which is equal
to the length squared of the vector. If you have ever done coordinate transformations in the past,
you have dealt with an orthogonal transformation. Orthogonal transformations rotate vectors but
do not change their lengths.


Properties of orthogonal transformations. There are several properties of orthogonal
transformations that we will wish to use.

First, if L is an N N orthogonal transformation, then

L
T
L = I
N
(2.33)
This follows from

y
T
y = [Lx]
T
[Lx]
Geosciences 567: CHAPTER 2 (RMR/GZ)
18
= x
T
L
T
Lx (2.34)

but y
T
y = x
T
x by Equation (2.32). Thus,

L
T
L = I
N
(Q.E.D.) (2.35)

Second, the relationship between L and its inverse is given by
L
1
= L
T
(2.36)

and

L = [L
T
]
1
(2.37)

These two follow directly from Equation (2.35) above.

Third, the determinant of a matrix is unchanged if it is operated upon by orthogonal
transformations. Recall that the determinant of a 3 3 matrix A, for example, where A is given
by

=
33 32 31
23 22 21
13 12 11
a a a
a a a
a a a
A (2.38)

is given by

det (A) = a
11
(a
22
a
33
a
23
a
32
)

a
12
(a
21
a
33
a
23
a
31
)

+a
13
(a
21
a
32
a
22
a
31
) (2.39)

Thus, if A is an M M matrix, and L

is an orthogonal transformations, and if

A = (L)A(L)
T
(2.40)

it follows that

det (A) = det (A) (2.41)

Fourth, the trace of a matrix is unchanged if it is operated upon by an orthogonal
transformation, where trace (A) is defined as

=
=
M
i
ii
a
1
) ( trace A
(2.42)

Geosciences 567: CHAPTER 2 (RMR/GZ)
19
That is, the sum of the diagonal elements of a matrix is unchanged by an orthogonal
transformation. Thus,

trace (A) = trace (A) (2.43)


Semiorthogonal Transformations

Suppose that the linear operator L is not square, but N M (N M). Then L is said to
be semiorthogonal if and only if

L
T
L = I
M
, but LL
T
I
N
, N > M (2.44)
or
LL
T
= I
N
, but L
T
L I
M
, M > N (2.45)

where I
N
and I
M
are the N N and M M identity matrices, respectively.

A matrix cannot be both orthogonal and semiorthogonal. Orthogonal matrices must be
square, and semiorthogonal matrices cannot be square. Furthermore, if L is a square N N
matrix, and
L
T
L = I
N
(2.35)

then it is not possible to have

LL
T
I
N
(2.46)


2.2.3 Matrices and Vector Spaces

The columns or rows of a matrix can be thought of as vectors. For example, if A is an N
M matrix, each column can be thought of as a vector in N-space because it has N entries.
Conversely, each row of A can be thought of as being a vector in M-space because it has M
entries.

We note that for the linear system of equations given by

Gm = d (1.13)

where G is N M, m is M 1, and d is N 1, that the model parameter vector m lies in M-space
(along with all the rows of G), while the data vector lies in N-space (along with all the columns
of G). In general, we will think of the M 1 vectors as lying in model space, while the N 1
vectors lie in data space.
Spanning a Space

Geosciences 567: CHAPTER 2 (RMR/GZ)
20
The notion of spanning a space is important for any discussion of the uniqueness of
solutions or of the ability to fit the data. We first need to introduce definitions of linear
independence and vector orthogonality.

A set on M vectors v
i
, i = 1, . . . , M, in M-space (the set of all M-dimensional vectors),
is said to be linearly independent if and only if

a
1
v
1
+ a
2
v
2
+ + a
M
v
M
= 0 (2.47)

where a
i
are constants, has only the trivial solution a
i
= 0, i = 1, . . . , M.

This is equivalent to saying that an arbitrary vector s in M space can be written as a linear
combination of the v
i
, i = 1, . . . , M. That is, one can find a
i
such that for an arbitrary vector s

s = a
1
v
1
+ a
2
v
2
+ + a
M
v
M
(2.48)

Two vectors r and s in M-space are said to be orthogonal to each other if their dot, or inner,
product with each other is zero. That is, if

0 cos = = s r s r (2.49)

where is the angle between the vectors, and r , s are the lengths of r and s, respectively.

The dot product of two vectors is also given by


r
T
s = s
T
r = r
i
i=1
M

s
i
(2.50)

M space is spanned by any set of M linearly independent M-dimensional vectors.


Rank of a Matrix

The number of linearly independent rows in a matrix, which is also equal to the number of
linearly independent columns, is called the rank of the matrix. The rank of matrices is defined for
both square and nonsquare matrices. The rank of a matrix cannot exceed the minimum of the num-
ber of rows or columns in the matrix (i.e., the rank is less than or equal to the minimum of N, M).

If an M M matrix is an orthogonal matrix, then it has rank M. The M rows are all linearly
independent, as are the M columns. In fact, not only are the rows independent for an orthogonal
matrix, they are orthogonal to each other. The same is true for the columns. If a matrix is
semiorthogonal, then the M columns (or N rows, if N < M) are orthogonal to each other.

We will make extensive use of matrices and linear algebra in this course, especially when we
work with the generalized inverse. Next, we need to turn our attention to probability and statistics.
Geosciences 567: CHAPTER 2 (RMR/GZ)
21
2.3 Probability and Statistics

2.3.1 Introduction

We need some background in probability and statistics before proceeding very far. In
this review section, I will cover the material from Menke's book, using some material from other
math texts to help clarify things.

Basically, what we need is a way of describing the noise in data and estimated model
parameters. We will need the following terms: random variable, probability distribution, mean
or expected value, maximum likelihood, variance, standard deviation, standardized normal
variables, covariance, correlation coefficients, Gaussian distributions, and confidence intervals.


2.3.2 Definitions, Part 1

Random Variable: A function that assigns a value to the outcome of an experiment. A random
variable has well-defined properties based on some distribution. It is called random because you
cannot know beforehand the exact value for the outcome of the experiment. One cannot measure
directly the true properties of a random variable. One can only make measurements, also called
realizations, of a random variable, and estimate its properties. The birth weight of baby goslings
is a random variable, for example.


Probability Density Function: The true properties of a random variable b are specified by the
probability density function P(b). The probability that a particular realization of b will fall
between b and b + db is given by P(b)db. (Note that Menke uses d where I use b. His notation is
bad when one needs to use integrals.) P(b) satisfies


1= P(b)

db
(2.51)

which says that the probability of b taking on some value is 1. P(b) completely describes the
random variable b. It is often useful to try and find a way to summarize the properties of P(b)
with a few numbers, however.


Mean or Expected Value: The mean value E(b) (also denoted <b>) is much like the mean of a
set of numbers; that is, it is the balancing point of the distribution P(b) and is given by


E(b) = b P(b)

db
(2.52)


Maximum Likelihood: This is the point in the probability distribution P(b) that has the highest
likelihood or probability. It may or may not be close to the mean E(b) = <b>. An important
point is that for Gaussian distributions, the maximum likelihood point and the mean E(b) = <b>
Geosciences 567: CHAPTER 2 (RMR/GZ)
22
are the same! The graph below (after Figure 2.3, p. 23, Menke) illustrates a case where the two
are different.
<b>
P
(
b
)
b
ML
b


The maximum likelihood point b
ML
of the probability distribution P(b) for data b gives the most
probable value of the data. In general, this value can be different from the mean datum <b>,
which is at the balancing point of the distribution.


Variance: Variance is one measure of the spread, or width, of P(b) about the mean E(b). It is
given by

2
= (b < b >)
2
P(b)

db
(2.53)

Computationally, for L experiments in which the kth experiment gives b
k
, the variance is given
by

2
=
1
L 1
(b
k
< b >)
2
k=1
L

(2.54)


Standard Deviation: Standard deviation is the positive square root of the variance, given by

= +
2
(2.55)


Covariance: Covariance is a measure of the correlation between errors. If the errors in two
observations are uncorrelated, then the covariance is zero. We need another definition before
proceeding.
Geosciences 567: CHAPTER 2 (RMR/GZ)
23
Joint Density Function P(b): The probability that b
1
is between b
1
and b
1
+ db
1
, that b
2
is
between b
2
and b
2
+ db
2
, etc. If the data are independent, then

P(b) = P(b
1
) P(b
2
)
. . .
P(b
n
) (2.56)

If the data are correlated, then P(b) will have some more complicated form. Then, the
covariance between b
1
and b
2
is defined as


n
db db db P b b b b b b ) ( ) )( ( ) , cov(
2 1 2 2 1 1 2 1
L L b

+

+

> < > < =
(2.57)

In the event that the data are independent, this reduces to


0 =
) ( ) ( ) )( ( ) , cov(
2 1 1 1 2 2 1 1 2 1
db db b P b P b b b b b b

+

+

> < > < =
(2.58)

The reason is that for any value of (b
1
<b
1
>), (b
2
<b
2
>) is as likely to be positive as
negative, i.e., the sum will average to zero. The matrix [cov b] contains all of the covariances
defined using Equation (2.57) in an N N matrix. Note also that the covariance of b
i
with itself
is just the variance of b
i
.

In practical terms, if one has an N-dimensional data vector b that has been measured L
times, then the ijth term in [cov b], denoted [cov b]
ij
, is defined as


[covb]
ij
=
1
L 1
b
i
k
b
i ( )
k=1
L

b
j
k
b
j ( )
(2.59)

where b
i
k
is the value of the ith datum in b on the kth measurement of the data vector,
i
b is the
mean or average value of b
i
for all L measurements (also commonly written <b
i
>), and the L 1
term results from sampling theory.


Correlation Coefficients: This is a normalized measure of the degree of correlation of errors. It
takes on values between 1 and 1, with a value of 0 implying no correlation.

The correlation coefficient matrix [cor b] is defined as


j i
ij
ij

] [cov
] cor [
b
b =
(2.60)

where [cov b]
ij
is the covariance matrix defined term by term as above for cov [b
1
, b
2
], and
i
,
j

are the standard deviations for the ith and jth observations, respectively. The diagonal terms of
[cor b]
ij
are equal to 1, since each observation is perfectly correlated with itself.

Geosciences 567: CHAPTER 2 (RMR/GZ)
24
The figure below (after Figure 2.8, page 26, Menke) shows three different cases of degree
of correlation for two observations b
1
and b
2
.


+

+
+

+
+

+
b
b
2
1
b
b
2
1
b
b
2
1
(a) (b) (c)


Contour plots of P(b
1
, b
2
) when the data are (a) uncorrelated, (b) positively correlated, (c)
negatively correlated. The dashed lines indicate the four quadrants of alternating sign used to
determine correlation.


2.3.3 Some Comments on Applications to Inverse Theory

Some comments are now in order about the nature of the estimated model parameters.
We will always assume that the noise in the observations can be described as random variables.
Whatever inverse we create will map errors in the data into errors in the estimated model
parameters. Thus, the estimated model parameters are themselves random variables. This is true
even though the true model parameters may not be random variables. If the distribution of noise
for the data is known, then in principle the distribution for the estimated model parameters can be
found by mapping through the inverse operator.

This is often very difficult, but one particular case turns out to have a rather simple form.
We will see where this form comes from when we get to the subject of generalized inverses. For
now, consider the following as magic.

If the transformation between data b and model parameters m is of the form
m = Mb + v (2.61)

where M is any arbitrary matrix and v is any arbitrary vector, then

<m> = M<b> + v (2.62)

and

[cov m] = M [cov b] M
T
(2.63)

Geosciences 567: CHAPTER 2 (RMR/GZ)
25
2.3.4 Definitions, Part 2

Gaussian Distribution: This is a particular probability distribution given by

> <
=
2
2
2
) (
exp
2
1
) (

b b
b P (2.64)

The figure below (after Figure 2.10, page 29, Menke) shows the familiar bell-shaped
curve. It has the following properties:

Mean = E(b) = <b> and Variance =
2

-5 -4 -3 -2 -1 0 1 2 3 4 5
0.25
0.50
B
A
b
P
(
b
)


Gaussian distribution with zero mean and = 1 for curve A, and = 2 for curve B.

Many distributions can be approximated fairly accurately (especially away from the tails)
by the Gaussian distribution. It is also very important because it is the limiting distribution for
the sum of random variables. This is often just what one assumes for noise in the data.

One also needs a way to represent the joint probability introduced earlier for a set of
random variables each of which has a Gaussian distribution. The joint probability density
function for a vector b of observations that all have Gaussian distributions is chosen to be [see
Equation (2.10) of Menke, page 30]


[ ] ( )
( )
[ ] [ ] [ ] { } > < > < =

b b b b b
b
cov exp
2
cov det
) (
1 T
2
1
2 /
2 / 1
N
b P

(2.65)
Geosciences 567: CHAPTER 2 (RMR/GZ)
26
which reduces to the previous case in Equation (2.64) for N = 1 and var (b
1
) =
2
. In statistics
books, Equation (2.65) is often given as

P(b) = (2)
N/2
|
b
|
1/2
exp{2[b
b
]
T

1
[b
b
]}

With this background, it makes sense (statistically, at least) to replace the original
relationship:

b = Gm (1.13)

with

<b> = Gm (2.66)

The reason is that one cannot expect that there is an m that should exactly predict any particular
realization of b when b is in fact a random variable.

Then the joint probability is given by


{ } ] [ ] [cov ] [ exp
) 2 (
]) (det[cov
) (
1 T
2
1
2 /
2 / 1
Gm b b Gm b
b
b =

N
P

(2.67)

What one then does is seek an m that maximizes the probability that the predicted data
are in fact close to the observed data. This is the basis of the maximum likelihood or probabilistic
approach to inverse theory.


Standardized Normal Variables: It is possible to standardize random variables by subtracting
their mean and dividing by the standard deviation.

If the random variable had a Gaussian (i.e., normal) distribution, then so does the
standardized random variable. Now, however, the standardized normal variables have zero mean
and standard deviation equal to one. Random variables can be standardized by the following
transformation:

m m
= s (2.68)

where you will often see z replacing s in statistics books.

We will see, when all is said and done, that most inverses represent a transformation to
standardized variables, followed by a simple inverse analysis, and then a transformation back
for the final solution.


Chi-Squared (Goodness of Fit) Test: A statistical test to see whether a particular observed
distribution is likely to have been drawn from a population having some known form.
Geosciences 567: CHAPTER 2 (RMR/GZ)
27
The application we will make of the chi-squared test is to test whether the noise in a
particular problem is likely to have a Gaussian distribution. This is not the kind of question one
can answer with certainty, so one must talk in terms of probability or likelihood. For example, in
the chi-squared test, one typically says things like there is only a 5% chance that this sample
distribution does not follow a Gaussian distribution.

As applied to testing whether a given distribution is likely to have come from a Gaussian
population, the procedure is as follows: One sets up an arbitrary number of bins and compares
the number of observations that fall into each bin with the number expected from a Gaussian
distribution having the same mean and variance as the observed data. One quantifies the
departure between the two distributions, called the chi-squared value and denoted
2
, as


( ) ( ) [ ]
[ ]
2
1
2
bin in expected #
bin in expected # bin in obs #

=
k
i
i
i i

(2.69)

where the sum is over the number of bins, k. Next, the number of degrees of freedom for the
problem must be considered. For this problem, the number of degrees is equal to the number of
bins minus three. The reason you subtract three is as follows: You subtract 1 because if an
observation does not fall into any subset of k 1 bins, you know it falls in the one bin left over.
You are not free to put it anywhere else. The other two come from the fact that you have
assumed that the mean and standard deviation of the observed data set are the mean and standard
deviations for the theoretical Gaussian distribution.

With this information in hand, one uses standard chi-squared test tables from statistics
books and determines whether such a departure would occur randomly more often than, say, 5%
of the time. Officially, the null hypothesis is that the sample was drawn from a Gaussian
distribution. If the observed value for
2
is greater than

2
, called the critical
2
value for the
significance level, then the null hypothesis is rejected at the significance level. Commonly,
= 0.05 is used for this test, although = 0.01 is also used. The significance level is
equivalent to the 100*(1 )% confidence level (i.e., = 0.05 corresponds to the 95%
confidence level).

Consider the following example, where the underlying Gaussian distribution from which
all data samples d are drawn has a mean of 7 and a variance of 10. Seven bins are set up with
edges at 4, 2, 4, 6, 8, 10, 12, and18, respectively. Bin widths are not prescribed for the chi-
squared test, but ideally are chosen so there are about an equal number of occurrences expected
in each bin. Also, one rule of thumb is to only include bins having at least five expected
occurrences. I have not followed the about equal number expected in each bin suggestion
because I want to be able to visually compare a histogram with an underlying Gaussian shape.
However, I have chosen wider bins at the edges in these test cases to capture more occurrences at
the edges of the distribution.

Suppose our experiment with 100 observations yields a sample mean of 6.76 and a
sample variance of 8.27, and 3, 13, 26, 25, 16, 14, and 3 observations, respectively, in the bins
from left to right. Using standard formulas for a Gaussian distribution with a mean of 6.76 and a
variance of 8.27, the number expected in each bin is 4.90, 11.98, 22.73, 27.10, 20.31, 9.56, and
3.41, respectively. The calculated
2
, using Equation (2.69), is 4.48. For seven bins, the DOFs
Geosciences 567: CHAPTER 2 (RMR/GZ)
28
for the test is 4, and

2
= 9.49 for = 0.05. Thus, in this case, the null hypothesis would be
accepted. That is, we would accept that this sample was drawn from a Gaussian distribution with
a mean of 6.76 and a variance of 8.27 at the = 0.05 significance level (95% confidence level).
The distribution is shown below, with a filled circle in each histogram at the number expected in
that bin.



It is important to note that this distribution does not look exactly like a Gaussian
distribution, but still passes the
2
test. A simple, non-chi-square analogy may help better
understand the reasoning behind the chi-square test. Consider tossing a true coin 10 times. The
most likely outcome is 5 heads and 5 tails. Would you reject a null hypothesis that the coin is a
true coin if you got 6 heads and 4 tails in your one experiment of tossing the coin ten times?
Intuitively, you probably would not reject the null hypothesis in this case, because 6 heads and 4
tails is not that unlikely for a true coin.

In order to make an informed decision, as we try to do with the chi-square test, you would
need to quantify how likely, or unlikely, a particular outcome is before accepting or rejecting the
null hypothesis that it is a true coin. For a true coin, 5 heads and 5 tails has a probability of 0.246
(that is, on average, it happens 24.6% of the time), while the probability of 6 heads and 4 tails is
0.205, 7 heads and 3 tails is 0.117, and 8 heads and 2 tails is 0.044, respectively. A distribution
of 7 heads and 3 tails does not look like 5 heads and 5 tails, but occurs more than 10% of the
time with a true coin.

Hence, by analogy, it is not too unlikely and you would probably not reject the null
hypothesis that the coin is a true coin just because you tossed 7 heads and 3 tails in one
experiment. Ten heads and no tails only occurs, on average, one time in 1024 experiments (or
about 0.098% of the time). If you got 10 heads and 0 tails, youd probably reject the null
hypothesis that you are tossing a true coin because the outcome is very unlikely. Eight heads and
two tails occurs 4.4% of the time, on average. You might also reject the null hypothesis in this
Geosciences 567: CHAPTER 2 (RMR/GZ)
29
case, but you would do so with less confidence, or at a lower significance level. In both cases,
however, your conclusion will be wrong occasionally just due to random variations. You accept
the possibility that you will be wrong rejecting the null hypothesis 4.4% of the time in this case,
even if the coin is true.

The same is true with the chi-square test. That is, at the = 0.05 significance level (95%
confidence level), with
2
greater than

2
, you reject the null hypothesis, even though you
recognize that you will reject the null hypothesis incorrectly about 5% of the time in the presence
of random variations. Note that this analogy is a simple one in the sense that it is entirely
possible to actually do a chi-square test on this coin toss example. Each time you toss the coin
ten times you get one outcome: x heads and (10 x) tails. This falls into the x heads and (10
x) tails bin. If you repeat this many times you get a distribution across all bins from 0 heads
and 10 tails to 10 heads and 0 tails. Then you would calculate the number expected in each
bin and use Equation (2.69) to calculate a chi-square value to compare with the critical value at
the significance level.

Now let us return to another example of the chi-square test where we reject the null
hypothesis. Consider a case where the observed number in each of the seven bins defined above
is now 2, 17, 13, 24, 26, 9, and 9, respectively, and the observed distribution has a mean of 7.28
and variance of 10.28. The expected number in each bin, for the observed mean and variance, is
4.95, 10.32, 19.16, 24.40, 21.32, 12.78, and 7.02, respectively. The calculated
2
is now 10.77,
and the null hypothesis would be rejected at the = 0.05 significance level (95% confidence
level). That is, we would reject that this sample was drawn from a Gaussian distribution with a
mean of 7.28 and variance of 10.28 at this significance level. The distribution is shown on the
next page, again with a filled circle in each histogram at the number expected in that bin.




Confidence Intervals: One says, for example, with 98% confidence that the true mean of a
random variable lies between two values. This is based on knowing the probability distribution
Geosciences 567: CHAPTER 2 (RMR/GZ)
30
for the random variable, of course, and can be very difficult, especially for complicated
distributions that include nonzero correlation coefficients. However, for Gaussian distributions,
these are well known and can be found in any standard statistics book. For example, Gaussian
distributions have 68% and 95% confidence intervals of approximately 1 and 2,
respectively.


T and F Tests: These two statistical tests are commonly used to determine whether the
properties of two samples are consistent with the samples coming from the same population.

The F test in particular can be used to test the improvement in the fit between predicted
and observed data when one adds a degree of freedom in the inversion. One expects to fit the
data better by adding more model parameters, so the relevant question is whether the
improvement is significant.

As applied to the test of improvement in fit between case 1 and case 2, where case 2 uses
more model parameters to describe the same data set, the F ratio is given by


) / (
) /( ) (
2 2
2 1 2 1
DOF E
DOF DOF E E
F

= (2.70)

where E is the residual sum of squares and DOF is the number of degrees of freedom for each
case.

If F is large, one accepts that the second case with more model parameters provides a
significantly better fit to the data. The calculated F is compared to published tables with DOF
1

DOF
2
and DOF
2
degrees of freedom at a specified confidence level. (Reference: T. M. Hearns,
P
n
travel times in Southern California, J. Geophys. Res., 89, 18431855, 1984.)



The next section will deal with solving inverse problems based on length measures. This
will include the classic least squares approach.
Geosciences 567: CHAPTER 3 (RMR/GZ)
31



CHAPTER 3: INVERSE METHODS BASED ON LENGTH


3.1 Introduction


This chapter is concerned with inverse methods based on the length of various vectors
that arise in a typical problem. The two most common vectors concerned are the data-error or
misfit vector and the model parameter vector. Methods based on the first vector give rise to
classic least squares solutions. Methods based on the second vector give rise to what are known
as minimum length solutions. Improvements over simple least squares and minimum length
solutions include the use of information about noise in the data and a priori information about
the model parameters, and are known as weighted least squares or weighted minimum length
solutions, respectively. This chapter will end with material on how to handle constraints and on
variances of the estimated model parameters.


3.2 Data Error and Model Parameter Vectors


The data error and model parameter vectors will play an essential role in the development
of inverse methods. They are given by

data error vector = e = d
obs
d
pre
(3.1)

and

model parameter vector = m (3.2)

The dimension of the error vector e is N 1, while the dimension of the model parameter vector
is M 1, respectively. In order to utilize these vectors, we next consider the notion of the size, or
length, of vectors.


3.3 Measures of Length


The norm of a vector is a measure of its size, or length. There are many possible
definitions for norms. We are most familiar with the Cartesian (L
2
) norm. Some examples of
norms follow:

=
=
N
i
i
e L
1
1
(3.3)
Geosciences 567: CHAPTER 3 (RMR/GZ)
32

2 / 1
1
2
2
(

=

=
N
i
i
e L
(3.4)

M


M
N
i
M
i M
e L
/ 1
1
(

=

=
(3.5)

and finally,


L

= max
i
e
i
(3.6)

Important Notice!
Inverse methods based on different norms can, and often do, give different
answers!


The reason is that different norms give different weight to outliers. For example, the
L

norm gives all the weight to the largest misfit. Low-order norms give more equal weight to
errors of different sizes.

The L
2
norm gives the familiar Cartesian length of a vector. Consider the total misfit E
between observed and predicted data. It has units of length squared and can be found either as
the square of the L
2
norm of e, the error vector (Equation 3.1), or by noting that it is also
equivalent to the dot (or inner) product of e with itself, given by


=
=
(
(
(
(

= =
N
i
i
N
N
e
e
e
e
e e e E
1
2 2
1
2 1
T
] [
M
L e e
(3.7)

Inverse methods based on the L
2
norm are also closely tied to the notion that errors in the
data have Gaussian statistics. They give considerable weight to large errors, which would be
considered unlikely if, in fact, the errors were distributed in a Gaussian fashion.

Now that we have a way to quantify the misfit between predicted and observed data, we
are ready to define a procedure for estimating the value of the elements in m. The procedure is
to take the partial derivative of E with respect to each element in m and set the resulting
equations to zero. This will produce a system of M equations that can be manipulated in such a
way that, in general, leads to a solution for the M elements of m.

The next section will show how this is done for the least squares problem of finding a
best fit straight line to a set of data points.
Geosciences 567: CHAPTER 3 (RMR/GZ)
33
3.4 Minimizing the MisfitLeast Squares


3.4.1 Least Squares Problem for a Straight Line

Consider the figure below (after Figure 3.1 from Menke, page 36):


d
z z
i
(a) (b)
d
obs
i
d
pre
i
{
e
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


(a) Least squares fitting of a straight line to (z, d) pairs. (b) The error e
i
for each
observation is the difference between the observed and predicted datum: e
i
= d
i
obs

d
i
pre
.

The ith predicted datum d
i
pre
for the straight line problem is given by

d
i
pre
= m
1
+ m
2
z
i
(3.8)

where the two unknowns, m
1
and m
2
, are the intercept and slope of the line, respectively, and z
i
is
the value along the z axis where the ith observation is made.

For N points we have a system of N such equations that can be written in matrix form as:


(

(
(
(
(
(
(

=
(
(
(
(
(
(

2
1
1 1
1
1
1
m
m
z
z
z
d
d
d
N
i
N
i
M M
M M
M
M
(3.9)

Or, in the by now familiar matrix notation, as

Geosciences 567: CHAPTER 3 (RMR/GZ)
34
d = G m (1.13)
(N 1) (N 2) (2 1)

The total misfit E is given by


[ ]
2
pre obs T

= =
N
i
i i
d d E e e
(3.10)


( ) [ ]
2
2 1
obs

+ =
N
i
i i
z m m d
(3.11)

Dropping the obs in the notation for the observed data, we have


[ ]

+ + + =
N
i
i i i i i i
z m m z m m z m d m d d E
2 2
2
2
1 2 1 2 1
2
2 2 2
(3.12)

Then, taking the partials of E with respect to m
1
and m
2
, respectively, and setting them to zero
yields the following equations:


E
m
1
= 2Nm
1
2 d
i
i=1
N

+ 2m
2
z
i
i=1
N

= 0
(3.13)

and


0 2 2 2
1
2
2
1
1
1 2
= + + =

= = =
N
i
i
N
i
i
N
i
i i
z m z m z d
m
E

(3.14)

Rewriting Equations (3.13) and (3.14) above yields



= +
i
i
i
i
d z m Nm
2 1
(3.15)

and



= +
i
i i
i
i
i
i
z d z m z m
2
2 1
(3.16)

Combining the two equations in matrix notation in the form Am = b yields


(

=
(



i i
i
i i
i
z d
d
m
m
z z
z N
2
1
2

(3.17)

or simply

Geosciences 567: CHAPTER 3 (RMR/GZ)
35
A m = b (3.18)
(2 2) (2 1) (2 1)

Note that by the above procedure we have reduced the problem from one with N equations in two
unknowns (m
1
and m
2
) in Gm = d to one with two equations in the same two unknowns in Am =
b.

The matrix equation Am = b can also be rewritten in terms of the original G and d when
one notices that the matrix A can be factored as


G G
T 2
1
2 1
2
1
1
1
1 1 1
=
(
(
(
(

=
(



N
N i i
i
z
z
z
z z z z z
z N
M M L
L
(3.19)
(2 2) (2 N) (N 2) (2 2)

Also, b above can be rewritten similarly as


d G
T 2
1
2 1
1 1 1
=
(
(
(
(

=
(

N
N i i
i
d
d
d
z z z z d
d
M L
L
(3.20)

Thus, substituting Equations (3.19) and (3.20) into Equation (3.17), one arrives at the so-called
normal equations for the least squares problem:

G
T
Gm = G
T
d (3.21)

The least squares solution m
LS
is then found as

m
LS
= [G
T
G]
1
G
T
d (3.22)

assuming that [G
T
G]
1
exists.

In summary, we used the forward problem (Equation 3.9) to give us an explicit
relationship between the model parameters (m
1
and m
2
) and a measure of the misfit to the
observed data, E. Then, we minimized E by taking the partial derivatives of the misfit function
with respect to the unknown model parameters, setting the partials to zero, and solving for the
model parameters.


Geosciences 567: CHAPTER 3 (RMR/GZ)
36
3.4.2 Derivation of the General Least Squares Solution

We start with any system of linear equations which can be expressed in the form

d = G m (1.13)
(N 1) (N M) (M 1)

Again, let E = e
T
e = [d d
pre
]
T
[d d
pre
]

E = [d Gm]
T
[d Gm] (3.23)


(

=

= = =
M
k
k ik i
M
j
j ij i
N
i
m G d m G d E
1 1 1

(3.24)

As before, the procedure is to write out the above equation with all its cross terms, take partials
of E with respect to each of the elements in m, and set the corresponding equations to zero. For
example, following Menke, page 40, Equations (3.6)(3.9), we obtain an expression for the
partial of E with respect to m
q
:


0 2 2
1 1 1
= =

= = =
i
N
i
iq ik
N
i
iq
M
k
k
q
d G G G m
m
E

(3.25)

We can simplify this expression by recalling Equation (2.4) from the introductory remarks on
matrix manipulations in Chapter 2:

=
=
M
k
kj ik ij
b a C
1
(2.4)

Note that the first summation on i in Equation (3.25) looks similar in form to Equation (2.4), but
the subscripts on the first G term are backwards. If we further note that interchanging the
subscripts is equivalent to taking the transpose of G, we see that the summation on i gives the
qkth entry in G
T
G:

qk
N
i
ik qi
N
i
ik iq
G G G G ] [ ] [
T
1
T
1
G G = =

= =
(3.26)

Thus, Equation (3.25) reduces to


0 2 ] [ 2
1
T
1
= =

= =
i
N
i
iq qk
M
k
k
q
d G m
m
E
G G

(3.27)

Now, we can further simplify the first summation by recalling Equation (2.6) from the same
section

=
=
M
j
j ij i
m G d
1
(2.6)
Geosciences 567: CHAPTER 3 (RMR/GZ)
37
To see this clearly, we rearrange the order of terms in the first sum as follows:


q k
M
k
qk
M
k
qk k
m m ] [ ] [ ] [
T
1 =
T
1
T
Gm G G G G G = =

=
(3.28)

which is the qth entry in G
T
Gm. Note that G
T
Gm has dimension (M N)(N M)(M 1) =
(M 1). That is, it is an M-dimensional vector.

In a similar fashion, the second summation on i can be reduced to a term in [G
T
d]
q
, the
qth entry in an (M N)(N 1) = (M 1) dimensional vector. Thus, for the qth equation, we
have


q q
q
m
E
] [ 2 ] [ 2 0
T T
d G Gm G = =

(3.29)

Dropping the common factor of 2 and combining the q equations into matrix notation, we arrive
at

G
T
Gm = G
T
d (3.30)

The least squares solution for m is thus given by

m
LS
= [G
T
G]
1
G
T
d (3.31)

The least squares operator, G
LS
1
, is thus given by

G
LS
1
= [G
T
G]
1
G
T
(3.32)

Recalling basic calculus, we note that m
LS
above is the solution that minimizes E, the total
misfit. Summarizing, setting the q partial derivatives of E with respect to the elements in m to
zero leads to the least squares solution.

We have just derived the least squares solution by taking the partial derivatives of E with
respect to m
q
and then combining the terms for q = 1, 2, . . ., M. An alternative, but equivalent,
formulation begins with Equation (3.2) but is written out as

E = [d Gm]
T
[d Gm] (3.23)

= [d
T
m
T
G
T
][d Gm]

= d
T
d d
T
Gm m
T
G
T
d + m
T
G
T
Gm (3.33)

Then, taking the partial derivative of E with respect to m
T
turns out to be equivalent to what was
done in Equations (3.25)(3.30) for m
q
, namely

Geosciences 567: CHAPTER 3 (RMR/GZ)
38
E/m
T
= G
T
d + G
T
Gm = 0 (3.34)

which leads to

G
T
Gm = G
T
d (3.30)

and
m
LS
= [G
T
G]
1
G
T
d (3.31)

It is also perhaps interesting to note that we could have obtained the same solution
without taking partials. To see this, consider the following four steps.


Step 1. We begin with

Gm = d (1.13)


Step 2. We then premultiply both sides by G
T


G
T
Gm = G
T
d (3.30)


Step 3. Premultiply both sides by [G
T
G]
1


[G
T
G]
1
G
T
Gm = [G
T
G]
1
G
T
d (3.35)


Step 4. This reduces to

m
LS
= [G
T
G]
1
G
T
d (3.31)

as before. The point is, however, that this way does not show why m
LS
is the solution
which minimizes E, the misfit between the observed and predicted data.

All of this assumes that [G
T
G]
1
exists, of course. We will return to the existence and
properties of [G
T
G]
1
later. Next, we will look at two examples of least squares problems to
show a striking similarity that is not obvious at first glance.


3.4.3 Two Examples of Least Squares Problems

Example 1. Best-Fit Straight-Line Problem

We have, of course, already derived the solution for this problem in the last section.
Briefly, then, for the system of equations
Geosciences 567: CHAPTER 3 (RMR/GZ)
39
d = Gm (1.13)

given by


(

(
(
(
(

=
(
(
(
(

2
1 2
1
2
1
1
1
1
m
m
z
z
z
d
d
d
N N
M M M
(3.9)

we have

(



=
(
(
(
(

=
2
2
1
2 1
T
1
1
1
1 1 1
i i
i
N
N
z z
z N
z
z
z
z z z M M L
L
G G
(3.36)

and


(

=
(
(
(
(

=
i i
i
N
N
z d
d
d
d
d
z z z M L
L
2
1
2 1
T
1 1 1
d G
(3.37)

Thus, the least squares solution is given by


(



=
(


i i
i
i i
i
z d
d
z z
z N
m
m
1
2
LS
2
1
(3.38)


Example 2. Best-Fit Parabola Problem

The ith predicted datum for a parabola is given by

d
i
= m
1
+ m
2
z
i
+ m
3
z
i
2
(3.39)

where m
1
and m
2
have the same meanings as in the straight line problem, and m
3
is the
coefficient of the quadratic term. Again, the problem can be written in the form:

d = Gm (1.13)

where now we have

Geosciences 567: CHAPTER 3 (RMR/GZ)
40

(
(
(

(
(
(
(
(
(

=
(
(
(
(
(
(

3
2
1
2
2
2
1 1 1
1
1
1
m
m
m
z z
z z
z z
d
d
d
N N
i i
N
i
M M M
M M M
M
M
(3.40)
and


(
(
(

=
(
(
(




=
2
T
4 3 2
3 2
2
T
,
i i
i i
i
i i i
i i i
i i
z d
z d
d
z z z
z z z
z z N
d G G G (3.41)

As before, we form the least squares solution as

m
LS
= [G
T
G]
1
G
T
d (3.31)

Although the forward problems of predicting data for the straight line and parabolic cases
look very different, the least squares solution is formed in a way that emphasizes the fundamental
similarity between the two problems. For example, notice how the straight-line problem is
buried within the parabola problem. The upper left hand 2 2 part of G
T
G in Equation (3.41) is
the same as Equation (3.36). Also, the first two entries in G
T
d in Equation (3.41) are the same as
Equation (3.37).

Next we consider a four-parameter example.


3.4.4 Four-Parameter Tomography Problem

Finally, let's consider a four-parameter problem, but this one based on the concept of
tomography.


S R
1 2
3 4
t
t
t
t
1
2
3 4
) (
1 1
2 1
2 1
1
s s h
v
h
v
h t + =
|
|
.
|

\
|
+
|
|
.
|

\
|
=
) (
1 1
4 3
4 3
2
s s h
v
h
v
h t + =
|
|
.
|

\
|
+
|
|
.
|

\
|
=
) (
1 1
3 1
3 1
3
s s h
v
h
v
h t + =
|
|
.
|

\
|
+
|
|
.
|

\
|
=
) (
1 1
4 2
4 2
4
s s h
v
h
v
h t + =
|
|
.
|

\
|
+
|
|
.
|

\
|
=






(3.42)



Geosciences 567: CHAPTER 3 (RMR/GZ)
41

(
(
(
(

(
(
(
(

=
(
(
(
(

4
3
2
1
4
3
2
1

1 0 1 0
0 1 0 1
1 1 0 0
0 0 1 1
s
s
s
s
h
t
t
t
t
(3.43)

or
d = Gm (1.13)


(
(
(
(

=
(
(
(
(

(
(
(
(

=
2 1 1 0
1 2 0 1
1 0 2 1
0 1 1 2
1 0 1 0
0 1 0 1
1 1 0 0
0 0 1 1

1 0 1 0
0 1 1 0
1 0 0 1
0 1 0 1
2 2 T
h h G G (3.44)


(
(
(
(

+
+
+
+
=
4 2
3 2
4 1
3 1
T
t t
t t
t t
t t
h d G
(3.45)

So, the normal equations are

G
T
Gm = G
T
d (3.21)


(
(
(
(

+
+
+
+
=
(
(
(
(

(
(
(
(

4 2
3 2
4 1
3 1
4
3
2
1

2 1 1 0
1 2 0 1
1 0 2 1
0 1 1 2
t t
t t
t t
t t
s
s
s
s
h
(3.46)

or


(
(
(
(

+
+
+
+
=
|
|
|
|
|
.
|

\
|
(
(
(
(

(
(
(
(

(
(
(
(

(
(
(
(

+ + +
4 2
3 2
4 1
3 1
4 3 2 1
2
1
1
0
1
2
0
1
1
0
2
1
0
1
1
2
t t
t t
t t
t t
s s s s h (3.47)

Example: s
1
= s
2
= s
3
= s
4
= 1, h = 1; then t
1
= t
2
= t
3
= t
4
= 2

By inspection, s
1
= s
2
= s
3
= s
4
= 1 is a solution, but so is s
1
= s
4
= 2, s
2
= s
3
=0, or s
1
= s
4

= 0, s
2
= s
3
= 2.

Geosciences 567: CHAPTER 3 (RMR/GZ)
42
Solutions are nonunique! Look back at G. Are all of the columns or rows independent?
No! What does that imply about G (and G
T
G)? Rank < 4. What does that imply about (G
T
G)
1
?
It does not exist. So does m
LS
exist? No.

Other ways of saying this: The vectors g
i
do not span the space of m. Or, the
experimental set-up is not sufficient to uniquely determine the solution. Note that this analysis
can be done without any data, based strictly on the experimental design.

Another way to look at it: Are the columns of G independent? No. For example,
coefficients 1, +1, +1, and 1 will make the equations add to zero. What pattern does that
suggest is not resolvable?

Now that we have derived the least squares solution, and considered some examples, we
next turn our attention to something called the determinancy of the system of equations given by
Equation (1.13):

d = Gm (1.13)

This will begin to permit us to classify systems of equations based on the nature of G.


3.5 Determinancy of Least Squares Problems
(See Pages 4652, Menke)

3.5.1 Introduction

We have seen that the least squares solution to d = Gm is given by

m
LS
= [G
T
G]
1
G
T
d (3.31)

There is no guarantee, as we saw in Section 3.4.4, that the solution even exists. It fails to exist
when the matrix G
T
G has no mathematical inverse. We note that G
T
G is square (M M), and it
is at least mathematically possible to consider inverting G
T
G. (N.B. The dimension of G
T
G is
M M, independent of the number of observations made). Mathematically, we can say the G
T
G
has an inverse, and it is unique, when G
T
G has rank M. The rank of a matrix was considered in
Section 2.2.3. Essentially, if G
T
G has rank M, then it has enough information in it to resolve
M things (in this case, model parameters). This happens when all M rows (or equivalently, since
G
T
G is square, all M columns) are independent. Recall also that independent means you cannot
write any row (or column) as a linear combination of the other rows (columns).

G
T
G will have rank < M if the number of observations N is less than M. Menke gives the
example (pp. 4546) of the straight-line fit to a single data point as an illustration. If [G
T
G]
1

does not exist, an infinite number of estimates will all fit the data equally well. Mathematically,
G
T
G has rank < M if |G
T
G| = 0, where |G
T
G| is the determinant of G
T
G.

Geosciences 567: CHAPTER 3 (RMR/GZ)
43
Now, let us introduce Menkes nomenclature based on the nature of G
T
G and on the
prediction error. In all cases, the number of model parameters is M and the number of
observations is N.


3.5.2 Even-Determined Problems: M = N

If a solution exists, it is unique. The prediction error [d
obs
d
pre
] is identically zero. For
example,


(

=
(

2
1
1 5
0 1
2
1
m
m
(3.48)

for which the solution is m = [1, 3]
T
.


3.5.3 Overdetermined Problems: Typically, N > M

With more observations than unknowns, typically one cannot fit all the data exactly. The
least squares problem falls in this category. Consider the following example:


(

(
(
(

=
(
(
(

2
1
1 3
1 5
0 1
1
2
1
m
m
(3.49)

This overdetermined case consists of adding one equation to Equation (3.48) in the previous
example. The least squares solution is [1.333, 4.833]
T
. The data can no longer be fit exactly.


3.5.4 Underdetermined Problems: Typically, M > N

With more unknowns than observations, m has no unique solution. A special case of the
underdetermined problem occurs when you can fit the data exactly, which is called the purely
underdetermined case. The prediction error for the purely underdetermined case is exactly zero
(i.e., the data can be fit exactly). An example of such a problem is


[ ] [ ]
(

=
2
1
1 2 1
m
m
(3.50)

Possible solutions include [0, 1]
T
, [0.5, 0]
T
, [5, 9]
T
, [1/3, 1/3]
T
and [0.4, 0.2]
T
. The solution
with the minimum length, in the L
2
norm sense, is [0.4, 0.2]
T
.

Geosciences 567: CHAPTER 3 (RMR/GZ)
44
The following example, however, is also underdetermined, but no choice of m
1
, m
2
, m
3

will produce zero prediction error. Thus, it is not purely underdetermined.


(
(
(

=
(

3
2
1
2 4 2
1 2 1
1
1
m
m
m
(3.51)

(You might want to verify the above examples. Can you think of others?)

Although I have stated that overdetermined (underdetermined) problems typically have
N > M (N < M), it is important to realize that this is not always the case. Consider the following:


(
(
(

(
(
(
(

=
(
(
(
(

3
2
1
2 2 0
1 1 0
0 0 1
0 0 1
4
3
2
1
m
m
m
(3.52)

For this problem, m
1
is overdetermined, (that is, no choice of m
1
can exactly fit both d
1
and d
2

unless d
1
happens to equal d
2
), while at the same time m
2
and m
3
are underdetermined. This is
the case even though there are two equations (i.e., the last two) in only two unknowns (m
2
, m
3
).
The two equations, however, are not independent, since two times the next to last row in G
equals the last row. Thus this problem is both overdetermined and underdetermined at the same
time.

For this reason, I am not very satisfied with Menkes nomenclature. As we will see later,
when we deal with vector spaces, the key will be the single values (much like eigenvalues) and
associated eigenvectors for the matrix G.


3.6 Minimum Length Solution


The minimum length solution arises from the purely underdetermined case (N < M, and
can fit the data exactly). In this section, we will develop the minimum length operator, using
Lagrange multipliers and borrowing on the basic ideas of minimizing the length of a vector
introduced in Section 3.4 on least squares.


3.6.1 Background Information

We begin with two pieces of information:

1. First, [G
T
G]
1
does not exist. Therefore, we cannot calculate the least squares solution
m
LS
= [G
T
G]
1
G
T
d.
Geosciences 567: CHAPTER 3 (RMR/GZ)
45
2. Second, the prediction error e = d
obs
d
pre
is exactly equal to zero.

To solve underdetermined problems, we must add information that is not already in G.
This is called a priori information. Examples might include the constraint that density be greater
than zero for rocks, or that v
n
, the seismic P-wave velocity at the Moho falls within the range 5 <
v
n
< 10 km/s, etc.

Another a priori assumption is called solution simplicity. One seeks solutions that are
as simple as possible. By analogy to seeking a solution with the simplest misfit to the data
(i.e., the smallest) in the least squares problem, one can seek a solution which minimizes the total
length of the model parameter vector, m. At first glance, there may not seem to be any reason to
do this. It does make sense for some cases, however. Suppose, for example, that the unknown
model parameters are the velocities of points in a fluid. A solution that minimized the length of
m would also minimize the kinetic energy of the system. Thus, it would be appropriate in this
case to minimize m. It also turns out to be a nice property when one is doing nonlinear
problems, and the m that one is using is actually a vector of changes to the solution at the
previous step. Then it is nice to have small step sizes. The requirement of solution simplicity
will lead us, as shown later, to the so-called minimum length solution.


3.6.2 Lagrange Multipliers (See Page 50 and Appendix A.1, Menke)

Lagrange multipliers come to mind whenever one wishes to solve a problem subject to
some constraints. In the purely underdetermined case, these constraints are that the data misfit
be zero. Before considering the full purely underdetermined case, consider the following
discussion of Lagrange Multipliers, mostly after Menke.


Lagrange Multipliers With 2 Unknowns and 1 Constraint

Consider E(x, y), a function of two variables. Suppose that we want to minimize E(x, y)
subject to some constraint of the form (x, y) = 0.

The steps, using Lagrange multipliers, are as follows (next page):

Geosciences 567: CHAPTER 3 (RMR/GZ)
46
Step 1. At the minimum in E, small changes in x and y lead to no change in E:

minimum
E
x
E(x, (y = constant))



0 = + = dy
y
E
dx
x
E
dE

(3.53)

Step 2. The constraint equation, however, says that dx and dy cannot be varied independently
(since the constraint equation is independent, or different, from E). Since (x, y) = 0 for
all x, y, then so must d(x, y) = 0. But,


0 = + = dy
y
dx
x
d

(3.54)

Step 3. Form the weighted sum of (3.53) and (3.54) as


0 =
|
|
.
|

\
|
+ + |
.
|

\
|
+ = + dy
y y
E
dx
x x
E
d dE


(3.55)

where is a constant. Note that (3.55) holds for arbitrary .

Step 4. If is chosen, however, in such a way that

0 = +
x x
E

(3.56)

then it follows that


0 = +
y y
E

(3.57)

Geosciences 567: CHAPTER 3 (RMR/GZ)
47
since at least one of dx, dy (in this case, dy) is arbitrary (i.e., dy may be chosen
nonzero).

When has been chosen as indicated above, it is called the Lagrange multiplier.
Therefore, (3.55) above is equivalent to minimizing E + without any constraints, i.e.,

( ) 0 = + = +
x x
E
E
x

(3.58)

and


( ) 0 = + = +
y y
E
E
y

(3.59)

Step 5. Finally, one must still solve the constraint equation

(x, y) = 0 (3.60)

Thus, the solution for (x, y) that minimizes E subject to the constraint that (x, y) = 0 is
given by (3.58), (3.59), and (3.60).

That is, the problem has reduced to the following three equations:

0 = +
x x
E

(3.56)


0 = +
y y
E

(3.57)
and
(x, y) = 0 (3.60)

in the three unknowns (x, y, ).


Extending the Problem to M Unknowns and N Constraints

The above procedure, used for a problem with two variables and one constraint, can be
generalized to M unknowns in a vector m subject to N constraints
i
(m) = 0, j = 1, . . . , N. This
leads to the following system of M equations, i = 1, . . . , M:


0
1
= +

= i
j
N
j
j
i
m m
E

(3.61)

with N constraints of the form


j
(m) = 0 (3.62)
Geosciences 567: CHAPTER 3 (RMR/GZ)
48
3.6.3 Application to the Purely Underdetermined Problem

With the background we now have in Lagrange multipliers, we are ready to reconsider
the purely underdetermined problem. First, we pose the following problem: find m such that
m
T
m is minimized subject to the N constraints that the data misfit be zero.


0
1
obs pre obs
= = =

=
M
j
j ij i i i i
m G d d d e
, i = 1, . . ., N (3.63)


That is, minimize


i
N
i
i
e

=
+ =
1
T
) ( m m m
(3.64)

with respect to the elements m
i
in m. We can expand the terms in Equation (3.64) and obtain


(

+ =

= = =
M
j
j ij i
N
i
i
M
k
k
m G d m
1 1 1
2
) ( m
(3.65)

Then, we have


q
j
M
j
ij
N
i
i
M
k
k
q
k
q
m
m
G m
m
m
m


= = =
=
1 1 1
2
(3.66)
but


jq
q
j
kq
q
k
m
m
m
m

= = and (3.67)

where
ij
is the Kronecker delta, given by

=
=
j i
j i
ij
, 0
, 1


Thus


M q G m
m
iq
N
i
i q
q
, . . . , 2 , 1 0 2
1
= = =

=

(3.68)

In matrix notation over all q, Equation (3.68) can be written as

2m G
T
= 0 (3.69)

Geosciences 567: CHAPTER 3 (RMR/GZ)
49
where

is an N 1 vector containing the N Lagrange Multipliers
i
, i = 1, . . . , N. Note that G
T

has dimension (M N) x (N 1) = M 1, as required to be able to subtract it from m.

Now, solving explicitly for m yields

m =
1
2
G
T
(3.70)

The constraints in this case are that the data be fit exactly. That is,

d = Gm (1.13)

Substituting (3.70) into (1.13) gives

d = Gm = G(
1
2
G
T
) (3.71)

which implies

d =
1
2
GG
T
(3.72)

where GG
T
has dimension (N M) (M N), or simply N N. Solving for , when [GG
T
]
1

exists, yields

= 2[GG
T
]
1
d (3.73)
The Lagrange Multipliers are not ends in and of themselves. But, upon substitution of Equation
(3.73) into (3.70), we obtain

m =
1
2
G
T
=
1
2
G
T
{2[GG
T
]
1
}d (3.9)

Rearranging, we arrive at the minimum length solution, m
ML
:

m
ML
= G
T
[GG
T
]
1
d (3.74)

where GG
T
is an N N matrix and the minimum length operator, G
ML
1
, is given by

G
ML
1
= G
T
[GG
T
]
1
(3.75)

The above procedure, then, is one that determines the solution which has the minimum
length (L
2
norm = [m
T
m]
1/2
) amongst the infinite number of solutions that fit the data exactly.
In practice, one does not actually calculate the values of the Lagrange multipliers, but goes
directly to (3.74) above.

The above derivation shows that the length of m is minimized by the minimum length
operator. It may make more sense to seek a solution that deviates as little as possible from some
prior estimate of the solution, <m>, rather than from zero. The zero vector is, in fact, the prior
Geosciences 567: CHAPTER 3 (RMR/GZ)
50
estimate <m> for the minimum length solution given in Equation (3.74). If we wish to explicitly
include <m>, then Equation (3.74) becomes

m
ML
= <m> + G
T
[GG
T
]
1
[d G<m>]

= <m> + G
ML
1
[d G<m>] = G
ML
1
d+ [I G
ML
1
G]<m> (3.76)

We note immediately that Equation (3.76) reduces to Equation (3.74) when <m> = 0.


3.6.4 Comparison of Least Squares and Minimum Length Solutions

In closing this section, it is instructive to note the similarity in form between the
minimum length and least squares solutions:

Least Squares: m
LS
= [G
T
G]
1
G
T
d (3.31)

with G
LS
1
= [G
T
G]
1
G
T
(3.32)

Minimum Length: m
ML
= <m> + G
T
[GG
T
]
1
[d G<m>] (3.76)

with G
ML
1
= G
T
[GG
T
]
1
(3.75)
The minimum length solution exists when [GG
T
]
1
exists. Since GG
T
is N N, this is
the same as saying when GG
T
has rank N. That is, when the N rows (or N columns) are
independent. In this case, your ability to predict or calculate each of the N observations is
independent.


3.6.5 Example of Minimum Length Problem

Recall the four-parameter, four-observation tomography problem we introduced in
Section 3.4.4. At that time, we noted that the least squares solution did not exist because [G
T
G]
1

does not exist, since G does not contain enough information to solve for 4 model parameters. In
the same way, G does not contain enough information to fit an arbitrary 4 observations, and
[GG
T
]
1
does not exist either for this example. The basic problem is that the four paths through
the structure do not provide independent information. However, if we eliminate any one
observation (lets say the fourth), then we reduce the problem to one where the minimum length
solution exists. In this new case, we have three observations and four unknown model
parameters, and hence N < M. G, which still has enough information to determine three
observations uniquely, is now given by


(
(
(

=
0 1 0 1
1 1 0 0
0 0 1 1
G (3.77)
Geosciences 567: CHAPTER 3 (RMR/GZ)
51
And GG
T
is given by


(
(
(

=
2 1 1
1 2 0
1 0 2
T
GG (3.78)

Now [GG
T
]
1
does exist, and we have


(
(
(
(

5 . 0 75 . 0 25 . 0
5 . 0 25 . 0 25 . 0
5 . 0 25 . 0 75 . 0
5 . 0 25 . 0 25 . 0
] [
1 T T
GG G (3.79)

If we assume a true model given by m = [1.0, 0.5, 0.5, 0.5]
T
, then the data are given by d = [1.5,
1.0, 1.5]
T
. The minimum length solution m
ML
is given by

m
ML
= G
T
[GG
T
]
1
d = [0.875, 0.625, 0.625, 0.375]
T
(3.80)

Note that the minimum length solution is not the "true" solution. This is generally the
case, since the "true" solution is only one of an infinite number of solutions that fit the data
exactly, and the minimum length solution is the one of shortest length. The length squared of the
"true" solution is 1.75, while the length squared of the minimum length solution is 1.6875. Note
also that the minimum length solution varies from the "true" solution by [0.125, 0.125, 0.125,
0.125]
T
. This is the same direction in model space (i.e., [1, 1, 1, 1]
T
) that represents the linear
combination of the original columns of G in the example in Section 3.4.4 that add to zero. We
will return to this subject when we have introduced singular value decomposition and the
partitioning of model and data space.


3.7 Weighted Measures of Length


3.7.1 Introduction

One way to improve our estimates using either the least squares solution

m
LS
= [G
T
G]
1
G
T
d (3.31)

or the minimum length solution

m
ML
= <m> + G
T
[GG
T
]
1
[d G<m>] (3.76)

is to use weighted measures of the misfit vector

Geosciences 567: CHAPTER 3 (RMR/GZ)
52
e = d
obs
d
pre (3.81)


or the model parameter vector m, respectively. The next two subsections will deal with these
two approaches.


3.7.2 Weighted Least Squares

Weighted Measures of the Misfit Vector e

We saw in Section 3.4 that the least squares solution m
LS
was the one that minimized the
total misfit between predicted and observed data in the L
2
norm sense. That is, E in


[ ]

=
=
(
(
(
(

= =
N
i
i
N
N
e
e
e
e
e e e E
1
2 2
1
2 1
T
M
L e e
(3.7)

is minimized.
Consider a new E, defined as follows:

E = e
T
W
e
e (3.82)

and where W
e
is an, as yet, unspecified N N weighting matrix. W
e
can take any form, but one
convenient choice is

W
e
= [cov d]
1
(3.83)

where [cov d]
1
is the inverse of the covariance matrix for the data. With this choice for the
weighting matrix, data with large variances are weighted less than ones with small variances.
While this is true in general, it is easier to show in the case where W
e
is diagonal. This happens
when [cov d] is diagonal, which implies that the errors in the data are uncorrelated. The diagonal
entries in [cov d]
1
are then given by the reciprocal of the diagonal entries in [cov d]. That is, if


(
(
(
(
(

=
2
2
2
2
1
0 0
0
0
0 0
] [cov
N

L
O M
M
L
d (3.84)

then

Geosciences 567: CHAPTER 3 (RMR/GZ)
53

(
(
(
(
(

2
2
2
2
1
1
0 0
0
0
0 0
] [cov
N

L
O M
M
L
d (3.85)

With this choice for W
e
, the weighted misfit becomes



= =
(

= =
N
i
N
j
j ij i e
e W e E
1 1
T
e W e
(3.86)

But,

2
1
i
ij ij
W

= (3.87)

where
ij
is the Kronecker delta. Thus, we have


2
1
2

1
i
N
i
i
e E

=
=

(3.88)

If the ith variance
i
2
is large, then the component of the error vector in the ith direction,
e
i
2
, has little influence on the size of E. This is not the case in the unweighted least squares
problem, where an examination of Equation (3.4) clearly shows that each component of the error
vector contributes equally to the total misfit.


Obtaining the Weighted Least Squares Solution m
WLS


If one uses E = e
T
W
e
e as the weighted measure of error, we will see below that this leads
to the weighted least squares solution:

m
WLS
= [G
T
W
e
G]
1
G
T
W
e
d (3.89)

with a weighted least squares operator G
WLS
1
given by

G
WLS
1
= [G
T
W
e
G]
1
G
T
W
e
(3.90)

While this is true in general, it is easier to arrive at Equation (3.89) in the case where W
e
is a
diagonal matrix and the forward problem d = Gm is given by the least squares problem for a
best-fitting straight line [see Equation (3.9)].

Geosciences 567: CHAPTER 3 (RMR/GZ)
54
Step 1.


= = =
=
(

= =
N
i
i ii
N
i
N
j
j ij i e
e W e W e E
1
2
1 1
T
e W e
(3.91)


( )
2
1 1 1
2
pre obs
(

= =

= = =
M
j
j ij i
N
i
ii
N
i
i i ii
m G d W d d W
(3.92)


( )

=
+ + + =
N
i
i i i i i i ii
z m z m m m z d m d m d W
1
2 2
2 2 1
2
1 2 1
2
2 2 2
(3.93)

Step 2. Then



= = =
= + + =
N
i
ii i
N
i
ii ii
N
i
i
W z m W m W d
m
E
1
2
1
1
1 1
0 2 2 2

(3.94)

and



= = =
= + + =
N
i
ii i
N
i
ii i ii i
N
i
i
W z m W z m W z d
m
E
1
2
2
1
1
1 2
0 2 2 2

(3.95)

This can be written in matrix form as

(
(
(
(

=
(

(
(
(
(



=
=
= =
= =
N
i
ii i i
N
i
ii i
N
i
ii i
N
i
ii i
N
i
ii i
N
i
ii
W d z
W d
m
m
W z W z
W z W
1
1
2
1
1
2
1
1 1

(3.96)

Step 3. The left-hand side can be factored as


(
(
(
(

(
(
(
(

=
(



N NN
N ii i ii i
ii i ii
z
z
z
W
W
W
z z z W z W z
W z W
1
1
1
0 0
0
0
0 0
1 1 1
2
1
22
11
2 1
2
M M
L
O M
M
L
L
L
(3.97)

or simply


G W G
e
ii i ii i
ii i ii
W z W z
W z W
T
2
=
(



(3.98)

Similarly, the right-hand side can be factored as

Geosciences 567: CHAPTER 3 (RMR/GZ)
55

(
(
(
(

(
(
(
(

=
(

N NN
N ii i i
ii i
d
d
d
W
W
W
z z z W z d
W d
M
L
O M
M
L
L
L
2
1
22
11
2 1
0 0
0
0
0 0
1 1 1
(3.99)

or simply


d W G
e
ii i i
ii i
W z d
W d
T
=
(

(3.100)

Step 4. Therefore, using Equations (3.98) and (3.100), Equation (3.96) can be written as

G
T
W
e
Gm = G
T
W
e
d (3.101)

The weighted least squares solution, m
WLS
from Equation (3.89) is thus

m
WLS
= [G
T
W
e
G]
1
G
T
W
e
d (3.102)

assuming that [G
T
W
e
G]
1
exists, of course.


3.7.3 Weighted Minimum Length

The development of a weighted minimum length solution is similar to that of the
weighted least squares problem. The steps are as follows.

First, recall that the minimum length solution minimizes m
T
m. By analogy with
weighted least squares, we can choose to minimize

m
T
W
m
m (3.103)

instead of m
T
m. For example, if one wishes to use

W
m
= [cov m]
1
(3.104)

then one must replace m above with

m <m> (3.105)

where <m> is the expected, or a priori, estimate for the parameter values. The reason for this is
that the variances must represent fluctuations about zero. In the weighted least squares problem,
it is assumed that the error vector e which is being minimized has a mean of zero. Thus, for the
weighted minimum length problem, we replace m by its departure from the expected value <m>.
Therefore, we introduce a new function L to be minimized:
Geosciences 567: CHAPTER 3 (RMR/GZ)
56
L = [m <m>]
T
W
m
[m <m>] (3.106)

If one then follows the procedure in Section 3.6 with this new function, one eventually
(as in It is left to the student as an exercise!!) is led to the weighted minimum length solution
m
WML
given by

m
WML
= <m> + W
m
1
G
T
[GW
m
1
G
T
]
1
[d G<m>] (3.107)

and the weighted minimum length operator, G
WML
1

, is given by

G
WML
1

= W
m
1
G
T
[GW
m
1
G
T
]
1
(3.108)

This expression differs from Equation (3.38), page 54 of Menke, which uses W
m
rather than W
m
1
. I believe Menkes equation is wrong. Note that the solution depends explicitly on the
expected, or a priori, estimate of the model parameters <m>. The second term represents a
departure from the a priori estimate <m>, based on the inadequacy of the forward problem
G<m> to fit the data d exactly.

Other choices for W
m
include:

1. D
T
D, where D is a derivative matrix (a measure of the flatness of m) of dimension (M
1) M:

(
(
(
(

=
1 1 0 0
0
1 1 0
0 0 1 1
L
O O M
M
L
D (3.109)

2. D
T
D, where D is an (M 2) M roughness (second derivative) matrix given by


(
(
(
(

=
1 2 1 0 0
0
1 2 1 0
0 0 1 2 1
L
O O O M
M
L
D (3.110)

Note that for both choices of D presented, D
T
D is an M M matrix of rank less than M (for the
first-derivative case, it is of rank M 1, while for the second it is of rank M 2). This means that
W
m
does not have a mathematical inverse. This can introduce some nonuniqueness into the
solution, but does not preclude finding a solution. Finally, note that many choices for W
m
are
possible.


Geosciences 567: CHAPTER 3 (RMR/GZ)
57
3.7.4 Weighted Damped Least Squares

In Sections 3.7.2 and 3.7.3 we considered weighted versions of the least squares and
minimum length solutions. Both unweighted and weighted problems can be very unstable if the
matrices that have to be inverted are nearly singular. In the weighted problems, these are

G
T
W
e
G (3.111)

and

GW
m
1
G
T
(3.112)

respectively, for least squares and minimum length problems. In this case, one can form a
weighted penalty, or cost function, given by

E +

2
L (3.113)

where E is from Equation (3.91) for weighted least squares and L is from Equation (3.106) for
the weighted minimum length problem. One then goes through the exercise of minimizing
Equation (3.113) with respect to the model parameters m, and obtains what is known as the
weighted, damped least squares solution m
WD
. It is, in fact, a weighted mix of the weighted least
squares and weighted minimum length solutions.
One finds that m
WD
is given by either

m
WD
= <m> + [G
T
W
e
G +

2
W
m
]
1
G
T
W
e
[d G<m>] (3.114)

or

m
WD
= <m> + W
m
1
G
T
[GW
m
1
G
T
+

2
W
e
1
]
1
[d G<m>] (3.115)

where the weighted, damped least squares operator, G
WD
1
, is given by

G
WD
1
= [G
T
W
e
G +

2
W
m
]
1
G
T
W
e
(3.116)

or

G
WD
1
= W
m
1
G
T
[GW
m
1
G
T
+

2
W
e
1
]
1

(3.117)

The two forms for G
WD
1
can be shown to be equivalent. The

2
term has the effect of damping
the instability. As we will see later in Chapter 6 using singular-value decomposition, the above
procedure minimizes the effects of small singular values in G
T
W
e
G or GW
m
1
G
T
.

In the next section we will learn two methods of including a priori information and
constraints in inverse problems.


Geosciences 567: CHAPTER 3 (RMR/GZ)
58
3.8 A Priori Information and Constraints
(See Menke, Pages 5557)


3.8.1 Introduction

Another common type of a priori information takes the form of linear equality
constraints:

Fm = h (3.118)

where F is a P M matrix, and P is the number of linear constraints considered. As an example,
consider the case for which the mean of the model parameters is known. In this case with only
one constraint, we have


1
1
1
h m
M
M
i
i
=

=
(3.119)

Then, Equation (3.118) can be written as


[ ]
1
2
1
1 1 1
1
h
m
m
m
M
M
=
(
(
(
(

=
M
L Fm
(3.120)

As another example, suppose that the jth model parameter m
j
is actually known in
advance. That is, suppose

m
j
= h
1
(3.121)

Then Equation (3.118) takes the form


[ ]
column th
0 0 1 0 0
1
1
j
h
m
m
m
M
j

=
(
(
(
(
(
(

=
M
M
L L Fm
(3.122)

Note that for this example it would be possible to remove m
j
as an unknown, thereby reducing
the system of equations by one. It is often preferable to use Equation (3.122), even in this case,
rather than rewriting the forward problem in a computer code.

Geosciences 567: CHAPTER 3 (RMR/GZ)
59
3.8.2 A First Approach to Including Constraints

We will consider two basic approaches to including constraints in inverse problems.
Each has its strengths and weaknesses. The first includes the constraint matrix F in the forward
problem, and the second uses Lagrange multipliers. The steps for the first approach are as
follows.


Step 1. Include Fm = h as rows in a new G
~
that operates on the original m:

=
h
d
F
G
m G
~
2
1
M
m
m
m
M
(3.123)
(N + P) M M 1 (N + P) 1


Step 2. The new (N + P) 1 misfit vector e becomes

=
pre
pre obs

h
d
h
d
e (3.124)
(N + P) 1 (N + P) 1

Performing a least squares inversion would minimize the new e
T
e, based on Equation
(3.124). The difference

h h
pre
(3.125)

which represents the misfit to the constraints, may be small, but it is unlikely that it
would vanish, which it must if the constraints are to be satisfied.


Step 3. Introduce a weighted misfit:

e
T
W
e
e (3.126)

where W
e
is a diagonal matrix of the form

Geosciences 567: CHAPTER 3 (RMR/GZ)
60

________ __________ __________ __________ __________ __________
_

_

) # (big 0 0 0 0 0
0
) # (big 0 0 0 0
0 0 ) # (big 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1

=
P
N
W
e
L L
O M M O M
M
L L
L L
M O M M O M
L L
(3.127)

That is, it has relatively large values for the last P entries associated with the constraint
equations. Recalling the form of the weighting matrix used in Equation (3.83), one sees
that Equation (3.127) is equivalent to assigning the constraints very small variances.
Hence, a weighted least squares approach in this case will give large weight to fitting
the constraints. The size of the big numbers in W
e
must be determined empirically.
One seeks a number that leads to a solution that satisfies the constraints acceptably, but
does not make the matrix in Equation (3.111) that must be inverted to obtain the
solution too poorly conditioned. Matrices with a large range of values in them tend to
be poorly conditioned.

Consider the example of the smoothing constraint here, P = M 2:

Dm = 0 (3.128)

where the dimensions of D =(M 2) m, m = M 1, and 0 = (M 2) 1. The augmented
equations are

=
0
d
m
D
G
m G
~
(3.129)

Let's use the following weighting matrix:

P P
N N
e
I
I
W
2
2
2
0
0
0 0
0
0 1 0
0 0 1

L L
M
M O M
L L
(3.130)

where

2
is a constant. This results in the following, with the dimensions of the three matrices in
the first set of brackets being M (N + P), (N + P) (N + P), and (N + P) M, respectively:

Geosciences 567: CHAPTER 3 (RMR/GZ)
61
m
WLS
=






2
T
1
2
T
(


0
d
I 0
0 I
D
G
D
G
I 0
0 I
D
G


<> <>
[ ]
T 2 T
D G [ ]
T 2 T
D G

The lower matrices having dimensions of M (N+ P ) | (N + P) 1.

= [G
T
G +

2
D
T
D]
1
[ G
T
d] (3.131)
M M M 1


] [
T
1

d G
D
G
D
G

=

T
(3.132)

The three matrices within (3.132) have dimensions M (N + P), (N + P) M, and M 1,
respectively, which produce an M 1 matrix when evaluated. In this form we can see this is
simply the m
LS
for the problem


(

=
(

0
d
m
D
G

(3.133)

By varying , we can trade off the misfit and the smoothness for the model.


3.8.3 A Second Approach to Including Constraints

Whenever the subject of constraints is raised, Lagrange multipliers come to mind! The
steps for this approach are as follows.


Step 1. Form a weighted sum of the misfit and the constraints:

(m) = e
T
e + [Fm h]
T
(3.134)

which can be expanded as



(

+
(

=

= = = =

2 ) (
1 1
2
1 1
M
j
i j ij
P
i
i
N
i
M
j
j ij i
h m F m G d m
(3.135)

where indicates a difference from Equation (3.43) on page 56 in Menke, and where
there are P linear equality constraints and where the factor of 2 as been added as a
matter of convenience to make the form of the final answer simpler.
Geosciences 567: CHAPTER 3 (RMR/GZ)
62
Step 2. One then takes the partials of Equation (3.135) with respect to all the entries in m and
sets them to zero. That is,

M q
m
q
, . . . 2, , 1 0
) (
= =

m
(3.136)
which leads to


M q F d G G G m
P
i
iq i
N
j
N
i
i iq ji jq
M
i
i
, . . . 2, , 1 0 2 2 2
1 1 1 1
= = +

= = = =

(3.137)

where the first two terms are the same as the least squares case in Equation (3.25) since
they come directly from e
T
e and the last term shows why the factor of 2 was added in
Equation (3.135).


Step 3. Equation (3.137) is not the complete description of the problem. To the M equations in
Equation (3.137), P constraint equations must also be added. In matrix form, this yields


(

=
(

h
d G

m
0 F
F G G
T T



T
(3.138)

(M + P) (M + P) (M + P) 1 (M + P) 1
Step 4. The above system of equations can be solved as


(

=
(


h
d G
0 F
F G G

m
T
1
T

T
(3.139)


As an example, consider constraining a straight line to pass through some point (z', d').
That is, for N observations, we have

d
i
= m
1
+ m
2
z
i
i = 1, N (3.140)

subject to the single constraint

d = m
1
+ m
2
z (3.141)

Then Equation (3.118) has the form


[ ] d
m
m
z =
(

=
2
1
1 Fm
(3.142)

We can then write out Equation (3.139) explicitly, and obtain the following:
Geosciences 567: CHAPTER 3 (RMR/GZ)
63

(
(
(

(
(
(



=
(
(
(


d
d z
d
z
z z z
z N
m
m
i i
i
i i
i
1
2
2
1
0 1
1

(3.143)

Note the similarity between Equations (3.143) and (3.36), the least squares solution to fitting a
straight line to a set of points without any constraints:


(



=
(


i i
i
i i
i
d z
d
z z
z N
m
m
1
2
LS
2
1
(3.36)

If you wanted to get the same result for the straight line passing through a point using the
first approach with W
e
, you would assign

W
ii
= 1 i = 1, . . . , N (3.144)

and

W
N+1, N+1
= big # (3.145)

which is equivalent to assigning a small variance (relative to the unconstrained part of the
problem) to the constraint equation. The solution obtained with Equation (3.103) should
approach the solution obtained using Equation (3.143).

Note that it is easy to constrain lines to pass through the origin using Equation (3.143).
In this case, we have

d = z = 0 (3.146)

and Equation (3.143) becomes


(
(
(

(
(
(



=
(
(
(


0 0 0 1
0
1
1
2
2
1
i i
i
i i
i
d z
d
z z
z N
m
m

(3.147)

The advantage of using the Lagrange multiplier approach to constraints is that the
constraints will be satisfied exactly. It often happens, however, that the constraints are only
approximately known, and using Lagrange multipliers to fit the constraints exactly may not be
appropriate. An example might be a gravity inversion where depth to bedrock at one point is
known from drilling. Constraining the depth to be exactly the drill depth may be misleading if
the depth in the model is an average over some area. Then the exact depth at one point may not
be the best estimate of the depth over the area in question. A second disadvantage of the
Lagrange multiplier approach is that it adds one equation to the system of equations in Equation
(3.143) for each constraint. This can add up quickly, making the inversion considerably more
difficult computationally.
Geosciences 567: CHAPTER 3 (RMR/GZ)
64
An entirely different class of constraints are called linear inequality constraints and take
the form

Fm h (3.148)

These can be solved using linear programming techniques, but we will not consider them further
in this class.


3.8.4 Seismic Receiver Function Example

The following is an example of using smoothing constraints in an inverse problem.
Consider a general problem in time series analysis, with a delta function input. Then the output
from the "model" is the Greens function of the system. The inverse problem is this: Given the
Greens function, find the parameters of the model.

input
impulse
model
Greens Function


In a little more concrete form:

1
2
3
M
z
c
m
c
1
c
2
c
3
.
.
.
c
N
.
.
.
model space
a
d
N
data space
.
.
.
.
.
.
1
2
3
t
F(m)
0
Fm = d


If d is very noisy, then m
LS
will have a high-frequency component to try to "fit the
noise," but this will not be real. How do we prevent this? So far, we have learned two ways:
use m
WLS
if we know cov d, or if not, we can place a smoothing constraint on m. An example of
this approach using receiver function inversions can be found in Ammon, C. J., G. E. Randall
and G. Zandt, On the nonuniqueness of receiver function inversions, J. Geophys. Res., 95,
15,303-15,318, 1990.

The important points are as follows:
Geosciences 567: CHAPTER 3 (RMR/GZ)
65

This approach is used in the real world.
The forward problem is written

d
j
= F
j
m j = 1, 2, 3 . . . N

This is nonlinear, but after linearization (discussed in Chapter 4), the equations are the
same as discussed previously (with minor differences).
Note the correlation between the roughness in the model and the roughness in the
data.
The way to choose the weighting parameter, , is to plot the trade-off between
smoothness and waveform fit.


3.9 Variances of Model Parameters
(See Pages 5860, Menke)


3.9.1 Introduction

Data errors are mapped into model parameter errors through any type of inverse. We
noted in Chapter 2 [Equations (2.61)(2.63)] that if

m
est
= Md + v (2.61)

and if [cov d] is the data covariance matrix which describes the data errors, then the a posteriori
model covariance matrix is given by

[cov m] = M[cov d]M
T
(2.63)

The covariance matrix in Equation (2.63) is called the a posteriori model covariance
matrix because it is calculated after the fact. It gives what are sometimes called the formal
uncertainties in the model parameters. It is different from the a priori model covariance matrix
of Equation (3.85), which is used to constrain the underdetermined problem.

The a posteriori covariance matrix in Equation (2.63) shows explicitly the mapping of
data errors into uncertainties in the model parameters. Although the mapping will be clearer
once we consider the generalized inverse in Chapter 7, it is instructive at this point to consider
applying Equation (2.63) to the least squares and minimum length problems.


3.9.2 Application to Least Squares

We can apply Equation (2.63) to the least squares problem and obtain

Geosciences 567: CHAPTER 3 (RMR/GZ)
66
[cov m] = {[G
T
G]
1
G
T
}[cov d]{[G
T
G]
1
G
T
}
T
(3.149)

Further, if [cov d] is given by

[cov d] =
2
I
N
(3.150)

then

[cov m] = [G
T
G]
1
G
T
[
2
I]{[G
T
G]
1
G
T
}
T


=
2
[G
T
G]
1
G
T
G{[G
T
G]
1
}
T


=
2
{[G
T
G]
1
}
T


=
2
[G
T
G]
1
(3.151)

since the transpose of a symmetric matrix returns the original matrix.


3.9.3 Application to the Minimum Length Problem

Application of Equation (2.63) to the minimum length problem leads to the following for
the a posteriori model covariance matrix:

[cov m] = {G
T
[GG
T
]
1
}[cov d]{G
T
[GG
T
]
1
}
T
(3.152)

If the data covariance matrix is again given by

[cov d] =
2
I
N
(3.153)
we obtain

[cov m] =
2
G
T
[GG
T
]
2
G (3.154)

where

[GG
T
]
2
= [GG
T
]
1
[GG
T
]
1
(3.155)


3.9.4 Geometrical Interpretation of Variance

There is another way to look at the variance of model parameter estimates for the least
squares problem that considers the prediction error, or misfit, to the data. Recall that we defined
the misfit E as
E = e
T
e = [d d
pre
]
T
[d d
pre
]

= [d Gm]
T
[d Gm] (3.23)
Geosciences 567: CHAPTER 3 (RMR/GZ)
67
which explicitly shows the dependence of E on the model parameters m. That is, we have

E = E(m) (3.156)

If E(m) has a sharp, well-defined minimum, then we can conclude that our solution m
LS
is well
constrained. Conversely, if E(m) has a broad, poorly defined minimum, then we conclude that
our solution m
LS
is poorly constrained.

After Figure 3.10, page 59, of Menke, we have (next page),

(a) (b)
m
est
m
m
est
m
E E
E
(
m
)
E
(
m
)
m m
model parameter m model parameter m


(a) The best estimate m
est
of model parameter m occurs at the minimum of E(m).
If the minimum is relatively narrow, then random fluctuations in E(m) lead to
only small errors m in m
est
. (b) If the minimum is wide, then large errors in m
can occur.

One way to quantify this qualitative observation is to realize that the width of the
minimum for E(m) is related to the curvature, or second derivative, of E(m) at the minimum.
For the least squares problem, we have


LS LS
2
2
2
2
2
] [
m m m m
Gm d
m m
= =
=

E
(3.157)

Evaluating the right-hand side, we have for the qth term



= =
(

= =
N
i
M
j
j ij i
q q q
m G d
m m m
E
1
2
1
2
2
2
2
2
2
2
] [

Gm d
(3.158)


( )
iq
N
i
M
j
j ij i
q
G m G d
m

(

=

= = 1 1
2
2
2

(3.159)
Geosciences 567: CHAPTER 3 (RMR/GZ)
68


= =
(

=
N
i
M
j
j iq ij i iq
q
m G G d G
m
1 1
2
2
2

(3.160)

Using the same steps as we did in the derivation of the least squares solution in Equations
(3.24)(3.29), it is possible to see that Equation (3.160) represents the qth term in G
T
[d Gm].
Combining the q equations into matrix notation yields

{ } ] [ 2 ] [
T 2
2
2
Gm d G
m
Gm d
m
=

(3.161)

Evaluating the first derivative on the right-hand side of Equation (3.161), we have for the qth
term


{ }

= =
(

=
N
i
M
j
j iq ij i iq
q q
m G G d G
m m
1 1
T
] [

Gm d G
(3.162)


( )

= =
=
N
i
M
j
j iq ij
q
m G G
m
1 1

(3.163)

=
=
N
i
iq iq
G G
1
(3.164)

which we recognize as the (q, q) entry in G
T
G. Therefore, we can write the M M matrix
equation as

{ } G G Gm d G
m
T T

] [ =

(3.165)

From Equations (3.150)(3.158) we can conclude that the second derivative of E in the least
squares problem is proportional to G
T
G. That is,

G G
m
m m
T
2

2
) constant (
LS
=
=

E
(3.166)

Furthermore, from Equation (3.150) we have that [cov m] is proportional to [G
T
G]
1
. Therefore,
we can associate large values of the second derivative of E, given by (3.166) with (1) sharp
curvature for E, (2) narrow well for E, and (3) good (i.e., small) model variance.

As Menke points out, [cov m] can be interpreted as being controlled either by (1) the
variance of the data times a measure of how error in the data is mapped into model parameters or
(2) a constant times the curvature of the prediction error at its minimum.

Geosciences 567: CHAPTER 3 (RMR/GZ)
69
I like Menkes summary for his chapter (page 60) on this material very much. Hence,
I've reproduced his closing paragraph for you as follows:

The methods of solving inverse problems that have been discussed in this
chapter emphasize the data and model parameters themselves. The method of
least squares estimates the model parameters with smallest prediction length.
The method of minimum length estimates the simplest model parameters. The
ideas of data and model parameters are very concrete and straightforward, and
the methods based on them are simple and easily understood. Nevertheless, this
viewpoint tends to obscure an important aspect of inverse problems. Namely, that
the nature of the problem depends more on the relationship between the data and
model parameters than on the data or model parameters themselves. It should,
for instance, be possible to tell a well-designed experiment from a poorly
designed one without knowing what the numerical values of the data or model
parameters are, or even the range in which they fall.



Before considering the relationships implied in the mapping between model parameters
and data in Chapter 5, we extend what we now know about linear inverse problems to nonlinear
problems in the next chapter.

Geosciences 567: CHAPTER 4 (RMR/GZ)
70




CHAPTER 4: LINEARIZATION OF NONLINEAR PROBLEMS


4.1 Introduction


Thus far we have dealt with the linear, explicit forward problem given by

Gm = d (1.13)

where G is a matrix of coefficients (constants) that multiply the model parameter vector m and
return a data vector d. If m is doubled, then d is also doubled.

We can also write Equation (1.13) out explicitly as

=
= =
M
j
j ij i
N i m G d
1
, . . . 2, , 1
(4.1)

This form emphasizes the linear nature of the problem. Next, we consider a more general
relationship between data and model parameters.


4.2 Linearization of Nonlinear Problems


Consider a general (explicit) relationship between the ith datum and the model parameters
given by

d
i
= g
i
(m) (4.2)

An example might be

d
1
= 2m
1
3
(4.3)

The steps required to linearize a problem of the form of Equation (4.2) are as follows:


Step 1. Expand g
i
(m) about some point m
0
in model space using a Taylor series expansion:

Geosciences 567: CHAPTER 4 (RMR/GZ)
71
) ( 0
) (
2
1 ) (
) ( ) (
3
1
2
2
2
1
0
0 0
j
M
j
j
j
i
M
j
j
j
i
i i i
m m
m
g
m
m
g
g g d +
(
(

+
(
(

+ =

= = = = m m m m
m m
m m


(4.4)


where m is the difference between m and m
0
, or

m = m m
0
(4.5)

If we assume that terms in m
j
n
, n 2, are small with respect to m
j
terms, then

= = (
(

+ =
M
j
j
j
i
i i i
m
m
g
g g d
1
0
0
) (
) ( ) (
m m
m
m m

(4.6)

Step 2. The predicted data
i
d

at m = m
0
are given by


i
d

= g
i
(m
0
) (4.7)

Therefore

= = (
(


M
j
j
j
i
i i
m
m
g
d d
1 0
) (

m m
m

(4.8)

Step 3. We can define the misfit c
i
as

c
i
= d
i

i
d

(4.9)

= observed data predicted data

c
i
is not necessarily noise. It is just the misfit between observed and predicted data
for some choice of the model parameter vector m
0
.


Step 4. The partial derivative of the ith data equation with respect to the jth model parameter is
given by

j
i
m
g

) (m
(4.10)

These partial derivatives are functions of the model parameters and may be nonlinear
(gasp) or occasionally even nonexistent (shudder).
Geosciences 567: CHAPTER 4 (RMR/GZ)
72
Fortunately, the values of these partial derivatives, evaluated at some point in model
space m
0
, and given by

0
) (
m m
m
=
j
i
m
g

(4.11)

are just numbers (constants), if they exist, and not functions. We then define G
ij
as
follows:


0
) (
m m
m
=
=
j
i
ij
m
g
G

(4.12)

Step 5. Finally, combining the above we have


N i m G c
M
j
j ij i
, . . . , 1
0
1
= =
=
=

m m
(4.13)

or, in matrix notation, the linearized problem becomes

c = Gm (4.14)

where

c
i
= d
i

i
d

= observed data predicted data



= d
i
g
i
(m
0
) (4.15)


0
) (
m m
m
=
=
j
i
ij
m
g
G

(4.16)

and

m
j
= change from (m
0
)
j
(4.17)

Thus, by linearizing Equation (4.2), we have arrived at a set of linear equations, where
now c
i
(the difference between observed and predicted data) is a linear function of changes in
the model parameters from some starting model.

Some general comments on Equation (4.14):

1. In general, Equation (4.14) only holds in the neighborhood of m
0
, and for small changes
m. The region where the linearization is valid depends on the smoothness of g
i
(m).

Geosciences 567: CHAPTER 4 (RMR/GZ)
73
2. Note that G now changes with each iteration. That is, one may obtain a different G for
each spot in solution space. Having to reform G at each step can be very time (computer)
intensive, and often one uses the same G for more than one iteration.


4.3 General Procedure for Nonlinear Problems


Step 1. Pick some starting model vector m
0
.

Step 2. Calculate the predicted data vector d

and form the misfit vector



d d c

= (4.18)

Step 3. Form

0
) (
m m
m
=
=
j
i
ij
m
g
G

(4.19)

Step 4. Solve for m using any appropriate inverse operator (i.e., least squares, minimum
length, weighted least squares, etc.)

Step 5. Form a new model parameter vector

m
1
= m
0
+ m (4.20)


One repeats Steps 15 until m becomes sufficiently small (convergence is obtained), or until c
becomes sufficiently small (acceptable misfit), or until a maximum number of iterations
(failsafe). Note that m
i
(note the boldfaced m) is the estimate of the model parameters at the ith
iteration, and not the ith component on the model parameter vector.


4.4 Three Examples


4.4.1 A Linear Example

Suppose g
i
(m) = d
i
is linear and of the form

2m
1
= 4 (4.21)

With only one equation, we have G = [2], m = [m
1
], and d = [4]. (I know, I know. Its easy!)
Then

Geosciences 567: CHAPTER 4 (RMR/GZ)
74
d
1
/m
1
= G
11
= 2 (for all m
1
) (4.22)

Suppose that the initial estimate of the model vector m
0
= [0]. Then

d
= Gm
0
= [2][0] = [0] and
we have

c = d

d
= [4] [0] = [4] (4.23)


or the change in the first and only element of our misfit vector c
1
= 4. Looking at our lone
equation then,

G
11
m
1
= c
1
(4.24)

or 2m
1
= 4 (4.25)

or m
1
= 2 (4.26)

Since this is the only element in our model-change vector [in this case, (m
1
)
1
= m
1
], we have
m
1
= [2], and our next approximation of the model vector, m
1
, then becomes

m
1
= m
0
+ m
1
= [0] + [2] = [2] (4.27)

We have just completed Steps 15 for the first iteration. Now it is time to update the
misfit vector and see if we have reached a solution. Thus, for the predicted data we obtain

d

= Gm
1
= [2][2] = [4] (4.28)

and for the misfit we have

c = d d

= [4] [4] = [0] (4.29)



which indicates that the solution has converged in one iteration. To see that the solution does not
depend on the starting point if Equation (4.2) is linear, lets start with

(m
0
)
1
= 1000 = m
1
(4.30)

Considering the one and only element of our predicted-data and misfit vectors, we have

d

1
= 2 1000 = 2000 (4.31)

and c
1
= 4 2000 = 1996 (4.32)

then 2m
1
= 1996 (4.33)

or m
1
= 998 (4.34)
Geosciences 567: CHAPTER 4 (RMR/GZ)
75
Since m
1
is the only element of our first model-change vector, m
1
, we have m
1
= [998],
and therefore

m
1
= m
0
+ m
1
= [1000] + [998] = [2] (4.35)

As before, the solution has converged in one iteration. This is a general conclusion if the
relationship between the data and the model parameters is linear. This problem also illustrates
that the nonlinear approach outlined above works when g
i
(m) is linear.

Consider the following graphs for the system 2m
1
= 4:


g(m) = 2m = d
^
1
6
4
2
2 4 6
d
^

m
1
6
4
2
2 4 6
m
1
c = d d = 4 2m
1
^

c


We note the following:

1. For

d versus m
1
, we see that the slope g(m
1
)/m
1
= 2 for all m
1
.

2. For our initial guess of (m
0
)
1
= 0, c = 4 0 = 4, denoted by a square symbol on the plot
of c versus m
1
. We extrapolate back down the slope to the point (m
1
) where c = 0 to
obtain our answer.

3. Because the slope does not change (the problem is linear), the starting guess m
0
has no
effect on the final solution. We can always get to c = 0 in one iteration.


4.4.2 A Nonlinear Example

Now consider, as a second example of the form g
1
(m) = d
1
, the following:

2m
3
= 16 (4.36)

Since we have only one unknown, I chose to drop the subscript. Instead, I will use the subscript
to denote the iteration number. For example, m
3
will be the estimate of the model parameter m at
the third iteration. Note also that, by inspection, m = 2 is the solution.
Geosciences 567: CHAPTER 4 (RMR/GZ)
76
Working through this example as we did the last one, we first note that G
11
, at the ith
iteration, will be given by


2 2
11
6 2 3
) (
i i
m m
m m
m
m g
G
i
= = =
=

(4.37)

Note also that G
11
is now a function of m.


Iteration 1. Let us pick as our starting model

m
0
= 1 (4.38)


then
6 6
) (
2
0 11
0
= = =
=
m
m
m g
G
m m

(4.39)

also d

= 2 1
3
= 2 (4.40)

and c = d d

= 16 2 = 14

Because we have only one element in our model change vector, we have c = [14],
and the length squared of c, ||c||
2
, is given simply by (c)
2
:

(c)
2
= 14 14 = 196 (4.41)

Now, we find m
1
, the change to m
0
, as

6m
1
= 14 (4.42)

and m
1
= 14/6 = 2.3333 (4.43)

Thus, our estimate of the model parameter at the first iteration, m
1
, is given by

m
1
= m
0
+ m
1
= 1 + 2.3333 = 3.3333 (4.44)

Iteration 2. Continuing,

G
11
= 6m
1
2
= 66.66 (4.45)

and

d = 2(3.333)
3
= 74.07 (4.46)

thus c = d

d = 16 74.07 = 58.07 (4.47)

Geosciences 567: CHAPTER 4 (RMR/GZ)
77
now (c)
2
= 3372 (4.48)

and 66.66m
2
= 58.07 (4.49)

gives m
2
= 0.871 (4.50)

thus m
2
= m
1
+ m
2
= 3.3333 0.871 = 2.462 (4.51)

Iteration 3. Continuing,

G
11
= 6m
2
2
= 36.37 (4.52)

and

d = 29.847 (4.53)

thus c = 13.847 (4.54)

now (c)
2
= 192 (4.55)

and 36.37m
3
= 13.847 (4.56)

gives m
3
= 0.381 (4.57)

thus m
3
= m
2
+ m
3
= 2.462 0.381 = 2.081 (4.58)

Iteration 4. (Will this thing ever end??)

G
11
= 6m
3
2
= 25.983 (4.59)

and

d = 18.024 (4.60)

thus c = 2.024 (4.61)

now (c)
2
= 4.1 (4.62)

and 25.983m
4
= 2.024 (4.63)

gives m
4
= 0.078 (4.64)

thus m
4
= m
3
+ m
4
= 2.081 0.078 = 2.003 (4.65)

Iteration 5. (When were computers invented???)

G
11
= 6m
4
2
= 24.072 (4.66)

Geosciences 567: CHAPTER 4 (RMR/GZ)
78
and

d = 16.072 (4.67)

thus c = 0.072 (4.68)

now (c)
2
= 0.005 (4.69)

and 24.072m
5
= 0.072 (4.70)

gives m
5
= 0.003 (4.71)

thus m
5
= m
4
+ m
5
= 2.003 0.003 = 2.000 (4.72)

Iteration 6. Beginning, we have

G
11
= 6m
5
2
= 24.000 (4.73)

and

d = 16.000 (4.74)

thus c = 0.000 (4.75)

and we quit!!!! We should note that we have quit because the misfit has been reduced to some
acceptable level (three significant figures in this case). The solution happens to be an integer,
and we have found it to three places. Most solutions are not integers, and we must decide how
many significant figures are justified. The answer depends on many things, but one of the most
important is the level of noise in the data. If the data are noisy, it does not make sense to claim
that a solution to seven places is meaningful.

Consider the following graph for this problem (next page):

Geosciences 567: CHAPTER 4 (RMR/GZ)
79
4 3 2 1 0
4
8
12
16
0
m
.
m = 3.333 g(m) = 2m
3
d
^

g(m)
m

m=1
= 6m
i
2
= 6


Note that the slope at m = 1, when extrapolated to give d

= 16, yields m = 3.333. The solution


then iterates back down the curve to the correct solution m = 2.

Consider plotting c, rather than d

, versus m (diagram on next page). This is perhaps


more useful because when we solve for m
i
, we are always extrapolating to c = 0.

Geosciences 567: CHAPTER 4 (RMR/GZ)
80
4 3
2
1 0
4
8
12
16
0
m
c = 16 2m
3

c
c = 14 at m = 1
0
slope at m = 1 is extrapolated
to c = 0
0
3.333


For this example, we see that at m
0
= 1, the slope g(m)/m = 6. We used this slope to
extrapolate to the point where c = 0.

At the second iteration, m
1
= 3.333 and is farther from the true solution (m
1
= 2) than was
our starting model. Also, the length squared of the misfit is 3372, much worse than the misfit
(196) at our initial guess. This makes an important point: you can still get to the right answer
even if some iteration takes you farther from the solution than where you have been. This is
especially true for early steps in the iteration when you may not be close to the solution.

Note also that if we had started with m
0
closer to zero, the shallow slope would have sent
us to an even higher value for m
1
. We would still have recovered, though (do you see why?).
The shallow slope corresponds to a small singular value, illustrating the problems associated
with small singular values. We will consider singular-value analysis in the next chapter.

What do you think would happen if you take m
0
< 0???? Would it still converge to the
correct solution? Try m
0
= 1 if your curiosity has been piqued!


Geosciences 567: CHAPTER 4 (RMR/GZ)
81
4.4.3 Nonlinear Straight-Line Example

An interesting nonlinear problem is fitting a straight line to a set of data points (y
i
, z
i
)
which may contain errors, or noise, along both the y and z axes. One could cast the problem as

y
i
= a + bz
i
i = 1, . . . , N (4.76)

Assuming z were perfectly known, one obtains a solution for a, b by a linear least squares
approach [see Equations (3.32) and (3.37)(3.39)].

Similarly, if y were perfectly known, one obtains a solution for

z
i
= c + dy
i
i = 1, . . . , N (4.77)

again using (3.32) and (3.37)(3.39). These two lines can be compared by rewriting (4.77) as a
function of y, giving

y
i
= (c/d) + (1/d)z
i
i = 1, . . . , N (4.78)

In general, a c/d and b 1/d because in (4.76) we assumed all of the error, or
misfit, was in y, while in (4.77) we assumed that all of the error was in z. Recall that the quantity
being minimized in (4.76) is

E
1
= [y
obs
y
pre
]
T
[y
obs
y
pre
]


( )

=
=
N
i
i i
y y
1
2

(4.79)

where y
i
is the predicted y value. The comparable quantity for (4.77) is

E
2
= [z
obs
z
pre
]
T
[z
obs
z
pre
]


( )

=
=
N
i
i i
z z
1
2

(4.80)

where
z
i
is the predicted z value.

For the best fit line in which both y and z have errors, the function to be minimized is


(

=
z z
y y
z z
y y

T
E (4.81)


2
1
2
) ( ) (
i i
N
i
i i
z z y y E + =

=
(4.82)
Geosciences 567: CHAPTER 4 (RMR/GZ)
82
where y

y and z

z together compose a vector of dimension 2N if N is the number of pairs
(y
i
, z
i
).

Consider the following diagram:


.
C
A
.
. .
P(y , z )
i i
Z
y
B(y , z )
i i
^ ^


Line PA above represents the misfit in y (see 4.79), line PC represents the misfit in z (see 4.80),
while line PB represents the misfit for the combined case (see 4.82).

In order to minimize (4.82) we must be able to write the forward problem for the
predicted data (
y
i
,
z
i
). Let the solution we seek be given by

y = m
1
+ m
2
z (4.83)

Line PB is perpendicular to (4.83), and thus has a slope of 1/m
2
. The equation of a line through
P(y
i
, z
i
) with slope 1/m
2
is given by

y y
i
= (1/m
2
)(z z
i
) (4.84)

or

y = (1/m
2
)z
i
+ y
i
(1/m
2
)z (4.85)

The point B(
y
i
,
z
i
) is thus the intersection of the lines given by (4.83) and (4.85). Equating the
right-hand sides of (4.83) and (4.85) for y = y
i
and z = z
i
gives

m
1
+ m
2

z
i
= (1/m
2
)z
i
+ y
i
(1/m
2
)
z
i
(4.86)

which can be solved for
z
i
as

z
i
=
m
1
m
2
+ z
i
+ m
2
y
i
1+ m
2
2
(4.87)

Geosciences 567: CHAPTER 4 (RMR/GZ)
83
Rearranging (4.83) and (4.85) to give z as a function of y and again equating for y =
i
y

and z =
i
z
gives
(1/m
2
)(
i
y
- m
1
) = m
2 i
y
+ z
i
+ m
2
y
i
(4.88)

which can be solved for
i
y
as

2
2
2
2 2 1
1

m
y m z m m
y
i i
i
+
+ +
= (4.89)

substituting (4.89) for
i
y
and (4.87) for
i
z

into (4.82) now gives E as a function of the
unknowns m
1
and m
2
. The approach used for the linear problem was to take partials of E with
respect to m
1
and m
2
, set them equal to zero, and solve for m
1
and m
2
, as was done in (4.10) and
(3.13)(3.14). Unfortunately, the resulting equations for partials of E with respect to m
1
and m
2

in (4.82) are not linear in m
1
and m
2
and cannot be cast in the linear form

G
T
Gm = G
T
d (3.21)

Instead, we must consider (4.87) and (4.89) to be of the form of (4.2):

d
i
= g
i
(m) (4.2)

and linearize the problem by expanding (4.87) and (4.89) in a Taylor series about some starting
guesses
0
z and
0
y , respectively. This requires taking partials of
i
y

and
i
z
with respect to m
1

and m
2
, which can be obtained from (4.87) and (4.89).

Consider the following data set, shown also on the following diagram (next page).
Geosciences 567: CHAPTER 4 (RMR/GZ)
84
y
i
: 1 4 5
z
i
: 1 2 5



The linear least square solution to (4.76)

y
i
= a + bz
i
i = 1, 2, 3 (4.76)

is

y
i
= 1.077 + 0.846z
i
i = 1, 2, 3 (4.90)

The linear least squares solution to (4.77)

z
i
= c + dy
i
i = 1, 2, 3 (4.77)

is
z
i
= 0.154 + 0.846y
i
i = 1, 2, 3 (4.91)

For comparison with a and b above, we can rewrite (4.91) with y as a function of z as

y
i
= 0.182 + 1.182z
i
i = 1, 2, 3 (4.92)

The nonlinear least squares solution which minimizes


2
1
2
) ( ) (
i i
N
i
i i
z z y y E + =

=
(4.2)
Geosciences 567: CHAPTER 4 (RMR/GZ)
85
is given by

y
i
= 0.667 + 1.000z
i
i = 1, 2, 3 (4.93)

From the figure you can see that the nonlinear solution lies between the other two solutions.

It is also possible to consider a weighted nonlinear least squares best fit to a data set. In
this case, we form a new E, after (3.83), as


(

=
z z
y y
W
z z
y y

T
e
E (4.94)

and where W
e
is a 2N 2N weighting matrix. The natural choice for W
e
is


1
cov

)
`

z
y


the inverse data covariance matrix. If the errors in y
i
, z
i
are uncorrelated, then the data
covariance matrix will be a diagonal matrix with the variances for y
i
as the first N entries and the
variances for z
i
as the last N entries. If we further let V
y
and V
z
be the variances for y
i
and z
i
,
respectively, then Equations (4.87) and (4.89) become


y z
i z i y z
i
V V m
y V m z V V m m
z
+
+ +
=
2
2
2 2 1

(4.95)
and


y z
i z i y y
i
V V m
y V m z V m V m
y
+
+ +
=
2
2
2
2 2 1

(4.96)

If V
z
= V
y
, then dividing through either (4.95) or (4.96) by the variance returns (4.87) or (4.89).
Thus we see that weighted least squares techniques are equivalent to general least squares
techniques when all the data variances are equal and the errors are uncorrelated, as expected.

Furthermore, dividing both the numerator and denominator of (4.96) by V
y
yields


1 ] / ) [(
/ ) (

2
2
2
2 2 1
+
+ +
=
y z
y i z i
i
V V m
V y V m z m m
y (4.97)

Then, in the limit that V
z
goes to zero, (4.97) becomes


i
y
= m
1
+ m
2
z
i
(4.98)
Geosciences 567: CHAPTER 4 (RMR/GZ)
86
which is just the linear least squares solution of (4.76). That is, if z
i
is assumed to be perfectly
known (V
z
= 0), the nonlinear problem reduces to a linear problem. Similar arguments can be
made for (4.95) to show that the linear least squares solution of (4.77) results when V
y
goes to
zero.


4.5 Creeping vs. Jumping (Shaw and Orcutt, 1985)


The general procedure described in Section 4.3 is termed "creeping" by Shaw and Orcutt
(Shaw, P. R., and Orcutt, J. A., Waveform inversion of seismic refraction data and applications
to young Pacific crust, Geophys. J. Roy. Astr. Soc., 82, 375-414, 1985). It finds, from the set of
acceptable misfit models, the one closest to the starting model under the Euclidian norm.

.
m
0
.
m
0
acceptable mifit
models
. . .
m
1 m
2
m
3


Because of the nonlinearity, often several iterations are required to reach the desired
misfit. The acceptable level of misfit is free to vary and can be viewed as a parameter that
controls the trade-off between satisfying the data and keeping the model perturbation small.
Because the starting model itself is physically reasonable, the unphysical model estimates tend to
be avoided. There are several potential disadvantages to the creeping strategy. Creeping
analysis depends significantly on the choice of the initial model. If the starting model is changed
slightly, a new final model may well be found. In addition, constraints applied to model
perturbations may not be as meaningful as those applied directly to the model parameters.
.
m
0
. . .
.
m
0
'
. . .



Parker (1994) introduced an alternative approach with a simple algebraic substitution
(Parker, R. L., Geophysical Inverse Theory, Princeton University Press, 1994). The new method,
called jumping, directly calculates the new model in a single step rather than calculating a
perturbation to the initial model. Now, any suitable norm can be applied to the model rather than
to the perturbations.

This new strategy is motivated, in part, by the desire to map the neighborhood of starting
models near m
0
to a single final model, thus making the solution less sensitive to small change in
m
0
.
Geosciences 567: CHAPTER 4 (RMR/GZ)
87
. m
0
.


Let's write the original nonlinear equations as

gm = d (4.99)

After linearization about an initial model m
0
, we have

Gm = c (4.100)

when


0
,
0
gm d c
m
g
G
m
= =

(4.101)

The algebraic substitution suggested by Parker is to simply add Gm
0
to both sides,
yielding

Gm + Gm
0
= c + Gm
0

then
G[m + m
0
] = c + Gm
0
(4.102)
and
Gm
1
= c + Gm
0
(4.103)
or
Gm
1
= d gm
0
+ Gm
0
(4.104)

At this point, this equation is algebraically equivalent to our starting linearized equation.
But the crucial difference is that now we are solving directly for the model m rather than a
perturbation m. This slight algebraic difference means we can now apply any suitable
constraint to the model. A good example is a smoothing constraint. If the initial model is not
smooth, applying a smoothing constraint to the model perturbations may not make sense. In the
new formulation, we can apply the constraint directly to the model. In the jumping scheme, the
new model is computed directly, and the norm of this model is minimized relative to an absolute
origin 0 corresponding to this norm.
Geosciences 567: CHAPTER 4 (RMR/GZ)
88
.
m
0
'
m
0
.
.
0


The explicit dependence on the starting model is greatly reduced.

In our example of the second derivative smoothing matrix D, we can now apply this
directly to the jumping equations.


(

+
=
(

0
Gm c
m
D
G
0

(4.105)

We should keep in mind that since the problem is nonlinear, there is still no guarantee that the
final model will be unique, and even this "jumping" scheme is iterative.
m
0
.
.
.
.
.
0


To summarize, the main advantage of the jumping scheme is that the new model is
calculated directly. Thus, constraints can be imposed directly on the new model. The
minimization of the squared misfit can be traded off with the constraint measures, allowing
optimization of some physically significant quantity. This tends to reduce the strong dependence
on the initial model that is associated with the creeping scheme.

We now turn our attention to the generalized inverse in the next three chapters. We will
begin with eigenvalue problems, and then continue with singular-value problems, the generalized
inverse, and ways of quantifying the quality of the solution.
Geosciences 567: CHAPTER 5 (RMR/GZ)
89




CHAPTER 5: THE EIGENVALUE PROBLEM


5.1 Introduction


In Chapter 3 the emphasis was on developing inverse operators and solutions based on
minimizing some combination of the (perhaps weighted) error vector e and the model parameter
vector m. In this chapter we will change the emphasis to the operator G itself. Using the
concepts of vector spaces and linear transformations, we will consider how G maps model
parameter vectors into predicted data vectors. This approach will lead to the generalized inverse
operator. We will see that the operators introduced in Chapter 3 can be thought of as special
cases of the generalized inverse operator. The power of the generalized inverse approach,
however, lies in the ability to assess the resolution and stability of the solution based on the
mapping back and forth between model and data spaces.

We will begin by considering the eigenvalue problem for square matrices, and extend this
to the shifted eigenvalue problem for nonsquare matrices. Once we have learned how to
decompose a general matrix using the shifted eigenvalue problem, we will introduce the
generalized inverse operator, show how it reduces to the operators and solutions from Chapter 3
in special cases, and consider measures of quality and stability for the solution.

The eigenvalue problem plays a central role in the vector space representation of inverse
problems. Although some of this is likely to be review, it is important to cover it in some detail
so that the underlying linear algebra aspects are made clear. This will lay the foundation for an
understanding of the mapping of vectors back and forth between model and data spaces.


5.2 Eigenvalue Problem for the Square (M M) Matrix A


5.2.1 Background

Given the following system of linear equations

Ax = b (5.1)

where A is a general M M matrix, and x and b are M 1 column vectors, respectively, it is
natural to immediately plunge in, calculate A
1
, the inverse of A, and solve for x. Presumably,
there are lots of computer programs available to invert square matrices, so why not just hand the
problem over to the local computer hack and be done with it? The reasons are many, but let it
suffice to say that in almost all problems in geophysics, A
1
will not exist in the mathematical
Geosciences 567: CHAPTER 5 (RMR/GZ)
90
sense, and our task will be to find an approximate solution, we hope along with some measure of
the quality of that solution.

One approach to solving Equation (5.1) above is through the eigenvalue problem for A.
It will have the benefit that, in addition to finding A
1
when it exists, it will lay the groundwork
for finding approximate solutions for the vast majority of situations where the exact,
mathematical inverse does not exist.

In the eigenvalue problem, the matrix A is thought of as a linear transformation that
transforms one M-dimensional vector into another M-dimensional vector. Eventually, we will
seek the particular x that solves Ax = b, but the starting point is the more general problem of
transforming any particular M-dimensional vector into another M-dimensional vector.

We begin by defining M space as the space of all M-dimensional vectors. For example, 2
space is the space of all vectors in a plane. The eigenvalue problem can then be defined as
follows:

For A, an M M matrix, find all vectors s in M space such that

As = s (5.2)

where is a constant called the eigenvalue and s is the associated eigenvector.

In words, this means find all vectors s that, when operated on by A, return a vector that points in
the same direction (up to the sign of the vector) with a length scaled by the constant .


5.2.2 How Many Eigenvalues, Eigenvectors?

If, As = s, then

[A I
M
]s = 0
M
(5.3)

where 0
M
is an M 1 vector of zeros. Equation (5.3) is called a homogeneous equation because
the right-hand side consists of zeros. It has a nontrivial solution (e.g., s [0, 0, . . . , 0]
T
) if and
only if the determinant of [A I
M
] is zero, or

|A I
M
| = 0 (5.4)

where

Geosciences 567: CHAPTER 5 (RMR/GZ)
91

=
MM M
M
M
M
a a
a a a
a a a
L
M O M
L
1
2 22 21
1 12 11
I A
(5.5)

Equation (5.4), also called the characteristic equation, is a polynomial of order M in of
the form


M
+ C
M1

M1
+ C
M2

M2
+ + C
0

0
= 0 (5.6)

Equation (5.6) has M and only M roots for . In general, these roots may be complex (shudder!).
However, if A is Hermitian, then all of the M roots are real (whew!). They may, however, have
repeated values (ugh!), be negative (ouch), or zero (yuk).

We may write the M roots for as


1
,
2
, . . . ,
M


For each
i
an associated eigenvector s
i
which solves Equation (5.2) can be (easily) found. It is
important to realize that the ordering of the
i
is completely arbitrary.

Equation (5.2), then, holds for M values of
i
. That is, we have

=
1
, s = (s
1
1
, s
2
1
, . . . , s
M
1
)
T
= s
1


=
2
, s = (s
1
2
, s
2
2
, . . . , s
M
2
)
T
= s
2
(5.7)
M M
=
M
, s = (s
1
M
, s
2
M
, . . . , s
M
M
)
T
= s
M


About the above notation: s
i
j
is the ith component of the jth eigenvector. It turns out to be very
inconvenient to use superscripts to denote which eigenvector it is, so s
j
will denote the jth
eigenvector. Thus, if I use only a subscript, it refers to the number of the eigenvector. Of course,
if it is a vector instead of a component of a vector, it should always appear as a bold-face letter.

The length of eigenvectors is arbitrary. To see this note that if

As
i
=
i
s
i
(5.8)

then

A(2s
i
) =
i
(2s
i
) (5.9)

since A is a linear operator.


Geosciences 567: CHAPTER 5 (RMR/GZ)
92
5.2.3 The Eigenvalue Problem in Matrix Notation

Now that we know that there are M values of
i
and M associated eigenvectors, we can
write Equation (5.2) as


, . . . 2, , 1 M i
i i i
= = s As
(5.10)

Consider, then, the following matrices:


(
(
(
(

=
M

0 0
0
0
0 0
2
1
L
O M
M
L
(5.11)
M M

and

(
(
(

=
M M M
L
M M M
M
s s s S
2 1
(5.12)
M M

where the ith column of S is the eigenvector s
i
associated with the ith eigenvalue
i
. Then
Equation (5.10) can be written in compact, matrix notation as

AS = S (5.13)

where the order of the matrices on the right-hand side is important! To see this, let us consider
the ith columns of [AS] and [S ], respectively. The first component of the ith column of [AS] is
given by


i
k
M
k
lk ki
M
k
lk li
s a s a

= =
= =
1 1
) (AS
(5.14)

where s
ki
= s
k
i
is the kth component of the ith eigenvector.

Similarly, components 2 through M of the ith column are given by


i
k
M
k
k ki
M
k
k i
s a s a

= =
= =
1
2
1
2 2
) (AS

M M
Geosciences 567: CHAPTER 5 (RMR/GZ)
93

i
k
M
k
Mk ki
M
k
Mk Mi
s a s a

= =
= =
1 1
) (AS
(5.15)

Therefore, the ith column of [AS] is given by

[ ]
i
M
k
i
k
M
k
i
k k
M
k
i
k k
i
s a
s a
s a
Mk
As AS =
(
(
(
(
(
(
(
(

=
=
=
1
1
2
1
1
M
(5.16)

That is, the ith column of [AS] is given by the product of A and the ith eigenvector s
i
.

Now, the first element in the ith column of [S ] is found as follows:


[ ]
ki
M
k
lk ki
M
k
lk li
s s S

= =
= =
1 1
(5.17)

But,

= =
i k
i k
i
i ki ki
,
, 0


(5.18)

where
ki
is the Kronecker delta. Therefore,

[S ]
1i
= s
1
i

i
(5.19)

Entries 2 through M are then given by


[ ]
i
i
ki
M
k
k
i
s s
2
1
2 2
= =

=
S

M

[ ]
i
i
M ki
M
k
k
M Mi
s s = =

=1
S
(5.20)

Thus, the ith column of [S ] is given by

Geosciences 567: CHAPTER 5 (RMR/GZ)
94

i i
i
M
i
i
i
i
M
i
i
s
s
s
s
s
s
s
=
(
(
(
(
(

=
(
(
(
(
(

M M
2
1
2
1
(5.21)

Thus, we have that the original equation defining the eigenvalue problem [Equation (5.10)]

As
i
=
i
s
i


is given by the ith columns of the matrix equation

AS = S (5.13)

Consider, on the other hand, the ith column of S


i
M
k
ki k i
s s
1 1
1
1 1
] [ = =

=
S


i
M
k
ki k i
s s
2 2
1
2 2
] [ = =

=
S

M

i
M M
M
k
ki Mk Mi
s s
1
1
] [ = =

=
S
(5.22)

Therefore, the ith column of [ S] is given by

[S]
i
= s
i
(5.23)

That is, the product of with the ith eigenvector s
i
. Clearly, this is not equal to As
i
above in
Equation (5.10).


5.2.4 Summarizing the Eigenvalue Problem for A

In review, we found that we could write the eigenvalue problem for Ax = b as

AS = S (5.13)

where


(
(
(

=
M M M
L
M M M
M
s s s S
2 1
(5.12)
Geosciences 567: CHAPTER 5 (RMR/GZ)
95
is an M x M matrix, each column of which is an eigenvector s
i
such that

As
i
=
i
s
i
(5.10)

and where


(
(
(
(

=
M

0 0
0
0
0 0
2
1
L
O M
M
L

(5.11)

is a diagonal M M matrix with the eigenvalue
i
of As
i
=
i
s
i
along the diagonal and zeros
everywhere else.


5.3 Geometrical Interpretation of the Eigenvalue Problem
for Symmetric A


5.3.1 Introduction

It is possible, of course, to cover the mechanics of the eigenvalue problem without ever
considering a geometrical interpretation. There is, however, a wealth of information to be
learned from looking at the geometry associated with the operator A. This material is not
covered in Menkes book, although some of it is covered in Statistics and Data Analysis in
Geology, 2nd Edition, 1986, pages 131139, by J. C. Davis, published by John Wiley and Sons.

We take as an example the symmetric 2 2 matrix A given by


(

=
3 2
2 6
A (5.24)

Working with a symmetric matrix assures real eigenvalues. We will discuss the case for the
general square matrix later.

The eigenvalue problem is solved by

|A I| = 0 (5.25)

where it is found that
1
= 7,
2
= 2, and the associated eigenvectors are given by

s
1
= [0.894, 0.447]
T
(5.26)

Geosciences 567: CHAPTER 5 (RMR/GZ)
96
and

s
2
= [0.447, 0.894]
T
(5.27)

respectively. Because A is symmetric, s
1
and s
2
are perpendicular to each other. The length of
eigenvectors is arbitrary, but for orthogonal eigenvectors it is common to normalize them to unit
length, as done in Equations (5.26) and (5.27). The sign of eigenvectors is arbitrary as well, and
I have chosen the signs for s
1
and s
2
in Equations (5.26) and (5.27) for convenience when I later
relate them to orthogonal transformations.


5.3.2 Geometrical Interpretation

The system of linear equations

Ax = b (5.1)

implies that the M M matrix A transforms M 1 vectors x into M 1 vectors b. In our
example, M = 2, and hence x = [x
1
, x
2
]
T
represents a point in 2-space, and b = [b
1
, b
2
]
T
is just
another point in the same plane.

Consider the unit circle given by

x
T
x = x
2
1
+ x
2
2
= 1 (5.28)

Matrix A maps every point on this circle onto another point. The eigenvectors s
1
and s
2
, having
unit length, are points on this circle. Then, since

As
i
=
i
s
i
(5.8)

we have

(
(
(

(
(
(

(
(
(

(
(
(

= = =
447 . 0
894 . 0
7
3.129
6.258
447 . 0
894 . 0

3 2
2 6
1
As (5.29)

and


(
(
(

(
(
(

(
(
(

(
(
(


= =

=
894 . 0
447 . 0
2
1.788
0.894 -
894 . 0
447 . 0

3 2
2 6
2
As (5.30)

As expected, when A operates on an eigenvector, it returns another vector parallel (or
antiparallel) to the eigenvector, scaled by the eigenvalue. When A operates on any direction
different from the eigenvector directions, it returns a vector that is not parallel (or antiparallel) to
the original vector.

What is the shape mapped out by A operating on the unit circle? We have already seen
where s
1
and s
2
map. If we map out this transformation for the unit circle, we get an ellipse, and
Geosciences 567: CHAPTER 5 (RMR/GZ)
97
this is an important element of the geometrical interpretation. What is the equation for this
ellipse? We begin with our unit circle given by

x
T
x = 1 (5.28)

and

Ax = b (5.1)

then

x = A
1
b (5.31)

assuming A
1
exists. Then, substituting (5.31) into (5.28) we get for a general A

[A
1
b]
T
[A
1
b] = b
T
[A
1
]
T
A
1
b = 1 (5.32)

Where A is symmetric,

[A
1
]
T
= A
1
(5.33)

and in this case

b
T
A
1
A
1
b = 1 (5.34)

After some manipulations, Equation (5.32)(5.34) for the present choice of A given in Equation
(5.24) gives

1
900 . 4 444 . 5 077 . 15
2
2 2 1
2
1
= + +
b b b b
(5.35)

which we recognize as the equation of an ellipse inclined to the b
1
and b
2
axes.

We can now make the following geometrical interpretation on the basis of the eigenvalue
problem:

1. The major and minor axis directions are given by the directions of the eigenvectors s
1

and s
2
.

2. The lengths of the semimajor and semiminor axes of the ellipse are given by the absolute
values of eigenvalues
1
and
2
.

3. The columns of A (i.e., [6, 2]
T
and [2, 3]
T
in our example) are vectors from the origin to
points on the ellipse.

Geosciences 567: CHAPTER 5 (RMR/GZ)
98
The third observation follows from the fact that A operating on [1, 0]
T
and [0, 1]
T
return
the first and second columns of A, respectively, and both unit vectors clearly fall on the unit
circle.

If one of the eigenvalues is negative, the unit circle is still mapped onto an ellipse, but
some points around the unit circle will be mapped back through the circle to the other side. The
absolute value of the eigenvalue still gives the length of the semiaxis.

The geometrical interpretation changes somewhat if A is not symmetrical. The unit circle
is still mapped onto an ellipse, and the columns of A still represent vectors from the origin to
points on the ellipse. Furthermore, the absolute values of the eigenvalues still give the lengths of
the semimajor and semiminor axes. The only difference is that the eigenvectors s
1
and s
2
will no
longer be orthogonal to each other and will not give the directions of the semimajor and
semiminor axes.

The geometrical interpretation holds for larger-dimensional matrices. In the present 2 2
case for A, the unit circle maps onto an ellipse. For a 3 3 case, a unit sphere maps onto an
ellipsoid. In general, the surface defined by the eigenvalue problem is of dimension one less than
the dimension of A.

The following page presents a diagram of this geometry.
Geosciences 567: CHAPTER 5 (RMR/GZ)
99
Geosciences 567: CHAPTER 5 (RMR/GZ)
100
5.3.3 Coordinate System Rotation

For the case of a symmetric A matrix, it is possible to relate the eigenvectors of A with a
coordinate transformation that diagonalizes A. For the example given in Equation (5.24), we
construct the eigenvector matrix S with the eigenvectors s
1
and s
2
as the first and second
columns, respectively. Then

(


=
894 . 0 447 . 0
447 . 0 894 . 0
S (5.36)

Since s
1
and s
2
are unit vectors perpendicular to one another, we see that S represents an
orthogonal transformation, and

SS
T
= S
T
S = I
2
(5.37)

Now consider the coordinate transformation shown below


y
y
x
x
+


where


(

=
(

=
(

y
x
y
x
y
x

cos sin
sin cos
T


(5.38)

and


(

=
(


=
(

y
x
y
x
y
x
T

cos sin
sin cos
T


(5.39)

where T transforms [x, y]
T
to [x, y]
T
and T
T
transforms [x, y]
T
to [x, y], where is positive for
counterclockwise rotation from x to x, and where

T
T
T = TT
T
= I
2
(5.40)

It is always possible to choose the signs of s
1
and s
2
such that they can be associated with x and
y, as I have done in Equations (5.26) and (5.27). Looking at the components of s
1
, then, we see
that = 26.6, and we further note that
Geosciences 567: CHAPTER 5 (RMR/GZ)
101
S = T
T
(5.41)

The matrix A can also be thought of as a symmetric second-order tensor (such as stress or strain,
for example) in the original (x, y) coordinate system.

Tensors can also be rotated into the (x, y) coordinate system as

A = TAT
T
(5.42)

where A is the rotated tensor. Using Equation (5.41) to replace T in Equation (5.42) yields

A = S
T
AS (5.43)

If you actually perform this operation on the example, you will find that A is given by


(

=
2 0
0 7
A (5.44)

Thus, we find that in the new coordinate system defined by the s
1
and s
2
axes, A is a
diagonal matrix with the diagonals given by the eigenvalues of A. If we were to write the
equation for the ellipse in Equation (5.32) in the (x, y) coordinates, we would find

1
4
) (
49
) (
2 2
=

+
y x
(5.45)

which is just the equation of an ellipse with semimajor and semiminor axes aligned with the
coordinate axes and of length 7 and 2, respectively. This new (x, y) coordinate system is often
called the principal coordinate system.

In summary, we see that the eigenvalue problem for symmetric A results in an ellipse
whose semimajor and semiminor axis directions and lengths are given by the eigenvectors and
eigenvalues of A, respectively, and that these eigenvectors can be thought of as the orthogonal
transformation that rotates A into a principal coordinate system where the ellipse is aligned with
the coordinate axes.


5.3.4 Summarizing Points

A few points can be made:

1. The trace of a matrix is unchanged by orthogonal transformations, where the trace is
defined as the sum of the diagonal terms, that is

=
=
M
i
ii
a
1
) ( trace A
(2.18)
Geosciences 567: CHAPTER 5 (RMR/GZ)
102
This implies that trace (A) = trace (A). You can use this fact to verify that you have
correctly found the eigenvalues.

2. If the eigenvalues had been repeated, it would imply that the length of the two axes of the
ellipse are the same. That is, the ellipse would degenerate into a circle. In this case, the
uniqueness of the directions of the eigenvectors vanishes. Any two vectors, preferably
orthogonal, would suffice.

3. If one of the eigenvalues is zero, it means the minor axis has zero length, and the ellipse
collapses into a straight line. No information about the direction perpendicular to the
major axis can be obtained.


5.4 Decomposition Theorem for Square A


We have considered the eigenvalue problem for A. Now it is time to turn our attention to
the eigenvalue problem for A
T
. It is important to do so because we will learn that A
T
has the
same eigenvalues as A, but in general, different eigenvectors. We will be able to use the
information about shared eigenvalues to decompose A into a product of matrices.


5.4.1 The Eigenvalue Problem for A
T


The eigenvalue problem for A
T
is given by

A
T
r
i
=
i
r
i
(5.46)

where

is the eigenvalue and r


i
is the associated M 1 column eigenvector. Proceeding in a
manner similar to the eigenvalue problem for A, we have

[A
T
I
M
]r = 0
M
(5.47)

This has nontrivial solutions for r if and only if

|A
T
I
M
| 0 (5.48)

But mathematically, |B| = |B
T
|. That is, the determinant is unchanged when you interchange rows
and columns! The implication is that the Mth-order polynomial in


M
+ b
M1

M1
+ + b
0

0
= 0 (5.49)

is exactly the same as the Mth-order polynomial in for the eigenvalue problem for A. That is,
A and A
T
have exactly the same eigenvalues. Therefore,


i
=
i
(5.50)
Geosciences 567: CHAPTER 5 (RMR/GZ)
103
5.4.2 Eigenvectors for A
T


In general, the eigenvectors s for A and the eigenvectors r for A
T
are not the same.
Nevertheless, we can write the eigenvalue problem for A
T
in matrix notation as

A
T
R = R (5.51)

where


(
(
(

=
M M M
L
M M M
M
r r r R
2 1
(5.52)

is an M M matrix of the eigenvectors r
i
of A
T
r
i
=
i
r
i
, and where


(
(
(
(

=
M

0 0
0
0
0 0
2
1
L
O M
M
L
(5.53)

is the same eigenvalue matrix shared by A.


5.4.3 Decomposition Theorem for Square Matrices

Statement of the Theorem

We are now in a position to decompose the square matrix A as the product of three
matrices.

Theorem: The square M M matrix A can be written as

A = S R
T
(5.54)

where S is the M M matrix whose columns are the eigenvectors of A, R is the M
M matrix whose columns are the eigenvectors of A
T
, and is the M M
diagonal matrix whose diagonal entries are the eigenvalues shared by A and A
T

and whose off-diagonal entries are zero.

Geosciences 567: CHAPTER 5 (RMR/GZ)
104
We will use this theorem to find the inverse of A, if it exists. In this section, then, we will go
through the steps to show that the decomposition theorem is true. This involves combining the
results for the eigenvalue problems for A and A
T
.

Proof of the Theorem

Step 1. Combining the results for A and A
T
. We start with Equation (5.13):

AS = S (5.13)

Premultiply Equation (5.13) by R
T
, which gives

R
T
AS = R
T
S (5.55)

Now, returning to Equation (5.51)

A
T
R = R (5.51)

Taking the transpose of Equation (5.51)

[A
T
R]
T
= (R )
T
(5.56)

or

R
T
[A
T
]
T
=
T
R
T
(5.57)

or

R
T
A = R
T
(5.58)

since

[A
T
]
T
= A (5.59)

and


T
= (5.60)

Next, we postmultiply Equation (5.58) by S to get

R
T
AS = R
T
S (5.61)

Noting that Equations (5.55) and (5.61) have the same left-hand sides, we have

R
T
S = R
T
S (5.62)

Geosciences 567: CHAPTER 5 (RMR/GZ)
105
or

R
T
S R
T
S = 0
M
(5.63)

where 0
M
is an M M matrix of all zeroes.


Step 2. Showing that R
T
S = I
M
. In order to proceed beyond Equation (5.63), we need to show
that

R
T
S = I
M
(5.64)

This means two things. First, it means that

[R
T
]
1
= S (5.65)

and

S
1
= R
T
(5.66)

Second, it means the dot product of the eigenvectors in R (the columns of R) with the
eigenvectors in S is given by

= =
j i
j i
ij j i
0,
= , 1
T
s r (5.67)

That is, if i j, r
i
has no projection onto s
j
. If i = j, then the projection of r
i
onto s
i
can
be made to be 1. I say that it can be made to be 1 because the lengths of eigenvectors is
arbitrary. Thus, if the two vectors have a nonzero projection onto each other, the length
of one, or the other, or some combination of both vectors, can be changed such that the
projection onto each other is 1.

Let us consider, for the moment, the matrix product R
T
S, and let

W = R
T
S (5.68)

Then Equation (5.63) implies


( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
M
MM M M M M M M
M M
M M
W W W
W W W
W W W
0 =
(
(
(
(







L
M O M M
L
L
2 2 1 1
2 2 22 2 2 21 2 1
1 1 12 1 2 11 1 1
(5.69)

Thus, independent of the values for W
ii
, the diagonal entries, we have that:
Geosciences 567: CHAPTER 5 (RMR/GZ)
106

( ) ( )
( ) ( )
( ) ( )
M
M M M M
M M
M M
W W
W W
W W
0 =
(
(
(
(




0
0
0
2 2 1 1
2 2 21 2 1
1 1 12 1 2
L
M O M M
L
L



(5.70)

If none of the eigenvalues is repeated (i.e.,
i

j
), then

(
i

j
) 0 (5.71)

and it follows that

Wij = 0 i j (5.72)

If
i
=
j
for some pair of eigenvalues, then it can still be shown that Equation (5.72)
holds. The explanation rests on the fact that when an eigenvalue is repeated, there is a
plane (or hyper-plane if M > 3) associated with the eigenvalues rather that a single
direction, as is the case for the eigenvector associated with a nonrepeated eigenvalue.
One has the freedom to choose the eigenvectors r
i
, s
i
and r
j
, s
j
in such a way that W
ij
=
0, while still having the eigenvectors span the appropriate planes. Needless to say, the
proof is much more complicated than for the case without repeated eigenvalues, and
will be left to the student as an exercise.

The end result, however, is that we are left with


(
(
(
(

=
MM
W
W
W
0 0
0
0
0 0
22
11
T
L
O M
M
L
S R
(5.73)

We recognize the W
ii
entries as the dot product of r
i
and s
i
, given by


i i
M
k
i
k
i
k
M
k
ki ki ii
s r s r W s r
T
1 1
= = =

= =
(5.74)

If W
ii
0, then we can make W
ii
= 1 by scaling r
i
, s
i
, or some combination of both.
We can claim that W
ii
0 as follows:

1. r
i
is orthogonal to s
j
, i j.

That is, r
i
, a vector in M space, is perpendicular to M 1 other vectors in M space.
These M 1 vectors are not all perpendicular to each other, or our work would be
done.

Geosciences 567: CHAPTER 5 (RMR/GZ)
107
2. However, the vectors in R (and S) span M space.

That is, one can write an arbitrary vector in M space as a linear combination of the
vectors in R (or S). Thus, r
i
, which has no projection on M 1 independent
vectors in M space, must have some projection on the only vector left in S, s
i
.

Since the projection is nonzero, one has the freedom to choose the r
i
and s
j
such that


1
T
1 1
= = = =

= =
i i
M
k
i
k
i
k
M
k
ki ki ii
s r s r W s r
(5.75)

Thus, finally, we have shown that it is possible to scale the vectors in R and/or S such
that

R
T
S = I
M
(5.76)

This means that

R
T
= S
1
(5.77)

and

[R
T
S]
T
= I
T
= I = S
T
R (5.78)

and

S
T
= R
1
(5.79)

Thus, the inverse of one is the transpose of the other, etc.

Before leaving this subject, I should emphasize that R and S are not orthogonal
matrices. This is because, in general,

R
T
R I
M
RR
T
(5.80)

and

S
T
S I
M
SS
T
(5.81)

We cannot even say that S and R are orthogonal to each other. It is true that r
i
is
perpendicular to s
j
, i j, because we have shown that

r
i
T
s
j
= 0 i j (5.82)

but
Geosciences 567: CHAPTER 5 (RMR/GZ)
108
r
i
T
s
i
= 1 (5.83)

does not imply that r
i
is parallel to s
i
. Recall that

cos
i i
T
i
s r r = (5.84)

where is the angle between r
i
and s
i
. The fact that you can choose
j
r ,
j
s such that
the dot product is equal to 1 does not require that be zero. In fact, it usually will not
be zero. It will be zero, however, if A is symmetric. In that case, R and S become
orthogonal matrices.


Step 3. Final justification that A = S R
T
. Finally, now that we have shown that R
T
S = I
M
, we
go back to Equations (5.13) and (5.51):

AS = S (5.13)

and

A
T
R = R (5.51)

If we postmultiply Equation (5.13) by R
T
, we have

ASR
T
= S R
T
(5.85)

But

SR
T
= I
M
(5.86)

because, if we start with

R
T
S = I
M
(5.76)

which we so laboriously proved in the last pages, and postmultiply by S
1
, we obtain

R
T
SS
1
= S
1
(5.87)

or

R
T
= S
1
(5.88)

Premultiplying by S gives

SR
T
= I
M
(5.89)

as required. Therefore, at long last, we have
Geosciences 567: CHAPTER 5 (RMR/GZ)
109

T
R S A =
(5.90)

Equation (5.90) shows that an arbitrary square M M matrix A can be decomposed into
the product of three matrices


(
(
(

=
M M M
L
M M M
M
s s s S
2 1
(5.12)

where S is an M M matrix, each column of which is an eigenvector s
i
such that

As
i
=
i
s
i
(5.10)


(
(
(

=
M M M
L
M M M
M
r r r R
2 1
(5.52)

where R is an M M matrix, each column of which is an eigenvector r
i
such that

A
T
r
i
=
i
r
i
(5.47)


(
(
(
(

=
M

0 0
0
0
0 0
2
1
L
O M
M
L

(5.11)

where is a diagonal M M matrix with the eigenvalues
i
of

As
i
=
i
s
i
(5.10)

along the diagonal and zeros everywhere else.

A couple of points are worth noting before going on to use this theorem to find
the inverse of A, if it exists. First, not all of the eigenvalues
i
are necessarily real.
Some may be zero. Some may even be repeated. Second, note also that taking the
transpose of Equation (5.90) yields


T T
S R A =

(5.91)



Geosciences 567: CHAPTER 5 (RMR/GZ)
110
5.4.4 Finding the Inverse A
1
for the M M Matrix A

The goal in this section is to use Equation (5.90) to find A
1
, the inverse to A in the exact
mathematical sense of

A
1
A = I
M
(5.92)

and

AA
1
= I
M
(5.93)

We start with Equation (5.10), the original statement of the eigenvalue problem for A

As
i
=
i
s
i
(5.10)

Premultiply by A
1
(assuming, for the moment, that it exists)

A
1
As
i
= A
1
(
i
s
i
)

=
i
A
1
s
i
(5.94)

But, of course,

A
1
A = I
M


by Equation (5.92). Therefore,

s
i
=
i
A
1
s
i
(5.95)

or, rearranging terms

A
1
s
i
= (1/
i
)s
i

i
0 (5.96)

Equation (5.96) is of the form of an eigenvalue problem. In fact, it is a statement of the
eigenvalue problem for A
1
! The eigenvalues for A
1
are given by the reciprocal of the
eigenvalues for A. A and A
1
share the same eigenvectors s
i
.

Since we know the eigenvalues and eigenvectors for A, we can use the information we
have learned on how to decompose square matrices in Equation (5.74) to write A
1
as



T 1 1
R S A

=
(5.97)

where
Geosciences 567: CHAPTER 5 (RMR/GZ)
111

(
(
(
(

/ 1 0 0
0
/ 1 0
0 0 / 1
2
1
1
L
O M
M
L

(5.98)

Hence, if we can decompose A as

A = S R
T
(5.90)

one can find, if it exists, A
1
as

A
1
= S
1
R
T
(5.97)

Of course, if
i
= 0 for any i, then 1/
i
is undefined and hence A
1
does not exist.


5.4.5 What Happens When There are Zero Eigenvalues?

Suppose that some of the eigenvalues
i
of

As
i
=
i
s
i
(5.10)

are zero. What happens then? First, of course, A
1
does not exist in the mathematical sense of
Equations (5.92) and (5.93). Given that, however, let us look in more detail.

First, suppose there are P nonzero
i
and (M P) zero
i
. We can order the
i
such that

|
1
| |
2
| |
P
| > 0 (5.99)

and

P+1
=
P+2
= =
M
= 0 (5.100)

Recall that one is always free to order the
i
any way one chooses, as long as the s
i
and r
i
in
Equation (5.90) are ordered the same way.

Then, we can rewrite as
Geosciences 567: CHAPTER 5 (RMR/GZ)
112

(
(
(
(
(
(
(
(
(

=
0 0
0
0
0 0
2
1
L
O
M M
O
L
P

(5.101)

Consider Equation (5.90) again. We have that

A = S R
T
(5.90)

We can write out the right-hand side as


(
(
(
(
(
(
(
(
(

(
(
(
(
(
(
(
(
(

(
(
(

+
+
L L
M
L L
L L
L L
L L
L
O
M M
O
L
M M M M M
L L
M M M M M
M
P
P P M P P
r
r
r
r
r
s s s s s
1
2
1
2
1
1 2 1

0 0
0
0
0 0

(5.102)

where S, , and R are all M M matrices. Multiplying out S explicitly yields


| | | || |
0 0

1
2
1
2 2 1 1

(
(
(
(
(
(
(
(
(

(
(
(

+
M P M P
P M
P
M
P
P P P
L L
M
L L
L L
M
L L
L L
M M M M M
L L
M M M M M
r
r
r
r
r
s s s
(5.103)

where we see that the last (M P) columns of the product S are all zero. Note also that the last
(M P) rows of R
T
(or, equivalently, the last (M P) columns of R) will all be multiplied by
zeros.

This means that s
P+1
, s
P+2
, . . . , s
M
and r
P
, r
P+1
, . . . , r
M
are not needed to form A! We
say that these eigenvectors are obliterated in (or by) A, or that A is blind to them.

Geosciences 567: CHAPTER 5 (RMR/GZ)
113
In order not to have to write out this partitioning in long hand each time, let us make the
following definitions:

1. Let S = [ S
P
| S
0
], where

(
(
(

=
M M M
L
M M M
P P
s s s S
2 1
(5.104)

is an M P matrix with the P eigenvectors of As
i
=
i
s
i
associated with the P nonzero
eigenvalues
i
as columns, and where


(
(
(

=
+ +
M M M
L
M M M
M P P
s s s S
2 1 0
(5.105)

is an M (M P) matrix with the (M P) eigenvectors associated with the zero
eigenvalues of As
i
=
i
s
i
as columns.

2. Let R = [R
P
| R
0
], where


(
(
(

=
M M M
L
M M M
P P
r r r R
2 1
(5.106)

is an M P matrix with the P eigenvectors of A
T
r
i
=
i
r
i
associated with the P nonzero
eigenvalues
i
as columns, and where


(
(
(

=
+ +
M M M
L
M M M
M P P
r r r R
2 1 0
(5.107)

is an M (M P) matrix with the (M P) eigenvectors associated with the zero
eigenvalues of A
T
r
i
=
i
r
i
as columns.

3. Let


(
(

=
0 0
0

P
P

(5.108)

where
P
is the P P subset of with the P nonzero eigenvalues
i
along the diagonal
and zeros elsewhere, as shown below

Geosciences 567: CHAPTER 5 (RMR/GZ)
114

(
(
(
(

=
P
P

0 0
0
0
0 0
2
1
L
O M
M
L

(5.109)

and where the rest of consists entirely of zeros.

We see, then, that Equation (5.103) implies that A can be reconstructed with just S
P
,
P
,
and R
P
as

M P P P P M M M
P P P

=


T
R S A
(5.110)

It is important to note that A can be reconstructed using either Equation (5.90) or Equation
(5.110). The benefit of using Equation (5.110) is that the matrices are smaller, and it could save
you effort not to have to calculate the r
i
, s
i
associated with the zero eigenvalues. An important
insight into the problem, however, is that although A can be reconstructed without any
information about directions associated with eigenvectors having zero eigenvalues, no
information can be retrieved, or gained, about these directions in an inversion.


5.4.6 Some Notes on the Properties of S
P
and R
P


1. At best, S
P
is semiorthogonal. It is possible that

S
P
T
S
P
= I
P
(5.111)

depending on the form (or information) of A. Note that the product S
P
T
S
P
has dimension
(P M)(M P) = (P P), independent of M. The product will equal I
P
if and only if the
P columns of S
P
are all orthogonal to one another.

It is never possible that

S
P
S
P
T
= I
M
(5.112)

This product has dimension (M P)(P M) = (M M). S
P
has M rows, but only P of
them can be independent. Each row of S
P
can be thought of as a vector in P-space (since
there are P columns). Only P vectors in P-space can be linearly independent, and M > P.

2. Similar arguments can be made about R
P
. That is, at best, R
P
is semiorthogonal. Thus

R
P
T
R
P
= I
P
(5.113)

is possible, depending on the structure of A. It is never possible that
Geosciences 567: CHAPTER 5 (RMR/GZ)
115
R
P
R
P
T
= I
M
(5.114)

3. A
1
does not exist (since there are zero eigenvalues).

4. If there is some solution x to Ax = b, (which is possible, even if A
1
does not exist), then
there are an infinite number of solutions. To see this, we note that

As
i
=
i
s
i
= 0 i = P + 1, P + 2, . . . , M (5.115)

This means that if

Ax = b (5.1)

for some x, then if we add s
i
, (P + 1) i M, to x, we have

A[x + s
i
] = b +
i
s
i
(5.116)

= b (5.117)

since A is a linear operator. This means that one could write a general solution as

+ =
+ =
M
P i
i i
1
s x x
(5.118)

where
i
is an arbitrary weighting factor.


5.5 Eigenvector Structure of m
LS


5.5.1 Square Symmetric A Matrix With Nonzero Eigenvalues

Recall that the least squares problem can always be transformed into the normal
equations form that involves square, symmetric matrices. If we start with

Gm = d (5.119)

G
T
Gm = G
T
d (5.120)

m
LS
= [G
T
G]
1
G
T
d (5.121)

Let A = G
T
G and b = G
T
d. Then we have

Am = b (5.122)

Geosciences 567: CHAPTER 5 (RMR/GZ)
116
m
LS
= A
1
b (5.123)

where A is a square, symmetric matrix. Now, recall the decomposition theorem for square,
symmetric matrices:
A = S S
T
(5.124)

where S is the M M matrix with columns of eigenvectors of A, and is the M M diagonal
matrix with elements of eigenvalues of A. Then, we have shown

A
1
= S
1
S
T
(5.125)

and

m
LS
= S
1
S
T
b (5.126)

Let's take a closer look at the structure of m
LS
. The easiest way to do this is to use a
simple example with M = 2. Then


(

22 12
21 11
2
1
22 21
12 11 1

/ 1 0
0 / 1

s s
s s
s s
s s

A
(5.127)

(
(
(
(

1
22
1
21
1
12
1
11
22 21
12 11 1



s s
s s
s s
s s
A
(5.128)

(
(
(
(

+ +
+ +
=

2
2
22
1
2
21
2
12 22
1
11 21
2
22 12
1
21 11
2
2
12
1
2
11
1


s s s s s s
s s s s s s
A
(5.129)

This has the form of the sum of two matrices, one with 1/
1
coefficient and the other with
1/
2
coefficient.

[ ] [ ]
T
2 2
2
T
1 1
1
22 12
22
12
2
21 11
21
11
1
2
22 12 22
22 12
2
12
2
2
21 11 21
21 11
2
11
1
1
1 1

1

1
1
+
1
s s s s
A



+ =
(

+
(

=
(

s s
s
s
s s
s
s
s s s
s s s
s s s
s s s
(5.130)

In general, for a square symmetric matrix

Geosciences 567: CHAPTER 5 (RMR/GZ)
117

T T
2 2
2
T
1 1
1
1
1 1 1
M M
M
s s s s s s A

+ + + =

L (5.131)
and

T T
2 2 2
T
1 1 1 M M M
s s s s s s A + + + = L (5.132)

where s
i
is the ith eigenvector of A and A
1
.

Now, let's finish forming m
LS
for the simple 2 2 case.


( ) ( )
2 2 22 1 12
2
1 2 21 1 11
1
T
2 2
2
T
1 1
1
1
LS

1

1
1 1
s s
b s s b s s
b A m
b s b s b s b s + + + =
+ =
=



(5.133)

Or, in general,

( )
i
i
i
i i
i i
i
M
j
j ji
M
i i
c
b s
s
s b s
s m


1
1
T
LS


=
=
|
|
.
|

\
|
=

(5.134)

where
s
i
T
b s
i
b
= the projection of data in the s
i
direction, and c
i
are the constants.

In this form, we can see that the least squares solution of Am = b is composed of a
weighted sum of the eigenvectors of A. The coefficients of the eigenvectors are constants
composed of two parts: one is the inverse of the corresponding eigenvalue, and the second is the
projection of the data in the direction of the corresponding eigenvector. This form also clearly
shows why there is no inverse if one or more of the eigenvalues of A is zero and why very small
eigenvalues can make m
LS
unstable. It also suggests how we might handle the case where A has
one or more zero eigenvalues.


5.5.2 The Case of Zero Eigenvalues

As we saw in section 5.4.5, we can order the eigenvalues from largest to smallest in
absolute value, as long as the associated eigenvectors are ordered in the same way. Then we saw
the remarkable result in Equation (5.110) that the matrix A can be completely reconstructed
using just the nonzero eigenvalues and associated eigenvectors.

Geosciences 567: CHAPTER 5 (RMR/GZ)
118
|
1
| > |
2
| > |
i
|

> |
P
| > 0 and
(P+1)
=

=
M
= 0 (5.135)

That is,


T T
1 1 1
T
P P P
P P P
s s s s
S S A
+ + =
=
L
(5.136)

This suggests that perhaps we can construct A
1
similarly using just the nonzero eigenvalues and
their corresponding eigenvectors. Let's try,



A
1
= S
P

P
1
S
P
T
(5.137)

which must, at least, exist. But, note that


A

A
1
= S
P

P
S
P
T
[ ]
S
P

P
1 11 1
S
P
T
[ ]
(5.138)

Note that
P P P
I S S =
T
because A is symmetric, and hence the eigenvectors are orthogonal, and
P P P
I =
1
, so

A

A
1
= S
P
S
P
T
I
M
(see 5.112) (5.139)

Like the case for A, we cannot write a true inverse for A using just the P nonzero
eigenvectorsthe true inverse does not exist. But, we can use

A
1
as an "approximate" inverse
for A. Thus, in the case when A
1
does not exist, we can use


m
LS
=

A
1
b
= S
P

P
1
S
P
T
b
=
s
i
b

i
|
\

|
.
|
i=1
P

s
i
(5.140)


5.5.3 Simple Tomography Problem Revisited

The concepts developed in the previous section are best understood in the context of an
example problem. Recall the simple tomography problem:

Geosciences 567: CHAPTER 5 (RMR/GZ)
119
d
d
d d
1
2
3
4
1
2
3 4



(
(
(
(

=
1 0 1 0
0 1 0 1
1 1 0 0
0 0 1 1
G (5.141)

(
(
(
(

= =
2 1 1 0
1 2 0 1
1 0 2 1
0 1 1 2
T
G G A (5.142)

) of s eigenvalue (the
0
2
2
4
4
3
2
` 1
A
=
=
=
=

(5.143)

) of rs eigenvecto (the
5 . 0 5 . 0 5 . 0 5 . 0
5 . 0 5 . 0 5 . 0 5 . 0
5 . 0 5 . 0 5 . 0 5 . 0
5 . 0 5 . 0 5 . 0 5 . 0

=
4 3 2 1
A
s s s s
S
(
(
(
(

(5.144)

A = S S
T (5.145)

where

(
(
(
(

=
0 0 0 0
0 2 0 0
0 0 2 0
0 0 0 4
(5.146)

Let's look at the "patterns" in the eigenvectors, s
i
.
Geosciences 567: CHAPTER 5 (RMR/GZ)
120

1
= 4, s
1

2
= 2, s
2

3
= 2, s
3

4
= 0, s
4


+ +
+ +

+
+


+
+

comps: dc L R T B checkerboard

These patterns are the fundamental building blocks (basis functions) of all solutions in
this fourspace. Do you see why the eigenvalues correspond to their associated eigenvector
patterns? Explain this in terms of the sampling of the four paths shooting through the medium.
What other path(s) must we sample in order to make the zero eigenvalue nonzero?

Now, let's consider an actual model and the corresponding noise-free data. The model is


(
(
(
(

=
0
0
0
1
m or graphically
+ 0
0 0
(5.147)

The data are

(
(
(
(

= =
(
(
(
(

=
0
1
1
2
and
0
1
0
1
T
d G b d (5.148)

First, A
1
does not exist. But
T 1 1
~
P P P
S S A =

does, where S
P
= [s
1
s
2
s
3
] and


(
(
(

2
1
2
1
4
1
0 0
0 0
0 0
1
P
(5.149)
then
m
LS
=

A
1
G
T
d (5.150)

We now have all the parts necessary to construct this solution. Let's use the alternate
form of the solution:
[ ] [ ] [ ]
3 3
3
2 2
2
1 1
1
LS

1

1

1
~
s b s s b s s b s m + + =

(5.151)


Geosciences 567: CHAPTER 5 (RMR/GZ)
121

2 5 . 0 5 . 0 0 . 1
0
1
1
2
5 . 0
5 . 0
5 . 0
5 . 0
1
= + + =
(
(
(
(

(
(
(
(

= b s
(5.152)


1 5 . 0 5 . 0 0 . 1
0
1
1
2
5 . 0
5 . 0
5 . 0
5 . 0
2
= + =
(
(
(
(

(
(
(
(

= b s
(5.153)


1 5 . 0 5 . 0 0 . 1
0
1
1
2
5 . 0
5 . 0
5 . 0
5 . 0
3
= + =
(
(
(
(

(
(
(
(

= b s
(5.154)
and note,
0 5 . 0 5 . 0 0 . 1
0
1
1
2
5 . 0
5 . 0
5 . 0
5 . 0
4
= =
(
(
(
(

(
(
(
(

= b s (5.155)

The data have no projection in the null-space s
4
. The geometry of the experiment
provides no constraints on the checkerboard pattern. How would we change the experiment to
remedy this problem? Continuing the construction of the solution,


(
(
(
(

=
(
(
(
(

+
(
(
(
(

+
(
(
(
(

=
(
(
(
(

+
(
(
(
(

+
(
(
(
(

=
0.25
0.25
0.25
0.75
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25

0.5
0.5
0.5
0.5
2
1
0.5
0.5
0.5
0.5
2
1
0.5
0.5
0.5
0.5
4
2
~
LS
m
(5.156)

Graphically, this is

Geosciences 567: CHAPTER 5 (RMR/GZ)
122
0.75 0.25
0.25 0.25


Does this solution fit the data? What do we need to add to this solution to get the true
solution?

0.25 0.25
0.25 0.25


This is the eigenvector associated with the zero eigenvalues and represents the "null"
space. Any scaled value of this pattern can be added and the data will not changethe data are
blind to this pattern.

(
(
(
(

+
(
(
(
(

=
25 . 0
25 . 0
25 . 0
25 . 0
const.
25 . 0
25 . 0
25 . 0
75 . 0
~
LS
m (5.157)


Summary

We now have all the tools to solve inverse problems, even those with zero eigenvalues!
There are many real inverse problems that have been tackled using just what we have learned so
far. In addition to eliminating zero eigenvalues and associated eigenvectors, the truncation
method can be used to remove the effects of small eigenvalues. As you might expect, there is a
cost or trade-off in truncating nonzero eigenvalues. Before we get into techniques to evaluate
such trade-offs, we first turn to the generalization of the eigenvalueeigenvector concepts to
nonsquare matrices.
Geosciences 567: Chapter 6 (RMR/GZ)
123




CHAPTER 6: SINGULAR-VALUE DECOMPOSITION (SVD)


6.1 Introduction


Having finished with the eigenvalue problem for Ax = b, where A is square, we now turn
our attention to the general N M case Gm = d. First, the eigenvalue problem, per se, does not
exist for Gm = d unless N = M. This is because G maps (transforms) a vector m from M-space
into a vector d in N-space. The concept of parallel breaks down when the vectors lie in
different dimensional spaces.

Since the eigenvalue problem is not defined for G, we will try to construct a square
matrix that includes G (and, as it will turn out, G
T
) for which the eigenvalue problem is defined.
This eigenvalue problem will lead us to singular-value decomposition (SVD), a way to
decompose G into the product of three matrices (two eigenvector matrices V and U, associated
with model and data spaces, respectively, and a singular-value matrix very similar to from the
eigenvalue problem for A). Finally, it will lead us to the generalized inverse operator, defined in
a way that is analogous to the inverse matrix to A found using eigenvalue/eigenvector analysis.

The end result of SVD is


M P P P P N M N
T
P P P

=

V U G
(6.1)

where U
P
are the P N-dimensional eigenvectors of GG
T
, V
P
are the P M-dimensional
eigenvectors of G
T
G, and
P
is the P P diagonal matrix with P singular values (positive square
roots of the nonzero eigenvalues shared by GG
T
and G
T
G) on the diagonal.


6.2 Formation of a New Matrix B

6.2.1 Formulating the Eigenvalue Problem With G

The way to construct an eigenvalue problem that includes G is to form a square (N +
M) (N + M) matrix B partitioned as follows:

Geosciences 567: Chapter 6 (RMR/GZ)
124

B =
0
N N
G
T
M N








G
N M
0
M M







|N || M |


(6.2)


B is Hermitian because

B
T
= B (6.3)

Note, for example,

B
1,N+3
= G
13
(6.4)

and

B
N+3,1
= (G
T
)
31
= G
13
, etc. (6.5)


6.2.2 The Role of G
T
as an Operator

Analogous to Equation (1.13), we can define an equation for G
T
as follows:

G
T
y = c (6.6)
M N N 1 M 1

We do not have to have a particular y and c in mind when we do this. We are simply interested
in the mapping of an N-dimensional vector into an M-dimensional vector by G
T
.

We can combine Gm = d and G
T
y = c, using B, as

________ __ __
T

c
d
m
y
0
G
G
0
(6.7)

or

B z = b (6.8)
(N + M) (N + M) (N + M) 1 (N + M) 1

where we have
Geosciences 567: Chapter 6 (RMR/GZ)
125

m
y
z
__
=
(6.9)

and

__
c
d
b =
(6.10)

Note that z and b are both (N + M) 1 column vectors.


6.3 The Eigenvalue Problem for B


The eigenvalue problem for the (N + M) (N + M) matrix B is given by

Bw
i
=
i
w
i
i = 1, 2, . . . , N + M (6.11)


6.3.1 Properties of B

The matrix B is Hermitian. Therefore, all N + M eigenvalues
i
are real. In preparation
for solving the eigenvalue problem, we define the eigenvector matrix W for B as follows:


) ( ) (

2 1
M N M N
M N
+ +

=
+
M M M
L
M M M
w w w W
(6.12)

We note that W is an orthogonal matrix, and thus

W
T
W = WW
T
= I
N+M
(6.13)

This is equivalent to

w
i
T
w
j
=
ij
(6.14)

where w
i
is the ith eigenvector in W.


Geosciences 567: Chapter 6 (RMR/GZ)
126
6.3.2 Partitioning W

Each eigenvector w
i
is (N + M) 1. Consider partitioning w
i
such that

__
M
N
i
i
i
v
u
w = (6.15)

That is, we stack an N-dimensional vector u
i
and an M-dimensional vector v
i
into a single (N +
M)-dimensional vector.

Then the eigenvalue problem for B from Equation (6.11) becomes

__
=

__

________
i
i
i
i
i
v
u
v
u
0
G
G
0

T

(6.16)

This can be written as


G v
i
=
i
u
i
i =1, 2, . . . , N + M
N M M1 N 1
(6.17)

and


G
T
u
i
=
i
v
i
i =1, 2, . . . , N + M
M N N 1 M1
(6.18)

Equations (6.17) and (6.18) together are called the shifted eigenvalue problem for G. It is
not an eigenvalue problem for G, since G is not square and eigenvalue problems are only
defined for square matrices. Still, it is analogous to an eigenvalue problem. Note that G operates
on an M-dimensional vector and returns an N-dimensional vector. G
T
operates on an
N-dimensional vector and returns an M-dimensional vector. Furthermore, the vectors are shared
by G and G
T
.


Geosciences 567: Chapter 6 (RMR/GZ)
127
6.4 Solving the Shifted Eigenvalue Problem


Equations (6.17) and (6.18) can be solved by combining them into two related eigenvalue
problems involving G
T
G and GG
T
, respectively.


6.4.1 The Eigenvalue Problem for G
T
G

Eigenvalue problems are only defined for square matrices. Note, then, that G
T
G is
M M, and hence has an eigenvalue problem. The procedure is as follows:

Starting with Equation (6.18)

G
T
u
i
=
i
v
i
(6.18)

Multiply both sides by
i



i
G
T
u
i
=
i
2
v
i
(6.19)

or

G
T
(
i
u
i
) =
i
2
v
i
(6.20)

But, by Equation (6.17), we have


i
u
i
= Gv
i
(6.17)

Thus


G
T
Gv
i
=
i
2
v
i
i =1, 2, . . . , M
(6.21)

This is just the eigenvalue problem for G
T
G! We were able to manipulate the shifted eigenvalue
problem into an eigenvalue problem that, presumably, we can solve.

We make the following notes:

1. G
T
G is Hermitian.


kj
N
k
ki kj
N
k
ik ij
g g G G

= =
= =
1 1
T T
) ( ) ( G G
(6.22)


ij ki
N
k
kj ki
N
k
jk ji
g g G G ) ( ) ( ) (
T
1 1
T T
G G G G = = =

= =
(6.23)
Geosciences 567: Chapter 6 (RMR/GZ)
128
2. Therefore, all M
i
2
are real. Because the diagonal entries of G
T
G are all 0, then all

i
2
are also 0. This means that G
T
G is positive semidefinite (one definition of which
is, simply, that all the eigenvalues are real and 0).

We can combine the M equations implied by Equation (6.20) into matrix notation as

G
T
G V = V M (6.24)
M M M M M M M M

where V is defined as follows:

=
M M M
L
M M M
M
v v v V
2 1
(6.25)
M M

and

=
2
2
2
2
1
0 0
0
0
0 0
M

L
O M
M
L
M (6.26)
M M

3. Because G
T
G is a Hermitian matrix, V is itself an orthogonal matrix:

VV
T
= V
T
V = I
M
(6.27)


6.4.2 The Eigenvalue Problem for GG
T


The procedure for forming the eigenvalue problem for GG
T
is very analogous to that of
G
T
G. We note that GG
T
is N N. Starting with Equation (6.17),

Gv
i
=
i
u
i
(6.17)

Again, multiply by
i


G(
i
v
i
) =
i
2
u
i
(6.28)

But by Equation (6.18), we have

G
T
u
i
=
i
v
i
(6.18)
Geosciences 567: Chapter 6 (RMR/GZ)
129
Thus



GG
T
u
i
=
i
2
u
i
i = 1, 2, . . . , N
(6.29)

We make the following notes for this eigenvalue problem:

1. GG
T
is Hermitian.

2. GG
T
is positive semidefinite.

3. Combining the N equations in Equation (6.29), we have

GG
T
U = UN (6.30)

where

=
M M M
L
M M M
N
u u u U
2 1
(6.31)
N N

and

=
2
2
2
2
1
0 0
0
0
0 0
N

L
O M
M
L
N (6.32)
N N

4. U is an orthogonal matrix

U
T
U = UU
T
= I
N
(6.33)


6.5 How Many
i
Are There, Anyway??


A careful look at Equations (6.11), (6.21), and (6.29) shows that the eigenvalue problems
for B, G
T
G, and GG
T
are defined for (N + M), M, and N values of i, respectively. Just how
many
i
are there?

Geosciences 567: Chapter 6 (RMR/GZ)
130
6.5.1 Introducing P, the Number of Nonzero Pairs (+
i
,
i
)

Equation (6.11)

Bw
i
=
i
w
i
(6.11)

can be used to determine (N + M) real
i
. Equation (6.21),

G
T
Gv
i
=
i
2
v
i
(6.21)

can be used to determine M real
i
2
since G
T
G is M M. Equation (6.29)

GG
T
u
i
=
i
2
u
i
(6.29)

can be used to determine N real
i
2
since GG
T
is N N.

This section will convince you, I hope, that the following are true:

1. There are P pairs of nonzero
i
, where each pair consists of (+
i
,
i
).

2. If +
i
is an eigenvalue of

Bw
i
=
i
w
i
(6.11)

and the associated eigenvector w
i
is given by

=
i
i
i
v
u
w
(6.34)

then the eigenvector associated with
i
is given by

=
i
i
i
v
u
w
(6.35)

3. There are (N + M) 2P zero
i
.

4. You can know everything you need to know about the shifted eigenvalue problem by
retaining only the information associated with the positive
i
.

5. P is less than or equal to the minimum of N and M.

P min(N, M) (6.36)


Geosciences 567: Chapter 6 (RMR/GZ)
131
6.5.2 Finding the Eigenvector Associated With
i


Suppose that you have found w
i
, a solution to Equation (6.11) associated with
i
. It also
satisfies the shifted eigenvalue problem

Gv
i
=
i
u
i
(6.17)

and
G
T
u
i
=
i
v
i
(6.18)

Let us try
i
as an eigenvalue and w
i
given by

=
i
i
i
v
u
w
(6.35)

and see if it satisfies Equations (6.16) and (6.17)

Gv
i
= (
i
)(u
i
) =
i
u
i
(6.37)

and

G
T
(u
i
) = (
i
)v
i
(6.38)

or

G
T
u
i
=
i
v
i
(6.39)

From this we conclude that the nonzero eigenvalues of B come in pairs. The relationship
between the solutions is given in Equations (6.34) and (6.35). Note that this property of paired
eigenvalues and eigenvectors is not the case for the general eigenvalue problem. It results from
the symmetry of the shifted eigenvalue problem.


6.5.3 No New Information From the
i
System

Let us form an ordered eigenvalue matrix D for B given by (next page)

Geosciences 567: Chapter 6 (RMR/GZ)
132

=
0 0
0
0
0 0
2
1
2
1
L
O
O
M M
O
L
P
P

D

(6.40)
(N + M) (N + M)

where
1

2

P
. Note that the ordering of matrices in eigenvalue problems is
arbitrary, but must be internally consistent. Then the eigenvalue problem for B from Equation
(6.11) becomes
BW = WD (6.41)

where now the (N + M) (N + M) dimensional matrix W is given by

=
________ __________ __________ __________ __________

+ +
+ +
M M M M M M M M
L L L
M M M M M M M M
M M M M M M M M
L L L
M M M M M M M M
M N P P P
M N P P P
v v v v v v v v
u u u u u u u u
W
1 2 2 1 2 1
1 2 2 1 2 1

(6.42)
| P | | P | |(N + M) 2P|

The second P eigenvectors certainly contain independent information about the eigenvectors w
i

in (N + M)-space. They contain no new information, however, about u
i
or v
i
, in N- and M-space,
respectively, since u
i
contains no information not already contained in +u
i
.


6.5.4 What About the Zero Eigenvalues
i
s, i = (2P + 1), . . . , N + M?

For the zero eigenvalues, the shifted eigenvalue problem becomes

Gv
i
=
i
u
i
= 0u
i
= 0 i = (2P + 1), . . . , (N + M) (6.43)
N 1
Geosciences 567: Chapter 6 (RMR/GZ)
133
and
G
T
u
i
=
i
v
i
= 0v
i
= 0 i = (2P + 1), . . . , (N + M) (6.44)
M 1

where 0 is a vector of zeros of the appropriate dimension.

If you premultiply Equation (6.43) by G
T
and Equation (6.44) by G, you obtain

G
T
Gv
i
= G
T
0 = 0 (6.45)
(M 1)
and
GG
T
u
i
= G0 = 0 (6.46)
(N 1)
Therefore, we conclude that the u
i
, v
i
associated with zero
i
for B are simply the eigenvectors of
GG
T
and G
T
G associated with zero eigenvalues for GG
T
and G
T
G, respectively!


6.5.5 How Big is P?

Now that we have seen that the eigenvalues come in P pairs of nonzero values, how can
we determine the size of P? We will see that you can determine P from either G
T
G or GG
T
, and
that P is bounded by the smaller of N and M, the number of observations and model parameters,
respectively. The steps are as follows.

Step 1. Let the number of nonzero eigenvalues
i
2
of G
T
G be P. Since G
T
G is M M, there are
only M
i
2
all together. Thus, P is less than or equal to M.

Step 2. If
i
2
0 is an eigenvalue of G
T
G, then it is also an eigenvalue of GG
T
since

G
T
Gv
i
=
i
2
v
i
(6.21)
and
GG
T
u
i
=
i
2
u
i
(6.29)

Thus the nonzero
i
s are shared by G
T
G and GG
T
.

Step 3. P is less than or equal to N since GG
T
is N N. Therefore, since P M and P N,

P min(N, M)

Thus, to determine P, you can do the eigenvalue problem for either G
T
G (M M) or
GG
T
(N N). It makes sense to choose the smaller of the two matrices. That is, one chooses
G
T
G if M < N, or GG
T
if N < M.

Geosciences 567: Chapter 6 (RMR/GZ)
134
6.6 Introducing Singular Values


6.6.1 Introduction

Recalling Equation (6.30) defining the eigenvalue problem for GG
T


GG
T
U = U N (6.30)
N N N N N N N N

The matrix U contains the eigenvectors u
i
, and can be ordered as

+
N
N N
N P P
M M
L
M M
M
M
M M
L
M M
u u u u u U
1 2 1

(6.47)
| P | | N P |
or
U= U
P
U
0
[ ]
(6.48)
Recall Equation (6.24), which defined the eigenvalue problem for G
T
G,

G
T
G V = V M (6.24)
M M M M M M M M

The matrix V of eigenvectors is given by

+
M
M M
M P P
M M
L
M M
M
M
M M
L
M M
v v v v v V
1 2 1

(6.49)
| P | | M P |

or

[ ]
0
V V V | =
P
(6.50)

where the u
i
, v
i
satisfy

Gv
i
=
i
u
i
(6.17)
and
G
T
u
i
=
i
v
i
(6.18)

Geosciences 567: Chapter 6 (RMR/GZ)
135
and where we have chosen the P positive
i
from

G
T
Gv
i
=
i
2
v
i
(6.21)

GG
T
u
i
=
i
2
u
i
(6.29)

Note that it is customary to order the u
i
, v
i
such that


1

2

P
(6.51)


6.6.2 Definition of the Singular Value

We define a singular value
i
from Equation (6.21) or (6.29) as the positive square root of
the eigenvalue
i
2
for G
T
G or GG
T
. That is,


2
i i
+ = (6.52)

Singular values are not eigenvalues.
i
is not an eigenvalue for G or G
T
, since the eigenvalue
problem is not defined for G or G
T
, N M. They are, of course, eigenvalues for B in Equation
(6.11), but we will never explicitly deal with B. The matrix B is a construct that allowed us to
formulate the shifted eigenvalue problem, but in practice, it is never formed. Nevertheless, you
will often read, or hear,
i
referred to as an eigenvalue.


6.6.3 Definition of , the Singular-Value Matrix

We can form an N M matrix with the singular values on the diagonal. If M > N, it has
the form

| | | |
N
0 0
0
0
0 0

2
1

N M N
M N
P
0
L
O
M M
O
L

(6.53)

If N > M, it has the form

Geosciences 567: Chapter 6 (RMR/GZ)
136

_ __________ __________


0 0
0
0
0 0
2
1
M N
M
M N
P
0
L
O
M M
O
L


(6.54)

| M |

Then the shifted eigenvalue problem

Gv
i
=
i
u
i
(6.17)

and

G
T
u
i
=
i
v
i
(6.18)

can be written as

Gv
i
=
i
u
i
(6.55)
and
G
T
u
i
=
i
v
i
(6.56)

where
i
has been replaced by
i
since all information about U, V can be obtained from the
positive
i
.

Equations (6.55) and (6.56) can be written in matrix notation as


G V = U
N M M M N N N M
(6.57)
and

G
T
U = V
T

M N N N M M M N
(6.58)

Geosciences 567: Chapter 6 (RMR/GZ)
137
6.7 Derivation of the Fundamental Decomposition Theorem
for General G (N M, N M)


Recall that we used the eigenvalue problem for square A and A
T
to derive a
decomposition theorem for square matrices:

A = SR
T
(5.90)

where S, R, and are eigenvector and eigenvalue matrices associated with A and A
T
. We are
now ready to derive an analogous decomposition theorem for the general N M, N M matrix
G.

We start with Equation (6.57)

G V = U (6.57)
N M M M N N N M

postmultiply by V
T


GVV
T
= UV
T
(6.59)

But V is an orthogonal matrix. That is,

VV
T
= I
M
(6.27)

since G
T
G is Hermitian, and the eigenvector matrices of Hermitian matrices are orthogonal.
Therefore, we have the fundamental decomposition theorem for a general matrix G given by


G = U V
T
N M N N N M M M
(6.60)

By taking the transpose of Equation (6.60), we obtain also


G
T
= V
T
U
T
M N M M M N N N

(6.61)


where


[ ]
0 1 1
U U u u u u U | =

+ P N P P
N N
M M M M
L L
M M M M
(6.62)

Geosciences 567: Chapter 6 (RMR/GZ)
138
and


[ ]
0 1 1
V V v v v v V | =

+ P M P P
M M
M M M M
L L
M M M M
(6.63)

and

0 0
0
0
0 0
2
1
L
O
M M
O
L
P
M N

(6.64)



6.8 Singular-Value Decomposition (SVD)


6.8.1 Derivation of Singular-Value Decomposition

We will see below that G can be decomposed without any knowledge of the parts of U or
V associated with zero singular values
i
, i > P. We start with the fundamental decomposition
theorem

G = UV
T
(6.60)

Let us introduce a P P singular-value matrix
P
that is a subset of :

=
P
P

0 0
0
0
0 0
2
1
L
O M
M
L
(6.65)

We now write out Equation (6.60) in terms of the partitioned matrices as
Geosciences 567: Chapter 6 (RMR/GZ)
139

=
________

_______ __________
P M
P
P N
P
N
P P
P




T
0
T
0
V
V
0
0
0
U U
G
(6.66)
| P | | NP| | P || M P|| M |

=
T
0
T

V
V
0 U
P
P P
N
(6.67)
| P ||MP|

= U
P

P
V
P
T
(6.68)

That is, we can write G as



G = U
P

P
V
P
T
N M N P P P P M
(6.69)

Equation (6.69) is known as the Singular-Value Decomposition Theorem for G.

The matrices in Equation (6.69) are

1. G = an arbitrary N M matrix.

2. The eigenvector matrix U
P


=
M M M
L
M M M
P P
u u u U
2 1
(6.70)

where u
i
are the P N-dimensional eigenvectors of

GG
T
u
i
=
i
2
u
i
(6.29)

associated with nonzero singular values
i
.

3. The eigenvector matrix V
P

Geosciences 567: Chapter 6 (RMR/GZ)
140

=
M M M
L
M M M
P P
v v v V
2 1
(6.71)

where v
i
are the P M-dimensional eigenvectors of

G
T
Gv
i
=
i
2
v
i
(6.21)

associated with nonzero singular values
i
, and
4. The singular-value matrix

=
P
P

0 0
0
0
0 0
2
1
L
O M
M
L
(6.65)

where
i
is the nonzero singular value associated with u
i
and v
i
, i = 1, . . . , P.


6.8.2 Rewriting the Shifted Eigenvalue Problem

Now that we have seen that we can reconstruct G using only the subsets of U, V, and
defined in Equations (6.65), (6.69), and (6.71), we can rewrite the shifted eigenvalue problem
given by Equations (6.57) and (6.58):

G V = U (6.57)
N M M M N N N M
and
G
T
U = V
T
(6.58)
M N N N M M M N
as
1. G V
P
= U
P

P
(6.72)
N M M P N P P P

2. G
T
U
P
= V
P

P
(6.73)
M N N P M P P P

3. G V
0
= 0 (6.74)
N M M (M P) N (M P)

Geosciences 567: Chapter 6 (RMR/GZ)
141
4. G
T
U
0
= 0 (6.75)
M N N (N P) M (N P)

Note that the eigenvectors in V are a set of M orthogonal vectors which span model space, while
the eigenvectors in U are a set of N orthogonal vectors which span data space. The P vectors in
V
P
span a P-dimensional subset of model space, while the P vectors in U
P
span a P-dimensional
subset of data space. V
0
and U
0
are called null, or zero, spaces. They are (M P) and (N P)
dimensional subsets of model and data spaces, respectively.


6.8.3 Summarizing SVD

In summary, we started with Equations (1.13) and (6.5)

Gm = d (1.13)
and
G
T
y = c (6.6)

We constructed

=
____ __________
| || |


T
M N
M
N
M M
M N
N M
N N
0
G
G
0
B

(6.2)
We then considered the eigenvalue problem for B

Bw
i
=
i
w
i
i = 1, 2, . . . , (N + M) (6.11)

This led us to the shifted eigenvalue problem

Gv
i
=
i
u
i
i = 1, 2, . . . , (N + M) (6.17)

and

G
T
u
i
=
i
v
i
i = 1, 2, . . . , (N + M) (6.18)

We found that the shifted eigenvalue problem leads us to eigenvalue problems for G
T
G and
GG
T
:

Geosciences 567: Chapter 6 (RMR/GZ)
142
G
T
Gv
i
=
i
2
v
i
i = 1, 2, . . . , M (6.21)

and

GG
T
u
i
=
i
2
u
i
i = 1 , 2, . . . , N (6.29)

We then introduced the singular value
i
, given by the positive square root of the eigenvalues
from Equations (6.20) and (6.28)


i
= +
i
2
(6.52)

Equations (6.16), (6.17), (6.20) and (6.28) give us a way to find U, V, and . They also lead,
eventually, to
G = U V
T
(6.60)
N M N N N M M M

We then considered partitioning the matrices based on P, the number of nonzero singular values.
This led us to singular-value decomposition

G = U
P

P
V
P
T
(6.76)
N M N P P P P M

Before considering an inverse operator based on singular-value decomposition, it is probably
useful to cover the mechanics of singular-value decomposition.


6.9 Mechanics of Singular-Value Decomposition


The steps involved in singular-value decomposition are as follows:

Step 1. Begin with Gm = d.

Form G
T
G (M M) or GG
T
(N N), whichever is smaller. (N.B. Typically, there are
more observations than model parameters; thus, N > M, and G
T
G is the more common
choice.)

Step 2. Solve the eigenvalue problem for Hermitian G
T
G (or GG
T
)

G
T
Gv
i
=
i
2
v
i
(6.21)

1. Find the P nonzero
i
2
and associated v
i
.

2. Let
i
= +(
i
2
)
1/2
.

3. Form
Geosciences 567: Chapter 6 (RMR/GZ)
143

=
M M M
L
M M M
P P
v v v V
2 1
(6.71)
M P
and

P
P
P

=
0 0
0
0
0 0
2 1
2
1
L
L
O M
M
L
(6.65)
P P
Step 3. Use Gv
i
=
i
u
i
to find u
i
for each known
i
, v
i
.

Note: Finding u
i
this way preserves the sign relationship implicit between u
i
, v
i
by
taking the positive member of each pair (+
i
,
i
). You will not preserve the sign
relationship (except by luck) if you use GG
T
u
i
=
i
2
u
i
to find u
i
.

Step 4. Form

=
M M M
L
M M M
P P
u u u U
2 1
(6.70)
N P

Step 5. Finally, form G as

G = U
P

P
V
P
T


(6.69)
N M N P P P P M


6.10 Implications of Singular-Value Decomposition


6.10.1 Relationships Between U, U
P
, and U
0


1. U
T
U = U U
T
= I
N

N N N N N N N N N N

Since U is an orthogonal matrix.

2. U
P
T
U
P
= I
P

P N N P

U
P
is semiorthogonal because all P vectors in
U
P
are perpendicular to each other.

Geosciences 567: Chapter 6 (RMR/GZ)
144
3. U
P
U
P
T
I
N

N P P N

(Unless P = N.) U
P
U
P
T
is N N. U
P
cannot
span N-space with only P (N-dimensional)
vectors.

4. U
0
T
U
0
= I
NP

(N P) N N (N P)

U
0
is semiorthogonal since the (N P) vectors
in U
0
are all perpendicular to each other.

5. U
0
U
0
T
I
N

N (N P) (N P) N

U
0
has (N P) N-dimensional vectors in it. It
cannot span N-space. U
0
U
0
T
is N N.

6. U
P
T
U
0
= 0
P N N (N P) P (N P)

Since all the eigenvectors in U
P
are
perpendicular to all the eigenvectors in U
0
.

7. U
0
T
U
P
= 0
(N P) N N P (N P) P

Again, since all the eigenvectors in U
0
are
perpendicular to all the eigenvectors in U
P
.



6.10.2 Relationships Between V, V
P
, and V
0


1. V
T
V = V V
T
= I
M

M M M M M M M M

Since V is an orthogonal matrix.

2. V
P
T
V
P
= I
P

P M M P

V
P
is semiorthogonal because all P vectors in
V
P
are perpendicular to each other.

3. V
P
V
P
T
I
M

M P P M

(Unless P = M.) V
P
T
is M M. V
P
cannot
span M-space with only P (M-dimensional)
vectors.

4. V
0
T
V
0
= I
MP

(M P) M M (M P)

V
0
is semiorthogonal since the (M P) vectors
in V
0
are all perpendicular to each other.

5. V
0
V
0
T
I
M

M (M P) (M P) M

V
0
has (M P) M-dimensional vectors in it. It
cannot span M-space. V
0
V
0
T
is M M.

6. V
P
T
V
0
= 0
P M M (M P) P (M P)

Since all the eigenvectors in V
P
are
perpendicular to all the eigenvectors in V
0
.

7. V
0
T
V
P
= 0
(M P) M M P (M P) P

Again, since all the eigenvectors in V
0
are
perpendicular to all the eigenvectors in V
P
.


Geosciences 567: Chapter 6 (RMR/GZ)
145
6.10.3 Graphic Representation of U, U
P
, U
0
, V, V
P
, V
0
Spaces

Recall that starting with Equation (1.13)

G m = d (1.13)
N M M 1 N 1

we obtained the fundamental decomposition theorem

G = U V
T
(6.60)
N M N N N M M M

and singular-value decomposition

G = U
P

P
V
P
T
(6.69)
N M N P P P P M

This gives us the following:

1. Recall the definitions of U, U
P
, and U
0



[ ]
0 1 1
U U u u u u U | =

+ P N P P
N N
M M M M
L L
M M M M
(6.62)

2. Similarly, recall the definitions for V, V
P
and V
0



[ ]
0 1 1
V V v v v v V | =

+ P N P P
M M
M M M M
L L
M M M M
(6.63)

3. Combining U, U
P
, U
0
, and V, V
P
, V
0
graphically

| || |


| || |

0
0

_________ __________


P M P
M
N
P N P
P
P
V
U
V
U
(6.77)
Geosciences 567: Chapter 6 (RMR/GZ)
146
4. Summarizing:

(1) V is an M M matrix with the eigenvectors of G
T
G as columns. It is an orthogonal
matrix.

(2) V
P
is an M P matrix with the P eigenvectors of G
T
G associated with nonzero
eigenvalues of G
T
G. V
P
is a semiorthogonal matrix.

(3) V
0
is an M (M P) matrix with the M P eigenvectors of G
T
G associated with
the zero eigenvalues of G
T
G. V
0
is a semiorthogonal matrix.

(4) The eigenvectors in V, V
P
, or V
0
are all M-dimensional vectors. They are all
associated with model space, since m, the model parameter vector of Gm = d, is an
M-dimensional vector.

(5) U is an N N matrix with the eigenvectors of GG
T
as columns. It is an orthogonal
matrix.

(6) U
P
is an N P matrix with the P eigenvectors of GG
T
associated with the nonzero
eigenvalues of GG
T
. U
P
is a semiorthogonal matrix.

(7) U
0
is an N (N P) matrix with the N P eigenvectors of GG
T
associated with the
zero eigenvalues of GG
T
. U
0
is a semiorthogonal matrix.

(8) The eigenvectors of U, U
P
or U
0
are all N-dimensional vectors. They are all
associated with data space, since d, the data vector of Gm = d, is an N-dimensional
vector.


6.11 Classification of d = Gm Based on P, M, and N


6.11.1 Introduction

In Section 3.3 we introduced a classification of the system of equations

d = Gm (1.13)

based on the dimensions of d (N 1) and m (M 1). I said at the time that I found the
classification lacking, and would return to it later. Now that we have considered singular-value
decomposition, including finding P, the number of nonzero singular values, I would like to
introduce a better classification scheme.

There are four basic classes of problems, based on the relationship between P, M, and N.
We will consider each class one at a time below.


Geosciences 567: Chapter 6 (RMR/GZ)
147
6.11.2 Class I: P = M = N

Graphically, for this case, we have (following page):

U
P
= U

V
P
= V


















| N |
| N |



1. U
0
and V
0
are empty.

2. G has a unique, mathematical inverse G
1
.

3. There is a unique solution for m.

4. The data can be fit exactly.


6.11.3 Class II: P = M < N

Graphically, for this case, we have



| |

| | |

0


U
V V
U
M
M
N
P N P
P
P


1. V
0
is empty since P = M.

2. U
0
is not empty since P < N.

3. G has no mathematical inverse.

Geosciences 567: Chapter 6 (RMR/GZ)
148
4. There is a unique solution in the sense that only one solution has the smallest
misfit to the data.

5. The data cannot be fit exactly unless the compatibility equations are satisfied,
which are defined as follows:
U
0
T
d = 0 (6.78)
(N P) N N 1 (N P) 1

If the compatibility equations are satisfied, one can fit the data exactly. The
compatibility equations are equivalent to saying that d has no projection onto U
0
.

Equation (6.78) can be thought of as the N P dot products of the eigenvectors in
U
0
with the data vector d. If all of the dot products are zero, then d has no
component in the (N P)-dimensional subset of N-space spanned by the vectors
in U
0
. G, operating on any vector m, can only predict a vector that lies in the
P-dimensional subset of N-space spanned by the P eigenvectors in U
P
. We will
return to this later.

6. P = M < N is the classic least squares environment. We will consider least
squares again when we introduce the generalized inverse.


6.11.4 Class III: P = N < M

Graphically, for this case, we have


| | |


| |


P M P
M
N
N
P
P
0
V V
U U


1. U
0
is empty since P = N.

2. V
0
is not empty since P < M.

3. G has no mathematical inverse.

4. You can fit the data exactly because U
0
is empty.

5. Solution is not unique. If m
est
is a solution which fits the data exactly
Geosciences 567: Chapter 6 (RMR/GZ)
149
Gm
est
= d
pre
= d (6.79)
then
m
est
+
i
i=P+1
M

v
i
(6.80)

is also a solution, where
i
is any arbitrary constant.
6. P = N < M is the minimum length environment. The minimum length solution
sets
i
, i = (P + 1), . . . , M to zero.


6.11.5 Class IV: P < min(N, M)

Graphically, for this case, we have (next page)

| | |



| | |

0
0


P M P
M
N
P N P
P
P
V
U
V
U

1. Neither U
0
nor V
0
is empty.

2. G has no mathematical inverse.

3. You cannot fit the data exactly unless the compatibility equations (Equation 6.78)
are satisfied.

4. The solution is nonunique.

This sounds like a pretty bleak environment. No mathematical inverse. Cannot fit the
data. The solution is nonunique. It probably comes as no surprise that most realistic problems
are of this type [P < min(N, M)]!


In the next chapter we will introduce the generalized inverse operator. It will reduce to
the unique mathematical inverse when P = M = N. It will reduce to the least squares operator
when we have P = M < N, and to the minimum length operator when P = N < M. It will also give
us a solution in the general case where we have P < min(N, M) that has many of the properties of
the least squares and minimum length solutions.
Geosciences 567: Chapter 7 (RMR/GZ)
150




CHAPTER 7: THE GENERALIZED INVERSE AND
MEASURES OF QUALITY


7.1 Introduction


Thus far we have used the shifted eigenvalue problem to do singular-value decomposition
for the system of equations Gm = d. That is, we have

G = U V
T
(6.60)
N M N N N M M M

and also

G = U
P

P
V
P
T
(6.69)
N M N P P P P M

where U is an N N orthogonal matrix, and where the ith column is given by the ith eigenvector
u
i
which satisfies

GG
T
u
i
=
i
2
u
i
(6.29)

V is an M M orthogonal matrix, where the ith column is given by the ith eigenvector v
i
which
satisfies

G
T
Gv
i
=
i
2
v
i
. (6.21)

is an N M diagonal matrix with the singular values
i
=
i
2
along the diagonal. U
P
,
P
,
and V
P
are the subsets of U, , and V, respectively, associated with the P nonzero singular
values, P min(N, M).

We found four classes of problems for Gm = d based on P, N, M:

Class I: P = N = M; G
1
(mathematical) exists.

Class II: P = M < N; least squares. Recall m
LS
= [G
T
G]
1
G
T
d.

Class III: P = N < M; Minimum Length. Recall m
ML
= <m> + G
T
[GG
T
]
1

[d G<m>].
Geosciences 567: Chapter 7 (RMR/GZ)
151
Class IV: P < min(N, M); at present, we have no way of obtaining an m
est
.

Thus, in this chapter we seek an inverse operator that has the following properties:

1. Reduces to G
1
when P = N = M.

2. Reduces to [G
T
G]
1
G
T
when P = M < N (least squares).

3. Reduces to G
T
[GG
T
]
1
when P = N < M (minimum length).

4. Exists when P < min(N, M).

In the following pages we will consider each of these classes of problems, beginning with
P = N = M (Class I). In this case,

G = U V
T
(6.60)
N N N N N N N N

with

=
N

0 0
0
0
0 0
2
1
L
O M
M
L
(7.1)

Since P = N = M, there are no zero singular values and we have


1

2

N
> 0 (7.2)

In order to find an inverse operator based on Equation (6.60), we need to find the inverse
of a product of matrices. Applying the results from Equation (2.8) to Equation (6.60) above
gives

G
1
= [V
T
]
1

1
U
1
(7.3)

We know
1
exists and is given by

/ 1 0 0
0
/ 1 0
0 0 / 1
2
1
1
L
O M
M
L
(7.4)
N N

We now make use of the fact that both U and V are orthogonal matrices. The properties
of U and V that we wish to use are

[V
T
]
1
= V (7.5)
Geosciences 567: Chapter 7 (RMR/GZ)
152
and

U
1
= U
T
(7.6)

Therefore

G
1
= V
1
U
T
P = N = M
N N N N N N N N
(7.7)

Equation (7.7) implies that G
1
, the mathematical inverse of G, can be found using singular-
value decomposition when P = N = M.

What we need now is to find an operator for the other three classes of problems that will
reduce to the mathematical inverse G
1
when it exists.


7.2 The Generalized Inverse Operator G
g
1



7.2.1 Background Information

We start out with three pieces of information:

1. G = UV
T
(6.60)

2. G
1
= V
1
U
T
(when G
1
exists) (7.8)

3. G = U
P

P
V
P
T
(singular-value decomposition) (6.69)

Then, by analogy with defining the inverse in Equation (7.7) above on the form of
Equation (6.60), we introduce the generalized inverse operator:

G
g
1
= V
P

P
1
U
P
T
M N M P P P P N
(7.8)

and find out the consequences for our four cases. Menke points out that there may be many
generalized inverses, but Equation (7.8) is by far the most common generalized inverse.


7.2.2 Class I: P = N = M

In this case, we have

Geosciences 567: Chapter 7 (RMR/GZ)
153
1. V
P
= V and U
P
= U.

2. V
0
and U
0
are empty.

We start with the definition of the generalized inverse operator in Equation (7.8):

G
g
1
= V
P

P
1
U
P
T
(7.8)

But, since P = M we have

V
P
= V (7.9)

Similarly, since P = N we have

U
P
= U (7.10)

Finally, since P = N = M, we have


P
1
=
1
(7.11)

Therefore, combining Equations (7.8)(7.11), we recover Equation (7.7)

G
1
= V
1
U
T
P = N = M (7.7)

the exact mathematical inverse. Thus, we have shown that the generalized inverse operator
reduces to the exact mathematical inverse in the case of P = N = M. Next we consider the case of
P = M < N.


7.2.3 Class II: P = M < N

This is the least squares environment where we have more observations than unknowns,
but where a unique solution exists. In this case, we have:

1. V
P
= V

2. V
0
is empty.

3. U
0
exists.

Ultimately we wish to show that the generalized inverse operator reduces to the least squares
operator when P = M < N.


Geosciences 567: Chapter 7 (RMR/GZ)
154
The Role of G
T
G

Recall that the least squares operator, as defined in Equation (3.27), for example, is given
by

m
LS
= [G
T
G]
1
G
T
d (3.31)

We first consider G
T
G, using singular-value decomposition for G, obtaining

G
T
G = [U
P

P
V
P
T
]
T
[U
P

P
V
P
T
] (7.12)

Recall that the transpose of the product of matrices is given by the product of the
transposes in the reverse order:

[AB]
T
= B
T
A
T


Therefore

G
T
G = [V
P
T
]
T

P
T
U
P
T
U
P

P
V
P
T
(7.13)
or

G
T
G = V
P

P
U
P
T
U
P

P
V
P
T
(7.14)

since


P
T
=
P
(7.15)

and

[V
P
T
]
T
= V
P
(7.16)

We know, however, that U
P
is a semiorthogonal matrix. Thus

U
P
T
U
P
= I
P
(7.17)

Therefore, Equation (7.14) reduces to

G
T
G = V
P

P
V
P
T
(7.18)

Now, consider the product of
P
with itself

Geosciences 567: Chapter 7 (RMR/GZ)
155

=
P P
P P

0 0
0
0
0 0

0 0
0
0
0 0
2
1
2
1
L
O M
M
L
L
O M
M
L
(7.19)

or


2
2
2
2
2
1
0 0
0
0
0 0
P
P
P P
=

L
O M
M
L
(7.20)

where we have introduced the notation
P
2
for the product of
P
with itself.

Please be aware, as noted before, that the notation A
2
, when A is a matrix, has no
universally accepted definition. We will use the definition implied by Equation (7.20). Finally,
then, we write Equation (7.18) as

G
T
G = V
P

P
2
V
P
T
(7.21)
M M M P P P P M


Finding the Inverse of G
T
G

I claim that G
T
G has a mathematical inverse [G
T
G]
1
when P = M. The reason that
[G
T
G]
1
exists in this case is that G
T
G has the following eigenvalue problem:

G
T
Gv
i
=
i
2
v
i
i = 1, . . . , M (6.21)

where, because P = M, we know that all M
i
2
are nonzero. That is, G
T
G has no zero
eigenvalues. Thus, it has a mathematical inverse.

Using the theorem presented earlier about the inverse of a product of matrices in
Equations (2.8), we have
[G
T
G]
1
= [V
P
T
]
1
[
P
2
]
1
V
P
1
(7.22)

The inverse of V
P
T
is found as follows. First,

V
P
T
V
P
= I
P
(7.23)

is always true because V
P
is semiorthogonal. But, because we have that P = M in this case, we
also have
Geosciences 567: Chapter 7 (RMR/GZ)
156
V
P
V
P
T
= I
M
(7.24)

Thus, V
P
is itself an orthogonal matrix, and we have

[V
P
T
]
1
= V
P
(7.25)

and

V
P
1
= V
P
T
(7.26)

Thus, we can write Equation (7.22) as

[G
T
G]
1
= V
P
[
P
2
]
1
V
P
T
(7.27)

Finally, we note that [
P
2
]
1
is given by


/ 1 0 0
0
/ 1 0
0 0 / 1
] [
2
2
2
2
1
2 1 2

= =

P
P P

L
O M
M
L
(7.28)

Therefore

[G
T
G]
1
= V
P

P
2
V
P
T
(7.29)

where
P
2
is as defined in Equation (7.28).


Equivalence of G
g
1
and Least Squares When P = M < N

We start with the least squares operator, from Equation (3.28), for example

G
LS
1
= [G
T
G]
1
G
T
(3.32)

We can use Equation (7.29) for [G
T
G]
1
in Equation (3.32) and singular-value decomposition for
G
T
to obtain

[G
T
G]
1
G
T
= V
P

P
2
V
P
T
[U
P

P
V
P
T
]
T
(7.30)

= V
P

P
2
V
P
T
V
P

P
U
P
T
(7.31)

But, because V
P
is semiorthogonal, we have from Equation (7.23)
Geosciences 567: Chapter 7 (RMR/GZ)
157
V
P
T
V
P
= I
P
(7.23)

Thus, Equation (7.31) becomes

[G
T
G]
1
G
T
= V
P

P
2

P
U
P
T
(7.32)

Now, considering Equations (7.20) and (7.21), we see that


P
2

P
=
P
1
(7.33)

Finally, then, Equation (7.32) becomes

[G
T
G]
1
G
T
= V
P

P
1
U
P
T
= G
g
1
(7.34)

as required. That is, we have shown that when P = M < N, the generalized inverse operator is
equivalent to the least squares operator.


Geometrical Interpretation of G
g
1
When P = M < N

It is possible to gain some insight into the generalized inverse operator by considering a
geometrical argument. An arbitrary data vector d may have components in both U
P
and U
0

spaces. The generalized inverse operator returns a solution m
g
, for which the predicted data lies
completely in U
P
space and minimizes the misfit to the observed data. The steps necessary to
see this follow:


Step 1. Let m
g
be the generalized inverse solution, given by

m
g
= G
g
1
d (7.35)

Step 2. Let

d be the predicted data, given by

d

= Gm
g
(7.36)

We now introduce the following theorem about the relationship of the predicted data to
U
P
space

Theorem: d

lies completely in U
P
-space (the subset of N-space spanned by
the P eigenvectors in U
P
).

Proof: If d

lies in U
P
-space, it is orthogonal to U
0
-space. That is,
Geosciences 567: Chapter 7 (RMR/GZ)
158

U
0
T
d

= U
0
T
Gm
g
= U
0
T
U
P

P
V
P
T
m
g
(7.37)

= 0
(N P) 1

which follows from

U
0
T
U
P
= 0 (7.38)

That is, every eigenvector in U
0
is perpendicular to every eigenvector in U
P
.
Another way to see this is that all of the eigenvectors in U are perpendicular to
each other. Thus, any subset of U is perpendicular to the rest of U. Q.E.D.

Step 3. Let d

d be the residual data vector (i.e., observed predicted data, also known as the
misfit vector), given by

d d

= d Gm
g


= d G[G
g
1
d]
= d GG
g
1
d

= d [U
P

P
V
P
T
][V
P

P
1
U
P
T
]d
M
= d U
P
U
P
T
d (7.39)

We cannot further reduce Equation (7.39) whenever P < N because in this case

U
P
U
P
T
I
N


Next, we introduce a theorem about the relationship between the misfit vector and U
0

space.

Theorem: The misfit vector d d

is orthogonal to U
P
.

Proof: U
P
T
[d d

] = U
P
T
[d U
P
U
P
T
d]

= U
P
T
d U
P
T
U
P
U
P
T
d

= U
P
T
d U
P
T
d
= 0 Q.E.D. (7.40)
Geosciences 567: Chapter 7 (RMR/GZ)
159
P 1

The crucial step in going from the second to third lines being that U
P
T
U
P
= I
P
since U
P

is semiorthogonal. This implies that the misfit vector d d

lies completely in the


space spanned by U
0
.

Combining the results from the above two theorems, we introduce the final
theorem of this section concerning the relationship between the predicted data and the
misfit vector.

Theorem: The predicted data vector

d
is perpendicular to the misfit vector
d d

.

Proof: 1. The predicted data vector d

lies in U
P
space.

2. The misfit vector d d

lies in U
0
space.

3. Since the vectors in U
P
are perpendicular to the vectors in
U
0
,

d is perpendicular to the misfit vector d d

. Q.E.D.

Step 4. Consider the following schematic graph showing the relationship between the various
vectors and spaces:


d d
^
d = Gm
g
d
U -space
P
0
U


-
s
p
a
c
e
^


The data vector d has components in both U
P
and U
0
spaces. Note the following
points:
1. The predicted data vector d

= Gm
g
lies completely in U
P
space.

2. The residual vector d

d lies completely in U
0
space.

Geosciences 567: Chapter 7 (RMR/GZ)
160
3. The shortest distance from the observed data d to the U
P
axis is given by
the misfit vector d

d.

Thus the generalized inverse G
g
1
minimizes the distance between the observed data
vector d and U
P
, the subset of data space in which all possible predicted data

d must
lie.

Recall that the least squares operator given in Equation (3.32)

G
LS
1
= [G
T
G]
1
G
T
(3.32)

minimizes the length of the misfit vector. Thus, the generalized inverse operator is
equivalent to the least squares operator when P = M < N.

Step 5. For P = M < N, it is possible to write the generalized inverse without forming U
P
. To
see this, note that the generalized inverse is equivalent to least squares for P = M < N.
That is,

G
g
1
= [G
T
G]
1
G
T
(7.41)

But, by Equation (7.28), [G
T
G]
1
is given by

[G
T
G]
1
= V
P

P
2
V
P
T
(7.27)

Thus, the generalized inverse in this case is given by

G
g
1
= [G
T
G]
1
G
T
= V
P

P
2
V
P
T
G
T
(7.42)

Equation (7.42) shows that the generalized inverse can be found without ever forming
U
P
when P = M < N. In general, this shortcut is not used, even though you can form the
inverse, because there is useful information lost about data space.


Step 6. Finally, recall the compatibility equations given by

U
0
T
d = 0 (6.78)
(N P) 1

Note that if the observed data d has any projection in U
0
space, is not possible to find a
solution m that can fit the data exactly. All estimates m lead to predicted data Gm that
lie in U
P
space. Thus, from the graph above, one sees that if the observed data, d, lies
completely in U
P
space, the compatibility equations are automatically satisfied.
Geosciences 567: Chapter 7 (RMR/GZ)
161
7.2.4 Class III: P = N < M

This is the minimum length environment where we have more model parameters than
observations. There are an infinite number of possible solutions that can fit the data exactly.
Recall that the minimum length solution is the one that has the shortest length. Ultimately we
wish to show that the generalized inverse operator reduces to the minimum length operator when
P = N < M.

For P = N < M we have

1. U
P
= U

2. U
0
is empty.

3. V
0
is not empty.


The Role of GG
T


Recall that the minimum length operator, as defined in Equation (3.75), is given by

G
ML
1
= G
T
[GG
T
]
1
(3.75)

We seek, thus, to show that G
g
1
= G
T
[GG
T
]
1
in this case. First consider writing GG
T
using
singular-value decomposition:

GG
T
= U
P

P
V
P
T
[U
P

P
V
P
T
]
T


= U
P

P
V
P
T
V
P

P
U
P
T


= U
P

P
2
U
P
T
(7.43)


Finding the Inverse of GG
T


Note that GG
T
is N N and P = N. This implies that [GG
T
]
1
, the mathematical inverse
of GG
T
, exists. Again using the theorem stated in Equation (2.8) about the inverse of a product
of matrices, we have

[GG
T
]
1
= [U
P
T
]
1
[
P
2
]
1
U
P
1

= U
P

P
2
U
P
T
P = N (7.44)

Then

Geosciences 567: Chapter 7 (RMR/GZ)
162
G
T
[GG
T
]
1
= [U
P

P
V
P
T
]
T
U
P

P
2
U
P
T


= V
P

P
U
P
T
U
P

P
2
U
P
T


= V
P

P

P
2
U
P
T


= V
P

P
1
U
P
T


= G
g
1
(7.45)

as required.


Fitting the Data Exactly When P = N < M

As before, let the generalized inverse solution m
g
be given by

m
g
= G
g
1
d (7.35)

Then the predicted data

d is given by



d
= Gm
g
(7.36)

= G[G
g
1
d]

= GG
g
1
d

= U
P

P
V
P
T
V
P

P
1
U
P
T
d

= U
P
U
P
T
d

= d (7.46)

since U
P
U
P
T
= I
N
whenever P = N.

Thus, one can fit the data exactly whenever P = N. The reason is that U
0
is empty when
P = N. That is, U
P
is equal to U space.
The Generalized Inverse Solution m
g
Lies in V
P
Space

The generalized inverse solution m
g
is given by

m
g
= G
g
1
d (7.35)
Geosciences 567: Chapter 7 (RMR/GZ)
163
and is a vector in model space. It lies completely in V
P
space. The way to see this is to take the
dot product of m
g
with the eigenvectors in V
0
. If m
g
has no projection in V
0
space, then it lies
completely in V
P
space.

Thus,

V
0
T
m
g
= V
0
T
G
g
1
d

= V
0
T
V
P

P
1
U
P
T
d

= 0 (7.47)
(M P) 1

since V
0
T
V
P
= 0.


Nonuniqueness of the Solution When P = N < M

The solution to Gm = d is nonunique because V
0
exists when P < M. Let the general
solution m to Gm = d be given by


i
M
P i
i
v m m

+ =
+ =
1
g

(7.48)

That is, the general solution is given by the generalized inverse solution m
g
plus a linear
combination of the eigenvectors in V
0
space, where
i
are constants. The predicted data for the
general case is given by


i
M
P i
i
i
M
P i
i
Gv Gm
v m G m G


1
g
1
g

+ =
+ =
+ =

+ =

(7.49)

When G operates on a vector in V
0
space, however, it returns a zero vector. That is,

GV
0
= U
P

P
V
P
T
V
0


= 0 (7.50)
N (M P)
which follows from the fact that the eigenvectors in V
P
are perpendicular to the eigenvectors in
V
0
. Thus,

G
m
= Gm
g
+ 0

= d (7.51)

Geosciences 567: Chapter 7 (RMR/GZ)
164
Now, consider the length squared of

m

+ =
+ =
M
P i
i
1
2
2
g
2
m m
(7.52)

which follows from the fact that [v
i
]
T
v
j
=
ij
.

||
m
||
2
||m
g
||
2
(7.53)

That is, m
g
, the generalized inverse solution, is the smallest of all possible solutions to Gm = d.
This is precisely what was stated at the beginning of this section: the generalized inverse
solution is equivalent to the minimum length solution when P = N < M.


It is Possible to Write G
g
1
Without V
P
When P = N < M

To see this, we write the generalized inverse operator as the minimum length operator
and use singular-value decomposition. That is,

G
g
1
= G
T
[GG
T
]
1

M
= G
T
U
P

P
2
U
P
T


(7.54)

Typically, this shortcut is not used because knowledge of V
P
space is useful in the interpretation
of the results.


7.2.5 Class IV: P < min(N, M)

This is the class of problems for which neither least squares nor minimum length
operators exist. That is, the least squares operator

G
LS
= [G
T
G]
1
G
T
(3.32)

does not exist because [G
T
G]
1
exists only when P = M. Similarly, the minimum length operator

G
ML
= G
T
[GG
T
]
1
(3.75)

does not exist because [GG
T
]
1
exists only when P = N.

For P < min(N, M) we have

1. V
0
is not empty.

Geosciences 567: Chapter 7 (RMR/GZ)
165
2. U
0
is not empty.

In this environment the solution is both nonunique (because V
0
exists), and it is impossible to fit
the data exactly unless the compatibility equations (Equations 6.69) are satisfied. That is, it is
impossible to fit the data exactly unless the data have no projection onto U
0
space.

The generalized inverse operator cannot be further reduced and is given by Equation (7.8)

G
g
1
= V
P

P
1
U
P
T
(7.8)

The generalized inverse operator G
g
1
simultaneously minimizes the misfit vector d

d in data
space and the solution length ||m
g
||
2
in model space.

In summary, in this section we have shown that the generalized inverse operator G
g
1

reduces to

1. The exact inverse when P = N = M.

2. The least squares inverse [G
T
G]
1
G
T
when P = M < N.

3. The minimum length inverse G
T
[GG
T
]
1
when P = N < M.

Since we have shown that the generalized inverse is equivalent to the exact, least squares,
and minimum length operators when they exist, it is worth comparing the way the solution m
g
is
written. In the least squares or unique inverse environment for example, we would then write

m
g
= G
g
1
d (7.55)

but in the minimum length environment we would write

m
g
= <m> + G
g
1
[d G<m>] (7.56)

which explicitly includes a dependence on the prior estimate <m>. It is somewhat disconcerting
to have to carry around two forms of the solution for the generalized inverse. Consider what
happens, however if we use Equation (7.56) for the unique or least squares environment. Then

m
g
= <m> + G
g
1
[d G<m>]
= <m> + G
g
1
d G
g
1
G<m>
= G
g
1
d + [I
M
G
g
1
G]<m> (7.57)

For the unique inverse environment, G
g
1
G = I
M
, and hence Equation (7.57) reduces to
Equation (7.55). For the least squares environment, we have

G
g
1
G = [G
T
G]
1
G
T
G = I
M
(7.58)

Geosciences 567: Chapter 7 (RMR/GZ)
166
and hence Equation (7.57) again reduces to Equation (7.55). The unique inverse and least
squares environments thus have no dependence on <m>. Equation (7.56), however, is true for
the generalized inverse in all environments and is thus adopted as the general form of the
generalized inverse solution m
g
.

In the next section we will introduce measures of the quality of the generalized inverse
operator. These will include the model resolution matrix R, the data resolution matrix N (also
called the data information density matrix), and the unit (model) covariance matrix [cov
u
m].


7.3 Measures of Quality for the Generalized Inverse


7.3.1 Introduction

In this section three measures of quality for the generalized inverse will be considered.
They are

1. The M M model resolution matrix R

2. The N N data resolution matrix N

3. The M M unit covariance matrix [cov
u
m]

The model resolution matrix R measures the ability of the inverse operator to uniquely determine
the estimated model parameters. The data resolution matrix N measures the ability of the inverse
operator to uniquely determine the data. This is equivalent to describing the importance, or
independent information, provided by the data. The two resolution matrices depend upon the
partitioning of model and data spaces into V
P
, V
0
, and U
P
, U
0
spaces, respectively. Finally, the
unit covariance matrix [cov
u
m] is a measure of how uncorrelated noise with unit variance in the
data is mapped into uncertainties in the estimated model parameters.


7.3.2 The Model Resolution Matrix R

Imagine for the moment that there is some true solution m
true
that exactly satisfies

Gm
true
= d (7.59)

In any inversion, we estimate this true solution with m
est
:

m
est
= G
est
1
d (7.60)

where G
est
1
is some inverse operator. It is then possible to ask how m
est
compares to m
true
.

Geosciences 567: Chapter 7 (RMR/GZ)
167
Specifically considering the generalized inverse, we start with Equation (7.61) and
replace d with Gm
true
, obtaining

m
g
= G
g
1
Gm
true
(7.61)

The model resolution matrix R is then defined as

R = G
g
1
G (7.62)

where R is an M M symmetric matrix.

If R = I
M
, then m
est
= m
true
, and we say that all of the model parameters are perfectly
resolved, or equivalently that all of the model parameters are uniquely determined. If R I
M
,
then m
est
is some weighted average of m
true
.

Consider the kth element of m
est
, denoted m
k
est
, given by the product of the kth row of R
and m
true
:

true
2
1
est
of row th

=
M
k
m
m
m
k m
M
R
(7.63)

The rows of R can thus be seen as windows, or filters, through which the true solution is
viewed. For example, suppose that the kth row of R is given by

R
k
= [0, 0, . . . , 0, 1, 0, . . . , 0, 0] (7.64)

kth column

We see that

m
k
est
= 0m
1
true
+ + 0m
k1
true
+ 1m
k
true
+ 0m
k+1
true
+ + 0
m
true
(7.65)

or simply
m
k
est


= m
k
true
(7.66)

In this case we say that the kth model parameter is perfectly resolved, or uniquely determined.
Suppose, however, that the kth row of R were given by

R
k
= [0, . . . , 0, 0.1, 0.3, 0.8, 0.4, 0.2, . . . , 0] (7.67)

kth column

Then the kth estimated model parameter m
k
est
is given by

m
k
est
= 0.1m
k2
true
+ 0.3m
k1
true
+ 0.8m
k
true
+ 0.4m
k+1
true
+ 0.2m
k+2
true
(7.68)

Geosciences 567: Chapter 7 (RMR/GZ)
168
Or, m
k
est
is a weighted average of several terms in m
true
. In the case just considered, it depends
most heavily (0.8) on m
k
true
, but it also depends on other components of the true solution. We
say, then, that m
k
est
is not perfectly resolved in this case. The closer the row of R is to the row of
an identity matrix, the better the resolution.

From the above discussion, it is clear that model resolution may be considered element by
element. If R = I
M
, then all elements are perfectly resolved. If a single row of R is equal to the
corresponding row of the identity matrix, then the associated model parameter estimate is
perfectly resolved.

Finally, we can rewrite Equation (7.57) as

m
g
= G
g
1
d + [I R]<m> (7.69)


Some Properties of R

1. R = V
P
V
P
T
(7.70)

Using singular-value decomposition on Equation (7.62), R can be written as

R = {V
P

P
1
U
P
T
}{U
P

P
V
P
T
}

= V
P

P
1

P
V
P
T


= V
P
V
P
T
(7.71)

In general, V
P
V
P
T
I. However, if P = M, then V
P
= V, and V
0
is empty. In this case,

R = V
P
V
P
T
= VV
T
= I
M
(7.72)

since V is an orthogonal matrix. Thus, the condition for perfect model resolution is that
V
0
be empty, or equivalently that P = M.

2.

=
= =
M
i
ii
P r
1
) ( Trace R , the number of nonzero singular values


Proof: If R = I
M
, then P = M and trace (R) = M.

For the general case, it is possible to write R as the product of the following three
partitioned matrices:

Geosciences 567: Chapter 7 (RMR/GZ)
169

[ ]
T

T
0
T

0



V
0 0
0
I
V
V
V
0 0
0 I
V V R

=
P
P
P
P
(7.73)

= VAV
T


where no part of V
0
actually contributes to R because of the extra zeros in A.

The trace of A is equal to P. Note, however, that matrix A has been obtained from R
by an orthogonal transformation because V is an orthogonal matrix. Thus, by
Equation (2.43), which states that the trace of a matrix is unchanged by an orthogonal
transformation, we conclude that trace R = P, as required. Q.E.D.

Trace (R) = P implies that G has enough information to uniquely resolve P aspects of the
solution. These aspects are, in fact, the P directions in model space given by the
eigenvectors v
i
in V
P
. Whenever a row of R is equal to a row of the identity matrix I,
then no part of the associated model parameter m
i
falls in V
0
space (i.e., it all falls in V
P

space) and that model parameter is perfectly resolved. When R is not equal to the
identity matrix I, some part of the problem is not perfectly resolved. Sometimes this is
acceptable and other times it is not, depending on the problem. Forming new model
parameters as linear combinations of the old model parameters is one way to reduce the
nonuniqueness of the problem. One way to do this is to form new model parameters by
using the eigenvectors v
i
to define the linear combinations. Suppose that v
i
, i < p, is
given by

v
i
= (1 / M )[1, 1, 1, . . . , 1]
T
(7.74)

This tells us that the average of all the model parameters is resolved, even if the
individual model parameters may not be. If we defined a new model parameter as the
average of all the old model parameters, it would be perfectly resolved.

If, as is often the case, G represents some kind of an averaging function, you can attempt
to reduce the nonuniqueness of the problem by forming new model parameters that are
the sum or average of a subset of the old ones, even without using the full information in
V
P
. If the model parameters are discretized versions of a continuous function, such as
velocity or density versus depth, you may be able to improve the resolution by combining
layers. A rule of thumb in this case is to sum the entries along the diagonal of the
resolution matrix R until you get close to one. At this point, your system is able to
resolve one aspect of the solution uniquely. You can try forming a new model parameter
as the average of the layer velocities or densities up to this point. Depending on the
details of G, you may have perfect resolution of this average of the old model parameters.
Depending on the problem, it may be more useful to uniquely know the average of the
Geosciences 567: Chapter 7 (RMR/GZ)
170
model parameters over some depth range than it is to have nonunique estimates of the
values over the same range.

3.
= = =

= =
ii
M
j
ji
M
j
ij
r r r
1
2
1
2 importance of ith model parameter

If r
ii
= 1, then the ith model parameter is uniquely resolved, and it is thus said to be very
important. If, on the other hand, r
ii
is very small, then the parameter is poorly resolved
and is said to be not very important.

If we further note that R can be written as

=
M M M
L
M M M
M
r r r R
2 1
(7.75)

where r
i
is the ith column of R, then the estimated solution m
g
from (7.61) can be written
as
m
g
= Rm
true


= r
1
m
1
+ r
2
m
2
+ ... + r
M
m
M
(7.76)

where m
i
is the ith component of m
true
, which follows from (2.23)(2.30). That is, the
estimated solution vector can also be thought of as the weighted sum of the columns of R,
with the weighting factors being given by the true solution.

The length squared of each column of R can then be thought of as the importance of m
i
in
the solution. The length squared of r
i
is given by


ii
M
j
ji i
r r = =

=1
2
2
r
(7.77)

Thus, the diagonal entries in R give the importance of each model parameter for the
problem.
Geosciences 567: Chapter 7 (RMR/GZ)
171
We will return to the model resolution matrix R later to show how the generalized inverse
is the inverse operator that minimizes the difference between R = G
g
1
G and I
M
in the least
squares sense, and when we discuss the trade-off between resolution and variance.


7.3.3 The Data Resolution Matrix N

Consider the development of the data resolution matrix N, which follows closely that of
the model resolution matrix R. The estimated solution, for any inverse operator G
est
1
, is given
by
m
est
= G
est
1
d (7.78)
The predicted data,

d , for this estimated solution are given by
d

= Gm
est
(7.79)
Replacing m
est
in (7.79) with (7.78) gives
d

= GG
est
1
d = Nd (7.80)
where N is an N N matrix called the data resolution matrix.


A Specific Example

As a specific example, consider the generalized inverse operator G
g
1
. Then N is given by

N = GG
g
1
(7.81)

If N = I
N
, then the predicted data d

equal the observed data d, and the observed data can be fit
exactly. If N I
N
, then the predicted data are some weighted average of the observed data d.

Consider the kth element of the predicted data d

=
N
k
d
d
d
k d
M
2
1
of row th

N
(7.82)

The rows of N are windows through which the observed data are viewed. If the kth row of N
has a 1 in the kth column and zeroes elsewhere, the kth observation is perfectly resolved. We
also say that the kth observation, in this case, provides completely independent information. For
this reason, N is sometimes also referred to as the data information density matrix. Equation
Geosciences 567: Chapter 7 (RMR/GZ)
172
(7.82) shows that the kth predicted datum is a weighted average of all of the observations, with
the weighting given by the entries in the kth row of N. If the kth row of N has many nonzero
elements, then the kth predicted observation depends on the true value of many of the
observations, and not just on the kth. The data resolution of the kth observation, then, is said to
be poor.


Some Properties of N for the Generalized Inverse

1. N = GG
g
1
= U
P
U
P
T
(7.83)

Using singular-value decomposition, we have that

N = GG
g
1
= U
P

P
V
P
T
V
P

P
1
U
P
T


= U
P
U
P
T
(7.84)

since V
P
T
V
P
= I (V
P
is semiorthogonal) and
P

P
1
= I.

In general, U
P
U
P
T
I
N
. However, if P = N, then U
P
= U, and U
0
is empty. Then, N =
U
P
U
P
T
= UU
T
= I
N
, since U is itself an orthogonal matrix. Thus, the condition for perfect
data resolution is that U
0
be empty, or that P = N.


2. trace (N) = P (7.85)

The proof follows that of trace (R) = P:

[ ]

= =
T
0
T
0
T
0 0
0
U
U I
U U U U N
P P
P P P
(7.86)
and

P
P
=

0 0
0
trace
I
Q.E.D. (7.87)

If, for example,

n
11
+ n
22
1 (7.88)

one might choose to form a new observation d
1
as a linear combination of d
1
and d
2
,
given in the simplest case by

d
1
= d
1
+ d
2
(7.89)
Geosciences 567: Chapter 7 (RMR/GZ)
173
The actual linear combination of the two observations that is resolved depends on the
eigenvectors in U
P
, or equivalently upon the structure of the data resolution matrix N. In
any case, the new observation d could provide essentially independent information and
could be a way to reduce the computational effort of the inverse problem by reducing N.
In many cases, however, the benefit of being able to average out data errors over the
observations is more important than any computational savings that might come from
combining observations.


3. = = =

= =
ii
N
j
ji
N
j
ij
n n n
1
2
1
2

importance of the ith observation (7.90)

That is, the sum of squares of the entries in a row (or column, since N is symmetric) of N
is equal to the diagonal entry in that row. Thus, as the diagonal entry gets large (close to
one), the other entries in that row must become small. As the importance of a particular
datum becomes large, the dependence of the predicted datum on other observations must
become small.

If we further note that we can write N as

=
M M M
L
M M M
N
n n n N
2 1
(7.91)

where n
i
is the ith column of N, then the predicted data

d
from (7.80) can be written as

d

= Nd

= n
1
d
1
+ n
2
d
2
+ + n
N
d
N
(7.92)

where d
i
is the ith component of d. Equation (7.92) follows from (2.23)(2.30). That is,
the predicted data vector can also be thought of as the weighted sum of the columns of N,
with the weighting factors being given by the actual observations.

The length squared of each column of N can then be thought of as the importance of d
i
in
the solution. The length squared of n
i
is given by


ii
N
j
ji i
n n = =

=1
2
2
n
(7.93)

Thus, the diagonal entries in N give the importance of each observation in the solution.

It can also be shown that the generalized inverse operator G
g
1
= V
P

P
1
U
P
T
minimizes the
difference between N and I
N
. Let us now turn our attention to another measure of quality for the
generalized inverse, the unit covariance matrix [cov
u
m].
Geosciences 567: Chapter 7 (RMR/GZ)
174
7.3.4 The Unit (Model) Covariance Matrix [cov
u
m]

Any errors (noise) in the data will be mapped into errors in the estimates of the model
parameters. The mapping was first considered in Section 3.7 of Chapter 3. We will now
reconsider it from the generalized inverse viewpoint.

Let the error (noise) in the data d be d. Then the error in the model parameters due to
d is given by
m = G
g
1
d (7.94)


Step 1. Recall from Equation (2.59) that the N N data covariance matrix [cov d] is given by


[ ]
T
1
1
1
] [cov
i
k
i
i
k
d d d

=

=
(7.95)

where k is the number of experiments, and i is the experiment number. The diagonal
terms are the data variances and the off-diagonal terms are the covariances.

The data covariance is also written as <dd
T
>, where < > denotes averaging.


Step 2. We seek, then, a model covariance matrix <mm
T
> = [cov m].

<m m
T
> = <G
g
1
d[G
g
1
d]
T
>

= <G
g
1
d d
T
[G
g
1
]
T
> (7.96)
G
g
1
is not changing with each experiment, so we can take it outside the averaging,
implying

<m m
T
> = G
g
1
<d d
T
> [G
g
1
]
T


or

[cov m] = G
g
1
[cov d] [G
g
1
]
T
(7.97)

The above derivation provides some of the logic behind Equation (2.63), which was
introduced in Chapter 2 as magic.

Step 3. Finally, define a unit (model) covariance matrix [cov
u
m] by assuming that [cov d] =
I
N
, that is, by assuming that all the data variances are equal to 1 and the covariances are
all 0 (uncorrelated data errors). Then


Geosciences 567: Chapter 7 (RMR/GZ)
175
[cov
u
m] = G
g
1
[cov d][G
g
1
]
T


= G
g
1
[G
g
1
]
T
(7.98)


Some Properties of [cov
u
m]

1. Using singular-value decomposition, we can write the unit model covariance matrix as

[cov
u
m] = G
g
1
[G
g
1
]
T

= V
P

P
1
U
P
T
[V
P

P
1
U
P
T
]
T


= V
P

P
1
U
P
T
U
P

P
1
V
P
T


= V
P

P
2
V
P
T
(7.99)

This emphasizes the importance of the size of the singular values
i
in determining the
model parameter variances. As
i
gets small, the entries in [cov
u
m] tend to get big
(implying large model parameter estimate variances) due to the terms in 1/
2


Consider the kth diagonal entry in [cov
u
m], [cov
u
m]
kk
, where

=
L L
M
L L
L L
L
O M
M
L
M M M
L
M M M
T
T
2
T
1
2
2
2
2
1
2 1 u
/ 1 0 0
0
/ 1 0
0 0 / 1
] [cov
P P
P
v
v
v
v v v m

(7.100)

If we multiply out the first two matrices, we can then identify the kk entry in [cov
u
m] as
the product of the kth row times the kth column of the resulting matrices

column th

/ / /
/ / /
/ / /
row th
] [cov
1
2 2 12
1 1 11
2 2
2 2
2
1 1
2 2
2 2
2
1 1
2
1
2
2 12
2
1 11
u
k
v v v
v v v
v v v
v v v
v v v
v v v
k
MP kP P
M k
M k
P MP M M
P kP k k
P P
kk
L L
M M M
L L
L L
L
M M M
L
M M M
L



m

=
=
P
i
i
ki
v
1
2
2

(7.101)


Geosciences 567: Chapter 7 (RMR/GZ)
176
Thus, as
i
gets small, [cov
u
m]
kk
will get large if v
ki
is not zero. Recall that v
ki
is the kth
component in the ith eigenvector v
i
in V
P
. Thus, it is the combination of
i
getting small,
and v
i
having a nonzero component in the kth row, that makes the variance for the kth
model parameter potentially very large.

2. Even if the data covariance is diagonal (i.e., all the observations have errors that are
uncorrelated), [cov
u
m] need not be diagonal. That is, the model parameter estimates
may well have nonzero covariances, even though the data have zero covariances.

For example, from Equation (7.101) above, we can see that


[ ]

=
=

=
P
i
i
ki i
kP
k
k
P P k
v v
v
v
v
v v v
1
2
1 2
1
2
1
2
2 12
2
1 11 1 u
/ / / ] [cov


M
L m
(7.102)

Note that the term in the above equation

=
P
i i
ki i
v v
1
2
1



is not the dot product of two columns of V
P
. In fact, even if the numerator were the dot
product between columns (i.e., v
1i
v
ik
), the fact that every term is divided by
i
2
would
likely yield something other than 1. The numerator is the dot product of two rows of V
P

and is likely nonzero anyway.

3. Notice that [cov
u
m] is a function of the forward problem as expressed in G, and not a
function of the actual data. Thus, it can be useful for experimental design.


7.3.5 A Closer Look at Stability

The unit model covariance matrix [cov
u
m] is very helpful in experiment design for
getting a sense of the basic stability of an inverse problem. And, the general rule of small
singular values leading to instability is certainly true. However, oversimplified analysis of
[cov
u
m] and the size of singular values can be very misleading.

First, what determines the size of singular values
i
? There are two main factors at work.
The first is simply the size of m and d. Consider the simple 1 1 forward problem:

G m = d (1.13)

where the expected (or reasonable) value of m is 5 widgets/hr and the expected (or reasonable)
value of d is $100. Then, clearly, the expected value of G is 20. If G were indeed 20, then
Geosciences 567: Chapter 7 (RMR/GZ)
177
trivially = 20. Another way of looking at this is that, in some sense (and in every sense is this
1 1 case):

|G| |m| = |d| (7.103)

and thus


m
d
G = (7.104)

Thus one important control on the average size of singular values is simply the relative or
average size of the values for data over the relative or average size of the values of the model
parameters. Now, consider what happens if we change units for our data to kilodollars. Then, in
this case, d = 0.1, and the numbers in G must change to reflect the new data units. By Equation
(7.2a), we would now expect |G| 0.02 and of course 0.02. Is this problem now inherently
more unstable? Not really. It would be if the sizes of the expected noise in d were unchanged
when we changed data units, but that would not make sense.

The second factor that determines the size of singular values is the degree to which
columns or rows of G are nearly parallel or dependent upon each other. Consider two 2 2 G
matrices where the average value in both cases for singular values is about 2:

=
05 . 2 00 . 2
00 . 2 00 . 2
and
00 . 2 00 . 1
00 . 2 00 . 2
G G (7.105)

For the first case,
1
= 3.00 and
2
= 2.00, while for the second case,
1
= 4.025 and
2
= 0.025.
In the first case, the columns or rows of G are at high angles to each other, while in the second
case, the rows or columns are at low angles to each other, or nearly parallel.

Thus, two factors control the size of singular values: (1) the ratio of the average size of
data to model parameters, and (2) the internal structure of G, specifically the degree to which
columns or rows of G are nearly parallel to one another. Or, more properly, the degree to which
any column or row of G can be nearly written as a linear combination of other columns or rows.

The crux of the stability question is whether small or reasonable or expected noise in the
data leads to acceptably small changes in the solution for the model parameters.

A blind application of [cov
u
m] can be very misleading. Inherent in [cov
u
m] is the
assumption that

I I d = =
2
] [cov
d
(7.106)

or that data variances all equal 1, and thus data standard deviations
d
= 1. Is
d
= 1 small,
reasonable, or expected? In our original example where d = 100 it would represent very good
Geosciences 567: Chapter 7 (RMR/GZ)
178
data, with an expected noise level of 1, or 1%. However, for our second example with the
rescaled data of 0.1, it would represent 1000% error!

How do we know what the expected noise is in the solution? [cov
u
m] gives us this
information, and more. On the diagonal of [cov
u
m] we have the variances for the model
parameters. The [1, 1] entry in [cov
u
m] is
2
1
m
, the variance for m
1
. Correspondingly, the
standard deviation of the error for m
1
is
1
m
. Thus, in some inverse problem if we found a
solution of 50 for m
1
and
2
1
m
= 10, then 3 . 3 10
1
=
m
. That would mean our solution for m
1

could be expressed as

m
1
= 50 3.3 (7.107)

This may or may not be an acceptable noise level in the solution. For example, if you
needed to know m
1
to 0.1% in order for your satellite to land on Mars rather than become a
Mars impactor or flyby, then knowing m
1
to 3.3 would clearly not be acceptable. Like so many
things, acceptable noise level is a problem-dependent question.

One way to look at the stability of an inverse problem is to see what a particular
percentage noise level in the data does to the expected noise level in the solution. You can look
at [cov d] to get the data noise level. For example, the ratio of that data standard deviation to the
expected or average data value


| |
i
d
d
i

(7.108)

can be expressed as a percentage. If 9
2
=
i
d
, and |d
i
| = 20, then %. 15 20 / 3 | | / = =
i d
d
i


Then, looking at model parameter stability, you can look at [cov m]. The ratio of the
model parameter standard deviation to the expected or derived model parameter value


| |
j
m
m
j

(7.109)

can also be expressed as a percentage. If 16
2
=
i
d
, and |m
j
| = 50, then %. 8 50 / 4 | | / = =
i d
m m
i

In this example, since a given percentage noise level in the data leads to a lower percentage noise
level in this model parameter, then this part of the problem is stable. Of course, you need to look
over all d
i
, i = 1, N, and m
j
, j = 1, M to have a full discussion of stability.

What all of this implies is that if you know [cov d], you should include this information
throughout the inverse analysis. If you do not know [cov d], or just want a quick look at
stability, you can assume

I I d = =
2
] [cov
d
(7.106)

Geosciences 567: Chapter 7 (RMR/GZ)
179
and use [cov
u
m]. You can also scale [cov
u
m] for other choices of
2
d
. For example, [cov
u
m]
for the second G matrix example

=
05 . 2 00 . 2
00 . 2 00 . 2
G (7.110)

is given by

800 810
810 820
] [cov
u
m (7.111)

If [cov d] is given by

=
01 . 0 00 . 0
00 . 0 01 . 0
] [cov d (7.112)

then

00 . 8 10 . 8
10 . 8 20 . 8
] [cov m (7.113)

where we have used [cov m] rather than [cov
u
m] because we are no longer assuming that [cov d]
= I.

This example shows that the entries in [cov m] scale with the assumed uniform data
variance. You can calculate [cov
u
m] and then scale all entries by the assumed scaling factor for
the difference between realistic data variances and assumed unit data variances.

Yet another way to look at this is to go back to Equation (7.97) defining the model
covariance matrix:

[ ]
T
1 1
] [cov ] [cov

=
g g
G d G m (7.97)

If we assume that the data covariance matrix [cov d] is given by

I d
2
] [cov
d
= (7.1067.106)

then Equation (7.97) becomes

[ ]
T
1 1 2
] [cov

=
g g d
G G m (7.114)

Geosciences 567: Chapter 7 (RMR/GZ)
180
For [cov
u
m] we assumed 1
2
=
d
, but we can get to [cov m] from [cov
u
m] by dividing all entries
in [cov
u
m] by
2
d
.


7.3.6 Combining R, N, [cov
u
m]

Note that, in general, G, G
g
1
, R, N, and [cov
u
m] can be written in terms of singular-
value decomposition as

1. G = U
P

P
V
P
T
(7.115)

2. G
g
1
= V
P

P
1
U
P
T
(7.116)

3. R = G
g
1
G = V
P
V
P
T
(7.117)

4. N = GG
g
1
= U
P
U
P
T
(7.118)
5. [cov
u
m] = G
g
1
[G
g
1
]
T


= V
P

P
2
V
P
T
(7.119)


Case I: P = M = N

R = G
g
1
G = I
M
, since G
g
1
= G
1
(7.120)

N = GG
g
1
= GG
1
= I
N
, since G
g
1
= G
1
(7.121)

[cov
u
m] = G
g
1
[G
g
1
]
T


= V
1
U
T
[V
1
U
T
]
T


= V
1
U
T
U
1
V
T
= V
2
V
T
(7.122)


Case II: P = M < N (Least Squares)

G
g
1
= V
P

P
1
U
P
T
= [G
T
G]
1
G
T
(7.123)

R = G
g
1
G = V
P
V
P
T
= VV
T
= I
M
since P = M (7.124)


Geosciences 567: Chapter 7 (RMR/GZ)
181
N = GG
g
1
= G{[G
T
G]
1
G
T
} = (using SVD . . .) = U
P
U
P
T
(7.125)

[cov
u
m] = G
g
1
[G
g
1
]
T


= [G
T
G]
1
G
T
{[G
T
G]
1
G
T
}
T


= (using SVD . . .) = V
P

P
2
V
P
T
(7.126)


Case III: P = N < M (Minimum Length)

G
g
1
= V
P

P
1
U
P
T
= G
T
[GG
T
]
1
(7.127)

R = G
g
1
G = G
T
[GG
T
]
1
G = (using SVD . . .) = V
P
V
P
T
(7.128)

N = GG
g
1
= G{G
T
[GG
T
]
1
} = (using SVD . . .) = U
P
U
P
T
= UU
T
= I
N
(7.129)

[cov
u
m] = G
g
1
[G
g
1
]
T


= G
T
[GG
T
]
1
{G
T
[GG
T
]
1
}
T


= (using SVD ...) = V
P

P
2
V
P
T
(7.130)


Case IV: P < min(M, N) (General Case)

This is just the general case.


7.3.7 An Illustrative Example

Consider a system of equations Gm = d given by

10 . 4
00 . 2
01 . 2 00 . 2
00 . 1 00 . 1
2
1
m
m
(7.131)

Doing singular-value decomposition, one finds


1
= 3.169 (7.132)


2
= 0.00316 (7.133)

Geosciences 567: Chapter 7 (RMR/GZ)
182


= =
446 . 0 895 . 0
895 . 0 446 . 0
U U
P
(7.134)


= =
706 . 0 709 . 0
709 . 0 706 . 0
V V
P
(7.135)


2
T
0 . 1 0 . 0
0 . 0 0 . 1
I V V R =

= =
P P
(7.136)


2
T
0 . 1 0 . 0
0 . 0 0 . 1
I U U N =

= =
P P
(7.137)

100 200
100 201
1
g
G (7.138)

= =

0 . 10
0 . 8
1 . 4
0 . 2
100 200
100 201
1
g g
d G m (7.139)

Note that the solution has perfect model resolution (R = I, and hence the solution is unique) and
perfect data resolution (N = I, and hence the data can be fit exactly). Note also that P = N = M,
and the generalized inverse is, in fact, the unique mathematical inverse.

This solution is, however, essentially meaningless if the data contain even a small amount
of noise. To see this, consider the unit covariance matrix [cov
u
m] for this case:


T T 2 T 1
g
1
g u
9 . 300 , 100 0
0 0996 . 0
] [ ] [cov
P P P P P
V V V V G G m

= = =



=
000 , 50 200 , 50
200 , 50 401 , 50
(7.140)

These are very large covariances for m
1
and m
2
, which indicate that the solution, while unique
and fitting the data perfectly, is very unstable, or sensitive to noise in the data. For example,
suppose that d
2
is 4.0 instead of 4.1 (2.5% error). Then the generalized inverse solution m
g
is
given by

= =

0 . 0
0 . 2
0 . 4
0 . 2
100 200
100 201
1
g g
d G m (7.141)

That is, errors of less than a few percent in d result in errors on the order of several hundred
percent in m
g
. Whenever small changes in d result in large changes in m
g
, the problem is
Geosciences 567: Chapter 7 (RMR/GZ)
183
considered unstable. In this particular problem, if a solution is desired with a standard deviation
of order 0.1, then the data standard deviations must be less than about 5 10
4
!

Another way of quantifying the instability of the inversion is with the condition number,
defined as

condition number =
max
/
min
(7.142)

For this particular problem, the condition number is approximately 1000, which indicates
considerable instability. The condition number, by itself, can be misleading. If a problem has
two singular values
1
and
2
, with
1
= 1,000,000 and
2
= 1000, then
1
/
2
= 1000. This
problem, however, is very stable with changes of length order one in the data (do you see why?).
If, however,
1
= 0.001 and
2
= 0.000001, then
1
/
2
= 1000, and unit length changes in the
data will cause large changes in the solution. In addition to just the condition number, the
absolute size of the singular values is important, especially compared to the size of the possible
noise in the data.

In order to gain a better understanding of the origin of the instability, one must consider
the structure of the G matrix itself. For the present example, inspection shows that the columns,
or rows, of G are very nearly parallel to one another. For example, the angle between the vectors
given by the columns is 0.114, obtained by taking the dot product of the two columns. The two
columns of G span the two dimensional data space, and hence the data resolution is perfect, but
the fact that they are nearly parallel leads to a significant instability.

It is not a coincidence, therefore, that the data eigenvector associated with the larger
singular value, u
1
= [0.446, 0.895]
T
, is essentially parallel to the common direction given by the
columns of G. Nor is it a coincidence that u
2
, associated with the smaller singular value, is
perpendicular to the almost uniform direction given by the columns of G. The eigenvector u
1

represents a stable direction in data space as far as noise is concerned, while u
2
represents an
unstable direction in data space as far as noise is concerned. Noise in data space parallel to u
1

will be damped by 1/
1
, while noise parallel to u
2
will be amplified by 1/
2
.

Similar arguments can be made about the rows of G, which lie in model space. That is,
v
1
is essentially parallel to the almost uniform direction given by the rows of G, while v
2
is
essentially perpendicular to the direction given by the rows of G. Noise parallel to u
1
, when
operated on by the generalized inverse, creates noise in the solution parallel to v
1
, while noise
parallel to u
2
creates noise parallel to v
2
. Thus, v
2
is the unstable direction in model space.

Methods to stabilize the model parameter variances will be considered in a later section,
but it will also be shown that any gain in stability is obtained at a cost in resolution. First,
however, we will introduce ways to quantify R, N, and [cov
u
m]. We will return to the above
example and show specifically how stability can be enhanced while resolution is lost.


Geosciences 567: Chapter 7 (RMR/GZ)
184
7.4 Quantifying the Quality of R, N, and [cov
u
m]


7.4.1 Introduction

In the proceeding sections we have shown that the model resolution matrix R, the data
resolution matrix N, and the unit model covariance matrix [cov
u
m] can be very useful, at least in
a qualitative way, in assessing the quality of a particular inversion. In this section, we will
quantify these measures of quality, and show that the generalized inverse is the inverse that gives
the best possible model and data resolution.

First, consider the following definitions (see Menke, page 68):


[ ]

= =
=
M
i
M
j
ij ij
r
1 1
2
2
2
= ) ( spread I R R
(7.143)


[ ]

= =
=
N
i
N
j
ij ij
n
1 1
2
2
2
= ) ( spread I N N
(7.144)

and

=
M
i
ii
1
u u
] cov [ = ]) cov ([ size m m
(7.145)

The spread function measures how different R (or N) is from an identity matrix. If R (or N) = I,
then spread (R) (or N) = 0. The size function is the trace of the unit model covariance matrix,
which gives the sum of the model parameter variances.

We can now look at the spread and size functions for various classes of problems.


7.4.2 Classes of Problems

Class I: P = N = M

spread (R) = spread (N) = 0 perfect model and data resolution
size ([cov
u
m]) depends on the size of the singular values


Class II: P = M < N (Least Squares)

spread (R) = 0 perfect model resolution
spread (N) 0 data not all independent
Geosciences 567: Chapter 7 (RMR/GZ)
185
size ([cov
u
m]) depends on the size of the singular values


Class III: P = N < M (Minimum Length)

spread (R) 0 nonunique solution
spread (N) = 0 perfect data resolution
size ([cov
u
m]) depends on the size of the singular values


Class IV: P < min(N, M) (General Case)

spread (R) 0 nonunique solution
spread (N) 0 data not all independent
size ([cov
u
m]) depends on the size of the singular values

We also note that the position of an off-diagonal nonzero entry in R or N does not affect the
spread. This is as it should be if the model parameters and data have no physical ordering.


7.4.3 Effect of the Generalized Inverse Operator G
g
1


We are now in a position to show that the generalized inverse operator G
g
1
gives the best
possible R, N matrices in terms of minimizing the spread functions as defined in (7.143)
(7.144). Menke (pp. 6870) does this for the P = M < N case, and less fully for the P = N < M
case. Consider instead, the more general derivation (after Jackson, 1972). For any estimate of
the inverse operator G
est
1
, the model resolution matrix R is given by

R = G
est
1
G

= G
est
1
U
P

P
V
P
T


= BV
P
T
(7.146)

where B = G
est
1
U
P

P
. From (2.23)(2.30), each row of R will be a linear combination of the
rows of V
P
T
, or equivalently a linear combination of the columns of V
P
. The weighting factors
are determined by B, which depends on the choice of the inverse operator.

The goal, then, is to choose an inverse operator that will make R most like the identity
matrix I in the sense of minimizing spread (R). Define b
k
T
as the kth row of B, and d
k
T
as the kth
row of I
M
. We seek b
k
T
as the least squares solution to

b
k
T
V
P
T
= d
k
T
(7.147)
1 P P M 1 M
Geosciences 567: Chapter 7 (RMR/GZ)
186
Taking the transposes implies

V
P
b
k
= d
k
(7.148)
M P P 1 M 1

Equation (7.148) can be solved with the least squares operator [see (3.22)] as

b
k
= [V
P
T
V
P
]
1
V
P
T
d
k


= I
1
V
P
T
d
k


= V
P
T
d
k
(7.149)

Taking the transpose of (7.149) gives
b
k
T
= d
k
T
V
P
(7.150)

Writing this out specifically, we have


[ ] [ ]

MP M M
P
P
kP k k
v v v
v v v
v v v
k
b b b
L
M M M
L
L
L L L
2 1
2 22 21
1 12 11
2 1
0 0 1 0 0
(7.151)

Looking at (7.151), we see that the ith entry in b
k
T
is given by

b
ki
= v
ki
(7.152)

That is, each element in the kth row of B is the corresponding element in the kth row of V
P
. Or,
simply put, the kth row of B is given by the kth row of V
P
.

Making similar arguments for each row of B gives us

B = V
P
(7.153)

Substituting B back into (7.146) gives

R = V
P
V
P
T
(7.154)

This is, however, exactly the model resolution matrix for the generalized inverse, given in (7.65).
Thus, we have shown that the generalized inverse is the operator with the best model resolution
in the sense that the least squares difference between R and I
M
is minimized. Very similar
arguments can be made that show that the generalized inverse is the operator with the best data
resolution in the sense that the least squares difference between N and I
N
is minimized.
Geosciences 567: Chapter 7 (RMR/GZ)
187
In cases where the model parameters or data have a natural ordering, such as a
discretization of density versus depth (for model parameters) or gravity measurements along a
profile (for data), we might want to modify the definition of the spread functions in (7.143)
(7.144). One such modification leads to the BackusGilbert inverse. A modified spread function
is defined by


( )[ ]

= =
=
M
i
M
j
ij ij
r j i W
1 1
2
, ) ( spread R
(7.155)

where W(i, j) = (i j)
2
. This gives more weight (penalty) to entries far from the diagonal. It has
the effect, however, of canceling out any i = j contribution to the spread. To handle this, a
constraint equation is added and satisfied by the use of Lagrange multipliers. The constraint
equation is given by

=
=
M
j
ij
r
1
1
(7.156)

This ensures that not all entries in the row of R are allowed to go to zero, which would minimize
the spread in (7.155). The inverse operator based on (7.155) is called the BackusGilbert
inverse, first developed for continuous (rather than discrete) problems.


7.5 Resolution Versus Stability


7.5.1 Introduction

We will see in this section that stability can be improved by removing small singular
values from an inversion. We will also see, however, that this reduces the resolution. There is
an unavoidable trade-off between solution stability and resolution.

Recall the example from Equation (7.131)

10 . 4
00 . 2
01 . 2 00 . 2
00 . 1 00 . 1
2
1
m
m
(7.131)

The singular values, eigenvector matrices, generalized inverse, and other relevant matrices are
given in Equation (7.139).

One option is to arbitrarily set
2
= 0. Then P is reduced from 2 to 1, and

=
895 . 0
446 . 0
P
U (7.157)

Geosciences 567: Chapter 7 (RMR/GZ)
188

=
709 . 0
706 . 0
P
V (7.158)

= =
5 . 0 5 . 0
5 . 0 5 . 0
T
P P
V V R (7.159)

= =
8 . 0 4 . 0
4 . 0 2 . 0
T
P P
U U N (7.160)

= =

200 . 0 100 . 0
199 . 0 099 . 0
T 1 1
g P P P
U V G (7.161)

= =

0500 . 0 0498 . 0
0498 . 0 0496 . 0
] [cov
2
u P P P
V V m (7.162)

= =

020 . 1
016 . 1
1
g g
d G m (7.163)

and

= =
08 . 4
04 . 2

g
Gm d (7.164)

First, note that the size of the unit model covariance matrix has been significantly
reduced, indicating a dramatic improvement in stability in the solution. The model parameter
variances are order 0.05 for data with unit variance.

Second, note that the fit to the data, while not perfect, is fairly close. The misfits for d
1

and d
2
are, at most, 2%.

Third, however, note that both model and data resolution have been degraded from
perfect resolution when both singular values were retained. In fact, R now indicates that the
estimates for both m
1
and m
2
are given by the average of the true values of m
1
and m
2
. That is,
0.706m
1
plus 0.709m
2
is perfectly resolved, but there is no information about the difference.
This can also be seen by examining V
P
, which points in the [0.706, 0.709]
T
direction in model
space. This is the only direction in model space that can be resolved.

Recall on page 165 that when d
2
was changed from 4.1 to 4.0, the solution changed from
[8, 10]
T
to [2, 0]
T
. The sum of m
1
and m
2
remained constant, but the difference changed
significantly. If d
2
is changed from 4.1 to 4.0 now, the solution is [0.998, 1.000]
T
, very close to
the solution with d
2
= 4.1.

Geosciences 567: Chapter 7 (RMR/GZ)
189
In some cases, knowing the sum of m
1
and m
2
may be useful, such as when m gives the
velocity of some layered structure. Then knowing the average velocity, even if the individual
layer velocities cannot be resolved, may be useful. In any case, we have shown that the original
decomposition, with two nonzero singular values, was so unstable that the solution, while
unique, was essentially meaningless.

The data resolution matrix N indicates that the second observation is more important than
the first (0.8 versus 0.2 along the diagonal). This can be seen either from noting that the second
row of either column of G is larger than the first row, and U
P
is formed as a linear combination
of the columns of G, or by looking at U
P
, which points in the [0.446, 0.895]
T
direction in data
space.

Another way to look at the trade-off is by plotting resolution versus stability as shown on
the next page:

size ([cov m])
(smaller = better)
u
s
p
r
e
a
d

(
R
)

(
s
m
a
l
l

=

b
e
t
t
e
r
)
best??
decreasing P

small
large
s
m
a
l
l
l
a
r
g
e


As P is decreased, by setting small singular values to zero, the resolution degrades while
the stability increases. Sometimes it is possible to pick an optimal cut-off value for small
singular values based on this type of graph.


7.5.2 R, N, and [cov
u
m] for Nonlinear Problems

The resolution matrices and the unit model covariance matrix are also useful in a
nonlinear analysis, although the interpretations are somewhat different than they are for the linear
case.


Geosciences 567: Chapter 7 (RMR/GZ)
190
Model Resolution Matrix R

In the linear case the solution is unique whenever R = I. For the nonlinear problem, a
unique solution is not guaranteed, even if R = I. In fact, no solution may exist, even when R = I.
Consider the following simple nonlinear problem:

m
2
= d
1
(7.165)

With a single model parameter and a single observation, we have P = M = N. Thus, R = I at
every iteration. If d
1
= 4, the process will iterate successfully to the solution m
1
= 2 unless, by
chance, the iterative process ever gives m
1
exactly equal to zero, in which case the inverse is
undefined. However, if d
1
is negative, there is no real solution, and the iterative process will
never converge to an answer, even though R = I.

The uniqueness of nonlinear problems also depends on the existence of local minima. It
is always a good idea in nonlinear problems to explore solution space to make sure that the
solution obtained corresponds to a global minima. Take, for example, the following case with
two observations and two model parameters:

m
1
4
+ m
2
2
= 2 (7.166)

m
1
2
+ m
2
4
= 2 (7.167)

This simple set of two nonlinear equations in two unknowns has R = I almost everywhere in
solution space. By inspection, however, there are four solutions that fit the data exactly, given by
[m
1
, m
2
]
T
= [1, 1]
T
, [1, 1]
T
, [1, 1]
T
, and [1, 1]
T
, respectively.

To see the role of the model resolution matrix for a nonlinear analysis, recall Equations
(4.13)(4.17), where, for example,

c = Gm (4.14)

and where c is the misfit to the data, given by the observed minus the predicted data, m are
changes to the model at this iteration, and G is the matrix of partial derivatives of the forward
equations with respect to the model parameters. If R = I at the solution, then the changes m are
perfectly resolved in the close vicinity of the solution. If R I, then there will be directions in
model space (corresponding to V
0
) that do not change the predicted data, and hence the fit to the
data. All of this analysis, of course, is based on the linearization of a nonlinear problem in the
vicinity of the solution. The analysis of R is only as good as the linearization of the nonlinear
problem. If the solution is very nonlinear at the solution, the validity of conclusions based on an
analysis of R may be suspect.

Note also that R, which depends on both G and the inverse operator, may change during
the iterative process. For example, in the Equation (7.167) above, we noted that R = I almost
everywhere in solution space. At the point given by [m
1
, m
2
]
T
= [0.7071, 0.7071]
T
, you may
Geosciences 567: Chapter 7 (RMR/GZ)
191
verify for yourselves that all four entries in the G matrix of partial derivatives are equal to 1. In
this case, P is reduced from 2 to 1. The next iteration, however, will take the solution to
somewhere else where the resolution matrix is again an identity matrix. The analysis of R is thus
generally reserved for the final iteration at the solution. At intermediate steps, R determines
whether there is a unique direction in which to move toward the solution. Since the path to the
solution is less critical than the final solution, little emphasis is generally placed on R during the
iterative process.

The generalized inverse operator, which finds the minimum length solution for m, finds
the smallest possible change to the linearized problem to minimize the misfit to the data. This is
a benefit because large changes in m will take the estimated parameter values farther away
from the region where the linearization of the problem is valid.


Data Resolution Matrix N

In the linear case, N = I implies perfectly independent (and resolved) data. In the
nonlinear case, N = I implies that the misfit c, and not necessarily the data vector d itself, is
perfectly resolved for the linearized problem. If N I, then any part of the misfit c that lies in
U
0
space will not contribute to changes in the model parameter estimates. In the vicinity of the
solution, if N = I, then data space is completely resolved, and the misfit should typically go to
zero. If N I at the solution, then there may be a part of the data that cannot be fit. But, even if
N = I, there is no guarantee that there is any solution that will fit the data exactly. Recall the
example in Equation (7.165) above where N = I everywhere. If d
1
is negative, no real solution
can be found that fits the data exactly.

As with the model resolution matrix, N is most useful at the solution and less useful
during the iterative process. Also, it should always be recalled that the analysis of N is only as
valid as the linearization of the problem.


The Unit Model Covariance Matrix [cov
u
m]

For a linear analysis, the unit model covariance provides variance estimates for the model
parameters assuming unit, uncorrelated data variances. For the nonlinear case, the unit model
covariance matrix provides variance estimates for the model parameter changes m. At the
solution, these can be interpreted as variances for the model parameters themselves, as long as
the problem is not too nonlinear. Along the iterative path, and at the final solution, the unit
covariance matrix provides an estimate of the stability of the process. If the variances are large,
then there must be small singular values, and the misfit may be mapped into large changes in the
model parameters. Analysis of particular model parameter variances is usually reserved for the
final iteration. As with both the resolution matrices, the model parameter variance estimates are
based on the linearized problem, and are only as good as the linearization itself.

Consider a simple N = M = 1 nonlinear forward problem given by

Geosciences 567: Chapter 7 (RMR/GZ)
192
d = m
1/3
(7.168)

The inverse solution (exact, or equivalently the generalized inverse) is of course given by

m = d
3
(7.169)

These relationships are shown in the figures on the next page.

Suppose we consider a case where d
true
= 1. The true solution is m = 1. A generalized
inverse analysis leads to a linearized estimate of the uncertainty in the solution, [cov
u
m], of 9.
This analysis assumes uncorrelated Gaussian noise with mean zero and variance 1. If we use
Gaussian noise with mean zero and standard deviation = 0.25 (i.e., variance = 0.0625) then
[cov m] = 0.5625. The simple nature of this problem leads to an amplification by a factor of 9
between the data variance and the linearized estimate of the solution variance.

Now consider an experiment in which 50,000 noisy data values are collected. The noise
in these data has a mean of 0.0 and a standard deviation of 0.25. For each noisy data value a
solution is found from the above equations. This will create 50,000 estimates of the solution.
Distributions of both the data noise and the solution are shown in the figures after those for the
forward and inverse problems in Equations (7.168)(7.169).

Note that due to the nonlinear nature of the forward problem, the distribution of solutions
is not even Gaussian. The mean value, <m>, is 1.18, greater than the true value of 1. The
standard deviation is 0.84. Also shown on the figure is the maximum likelihood solution m
ML
,
as determined empirically from the distribution.

The purpose of this example is to show that caution must be applied to the interpretation
of all inverse problems, but especially nonlinear ones.
Geosciences 567: Chapter 7 (RMR/GZ)
193


Geosciences 567: Chapter 7 (RMR/GZ)
194

Geosciences 567: CHAPTER 8 (RMR/GZ)
195




CHAPTER 8: VARIATIONS OF THE GENERALIZED INVERSE


8.1 Linear Transformations


8.1.1 Analysis of the Generalized Inverse Operator G
g
1


Recall Equation (2.30)

ABC = D (2.30)

which states that if the matrix D is given by the product of matrices A, B, and C, then each
column of D is the weighted sum of the columns of A and each row of D is the weighted sum of
the rows of C. Applying this to Gm = d, we saw in Equation (2.15) that the data vector d is the
weighted sum of the columns of G. Note that both the data vector and the columns of G are N
1 vectors in data space.

We can extend this analysis by using singular-value decomposition. Specifically, writing
out G as

G = U
P

P
V
P
T
(6.69)
N

M N

P P

P P

M

Each column of G is now seen as a weighted sum of the columns of U
P
. Each column of G is an
N

1 dimensional vector (i.e., in data space), and is the weighted sum of the P eigenvectors u
1
,
u
2
, . . . , u
P
in U
P
. Each row of G is a weighted sum of the rows of V
P
T
, or equivalently, the
columns of V
P
. Each row of G is a 1

M row vector in model space. It is the weighted sum of
the P eigenvectors v
1
, v
2
, . . . , v
P
in V
P
.

A similar analysis may be considered for the generalized inverse operator, where

G
g
1
= V
P

P
1
U
P
T
(7.8)
M

N M

P P

P P

N

Each column of G
g
1
is a weighted sum of the columns of V
P
. Each row of G
g
1
is the weighted
sum of the rows of U
P
T
, or equivalently, the columns of U
P
.

Let us now consider what happens in the system of equations Gm = d when we take one
of the eigenvectors in V
P
as m. Let m = v
i
, the ith eigenvector in V
P
. Then
Geosciences 567: CHAPTER 8 (RMR/GZ)
196
Gv
i
= U
P

P
V
P
T
v
i
(8.1)
N

1 N

P P

P P

M M

1

We can expand this as

i
P
i P P i
v
v
v
v
U Gv

=
L L
M
L L
M
L L
T
T
T
1
(8.2)

The product of V
P
T
with v
i
is a P

1 vector with zeros everywhere except for the ith row, which
represents the dot product of v
i
with itself.

Continuing, we have

=
0
0
1
0
0

0 0
0
0
0 0
1
M
M
L
O
M
M
O
L
P
i P i

U Gv

=
0
0
0
0

i
1
2 2 21
1 1 11
M
M
L L
M M M
L L

NP Ni N
P i
P i
u u u
u u u
u u u


=
Ni
i
i
i
u
u
u
M
2
1

(8.3)

Or simply,

Gv
i
=
i
u
i
(8.4)
Geosciences 567: CHAPTER 8 (RMR/GZ)
197
This is, of course, simply the statement of the shifted eigenvalue problem from Equation (6.16).
The point was not, however, to reinvent the shifted eigenvalue problem, but to emphasize the
linear algebra, or mapping, between vectors in model and data space.

Note that v
i
, a unit-length vector in model space, is transformed into a vector of length
i

(since u
i
is also of unit length) in data space. If
i
is large, then a unit-length change in model
space in the v
i
direction will have a large effect on the data. Conversely, if
i
is small, then a
unit length change in model space in the v
i
direction will have little effect on the data.


8.1.2 G
g
1
Operating on a Data Vector d

Now consider a similar analysis for the generalized inverse operator G
g
1
, which operates
on a data vector d. Suppose that d is given by one of the eigenvectors in U
P
, say u
i
. Then

G
g
1
u
i
= V
P

P
1
U
P
T
u
i
(8.5)
M

1 M

P P

P P

N N

1

Following the development above, note that the product of U
P
T
with u
i
is a P

1 vector with
zeros everywhere except the ith row, which represents the dot product of u
i
with itself. Then

0
0
1
0
0

0 0
0
0
0 0
1
1
1
1
1
g
M
M
L
O
M
M
O
L
P
i P i

V u G
Continuing,

=
0
0
0
0

1
1
2 2 21
1 1 11
M
M
L L
M M M
L L
i
MP Mi M
P i
P i
v v v
v v v
v v v


Geosciences 567: CHAPTER 8 (RMR/GZ)
198

Mi
i
i
i
v
v
v
M
2
1
1

(8.6)

Or simply,

G
g
1
u
i
=
i
1
v
i
(8.7)

This is not a statement of the shifted eigenvalue problem, but has an important
implication for the mapping between data and model spaces. Specifically, it implies that a unit-
length vector (u
i
) in data space is transformed into a vector of length 1/
i
in model space. If
i
is
large, then small changes in d in the direction of u
i
will have little effect in model space. This is
good if, as usual, these small changes in the data vector are associated with noise. If
i
is small,
however, then small changes in d in the u
i
direction will have a large effect on the model
parameter estimates. This reflects a basic instability in inverse problems whenever there are
small, nonzero singular values. Noise in the data, in directions parallel to eigenvectors
associated with small singular values, will be amplified into very unstable model parameter
estimates.

Note also that there is an intrinsic relationship, or coupling, between the eigenvectors v
i

in model space and u
i
in data space. When G operates on v
i
, it returns u
i
, scaled by the singular
value
i
. Conversely, when G
g
1
operates on u
i
it returns v
i
, scaled by
i
1
. This represents a
very strong coupling between v
i
and u
i
directions, even though the former are in model space and
the latter are in data space. Finally, the linkage between these vectors depends very strongly on
the size of the nonzero singular value
i
.


8.1.3 Mapping Between Data and Model Space: An Example

One useful way to graphically represent the mapping back and forth between model and
data spaces is with the use of stick figures. These are formed by plotting the components of
the eigenvectors in model and data space for each model parameter and observation as a stick,
or line, whose length is given by the size of the component. These can be very helpful in
illustrating directions in model space associated with stability and instability, as well as
directions in data space where noise will have a large effect on the estimated solution.

For example, recall the previous example, given by

10 . 4
00 . 2

01 . 2 00 . 2
00 . 1 00 . 1
2
1
m
m
(7.131)

The singular values and associated eigenvectors are given by

Geosciences 567: CHAPTER 8 (RMR/GZ)
199

1
= 3.169 and
2
= 0.00316 (8.8)


= =
704 . 0 709 . 0
710 . 0 706 . 0
V V
P
(8.9)

and


= =
446 . 0 895 . 0
895 . 0 446 . 0
U U
P
(8.10)

From this information, we may plot the following figure:


1
1
m m
1 2
v
1
1
1
d d
1 2
u
1
= 3.169
1
G
G
1
g




1
1
m
m
1
2
v
2
1
1
d
d
1
2
u
2
= 0.003
2
G
G
1
g


From V
P
, we see that v
1
= [0.706, 0.709]
T
. Thus, on the figure for v
1
, the component
along m
1
is +0.706, while the component along m
2
is +0.709. Similarly, u
1
= [0.446, 0.895]
T
,
and thus the components along d
1
and d
2
are +0.446 and +0.895, respectively. For v
2
= [0.710,
0.704]
T
, the components along m
1
and m
2
are 0.710 and +0.704, respectively. Finally, the
components of u
2
= [0.895, 0.446]
T
along d
1
and d
2
are 0.895 and 0.446, respectively.

Geosciences 567: CHAPTER 8 (RMR/GZ)
200
These figures illustrate, in a simple way, the mapping back and forth between model and
data space. For example, the top figure shows that a unit length change in model space in the
[0.706, 0.709]
T
direction will be mapped by G into a change of length 3.169 in data space in the
[0.446, 0.895]
T
direction. A unit length change in data space along the [0.446, 0.895]
T
direction
will be mapped by G
1
into a change of length 1/(3.169) = 0.316 in model space in the [0.706,
0.709]
T
direction. This is a stable mapping back and forth, since noise in the data is damped
when it is mapped back into model space. The pairing between v
2
and u
2
directions is less
stable, however, since a unit length change in data space parallel to u
2
will be mapped back into
a change of length 1/(0.00316) = 317 parallel to v
2
. The v
2
direction in model space will be
associated with a very large variance. Since v
2
has significant components along both m
1
and
m
2
, they will both individually have large variances, as seen in the unit model covariance matrix
for this example, given by

=
753 , 49 154 , 50
154 , 50 551 , 50
] [cov
u
m (8.11)

For a particular inverse problem, these figures can help one understand both the
directions in model space that affect the data the least or most and the directions in data space
along which noise will affect the estimated solution the least or most.


8.2 Including Prior Information, or the Weighted
Generalized Inverse


8.2.1 Mathematical Background

As we have seen, the generalized inverse operator is a very powerful operator, combining
the attributes of both least squares and minimum length estimators. Specifically, the generalized
inverse minimizes both

e
T
e = [d d
pre
]
T
[d d
pre
] = [d Gm]
T
[d Gm] (8.12)

and [m <m>]
T
[m <m>], where <m> is the a priori estimate of the solution.

As discussed in Chapter 3, however, it is useful to include as much prior information into
an inverse problem as possible. Two forms of prior information were included in weighted least
squares and weighted minimum length, and resulted in new minimization criteria given by

e
T
W
e
e = e
T
[cov d]
1
e = [d Gm]
T
[cov d]
1
[d Gm] (8.13)

and

m
T
W
m
m = [m <m>]
T
[cov m]
1
[m <m>] (8.14)

Geosciences 567: CHAPTER 8 (RMR/GZ)
201
where [cov d] and [cov m] are a priori data and model covariance matrices, respectively. It is
possible to include this information in a generalized inverse analysis as well.

The basic procedure is as follows. First, transform the problem into a coordinate system
where the new data and model parameters each have uncorrelated errors and unit variance. The
transformations are based on the information contained in the a priori data and model parameter
covariance matrices. Then, perform a generalized inverse analysis in the transformed coordinate
system. This is the appropriate inverse operator because both of the covariance matrices are
identity matrices. Finally, transform everything back to the original coordinates to obtain the
final solution.

One may assume that the data covariance matrix [cov d] is a positive definite Hermitian
matrix. This is equivalent to assuming that all variances are positive, and none of the correlation
coefficients are exactly equal to plus or minus one. Then the data covariance matrix can be
decomposed as

[cov d] = B
d
B
T
(8.15)
N

N N

N N

N N

N

where
d
is a diagonal matrix containing the eigenvalues of [cov d] and B is an orthonormal
matrix containing the associated eigenvectors. B is orthonormal because [cov d] is Hermitian,
and all of the eigenvalues are positive because [cov d] is positive definite.

The inverse data covariance matrix is easily found as

[cov d]
1
= B
d
1
B
T
(8.16)
N

N N

N N

N N

N

where we have taken advantage of the fact that B is an orthonormal matrix. It is convenient to
write the right-hand side of (8.16) as

B
d
1
B
T
= D
T
D (8.17)
N

N N

N N

N N

N N

N

where

D =
d
1/2
B
T
(8.18)

Thus,

[cov d]
1
= D
T
D (8.19)
N

N N

N N

N

Geosciences 567: CHAPTER 8 (RMR/GZ)
202
The reason for writing the data covariance matrix in terms of D will be clear when we
introduce the transformed data vector. The covariance matrix itself can be expressed in terms of
D as
[cov d] = {[cov d]
1
}
1


= [D
T
D]
1


= D
1
[D
T
]
1
(8.20)

Similarly, the positive definite Hermitian model covariance matrix may be decomposed
as

[cov m] = M
m
M
T
(8.21)
M

M M

M M

M M

M

where
m
is a diagonal matrix containing the eigenvalues of [cov m] and M is an orthonormal
matrix containing the associated eigenvectors.

The inverse model covariance matrix is thus given by

[cov m]
1
= M
m
1
M
T
(8.22)
M

M M

M M

M M

M

where, as before, we have taken advantage of the fact that M is an orthonormal matrix. The
right-hand side of (8.22) can be written as

M
m
1
M
T
= S
T
S (8.23)
M

M M

M M

M M

M M

M

where

S =
m
1/2
M
T
(8.24)

Thus,

[cov m]
1
= S
T
S (8.25)
M

M M

M M

M

As before, it is possible to write the covariance matrix in terms of S as

[cov m] = {[cov m]
1
}
1


= [S
T
S]
1


= S
1
[S
T
]
1
(8.26)
Geosciences 567: CHAPTER 8 (RMR/GZ)
203
8.2.2 Coordinate System Transformation of Data and Model Parameter
Vectors

The utility of D and S can now be seen as we introduce transformed data and model
parameter vectors. First, we introduce a transformed data vector d as

d =
d
1/2
B
T
d (8.27)

or

d = Dd (8.28)

The transformed model parameter m is given by

m =
m
1/2
M
T
m (8.29)

or

m = Sm (8.30)

The forward operator G must also be transformed into G, the new coordinates. The
transformation can be found by recognizing that

Gm = d (8.31)

GSm = Dd (8.32)

or

D
1
GSm = d = Gm (8.33)

That is
D
1
GS = G (8.34)

Finally, by pre and postmultiplying by D and S
1
, respectively, we obtain G as

G = DGS
1
(8.35)

The transformations back from the primed coordinates to the original coordinates are given by

d = B
d
1/2
d (8.36)

or

d = D
1
d (8.37)
Geosciences 567: CHAPTER 8 (RMR/GZ)
204
m = M
m
1/2
m (8.38)

or

m = S
1
m (8.39)

and

G = B
d
1/2
G
m
1/2
M
T
(8.40)

or

G = D
1
GS (8.41)

In the new coordinate system, the generalized inverse will minimize

e
T
e = [d d
pre
]
T
[d d
pre
] = [d Gm]
T
[d Gm] (8.42)

and [m]
T
m.

Replacing d, m and G in (8.42) with Equations (8.27)(8.35), we have

[d Gm]
T
[d Gm] = [Dd DGS
1
Sm]
T
[Dd DGS
1
Sm]

= [Dd DGm]
T
[Dd DGm]

= {D[d Gm]}
T
{D[d Gm]}

= [d Gm]
T
D
T
D[d Gm]

= [d Gm]
T
[cov d]
1
[d Gm] (8.43)

where we have used (8.19) to replace D
T
D with [cov d]
1
.

Equation (8.43) shows that the unweighted misfit in the primed coordinate system is
precisely the weighted misfit to be minimized in the original coordinates. Thus, the least squares
solution in the primed coordinate system is equivalent to weighted least squares in the original
coordinates.

Furthermore, using (8.29) for m, we have

m
T
m = [Sm]
T
Sm

= m
T
S
T
Sm


Geosciences 567: CHAPTER 8 (RMR/GZ)
205
= m
T
[cov m]
1
m (8.44)

where we have used (8.25) to replace S
T
S with [cov m]
1
.

Equation (8.44) shows that the unweighted minimum length solution in the new
coordinate system is equivalent to the weighted minimum length solution in the original
coordinate system. Thus minimum length in the new coordinate system is equivalent to
weighted minimum length in the original coordinates.


8.2.3 The Maximum Likelihood Inverse Operator, Resolution, and Model
Covariance

The generalized inverse operator in the primed coordinates can be transformed into an
operator in the original coordinates. We will show later that this is, in fact, the maximum
likelihood operator in the case where all distributions are Gaussian. Let this inverse operator be
G
MX
1
, and be given by

G
MX
1
= [D
1
GS]
g
1


= S
1
[G]
g
1
D (8.45)

The solution in the original coordinates, m
MX
, can be expressed either as

m
MX
= G
MX
1
d (8.46)

or as

m
MX
= S
1
m
g


= S
1
[G]
g
1
d (8.47)

Now that the operator has been expressed in the original coordinates, it is possible to calculate
the resolution matrices and an a posteriori model covariance matrix.

The model resolution matrix R is given by

R = G
MX
1
G

= {S
1
[G]
g
1
D}{D
1
GS}

= S
1
[G]
g
1
GS

= S
1
RS (8.48)
Geosciences 567: CHAPTER 8 (RMR/GZ)
206
where R is the model resolution matrix in the transformed coordinate system.

Similarly, the data resolution matrix N is given by

N = GG
MX
1


= {D
1
GS}{S
1
[G]
g
1
D}

= D
1
G[G]
g
1
D

= D
1
ND (8.49)

The a posteriori model covariance matrix [cov m]
P
is given by

[cov m]
P
= G
MX
1
[cov d][G
MX
1
]
T
(8.50)

Replacing [cov d] in (8.50) with (8.20) gives

[cov m]
P
= G
MX
1
D
1
[D
T
]
1
[G
MX
1
]
T


= {S
1
[G]
g
1
D}D
1
[D
T
]
1
{S
1
[G]
g
1
D}
T


= S
1
[G]
g
1
DD
1
[D
T
]
1
D
T
{[G]
g
1
}
T
[S
1
]
T


= S
1
[G]
g
1
{[G]
g
1
}
T
[S
1
]
T


= S
1
[cov
u
m][S
1
]
T
(8.51)

That is, an a posteriori estimate of model parameter uncertainties can be obtained by
transforming the unit model covariance matrix from the primed coordinates back to the original
coordinates.

It is important to realize that the transformations introduced by D and S in (8.27)(8.41)
are not, in general, orthonormal. Thus,

d = Dd (8.28)

implies that the length of the transformed data vector d is, in general, not equal to the length of
the original data vector d. The function of D is to transform the data space into one in which the
data errors are uncorrelated and all observations have unit variance. If the original data errors are
uncorrelated, the data covariance matrix will be diagonal and B, from

[cov d]
1
= B
d
1
B
T
(8.16)
N

N N

N N

N N

N
Geosciences 567: CHAPTER 8 (RMR/GZ)
207
will be an identity matrix. Then D, given by

D =
d
1/2
B
T
(8.18)

will be a diagonal matrix given by
d
1/2
. The transformed data d are then given by

d =
d
1/2
d (8.52)

or

d
i
= d
i
/
di
i = 1, N (8.53)

where
di
is the data standard deviation for the ith observation. If the original data errors are
uncorrelated, then each transformed observation is given by the original observation, divided by
its standard deviation. The transformation in this case can be thought of as leaving the direction
of each axis in data space unchanged, but stretching or compressing each axis, depending on the
standard deviation. To see this, consider a vector in data space representing the d
1
axis. That is,

=
0
0
1
M
d (8.54)

This data vector is transformed into

=
0
0
/ 1
1
M
d

d (8.55)

That is, the direction of the axis is unchanged, but the magnitude is changed by 1/
d1
. If the data
errors are correlated, then the axes in data space are rotated (by B
T
), and then stretched or
compressed.

Very similar arguments can be made about the role of S in model space. That is, if the a
priori model covariance matrix is diagonal, then the directions of the transformed axes in model
space are the same as in the original coordinates (i.e., m
1
, m
2
, . . . , m
M
), but the lengths are
stretched or compressed by the appropriate model parameter standard deviations. If the errors
are correlated, then the axes in model space are rotated (by M
T
) before they are stretched or
compressed.
Geosciences 567: CHAPTER 8 (RMR/GZ)
208
8.2.4 Effect on Model- and Data-Space Eigenvectors

This stretching and compressing of directions in data and model space affects the
eigenvectors as well. Let

V
be the set of vectors transformed back into the original coordinates
from V, the set of model eigenvectors in the primed coordinates. Thus,

V

= S
1
V (8.56)

For example, suppose that [cov m] is diagonal, then

V

=
m
1/2
V (8.57)

For v
i
, the ith vector in V

, this implies


'
v
v
v
v
v
v
i
M mM
m
m
i
M

M M
2 2
1 1
2
1

(8.58)

Clearly, for a general diagonal [cov m],
v i
will no longer have unit length. This is true
whether or not [cov m] is diagonal. Thus, in general, the vectors in

V
are not unit length
vectors. They can, of course, be normalized to unit length. Perhaps more importantly, however,
the directions of the
v i
have been changed, and the vectors in

V
are no longer perpendicular to
each other. Thus, the vectors in

V
cannot be thought of as orthonormal eigenvectors, even if
they have been normalized to unit length.

These vectors still play an important role in the inverse analysis, however. Recall that the
solution m
MX
is given by

m
MX
= G
MX
1
d (8.46)

or as

m
MX
= S
1
m
g


= S
1
[G]
g
1
d (8.47)

We can expand (8.47) as

m
MX
= S
1
V
P
[
P
]
1
[U
P
]
T
Dd

=

V
P
[
P
]
1
[U
P
]
T
Dd (8.59)
Geosciences 567: CHAPTER 8 (RMR/GZ)
209
Recall that the solution m
MX
can be thought of as a linear combination of the columns of
the first matrix in a product of several matrices [see Equations (2.23)(2.30)]. This implies that
the solution m
MX
consists of a linear combination of the columns of

V P
. The solution is still a
linear combination of the vectors in

V P
, even if they have been normalized to unit length. Thus,

V P
still plays a fundamental role in the inverse analysis.

It is important to realize that [cov m] will only affect the solution if P < M. If P = M,
then V
P
= V, and V
P
spans all of model space.

V
P
will also span all of solution space. In this
case, all of model space can be expressed as a linear combination of the vectors in

V P
, even
though they are not an orthonormal set of vectors. Thus, the same solution will be reached,
regardless of the values in [cov m]. If P < M, however, the mapping of vectors from the primed
coordinates back to the original space can affect the part of solution space that is spanned by

V P
.
We will return to this point later with a specific example.

Very similar arguments can be made for the data eigenvectors as well. Let

U
be the set
of vectors obtained by transforming the data eigenvectors U in the primed coordinates back into
the original coordinates. Then



U
= D
1
U (8.60)

In general, the vectors in

U
will not be either of unit length or perpendicular to each other.

The predicted data

d
are given by



d
= G m
MX


= D
1
GSm
MX


= D
1
U
P

P
V
P
Sm
MX


=

U P

P
V
P
Sm
MX
(8.61)

Thus, the predicted data are a linear combination of the columns of

U P
.

It is important to realize that the transformations introduced by [cov d] will only affect
the solution if P < N. If P = N, then U
P
= U, and U
P
spans all of data space. The matrix

U
P

will also span all of data space. In this case, all of data space can be expressed as a linear
combination of the vectors in

U P
, even though they are not an orthonormal set of vectors. Thus,
the same solution will be reached, regardless of the values in [cov d]. If P < N, however, the
mapping of vectors from the primed coordinates back to the original space can affect the part of
solution space that is spanned by

U P
. We are now in a position to consider a specific example.


Geosciences 567: CHAPTER 8 (RMR/GZ)
210
8.2.5 An Example

Consider the following specific example of the form Gm = d, where G and d are given
by

=
00 . 2 00 . 2
00 . 1 00 . 1
G (8.62)

=
00 . 5
00 . 4
d (8.63)

If we assume for the moment that the a priori data and model parameter covariance matrices are
identity matrices and perform a generalized inverse analysis, we obtain

P = 1 < M = N = 2 (8.64)


1
= 3.162 (8.65)

=
707 . 0 707 . 0
707 . 0 707 . 0
V (8.66)

=
447 . 0 894 . 0
894 . 0 447 . 0
U (8.67)

=
500 . 0 500 . 0
500 . 0 500 . 0
R (8.68)

=
800 . 0 400 . 0
400 . 0 200 . 0
N (8.69)

=
400 . 1
400 . 1
g
m (8.70)

=
600 . 5
800 . 2

d (8.71)

e
T
e = e
T
[cov d]
1
e = 1.800 (8.72)

The two rows (or columns) of G are linearly dependent, and thus the number of nonzero
singular values is one. Thus, the first column of V (or U) gives V
P
(or U
P
), while the second
column gives V
0
(or U
0
). The generalized inverse solution m
g
must lie in V
P
space, and is thus
parallel to the [0.707, 0.707]
T
direction in model space. Similarly, the predicted data

d must lie
Geosciences 567: CHAPTER 8 (RMR/GZ)
211
in U
P
space, and is thus parallel to the [0.447, 0.894]
T
direction in data space. The model
resolution matrix R indicates that only the sum, equally weighted, of the model parameters m
1

and m
2
is resolved. Similarly, the data resolution matrix N indicates that only the sum of d
1
and
d
2
, with more weight on d
2
, is resolved, or important, in constraining the solution.

Now let us assume that the a priori data and model parameter covariance matrices are not
equal to a constant times the identity matrix. Suppose

15.638 2.052
2.052 4.362
= ] [cov d (8.73)

and

10.872 5.142
5.142 23.128
= ] [cov m (8.74)

The data covariance matrix [cov d] can be decomposed as

638 . 15 052 . 2
052 . 2 362 . 4
0.985 0.174
0.174 0.985

16.000 0.000
0.000 4.000

0.985 0.174
0.174 0.985
=
= ] [cov
T
B B d
d

(8.75)


Recall that B contains the eigenvectors of the symmetric matrix [cov d]. Furthermore,
these eigenvectors represent the directions of the major and minor axes of an ellipse. Thus, for
the present case, the first vector in B, [0.985, 0.174]
T
, is the direction in data space of the minor
axis of an ellipse having a half-length of 4. Similarly, the second vector in B, [0.174, 0.985]
T
,
is the direction in data space of the major axis of an ellipse having length 16. The eigenvectors
in B
T
represent a 10 counterclockwise rotation of data space, as shown on the next page:


Geosciences 567: CHAPTER 8 (RMR/GZ)
212
10
d
1
d
1
'
d
2
'
d
2
16
16
16
16
1
6
4



The negative off-diagonal entries in [cov d] indicate a negative correlation of errors between d
1

and d
2
. Compare the figure above with figure (c) just after Equation (2.43).

The inverse data covariance [cov d]
1
can also be written as

[cov d]
1
= B
d
1
B
T

= D
T
D (8.76)

where D is given by

D =
d
1/2
B
T
(8.77)

=
246 . 0 043 . 0
087 . 0 492 . 0


Similarly, the model covariance matrix [cov m] can be decomposed as

0.940 0.342
0.342 0.940

9.000 0.000
0.000 25.000

0.940 0.342
0.342 0.940
=
= ] [cov
T
M M m
m
(8.78)

The matrix M
T
represents a 20 counterclockwise rotation of the m
1
and m
2
axes in model space.
In the new coordinate system, the a priori model parameter errors are uncorrelated and have
variances of 25 and 9, respectively. The major and minor axes of the error ellipse are along the
[0.940, 0.342]
T
and [0.342, 0.940]
T
directions, respectively. The geometry of the problem in
model space is shown below:
Geosciences 567: CHAPTER 8 (RMR/GZ)
213

m
1
m
2
'
m
2
20
m
1
'
25
25
20
20
2
5
9


The inverse model parameter covariance matrix [cov m]
1
can also be written as

[cov m]
1
= M
m
1
M
T


= S
T
S (8.79)

where S is given by

313 . 0 114 . 0
068 . 0 188 . 0
=
T 2 / 1
M S
m
(8.80)

With the information in D and S, it is now possible to transform G, d, and m into G, d,
and m in the new coordinate system:

=
=

80505 . 0 87739 . 2
19424 . 1 26844 . 4
819 . 2 710 . 1
026 . 1 698 . 4
000 . 2 000 . 2
000 . 1 000 . 1
246 . 0 043 . 0
087 . 0 492 . 0
1
DGS G

(8.81)


and
Geosciences 567: CHAPTER 8 (RMR/GZ)
214

=
=
05736 . 1
40374 . 2
000 . 5
000 . 4
246 . 0 043 . 0
087 . 0 492 . 0
Dd d

(8.82)


In the new coordinate system, the data and model parameter covariance matrices are
identity matrices. Thus, a generalized inverse analysis gives

P = 1 < M = N = 2 (8.83)


1
= 5.345 (8.84)


=
963 . 0 269 . 0
269 . 0 963 . 0
V (8.85)


=
829 . 0 559 . 0
559 . 0 829 . 0
U (8.86)

=
028 . 0 042 . 0
101 . 0 149 . 0
1 -
g
G (8.87)

=
073 . 0 259 . 0
259 . 0 927 . 0
R (8.88)

=
312 . 0 463 . 0
463 . 0 688 . 0
N (8.89)

=
130 . 0
466 . 0
g
m (8.90)

=
145 . 1
143 . 2

d (8.91)

[e]
T
e = [e]
T
[cov d]
1
e = 0.218 (8.92)

The results may be transformed back to the original coordinates, using Equations (8.37), (8.39),
(8.44), (8.48), (8.49), (8.56), and (8.60) as

Geosciences 567: CHAPTER 8 (RMR/GZ)
215

094 . 0 173 . 0
167 . 0 305 . 0
1
MX
G (8.93)


1
= 5.345 (8.94)

=
=

163 . 1
054 . 2
g
1
MX
m S m

(8.95)


=
=

434 . 6
217 . 3

1
d D d

(8.96)


e
T
e = 2.670 (8.97)


[ ]
218 . 0
434 . 1
783 . 0
068 . 0 032 . 0
032 . 0 244 . 0
434 . 1 783 . 0 ] [cov
1 T
=

e d e

(8.98)



=
707 . 0 493 . 0
707 . 0 870 . 0

V (8.99)


=
878 . 0 894 . 0
479 . 0 447 . 0

U (8.100)


=
362 . 0 362 . 0
639 . 0 639 . 0
R (8.101)

=
522 . 0 956 . 0
261 . 0 478 . 0
N (8.102)

Note that e
T
e = 2.670 for the weighted case is larger than the misfit e
T
e = 1.800 for the
unweighted case. This is to be expected because the unweighted case should produce the
smallest misfit. The weighted case provides an answer that gives more weight to better-known
data, but it produces a larger total misfit.

The

U
and

V
matrices were obtained by transforming each eigenvector in the primed
coordinate system into a vector in the original coordinates, and then scaling to unit length. Note
that the vectors in

U
(and

V
) are not perpendicular to each other. Note also that the solution
m
MX
is parallel to the [0.870, 0.493]
T
direction in model space, also given by the first column of
Geosciences 567: CHAPTER 8 (RMR/GZ)
216

V
. The predicted data

d is parallel to the [0.447, 0.894]
T
direction in data space, also given by
the first column in

U
.

The resolution matrices were obtained from the primed coordinate resolution matrices
after Equations (8.48)(8.49). Note that they are no longer symmetric matrices, but that the trace
has remained equal to one. The model resolution matrix R still indicates that only a sum of the
two model parameters m
1
and m
2
is resolved, but now we see that the estimate of m
1
is better
resolved than that of m
2
. This may not seem intuitively obvious, since the a priori variance of
m
2
is less than that of m
1
, and thus m
2
is better known. Because m
2
is better known, the
inverse operator will leave m
2
closer to its prior estimate. Thus, m
1
will be allowed to vary
further from its prior estimate. It is in this sense that the resolution of m
1
is greater than that of
m
2
. The data resolution matrix N still indicates that only the sum of d
1
and d
2
is resolved, or
important, in constraining the solution. Now, however, the importance of the first observation
has been increased significantly from the unweighted case, reflecting the smaller variance for d
1

compared to d
2
.


8.3 Damped Least Squares and the Stochastic Inverse


8.3.1 Introduction

As we have seen, the presence of small singular values causes significant stability
problems with the generalized inverse. One approach is simply to set small singular values to
zero, and relegate the associated eigenvectors to the zero spaces. This improves stability, with an
inevitable decrease in resolution. Ideally, the cut-off value for small singular values should be
based on how noisy the data are. In practice, however, the decision is almost always arbitrary.

We will now introduce a damping term, the function of which is to improve the stability
of inverse problems with small singular values. First, however, we will consider another inverse
operator, the stochastic inverse.


8.3.2 The Stochastic Inverse

Consider a forward problem given by

Gm + n = d (8.103)

where n is an N 1 noise vector. It is similar to

Gm = d (1.13)
Geosciences 567: CHAPTER 8 (RMR/GZ)
217
except that we explicitly separate out the contribution of noise to the total data vector d. This has
some important implications, however.

We assume that both m and n are stochastic (i.e., random variables, as described in
Chapter 2, that are characterized by their statistical properties) processes, with mean (or
expected) values of zero. This is natural for noise, but implies that the mean value must be
subtracted from all model parameters. Furthermore, we assume that we have estimates for the
model parameter and noise covariance matrices, [cov m] and [cov n], respectively.

The stochastic inverse is defined by minimizing the average, or statistical, discrepancy
between m and G
s
1
d, where G
s
1
is the stochastic inverse. Let G
s
1
= L, and determine L by
minimizing

j
N
j
ij i
d L m

=

1
(8.104)

for each i. Consider repeated experiments in which m and n are generated. Let these values, on
the kth experiment, be m
k
and n
k
, respectively. If there are a total of q experiments, then we seek
L which minimizes

2
1 1
1

= =
|
|
.
|

\
|

q
k
k
j
N
j
ij
k
i
d L m
q
(8.105)

The minimum of Equation (8.105) is found by differentiating with respect to L
il
and
setting it equal to zero:


0
1
2
1 1
=
(
(

|
|
.
|

\
|


= =
q
k
k
j
N
j
ij
k
i
il
d L m
q L

(8.106)

or


( ) 0
2
1 1
=
|
|
.
|

\
|


= =
k
l
q
k
k
j
N
j
ij
k
i
d d L m
q
(8.107)

This implies

k
l
q
k
k
j
N
j
ij
k
l
q
k
k
i
d d L
q
d m
q

1 1
1 1 1

= = =
|
|
.
|

\
|
=
(8.108)

The left-hand side of Equation (8.108), when taken over i and l, is simply the covariance
matrix between the model parameters and the data, or

[cov md] = <md
T
> (8.109)

Geosciences 567: CHAPTER 8 (RMR/GZ)
218
The right-hand side, again taken over i and l and recognizing that L will not vary from
experiment to experiment, gives [see Equation (2.63)]

L[cov d] = L<dd
T
> (8.110)

where [cov d] is the data covariance matrix. Note that [cov d] is not the same matrix used
elsewhere in these notes. As used here, [cov d] is a derived quantity, based on [cov m] and
[cov n]. With Equations (8.109) and (8.110), we can write Equation (8.108), taken over i and l,
as
[cov md] = L[cov d] (8.111)

or

L = [cov md][cov d]
1
(8.112)

We now need to rewrite [cov d] and [cov md] in terms of [cov m], [cov n], and G. This
is done as follows:

[cov d] = <dd
T
>

= <[Gm + n][Gm + n]
T
>

= G<mn
T
> + G<mm
T
>G
T
+ <nm
T
>G
T
+ <nn
T
> (8.113)

If we assume that model parameter and noise errors are uncorrelated, that is, that <mn
T
> = 0 =
<nm
T
>, then Equation (8.113) reduces to

[cov d] = G<mm
T
>G
T
+ <nn
T
>

= G[cov m]G
T
+ [cov n] (8.114)

Similarly,

[cov md] = <md
T
>

= <m[Gm + n]
T
>

= <mm
T
>G
T
+ <mn
T
>

= [cov m]G
T
(8.115)

if <mn
T
> = 0.

Replacing [cov md] and [cov d] in Equation (8.112) with expressions from Equations
(8.114) and ((8.115), respectively, gives the definition of the stochastic inverse operator G
s
1
as

G
s
1
= [cov m]G
T
{G[cov m]G
T
+ [cov n]}
1
(8.116)
Geosciences 567: CHAPTER 8 (RMR/GZ)
219
Then the stochastic inverse solution, m
s
, is given by

m
s
= G
s
1
d

= [cov m]G
T
[cov d]
1
d (8.117)

It is possible to decompose the symmetric covariance matrices [cov d] and [cov m] in
exactly the same manner as was done for the maximum likelihood operator [Equations (8.19) and
(8.25)]:
[cov d] = B
d
B
T
= {B
d
1/2
}{
d
1/2
B
T
} = D
1
[D
1
]
T
(8.118)

[cov d]
1
= B
d
1
B
T
= D
T
D (8.119)

[cov m] = M
m
M
T
= {M
m
1/2
}{M
m
1/2
} = S
1
[S
1
]
T
(8.120)

[cov m]
1
= M
m
1
M
T
= S
T
S (8.121)

where
d
and
m
are the eigenvalues of [cov d] and [cov m], respectively. The orthogonal
matrices B and M are the associated eigenvectors.

At this point it is useful to reintroduce a set of transformations based on the
decompositions in (8.118)(8.121) that will transform d, m, and G back and forth between the
original coordinate system and a primed coordinate system.

m = Sm (8.122)

d = Dd (8.123)

G = DGS
1
(8.124)

m = S
1
m (8.125)

d = D
1
d (8.126)

G = D
1
GS (8.127)

Then, Equation (8.117), using primed coordinate variables, is given by

S
1
m
s
= [cov m]G
T
[cov d]
1
d

S
1
m
s
= S
1
[S
1
]
T
[D
1
GS]
T
D
T
DD
1
d

= S
1
[S
1
]
T
S
T
[G ]
T
[D
1
]
T
D
T
d (8.128)

but [S
1
]
T
S
T
= I
M
(8.129)
Geosciences 567: CHAPTER 8 (RMR/GZ)
220
and [D
1
]
T
D
T
= I
N
(8.130)

and hence S
1
m
s
= S
1
[G ]
T
d (8.131)

Premultiplying both sides by S yields

m
s
= [G ]
T
d (8.132)

That is, the stochastic inverse in the primed coordinate system is simply the transpose of G in the
primed coordinate system. Once you have found m
s
, you can transform back to the original
coordinates to obtain the stochastic solution as

m
s
= S
1
m
s
(8.133)

The stochastic inverse minimizes the sum of the weighted model parameter vector and
the weighted data misfit. That is, the quantity

m
T
[cov m]
1
m + [d

d ]
T
[cov d]
1
[d

d ] (8.134)

is minimized. The generalized inverse, or maximum likelihood, minimizes both individually but
not the sum.

It is important to realize that the transformations introduced in Equations (8.118)(8.121),
while of the same form and nomenclature as those introduced in the weighted generalized inverse
case in Equations (8.17) and (8.23), differ in an important aspect. Namely, as mentioned after
Equation (8.110), [cov d] is now a derived quantity, given by Equation (8.114):

[cov d] = G[cov m]G
T
+ [cov n] (8.114)

The data covariance matrix [cov d] is only equal to the noise covariance matrix [cov n] if you
assume that the noise, or errors, in m are exactly zero. Thus, before doing a stochastic inverse
analysis and the transformations given in Equations (8.118)(8.121), [cov d] must be constructed
from the noise covariance matrix [cov n] and the mapping of model parameter uncertainties in
[cov m] as shown in Equation (8.114).


8.3.3 Damped Least Squares

We are now ready to see how this applies to damped least squares. Suppose

[cov m] =
m
2
I
M
(8.135)

and

[cov n] =
n
2
I
N
(8.136)
Geosciences 567: CHAPTER 8 (RMR/GZ)
221
Define a damping term
2
as


2
=
n
2
/
m
2
(8.137)

The stochastic inverse operator, from Equation (8.116), becomes

G
s
1
= G
T
[GG
T
+
2
I
N
]
1
(8.138)

To determine the effect of adding the
2
term, consider the following

GG
T
= U
P

P
2
U
P
T
(7.43)

[GG
T
]
1
exists only when P = N, and is given by

[GG
T
]
1
= U
P

P
2
U
P
T
P = N (7.44)

we can therefore write GG
T
+
2
I as

[ ]
(

+
= +

T
0
T
2
2 2
0
2 T
0
0
U
U
I
I
U U I GG
P
P N
P P
P

(8.139)
N N N N N N N N

Thus

[ ] [ ]
[ ]
(

(
(

+
= +

T
0
T
2
1
2 2
0
1
2 T
0
0
U
U
I
I
U U I GG
P
P N
P P
P

(8.140)

Explicitly multiplying Equation (8.140) out gives

[GG
T
+
2
I]
1
= U
P
[
P
2
+
2
I
P
]
1
U
P
T
+ U
0
[
2
I
NP
]U
0
T
(8.141)

Next, we write out Equation (8.138), using singular-value decomposition, as

G
s
1
= G
T
[GG
T
+
2
I
N
]
1


= {V
P

P
U
P
T
}{U
P
[
P
2
+
2
I
P
]
1
U
P
T
+ U
0
[
2
I
NP
]U
0
T
}


T
2 2
P
P P
P
P
U
I
V
+

= (8.142)

since U
P
T
U
0
= 0.
Geosciences 567: CHAPTER 8 (RMR/GZ)
222
Note the similarity between the stochastic inverse in Equation (8.142) and the generalized
inverse

G
g
1
= V
P

P
1
U
P
T
(7.8)

The net effect of the stochastic inverse is to suppress the contributions of eigenvectors with
singular values less than . To see this, let us write out
P
/ (
P
2
+
2
I
P
) explicitly:


(
(
(
(
(
(
(
(

+
+
+
=
+

2 2
2 2
2
2
2 2
1
1
2 2
0 0
0
0
0 0

P
P
P P
P
L
O M
M
L
I
(8.143)

If
i
>> , then
i
/ (
i
2
+
2
)
i
1
, the same as the generalized inverse. If
i
<< , then
i
/ (
i
2

+ )
i
/
2
0. The stochastic inverse, then, dampens the contributions of eigenvectors
associated with small singular values.

The stochastic inverse in Equation (8.138) looks similar to the minimum length inverse

G
ML
1
= G
T
[GG
T
]
1
(3.75)

To see why the stochastic inverse is also called damped least squares, consider the following:

[G
T
G +
2
I
M
]
1
G
T
= {V
P
[
P
2
+
2
I
P
]
1
V
P
T
+
2
V
0
V
0
T
}{V
P

P
U
P
T
}

= {V[
P
2
+
2
I
P
]
1
}{V
P
T
V
P

P
U
P
T
} +
2
V
0
V
0
T
V
P

P
U
P
T



T
2 2
P
P P
P
P
U
I
V
+

=

= G
T
[GG
T
+
2
I
N
]
1
(8.144)

Thus

[G
T
G +
2
I
M
]
1
G
T
= G
T
[GG
T
+
2
I
N
]
1
(8.145)

The choice of
m
2
is often arbitrary. Thus,
2
is often chosen arbitrarily to stabilize the problem.
Solutions are obtained for a variety of
2
, and a final choice is made based on the a posteriori
model covariance matrix.
Geosciences 567: CHAPTER 8 (RMR/GZ)
223
The stability gained with damped least squares is not obtained without loss elsewhere.
Specifically, resolution degrades with increased damping. To see this, consider the model
resolution matrix for the stochastic inverse:

R = G
s
1
G


T
2 2
2
P
P P
P
P
V
I
V
+

= (8.146)

It is easy to see that the stochastic inverse model resolution matrix reduces to the generalized
inverse case when
2
goes to 0, as expected.

The reduction in model resolution can be seen by considering the trace of R:


P
P
i
i
i

+
=

=1
2 2
2
) ( trace

R
(8.147)

Similarly, the data resolution matrix N is given by

N = GG
s
1


T
2 2
2
P
P P
P
P
U
I
U
+

= (8.148)


P
P
i
i
i

+
=

=1
2 2
2
) ( trace

N
(8.149)

Finally, consider the unit model covariance matrix [cov
u
m], given by

[cov
u
m] = G
s
1
[G
s
1
]
T



T
2 2 2
2
] [
P
P P
P
P
V
I
V
+

= (8.150)

which reduces to the generalized inverse case when
2
= 0. The introduction of
2
reduces the
size of the covariance terms, a reflection of the stability added by including a damping term.

An alternative approach to damped least squares is achieved by adding equations of the
form
m
i
= 0 i = 1, 2, . . . , M (8.151)

to the original set of equations

Gm = d (1.13)
Geosciences 567: CHAPTER 8 (RMR/GZ)
224
The combined set of equations can be written in partitioned form as


(

=
(

0
d
m
I
G

M

(8.152)
(N + M) M (N + M) 1

The least squares solution to Equation (8.152) is given by


[ ] [ ]
(

)
`

=
0
d
I G
I
G
I G m
T
1
T
M
M
M



= [G
T
G +
2
I
M
]
1
G
T
d (8.153)

The addition of
2
I
M
insures a least squares solution because G
T
G +
2
I
M
will have no
eigenvalues less than
2
, and hence is invertible.

In signal processing, the addition of
2
is equivalent to adding white noise to the signal.
Consider transforming

[G
T
G +
2
I
M
] m = G
T
d (8.154)

into the frequency domain as

[F
i
*
() F
i
() +
2
]M() = F
i
*
() F
o
() (8.155)

where F
i
( ) is the Fourier transform of the input waveform to some filter,
*
represents complex
conjugate, F
o
() is the Fourier transform of the output wave form from the filter, M() is the
Fourier transform of the impulse response of the filter, and
2
is a constant for all frequencies .
Solving for m as the inverse Fourier transform of Equation (8.155) gives


(

+
=

2 *
*
1
) ( ) (
) ( ) (
F.T.


i i
o i
F F
F F
m (8.156)

The addition of
2
in the denominator assures that the solution is not dominated by small values
of F
i
(), which can arise when the signal-to-noise ratio is poor. Because the
2
term is added
equally at all frequencies, this is equivalent to adding white light to the signal.

Damping is particularly useful in nonlinear problems. In nonlinear problems, small
singular values can produce very large changes, or steps, during the iterative process. These
large steps can easily violate the assumption of linearity in the region where the nonlinear
problem was linearized. In order to limit step sizes, an
2
term can be added. Typically, one uses
a fairly large value of
2
during the initial phase of the iterative procedure, gradually letting
2
go
to zero as the solution is approached.
Geosciences 567: CHAPTER 8 (RMR/GZ)
225
Recall that the generalized inverse minimized [d Gm]
T
[d Gm] and m
T
m
individually. Consider a new function E to minimize, defined by

E = [d Gm]
T
[d Gm] +
2
m
T
m

= m
T
G
T
Gm m
T
G
T
d d
T
Gm + d
T
d +
2
m
T
m (8.157)

Differentiating E with respect to m
T
and setting it equal to zero yields

E/m
T
= G
T
Gm G
T
d +
2
m = 0 (8.158)

or

[G
T
G +
2
I
M
]m = G
T
d (8.159)

This shows why damped least squares minimized a weighted sum of the misfit and the length of
the model parameter vector.


8.4 Ridge Regression


8.4.1 Mathematical Background

Recall the least squares operator

[G
T
G]
1
G
T
(8.160)

If the data covariance matrix [cov d] is given by

[cov d] =
2
I (8.161)

then the a posteriori model covariance matrix [cov m], also called the dispersion of m, is given
by

[cov m] =

2
[G
T
G]
1
(8.162)

In terms of singular-value decomposition, it is given by

[cov m] =
2
V
P

P
2
V
P
T
(8.163)

This can also be written as

[ ]
(

0
T 2
0
2
= ] [cov
V
V
0
V V m
p p
p
(8.164)
Geosciences 567: CHAPTER 8 (RMR/GZ)
226
The total variance is defined as the trace of the model covariance matrix, given by

=
P
i i 1
2
2 1 T 2
1
= } ] [ trace { = ] [cov trace

G G m
(8.165)

which follows from the fact that the trace of a matrix is invariant under an orthogonal coordinate
transformation.

It is clear from Equation (8.165) that the total variance will get large as
i
gets small. We
saw that the stochastic inverse operator

G
s
1
= [G
T
G +
2
I
M
]
1
G
T
= G
T
[GG
T
+
2
I
N
]
1 (
8.145
)

resulted in a reduction of the model covariance (8.107). In fact, the addition of
2
to each
diagonal entry G
T
G results in a total variance defined by

trace [cov m] =
2
{trace [G
T
G +
2
I]
1
} =

=
+
P
i
i
i
1
2
2 2
2
2
) (


(8.166)

Clearly, Equation (8.166) is less than (8.165) for all
2
> 0.


8.4.2 The Ridge Regression Operator

The stochastic inverse operator of Equation (8.145) is also called ridge regression for
reasons that I will explain shortly. The ridge regression operator is derived as follows. We seek
an operator that finds a solution m
RR
that is closest to the origin (as in the minimum length case),
subject to the constraint that the solution lie on an ellipsoid defined by

[m
RR
m
LS
]
T
G
T
G [m
RR
m
LS
] =
0
(8.167)
1 M M M M 1 1 1

where m
LS
is the least squares solution (i.e., obtained by setting
2
equal to 0). Equation (8.167)
represents a single-equation quadratic in m
RR
.

The ridge regression operator G
1
RR
is obtained using Lagrange multipliers. We form the
function

(m)
RR
= m
RR
T
m
RR
+ {[m
RR
m
LS
]
T
G
T
G[m
RR
m
LS
]
0
} (8.168)

and differentiate with respect to m
RR
T
to obtain

m
RR
+ G
T
G[m
RR
m
LS
] = 0 (8.169)
Geosciences 567: CHAPTER 8 (RMR/GZ)
227
Solving Equation (8.169) for m
RR
gives

[G
T
G + I
M
]m
RR
= G
T
Gm
LS

or

m
RR
= [G
T
G + I
M
]
1
G
T
Gm
LS
(8.170)

The least squares solution m
LS
is given by

m
LS
= [G
T
G]
1
G
T
d (3.31)

Substituting m
LS
from Equation (3.31) into (8.170)

m
RR
= [G
T
G + I
M
]
1
G
T
G[G
T
G]
1
G
T
d

= [G
T
G + I
M
]
1
G
T
d


0
1 1
T
1
T

+ =



d G I G G
M


d G I G G
T
1
T
1

+ =
M

(8.171)

If we let 1/ =
2
, then Equation (8.171) becomes

m
RR
= [G
T
G +
2
I
M
]
1
G
T
d (8.172)

and the ridge regression operator G
RR
1
is defined as

G
RR
1
= [G
T
G +
2
I
M
]
1
G
T
(8.173)

In terms of singular-value decomposition, the ridge regression operator G
RR
1
is identical
to the stochastic inverse operator, and following Equation (8.142),

G
RR
1
= V
P

P
2
+
2
I
P

U
P
T
(8.174)

In practice, we determine
2
(and thus ) by trial and error, with the attendant trade-off between
resolution and stability. As defined, however, every choice of
2
is associated with a particular

0
and hence a particular ellipsoid from Equation (8.167). Changing
0
does not change the
orientation of the ellipsoid; it simply stretches or contracts the major and minor axes. We can
think of the family of ellipsoids defined by varying
2
(or
0
) as a ridge in solution space, with
Geosciences 567: CHAPTER 8 (RMR/GZ)
228
each particular
2
(or
0
) being a contour of the ridge. We then obtain the ridge regression
solution by following one of the contours around the ellipsoid until we find the point closest to
the origin, hence the name ridge regression.


8.4.3 An Example of Ridge Regression Analysis

A simple example will help clarify the ridge regression operator. Consider the following:


d m G
4
8
m
m
1 0
0 2
2
1
(

=
(

(8.175)


Singular-value decomposition gives

U
P
= U = I
2
(8.176)

V
P
= V = I
2
(8.177)

(

= =
1 0
0 2
P
(8.178)

The generalized inverse G
g
1
is given by

G
g
1
= V
P

P
1
U
P
T



(

1 0
0
=
1 0
0
=
2
1
T
2
2
1
2
I I

(8.179)


The generalized inverse solution (also the exact, or least squares, solution) is


(

= =
4
4
1
g LS
d G m

(8.180)


The ridge regression solution is given by

Geosciences 567: CHAPTER 8 (RMR/GZ)
229


4
8
1
1
0
0
4
2
=
4
8
1
1
0
0
4
2
=
4
8

=
2
2
2
2
2
2
T
2 2
1
RR RR
(

(
(
(

+
+
(

(
(
(

+
+
(

I I
U
I
V d G m
P
P P
P
P
(8.181)

Note that for
2
= 0, the least squares solution is recovered. Also, as
2
, the solution goes
to the origin. Thus, as expected, the solution varies from the least squares solution to the origin
as more and more weight is given to minimizing the length of the solution vector.

We can now determine the ellipsoid associated with a particular value of
2
. For
example, let
2
= 1. Then the ridge regression solution, from Equation (8.181), is


(

=
(
(
(

+
+
=
(

2
2 . 3
1
4
4
16
2
2
RR
2
1

m
m
(8.182)

Now, returning to the constraint Equation (8.167), we have that


m
1
4.0
m
2
4.0



(

(
T
4 0
0 1



(

(
m
1
4.0
m
2
4.0



(

(
=
0


or

4(m
1
4.0)
2
+ (m
2
4.0)
2
=
0
(8.183)

To find
0
, we substitute the solution from Equation (8.182) into (8.167) and

4(3.2 4.0)
2
+ (2.0 4.0)
2
=
0
(8.184)

or

0 = 6.56
(8.185)


Substituting

0
from Equation (8.185) back into (8.183) and rearranging gives


( ) ( )
0 . 1
56 . 6
0 . 4
64 . 1
0 . 4
2
2
2
1
=

+
m m
(8.186)
Geosciences 567: CHAPTER 8 (RMR/GZ)
230
Equation (8.186) is of the form


( ) ( )
0 . 1
2
2
2
2
=

a
k y
b
h x
(8.187)

which represents an ellipse centered at (h, k), with semimajor and semiminor axes a and b
parallel to the y and x axes, respectively Thus, for the current example, the lengths of the
semimajor and semiminor axes are 2.56 and 1.28, respectively. The axes of the ellipse are
parallel to the m
2
and m
1
axes, and the ellipse is centered at (4, 4). Different choices for
2
will
produce a family of ellipses centered on (4, 4), with semimajor and semiminor axes parallel to
the m
2
and m
1
axes, respectively, and with the semimajor axis always twice the length of the
semiminor axis.

The shape and orientation of the family of ellipses follow completely from the structure
of the original G matrix. The axes of the ellipse coincide with the m
1
and m
2
axes because the
original G matrix was diagonal. If the original G matrix had not been diagonal, the axes of the
ellipse would have been inclined to the m
1
and m
2
axes. The center of the ellipse, given by the
least squares solution, is, of course, both a function of G and the data vector d.

The graph below illustrates this particular problem for
2
= 1.


(4, 4)
(3.2, 2.0)
Ridge regression
solution
Least squares
solution
1.0 2.0 3.0 4.0 5.0 6.0
1.0
2.0
4.0
3.0
5.0
6.0
7.0
6.56 = 2.56
1.64 = 1.28
m
1
m
2


It is also instructive to plot the length squared of the solution, m
T
m, as a function of
2
:
Geosciences 567: CHAPTER 8 (RMR/GZ)
231


This figure shows that adding
2
damps the solution from least squares toward zero length as
2

increases.

Next consider a plot of the total variance from Equation (8.166) as a function of
2
for
data variance
2
= 1.


0.4
0.0
0.8
0.6
0.2
1.0
2.0 4.0 6.0 8.0 10.0
1.2

2
T
o
t
a
l

V
a
r
i
a
n
c
e

=

T
r
[
c
o
v

m
]
0.0


The total variance decreases, as expected, as more damping is included.

Finally, consider the model resolution matrix R given by


T
2 2
2
1
RR

=
=
P
P P
P
P
V
I
V
G G R
+


(8.188)

Geosciences 567: CHAPTER 8 (RMR/GZ)
232
We can plot trace (R) as a function of
2
and get


2.0
1.5
1.0
0.5
0.0
0.0 2.0 4.0 8.0 10.0 6.0

2
t
r
a
c
e
(
R
)


For
2
= 0, we have perfect model resolution, with trace (R) = P = 2 = M = N. As
2
increases,
the model resolution decreases. Comparing the plots of total variance and the trace of the model
resolution matrix, we see that as
2
increases, stability improves (total variance decreases) while
resolution degrades. This is an inevitable trade-off.

In this particular simple example, it is hard to choose the most appropriate value for
2

because, in fact, the sizes of the two singular values differ very little. In general, when the
singular values differ greatly, the plots for total variance and trace (R) can help us choose
2
. If
the total variance initially diminishes rapidly and then very slowly for increasing
2
, choosing
2

near the bend in the total variance curve is most appropriate.

We have shown in this section how the ridge regression operator is formed and how it is
equivalent to damped least squares and the stochastic inverse operator.


8.5 Maximum Likelihood


8.5.1 Background

The maximum likelihood approach is fundamentally probabilistic in nature. A
probability density function (PDF) is created in data space that assigns a probability P(d) to
every point in data space. This PDF is a function of the model parameters, and hence P(d) may
change with each choice of m. The underlying principle of the maximum likelihood approach is
to find a solution m
MX
such that P(d) is maximized at the observed data d
obs
. Put another way, a
solution m
MX
is sought such that the probability of observing the observed data is maximized.
At first thought, this may not seem very satisfying. After all, in some sense there is a 100%
chance that the observed data are observed, simply because they are the observed data. The point
Geosciences 567: CHAPTER 8 (RMR/GZ)
233
is, however, that P(d) is a calculated quantity, which varies over data space as a function of m.
Put this way, does it make sense to choose m such that P(d
obs
) is small, meaning that the
observed data are an unlikely outcome of some experiment? This is clearly not ideal. Rather, it
makes more sense to choose m such that P(d
obs
) is as large as possible, meaning that you have
found an m for which the observed data, which exist with 100% certainty, are as likely an
outcome as possible.

Imagine a very simple example with a single observation where P(d) is Gaussian with
fixed mean <d> and a variance
2
that is a function of some model parameter m. For the
moment we need not worry about how m affects
2
, other than to realize that as m changes, so
does
2
. Consider the diagram below, where the vertical axis is probability, and the horizontal
axis is d. Shown on the diagram are d
obs
, the observed datum; <d>, the mean value for the
Gaussian P(d); and two different P(d) curves based on two different variance estimates
1
2
and

2
2
, respectively.


P
(
d
)
<d > d
obs
d

1
2

2
2


The area under both P(d) curves is equal to one, since this represents integrating P(d)
over all possible data values. The curve for
1
2
, where
1
2
is small, is sharply peaked at <d>, but
is very small at d
obs
. In fact, d
obs
appears to be several standard deviations from <d>,
indicating that d
obs
is a very unlikely outcome. P(d) for
2
2
, on the other hand, is not as sharply
peaked at <d>, but because the variance is larger, P(d) is larger at the observed datum, d
obs
. You
could imagine letting
2
get very large, in which case values far from <d> would have P(d)
larger than zero, but no value of P(d) would be very large. In fact, you could imagine P(d
obs
)
becoming smaller than the case for
2
2
. Thus, the object would be to vary m, and hence
2
, such
that P(d
obs
) is maximized. Of course, for this simple example we have not worried about the
mechanics of finding m, but we will later for more realistic cases.

A second example, after one near the beginning of Menkes Chapter 5, is also very
illustrative. Imagine collecting a single datum N times in the presence of Gaussian noise. The
observed data vector d
obs
has N entries and hence lies in an N-dimensional data space. You can
think of each observation as a random variable with the same mean <d> and variance
2
, both of
which are unknown. The goal is to find <d> and
2
. We can cast this problem in our familiar
Gm = d form by associating m with <d> and noting that G = (1/N)[1, 1, . . . , 1]
T
. Consider the
simple case where N = 2, shown on the next page:
Geosciences 567: CHAPTER 8 (RMR/GZ)
234

UP
U0
d2
.
d
1
Q
d
obs
.


The observed data d
obs
are a point in the d
1
d
2
plane. If we do singular-value decomposition on
G, we see immediately that, in general, U
P
= (1/ N )[1, 1, . . . , 1]
T
, and for our N= 2 case, U
P
=
[1/ 2 , 1/ 2 ]
T
, and U
0
= [1/ 2 , 1/ 2 ]
T
. We recognize that all predicted data must lie in U
P

space, which is a single vector. Every choice of m = <d> gives a point on the line d
1
= d
2
= L =
d
N
. If we slide <d> up to the point Q on the diagram, we see that all the misfit lies in U
0
space,
and we have obtained the least squares solution for <d>. Also shown on the figure are contours
of P(d) based on
2
. If
2
is small, the contours will be close together, and P(d
obs
) will be
small. The contours are circular because the variance is the same for each d
i
. Our N = 2 case has
thus reduced to the one-dimensional case discussed on the previous page, where some value of

2
will maximize P(d
obs
). Menke (Chapter 5) shows that P(d) for the N-dimensional case with
Gaussian noise is given by


(

> < =

=
N
i
i
N N
d d d P
1
2
2 2 /
) (
2
1
exp
) 2 (
1
) (

(8.189)

where d
i
are the observed data and <d> and are the unknown model parameters. The solution
for <d> and is obtained by maximizing P(d). That is, the partials of P(d) with respect to <d>
and are formed and set to zero. Menke shows that this leads to

=
= > <
N
i
i
d
N
d
1
est
1
(8.190)


est
=
1
N
(d
i
i=1
N

< d >)
2




(

(
(
1/ 2
(8.191)

We see that <d> is found independently of , and this shows why the least squares solution
(point Q on the diagram) seems to be found independently of . Now, however, Equation
(8.191) indicates that
est
will vary for different choices of <d> affecting P(d
obs
).

Geosciences 567: CHAPTER 8 (RMR/GZ)
235
The example can be extended to the general data vector d case where the Gaussian noise
(possibly correlated) in d is described by the data covariance matrix [cov d]. Then it is possible
to assume that P(d) has the form

P(d) exp{
1
2
[d Gm]
T
[cov d]
1
[d Gm]} (8.192)

We note that the exponential in Equation (8.192) reduces to the exponential in Equation (8.189)
when [cov d] =
2
I, and Gm gives the predicted data, given by <d>. P(d) in Equation (8.192) is
maximized when [d Gm]
T
[cov d]
1
[d Gm] is minimized. This is, of course, exactly what is
minimized in the weighted least squares [Equations (3.89) and (3.90)] and weighted generalized
inverse [Equation (8.13)] approaches. We can make the very important conclusion that
maximum likelihood approaches are equivalent to weighted least squares or weighted
generalized inverse approaches when the noise in the data is Gaussian.


8.5.2 The General Case

We found in the generalized inverse approach that whenever P < M, the solution is
nonunique. The equivalent viewpoint with the maximum likelihood approach is that P(d) does
not have a well-defined peak. In this case, prior information (such as minimum length for the
generalized inverse) must be added. We can think of d
obs
and [cov d] as prior information for
the data, which we could summarize as P
A
(d). The prior information about the model parameters
could also be summarized as P
A
(m) and could take the form of a prior estimate of the solution
<m> and a covariance matrix [cov m]
A
. Graphically (after Figure 5.9 in Menke) you can
represent the joint distribution P
A
(m, d) = P
A
(m) P
A
(d) detailing the prior knowledge of data and
model spaces as


d
obs
<m>
.
contours of
P (m, d )
A


where P
A
(m, d) is contoured about (d
obs
, <m>), the most likely point in the prior distribution.
The contours are not inclined to the model or data axes because we assume that there is no
correlation between our prior knowledge of d and m. As shown, the figure indicates less
confidence in <m> than in the data. Of course, if the maximum likelihood approach were applied
Geosciences 567: CHAPTER 8 (RMR/GZ)
236
to P
A
(m, d), it would return (d
obs
, <m>) because there has not been any attempt to include the
forward problem Gm = d.

Each choice of m leads to a predicted data vector d
pre
. In the schematic figure below, the
forward problem Gm = d is thus shown as a line in the model spacedata space plane:

d
obs
<m>
Gm = d
d
pre
m
est
.


The maximum likelihood solution m
est
is the point where the P(d) obtains its maximum value
along the Gm = d curve. If you imagine that P(d) is very elongated along the model-space axis,
this is equivalent to saying that the data are known much better than the prior model parameter
estimate <m>. In this case d
pre
will be very close to the observed data d
obs
, but the estimated
solution m
est
may be very far from <m>. Conversely, if P(d) is elongated along the data axis,
then the data uncertainties are relatively large compared to the confidence in <m>, and m
est
will
be close to <m>, while d
pre
may be quite different from d
obs
.

Menke also points out that there may be uncertainties in the theoretical forward
relationship Gm = d. These may be expressed in terms of an N N inexact-theory covariance
matrix [cov g]. This covariance matrix deserves some comment. As in any covariance matrix of
a single term (e.g., d, m, or G), the diagonal entries are variances, and the off-diagonal terms are
covariances. What does the (1, 1) entry of [cov g] refer to, however? It turns out to be the
variance of the first equation (row) in G. Similarly, each diagonal term in [cov g] refers to an
uncertainty of a particular equation (row) in G, and off-diagonal terms are covariances between
rows in G. Each row in G times m gives a predicted datum. For example, the first row of G
times m gives d
1
pre
. Thus a large variance for the (1, 1) term in [cov g] would imply that we do
not have much confidence in the theorys ability to predict the first observation. It is easy to see
that this is equivalent to saying that not much weight should be given to the first observation.
We will see, then, that [cov g] plays a role similar to [cov d].

We are now in a position to give the maximum likelihood operator G
M

X
1
in terms of G,
and the data ([cov d]), model parameter ([cov m]), and theory ([cov g]) covariance matrices as

G
M

X
1
= [cov m]
1
G
T
{[cov d] + [cov g] + G[cov m]
1
G
T
}
1
(8.193)

= [G
T
{[cov d] + [cov g]}
1
G + [cov m]
1
]
1
G
T
{[cov d] + [cov g]}
1
(8.194)
Geosciences 567: CHAPTER 8 (RMR/GZ)
237
where Equations (8.193) and (8.194) are equivalent. There are several points to make. First, as
mentioned previously, [cov d] and [cov g] appear everywhere as a pair. Thus, the two
covariance matrices play equivalent roles. Second, if we ignore all of the covariance
information, we see that Equation (8.193) looks like G
T
[GG
T
]
1
, which is the minimum length
operator. Third, if we again ignore all covariance information, Equation (8.194) looks like
[G
T
G]
1
G
T
, which is the least squares operator. Thus, we see that the maximum likelihood
operator can be viewed as some kind of a combined weighted least squares and weighted
minimum length operator.

The maximum likelihood solution m
MX
is given by

m
MX
= <m> + G
M

X
1
[d G<m>]

= <m> + G
M

X
1
d G
M

X
1
G <m>

= G
M

X
1
d + [I R] <m> (8.195)

where R is the model resolution matrix. Equation (8.195) explicitly shows the dependence of
m
MX
on the prior estimate of the solution <m>. If there is perfect model resolution, then R = I,
and m
MX
is independent of <m>. If the ith row of R is equal to the ith row of the identity
matrix, then there will be no dependence on the ith entry in m
MX
on the ith entry in <m>.

Menke points out that there are several interesting limiting cases for the maximum
likelihood operator. We begin by assuming some simple forms for the covariance matrices:

[cov g] =
g
2
I
N
(8.196)

[cov m] =
m
2
I
M
(8.197)

[cov d] =
d
2
I
N
(8.198)


In the first case we assume that the data and theory are much better known than <m>. In the
limiting case we can assume
d
2
=
g
2
= 0. If we do, then Equation (8.193) reduces to G
M

X
1
=
G
T
[GG
T
]
1
, the minimum length operator. If we assume that [cov m] still has some structure,
then Equation (8.193) reduces to

G
M

X
1
= [cov m]
1
G
T
{G[cov m]
1
G
T
} (8.150)

the weighted minimum length operator. If we assume only that
d
2
and
g
2
are much less than

m
2
and that 1/
m
2
goes to 0, then Equation (8.194) reduces to G
M

X
1
= [G
T
G]
1
G
T
, or the least
squares operator. It is important to realize that [G
T
G]
1
only exists when P = M, and [GG
T
]
1

only exists when P = N. Thus, either form, or both, may fail to exist, depending on P. The
Geosciences 567: CHAPTER 8 (RMR/GZ)
238
simplifying assumptions about
d
2
,
g
2
, and
m
2
can thus break down the equivalence between
Equations (8.193) and (8.194).

A second limiting case involves assuming no confidence in either (or both) the data or
theory. That is, we let
d
2
and/or
g
2
go to infinity. Then we see that G
M

X
1
goes to 0 and m
MX

= <m>. This makes sense if we realize that we have assumed the data are useless (and/or the
theory), and hence we do not have a useful forward problem to move us away from our prior
estimate <m>.

We have assumed in deriving Equations (8.193) and (8.194) that all of the covariance
matrices represent Gaussian processes. In this case, we have shown that maximum likelihood
approaches will yield the same solution as weighted least squares (P = M), weighted minimum
length (P = N), or weighted generalized inverse approaches. If the probability density functions
are not Gaussian, then maximum likelihood approaches can lead to different solutions. If the
distributions are Gaussian, however, then all of the modifications introduced in Section 8.2 for
the generalized inverse can be thought of as the maximum likelihood approach.
Geosciences 567: CHAPTER 9 (RMR/GZ)
239




CHAPTER 9: CONTINUOUS INVERSE PROBLEMS
AND OTHER APPROACHES


9.1 Introduction


Until this point, we have only considered discrete inverse problems, either linear or
nonlinear, that can be expressed in the form

d = Gm (1.13)

We now turn our attention to another branch on inverse problems, called continuous inverse
problems, in which at least the model parameter vector m in replaced by a continuous function,
and the matrix G is replaced by an integral relationship. The general case of a linear continuous
inverse problem involving continuous data and a continuous model function is given by a
Fredholm equation of the first kind

= ) ( data continuous ) ( ) , ( ) ( y g dx x m y x k y g
(9.1)

where the data g are a continuous function of some variable y, the model m is a continuous
function of some other variable x, and k, called the data kernel or Greens function, is a function
of both x and y.

In most situations, data are not continuous, but rather are a finite sample. For example,
analog (continuous) seismic data is digitized to become a finite, discrete data set. In the case
where there are N data points, Equation (9.1) becomes

= N j dx x m x k g
j j
1, = data, discrete ) ( ) (
(9.2)

One immediate implication of a finite data set of dimension N with a continuous (infinite
dimensional) model is that the solution is underconstrained, and if there is any solution m(x) that
fits the data, there will be an infinite number of solutions that will fit the data as well. This basic
problem of nonuniqueness was encountered before for the discrete problem in the minimum
length environment (Chapter 3).

Given that all continuous inverse problems with discrete data are nonunique, and almost
all real problems have discrete data, the goals of a continuous inverse analysis can be somewhat
different than a typical discrete analysis. Three possible goals of a continuous analysis include:
(1) find a model m(x) that fits the data g
j
, also known as construction, (2) find unique properties
or values of all possible solutions that fit the data by taking linear combinations of the data, also
Geosciences 567: CHAPTER 9 (RMR/GZ)
240
known as appraisal, and (3) find the values of other linear combinations of the model using the
data g
j
, also known as inference (Oldenburg, 1984). There are many parallels, and some
fundamental differences, between discrete and continuous inverse theory. For example, we have
encountered the construction phase before for discrete problems whenever we used some
operator to find a solution that best fits the data. The appraisal phase is most similar to an
analysis of resolution and stability analysis for discrete problems. We have not encountered the
inference phase before. Emphasis in this chapter on continuous inverse problems will be on the
construction and appraisal phases, and the references, especially Oldenburg [1984] can be used
for further information on the inference phase.

The material in this chapter is based primarily on the following references:

Backus, G. E. and J. F. Gilbert, Numerical application of a formalism for geophysical inverse
problems, Geophys. J. Roy. Astron. Soc., 13, 247276, 1967.

Huestis, S. P., An introduction to linear geophysical inverse theory, unpublished manuscript,
1992.

Jeffrey, W., and R. Rosner, On strategies for inverting remote sensing data, Astrophys. J., 310,
463472, 1986.

Oldenburg, D. W., An introduction to linear inverse theory, IEEE Trans. Geos. Remote Sensing,
Vol. GE-22, No. 6, 665674, 1984.

Parker, R. L., Understanding inverse theory, Ann. Rev. Earth Planet. Sci., 5, 3564, 1977.

Parker, R. L., Geophysical Inverse Theory, Princeton University Press, 1994.


9.2 The BackusGilbert Approach


There are a number of approaches to solving Equations (9.1) or (9.2). This chapter will
deal exclusively with one approach, called the BackusGilbert approach, which was developed
by geophysicists in the 1960's. This approach is based on taking a linear combination of the data
g
j
, given by


dx x m x k
dx x m x k g l
j
N
j
j
j
N
j
j j
N
j
j
) ( ) (
) ( ) (
1
1 1

=
= =
=
= =



(9.3)


where the
j
are as yet undefined constants. The essence of the Backus-Gilbert approach is
deciding how the
j
s are chosen.
Geosciences 567: CHAPTER 9 (RMR/GZ)
241
If the expression in square brackets, [
j
k
j
(x)], has certain special properties, it is
possible in theory to construct a solution from the linear combination of the data. Specifically, if


) ( ) (
0
1
x x x k
j
N
j
j
=
(

=

(9.4)

where (x x
0
) is the Dirac delta function at x
0
, then


) ( ) ( ) ( ) ( ) (
0 0 0
1
0
x m dx x m x x g x x l
j
N
j
j
= =
(

=

(9.5)

That is, choosing the
j
such that [
j
k
j
(x)] is as much like a delta function as possible at x
0
, we
can obtain an estimate of the solution at x
0
. The expression [
j
k
j
(x)] plays such a fundamental
role in BackusGilbert theory that we let


) , ( ) (
0
1
x x A x k
j
N
j
j
=

(9.6)

where A(x, x
0
) has many names in the literature, including averaging kernel, optimal averaging
kernel, scanning function, and resolving kernel. The averaging kernel A(x, x
0
) may take many
different forms, but in general is a function which at least peaks near x
0
, as shown below

A
x
0
x
A(x, x )
0

Recall from Equation (9.6) that the averaging kernel A(x, x
0
) is formed as a linear
function of a finite set of data kernels k
j
. An example of a set of three data kernels k
i
over the
interval 0 < x < 4 is shown on the next page.

Geosciences 567: CHAPTER 9 (RMR/GZ)
242

X
1 2 3 4
k
3
k
2
k
1


One of the fundamental problems encountered in the Backus-Gilbert approach is that the finite
set of data kernels k
j
, j = 1, N is an incomplete set of basis functions from which to construct the
Dirac delta function. You may recall from Fourier analysis, for example, that a spike (Dirac
delta) function in the spatial domain has a white spectrum in the frequency domain, which
implies that it takes an infinite sum of sin and cosine terms (basis functions) to construct the
spike function. It is thus impossible for the averaging kernel A(x, x
0
) to exactly equal the Dirac
delta with a finite set of data kernels k
j
.

Much of BackusGilbert approach thus comes down to deciding how best to make
A(x, x
0
) approach a delta function. Backus and Gilbert defined three measures of the deltaness
of A(x, x
0
) as follows


[ ] dx x x x x A J
2
0 0
) ( ) , (

=
(9.7)


[ ] dx x x x x A x x K
2
0 0
2
0
) ( ) , ( ) ( 12

=
(9.8)


dx dx x x A x x H W
2
0 0
) , ( ) (

(

=
(9.9)

where H(x x
0
) is the Heaviside, or unit step, function at x
0
.

The smaller J, K, or W is, the more the averaging kernel approaches the delta function in
some sense. Consider first the K criterion:


[ ] [ ] { }dx x x x x x x x x A x x x x A x x K ) ( ) ( ) ( ) , ( ) ( 2 ) , ( ) ( 12
2
0
2
0 0 0
2
0
2
0
2
0

+ =

(9.10)

The second and third terms drop out because (x x
0
) is nonzero only when x = x
0
, and then the
(x x
0
)
2
term is zero. Thus

Geosciences 567: CHAPTER 9 (RMR/GZ)
243

[ ]
dx x k x x
dx x x A x x K
N
j
j j
2
1
2
0
2
0
2
0
) ( ) ( 12
) , ( ) ( 12

=
=
=


(9.11)


We minimize K by taking the partials with respect to the
j
s and setting them to zero.


0 ) ( ) ( ) ( 24
0 ) ( ) ( 2 ) ( 12
2
0
1
1
2
0
= =
=


=
=
dx x k x k x x
dx x k x k x x
K
j i
N
j
j
i
N
j
j j
i


(9.12)


Writing out the sum over j explicitly for the ith partial derivative gives


0 ) ( ) ( ) (
) ( ) ( ) ( ) ( ) ( ) (
2
0
2 2
2
0 1 1
2
0
=
(

+
+
(

+
(


N N i
i i
i
dx x k x k x x
dx x k x k x x dx x k x k x x
K

L

(9.13)


Combining the N partial derivatives and using matrix notation this becomes


1 1
0
0
0

) ( ) ( ) (
) ( ) ( ) (
) ( ) ( ) (
2
1
2
0 2
2
0 1
2
0
2
2
0 2 2
2
0 1 2
2
0
1
2
0 2 1
2
0 1 1
2
0

(
(
(
(

=
(
(
(
(

(
(
(
(
(
(







N N N N
dx k k x x dx k k x x dx k k x x
dx k k x x dx k k x x dx k k x x
dx k k x x dx k k x x dx k k x x
N
N N N N
N
N
M M
L
M O M M
L
L

(9.14)

or

B = 0 (9.15)

B = 0 has the trivial solution = 0. Therefore, we add the constraint, with Lagrange
multipliers, that

= , ( 1 )
0
dx x x A
(9.16)

which says the area under the averaging kernel is one, or that A(x, x
0
) is a unimodular function.
Adding the constraint to the original K criterion creates a new criterion K given by

Geosciences 567: CHAPTER 9 (RMR/GZ)
244

[ ]
)
`

+ =

1 ) , ( ) , ( ) ( 12
0
2
0
2
0
dx x x A dx x x A x x K
(9.17)

which leads to


) 1 ( ) 1 (
1
0
0

0
) ( 24 ) ( 24
) ( 24 ) ( 24
1
1
2
0 1
2
0
1 1
2
0 1 1
2
0
+ +
(
(
(
(

=
(
(
(
(

(
(
(
(
(
(






N N
dx k dx k
dx k dx k k x x dx k k x x
dx k dx k k x x dx k k x x
N
N
N N N N
N
M M
L
L
M M O M
L

(9.18)

or


(

=
(

1
0

C (9.19)

Then the
j
s are found by inverting the square, symmetric matrix C.

The factor of 12 in the original definition of K was added to facilitate the geometrical
interpretation of K. Specifically, with the factor of 12 included, K = if A(x, x
0
) is a
unimodular (i.e., A(x, x
0
) dx = 1) box car of width centered on x
0
. In general, when K is
small, it is found (Oldenburg, 1984, p. 669) that K is a good estimation of the width of the
averaging function A(x, x
0
) at half its maximum value: Thus K gives essential information
about the resolving power of the data. If K is large, then the estimate of the solution m(x) will
be a smoothed average of the true solution around x
0,
, and the solution will have poor resolution.
Of course, you can also look directly at A(x, x
0
) and get much the same information. If A(x, x
0
)
has a broad peak around x
0,
, then the solution will be poorly resolved in that neighborhood.
Features of the solution m(x) with a scale length less than the width of A(x, x
0
) cannot be resolved
(i.e., are nonunique). Thus, on the figure below, the high-frequency variations in m(x) are not
resolved.


x
0

m (x)
x


Geosciences 567: CHAPTER 9 (RMR/GZ)
245
The analysis of the averaging kernel above falls within the appraisal phase of the possible goals
of a continuous inverse problem. Since any solution to Equation (9.2) is nonunique, often times
the most important aspect of the inverse analysis is the appraisal phase.

The above discussion does not include possible data errors. Without data errors,
inverting C to get the
j
's gives you the solution. Even if C is nearly singular, then except for
numerical instability, once you have the
j
's, you have a solution. If, however, the data contain
noise, then near-singularity of C can cause large errors in the solution for small fluctuations in
data values.

It may help to think of the analogy between k
j
and the kth row of G in the discrete case
Gm = d. Then

=
=
N
j
j j
k x x A
1
0
) , (
(9.20)

is equivalent to taking some linear combination of the rows of G (which are M-dimensional, and
therefore represent vectors in model space) and trying to make a rowvector as close as possible
to a row of the identity matrix. If there is a near-linear dependency of rows in G, then the
coefficients in the linear combination will get large. One can also speak of the near
interdependence of the data kernels k
j
(x), j = 1, N in Hilbert space. If this is the case, then terms
like k
i
k
j
dx can be large, and C will be nearly singular. This will lead to large values for as
it tries to approximate (x x
0
) with a set of basis functions (kernels) k
j
that have near
interdependence. Another example arises in the figure after Equation (9.6) with three data
kernels that are all nearly zero at x = 3. It is difficult to approximate a delta function near x = 3
with a linear combination of the data kernels, and the coefficients of that linear combination are
likely to be quite large.

You can quantify the effect of the near singularity of C by considering the data to have
some covariance matrix [cov d]. Then the variance of your estimate of the solution m(x
0
) at x
0
is


[ ][ ]
(
(
(
(

=
N
N x m


M
L
2
1
2 1
2
) (
cov
0
d
(9.21)

and if


(
(
(
(
(

=
2
2
2
2
1
0 0
0
0
0 0
] [cov
N

L
O M
M
L
d (9.22)

then
Geosciences 567: CHAPTER 9 (RMR/GZ)
246
2
1
2 2
) (
0
j
N
j
j x m


=
=
(9.23)

[See Oldenburg, 1984, Equation (18), p. 670, or Jeffrey and Rosner, Equation (3.5), p. 467.]

If m( x
0
)
2
is large, the solution is unstable. We now have two conflicting goals:

(1) minimize K
versus (2) minimize m( x
0
)
2


This leads to trade-off curves of stability (small
m( x
0
)
2
) versus resolving power (small K).
These trade-off curves are typically plotted as



*
*
*
*
*
*
m
(
x
0
)
2
l
o
g
best stability
small
best resolving
power
small large
large
optimal
scanning width (K')
successfully throwing
away larger 's
j


Typically, one gains a lot of stability without too much loss in resolving power early on by
throwing away the largest (few)
j
's.

Another way to accomplish the same goal is called spectral expansion or spectral
synthesis. With this technique, you do singular-value decomposition on C and start throwing
away the smallest singular values and associated eigenvectors until the error amplification (i.e.,
log
m( x
0
)
2
) is sufficiently small. In either case, you give up resolving power to gain stability.
Small singular values, or large
j
, are associated with high-frequency components of the solution
m(x). As you give up resolving power to gain stability, the width of the resolving kernel
increases. Thus, in general, the narrower the resolving kernel, the better the resolution but the
poorer the stability. Similarly, the wider the resolving kernel, the poorer the resolution, but the
better the stability.

The logic behind the trade-off between resolution and stability can be looked at another
way. With the best resolving power you obtain the best fit to the data in the sense of minimizing
the difference between observed and predicted data. However, if the data are known to contain
Geosciences 567: CHAPTER 9 (RMR/GZ)
247
noise, then fitting the data exactly (or too well) would imply fitting the noise. It does not make
sense to fit the data better than the data uncertainties in this case. We can quantify this relation
for the case in which the data errors are Gaussian and uncorrelated with standard deviation
j
by
introducing
2


=
=
N
j
j j j
g g
1
2 2 2
) (
(9.24)

where g
j
is the predicted jth datum, which depends on the choice of .
2
will be in the range
0
2
. If
2
= 0, then the data are fit perfectly, noise included. If
2
N, then the data
are being fit at about the one standard deviation level. If
2
>> N, then the data are fit poorly.
By using the trade-off curve, or the spectral expansion method, you affect the choice of
j
's, and
hence the predicted data g
j
and ultimately the value of
2
. Thus the best solution is obtained by
adjusting the
j
until
2
N.

Now reconsider the J criterion:


[ ]
[ ]dx x x x x x x A x x A
dx x x x x A J
) ( ) ( ) , ( 2 ) , (
) ( ) , (
0
2
0 0 0
2
2
0 0

+ =
=


(9.25)


Recall that for the K criterion, the second and third terms vanished because they were multiplied
by (x x
0
)
2
, which is zero at the only place the other two terms are nonzero because of the
(x - x
0
) term. With the J criterion, it is minimized by taking partial derivatives with respect to
i

and setting equal to zero. The A
2
(x, x
0
) terms are similar to the K criterion case, but

=
=
(


=
) ( 2 ) ( ) ( 2 =
) ( ) ( 2 ) ( ) , ( 2
0 0
0
1
0 0
x k dx x x x k
dx x x x k x x x x A
i i
N
j
j j
i i


(9.26)


Thus the matrix equations from minimizing J become


1 1
) (
) (
) (
2
0
0 2
0 1
2
1
2 1
2 2 2 2 1
1 2 1 1 1

(
(
(
(

=
(
(
(
(

(
(
(
(
(
(




N N N N
x k
x k
x k
k k k k k k
k k k k k k
k k k k k k
N N
N N N N
N
N
M M
L
M O M M
L
L

(9.27)
or

D = 2k (9.28)

Geosciences 567: CHAPTER 9 (RMR/GZ)
248
Notice now that = 0 is not a trivial solution, as it was in the case of the K criterion. Thus, we
do not have to use Lagrange multipliers to add a constraint to insure a nontrivial solution. Also,
notice that D, the matrix that must be inverted to find , no longer depends on x
0
, and thus need
only be formed once. This benefit is often sacrificed by adding the constraint that A(x, x
0
) be
unimodular anyway! Then the s found no longer necessarily give m(x
0
) corresponding to a
solution of g
j
= k
i
(x)m(x) dx at all, but this may be acceptable if your primary goal is appraisal
of the properties of solutions, rather than construction of one among many possible solutions.

The J and K criteria lead to different . The J criterion typically leads to narrower
resolving kernels, but often at the expense of negative side lobes. These side lobes confuse the
interpretation of A(x, x
0
) as an averaging kernel. Thus, most often the K criterion is used, even
though it leads to somewhat broader resolving kernels.


9.3 Neural Networks


Neural networks, based loosely on models of the brain and dating from research in the
1940s (long before the advent of computers), have become very successful in a wide range of
applications, from pattern recognition (one of the earliest applications) to aerospace (autopilots,
aircraft control systems), defense (weapon steering, signal processing), entertainment (animation
and other special effects), manufacturing (quality control, product design), medical (breast cancer
cell analysis, EEG analysis, optimization of transplant times), oil and gas (exploration),
telecommunications (image and data compression), and speech (speech recognition) applications.

The basic idea behind neural networks is to design a system, based heavily on a parallel
processing architecture, that can learn to solve problems of interest.

We begin our discussion of neural networks by introducing the neuron model, which
predictably has a number of names, including, of course, neurons, but also nodes, units, and
processing elements (PEs),
P O
Output Input
Neuron
w


where the neuron sees some input P coming, which is weighted by w before entering the neuron.
The neuron processes this weighted input = Pw and creates an output O. The output of the
neuron can take many forms. In early models, the output was always +1 or 1 because the neural
network was being used for pattern recognition. Today, this neuron model is still used (called
the step function, or Heaviside function, among other things), but other neuron models have the
output equal to the weighted input to the neuron (called the linear model) and perhaps the most
common of all, the sigmoid model, which looks like

Geosciences 567: CHAPTER 9 (RMR/GZ)
249
0.5
1.0
0.0
0.0
x


often taken to be given by S(x) = 1(1 + e
x
), where x is the weighted input to the neuron. This
model has elements of both the linear and step function but has the advantage over the step
function of being continuously differentiable.

So, how is the neuron model useful? It is useful because, like the brain, the neuron can
be trained and can learn from experience. What it learns is w, the correct weight to give the input
so that the output of the neuron matches some desired value.

As an almost trivial example, let us assume that our neuron in the figure above behaves as
a linear model, with the output equal to the weighted input. We train the neuron to learn the
correct w by giving it examples of inputs and output that are true. For example, consider
examples to be

input = 1 output = 2
10 20
20 40
42 84

By inspection we recognize that w = 2 is the correct solution. However, in general we
start out not knowing what w should be. We thus begin by assuming it to be some number,
perhaps 0, or as is typically done, some random number. Let us assume that w = 0 for our
example. Then for the first test with input = 1, our output would be 0. The network recognizes
that the output does not match the desired output and changes w. There are a world of ways to
change w based on the mismatch between the output and the desired output (more on this later),
but let us assume that the system will increase w, but not all the way to 2, stopping at 0.5.
Typically, neural networks do not change w to perfectly fit the desired output because making
large changes to w very often results in instability. Now our system, with w = 0.5, inputs 10 and
outputs 5. It again increases w and moves on to the other examples. When it has cycled through
the known input/output pairs once, this is called an epoch. Once it has gone through an epoch, it
goes back to the first example and cycles again until there is an acceptably small misfit between
all of the outputs and desired outputs for all of the known examples. In neural network
programs, the codes typically go through all of the known examples randomly rather than
sequentially, but the idea is the same. In our example, it will settle in eventually on w = 2. In
more realistic examples, it takes hundreds to thousands of iterations (one iteration equals an
epoch) to find an acceptable set of weights.
In our nomenclature, we have the system

d = Gm (1.13)

Geosciences 567: CHAPTER 9 (RMR/GZ)
250
and in this example, G = 2. The beauty of the neural network is that it learned this relationship
without even having to know it formally. It did it simply by adjusting weights until it was able
to correctly match a set of example input/output pairs.

Suppose we change our example input/output pairs to

input = 1 output = 5
10 23
20 37
42 87

where the output has been shifted 3 units from the previous example. You may recognize that
we will be unable to find a single weight w to make the output of our neuron equal the desired
output. This leads to the first additional element in the neuron model, called the bias. Consider
the diagram below

P O
Output Input
Neuron
w
1
b


where the bias is something added to the weighted input Pw before the neuron processes it to
make output. It is convention that the bias has an input of 1 and another weight (bias) b that is
to be determined in the learning process. in this example, the neuron will learn that, as before,
w = 2, and now that the bias b is 3.

The next improvement to our model is to consider a neuron that has multiple inputs:

O
Output
Input
Neuron
1
w
1
w
2
w
3
P
2
P
1
P
3
b


We now introduce the nomenclature that the process of the neuron is going to be some function
of the sum of the weighted input vector and the bias b. That is, we describe the neuron by the
function of (Pw + b), where
Geosciences 567: CHAPTER 9 (RMR/GZ)
251

=
= + + =
N
i
i i N N
w P w P w P w P
1
2 2 1 1
L w P (9.29)

This is just a very incomplete introduction to neural networks. Maybe someday we will add
more material.


9.4 The Radon Transform and Tomography (Approach 1)


u
s
x
y


9.4.1 Introduction

Consider a model space m(x, y) through which a straight-line ray is parameterized by the
perpendicular distance u and angle . Position (x, y) and ray coordinates (u, s) are related by


(


=
(

s
u
y
x

cos sin
sin cos


(9.30)
and

(

=
(

y
x
s
u

cos sin
sin cos


(11.18 in Menke)

If m(x, y) is slowness (1 / v) model, and t(u, ) is the travel time along a ray at distance u
and angle , then


= ds y x m u t ) , ( ) , (
(9.31)

is known as the Radon transform (RT). Another way of stating this is that t(u, ) is the
"projection" of m(x, y) onto the line defined by u and .

The inverse problem is: Given t(u, ) for many values of u and , find the model m(x, y),
i.e., the inverse Radon transform (IRT).

Define a 2D Fourier transform of the model:

Geosciences 567: CHAPTER 9 (RMR/GZ)
252

dy dx e y x m k k m
y k x k i
y x
y x
) , ( ) , (
~
) ( 2


+
=

(9.32)


y x
y k x k i
y x
dk dk e k k m y x m
y x


+
=
) ( 2
) , (
~
) , (

(9.33)

Now, define the 1D Fourier Transform of the projection data:


= du e u t k t
u k i
u
u


2
) , ( ) , (
~
(9.34)


=
u
u k i
u
dk e k t u t
u


2
) , (
~
) , (
(9.35)

Substituting Equation (9.31) into Equation (9.34) gives

du e ds y x m k t
u k i
u
u

2
) , ( ) , (
~

= (9.36)

Making a change of variables ds du > dx dy and using the fact that the Jacobian determinant is
unity we have

) sin , cos (
~
) , ( ) (
~ ) sin (cos 2
y k x k m
dy dx e y x m k t
u u
y x k i
u
u


=
=


+
(9.37)

Equation (9.37) states that the 1D Fourier transform of the projected data is equal to the
2D Fourier transform of the model. This relationship is known as the Fourier (central) slice
theorem because for a fixed angle , the projected data provide the Fourier transform of the
model slice through the origin of the wavenumber space, i.e.,

fixed ,
sin
cos
0
0
0

=
(

u
y
x
k
k
k



u
t(u, )
0
x
y
0
0
k
x
k
y
m( , ) k
x
~
k
y
Fourier transform =>


Now, we can invert the 2D Fourier transform and recover the model,
Geosciences 567: CHAPTER 9 (RMR/GZ)
253

y x
y k x k i
u
dk dk e y x k m y x m
y x
) ( 2
)] sin cos ( [
~
) , (
+


+ =


(9.38)

Change variables to polar coordinates gives


u u
y x k i
u
dk d k e k m y x m
u
) , (
~
) , (
) sin cos ( 2
0


=
(9.39)

where |k| arises because

0
d k

d y x k m
d dk e k k m y x m
u
u
k i
u
u
) ), sin cos ( (
) , (
~
) , (
0
*
2
0
+ =
=


(9.40)


where m
*
is obtained from the inverse Fourier transform of k
u
m (k
u
,).

The Radon Transform can be written more simply as

+ =
r
dr bx a x f b a F ) , ( ) , (
(9.41)
shadow path model


x
y
r


with the inverse

=
q
dq b bx y F y x y x f ) , ( ) , ( ) , (
(9.42)
model "rho filter" path shadows

Note the similarity of the inverse and forward transforming with a change of sign in the
argument of the integrand.

Geosciences 567: CHAPTER 9 (RMR/GZ)
254

1
(k )
y
y
x
y
x k
y
1
2

1
2
+




9.4.2 Interpretation of Tomography Using the Radon Transform


body
projection
anomaly
source


The resolution and accuracy of the reconstructed image are controlled by the coverage of
the sources.

source
blurring
Back projection with one ray Back projection with many rays


Because of this requirement for excellent coverage for a good reconstruction, the Radon
transform approach is generally not used much in geophysics. One exception to this is in the
area of seismic processing of petroleum industry data, where data density and coverage is, in
general, much better.
Geosciences 567: CHAPTER 9 (RMR/GZ)
255
9.4.3 Slant-Stacking as a Radon Transform (following Claerbout, 1985)

Let u(x, t) be a wavefield. An example would be as follows.


slope = p
x
t


The slant stack of the wavefield is defined by

u ( p, ) = u(x, + px) dx

(9.43)

The integral along x is done at constant , which defines a slanting straight line in the xt
plane with slope p. Note the similarity of Equation (9.43) to Equation (9.41). Equation (9.43) is
a Radon transform.

To get a better feel for this concept, let's step back a little and consider the travel time
equations of rays in a layered medium.


x
x
x
1 2 3
z
z
1
x x x
1 2 3
t
x x
v
v
1
2


The travel time equation is


2
1
2 2 2 2 2
1
4 or 4 z x v t v x z t = + = (9.44)

Because the signs of the t
2
and x
2
terms are opposite, this is an equation of a hyperbola.
Slant stacking is changing variables from t x to p where

Geosciences 567: CHAPTER 9 (RMR/GZ)
256
= t px (9.45)

and

dx
dt
p = (9.46)

From Equation (9.44)

2t dt v
2
= 2x dx (9.47)

so

t v
x
dx
dt
2
= (9.48)

The equations simplify if we introduce the parametric substitution.

=
=
=

tan so
sin
cos
z x
vt x
vt z
(9.49)

Now, take the linear moveout Equation (9.46) and substitute for t and x (using p =
(sin ) /v).

tan
sin
cos
z
v v
z
=
cos
v
z
= (9.50)

This can be written after some substitution as


2 2
1 v p
v
z
= (9.51)
and finally,

2
2
2
1
v
p
x
= +
|
.
|

\
|
(9.52)

This is an equation of an ellipse in p space All this was to show that hyperbolic curves
in tx space transform to ellipses in p space.

Geosciences 567: CHAPTER 9 (RMR/GZ)
257

t
p
x


With this background, consider how the wavefield in a layer-over-halfspace slant stacks
(or Radon transforms).


head wave
reflected wave
v
0
v
1
v
0
>
x
z
source
*
direct wave
c
b
a
direct wave
head wave
reflected
wave
x
t
head wave
direct wave
c b
a
1/v
1
1/v
0
P


A couple of important features of the p plot are as follows:

1. Curves cross one another in tx space but not in p space.
2. The p axis has dimensions of 1 / v.
3. The velocity of each layer can be read from its p curve as the inverse of the maximum p
value on its ellipse.
4. Head waves are points in p space located where the ellipsoids touch.

Returning to our original slant-stack equation,

+ = dx px x u p u ) , ( ) , (
(9.43)

This equation can be transformed into the Fourier space using


= dt dx e t x u w k U
ikx iwt
) , ( ) , (
(9.53)

Let p = k / w


dt dx e t x u w wp U
px t iw
) , ( ) , (
) (

=
(9.54)

and change of variables from t to = t px.
Geosciences 567: CHAPTER 9 (RMR/GZ)
258

d e dx px t x u w wp U
iw
) , ( ) , (

+ = (9.55)
|<-(9.43)>|

Insert Equation (9.43) to get




d e p u w wp U
iw


= ) , ( ) , (
(9.56)

Think of this as a 1D function of w that is extracted from the kw plane along the line k = wp.

Finally, taking the inverse Fourier transform


dw e w wp U p u
iw

) , ( ) , (
(9.57)

This result shows that a slant stack can be done by Fourier transform operations.
1. Fourier transform u(x, t) to U(k, w).
2 Extract U(wp, w) from U(k, w) along the line k = wp.
3. Inverse Fourier transform from w to t.
4. Repeat for all interesting values of p.

Tomography is based on the same mathematics as inverse slant stacking. In simple
terms, tomography is the reconstruction of a function given line integrals through it.

The inverse slant stack is based on the inverse Fourier transform of Equation (9.53),

dw e dk e w k U t x u
iwt ikx

= ) , ( ) , ( (9.58)

substituting k = wp and dk = wpd. Notice that when w is negative, the integration with dp is from
positive to negative. To keep the integration in the conventional sense, we introduce |w|,

dw e dp e w w wp U t x u
iwt iwpx

= ) , ( ) , ( (9.59)

Changing the order of integration gives


[ ] { } dp e dw e w e w wp U t x u
iwt iwt iwpx


= ) , ( ) , (
(9.60)

Note that the term in {} contains the inverse Fourier transform of a product of three
functions of frequency. The three functions are (1) U(wp, w), which is the Fourier transform of
the slant stack, (2) e
iwpx
, which can be thought of as a delay operator, and (3) |w|, which is called
Geosciences 567: CHAPTER 9 (RMR/GZ)
259
the rho filter. A product of three terms in Fourier space is a convolution in the original space.
Let the delay px be a time shift in the argument, so,


dp px t p u t t x u ) , ( ) ( rho ) , ( =

(9.61)

This is the inverse slant-stack equation. Comparing the forward and inverse Equations
(9.41) and (9.58), note that the inverse is basically another slant stack with a sign change.

+ = dx px x u p u ) , ( ) , (
(9.43)

dp px t p u t t x u ) , ( ) ( rho ) , ( =

(9.61)


9.5 Review of the Radon Transform (Approach 2)


If m(x, y) is a 2-D slowness (1 / v) model, and t(u, ) is the travel time along a ray
parameterized by distance u and angle , then


= sform Radon tran the is ) , ( ) , ( ds y x m u t
(9.62)

The inverse problem is: Given t(u, ) for many values of u and , find the model m(x, y).
Take the 1-D Fourier transform of the projection data,


du e u t k t
u k i
u
u

=


2
) , ( ) , (
~
(9.63)

Substitute Equation (9.62),


= du e ds y x m k t
u k i
u
u

2
) , ( ) , (
~
(9.64)

Change variables ds du > dx dy



+
= dy dx e y x m k t
y x k i
u
u
) , ( ) , (
~ ) sin (cos 2

(9.65)

Recognize that the right-hand side is a 2-D Fourier transform:


dy dx e y x m k k m
y k y k i
y x
y x
) , ( ) , (
~
) ( 2


+
=

(9.66)
So


t (k
u
,) = m (k
u
cos x, k
u
sin y)
(9.67)

Geosciences 567: CHAPTER 9 (RMR/GZ)
260
In words: The 1-D Fourier transform of the projected data is equal to the 2-D Fourier
transform of the model evaluated along a radial line in the k
x
k
y
space with angle .

u
t(u, )
0
x
y
0
0
k
x
k
y
m( , ) k
x
~
k
y
Fourier transform =>


If the Radon transform is known for all values of (u, ), then the Fourier transform image
of m is known for all values of (u, ). The model m(x, y) can then be found by taking the inverse
Fourier transform of m (k
x
, k
y
).


Slant-stacking is also a Radon transform. Let u(x, t) be a wavefield. Then the slant-stack
is

t
x
slope = p

+ = dx px x u p u ) , ( ) , (

(9.68)


Now, consider a simple example of reflections in a layer.

z
z
1
x
v
v
1
2

Geosciences 567: CHAPTER 9 (RMR/GZ)
261
Travel time (tx) equation is:
t
2
v
2
x
2
= z
2

hyperbolas

t
x


Slant-stack (p ) equation is:
( / x)
2
+ p
2
= 1 / v
2

ellipses

p



With this background, consider how the complete wavefield in a layer slant-stack:

head wave
reflected wave
v
0
v
1
v
0
>
x
z
source
*
direct wave
c
b
a
direct wave
head wave
reflected
wave
x
t
head wave
direct wave
c b
a
1/v
1
1/v
0
P


The p equation can be transformed into kw space by a 2-D Fourier transform:


= dt dx e t x u w k U
ikx iwt
) , ( ) , (
(9.69)

The inverse slant-stack (IRT) is based on the inverse Fourier transform of the above
equation,


= dw e dx e w k U t x u
iwt ikx
) , ( ) , (
(9.70)

Substitute k = wp and dk = w dp. Notice that when w is negative, the integral with respect to dp
is from + to . To keep the integration in the conventional sense, introduce |w|.


= dw e dp e w w wp U t x u
iwt iwpx
) , ( ) , (
(9.71)

Changing the order of integration,

Geosciences 567: CHAPTER 9 (RMR/GZ)
262

dp dw e w e w wp U t x u
iwt iwpx
) , ( ) , (


)
`

=
(9.72)

The term in { } contains an inverse Fourier transform of a product of three functions of w.

1. U(wp, w) is the Fourier transform of the slant-stack evaluated at k = wp. So, the slant-
stack is given by

= dw w wp U e p u
iw
) , ( ) , (

(9.73)
2. The e
iwpx
term can be thought of a delay operator where = t px.
3. The |w| term is called the rho filter.

Now we can rewrite the above equation as


Transform Radon Inverse ) , ( ) ( rho ) , ( dp px t p u t t x u

=
(9.74)

Compare with


Transform Radon ) , ( ) , ( dx px x u x u + =


(9.43)


9.6 Alternative Approach to Tomography


An alternative approach to tomography is to discretize the travel time equation, as
follows.

y
x
dl
*
r


m(x, y) =
1
v(x, y)
(9.75)


t = m dl
r

(9.76)

With some assumptions, we can linearize and discretize this equation to

=
b
b rb r
m l t
(9.77)

where t
r
= travel time of the r
th
ray
m
b
= slowness of the b
th
block
l
rb
= length of the r
th
ray segment in the b
th
block.
Geosciences 567: CHAPTER 9 (RMR/GZ)
263
In matrix form,

d = Gm Aha!! (1.13)

We know how to solve this equation. But what if we have a 3D object with 100 blocks on a
side? Then M (100)
3
= 10
6
, and even G
T
G is a matrix with M
2
elements, or ~10
12
. Try
throwing that into your MATLAB program. So what can we do?

Lets start at the beginning again.

d = Gm (1.13)

G
T
d = G
T
Gm (9.78)
| R |

In a sense, G
T
is an approximate inverse operator that transforms a data vector into model
space. Also, in this respect, G
T
G can be thought of as a resolution matrix that shows you the
filter between your estimate of the model, G
T
d, and the real model, m. Since the ideal R = I,
lets try inverting Equation (9.78) by using only the diagonal elements of G
T
G. Then the
solution can be computed simply.


m
0 0 0
0 0 0
0 0 0
0 0 0
1
2
1
2
1
2
1
1
1
2
1
2
1
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
(
(

=
=
=
=
=
=
N
i
i
N
i
i
N
i
i
N
i
i i
N
i
i i
N
i
i i
M M
l
l
l
t l
t l
t l
O M
(9.79)
so

ion" approximat c tomographi "
1
2
1

=
=
=
N
i
i
N
i
i ib
b
b
l
t l
m
(9.80)

Operationally, each ray is back-projected, and at each block a ray hits, the contributions
to the two sums are accumulated in two vectors. At the end, the contents of each element of the
numerator vector are divided by each element of the denominator vector. This reduces storage
requirements to ~2M instead of M
2
.

Lets see how this works for our simple tomography problem.

Geosciences 567: CHAPTER 9 (RMR/GZ)
264

1 2
3 4
1
0
1 0



for
(
(
(
(

=
(
(
(
(

=
0
1
0
1

0
0
0
1
d m (9.81)

(
(
(
(

=
(
(
(
(

(
(
(
(
(

0
1
1
2

1 0 1 0
0 1 0 1
1 1 0 0
0 0 1 1
=
T

4
1
4
1
4
1
4
3
d G G m
ML
(9.82)

Fits data


(
(
(
(

=
(
(
(
(

=
2 0 0 0
0 2 0 0
0 0 2 0
0 0 0 2
] [
2 1 1 0
1 2 0 1
1 0 2 1
0 1 1 2
T T
d
G G G G
(9.83)

The approximate equation is


(
(
(
(

=
(
(
(
(

(
(
(
(

0
1
~
2 0 0 0
0 2 0 0
0 0 2 0
0 0 0 2
0
1
1
2
2
1
2
1
m m
(9.84)

Not bad, but this does not predict the data. Can we improve on thus? Yes! Following
Menke, develop an iterative solution.

[G
T
G]m = G
T
d (9.85)

[I I + G
T
G]m = G
T
d (9.86)

m [I G
T
G]m = G
T
d (9.87)

so

m
est(i)
= G
T
d + [I G
T
G]m
est(i1
) (9.88)
[Menkes Equations (11) and (24)]
Geosciences 567: CHAPTER 9 (RMR/GZ)
265
A slightly different iterative technique,

m
(k)
= D
1
G
T
d + [D G
T
G]m
(k1)
(9.89)

where D is the diagonalized G
T
G matrix. Then for m
0
= 0, m
1
= D
1
G
T
d is the tomographic
approximation solution.

These are still relatively rudimentary iterative techniques. They do not always
converge. Ideally, we want iterative techniques to converge to the least squares solution. A
number of such techniques exist and have been used in tomography. They include the
simultaneous iterative reconstruction techniques, or SIRT, and LSQR. See the literature.


This chapter is only an introduction to continuous inverse theory, neural networks, and
the Radon transform. The references cited at the beginning of the chapter are an excellent place
to begin a deeper study of continuous inverse problems. The BackusGilbert approach provides
one way to handle the construction (i.e., finding a solution) and appraisal (i.e., what do we really
know about the solution) phases of a continuous inverse analysis. In this sense, these are tasks
that were covered in detail in the chapters on discrete inverse problems. For all the division
between continuous and discrete inverse practitioners (continuous inverse types have been
known to look down their noses at discrete types saying that they're not doing an inverse
analysis, but rather parameter estimation; conversely, discrete types sneer and say that
continuous applications are too esoteric and don't really work very often anyway), we have
shown in this chapter that the goals of constructing (finding) a solution and appraising it are
universal between the two approaches. We hope to add more material on these topics Real Soon
Now.


References

Claerbout, J. F., Imaging the Earth's Interior, 398 pp., Blackwell Scientific Publications, Oxford,
UK, 1985.

You might also like