Aetna Book 2013 Print
Aetna Book 2013 Print
Aetna Book 2013 Print
An Engineer’sToolkit of Numerical
Algorithms
July 2013
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.2 Condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.3 Perturbation of A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.4 Induced matrix norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Illustration 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.4.5 Condition number in pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Illustration 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.4.6 Condition number for symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Illustration 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.5 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.5.1 Householder reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Illustration 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Illustration 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.6 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.10Conjugate Gradients method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.11Generalization to multiple equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.12Direct versus iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.13Least-squares minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.13.1Geometry of least squares fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.13.2Solving least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.14Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
1
Motivation
The narrative in this chapter is provided in the hope that it will motivate the esteemed reader to
take the present subject seriously. Please do not be discouraged if the text in this Chapter is found
lacking in entertainment value. The rest of the book will make it up to you to excess.
Let us consider the experience made by the renowned structural engineer WTP (in Figure 1.1
accompanied by his attorney CR). It concerns a planar truss structure designed by WTP and
analyzed for static loads. The structural software used was developed by Owl & Co.
Fig. 1.1. The renowned structural engineer WTP depicted on the stairs of his house with his attorney CR.
CR was instrumental in keeping WTP’s engineering career on track.
The structure was first analyzed with default analysis settings and the shape after the deformation
is shown in Figure 1.2 ( also included is a visual representation of the applied loading and the three
pin supports). The shape before deformation is shown in broken line. The deformation is highly
magnified.
Later that day WTP was exploring the menus of the analysis software (there is always a first time
for everything), and got intrigued by the fact that the analysis option “Use automatic stabilization”
was checked. The documentation was not very helpful in explaining the effects of this option (WTP
was in fact not sure in which language the documentation was written), and therefore an experiment
was in order. The analysis option “Use automatic stabilization” was unchecked, and the analysis
was repeated. To WTP’s surprise the results were practically identical, except that a slight displace-
ment unsymmetry developed (for instance, displacement of -0.7039 versus -0.6518 units). This was
disquieting since the structure and the boundary conditions (loads and supports) were symmetric.
WTP was however nonplussed, especially given that the analysis was to be delivered to the client
the next day.
Several weeks later the software developers alerted WTP that a bug was found in the analysis
software, the bug was fixed, and an update was to be installed. WTP remembered the slight un-
symmetry, and therefore checked whether the update removed it. Since the unsymmetry remained,
a brief discussion ensued in whose course WTP ascertained that the bug had to do with the color
2 1 Motivation
Fig. 1.2. The planar truss structure. Deformation under indicated static loads is shown in solid line (mag-
nified). The undeformed structure is shown in dashed line.
in which the logo of the company was drawn in the splash screen, and therefore it was somewhat
unlikely to be the cause of the unsymmetry.
During a discussion with a colleague WTP was able to convince himself that no unsymmetry
was to be expected in the analysis, and that if it appeared it should be considered an error. At
that point WTP began to draw on his immense powers of reasoning. After only a few hours he
was able to recall the name of the text in which properties of coupled systems of linear algebraic
equations were discussed in his junior year in college. An intense session with the textbook followed,
and WTP was quickly able to find the page that pertained to errors that can appear in the solution
of systems of equations. The error was found to be proportional to the error in the right-hand side
(the loads) and to the “condition number” of the stiffness matrix. The loads were, as WTP checked,
specified correctly, and consequently the mysterious “condition number” was probably the source of
the confounding error.
WTP was now able to find in the textbook that the condition number of the stiffness matrix
was rather expensive to compute as one had to solve an “eigenvalue problem”. WTP was not to be
deterred however, and subcontracted this work out to a group of students from the local university,
cost it what it may (it wasn’t much). The magnitudes of the eigenvalues of the stiffness matrix found
by the students are shown in Figure 1.3.
5
10
Eigenvalue magnitude
0
10
−5
10
−10
10
−15
10
0 10 20 30 40 50
Eigenvalue
Fig. 1.3. The magnitudes of the eigenvalues of the stiffness matrix of structure from Figure 1.2
1 Motivation 3
The rather small first eigenvalue did not escape WTP and a few more rewarding hours were
spent looking for information that could lead to an understanding of the relationship between the
condition number, the eigenvalue problem, and the stiffness matrix. Eventually the critical piece of
information that the so-called singular matrix had at least one zero eigenvalue was located, and the
conclusion that the stiffness matrix was somehow close to singular was reached.
The displacement shape corresponding to the first eigenvector (Figure 1.4) facilitated the ul-
timate breakthrough. The structure contained a mechanism: a floppy piece of structure that was
insufficiently connected to the rest of the structure (which was in fact sufficiently supported).
Fig. 1.4. The eigenvector 1 of the stiffness matrix of structure from Figure 1.2
The structure was consequently subjected to a redesign to remove the mechanism, and the
redesign was eagerly adopted by the client who remarked on the propitious circumstance that a
superior design became available before the structure was realized. WTP has yet again demonstrated
that superior skill and knowledge cannot fail to win the day. Even though his friend CR’s assistance
was not required in this matter, his comforting presence during these trials and tribulations was
gratefully noted by WTP.
2
Modeling with differential equations
Summary
1. In this chapter we develop an understanding of initial value problems (IVPs). We look at the
simple but illustrative model of motion in a viscous fluid, and the model of satellite motion. The
main idea: these models can be treated similarly since they are are both members of the class
of IVPs. The constituents of an IVP are the governing equation and the initial conditions.
2. The IVPs that will be considered in this book will be in the form of coupled first-order (only
one derivative with respect to the independent variable) equations.
3. We develop simple methods for integrating IVPs numerically in time. The main idea: approxi-
mate the curve by its tangent in order to make one discrete step in time. The basic visual picture
is provided by the direction field.
4. We discuss the essential differences between IVPs and BVPs (boundary value problems). The
main idea: BVPs are harder to solve than IVPs because the problem data is located on the entire
boundary of the domain of the independent variables.
5. We investigate the accuracy of some simple numerical solvers for IVPs. The main concepts:
Monomial relationship between the error and the time step length gives us formulas to estimate
the error, the log-log plot illuminates the convergence produced by the dependence on the time
step, and the convergence rate is revealed by the log-log plot.
6. We wrap up the exposition of the various time integrators by describing the Runge-Kutta inte-
grators. Main idea: try to aim the time step for optimal accuracy by sampling the right-hand
side function (that is the slope) within the time step.
George Gabriel Stokes was a 19th century mathematician who has had an enormous impact on
many areas of engineering through his work on properties of fluids. Perhaps his most significant
accomplishment was the work describing the motion of a sphere in a viscous fluid. This work led to
the development of Stokes’ Law. This is a mathematical description of the force required to move a
sphere through a viscous fluid at specific velocity.
Stokes’ Law for a sphere descending under the influence of gravity in a viscous fluid is written
as
4 3
η6πrv = πr (ρs − ρf )g , (2.1)
3
where 43 πr3 is the volume of the sphere, η is the dynamic fluid viscosity (for instance in SI units
Pa · s), 6πr is the shape factor of the sphere of radius r, v is the velocity of the falling sphere relative
to the fluid, m is the mass of the sphere, and g is the gravitational acceleration. On the left of
equation (2.1) is the so-called drag force Fd , on the right is the gravitational force Fg (i.e. 34 πr3 ρs g,
6 2 Modeling with differential equations
Fd
Fg
where ρs is the mass density of the material of the sphere) minus the buoyancy force (i.e. 43 πr3 ρf g,
where ρf is the mass density of the fluid)– compare with Figure 2.1.
An application of this law to structural engineering may be found for instance in composites
manufacturing: a commonly used manufacturing technique used for large parts infuses dry fibers laid
up on a bagged mold with resin by creating a degree of vacuum (Vacuum Assisted Resin Transfer
Moulding (VARTM)) to “suck the resin into the fibers”. A critical property of the polymer resins is
their dynamic viscosity: if the resin is too viscous, the fibers may be incompletely impregnated and
the part must be discarded. Some of the techniques to determine the viscosity of the liquid resemble
a high school science experiment: drop a ball into a tube filled with this liquid. Measure the time it
takes the ball to travel some distance. From that calculate the ball’s velocity (distance/time), and
knowing the ball’s diameter and mass obtain from (2.1) the liquid’s viscosity.
Of course, if we calculate the ball’s velocity as (distance/time) it better be uniform in that inter-
val. So how does the velocity of the falling ball vary with time? Let us say we observe the proceedings
with a high-speed camera. We drop the ball from rest, and then we see the ball rapidly accelerate.
Eventually it seems to settle down to a steady speed on the way downwards. The modeling keyword
is “acceleration”, and consequently we shall use Newton’s equation: Acceleration is proportional to
force. The acceleration may be written as ẍ (measuring the distance traveled downwards: Figure 2.1),
and the total applied force is Fg − Fd . Therefore, we write
4 3 4
πr ρs ẍ = πr3 (ρs − ρf )g − η6πrv . (2.2)
3 3
Simplifying a little bit, we obtain
ρs − ρf 9η
ẍ = g− 2 v. (2.3)
ρs 2r ρs
We see that we have one equation, but two variables: x and v. These are not independent, since
the velocity is defined as the rate of change of the distance fallen, v = ẋ. We have two choices. Either
we express equation (2.3) in terms of the distance
ρs − ρf 9η
ẍ = g − 2 ẋ (2.4)
ρs 2r ρs
and we obtain a second order differential equation, or we express equation (2.3) in terms of the
velocity
ρs − ρf 9η
v̇ = g− 2 v (2.5)
ρs 2r ρs
2.1 Simple model of motion 7
and we obtain a first order differential equation. Since we are at the moment primarily interested in
the velocity, we will stick to the latter.
All these equations are the so-called equations of motion. They are differential equations, express-
ing rate of change of some variable (x or v) in terms of the same variable (and/or other variables,
in general). The independent variable is the time t and the dependent variable is the velocity v.
We realize that to obtain a solution we must somehow integrate both sides of the equation of
motion. From calculus we know that integration brings in constant(s) of integration. So, for instance
for equation (2.5), we may write
Z t Z t
ρs − ρf 9η
v̇(τ )dτ = g − 2 v(τ ) dτ
t0 t0 ρs 2r ρs
Here the task is to find a suitable form of function v(τ ) to satisfy this equation for all times. The
value v(t0 ) is arbitrary. Its physical meaning is that of the velocity at the beginning of the interval
t0 ≤ τ ≤ t. Therefore, setting the value v(t0 ) to some particular number
v(t0 ) = v0 (2.6)
is called specifying the initial condition. The initial condition makes the solution to the equation
of motion meaningful to a particular problem. Therefore, we always think of the models of this type
in terms of the pair “governing equation” (the equation of motion) plus the “initial condition”. This
type of model is called the initial value model (and the problem which is modeled this way is called
an initial value problem: IVP). The problem of the falling sphere is an initial value problem,
and the model that needs to be solved is
ρs − ρf 9η
v̇ = g− 2 v, v(0) = v0 (2.7)
ρs 2r ρs
where we have quite sensibly taken t0 = 0.
For future reference we will sketch the construction of an analytical solution. One possible ap-
proach uses the decomposition of the solution into a general solution of the homogeneous equation
9η
v˙h = − vh
2r2 ρs
and one particular solution to the inhomogeneous equation
ρs − ρf 9η
v˙p = g − 2 vp .
ρs 2r ρs
The homogenous equation may be solved by assuming the solution in the form of an exponential
vh (τ ) = exp(aτ ) .
2r2 (ρs − ρf )
vp = g.
9η
The solution to the initial value problem is the sum of the particular solution and some multiple of
the general solution
8 2 Modeling with differential equations
v(τ ) = vp (τ ) + Cvh (τ )
2r2 (ρs − ρf )
C = v0 − g
9η
and the analytical solution to the initial value problem (2.7)
2r2 (ρs − ρf ) 2r2 (ρs − ρf ) 9η
v(τ ) = g + v0 − g exp − 2 τ .
9η 9η 2r ρs
Figure 2.2 displays the time variation of the speed of the falling sphere for some common data
(epoxy resin, and steel sphere of 5 mm radius). The analytical formula is easily recognizable in the
MATLAB1 code2 to produce the figure:
t=linspace(tspan(1),tspan(2), 100);
vt =(2*r^2*(rhos-rhof))/(9*eta)*g +...
(v0-(2*r^2*(rhos-rhof))/(9*eta)*g)*exp (-(9*eta)/(2*r^2*rhos)*t);
plot(t,vt, ’linewidth’, 2, ’color’, ’black’, ’marker’, ’.’); hold on
xlabel(’t [s]’),ylabel(’v(t) [m/s]’)
As we can see, the sphere attains an essentially unchanging velocity within a fraction of a second.
Our model can give us additional information: we can see that in theory the time dependence of
the velocity vanishes only for τ → ∞. In other words, it takes an infinite time for the sphere to
stop accelerating. The corresponding velocity is called terminal velocity. So much for theory. In
experiments we expect that practical limits to measurement accuracy would allow us to say that
terminal velocity will be reached within a finite time (for instance, if we can measure velocities with
accuracy of about 1 mm per second, the time to reach terminal velocity in our example is less than
0.3 seconds).
0.35
0.3
0.25
v(t) [m/s]
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
Fig. 2.2. Sphere falling in viscous fluid. Time variation of the descent speed.
not available. The tools available to engineers for these problems will most likely be numerical in
nature. (Hence the reason for this book.)
The simplest method with which to introduce numerical solutions to initial value problems (IVP)
is the Euler’s method. It is based on a very simple observation: the solution graph is a curve. The
solution process itself could be understood as constructing a curve. The curve passes through a point
that is known to us from the specification of the IVP: the point (t0 , v0 ). A curve consists from an
infinite number of points, and we do not want to have to compute the coordinates of an infinite
number of points. The next best thing would be to compute the solution at only a few points along
the curve, and somehow approximate the curve in-between. It is logical to try to approach this task
by starting from the point we know from the beginning, (t0 , v0 ), and to compute next another point
on the curve, let us say (t1 , v1 ). Then restart the process by moving one point forward in time,
compute (t2 , v2 ), and so on. This is an important aspect of numerical methods: the algorithms make
discrete steps and they produce discrete solutions (as opposed to a continuous analytical solution).
In general we will not be able to compute this sequence of points so that they all lie on the
“exact” solution curve. The points will only be “close” to the curve we wish to find (they will be
only approximately “on the curve”). In fact, there is in general an infinite number of solution curves,
those passing through all possible initial conditions. Refer to Figure 2.3): Shown are five solution
curves for five different initial velocities. So if our numerical solution process drifts off the desired
solution curve, it will most probably lie on an adjacent solution curve.
Since the process is repetitive (start from a known solution point and then compute the next
solution point), we may just as well think in terms the pair (tj , vj ) (known) and (tj+1 , vj+1 ) (un-
known, to be computed). How do we approximately locate the point (tj+1 , vj+1 ) from what we know
of the solution curve passing through (tj , vj )? We know the location (tj , vj ), but is there anything
else? The answer is yes: having (tj , vj ) allows us to substitute these values on the right-hand side of
the governing equation (2.7) and compute
ρs − ρf 9η
v̇(tj , vj ) = g − 2 vj (2.8)
ρs 2r ρs
(there is no mention of tj , so it does not appear). The meaning of v̇(tj , vj ) (∼ dv
dt (tj , vj )) is the slope
of the solution curve that passes through (tj , vj )! And here is Euler’s critical insight: if we can’t
move along the actual curve (since we don’t know it), we will move instead along the straight line
tangent to the solution curve. How far? Just a little bit, since if the curve really curves, the straight
line motion will quickly become a very poor approximation of the curve. Therefore we compute the
next solution point as (compare with Figure 2.3)3
(tj+1 , vj+1 ) ← (tj + ∆t, vj + ∆tv̇(tj , vj )) , ∆t “small” . (2.9)
Here v̇(tj , vj )) may become confusing, since by the superimposed dot we don’t mean that a time
derivative of some quantity was taken. We simply mean the value of the given function on the right
of (2.8). Therefore, we give the right-hand side function a name and we use the notation
v̇ = f (t, v) , v(t0 ) = v0 (2.10)
for the IVP. Here by f (t, v) we mean that the right-hand side of the governing equation is known as
a function of t and v. Then the Euler algorithm may be written as
(tj+1 , vj+1 ) ← (tj + ∆t, vj + ∆tf (tj , vj )) , ∆t “small” . (2.11)
One more remark is in order in reference to Figure 2.3. The short red lines indicate the slope
of the solution curves passing through the points from which the straight red lines emanate (the
left-hand side ends). The straight lines represent the tangents to the solution curves (the slopes).
They are also known as the direction field . Plotting the direction field is a good way in which
the behavior of solutions to ordinary differential equations can be understood. It works best for a
single scalar equation since it is hard to visualize the direction fields when there are more than one
dependent variables.
3
See: aetna/Stokes/stokesdirf.m
10 2 Modeling with differential equations
0.4
0.3
0.2
v(t) [m/s]
0.1
−0.1
0 0.05 0.1 0.15 0.2
t [s]
Fig. 2.3. Stokes problem solutions corresponding to different initial conditions, with the direction field
shown at a few selected points.
0.35
0.3
0.25
v(t) [m/s]
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
Fig. 2.4. Stokes problem solution computed with a simple implementation of the Euler’s method.
Numerically solving IVP’s is a common and important task. Not surprisingly, MATLAB has a
menagerie of functions that can do this job very well indeed. Here we illustrate how to use the
MATLAB integrator (that’s what these types of functions are usually called) ode235. For brevity
we omit the definitions of the constants (same as above). Then as above we define an anonymous
function for the right-hand side of the governing equation, and we pass it as the first argument to
the integrator. The integrator returns two arrays, whose meaning is the same as in our simple code
above.
f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);
[t,v] = ode23 (f, tspan, [v0]);
The solution pairs are now plotted. However, this time we let the plotter connect the computed
points (as indicated by markers) with straight lines. Note well: we are not computing the points
in between, those are interpolated from the computed data points only “for show”. We may take
note of the spacing of the computed data points: the spacing is not uniform. The integrator is clever
enough to figure out how long a step may be taken from one time instant to the next without losing
too much accuracy. We will do our own so-called adaptive time stepping later on in the book.
plot(t,v,’o-’)
xlabel(’t [s]’),ylabel(’v(t) [m/s]’)
Now consider the equation of motion written in terms of the second derivative of the distance
traveled (2.3). Since two time derivatives are present, we should expect to have to integrate twice
to obtain a solution. This will result in two constants of integration. The first integration yields
Z t Z t
ρs − ρf 9η
ẍ dτ = g − 2 ẋ dτ , (2.12)
t0 t0 ρs 2r ρs
which results in
ρs − ρf 9η
ẋ(t) − ẋ(t0 ) = (t − t0 ) g − 2 (x(t) − x(t0 )) .
ρs 2r ρs
Similarly the second integration gives
5
See: aetna/Stokes/stokes2.m
12 2 Modeling with differential equations
0.35
0.3
0.25
v(t) [m/s]
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
Z t Z t Z t
ρs − ρf 9η
(ẋ(τ ) − ẋ(t0 )) dτ = (τ − t0 ) g dτ − (x(τ ) − x(t 0 )) dτ .
t0 t0 ρs t0 2r2 ρs
This expression could be further simplified, but my point can be made here: the two constants of
integration are already present, x(t0 ) and ẋ(t0 ). Therefore the IVP (the governing equation plus the
initial conditions) may be written
ρs − ρf 9η
ẍ = g − 2 ẋ , x(t0 ) = x0 , ẋ(t0 ) = v0 . (2.13)
ρs 2r ρs
The meaning of the IVP is: Find a function (distance traveled) x(t) such that it satisfies the equation
of motion, and such that the initial distance and the initial velocity at the time t0 are x0 and v0
respectively.
The integration of IVPs in MATLAB is made general by requiring that all IVPs be first order
(only first derivatives of the variables may be present). Our IVP (2.13) is second order, but we can
see that it may be converted to a first order form. Just introduce the velocity to write
ρs − ρf 9η
v̇ = g− 2 v, ẋ = v , x(t0 ) = x0 , v(t0 ) = v0 . (2.14)
ρs 2r ρs
The price to pay for having to deal with only the first order derivatives is an increased number of
variables: now we have two. Since we have two variables, we better also have two equations. Note
that the initial conditions are now written in terms of the two variables, but we still have two of
them. That is not entirely surprising since we still need two integration constants: we have two first-
order equations, each of them needs to be integrated once, which will again result in two constants
of integration.
The IVP now deals with a system of coupled ordinary differential equations. Such systems are
usually written in the so-called vector form. We introduce a vector to collect our variables
x
z=
v
Formally, this is the same as the IVP (2.7), except that our variable is a vector, and the function
on the right-hand side returns a vector and takes the time and a vector as arguments. This parallel
2.2 Euler’s method 13
makes it possible to treat a variety of IVP’s with the same code in MATLAB. Here we show an
implementation6 that computes the solution to (2.15).
The definitions of the constants are the same as above, except for the initial conditions. The
initial condition now is a column vector.
z0 = [0;0];% Initial distance and velocity, meters per second
The right-hand side function looks very similar to the one introduced above, except that it needs to
return a vector, and whenever it refers to velocity it needs to take it out of the input vector as z(2)
f=@(t,z)([z(2); (rhos-rhof)/rhos*g-(9*eta)/(2*r^2*rhos)*z(2)]);
The MATLAB integrator is called exactly as before.
[t,z] = ode23 (f, tspan, z0);
The arrays returned by the integrator collect results in the form of a table:
t(:) z(:,1) z(:,2)
t1 x1 v1
t2 x2 v2
... ... ...
Plotting the two arrays then the yields two curves: the distance traveled and the velocity (Figure 2.6).
plot(t,z,’o-’)
xlabel(’t [s]’),ylabel(’x(t) [m], v(t) [m/s]’)
0.35
0.3
0.25
x(t) [m], v(t) [m/s]
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
Fig. 2.6. Stokes problem. Solution of (2.15) computed with a MATLAB integrator.
The solutions are computed in the form of a table. An important parameter in that table is the
spacing along the time direction, the time step. The time step is a form of a control knob: it
appears that turning the knob so that the time step decreases would compute more points along
the solution curve. This should be helpful, if for nothing else than to render better representations
of the solutions. Since the major approximation in Euler’s method is the replacement of the actual
curve with a straight line, we can also see that making the time step smaller will somehow decrease
the error that we will make in each step.
6
See: aetna/Stokes/stokes3.m
14 2 Modeling with differential equations
Making the step smaller is however expensive. The more steps we make the algorithm take, the
longer we have to wait for the computer to give us the solution. Hence we may wish to use a step that
is sufficiently large for the results to arrive quickly, but large steps also have consequences. What if
I wanted to take only 10 steps instead of 20 in the first MATLAB script (Section 2.2.1, set nsteps
=10;). The result7 is shown in Figure 2.7 and it is clearly unphysical: in the actual experiment (and
in the analytical solution and in our prior calculations) the dropped sphere certainly seems to be
monotonically speeding up, whereas here the result tells us that the velocity oscillates, and moreover
at times seems to be higher than the terminal velocity.
0.45
0.4
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
t [s]
Fig. 2.7. Stokes problem. Solution with a larger time step than in Figure 2.4.
An explanation of this phenomenon8 may be found in Figure 2.8. Note the direction field which
will help us understand the numerical solution. Starting from (t1 = 0, v1 = 0) we proceed along the
steep straight-line so far that the next solution point (t2 , v2 ) overshoots the terminal velocity. The
next step is along a straight line with a negative slope, and again we go so far that we undershoot
the terminal velocity. The third step takes us along a straight line with a positive slope, and we
overshoot again. This kind of computed response is not useful to us since the qualitative feature of
the solution, namely the monotonic increase of the speed, is lost in the numerical solution.
0.45
0.4
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
Fig. 2.8. Stokes problem. Solution with a larger time step than in Figure 2.4. The direction field and the
analytical solution are shown.
7
See: aetna/Stokes/stokes4.m
8
See: aetna/Stokes/stokes4ill.m
2.2 Euler’s method 15
In summary, we see that the selection of the time step length has two kinds of implications.
Firstly, the time step affects the accuracy (how far are the computed solutions from the curve that
we would like to track?). Secondly, the time step effects the quality of the solution (is the shape of the
computed solution series a reasonable approximation of the shape of the “exact” solution curve?).
The first aspect is generally referred to as accuracy . The second aspect is generally considered a
manifestation of stability (or instability, depending on how we look at it).
The Euler’s method proposes to follow a straight line set up at (tj , vj ) to arrive at the point
(tj+1 , vj+1 ). As an alternative, let us consider the possibility of following a straight line set up
at the (initially unknown) point (tj+1 , vj+1 ). For simplicity let us work with the IVP (2.7). The
right-hand side of the equation of motion is the function
ρs − ρf 9η
f (t, v) = g− 2 v.
ρs 2r ρs
which has the unknown velocity vj+1 on both sides of the equation. Equations of this type are called
implicit , as opposed to the original Euler’s algorithmic equation (2.9). The latter was explicit : the
unknown was explicitly defined by the right-hand side. In implicit equations, the unknown may be
hidden in some nasty expressions on both sides of the equation, and typically a numerical method
must be used to extract the value of the unknown.
In the present case, solving the implicit equation is not that hard
ρs − ρf
vj + ∆t g
ρs
vj+1 = .
9η
1 + ∆t 2
2r ρs
The MATLAB script of Section 2.2.1 may be easily modified to incorporate our new algorithm. The
only change occurs inside the time-stepping loop
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =(v(j)+dt*(rhos-rhof)/rhos*g)/(1+dt*(9*eta)/(2*r^2*rhos));
end
With this modification9 we can now compute the numerical solution without overshoot not only
with just 10 steps, but with just five or even two– see Figure 2.9. The computed points are not
particularly accurate, but the qualitative character of the solution curves is preserved. In this sense,
the present modification of the Euler’s algorithm has rather different properties than the original.
9
See: aetna/Stokes/stokes5.m
16 2 Modeling with differential equations
0.35
0.3
0.25
v(t) [m/s]
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5
t [s]
In order to be able to distinguish between these algorithms we will call the original algorithm
of Section 2.2.1 the forward Euler , and the algorithm introduced in this section will be called
backward Euler . The justification for this nomenclature may be sought in the visual analogy of
approximating a curve with a tangent: in the forward Euler method this tangent points forward from
the point (tj , vj ), in the backward Euler method the tangent points backward from (tj+1 , vj+1 ).
In this book we shall spend some time experimenting with the forward and backward Euler method.
However, MATLAB does not come with integrators implementing these methods. They are too
simplistic to serve the general-purpose aspirations of MATLAB. Since it will make our life easier if
we don’t have to code the forward and backward Euler method every time we want to apply it to a
different problem, the toolbox aetna that comes with the book provides integrators for this pair of
methods.
The aetna forward and backward Euler integrators are called in the same way as the built-in
MATLAB integrators. We have seen in Section 2.2.2 an example of the built-in MATLAB integrator,
ode23. There is one difference, however, which is unavoidable. The built-in MATLAB integrators are
able to determine the time step automatically, and in general the time step is changed from step to
step. The aetna forward and backward Euler integrators are fixed-time-step implementations: the
user controls the time step, and it will not change. Therefore, we have to supply the initial time step
as an option to the integrator. (In fact, even the MATLAB built-in integrators take that options
argument. It is used to control various aspects of the solution process.) The MATLAB odeset
function is used to create the options argument. To compute the solution with the forward Euler
integrator odefeul10, replace the ode23 line in the script in Section 2.2.2 with these two lines11 :
options =odeset(’initialstep’, 0.01);
[t,v] = odefeul(f, tspan, [v0], options);
To compute the solution with a backward Euler integrator, use odebeul12 instead. The inquisitive
reader now probably wonders: how does odebeul solve for vj+1 from the implicit equation
when it cannot even know how the function f was defined (all it is given is the function handle f)? The
answer is: the equation is solve numerically. Solving (systems of) non-linear algebraic equations is
10
See: aetna/utilities/ODE/integrators/odefeul.m
11
See: aetna/Stokes/stokes6.m
12
See: aetna/utilities/ODE/integrators/odebeul.m
2.3 Beam bending model 17
so important that MATLAB cannot fail to deliver some methods for dealing with them. MATLAB’s
fzero implements a few methods by which the root of a single nonlinear equation may be located.
It takes two arguments, function handle: this would be the function whose zero we wish to find; and
the initial guess of the root location. First we define the function
F (vj+1 ) = vj+1 − vj − ∆tf (tj+1 , vj+1 )
by moving all the terms to the left-hand side, and our goal will be to find vj+1 such that F (vj+1 ) = 0.
For that purpose we will create a handle to an anonymous function @(v1)(v1-v(j)-dt*f(t(j+1),v1))
in which we readily recognize the function F (vj+1 ) (the argument vj+1 is called v1). Finally, the
time stepping loop for the backward Euler method is written as13
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =fzero(@(v1)(v1-v(j)-dt*f(t(j+1),v1)),v(j));
end
where the second line inside the loop solves the implicit equation using fzero.
Illustration 1
Note that the beam is not loaded (q = 0). The boundary conditions correspond to a beam simply
supported at one end and on the other side with a free end with a nonzero shear force. In the absence
of other forces and moments, the shear force at x = L cannot be balanced. Such a beam is not stably
supported, and therefore no solution exists for these boundary conditions.
We will handle in this book some simple boundary value problems, but most of their intricacies
are outside of the scope of this book.
v(x)
y q(x)
M (x + dx)
x
M (x)
V (x) V (x + dx)
It is relatively straightforward to add the aspect of dynamics to the equation of motion (2.18).
All terms are moved to one side of the equation, and they represent the total applied force on the
differential element of the beam. Then invoking Newton’s law of motion, we obtain
Here µ is the mass per unit length, and v̈ is the acceleration. The equation of motion now became
a partial differential equation, since there are now derivatives with respect to space and time. With
the time derivatives there comes the need for more “constants” of integration. It is consistent with
our physical reality that the integration constants will come from the beginning of the time interval
on which the equation of motion (2.20) holds. Therefore, they will be obtained from the so-called
initial conditions. The solution will still be subject to the boundary conditions as before, and thus we
obtain an initial boundary value problem (IBVP) for the function v(t, x) of the midline deflection.
For instance
µv̈ = q − EIv ′′′′ , v(t, 0) = 0 , v ′′ (t, 0) = 0 , v(t, L) = 0 , v ′′ (t, L) = 0 ,
(2.21)
v(0, x) = vt0 (x) , v̇(0, x) = v̇t0 (x) .
This IBVP model describes the vibration of a simply-supported beam, whose deflection at time t = 0
(initial deflection) is described by the shape vt0 (x) and whose (initial) velocity at time t = 0 is given
as v̇t0 (x).
2.4 Model of satellite motion 19
Fig. 2.11. Satellite motion. Satellite path and the gravitational force.
formulation is straightforward. The equation of motion is a classical Newton’s law: the acceleration
of the mass of the satellite is proportional to the acting force
F = mr̈ .
r(0) = r0 , ṙ(0) = v 0 .
Note that the mass of the satellite canceled in the equation of motion.
As before we can introduce the same formal way of writing the IVP using a single dependent
variable. Introduce the vector
r
z=
v
ż = f (t, z) , z(0) = z 0 .
The complete MATLAB code14 to compute the solution starts with a few definitions. Especially
note the initial conditions, velocity v0, and position r0.
G=6.67428 *10^-11;% cubic meters per kilogram second squared;
M=5.9736e24;% kilogram
R=6378e3;% meters
v0=[-2900;-3200;0]*0.9;% meters per second
r0=[R+20000e3;0;0];% meters
dt=0.125*60;% in seconds
te=50*3600; % seconds
Now the right-hand side function is defined (as an anonymous function, assigned to the variable f).
Clearly the MATLAB code corresponds very closely to equation (2.24).
f=@(t,z)([z(4:6);-G*M*z(1:3)/(norm(z(1:3))^3)]);
We set the initial time step (the MATLAB integrator may or may not consider it: it is always driven
by accuracy), and then we call the integrator ode45.
opts=odeset(’InitialStep’,dt);
[t,z]=ode45(f,[0,te],[r0;v0]);
Finally, we do some visualization in order to understand the output better than a printout of the
numbers can afford. In Figure 2.12 we compare results for this problem obtained with two MATLAB
integrators, ode45 and ode23, and with the forward and backward Euler integrators, odefeul and
odebeul. Some of the interesting features are: ode45 is nominally of higher accuracy than ode23.
However, we can see the individual curves spread out quite distinctly for ode45 while only a single
curve, at this resolution of the image, is presented for ode23. From what we know about analytical
solutions to this problem (remember Kepler?), the curve is an ellipse and the computed paths for
repeated revolutions of the satellite around the planet would ideally overlap and represent a single
curve. Therefore we have to conclude that ode23 is actually doing a better job, but not perfect (the
trajectory is not actually closed). The two Euler integrators produce altogether useless solutions.
The problem is not accuracy, it is the qualitative character of the orbits. From years and years of
observations of the motion of satellites (and from the analytical solution to this model) we know
that the energy of a satellite moving without contact with the atmosphere should be conserved to
a high degree. For the forward Euler the satellite is spiraling out (which would correspond to its
gaining energy), while for the backward Euler it is spiraling in (losing energy). A lot of energy! We
say that all these integrators fail to reproduce the qualitative character of the solution, but some
fail more spectacularly than others.
Looking at energy is a good way of judging the performance of the above integrators. The kinetic
energy of the satellite is
14
See: aetna/SatelliteMotion/satellite1.m
2.5 On existence and uniqueness of solutions to IVPs 21
Fig. 2.12. Satellite motion. Solution computed with (left to right): ode45, ode23, odefeul, odebeul.
mkvk2
K=
2
and the potential energy of the satellite is written as
mM
V = −G .
krk
The total energy T = K + V should be conserved for all times. Let us compute this quantity
for the solutions produced by these various integrators, and graph it. Or, rather we will graph
T /m = K/m + V /m so that the expressions do not depend on the mass of the satellite, which did
not appear in the IVP in the first place. Here is the code15 to produce Figure 2.13 which shows what
the time variation of the energies should look like (the total energy is conserved – hence a horizontal
line).
Km=0*t;
Vm=0*t;
for i=1:length(t)
Km(i)=norm(z(i,4:6))^2/2;
Vm(i)=-G*M/norm(z(i,1:3));
end
plot(t,Km,’k--’); hold on
plot(t,Vm,’k:’); hold on
plot(t,Km+Vm,’k-’); hold on
xlabel(’t [s]’),ylabel(’T/m,K/m,V/m [m^2/s^2]’)
In Figure 2.14 we compare the four integrators. Only the total energy is shown, so ideally we
should see horizontal lines corresponding to perfectly conserved energy. On the contrary, we can see
that neither of the four integrators conserves the total energy. Note that the vertical axes have rather
different scales. The Euler integrators perform very poorly: the change in total energy is huge. The
ode45 is significantly outperformed by ode23, but both integrators lose kinetic energy nevertheless.
Since ode45 is significantly more expensive than ode23, this example illustrates that choosing an
appropriate integrator can make the difference between success and failure.
T/m,K/m,V/m [m /s ]
2 2
0
−2
−4
−6
0 0.5 1 1.5 2
t [s] 5
x 10
Fig. 2.13. Satellite motion. Total energy (solid line), potential energy (dotted line), kinetic energy (dashed
line).
6 6
x 10 x 10
−7.4 −7.55
−7.56
−7.5
−7.57
T/m,K/m,V/m [m2/s2]
T/m,K/m,V/m [m2/s2]
−7.6 −7.58
−7.59
−7.7 −7.6
−7.61
−7.8
−7.62
−7.9 −7.63
0 0.5 1 1.5 2 0 0.5 1 1.5 2
t [s] 5 t [s] 5
x 10 x 10
6 7
x 10 x 10
−5.5 −0.7
−0.8
−6
T/m,K/m,V/m [m2/s2]
T/m,K/m,V/m [m2/s2]
−0.9
−6.5
−1
−7
−1.1
−7.5
−1.2
−8 −1.3
0 0.5 1 1.5 2 0 0.5 1 1.5 2
t [s] 5 t [s] 5
x 10 x 10
Fig. 2.14. Satellite motion. Total energy computed with (left to right, top to bottom): ode45, ode23,
odefeul, odebeul.
2.6 First look at accuracy 23
0.08
0.06
v(t),[m/s], x(t)[m]
0.04
0.02
−0.02
−0.04
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
t [s]
Fig. 2.15. Dry friction sliding of eccentric mass shaker. Sliding motion: Displacement in dotted line, velocity
in solid line.
with a polished steel base lying on a steel platform. The mass of the shaker is m. The harmonic
force due to the eccentric mass motion is added to the weight of the shaker to give the normal
contact force between those two as mg + A sin(ωt), and the horizontal force parallel to the platform
A cos(ωt). The IVP of the shaker sliding motion may be written in terms of its velocity as
mv̇ + µ(v)(mg + A sin(ωt))signv = A cos(ωt) , v(0) = v0 .
Here µ(v) is the friction coefficient. For steel on steel contact we could take
µs , for |v| ≤ vstick
µ(v) =
µk ≪ µs , otherwise.
Here µs is the so-called static friction coefficient, µk is the kinetic friction coefficient, and vstick is the
sticking velocity. In words, for low sliding velocity the coefficient of friction is high, for high sliding
velocity the coefficient of friction is low.
Run the simulation script stickslip harm 2 animate and watch the animation a few times to
get a feel for the motion.16 Figure 2.15 shows the displacement and velocity of the sliding motion.
The brief stick phases should be noted. Also note the drift of the shaker due to the lift off force
which reduces the contact force and hence also the friction force when the mass is moving upwards.
Now consider Figure 2.16 which shows the velocity of the sliding motion for two slightly different
initial conditions.17 Note well the direction field and consider how quickly (in fact discontinuously)
it changes for some values of the velocity of sliding.
We take the initial velocity of the shaker to be 0.99vstick and 1.01vstick. We may expect that for
such close initial conditions the velocity curves would also stay close, but they don’t. The reason is
the discontinuous (and divergent) direction field, as especially evident in the close-up on the right.
The direction field is also discontinuous at zero velocity, but there it is convergent, and solution
curves that arrive there are forced to remain at zero (and sticking occurs).
The divergent discontinuous direction field makes the solution non-unique. As users of numerical
algorithms for IVPs we must be aware of such potential complications, and address them by careful
consideration of the formulation of the problem and interpretation of the results.
0.05
0.03
0.04 0.025
0.02
0.03
v(t) [m/s]
v(t) [m/s]
0.015
0.02
0.01
0.01 0.005
0 0
−0.005
−0.01
0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.005 0.01 0.015 0.02
t [s] t [s]
Fig. 2.16. Dry friction sliding of eccentric mass shaker. Direction field and velocity curves for two initial
conditions.
First, what do we mean by error? Consider for example that we want to obtain a numerical
solution to the IVP
in the sense that our goal is the approximation of the value of the solution at some given point t = t̄.
The difference between our computed solution yt̄ and the true answer y(t̄) will be the true error
Et = y(t̄) − yt̄ .
We have already seen that the time step length apparently controls the error (we will see later exactly
how it achieves this feat). So let us compute the solution for a few decreasing time step lengths. The
result will be a table of time step length versus true error.
Time step length Solution at t̄ for ∆tj True error for ∆tj
∆t1 yt̄,1 Et,1 = y(t̄) − yt̄,1
∆t2 yt̄,2 Et,2 = y(t̄) − yt̄,2
... ... ...
The true error is a fine concept, but not very useful as knowing it implies knowing the exact
solution. In practical applications of numerical methods we will never know the true error (otherwise
why would we be computing a numerical solution?). In practice we will have to be content with the
concept of an approximate error . A useful form of approximate error is the difference of successive
solutions. So now we can construct the table of approximate errors
Time step length Solution at t̄ for ∆tj Approximate error for ∆tj
∆t1 yt̄,1 –
∆t2 yt̄,2 Ea,1 = yt̄,2 − yt̄,1
∆t3 yt̄,3 Ea,2 = yt̄,3 − yt̄,2
∆t4 yt̄,4 Ea,3 = yt̄,4 − yt̄,3
... ... ...
For definiteness we will be working in this section with the IVP
1
ẏ = − y , y(0) = 1.0 (2.26)
2
and our goal will be to compute y(t̄ = 4). Figure 2.17 shows on the left the succession of computed
solutions with various time steps. As we can see, the two methods used, the forward and backward
2.6 First look at accuracy 25
Euler, are approaching the same value as the time step gets smaller. We call this behavior conver-
gence. From the computed sequence of solutions we can obtain the approximate errors as discussed
above18 . The approximate errors are shown in Figure 2.17 on the right.
0.25 0.07
0.06
0.2
0.05
0.15
0.04
y(t)
Ea
0.03
0.1
0.02
0.05
0.01
0 0
0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1
∆t ∆t
Fig. 2.17. Successive approximations to y(t̄ = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler
With this data at hand we can try to ask some questions. How does the error depend on the
time step length? The curves in Figure 2.17 suggest a linear relationship. Before we look at this
question in more detail, we will consider the problem of numerical integration of the IVP again, this
time with a view towards devising a better (read: more accurate) integrator than the first two Euler
algorithms.
As discussed below equation (2.5), the governing equation of the IVP (2.25) may be subject to
integration from t0 to t to obtain
Z t
y(t) − y(t0 ) = f (τ, y(τ ))dτ .
t0
To use this expression to obtain an actual solution may not be easy because of the integral on the
right-hand side. This gives us an incentive to try to approximate the right-hand side integral. One
possibility is to write
Z t
f (τ, y(τ ))dτ ≈ (t − t0 )f (t0 , y(t0 ))
t0
and we get the backward Euler method. This should be familiar: we are approximating the “areas
under curves” (integrals of functions) by rectangles (recall the concept of the Riemann sum). A
better approximation would be achieved with trapezoids. Thus we may try
18
See: aetna/ScalarODE/scalardecayconv.m
26 2 Modeling with differential equations
Z t
(t − t0 )
f (τ, y(τ )) dτ ≈ [f (t0 , y(t0 )) + f (t, y(t))] .
t0 2
(t − t0 )
y(t) = y(t0 ) + [f (t0 , y(t0 )) + f (t, y(t))] . (2.27)
2
To obtain a method that is explicit in y(t) one may try the following trick: in the above equation
approximate y(t) in f (t, y(t)) using the forward Euler step to arrive at
We will now compute21 the solution to (2.26) also with the modified Euler method (2.28). Figure 2.18
shows that the modified Euler approaches the solution somehow quicker than both backward and
forward Euler methods. This is especially clear when we look at the approximate errors (on the
right).
0.25 0.1
0.2 0.08
0.15 0.06
y(t)
y(t)
0.1 0.04
0.05 0.02
0 0
0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1
∆t ∆t
Fig. 2.18. Successive approximations to y(t̄ = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler, black curve modified Euler
The errors seem to decrease roughly linearly for the backward and forward Euler methods. The
modified Euler method errors decrease along some curve. Can we find out what kind of curve? Could
it be a polynomial? That would be the first thing to try, because polynomials tend to be very useful
in this way (viz the Taylor series later in the book).
19
See: aetna/utilities/ODE/integrators/odetrap.m
20
See: aetna/utilities/ODE/integrators/odemeul.m
21
See: aetna/ScalarODE/scalardecayconv1.m
2.6 First look at accuracy 27
We will assume that the approximate errors depend on the time step length as
The exponent β is unknown. One clever way in which we can use data to find out the value of β
relies on taking logarithms on both sides of the equation
which yields
This is an expression for a straight line on a plot with logarithmic axes. The slope of the line would
be β. Figure 2.19 shows the approximate errors re-plotted on the log-log scale. Also shown are two
red triangles. The hypotenuse in those triangles has slope 1 or 2 respectively. This may be compared
with the plotted data. The forward and backward Euler approximate errors (at least for the smaller
time step lengths) appear to lie along a straight line with slope equal to one. The modified Euler
approximate errors on the other hand are close to a straight line with slope equal to two. Therefore,
we may hypothesize that the approximate errors behave as Ea (∆t) ≈ C∆t for the forward and
backward Euler, and as Ea (∆t) ≈ C∆t2 for the modified Euler. The exponent β is called the rate
of convergence (convergence rate). The higher the rate of convergence, the faster the errors drop.
Later in the book we will use mathematical analysis tools (the Taylor series) to understand where
the convergence rate is coming from.
−1
10
−2
10
a
−3
y,E
10
−4
10
−5
10 −2 −1 0
10 10 10
∆t
Fig. 2.19. The approximate errors from Figure 2.18 re-plotted on the log-log scale. Red curve backward
Euler, blue curve forward Euler, black curve modified Euler
What about the first few points in the computed series: they fail to lie along a straight line on the
log-log plot? We have assumed that the errors depended only a single power of the time step length.
This is a good assumption for very small time step lengths (the so-called asymptotic range, where
∆t → 0), but for larger time step lengths (the so-called pre-asymptotic range) the error more
likely depends on a mixture of powers of the time step length. Then the data points will not lie on
a straight line on the log-log plot.
Plotting the data as in Figure 2.19 is very useful in that it gives us the convergence rate. Could
we use this information to get a handle on the true error? As explained above, we assumed that the
28 2 Modeling with differential equations
approximate error depended on the time step length through the mononomial relation (2.29). Using
a simple trick, we can relate the approximate errors to the true errors
where −Et,j+1 = yt̄,j+1 − y(t̄) and Et,j = y(t̄) − yt̄,j and so we have
Then if the approximate error on the left behaves as the mononomial (2.29), then so will the true
errors on the right. There are two parameters in (2.29), the rate β and the constant C. We have
estimated the rate by plotting the approximate errors on a log-log scale. Now we can estimate the
constant C by taking
to obtain
and
Ea,j
C= .
∆tβj − ∆tβj+1
For instance, for the forward Euler we have obtained the following approximate errors
>> ea_f =
6.2500e-2 3.7613e-2 1.7954e-2 8.7217e-3 4.2952e-3 2.1311e-3
>> dts =
2, 1, 1/2, 1/4, 1/8, 1/16, 1/32
and we have estimated from Figure 2.19 that the convergence rate was β = 1. Therefore, we can
estimate the constant using (for example) Ea,3 as
>> C=ea_f(3)/(dts(3)-dts(4))
C =
0.071816687928745
This is useful: we can now predict for instance how small a time step will be required to obtain the
solution within the absolute tolerance 10−4 :
−4 1/β
10
Et,j ≈ C∆tβj ≤ 10−4 ⇒ ∆tj ≤
C
>> 1e-4/(ea_f(3)/(dts(3)-dts(4)))
ans =
0.001392434027300
Indeed, we find that for time step length 1/1024 < 0.00139 the true error is computed as 0.000066 <
10−4 .
If we do not have an estimate of the convergence rate, we can try solving for it. Provided we
have at least two approximate errors, let us say Ea,1 and Ea,2 , we can write (2.31) twice as
This system of two nonlinear equations will allow us to solve for both unknowns C and β. In general
a numerical solution of this nonlinear system of equations will be required. Only if the time steps
are always related by a constant factor so that
2.7 Runge-Kutta integrators 29
where α is a fixed constant, will we be able to solve the system analytically: First we write
Ea,1 Ea,2
= . (2.33)
∆tβ1 − ∆tβ2 ∆tβ2 − ∆tβ3
Now we realize that
∆tβ1 − ∆tβ2 = ∆tβ1 − (α∆t1 )β = ∆tβ1 1 − αβ
and also
∆tβ2 − ∆tβ3 = ∆tβ2 − (α∆t2 )β = ∆tβ2 1 − αβ
Ea,1 Ea,2
= .
∆tβ1 ∆tβ2
This is then easily solved for the convergence rate by taking a logarithm of both sides to give
log Ea,1 − log Ea,2
β= . (2.34)
log ∆t1 − log ∆t2
The second unknown C then follows
Ea,1
C= . (2.35)
∆tβ1 − ∆tβ2
The described procedure for the estimation of the parameters of the relation (2.29) is a special
case of the so-called Richardson extrapolation. When the data for the extrapolation is “nice”,
this procedure is very useful. The data may not be nice: for instance for some reason we haven’t
reached the asymptotic range. Or, perhaps there is a lot of noise in the data. Then the extrapolation
procedure cannot work. It is a good idea to always visualize the approximate error on the log-log
graph. If the approximate error data does not not appear to lie on a straight line, the extrapolation
should not be attempted.
An important note: the above Richardson extrapolation can work only for results obtained with
fixed step-length integrators. The step length is the parameter in the extrapolation formula. It varies
from step to step when the MATLAB adaptive step length integrators (i.e. ode23, ...) are used,
which the formula cannot accommodate, and the extrapolation is then not applicable.
which means that from y(t0 ) we follow a slope which is determined as a linear combination of slopes
kj evaluated at various points within the current time step
30 2 Modeling with differential equations
where ∆t = (t − t0 ), and the coefficients asj , bj , cj are determined in various ingenious ways so that
the method has the best accuracy and stability properties.
Figure 2.20 shows graphically an example of such an explicit Runge-Kutta method, the modified
Euler method. It can be written in the above notation as
1 1
y(t) = y(t0 ) + ∆t k1 + k2
2 2
(2.38)
k1 = f (t0 + 0 × ∆t, y(t0 ))
k2 = f (t0 + 1 × ∆t, y(t0 ) + 1 × ∆tk1 )
Fig. 2.20. The modified Euler algorithm as a graphical schematic: The solution at the time t = t0 + ∆t is
arrived at from y(t0 ) using the average slope 21 k1 + 12 k2
The coefficients of Runge-Kutta methods asj , bj , cj are usually presented in the form of the
so-called Butcher tableau
ca
(2.39)
b
where the coefficients are elements of the three matrices. For the explicit RK methods c1 = 0 always,
and the matrix a is strictly lower diagonal. The modified Euler method is an RK method with s = 2
and the tableau
0 0 0
1 1 0
1 1
2 2
We must mention the fourth-order explicit Runge-Kutta. It represents perhaps the most common
RK method. An improvement of this method in the form of the explicit Runge-Kutta (4,5) pair
of Dormand and Prince (a combination of fourth-and fifth-order method) makes appearance in
2.7 Runge-Kutta integrators 31
MATLAB in the ode45 integrator. The tableau of the fourth-order explicit Runge-Kutta with a
fixed time step is
0 0 0 0 0
1 1
2 2 0 0 0
1 1
2 0 2 0 0
1 0 0 1 0
1 1 1 1
6 3 3 6
Illustration 2
Figure 2.21 shows the same results as Figure 2.19, but supplemented23 with results for fourth-order
Runge-Kutta oderk4. Clearly the fourth-order method is much more accurate, and by drawing a
triangle in the log-log scale plot we easily ascertain that RK4 converges with convergence rate of 4.
0
10
−5
10
y,Ea
−10
10
−15
10 −3 −2 −1 0
10 10 10 10
∆t
Fig. 2.21. The approximate errors plotted on the log-log scale. Red curve backward Euler, blue curve
forward Euler, black curve with “o” markers modified Euler, black curve with “x” markers fourth-order
Runge-Kutta oderk4
To round off this discussion we will consider the adaptive-step Runge-Kutta method implemented
in the Matlab ode23 integrator. The tableau reads
22
See: aetna/utilities/ODE/integrators/oderk4.m
23
See: aetna/ScalarODE/scalardecayconv2.m
32 2 Modeling with differential equations
0 0 0 0 0
1 1
2 2 0 0 0
3 3
4
0 4 0 0
2 1 4
1 9 3 9 0
2 1 4
9 3 9 0
7 1 1 1
24 4 3 8
The array b with two rows instead of one makes the method so useful: the solution at the time
t = t0 + ∆t may be computed in two different ways from the slopes k1 , ..., k4 . One of these (the first
row) is third-order accurate and the other (the second row) is fourth-order accurate. The difference
between them can be used to guide the change of the time step to maintain accuracy.
Suggested experiments
1. The integrator ode87fixed24 uses a high order Runge-Kutta formula and fixed time step length.
Repeat the above exercise with this integrator, and estimate its convergence rate.
24
See: aetna/utilities/ODE/integrators/ode87fixed.m
3
Preservation of solution features: stability
Summary
1. In this chapter we investigate the central role that the eigenvalue problem plays in the design
of ODE integrators. The goal is to preserve important solution features. This is referred to as
the stability of the integration algorithm. Main idea: stability can be investigated on the model
equation of the scalar linear ODE.
2. For the model IVP, the formula of a particular integrator can be written down so that the new
value of the solution is expressed as a multiple of the solution value in the previous step. Main
idea: the amplification factor depends on the product of the eigenvalue and the time step, and
therefore the “shape” of the numerical solution is determined by these quantities. The eigenvalue
is given as data, the time step can be (needs to be) chosen by the user.
3. The scalar linear ODE with a complex coefficient is equivalent to two coupled real equations in
two real variables. Main idea: the ODE with a complex coefficient describes harmonic oscillations.
4. For the model IVP with a complex coefficient, the same procedure that leads to an amplification
factor is used. Main idea: the amplification factor and the solution now “live” in the complex
plane. The magnitude of the amplification factor again is seen to play a role in the stability
investigation.
5. Understanding the amplification factors is aided by appropriate diagrams. Main idea: The preser-
vation of solution features is illustrated by a complete stability diagram for the various methods.
The magnitude of the amplification factor may also be visualized as a surface above the complex
plane.
ẏ = ky , y(0) = y0 . (3.1)
For the moment we shall consider k real. As an example take k = −1/2, with an arbitrary initial
condition
34 3 Preservation of solution features: stability
1
ẏ = − y , y(0) = 1.3361 . (3.2)
2
y = B exp(λt)
ẏ = Bλ exp(λt) = ky = kB exp(λt)
The constant B 6= 0 (otherwise we don’t have a solution!), and for the above to hold for all times t
we must require
B(λ − k) = 0 .
The above equation is called the eigenvalue problem, and this is definitely not the last time we
have encountered this type of equation in the present book. Here λ is the eigenvalue, and B is the
eigenvector. The solution is easy: we see that λ = k. Any B 6= 0 will satisfy the eigenvalue equation.
We could determine B so that the initial value problem (3.1) was solved by substituting into the
initial condition to obtain B = y0 .
The solution to the IVP (3.2) is drawn with a solid line in Figure 3.1. It is a “decaying” solution.
In the same figure there’s also a “growing” solution (for k = 1/2), and a constant solution (for
k = 0).
10
6
y(t)
0
0 1 2 3 4
t
to obtain
We would like to see a monotonically decaying numerical solution, | yj+1 |<| yj |, so the so-called
amplification factor (1 + ∆tk) must be positive and its magnitude must be less than one
|1 + ∆tk| < 1 .
If this condition is satisfied but (1 + ∆tk) < 0 the solution decreases in magnitude, but changes
sign from step to step. Finally, (1 + ∆tk) = 0 implies that the solution drops to zero in one step
and stays zero. Recall that for our example k = −1/2. Correspondingly, in Figure 3.2 we see1 a
monotonically decaying solution for ∆t = 1.0 (|1 + ∆tk| = |1 + 1.0 × (−1/2)| = 1/2 < 1), a solution
dropping to zero in one step for ∆t = 2.0, a solution decaying, but non-monotonically for ∆t = 3.0
(as 1 + ∆tk = 1 + 3.0 × (−1/2) = −1/2), and finally for ∆t = 4.0 we get a solution which oscillates
between ±y0 . Note that for an even bigger time step we would get an oscillating solution which
would increase in amplitude rather than decrease.
1.4 1.4
1.2 1.2
1 1
0.8 0.8
y
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
t t
1.5 1.5
1
1
0.5
0.5
0
y
0
−0.5
−0.5
−1
−1 −1.5
0 10 20 30 40 50 0 10 20 30 40 50
t t
Fig. 3.2. Forward Euler solutions to (3.2) for time steps (left to right) ∆t = 1.0, 2.0, 3.0, 4.0
In summary, for a negative coefficient k < 0 in the model IVP (3.1) we can reproduce the correct
shape of the solution curve with the forward Euler method provided
This is visualized in Figure 3.3. On the top we show the real line, the thick part indicates where
the eigenvalues λ = k are located when they are negative. On the bottom we show the real line
for the quantity λ∆t. The thick segment corresponds to equation (3.5). The filled circle indicates
“included”, the empty circle indicates “excluded”. The meaning of (3.5) is expressed in words as:
1
See: aetna/ScalarODE/scalarsimple.m
36 3 Preservation of solution features: stability
for a negative λ = k the forward Euler will reproduce the correct decaying behavior provided the
quantity λ∆t lands in the segment −1 ≤ λ∆t < 0 as indicated by the arrow.
The time step lengths that satisfy equation (3.5) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the forward Euler applied to IVPs with decaying
solutions.
−1 0 1 λ
−1 0 1 λ∆t
Fig. 3.3. Forward Euler stability when applied to the model problem (3.1) for negative eigenvalues. The
given coefficient λ is located in the negative part of the real axis on top. The time step ∆t needs to be chosen
to place the product λ∆t in the unit interval −1 ≤ λ∆t < 0 on the axis at the bottom.
The time step lengths that satisfy equation (3.8) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the backward Euler applied to IVPs with growing
solutions. We see that the situation somehow mirrors the one discussed for the forward Euler applied
to decaying solutions. The Figure 3.4 which corresponds to (3.8) illustrates this quite clearly, as it
is quite literally a mirror image of the Figure 3.3 for the forward Euler and k < 0.
−1 0 1 λ
−1 0 1 λ∆t
Fig. 3.4. Backward Euler stability one applied to the model problem (3.1) for positive eigenvalues. The
given coefficient λ is located in the positive part of the real axis on top. The time step ∆t needs to be chosen
to place the product λ∆t in the unit interval 0 < λ∆t ≤ 1 on the axis at the bottom.
(1 + ∆tk)
is positive for all time step lengths, and also (1 + ∆tk) > 1. Hence we see that any time step length
is stable for the forward Euler method applied to an IVP with a growing solution. This is again a
mirror image, this time of the backward Euler applied to IVP with a decaying solution.
y = y0 exp(kt) .
Note that the complex exponential may be expressed in terms of sine and cosine
The solution is now to be sought with a time dependence in the form of a complex exponential. Let
us write the solution in terms of the real and imaginary parts
y = Rey + i Imy ,
which can be substituted into the differential equation, together with k = Rek + i Imk, to give
Expanding we obtain
and in order to get a real zero on the right-hand side, we require that both brackets vanish identically,
and we obtain a system of coupled real differential equations
Re ẏ = RekRey − ImkImy
Im ẏ = RekImy + ImkRey . (3.9)
Also, the initial condition y(0) = y0 is equivalent to
So we see that to solve (3.1) with k complex is equivalent to solving the real IVP (profitably written
in matrix form)
Re ẏ Rek, −Imk Rey Rey(0) Rey 0
= , = . (3.10)
Im ẏ Imk, Rek Imy Imy(0) Imy 0
The method of Section 3.2 can be used again, but with a little modification since we now have a
matrix differential equation instead of a scalar ODE. We will seek the solution to (3.10) as
Rey z
= exp(λt) 1 .
Imy z2
and
Rek, −Imk
K= (3.11)
Imk, Rek
ẇ = Kw , w(0) = w 0 . (3.12)
w = exp(λt)z , (3.13)
where z is a time independent vector. Performing the time differentiation, we obtain
ẇ = λ exp(λt)z = K exp(λt)z .
Collecting the terms, we get, entirely analogously to (3.3),
exp(λt) (Kz − λz) = 0 .
To satisfy this equation for all times, the following conditions must be true
Kz = λz . (3.14)
This is the so-called matrix eigenvalue problem. The vector z is the eigenvector , the scalar
λ is the eigenvalue, and they both may be complex. The eigenvalue problem (EP) is highly non-
linear, and therefore for larger matrices impossible to solve analytically and quite difficult to solve
numerically.
Looking at (3.14) we realize that there are too many unknowns here: λ , z1 , and z2 (three), and
not enough equations (two). We need one more equation, and to get it we rewrite (3.14) as
(K − λ1)z = 0 ,
where 1 is an identity matrix. This is a system of linear equations for the vector z with a zero
right-hand side. In order for the above equation to have a nonzero solution, the square matrix
K − λ1
must be singular . (The linear combination of the columns of K − λ1 yields a zero vector, which is
just another way of saying that the columns are linearly dependent. Hence, the matrix is singular.)
We may put the fact that K − λ1 is singular differently by referring to its determinant
det (K − λ1) = 0 . (3.15)
This is the additional equation that makes the solution of the eigenvalue problem possible (the
characteristic equation).
Illustration 1
For a 2 × 2 matrix the polynomial is quadratic, and with each additional row and column the
order of the polynomial goes up by one. As a consequence, to solve the eigenvalue problem means to
find the roots of the characteristic polynomial. This is a highly nonlinear and unstable computation,
which for larger matrices must be done numerically since no analytical formulas exist.
40 3 Preservation of solution features: stability
Illustration 2
Display the characteristic polynomial of the matrix [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1].
The MATLAB symbolic solution
>> syms lambda ’real’
>> det( [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]-lambda*eye(4))
ans =
lambda+6*lambda^2-5*lambda^3-2+lambda^4
>> ezplot(ans)
>> grid on
yields a curve similar to the one shown in Figure 3.5. One has to zoom in to be able to estimate
where the roots lie. There are going to be four of them, corresponding to the highest power λ4 .
2 3 4
p(λ) =λ+6 λ −5 λ −2+λ
150
100
p(λ)
50
0
−3 −2 −1 0 1 2 3 4 5
λ
Illustration 3
We may familiarize ourselves with the concepts of the EP solutions by looking at some simple 2 × 2
matrices.
• Zero matrix. The characteristic polynomial is
0, 0 1, 0
det −λ = λ2 = 0
0, 0 0, 1
which has the double root λ1,2 = 0. Apparently any vector v is an eigenvector since
0v = 0 × v .
The MATLAB solution agrees with our analytical consideration (columns of V are the eigenvec-
tors, the diagonal elements of D are the eigenvalues). Eigenvectors we obtained are particularly
nice because they are orthogonal.
>> [V,D]=eig([0,0;0,0])
V =
1 0
0 1
D =
0 0
0 0
3.7 Complex IVP 41
which has the double root λ1,2 = 1. Again any vector is an eigenvector. The MATLAB solution
agrees with our analytical consideration (note that the eigenvectors are again orthonormal).
>> [V,D]=eig([1,0;0,1])
V =
1 0
0 1
D =
1 0
0 1
• Diagonal matrix. The characteristic polynomial is
a, 0 1, 0
det −λ = (a − λ)(b − λ) = 0
0, b 0, 1
which has the roots λ1 = a and λ2 = b. The eigenvectors may be calculated by substituting the
eigenvalue (let us start with λ1 )
a, 0
v = λ1 v 1 = av 1
0, b 1
The roots λ1 and λ2 need to be solved for from this quadratic equation. The below symbolic
MATLAB expression evaluates the determinant
>> syms a b c d lambda ’real’
>> det([a,d;c,b]-lambda*[1,0;0,1])
ans =
a*b-a*lambda-lambda*b+lambda^2-d*c
42 3 Preservation of solution features: stability
A helpful observation usually made in a linear algebra course is that the trace of the 2 × 2 matrix
(i.e. the sum of the diagonal elements) is equal to the sum of the eigenvalues a + b = λ1 + λ2 , and
the determinant of the matrix is equal to the product of the eigenvalues ab − cd = λ1 λ2 . We can
easily verify this symbolically in MATLAB by first computing the eigenvalues and eigenvectors
(symbolically)
syms a b c d lambda ’real’
[V,D]=eig([a,d;c,b])
So that
−1, −1 z11 0
=
−1, −1 z21 0
These two equations are linearly dependent, and we cannot determine both elements z11 , z21
from a single equation. Choosing for instance z11 = 1 gives (one possible) solution for the first
eigenvector
z11 1
=
z21 −1
λ1 + λ2 = 2a , λ1 λ2 = a2 + b2
and the identity (a + i b) (a − i b) = a2 + b2 we can see that the eigenvalues are in fact
λ1 = a + i b , λ2 = a − i b .
So the diagonal elements of the matrix are the real parts of the eigenvalues, and the off diagonal
elements are the (real values) of the imaginary parts of the eigenvalues.
3.10 Case of Rek = 0 and Imk 6= 0 43
Suggested experiments
1. When we compute the eigenvector by solving the system with the singular matrix we have to
choose one element of the vector, apparently arbitrarily. Discuss whether the choice is truly
arbitrary. For instance, could we choose z11 = 0?
det (K − λ1) = 0 ,
which yields
Rek, −Imk 1, 0
det −λ = (Rek − λ)2 + (Imk)2 = 0 . (3.16)
Imk, Rek 0, 1
We know that for the scalar case the eigenvalue is λ = k = Rek + iImk. Would this eigenvalue satisfy
also the characteristic equation above? Substituting and simplifying we obtain:
It does! That is not all, however. Numbers whose imaginary parts have equal magnitude but opposite
signs are called complex conjugate (see Figure 3.6). The characteristic equation (3.16) also has
the root λ = k = Rek − iImk, where the overbar means “complex conjugate”. This holds because
(iImk)2 = (−iImk)2 . The eigenvalue problem in Section 3.2 is saying the same thing, since forming
a complex conjugate of the equation
B(λ − k) = B(λ − k) = 0
becomes a multiple of the identity. The eigenvalues are λ1,2 = Rek. Depending on the numerical
value of Rek both equations describe the same growth, decay, or stagnation.
Im
a
a+a
Re
a
These are interesting matrices, which occur commonly in many important applications. We will
hear more about them. The eigenvalues are λ1,2 = ±i Imk, which means purely imaginary. We write
λ1 = λ2 (and λ1 = λ2 ).
We solve for the components of the first eigenvector. The procedure is the same as in the example
above: substitute the computed eigenvalue into the eigenproblem equation, and since the resulting
equations are linearly dependent, choose one of the components of the eigenvector and solve for the
rest. Thus we get for λ1 = i Imk
0, −Imk 1, 0 z11 0
− λ1 = .
Imk, 0 0, 1 z21 0
Note that z 1 and z 2 are complex conjugate, as their corresponding eigenvalues. We can easily con-
vince ourselves that an eigenvalue problem with complex conjugate eigenvalues must have complex
conjugate eigenvectors. For an arbitrary real matrix A write the complex conjugate on either side
of the equation
A · z = λz −→ A · z = A · z = λz (3.18)
w1 = exp(λ1 t)z 1
and
w2 = exp(λ2 t)z 2 = w 1
3.10 Case of Rek = 0 and Imk 6= 0 45
could be solutions of the IVP (3.10). A general solution therefore is likely to be a mix of these two
w = C1 w1 + C2 w2 .
We expect w to be a real vector, whereas w1 and w2 are both complex quantities. However, they are
complex conjugate which suggests that if the constants are also complex conjugates the expression
on the right may be real (refer to Figure 3.6).
w = C1 w1 + C1 w1
In general, the complex constant may be written as
C1 = ReC1 + i ImC1 (3.19)
and the complex exponential has the equivalent expression (Euler’s formula from complex analysis)
exp(i Imkt) = cos(Imkt) + i sin(Imkt) . (3.20)
Therefore, the product of the three complex quantities may be expanded as
C1 w1 = C1 exp(λ1 t)z 1 =
−ReC1 sin(Imkt) − ImC1 cos(Imkt) ReC1 cos(Imkt) − ImC1 sin(Imkt) .
+i
ReC1 cos(Imkt) − ImC1 sin(Imkt) ReC1 sin(Imkt) + ImC1 cos(Imkt)
Next we take into account that C2 w2 = C1 w1 and we attain readily the simplification of the
expression w = C1 w1 + C2 w 2 by canceling the imaginary part
−ReC1 sin(Imkt) − ImC1 cos(Imkt)
w=2 . (3.21)
ReC1 cos(Imkt) − ImC1 sin(Imkt)
Substituting of the above expression into the initial condition we arrive at
−ImC1
w(0) = 2 = w0
ReC1
and that allows us to solve immediately for ImC1 , ReC1 .
So finally we can write the solution to the matrix IVP (3.10)
−Imy0 sin(Imkt) + Rey0 cos(Imkt)
w=
Imy0 cos(Imkt) + Rey0 sin(Imkt)
or even more profitably in the matrix form
cos(Imkt), − sin(Imkt) Rey0
w= . (3.22)
sin(Imkt), cos(Imkt) Imy0
The matrix in the above equation is the so-called rotation matrix . The quantity Imk has the
meaning of angular velocity, and correspondingly Imkt is the rotation angle. One way of visualizing
rotations is through phasors: see Figure 3.7. A phasor is a rotating vector whose components vary
harmonically.
Figure 3.8 provides a link between different ways of visualizing rotations2. The black circle
corresponds to the trace of the tip of the rotating vector of Figure 3.7. The rainbow colored helical
tube (time advances from blue to red) is the black circle stretched in the time dimension. (Think
Slinky.)
The red curve is the projection of the helix onto the plane Imy = 0, and it is the graph of t
versus Rey. The blue curve is the projection of the helix onto the plane Rey = 0, and it is the graph
of t versus Imy. When we plot the solutions computed by the MATLAB integrators they are the
superposition of these (red and blue) curves in one plane, as shown on the right in Figure 3.8.3 The
rotating-vector picture tells us was kind of curves we should expect: the vector rotates with constant
angular velocity, which when projected onto either of the two coordinates will yield a sinusoidal
phase-shifted curve in time– compare with Figure 3.8.
2
See: aetna/ScalarODE/scalaroscillstream.m
3
See: aetna/ScalarODE/scalaroscillplot.m
46 3 Preservation of solution features: stability
Imy
w(t) Imk t w0
Rey
Fig. 3.8. Graphical representation of the solution to (3.12) Imk = 0.3, Rey 0 = 0, Imy 0 = 8
4
See: aetna/ScalarODE/scalaroscill1st.m
3.12 Euler methods for oscillating solutions 47
Re y, Im y
0
−2
−4
−6
−8
0 2 4 6 8 10
t
time step, but the forward Euler integrator odefeul fails spectacularly: the solution blows up very
quickly (Figure 3.10 on the left). The backward Euler integrator is not much better, except that the
amplitude goes to zero (Figure 3.10 on the right). With smaller time steps we can reduce the rate of
the blowup (decay) of the amplitude, but we can never remove it (try it: decrease the time step by
couple of orders of magnitude – and arm yourselves with patience, it is going to take a long time to
integrate). We consider the constant amplitude as the main feature of the solutions to this problem.
Therefore, we must conclude that for this problem the two integrators appear to be unconditionally
unstable as they are unable to maintain an unchanging amplitude of the oscillations no matter how
small the time step. For comparison we show the results for the built-in ode45 integrator, applied
600 8
6
400
4
200
2
Re y, Im y
Re y, Im y
0 0
−2
−200
−4
−400
−6
−600 −8
0 2 4 6 8 10 0 2 4 6 8 10
t t
Fig. 3.10. Example of Section 3.11, odefeul integrator (on the left), and odebeul integrator (on the right).
Time step ∆t = 0.099.
to in a long integration time5 in Figure 3.11 (there are so many oscillations that the curves visually
melt into a solid block). We see that even for this integrator there is a systematic change (decay)
in the amplitude of the oscillation. By reducing the time step length we can reduce the drift, but
we cannot remove it entirely (as observed in numerical experiments). Again, this behavior has to do
with stability, not accuracy.
Re y, Im y
0
−2
−4
−6
−8
0 200 400 600 800 1000 1200
t
Fig. 3.11. Example of Section 3.11, ode45 integrator, long integration time
and we work in the knowledge that k, yj , yj+1 are complex. We now understand that for a purely
imaginary k = i Imk the solution may be represented as a circle in the plane Rey, Imy. Another way
of saying this is “the modulus of the complex quantity y is constant”. We take the modulus on both
sides
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need for the complex
amplification factor to satisfy
|1 + ∆tk| = 1 . (3.24)
Figure 3.12 illustrates the meaning of the above equation graphically. The circle of radius equal to
1.0 centered at (0, 0) is translated to be centered at (−1, 0) in order for the complex number ∆tk to
satisfy (3.24). Now consider the purely imaginary value of the coefficient k = i Imk. Such numbers
∆tImk
∆tk
|1 + x| = 1 |x| = 1
−2 −1 ∆tRek
lie along the imaginary axis, Rek = 0, and when multiplied by ∆t > 0 the resulting product just
moves closer to or further away from the origin. One such number ∆tk is shown in Figure 3.12.
In order for ∆tk to satisfy (3.24) the dot representing the number must move to the thick circle
in Figure 3.12. We can see that no such non-zero time step length exists: only ∆t = 0 will make
∆tk = 0 lie on the circle at (0, 0). Therefore, we must conclude that the forward Euler method
is unconditionally unstable for imaginary k as there is no time step length that would satisfy the
stability requirement (3.24).
3.13 General complex k 49
Next we shall consider the backward Euler method (3.7) for the same problem. Taking the
modulus on both sides we obtain
yj |yj |
|yj+1 | = = ,
(1 − ∆tk) |1 − ∆tk|
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need
|1 − ∆tk| = 1 . (3.25)
Figure 3.13 now illustrates that the circle of radius equal to 1.0 centered at (0, 0) needs to be
translated to be centered at (+1, 0) in order for ∆tk to satisfy (3.25). Again, we must conclude that
the backward Euler method is unconditionally unstable for imaginary k as there is no non-zero time
step length that would satisfy the stability requirement (3.25).
∆tImk
∆tk
|x| = 1 |1 − x| = 1
+1 +2 ∆tRek
The solution will be in the form of (3.21), except that everything will be multiplied by the real
exponential exp(Rekt)
−ReC1 sin(Imkt) − ImC1 cos(Imkt)
w = 2 exp(Rekt) .
ReC1 cos(Imkt) − ImC1 sin(Imkt)
Following the same steps as in Section 3.10, we arrive at the solution to the IVP in the form
cos(Imkt), − sin(Imkt) Rey0
w = exp(Rekt) , (3.27)
sin(Imkt), cos(Imkt) Imy0
which may be interpreted readily as the rotation of a phasor with exponentially decreasing (Rek < 0)
or increasing (Rek > 0) amplitude.
50 3 Preservation of solution features: stability
Let us take first Rek < 0 and the forward Euler algorithm. Equation (3.23) is still our starting
point, but now we are asking if there is a time step length that would make the modulus of the
solution decrease in time, or in mathematical terms
For the accompanying picture refer to Figure 3.14: One possible complex coefficient k is shown, as
is its scaling (down) by the time step ∆tk. Clearly, it is now possible by choosing a sufficiently small
time step length to bring ∆tk inside the circle so that its distance from (−1, 0) is less than one, and
so that the stability criterion (3.28) is satisfied. Since now there is a time step length so that the
forward Euler can reproduce the correct solution shape, we call forward Euler for general complex k
and Rek < 0 conditionally stable. The condition implied by “conditionally” is equation (3.28), and
for a given k we can use it to solve for an appropriate ∆t.
∆tImk
k
|1 + x| = 1
∆tk
−2 −1 ∆tRek
On the other hand, we can now see that for the forward Euler algorithm we achieve stability
(satisfy equation (3.28)) for Rek > 0 for any ∆t: the coefficient k is in the right-hand side half plane,
and the stability circle is in the left-hand side half plane. Therefore multiplying a complex k with
an arbitrary ∆t > 0 will satisfy |1 + ∆tk| > 1. Hence, for Rek > 0 the forward Euler method is
unconditionally stable.
This state of affairs is again mirrored by the behavior of the backward Euler algorithm. First
take Rek > 0. Equation (3.25) is now used to figure out if there is a time step length that would
make the modulus of the solution increase in time, or in mathematical terms
1
>1. (3.29)
|1 − ∆tk|
For the accompanying picture refer to Figure 3.15: One possible complex coefficient k is shown,
as is its scaling ∆tk. Clearly, it is now possible by choosing a sufficiently small time step length
to bring ∆tk inside the circle so that its distance from (+1, 0) is less than one which will ensure
satisfaction of (3.29). Thus, the backward Euler method is conditionally stable for general complex
k and Rek > 0. Also, we now conclude the backward Euler algorithm achieves stability (satisfy
equation (3.29)) for Rek < 0 for any ∆t: the coefficient k is in the left-hand side half plane, and the
stability circle is in the right-hand side half plane. Similar reasoning as for the forward Euler leads
us to conclude that backward Euler is unconditionally stable for complex k and Rek < 0.
Illustration 4
Apply the modified Euler (2.28) to the model equation (3.1), and derive the amplification factor.
3.14 Summary of integrator stability 51
∆tImk
k
|1 − x| = 1
∆tk
+1 +2 ∆tRek
Substituting the right-hand side of the model equation into the formula (2.28) we get
and
(t − t0 )
y(t) = y(t0 ) + [ky(t0 ) + kya ]
2
(t − t0 )
= y(t0 ) + [ky(t0 ) + k(y(t0 ) + (t − t0 )ky(t0 ))]
2
[k(t − t0 )]2
= y(t0 ) 1 + k(t − t0 ) + .
2
The term in square brackets that multiplies y(t0 ) is the amplification factor for the modified Euler.
Suggested experiments
Figure 3.16 shows the classification of the various behaviors for the linear differential equation with
constant coefficients
ẏ = ky , y(0) = y0 , k, y complex .
The eigenvalue λ = k (a complex number) is plotted in the complex plane. Depending on where
it lands, the analytical solution will display the following behaviors: In the left half-plane we get
decaying oscillations, in the right half-plane we get growing oscillations. If the eigenvalue is purely
imaginary, we get pure oscillation. If the eigenvalue is purely real, we get either exponentially decay-
ing or growing solutions. Finally, zero eigenvalue yields a stagnant (unchanging) solution. Figure 3.17
shows the behaviors produced by the forward Euler integrator. The same color coding as in Fig-
ure 3.16 is used. The key to understanding whether the forward Euler integrator can give us a discrete
52 3 Preservation of solution features: stability
Im
Oscillation
Decaying Growing
Re
Decaying Growing
Oscillations Oscillations
Fig. 3.16. Behavior classification for the first order linear differential equation
solution that mimics the analytical one is to compare the two figures. The complex number ∆tλ is
plotted in the complex plane in Figure 3.17. Forward Euler can reproduce the desired behavior if
there is such a ∆t as to place the number ∆tλ in Figure 3.17 in the region with the same color as
the one in which λ was located in Figure 3.16.
Illustration 5
Example 1: consider λ = −0.1 + i3. The analytical solution is decaying oscillation. In Figure 3.17 we
can see that a sufficiently small time step ∆t will indeed place ∆tλ inside the circle of unit radius
centered at −1 which has the same color as the left-hand side half-plane in Figure 3.16. Forward
Euler is conditionally stable in this case. (The condition is that ∆t must be sufficiently small.)
Example 2: consider λ = −i3. The analytical solution is pure oscillation. In Figure 3.17 we can
see that it is not possible to find any other time step but ∆t = 0 to place ∆tλ on the circle of unit
radius centered at −1 (which has the same color as the imaginary axis in Figure 3.16). Forward
Euler is unconditionally unstable for pure oscillations.
Example 3: consider λ = 13.3. The analytical solution is exponential growth. In Figure 3.17 we
can see that the positive part of the real axis has the same color in both figures. Therefore, for
all ∆t > 0 we get the correct behavior. Forward Euler is unconditionally stable for exponentially
growing solutions.
Example 4: consider λ = −0.61. The analytical solution is exponentially decaying. In Figure 3.17
we can see that a sufficiently small time step ∆t will indeed place ∆tλ within the interval −1 ≤
∆tλ < 0 which has the same color as the negative part of the real axis in Figure 3.16. Forward Euler
is conditionally stable in this case. (The condition is that ∆t must be sufficiently small.)
In words, using the pair of images 3.16 and 3.17 the forward Euler integrator is found to be
unconditionally unstable for pure oscillations, unconditionally stable for growing oscillations and
exponentially growing non-oscillating solutions, and conditionally stable for exponentially decayin-
goscillating and non-oscillating solutions. Analogous observations can be made about the backward
Euler integrator which is found to be unconditionally unstable for pure oscillations, conditionally sta-
ble for growing oscillations and exponentially growing non-oscillating solutions, and unconditionally
stable for exponentially decaying oscillating and non-oscillating solutions.
3.14 Summary of integrator stability 53
Im
Oscillation
Decaying
Growing
−1 +1 Re
Growing
Decaying Oscillations
Oscillations
Fig. 3.17. Behavior classification for the first order linear differential equation, Forward Euler algorithm
Im
Oscillation
Decaying Growing
−1 +1 Re
Decaying
Oscillations Growing
Oscillations
Fig. 3.18. Behavior classification for the first order linear differential equation, Backward Euler algorithm.
The stability diagrams that we have developed for the Euler algorithms are complete and unam-
biguous. Nevertheless, it will be instructive to visualize the amplification factors of the algorithms
discussed so far in yet another way6 .
For instance, the amplification factor for the modified Euler may be written in terms of ∆tλ as
(see the Illustration section on page 50)
1
1 + ∆tλ + (∆tλ)2 . (3.30)
2
All possible complex λ are allowed, which means that ∆tλ may represent an arbitrary point of the
complex plane. The magnitude of the amplification factor may be therefore considered a function
of the complex number ∆tλ, and it is often useful to visualize such functions as surfaces raised
above the complex plane. The MATLAB function surf is designed to do just that. It takes three
matrices which represent the coordinates of points of a logically rectangular grid. The elements
6
See: aetna/StabilitySurfaces/StabilitySurfaces.m
54 3 Preservation of solution features: stability
x(k,m), y(k,m), z(k,m), represent Cartesian coordinates of the k,m vertex of the grid. The grid
then may be rendered with surf(x,y,z). Here we set up a grid with 99 rectangular faces in each
direction (which is why we have 100 × 100 matrices for the corners of those faces). First the extent
of the grid and the number of corners.
xlow =-3.2; xhigh= 0.9;
ylow =-3.2; yhigh= 3.2;
n=100;
Then we set up the matrices for the coordinates. Note that k corresponds to moving in the x
direction, the index m corresponds to moving in the y direction. dtlambda is a complex number (1i
is the complex unit), so taking its absolute value means getting the magnitude of the amplification
factor.
x=zeros(n,n); y=zeros(n,n); z=zeros(n,n);
for k =1:n
for m =1:n
x(k,m) =xlow +(k-1)/(n-1)*(xhigh-xlow);
y(k,m) =ylow +(m-1)/(n-1)*(yhigh-ylow);
dtlambda = x(k,m) + 1i*y(k,m);
z(k,m) = abs(1 + dtlambda + 0.5*dtlambda.^2);
end
end
Of course there is more than one way of accomplishing this. Here is the whole setup accomplished
with just three lines using the handy meshgrid and linspace functions.
[x,y] = meshgrid(linspace(xlow,xhigh,n),linspace(ylow,yhigh,n));
dtlambda = x + 1i*y;
% Modified Euler
z = abs(1 + dtlambda + 0.5*dtlambda.^2);
Next we draw the color-coded surface that represents the height z above the complex plane: blue
is the lowest, red is the highest.
surf(x,y,z,’edgecolor’,’none’)
Then we draw into the same figure the level curve at height 1.0 of the same function z of x, y. We
set the linewidth of the curve using a handle returned from the function contour3.
hold on
[C,H] = contour3(x,y,z,[1, 1],’k’)
set(H,’linewidth’, 3)
Finally set up the view, and label the axes.
axis([-4 0.6 -4 4 0 8])
axis equal,
xlabel (’Re (\Delta{t}\lambda)’)
ylabel (’Im (\Delta{t}\lambda)’)
Voila Figure 3.19. It shows how the amplification factor falls below 1.0 in amplitude inside an oval
shape in the left-hand side half plane.
As shown in the MATLAB script StabilitySurfaces, corresponding surface representations of
the amplification factors for the methods discussed so far, forward and backward Euler (Figures 3.20
and 3.21), trapezoidal rule (Figure 3.22), and the fourth-order Runge-Kutta (Figure 3.23), are easily
obtained just by commenting out or uncommenting the appropriate definitions of the variable z.
Figure 3.24 compares the level curves at 1.0 for the amplitude of the amplification factor
for the first order linear differential equation for the integrators FEUL=forward Euler algorithm,
3.14 Summary of integrator stability 55
Fig. 3.19. Surface of the amplitude of the amplification factor for the first order linear differential equation,
MEUL=modified Euler algorithm. The contour of unit amplitude is shown in black.
Fig. 3.20. Surface of the amplitude of the amplification factor for the first order linear differential equation,
FEUL=forward Euler algorithm. The contour of unit amplitude is shown in black.
2
0
0
−2 −1
−2
Im (∆tλ) −3
Re (∆tλ)
Fig. 3.21. Surface of the amplitude of the amplification factor for the first order linear differential equation,
BEUL=backward Euler algorithm. The contour of unit amplitude is shown in black.
56 3 Preservation of solution features: stability
Fig. 3.22. Surface of the amplitude of the amplification factor for the first order linear differential equation,
TRAP =trapezoidal rule algorithm. The contour of unit amplitude is shown in black.
Fig. 3.23. Surface of the amplitude of the amplification factor for the first order linear differential equation,
RK4 =fourth-order Runge-Kutta algorithm. The contour of unit amplitude is shown in black.
BEUL=backward Euler algorithm, MEUL=modified Euler algorithm, TRAP =trapezoidal rule al-
gorithm, RK4 =fourth-order Runge-Kutta algorithm. Note that the level curve for the trapezoidal
rule coincides with the vertical axis in the figure. For a decaying solution, the integrator will pro-
duce a stable solution if it is inside the contours in the left-hand side plane, or outside the circle in
the right-hand side plane for the backward Euler. Clearly, comparing Figure 3.24 with the surface
representations of integrator stability in Figures 3.20 – 3.23 we can see that visualizing the stability
with contours is only part of the story: the surface figures supply the missing information about the
magnitude of the amplification factor.
Suggested experiments
1. Use the information in the Figure 3.22 to estimate the stability diagram for the integrator
odetrap similar to that shown in Figures 3.17 , 3.18.
3.15 Annotated bibliography 57
3
RK4
2
BEUL
1
Im(∆t λ)
0
FEUL
−1
−2 TRAP
MEUL
−3
−3 −2 −1 0 1 2 3
Re(∆t λ)
Fig. 3.24. Level curves (contours) at value 1.0 of the amplitude of the amplification factor for the
first order linear differential equation, FEUL=forward Euler algorithm, BEUL=backward Euler algorithm,
MEUL=modified Euler algorithm, TRAP =trapezoidal rule algorithm, RK4 =fourth-order Runge-Kutta al-
gorithm.
Summary
1. The model of the linear oscillator with a single degree of freedom is investigated from the point
of view of the uncoupling procedure (so-called modal expansion), and the solution in the form
of a matrix exponential. Main idea: solve the eigenvalue problem for the governing ODE system,
and expand the original variables in terms of the eigenvectors. The modal expansion is a critical
piece in engineering vibration analysis.
2. For the single degree of freedom linear vibrating system we have study how to transform between
the second order and the first order matrix form, and we discuss the relationship of the scalar
equation with the complex coefficient from Chapter 3 with the linear oscillator model. Main
idea: the two IVPs are shown to be equivalent descriptions.
3. It is shown that modal analysis is possible as long as the system matrix is not defective, i.e. as
long as it has a full set of eigenvectors. The case of critical damping is discussed as a special case
which leads to a defective system matrix.
4. The modal analysis allows multiple degree of freedom systems to be understood in terms of the
properties of multiple single degree of freedom linear oscillators.
x(0) = x0 , ẋ(0) = v0
together this will constitute the complete definition of the IVP of the linear oscillator. Using the
definition of the velocity
v = ẋ
will yield the general first-order form of the 1-dof damped oscillator IVP as
ẏ = A · y , y(0) = y 0 , (4.1)
where
0, 1
A= (4.2)
−k/m, −c/m
60 4 Linear Single Degree of Freedom Oscillator
and
x
y= .
v
k m c
x
The discussion of Section 3.7 (refer to equation (3.13)), applies here too. We assume the solution
in the form of an exponential
y = eλt z .
The quantity ωn
Illustration 1
Show that the IVP of the undamped (c = 0) one degree of freedom oscillator is equivalent to a single
scalar equation with a complex coefficient as in equation (3.12).
Solution is obtained by substitution of (4.3) into the matrix of (4.2)
0, 1
A= .
−ωn2 , 0
The trick (yes, there is one) is to introduce a new set of variables. The first is the same as deflection
of the mass x and the second is the velocity scaled by the negative natural frequency:
w1 x
= .
w2 −v/ωn
4.1 Linear single degree of freedom oscillator 61
Therefore, the differential equation of motion may be written in terms of the new variables as
w˙1 0, 1 w1
= ,
−ωn w˙2 −ωn2 , 0 −ωn w2
which is in perfect agreement with Section 3.10: we get two variables, the displacement x and the
velocity scaled by the angular velocity −v/ωn , coupled together by a skew symmetric matrix which
is the same as in equation (3.17) (where ωn = Imk). The solution in the new variables w1 , w2 is
therefore expressed by the rotation matrix as in (3.22). Now we can understand that Figure 3.7
describes the motion of an oscillating mass.
4.1.1 ω = 0: No oscillation
For ω = 0 the imaginary part of the eigenvalue vanishes, and then there is no oscillation. The real
component α is obtained from
α2 + (c/m)α + k/m = 0
giving
r
c c 2 k
α1,2 =− ± − .
2m 2m m
Notice that we must require
c 2 k
≥ (4.4)
2m m
for α1,2 to come out real.
This is the second subcase: Substituting α = −(c/2m) into the first equation, we obtain
c 2 c k
− ω 2 + (c/m)(− )+ =0,
2m 2m m
62 4 Linear Single Degree of Freedom Oscillator
is the so-called critical damping . The critically damped oscillator needs a special handling, which
we will postpone to its own section that will follow the discussion of the generic cases of the super-
critically and the subcritically damped oscillator.
(A − λ1 1) z 1 = 0 .
Substituting we have
−λ1 , 1 z11 0
k c = .
−m , −m − λ1 z21 0
The two equations are really only one equation (the rows and columns of the matrix on the left
are linearly dependent, since that is the condition from which we solved for λ1 ). Therefore, using
4.2 Supercritically damped oscillator 63
for instance the first equation and choosing z11 = 1 we compute z21 = λ1 . We repeat the same
procedure for the second root to arrive at the two eigenvectors
z11 1 z12 1
z1 = = , z2 = = .
z21 λ1 z22 λ2
The general solution of the differential equation of motion of the oscillator is therefore
This can be conveniently cast in matrix form using the matrix of eigenvectors
V = [z 1 , z 2 ]
Provided λ1 6= λ2 , the two eigenvectors are linearly independent, which means that the matrix V is
non-singular. The constants are then
c1
= V −1 y 0 .
c2
Illustration 2
It may be illustrative to work out in detail the inverse of the matrix of eigenvectors.
We write the matrix of the eigenvectors as above:
1, 1
V = .
λ1 , λ2
Namely, with the definition of a new variable w (also commonly referred to as change of coordinates)
we can write
λt
e 1 0
w= w0
0 eλ2 t
as a completely equivalent solution to the oscillator IVP, using the new variable w. Each component
of the solution is independent of the other, as we can see from the scalar equivalent to the above
matrix equation
V −1 ẏ = V −1 Ay = V −1 AV V −1 y
and with the new variable w we may rewrite the original IVP in this form
ẇ = V −1 AV w , w(0) = w 0 = V −1 y 0 .
The matrix V −1 AV is a very nice one: it is diagonal . To see this, we realize that for each column
of the matrix V the eigenvalue problem
Az j = λj z j , j = 1, 2 (4.10)
holds, and writing all such eigenvalue problems in one shot is possible as
λ 0
A [z 1 , z 2 ] = [z 1 , z 2 ] 1 (4.11)
0 λ2
Therefore we have
A [z 1 , z 2 ] = AV = V Λ
V −1 AV = Λ . (4.13)
We say that the matrix A is similar to a diagonal matrix Λ. (We also say that A is diagonalizable.)
So the IVP for the oscillator can be written in the new variable w as
4.3 Change of coordinates: similarity transformation 65
ẇ = V −1 AV w = Λw , w(0) = w 0 = V −1 y 0 .
This means that we can write totally independent scalar IVPs for each component
ẇ1 (t) = λ1 w1 , w1 (0) = w10 , ẇ2 (t) = λ2 w2 , w2 (0) = w20 ,
which as we know have the solutions (4.9):
wj (t) = eλj t wj0 . (4.14)
This is the well-known decoupling procedure: the original variables y are in general coupled together
since the matrix A is in general non-diagonal. Therefore to make things easier for us we switch to
a different set of variables w with the transformation (4.8) in which all the variables are uncoupled.
The uncoupled variables have each its own IVP which is easily solved. Finally, if we wish to, we
switch back to the original variables y. This procedure may be summarized as
ẏ = Ay , y(0) = y 0 (original IVP), (4.15)
Illustration 3
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 3/2, x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox1 . First the definitions of the variables. The variable names are self-explanatory.
function froscill_super_symb
syms m k c omega_n t x0 v0 ’real’
y0= [x0;v0];
c_cr=2*m*omega_n;
c=3/2*c_cr;
A = [0, 1; -omega_n^2, -(c/m)];
We compute symbolically the eigenvalues and eigenvectors, and we construct the diagonal matrix Λ
(called L in the code)
[V,D] =eig(A);
L =simple(inv(V)*A*V);
(Control question: How do L and D compare?) Next we can compute the matrix with eλj t on the
diagonal (called eLt). Note that calling the MATLAB function exp on a matrix would exponentiate
each element of the matrix. This is not what we intend: only the elements on the diagonal should
be affected. Therefore we have to extract the diagonal of L with diag, exponentiate, and then
reconstruct a square matrix with another call to diag
1
See: aetna/LinearOscillator/froscill super symb.m
66 4 Linear Single Degree of Freedom Oscillator
eLt =diag(exp(diag(L)*t));
Now we are ready to write down the last equation (4.15) to construct the solution components
(displacement and velocity).
y=simple(V*eLt*inv(V))*y0;
It only remains to substitute numbers and plot. These are the given numbers and we also define an
auxiliary variable.
x0= 0; v0=1;% [initial displacement; initial velocity]
m= 13; k= 6100; omega_n= sqrt(k/m);
For the plotting we need data to plot on the horizontal and vertical axis. Here we set it up so that
the time variable array t consists of 200 points spanning two periods of vibration of the undamped
system.
T_n=(2*pi)/omega_n;
t=linspace(0, 2*T_n, 200);
Finally the plotting of the components of the solution.
plot(t,eval(vectorize(y(1))),’m-’); hold on
plot(t,eval(vectorize(y(2))),’r--’); hold on
Remember the components of y are symbolic expressions. Now that we have provided all the variables
with numerical values, we need to evaluate the numerical value of the solution components using
the MATLAB function eval. It also doesn’t hurt to use the function vectorize: the variable t is an
array. In case the expression for the solution components contained arithmetic operators of two or
more terms that referred to t (such as exp(t)*sin(t)) we would want the expressions to evaluate
element-by-element. vectorize replaces all references to operators such as “*” or “^” with “.*” or
“.^” so that these operators work on each scalar element of the arrays in turn.
which are complex, since λj are complex numbers. The solution is again written as in (4.6) but with
the important difference that all quantities on the right-hand side are complex while the left-hand
side is expected to be real.
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of
the first one, λ2 = λ1 . We see this easily writing the complex conjugate of the equation A · z = λz
(see equation (3.18)). The two constants cj can be determined from the initial condition
and since y 0 is real, the two constants must be complex conjugates of each other, c2 = c1 . The
constants are still determined by
c1
= V −1 y 0 .
c2
Now we can follow all the derivations from the previous section, and the solution will be still arrived
at in the form of (4.7). Since both y and y 0 are real, the product of the three complex matrices
λt
e 1 0
V V −1
0 eλ2 t
must also be real, and however surprising it may seem, it is real. (We can do the algebra by hand
or with MATLAB to check this.)
Illustration 4
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 0.2 (< 1 so that the
damping is subcritical), x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox2 . First the definitions of the variables. The variable names are self-explanatory. The code is
pretty much the same as for the supercritically damped oscillator example above, except
function froscill_sub_symb
...
c=0.2*c_cr;
...
We may verify that the eigenvalues (and eigenvectors) are now general complex numbers. For instance
K>> D(1,1)
ans =
(-1/5+2/5*i*6^(1/2))*omega_n
It is rather satisfying to find that no modifications to the code of froscill_super_symb that was
written for the real (supercritical) case are required to account for the complex eigenvalues and
eigenvectors: it just works as is.
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of the
first one, λ2 = λ1 = −iωn ,
2
See: aetna/LinearOscillator/froscill sub symb.m
68 4 Linear Single Degree of Freedom Oscillator
z12 1
z2 = = z1 = .
z22 −iωn
The general solution of the free undamped oscillator motion is a linear combination of the eigenvec-
tors
y = c1 eλ1 t z 1 + c2 eλ2 t z 2 .
Because of the complex conjugate status of the pairs of the eigenvalues and eigenvectors, we have
y = c1 eλ1 t z 1 + c2 eλ1 t z 1 .
Introducing the initial condition, which is real, we obtain
y(0) = c1 z 1 + c2 z 1
and we must conclude c1 = c2 , otherwise the right-hand side couldn’t be real. Using
Rea = (a + a)/2
we see that the sum c1 z 1 + c2 z 1 therefore evaluates to 2Re(c1 z 1 ), and the constants can be deter-
mined from
Rec 1
y(0) = 2Re(c1 z 1 ) = 2 (Rec 1 Rez 1 − Imc 1 Imz 1 ) = 2 Rez 1 , −Imz 1 .
Imc 1
We will introduce the matrix composed of the real and imaginary part of the eigenvector z 1
1, 0
Z = Rez 1 , −Imz 1 = . (4.16)
0, −ωn
Then we can write
Rec 1 1 −1 1 1, 0
= Z y(0) = y(0) .
Imc 1 2 2 0, −ωn−1
Using the same principle that we obtain a real number from the sum of the complex conjugates, we
write
y = 2Re c1 eλ1 t z 1 , (4.17)
which may be expanded into
y = 2 [Rec1 (cos ωn t Rez 1 − sin ωn t Imz 1 ) − Imc1 (sin ωn t Imz 1 + cos ωn t Rez 1 )] .
Then collecting the terms leads to the matrix expression
cos ωn t , − sin ωn t Rec 1
y = 2 Rez 1 , −Imz 1 ,
sin ωn t , cos ωn t Imc 1
which after substitution of Rec 1 , Imc 1 finally results in the matrix expression
cos ωn t , − sin ωn t 1 −1
y = 2Z Z y(0) = ZR(t)Z −1 y(0) . (4.18)
sin ωn t , cos ωn t 2
We have in this way introduced the time-dependent rotation matrix
cos ωn t , − sin ωn t
R(t) = . (4.19)
sin ωn t , cos ωn t
The solution for the displacement and velocity of the linear single degree of freedom oscillator can
therefore be understood as the result of the rotation of the initial-value quantity Z −1 y(0) (phasor)
Z −1 y(t) = R(t)Z −1 y(0) . (4.20)
4.5 Undamped oscillator: alternative treatment 69
Illustration 5
Check that the procedure (4.15) and the alternative formula (4.18) lead to the same solution.
We don’t want to do this by hand. It is faster to use the MATLAB symbolic algebra. The function
froscill un symb3 computes the solution twice, and then subtracts one from the other. If we get
as a result zeroes, the solutions were the same.
The code begins by the same variable definitions and solution of the eigenvalue problem as for
froscill sub symb. We compute the first solution using (4.15).
L =simple(inv(V)*A*V);
eLt =diag(exp(diag(L)*t));
y1=simple(V*eLt*inv(V))*y0;
Next, we compute the solution using the alternative with the rotation matrix (4.19).
Z =[real(V(:,1)),-imag(V(:,1))];
R = [cos(omega_n*t),-sin(omega_n*t);
sin(omega_n*t),cos(omega_n*t)];
y2 =simple(Z*R*inv(Z))*y0;
Finally we evaluate y1-y2.
Finally we can realize that the solution (4.20) is of the same form as that derived in Section 3.10
(as in (3.22)) and then again in the Illustration in Section 4.1. The new variables are w1 = y1 , w2 =
−y2 /ωn as in Section 4.1.
we see that (4.18) requires only a change of the matrix R, which should for the damped oscillator
read
αt cos ωt , − sin ωt
R(t) = e .
sin ωt , cos ωt
Here
r c 2
c k
α=− , ω= − .
2m m 2m
Let us note that ω is the frequency of damped oscillation.
3
See: aetna/LinearOscillator/froscill un symb.m
70 4 Linear Single Degree of Freedom Oscillator
ẏ = ay , y(0) = y0 .
y = eat y0 ,
that is as an exponential. Therefore it may not be a terrible stretch of imagination to anticipate the
solution to (4.1) to be formally identical so that the IVP
ẏ = A · y , y(0) = y 0
y = eAt y 0 .
Of course, we must explain the meaning of the matrix exponential eAt . Even here the analogy
with the scalar case is of help: consider defining a scalar exponential using the Taylor series starting
from t = 0
∞
X a k tk
eat = ea0 + aea0 t + a2 ea0 t2 /2 + ... = .
k!
k=0
The matrix exponential could be defined (and in fact this is one of its definitions) as
∞
X Ak tk
eAt = . (4.21)
k!
k=0
For a general matrix A evaluating the infinite series would be difficult. Fortunately, for some
special matrices it turns out to be easy. Especially the nice diagonal matrix makes this a breeze:
P∞ Dk tk
k=0
11
k! 0 ... 0 0
P∞ D22 k k
t
0 ... 0 0
∞
XD t k k k=0 k!
D t .. .. . . .
. .
.
e = = . . . . . .
k! P∞ Dn−1,n−1 t
k=0 k k
0 0 . . . k=0 k! 0
P∞ Dnn t
k k
0 0 ... 0 k=0 k!
This result is easily verified by just multiplying through the diagonal matrix with itself. Finally we
realize that on the right-hand side we have a matrix with exponentials eDjj t on the diagonal
D t
e 11 0 ... 0 0
0 eD22 t . . . 0 0
X∞
D k k
t
D t .. .
.. . .. .
.. .. .
e = = . . (4.22)
k!
k=0 0 0 ... e Dn−1,n−1 t
0
0 0 ... 0 eDnn t
This is very helpful indeed, since we already saw that having a full set of eigenvectors as in equa-
tion (4.13) allows us to write the matrix A as similar to a diagonal matrix. Let us substitute into
the definition of a matrix exponential the similarity
V −1 AV = Λ , A = V ΛV −1
4.7 Critically damped oscillator 71
as
∞ ∞ k
X Ak tk X V ΛV −1 tk
eAt = = .
k! k!
k=0 k=0
Now we work out the matrix powers: The zeroth and first,
0 1
V ΛV −1 = 1 = V 1V −1 , V ΛV −1 = V ΛV −1 ,
To compute the matrix exponential of the diagonal Λt is easy, so the only thing we need in order to
compute the exponential of At is a full set of eigenvectors of A. (Warning: There are matrices that
do not have a full set of linearly independent eigenvectors. Such matrices are called defective. More
details are discussed in the next section.)
As a matter of fact we have been using the matrix exponential all along. The solution (4.7)
is of the form V eΛt V −1 . In equation (4.18) the matrix R(t) (rotation matrix) is also a matrix
exponential of a special matrix: the skew-symmetric matrix.
0, −ωn 0, −1
S= = ωn .
ωn , 0 1, 0
Constructing the infinite matrix series, this gives the correct Taylor expansions for cosines and
sines of the rotation matrix
S t ωn2 t2 ωn4 t4 ωn3 t3 ωn5 t5 1
R(t) = e = 1 − + + ... 1 + ωn t − + + ... S.
2! 4! 3! 5! ωn
| {z } | {z }
cos ωn t sin ωn t
Inconveniently, this is the only eigenvector that we are going to get for the case of the critically
damped oscillator. Since we obtained a double real root, the second eigenvector is exactly the same
as the first. We say inconveniently, because our approach was developed for an invertible eigenvector
matrix
V = [z 1 , z 2 ]
and it will now fail since both columns of V are the same, and such a matrix is not invertible.
We call matrices that have missing eigenvectors defective. For the critically damped oscillator the
matrix A is defective.
ζ=1.005
20
15
10
5
Imλ
−5
−10
−15
−20
−25 −20 −15 −10 −5 0 5 10 15 20 25
Reλ
Let us approach the degenerate case of the critically damped oscillator as the limit of the super
critically damped oscillator whose two eigenvalues will approach each other to become one. Figure 4.2
shows a circle of the radius equal to ωn for the data of the IVP (4.1) set to m = 13, k = 6100,
ζ = 1.005 (in other words close to critical damping). The two (real) eigenvalues are indicated by
small circular markers (the function animated eigenvalue diagram4 illustrates with an animation
how the eigenvalues change in dependence on the amount of damping). For critical damping (ζ = 1.0)
the two eigenvalues would merge on the black circle and become one real eigenvalue (also referred to
4
See: aetna/LinearOscillator/animated eigenvalue diagram.m
4.7 Critically damped oscillator 73
as a repeated eigenvalue). As the eigenvalues approach each other λ2 −→ λ1 the solution may still
be written as
y = c1 eλ1 t z 1 + c2 eλ2 t z 2 .
In order to understand the behavior of the eigenvalues as they approach each other, we can write
the exponential eλ2 t using the Taylor series with λ1 as the starting point
d
eλ2 t = eλ1 t + eλ2 t |λ1 (λ2 − λ1 ) + . . . = eλ1 t + teλ1 t (λ2 − λ1 ) + . . . .
dλ2
From this result we conclude that as λ2 −→ λ1 , a linearly independent basis will be the two functions
With essentially the same reasoning we can now look for the missing eigenvector. Write (again
assuming λ2 −→ λ1 )
dz 2
z2 ≈ z 1 + (λ2 − λ1 ) .
dλ2 λ1
This allows us to subtract the two eigenvector equations from each other to obtain
(+) Az 2 = λ2 z 2
(−) Az 1 = λ1 z 1
,
A (z 2 − z 1 ) = (λ2 z 2 − λ1 z 1 )
z 2
Note that ddλ has the direction of the difference between the two vectors z 2 and z 1 . Since z 2
2
λ1
and z 1 are linearly independent vectors for λ2 6= λ1 , so are the vectors z 1 and ddλz 2 . Therefore,
2
λ1
when λ2 = λ1 , we can obtain a full set of linearly independent vectors that go with the double root
as the two vectors z 1 and p2 that solve
74 4 Linear Single Degree of Freedom Oscillator
Az 1 = λ1 z 1 , Ap2 = z 1 + λ2 p2 . (4.23)
Here p2 is not an eigenvector. Rather, it is called a principal vector . To continue with our critically
damped oscillator: we can compute the principal vector as
" #
0, 1 p12 z11 p12
k c = + λ2 ,
− ,− p22 z21 p22
m m
or, upon substitution,
" #
0, 1 p12 1 c p12
c 2 c = − ,
c
− ,− p22 − 2m 2m p22
2m m
or, rearranging the terms,
c
, 1
2m 2 p12 = 1
.
c c p22 c
− 2m
− ,−
2m 2m
Since the matrix on the left-hand side is singular, the principal vector is not determined uniquely.
One possible solution is
p 0
p2 = 12 = c .
p22 2m
Similarly as for the general oscillator eigenproblem (4.10) which could be written in the matrix
form (4.11), we can write here for the critically damped oscillator
λ1 1
A [z 1 , p2 ] = [z 1 , p2 ] , (4.24)
0 λ2
M = [z 1 , p2 ] .
We see that for critical damping the matrix A cannot be diagonalized (i.e. be made similar to a
diagonal matrix). It becomes defective (i.e. it doesn’t have a full set of eigenvectors). The best we
can do is to make it similar to the Jordan matrix
M −1 AM = J . (4.26)
Illustration 6
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 1.0 (critical damping),
x0 = 0, and v0 = 1.
We shall follow the procedure that leads to the Jordan matrix. The MATLAB solution is based
on the symbolic algebra toolbox5 .
The solution to the eigenvalue problem yields a rectangular one-column V. Therefore we solve for
the principal vector p2 , and we form the matrix M
5
See: aetna/LinearOscillator/froscill crit symb.m
4.8 Annotated bibliography 75
Illustration 7
Compute the matrix exponential of the Jordan matrix
λ1
J =t .
0λ
Solution: The matrix can be decomposed as
01
J = tλ1 + t = tλ1 + tΘ .
00
Because we have
(tλ1) (tΘ) = (tΘ) (tλ1)
(i.e. the matrices commute), it holds for the matrix exponential
etλ1 + tΘ = etλ1 etΘ = etΘ etλ1 .
The exponential of the diagonal matrix is easy: see equation (4.22). For the matrix Θ using the
definition (4.21) we readily get
∞
X Θ k tk
etΘ = = 1 + tΘ
k!
k=0
because all its powers higher than two are zero matrices, Θ2 = 0, and so on. Therefore, we have
1t
etλ1 + tΘ = etλ1 etΘ = eλt 1 (1 + tΘ) = eλt .
01
Summary
1. For the multiple degree of freedom linear vibrating system we study how to transform between
the second order and the first order matrix form. Modal analysis is discussed in detail for both
forms.
2. Modal analysis decouples the equations of the multiple degree of freedom system. The orig-
inal coupled system may be understood in terms of the individual modal components. Main
idea: whether coupled or uncoupled, the response of the system is determined by the modal
characteristics. Each uncoupled equation evolves as governed by its own eigenvalue.
3. We can analyze a scalar real or complex linear differential equation to gain insight into the
stability behavior. When the equations are coupled, stability is usually decided by the fastest
changing component of the solution (as dictated by the largest eigenvalue). This information is
used to select the time step for direct integration of the equations of motion.
4. The frequency content (spectrum) is a critical piece of information. We use the Fourier transform
and we discuss the Nyquist frequency.
5. The first-order form of the vibrating system equations is used to analyze damped systems.
M ẍ = −Kx − C ẋ , (5.1)
where M is the mass matrix, K is the stiffness matrix, C is the damping matrix, and x is the vector
of displacements. In conjunction with the initial conditions
x(0) = x0 , ẋ(0) = v 0
this will define the multi-degree of freedom (dof) damped oscillator IVP. Using the definition
v = ẋ
will yield the general first-order form of the multi-dof damped oscillator IVP as
ẏ = A · y , y(0) = y 0 , (5.2)
where
0, 1
A=
−M −1 K, −M −1 C
78 5 Linear Multiple Degree of Freedom Oscillator
and
x
y= .
v
The vector variable y collects both the vector of displacements x and the vector of velocities v.
Figure 5.1 shows an example of a multi-degree of freedom oscillator that is physically realized as
three carriages connected by springs and dampers. This will be our sample mechanical system that
will be studied in the following sections.
k1 , c 1 m 1 k2 , c 2 m 2 k3 , c 3 m 3
x1 x2 x3
The second order equations of motion (5.1) have a solution (this is an educated guess)
x = eλt z ,
λ2 M eλt z = −Keλt z .
−λ2 M z = Kz ,
ω 2 M z = Kz (5.3)
Similarly to the characteristic equation for the standard eigenvalue problem (3.15) we can write
det K − ω 2 M = 0 . (5.4)
5.2 Undamped vibrations 79
Illustration 1
For the stiffness and mass matrices given above, the characteristic polynomial is
2k, −k, 0 m 0 0
det −k, 2k, −k − ω 2 0 m 0 = k 3 − 6k 2 mω 2 + 5km2 (ω 2 )2 − m3 (ω 2 )3
0, −k, k 0 0 m
Illustration 2
For the stiffness and mass matrices given above, the characteristic equation is
k 3 − 6k 2 mω 2 + 5km2 (ω 2 )2 − m3 (ω 2 )3 = 0 .
5
x 10
4
−1
−2
−3
0 50 100 150
o2
The eigenvalues (and eigenvectors) of the generalized eigenvalue problem are known to be
real for M , K symmetric. Also, when the stiffness matrix is nonsingular, the eigenvalues will be
positive. Hence we write
−λ2 = ω 2 ≥ 0 .
1
See: aetna/ThreeCarriages/n3 undamped modes MK symbolic.m
2
See: aetna/ThreeCarriages/n3 undamped modes MK.m
80 5 Linear Multiple Degree of Freedom Oscillator
For the above matrices, the eigenvalues are ω12 = 9.2937 (i.e. angular frequency ω1 = ±3.0486),
ω22 = 72.9634 (i.e. angular frequency ω2 = ±8.5419), and ω32 = 152.3583 (i.e. angular frequency
ω3 = ±12.3433). Therefore, we see that the λ’s are all imaginary, λj = ±i ωj . Note that there
are three eigenvalues, but each eigenvalue generates two solutions because of the ± for the square
roots. That is necessary, because there are six constants needed to satisfy the initial conditions (two
conditions, each with three equations).
The solutions are therefore found to be both
x = e+i ωj t z j and x = e−i ωj t z j ,
which are complex vectors. The solution however needs to be real. This is easily accomplished by
taking as the solutions a linear combination of the above, for instance
x = Re e+i ωj t + e−i ωj t z and x = Im e+i ωj t − e−i ωj t z .
From Euler’s formula we know that
Re e+i ωj t + e−i ωj t = 2 cos ωj t
and
Im e+i ωj t − e−i ωj t = 2 sin ωj t .
Therefore, we can take as the three linearly independent solutions (j = 1, 2, 3)
x = cos ωj tz j and x = sin ωj tz j .
In this way we will obtain enough integration constants to satisfy the initial conditions, since the
general solution may be written as
3
X
x= (Aj cos ωj t + Bj sin ωj t)z j .
j=1
The undamped mode shapes for our example are shown in Figure 5.2, both graphically as arrows
and numerically as the values of the components.3
Next we will explore the free vibration of the same system in its first-order form. The system matrix
is (note: no damping)
0, 1
A= .
−M −1 K, 0
Fig. 5.2. Linear 3-degree of freedom oscillator: second-order model, undamped modes
We can reorder them using the sort function: the first line sorts the diagonal elements by ascending
modulus, the second line re-orders the rows and columns of D, and constructs the new D, the third
line then reorders the columns of V .
[Ignore,ix] = sort(abs(diag(D)));
D =D(ix,ix);
V =V(:,ix);
Here is the reordered D (be sure to compare with the eigenvalues computed in the previous section
for the generalized EP)
0+3.05i 0 0 0 0 0
0 0-3.05i 0 0 0 0
0 0 0+8.54i 0 0 0
D=
0 0 0 0-8.54i 0 0
0 0 0 0 0+12.3i 0
0 0 0 0 0 0-12.3i
Note that the eigenvalues come in complex conjugate pairs. The corresponding eigenvectors are
also complex conjugate. Each pair of complex conjugate eigenvalues corresponds to a one-degree of
freedom oscillator with complex-conjugate solutions.
Figure 5.3 illustrates graphically the modes of the A matrix. There are six components to each
eigenvector: the first three elements represent the components of the displacement, and the last
three elements represent the components of the velocity. Therefore, the eigenvectors are visualized
using two arrows at each mass. We use the classical complex-vector (phasor) representation: the
real part is on the horizontal axis, and the imaginary part is on the vertical axis. Note that all
displacement components (green) are purely imaginary, while all the velocity components (red) are
real. An animation of the motion described by a single eigenvector
82 5 Linear Multiple Degree of Freedom Oscillator
Fig. 5.3. Linear 3-degree of freedom oscillator: first-order model, undamped modes
Figure 5.4 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.6 Note that the displacements go through zero at the same time, and
that the amplitude does not change.
0.25
0.2
0.15
0.1
0.05
y(1:3)
−0.05
−0.1
−0.15
−0.2
−0.25
0 1 2 3 4 5 6
t
Fig. 5.4. Linear 3-degree of freedom oscillator: first-order model, undamped. Free-vibration response to
initial condition in the form of mode 2.
We have made the observation that the eigenvalues and eigenvectors come in complex conjugate
pairs. Each pair of complex conjugate eigenvalues corresponds to a one-degree of freedom oscillator
with complex-conjugate solutions. We have shown in Section 4.3 that all the individual eigenvalue
problems for the 2 × 2 matrix A may be written as one matrix expression
AV = V Λ ,
where each column of V corresponds to one eigenvector, and the eigenvalues are the diagonal ele-
ments of the diagonal matrix Λ. So that provided V was invertible, the matrix A was similar to a
5
See: aetna/ThreeCarriages/n3 undamped A animation.m
6
See: aetna/ThreeCarriages/n3 undamped IC.m
5.3 Direct time integration and eigenvalues 83
diagonal matrix (4.13). Exactly the same transformation may be used no matter what the size of
the matrix A. The 6 × 6 A is also similar to a diagonal matrix
V −1 AV = D
using the matrix of eigenvectors V . Therefore, the original IVP (5.2) may be written in the com-
pletely equivalent form
ẇ = D · w , w(0) = V −1 y 0 (5.5)
for the new variables, the modal coordinates, w. Each modal coordinate wj is independent of the
others since the matrix D is diagonal.
Let us consider the task of finding a numerical solution to the IVP (5.2) by the so-called direct time
integration using a Matlab integrator. Let us assume such an integrator is conditionally stable for
our current vibration system (pure oscillation, no decay, no growth). As an example, let us take
the fourth-order Runge-Kutta integrator (oderk47). The amplification factor for this method when
applied to the scalar IVP for the modal coordinate
ẇ = λw, w(0) = w0
reads
(∆tλ)2 (∆tλ)3 (∆tλ)4
α = 1 + ∆tλ + + + .
2 6 24
The stability diagram is shown in Figure 3.24. The intersection of the imaginary axis with the level
curve α = 1 of the amplification factor gives one and only one stable time step for purely oscillatory
solutions. Numerically we can solve for the corresponding stable time step with fzero as
F=@(dt)(abs(1+(dt*lambda)+(dt*lambda)^2/2+(dt*lambda)^3/6+(dt*lambda)^4/24)-1);
dt =fzero(F, 1.0)
Integrating with the stable time step leads to an oscillating solution with unchanging amplitude,
using a longer time step yields oscillating solutions with increasing amplitude, and decreasing the
time step leads to oscillations with decaying amplitude. Figure 5.5 was produced by the script
n3 undamped direct modal8. The modal coordinate w2 (λ2 = 3.0486i) was integrated by oderk4
with a stable time step ∆t (horizontal curve), slightly longer time step 1.00001∆t (rising curve), and
shorter time step ∆t/10 (dropping curve), and it is a good illustration of the above derivation.
If we were to numerically integrate the IVP (5.5), i.e. the uncoupled form of the original (5.2),
we could integrate each equation separately from all the others since in the uncoupled form they
are totally independent. Hence we could also use different time steps for different equations. Let us
say we were to use a conditionally stable integrator such as the oderk4. Then for each equation j
we could find a stable time step and integrate wj with that time step. Of course to construct the
original solution as y = V w would take additional work: All the wj would be computed at different
time instants, whereas all the components of y should be known at the same time instants.
Alternately, if we were to integrate the original IVP (5.2) in the coupled form, the uncoupled
modal coordinates wj would still be present in the solution y, only now they would be mixed
together (coupled) in the variables yk . Again, let us assume that we need to use a conditionally
stable integrator such as the oderk4. However, now we have to use only one time step for all the
components of the solution. It would be in general impossible for purely oscillatory solutions to
7
See: aetna/utilities/ODE/integrators/oderk4.m
8
See: aetna/ThreeCarriages/n3 undamped direct modal.m
84 5 Linear Multiple Degree of Freedom Oscillator
0.508
0.5
0.506
0.504
Im(w )
2
|w|
0
0.502
0.5
−0.5
−0.5 0 0.5 0.498
0 20 40 60 80 100
Re(w ) t
2
Fig. 5.5. Integration of modal coordinate w2 (λ2 = 3.0486i). The real and imaginary part of the solution
(phase-space diagram) on the left, absolute value of the complex solution on the right. Integrated with stable
time step ∆t (exactly one circle on the left, on the horizontal curve on the right), slightly longer time step
1.00001∆t (increasing radius on the left, rising curve on the right), and shorter time step ∆t/10 (decreasing
radius on the left, dropping curve on the right)
integrate at a time step that was stable for all wj at the same time. If we cannot integrate all
solution components so that their amplitude of oscillation is conserved, then we would probably
elect to have the amplitudes decay rather than grow. Therefore, we would integrate the coupled IVP
with the time step equal to or shorter than the shortest stable time step. For our example the stable
time step lengths are9
dts =
0.9278 0.9278 0.3311 0.3311 0.2291 0.2291
The shortest stable time step (for solution components five and six) is ∆tmin ≈ 0.2291. Figure 5.6
shows that running the integrator at the shortest stable time step yields a solution of the original,
uncoupled, vibrating system which is non-growing (decaying oscillations), because two components
are integrated at the stable time step (and therefore their amplitude is maintained), and the first
four components are integrated below their stable time step and hence their amplitude decays.
Running the integration at just a slightly longer time step than ∆tmin means that the first four
components are still integrated below their stable time steps. Their amplitude will still decay. The
last two components are integrated very slightly above their stable time step, which means that the
amplification factor for them is just a tad greater than one. We can clearly see how that can easily
destroy the solution as we get a sharply growing oscillation amplitude of the coupled solution (on
the right).
The eigenvalues of the matrix A of the IVP (5.2) (sometimes referred to as the spectrum of A) need
to be taken into account when the IVP is to be integrated numerically. We have shown the reasons
for this above, and now we are going to try to summarize a few practical rules.
• If the decoupling of the original system is feasible and cost-effective, each of the resulting in-
dependent modal equations can be integrated separately with its own time step. In particular,
exponentially decaying (or growing) solutions may require the time step to be smaller than some
appropriate length for stability. Purely oscillating solutions may also pose a limit on the time
step, depending on the integrator. To achieve stability we need to solve for an appropriate time
step from the amplification factor as shown for instance above for the fourth-order Runge-Kutta
integrator, or for the Euler integrators in Chapter 3.
9
See: aetna/ThreeCarriages/n3 undamped stable rk4.m
5.4 Analyzing the frequency content 85
0.5 0.5
y(1:3)
y(1:3)
0 0
−0.5 −0.5
0 10 20 30 40 50 0 10 20 30 40 50
t t
Fig. 5.6. Integration of the undamped IVP with the shortest stable time step ∆tmin (non-growing solution
on the left), and slightly longer time step than the shortest stable time step 1.002∆tmin (growing solution
on the right)
• All types of solutions may also require time step that provides sufficient accuracy. In this respect
we should remember that equations should not be integrated at a time step that is longer than the
stable time step. Therefore we first consider stability, and then, if necessary, we further shorten
the time step length for accuracy. For oscillating solutions, good accuracy is typically achieved if
the time step is less than 1/10 of the period of oscillation. In particular, let us say we got a purely
imaginary eigenvalue for the jth mode, λj = iωj . Then the time step for acceptable accuracy
should be
Tj
∆t ≤ ,
10
where Tj is the period of vibration for the jth mode
2π
Tj = .
ωj
• If the equations cannot be decoupled (such as when the cost of solving the complete eigenvalue
problem is too high), the system has to be integrated in its coupled form. Firstly, we shall
think about stability. A time step must be chosen that works well for all the eigenvalues and
eigenvectors in the system. That shouldn’t be a problem for unconditionally stable integrators–
they would give reasonable answers for any time step length. Unfortunately, there is really only
one such integrator on our list, the trapezoidal rule. For conditionally stable integrators we have
to choose a suitable time step length. In particular, we would most likely try to avoid integrating
at a time step length that would make some of the solution components grow when they should
not grow (oscillating or decaying components). Then we should choose a time step that is the
smallest of all the time step limits computed for the individual eigenvector/eigenvalue pairs.
Secondly, the time step is typically assessed with respect to accuracy requirements– this was
discussed above.
More on the topic of the time step selection in the next two sections that deal with solutions to
initial boundary value problems.
Next we look at a couple of experiments that will provide insight into the frequency content of the
response. First we simulate the free vibration of the undamped system, with the initial condition
being a mixture of the modes 1,2,5,6.10 The “measurement” of the response will be displacement
10
See: aetna/ThreeCarriages/n3 undamped fft.m
86 5 Linear Multiple Degree of Freedom Oscillator
of the mass 3, which the simulation will give us as a “discrete signal”. The signal is a sequence of
numbers xj measured at equally spaced time intervals, tj such that tj − tj−1 = ∆t.
The sampling interval is a critical quantity. With a given sampling interval length it is only pos-
sible to sample signals faithfully up to a certain frequency. Figure 5.7 shows two signals of different
frequencies sampled with the same sampling interval. Even though the signals have different fre-
quencies, their sampling produces exactly the same numbers and therefore we would be interpreting
them as one and the same. This is called aliasing. The so-called Nyquist rate 1/∆t is the minimum
sampling rate required to avoid aliasing, i.e. viewing two very different frequencies as being the same
due to inadequate sampling.
s(t)
Fig. 5.7. Illustration of the Nyquist rate. Sampling at a rate that is lower than the Nyquist rate for the
signal represented with the dashed line. Clearly as far as the information obtained from the sampling the
two signals shown in the figure are completely equivalent, even though they have different frequencies.
We can see from Figure 5.8 that the Nyquist rate is twice the rate (frequency) of the frequency
we wish to reproduce faithfully. The highest frequency that is reproduced faithfully by the Nyquist
rate is the Nyquist frequency
1 1
fN y = , (5.6)
2 ∆t
where ∆t is the sampling interval. If we sample with an even higher rate (with smaller sampling
interval), the signal is going to be reproduced much better; on the other hand sampling slower, below
the Nyquist rate, i.e. with a longer sampling interval, the signal is going to be aliased: we will get
the wrong idea of its frequency.
In order to extract the frequencies that contribute to the response from the measured signal we
perform an FFT analysis. A quick refresher: The discrete Fourier transform (DFT ) is expressed
by the formula
N
1 X −i2π(n−1)(m−1)/N
Am = e an , m = 1, ..., N (5.7)
N n=1
that links two sets of numbers, the input signal an and its Fourier transform coefficients Am . The
Fast Fourier transform (FFT) is simply a fast way of multiplying with the complex transform matrix,
PN
i.e. evaluating the sum n=1 e−i2π(n−1)(m−1)/N an .
The Fourier transform (Fourier series) of a periodic function x(t) with period T is defined as
∞
X
x(t) = Xm eim(2π/T )t , (5.8)
m=−∞
where
5.4 Analyzing the frequency content 87
s(t) s(t)
f = 1 × fN y f = 1.1 × fN y
t t
s(t) s(t)
f = 2 × fN y f = 10 × fN y
t t
Fig. 5.8. Illustration of the Nyquist frequency. Frequencies which are lower than the Nyquist frequency are
sampled at a higher rate.
Z T
1
Xm = x(t)e−im(2π/T )t dt . (5.9)
T 0
Here 2π
T = ω0 is the fundamental frequency. The following illustration shows how equation (5.7)
that defines the transformation between the Fourier coefficients and the input discrete signal can be
obtained from the above expressions for the continuous transform by a numerical approximation of
integral.
Illustration 3
Consider the possibility that the function x(t) is known only by its values xj = x(tj ) at equally spaced
time intervals, tj such that tj − tj−1 = ∆t. Assume the period of the function is an integer number
of the time intervals, T = N ∆t, and the function is periodic between 0 and T . The integral (5.9)
may then be approximated by a Riemann-sum
Z N
1 T 1 X
x(t)e−i2πmt/T dt ≈ x(tn )e−i2πmtn /T ∆t ,
T 0 T n=1
and finally
Z N
1 T 1 X
x(t)e−i2πmt/T dt ≈ xn e−i2πm(n−1)/N .
T 0 N n=1
This is already close to formula (5.7). The remaining difference may be removed by a shift of the
index m. Therefore if we set m = 1, 2, ..., then the above will change to
Z N
1 T 1 X
x(t)e−i2πmt/T dt ≈ xn e−i2π(m−1)(n−1)/N .
T 0 N n=1
As an example of the use of the DFT we will analyze the spectrum of an earthquake acceleration
record to find out which frequencies were represented strongly in the ground motion.
88 5 Linear Multiple Degree of Freedom Oscillator
Fig. 5.9. Workspace variables stored in elcentro.mat. The variable desc is the description of the data
stored in the file.
Illustration 4
The earthquake record is from the notorious 1940 El Centro earthquake. The acceleration data is
stored in elcentro.mat (Figure 5.9), and processed by the script dft example 111 . Note that when
the file is loaded as Data=load(’elcentro.mat’);, the variables stored in the file become fields of
a structure (in this case called Data).
Data=load(’elcentro.mat’);
dt=Data.delt;% The sampling interval
x=Data.han;% This is the signal: Let us process the North-South acceleration
t=(0:1:length(x)-1)*dt;% The times at which samples were taken
Next the signal is going to be padded to length which is an integral power of 2 for efficiency. The
product of the complex transform matrix with the signal is carried out by fft.
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
The Nyquist frequency is calculated and used to determine the N/2 frequencies of interest, which
are all frequencies lower than one half of the Nyquist rate.
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
Because of the aliasing there is a symmetry of the computed coefficients, and hence we also take
only one half of the coefficients, X(1:N/2). In order to preserve the energy of the signal we multiply
by two.
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Finally, the coefficients are plotted.
plot(f,absX,’Color’, ’r’,’LineWidth’, 3,’LineStyle’, ’-’,’Marker’, ’.’); hold on
xlabel (’ Frequency f [Hz]’); ylabel (’ |X(f)|’);
11
See: aetna/FourierTransform/dft example 1.m
5.4 Analyzing the frequency content 89
−3
x 10
6
|X(f)|
3
0
0 5 10 15 20 25
Frequency f [Hz]
We can see that the highest-magnitude accelerations in the north-south direction occur with
frequencies below 5 Hz.
Finally, we are ready to come back to our vibration example. The displacement at the third mass
is the signal to transform.
x=y(:,3);% this is the signal to transform
The computation of the Fourier transform coefficients proceeds as
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Note that the absolute value of one half of the coefficients (shown in Figure 5.10) is often called the
one-sided amplitude spectrum.
The three frequencies that we may expect to show up correspond to the angular frequencies
above and are 0.485 Hz, 1.359 Hz and 1.965 Hz. As evident from Figure 5.10 we can see that the
intermediate frequency, 1.359 Hz, is missing in the FFT. By including only the modes 1,2 and 5,6
with frequencies 0.485 Hz and 1.965Hz in the initial condition, we have excluded the intermediate
two modes from the response. Not to have been excited by the initial condition, the two modes will
not appear in the FFT: they will not contribute to the response of the system at any time.
Next we simulate the forced vibration of the system, with zero initial condition and sinusoidal
force at the frequency of 3 Hz applied at the mass 3.12 With the inclusion of forcing the second order
equations of motion are rewritten as
M ẍ = −Kx + L ,
where L is the vector of forces applied to the individual masses. Converting this to first order form
results in
ẋ 0, 1 x 0
= + .
v̇ −M −1 K, 0 v L
Therefore, we add in the forcing to the right-hand side function supplied to the integrator: now it
includes a harmonic force applied to mass 3. 13
12
See: aetna/ThreeCarriages/n3 undamped fft f.m
13
See: aetna/utilities/ODE/integrators/odetrap.m
90 5 Linear Multiple Degree of Freedom Oscillator
0.25
0.15
0.1
0.05
0
0 1 2 3 4 5
Frequency f [Hz]
Fig. 5.10. Linear 3-degree of freedom oscillator: first-order model, undamped. Free-vibration response to
initial condition in the form of mode 1,2,5,6 mixture.
[t,y]=odetrap(@(t,y)A*y+sin(2*pi*3*t)*[0;0;0;0;0;1],...
tspan,y0,odeset(’InitialStep’,dt));
Again, the “measurement” of the response (the signal) will be the displacement of the mass 3. The
simulation will give us the displacement x3 as a discrete signal. The FFT analysis on this signal is
shown in Figure 5.11. We can see that now all free-vibration frequencies are present, and of course
the forcing frequency shows up strongly.
−3
x 10
5
One−sided amplitude spectrum |X(f)|
0
0 1 2 3 4 5
Frequency f [Hz]
Fig. 5.11. Linear 3-degree of freedom oscillator: first-order model, undamped. Forced-vibration response.
14
See: aetna/ThreeCarriages/n3 damped modes A.m
5.5 Proportionally damped system 91
2c −c 0
C = −c 2c −c ,
0 −c c
where for our particular data c = 3.13. This is an example of the so-called Rayleigh damping.
(In addition to stiffness-proportional there is also a mass-proportional Rayleigh damping.) The
eigenvalues are now complex with negative real parts
-0.238+3.04i 0 0 0 0 0
0 -0.238-3.04i 0 0 0 0
0 0 -1.87+8.33i 0 0 0
D= .
0 0 0 -1.87-8.33i 0 0
0 0 0 0 -3.91+11.7i 0
0 0 0 0 0 -3.91-11.7i
Clearly the system is strongly damped (the real parts of the eigenvalues are quite large in magnitude).
The eigenvectors shows that the velocities (the last three components) are no longer phase-shifted
by 90o with respect to the displacements.
-0.8-10.2i -0.8+10.2i -1.88-8.36i -1.88+8.36i 1.51+4.53i 1.51-4.53i
-1.44-18.4i -1.44+18.4i -0.836-3.72i -0.836+3.72i -1.88-5.64i -1.88+5.64i
−2 -1.8-22.9i -1.8+22.9i 1.51+6.7i 1.51-6.7i 0.839+2.51i 0.839-2.51i
V = 10 .
31.2 31.2 73.2 73.2 -58.9 -58.9
56.2 56.2 32.6 32.6 73.5 73.5
70 70 -58.7 -58.7 -32.7 -32.7
z13, z43 z23, z53 z33, z63 z14, z44 z24, z54 z34, z64
z15, z45 z25, z55 z35, z65 z16, z46 z26, z56 z36, z66
Fig. 5.12. Linear 3-degree of freedom oscillator: first-order model, modes for stiffness-proportional damping
Figure 5.13 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.15 Note that the displacements go through zero at the same time. This
may be also deduced in Figure 5.12 from the fact that all the displacement arrows for any particular
mode are parallel, which means they all have the same phase shift. Next we repeat the frequency
analysis we’ve performed for the undamped system previously: we simulate the forced vibration of
the damped system, with zero initial condition and sinusoidal force at the frequency of 3 Hz applied
at the mass 3.16 Again, the “measurement” of the response will be displacement of the mass 3. The
one-sided amplitude FFT analysis on this signal is shown in Figure 5.14. We can see that not all
free-vibration frequencies are clearly distinguishable, while the forcing frequency shows up strongly.
15
See: aetna/ThreeCarriages/n3 damped IC.m
16
See: aetna/ThreeCarriages/n3 damped fft f.m
92 5 Linear Multiple Degree of Freedom Oscillator
0.25
0.2
0.15
0.1
0.05
y(1:3)
0
−0.05
−0.1
−0.15
−0.2
0 1 2 3 4 5 6
t
Fig. 5.13. Linear 3-degree of freedom oscillator: first-order model, stiffness-proportional damping. Free-
vibration response to initial condition in the form of mode 2.
−3
x 10
1.6
One−sided amplitude spectrum |X(f)|
1.4
1.2
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5
Frequency f [Hz]
Fig. 5.14. Linear 3-degree of freedom oscillator: first-order model, stiffness-proportional damping. Forced-
vibration response.
17
See: aetna/ThreeCarriages/n3 damped non modes A.m
5.6 Non-proportionally damped system 93
-2.4 0 0 0 0 0
0 -0.641+3.98i 0 0 0 0
0 0 -0.641-3.98i 0 0 0
D=
.
0 0 0 -0.254+11.1i 0 0
0 0 0 0 -0.254-11.1i 0
0 0 0 0 0 -21.4
Correspondingly, the first and last eigenvector is real, and the rest are complex conjugate pairs.
26 5.22+1.33i 5.22-1.33i -1.25+0.191i -1.25-0.191i 4.65
21.1 4.16+12.5i 4.16-12.5i -0.173-7.57i -0.173+7.57i 0.398
−2 18.8 3.09+19.2i 3.09-19.2i 0.446+4.61i 0.446-4.61i 0.0369
V = 10 .
-62.5 -8.64+19.9i -8.64-19.9i -1.8-13.9i -1.8+13.9i -99.5
-50.7 -52.5+8.5i -52.5-8.5i 84.1 84.1 -8.52
-45.2 -78.2 -78.2 -51.3+3.79i -51.3-3.79i -0.79
We can illustrate that the motion for instance for mode 6 is non-oscillatory in Figure 5.15 where we
show the response for the initial conditions in the form of mode 6.18
0.05
0.04
0.03
y(1:3)
0.02
0.01
−0.01
0 1 2 3 4 5 6
t
Fig. 5.15. Linear 3-degree of freedom oscillator: first-order model, modes for non-proportional damping.
Response for initial conditions in the form of mode 6.
Figure 5.16 illustrates graphically the modes of the A matrix. It is noteworthy that the displace-
ments and velocities for the purely decaying modes are phase shifted by 180o (they are out of phase).
Figure 5.17 shows the free-vibration response to excitation in the form of the initial condition
set to (the real part of) mode 2. Note that the displacements no longer go through zero at the same
time: they are phase shifted. This may be also deduced in Figure 5.16 because the displacement
arrows for any particular mode are not parallel any more.
Illustration 5
The dynamics of the system discussed above is to be integrated with the time step ∆t = 0.06 s with
the modified Euler integrator. Determine if this integrator will be stable.
The natural angular frequencies are diag(D)
lambda=[ -2.4030
-0.6411+3.9785i
18
See: aetna/ThreeCarriages/n3 damped non IC.m
94 5 Linear Multiple Degree of Freedom Oscillator
z11 , z41 z21 , z51 z31 , z61 z12 , z42 z22 , z52 z32 , z62
z13 , z43 z23 , z53 z33 , z63 z14 , z44 z24 , z54 z34 , z64
z15 , z45 z25 , z55 z35 , z65 z16 , z46 z26 , z56 z36 , z66
Fig. 5.16. Linear 3-degree of freedom oscillator: first-order model, modes for non-proportional damping
0.1
0.05
0
y(1:3)
−0.05
−0.1
−0.15
0 1 2 3 4 5 6
t
Fig. 5.17. Linear 3-degree of freedom oscillator: first-order model, nonproportional damping. Free-vibration
response to initial condition in the form of mode 2.
-0.6411-3.9785i
-0.2541+11.1142i
-0.2541-11.1142i
-21.4220]
Each angular frequency needs to be substituted into the amplification factor for the modified Eu-
ler (3.30), and its modulus (absolute value) needs to be evaluated. The result is
>> abs(1+dt*lambda+1/2*(dt*lambda).^2)
ans =
0.8662
0.9616
0.9616
1.0063
1.0063
0.5407
Since two of the amplification factors (for the complex-conjugate natural frequencies 4 and 5) are
greater than one in modulus, the integrator is not going to be stable with the given time step as the
contribution of the modes 4 and 5 would grow in time.
5.8 Annotated bibliography 95
z15 , z45 z25 , z55 z35 , z65 z16 , z46 z26 , z56 z36 , z66
Fig. 5.18. Linear 3-degree of freedom oscillator: first-order model, modes for singular-stiffness non-
proportional damping
Note the zero eigenvalue: for a singular stiffness the entire matrix A must be singular (consider
whether the first three columns of A can be linearly independent when K has linearly dependent
columns).
0 0 0 0 0 0
0 -0.679+4.31i 0 0 0 0
0 0 -0.679-4.31i 0 0 0
D= 0
.
0 0 -0.237+11.2i 0 0
0 0 0 0 -0.237-11.2i 0
0 0 0 0 0 -23.8
Correspondingly, the first and last eigenvector is real, and the rest are complex conjugate pairs. The
first eigenvector is as expected: all displacements the same, no velocities:
-57.7 -4.98+1.26i -4.98-1.26i -1.16+0.371i -1.16-0.371i -4.19
-57.7 -4.03-10.8i -4.03+10.8i -0.161-7.57i -0.161+7.57i -0.3
-57.7 -2.86-18.2i -2.86+18.2i 0.409+4.56i 0.409-4.56i -0.023
V = 10−2 0 -2.07-22.4i -2.07+22.4i -3.86-13i
.
-3.86+13i 99.7
0 49.3-10i 49.3+10i 84.4 84.4 7.13
0 80.4 80.4 -51+3.48i -51-3.48i 0.546
Under these conditions no forces are generated in any of the springs or the damper.
19
See: aetna/ThreeCarriages/n3 damped sing modes A.m
6
Initial Boundary Value Models
Summary
1. Models based on partial differential equations lead to the so-called initial boundary value prob-
lems (IBVP).
2. A common technique for solution of IBVPs deals with the space dimensions first by creating
a lumped representation of the interactions in space and then integrates the resulting ordinary
differential equations in time. This is called “spatial discretization”.
3. In this chapter we study in this way models of elastic wave propagation and transient heat
conduction. We discuss two approaches to discretization: particles and control volumes.
4. As a result the IBVP is converted to an IVP, which is integrated in time using techniques
developed in previous chapters. Main idea: All previous developments relying on modal analysis
still apply.
5. We discuss the issue of the time step length selection based on the frequency spectrum of the
discrete model.
x, u A, E, ρ
By considering a differential element of the bar (Figure 6.2), we can formulate the Newton’s
equation of motion of that element as
98 6 Initial Boundary Value Models
∂ 2 u(x, t)
N (x + dx, t) − N (x, t) = ρAdx ,
∂t2
where on the left is the total applied force, and on the right is the mass of the differential element
ρAdx, which when multiplied by the acceleration gives the inertial force. The internal force in the
bar at x + dx may be expanded in the truncated Taylor series as
∂N (x)
N (x + dx) ≈ N (x) + dx .
∂x
Here we envision that the higher order terms of the Taylor series will eventually play no role as
the length of the differential element will approach zero. The Taylor series approximation may be
substituted into the equation of motion to yield
∂N (x)
dx = ρAdxü .
∂x
The length of the differential element dx cancels, and we obtain
∂N (x)
= ρAü .
∂x
Further we express the internal force in terms of the deformation of the material and the constitutive
relationship between the stress and strain ε
∂u(x)
N (x) = EAε = EA ,
∂x
where E is the Young’s modulus. The resulting partial differential equation contains a single unknown
function u(x, t)
∂2u
EA = ρAü .
∂x2
N (x) N (x + dx)
dx
For the boundary conditions let us consider a free bar which is loaded at the left-hand side end by
b The force impulse sends a pressure wave down the bar. Let us assume a sinusoidal
a force impulse I.
time variation for the force generating the impulse. The force is nonzero in a short time interval
t ≤ T and then becomes identically zero
Ib πt
N (0, t) = sin( ) , t ≤ T , and N (0, t) = 0 , t > T .
2T T
At the other end the bar has a free (unloaded) cross section
N (L, t) = 0 , t>0.
The initial conditions describe a bar initially unstretched and at rest
u(x, 0) = 0 , u̇(x, 0) = 0 .
6.1 Propagation of elastic wave 99
x, u A, E, ρ
The MATLAB code1 that computes the required matrices and integrates the solution to the IVP is
shown below. We are applying a nonzero force at x = 0 for a very short time interval 0 ≤ t ≤ T .
For generality we also include the possibility of stiffness-proportional damping, even though it isn’t
present in the preceding discussion.
First let us define some variables, Young’s modulus, mass density, cross-sectional area, length,
applied force impulse. Also define the distance between particles and a few auxiliary variables.
function wavePulse
npart= 159;
% MPa tonne/mm^3 mm^2 mm
E=200000; rho=7.85e-9; A=200; L=7850; Ihat =1;
h=L/(npart-1);% element length
m= A*rho*h; k = E*A/h; c = 0.00000*k;
Next we create the mass, stiffness, and damping matrix. We are defining the latter using a pattern
matrix S, which is scaled to give us either stiffness or damping.
M = diag(ones(1,npart))*m;% mass matrix
M(1,1) =m/2;
M(npart,npart) =m/2;
S = diag(ones(1,npart))*3 -tril(triu(ones(npart),-1),+1);
S(1,1) =1;
S(npart,npart) =1;
K =k*S;% stiffness matrix
C = c*S;% damping matrix
This is the definition of the time-dependent force.
T =2e-4;% duration of forcing in seconds
F =zeros(npart,1); F(1) =Ihat*pi/T/2;% force distribution
And in the definition of the right-hand side function to be passed to the integrator, the time-
dependent force gets used. Note the term (t<T): this is a logical expression, whose value is either
false (0) or true (1). Therefore this in effect acts as a switch that either turns the force on (for t < T )
or off (for t ≥ T ).
function out=rhs(t, y, varargin)
out = [y(npart+1:2*npart); ...
inv(M)*(-K*y(1:npart)-C*y(npart+1:2*npart)...
+F*(t<T)*sin(pi*t/T))];
end
The trapezoidal integrator 2 is run with a time step corresponding to a 1/4 fraction of the smallest
period of vibration Tmin of the system. This is determined from the formula
1
Tmin = ,
fmax
where fmax is the highest frequency of vibration. The highest frequency is related to the maximum
eigenvalue of the free vibration problem as
ωmax
fmax = ,
2π
where ωmax is the largest eigenvalue of (5.3). The MATLAB general routine eig for the solution of
eigenvalue problems is used.
1
See: aetna/ElasticWave/wavePulse.m
2
See: aetna/utilities/ODE/integrators/odetrap.m
6.1 Propagation of elastic wave 101
[V,D]=eig(K,M);
max_omega=sqrt(max(abs(diag(D))))
min_omega=sqrt(min(abs(diag(D))))
dt= (2*pi/max_omega)/4;
nsteps = round(2.4*L/(sqrt(E/rho))/dt);% how many steps?
tspan= [0,nsteps*dt];
The initial condition is trivial: zero displacement and zero velocity.
y0(1:npart,1)= zeros(npart,1);%init. [disp,velocity]
y0(npart+1:2*npart,1) = zeros(npart,1);
Options setting the initial time step are set, and the trapezoidal integrator odetrap is invoked. If
we’d like to try a different integrator, we can comment out the odetrap3 line and uncomment a
different one.
options =odeset(’InitialStep’,dt);
[t,y]=odetrap(@rhs,tspan,y0,options);style =’r-’;
Finally, postprocessing type may be selected by uncommenting one of the lines that invoke the
postprocessing functions defined inside the file; if you happen to be running the code, be sure to
check out Spring animate, Displacement animate, and Stress animate to watch a few helpful
animated visuals.
Figure 6.4 shows a sample result from the simulation with 159 particles. The displacement and
the velocity of the particles at x = 0, L/2, L are shown as functions of time. An often used technique
0.5 2000
y(160)
y(1)
0
0 −2000
0 2 0 2
−3 −3
x 10 x 10
0.5 2000
y(240)
y(81)
0
0 −2000
0 2 0 2
−3 −3
x 10 x 10
0.5 2000
y(159)
y(318)
0
0 −2000
0 2 0 2
t −3 t −3
x 10 x 10
Fig. 6.4. Elastic wave simulation. Displacements and velocities of the particles at x = 0, L/2, L.
is to plot axial displacement in one-dimensional models perpendicularly to the axis of the bar, as if it
was a function graph. Figure 6.5 shows the displacements of the particles along the bar at three time
instants: the first snapshot was taken early on, when only the first dozen particles have appreciably
moved. The second snapshot shows the displacement along the bar after the wave passed through
the midpoint of the bar for the first time. The third snapshot shows the displacement after the wave
reflected off the free end at x = L, and we should note the doubled displacement amplitude as the
wave is coming back.
Figure 6.5 may take some getting used to. To help us along, Figure 6.6 visualizes the wave in
the steel bar as if it was a coiled spring. Ten snapshots are taken: the first one records the initial
3
See: aetna/utilities/ODE/integrators/odetrap.m
102 6 Initial Boundary Value Models
u(x,1.2×10−3) u(x,0.22×10−3)
0.5
0
0 1000 2000 3000 4000 5000 6000 7000
0.5
0
0 1000 2000 3000 4000 5000 6000 7000
u(x,2.1×10−3)
0.5
0
0 1000 2000 3000 4000 5000 6000 7000
x
Fig. 6.5. Elastic wave simulation. Displacement shape at three time instants. Note: axial displacement is
plotted perpendicularly to the axis of the bar.
compression under the applied force. The next three snapshots show the compressed coils moving
rightward as a wave. Behind the compression the coils are undeformed, except that they are displaced
to the right. The fifth snapshot shows a bunch of stretched coils moving to the left immediately after
the waves reflected off the free right-hand side end. The next three snapshots show the stretched
coils moving towards the left as a wave. The last two snapshots show the compressed coils moving
again to the right after the reflection of the tensile wave off the left-hand side end.
Fig. 6.6. Elastic wave simulation. Displacements shown as snapshots of the steel bar represented as a coiled
spring.
With the help of Figure 6.6 it should be reasonably easy to interpret a stress diagram. Figure 6.7
shows snapshots of the axial (normal to the cross-section) stress σ. The snapshots are the same as
those for which the displacement is shown in Figure 6.5. The top graph shows the compressive (hence
negative) stress immediately after the applied force ceased to act on the left-hand side cross-section.
6.1 Propagation of elastic wave 103
Since the time variation of the applied force was sinusoidal, the stress pulse sent down the bar is also
sinusoidal in shape. The solution is approximate, hence the shape is also approximate. The pulse
shape gets further distorted by the numerical integrator as it moves along the bar. Note that after
the reflection the pulse is tensile.
σ(x,1.2×10−3) σ(x,0.22×10−3)
40
20
0
−20
−40
0 1000 2000 3000 4000 5000 6000 7000
40
20
0
−20
−40
0 1000 2000 3000 4000 5000 6000 7000
σ(x,2.1×10−3)
40
20
0
−20
−40
0 1000 2000 3000 4000 5000 6000 7000
x
Fig. 6.7. Elastic wave simulation. Stress along the bar at three time instants.
Finally, we shall mention a useful approach to the visualization of the solutions to partial dif-
ferential equations in x and t: the solution may be interpreted as a surface raised above the plane
x, t. Figure 6.8 shows such as surface. The magnitude of the displacement is indicated by color: blue
corresponds to zero displacement, red corresponds to maximum displacement during the recorded
event. Slices through the surface have appeared previously above. For instance, slices at particular
locations along the x axis have appeared previously in Figure 6.4 as time records of displacement at
particular locations. On the other hand, slices at particular locations along the t axis are shapes of
the deformed bar and have previously appeared in Figure 6.5.
Let us also look at the energy balance to help us appreciate how well the trapezoidal rule works
for the wave propagation IVP. Figure 6.9 displays the total energy
1 T 1
TE = u̇ M u̇ + uT Ku
2 2
as the sum of the kinetic energy (the first term) and the potential energy of deformation (the second
term). After the initial rise due to the work input by the impulse, the total energy is conserved to
within numerical precision. As an interesting feature we may appreciate the significance of the time
instants where the potential energy drops to zero: this is where the wave is reflected off the free
end. The bar is at this point undeformed, however its material has nonzero kinetic energy which
immediately starts the wave going again.
Figure 6.10 is analogous to Figure 6.4, but the simulation was run with a much larger time step4 .
The trapezoidal integrator was applied with a time step corresponding to a 1/10 fraction of the
fourth largest period of vibration T4 of the system,
1 2π
T4 = = .
f4 ω4
4
See: aetna/ElasticWave/wavePulseUnder.m
104 6 Initial Boundary Value Models
Fig. 6.8. Elastic wave simulation. Displacement as function of time and position along the bar visualized
as a surface.
1000
PE(t)
500
0
0 0.5 1 1.5 2 2.5 3 3.5 4
−3
x 10
1000
KE(t)
500
0
0 0.5 1 1.5 2 2.5 3 3.5 4
−3
x 10
1000
TE(t)
500
0
0 0.5 1 1.5 2 2.5 3 3.5 4
t −3
x 10
Fig. 6.9. Elastic wave simulation. Potential, kinetic, and total energy for the trapezoidal integrator.
(this corresponds to frequencies of [0, 321.5, 643.0, 964.4, 1285.7, 1606.8] Hz). We may note the zero
frequency due to the stiffness matrix being singular : the bar is free-floating. (We also say it has a
rigid body mode.)
The procedure of picking the time step to integrate well the first few lowest frequencies is a sound
one when used for vibration problems, where the lowest frequencies are the most meaningful, and the
highest frequencies are sometimes just artifacts of the discrete nature of the model. Here we claim
that in this case picking the time step by looking at the lowest natural frequencies would be wrong.
The present problem is a wave-propagation event, and all frequencies collaborate to propagate the
wave (it is a “broadband” event). Figure 6.11 shows the magnitude of the natural frequencies from
the generalized eigenvalue problem (5.3). Evidently, about 7/8 of the frequencies are above 1 kHz.
Figure 6.12 demonstrates that also all of these high frequencies non-negligibly contribute to the
response as the drop-off in the amplitude is very slight towards high frequencies.
6.1 Propagation of elastic wave 105
0.5 2000
y(160)
y(1)
0
0 −2000
0 2 0 2
−3 −3
x 10 x 10
0.5 2000
y(240)
y(81)
0
0 −2000
0 2 0 2
−3 −3
x 10 x 10
0.5 2000
y(159)
y(318)
0
0 −2000
0 2 0 2
t −3 t −3
x 10 x 10
Fig. 6.10. Elastic wave simulation. Displacements and velocities of the particles at x = 0, L/2, L. Time step
set from ω4 .
6
10
4
10
2
10
Hz
0
10
−2
10
−4
10
0 20 40 60 80 100 120 140 160
Mode
Fig. 6.11. Elastic wave simulation. Frequency spectrum from the generalized eigenvalue problem.
The largest angular frequency in the system is ωmax = 20.319 × 104 , so about 33 times larger
than ω4 . Consequently, if we are running the simulation at a time step
2π
∆t =
10ω4
we cannot possibly represent the high frequencies well. According to the Nyquist rate equation, the
highest frequency (Nyquist frequency) that can be sampled with this time step is
1 1 1 10ω4
f= = ≈ 4821.8 Hz .
2 ∆t 2 2π
That excludes most frequencies from the spectrum (Figure 6.12), and consequently the computed
displacement in Figure 6.10 looks smeared out and shows significant misleading artifacts. The com-
puted velocity is practically useless.
106 6 Initial Boundary Value Models
0
10
−2
10
−3
10
−4
10
0 1 2 3 4 5 6 7
Frequency f [Hz] 4
x 10
Fig. 6.12. Elastic wave simulation. Single-Sided Amplitude Spectrum of two signals: displacement at particle
13 (red curve), and particle 130 (purple curve).
x A, κ, cv
Fig. 6.13. Conduction of heat through a thick concrete wall. Temperature is known (as a function of time)
on the left-hand side face of the wall, and the right-hand side face of the wall is insulated. We will write the
balance of energy in the shaded “pipe” of cross-section A.
The boundary conditions on the surface of the modeled volume will be: zero heat flux everywhere
along the surface, except at x = 0 where the temperature is prescribed (and hence the heat flux is
unknown). The initial conditions that we will consider are
T (x, 0) = 0 .
6.2 Transient heat conduction 107
The heat conduction phenomena in three-dimensional solids are described by the following balance
equation. It expresses the rate of change of the heat energy inside a given volume as the amount of
heat lost per unit time through heat flowing through the surface of the volume from the inside out,
and the amount of heat energy generated per unit time inside the volume.
Z Z Z
∂T
cV dV = − n · q dS + Q dV . (6.1)
V ∂t S V
Here cV is the specific heat per unit volume, n is the outer unit normal to the surface S, q is the
heat flux (heat power per unit area), and Q is the rate of heat generation per unit volume.
n
q
S
Fig. 6.14. Balance of heat energy in a 3-D solid. Volume =V , surface =S, outer unit normal=n, heat flux
=q.
In this Section we will introduce a different discretization scheme, the so-called control-volume
approach. We will construct the discrete model by dividing the dark gray cylinder into a number of
control volumes as shown in Figure 6.15. Figure 6.15 shows eight control volumes, the outer ones
of length h/2, and the inner ones of length h = L/7. The outer surfaces and the midpoints of the
inner control volumes are associated with temperatures Ti , i = 0, 1, . . . 7; T0 is known – prescribed,
and Ti , i = 1, . . . 7 are to be computed.
A, κ, cv
1 2 3 4 5 6 7 8
T0 T1 T2 T3 T4 T5 T6 T7
Fig. 6.15. Balance of heat energy in a 3-D solid. Volume =V , surface =S, outer unit normal=n, heat flux
=q.
The three-dimensional heat energy balance equation will be now applied to each control volume.
qj,L qj,R
Tj−1 Tj Tj+1
h
h h
Further, we will make the following assumption: we will assume that temperature varies linearly
between the centers of the adjacent control volumes. Therefore, the heat flux between the centers
j − 1 and j will determined using the Fourier law as
Tj − Tj−1
qj,L = −κ
h
and the heat flux between the centers j and j + 1 will read
Tj+1 − Tj
qj,R = −κ .
h
Putting everything together we will obtain for the control volume associated with temperature Tj
Tj+1 − Tj Tj − Tj−1
cV Ṫj hA = κ A−κ A
h h
or
κA
cV Ṫj hA = (Tj+1 − 2Tj + Tj−1 ) .
h
This specializes for the first control volume to
κA
cV Ṫ1 hA = (T2 − 2T1 + T0 ) ,
h
where T0 is prescribed to a known value. For the rightmost control volume n
and
Tn − Tn−1
qn,L = −κ .
h
The heat flux on the right is on the other hand known to be zero
qn,R = 0 .
Together we will obtain for the control volume associated with temperature Tn
Tn − Tn−1
cV Ṫn (h/2)A = −κ A
h
6.2 Transient heat conduction 109
qn,L qn,R = 0
Tn−1 Tn
h/2
or
κA
cV Ṫn (h/2)A = − (Tn − Tn−1 ) .
h
In this way, we obtain a coupled system of ODE’s
10 0 ... 0 0 Ṫ1 2 −1 0 . . . 0 0 T1 T0
0 1 0 ... 0
0 Ṫ2 −1 2 −1 . . . 0
0 T2 0
0 0 1 ... 0
0 κA 0 −1 2 . . .
Ṫ3 0
0 κA 0
T3
cV hA . . .. . . ..
.. + .. . . . ..
.. =.. .
.. .. . . .
. h .. .. .. . . .
. .
. h ..
.
0 0 0 . . . 1 0 Ṫn−1 0 0 0 ... 2 −1 Tn−1 0
00 0 . . . 0 1/2 Ṫn 0 0 0 ... −1 1 Tn 0
C Ṫ + KT = L , T (0) = 0 , (6.2)
where C is the capacity matrix, Ṫ is the vector of temperature rates (as functions of time), K is the
conductivity matrix, T is the vector of temperatures (also functions of time), and L is the vector of
thermal loads.
The computation is implemented in the script described below.5 First we define a few variables,
function heatCond1
nTemps= 30;
% Data for concrete
kappa_concrete=1.81; % W/K/m
rho_concrete = 2350;% kg/m^3
cv_concrete =0.22*rho_concrete;% per unit volume
% Select the data for this calculation
kappa=kappa_concrete; cv=cv_concrete;
including the thickness of the wall, and the length of the control volume.
5
See: aetna/HeatConduction/heatCond1.m
110 6 Initial Boundary Value Models
We consider the differential equations (6.2) in the uncoupled form, and we realize that the equation
with the largest eigenvalue will govern the time step when they are all coupled together. The time
step length will be set by requiring the satisfaction of the condition (3.6), i.e. ∆t ≤ 2/|λmax |, as all
we care about is the decay of the solution and we don’t mind any possible oscillations: more about
this below. In order to find the largest eigenvalue we sort the computed eigenvalues by magnitude
[ignr,IX]=sort (abs(diag(D)));
V=V(:,IX); D=D(IX,IX);
Ds =diag(D);
and then we take the last one as λmax :
dt=2/abs(Ds(end))
Finally, with the definition of the integration interval
tend= 30;
nsteps = tend/dt;
tspan= [0,nsteps*dt];
and the initial conditions
y0(1:nTemps)= zeros(nTemps,1);% initial conditions
we call the forward Euler integrator.
6.2 Transient heat conduction 111
% Select integrator
[t,y]=odefeul(@rhs,tspan,y0,odeset(’InitialStep’,dt));style =’m^-.’;
The result is shown in Figure 6.18, which shows the final distribution of temperature throughout
the wall.
t=30.0378=tend
100
80
60
T
40
20
0
0 0.1 0.2 0.3 0.4 0.5 0.6
x
Fig. 6.18. Temperature distribution at the end of the time interval. Simulation for ∆t = 2/|λ30 | .
Figure 6.19 visualizes the eigenvectors (modes) of the matrix C −1 K.6 Similarly as for the vibration
response (modal analysis), the distribution of the temperature through the thickness of the wall may
be described as a linear combination of the modes. The first few modes (those for small eigenvalues)
can be seen to be very smooth. Not so for the modes 29 and 30 corresponding to the largest
eigenvalues. Clearly we would expect the temperature distribution to be smooth, most like the first
mode, and not at all like modes 29 and 30. Yet, if we uncoupled all the equations, the modes 29 and
30 would have to be integrated with the smallest time steps according to (3.6). Perhaps, we could be
thinking, there is no point for the numerical integration of the system of ODEs to spend time and
effort on integrating the modes which obviously shouldn’t play a significant role in the representation
of the temperature. So, could we integrate the coupled system of ODEs at a time step length which
is larger then the one necessary for the highest mode? (Recall that we are using a forward Euler
integrator which does have restrictions on the time step length for negative eigenvalues – decaying
solutions.)
Figure 6.20 shows the distribution of temperature computed with a script identical to heatCond1,
except that the time step is set not from the largest eigenvalue (the 30th), but the largest but one
eigenvalue (the 29th).7 In other words, the 30th mode is not going to be integrated with a sufficiently
short time step length for the forward Euler to perform well for it. As a consequence, the integrator
blows up for the 30th mode: the mode instead of decaying, as it should, given that it corresponds
to a negative eigenvalue, grows. Within the short integration interval this mode grows sufficiently
large to pollute the picture of the temperature distribution with totally unphysical oscillations.
So unfortunately the answer is no: when the system is integrated as coupled, if the integrator
has a restriction on the time step, this restriction needs to be applied to the mode with the largest
eigenvalue. No exceptions. And the response is very sensitive: look at Figure 6.21: the difference
between eigenvalues 29 and 30, and consequently between the stable time step lengths for modes 29
and 30, is very slight. Nevertheless, increasing the time step length beyond the shortest one leads
immediately to a blowup as illustrated by Figure 6.20.
6
See: aetna/HeatConduction/heatCond1modes.m
7
See: aetna/HeatConduction/heatCond2.m
112 6 Initial Boundary Value Models
0.2 0 0
T
T
0.15 −0.1 −0.1
0 −0.4 −0.4
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30
Control volume Temperature Control volume Temperature Control volume Temperature
0 0 0
T
T
−0.1 −0.1 −0.1
Fig. 6.19. Eigenvectors for the model (6.2). Left to right, top to bottom: Modes 1, 2, 3, 4, 29, 30.
t=30.0186=tend
100
80
60
T
40
20
0
0 0.1 0.2 0.3 0.4 0.5 0.6
x
Fig. 6.20. Temperature distribution at the end of the time interval. Simulation for ∆t = 2/|λ29 |.
A concluding remark concerning the formula for the calculation of the stable time step: The stable
time step for the coupled system is determined by the stable time step for the highest frequency.
However the highest-frequency mode was seen above not to contribute very much to the overall shape
of the solution, and hence we can afford to use the condition (3.6) instead of the more restrictive (3.5).
We don’t mind the oscillations induced in the highest modes as they are unimportant, we just want
them to decay.
Differential equation models of this type are called stiff . By numerical stiffness we mean that while
accuracy may require a certain step length, stability dictates step lengths much smaller (potentially
by orders of magnitude). This makes numerical integration with integrators such as ode45, ode23,
odefeul very costly (long run times to get an acceptable solution). These are explicit solvers which
are non-optimal or unsuitable for stiff problems. There are specialized solvers for stiff IVPs, and
they are invariably implicit (as opposed to explicit). For instance, MATLAB has ode23s (note the
6.3 Annotated bibliography 113
2
10
1
10
2/|λ|
0
10
−1
10
−2
10
0 5 10 15 20 25 30
Mode
Fig. 6.21. Largest possible time step ∆ti = 2/|λi | for all modes i = 1, 2, . . . 30.
suffix s, indicating that the solver is for stiff problems). The aetna integrator odetrap8 is also an
example of a solver appropriate for stiff problems.
Suggested experiments
1. Use the integrator odetrap to integrate the transient heat conduction problem in heatCond2.9
Verify that the time step length can be increased beyond that allowed by the explicit solvers.
8
See: aetna/utilities/ODE/integrators/odetrap.m
9
See: aetna/HeatConduction/heatCond2.m
7
Analyzing errors
Summary
1. The basic tool is here the Taylor series. Especially important is the Lagrange remainder term.
2. We use it to reason about order-of estimates (i.e. big-O notation). Main idea: as we control error
in numerical algorithms by decreasing the time step length, the element size, and other control
parameters, towards zero, the first term of the Taylor series that is missing in our model will
dominate the error. We use these ideas to evaluate errors of integrals and estimate local and
global errors of ODE integrators.
3. Combining order-of error estimates with repeated solutions with different time step lengths allows
us to construct time-adaptive integrators. Main idea: by controlling the local error (estimated
from the Taylor series) we attempt to deliver the solution within a user-given error tolerance.
4. We discuss the approximation of derivatives by the so-called finite difference stencils. Main idea:
the total error has components of a distinct nature, the truncation error and the machine-
representation error.
5. The computer represents numbers as collections of bits. Main idea: The machine-representation
error (round-off) is due to the inability of the computer to store only some values, to which
results of arithmetic operations must be converted (with the attendant loss of precision).
Illustration 1
Warning: The Taylor series need not be convergent. For instance, the function log(1 + x) has a
convergent Taylor series in the interval −1 < x < 1. Outside this interval the Taylor series does not
converge (the more terms are added, the worse the approximation becomes). Try the following code
that uses the taylor MATLAB function.
116 7 Analyzing errors
syms x ’real’
t=taylor(log(1+x),6);
x=linspace(-1,+2,100);
plot (x,log(1+x))
hold on
plot (x,eval(vectorize(t)),’--’)
Note the use of vectorize: MATLAB will choke on all those powers of x from the Taylor series
function when x is an array of numbers.
Often it is useful to truncate the Taylor series exactly (that is to write down a finite number of
the terms, but still preserve the exact meaning). The Lagrange remainder can be used for this
purpose. For instance we can write
dy(b
x)
y(x) = y(x0 ) + (x − x0 )
dx
to truncate after the first term, or
dy(x0 ) d2 y(b
x) (x − x0 )2
y(x) = y(x0 ) + (x − x0 ) + 2
dx dx 2
to truncate after the second term. Both truncations are exact (when the Taylor series converges, of
course). The trick is to write the last term (which is the Lagrange remainder) with a derivative taken
at x
b somewhere between x0 and x. The location x b is not the same in the two truncations above.
In general, we would write
x) (x − x0 )n+1
dn+1 y(b
Rn = . (7.1)
dxn+1 (n + 1)!
Having reminded ourselves of the basics of Taylor series approximation, we can look at a very
useful tool (terminology really) to help us with engineering analyses of all kinds.
|f (x)|
lim <M <∞,
x→0 |g(x)|
where we require g(x) 6= 0 for x 6= 0. In words, the absolute values of the two functions are in some
proportion that is of finite magnitude. We write f (x) ∈ O(g(x)) and say “f of x is big o g of x as x
goes to zero”. The meaning of this definition is that “|f (x)| decreases towards zero at least as fast
as |g(x)|”.
7.2 Order-of analysis 117
Illustration 2
Example 1: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that it is of order g(x) = x as x → 0.
We form the fraction and simplify
|f (x)| |0.1x + 30x2 | 0.1x + 30x2
lim = lim = lim = lim 0.1 + 30x = 0.1 < ∞
x→0 |g(x)| x→0 |x| x→0 x x→0
Conclusion: f (x) = 0.1x + 30x2 is of order g(x) = x as x → 0. We say “f of x is big o x”, and write
f (x) = 0.1x + 30x2 ∈ O(x).
Example 2: Consider f (x) = 0.1x + 30, for x > 0. Show that it is of order g(x) = 1 as x → 0.
We form the fraction and simplify
|f (x)| |0.1x + 30| 0.1x + 30
lim = lim = lim = lim 30 = 30 < ∞
x→0 |g(x)| x→0 |1| x→0 1 x→0
Conclusion: f (x) = 0.1x + 30 is of order g(x) = 1 as x → 0. We say “f of x is big o one”, and write
f (x) = 0.1x + 30 ∈ O(1).
Example 3: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that f (x) is not of order g(x) = x2 as
x → 0.
We form the fraction and simplify
|f (x)| |0.1x + 30x2 | 0.1x + 30x2
lim = lim 2
= lim = lim 0.1/x + 30 → ∞
x→0 |g(x)| x→0 |x | x→0 x2 x→0
When analyzing algorithms, our interest is typically to find out how quickly their errors decrease as
a function of the accuracy control knob (which may be the time step, or the grid spacing, according
to the algorithm). The assumption is that accuracy is improving as the control knob makes the time
step (or the grid spacing) smaller (approaching zero).
Given an expression such as f (∆t) = 0.1∆t + 30∆t2 our interest would be to find the dominant
term, that is the term that decreases to zero the slowest, as ∆t → 0. In the examples above we have
discovered that f (∆t) = 0.1∆t + 30∆t2 ∈ O(∆t). This to us indicates that f (∆t) decreases toward
zero at most as quickly as ∆t. It does not decrease as quickly as ∆t2 . Also, it does decrease toward
zero, which a constant, 1, does not. The notation f (∆t) ∈ O(∆t), and f (∆t) ∈ / O(∆t2 ), f (∆t) ∈/ O(1)
helps us filter out things that are not important, the numerical values of the coefficients (0.1 and 30),
what other unimportant terms there might be (∆t2 ), and keep just the information that matters to
us: f (∆t) = 0.1∆t + 30∆t2 ∈ O(∆t).
Illustration 3
Illustration 4
Estimate the resulting magnitude of the Taylor series sum for tj+1 → tj . Assume that all the
derivatives exist and are finite numbers.
d2 y(tj ) (tj+1 − tj )2 d3 y(tj ) (tj+1 − tj )3 d4 y(tj ) (tj+1 − tj )4
+ + + ... .
dt2 2 dt3 3! dt4 4!
First of all, the Taylor series is a polynomial in the quantity tj+1 − tj , and this quantity goes to
zero as tj+1 → tj . Therefore, we can introduce the new variable τ = tj+1 − tj and write
The goal here is to estimate the error of the Riemann-sum approximation of integrals of one variable
using the order-of analysis. For instance, as shown in Figure 7.1, approximate the integral
Z b
y(x) dx
a
using the Riemann-sum approximation indicated by the filled rectangles in the figure. The error of
approximating the actual area between x0 and x0 + h by the rectangle y(x0 )h may be estimated by
expressing the Taylor series of y(x) at x0
dy(x0 ) d2 y(x0 ) (x − x0 )2
y(x) = y(x0 ) + (x − x0 ) + + ...
dx dx2 2
and integrating the Taylor series, where we can conveniently introduce the change of variables
s = x − x0
Z x0 +h Z h
dy(x0 ) d2 y(x0 ) s2
y(x) dx = y(x0 ) + s+ + . . . ds .
x0 0 dx dx2 2
We obtain
Z x0 +h
dy(x0 ) h2 d2 y(x0 ) h3
y(x) dx = y(x0 )h + + + ... .
x0 dx 2 dx2 6
7.2 Order-of analysis 119
Comparing with the approximate area y(x0 )h, we express the error as
dy(x0 ) h2 d2 y(x0 ) h3
e= + + ... .
dx 2 dx2 6
We recall that the lowest polynomial power dominates, and therefore
e ∈ O(h2 ) .
The integral of the function y(x) between a and b is approximated as a sum of the areas of the
rectangles, let us say all of the same width h. There is
b−a
n=
h
such rectangles. A pessimistic estimate of the total error magnitude would ignore the possibility of
error canceling, so that the absolute value of the total error could be bounded by the sum of the
absolute values of the errors committed for each subinterval
n
X n
X b−a
|E| ≤ |ei | = O(h2 ) = n O(h2 ) = O(h2 ) = O(h) .
i=1 i=1
h
Note that when we write in the equals sign in the above equation, we don’t really mean equality, we
use it rather informally to mean “is”. In the terms of the order-of analysis, we would write for the
error E of the integral from a to b
E ∈ O(h) .
From the point of view of the user of the Riemann-sum approximation this is good news: The error
can be controlled. By decreasing h (that is by using more subintervals) we can make the total error
smaller. It would be even nicer if the error of was O(h2 ), since then it would decrease faster when h
was decreased. We demonstrate this as follows: assume that we use twice as many subintervals. For
E ∈ O(h) the error would decrease as
so the error decreases with a factor of two. For E ∈ O(h2 ) the error would decrease as
so the error decreases with a factor of four. The pay off of using twice as many intervals is better
this time.
Now we demonstrate the estimate the error of the midpoint approximation of integrals of one variable
using the order-of analysis. For instance, as shown in Figure 7.2, approximate the integral
Z b
y(x) dx
a
using the midpoint approximation indicated by the filled rectangles in the figure. The error of
approximating the actual area between x0 − h/2 and x0 + h/2 by the rectangle y(x0 )h may be
estimated by expressing the Taylor series of y(x) at x0
dy(x0 ) d2 y(x0 ) (x − x0 )2
y(x) = y(x0 ) + (x − x0 ) + + ...
dx dx2 2
120 7 Analyzing errors
a x0 b x
and integrating the Taylor series, where we introduce the change of variables s = x − x0
Z x0 +h/2 Z h/2
dy(x0 ) d2 y(x0 ) s2
y(x) dx = y(x0 ) + s+ + . . . ds .
x0 −h/2 −h/2 dx dx2 2
We obtain
Z x0 +h/2
d2 y(x0 ) h3
y(x) dx = y(x0 )h + + ... .
x0 −h/2 dx2 24
d2 y(x0 ) h3
e= + . . . ∈ O(h3 ) ,
dx2 24
which is one order higher than the error estimated for the Riemann sum. The integral of the function
y(x) between a and b is approximated as a sum of the areas of the n rectangles and the absolute
value of the total error could be bounded by the sum of the absolute values of the errors committed
for each subinterval
n
X n
X b−a
|E| ≤ |ei | = O(h3 ) = n O(h3 ) = O(h3 ) = O(h2 ) .
i=1 i=1
h
The order-of analysis tells us that the error E of the integral from a to b for the midpoint rule is
E ∈ O(h2 )
and therefore the midpoint rule is more accurate than either of the Riemann sum rules.
ẏ = f (t) , y(0) = y0 .
a x0 b x
Z ∆t
y(∆t) = y0 + f (τ ) dτ .
0
y(∆t) ≈ y0 + f (0)∆t
which leads exactly to the same kind of error estimate, O(∆t2 ), moving the solution forward by one
time step.
The situation is complicated somewhat by considering right-hand sides which depend both on
the time t and the solution y. For instance, Figure 7.3 shows what happens for of the equation
ẏ = − cos(2t)y , y(0) = 1 .
Each step of the forward Euler algorithm drifts off from the original curve. So we see one solution
curve departing from the starting point (t0 , y0 ), but after one step the forward Euler no longer tries
to follow that curve, but rather the one starting at (t1 , y1 ), and so on. Clearly, here is the potential
for amplifying small errors if the solution curves part company rapidly as the time goes on. However,
provided we use time steps which are sufficiently small so that the forward Euler does not excessively
amplify these little drifts, we can estimate the error on the entire solution interval (the so-called
global error ) from the so-called local errors in each time step.
y0
y4
y1 y3
y2
t0 t1 t2 t3 t4 t
Fig. 7.3. Forward Euler integration drifting off the original solution path.
122 7 Analyzing errors
ẏ = f (t, y) , y(0) = y 0 ,
At the same time we can expand the solution in a Taylor series at (tj , y j )
dy(tj )
= f (tj , y j )
dt
to get
d2 y(tj ) (tj+1 − tj )2
y(tj+1 ) = y(tj ) + f (tj , y j )(tj+1 − tj ) + + ...
dt2 2
and then move the first two terms on the right-hand side onto the left-hand side
d2 y(tj ) (tj+1 − tj )2
y(tj+1 ) − y(tj ) − f (tj , y j )(tj+1 − tj ) = + ... .
dt2 2
Finally, the second and third term on the left-hand side are −y j+1 , and so we obtain the local error
(also called truncation error) in this time step as
and secondly, the coefficient of this term is the second derivative at (tj , y j ) which measures the
curvature of the solution curve at that point. The more the curve curves, the larger the error. If
the solution happens to have a zero curvature at (tj , y j ) then we would predict that the Euler step
should not incur any error. It still might: our prediction neglected all those “dots” (the higher order
terms) in the Taylor series, but at least for zero curvature the second order term in the error would be
absent. The local error resulted from the truncation of the Taylor series, which is a good explanation
of why it is called the truncation error.
We have demonstrated above (see Figure 7.3) that the global error , that is the difference between
the analytical exact solution y(tn ) and the computational solution yn , is a mixture of two compo-
nents. Now we will look at the global error in detail. We will try to estimate the global error at
time tn+1 , GEn+1 , from the global error GEn at time tn : see Figure 7.4. Note that we are thinking
in terms of a scalar differential equation, but the conclusions may be readily generalized to coupled
equations.
7.3 Estimating error in ODE integrators 123
The first component of the global error is the local (truncation) error which is caused by the
truncation of the Taylor series as explained in the previous section.
The second component is caused by the drift off in the previous steps of the algorithm: every
step of the integrator will cause the solution to drift off the original curve passing through the
initial condition. Let us consider performing one single step of the numerical integration, from tn to
tn+1 . Two different curves pass through the two points (tn , yn ) and (tn , y(tn )): let us say ye(t) passes
through (tn , yn ), and y(t) passes through (tn , y(tn )). The difference between the points (tn , yn ) and
(tn , y(tn )) is the global error at time tn , GEn .
The difference between the two curves y(tn+1 ) − ye(tn+1 ) at time tn+1 measures the propagated
error . We can estimate the propagated error from as P En+1 which is the global error GEn plus the
increase of the distance between the two curves. The increase can be approximated to first order
from the slopes y(t ė n ) = f (tn , yn ) and ẏ(tn ) = f (tn , y(tn ))
P En+1 ≈ GEn + (f (tn , y(tn )) − f (tn , yn )) (tn+1 − tn ) .
We can also use the Taylor series to expand the right-hand side function f as
∂f (tn , yn )
f (tn , y(tn ) ≈ f (tn , yn ) + (y(tn ) − yn )
∂y
to obtain
∂f (tn , yn )
P En+1 ≈ GEn + (y(tn ) − yn ) (tn+1 − tn )
∂y
and substituting GEn = y(tn ) − yn we arrive at
∂f (tn , yn )
P En+1 ≈ GEn 1 + (tn+1 − tn ) .
∂y
This is really saying that the propagated error in step tn+1 is the global error in step tn plus a
little bit more due to the difference between the slopes at yn and y(tn ). As an illustration consider
a model equation
ẏ = λy , y(0) = y0 .
For this model equation the propagated error will read
P En+1 ≈ GEn (1 + λ(tn+1 − tn )) .
Thus we see that the propagated error will be controlled by the stability (growth versus de-
cay) of the analytical solution: for positive λ the propagated error will exponentially increase as
(1 + λ(tn+1 − tn )) > 1, for negative lambda (and sufficiently small time step) the propagated error
will likely decrease as (1 + λ(tn+1 − tn )) < 1.
Under reasonable assumptions concerning the smoothness of the right-hand side function f (note
well that this will not include models such as the friction stick-slip), the global error may be estimated
from the local errors using a (pessimistic) assumption that the local errors will never cancel each
other, they will always add up. Then we can estimate the global error E = y(tn+1 ) − yn+1 as
n
X n
X t
|GEn+1 | ≤ |ei | = O(∆t2 ) = n O(∆t2 ) = O(∆t2 ) = O(∆t) .
i=1 i=1
∆t
Thus we see that we lost one order in the error estimate going from local to global errors. The
forward Euler algorithm was second order locally, but it is only first order globally.
Illustration 5
Now we can go back to graphs of Chapter 1, especially Figure 2.19. The slopes of the error curves on
the log-log scale will now be making sense. For the forward Euler we now know that its local error is
second order, but the global error is first order. The graph 2.19 displays the global error, and hence
the slope (i.e. the convergence rate) is one. For the modified Euler the global error is second order,
consequently its local error is cubic in the time step.
124 7 Analyzing errors
y
y(tn+1 )
f(tn , y(tn ))
LEn+1 y(t)
y(tn ) yn+1
GEn f(tn , yn )
yn
y2
y0 y1
t0 t1 t2 t3 t4 t
Fig. 7.4. Global error of the forward Euler integration. LEn+1 = local (truncation) error, P En+1 = propa-
gated error, GEn = global error at time tn , GEn+1 = global error at time tn+1 .
Suggested experiments
1. Estimate from Figure 2.21 the order of the local error of the oderk4 Runge-Kutta integrator.
e 1,p k ,
ky(t1,p ) − y
where y(t1,p ) is the elusive true solution obtained by moving along the exact curve from (t0 , y 0 ).
The same integrator can then be applied again from (t0 , y 0 ), but this time with step size only half
e
e 1,p . This results in the error
as long ∆tp /2. The resulting solution is y
e
e 1,p k .
ky(t1,p ) − y
Now we use the knowledge of the local error of the particular integrator as obtained from the Taylor
series expansion as explained above. The local error for the full step length is
e 1,p k = C∆tkp
ky(t1,p ) − y (7.2)
and the local error for the half step length is (assuming the errors from the two steps add up,
excluding thereby possible canceling)
7.4 Adaptive time stepping 125
k
e ∆tp
e 1,p k = 2C
ky(t1,p ) − y , (7.3)
2
where for instance for the forward Euler integrator we have derived above k = 2.
Now the goal of the adaptive time stepping is to keep the magnitude of the local error in each
time step below a certain tolerance abstol. We will require
e 1,p k ≤ abstol ,
ky(t1,p ) − y
We have here a formula for the desired time step length ∆testim , but we need to estimate the constant
C. We could use (7.2) and (7.3) to solve C, but we have to estimate the unknown errors on the left-
hand side. We can think of it as estimating the distance of two vectors, and the tool we need may
be found in the well-known triangle inequality: ka − bk = ka − c + c − bk ≤ ka − ck + kc − bk. Thus
we obtain
ky(t1,p ) − y e
e 1,p k = ky(t1,p ) − y e
e 1,p + y
e 1,p − y e
e 1,p k ≤ ky(t1,p ) − y e
e 1,p k + ky
e 1,p − y
e 1,p k
and replacing the first term on the right-hand side from (7.3) yields
∆tkp e
e 1,p k ≤ C∆tkp ≤ 2C
ky(t1,p ) − y e1,p − y
+ ky e 1,p k
2k
where the last term on the right-hand side is known. Hence the constant is expressed as
e
2k−1 ky
e1,p − ye 1,p k
C≤
2k−1 −1 ∆tkp
Now recall that we have already made a tentative time step with step length ∆tp . That computation
is an expense we can’t take back, so to make the best of it we can do the following:
• If the estimated time step is longer than the step we actually took, ∆testim ≥ ∆tp , the estimation
is telling us that we could have taken an even longer time step. In order to save computation, we
e 1,p ), and move on to the next time step.
may just as well accept the solution (t1 = t1,p , y 1 = y
The next predicted time step length is ∆tp = ∆testim .
• Otherwise, if the time step we took was longer than the estimated step ∆testim < ∆tp , the
estimation tells us that we should have taken a shorter time step. Therefore, in order to maintain
good accuracy we repeat the step from (t0 , y 0 ) with the time step reset to ∆tp = ∆testim . The
calculation of the solutions and the calculation of the estimated time step is repeated as above.
As an alternative, we may consider the value obtained with the two half-steps as more accurate
e
e 1,p ). The
than the one obtained with the full step, and accept the new solution as (t1 = t1,p , y 1 = y
1
aetna toolbox implements a forward Euler adaptive integrator odefeuladapt , and a fourth-order
Runge-Kutta adaptive integrator oderk4adapt2 using these principles.
1
See: aetna/utilities/ODE/integrators/odefeuladapt.m
2
See: aetna/utilities/ODE/integrators/oderk4adapt.m
126 7 Analyzing errors
Illustration 6
We are solving a system of two coupled first-order linear differential equations with forcing in the
form of a periodic function. The right-hand side function and the initial condition are defined as3
rhsf =@(t,y) ([-20,1;1,-2]*y+exp(sin(t)));
y0=[1,-1]’;
The integrators may be selected by commenting or uncommenting lines of the form
odesolver=@odefeuladapt; Color =’blue’; Marker =’v’;
The integrator options are set by the following line. Note that the aetna integrators work only
with absolute tolerance. In order to level the field, we set the relative tolerance for the MATLAB
integrator ode45 to a very low value (on the order of machine epsilon). Note also that we have to
set the ’refine’ control parameter to 1 in order to get from ode45 output which is not artificially
refined (i.e. produced at points in between the actually computed time steps) in order to make the
solution look smoother.
options=odeset(’InitialStep’,0.095,’AbsTol’,1e-2,’reltol’,2*eps,...
’refine’,1);
Figure 7.5 shows on the left the solution components. Note the very sharp initial transient response
that accommodates the initial conditions. The fast-changing initial response requires rather small
time step. The time step length later increases and reflects the periodic nature of the signals. It is
interesting to note that even though forward Euler is nominally much less accurate than the fourth-
order Runge-Kutta, the time steps required to maintain the required tolerance are not that different.
Also interesting is that ode45 runs at a shorter and more variable time step than either of the aetna
integrators.
0.35
1.5
0.3
1 0.25
Solution components
Time step
0.5
0.2
0.15
0
0.1 ode45
−0.5
0.05
odefeuladapt
oderk4adapt
−1 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Time Time
Fig. 7.5. Adaptive time-stepping solutions with adaptive forward Euler, adaptive RK4, and ode45. Solution
components on the left, time step length on the right.
Furthermore, we have in Figure 7.6 a good illustration of the fact that adaptive time stepping
tends to be complicated business. Figure 7.6 shows a close-up of the solution components produced
by ode45, and we may note the strange irregularities in the solid curve (no, they are not supposed
to be there). Somehow the changing time step length in ode45 makes it veer off course (even though
it is using shorter time step then our methods). This flaw is not present in the aetna odefeuladapt
and oderk4adapt results.
3
See: aetna/AdaptiveTimeStepping/adaptode2.m
7.5 Approximation of derivatives 127
0.45
0.4
0.35
Solution components
0.3
0.25
0.2
0.15
0.1
0.05
0 2 4 6 8 10 12 14
Time
Fig. 7.6. Adaptive time-stepping solutions with ode45: notice the irregularly straying off solution curves
Since
d2 f (x0 ) (x − x0 )
∈ O(|x − x0 |) (7.5)
dx2 2!
we see that it will dominate the error as the control parameter, the step along the x axis, x − x0 ,
becomes shorter and shorter. The accuracy of the algorithm (7.4) is quite poor, the error being only
O(|x − x0 |). We call this kind of error the truncation error, since it is the result of the truncation of
the Taylor series.
128 7 Analyzing errors
Illustration 7
Consider a common counterexample where (7.5) is not valid. In Figure 7.7 a piecewise linear function
is shown (in solid line) with its derivative (dashed). If we take (7.4) with x0 to the left of b, for x < b
the formula works perfectly. The second derivative is in fact
d2 f (x0 )
=0,
dx2
which makes our derivative computation perfect – no error. Now we will make x0 approach b from
the left arbitrarily closely. The error estimate (7.5) is then no longer valid since at x0 = b
d2 f (x0 )
→∞.
dx2
This unfortunate behavior is due to the first derivative being discontinuous at b.
f ′ (x)
f (x)
b
x
′
f (x)
Now we will consider the approximation formula (7.4) for two cases: x > x0 and x < x0 . When
x > x0 we are looking “forward” with the formula to determine the slope at x0 , hence we get the
forward Euler approximation of the derivative. Let us write h = |x − x0 |. Then the formula (7.4)
may be rewritten in the familiar form
f (x0 + h)
x0 − h x0
x0 + h x
f (x0 − h)
f (x0 )
Evidently the figure suggests an improvement on these two algorithms. The green line seems to
have a slope rather close to the average of the slopes of the red and blue lines. (The angles between
the blue and green line and between the red and green line are about the same.) So what happens
if we average those Euler predictions?
1 f (x0 + h) − f (x0 ) f (x0 ) − f (x0 − h) f (x0 + h) − f (x0 − h)
+ = . (7.6)
2 h h 2h
The above formula defines another algorithm, the centered difference approximation of the deriva-
tive. Figure 7.9 shows the dashed green line which represents the centered difference approximation
of the tangent, and we can see that indeed the slopes of the dashed and solid green lines are indeed
quite close. It appears that the centered difference approximation should be more accurate, in gen-
eral, and we can investigate this analytically by averaging out only the approximation formulas, but
the entire expressions including the errors.
f (x0 + h)
x0 − h x0
x0 + h x
f(x0 − h)
f (x0 )
Fig. 7.9. Forward and backward Euler and centered difference approximation of the derivative.
The forward difference approximation of the derivative, including the truncation error R2,f
130 7 Analyzing errors
It is one order higher than the truncation errors of the Euler algorithms (O(h2 ) versus O(h)), and
higher is better – the error decreases faster with decreasing h.
The formulas for the numerical approximation of derivatives of functions, forward and backward
Euler, and the centered differences, are called finite difference stencils, and many more, sometimes
with a considerably higher accuracy, can be found in the technical literature. The price to pay is
that with higher accuracy one needs more function values around the point x0 .
Illustration 8
We shall now investigate the numerical evidence for these estimates of truncation error. 4 In the script
compare conv driver, x is the point where the derivative is evaluated, n is the number of reductions
of the step, dx0 is the initial step, which is then subsequently reduced by the factor divFactor.
funhand and funderhand are the handles of the function and its derivative (as anonymous MATLAB
functions).
funhand=@(x)2*x^2-1/3*x^3;
funderhand=@(x)4*x-3/3*x^2;
x=1e1;
n= 9;
dx0= 0.3;
divFactor=4;
Figure 7.10 both confirms the expected outcome and presents an unexpected one: the forward
and backward Euler are of the same accuracy, and on the log-log scale the error decreases with rate
of convergence equal to one, and the centered difference is both more accurate in absolute terms
and the error decreases with a convergence rate of two. What may be unexpected however is the
behavior of the centered difference error for very small steps. The error does not decrease anymore,
rather the opposite occurs.
Shifting x as the point where the derivative is evaluated (change the third line to read
x=1e4;
gives the results in Figure 7.11. The performance of the numerical differentiation algorithms has now
very much deteriorated, and a decrease in the step size does not necessarily lead to an improvement
in the result in neither the two Euler derivatives approximations, nor in the centered difference
approximation.
4
See: aetna/RoundoffTruncation/compare conv driver.m
7.6 Computer arithmetic 131
0
10
−5
10
err
−10
10
FE
BE
−15
CD
10 −6 −4 −2 0
10 10 10 10
dx
Fig. 7.10. Forward and backward Euler and centered difference approximation of the derivative. Error
versus the step size.
The explanation for the behavior described in the Illustration above rests in what is displayed
in the graphs: the graphs present the total error incurred by the numerical algorithm, and this error
is the result of the interplay between the truncation error and the effect of the so-called machine-
representation error. The term “round-off error” is commonly used for this type of error. However,
round-off is only a special case of the broader class of machine-representation errors. Another term
which would be equivalent is “computer-arithmetic”, or just “arithmetic” error. We will sometimes
use interchangeably machine-representation and arithmetic error.
−4
10
−6
10
−8
err
10
−10
10 FE
BE
−12
CD
10 −6 −4 −2 0
10 10 10 10
dx
Fig. 7.11. Forward and backward Euler and centered difference approximation of the derivative. As Fig-
ure 7.10, but the point of evaluation is shifted towards much bigger number, x=1e4.
The computer architectures in current use are based on binary storage: the smallest piece of data
is a bit, which assumes values 0 or 1. A collection of bits can store a binary number. In particular,
132 7 Analyzing errors
computers nowadays use a chunk of eight bits called byte. The position of the bit in the byte indicates
the power of two, similarly to what we’re used to with decimal numbers. For instance, the decimal
number 13 = 1×101 +3×100 can be written in the binary system as 13 = 1×23 +1×22 +0×21 +1×20.
Hence its binary representation is 1101. We can use the MATLAB function dec2bin:
>> dec2bin(13)
ans =
1101
The largest number we can store in a byte (more precisely in an unsigned byte) is 255, viz
>> dec2bin(255)
ans =
11111111
since in that case all the bits are toggled to 1. If we wish to represent signed numbers, we must
reserve one bit for the storage of the sign (positive or negative). Then we have only seven bits for
the storage of the actual pattern of 0s and 1s. The largest number that seems to be available then is
>> bin2dec(’1111111’)
ans =
127
However, by some clever manipulation it is possible to squeeze out one more number out of the eight
bits, and so we get as the algebraically smallest and largest integers using the MATLAB functions
intmin and intmax
>> intmin(’int8’)
ans =
-128
>> intmax(’int8’)
ans =
127
The clever trick is called the “2’s complement” representation, and the bits represent numbers as
shown here
00000000=0
00000001=1
00000010=2
00000011=3
...
01111111=127
11111111=-1
11111110=-2
11111101=-3
...
10000000 =-128
The argument ’int8’ denotes the so-called integer type, and there are four signed and four unsigned
varieties in MATLAB (with 8, 16, 32, and 64 bits). As an example, here are the smallest and largest
unsigned 64-bit integer
>> intmin(’uint64’)
ans =
0
>> intmax(’uint64’)
ans =
18446744073709551615
7.6 Computer arithmetic 133
Integers are nice to work with, and they are very useful for instance as counters in loops. If we’re
not careful, bad things can happen though. Take the following code fragment: First we create the
variable a as an 8-bit integer zero with int8
>> a= int8(0);
and then we increment it 1000 times by one. The result is a bit unexpected, perhaps:
for i=1: 1000
a=a+1;
end
a
a =
127
What happened? Overflow! When the variable reached the largest value that can be stored in a
variable of this type, it stopped increasing: the variable overflowed.
The floating-point numbers are represented with values for the so-called mantissa M and exponent
E, stored in bits essentially as described above, as
M*2^E
The basic datatype in MATLAB is a floating-point number stored in 64 bits, the so-called double. The
machine representation for this number is standardized, as described in the ANSI/IEEE Standard
754-1985, Standard for Binary Floating Point Arithmetic. The exponent and the mantissa are stored
as patterns of bits, which may be represented as numbered from 0 to 63, left to right. The first bit is
the sign bit, ’S’, the next eleven bits are the exponent bits, ’E’, and the final 52 bits are the mantissa
bits ’M’:
S EEEEEEEEEEE MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
0 1 11 12 63
The value V represented by the 64-bit word may be determined by an algorithm (how else?):
1. If E=2047
a) If M is nonzero, then V=NaN (“Not a Number”)
b) Else (M==0)
i. If S is 1, then V=-Inf
ii. Else V=Inf
2. Else if 0<E<2047 then we get the so-called normalized values
V=(-1)^S * 2^(E-1023) * (1.M)
where 1.M is intended to represent the binary number created by prefixing M with an implicit
leading 1 and a binary point.
3. Else (E==0)
a) If M is nonzero, then we get the so-called unnormalized values
V=(-1)^S * 2^(-1022) * (0.M)
b) Else (M==0)
i. If S is 1, then V=-0
ii. Else V=0
134 7 Analyzing errors
The cleverness of this representation should be appreciated. It allows us not only to store a zero
(twice!), and the regular (normalized) numbers, but also numbers which are extremely small (the
unnormalized values). In addition, we can also store negative and positive infinity (Inf, and -Inf),
which may result if we divide by zero (most likely by accident)
>> 1/0
ans =
Inf
and finally we can also store something that isn’t a number at all (for instance because the result of
an operation isn’t defined at all):
>> 0/0
ans =
NaN
The following two functions can be used to obtain the smallest (normalized) and largest floating-
point double value
>> realmin(’double’)
ans =
2.225073858507201e-308
>> realmax(’double’)
ans =
1.797693134862316e+308
Here we have one unnormalized value
>> realmin(’double’)/1e6
ans =
2.225073858324152e-314
The machine representation of the double floating-point values described above tells us something
important about which values can be represented in the computer, and they are not all the numbers
we can think of! To get from one particular number to the one right next to it we change one of the
bits in the mantissa. The least significant one is the 52nd. So if we take a normalized number
V=(-1)^S * 2^(E-1023) * (1.M)
changing the bit in the 52nd position behind the binary point amounts to adding to or subtracting
from the total value of the number V of the tiny value (-1)^S * 2^(E-1023)*2^(-52). For instance
for E=1023 and S=0 the tiny difference is
>> 2^-52
ans =
2.220446049250313e-016
For S=0, E-1023, and all the bits of the mantissa M set to zero the value V=1.0. Then the next closest
number to 1.0 that can be represented in the computer is 1.0 + 2−52 . Between these two values is a
gap where no numbers live. It is a tiny gap, so it may not bother us too much, but consider what
happens as the exponent 2^(E-1023) gets bigger. The MATLAB function eps computes the size of
the gap next to any particular double value. For instance, for a value representative of the Young’s
modulus in some units the gap will get bigger
>> eps(3e11)
ans =
6.103515625000000e-005
For the distance across the Milky Way, the gap already amounts to (in meters)
7.6 Computer arithmetic 135
end
a
a =
0.999999999999906
to be compared with
>> a=0.0001*10000
a =
1
Similarly to the integer data types, floating-point values can overflow .
>> realmax(’double’)+1==realmax(’double’)
ans =
1
This test should not have evaluated to “true” (numerical value 1), but since the left-hand side
overflowed it did. Floating-point values can also underflow : the value becomes so small that it gets
converted to the exact zero in the computer.
>> realmin(’double’)/1e16
ans =
0
There is a second floating-point type available in MATLAB, the single. Since there are only 32
bits available for its storage, the budgets for the exponent (eight bits) and the mantissa (23 bits)
are correspondingly reduced with respect to the double. This simply compounds all the problems
we described above for the double. For instance, the machine epsilon for the single at 1.0 is
>> eps(single(1.0))
ans =
1.1920929e-007
so almost 9 orders of magnitude larger than the one for the double. This will make the precision
considerably less for all sorts of operations. The only reason one may consider using a single is
that it saves half the storage space compared to a double. Therefore (pieces of) some commercial
softwares use single-precision storage, which could make them considerably less robust for certain
inputs then we would hope for. We as users need to be aware of such pitfalls.
7.6.3 Summary
We will inspect the centered-difference formula for the approximation of derivatives (7.6). As the
step size in the denominator decreases, the numerator contains the difference of two numbers which
are closer and closer in magnitude. We can estimate that the arithmetic error of the subtraction
f (x0 + h) − f (x0 − h) is going to be on the order of machine epsilon. Therefore, the error of the
derivative can be estimated as
ε(f (x0 ))
ER = .
2h
Here ε(f (x0 )) is the machine epsilon at the real number f (x0 ). We can see that the arithmetic error
ER increases in inverse proportion to the step size h. Also, we see that the error increases with the
magnitude of the numbers to be subtracted ≈ f (x0 ), since the machine epsilon depends on it.
Now let us go back to Figure 7.10. The total error displayed in Figure 7.10 is the sum of the
truncation error and the arithmetic error. The descending branches of the errors are dominated by
the truncation error, either O(h) (slope +1) or O(h2 ) (slope +2). In the climbing branch of the
total error the arithmetic error dominates. The dependence of the arithmetic error on 1/h = h−1
can be clearly detected in the slope −1 of the climbing branch of the total error of the derivative in
Figure 7.11.
Note well that while talking about the total error we disregard the avoidable errors of the nature
of bugs and mistakes. Sadly, these errors are sometimes present, but their unpredictable nature
makes them very difficult to discuss in general.
Illustration 9
The report “DCS Upgrades for Nuclear Power Plants: Saving Money and Reducing Risk through
Virtual-Stimulation Control System Checkout” by G. McKim, M. Yeager and C. Weirich from 2011,
states on page 5 when discussing a software simulator of a nuclear power plant subsystem: “Here
was the first surprise. The emulated Bailey response in Figure 5 didnt show this rate limiting. The
controller output traveled as fast as 12% per second. This led to a line-by- line examination of the
FORTRAN source code for the Bailey emulation, whereupon it was discovered that, contrary to
belief, the rate limiting was not included in the simulation. ”
This is an example of a software bug: the feature that was supposed to be programmed was
either never implemented or was implemented and later deleted.
Illustration 10
The Deepwater Horizon Accident Investigation Report from September 8, 2010 states on page 64
“The 13.97 ppg interval at 18,200 ft. was included in the OptiCem model report as the reservoir
zone. The investigation team was unable to clarify why this pressure (13.97 ppg) was used in the
model since available log data measured the main reservoir pressure to be 12.6 ppg at the equivalent
depth. Use of the higher pressure would tend to increase the predicted gas flow potential. The same
OptiCem report refers to a 14.01 ppg zone at 17,700 ft. (which, in fact, should be 14.1 ppg: the
actual pressure measured using the GeoTap logging-while-drilling tool). ” (Emphasis is mine.)
These two instances are illustrations of an input error (mistake of the operator). Undoubtedly
important, but they are outside of the scope of error control that numerical methods can exercise
and therefore will not be discussed in this book.
138 7 Analyzing errors
Summary
1. We discuss a couple of representative methods for the solution of a scalar nonlinear equation.
Main idea: Newton’s and bisection method are complementary with the respect to the rate of
convergence and robustness.
2. Newton’s method (in one of its several variants) is a crucial building block in nonlinear analysis
of structures, where systems of coupled nonlinear equations need to be solved repeatedly. Main
idea: efficient solvers for systems of coupled linear equations are critical to the success of the
Newton’s method.
3. Solutions of systems of coupled linear equations that fall under the class of factorizations on the
examples of the LU and QR decompositions. Main idea: factorizations provide critical infrastruc-
ture to a variety of numerical algorithms, especially Newton-like solvers of nonlinear equations
and eigenvalue problem solvers.
4. Errors produced by factorization algorithms depend on the so-called condition number. Main
idea: condition numbers are related to eigenvalues.
In general, for an arbitrary right-hand side function f this will require the solution of a nonlinear
algebraic equation to obtain yj+1 . For convenience we will define the function of the unknown yj+1
F (y (∗) ) = 0
dF (y (0) ) (∗)
F (y (∗) ) = F (y (0) ) + (y − y (0) ) + R1 = 0 .
dyj+1
The term
dF (y (0) )
dyj+1
is referred to as the Jacobian. Provided the remainder R1 is negligible compared to the other terms,
we can write approximately
dF (y (0) ) (∗)
F (y (0) ) + (y − y (0) ) ≈ 0 , (8.2)
dyj+1
Thus we arrive at the Newton’s algorithm for finding the solution of a nonlinear algebraic equation:
Guess the starting point of the iteration, y (0) , as close to the expected root of the equation y (∗) as
possible. Then repeat until the error (in some measure to be determined) drops below acceptable
tolerance.
−1
(k) (k−1) dF (y (k−1) )
y =y − F (y (k−1) ) (8.3)
dyj+1
if error e(k) < tolerance, break; otherwise go on
k = k + 1 and repeat from the top
The error could be measured as the difference between the successive iterations
e(k) = F (y (k) ) .
Or, convergence can be decided by looking at some composite of the above errors, for instance the
iteration could be considered converged when either of these errors drops below a certain tolerance.
Illustration 1
How do we apply the Newton’s algorithm to solve the nonlinear equation that defines a single step
for the backward Euler algorithm?
To advance the solution we have to solve F (yj+1 ) = yj+1 − yj − (tj+1 − tj )f (tj+1 , yj+1 ) = 0. The
only difficulty may present the derivative of the function f which we need to compute
This turns out to be really easy for the simple function f of a linear ODE with a constant coefficient
f (t, y) = λy
dF (y (k−1) )
= 1 − (tj+1 − tj )λ .
dyj+1
For this special right-hand side function it works out precisely as we would expect from the definition
of the backward Euler method. For general right-hand side functions f the solution will require
several iterations of the algorithm, until some tolerance is reached as discussed above.
Concerning the implementation of the backward Euler time integration in MATLAB: Either we
have to provide not only a function for the right-hand side f but also its derivative ∂f (., y)/∂y, or the
software must make do without the derivative. Fortunately, we realize that numerical differentiation
could be used, and we have developed some approaches in the previous Chapter.
Write the Taylor series for the scalar function F (x), but this time keep the remainder
dF (x(k) ) (∗)
F (x(∗) ) = F (x(k) ) + (x − x(k) ) + R1 .
dx
The remainder is written as
d2 F (ξ) (x(∗) − x(k) )2 D E
R1 = , ξ ∈ x(k) , x(∗) .
dx2 2!
Since x(∗) is the root, we know that
F (x(∗) ) = 0
and these may be substituted both in Equation (8.4) and in the expression for the remainder.
Thus (8.4) may be written in terms of the errors as
dF (x(k) )
F (x(k) ) + (Ek − Ek+1 ) = 0 . (8.7)
dx
Now (8.7) may be subtracted from (8.6) to yield
We say that the Newton’s method attains a quadratic convergence rate, because the error in the
current iteration is proportional to the square of the error in the previous iteration (and this is good,
assuming the error is going to be small, and the square of a small number is even smaller).
Illustration 2
We shall solve the equation f (x) = −0.5 + (x − 1)3 = 0 with Newton’s method.1 The solver used
is the aetna implementation of the Newton’s method newt2 . The approximate errors in the seven
iterations required for convergence to machine precision are
Iteration Approximate Error
1 0.806299474015900
2 0.338070307349234
3 0.090929677656663
4 0.009026273904915
5 0.000101115651663
6 0.000000012879718
7 0.000000000000000
A good rule of thumb is that the number of zeros behind the decimal point of the error doubles with
each iteration. That is excellent convergence indeed.
Figure 8.1 illustrates the formula (8.8). We plot the approximate errors Ek+1 versus Ek as
plot(e(1:end-2),e(2:end-1),’ro-’)% e = approximate errors
Clearly the data resembles a parabolic arc, exactly as predicted by the formula. Re-plotted on a log
log scale (Figure 8.2) as
loglog(e(1:end-2),e(2:end-1),’ro-’)
confirms the relationship between Ek+1 and Ek . It is quadratic, since the slope on the log-log plot
is very close to 2.
1
See: aetna/NonlinearEquations/testnewt conv rate.m
2
See: aetna/NonlinearEquations/newt.m
8.1 Single-variable nonlinear algebraic equation 143
3 3
@(x)(−0.5+(x−1) ), x0=2.6 @(x)(−0.5+(x−1) ), x0=2.6
0
0.35 10
0.3
−5
10
0.25
0.2
Ek+1
Ek+1
−10
10
0.15
0.1 −15
10
0.05
−20
0 10 −8 −6 −4 −2 0
0 0.2 0.4 0.6 0.8 1 10 10 10 10 10
Ek Ek
F (x) F (x)
x3 x1 x1 x0
x0 x2 x x3 x2 x
Fig. 8.2. Failure of Newton’s method due to divergence (left), and successful convergence upon the selection
of the initial guess closer to the root (right).
Newton’s method can converge very satisfactorily, but the bad news is it can also spectacularly fail
to deliver the goods. Consider for instance Figure 8.3. On the left: Choosing as the initial guess x0
leads to a succession of xj which drift away from the root rather than converging to it. On the right:
The function graph is scaled in the horizontal direction for clarity. Therefore the initial guess x0
that is shown there is in fact chosen much closer to the desired root than in the figure on the left.
Consequently, the Newton’s method generates a succession of root locations which converge. This is
quite typical: as good an initial guess of the location of the root as possible is critical to the success
of the method.
Figure 8.3 shows a situation in which one may be looking for a root where there is none. Use
your imagination to reduce the gap between the horizontal axis and the hump of the function so
that the two almost merge visually. (In the figure we keep the gap large for clarity.) Then starting
the iteration in the vicinity of the presumed root will not lead to convergence. In fact, since the
function graph has a zero slope at some point at the top of the hump, there is a potential for the
Newton’s method to blow up (remember, we need to divide with the value of the derivative).
Figure 8.4 illustrates another difficulty. For rapidly oscillating functions with many roots it is
quite possible for the Newton’s method to jump from root to root, and to eventually locate a root,
but not the one we were looking for originally. If the Newton’s solver is used in an automatic fashion,
we might not be even aware of the switch.
144 8 Solution of systems of equations
F (x)
x1 x4 x2 x5 x3 x0
x
Fig. 8.3. Failure of Newton’s method: first its gets stuck next to a false root (maximum), then the iterations
blast off to infinity.
F (x)
x4
x2 x3 x1 x0 x
Fig. 8.4. Failure of Newton’s method: if the initial guess of the root is not sufficient to close, it does not
find the root that was intended.
The bisection method is a complement to the Newton’s method. (a) While the Newton’s method
converges quickly, bisection is slow to converge. (b) While the Newton’s method may fail to find a
root, bisection is guaranteed to converge to a root. (c) Newton’s method needs to know both the
function and its derivative, while bisection can work with just the function. (d) While for bisection
we need the so-called bracket (pair of locations at which a given function gives values of opposite
signs), this is not needed for Newton’s method.
Perhaps the best way to describe the bisection method is by an algorithm:
function [xl,xu] = bisect(funhandle,xl,xu,tolx,tolf)3
if (xl >xu)
temp =xl; xl = xu; xu =temp;
end
fl=feval(funhandle,xl);
fu=feval(funhandle,xu);
... a bit of error checking omitted for brevity
while 1
xr=(xu+xl)/2; % bisect interval
fr=feval(funhandle,xr); % value at the midpoint
if (fr*fl < 0), xu=xr; fu=fr;% upper --> midpoint
elseif (fr == 0), xl=xr; xu=xr;% exactly at the root
else, xl=xr; fl=fr;% lower --> midpoint
end
if (abs(xu-xl) < tolx) || (abs(fr) < tolf)
3
See: aetna/NonlinearEquations/bisect.m
8.1 Single-variable nonlinear algebraic equation 145
3 3
@(x)(−0.5+(x−1) ), xl=1.7934, xu=1.7953 @(x)(−0.5+(x−1) ), xl=1.7934, xu=1.7953
0
0.5 10
0.4
−1
10
0.3
Ek+1
Ek+1
0.2
−2
10
0.1
−3
0 10 −2 −1 0
0 0.2 0.4 0.6 0.8 1 10 10 10
Ek Ek
Figure 8.6 is a good comparison of the typical convergence properties of the Newton’s and
bisection methods.6 Evidently the bisection method requires many more iterations than the Newton’s
method. When each evaluation of the function is expensive, the quicker converging method wins.
When the robustness of bisection is required (such as when Newton’s would not converge), the
slower method is preferable. Wouldn’t it make sense to combine such disparate methods and switch
between them as needed? That is how the MATLAB fzero function works. (Find out from the
documentation which methods are combined in fzero.)
3
@(x)(−0.5+(x−1) ), Bisection versus Newton
0
10
−5
10
−10
Ek
10
−15
10
−20
10
0 2 4 6 8 10
Iteration k
Fig. 8.6. Comparison of the convergence of the bisection method (dashed line), and the Newton’s method
(solid line).
4
See: aetna/NonlinearEquations/testbisect conv rate.m
5
See: aetna/NonlinearEquations/bisect.m
6
See: aetna/NonlinearEquations/bisection versus Newton.m
146 8 Solution of systems of equations
For an arbitrary right-hand side function f this will require the solution of a nonlinear vector
algebraic equation to obtain y j+1 . We will define the vector function of the vector unknown y
which is clearly the backward Euler algorithm for y = y j+1 . The solution y (∗) to the equation
F (y (∗) ) = 0
dF (y (0) ) (∗)
F (y (∗) ) = F (y (0) ) + (y − y (0) ) + R1 = 0 .
dy
Provided the remainder R1 is negligible compared to the other terms, we can write approximately
dF (y (0) ) (∗)
F (y (0) ) + (y − y (0) ) ≈ 0 , (8.9)
dy
which at first sight looks exactly like (8.2). There must be a difference here, however, as we are
dealing with a system of equations. What do we mean by
dF (y (0) ) (∗)
(y − y (0) ) ? (8.10)
dy
The expression (8.9) holds for each component (row) of the vector (column matrix) separately. The
components of the vector function F and of the argument y may be written as
[F (.)]r , [y]c .
Each of the components [F (.)]r is a function of all the components [y]c . Therefore, equation (8.9)
in components must have the meaning
X ∂[F (y (0) )]r
[F (y (0) )]r + [y (∗) − y (0) ]c ≈ 0 ,
c=1:n
∂[y] c
i.e. in words: the change in the component [F (y (0) )]r is due to the change of this component in the
direction of each of the c components of the argument [y]c , which is expressed by the first term of
the Taylor series. Thus we see that left-hand side of (8.9) is the sum of two vectors, F (y (0) ) and the
vector
dF (y (0) ) (∗)
(y − y (0) ) ,
dy
dF (y (0) )
which is the product of a square matrix and the vector (y (∗) − y (0) ). The matrix
dy
dF (y (0) )
dy
8.2 System of nonlinear algebraic equations 147
Thus we arrive at the Newton’s algorithm for finding the solution of a nonlinear algebraic
equation: Initially guess y (0) ; then compute
−1
(k) (k−1) dF (y (k−1) )
y =y − F (y (k−1) ) (8.11)
dy
k = k + 1 and repeat previous line
until the error (in some measure to be determined) drops below acceptable tolerance. In general it
is a good idea not to invert a matrix if we can help it. Rewriting the Newton algorithm as
dF (y (k−1) )
J (y (k−1) ) = % Compute the Jacobian matrix
dy
J (y (k−1) )∆y = −F (y (k−1) ) % Compute the increment ∆y
y (k) = y (k−1) + ∆y % Update the solution (8.12)
(k)
if error e < tolerance, break; otherwise go on
k = k + 1 and go to the top
we see that the Newton algorithm will require repeated solutions of a system of linear algebraic
equations, since the first line of the above algorithm means solve for ∆y.
Clearly this could mean major computational effort, depending on how many equations there
are (how big the matrix J (y (k−1) ) is), whether the Jacobian is symmetric, how many zeros and in
what pattern there might be in the Jacobian (in other words, is it dense, and if it isn’t what is the
pattern of the sparse matrix), and so on. We will take up the subject of the solution of system of
equations in the next chapter.
The error e(k) of the solution in iteration k could be measured as the difference between successive
iterations, as for the scalar equation (8.3), and it should be expressed in terms of vector norms
e(k) = kF (y (k) )k .
Illustration 3
The two expressions f (x, y) and g(x, y) may be interpreted as surfaces raised above the x, y plane.
Setting these to zero is equivalent to forcing the points that satisfy these equations, individually, to
lie on the level curves of the surfaces. The solution of the two equations being satisfied simultaneously
corresponds to the intersection of the level curves. The figures of the surfaces were produced by the
script two surfaces7.
The solution will be attempted with the Newton method. The vector argument is
x
y=
y
f (x, y)
F (y) = .
g(x, y)
Therefore the necessary Jacobian matrix is
∂f (x, y) ∂f (x, y) ∂g(x, y) ∂g(x, y)
J11 = = x , J12 = = 3y , J21 = = y , J22 = =x.
∂x ∂y ∂x ∂y
The Matlab code defines both the vector function and the Jacobian matrix as anonymous functions.
F=@(x,y) [((x.^2 + 3*y.^2)/2 -2); (x.*y +3/4)];
J=@(x,y) [x, 3*y; y, x];
With these functions at hand it is easy to carry out the iteration interactively, step-by-step. For
instance, guessing
w0= [-0.5;0.5];
we update the solution as
>> w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.5000
1.5000
For the next iteration, we reset the variable w0
>> w0=w;
and repeat the solution
w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.6154
1.1538
We can watch the differences between the successive iterations getting smaller. With four iterations
we get five decimal digits converged.
w =
-0.6923
1.0833
This point will be one of the four possible solutions (level-curve intersections). To get a different
solution we need to start with a different guess w0, for instance w0= [-2;0.5];.
8.2 System of nonlinear algebraic equations 149
To compute the Jacobian analytically is often not possible or feasible. The elements of the Jacobian
matrix can be computed by numerical differentiation. MATLAB includes a sophisticated routine for
forming Jacobians numerically, numjac. Here we discuss just the basic idea.
Consider the vector function F (z), whose derivative should be evaluated at z. Each element of
the matrix
∂[F (z)]r
∂[z]c
is a partial derivative of the component r of the vector F with the respect to the component c of the
argument. The index c is the column index. Therefore, just one evaluation of the vector function F
per column is necessary for forward or backward difference evaluation of the numerical derivative.
First, evaluate F = F (z). Then, for each column c of the Jacobian matrix evaluate
c
F = F (z + hc ec ) ,
where
[ec ]m = 1 for c = m , [ec ]m = 0 otherwise,
and hc is a suitably small number (not too small: let us not forget the effect of computer arithmetic).
The Jacobian matrix is approximated by the computed column vectors as
∂F (z)
≈ (1 F − F )/h1 , (2 F − F )/h2 , . . . , (n F − F )/hn , . (8.13)
∂z
In these columns we recognize numerical approximations of derivatives of the vector function F
(divided differences).
One can recognize in the Newton’s method with the numerical approximation of the Jacobian a
variation of the so-called secant method .
Illustration 4
In the script Jacobian example8 we compare the analytically determined Jacobian matrix with its
numerical approximation. The vector function is taken as
2
z1 + 2z1 z2 + z22
F (z) = .
z1 z2
Therefore the Jacobian matrix is evaluated at z1 = −0.23, z2 = 0.6 as
8
See: aetna/NonlinearEquations/Jacobian example.m
150 8 Solution of systems of equations
>> F=@(z)[z(1)^2+2*z(1)*z(2)+z(2)^2;...
z(1)*z(2)];
dFdz=@(z)[2*z(1)+2*z(2),2*z(1)+2*z(2);...
z(2), z(1)];
zbar = [-0.23;0.6];
>> Jac =dFdz(zbar)
Jac =
0.7400 0.7400
0.6000 -0.2300
Evaluating the function at the base point and using the step size of 0.1
>> Fbar =F(zbar);
h=1e-1;
we obtain the approximate (numerically differentiated) Jacobian matrix
>> Jac_approx =[(F(zbar+[1;0]*h)-Fbar)/h, (F(zbar+[0;1]*h)-Fbar)/h]
Jac_approx =
0.8400 0.8400
0.6000 -0.2300
We may note that the second row is in fact exact. (Why?) On the other hand the Jacobian matrix
will not be evaluated exactly in any component for the second example in Jacobian example. Check
it out.
Consider a high-strength steel cable structure shown in Figure 8.7. The dashed line shows how the
cables are connected, but the geometry has no physical meaning. In reality, the cables are strung
between the joints so that the unstressed lengths of three sections of the main cable, connecting
joints p3 , p1 , p2 , and p4 , are given as 1.025× the distance between the joints. The two tiedowns
between joints p5 and p1 and between joints p5 and p2 have unstressed lengths which are less than
the distances between the points p5 and p1 and p5 and p2 : tiedown 4 has unstressed length 0.88×
distance between p5 and p1 , and tiedown 5 has unstressed length 0.81× distance between p5 and p2 .
Therefore after the structure is assembled, the structure must deform and it must experience tensile
stress (it becomes prestressed). The goal is to find the forces in the cables after the structure was
assembled. Since the problem is statically indeterminate, we will use the deformation method. The
requisite equations are going to be the equilibrium equations for the joints p1 , p2 , and the unknowns
are going to be the locations of these two joints. (Note that the locations of joints p3 , p4 , and p5 are
fixed, those joints are supported.)
For instance, for the joint 1 we write the equilibrium equations as
∆x,1 ∆x,2 ∆x,4
−N1 + N2 − N4 =0
L1 L2 L4 .
∆y,1 ∆y,2 ∆y,4
−N1 + N2 − N4 =0
L1 L2 L4
The geometrical relationships for cable 1 are based on these expressions
q
∆x,1 = Y1 − px,3 , ∆y,1 = Y2 − py,3 , L1 = ∆2x,1 + ∆2y,1 ,
where Y1 , Y2 are the coordinates of joint 1 after deformation. Similarly for cable 2
8.2 System of nonlinear algebraic equations 151
p4
3
p2
p1 5
1
p3 4 p5
Fig. 8.7. Cable structure configuration. Dashed line: schematic of the connections. Filled dots indicate
supported joints.
q
∆x,2 = Y3 − Y1 , ∆y,2 = Y4 − Y2 , L2 = ∆2x,2 + ∆2y,2 ,
where Y3 , Y4 are the coordinates after deformation of joint 2. Together Y1 , Y2 , Y3 , Y4 constitute the
unknowns in the problem. Finally for the third cable running into joint 1 we have
q
∆x,4 = Y1 − px,5 , ∆y,4 = Y2 − py,5 , L4 = ∆2x,4 + ∆2y,4 .
Therefore, we are going to construct the Jacobian matrix numerically using the numerical differen-
tiation technique from the preceding section.
We are going to present the computation as implemented in a MATLAB function.9 First we
define the data of the problem.
function [y,sigma]=cable_config_myjac
% undeformed configuration, lengths in millimeters
p =[10,10; 25,25; 0,0; 40,40; 40,0]*1000;
9
See: aetna/NonlinearEquations/cable config myjac.m
152 8 Solution of systems of equations
sigma =
1.0e+002 *
4.494507981897321
4.851283944479819
2.463132114467833
2.051731752058516
3.659994343900810
Figure 8.8 displays the results of the computation. Note that the stresses are distributed somewhat
non-uniformly. A cool improvement on our computation would be to optimize the unstressed lengths
of the cables so that the prestress was uniform across the structure.
p4
3
σ = 414
p2
2
5
σ = 357 σ = 208
p3 1
p1 p5
σ = 449 4 σ = 205
Fig. 8.8. Cable structure configuration. Dashed line: schematic of the connections. Filled dots indicate
supported joints. Thick solid line: actual configuration of the prestressed structure. Tensile stresses are
indicated.
154 8 Solution of systems of equations
As a final note, we shall point out that MATLAB comes with its own sophisticated function for
the numerical evaluation of the Jacobian matrix, numjac. The pieces of code that would need to
be changed with respect to our implementation10 are the computation of the residual (the function
needs to accept additional arguments)
function R=Force_residual(Ignore1,Y,varargin)
y(1,:) =Y(1:2)’;
y(2,:) =Y(3:4)’;
F =zeros(size(p,1),2);
for j=1:size(conn,1)
L=Length(j);
N(j)=E*A(j)*(L-Initial_L(j))/L;
F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;
F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;
end
R =[F(1,:)’;F(2,:)’];
end
and the evaluation of the numerical Jacobian in the Newton’s loop (there are a few additional
arguments to pass)
Y=[y(1,:)’;y(2,:)’];% Initialize deformed configuration
for iteration = 1: maximum_iteration % Newton loop
R=Force_residual(0,Y);% Compute residual
[dRdy] = numjac(@Force_residual,0,...
Y,R,Y/1e3,[],0);% Compute Jacobian
dY=-dRdy\R;% Solve for correction
if norm(dY,inf)<AbsTol % Check convergence
y(1,:) =Y(1:2)’;% Converged
y(2,:) =Y(3:4)’;
R=Force_residual(0,Y);% update the forces
sigma =N./A’;% Stress
return;
end
Y=Y+dY;% Update configuration
end
error(’Not converged’)% bummer :(
We can easily check that the two implementations of the computation give identical results.
In summary, Newton’s method, in its several variants and refinements, has a special place among
the mainstream methods for solving a system of nonlinear algebraic equations in engineering appli-
cations. One of the building blocks of this class of algorithms is a solver for repeatedly solving a
system of linear algebraic equations. This is the topic we will take up in the following sections.
8.3 LU factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of a lower triangular
matrix and an upper triangular matrix
A = LU
10
See: aetna/NonlinearEquations/cable config numjac.m
8.3 LU factorization 155
The triangular matrices are not determined uniquely. Here we will consider the variant where the
lower triangular matrix L has ones on the diagonal.
What is the value of the LU factorization? It derives from the efficiency with which a system with
a triangular matrix can be solved. For instance, consider the system
•
• •
• • •
Ly = • • • •
y = b ,
• • • • •
••••••
where L is lower triangular (non-zeros are indicated by the black dots, the zeros are not shown). In
the first row of L there is only one nonzero, L11 . Therefore we can solve immediately for y1 . Next,
y1 may be substituted into the second equation, from which we can solve for y2 , and so on. Since
we are solving for the unknowns in the order of their indexes, 1, 2, 3, ..., n, we call this the forward
substitution.11
function c=fwsubs(L,b)
[n m] = size(L);
if n ~= m, error(’Matrix must be square!’); end
c=zeros(n,1);
c(1)=b(1)/L(1,1);
for i=2:n
c(i)=(b(i)-L(i,1:i-1)*c(1:i-1))/L(i,i);
end
end
Now consider the system
••••••
• • • • •
• • • •
Ux = x = c ,
• • •
• •
•
where U is upper triangular. In the last row of U there is only one nonzero, Unn . Therefore we can
solve immediately for xn . Next, xn may be substituted into the last but one equation, from which
we can solve for xn−1 , and so on. Since we are solving for the unknowns in the reverse order of their
indexes, n, n − 1, n − 2, ..., 2, 1, we call this the backward substitution. 12
function x=bwsubs(U,c)
[n m] = size(U);
if n ~= m, error(’Matrix must be square!’); end
x=zeros(n,1);
x(n)=c(n)/U(n,n);
for i=n-1:-1:1
x(i)=(c(i)-U(i,i+1:n)*x(i+1:n))/U(i,i);
end
end
11
See: aetna/LUFactorization/fwsubs.m
12
See: aetna/LUFactorization/bwsubs.m
156 8 Solution of systems of equations
And so we come to the punchline: provided we can factorize a general matrix A into the triangular
factors, we can solve the system Ax = b in two steps. Write
Ax = LU x = L(U x) = Ly = b .
| {z }
y
Ly = b .
Ux = y .
Both solution steps can be done very efficiently since the matrices involved are triangular. This is
handy in many situations where the right-hand side b will change several times while the matrix A
stays the same. For instance, here is how we compute the inverse of a general square matrix A:
write the definition of the inverse
AA−1 = 1
column-by-column as
Here by ck (A−1 ) we mean the kth column of A−1 , and by ck (1) we mean the kth column of the
identity matrix . So if we successively set the right-hand side vector to b = ck (1), k = 1, 2, ... and
solve Ax = b, we obtain the columns of the inverse matrix as ck (A−1 ) = x.
8.3.2 Factorization
The crucial question is: how do we compute the factors? LU factorization can be easily explained
by reference to the well-known Gaussian elimination. We shall start with an example:
0.796, 0.7448, 0.1201, 0.0905
−0.3649, 1.216, −0.3435, −0.5449
A= 0.0186, −0.093,
1.204, −0.0012
−0.1734, −0.6695, −0.0653, 0.4113
First we will change the numbers below the diagonal in the first column to zeros. Gaussian elimination
does this by replacing a row in which a zero should be introduced, let us say row j, by a combination of
the row j and the so-called pivot row. Thus zero will be introduced in the element 2, 1 by subtracting
(−0.3649)/(0.796)× row 1 from row 2 two obtain
0.796, 0.7448, 0.1201, 0.0905
0, 1.558, −0.2884, −0.5034
0.0186, −0.093, 1.204, −0.0012
−0.1734, −0.6695, −0.0653, 0.4113
The element 1, 1 (the number .796) is called a pivot. Evidently, the success of the proceedings is going
to rely on the pivot being different from zero (not only strictly different from zero, but “sufficiently
different”: it shouldn’t be too small compared to the other numbers in the same column). The
manipulation described above can be executed by the following code fragment
i=1;
A(2,i:end) =A(2,i:end)-A(2,i)/A(i,i)*A(i,i:end)
8.3 LU factorization 157
Importantly, the same can also be written as a result of a matrix-matrix multiplication by the
so-called elimination matrix
1, 0, 0, 0
0.4584, 1, 0, 0
E (2,1) =
0, 0, 1, 0
0, 0, 0, 1
Next we will change 0.0186 to a zero. Again, we will do this with an elimination matrix, and note
well that we will be working with the above right-hand side matrix, not the original A. So we will
construct
1, 0, 0, 0
0, 1, 0, 0
E (3,1) =
−0.02337, 0, 1, 0
0, 0, 0, 1
and compute
0.796, 0.7448, 0.1201, 0.0905
0, 1.558, −0.2884, −0.5034
E (3,1) E (2,1) A =
.
0, −0.1104, 1.201, −0.003315
−0.1734, −0.6695, −0.0653, 0.4113
And so on: eliminating the non-zeros in the first column is constructed as the sequence
0.796, 0.7448, 0.1201, 0.0905
0, 1.558, −0.2884, −0.5034
E (4,1) E (3,1) E (2,1) A =
.
0, −0.1104, 1.201, −0.003315
0, −0.5073, −0.03914, 0.431
Now we start working on the second column. Note again that we are working with the matrix
E (4,1) E (3,1) E (2,1) A, not the elements of the original matrix. Thus 0.07087 = −(−0.1104/1.558),
and the elimination matrix to put a zero in the element 3, 2 reads
1, 0, 0, 0
0, 1, 0, 0
E (3,2) =
0, 0.07087, 1, 0 .
0, 0, 0, 1
Finally, we apply the elimination matrix to the element 4, 3 and the entire Gaussian elimination
sequence will read
13
See: aetna/LUFactorization/elim matrix.m
158 8 Solution of systems of equations
0.796, 0.7448, 0.1201, 0.0905
0, 1.558, −0.2884, −0.5034
E (4,3) E (4,2) E (3,2) E (4,1) E (3,1) E (2,1) A =
.
0, 0, 1.18, −0.03899
0, 0, 0, 0.2627
We recall that we wish to construct the factorization A = LU , which means that the above matrix
on the right is U and consequently
So now we have the matrix U and the inverse of L. Fortunately, L is obtained very easily. Not by
inverting the above product, but rather by inverting each of the terms separately
L = E −1 −1 −1 −1 −1 −1
(2,1) E (3,1) E (4,1) E (3,2) E (4,2) E (4,3) .
For instance, to invert E (2,1) we realize that the effect of the matrix multiplication in the product
E (2,1) A is to make the second row of the result the sum of a multiple of the first row and 1× the
second row. Therefore, to multiply with the inverse of E (2,1) is to undo this operation, to subtract
a multiple of the first row from the second row. The inverse of E (2,1) also has ones on the diagonal,
the only change is that the off-diagonal element changes its sign (we want subtraction instead of
addition)
E −1
(2,1) = 21 − E (2,1) .
The same reasoning applies to the other elimination matrices. Now we only have to figure out the
product of the inverses of the elimination matrices. Take for instance the product E −1 −1
(2,1) E (3,1) :
1, 0, 0, 0 1, 0, 0, 0 1, 0, 0, 0
−0.4584, 1, 0, 0 0, 1, 0, 0 −0.4584, 1, 0, 0
E −1 −1
(2,1) E (3,1) =
= .
0, 0, 1, 0 0.02337, 0, 1, 0 0.02337, 0, 1, 0
0, 0, 0, 1 0, 0, 0, 1 0, 0, 0, 1
The pattern is clear: each matrix in the product will simply copy its only nonzero off diagonal
element into the same location in the resulting matrix. Thus we have
1, 0, 0, 0
−0.4584, 1, 0, 0
L= 0.02337, −0.07087,
.
1, 0
−0.2178, −0.3256, −0.1127, 1
The entire elimination process for our given matrix can be expressed as a series of matrix multipli-
cations
E21 =elim matrix(A,2,1)
E31 =elim matrix(E21*A,3,1)
E41 =elim matrix(E31*E21*A,4,1)
E32 =elim matrix(E41*E31*E21*A,3,2)
E42 =elim matrix(E32*E41*E31*E21*A,4,2)
E43 =elim matrix(E42*E32*E41*E31*E21*A,4,3)
U = E43*E42*E32*E41*E31*E21*A
Inefficient, but correct. In reality the elimination is done usually in-place. The upper triangle and
the diagonal of A store the matrix U , the lower triangle (below the diagonal) of A store the matrix
L (we do not store the diagonal, since we know that the diagonal of L consists of ones). naivelu4
is one of the naive implementations of the LU factorization in aetna.14
14
See: aetna/LUFactorization/naivelu4.m
8.3 LU factorization 159
8.3.3 Pivoting
The implementation of the LU factorization presented above is naive: it blithely divides by the
numerical value in the diagonal element, the so-called pivot. Unless the user is reasonably sure that
all the numbers encountered in the pivot locations are sufficiently large during the factorization,
it is preferable to use an implementation that does either partial or full pivoting. The MATLAB
implementation of the LU factorization can perform pivoting. Normally only the so-called partial
pivoting is performed. Partial pivoting consists of selecting which row should be used as the pivot
row when working in column j, and all the rows j and below are considered. The row with the
largest number in absolute value in column j is chosen. Complete pivoting would also consider
the possibility of switching columns in order to get the best element in the pivot position, but that
involves extensive searching throughout the matrix and is therefore expensive (and hence rarely
done).
The MATLAB implementation of the LU factorization will return the information in three ma-
trices: Consider this example
0.4653, 0.1766, 0.8463, 0.7917
0.1805, 0.9188, 0.3244, 0.6952
A=
0.7891, 0.236, 0.007259, 0.4891 .
0.09073, 0.6998, 0.9637, 0.9205
LU = P A .
The matrix P permutes (switches) the rows of the matrix A. That is the actual pivoting. Note that
the permutation matrix has an interesting inverse: it is its own transpose (the permutation matrix
is orthogonal). Therefore we can write the above as
P T LU = A .
The matrix
0.5897, 0.04327, 1, 0
0.2287, 1, 0, 0
PTL =
1,
0, 0, 0
0.115, 0.7778, 0.8597, 1
is the so-called psychologically lower triangular matrix. Such a matrix would be returned if we called
lu with only two output arguments
[L,U]=lu(A)
How do we use the three output matrices? Symbolically, we can write now in the way in which we use
the LU factorization (A = LU ) as (we do not actually use inverses, we use forward and backward
substitution!)
y = L−1 b , x = U −1 y
or
x = U −1 L−1 b .
or
x = U −1 L−1 (P b) .
How much does it cost to perform an LU factorization? We see that the procedure is essentially
that of Gaussian elimination, which processes the matrix in blocks. First the block A(2:n,1:n) is
modified, then the block A(3:n,2:n), A(4:n,3:n), all the way down to A(n:n,n-1:n). If we take
as a measure of required time the number of modified elements of the matrix, we have
where C is a time constant that measures how much time it takes to manipulate a single element of
the matrix. Multiplying through we see that the required time is the sum
C n2 − n + (n − 1)2 − (n − 1) + (n − 2)
2
− (n − 2) + . . . + 32 − 3 + 22 − 2 =
C n2 + (n − 1)2 + (n − 2)2 + . . . + 32 + 22 − C (n + (n − 1) + (n − 2) + . . . + 3 + 2) .
In Chapter 7 we have seen the big-O notation used as a means of describing how function value
decreases as the argument decreases towards zero. Here we introduce the opposite viewpoint: the
notation can be also used to express how quickly a function value grows. As we discussed, the big-
O notation typically expresses how complicated functions behave in terms of a simple monomial
(say x2 ). For the measurement how quickly function value decreases the low powers dominate;
contrariwise, when we measure how quickly function value grows the high powers dominate.
Illustration 5
Consider the simple function f (x) = x2 + 30000x. Use the big-O notation to describe its behavior
as x → 0 and as x → ∞ .
As x → 0 the function value decrease is dominated by the linear term (30000x) as it drops in
magnitude much slower then the square. On the contrary, the square term grows much faster than
the linear term as x → ∞. Therefore we conclude that f (x) ∈ O(x) as x → 0 and that f (x) ∈ O(x2 )
as x → ∞.
The big-O notation is often used in computer science to express how quickly the cost of an
algorithm grows as the number of quantities to be processed grows. For instance, nice algorithms are
those that grow linearly or logarithmically - for instance computing the mean of a vector of length
n is an operation of O(n) or FFT is an operation of O(n log n). Not so nice algorithms may be very
expensive for large n – for instance a naı̈ve discrete Fourier transform (the slow version of FFT) is
O(n2 ). Much more expensive than FFT!
The LU factorization is one of the more computationally-intensive algorithms. Based on the
expression that includes both a cubic term and a quadratic term we conclude that for sufficiently
large n we should write tLU = O(n3 ). Rather costly!
162 8 Solution of systems of equations
Illustration 6
Figure 8.9 shows the results of a numerical experiment. The MATLAB LU factorization is run for a
sequence of variously sized matrices, and the factorization time is recorded.
t = [];
for n = 10:10:600
A=rand(n);
tic;
for Trial = 1: 1000
[L,U,p]=lu(A,’vector’);
end
t(end+1) =toc
end
The curve of required CPU time per factorization illustrates our estimate: first the time grows more
slowly than predicted, but asymptotically it appears to approach a straight line with slope 3 which
corresponds to a cubic dependence on the number of equations.
−1
10
−2
10
1:3
Time [s]
−3
10
−4
10
−5
10 1 2 3
10 10 10
Matrix size n
In a similar way, we can show that the time for forward or backward substitution is going to
grow as O(n2 ). This is good news, since for many right-hand sides the time is only going to grow
as quickly as for the factorization itself. For instance, to compute a matrix inverse we need to solve
n times an n × n system of linear algebraic equations. If we use LU factorization with forward and
backward substitution, it will take
time. If we use just plain Gaussian elimination for each solve, it will take
nO(n3 ) = O(n4 ) .
Illustration 7
The cost estimate tLU = C O(n3 ) can be put to good use guessing the time that it may take to
factorize larger matrices. From Figure 8.9 we can read off that on this particular computer a 400×400
matrix takes about one hundredth of a second:
Running the calculation we find 2.35 s. This is a substantial difference with respect to the prediction.
First, the measurement of 0.01 s is likely to be substantially in error as it is difficult to measure the
execution times for computations that conclude very quickly – there are just too many confounding
factors in the software (think of all the operating system overhead) and hardware. Second, our
estimate was based on the cubic term, but we know there is also a quadratic term and that was not
taken into account. The matrix may not be large enough for the asymptotic big-O estimate to work
based on the largest term only.
Furthermore, let us say we want to use the second measurement, tLU,3000 = 2.35 s to predict
the factorization time for a 30, 000 × 30, 000 matrix. If we had a computer with enough memory to
accommodate a matrix of this size, our prediction would be that the factorization time would go
up with a factor of 1000 = 103 with the respect to the time measured for the 3000 × 3000 matrix,
so about 40 minutes. We would find the prediction rather more accurate this time. (Try it with a
slightly more modest increase: for instance a factor of 2 increase in the size of the matrix would
increase the factorization time by a factor of 8.)
Structural engineers nowadays meet almost daily with results produced by models which are much
larger than the ones encountered so far in this book. Structural analysis programs, or more generally
finite element analysis programs, work on a daily basis with models where one million unknowns
is not uncommon. In recent years there have been reports of successful analyses with billions of
unknowns (simulation of seismic events). How do our algorithms handle the linear algebra in big
models?
First we may note that in many analyses we work with symmetric matrices. Considerable savings
are possible then. Take the LU factorization of a symmetric matrix
A = LU .
Now it is possible to factor U by dividing the rows with its diagonal elements, so that we can write
U as the product of the diagonal D = diag(diag(U )) (expressed in MATLAB notation) with the
matrix Ub
b .
U = DU
b
A = LU = LD U
implies that Ub = LT . Therefore for symmetric A we can make one more step from the LU factor-
ization to the LDLT factorization
A = LDLT .
This saves both time (we don’t have to compute U ) and space (we don’t have to store U ).
Figure 8.10 displays a finite element model with over 2000 unknowns. A small model, it can be
handled comfortably on a reasonably equipped laptop, yet it will serve us well to illustrate some of
the aspects of the so-called large-scale computing algorithms of which we need to be aware.
The figure shows a tuning fork. This one sounds approximately the note of A (440 Hz, interna-
tional “concert pitch”). To find this vibration frequency, we need to solve an eigenvalue problem (in
our terminology, the free vibration problem).
The impedance matrix A = K − ω 2 M which couples together the stiffness and the mass matrix
is of dimension of roughly 2000 × 2000. However, not all 4 million numbers are nonzero. Figure 8.11
illustrates this by displaying the nonzeros as black dots (the zeros are not shown). The code to get
an image like this for the matrix A is as simple as
spy(A)
Where do the unknowns come from? The vibration model describes the motion of each node (that
would be the corners and the midsides of the edges of the tetrahedral shapes which constitute the
mesh of the tuning fork). At each node we have three displacements. Through the stiffness and mass
of each of the tetrahedra the nodes which are connected by the tetrahedra are dynamically coupled (in
the sense that the motion of one node creates forces on another node). All these coupling interactions
are recorded in the impedance matrix A. If an unknown displacement j at node K is coupled to an
unknown displacement k at node M, there will be a nonzero element Ajk in the impedance matrix.
If we do not care how we number the individual unknowns, the impedance matrix may look for
instance as shown in Figure 8.11: there are some interesting patterns in the matrix, but otherwise
the connections seem to be pretty random.
An important aspect of working with large matrices is that as a rule only the non-zeros in
matrices will be stored. The matrices will be stored as sparse. So far we have been working with
dense matrices: all the numbers were stored in a two-dimensional table. A sparse matrix has a more
complicated storage, since only the non-zeros are kept, and all the zeros are implied (not stored, but
when we ask for an element of the matrix that is not in storage, we will get back a zero). This may
mean considerable savings for matrixes that hold only a very small number of non-zeros.
8.3 LU factorization 165
Fig. 8.11. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Original
numbering of the unknowns. The black dots represent non-zeros, zeros are not shown.
The reason we might want to worry about how the unknowns are numbered lies in the way the
LU factorization works. Remember, we are removing non-zeros below the diagonal by combining
rows. That means that if we are eliminating element km, we are adding a multiple of the row k and
the row m. If the row m happens to have non-zeros to the right of the column m, all those non-zeros
will now appear in row k. In this way, some of the zeros in a certain envelope around the diagonal
will become non-zeros during the elimination. This is clearly evident in Figure 8.11, where we can
see almost entirely black (non-zero) matrices L and U . Why is this a problem? Because there are a
lot more of non-zeros in the LU factors than in the original matrix A. The more numbers we have to
operate on, the more it costs to factorize the matrix, and the longer it takes. Also, all the non-zeros
need to be stored, and to update a sparse matrix with additional non-zeros is very expensive.
The appearance of additional non-zeros in the matrix during the elimination is called fill-in.
Fortunately, there are ways in which the fill-in may be minimized by carefully numbering coupled
unknowns. Figure 8.12 and Figure 8.13 visualize the impedance matrix and its factors for two
different renumbering schemes: the reverse Cuthill-McKee and symmetric approximate minimum
degree permutation. The matrix A holds the same number of non-zeros in all three figures (original
numbering, and the two renumbered cases). However the factors in the renumbered cases hold about
10 times less non-zeros than in the original factors. This may be significant. Recall that for a dense
matrix the cost scales as O(N 3 ). For a sparse matrix with a nice numbering which will limit the
fill-in to say 100 elements per row, the cost will scale as O(100 × N 2 ). For N = 106 this will be the
difference between having to wait for the factors for one minute or for a full week.
Fig. 8.12. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symrcm. The black dots represent non-zeros, zeros are not shown.
As a last note on the subject we may take into account other techniques of solving systems of
linear algebraic equations than factorization. There is a large class of iterative algorithms, a line
up starting with Jacobi and Gauss-Seidel solvers and currently ending with the so-called multi-
grid solvers. These algorithms are much less sensitive to the numbering of the unknowns. In this
book we do not discuss these techniques, only a couple of minimization-based solvers, including the
powerful conjugate gradients, but refer for instance to Trefethen, Bau for an interesting story on
current iterative solvers. They are becoming ubiquitous in commercial softwares, hence we better
know something about them.
166 8 Solution of systems of equations
Fig. 8.13. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symamd. The black dots represent non-zeros, zeros are not shown.
Some of the uses of the LU factorization had been mentioned above: computing the matrix inverse,
in particular. Some other uses to which the factorization can be put are computing the matrix
determinant, finding out whether the matrix has a full rank, and assessing the so-called definiteness
(especially positive definiteness is of interest to us).
The determinant of the matrix A can be computed from the LU factorization as
Provided L is indeed lower triangular, the determinant of the two triangular matrices is the product
of the diagonal elements which yields
n
Y n
Y
det L = Lii = 1 , det U = Uii
i=1 i=1
Qn
so that we have det A = i=1 Uii .
If on the other hand L has been modified by pivoting permutations, its determinant can be ±1,
according to how many permutations occurred. (It is probably best to use the MATLAB built-in
det function. It uses the LU factorization, and correctly accounts for pivoting.)
That’s how determinants are computed, not by Cramer’s rule (not if we wish to live to see the
result).
We might consider using the LU factorization for determining the number of independent rows
(columns) of a matrix, the so-called rank . If the LU factorization succeeds, the matrix A had
a full rank. Otherwise, it is possible that the factorization failed just because full pivoting was
not applied: it is possible that the factorization might succeed if all possibilities for pivoting are
exploited. MATLAB does not use factorization for this reason (and other reasons that have to do
with the stability of the computation), it rather takes advantage of the so-called singular value
decomposition. If the matrix A does not have a full rank (the number of linearly independent
columns, or linearly independent rows, is less than the dimension of the matrix) it is singular, and
cannot be LU factorized.
On the diagonal of the matrix U we have the pivots. The signs of the pivots determine the
so-called positive or negative definiteness (or indefiniteness) of a matrix. More about this in the
chapter on optimization.
vector b and the coefficient matrix A itself are not represented during the solution process faithfully.
Therefore, in this section we will consider how the properties of A and b affect the error of the
solution x.
First, we shall inspect the sensitivity of the solution of the system of coupled linear algebraic
equations Ax = b to the magnitude of the error of the right-hand side, and the properties of the
matrix A. Equivalently, we could also state this in terms of errors: how large can they get?
8.4.1 Perturbation of b
Imagine the right-hand side vector changes just a little bit to b + ∆b. The solution will then also
change
A (x + ∆x) = (b + ∆b) ,
A∆x = ∆b .
Now we would like to measure the relative change in the solution k∆xk/kxk due to the relative
change in the right-hand side k∆bk/kbk. In terms of norms we can write (symbolically, we never
actually invert the matrix)
so that using the so-called CBS inequality (CBS: Cauchy, Bunyakovsky, Schwartz) we estimate
It does not matter very much which norm is meant here, they are all equivalent. Also we can write
for the norms of the solution vector on the left-hand side and the vector on the right-hand side
Now we substitute (8.15) into (8.14) and divide both sides by kbk
On the right-hand side we now have the relative error k∆bk/kbk. Now we can introduce (8.16) to
replace kbk on the left-hand side
which will give us the relative error of the solution k∆xk/kxk. Finally we rearrange this result into
k∆xk k∆bk
≤ kAkkA−1 k . (8.17)
kxk kbk
The quantity kAkkA−1 k is the so-called condition number of the matrix A. This inequality relates
the relative error of the solution to the relative error of the right-hand side vector. The coefficient
of proportionality is found to be determined by the properties of the coefficient matrix.
168 8 Solution of systems of equations
Illustration 8
When the condition number is large, we see that there is a possibility of the change in the right-
hand side being very much magnified in the change of the solution. An example of the effect is given
here.15 Consider the least-squares computation of a quadratic function passing through three points:
the point locations are x= [0,1.11,1.13]’, and the values of the function at those three points are
y= [1,0.5,0.513]’. The least squares computation is set up as
A = [x.^2,x.^1,x.^0];
p=(A’*A)\(A’*y)
to solve for the parameters p of the quadratic fit from the so-called normal equations (see details in
Section 10.13). The solution is
p =
0.973849956151390
-1.531423901778500
1.000000000000001
Now change the values of the quadratic function by dy= [0,0.00746,-0.006658]’;, which is a
relative change of norm(dy)/norm(y)=0.00864. The solution changes by
dp=(A’*A)\(A’*dy)
dp =
-0.630637805947415
0.706728685322350
-0.000000000000000
which can be appreciated as a pretty substantial change. We see that
norm(dy)/norm(y)
ans =
0.008128568566353
norm(dp)/norm(p)
ans =
0.457113748779369
This means that while the data changed by less than 1%, the solution for the parameters changed by
almost 50%. We call matrices that produce this kind of large sensitivity ill conditioned . Figure 8.14
produced by
x =linspace(0,1.13,100)’;
plot(x,[x.^2,x.^1,x.^0]*p,’r-’,’linewidth’,2); hold on
plot(x,[x.^2,x.^1,x.^0]*(p+dp),’k--’,’linewidth’,2)
shows the effect of the ill conditioning : It shows two quadratic curves fitted to the original data y
(red solid curve), and to the perturbed data y+dy (black dashed curve). The curves are very different
despite the fact that the points through which they pass have been moved only very little.
The amplification of the right-hand side error can be measured as shown in equation (8.17) by
assessing the magnitude of the condition number . In MATLAB this can be evaluated with the
function cond. For instance, we find for the matrix A from the Illustration above
15
See: aetna/DifficultMatrices/ill conditioned.m
8.4 Errors and condition numbers 169
0.9
0.8
0.7
y
0.6
0.5
0.4
0 0.2 0.4 0.6 0.8 1
x
Fig. 8.14. Quadratic curves fit it to the original data y (red solid curve), and the perturbed data y+dy
(black dashed curve).
cond(A’*A)
ans =
7.145344297615475e+004
The magnitude of the condition number can be understood in relative terms by considering the
condition numbers of identity matrices (these are probably the best matrices to work with!), which
are equal to one. More generally, orthogonal matrices also have condition numbers that are equal to
one. That is as low as the condition number goes, all other matrices have larger condition numbers.
The bigger the condition number, the bigger the ill conditioning problem. In particular, we can see
that the condition number depends on the existence of the inverse of A. The closer the matrix A is
to being not invertible, the larger the condition number is going to get. For a singular matrix the
condition number is defined to be infinite. In the present case, the condition number is seen to be
fairly large. Hence we get the substantial amplification of the change of the right-hand side in the
solution vector.
Illustration 9
To continue the previous Illustration, we change the horizontal position of one of the points x=
[0,0.61,1.13]’.16 The perturbed quadratic curve is found to differ only slightly from the original.
The condition number confirms that the matrix is considerably less ill-conditioned
cond(A’*A)
ans =
193.7789
8.4.3 Perturbation of A
We can also consider the effect of changes in the matrix itself. For instance, when the elements of
the matrix are calculated with some error. So when the matrix changes (not the right-hand side,
that remains the same), we write for the changed solution
16
See: aetna/DifficultMatrices/better conditioned.m
170 8 Solution of systems of equations
(A + ∆A) (x + ∆x) = b
canceling Ax = b gives
A∆x + ∆A (x + ∆x) = 0
or
and
To bring in relative changes again, we divide by kx + ∆xk on both sides and divide and multiply
with kAk on the right-hand side
k∆xk k∆Ak
≤ kAkkA−1 k .
kx + ∆xk kAk
We see that the relative change in the solution is expressed as before. It is bounded by the relative
change in the left-hand side matrix, and the multiplier is again the condition number.
The condition number appears to be an important quantity. In order to understand the condition
number we have to understand a little bit where the norms of the matrix and its inverse come from.
An easy way in which we can talk of matrix norms while introducing nothing more than norms of
vectors stems from the so-called induced matrix norm. We think of the matrix A (here we will
discuss only square matrices, but this would also apply to rectangular matrices) as producing a map
from the vector space Rn to the same vector space by taking input x and producing output y
y = Ax .
We can measure “how big” a matrix is (that is its norm) by measuring how much all possible input
vectors x get stretched by A. We take the largest possible stretch as the induced norm of A
kAxk
kAk = max .
kxk 6= 0 kxk
Note that on the left we have a matrix norm, and on the right we have a vector norm. That is why
we say that the matrix norm on the left is induced by the vector norm on the right. An alternative
form of the above equation, and a very useful one, can be expressed as
(we mayPrecall the similarity with the root-mean-square formula for p = 2). Taking p = 1 we get
n
kxk1 = j=1 |xj | (the so-called 1-norm), taking p = 2 we obtain the usual Euclidean norm (also
called the 2-norm)
1/2
n
X
kxk2 = |xj |2 .
j=1
Also used is the so-called infinity norm, which has to be worked out by a limiting process kxk∞ =
maxj=1:n |xj |.
Illustration 10
The three norms introduced above are illustrated in Figure 8.15. The squares and the circle represent
vectors of unit norm, as measured by the various norm definitions. The arrows are vectors of unit
norms, using the three norm definitions given above.17
kxk2 = 1
1
kxk∞ = 1
0.5
x
x2
0
kxk1 = 1
−0.5
−1
−1 −0.5 0 0.5 1 1.5
x1
Fig. 8.15. Illustration of vector norms (1, 2, ∞).
We are going to work out a useful visual association for the condition number kAkkA−1 k. We have
the definition (8.18) and the induced matrix norm of the matrix inverse can be obtained by the
following substitution
Ay = x
17
See: aetna/MatrixNorms/vecnordemo.m
172 8 Solution of systems of equations
Note that we assume A to be invertible, and then kAyk 6= 0 for kyk 6= 0. Also, we can change the
maximum into a one-over-minimum fraction, so that we can write for the norm of A−1
!−1 !−1
−1 kAyk
kA k= min = min kAyk .
kyk 6= 0 kyk kyk = 1
With these formulas for the norms, we can write for the condition number
max kAxk
−1 kxk = 1
kAkkA k= . (8.19)
min kAyk
kyk = 1
Now this is relatively easy to visualize. Figures 8.16 and 8.17 present a gallery of matrices. The
images visualize the results of the multiplication of unit-length vectors pointing in various directions
from the origin. The induced 2-norm is used, and consequently the heads of the unit-length vectors
form a circle of unit radius. We can see how the formula for the condition number (8.19) correlates
with the largest and smallest length of the vector that results from the multiplication of the matrix
and the unit vector. For instance, for the matrix A we may estimate the length of the longest and
shortest Ax vector as ≈ 3 and ≈ 2, and therefore we guess the condition number to be ≈ 3/2. This
may be compared with the computed condition number kAkkA−1 k ≈ 1.414. Alternatively, we could
take the length of the longest vector Ax as ≈ 3 and the length of the longest vector A−1 x as ≈ 1/2,
and therefore we guess the condition number to be ≈ 3 × 1/2.
Illustration 11
Use the function matnordemo18 to create for each of the three norms a diagram similar to those of
Figure 8.16 for the matrix [2 -0.2; -1.5 3], and then try to read off the norm of this matrix from
the figure. Compare with the matrix norm computed as
norm([2 -0.2; -1.5 3],1)
norm([2 -0.2; -1.5 3],2)
norm([2 -0.2; -1.5 3],inf)
Note that for the symmetric matrices B, D, F in Figures 8.16 and 8.17 the largest and the smallest
stretch occurs in the direction of some vector x. In other words, we have
Bx = λx
and we see that the extreme stretches have to do with the eigenvalues of the symmetric matrix.
This may be contrasted with for instance the unsymmetric matrix A, where the stretch Ax never
occurs in the direction of x. Other examples similar Ax to are matrices C, E in Figure 8.16 and
Figure 8.17.
Symmetric matrices have real eigenvalues and can be always made similar to a diagonal matrix,
which means that symmetric matrices always have a full set of eigenvectors. Now we have seen
that for symmetric matrices the 2-norms are directly related to their eigenvalues. We all will fondly
remember the stress and strain representations as symmetric matrices: the principal stresses and
18
See: aetna/MatrixNorms/matnordemo.m
8.4 Errors and condition numbers 173
A=[2 1.5; −1.5 3], A*x A=[2 1.5; −1.5 3], inv(A)*x
2 2
x2 , [A−1x]2
x2 , [Ax]2
0 0
−2 −2
−4 −2 0 2 4 −4 −2 0 2 4
x1 , [Ax]1 x1 , [A−1x]1
B=[3 −1.2; −1.2 2], B*x B=[3 −1.2; −1.2 2], inv(B)*x
2 2
x2 , [B −1x]2
x2 , [Bx]2
0 0
−2 −2
−4 −2 0 2 4 −4 −2 0 2 4
x1 , [Bx]1 x1 , [B −1x]1
C=[0 −1; 1 0], C*x C=[0 −1; 1 0], inv(C)*x
1 1
0.5 0.5
x2 , [C −1x]2
x2 , [Cx]2
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x1 , [Cx]1 x1 , [C −1x]1
Fig. 8.16. Matrix and matrix inverse norm illustration. Matrix condition numbers: kAkkA−1 k= 1.414;
kBkkB −1 k= 3.167; kCkkC −1 k= 1.0;
strains, and the directions of the principles stresses and strains, are the eigenvalues and eigenvectors
of these matrices.
In fact, for all matrices, symmetric and unsymmetric, the matrix norm has something to do with
eigenvalues and eigenvectors. Consider the definition of the induced matrix norm
kAxk
kAk = max
kxk 6= 0 kxk
and square both sides
kAxk2
kAk2 = max 2
.
kxk 6= 0 kxk
174 8 Solution of systems of equations
x2 , [D−1x]2
x2 , [Dx]2
0 0
−5 −5
−5 0 5 −5 0 5
x1 , [Dx]1 x1 , [D−1x]1
E=[1,1; 0,1], E*x E=[1,1; 0,1], inv(E)*x
1 1
0.5 0.5
x2 , [E −1x]2
x2 , [Ex]2
0 0
−0.5 −0.5
−1 −1
−1 0 1 −1 0 1
x1 , [Ex]1 x1 , [E −1x]1
F=[1,1; 1,1.2], F*x F=[1,1; 1,1.2], inv(F)*x
5 5
x2 , [F −1x]2
x2 , [F x]2
0 0
−5 −5
−10 −5 0 5 10 −10 −5 0 5 10
x1 , [F x]1 x1 , [F −1x]1
Fig. 8.17. Matrix and matrix inverse norm illustration. Matrix condition numbers: kDkkD −1 k= 10.0;
kEkkE −1 k= 2.618; kF kkF −1 k= 22.15;
For the moment we shall consider that the vector norms are Euclidean norms (2-norms). From the
definition of the vector norms, we have
The expression on the right is the so-called Rayleigh quotient of the matrix AT A (not of A itself!).
It is the result of the pre-multiplication of the eigenvalue problem
8.4 Errors and condition numbers 175
AT Ax = λx (8.20)
xT AT Ax
λ= .
xT x
Note that
xT AT Ax ≥ 0 , xT x > 0
where xT x = 0 is not allowed by the definition of the norm. Clearly, the Rayleigh quotient attains
its maximum for the largest eigenvalue in absolute value max |λ|, and its minimum for the smallest
eigenvalue in absolute value min |λ|. From this we can deduce
p
kAk = max |λ| .
Similarly, we obtain
p
kA−1 k = 1/ min |λ| .
Av = λ′ v . (8.21)
AT Av = λ′ AT v = (λ′ )2 v . (8.22)
In comparison with (8.20) we see that λ = (λ′ )2 . Therefore, the norm of a symmetric matrix will be
where λ′ solves the eigenvalue problem (8.21). Analogously, the norm of the inverse of a symmetric
matrix will be
1
kA−1 k = ,
min |λ′ |
and the condition number of the symmetric matrix is therefore
max |λ′ |
kAkkA−1 k = . (8.23)
min |λ′ |
Illustration 12
8.5 QR factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of an orthogonal
matrix Q and an upper triangular matrix R
A = QR .
How does this work? If we write this relationship between the matrices in terms of their columns
things become clearer.
ck (A) = Q ck (R) .
Then the first column of A is c1 (A) = c1 (Q) R11 (R11 = ♠, all other coefficients in the first column
of R are zero). The fourth column of A is a linear combination of the first four columns of Q (the
coefficients are the ⋄’s)
and so on. The principle is now clear: each of the columns of A is constructed of columns of Q which
are orthogonal, and the columns of Q can be obtained by straightening out the columns of A as
long as the columns of A are linearly independent (refer to Figure 8.18): q 1 is a unit vector in the
direction of a1 , and q 2 is obtained from the part of a2 that is orthogonal to q 1 .
Fig. 8.18. Two arbitrary linearly independent vectors a1 and a2 , and two orthonormal vectors vectors q 1
and q 2 that span the same plane
The great advantage that can be derived from this factorization stems from the fact that the
inverse of an orthogonal matrix is simply its transpose
Q−1 = QT .
Ax = QRx = b
Rx = QT b .
Now since the matrix R is upper triangular, to solve for the unknown x is very efficient, starting
at the bottom we proceed by backsubstitution. The solution is not for free, of course. We had to
construct the factorization in the first place.
An additional benefit of this particular factorization is in the ability to factorize rectangular
matrices, not just square. Furthermore, due to the orthogonality of Q operations with it are as
nice numerically as possible (remember the perfect condition number of one?). Therefore the QR
factorization is used when numerical stability is at a premium. Examples may be found in the least-
squares fitting subject. Also, the QR factorization leads to a valuable algorithm for the computation
of eigenvalues and eigenvectors for general matrices.
The question now is how to compute the QR factorization. A particularly popular and effective
algorithm is based on the so-called Householder reflections.
The Householder transformation (reflection) is designed to modify a column matrix so that the
result of the transformation has only one nonzero element, the first one, but the length of the result
(that is its norm) is preserved. Matrix transformations that preserve lengths are either rotations or
reflections (the Householder transformation is the latter):
e,
Ha = a where kak = ke
ak .
e
The transformation produces the vector a
±kak
0
e= .
a
. .
0
by reflection in a plane that is defined by the normal generated as the difference n = ae − a and
passes through the origin O (see Figure 8.19). This follows from the two vectors ae and a being of
the same length.
n n
a a
e
a O
O
e
a
Fig. 8.19. Householder transformation: the geometrical relationships. The reflection plane is shown by the
dashed line. Consider that in the two-dimensional figure there are two possible reflection planes.
The relationship between the three vectors may be written as a e = a + n, which may be tweaked
using a little trick (note carefully the position of the parentheses)
T
n a nnT a nnT
e =a+n
a =a+ = 1+ T a.
nT a nT a n a
Note that both matrices (the identity and the rest) in the parentheses are square. Together they
constitute an orthogonal matrix
178 8 Solution of systems of equations
nnT
H =1+ , HT H = 1 . (8.24)
nT a
Interestingly, this matrix is also symmetric. This is really how it should be: H produces a mirror
image of a, ae = Ha. The mirror image of a e , the inverse operation of a = H −1 a
e , must give us back
a, but the inverse operation is again a reflection, the same reflection that gave us ae from a.
To compute the Householder matrix we could use the function Householder matrix.19 The sign
of the non-zero element of ae is computed with particular attention to numerical stability: when we
compute n = a e − a, the vector ae has only one nonzero element. To avoid numerical error when
subtracting two similar numbers e a1 − a1 we choose sign ea1 = −sign a1 .
function H = Householder_matrix(a)
if (a(1)>0) at1 =-norm(a);% choose the sign wisely
else at1 =+norm(a); end
n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a
H = eye(length(a))+(n*n’)/(n’*a);% this is the formula
end
How do we use the Householder transformation? We consider the columns of the matrix to be
transformed as the vectors that we can reflect as shown above. The first step zeroes out the elements
of A below the diagonal of the first column.
•••••• ••••••
• • • • • •
• • • • •
H1
• • • • • • • • • • •
H 1 A = 6×6 = .
• • • • • •
• • • • •
•••••• • • • • •
•••••• •••••
We write H 1 for the 6 × 6 matrix obtained from the first column of A. We write H 2 for the 5 × 5
matrix obtained from the second column of A, from the diagonal to the bottom of the column.
Analogously for the other Householder matrices.
1 •••••• ••••••
• • • • •
• • • • •
• • • • • • • • •
.
H =
2
5×5
• • • • • • • • •
• • • • • • • • •
••••• ••••
To obtain A from R we would successively invert the above relationships one by one. That is not
difficult since we realize that those matrices are orthogonal and symmetric, so the inverse is equal
to the original matrix. We just have to switch the order of the matrices. We get
19
See: aetna/QRFactorization/Householder matrix.m
8.5 QR factorization 179
1 1
1
H1 1
A=
...
R ,
H2 1
H5
Illustration 13
Here we present a factorization which is based directly on the schemas above. The function
Householder matrix20 computes the Householder matrix of equation (8.24). Note that the matri-
ces H j are blocks embedded in an identity matrix. The following code fragment should be stepped
through, and I will bet that it will nicely reinforce our ideas of how Householder reflections work.
format short
A=rand(5); R=A % this is where R starts
Q=eye(size(A));% this is where Q starts
for k=1:size(A,1)-1
H=eye(size(A));% Start with an identity...
% ...and then put in the Householder matrix as a block
H(k:end,k:end) = Householder_matrix(R(k:end,k:end))
R= H*R % this matrix is becoming R
Q= Q*H % this matrix is becoming Q
end
Q*Q’% check that this is an orthogonal matrix: should get identity
A-Q*R % check that the factorization is correct
R-Q’*A % another way to check
The algorithm to produce the QR factorization21 is designed to be a little bit more efficient than
the code above, but it is still surprisingly short and readable
function [Q,R] = HouseQR(A)
m=size(A,1);
Q=eye(m); R =A;
for k=1:size(A,1)-1
n = Householder_normal(R(k:end,k:k));
R(k:end,k:end) =R(k:end,k:end)-2*n*(n’*R(k:end,k:end));
Q(:,k:end)=Q(:,k:end)-2*(Q(:,k:end)*n)*n’;
end
end
20
See: aetna/QRFactorization/Householder matrix.m
21
See: aetna/QRFactorization/HouseQR.m
180 8 Solution of systems of equations
Instead of the Householder matrix (8.24) we use in HouseQR the equivalent expression
H = 1 − 2N N T ,
The Householder normal is also computed with attention to numerical stability by choosing the sign
of the nonzero element of the normal to eliminate cancellation. Note well that the computed normal
is of unit length.22
function n = Householder_normal (a)
if (a(1)>0) at1 =-norm(a);% choose the sign wisely
else at1 =+norm(a); end
n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a
n=n/sqrt(n’*n);% normalize
end
Illustration 14
The HouseQR function acts as a black box: A goes in, Q,R come out in their finished form. It is
however possible to set a breakpoint inside the function to watch the matrices form layer-by-layer
by the Householder reflections. Try it.
22
See: aetna/QRFactorization/Householder normal.m
9
Solutions methods for eigenvalue problems
Summary
1. We discover a few basic algorithms for the solution of the eigenvalue problem, both the standard
and the generalized form.
2. Repeated multiplications with matrices tends to amplify directions associated with eigenvectors
of dominance eigenvalues. Main idea: write the modal expansion, and consider the powers of
eigenvalues.
3. Various forms of the power iteration, including the QR iteration, form the foundations of some
of the workhorse routines used in vibration analysis and in general purpose software (with
appropriate, and sometimes considerable, refinements).
4. The Rayleigh quotient is an invaluable tool both for algorithm design and for quick ad hoc
checks.
5. This area of numerical analysis has seen considerable progress in recent years and some power-
ful new algorithms have emerged. Solving large-scale eigenvalue problems nevertheless remains
nontrivial, even with sophisticated software packages.
Av j = λj v j .
To multiply both the left-hand side in the right-hand side of the above equation again with A yields
A (Av j ) = A2 v j = λj Av j = λ2j v j .
Ak v j = λkj v j .
Now imagine that an arbitrary vector x is going to be multiplied repeatedly by A. Our goal is
to analyze the result of Ak x. We will use an expansion of the arbitrary vector x in terms of the
eigenvectors of the matrix A (the so-called modal expansion)
X
x= cj v j .
j=1:n
Due to our assumption that the first eigenvalue dominates, the coefficients λkj /|λ1 |k will approach
zero in absolute value as k → ∞, except for λk1 /|λ1 |k which will maintain absolute value equal to
one. Therefore, as k → ∞ the only term left from the modal expansion of x will be
lim Ak x = c1 λk1 v 1 .
k→∞
Figure 9.1 illustrates the effect of repeated multiplication of an arbitrary vector x by the 2 × 2
matrix A
Ax , AAx = A2 x , ...
The eigenvalues are λ1 = 1.6 (with eigenvector v 1 ), λ2 = 0.37 (with eigenvector v 2 ), so the first eigen-
value is dominant, and evidently the result of the multiplication leans more and more towards the
first eigenvector. The “leaning” is very rapid. The reason is that the fraction λk2 /|λ1 |k = (0.23125)k
will decrease very rapidly with higher powers (for instance, (0.23125)4 = 0.00285). Therefore, the
contribution of the eigenvector v 2 to the vector Ak x will become vanishingly small rather quickly.
A0 x
v2
A1 x
A2 x
A3 x
v1 A4 x
Fig. 9.1. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.6, λ2 = 0.37
The repeated multiplication to amplify the components of the dominant eigenvector is the
principle behind the so-called power iteration method for the calculation of the dominant eigen-
value/eigenvector.
matrix will diminish the contributions of all other eigenvectors except the first one so that eventually
the product Ak x will be mostly in the direction of the first eigenvector v 1 .
The method is not failproof. Firstly, it appears that if the starting vector x does not contain any
contribution of the first eigenvector, c1 = 0, the power method is not going to converge. Fortunately,
any amount of the inevitable arithmetic error will likely introduce some contribution of the first
eigenvector to which the power method will ultimately converge. Unfortunately, it may take a long
time.
Secondly, the method is definitely going to have trouble with converging for |λ2 | ≈ |λ1 | (in
words, when the second eigenvalue is close to the first eigenvalue in magnitude). The ratio λk2 /|λ1 |k
will decrease slowly, resulting in slow convergence. Such a situation is illustrated in Figure 9.2: the
eigenvalues are λ1 = −0.8, λ2 = 0.75. The iterated vector Ak x appears to converge to the direction
of v 1 , but slowly.
A few observations can be made from Figure 9.2. The iterated vector Ak x decreases in magnitude
(|λ1 | < 1), and if we iterate sufficiently long the vector will get so short that we may risk underflow,
or at least numerical issues due to arithmetic error. (Note that for |λ1 | > 1 the approximations
to the eigenvector will grow, which may eventually result in overflow.) Further, since λ1 < 0 the
iterated vector aligns itself alternately with v 1 and −v 1 . This is fine, since both are perfectly good
eigenvectors, but it complicates somewhat the issue of how to measure convergence. We want to
measure convergence of directions, not of the individual components of the vector!
v1
A4 x
A2 x
A0 x
v2 A3 x
A1 x
Fig. 9.2. The effect of several matrix-vector multiplications. Eigenvalues λ1 = −0.8, λ2 = 0.75
To address the concerns about underflow and overflow we may introduce normalization (rescaling)
of the iterated vector as
x0 given
for k = 1, 2, ...
x(k) = Ax(k−1)
x(k) = kx
(k)
x(k) k
How to measure the convergence of the algorithm may be made easier by considering the associated
problem of finding the eigenvalue λ1 . An excellent tool is offered by the Rayleigh quotient. Pre-
multiply the eigenvalue problem on both sides with v Tj
Av j = λj v j ⇒ v Tj Av j = λj v Tj v j ,
v Tj Av j
λj = .
v Tj v j
Now consider the vector x(k) as an approximation of the eigenvector v 1 . A good approximation of
the eigenvalue will be
T
x(k) Ax(k)
λ1 ≈ T
.
x(k) x(k)
It will be much easier to measure relative approximate errors in the eigenvalue then to measure the
convergence of the direction of the eigenvector. An actual implementation of the power iteration
algorithm then follows easily:1
function [lambda,v,converged]=pwr2(A,v,tol,maxiter)
... some error checking omitted
plambda=Inf;% eigenvalue in previous iteration
converged = false;
for iter=1:maxiter
u=A*v; % update eigenvector approx
lambda=(u’*v)/(v’*v);% Rayleigh quotient
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
converged = true; break;% converged!
end
plambda=lambda;% eigenvalue in previous iteration
end
end
Note that we have to return a Boolean flag to indicate whether the iteration process inside the
function converged or not. This is a common design feature of software implementing iterative
processes, since the iterations may or may not succeed.
We conclude this section with pointing out that power iteration relies on the existence of a
dominant eigenvalue. This is not applicable in many important problems, for example for the first
order form of the equations of motion of a vibrating system. For such systems eigenvalues come in
complex conjugate pairs. There is no single dominant eigenvalue, and consequently power iteration
will not converge. This is illustrated in Figure 9.3, where we show the progress of the power iteration
for two different starting vectors for a matrix with eigenvalues λ1,2 = ±0.7. There is no progress
towards any of the eigenvectors, since the iterated vectors just switch between two different directions
neither of which is the eigenvector direction.
In what follows we shall work with real symmetric matrices, unless we explicitly say
otherwise. The main reasons: these matrices are very important in practice, we don’t
have to treat special cases such as missing eigenvectors, and the eigenvalues and eigen-
vectors are real.
Illustration 1
Figure 9.4 shows the model of two linked buildings. Each building is represented by a concentrated
mass m standing in for the total mass of the floor, and springs linking the floors kc which would be
representative of the total horizontal stiffness of the columns in between the floors (or the ground).
The buildings are linked at each floor with another spring kℓ , representative of walkways (bridges)
that connect the buildings. The masses in the system are numbered as shown.
1
See: aetna/EigenvalueProblems/pwr2.m
9.2 Power iteration 185
A1 x
v1
A4 x
3 0 A2 x
A x A x
v1 A2 x A0 x
A3 x
A4 x
v2
v2
A1 x
Fig. 9.3. The effect of several matrix-vector multiplications. Eigenvalues λ1,2 = ±0.7
1 6
2 7
3 8
4 9
5 10
The mass matrix is simply m× (10×10 identity matrix). The stiffness matrix K has the structure
shown below. Note that if the buildings are not linked by the walkways (kℓ = 0), the stiffness matrix
will split into two uncoupled 5 × 5 diagonal blocks that correspond to each building separately.
Nonzero walkway stiffness will couple the vibrations of the two buildings together.
kc +kℓ −kc 0 0 0 −kℓ 0 0 0 0
−kc 2kc +kℓ −kc 0 0 0 −kℓ 0 0 0
0 −kc 2kc +kℓ −kc 0 0 0 −kℓ 0 0
0 0 −kc 2kc +kℓ −kc 0 0 0 −kℓ 0
0 0 0 −kc 2kc +kℓ 0 0 0 0 −kℓ
K=
−kℓ 0 0 0 0 kc +kℓ −kc 0 0 0
0 −kℓ 0 0 0 −kc 2kc +kℓ −kc 0 0
0 0 −kℓ 0 0 0 −kc 2kc +kℓ −kc 0
0 0 0 −kℓ 0 0 0 −kc 2kc +kℓ −kc
0 0 0 0 −kℓ 0 0 0 −kc 2kc +kℓ
ω 2 M z = Kz .
Since the mass matrix is just a multiple of the identity, this may be written as
Az = λz ,
186 9 Solutions methods for eigenvalue problems
where we define
1
A= K, and λ = ω 2 .
m
The first practice will apply the power method to the computation of the largest frequency of
vibration. We assume m = 133, kc = 61000, kℓ = 3136 (in consistent units). The solution with
MATLAB’s eig is written for the eigenvalue problem as
[M,K,A] = lb_prop;
[V,D]=eig(A) % This may be replaced with [V,D]=eig(M,K)
disp(’Frequencies [Hz]’)
sqrt(diag(D)’)/(2*pi)
which yields the resulting frequencies as
Frequencies [Hz]
ans =
0.9702 1.4614 2.8319 3.0354 4.4641 4.5960 5.7348 5.8380 6.5408 6.6315
Applying the power method as shown in the script lb A power2 with a random starting vector yields
an approximation of the highest eigenvalue, but it is not anywhere close to being converged. This
should not surprise us. We would expect the convergence to be slow: The two largest eigenvalues are
very closely spaced (the largest eigenvalue is weakly dominant): see Figure 9.5. This makes, together
with the inherent symmetry in the structure, for an interesting experiment: see below.
Suggested experiments
1. Use a starting vector in the form of ones(10,1). Do we get convergence to the largest eigenvalue?
If not, try to explain. [Difficult]
f =6.5408[Hz] f =6.6315[Hz]
9 10
1 6 1 6
2 7 2 7
3 8 3 8
4 9 4 9
5 10 5 10
2
See: aetna/EigenvalueProblems/LinkedBuildings/lb A power.m
9.3 Inverse power iteration 187
The power iteration can be used to compute the eigenvalue/eigenvector pair for the eigenvalue with
the largest absolute value. The inverse power iteration can look at the other end of the spectrum,
at the smallest eigenvalues.
The eigenvalues of a matrix A and A−1 are related as follows: Provided the matrix is invertible
(and therefore does not have λ = 0 among its eigenvalues), we can multiply the eigenvalue problem
for A
Ax = λx
Therefore, to find the eigenvalue/eigenvector pair of A for the smallest eigenvalue in absolute value
we can perform the power iteration on A−1 . We would not wish to invert the matrix, of course, and
so we formulate the algorithm as
x0 given
for k = 1, 2, ...
Ax(k) = x(k−1)
x(k) = kx
(k)
x(k) k
which simply means solve for x(k) from Ax(k) = x(k−1) . (Compare with the power iteration algo-
rithm on page 183; there is only one change, but an important one.) Since the solution is needed
during each iteration, we may conveniently and efficiently take advantage of the LU factorization.
The inverse power iteration algorithm is summarized in the code below. Note the changes with
respect to the power iteration in the first two lines in the for loop. 3
function [lambda,v,converged]=invpwr2(A,v,tol,maxiter)
... some error checking omitted
plambda=Inf;% initialize eigenvalue in previous iteration
[L,U,p]=lu(A,’vector’);%Factorization
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(v’*v)/(u’*v);% Rayleigh quotient: note the inverse
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
converged = true; break;% converged!
end
plambda=lambda;
end
end
3
See: aetna/EigenvalueProblems/invpwr2.m
188 9 Solutions methods for eigenvalue problems
Note the shortcut to the value of the Rayleigh quotient: the vector product (u’*v) incorporates the
multiplication with A−1 . Then, because we are iterating to find 1/λ, we invert the fraction.
The inverse power iteration also relies on the existence of a dominant eigenvalue. Dominant
here means that the smallest eigenvalue should be strictly smaller in absolute value than any other
eigenvalue of A. We assume again they are ordered in decreasing magnitude, and for the success of
the inverse iteration we require
Analogously to the power iteration, the convergence of the inverse power iteration will be faster for
very dominant eigenvalues, |λn−1 | ≫ |λn |, and painfully slow for |λn−1 | ≈ |λn |.
Illustration 2
Here we illustrate the convergence of the inverse power iteration on the example of two symmetric
matrices.4 We construct two random matrices with spectra that are identical except for the small-
est eigenvalue. The smallest eigenvalue is dominant in one matrix, and rather close to the second
eigenvalue in magnitude in the second matrix. Consequently Figure 9.6 displays quite disparate
convergence behaviors of the inverse power iteration: very good in the first case, poor in the second.
5
10
Relative eigenvalue error
0
10
λn = 13
−5
10
−10
10 λn = 6.1
−15
10
0 5 10 15 20 25
Iteration
Fig. 9.6. The relative error of the smallest eigenvalue for two symmetric 13 × 13 matrices with eigenvalues
[13, 14 : 25] and [6.1, 14 : 25].
Illustration 3
Apply the inverse power iteration method to the structure described in Illustration on page 184.
The inverse power method as shown in the script lb A invpower5 with a random starting vector
yields an approximation of the lowest eigenvalue with satisfactory convergence. The first two mode
shapes are shown in Figure 9.7 (only the mode on the left was computed with inverse power iteration,
the mode on the right was added using eig()).
4
See: aetna/EigenvalueProblems/test invpwr conv1.m
5
See: aetna/EigenvalueProblems/LinkedBuildings/lb A invpower.m
9.3 Inverse power iteration 189
f =0.97015[Hz] f =1.4614[Hz]
1 2
1 6 1 6
2 7 2 7
3 8 3 8
4 9 4 9
5 10 5 10
Suggested experiments
1. Change the stiffness of the link spring to kℓ = 0. Does the inverse power iteration converge? If
not, why?
Consider the effect of adding an identity −σx = −σx to the eigenvalue problem.
Ax − σx = λx − σx .
At first blush, this does not seem to have any effect, but rewritten as
(A − σ1) x = (λ − σ)x
or
(A − σ1) x = ̺x
it is revealed that it leads to a slightly different eigenvalue problem, with the same eigenvector, but
a shifted eigenvalue ̺ = λ − σ. This leads to the idea of searching for an eigenvalue/eigenvector pair
for the shifted matrix, not the original one, because the smallest min |λ| can be made to correspond
to min |̺| ≈ 0. Then, the eigenvalue min |̺| could be very strongly dominant, since 1/ min |̺| is going
to be large compared to the other eigenvalues.
Figure 9.8 illustrates this concept with an example with four eigenvalues
The ratio λ3 /λ4 ≈ 1.34. Applying a shift σ = 0.3 leads to a shifted problem with eigenvalues
and the ratio ̺3 /̺4 ≈ 2.04 > 1.34. The larger this ratio, the better. The solution of the inverse
power iteration on the shifted problem will converge faster.
190 9 Solutions methods for eigenvalue problems
λ3 λ1
0 λ4 λ2
1/λ1 1/λ3
0 1/λ2 1/λ4
̺ = λ − σ, σ > 0
̺3 ̺1
0 ̺4 ̺ 2
1/̺1 1/̺3
0 1/̺2 1/̺4
5
10
Relative eigenvalue error
0
10
no shift
−5
10
σ = 0.3
−10
σ = 0.4
10
−15
10
0 2 4 6 8 10 12
Iteration
Fig. 9.9. The relative error of the smallest eigenvalue λ4 for the symmetric 4 × 4 matrices with eigenvalues
[2.80, 1.167, 0.609, 0.452]. Comparison of un-shifted and shifted inverse power iteration.
Figure 9.9 shows the effect of shifting. Two shifts are applied, one corresponding to Figure 9.8,
and one even closer to the eigenvalue λ4 in magnitude, σ = 0.4. The effect of shifting is quite
dramatic. The closer we can guess the magnitude of the smallest eigenvalue (so that we can set the
shift to be equal to the guess the eigenvalue) the higher the convergence rate.
The inverse power iteration algorithm with shifting is given in MATLAB code below.6
function [lambda,v,converged]=sinvpwr2(A,v,sigma,tol,maxiter)
... some error checking omitted
plambda=Inf;% initialize eigenvalue in previous iteration
v=v/norm(v);% normalize
[L,U,p]=lu((A-sigma*eye(n)),’vector’);%Factorization
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(u’*A*u)/(u’*u);% Rayleigh q. using the definition
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
6
See: aetna/EigenvalueProblems/sinvpwr2.m
9.4 Simultaneous power iteration 191
Illustration 4
Consider the following eigenvalue problem with a 3 × 3 matrix whose eigenvalues are 1,2,4.7
A =[ 2.486697669648270 -0.326429831194336 -1.065046141649933
-0.326429831194336 2.167809045836811 1.032918306492685
-1.065046141649933 1.032918306492685 2.345493284514918];
n=3;
[V,D]=eig(A)
tol =1e-6; maxiter= 24;
v=rand(n,1);% starting vector
sigma =1.6;% the shift
[lambda,phi,converged]=sinvpwr2(A,v,sigma,tol,maxiter)
We guessed that the smallest eigenvalue was close to 1.6 and applied the shift 1.6. The shifted inverse
power iteration produced the eigenvalue approximation of 2, instead of the smallest eigenvalue we
hoped to find.
Illustration 5
Apply the inverse power iteration method to the structure described in Illustration on page 184, but
change the stiffness of the link spring to kℓ = 0. Would shifting help with convergence to the first
frequency?
(0)
w2
(0) w1
(1)
w
(3) 2
w2 w2
(2)
(4) (1)
w2 w1
v1 w1
(2)
(3)
w1
(4)
w1
v2
Fig. 9.10. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.11, λ2 = 0.556 . No effort
is made to maintain the iteration vectors linearly independent.
The fact that the most dominant eigenvector will be swamping out all the other eigenvectors is
going to keep us from obtaining reasonable approximations of the other eigenvectors. In other words,
since the dominant eigenvector components will be getting magnified more than the components
of the other vectors, eventually all the vectors on which we iterate will become aligned with the
dominant eigenvector. Figure 9.10 illustrates the effect of simultaneous iteration on two vectors:
(0) (0) (4) (4)
starting vectors are w1 , w2 . After just four iterations the vectors w 1 , w2 are pretty much
aligned with the dominant eigenvector v 1 . They are still linearly independent, but only barely.
So iteration on multiple vectors will be tricky. The desired eigenvectors will still be present,
but they will be hard to extract from such an ill conditioned basis (all vectors essentially parallel).
Therefore, similarly to power (inverse power) iteration where we normalized the approximation in
each step so as to avoid underflow or overflow, we will normalize the set of vectors on which we iterate.
Not only so they are unit magnitude, but also so that they are mutually orthogonal. (Technical
term: the vectors are orthonormal .) An excellent tool for this purpose is the QR factorization: the
columns of the matrix Q are orthonormal, and they come from the columns of the input matrix.
In this way we get the so-called simultaneous power iteration (also called block power itera-
tion). The starting vectors will be arranged as columns of a rectangular matrix
h i
(0) (0)
W (0) = w1 , w2 , ...w (0)p .
The algorithm will repeatedly multiply the iterated n × p matrix W (k) by the n × n matrix A and
also orthogonalize the columns of the iterated matrix by the QR factorization.
W (0) given
for k = 1, 2, ...
W (k) = AW (k−1) (9.1)
QR = W (k) % compute QR factorization
W (k) = Q
The eigenvalue approximations may be computed as before from the Rayleigh quotient
(k) T (k)
Note that we have omitted dividing by wj wj because these vectors are orthonormal:
9.4 Simultaneous power iteration 193
(k) T (k) 1, when j = m
wj wj =
0, otherwise.
Figure 9.11 shows the effect of orthogonalization for the same matrix and the same starting vectors
as in Figure 9.10, but this time with QR factorization. The iterated vectors now converge to the two
eigenvectors.
(0)
v 1 w2
(4)
w1
(3)
w1
(2)
w1
(1)
w1
v2
Fig. 9.11. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.11, λ2 = 0.556. Iteration
vectors are orthogonalized after each iteration.
In order to switch from the block power iteration to the block inverse power iteration we just
switch the one line that refers to the repeated multiplication with the coefficient matrix so that the
multiplication is with its inverse
W (0) given
for k = 1, 2, ...
AW (k) = W (k−1) % solve (9.2)
QR = W (k) % compute QR factorization
W (k) = Q
The MATLAB code for the block inverse power iteration is given below. Note that the so-called
economy QR factorization is used: the matrix Q is rectangular rather than square. 8
function [lambda,v,converged]=binvpwr2(A,v,tol,maxiter)
... some error checking omitted
nvecs =size(v,2);
plambda=Inf+zeros(nvecs,1);
lambda =plambda;
nvecs=size(v,2);% How many eigenvalues?
[v,r]=qr(v,0);% normalize
[L,U,p] =lu(A,’vector’);% Factorized for efficiency
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p,:)); % update vectors
for j=1:nvecs % Rayleigh quotient
lambda(j)=(v(:,j)’*v(:,j))./(u(:,j)’*v(:,j));
8
See: aetna/EigenvalueProblems/binvpwr2.m
194 9 Solutions methods for eigenvalue problems
end
[v,r]=qr(u,0);% economy QR factorization
if (norm(lambda-plambda)/norm(lambda)<tol)
converged = true; break;
end
plambda=lambda;
end
end
Note that when we’re computing the Rayleigh quotient we have to account for u being the result of
the inverse power iteration. Also, we could have replaced
lambda(j)=(v(:,j)’*v(:,j))./(u(:,j)’*v(:,j)) with
lambda(j)= 1.0./(u(:,j)’*v(:,j)) (why?).
Shifting could also be applied to block inverse power iteration. Even though only one shift value
can be used, the beneficial effect applies to all iterated eigenvectors: The iteration will converge to
the eigenvectors with eigenvalues closest to the shift.
Illustration 6
Apply the block inverse power iteration method to the structure described in Illustration on page 184,
but change the stiffness of the link spring to kℓ = 0. Use it to find the first two modes.
A possible solution is given in the script lb A blinvpower.9
Suggested experiments
1. Interpret the mode shapes obtained above with the solution provided by MATLAB’s eig. The
mode shapes are different. Does it matter?
9.5 QR iteration
An obvious step to take with simultaneous power iteration is to compute all the eigenvalues and
eigenvectors of the n × n matrix A by iterating on n vectors at the same time. This is shown in the
following algorithm (note the choice of the initial orthonormal vectors as the columns of an identity
matrix):
W (0) = 1
for k = 1, 2, ...
W (k) = AW (k−1) (9.3)
QR = W (k) % compute QR factorization
W (k) = Q
The matrix W (k) converges to a matrix of eigenvectors. Recall that the matrix of eigenvectors can
make the matrix A similar to a diagonal matrix, the matrix of the eigenvalues (call for (4.13)). The
matrix W (k) is only close to the matrix of eigenvectors (and getting closer with the iteration), and
therefore the matrix
T
A(k) = W (k) AW (k)
9
See: aetna/EigenvalueProblems/LinkedBuildings/lb A blinvpower.m
9.5 QR iteration 195
will be only close to a diagonal matrix, not perfectly diagonal, and the numbers on the diagonal will
approximate the eigenvalues.
It can be shown that the above simultaneous iteration is equivalent to the so-called QR iteration
(note well that this is different from QR factorization). The QR iteration is given by the following
algorithm:
A(0) = A
for k = 1, 2, ...
(9.4)
QR = A(k−1) % compute QR factorization
A(k) = RQ % note the switched factors
T
The matrix A(k) that appears in the last step of (9.4) is the same as A(k) = W (k) AW (k) in
the algorithm (9.3) (explained in detail in Trefethen, Bau (1997)). In this sense the two algorithms
are equivalent. The script qr power correspondence10 demonstrates the equivalence of the two
algorithms for a randomly generated matrix.
The QR iteration (9.4) is amenable to several significant enhancements as pointed out below.
The QR iteration is one of the most important algorithms used in eigenvalue/eigenvector problems.
First we will inspect the properties of the transformations effected by the above algorithm.
The matrix A(k) in (9.4) converges to an upper triangular matrix. In fact, for our assumption of A
being symmetric, A(k) converges to a diagonal matrix. In the limit of k → ∞ the transformation
T
A(k) = W (k) AW (k)
T = U T AU (9.5)
is upper triangular. This can be shown as follows: the square matrix A has at least one eigenvalue
and one eigenvector. Therefore, we can write (for simplicity the procedure is demonstrated here for
a 6 × 6 matrix; the symbols •, ♦, ... mean here general complex numbers; zeros are not shown)
λ1 • • • • •
♦ ♦ ♦ ♦ ♦
♦ ♦ ♦ ♦ ♦
AU 1 = U 1 ,
♦ ♦ ♦ ♦ ♦
♦ ♦ ♦ ♦ ♦
♦♦♦♦♦
where the first column of U 1 is an eigenvector of A: Au1 = λ1 u1 , and the other columns of U 1 are
arbitrarily selected to form an orthonormal basis (this is always possible). Now we write
10
See: aetna/EigenvalueProblems/qr power correspondence.m
11 T T
Hermitian matrix: A = A , where A is the so-called conjugate transpose (its elements are complex
conjugates of the transposed matrix).
12
Defective matrix does not have a full set of eigenvectors. Example: [0, 1; 0, 0]. Double eigenvalue 0, a
single eigenvector [1; 0].
13 T T
Unitary matrix: complex matrix U such that U U = U U = 1. For real matrices unitary = orthogonal.
196 9 Solutions methods for eigenvalue problems
λ1 • • • • •
♦ ♦♦ ♦ ♦
T ♦ ♦♦ ♦ ♦
U 1 AU 1 =
♦
♦♦ ♦ ♦
♦ ♦♦ ♦ ♦
♦ ♦♦ ♦♦
and we apply exactly the same argument to the smaller 5 × 5 matrix (the ♦ elements). This again
leads to the first column having zeros below the diagonal, which we write as
λ1 • • • • • λ1 • • • • •
♦ ♦ ♦ ♦ ♦ λ2 ♦ ♦ ♦ ♦
T ♦ ♦ ♦ ♦ ♦ T T ♠ ♠ ♠ ♠
U2 ♦ ♦ ♦ ♦ ♦ 2 U = U U AU U =
2 1 1 2
♠ ♠ ♠ ♠
♦ ♦ ♦ ♦ ♦ ♠ ♠ ♠ ♠
♦♦♦♦♦ ♠♠♠♠
Since we can define a unitary matrix as U = U 1 U 2 ...U 5 we have completed the Schur factorization.
This construction highlights the main attraction of the Schur factorization: the upper triangular
matrix on the right-hand side has the eigenvalues of A on the diagonal. It also points to a major
difficulty: in order to compute the Schur factorization we have to solve a sequence of eigenvalue
problems. This is not possible in a finite number of steps in general, as follows from the impossibility
of finding the roots of an arbitrarily high order polynomial by explicit formulas. As a consequence,
computing the Schur factorization must be an iterative procedure, and in fact the QR iteration is
precisely such a procedure.
Approx λ = [4.980, 5.016, −1.332, 3.872, 4.702, 3.263,] Approx λ = [6.854, 4.528, 4.623, 1.073, 0.516, 2.907,]
1 1
2 2
Rows
3
Rows
4 4
5 5
6 6
1 2 3 4 5 6 1 2 3 4 5 6
Columns Columns
Approx λ = [6.994, 4.576, 4.897, 3.491, −2.351, 2.894,] Approx λ = [7.000, 4.678, 4.812, 3.954, −2.837, 2.893,]
1 1
2 2
3 3
Rows
Rows
4 4
5 5
6 6
1 2 3 4 5 6 1 2 3 4 5 6
Columns Columns
Fig. 9.12. QR factorization example. Matrix eigenvalues [−3, 3, 4, 4.5, 5, 7]. QR iterations 1, 5, 9, 13 are
shown top to bottom, left to right.
elements decrease in magnitude with successive iterations, and the diagonal elements come to dom-
inate. Figure 9.13 shows similar computation as in Figure 9.12, but with a different matrix. This
time the QR iteration gets stuck on the three eigenvalues in the top left corner, and the iteration
does not result in a diagonal matrix. The lack of convergence is due to the repeated eigenvalues (in
absolute value), and additional sophistication is needed to extract the the repeated eigenvalues.
Shifting may be introduced into the QR iteration similarly as in the simultaneous inverse itera-
tion. The QR iteration may be in fact shown to be equivalent not only to simultaneous iteration, but
also to simultaneous inverse iteration. Therefore, the shifting will have a very similar effect: faster
convergence in the lower eigenvalues. The shift can be selected in various judicious ways. Here we
will discuss a simple choice: the Rayleigh quotient shift. We have seen that the QR iteration was
successively transforming the original matrix to a diagonal matrix. The elements on the diagonal of
(k−1)
the iterated matrix are in fact the Rayleigh quotients. A good shift therefore is the element Ann
of the iterated matrix. The shift is applied as
A(0) = A
for k = 1, 2, ...
(k−1)
ρ = Ann (9.6)
QR = (A(k−1) − ρ1) % compute QR factorization
A(k) = RQ + ρ1
Approx λ = [1.257, 4.873, −0.387, 2.183, 1.108, 1.967,] Approx λ = [0.208, 4.915, −0.163, 3.041, 1.997, 1.001,]
1 1
2 2
Rows 3 3
Rows
4 4
5 5
6 6
1 2 3 4 5 6 1 2 3 4 5 6
Columns Columns
Approx λ = [0.180, 4.913, −0.094, 3.001, 2.000, 1.000,] Approx λ = [0.180, 4.913, −0.093, 3.000, 2.000, 1.000,]
1 1
2 2
3 3
Rows
Rows
4 4
5 5
6 6
1 2 3 4 5 6 1 2 3 4 5 6
Columns Columns
Fig. 9.13. QR factorization example. Matrix eigenvalues [5, −5, 5, 3, 2, 1]. QR iterations 1, 5, 9, 13 are shown
top to bottom, left to right.
[m,n]=size(A);
rho = A(n,n); % shift
[Q,R]=qr(A-rho*eye(n,n));
A = R*Q + rho*eye(n,n);
end
In practice, once an eigenvalue converges, the corresponding row and column are removed from
the matrix, and the QR iteration continues on the smaller remaining matrix. This is called deflation.
Illustration 7
Apply the shifted QR iteration method to the structure described in Illustration on page 184.
The shifted QR algorithm using qrstepS in the script lb A qr17 does not in fact converge very
well. The basic algorithm without shifting18 works actually better. Even better is the strategy of
shifting known under the name of Wilkinson (James Hardy Wilkinson, 1919 - 1986, was a giant in
the 20th century history of numerical algorithms)19 .
Suggested experiments
1. Change the stiffness of the link spring to kℓ = 0. Does the QR iteration converge? Try the
variants with shifting.
17
See: aetna/EigenvalueProblems/LinkedBuildings/lb A qr.m
18
See: aetna/EigenvalueProblems/qrstep.m
19
See: aetna/EigenvalueProblems/qrstepW.m
9.6 Spectrum slicing 199
A − σ1 = LDLT
and count the number of negative elements (these are the pivots) in the diagonal matrix D.
This spectrum slicing approach is also easily extended to the generalized eigenvalue problem.
To find the number of eigenvalues of Kx = λM x less than σ, form the LDLT factorization of the
matrix
K − σM = LDLT
Illustration 8
For the mechanical system of Figure 5.1 the mass and stiffness matrices are
m 0 0 2k k 0
M = 0 m 0 , K = k 2k k
0 0 m 0 k k
Here k = 61, all the masses are equal m = 1.3. For instance, we can check how many natural
frequencies lie below 0.5 Hz. We form the matrix
A = K − (0.5 × 2π)2 M
Since there is only one negative number on the diagonal of U (that is on the diagonal of the matrix
D from the LDLT matrix factorization) we conclude that only one natural frequency lies below
0.5 Hz.
Next we check how many natural frequencies lie below 2.0 Hz. The factorization gives
1 0 0 −83.3 −61 0 100
L = −0 1 0 , U = 0 −61 −144 , P = 0 0 1 ,
0.732 0.633 1 0 0 30.3 010
which we compare with the frequencies given in Section 5.4 and conclude that something is wrong:
there are two negative numbers on the diagonal, but all three frequencies are in fact below 2.0 Hz.
The reason is that once the partial pivoting introduces a non-identity permutation matrix, so that
LU = P A
the congruence that the Sylvester theorem relies upon is no longer applicable. In fact, the product
LU is no longer symmetric and it is not possible to factor into LDLT . The pivoting has to be done
carefully to preserve the symmetry of the resulting product of factors. For instance, the MATLAB
200 9 Solutions methods for eigenvalue problems
function ldl produces directly the LDLT factorization and returns the “psychologically” lower-
triangular factor L. We can write [L,D] =ldl(A), with the result
1 0 0 −83.3 0 0
L = 0.732 0.423 1 , D = 0 −144 0 .
−0 1 0 0 0 −12.8
Now we see three negative numbers on the diagonal of D which indeed corresponds to our prior
knowledge that all three frequencies are below 2.0 Hz.
LDLT = K
√
by defining R = L D so that RRT = K. We see that we need to work with a positive definite
stiffness matrix so that the diagonal matrix D will give real roots. With the Cholesky factors at
hand we transform the generalized eigenvalue problem Kz = ω 2 M z as
Kz = RRT z = ω 2 M z
W (0) given
for k = 1, 2, ...
KW (k) = M W (k−1) % solve
QR = W (k) % compute QR factorization
W (k) = Q
The eigenvalues may be estimated during the iteration using the Rayleigh quotient. For the gener-
alized eigenvalue problem the Rayleigh quotient is computed from
z T Kz
ω 2 M z = Kz ⇒ ω 2 = .
zT M z
The MATLAB code for the generalized eigenvalue problem solved with block inverse power
iteration is given below: 20
20
See: aetna/EigenvalueProblems/gepbinvpwr2.m
9.7 Generalized eigenvalue problem 201
function [lambda,v,converged]=gepbinvpwr2(K,M,v,tol,maxiter)
... some error checking omitted
nvecs=size(v,2);% How many eigenvalues?
plambda=Inf+zeros(nvecs,1);% previous eigenvalue
lambda =plambda;
[L,U,p] =lu(K,’vector’);
converged = false;% not yet
for iter=1:maxiter
u=U\(L\(M*v(p,:))); % update vector
for j=1:nvecs
lambda(j)=(v(:,j)’*K*v(:,j))/(v(:,j)’*M*v(:,j));% Rayleigh quotient
end
[v,r]=qr(u,0);% economy factorization
if (norm(lambda-plambda)/lambda<tol)
converged = true; break;
end
plambda=lambda;
end
end
Illustration 9
Apply the block inverse power iteration method for the generalized eigenvalue problem to the struc-
ture described in Illustration on page 184.
The algorithm gepbinvpwr2 converges as well as the regular block inverse power iteration for the
standard eigenvalue problem.21 No surprise, given how easy it was to transition from the generalized
to the standard eigenvalue problem for this particular mass matrix.
9.7.1 Shifting
Shifting could also be introduced into the block inverse power iteration for the generalized eigenvalue
problem. Not only to speed up convergence to the smallest eigenvalue by making it more dominant,
but also for precisely the opposite: to make the smallest eigenvalue less dominant. What we mean by
this is that if a structure contains rigid body modes (the structure can move without experiencing
any resisting forces), it has at least one zero frequency of vibration. Such a frequency is very strongly
dominant in the inverse power iteration (1/0!!!). The effect of this dominance cannot be exploited,
however, since the matrix K is not invertible. This would make the block inverse power iteration
algorithm (page 200) impossible.
Shifting can help. To the eigenvalue problem (with λ = ω 2 )
λM z = Kz
σM z + λM z = σM z + Kz
and obtain
Illustration 10
We consider a variation on the three-carriage vibrating system of Section 5.1, where the middle
spring is removed. The stiffness matrix of such vibrating system is singular.
k −0 0
K = −0 k −k
0 −k k
Equivalently, we say that the structure has a rigid body mode. The frequency corresponding to the
rigid body mode is zero. Figure 9.14 shows this rigid body mode as a translation of the masses
2,3. Mass 1 does not displace.22 Clearly, all springs maintain their unstressed length: the rigid body
motion does not induce any forces in the structure.
Fig. 9.14. Structure with a singular stiffness matrix. The rigid body mode (ω = 0).
Now we shall try to apply the block inverse power iteration with gepbinvpwr2. 23 The script
n3 sing undamped modes MK224 invokes gepbinvpwr2 to obtain the first mode without shifting,
and the resulting eigenvector and eigenvalue are worthless. The eigenvector in fact contains not-a-
numbers (NaN). Why? Because the stiffness matrix is singular, its LU factorization should not exist.
The MATLAB function lu (put a breakpoint inside gepbinvpwr2) returns the factors as
K>> L,U
L =
1 0 0
0 1 0
0 -1 1
U =
61 0 0
0 61 -61
0 0 0
The 0 in the element 3,3 of the U factor is a problem: at some point we will have to divide with it.
Hence the not-a-numbers.
The script n3 sing undamped modes MK325 invokes gepbinvpwr2 to obtain the first mode with
shifting. The shift is guessed as 0.2. This number is arbitrary, but it should be sufficiently small
22
See: aetna/ThreeCarriages/n3 sing undamped modes MK1.m
23
See: aetna/EigenvalueProblems/gepbinvpwr2.m
24
See: aetna/ThreeCarriages/n3 sing undamped modes MK2.m
25
See: aetna/ThreeCarriages/n3 sing undamped modes MK3.m
9.8 Annotated bibliography 203
to avoid getting close to the first nonzero frequency. The script shows how we invoke gepbinvpwr2
for a stiffness matrix that is modified by the addition of a multiple of the mass matrix to make it
non-singular.
[M,C,K,A,k1,k2,k3,c1,c2,c3] = properties_sing_undamped;
v=rand(size(M,1),1);% initial guess of the eigenvector
tol=1e-9; maxiter =4;% tolerance, how many iterations allowed?
sigma = 0.2;% this is the shift
[lambda,v,converged]=gepbinvpwr2(K+sigma*M,M,v,tol,maxiter)
lambda =lambda-sigma % subtract the shift to get the original eigenvalue
The output evidently shows that the iteration was successful.
lambda =
0.2000 % shifted
v =
-0.0000
-0.7071
-0.7071
converged =
1
lambda =
6.3838e-016 % shift removed: ~0
Suggested experiments
1. C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual, SIAM:
Society for Industrial and Applied Mathematics, 2001.
Good coverage of eigenvalue and eigenvector problems. Interesting examples. Best of all, freely
available at http://matrixanalysis.com/.
2. D. E. Newland, Mechanical Vibration Analysis and Computation, Dover Publications Inc., 2006.
It covers well matrix analysis of natural frequencies and mode shapes, and some numerical
methods for modal analysis.
3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005. (Alternatively,
the 3rd edition, 1988.)
Good coverage of the basics of the eigenvalue problem.
4. L. N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM: Society for Industrial and Applied
Mathematics, 1997.
The treatment of QR factorization is excellent.
10
Unconstrained Optimization
Summary
1. A number of basic techniques in structural analysis rely on results from the area of optimization.
Main idea: Equilibrium of structures and minimization of potential functions are intimately tied.
Equilibrium equations are the conditions of the minimum.
2. Stability of structures is connected to the classification of the stiffness matrix. Main idea: positive
definite matrices correspond to stable structures.
3. The line search is a basic tool in minimization. Main idea: Monitor the gradient of the objective
function. Minimum (extremum) is indicated when the gradient becomes orthogonal to the line
search direction.
4. Solving a system of linear equations and minimizing an objective function are two roads to the
same destination. Main idea: We show that minimizing the so-called quadratic form solves a
system of linear algebraic equations.
5. The method of steepest descent may be improved by the method of conjugate gradients. Main
idea: keep track of directions of past line searches.
6. Direct versus iterative methods. Main idea: direct and iterative methods are rather different in
their properties (cost vs. accuracy). Iterative algorithms seem to be becoming more and more
important in modern software.
7. Least-squares fitting is an important example of optimization.
This can be easily changed into a maximization task by flipping the objective function about the
horizontal axis (i.e. changing its sign) and seeking the maximum as
x2
k x1
30o
This confirms that for the displacements (10.5) the energy stored in the spring is equal to zero. This
property is encountered in structures which are mechanisms: they can move in some ways without
deformation, that is without the need to store energy. Such structures are unstable.
Furthermore, we can see that for the displacements (10.5) we get
Kx = 0 .
Thus we see that the matrix K of (10.3) is singular. Clearly, the fact that the matrix is singular and
the fact that the deformation energy may be zero for some nonzero displacement are related.
Fig. 10.2. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
208 10 Unconstrained Optimization
Illustration 1
Modify the code below to display the surface in Figure 10.2. The second and the last line need
to be modified to reflect a particular objective function. The last line is supposed to draw arrows
representing the gradient.
[x,y]=meshgrid(-10:10,-10:10);
z=x.*y; % function
surf(x,y,z,’Edgecolor’,’none’); hold on
contour3(x,y,z,20,’w’); hold on
quiver(x,y,y,x); % gradient
x2
30o
k/2
x1
k
30o
L
Figure 10.4 shows the variation of the deformation energy as a function of x1 , x2 : the only point
where the DE assumes the value of zero is at x1 = 0, x2 = 0. Everywhere else the deformation
energy is positive. This means that whenever the displacements are different from zero, the springs
will store nonzero energy. This is the hallmark of stable structures.
Matrices A that have the property
xT Ax > 0
Fig. 10.4. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
xT Ax = 0
only for x = 0, are called positive definite. Stable structures have positive definite stiffness matri-
ces. Positive definite matrices are nonsingular (they are regular). This is a fact well worth retaining.
Note that the stiffness matrix is symmetric. An important property of the quadratic forms is
that only symmetric matrices contribute to the value of the quadratic form. We can show that as
follows: For the moment assume that A is in general unsymmetric. The quadratic form is a scalar
(real number), and as such it is equal to its transpose
T
xT Ax = xT Ax .
xT Ax = xT AT x
or
xT Ax − xT AT x = xT A − AT x = 0 . (10.7)
The general matrix A may be written as a sum of a symmetric matrix and a skew-symmetric
(anti-symmetric) matrix
1 1
A= A + AT + A − AT .
2 2
In the expression (10.7) we recognize the anti-symmetric part of A. Therefore, we conclude that
the anti-symmetric part does not contribute to the quadratic form, only the symmetric part does.
Therefore, normally we work only with symmetric matrices in quadratic forms.
Consider how to compute the derivative with respect to x of the product aT b: both vectors needs
to be differentiated in turn using the chain rule. So that we don’t have to differentiate a transpose
of the vector a we take advantage of the fact that the result of the product aT b is a scalar which
may be transposed at will without changing anything
aT b = bT a .
Illustration 2
Compute the components of the Hessian of the potential Φ(x) = 12 xT Ax.
As shown above, the gradient of Φ(x) is
1
∇Φ(x) = xT A + AT .
2
The result is a row matrix, with components
X 1
[∇Φ(x)]c = xi (Aic + Aci ) .
i
2
The components of the Hessian matrix Hrc can be obtained by differentiating the gradient with the
respect to each xr . Therefore, we obtain
1
Hrc = (Arc + Acr ) .
2
Clearly, the Hessian is symmetric, Hrc = Hcr . For A symmetric we have
Hrc = Arc .
10.6 Two degrees of freedom static equilibrium: computing displacement 211
The last expression is going to be positive for any combination of zi only if Dii > 0 for all i. So
Dii > 0 for all i guarantees that the quadratic form is positive definite.
If any of the Dii was equal to zero (to get this factorization if any of the elements in the pivot
position was zero would be tricky!) and all the others were positive, the matrix would be positive
semi-definite (and singular). (Just for completeness, if the pivots were a mixture of positive and
negative numbers, the matrix would be indefinite.)
Consider a one degree of freedom system (particle on a grounded spring). The deformation energy
(elastic energy stored in the spring)
1
DE = xKx ,
2
where K is the stiffness constant of the spring. The potential energy of the applied forces is defined
as
W = −Lx .
T E = DE + W . (10.9)
The solution for the equilibrium displacement is determined by the principle of minimum total
energy: for the equilibrium displacement x∗ the total energy assumes the smallest possible value
(It should be read: find x∗ as such argument that minimizes T E.) This is an unconstrained min-
imization problem. The minimum of the total energy is distinguished by the condition that the
slope at the minimum is zero:
dT E d 1
= xKx − Lx = Kx − L = 0 .
dx dx 2
This condition is seen to be simply the equation of equilibrium, whose solution indeed is the equi-
librium displacement.
The meaning of equation (10.9) and of the minimization problem (10.10) is illustrated in Fig-
ure 10.5. The deformation energy is represented by a parabolic arc (dashed line), which attains
zero value (that is its minimum) at zero displacement. The potential energy of the external force is
represented by the straight dashed line. The sum of the deformation energy and the energy of the
external force tilts the dashed parabola into the solid line parabola, the total energy. That shifts the
original minimum on the dashed parabola into the new minimum on the solid parabola (negative
value) at x∗ . The minimum is easily seen to be
1 ∗ 1 1 1
min T E = x Kx∗ − Lx∗ = x∗ Kx∗ − Kx∗ x∗ = − x∗ Kx∗ = − x∗ L .
2 2 2 2
W = −LT x .
The effect of this term on the parabolic surface in Figure 10.4 is very similar to that of Figure 10.5,
except now it is in more than one variable: the parabolic surface of the deformation energy is tilted
into the parabolic surface of the total energy (TE). This surface is shown in Figure 10.6. The red
cross at the bottom represents the solution of the static equilibrium equations.
10.9 Application of the total energy minimization 213
Energy
TE
W
DE
x∗
x
Fig. 10.6. Static equilibrium of particle suspended on two springs. The surface of total energy.
A commonly used technique for these kinds of problems is the so-called line search method. It
works as follows: start at a point. Then repeat as many times as necessary: pick a direction, and
find along this direction a location where the objective function has a lower value than at the start
point. Make this the new start point, and go back to picking a direction.
The algorithm is seen to be a sort of walkabout on the surface of the objective function. The
goal is to reach the lowest point. Two issues need to be addressed: how to choose the direction,
and how to choose where to stop when moving along this direction from the starting point. One
particular strategy for addressing the first issue is to choose the direction of the negative gradient
at the starting point. Since this direction leads to the steepest decrease of the objective function out
of all the directions as the starting point, this strategy is called the steepest descent. For general
objective functions, the second issue is difficult to address. To know when to stop when moving
from the starting point along the chosen direction could be expensive to compute. Compare with
Figure 10.7: the objective function appears to be rather complex, the minimum in the middle is only
a local one, not a global minimum: following the drop-off of the objective function in either of the
descending corners would lead to further decrease. Fortunately, our present objective function (10.11)
is much simpler, and hence much nicer to work with.
p1
p2
p3
p0
Fig. 10.7. Walk towards the minimum of the objective function. Starting point is p0 , the walk proceeds
against the direction of the gradient.
First things first: let us figure out the gradient of the objective function. Since the objective function
is based on a quadratic form, we have in fact already done something very much like this before
∂ TE ∂ 1 T
∇TE = = x Kx − LT x .
∂x ∂x 2
From (10.8) we have
∂ TE 1
= xT K + K T − LT .
∂x 2
Since the matrix K is symmetric, we can simplify
∂ 1 T
x Kx = xT K
∂x 2
and finally
10.9 Application of the total energy minimization 215
∂ TE ∂ 1 T
∇TE = = x Kx − LT x = xT K − LT . (10.13)
∂x ∂x 2
Note that the gradient is a row matrix.
So now we know how to determine the direction in which to move from a given point in order
to decrease the objective function. For direction vectors we usually use column matrices, and so we
define the direction of steepest descent as
T
r = − (∇ T E) = L − Kx .
The vector r is called the residual. We make it into a column matrix in order for the addition of the
vector x and r to make sense.
Next we have to find out how far to go. One possible strategy is to go as far as possible, meaning
that we would follow along a given direction until we’ve reached the lowest possible value of the
objective function starting from a given point in a given direction. Denoting the starting point x0 ,
we write the motion in the direction r
x = x0 + αr .
The lowest point will be reached when we stop descending and if we went any further we would start
ascending on the surface of the objective function. We are moving along a direction which subtends
various angles with the gradient at any given point. When we are descending we are moving against
the direction of the gradient. This would be expressed as (see Figure 10.8, and observe the gradient
of function f at point p2 )
Note that the result of the multiplication ∇f (p2 )r (row matrix with one row times column matrix
with one column) is a number, cosine of the angle that these two arrows subtend.
p1
p3
p4
p2 ∇f(p3 )
r(p0 ) ∇f(p4 )
∇f(p0 ) ∇f(p2 )
p0
Fig. 10.8. Walk to find the minimum of the objective function along a given direction. Starting point is
p0 , the walk proceeds in the direction of r(p0 ) towards the point p1 .
On the other hand, when we are ascending we are moving broadly in the same direction in which
the gradient points, and we have (see Figure 10.8, and observe the gradient of function f at point
p3 )
Finally, we must conclude that when we are standing at a point from which to move in any direction
would mean ascending, the path at that point must be perpendicular to the direction of the gradient
at that point (see Figure 10.8, observe the gradient of function f at point p4 )
∇f (p4 )r(p0 ) = 0 .
(Remark: This may be an oversimplification for more general objective functions. There is also the
possibility that a part of the path from p0 to p1 runs level – no descending or ascending.)
The condition that the gradient (10.13) at the lowest point x∗ must be orthogonal to the direction
of descent r can be written down as
T
∇f (x∗ )r = x∗ K − LT r = 0
rT r
α∗ = .
rT Kr
This is really the entire algorithm of steepest descent applied to the quadratic form objective func-
tion (10.11): improve the location of the lowest value of the objective function by moving from the
starting point x0 to the new point x
T
r r
x = x0 + r , r = L − Kx0
r T Kr
and then reset the starting point x0 = x. Such an algorithm is concisely written in MATLAB as
for iter=1:maxiter
r = b-A*x0;
x = x0 + (dot(r,r)/dot(A*r,r))* r;
x0 = x;
end
The steepest descent solver for quadratic objective functions is provided in the toolbox as
SteepestAxb. 1
Illustration 3
In Figure 10.9 we apply the solver SteepestAxb to the two-spring equilibrium problem from Sec-
tion 10.8. Given that this is a two-unknowns system of linear algebraic equations, it takes a lot of
iterations to arrive at a solution: inefficient! So why would we bother with this method? It does
have some redeeming characteristics. To mention one, it requires very little memory. More about
this later in Section 10.12.
0
10
−5
10
−15
10
−20
10
0 5 10 15 20 25 30 35
Iteration
Fig. 10.9. Convergence in the norm of the solution error for the steepest descent algorithm applied to the
two-spring equilibrium problem.
p4
p3
p2
p1
p0
Fig. 10.10. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds against the direction of the gradient.
that effort is wasted by zigzagging in towards the minimum, with each step going too much sideways
with too little progress in the direction of the minimum.
We realize that there are only two independent directions in the plane x1 , x2 . The first direction
is d(0) = −∇f (x(0) )T , the direction for the first descent step. Therefore, it must be possible to find
a direction for the second step d(1) that would lead directly to the minimum. The reason is that at
the point x(2) (that is at the minimum) the gradient must vanish, which will make it perpendicular
to any vector, including the first and second descent direction
The second orthogonality condition, that is ∇f (x(2) )d(1) = 0, occurs naturally as a stopping condi-
tion for the step along d(1) (we go as far downhill as possible). We write
T
and since x(1) K − LT = ∇f (x(1) ) is orthogonal to d(0) , we get
T
d(1) Kd(0) = 0 . (10.14)
From this condition we can determine the second descent direction. We can see that it must be a
combination of the first direction d(0) and of −∇f (x(1) )T : these two vectors are orthogonal and
therefore they span the plane. In other words any vector can be expressed as a linear combination
of these two. Thus we write
∇f (x(1) )Kd(0)
β= T
.
d(0) Kd(0)
To show that the solution can indeed be obtained in just two steps in this case is possible with
MATLAB symbolic math:2
K =[sym(’K11’),sym(’K12’);sym(’K12’),sym(’K22’)];% stiffness
L =[sym(’L1’);sym(’L2’)];% load
X0 =[sym(’X01’);sym(’X02’)];% starting point
g=@(x)(x’*K-L’);% compute gradient
a=@(x,d)(-g(x)*d)/(d’*K*d);% compute alpha
b=@(x,d)(g(x)*K*d)/(d’*K*d);% compute beta
p2
p1
p0
Fig. 10.11. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds in the directions determined to reach the minimum in just two steps.
We will again make the gradient at the point x(k+1) orthogonal to the two directions d(k−1) and
d(k) ,
only this time the gradient does not have to vanish identically at x(k+1) since there are many vectors
to which it could be orthogonal without having to become identically zero. First we will work out
the gradient at the point x(k+1)
T
T T T
∇f (x(k+1) ) = x(k+1) K − LT = x(k) + αd(k) K − LT = x(k) K + αd(k) K − LT ,
which results in
T
∇f (x(k+1) ) = ∇f (x(k) ) + αd(k) K .
We realize that the point x(k) was reached along the direction d(k−1) and at that point the gradient
was orthogonal to the marching direction
∇f (x(k) )d(k−1) = 0 .
We say that the directions d(k−1) and d(k) are K-orthogonal or K-conjugate (or just conjugate
directions for short).
So that we can determine the new direction d(k) to be K-conjugate to the old one d(k−1) we
assume the new descent direction is a combination of the direction of steepest descent −∇f (x(k) )T
and the old direction d(k−1)
∇f (x(k) )Kd(k−1)
β= T
.
d(k−1) Kd(k−1)
The conjugate gradients algorithm may be succinctly sketched as3
x=x0;
g = x’*A-b’;
d=-g’;
for iter=1:maxiter
alpha =(-g*d)/(d’*A*d);
x = x + alpha* d;
g = x’*A-b’;
beta =(g*A*d)/(d’*A*d);
d =beta*d-g’;
end
Note well that this is not at all an efficient implementation. For instance, the product A*d should
be computed just once. For a real industrial-strength conjugate gradient implementation checkout
the MATLAB pcg solver.
Illustration 4
Here we apply the steepest descent and the conjugate gradient solvers to a system of linear algebraic
equations with a “standard” 324 × 324 matrix.4
Figure 10.12 illustrates that the method of conjugate gradients is a definite improvement over
the method of steepest descent. The convergence is much quicker. Note that after just 75 iterations
or so we could have stopped the conjugate gradient iteration since it reached a limit imposed by the
machine precision. The difference between the two methods can be also dramatically displayed by
showing how the solution is approached during the iterations. Figure 10.13 shows how the iterated
solution (red dashed curve) approaches the converged solution (black solid line) for the steepest
descent method in relation to the number of iterations. Figure 10.14 shows the same kind of infor-
mation. Clearly, even though the two methods started with essentially the same magnitude of error,
conjugate gradients managed to reduce it much more quickly.
3
See: aetna/SteepestDescent/ConjGradAxb.m
4
See: aetna/SteepestDescent/test cg 1.m
10.11 Generalization to multiple equations 221
5
10
0
10
−10
10
−15
10
0 50 100 150 200 250 300 350
Iteration
Fig. 10.12. Comparison of the convergence of the steepest-descent algorithm (dashed line) and the Conju-
gate Gradients algorithm (solid line). Matrix: poisson(18), 324 unknowns.
10 10 10
Solution
Solution
Solution
5 5 5
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Iteration Iteration Iteration
iter =65 iter =108 iter =162
15 15 15
10 10 10
Solution
Solution
Solution
5 5 5
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Iteration Iteration Iteration
Fig. 10.13. Solution obtained with the Steepest Descent algorithm for the matrix gallery(’poisson’,18),
324 unknowns, using various numbers of iterations.
10 10 10
Solution
Solution
Solution
5 5 5
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Iteration Iteration Iteration
Fig. 10.14. Solution obtained with the Conjugate Gradients algorithm for the matrix poisson(18), 324
unknowns, using various numbers of iterations.
222 10 Unconstrained Optimization
We have seen two representatives of two classes of numerical methods: the LU factorization as a
representative of the so-called direct methods, and the method of steepest descent as a representative
of the iterative methods.
The direct methods will complete their work in a number of steps that can be determined before
they start. If we took the time and effort, we could count every single addition and multiplication
that will be required for a given size matrix.
On the other hand, for iterative methods this is not possible. There may be constituents in the
iteration procedure whose cost maybe evaluated a priori, but the number of iterations is typically
impossible to determine beforehand.
Where does the method of conjugate gradients fit? It can be shown that even though we have
enforced the orthogonality of gradients to two successive directions at a time, orthogonality of all
previous directions to the gradient at the current point is carried forward. Therefore, theoretically,
given infinite arithmetic precision, after n steps we will again reach a point, x(n) , where the gradient
must be orthogonal to all n descent directions. Thus, the gradient at x(n) must vanish identically,
otherwise in an n dimensional space it couldn’t be simultaneously orthogonal to n directions. In this
sense, a method of conjugate gradients is able to complete its work in time that can be determined
before the computation starts. On the other hand, it can also be used as an iteration procedure since
it is possible to stop it at any time, and the current point would be an improvement of the initial
guess.
The characteristics we have just introduced can be illustrated in Figure 10.15. The direct method
will start computing, and after a certain time and effort (which we can predict in advance: advantage!)
it will stop and deliver the solution with an error within the limits of computer precision (machine
epsilon). Until it does, we have nothing. (Disadvantage.)
The iterative method will start reducing the error of the initial guess right away. After a certain
time and effort the method will reduce the error to machine precision. (For simplicity we have
assumed that the two methods we are comparing will reach machine precision in the same time, this
may or may not be so.) Importantly, the iterative method can be stopped before it reaches machine
precision. If we are satisfied with a cruder tolerance, we could accept the solution much sooner, and
potentially save time (advantage!). For the iterative method we will not know in advance how long
it’s going to take to compute an acceptable solution. (Disadvantage.)
Nowadays there seems to be an agreement in the scientific and engineering computing community
that iterative methods are for many applications the preferred algorithms. This makes it a little bit
harder for the users of software built on iterative algorithms since iterative algorithms typically
include some tuning parameters, and various tolerances are involved. A judicious choice of these is
not always easy, and it can have a very significant impact on the cost of such computations.
Consider the following problem. The load-deflection diagram of a stainless steel 303 round coupon
was determined experimentally as shown in Figure 10.16. The data comes from the initial, more or
less straight portion of the curve. What is the stiffness coefficient of the coupon? If the data were all
located on a straight line, it will be the slope of that straight line. However, we can see that not only
there is some experimental scatter, but the data points appear to lie on a curve, not a straight line.
The so-called linear regression approach to the above problem could start from the assumption that
the stiffness could be determined as the slope of a straight line which somehow “best” approximates
the measured data points. If the data was all on a straight line, we could write
F (w) = p1 w + p2
for the relationship between the displacement w and of the force F , where p1 is the stiffness coefficient
of the coupon K = dF/dw = p1 . The data points are not located on a straight line however, which
10.13 Least-squares minimization 223
Error Direct
Iterative
tol
eps
Effort
Fig. 10.15. Comparison of effort versus error for direct and iterative methods.
2500
2000
1500
Force [lb]
1000
500
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07
Deflection [in]
Fig. 10.16. Stainless steel 303 round coupon, and the load-deflection diagram.
means that substituting the displacement wk and the force measured for that displacement Fk into
the above relationship will not render it an equality, something will be left over: we will call it the
residual.
Fk − F (wk ) = Fk − p1 wk + p2 = rk .
This may be written in matrix form for all the data points as
F1 w1 , 1 r1
F2 w2 , 1 r2
1 p
.. − .. .. = . .
. . . p2 ..
Fn wn , 1 rn
For convenience, using the measured data w1 , w2 , ..., wn and F1 , F2 , ..., Fn we will define the matrix
w1 , 1
w2 , 1
A= . .
.. ..
wn , 1
and the vector
224 10 Unconstrained Optimization
F1
F2
b= . .
..
Fn
The vector of the residuals (also called the error of the linear fit) is
r1
r2
e= . .
..
rn
So we write
b − Au = e ,
where the matrix A has more rows than columns. This is the reason why it will not be possible to
make the error exactly zero in general: there are more equations than unknowns.
We realize that in default of being able to zero out the error, we have to go for the next best
thing which is to somehow minimize the magnitude of the error. In terms of the norm of the vector
e it means to find the minimum of the following objective function
T
min kek2 = min eT e = min (b − Au) (b − Au)
with the respect to the parameter vector of the linear fit u. This is a classical unconstrained mini-
mization problem. The argument u∗ for which the minimum of the objective function is attained is
found from the above expression which in expanded form reads
T
u∗ = arg min (b − Au) (b − Au) = arg min uT AT Au − 2bT Au + bT b .
u u
The first-order condition for the existence of an extremum is the vanishing of the gradient of the
objective function
∂ T T
u A Au − 2bT Au + bT b = 2uT AT A − 2bT A = 0 .
∂u
Canceling out the factor 2 and transposing leads to the so-called normal equations
AT Au = AT b .
They are linear algebraic equations with a symmetric matrix, and since the columns of A are
linearly independent (if the wk ’s are not all the same), the matrix AT A is invertible. (It is however
not necessarily well conditioned. We have seen an evidence of this in the Illustration on page 168.)
The coupon data are
w = 10−2 × [ 1.3, 1.8, 2.3, 2.8, 3.1, 3.6, 4.1, 4.5, 4.8, 5.3, 5.8, 6.3, 6.7 ]
and
F = 102 × [ 1.2, 2.6, 4.3, 6.1, 7.3, 9.2, 11, 13, 14, 16, 18, 20, 21 ] .
Substituting our data into the normal equations leads to the solution
10.13 Least-squares minimization 225
A =[w, ones(length(w),1)];
pl =(A’*A)\A’*F
pl =
1.0e+004 *
3.799600652696673
-0.042276210924081
So the stiffness of the coupon based on the linear fit is approximately 37996 lb/in. Continuing our
investigation, we realize that the data points appear to lie on S-shaped curve, which suggests a linear
regression with a cubic polynomial. This is easily accommodated in our model by taking
F (w) = p1 w3 + p2 w2 + p3 w + p4 .
Figure 10.17 shows the linear and cubic polynomial fit of the experimental data. The difference is
somewhat inconspicuous, but plotting the residuals is quite enlightening. Figure 10.18 shows the
2500 2500
2000 2000
1500 1500
Force [lb]
Force [lb]
1000 1000
500 500
0 0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Deflection [in] Deflection [in]
Fig. 10.17. Stainless steel 303 round coupon, and the load-deflection diagram. Linear polynomial fit on the
left, cubic polynomial fit on the right.
residual for the linear and cubic polynomial fit. The linear polynomial fit residual shows a clear bias
in the form of a cubic curve. This indicates that a cubic polynomial would be a better fit. That is
indeed true, as both the magnitude decreased and the bias was removed from the cubic-fit residual.
226 10 Unconstrained Optimization
60
40
Residual [lb]
20
−20
−40
0.01 0.02 0.03 0.04 0.05 0.06 0.07
Deflection [in]
Fig. 10.18. Stainless steel 303 round coupon load-deflection diagram. Linear polynomial fit residual in
dashed line, cubic polynomial fit in solid line.
Figure 10.19 shows the variation of the stiffness coefficient as a function of the deflection for
both the linear and the cubic polynomial fit. It may be appreciated that the stiffness varies by a
substantial amount when determined from the cubic fit, while it is constant based on the linear fit.
4
x 10
4.5
4
Stiffness [lb/in]
3.5
2.5
2
0.01 0.02 0.03 0.04 0.05 0.06 0.07
Deflection [in]
Fig. 10.19. Stainless steel 303 round coupon, and the load-deflection diagram. Stiffness coefficient as a
function of deflection. Dashed line: from linear polynomial fit, solid line: from cubic polynomial fit.
Let us now come back to the geometrical meaning of the least squares equations. The equation
b − Au = e
expresses that we cannot satisfy all the individual equations since there more equations than there
are unknown parameters: the vector b belongs to Rn , while u belongs to Rm , and we have m < n.
In other words, the matrix A is rectangular (tall and skinny).
The geometry viewpoint would imagine b as a vector (arrow) in Rn . Each of the columns of the
matrix A also represents a vector (arrow) in Rn . The product Au is a linear combination of the
columns of the matrix A
where c1 (A) is used to mean column 1 of the matrix A and so on. To reach every single point of Rn ,
we would need n linearly independent basis vectors. Since there are only m columns of the matrix A
10.13 Least-squares minimization 227
they cannot serve as such basis vectors, and the linear combination of the columns of the matrix A
is only going to cover a subset of Rn . Inspect Figure 10.20: the columns of the matrix A generate the
gray plane as a graphical representation of the subset of Rn . The vector b is of course not confined
to the plane and somehow sticks out of it. The difference e between b and Au also sticks out. To
make the error e as small as possible (as short as possible) then amounts to making it orthogonal
to the gray plane Au. The shortest possible error e∗ = b − Au∗ will be orthogonal to all possible
vectors in the gray plane, Au as expressed here
T
(Au) e∗ = 0 .
Substituting we obtain
T
(Au) (b − Au∗ ) = 0
or
uT AT (b − Au∗ ) = uT AT b − AT Au∗ = 0 .
When we say for all possible vectors in the gray plane, Au, we mean for all parameters u, and since
the above equation must be true for all u, we have again the normal equations
AT b − AT Au∗ = 0 .
The solution to the normal equations are such parameters u∗ that they make the error of the least
squares fitting as small as possible.
1. R. Fletcher, Practical methods of optimization, second edition, John Wiley and sons, 2000.
Lucid presentation of the basics of unconstrained optimization.
2. P.Y. Papalambros, D. J. Wilde, Principles of optimal design, second edition, Cambridge Univer-
sity press, 2000.
Practical engineering treatment of both unconstrained and constrained optimization.
3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005.
Great reference for the least-squares methodology.
11
Constrained Optimization
Summary
1. Constraints are used to eliminate some values of the optimization variables from consideration
(i.e. such values are inadmissible as solutions).
2. Candidate solutions are sought in the so-called feasible regions.
3. Constraints often make the optimization problem much harder, especially because the optima
may be located on the boundary of the feasible region.
4. Sometimes the constrained optimization may be simplified by converting the optimization prob-
lem to an unconstrained one, for instance by adding a penalty term for the constraint. Main idea:
the objective function is augmented with cost associated with the violation of the constraint.
5. Method of feasible directions is an important tool in constrained optimization.
Fig. 11.1. Maximization problem: Climb as high as possible. Much easier when the search is limited to the
red rectangle rather than the entire map.
Fig. 11.2. Descend as low as possible. The objective function surface is an axially symmetric paraboloid
f = x21 + x22 , where blue is low and red is high. The feasible region is the outside of the gray circle. Quite
hard to compute the solution, since there are an infinite number of points at the same height.
Fig. 11.3. Non-smooth minimization problem in one dimension: minimum may occur locally at the corner,
the slope at the minimum may be ill-defined (infinite, non-unique, or nonexistent).
minimum as a point where the first derivative becomes zero needs to be revised for non-smooth
optimization problems.
x∗ = arg min f .
x
The function f has a minimum at the point x∗ if the objective function value at points reached from
x∗ in all possible directions d by steps of length s ≥ 0 is greater than the objective function value
at x∗ :
11.2 Locating the minima of smooth objective functions 231
We can express this in terms of the first derivatives (the gradient) as follows: Using the Taylor series,
just one term and the remainder in Lagrange form, moving from point x∗ in the unit direction d
with the step length measured by s, we write for the function value at the points x∗ + sd in the
neighborhood of x∗
so that
∇f (x∗ )d ≥ 0
could be true for an arbitrary vector d (in this case, with the equals sign). In fact, a necessary
condition for the existence of a minimum may be written down in the unconstrained case as
∇f (x∗ )d = 0 , ∀d 6= 0 .
Now we will consider minima in problems with constraints. As an example consider the following
structural optimization task (adapted from Papalambros and Wilde, 2000): design the thickness t
and the diameter d of a steel pipe so that its weight is minimized (Figure 11.4). The pipe is a simply
supported beam, prestressed by a large tensile force. The optimization is subject to the following
constraints: (a) the deflection of the beam δ must be less than a fraction of the span L: δ ≤ 0.001L;
(b) the tensile stress σ due to the applied force P must be below the yield stress σ ≤ Sy ; and (c)
the thickness of the pipe is limited from below, t ≥ 0.05 in. The data: E = 30 × 106 psi, L = 100 in,
ρ = 0.283 lb/in3 , Sy = 36 kpsi, and P = 104 lb. The objective function minimization with the three
inequality constraints may be simplified to an expression in the two variables t, d
∗ t
w = arg minw f = 88.9td , w =
d
subject to
td − 0.0885 ≥ 0 ,
d − 0.994 ≥ 0 ,
t − 0.05 ≥ 0 .
Figure 11.5 shows the geometry of the feasible and infeasible regions for the above constrained
minimization. The constraints are indicated as overlapping gray regions. The feasible points are
located outside of the union of the gray regions. Since the objective function increases towards
232 11 Constrained Optimization
Fig. 11.4. Example of constrained minimization. Optimization of a steel pipe carrying a tensile load, and
subject to its own weight.
Fig. 11.5. Constrained minimization. The unconstrained minimum of the objective function lies outside of
the feasible region. Up to two constraints may be active at any point of the solution set.
the upper right corner, we can see that the minimum is obtained at any point of the level curve
td = 0.0885 between the second and third constraint. The vectors d1 , d2 , d3 are representatives of
the so-called limiting feasible directions. They are tangent to the boundaries of the constraint
regions. Their usefulness derives from the fact that they can be viewed as starting directions of the
so-called feasible sequences: a sequence of feasible points originating from the tail of and along
the feasible direction arrow. Let us now look at some such feasible sequences. Starting from point
p0 along d3 we see that the objective function value increases. Starting from point p0 along d2
we see that there are feasible sequences (along the contour) such that the value of the objective
function stays the same. There are of course other sequences starting from the same point in the
same direction such that the objective function increases. This information is visually available from
the gradient of the objective function at point p0 . If we have
∇f (p0 )d3 > 0 ,
we conclude that the objective function in the direction d3 increases. The gradient is orthogonal to
d2 , and therefore
∇f (p0 )d2 = 0
and our observation is that starting along paths tangent to d2 does not lead to either increase or
decrease of the objective function (to first order in the distance).
On the other hand, we see that starting from point p1 along d1 produces a decrease in the value
of the objective function since
∇f (p1 )d1 < 0 .
With these observations at hand, we can formulate the conditions from which one can decide whether
a given point is a constrained minimum: If at a point x∗ we have for all feasible directions d
11.3 Method of feasible directions 233
∇f (x∗ )d ≥ 0 ,
then such a point is a (local) minimum of the objective function. On the other hand if for any feasible
direction we have ∇f (x∗ )d < 0 the point x∗ is not a minimum, and the location of the minimum
may be improved by moving along d. The direction d is called feasible descent direction.
The characteristics of the one-dimensional constrained line search are shown in Figure 11.6. The
minimum is sought within an interval of the real line, let us say a ≤ x ≤ b. A local minimum may
obtain at an interior point of the interval (the point x∗ ), or at the boundary (the point b). The theory
of feasible directions may also be illustrated by reference to Figure 11.6. The feasible direction at a
is da and a feasible descent sequence is possible from a in the direction of da : the point a is therefore
not a local minimum. The feasible direction at b is db and there is no feasible descent sequence from
b: the point b is therefore a local minimum.
f (x)
a da db
x ∗ b x
A popular technique for locating a constrained minimum on the given interval (that is a one-
dimensional line search) is a combination of the so-called Golden-section algorithm and the quadratic
interpolation method. This is implemented in the MATLAB function fminbnd.
234 11 Constrained Optimization
This algorithm is bracketing, which means that it starts from an initial guess of the interval in which
the minimum is located and then progressively refines that interval without losing the guarantee
that a minimum is still inside.
The method starts from locating three points with these properties, xL , xU , and xI , such that
xL < xI < xU
and
This last condition guarantees the existence of a minimum in the interval xL < x < xU . Observe
Figure 11.7: the initial interval is xL = xk−1 , xU = xk , and the intermediate point is xI = xk+1 . Now
we generate one more point, xk+2 , and we will reconfigure the interval bounds and the intermediate
point. We can see that from the two choices
and
the second one yields f (xI ) < f (xL ) and f (xI ) < f (xU ). The process is continued by generating
another point, xk+3 . The interval is then reconfigured to
and so the iteration converges by shrinking the size of the interval xU −xL that encloses the minimum.
The trick that makes this algorithm as efficient as possible by minimizing the number of needed
newly generated points while at the same time maintaining convergence at a steady rate is to set
xk+2 − xk−1 = xk − xk+1 = ℓ and
xk − xk−1 xk+2 − xk−1
= =φ.
xk − xk+1 xk+1 − xk−1
Substituting we get
xk − xk−1 ℓ
= =φ
ℓ xk+1 − xk−1
so that
ℓ
xk − xk−1 = φℓ and = xk+1 − xk−1 .
φ
Since
√ xk+1 − xk−1 = (xk − xk−1 ) − (xk − xk+1 ), which means ℓ/φ = ℓφ − ℓ, we see that φ =
( 5 + 1)/2 ≈ 1.618 is the positive solution to the quadratic equation φ2 − φ − 1 = 0. The constant
φ is the so-called golden ratio.
Given three data points, (xL , f (xL )), (xU , f (xU )), and (xI , f (xI )), such that
xL < xI < xU
and
11.3 Method of feasible directions 235
f (x)
Fig. 11.7. Line search for a one-dimensional minimization problem with the golden-section method.
as for the Golden-section method, we can approximate the location of the minimum of f (x) by the
location of the minimum of the parabolic interpolant q(x) (compare with Figure 11.8). Using the so-
called Lagrange interpolation polynomials, we can write for the interpolating quadratic polynomial
Differentiating we obtain
f (x)
xL xI xU
x
Fig. 11.8. Line search for a one-dimensional minimization problem with the quadratic-interpolation method.
Illustration 1
Solve for the static equilibrium displacement of a particle on a spring. The total energy of such system
is given in equation (10.9). Use the fminbnd constrained line search function. Set K = 100, L = 30.
Set the interval limits as x1=-5 and x2=+5. Set the options with optimset(’Display’,’iter’):
236 11 Constrained Optimization
this will print out information about which method is used at each step. How many iterations does
it take to obtain a solution? How many steps would just the parabolic interpolation need? Is there
any danger associated with choosing the interval limits as x1=-5/100 and x2=+5/100?
∇c
o
x2
30
g
k/2
x1
k
30o
L
In terms of the minimization of the total potential energy of the system, the required change will
be the introduction of a constraint equation (actually, an inequality). The constraint inequality will
evaluate to a non-negative (positive or zero) value for the so-called admissible displacements of the
structure. Inadmissible displacements (in optimization lingo they would be called infeasible) would
be associated with negative values of the constraint. In the present situation, the constraint would
be written as
c(x) = g + nT x ≥ 0 ,
where the normal to the obstacle surface is the gradient of the constraint function, nT = ∇c.
The originally unconstrained minimization problem (10.12) is now changed into the so-called con-
strained minimization problem
x∗ = arg minx T E
(11.1)
subject to c(x) ≥ 0 .
Figure 11.10 shows the objective function in the presence of a contact constraint that is sufficiently
removed from the loaded joint so that the applied loads do not bring the joint into contact with the
obstacle. The minimum of the total potential energy is located in the feasible region. We say that
the constraint is inactive. In Figure 11.11 we see the situation in which the loads are sufficiently
large to push the free joint against the obstacle. The minimum of the total potential energy occurs
11.5 Approximating constrained optimization 237
in the infeasible region, and a contact constraint is active. In other words, the constraint prevents
the displacements from reaching the minimum of the original unconstrained total potential energy
of the structure.
Fig. 11.10. Constrained minimization. The unconstrained minimum of the objective function lies inside of
the feasible region. In other words, the constraint is inactive.
Fig. 11.11. Constrained minimization. The unconstrained minimum of the objective function lies outside
of the feasible region. In other words, the constraint is active.
A suitable choice of the penalty parameters α1 > 0, α2 > 0 will ensure that the penalty term is
properly scaled with respect to the total potential energy T E. Furthermore, by taking the values of
the penalty parameters to the limit we can theoretically ensure that the solution of the unconstrained
problem coincides with that of the constrained problem. Figure 11.12 shows the total potential energy
238 11 Constrained Optimization
for the two-degrees of freedom truss system with the penalty term above thrown in to approximately
ensure the contact constraint. Note how the surface of the total potential energy has a minimum
just adjacent to the contact surface.
Fig. 11.12. Constrained static two-degree of freedom system. Objective function: Total potential energy
plus a penalty term for the contact constraint.
2 2
5 1 5 1
4 7 4 7
6 6
3 4 3 4
2 2
8 6 8 6
5 3 5 3
1 1
Fig. 11.13. Truss cantilever. Static configuration showing the live loads. Dynamic configuration showing
the added mass.
function [XY,en,A,E,rho,W,Widx,addM,addMidx,...
1
See: aetna/TrussCantilever/tcant data.m
11.6 Example of constrained optimization 239
neqf,maxtipd,Lowestfreq] =tcant_data
% coordinates of the joints
XY = [7.0, 2.5; ...
6., 3.5; ...
7.0, -1.0; ...
6.0, 1.5; ...
0, -1; ...
0, 1.5]*1000;% mm
% Which joints are linked by the bars?
en= [5,3;3,1;6,4;4,2;1,2;3,4;1,4;3,6];
A= zeros(size(en,1),1)+pi*60*7;% cross-sectional areas mm^2
E= 70000;% Young’s modulus: aluminum, MPa
rho=2.700e-9;% mass density, 1000*kg/mm^3
W=6000;% Live load, N
Widx=[1, 2];% Live load at which joints?
addM = 0.096;% additional mass, 1000*kg
addMidx =[1];% Mass at which joints?
neqf=8;% number of free degrees of freedom
maxtipd =10;%mm, Maximum deflection magnitude
Lowestfreq =13;% Hertz
end
The original design of the truss does not satisfy the constraints. The truss cantilever structure
is shown in Figure 11.14. The largest deflection under the live load is ≈ 15.39 mm, and the lowest
frequency is ≈ 10.617 Hz.
5 1
4 7
6
3 4
2
8 6
5 3
1
The objective function is approximately minimized is found by converting the original constrained
problem into an unconstrained one. The MATLAB simplex-search function fminsearch is used.2
options = optimset(’Display’,’iter’,...
’MaxFunEvals’, 350000,...
’MaxIter’, 3000,...
’TolX’,1e-3,...
’TolFun’, 1.0e-3);
[XY,en,A,E,rho,W,Widx,addM,addMidx,...
neqf,maxtipd,Lowestfreq] =tcant_data;
2
See: aetna/TrussCantilever/tcant optimize.m
240 11 Constrained Optimization
XY34 = XY([3,4],:);
XY34 = fminsearch(@tcant_objective_function,XY([3,4],:),options);
The objective function consists of two contributions, the normalized volume of the structure V,
and a penalty term for either the static deflection constraint (D), or a penalty term for the frequency
constraint (F) (or possibly both).3
function f =tcant_objective_function(XY34)
[XY,en,minA,E,rho,W,Widx,addM,addMidx,...
neqf,maxtipd,Lowestfreq] =tcant_data;
V =tcant_volume(XY34)/tcant_volume(XY([3,4],:));
Frequency =tcant_frequency(XY34);
F=0.0008*exp(-200*(Frequency-Lowestfreq)/Lowestfreq);
tipd=tcant_tip_deflection(XY34);
D=norm(0.0001*exp(-200*((maxtipd-abs(tipd))/maxtipd)),Inf);
f=V;% Initialize the value of the objective function
% f=f+F;% + term for frequency constraint (uncomment or comment out)
f=f+D;% + term for static deflection constraint (uncomment or comment out)
end
The penalty terms allow for a slight violation of the constraints (that is in the nature of the expo-
nential functions). The penalty coefficient values need some adjusting. A basic adjustment can be
based on how much of a penalty value do we want to put on a violation of let’s say 1%. If you have
a few minutes to spare, run the script tcant optimize movie to watch the various shapes that the
structure assumes during the optimization.4
2
4 5
4
1
3 7
6
6 2
8 3
1
5
Fig. 11.15. Truss cantilever. Shape optimized to maintain the lowest frequency constraint.
The truss cantilever structure optimized for lowest frequency of vibration is shown in Figure 11.15.
The largest deflection under the live load is approximately 12.46 mm, and the lowest frequency is
approximately 13.06 Hz. The structure volume with respect to the reference configuration is reduced
by approximately 25.7%. While the structure optimized for lowest frequency has a lower volume than
the one optimized for static deflection, it does not satisfy the deflection constraint.
The truss cantilever structure optimized for static deflection is shown in Figure 11.16. The largest
deflection under the live load is approximately 10.05 mm, and the lowest frequency is approximately
14.37 Hz. The structure volume with respect to the reference configuration is reduced by approxi-
mately 25.0%. More material is used in this structure, but it (almost, but in our engineering judgment
sufficiently) satisfies simultaneously the static deflection and the lowest frequency constraints.
3
See: aetna/TrussCantilever/tcant objective function.m
4
See: aetna/TrussCantilever/tcant optimize movie.m
11.8 Annotated bibliography 241
2
4 5
4
1
3 6 7
6 2
8 3
1
5
Fig. 11.16. Truss cantilever. Shape optimized to maintain the static deflection constraint (which also
happens to satisfy the frequency constraint).
Suggested experiments
1. Modify the objective function so that it incorporates at the same time the static (deflection)
and the dynamic (frequency) constraints.