Notes ITSC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

Lecture Notes CME108: Introduction to Scientific

Computing

Cindy Orozco

Summer 2016
Abstract

These notes are designed for the course CME 108: Introduction to Scientific Com-
puting (MATH 114) during Summer 2016. In addition of lectures notes from previ-
ous editions of the course (given by Pr. Eric Dunham), this material follows these
references:

• U. M. Ascher, C. Greif, A First Course on Numerical Methods, SIAM, Philadel-


phia, PA, 2011.

• B. Bradie, A Friendly Introduction to Numerical Analysis, Pearson Prentice


Hall, Upper Saddle River, NJ, 2006

If you want some engineering applications, I recommend:

• S. Chapra, R. Canale Numerical Methods for Engineers, Mc-Graw-Hill Science,


6th edition, 2009.

• A. Antoniou, W. Lu., Practical optimization: Algorithms and engineering ap-


plications, Springer, New York, NY, 2007.

And for a more specialized content:

• L. N. Trefethen, D. Bau, Numerical Linear Algebra, SIAM, Philadelphia, PA,


1997.

• J. Nocedal, S. J. Wright, Numerical Optimization, Springer, New York, NY,


2006.

• S. J. Wright, Coordinate Descent Algorithms, ArXiv e-prints, arXiv:1502.04759,


Feb 2015, Provided by the SAO/NASA Astrophysics Data System.

• R. LeVeque, Finite Difference Methods for Ordinary and Differential Equa-


tions: steady-state and time-dependent problems, SIAM, Philadelphia, PA,
2007.

• A. Vasy, Partial Differential Equations: an accessible route through theory and


applications, AMS, Providence, RI, 2015

If you find any typo or mistake, just send me an email at orozcocc@stanford.edu.

Enjoy the material!


Contents

1 Basic Concepts of Scientific Computing 4


1.1 Floating point arithmetic and Roundoff errors . . . . . . . . . . . . . 6
1.2 Errors related with Algorithms: Robustness and Efficiency . . . . . . 8
1.2.1 Robustness and stability . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Efficiency and Convergence . . . . . . . . . . . . . . . . . . . 10
1.3 Errors related with Discretization: Calculus tools . . . . . . . . . . . 11

2 Root-finding algorithms 15
2.1 Closed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 False Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Open methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Fixed Point iteration . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Minimizing a function in 1D . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Golden Section . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Parabolic Interpolation . . . . . . . . . . . . . . . . . . . . . 25

3 Linear systems of equations 27


3.1 Linear Algebra Review . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Singular Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Famous matrices . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Nice Factorizations . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.4 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Direct Methods: Gauss Elimination . . . . . . . . . . . . . . . . . . . 31
3.2.1 Number of flops . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Condition number . . . . . . . . . . . . . . . . . . . . . . . . 39

1
3.4 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Non Linear systems of equations 45


4.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Non-Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Line Search methods . . . . . . . . . . . . . . . . . . . . . . . 55

5 Interpolation 60
5.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Monomial Interpolation . . . . . . . . . . . . . . . . . . . . . 61
5.1.2 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . 62
5.1.3 Newton Interpolation . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . 66
5.2.1 Runge’s Phenomenon . . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . 68
5.2.3 Cubic Spline Interpolation . . . . . . . . . . . . . . . . . . . . 69
5.3 Some additional comments . . . . . . . . . . . . . . . . . . . . . . . 70

6 Numerical Integration 72
6.1 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Composite Newton-Cotes formula . . . . . . . . . . . . . . . 74
6.1.2 Error of Newton-Cotes formulas . . . . . . . . . . . . . . . . 75
6.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Some additional comments . . . . . . . . . . . . . . . . . . . . . . . 78

7 Numerical Differentiation 79
7.1 Numerical Differentiation using Taylor series . . . . . . . . . . . . . 80
7.2 Numerical Differentiation using Interpolation . . . . . . . . . . . . . 82
7.3 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 84

8 Initial Value Ordinary Differential Equations 85


8.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Properties of the methods . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Higher order methods . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.4 Systems of equations, Stiffness . . . . . . . . . . . . . . . . . . . . . 93

9 Boundary Value Ordinary Differential Equations 96


9.1 Shooting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.1 The linear case . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.1.2 Other types of Boundary Conditions . . . . . . . . . . . . . . 98

2
9.2 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2.1 Other types of Boundary Conditions . . . . . . . . . . . . . . 101

10 Solution of Partial Differential Equations 103


10.1 Elliptic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.2 Parabolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

11 Introduction to Finite Elements 111


11.1 Mathematical Principle behind it: Distributions . . . . . . . . . . . . 112
11.2 The Variational Formulation: Weak form v.s. Strong form . . . . . . 113
11.3 Discretization: Galerkin’s method . . . . . . . . . . . . . . . . . . . . 115
11.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

3
Chapter 1

Basic Concepts of Scientific


Computing

Sometimes there is the misconception that the most reliable alternative to solve a
problem is to use math and computers. The principal issue is that if we do not keep
track of the sources of error, things can go terrible wrong. For example, in 1996,
the European Space Agency failed launching Ariane 5 rocket. The reason: 16-bits
integer overflow.

In order to keep track of what can go wrong solving a problem, first we need
to understand all the decisions made along the way, and more important all the
implications and limitations we are subject to. To illustrate the thinking process,
let us introduce the following example:

It is a sunny Sunday morning in your Grandparents home. You are


thinking that is nice to see all the family together but you really need to
go back to your office to finish that report, when, out of the sudden, you
hear a gigantic crash: your nephew just broke the Grandpas pendulum
clock. Since you took carpentry classes last summer, unanimously the
family decides that you are in charge of fixing the clock. What would
you do?

Well, if the carpentry classes were true, you would definitely know how much
wood you would need for the case, but probably you would not have any idea about
the appropriate chord length that achieves the precision of the pendulum. Therefore
let us try to model the clock
p as a simple pendulum. Given the drag coefficient b and
a natural frequency ω = L/g, where L is the arm length and g is the gravity , the
angle of oscillation θ made with the vertical in time t can be defined by the second
order differential equation:

Mathematical Model: θ̈ + bθ̇ + ω 2 sin θ = 0 (1.1)

4
dθ d2 θ
where we use θ̇ = dt and θ̈ = dt2
.

Notice that this is a nonlinear equation in θ, because of the term ω 2 sin(θ). If you
try a couple of minutes, you will notice it is really difficult to solve it analytically.
Therefore it is preferable to discretize it. This means that we will choose a finite
number of t where we will know the solution of θ. Using these values, we approximate
the derivatives using sums and divisions:

θi+1 + θi−1 − 2θi θi+1 − θi−1


Discretization: 2
+b + ω 2 sin θi = 0 (1.2)
∆t 2∆t
where θi = θ(ti ) for i = 0, 1, ..., N . After we specify some initial and boundary
conditions, we can rewrite (1.2) as a non linear system of equations:

KΘ + ω 2 sin Θ = 0 (1.3)

where K ∈ Rn×n is a matrix of coefficients, and Θ = (θ1 , ..., θn ) is the vector of


unknowns. Notice that other way to see the problem is to minimize the difference
between KΘ and ω 2 sin Θ. Therefore (1.3) is equivalent to find θ such that :

min ||KΘ + ω 2 sin Θ||22 (1.4)


Θ∈Rn

To solve this minimization problem we use an iterative algorithm along feasible


points, each time decreasing the value of the optimal function. At each iteration we
find a feasible solving a linear system of the shape Ax = b. To solve this system
we factorize A into lower and upper triangular matrices and then we backward and
forward substitution to find the result.

Notice that solving this simple problem we used a lot different techniques: fi-
nite differences, numerical optimization and numerical linear algebra. Also this
illustrates the way problems are solved using Scientific Computing.
In Figure 1.1 we can see the process starting form a problem in the real world
to implementation. Related with each step, there is a different type of error (notice
the red text). There are three levels of relevance of this errors:

• Error relative to each field: The tolerance of our method needs to include
measurement errors and the consequences of assuming certain physical model.

5
Figure 1.1: Different steps in Scientific Computing

• Numerical analysis errors: When we decide to solve a problem using certain


method, usually we make two approximations: we solve for a finite number
of points (discretization - truncation errors) and we iterate a finite number of
times (rate and order of convergence).

• Implementation errors: First we will always find bugs in the code. And second,
we cannot represent exactly numbers in the computer. This type of errors are
called roundoff errors.

Since we can not avoid any of this three type of errors, we will start by discussing
how to identify and quantify them. From now on we will use

Definition 1. Let p̃ an approximation for the value p then

|p̃ − p|
Absolute error: |p̃ − p| Relative error: (1.5)
p

1.1 Floating point arithmetic and Roundoff errors


How do computers represent numbers?
Which challenges do we have? Can we represent any number?

Computers represent numbers using binary system. Each bit can take values
{0, 1}, similar to a light bulb that goes on/off. A set of 8 bits creates a byte, and
usually every language or architecture measure the amount of memory in bytes. We
cannot represent any number because we have a finite amount of memory. Addi-
tionally, in the case that memory was not a problem, we have finite time to make

6
computations. More used memory means more communication time and more com-
putation time.

Therefore it is important to be aware of the type of number representation a


programming language uses. For example in MATLAB you can check the conven-
tion for integers and floating point numbers.

The floating point representation is based that given a basis β ∈ N, any x ∈ R


can be represented uniquely as:
 
d1 d2 d3
x = ± d0 + + 2 + 3 + ... × β e , where 1 ≤ d0 < β, 0 ≤ di < β (1.6)
β β β

Therefore any number x can be represented by:

• A sign ±

• An exponent e

• A sequence of naturals d0 , d1 , d2 ...

In a computer we can only store a limited number of di . The maximum number of


di ’s a computer can store is called precision t.

Definition 2. The given x ∈ R, floating number representation f l(x) is the


truncated sum of (1.6), i.e.
   
d1 d2 dt e m
fl(x) = ± d0 + + 2 + ... + t−1 × β = ± × βe (1.7)
β β β βt

where β t−1 ≤ m ≤ 2t − 1 is called the mantissa

A better measure of the tolerance is the rounding unit:


1
machine = β 1−t (1.8)
2
which represents the maximum error representing a number between 1 and its clos-
est right neighbor in this representation.

Notice that if β = 2 and x 6= 0, then it must be the case that d0 = 1. There-


fore we do not need to store d0 and we get and additional position. For example,
double precision in MATLAB uses 8 bytes, of which 52 bits represent the mantissa.
Therefore t = 53 and machine = 2−53 ≈ 1.1e − 16.

Why is it important to know machine ? The floating number representation


satisfy the following two properties:

7
• Relative error: If x can be represented in the system then
|x − f l(x)| ≤ |x|machine (1.9)

• Definition of operations: For any arithmetic operation ∗, its floating point


analogue ~ is:
f l(x) ~ f l(y) = f l(f l(x) ∗ f l(y)) (1.10)

Therefore for any x, y that can be represented in the system:


|x ∗ y − f l(x) ~ f l(y)| ≤ |f l(x) ∗ f l(y) − f l(x) ~ f l(y)| + |x ∗ y − f l(x) ∗ f l(y)|
| {z } | {z }
Introduced error P ropagated error
(1.11)
The introduced error is related with the representation of the number. If all the
floating numbers we can represent are in the interval [L, U ], we can have under-
flow if x < L (and the computer will assume x = 0) or overflow if x > U (and the
computer will assume x = ∞).

The propagated error is related with how the operations change the initial
error, i.e. if the error gets magnified or if we get catastrophic cancellation. For
example if x >> y then f l(x) ⊕ f l(y) = f l(x), and if x ≈ y then f l(x) f l(y) = 0.

Try the following exercises:


• In MATLAB let x = 1/7. What is the result of x + x + x + x + x + x + x == 1
? Is it what you expected?
• Operations are not longer associative. Suppose the representation system
 
d1 d2
x = ± d0 + + 10e where 1 ≤ d0 ≤ 9, 0 ≤ di ≤ 9
10 100
let x = −1.01, y = 1.00 and z = 1.00e − 3. What is the value of (x + y) + z?
What is the value of x + (y + z)?.

1.2 Errors related with Algorithms: Robustness and Ef-


ficiency
Definition 3. An algorithm is a sequence of steps to perform a task
People care about algorithms even before computers. And to compare between
algorithms, we analyze how close are the approximated solution and the real one,
and how expensive is to get to the solution. Designing an algorithm is like designing
a building for an earthquake:

8
You always need a trade-off between admissible loads (Robustness)
and cost (efficiency). Old buildings are designed to admit large loads
during a seismic event without any visible deformation. Nevertheless
their cost is 5 times more in comparison with a new building. This does
not mean new buildings will collapse in any earthquake, but they have
more probability to have visible damage

The same happens in choosing an algorithm to solve a problem. Since we do


not know the real solution in advance (i.e. the real earthquake), we need to be sure
how much does it take an algorithm to solve a problem if it is of certain type and
the inputs are in certain threshold.

1.2.1 Robustness and stability


Suppose that given a real problem, we can get the real solution as a function f in
terms of the x input data, i.e. real solution = f (x). Similarly, if we neglect errors
in the input, we can define the result of the algorithm as another function in terms
of x, i.e. approx solution = f˜(x). If we knew the real solution we could measure
how close both solutions are.
Definition 4. A method is accurate if |f (x) − f˜(x)| is sufficiently small.
Since in reality we do not know the real solution, then we would like to know
under which conditions the method gives a “nearly correct” solution, i.e. we want
to analyze the Robustness of the method. Instead of talking about accuracy, we
talk about stability.
Definition 5. A method is stable if given two inputs x and x̃ that are sufficiently
close (i.e. |x − x̃|/|x| = O()) then |f˜(x) − f (x̃)|/|f (x̃)| is sufficiently small. More-
over, we can define that a method is backward stable if we can say that the result
of the algorithm using the input f˜(x) is equal to the real solution of the problem
using a perturbed input f (x̃).
For example, solving y such that Ay = b, we can identify the input A, and the
real solution y = f (A) = A−1 b. If my algorithm is f˜, then we can call the result
ỹ = f˜(A). Backward stability says that ỹ = f (Ã) = Ã−1 b.

Notice that stability does not only depend on the method, but also depends on
the mathematical problem. If the problem does not behave “nicely”, then it does
not matter how stable is the method, we can not ensure to get a desirable solution.
Think when we are solving Ay = b and A is a singular matrix. Depending on b
we can have infinite solutions or zero solutions. How can the algorithm select the
solution we are looking for?.

9
To avoid these issues, it is better to work with well condition problems: “prob-
lems that behave nicely”.

Definition 6. A problem is well conditioned if:

• The solution exists

• The solution is unique

• The solution depends continuously on the data. Small changes on the initial
condition represent small changes on the data.

1.2.2 Efficiency and Convergence


Usually algorithms in numerical analysis involve many iterations. Therefore to mea-
sure efficiency we need to count the cost per iteration, and the number of iterations
we take. To count the cost per iteration we count flops = number of elementary
operations performed (+,-,*,/) .

To measure the number of iteration an algorithm takes we talk about conver-


gence.

Definition 7. We say a sequence pn converges to p, i.e. pn → p if

lim pn = p (1.12)
n→∞

. We say an algorithm converges if the sequence of partial results converges.

Once we know the algorithm converges to the correct result, then we want to
know the speed of convergence. For this we have two measures:

Definition 8. We say a method pn has rate of convergence β(n) if there exists


a constant k < ∞ such that

|pn − p| ≤ K|β(n)| which is equivalent to |pn − p| = O(βn ) (1.13)

Definition 9. We say a method pn has order of convergence α if there exists a


constant 0 < λ < ∞ such that
|pn+1 − p| |en+1 |
lim = lim =λ (1.14)
n→∞ |pn − p|α n→∞ |en |α

For α = 1 we talk about linear convergence and for α = 2 we talk about quadratic
convergence.

Notice that O(βn ) comes from asymptotic notation. Some useful definitions are

10
Definition 10. Let f, g functions in R (or N), and a defined limit L = ∞ or 0 or ....
We informally say:

• Bounded above asymptotically f (x) = O(g(x)) ⇐⇒

∃M > 0 : |f (x)| ≤ M |g(x)| for x → L

• Dominated above asymptotically lf (x) = o(g(x)) ⇐⇒

∀M > 0 : |f (x)| ≤ M |g(x)| for x → L

• Bounded below asymptotically f (x) = Ω(g(x)) ⇐⇒

∃M > 0 : |f (x)| ≥ M |g(x)| for x → L

• Dominated below asymptotically f (x) = ω(g(x)) ⇐⇒

∀M > 0 : |f (x)| ≥ M |g(x)| for x → L

• Bounded above and below asymptotically latexf (x) = Θ(g(x)) ⇐⇒

∃M1 , M2 > 0 : M1 |g(x)| ≤ |f (x)| ≤ M2 |g(x)| for x → L

In here, ∃ means exists and ∀ means for all

1.3 Errors related with Discretization: Calculus tools


If someone asks: “Which problems can be solved using Numerical Analysis?” the
answer is not straightforward. If we only count the problems for which there exists
a known algorithm that gives the exact answer, then only a few problems in Linear
Algebra and Optimization would be covered. But of course, these are not the only
interesting problems. Therefore, what is the trick? ... Discretization.

Instead of working with any function, we know how to work with polynomials.
And instead of solving for infinite number of points, it is enough in practice to solve
for certain points. Of course, these two techniques introduce errors, but once we
know how their magnitude depends on the approximation, we can refine it until we
get acceptable results.

The principal tool is

11
Theorem 1. Taylor’s Theorem If f ∈ C n+1 [a, b] (This means that f and all its
derivatives up to n + 1-th derivative are continuous in the interval [a, b]). then for
any x and x0 in [a, b], there exists ξ(x) between x and x0 such that:
n
X f (k) (x0 ) f (n+1) (ξ(x))
f (x) = (x − x0 )k + (x − x0 )n+1 (1.15)
k! (n + 1)!
k=0 | {z }
| {z }
Rn : Remainder
n-th Taylor Polynomial

We can also express the remainder as:


Z 1
n+1 (1 − s)n (n+1)
Rn = (x − x0 ) f (x0 + s(x − x0 ))ds (1.16)
0 n!

Example 1. For example, what is the rate of convergence of (sin x)/x → 1 when
x → 0?

Solution. The Taylor’s series of sin x around 0 is


x3 x5 x6
sin x = x − + − sin(ξ(x))
3! 5! 6!
Therefore
sin x x2 x4 x5
lim = lim 1 − + − sin(ξ(x)) =1
x→0 x x→0 3! 5! 6!
And moreover when x → 0
sin x x2 x4 x5
−1 = − + − sin(ξ(x)) = O(x2 )
x 3! 5! 6!
Why O(x2 )? because of the three terms, x2 is the slowest going to zero.

Example 2. Consider the iterative scheme to find x = a.
 
1 a
xn+1 = xn +
2 xn
Which is its order of convergence?

Solution. Notice that


√ √
 
1 a
xn+1 − a= xn + − a
2 xn
2 √
x − 2 axn + a
= n
2xn
√ 2
(xn − a)
=
2xn

12

Assuming that xn → a when n → ∞ then

|xn+1 − a| 1 1
lim √ 2 = lim = √
n→∞ |xn − a| n→∞ 2xn 2 a
Therefore the order of convergence is quadratic. Notice that

√ 0 α=1
|xn+1 − a|  1
lim √ = 2 √a α = 2
n→∞ |xn − a|α 
∞ α>2

In addition of Taylor’s theorem we have other useful theorems


Theorem 2. Intermediate Value Suppose f ∈ C[a, b] and let s be between f (a)
and f (b) therefore there exists c ∈ [a, b] such that f (c) = s.

Figure 1.2: Intermediate value theorem

Theorem 3. Mean Value Suppose f ∈ C[a, b] and f ∈ C 1 (a, b). There exists
c ∈ (a, b) such that
f (b) − f (a)
f 0 (c) =
b−a
.

Theorem 4. Rolle’s Apply the mean value theorem for the case f (a) = f (b).
Therefore there exists c ∈ (a, b) such that
f 0 (c) = 0
.

13
Figure 1.3: Mean value theorem

Theorem 5. Fundamental Theorem of Calculus Suppose f ∈ C[a, b] and F is


the indefinite integral of f , i.e. F 0 (x) = f (x) then
Z b
f (x)dx = F (b) − F (a) (1.17)
a

Theorem 6. Integration by parts (It is not really a theorem but it is really


useful). Assuming f, g ∈ C[a, b] and f, g ∈ C 1 (a, b) then
Z b Z b
f 0 (x)g(x)dx = [f (b)g(b) − f (a)g(a)] − f (x)g 0 (x)dx (1.18)
a a

In almost every proof in Numerical Analysis we use one or many of this theorems.
Always the key step is related with integrating by parts, or approximating a function
with a Taylor’s series. Therefore, you would really want to remember these theorems,
not only for the course, but for sure for your career.

14
Chapter 2

Root-finding algorithms

Suppose that we want to find x such that f (x) = 0. There are two ways of solving
it:

• Closed Methods: The root is enclosed in an interval. At each iteration we


reduce the length of the interval. For example: Bisection

• Open Methods: We approximate the function with a model. At each it-


eration we calculate the new iteration based on the previous one, i.e. xn =
g(xn−1 ). For example: Newton’s method, fixed point iteration.

Figure 2.1: Closed methods vs. Open methods

Closed methods are robust, because we ensure that we are always reducing the level
of uncertainty, therefore they always converge. In contrast, open methods are not
that robust but they are efficient. Since they approximate the function, they pre-
serve more information of the model. If they converge, they converge faster than
closed methods. The problem is that we cannot bound the error at each iteration,
and that explains why they can diverge.

15
2.1 Closed Methods
2.1.1 Bisection
Notice that in general, our function can have more than one root. Therefore the
first step, regardless the method we are using, is to select an appropriate interval
[a, b] where the desired root is. We can do this plotting the function.

Also if we select this interval such that f (a) and f (b) have different sign, using
the Intermediate Value theorem there exists c ∈ [a, b] s.t. f (c) = 0. Therefore the
simplest method is to reduce the interval [a, b] such that we always preserve that
f (a) and f (b) have different signs. If we are only using the sign of f , the healthiest
option is to guess that xn = (an + bn )/2 the midpoint. Depending on the sign of
f (c) we update the interval, an we ensure that the error is at most the length of the
interval.

Set xl = a, xr = b;
while |xr − xl | > tol do
x = xr +x
2
l

if f (x) = 0 then
return x
end
if f (x)f (xl ) < 0 then
xr = x
else
xl = x
end
end
return x
Algorithm 1: Bisection Pseudo-code

Notice that at each iteration the maximum error is reduced by a half, therefore
if {xn } is the set of iterates and x is the real root, we know that x, xn ∈ [an , bn ]
then:
1 1 1
|xn − x| ≤ |bn − an | = 2 |bn−1 − an−1 | = ... = n+1 |b − a|
2 2 2
Therefore the rate of convergence is O(2n ). Additionally we can estimate an upper
bound for the maximum number of iterations we would take given a tolerance |xn −
x| < tol. We have that
 
b−a
n ≤ log2 −1 (2.1)
tol

16
Figure 2.2: Bisection for x2 + 3x

Figure 2.3: Bisection for atan(x)

17
2.1.2 False Position
Notice that in bisection, we only use the sign of the function. What would happen if
we approximate the function with a line joining both extreme points?. This method
is called False Position. Therefore we update the interval, guessing xn is the root
of the straight line between f (a) and f (b)

Set xl = a, xr = b;
while |xr − xl | > tol do
−xl
x = xr − f (xr ) f (xxrr)−f (xl )
if f (x) = 0 then
return x
end
if f (x)f (xl ) < 0 then
xr = x
else
xl = x
end
end
return x
Algorithm 2: False Position Pseudo-code

Although that this looks like a better alternative, we cannot ensure we are re-
ducing the interval as much as 1/2. Therefore depending on the function, its con-
vergence can be faster or slower than Bisection. For example consider a function
that is almost flat and negative, and suddenly goes straight up near one of the end
points. In such a case, the interval reduction using false position is less than using
bisection.

Figure 2.4: When Bisection is not always worse than False Position

18
2.2 Open methods
2.2.1 Fixed Point iteration
Open methods are based in expressing the next guess in terms of the previous ones.
In general we have something of the form
xn+1 = g(xn ) (2.2)
Notice that if we reach the solution of f (x) = 0, we would like to stay there, i.e.
x = g(x) (2.3)
This means that the solution x is a fixed point of g.

Set p0 ;
p1 = g(p0 ); n = 1
while |pn − pn−1 | > tol and n < max iterations do
pn+1 = g(pn );
n = n + 1;
end
return pn
Algorithm 3: Fixed Point Iteration Pseudo-code

How can we choose g if our problem is in terms of f ?


There are many different options for choosing g in order that x is a fixed point of
it. For example we can take:
f (x)
g1 = x + f (x) g2 = x − 10f (x)2 g3 = x − ...
f 0 (x)
The problem will be to know if starting from certain x0 , the sequence {xn =
g(xn−1 )} → x as n → ∞, i.e. if it converges and at which speed.
Example 3. Check the implementation in class solving f (x) = x3 + x2 − 3x − 3 = 0
using:
x3 + x2 − 3
g1 (x) = Isolating 3x and dividing by 3
3
3x + 3
g2 (x) = −1 + 2
Isolating x3 and dividing by x2
r x
3 + 3x − x2
g3 (x) = Isolating x3 and dividing by x and taking sqrt
x
x3 + x2 − 3x − 3
g4 (x) = x − Newton’s method
3x2 + 2x − 3
All of the schemes converge to the expected root? How is the order of convergence?

19
It is not easy to decide if any scheme and initial condition converges or not.
Nevertheless, we have a theorem that helps us in a large number of cases.
Theorem 7. Fixed Point Theorem If g ∈ C[a, b] and g ∈ C 1 (a, b) such that:
1. g : [a, b] → [a, b] “The range goes into the domain”
2. ∃k < 1 : |g 0 (x)| ≤ k < 1 for all x ∈ (a, b) “The function is not steep”
Then g has a unique fixed point p ∈ [a, b] and the sequence {pn = g(pn−1 )} starting
from any p0 ∈ [a, b] converges to p such that |pn − p| = O(k n )
Notice that if we have a function g that satisfies the two conditions of this the-
orem, we can say it converges because of the theorem. But of course there are
functions that does not satisfy this theorem and they also converge.

How can we explain the rate of convergence? Notice that


|pn − p| = |g(pn−1 ) − g(p)| using the fixed point defintion
= |g 0 (ξ)||pn−1 − p| for ξ between pn−1 and p, using mean value theorem
≤ k|pn−1 − p| by hypothesis
≤ k n |p0 − p| applying it n times
Therefore what is the order of convergence? If pn → p, since ξ(pn ) is between pn
and p then ξ(pn ) → p, then
|pn − p|
lim = lim |g 0 (ξ(pn ))| = |g 0 (p)| (2.4)
n→∞ |pn−1 − p| n→∞

If |g 0 (p)| =
6 0, then we have linear order of convergence.

If |g 0 (p)| = 0 we have a better order of convergence. Consider the Taylor series


of g(pn ) around p
g 00 (p) g (k) (p) g (k+1) (ξ)
g(pn ) = g(p) + g 0 (p)(p − pn ) + (p − pn )2 + ... + (p − pn )k + (p − pn )k+1
2! k! (k + 1)!
using the definition of fixed point, and the sequence {pn }:
g 00 (p) g (k) (p) g (k+1) (ξ)
pn+1 = p + g 0 (p)(p − pn ) + (p − pn )2 + ... + (p − pn )k + (p − pn )k+1
2! k! (k + 1)!
Therefore if all the first k derivatives of p are zero, i.e. g 0 (p) = g 00 (p) = ... =
g ( k)(p) = 0 and g ( k + 1)(p) 6= 0 then
|g (k+1) (ξ)| |pn − p| |g (k+1) (p)|
|pn+1 − p| = |p − pn |k+1 =⇒ lim =
(k + 1)! n→∞ |pn−1 − p|k+1 (k + 1)!
the order of convergence is k + 1.

20
2.2.2 Newton’s method
Notice that if f 0 (xn ) 6= 0, we can approximate our function with the tangent line at
the point (xn , f (xn )). And we can approximate the root as the zero of the function

y = f 0 (xn )(x − xn ) + f (xn )

Therefore we can define Newton’s method as a fixed point iteration:

f (xn )
xn+1 = xn − (2.5)
f 0 (xn )

Notice that in this case


f (x) f (x)f 00 (x)
g(x) = x − 0
therefore g 0 (x) =
f (x) (f 0 (x))2
Let p be the fixed point and assuming f 0 (p) 6= 0 , we have two results:
00
• Since g 0 (p) = f(f(p)f (p)
0 (p))2 = 0 and g 0 (p) is continuous, we know there exists δ
such that |g 0 (x)| ≤ k < 1 if x ∈ [p − δ, p + δ]
• Since g 0 (p) = 0, therefore we have at least quadratic order of convergence
Notice that the first result tells us that if we are close enough to the root, we can
apply the fixed point theorem, and we can ensure convergence. But close enough is
hard to define.
Example 4. Consider the example in class f (x) = atan(x) that converges when
x0 = −1 but diverges when x0 = −2. Check the implementation.

2.2.3 Secant method


If f is really complicated, or is not even explicit, it is almost impossible to compute
f 0 (xn ). Therefore in the secant method we approximate
f (xn ) − f (xn−1 )
f 0 (xn ) = (2.6)
xn − xn−1
Therefore the iteration is given by
xn − xn−1
xn+1 = xn − f (xn ) (2.7)
f (xn ) − f (xn−1 )
Notice that this iteration looks similar to the update in False Position method. The
difference is that in False position we take the line that crosses both extremes of
the interval [a, b], for which we know they have different signs. In secant method
we take the line between two guesses, which we do not know a priori have different
signs or not. Therefore the root is not necessarily between them.

21
Figure 2.5: Newton for x2 + 3x

Figure 2.6: False Position v.s. Secant

22
2.3 Minimizing a function in 1D
Up to now, we have talked about solving nonlinear functions in 1D. This means
that given a problem where we need to find x ∈ R such that it satisfies an equation,
we can solve it if we express it as f (x) = 0. We have talked about two strategies:
enclosing the root (using bisection) or approximating the model (using a fixed point
iteration, such as Newton). We have seen that the first type of method is robust,
whereas the second type is faster if it converges to the solution. And depending on
whether or not we know more information of the function (sign or value or deriva-
tive), our approximation has more properties or not.

Now let us talk about optimizing a function in 1D. In general, this means that
we need to find x̂ such that minimizes φ(x).

x̂ s.t. minimize φ(x) (2.8)


x∈R

What about maximizing a function ψ(x)? Just take φ(x) = −ψ(x).

How can I identify a minimum point?

Recall that when we were finding the root of a function, we could identify if we were
approaching the solution, or at least we could measure the error based on evaluating
our guess xn in f (x). Since we were looking for x such that f (x) = 0, then we could
say that a point xn+1 was better than xn if |f (xn+1 )| < |f (xn )|, and moreover we
can stop if |f (xn )| is sufficiently small.

But when we are minimizing a function, certainly we can measure if f (xn+1 ) <
f (xn ) but we cannot define a stop criteria only based in the function value, since
probably we do not know if there exists another point with even smaller evaluation.
Therefore we need to evaluate the derivative at the point.

If we want to minimize a φ ∈ C 2 [a, b]:

Definition 11. We say that x∗ ∈ (a, b) is a critical point if

φ0 (x∗ ) = 0 (2.9)

Using the Taylor series we can approximate φ(x) using a quadratic model

φ00 (x∗ )
φ(x∗ + ∆x) = φ(x∗ ) + φ0 (x∗ )∆x + ∆x2 + O(∆x3 ) (2.10)
2
Therefore for sufficiently small ∆x, we can see that the approximation has a mini-
mum or a maximum depending on φ00 (x∗ ):

23
Definition 12. Given a critical point x∗

• If φ00 (x∗ ) > 0 then x∗ is a local minimizer

• If φ00 (x∗ ) < 0 then x∗ is a local maximizer

• If φ00 (x∗ ) = 0 then we need more information to decide if it is a minimizer,


maximizer or a saddle point.

Therefore if we call f (x) = φ0 (x). To minimize φ(x) not only we need to find x̂
such that f (x̂) = 0 but also f 0 (x̂) > 0. Notice also that finding a global minimizer x̃
is much more difficult because we also require that φ(x) > φ(x̃) ∀x ∈ [a, b]. Therefore
unless we know our function is of certain type (i.e. convex) , we only look for local
minimizers.

How can we find a local minimizer?

One way is to solve f (x) = 0 for f (x) = φ0 (x) using either closed or open methods
and once we find x∗ such that f (x∗ ) ≈ 0, then we verify if f 0 (x∗ ) = φ00 (x∗ ) > 0.

If we have the information of φ, φ0 , φ00 , we can apply Newton’s method, and to


ensure convergence we can use a hybrid method that also reduces the interval of
uncertainty. If we do not know f 0 (x) = φ00 (x), we can approximate it using secant
method. We can also perform bisection in f (x) and approximate f 0 (x) to check the
final result.

But what can we do if we do not even know f (x) = φ0 (x)?

Notice that in root-finding we always needed 2 pieces of information to recon-


struct the model, either f (a), f (b) or f (xn ), f 0 (xn ), because we see it as a straight
line. Straight lines do not have local minimizers. Instead we use a parabola
and therefore we need three pieces of information: either φ(xn ), φ0 (xn ), φ00 (xn ), or
φ(xn ), φ0 (an ), φ0 (bn ) or φ(a), φ(b), φ(c)

2.3.1 Golden Section


When we have three points a < c < b, we can ensure there is a (local) minimum in
[a, b] if we have the pattern “high, low, high”. If we only start with three points,
and we iterate to reduce the interval of uncertainty (similar to bisection), how can
we do it efficiently? It is not longer as easy as evaluating the middle part as in
bisection.
Notice that we need a forth point in order to choose a subinterval. If we do not
distribute our points efficiently we will end up evaluating at each iteration 2 internal
points. The special distribution to locate our points such that they are optimal ends
up to be the golden ratio ϕ = 1.618. How can we see it?

24
Figure 2.7: “high, low, high” pattern, but how can we reduce the interval to choose
either the red or the green model.

Suppose we have the interval [an , bn ] and we have the internal points cn , dn . First
we would like to be symmetric, i.e.

cn − an = dn − bn

. Therefore we can say cn = an + α(bn − an ) and dn = an + (1 − α)(bn − an )

Second, you would like to reuse the point that is enclosed by the subinterval we
choose. Without loss of generality, let us assume we choose the subinterval [an , dn ]
and cn is the internal point. Therefore we would like
cn − an dn − an
=
dn − an bn − an
α(bn − an ) (1 − α)(bn − an )
=
(1 − α)(bn − an ) bn − an
α = (1 − α)2
0 = α2 − 3α + 1

1− 5 1
α=1+ =1−
2 ϕ

−1 + 5 1
1−α= =
2 ϕ

2.3.2 Parabolic Interpolation


Similar as the bisection case, using the golden section we are only using the func-
tion value to decide if a point remains or not, but we are not using it to estimate

25
Set a, b
α = 1 − 1/ϕ
c = a + α ∗ (b − a); φc = φ(c)
d = a + (1 − α)(b − a); φd = φ(d)
while |b − a| > tol do
if φc < φd then
b=d
d = c; φd = φc
c = a + α(b − a); φc = φ(c)
else
a=c
c = d; φc = φd
d = a + (1 − α)(b − a); φd = φ(d)
end
end
return c
Algorithm 4: Golden Section Pseudo-code

the next iteration. When we were finding roots of the function, we introduce the
method of false position to estimate the next iterate approximating the function
with a straight line. Now, we can use a similar approach, an use three points to
estimate a quadratic model for the point.

Between 3 different points there exists a unique parabola passing through them.
If the general equation for the parabola is:
ψ(x) = αx2 + βx + γ
We want to find α, β, γ that defines the parabola between (xn−2 , f (xn−2 )), (xn−1 , f (xn−1 )),
(xn , f (xn )). Therefore it should satisfy:
f (xn−2 ) = αx2n−2 + βxn−2 + γ
f (xn−1 ) = αx2n−1 + βxn−1 + γ
f (xn ) = αx2n + βxn + γ
That if we write it in matrix form, we have
 2    
xn−2 xn−2 1 α f (xn−2 )
 x2n−1 xn−1 1   β  =  f (xn−1 )  (2.11)
x2n xn 1 γ f (xn )
Solving this linear system, we can find the coefficients of the quadratic model, and
moreover, the minimum of it
ψ(x)0 = 2αx + β
If α > 0, we take xn+1 = −β/(2α), which is the minimum of the quadratic model.

26
Chapter 3

Linear systems of equations

3.1 Linear Algebra Review


3.1.1 Singular Matrices
The goal for this review is to emphasize some key concepts that will determined the
success of algorithms solving linear systems of equations.

Suppose that we have a linear system of equations:

a11 x1 + a12 x2 + ... + a1m xm = b1


a21 x1 + a22 x2 + ... + a2m xm = b2
...
an1 x1 + an2 x2 + ... + anm xm = bn

Each of the equations represents a hyperplane in Rm and the solution is the intersec-
tion of all of them. Notice that depending on the equations, there can be a unique
solution (only 1 point), infinite solutions ( an affine subspace of dimension bigger
than 1) or no solutions. We can also express the system as
    
a11 a12 ... a1m x1 b1
 a21 a22 ... a2m   x2   b2 
Ax = 
 ...
 = =b (3.1)
... ... ...   ...   ... 
an1 an2 ... anm xm bn

And using the geometric intuition, now we know that not every system Ax = b
where A ∈ Rn×m and b ∈ Rn has a unique solution in Rm .

27
We can also see the system as a linear combination of column vectors of A:
     
a11 a12 a1m
m
X  a21   a22   a2m 
aj xj = 
 ...  x1 +  ...  x2 + ... +  ...  xm = b
     (3.2)
j=1
an1 an2 anm

And another way to understand Ax = b is seeing it as a linear transformation


T : Rm → Rn that transforms vectors from Rm to Rn . There are two important sets
for any transformation:

Definition 13. Let A ∈ Rn×m a linear transformation then:

• The range of A, range(A) ⊂ Rn , is the image of the transformation, i.e.

range(A) = {y ∈ Rn |∃x : Ax = y}

or equivalent y ∈ range(A) ⇐⇒ it is a linear combination of the column


vectors of b

• The nullspace of A, null(A) ⊂ Rm , is the kernel of the transformation, i.e.

null(A) = {x ∈ Rm |Ax = 0}

Therefore given a system Ax = b:

• If b ∈
/ range(A), i.e. b is not a linear combination of columns of A, then the
system does not have a solution.

• If dim(null(A)) > 0 then if x is a solution (i.e Ax = b) then for all x̃ ∈ null(A),


x + x̃ is also a solution of the system.

Definition 14. So if we talk about a square matrix A ∈ n × n, we say that the


system Ax = b has unique solution if A is invertible, which is equivalent to:

• dim(null(A)) = 0

• range(A) = Rn

• The columns of A are linearly independent. Similarly with the rows.

• det(A) 6= 0

• There exists A−1 such that AA−1 = A−1 A = I

A matrix that is not invertible is called singular

Why are we concerned about A being singular or not?

28
We know that if A is invertible, we will have a unique solution. Otherwise if A
singular we will have either infinite solution or none.
Now the problem is that there exists matrices that are invertible, but are nearly to
be singular. For example consider
 
1+ 1
A=
3 3

with  > 0 the matrix is invertible with


 
−1 1 3 −1
A =
3 −3 1 + 

But when  = 0, A is singular and therefore we cannot longer compute A−1 . Small
perturbations, i.e. changes in  (for example due to roundoff errors) produce huge
changes in the solution. This type of problems are ill conditioned problems (recall
the discussion in Chapter 1 about well conditioned problems and stability).

3.1.2 Famous matrices


• Diagonal

• Triangular (It has eigenvalues in the diagonal)

• Symmetric At = A

• Positive definite ∀x 6= 0, xT Ax > 0

• Semi-Positive definite ∀x 6= 0, xT Ax ≥ 0

• Orthogonal QT Q = I. Its column vectors form an orthonormal set:


(
0 i 6= j (orthogonal vectors)
qiT qj =
1 i = j (2-norm is equal to 1)

If Q is square then we also have QQT = I, otherwise this is not true.

• Sparse matrix: most of its elements are zero. Instead of having O(n2 ) entries,
usually it has O(n).

3.1.3 Nice Factorizations


Working with any A can be exhausting and messy. Therefore we prefer to factorize
A as a product of matrices, each one with nice properties:

29
Name Shape Explanation Properties

Eigenvalue XΛX −1 X invertible: eigenvectors Only when multiplicity


Decomp. Λ diagonal: eigenvalues algebraic = geometric
It is used for:
Ak = XΛk X −1

Jordan XJX −1 X invertible Any matrix


Form J Block Diagonal

Unitary QΛQ∗ Q unitary: eigenvectors Only Normal matrices


Diagonal. Λ diagonal: eigenvalues AA∗ = A∗ A

Schur QT Q∗ Q unitary Any Matrix


Decomp. T (upper) triangular Eigenvalue revealing

SVD U ΣV U, V unitary Any Matrix


Σ diagonal, σi ≥ σi+1 ≥ 0 Reveals the behavior
U ∈ Rm×m (Full) of transformation
U ∈ Rm×n (Thin)

QR QR Q orthogonal Ax = b =⇒ Rx = QT b
R Upper triangular (using backward substitution)

LU LU L lower triangular, lii = 1 det(U ) = det(A)


U Upper triangular To solve Ax = b:
Ly = b, U x = y

Cholesky RT R R upper triangular Stable it A is


positive definite

Table 3.1: Useful Factorizations for a matrix A

30
3.1.4 Eigenvalues
In problems where we want to find stable solutions for physical systems, usually we
need to evaluate Ak . If A is dense, each matrix-matrix product cost O(n3 ) opera-
tions. Therefore computing A...A k times is too expensive (O(kn3 )), and in addition
we deal with roundoff errors.

Nevertheless, for certain type of matrices, diagonalizable matrices, we can ex-


press A = XΛX −1 where Λ is diagonal. In such a case Ak = XΛk X −1 will cost
O(n3 ) for any k.

In general, if for λ ∈ R there exists x 6= 0 such that


Ax = λx
we say that λ is an eigenvalue of A, and x is a corresponding eigenvector.

Notice that to find eigenvalues of A is equivalent to find the roots of the charac-
teristic polynomial ρA (λ).
Ax = λx =⇒ ρA (λ) = det(A − λI) = 0
And we define to types of multiplicity for an eigenvalue:
• Algebraic: Multiplicity of root in ρA (λ)
• Geometric: dim(Eλ ) = {x ∈ Rn |Ax = λx}
And we know that Algebraic multiplicity ≥ Geometric multiplicity. If we have equal-
ity for all the eigenvalues, then A is diagonalizable.

Do you think we find the eigenvalues finding the roots of ρA (λ)?.


Not really. Usually the matrices we are interested in belong to Rn×n , where n is large.
Therefore the characteristic polynomial is order n too. And generally finding the
roots of high order polynomials is really unstable. Therefore to find eigenvalues and
eigenvectors we use iterative methods to solve them (For example, power method).
This methods are out of scope of this course.

3.2 Direct Methods: Gauss Elimination


Given a system Ax = b we can create the augmented version
 
a11 a12 ... a1m b1
 a21 a22 ... a2m b2 
[A|b] = 
 ...
 (3.3)
... ... ... ... 
an1 an2 ... anm bn

31
Depending on A this system can be really easy or really hard to solve. For example
of A is diagonal, we only need to divide bj by ajj to find xj .

Another relatively easy type of matrices are the lower and upper triangular ma-
trices. For example with an upper triangular matrix in Rn×n , every variable xk
only depends on the following xk+1 , ..., xn−1 , xn . Therefore we can move backward,
starting from xn , then xn−1 , and so on. This method is called Backward Substitution.

In general, the goal of Gauss Elimination is first reduce any matrix A into an
upper triangular form and then solve the system using backward substitution. To
do this we use the following three rules:

• ri ↔ rj Interchange rows i and j

• ri ← αri Multiply a row by a scalar

• ri ← ri + βrj Add a multiple to the row j to the row i.

Let us consider the system


 
2 1 0 2
[A|b] =  −1 1 −1 1 
−3 4 −4 2

And apply Gaussian elimination only using the third rule:


   
2 1 0 2 2 1 0 2
[A|b] =  −1 1 −1 1  ∼  0 3/2 −1 2 
r2 ← r2 + 1/2r1
−3 4 −4 2 0 11/2 −4 5
r3 ← r3 + 3/2r1
 
2 1 0 2
∼  0 3/2 −1 2 
r3 ← r3 − 11/3r2
0 0 −1/3 −7/3

And then doing backward substitution

−7/3
x3 = =7
−1/3
1 2
x2 = (2 − (−1)x3 ) = (2 + 7) = 6
3/2 3
1 1
x1 = (2 − (1)x2 − (0)x3 ) = (2 − 6) = −2
2 2

32
Notice that we can express the set of transformations as multiplying A by the
matrices
   
1 0 0 1 0 0
L1 =  1/2 1 0  L2 =  0 1 0 
3/2 0 1 0 −11/3 1

Therefore
 
2 1 0
L2 L1 A = U =  0 3/2 −1 
0 0 −1/3

Moreover notice that


   
1 0 0 1 0 0
L−1
1 =  −1/2 1 0  L−1
2 = 0 1 0 
−3/2 0 1 0 11/3 1

Then
A = L−1 L−1 U = LU
| 1 {z 2 }
L

Where
   
1 0 0 2 1 0
L =  −1/2 1 0  U =  0 3/2 −1 
−3/2 11/3 1 0 0 −1/3

Therefore in general, using Gauss elimination, we can get a LU decomposition of A.


If we are solving only for one b, this decomposition is not relevant. But imagine the
case where we have the system Axi = bi , where we need the solution of the system
xi for different right hand side vectors bi . Therefore with a LU factorization we can
transform the problem as follows:
(
First solve Ly = b
Ax = LU x = b =⇒ (3.4)
then solve U x = y

Notice that for the first step you perform Forward substitution and for the second
one Backward substitution.

3.2.1 Number of flops


Why do we want to use Gauss elimination (or LU), instead of just
computing the inverse?

33
for k = 1 : n − 1 do
for i = k + 1 : n do
lik = aakk
ik

for j = k + 1 : n do
aij = aij − lik akj
end
bi = bi − lik bk
end
end
Algorithm 5: Gauss Elimination Pseudo-code

for k = n : −1 : 1 do
sum = 0
for j = k + 1 : n do
sum = sum + akj xj
end
xk = bk −sum
akk
end
Algorithm 6: Backward substitution Pseudo-code

Notice that we can find the inverse of a matrix applying Gauss elimination to the
system

A I ∼ ... ∼ I A−1
   

Imagine that we do not have roundoff errors, therefore why don’t we just compute
the inverse of a matrix instead of computing the LU factorization. The answer is
because of the number of operations (flops). Let us compute the number of flops
for different methods. We count the sums and the products, and each of the sums
are the iterations inside the loops. Gauss Jordan refers to the method that instead

34
of reducing the matrix to a triangular form, leaves it as a diagonal.
  
n−1
X X n Xn
f lops(Gauss Elimination) =  3 + 2
k=1 i=k+1 j=k+1
n−1 n
!
X X
= (3 + 2(n − k))
k=1 i=k+1
n−1
X
= (n − k)(3 + 2(n − k))
k=1
n−1
X 3
=2 k2 + k
2
k=1
2
= n3 + O(n2 )
3

 
n
X n
X
f lops(Backward substitution) = 2 + 2
k=1 j=k+1
n
X
=2 (1 + n − k)
k=1
Xn
=2 k
k=1

= n2 + O(n)

35
 
  
Xn  X Xn 
 
f lops(Gauss Jordan) = 

 3+ 2 


k=1  i = 1 : n j=k+1 
 
i 6= k
 
 
Xn  X

 
= 
 (3 + 2(n − k))

k=1
 i=1:n
 

i 6= k
n
X
= (n − 1)(3 + 2(n − k))
k=1
n
X 3
= 2(n − 1) k+
2
k=1

= n + O(n2 )
3

  
n−1
X n
X n
X
f lops(Inverse) =  1 + 2n + 2
k=1 i=k+1 j=k+1
n−1 n
!
X X
= (1 + 2n + 2(n − k))
k=1 i=k+1
n−1
X
= (n − k)(1 + 2n + 2(n − k))
k=1
n−1
X
=2 k 2 + nk + k
k=1
5
= n3 + O(n2 )
3

As you can see, finding the Inverse or finding LU factorization are both O(n3 ). The
difference is in the coefficient in front. Finding the inverse is almost three times as
expensive as finding the LU.

Also notice that backward substitution (and therefore Forward substitution) are
O(n2 ). Therefore once we have the LU, solving (3.4) is really cheap. Think about
what is the number of flops of solving a diagonal or quasi-diagonal system (i.e. tridi-

36
agonal)!

3.2.2 Pivoting
Does this method work for any invertible matrix?

For example take  


0 1
A=
1 0
In this case, with standard Gauss elimination, we will divide by 0. This will break
the method!. Therefore we need to use the rule of switching rows in order to get
a value different than zero (a pivot). Notice that in a computer, we will not only
find the problem when the entry is exactly zero but when is relatively small with
respect to the rest of the entries of the column. Therefore we would like to apply
pivoting in order to have a more stable algorithm.

Clearly, this pivoting changes the factorization of A as:

P A = LU

where L is lower triangular, U is upper triangular and P is a permutation matrix. To


make the algorithm more stable we can even permute rows and columns, including
an additional permutation matrix Q

P AQ = LU

In the case of solving the system using this permutation matrices, we need to modify
x and b appropriately.

3.2.3 Cholesky Factorization


What happens if A is symmetric? Or even symmetric positive defi-
nite?

Notice that from


A = LU
if all the diagonal entries of U are different from zero, we can form the decomposition

A = LDŨ

where Ũ is upper triangular, with diagonal entries equal to one, and we can get it
dividing each row Ui by the diagonal entry uii . Notice that if A is symmetric, we
can have a symmetric decomposition

A = LDLT

37
And moreover, if all the entries of D are > 0 we can take the square root and

A = LD1/2 D1/2 LT

If we call G = LD1/2 then we have that

A = GGT

where G is lower triangular matrix with positive entries in the diagonal. This
is called Cholesky factorization. If the matrix is far from being singular and
moreover it is symmetric positive definite (i.e. all the eigenvalues are positive and
far from zero), then the Cholesky factorization for A exists and the method is stable.

3.3 Conditioning
3.3.1 Vector and Matrix Norms
Definition 15. A norm is a function || · || : V → R such that:
• ||x|| ≥ 0; ||x|| = 0 ⇐⇒ x = 0
• ||αx|| = |α| ||x||
• ||x + y|| ≤ ||x|| + ||y|| ∀x, y ∈ Rn
Some typical vector norms defined in Rn are
 1/2
√ Xn
||x||2 = xt x =  |xj |2 
j=1
n
X
||x||1 = |xj |
j=1
 1/p
n
X
||x||p =  |xj |p 
j=1

||x||∞ = max |xi |


1≤j≤n

How can we define a matrix norm?


First we can see a Matrix as object in ∈ Rn×m . This means that we can unwrapped
the rows of A and we can have a large vector belonging to Rn×m . If we compute
the 2-norm os this vector, it is called the Frobenius norm
sX
||A||F = a2ij (3.5)
i,j

38
And as a nice fact, we can even define an Inner Product between matrices based on
this norm:
hA, Bi = tr(AT B)
On the other hand we can see a Matrix as operator : Rn → Rm . This means we can
analyze how the matrix transforms the vectors from one space to the other. In this
way we can define the Induced norm:
||Ax||p
||A||p = sup for 1 ≤ p ≤ ∞ (3.6)
x6=0 ||x||p
Some of the typical induced norms are:
X X
||A||1 = max |aij | ||A||∞ = max |aij | ||A||2 = max |σi |
j i
i j

where σi corresponds to the singular value, that are the same as the diagonal terms
in Σ in the SVD, singular value decomposition. Also
q
||A||2 = ρ (AAT ) where ρ(M ) = max |λi | is the spectral radious
λi ∈Λ(M )

All the induced norms satisfy the sub-multiplicative property:


||AB|| ≤ ||A|| ||B||

3.3.2 Condition number


How can we measure the expected error in solving a system?
As we talked in section 1, if we have perfect arithmetic (i.e. we do not count
roundoff errors), the sources of error depend on the problem itself (if it is a well
conditioned problem) and the method we use (if it is stable or not). If we know
before hand that our problem tends to be ill-conditioned, then we will use a method
that is really robust.

In the case of linear systems, let us understand where the error comes from.
First let us define the residual (that we can measure in reality, without knowing the
real solution) as:
r = b − Ax̃ = Ax − Ax̃ = A(x − x̃)
then the absolute error is:
||x − x̃|| = ||A−1 r|| ≤ ||A−1 || ||r||
If we prefer to consider the relative error, and using that ||b|| ≤ ||A|| ||x||, we have
||x − x̃|| ||r||
≤ ||A|| ||A−1 || (3.7)
||x|| ||b||
Here we see two sources from error:

39
• The method and how the residual can change from method to method ||r||
• And the problem itself regardless of the method we use.
In this setting, choosing a specific norm, we define the condition number of a
matrix as
κ(A) = ||A|| ||A−1 || (3.8)
Notice that we have the following properties:
• Using the sub-multiplicative property 1 = ||I|| = ||AA−1 || ≤ ||A|| ||A−1 ||
• Using that for orthogonal matrices QT = Q−1 and ||Q|| = ||QT || then 1 =
||Q||||Q−1 ||
• And using the SVD decomposition
σmax
k2 =
σmin

Notice that from (3.7), if a matrix has a large condition number, the bound in the
relative error of Ax = b is bigger, therefore we would need a robust method. If you
want to explore more about condition number I recommend to read Ascher section
5.8 or Bradie section 3.4, or for deeper understanding a Numerical Linear Algebra
book such as the one from Trefethen and Bau.

3.4 Linear Least Squares


Let us assume we have a set of stock prices along certain period of time, and we
want to adjust a model to it. Each measure has certain type of error, and of course,
if we pass by all the points, probably we will be overfitting the model. Therefore we
prefer to adjust a model and minimize the error between the curve and the data. If
the model is
f (t) = α0 + α1 t + α2 t2 + ... + αm−1 tm−1
Then the error between the model and the data (tk , pk ) is given by
1 t1 ... tm−1
        
r1 p1 − f (t1 ) p1 1 α0
 r2   p2 − f (t2 )   p2   1 t2 ... tm−1   α1 
1
 ...  =    ...  −  ... ... ...
   =     (3.9)
... ...   ... 
rn pn − f (tn ) pn 1 tn ... tm−1 n αm−1
In matrix form we have
r = b − Ax
In here we are assuming that A has full column rank, this means that all the variables
are independent one from the other. If A were square, we could just solve the system
such that r = 0. But given that A is not square, we should minimize ||r||2 .

40
Figure 3.1: Data Fitting problem: Let us suppose we have the data series
{(ti , pi )} and we want to see how the model f (t) = α0 + α1 t + α22 t2 approximates
the price.

Why || ||2 and not other norm?


The other 2 popular norms are || ||1 and || ||∞ . The first one helps to get rid of
outliers (it makes the coefficients of the model also to be sparse) whereas the second
one keeps track of the worst case error (a min-max problem). The problem with
this 2 norms is that they are not differentiable, therefore we can not apply the same
techniques as for || ||2 . In such a case we use Linear Programing, a constrained
optimization problem with both linear objective function and constraints.

Therefore in linear least squares, we are looking for x such that minimizes
1
min ||b − Ax||22
x∈Rm 2
Let us call
1
ψ(x) = ||b − Ax||22
2
1
= (b − Ax)T (b − Ax)
2
1 T
b b − 2bT Ax + xT AT Ax

=
2
1 1
= xT AT Ax − (AT b)T x + bT b
2 2

41
In this case to have a minimum, as in 1D we need a critical point, (i.e. ∇ψ = 0)
and a second derivative requirement (∇2 ψ is positive definite).
If it is not easy to see how are the derivatives from ψ let us see it by components:
n m
!!2
1X X
ψ(x) = bi − aik xk
2
i=1 k=1

Therefore taking the partial derivative with respect of xj we have


n m
!!2
∂ψ 1 ∂ X X
= bi − aik xk
∂xj 2 ∂xj
i=1 k=1
n m m
!! !!
X X ∂ X
= bi − aik xk bi − aik xk
∂xj
i=1 k=1 k=1
n m m
!! !!
X X ∂bi X ∂xk
= bi − aik xk − aik
∂xj ∂xj
i=1 k=1 k=1
n m m
!! !
X X X
= bi − aik xk − aik δkj
i=1 k=1 k=1
n m
!!
X X
= bi − aik xk (−aij )
i=1 k=1
n m
!!
X X
=− aij bi − aik xk
i=1 k=1

If we express the gradient in matrix form we have that


∇ψ(x) = −AT (b − Ax) = AT Ax − AT b (3.10)
∂2g
And the components of the Hessian are ∂xj ∂xl .
Therefore:
n m
! !
∂2ψ ∂ X X
= aij aik xk − bi
∂xj ∂xl ∂xl
i=1 k=1
n m
!
X X ∂xk
= aij aik
∂xl
i=1 k=1
Xn
= aij ail
i=1

Then the Hessian in matrix form is given by


∇2 ψ(x) = AT A (3.11)

42
Since AT A is positive definite, then now we can solve the least squares problem as
finding x such that

AT Ax = AT b (3.12)

And they are called the normal equations, because they satisfy that

AT (Ax − b) = AT r = 0

this means that the residual is normal to the range of A.


How can we solve them?
Notice that (3.12), if AT A is invertible (i.e A is full column rank) is equivalent to
−1
x = AT A AT b
−1
And therefore we can call A† = AT A AT the pseudo inverse of A, and as in
linear systems we can find that
||x − x̃|| ||r||
≤ ||A|| ||A† || (3.13)
||x|| ||b||

Therefore we can also define a condition number for rectangular matrices A ∈ Rn×m
and in 2-norm is equal to
σ1
κ(A) =
σm
Therefore if we solve (3.12) using Gauss elimination, or in this case Cholesky Fac-
torization (since AT A is symmetric positive definite), then the error will not depend
on κ(A) but on κ(AT A). And from the SVD we can see that

σ12
κ(AT A) = 2
= (κ(A))2
σm
Therefore if the condition number of A is large, when we solve using normal equa-
tions we duplicate, therefore it is more ill conditioned.

In order to avoid that, we would like to use another methods, such us


• Use QR factorization. At the begin of the chapter, we talked about factoriz-
ing A = QR where Q is orthogonal and R is upper triangular. We can get this
factorization as applying Gram-Schmidt to the column vectors of A, to trans-
form it into a orthonormal basis of the range. Although Gram-Schmidt is not
really stable by itself, there are other methods, such as Householder reflection
and Givens rotations, to achieve this factorization. It is useful because:

||Ax − b||22 = ||QRx − b||22 = ||Rx − QT b||22

43
Therefore the only way to reduce the residual is to solve exactly for the first
m equations of R. And the condition number of R is comparable with the
one of A. As an additional comment when we have A ∈ Rn×m n! = m
full column rank, and b ∈ Rn , if we write in MATLAB A\b, it returns the
least squares solution using QR factorization. Check the last part of http:
//www.mathworks.com/help/matlab/ref/mldivide.html .

• We can only use QR factorization if A is non singular. In the case A is singular


or almost singular we can use SVD factorization. Therefore we get

||Ax − b||22 = ||U ΣV T x − b||22 = ||ΣV T x − U T b||22

And we can neglect the equations where σi is almost zero.

As we can always expect, as the method gets more robust, the cost of computing it
increases. Therefore depending on the problem and the conditioning of A, we would
like to use a different method.

44
Chapter 4

Non Linear systems of equations

4.1 Newton’s Method


After giving a general introduction of Scientific Computing, in chapter 2 we started
with finding the solutions of nonlinear equations of the shape f (x) = 0. Some
of the methods were based in reducing the uncertainty interval (closed methods),
whereas others used a fixed point iteration to define the guess (Open methods:
xn+1 = g(xn )). In general closed methods were robust but open methods had faster
convergence. Therefore in HW 1, we explored hybrid methods, mixing both ap-
proaches to ensure convergence.

After in chapter 3, we deal with linear systems of equations of the shape Ax = b,


and they are linear because all the operations between unknowns are additions or
multiplications by a scalar. Now we will deal with a more general problem. Let us
consider
 
  f1 (x1 , ..., xm )
x1  f2 (x1 , ..., xm ) 
Find x =  ...  ∈ Rm s.t. F(x) = 0 where F =   (4.1)
 ... 
xm
fn (x1 , ..., xm )

The first remark we should do is that in general we can not generalize closed meth-
ods in this multidimensional setting, because we do not have a graphic picture of
the function neither basic conditions on the neighborhood of a root that should been
satisfied. Therefore we only have open methods. Here we will talk about Newton’s
method.

As in 1D case, we approximate our function using Taylor series:

Definition 16. Let x = [x1 , ..., xm ]T ∈ Rm and F = [f1 , ..., fn ]T with bounded
derivatives up to order at least 2, then for any direction p = [p1 , ..., pm ]T ∈ Rm the

45
Taylor expansion of F is

F(x + p) = F(x) + J(x)p + O(||p||2 ) (4.2)

where J ∈ Rn×m is the Jacobian matrix


 ∂f ∂f1 ∂f1

1
∂x1 ∂x2 ... ∂xm
∂f2 ∂f2 ∂f2
...
 
J(x) = 
 ∂x1 ∂x2 ∂xm 
(4.3)
 ... ... ... ... 

∂fn ∂f2 ∂fn
∂x1 ∂x1 ... ∂xm

Therefore for each function fi we have that


m
X ∂fi
fi (x1 + p1 , ..., xm + pm ) = f1 (x1 , ..., xm ) + pm + O(||p||2 ) (4.4)
∂xm
i=1

Let us start for the case when n = m. Then starting at x(0) we want to find x∗
such that F(x∗ ) = 0. We can start approximating F with the Taylor series to find
x(1) . Let x(1) = x(0) + p(0) then

F(x(1) ) = F(x(0) + p(0) ) = F(x(0) ) + J(x(0) )p(0) + O(||p||2 )

But since we would like F(x(1) ) = 0 then we can choose p0 such that

J(x(0) ) p(0) = − F(x(0) ) (4.5)


| {z } | {z }
matrix ∈Rn×n vector ∈Rn

Notice that we get a linear system of equations that we can solve using any of the
methods we discussed in Chapter 3.

Similar to the 1D case, we have quadratic convergence if we are close enough to


the solution x∗ i.e. there exists a constant M such that if ||x(k) − x∗ || is sufficiently
small then

||x(k+1) − x∗ || ≤ M ||x(k) − x∗ ||2 (4.6)

As stopping criteria you can choose the absolute error between iterations, the relative
error, the norm of the function value, ....
Example 5. (See Ascher Section 9.1. Example 9.1) Consider the intersection be-
tween the circle centered at the origin with radius of 1 x21 + x22 = 1 and the parabola
x2 = (x1 − 1)2 . We want to find the points such that the 2 curves intersect.
Solution. Notice that we can express the problem as:
x21 + x22 − 1
   
x1 2
Find x = ∈ R s.t. F(x) = 0 where F = (4.7)
x2 (x1 − 1)2 − x2

46
Set x(0) ; k = 0;
while abs error(x(k) ) > tol and k < max iterations do
(k) (k) (k) (k)

Solve for p in J x p = −F x ;
x(k+1) = x(k) + p(k) ;
k = k + 1;
end
return x(k)
Algorithm 7: Multidimensional Newton Pseudo-code

Therefore the Jacobian at x is


 
2x1 2x2
J (x) =
2(x1 − 1) 1

(0) (0)
And therefore starting from x(0) = (x1 , x2 ) we have that

x21 + x22 − 1
   
(k) 2x1 2x2 (k)
x(k+1) = x(k) + p where p =−
2(x1 − 1) 1 (x1 − 1)2 − x2

What are the down sides of Newton’s method?

As in the 1D case, we need to be near the root to have convergence. And now we do
not have any other method to combine with to ensure convergence. Nevertheless we
can use p(k) as a direction instead of just a solution and find the best α such that

x(k+1) = x(k) + αp(k)

is a good guess. These are called Line-search methods and they are a keystone
in Numerical Optimization.

On the other hand, notice that at each step we need to compute the Jacobian
matrix at x(k) , i.e. J x(k) , and depending on F this can be a very expensive
process. Therefore, we would like to approximate J with a matrix B (k) that we
(k)

update at each iteration in order to be near the value of J x . These are called
Quasi-Newton methods. One of them is the Broyden’s method:

F x(k) − F x(k−1) − B (k−1) x(k) − x(k−1)


 (k)
x − x(k−1)
  
(k) (k−1)
B =B + T 
x(k) − x(k−1) x(k) − x(k−1)
(4.8)

47
4.2 Non-Linear Least Squares
In section 3.4, we talked about minimizing a specific type of problems: linear least
squares problems, where we had a matrix A ∈ Rn×m with n > m and we want to
find
1
minimize ||Ax − b||22
x 2
In most of the cases Ax represents a model linear in the parameters to approximate
b, i.e. if x = (α1 , ..., αm ) then the model is of the shape

f (t) = α1 f1 (t) + α2 f2 (t) + ... + αm fm (t)

Now let us consider the case where the model is not linear in the parameters. For
example the logistic model for population growth:

N0 Kert
N (t) =
K + N0 (ert − 1)

That comes from the differential equation


 
dN N
= rN 1 − ; N (0) = N0
dt K

In the logistic model, the parameters are: N0 the initial population, r the growth
rate, K the carrying capacity. And it is clear that N (t) is not linear in the pa-
rameters. Let us define x = (x1 , x2 , x3 ) = (N0 , r, K), and let us suppose that we
have 1 ≤ i ≤ n measurements of bacteria population (ti , Ni ). Therefore we have the
system of equations

x1 x3 ex2 t1
= N1
x3 + x1 (ex2 t1 − 1)
x1 x3 ex2 t2
= N2
x3 + x1 (ex2 t2 − 1)
... = ...
x1 x3 ex2 tn
= Nn
x3 + x1 (ex2 tn − 1) |{z}
| {z } b
g(x)

If we express it as a vector form, we want to find x = (x1 , x2 , x3 ) such that g(x) = b.


Since we only have 3 unknowns and n equations, in the linear case this is not possible
to find. Therefore we prefer to formulate the problem as
1
minimize ||g(x) − b||22
x∈Rm 2

48
Following the same steps we did in the linear case, let us call
1
ψ(x) = ||g(x) − b||22
2
1
= (g(x) − b)T (g(x) − b)
2
1
= g(x)T g(x) − g(x)T b + bT b
2
And in component-wise
n
1X
ψ(x) = (gi (x) − bi )2
2
i=1

In this case to have a minimum, as in 1D we need a critical point, (i.e. ∇ψ = 0) and


a second derivative requirement (∇2 ψ is positive definite). Therefore computing the
gradient component-wise:
n
∂ψ ∂ 1X
= (gi (x) − bi )2
∂xj ∂xj 2
i=1
n
X ∂gi
= (gi (x) − bi )
∂xj
i=1 |{z}
Jacobian A(x) of g(x)

In matrix form we get the general form of the Normal equations

0 = ∇ψ(x) = AT (x)(g(x) − b) (4.9)

where A(x) is the Jacobian of g(x). Notice that the Hessian of ψ component-wise
is given by:
n
∂2ψ ∂ X ∂gi
= (gi (x) − bi )
∂xl ∂xj ∂xl ∂xj
i=1
n
X ∂gi ∂gi ∂ 2 gi
= + (gi (x) − bi )
∂x ∂xl ∂xl ∂xj
i=1 | j{z } | {z }
AT (x)A(x) L(x)

Therefore

∇2 ψ(x) = AT (x)A(x) + L(x)  0 (4.10)

49
Notice that to solve (4.9) we need to solve a non-linear system of equations. If we
call φ(x) = AT (x)(g(x) − b), then we want to find x such that φ(x) = 0. Therefore
we can solve it using Newton, starting from an initial guess, and finding a new p
such that φ(x + p) ≈ 0. We use the Taylor series of φ(x) and we find

Jφ (x0 )p = −φ(x0 )

But recall that φ(x) = ∇ψ(x) then we can find p from

∇2 ψ(x0 ) = −∇ψ(x0 )
AT (x0 )A(x0 ) + L(x0 ) p = −AT (x0 )(g(x0 ) − b)


Notice that this equation is a bit complicated because it involves the function g with
its first (in A(x)) and second derivatives (in L(x)).

What if we drop the second derivatives? (i.e. L(x))


When we neglect the second derivatives, the method is called Gauss-Newton, and
computes p from

AT (x0 )(A(x0 )p) = −AT (x0 )(g(x0 ) − b)

which correspond to the normal equations of the problem


1 1
minimize ||A(x0 )p + (g(x0 ) − b)||22 = minimize || (g(x0 ) + A(x0 )p) −b||22 (4.11)
p∈R m 2 p∈R m 2 | {z }
Linear approx. of g(x)

Notice that Gauss-Newton have the following properties:


• The approximation for the Hessian (the second derivative) is always positive
definite (even when AT A + L is not)

• It does not compute the second derivative of g, therefore it is cheaper per


iteration

• Since there is always a difference between Newton and Gauss-Newton (i.e L),
then Gauss-Newton does not have quadratic convergence.

4.3 Unconstrained Optimization


Up to now, we have solved linear and non-linear systems going from Rm to Rn . If
m = n we used Newton’s method to find the solution. If m > n we used a least
squares approach. Least squares is a particular type of minimization problem. But
it is not the only one. In general given a function φ : Rn → R, we are interested

50
Set x(0) ; k = 0;
while abs error(x(k) ) > tol and k < max iterations do
Solve the linear least squares problem
1
minimize ||A(x(k) )p + (g(x(k) ) − b)||22
p∈R m 2

Set the minimizer pk = p


Update x(k+1) = x(k) + p(k) ;
k = k + 1;
end
return x(k)
Algorithm 8: Gauss-Newton for Non-Linear Least Squares Pseudo-code

in finding x∗ such that there exists δ > 0 that for any y with ||y − x|| < δ then
φ(x∗ ) ≤ φ(y)

The simplest way to find a minimum is by simple inspection of the function. We


can guess x∗ choosing randomly a set of points {x(1) , x(2) , ...x(k) } and taking x∗ as
the minimum evaluation, i.e. x∗ = {x( i)|φ(x(i) ) = minj φ(x(j) )} . But as you can
imagine this is a very inefficient method.

Another way is to minimize φ in each direction. Start with a guess x(1) =


(1) (1) (1)
(x1 , x2 , ..., xn ) and fix all the components except the first one, i.e. consider
(1)
f (α) = φ(α, x2 , ..., x(1)
n )

This is a univariate function in α and we can find its minimum using any of the
methods described in chapter 2. Once we find α∗ we can update our guess as
(1) (1)
x(2) = (α∗ , x2 , ..., xn ) and now we can optimize in the second component x2 and
so on. Once we reach xn , we start again in x1 and we repeat the process until we
cannot minimize any more. This method is called Coordinate descent.

Do you think this is the best way of minimizing a function?

Consider the two functions in Figure 4.1. They are exactly the same function and
the same initial guess. The only difference is that we rotated our coordinate system.
For the fist one we get the minimum in one iteration whereas in the second one we
get in two. But probably if we selected in a better way our search direction, we
could find the minimum in both cases in the same number of iterations.

To choose this direction, let us consider the Taylor series for a function φ : Rn →
R

51
Figure 4.1: Minimize φ(x, y) = x2 + y 2 choosing one direction at a time for different
initial conditions. How can we just do everything in one step?

Definition 17. Let x = [x1 , ..., xn ]T ∈ Rn and φ : Rn → R with bounded derivatives


up to order at least 2, then for any direction p = [p1 , ..., pn ]T ∈ Rn the Taylor
expansion of φ is
1
φ(x + p) = φ(x) + ∇φ(x)T p + pT ∇2 φ(x)p + O(||p||3 ) (4.12)
2
where ∇φ(x)T is the gradient of φ
 
T∂φ ∂φ
∇φ(x) = , ..., (4.13)
∂x1 ∂xn

and ∇2 φ(x) is the Hessian of φ


∂2φ ∂2φ ∂2φ
 
∂x21 ∂x1 ∂x2 ... ∂x1 ∂xn
∂2φ ∂2φ ∂2φ
 
2

∂x1 ∂x2 ∂x22
... ∂x2 ∂xn

∇ φ(x) =   (4.14)
... ... ... ...
 
 
∂2φ ∂2φ ∂2φ
∂x1 ∂xn ∂x2 ∂xn ... ∂x2n

4.3.1 Steepest descent


If we just consider the first order Taylor expansion, we notice that for sufficiently
small p, φ(x + p) < φ(x) if

∇φT p < 0 (4.15)

If a direction p satisfies (4.15) then it is called a descent direction at x. Let θ is


the angle between the two vectors then ∇φT p = ||∇φ|| ||p|| cos θ. Therefore among

52
all directions p such that ||p|| = 1 the biggest descent is

p = −∇φ(x) (4.16)

normalized. When we choose the direction p that satisfies (4.16) then the method
is called steepest descent or gradient descent method.

Now if we try in our functions of Figure 4.1, we take only one iteration if we use
steepest descent direction. But now let us consider Figure 4.2. We have the same
function and we are choosing two different initial conditions, both at the same level
curve of φ. We optimize choosing the steepest descent direction. In the first case
we get the minimum in only one iteration, whereas in the second case we start to
zig-zag towards the minimum. Why is this happening?

Figure 4.2: Minimize φ(x, y) = x2 +4y 2 choosing one direction at a time for different
initial conditions. How can we just do everything in one step?

4.3.2 Newton’s Method


Notice that using only a first order approximation of φ is replacing our function
by a plane. And a plane is not bounded. Therefore it does not seem a clever
choice to approximate a bounded function with an unbounded function to find the
minimum. Therefore let us go one step further and take the second order Taylor
series approximation. Let
1
φ(x + p) ≈ ψ(p) = φ(x) + ∇φ(x)T p + pT ∇2 φ(x)p (4.17)
2
Let us recall which are the necessary and sufficient conditions to be a local minimum.

Definition 18. We say that x∗ ∈ Rn is a critical point of φ if

∇φ(x) = 0 (4.18)

53
Definition 19. Given a critical point x∗ , then

• If it is a local minimizer then ∇2 φ(x∗ )  0 (necessary)

• If ∇2 φ(x∗ )  0 then it is a local minimizer (sufficient)

If x is sufficiently near x∗ then ψ is a quadratic function in p, such that its


Hessian is positive definite. Since ∇ψ(p) = ∇φ(x) + ∇2 φ(x)p, then φ has a unique
minimizer p∗ such that

∇2 φ(x)p∗ = −∇φ(x) (4.19)

Therefore we just need to find p∗ solving a linear system of equations (4.19). With
this direction, we only take one iteration in both function in Figure (4.2).

Why is it called Newton’s method?

Notice that if we consider the Taylor series of the gradient of φ, let G = ∇φ using
(4.2) we have that for x, p ∈ Rn if φ is smooth enough then:

G(x + p) = G(x) + JG (X)p + O(p) (4.20)


h iT
∂φ ∂φ
where JG is the Jacobian of G = ∇φ = ∂x1 , ..., ∂xn , i.e
     
∂ ∂φ ∂ ∂φ ∂2φ ∂2φ
   
∂g1 ∂g1 ... ...
∂x1 ... ∂xn ∂x1 ∂x1 ∂xn ∂x1 ∂x21 ∂xn ∂x1
   
JG =  ... ... ...  =  ...  ... ...  = ... ... ...
  
 
∂gn ∂gn ∂ ∂φ ∂gn ∂φ ∂2φ ∂2φ
∂x1 ... ∂xn ... ...
∂x1 ∂xn ∂xn ∂xn ∂x1 ∂xn ∂x2n

Therefore JG = ∇2 φ. If we apply Newton’s method of section 4.1 to solve G(x) = 0,


then in each iteration with a guess x(k) we find p(k) such that:

JG (x(k) )p(k) = −G(x(k) )


∇2 φ(x(k) )p(k) = −∇φ(x(k) )

which is indeed the same equation we got from approximating our function with a
quadratic model from Taylor’s series.

Nevertheless, similar to the problem in 1D (section 2.3) it is not enough to solve


(4.19) in order to find a minimum. We also need that ∇φ(x(k) ) is positive definite.
If we do not enforce this condition, the direction that we get might not be a descent
direction, or even worse, we could not even solve (4.19) if ∇φ(x(k) ) is singular or
ill-conditioned.

54
Notice that if we assume that ∇φ(x(k) ) is positive definite, we can solve (4.19)
using a Cholesky Factorization. Therefore one of the methods to solve (4.19) when
we do not know if ∇φ(x(k) ) is to modify its diagonal until a Cholesky factorization
exists. This method is called Modified Cholesky and it is one of the most used
methods.

4.3.3 Line Search methods


Up to now we have talked about three different methods to minimize φ : Rn → R:
Coordinate descent, Steepest descent and Newton’s method. In all of them we can
split each iteration in 2 steps:
1. Find or select a search direction p(k) , that it is a descent direction (4.15):
• Coordinate decent:
 
(k) ∂φ (k)
p =− (x ) eik
∂xik
• Steepest decent:
p(k) = −∇φ(x(k) )
• Newton’s method:
∇2 φ(x(k) )p(k) = −∇φ(x(k) )

2. Minimize your function in the direction p(k) . Consider f (α) = φ(x(k) + αp(k) ).
This is the restriction of your function to the direction p(k) , and f : R → R.
Therefore we can apply any of the techniques of chapter 2 to find α(k) such
that minimizes f (α). Then we update the guess as
x(k+1) = x(k) + α(k) p(k)

With this method we construct a sequence of points {x(1) , x(2) , x(3) , ...} such that
φ(x(1) ) ≥ φ(x(2) ) ≥ φ(x(3) ) ≥ ... and it converges to a local minimum of φ.

Any minimization method based on this two steps is called a Line Search
method. And as you can expect, there is always a trade off between finding a de-
scent direction and optimizing the function in that direction.

For example, steepest descent has linear rate of convergence if we solve exactly
the 1D minimization problem at each step, whereas Newton’s method has quadratic
rate of convergence if we always take α(k) = 1 and we are sufficiently close to the real
minimum. Nevertheless, steepest descent only requires to know the first derivative
of the function and Newton’s method requires first and second derivative for every
direction.

55
Example 6. Consider the function φ(x1 , x2 ) = (x1 + x2 )2 + 4(x1 − x2 )2 . How is
the first iteration with each of the methods if x(0) = (−1, −2)?
Solution. Let us compute the gradient and the Hessian matrix of φ:

∇φ(x1 , x2 ) = [2(x1 + x2 ) + 8(x1 − x2 ), 2(x1 + x2 ) − 8(x1 − x2 )]T


 
2 10 −6
∇ φ(x1 .x2 ) =
−6 10

• Coordinate descent We have that:


 
(0) ∂φ
p =− (−1, −2) e1 = −(2, 0)
∂x1
Therefore we need to optimize

f (α) = φ((−1, −2) − α(2, 0)) = ((−1 − 2α) − 2)2 + 4((−1 − 2α) + 2)2

Taking first derivative


∂f
0= = 2((−1 − 2α) − 2)(−2) + 8((−1 − 2α) + 2)(−2)
∂α
0 = −3 − 2α + 4(1 − 2α)
1
α=
10
Therefore the new guess is x(1) = x(0) − (1/10)(2, 0) = (−6/5, −2)

• Steepest descent We have that:

p(0) = −∇φ(−1, −2) = (−2, 14)

Therefore we need to optimize

f (α) = φ((−1, −2) + α(−2, 14))


= ((−1 − 2α) + (−2 + 14α))2 + 4((−1 − 2α) − (−2 + 14α))2

Taking first derivative


∂f
0=
∂α
0 = 2((−1 − 2α) + (−2 + 14α))(12) + 8((−1 − 2α) − (−2 + 14α))(−16)
0 = (−3 + 12α)(3) + 4(1 − 16α)(−4)
25
α= = 0.0856
292
25
Therefore the new guess is x(1) = x(0) + 292 (−2, 14) = (−0.8287, −0.8014)

56
• Newton’s method We have that:
   
10 −6 (0) −2
p = −∇φ(−1, −2) =
−6 10 14

Therefore p(0) = [1, 2]T and we need to optimize

f (α) = φ((−1, −2) + α(1, 2))


= ((−1 + α) + (−2 + 2α))2 + 4((−1 + α) − (−2 + 2α))2

Taking first derivative


∂f
0=
∂α
0 = 2((−1 + α) + (−2 + 2α))(3) + 8((−1 + α) − (−2 + 2α))(−1)
0 = −9(1 − α) − 4(1 − α)
α=1

Therefore the new guess is x(1) = x(0) + (1, 2) = (0, 0)

Some comments about finding the search direction

Since computing the second derivatives of φ can be really expensive or com-


plicated, one of the approaches is to approximate ∇2 φ(x(k) )p(k) at each iteration.
There are two ways to approximate it:

• Approximate term by term with finite differences. For example, if we know


the gradient then we can approximate:
∂φ (k) ∂φ (k)
∂2φ ∂xl (x + h ek ) − ∂xl (x )
(x(k) ) ≈
∂xk ∂xl h

• Approximate ∇2 φ as an operator: Find a matrix with a similar behavior. For


example, if we approximate ∇2 φ = I the identity, then we get the steepest de-
scent direction. But of course doing this we loose all the curvature information
of φ.
Another way to approximate ∇2 φ is to consider again the Taylor series of
∇φ(x(k) around x(k+1) , then we get that

∇φ(x(k) ) ≈ ∇φ(x(k+1) ) − ∇2 φ(x(k+1) )(x(k+1) − x(k) ) (4.21)

57
Therefore if we want Bk+1 being a nice approximation of ∇2 φ(x(k+1) ) then it
should satisfy (4.21) i.e.

Bk+1 sk = yk (4.22)

where sk = x(k+1) − x(k) and yk = ∇φ(x(k+1) ) − ∇φ(x(k) ). The equation (4.22)


is called the secant condition, and the methods that use this condition to
approximate the Hessian are called Quasi-Newton methods.

In all of them we find the search direction solving

Bk p(k) = −∇φ(x(k) ) (4.23)

And some of the updates are:

– BFGS method It is the most popular of the Quasi-Newton methods


because of its fast convergence in practice. And it is the heart of some
optimization packages:

Bk sk sTk Bk yk ykT
Bk+1 = Bk − + (4.24)
sTk Bk sk ykT sk

– DFP method It applies the same update as the BFGS to the inverse
of the Hessian instead of the Hessian itself. This means that if we call
Bk−1 = Hk then DFP update is

Hk yk ykT Hk sk sTk
Hk+1 = Hk − + (4.25)
ykT Hk yk sTk yk

– SR1 method It is a symmetric rank-1 of the matrix. It is easier to


compute but it can be non-positive definite

(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + (4.26)
(yk − Bk sk )T yk

Some comments about minimizing φ(x + αp)

How much effort should we invest in finding the minimum of f (α) = φ(x(k) + αp(k) )
if we are far from x∗ ? Do we have an initial guess for α? Notice that if we are
considerably far from the minimum, it is not worthy to put a lot of time in finding
the minimum in certain direction if that does not bring me sufficiently closer to x∗ .
This means that most of the times, the methods used in finding the minimum are
not the most sophisticated. For example we can use backtracking, i.e. start with
αmax and iterate using αi = (1/2)i αmax until we find a reasonable minimum. Other

58
methods use bisection or parabolic interpolation with loose tolerances.

Also notice that for steepest descent and coordinate decent we do not have a
clear idea of the size of α because it comes from a linear model where the optimal
α is ∞. In contrast, Newton’s method comes from a quadratic model, where the
optimal α is 1. Therefore we have an initial guess for Newton’s method, and also
Quasi-Newton methods, but not for steepest descent or coordinate descent. In more
general cases, when the problem also has constraints (i.e. another set of conditions
that the solution should satisfy), then αmax is designed in order to satisfy all the
constraints.

Another important concept in choosing the α is that it should be large enough


to ensure convergence to the real minimum (and not to another point) and small
enough to avoid divergence or zig-zagging. For this, we use a set of conditions that
a new guess should satisfy. Some of them are:

• Armijo Condition To avoid divergence: For c ∈ (0, 1)

φ(x(k) + αp(k) ) ≤ φ(x(k) ) + c α ∇φ(x(k) )T p(k)

• Goldstein Condition To avoid small α and also divergence: For c ∈ (0, 1/2)

φ(x(k) )+(1−c) α ∇φ(x(k) )T p(k) ≤ φ(x(k) +αp(k) ) ≤ φ(x(k) )+c α ∇φ(x(k) )T p(k)

For Newton’s method, we can take c = 1e − 4, but for steepest descent it depends
on the problem. Sometimes it should be around c = 0.1.

Just an additional comment: Coordinate decent is a really basic algorithm, and


that is the reason that it is not a main topic in many optimization books. Never-
theless, its performance is considerably good, especially nowadays in applications
such as machine learning and statistics. If you want to learn more about this topic,
check Wright’s paper in Arxiv. You find its reference at the beginning of the notes.

59
Chapter 5

Interpolation

5.1 Polynomial Interpolation


Interpolation is a special way to approximate functions. This can be used to find
a function that passes through all given data (data fitting) or to approximate a
complicated function with a simple one. Once we have this model, we can use it to
predict new points or to approximate derivatives and integrals.
What are we looking for in interpolation?
Example 7. Find the line between 2 data points. This is called linear interpola-
tion
Solution. We know that the slope of the straight line should satisfy
f (x) − f (x0 ) f (x1 ) − f (x0 )
= (5.1)
x − x0 x1 − x0
If we have the points (x0 , f (x0 )) and (x1 , f (x1 )) then we have that
f (x) − y0 y1 − y0
=
x − x0 x1 − x0
y1 − y0
f (x) = y0 + (x − x0 ) (5.2)
x1 − x0

In general, when we talked about interpolation, we are looking for a function of the
form

f (x) = c0 φ0 (x) + c1 φ1 (x) + · · · + cn φn (x) (5.3)

such that for our given set of data (x0 , y0 ), ..., (xn , yn ) satisfies that

yi = f (xi ) = c0 φ0 (xi ) + c1 φ1 (xi ) + · · · + cn φn (xi )

60
If we fixed the basis functions φi , in order to know the coefficients ci we need to
solve a linear system of n equations and n unknowns:
    
φ0 (x0 ) φ1 (x0 ) . . . φn (x0 ) c0 y0
 φ0 (x1 ) φ1 (x1 ) . . . φn (x1 )   c1   y1 
  =  (5.4)
 ... ... ... ...  ...   ... 
φ0 (xn ) φ1 (xn ) . . . φn (xn ) cn yn
When we talk about polynomial interpolation, we choose the basis functions to be
polynomials. But since the polynomials are a Vector space, we can take different
set of basis functions to represent the same polynomial interpolation. For example:
• Monomial Interpolation if we choose φi (x) = xi
x−x
• Lagrange Interpolation if we choose φi (x) = j6=i xi −xjj
Q

Q
• Newton Interpolation if we choose φi (x) = j<i (x − xj )
Of course, with each choose of basis, the coefficients of the polynomial will change,
but the result will be the same.

5.1.1 Monomial Interpolation


The simplest interpolation to imagine is the Monomial interpolation. In this case
we are looking for the coefficients {c0 , . . . , cn }
f (x) = c0 + c1 x + c2 x2 + · · · + cn xn (5.5)
to pass exactly through (x0 , y0 ), ..., (xn , yn ). Therefore we should solve the linear
system
1 x0 . . . xn0
    
c0 y0
 1 x1 . . . xn1   c1   y1 
 ... ... ... ...  ...  =  ...  (5.6)
    

1 xn . . . xnn cn yn
| {z }
Vandermonde matrix

We know that the determinant of the Vandermonde matrix X is


n−1
Y n
Y
det(X) = (xj − xi )
i=0 j=i+1

And therefore if all the x entries are different, then the determinant is different
from zero and X is invertible. Then there is a unique solution for the coefficients.
This means there is a unique polynomial of order n that passes exactly through
n + 1 points with different abscissas. Therefore it does not matter which basis we
are working with, we will get the same polynomial. The only difference is that the
representation in some basis is easier to understand than in others.

61
Which are the drawbacks of the monomial representation

• X is often ill-conditioned, and solving the system with Gauss elimination takes
2/3 n3

• The coefficients do not give a lot of insight of the behavior of the interpolant.
For example notice that it is easier to imagine

f (x) = (x − 3)3

than
f (x) = x3 − 9x2 + 27x − 27

5.1.2 Lagrange interpolation


What if we want the coefficients to be exactly the points? i.e. ci = yi

Recall that we are looking for an interpolation that given (x0 , y0 ), ..., (xn , yn ), satis-
fies
yi = f (xi ) = c0 φ0 (xi ) + c1 φ1 (xi ) + · · · + cn φn (xi )
For example if we consider the basis functions φi (x) = Li (x) such that
(
1 i=j
Li (xj ) = (5.7)
0 i 6= j

Then we have that

yi = c0 L0 (xi ) + c1 L1 (xi ) + · · · + ci−1 Li−1 (xi−1 ) + ci Li (xi ) + ci+1 Li+1 (xi+1 ) + · · · + cn Ln (xi )
= c0 ∗ 0 + c1 ∗ 0 + · · · + ci−1 ∗ 0 + ci ∗ 1 + ci+1 ∗ 0 + · · · + cn ∗ 0
= ci

But now, how do we construct Li (x)?

Let us think the case when n = 2, then we have (x0 , y0 ), (x1 , y1 ), (x2 , y2 ). To con-
struct L0 (x) we need:

• L0 (x1 ) = 0 and L0 (x2 ) = 0

• L0 (x0 ) = 1

Then notice that we can choose

L0 (x) = a(x − x1 )(x − x2 )

62
That satisfies the first condition. For the second condition, we need to find the value
of a such that L0 (x0 ) = 1. Therefore
1
1 = L0 (x0 ) = a(x0 − x1 )(x0 − x2 ) =⇒ a =
(x0 − x1 )(x0 − x2 )
And similarly we can solve for L1 (x) and L2 (x) to get:
(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )
L0 (x) = L1 (x) = L2 (x) =
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
Now for a general n we can define the Lagrange polynomials Li (x) as
Y (x − xj )
Li (x) = (5.8)
(xi − xj )
j6=i

And therefore the interporlant is given by


n
X n
X Y (x − xj )
f (x) = yi Li (x) = yi (5.9)
(xi − xj )
i=0 i=0 j6=i

There is another expression for (5.9) that makes it easier to compute using the
barycentric weights wi . Define
Y 1
ρi = (xi − xj ) wi = (5.10)
ρi
j6=i

And also define the function


n
Y
ψ(x) = (x − x0 )(x − x1 ) . . . (x − xn ) = (x − xi )
i=0

Then for any set of points {(xi , yi )}ni=0


n
X ψ(x)
f (x) = yi wi
x − xi
i=0

In particular, if we interpolate the constant function one, then all yi = 1 and


n n
X ψ(x) X wi 1
1= (1)wi = ψ(x) =⇒ ψ(x) = Pn wi
x − xi x − xi i=0 x−xi
i=0 i=0

And therefore we get the barycentric interpolation formula. For x 6= xi :


n Pn wi
i=0 yi x−xi
X
f (x) = yi Li (x) = Pn wi (5.11)
i=0 i=0 x−xi

63
5.1.3 Newton Interpolation
What if we want to an adaptive algorithm where we can add point
one by one?
Sometimes, all the interpolation points are not available when we start to interpolate
them. Imagine an experiment where we collect few points, we see how is the result,
and we measure more points to add to the interpolation. When we compute the
coefficients for the new polynomial, if we use either Lagrangian or monomial basis,
we need to recompute all the values, we cannot reuse the information we had before.

Therefore let us consider the following basis functions:


i−1
Y
φi (x) = (x − xj ) (5.12)
j=0

This means
φ0 = 1, φ1 = (x − x0 ), φ2 = (x − x0 )(x − x1 ), φ3 = (x − x0 )(x − x1 )(x − x2 ), . . .
Therefore we will look for the coefficients ci such that
yi = pn (xi ) = c0 φ0 (xi ) + c1 φ1 (xi ) + · · · + cn φn (xi )
Therefore notice that if pn interpolates {(x0 , y0 ), . . . , (xn , yn )}, then pn+1 that in-
terpolates all the previous n points plus an extra one (xn+1 , yn+1 ) is given by:
pn+1 (x) = pn (x) + cn+1 (x − x0 )(x − x1 )...(x − xn )
How can we compute the coefficients? Divided Differences
Since the definition of the polynomial is recursive, we can expect that the definition
of the coefficients is recursive. Notice that
1. f (x0 ) = c0
2. f (x1 ) = c0 + c1 (x1 − x0 ). Therefore:
f (x1 ) − f (x0 )
c1 =
x1 − x0
3. f (x2 ) = c0 + c1 (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 ). Therefore:
f (x1 ) − f (x0 )
f (x2 ) = f (x0 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x1 ) − f (x0 )
f (x2 ) − f (x1 ) = f (x0 ) − f (x1 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x1 ) − f (x0 )
f (x2 ) − f (x1 ) = (x2 − x1 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x2 )−f (x1 ) f (x1 )−f (x0 )
x2 −x1 − x1 −x0
= c2
x2 − x0

64
4. . . .
So in general, we can define the ith coefficient as the ith divided difference where:
c0 = f [x0 ] = f (x0 )
f (x1 ) − f (x0 ) f [x1 ] − f [x0 ]
c1 = f [x0 , x1 ] = =
x1 − x0 x1 − x0
f (x2 )−f (x1 ) f (x1 )−f (x0 )
x2 −x1 − x1 −x0 f [x1 , x2 ] − f [x0 , x1 ]
c2 = f [x0 , x1 , x2 ] = =
x2 − x0 x2 − x0
... = ...
f [x1 , . . . , xi ] − f [x0 , . . . , xi−1 ]
ci = f [x0 , x1 , . . . , xi ] = (5.13)
xi − x0
Therefore we can construct a table i, xi , f [xi ], f [xi , xi+1 ], . . . and so on, and each
time we add a new value to interpolate, we just add a new extra row to the table.
How can we estimate the error of the interpolation?
Notice that the divided differences look like approximations of higher derivatives,
indeed we have the following theorem
Theorem 8. Let f a function with k bounded derivatives in the interval [a, b] and
let z0 , ..., zk k + 1 distinct points, then there is ξ ∈ [a, b] such that
f (k) (ξ)
f [x0 , ..., xk ] = (5.14)
k!
So now let us consider the points {(x0 , f (x0 )), ..., (xn , f (xn ))} and pn (x) the
interpolant. If we want to compute the error at x:
en (x) = f (x) − pn (x) (5.15)
We can construct an interpolant for the points {(x0 , f (x0 )), ..., (xn , f (xn ))}∪{(x, f (x)))}
given by
n
Y
pn+1 (y) = pn (y) + f [x0 , ..., xn , x] (y − xi )
i=0

Therefore when we evaluate y = x we get that


n n
Y f (n+1) (ξ) Y
en (x) = f (x) − pn (x) = f [x0 , ..., xn , x] (x − xi ) = (x − xi ) (5.16)
(n + 1)!
i=0 i=0

Therefore we can bound the maximum error as


n
f (n+1) (ξ) Y
en (x) ≤ max |f (x) − pn (x)| = max max (s − xi ) (5.17)
a≤xi≤b (n + 1)! a≤s≤b
i=0

65
Name Basis Construction cost Properties
Monomial {xi }ni=0 o 2/3n3 Intuitive, Prove existence
x−xj n
nQ
Lagrange j6=i xi −xj n2 Stable, ci = yi
nQ oi=0
n
i−1
Newton j=0 x − xj 3/2n2 Adaptive, compute the error
i=0

Table 5.1: Properties of different basis polynomials

5.2 Piecewise Polynomial Interpolation


5.2.1 Runge’s Phenomenon
It looks that everything works well with Polynomial Interpolation, any
set of points has a unique polynomial, is that enough?
1
Let us consider the function f (x) = 1+x 2 for x ∈ [−5, 5], and suppose that we
sample the data (x0 , f (x0 )), (x1 , f (x1 )), ..., (xn , f (xn )) where all the x-coordinates
are equally spaced, i.e. xi = −5 + i ∗ 10/n. Let us see how is the behavior of the
interpolant in Figure 5.1.

Figure 5.1: Equally spaced interpolation of Runge phenomenon

As you can see, as we increase n, the interpolant has more and more oscillations
at the tails of the interval, and the error is concentrated there. Adding more points
only makes the situation worse.

66
Now let us use a different set of x-coordinates called the Chebyshev points.
This coordinates correspond to the roots of the Chebyshev polynomials in the in-
terval [−1, 1] and they are given by:
 
2k − 1
xk = cos π (5.18)
2n
In order to scale them to the interval [−5, 5] we multiply them by 5, and in this case
we also include the beginning and the last point of the interval.

Figure 5.2: Interpolation of Runge phenomenon using Chebyshev points

Now when we plot the interpolant of this new set of {(xk , f (xk ))}, we do not
have as many oscillations as before and the error keeps bounded.
Why does this happen?
This is called the Runge’s phenomenon and illustrates why we usually do not
consider polynomial interpolations of high degree unless the set of x coordinates is
really special. If we recall the equation (5.17), the error depends on the maximum
of the k-th derivative of the function, and also, on the distribution of the points, i.e.
n
Y
max (s − xi ) (5.19)
a≤s≤b
i=0

67
In particular, the Chebyshev points properly scaled minimizes (5.19), and that ex-
plains the different behavior between Figures 5.1 and 5.2.

Therefore, in general, if we can choose the x-coordinates to interpolate, we can


special set of points, for example Chebyshev points, and we can have high order
polynomial interpolation. But in general, we do not have the freedom to choose the
abscissas. Therefore we need another way to approximate our functions without
having huge oscillations.

5.2.2 Piecewise linear interpolation


If I have a hundred pairs of data, how can I interpolate them?

In the previous section, we saw that interpolating a small set of data ends up in a
small order polynomial, and everything was well behaved. Therefore the idea now
is to split the data into small chunks, and interpolate each piece separately.

The simplest interpolation we can propose is piecewise constant interpo-


lation where given the points {(xi , f (xi ))}ni=0 we can define our interpolant φ(x)
as
xi + xi−1 xi + xi+1
φ(x) = f (xi ) if ≤x≤ (5.20)
2 2
The problem with this interpolation is that is not even continuous, because it has
jumps at (xi + xi+1 )/2 for all i = 0, 1, ..., n − 1.

The following most simple interpolation is piecewise linear interpolation,


that consist in joining each 2 adjacent points with a straight line. Given the points
{(xi , f (xi ))}ni=0 we can define our interpolant φ(x) as
 
x − xi−1 x − xi−1
φ(x) = f (xi ) + f (xi−1 ) 1 − if xi−1 ≤ x ≤ xi (5.21)
xi − xi−1 xi − xi−1
This representation is really ease to compute, and looks like a pretty accurate ap-
proximation. And sometimes that is all what we need. But other times we are
interested in a smoother approximation. That means, we should increase the order
of the polynomial. And here is where a lot of options are open: there is not a single
path to take.

In one hand, notice that in the constant interpolation we took 1 point, in the
linear interpolation we took 2 points, and therefore we can increase the order of
the polynomial taking more points to define 1 curve (bigger chunks). For exam-
ple if we want to construct piecewise quadratic interpolation, we can take
x2i−1 , x2i , x2i+1 , and construct a parabola between them. The only problem with

68
this approach is what happens in the break points. In the case of the parabola, it
is C 2 in (x2i−1 , x2i +1 ) but the derivative is not continuous in x2i−1 and x2i+1 .

The second approach is to work with the same chunks (only two points defining
each polynomial) and not only impose continuity in each xi but also continuity of
the derivatives at each xi .

5.2.3 Cubic Spline Interpolation


Suppose we want to do a piece-wise cubic interpolation between 2 points. There-
fore we want to fit the polynomial
φi (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 for xi ≤ x ≤ xi+1
Then we need 4 equations. If we know (xi .f (xi ), f 0 (xi )) and (xi+1 .f (xi+1 ), f 0 (xi+1 )),
we can solve this system of equations and find the cubic interpolation. This pro-
cess is called piece-wise cubic Hermite interpolation, because it is using the
information of the derivative at each point. And if we modify one point, it will only
change 2 curves. So the interpolation is local. Notice that the function is C 2 in
(xi , xi+1 ) and the second derivatives are not continuous in xi .

Let us see which are the equations we get:


f (xi ) = ai
f (xi+1 ) = ai + bi (xi+1 − xi ) + ci (xi+1 − xi )2 + di (xi+1 − xi )3
f 0 (xi ) = bi
f (xi+1 ) = bi + ci 2(xi+1 − xi ) + 3di (xi+1 − xi )2

Therefore to find the coefficients ai , bi , ci , di we solve:


    
1 0 0 0 ai f (xi )
 1 (xi+1 − xi ) (xi+1 − xi )2 (xi+1 − xi )3   bi   f (xi+1 ) 
= (5.22)
ci   f 0 (xi )
  
 0 1 0 0  
0 1 2(xi+1 − xi ) 3(xi+1 − xi ) 2 di f 0 (xi+1 )
But what if we do not have the derivatives?
If we do not have the derivatives, we can use a finite difference approximation of
them, or we can just impose that the derivative at xi of two adjacent curves is the
same, without fixing its value. Therefore we will have the following set of equations:
φi (xi ) = f (xi )
φi (xi+1 ) = f (xi+1 )
φ0i (xi ) = φ0i+1 (xi )
φ00i (xi ) = φ00i+1 (xi )

69
Therefore if we have n + 1 points, we have n splines, each one with 4 unknowns,
and up to now we have 2n + 2(n − 1) equations. Therefore we need to impose 2
additional conditions that can be:
• Free boundary or Natural spline:
φ00 (x0 ) = φ00 (xn ) = 0

• Champed boundary or Complete spline:


φ0 (x0 ) = f 0 (x0 ), φ0 (xn ) = f 0 (xn )

• Not-a-knot
φ000 (x0 ) = φ000 (x1 ), φ000 (xn−1 ) = φ000 (xn )
As you can see, in this case all the 4n equations are coupled, therefore the interpo-
lation although it looks like local, it is not in reality. To solve for the coefficients,
we need to construct a linear system Ax = b, where A is a tridiagonal matrix. Once
we find the coefficients, we get a spline that is C 2 in all the interval (x0 , xn ).

5.3 Some additional comments


When we introduced interpolation, we talked about basis functions φi , and we said
that we were looking for a function
f (x) = c0 φ0 (x) + c1 φ1 (x) + · · · + cn φn (x)
but when we started to talk about piecewise interpolation, it looks like we forgot
about them. Basis functions are important because they help us to describe any
function, in the cases where we do not have any idea of its shape. For example,
when we are solving differential equations. Some of them have a simple form, and
some others not that simple.

In the case of piecewise-linear functions, the basis functions are called hat func-
tions, and they are really famous in Finite Elements method to solve differential
equations. As the Lagrange polynomials they satisfy φj (xi ) = δi,j , but the differ-
ence is that they have compact support (i.e. they are different from zero in a small
interval). They are defined by

x−xj−1
 xj −xj−1 xj−1 ≤ x ≤ xj


x−xj+1
φj (xi ) = xj −xj+1 xj ≤ x ≤ xj+1 (5.23)


0 otherwise
The splines also have related basis functions called B-splines, and they are the basis
of Computer-aided Design (CAD), and sometimes they are also used for solving
differential equations.

70
What if my curve is too complicated or I want to interpolate points
in more dimensions?

In those we use parametric curves. Therefore if we have the data {(xi , yi )}, we
include a new variable τi that parametrize the curve. Then we solve two separate
interpolation problems with the data {(τi , xi )} and {(τi , yi )}, and then the resulting
curve will be (x(τ ), y(τ )).

But if I have a 100 of points, which is the best method to choose?

There is not a simple answer in this case. As we saw in chapter 4, we can formulate
any model, and we can fit the data solving a least squares problem. In such a case,
we can choose the order of the polynomial that we want, and fit as many points
we have, and there is not a correspondence between points and polynomial degree.
The problem (or sometimes the advantage) of this approach is that the model is
not ensure to pass exactly through all the points, but, the residual error is reduced.
Therefore if we have some experiment data, that we know it has measurement er-
rors, the best option is to go for least squares, and it is up to you which model to
select.

But then, why does interpolation exist? Well, not every model wants to fit
experimental data. Sometimes you are sure your function has to satisfy certain
points. And moreover, you are interested in the integral or the derivative of your
function. Think in the case of a really complicated function, that you want to
estimate its integral in certain interval. Then you would like to take as many points
as you can in order to approximate that interval. Therefore you can use a global
interpolation using “special” points, such as Chebyshev, or you can use piecewise
interpolation, such as constant, linear or cubic.

71
Chapter 6

Numerical Integration

The idea behind numerical integration is to remember that if f is nice enough, the
integral of f in an interval is the limit of Riemann sums. Therefore if we want to
approximate an integral, it sounds reasonable to propose the following:
Z b n
X
I= f (x)dx ≈ ωi f (xi ) (6.1)
a i=0

Where xi are called the abscissas, and ωi the weights. This form is called a quadra-
ture rule.

Does this formula look familiar?

When we were interpolating points {(xi , f (xi )}ni=0 , we chose certain basis functions
such that
Xn
f (x) ≈ pn (x) = f (xi )φi (x)
i=0

For example in the case of polynomial interpolation φi (x) = Li (x) the Lagrange
polynomials, and in the case of piecewise linear interpolation φi (x) were the hat
functions.

Therefore using the interpolant, we have have that


Z b Z b n
Z bX n
X Z b
I= f (x)dx ≈ pn (x)dx = f (xi )φi (x)dx = f (xi ) φi (x)dx
a a a i=0 i=0 |a {z }
ωi

In conclusion, when we approximate an integral using a quadrature rule, we inter-


polate the function using a set basis functions for which we know the integral, and
then we just transform the integral into a sum of terms.

72
6.1 Newton-Cotes Formulas
To introduce the different Newton-Cotes Formulas, let us follow the same steps from
Polynomial Interpolation. First we have a function, and let us choose a set of n + 1
points in the interval [a, b]. Once we have the points {(xi , f (xi ))}ni=0 , then we can
find a unique polynomial of order n that passes through all the points.

b+a
For example, if n = 0 and we choose x0 = 2 ,
then
 
b+a
p0 (x) = f (x0 ) = f
2
And the corresponding quadrature formula is the Midpoint rule
Z b  
b+a
I≈ p0 (x)dx = f (b − a)
a 2
Now, if n = 1 and we choose x0 = a, and x1 = b then
(x − x1 ) (x − x0 ) f (b)(x − a) − f (a)(x − b)
p1 (x) = f (x0 ) + f (x1 ) =
x0 − x1 x1 − x0 b−a
And the corresponding quadrature formula is the Trapezoidal Rule
Z b
b−a
I≈ p1 (x)dx = (f (a) + f (b))
a 2
a+b
And if n = 2 and we choose x0 = a, x1 = 2 , x2 = b then

(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )


p2 (x) = f (x0 ) + f (x1 ) + f (x2 )
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
2 b+a b+a b+a
= 2
f (a)(x − )(x − b) + f ( )(x − a)(x − b) + f (b)(x − a)(x − )
(b − a) 2 2 2
And the corresponding quadrature formula is Simpson’s Rule
Z b    
b−a b+a
I≈ p2 (x)dx = f (a) + 4f + f (b)
a 6 2
Using equally spaced points, with n = 3 we get the three-eights rule and with n = 4
we get Boole’s rule.

On the other hand, Midpoint rule is an example of an open quadrature for-


mula because it does not include the endpoints a, b, whereas trapezoidal rule and
Simpson’s rule are called closed quadrature formulas, because they include the
endpoints.

73
But following the same reasoning as in Interpolation, you can imagine that using
equidistant points with a large n is not the most stable algorithm. Therefore we
have two choices: split the interval into small chunks (which is equivalent to do a
piecewise interpolation) or choose an optimal set of points for the quadrature rule.
The first approach is called Composite Newton-Cotes formulas and the second
one is called Gauss quadrature

6.1.1 Composite Newton-Cotes formula


Are we interested in preserving the derivatives at every point?

In the case of integration, we can split the interval into different segments, and
compute the integral separately in each of them. Therefore, we can use an interpo-
lation scheme that can have discontinuous derivatives or even discontinuities in the
function at finite points.
n Z
X bi n Z
X bi Z b
(i)
I= f (x)dx ≈ p (x)dx = φ(x)dx (6.2)
ai ai
i=1
|
i=1
{z } | a {z }
Piecewise Interpolant
Integrate each segment

If we use piecewise linear interpolation using n+1 points x0 , x1 , ..., xn , it is equivalent


to split the interval [a, b] in n segments and apply trapezoidal rule to each one.
Therefore the composite trapezoidal rule is
n Z bi n
X X xi − xi−1
I≈ p1 (x)dx = (f (xi−1 ) + f (xi )) (6.3)
ai 2
i=1 i=1

If all of the points are equally spaced, i.e. xi = a + ih where h = (b − a)/n then
n Z bi n−1
!
X h X
I≈ p1 (x)dx = f (a) + 2 f (xi ) + f (b) (6.4)
ai 2
i=1 i=1

Similarly with piecewise quadratic interpolation, we get the composite Simpson’s


rule (in here, we assume that the x2i es the midpoint between x2i−i and x2i+1 . This
is not the most general case)
n/2 Z bi n/2
X X x2i − x2(i−1) 
I≈ p2 (x)dx = f (x2(i−1) ) + 4f (x2i−1 ) + f (x2i )
ai 6
i=1 i=1

Using equally spaced points,i.e. xi = a + ih where h = (b − a)/n we get


 
n/2 Z b n/2 n/2−1
X i
h X X
I≈ p2 (x)dx = f (a) + 4 f (x2i−1 ) + 2 f (x2i ) + f (b) (6.5)
ai 3
i=1 i=1 i=2

74
6.1.2 Error of Newton-Cotes formulas
Since we are using a polynomial to interpolate f (x) and then integrate, we can
express the error based on the interpolation error. This means that if we call:
Z b Z b
I= f (x)dx In = φn (x)dx (6.6)
a a

Where φn (x) is an interpolant of f (x) using n + 1 points, we know that


Z b
I − In = f (x) − φn (x)dx (6.7)
a

For example if φn (x) is the polynomial that interpolates the points {(xi , f (xi )} then
we know from (5.16) that
Z b Z b n
Y
e = I − In = f (x) − φn (x)dx = f [x0 , x1 , ..., xn , x] (x − xi )dx (6.8)
a a i=0

And therefore for the previous Newton-Cotes formulas defined we have

• Midpoint rule
f 00 (ξ1 )
e= (b − a)3
24
• Trapezoidal rule
f 00 (ξ2 )
e=− (b − a)3
12
• Simpson’s rule
5
f (4) (ξ3 )

b−a
e=−
90 2

• Composite Trapezoidal rule (equally spaced coordinates)


n Z
X bi
f 00 (η1 )
e= I − In dx = − (b − a)h2
ai 12
i=1

• Composite Simpson’s rule (equally spaced coordinates)


n/2 Z bi  4
X f (4) (η2 ) h
e= I − In dx = − (b − a)
ai 180 2
i=1

75
So for example, how can we get the error of the Midpoint rule? Using Taylor series
around (b + a)/2:
Z b  
b+a
e= f (x) − f dx
a 2
b+a 2
Z b     
0 b+a b+a 1 00
= f x− + f (ξ) x − dx
a 2 2 2 2
f 00 (ξ1 )
= (b − a)3
24
Definition 20. We call precision of a quadrature formula the largest integer p
such that the quadrature error is zero (e = 0)

Example 8. Which is the precision of the previous quadrature formulas?

Solution. Since all the errors are expressed using derivatives, it is easy to see that:

• Midpoint rule: Uses 1 point and has precision p = 1

• Trapezoidal rule: Uses 2 points and has precision p = 1

• Simpson’s rule: Uses 3 points and has precision p = 3

• In general a Newton-Cotes rule that uses n + 1 points has precision p = n if n


is odd and p = n + 1 if n is even (Because of symmetry around intermediate
points)

6.2 Gaussian Quadrature


Going back to (6.1), we want to find weights ωi and abscissas xi such that:
Z b n
X
I= f (x)dx ≈ ωi f (xi )
a i=0

With the Newton-Cotes approach, we saw that using n+1 points we can get at most
p = n + 1 precision, this means we can interpolate exactly polynomials of degree at
most p = n + 1. But if we see, in (6.1) we have 2(n + 1) choices, therefore something
tells us that we can optimize the choice of weights and abscissas to get a better
precision with the same number of points.

Let us suppose we want to find a quadrature rule that exactly integrates all the
polynomials of degree at most 3 in the interval [−1, 1] using the minimum number

76
of points. In particular it should integrate {1, x, x2 , x3 }. Let us suppose we use 2
abscissas, they should satisfy:
Z 1
1dx = ω1 (1) + ω2 (1)
−1
Z 1
xdx = ω1 (x1 ) + ω2 (x2 )
−1
Z 1
x2 dx = ω1 (x1 )2 + ω2 (x2 )2
−1
Z 1
x3 dx = ω1 (x1 )3 + ω2 (x2 )3
−1

Now we have a non-linear system of equations to solve for {ω1 , ω2 , x1 , x2 } Using


symmetry we know that x1 = −x2 and w1 = w2 then we get
1 1
w1 = 1 x1 = − √ w2 = 1 x2 = √
3 3
Therefore using only 2 points we can get precision p = 3 (in contrast to the Trape-
zoidal rule, where we just get p = 1). In general, we can extend the same procedure
for any number of points. And we can show that using n+1 “optimal” points we get a
precision of p = 2n+1. This quadrature approach is called Gaussian Quadrature.

The theory behind Gaussian quadrature is a family of orthogonal polynomials


called Legendre Polynomials. The idea is that when we use Legendre polynomials
to interpolate a function, we minimize the error in its integral. If φk is the Legendre
polynomial of order k, then for any polynomial q(x) of order j < k we have that
Z 1
q(x)φk (x)dx = 0 (6.9)
−1

Therefore if we want to integrate a polynomial f (x) of order p = 2k − 1, there exists


q(x), r(x) of degree less than k such that
f (x) = q(x)φk (x) + r(x)
Therefore
Z 1 Z 1 Z 1
f (x)dx = q(x)φk (x) + r(x)dx = r(x)dx (6.10)
−1 −1 −1
and if we evaluate f in xi the roots of φk , then
f (xi ) = q(xi )φk (xi ) + r(xi ) = r(xi ) (6.11)
Then if the quadrature rule integrates exactly r(x), then it also integrates exactly
f (x).

If you want to see a complete explanation check Bradie, Section 6.6.

77
6.3 Some additional comments
What does it mean to intepolate (integrate) using “...” polynomials?
For example using Legendre polynomials?

It means that if we want to use n + 1 points, use the roots of the “...” polynomial
of order n + 1 as abscissas.

The Legendre polynomials are optimal to integrate polynomials, but


what if my function is not a polynomial?

The Gaussian quadrature can be generalize to integrals of the shape


Z b n
X
I= w(x)f (x)dx ≈ ωi f (xi ) (6.12)
a i=0

where w(x) is a weight function that is positive and integrable. For each weight
function, we can find an orthogonal family of polynomials which roots are the opti-
mal choice for the abscissas. Some examples are

• Laguerre polynomials: w(x) = e−x on the interval [0, ∞)


2
• Hermite polynomials: w(x) = e−x on the interval (−∞, ∞)

• Chebyshev polynomials: w(x) = √ 1 on the interval [−1, 1]


1−x2

How can I integrate in multiple dimensions?

For example consider a simple domain Ω ∈ Rm , then the integral of the function
over the domain can be decomposed into iterated integrals for each variable
Z Z b(1) Z b(2) Z b(m)
I= f (x)dx = ... f (x1 , x2 , ..., xm )dx1 dx2 ...dxm (6.13)
Ω a(1) a(2) a(n)

An then we can apply a quadrature rule in each integral. For example if we use
a quadrature rule of n + 1 points in each direction, we will end up with a grid of
(n + 1)m points to evaluate the whole integral.

78
Chapter 7

Numerical Differentiation

When we started the course, one of the principal goals in developing all the methods
was to solve numerically differential equations. Therefore, first we learned how to
solve systems of equations (linear and non-linear) and we applied them to solve un-
constrained optimization problems. Some of those optimization problems involved
data fitting using linear and non-linear least squares. At the heart of all this meth-
ods was to approximate the function with a Taylor series expansion around certain
point x0

Later, we switched a bit from subject, and we look for methods to fit polynomials
that exactly satisfy some data, i.e. that interpolates it. So given certain points
(xi , yi ), we proposed a model
n
X
φ(x) = cj φj (x)
j=1

i.e. a linear combination of basis functions, and we looked for coefficients cj such
that yi = φ(xi ). First we proposed global polynomials, and using different basis we
proved existence and error bounds. Then to avoid oscillations, we proposed local
polynomials (piecewise functions), and we enforced continuity at the break points.

Lastly, we talked about integration as a direct application of integration. In this


case we approximate the function with an interpolant, and instead of integrating
the function, we integrate the interpolant. With this idea behind, we developed
quadrature rules such that
Xn
I≈ f (xi )ωi
i=1
.

But if the idea is to solve differential equations, what about numerical differen-
tiation?. In this chapter we will show two methods to find numerical differentiation

79
formulas. The first one is intuitive using Taylor series and the second one is more
general using Lagrange Interpolation.

7.1 Numerical Differentiation using Taylor series


Along this quarter, we have been using numerical differentiation without introduce
it formally. Recall Theorem 1, we have that
h2 00
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + f (ξ) where ξ ∈ [x0 , x0 + h] (7.1)
2
Therefore we have
f (x0 + h) − f (x0 ) h 00
f 0 (x0 ) = − f (ξ) where ξ ∈ [x0 , x0 + h]
h 2
f (x0 + h) − f (x0 )
= + O(h) (7.2)
h
f (x0 +h)−f (x0 )
Notice that when h → 0, h → f 0 (x0 ).

The formula (7.2) is called forward difference approximation and it is first


order accurate becaouse the error is O(h). Similarly, using the Taylor series of
f (x0 − h) we get the backward difference formula
f (x0 ) − f (x0 − h) h 00
f 0 (x0 ) = + f (ξ) where ξ ∈ [x0 − h, x0 ]
h 2
f (x0 ) − f (x0 − h)
= + O(h) (7.3)
h
Now using Taylor series up to order 2 around x0 we have that there exist ξ1 and ξ2
near x0 such that
h2 00 h3 000
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + f (x0 ) + f (ξ1 )
2 3!
h2 h3 000
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + f 00 (x0 ) − f (ξ2 )
2 3!
Therefore subtracting both equations and using intermediate value theorem, we get
that for ξ ∈ [x0 − h, x0 + h]
h3 000
f (x0 + h) − f (x0 − h) = 2hf 0 (x0 ) + f (ξ)
3
f (x0 + h) − f (x0 − h) h2 000
f 0 (x0 ) = − f (ξ) (7.4)
2h 6
Equation (7.4) is called centered difference formula and it is second order accu-
rate.

80
Now to approximate the second derivative, we can use again the Taylor series of
x0 + h and x0 − h of order 3:

h2 00 h3 000 h4 (4)
f (x0 + h) = f (x0 ) + hf 0 (x0 ) + f (x0 ) + f (x0 ) + f (ξ1 )
2 3! 4!
h2 h3 000 h4 (4)
f (x0 − h) = f (x0 ) − hf 0 (x0 ) + f 00 (x0 ) − f (x0 ) + f (ξ2 )
2 3! 4!

Adding both equations we get the centered formula for the second derivative:

h4 (4)
f (x0 + h) + f (x0 − h) = 2f (x0 ) + h2 f 00 (x0 ) + 2 f (ξ)
4!
f (x0 + h) − 2f (x0 ) + f (x0 − h) h2 (4)
f 00 (x0 ) = − f (ξ) (7.5)
h2 12
And equation (7.5) is second order accurate.

Using more points tends to gives us a higher accuracy. So how can


we generalize this methods to involve more points?

For example if we want to use x0 − 2h, x0 − h, x0 , x0 + h and x0 + 2h to generate a


forth order approximation formula for the first derivative, we can solve the system
 
f (x0 ) f (x0 ) f (x0 ) f (x0 ) f (x0 )  α   0

0
0 (−2h)f 0 (x ) (−h)f 0 (x ) (h)f 0 (x ) (2h)f 0 (x )  0
 0 0 0 0   α−2h   f (x0 )
   
(−2h)2 00 (−h)2 00 (h)2 00 (2h)2 00
 

 0 2! f (x0 ) 2! f (x 0 ) 2! f (x 0 ) 2! f (x 0 )   α−h 

=
 0 
3
(−2h) 000 3
(−h) 000 3
(h) 000 3
(2h) 000

αh 0
 
 0 3! f (x0 ) 3! f (x0 ) 3! f (x0 ) 3! f (x0 )
    
(−2h)4 (4) (−h)4 (4) (h)4 (4) (2h)4 (4) α 2h 0
0 4! f (x0 ) 4! f (x0 ) 4! f (x0 ) 4! f (x0 )

and find the coefficients for which we should multiply each Taylor expansion.

We divide each row by f (i) (x0 )hi /i! and we get


    
1 1 1 1 1 α0 0
 0
 −2 −1 1 2 

 α−2h  
  1/h 

 0 (−2)2 (−1)2 (1)2 (2)2    α−h 0
 = 
   
 0 (−2)3 3
(−1) (1) (2) 3 3   αh   0 
0 (−2)4 4
(−1) (1) (2) 4 4 α2h 0

Nevertheless, with this approach we do not have enough insight what does it
mean each of the coefficients, and how does the underlined approximation behave.

81
7.2 Numerical Differentiation using Interpolation
Now let us go back to the case of approximating the first derivative at x0 using
5 points: x0 − 2h, x0 − h, x0 , x0 + h. In this case let us consider the Lagrange
polynomial that interpolates the 5 points:
2 2
X Y (x − (x0 + jh))
φ(x) = f (x0 + ih)Li (x) where Li (x) =
((x0 + ih) − (x0 + jh))
i=−2
j = −2
j 6= i

Therefore
2
X
φ0 (x) = f (x0 + ih)L0i (x)
i=−2
2 2
X 1 Y (x − (x0 + jh))
where L0i (x) =
((x0 + ih) − (x0 + kh)) ((x0 + ih) − (x0 + jh))
k = −2 j = −2
k 6= i j 6= i, k

In particular we have that


2
1 X 1
L0i (xi ) = (7.6)
h (i − k)
k = −2
k 6= i
2
1 1 Y (l − j)
L0i (xl ) = (7.7)
h (i − l) (i − j)
j = −2
j 6= i, l

Therefore
2
1 X 1 1 1 1 1
L00 (x0 ) = − = + + + =0
h k −2 −1 1 2
k = −2
k 6= 0
2
1 Y (−j)
L0i (x0 ) =
ih (i − j)
j = −2
j 6= i, 0

82
Then
 
1 (1) (−1) (−2) 1
L0−2 (x0 ) =− =
2h (−2 + 1) (−2 − 1) (−2 − 2) 12h
 
0 1 (2) (−1) (−2) 8
L−1 (x0 ) = − =−
1h (−1 + 2) (−1 − 1) (−1 − 2) 12h
 
1 (2) (1) (−2) 8
L01 (x0 ) = =
1h (1 + 2) (1 + 1) (1 − 2) 12h
 
1 (2) (1) (−1) 1
L02 (x0 ) = =−
2h (2 + 2) (2 + 1) (2 − 1) 12h
Therefore using the error formula for interpolation we have that
2
!
∂ Y
f 0 (x0 ) = φ0 (x0 ) + f [x0 − 2h, x0 − h, x0 , x0 + h, x0 + 2h, x] (x − (x0 + ih))
∂x
i=−2 x=x0
2
X 2
Y
= f (x0 + ih)L0i (x0 ) + f [x0 − 2h, x0 − h, x0 , x0 + h, x0 + 2h, x0 ] (ih)
i=−2
i = −2
i 6= 0
1 h4
= (f (x0 − 2h) − 8f (x0 − h) + 8f (x0 + h) − f (x0 + 2h)) + f (5) (ξ)
12h 30
Which is exactly the same result as solving the linear system given by the Taylor
expansions around x0 . The difference is that now we understand that the deriva-
tive comes from fitting a polynomial of order 4 in the points x0 −2h, x0 −h, x0 , x0 +h.

The same behavior we had in interpolation, now we have it in differentiation.


Then unless we are working with special abscissas (i.e. Chebyshev points, we call this
spectral methods), we do not want to use all the points in the grid to approximate
the derivative, otherwise we can have oscillations because of the Runge phenomenon.

Therefore we prefer to do is use local approximations of the derivatives (for


example (7.2),(7.3),(7.4),(7.5)) . This is equivalent to use piecewise polynomial in-
terpolation.

On the other hand, notice that regardless of the method we use to approximate
the derivatives, we can write
ik
X
0
f (xi ) ≈ aij f (xj ) = (Af )i (7.8)
j=i1

This means that we can find a matrix A such that transforms the vector f =
[f (x0 ), ..., f (xn )] into f 0 = [f 0 (x0 ), ..., f 0 (xn )]. This matrix is called differentiation

83
matrix. Therefore once we construct the matrix, we can find the derivative for all
the points in just ”one operation”.

7.3 Richardson Extrapolation


Up to now, all the error estimates that we have are based in the theoretical derivation
of the methods (i.e using Taylor series or Interpolation). Nevertheless, we would like
to have a practical error estimate. Let us suppose we have 2 discretizations. In the
first one the points are equally spaced at a distance h. In the second one, they are
equally spaced at a distance h/2. Now let us use the same differentiation method in
both to find f 0 (x0 ). For example centered differences. Let D be the real derivative,
Dh the approximation from the first grid and Dh/2 the approximation from the
second one. We know from Taylor series that
1
D = Dh + f 000 (x0 )h2 + O(h3 ) (7.9)
6
1
D = Dh/2 + f 000 (x0 )(h/2)2 + O(h3 ) (7.10)
6
If we are approximating the derivative, it means that we also do not have any idea
fo the value of f 000 (x0 ). If we only had one discretization, we could not find the error,
but now with 2 approximations we can subtract them and get:
1 3
0 = (Dh − Dh/2 ) + f 000 (x0 ) h2 + O(h3 )
6 4
Therefore we have an approximation for the error of O(h2 ).
1 000 4
f (x0 )h2 = − (Dh − Dh/2 ) + O(h3 )
6 3
And this implies that we have a better approximation for the derivative (of O(h3 ))
4
D = Dh − (Dh − Dh/2 ) + O(h3 ) (7.11)
3
4Dh/2 − Dh
= + O(h3 )
 3 
1 f (x0 + h/2) − f (x0 − h/2) f (x0 + h) − f (x0 − h)
= 4 − + O(h3 )
3 h 2h
8f (x0 + h/2) − 8f (x0 − h/2) − f (x0 + h) + f (x0 − h)
= + O(h3 )
12(h/2)
Therefore this gives you the same differentiation rule using 5 points derived using
Taylor series or interpolation. The meaning here is a bit different. The idea is
to have an adaptive method, where we reduce h to half in order to have a better
approximation, and we estimate the error, using a previous guess.

84
Chapter 8

Initial Value Ordinary


Differential Equations

In this part of the course we will start to work with the simplest model that includes
a derivative. We will look for a function y(t) such that
dy
= f (t, y), f or a ≤ t ≤ b (8.1)
dt
We call t the independent variable and y the dependent variable. What makes
it difficult to solve is that f depends on the unknown variable.

Also notice that (8.1) has infinite number of solutions. For example if f (t, y) = 0,
then y(t) = constant and it can take any value. Therefore we need to specify
additional conditions in order to have a unique solution. In this case we will specify
the value of the function at t = a
y(a) = α (8.2)
And together (8.1) plus (8.2) are called Initial Value problem. Also to have a
unique solution, we need f (t, y) to be Lipschitz continuous (See Bradie Section 7.1,
LeVeque Section 5.2)

8.1 Euler’s method


Instead of finding a numerical approximation of y for all the t ∈ [a, b] , we will choose
a discrete set of points:
a = t0 < t1 < t2 < · · · < tN −1 < tN = b (8.3)
And for simplicity let us suppose they are equally spaced, i.e. ti = a + ih where
h = (b − a)/N .

85
For each of this t’s we have:

y(t0 ) = α
dy
(ti ) = f (ti , y(ti )) (8.4)
dt
How can we get rid of the derivative?
Let us call yi the numerical approximation of y(ti ). We can use any of the approxi-
mations form numerical differentiation to approximate the derivative. For example
if we use Forward differences we transform (8.4) into

y0 = α
yi+1 − yi
= f (ti , yi ) =⇒ yi+1 = yi + hf (ti , yi ) (8.5)
h
This is the so called Euler’s method. Notice that the advantage of (8.5) is that
yi+1 is explicitly defined in terms of yi . This type of method is called an explicit
method.

Notice that we could also use Backward differences to approximate the deriva-
tive and get:

y0 = α
yi − yi−1
= f (ti , yi ) =⇒ yi+1 = yi + hf (ti+1 , yi+1 ) (8.6)
h
This is the so called Backward Euler’s method. The shape between Euler and
backward Euler is the same, but the substantial difference is that that yi+1 de-
pends implicitly on itself (notice the f (ti+1 , yi+1 ) term). This is called an implicit
method. The difficulty of implicit methods is that in general we need to solve a
nonlinear system of equations. Nevertheless, later we will se that they have advan-
tages.

8.2 Properties of the methods


How can we decide which method to use?

Let us remember the discussion about scientific computing we had at the beginning
of the quarter. When we talked about algorithms, there were 2 important proper-
ties an algorithm should have: Robustness and Efficiency. The first one was related
with the stability of the method (if I perturb the input, how much does the output
change?). The second one was related with convergence of the method (how many
steps should I take in order to get a small error?).

86
This two concepts are again present in the solution of differential equations.
More properly we will look for a one step method:

yi+1 = yi − hi φ(f, ti , yi , yi+1 , hi )

• Consistent Define the local truncation error as


y(ti+1 ) − y(ti )
τi = − φ(f, ti , y(ti ), y(ti+1 ), hi ) (8.7)
hi
This means, τi is the error that we get when we evaluate the approximation
using the exact solution of y. For example, for Euler’s method we have

y(ti+1 ) − y(ti )
τi = − f (ti , y(ti )) (8.8)
h
Using Taylor series around yti we have:

y(ti ) + y 0 (ti )h + y 00 (ti )h2 /2 + O(h3 ) − y(ti )


τi = − f (ti , y(ti ))
h
= y 0 (ti ) + y 00 (ti )h/2 + O(h2 ) − f (ti , y(ti )) using the ODE
h
= y 00 (ti ) + O(h2 )
2
Therefore it is a first order method because ti = O(h). A method is consistent
if τi → 0 as hi → 0

• Convergent Define the global discretization error as y(ti ) − yi , it is the cu-


mulative error introduced in all the steps. We say that a method is convergent
if

lim max |y(ti ) − yi | = 0 (8.9)


h→0 1≤i≤N

• Stable: How can we relate local truncation error and global discretization
error?
Let us suppose that we want to solve the ODE
dy
= λy + g(t), 0 ≤ t ≤ T
dt
y(0) = y0

Integrating in both sides (or using Duhamel’s principle), the solution is given
by
Z t
λt
y(t) = y0 e + eλ(t−s) g(s)ds
i=0

87
On the other hand using Forward Euler we have that
yi+1 = yi + λhyi + hg(ti ) = (1 + λh)yi + hg(ti )
If we evaluate Forward Euler with the exact solution, and using the local
truncation error, we get:
yi+1 = (1 + λh)yi + hg(ti )
y(ti+1 ) = (1 + λh)y(ti ) + hg(ti ) + hτi
ei+1 = (1 + λh)ei − hτi (8.10)
Using the recurrence relation, we have
Xn
en = −h (1 + λh)n−i τi−1
i=1

Since |1+λh| < e|λ|h (this is called Zero-Stability) we can bound |(1+λh)n−i | ≤
eT |λ| and since we have n terms then nh = T . Therefore if τ = [τ0 , τ1 , ...τn−1 ],
then
|en | ≤= T eT |λ| ||τ ||∞
Since Euler is consistent then ||τ ||∞ → 0 and therefore |en | → 0, i.e. the
method is convergent.
But notice that this is not enough. If we go back to (8.10), at each time step,
the previous error is amplified by a factor of (1 + λh). If λ < 0, the solution is
characterized by an exponential decay. Therefore, if the method is absolutely
stable, we expect that the errors will also decay. This requires that:
2
|1 + λh| < 1 ⇐⇒ −1 < 1 − |λ|h < 1 ⇐⇒ h < (8.11)
|λ|
Therefore we need to choose h sufficiently small and λ < 0 to have a nice
behavior in Euler.

For the case of backward differences, we have that


1
yi+1 = yi + λhyi+1 + hg(ti+1 ) =⇒ yi+1 = (yi + hg(ti+1 ))
(1 − λh)
If we evaluate Backward Euler with the exact solution, and using the local
truncation error, we get:
1
yi+1 = (yi + hg(ti+1 ))
(1 − λh)
1
y(ti+1 ) = (y(ti ) + hg(ti+1 )) + hτi
(1 − λh)
1
ei+1 = ei − hτi
(1 − λh)

88
Then
n  n−i
X 1
en = −h τi−1
(1 − λh)
i=1

And as before we can show convergence. To have absolute stability we need


1
< 1 ⇐⇒ 1 < |1 − λh|
(1 − λh)
If λ < 0 or h > 2/λ we have absolute stability. In particular if λ < 0 we do
not have any restriction in h.
As we can see, the choice of time step h determines 2 different properties:
• Truncation error Since τi = O(hp ) for certain p, then we need to choose a
really small h in order to avoid truncation error.

• Stability In order to have absolute stability, we have a restriction on h given


λ. In the linear case, it is easy to see what is λ. In the non-linear case, we can
always think in linearizing f (t, y) using Taylor series:
∂f ∂f
f (t, y) = f (t0 , t0 ) + (t − t0 ) + (y − y0 ) +O(|t − t0 |2 + |y − y0 |2 )
∂t (t0 ,y0 ) ∂y (t0 ,y0 )
| {z } | {z }
≈g(t) ≈λy

∂f
Therefore λ is usually related with ∂y around the points of the solution. For
example we can take

∂f
λ = max
t,y∈Ω ∂y (t,y)

For a redefined set Ω where we know the solution lives.

8.3 Higher order methods


Up to now, forward and backward Euler are only first order accurate. It means that
we need to take really small h in order to have small truncation error. If we want to
find higher order methods, let us see from where we can derive Euler method first:
dy
1. Finite Differences: We are looking for an approximation of dx . Therefore
we can have:

• Forward Differences : Euler’s method


dy yi+1 − yi
(ti , yi ) = + O(h)
dx h

89
• Backward Differences : Backward Euler’s method
dy yi − yi−1
(ti , yi ) = + O(h)
dx h
• Centered Differences : Multi-step method
dy yi+1 − yi−1
(ti , yi ) = + O(h2 )
dx 2h
Therefore replacing in the ODE we get
yi+1 − yi−1
= f (ti , yi )
2h
yi+1 = yi−1 + 2hf (ti , yi ) (8.12)
And it is a multi-step method because yi+1 depends on 2 time-steps:
yi and yi−1 . This type of methods are more complicated because we
need to specify more initial conditions and the analysis of global error
and stability is more sensitive. Nevertheless they have a higher order of
accuracy.
2. Taylor’s Series Recall the Theorem1, and let us use it for y(ti+1 ) around
y(ti ) then we have
h2 h3
y(ti+1 ) = y(ti ) + y 0 (ti )h + y 00 (ti ) + y 000 (ti ) + ... (8.13)
2! 3!
And using the ODE we have that
y 0 (ti ) = f (ti , yi )
d
y 00 (ti ) = (f (ti , yi ))
dt
∂ ∂ dy
= f (ti , yi ) + f (ti , yi )
∂t ∂y dt
∂ ∂
= f (ti , yi ) + f (ti , yi ) f (ti , yi )
∂t ∂y
000
y (ti ) = . . .
Therefore second order scheme, we can ignore all the terms of third order or
higher and we can just replace the equivalences given by the ODE:
  2
∂ ∂ h
y(ti+1 ) = y(ti ) + f (ti , yi )h + f (ti , yi ) + f (ti , yi ) f (ti , yi ) (8.14)
∂t ∂y 2!
This is called a Taylor method, and it is explicit, high order accurate one-
step method. The only drawback of this family of methods is that we need to
compute the derivatives of f , and as the order increases, handling them starts
to be complicated. It is a nice theoretical method, but not useful in practice.

90
3. Numerical Integration If we go back to (8.1), integrating in both sides we
have that
Z ti+1
y(ti+1 ) = y(ti ) + f (t, y(t))dt (8.15)
ti

Therefore we can use different quadrature rules to approximate the integral:


• If we use the value on the left =⇒ Euler’s method:
Z ti+1
f (t, y(t))dt = h(f (ti , yi ) + O(h))
ti

• If we use the value on the right =⇒ Backward Euler’s method:


Z ti+1
f (t, y(t))dt = h(f (ti+1 , yi+1 ) + O(h))
ti

• If we use the value on the middle =⇒ Midpoint method:


Z ti+1
f (t, y(t))dt = h(f (ti+1/2 , yi+1/2 ) + O(h2 ))
ti

• If we use both endpoints =⇒ Trapezoidal method:


Z ti+1  
f (ti , yi ) + f (ti+1 , yi+1 ) 2
f (t, y(t))dt = h + O(h )
ti 2
• If we use both endpoints and middle =⇒ Simpson’s method:
Z ti+1
f (ti , yi ) + 4f (ti+1/2 , yi+1/2 ) + f (ti+1 , yi+1 )
 
4
f (t, y(t))dt = h + O(h )
ti 6
Of all of this approximations, only Euler is explicit, and moreover some of the
methods involve points that are not in the original discretization (for example
(ti+1/2 , yi+1/2 )).
If we want to transform all the methods in explicit one-step
methods: What should we do?
In this case, we can approximate the intermediate points using Euler’s method.
This family of methods is called Runge-Kutta methods. For example:
• For the Midpoint method, we can approximate
h
yi+1/2 ≈ yi + f (ti , yi )
2
Therefore we can define the Modified Euler method as
 
h h
yi+1 = yi + hf (ti+1/2 , yi+1/2 ) = yi + hf ti + , yi + f (ti , yi ) (8.16)
2 2

91
• For the Trapezoidal method, we can approximate

yi+1 ≈ yi + hf (ti , yi )

Therefore we can define the Heun’s method as


h
yi+1 = yi + (f (ti , yi ) + f (ti+1 , yi+1 ))
2
h
= yi + (f (ti , yi ) + f (ti + h, yi + hf (ti , yi ))) (8.17)
2
• For Simpson’s method, we can approximate

(1) h
yi+1/2 = yi + f (ti , yi ) using Forward Euler
2
(2) h (1)
yi+1/2 = yi + f (ti+1/2 , yi+1/2 ) Using Backward differences
2
(1) (2)
yi+1 = yi + hf (ti+1/2 , yi+1/2 ) Using Modified Euler

Therefore when we use Simpson’s formula we have the famous 4-th order
Runge-Kutta
h (1)
yi+1 = yi + f (ti , yi ) + 2f (ti + 1/2h, yi+1/2 )
6 
(2) (1)
+2f (ti + 1/2h, yi+1/2 ) + f (ti + h, yi+1 ) (8.18)

A general Runge-Kutta method of k stages is of the form


k
X
yi+1 = yi + wj kj (8.19)
j=1
j−1
!
X
where kj = hf ti + cj h, yi + aj,l kl (8.20)
l=1

with c1 = 0

How does the ODE’s methods are implemented in MATLAB?

Depending on the requirements, MATLAB offers multiple ODE solvers http://


www.mathworks.com/help/matlab/math/choose-an-ode-solver.html. Most of
them are generalizations of Runge-Kutta methods (implicit or explicit) and multi-
step methods.

The principal characteristic among them is that the points in time are not equally
spaced (i.e. hi is not constant). Moreover, at each iteration, they adjust the position

92
of the new ti+1 = ti +h depending on the local error tolerance they want to have. To
have an estimation for the local error, they solve the value of yi+1 using 2 methods
of order q and q + 1, and then they compute the difference between the results. If
ŷi+1 is the estimation for order q and yi+1 the estimation for order q + 1, then we
accept the value of yi+1 if
|ŷi+1 − yi+1 | ≤ h tol
. If not we compute a new h̄
 1/q
h tol
h̄ = h µ
|ŷi+1 − yi+1 |
where µ is a factor between 0 and 1 (we can take µ = 0.9).

For example, ode45 is based in 2 Runge-Kutta methods of order 4 and 5. For


each of the ODE solvers that MATLAB offers, you can check the documentation
and at the end of it, you can find a description of the algorithm that is used behind
the scenes. Most of them have fancy names, but all of them are based in the basic
ideas described here.

8.4 Systems of equations, Stiffness


Most of the problems do not deal with an isolated ODE equation. In general, we
have multiple variables that depend on time and the relations between them. For
example think in a biological system analyzing a predator - prey model
dN1
Prey Population : = αN1 − βN1 N2
dt |{z} | {z }
growth loss for predation
dN1
Predator Population : = δN1 N2 − γN2
dt | {z } |{z}
growth from predation death rate

Other type of models come from chemical reactions. If I have u1 concentration of


chemical compound A, and u2 concentration of chemical compound B, and A is
transform into B at rate K1 and B is transform into A at rate K2 :
K1
A B
K2

Then we get a set of ODEs for u1 and u2


du1
= −K1 u1 + K2 u2
dt
du2
= K1 u1 − K2 u2
dt

93
And in general, even if we are working with only one differential equation, most of
the times it involves higher order derivatives. For example, the pendulum equation
we had at the beginning of the course:

d2 θ dθ
2
+ b + ω 2 sin θ = 0
dt dt

We can call φ = dt and we have the following system:



dt

= −bφ − ω 2 sin θ
dt
In general, when we talk about an Initial Value problem for a ODE system, we want
to solve for u(t) = [u1 (t), u2 (t)..., un (t)]0 vector of unknowns such that
(
du
dt = f (t, u1 (t), u2 (t), ..., un ) (8.21)
u(0) = [α1 , α2 , ..., αn ] known

And as we can expect, all the numerical methods that we developed previously (Eu-
ler, Backward Euler, Runge-Kutta) extend naturally to solve systems of ODEs. The
only difference is that we need to reconsider the stability properties of the methods,
and how that affects the performance of it.

The simplest non-trivial case is when f (t, u1 (t), u2 (t), ..., un ) is linear in u, i.e.
(
du
dt = Au (8.22)
u(0) = [α1 , α2 , ..., αn ] known

If we assume that A is diagonal, i.e. A = diag(λ1 , λ2 , ..., λn ), then we can uncouple


the equations:

du1
dt = λ1 u1 ; u1 (0) = α1 u1 (t) = α1 eλ1 t



 du2 = λ u ; u (0) = α
u2 (t) = α2 eλ2 t

dt 2 2 2 2
=⇒ (8.23)

 ... ... ...
un (t) = αn eλn t

 dun = λ u ; u (0) = α

dt n n n n

Let us assume all λj < 0. Therefore each λj is associated with the rate of decay of
uj . To use Euler to solve the system we need that each spacing hj < 2/|λj |, where
(i+1) (i)
uj − uj (i)
= λj uj
hj

94
If we solve the system as a whole, to have stability we need h ≤ minj hj <
2/ maxj |λj |. But in the other hand, if I am interesting in a large scale behav-
ior, the variables that will dominate are those with the smallest rate of decay, i.e.
with the smallest values of |λi |.

This means that in order to have stability, we need to choose h according to the
”modes” or |λi | that we are less interested on, making the computation really slow.
If the gap between the largest λi and the smallest one is considerably big, then we
say that the problem is stiff.

Usually stiff problems are characterized by a separation of time scale, and when
we are using explicit methods, we required a really small time step to preserve the
stability of the method avoiding effects of the small scale. This is a very common
situation in reaction-diffusion and transport equations.

In these cases, it is strongly recommended to use implicit methods based on Back-


ward Euler and Trapezoidal method, because they are absolutely stable regardless of
the size of h and λ. This means that if we solve the system using implicit methods,
we can choose h based only in the accuracy we are looking for the method (based on
the slowest modes) without being worried of stability issues given by fastest modes.

But in general, we are not solving a diagonal system, how can we


know a problem is stiff ?

To determine if a problem is stiff is not an easy task. It is similar to the case of


well or ill-conditioned matrices. There is not a magic threshold that can define
stiffness. Nevertheless it is important to know that if we have a linear ODE system
(8.22), such that A is diagonalizable, then we can apply a change of variables to
each eigenspace and we can end up with (8.23) where each λi are the eigenvalues of A.

In a more general case, if f (t, u1 , u2 , ..., un ) is not linear, we can always think
in its linear approximation using the Taylor series. In such a case A would be the
Jacobian matrix of f at (t, u1 , u2 , ..., un ).

95
Chapter 9

Boundary Value Ordinary


Differential Equations

In te previous chapter, we started with a first order ODE of the form


(
dy
dt = f (t, y) for t0 ≤ t ≤ tf
y(t0 ) = y0

Since it is a first order, we only require one extra condition to have a unique solution,
y(t0 ) = y0 . But what about higher order ODE’s?. For example in the homework we
have

θ00 + bθ0 + ω 2 sin(θ) = 0 for 0 ≤ t ≤ T

and we can solve it if θ and θ0 are specified at the initial point t = 0, i.e. if it is
an initial value problem. But what about if we have other type of conditions? for
example if we specify values θ(0) = θ0 and θ(T ) = θT ?

In the case where the extra conditions are specified in more than one point,
generally at both endpoints (t = t0 &t = tf ) we say we have a boundary value
problem. And now we can have different types of conditions:

• Dirichlet Condition The value of the function is specified, i.e θ(T ) = 0

• Neumann Condition The value of the (normal) derivative of the function


is specified, i.e θ(0)0 = 1

• Robin Condition A linear combination of function and derivative is specified,


i.e. αθ(0) + βθ0 (0) = γ

• Periodic Condition We identify initial point with end point as the same
point, i.e. θ(0) = θ(T ) and θ0 (0) = θ0 (T )

96
In this chapter we will talk about two different methods to solve a boundary value
problem: Shooting problem and finite differences. The first one will use the tools we
developed in the previous chapter to redefine the problem as a Initial Value problem,
whereas the second approach will be extended to Partial differential equations, and
solves all the discretization points at once.

9.1 Shooting Method


Suppose we have the following Boundary Value problem
(
θ00 + bθ0 + ω 2 sin(θ) = 0 for 0 ≤ t ≤ T
(9.1)
θ(0) = θ0 θ(T ) = θT

If we would like to use the methods to solve Initial Value Problems,


how would you solve this problem?
To use an initial value approach we would like to solve the system for 0 ≤ t ≤ T :



 θ0 = φ
φ0 = −bφ − ω 2 sin(θ)

(9.2)
θ(0) = θ0



φ(0) = φ
0

The only problem is that we do not know a priori φ0 . Therefore in the shooting
method we “try” different values of φ0 until we get θ(T ) = θT .
Does this setting look familiar?
Indeed! We are solving a root finding problem. This means that we want to find φ0
such that θ as a function of its initial values, gives us θ(T ; φ0 ) = θT . Therefore we
should use any of the methods in Chapter 2 to find a result. At each iteration we
will solve an initial value problem for a guess of φ0 . If |θ(T ; φ0 ) − θT | > tol, then we
update φ0 using a root finding algorithm.

Since each iteration is really expensive, we would like to start with an appropriate
first guess, and use a fast convergence algorithm, such as Newton. The problem of
Newton’s method for root finding is that if we call g(φ0 ) = θ(T ; φ0 ), it requires the
derivative of g with respect to φ0 , which is not easy to compute. Instead we prefer
to use Secant method and we update φ0 as
(i) (i−1)
(i+1) (i)

(i)
 φ − φ0
φ0 = φ0 − g φ0  0   (9.3)
(i) (i−1)
g φ0 − g φ0

How to choose an initial guess for φ0

97
If the ODE is linear (i.e. θ00 = p(t)θ0 + q(t)θ + r(t)) then we can start assuming that
φ0 (0) = (θ(T ) − θ(0))/T . If the ODE is nonlinear, it is more difficult to predict how
the method behaves, and moreover it could exists more than one solution. Therefore
just try with different initial guesses, to see how it performs.

9.1.1 The linear case


Consider the following ODE

00 0
y = p(t)y + q(t)y + r(t) t0 ≤ t ≤ tf

y(t0 ) = y0 (9.4)

y(tf ) = yf

With linear equations, the superposition principle holds. It means that we can
decompose the extra conditions (either boundary or initial conditions) into homo-
geneous and nonhomogenous part, solve them separately and add the two solutions
as the final result. Following this approach we can consider the following 2 ODE
systems:

00 0
u = p(t)u + q(t)u + r(t) t0 ≤ t ≤ tf

u(t0 ) = y0 (9.5)

 0
u (t0 ) = 0


00 0 t0 ≤ t ≤ tf
w = p(t)w + q(t)w

w(t0 ) = 0 (9.6)

 0
w (t0 ) = 1

And assume that y = u + cw. Therefore in order to satisfy (9.4), we need y(tf ) =
u(tf ) + cw(tf ). Solving (9.5) and (9.6) numerically, we can approximate c as:

yf − u(tf )
c= (9.7)
w(tf )

Since the superposition principle does not hold in nonlinear equations, we cannot
use this approach to solve them.

9.1.2 Other types of Boundary Conditions


Up to now, we have talked only about Dirichlet Boundary conditions. What about
Robin and Neumann? Since Robin is the most general one, we will focus only in it.

98
For the Linear case, instead of solving (9.5) and (9.6), we can consider the
following set of Initial Value Problems:

00 0
u = p(t)u + q(t)u t0 ≤ t ≤ tf

u(t0 ) = 1 (9.8)

 0
u (t0 ) = 0


00 0 t0 ≤ t ≤ tf
w = p(t)w + q(t)w

w(t0 ) = 0 (9.9)

 0
w (t0 ) = 1


00 0 t0 ≤ t ≤ tf
z = p(t)z + q(t)z + r(t)

z(t0 ) = 0 (9.10)

 0
z (tf ) = 0

And we assume that y(t) = α1 u(t) + α2 w(t) + α3 z(t), and we solve for the boundary
conditions.

In general, for the nonlinear case, we continue solving a root finding problem.
If the conditions are Neumann (y 0 (t0 ) is given), then we start with a guess of the
function value y(t0 ) = α, and we iterate until the boundary conditions are satisfied.
If the conditions are Robin, we can decide either to guess y 0 (t0 ) or y(t0 ), and the
other term is given by the Boundary conditions. If the problem is nonlinear, we
cannot ensure the stability of the algorithm beforehand. Therefore if it is possible,
it is always good to try both from right to left and left to right, and solving for y(t0 )
or for y 0 (t0 )

9.2 Finite Differences


Let us recall the Boundary Value problem we had in the Shooting method
(
θ00 + bθ0 + ω 2 sin(θ) = 0 for 0 ≤ t ≤ T
(9.11)
θ(0) = α θ(T ) = β

Before, we solved this problem based on an initial value problem, using a guess for
the derivate of the function θ0 at zero, and then applying root finding techniques to
converges to the exact solution.

But now if we just analyze the problem, we can discretize both θ00 and θ0 using
finite differences and replace them in (9.11). Notice that we would like to have the

99
same truncation error in both approximations, otherwise only one of them would be
responsible of the error. Therefore let us use centered differences for the first and
second derivative, and for every θi ≈ θ(ti ), where ti = t0 + ih = iT /N , we have

θ0 = α

θi+1 −2θi +θi−1 −θi−1
h2
+ b θi+12h + ω 2 sin(θi ) = 0 1 ≤ i ≤ N − 1 (9.12)

θN = β

Therefore we can rearrange this into a system of equations:


    
1 θ0 α
 12 − b − 22 12 + b   θ1   −ω 2 sin(θ1 ) 
 h 2h h h 2h    

 ... ... ...  ...  = 
   ... 

1 b 2 1 b  θ −ω 2

h2 − 2h − h2 h 2 + 2h N −1
  sin(θ N −1 ) 
1 θN β
(9.13)
Since this is a nonlinear system of equations (N + 1 equations, N + 1 unknowns),
then we solve it iteratively using Newton’s method. Since it is a nonlinear equation,
different initial guesses could result in different solutions.

It is important to notice that we get a nonlinear system, because the original


ODE is nonlinear. Similarly, we can expect a linear system from a linear second
order ODE. So in general if we have
(
y 00 = f (t, y, y 0 ) t0 ≤ t ≤ tf
(9.14)
y(t0 ) = α , y(tf ) = β
To find a numerical approximation we solve the system of equations:
( y −2y +y  
i+1 i i−1 yi+1 −yi−1
h2
= f t ,
i iy , 2h ti = t0 + ih , 1 ≤ i ≤ N − 1
(9.15)
y0 = α , yN = β
What is the difference between shooting method and Finite Differ-
ences?
In the shooting method, the approach is to solve yi+1 only using the information at
yi , or in multistep, some other points before that one. Therefore we get the values
of yi sequentially, and at each step, the total number of points in the grid does not
influence the cost per step.

On the other hand, Finite Differences gives us a general stencil for all the points
in the grid, and to solve it, we should solve a system of equations involving all the
points in the grid. Since the matrix of coefficients is usually banded (for example
tridiagonal), There are efficient ways to solve all the points at once. Nevertheless,
the cost of solving the system increases with the number of total points in the grid.

100
9.2.1 Other types of Boundary Conditions
Let us consider now the following problem

00 0 2 for 0 ≤ t ≤ T
θ + bθ + ω sin(θ) = 0

α0 θ(0) + β0 θ0 (0) = γ0 (9.16)

αT θ(T ) + βT θ0 (T ) = γT

So how to construct a system of equations similar to (9.13), but in-


volving the Robin conditions?

One way is to approximate the Robin conditions using finite differences. We could
use forward differences for θ0 (0) and backward for θ0 (T ). But we would break the
second order approximation, and the error would be concentrated in the endpoints.

Another option is to have a second order approximation for the derivative. Since
t0 , and tN are at the boundary, the only way to approximate the derivatives using
the existing points is to take t0 , t1 & t2 and on the other hand tN −2 , tN −1 & tN .
The problem of this approach is that the tridiagonal structure of (9.13) is lost.

If t0 and tN were not boundary points, we could approximate the derivative using
centered differences, and the tridiagonal pattern would not be lost. Therefore the
idea of this approach is to include fictitious points t−1 and tN +1 , and approximate
the Robin conditions with centered differences.

But adding 2 additional unknowns that don’t exist does not look like
the best answer

Well if we combine the Robin conditions with the numerical equation at θ0 we get:
θ1 − θ−1
α0 θ0 + β0 = γ0 (9.17)
2h
θ1 − 2θ0 + θ−1 θ1 − θ−1
2
+b + ω 2 sin(θ0 ) = 0 (9.18)
h 2h
From (9.17), if β0 6= 0, we have that:

θ1 − θ−1 1
= (γ0 − α0 θ0 )
2h β0
2h
θ−1 = θ1 − (γ0 − α0 θ0 )
β0

101
Replacing in (9.18) we get:
 
θ1 − 2θ0 1 2h 1
2
+ 2 θ1 − (γ0 − α0 θ0 ) + b (γ0 − α0 θ0 ) + ω 2 sin(θ0 ) = 0
h h β0 β0
 
2θ1 − 2θ0 γ0 − α0 θ0 2
+ b − + ω 2 sin(θ0 ) = 0
h2 β0 h

102
Chapter 10

Solution of Partial Differential


Equations

In Ordinary Differential Equations (ODEs) we had an independent variable t and a


dependent variable y, which was only a function of t, and we could find the behavior
of y with respect to t solving an equation that involves derivatives of y.

In contrast, in a Partial Differential Equation (PDE), the dependent variable u


is a function of multiple independent variables, let us say t, x & y , and we find
the behavior of u solving an equation that involves the partial derivatives of u with
respect to t, x & y

Let us say u = u(x, y), this means u is a function of x and y. Then first order
semilinear PDE is given by:
∂u ∂u
a(x, y) + b(x, y) = c(x, y) (10.1)
∂x ∂y
This equation defines the directional derivative of u with respect to the vector field
V = (a(x, y), b(x, y). This means that given certain initial conditions, the solution
of the equation moves following characteristic curves defined by the vector field.
In other words, if we know the solution in the characteristic curves, we know the
solution in the whole domain.

One simple example of a first order PDE is the advection equation:


(
∂u ∂u
∂t + a ∂x = 0 (10.2)
u(x, 0) = η(x)

And the exact solution is given by

u(x, y) = η(x − at) (10.3)

103
When we want to solve numerically this equation, we really need to understand how
the solution should behave. For example if in (10.2), a > 0 it means that the char-
acteristics are straight lines with positive slope in the t v.s. x plane. Therefore if we
use a forward discretization in x, we won’t capture the behavior of the characteristic
curves, and, as a result, the numerical method would be unstable.

For this reason solving PDEs numerically is not only a matter of approximating
the derivatives (for example using finite differences), but also being sure that the
numerical approximation follows the same physical principles that the original PDE
does.

A second order constant coefficient PDEs in 2D is given by:

∂2u ∂2u ∂2u ∂u ∂u


a1 2
+ a 2 + a3 2
+ a4 + a5 + a6 u = f (10.4)
∂x ∂x∂y ∂y ∂x ∂y
and we can classify it in 3 types:

• Elliptic if a22 − 4a1 a3 < 0. For example Lagrange Equation:

∂2u ∂2u
∆u = + 2 =0 (10.5)
∂x2 ∂y
or the Poisson Equation

∂2u ∂2u
∆u = + 2 = f (x, y) (10.6)
∂x2 ∂y

both related with the steady-state solution of the Heat equation, where f (x)
is a source term.

• Hyperbolic if a22 − 4a1 a3 > 0. For example the Wave equation

∂2u ∂2u
c2 − 2 =0 (10.7)
∂x2 ∂t
Since they can be factorize as 2 first order PDEs, its behavior is similar to
the advection equation. This means that the solution at a point is given by
the characteristic curve that passes through it, and it is not affected by the
solution of the whole domain.

• Parabolic if a22 − 4a1 a3 = 0. For example the Heat equation

∂u ∂2u
=k 2 (10.8)
∂t ∂x

104
And the principal difference with the wave equation is that the speed of propagation
is infinity, this means that any change in the function in any point of the domain
affects the whole domain. Think in how is the behavior of a heated pan versus the
waves generated by throwing a stone into a lake. Therefore this two phenomenons
are summarize as:
• Diffusion It has an instantaneous smoothing effect of the initial conditions.
For example Elliptic and Parabolic equations
• Advection or Transport It carries the initial conditions along the domain
using characteristic curves. Wave equation: First order PDE and Hyperbolic
equations
In a more general setting, we could have equations that combine parabolic and
hyperbolic behaviors, for example the advection-diffusion-reaction equation:
∂u ∂2u ∂u
= k 2 − a + R(x) (10.9)
∂t | ∂x ∂x | {z }
reaction
{z } |{z}
dif f usion advection

For this type of equations, we should develop specific numerical methods that can
capture the effects of both systems, dealing with the stability issues of a stiff problem.

10.1 Elliptic Equations


We call ∆ the Laplacian operator, and it is defined as
D
X ∂2u
∆u(x) = (10.10)
i=1
∂x2i

where D is the dimension of the domain, and it can be 1,2 or 3, x = (xi )D


i=1 . The
general heat equation with a source term ψ is given by
∂u
= k∆u + ψ (10.11)
∂t
The steady-state solution for the heat equation means that u is not changing with
respect to time, therefore ∂u
∂t = 0, and at the same time the source is independent
of time ψ(x, t) = f (x). With this conditions we derive the Poisson equation as

∆u = f (x) (10.12)

Notice that in 1D, (10.12) is a second order ODE u00 = f (x), and we can solve it
using the methods of the previous chapter. In 2D we get:
∂2u ∂2u
+ 2 = f (x, y) (10.13)
∂x2 ∂y

105
Additional to (10.13), we should specify boundary conditions, and these can be
Dirichlet, Newman, Robin or more general conditions. As you can assume, dis-
cretizing (10.13) using finite differences is not really complicated. The challenge
now is the different type of domains we could have.

First let us consider a rectangular domain x0 ≤ x ≤ xf , y0 ≤ y ≤ yf , and we


divide them in (Nx , Ny ) equally spaced points, i.e. xi = x0 + ihx and xj = y0 + jhy
such that hx = (xf − x0 )/Nx and hy = (yf − y0 )/Ny .

Then we want to approximate the function at the grid points as u(xi , yj ) ≈ ui,j .
Therefore using centered finite difference to approximate the second order derivatives
in (10.13), we get:
ui+1,j − 2ui,j + ui−1,j ui,j+1 − 2ui,j + ui,j−1
2
+ = f (xi , yi ) (10.14)
hx h2x
If hx = hy = h we get:
1
(ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4ui,j ) = f (xi , yi ) (10.15)
h2
This is the 5-points stencil for the Laplacian operator.
And notice that if we solve for ui,j we get that

ui+1,j + ui−1,j + ui,j+1 + ui,j−1 h2


ui,j = − f (xi , yi )
4 4
Therefore, if we do not have any forcing (i.e. f = 0) then each point is the average of
its 4 neighbors. This proves that the numerical scheme also satisfies the Maximum
Principle: let Ω be the set of all grid points, and ∂Ω the set of all boundary points
then:

max uij = max uij


Ω ∂Ω

The maximum principle is one of the most important properties of the Laplace
equation, because it explains why solutions are smooth inside the domain.

How can we solve this system?

If we have Nx − 1 internal nodes in x and Ny − 1 internal nodes in y, depending on


the boundary conditions we have at least (Nx − 1) ∗ (Ny − 1) unknowns, and for each
internal node we have (10.14), which only involves 5 points per equation, therefore
we have a sparse linear system to solve. In 1D the way to organize the equations
was straightforward, we just followed the orientation of x. But now in 2D, we do
not have a natural order to follow. One of the simplest ways is to enumerate the
nodes row by row. That produces a nice pattern in the matrix but it is not the

106
Figure 10.1: Assembly of the matrix to solve the Poisson equation in 2D with equally
spaced points and Dirichlet Boundary conditions. Notice that we can remove the
boundary terms from the system, just passing their values to the right hand side of
the system of equations. Having other type of boundary conditions will only change
the rows corresponding to the boundary nodes.

107
optimal choice in order to solve the system. A lot of people have worked in finding
optimal configurations for this matrix.

On the other hand, dealing with different types of boundary conditions is exactly
the same as in the Boundary Value problems with ODE’s.

What to do in more general domains?

Depending on the domain, sometimes it would be necessary to have different spacing


along the points. This is possible to do in finite differences, as far as they include
that change in the expansion. For those cases, we could think in the Lagrangian
polynomial that is behind the finite difference approximation.

10.2 Parabolic Equations


The canonical Parabolic equation is the heat equation. If we have a domain Ω, the
heat equation is defined as:
∂u
= D∆u f or x ∈ Ω & t > 0 (10.16)
∂t
And now to have a particular solution we need to specify Initial conditions in time
and Boundary conditions in space.

Notice that the right hand side only involves partial derivatives with respect to
space, therefore for each t we can discretize it using methods for elliptic equations.
Once we do that, we will have a linear system that depends on u(x, t) and its neigh-
bors. We can call this approximation ∆ ˜ h u.

And then (10.16) can be approximated as

∂u ˜ h u(t) t > 0
= D∆ (10.17)
∂t
And for each x, we can this Initial Value ODE problem in time, using any of the
methods derived in Chapter 8, for example Euler, Backward Euler or any other
Runge-Kutta or multistep method. And of course, as we could expect from the
stability analysis of ODE methods, there will be a tight relationship between the
discretization in space and the discretization in time.

Let us work with the simplest case to see how this procedure works. If Ω is just
one line segment, then we will be working in one dimension. The heat equation for

108
1D with Dirichlet Boundary conditions is

∂u ∂2u


 ∂t = D ∂x2 x0 ≤ x ≤ xf , t > 0

u(x , t) = α(t)
0
(10.18)


 u(x f , t) = β(t)

u(x, 0) = v(x)

To discretize in space, let us take N + 1 equally spaced points x0 , .., xN = xf , where


xi = x0 + ih where h = (xf − x0 )/N . Let ui (t) ≈ u(xi , t), and let us approximate
the second derivative using centered differences. Therefore for each 1 ≤ i ≤ N − 1
we have that

dui (t) ui+1 (t)−2ui (t)+ui−1 (t)


 dt = D h2
1≤i≤N −1

u (t) = α(t)
0
(10.19)


u N (t) = β(t)

u (0) = v(x )
i i

Notice that if we consider the vector U (t) = [u1 (t), u2 (t), ..., uN −1 (t)] then (10.19)
is an initial value ODE of the form
(
dU (t) D
dt = − h2 (AU (t) + b(t))
U (0) = V

where
2 −1 −α(t)
     
v(x1 )
 −1 2 −1   0   v(x2 ) 
     
 −1 2 −1   0   v(x3 ) 
A=  b(t) =   V = 

 ... ... ... 


 ... 


 ... 

 −1 2 −1   0   v(xN −2 ) 
−1 2 −β(t) v(xN −1 )

And now we can solve (10.19) using Euler, Backward Euler or Trapezoidal method.
If we define t(n) = n ∗ k where k is the time-step, then we will approximate our
(n)
solution as u(xi , t(n) ) = ui . Therefore applying the different time schemes we have
• Euler⇒FTCS Forward in Time, Centered in Space
(n+1) (n) (n) (n) (n)
ui − ui ui+1 − 2ui + ui−1
=D (10.20)
k h2

• Backward Euler⇒BTCS Backward in Time, Centered in Space


(n+1) (n) (n+1) (n+1) (n+1)
ui − ui ui+1 − 2ui + ui−1
=D (10.21)
k h2

109
• Trapezoidal⇒Crank-Nicolson
(n) (n) (n) (n+1) (n+1) (n+1)
(n+1) (n)
!
ui − ui D ui+1 − 2ui + ui−1 ui+1 − 2ui + ui−1
= +
k 2 h2 h2
(10.22)

For the three cases the initial conditions are


(0)
ui = v(xi ) (10.23)
And the boundary conditions are
(n) (n)
u0 = α(t(n) ) uN = β(t(n) ) (10.24)
What about convergence?
Notice that the truncation error of FTCS and FTCS is O(h2 + k) whereas in
Crank Nicholson is O(h2 + k 2 ), because the trapezoidal scheme is second order
accurate in time.
What about stability?
Since (10.19) is a linear system of first order ODEs, we can analyze stability of each
method based on the eigenvalues of A. The nice property about A is that it is a
Toeplitz symmetric tridiagonal matrix, and therefore its eigenvalues are well known.
If A is N × N and the value in the main diagonal is a and in the outer diagonal is
b then
 

λs = a + 2b cos s = 1, 2, 3, .., N
N +1
Therefore in this case, we have that
 sπ 
λs = 2 − 2 cos s = 1, 2, 3, .., N − 1 (10.25)
N
We know that Backward Euler and Trapezoidal are unconditionally stable, this
means that we can take any k, and the error per iteration will remain bounded. But
for Euler method, we had the condition that
|1 + kλ| < 1
In this case, this is translated to
k   sπ 
1 − 2D 2 1 − cos < 1 s = 1, 2, 3, .., N − 1 (10.26)
h N
which implies that
h2
k≤ (10.27)
2D

110
Chapter 11

Introduction to Finite Elements

Finite Elements method appeared around the 50’s, when engineers were solving solid
elasticity problems. Since then, it has been one of the most well known methods to
solve partial differential equations because of its generality of applications and its
solid mathematical background. In the previous chapters, we have seen that finite
differences are really simple to implement and they work in a broad set of cases.
Nevertheless:
• Not always our solutions are completely smooth (continuous). Consider having
a chord, fixed at both ends, and we apply a point force in the middle. We
would like to model its displacement using the linear elasticity formula:
 
d d
− E(x)A(x) u(x) = f (x) x0 < x < xf ; u(x0 ) =u(xf ) = 0 (11.1)
dx dx

where E(x) is the elasticity modulus, A(x) is the cross-section area, and f (x)
is the source term. By experience we know that once we apply a point force,
the chord creates a triangular shape, with the lowest vertex being the point of
application of the force. The problem with equation (11.1) is that we cannot
represent as a function f (x) a point force, because it is zero everywhere except
at one point where it is infinity. And also, the solution u(x) is a linear piecewise
that it is not continuously differentiable at the point source, so what does it
mean to take first and second derivative at it?

• Not always the domain is suitable for finite differences. Solving elliptic equa-
tions in a rectangular domain was really simple. We just put our 5-point
stencil along the domain, and we got an equally spaced discretization. But
what would happen if instead of a plate, we would like to model the helmet
of a professional cyclist. In such a case, it is more difficult to fit a finite dif-
ferences scheme into the object, and even if we could do it with non-equally
spaced points, it would be a nightmare if we would like to refine (add more
points) to the mesh.

111
The advantage of Finite Elements (FEM) is that it is general enough that in the
nicest cases (for example a rectangular plate) where we can use finite differences, we
can use also FEM and get the same results; but also in the most complicated cases,
where we do not know how to adapt efficiently finite differences, applying FEM is
straightforward.

11.1 Mathematical Principle behind it: Distributions


Let us go back to the problem of representing the point force. How
can we represent a ”function” such that it is zero everywhere and just at
one point it is infinity?

Instead of functions we can use generalized function or distributions, where in ad-


dition to function, we add other objects that behave like them. Let us consider the
bump function
(
exp −1/ 2 − |x − xk |2
 
|x − x0 | < 
χ,xk (t) = (11.2)
0 otherwise

Notice that when  → 0, this function tends to behave as a point force, because
its support (the values where it is different from zero) gets smaller, and at x = xk
the function value increases. The point force we call it delta distribution δxk ,
and it has the property that for any sufficiently nice function φ (i.e. it is infinitely
differentiable and it has compact support), then
Z ∞
δxk (x)φ(x)dx = φ(xk ) (11.3)
−∞

And although imaging the delta distribution as a function is a bit difficult, when we
multiply it by a nice function and integrate, it is really simple to see its value.

But what about derivatives and so on?

The most important technique is integration by parts. For example, let us con-
sider the following piecewise function:
(
−1 x < 0
ψ(x) = (11.4)
1 x>0

And we would like to understand what dψ


dx means. Then we take a nice function φ
and we integrate
Z ∞
dψ(x)
φ(x) dx
−∞ dx

112
From here we can apply integration by parts
Z b Z b
u0 (x)v(x)dx = u(x)v(x)|ba − u(x)v 0 (x)dx
a a

Therefore:
Z ∞ Z ∞
dψ(x) ∞ dφ(x)
φ(x) dx = φ(x)ψ(x)|−∞ − ψ(x)dx
−∞ dx | {z } −∞ dx
=0, φ is nice
Z 0 Z ∞
dφ(x) dφ(x)
= dx − dx ; apply definition of ψ
−∞ dx 0 dx

= φ(x)|0−∞ − φ(x)|0
= (φ(0) − 0) − (0 − φ(0))
Z ∞
= 2φ(0) = 2δ0 φ(x)
−∞

Now we can say that dψdx = 2δ, and therefore we can generalize the derivatives to
Generalized functions that are not continuously differentiable functions.

To deal more formally with these concepts, some ideas from functional analysis
are required. An introductory course in functional analysis or an advanced course
in PDEs would do the job. A nice explanation about distributions is found in Vasy
(2015). Nevertheless, from engineering perspective, not always it is needed to un-
derstand the mathematical theory behind, but it is useful to understand when the
method works and how we can explain the errors.

11.2 The Variational Formulation: Weak form v.s. Strong


form
As we discussed at the beginning of the chapter, not always we can make sense of
a differential equation, although our intuition tells us how the solution should be.
Also, when we derived numerical differentiation, we used the Taylor series assuming
that the function was smooth enough, and for example, for the second derivative we
got that the truncation error was
u(4) (ξ) 2
τ= h
12
but what if u does not have fourth derivative?

In the previous section we saw how we could make sense of ”functions” and
derivatives that don’t exist in the proper definition of a function. Therefore, we

113
can expect that applying the technique of the previous section, we would get more
solutions to problems that before we could not solve. So therefore we have 2 different
formulations of a problem:
• Strong form which only make sense when the functions involved are suffi-
ciently smooth and their derivatives exists. For example:

Look for u ∈ C 2 (x0 , xf ) such that:


 
d d
− E(x)A(x) u(x) = f (x) x0 < x < xf ; u(x0 ) =u(xf ) = 0
dx dx

• Weak form that we obtain when we multiply by a very nice function v(x)
(i.e. it is infinitely differentiable and it has compact support) we integrate,
and we apply integration by parts:
Z xf   Z xf
d d
− E(x)A(x) u(x) v(x)dx = f (x)v(x)dx
x0 dx dx x0
Z xf Z xf
d d
E(x)A(x) u(x) v(x)dx = f (x)v(x)dx
x0 dx dx x0
 xf
d
− E(x)A(x) u(x)v(x)
dx x0
| {z }
u(x0 )=u(xf )=0
Z xf Z xf
du dv
E(x)A(x) dx = f (x)v(x)dx
x0 dx dx x0

Notice that once we have the integral form v does not have to be a really
dv
smooth function, but it is enough that v(x) and dx make sense once we multiply
them by a function and we integrate them. Therefore we can generalize the
integral form that we have before as :

Look for u ∈ U and v ∈ V such that


a(u, v) = F (v)
where
Z xf
du dv
a(u, v) = E(x)A(x) dx
dx dx
Zx0xf
F (v) = f (x)v(x)dx
x0

In this case U is the space of possible solutions, we call it trial functions and V is
the space for which it has to be satisfied, we call it test functions

114
For people more interested in the mathematical details, if we define
Z
L2 = {f functions ∈ (x0 , xf ) : f 2 dx < ∞}

and
H 1 = {f ∈ L2 : f 0 ∈ L2 }
then
U = V = {f ∈ H 1 |u(x0 ) = u(xf ) = 0}

11.3 Discretization: Galerkin’s method

11.4 Implementation
So our system of equations derived from the weak formulation is given by
n Z xf Z xf
X dφi dφj
uj E(x)A(x) dx = f (x)φi (x)dx (11.5)
x0 dx dx x0
j=1 | {z } | {z }
aji bi

which is equivalent to solve the system

AU = B

where A = (aij ), U = (uj ) and B = bi .

How can we compute A and B?

First we need to specify the basis functions φi . Remember that in interpolation, we


have the option of using one polynomial for all the points, piecewise polynomials
that preserve derivatives (Hermite) or splines.

115

You might also like